Multilateral price indexes that use large volumes of transaction data can be challening to compute. I show some tricks to calculate big indexes fast.
Author
Steve Martin
Published
November 24, 2025
Multilateral price indexes are often used to measure the evolution of prices over time when there are large volumes of transaction data, such as retail scanner data or housing data. The main challenge with computing multilateral indexes with large amounts of data is that these indexes often depend on a matrix where the dimensions are at least as large as the number of products. Lots of data means lots of different products, which in turns means making quite large matrices. Indeed, assuming these matrices are stored as 64-bit floats (more on that below), at least\(8n^2\) bytes of memory are needed just to store one of these matrices—with, say, 100,000 products, that’s at least 80 GB of memory before doing any computations. Effectively computing these indexes requires a way to reduce the dimensions of these matrices.
Weighted time product dummy
Let’s start with the WTPD index. This index is recovered from the coefficient on a set of time-dummy variables in a weighted log-linear regression model with a dummy variable for each product (e.g., Diewert and Fox 2022, eq. 9). Fitting this model is impractical with many products—with \(n\) observations for \(k\) products over \(t\) time periods, the design matrix is a \(n \times (k + t - 1)\) matrix and will consume at least \(8(n + t - 1)^2\) bytes of memory. Fortunately, only the coefficients on the time-dummy variables are required to make the index and so the product-dummy variables can be removed by treating them as fixed effects and demeaning the regression equation (Wooldridge 2002, sec. 10.5). Now the design matrix has only \(t - 1\) columns and, as \(t\) is usually much smaller than \(k\), will consume far less memory.
Let’s consider a example. We’ll start with a function to make (unbalanced) transaction data for a collection of products over time.
Now let’s build a WTPD index with data for 100,000 products over 25 time periods. Doing this with a regular linear model, where each product has its own dummy variable, would require 1.8 TB of memory to just store the design matrix, making it impractical to calculate. Using a fixed-effects model to recover the WTPD index can be done without issue on any decent laptop.
The GK index is another multilateral index that’s often written as a function of a large matrix. Following Diewert and Fox (2022), the GK index can be written as a function of two \(k \times k\) matrices. One of these matrices is diagonal, so it can be handled efficiently, but the other is usually dense (unless product imbalanced is really bad) and is impractical to store in memory. Using the same data as with the WTPD index, it would require 80 GB of memory to just store this matrix.
An alternative approach (Balk 2008, 245–46) expresses the GK index as a function of three \(t \times t\) matrices. With the same data as before, this now requires only 15 KB of memory. Here is relatively fast function to implement this approach.
#' Intertemporal Geary-Khamis index#'#' Make a multilateral Geary-Khamis price index.#'#' @param p A numeric vector of prices.#' @param q A numeric vector of quantities.#' @param period A factor, or something that can be coerced into one, that#' gives the corresponding time period for each element in `p` and#' `q`. The ordering of time periods follows the levels of `period`#' to agree with [`cut()`][cut.Date].#' @param product A factor, or something that can be coerced into one, that#' gives the corresponding product identifier for each element in `p` and#' `q`.#' @param na.rm Should missing values be removed? By default they are not.#'#' @returns#' A numeric vector of period-over-period price indexes.fast_gk <-function(p, q, period, product, na.rm =FALSE) { period <-as.factor(period) product <-as.factor(product) qs <-unsplit(lapply(split(q, product), gpindex::scale_weights), product)attributes(product) <-NULL# faster to match on numeric codes ux <-unique(product) product <-lapply(split(product, period), \(x) match(ux, x, incomparables =NA) ) v <-Map(`[`, split(p * q, period), product) qs <-Map(`[`, split(qs, period), product) n <-nlevels(period)# Make the matrices from pp. 245-246 of Balk (2008) to build the quantity# index, then deflate. m <-vector("list", n)for (i inseq_along(m)) { m[[i]] <-vapply( qs, \(qs) sum(qs * gpindex::scale_weights(v[[i]]), na.rm = na.rm),numeric(1L) ) } e <-diag(n) r <-cbind(rep.int(1, n), matrix(0, n, n -1)) q <-solve(do.call(rbind, m) - e + r)[1L, ] v <-vapply(v, sum, numeric(1L), na.rm = na.rm)# Return the period-over-period index. v[-1L] / v[-length(v)] / (q[-1L] / q[-length(q)])}
As with the WTPD index, any decent laptop can compute this index.
The repeat-sales indexes listed by Shiller (1991) are a type of multilateral index usually found in the housing economics literature. Like the WTPD index, they come from linear models and suffer from the same large design matrices. The geometric repeat-sales index is like the WTPD index, although usually with a much longer time horizon, except that it uses the first-difference transformation to sweep out the product fixed effects instead of the within-group transformation. The arithmetic repeat-sales index is more complicated as it comes from an IV estimator and its not obvious, at least to me, how the matrices for this index can be made smaller.
Instead of reducing the dimensionality of the repeat-sales matrices, these matrices are naturally sparse when products sell infrequently (like housing) and can be stored as sparse matrices. Let’s adapt the vignette from my {rsmatrix} package to see this in action with some data for 5 million house sales over 30 years.
We can now build a generator to make the sparse repeat-sales matrices. If these were dense matrices they would consume 22.1 GB of memory. This isn’t nearly as bad as the WTPD or GK indexes, but still enough to make computing these indexes cumbersome. With sparse matrices, however, it is easy to make the repeat-sales index.
# A tibble: 1 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 ars 1.59s 1.59s 0.627 3MB 0
GEKS
Let’s conclude by briefly mentioned the GEKS family of multilateral indexes. As with the others, the GEKS depends on a matrix, but it is only a \(t \times t\) matrix and will require very little memory. This makes GEKS indexes somewhat more straightforward to compute than the other popular multilateral indexes.
Balk, Bert M. 2008. Price and Quantity Index Numbers: Models for Measuring Aggregate Change and Difference. Cambridge University Press. https://doi.org/10.1017/CBO9780511720758.
Diewert, W Erwin, and Kevin J Fox. 2022. “Substitution Bias in Multilateral Methods for Cpi Construction.”Journal of Business & Economic Statistics 40 (1): 355–69. https://doi.org/10.1080/07350015.2020.1816176.