Making big multilateral indexes fast

Performance

Index numbers

Multilateral price indexes that use large volumes of transaction data can be challening to compute. I show some tricks to calculate big indexes fast.

Author

Steve Martin

Published

November 24, 2025

Multilateral price indexes are often used to measure the evolution of prices over time when there are large volumes of transaction data, such as retail scanner data or housing data. The main challenge with computing multilateral indexes with large amounts of data is that these indexes often depend on a matrix where the dimensions are at least as large as the number of products. Lots of data means lots of different products, which in turns means making quite large matrices. Indeed, assuming these matrices are stored as 64-bit floats (more on that below), at least \(8n^2\) bytes of memory are needed just to store one of these matrices—with, say, 100,000 products, that’s at least 80 GB of memory before doing any computations. Effectively computing these indexes requires a way to reduce the dimensions of these matrices.

Weighted time product dummy

Let’s start with the WTPD index. This index is recovered from the coefficient on a set of time-dummy variables in a weighted log-linear regression model with a dummy variable for each product (e.g., Diewert and Fox 2022, eq. 9). Fitting this model is impractical with many products—with \(n\) observations for \(k\) products over \(t\) time periods, the design matrix is a \(n \times (k + t - 1)\) matrix and will consume at least \(8(n + t - 1)^2\) bytes of memory. Fortunately, only the coefficients on the time-dummy variables are required to make the index and so the product-dummy variables can be removed by treating them as fixed effects and demeaning the regression equation (Wooldridge 2002, sec. 10.5). Now the design matrix has only \(t - 1\) columns and, as \(t\) is usually much smaller than \(k\), will consume far less memory.

Let’s consider a example. We’ll start with a function to make (unbalanced) transaction data for a collection of products over time.

set.seed(1552443)

make_prices <- function(products, periods, frac = 0.1) {
  res <- data.frame(
    product = as.factor(seq_len(products)),
    time = as.factor(rep(seq_len(periods), each = products)),
    price = rlnorm(products * periods),
    quantity = runif(products * periods)
  )
  res[-sample(nrow(res), frac * nrow(res)), ]
}

head(make_prices(10, 13))

  product time     price  quantity
2       2    1 1.4944618 0.2978414
3       3    1 0.6778284 0.9263743
5       5    1 1.2611687 0.3008513
7       7    1 0.7996127 0.9532272
8       8    1 0.2138277 0.4409094
9       9    1 1.9830544 0.9422246

We’ll also make a function to take these data and construct the WTPD index from a fixed-effects model with the {fixest} package.

wtpd_index <- function(prices) {
  prices <- prices |>
    dplyr::group_by(time) |>
    dplyr::mutate(w = gpindex::scale_weights(price * quantity))

  mdl <- fixest::feols(
    log(price) ~ time | product,
    data = prices,
    weights = ~w
  )

  exp(coef(mdl))
}

Now let’s build a WTPD index with data for 100,000 products over 25 time periods. Doing this with a regular linear model, where each product has its own dummy variable, would require 1.8 TB of memory to just store the design matrix, making it impractical to calculate. Using a fixed-effects model to recover the WTPD index can be done without issue on any decent laptop.

price_data <- make_prices(1e5, 25)

bench::mark(fixest = wtpd_index(price_data))

# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 fixest        4.08s    4.08s     0.245    2.77GB    0.980

Note that this is generally much faster than the routine for making the WTPD index in the excellent {IndexNumR} package.

wtpd_index2 <- function(prices, window) {
  # IndexNumR expects these to be integers.
  prices[c("time", "product")] <- lapply(
    prices[c("time", "product")],
    as.integer
  )
  res <- IndexNumR::WTPDIndex(
    prices,
    "price",
    "quantity",
    "time",
    "product",
    window = window
  )
  as.numeric(res[-1])
}

price_data <- make_prices(50, 25)

bench::mark(
  fixest = unname(wtpd_index(price_data)),
  IndexNumR = wtpd_index2(price_data, 25)
)

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 fixest       10.6ms  13.05ms    74.3      1.51MB     1.96
2 IndexNumR     4.25s    4.25s     0.235  102.15MB     7.05

Geary-Khamis index

The GK index is another multilateral index that’s often written as a function of a large matrix. Following Diewert and Fox (2022), the GK index can be written as a function of two \(k \times k\) matrices. One of these matrices is diagonal, so it can be handled efficiently, but the other is usually dense (unless product imbalanced is really bad) and is impractical to store in memory. Using the same data as with the WTPD index, it would require 80 GB of memory to just store this matrix.

An alternative approach (Balk 2008, 245–46) expresses the GK index as a function of three \(t \times t\) matrices. With the same data as before, this now requires only 15 KB of memory. Here is relatively fast function to implement this approach.

#' Intertemporal Geary-Khamis index
#'
#' Make a multilateral Geary-Khamis price index.
#'
#' @param p A numeric vector of prices.
#' @param q A numeric vector of quantities.
#' @param period A factor, or something that can be coerced into one, that
#'   gives the corresponding time period for each element in `p` and
#'   `q`. The ordering of time periods follows the levels of `period`
#'   to agree with [`cut()`][cut.Date].
#' @param product A factor, or something that can be coerced into one, that
#'   gives the corresponding product identifier for each element in `p` and
#'   `q`.
#' @param na.rm Should missing values be removed? By default they are not.
#'
#' @returns
#' A numeric vector of period-over-period price indexes.

fast_gk <- function(p, q, period, product, na.rm = FALSE) {
  period <- as.factor(period)
  product <- as.factor(product)

  qs <- unsplit(lapply(split(q, product), gpindex::scale_weights), product)
  attributes(product) <- NULL # faster to match on numeric codes
  ux <- unique(product)
  product <- lapply(
    split(product, period),
    \(x) match(ux, x, incomparables = NA)
  )

  v <- Map(`[`, split(p * q, period), product)
  qs <- Map(`[`, split(qs, period), product)

  n <- nlevels(period)
  # Make the matrices from pp. 245-246 of Balk (2008) to build the quantity
  # index, then deflate.
  m <- vector("list", n)
  for (i in seq_along(m)) {
    m[[i]] <- vapply(
      qs,
      \(qs) sum(qs * gpindex::scale_weights(v[[i]]), na.rm = na.rm),
      numeric(1L)
    )
  }
  e <- diag(n)
  r <- cbind(rep.int(1, n), matrix(0, n, n - 1))
  q <- solve(do.call(rbind, m) - e + r)[1L, ]
  v <- vapply(v, sum, numeric(1L), na.rm = na.rm)
  # Return the period-over-period index.
  v[-1L] / v[-length(v)] / (q[-1L] / q[-length(q)])
}

As with the WTPD index, any decent laptop can compute this index.

gk_index <- function(prices) {
  fast_gk(
    prices$price,
    prices$quantity,
    prices$time,
    prices$product,
    na.rm = TRUE
  )
}

price_data <- make_prices(1e5, 25)

bench::mark(fast_gk = gk_index(price_data))

# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 fast_gk       1.32s    1.32s     0.755     819MB     2.26

It is similarly non-trivial to make an index of this size with {IndexNumR}.

gk_index2 <- function(prices, window) {
  prices[c("time", "product")] <- lapply(
    prices[c("time", "product")],
    as.integer
  )
  res <- IndexNumR::GKIndex(
    prices,
    "price",
    "quantity",
    "time",
    "product",
    window = window
  )
  as.numeric(res[-1] / res[-length(res)])
}

price_data <- make_prices(1000, 25)

bench::mark(
  fast_gk = unname(gk_index(price_data)),
  IndexNumR = gk_index2(price_data, 25)
)

# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 fast_gk      8.03ms   8.56ms   109.       8.15MB    5.93 
2 IndexNumR      1.7s     1.7s     0.587  332.21MB    0.587

Repeat sales

The repeat-sales indexes listed by Shiller (1991) are a type of multilateral index usually found in the housing economics literature. Like the WTPD index, they come from linear models and suffer from the same large design matrices. The geometric repeat-sales index is like the WTPD index, although usually with a much longer time horizon, except that it uses the first-difference transformation to sweep out the product fixed effects instead of the within-group transformation. The arithmetic repeat-sales index is more complicated as it comes from an IV estimator and its not obvious, at least to me, how the matrices for this index can be made smaller.

Instead of reducing the dimensionality of the repeat-sales matrices, these matrices are naturally sparse when products sell infrequently (like housing) and can be stored as sparse matrices. Let’s adapt the vignette from my {rsmatrix} package to see this in action with some data for 5 million house sales over 30 years.

periods <- seq(as.Date("2000-01-01"), as.Date("2029-12-31"), "day")

prices <- data.frame(
  sale = sample(periods, 5e6, TRUE),
  property = factor(sprintf("%06d", sample(1:1e6, 5e6, TRUE))),
  price = rlnorm(5e6)
) |>
  dplyr::mutate(period = cut(sale, "month"))

sales_pairs <- rsmatrix::rs_pairs(prices$sale, prices$property)
prices[c("price_prev", "period_prev")] <- prices[
  sales_pairs,
  c("price", "period")
]

prices <- prices |>
  dplyr::mutate(
    holding_period = as.numeric(period) - as.numeric(period_prev)
  ) |>
  dplyr::filter(holding_period > 2)

head(prices)

        sale property     price     period price_prev period_prev
1 2016-05-11   627686 0.7705151 2016-05-01  0.1878927  2014-06-01
2 2029-07-22   992771 0.2366689 2029-07-01  0.4822262  2026-10-01
3 2011-07-17   925847 0.1712451 2011-07-01  0.5992709  2003-05-01
4 2013-03-28   337973 0.1540939 2013-03-01  0.9106525  2002-11-01
5 2008-04-22   295262 0.3274417 2008-04-01  0.2760829  2003-10-01
6 2015-12-27   038822 2.2174534 2015-12-01  1.7673652  2010-11-01
  holding_period
1             23
2             33
3             98
4            124
5             54
6             61

We can now build a generator to make the sparse repeat-sales matrices. If these were dense matrices they would consume 22.1 GB of memory. This isn’t nearly as bad as the WTPD or GK indexes, but still enough to make computing these indexes cumbersome. With sparse matrices, however, it is easy to make the repeat-sales index.

matrices <- with(
  prices,
  rsmatrix::rs_matrix(period, period_prev, price, price_prev, sparse = TRUE)
)

Z <- matrices("Z")
X <- matrices("X")
Y <- matrices("Y")

bench::mark(ars = 1 / Matrix::solve(crossprod(Z, X), crossprod(Z, Y)))

# A tibble: 1 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 ars           1.59s    1.59s     0.627       3MB        0

GEKS

Let’s conclude by briefly mentioned the GEKS family of multilateral indexes. As with the others, the GEKS depends on a matrix, but it is only a \(t \times t\) matrix and will require very little memory. This makes GEKS indexes somewhat more straightforward to compute than the other popular multilateral indexes.

Session Info

R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Toronto
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] Matrix_1.7-4        jsonlite_2.0.0      dplyr_1.1.4        
 [4] compiler_4.5.2      rsmatrix_0.2.9      tidyselect_1.2.1   
 [7] Rcpp_1.1.0          yaml_2.3.10         fastmap_1.2.0      
[10] lattice_0.22-7      R6_2.6.1            generics_0.1.4     
[13] Formula_1.2-5       knitr_1.50          htmlwidgets_1.6.4  
[16] tibble_3.3.0        pillar_1.11.1       rlang_1.1.6        
[19] utf8_1.2.6          xfun_0.54           stringmagic_1.2.0  
[22] fixest_0.13.2       cli_3.6.5           magrittr_2.0.4     
[25] digest_0.6.38       grid_4.5.2          IndexNumR_0.6.0    
[28] sandwich_3.1-1      gpindex_0.6.3       lifecycle_1.0.4    
[31] nlme_3.1-168        vctrs_0.6.5         bench_1.1.4        
[34] evaluate_1.0.5      glue_1.8.0          data.table_1.17.8  
[37] numDeriv_2016.8-1.1 zoo_1.8-14          profmem_0.7.0      
[40] dreamerr_1.5.0      rmarkdown_2.30      tools_4.5.2        
[43] pkgconfig_2.0.3     htmltools_0.5.8.1

References

Balk, Bert M. 2008. Price and Quantity Index Numbers: Models for Measuring Aggregate Change and Difference. Cambridge University Press. https://doi.org/10.1017/CBO9780511720758.

Diewert, W Erwin, and Kevin J Fox. 2022. “Substitution Bias in Multilateral Methods for Cpi Construction.” Journal of Business & Economic Statistics 40 (1): 355–69. https://doi.org/10.1080/07350015.2020.1816176.

Shiller, Robert J. 1991. “Arithmetic Repeat Sales Price Estimators.” Journal of Housing Economics 1 (1): 110–26. https://doi.org/10.1016/S1051-1377(05)80028-2.

Wooldridge, Jeffrey M. 2002. Econometric Analysis of Cross Section and Panel Data. MIT Press.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{martin2025,
  author = {Martin, Steve},
  title = {Making Big Multilateral Indexes Fast},
  date = {2025-11-24},
  url = {https://marberts.github.io/blog/posts/2025/multilaterals/},
  langid = {en}
}

For attribution, please cite this work as:

Martin, Steve. 2025. “Making Big Multilateral Indexes Fast.” November 24, 2025. https://marberts.github.io/blog/posts/2025/multilaterals/.