How to run CUPED at scale
I’ve worked on many iterations of the CUPED method and from a practical standpoint found that CUPED is one of those ideas that looks simple on a slide and gets tricky in production. In practice, you’re juggling messy behavior and at scale you’re also juggling a huge cloud bill. For large companies that run thousands of concurrent experiments implementing naive version of CUPED can be super expensive (easily in the tens of millions of dollars in compute costs). This post is a field guide for getting more signal per dollar from CUPED that routinely deliver 100×–1000× efficiency in the right conditions without giving up precision. We’ll focus on four tactics:
- picking the right pre-experiment window
- caching pre-experiment covariates
- using delta method with CUPED
- sampling when we compute CUPED aggregates
And a quick note on why this matters. I know I mentioned above compute efficiency and cloud bills. But the real constraint isn’t just money, it’s time. When pipelines are expensive, teams ration freshness due to budget constrains: “we only refresh daily or hourly,” “no realtime monitoring,” “kick off a job and check back in 30 minutes.” That slows learning and turns an experiment platform into a reporting tool. The playbook below is about cutting variance and latency so answers arrive while the question is still interesting.
I’ll start with the most visible and most often mis-specified choice: the window.
Selecting a pre-experiment window for CUPED
When people say “use pre-experiment data as a covariate,” it hides a real question: which pre-experiment data? If your experiment runs 7 days and you have two weeks of history, you’ve got several options. The choice matters for both variance reduction and compute.
Option 1 — Match the experiment window (week −1 only)
Use 7 days of history for a 7-day experiment. This is the default most teams start with (see Fig. A). It’s simple and respects recency, but it often underperforms in consumer settings because pre-period data are sparse. If many users don’t transact every week, week −1 has a lot of zeros, which limits how much variance you can soak up. If you have more data available about your users in your history, why not use more data?
Option 2 — A longer single covariate (weeks −2 & −1 summed)
Use a 14-day total as one covariate (Fig. B). By pooling two weeks, you reduce zeros and get a stronger proxy for latent “propensity to spend.” That almost always improves variance reduction relative to Option 1. The downside is that you have to process twice the amount of pre-experimental data, which doubles your compute costs and has users waiting for longer to see results refreshing.
Option 3 — Two non-overlapping covariates (week −1 and week −2)
Use both weeks as separate covariates (Fig. C). This typically beats Option 2 because the regression can learn different weights for each week. If week −1 is more predictive (recency), it gets a bigger weight; if week −1 had a promo/holiday, the model can down-weight it and lean on week −2. Either way, residual variance drops, so the SE on the treatment effect tightens.
Option 4 — Overlapping pair (week −1 and weeks −2 &−1)
This is not a strictly better option than option 3, but it is something to watch out for because it tends to increase compute requirements. In this scenario we use two covariates: one spans week -1 and one spans week -1 & -2. If you do that, you will generally get the same estimate as in option 3 but you have to reprocess overlapping data multiple times. Overall, I discourage using this default.
Some results
Here are some OLS results that demonstrate how these methods would looks like. Note that I’m using data generation process that on purpose mimics having users that don’t contribute purchases for the week before experiment but do have data for weeks further back. Overall in practice Option 3 will get the best value out of CUPED.
Caching pre-experiment user level aggregates
The advice above suggests that you should leverage more historical data and more covariates to get best variance reduction from CUPED. The downside of that advice is that you will have a very large increase in compute requirements. Nonetheless, the best thing about CUPED inputs is also the simplest: for a fixed anchor date, they never change. Once you know a user’s GMV in week −1 and week −2 relative to an experiment start, those numbers are deterministic (ignoring late arrivals). Recomputing them for every analysis is just lighting money on fire.
Why this matters? If you have user telemetry, you’ll find that most folks that run experiments are super impatient. A 7-day experiment often gets re-run 100+ times — scheduled jobs, ad hoc refreshes, curious PMs. If you don’t cache randomization-unit aggregates (user/store/courier, etc.), every single of those 100+ runs recomputes identical pre-experiment data.
There’s another dividend to this caching: users are shared across experiments. A single user can sit in hundreds of concurrent tests, which means the same cached aggregates can be reused across pipelines. This trick alone can drive 100x improvement in pipeline efficiency.
Delta Method + CUPED
For years, the most expensive part of experiment analysis wasn’t the math it was the commute + data movements. Data lived in the warehouse; analysis lived on a big EC2 box. The loop looked like this:
- pull a giant table out of the warehouse,
- ship it across the wire,
- run a regression to get estimates and standard errors.
Steps 1 and 2 dominate the bill. The Delta method flips this around. Instead of hauling rows to Python, you turn rows into a handful of aggregates you can compute right in the warehouse: means, variances, covariances (metric with covariates and covariate with covariates). This is why in recent years Statsig/Eppo has adopted delta method as the default solution for their metric computation with a rebranding called warehouse-native. There are quite a few articles describing delta method here, here and here and I’m not going to cover the theoretical underpinnings. But the general highlight is that delta method shifts 99% of compute to the warehouse.
What you need to compute (warehouse-side inputs)
Here’s the birds-eye view of the pipeline for a metric like total gmv per randomization unit. We take the exposures table to define who’s in the test and when they started, roll up event data into a per-user metric (e.g., total GMV and its denominator) over the experiment window, and join a cached tile of pre-period covariates. Then we collapse everything to a tiny set of moments per arm (and once pooled): counts, means, variances, and covariances. That’s the whole warehouse-native pipeline. No user-level rows leave the warehouse, and the app layer just consumes this small bundle of <1kb of aggregates to run delta method through.
-- PARAMETERS
-- :exp_id
-- :treatment_window_days -- e.g., 7
-- :anchor_date -- DATE used to select the pre-period covariate tile
WITH exposures AS (
-- One row per user in the experiment (deduped to first exposure)
SELECT
e.user_id,
e.arm, -- 'control' or 'treatment'
MIN(e.first_exposure_time) AS t0
FROM experiment_exposures e -- (experiment_id, user_id, arm, first_exposure_time)
WHERE e.experiment_id = :exp_id
GROUP BY 1,2
),
metric_events AS (
-- Event-level facts for the metric window; join to exposures to prune by t0
SELECT
ev.user_id,
ev.event_ts,
ev.gmv_numerator_atom AS gmv_num_atom, -- e.g., GMV dollars
ev.gmv_denominator_atom AS gmv_den_atom -- e.g., active_day flag; use 1 if not a true ratio
FROM events ev -- (user_id, event_ts, gmv_numerator_atom, gmv_denominator_atom)
),
metric_per_user AS (
-- Aggregate metric numerator/denominator per user, anchored on t0
SELECT
ex.user_id,
ex.arm,
/* GMV numerator: sum over [t0, t0 + D) */
COALESCE(SUM(CASE
WHEN me.event_ts >= ex.t0
AND me.event_ts < DATEADD('day', :treatment_window_days, ex.t0)
THEN me.gmv_num_atom ELSE 0 END), 0.0)::FLOAT AS gmv_numerator,
/* GMV denominator: sum over [t0, t0 + D) (set to 1.0 if your metric is a mean) */
COALESCE(SUM(CASE
WHEN me.event_ts >= ex.t0
AND me.event_ts < DATEADD('day', :treatment_window_days, ex.t0)
THEN me.gmv_den_atom ELSE 0 END), 0.0)::FLOAT AS gmv_denominator
FROM exposures ex
LEFT JOIN metric_events me
ON me.user_id = ex.user_id
GROUP BY 1,2
),
covariate_tile AS (
-- Cached pre-period covariates keyed by (user_id, anchor_date).
-- Add as many *_numerator / *_denominator pairs as you need.
SELECT
ct.user_id,
/* Covariate: pre-period GMV in days -7..-1 */
COALESCE(ct.pre_gmv_1_7_numerator, 0.0)::FLOAT AS pre_gmv_1_7_numerator,
COALESCE(ct.pre_gmv_1_7_denominator, 1.0)::FLOAT AS pre_gmv_1_7_denominator,
/* Covariate: pre-period GMV in days -14..-8 */
COALESCE(ct.pre_gmv_8_14_numerator, 0.0)::FLOAT AS pre_gmv_8_14_numerator,
COALESCE(ct.pre_gmv_8_14_denominator, 1.0)::FLOAT AS pre_gmv_8_14_denominator
FROM user_covariates ct -- (user_id, anchor_date, pre_gmv_1_7_numerator, ..., pre_gmv_8_14_denominator, …)
WHERE ct.anchor_date = :anchor_date
),
per_user AS (
-- One row per user with metric + covariates
SELECT
m.user_id,
m.arm,
m.gmv_numerator,
m.gmv_denominator,
COALESCE(c.pre_gmv_1_7_numerator, 0.0) AS pre_gmv_1_7_numerator,
COALESCE(c.pre_gmv_1_7_denominator, 1.0) AS pre_gmv_1_7_denominator,
COALESCE(c.pre_gmv_8_14_numerator, 0.0) AS pre_gmv_8_14_numerator,
COALESCE(c.pre_gmv_8_14_denominator, 1.0) AS pre_gmv_8_14_denominator
FROM metric_per_user m
LEFT JOIN covariate_tile c
ON c.user_id = m.user_id
),
-- Sample means/variances/covariances per arm
arm AS (
SELECT
arm,
COUNT(*) AS n,
-- Metric aggregates (GMV numerator/denominator)
AVG(gmv_numerator) AS mu_gmv_numerator,
AVG(gmv_denominator) AS mu_gmv_denominator,
VAR_SAMP(gmv_numerator) AS var_gmv_numerator,
VAR_SAMP(gmv_denominator) AS var_gmv_denominator,
COVAR_SAMP(gmv_numerator, gmv_denominator) AS cov_gmv_num_gmv_den,
-- Covariate aggregates: pre_gmv_1_7 (numerator/denominator)
AVG(pre_gmv_1_7_numerator) AS mu_pre_1_7_num,
AVG(pre_gmv_1_7_denominator) AS mu_pre_1_7_den,
VAR_SAMP(pre_gmv_1_7_numerator) AS var_pre_1_7_num,
VAR_SAMP(pre_gmv_1_7_denominator) AS var_pre_1_7_den,
COVAR_SAMP(pre_gmv_1_7_numerator,
pre_gmv_1_7_denominator) AS cov_pre_1_7_num_den,
-- Covariate aggregates: pre_gmv_8_14 (numerator/denominator)
AVG(pre_gmv_8_14_numerator) AS mu_pre_8_14_num,
AVG(pre_gmv_8_14_denominator) AS mu_pre_8_14_den,
VAR_SAMP(pre_gmv_8_14_numerator) AS var_pre_8_14_num,
VAR_SAMP(pre_gmv_8_14_denominator) AS var_pre_8_14_den,
COVAR_SAMP(pre_gmv_8_14_numerator,
pre_gmv_8_14_denominator) AS cov_pre_8_14_num_den,
-- Cross-aggregates: metric ↔ covariate components
COVAR_SAMP(gmv_numerator, pre_gmv_1_7_numerator) AS cov_gmv_num__pre_1_7_num,
COVAR_SAMP(gmv_numerator, pre_gmv_1_7_denominator) AS cov_gmv_num__pre_1_7_den,
COVAR_SAMP(gmv_denominator, pre_gmv_1_7_numerator) AS cov_gmv_den__pre_1_7_num,
COVAR_SAMP(gmv_denominator, pre_gmv_1_7_denominator) AS cov_gmv_den__pre_1_7_den,
COVAR_SAMP(gmv_numerator, pre_gmv_8_14_numerator) AS cov_gmv_num__pre_8_14_num,
COVAR_SAMP(gmv_numerator, pre_gmv_8_14_denominator) AS cov_gmv_num__pre_8_14_den,
COVAR_SAMP(gmv_denominator, pre_gmv_8_14_numerator) AS cov_gmv_den__pre_8_14_num,
COVAR_SAMP(gmv_denominator, pre_gmv_8_14_denominator) AS cov_gmv_den__pre_8_14_den
FROM per_user
GROUP BY arm
),
-- Pooled (control + treatment) for CUPED weights β
pooled AS (
SELECT
'pooled' AS arm,
COUNT(*) AS n,
AVG(gmv_numerator) AS mu_gmv_numerator,
AVG(gmv_denominator) AS mu_gmv_denominator,
VAR_SAMP(gmv_numerator) AS var_gmv_numerator,
VAR_SAMP(gmv_denominator) AS var_gmv_denominator,
COVAR_SAMP(gmv_numerator, gmv_denominator) AS cov_gmv_num_gmv_den,
AVG(pre_gmv_1_7_numerator) AS mu_pre_1_7_num,
AVG(pre_gmv_1_7_denominator) AS mu_pre_1_7_den,
VAR_SAMP(pre_gmv_1_7_numerator) AS var_pre_1_7_num,
VAR_SAMP(pre_gmv_1_7_denominator) AS var_pre_1_7_den,
COVAR_SAMP(pre_gmv_1_7_numerator,
pre_gmv_1_7_denominator) AS cov_pre_1_7_num_den,
AVG(pre_gmv_8_14_numerator) AS mu_pre_8_14_num,
AVG(pre_gmv_8_14_denominator) AS mu_pre_8_14_den,
VAR_SAMP(pre_gmv_8_14_numerator) AS var_pre_8_14_num,
VAR_SAMP(pre_gmv_8_14_denominator) AS var_pre_8_14_den,
COVAR_SAMP(pre_gmv_8_14_numerator,
pre_gmv_8_14_denominator) AS cov_pre_8_14_num_den,
COVAR_SAMP(gmv_numerator, pre_gmv_1_7_numerator) AS cov_gmv_num__pre_1_7_num,
COVAR_SAMP(gmv_numerator, pre_gmv_1_7_denominator) AS cov_gmv_num__pre_1_7_den,
COVAR_SAMP(gmv_denominator, pre_gmv_1_7_numerator) AS cov_gmv_den__pre_1_7_num,
COVAR_SAMP(gmv_denominator, pre_gmv_1_7_denominator) AS cov_gmv_den__pre_1_7_den,
COVAR_SAMP(gmv_numerator, pre_gmv_8_14_numerator) AS cov_gmv_num__pre_8_14_num,
COVAR_SAMP(gmv_numerator, pre_gmv_8_14_denominator) AS cov_gmv_num__pre_8_14_den,
COVAR_SAMP(gmv_denominator, pre_gmv_8_14_numerator) AS cov_gmv_den__pre_8_14_num,
COVAR_SAMP(gmv_denominator, pre_gmv_8_14_denominator) AS cov_gmv_den__pre_8_14_den
FROM per_user
)
SELECT * FROM arm
UNION ALL
SELECT * FROM pooled;Subsampling the covariate aggregates
Even after you cache pre-period features and move CUPED into the warehouse, there’s one more easy win: estimate the CUPED weights on a sample, then apply them to the full experiment.
CUPED needs one thing learned from data: a small set of weights (one per covariate) that tell you how much to subtract from the metric. Those weights come from simple aggregates (means, variances, covariances). But do you really need ALL users to get those estimates? Let’s imagine you have 1M users in your experiment, do you really have to compute CUPED inputs by examining weeks of history for all of them? The answer is no.
Why this works?
- Aggregates stabilize fast: We’re not fitting a complex model; we’re estimating a handful of averages and correlations. If your experiment has 1M users, sampling 20% to compute CUPED aggregates for historical data would still be huge.
- The estimate stays honest: Note that the sampling is random and focused on the unit of randomization. This will keep the CUPED effect unbiased.
- Precision is basically the same: The final statistics end up being very similar to the original where we use the full sample. You can expect 2–3% variation relative to the original if you deal with aggregates like count distinct. Note that aggregates like min/max will not work though.
- Cost drops a lot: Because we’re removing the pricey part of computing user level aggregates for multiple covariates, the cost drops by a lot.
Example: here are some results of what subsampling impact has on the data generation process. In here I use the delta method and made a small change to compute CUPED inputs on the warehouse on a much smaller sample of users.
You can see that the ATE and SE are very close to the model that has to compute CUPED and scan ALL the data.
Conclusion: The real win is time
Cost matters, but what compounds is speed, how quickly a question turns into an answer, and an answer into the next question. Slow pipelines create a curiosity tax; people stop poking at outliers, wait for the nightly job, and make blunter decisions. Use the four tactics above and that tax disappears. Results show up while the ramp is still live, PMs iterate the same day, analysts chase the interesting insights, and engineers ship smaller, safer ramps because they can see what is happening in real time. Overall, if you want your experimentation pipelines to scale efficiently and not have users waiting, you must leverage the options above.
