A/B Testing: How Misaligned Incentives Can Sabotage Your Experimentation Culture

Stas Sajin
9 min readFeb 24, 2023

--

A/B testing, also known as split testing, is a popular method used by companies to improve the performance of their products. It involves comparing two versions of a design, feature, or content to determine which one performs better. Despite its popularity, A/B testing programs often fail to deliver the intended impact.

Misaligned incentives

Most experiments you run test a change to which you have a personal attachment. You are ultimately trying to increase some number, receive a big congratulatory pat on the back, and feel validation for a job well done. Will you truly be a passive observer, a good scientist, and simply accept restarting the work if the numbers tell an uncomfortable story? Will you truly be impartial, not influenced by incentives and recognition? Are you able to set aside that uncomfortable feeling when you inform your stakeholders about the time and resources that were wasted? I don’t think so.

In academia, there is a strong incentive to “publish or perish”, and there are an untold number of stories related to questionable research practices and even fraud. One thing that folks don’t discuss enough, is that in the industry, that cultural phenomenon is probably worse since incentives are a lot more short-term and real, pressure is higher, and people might be less scrupulous and more prone to misuse statistics. Due to misaligned incentives, the result of any experiment can be adjusted to make anything look significant. For example, Simmons et. al (2011) did a simulation where they show that simply by having a choice among four possible analysis decisions, anything could be made to look stat sig:

  • Testing for two correlated metrics, but reporting only on the one that looks better.
  • Peeking sequentially at data.
  • Adding a covariate.
  • Trying three variants and reporting only on the successful one.
False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.

I can tell you that these four choices are just a few of the many options an analyst has when examining results. I know of a dozen other ways to make results statistically significant: transforming the metric, handling outliers, trying different combinations of CUPED covariates, segmenting data, restarting experiments, trying different statistical methodologies (vanilla OLS vs complex LMM), introducing interference, etc.

The unfortunate outcome of this flexibility is that you end up with a lot of experiment “wins” that don’t really do anything. Your north star metric will still remain flat year-over-year, despite all the reported cumulative improvements that suggest that you’re increasing revenue by 300%. Moreover, because A/B testing has the appeal of rigor that is shrouded in science and statistics, the golden standard, exceptionally few internal stakeholders will be willing to call out the emperor without clothes. This means that you might have to go through a few years before you call it quits and recognize that something needs to be done to fix the pathology caused by misaligned incentives.

Solution

One solution to this problem might be to have a governing body that ultimately has a say in an experiment’s outcome. In other words, “you do the work, we do the analysis and decide the outcome.” Although this might sound like it would work, it's not a scalable solution and creates far too many organizational bottlenecks to doing work. You might be adding a 30–50% delay to make decisions because you’re waiting for someone to say “ship it.” Moreover, you need and want people to be somewhat invested in outcomes since those are the same people who end up coming up with new ideas and follow-ups to existing experiments. I hardly see a dispassionate centralized authority being creative or even nuanced enough in making ship decisions.

A solution that I personally prefer, largely due to the peace of mind it provides, is to preregister the experiment design. In writing, I provide all the choices I make about how my data will be analyzed, which metrics I will look at, how outliers will be handled, and even the code that will be used to run the analysis. I personally like this solution because I hate the feeling of thinking of myself as a fraud simply because I engage in posthoc analyses and I don’t remember which analysis I actually planned to use. By writing things down before the test, I can get more precise feedback from stakeholders, improve the quality of my design, and report on effects that are more likely to be real. But even then, I still think most will succumb to peer pressure and revert towards an experimentation culture lacking trust. Your honesty and integrity can easily crumble if those around you get more recognition when they engage in dubious practices.

Feeling blocked and bottlenecked

One other observation I see across some companies that use A/B testing is that they are running their tests for far too long, leading to a scenario where experimentation capacity acts as a blocker on how much work you can do. Analytics-Toolkit published a meta-analysis on 1000 tests, and the median test duration was 30 days (which is very similar to what many folks report across other companies). This is far too long to get to a decision and creates large swaths of idle time within teams. You have the capacity and desire to iterate on five ideas a month, but because you’re waiting on an A/B test, you have to pick and choose one idea to implement. Moreover, because you put all your eggs in one basket, you’re going to be incentivized and feel more pressure to make sure that the analysis results look “right” and p-hack your way to a successful outcome.

Personally, I think the majority of A/B tests should take no more than two weeks to run, with the exception of holdouts, tests that are one-way doors, or tests that can create UX thrash. The opportunity cost of not trying new things is just much higher than the cost of occasionally shipping something that doesn’t work as expected.

Solution

I think the industry has explored and sort of trodden on the same path of suggesting to use some form of residualized regression (see CUPED, CUPAC, MLRATE, etc). But besides using covariates to reduce your variance, there are a lot more approaches available.

  • Frequentist statistics is inherently pessimistic since in many tests you make a tradeoff between false positives and false negatives of 1:4 (i.e, alpha of 5% and power 80%). That’s not an appropriate choice for many experiments that would likely benefit more from balanced error rates. See Justify Error Rates by Lakens.
  • Moreover, most hypotheses we ask are directional, yet we still shoot ourselves in the foot by assuming two-sided tests.
  • We also often track really bad and insensitive metrics in experiments. MAU as a primary metric for a two-week experiment is unlikely to move much. Simply by being smarter about your metric choice, you can get high sensitivity.
  • We also over-track events and don’t really focus on the treated population.
  • Or we use an inappropriate design that doesn’t control for observed covariates. See the fully blocked randomized design by King et. al.
  • Or use conditional probability or predictive power metrics to identify “lost causes.” Lost causes are experiments where, halfway through your sample size, the results look flat and you should just stop your test and move to the next idea.

All of the choices above, when implemented reliably, can lead to 10x sensitivity in your experiments, which can materialize in more efficient use of resources and time. Now you won’t feel so bad if one of your ideas fails, because you have four more chances to get to a successful one.

Some critics might say that this increases the false positive rate, but my response to that is to not think about A/B testing in the context of single experiments. Instead be a Markowitz, or a gambler who enters a casino and knows that if they have a tiny edge, their payoff across many trials will converge to infinity.

Ignoring new users

Meta is well known for their A/B testing program. Although I thoroughly enjoy reading some of their pioneering work in this area, I can’t help but feel that they have reached a point of local optimization where new users don’t get as much coverage as existing users. When I tried to use their products after several years of absence, I just couldn’t onboard to anything because it was all filled with ads or noise. I find that for any consumer-facing product out there, the complexity of the interface just constantly becomes more and more bloated, effectively killing platform growth. Let me give you another example, not to pick only on Meta. Google has increased its ads count on the platform, presumably thinking that the tradeoff between ad revenue and product adoption looks good. But that might be true only for existing users. A user that comes from DuckDuckGo, where they might see half the ads, might find it very difficult to onboard to this experience. Today, for example, I never trust Google with searches that relate to reviews or purchasing decisions because even non-ads look and feel like ads.

My experience on Google

A/B testing focus is partly to blame for this local optimization. There is probably someone in the company who presents a slide that shows an A/B test results that suggest that if you add one more ad, you get $100B in revenue with no negative impact. Everyone is excited and nobody ends up asking if a new user will like that experience. If someone asks, the answer is that “new users are only 0.1% of our user base and we couldn’t even track their performance in an A/B test”. I find this perspective very shortsighted because the cumulative impact of that type of decision-making ends up bringing the company’s growth to zero within 2–4 years.

Solution

Always track new user adoption and experience in experiments, even if the segment of those users looks too small for anyone to care. Moreover, given that the nature of most A/B tests is to add features, try to reserve a portion of your tests for removing stuff and decluttering your product. One product that has done an exceptionally good job in this area is Notion, and maybe more of us should follow the philosophy of targeting intuitive design that feels natural but also allows for a range of experiences customizable by the user.

There is too much snacking

Let’s assume you’re getting really good at A/B testing. Maybe you’ve become part of the tier of companies that run 1000 A/B tests a day. What you’ll find is that most of those A/B tests aren’t really testing for anything useful and represent small configuration changes.

I personally like these kinds of tests because they represent good learning opportunities for what should be the optimal decision boundary in a system. If done smartly, they can have a very good ROI on work. Unfortunately, because you’ve made A/B testing so easy, everyone within your company or team is always just “snacking,” addicted to quick wins in the same way you get addicted to junk food. You’ll find that folks constantly move one number in a config to another until they get a positive outcome, but no real work happens. How do you get out of that local maximum?

Solution

I don’t know what the optimal split between real work and “snacking” is. Because snacking can also provide extremely attractive wins, you don’t want to completely discourage this behavior. Nonetheless, if you fundamentally don’t see any major product development from a team and all you see is “snacking”, it might be useful to realign incentives where the team is explicitly asked to take on more meaningful long-term work. Moreover, I recommend delegating snacking projects to newer or more junior members of the team, since it provides them with better learning opportunities. If you have a bunch of Staff+ engineers who only deliver impact through snacking, your incentives are probably not aligned well, since they are not working on solving the more challenging problems.

Takeaways

I generally find that if you do A/B testing long enough, you can get disillusioned about the overall process. Even with all the supposed rigor and objectivity, humans are ultimately involved in decision-making, and they’ll figure out a way to bring subjectivity back. Overall, if you want to see value in experimentation, I suggest focusing on the following:

  • Setting a pre-registration program to minimize “everything looks significant” pathology. The more you deviate from the original experimental design, the less trust we should put in the outputs. Required pre-registration can help alleviate the issue of misaligned incentives.
  • Improving experimentation velocity. A good mandate would be to run most experiments in under two weeks. Exceptions should obviously exist, but the objective should be to get to a decision fast and minimize opportunity costs. This will also reduce the risk of misaligned incentives since folks are given more opportunities to try out ideas.
  • Pay attention to new user metrics. Even if those metrics are directional and have higher variance, simply by tracking them you might find that you’re doing too much harm to new user adoption because existing users have too much weight in your analysis.
  • Snacking should not be eliminated. It should be balanced against real work. I personally encourage for snaking to be done by new team members or junior members, since it presents new learning opportunities.

--

--