Lester Leong
Your A/B Test Didn't Fail. You Measured the Wrong Metric.
The Real Reason Experiments Fail
Most A/B tests do not fail because the hypothesis was wrong. They fail because the team measured the wrong thing.
The pattern is remarkably consistent. A product team develops a credible hypothesis about user behavior. They design a clean experiment. They implement it correctly. They run it for a reasonable duration. They get a null result. They conclude the hypothesis was wrong and move on to the next idea.
But the hypothesis was often right. The metric was wrong. And because the team conflated "no measurable effect on this metric" with "no effect at all," they abandoned an intervention that was actually working. This is not a statistical error. It is a design error that occurs before a single user enters the experiment.
Metric selection is the most under-discussed skill in experimentation. Teams spend hours on hypothesis formation, sample size calculations, and statistical methodology. They spend minutes on whether the metric they chose actually captures the behavior they care about. That imbalance explains more failed experiments than any issue with p-values or confidence intervals.
The Time Horizon Problem
Consider a concrete example. A product team believes that a redesigned onboarding flow will produce more engaged long-term users. They pick "completed onboarding" as their primary metric. They run the test for one week. No significant difference. Hypothesis rejected.
The problem is not the hypothesis. It is the mismatch between the metric and the behavior. "Completed onboarding" measures whether users finished a sequence of steps. It does not measure whether they became engaged. A user can complete onboarding efficiently and never return. A user can abandon onboarding at step three and still become a power user through a different entry point.
The behavior the team actually cared about (users becoming regular, engaged customers) takes three weeks to manifest. A seven-day observation window on a three-week behavior is not an experiment. It is noise wrapped in a confidence interval.
This time horizon mismatch is the most common form of metric selection failure. Teams choose metrics they can observe quickly because organizational patience for experiments is limited. But selecting a metric based on convenience rather than accuracy does not produce faster learning. It produces false negatives that waste far more time than running a longer experiment would have.
The Leading Indicator Trap
Experienced teams recognize the time horizon problem and attempt to solve it with leading indicators. Instead of measuring 30-day retention directly, they identify an early behavior (say, "completed three sessions in the first week") that correlates with long-term retention. In principle, this is sound. In practice, it introduces a new failure mode.
The leading indicator must actually predict the lagging outcome. This sounds obvious, but the validation step is routinely skipped. Teams pick a leading indicator based on intuition or a single historical analysis, then treat it as ground truth for every subsequent experiment. But the relationship between early behaviors and long-term outcomes is not static. It shifts as the product evolves, as the user base changes, and as the competitive landscape moves.
A leading indicator that predicted 12-month retention in Q1 may be meaningless by Q3 if you shipped a major feature update that changed how new users engage. Every leading indicator has a shelf life. Teams that do not periodically revalidate their proxies against actual outcomes are running experiments against a metric that no longer means what they think it means.
The Metric Selection Protocol
The fix is not complicated. It requires discipline, not sophistication. Before any experiment, write down three things.
The behavior you expect to change. Not the metric. The actual human behavior. "Users will integrate the product into their weekly workflow" is a behavior. "Day-7 retention" is a metric. Start with the behavior. If you cannot articulate the behavior you expect to change, your hypothesis is not ready for testing.
The metric that captures that behavior. Now translate the behavior into something measurable. The translation will always be lossy, and acknowledging that loss is important. "Day-7 retention" captures some aspects of "weekly workflow integration" but misses others. A user who logs in on day 7 to check a notification is counted the same as a user who completes a 45-minute workflow. If the distinction matters for your hypothesis, you need a more precise metric.
The time horizon required to observe it. How long does the behavior take to manifest? If your hypothesis is about long-term engagement, you need a long-term observation window. If organizational patience will not tolerate that window, find a leading indicator, but validate it first. Run the correlation analysis. Confirm that the leading indicator actually predicts the lagging outcome in your current product and current user base.
This three-part protocol takes thirty minutes. It prevents the most common and most expensive experimentation failures. The return on that time investment is asymmetric.
Where the Leverage Lives
The best experimentation teams I have observed spend more time on metric selection than on any other part of the process. They debate whether a metric captures what they actually care about. They document the assumptions embedded in their metric choice. They revisit those assumptions when results are surprising.
This is not perfectionism. It is pragmatism. A well-chosen metric makes the experiment interpretable regardless of the outcome. A positive result means something specific. A null result means something specific. A poorly chosen metric makes every outcome ambiguous, and ambiguous results drive no decisions.
If your experimentation program is producing frequent null results that feel unsatisfying, the problem is probably not your hypotheses. Look at your metrics. The answer is usually there.