Math, Applied
When Everything Wins Once: Running Many A/B Tests
The idea
One A/B test at 95% confidence accepts a 5% false alarm rate when there is no real lift. Run twenty independent null tests and luck adds up. The chance of at least one false winner crosses 50% even when nothing actually works.
Growth teams feel this every quarter: a sprint full of tests, a few winners, and a roadmap built on noise. The single-test readout posts still apply. This post is about what changes when you run many of them at once.
Multiple tests answer: If nothing worked, how many false wins should we still expect?
Example: how many false wins should you expect?
Assume every test is a true null (no real lift). At 5% confidence per test, running many tests still produces winners by luck. Drag test count and threshold to see the odds.
Team runs many homepage tests in one quarter
Chance of ≥1 false win
64%
Expected false wins
1.0
Tests run
20
Dark bar = a null test that still looks like a winner
With 20 independent null tests at 5% each, you have about a 64% chance of celebrating at least one false win. Expected false alarms: ~1.0.
The math
Per-test false alarm
α is the threshold you set per test. At 5%, a true null still looks like a winner about one time in twenty.
Across n independent tests
With α = 0.05 and n = 20: 1 − 0.95^20 ≈ 64%. You are more likely than not to see at least one false winner if every test is a null.
Average false alarms
Twenty tests at 5% produce about one false win on average. Some quarters you see zero, some you see two or three. That is why pre-registering a primary metric matters.
A simple application: experiment sprints
Name one primary metric before the sprint. Treat secondary tests as directional. Hold winners to the same interval overlap and sample size bar as a single test. When many tests run at once, skepticism scales with n.
Experiment sprint: skepticism scales with n
Increase concurrent tests. Expected false wins rise even when each test looks fine alone.
12 tests at α=5% → ~0.6 expected false wins
Expected false wins
Your sprint
Tests
12
Expected false wins
~0.6
α
5%
Optimize (move here)
- • One primary metric per sprint
- • Treat secondary wins as directional
Hold (do not over-react)
- • Shipping every 'winner' from a 20-test sprint unchanged
Escalate if
- • More than two winners lack interval separation
False win risk is modest at this sprint size.
False wins are not a reason to stop testing. They are a reason to rank evidence before you ship a bundle of changes that never really worked.