Math, Applied

When Everything Wins Once: Running Many A/B Tests

Many A/B tests and false winner probability

The idea

One A/B test at 95% confidence accepts a 5% false alarm rate when there is no real lift. Run twenty independent null tests and luck adds up. The chance of at least one false winner crosses 50% even when nothing actually works.

Growth teams feel this every quarter: a sprint full of tests, a few winners, and a roadmap built on noise. The single-test readout posts still apply. This post is about what changes when you run many of them at once.

Multiple tests answer: If nothing worked, how many false wins should we still expect?

Example: how many false wins should you expect?

Assume every test is a true null (no real lift). At 5% confidence per test, running many tests still produces winners by luck. Drag test count and threshold to see the odds.

Team runs many homepage tests in one quarter

Chance of ≥1 false win

64%

Expected false wins

1.0

Tests run

Independent tests: 20

False alarm rate per test: 5%

Dark bar = a null test that still looks like a winner

With 20 independent null tests at 5% each, you have about a 64% chance of celebrating at least one false win. Expected false alarms: ~1.0.

The math

Per-test false alarm

P(false win on one null test) ≈ α (often 5%)

α is the threshold you set per test. At 5%, a true null still looks like a winner about one time in twenty.

Across n independent tests

P(≥1 false win) = 1 − (1 − α)^n

With α = 0.05 and n = 20: 1 − 0.95^20 ≈ 64%. You are more likely than not to see at least one false winner if every test is a null.

Average false alarms

expected false wins = n × α

Twenty tests at 5% produce about one false win on average. Some quarters you see zero, some you see two or three. That is why pre-registering a primary metric matters.

A simple application: experiment sprints

Name one primary metric before the sprint. Treat secondary tests as directional. Hold winners to the same interval overlap and sample size bar as a single test. When many tests run at once, skepticism scales with n.

Experiment sprint: skepticism scales with n

Increase concurrent tests. Expected false wins rise even when each test looks fine alone.

Concurrent tests: 12

Significance level (%): 5%

12 tests at α=5% → ~0.6 expected false wins

Expected false wins

Your sprint

Tests

Expected false wins

~0.6

Optimize (move here)

• One primary metric per sprint
• Treat secondary wins as directional

Hold (do not over-react)

• Shipping every 'winner' from a 20-test sprint unchanged

Escalate if

• More than two winners lack interval separation

False win risk is modest at this sprint size.

False wins are not a reason to stop testing. They are a reason to rank evidence before you ship a bundle of changes that never really worked.