Math, Applied
False Alarm vs Missed Win: Two Ways an Experiment Decision Goes Wrong
The idea
Every test decision has two mistakes. A false alarm is shipping when there is no real lift. A missed win is holding when a real lift was there. You cannot minimize both at once with the same bar.
Stricter rules reduce false alarms but increase missed wins. Looser rules do the opposite. The A/B readout posts cover lift and intervals. This post names the two errors in plain language so teams can pick which mistake they can afford.
The tradeoff answers: Which error is costlier for this launch, a false win or a missed win?
Example: balance false alarms and missed wins
Drag the decision bar from loose to strict. A tighter bar cuts false alarms but raises missed wins. The counts below update as you move the slider.
Ship a redesign when the readout looks positive
False alarm rate
17%
~17 of 100 null tests would ship
Missed win rate
36%
~36 of 100 real wins would never ship
Bars resize as you move the decision bar
If you ship too eagerly
Revenue dip while you roll back a change that never helped
If you wait too long
Months of flat conversion while a real lift waits in the data
Balanced middle: watch both error rates
Out of 100 tests with no real lift, about 17 would still ship (false alarms).
Out of 100 tests with a real lift, about 36 would never ship (missed wins).
Trading 17% false alarms for 36% missed wins. Ask which cost hurts more here: revenue dip while you roll back a change that never helped, or months of flat conversion while a real lift waits in the data.
The math
False alarm (Type I error)
You roll out a checkout change that looked positive but was noise. Tighter intervals, higher sample size, and a higher minimum lift bar all push this rate down.
Missed win (Type II error)
You kill a variant that would have helped because the readout was inconclusive. Small samples and strict bars make this more likely.
The tradeoff
There is no free lunch. Reversible, low-cost tests can tolerate more false alarms. High-stakes launches should tolerate more missed wins until evidence is solid.
A simple application: launch policy
Checkout and pricing changes with rollback plans can use a moderate bar: overlap checks plus a minimum lift. Fraud rules and billing logic need fewer false alarms even if that means waiting longer. Feature launches with high build cost sit in the middle: missed wins waste engineering, false alarms waste trust.
Launch policy: false alarm vs missed win
Tighten or loosen your ship bar. See the tradeoff between crying wolf and missing real lifts.
Strictness 6/10 — false alarm risk ~22%, missed win ~29%
Risk tradeoff (index)
False alarm: 22 · Missed win: 29
Strictness curve
Strictness
6/10
False alarm index
22
Missed win index
29
Optimize (move here)
- • Write policy before the test finishes
- • Match bar to cost of each mistake
Hold (do not over-react)
- • One global p-value rule for fraud and UI copy tests alike
Escalate if
- • Policy disagrees with finance on rollback cost
More wins captured. Reversible checkout tests can live looser with rollback plans.
Write the policy before the test finishes. Name which mistake hurts more, then set sample size and ship rules to match. That turns significance talk into a business choice.