Math, Applied

False Alarm vs Missed Win: Two Ways an Experiment Decision Goes Wrong

The idea

Every test decision has two mistakes. A false alarm is shipping when there is no real lift. A missed win is holding when a real lift was there. You cannot minimize both at once with the same bar.

Stricter rules reduce false alarms but increase missed wins. Looser rules do the opposite. The A/B readout posts cover lift and intervals. This post names the two errors in plain language so teams can pick which mistake they can afford.

The tradeoff answers: Which error is costlier for this launch, a false win or a missed win?

Example: balance false alarms and missed wins

Drag the decision bar from loose to strict. A tighter bar cuts false alarms but raises missed wins. The counts below update as you move the slider.

Ship a redesign when the readout looks positive

Decision bar: Moderate (45)

Loose (ship more)Strict (wait more)

False alarm rate

17%

~17 of 100 null tests would ship

Missed win rate

36%

~36 of 100 real wins would never ship

Bars resize as you move the decision bar

If you ship too eagerly

Revenue dip while you roll back a change that never helped

If you wait too long

Months of flat conversion while a real lift waits in the data

Balanced middle: watch both error rates

Out of 100 tests with no real lift, about 17 would still ship (false alarms).

Out of 100 tests with a real lift, about 36 would never ship (missed wins).

Trading 17% false alarms for 36% missed wins. Ask which cost hurts more here: revenue dip while you roll back a change that never helped, or months of flat conversion while a real lift waits in the data.

The math

False alarm (Type I error)

false alarm rate = P(ship | no real lift)

You roll out a checkout change that looked positive but was noise. Tighter intervals, higher sample size, and a higher minimum lift bar all push this rate down.

Missed win (Type II error)

missed win rate = P(hold | real lift exists)

You kill a variant that would have helped because the readout was inconclusive. Small samples and strict bars make this more likely.

The tradeoff

stricter bar → lower false alarms, higher missed wins

There is no free lunch. Reversible, low-cost tests can tolerate more false alarms. High-stakes launches should tolerate more missed wins until evidence is solid.

A simple application: launch policy

Checkout and pricing changes with rollback plans can use a moderate bar: overlap checks plus a minimum lift. Fraud rules and billing logic need fewer false alarms even if that means waiting longer. Feature launches with high build cost sit in the middle: missed wins waste engineering, false alarms waste trust.

Launch policy: false alarm vs missed win

Tighten or loosen your ship bar. See the tradeoff between crying wolf and missing real lifts.

Ship bar strictness: 6

Loose (ship early)Strict (wait for proof)

Strictness 6/10 — false alarm risk ~22%, missed win ~29%

Risk tradeoff (index)

False alarm: 22 · Missed win: 29

Strictness curve

Strictness

6/10

False alarm index

Missed win index

Optimize (move here)

• Write policy before the test finishes
• Match bar to cost of each mistake

Hold (do not over-react)

• One global p-value rule for fraud and UI copy tests alike

Escalate if

• Policy disagrees with finance on rollback cost

More wins captured. Reversible checkout tests can live looser with rollback plans.

Write the policy before the test finishes. Name which mistake hurts more, then set sample size and ship rules to match. That turns significance talk into a business choice.