Math, Applied
We Ran the Test: Could We Even See a Win? Statistical Power
The idea
Sample size tells you how much data you collected. Statistical power tells you something different: if the variant is actually better by the amount you care about, how often will this test detect it?
Remember it in one line: an underpowered test is a coin flip that looks like science.
Teams launch tests with five thousand visitors, see flat results, and kill a good idea. Or they see a tiny bump and ship noise. Power connects baseline rate, minimum detectable lift, and sample size before you start.
Power answers: If the effect is real, will this experiment notice?
Example: can this test detect the lift you care about?
Power is the chance you detect a real effect before you give up. Drag sample size per arm. Small tests often end at “no winner” even when the variant works.
10k visitors per arm — can you detect a 1.5 point lift from 8%?
Baseline rate
8%
Minimum detectable lift
+1.5 pts
Statistical power
96%
Well powered
At 10,000 per arm, power is about 96%. A real 1.5 point lift is likely visible.
The math
Definition
High power means a true improvement is likely to show up as a significant readout. Low power means you often conclude “no difference” even when the variant works.
Levers
More traffic per arm helps. So does accepting that you can only detect larger lifts. Trying to spot a 0.5 point move on a 5% baseline needs far more data than a 3 point move on 20%.
Business cost
False alarm posts cover shipping noise. Power covers killing winners because the test was too thin to see them.
A simple application: experiment planning
Product wants to detect a 1.5 point lift on 8% checkout conversion. At 3,000 users per arm, power sits near 35%. They extend two weeks, reach 12,000 per arm, power crosses 80%. The readout is still flat, but now “no winner” is a real conclusion, not a sample size excuse.
Experiment planning: can you see the lift you care about?
Adjust sample size and minimum detectable lift. Power tells you if a flat readout is informative.
Power ~41% to detect +1.5 pp on 8% baseline
Power vs sample size
Detectability
Current power: 41% · Target: 80%
Power
~41%
MDE
+1.5 pp
Per arm
3,000
Optimize (move here)
- • State baseline, MDE, and 80% power before launch
- • Extend run when power is under 60%
Hold (do not over-react)
- • Treating null as proof when underpowered
Escalate if
- • Power < 50% at planned readout date
Underpowered: a null result might mean 'cannot see lift' not 'no lift.' Extend run or widen MDE.
The habit: state baseline, MDE, and target power before launch. Sample size posts cover noise; power covers detectability of the lift you actually care about.