Math, Applied

We Ran the Test: Could We Even See a Win? Statistical Power

Statistical power gauge for an experiment

The idea

Sample size tells you how much data you collected. Statistical power tells you something different: if the variant is actually better by the amount you care about, how often will this test detect it?

Remember it in one line: an underpowered test is a coin flip that looks like science.

Teams launch tests with five thousand visitors, see flat results, and kill a good idea. Or they see a tiny bump and ship noise. Power connects baseline rate, minimum detectable lift, and sample size before you start.

Power answers: If the effect is real, will this experiment notice?

Example: can this test detect the lift you care about?

Power is the chance you detect a real effect before you give up. Drag sample size per arm. Small tests often end at “no winner” even when the variant works.

10k visitors per arm — can you detect a 1.5 point lift from 8%?

Baseline rate

Minimum detectable lift

+1.5 pts

Statistical power

96%

Well powered

Sample size per arm: 10,000

At 10,000 per arm, power is about 96%. A real 1.5 point lift is likely visible.

The math

Definition

power = P(detect lift | lift is real)

High power means a true improvement is likely to show up as a significant readout. Low power means you often conclude “no difference” even when the variant works.

Levers

power ↑ when sample size ↑ or minimum detectable effect ↓

More traffic per arm helps. So does accepting that you can only detect larger lifts. Trying to spot a 0.5 point move on a 5% baseline needs far more data than a 3 point move on 20%.

Business cost

underpowered test → missed wins (Type II errors)

False alarm posts cover shipping noise. Power covers killing winners because the test was too thin to see them.

A simple application: experiment planning

Product wants to detect a 1.5 point lift on 8% checkout conversion. At 3,000 users per arm, power sits near 35%. They extend two weeks, reach 12,000 per arm, power crosses 80%. The readout is still flat, but now “no winner” is a real conclusion, not a sample size excuse.

Experiment planning: can you see the lift you care about?

Adjust sample size and minimum detectable lift. Power tells you if a flat readout is informative.

Users per arm: 3,000

Minimum lift (pp): +1.5 pp

Power ~41% to detect +1.5 pp on 8% baseline

Power vs sample size

Detectability

Current power: 41% · Target: 80%

Power

~41%

MDE

+1.5 pp

Per arm

3,000

Optimize (move here)

• State baseline, MDE, and 80% power before launch
• Extend run when power is under 60%

Hold (do not over-react)

• Treating null as proof when underpowered

Escalate if

• Power < 50% at planned readout date

Underpowered: a null result might mean 'cannot see lift' not 'no lift.' Extend run or widen MDE.

The habit: state baseline, MDE, and target power before launch. Sample size posts cover noise; power covers detectability of the lift you actually care about.