Math, Applied

A/B Test Readouts: Significance Without Jargon

The idea

Experiment readouts often hide behind words like significant or not significant. You do not need that vocabulary to decide. You need three things: observed lift, how many users were in each arm, and whether the uncertainty bands overlap.

If variant beats control by 0.7 points but the 95% intervals still overlap, the result is compatible with no real difference. If bands separate and lift clears your minimum bar, you have a stronger case to ship.

A good readout answers: Is the lift big enough for us, and is the sample large enough that we are not fooling ourselves?

Example: read an A/B test without p-value jargon

Compare observed lift to your minimum bar and check whether the 95% bands overlap. Overlap means the result is still compatible with no real difference.

Control

3.2%

2.3% to 4.4%

Variant

3.9%

2.9% to 5.2%

Lift

+0.7 pts

Min to ship: 0.3 pts

Control conversion rate: 3.2%

Variant conversion rate: 3.9%

Users per arm: 1,200

95% bands overlap

Directional only

Variant leads by +0.7 pts on paper, but intervals still overlap. A few hundred more users per arm could settle it.

The math

Observed lift

lift = variant rate − control rate

3.9% variant vs 3.2% control is +0.7 percentage points. That is the headline move. Decision quality depends on whether that move is real or within sampling noise.

Uncertainty per arm

95% interval around each arm (Wilson or normal approx.)

Each rate gets a band. Overlap means both stories could still be true at once: variant ahead by luck, or truly tied. No overlap means the arms are separated at your chosen confidence level.

When to ship (practical rule)

ship when lift ≥ minimum bar AND intervals do not overlap

Set a minimum lift that covers engineering cost, risk, or revenue goal. Then check separation. A tiny win with huge samples might be statistically separated but not worth the rollout tax.

Sample size shrinks the bands. Pre-set minimum detectable lift before you launch so you know when to stop. If bands overlap, extend the test or accept a directional read only. This connects directly to the sample size and confidence interval posts: same machinery, decision-first framing.

A simple application: experiment readouts

Product and growth teams paste lift, n per arm, and interval overlap into readout docs instead of a lone p-value. Ops sets reversible rollouts when separation is thin. Leadership asks for the minimum lift bar up front so debates happen before data arrives, not after.

Experiment readout: overlap and next step

Adjust lift and sample size. See when intervals overlap enough to wait vs ship.

Variant lift (pp): +1.4 pp

Users per arm: 4,000

Intervals overlap — next step: wait or slice

Conversion (%)

Control: 8.0% · Variant: 9.4%

Lift band (pp)

Low: +0.2 · Lift: +1.4 · High: +2.6

Lift

+1.4 pp

Overlap?

Yes

Per arm

4,000

Optimize (move here)

• Paste lift, n, and interval overlap into readout docs
• Set minimum lift bar before data arrives

Hold (do not over-react)

• Shipping on point estimate alone

Escalate if

• Aggregate wins but every segment loses

Report overlap explicitly. Ops can run reversible rollout while traffic accumulates.

When you report overlap clearly, the next step is obvious: ship, wait for more traffic, or slice the result before trusting the aggregate. That last step matters when segment mix can flip the story.