Math, Applied

One Number Is Not Enough: Confidence Intervals for Real Decisions

Confidence interval band around a point estimate

The idea

A conversion rate, CSAT score, or defect rate from one sample is a point estimate. It is the best single guess given what you observed. It is not a guarantee about the full customer base or every future week.

A confidence interval adds a range. Plain language: if we repeated this measurement many times, the true rate would fall inside this band most of the time. Wider band means more uncertainty. Narrow band means the sample carried more information.

Confidence intervals answer: How wrong could this headline number be, given our sample size?

Example: how wide is the range?

A point estimate is the center. A 95% confidence interval is the range where the true rate likely lives given your sample size. Drag rate and n to see the band widen or tighten.

Point estimate

4.2%

95% interval

3.0% to 5.8%

Long-run rate for this metric: ~3.8%

Conversion rate: 4.2%

Sample size: 800

Sample rate
95% band
Long-run rate

Tight enough to plan around. The true rate likely sits in this band.

The math

A confidence interval wraps a point estimate with a range that reflects sample size and variability. Wider band means less certainty.

Observed rate

point estimate p = successes ÷ n

120 conversions out of 1,000 visits gives p = 12%. That is your best single guess, but the true long-run rate could be a bit higher or lower.

Standard error

SE ≈ √(p(1 − p) ÷ n)

This is the typical wobble in p due to random sampling. Small n or p near 50% produces a larger SE. Table 1 shows the same 12% rate with bands that shrink as n grows.

Confidence interval (conceptual)

95% interval ≈ p ± 1.96 × SE (rough normal approximation)

About 95% of the time, a procedure like this captures the true rate. The explorer uses a Wilson interval, which behaves better for small n and extreme rates, but the intuition is the same: center plus/minus a margin based on SE.

Doubling n narrows the band, but you need roughly four times as many observations to halve the width. Rates near 50% produce the widest intervals for a given n. A 99% interval is wider than a 95% interval because you are demanding more coverage. Two experiments can show the same headline rate with different bands if n differs. Ship when the whole interval clears your decision bar; if the band straddles zero lift or minimum ROI, gather more data.

Why sample size still matters

The same observed rate can imply a tight or loose range depending on n. That is why sample size posts and interval posts belong together. Small pilots produce wide bands. Large rollouts shrink them.

Table 1: Same 4.2% conversion, different sample sizes
Sample size	Observed rate	95% interval	Band width
100	4.2%	1.7% to 10.1%	8.4%
400	4.2%	2.6% to 6.6%	4.0%
1,600	4.2%	3.3% to 5.3%	2.0%

A simple application: shipping decisions

Report the interval in experiment readouts and ops reviews. If the band crosses your decision threshold, wait or gather more data. If the band clears the bar with room to spare, you have a stronger case to ship or scale.

Shipping decision: does the band clear the bar?

Move observed lift and sample size. See when the confidence band clears your ship threshold.

Observed lift (pp): +1.2 pp

Users per arm: 6,000

Band crosses threshold — wait or gather more data

Lift band (pp)

Low: +0.2 · Observed: +1.2 · High: +2.2

Precision

MoE: 1.0 pp · Threshold: 1.0 pp

Observed lift

+1.2 pp

Band

0.2 to 2.2

Ship bar

+1 pp

Optimize (move here)

• Show interval next to every rate in readouts
• Ship when band clears bar with margin

Hold (do not over-react)

• Full launch when band still crosses zero

Escalate if

• Point estimate wins but band includes zero

Band overlaps threshold. Reversible rollout or more traffic before full launch.

The habit is showing a range next to every rate. The math above explains why that range widens or tightens.