Math, Applied

How Many Labels Before You Trust the Metric?

95% confidence band around 68% precision on a thin labeled set

The idea

The model team reports 68% precision on a held-out eval set. Leadership wants to auto-block at 0.85 tomorrow. The number sounds solid until you learn the eval set had 500 rows and only 20 flagged cases. Precision on twenty outcomes is a coin flip with extra decimals.

Remember it in one line: report the metric and the labeled count behind it.

Sample size posts cover A/B tests and dashboard averages. Eval sample size is the same uncertainty logic applied to ML metrics: precision, recall, and F1 are proportions on a thin slice of labeled data. Small denominators mean wide bands.

Eval sample size answers: Is this precision/recall read tight enough to change policy?

Example: how wide is your eval metric band?

A point estimate is the center. The 95% band is where the true metric likely lives given your labeled count. Drag eval size. Effective n is flagged or positive rows, not total labels.

500-row eval set, 4% flagged. Precision CI is still ±15 pp wide.

Point estimate

68%

Precision on flagged orders

95% interval

47.6% to 88.4%

500 labeled rows · effective n = 20 (20 flagged)

Too thin to ship

Labeled eval set: 500 rows

Thin bandMore labels

Observed metric
95% band

Precision on flagged orders is 68% on paper, but ±20 pp at 95% with only 20 flagged rows in the eval set. Label more before changing thresholds or shipping the model.

The math

Binomial uncertainty on a rate

SE(metric) ≈ √(p(1 − p) ÷ n_effective)

p is the observed precision, recall, or F1. n_effective is the count in the denominator you care about: flagged rows for precision, true positives for recall. Not always total labeled rows.

Margin of error

95% band ≈ p ± 1.96 × SE

At p = 0.68 and n_effective = 20, SE ≈ 0.10, so the band is roughly 48%–88%. At n_effective = 200, SE ≈ 0.03, band narrows to about 62%–74%. Same model, different label budget.

Imbalanced eval sets

rare positives → need more labels than headline row count suggests

A 2,000-row eval with 1% fraud may include only ~20 fraud cases. Recall bands stay wide even when total rows look impressive. Stratified labeling buys tighter reads on the metrics that drive policy.

Where teams get stuck

Launch reviews. A slide shows precision up 6 points. Nobody asks how many flagged eval rows supported that read. Policy ships on noise.

Model comparisons. Version B beats version A 71% vs 68% precision. Bands overlap by 15 points. The team rewrites infrastructure anyway.

Rare events. Fraud or safety classes are 1–3% of traffic. Total eval rows look healthy while positive counts stay in the twenties. Recall and precision both look stable week to week but swing wildly.

A simple application: ship or label more?

Before moving a fraud model to auto-block, ask how many flagged eval rows supported the precision readout. If the band spans 50%–85%, label another few hundred flagged cases or run a shadow period. Threshold and calibration posts assume you can measure the metric. This post checks whether you have enough labels to do that.

Model eval: how many labels for a tight precision read?

Move labeled-set size. Watch the 95% band on precision shrink. Effective n is flagged rows, not total labels.

Labeled eval rows: 500

Share flagged in eval (%): 4%

Precision 68% ±20 pp on 20 flagged rows

Margin of error vs eval size

Effective n

95% band

48–88%

Verdict

Label more

Optimize (move here)

• Report effective n next to every precision/recall slide
• Stratified labeling on rare positives before launch

Hold (do not over-react)

• Shipping auto-policy on thin flagged-count evals

Escalate if

• Fewer than 30 flagged rows in eval
• Precision band wider than ±12 pp

Band is too wide to change auto-block policy. Label more flagged cases first.

The habit: every model metric slide shows n_effective next to the point estimate. Compare model versions only when bands barely overlap. Labeling budget is part of the launch plan, not a footnote after the fact.