Math, Applied
How Many Labels Before You Trust the Metric?
The idea
The model team reports 68% precision on a held-out eval set. Leadership wants to auto-block at 0.85 tomorrow. The number sounds solid until you learn the eval set had 500 rows and only 20 flagged cases. Precision on twenty outcomes is a coin flip with extra decimals.
Remember it in one line: report the metric and the labeled count behind it.
Sample size posts cover A/B tests and dashboard averages. Eval sample size is the same uncertainty logic applied to ML metrics: precision, recall, and F1 are proportions on a thin slice of labeled data. Small denominators mean wide bands.
Eval sample size answers: Is this precision/recall read tight enough to change policy?
Example: how wide is your eval metric band?
A point estimate is the center. The 95% band is where the true metric likely lives given your labeled count. Drag eval size. Effective n is flagged or positive rows, not total labels.
500-row eval set, 4% flagged. Precision CI is still ±15 pp wide.
Point estimate
68%
Precision on flagged orders
95% interval
47.6% to 88.4%
500 labeled rows · effective n = 20 (20 flagged)
- Observed metric
- 95% band
Precision on flagged orders is 68% on paper, but ±20 pp at 95% with only 20 flagged rows in the eval set. Label more before changing thresholds or shipping the model.
The math
Binomial uncertainty on a rate
p is the observed precision, recall, or F1. n_effective is the count in the denominator you care about: flagged rows for precision, true positives for recall. Not always total labeled rows.
Margin of error
At p = 0.68 and n_effective = 20, SE ≈ 0.10, so the band is roughly 48%–88%. At n_effective = 200, SE ≈ 0.03, band narrows to about 62%–74%. Same model, different label budget.
Imbalanced eval sets
A 2,000-row eval with 1% fraud may include only ~20 fraud cases. Recall bands stay wide even when total rows look impressive. Stratified labeling buys tighter reads on the metrics that drive policy.
Where teams get stuck
Launch reviews. A slide shows precision up 6 points. Nobody asks how many flagged eval rows supported that read. Policy ships on noise.
Model comparisons. Version B beats version A 71% vs 68% precision. Bands overlap by 15 points. The team rewrites infrastructure anyway.
Rare events. Fraud or safety classes are 1–3% of traffic. Total eval rows look healthy while positive counts stay in the twenties. Recall and precision both look stable week to week but swing wildly.
A simple application: ship or label more?
Before moving a fraud model to auto-block, ask how many flagged eval rows supported the precision readout. If the band spans 50%–85%, label another few hundred flagged cases or run a shadow period. Threshold and calibration posts assume you can measure the metric. This post checks whether you have enough labels to do that.
Model eval: how many labels for a tight precision read?
Move labeled-set size. Watch the 95% band on precision shrink. Effective n is flagged rows, not total labels.
Precision 68% ±20 pp on 20 flagged rows
Margin of error vs eval size
Effective n
20
95% band
48–88%
Verdict
Label more
Optimize (move here)
- • Report effective n next to every precision/recall slide
- • Stratified labeling on rare positives before launch
Hold (do not over-react)
- • Shipping auto-policy on thin flagged-count evals
Escalate if
- • Fewer than 30 flagged rows in eval
- • Precision band wider than ±12 pp
Band is too wide to change auto-block policy. Label more flagged cases first.
The habit: every model metric slide shows n_effective next to the point estimate. Compare model versions only when bands barely overlap. Labeling budget is part of the launch plan, not a footnote after the fact.