Math, Applied

Classifier Metrics: Precision, Recall, Accuracy, and AUC in Plain Language

Confusion matrix cells beside an ROC curve

The idea

A classifier outputs a score. Policy turns that score into an action: block, flag, advance, or ignore. Before you argue about the cutoff, you need a shared vocabulary for what the model got right and wrong. Every metric you will see in a model card or eval slide comes from one four-cell table.

Remember it in one line: precision is about the flagged pile; recall is about the missed pile; accuracy mixes both classes and lies easily when positives are rare.

This post is the reference layer. The sensitivity and specificity post builds the 2×2 table from screening. The threshold tradeoffs post adds review capacity and dollar cost. The eval sample size post asks whether your precision read is tight enough to ship. Start here when a stakeholder says "the model is 94% accurate" and nobody has drawn the confusion matrix yet.

Classifier metrics answer: Given our mistakes, which number should we optimize and which errors can the business afford?

Example: confusion matrix, scores, and curves at one threshold

Drag the dot on the ROC or precision-recall chart, or use the threshold slider. Everything stays linked: matrix, scores, and operating point move together.

Rare fraud, costly misses. Teams debate recall until the false-block queue explodes.

Score threshold: 72%

Lower (more flags, higher recall)Higher (fewer flags, higher precision)

Confusion matrix (n = 10,000)

105

Predicted +, actually +

1,616

Predicted +, actually -

Predicted -, actually +

8,184

Predicted -, actually -

1,721 flagged · prevalence 2.0%

All scores at this threshold

Accuracy: 83%
Precision (PPV): 6%
Recall (TPR): 53%
Specificity (TNR): 84%
F1: 11%
NPV: 99%
FPR: 16%
FNR (miss rate): 48%
Balanced accuracy: 68%
ROC AUC: 0.00

ROC curve · AUC 0.00

Drag the dot along the curve to change threshold

Precision-recall curve

Drag the dot here too. Same threshold, different view.

When to optimize precision

• Auto-block with no human review
• Customer-facing denial without appeal path

Auto-block without review: every false block angers a customer.

When to optimize recall

• Chargebacks dominate unit economics
• Manual review capacity exists

Manual review queue: missed fraud is dollars walking out the door.

AUC 0.00 summarizes ranker quality across thresholds. At 72%: precision 6%, recall 53%. Pick the operating point from clean order blocked vs fraud missed costs, not accuracy alone.

The four cells: TP, FP, FN, TN

True positive (TP): Model flagged positive, and the case was truly positive. A fraud score that caught real fraud. A resume screen that advanced a strong candidate.

False positive (FP): Model flagged positive, but the case was negative. The clean order sent to review. The weak candidate invited to interview. False positives consume reviewer time, customer goodwill, and downstream capacity.

False negative (FN): Model said negative, but the case was positive. Fraud that shipped. A policy violation that stayed live. A strong hire that never got surfaced. False negatives are often silent until someone audits losses.

True negative (TN): Model said negative, and the case was negative. Most rows in imbalanced problems land here. TN dominates accuracy, which is why accuracy alone misleads leadership.

The math

Accuracy

accuracy = (TP + TN) ÷ (TP + TN + FP + FN)

Overall fraction correct. Useful when classes are balanced and mistakes are symmetric. Misleading when negatives dominate: predicting "no fraud" on every row can still score 98% on a 2% fraud rate.

Precision (PPV)

precision = TP ÷ (TP + FP)

Of everything you flagged positive, how many were real? High precision means the review queue is mostly signal. Also called positive predictive value (PPV).

Recall (sensitivity, TPR)

recall = TP ÷ (TP + FN)

Of all true positives in the population, how many did you catch? High recall means fewer misses. Same as sensitivity and true positive rate (TPR).

Specificity (TNR)

specificity = TN ÷ (TN + FP)

Of all true negatives, how many cleared the bar? High specificity means fewer false flags. Complements recall: you can tune one up and often push the other down.

F1 score

F1 = 2 × precision × recall ÷ (precision + recall)

Harmonic mean of precision and recall. Useful when you need one number and both mistake types matter. Punishes lopsided models: 99% precision with 5% recall still scores poorly.

Negative predictive value

NPV = TN ÷ (TN + FN)

Of everything cleared as negative, how many were truly negative? Matters when a false clear is dangerous, for example releasing a flagged account too early.

False positive rate

FPR = FP ÷ (FP + TN)

Fraction of real negatives that get flagged. The x-axis on an ROC curve. Lower FPR at a fixed recall means fewer clean cases in the queue.

Balanced accuracy

balanced accuracy = (recall + specificity) ÷ 2

Averages performance on both classes. Better than raw accuracy when prevalence is skewed, but still ignores business cost asymmetry between FP and FN.

When accuracy looks great and still misleads

Fraud at 2% prevalence: a model that never flags anything is 98% accurate. Leadership sees a green dashboard. Losses climb. The fix is not a better slide template. Plot precision and recall, report prevalence, and show the four counts.

Hiring at 12% pass rate: accuracy hides whether you are wasting interviews (FP) or missing talent (FN). Ask which error shows up in ops metrics first.

Precision vs recall: what the business should optimize

Optimize precision when false alarms are expensive. Auto-block with no appeal. Customer-facing denial. Interview slots that cost manager hours. Contractor queues that quit when most flags are clean content. Raising the threshold usually helps precision but drops recall.

Optimize recall when misses are expensive. Chargebacks and policy violations. Safety incidents. Spam drowning real mail. Screening for a rare disease where a miss is catastrophic. Lowering the threshold usually helps recall but floods the queue with FP.

You rarely get both at once. Moving the threshold trades one error type for the other. AUC and PR curves summarize how good the ranker is before you pick the operating point. Dollar costs and review capacity pick the final cutoff.

ROC, AUC, and precision-recall curves

ROC curve: plots true positive rate (recall) against false positive rate as you sweep the threshold. A model that ranks positives above negatives bows toward the top-left. The diagonal is random guessing.

AUC (area under ROC): one number for ranker quality across all thresholds. 0.5 is coin flip; 0.85+ is strong on many business problems. AUC does not tell you which threshold to ship. It answers whether the scores are worth tuning at all.

Precision-recall curve: more informative when positives are rare. A model can have decent AUC while precision collapses at the recall level ops needs. Report both curves in model reviews for fraud, safety, and medical-style screens.

ROC AUC

AUC ≈ integral of TPR d(FPR) across thresholds

Higher AUC means you can achieve more recall at the same false-positive rate. Compare model versions on AUC before debating a single production threshold.

Where teams get stuck

The accuracy slide. Eval deck leads with 96% accuracy on imbalanced data. Nobody shows TP, FP, FN, TN. Policy ships on the wrong metric.

Precision without prevalence. "70% precision" sounds fine until you learn only 40 rows were flagged in eval. Pair with eval sample size before you auto-block.

Recall without capacity. Safety lowers threshold after one incident. Recall rises, precision falls, contractors burn out, and real violations still slip through at volume.

AUC as a deployment metric. Two models with the same AUC can behave very differently at the threshold your ops team actually uses. Always mark the operating point on the curves.

A simple application: fraud auto-block review

Fraud ops has a model with 0.88 AUC. Leadership wants to auto-block at 0.85 score. Precision at that cutoff is 62%: four in ten blocked orders are clean. Recall is 71%: nearly a third of fraud still ships. The right question is not "is 0.88 AUC good?" but "what is the dollar cost of FP overturns vs FN chargebacks, and how many cases can humans review per day?"

The habit: draw the confusion matrix, report precision and recall at the proposed threshold, plot ROC and PR with the operating point marked, and state prevalence. Compare models on AUC; ship policy on costs and capacity. If the metric band is still wide, label more rows before you change production.