Sandhya Indurkar

Math, Applied

Where Do You Draw the Line? Threshold Tradeoffs in Real Decisions

Precision and recall tradeoff when you move the classifier threshold

The idea

A classifier outputs a score. Policy turns that score into an action: auto-block, flag for review, advance a candidate. The cutoff is where math meets operations. Lower it and you catch more true positives, but you also flag more false positives, fill review queues, and burn analyst time.

Remember it in one line: precision and recall move in opposite directions when you move the threshold.

Calibration asks whether 90% on the score means 90% in reality. Threshold tradeoffs ask: given a workable score, where should we act, and what does each mistake cost?

Threshold choice answers: Which errors can we afford: false alarms, missed cases, or queue overflow?

Example: precision vs recall as you move the threshold

Lower the score cutoff to catch more positives. Precision usually falls and review queues grow. Drag threshold and capacity to see the tradeoff.

Lower threshold catches more fraud and floods review with clean orders.

Precision

21%

47 true positives in 220 flagged

Recall

47%

53 missed · $24,745 daily cost

Lower (more flags)Higher (fewer flags)
RateThreshold60%75%90%
  • Precision
  • Recall
  • Your threshold

Flagged queue breakdown

220 flagged today

True positive (47)False positive (173)

Balanced zone: precision 21%, recall 47%, daily cost $24,745. Tune threshold against clean order blocked vs fraud missed dollar weights.

The math

Precision

precision = TP ÷ (TP + FP)

Of everything you flagged, how many were truly positive? High precision means fewer clean cases in the review queue. It usually falls when you lower the threshold.

Recall

recall = TP ÷ (TP + FN)

Of all true positives in the population, how many did you catch? High recall means fewer misses. It usually rises when you lower the threshold.

Capacity and cost

expected daily cost ≈ FP × cost_false_alarm + FN × cost_miss

False positives consume review time and customer goodwill. False negatives consume fraud loss, policy risk, or missed hires. If flagged volume exceeds reviewer slots, the queue backs up regardless of precision on paper.

Where teams get stuck

Fraud auto-block. Ops raises threshold to cut overturns. Recall drops, chargebacks climb, and nobody plotted both curves on one chart.

Content moderation. Safety lowers threshold after one incident. Precision collapses, contractors quit from false-flag fatigue, and real violations still slip through at volume.

Hiring screens. Recruiting advances more candidates to hit headcount. Interview load spikes on weak matches while strong passive candidates never get scored high enough to surface.

A simple application: auto-block policy

Fraud ops debates raising the block threshold from 0.80 to 0.90. Precision improves: fewer overturned blocks. But recall falls and more fraud slips through. The right cutoff depends on dollar costs and how many cases humans can review per day, not on accuracy alone.

Fraud policy: threshold vs precision, recall, and cost

Move block threshold and review capacity. Watch precision and recall trade off, and daily false-alarm vs miss cost.

Precision 21% · recall 47% · 220 flagged/day

Precision vs threshold

Recall vs threshold

Flagged / day

220

Daily cost

$24,745

Queue overflow

+20

Optimize (move here)

  • Plot precision-recall across thresholds before auto-policy
  • Weight FP vs FN cost explicitly in the cutoff choice

Hold (do not over-react)

  • Lowering threshold when review queue is already full

Escalate if

  • Review overflow exceeds 50 cases/day
  • Precision below 35% at chosen threshold

Queue exceeds capacity by 20. Raise threshold or add reviewers before tightening policy.

The habit: plot precision and recall across thresholds before you lock policy. Pair with calibration so the scores behind the threshold mean what they say. Report flagged volume alongside rates. A threshold that looks good on a slide can still drown the team.