Math, Applied

The Model Says 90%: Can You Trust the Score?

Reliability diagram with model curve below perfect calibration line

The idea

A fraud model auto-blocks orders scored above 0.85. The dashboard shows 92% accuracy. Ops keeps overturning blocks — customers were legitimate. The model ranks reasonably well, but the scores are overconfident: orders labeled 90% fraud are only 60% fraud in reality.

Remember it in one line: a high score is not the same as a reliable score.

Base rates teach you what happens after one alert fires. Calibration teaches whether the number on the score itself means what it says. A model can catch most fraud while its probabilities lie — and that breaks any rule tied to a fixed threshold.

Calibration answers: among everything the model scored ~90%, is roughly 90% actually positive?

Example: reliability diagram — predicted score vs actual rate

Drag calibration quality. On a perfect model, points sit on the diagonal: among orders scored 90%, about 90% are truly positive. Overconfident models sit below the line.

Orders scored 90% fraud are only ~55% fraud after review.

Calibration quality: Overconfident (25)

Overconfident (scores too high)PerfectUnderconfident

At 90% model fraud score

68%

Share that are actually fraud

Calibration gap

22 pp

Predicted minus actual at top bucket

Dashed diagonal = perfect calibration. Solid curve = what actually happens in each score bucket.

At 90% predicted, only 68% are truly positive — a 22 pp gap. Auto-block rules built on raw scores will misfire even if ranking is useful.

The math

Well calibrated

reliability: actual rate in score bucket ≈ average predicted score

Sort predictions into buckets (70–80%, 80–90%, and so on). In each bucket, count how many were truly positive. If the model says 90%, about 90% should be positive. That is the reliability check.

Overconfident scores

calibration gap = average predicted − actual rate (in a bucket)

When the gap is positive, scores run hot: the model says 90% but only 60% are real. Auto-block rules and cost models built on raw scores will over-punish clean cases.

Why accuracy misleads

ranking can be fine while calibration is broken

Accuracy mixes threshold choice with score quality. A model can rank fraud above clean orders (good for prioritizing review) while every probability is too high (bad for auto-decisions). Check calibration before you wire scores to policy.

A simple application: auto-block threshold

Fraud ops sets auto-block at 0.85 and human review for 0.60–0.85. After launch, review queues flood and overturn rates climb. The fix is not always a higher threshold — it may be recalibrating scores (Platt scaling, isotonic regression) or pairing scores with base rate context before auto-action.

Auto-block policy: threshold vs score reliability

Move block threshold, review capacity, and calibration quality. Overconfident scores inflate overturns even when ranking is fine.

Auto-block threshold: 85%

Review capacity (orders/day): 180

Calibration quality: 25

OverconfidentWell calibrated

147 auto-blocks/day → ~31% overturn rate (46 clean orders)

Daily auto-blocks vs capacity

Overturn rate vs calibration

Auto-blocks / day

147

Overturn rate

31%

Queue overflow

None

Optimize (move here)

• Plot reliability before wiring scores to auto-block
• Recalibrate (Platt/isotonic) when overturn rate spikes

Hold (do not over-react)

• Lowering threshold alone when scores are overconfident

Escalate if

• Review queue overflow exceeds 50 orders/day
• Overturn rate above 45% after auto-block

Scores run hot at 91% predicted but ~69% truly fraud. Recalibrate before tightening threshold.

The habit: plot reliability before you automate. Keep ranking and calibration as separate questions. If scores are overconfident, use them to sort the queue — not to auto-block without review.