Math, Applied

Last Quarter's Model, This Quarter's Data: Concept Drift

Training accuracy high while live accuracy drops after distribution shift

The idea

Fraud launches a new payment scam. Product changes onboarding. Support tickets spike on a feature nobody trained on. The model still shows 94% accuracy on last quarter's holdout — but live false positives climb and ops keeps overturning automated decisions.

Remember it in one line: the world moved; the scoreboard stayed on old data.

Concept drift is when the relationship between inputs and outcomes shifts in production. It is different from overfitting (memorizing past noise) and different from miscalibration (scores lying at a fixed point in time). Drift is the model getting stale while metrics on frozen training data still look fine.

Concept drift answers: Is live behavior still the same population the model learned?

Example: training metrics hold while live performance slides

Drag drift severity. Score distributions shift in production. Training accuracy stays high while live accuracy and false-positive rates worsen.

A new payment scam shifts — training accuracy still looks fine.

Concept drift: Severe drift (55)

Stable worldBehavior shifted

Train accuracy

91.8%

Live accuracy

76.4%

Live false-positive rate

24.9%

Training period score distribution

Low

Mid

High

Live period score distribution

Low

Mid

High

Training accuracy (91.8%) masks live drop to 76.4%. False positives hit 24.9% — the world changed, not just noise.

The math

The shift

P(Y | X, today) ≠ P(Y | X, training period)

The inputs X may look similar — same columns, same dashboard — but the mapping to outcomes Y changed. New fraud patterns, new user cohorts, new product flows all break the old mapping without changing training accuracy on history.

What to watch

monitor live accuracy, false-positive rate, and score distributions

Track the same metrics on fresh labeled data weekly or monthly. A rising gap between train and live performance, or a shifting score histogram, is drift before revenue or trust damage shows up in executive reviews.

Decision playbook

response: retrain, retune thresholds, or pause automation

Mild drift may need threshold tweaks. Severe drift needs retraining on recent data or rolling back auto-decisions until the model catches up. Pair with calibration checks so new scores mean what they say after retrain.

A simple application: fraud model review

Risk reviews training accuracy monthly — still 93%. Live review overturn rate doubled since a wallet feature launch. Score distributions shifted toward low-risk bins for new scam types the model never saw. The team schedules retraining on last 90 days and pauses auto-block until live calibration is replotted.

Fraud model review: train vs live after a launch

Increase drift severity. Training accuracy holds while live accuracy and false positives worsen.

Concept drift severity: 55

StableSevere shift

Auto-block share (%): 6%

Train 92% vs live 76% — ~75 false blocks/day

Accuracy (%)

Train: 92% · Live: 76%

Live accuracy vs drift

Train accuracy

92%

Live accuracy

76%

False blocks / day

~75

Optimize (move here)

• Monitor live labeled accuracy weekly after launches
• Retrain on rolling 90-day window when gap opens

Hold (do not over-react)

• Trusting training accuracy after product or fraud pattern changes

Escalate if

• Train vs live accuracy gap exceeds 15 pp
• Live false-positive rate above 25%

Pause or narrow automation. Retrain on recent labeled data before trusting auto-block.

The habit: treat training metrics as history, live metrics as truth. Monitor drift on the same cadence as seasonality and benchmark reviews — especially after launches, policy changes, or new attack patterns.