Math, Applied
The Model Looked Perfect on Past Data: Overfitting in Real Decisions
The idea
A forecast can look excellent on the weeks you used to build it and fall apart on weeks you held back. The model memorized noise, seasonality quirks, and one-off promos instead of learning a pattern that repeats.
Remember it in one line: if it only works on history you already saw, it is not ready for next month's budget.
Overfitting is not a data science buzzword. It is the gap between training performance and holdout performance. Finance sees it when inventory plans miss. Marketing sees it when ROI models break after a channel mix change. Product sees it when churn scores misfire on new accounts.
Overfitting answers: Did we fit the past, or did we fit the future?
Example: train error falls while holdout error rises
Drag model complexity. A flexible curve hugs training history. Holdout weeks tell you if the forecast will survive new data.
Finance trusts a wiggly forecast that memorized last quarter.
Training weeks (model fit)
Train error
11.2%
Holdout error
30.8%
Holdout error is 30.8% vs train 11.2%. The model is memorizing history, not forecasting new weeks.
The math
The warning sign
Train error keeps falling as you add variables and flexibility. Holdout error often bottoms out, then rises. The widening gap is overfitting.
Why it happens
Twelve weeks of data cannot support twenty interaction terms. Each extra dial lets the curve bend to fit random wiggles that will not repeat.
What to do
Split time: fit on early weeks, score on later weeks. Prefer simpler models when the holdout gap opens. Regression posts cover the mechanics; this post covers the decision to ship or wait.
A simple application: the forecast review
Ops presents a weekly order forecast with 4% train error. Finance asks for holdout weeks: error jumps to 19%. The team drops three channel interaction terms, error on holdout falls to 9%, train error rises slightly. That is a model you can plan inventory against.
Forecast review: train vs holdout
Add model complexity. Train error falls while holdout error rises — memorizing history.
Train 11% error, holdout 14% — gap 3 pp
Error (%)
Train: 11% · Holdout: 14%
Complexity vs holdout
Low: 8% · You: 14% · High: 22%
Train error
11%
Holdout error
14%
Terms
12
Optimize (move here)
- • Always show train and holdout together
- • Cap complexity when history is short
Hold (do not over-react)
- • Shipping forecast on train error alone
Escalate if
- • Holdout error worsens after adding drivers
Generalization gap is acceptable for this complexity.
The habit: always show train and holdout together. Cap complexity when history is short. Pair with sample size and confidence intervals before you lock spend.