Math, Applied

Ridge Regularization: Shrink Unstable Coefficients Without Dropping Features

Ridge penalty shrinking regression coefficients toward zero

The idea

Ordinary least squares can assign huge, opposite-signed coefficients when inputs correlate. Ridge adds a penalty on coefficient size. Higher λ pulls betas toward zero, stabilizing predictions on collinear features without deleting columns from the model.

Ridge answers: Can I keep all drivers in the model while making coefficient swings less violent?

Example: ridge penalty shrinks unstable coefficients

Drag λ (regularization strength). Higher λ pulls coefficients toward zero and stabilizes collinear inputs.

Ridge shrinks unstable ad and search coefficients when both channels correlate.

Regularization λ: 35%

Ad spend

0.20

OLS: 0.41

Branded search

0.67

OLS: 1.38

Ridge at λ = 35% reduces coefficient swing vs OLS (0.41, 1.38). Better for prediction than causal attribution.

The math

Ridge objective

min ||y − Xβ||² + λ||β||²

Fit residuals plus a penalty on squared coefficient size. λ controls shrinkage strength.

Closed form (ridge)

β̂_ridge = (XᵀX + λI)⁻¹ Xᵀy

Adding λ to the diagonal of XᵀX dampens ill-conditioned directions from collinearity.

Shrinkage

λ ↑ → |β| ↓

As λ grows, coefficients move toward zero. Predictions often stay stable while interpretability as causal credit does not improve.

A simple application: collinear marketing channels

When ad spend and branded search rise together, OLS coefficients flip week to week. Ridge keeps the forecast usable for planning but does not turn correlated inputs into clean attribution. Pair with the multicollinearity post before presenting driver slides.

Marketing mix: ridge vs raw OLS on collinear channels

Increase λ to shrink ad and search coefficients when both channels correlate. Stabilize forecasts without pretending attribution is clean.

Ridge penalty λ (%): 35%

Feature overlap (%): 88%

Ridge ads 0.23 · search 0.64 · r ≈ 84%

Ridge coefficients

Ad spend: 0.23 · Branded search: 0.64

35%

Correlation

84%

Stability

Improved

Optimize (move here)

• Use ridge when you must keep correlated features for prediction
• Cross-validate λ on holdout weeks

Hold (do not over-react)

• Budget splits from ridge coefficients when r > 0.8

Escalate if

• λ above 50% and coefficients still flip sign on refit
• Leadership asks for causal driver credit on collinear channels

Coefficients shrink toward zero. Better for stable forecasts; still not clean causal attribution.

The habit: try ridge when you must keep correlated features and care about prediction stability. Still plot correlation and avoid causal language on individual coefficients.