Math foundations

Gradient Descent: How Classifiers Learn Their Weights

The idea

Logistic regression, neural nets, and many other models start with random weights and repeatedly step downhill on a loss surface. The gradient points uphill; we move opposite it. Learning rate controls step size. Mini-batches and Adam are production refinements on the same idea.

Gradient descent answers: How do we find weights that lower loss when we cannot solve in one formula?

Example: gradient descent steps along the loss curve

Each step moves opposite the gradient. Learning rate controls step size; too large oscillates.

Gradient descent rolls downhill on a smooth loss surface toward one minimum.

Purple = start · Green line = minimum · Orange path = gradient steps

Learning rate: 0.15

Start w: -1.20

Steps: 12

After 12 steps at lr = 0.15, w = 1.97. Still 0.53 from optimum. Increase steps or adjust learning rate.

The math

Think of loss as height on a landscape. Weights are your position. The gradient points uphill; each update steps downhill. Learning rate is stride length; too long and you overshoot the valley.

Gradient descent update

wₜ₊₁ = wₜ − η · ∇L(wₜ)

η (eta) is learning rate. wₜ are current weights. ∇L is the vector of partial derivatives of loss with respect to each weight. Repeat until loss stops improving or a step budget ends.

Partial derivative

∂L/∂wⱼ = limit of (L(w+εeⱼ) − L(w)) / ε

How fast loss changes when only wⱼ moves. The full gradient collects one partial derivative per weight. Frameworks compute this with backpropagation on large graphs.

Mini-batch SGD

wₜ₊₁ = wₜ − η · (1/B) Σᵢ∈batch ∇Lᵢ(wₜ)

B is batch size. Each step uses a random subset of rows instead of the full table. Noisier path, much faster on big data. B = 1 is pure SGD; B = N is full-batch gradient descent.

Learning rate

η too large → oscillate · η too small → slow

The explorer shows overshoot when η is high. Production training often decays η over time or uses adaptive methods (Adam, RMSprop) that tune step size per parameter.

Momentum (optional)

mₜ = βmₜ₋₁ + (1−β)∇L · wₜ₊₁ = wₜ − η mₜ

Smooths zig-zags by accumulating past gradients. Helps on ravines and speeds convergence when the loss surface is ill-conditioned.

Early stopping

stop when holdout loss rises

Training loss can keep falling while holdout loss climbs. Stop when validation loss worsens to limit overfitting. Pairs with the overfitting post.

Where teams get stuck

Loss not decreasing at all. Learning rate too high (divergence) or too low (crawl). Features unscaled so gradients are tiny on one column and huge on another.

Training loss perfect, deploy weak. Memorized training rows. Check holdout metrics and regularization before adding model complexity.

A simple application

Fitting logistic regression on 2M rows uses mini-batch SGD, not one matrix solve. Ops sees training loss curves in MLflow; product sees precision after thresholding. Gradient descent is the bridge between the loss post and a shipped score.