Math foundations
Gradient Descent: How Classifiers Learn Their Weights
The idea
Logistic regression, neural nets, and many other models start with random weights and repeatedly step downhill on a loss surface. The gradient points uphill; we move opposite it. Learning rate controls step size. Mini-batches and Adam are production refinements on the same idea.
Gradient descent answers: How do we find weights that lower loss when we cannot solve in one formula?
Example: gradient descent steps along the loss curve
Each step moves opposite the gradient. Learning rate controls step size; too large oscillates.
Gradient descent rolls downhill on a smooth loss surface toward one minimum.
Purple = start · Green line = minimum · Orange path = gradient steps
After 12 steps at lr = 0.15, w = 1.97. Still 0.53 from optimum. Increase steps or adjust learning rate.
The math
Think of loss as height on a landscape. Weights are your position. The gradient points uphill; each update steps downhill. Learning rate is stride length; too long and you overshoot the valley.
Gradient descent update
η (eta) is learning rate. wₜ are current weights. ∇L is the vector of partial derivatives of loss with respect to each weight. Repeat until loss stops improving or a step budget ends.
Partial derivative
How fast loss changes when only wⱼ moves. The full gradient collects one partial derivative per weight. Frameworks compute this with backpropagation on large graphs.
Mini-batch SGD
B is batch size. Each step uses a random subset of rows instead of the full table. Noisier path, much faster on big data. B = 1 is pure SGD; B = N is full-batch gradient descent.
Learning rate
The explorer shows overshoot when η is high. Production training often decays η over time or uses adaptive methods (Adam, RMSprop) that tune step size per parameter.
Momentum (optional)
Smooths zig-zags by accumulating past gradients. Helps on ravines and speeds convergence when the loss surface is ill-conditioned.
Early stopping
Training loss can keep falling while holdout loss climbs. Stop when validation loss worsens to limit overfitting. Pairs with the overfitting post.
Where teams get stuck
Loss not decreasing at all. Learning rate too high (divergence) or too low (crawl). Features unscaled so gradients are tiny on one column and huge on another.
Training loss perfect, deploy weak. Memorized training rows. Check holdout metrics and regularization before adding model complexity.
A simple application
Fitting logistic regression on 2M rows uses mini-batch SGD, not one matrix solve. Ops sees training loss curves in MLflow; product sees precision after thresholding. Gradient descent is the bridge between the loss post and a shipped score.