Math foundations

Classification Loss: What Optimizers Actually Minimize

Cross-entropy hinge and zero-one loss curves

The idea

Training does not maximize accuracy directly. Accuracy is flat almost everywhere, so gradients are zero and the model cannot learn. Instead we minimize a smooth loss that punishes confident mistakes: cross-entropy for probabilistic classifiers, hinge for margins.

Loss functions answer: What numeric penalty should a wrong or uncertain prediction pay during training?

Example: why training uses smooth loss, not accuracy

Drag predicted probability. Cross-entropy and hinge penalize confident mistakes; 0-1 loss is flat almost everywhere.

Predicted probability: 62%

Cross-entropy: 0.48
Hinge: 0.38
0-1 (misclassify): 0.00

Uncertain prediction (p = 62%): cross-entropy 0.48 is higher than hinge 0.38. Smooth losses guide optimizers; 0-1 loss does not.

The math

Training minimizes expected loss over labeled rows. Eval reports accuracy, precision, and recall. Those are different jobs: loss must be smooth so weights can move; metrics must match what the business cares about at deploy time.

0-1 loss (misclassification)

L₀₋₁ = 𝟙[ŷ ≠ y]

One if wrong, zero if right. Matches accuracy but is flat between steps: a small change in weights often changes zero loss. Optimizers need a slope to follow.

Binary cross-entropy

L_CE = −[y log p + (1−y) log(1−p)]

y is 0 or 1, p = predicted P(y=1). Punishes confident mistakes heavily. Standard loss for logistic regression and probabilistic classifiers. The explorer shows how it rises as p moves away from the true label.

Why cross-entropy trains

∂L_CE/∂p = −y/p + (1−y)/(1−p)

Gradient pushes p toward y. Wrong and confident predictions get a large slope; correct and confident predictions flatten out. That is the signal gradient descent uses.

Hinge loss (margin)

L_hinge = max(0, 1 − y·f(x))

y is −1 or +1 in the classic form; f(x) is a score. Zero loss once the margin is wide enough. Used in support vector machines. Flat when the model is already confident and correct.

Empirical risk

(1/N) Σᵢ L(yᵢ, pᵢ)

Average loss on the training set. What SGD minimizes in practice. Holdout loss tells you whether you are learning signal or memorizing noise.

Train vs eval

train on L_CE · report precision/recall at cutoff

Optimize smooth loss during training. Choose threshold and report business metrics after. Never back-propagate accuracy; it has no usable gradient.

Where teams get stuck

Training loss down, eval flat. Model fits training noise. Pair with overfitting and eval sample size posts.

Class imbalance ignored. Cross-entropy on 99% negatives still pushes the model toward always predicting zero. Use class weights, resampling, or metrics that match the decision (PR curve, not accuracy).

A simple application

The fraud model minimizes cross-entropy on six months of labeled chargebacks. Leadership cares about precision at 0.85 cutoff. Training loss and deployment metrics are related but not the same number. Log both on the model card.