Math foundations

Logistic Regression: Linear Boundary, Probability Score

Logistic regression boundary and sigmoid curve

The idea

Logistic regression extends the line fit you already know into classification. A linear combination of features passes through a sigmoid to produce a probability between 0 and 1. Fraud models, churn scores, and hiring rankers often start here.

Logistic regression answers: What is the probability this row is positive, and where is the linear decision boundary?

Example: linear boundary and sigmoid probability

Adjust weights to rotate the boundary. The cutoff turns probability into a class label.

Logistic regression draws a linear boundary and outputs a probability, not just yes/no.

Feature space (Velocity vs Amount)

Sigmoid curve

Orange line = classification cutoff

Bias w0: -0.20

Weight Velocity: 2.40

Weight Amount: 2.80

Cutoff: 50%

44% train acc.

Boundary at 50% probability misclassifies many points. Logistic regression assumes a linear separator; nonlinear patterns need trees or more features. Sample score at center: 92%.

The math

Logistic regression keeps the linear structure of regression but outputs a probability. The sigmoid squashes the score; cross-entropy (see the loss post) teaches the weights; gradient descent fits them.

Linear score (logit input)

z = w₀ + w₁x₁ + w₂x₂ + … + wₙxₙ

w₀ is the intercept (baseline log-odds). Each wⱼ is how much feature j pushes the score up or down, holding other columns fixed. Same form as y = a + bx, but z is not the final prediction.

Sigmoid

P(y=1|x) = σ(z) = 1 / (1 + e⁻ᶻ)

Maps any real z to (0, 1). z = 0 gives P = 0.5. Large positive z approaches 1; large negative z approaches 0. This is the number fraud and churn dashboards show before policy applies a cutoff.

Log-odds (logit)

log(P / (1−P)) = z

The linear part models log-odds, not probability directly. Adding one unit to a feature multiplies odds by e^wⱼ. Coefficients are interpretable on the log scale.

Decision boundary

P(y=1|x) = 0.5 ⟺ z = 0

In two features, z = 0 is a line. In n features, it is a hyperplane. Points on one side score above 0.5, the other below. Nonlinear boundaries need feature engineering, trees, or other models.

Training objective

min Σᵢ L(yᵢ, pᵢ) + λ||w||²

Fit weights by minimizing classification loss (usually cross-entropy) on labeled rows. Optional λ penalty shrinks coefficients, as in ridge, when features correlate.

Multiclass extension

softmax(zₖ) = e^zₖ / Σⱼ e^zⱼ

More than two classes: one score per class, softmax turns scores into probabilities that sum to 1. Same linear core, wider output layer.

Where teams get stuck

Treating 0.87 as calibrated. A score is a ranking aid until calibration checks pass. See the model calibration post.

Reading coefficients as causal. Correlated inputs share credit. wⱼ is a lever only when features are independent enough to interpret.

Linear boundary on nonlinear data. If the explorer boundary misclassifies obvious clusters, add interaction terms, polynomial features, or switch family (trees, k-NN).

A simple application

Fraud ops ships a logistic score from velocity and amount. Policy auto-blocks at 0.85. The math post stops at the boundary; the classifier metrics post asks whether 0.85 precision and recall are good enough, and threshold tradeoffs adds queue capacity and dollar cost.