Math foundations
k-Nearest Neighbors: Classify by the Closest Labeled Examples
The idea
k-NN has no training phase in the usual sense. The model is your labeled history. For a new row, find the k closest points in feature space and let them vote. Fraud looks like past fraud; support tickets route like similar resolved tickets.
k-NN answers: What label do the nearest examples carry, and how many neighbors should vote?
Example: drag the query point and watch neighbors vote
No training phase: the model is the labeled history. Distance picks the k closest rows; majority label wins.
Drag the new order. k-NN classifies by the closest past cases in feature space.
- Legitimate
- Fraud
- Query (drag me)
Prediction
Legitimate
Votes: 3 Legitimate · 2 Fraud
k = 5 splits the vote (3 vs 2). Borderline cases need more features or a different k.
The math
k-NN is geometry plus a vote. You define how close is close (distance), how many neighbors get a say (k), and how ties break. Everything else is lookup on labeled history.
Euclidean distance (L2)
Default in tabular ML when features are on comparable scales. Sensitive to unit choice: dollars and session counts must be normalized first or distance is dominated by one column.
Manhattan distance (L1)
Sums absolute gaps. Less sensitive to one wild coordinate than L2. Useful when outliers in a single feature should not dominate the neighbor search.
Cosine similarity
Ignores vector length; measures direction. Common for text embeddings and catalog search when document size varies. k-NN on cosine often uses distance = 1 − cosine.
Classification vote
Each of the k closest training rows casts one ballot. Majority label wins. Ties go to the nearer neighbor or to a business default (for example route to human review).
Regression average
For numeric targets (forecast amount, handle time), k-NN averages neighbor values instead of voting. Same distance logic, different aggregation.
Curse of dimensionality
In high dimensions, most points are far from each other. Neighbors stop being local and votes become noisy. Feature selection, PCA, or a parametric model often follow k-NN prototypes.
Where teams get stuck
Unscaled features. Transaction amount in dollars outweighs every other column. Neighbors match on money, not behavior.
k too small or too large. k = 1 chases noise. k = 50 washes out local structure. Plot accuracy or precision on a holdout grid of k before you ship.
Latency at scoring time. Every prediction scans the training set. Fine for thousands of rows; expensive for millions without an index (ball tree, approximate nearest neighbor).
Where business uses it
Fraud similarity, catalog deduplication, content recommendations, and cold-start routing when you have labels but no time to train a parametric model. Weak when dimension is high, data is huge, or you need a compact score at millisecond latency.
The habit: scale features, try a small grid of k on holdout data, and state which distance metric ranked neighbors. Pair with the norms and distance post for L1 vs L2, and classifier metrics for eval once labels exist.