Support Vector Machines

When two classes of points can be separated by a straight line, there are usually infinitely many lines that do the job. A decision boundary squeezed right up against one class is a nervous one — a single new point could fall on the wrong side. A support vector machine (SVM) picks the safest boundary: the one that sits as far as possible from both classes, with the widest possible margin of empty space around it.

Step through the idea — the same points, then the maximum-margin boundary, then what actually determines it:

Margins and support vectors

The margin is the strip of empty space on either side of the boundary, reaching out to the nearest points. An SVM chooses the boundary that makes this strip as wide as possible — a "maximum-margin classifier." A wide margin means the boundary is committed, robust, and generalises better to unseen data.

Here's the striking part: only the handful of points sitting right on the edge of the margin matter. Those are the support vectors — they alone "hold up" the boundary. Every other point could be moved (or deleted) without changing the answer at all. The whole model is pinned by just a few critical examples.

Real data isn't always perfectly separable, so a soft-margin SVM allows a few points to sit inside the margin or on the wrong side, trading a little error for a wider, more robust boundary. A parameter C sets how harshly those violations are penalised.

The kernel trick

What if no straight line can separate the classes — say one class forms a ring around the other? The SVM's secret weapon is the kernel trick: map the points into a higher-dimensional space where they do become linearly separable, find the flat boundary there, and it curves back into a nonlinear boundary in the original space.

The magic is that you never actually compute those high-dimensional coordinates. An SVM only ever needs dot products between points, and a kernel function K(\mathbf{x}, \mathbf{z}) computes the dot product in the high-dimensional space directly and cheaply. Popular kernels (polynomial, and the "RBF" Gaussian kernel) turn a linear SVM into a powerful nonlinear classifier for almost no extra cost.

Chooses the maximum-margin boundary — farthest from both classes.
Only the support vectors (points on the margin edge) determine it.
The kernel trick gives nonlinear boundaries by an implicit map to a higher-dimensional space.

Picture red points in a circle surrounded by blue points — no line separates them in the plane. Now add a third coordinate z = x^2 + y^2: the inner red points, being close to the origin, get a small z, and the outer blue points a large one. In this lifted 3-D space a flat plane slices cleanly between them — and viewed back in 2-D, that plane is a circle. That is the kernel trick in one picture: borrow a dimension, separate with a plane, project back to a curve.

SVMs are sensitive to feature scaling — a feature measured in thousands will dominate the margin. Standardise your features first.
The boundary depends only on the support vectors. Adding masses of easy, far-away points changes nothing — which is also what makes SVMs memory-efficient.