Differentiability as a Linear Map

Zoom in on a smooth curve — really zoom, as if pressing your eye to the graph — and the wiggles flatten out. A little arc of a parabola, magnified enough, is indistinguishable from a straight line. This is not a quirk of parabolas; it is the whole content of the word differentiable. A differentiable function is one that, viewed closely enough at a point, looks linear.

That single sentence is worth more than the formula for a derivative. In school the derivative f'(a) is "the slope of the tangent," a number. But a number is a poor thing to carry into higher dimensions — what is the "slope" of a map from \mathbb{R}^3 to \mathbb{R}^2? The idea that survives the jump to many variables is not the number but the linear map it stands for: the derivative at a is the linear transformation that best approximates f near a.

Engineers live on this. Linearising a nonlinear system about an operating point — a pendulum near the bottom of its swing, a circuit near its bias, a spacecraft near its trajectory — is replacing f by its derivative map and studying the linear model instead. Get the definition right and multivariable calculus, the inverse function theorem, and numerical optimisation all open up.

Rewriting the one-variable derivative to expose the map

Start where you are comfortable. The usual definition,

f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h},

hides the geometry. Multiply through and move everything to one side. Differentiability at a is equivalent to the existence of a number A = f'(a) for which

f(a + h) = f(a) + A\,h + r(h), \qquad \text{where } \frac{r(h)}{h} \to 0 \text{ as } h \to 0.

Read this as an accounting statement. To predict f(a + h) you take the value f(a), add the linear correction A\,h, and are left with a remainder r(h) that vanishes faster than h itself. That last clause is the whole game: plenty of functions can be approximated to first order, but only a differentiable one has an error that is little-o of h — small even compared to the already-small h.

The map h \mapsto A\,h is a linear function of the displacement h. In one dimension it is just "multiply by f'(a)." We give it a name: the differential Df(a). The tangent line you already know, L(x) = f(a) + f'(a)(x - a), is nothing but f(a) plus this linear map applied to the displacement x - a.

The definition that scales to many variables

Now every symbol survives promotion. Let f : \mathbb{R}^n \to \mathbb{R}^m and let a \in \mathbb{R}^n. We say f is differentiable at a if there is a linear map Df(a) : \mathbb{R}^n \to \mathbb{R}^m such that

\lim_{\mathbf{h} \to \mathbf{0}} \frac{\big\lVert f(a + \mathbf{h}) - f(a) - Df(a)\,\mathbf{h} \big\rVert}{\lVert \mathbf{h} \rVert} = 0.

The number h became a vector \mathbf{h}; division by h (illegal for vectors) became division by the norm \lVert \mathbf{h} \rVert; and the slope f'(a) became a linear map Df(a). Everything else is word-for-word the one-variable story. The map Df(a) is the best linear approximation, and the requirement is again that the error be o(\lVert \mathbf{h} \rVert).

So the derivative of a map \mathbb{R}^n \to \mathbb{R}^m is an m \times n matrix. For a scalar field f : \mathbb{R}^n \to \mathbb{R} it is a 1 \times n row — the gradient laid sideways. For an ordinary curve \mathbb{R} \to \mathbb{R}^m it is an m \times 1 column — the velocity vector. The humble 1 \times 1 case is your old friend f'(a).

Building a Jacobian, one partial at a time

The Jacobian is assembled mechanically: entry (i, j) is the partial derivative of the i-th output with respect to the j-th input. Take the map f : \mathbb{R}^2 \to \mathbb{R}^2,

f(x, y) = \big(\, x^2 y,\ \ x + \sin y \,\big).

There are two outputs and two inputs, so Df is 2 \times 2. Differentiate row by row:

Df(x, y) = \begin{pmatrix} \dfrac{\partial (x^2 y)}{\partial x} & \dfrac{\partial (x^2 y)}{\partial y} \\[2ex] \dfrac{\partial (x + \sin y)}{\partial x} & \dfrac{\partial (x + \sin y)}{\partial y} \end{pmatrix} = \begin{pmatrix} 2xy & x^2 \\ 1 & \cos y \end{pmatrix}.

Evaluate at, say, (2, 0): Df(2, 0) = \begin{pmatrix} 0 & 4 \\ 1 & 1 \end{pmatrix}. This is the linear map that best matches f near (2, 0): a small input nudge \mathbf{h} = (h_1, h_2) produces an output nudge Df(2,0)\,\mathbf{h} = (4 h_2,\ h_1 + h_2), up to an error that is negligible beside \lVert \mathbf{h} \rVert.

Its determinant, \det Df(2, 0) = 0\cdot 1 - 4 \cdot 1 = -4, is the local area-scaling factor (and its sign flags an orientation flip). That single number will decide, in the inverse function theorem, whether f can be locally undone.

See the map do its job

Below is f(x) = \sin x (faint) and its best linear approximation L(x) = f(a) + f'(a)(x - a) (bold) at a base point you control. Slide a and watch the line pivot to stay tangent. The gap between them at a displacement h is the remainder r(h) — and the point of differentiability is that this gap shrinks faster than h as you approach a.

In one dimension the "linear map" is just this pivoting line, and its slope is the lone Jacobian entry. In two dimensions the same picture becomes a tangent plane resting on a surface; in n dimensions, a tangent hyperplane. The dimension changes, the idea does not: differentiable means locally flat.

The Jacobian is built from partial derivatives, so it is tempting to declare a function differentiable the moment all its partials exist. They do not suffice. Partials only probe f along the coordinate axes; differentiability demands a single linear map that works for approach from every direction at once, error and all.

The classic counterexample is

f(x, y) = \begin{cases} \dfrac{xy}{x^2 + y^2}, & (x, y) \ne (0, 0), \\ 0, & (x, y) = (0, 0). \end{cases}

At the origin both partials exist and equal 0 (along either axis f is identically 0). So the only candidate linear map is the zero map. Yet along the diagonal y = x the function holds steady at \tfrac12 — it does not even tend to 0, so it is not continuous, let alone differentiable. The rescue is the continuously-differentiable (C^1) test: if all partials exist and are continuous near a, then f genuinely is differentiable there. That is the box worth ticking in practice.

Worked consequences

1. Differentiable ⟹ continuous. If f(a + \mathbf{h}) = f(a) + Df(a)\mathbf{h} + o(\lVert\mathbf{h}\rVert), then as \mathbf{h} \to \mathbf{0} both the linear term and the remainder vanish, so f(a + \mathbf{h}) \to f(a). The best-linear-approximation picture makes a theorem obvious that is fiddly to prove from difference quotients.

2. The derivative of a linear map is itself. If T(\mathbf{x}) = A\mathbf{x} is already linear, then T(a + \mathbf{h}) = A a + A\mathbf{h} exactly — no remainder at all. So DT(a) = A at every point. A linear map is its own best linear approximation, which is exactly what you would hope a sane definition would say.

3. The chain rule becomes matrix multiplication. For f : \mathbb{R}^n \to \mathbb{R}^m and g : \mathbb{R}^m \to \mathbb{R}^p, the derivative of the composite is the composite of the derivatives:

D(g \circ f)(a) = Dg\big(f(a)\big) \cdot Df(a),

a product of a p \times m matrix with an m \times n matrix. Best-linear-approximations compose by composing their linear maps — the single cleanest reason to think of the derivative as a map and not a number.