Differentiability as a Linear Map
Zoom in on a smooth curve — really zoom, as if pressing your eye to the graph — and the wiggles
flatten out. A little arc of a parabola, magnified enough, is indistinguishable from a straight
line. This is not a quirk of parabolas; it is the whole content of the word differentiable.
A differentiable function is one that, viewed closely enough at a point, looks linear.
That single sentence is worth more than the formula for a derivative. In school the derivative
f'(a) is "the slope of the tangent," a number. But a number is a poor
thing to carry into higher dimensions — what is the "slope" of a map from
\mathbb{R}^3 to \mathbb{R}^2? The idea that
survives the jump to many variables is not the number but the linear map it stands
for: the derivative at a is the
linear transformation
that best approximates f near a.
Engineers live on this. Linearising a nonlinear system about an operating point — a pendulum near
the bottom of its swing, a circuit near its bias, a spacecraft near its trajectory — is
replacing f by its derivative map and studying the linear model instead.
Get the definition right and multivariable calculus, the
inverse function theorem, and
numerical optimisation all open up.
Rewriting the one-variable derivative to expose the map
Start where you are comfortable. The usual definition,
f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h},
hides the geometry. Multiply through and move everything to one side. Differentiability at
a is equivalent to the existence of a number A = f'(a)
for which
f(a + h) = f(a) + A\,h + r(h), \qquad \text{where } \frac{r(h)}{h} \to 0 \text{ as } h \to 0.
Read this as an accounting statement. To predict f(a + h) you take the
value f(a), add the linear correction
A\,h, and are left with a remainder r(h) that
vanishes faster than h itself. That last clause is the whole
game: plenty of functions can be approximated to first order, but only a differentiable one has an
error that is little-o of h — small even compared to the
already-small h.
The map h \mapsto A\,h is a linear function of the displacement
h. In one dimension it is just "multiply by f'(a)."
We give it a name: the differential Df(a). The tangent
line you already know, L(x) = f(a) + f'(a)(x - a), is nothing but
f(a) plus this linear map applied to the displacement
x - a.
The definition that scales to many variables
Now every symbol survives promotion. Let
f : \mathbb{R}^n \to \mathbb{R}^m and let
a \in \mathbb{R}^n. We say f is
differentiable at a if there is a linear map
Df(a) : \mathbb{R}^n \to \mathbb{R}^m such that
\lim_{\mathbf{h} \to \mathbf{0}} \frac{\big\lVert f(a + \mathbf{h}) - f(a) - Df(a)\,\mathbf{h} \big\rVert}{\lVert \mathbf{h} \rVert} = 0.
The number h became a vector \mathbf{h};
division by h (illegal for vectors) became division by the
norm \lVert \mathbf{h} \rVert; and the slope
f'(a) became a linear map Df(a). Everything
else is word-for-word the one-variable story. The map Df(a) is the best
linear approximation, and the requirement is again that the error be
o(\lVert \mathbf{h} \rVert).
-
f is differentiable at a
iff a linear map Df(a) exists with
f(a + \mathbf{h}) = f(a) + Df(a)\,\mathbf{h} + o(\lVert \mathbf{h}\rVert).
-
When it exists, Df(a) is unique — there is only one
best linear approximation.
-
Its matrix in the standard basis is the Jacobian
J = \big[\,\partial f_i / \partial x_j\,\big], an
m \times n array of
partial derivatives.
So the derivative of a map \mathbb{R}^n \to \mathbb{R}^m is an
m \times n matrix. For a scalar field
f : \mathbb{R}^n \to \mathbb{R} it is a 1 \times n
row — the gradient laid sideways. For an ordinary curve
\mathbb{R} \to \mathbb{R}^m it is an m \times 1
column — the velocity vector. The humble 1 \times 1 case is your old
friend f'(a).
Building a Jacobian, one partial at a time
The Jacobian is assembled mechanically: entry (i, j) is the partial
derivative of the i-th output with respect to the
j-th input. Take the map
f : \mathbb{R}^2 \to \mathbb{R}^2,
f(x, y) = \big(\, x^2 y,\ \ x + \sin y \,\big).
There are two outputs and two inputs, so Df is
2 \times 2. Differentiate row by row:
Df(x, y) = \begin{pmatrix} \dfrac{\partial (x^2 y)}{\partial x} & \dfrac{\partial (x^2 y)}{\partial y} \\[2ex] \dfrac{\partial (x + \sin y)}{\partial x} & \dfrac{\partial (x + \sin y)}{\partial y} \end{pmatrix} = \begin{pmatrix} 2xy & x^2 \\ 1 & \cos y \end{pmatrix}.
Evaluate at, say, (2, 0):
Df(2, 0) = \begin{pmatrix} 0 & 4 \\ 1 & 1 \end{pmatrix}. This is the linear
map that best matches f near (2, 0): a small
input nudge \mathbf{h} = (h_1, h_2) produces an output nudge
Df(2,0)\,\mathbf{h} = (4 h_2,\ h_1 + h_2), up to an error that is
negligible beside \lVert \mathbf{h} \rVert.
Its determinant, \det Df(2, 0) = 0\cdot 1 - 4 \cdot 1 = -4, is the local
area-scaling factor (and its sign flags an orientation flip). That single number will decide, in the
inverse function theorem, whether
f can be locally undone.
See the map do its job
Below is f(x) = \sin x (faint) and its best linear approximation
L(x) = f(a) + f'(a)(x - a) (bold) at a base point you control. Slide
a and watch the line pivot to stay tangent. The gap between them at a
displacement h is the remainder r(h) — and the
point of differentiability is that this gap shrinks faster than h as you
approach a.
In one dimension the "linear map" is just this pivoting line, and its slope is the lone Jacobian
entry. In two dimensions the same picture becomes a tangent plane resting on a surface; in
n dimensions, a tangent hyperplane. The dimension changes, the idea does
not: differentiable means locally flat.
The Jacobian is built from partial derivatives, so it is tempting to declare a function
differentiable the moment all its partials exist. They do not suffice. Partials
only probe f along the coordinate axes; differentiability demands a single
linear map that works for approach from every direction at once, error and all.
The classic counterexample is
f(x, y) = \begin{cases} \dfrac{xy}{x^2 + y^2}, & (x, y) \ne (0, 0), \\ 0, & (x, y) = (0, 0). \end{cases}
At the origin both partials exist and equal 0 (along either axis
f is identically 0). So the only
candidate linear map is the zero map. Yet along the diagonal y = x the
function holds steady at \tfrac12 — it does not even tend to
0, so it is not
continuous, let
alone differentiable. The rescue is the continuously-differentiable
(C^1) test: if all partials exist and are continuous near
a, then f genuinely is differentiable there.
That is the box worth ticking in practice.
Worked consequences
1. Differentiable ⟹ continuous. If
f(a + \mathbf{h}) = f(a) + Df(a)\mathbf{h} + o(\lVert\mathbf{h}\rVert),
then as \mathbf{h} \to \mathbf{0} both the linear term and the remainder
vanish, so f(a + \mathbf{h}) \to f(a). The best-linear-approximation
picture makes a theorem obvious that is fiddly to prove from difference quotients.
2. The derivative of a linear map is itself. If
T(\mathbf{x}) = A\mathbf{x} is already linear, then
T(a + \mathbf{h}) = A a + A\mathbf{h} exactly — no remainder at all. So
DT(a) = A at every point. A linear map is its own best linear
approximation, which is exactly what you would hope a sane definition would say.
3. The chain rule becomes matrix multiplication. For
f : \mathbb{R}^n \to \mathbb{R}^m and
g : \mathbb{R}^m \to \mathbb{R}^p, the derivative of the composite is the
composite of the derivatives:
D(g \circ f)(a) = Dg\big(f(a)\big) \cdot Df(a),
a product of a p \times m matrix with an
m \times n matrix. Best-linear-approximations compose by composing their
linear maps — the single cleanest reason to think of the derivative as a map and not a number.