The derivation, line by line
This is the centrepiece, so let us prove it with no steps skipped. Abbreviate the
i-th increment and its time-step by
\Delta W_i = W_{t_{i+1}} - W_{t_i}, \qquad \Delta t_i = t_{i+1} - t_i,
and write the random sum we are studying as
Q_n = \sum_{i=0}^{n-1} \big(\Delta W_i\big)^2.
We will show two things — its mean is exactly t, and its variance
goes to 0 — and then read off the convergence.
Step 1: the mean is t. Each increment runs over an
interval of length \Delta t_i, so by the Gaussian-increments property
\Delta W_i \sim N(0,\, \Delta t_i).
For a mean-zero variable the expected square is the variance, so
\mathbb{E}\big[(\Delta W_i)^2\big] = \operatorname{Var}(\Delta W_i) = \Delta t_i.
Take the expectation of Q_n term by term (expectation is linear) and
sum:
\mathbb{E}[Q_n] = \sum_{i=0}^{n-1} \mathbb{E}\big[(\Delta W_i)^2\big] = \sum_{i=0}^{n-1} \Delta t_i.
But the time-steps \Delta t_i = t_{i+1} - t_i are a
telescoping sum: consecutive endpoints cancel, leaving only the outermost,
\sum_{i=0}^{n-1} \Delta t_i = (t_n - t_0) = t - 0 = t.
So \mathbb{E}[Q_n] = t exactly, for every partition, however
coarse. The sum is already centred on the right answer; what remains is to show it stops
wobbling.
Step 2: the variance vanishes. The increments
\Delta W_i live over disjoint intervals, so by the
independent-increments property the terms
(\Delta W_i)^2 are independent. The variance of a sum of
independent terms is the sum of their variances (no cross terms):
\operatorname{Var}(Q_n) = \sum_{i=0}^{n-1} \operatorname{Var}\big((\Delta W_i)^2\big).
Now we need the variance of a squared Gaussian. For
X \sim N(0, \sigma^2) one has
\operatorname{Var}(X^2) = 2\sigma^4 (the fourth-moment fact, derived
in the vignette below from \mathbb{E}[Z^4] = 3 for a standard normal
Z). Here \sigma^2 = \Delta t_i, so
\operatorname{Var}\big((\Delta W_i)^2\big) = 2\,(\Delta t_i)^2.
Substituting back,
\operatorname{Var}(Q_n) = \sum_{i=0}^{n-1} 2\,(\Delta t_i)^2 = 2 \sum_{i=0}^{n-1} (\Delta t_i)^2.
To bound this, pull one factor of \Delta t_i out of each square and
replace it by the largest step — the mesh
\|\Delta\| = \max_i \Delta t_i:
(\Delta t_i)^2 = \Delta t_i \cdot \Delta t_i \le \|\Delta\| \cdot \Delta t_i.
Summing the bound and using the telescoping identity
\sum_i \Delta t_i = t from Step 1,
\operatorname{Var}(Q_n) \le 2 \sum_{i=0}^{n-1} \|\Delta\|\, \Delta t_i = 2\,\|\Delta\| \sum_{i=0}^{n-1} \Delta t_i = 2\,\|\Delta\|\, t.
As the partition refines the mesh shrinks, \|\Delta\| \to 0, and
therefore
\operatorname{Var}(Q_n) \le 2\,\|\Delta\|\, t \longrightarrow 0.
Step 3: put them together — convergence in L^2.
Mean-square (L^2) convergence to the constant
t means the expected squared distance
\mathbb{E}\big[(Q_n - t)^2\big] goes to zero. Because
\mathbb{E}[Q_n] = t exactly, that distance is precisely the
variance:
\mathbb{E}\big[(Q_n - t)^2\big] = \mathbb{E}\big[(Q_n - \mathbb{E}[Q_n])^2\big] = \operatorname{Var}(Q_n) \le 2\,\|\Delta\|\, t \to 0.
So Q_n \to t in L^2: the random sum
tightens onto the deterministic number t as the mesh shrinks. That
limit is the quadratic variation,
[W]_t = \lim_{\|\Delta\| \to 0} \sum_{i=0}^{n-1} (\Delta W_i)^2 = t.
Let (W_t) be a standard Brownian motion and partition
[0, t] by 0 = t_0 < \cdots < t_n = t. As
the mesh \max_i (t_{i+1} - t_i) \to 0,
\sum_{i=0}^{n-1} \big(W_{t_{i+1}} - W_{t_i}\big)^2 \;\xrightarrow{\;L^2\;}\; t,
so the quadratic variation is [W]_t = t — a deterministic number, not
random and not zero. In differential shorthand this is written
(dW)^2 = dt.
The contrast with smooth functions, and "(dW)² = dt"
Why is this so special? Take a smooth (differentiable) function
g. Over a sub-interval of length
\Delta t its increment is about
g'(\tau)\,\Delta t, so the squared increment is of order
(\Delta t)^2. Summing n \approx t/\Delta t
of them gives order t \cdot \Delta t \to 0: a smooth function has
quadratic variation zero. Brownian increments are larger — of order
\sqrt{\Delta t}, so their squares are of order
\Delta t and the sum survives.
The shorthand that bookkeeps all of this is
(dW)^2 = dt.
A Brownian increment squared behaves like dt, not like the
negligible (dt)^2 of smooth calculus. This one extra term — kept
instead of discarded — is the seed of the correction in
Itô's lemma, the chain rule of stochastic
calculus.
The hand-wavy "order of magnitude" argument above can be made completely rigorous, and it is
worth seeing because it is the exact mirror image of the Brownian computation. Let
f be continuously differentiable
(C^1) on [0, t], and form the quadratic
sum over a partition,
Q_n^f = \sum_{i=0}^{n-1} \big(\Delta f_i\big)^2, \qquad \Delta f_i = f(t_{i+1}) - f(t_i).
Pull one factor of |\Delta f_i| out of each square and bound it by
the largest increment over the partition:
Q_n^f = \sum_{i} |\Delta f_i|\cdot |\Delta f_i| \;\le\; \Big(\max_i |\Delta f_i|\Big) \sum_{i} |\Delta f_i|.
Look at the two factors separately. The sum \sum_i |\Delta f_i| is
the total variation of f; for a C^1
function it is finite (bounded by \int_0^t |f'| \le t\max|f'|),
and it does not grow as we refine. Meanwhile f is uniformly
continuous on the closed interval, so the single largest increment
\max_i |\Delta f_i| \to 0 as the mesh shrinks. Therefore
Q_n^f \;\le\; \underbrace{\Big(\max_i |\Delta f_i|\Big)}_{\to\, 0} \cdot \underbrace{\sum_i |\Delta f_i|}_{\text{finite (total variation)}} \;\longrightarrow\; 0 \cdot (\text{finite}) = 0.
A smooth function has quadratic variation 0. The contrast is exact:
for the smooth path it was the first-power sum that stayed finite and dragged the
second-power sum down to zero; for the Brownian path the first-power sum is infinite, and the
second-power sum settles on t.
Step 2 of the derivation used the fact that a squared centred Gaussian
X \sim N(0, \sigma^2) has variance
2\sigma^4. Here it is, line by line, from the fourth moment of a
standard normal.
Write X = \sigma Z with Z \sim N(0, 1).
By definition,
\operatorname{Var}(X^2) = \mathbb{E}\big[(X^2)^2\big] - \big(\mathbb{E}[X^2]\big)^2 = \mathbb{E}[X^4] - \big(\mathbb{E}[X^2]\big)^2.
Pull out the powers of \sigma, since
X^k = \sigma^k Z^k:
\mathbb{E}[X^2] = \sigma^2\, \mathbb{E}[Z^2], \qquad \mathbb{E}[X^4] = \sigma^4\, \mathbb{E}[Z^4].
For a standard normal the second moment is \mathbb{E}[Z^2] = 1 (its
variance), and the fourth moment is
\mathbb{E}[Z^4] = 3.
(This is the standard Gaussian fourth moment — for example via the moment formula
\mathbb{E}[Z^{2k}] = (2k-1)!! = 1\cdot 3 \cdots (2k-1), which gives
3!! = 1 \cdot 3 = 3; it also drops out of one integration by parts,
\mathbb{E}[Z^4] = 3\,\mathbb{E}[Z^2] = 3.) Substituting both
moments,
\operatorname{Var}(X^2) = \sigma^4 \cdot 3 - \big(\sigma^2 \cdot 1\big)^2 = 3\sigma^4 - \sigma^4 = 2\sigma^4.
The leftover 3 - 1 = 2 is exactly the constant that appears as
\operatorname{Var}((\Delta W_i)^2) = 2(\Delta t_i)^2 in the
derivation above.
A Brownian path has infinite total variation (the sum of the absolute increments
\sum|\Delta W| blows up) yet finite quadratic variation
(\sum(\Delta W)^2 \to t). It is exactly this gap — too rough for
the first power, perfectly tame at the second — that ordinary calculus has no machinery for,
and that the Itô integral is built to handle.