The derivation, line by line (simple H)
We prove it for a simple adapted process
H_s = \sum_{i=0}^{n-1} H_{t_i}\mathbf{1}_{(t_i, t_{i+1}]}(s); the
general case then follows by the very approximation the isometry makes possible. Write
\Delta W_i = W_{t_{i+1}} - W_{t_i} and
\Delta t_i = t_{i+1} - t_i, so the integral is
\int_0^T H\, dW = \sum_i H_{t_i}\Delta W_i.
Step 1 — expand the square into a double sum. A finite sum squared is the
double sum of all pairwise products:
\left(\sum_{i} H_{t_i}\Delta W_i\right)^{2} = \sum_{i}\sum_{j} H_{t_i} H_{t_j}\,\Delta W_i\,\Delta W_j.
Taking expectations and using linearity,
\mathbb{E}\!\left[\left(\int_0^T H\, dW\right)^{2}\right] = \sum_{i}\sum_{j} \mathbb{E}\big[\,H_{t_i} H_{t_j}\,\Delta W_i\,\Delta W_j\,\big].
Split the double sum into off-diagonal terms
(i \neq j) and diagonal terms
(i = j). We show every off-diagonal term is zero, then evaluate the
diagonal.
Step 2 — the off-diagonal terms vanish. Take i < j
(the case i > j is symmetric). Then the four factors
H_{t_i}, H_{t_j}, \Delta W_i are all
\mathcal{F}_{t_j}-measurable: H_{t_i} and
\Delta W_i happened before t_j, and
H_{t_j} is adapted, set at t_j.
Only the last increment \Delta W_j reaches into the future. Condition
on \mathcal{F}_{t_j} and pull out everything known:
\mathbb{E}\big[\,H_{t_i} H_{t_j}\,\Delta W_i\,\Delta W_j\,\big] = \mathbb{E}\Big[\,H_{t_i} H_{t_j}\,\Delta W_i\;\mathbb{E}\big[\Delta W_j \mid \mathcal{F}_{t_j}\big]\,\Big].
The future increment \Delta W_j is independent of
\mathcal{F}_{t_j} and mean-zero, so the inner conditional expectation
is 0, and the whole term collapses:
= \mathbb{E}\big[\,H_{t_i} H_{t_j}\,\Delta W_i \cdot 0\,\big] = 0.
Every cross term is killed by the same "future increment has no correlation with the past"
mechanism that made the integral mean-zero. Only the diagonal survives.
Step 3 — the diagonal terms. On the diagonal
i = j the term is
\mathbb{E}\big[H_{t_i}^2\,(\Delta W_i)^2\big]. Condition on
\mathcal{F}_{t_i} and pull out the known coefficient
H_{t_i}^2:
\mathbb{E}\big[\,H_{t_i}^2\,(\Delta W_i)^2\,\big] = \mathbb{E}\Big[\,H_{t_i}^2\;\mathbb{E}\big[(\Delta W_i)^2 \mid \mathcal{F}_{t_i}\big]\,\Big].
The squared increment is independent of \mathcal{F}_{t_i}, so its
conditional mean is its plain mean — the variance of a mean-zero
N(0, \Delta t_i):
\mathbb{E}\big[(\Delta W_i)^2 \mid \mathcal{F}_{t_i}\big] = \mathbb{E}\big[(\Delta W_i)^2\big] = \operatorname{Var}(\Delta W_i) = \Delta t_i.
Therefore each diagonal term is
\mathbb{E}\big[\,H_{t_i}^2\,(\Delta W_i)^2\,\big] = \mathbb{E}\big[\,H_{t_i}^2\,\big]\,\Delta t_i.
Step 4 — sum the diagonal and recognise the time-integral. Adding the surviving
terms,
\mathbb{E}\!\left[\left(\int_0^T H\, dW\right)^{2}\right] = \sum_{i=0}^{n-1} \mathbb{E}\big[H_{t_i}^2\big]\,\Delta t_i = \mathbb{E}\!\left[\sum_{i=0}^{n-1} H_{t_i}^2\,\Delta t_i\right].
But \sum_i H_{t_i}^2\,\Delta t_i is exactly the (deterministic-in-time)
Riemann sum of \int_0^T H_s^2\, ds for the step integrand
H^2 — and since H is simple, it equals that
integral on the nose. Hence
\mathbb{E}\!\left[\left(\int_0^T H\, dW\right)^{2}\right] = \mathbb{E}\!\left[\int_0^T H_s^2\, ds\right].
Two ingredients did everything: independent increments killed the off-diagonal,
and variance = elapsed time ((dW)^2 = dt again)
evaluated the diagonal.
Let H be an adapted process with
\mathbb{E}\big[\int_0^T H_s^2\, ds\big] < \infty. Then the Itô integral
is an isometry from L^2(dt \times d\mathbb{P}) into
L^2(\Omega):
\mathbb{E}\!\left[\left(\int_0^T H_s\, dW_s\right)^{2}\right] = \mathbb{E}\!\left[\int_0^T H_s^{2}\, ds\right].
Equivalently, the L^2(\Omega) norm of the integral equals the
L^2(dt\times d\mathbb{P}) norm of the integrand,
\big\|\int_0^T H\, dW\big\|_{L^2(\Omega)} = \|H\|_{L^2(dt\times d\mathbb{P})}.
This identity is not a footnote — it is what lets Stage 2 of the construction even make sense.
Suppose H is a general adapted L^2 integrand
and H^{(m)} is a sequence of simple processes approximating it
in L^2(dt\times d\mathbb{P}), so
\|H^{(m)} - H^{(k)}\| \to 0 as m, k \to \infty.
Apply the isometry to the difference (using linearity of the integral on simple processes):
\mathbb{E}\!\left[\left(\int H^{(m)} dW - \int H^{(k)} dW\right)^{2}\right] = \mathbb{E}\!\left[\int_0^T \big(H^{(m)}_s - H^{(k)}_s\big)^2\, ds\right] \longrightarrow 0.
So a Cauchy sequence of integrands maps to a Cauchy sequence of integrals in
L^2(\Omega). Because L^2(\Omega) is
complete, that sequence has a limit, and the isometry also forces the limit to be
unique — two approximating sequences for the same H
differ by something of vanishing norm, hence have the same limit. That limit is the
definition of \int_0^T H\, dW for general adapted
H. The isometry is the bridge that carries the easy step-function
definition across to every square-integrable integrand.
The analogy is to Parseval / Plancherel: the Fourier transform is an isometry
between two L^2 spaces, and that single fact lets it be extended from
nice test functions to all of L^2 by continuity. The Itô isometry plays
exactly that role for the stochastic integral — an isometry between
L^2(dt\times d\mathbb{P}) and L^2(\Omega),
and the extension is "by continuity" in precisely the same sense.