The Bellman Equation
The
principle
of optimality says every tail of an optimal trajectory is optimal, so the
cost-to-go depends only on the current state. In discrete time that observation becomes an
explicit, mechanical recursion — the Bellman equation — that fills in the
value function one stage at a time, working backward from the end.
The backward recursion
Take an N-stage problem. The state evolves by a known dynamics
x_{k+1} = f(x_k, u_k), each stage incurs a running cost
g(x_k, u_k), and the end state is charged a terminal cost
\varphi(x_N). Write V_k(x) for the
cost-to-go: the least total cost incurred from stage k onward,
starting in state x.
Step 1 — the terminal condition. At the last stage nothing remains to be
decided, so the cost-to-go is just the terminal charge:
V_N(x) = \varphi(x).
Step 2 — one stage of the recursion. Standing in state
x at stage k, picking control
u costs g(x, u) right now and moves you
to f(x, u), from where — by the principle of optimality — the best
you can do is V_{k+1}(f(x, u)). Choosing
u to minimise the sum gives Bellman's equation:
V_k(x) = \min_{u} \Big[\, \underbrace{g(x, u)}_{\text{cost now}} + \underbrace{V_{k+1}\big(f(x, u)\big)}_{\text{optimal cost-to-go next}} \,\Big].
Step 3 — record the minimiser. The control that attains the minimum is the
optimal action at (x, k):
\mu_k^\*(x) = \arg\min_{u} \Big[\, g(x, u) + V_{k+1}\big(f(x, u)\big) \,\Big].
Collecting \mu_k^\* over every state and stage gives the optimal
feedback policy — a lookup table of "what to do in each state" — which is far
more than the maximum principle's single open-loop trajectory. We sweep
k = N, N-1, \dots, 0, each stage built from the one already solved.
A fully worked example
Let the state be x \in \{0, 1, 2\}, with two controls
u \in \{0, 1\} and dynamics
x_{k+1} = \min(x + u,\, 2) (you may step up by one, capped at
2). The running cost is
g(x, u) = x + 2u — occupying a higher state costs more, and each
step up costs 2 — while the terminal cost
\varphi(x) = 4(2 - x) rewards finishing high. There are
N = 2 stages, and we start at x_0 = 0.
Stage 2 — terminal. V_2(x) = \varphi(x) = 4(2 - x):
V_2(0) = 8, \qquad V_2(1) = 4, \qquad V_2(2) = 0.
Stage 1. Apply V_1(x) = \min_u[\, g(x, u) + V_2(\min(x+u, 2)) \,]
at each state:
\begin{aligned}
V_1(0) &= \min\big(\, \underbrace{0 + V_2(0)}_{u=0:\ 8},\ \underbrace{2 + V_2(1)}_{u=1:\ 6} \,\big) = 6, & \mu_1^\*(0) &= 1, \\
V_1(1) &= \min\big(\, \underbrace{1 + V_2(1)}_{u=0:\ 5},\ \underbrace{3 + V_2(2)}_{u=1:\ 3} \,\big) = 3, & \mu_1^\*(1) &= 1, \\
V_1(2) &= \min\big(\, \underbrace{2 + V_2(2)}_{u=0:\ 2},\ \underbrace{4 + V_2(2)}_{u=1:\ 4} \,\big) = 2, & \mu_1^\*(2) &= 0.
\end{aligned}
Stage 0. Now apply V_0(x) = \min_u[\, g(x, u) + V_1(\min(x+u, 2)) \,],
reusing the stage-1 values:
\begin{aligned}
V_0(0) &= \min\big(\, \underbrace{0 + V_1(0)}_{u=0:\ 6},\ \underbrace{2 + V_1(1)}_{u=1:\ 5} \,\big) = 5, & \mu_0^\*(0) &= 1, \\
V_0(1) &= \min\big(\, \underbrace{1 + V_1(1)}_{u=0:\ 4},\ \underbrace{3 + V_1(2)}_{u=1:\ 5} \,\big) = 4, & \mu_0^\*(1) &= 0, \\
V_0(2) &= \min\big(\, \underbrace{2 + V_1(2)}_{u=0:\ 4},\ \underbrace{4 + V_1(2)}_{u=1:\ 6} \,\big) = 4, & \mu_0^\*(2) &= 0.
\end{aligned}
Read off the optimal run. From x_0 = 0 the optimal
cost is V_0(0) = 5. The policy says: at stage
0 play \mu_0^\*(0) = 1, moving to
x_1 = 1; at stage 1 play
\mu_1^\*(1) = 1, moving to x_2 = 2. The
trajectory 0 \to 1 \to 2 costs
g(0,1) + g(1,1) + \varphi(2) = 2 + 3 + 0 = 5, matching
V_0(0) exactly.
The table, filling backward
Here is that computation as a table — one column per stage, one row per state. Slide
backward stages computed: the terminal column
V_2 = \varphi appears first, then
V_1, then V_0, each built from the column
to its right. The small u^\* under each computed cell is the optimal
action there; tracing them from x_0 = 0 gives the optimal run.
The Bellman equation is the discrete skeleton of the whole subject. Shrink the stage to an
infinitesimal time step and the recursion becomes a partial differential equation for
V — the
Hamilton–Jacobi–Bellman
equation. Let the horizon recede to infinity and it becomes a fixed-point
equation solved by
value
and policy iteration. Every one of these is the line "cost now plus optimal
cost-to-go next" written in a different limit.