The Bellman Equation

The principle of optimality says every tail of an optimal trajectory is optimal, so the cost-to-go depends only on the current state. In discrete time that observation becomes an explicit, mechanical recursion — the Bellman equation — that fills in the value function one stage at a time, working backward from the end.

The backward recursion

Take an N-stage problem. The state evolves by a known dynamics x_{k+1} = f(x_k, u_k), each stage incurs a running cost g(x_k, u_k), and the end state is charged a terminal cost \varphi(x_N). Write V_k(x) for the cost-to-go: the least total cost incurred from stage k onward, starting in state x.

Step 1 — the terminal condition. At the last stage nothing remains to be decided, so the cost-to-go is just the terminal charge:

V_N(x) = \varphi(x).

Step 2 — one stage of the recursion. Standing in state x at stage k, picking control u costs g(x, u) right now and moves you to f(x, u), from where — by the principle of optimality — the best you can do is V_{k+1}(f(x, u)). Choosing u to minimise the sum gives Bellman's equation:

V_k(x) = \min_{u} \Big[\, \underbrace{g(x, u)}_{\text{cost now}} + \underbrace{V_{k+1}\big(f(x, u)\big)}_{\text{optimal cost-to-go next}} \,\Big].

Step 3 — record the minimiser. The control that attains the minimum is the optimal action at (x, k):

\mu_k^\*(x) = \arg\min_{u} \Big[\, g(x, u) + V_{k+1}\big(f(x, u)\big) \,\Big].

Collecting \mu_k^\* over every state and stage gives the optimal feedback policy — a lookup table of "what to do in each state" — which is far more than the maximum principle's single open-loop trajectory. We sweep k = N, N-1, \dots, 0, each stage built from the one already solved.

A fully worked example

Let the state be x \in \{0, 1, 2\}, with two controls u \in \{0, 1\} and dynamics x_{k+1} = \min(x + u,\, 2) (you may step up by one, capped at 2). The running cost is g(x, u) = x + 2u — occupying a higher state costs more, and each step up costs 2 — while the terminal cost \varphi(x) = 4(2 - x) rewards finishing high. There are N = 2 stages, and we start at x_0 = 0.

Stage 2 — terminal. V_2(x) = \varphi(x) = 4(2 - x):

V_2(0) = 8, \qquad V_2(1) = 4, \qquad V_2(2) = 0.

Stage 1. Apply V_1(x) = \min_u[\, g(x, u) + V_2(\min(x+u, 2)) \,] at each state:

\begin{aligned} V_1(0) &= \min\big(\, \underbrace{0 + V_2(0)}_{u=0:\ 8},\ \underbrace{2 + V_2(1)}_{u=1:\ 6} \,\big) = 6, & \mu_1^\*(0) &= 1, \\ V_1(1) &= \min\big(\, \underbrace{1 + V_2(1)}_{u=0:\ 5},\ \underbrace{3 + V_2(2)}_{u=1:\ 3} \,\big) = 3, & \mu_1^\*(1) &= 1, \\ V_1(2) &= \min\big(\, \underbrace{2 + V_2(2)}_{u=0:\ 2},\ \underbrace{4 + V_2(2)}_{u=1:\ 4} \,\big) = 2, & \mu_1^\*(2) &= 0. \end{aligned}

Stage 0. Now apply V_0(x) = \min_u[\, g(x, u) + V_1(\min(x+u, 2)) \,], reusing the stage-1 values:

\begin{aligned} V_0(0) &= \min\big(\, \underbrace{0 + V_1(0)}_{u=0:\ 6},\ \underbrace{2 + V_1(1)}_{u=1:\ 5} \,\big) = 5, & \mu_0^\*(0) &= 1, \\ V_0(1) &= \min\big(\, \underbrace{1 + V_1(1)}_{u=0:\ 4},\ \underbrace{3 + V_1(2)}_{u=1:\ 5} \,\big) = 4, & \mu_0^\*(1) &= 0, \\ V_0(2) &= \min\big(\, \underbrace{2 + V_1(2)}_{u=0:\ 4},\ \underbrace{4 + V_1(2)}_{u=1:\ 6} \,\big) = 4, & \mu_0^\*(2) &= 0. \end{aligned}

Read off the optimal run. From x_0 = 0 the optimal cost is V_0(0) = 5. The policy says: at stage 0 play \mu_0^\*(0) = 1, moving to x_1 = 1; at stage 1 play \mu_1^\*(1) = 1, moving to x_2 = 2. The trajectory 0 \to 1 \to 2 costs g(0,1) + g(1,1) + \varphi(2) = 2 + 3 + 0 = 5, matching V_0(0) exactly.

The table, filling backward

Here is that computation as a table — one column per stage, one row per state. Slide backward stages computed: the terminal column V_2 = \varphi appears first, then V_1, then V_0, each built from the column to its right. The small u^\* under each computed cell is the optimal action there; tracing them from x_0 = 0 gives the optimal run.

The Bellman equation is the discrete skeleton of the whole subject. Shrink the stage to an infinitesimal time step and the recursion becomes a partial differential equation for V — the Hamilton–Jacobi–Bellman equation. Let the horizon recede to infinity and it becomes a fixed-point equation solved by value and policy iteration. Every one of these is the line "cost now plus optimal cost-to-go next" written in a different limit.