Value and Policy Iteration
The
Bellman
equation filled a finite table backward, stage by stage. But many problems have no
last stage: the system runs forever and we discount the future. Then there is no terminal
column to start from — the value function is instead a fixed point of the
Bellman operator, and we find it by iterating. Two classic schemes do this:
value iteration and policy iteration.
Discounting and the Bellman operator
For an infinite-horizon problem we weight a cost incurred k steps
from now by \gamma^k, with a discount factor
0 \le \gamma < 1. The total cost stays finite (a geometric tail), and
the value function obeys a Bellman equation with no time index — the same in every step:
V(x) = \min_{u} \Big[\, g(x, u) + \gamma\, V\big(f(x, u)\big) \,\Big].
The right-hand side defines the Bellman operator
T: feed it any guess V and it returns a
new function (TV)(x) = \min_u[\,g(x,u) + \gamma V(f(x,u))\,]. The
true value function is the one guess that comes back unchanged — the fixed point
V^\* = T V^\*.
-
For any two guesses V, W, the operator
shrinks the gap by a factor \gamma in the sup norm:
\lVert TV - TW \rVert_\infty \le \gamma \, \lVert V - W \rVert_\infty.
-
Because \gamma < 1, the Banach fixed-point theorem guarantees a
unique fixed point V^\*, and repeatedly applying
T from any start converges to it.
The discount factor is doing the work: the \gamma multiplying the
future value is exactly the factor by which two guesses are pulled together each application,
so a smaller \gamma means faster convergence.
Value iteration
Value iteration is the contraction in action: pick any
V_0 and apply the operator until it stops moving,
V_{n+1}(x) \leftarrow \min_{u} \Big[\, g(x, u) + \gamma\, V_n\big(f(x, u)\big) \,\Big].
Each pass pulls the estimate a factor \gamma closer to
V^\*, so the error decays geometrically.
A worked iteration. Two states \{A, B\},
discount \gamma = \tfrac12. In A you may
stay (cost 1, back to A) or
exit (cost 3, to B);
B is absorbing at cost 0, so
V(B) = 0. Bellman at A is
V(A) = \min(\,1 + \tfrac12 V(A),\ 3\,). Start from
V_0(A) = 0 and apply the update; stay always wins, so
\begin{aligned}
V_1 &= \min(1 + \tfrac12\cdot 0,\ 3) = 1, &
V_2 &= \min(1 + \tfrac12\cdot 1,\ 3) = 1.5, \\
V_3 &= \min(1 + \tfrac12\cdot 1.5,\ 3) = 1.75, &
V_4 &= \min(1 + \tfrac12\cdot 1.75,\ 3) = 1.875,
\end{aligned}
marching toward the fixed point V^\*(A) = 2 (solving
V = 1 + \tfrac12 V). The error
2 - V_n = 2 \cdot (\tfrac12)^n halves each pass — exactly the
contraction factor \gamma = \tfrac12 at work.
Policy iteration
Policy iteration works with policies rather than raw values, alternating two
steps until the policy stops changing:
-
Policy evaluation — fix the current policy
\pi and solve the (now linear) equations
V^\pi(x) = g(x, \pi(x)) + \gamma V^\pi(f(x, \pi(x))) for its
value V^\pi.
-
Policy improvement — update the policy greedily against that value,
\pi'(x) = \arg\min_{u} \Big[\, g(x, u) + \gamma\, V^\pi\big(f(x, u)\big) \,\Big].
-
Each round produces a policy no worse than the last, and with finitely many policies it
converges in a finite number of iterations — typically very few.
On the same example: start from the policy exit. Evaluation gives
V^\pi(A) = 3 + \tfrac12\cdot 0 = 3. Improvement compares
1 + \tfrac12\cdot 3 = 2.5 (stay) against
3 (exit) and switches to stay. Re-evaluating stay
gives V^\pi(A) = 2; improvement now compares
1 + \tfrac12\cdot 2 = 2 against 3 and keeps
stay — no change, so it has converged in two rounds to the optimal policy and
V^\*(A) = 2.
These two algorithms — sweeping values to a fixed point, or alternating evaluation and greedy
improvement — are the exact backbone of
reinforcement learning, where the
same updates run on a system whose costs and dynamics are learned from experience rather than
known in advance.
Watch it converge
Below is the value-iteration sequence V_n(A) for the worked example,
with the fixed point V^\* = 2 dashed in. Slide the
iteration to step a marker along it and read off the value and the remaining
error 2 \cdot (\tfrac12)^n. Each step closes half the gap — the
signature of a contraction — and the iterates settle onto the fixed point.