Value and Policy Iteration

The Bellman equation filled a finite table backward, stage by stage. But many problems have no last stage: the system runs forever and we discount the future. Then there is no terminal column to start from — the value function is instead a fixed point of the Bellman operator, and we find it by iterating. Two classic schemes do this: value iteration and policy iteration.

Discounting and the Bellman operator

For an infinite-horizon problem we weight a cost incurred k steps from now by \gamma^k, with a discount factor 0 \le \gamma < 1. The total cost stays finite (a geometric tail), and the value function obeys a Bellman equation with no time index — the same in every step:

V(x) = \min_{u} \Big[\, g(x, u) + \gamma\, V\big(f(x, u)\big) \,\Big].

The right-hand side defines the Bellman operator T: feed it any guess V and it returns a new function (TV)(x) = \min_u[\,g(x,u) + \gamma V(f(x,u))\,]. The true value function is the one guess that comes back unchanged — the fixed point V^\* = T V^\*.

The discount factor is doing the work: the \gamma multiplying the future value is exactly the factor by which two guesses are pulled together each application, so a smaller \gamma means faster convergence.

Value iteration

Value iteration is the contraction in action: pick any V_0 and apply the operator until it stops moving,

V_{n+1}(x) \leftarrow \min_{u} \Big[\, g(x, u) + \gamma\, V_n\big(f(x, u)\big) \,\Big].

Each pass pulls the estimate a factor \gamma closer to V^\*, so the error decays geometrically.

A worked iteration. Two states \{A, B\}, discount \gamma = \tfrac12. In A you may stay (cost 1, back to A) or exit (cost 3, to B); B is absorbing at cost 0, so V(B) = 0. Bellman at A is V(A) = \min(\,1 + \tfrac12 V(A),\ 3\,). Start from V_0(A) = 0 and apply the update; stay always wins, so

\begin{aligned} V_1 &= \min(1 + \tfrac12\cdot 0,\ 3) = 1, & V_2 &= \min(1 + \tfrac12\cdot 1,\ 3) = 1.5, \\ V_3 &= \min(1 + \tfrac12\cdot 1.5,\ 3) = 1.75, & V_4 &= \min(1 + \tfrac12\cdot 1.75,\ 3) = 1.875, \end{aligned}

marching toward the fixed point V^\*(A) = 2 (solving V = 1 + \tfrac12 V). The error 2 - V_n = 2 \cdot (\tfrac12)^n halves each pass — exactly the contraction factor \gamma = \tfrac12 at work.

Policy iteration

Policy iteration works with policies rather than raw values, alternating two steps until the policy stops changing:

On the same example: start from the policy exit. Evaluation gives V^\pi(A) = 3 + \tfrac12\cdot 0 = 3. Improvement compares 1 + \tfrac12\cdot 3 = 2.5 (stay) against 3 (exit) and switches to stay. Re-evaluating stay gives V^\pi(A) = 2; improvement now compares 1 + \tfrac12\cdot 2 = 2 against 3 and keeps stay — no change, so it has converged in two rounds to the optimal policy and V^\*(A) = 2.

These two algorithms — sweeping values to a fixed point, or alternating evaluation and greedy improvement — are the exact backbone of reinforcement learning, where the same updates run on a system whose costs and dynamics are learned from experience rather than known in advance.

Watch it converge

Below is the value-iteration sequence V_n(A) for the worked example, with the fixed point V^\* = 2 dashed in. Slide the iteration to step a marker along it and read off the value and the remaining error 2 \cdot (\tfrac12)^n. Each step closes half the gap — the signature of a contraction — and the iterates settle onto the fixed point.