The Principle of Optimality

The maximum principle attacks an optimal-control problem along one trajectory: it writes down necessary conditions a single optimal path must obey. Dynamic programming takes the opposite stance. Instead of one path it reasons about every starting state at once, and it does so by exploiting a deceptively simple structural fact about optimal decisions — Bellman's principle of optimality. This page states that principle, shows why it lets us work backward from the goal, and defines the object it gives us: the value function.

Bellman's principle of optimality

Richard Bellman put it in one sentence:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Equivalently: every tail of an optimal trajectory is itself optimal. If the best route from x_0 to the goal passes through some intermediate state x_k, then the portion of that route from x_k onward is the best route from x_k to the goal.

The argument is a one-line proof by contradiction. Suppose the tail from x_k were not optimal — that some cheaper route from x_k to the goal existed. Then we could splice that cheaper tail onto the original head (which still reaches x_k) and obtain a strictly cheaper route from x_0 overall, contradicting the assumption that the full route was optimal. So no cheaper tail can exist: the tail must already be optimal.

Shortest paths, and why we work backward

Picture a road network with a start, a goal, and a cost on every leg. The principle says: if the cheapest journey from the start to the goal happens to pass through a town M, then from M onward it follows the cheapest journey from M to the goal — irrespective of how it reached M. The cost of the best continuation depends only on where you are, never on the history of how you got there.

This is exactly what licenses a backward sweep. We do not need to enumerate whole paths. Starting at the goal (cost 0 to finish) we compute, for each node, the cheapest cost to reach the goal from that node, reusing the answers already found one step closer to the goal. Each node is solved once; the optimal continuation from it is then fixed forever. An exponential search over paths collapses into a linear sweep over states — the engine of dynamic programming.

The value function (cost-to-go)

The object the backward sweep computes deserves a name. The value function, or cost-to-go, is the minimum cost achievable from a given state:

V(x, t) = \min_{u(\cdot)} \left[\, \varphi\big(x(T)\big) + \int_t^T L\big(x(s), u(s), s\big)\,ds \,\right] \quad\text{subject to } x(t) = x.

In words: V(x, t) is the best total cost you can still incur if you find yourself in state x at time t and play optimally from there on. The principle of optimality is precisely the statement that this V is well defined — that the best cost-to-go depends only on the current state, not on the past — which is what makes a recursion in V possible. At the goal there is nothing left to do, so V equals the terminal cost \varphi alone; everywhere else it is built from values nearer the goal. The next page turns that idea into an explicit recursion.

Watch the cost-to-go fill in

Below is a small layered network from a start S to a goal T; each leg carries a cost. Slide the backward sweep step from the goal outward. At step 0 only V(T) = 0 is known. Each further step labels the next layer of nodes with their cost-to-go V — the cheapest cost from that node to T — computed as (leg cost) + (cost-to-go of the next node), minimised over the outgoing legs. By the last step every node knows its value, and the optimal route is read off by always stepping to the neighbour that achieved the minimum.

Notice the value at B: V(B) = 4, the cost of the tail B \to D \to T. The optimal full route S \to B \to D \to T costs 8, and its tail from B costs exactly V(B) = 4 — not a penny more or less. That is the principle of optimality made concrete: the optimal route's continuation from any node it visits is that node's own optimal continuation. Because of it we never re-solve a tail; we look it up.