The Principle of Optimality
The
maximum
principle attacks an optimal-control problem along one trajectory: it
writes down necessary conditions a single optimal path must obey. Dynamic programming takes the
opposite stance. Instead of one path it reasons about every starting state at
once, and it does so by exploiting a deceptively simple structural fact about optimal
decisions — Bellman's principle of optimality. This page states that
principle, shows why it lets us work backward from the goal, and defines the object it
gives us: the value function.
Bellman's principle of optimality
Richard Bellman put it in one sentence:
-
An optimal policy has the property that whatever the initial state and initial decision
are, the remaining decisions must constitute an optimal policy with regard to the state
resulting from the first decision.
-
Equivalently: every tail of an optimal trajectory is itself optimal. If
the best route from x_0 to the goal passes through some
intermediate state x_k, then the portion of that route from
x_k onward is the best route from
x_k to the goal.
The argument is a one-line proof by contradiction. Suppose the tail from
x_k were not optimal — that some cheaper route from
x_k to the goal existed. Then we could splice that cheaper tail onto
the original head (which still reaches x_k) and obtain a strictly
cheaper route from x_0 overall, contradicting the assumption that
the full route was optimal. So no cheaper tail can exist: the tail must already be optimal.
Shortest paths, and why we work backward
Picture a road network with a start, a goal, and a cost on every leg. The principle says: if
the cheapest journey from the start to the goal happens to pass through a town
M, then from M onward it follows the
cheapest journey from M to the goal — irrespective of how it
reached M. The cost of the best continuation depends only on
where you are, never on the history of how you got there.
This is exactly what licenses a backward sweep. We do not need to enumerate whole paths.
Starting at the goal (cost 0 to finish) we compute, for each
node, the cheapest cost to reach the goal from that node, reusing the answers already
found one step closer to the goal. Each node is solved once; the optimal continuation from it
is then fixed forever. An exponential search over paths collapses into a linear sweep over
states — the engine of dynamic programming.
The value function (cost-to-go)
The object the backward sweep computes deserves a name. The value function,
or cost-to-go, is the minimum cost achievable from a given state:
V(x, t) = \min_{u(\cdot)} \left[\, \varphi\big(x(T)\big) + \int_t^T L\big(x(s), u(s), s\big)\,ds \,\right] \quad\text{subject to } x(t) = x.
In words: V(x, t) is the best total cost you can still incur if you
find yourself in state x at time t and
play optimally from there on. The principle of optimality is precisely the statement that this
V is well defined — that the best cost-to-go depends only on the
current state, not on the past — which is what makes a recursion in
V possible. At the goal there is nothing left to do, so
V equals the terminal cost
\varphi alone; everywhere else it is built from values nearer the
goal. The
next
page turns that idea into an explicit recursion.
Watch the cost-to-go fill in
Below is a small layered network from a start S to a goal
T; each leg carries a cost. Slide the backward sweep
step from the goal outward. At step 0 only
V(T) = 0 is known. Each further step labels the next layer of
nodes with their cost-to-go V — the cheapest cost from that node to
T — computed as (leg cost) + (cost-to-go of the next node),
minimised over the outgoing legs. By the last step every node knows its value, and the optimal
route is read off by always stepping to the neighbour that achieved the minimum.
Notice the value at B: V(B) = 4, the
cost of the tail B \to D \to T. The optimal full route
S \to B \to D \to T costs 8, and its
tail from B costs exactly V(B) = 4 — not
a penny more or less. That is the principle of optimality made concrete: the optimal route's
continuation from any node it visits is that node's own optimal continuation. Because
of it we never re-solve a tail; we look it up.