The Hamilton–Jacobi–Bellman Equation
The
Bellman
equation was a backward recursion over discrete stages. Let the stage shrink to an
infinitesimal slice of time and that recursion becomes a partial differential equation for the
value function — the Hamilton–Jacobi–Bellman (HJB) equation. It is the
continuous-time crown of dynamic programming, and remarkably, the
Hamiltonian
steps straight back out of it.
Deriving HJB from the principle of optimality
Let V(x, t) be the cost-to-go for the system
\dot{x} = f(x, u) with running cost
L(x, u) and terminal cost
\varphi(x(T)).
Step 1 — split the horizon at a tiny interval. The principle of optimality
says the best cost from (x, t) equals the cost over
[t, t+dt] plus the best cost from wherever you land:
V(x, t) = \min_{u} \Big[\, L(x, u)\,dt + V\big(x + f(x, u)\,dt,\ t + dt\big) \,\Big] + o(dt).
Step 2 — Taylor-expand the cost-to-go at the next instant. Expand
V about (x, t) to first order, using the
partial
derivatives V_t and the
gradient
V_x = \nabla_x V:
V\big(x + f\,dt,\ t + dt\big) = V(x, t) + V_x^\top f(x, u)\,dt + V_t\,dt + o(dt).
Step 3 — substitute and cancel V(x, t). The term
V(x, t) and the time-derivative term V_t\,dt
do not depend on u, so they come out of the minimisation:
V(x, t) = \min_{u} \Big[\, L\,dt + V_x^\top f\,dt \,\Big] + V(x, t) + V_t\,dt + o(dt).
Cancelling V(x, t) from both sides leaves
0 = \min_{u} \Big[\, L\,dt + V_x^\top f\,dt \,\Big] + V_t\,dt + o(dt).
Step 4 — divide by dt and let
dt \to 0. The o(dt) term
vanishes and the bracket's common dt factors out:
-V_t = \min_{u} \Big[\, L(x, u) + V_x^\top f(x, u) \,\Big].
This is the Hamilton–Jacobi–Bellman equation, a first-order
partial
differential equation for V(x, t), closed by the
terminal condition V(x, T) = \varphi(x) (at the
final time there is nothing left to do).
-
The value function satisfies
-\frac{\partial V}{\partial t} = \min_{u}\Big[\, L(x, u) + (\nabla_x V)^\top f(x, u) \,\Big], \qquad V(x, T) = \varphi(x).
-
The optimal control is the pointwise minimiser — a feedback law in the
current state and time:
u^\*(x, t) = \arg\min_{u}\Big[\, L(x, u) + (\nabla_x V)^\top f(x, u) \,\Big].
The Hamiltonian, reappearing
Look at the bracket being minimised. With L + \lambda^\top f the
very definition of the Hamiltonian H(x, u, \lambda), the HJB bracket
is the Hamiltonian — provided we identify the costate with the gradient of the value
function:
\lambda = \nabla_x V \quad\Longrightarrow\quad -V_t = \min_{u} H\big(x, u, \nabla_x V\big).
So the costate \lambda of the maximum principle is not an abstract
multiplier after all: it is the sensitivity of the optimal cost to the state,
\partial V/\partial x — the shadow price of being where you are. HJB
and Pontryagin are looking at the same Hamiltonian from two directions.
The two are complementary. HJB is a sufficient condition: a smooth
V solving the PDE delivers the globally optimal
feedback control u^\*(x, t) at every state. The maximum principle
gives necessary conditions — coupled ODEs along a single optimal
trajectory. One PDE for all states, versus ODEs for one path.
A worked HJB solution
Take the scalar system \dot{x} = u with the infinite-horizon cost
\int_0^\infty (x^2 + u^2)\,dt, so
L = x^2 + u^2 and f = u. With no explicit
time dependence the value function is stationary (V_t = 0) and HJB
reads
0 = \min_{u}\Big[\, x^2 + u^2 + V'(x)\,u \,\Big].
Step 1 — minimise the bracket over u. It is a
quadratic in u; setting its derivative to zero,
2u + V'(x) = 0, gives the feedback
u^\* = -\tfrac12 V'(x).
Step 2 — substitute back. Putting u^\* into the
bracket,
0 = x^2 + \tfrac14 V'^2 - \tfrac12 V'^2 = x^2 - \tfrac14 V'^2 \quad\Longrightarrow\quad V'(x)^2 = 4x^2.
Step 3 — solve. Taking the root that makes V a
genuine (positive, increasing-away-from-zero) cost, V'(x) = 2x, so
V(x) = x^2, \qquad u^\*(x) = -\tfrac12 V'(x) = -x.
The value function is the parabola V(x) = x^2 and the optimal policy
is the linear feedback u^\*(x) = -x: steer toward the origin in
proportion to how far away you are. No trajectory needed — HJB hands back the control law for
every state at once.
The value landscape and the feedback law
Below are the two objects HJB produced: the value function
V(x) = x^2 — the cost landscape, lowest at the goal
x = 0 — and the feedback law
u^\*(x) = -x. Slide the state
x: the markers read off the cost-to-go and the control HJB
prescribes there. Notice the optimal control is largest, and points hardest toward the origin,
exactly where the landscape is steepest — because u^\* is set by the
slope V'(x).
Solving the HJB PDE is a complete answer — a sufficient condition giving a global feedback
law — but it is a PDE over the whole state space. In a handful of dimensions that is a gift;
in dozens it is hopeless, because the grid on which one would represent
V(x, t) grows exponentially with the dimension of
x — Bellman's own "curse of dimensionality". The maximum
principle's ODEs sidestep that (they live along one trajectory), at the price of giving only
that one open-loop path. The
next
page sets the two methods precisely side by side.