The Hamilton–Jacobi–Bellman Equation

The Bellman equation was a backward recursion over discrete stages. Let the stage shrink to an infinitesimal slice of time and that recursion becomes a partial differential equation for the value function — the Hamilton–Jacobi–Bellman (HJB) equation. It is the continuous-time crown of dynamic programming, and remarkably, the Hamiltonian steps straight back out of it.

Deriving HJB from the principle of optimality

Let V(x, t) be the cost-to-go for the system \dot{x} = f(x, u) with running cost L(x, u) and terminal cost \varphi(x(T)).

Step 1 — split the horizon at a tiny interval. The principle of optimality says the best cost from (x, t) equals the cost over [t, t+dt] plus the best cost from wherever you land:

V(x, t) = \min_{u} \Big[\, L(x, u)\,dt + V\big(x + f(x, u)\,dt,\ t + dt\big) \,\Big] + o(dt).

Step 2 — Taylor-expand the cost-to-go at the next instant. Expand V about (x, t) to first order, using the partial derivatives V_t and the gradient V_x = \nabla_x V:

V\big(x + f\,dt,\ t + dt\big) = V(x, t) + V_x^\top f(x, u)\,dt + V_t\,dt + o(dt).

Step 3 — substitute and cancel V(x, t). The term V(x, t) and the time-derivative term V_t\,dt do not depend on u, so they come out of the minimisation:

V(x, t) = \min_{u} \Big[\, L\,dt + V_x^\top f\,dt \,\Big] + V(x, t) + V_t\,dt + o(dt).

Cancelling V(x, t) from both sides leaves

0 = \min_{u} \Big[\, L\,dt + V_x^\top f\,dt \,\Big] + V_t\,dt + o(dt).

Step 4 — divide by dt and let dt \to 0. The o(dt) term vanishes and the bracket's common dt factors out:

-V_t = \min_{u} \Big[\, L(x, u) + V_x^\top f(x, u) \,\Big].

This is the Hamilton–Jacobi–Bellman equation, a first-order partial differential equation for V(x, t), closed by the terminal condition V(x, T) = \varphi(x) (at the final time there is nothing left to do).

The value function satisfies -\frac{\partial V}{\partial t} = \min_{u}\Big[\, L(x, u) + (\nabla_x V)^\top f(x, u) \,\Big], \qquad V(x, T) = \varphi(x).
The optimal control is the pointwise minimiser — a feedback law in the current state and time: u^\*(x, t) = \arg\min_{u}\Big[\, L(x, u) + (\nabla_x V)^\top f(x, u) \,\Big].

The Hamiltonian, reappearing

Look at the bracket being minimised. With L + \lambda^\top f the very definition of the Hamiltonian H(x, u, \lambda), the HJB bracket is the Hamiltonian — provided we identify the costate with the gradient of the value function:

\lambda = \nabla_x V \quad\Longrightarrow\quad -V_t = \min_{u} H\big(x, u, \nabla_x V\big).

So the costate \lambda of the maximum principle is not an abstract multiplier after all: it is the sensitivity of the optimal cost to the state, \partial V/\partial x — the shadow price of being where you are. HJB and Pontryagin are looking at the same Hamiltonian from two directions.

The two are complementary. HJB is a sufficient condition: a smooth V solving the PDE delivers the globally optimal feedback control u^\*(x, t) at every state. The maximum principle gives necessary conditions — coupled ODEs along a single optimal trajectory. One PDE for all states, versus ODEs for one path.

A worked HJB solution

Take the scalar system \dot{x} = u with the infinite-horizon cost \int_0^\infty (x^2 + u^2)\,dt, so L = x^2 + u^2 and f = u. With no explicit time dependence the value function is stationary (V_t = 0) and HJB reads

0 = \min_{u}\Big[\, x^2 + u^2 + V'(x)\,u \,\Big].

Step 1 — minimise the bracket over u. It is a quadratic in u; setting its derivative to zero, 2u + V'(x) = 0, gives the feedback u^\* = -\tfrac12 V'(x).

Step 2 — substitute back. Putting u^\* into the bracket,

0 = x^2 + \tfrac14 V'^2 - \tfrac12 V'^2 = x^2 - \tfrac14 V'^2 \quad\Longrightarrow\quad V'(x)^2 = 4x^2.

Step 3 — solve. Taking the root that makes V a genuine (positive, increasing-away-from-zero) cost, V'(x) = 2x, so

V(x) = x^2, \qquad u^\*(x) = -\tfrac12 V'(x) = -x.

The value function is the parabola V(x) = x^2 and the optimal policy is the linear feedback u^\*(x) = -x: steer toward the origin in proportion to how far away you are. No trajectory needed — HJB hands back the control law for every state at once.

The value landscape and the feedback law

Below are the two objects HJB produced: the value function V(x) = x^2 — the cost landscape, lowest at the goal x = 0 — and the feedback law u^\*(x) = -x. Slide the state x: the markers read off the cost-to-go and the control HJB prescribes there. Notice the optimal control is largest, and points hardest toward the origin, exactly where the landscape is steepest — because u^\* is set by the slope V'(x).

Solving the HJB PDE is a complete answer — a sufficient condition giving a global feedback law — but it is a PDE over the whole state space. In a handful of dimensions that is a gift; in dozens it is hopeless, because the grid on which one would represent V(x, t) grows exponentially with the dimension of x — Bellman's own "curse of dimensionality". The maximum principle's ODEs sidestep that (they live along one trajectory), at the price of giving only that one open-loop path. The next page sets the two methods precisely side by side.