The quadratic ansatz
The HJB equation for the LQ problem, with running cost
L = \tfrac12(x^{\mathsf{T}} Q x + u^{\mathsf{T}} R u) and dynamics
f = A x + B u, reads
-V_t = \min_{u}\Big[\, \tfrac12\big(x^{\mathsf{T}} Q x + u^{\mathsf{T}} R u\big) + V_x^{\mathsf{T}}\big(A x + B u\big) \,\Big].
The cost is quadratic and the dynamics linear, so we guess that the value function
(the cost-to-go) is itself a pure quadratic form in the state, with a symmetric, time-varying matrix
P(t) carrying the coefficients:
V(x, t) = \tfrac12\, x^{\mathsf{T}} P(t)\, x, \qquad P(t) = P(t)^{\mathsf{T}}.
This is the LQ counterpart of the parabola V(x) = x^2 that HJB produced
for the scalar example. From it the two derivatives HJB needs fall straight out:
V_x = P(t)\,x, \qquad V_t = \tfrac12\, x^{\mathsf{T}} \dot{P}(t)\, x.
Deriving the differential Riccati equation
Step 1 — substitute the ansatz into HJB. Replacing
V_x = Px and V_t = \tfrac12 x^{\mathsf{T}}\dot{P}x,
-\tfrac12\, x^{\mathsf{T}} \dot{P}\, x = \min_{u}\Big[\, \tfrac12 x^{\mathsf{T}} Q x + \tfrac12 u^{\mathsf{T}} R u + (P x)^{\mathsf{T}}(A x + B u) \,\Big].
Step 2 — minimise over u. Only two terms in the bracket
involve u — the control penalty
\tfrac12 u^{\mathsf{T}} R u and the coupling
(Px)^{\mathsf{T}} B u = x^{\mathsf{T}} P B u (using
P^{\mathsf{T}} = P). Differentiate the bracket with respect to
u and set it to zero:
\frac{\partial}{\partial u}\Big[\, \tfrac12 u^{\mathsf{T}} R u + x^{\mathsf{T}} P B u \,\Big] = R\,u + B^{\mathsf{T}} P\, x = 0.
Because R \succ 0 it is invertible, so we can solve for the minimiser
outright — and since the bracket is convex in u (its
u-Hessian is R \succ 0) this stationary point is
the genuine minimum:
u^\* = -R^{-1} B^{\mathsf{T}} P\, x.
The optimal control is already a linear feedback in the state — the whole point of
the stage, falling out at the first minimisation.
Step 3 — substitute u^\* back in. We evaluate the two
u-terms at u^\* = -R^{-1}B^{\mathsf{T}}Px.
The control penalty, using R^{-1} R\, R^{-1} = R^{-1}, is
\tfrac12\, u^{\*\mathsf{T}} R\, u^\* = \tfrac12\, x^{\mathsf{T}} P B\, R^{-1} R\, R^{-1} B^{\mathsf{T}} P\, x = \tfrac12\, x^{\mathsf{T}} P B R^{-1} B^{\mathsf{T}} P\, x,
and the coupling term is
x^{\mathsf{T}} P B\, u^\* = -\,x^{\mathsf{T}} P B R^{-1} B^{\mathsf{T}} P\, x.
Adding these two leaves a single
-\tfrac12\, x^{\mathsf{T}} P B R^{-1} B^{\mathsf{T}} P\, x (the positive
half minus the whole). Collecting everything on the right, the minimised HJB bracket becomes
-\tfrac12\, x^{\mathsf{T}} \dot{P}\, x = \tfrac12 x^{\mathsf{T}} Q x + x^{\mathsf{T}} P A x - \tfrac12\, x^{\mathsf{T}} P B R^{-1} B^{\mathsf{T}} P\, x.
Step 4 — symmetrise the lone PA term. The scalar
x^{\mathsf{T}} P A x equals its own transpose
x^{\mathsf{T}} A^{\mathsf{T}} P x, so we may replace it by the symmetric
average \tfrac12 x^{\mathsf{T}}(P A + A^{\mathsf{T}} P) x. Now every term
wears the form \tfrac12 x^{\mathsf{T}}(\cdots) x:
-\tfrac12\, x^{\mathsf{T}} \dot{P}\, x = \tfrac12\, x^{\mathsf{T}}\Big[\, A^{\mathsf{T}} P + P A - P B R^{-1} B^{\mathsf{T}} P + Q \,\Big] x.
Step 5 — strip the x's. This must hold for
every state x, and both sides are symmetric quadratic forms, so
the bracketed matrices must be equal. The x's and the
\tfrac12's cancel, leaving a pure matrix differential equation — the
differential Riccati equation:
-\dot{P} = A^{\mathsf{T}} P + P A - P B R^{-1} B^{\mathsf{T}} P + Q, \qquad P(T) = S.
The terminal condition comes from HJB's own terminal condition
V(x, T) = \tfrac12 x^{\mathsf{T}} S x, matching the terminal cost of the
LQ problem: at the final instant the cost-to-go is the terminal penalty, so
P(T) = S.
-
The value function of the LQ problem is the quadratic form
V(x, t) = \tfrac12 x^{\mathsf{T}} P(t) x with
P(t) symmetric.
-
P(t) solves the matrix ODE
-\dot{P} = A^{\mathsf{T}} P + P A - P B R^{-1} B^{\mathsf{T}} P + Q,
integrated backward from the terminal value
P(T) = S.
-
The optimal control is the linear feedback
u^\*(t) = -R^{-1} B^{\mathsf{T}} P(t)\, x(t).
Watching p(t) integrate backward
In one dimension the matrices are numbers. Take a = 0,
b = 1, r = 1; the differential Riccati equation
collapses to -\dot{p} = q - p^2, started at
p(T) = S and solved backward in time. Slide the state
weight q and the terminal value S and watch the
curve: integrating from the right edge t = T leftward, whatever it starts
at, p(t) rushes to the same steady value
p_\infty = \sqrt{q} (where \dot p = 0). That
plateau — reached well before t = 0 for a long horizon — is the constant
the next
page solves for directly.
The Riccati equation carries a terminal condition, not an initial one — we know
P at the end, P(T) = S, because that is where
the cost-to-go is pinned to the terminal penalty. Dynamic programming always reasons from the
finish line backward, and the Riccati equation inherits that direction. The system trajectory
x(t) runs forward from x(0); the cost matrix
P(t) runs backward from P(T) — the two sweeps
meet to give the feedback gain at every instant.