Stochastic Optimal Control
So far the uncertainty lived only in our measurements — the
Kalman
filter cleaned up a noisy sensor, but the system itself evolved by a deterministic
\dot{x} = Ax + Bu. Now we push the noise into the
dynamics. A gust hits the aircraft; a customer order jolts the inventory; thermal
noise rattles the circuit. The state no longer follows a smooth curve — it follows a random one.
The right language for a continuously, randomly buffeted state is a controlled
stochastic
differential equation. An
Itô
process whose drift and diffusion we get to steer:
dx = f(x, u)\,dt + \sigma(x, u)\,dW.
The drift f(x,u)\,dt is the controllable, deterministic
push — our old dynamics. The diffusion \sigma(x,u)\,dW is
the irreducible randomness, a
Brownian
kick that no control can fully cancel. In the linear case this is just
dx = (Ax + Bu)\,dt + \sigma\,dW.
The cost becomes an expectation
With deterministic dynamics the cost of a control was a single number — integrate the running cost
along the one trajectory the control produces. But now every run is different: the same
feedback policy, applied twice, traces two different random paths because the Brownian kicks differ.
A single trajectory's cost is itself a random variable, so we cannot minimise it directly.
The fix is to minimise the cost on average. We take the
expectation
over all the randomness:
J = \mathbb{E}\!\left[\, \phi\big(x(T)\big) + \int_0^T L\big(x(t), u(t)\big)\,dt \,\right].
Here \phi(x(T)) is the terminal cost and
L(x,u) the running cost — the same ingredients as the deterministic
problem, now wrapped in an \mathbb{E}[\cdot] because both the terminal
state and the whole trajectory are random. Minimising J means minimising
the expected cost, accepting that any one realisation may come out lucky or unlucky.
The goal. Choose a feedback policy
u = u(x, t) — a rule that maps the current state and time to a control —
minimising J. We want a policy, not a fixed schedule
u(t), precisely because the state is random: the controller must react to
wherever the noise has actually pushed the system, which only a state-dependent rule can do.
The new wrinkle: optimise the distribution, not a path
In deterministic control the principle of optimality reasoned about one trajectory and its
cost-to-go. Stochastic control reasons about the whole distribution of trajectories
a policy induces. Two consequences follow, and they are the heart of what makes the stochastic
problem genuinely different.
-
You cannot ride a single path. Pontryagin-style methods that perturb one optimal
trajectory have nothing fixed to perturb — the trajectory is a random object. Dynamic programming,
which already works backward over states rather than paths, survives the move to
randomness, which is why the value-function approach is the one we extend.
-
Curvature now costs. When we expand the value function
V(x, t) along the noisy state, the diffusion forces us to keep a
second-order term — exactly the
Itô
correction \tfrac12\sigma^2 V_{xx}. Noise couples to the
curvature of the value function. Ordinary calculus never sees this term; Itô's lemma insists on it.
That second point is the whole pivot of the stage. The deterministic value-function analysis used
the ordinary chain rule; the stochastic one must use Itô's lemma, and the lone extra term it carries
is the entire difference between deterministic and stochastic optimal control. We assemble it into
the
stochastic
Hamilton–Jacobi–Bellman equation on the next page.
-
Dynamics: a controlled Itô SDE
dx = f(x,u)\,dt + \sigma(x,u)\,dW, with a steerable drift and an
irreducible Brownian diffusion.
-
Cost: an expected total
J = \mathbb{E}\big[\phi(x(T)) + \int_0^T L(x,u)\,dt\big], because the
trajectory is random.
-
Solution: a feedback policy u(x,t) minimising
J over the distribution of trajectories — found by extending dynamic
programming through Itô's lemma.
Seeing the randomness
Take the scalar controlled SDE dx = -k\,x\,dt + \sigma\,dW with the
feedback u = -k x already in place (here k = 0.6),
all paths starting at x(0) = 2. Each curve below is one random
realisation — the drift -kx\,dt pulls every path toward
0, while the diffusion \sigma\,dW jostles it.
The readout is the average running cost
\mathbb{E}\big[\int_0^T x^2\,dt\big] across the paths — an estimate of the
expected cost J. Raise \sigma and watch the
paths fan out and the expected cost climb: that growth is the price of the noise the control cannot
cancel.