Stochastic Optimal Control

So far the uncertainty lived only in our measurements — the Kalman filter cleaned up a noisy sensor, but the system itself evolved by a deterministic \dot{x} = Ax + Bu. Now we push the noise into the dynamics. A gust hits the aircraft; a customer order jolts the inventory; thermal noise rattles the circuit. The state no longer follows a smooth curve — it follows a random one.

The right language for a continuously, randomly buffeted state is a controlled stochastic differential equation. An Itô process whose drift and diffusion we get to steer:

dx = f(x, u)\,dt + \sigma(x, u)\,dW.

The drift f(x,u)\,dt is the controllable, deterministic push — our old dynamics. The diffusion \sigma(x,u)\,dW is the irreducible randomness, a Brownian kick that no control can fully cancel. In the linear case this is just dx = (Ax + Bu)\,dt + \sigma\,dW.

The cost becomes an expectation

With deterministic dynamics the cost of a control was a single number — integrate the running cost along the one trajectory the control produces. But now every run is different: the same feedback policy, applied twice, traces two different random paths because the Brownian kicks differ. A single trajectory's cost is itself a random variable, so we cannot minimise it directly.

The fix is to minimise the cost on average. We take the expectation over all the randomness:

J = \mathbb{E}\!\left[\, \phi\big(x(T)\big) + \int_0^T L\big(x(t), u(t)\big)\,dt \,\right].

Here \phi(x(T)) is the terminal cost and L(x,u) the running cost — the same ingredients as the deterministic problem, now wrapped in an \mathbb{E}[\cdot] because both the terminal state and the whole trajectory are random. Minimising J means minimising the expected cost, accepting that any one realisation may come out lucky or unlucky.

The goal. Choose a feedback policy u = u(x, t) — a rule that maps the current state and time to a control — minimising J. We want a policy, not a fixed schedule u(t), precisely because the state is random: the controller must react to wherever the noise has actually pushed the system, which only a state-dependent rule can do.

The new wrinkle: optimise the distribution, not a path

In deterministic control the principle of optimality reasoned about one trajectory and its cost-to-go. Stochastic control reasons about the whole distribution of trajectories a policy induces. Two consequences follow, and they are the heart of what makes the stochastic problem genuinely different.

You cannot ride a single path. Pontryagin-style methods that perturb one optimal trajectory have nothing fixed to perturb — the trajectory is a random object. Dynamic programming, which already works backward over states rather than paths, survives the move to randomness, which is why the value-function approach is the one we extend.
Curvature now costs. When we expand the value function V(x, t) along the noisy state, the diffusion forces us to keep a second-order term — exactly the Itô correction \tfrac12\sigma^2 V_{xx}. Noise couples to the curvature of the value function. Ordinary calculus never sees this term; Itô's lemma insists on it.

That second point is the whole pivot of the stage. The deterministic value-function analysis used the ordinary chain rule; the stochastic one must use Itô's lemma, and the lone extra term it carries is the entire difference between deterministic and stochastic optimal control. We assemble it into the stochastic Hamilton–Jacobi–Bellman equation on the next page.

Dynamics: a controlled Itô SDE dx = f(x,u)\,dt + \sigma(x,u)\,dW, with a steerable drift and an irreducible Brownian diffusion.
Cost: an expected total J = \mathbb{E}\big[\phi(x(T)) + \int_0^T L(x,u)\,dt\big], because the trajectory is random.
Solution: a feedback policy u(x,t) minimising J over the distribution of trajectories — found by extending dynamic programming through Itô's lemma.

Seeing the randomness

Take the scalar controlled SDE dx = -k\,x\,dt + \sigma\,dW with the feedback u = -k x already in place (here k = 0.6), all paths starting at x(0) = 2. Each curve below is one random realisation — the drift -kx\,dt pulls every path toward 0, while the diffusion \sigma\,dW jostles it. The readout is the average running cost \mathbb{E}\big[\int_0^T x^2\,dt\big] across the paths — an estimate of the expected cost J. Raise \sigma and watch the paths fan out and the expected cost climb: that growth is the price of the noise the control cannot cancel.