Optimal Control and Reinforcement Learning
We have reached the end of the course, and it is time to step back and see what we have really been
studying. Every stage circled the same object: a value function that measures the
best achievable future, and a backward consistency law — the
Bellman
equation — that pins it down. That single idea did not stay inside control theory. Under a
different name, with rewards instead of costs and data instead of a model, it became
reinforcement learning (RL) — the engine behind game-playing agents and robot
locomotion. Optimal control and RL are, at heart, the same subject viewed from two sides.
The grand unification
The cleanest statement of the connection is a slogan:
\textbf{Reinforcement learning} \;=\; \textbf{optimal control when you don't know the model.}
Once you accept that, the two vocabularies line up term for term. They are the same equations wearing
different clothes:
- a cost to minimise ↔ a negative reward to maximise;
- the cost-to-go value function V ↔ the RL
value (or action-value Q) function;
- the
Bellman
equation ↔ the RL Bellman backup;
- the
Hamilton–Jacobi–Bellman
equation ↔ continuous-time RL;
- dynamic programming with a known model ↔ model-based RL;
- value
and policy iteration ↔ Q-learning and policy-gradient
methods.
Read the optimality condition on either side and you find the same line: the best value now equals the
immediate cost (reward) plus the best value next. Control writes it with a
\min over a known f; RL writes it with a
\max over expectations estimated from experience. The skeleton is identical.
The one real difference: model versus data
So where do the fields actually part? On a single question — do you know the dynamics?
Classical optimal control assumes a known model x_{k+1} = f(x_k, u_k) and
solves: it computes the value function and policy directly from
f, by Riccati, by HJB, by dynamic programming. Reinforcement learning is for
when f is unknown, or known but far too complex to optimise against. It
learns the value function and policy from sampled experience — sequences of
state, action, reward, next state — gathered by interacting with the system. The same Bellman
optimality condition is now satisfied approximately, from data:
\underbrace{V^{\*}(x) = \min_{u}\big[\, g(x,u) + V^{\*}(f(x,u)) \,\big]}_{\text{optimal control: solve the known model}} \qquad\longleftrightarrow\qquad \underbrace{Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')}_{\text{RL: sampled backup from experience}}.
In deep RL the value or policy is represented by a neural network and fitted by
gradient
descent down a
loss
landscape whose minimum is the Bellman-consistent value — the optimal-control objective,
now approximated by a function approximator instead of a grid or a Riccati solve. The backbone has not
changed; only the way we represent and compute the value function has.
Q-learning is sampled value iteration
To see the unity concretely, take
value
iteration — the sweep that repeatedly applies the Bellman backup until the value function
stops changing — and remove the model. You can no longer compute the expectation over
f, so you sample it: take an action, observe the actual reward
r and next state s', and nudge your estimate
toward the backed-up target. That is Q-learning:
Q(s,a) \;\leftarrow\; Q(s,a) + \alpha\Big[\, \underbrace{r + \gamma \max_{a'} Q(s', a')}_{\text{Bellman target (sampled)}} - Q(s,a) \,\Big].
Strip away the learning rate \alpha and the sampling, and the update is
exactly a Bellman backup of the action-value function — value iteration performed one experienced
transition at a time, model-free. Policy-gradient methods make the dual move,
improving the policy directly, the data-driven cousin of policy iteration. Either way, the optimal
control we spent the course deriving is what RL is climbing toward without a map.
-
RL is optimal control without a known model: it learns the value/policy from
sampled experience instead of solving a given f.
-
Both rest on the same Bellman optimality backbone; cost ↔ negative reward,
V ↔ value/Q, dynamic programming ↔
model-based RL, value/policy iteration ↔ Q-learning/policy-gradient.
-
Q-learning is sampled, model-free value iteration; deep RL
represents the value with a neural network fitted by gradient descent.
Watching the value converge
Here is the shared engine in miniature: a one-dimensional grid world, states
s = 0, \dots, 6, with the goal at
s = 6 and a cost of one per step (reward -1).
Each iteration applies one Bellman backup
V(s) \leftarrow \max_a[\, r + V(s') \,]. Slide the iteration count and watch
the optimal value propagate outward from the goal: after k sweeps
every state within k steps knows its true value, and once the wave reaches
the far end the estimate (dots) has converged to the optimal V^{\*} (the
dashed line). Optimal control computes this from the model; RL learns the very same values by sampling
the world — the algorithm a robot runs when no map is given.
It is a remarkable lineage. In the 1950s Lev Pontryagin in Moscow wrote down the maximum principle
and Richard Bellman in California wrote down dynamic programming and the value function — two ways of
characterising the optimal future, born of missile guidance and operations research. Half a century
later the same value function, now learned by deep neural networks from millions of self-played
games, let AlphaGo defeat the world's best Go players, and the same Bellman backup,
learned from simulated falls, teaches legged robots to walk, run and recover. The cart-pole you
stabilised with
LQR is
the shared benchmark of both fields — control computes its optimal feedback from a model, RL
discovers the same balancing policy from trial and reward. From a Riccati equation to a policy
network, it is one idea — the value of the future — running through all of it. That is where this
course has been leading.