Optimal Control and Reinforcement Learning

We have reached the end of the course, and it is time to step back and see what we have really been studying. Every stage circled the same object: a value function that measures the best achievable future, and a backward consistency law — the Bellman equation — that pins it down. That single idea did not stay inside control theory. Under a different name, with rewards instead of costs and data instead of a model, it became reinforcement learning (RL) — the engine behind game-playing agents and robot locomotion. Optimal control and RL are, at heart, the same subject viewed from two sides.

The grand unification

The cleanest statement of the connection is a slogan:

\textbf{Reinforcement learning} \;=\; \textbf{optimal control when you don't know the model.}

Once you accept that, the two vocabularies line up term for term. They are the same equations wearing different clothes:

Read the optimality condition on either side and you find the same line: the best value now equals the immediate cost (reward) plus the best value next. Control writes it with a \min over a known f; RL writes it with a \max over expectations estimated from experience. The skeleton is identical.

The one real difference: model versus data

So where do the fields actually part? On a single question — do you know the dynamics?

Classical optimal control assumes a known model x_{k+1} = f(x_k, u_k) and solves: it computes the value function and policy directly from f, by Riccati, by HJB, by dynamic programming. Reinforcement learning is for when f is unknown, or known but far too complex to optimise against. It learns the value function and policy from sampled experience — sequences of state, action, reward, next state — gathered by interacting with the system. The same Bellman optimality condition is now satisfied approximately, from data:

\underbrace{V^{\*}(x) = \min_{u}\big[\, g(x,u) + V^{\*}(f(x,u)) \,\big]}_{\text{optimal control: solve the known model}} \qquad\longleftrightarrow\qquad \underbrace{Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s', a')}_{\text{RL: sampled backup from experience}}.

In deep RL the value or policy is represented by a neural network and fitted by gradient descent down a loss landscape whose minimum is the Bellman-consistent value — the optimal-control objective, now approximated by a function approximator instead of a grid or a Riccati solve. The backbone has not changed; only the way we represent and compute the value function has.

Q-learning is sampled value iteration

To see the unity concretely, take value iteration — the sweep that repeatedly applies the Bellman backup until the value function stops changing — and remove the model. You can no longer compute the expectation over f, so you sample it: take an action, observe the actual reward r and next state s', and nudge your estimate toward the backed-up target. That is Q-learning:

Q(s,a) \;\leftarrow\; Q(s,a) + \alpha\Big[\, \underbrace{r + \gamma \max_{a'} Q(s', a')}_{\text{Bellman target (sampled)}} - Q(s,a) \,\Big].

Strip away the learning rate \alpha and the sampling, and the update is exactly a Bellman backup of the action-value function — value iteration performed one experienced transition at a time, model-free. Policy-gradient methods make the dual move, improving the policy directly, the data-driven cousin of policy iteration. Either way, the optimal control we spent the course deriving is what RL is climbing toward without a map.

Watching the value converge

Here is the shared engine in miniature: a one-dimensional grid world, states s = 0, \dots, 6, with the goal at s = 6 and a cost of one per step (reward -1). Each iteration applies one Bellman backup V(s) \leftarrow \max_a[\, r + V(s') \,]. Slide the iteration count and watch the optimal value propagate outward from the goal: after k sweeps every state within k steps knows its true value, and once the wave reaches the far end the estimate (dots) has converged to the optimal V^{\*} (the dashed line). Optimal control computes this from the model; RL learns the very same values by sampling the world — the algorithm a robot runs when no map is given.

It is a remarkable lineage. In the 1950s Lev Pontryagin in Moscow wrote down the maximum principle and Richard Bellman in California wrote down dynamic programming and the value function — two ways of characterising the optimal future, born of missile guidance and operations research. Half a century later the same value function, now learned by deep neural networks from millions of self-played games, let AlphaGo defeat the world's best Go players, and the same Bellman backup, learned from simulated falls, teaches legged robots to walk, run and recover. The cart-pole you stabilised with LQR is the shared benchmark of both fields — control computes its optimal feedback from a model, RL discovers the same balancing policy from trial and reward. From a Riccati equation to a policy network, it is one idea — the value of the future — running through all of it. That is where this course has been leading.