Reinforcement Learning

There is a third great branch of machine learning, quite unlike the other two. It isn't supervised (no one hands you the right answers) and it isn't unsupervised (you're not just finding structure). In reinforcement learning (RL) an agent learns by trial and error: it acts in an environment, receives rewards and punishments, and gradually works out how to behave to earn the most reward over time. It is how you learned to ride a bike, and how a program learned to beat the world champion at Go.

The agent–environment loop

Everything in RL happens in one repeating cycle:

The agent observes the current state of the environment.
It chooses an action.
The environment returns a reward (a number) and the next state.

Round and round, the agent's goal is to learn a policy \pi — a rule mapping states to actions — that maximises its total return, the sum of rewards over the long run (usually discounted, so sooner rewards count for a little more):

G = r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots, \qquad 0 \le \gamma \le 1.

The central tension is exploration versus exploitation: should the agent exploit the best action it knows, or explore an untried one that might be even better? Below, a simple agent faces a few "slot machines" of unknown payout and, by mostly exploiting its current best guess but occasionally exploring, learns which pays best. Press Run:

// Three actions with hidden average rewards; action 2 is secretly the best. const trueMean = [0.2, 0.5, 0.8]; const Q = [0, 0, 0]; // the agent's learned value estimate for each action const N = [0, 0, 0]; // how many times each action has been tried const epsilon = 0.1; // 10% of the time, explore at random function argmax(a: number[]): number { let best = 0; for (let i = 1; i < a.length; i++) if (a[i] > a[best]) best = i; return best; } for (let t = 0; t < 800; t++) { const a = Math.random() < epsilon ? Math.floor(Math.random() * 3) : argmax(Q); const reward = trueMean[a] + (Math.random() - 0.5) * 0.2; // noisy reward N[a] += 1; Q[a] += (reward - Q[a]) / N[a]; // running average of the rewards seen } console.log("learned values:", Q.map((q) => q.toFixed(2)).join(", ")); console.log("agent's best action:", argmax(Q)); // should discover action 2

The agent never sees the true payouts — it discovers them purely from the rewards its own actions return, and converges on choosing the best one.

Why RL is hard, and powerful

An agent learns a policy by acting and receiving rewards — no labelled answers.
The goal is to maximise the long-run (discounted) return, not the immediate reward.
It must balance exploration (trying new actions) against exploitation (using the best known one).

The deep difficulty is delayed reward: a move early in a chess game might only pay off twenty moves later, so the agent must solve the credit-assignment problem — figuring out which past actions deserve the credit. Value functions and algorithms like Q-learning exist precisely to propagate future rewards back to the decisions that earned them.

RL's breakthroughs are spectacular. A single deep-RL agent learned to play dozens of Atari games straight from the pixels, given nothing but the score as reward. AlphaGo and its successors mastered Go, chess and shogi from self-play alone, discovering strategies that overturned centuries of human theory. The same framework now trains robots to walk and helps fine-tune large language models to be helpful — the classic testbed for all of it is the cart-pole balancing task.

An agent that only exploits can lock onto a mediocre action forever, never discovering a better one. Some exploration is essential.
RL optimises the reward you actually give it, not the one you meant. A badly shaped reward leads to agents that "cheat" — winning the letter of the game while missing its spirit.