The agent–environment loop
Everything in RL happens in one repeating cycle:
- The agent observes the current state of the environment.
- It chooses an action.
- The environment returns a reward (a number) and the next state.
Round and round, the agent's goal is to learn a policy
\pi — a rule mapping states to actions — that maximises its total
return, the sum of rewards over the long run (usually discounted, so
sooner rewards count for a little more):
G = r_1 + \gamma r_2 + \gamma^2 r_3 + \cdots, \qquad 0 \le \gamma \le 1.
The central tension is exploration versus exploitation: should the agent
exploit the best action it knows, or explore an untried one that might be even
better? Below, a simple agent faces a few "slot machines" of unknown payout and, by mostly
exploiting its current best guess but occasionally exploring, learns which pays best. Press
Run:
// Three actions with hidden average rewards; action 2 is secretly the best.
const trueMean = [0.2, 0.5, 0.8];
const Q = [0, 0, 0]; // the agent's learned value estimate for each action
const N = [0, 0, 0]; // how many times each action has been tried
const epsilon = 0.1; // 10% of the time, explore at random
function argmax(a: number[]): number {
let best = 0;
for (let i = 1; i < a.length; i++) if (a[i] > a[best]) best = i;
return best;
}
for (let t = 0; t < 800; t++) {
const a = Math.random() < epsilon ? Math.floor(Math.random() * 3) : argmax(Q);
const reward = trueMean[a] + (Math.random() - 0.5) * 0.2; // noisy reward
N[a] += 1;
Q[a] += (reward - Q[a]) / N[a]; // running average of the rewards seen
}
console.log("learned values:", Q.map((q) => q.toFixed(2)).join(", "));
console.log("agent's best action:", argmax(Q)); // should discover action 2
The agent never sees the true payouts — it discovers them purely from the rewards its own actions
return, and converges on choosing the best one.
Why RL is hard, and powerful
- An agent learns a policy by acting and receiving
rewards — no labelled answers.
- The goal is to maximise the long-run (discounted) return, not the immediate
reward.
- It must balance exploration (trying new actions) against
exploitation (using the best known one).
The deep difficulty is delayed reward: a move early in a chess game might only pay
off twenty moves later, so the agent must solve the credit-assignment problem — figuring
out which past actions deserve the credit. Value functions and algorithms like Q-learning exist
precisely to propagate future rewards back to the decisions that earned them.
RL's breakthroughs are spectacular. A single deep-RL agent learned to play dozens of Atari games
straight from the pixels, given nothing but the score as reward. AlphaGo and its
successors mastered Go, chess and shogi from self-play alone, discovering strategies that
overturned centuries of human theory. The same framework now trains robots to walk and helps
fine-tune large language models to be helpful — the classic testbed for all of it is the
cart-pole
balancing task.
-
An agent that only exploits can lock onto a mediocre action forever, never
discovering a better one. Some exploration is essential.
-
RL optimises the reward you actually give it, not the one you meant. A badly
shaped reward leads to agents that "cheat" — winning the letter of the game while missing its
spirit.