Instruction Tuning

A freshly pretrained language model is a magnificent autocomplete and nothing more: it has read the internet and learned to continue text, so when you type a question it is just as likely to invent five more questions as to answer the one you asked. It knows an enormous amount; it has no idea that you want it to help. Instruction tuning — also called supervised fine-tuning (SFT) — is the short, cheap step that turns that raw next-token predictor into an assistant that follows instructions.

The trick is almost embarrassingly simple: keep training the model with the same next-token objective it was pretrained with, but on a small, curated dataset of demonstrations — examples of an instruction paired with its ideal response. No new loss, no new architecture. Just better data, and one crucial bookkeeping choice about which tokens count.

The objective, line by line

Each training example is a pair: a prompt x (the instruction, perhaps with a system message and prior turns) and a target response y = (y_1, \dots, y_T) written by a skilled human (or a trusted model). We tokenise the concatenation and predict it left to right — exactly like pretraining.

Step 1 — start from the causal-LM loss. The pretraining objective is the average negative log-likelihood of each token given everything before it, i.e. the cross-entropy of the model's next-token distribution against the actual next token:

\mathcal{L}_{\text{LM}} = -\sum_{t} \log \pi_\theta\!\left(w_t \mid w_{

where w is the whole token sequence and \pi_\theta is the model.

Step 2 — concatenate prompt and response. Lay the example out as one sequence, the prompt tokens followed by the response tokens:

w = (\underbrace{x_1, \dots, x_m}_{\text{prompt } x},\ \underbrace{y_1, \dots, y_T}_{\text{response } y}).

Step 3 — mask the prompt; supervise only the response. Here is the one idea that distinguishes SFT from plain pretraining. We do not want to reward the model for predicting the user's instruction — only for producing the answer. So we apply a loss mask that zeroes out the prompt positions and keeps only the response positions:

\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log \pi_\theta\!\left(y_t \mid x,\, y_{

Step 4 — average over a dataset of demonstrations. Sum that response-only loss over every pair (x, y) in the demonstration set \mathcal{D} and minimise by gradient descent (often with parameter-efficient adapters rather than touching all the weights):

\theta^\star = \arg\min_\theta\ \mathbb{E}_{(x,y)\sim\mathcal{D}}\!\left[\,-\sum_{t=1}^{T} \log \pi_\theta\!\left(y_t \mid x,\, y_{

Step 5 — read what it learned. Nothing in this loss injects new facts; it only raises the probability of responding the way the demonstrations do. So the model learns format and behaviour — answer directly, follow the instruction, adopt the assistant's tone — drawing on knowledge it already had. The instruction style is taught; the world knowledge was already there.

Instruction tuning continues language-model training on curated demonstrations:

Data. A dataset of (\text{instruction } x,\ \text{ideal response } y) pairs.
Objective. The ordinary causal-LM cross-entropy -\sum_{t}\log \pi_\theta(y_t \mid x, y_{, computed only on the response tokens (the prompt is masked out of the loss).
What it teaches. Format and instruction-following — how to behave like an assistant — not new facts. The knowledge already lives in the pretrained weights.

See the difference

Same prompt, two models. The base model treats your instruction as text to continue, and rambles on inventing more instructions. The instruction-tuned model recognises the instruction and answers it. Flip the switch — only the behaviour changed; the underlying knowledge of primes was present all along.

Because the loss is just cross-entropy, the leverage of instruction tuning lies almost entirely in the demonstrations. The model copies whatever it is shown, so the quality, diversity, and honesty of the examples become the model's manners. A famous line of results (the “less is more” finding) showed that a few thousand carefully written, varied demonstrations can beat hundreds of thousands of noisy ones: the model already knows the facts from pretraining, so SFT mostly needs to elicit them cleanly across a broad spread of tasks. Garbage demonstrations teach garbage format — sloppy answers, refusals to easy questions, a single rigid template — no matter how large the pile. In SFT, curating the dataset is the engineering.

Where this sits

SFT gets you an assistant that follows instructions, but it can only imitate the demonstrations it saw — it has no way to learn that one good answer is better than another merely-acceptable one. Capturing human preferences needs a different signal, which is the subject of reinforcement learning from human feedback.