A freshly
The trick is almost embarrassingly simple: keep training the model with the same next-token objective it was pretrained with, but on a small, curated dataset of demonstrations — examples of an instruction paired with its ideal response. No new loss, no new architecture. Just better data, and one crucial bookkeeping choice about which tokens count.
Each training example is a pair: a prompt
Step 1 — start from the causal-LM loss. The pretraining objective is the
average negative log-likelihood of each token given everything before it, i.e. the
where
Step 2 — concatenate prompt and response. Lay the example out as one sequence, the prompt tokens followed by the response tokens:
Step 3 — mask the prompt; supervise only the response. Here is the one idea that distinguishes SFT from plain pretraining. We do not want to reward the model for predicting the user's instruction — only for producing the answer. So we apply a loss mask that zeroes out the prompt positions and keeps only the response positions:
Step 4 — average over a dataset of demonstrations. Sum that response-only
loss over every pair
Step 5 — read what it learned. Nothing in this loss injects new facts; it only raises the probability of responding the way the demonstrations do. So the model learns format and behaviour — answer directly, follow the instruction, adopt the assistant's tone — drawing on knowledge it already had. The instruction style is taught; the world knowledge was already there.
Same prompt, two models. The base model treats your instruction as text to continue, and rambles on inventing more instructions. The instruction-tuned model recognises the instruction and answers it. Flip the switch — only the behaviour changed; the underlying knowledge of primes was present all along.
Because the loss is just cross-entropy, the leverage of instruction tuning lies almost entirely in the demonstrations. The model copies whatever it is shown, so the quality, diversity, and honesty of the examples become the model's manners. A famous line of results (the “less is more” finding) showed that a few thousand carefully written, varied demonstrations can beat hundreds of thousands of noisy ones: the model already knows the facts from pretraining, so SFT mostly needs to elicit them cleanly across a broad spread of tasks. Garbage demonstrations teach garbage format — sloppy answers, refusals to easy questions, a single rigid template — no matter how large the pile. In SFT, curating the dataset is the engineering.
SFT gets you an assistant that follows instructions, but it can only imitate the
demonstrations it saw — it has no way to learn that one good answer is better than
another merely-acceptable one. Capturing human preferences needs a
different signal, which is the subject of