Once
Step 1 — zero-shot vs few-shot. The baseline lever is how many examples you
supply. Zero-shot states the task and asks; few-shot prepends
Few-shot pins down the desired format and disambiguates the task — often the cheapest accuracy you can buy.
Step 2 — chain-of-thought: ask it to show its work. For a multi-step problem,
append a simple instruction — "Let's think step by step" — or demonstrate a worked
solution that spells out the reasoning. Instead of emitting the answer directly, the model now
generates intermediate steps
Step 3 — why writing the steps helps: each step conditions the next. The model
is autoregressive — every token it emits is fed back in as context for the next. So a written
reasoning step
The intermediate tokens are scratch memory the model can read back. A direct answer has to do all the work in one forward pass with nowhere to store partial results; a chain-of-thought answer spreads the computation across many tokens, each building on the last. That is why "show your working" dramatically improves multi-step arithmetic and reasoning — and why it does little for one-step lookups, which need no scratch space.
Step 4 — system prompts and role conditioning. A final lever is to prepend a persistent instruction that frames who the model is and how it should respond — a system prompt ("You are a careful maths tutor; explain each step"). Because every later token is conditioned on it, that framing steers tone, format, and behaviour across the whole exchange — role-conditioning the same weights without touching them.
The flip side of "the prompt is the program" is prompt sensitivity: trivial rewordings — a relocated example, a changed instruction, even the order of the few-shot demonstrations — can swing accuracy noticeably. The model has no stable API; you are programming in natural language, a famously underspecified one. This is why prompt engineering is empirical, and why robust prompts are tested, not just written.
And chain-of-thought is more than a trick — it buys two concrete things. First,
more compute per answer: a fixed-depth transformer does a bounded amount of
work per token, so a one-token answer caps the computation, whereas emitting
Flip between a plain prompt and a chain-of-thought prompt for the same word problem. The plain prompt asks for the answer directly and the model blurts a wrong one-step guess. The chain-of-thought prompt adds "let's think step by step," so the model writes out each intermediate result — and because each line is fed back as context for the next, it lands on the right total. Same model, same question; only the phrasing changed.