So far an RNN maps a sequence to a label, or to one output per input. But translation, summarisation, and question answering are sequence-to-sequence tasks: the input and the output are both sequences, and of different lengths — “the cat sat” (3 words) becomes “le chat s'est assis” (4). We can't emit one output per input when the counts don't match. The fix is to split the job between two networks: an encoder that reads the whole input into a summary, and a decoder that writes the output from that summary, one token at a time.
Step 1 — run an RNN over the input. The encoder is a
Step 2 — take the final state as the context. After the last input token, the
encoder's final hidden state is declared the context vector
Everything the decoder will ever know about the input must pass through this one vector
Step 1 — start the decoder from the context. A second RNN, the decoder, takes
Step 2 — generate one token, then feed it back. The decoder is autoregressive: at each step it updates its state from the previous state and the previous output token, and emits the next one:
The
Step 3 — read the probability it defines. Each token is conditioned on the context and everything generated so far, so the model factorises the output sequence probability by the chain rule:
That is the encoder–decoder model in one line: condition every output token on a fixed summary
Step 1 — notice what must fit. Whether the input is three words or eighty, the
entire meaning is forced through the single fixed-size vector
Step 2 — conclude the strain. This is the
Step through it. First the encoder reads
At generation time the decoder feeds its own output back in. But early in training those
outputs are garbage, and feeding garbage back in would derail every subsequent step — the model
could never get a foothold. The remedy is teacher forcing: during training, feed
the decoder the true previous token from the reference output, not its own guess. So at
step
It trains fast and stably, but it creates a mismatch — exposure bias: at test time the decoder sees its own (possibly wrong) tokens, a distribution it never trained on, so one early mistake can cascade. Mitigations like “scheduled sampling” occasionally feed the model's own predictions during training to bridge the gap. The tension between fast teacher-forced training and faithful free-running generation is a recurring theme in sequence modelling.
The bottleneck is the obvious thing to attack. What if, instead of compressing the whole input
into one frozen