Sequence to Sequence

So far an RNN maps a sequence to a label, or to one output per input. But translation, summarisation, and question answering are sequence-to-sequence tasks: the input and the output are both sequences, and of different lengths — “the cat sat” (3 words) becomes “le chat s'est assis” (4). We can't emit one output per input when the counts don't match. The fix is to split the job between two networks: an encoder that reads the whole input into a summary, and a decoder that writes the output from that summary, one token at a time.

The encoder: read the input into one context vector

Step 1 — run an RNN over the input. The encoder is a gated RNN (an LSTM or GRU, so it can actually remember). It consumes the input tokens x_1, \dots, x_n in order, updating its hidden state at each step:

h_i = \text{Encoder}(h_{i-1},\, x_i),\qquad i = 1, \dots, n.

Step 2 — take the final state as the context. After the last input token, the encoder's final hidden state is declared the context vector c — a single fixed-size summary of the entire input:

c = h_n.

Everything the decoder will ever know about the input must pass through this one vector c. Read the whole sentence, hand over one summary.

The decoder: write the output, one token at a time

Step 1 — start the decoder from the context. A second RNN, the decoder, takes c as its initial state — it begins “knowing” the input:

s_0 = c.

Step 2 — generate one token, then feed it back. The decoder is autoregressive: at each step it updates its state from the previous state and the previous output token, and emits the next one:

s_t = \text{Decoder}(s_{t-1},\, y_{t-1}),\qquad y_t = \operatorname{softmax}(W s_t).

The softmax turns the decoder's scores into a distribution over the vocabulary; the chosen token y_t is fed back in as the next input. Start with a special \langle\text{start}\rangle token and keep going until the decoder emits \langle\text{end}\rangle — which is how the output length m can differ from the input length n.

Step 3 — read the probability it defines. Each token is conditioned on the context and everything generated so far, so the model factorises the output sequence probability by the chain rule:

p(y_1, \dots, y_m \mid x) = \prod_{t=1}^{m} p\!\left(y_t \mid y_{

That is the encoder–decoder model in one line: condition every output token on a fixed summary c of the input and on the tokens already written.

The bottleneck hiding in c = h_n

Step 1 — notice what must fit. Whether the input is three words or eighty, the entire meaning is forced through the single fixed-size vector c = h_n. Translating word forty of the output, the decoder cannot look back at the input — it can only consult c.

Step 2 — conclude the strain. This is the fixed-state bottleneck again, now at the level of whole sentences: a long, detailed input must be crammed into the same-width c as a short one, so quality sags as inputs grow. The model also leans hardest on the last few input tokens (they touched h_n most recently), often mistranslating the beginning of a long sentence. The single-vector summary is the model's great simplification — and its central weakness.

Map an input sequence to an output sequence with a reader and a writer:

Encoder → context → decoder. The encoder RNN reads x_1,\dots,x_n and its final state becomes the context c = h_n; the decoder RNN is initialised s_0 = c.
Autoregressive decoding. The decoder generates one token at a time, s_t = \text{Decoder}(s_{t-1}, y_{t-1}) and y_t = \operatorname{softmax}(W s_t), feeding each output back in, so p(y\mid x) = \prod_t p(y_t \mid y_{ and output length may differ from input length.
The single-vector bottleneck. All of the input must fit one fixed-size c, so long or detailed inputs degrade — the limitation attention was invented to remove.

Watch the handoff

Step through it. First the encoder reads x_1, x_2, x_3 left to right, its state flowing along the bottom row. Then the final state is handed off as the context c. Finally the decoder starts from c and writes y_1, y_2, y_3, each output looping back as the next input. Notice that every decoder step pulls from the same single c — that is the bottleneck, drawn.

At generation time the decoder feeds its own output back in. But early in training those outputs are garbage, and feeding garbage back in would derail every subsequent step — the model could never get a foothold. The remedy is teacher forcing: during training, feed the decoder the true previous token from the reference output, not its own guess. So at step t the decoder is conditioned on the correct y_{t-1}^{\text{true}}, and the loss only asks it to predict the next true token — turning the whole sequence into m clean, parallelisable next-token predictions.

It trains fast and stably, but it creates a mismatch — exposure bias: at test time the decoder sees its own (possibly wrong) tokens, a distribution it never trained on, so one early mistake can cascade. Mitigations like “scheduled sampling” occasionally feed the model's own predictions during training to bridge the gap. The tension between fast teacher-forced training and faithful free-running generation is a recurring theme in sequence modelling.

Where this is going

The bottleneck is the obvious thing to attack. What if, instead of compressing the whole input into one frozen c, the decoder could look back at all the encoder states and pick out the relevant ones at each step? That is the attention mechanism — and it changed everything.