Sequence Data

Some data comes with an order, and the order carries meaning. A sentence is a sequence of words; a piece of audio is a sequence of samples; a stock chart is a sequence of prices in time. Shuffle the order and you destroy the message — "dog bites man" and "man bites dog" use exactly the same words, yet one is a Tuesday and the other is a headline.

Such data is called sequential. Two facts make it awkward for an ordinary network:

Order matters. The position of each element is part of the information.
Length varies. One sentence has 4 words, the next has 40. There is no fixed number of slots to fill.

From a sentence to a sequence of vectors

To feed text to a network we must turn it into numbers — and we do it one piece at a time, preserving the order. Take the sentence

\text{"the cat sat"}.

Step 1 — split into tokens. A token is the atomic unit the model reads (here, a word). Splitting on spaces gives an ordered list:

(\,\text{the},\ \text{cat},\ \text{sat}\,).

Step 2 — fix a vocabulary. Collect every token the model knows into a vocabulary of size V, and number them 0, 1, \dots, V-1. Say

\text{the} \mapsto 0, \quad \text{cat} \mapsto 7, \quad \text{sat} \mapsto 3.

Step 3 — map tokens to token ids. Replacing each token by its number turns the sentence into a sequence of integers — the token ids:

(\,\text{the},\ \text{cat},\ \text{sat}\,) \;\longmapsto\; (0,\ 7,\ 3).

Step 4 — a sequence of vectors. Each id is then turned into a vector x_t (the next page, word embeddings, learns good ones). The sentence of length T becomes an ordered sequence

x_1,\ x_2,\ \dots,\ x_T,

where the index t is position in the sequence (often "time"). This — an ordered, variable-length list of vectors — is the shape every sequence model consumes.

Why not just flatten it into a plain network?

A standard feed-forward network (an MLP) wants one fixed-size feature vector. You might try to flatten the sequence — glue x_1, x_2, \dots, x_T end to end into one long vector. Two things break:

Fixed input size. A flattened length-T sequence needs T input slots. But T changes from example to example, and an MLP's first weight matrix has a fixed number of columns. A 4-word sentence and a 40-word sentence simply do not fit the same network.
Order-blindness (with a fix that doesn't scale). If you pad every sequence to a maximum length and flatten, the network can in principle see order — but the word in position 1 and the same word in position 50 are handled by completely separate weights, so it must relearn the meaning of "cat" once per position. Nothing is shared across time, and long sequences are hopeless.

What we want instead is a model that walks the sequence one step at a time, reuses the same machinery at every position, and naturally handles any length. That is exactly the recurrent neural network — but first we must turn those token ids into good vectors.

Data is sequential when its elements come in an order that carries meaning.

Order is information. Permuting the elements can change the meaning (\text{"dog bites man"} \neq \text{"man bites dog"}), so a model must respect position.
Length is variable. Different examples have different lengths T; the model must accept any T.
Canonical form. Text becomes tokens → token ids → an ordered sequence of vectors x_1, \dots, x_T. A plain MLP, needing a fixed input size and giving each position its own weights, is the wrong tool.

Tokenise, left to right

Step the control forward to reveal the sentence one token at a time. Each cell is a token; the number under it is its token id (its slot in the vocabulary). The strip is ordered — reading it right to left would be a different sentence — and its length is just however many tokens the sentence has.