Sequence Data

Some data comes with an order, and the order carries meaning. A sentence is a sequence of words; a piece of audio is a sequence of samples; a stock chart is a sequence of prices in time. Shuffle the order and you destroy the message — "dog bites man" and "man bites dog" use exactly the same words, yet one is a Tuesday and the other is a headline.

Such data is called sequential. Two facts make it awkward for an ordinary network:

From a sentence to a sequence of vectors

To feed text to a network we must turn it into numbers — and we do it one piece at a time, preserving the order. Take the sentence

\text{"the cat sat"}.

Step 1 — split into tokens. A token is the atomic unit the model reads (here, a word). Splitting on spaces gives an ordered list:

(\,\text{the},\ \text{cat},\ \text{sat}\,).

Step 2 — fix a vocabulary. Collect every token the model knows into a vocabulary of size V, and number them 0, 1, \dots, V-1. Say

\text{the} \mapsto 0, \quad \text{cat} \mapsto 7, \quad \text{sat} \mapsto 3.

Step 3 — map tokens to token ids. Replacing each token by its number turns the sentence into a sequence of integers — the token ids:

(\,\text{the},\ \text{cat},\ \text{sat}\,) \;\longmapsto\; (0,\ 7,\ 3).

Step 4 — a sequence of vectors. Each id is then turned into a vector x_t (the next page, word embeddings, learns good ones). The sentence of length T becomes an ordered sequence

x_1,\ x_2,\ \dots,\ x_T,

where the index t is position in the sequence (often "time"). This — an ordered, variable-length list of vectors — is the shape every sequence model consumes.

Why not just flatten it into a plain network?

A standard feed-forward network (an MLP) wants one fixed-size feature vector. You might try to flatten the sequence — glue x_1, x_2, \dots, x_T end to end into one long vector. Two things break:

What we want instead is a model that walks the sequence one step at a time, reuses the same machinery at every position, and naturally handles any length. That is exactly the recurrent neural network — but first we must turn those token ids into good vectors.

Data is sequential when its elements come in an order that carries meaning.

Tokenise, left to right

Step the control forward to reveal the sentence one token at a time. Each cell is a token; the number under it is its token id (its slot in the vocabulary). The strip is ordered — reading it right to left would be a different sentence — and its length is just however many tokens the sentence has.