Some data comes with an order, and the order carries meaning. A sentence is a sequence of words; a piece of audio is a sequence of samples; a stock chart is a sequence of prices in time. Shuffle the order and you destroy the message — "dog bites man" and "man bites dog" use exactly the same words, yet one is a Tuesday and the other is a headline.
Such data is called sequential. Two facts make it awkward for an ordinary network:
To feed text to a network we must turn it into numbers — and we do it one piece at a time, preserving the order. Take the sentence
Step 1 — split into tokens. A token is the atomic unit the model reads (here, a word). Splitting on spaces gives an ordered list:
Step 2 — fix a vocabulary. Collect every token the model knows into a
vocabulary of size
Step 3 — map tokens to token ids. Replacing each token by its number turns the sentence into a sequence of integers — the token ids:
Step 4 — a sequence of vectors. Each id is then turned into a vector
where the index
A standard feed-forward network (an
What we want instead is a model that walks the sequence one step at a time, reuses the
same machinery at every position, and naturally handles any length. That is exactly the
Step the control forward to reveal the sentence one token at a time. Each cell is a token; the number under it is its token id (its slot in the vocabulary). The strip is ordered — reading it right to left would be a different sentence — and its length is just however many tokens the sentence has.