When you store text with a
Huffman coding is a clever
Fixed-length codes are tidy but blind to frequency. Suppose a message only ever uses four
letters. Two bits give exactly
Huffman does something smarter. It looks at how often each letter actually appears, then
hands out codes of different lengths: maybe just
In Morse code — over 150 years old and built on exactly this principle. Samuel
Morse gave the commonest English letter,
Let's compress the word MISSISSIPPI. First, count how often each letter appears:
Now we build a binary tree from the bottom up. Start with every letter as its own little node, each carrying its frequency. Then repeat one step over and over:
Step the diagram forward to watch the tree grow — the two smallest nodes join first
(
Notice why the rare letters sink to the bottom: a low-frequency node is combined early, so it ends up deep in the tree, far from the root — and the deeper the leaf, the longer its code. The busiest letters get combined last, so they sit near the top with short codes. The algorithm arranges this automatically.
To find a letter's code, walk from the root down to its leaf, writing
MISSISSIPPI that gives:
| Letter | Frequency | Code | Bits used |
|---|---|---|---|
S | 4 | 1 (1 bit) | |
I | 4 | 01 (2 bits) | |
P | 2 | 001 (3 bits) | |
M | 1 | 000 (3 bits) |
The two most common letters, S and I, got the shortest codes
(M,
got a long one — exactly as promised.
Add up the "bits used" column: the whole word squeezes into
Compare that with fixed-length schemes for the same 11-letter word:
Huffman's
Once the tree has given us a code for each letter, encoding a message is just "look up each letter
and glue its code on." Here is the code table for MISSISSIPPI and an encoder that
measures the result against plain ASCII. Change the message and press
Run:
The encoded stream is 000011101110100100101 — Ss and watch the bit count stay
small; a message full of Ms costs more.
Here is a puzzle. The encoded stream is one long run of bits with no spaces:
000011101110100100101. Codes are different lengths — 1, 01,
000 — so how can the decoder possibly tell where one letter stops and the next begins?
The answer is the whole reason Huffman works, and it's worth stating carefully.
Huffman codes are prefix-free (also called a "prefix code"): no code is
the start of any other code. Look back at the table — 1 is not the
beginning of 01, 001 or 000; 01 is not the
beginning of the others; and so on. This is guaranteed automatically, because in a Huffman tree
every letter sits at a leaf, so no letter's path is ever the start of another letter's
path.
Why does this matter so much? Because it lets the decoder read the stream with no separators at all. It reads bits one at a time until the bits so far exactly match a code — and because that code can't be the beginning of a longer one, there is never any ambiguity about whether to stop. It emits that letter, resets, and carries on. A single unbroken stream of bits decodes back to precisely one message. Without the prefix-free property you would need to store the length of every code or insert markers, and the compression would fall apart. This one property is the entire trick that makes variable-length codes usable.
Run the decoder below. It walks the stream bit by bit, building up current until it
matches a code, then emits that letter and starts fresh — and because the code is prefix-free, that
greedy "stop at the first match" is always the right call:
Getting back precisely MISSISSIPPI from an unpunctuated string of bits is
what makes Huffman coding lossless — and it only works because the code is prefix-free.
Huffman wins when some symbols are much more common than others — a skewed frequency distribution — because then the short codes get used a great deal and the long ones hardly at all. On text, images with large flat areas, and most everyday files, that skew is real, so Huffman (often as part of bigger schemes like DEFLATE inside ZIP, PNG and JPEG) saves useful space.
For the job it's given — assigning a whole number of bits to each symbol independently — Huffman
coding is provably optimal: no other such code produces a shorter average
length. That's a remarkable guarantee for so simple an algorithm. But there's a subtlety: it's
stuck with a whole number of bits per symbol. If the ideal length for a very common
symbol were really
Then you simply pick either one — different tie-breaking choices can produce different-looking trees and even different individual code lengths, but the total compressed size always comes out the same (and always optimal). So there isn't a single "correct" Huffman tree for a given message; there's a family of equally good ones. An exam will usually tell you a tie-breaking rule (for instance, "combine the leftmost pair") so everyone reaches the same diagram.