Huffman Coding

When you store text with a character set like ASCII, every character costs the same number of bits — a fixed 8 bits each, whether it is a common e or a rare z. That feels wasteful, and it is. In a normal page of English the letter e turns up again and again while q, x and z barely appear at all — yet ASCII spends exactly as many bits on each rare letter as on each common one.

Huffman coding is a clever lossless compression method that fixes this. Its single idea is beautifully simple: give the most frequent characters the shortest codes, and let the rare ones have longer codes. Because the letters you use most now cost the fewest bits, the whole message shrinks — and, being lossless, it can be rebuilt exactly, bit for bit. This page shows you how those codes are built, works a full example by hand, and lets you run the encoder and decoder yourself.

The core idea: shorter codes for commoner letters

Fixed-length codes are tidy but blind to frequency. Suppose a message only ever uses four letters. Two bits give exactly 2^2 = 4 patterns (00, 01, 10, 11), so a fixed-length scheme would spend 2 bits on every letter — even the one that appears far more often than the rest.

Huffman does something smarter. It looks at how often each letter actually appears, then hands out codes of different lengths: maybe just 1 bit for the letter that dominates the message, and 3 bits for a letter that shows up only once. Spend fewer bits on the many, more bits on the few, and the total comes out smaller. That trade is the entire trick.

In Morse code — over 150 years old and built on exactly this principle. Samuel Morse gave the commonest English letter, e, the shortest possible signal: a single dot. The next commonest, t, is one dash. Rare letters like q (dash-dash-dot-dash) are long. Morse worked out his letter frequencies by counting the little metal type blocks in a printer's tray — the more of a letter the printer stocked, the more often it must be used, so the shorter he made its code. Huffman coding, invented by David Huffman in 1952 as a student assignment, is the same instinct turned into a method that finds the provably best such code.

Building a Huffman tree

Let's compress the word MISSISSIPPI. First, count how often each letter appears:

\texttt{S}: 4 \qquad \texttt{I}: 4 \qquad \texttt{P}: 2 \qquad \texttt{M}: 1

Now we build a binary tree from the bottom up. Start with every letter as its own little node, each carrying its frequency. Then repeat one step over and over:

Start with every symbol as a separate node, labelled with its frequency.
Take the two nodes of lowest frequency and join them under a new parent node, whose frequency is the sum of the two.
Repeat — always combining the two lowest-frequency nodes still available — until a single node is left. That node is the root.
Finally, label every left branch 0 and every right branch 1. Each symbol's code is the sequence of labels read from the root down to that symbol's leaf.

Step the diagram forward to watch the tree grow — the two smallest nodes join first (\texttt{M}=1 with \texttt{P}=2), and so on up to the root:

Notice why the rare letters sink to the bottom: a low-frequency node is combined early, so it ends up deep in the tree, far from the root — and the deeper the leaf, the longer its code. The busiest letters get combined last, so they sit near the top with short codes. The algorithm arranges this automatically.

Reading the codes off the tree

To find a letter's code, walk from the root down to its leaf, writing 0 for each left branch and 1 for each right branch. For MISSISSIPPI that gives:

Letter	Frequency	Code	Bits used
`S`	4	`1` (1 bit)	4 \times 1 = 4
`I`	4	`01` (2 bits)	4 \times 2 = 8
`P`	2	`001` (3 bits)	2 \times 3 = 6
`M`	1	`000` (3 bits)	1 \times 3 = 3

The two most common letters, S and I, got the shortest codes (1 and 2 bits); the rarest, M, got a long one — exactly as promised.

Counting the saving

Add up the "bits used" column: the whole word squeezes into

4 + 8 + 6 + 3 = 21 \text{ bits.}

Compare that with fixed-length schemes for the same 11-letter word:

ASCII (8 bits each): 11 \times 8 = 88 bits.
A tight 2-bit fixed code (enough for 4 distinct letters): 11 \times 2 = 22 bits.

Huffman's 21 bits beats even the tightest fixed-length code — and crushes ASCII's 88. On real text, where a handful of letters dominate, the saving is far larger.

Run it: encoding with a Huffman code table

Once the tree has given us a code for each letter, encoding a message is just "look up each letter and glue its code on." Here is the code table for MISSISSIPPI and an encoder that measures the result against plain ASCII. Change the message and press Run:

// The code table we read off the Huffman tree. const codes: Record<string, string> = { M: "000", P: "001", I: "01", S: "1" }; const message: string = "MISSISSIPPI"; let encoded: string = ""; for (const ch of message) { encoded = encoded + codes[ch]; // glue this letter's code on the end } console.log("Message: " + message); console.log("Encoded: " + encoded); console.log("Huffman: " + encoded.length + " bits"); console.log("ASCII: " + message.length * 8 + " bits");

The encoded stream is 000011101110100100101 — 21 bits, just as we counted by hand. Try a message made mostly of Ss and watch the bit count stay small; a message full of Ms costs more.

Decoding: how does the computer know where each code ends?

Here is a puzzle. The encoded stream is one long run of bits with no spaces: 000011101110100100101. Codes are different lengths — 1, 01, 000 — so how can the decoder possibly tell where one letter stops and the next begins?

The answer is the whole reason Huffman works, and it's worth stating carefully.

Huffman codes are prefix-free (also called a "prefix code"): no code is the start of any other code. Look back at the table — 1 is not the beginning of 01, 001 or 000; 01 is not the beginning of the others; and so on. This is guaranteed automatically, because in a Huffman tree every letter sits at a leaf, so no letter's path is ever the start of another letter's path.

Why does this matter so much? Because it lets the decoder read the stream with no separators at all. It reads bits one at a time until the bits so far exactly match a code — and because that code can't be the beginning of a longer one, there is never any ambiguity about whether to stop. It emits that letter, resets, and carries on. A single unbroken stream of bits decodes back to precisely one message. Without the prefix-free property you would need to store the length of every code or insert markers, and the compression would fall apart. This one property is the entire trick that makes variable-length codes usable.

Run the decoder below. It walks the stream bit by bit, building up current until it matches a code, then emits that letter and starts fresh — and because the code is prefix-free, that greedy "stop at the first match" is always the right call:

// Reverse table: bits -> letter. const fromBits: Record<string, string> = { "000": "M", "001": "P", "01": "I", "1": "S" }; const stream: string = "000011101110100100101"; let current: string = ""; // bits collected so far let out: string = ""; // the decoded message for (const bit of stream) { current = current + bit; if (fromBits[current] !== undefined) { // a complete code? (safe because prefix-free) out = out + fromBits[current]; current = ""; // reset and read the next letter } } console.log(out); // "MISSISSIPPI" — the exact original, from a stream with no separators

Getting back precisely MISSISSIPPI from an unpunctuated string of bits is what makes Huffman coding lossless — and it only works because the code is prefix-free.

When does Huffman actually help?

Huffman wins when some symbols are much more common than others — a skewed frequency distribution — because then the short codes get used a great deal and the long ones hardly at all. On text, images with large flat areas, and most everyday files, that skew is real, so Huffman (often as part of bigger schemes like DEFLATE inside ZIP, PNG and JPEG) saves useful space.

For the job it's given — assigning a whole number of bits to each symbol independently — Huffman coding is provably optimal: no other such code produces a shorter average length. That's a remarkable guarantee for so simple an algorithm. But there's a subtlety: it's stuck with a whole number of bits per symbol. If the ideal length for a very common symbol were really 0.4 bits, Huffman must round up to 1. Cleverer methods called arithmetic coding and range coding escape that rounding by encoding the whole message as one enormous number, squeezing out those fractional bits — which is why modern formats sometimes reach for them instead. Huffman remains the classic, understandable starting point that gets you most of the way there.

Then you simply pick either one — different tie-breaking choices can produce different-looking trees and even different individual code lengths, but the total compressed size always comes out the same (and always optimal). So there isn't a single "correct" Huffman tree for a given message; there's a family of equally good ones. An exam will usually tell you a tie-breaking rule (for instance, "combine the leftmost pair") so everyone reaches the same diagram.