Compression

Every file on a computer is really just a long list of numbers — the bits and bytes that spell out your photos, songs, essays and games. Big files are a nuisance: they eat up storage on your phone, and they are slow to send across a network. A one-minute video straight off a camera can be hundreds of megabytes. Sending that to a friend, or fitting a thousand songs on a phone, would be painful.

Compression is the trick that fixes this. It rewrites a file so it takes up fewer bytes — smaller to store, faster to send, cheaper to stream — and later the file can be turned back into something you can use. The whole art is finding a cleverer, shorter way to say the same thing. This page is about the two big families of this trick, and it shows you one of them working, line by line, in code you can run.

Two families: lossless and lossy

There are two fundamentally different ways to make a file smaller, and the difference between them is the single most important idea on this page.

Lossless compression shrinks the file so it can later be restored exactly — bit for bit, with nothing missing. You get back precisely what you started with. This is what a .zip file does, and what run-length encoding (coming up next) does. Because nothing is thrown away, lossless is the only safe choice for text, program code, spreadsheets and anything where a single wrong character would matter.

Lossy compression takes a bolder approach: it throws away detail that a human can barely notice, and it can shrink a file far more as a result. A JPEG photo drops subtle colour differences your eye glosses over; an MP3 song discards sounds too quiet or too high to hear behind the louder ones. The result is a fraction of the size — but the discarded detail is gone forever. You cannot get the original back from a JPEG or an MP3, so lossy is only for things like photos, music and video, where "close enough for a human" is genuinely good enough.

A lossless method up close: run-length encoding

Run-length encoding (RLE) is the friendliest compression method to understand, and it is lossless. The idea is tiny: whenever the same symbol repeats several times in a row — a run — don't write it out over and over. Instead, write how many, then which symbol.

Take the string AAAABBB. That's four As followed by three Bs. Rather than storing all seven characters, RLE records the runs:

\texttt{AAAABBB} \;\longrightarrow\; \texttt{4A3B}

Seven characters become four — and no information is lost, because 4A3B tells us exactly how to rebuild the original: write A four times, then B three times. Step through a longer example to see the runs being counted:

This is genuinely useful for data with big blocks of the same value — the wide areas of one colour in a simple cartoon or icon, or long stretches of blank space in a scanned document. Fax machines and simple image formats have used RLE for decades for exactly this reason.

Run it: RLE in TypeScript

Here is run-length encoding written out as a program. It walks through the string counting each run, builds the compressed version like 4A3B, and then compares the two .lengths so you can see the saving. Change the original string and press Run:

// Try changing this! Long runs compress well; try "AAAAAAAA" vs "ABCDEF". const original: string = "AAAABBBAAAAAAAAC"; function runLengthEncode(text: string): string { let encoded: string = ""; let i: number = 0; while (i < text.length) { const symbol: string = text[i]; let count: number = 1; // count how long this run of the same symbol is while (i + count < text.length && text[i + count] === symbol) { count = count + 1; } encoded = encoded + count + symbol; // e.g. "4" + "A" i = i + count; // jump past the whole run } return encoded; } const compressed: string = runLengthEncode(original); console.log("Original: " + original + " (" + original.length + " characters)"); console.log("Compressed: " + compressed + " (" + compressed.length + " characters)");

With AAAABBBAAAAAAAAC the fifteen-character original squashes down to 4A3B8A1C — just eight characters. The longer the runs, the bigger the win.

Because RLE is lossless, decoding is just as simple as encoding — and reversing it is a great way to prove nothing was lost. Read the compressed string in pairs of "number, symbol" and write each symbol out that many times:

const compressed: string = "4A3B8A1C"; let original: string = ""; for (let i = 0; i < compressed.length; i = i + 2) { const count: number = Number(compressed[i]); // the number const symbol: string = compressed[i + 1]; // the symbol for (let n = 0; n < count; n++) { original = original + symbol; } } console.log(original); // "AAAABBBAAAAAAAAC" — exactly what we started with

Getting back precisely AAAABBBAAAAAAAAC is what "lossless" means. (Real RLE has to handle runs longer than 9 and symbols that are themselves digits, but the idea is exactly this.)

Which one should you use?

The choice is really a question about the data. Ask: would losing a tiny bit of detail matter? If the answer is yes — because it's text, code, a bank statement or a program — you must use lossless. If the answer is no, because it's a photo or a song and a human just needs it to look or sound right, lossy will give you a much smaller file.

Lossless (RLE, ZIP, PNG, FLAC): the original can be restored exactly. Smaller savings, but nothing is ever lost. Use for text, code, spreadsheets, and archives.
Lossy (JPEG, MP3, most video): permanently discards detail humans barely notice. Much smaller files, but the original can never be recovered. Use for photos, music and video.

Two traps catch people out with compression:

Lossy compression is permanent — the lost data is gone forever. Never use a lossy format for anything that must stay exact. Saving a spreadsheet, an essay or program code as a JPEG-style "lossy" file would corrupt it — a single flipped character or number can wreck the whole thing, and you can't undo it. And each time you re-save a JPEG it loses a little more quality, so a photo edited and re-saved many times slowly turns blurry and blocky.
RLE only helps when there are long runs. It shines on AAAAAAAA, but on data with no repeats it actually makes things bigger: ABCDEF (6 characters) becomes 1A1B1C1D1E1F (12 characters!), because every single symbol now drags a "count of 1" along with it. Compression is not magic — a method that saves space on one kind of data can waste it on another.

Measuring the saving: the compression ratio

How good is a particular squeeze? We measure it with the compression ratio — the original size divided by the compressed size. A ratio of 3 means the file is now a third of its old size (often written "3:1"). You can also talk about the percentage saved. This program works both out:

const originalBytes: number = 1200; const compressedBytes: number = 300; const ratio: number = originalBytes / compressedBytes; const percentSaved: number = (1 - compressedBytes / originalBytes) * 100; console.log("Compression ratio: " + ratio + " : 1"); console.log("Space saved: " + percentSaved + "%");

Lossless methods on ordinary files might reach maybe 2:1 or 3:1. Lossy methods on a photo can happily hit 10:1 or more — which is exactly why your holiday snaps are JPEGs and not giant exact copies.

It's tempting to imagine zipping a file, then zipping the zip, then zipping that, until a whole film fits in a single byte. It can't work. Lossless compression saves space by spotting patterns — repeats, predictable structure — and rewriting them shorter. Once those patterns are gone, there is nothing left to exploit: a truly random, pattern-free file cannot be losslessly compressed at all, and trying usually makes it a touch bigger. This is why an already-zipped file barely shrinks if you zip it again. There is a hard floor, set by how much genuine information the file really contains.