Every file on a computer is really just a long list of numbers — the
Compression is the trick that fixes this. It rewrites a file so it takes up fewer bytes — smaller to store, faster to send, cheaper to stream — and later the file can be turned back into something you can use. The whole art is finding a cleverer, shorter way to say the same thing. This page is about the two big families of this trick, and it shows you one of them working, line by line, in code you can run.
There are two fundamentally different ways to make a file smaller, and the difference between them is the single most important idea on this page.
Lossless compression shrinks the file so it can later be restored
exactly — bit for bit, with nothing missing. You get back precisely
what you started with. This is what a .zip file does, and what
run-length encoding (coming up next) does. Because nothing is thrown away,
lossless is the only safe choice for text, program code, spreadsheets and
anything where a single wrong character would matter.
Lossy compression takes a bolder approach: it throws away detail that a human can barely notice, and it can shrink a file far more as a result. A JPEG photo drops subtle colour differences your eye glosses over; an MP3 song discards sounds too quiet or too high to hear behind the louder ones. The result is a fraction of the size — but the discarded detail is gone forever. You cannot get the original back from a JPEG or an MP3, so lossy is only for things like photos, music and video, where "close enough for a human" is genuinely good enough.
Run-length encoding (RLE) is the friendliest compression method to understand, and it is lossless. The idea is tiny: whenever the same symbol repeats several times in a row — a run — don't write it out over and over. Instead, write how many, then which symbol.
Take the string AAAABBB. That's four As followed by three
Bs. Rather than storing all seven characters, RLE records the runs:
Seven characters become four — and no information is lost, because 4A3B tells us
exactly how to rebuild the original: write A four times, then B
three times. Step through a longer example to see the runs being counted:
This is genuinely useful for data with big blocks of the same value — the wide areas of one colour in a simple cartoon or icon, or long stretches of blank space in a scanned document. Fax machines and simple image formats have used RLE for decades for exactly this reason.
Here is run-length encoding written out as a program. It walks through the string counting each
run, builds the compressed version like 4A3B, and then compares the two
.lengths so you can see the saving. Change the original string and
press Run:
With AAAABBBAAAAAAAAC the fifteen-character original squashes down to
4A3B8A1C — just eight characters. The longer the runs, the bigger the win.
Because RLE is lossless, decoding is just as simple as encoding — and reversing it is a great way to prove nothing was lost. Read the compressed string in pairs of "number, symbol" and write each symbol out that many times:
Getting back precisely AAAABBBAAAAAAAAC is what "lossless" means. (Real
RLE has to handle runs longer than 9 and symbols that are themselves digits, but the idea is
exactly this.)
The choice is really a question about the data. Ask: would losing a tiny bit of detail matter? If the answer is yes — because it's text, code, a bank statement or a program — you must use lossless. If the answer is no, because it's a photo or a song and a human just needs it to look or sound right, lossy will give you a much smaller file.
Two traps catch people out with compression:
AAAAAAAA, but on data with no repeats it actually makes things
bigger: ABCDEF (6 characters) becomes
1A1B1C1D1E1F (12 characters!), because every single symbol now drags a
"count of 1" along with it. Compression is not magic — a method that saves space on one kind
of data can waste it on another.
How good is a particular squeeze? We measure it with the compression ratio —
the original size divided by the compressed size. A ratio of
Lossless methods on ordinary files might reach maybe 2:1 or 3:1. Lossy methods on a photo can happily hit 10:1 or more — which is exactly why your holiday snaps are JPEGs and not giant exact copies.
It's tempting to imagine zipping a file, then zipping the zip, then zipping that, until a whole film fits in a single byte. It can't work. Lossless compression saves space by spotting patterns — repeats, predictable structure — and rewriting them shorter. Once those patterns are gone, there is nothing left to exploit: a truly random, pattern-free file cannot be losslessly compressed at all, and trying usually makes it a touch bigger. This is why an already-zipped file barely shrinks if you zip it again. There is a hard floor, set by how much genuine information the file really contains.