Character Sets: ASCII and Unicode

A computer can only really store numbers — patterns of binary 0s and 1s. So how does it store the message you're reading? The trick is a simple agreement: every letter, digit and symbol is given its own number, called its code. Write down the code, store the code in binary, and you've stored the character.

The lookup table that says "this character means this number" is called a character set (or character encoding). It's just a code book that everyone agrees on. As long as the computer that saves your text and the computer that opens it use the same character set, the message survives the journey.

\text{character} \;\longleftrightarrow\; \text{number (a code)} \;\longleftrightarrow\; \text{binary}

There is nothing clever or secret about the numbers — someone simply had to sit down and decide them. The most famous decision is called ASCII.

ASCII: 7 bits, 128 codes

ASCII (the American Standard Code for Information Interchange, said "ass-key") gives each character a code using 7 bits. Seven bits can count from 0 to 127, which is 2^7 = 128 different codes — plenty of room for the English alphabet in capitals and lower case, the digits, punctuation, a space, and some invisible "control" characters like newline.

A few codes worth remembering:

Notice the codes run in order: A is 65, so B must be 66, C is 67, and so on. The same is true for lower-case letters (starting at 97) and for the digits (the character \texttt{'0'} is 48, \texttt{'1'} is 49…). This tidy ordering is what makes it easy to sort words alphabetically or check whether a key press was a letter.

Try it — turn characters into codes

In TypeScript, "A".charCodeAt(0) asks "what is the code of the first character of this text?" and String.fromCharCode(65) does the reverse — it turns a code back into a character. Press Run, then change the letters and numbers and run it again:

// Character -> code console.log("A".charCodeAt(0)); // 65 console.log("a".charCodeAt(0)); // 97 console.log("0".charCodeAt(0)); // 48 console.log(" ".charCodeAt(0)); // 32 (space has a code too!) // Code -> character console.log(String.fromCharCode(66)); // "B" console.log(String.fromCharCode(97)); // "a" console.log(String.fromCharCode(48)); // "0"

Looping through the alphabet

Because the letters are stored in order, we can start at the code for A and count up 26 times, printing the character and its code at each step. Run it and watch the codes climb from 65 to 90:

const A: number = "A".charCodeAt(0); // 65 for (let i: number = 0; i < 26; i++) { const code: number = A + i; const letter: string = String.fromCharCode(code); console.log(letter + " = " + code); }

Change "A" to "a" and you'll print the lower-case alphabet instead, running from 97 to 122. The gap between a capital and its lower-case partner is always 32 (97 - 65 = 32).

Unicode: room for every language (and emoji)

ASCII was designed for English, so it has no \text{é}, no \text{ñ}, no Greek, Arabic, Chinese or Hindi characters, and certainly no \text{😀}. There simply aren't enough codes — 128 is tiny compared with the world's writing systems.

Unicode is the modern character set that fixes this. It keeps the first 128 codes exactly the same as ASCII (so A is still 65) but then keeps going, with room for well over a hundred thousand characters — every alphabet in use, historical scripts, mathematical symbols, and thousands of emoji. To hold such large code numbers, each character needs more bits than ASCII's 7.

// Unicode keeps ASCII's codes, then adds far more. console.log("A".charCodeAt(0)); // 65 (same as ASCII) console.log("é".charCodeAt(0)); // 233 console.log("Ω".charCodeAt(0)); // 937 (Greek capital omega) console.log("好".charCodeAt(0)); // 22909 (a Chinese character) console.log(String.fromCharCode(233)); // "é" console.log(String.fromCharCode(937)); // "Ω"

This is the pattern behind almost every character set: more codes needs more bits. ASCII buys 128 codes with 7 bits; Unicode spends more bits per character to buy room for the whole planet.

An emoji is just a character with a Unicode code, exactly like a letter. The grinning face \text{😀} sits at code 128512. A committee (the Unicode Consortium) meets to decide which new emoji and scripts get added each year — so somewhere, adults in a meeting genuinely voted on whether the world needed a melting-face emoji. The reason a message can arrive as a little empty box is that the code was stored correctly, but the receiving device didn't yet have a picture (a "glyph") drawn for that code.

Two traps catch almost everyone at first:

Case matters. A is code 65 but a is code 97 — different characters with different codes. To a computer, "Hello" and "hello" are not the same text, which is exactly why passwords can be case-sensitive.
The character \texttt{'5'} is not the number 5. The character \texttt{'5'} is stored as code 53. So "5" + "3" glues text together to make "53", while the numbers 5 + 3 add to 8. Mixing up the digit-you-see with the number-it-means is a classic bug.

See the trap for yourself

Here is the "character \texttt{'5'} vs number 5" mistake in action. Run it and read each line carefully:

const charFive: string = "5"; // the CHARACTER '5', stored as code 53 const numFive: number = 5; // the NUMBER five console.log(charFive.charCodeAt(0)); // 53, the code for '5' console.log("5" + "3"); // "53" (text glued together) console.log(5 + 3); // 8 (real addition) console.log("A" === "a"); // false — case matters! console.log("A".charCodeAt(0), "a".charCodeAt(0)); // 65 97