Character Sets: ASCII and Unicode
A computer can only really store numbers — patterns of
binary
0s and 1s. So how does it store the
message you're reading? The trick is a simple agreement: every letter, digit and symbol is
given its own number, called its code. Write down the code,
store the code in binary, and you've stored the character.
The lookup table that says "this character means this number" is called a
character set (or character encoding). It's just a code book that
everyone agrees on. As long as the computer that saves your text and the computer
that opens it use the same character set, the message survives the journey.
\text{character} \;\longleftrightarrow\; \text{number (a code)} \;\longleftrightarrow\; \text{binary}
There is nothing clever or secret about the numbers — someone simply had to sit down and
decide them. The most famous decision is called ASCII.
ASCII: 7 bits, 128 codes
ASCII (the American Standard Code for Information Interchange, said
"ass-key") gives each character a code using 7 bits. Seven bits can count
from 0 to 127, which is
2^7 = 128 different codes — plenty of room for the English
alphabet in capitals and lower case, the digits, punctuation, a space, and some invisible
"control" characters like newline.
A few codes worth remembering:
Notice the codes run in order: A is
65, so B must be
66, C is
67, and so on. The same is true for lower-case letters (starting at
97) and for the digits (the character
\texttt{'0'} is 48,
\texttt{'1'} is 49…). This tidy ordering
is what makes it easy to sort words alphabetically or check whether a key press was a letter.
Try it — turn characters into codes
In TypeScript, "A".charCodeAt(0) asks "what is the code of the first character
of this text?" and String.fromCharCode(65) does the reverse — it turns a code
back into a character. Press Run, then change the letters and numbers and
run it again:
// Character -> code
console.log("A".charCodeAt(0)); // 65
console.log("a".charCodeAt(0)); // 97
console.log("0".charCodeAt(0)); // 48
console.log(" ".charCodeAt(0)); // 32 (space has a code too!)
// Code -> character
console.log(String.fromCharCode(66)); // "B"
console.log(String.fromCharCode(97)); // "a"
console.log(String.fromCharCode(48)); // "0"
Looping through the alphabet
Because the letters are stored in order, we can start at the code for
A and count up 26 times, printing the
character and its code at each step. Run it and watch the codes climb from
65 to 90:
const A: number = "A".charCodeAt(0); // 65
for (let i: number = 0; i < 26; i++) {
const code: number = A + i;
const letter: string = String.fromCharCode(code);
console.log(letter + " = " + code);
}
Change "A" to "a" and you'll print the lower-case alphabet instead,
running from 97 to 122. The gap between
a capital and its lower-case partner is always 32
(97 - 65 = 32).
Unicode: room for every language (and emoji)
ASCII was designed for English, so it has no \text{é}, no
\text{ñ}, no Greek, Arabic, Chinese or Hindi characters, and
certainly no \text{😀}. There simply aren't enough codes — 128 is
tiny compared with the world's writing systems.
Unicode is the modern character set that fixes this. It keeps the first
128 codes exactly the same as ASCII (so
A is still 65) but then keeps going,
with room for well over a hundred thousand characters — every alphabet in
use, historical scripts, mathematical symbols, and thousands of emoji. To hold such large
code numbers, each character needs more bits than ASCII's 7.
// Unicode keeps ASCII's codes, then adds far more.
console.log("A".charCodeAt(0)); // 65 (same as ASCII)
console.log("é".charCodeAt(0)); // 233
console.log("Ω".charCodeAt(0)); // 937 (Greek capital omega)
console.log("好".charCodeAt(0)); // 22909 (a Chinese character)
console.log(String.fromCharCode(233)); // "é"
console.log(String.fromCharCode(937)); // "Ω"
This is the pattern behind almost every character set: more codes needs more
bits. ASCII buys 128 codes with 7 bits; Unicode spends more bits per character to
buy room for the whole planet.
An emoji is just a character with a Unicode code, exactly like a letter. The grinning face
\text{😀} sits at code 128512. A
committee (the Unicode Consortium) meets to decide which new emoji and scripts get added
each year — so somewhere, adults in a meeting genuinely voted on whether the world needed a
melting-face emoji. The reason a message can arrive as a little empty box is that the code
was stored correctly, but the receiving device didn't yet have a picture (a
"glyph") drawn for that code.
Two traps catch almost everyone at first:
-
Case matters. A is code
65 but a is code
97 — different characters with
different codes. To a computer,
"Hello" and "hello"
are not the same text, which is exactly why passwords can be case-sensitive.
-
The character \texttt{'5'} is not the number
5. The character
\texttt{'5'} is stored as code
53. So
"5" + "3" glues text together to make
"53", while the numbers 5 + 3 add to
8. Mixing up the digit-you-see with the number-it-means is a
classic bug.
See the trap for yourself
Here is the "character \texttt{'5'} vs number
5" mistake in action. Run it and read each line carefully:
const charFive: string = "5"; // the CHARACTER '5', stored as code 53
const numFive: number = 5; // the NUMBER five
console.log(charFive.charCodeAt(0)); // 53, the code for '5'
console.log("5" + "3"); // "53" (text glued together)
console.log(5 + 3); // 8 (real addition)
console.log("A" === "a"); // false — case matters!
console.log("A".charCodeAt(0), "a".charCodeAt(0)); // 65 97