Hash Tables

Imagine a school register with two thousand pupils, and you need to find one pupil's record from their name. With an array you'd have to check the records one by one until you hit the right name — up to two thousand comparisons for an unlucky search. A phone book helps because it's sorted, but keeping things sorted as you add and remove them is its own headache.

A hash table pulls off something that feels like magic: it stores key → value pairs (name → record, word → definition, username → account) and lets you fetch a value from its key almost instantly, no matter how many pairs it holds. Instead of searching for where a key lives, it calculates where the key lives — in one step. That trick is the single idea of this page, and it powers dictionaries in Python, objects and Map in JavaScript/TypeScript, database indexes, and much more besides.

The idea: turn the key into an array index

Underneath, a hash table is just an ordinary array of slots, which we'll call buckets. The clever part is the hash function: a small piece of code that takes a key and crunches it down to a whole number — the index of the bucket where that key's value should live.

\text{index} = \operatorname{hash}(\text{key}) \bmod \text{tableSize}

The mod (remainder) step is what squashes any number the hash produces into a valid index — a value from 0 up to tableSize - 1, so it always lands inside the array. To store a pair you hash the key and drop the value in that bucket; to look it up later you hash the same key, go straight to the same bucket, and there it is. You never scan the whole array — you compute the address and jump to it:

Because computing a hash and then indexing an array both take the same tiny amount of time whether the table holds ten items or ten million, lookup, insertion and deletion are all constant time, written O(1). That's the headline: a hash table doesn't get slower as it grows.

A hash function you can read

A hash function for text can be as simple as this: add up the character code of every letter, then take the remainder when divided by the table size. (Every character has a numeric code — 'A' is 65, 'B' is 66, and so on; charCodeAt reads it.) Press Run and watch three keys turn into bucket indices:

const TABLE_SIZE = 7; function hash(key: string): number { let total = 0; for (let i = 0; i < key.length; i++) { total += key.charCodeAt(i); // add each character's numeric code } return total % TABLE_SIZE; // squash into a bucket 0..6 } console.log("hash('cat') =", hash("cat")); console.log("hash('dog') =", hash("dog")); console.log("hash('bird') =", hash("bird"));

The same key always produces the same index — that's what makes lookup possible. A hash function must be deterministic (same input, same output every time) and fast to compute; the best ones also spread different keys as evenly as possible across the buckets, for a reason we're about to meet.

Storing and retrieving with it

Put the hash function to work as a tiny dictionary. To store, hash the key and place the value in that bucket; to look up, hash again and read the same bucket. Notice there is no searching anywhere — the index is computed both times:

This works beautifully — right up until two different keys land in the same bucket. And with only seven buckets, that is going to happen sooner than you might think.

Collisions: when two keys want the same bucket

A collision is when hash(keyA) and hash(keyB) give the same index for two different keys. Our simple "sum the character codes" hash collides very easily — any two words made of the same letters (an anagram!) add up to exactly the same total. Watch "listen" and "silent" collide, and see the danger:

const TABLE_SIZE = 7; const buckets: (string | undefined)[] = new Array(TABLE_SIZE); function hash(key: string): number { let total = 0; for (let i = 0; i < key.length; i++) total += key.charCodeAt(i); return total % TABLE_SIZE; } console.log("hash('listen') =", hash("listen")); console.log("hash('silent') =", hash("silent")); // same index — a collision! // Naively overwriting is a DISASTER: buckets[hash("listen")] = "to hear"; buckets[hash("silent")] = "making no sound"; // clobbers 'listen'! console.log("Looking up 'listen':", buckets[hash("listen")]); // wrong answer

The value for "listen" has been silently overwritten. A hash table is useless unless it has a plan for collisions — and it always needs one, because collisions can never be fully avoided (more on that in the warning below). There are two classic plans.

Resolving collisions #1: chaining

With chaining (also called separate chaining), each bucket doesn't hold a single value — it holds a small list of all the pairs that hashed there. To store, you append to the bucket's list; to look up, you go to the bucket and scan its (usually very short) list for the matching key. Colliding keys simply queue up together instead of clobbering one another:

The chain is why we store the whole key alongside the value, and compare on it inside the bucket: the index alone can't tell "listen" from "silent", so we confirm by matching the actual key. As long as the chains stay short, lookup is still effectively O(1).

Resolving collisions #2: open addressing

The other approach keeps just one value per bucket but, on a collision, goes looking for another free bucket. This is open addressing, and the simplest version is linear probing: if the computed bucket is taken, try the next one along, then the next, wrapping around the end of the array, until you find an empty slot.

\text{try } \operatorname{hash}(k),\; \operatorname{hash}(k)+1,\; \operatorname{hash}(k)+2,\; \dots \pmod{\text{tableSize}}

Look-up follows the very same trail: start at the hashed bucket and step forward until you either find the key or hit an empty slot (which means it isn't there). Run it and watch "silent" get bumped one place past the bucket "listen" already occupies:

Chaining is simpler and copes gracefully when a table gets crowded; open addressing avoids the extra lists and keeps everything in one tidy array, which the hardware often likes. Both are widely used — what matters is that some strategy is always there.

The load factor: how full is too full?

The speed of a hash table depends on how crowded it is. The load factor measures exactly that — the number of stored items divided by the number of buckets:

\lambda = \frac{\text{number of items}}{\text{number of buckets}}

A load factor of 0.5 means the table is half full; 1.0 means there are as many items as buckets. As \lambda climbs, collisions pile up: chains grow longer, probe trails grow longer, and lookup drifts away from the lovely O(1) toward a slow linear scan. (For open addressing it's worse — once \lambda nears 1 there's barely a free slot left, so probing crawls.)

The cure is resizing (also called rehashing): when the load factor crosses a threshold — commonly around 0.7 — the table makes a bigger array (often double the size) and re-inserts every item using the new table size. This keeps the load factor low and lookups fast, which is why a hash table can grow indefinitely and still behave like O(1) on average.

The honest answer is that O(1) is the average case, not a guarantee. In a pathological situation — a terrible hash function, or an attacker deliberately feeding keys they know will collide — every key could land in one bucket, turning the table into a single long list with O(n) lookups. This is a real security issue (a "hash-flooding" denial-of-service attack), and it's why serious languages use hash functions that mix in a secret random seed so an outsider can't predict the collisions. Keep the hash good and the load factor low, and O(1) is what you get in practice.

Collisions are not a sign of a bug — they are unavoidable, and the maths proves it. By the pigeonhole principle, if you have more keys than buckets then at least two keys must share a bucket, no matter how brilliant the hash function is. And there are always vastly more possible keys (every possible word, name or number) than the handful of buckets in the array. So a hash table without a collision strategy is simply broken; handling collisions is not optional.

The subtler trap is a poor hash function. A hash that clusters keys into just a few buckets — leaving most empty — creates long chains or long probe trails, and your beautiful O(1) quietly rots into O(n): you're back to scanning a list. Our "sum the character codes" hash is a good teaching example but a poor real one, because all anagrams collide and short words only ever reach small totals. A good hash function scatters keys evenly across all the buckets — that even spread is what keeps chains short and lookups fast.