It is payday, and your bank balance is
That is a race condition: the outcome depends on the exact, unpredictable order in
which two
A race needs exactly two ingredients, and both must be present:
Take either ingredient away and the race evaporates. If the data is never shared — each thread has its own private copy — there is nothing to fight over. If the data is shared but never modified — everyone only reads — then it doesn't matter who reads first, because a read never changes what the next reader sees. Immutable data and thread-local data are both perfectly safe under concurrency. It is the collision of shared and writeable that opens the door.
Here is the entire problem in one line. A shared counter starts at count = count + 1 once. You would bet your lunch the counter ends at
count = count + 1 is not one action — it is threeThe trouble starts with a lie the source code tells you. That single, innocent-looking assignment is not carried out by the hardware in one indivisible step. Underneath, the CPU does three separate things:
count from memory into a register;This read–modify–write trio is three distinct moments in time, and the operating system's scheduler is free to pause a thread between any two of them and let another thread run for a while. The gap between reading a value and writing back the result is where the whole family of concurrency bugs lives. When both threads read the old value before either has written the new one, one of the two increments simply vanishes — the classic lost update.
People use the terms interchangeably, but they name slightly different sins. A data race is the narrow, mechanical event of two threads accessing the same memory location at the same time with at least one write and no synchronisation between them — a property a tool or the language memory model can detect. A race condition is the broader, logical problem: your program's correctness depends on timing. You can have a race condition with no data race at all — for example two threads that each take a lock correctly but in an order that produces the wrong answer (a bank transfer that checks the balance, releases the lock, then withdraws). Fixing every data race does not automatically fix every race condition. Data race ⊂ race condition, roughly — but neither strictly contains the other.
We need two precise words. An interleaving is one particular merge of the individual steps of several threads into a single sequence — as if a dealer shuffled two decks together, keeping each deck's own order but mixing them against each other. A schedule is the specific interleaving the scheduler actually chose on a given run.
Because each thread's steps must stay in order but can be separated by the other thread's steps, the
number of possible interleavings explodes fast. Two threads of three steps each already admit
Let's freeze the scheduler and make it choose the worst possible order. Both threads want to add
The disaster is pure timing. Thread A reads count rose by
only one. A's update was silently clobbered — a lost update.
Change the interleaving — let A fully finish read-modify-write before B starts — and you get the
correct
The sandbox below runs on a single thread, so it cannot produce a real race — which is exactly what makes it useful. Because we control the timeline, we can script the precise interleaving and make the bug happen every time instead of one run in a thousand. The program models each thread's read/modify/write as explicit steps, then plays several different schedules of the same two increments and prints the final value for each. Press Run ▶:
Nothing in the increments changed between runs; only the order did. Yet the final value swings across
the range
On a real multi-core machine you do not get to script the schedule — the hardware and OS pick one for you, differently on every run, depending on cache state, interrupts, what else the machine is doing, and plain luck. A racy program is therefore nondeterministic: the same input can give different outputs on different runs. Very often the bad interleaving is rare, so the program appears to work — passing your tests 999 times out of 1000, then failing in production at 3 a.m.
Worse still, races are famously shy. The moment you try to observe one — add a
print, attach a debugger, single-step — you change the timing, the rare interleaving
stops happening, and the bug "disappears." A bug that vanishes when you look at it is nicknamed a
Heisenbug, after the uncertainty principle: measurement disturbs the thing measured.
You add a console.log inside the loop, the crash stops, you ship it. You have fixed
nothing. The extra I/O merely slowed one thread by a few microseconds and nudged the
scheduler away from the unlucky interleaving — this time, on this machine, under this load. The
race is still there, waiting for a faster CPU, a different core count, or a heavier server to line
the steps up again. Changing timing is not the same as enforcing correctness. The only real fix is
to make the shared access atomic or mutually exclusive, so the
bad interleaving becomes impossible rather than merely unlikely.
Lost updates are one flavour of race. Another huge family is check-then-act: a thread inspects some shared state, decides what to do based on what it saw, and then acts — but between the check and the act, another thread changes the state, so the decision is now based on stale information. Our ATM story was exactly this: check "is the balance ≥ $100?", then act "withdraw $100" — with a fatal gap in between.
When the shared thing being checked is a resource — a file, a lock file, a network port — this
pattern has a specific name: a TOCTOU race, for Time Of Check To Time Of
Use. A program checks "does this file exist / am I allowed to open it?" and then, a beat later,
opens it. In that beat an attacker can swap the file for a symbolic link to /etc/passwd,
and the program dutifully writes where it was checked not to. TOCTOU bugs are a whole
category of real-world security vulnerabilities, not just wrong-answer bugs — the
window between check and use is an attacker's playground.
Notice the shape shared by every example: a value is observed, a decision hinges on it, and the value
can change before the decision is carried out. The cure, again, is to fuse check-and-act into one
indivisible step so nothing can slip into the gap — the subject of
Every race we have met has the same root and the same remedy. The root is that a sequence of steps that should be one indivisible action can be interrupted partway. The remedy is to restore indivisibility — to make sure that while one thread is in the middle of touching shared state, no other thread can observe or modify it.
Both make the dangerous interleavings impossible rather than merely improbable — which is the only kind of fix that survives a faster CPU and a heavier load. How locks, mutexes and semaphores actually deliver that guarantee is the next page.
Yes — and it is one of the most sobering stories in computing. The Therac-25, a radiation-therapy machine from the mid-1980s, had a race condition in its control software. If an operator typed the treatment settings unusually fast, a background thread setting up the machine's hardware could interleave badly with the thread reading the console, leaving the machine in an inconsistent state — firing a high-power electron beam without the protective target in place. Because it only triggered on a fast, specific keystroke timing, it passed testing and worked almost every time. Between 1985 and 1987 it delivered massive radiation overdoses to at least six patients, several fatally. It is now the canonical case study in software-safety courses: proof that "it works 999 times out of 1000" is not good enough when the thousandth time matters, and that nondeterministic timing bugs must be designed out, not tested away.