Race Conditions

It is payday, and your bank balance is \$100. At the very same instant, you tap your card at an ATM downtown and your flatmate — who shares the account — taps theirs across town. Both machines read the balance: $100. Both check "is there enough for a $100 withdrawal?" — yes on both. Both dispense five crisp twenties. Both write the new balance back: $0. Two hundred dollars just walked out of the building, and the bank's ledger insists you only spent one hundred. Nobody wrote a bug. The timing was the bug.

That is a race condition: the outcome depends on the exact, unpredictable order in which two threads happen to interleave while touching the same shared state — and that order is not something your program controls. This page is about why races happen, the vocabulary to talk about them precisely, and why they are so uniquely horrible to debug. The cure — making operations mutually exclusive or atomic — we only preview here; locks earn a page of their own.

The root cause: shared mutable state

A race needs exactly two ingredients, and both must be present:

Shared — two or more threads can reach the same piece of memory; and
Mutable — at least one of them writes to it.

Take either ingredient away and the race evaporates. If the data is never shared — each thread has its own private copy — there is nothing to fight over. If the data is shared but never modified — everyone only reads — then it doesn't matter who reads first, because a read never changes what the next reader sees. Immutable data and thread-local data are both perfectly safe under concurrency. It is the collision of shared and writeable that opens the door.

Here is the entire problem in one line. A shared counter starts at 0. Two threads each run count = count + 1 once. You would bet your lunch the counter ends at 2. Distressingly often, it ends at 1.

let count = 0; // shared, mutable — both conditions met // Thread A // Thread B count = count + 1; count = count + 1; // You expect count === 2. Sometimes count === 1. How?

`count = count + 1` is not one action — it is three

The trouble starts with a lie the source code tells you. That single, innocent-looking assignment is not carried out by the hardware in one indivisible step. Underneath, the CPU does three separate things:

Read the current value of count from memory into a register;
Modify it — add one inside the register;
Write the new value back to memory.

This read–modify–write trio is three distinct moments in time, and the operating system's scheduler is free to pause a thread between any two of them and let another thread run for a while. The gap between reading a value and writing back the result is where the whole family of concurrency bugs lives. When both threads read the old value before either has written the new one, one of the two increments simply vanishes — the classic lost update.

People use the terms interchangeably, but they name slightly different sins. A data race is the narrow, mechanical event of two threads accessing the same memory location at the same time with at least one write and no synchronisation between them — a property a tool or the language memory model can detect. A race condition is the broader, logical problem: your program's correctness depends on timing. You can have a race condition with no data race at all — for example two threads that each take a lock correctly but in an order that produces the wrong answer (a bank transfer that checks the balance, releases the lock, then withdraws). Fixing every data race does not automatically fix every race condition. Data race ⊂ race condition, roughly — but neither strictly contains the other.

Interleavings and schedules

We need two precise words. An interleaving is one particular merge of the individual steps of several threads into a single sequence — as if a dealer shuffled two decks together, keeping each deck's own order but mixing them against each other. A schedule is the specific interleaving the scheduler actually chose on a given run.

Because each thread's steps must stay in order but can be separated by the other thread's steps, the number of possible interleavings explodes fast. Two threads of three steps each already admit \binom{6}{3} = 20 distinct interleavings. Your program is correct only if every single one of those orderings gives the right answer. A race condition is precisely the situation where some interleavings are fine and some are wrong — and you don't get to pick which one runs.

A losing interleaving, step by step

Let's freeze the scheduler and make it choose the worst possible order. Both threads want to add 1 to a counter starting at 0. Read down the table — each row is one CPU step, and the "reg" columns are each thread's private register. Watch the moment the two reads overlap.

The disaster is pure timing. Thread A reads 0, and before it can write its result back, thread B also reads 0. Now both are computing 0 + 1 = 1. A writes 1; B writes 1 on top of it. Two increments happened, yet count rose by only one. A's update was silently clobbered — a lost update.

Change the interleaving — let A fully finish read-modify-write before B starts — and you get the correct 2. The same code yields a different answer depending purely on scheduler timing. Code that misbehaves only on unlucky interleavings is a nightmare, precisely because it usually works.

Run the race yourself

The sandbox below runs on a single thread, so it cannot produce a real race — which is exactly what makes it useful. Because we control the timeline, we can script the precise interleaving and make the bug happen every time instead of one run in a thousand. The program models each thread's read/modify/write as explicit steps, then plays several different schedules of the same two increments and prints the final value for each. Press Run ▶:

// Each thread's increment is THREE steps on a shared counter. // A "schedule" is just the order those steps actually execute in. // reg[t] is thread t's private register (what it has "read" so far). interface Machine { count: number; reg: number[]; } function step(m: Machine, action: string): void { const t = Number(action[1]); // "R0" => thread 0, "R1" => thread 1 const kind = action[0]; if (kind === "R") m.reg[t] = m.count; // READ shared -> my register else if (kind === "M") m.reg[t] = m.reg[t] + 1; // MODIFY my register else if (kind === "W") m.count = m.reg[t]; // WRITE my register -> shared } function run(label: string, schedule: string[]): void { const m: Machine = { count: 0, reg: [0, 0] }; for (const a of schedule) step(m, a); console.log(label.padEnd(26) + schedule.join(" ") + " => count = " + m.count); } console.log("Two threads, each doing count = count + 1, starting from 0:"); console.log(""); // Serial: thread 0 finishes entirely before thread 1 begins. run("serial (0 then 1):", ["R0", "M0", "W0", "R1", "M1", "W1"]); // 2 run("serial (1 then 0):", ["R1", "M1", "W1", "R0", "M0", "W0"]); // 2 // Interleaved: both READ before either WRITES -> a lost update. run("both read first:", ["R0", "R1", "M0", "M1", "W0", "W1"]); // 1 run("ping-pong:", ["R0", "R1", "M1", "W1", "M0", "W0"]); // 1 // Interleaved but harmless: 0 fully commits before 1 reads. run("0 commits, then 1:", ["R0", "M0", "W0", "R1", "M1", "W1"]); // 2 console.log(""); console.log("Same code, same two increments. Only the ORDER changed —"); console.log("and the answer swings between 1 and 2.");

Nothing in the increments changed between runs; only the order did. Yet the final value swings across the range \{1, 2\}. Order alone decides correctness — that is the whole essence of a race.

Nondeterminism: the Heisenbug

On a real multi-core machine you do not get to script the schedule — the hardware and OS pick one for you, differently on every run, depending on cache state, interrupts, what else the machine is doing, and plain luck. A racy program is therefore nondeterministic: the same input can give different outputs on different runs. Very often the bad interleaving is rare, so the program appears to work — passing your tests 999 times out of 1000, then failing in production at 3 a.m.

Worse still, races are famously shy. The moment you try to observe one — add a print, attach a debugger, single-step — you change the timing, the rare interleaving stops happening, and the bug "disappears." A bug that vanishes when you look at it is nicknamed a Heisenbug, after the uncertainty principle: measurement disturbs the thing measured.

You add a console.log inside the loop, the crash stops, you ship it. You have fixed nothing. The extra I/O merely slowed one thread by a few microseconds and nudged the scheduler away from the unlucky interleaving — this time, on this machine, under this load. The race is still there, waiting for a faster CPU, a different core count, or a heavier server to line the steps up again. Changing timing is not the same as enforcing correctness. The only real fix is to make the shared access atomic or mutually exclusive, so the bad interleaving becomes impossible rather than merely unlikely.

Check-then-act and TOCTOU

Lost updates are one flavour of race. Another huge family is check-then-act: a thread inspects some shared state, decides what to do based on what it saw, and then acts — but between the check and the act, another thread changes the state, so the decision is now based on stale information. Our ATM story was exactly this: check "is the balance ≥ $100?", then act "withdraw $100" — with a fatal gap in between.

// CHECK-THEN-ACT: classic and broken under concurrency. if (balance >= 100) { // CHECK — true for BOTH threads... // ...scheduler may switch threads right here... balance = balance - 100; // ACT — BOTH subtract; balance goes negative }

When the shared thing being checked is a resource — a file, a lock file, a network port — this pattern has a specific name: a TOCTOU race, for Time Of Check To Time Of Use. A program checks "does this file exist / am I allowed to open it?" and then, a beat later, opens it. In that beat an attacker can swap the file for a symbolic link to /etc/passwd, and the program dutifully writes where it was checked not to. TOCTOU bugs are a whole category of real-world security vulnerabilities, not just wrong-answer bugs — the window between check and use is an attacker's playground.

Notice the shape shared by every example: a value is observed, a decision hinges on it, and the value can change before the decision is carried out. The cure, again, is to fuse check-and-act into one indivisible step so nothing can slip into the gap — the subject of synchronization primitives.

The fix, previewed

Every race we have met has the same root and the same remedy. The root is that a sequence of steps that should be one indivisible action can be interrupted partway. The remedy is to restore indivisibility — to make sure that while one thread is in the middle of touching shared state, no other thread can observe or modify it.

Atomicity — make the whole read-modify-write happen as a single, unbreakable hardware step, so there is no "middle" to interrupt (e.g. an atomic increment, or compare-and-swap).
Mutual exclusion — wrap the shared access in a lock so that at most one thread is inside the critical section at a time; everyone else waits their turn.

Both make the dangerous interleavings impossible rather than merely improbable — which is the only kind of fix that survives a faster CPU and a heavier load. How locks, mutexes and semaphores actually deliver that guarantee is the next page.

Yes — and it is one of the most sobering stories in computing. The Therac-25, a radiation-therapy machine from the mid-1980s, had a race condition in its control software. If an operator typed the treatment settings unusually fast, a background thread setting up the machine's hardware could interleave badly with the thread reading the console, leaving the machine in an inconsistent state — firing a high-power electron beam without the protective target in place. Because it only triggered on a fast, specific keystroke timing, it passed testing and worked almost every time. Between 1985 and 1987 it delivered massive radiation overdoses to at least six patients, several fatally. It is now the canonical case study in software-safety courses: proof that "it works 999 times out of 1000" is not good enough when the thousandth time matters, and that nondeterministic timing bugs must be designed out, not tested away.