Picture a busy restaurant kitchen the night before it opens. There are two completely different ways to get a mountain of work done fast. In the first, the head chef hands ten cooks an identical job: "each of you take a pile of onions and chop it, exactly the same way." Ten piles shrink at once — the same operation applied across many pieces of data. In the second, the chef gives each cook a different job: one chops, one stirs the stock, one plates the starters, one washes up — different tasks running at the same time.
Those two pictures are the two oldest ideas in parallel computing —
data parallelism and task parallelism — and almost every way we
structure parallel work is a variation or combination of them. This page is a tour of these
models: the high-level shapes we pour concurrent work into, sitting a level above the
Data parallelism is the ten-cooks-chopping-onions model. You have one operation and
a large collection of data; you apply the operation to every element, and the elements do not depend
on each other, so they can all be done at once. Doubling a million numbers, brightening every pixel
in an image, running the same physics update on a million particles — all data-parallel. It shows up
as map, as a parallel-for loop, as SIMD (Single
Instruction, Multiple Data — one instruction driving many arithmetic lanes), and most spectacularly
as a GPU kernel, where thousands of tiny cores each run the identical program on a
different slice of the data.
Task parallelism is the other kitchen: different jobs, run concurrently. A web server that, for one request, kicks off "query the database", "call the payment API" and "render the template" at the same time is task-parallel. So is a thread pool chewing through a queue of heterogeneous jobs. Here the parallelism comes not from splitting data but from the fact that the tasks themselves are independent of one another.
The two are not rivals — real programs mix them. A video encoder might run task-parallel stages (read, transform, compress, write) where the transform stage is itself data-parallel across every pixel block. The question a model answers is: where does the independence live — in the data, or in the work?
Once several workers run at once, they usually need to coordinate. There are two fundamentally different communication models, and choosing between them shapes everything else.
In the shared-memory model, threads live in one address space and simply read and
write the same variables. Communication is implicit: I write total, you read
total. It is fast and convenient, and it is exactly the world of the previous pages —
which is also why it needs
In the message-passing model there is no shared memory. Each process keeps
its own private data and the only way to communicate is to send a message — an
explicit send and receive. This is the model behind MPI
(the Message Passing Interface that powers supercomputers) and behind essentially every
distributed system, where the "processes" sit on different machines and could not
share memory even if they wanted to. There are no data races because there is no shared data to race
over; the trade is that every exchange is explicit, and messages take time and can arrive out of
order.
The actor model takes message passing and makes it the only rule. The world is a swarm of actors, each a tiny independent unit with private state nobody else can touch and a mailbox. An actor does exactly three kinds of thing in response to a message: change its own state, send messages to other actors it knows, and create new actors. Crucially, an actor processes its mailbox one message at a time, so its private state is never touched by two things at once.
That single restriction is the whole point. Because no state is shared, there is nothing to lock and no data race is even possible — you have traded shared-memory hazards for the simpler discipline of "just send a message." It is the model behind Erlang and the Akka framework, and it scales beautifully: actors are cheap, so you can have millions of them, and because they only talk by messages, it does not matter whether the actor you message is in the same process or on a machine across the world.
Erlang was built at Ericsson to run telephone exchanges, which are allowed roughly a few minutes of downtime per century. Its answer was millions of lightweight actor-like processes, each isolated, each cheap to spawn — and a motto of "let it crash". If one process hits a bug, it dies alone without corrupting anyone else's memory (there is no shared memory to corrupt), and a supervisor simply restarts it. Isolated state turned crashes from catastrophes into a shrug. The same design now underpins WhatsApp, which famously handled millions of connections per server.
Fork–join is the pattern that turns
Read the tree top-down for the fork: the whole array splits in half, and each half splits again, until the pieces are small enough to add up directly. Then read it bottom-up for the join: each pair of partial sums combines into its parent, and the totals bubble all the way up to the answer. The leaves are independent, so on a machine with several cores they can be computed genuinely at the same time; the joins impose the only ordering (a parent must wait for its children).
The sandbox below is single-threaded, so nothing here really runs in parallel — but that is
fine, because the fork–join structure is what we are showing. The function splits the array,
recurses into each half (these are the subtasks a real runtime would hand to different cores), and
then joins the two partial sums. Watch the fork lines march inward as the work splits,
and the join lines march back out as the sums combine. Press Run ▶:
The shape of the output is the tree from the diagram above: every fork has a
matching join, and the deepest indentation is where the independent leaf work lives — the
part that parallel hardware actually speeds up.
It is tempting to think "twice the cores, twice the speed." Almost never true. Nearly every program
has some part that is stubbornly serial — it cannot be parallelised (reading the
input, a final combine step, a bit of coordination). Amdahl's law makes the
consequence brutally precise. If a fraction
As
Notice how flat the curves go. Each one races upward at first and then bends over toward its ceiling: more cores buy less and less, and the smaller the parallel fraction, the sooner the wall arrives. This is why real speedups are measured, not assumed — and why shrinking the serial fraction often beats buying more cores.
This is the single most common misconception in parallel programming, and Amdahl's law is its
antidote. Speedup is not proportional to the number of cores; it is throttled by
whatever fraction of the work stays serial. At
It removes one whole class — data races — because there is no shared state to race over. But it does not make concurrency easy; it trades one kind of hard problem for another. Instead of reasoning about locks and interleavings, you now reason about message ordering, mailbox back-pressure (what if messages arrive faster than an actor can process them?), and deadlocks of a new flavour (actor A waits for a reply from B while B waits for A). No model makes concurrency free — each one just moves the difficulty to a place you might reason about more clearly.