A single
COUNT, SUM, AVG,
MIN, MAX) with GROUP BY — collapses many rows into one
summary row per group.Together they turn a database from a filing cabinet into something you can genuinely interrogate. These are staples of A-level Computer Science, so it pays to see exactly which rows each query hands back.
Everything below uses these two little tables. The pupil table has one row per pupil,
and its classId column is a foreign key — each value is the
primary key of a row over in the class table. That shared value is the
link between them.
Notice the tables deliberately store each fact once: the pupil table doesn't repeat
"Maths", "Ms Lee" on every maths pupil — it just stores the tiny code
C1 and lets the class table hold the details. A JOIN is
how we glue them back together when we want the full picture.
Suppose you want a list of pupils with the name of the class they're in — the pupil's name
comes from pupil, but the class name comes from class. An
INNER JOIN matches rows from the two tables wherever the foreign key equals the
primary key:
Read the ON line as the matching rule: "pair up a pupil row with the class row
that has the same classId." Because column names now come from two tables, we write
pupil.name and class.className — table.column — so there's no
confusion about which table each field belongs to. Here is exactly what comes back:
Every pupil now sits next to their class name. Under the hood the database walked each pupil's
classId across to the class table, found the matching row, and merged the
two into one wide row. You can filter and sort a joined result just like any other query — add
WHERE class.className = 'Maths' and ORDER BY pupil.name and only the maths
pupils come back, alphabetically.
They hold the same value for matched rows, but they live in different tables and mean
different things. In class, classId is the primary key
— it identifies a class. In pupil, classId is a foreign
key — it points at a class. The ON condition is precisely the
instruction "line up the pointer with the thing it points at." That single idea — foreign key
equals primary key — is the join at the heart of almost every relational query you'll ever write.
An aggregate function takes a whole column of values and boils it down to a single number. There are five you must know:
COUNT(*) — how many rows;SUM(column) — the total of a numeric column;AVG(column) — the mean (average);MIN(column) — the smallest value;MAX(column) — the largest value.Used on their own they crunch the entire table down to one row. "How many pupils are there, and what's the average mark across all of them?"
With six pupils and marks of 72, 55, 88, 64, 91 and 60, that returns a single summary row:
COUNT = 6, AVG = 71.7, MAX = 91, MIN = 55. The
six individual pupils have vanished — an aggregate answers a question about the group, not
about any one pupil.
Averaging all the pupils is a blunt instrument. Usually you want the average per class. GROUP BY sorts the rows into piles that share a value, then runs the aggregate once per pile:
The database gathers all the C1 rows into one group, all the C2 rows into
another, and so on, then reports the count and average within each group:
The golden rule of GROUP BY: everything in the SELECT list must be
either the column(s) you grouped on or wrapped in an aggregate function. You can
select classId (you grouped by it, so it's the same for every row in the group) and
AVG(mark) (an aggregate), but you cannot also select a raw
name — a group of three pupils has three different names, and only one row comes back,
so which name would it print?
The two ideas combine beautifully. To show the average mark next to each class's name (not just its code), join first, then group:
The join supplies the readable class name; the GROUP BY then rolls each class's pupils
into a single summary row. This is the workhorse pattern behind almost every report you've ever
seen: "sales per region", "goals per team", "orders per customer".
There are two different filters, and telling them apart is a classic exam question.
WHERE filters individual rows before they are grouped;
HAVING filters the summary rows after the grouping, so it's the one
that can test an aggregate.
"Show each class's average mark, but only for classes averaging 70 or more." You couldn't put
AVG(mark) >= 70 in a WHERE — at the moment WHERE runs, the
groups (and therefore the averages) don't exist yet. A neat way to remember the running order:
FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY.
A database does joins and aggregates for you, but they're no mystery — a table is just an array of records. Here is the join written in TypeScript: for each pupil, find the matching class and merge the two. Press Run and compare it with the SQL above.
And here is the aggregate: gather pupils into piles by classId, then average each pile.
The Map is doing the job of GROUP BY.
Two mistakes trip almost everyone at first — know them cold for the exam.
classId pointed at no class (or was blank), that
row simply vanishes from an INNER JOIN — it has no partner to pair with,
so it's silently dropped. Imagine adding a brand-new class C4 "Chemistry" with no
pupils yet: the inner join above would never mention Chemistry at all. When you need to keep the
unmatched rows, use a LEFT JOIN instead — it keeps every row from
the left-hand table and fills the missing right-hand columns with NULL. So
class LEFT JOIN pupil would list Chemistry with a NULL pupil, whereas
class INNER JOIN pupil would hide it. A join that mysteriously loses rows is almost
always an INNER JOIN discarding unmatched data.
SELECT classId, name, AVG(mark) FROM pupil GROUP BY classId is a mistake: class
C1 has three pupils with three different names, but the group produces only
one row, so there's no single name to show. Every column you
SELECT must be either something you GROUP BY or something inside an
aggregate function. (Some databases quietly pick a random name and let it slide — which is worse,
because the answer looks fine but is meaningless.)
Picture the two tables as overlapping circles (a Venn diagram). An INNER JOIN
returns only the overlap — rows that exist on both sides. The OUTER
joins reach beyond the overlap: a LEFT OUTER JOIN keeps the whole left circle
(matched or not), a RIGHT OUTER JOIN the whole right circle, and a
FULL OUTER JOIN both circles entirely. "Inner" means the inner overlap; "outer" means
you also reach out to the parts with no partner. At A-level you'll mostly meet INNER
and LEFT — but knowing they're the same machine with a different reach makes the
whole family click.
A subtle but favourite exam point: COUNT(*) counts rows, but
COUNT(column) counts only the rows where that column is not NULL.
So in a table where two pupils have no recorded email, COUNT(*) might be 6 while
COUNT(email) is 4. The same holds for AVG and SUM: they
skip NULLs entirely, which means an average is taken over "the pupils who actually
have a mark", not over every row. Blanks don't count — literally.