Reliability and Validity

You collect a set of results, work out an average, and write a neat conclusion. But before anyone believes you, there is a harder question waiting: how good is your investigation, really? Doing an experiment is easy. Doing one whose results you can actually trust — and that genuinely answers the question you asked — is the real skill of working scientifically.

Scientists judge an investigation with two different words that sound similar but mean completely different things: reliable and valid. Getting them straight is the whole point of this page — because an experiment can be one without the other, and mixing them up is how good-looking results quietly lead people astray.

Repeatable and reproducible

Suppose you time how long a pendulum takes to swing. You do it once and get 1.8 s. Should the world rewrite the textbooks? Not yet — a single reading could be a fluke. So you ask two questions.

Repeatable is you agreeing with yourself; reproducible is the world agreeing with you. A result that is both is on very solid ground — a single lucky reading is not.

Reliable data: repeats that agree

Here is a plot of repeated readings. Each dot is one run of the experiment; the dashed line is the accepted value — the answer everyone agrees is right. Reliable (also called precise) data means the repeats bunch tightly together: a small spread. Drag the spread slider and watch the dots tighten or scatter.

When the spread is tiny, two good things happen. First, you can quote your result with confidence — the runs back each other up. Second, an anomaly — a run that clearly doesn't fit — sticks out like a sore thumb, so you can spot it and deal with it. Flip the anomaly switch on and see how one stray reading jumps away from the pack and yanks the mean upward.

Now try the last switch — what the experiment measures. Move it to "wrong thing" and the whole tight cluster slides off the accepted value. The repeats still agree beautifully — the data is just as precise — but every one of them is wrong. That is the trap the next card is all about.

Valid: measuring the right thing, the right way

An investigation is valid when it actually tests what it claims to test. That takes two things together:

Miss either one and the experiment is invalid — and here is the sting: an invalid experiment can still be wonderfully reliable. If your ruler is missing its first centimetre, every length you read is 1 cm too big; repeat it a hundred times and all hundred agree perfectly. Your data is precise, tidy, convincing… and confidently wrong. Reliability polishes a result; it cannot rescue a broken design.

Reliable is not the same as valid

Anomalies: spot it, check it, exclude it

An anomaly (or outlier) is a result that plainly doesn't fit the pattern of the others. Say five readings of a length come out as

21,\quad 22,\quad 21,\quad 48,\quad 22\ \text{cm}.

The 48 is screaming at you. The honest scientist follows three steps, in order:

The one thing you must not do is quietly blend an anomaly into the average and hope. An anomaly is information — it's telling you something went wrong that time — so investigate it, then set it aside with a note, never bury it.

Usually an outlier is a slip-up. But not always — sometimes the odd reading is the whole point. When Alexander Fleming found a stray patch of mould had killed the bacteria on one of his dishes, a tidier scientist might have thrown that "spoiled" plate away as an anomaly. He looked closer instead, and that outlier became penicillin. The lesson isn't "keep every weird result" — it's that you check an anomaly before you exclude it, because just occasionally the thing that doesn't fit is trying to tell you something new.

Putting it together

Picture two students both timing that pendulum. Ada takes ten careful repeats with the same stopwatch and they cluster within a tenth of a second — but she started her stopwatch on "go" and stopped it a beat late every single time. Ben takes just three readings that jump around a bit — but his method and timing are spot on.

The best investigation is both. You reach validity by designing a fair test that measures the right thing, and you reach reliability by taking enough good repeats to pin the value down. Neither alone is enough — and no pile of repeats can ever turn an invalid design into a valid one.

In the 1970s and 80s, huge, careful studies measured that people who took hormone replacement therapy (HRT) had far fewer heart attacks. The numbers were tight and reproducible across many groups — highly reliable — and doctors began prescribing HRT to protect the heart. But the studies had a hidden flaw: the women choosing HRT also tended to be healthier and wealthier to begin with. The experiment wasn't a fair test, so it was measuring the wrong thing — it wasn't actually valid. When proper randomised trials were finally run in the 2000s, the heart benefit vanished. Years of precise, confident, invalid data had pointed medicine the wrong way — a textbook case of reliable-but-not-valid.