Reliability and Validity

You collect a set of results, work out an average, and write a neat conclusion. But before anyone believes you, there is a harder question waiting: how good is your investigation, really? Doing an experiment is easy. Doing one whose results you can actually trust — and that genuinely answers the question you asked — is the real skill of working scientifically.

Scientists judge an investigation with two different words that sound similar but mean completely different things: reliable and valid. Getting them straight is the whole point of this page — because an experiment can be one without the other, and mixing them up is how good-looking results quietly lead people astray.

Repeatable and reproducible

Suppose you time how long a pendulum takes to swing. You do it once and get 1.8 s. Should the world rewrite the textbooks? Not yet — a single reading could be a fluke. So you ask two questions.

Repeatable — if you, with the same equipment and method, do it again, do you get a similar result? Time the same pendulum ten times and if you keep getting close to 1.8 s, your measurement is repeatable.
Reproducible — if a different person, or a different method or equipment, tries it, do they get a similar result too? If a class across the country, using their own stopwatch and their own pendulum, also land near 1.8 s, your result is reproducible.

Repeatable is you agreeing with yourself; reproducible is the world agreeing with you. A result that is both is on very solid ground — a single lucky reading is not.

Reliable data: repeats that agree

Here is a plot of repeated readings. Each dot is one run of the experiment; the dashed line is the accepted value — the answer everyone agrees is right. Reliable (also called precise) data means the repeats bunch tightly together: a small spread. Drag the spread slider and watch the dots tighten or scatter.

When the spread is tiny, two good things happen. First, you can quote your result with confidence — the runs back each other up. Second, an anomaly — a run that clearly doesn't fit — sticks out like a sore thumb, so you can spot it and deal with it. Flip the anomaly switch on and see how one stray reading jumps away from the pack and yanks the mean upward.

Now try the last switch — what the experiment measures. Move it to "wrong thing" and the whole tight cluster slides off the accepted value. The repeats still agree beautifully — the data is just as precise — but every one of them is wrong. That is the trap the next card is all about.

Valid: measuring the right thing, the right way

An investigation is valid when it actually tests what it claims to test. That takes two things together:

it must be a fair test — only the one intended variable is changed, and everything else is held constant, so any effect really is caused by that variable;
it must measure the right quantity — with a suitable instrument, over a sensible range, actually capturing what you set out to find.

Miss either one and the experiment is invalid — and here is the sting: an invalid experiment can still be wonderfully reliable. If your ruler is missing its first centimetre, every length you read is 1 cm too big; repeat it a hundred times and all hundred agree perfectly. Your data is precise, tidy, convincing… and confidently wrong. Reliability polishes a result; it cannot rescue a broken design.

Reliable is not the same as valid

Reliable (precise) — the repeats agree with each other: a small spread.
Valid — the experiment answers the real question: a fair test that measures the right thing.
These are independent. Data can be reliable-but-invalid, valid-but-noisy, both, or neither — so you must judge each one separately.

Anomalies: spot it, check it, exclude it

An anomaly (or outlier) is a result that plainly doesn't fit the pattern of the others. Say five readings of a length come out as

21,\quad 22,\quad 21,\quad 48,\quad 22\ \text{cm}.

The 48 is screaming at you. The honest scientist follows three steps, in order:

Spot it — a value far from the cluster (much easier when your data is reliable).
Check it — was it a misread scale, a slipped stopwatch, a wobble? If you can, go back and repeat that run.
Exclude it — leave a genuine anomaly out of the mean. Averaging the four good readings gives 21.5 cm; letting the 48 in drags the mean up to a meaningless 26.8 cm.

The one thing you must not do is quietly blend an anomaly into the average and hope. An anomaly is information — it's telling you something went wrong that time — so investigate it, then set it aside with a note, never bury it.

Usually an outlier is a slip-up. But not always — sometimes the odd reading is the whole point. When Alexander Fleming found a stray patch of mould had killed the bacteria on one of his dishes, a tidier scientist might have thrown that "spoiled" plate away as an anomaly. He looked closer instead, and that outlier became penicillin. The lesson isn't "keep every weird result" — it's that you check an anomaly before you exclude it, because just occasionally the thing that doesn't fit is trying to tell you something new.

Putting it together

Picture two students both timing that pendulum. Ada takes ten careful repeats with the same stopwatch and they cluster within a tenth of a second — but she started her stopwatch on "go" and stopped it a beat late every single time. Ben takes just three readings that jump around a bit — but his method and timing are spot on.

Ada's data is reliable but not valid: precise, and precisely wrong.
Ben's data is valid but not very reliable: aimed right, but too noisy to trust yet.

The best investigation is both. You reach validity by designing a fair test that measures the right thing, and you reach reliability by taking enough good repeats to pin the value down. Neither alone is enough — and no pile of repeats can ever turn an invalid design into a valid one.

Reliable does not mean valid. Repeats that agree only tell you the readings are consistent with each other — not that you measured the right thing. You can reliably measure the wrong quantity.
More repeats can't fix an invalid experiment. Extra readings shrink the spread (better reliability), but if the design is flawed they just make you more confident of a wrong answer. Fix the design, not the sample size.
Don't blend an anomaly into the mean. Investigate it first, then exclude a genuine outlier from the average — never average it in and hope it washes out.

In the 1970s and 80s, huge, careful studies measured that people who took hormone replacement therapy (HRT) had far fewer heart attacks. The numbers were tight and reproducible across many groups — highly reliable — and doctors began prescribing HRT to protect the heart. But the studies had a hidden flaw: the women choosing HRT also tended to be healthier and wealthier to begin with. The experiment wasn't a fair test, so it was measuring the wrong thing — it wasn't actually valid. When proper randomised trials were finally run in the 2000s, the heart benefit vanished. Years of precise, confident, invalid data had pointed medicine the wrong way — a textbook case of reliable-but-not-valid.