The Multiple Testing Problem
Understand p-hacking, the Bonferroni correction, false discovery rates, and why the replication crisis exists.
The Problem with Running Many Tests
If you test one hypothesis at α = 0.05, there's a 5% chance of a false positive. Seems manageable.
But what if you test 20 hypotheses? The probability of at least one false positive:
64% chance of at least one false positive! Test 100 hypotheses? You'd expect about 5 "significant" results purely by chance.
This is the multiple testing problem — the more tests you run, the more false discoveries you'll make.
p-Hacking: The Dark Side
Manipulating data analysis until you find p < 0.05. This includes:
- Testing many variables but only reporting the "significant" ones
- Trying multiple statistical tests and picking the best result
- Removing "outliers" until the result becomes significant
- Collecting data until p < 0.05, then stopping
- Splitting the data into subgroups and testing each
A researcher studies whether jelly beans cause acne. They test 20 different colors:
- Red jelly beans & acne: p = 0.67 ❌
- Blue jelly beans & acne: p = 0.12 ❌
- Green jelly beans & acne: p = 0.04 ✅
- ... (17 more: all p > 0.05)
Published headline: "Green Jelly Beans Linked to Acne! (p = 0.04)"
The 19 non-significant tests are never mentioned. With 20 tests at α = 0.05, we'd EXPECT about 1 false positive. The green result is almost certainly noise.
This is not a hypothetical — it's based on an actual XKCD comic illustrating the problem, and similar practices happen in real research.
p-hacking is arguably the biggest methodological problem in science today. It's not always intentional — researchers naturally gravitate toward analyses that produce interesting results. This makes it all the more insidious.
The Bonferroni Correction
The simplest fix: if you run m tests, use α/m as your significance threshold instead of α.
Testing 20 jelly bean colors with α = 0.05:
Bonferroni threshold: 0.05 / 20 = 0.0025
Now the green jelly bean result (p = 0.04) is NOT significant under the corrected threshold. Crisis averted.
Pros: Simple, controls the family-wise error rate (probability of any false positives). Cons: Very conservative — with many tests, the threshold becomes so strict that real effects are missed too.
False Discovery Rate (FDR)
In fields like genomics where you test thousands of genes simultaneously, Bonferroni is too strict. The Benjamini-Hochberg procedure controls the false discovery rate instead — the expected proportion of false positives among the rejected hypotheses.
The expected proportion of false positives among all "significant" results. Controlling FDR at 5% means: of all the hypotheses you reject, you expect about 5% to be false discoveries. The rest are real.
Bonferroni asks: "What's the chance of even ONE false positive?" (very strict) FDR asks: "What FRACTION of my discoveries are false?" (more permissive)
FDR is the default in genomics, neuroimaging, and other "big data" fields where thousands of simultaneous tests are the norm.
Real-World Consequences
The multiple testing problem has real consequences:
The Replication Crisis: Many published "significant" findings fail to replicate. A large-scale study found that only 36-39% of psychology results replicated successfully.
Reasons:
- p-hacking (conscious or unconscious)
- Publication bias (journals prefer positive results)
- Underpowered studies (small samples + multiple tests = false discoveries)
- Lack of correction for multiple testing
The file drawer problem: Studies with p > 0.05 don't get published. So the published literature is biased toward false positives. If 20 labs independently test the same false hypothesis, we'd expect 1 to find p < 0.05. That one gets published; the other 19 go into the "file drawer."
Protecting Yourself
As a researcher:
- Pre-register your hypotheses and analysis plan before collecting data
- Correct for multiple comparisons whenever testing multiple hypotheses
- Report all tests, not just significant ones
- Replicate before drawing strong conclusions
- Focus on effect sizes and confidence intervals, not just p-values
As a consumer of statistics:
- Be skeptical of surprising single-study results
- Ask "How many things did they test?"
- Look for replication across multiple studies
- Check if the study was pre-registered
- Remember: extraordinary claims require extraordinary evidence
The antidote to p-hacking is transparency. Pre-registration, open data, and reporting ALL analyses (not just the significant ones) dramatically reduce false discovery rates.