Kekkei - Advancing Financial Science For Everyone

The Problem with Running Many Tests

If you test one hypothesis at α = 0.05, there's a 5% chance of a false positive. Seems manageable.

But what if you test 20 hypotheses? The probability of at least one false positive:

P(\text{at least one false positive}) = 1 - (1 - 0.05)^{20} = 1 - 0.95^{20} \approx 0.64

64% chance of at least one false positive! Test 100 hypotheses? You'd expect about 5 "significant" results purely by chance.

This is the multiple testing problem — the more tests you run, the more false discoveries you'll make.

p-Hacking: The Dark Side

Manipulating data analysis until you find p < 0.05. This includes:

Testing many variables but only reporting the "significant" ones
Trying multiple statistical tests and picking the best result
Removing "outliers" until the result becomes significant
Collecting data until p < 0.05, then stopping
Splitting the data into subgroups and testing each

How p-Hacking Works

A researcher studies whether jelly beans cause acne. They test 20 different colors:

Red jelly beans & acne: p = 0.67 ❌
Blue jelly beans & acne: p = 0.12 ❌
Green jelly beans & acne: p = 0.04 ✅
... (17 more: all p > 0.05)

Published headline: "Green Jelly Beans Linked to Acne! (p = 0.04)"

The 19 non-significant tests are never mentioned. With 20 tests at α = 0.05, we'd EXPECT about 1 false positive. The green result is almost certainly noise.

This is not a hypothetical — it's based on an actual XKCD comic illustrating the problem, and similar practices happen in real research.

p-hacking is arguably the biggest methodological problem in science today. It's not always intentional — researchers naturally gravitate toward analyses that produce interesting results. This makes it all the more insidious.

The Bonferroni Correction

The simplest fix: if you run m tests, use α/m as your significance threshold instead of α.

\alpha_{\text{adjusted}} = \frac{\alpha}{m}

Bonferroni in Action

Testing 20 jelly bean colors with α = 0.05:

Bonferroni threshold: 0.05 / 20 = 0.0025

Now the green jelly bean result (p = 0.04) is NOT significant under the corrected threshold. Crisis averted.

Pros: Simple, controls the family-wise error rate (probability of any false positives). Cons: Very conservative — with many tests, the threshold becomes so strict that real effects are missed too.

False Discovery Rate (FDR)

In fields like genomics where you test thousands of genes simultaneously, Bonferroni is too strict. The Benjamini-Hochberg procedure controls the false discovery rate instead — the expected proportion of false positives among the rejected hypotheses.

The expected proportion of false positives among all "significant" results. Controlling FDR at 5% means: of all the hypotheses you reject, you expect about 5% to be false discoveries. The rest are real.

Bonferroni asks: "What's the chance of even ONE false positive?" (very strict) FDR asks: "What FRACTION of my discoveries are false?" (more permissive)

FDR is the default in genomics, neuroimaging, and other "big data" fields where thousands of simultaneous tests are the norm.

Real-World Consequences

The multiple testing problem has real consequences:

The Replication Crisis: Many published "significant" findings fail to replicate. A large-scale study found that only 36-39% of psychology results replicated successfully.

Reasons:

p-hacking (conscious or unconscious)
Publication bias (journals prefer positive results)
Underpowered studies (small samples + multiple tests = false discoveries)
Lack of correction for multiple testing

The file drawer problem: Studies with p > 0.05 don't get published. So the published literature is biased toward false positives. If 20 labs independently test the same false hypothesis, we'd expect 1 to find p < 0.05. That one gets published; the other 19 go into the "file drawer."

Protecting Yourself

As a researcher:

Pre-register your hypotheses and analysis plan before collecting data
Correct for multiple comparisons whenever testing multiple hypotheses
Report all tests, not just significant ones
Replicate before drawing strong conclusions
Focus on effect sizes and confidence intervals, not just p-values

As a consumer of statistics:

Be skeptical of surprising single-study results
Ask "How many things did they test?"
Look for replication across multiple studies
Check if the study was pre-registered
Remember: extraordinary claims require extraordinary evidence

The antidote to p-hacking is transparency. Pre-registration, open data, and reporting ALL analyses (not just the significant ones) dramatically reduce false discovery rates.

Test your knowledge

🧠 Knowledge Check

1 / 3

The Multiple Testing Problem

The Problem with Running Many Tests

p-Hacking: The Dark Side

The Bonferroni Correction

False Discovery Rate (FDR)

Real-World Consequences

Protecting Yourself

Test your knowledge

You run 20 hypothesis tests at α = 0.05, all with true null hypotheses. How many false positives do you expect?

The Problem with Running Many TestsFocusStart Focus Mode

p-Hacking: The Dark SideFocusStart Focus Mode

The Bonferroni CorrectionFocusStart Focus Mode

False Discovery Rate (FDR)FocusStart Focus Mode

Real-World ConsequencesFocusStart Focus Mode

Protecting YourselfFocusStart Focus Mode

Test your knowledgeFocusStart Focus Mode

You run 20 hypothesis tests at α = 0.05, all with true null hypotheses. How many false positives do you expect?

The Problem with Running Many Tests

p-Hacking: The Dark Side

The Bonferroni Correction

False Discovery Rate (FDR)

Real-World Consequences

Protecting Yourself

Test your knowledge