Test Statistics & p-values
Understand what p-values really mean (and don't mean), Type I and II errors, statistical power, and the limitations of significance testing.
What Is a p-value?
The probability of observing data as extreme as or more extreme than what you actually got, assuming H₀ is true.
In other words: if nothing interesting is happening (H₀), how surprising is your data?
- Small p-value (e.g., 0.01): Your data is very unlikely under H₀ → strong evidence against H₀
- Large p-value (e.g., 0.45): Your data is perfectly plausible under H₀ → no evidence against H₀
You suspect a coin is biased. You flip it 100 times and get 60 heads.
p-value answers: "If the coin WERE fair, what's the probability of getting 60 or more heads (or 40 or fewer)?"
p-value ≈ 0.046
This means: if the coin is truly fair, there's only a 4.6% chance of seeing a result this extreme. That's pretty unlikely — you might conclude the coin is biased.
What p-values Do NOT Mean
This is where most people (including many scientists) go wrong:
Common misinterpretations:
❌ "There's a 3% chance H₀ is true" — The p-value is NOT the probability that H₀ is true. H₀ is either true or false; the p-value is about the data, not the hypothesis.
❌ "There's a 97% chance H₁ is true" — Same error in reverse.
❌ "The effect is large" — A tiny, meaningless effect can have a small p-value with a huge sample. p-values measure evidence, not effect size.
❌ "The result will replicate" — A p-value of 0.04 doesn't guarantee the finding will hold up in future studies.
❌ "p > 0.05 means there's no effect" — It means we didn't detect one with this data. Could be no effect, or could be insufficient power.
Test Statistics
A test statistic converts your data into a single number that measures how far your sample result is from what H₀ predicts. The general form:
It answers: "How many standard errors away from the H₀ prediction is our sample?"
Test | Statistic | When to Use |
|---|---|---|
| Z-test | z = (x̄ - μ₀) / (σ/√n) | σ known, large n |
| One-sample t-test | t = (x̄ - μ₀) / (s/√n) | σ unknown |
| Two-sample t-test | t = (x̄₁ - x̄₂) / SE | Comparing two means |
| Chi-square test | χ² = Σ(O-E)²/E | Categorical data |
Type I and Type II Errors
H₀ is True (reality) | H₀ is False (reality) | |
|---|---|---|
| Reject H₀ | ❌ Type I Error (false positive) | ✅ Correct (true positive) |
| Fail to Reject H₀ | ✅ Correct (true negative) | ❌ Type II Error (false negative) |
False positive: Concluding there's an effect when there isn't one. The probability of this is α (the significance level you chose).
Real-world cost: Approving an ineffective drug, convicting an innocent person, launching a feature that doesn't actually help.
False negative: Failing to detect a real effect. The probability of this is β.
Real-world cost: Missing an effective drug, failing to diagnose a disease, not launching a feature that would have helped.
The tradeoff: Reducing α (fewer false positives) increases β (more false negatives), and vice versa. You can't minimize both simultaneously with fixed n. The only way to reduce both is to increase sample size.
Statistical Power
The probability of correctly rejecting H₀ when H₁ is true:
Power increases when:
- Effect size is larger (easier to detect big effects)
- Sample size n is larger (more data = more ability to detect)
- α is larger (lower bar for rejection, but more false positives)
- Variability σ is smaller (less noise obscuring the signal)
Convention: Studies should aim for power ≥ 0.80 (80% chance of detecting a real effect).
Power analysis is done BEFORE collecting data to determine the sample size needed. Underpowered studies are a waste of resources — they're unlikely to detect effects even when they exist.
The replication crisis in science is partly due to underpowered studies. With power of 0.50, half of real effects go undetected. The ones that DO get detected tend to overestimate the effect size (because they needed to be "lucky" to clear the significance threshold). This is called the winner's curse.
Significance Level: The 0.05 Threshold
The α = 0.05 standard was popularized by Ronald Fisher in the 1920s. He chose it somewhat arbitrarily as "convenient." It is NOT a law of nature.
Problems with rigid thresholds:
- p = 0.049 → "significant!" vs p = 0.051 → "not significant!" — This is absurd. The evidence is nearly identical.
- It encourages binary thinking (effect/no effect) instead of considering the strength of evidence
- It incentivizes p-hacking (manipulating analyses until p < 0.05)
Better practice:
- Report the actual p-value, not just "significant/not significant"
- Always report effect sizes and confidence intervals alongside p-values
- Consider the practical significance, not just statistical significance
- Use multiple lines of evidence, not a single p-value
Statistical significance ≠ practical significance. A drug that lowers blood pressure by 0.1 mmHg might be statistically significant with n = 100,000 (because the SE is tiny), but it's clinically meaningless. Always ask: "Is the effect large enough to matter in the real world?"