Test Statistics & p-values

Understand what p-values really mean (and don't mean), Type I and II errors, statistical power, and the limitations of significance testing.

25 min read
Intermediate

What Is a p-value?

The probability of observing data as extreme as or more extreme than what you actually got, assuming H₀ is true.

In other words: if nothing interesting is happening (H₀), how surprising is your data?

  • Small p-value (e.g., 0.01): Your data is very unlikely under H₀ → strong evidence against H₀
  • Large p-value (e.g., 0.45): Your data is perfectly plausible under H₀ → no evidence against H₀
p-value Intuition

You suspect a coin is biased. You flip it 100 times and get 60 heads.

p-value answers: "If the coin WERE fair, what's the probability of getting 60 or more heads (or 40 or fewer)?"

p-value ≈ 0.046

This means: if the coin is truly fair, there's only a 4.6% chance of seeing a result this extreme. That's pretty unlikely — you might conclude the coin is biased.

What p-values Do NOT Mean

This is where most people (including many scientists) go wrong:

Common misinterpretations:

❌ "There's a 3% chance H₀ is true" — The p-value is NOT the probability that H₀ is true. H₀ is either true or false; the p-value is about the data, not the hypothesis.

❌ "There's a 97% chance H₁ is true" — Same error in reverse.

❌ "The effect is large" — A tiny, meaningless effect can have a small p-value with a huge sample. p-values measure evidence, not effect size.

❌ "The result will replicate" — A p-value of 0.04 doesn't guarantee the finding will hold up in future studies.

❌ "p > 0.05 means there's no effect" — It means we didn't detect one with this data. Could be no effect, or could be insufficient power.

Test Statistics

A test statistic converts your data into a single number that measures how far your sample result is from what H₀ predicts. The general form:

Test Statistic=ObservedExpected under H0Standard Error\text{Test Statistic} = \frac{\text{Observed} - \text{Expected under } H_0}{\text{Standard Error}}

It answers: "How many standard errors away from the H₀ prediction is our sample?"

Common Test Statistics
Test
Statistic
When to Use
Z-testz = (x̄ - μ₀) / (σ/√n)σ known, large n
One-sample t-testt = (x̄ - μ₀) / (s/√n)σ unknown
Two-sample t-testt = (x̄₁ - x̄₂) / SEComparing two means
Chi-square testχ² = Σ(O-E)²/ECategorical data

Type I and Type II Errors

Decision Outcomes
H₀ is True (reality)
H₀ is False (reality)
Reject H₀❌ Type I Error (false positive)✅ Correct (true positive)
Fail to Reject H₀✅ Correct (true negative)❌ Type II Error (false negative)

False positive: Concluding there's an effect when there isn't one. The probability of this is α (the significance level you chose).

Real-world cost: Approving an ineffective drug, convicting an innocent person, launching a feature that doesn't actually help.

False negative: Failing to detect a real effect. The probability of this is β.

Real-world cost: Missing an effective drug, failing to diagnose a disease, not launching a feature that would have helped.

The tradeoff: Reducing α (fewer false positives) increases β (more false negatives), and vice versa. You can't minimize both simultaneously with fixed n. The only way to reduce both is to increase sample size.

Statistical Power

The probability of correctly rejecting H₀ when H₁ is true:

Power=1β=P(Reject H0H0 is false)\text{Power} = 1 - \beta = P(\text{Reject } H_0 | H_0 \text{ is false})

Power increases when:

  • Effect size is larger (easier to detect big effects)
  • Sample size n is larger (more data = more ability to detect)
  • α is larger (lower bar for rejection, but more false positives)
  • Variability σ is smaller (less noise obscuring the signal)

Convention: Studies should aim for power ≥ 0.80 (80% chance of detecting a real effect).

Power analysis is done BEFORE collecting data to determine the sample size needed. Underpowered studies are a waste of resources — they're unlikely to detect effects even when they exist.

The replication crisis in science is partly due to underpowered studies. With power of 0.50, half of real effects go undetected. The ones that DO get detected tend to overestimate the effect size (because they needed to be "lucky" to clear the significance threshold). This is called the winner's curse.

Significance Level: The 0.05 Threshold

The α = 0.05 standard was popularized by Ronald Fisher in the 1920s. He chose it somewhat arbitrarily as "convenient." It is NOT a law of nature.

Problems with rigid thresholds:

  • p = 0.049 → "significant!" vs p = 0.051 → "not significant!" — This is absurd. The evidence is nearly identical.
  • It encourages binary thinking (effect/no effect) instead of considering the strength of evidence
  • It incentivizes p-hacking (manipulating analyses until p < 0.05)

Better practice:

  • Report the actual p-value, not just "significant/not significant"
  • Always report effect sizes and confidence intervals alongside p-values
  • Consider the practical significance, not just statistical significance
  • Use multiple lines of evidence, not a single p-value

Statistical significance ≠ practical significance. A drug that lowers blood pressure by 0.1 mmHg might be statistically significant with n = 100,000 (because the SE is tiny), but it's clinically meaningless. Always ask: "Is the effect large enough to matter in the real world?"

Test your knowledge

🧠 Knowledge Check
1 / 3

A p-value of 0.03 means: