Hypothesis Testing Framework

Learn the logic of hypothesis testing: null and alternative hypotheses, one-tailed vs two-tailed tests, significance levels, and the testing procedure.

22 min read
Intermediate

The Logic of Hypothesis Testing

Hypothesis testing is statistics' framework for making decisions from data. It follows the same logic as a criminal trial:

  1. Start by assuming innocence (nothing is happening)
  2. Look at the evidence (data)
  3. Decide if the evidence is strong enough to reject innocence (conclude something is happening)

We never "prove" anything. We either find sufficient evidence to reject the default assumption, or we don't.

Null and Alternative Hypotheses

The default assumption — typically "nothing is happening," "no difference," or "no effect." The claim we try to disprove.

Examples:

  • The drug has no effect (μ_drug = μ_placebo)
  • The coin is fair (p = 0.5)
  • There's no difference between groups (μ₁ = μ₂)

What we're trying to find evidence for — typically "something is happening." It's the research hypothesis.

Examples:

  • The drug works (μ_drug ≠ μ_placebo)
  • The coin is biased (p ≠ 0.5)
  • The groups differ (μ₁ ≠ μ₂)

The burden of proof is on H₁. We don't try to prove H₀ — we assume it's true and see if the data contradicts it strongly enough. This is like "innocent until proven guilty."

One-Tailed vs Two-Tailed Tests

The alternative hypothesis determines whether the test is one-tailed or two-tailed:

Types of Tests
Type
H₁
When to Use
Two-tailedμ ≠ μ₀You care about any difference (higher or lower)
Right-tailedμ > μ₀You only care if the value is higher
Left-tailedμ < μ₀You only care if the value is lower
Choosing the Right Tail

"Does the new drug affect blood pressure?" → Two-tailed (could raise or lower it)

"Does the new drug lower blood pressure?" → Left-tailed (we only care about a decrease)

"Do students score higher with the new teaching method?" → Right-tailed

Important: You must choose one-tailed vs two-tailed BEFORE looking at the data. Choosing after seeing which direction the data went is a form of cheating (p-hacking).

The Testing Procedure

Step 1: State H₀ and H₁

Step 2: Choose a significance level α (typically 0.05)

Step 3: Collect data and compute a test statistic

Step 4: Find the p-value or compare to the critical value

Step 5: Make a decision:

  • If p-value ≤ α → Reject H₀ (evidence supports H₁)
  • If p-value > α → Fail to reject H₀ (insufficient evidence)

We never "accept H₀." We only "fail to reject" it. Absence of evidence is not evidence of absence. Not finding evidence that a drug works doesn't mean it doesn't work — maybe our sample was too small or our test wasn't sensitive enough.

Significance Level (α)

The probability threshold for rejecting H₀. It's the maximum P(reject H₀ | H₀ is true) you're willing to tolerate.

Common choices:

  • α = 0.05 (5%) — Standard in most fields
  • α = 0.01 (1%) — More conservative (medicine, physics)
  • α = 0.10 (10%) — More liberal (exploratory research)

Setting α = 0.05 means you're willing to accept a 5% chance of falsely concluding something is happening when it isn't. In other words, 1 in 20 times you'll cry wolf.

Why not set α = 0.001 to be super safe? Because the safer you make it against false positives, the more likely you are to miss real effects (false negatives). It's a tradeoff.

A Complete Example

Testing a Coin for Fairness

You flip a coin 100 times and get 60 heads. Is the coin fair?

Step 1: H₀: p = 0.5 (fair coin), H₁: p ≠ 0.5 (biased coin) — two-tailed

Step 2: α = 0.05

Step 3: Under H₀, the number of heads ~ Binomial(100, 0.5) Expected: 50 heads, Standard error: √(100 × 0.5 × 0.5) = 5 Z = (60 - 50) / 5 = 2.0

Step 4: P(|Z| > 2.0) = 2 × P(Z > 2.0) = 2 × 0.0228 = 0.0456

Step 5: p-value (0.046) < α (0.05) → Reject H₀

At the 5% significance level, there's sufficient evidence to conclude the coin is biased.

Note: At α = 0.01, we would NOT reject (0.046 > 0.01). The conclusion depends on our chosen threshold!

The Courtroom Analogy

| Court System | Hypothesis Testing | |---|---| | Innocent until proven guilty | H₀ assumed true until rejected | | Evidence presented | Data collected | | Beyond reasonable doubt | p-value < α | | Guilty verdict | Reject H₀ | | Not guilty verdict | Fail to reject H₀ | | Not guilty ≠ innocent | Fail to reject ≠ H₀ is true | | Wrongful conviction | Type I error | | Guilty person goes free | Type II error |

This analogy breaks down in one way: courts use subjective judgment, while hypothesis tests use a precise numerical threshold. But the logical structure is identical.

Test your knowledge

🧠 Knowledge Check
1 / 3

When we "fail to reject H₀," this means: