When Normality Fails

Handle skewed data, heavy tails, and outliers. Learn to detect non-normality and choose appropriate alternatives.

18 min read
Advanced

The Real World Is Messy

Most statistical methods you've learned assume data is approximately normal. But real-world data often isn't:

  • Income data is heavily right-skewed (most people earn moderate amounts, a few earn millions)
  • Survival times are right-skewed (most patients survive a while, some die quickly)
  • Insurance claims have heavy tails (most claims are small, but catastrophic ones are enormous)
  • Likert scale data (1-5 ratings) can't be normal โ€” it's discrete and bounded

When normality fails, standard methods can give misleading results: wrong p-values, incorrect confidence intervals, and poor predictions.

Skewed Distributions

Right skew (positive skew) โ€” long tail to the right:

  • Mean > Median
  • Examples: income, house prices, hospital stays, claim sizes

Left skew (negative skew) โ€” long tail to the left:

  • Mean < Median
  • Examples: age at death in developed countries, scores on easy exams

Impact on standard methods:

  • t-tests can produce incorrect p-values
  • Confidence intervals may not have the claimed coverage
  • The mean becomes a poor measure of "typical"

Heavy-Tailed Distributions

A distribution where extreme values are more likely than the normal distribution predicts. In normal data, values beyond 4ฯƒ from the mean are essentially impossible. In heavy-tailed data, they happen surprisingly often.

The Danger of Thin-Tail Assumptions

Financial models before 2008 assumed market returns were approximately normal. Under normality, the 2008 crash was a "25-sigma event" โ€” probability essentially zero (less than 10โปยนยณโธ).

But it happened. Because financial returns have heavy tails. Events that should be "impossible" under normality occur every few decades.

The lesson: Using normal-based methods when tails are heavy makes you overconfident. You underestimate risk.

Detecting Non-Normality

Visual methods (most useful):

  1. Histogram โ€” Does it look bell-shaped? Skewed? Multimodal?
  2. QQ plot (quantile-quantile plot) โ€” Plots your data quantiles against theoretical normal quantiles. Points should fall on a straight line if data is normal. Deviations at the tails indicate non-normality.
  3. Box plot โ€” Asymmetric box or many outliers suggest non-normality.

Formal tests:

  • Shapiro-Wilk test โ€” Best for small-medium samples
  • Kolmogorov-Smirnov test โ€” Compares data to a reference distribution

Warning about formal tests: With large samples, they reject normality for trivially small deviations. With small samples, they lack power to detect meaningful departures. Visual assessment is usually more informative.

What To Do When Normality Fails

Option 1: Transform the data Apply a mathematical function to make data more normal:

  • Log transform โ€” Great for right-skewed positive data (income, prices)
  • Square root transform โ€” Milder than log, for count data
  • Box-Cox transform โ€” Finds the optimal power transformation

Option 2: Use non-parametric methods (next lesson) Methods that don't assume any particular distribution shape.

Option 3: Use robust methods

  • Trimmed means (remove top/bottom 5-10% before averaging)
  • Median instead of mean
  • Robust standard errors

Option 4: Use bootstrapping (lesson after next) Estimate sampling distributions empirically, no normality needed.

Option 5: Rely on CLT With large enough n, the sampling distribution of the mean is approximately normal regardless. But "large enough" can be very large for heavily skewed data.

The order of preference: First, check if normality violations actually matter for your analysis (often they don't with large n). If they do, try transformations. If those don't work, use non-parametric or bootstrap methods.

Outliers: Friends or Foes?

Outliers are observations that are far from the rest of the data. They can be:

  1. Data errors โ€” Typos, measurement malfunctions. Should be corrected or removed.
  2. Natural extreme values โ€” Real but rare observations. Often the most interesting data points!
  3. From a different population โ€” Mixed data sources (measuring adults and accidentally including children).

Never automatically remove outliers. Always investigate first. The "outlier" might be your most important data point โ€” or it might be a data entry error. Context matters.

Removing outliers to make results "significant" is a form of p-hacking. If you remove data points, report it transparently and show results both with and without them.

Test your knowledge

๐Ÿง  Knowledge Check
1 / 2

Which transformation is most effective for right-skewed positive data?