Covariance & Correlation
Understand covariance, Pearson correlation, Spearman rank correlation, and the many pitfalls of interpreting correlation.
Do Two Variables Move Together?
So far we've analyzed one variable at a time. But the most interesting questions involve relationships: Does more education lead to higher income? Does exercise lower blood pressure? Does advertising increase sales?
Covariance and correlation quantify how two variables move together.
Covariance
A measure of how two variables change together. Positive covariance means they tend to increase together; negative means one increases as the other decreases.
Intuition: For each data point, multiply how far X is from its mean by how far Y is from its mean. If both tend to be above (or both below) their means simultaneously, the product is positive → positive covariance. If one is above while the other is below, the product is negative → negative covariance.
Problem with covariance: Its magnitude depends on the units of X and Y. Cov(height in cm, weight in kg) gives a completely different number than Cov(height in inches, weight in pounds). You can't tell if the relationship is "strong" or "weak" from covariance alone.
Pearson Correlation Coefficient
The solution: standardize the covariance by dividing by the product of the standard deviations.
- -1 ≤ r ≤ 1 always
- r = 1: Perfect positive linear relationship
- r = -1: Perfect negative linear relationship
- r = 0: No linear relationship (but there could be a non-linear one!)
- r is unitless — changing units of X or Y doesn't change r
r | Strength |
|---|---|
| 0.00 – 0.19 | Very weak |
| 0.20 – 0.39 | Weak |
| 0.40 – 0.59 | Moderate |
| 0.60 – 0.79 | Strong |
| 0.80 – 1.00 | Very strong |
These labels are rough guidelines, not rules. In some fields (physics), r = 0.7 might be disappointing. In others (psychology), r = 0.3 might be impressive.
Spearman Rank Correlation
Measures the strength of a monotonic relationship (not just linear). It works by ranking the data first, then computing Pearson's r on the ranks.
When to use Spearman instead of Pearson:
- Data contains outliers (ranks are robust to outliers)
- Relationship is monotonic but not linear (e.g., exponential)
- Data is ordinal (rankings, ratings)
- Distribution is heavily skewed
Example: Income and happiness might have a positive but diminishing relationship — doubling income from 40k matters more than 400k. Spearman captures this better than Pearson.
Correlation Pitfalls
1. Correlation ≠ Causation (again!) r = 0.95 between ice cream sales and drownings doesn't mean ice cream causes drowning. Confounders lurk everywhere.
2. r measures LINEAR relationships only A perfect U-shaped relationship gives r ≈ 0. Always plot your data! If the scatter plot shows a curve, Pearson's r will miss it.
3. Outliers can dominate A single extreme point can create an apparent correlation where none exists, or destroy a real one.
4. Restriction of range If you only measure the relationship among a narrow group (e.g., SAT scores vs college GPA among Harvard students), the correlation will be artificially low because the range of SAT scores is restricted.
5. Ecological fallacy Correlation at the group level doesn't imply correlation at the individual level. Countries with more chocolate consumption win more Nobel Prizes, but individual chocolate eaters aren't smarter.
Always, always, always make a scatter plot before computing correlation. The number alone can be deeply misleading. Anscombe's Quartet proves this — four datasets with identical r ≈ 0.82 that look completely different when plotted.