Simple Linear Regression
Build and interpret regression models: least squares, slope, intercept, Rยฒ, residuals, and the LINE assumptions.
From Correlation to Prediction
Correlation tells you two variables are related. Regression tells you how โ it builds a mathematical model that predicts one variable from the other.
If you know study hours predict exam scores (r = 0.75), regression answers: "If a student studies 5 hours, what score do we predict?" and "For each additional hour of studying, how much does the score increase?"
The Line of Best Fit
- ลท (y-hat): the predicted value of Y
- bโ (intercept): the predicted Y when X = 0
- bโ (slope): the change in Y for each one-unit increase in X
- x: the predictor (independent variable)
Interpreting the slope: bโ = 8 in a study-hours-vs-score model means "each additional hour of studying is associated with 8 more points on the exam." The word "associated" is crucial โ regression alone doesn't prove causation.
Least Squares: How We Find the Best Line
Among all possible lines, we want the one that minimizes the total squared distance between observed values and predicted values:
The solution (using calculus):
Why squared errors? Same reasoning as variance โ squaring penalizes large errors more, gives a smooth function to optimize, and connects to the geometry of projection.
| Study Hours (X) | Score (Y) | |---|---| | 2 | 65 | | 3 | 70 | | 5 | 80 | | 7 | 85 | | 8 | 92 |
xฬ = 5, ศณ = 78.4, sโ = 2.55, sแตง = 10.90, r = 0.98
bโ = 0.98 ร (10.90/2.55) = 4.19 bโ = 78.4 - 4.19(5) = 57.45
Model: ลท = 57.45 + 4.19x
Predict score for 6 hours: ลท = 57.45 + 4.19(6) = 82.6
Residuals: What the Model Misses
The difference between what we observed and what we predicted:
Residuals are the "errors" of the model. Analyzing them tells you if your model is any good:
- Residuals should be randomly scattered around zero. If you see patterns (curves, funnels), your model is missing something.
- Residuals should have constant spread (homoscedasticity). If the spread increases with X, your model's reliability varies.
- Residuals should be approximately normal for inference to work.
Rยฒ โ Coefficient of Determination
Rยฒ tells you what fraction of the variation in Y is explained by X.
- Rยฒ = 0.85 โ 85% of the variation in scores is explained by study hours
- Rยฒ = 0.30 โ only 30% explained; most variation is due to other factors
For simple linear regression: Rยฒ = rยฒ (the square of the correlation coefficient).
Rยฒ is overrated. A high Rยฒ doesn't mean your model is correct โ it could be fitting noise. A low Rยฒ doesn't mean X is useless โ in some fields, explaining 10% of variation is very valuable. And Rยฒ always increases when you add more predictors, even useless ones. Use adjusted Rยฒ for model comparison.
Assumptions of Linear Regression
For the results to be trustworthy, four assumptions must (roughly) hold โ remembered as LINE:
- Linearity โ The relationship between X and Y is actually linear
- Independence โ Residuals are independent of each other
- Normality โ Residuals are approximately normally distributed
- Equal variance โ Residuals have constant spread (homoscedasticity)
When assumptions are violated:
- Non-linearity โ Transform variables or use non-linear models
- Non-independence โ Use time series or mixed models
- Non-normality โ May still be okay for large n (CLT), or use robust methods
- Unequal variance โ Use weighted regression or robust standard errors
Extrapolation: The Danger Zone
Your regression model is only valid within the range of your data. Predicting beyond that range is called extrapolation and is dangerously unreliable.
If you built a model of height vs age for children 5-15 years old, predicting height at age 30 using that model would give absurd results. The linear relationship doesn't continue forever.
Never extrapolate far beyond your data's range. The relationship might change, plateau, reverse, or break down entirely. Stay within the observed range for reliable predictions.