Kekkei - Advancing Financial Science For Everyone

From Correlation to Prediction

Correlation tells you two variables are related. Regression tells you how — it builds a mathematical model that predicts one variable from the other.

If you know study hours predict exam scores (r = 0.75), regression answers: "If a student studies 5 hours, what score do we predict?" and "For each additional hour of studying, how much does the score increase?"

The Line of Best Fit

\hat{y} = b_0 + b_1 x

ŷ (y-hat): the predicted value of Y
b₀ (intercept): the predicted Y when X = 0
b₁ (slope): the change in Y for each one-unit increase in X
x: the predictor (independent variable)

Interpreting the slope: b₁ = 8 in a study-hours-vs-score model means "each additional hour of studying is associated with 8 more points on the exam." The word "associated" is crucial — regression alone doesn't prove causation.

Least Squares: How We Find the Best Line

Among all possible lines, we want the one that minimizes the total squared distance between observed values and predicted values:

\min \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \min \sum_{i=1}^{n} (y_i - b_0 - b_1 x_i)^2

The solution (using calculus):

b_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x}

b_0 = \bar{y} - b_1 \bar{x}

Why squared errors? Same reasoning as variance — squaring penalizes large errors more, gives a smooth function to optimize, and connects to the geometry of projection.

Computing a Regression Line

| Study Hours (X) | Score (Y) | |---|---| | 2 | 65 | | 3 | 70 | | 5 | 80 | | 7 | 85 | | 8 | 92 |

x̄ = 5, ȳ = 78.4, sₓ = 2.55, sᵧ = 10.90, r = 0.98

b₁ = 0.98 × (10.90/2.55) = 4.19 b₀ = 78.4 - 4.19(5) = 57.45

Model: ŷ = 57.45 + 4.19x

Predict score for 6 hours: ŷ = 57.45 + 4.19(6) = 82.6

Residuals: What the Model Misses

The difference between what we observed and what we predicted:

e_i = y_i - \hat{y}_i

Residuals are the "errors" of the model. Analyzing them tells you if your model is any good:

Residuals should be randomly scattered around zero. If you see patterns (curves, funnels), your model is missing something.
Residuals should have constant spread (homoscedasticity). If the spread increases with X, your model's reliability varies.
Residuals should be approximately normal for inference to work.

R² — Coefficient of Determination

R^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

R² tells you what fraction of the variation in Y is explained by X.

R² = 0.85 → 85% of the variation in scores is explained by study hours
R² = 0.30 → only 30% explained; most variation is due to other factors

For simple linear regression: R² = r² (the square of the correlation coefficient).

R² is overrated. A high R² doesn't mean your model is correct — it could be fitting noise. A low R² doesn't mean X is useless — in some fields, explaining 10% of variation is very valuable. And R² always increases when you add more predictors, even useless ones. Use adjusted R² for model comparison.

Assumptions of Linear Regression

For the results to be trustworthy, four assumptions must (roughly) hold — remembered as LINE:

Linearity — The relationship between X and Y is actually linear
Independence — Residuals are independent of each other
Normality — Residuals are approximately normally distributed
Equal variance — Residuals have constant spread (homoscedasticity)

When assumptions are violated:

Non-linearity → Transform variables or use non-linear models
Non-independence → Use time series or mixed models
Non-normality → May still be okay for large n (CLT), or use robust methods
Unequal variance → Use weighted regression or robust standard errors

Extrapolation: The Danger Zone

Your regression model is only valid within the range of your data. Predicting beyond that range is called extrapolation and is dangerously unreliable.

If you built a model of height vs age for children 5-15 years old, predicting height at age 30 using that model would give absurd results. The linear relationship doesn't continue forever.

Never extrapolate far beyond your data's range. The relationship might change, plateau, reverse, or break down entirely. Stay within the observed range for reliable predictions.

Test your knowledge

🧠 Knowledge Check

1 / 3

Simple Linear Regression

From Correlation to Prediction

The Line of Best Fit

Least Squares: How We Find the Best Line

Residuals: What the Model Misses

R² — Coefficient of Determination

Assumptions of Linear Regression

Extrapolation: The Danger Zone

Test your knowledge

In ŷ = 20 + 3x, the slope of 3 means:

From Correlation to PredictionFocusStart Focus Mode

The Line of Best FitFocusStart Focus Mode

Least Squares: How We Find the Best LineFocusStart Focus Mode

Residuals: What the Model MissesFocusStart Focus Mode

R² — Coefficient of DeterminationFocusStart Focus Mode

Assumptions of Linear RegressionFocusStart Focus Mode

Extrapolation: The Danger ZoneFocusStart Focus Mode

Test your knowledgeFocusStart Focus Mode

In ŷ = 20 + 3x, the slope of 3 means:

From Correlation to Prediction

The Line of Best Fit

Least Squares: How We Find the Best Line

Residuals: What the Model Misses

R² — Coefficient of Determination

Assumptions of Linear Regression

Extrapolation: The Danger Zone

Test your knowledge