Kekkei - Advancing Financial Science For Everyone

Beyond One Predictor

In reality, outcomes depend on many factors. Exam scores depend on study hours AND sleep AND prior knowledge AND stress levels. Simple linear regression uses one predictor; multiple regression uses several:

\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_p x_p

Each coefficient bⱼ represents the effect of xⱼ on y, holding all other predictors constant. This "controlling for" other variables is one of the most powerful features of multiple regression.

Interpreting Coefficients

Salary Prediction Model

ŷ = 30,000 + 2,500(years_experience) + 5,000(has_masters) - 500(commute_miles)

b₁ = 2,500: Each additional year of experience is associated with $2,500 higher salary, holding education and commute constant
b₂ = 5,000: Having a master's degree is associated with $5,000 more, holding experience and commute constant
b₃ = -500: Each additional mile of commute is associated with $500 less salary

The "holding constant" part is crucial. In simple regression, the experience coefficient might be $3,200 because it also captures the effect of education (more experienced people tend to have more education). Multiple regression separates these effects.

Dummy Variables: Handling Categories

Regression requires numerical inputs, but many predictors are categorical (gender, region, department). Dummy variables (indicator variables) solve this.

For a variable with k categories, create k-1 dummy variables (0 or 1):

Encoding Color

Product color: Red, Blue, Green

| Product | Blue | Green | |---|---|---| | Red widget | 0 | 0 | | Blue widget | 1 | 0 | | Green widget | 0 | 1 |

Red is the reference category (both dummies = 0). The coefficient for "Blue" means the difference between Blue and Red, holding everything else constant.

Why k-1 and not k? Because the kth category is fully determined when you know the others (if it's not Blue and not Green, it must be Red). Including all k creates perfect collinearity, which breaks the math.

Multicollinearity

When predictor variables are highly correlated with each other. This makes individual coefficient estimates unstable and hard to interpret, even though the overall model may still predict well.

Example: Including both "years of experience" and "age" in a salary model. They're highly correlated (older people have more experience). The model can't tell which one is really driving salary, so both coefficients become unreliable.

Signs of multicollinearity:

Coefficients change dramatically when you add/remove a variable
A variable you know is important has a non-significant p-value
Variance Inflation Factor (VIF) > 5-10

Solutions: Remove redundant predictors, combine correlated variables, or use regularization techniques (Ridge/Lasso regression).

Overfitting: When the Model Learns Noise

When a model captures random noise in the training data rather than the true underlying pattern. The model performs great on your data but poorly on new data.

With enough predictors, you can "explain" anything. A model with as many predictors as data points will have R² = 1 (perfect fit) but zero predictive ability.

Signs of overfitting:

R² is very high but adjusted R² is much lower
The model performs well on training data but poorly on test data
Coefficients are very large in magnitude
Adding irrelevant predictors improves R² but not adjusted R²

R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}

Adjusted R² penalizes for adding predictors, so it only increases if the new predictor genuinely improves the model.

Rule of thumb: You need at least 10-20 observations per predictor variable. With 50 data points and 30 predictors, you're almost certainly overfitting.

Building Good Models

Start simple, add complexity only when justified:

Start with theory — Include variables that make conceptual sense, not just everything available
Check assumptions — Plot residuals, check for non-linearity and heteroscedasticity
Watch for multicollinearity — Check VIF values
Use adjusted R² or AIC/BIC — Not raw R² for model comparison
Validate — Test on data the model hasn't seen (cross-validation or holdout set)
Report honestly — Including variables that weren't significant

The best model isn't the one with the highest R² — it's the one that generalizes best to new data while remaining interpretable.

George Box famously said: "All models are wrong, but some are useful." The goal isn't a perfect model — it's a useful one that captures the most important patterns without overfitting to noise.

Test your knowledge

🧠 Knowledge Check

1 / 3

Multiple Regression

Beyond One Predictor

Interpreting Coefficients

Dummy Variables: Handling Categories

Multicollinearity

Overfitting: When the Model Learns Noise

Building Good Models

Test your knowledge

In multiple regression, b₂ = 5,000 for "has_masters" means:

Beyond One PredictorFocusStart Focus Mode

Interpreting CoefficientsFocusStart Focus Mode

Dummy Variables: Handling CategoriesFocusStart Focus Mode

MulticollinearityFocusStart Focus Mode

Overfitting: When the Model Learns NoiseFocusStart Focus Mode

Building Good ModelsFocusStart Focus Mode

Test your knowledgeFocusStart Focus Mode

In multiple regression, b₂ = 5,000 for "has_masters" means:

Beyond One Predictor

Interpreting Coefficients

Dummy Variables: Handling Categories

Multicollinearity

Overfitting: When the Model Learns Noise

Building Good Models

Test your knowledge