Multiple Regression

Extend regression to multiple predictors: interpret coefficients, handle categories with dummy variables, and avoid multicollinearity and overfitting.

25 min read
Advanced

Beyond One Predictor

In reality, outcomes depend on many factors. Exam scores depend on study hours AND sleep AND prior knowledge AND stress levels. Simple linear regression uses one predictor; multiple regression uses several:

y^=b0+b1x1+b2x2+โ‹ฏ+bpxp\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_p x_p

Each coefficient bโฑผ represents the effect of xโฑผ on y, holding all other predictors constant. This "controlling for" other variables is one of the most powerful features of multiple regression.

Interpreting Coefficients

Salary Prediction Model

ลท = 30,000 + 2,500(years_experience) + 5,000(has_masters) - 500(commute_miles)

  • bโ‚ = 2,500: Each additional year of experience is associated with $2,500 higher salary, holding education and commute constant
  • bโ‚‚ = 5,000: Having a master's degree is associated with $5,000 more, holding experience and commute constant
  • bโ‚ƒ = -500: Each additional mile of commute is associated with $500 less salary

The "holding constant" part is crucial. In simple regression, the experience coefficient might be $3,200 because it also captures the effect of education (more experienced people tend to have more education). Multiple regression separates these effects.

Dummy Variables: Handling Categories

Regression requires numerical inputs, but many predictors are categorical (gender, region, department). Dummy variables (indicator variables) solve this.

For a variable with k categories, create k-1 dummy variables (0 or 1):

Encoding Color

Product color: Red, Blue, Green

| Product | Blue | Green | |---|---|---| | Red widget | 0 | 0 | | Blue widget | 1 | 0 | | Green widget | 0 | 1 |

Red is the reference category (both dummies = 0). The coefficient for "Blue" means the difference between Blue and Red, holding everything else constant.

Why k-1 and not k? Because the kth category is fully determined when you know the others (if it's not Blue and not Green, it must be Red). Including all k creates perfect collinearity, which breaks the math.

Multicollinearity

When predictor variables are highly correlated with each other. This makes individual coefficient estimates unstable and hard to interpret, even though the overall model may still predict well.

Example: Including both "years of experience" and "age" in a salary model. They're highly correlated (older people have more experience). The model can't tell which one is really driving salary, so both coefficients become unreliable.

Signs of multicollinearity:

  • Coefficients change dramatically when you add/remove a variable
  • A variable you know is important has a non-significant p-value
  • Variance Inflation Factor (VIF) > 5-10

Solutions: Remove redundant predictors, combine correlated variables, or use regularization techniques (Ridge/Lasso regression).

Overfitting: When the Model Learns Noise

When a model captures random noise in the training data rather than the true underlying pattern. The model performs great on your data but poorly on new data.

With enough predictors, you can "explain" anything. A model with as many predictors as data points will have Rยฒ = 1 (perfect fit) but zero predictive ability.

Signs of overfitting:

  • Rยฒ is very high but adjusted Rยฒ is much lower
  • The model performs well on training data but poorly on test data
  • Coefficients are very large in magnitude
  • Adding irrelevant predictors improves Rยฒ but not adjusted Rยฒ
Radj2=1โˆ’(1โˆ’R2)(nโˆ’1)nโˆ’pโˆ’1R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}

Adjusted Rยฒ penalizes for adding predictors, so it only increases if the new predictor genuinely improves the model.

Rule of thumb: You need at least 10-20 observations per predictor variable. With 50 data points and 30 predictors, you're almost certainly overfitting.

Building Good Models

Start simple, add complexity only when justified:

  1. Start with theory โ€” Include variables that make conceptual sense, not just everything available
  2. Check assumptions โ€” Plot residuals, check for non-linearity and heteroscedasticity
  3. Watch for multicollinearity โ€” Check VIF values
  4. Use adjusted Rยฒ or AIC/BIC โ€” Not raw Rยฒ for model comparison
  5. Validate โ€” Test on data the model hasn't seen (cross-validation or holdout set)
  6. Report honestly โ€” Including variables that weren't significant

The best model isn't the one with the highest Rยฒ โ€” it's the one that generalizes best to new data while remaining interpretable.

George Box famously said: "All models are wrong, but some are useful." The goal isn't a perfect model โ€” it's a useful one that captures the most important patterns without overfitting to noise.

Test your knowledge

๐Ÿง  Knowledge Check
1 / 3

In multiple regression, bโ‚‚ = 5,000 for "has_masters" means: