Supervised learning: linear regression

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Resources

Advanced Data Analysis from an Elementary Point of View (aka the greatest book that will never be published)
The Truth about Linear Regression

“Regression”, in statistical jargon, is the problem of guessing the average level of some quantitative response variable from various predictor variables. - Note GOAT

Simple linear regression

Linear regression is used when the response variable is quantitative

Assume a linear relationship for \(Y = f(X)\) \[Y_{i}=\beta_{0}+\beta_{1} X_{i}+\epsilon_{i}, \quad \text { for } i=1,2, \dots, n\]

\(Y_i\) is the \(i\)th value for the response variable
\(X_i\) is the \(i\)th value for the predictor variable

\(\beta_0\) is an unknown, constant intercept
- average value for \(Y\) if \(X = 0\) (be careful sometimes…)
\(\beta_1\) is an unknown, constant slope
- change in average value for \(Y\) for each one-unit increase in \(X\)

\(\epsilon_i\) is the random noise
- assume independent, identically distributed (iid) from a normal distribution
- \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\) with constant variance \(\sigma^2\)

Simple linear regression estimation

We are estimating the conditional expection (mean) for \(Y\):

\[\mathbb{E}[Y_i \mid X_i] = \beta_0 + \beta_1X_i\]

average value for \(Y\) given the value for \(X\)

How do we estimate the best fitted line?

Ordinary least squares (OLS) - by minimizing the residual sum of squares (RSS)

\[\text{RSS} \left(\beta_{0}, \beta_{1}\right)=\sum_{i=1}^{n}\left[Y_{i}-\left(\beta_{0}+\beta_{1} X_{i}\right)\right]^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}\]

\[\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} \quad \text{ and } \quad \widehat{\beta}_{0}=\bar{Y}-\widehat{\beta}_{1} \bar{X}\]

where \(\displaystyle \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\) and \(\displaystyle \bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i\)

Connection to covariance and correlation

Covariance describes the joint variability of two variables \[\text{Cov}(X, Y) = \sigma_{X,Y} = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\]

Sample covariance (use \(n - 1\) since the means are used and we want unbiased estimates) \[\hat{\sigma}_{X,Y} = \frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)\]

Connection to covariance and correlation

Correlation is a normalized form of covariance, ranges from -1 to 1 \[\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}\]

Sample correlation uses the sample covariance and standard deviations, e.g. \(\displaystyle s_X^2 = \frac{1}{n-1} \sum_i (X_i - \bar{X})^2\) \[r_{X,Y} = \frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}}}\]

Connection to covariance and correlation

We have \[\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}} \quad \text{ and } \quad r_{X,Y} = \frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2} \sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2}}}\]

We can rewrite \(\hat{\beta}_1\) as \[\widehat{\beta}_{1} = r_{X,Y} \cdot \frac{s_Y}{s_X}\]

We can rewrite \(\displaystyle r_{X,Y}\) as \[r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y}\]

Can think of \(\widehat{\beta}_{1}\) weighting the ratio of variance between \(X\) and \(Y\)…

Linear regression in `R`

Gapminder data

Health and income outcomes for 184 countries from 1960 to 2016 from the famous Gapminder project

library(tidyverse)
theme_set(theme_light())
library(dslabs)
clean_gapminder <- gapminder |>
  filter(year == 2011, !is.na(gdp)) |>
  mutate(log_gdp = log(gdp))
glimpse(clean_gapminder)

Rows: 168
Columns: 10
$ country          <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
$ year             <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
$ infant_mortality <dbl> 14.3, 22.8, 106.8, 7.2, 12.7, 15.3, 3.8, 3.4, 32.5, 1…
$ life_expectancy  <dbl> 77.4, 76.1, 58.1, 75.9, 76.0, 73.5, 82.2, 80.7, 70.8,…
$ fertility        <dbl> 1.75, 2.83, 6.10, 2.12, 2.20, 1.50, 1.88, 1.44, 1.96,…
$ population       <dbl> 2886010, 36717132, 21942296, 88152, 41655616, 2967984…
$ gdp              <dbl> 6321690864, 81143448101, 27013935821, 801787943, 4729…
$ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, Oce…
$ region           <fct> Southern Europe, Northern Africa, Middle Africa, Cari…
$ log_gdp          <dbl> 22.56725, 25.11948, 24.01962, 20.50235, 26.88222, 22.…

Modeling life expectancy

Interested in modeling a country’s life expectancy

clean_gapminder |>
  ggplot(aes(x = life_expectancy)) +
  geom_histogram(color = "black", fill = "gray")

Relationship between life expectancy and log GDP

gdp_plot <- clean_gapminder |>
  ggplot(aes(x = log_gdp, y = life_expectancy)) +
  geom_point(size = 3, alpha = 0.5)
gdp_plot

We fit linear regression models using lm(), formula is input as: response ~ predictor

simple_lm <- lm(life_expectancy ~ log_gdp, 
                data = clean_gapminder)

Summarize the model using `summary()`

summary(simple_lm)


Call:
lm(formula = life_expectancy ~ log_gdp, data = clean_gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.901  -4.781   1.879   5.335  13.962 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24.174      5.758   4.198 4.38e-05 ***
log_gdp        1.975      0.242   8.161 7.87e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.216 on 166 degrees of freedom
Multiple R-squared:  0.2864,    Adjusted R-squared:  0.2821 
F-statistic: 66.61 on 1 and 166 DF,  p-value: 7.865e-14

Summarize the model using the `broom` package

The 3 broom functions (Note: the output is always a tibble)

tidy(): coefficients table in a tidy format
glance(): produces summary metrics of a model fit
augment(): adds/“augments” columns to the original data (e.g., adding predictions)

Note: the output of tidy(), augment() and glance() are always a tibble.

library(broom)
tidy(simple_lm)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    24.2      5.76       4.20 4.38e- 5
2 log_gdp         1.97     0.242      8.16 7.87e-14

glance(simple_lm)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.286         0.282  7.22      66.6 7.87e-14     1  -569. 1145. 1154.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Inference with OLS

Intercept and coefficient estimates \[\quad \hat{\beta}_0 = 24.174 \quad \text{and} \quad \hat{\beta}_1 = 1.975\]

Estimates of uncertainty for \(\beta\)’s via standard errors \[\quad \widehat{SE}(\hat{\beta}_0) = 5.758 \quad \text{and} \quad \widehat{SE}(\hat{\beta}_1) = 0.242\]

\(t\)-statistics: Estimates / Std. Error, i.e., number of standard deviations from 0

\(p\)-values (i.e., Pr(>|t|)): estimated probability observing value as extreme as |t value| given the null hypothesis \(\beta_1 = 0\)
\(p\)-value \(< \alpha = 0.05\) (conventional threshold): sufficient evidence to reject the null hypothesis that the coefficient is zero
i.e., there is a significant association between life_expectancy and log_gdp

Coefficient of determination

In linear regression, the square of the correlation coefficient happens to be exactly the coefficient of determination

cor(clean_gapminder$log_gdp, clean_gapminder$life_expectancy)

[1] 0.5351189

cor(clean_gapminder$log_gdp, clean_gapminder$life_expectancy) ^ 2

[1] 0.2863522

\(R^2\) estimates the proportion of the variance of \(Y\) explained by \(X\)
More generally: variance of model predictions / variance of \(Y\)

var(predict(simple_lm)) / var(clean_gapminder$life_expectancy)

[1] 0.2863522

Generating predictions

We can use the predict() or fitted() function to either get the fitted values of the regression:

train_preds <- predict(simple_lm)
head(train_preds)

       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876

head(fitted(simple_lm))

       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876

head(simple_lm$fitted.values)

       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876

simple_lm |> 
  pluck("fitted.values") |> 
  head()

       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876

Predictions for new data

Or we can provide it newdata which must contain the explanatory variables:

us_data <- clean_gapminder |> 
  filter(country == "United States")

new_us_data <- us_data |>
  select(country, gdp) |>
  slice(rep(1, 3)) |> 
  mutate(adj_factor = c(0.25, 0.5, 0.75),
         log_gdp = log(gdp * adj_factor))
new_us_data$pred_life_exp <- 
  predict(simple_lm, newdata = new_us_data) 
gdp_plot +
  geom_point(data = new_us_data,
             aes(x = log_gdp, y = pred_life_exp),
             color = "darkred", size = 3)

Plot observed values against predictions

Useful diagnostic (for any type of model, not just linear regression!)

clean_gapminder |>
  mutate(pred_vals = predict(simple_lm)) |> 
  ggplot(aes(x = pred_vals, y = life_expectancy)) +
  geom_point(alpha = 0.5, size = 3) +
  geom_abline(slope = 1, intercept = 0, 
              linetype = "dashed",
              color = "red",
              linewidth = 2)

“Perfect” model will follow diagonal

Plot observed values against predictions

Augment the data with model output using the broom package

clean_gapminder <- simple_lm |> 
  augment(clean_gapminder) 
clean_gapminder |>
  ggplot(aes(x = .fitted, y = life_expectancy)) + 
  geom_point(alpha = 0.5, size = 3) +
  geom_abline(slope = 1, intercept = 0, color = "red", 
              linetype = "dashed", linewidth = 2)

Adds various columns from model fit we can use in plotting for model diagnostics

Assessing assumptions of linear regression

Simple linear regression assumes \(Y_i \overset{iid}{\sim} N(\beta_0 + \beta_1 X_i, \sigma^2)\)

If this is true, then \(Y_i - \hat{Y}_i \overset{iid}{\sim} N(0, \sigma^2)\)

Plot residuals against \(\hat{Y}_i\), residuals vs fit plot

Used to assess linearity, any divergence from mean 0
Used to assess equal variance, i.e., if \(\sigma^2\) is homogenous across predictions/fits \(\hat{Y}_i\)

More difficult to assess the independence and fixed \(X\) assumptions

Make these assumptions based on subject-matter knowledge

Plot residuals against predicted values

Residuals = observed - predicted
Interpretation of residuals in context?
Conditional on the predicted values, the residuals should have a mean of zero

clean_gapminder |>
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.5, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", 
             color = "red", linewidth = 2) +
  # plot the residual mean
  geom_smooth(se = FALSE)

Residuals should NOT display any pattern
Two things to look for:
- Any trend around horizontal reference line?
- Equal vertical spread?

Multiple linear regression

We can include as many variables as we want (assuming \(n > p\))

\[Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon\]

OLS estimates in matrix notation ( \(\boldsymbol{X}\) is a \(n \times p\) matrix):

\[\hat{\boldsymbol{\beta}} = (\boldsymbol{X} ^T \boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}\]

Can just add more variables to the formula in R

multiple_lm <- lm(life_expectancy ~ log_gdp + fertility, data = clean_gapminder)

Use adjusted \(R^2\) when including multiple variables \(\displaystyle 1 - \frac{(1 - R^2)(n - 1)}{(n - p - 1)}\)
- Adjusts for the number of variables in the model \(p\)
- Adding more variables will always increase the model’s (unadjusted) \(R^2\)

READ THIS: \(R^2\) myths

Get your \(R^2\) out of here

What about the Normal distribution assumption?

Model: \(Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon\)
\(\epsilon_i\) is the random noise: assume independent and identically distributed (iid) from a Normal distribution \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\) with constant variance \(\sigma^2\)

OLS doesn’t care about this assumption, it’s just estimating coefficients!

In order to perform inference, we need to impose additional assumptions
By assuming \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\), what we really mean is \(Y \overset{iid}{\sim}N(\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}, \sigma^2)\)

So we’re estimating the mean \(\mu\) of this conditional distribution, but what about \(\sigma^2\)?

Unbiased estimate \(\displaystyle \hat{\sigma}^2 = \frac{\text{RSS}}{n - (p + 1)}\)
Degrees of freedom: \(n - (p + 1)\), data supplies us with \(n\) “degrees of freedom” and we used up \(p + 1\)

Check the assumptions about normality with `ggfortify`

library(ggfortify)
autoplot(multiple_lm, ncol = 4)

Standardized residuals = residuals / sd(residuals) (see also .std.resid from augment)

Regression confidence intervals vs prediction intervals

Do we want to predict the mean response for a particular value \(x^*\) of the explanatory variable or do we want to predict the response for an individual case?

Example:

if we would like to know the average price for all Tesla with 50 thousand miles, then we would use an interval for the mean response
if we want an interval to contain the price of a particular Tesla with 50 thousand miles, then we need the interval for a single prediction

Regression confidence intervals

geom_smooth() displays confidence intervals for the regression line

# predict(simple_lm, interval = "confidence")
lm_plot <- clean_gapminder |>
  ggplot(aes(x = log_gdp,
             y = life_expectancy)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  theme_bw() +
  labs(x = "log(GDP)",
       y = "Life expectancy")
lm_plot

Regression confidence intervals

Interval estimate for the MEAN response at a given observation (i.e. for the predicted AVERAGE \(\hat{\beta}_0 + \hat{\beta}_1 x^*\))
Based on standard errors for the estimated regression line at \(x^*\) \[SE_{\text{line}}\left(x^{*}\right)=\hat{\sigma} \cdot \sqrt{\frac{1}{n}+\frac{\left(x^{*}-\bar{x}\right)^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}}\]
Variation only comes from uncertainty in the parameter

Regression prediction intervals

interval estimate for an INDIVIDUAL response at a given (new) observation
add the variance \(\sigma^2\) of a single predicted value \[SE_{\text{pred}}\left(x^{*}\right)=\hat{\sigma} \cdot \sqrt{1 + \frac{1}{n}+\frac{\left(x^{*}-\bar{x}\right)^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}}\]
variation comes from uncertainty in parameter estimates and error variance

Confidence intervals versus prediction intervals

Generate 95% intervals

# predict(simple_lm, interval = "prediction")
lm_plot +
  geom_ribbon(
    data = augment(simple_lm, interval = "prediction"),
    aes(ymin = .lower, ymax = .upper),
    color = "red", fill = NA
  )

The standard error for predicting an individual response will always be larger than for predicting a mean response
Prediction intervals will always be wider than confidence intervals

Appendix: Interpret a linear regression model with log transformations

Log-transformed predictor: \(Y = \beta_0 + \beta_1 \log(X)\)
- A 1% increase in \(X\) is associated with an average change of \(\beta_1/100\) units in \(Y\)
Log-transformed outcome: \(\log(Y) = \beta_0 + \beta_1 X\)
- A 1 unit increase in \(X\) is associated with an average change of \((100 \times \beta) \%\) in Y
Log-transformed predictor and outcome: \(\log(Y) = \beta_0 + \beta_1 \log(X)\)
- A 1% increase in \(X\) is associated with an average change of \(\beta_1 \%\) in Y

Useful reading

Supervised learning: linear regression

Background

Resources

Simple linear regression

Simple linear regression estimation

Connection to covariance and correlation

Connection to covariance and correlation

Connection to covariance and correlation

Linear regression in R

Gapminder data

Modeling life expectancy

Relationship between life expectancy and log GDP

Summarize the model using summary()

Summarize the model using the broom package

Inference with OLS

Coefficient of determination

Generating predictions

Predictions for new data

Plot observed values against predictions

Plot observed values against predictions

Assessing assumptions of linear regression

Plot residuals against predicted values

Multiple linear regression

What about the Normal distribution assumption?

Check the assumptions about normality with ggfortify

Regression confidence intervals vs prediction intervals

Regression confidence intervals

Regression confidence intervals

Regression prediction intervals

Confidence intervals versus prediction intervals

Appendix: Interpret a linear regression model with log transformations

Linear regression in `R`

Summarize the model using `summary()`

Summarize the model using the `broom` package

Check the assumptions about normality with `ggfortify`