Supervised learning: linear regression


SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Resources

“Regression”, in statistical jargon, is the problem of guessing the average level of some quantitative response variable from various predictor variables. - Note GOAT

Simple linear regression

Linear regression is used when the response variable is quantitative

Assume a linear relationship for \(Y = f(X)\) \[Y_{i}=\beta_{0}+\beta_{1} X_{i}+\epsilon_{i}, \quad \text { for } i=1,2, \dots, n\]

  • \(Y_i\) is the \(i\)th value for the response variable

  • \(X_i\) is the \(i\)th value for the predictor variable

  • \(\beta_0\) is an unknown, constant intercept

    • average value for \(Y\) if \(X = 0\) (be careful sometimes…)
  • \(\beta_1\) is an unknown, constant slope

    • change in average value for \(Y\) for each one-unit increase in \(X\)
  • \(\epsilon_i\) is the random noise

    • assume independent, identically distributed (iid) from a normal distribution

    • \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\) with constant variance \(\sigma^2\)

Simple linear regression estimation

We are estimating the conditional expection (mean) for \(Y\):

\[\mathbb{E}[Y_i \mid X_i] = \beta_0 + \beta_1X_i\]

  • average value for \(Y\) given the value for \(X\)

How do we estimate the best fitted line?

Ordinary least squares (OLS) - by minimizing the residual sum of squares (RSS)

\[\text{RSS} \left(\beta_{0}, \beta_{1}\right)=\sum_{i=1}^{n}\left[Y_{i}-\left(\beta_{0}+\beta_{1} X_{i}\right)\right]^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}\]

\[\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} \quad \text{ and } \quad \widehat{\beta}_{0}=\bar{Y}-\widehat{\beta}_{1} \bar{X}\]

where \(\displaystyle \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i\) and \(\displaystyle \bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i\)

Connection to covariance and correlation

Covariance describes the joint variability of two variables \[\text{Cov}(X, Y) = \sigma_{X,Y} = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\]

Sample covariance (use \(n - 1\) since the means are used and we want unbiased estimates) \[\hat{\sigma}_{X,Y} = \frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)\]

Connection to covariance and correlation

Correlation is a normalized form of covariance, ranges from -1 to 1 \[\rho_{X,Y} = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}\]

Sample correlation uses the sample covariance and standard deviations, e.g. \(\displaystyle s_X^2 = \frac{1}{n-1} \sum_i (X_i - \bar{X})^2\) \[r_{X,Y} = \frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2} \sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}}}\]

Connection to covariance and correlation

We have \[\widehat{\beta}_{1}=\frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2}} \quad \text{ and } \quad r_{X,Y} = \frac{\sum_{i=1}^{n}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_{i}-\bar{X})^{2} \sum_{i=1}^{n}(Y_{i}-\bar{Y})^{2}}}\]

We can rewrite \(\hat{\beta}_1\) as \[\widehat{\beta}_{1} = r_{X,Y} \cdot \frac{s_Y}{s_X}\]

We can rewrite \(\displaystyle r_{X,Y}\) as \[r_{X,Y} = \widehat{\beta}_{1} \cdot \frac{s_X}{s_Y}\]

Can think of \(\widehat{\beta}_{1}\) weighting the ratio of variance between \(X\) and \(Y\)

Linear regression in R

Gapminder data

Health and income outcomes for 184 countries from 1960 to 2016 from the famous Gapminder project

library(tidyverse)
theme_set(theme_light())
library(dslabs)
clean_gapminder <- gapminder |>
  filter(year == 2011, !is.na(gdp)) |>
  mutate(log_gdp = log(gdp))
glimpse(clean_gapminder)
Rows: 168
Columns: 10
$ country          <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
$ year             <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
$ infant_mortality <dbl> 14.3, 22.8, 106.8, 7.2, 12.7, 15.3, 3.8, 3.4, 32.5, 1…
$ life_expectancy  <dbl> 77.4, 76.1, 58.1, 75.9, 76.0, 73.5, 82.2, 80.7, 70.8,…
$ fertility        <dbl> 1.75, 2.83, 6.10, 2.12, 2.20, 1.50, 1.88, 1.44, 1.96,…
$ population       <dbl> 2886010, 36717132, 21942296, 88152, 41655616, 2967984…
$ gdp              <dbl> 6321690864, 81143448101, 27013935821, 801787943, 4729…
$ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, Oce…
$ region           <fct> Southern Europe, Northern Africa, Middle Africa, Cari…
$ log_gdp          <dbl> 22.56725, 25.11948, 24.01962, 20.50235, 26.88222, 22.…

Modeling life expectancy

Interested in modeling a country’s life expectancy

clean_gapminder |>
  ggplot(aes(x = life_expectancy)) +
  geom_histogram(color = "black", fill = "gray")

Relationship between life expectancy and log GDP

gdp_plot <- clean_gapminder |>
  ggplot(aes(x = log_gdp, y = life_expectancy)) +
  geom_point(size = 3, alpha = 0.5)
gdp_plot

We fit linear regression models using lm(), formula is input as: response ~ predictor

simple_lm <- lm(life_expectancy ~ log_gdp, 
                data = clean_gapminder) 

Summarize the model using summary()

summary(simple_lm)

Call:
lm(formula = life_expectancy ~ log_gdp, data = clean_gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.901  -4.781   1.879   5.335  13.962 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   24.174      5.758   4.198 4.38e-05 ***
log_gdp        1.975      0.242   8.161 7.87e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.216 on 166 degrees of freedom
Multiple R-squared:  0.2864,    Adjusted R-squared:  0.2821 
F-statistic: 66.61 on 1 and 166 DF,  p-value: 7.865e-14

Summarize the model using the broom package

The 3 broom functions (Note: the output is always a tibble)

  • tidy(): coefficients table in a tidy format

  • glance(): produces summary metrics of a model fit

  • augment(): adds/“augments” columns to the original data (e.g., adding predictions)

Note: the output of tidy(), augment() and glance() are always a tibble.

library(broom)
tidy(simple_lm)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    24.2      5.76       4.20 4.38e- 5
2 log_gdp         1.97     0.242      8.16 7.87e-14
glance(simple_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.286         0.282  7.22      66.6 7.87e-14     1  -569. 1145. 1154.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Inference with OLS

Intercept and coefficient estimates \[\quad \hat{\beta}_0 = 24.174 \quad \text{and} \quad \hat{\beta}_1 = 1.975\]

Estimates of uncertainty for \(\beta\)’s via standard errors \[\quad \widehat{SE}(\hat{\beta}_0) = 5.758 \quad \text{and} \quad \widehat{SE}(\hat{\beta}_1) = 0.242\]

\(t\)-statistics: Estimates / Std. Error, i.e., number of standard deviations from 0

  • \(p\)-values (i.e., Pr(>|t|)): estimated probability observing value as extreme as |t value| given the null hypothesis \(\beta_1 = 0\)

  • \(p\)-value \(< \alpha = 0.05\) (conventional threshold): sufficient evidence to reject the null hypothesis that the coefficient is zero

  • i.e., there is a significant association between life_expectancy and log_gdp

Coefficient of determination

  • In linear regression, the square of the correlation coefficient happens to be exactly the coefficient of determination
cor(clean_gapminder$log_gdp, clean_gapminder$life_expectancy)
[1] 0.5351189
cor(clean_gapminder$log_gdp, clean_gapminder$life_expectancy) ^ 2
[1] 0.2863522
  • \(R^2\) estimates the proportion of the variance of \(Y\) explained by \(X\)

  • More generally: variance of model predictions / variance of \(Y\)

var(predict(simple_lm)) / var(clean_gapminder$life_expectancy) 
[1] 0.2863522

Generating predictions

We can use the predict() or fitted() function to either get the fitted values of the regression:

train_preds <- predict(simple_lm)
head(train_preds)
       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876 
head(fitted(simple_lm))
       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876 
head(simple_lm$fitted.values)
       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876 
simple_lm |> 
  pluck("fitted.values") |> 
  head()
       1        2        3        4        5        6 
68.74401 73.78465 71.61243 64.66585 77.26605 67.97876 

Predictions for new data

Or we can provide it newdata which must contain the explanatory variables:

us_data <- clean_gapminder |> 
  filter(country == "United States")

new_us_data <- us_data |>
  select(country, gdp) |>
  slice(rep(1, 3)) |> 
  mutate(adj_factor = c(0.25, 0.5, 0.75),
         log_gdp = log(gdp * adj_factor))
new_us_data$pred_life_exp <- 
  predict(simple_lm, newdata = new_us_data) 
gdp_plot +
  geom_point(data = new_us_data,
             aes(x = log_gdp, y = pred_life_exp),
             color = "darkred", size = 3)

Plot observed values against predictions

Useful diagnostic (for any type of model, not just linear regression!)

clean_gapminder |>
  mutate(pred_vals = predict(simple_lm)) |> 
  ggplot(aes(x = pred_vals, y = life_expectancy)) +
  geom_point(alpha = 0.5, size = 3) +
  geom_abline(slope = 1, intercept = 0, 
              linetype = "dashed",
              color = "red",
              linewidth = 2)
  • “Perfect” model will follow diagonal

Plot observed values against predictions

Augment the data with model output using the broom package

clean_gapminder <- simple_lm |> 
  augment(clean_gapminder) 
clean_gapminder |>
  ggplot(aes(x = .fitted, y = life_expectancy)) + 
  geom_point(alpha = 0.5, size = 3) +
  geom_abline(slope = 1, intercept = 0, color = "red", 
              linetype = "dashed", linewidth = 2)
  • Adds various columns from model fit we can use in plotting for model diagnostics

Assessing assumptions of linear regression

Simple linear regression assumes \(Y_i \overset{iid}{\sim} N(\beta_0 + \beta_1 X_i, \sigma^2)\)

  • If this is true, then \(Y_i - \hat{Y}_i \overset{iid}{\sim} N(0, \sigma^2)\)

Plot residuals against \(\hat{Y}_i\), residuals vs fit plot

  • Used to assess linearity, any divergence from mean 0

  • Used to assess equal variance, i.e., if \(\sigma^2\) is homogenous across predictions/fits \(\hat{Y}_i\)

More difficult to assess the independence and fixed \(X\) assumptions

  • Make these assumptions based on subject-matter knowledge

Plot residuals against predicted values

  • Residuals = observed - predicted

  • Interpretation of residuals in context?

  • Conditional on the predicted values, the residuals should have a mean of zero

clean_gapminder |>
  ggplot(aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.5, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", 
             color = "red", linewidth = 2) +
  # plot the residual mean
  geom_smooth(se = FALSE)
  • Residuals should NOT display any pattern

  • Two things to look for:

    • Any trend around horizontal reference line?

    • Equal vertical spread?

Multiple linear regression

We can include as many variables as we want (assuming \(n > p\))

\[Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon\]

OLS estimates in matrix notation ( \(\boldsymbol{X}\) is a \(n \times p\) matrix):

\[\hat{\boldsymbol{\beta}} = (\boldsymbol{X} ^T \boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y}\]

Can just add more variables to the formula in R

multiple_lm <- lm(life_expectancy ~ log_gdp + fertility, data = clean_gapminder)
  • Use adjusted \(R^2\) when including multiple variables \(\displaystyle 1 - \frac{(1 - R^2)(n - 1)}{(n - p - 1)}\)

    • Adjusts for the number of variables in the model \(p\)

    • Adding more variables will always increase the model’s (unadjusted) \(R^2\)

READ THIS: \(R^2\) myths

Get your \(R^2\) out of here

What about the Normal distribution assumption?

  • Model: \(Y=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}+\epsilon\)

  • \(\epsilon_i\) is the random noise: assume independent and identically distributed (iid) from a Normal distribution \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\) with constant variance \(\sigma^2\)

  • OLS doesn’t care about this assumption, it’s just estimating coefficients!
  • In order to perform inference, we need to impose additional assumptions

  • By assuming \(\epsilon_i \overset{iid}{\sim}N(0, \sigma^2)\), what we really mean is \(Y \overset{iid}{\sim}N(\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p}, \sigma^2)\)

  • So we’re estimating the mean \(\mu\) of this conditional distribution, but what about \(\sigma^2\)?
  • Unbiased estimate \(\displaystyle \hat{\sigma}^2 = \frac{\text{RSS}}{n - (p + 1)}\)

  • Degrees of freedom: \(n - (p + 1)\), data supplies us with \(n\) “degrees of freedom” and we used up \(p + 1\)

Check the assumptions about normality with ggfortify

library(ggfortify)
autoplot(multiple_lm, ncol = 4)
  • Standardized residuals = residuals / sd(residuals) (see also .std.resid from augment)

Regression confidence intervals vs prediction intervals

Do we want to predict the mean response for a particular value \(x^*\) of the explanatory variable or do we want to predict the response for an individual case?

Example:

  • if we would like to know the average price for all Tesla with 50 thousand miles, then we would use an interval for the mean response

  • if we want an interval to contain the price of a particular Tesla with 50 thousand miles, then we need the interval for a single prediction

Regression confidence intervals

  • geom_smooth() displays confidence intervals for the regression line
# predict(simple_lm, interval = "confidence")
lm_plot <- clean_gapminder |>
  ggplot(aes(x = log_gdp,
             y = life_expectancy)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  theme_bw() +
  labs(x = "log(GDP)",
       y = "Life expectancy")
lm_plot

Regression confidence intervals

  • Interval estimate for the MEAN response at a given observation (i.e. for the predicted AVERAGE \(\hat{\beta}_0 + \hat{\beta}_1 x^*\))

  • Based on standard errors for the estimated regression line at \(x^*\) \[SE_{\text{line}}\left(x^{*}\right)=\hat{\sigma} \cdot \sqrt{\frac{1}{n}+\frac{\left(x^{*}-\bar{x}\right)^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}}\]

  • Variation only comes from uncertainty in the parameter

Regression prediction intervals

  • interval estimate for an INDIVIDUAL response at a given (new) observation

  • add the variance \(\sigma^2\) of a single predicted value \[SE_{\text{pred}}\left(x^{*}\right)=\hat{\sigma} \cdot \sqrt{1 + \frac{1}{n}+\frac{\left(x^{*}-\bar{x}\right)^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}}\]

  • variation comes from uncertainty in parameter estimates and error variance

Confidence intervals versus prediction intervals

Generate 95% intervals

# predict(simple_lm, interval = "prediction")
lm_plot +
  geom_ribbon(
    data = augment(simple_lm, interval = "prediction"),
    aes(ymin = .lower, ymax = .upper),
    color = "red", fill = NA
  )
  • The standard error for predicting an individual response will always be larger than for predicting a mean response

  • Prediction intervals will always be wider than confidence intervals

Appendix: Interpret a linear regression model with log transformations

  • Log-transformed predictor: \(Y = \beta_0 + \beta_1 \log(X)\)

    • A 1% increase in \(X\) is associated with an average change of \(\beta_1/100\) units in \(Y\)
  • Log-transformed outcome: \(\log(Y) = \beta_0 + \beta_1 X\)

    • A 1 unit increase in \(X\) is associated with an average change of \((100 \times \beta) \%\) in Y
  • Log-transformed predictor and outcome: \(\log(Y) = \beta_0 + \beta_1 \log(X)\)

    • A 1% increase in \(X\) is associated with an average change of \(\beta_1 \%\) in Y

Useful reading