Supervised learning: regularization

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

The big picture

Regularized/shrinkage/penalized regression

Involves fitting a model containing all \(p\) predictors

However, the estimated coefficients are shrunken towards zero relative to the least squares estimates of vanilla linear regression

This “shrinkage” (also known as regularization) has the effect of reducing variance

Depending on what type of regularization is performed, some coefficient estimates can be exactly zero

Thus, regularization methods can perform variable selection

Recap: linear regression

Linear regression estimates coefficients by minimizing

\[\text{RSS} = \sum_i^n \big(Y_i - \beta_0 - \sum_j^p \hat{\beta}_j X_{ij}\big)^2\]

Shrinkage methods: ridge regression

Linear regression estimates coefficients by minimizing

\[\text{RSS} = \sum_i^n \big(Y_i - \beta_0 - \sum_j^p \hat{\beta}_j X_{ij}\big)^2\]

Ridge regression introduces a shrinkage penalty \(\lambda \geq 0\) by minimizing

\[\sum_i^n \big(Y_i - \beta_0 - \sum_j^p \beta_j X_{ij}\big)^2 + \lambda \sum_j^p \beta_j^2 = \text{RSS} + \lambda \sum_j^p \beta_j^2\]

as \(\lambda\) increases, flexibility of models decreases
- increases bias, but decreases variance

for fixed value of \(\lambda\), ridge regression fits only a single model
- need to use cross-validation to tune \(\lambda\)

Shrinkage methods: ridge regression

For example: note how the magnitude of the coefficient for Income trends as \(\lambda \rightarrow \infty\)

The coefficient shrinks towards zero, but never actually reaches it

Income is always a variable in the learned model, regardless of the value of \(\lambda\)

Shrinkage methods: lasso regression

Ridge regression keeps all variables, but we may believe there is a sparse solution

Lasso (“Least Absolute Shrinkage and Selection Operator”) enables variable selection with \(\lambda\) by minimizing:

\[\sum_i^n \big(Y_i - \beta_0 - \sum_j^p \beta_j X_{ij}\big)^2 + \lambda \sum_j^p\vert \beta_j \vert = \text{RSS} + \lambda \sum_j^p \vert \beta_j \vert\]

Lasso uses an \(\ell_1\) (“ell 1”) penalty

as \(\lambda\) increases, flexibility of models decreases
- increases bias, but decreases variance

Can handle the \(p > n\) case, i.e. more variables than observations!

Shrinkage methods: lasso regression

Lasso regression performs variable selection yielding sparse models (i.e. models that involve only a subset of the variables)

The coefficient shrinks towards and eventually equals zero

Be lasso

Lasso and ridge as optimization problems

Why does the lasso, unlike ridge, result in coefficient estimates that are exactly 0?

Lasso

\[ \underset{\beta}{\text{minimize}} \sum_{i=1}^n \left(Y_i - \beta_0 - \sum_{j=1}^p\beta_j X_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p|\beta_j| \le s \]

Ridge

\[ \underset{\beta}{\text{minimize}} \sum_{i=1}^n \left(Y_i - \beta_0 - \sum_{j=1}^p\beta_j X_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p \beta_j ^2 \le s \]

Lasso and ridge in picture

Best of both worlds: elastic net

\[\sum_{i}^{n}\left(Y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} X_{i j}\right)^{2}+\lambda\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1} \right)\]

\(\Vert \beta \Vert_1\): \(\ell_1\) norm: \(\Vert \beta \Vert_1 = \sum_{j=1}^p \vert \beta_j \vert\)
\(\Vert \beta \Vert_2\): \(\ell_2\) (Euclidean) norm: \(\Vert \beta \Vert_2 = \sqrt{\sum_{j=1}^p \beta_j^2}\)

Ridge penalty: \(\lambda \cdot (1 - \alpha) / 2\)
Lasso penalty: \(\lambda \cdot \alpha\)
\(\alpha\) controls the mixing between the two types, ranges from 0 to 1
- \(\alpha = 1\) returns lasso, \(\alpha = 0\) return ridge
\(\lambda\) and \(\alpha\) are tuning parameters
Choose appropriate values based on out-of-sample performance/cross-validation
Use cv.glmnet() function in glmnet to perform cross-validation

Scaling of predictors

In linear regression, the (least squares) coefficient estimates are scale equivariant
- multiplying \(X_j\) by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of \(1/c\)
- i.e., regardless of how the \(j\)th predictor is scaled, \(X_j \hat \beta_j\) will remain the same
For either ridge, lasso, or elastic net: you should standardize your data
- The coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty
Common convention: for each predictor column, subtract off the mean, and divide by the standard deviation \(\qquad \displaystyle \tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{s_{x,j}}\)
glmnet package does this by default and reports coefficients on the original scale

Examples

Data: prostate cancer

Examine the level of a prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy

library(tidyverse)
theme_set(theme_light())
# more info: https://rdrr.io/cran/ElemStatLearn/man/prostate.html
prostate <- read_tsv("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data")
glimpse(prostate)

Rows: 97
Columns: 11
$ ...1    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ lcavol  <dbl> -0.5798185, -0.9942523, -0.5108256, -1.2039728, 0.7514161, -1.…
$ lweight <dbl> 2.769459, 3.319626, 2.691243, 3.282789, 3.432373, 3.228826, 3.…
$ age     <dbl> 50, 58, 74, 58, 62, 50, 64, 58, 47, 63, 65, 63, 63, 67, 57, 66…
$ lbph    <dbl> -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1…
$ svi     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ lcp     <dbl> -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1…
$ gleason <dbl> 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6,…
$ pgg45   <dbl> 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 5, 5, 0, 30, 0, 0, 0,…
$ lpsa    <dbl> -0.4307829, -0.1625189, -0.1625189, -0.1625189, 0.3715636, 0.7…
$ train   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,…

Introduction to `glmnet`

Fit ridge, lasso, and elastic net models with glmnet

library(glmnet)
prostate <- prostate |> 
  select(lcavol:lpsa)

Create predictor matrix and response vector

# predictors
# model_x <- model.matrix(lpsa ~ ., prostate)
model_x <- prostate |> 
  select(lcavol:pgg45) |> 
  as.matrix()

# response
# model_y <- prostate$lpsa
model_y <- prostate |> 
  pull(lpsa)

Vanilla linear regression model

What do the initial regression coefficients look like?
Use broom to tidy model output for plotting

prostate_lm <- lm(lpsa ~ ., data = prostate)
library(broom)
prostate_lm |> 
  tidy() |> 
  mutate(term = fct_reorder(term, estimate)) |> 
  ggplot(aes(x = estimate, y = term, 
             fill = estimate > 0)) +
  geom_col(color = "white", show.legend = FALSE) +
  scale_fill_manual(values = c("darkred", "darkblue"))

Ridge regression

Perform ridge regression using glmnet with alpha = 0

By default, predictors are standardized and models are fitted across a range of \(\lambda\) values

prostate_ridge <- glmnet(model_x, model_y, alpha = 0)
plot(prostate_ridge, xvar = "lambda")

Ridge regression

Use cross-validation to select \(\lambda\) with cv.glmnet() which uses 10-folds by default

Specify ridge regression with alpha = 0

prostate_ridge_cv <- cv.glmnet(model_x, model_y, alpha = 0)
plot(prostate_ridge_cv)

Tidy ridge regression

# str(prostate_ridge_cv)
tidy_ridge_coef <- tidy(prostate_ridge_cv$glmnet.fit)
tidy_ridge_coef |> 
  ggplot(aes(x = lambda, y = estimate, group = term)) +
  scale_x_log10() +
  geom_line(alpha = 0.75) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.min) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.1se, 
             linetype = "dashed", color = "red")

Could easily add color with legend for variables…

Tidy ridge regression

tidy_ridge_cv <- tidy(prostate_ridge_cv)
tidy_ridge_cv |> 
  ggplot(aes(x = lambda, y = estimate)) +
  geom_line() + 
  scale_x_log10() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), 
              alpha = 0.2) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.min) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.1se,
             linetype = "dashed", color = "red")

Ridge regression coefficients

Coefficients using the one-standard-error rule

(select the model with estimated test error within one standard error of the minimum test error)

# ridge_final <- glmnet(
#   model_x, model_y, alpha = 0,
#   lambda = prostate_ridge_cv$lambda.1se,
# )
# library(vip)
# ridge_final |> 
#   vip()

tidy_ridge_coef |>
  filter(lambda == prostate_ridge_cv$lambda.1se) |>
  mutate(term = fct_reorder(term, estimate)) |>
  ggplot(aes(x = estimate, y = term, 
             fill = estimate > 0)) +
  geom_col(color = "white", show.legend = FALSE) +
  scale_fill_manual(values = c("darkred", "darkblue"))

Lasso regression example

Similar syntax to ridge but specify alpha = 1:

prostate_lasso_cv <- cv.glmnet(model_x, model_y, 
                               alpha = 1)
tidy_lasso_coef <- tidy(prostate_lasso_cv$glmnet.fit)
tidy_lasso_coef |> 
  ggplot(aes(x = lambda, y = estimate, group = term)) +
  scale_x_log10() +
  geom_line(alpha = 0.75) +
  geom_vline(xintercept = prostate_lasso_cv$lambda.min) +
  geom_vline(xintercept = prostate_lasso_cv$lambda.1se, 
             linetype = "dashed", color = "red")

Lasso regression example

Number of non-zero predictors by \(\lambda\)

tidy_lasso_cv <- tidy(prostate_lasso_cv)
tidy_lasso_cv |>
  ggplot(aes(x = lambda, y = nzero)) +
  geom_line() +
  geom_vline(xintercept = prostate_lasso_cv$lambda.min) +
  geom_vline(xintercept = prostate_lasso_cv$lambda.1se, 
             linetype = "dashed", color = "red") +
  scale_x_log10()

Reduction in variables using one-standard-error rule

Lasso regression example

Coefficients using the one-standard-error rule

# this will only print out non-zero coefficient estimates
# tidy_lasso_coef |>
#   filter(lambda == prostate_lasso_cv$lambda.1se)

lasso_final <- glmnet(
  model_x, model_y, 
  alpha = 1,
  lambda = prostate_lasso_cv$lambda.1se,
)
library(vip)
lasso_final |> 
  vi() |> 
  mutate(Variable = fct_reorder(Variable, Importance)) |>
  ggplot(aes(x = Importance, y = Variable, 
             fill = Importance > 0)) +
  geom_col(color = "white", show.legend = FALSE) +
  scale_fill_manual(values = c("darkred", "darkblue")) +
  labs(x = "estimate", y = NULL)

Elastic net example

Need to tune both \(\lambda\) and \(\alpha\) - can do so manually with our own folds

set.seed(100)
fold_id <- sample(rep(1:10, length.out = nrow(model_x)))

Then use cross-validation with these folds for different candidate alpha values:

cv_enet_25 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0.25)
cv_enet_50 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0.5)
cv_ridge <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0)
cv_lasso <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 1)

Can see which one had the lowest CV error among its candidate \(\lambda\) values:

which.min(c(min(cv_enet_25$cvm), min(cv_enet_50$cvm), min(cv_ridge$cvm), min(cv_lasso$cvm)))

[1] 3

Elastic net example

Can view same type of summary

cv_enet_50 |> 
  tidy() |> 
  ggplot(aes(x = lambda, y = nzero)) +
  geom_line() +
  geom_vline(xintercept = cv_enet_50$lambda.min) +
  geom_vline(xintercept = cv_enet_50$lambda.1se, 
             linetype = "dashed", 
             color = "red") +
  scale_x_log10()

More relaxed than lasso for variable entry

Comparison of models based on holdout performance

set.seed(101)
k <- 5
prostate <- prostate |>
  mutate(test_fold = sample(rep(1:k, length.out = n())))

get_test_pred <- function(k) {
  test_data <- prostate |> filter(test_fold == k)                     # get test and training data
  train_data <- prostate |> filter(test_fold != k)
  test_x <- as.matrix(select(test_data, -lpsa))                       # get test and training matrices
  train_x <- as.matrix(select(train_data, -lpsa))
  
  lm_fit <- lm(lpsa ~ ., data = train_data)                           # fit models to training data
  ridge_fit <- cv.glmnet(train_x, train_data$lpsa, alpha = 0)
  lasso_fit <- cv.glmnet(train_x, train_data$lpsa, alpha = 1)
  enet_fit <- cv.glmnet(train_x, train_data$lpsa, alpha = 0.5)
  
  tibble(lm_pred = predict(lm_fit, newdata = test_data),              # return test results
         ridge_pred = as.numeric(predict(ridge_fit, newx = test_x)),
         lasso_pred = as.numeric(predict(lasso_fit, newx = test_x)),
         enet_pred = as.numeric(predict(enet_fit, newx = test_x)),
         test_actual = test_data$lpsa,
         test_fold = k)
}

test_pred_all <- map(1:k, get_test_pred) |> 
  bind_rows()

Comparison of models based on holdout performance

Compute RMSE across folds with standard error intervals

test_pred_all |>
  pivot_longer(lm_pred:enet_pred, 
               names_to = "type", 
               values_to = "test_pred") |>
  group_by(type, test_fold) |>
  summarize(
    rmse = sqrt(mean((test_actual - test_pred)^2))
  ) |> 
  ggplot(aes(x = type, y = rmse)) + 
  geom_point(size = 4) +
  stat_summary(fun = mean, geom = "point", 
               color = "red", size = 4) + 
  stat_summary(fun.data = mean_se, geom = "errorbar", 
               color = "red", width = 0.2)

Linear regression actually performs “better” than regularization, but within intervals

Supervised learning: regularization

Background

The big picture

Recap: linear regression

Shrinkage methods: ridge regression

Shrinkage methods: ridge regression

Shrinkage methods: lasso regression

Shrinkage methods: lasso regression

Be lasso

Lasso and ridge as optimization problems

Lasso and ridge in picture

Best of both worlds: elastic net

Scaling of predictors

Examples

Data: prostate cancer

Introduction to glmnet

Vanilla linear regression model

Ridge regression

Ridge regression

Tidy ridge regression

Tidy ridge regression

Ridge regression coefficients

Lasso regression example

Lasso regression example

Lasso regression example

Elastic net example

Elastic net example

Comparison of models based on holdout performance

Comparison of models based on holdout performance

Introduction to `glmnet`