Supervised learning: regularization

SURE 2025

Department of Statistics & Data Science
Carnegie Mellon University

Background

Shortcomings of linear regression

Predictability
- Recall that we can decompose estimation (prediction) error into squared bias and variance
- Linear regression: high bias, low variance

Variable selection
- With a large number of predictors, it can be helpful to identify a subset of relevant features and eliminate uninformative ones
- Linear regression “freely” assigns a coefficient to each predictor variable
- Can we have a model fitting procedure that make only a subset of the coefficients large, and others small (or even better, zero)?

The big picture

Regularized/shrinkage/penalized regression

Involves fitting a model containing all \(p\) predictors

However, the estimated coefficients are shrunken towards zero relative to the least squares estimates of vanilla linear regression

This “shrinkage” (also known as regularization) has the effect of reducing variance

Depending on what type of regularization is performed, some coefficient estimates can be exactly zero

Thus, regularization methods (e.g. LASSO) can perform variable selection

Recap: linear regression

Linear regression (least squares) estimates coefficients by minimizing

\[\text{RSS} = \sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2\]

Ridge regression

Linear regression vs ridge regression

Linear regression (least squares) estimates coefficients by minimizing

\[\text{RSS} = \sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2\]

Ridge regression introduces a shrinkage penalty term and minimizes

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p \beta_j^2 = \text{RSS} + \lambda \sum_{j=1}^p \beta_j^2\]

Ridge regression uses a (squared) \(\ell_2\) penalty (hence ridge is also known as \(\ell_2\) regularization)
- The \(\ell_2\) norm of a coefficient vector \(\beta\) is given by \(\displaystyle \Vert \beta \Vert_2 = \sqrt{\sum_{j=1}^p \vert \beta_j \vert ^2}\)

Ridge regression: more details

Recall that ridge regression minimizes

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p \beta_j^2 = \text{RSS} + \lambda \sum_{j=1}^p \beta_j^2\]

Here, \(\lambda \ge 0\) is a tuning parameter, which controls the strength of the penalty term

Use cross-validation to tune \(\lambda\) (for a fixed value of \(\lambda\), ridge regression fits only a single model)

As \(\lambda\) increases, flexibility of models decreases (bias increases, variance decreases)
- When \(\lambda = 0\), it’s just linear regression
- When \(\lambda \rightarrow \infty\), the coefficient estimates approach 0
- For \(\lambda\) in between, we are balancing two ideas: fitting a linear model and shrinking the coefficients

Ridge regression: textbook example

Notice how the magnitude of the coefficients for the 4 highlighted variables trend as \(\lambda \rightarrow \infty\)
The coefficients shrink towards zero, but never actually reach it

Lasso & sparsity

Moving from ridge to lasso

Ridge never sets coefficients to exactly zero (i.e., always keeps all variables)

Therefore ridge regression cannot perform variable selection

But we may believe there is a sparse solution (i.e., a model involving only a subset of the variables)

Entering the lasso (least absolute shrinkage and selection operator)

Compared to ridge, one big advantage of lasso/sparsity: it performs variable selection in the linear model, so we can understand which relevant features the model relies on for prediction

Linear regression vs ridge vs lasso

Linear regression (least squares) estimates coefficients by minimizing

\[\text{RSS} = \sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2\]

Ridge regression minimizes

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p \beta_j^2 = \text{RSS} + \lambda \sum_{j=1}^p \beta_j^2\]

Lasso regression enables variable selection by minimizing

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p\vert \beta_j \vert = \text{RSS} + \lambda \sum_{j=1}^p \vert \beta_j \vert\]

The lasso: more details

Recall that the lasso minimizes

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p\vert \beta_j \vert = \text{RSS} + \lambda \sum_{j=1}^p \vert \beta_j \vert\]

Lasso uses an \(\ell_1\) penalty (hence also known as \(\ell_1\) regularization)
- The \(\ell_1\) norm of a coefficient vector \(\beta\) is given by \(\displaystyle \Vert \beta \Vert_1 = \sum_{j=1}^p \vert \beta_j \vert\)

The only difference between lasso and ridge is that ridge uses a (squared) \(\ell_2\) penalty \(\Vert \beta \Vert_2^2\), while lasso uses an \(\vert \ell_1 \vert\) penalty

Importantly, although these two problems look relatively similar, their solutions behave very differently

The lasso: more details

Recall that the lasso minimizes

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p\vert \beta_j \vert = \text{RSS} + \lambda \sum_{j=1}^p \vert \beta_j \vert\]

Here, \(\lambda \ge 0\) is a tuning parameter, which controls the strength of the penalty (like ridge)

Use cross-validation to tune \(\lambda\) (for a fixed value of \(\lambda\), lasso regression fits only a single model)

As \(\lambda\) increases, flexibility of models decreases (bias increases, variance decreases)
- When \(\lambda = 0\), it’s just linear regression
- When \(\lambda \rightarrow \infty\), the coefficient estimates approach 0
- For \(\lambda\) in between, we are balancing two ideas: fitting a linear model and shrinking the coefficients
- But the nature of the \(\ell_1\) penalty allows some coefficients to be shrunken to zero exactly

Lasso can handle the \(p > n\) case, i.e. more variables than observations

Shrinkage methods: lasso regression

Lasso regression performs variable selection yielding sparse models (i.e. models that involve only a subset of the variables)
We get different models with different number of predictors at different values of \(\lambda\)
The coefficients shrink towards and eventually equal zero

Lasso and ridge as optimization problems

Why does the lasso, unlike ridge, result in coefficient estimates that are exactly 0?

Lasso

\[ \underset{\beta}{\text{minimize}} \sum_{i=1}^n \left(Y_i - \beta_0 - \sum_{j=1}^p\beta_j X_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p|\beta_j| \le s \]

Ridge

\[ \underset{\beta}{\text{minimize}} \sum_{i=1}^n \left(Y_i - \beta_0 - \sum_{j=1}^p\beta_j X_{ij} \right)^2 \quad \text{subject to} \quad \sum_{j=1}^p \beta_j ^2 \le s \]

Lasso and ridge in pictures

Lasso and ridge in pictures, explained

Consider the case of \(p=2\) (2 predictors, hence 2 coefficients \(\beta_1\) and \(\beta_2\) to be estimated)
The residual sum of squares has elliptical contours, centered at the unregularized least squares estimate \(\hat \beta\)
The constraint region for lasso (left) is the diamond \(\vert \beta_1 \vert + \vert \beta_2 \vert \le s\)
The constraint region for ridge (right) is the square \(\beta_1 ^ 2 + \beta_2 ^ 2 \le s\)
Both methods aim to find the first point where the elliptical contours hit the constraint region
Unlike the disk, the diamond has corners; if the solution occurs at a corner, then it has one \(\beta_j\) equal to zero
When \(p>2\), the diamond becomes a rhomboid (with many corners, flat edges, faces, etc.), allowing more opportunities for the estimated coefficients to be zero

Elastic net

Best of both worlds: elastic net

Motivation: when developing models, there are often strong correlations among the features

The lasso penalty is somewhat indifferent to the choice among a set of strong but correlated features

The ridge penalty tends to shrink the coefficients of correlated features toward each other

The elastic net penalty is a compromise, aiming to
- encourages highly correlated features to be average
- encourages a sparse solution in the coefficients of these averaged features

From the original paper’s abtract:

… the elastic net encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together.

Ridge vs lasso vs elastic net

Ridge minimizes

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p \beta_j^2 = \text{RSS} + \lambda \sum_{j=1}^p \beta_j^2\]

Lasso minimizes

\[\sum_{i=1}^n \big(Y_i - \beta_0 - \sum_{j=1}^p \beta_j X_{ij}\big)^2 + \lambda \sum_{j=1}^p\vert \beta_j \vert = \text{RSS} + \lambda \sum_{j=1}^p \vert \beta_j \vert\]

Elastic net compromises between ridge and lasso

\[\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} X_{i j}\right)^{2}+\lambda\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1} \right)\]

Elastic net: more details

Elastic net minimizes

\[\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\sum_{j=1}^{p} \beta_{j} X_{i j}\right)^{2}+\lambda\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1} \right)\]

Penalty terms: \(\lambda \cdot (1 - \alpha) / 2\) (ridge) and \(\lambda \cdot \alpha\) (lasso)
- \(\alpha\) controls the mixing between the two types, ranges from 0 to 1
- \(\alpha = 1\) returns lasso, \(\alpha = 0\) return ridge

\(\lambda\) and \(\alpha\) are tuning parameters; use cross-validation to tune

Scaling of predictors

In linear regression, the (least squares) coefficient estimates are scale equivariant
- Multiplying \(X_j\) by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of \(1/c\)
- i.e., regardless of how the \(j\)th predictor is scaled, \(X_j \hat \beta_j\) will remain the same

For ridge, lasso, and elastic net: standardize the predictors
- The coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty
- For each predictor column, subtract off the mean, and divide by the standard deviation \[\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{s_{x,j}}\]

Examples

Data: prostate cancer

Examine the level of a prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy

library(tidyverse)
theme_set(theme_light())
# more info: https://rdrr.io/cran/ElemStatLearn/man/prostate.html
prostate <- read_tsv("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data")
glimpse(prostate)

Rows: 97
Columns: 11
$ ...1    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ lcavol  <dbl> -0.5798185, -0.9942523, -0.5108256, -1.2039728, 0.7514161, -1.…
$ lweight <dbl> 2.769459, 3.319626, 2.691243, 3.282789, 3.432373, 3.228826, 3.…
$ age     <dbl> 50, 58, 74, 58, 62, 50, 64, 58, 47, 63, 65, 63, 63, 67, 57, 66…
$ lbph    <dbl> -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1…
$ svi     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ lcp     <dbl> -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1.3862944, -1…
$ gleason <dbl> 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6,…
$ pgg45   <dbl> 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 5, 5, 0, 30, 0, 0, 0,…
$ lpsa    <dbl> -0.4307829, -0.1625189, -0.1625189, -0.1625189, 0.3715636, 0.7…
$ train   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,…

Introduction to `glmnet`

Fit ridge, lasso, and elastic net models with glmnet

library(glmnet)
prostate <- prostate |> 
  select(lcavol:lpsa)

Create predictor matrix and response vector

# predictors
# model_x <- model.matrix(lpsa ~ ., prostate)
model_x <- prostate |> 
  select(lcavol:pgg45) |> 
  as.matrix()

# response
# model_y <- prostate$lpsa
model_y <- prostate |> 
  pull(lpsa)

Vanilla linear regression model

What do the initial regression coefficients look like?
Use broom to tidy model output for plotting

prostate_lm <- lm(lpsa ~ ., data = prostate)
library(broom)
prostate_lm |> 
  tidy() |> 
  mutate(term = fct_reorder(term, estimate)) |> 
  ggplot(aes(x = estimate, y = term, 
             fill = estimate > 0)) +
  geom_col(color = "white", show.legend = FALSE) +
  scale_fill_manual(values = c("darkred", "darkblue"))

Coefficient plot (w/ uncertainty)

prostate_lm |> 
  tidy(conf.int = TRUE) |> 
  filter(term != "(Intercept)") |> 
  ggplot(aes(x = estimate, y = term))  +
  geom_point(size = 4) +
  geom_errorbar(aes(xmin = conf.low, 
                    xmax = conf.high, 
                    width = 0.2)) +
  geom_vline(xintercept = 0, 
             linetype = "dashed", 
             color = "red")

Ridge regression

Perform ridge regression using glmnet with alpha = 0

By default, predictors are standardized and models are fitted across a range of \(\lambda\) values

prostate_ridge <- glmnet(model_x, model_y, alpha = 0)
plot(prostate_ridge, xvar = "lambda")

Ridge regression

Use cross-validation to select \(\lambda\) with cv.glmnet() which uses 10 folds by default

Specify ridge regression with alpha = 0

prostate_ridge_cv <- cv.glmnet(model_x, model_y, alpha = 0)
plot(prostate_ridge_cv)

Tidy ridge regression

# str(prostate_ridge_cv)
tidy_ridge_coef <- tidy(prostate_ridge_cv$glmnet.fit)
tidy_ridge_coef |> 
  ggplot(aes(x = lambda, y = estimate, group = term)) +
  scale_x_log10() +
  geom_line(alpha = 0.75) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.min) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.1se, 
             linetype = "dashed", color = "red")

Could easily add color with legend for variables…

Tidy ridge regression

tidy_ridge_cv <- tidy(prostate_ridge_cv)
tidy_ridge_cv |> 
  ggplot(aes(x = lambda, y = estimate)) +
  geom_line() + 
  scale_x_log10() +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), 
              alpha = 0.2) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.min) +
  geom_vline(xintercept = prostate_ridge_cv$lambda.1se,
             linetype = "dashed", color = "red")

Ridge regression coefficients

Coefficients using the one-standard-error rule

(select the model with estimated test error within 1 SE of the minimum test error)

# ridge_final <- glmnet(
#   model_x, model_y, alpha = 0,
#   lambda = prostate_ridge_cv$lambda.1se,
# )
# library(vip)
# ridge_final |> 
#   vip()

tidy_ridge_coef |>
  filter(lambda == prostate_ridge_cv$lambda.1se) |>
  mutate(term = fct_reorder(term, estimate)) |>
  ggplot(aes(x = estimate, y = term, 
             fill = estimate > 0)) +
  geom_col(color = "white", show.legend = FALSE) +
  scale_fill_manual(values = c("darkred", "darkblue"))

Lasso regression example

Similar syntax to ridge but specify alpha = 1:

prostate_lasso_cv <- cv.glmnet(model_x, model_y, 
                               alpha = 1)
tidy_lasso_coef <- tidy(prostate_lasso_cv$glmnet.fit)
tidy_lasso_coef |> 
  ggplot(aes(x = lambda, y = estimate, group = term)) +
  scale_x_log10() +
  geom_line(alpha = 0.75) +
  geom_vline(xintercept = prostate_lasso_cv$lambda.min) +
  geom_vline(xintercept = prostate_lasso_cv$lambda.1se, 
             linetype = "dashed", color = "red")

Lasso regression example

Number of non-zero predictors by \(\lambda\)

tidy_lasso_cv <- tidy(prostate_lasso_cv)
tidy_lasso_cv |>
  ggplot(aes(x = lambda, y = nzero)) +
  geom_line() +
  geom_vline(xintercept = prostate_lasso_cv$lambda.min) +
  geom_vline(xintercept = prostate_lasso_cv$lambda.1se, 
             linetype = "dashed", color = "red") +
  scale_x_log10()

Reduction in variables using one-standard-error rule

Lasso regression example

Coefficients using the one-standard-error rule

# this will only print out non-zero coefficient estimates
# tidy_lasso_coef |>
#   filter(lambda == prostate_lasso_cv$lambda.1se)

lasso_final <- glmnet(
  model_x, model_y, 
  alpha = 1,
  lambda = prostate_lasso_cv$lambda.1se,
)
library(vip)
lasso_final |> 
  vi() |> 
  mutate(Variable = fct_reorder(Variable, Importance)) |>
  ggplot(aes(x = Importance, y = Variable, 
             fill = Importance > 0)) +
  geom_col(color = "white", show.legend = FALSE) +
  scale_fill_manual(values = c("darkred", "darkblue")) +
  labs(x = "estimate", y = NULL)

Elastic net example

Need to tune both \(\lambda\) and \(\alpha\) - can do so manually with our own folds

set.seed(100)
fold_id <- sample(rep(1:10, length.out = nrow(model_x)))

Then use cross-validation with these folds for different candidate alpha values:

cv_enet_25 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0.25)
cv_enet_50 <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0.5)
cv_ridge <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 0)
cv_lasso <- cv.glmnet(model_x, model_y, foldid = fold_id, alpha = 1)

Can see which one had the lowest CV error among its candidate \(\lambda\) values:

which.min(c(min(cv_enet_25$cvm), min(cv_enet_50$cvm), min(cv_ridge$cvm), min(cv_lasso$cvm)))

[1] 3

Elastic net example

Can view same type of summary

cv_enet_50 |> 
  tidy() |> 
  ggplot(aes(x = lambda, y = nzero)) +
  geom_line() +
  geom_vline(xintercept = cv_enet_50$lambda.min) +
  geom_vline(xintercept = cv_enet_50$lambda.1se, 
             linetype = "dashed", 
             color = "red") +
  scale_x_log10()

More relaxed than lasso for variable entry

Comparison of models based on holdout performance

set.seed(101)
N_FOLDS <- 5
prostate <- prostate |>
  mutate(test_fold = sample(rep(1:N_FOLDS, length.out = n())))

get_test_pred <- function(x) {
  test_data <- prostate |> filter(test_fold == x)                     # get test and training data
  train_data <- prostate |> filter(test_fold != x)
  test_x <- as.matrix(select(test_data, -lpsa))                       # get test and training matrices
  train_x <- as.matrix(select(train_data, -lpsa))
  
  lm_fit <- lm(lpsa ~ ., data = train_data)                           # fit models to training data
  ridge_fit <- cv.glmnet(train_x, train_data$lpsa, alpha = 0)
  lasso_fit <- cv.glmnet(train_x, train_data$lpsa, alpha = 1)
  enet_fit <- cv.glmnet(train_x, train_data$lpsa, alpha = 0.5)
  
  tibble(lm_pred = predict(lm_fit, newdata = test_data),              # return test results
         ridge_pred = as.numeric(predict(ridge_fit, newx = test_x)),
         lasso_pred = as.numeric(predict(lasso_fit, newx = test_x)),
         enet_pred = as.numeric(predict(enet_fit, newx = test_x)),
         test_actual = test_data$lpsa,
         test_fold = x)
}

test_pred_all <- map(1:N_FOLDS, get_test_pred) |> 
  bind_rows()

Comparison of models based on holdout performance

Compute RMSE across folds with standard error intervals

test_pred_all |>
  pivot_longer(lm_pred:enet_pred, 
               names_to = "type", 
               values_to = "test_pred") |>
  group_by(type, test_fold) |>
  summarize(
    rmse = sqrt(mean((test_actual - test_pred)^2))
  ) |> 
  ggplot(aes(x = type, y = rmse)) + 
  geom_point(size = 4) +
  stat_summary(fun = mean, geom = "point", 
               color = "red", size = 4) + 
  stat_summary(fun.data = mean_se, geom = "errorbar", 
               color = "red", width = 0.2)

Linear regression actually slightly outperforms regularization here, but within intervals

Supervised learning: regularization

Background

Shortcomings of linear regression

The big picture

Recap: linear regression

Ridge regression

Linear regression vs ridge regression

Ridge regression: more details

Ridge regression: textbook example

Lasso & sparsity

Moving from ridge to lasso

Linear regression vs ridge vs lasso

The lasso: more details

The lasso: more details

Shrinkage methods: lasso regression

Lasso and ridge as optimization problems

Lasso and ridge in pictures

Lasso and ridge in pictures, explained

Elastic net

Best of both worlds: elastic net

Ridge vs lasso vs elastic net

Elastic net: more details

Scaling of predictors

Examples

Data: prostate cancer

Introduction to glmnet

Vanilla linear regression model

Coefficient plot (w/ uncertainty)

Ridge regression

Ridge regression

Tidy ridge regression

Tidy ridge regression

Ridge regression coefficients

Lasso regression example

Lasso regression example

Lasso regression example

Elastic net example

Elastic net example

Comparison of models based on holdout performance

Comparison of models based on holdout performance

Introduction to `glmnet`