library(tidyverse)
theme_set(theme_light())
<- read_csv("https://raw.githubusercontent.com/36-SURE/2025/refs/heads/main/data/bikes.csv") bikes
Best subset selection
Forward/backward/stepwise selection
Limitations
only feasible for a smaller number of variables
computationally infeasible for a large number of predictors
arbitrary way of defining the best model and ignores contextual knowledge about potential predictors
Justify which predictors used in modeling based on:
domain knowledge
extensive EDA
model assessment based on holdout predictions (w/ cross-validation, we’ll discuss this in a bit)
A confounding variable (or lurking variable) is one that is associated with both the response and predictor variables
It can be tempting to immediately jump to conclusions when faced with a strong association between the variables
However, it is important to stop and think about whether there are confounding variables which could be explaining the association instead
In studying the relationship between a predictor \(X\) and a response variable \(Y\),
We should adjust for confounding variables that can influence that relationship because they are associated both with \(X\) and with \(Y\)
If the variable is a potential confounder, including it in the model may help to reduce bias in estimating relevant effects of key explanatory variables
Example: Examine the impact of fertilizer on the yield of an apple orchard
Conditioning is the soul of statistics. — Joe Blitzstein
Setup: Let \(Y\) be the response variable, and \(X = (X_1, X_2, \dots, X_p)\) be the predictors
We want to fit a model to estimate an unknown function \(f\) describing the relationship between \(Y\) and \(X\)
The learned model will take the form \(\hat{Y}=\hat{f}(X)\)
What is the goal of your modeling problem?
If you care more about the details of \(\hat{f}(X)\), then it’s an inference problem
If you care more about predictive power, then it’s a prediction problem
Model flexibility vs interpretability
Good fit vs overfit or underfit
Parsimony vs black-box
Black curve (left): truth
Orange, blue and green curves/squares: fits of different flexibility
Gray curve (right): \(\text{MSE}_{\texttt{train}}\). Red curve (right): \(\text{MSE}_{\texttt{test}}\)
Black curve (left): true model
Data are generated from a smoothly varying non-linear model with random noise added
Orange line (left): linear regression fit: inflexible, fully parametrized, underfitting
Green curve (left): an overly flexible, nonparametric model: overfits (follows noise too closely)
Blue curve (left): aims to strike a balance, not too smooth, not too wiggly.
Gray line (right): training error (MSE) decreases as flexibility increases.
Red line (right): test error (MSE) decreases until a “sweet spot”, then increases due to overfitting
Typical trend
Underfitting means high bias & low variance (fitting a line to the data)
Overfitting means low bias & high variance (drawing a wiggly curve that passes through every single training observation)
Orange curve/squares (linear model): high bias, low variance
Green curve/squares (overly flexible model): low bias, high variance
Blue curve/squares: optimal (“just right”) amount of flexibility is somewhere in the middle
Can we ever predict \(Y\) from \(X\) with zero error?
Generally, no… even the true \(f\) cannot generically do this, and will incur some error due to noise
This is called the irreducible error \(\epsilon\)
What happens if our fitted model \(\hat f\) belongs to a model class that is far from the true \(f\)?
For example, we fit a linear model in a setting where the true relationship is far from linear
As a result, we encounter a source of error: estimation bias
What happens if our fitted model \(\hat f\) is itself quite variable?
In other words, over different training sets, we end up constructing substantially different \(\hat f\)
This is another source of error: estimation variance
Thus, there’s a trade-off between bias and variance
Even if our prediction is unbiased, we can still incur a large error if it is highly variable
Meanhile, even when our prediction is stable and not variable, we can incur a large error if it is badly biased
Remember, when building predictive models, try to find the sweet spot between bias and variance. #babybear
— Dr. G. J. Matthews (@StatsClass) September 6, 2018
“Numquam ponenda est pluralitas sine necessitate” (plurality must never be posited without necessity)
From ISLR:
When faced with new data modeling and prediction problems, it’s tempting to always go for the trendy new methods. Often they give extremely impressive results, especially when the datasets are very large and can support the fitting of high-dimensional nonlinear models. However, if we can produce models with the simpler tools that perform as well, they are likely to be easier to fit and understand, and potentially less fragile than the more complex approaches. Wherever possible, it makes sense to try the simpler models as well, and then make a choice based on the performance/complexity trade-off.
In short: “When faced with several methods that give roughly equivalent performance, pick the simplest.”
Adding additional signal features that are truly associated with the response will improve the fitted model
Adding noise features that are not truly associated with the response will lead to a deterioration in the model
Increase in test set error
Noise features increase the dimensionality of the problem, exacerbating the risk of overfitting
(Noise features may be associated with the response on the training set due to chance)
For each fold \(k=1,2,\dots,K\)
Fit the model on data excluding observations in fold \(k\)
Obtain predictions for fold \(k\)
Compute performance measure (e.g. MSE) for fold \(k\)
Aggregate performance measure across folds
We can justify model choice (e.g. model selection/comparison, variable selection, tuning parameter selection) using any performance measure we want with CV
Why do we typically choose \(K=5\) or \(K=10\)? Think about: What’s the smallest and largest \(K\) could be? And what’s the trade-off?
Compare the following model for predicting number of bikeshare rides
Model 1: includes only 1 predictor–feels like temperature
Model 2: includes 2 predictors–feels like temperature and season
Model 3: same as model 2, but adding the interaction between temperature and season
Perform 5-fold cross-validation to assess model performance
Create a column of test fold assignments to our dataset with the sample()
function
Always remember to set.seed()
prior to performing cross-validation
Function to perform cross-validation
bikes_cv <- function(x) {
# get test and training data
test_data <- bikes |> filter(fold == x)
train_data <- bikes |> filter(fold != x)
# fit models to training data
interaction_fit <- lm(rides ~ temp_feel * season, data = train_data)
temp_season_fit <- lm(rides ~ temp_feel + season, data = train_data)
temp_fit <- lm(rides ~ temp_feel, data = train_data)
# return test results
out <- tibble(
interaction_pred = predict(interaction_fit, newdata = test_data),
temp_season_pred = predict(temp_season_fit, newdata = test_data),
temp_pred = predict(temp_fit, newdata = test_data),
test_actual = test_data$rides,
test_fold = x
)
return(out)
}
Compute RMSE across folds with standard error intervals
bikes_test_summary <- bikes_test_preds |>
pivot_longer(interaction_pred:temp_pred, names_to = "model", values_to = "test_pred") |>
group_by(model, test_fold) |>
summarize(rmse = sqrt(mean((test_actual - test_pred)^2)))
bikes_test_summary |>
group_by(model) |>
summarize(avg_cv_rmse = mean(rmse),
sd_rmse = sd(rmse),
k = n()) |>
mutate(se_rmse = sd_rmse / sqrt(k),
lower_rmse = avg_cv_rmse - se_rmse,
upper_rmse = avg_cv_rmse + se_rmse) |>
select(model, avg_cv_rmse, lower_rmse, upper_rmse)
# A tibble: 3 × 4
model avg_cv_rmse lower_rmse upper_rmse
<chr> <dbl> <dbl> <dbl>
1 interaction_pred 1242. 1216. 1269.
2 temp_pred 1288. 1261. 1316.
3 temp_season_pred 1239. 1211. 1268.
Including season (together with temperature) clearly provides better performance
Including an interaction term between season and temperature is not worthwhile
(in fact, the performance is slightly worse compared to the model without interaction)
Think carefully about what the ideal data would be
Do you have the suitable data?
What is the appropriate data format for the modeling problem?
Write down different aspects you want your model to capture
Distribution of the response?
Different types of feature?
Any dependence structure you want your model to capture?
…
Why is your model appropriate for the problem?
For simplicity and explanability purpose? (Why do you prefer a simpler model over a black box?)
For complexity and flexibility purpose? (Why do you prefer a black box over a simpler model?)
How do you evaluate your model?
Do you compare it with other models? Do you have a baseline model?
Do you compare it with alternative specifications (e.g. different set of predictors)?
How do you choose the predictors?
domain knowledge?
extensive EDA?
model assessment based on holdout predictions?
Don’t just say “we find a statistically significant association between Y and X…”
What do those coefficient estimates tell you? Interpret in context
How large/small is the effect size?
What’s the margin of error?
Did the model also adjust for other potential meaningful predictors?
This also applies to (\(k\)-fold) cross-validation results
Report the average of the performance measure (e.g. MSE, accuracy, etc.) across the \(k\) test set along with standard errors
In some cases, all \(k\) folds yield similar results, but in other cases, the results are vastly different across the folds
Without some indication of uncertainty, no one will have a clue of how reliable your results are