Supervised learning: nonparametric regression


SURE 2025

Department of Statistics & Data Science
Carnegie Mellon University

Background

Recall: Model flexibility vs interpretability

  • Generally speaking: trade-off between a model’s flexibility (i.e. how “wiggly” it is) and how interpretable it is
  • Parametric methods (e.g. linear regression, GLMs): we make an assumption about the functional form, or shape, of \(f\) before observing the data
  • Nonparametric methods: we do not make explicit assumptions about the functional form of \(f\) (i.e., \(f\) is estimated from the data)
  • Parametric models are generally easier to interpret, whereas nonparametric models are more flexible

\(k\)-nearest neighbors (\(k\)-NN)

  • Perhaps the simplest nonparametric method…
  • Find the \(k\) data points closest to an observation \(x\), use these to make predictions

    • Need to use some measure of distance, e.g., Euclidean distance
  • Take the average value of the response over the \(k\) nearest neighbors

    • \(k\)-NN classification: most common class among the \(k\) nearest neighbors (“majority vote”)

    • \(k\)-NN regression: average of the values of \(k\) nearest neighbors

  • The number of neighbors \(k\) is a tuning parameter (like \(\lambda\) is for ridge/lasso)

Finding the optimal number of neighbors \(k\)

Recall: bias-variance tradeoff

  • If \(k\) is too small, the resulting model is too flexible: low bias, high variance
  • If \(k\) is too large, the resulting model is not flexible enough: high bias, low variance

Moving beyond linearity

  • The truth is (almost) never linear
  • But often the linearity assumption is good enough
  • What if it’s not linear?

    • splines

    • local regression

    • generalized additive models

    • wavelets

  • The above methods can offer a lot of flexibility, without losing the ease and interpretability of linear models

Generalized additive models (GAMs)

Resources

Generalized additive models (GAMs)

  • Previously: generalized linear models (GLMs) (e.g., linear regression, logistic regression, Poisson regression)
  • GLM generalizes linear regression to permit non-normal distributions and link functions of the mean \[g(\mathbb E[Y \mid X = x]) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p\]
  • What if we replace the linear predictor by additive smooth functions of the explanatory variables?

  • Entering generalized additive models (GAMs)

    • relax the restriction that the relationship must be a simple weighted sum

    • instead assume that the response can be modeled by a sum of arbitrary functions of each predictor

Generalized additive models (GAMs)

  • Allows for flexible nonlinearities in several variables
  • But retains the additive structure of linear models

\[g(\mathbb E[Y \mid X = x]) = \beta_0 + s_1(x_1) + s_2(x_2) + \dots + s_p(x_p)\]

  • Like GLMs, a GAM specifies a link function \(g()\) and a probability distribution for the response \(Y\)
  • \(s_j\) is some smooth function of predictor \(j\)
  • GLM is the special case of GAM in which each \(s_j\) is a linear function

Why use smooth functions rather that just coefficients?

  • Relationships between individual predictors and the response are smooth
  • Estimate the smooth relationships simultaneously to predict the response by adding them up

  • GAMs have the advantage over GLMs of greater flexibility (want to model the covariates flexibly, covariates and response not necessarily linearly related)
  • A disadvantage of GAMs/smoothing methods is the loss of interpretability

    • How do we interpret the effect of a predictor on the response?

    • How do we perform inferences for those effects?

Generalized additive models (GAMs)

Toy example: is this linear?

Toy example: linear regression fit

Toy example: GAM fit

Splines

  • A common way to smooth and learn nonlinear functions is to use splines
  • Splines can be used to approximate other, more complex functions
  • Splines are functions that are constructed from simpler basis functions (usually polynomials), and the set of basis functions is a basis
  • Each basis function \(b_k\) has coefficient \(\beta_k\). Spline is just the sum of the weighted basis functions

\[\displaystyle s(x) = \sum_{k=1}^K \beta_k b_k(x)\]

Weight basis functions to obtain spline

Why not just a polynomial fit?

\[y_i \sim \mathcal{N}(\mu_i, \sigma^2)\] \[\mu_i = \beta_0 + \beta_1 x_{i} + \beta_2 x^2_{1i} + \cdots + \beta_j x^j_{i}\]

  • In the polynomial models we use a polynomial basis expansion of \(x\)

    • \(x^0 = 1\) (constant term), \(x, x^2, x^3, x^4, \dots\)
  • We can keep on adding ever more powers of \(x\) to the model (hmm…)
  • Runge phenomenon: oscillations at the edges of an interval—using higher-order polynomials doesn’t always improve accuracy

Smoothing splines

Use smooth function \(s(x)\) to predict \(y\), control smoothness directly by minimizing the spline objective function

\[\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int s''(x)^2dx\]

\[= \text{fit data} + \text{impose smoothness}\]

\[\longrightarrow \text{model fit} = \text{likelihood} - \lambda \cdot \text{wiggliness}\]

Smoothing splines

  • The most commonly considered case: cubic smoothing splines

\[\text{minimize }\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda \int s''(x)^2dx\]

  • First term: RSS, tries to make \(s(x)\) fit the data at each \(x_i\)
  • Second term: roughness penalty, controls how wiggly \(s(x)\) is via tuning parameter \(\lambda \ge 0\)

    • Balances the accuracy of the fit and the flexibility of the function

    • The smaller \(\lambda\), the more wiggly the function

    • As \(\lambda \rightarrow \infty\), \(s(x)\) becomes linear

Smoothing splines

Goal: Estimate the smoothing spline \(\hat{s}(x)\) that balances the tradeoff between the model fit and wiggliness

GAM recap

  • A framework to model flexible nonlinear relationships

  • Use little functions (basis functions) to make big functions (smooths)

  • The predictor matrix has columns for each basis function, evaluated at each observation

  • Use a penalty to trade off model fit and wiggliness

  • Need to make sure the smooths are wiggly enough

Examples

Predicting MLB HR probability

library(tidyverse)
savant <- read_csv("https://raw.githubusercontent.com/36-SURE/2025/main/data/savant.csv")
batted_balls <- savant |> 
  filter(type == "X") |> 
  mutate(is_hr = as.numeric(events == "home_run")) |> 
  filter(!is.na(launch_angle), !is.na(launch_speed), !is.na(is_hr))
# glimpse(batted_balls)

Predicting HRs with launch angle and exit velocity

HRs are relatively rare and confined to one area of this plot

batted_balls |> 
  ggplot(aes(x = launch_speed, 
             y = launch_angle, 
             color = as.factor(is_hr))) +
  geom_point(alpha = 0.2)

Predicting HRs with launch angle and exit velocity

There is a sweet spot of launch_angle (mid-way ish) and launch_speed (relatively high) where almost all HRs occur

batted_balls |> 
  group_by(
    launch_angle_bucket = round(launch_angle * 2, -1) / 2,
    launch_speed_bucket = round(launch_speed * 2, -1) / 2
  ) |> 
  summarize(hr = sum(is_hr == 1),
            n = n()) |> 
  ungroup() |> 
  mutate(pct_hr = hr / n) |> 
  ggplot(aes(x = launch_speed_bucket, 
             y = launch_angle_bucket, 
             fill = pct_hr)) +
  geom_tile() +
  scale_fill_gradient2(labels = scales::percent_format(),
                       low = "blue", 
                       high = "red", 
                       midpoint = 0.2)

Fitting GAMs with mgcv

  • Modeling the log-odds of a home run using non-linear functions of launch_speed and launch_angle \[ \log \left( \frac{p_{\texttt{is_hr}}}{1 - p_\texttt{is_hr}} \right) = \beta_0 + s_1 (\texttt{launch_speed}) + s_2 (\texttt{launch_angle})\] where \(p_\texttt{is_hr}\) is the probability of a home run
  • Fit the model with smoothing spline functions s()
library(mgcv)
hr_gam <- gam(is_hr ~ s(launch_speed) + s(launch_angle),
              family = binomial,
              method = "REML", # more stable solution than default
              data = batted_balls)

View model summary

  • mgcv performs hypothesis tests for the smooth terms — these are roughly the equivalent of an \(F\)-test for dropping each term

  • Effective degrees of freedom (edf): basically the number of free parameters required to represent the function

  • In a smoothing spline, different choices of \(\lambda\) correspond to different values of the edf, representing different amounts of smoothness

# summary(hr_gam)
library(broom)
# glance(hr_gam)
tidy(hr_gam)
# A tibble: 2 × 5
  term              edf ref.df statistic p.value
  <chr>           <dbl>  <dbl>     <dbl>   <dbl>
1 s(launch_speed)  1.00   1.00      804.       0
2 s(launch_angle)  3.73   3.98      468.       0
tidy(hr_gam, parametric = TRUE)
# A tibble: 1 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    -38.3      12.3     -3.11 0.00189

Visualize partial response functions

Display the partial effect of each term in the model and how they add up to the overall prediction

library(gratia)
draw(hr_gam)

Model comparison: GAM vs GLM (logistic regression)

  • 10-fold cross-validation: assigning games to folds
set.seed(25)
N_FOLDS <- 10
games_folds <- batted_balls |> 
  distinct(game_date, home_team, away_team) |>
  mutate(fold = sample(rep(1:N_FOLDS, length.out = n())))

batted_balls <- batted_balls |> 
  left_join(games_folds, by = join_by(game_date, home_team, away_team))

Cross-validation

hr_cv <- function(x) {
  hr_train <- batted_balls |> filter(fold != x)
  hr_test <- batted_balls |> filter(fold == x)
  
  gam_fit <- gam(is_hr ~ s(launch_speed) + s(launch_angle),
                 family = binomial,
                 method = "REML",
                 data = hr_train)
  logit_fit <- glm(is_hr ~ launch_speed + launch_angle, 
                   family = binomial, data = hr_train)
  
  hr_out <- tibble(
    gam_pred = predict(gam_fit, newdata = hr_test, type = "response"),
    logit_pred = predict(logit_fit, newdata = hr_test, type = "response"),
    test_actual = hr_test$is_hr,
    test_fold = x
  )
  return(hr_out)  
}

hr_preds <- map(1:N_FOLDS, hr_cv) |> 
  list_rbind()

Out-of-sample comparison

  • Very few situations in reality where GLMs (with a linear predictor) perform better than an additive model using smooth functions

  • Especially since smooth functions can just capture linear models

hr_preds |> 
  pivot_longer(cols = gam_pred:logit_pred, 
               names_to = "model", 
               values_to = "test_pred") |> 
  mutate(test_pred_class = round(test_pred)) |> 
  group_by(model, test_fold) |>
  summarize(accuracy = mean(test_pred_class == test_actual)) |> 
  group_by(model) |> 
  summarize(cv_accuracy = mean(accuracy),
            se_accuracy = sd(accuracy) / sqrt(N_FOLDS))
# A tibble: 2 × 3
  model      cv_accuracy se_accuracy
  <chr>            <dbl>       <dbl>
1 gam_pred         0.973     0.00105
2 logit_pred       0.959     0.00148

Appendix: code to build dataset

# pip install pybaseball
# dictionary: https://baseballsavant.mlb.com/csv-docs
import pandas as pd
from pybaseball import statcast
savant = statcast(start_dt='2025-05-01', end_dt='2025-05-31')
savant.to_csv('savant.csv', index=False)