Supervised learning: generalized linear models

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Recap

When the response is continuous, we can use linear regression
When the response is binary categorical, we can use logistic regression
Which model do we use if the response contains discrete counts?

Generalized linear models (GLMs)

Three components of a GLM

1. Random component (distribution of the response)

Identifies the response variable and assumes a probability distribution for it

2. Linear predictor

Specifies the predictor variables
Linear combination of predictors on the right-hand side of the model equation
In the form \(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\)

3. Link function (a function of the parameter)

Connects the random component with the linear predictor
Via a function of the expected value of the probability distribution of the response

Linear regression

\(Y \sim N(\mu, \sigma ^2)\)
Linear predictor: \(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\)
Identity link function: \(g(\mu) = \mu\)

Logistic regression

\(Y \sim \text{Binomial}(n,p)\)
Linear predictor: \(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\)
Logit link function: \(\displaystyle g(p) = \log\left(\frac{p}{1-p}\right)\)

(Recall: We cannot just simply model \(p = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\), since the binary response can only take on values of either 0 or 1)

Poisson regression

\(Y \sim \text{Poisson}(\lambda)\)
Linear predictor: \(\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\)
Log link function: \(\displaystyle g(\lambda) = \log (\lambda)\)

(We cannot just simply model \(\lambda = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p\), since the counts can only take on non-negative integers)

Generalization example

In linear regression, the distribution is normal and the domain of \(Y \mid x\) is \((-\infty,\infty)\)

What, however, happens if we know that

the domain of observed values of the response is actually \([0,\infty]\)
the observed values are discrete, with possible values \(0, 1, 2, \dots\)

The normal distribution doesn’t hold here

Which distribution could possibly govern \(Y \mid x\)?
Remember, we might not know truly how \(Y \mid x\) is distributed, but any assumption we make has to fit with the nature of the data

Generalization: Poisson regression

PMF of a Poisson distribution: \(\displaystyle f(x \mid \lambda)= \frac{ e^{-\lambda} \lambda^x}{x!}\); \(x = 0, 1, 2, \dots\) and \(\lambda > 0\)

Model for the counts of an event in a fixed period of time or space, with a rate of occurrence parameter \(\lambda\)
\(\lambda\) is both the mean and the variance of the distribution
- in general, the variance governs the distribution’s shape

distribution of independent event occurences in an interval, e.g. soccer goals in a match
\(\lambda\) is the average number of the events in an interval

So, when we apply GLM in this context, we would identify the family as Poisson

But there’s another step in generalization…

Generalization: link function

Start with one predictor, linear function: \(\beta_0 + \beta_1 x\)

Range of this function: \((-\infty,\infty)\)

But for Poisson regression, \(Y\) cannot be negative,

We need to transform the linear function to be \([0,\infty)\)
- (What if we use linear regression? Results may not be meaningful, e.g., we predict \({\hat Y}\) to be negative!)

There is usually no unique transformation, but rather conventional ones

Generalization: link function

For Poisson, we use the \(\log()\) function as the link function \(g()\):

\[g(\lambda \mid x) = \log(\lambda \mid x) = \beta_0 + \beta_1 x\]

(Hence we also call this a log-linear model)

Given \(Y\) with values limited to being either 0 or positive integers, with no upper bound, we

assume \(Y \mid x \sim \text{Poisson}(\lambda)\)
assume \(\lambda \mid x = e^{\beta_0 + \beta_1 x}\)
use optimization to estimate \(\beta_0\) and \(\beta_1\) by maximizing the likelihood function

More distributions

Gamma distribution

\(Y \mid x\) continuous, but bounded between 0 and \(\infty\)

More distributions

Beta distribution

\(Y \mid x\) continuous, but bounded between 0 and 1

More distributions

Bernoulli distribution

\(Y \mid x\) discrete, but can only take on the values 0 and 1

Modeling count data

Goals scored across men’s soccer matches in the five biggest European leagues during the 2023-2024 season, via the worldfootballR package

library(tidyverse)
theme_set(theme_light())
goals <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/goals.csv")
glimpse(goals)

Rows: 3,504
Columns: 7
$ n_goals  <dbl> 0, 2, 0, 0, 4, 1, 5, 2, 1, 1, 1, 0, 2, 5, 2, 0, 3, 1, 3, 0, 0…
$ xG       <dbl> 0.3, 0.8, 2.7, 0.5, 4.0, 1.3, 3.3, 2.2, 1.4, 2.2, 0.6, 1.9, 1…
$ off_team <chr> "Burnley", "Arsenal", "Everton", "Sheffield Utd", "Brighton",…
$ def_team <chr> "Manchester City", "Nott'ham Forest", "Fulham", "Crystal Pala…
$ league   <chr> "ENG", "ENG", "ENG", "ENG", "ENG", "ENG", "ENG", "ENG", "ENG"…
$ match_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ is_home  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

EDA: Distribution for the number of goals

goals |> 
  count(n_goals) |> 
  ggplot(aes(n_goals, n)) +
  geom_col(width = 0.5) +
  scale_x_continuous(breaks = 0:8)

EDA: Distribution for the number of goals by league

goals |> 
  count(n_goals, league) |> 
  ggplot(aes(n_goals, n)) +
  geom_col(width = 0.5) +
  scale_x_continuous(breaks = 0:8) +
  facet_wrap(~ league, nrow = 1)

EDA: Distribution for the number of goals by home/away team

goals |> 
  count(n_goals, is_home) |> 
  mutate(is_home = factor(is_home)) |> 
  ggplot(aes(n_goals, n, fill = is_home, group = is_home)) +
  geom_col(position = "dodge", width = 0.5) +
  scale_x_continuous(breaks = 0:8)

Fitting a poisson regression model

goals_poisson <- glm(n_goals ~ league + is_home, 
                     family = "poisson",
                     data = goals)
# summary(goals_poisson)
library(broom)
tidy(goals_poisson)

# A tibble: 6 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   0.385     0.0323    11.9   1.18e-32
2 leagueESP    -0.215     0.0424    -5.07  3.98e- 7
3 leagueFRA    -0.195     0.0449    -4.34  1.46e- 5
4 leagueGER    -0.0185    0.0426    -0.433 6.65e- 1
5 leagueITA    -0.228     0.0426    -5.36  8.44e- 8
6 is_home       0.208     0.0283     7.36  1.87e-13

What observations in the dataset correspond to predictions using only the intercept?

Fitting a poisson regression model

tidy(goals_poisson, exponentiate = TRUE)

# A tibble: 6 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1.47     0.0323    11.9   1.18e-32
2 leagueESP      0.807    0.0424    -5.07  3.98e- 7
3 leagueFRA      0.823    0.0449    -4.34  1.46e- 5
4 leagueGER      0.982    0.0426    -0.433 6.65e- 1
5 leagueITA      0.796    0.0426    -5.36  8.44e- 8
6 is_home        1.23     0.0283     7.36  1.87e-13

We have sufficient evidence to suggest the expected number of goals is different between ENG teams and each of ESP, FRA, and ITA, but not GER, after accounting for home field advantage

The expected number of goals changes (is multiplied) by a factor of 1.27 if the team is home in comparison to away teams, after accounting for league

Model residuals and goodness-of-fit measures

Pearson residuals: \(\displaystyle r_i = \frac{y_i-\hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\)
Deviance residuals: \(\displaystyle d_i = \textrm{sign}(y_i-\hat{\lambda}_i) \sqrt{2 \left[y_i \log\left(\frac{y_i}{\hat{\lambda}_i}\right) -(y_i - \hat{\lambda}_i) \right]}\)

residuals(goals_poisson, type = "pearson")
residuals(goals_poisson, type = "deviance")

Model summaries

glance(goals_poisson)

# A tibble: 1 × 8
  null.deviance df.null logLik    AIC    BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl>  <dbl>  <dbl>    <dbl>       <int> <int>
1         4379.    3503 -5363. 10738. 10775.    4272.        3498  3504

As before, we prefer assessment in terms of out-of-sample/test set performances over these measures

Overdispersion

Overdispersion is a common problem faced in Poisson regression when the variance is larger than what is assumed under the model
Recall that a Poisson distribution assumes equidispersion (i.e. the mean and variance are equal)

var(goals$n_goals)

[1] 1.58617

mean(goals$n_goals)

[1] 1.442352

Looks like overdispersion is not a concern here

Quasipoisson regression

In the case of overdispersion (the variance is larger than expected), the resulting standard errors of the model coefficients can be inflated by the dispersion parameter \[\displaystyle \varphi = \frac{\sum_{i} r_i^2}{n - p},\] where \(p\) is the number of model parameters, and \(r_i\) is the Pearson residual for the \(i\)th observation

We can then multiply each coefficient standard error by \(\varphi\)

sum(residuals(goals_poisson, type = "pearson") ^ 2) / df.residual(goals_poisson)

[1] 1.070286

goals_qp <- glm(n_goals ~ league + is_home, 
                family = "quasipoisson",
                data = goals)
# summary(goals_qp)
# goals_qp |> 
#   glance() |> 
#   summarize(p_val = pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

Negative binomial regression

Key idea: introduces another parameter such that the variance can exceed the mean

We have \(E(Y) = \lambda\) and \(\text{Var}(Y)=\lambda(1+\alpha \lambda)\)

The variance is still proportional to \(\lambda\), but depends on the dispersion parameter \(\alpha \ge 0\)

The further \(\alpha\) falls above 0, the greater the overdispersion relative to Poisson variability

As \(\alpha\) decreases toward 0, the variance decreases toward \(\lambda\) , and the model converges to a Poisson distribution

See here for more information

Negative binomial regression

goals_nb <- MASS::glm.nb(n_goals ~ league + is_home, data = goals)
summary(goals_nb)


Call:
MASS::glm.nb(formula = n_goals ~ league + is_home, data = goals, 
    init.theta = 21.53331101, link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.38498    0.03348  11.500  < 2e-16 ***
leagueESP   -0.21511    0.04383  -4.908 9.22e-07 ***
leagueFRA   -0.19437    0.04638  -4.190 2.78e-05 ***
leagueGER   -0.01865    0.04423  -0.422    0.673    
leagueITA   -0.22798    0.04398  -5.184 2.17e-07 ***
is_home      0.20807    0.02922   7.122 1.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(21.5333) family taken to be 1)

    Null deviance: 4141.0  on 3503  degrees of freedom
Residual deviance: 4040.6  on 3498  degrees of freedom
AIC: 10732

Number of Fisher Scoring iterations: 1

              Theta:  21.53 
          Std. Err.:  8.31 

 2 x log-likelihood:  -10718.22

Zero-inflated Poisson regression

One common problem in modeling count data is the prevalence of zeros
While the Poisson distribution allows for zero counts, there are problems with a lot more zeros observed than expected for Poisson data

A zero-inflated Poisson (ZIP) model is a mixture model with two components
- Logistic regression model to predict whether the response will be zero (whether 0 goals are observed) \[P(Y = 0) = \pi + (1 - \pi)e^{\lambda}\]
- Poisson regression model for non-zero counts (non-zero goal values) \[P(Y = y) = (1 - \pi) \frac{\lambda^y e^{-\lambda}}{y!}, \text{ for } y = 1, 2, 3, \dots\]

Zero-inflated Poisson (ZIP) regression

library(pscl) # install.packages("pscl")
goals_zip <- zeroinfl(n_goals ~ league + is_home, data = goals)
summary(goals_zip)


Call:
zeroinfl(formula = n_goals ~ league + is_home, data = goals)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.3289 -1.0266 -0.1845  0.7004  5.2685 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.41532    0.04103  10.123  < 2e-16 ***
leagueESP   -0.17010    0.05467  -3.111  0.00186 ** 
leagueFRA   -0.18004    0.05792  -3.108  0.00188 ** 
leagueGER   -0.01636    0.05351  -0.306  0.75985    
leagueITA   -0.22493    0.05737  -3.921 8.82e-05 ***
is_home      0.18472    0.03752   4.923 8.53e-07 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -3.6716     1.2681  -2.895  0.00379 **
leagueESP     1.3000     1.3596   0.956  0.33902   
leagueFRA     0.6130     1.6018   0.383  0.70195   
leagueGER     0.1166     1.7986   0.065  0.94831   
leagueITA     0.1689     2.0463   0.083  0.93421   
is_home      -0.8116     0.9188  -0.883  0.37709   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 40 
Log-likelihood: -5359 on 12 Df

Model selection: Poisson vs ZIP

set.seed(2008)
n_folds <- 5
goals_folds <- goals |> 
  distinct(match_id) |>
  mutate(match_fold = sample(rep(1:n_folds, length.out = n())))

# goals_folds |> count(match_fold)

goals <- goals |> 
  left_join(goals_folds, by = "match_id")

Cross-validation

poisson_cv <- function(k) {

  train_goals <- goals |> 
    filter(match_fold != k)
  test_goals <- goals |> 
    filter(match_fold == k)

  poisson_fit <- glm(n_goals ~ league + is_home, family = "poisson", data = train_goals)
  poisson_test_preds <- predict(poisson_fit, newdata = test_goals, type = "response")
  
  # return a table with RMSE and fold id
  poisson_out <- tibble(rmse = sqrt(mean((test_goals$n_goals - poisson_test_preds) ^ 2)), 
                        match_fold = k)
  return(poisson_out)
}

poisson_cv_results <- map(1:n_folds, poisson_cv) |> 
  list_rbind()

Cross-validation

zip_cv <- function(k) {

  train_goals <- goals |> 
    filter(match_fold != k)
  test_goals <- goals |> 
    filter(match_fold == k)
  
  zip_fit <- zeroinfl(n_goals ~ league + is_home, data = train_goals)
  zip_test_preds <- predict(zip_fit, newdata = test_goals)
  
  zip_out <- tibble(rmse = sqrt(mean((test_goals$n_goals - zip_test_preds)^2)),
                    match_fold = k)
  return(zip_out)
}

zip_cv_results <- map(1:n_folds, zip_cv) |> 
  list_rbind()

Out-of-sample comparison

Which model would you select?

poisson_cv_results |> 
  summarize(avg_rmse = mean(rmse))

# A tibble: 1 × 1
  avg_rmse
     <dbl>
1     1.24

zip_cv_results |> 
  summarize(avg_rmse = mean(rmse))

# A tibble: 1 × 1
  avg_rmse
     <dbl>
1     1.24

Remember: Occam’s razor

Resources

Appendix: Code to build dataset

library(worldfootballR)
library(tidyverse)
big5 <- fb_match_results(country = c("ENG", "ESP", "ITA", "GER", "FRA"),
                         gender = "M", 
                         season_end_year = 2024, 
                         tier = "1st")
big5 <- big5 |>
  mutate(match_id = row_number()) |>  # add unique id for each row
  filter(!is.na(Wk))
  
home_goals <- big5 |>
  select(n_goals = HomeGoals, xG = Home_xG, off_team = Home, def_team = Away, league = Country, match_id) |>
  mutate(is_home = 1)

away_goals <- big5 |>
  select(n_goals = AwayGoals, xG = Away_xG, off_team = Away, def_team = Home, league = Country, match_id) |>
  mutate(is_home = 0)

goals <- home_goals |>
  bind_rows(away_goals)
  
# write_csv(goals, "goals.csv")