Supervised learning: logistic regression


SURE 2025

Department of Statistics & Data Science
Carnegie Mellon University

Background

Classification

  • Categorical variables take values in an unordered set with different categories
  • Classification: predicting a categorical response \(Y\) for an observation (since it involves assigning the observation to a category/class)
  • Often, we are more interested in estimating the probability for each category of the response \(Y\)
  • Classification is different from clustering, since we know the true label \(Y\)
  • We will first focus on binary categorical response (categorical response variables with only two categories)

    • Examples: whether it will rain or not tomorrow, whether the kicker makes or misses a field goal attempt

Logistic regression

  • Review: in linear regression, the response variable \(Y\) is quantitative
  • Logistic regression is used to model binary categorical outcomes (i.e. \(Y \in \{0, 1\}\))
  • Similar concepts from linear regression

    • tests for coefficients
    • indicator variables
    • interactions
    • model building procedure
    • regularization
  • Other concepts are different

    • interpretation of coefficients
    • residuals are not normally distributed
    • model evaluation

Why not linear regression for categorical data?

Linear regression: predicted probabilities might be outside the \([0, 1]\) interval

Logistic regression: ensures all probabilities lie between 0 and 1

Logistic regression

  • Logistic regression models the probability of success \(p(x)\) of a binary categorical response variable \(Y \in \{0, 1\}\)
  • To limit the regression between \([0, 1]\), we use the logit function, or the log-odds ratio

\[ \log \left( \frac{p(x)}{1 - p(x)} \right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p \]

or equivalently,

\[ p(x) = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}} \]

  • Note: \(p(x)= P(Y = 1 \mid X = x) = \mathbb E[Y \mid X = x]\)

Maximum likelihood estimation (MLE)

  • Likelihood function: how likely to observe data for given parameter values \[L(\theta) = \prod_{i=1}^n f(y_i; \theta)\]
  • Goal: find values for parameters that maximize likelihood function (i.e., MLEs)
  • Why: likelihood methods can handle variety of data types and assumptions
  • How: typically numerical approaches and optimization techniques

    • Easier to work with log-likelihood (sums compared to products)

Example: linear regression MLE

Reminder: probability density function (pdf) for a normal distribution \[f(y \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{\frac{-(y - \mu) ^ 2}{2 \sigma^2}}\]

Consider a simple linear regression model (with an intercept and one predictor) \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)\]

Then \(\qquad \displaystyle L(\beta_0, \beta_1, \sigma) = \prod_{i=1}^n f(y_i \mid x_i; \beta_0, \beta_1, \sigma) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} e^{\frac{-(y_i - \beta_0 - \beta_1 x_i) ^ 2}{2 \sigma^2}}\)

And \(\qquad \displaystyle \log L(\beta_0, \beta_1, \sigma) = -\frac{n}{2} \log 2\pi - \frac{n}{2} \log \sigma^2 - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1x_1)^2\)

We can then use optimization methods (e.g., simple calculus) to find the optimal parameter values.

For more details, see here

Major difference between linear and logistic regression

Logistic regression involves numerical optimization

  • \(y_i\) is observed response for \(n\) observations, either 0 or 1
  • We need to use an iterative algorithm to find \(\beta\)’s that maximize the likelihood \[\prod_{i=1}^{n} p\left(x_{i}\right)^{y_{i}}\left(1-p\left(x_{i}\right)\right)^{1-y_{i}}\]
  • Newton’s method: start with initial guess, calculate gradient of log-likelihood, add amount proportional to the gradient to parameters, moving up log-likelihood surface
  • This means logistic regression runs more slowly than linear regression
  • Another approach (if you’re interested): iteratively re-weighted least squares, see here

Inference with logistic regression

  • Major motivation for logistic regression (and all GLMs) is inference

    • How does the response change when we change a predictor by one unit?
  • For linear regression, the answer is straightforward

\[\mathbb{E}[Y \mid x] = \beta_0 + \beta_1 x_1 + + \cdots + \beta_p x_p\]

  • For logistic regression, it is a little less straightforward

\[\mathbb E[Y \mid x] = \frac{e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}{1 + e^{\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}\]

Note:

  • The predicted response varies non-linearly with the predictor variable values

  • Interpretation on the odds scale

The odds interpretation

The odds of an event are the probability of the event divided by the probability of the event not occurring

  • \(\displaystyle \text{odds} = \frac{\text{probability}}{1 - \text{probability}}\)

  • \(\displaystyle \text{probability} = \frac{\text{odds}}{1 + \text{odds}}\)

Example: Rolling a fair six-sided die

  • The probability of rolling a 2 is 1/6

  • The odds of rolling a 2 is 1 to 5 (or 1:5)

The odds interpretation

Suppose we fit a simple logistic regression model (with one predictor), and say the predicted probability is 0.8 for a particular predictor value

  • This means that if we were to repeatedly sample response values given that predictor variable value: we expect class 1 to appear 4 times as often as class 0 \[\text{odds} = \frac{\mathbb{E}[Y \mid x]}{1-\mathbb{E}[Y \mid x]} = \frac{0.8}{1-0.8} = 4 = e^{\beta_0+\beta_1x}\]

  • Thus we say that for the given predictor variable value, the odds are 4 (or 4:1) in favor of class 1

How does the odds change if we change the value of a predictor variable by one unit?

\[\text{odds}_{\rm new} = e^{\beta_0+\beta_1(x+1)} = e^{\beta_0+\beta_1x}e^{\beta_1} = e^{\beta_1}\text{odds}_{\rm old}\]

For every unit change in \(x\), the odds change by a factor \(e^{\beta_1}\)

Example

Data: diabetes in the Pima

  • Each observation represents one Pima woman, at least 21 years old, who was tested for diabetes as part of the study.

  • type: if they had diabetes according to the diagnostic criteria

  • npreg: number of pregnancies

  • bp: blood pressure

library(tidyverse)
theme_set(theme_light())
pima <- as_tibble(MASS::Pima.tr)
pima <- pima |> 
  mutate(pregnancy = ifelse(npreg > 0, "Yes", "No")) # whether the patient has had any pregnancies

Fitting a logistic regression model

  • Use the glm function, but specify the family is binomial
  • Get coefficients table using the tidy() function in the broom package
pima_logit <- glm(type ~ pregnancy + bp, data = pima, family = binomial)
library(broom)
# summary(pima_logit)
tidy(pima_logit)
# A tibble: 3 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -3.16      1.08       -2.94 0.00329
2 pregnancyYes  -0.468     0.427      -1.10 0.273  
3 bp             0.0403    0.0140      2.89 0.00391
  • Exponentiate the coefficient estimates (log-odds) to back-transform and get the odds
tidy(pima_logit, conf.int = TRUE, exponentiate = TRUE)
# A tibble: 3 × 7
  term         estimate std.error statistic p.value conf.low conf.high
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)    0.0422    1.08       -2.94 0.00329  0.00472     0.329
2 pregnancyYes   0.626     0.427      -1.10 0.273    0.272       1.47 
3 bp             1.04      0.0140      2.89 0.00391  1.01        1.07 

Interpreting a logistic regression model

tidy(pima_logit)
# A tibble: 3 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -3.16      1.08       -2.94 0.00329
2 pregnancyYes  -0.468     0.427      -1.10 0.273  
3 bp             0.0403    0.0140      2.89 0.00391
  • Pregnancy: pregnancy changes the log-odds of diabetes, relative to the baseline value, by -0.468. The odds ratio is exp(-0.47) = 0.626
  • Blood pressure: each unit of increase in blood pressure (in mmHg) is associated with an increase in the log-odds of diabetes of 0.0403, or an increase in the odds of diabetes by a multiplicative factor of exp(0.0403) = 1.04

    • Every extra mmHg in blood pressure multiplies the odds of diabetes by 1.04

View predicted probability relationship

pima |> 
  mutate(pred_prob = fitted(pima_logit),
         i_type = as.numeric(type == "Yes")) |> 
  ggplot(aes(bp)) +
  geom_line(aes(y = pred_prob, color = pregnancy), linewidth = 2) +
  geom_point(aes(y = i_type), alpha = 0.3, color = "darkorange", size = 4)

Deviance

  • Recall that in linear regression, we have the residual sum of squares (RSS)—an RSS of 0 means the model predicts perfectly

  • In logistic regression, the RSS is less meaningful; instead we define the deviance as a measure of goodness-of-fits

  • Generalization of RSS in linear regression to any distribution family

Deviance

  • For model of interest \(\mathcal{M}\) the total deviance is

\[D_{\mathcal{M}}= -2 \log \frac{\mathcal{L}_{\mathcal{M}}}{\mathcal{L}_{\mathcal{S}}} = 2\left(\log \mathcal{L}_{\mathcal{S}}-\log \mathcal{L}_{\mathcal{M}}\right)\]

  • \(\mathcal{L}_{\mathcal{M}}\) is the likelihood for model \(\mathcal{M}\)
  • \(\mathcal{L}_{\mathcal{S}}\) is the likelihood for the saturated model (i.e., the most complex model possible, providing a perfect fit to the data)
  • Deviance is a measure of goodness-of-fit: the smaller the deviance, the better the fit

For more, see here

Model summaries

glance(pima_logit)
# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1          256.     199  -123.  252.  262.     246.         197   200
  • deviance = 2 * logLik

  • null.deviance corresponds to intercept-only model

  • AIC = 2 * df.residual - 2 * logLik

  • We will consider these to be less important than out-of-sample/test set performance

Predictions

  • The default output of predict() is on the log-odds scale

  • To get predicted probabilities, specify type = "response

pima_pred_prob <- predict(pima_logit, type = "response")
  • How do we predict the class (diabetes or not)?

  • Typically if predicted probability is > 0.5 then we predict Yes, else No

    • Could also be encoded as 1 and 0
pima_pred_class <- ifelse(pima_pred_prob > 0.5, "Yes", "No")
pima_pred_binary <- ifelse(pima_pred_prob > 0.5, 1, 0)

Classification model assessment

  • Accuracy and misclassification rate
mean(pima_pred_class == pima$type)
[1] 0.665
mean(pima_pred_class != pima$type)
[1] 0.335
mean((pima_pred_binary - pima_pred_prob)^2)
[1] 0.1140525
  • Confusion matrix
table("Predicted" = factor(pima_pred_class, levels = c("Yes", "No")), 
      "Observed" = factor(pima$type, levels = c("Yes", "No")))
         Observed
Predicted Yes  No
      Yes   8   7
      No   60 125

Confusion matrix

  • Accuracy: How often is the classifier correct? \((\text{TP} + \text{TN}) \ / \ {\text{total}}\)

  • Precision: How often is it right for predicted positives? \(\text{TP} \ / \ (\text{TP} + \text{FP})\)

  • Sensitivity (or true positive rate (TPR) or power): How often does it detect positives? \(\text{TP} \ / \ (\text{TP} + \text{FN})\)

  • Specificity (or true negative rate (TNR), or 1 - false positive rate (FPR)): How often does it detect negatives? \(\text{TN} \ / \ (\text{TN} + \text{FP})\)

Receiver operating characteristic (ROC) curve

  • Question: How do we balance between high power and low false positive rate?
  • In general, we can obtain predicted class for each observation by comparing the predicted probability to some cutoff \(c\) (typically, \(c=0.5\)) \[\hat{Y} =\left\{\begin{array}{ll} 1, & \hat{p}(x)>c \\ 0, & \hat{p}(x) \leq c \end{array}\right.\]
  • ROC curve: check all possible values for the cutoff \(c\), plot the power against false positive rate

Receiver operating characteristic (ROC) curve

  • Want to maximize the area under the curve (AUC)

Calculating AUC & plotting an ROC curve

library(pROC)
pima_roc <- pima |> 
  mutate(pred_prob = predict(pima_logit, 
                             type = "response")) |> 
  roc(type, pred_prob)
# str(pima_roc)
pima_roc$auc
Area under the curve: 0.6428
tibble(threshold = pima_roc$thresholds,
       specificity = pima_roc$specificities,
       sensitivity = pima_roc$sensitivities) |> 
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(slope = 1, intercept = 0, 
              linetype = "dashed")

Calibration

  • Classification models go beyond providing binary predictions and instead produce predicted probabilities for each class
  • It might be desirable to assess the quality of the predicted probabilities instead of just the accuracy of the crude predicted class labels
  • From The Signal and the Noise (by Nate Silver):

One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If, over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated. If it wound up raining just 20 percent of the time instead, or 60 percent of the time, they weren’t.

Calibration

  • For a model to be calibrated, the predicted probabilities should match the actual probabilities (actual likelihood of an event occurring)
  • For example, if we’ve got a model which tells us that the probability of rain on a certain class of days is 50%, it had better rain on half of those days in reality, or there model is just wrong about the probability of rain
  • Calibration should be used together with other model evaluation and diagnostic tools
  • A model is well-calibrated does not mean it is good at making predictions (example: a constant classifier)

Calibration plot: smoothing

Plot the observed outcome against the fitted probability, and apply a smoother

pima |> 
  mutate(pred = predict(pima_logit, type = "response"),
         obs = ifelse(type == "Yes", 1, 0)) |> 
  ggplot(aes(pred, obs)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, intercept = 0, 
              linetype = "dashed") 

Looks good for the most part

  • odd behavior for high probability values

  • since there are only a few observations with high predicted probability

Calibration plot: binning

pima |> 
  mutate(
    pred_prob = predict(pima_logit, type = "response"),
    bin_pred_prob = cut(pred_prob, breaks = seq(0, 1, .1))
  ) |> 
  group_by(bin_pred_prob) |> 
  summarize(n = n(),
            bin_actual_prob = mean(type == "Yes")) |> 
  mutate(mid_point = seq(0.15, 0.65, 0.1)) |> 
  ggplot(aes(x = mid_point, y = bin_actual_prob)) +
  geom_point(aes(size = n)) +
  geom_line() +
  geom_abline(slope = 1, intercept = 0, 
              color = "black", linetype = "dashed") +
  # expand_limits(x = c(0, 1), y = c(0, 1)) +
  coord_fixed()

Other calibration methods