Data visualization: density estimation


SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Data: Sabrina Ionescu’s shots

Shot attempts by the Sabrina Ionescu in the 2023 WNBA season using wehoop

library(tidyverse)
theme_set(theme_light())
ionescu_shots <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/ionescu_shots.csv")
glimpse(ionescu_shots)
Rows: 617
Columns: 6
$ shot_x        <dbl> -10, 11, 2, -13, -18, 19, -1, 0, -1, -12, -1, -14, -7, -…
$ shot_y        <dbl> 24, 23, 2, 21, 19, 16, 3, 2, 4, 5, 2, 21, 26, 14, 7, 6, …
$ shot_distance <dbl> 26.000000, 25.495098, 2.828427, 24.698178, 26.172505, 24…
$ shot_type     <chr> "Pullup Jump Shot", "Running Jump Shot", "Cutting Layup …
$ scoring_play  <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FAL…
$ score_value   <dbl> 0, 3, 0, 0, 3, 0, 2, 0, 0, 0, 0, 3, 0, 2, 0, 3, 3, 0, 0,…
  • Each row is a shot attempt by Ionescu in the 2023 WNBA season

  • Categorical / qualitative variables: scoring_play, shot_type

  • Continuous / quantitative variables: shot_x, shot_y, shot_distance, score_value

Revisiting histograms

fd_bw <- 2 * IQR(ionescu_shots$shot_distance) / length(ionescu_shots$shot_distance)^(1/3)
ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = fd_bw)
  • Split observed data into bins

  • Count number of observations in each bin

Need to choose the number of bins, adjust with:

  • bins: number of bins (default is 30)

  • binwidth: width of bins (overrides bins), various rules of thumb

  • breaks: vector of bin boundaries (overrides both bins and binwidth)

Adjusting the binwidth

Small binwidth \(\rightarrow\) “undersmooth” / spiky

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1)

Large binwidth \(\rightarrow\) “oversmooth” / flat

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 25)

Adjusting the binwidth

  • A binwidth that is too narrow shows too much detail

    • too many bins: low bias, high variance
  • A binwidth that is too wide hides detail

    • too few bins: high bias, low variance
  • Always pick a value that is “just right” (The Goldilocks principle)

Try several values, the R / ggplot2 default is NOT guaranteed to be an optimal choice

A subtle point about the histogram code…

By default the bins are centered on the integers

  • left-closed, right-open intervals
  • starting at -0.5 to 0.5, 0.5 to 1.5, …
ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1)

Specify center of one bin (e.g. 0.5)

  • Use closed = "left"
ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1, center = 0.5, 
                 closed = "left")

How do histograms relate to the PDF and CDF?

  • Histograms approximate the PDF with bins, and points are equally likely within a bin

  • PDF is the derivative of the cumulative distribution function (CDF)

Density estimation

Kernel density estimation

Goal: estimate the PDF \(f(x)\) for all possible values (assuming it is smooth)

The kernel density estimator (KDE) is \(\displaystyle \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i)\)

  • \(n\): sample size

  • \(x\): new point to estimate \(f(x)\) (does NOT have to be in the dataset!)

  • \(h\): bandwidth, analogous to histogram binwidth, ensures \(\hat{f}(x)\) integrates to 1

  • \(x_i\): \(i\)th observation in the dataset

  • \(K_h(x - x_i)\): kernel function, creates weight given distance of \(i\)th observation from new point
    • as \(|x - x_i| \rightarrow \infty\) then \(K_h(x - x_i) \rightarrow 0\), i.e. the further apart the \(i\)th observation is from \(x\), the smaller the weight

    • as bandwidth \(h\) increases, weights are more evenly spread out

    • Choice of kernel functions: Gaussian/normal, etc.

    • \(K_h(x - x_i)\) is large when \(x_i\) is close to \(x\)

Kernel density estimation

Intuition:

  • smooth each data point into a small density bumps
  • sum all these small bumps together to obtain the final density estimate

How do we compute and display the density estimate?

ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density() +
  geom_rug(alpha = 0.3)
  • Pros:
    • Displays full shape of distribution
    • Can easily layer
    • Add categorical variable with color
  • Cons:
    • Need to pick bandwidth and kernel…

What about the bandwidth?

Use Gaussian reference rule (rule-of-thumb) \(\approx 1.06 \cdot \sigma \cdot n^{-1/5}\) (\(\sigma\): observed standard deviation)

Modify the bandwidth using the adjust argument - value to multiply default bandwidth by

ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density(adjust = 0.5) +
  geom_rug(alpha = 0.3)

ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density(adjust = 2) +
  geom_rug(alpha = 0.3)

Notes on density estimation

  • In KDE, the bandwidth parameter is analogous to the binwidth in histograms.

  • If the bandwidth is too small, the density estimate can become overly peaky and the main trends in the data may be obscured.

  • If the bandwidth is too large, then smaller features in the distribution of the data may disappear

  • The choice of the kernel can affect the shape of the density curve.

    • A Gaussian kernel typically gives density estimates that look bell-shaped (ish)

    • A rectangular kernel can generate the appearance of steps in the density curve

    • Kernel choice matters less with more data points

Density plots are often reliable and informative for large datasets but can be misleading for smaller ones.

Common pitfall: bounded data

set.seed(36)
bounded_data <- tibble(x = runif(100))
bounded_data |> 
  ggplot(aes(x)) +
  geom_density() +
  geom_rug(alpha = 0.5) +
  stat_function(data = tibble(x = c(0, 1)),
                fun = dunif, color = "red") +
  scale_x_continuous(limits = c(-0.5, 1.5))
  • Observe density estimates for impossible values (in the tails) - ALWAYS be mindful

  • Reflection method: first perform standard KDE, then “reflect” tails outside of desired interval to be inside

  • See also: evmix package

Use density curves and ECDFs together

Code interlude: easy way to arrange multiple figures

Use the cowplot package to easily arrange your plots (see also patchwork)

library(cowplot)

ionescu_shot_dens <- ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density() +
  geom_rug(alpha = 0.3) +
  theme_bw() +
  labs(x = "Shot distance (in feet)",
       y = "Number of shot attempts")

ionescu_shot_ecdf <- ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  stat_ecdf() +
  geom_rug(alpha = 0.3) +
  theme_bw() +
  labs(x = "Shot distance (in feet)",
       y = "Proportion of Ionescu shot attempts")

# library(patchwork)
# ionescu_shot_dens + ionescu_shot_ecdf
plot_grid(ionescu_shot_dens, ionescu_shot_ecdf)

Use density curves and ECDFs together

Another code interlude: collect the legends with patchwork

ionescu_shot_dens_made <- ionescu_shots |>
  ggplot(aes(x = shot_distance, 
             color = scoring_play)) + 
  geom_density() +
  geom_rug(alpha = 0.3) +
  labs(x = "Shot distance (in feet)",
       y = "Number of shot attempts")

ionescu_shot_ecdf_made <- ionescu_shots |>
  ggplot(aes(x = shot_distance,
             color = scoring_play)) + 
  stat_ecdf() +
  geom_rug(alpha = 0.3) +
  labs(x = "Shot distance (in feet)",
       y = "Proportion of Ionescu shot attempts")

library(patchwork)
ionescu_shot_dens_made + ionescu_shot_ecdf_made + plot_layout(guides = "collect")

Ridgeline plots

  • Check out the ggridges package for a variety of customization options
library(ggridges)
ionescu_shots |>
  ggplot(aes(x = shot_distance, y = shot_type)) + 
  geom_density_ridges(rel_min_height = 0.01) 
  • Useful to display conditional distributions across many levels

Going from 1D to 2D density estimation

In 1D: estimate density \(f(x)\), assuming that \(f(x)\) is smooth:

\[ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) \]

In 2D: estimate joint density \(f(x_1, x_2)\)

\[\hat{f}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h_1h_2} K\left(\frac{x_1 - x_{i1}}{h_1}\right) K\left(\frac{x_2 - x_{i2}}{h_2}\right)\]

In 1D there’s one bandwidth, now we have two bandwidths

  • \(h_1\): controls smoothness as \(X_1\) changes, holding \(X_2\) fixed
  • \(h_2\): controls smoothness as \(X_2\) changes, holding \(X_1\) fixed

Gaussian kernels are still a popular choice

Display densities for 2D data

How to read contour plots

Best known in topography: outlines (contours) denote levels of elevation

2D density estimation

We can visualize all of the shot locations: (shot_x, shot_y)

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) +
  geom_point(size = 4, alpha = 0.3)
  • Adjust transparency with alpha for overlapping points

Create contours of 2D kernel density estimate (KDE)

ionescu_shots |>
  filter(shot_y < 35) |> # remove outliers
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(size = 4, alpha = 0.3) + 
  geom_density2d() +
  coord_fixed() +
  theme(legend.position = "bottom")
  • Extend KDE for joint density estimates in 2D (see section 14.4.2 for details)

  • Inner lines denote “peaks”

  • coord_fixed() forced a fixed ratio

Create contours of 2D kernel density estimate (KDE)

ionescu_shots |>
  filter(shot_y < 35) |> 
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(size = 4, alpha = 0.3) + 
  geom_density2d(adjust = 0.1) +
  coord_fixed() +
  theme(legend.position = "bottom")
  • Can use adjust to modify the multivariate bandwidth

Contours are difficult… let’s make a heatmap instead

  • Use stat_density_2d() and the after_stat() function to make 2D KDE heatmaps

  • May be easier to read than nested lines with color

  • Default color scale is awful. Always change it.

ionescu_shots |>
  filter(shot_y < 35) |> 
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(level)),
                 h = 0.6, bins = 60, geom = "polygon") +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") +
  coord_fixed() +
  theme(legend.position = "bottom")

Multivariate density estimation can be difficult

Turn off contours and use tiles instead

  • Divide the space into a grid and color the grid according to high/low values
ionescu_shots |>
  filter(shot_y < 35) |> 
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(density)),
                 h = 0.6, bins = 60, contour = FALSE,
                 geom = "raster") +
  # scale_fill_gradient(low = "white", high = "red") +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") +
  theme(legend.position = "bottom") +
  coord_fixed()

Best alternative? Hexagonal binning

  • Use geom_hex() to make hexagonal heatmaps

  • Need to have the hexbin package installed

  • 2D version of histogram

ionescu_shots |>
  filter(shot_y < 35) |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_hex(binwidth = c(1, 1)) +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") + 
  theme(legend.position = "bottom") +
  coord_fixed()
  • Can specify binwidth in both directions
  • Avoids limitations from smoothing

What about shooting efficiency?

ionescu_shots |>
  filter(shot_y < 35) |>
  ggplot(aes(x = shot_x, y = shot_y, 
             z = scoring_play, group = -1)) +
  stat_summary_hex(binwidth = c(2, 2), fun = mean, 
                   color = "black") +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") + 
  theme(legend.position = "bottom") +
  coord_fixed()

Appendix: Making shot charts and drawing courts with sportyR

library(sportyR)
wnba_court <- geom_basketball("wnba", display_range = "offense", rotation = 270, x_trans = -41.5)
wnba_court +
  geom_hex(data = ionescu_shots, aes(x = shot_x, y = shot_y), binwidth = c(1, 1)) + 
  scale_fill_gradient(low = "midnightblue", high = "gold")

Appendix: Code to build dataset

# install.packages("wehoop")
library(wehoop)
wnba_pbp <- load_wnba_pbp(2023)
ionescu_shots <- wnba_pbp |> 
  filter(shooting_play) |> 
  filter(str_detect(text, "Ionescu")) |> 
  filter(!str_detect(text, "Ionescu assists")) |> 
  filter(!str_detect(text, "free throw")) |> 
  transmute(
    shot_x = coordinate_x_raw - 25,
    shot_y = coordinate_y_raw,
    shot_distance = sqrt((abs(shot_x) ^ 2) + shot_y ^ 2), 
    shot_type = type_text,
    scoring_play,
    score_value
  )