Data visualization: density estimation

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Data: Sabrina Ionescu’s shots

Shot attempts by the Sabrina Ionescu in the 2023 WNBA season using wehoop

library(tidyverse)
theme_set(theme_light())
ionescu_shots <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/ionescu_shots.csv")
glimpse(ionescu_shots)

Rows: 617
Columns: 6
$ shot_x        <dbl> -10, 11, 2, -13, -18, 19, -1, 0, -1, -12, -1, -14, -7, -…
$ shot_y        <dbl> 24, 23, 2, 21, 19, 16, 3, 2, 4, 5, 2, 21, 26, 14, 7, 6, …
$ shot_distance <dbl> 26.000000, 25.495098, 2.828427, 24.698178, 26.172505, 24…
$ shot_type     <chr> "Pullup Jump Shot", "Running Jump Shot", "Cutting Layup …
$ scoring_play  <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FAL…
$ score_value   <dbl> 0, 3, 0, 0, 3, 0, 2, 0, 0, 0, 0, 3, 0, 2, 0, 3, 3, 0, 0,…

Each row is a shot attempt by Ionescu in the 2023 WNBA season
Categorical / qualitative variables: scoring_play, shot_type
Continuous / quantitative variables: shot_x, shot_y, shot_distance, score_value

Revisiting histograms

fd_bw <- 2 * IQR(ionescu_shots$shot_distance) / length(ionescu_shots$shot_distance)^(1/3)
ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = fd_bw)

Split observed data into bins
Count number of observations in each bin

Need to choose the number of bins, adjust with:

bins: number of bins (default is 30)
binwidth: width of bins (overrides bins), various rules of thumb
- e.g., see fd_bw for Freedman–Diaconis rule
breaks: vector of bin boundaries (overrides both bins and binwidth)

Adjusting the binwidth

Small binwidth \(\rightarrow\) “undersmooth” / spiky

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1)

Large binwidth \(\rightarrow\) “oversmooth” / flat

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 25)

Adjusting the binwidth

A binwidth that is too narrow shows too much detail
- too many bins: low bias, high variance
A binwidth that is too wide hides detail
- too few bins: high bias, low variance
Always pick a value that is “just right” (The Goldilocks principle)

Try several values, the R / ggplot2 default is NOT guaranteed to be an optimal choice

A subtle point about the histogram code…

By default the bins are centered on the integers

left-closed, right-open intervals
starting at -0.5 to 0.5, 0.5 to 1.5, …

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1)

Specify center of one bin (e.g. 0.5)

Use closed = "left"

ionescu_shots |>
  ggplot(aes(x = shot_distance)) +
  geom_histogram(binwidth = 1, center = 0.5, 
                 closed = "left")

How do histograms relate to the PDF and CDF?

Histograms approximate the PDF with bins, and points are equally likely within a bin
PDF is the derivative of the cumulative distribution function (CDF)

Density estimation

Kernel density estimation

Goal: estimate the PDF \(f(x)\) for all possible values (assuming it is smooth)

The kernel density estimator (KDE) is \(\displaystyle \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i)\)

\(n\): sample size
\(x\): new point to estimate \(f(x)\) (does NOT have to be in the dataset!)
\(h\): bandwidth, analogous to histogram binwidth, ensures \(\hat{f}(x)\) integrates to 1
\(x_i\): \(i\)th observation in the dataset

\(K_h(x - x_i)\): kernel function, creates weight given distance of \(i\)th observation from new point
- as \(|x - x_i| \rightarrow \infty\) then \(K_h(x - x_i) \rightarrow 0\), i.e. the further apart the \(i\)th observation is from \(x\), the smaller the weight
- as bandwidth \(h\) increases, weights are more evenly spread out
- Choice of kernel functions: Gaussian/normal, etc.
- \(K_h(x - x_i)\) is large when \(x_i\) is close to \(x\)

Kernel density estimation

Intuition:

smooth each data point into a small density bumps
sum all these small bumps together to obtain the final density estimate

How do we compute and display the density estimate?

We make kernel density estimates with geom_density()

ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density() +
  geom_rug(alpha = 0.3)

Pros:
- Displays full shape of distribution
- Can easily layer
- Add categorical variable with color
Cons:
- Need to pick bandwidth and kernel…

What about the bandwidth?

Use Gaussian reference rule (rule-of-thumb) \(\approx 1.06 \cdot \sigma \cdot n^{-1/5}\) (\(\sigma\): observed standard deviation)

Modify the bandwidth using the adjust argument - value to multiply default bandwidth by

ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density(adjust = 0.5) +
  geom_rug(alpha = 0.3)

ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density(adjust = 2) +
  geom_rug(alpha = 0.3)

Notes on density estimation

In KDE, the bandwidth parameter is analogous to the binwidth in histograms.
If the bandwidth is too small, the density estimate can become overly peaky and the main trends in the data may be obscured.
If the bandwidth is too large, then smaller features in the distribution of the data may disappear
The choice of the kernel can affect the shape of the density curve.
- A Gaussian kernel typically gives density estimates that look bell-shaped (ish)
- A rectangular kernel can generate the appearance of steps in the density curve
- Kernel choice matters less with more data points

Density plots are often reliable and informative for large datasets but can be misleading for smaller ones.

Common pitfall: bounded data

set.seed(36)
bounded_data <- tibble(x = runif(100))
bounded_data |> 
  ggplot(aes(x)) +
  geom_density() +
  geom_rug(alpha = 0.5) +
  stat_function(data = tibble(x = c(0, 1)),
                fun = dunif, color = "red") +
  scale_x_continuous(limits = c(-0.5, 1.5))

Observe density estimates for impossible values (in the tails) - ALWAYS be mindful
Reflection method: first perform standard KDE, then “reflect” tails outside of desired interval to be inside
See also: evmix package

Use density curves and ECDFs together

Code interlude: easy way to arrange multiple figures

Use the cowplot package to easily arrange your plots (see also patchwork)

library(cowplot)

ionescu_shot_dens <- ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  geom_density() +
  geom_rug(alpha = 0.3) +
  theme_bw() +
  labs(x = "Shot distance (in feet)",
       y = "Number of shot attempts")

ionescu_shot_ecdf <- ionescu_shots |>
  ggplot(aes(x = shot_distance)) + 
  stat_ecdf() +
  geom_rug(alpha = 0.3) +
  theme_bw() +
  labs(x = "Shot distance (in feet)",
       y = "Proportion of Ionescu shot attempts")

# library(patchwork)
# ionescu_shot_dens + ionescu_shot_ecdf
plot_grid(ionescu_shot_dens, ionescu_shot_ecdf)

Use density curves and ECDFs together

Another code interlude: collect the legends with `patchwork`

ionescu_shot_dens_made <- ionescu_shots |>
  ggplot(aes(x = shot_distance, 
             color = scoring_play)) + 
  geom_density() +
  geom_rug(alpha = 0.3) +
  labs(x = "Shot distance (in feet)",
       y = "Number of shot attempts")

ionescu_shot_ecdf_made <- ionescu_shots |>
  ggplot(aes(x = shot_distance,
             color = scoring_play)) + 
  stat_ecdf() +
  geom_rug(alpha = 0.3) +
  labs(x = "Shot distance (in feet)",
       y = "Proportion of Ionescu shot attempts")

library(patchwork)
ionescu_shot_dens_made + ionescu_shot_ecdf_made + plot_layout(guides = "collect")

Ridgeline plots

Check out the ggridges package for a variety of customization options

library(ggridges)
ionescu_shots |>
  ggplot(aes(x = shot_distance, y = shot_type)) + 
  geom_density_ridges(rel_min_height = 0.01)

Useful to display conditional distributions across many levels

Going from 1D to 2D density estimation

In 1D: estimate density \(f(x)\), assuming that \(f(x)\) is smooth:

\[ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h} K_h(x - x_i) \]

In 2D: estimate joint density \(f(x_1, x_2)\)

\[\hat{f}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^n \frac{1}{h_1h_2} K\left(\frac{x_1 - x_{i1}}{h_1}\right) K\left(\frac{x_2 - x_{i2}}{h_2}\right)\]

In 1D there’s one bandwidth, now we have two bandwidths

\(h_1\): controls smoothness as \(X_1\) changes, holding \(X_2\) fixed
\(h_2\): controls smoothness as \(X_2\) changes, holding \(X_1\) fixed

Gaussian kernels are still a popular choice

Display densities for 2D data

How to read contour plots

Best known in topography: outlines (contours) denote levels of elevation

2D density estimation

We can visualize all of the shot locations: (shot_x, shot_y)

ionescu_shots |>
  ggplot(aes(x = shot_x, y = shot_y)) +
  geom_point(size = 4, alpha = 0.3)

Adjust transparency with alpha for overlapping points

Create contours of 2D kernel density estimate (KDE)

Use geom_density2d() to display contour lines

ionescu_shots |>
  filter(shot_y < 35) |> # remove outliers
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(size = 4, alpha = 0.3) + 
  geom_density2d() +
  coord_fixed() +
  theme(legend.position = "bottom")

Extend KDE for joint density estimates in 2D (see section 14.4.2 for details)
Inner lines denote “peaks”
coord_fixed() forced a fixed ratio

Create contours of 2D kernel density estimate (KDE)

We make 2D KDE contour plots using geom_density2d()

ionescu_shots |>
  filter(shot_y < 35) |> 
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_point(size = 4, alpha = 0.3) + 
  geom_density2d(adjust = 0.1) +
  coord_fixed() +
  theme(legend.position = "bottom")

Can use adjust to modify the multivariate bandwidth

Contours are difficult… let’s make a heatmap instead

Use stat_density_2d() and the after_stat() function to make 2D KDE heatmaps
May be easier to read than nested lines with color
Default color scale is awful. Always change it.

ionescu_shots |>
  filter(shot_y < 35) |> 
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(level)),
                 h = 0.6, bins = 60, geom = "polygon") +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") +
  coord_fixed() +
  theme(legend.position = "bottom")

Multivariate density estimation can be difficult

Turn off contours and use tiles instead

Divide the space into a grid and color the grid according to high/low values

ionescu_shots |>
  filter(shot_y < 35) |> 
  ggplot(aes(x = shot_x, y = shot_y)) + 
  stat_density2d(aes(fill = after_stat(density)),
                 h = 0.6, bins = 60, contour = FALSE,
                 geom = "raster") +
  # scale_fill_gradient(low = "white", high = "red") +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") +
  theme(legend.position = "bottom") +
  coord_fixed()

Best alternative? Hexagonal binning

Use geom_hex() to make hexagonal heatmaps
Need to have the hexbin package installed
2D version of histogram

ionescu_shots |>
  filter(shot_y < 35) |>
  ggplot(aes(x = shot_x, y = shot_y)) + 
  geom_hex(binwidth = c(1, 1)) +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") + 
  theme(legend.position = "bottom") +
  coord_fixed()

Can specify binwidth in both directions
Avoids limitations from smoothing

What about shooting efficiency?

Can compute a function of another variable inside hexagons with stat_summary_hex()

ionescu_shots |>
  filter(shot_y < 35) |>
  ggplot(aes(x = shot_x, y = shot_y, 
             z = scoring_play, group = -1)) +
  stat_summary_hex(binwidth = c(2, 2), fun = mean, 
                   color = "black") +
  scale_fill_gradient(low = "midnightblue", 
                      high = "gold") + 
  theme(legend.position = "bottom") +
  coord_fixed()

Appendix: Making shot charts and drawing courts with `sportyR`

library(sportyR)
wnba_court <- geom_basketball("wnba", display_range = "offense", rotation = 270, x_trans = -41.5)
wnba_court +
  geom_hex(data = ionescu_shots, aes(x = shot_x, y = shot_y), binwidth = c(1, 1)) + 
  scale_fill_gradient(low = "midnightblue", high = "gold")

Appendix: Code to build dataset

# install.packages("wehoop")
library(wehoop)
wnba_pbp <- load_wnba_pbp(2023)
ionescu_shots <- wnba_pbp |> 
  filter(shooting_play) |> 
  filter(str_detect(text, "Ionescu")) |> 
  filter(!str_detect(text, "Ionescu assists")) |> 
  filter(!str_detect(text, "free throw")) |> 
  transmute(
    shot_x = coordinate_x_raw - 25,
    shot_y = coordinate_y_raw,
    shot_distance = sqrt((abs(shot_x) ^ 2) + shot_y ^ 2), 
    shot_type = type_text,
    scoring_play,
    score_value
  )

Data visualization: density estimation

Background

Data: Sabrina Ionescu’s shots

Revisiting histograms

Adjusting the binwidth

Adjusting the binwidth

A subtle point about the histogram code…

How do histograms relate to the PDF and CDF?

Density estimation

Kernel density estimation

Kernel density estimation

How do we compute and display the density estimate?

What about the bandwidth?

Notes on density estimation

Common pitfall: bounded data

Use density curves and ECDFs together

Code interlude: easy way to arrange multiple figures

Use density curves and ECDFs together

Another code interlude: collect the legends with patchwork

Ridgeline plots

Going from 1D to 2D density estimation

Display densities for 2D data

How to read contour plots

2D density estimation

Create contours of 2D kernel density estimate (KDE)

Create contours of 2D kernel density estimate (KDE)

Contours are difficult… let’s make a heatmap instead

Turn off contours and use tiles instead

Best alternative? Hexagonal binning

What about shooting efficiency?

Appendix: Making shot charts and drawing courts with sportyR

Appendix: Code to build dataset

Another code interlude: collect the legends with `patchwork`

Appendix: Making shot charts and drawing courts with `sportyR`