Data visualization: the grammar of graphics and ggplot2


SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Motivation and background

Goals of data visualization

  • to represent the data in a visual way (enough with tables… though tables are useful in a lot of situations)
  • most importantly, to deliver the information to your audience and help them understand the story behind the data

ALWAYS visualize your data before modeling and analysis

Anscombe’s quartet

# A tibble: 4 × 6
  set   x_mean x_var y_mean y_var x_y_cor
  <chr>  <dbl> <dbl>  <dbl> <dbl>   <dbl>
1 1          9    11   7.50  4.13   0.816
2 2          9    11   7.50  4.13   0.816
3 3          9    11   7.5   4.12   0.816
4 4          9    11   7.50  4.12   0.817

ALWAYS visualize your data before modeling and analysis

The Datasaurus dozen

# A tibble: 13 × 6
   dataset    x_mean x_var y_mean y_var x_y_cor
   <chr>       <dbl> <dbl>  <dbl> <dbl>   <dbl>
 1 away         54.3  281.   47.8  726. -0.0641
 2 bullseye     54.3  281.   47.8  726. -0.0686
 3 circle       54.3  281.   47.8  725. -0.0683
 4 dino         54.3  281.   47.8  726. -0.0645
 5 dots         54.3  281.   47.8  725. -0.0603
 6 h_lines      54.3  281.   47.8  726. -0.0617
 7 high_lines   54.3  281.   47.8  726. -0.0685
 8 slant_down   54.3  281.   47.8  726. -0.0690
 9 slant_up     54.3  281.   47.8  726. -0.0686
10 star         54.3  281.   47.8  725. -0.0630
11 v_lines      54.3  281.   47.8  726. -0.0694
12 wide_lines   54.3  281.   47.8  726. -0.0666
13 x_shape      54.3  281.   47.8  725. -0.0656

Viz crime?

Florence Nightingale’s rose diagram

Viz crime?

Previously…

Yearly MLB batting statistics from Lahman with tidyverse:

  • total hits, home runs, strikeouts, walks, at bats
  • total batting average for each year
    = total H / total AB
  • only keeps AL and NL leagues
library(tidyverse)
library(Lahman) 
yearly_batting <- Batting |>
  filter(lgID %in% c("AL", "NL")) |>
  group_by(yearID) |>
  summarize(total_h = sum(H, na.rm = TRUE),
            total_hr = sum(HR, na.rm = TRUE),
            total_so = sum(SO, na.rm = TRUE),
            total_bb = sum(BB, na.rm = TRUE),
            total_ab = sum(AB, na.rm = TRUE)) |>
  mutate(batting_avg = total_h / total_ab)

How do we make data visualization?

What are the steps to make this figure below?

The grammar of graphics

  • Key idea: specify plotting “layers” and combine them to produce a graphic

  • ggplot2 provides an implementation of the grammar of graphics

  • The following layers are building blocks of data graphics

  1. data - one or more datasets (in tidy tabular format)

  2. geom - geometric objects to visually represent the data (e.g. points, lines, bars, etc.)

  3. aes - mappings of variables to visual properties (i.e. aesthetics) of the geometric objects

  4. scale - one scale for each variable displayed (e.g. axis limits, log scale, colors, etc.)

  5. facet - similar subplots (i.e. facets) for subsets of the same data using a conditioning variable

  6. stat - statistical transformations and summaries (e.g. identity, count, smooth, quantile, etc.)

  7. coord - one or more coordinate systems (e.g. cartesian, polar, map projection)

  8. labs - labels/guides for each variable and other parts of the plot (e.g. title, subtitle, caption, etc.)

  9. theme - customization of plot layout (e.g. text size, alignment, legend position, etc.)

A walkthrough of ggplot2

First, start with the data


ggplot(data = yearly_batting)


or equivalently, using |>

yearly_batting |> 
  ggplot()


So far, nothing is displayed

Specify variables and geometric object


yearly_batting |> 
  ggplot() +
  geom_point(aes(x = yearID, y = total_hr))
  • Adding (+) a geometric layer of points to the plot

  • Map yearID to the x-axis and total_hr to the y-axis via aes()

  • Implicitly using coord_cartesian()

yearly_batting |> 
  ggplot() + 
  geom_point(aes(x = yearID, y = total_hr)) +
  coord_cartesian()

Now, can we add another geometric layer?


yearly_batting |> 
  ggplot() +
  geom_point(aes(x = yearID, y = total_hr)) +
  geom_line(aes(x = yearID, y = total_hr))
  • Adding (+) a line geometric layer

  • Include mappings shared across geometric layers inside ggplot()

yearly_batting |>
  ggplot(aes(x = yearID, y = total_hr)) +
  geom_point() +
  geom_line()

Scaling axes: changing axis label breaks


yearly_batting |>
  ggplot(aes(x = yearID, y = total_hr)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(breaks = seq(0, 6000, 1000))

Scaling axes: customizing axis limits


yearly_batting |>
  ggplot(aes(x = yearID, y = total_hr)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2015))

Scaling axes: having different axis scales


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point() +
  geom_line() + 
  scale_x_reverse() +
  scale_y_log10()

We can easily adjust variable scales without directly modifying the columns in the data

Adding a statistical summary


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point() +
  geom_line() + 
  stat_smooth()
  • Smoothing regression summary (will cover later) using yearID and total_hr
  • Geometric layers implicitly use a default statistical summary
  • Technically we’re already using geom_point(stat = "identity")
yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point() +
  geom_line() +
  geom_smooth()

Mapping additional variables


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr,
             color = total_so,
             size = total_bb)) +
  geom_point() +
  geom_line()
  • total_hr, total_so, and total_bb are all displayed

  • color and size are being shared globally across layers

  • This is a bit odd to look at…

Customizing mappings by layer


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line()
  • Now mapping total_so and total_bb to color and size of the point layer only

Changing aesthetics without mapping variables


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed")
  • Manually set the color and linetype of the line layer

Remember: one scale for each mapped variable


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed") +
  scale_color_gradient(low = "darkblue", high = "gold") +
  scale_size_continuous(breaks = seq(0, 20000, 2500))

Always label your plots! (seriously…)


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed") +
  scale_color_gradient(low = "darkblue", high = "gold") +
  labs(
    x = "Year",
    y = "Homeruns",
    color = "Strikeouts",
    size = "Walks",
    title = "The rise of three true outcomes in baseball",
    caption = "Data courtesy of Lahman"
  )
  • Each mapped aesthetic can be labeled

Custom theme


yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed") +
  scale_color_gradient(low = "darkblue", high = "gold") +
  labs(
    x = "Year",
    y = "Homeruns",
    color = "Strikeouts",
    size = "Walks",
    title = "The rise of three true outcomes in baseball",
    caption = "Data courtesy of Lahman"
  ) +
  theme_bw(base_size = 20) +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, 
                                  face = "bold"))

A lesson about data visualization…

  • So far we’ve plotted total home runs across the years, with point size representing total walks and point color representing total strikeouts
  • Simpler is better. What can we do to improve and make the plot simpler?
  • How about creating three separate plots for home runs, strikeouts, and walks, with each mapped to the y-axis?

  • But how do we do this without repeating the same code?

Pivoting

Remember: data should be in tidy format

Within the tidyverse, the tidyr package offers functions for reshaping the data

  • pivot_longer: casts/gathers information spread out across variables

    • transforms data from wide format into long format

    • increase number of rows and decrease number of columns

  • pivot_wider: melts/spreads information out from observations

    • transforms data from long format into wide format

    • decrease number of rows and increase number of columns

Pivoting

yearly_batting |> 
  select(yearID, HRs = total_hr, Strikeouts = total_so, Walks = total_bb) |> # renaming while also selecting
  pivot_longer(HRs:Walks, # can also do !yearID (to select everything but yearID)
               names_to = "stat",
               values_to = "val")
# A tibble: 441 × 3
   yearID stat         val
    <int> <chr>      <int>
 1   1876 HRs           40
 2   1876 Strikeouts   589
 3   1876 Walks        336
 4   1877 HRs           24
 5   1877 Strikeouts   726
 6   1877 Walks        345
 7   1878 HRs           23
 8   1878 Strikeouts  1081
 9   1878 Walks        364
10   1879 HRs           58
# ℹ 431 more rows

We’ve pivoted the data and created the following variables

  • stat, to represent the name of the batting statistics
  • val, to represent the total value of each statistic in each year.

Faceting

yearly_batting |>
  select(yearID, HRs = total_hr, 
         Strikeouts = total_so, Walks = total_bb) |>
  pivot_longer(HRs:Walks, 
               names_to = "stat", 
               values_to = "val") |>
  ggplot(aes(yearID, val)) +
  geom_line(color = "darkblue") +
  geom_point(alpha = 0.8, color = "darkblue") +
  facet_wrap(~ stat, scales = "free_y", ncol = 1) +
  labs(
    x = "Year", 
    y = "Total of statistic",
    title = "The rise of three true outcomes in baseball",
    caption = "Data courtesy of Lahman"
  ) +
  theme_bw(base_size = 20) +
  theme(strip.background = element_blank(),
        plot.title = element_text(hjust = 0.5, 
                                  face = "bold"))
  • Create a multi-panel plot faceted by a conditioning variable (in our case, stat)

Exercise

The babynames package contains a dataset (also) named babynames, which contains information on the number of children of each sex given each name from 1880 to 2017, provided by the United States Social Security Administration.

How does the popularity (in terms of frequency) of your own name (combination of name and sex) change over time? Also, stick a thick, red, vertical dashed line on the plot at your birth year (try geom_vline()).

Next, pick two other names and compare their popularity over time with your own name.

# install.packages("babynames")
# library(babynames)
# babynames |> 
#   INSERT CODE HERE

Resources