Data visualization: the grammar of graphics and `ggplot2`

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Motivation and background

Goals of data visualization

to represent the data in a visual way (enough with tables… though tables are useful in a lot of situations)

most importantly, to deliver the information to your audience and help them understand the story behind the data

Data Visualization (good DataViz anyway) answers a question. - Greggy J. M.

ALWAYS visualize your data before modeling and analysis

Anscombe’s quartet

# A tibble: 4 × 6
  set   x_mean x_var y_mean y_var x_y_cor
  <chr>  <dbl> <dbl>  <dbl> <dbl>   <dbl>
1 1          9    11   7.50  4.13   0.816
2 2          9    11   7.50  4.13   0.816
3 3          9    11   7.5   4.12   0.816
4 4          9    11   7.50  4.12   0.817

ALWAYS visualize your data before modeling and analysis

The Datasaurus dozen

# A tibble: 13 × 6
   dataset    x_mean x_var y_mean y_var x_y_cor
   <chr>       <dbl> <dbl>  <dbl> <dbl>   <dbl>
 1 away         54.3  281.   47.8  726. -0.0641
 2 bullseye     54.3  281.   47.8  726. -0.0686
 3 circle       54.3  281.   47.8  725. -0.0683
 4 dino         54.3  281.   47.8  726. -0.0645
 5 dots         54.3  281.   47.8  725. -0.0603
 6 h_lines      54.3  281.   47.8  726. -0.0617
 7 high_lines   54.3  281.   47.8  726. -0.0685
 8 slant_down   54.3  281.   47.8  726. -0.0690
 9 slant_up     54.3  281.   47.8  726. -0.0686
10 star         54.3  281.   47.8  725. -0.0630
11 v_lines      54.3  281.   47.8  726. -0.0694
12 wide_lines   54.3  281.   47.8  726. -0.0666
13 x_shape      54.3  281.   47.8  725. -0.0656

Viz crime?

Florence Nightingale’s rose diagram

Viz crime?

Previously…

Yearly MLB batting statistics from Lahman with tidyverse:

total hits, home runs, strikeouts, walks, at bats
total batting average for each year
= total H / total AB
only keeps AL and NL leagues

library(tidyverse)
library(Lahman) 
yearly_batting <- Batting |>
  filter(lgID %in% c("AL", "NL")) |>
  group_by(yearID) |>
  summarize(total_h = sum(H, na.rm = TRUE),
            total_hr = sum(HR, na.rm = TRUE),
            total_so = sum(SO, na.rm = TRUE),
            total_bb = sum(BB, na.rm = TRUE),
            total_ab = sum(AB, na.rm = TRUE)) |>
  mutate(batting_avg = total_h / total_ab)

How do we make data visualization?

What are the steps to make this figure below?

The grammar of graphics

Key idea: specify plotting “layers” and combine them to produce a graphic
ggplot2 provides an implementation of the grammar of graphics
The following layers are building blocks of data graphics

data - one or more datasets (in tidy tabular format)
geom - geometric objects to visually represent the data (e.g. points, lines, bars, etc.)
aes - mappings of variables to visual properties (i.e. aesthetics) of the geometric objects
scale - one scale for each variable displayed (e.g. axis limits, log scale, colors, etc.)
facet - similar subplots (i.e. facets) for subsets of the same data using a conditioning variable
stat - statistical transformations and summaries (e.g. identity, count, smooth, quantile, etc.)
coord - one or more coordinate systems (e.g. cartesian, polar, map projection)
labs - labels/guides for each variable and other parts of the plot (e.g. title, subtitle, caption, etc.)
theme - customization of plot layout (e.g. text size, alignment, legend position, etc.)

A walkthrough of `ggplot2`

First, start with the data

ggplot(data = yearly_batting)

or equivalently, using |>

yearly_batting |> 
  ggplot()

So far, nothing is displayed

Specify variables and geometric object

yearly_batting |> 
  ggplot() +
  geom_point(aes(x = yearID, y = total_hr))

Adding (+) a geometric layer of points to the plot
Map yearID to the x-axis and total_hr to the y-axis via aes()
Implicitly using coord_cartesian()

yearly_batting |> 
  ggplot() + 
  geom_point(aes(x = yearID, y = total_hr)) +
  coord_cartesian()

Now, can we add another geometric layer?

yearly_batting |> 
  ggplot() +
  geom_point(aes(x = yearID, y = total_hr)) +
  geom_line(aes(x = yearID, y = total_hr))

Adding (+) a line geometric layer
Include mappings shared across geometric layers inside ggplot()

yearly_batting |>
  ggplot(aes(x = yearID, y = total_hr)) +
  geom_point() +
  geom_line()

Scaling axes: changing axis label breaks

yearly_batting |>
  ggplot(aes(x = yearID, y = total_hr)) +
  geom_point() +
  geom_line() +
  scale_y_continuous(breaks = seq(0, 6000, 1000))

Scaling axes: customizing axis limits

yearly_batting |>
  ggplot(aes(x = yearID, y = total_hr)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(limits = c(2000, 2015))

Scaling axes: having different axis scales

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point() +
  geom_line() + 
  scale_x_reverse() +
  scale_y_log10()

We can easily adjust variable scales without directly modifying the columns in the data

Adding a statistical summary

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point() +
  geom_line() + 
  stat_smooth()

Smoothing regression summary (will cover later) using yearID and total_hr
Geometric layers implicitly use a default statistical summary
Technically we’re already using geom_point(stat = "identity")

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point() +
  geom_line() +
  geom_smooth()

Mapping additional variables

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr,
             color = total_so,
             size = total_bb)) +
  geom_point() +
  geom_line()

total_hr, total_so, and total_bb are all displayed
color and size are being shared globally across layers
This is a bit odd to look at…

Customizing mappings by layer

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line()

Now mapping total_so and total_bb to color and size of the point layer only

Changing aesthetics without mapping variables

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed")

Manually set the color and linetype of the line layer

Remember: one scale for each mapped variable

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed") +
  scale_color_gradient(low = "darkblue", high = "gold") +
  scale_size_continuous(breaks = seq(0, 20000, 2500))

Always label your plots! (seriously…)

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed") +
  scale_color_gradient(low = "darkblue", high = "gold") +
  labs(
    x = "Year",
    y = "Homeruns",
    color = "Strikeouts",
    size = "Walks",
    title = "The rise of three true outcomes in baseball",
    caption = "Data courtesy of Lahman"
  )

Each mapped aesthetic can be labeled

Custom theme

yearly_batting |> 
  ggplot(aes(x = yearID, y = total_hr)) + 
  geom_point(aes(color = total_so, size = total_bb)) +
  geom_line(color = "darkred", linetype = "dashed") +
  scale_color_gradient(low = "darkblue", high = "gold") +
  labs(
    x = "Year",
    y = "Homeruns",
    color = "Strikeouts",
    size = "Walks",
    title = "The rise of three true outcomes in baseball",
    caption = "Data courtesy of Lahman"
  ) +
  theme_bw(base_size = 20) +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, 
                                  face = "bold"))

For more theme options, check out the ggthemes and hrbrthemes packages

A lesson about data visualization…

So far we’ve plotted total home runs across the years, with point size representing total walks and point color representing total strikeouts

Simpler is better. What can we do to improve and make the plot simpler?

How about creating three separate plots for home runs, strikeouts, and walks, with each mapped to the y-axis?
But how do we do this without repeating the same code?

Pivoting

Remember: data should be in tidy format

Within the tidyverse, the tidyr package offers functions for reshaping the data

pivot_longer: casts/gathers information spread out across variables
- transforms data from wide format into long format
- increase number of rows and decrease number of columns
pivot_wider: melts/spreads information out from observations
- transforms data from long format into wide format
- decrease number of rows and increase number of columns

Pivoting

yearly_batting |> 
  select(yearID, HRs = total_hr, Strikeouts = total_so, Walks = total_bb) |> # renaming while also selecting
  pivot_longer(HRs:Walks, # can also do !yearID (to select everything but yearID)
               names_to = "stat",
               values_to = "val")

# A tibble: 441 × 3
   yearID stat         val
    <int> <chr>      <int>
 1   1876 HRs           40
 2   1876 Strikeouts   589
 3   1876 Walks        336
 4   1877 HRs           24
 5   1877 Strikeouts   726
 6   1877 Walks        345
 7   1878 HRs           23
 8   1878 Strikeouts  1081
 9   1878 Walks        364
10   1879 HRs           58
# ℹ 431 more rows

We’ve pivoted the data and created the following variables

stat, to represent the name of the batting statistics
val, to represent the total value of each statistic in each year.

Faceting

yearly_batting |>
  select(yearID, HRs = total_hr, 
         Strikeouts = total_so, Walks = total_bb) |>
  pivot_longer(HRs:Walks, 
               names_to = "stat", 
               values_to = "val") |>
  ggplot(aes(yearID, val)) +
  geom_line(color = "darkblue") +
  geom_point(alpha = 0.8, color = "darkblue") +
  facet_wrap(~ stat, scales = "free_y", ncol = 1) +
  labs(
    x = "Year", 
    y = "Total of statistic",
    title = "The rise of three true outcomes in baseball",
    caption = "Data courtesy of Lahman"
  ) +
  theme_bw(base_size = 20) +
  theme(strip.background = element_blank(),
        plot.title = element_text(hjust = 0.5, 
                                  face = "bold"))

Create a multi-panel plot faceted by a conditioning variable (in our case, stat)

Exercise

The babynames package contains a dataset (also) named babynames, which contains information on the number of children of each sex given each name from 1880 to 2017, provided by the United States Social Security Administration.

How does the popularity (in terms of frequency) of your own name (combination of name and sex) change over time? Also, stick a thick, red, vertical dashed line on the plot at your birth year (try geom_vline()).

Next, pick two other names and compare their popularity over time with your own name.

# install.packages("babynames")
# library(babynames)
# babynames |> 
#   INSERT CODE HERE

Resources

ggplot2 website: cheatsheets, FAQs, extensions, and more
TidyTuesday
David Robinson’s TidyTuesday screencasts
VizBuzz: LIVE data viz replication game show

Data visualization: the grammar of graphics and ggplot2

Motivation and background

Goals of data visualization

ALWAYS visualize your data before modeling and analysis

ALWAYS visualize your data before modeling and analysis

Viz crime?

Viz crime?

Previously…

The grammar of graphics

A walkthrough of ggplot2

First, start with the data

Specify variables and geometric object

Now, can we add another geometric layer?

Scaling axes: changing axis label breaks

Scaling axes: customizing axis limits

Scaling axes: having different axis scales

Adding a statistical summary

Mapping additional variables

Customizing mappings by layer

Changing aesthetics without mapping variables

Remember: one scale for each mapped variable

Always label your plots! (seriously…)

Custom theme

A lesson about data visualization…

Pivoting

Pivoting

Faceting

Exercise

Resources

Data visualization: the grammar of graphics and `ggplot2`

A walkthrough of `ggplot2`