Lab: data visualization

Reading in data

Let’s start again by reading in the data from yesterday using the read_csv() function after loading the tidyverse:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nba_stats <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/nba_stats.csv")
Rows: 657 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): player, position, team
dbl (20): age, games, games_started, minutes_played, field_goals, field_goal...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Previewing the data

Write code that displays the column names of nba_stats. Also, look at the first six rows of your dataset to get an idea of what these variables look like. Which variables are quantitative, and which are categorical?

# INSERT CODE HERE

Always make a bar chart…

Now we’ll use the ggplot() function to create a bar chart of the position variable. To make things easier, we provide the code for you to do this below; just uncomment the code and run it to create the bar graph. In what follows, you must answer some questions about the code and plot.

# Create the bar graph of position:
# nba_stats |>
#   ggplot(aes(x = position)) +
#   geom_bar(fill = "darkblue") +
#   labs(title = "Number of NBA players by position",
#        x = "Position",
#        y = "Number of players",
#        caption = "Source: Basketball-Reference.com")

Answer the following questions about the code and plot:

  • In general, ggplot() code takes the following format: ggplot(blank1, aes(x = blank2)). Looking at the above code, what kind of R object should blank1 be, and what should blank2 be?

  • What do you think the line geom_bar(fill = "darkblue") does?

  • What do you think the remaining lines of code do (contained in labs())?

More area plots (but bar charts are better!)

Now we’ll make a few other area plots:

  • spine chart

  • pie chart

  • rose diagram

Your goal for this part is to create each of these plots. These plots can be created by copy-and-pasting the bar chart code from above and modifying it slightly. Follow these directions to create each of these plots:

  • spine chart: First, copy-and-paste the bar chart code from above. Then, delete the fill = "darkblue" within geom_bar(). Finally, within ggplot(), replace aes(x = position) with aes(x = "", fill = position). Also, change the labels in labs() if necessary.
# PUT YOUR SPINE CHART CODE HERE
  • pie chart: First, copy-and-paste the spine chart code you just made. Then, after geom_bar(), “add” coord_polar("y"). Be sure to put plus signs before and after coord_polar("y"). Also, change the labels in labs() if necessary.
# PUT YOUR PIE CHART CODE HERE
  • rose diagram: First, copy-and-paste your original bar chart code. Then, after geom_bar(fill = "darkblue"), “add” coord_polar() + scale_y_sqrt(). Be sure to put plus signs before and after coord_polar() + scale_y_sqrt(). Also, change the labels in labs() if necessary. After you make the rose diagram: In 1-2 sentences, what do you think scale_y_sqrt() does, and what is a benefit to including scale_y_sqrt() when making the rose diagram?
# PUT YOUR ROSE DIAGRAM CODE HERE

Notes on colors in plots

Three types of color scales to work with:

  1. Qualitative: distinguishing discrete items that don’t have an order (nominal categorical). Colors should be distinct and equal with none standing out unless otherwise desired for emphasis.
  • Do NOT use a discrete scale on a continuous variable
  1. Sequential: when data values are mapped to one shade, e.g., for an ordered categorical variable or low to high continuous variable
  • Do NOT use a sequential scale on an unordered variable
  1. Divergent: think of it as two sequential scales with a natural midpoint midpoint could represent 0 (assuming +/- values) or 50% if your data spans the full scale
  • Do NOT use a divergent scale on data without natural midpoint

Options for ggplot2 colors

The default color scheme is pretty bad to put it bluntly, but ggplot2 has ColorBrewer built in which makes it easy to customize your color scales. For instance, we can make a scatterplot with three_pointers on the y-axis and offensive_rebounds on the x-axis and using the geom_point() layer with each point colored by position:

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = position)) +
  geom_point(alpha = 0.5) +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Position") +
  theme_light()

What does alpha change? We can change the color plot for this plot using scale_color_brewer() function:

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers,
             color = position)) +
  geom_point(alpha = 0.5) +
  scale_color_brewer(palette = "Set2") +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Position") +
  theme_light()

Which do you prefer, the default palette or this new one? You can check out more color palettes here.

Something you should keep in mind is to pick a color-blind friendly palette. One simple way to do this is by using the ggthemes package (you need to install it first before running this code!) which has color-blind friendly palettes included:

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = position)) +
  geom_point(alpha = 0.5) +
  # call the function directly from the package using `::` instead of library(ggthemes)
  ggthemes::scale_color_colorblind() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Position") +
  theme_light()

In terms of displaying color from low to high, the viridis scales are excellent choices (and are also color-blind friendly!). For instance, we can map another continuous variable (minutes_played) to the color:

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_light()

What does this reveal about the plot? What happens if you delete scale_color_viridis_c() + from above? Which do you prefer?

Notes on themes

You might have noticed above have various changes to the theme of plots for customization. You will constantly be changing the theme of your plots to optimize the display. Fortunately, there are a number of built-in themes you can use to start with rather than the default theme_gray():

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_gray()

For instance, Quang’s go-to theme is theme_light()

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_light()

There are options such as theme_minimal():

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_minimal()

or theme_classic():

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_classic()

or theme_bw():

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_bw()

There are also packages with popular, such as the ggthemes package which includes, for example, theme_economist():

library(ggthemes)
nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_economist()

and theme_fivethirtyeight(), to name a couple:

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       color = "Minutes played") +
  theme_fivethirtyeight()

With any theme you have picked, you can then modify specific components directly using the theme() layer. There are many aspects of the plot’s theme to modify, such as my decision to move the legend to the bottom of the figure, drop the legend title, and increase the font size for the y-axis:

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       title = "Joint distribution of three-pointers and offensive rebounds",
       subtitle = "NBA statistics from 2021-2022 season",
       color = "Minutes played") +
  theme_light() +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        axis.text.y = element_text(size = 14),
        axis.text.x = element_text(size = 6))

If you’re tired of explicitly customizing every plot in the same way all the time, then you should make a custom theme. It’s quite easy to make a custom theme for ggplot2 and of course there are an incredible number of ways to customize your theme. Below, we modify theme_bw() using the %+replace% argument to a new customized theme named theme_cus() - which is stored as a function:

theme_cus <- function() {
  # start with the base font size
  theme_bw(base_size = 10) %+replace%
    theme(
      panel.background  = element_blank(),
      plot.background = element_rect(fill = "transparent", color = NA), 
      legend.position = "bottom",
      legend.background = element_rect(fill = "transparent", color = NA),
      legend.key = element_rect(fill = "transparent", color = NA),
      axis.ticks = element_blank(),
      panel.grid.major = element_line(color = "grey90", linewidth = 0.3), 
      panel.grid.minor = element_blank(),
      plot.title = element_text(size = 15, hjust = 0, vjust = 0.5, face = "bold", 
                                margin = margin(b = 0.2, unit = "cm")),
      plot.subtitle = element_text(size = 12, hjust = 0, vjust = 0.5, 
                                   margin = margin(b = 0.2, unit = "cm")),
      plot.caption = element_text(size = 7, hjust = 1, face = "italic", 
                                  margin = margin(t = 0.1, unit = "cm")),
      axis.text.x = element_text(size = 13),
      axis.text.y = element_text(size = 13)
    )
}

Create the plot from before with this theme:

nba_stats |>
  ggplot(aes(x = offensive_rebounds, y = three_pointers, color = minutes_played)) +
  geom_point(alpha = 0.5) +
  scale_color_viridis_c() +
  labs(x = "Offensive rebounds", 
       y = "Three-pointers",
       title = "Joint distribution of three-pointers and offensive rebounds",
       subtitle = "NBA statistics from 2021-2022 season",
       color = "Minutes played") +
  theme_cus()