Data visualization: quantitative data


SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Quantitative data

Two different versions of quantitative data:

Discrete: countable and has clear space between values (i.e. whole number only)

  • Examples: number of goals scored in a game, number of children in a family

Continuous: can take any value within some interval

  • Examples: price of houses in Pittsburgh, water temperature, wind speed

Data

Taylor Swift songs via the taylor package (data dictionary here)

library(tidyverse)
theme_set(theme_light())
library(taylor)
names(taylor_all_songs)
 [1] "album_name"          "ep"                  "album_release"      
 [4] "track_number"        "track_name"          "artist"             
 [7] "featuring"           "bonus_track"         "promotional_release"
[10] "single_release"      "track_release"       "danceability"       
[13] "energy"              "key"                 "loudness"           
[16] "mode"                "speechiness"         "acousticness"       
[19] "instrumentalness"    "liveness"            "valence"            
[22] "tempo"               "time_signature"      "duration_ms"        
[25] "explicit"            "key_name"            "mode_name"          
[28] "key_mode"            "lyrics"             
taylor_all_songs <- taylor_all_songs |> 
  mutate(duration = duration_ms / 60000)

1D quantitative data

Summarizing 1D quantitative data

  • Center: mean, median, number and location of modes
  • Spread: range, variance, standard deviation, IQR, etc.
  • Shape: skew vs symmetry, outliers, heavy vs light tails, etc.

Compute various statistics in R with summary(), mean(), median(), quantile(), range(), sd(), var(), etc.

Example: Summarizing the duration of Taylor Swift songs

summary(taylor_all_songs$duration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  2.198   3.543   3.930   3.992   4.349  10.217      11 
sd(taylor_all_songs$duration, na.rm = TRUE)
[1] 0.7562114

Boxplots visualize summary statistics

Pros:

  • Displays outliers, percentiles, spread, skew

  • Useful for side-by-side comparison

Cons:

  • Does not display the full distribution shape

  • Does not display modes

The expert weighed in…

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_boxplot() +
  theme(axis.text.y = element_blank())

Histograms display 1D continuous distributions

\(\displaystyle \text{# total obs.} = \sum_{j=1}^k \text{# obs. in bin }j\)

Pros:

  • Displays full shape of distribution

  • Easy to interpret

Cons:

  • Have to choose number of bins and bin locations (will revisit later)
taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_histogram()

Display the data points directly with beeswarm plots

Pros:

  • Displays each data point

  • Easy to view full shape of distribution

Cons:

  • Can be overbearing with large datasets

  • Which algorithm for arranging points?

library(ggbeeswarm)
taylor_all_songs |> 
  ggplot(aes(x = duration, y = "")) +
  geom_beeswarm(cex = 2)

Smooth summary with violin plots

Pros:

  • Displays full shape of distribution

  • Can easily layer…

taylor_all_songs |> 
  ggplot(aes(x = duration, y = "")) +
  geom_violin()

Smooth summary with violin plots + box plots

Pros:

  • Displays full shape of distribution

  • Can easily layer… with box plots on top

Cons:

  • Summary of data via density estimate

  • Mirror image is duplicate information

taylor_all_songs |> 
  ggplot(aes(x = duration, y = "")) +
  geom_violin() +
  geom_boxplot(width = 0.4)

What do visualizations of continuous distributions display?

Probability that continuous variable \(X\) takes a particular value is 0

e.g. \(P(\) duration \(= 3) = 0\) (why?)

For continuous variables, the cumulative distribution function (CDF) is \[F(x) = P(X \leq x)\]

For \(n\) observations, the empirical CDF (ECDF) can be computed based on the observed data \[\hat{F}_n(x) = \frac{\text{# obs. with variable} \leq x}{n} = \frac{1}{n} \sum_{i=1}^{n} I (x_i \leq x)\]

where \(I()\) is the indicator function, i.e. ifelse(x_i <= x, 1, 0)

Display full distribution with ECDF plot

Pros:

  • Displays all of your data at once

  • As \(n \rightarrow \infty\), the ECDF \(\hat F_n(x)\) converges to the true CDF \(F(x)\)

Cons:

  • What are the cons?
taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  stat_ecdf()

Rug plots display raw data

Pros:

  • Displays raw data points

  • Useful supplement for summaries and 2D plots

Cons:

  • Can be overbearing for large datasets
taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_rug(alpha = 0.5)

Rug plots supplement other displays

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_histogram() +
  geom_rug(alpha = 0.5)

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  stat_ecdf() +
  geom_rug(alpha = 0.5)

2D quantitative data

Summarizing 2D quantitative data

  • Direction/trend (positive, negative)

  • Strength of the relationship (strong, moderate, weak)

  • Linearity (linear, non-linear)

Big picture

  • Scatterplots are by far the most common visual

  • Regression analysis is by far the most popular analysis (we will have a class on this)

  • Relationships may vary across other variables, e.g., categorical variables

Making scatterplots

  • Use geom_point()

  • Displaying the joint (bivariate) distribution

  • What is the obvious flaw with this plot?

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4)

Making scatterplots: always adjust the transparency (alpha)

  • Adjust the transparency of points via alpha to visualize overlap

  • Provides better understanding of joint frequency

  • Especially important with larger datasets

  • See also: ggblend

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5)

Summarizing 2D quantitative data

  • Scatterplot
taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5)

  • Correlation coefficient
cor(taylor_all_songs$loudness, 
    taylor_all_songs$energy, 
    use = "complete.obs")
[1] 0.7826175

Note: the default correlation you get from cor() is Pearson correlation coefficient

Other correlations:

When the correlation’s high…

Displaying trend line: linear regression (a preview)

  • Display regression line for energy ~ loudness

  • 95% confidence intervals by default

  • Estimating the conditional expectation of energy | loudness

    • i.e., \(\mathbb{E}[\) energy \(\mid\) loudness \(]\)
taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5) +
  geom_smooth(method = "lm", linewidth = 2)

Summarizing 2D quantitative data

Add rug plots to supplement scatterplot

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5) +
  geom_rug(alpha = 0.4)

Pairs plot

library(GGally)
taylor_all_songs |> 
  select(danceability, energy, loudness, tempo) |> 
  ggpairs()

Continuous by categorical data

Continuous by categorical: side by side plots

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, y = album_name)) +
  geom_violin() +
  geom_boxplot(width = 0.4)

Continuous by categorical: color

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, color = album_name)) +
  stat_ecdf(linewidth = 1) +
  scale_color_albums() + # from the taylor package 
  theme(legend.position = "bottom")

Continuous by categorical: ridgeline plot (joyplot)

For more, see this tutorial

library(ggridges)
taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, y = album_name)) +
  geom_density_ridges(scale = 1)

What about for histograms?

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, fill = album_name)) +
  geom_histogram(alpha = 0.6, bins = 15) +
  scale_fill_albums()

What about facets?

Difference between facet_wrap and facet_grid

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration)) +
  geom_histogram(bins = 15) +
  facet_wrap(~ album_name, nrow = 1)

What about facets?

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration)) +
  geom_histogram(bins = 15) +
  facet_grid(album_name ~ ., margins = TRUE)