Data visualization: quantitative data

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Quantitative data

Two different versions of quantitative data:

Discrete: countable and has clear space between values (i.e. whole number only)

Examples: number of goals scored in a game, number of children in a family

Continuous: can take any value within some interval

Examples: price of houses in Pittsburgh, water temperature, wind speed

Data

Taylor Swift songs via the taylor package (data dictionary here)

library(tidyverse)
theme_set(theme_light())
library(taylor)
names(taylor_all_songs)

 [1] "album_name"          "ep"                  "album_release"      
 [4] "track_number"        "track_name"          "artist"             
 [7] "featuring"           "bonus_track"         "promotional_release"
[10] "single_release"      "track_release"       "danceability"       
[13] "energy"              "key"                 "loudness"           
[16] "mode"                "speechiness"         "acousticness"       
[19] "instrumentalness"    "liveness"            "valence"            
[22] "tempo"               "time_signature"      "duration_ms"        
[25] "explicit"            "key_name"            "mode_name"          
[28] "key_mode"            "lyrics"

taylor_all_songs <- taylor_all_songs |> 
  mutate(duration = duration_ms / 60000)

1D quantitative data

Summarizing 1D quantitative data

Center: mean, median, number and location of modes

Spread: range, variance, standard deviation, IQR, etc.

Shape: skew vs symmetry, outliers, heavy vs light tails, etc.

Compute various statistics in R with summary(), mean(), median(), quantile(), range(), sd(), var(), etc.

Example: Summarizing the duration of Taylor Swift songs

summary(taylor_all_songs$duration)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  2.198   3.543   3.930   3.992   4.349  10.217      11

sd(taylor_all_songs$duration, na.rm = TRUE)

[1] 0.7562114

Boxplots visualize summary statistics

Pros:

Displays outliers, percentiles, spread, skew
Useful for side-by-side comparison

Cons:

Does not display the full distribution shape
Does not display modes

The expert weighed in…

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_boxplot() +
  theme(axis.text.y = element_blank())

Histograms display 1D continuous distributions

\(\displaystyle \text{# total obs.} = \sum_{j=1}^k \text{# obs. in bin }j\)

Pros:

Displays full shape of distribution
Easy to interpret

Cons:

Have to choose number of bins and bin locations (will revisit later)

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_histogram()

Display the data points directly with beeswarm plots

Pros:

Displays each data point
Easy to view full shape of distribution

Cons:

Can be overbearing with large datasets
Which algorithm for arranging points?

library(ggbeeswarm)
taylor_all_songs |> 
  ggplot(aes(x = duration, y = "")) +
  geom_beeswarm(cex = 2)

Smooth summary with violin plots

Pros:

Displays full shape of distribution
Can easily layer…

taylor_all_songs |> 
  ggplot(aes(x = duration, y = "")) +
  geom_violin()

Smooth summary with violin plots + box plots

Pros:

Displays full shape of distribution
Can easily layer… with box plots on top

Cons:

Summary of data via density estimate
Mirror image is duplicate information

taylor_all_songs |> 
  ggplot(aes(x = duration, y = "")) +
  geom_violin() +
  geom_boxplot(width = 0.4)

What do visualizations of continuous distributions display?

Probability that continuous variable \(X\) takes a particular value is 0

e.g. \(P(\) duration \(= 3) = 0\) (why?)

For continuous variables, the cumulative distribution function (CDF) is \[F(x) = P(X \leq x)\]

For \(n\) observations, the empirical CDF (ECDF) can be computed based on the observed data \[\hat{F}_n(x) = \frac{\text{# obs. with variable} \leq x}{n} = \frac{1}{n} \sum_{i=1}^{n} I (x_i \leq x)\]

where \(I()\) is the indicator function, i.e. ifelse(x_i <= x, 1, 0)

Display full distribution with ECDF plot

Pros:

Displays all of your data at once
As \(n \rightarrow \infty\), the ECDF \(\hat F_n(x)\) converges to the true CDF \(F(x)\)

Cons:

What are the cons?

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  stat_ecdf()

Rug plots display raw data

Pros:

Displays raw data points
Useful supplement for summaries and 2D plots

Cons:

Can be overbearing for large datasets

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_rug(alpha = 0.5)

Rug plots supplement other displays

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  geom_histogram() +
  geom_rug(alpha = 0.5)

taylor_all_songs |> 
  ggplot(aes(x = duration)) +
  stat_ecdf() +
  geom_rug(alpha = 0.5)

2D quantitative data

Summarizing 2D quantitative data

Direction/trend (positive, negative)
Strength of the relationship (strong, moderate, weak)
Linearity (linear, non-linear)

Big picture

Scatterplots are by far the most common visual
Regression analysis is by far the most popular analysis (we will have a class on this)
Relationships may vary across other variables, e.g., categorical variables

Making scatterplots

Use geom_point()
Displaying the joint (bivariate) distribution
What is the obvious flaw with this plot?

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4)

Making scatterplots: always adjust the transparency (`alpha`)

Adjust the transparency of points via alpha to visualize overlap
Provides better understanding of joint frequency
Especially important with larger datasets
See also: ggblend

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5)

Summarizing 2D quantitative data

Scatterplot

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5)

Correlation coefficient

cor(taylor_all_songs$loudness, 
    taylor_all_songs$energy, 
    use = "complete.obs")

[1] 0.7826175

Note: the default correlation you get from cor() is Pearson correlation coefficient

Other correlations:

When the correlation’s high…

Displaying trend line: linear regression (a preview)

Display regression line for energy ~ loudness
95% confidence intervals by default
Estimating the conditional expectation of energy | loudness
- i.e., \(\mathbb{E}[\) energy \(\mid\) loudness \(]\)

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5) +
  geom_smooth(method = "lm", linewidth = 2)

Summarizing 2D quantitative data

Add rug plots to supplement scatterplot

taylor_all_songs |> 
  ggplot(aes(x = loudness, y = energy)) +
  geom_point(color = "darkred", size = 4, alpha = 0.5) +
  geom_rug(alpha = 0.4)

Pairs plot

library(GGally)
taylor_all_songs |> 
  select(danceability, energy, loudness, tempo) |> 
  ggpairs()

Continuous by categorical data

Continuous by categorical: side by side plots

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, y = album_name)) +
  geom_violin() +
  geom_boxplot(width = 0.4)

Continuous by categorical: color

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, color = album_name)) +
  stat_ecdf(linewidth = 1) +
  scale_color_albums() + # from the taylor package 
  theme(legend.position = "bottom")

Continuous by categorical: ridgeline plot (joyplot)

For more, see this tutorial

library(ggridges)
taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, y = album_name)) +
  geom_density_ridges(scale = 1)

What about for histograms?

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration, fill = album_name)) +
  geom_histogram(alpha = 0.6, bins = 15) +
  scale_fill_albums()

What about facets?

Difference between facet_wrap and facet_grid

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration)) +
  geom_histogram(bins = 15) +
  facet_wrap(~ album_name, nrow = 1)

What about facets?

taylor_all_songs |> 
  filter(album_name %in% c("Lover", "folklore", "evermore", "Midnights")) |>
  ggplot(aes(x = duration)) +
  geom_histogram(bins = 15) +
  facet_grid(album_name ~ ., margins = TRUE)

Data visualization: quantitative data

Background

Quantitative data

Data

1D quantitative data

Summarizing 1D quantitative data

Boxplots visualize summary statistics

Histograms display 1D continuous distributions

Display the data points directly with beeswarm plots

Smooth summary with violin plots

Smooth summary with violin plots + box plots

What do visualizations of continuous distributions display?

Display full distribution with ECDF plot

Rug plots display raw data

Rug plots supplement other displays

2D quantitative data

Summarizing 2D quantitative data

Making scatterplots

Making scatterplots: always adjust the transparency (alpha)

Summarizing 2D quantitative data

When the correlation’s high…

Displaying trend line: linear regression (a preview)

Summarizing 2D quantitative data

Pairs plot

Continuous by categorical data

Continuous by categorical: side by side plots

Continuous by categorical: color

Continuous by categorical: ridgeline plot (joyplot)

What about for histograms?

What about facets?

What about facets?

Making scatterplots: always adjust the transparency (`alpha`)