Data visualization: categorical data


SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Data

  • Flying ettiquette survey

  • Publicly available on GitHub and also via the ggmosaic package (the dataset is called fly).

  • What does each row represent here?

library(tidyverse)
theme_set(theme_light()) # setting the ggplot theme
library(ggmosaic) # make sure to install it first
flying_etiquette <- fly |> 
  filter(!is.na(do_you_recline), !is.na(rude_to_recline))
names(flying_etiquette)
 [1] "id"                             "flight_freq"                   
 [3] "do_you_recline"                 "height"                        
 [5] "has_child_under_18"             "three_seats_two_arms"          
 [7] "two_seats_one_arm"              "window_shade"                  
 [9] "rude_to_move_to_unsold_seat"    "rude_to_talk_to_neighbor"      
[11] "six_hr_flight_leave_seat"       "reclining_obligation_to_behind"
[13] "rude_to_recline"                "eliminate_reclining"           
[15] "rude_to_switch_seats_friends"   "rude_to_switch_seats_family"   
[17] "rude_to_wake_neighbor_bathroom" "rude_to_wake_neighbor_walk"    
[19] "rude_to_bring_baby"             "rude_to_bring_unruly_child"    
[21] "use_electronics_takeoff"        "smoked_inflight"               
[23] "gender"                         "age"                           
[25] "household_income"               "education"                     
[27] "region"                        

Categorical data

Two different versions of categorical data:

Nominal: categorical variables having unordered scales

  • Examples: race, gender, species, etc,

Ordinal: ordered categories; levels with a meaningful order

  • Examples: education level, grades, ranks

Factors in R

  • In R, factors are used to work with categorical variables

  • R treats factors as ordinal - defaults to alphabetical

    • May need to manually define the factor levels (e.g., the reference level)
  • See the forcats package (automatically loaded with tidyverse)

class(flying_etiquette$do_you_recline)
[1] "factor"
levels(flying_etiquette$do_you_recline)
[1] "never"               "once in a while"     "about half the time"
[4] "usually"             "always"             

1D categorical data

Summarizing 1D categorical data

How often do these respondents recline?

Frequency tables (counts)

table(flying_etiquette$do_you_recline)

              never     once in a while about half the time             usually 
                170                 256                 117                 175 
             always 
                136 
# flying_etiquette |> 
#   group_by(do_you_recline) |>
#   summarize(n = n(), .groups = "drop")

flying_etiquette |> 
  count(do_you_recline)
# A tibble: 5 × 2
  do_you_recline          n
  <fct>               <int>
1 never                 170
2 once in a while       256
3 about half the time   117
4 usually               175
5 always                136

Summarizing 1D categorical data

Proportion table

prop.table(table(flying_etiquette$do_you_recline))

              never     once in a while about half the time             usually 
          0.1990632           0.2997658           0.1370023           0.2049180 
             always 
          0.1592506 
flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(prop = n / sum(n))
# A tibble: 5 × 3
  do_you_recline          n  prop
  <fct>               <int> <dbl>
1 never                 170 0.199
2 once in a while       256 0.300
3 about half the time   117 0.137
4 usually               175 0.205
5 always                136 0.159

Visualizing 1D categorical data

Create a bar chart with geom_bar()

  • Map do_you_recline to the x-axis

  • Counts of each category are displayed on the y-axis

flying_etiquette |> 
  ggplot(aes(x = do_you_recline)) +
  geom_bar()

Behind the scenes of geom_bar()

  • start with the data
  • aggregate and count the number of observations in each bar
  • map to plot aesthetics

Visualizing 1D categorical data

Instead of geom_bar(), do this “by hand” (Quang prefers this way)

  • aggregate and obtain the counts first with count() or (group_by and summarize())

  • then use geom_col()

flying_etiquette |>
  count(do_you_recline, name = "count") |> 
  ggplot(aes(x = do_you_recline, y = count)) +
  geom_col()

Visualizing 1D categorical data

Flip your bar chart axes!

Just simply replace x with y (Quang prefers this way)

flying_etiquette |>
  ggplot(aes(y = do_you_recline)) +
  geom_bar()

Or use coord_flip()

flying_etiquette |> 
  ggplot(aes(x = do_you_recline)) +
  geom_bar() +
  coord_flip()

What does a bar chart show?

Marginal distribution: probability that a categorical variable \(X\) (e.g., do_you_recline) takes each particular category value \(x\) (always, usually, …, never)

  • Frequency bar charts (earlier version) give info about sample size, but this could be labeled in the chart (use geom_text() or geom_label())

  • Now, we create a proportion/percent bar chart to display the individual probabilities

  • This shows the probability mass function (PMF) for discrete variables

    • (e.g. \(P(\) do_you_recline \(=\) never\()\))
flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(prop = n / sum(n)) |> 
  ggplot(aes(x = prop, y = do_you_recline)) +
  geom_col()   # + geom_label(aes(label = n), hjust = 1)

Population vs sample

Population: The collection of all subjects of interest

Sample: A representative subset of the population of interest

The survey respondents is just a subset of all airplane flyers

Empirical distribution: estimating the true marginal distribution with observed (sample) data

  • Estimate \(P(\) do_you_recline = \(C_j\)) with \(\hat p_j\) for each category \(C_j\) (e.g., \(\hat p_\texttt{always}\), …, \(\hat p_\texttt{never}\))

    • Standard error for each \(\hat p_j\): \(\quad \displaystyle \text{SE}(\hat{p}_j) = \sqrt{\frac{\hat{p}_j (1 - \hat{p}_j)}{n}}\)

Adding confidence intervals to bar chart


flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(prop = n / sum(n),
         se = sqrt(prop * (1 - prop) / sum(n)),
         lower = prop - 2 * se,
         upper = prop + 2 * se) |> 
  ggplot(aes(x = prop, y = do_you_recline)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue", 
                width = 0.2, 
                linewidth = 1)

Ordering factors in a bar chart

Order the bars by proportion

(Let’s also flip the axes)

flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(
    prop = n / sum(n),
    se = sqrt(prop * (1 - prop) / sum(n)),
    lower = prop - 2 * se,
    upper = prop + 2 * se,
    do_you_recline = fct_reorder(do_you_recline, prop)
  ) |> 
  ggplot(aes(x = prop, y = do_you_recline)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue", 
                width = 0.2, 
                linewidth = 1)

Pie charts… don’t make them

Why?

3D pie charts?… even worse

Inference for 1D categorical data

Chi-square test for 1D categorical data

  • Null hypothesis: \(H_0\): \(p_1 = p_2 = \cdots = p_K\)

  • Test statistic: \(\displaystyle \chi^2 = \sum_{j=1}^K \frac{(O_j - E_j)^2}{E_j}\), where

    • \(O_j\): observed counts in category \(j\)

    • \(E_j\) : expected counts under the null (i.e., \(n/K\) or each category is equally likely to occur)

chisq.test(table(flying_etiquette$do_you_recline))

    Chi-squared test for given probabilities

data:  table(flying_etiquette$do_you_recline)
X-squared = 66.644, df = 4, p-value = 1.159e-13

Hypothesis testing in general

Computing \(p\)-values works like this:

  • Choose a test statistic

  • Compute the test statistic using the data

  • Is test statistic “unusual” compared to what we would expect under the null?

  • Compare \(p\)-value to the target error rate (“significance level”) \(\alpha\)

2D categorical data

Summarizing 2D categorical data

Continuing with the flying etiquette survey data, let’s look at the responses to 2 questions

  • do_you_recline (Do you ever recline your seat when you fly?)

  • rude_to_recline (Is it rude to recline your seat on a plane?)

How many levels does each variable have?

table(flying_etiquette$do_you_recline)

              never     once in a while about half the time             usually 
                170                 256                 117                 175 
             always 
                136 
table(flying_etiquette$rude_to_recline)

      no somewhat      yes 
     502      281       71 

Summarizing 2D categorical data

Two-way table (or contingency table, cross tabulation, crosstab)

table("Recline?" = flying_etiquette$do_you_recline, 
      "Rude to reline?" = flying_etiquette$rude_to_recline)
                     Rude to reline?
Recline?               no somewhat yes
  never                35       81  54
  once in a while     116      129  11
  about half the time  82       35   0
  usually             145       27   3
  always              124        9   3
xtabs(~ do_you_recline + rude_to_recline, data = flying_etiquette)
                     rude_to_recline
do_you_recline         no somewhat yes
  never                35       81  54
  once in a while     116      129  11
  about half the time  82       35   0
  usually             145       27   3
  always              124        9   3

Visualizing 2D categorical data

Stacked bar chart: a bar chart of spine charts

Emphasizes the marginal distribution of each category of x variable

  • e.g., \(P(\) rude_to_recline \(=\) somewhat \()\)

Similar to 1D bar charts, start with counting every combination of 2 variables (using count() or group_by() and summarize()), then plot with geom_col()

# flying_etiquette |>
#   ggplot(aes(x = rude_to_recline,
#              fill = do_you_recline)) +
#   geom_bar()
flying_etiquette |>
  count(rude_to_recline, do_you_recline) |>
  ggplot(aes(x = rude_to_recline, y = n, 
             # filled by the other categorical variable
             fill = do_you_recline)) + 
  geom_col()

Visualizing 2D categorical data

Stacked bar chart (proportion version)

flying_etiquette |>
  count(rude_to_recline, do_you_recline) |>
  ggplot(aes(x = rude_to_recline, y = n, 
             fill = do_you_recline)) +
  geom_col(position = "fill")

Visualizing 2D categorical data

Side-by-side (grouped, dodged) bar chart: a bar chart of bar charts

Shows the conditional distribution of fill variable given x variable

  • e.g., \(P(\) do_you_recline \(=\) always \(\mid\) rude_to_recline \(=\) somewhat \()\)
flying_etiquette |>
  count(rude_to_recline, do_you_recline) |>
  ggplot(aes(x = rude_to_recline, y = n, 
             fill = do_you_recline)) + 
  geom_col(position = "dodge")

Joint, marginal, and conditional probabilities

Let \(X\) = rude_to_recline and \(Y\) = do_you_recline

  • Joint distribution: frequency of the intersection

    • e.g., \(P(X =\) somewhat \(, Y =\) always \()\)
  • Marginal distribution: row sums or column sums

    • e.g., \(P(X =\) somewhat \()\), \(P(Y =\) always \()\)
  • Conditional distribution: probability event \(X\) given event \(Y\)

    • e.g., \(P(X =\) somewhat \(\mid Y =\) always \()\)

    \(\displaystyle \qquad \quad = \frac{P(X = \texttt{somewhat}, Y = \texttt{always})}{P(Y = \texttt{always})}\)

flying_etiquette |> 
  select(do_you_recline, rude_to_recline) |> 
  table()
                     rude_to_recline
do_you_recline         no somewhat yes
  never                35       81  54
  once in a while     116      129  11
  about half the time  82       35   0
  usually             145       27   3
  always              124        9   3
flying_etiquette |> 
  select(do_you_recline, rude_to_recline) |> 
  table() |> 
  prop.table()
                     rude_to_recline
do_you_recline                 no    somewhat         yes
  never               0.040983607 0.094847775 0.063231850
  once in a while     0.135831382 0.151053864 0.012880562
  about half the time 0.096018735 0.040983607 0.000000000
  usually             0.169789227 0.031615925 0.003512881
  always              0.145199063 0.010538642 0.003512881

Joint, marginal, and conditional probabilities

Two-way proportion table (the tidyverse way) with pivot_wider

flying_etiquette |>
  group_by(rude_to_recline, do_you_recline) |>
  summarize(joint = n() / nrow(flying_etiquette)) |>
  pivot_wider(names_from = rude_to_recline, values_from = joint, values_fill = 0)
# A tibble: 5 × 4
  do_you_recline          no somewhat     yes
  <fct>                <dbl>    <dbl>   <dbl>
1 never               0.0410   0.0948 0.0632 
2 once in a while     0.136    0.151  0.0129 
3 about half the time 0.0960   0.0410 0      
4 usually             0.170    0.0316 0.00351
5 always              0.145    0.0105 0.00351

Categorical heatmaps

  • Use geom_tile to display joint distribution of two categorical variables

  • Annotate tiles with labels of percentages using geom_text() and the scales package (a very neat package)

flying_etiquette |>
  group_by(rude_to_recline, do_you_recline) |>
  summarize(
    freq = n(), 
    joint = n() / nrow(flying_etiquette)
  ) |> 
  ggplot(aes(x = rude_to_recline, y = do_you_recline)) +
  geom_tile(aes(fill = freq), color = "white") +
  geom_text(aes(label = scales::percent(joint))) +
  scale_fill_gradient2()

Visualizing independence

Mosaic plot

  • spine chart of spine charts

  • width: marginal distribution of rude_to_recline

  • height: conditional distribution of do_you_recline | rude_to_recline

  • area: joint distribution

Using a mosaic plot to visually check for independence:

  • check whether all proportions are the same (the boxes line up in a grid)
flying_etiquette |> 
  select(rude_to_recline, do_you_recline) |> 
  table() |> 
  mosaicplot(main = "Relationship between reclining frequency and opinion on rudeness")

Visualizing independence

Mosaic plot with ggmosaic package

flying_etiquette |> 
  ggplot() +
  geom_mosaic(aes(x = product(do_you_recline, rude_to_recline), fill = do_you_recline))

Inference for 2D categorical data

Chi-square test for 2D categorical data

  • Null hypothesis: \(H_0\): 2 categorical variables are independent of each other

    • e.g., no association between do_you_recline and rude_to_recline
  • Test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

# chisq.test(table(flying_etiquette$rude_to_recline, flying_etiquette$do_you_recline))
flying_etiquette |> 
  select(rude_to_recline, do_you_recline) |> 
  table() |> 
  chisq.test()

    Pearson's Chi-squared test

data:  table(select(flying_etiquette, rude_to_recline, do_you_recline))
X-squared = 316.73, df = 8, p-value < 2.2e-16

Residuals

Recall the test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

Define the Pearson residuals: \(\displaystyle r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}\)

Some rules of thumb:

  • \(r_{ij} \approx 0\): observed counts are close to expected counts

  • \(|r_{ij}| > 2\): significant at \(\alpha = 0.05\)

  • very positive \(r_{ij}\): higher than expected

  • very negative \(r_{ij}\): lower than expected

Residuals

Mosaic plots with boxes color-coded by Pearson residuals

Tells us which combinations of 2 categorical variables (cells) are much higher/lower than expected

flying_etiquette |> 
  select(rude_to_recline, do_you_recline) |> 
  table() |> 
  mosaicplot(main = "Relationship between reclining frequency and opinion on rudeness", shade = TRUE)

Beyond 2D: facets!

flying_etiquette %>%
  ggplot(aes(x = rude_to_recline, fill = do_you_recline)) + 
  geom_bar() +
  facet_wrap(~ flight_freq)

Appendix

The janitor package

The most popular janitor function is clean_names()… for cleaning column names

# before
iris |> 
  head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
# after
library(janitor)
iris |> 
  clean_names() |> 
  head()
  sepal_length sepal_width petal_length petal_width species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Tabulation with the janitor package

The lesser-known stars of janitor: functions for tabulation of categorical data

tabyl

flying_etiquette |> 
  tabyl(do_you_recline)
      do_you_recline   n   percent
               never 170 0.1990632
     once in a while 256 0.2997658
 about half the time 117 0.1370023
             usually 175 0.2049180
              always 136 0.1592506

adorn_*() functions

flying_etiquette |> 
  tabyl(do_you_recline, rude_to_recline) |> 
  adorn_percentages("row") |> 
  adorn_pct_formatting(digits = 2) |> 
  adorn_ns()
      do_you_recline           no     somewhat         yes
               never 20.59%  (35) 47.65%  (81) 31.76% (54)
     once in a while 45.31% (116) 50.39% (129)  4.30% (11)
 about half the time 70.09%  (82) 29.91%  (35)  0.00%  (0)
             usually 82.86% (145) 15.43%  (27)  1.71%  (3)
              always 91.18% (124)  6.62%   (9)  2.21%  (3)

For more, see this overview and this tutorial