Data visualization: categorical data

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

Data

Flying ettiquette survey
Publicly available on GitHub and also via the ggmosaic package (the dataset is called fly).
What does each row represent here?

library(tidyverse)
theme_set(theme_light()) # setting the ggplot theme
library(ggmosaic) # make sure to install it first
flying_etiquette <- fly |> 
  filter(!is.na(do_you_recline), !is.na(rude_to_recline))
names(flying_etiquette)

 [1] "id"                             "flight_freq"                   
 [3] "do_you_recline"                 "height"                        
 [5] "has_child_under_18"             "three_seats_two_arms"          
 [7] "two_seats_one_arm"              "window_shade"                  
 [9] "rude_to_move_to_unsold_seat"    "rude_to_talk_to_neighbor"      
[11] "six_hr_flight_leave_seat"       "reclining_obligation_to_behind"
[13] "rude_to_recline"                "eliminate_reclining"           
[15] "rude_to_switch_seats_friends"   "rude_to_switch_seats_family"   
[17] "rude_to_wake_neighbor_bathroom" "rude_to_wake_neighbor_walk"    
[19] "rude_to_bring_baby"             "rude_to_bring_unruly_child"    
[21] "use_electronics_takeoff"        "smoked_inflight"               
[23] "gender"                         "age"                           
[25] "household_income"               "education"                     
[27] "region"

Categorical data

Two different versions of categorical data:

Nominal: categorical variables having unordered scales

Examples: race, gender, species, etc,

Ordinal: ordered categories; levels with a meaningful order

Examples: education level, grades, ranks

Factors in `R`

In R, factors are used to work with categorical variables
R treats factors as ordinal - defaults to alphabetical
- May need to manually define the factor levels (e.g., the reference level)
See the forcats package (automatically loaded with tidyverse)

class(flying_etiquette$do_you_recline)

[1] "factor"

levels(flying_etiquette$do_you_recline)

[1] "never"               "once in a while"     "about half the time"
[4] "usually"             "always"

1D categorical data

Summarizing 1D categorical data

How often do these respondents recline?

Frequency tables (counts)

table(flying_etiquette$do_you_recline)


              never     once in a while about half the time             usually 
                170                 256                 117                 175 
             always 
                136

# flying_etiquette |> 
#   group_by(do_you_recline) |>
#   summarize(n = n(), .groups = "drop")

flying_etiquette |> 
  count(do_you_recline)

# A tibble: 5 × 2
  do_you_recline          n
  <fct>               <int>
1 never                 170
2 once in a while       256
3 about half the time   117
4 usually               175
5 always                136

Summarizing 1D categorical data

Proportion table

prop.table(table(flying_etiquette$do_you_recline))


              never     once in a while about half the time             usually 
          0.1990632           0.2997658           0.1370023           0.2049180 
             always 
          0.1592506

flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(prop = n / sum(n))

# A tibble: 5 × 3
  do_you_recline          n  prop
  <fct>               <int> <dbl>
1 never                 170 0.199
2 once in a while       256 0.300
3 about half the time   117 0.137
4 usually               175 0.205
5 always                136 0.159

Visualizing 1D categorical data

Create a bar chart with geom_bar()

Map do_you_recline to the x-axis
Counts of each category are displayed on the y-axis

flying_etiquette |> 
  ggplot(aes(x = do_you_recline)) +
  geom_bar()

Behind the scenes of `geom_bar()`

start with the data
aggregate and count the number of observations in each bar
map to plot aesthetics

Visualizing 1D categorical data

Instead of geom_bar(), do this “by hand” (Quang prefers this way)

aggregate and obtain the counts first with count() or (group_by and summarize())
then use geom_col()

flying_etiquette |>
  count(do_you_recline, name = "count") |> 
  ggplot(aes(x = do_you_recline, y = count)) +
  geom_col()

Visualizing 1D categorical data

Flip your bar chart axes!

Just simply replace x with y (Quang prefers this way)

flying_etiquette |>
  ggplot(aes(y = do_you_recline)) +
  geom_bar()

Or use coord_flip()

flying_etiquette |> 
  ggplot(aes(x = do_you_recline)) +
  geom_bar() +
  coord_flip()

What does a bar chart show?

Marginal distribution: probability that a categorical variable \(X\) (e.g., do_you_recline) takes each particular category value \(x\) (always, usually, …, never)

Frequency bar charts (earlier version) give info about sample size, but this could be labeled in the chart (use geom_text() or geom_label())
Now, we create a proportion/percent bar chart to display the individual probabilities
This shows the probability mass function (PMF) for discrete variables
- (e.g. \(P(\) do_you_recline \(=\) never\()\))

flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(prop = n / sum(n)) |> 
  ggplot(aes(x = prop, y = do_you_recline)) +
  geom_col()   # + geom_label(aes(label = n), hjust = 1)

Population vs sample

Population: The collection of all subjects of interest

Sample: A representative subset of the population of interest

The survey respondents is just a subset of all airplane flyers

Empirical distribution: estimating the true marginal distribution with observed (sample) data

Estimate \(P(\) do_you_recline = \(C_j\)) with \(\hat p_j\) for each category \(C_j\) (e.g., \(\hat p_\texttt{always}\), …, \(\hat p_\texttt{never}\))
- Standard error for each \(\hat p_j\): \(\quad \displaystyle \text{SE}(\hat{p}_j) = \sqrt{\frac{\hat{p}_j (1 - \hat{p}_j)}{n}}\)

Adding confidence intervals to bar chart

flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(prop = n / sum(n),
         se = sqrt(prop * (1 - prop) / sum(n)),
         lower = prop - 2 * se,
         upper = prop + 2 * se) |> 
  ggplot(aes(x = prop, y = do_you_recline)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue", 
                width = 0.2, 
                linewidth = 1)

Ordering factors in a bar chart

Order the bars by proportion

(Let’s also flip the axes)

flying_etiquette |> 
  count(do_you_recline) |> 
  mutate(
    prop = n / sum(n),
    se = sqrt(prop * (1 - prop) / sum(n)),
    lower = prop - 2 * se,
    upper = prop + 2 * se,
    do_you_recline = fct_reorder(do_you_recline, prop)
  ) |> 
  ggplot(aes(x = prop, y = do_you_recline)) +
  geom_col() +
  geom_errorbar(aes(xmin = lower, xmax = upper), 
                color = "blue", 
                width = 0.2, 
                linewidth = 1)

Pie charts… don’t make them

Why?

3D pie charts?… even worse

Inference for 1D categorical data

Chi-square test for 1D categorical data

Null hypothesis: \(H_0\): \(p_1 = p_2 = \cdots = p_K\)
Test statistic: \(\displaystyle \chi^2 = \sum_{j=1}^K \frac{(O_j - E_j)^2}{E_j}\), where
- \(O_j\): observed counts in category \(j\)
- \(E_j\) : expected counts under the null (i.e., \(n/K\) or each category is equally likely to occur)

chisq.test(table(flying_etiquette$do_you_recline))


    Chi-squared test for given probabilities

data:  table(flying_etiquette$do_you_recline)
X-squared = 66.644, df = 4, p-value = 1.159e-13

Hypothesis testing in general

Computing \(p\)-values works like this:

Choose a test statistic
Compute the test statistic using the data
Is test statistic “unusual” compared to what we would expect under the null?
Compare \(p\)-value to the target error rate (“significance level”) \(\alpha\)
- Typically choose \(\alpha = 0.05\) (the origins of 0.05)

2D categorical data

Summarizing 2D categorical data

Continuing with the flying etiquette survey data, let’s look at the responses to 2 questions

do_you_recline (Do you ever recline your seat when you fly?)
rude_to_recline (Is it rude to recline your seat on a plane?)

How many levels does each variable have?

table(flying_etiquette$do_you_recline)


              never     once in a while about half the time             usually 
                170                 256                 117                 175 
             always 
                136

table(flying_etiquette$rude_to_recline)


      no somewhat      yes 
     502      281       71

Summarizing 2D categorical data

Two-way table (or contingency table, cross tabulation, crosstab)

table("Recline?" = flying_etiquette$do_you_recline, 
      "Rude to reline?" = flying_etiquette$rude_to_recline)

                     Rude to reline?
Recline?               no somewhat yes
  never                35       81  54
  once in a while     116      129  11
  about half the time  82       35   0
  usually             145       27   3
  always              124        9   3

xtabs(~ do_you_recline + rude_to_recline, data = flying_etiquette)

                     rude_to_recline
do_you_recline         no somewhat yes
  never                35       81  54
  once in a while     116      129  11
  about half the time  82       35   0
  usually             145       27   3
  always              124        9   3

Visualizing 2D categorical data

Stacked bar chart: a bar chart of spine charts

Emphasizes the marginal distribution of each category of x variable

e.g., \(P(\) rude_to_recline \(=\) somewhat \()\)

Similar to 1D bar charts, start with counting every combination of 2 variables (using count() or group_by() and summarize()), then plot with geom_col()

# flying_etiquette |>
#   ggplot(aes(x = rude_to_recline,
#              fill = do_you_recline)) +
#   geom_bar()
flying_etiquette |>
  count(rude_to_recline, do_you_recline) |>
  ggplot(aes(x = rude_to_recline, y = n, 
             # filled by the other categorical variable
             fill = do_you_recline)) + 
  geom_col()

Visualizing 2D categorical data

Stacked bar chart (proportion version)

flying_etiquette |>
  count(rude_to_recline, do_you_recline) |>
  ggplot(aes(x = rude_to_recline, y = n, 
             fill = do_you_recline)) +
  geom_col(position = "fill")

Visualizing 2D categorical data

Side-by-side (grouped, dodged) bar chart: a bar chart of bar charts

Shows the conditional distribution of fill variable given x variable

e.g., \(P(\) do_you_recline \(=\) always \(\mid\) rude_to_recline \(=\) somewhat \()\)

flying_etiquette |>
  count(rude_to_recline, do_you_recline) |>
  ggplot(aes(x = rude_to_recline, y = n, 
             fill = do_you_recline)) + 
  geom_col(position = "dodge")

Joint, marginal, and conditional probabilities

Let \(X\) = rude_to_recline and \(Y\) = do_you_recline

Joint distribution: frequency of the intersection
- e.g., \(P(X =\) somewhat \(, Y =\) always \()\)
Marginal distribution: row sums or column sums
- e.g., \(P(X =\) somewhat \()\), \(P(Y =\) always \()\)
Conditional distribution: probability event \(X\) given event \(Y\)
- e.g., \(P(X =\) somewhat \(\mid Y =\) always \()\)
\(\displaystyle \qquad \quad = \frac{P(X = \texttt{somewhat}, Y = \texttt{always})}{P(Y = \texttt{always})}\)

flying_etiquette |> 
  select(do_you_recline, rude_to_recline) |> 
  table()

                     rude_to_recline
do_you_recline         no somewhat yes
  never                35       81  54
  once in a while     116      129  11
  about half the time  82       35   0
  usually             145       27   3
  always              124        9   3

flying_etiquette |> 
  select(do_you_recline, rude_to_recline) |> 
  table() |> 
  prop.table()

                     rude_to_recline
do_you_recline                 no    somewhat         yes
  never               0.040983607 0.094847775 0.063231850
  once in a while     0.135831382 0.151053864 0.012880562
  about half the time 0.096018735 0.040983607 0.000000000
  usually             0.169789227 0.031615925 0.003512881
  always              0.145199063 0.010538642 0.003512881

Joint, marginal, and conditional probabilities

Two-way proportion table (the tidyverse way) with pivot_wider

flying_etiquette |>
  group_by(rude_to_recline, do_you_recline) |>
  summarize(joint = n() / nrow(flying_etiquette)) |>
  pivot_wider(names_from = rude_to_recline, values_from = joint, values_fill = 0)

# A tibble: 5 × 4
  do_you_recline          no somewhat     yes
  <fct>                <dbl>    <dbl>   <dbl>
1 never               0.0410   0.0948 0.0632 
2 once in a while     0.136    0.151  0.0129 
3 about half the time 0.0960   0.0410 0      
4 usually             0.170    0.0316 0.00351
5 always              0.145    0.0105 0.00351

Categorical heatmaps

Use geom_tile to display joint distribution of two categorical variables
Annotate tiles with labels of percentages using geom_text() and the scales package (a very neat package)

flying_etiquette |>
  group_by(rude_to_recline, do_you_recline) |>
  summarize(
    freq = n(), 
    joint = n() / nrow(flying_etiquette)
  ) |> 
  ggplot(aes(x = rude_to_recline, y = do_you_recline)) +
  geom_tile(aes(fill = freq), color = "white") +
  geom_text(aes(label = scales::percent(joint))) +
  scale_fill_gradient2()

Visualizing independence

Mosaic plot

spine chart of spine charts
width: marginal distribution of rude_to_recline
height: conditional distribution of do_you_recline | rude_to_recline
area: joint distribution

Using a mosaic plot to visually check for independence:

check whether all proportions are the same (the boxes line up in a grid)

flying_etiquette |> 
  select(rude_to_recline, do_you_recline) |> 
  table() |> 
  mosaicplot(main = "Relationship between reclining frequency and opinion on rudeness")

Visualizing independence

Mosaic plot with ggmosaic package

flying_etiquette |> 
  ggplot() +
  geom_mosaic(aes(x = product(do_you_recline, rude_to_recline), fill = do_you_recline))

Inference for 2D categorical data

Chi-square test for 2D categorical data

Null hypothesis: \(H_0\): 2 categorical variables are independent of each other
- e.g., no association between do_you_recline and rude_to_recline
Test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

# chisq.test(table(flying_etiquette$rude_to_recline, flying_etiquette$do_you_recline))
flying_etiquette |> 
  select(rude_to_recline, do_you_recline) |> 
  table() |> 
  chisq.test()


    Pearson's Chi-squared test

data:  table(select(flying_etiquette, rude_to_recline, do_you_recline))
X-squared = 316.73, df = 8, p-value < 2.2e-16

Residuals

Recall the test statistic: \(\displaystyle \chi^2 = \sum_i^{k_1} \sum_j^{k_2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\)

Define the Pearson residuals: \(\displaystyle r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}\)

Some rules of thumb:

\(r_{ij} \approx 0\): observed counts are close to expected counts
\(|r_{ij}| > 2\): significant at \(\alpha = 0.05\)
very positive \(r_{ij}\): higher than expected
very negative \(r_{ij}\): lower than expected

Residuals

Mosaic plots with boxes color-coded by Pearson residuals

Tells us which combinations of 2 categorical variables (cells) are much higher/lower than expected

flying_etiquette |> 
  select(rude_to_recline, do_you_recline) |> 
  table() |> 
  mosaicplot(main = "Relationship between reclining frequency and opinion on rudeness", shade = TRUE)

Appendix

The `janitor` package

The most popular janitor function is clean_names()… for cleaning column names

# before
iris |> 
  head()

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# after
library(janitor)
iris |> 
  clean_names() |> 
  head()

  sepal_length sepal_width petal_length petal_width species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Tabulation with the `janitor` package

The lesser-known stars of janitor: functions for tabulation of categorical data

tabyl

flying_etiquette |> 
  tabyl(do_you_recline)

      do_you_recline   n   percent
               never 170 0.1990632
     once in a while 256 0.2997658
 about half the time 117 0.1370023
             usually 175 0.2049180
              always 136 0.1592506

adorn_*() functions

flying_etiquette |> 
  tabyl(do_you_recline, rude_to_recline) |> 
  adorn_percentages("row") |> 
  adorn_pct_formatting(digits = 2) |> 
  adorn_ns()

      do_you_recline           no     somewhat         yes
               never 20.59%  (35) 47.65%  (81) 31.76% (54)
     once in a while 45.31% (116) 50.39% (129)  4.30% (11)
 about half the time 70.09%  (82) 29.91%  (35)  0.00%  (0)
             usually 82.86% (145) 15.43%  (27)  1.71%  (3)
              always 91.18% (124)  6.62%   (9)  2.21%  (3)

For more, see this overview and this tutorial

Data visualization: categorical data

Background

Data

Categorical data

Factors in R

1D categorical data

Summarizing 1D categorical data

Summarizing 1D categorical data

Visualizing 1D categorical data

Behind the scenes of geom_bar()

Visualizing 1D categorical data

Visualizing 1D categorical data

What does a bar chart show?

Population vs sample

Adding confidence intervals to bar chart

Ordering factors in a bar chart

Pie charts… don’t make them

Inference for 1D categorical data

Hypothesis testing in general

2D categorical data

Summarizing 2D categorical data

Summarizing 2D categorical data

Visualizing 2D categorical data

Visualizing 2D categorical data

Visualizing 2D categorical data

Joint, marginal, and conditional probabilities

Joint, marginal, and conditional probabilities

Categorical heatmaps

Visualizing independence

Visualizing independence

Inference for 2D categorical data

Residuals

Residuals

Beyond 2D: facets!

Appendix

The janitor package

Tabulation with the janitor package

Factors in `R`

Behind the scenes of `geom_bar()`

The `janitor` package

Tabulation with the `janitor` package