Unsupervised learning: hierarchical clustering

SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

The big picture

\(k\)-means clustering: partition the observations into a pre-specified number of clusters

Hierarchical clustering: does not require commitment to a particular choice of clusters
- In fact, we end up with a tree-like visual representation of the observations, called a dendrogram
- This allows us to view at once the clusterings obtained for each possible number of clusters
- Common approach: agglomerative (bottom-up) hierarchical clustering: build a dendrogram starting from the leaves and combining clusters up to the trunk
- There’s also divisive (top-down) hierarchical clustering: start with one large cluster and then break the cluster recursively into smaller and smaller pieces

Data: County-level health indicators for Utah in 2014

library(tidyverse)
theme_set(theme_light())
utah_health <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/utah_health.csv")
glimpse(utah_health)

Rows: 30
Columns: 20
$ County                 <chr> "Tooele", "Uintah", "Wasatch", "Statewide", "Sa…
$ Population             <dbl> 59870, 34524, 25273, 2855287, 27906, 1524, 7221…
$ PercentUnder18         <dbl> 35.3, 33.6, 33.1, 31.1, 29.4, 27.8, 23.4, 33.6,…
$ PercentOver65          <dbl> 7.9, 9.1, 9.1, 9.5, 12.3, 23.7, 20.5, 11.0, 10.…
$ DiabeticRate           <dbl> 9, 8, 6, 7, 8, 10, 10, 8, 8, 7, 7, 9, 6, 4, 9, …
$ HIVRate                <dbl> 37, 21, 28, 111, 22, NA, NA, NA, 72, 36, NA, NA…
$ PrematureMortalityRate <dbl> 347.6, 396.2, 227.2, 286.7, 314.3, 445.8, 293.6…
$ InfantMortalityRate    <dbl> 5.0, 5.2, NA, 5.0, NA, NA, NA, NA, 5.8, 7.1, NA…
$ ChildMortalityRate     <dbl> 47.7, 57.4, 69.0, 52.9, 61.2, NA, NA, 70.4, 58.…
$ LimitedAccessToFood    <dbl> 14, 14, 14, 17, 17, 15, 14, 14, 16, 19, 11, 18,…
$ FoodInsecure           <dbl> 7, 5, 1, 5, 4, 41, 16, 8, 6, 12, 0, 14, 4, 4, 8…
$ MotorDeathRate         <dbl> 18, 23, 16, 11, 18, NA, 21, 36, 10, 13, NA, NA,…
$ DrugDeathRate          <dbl> 18, 13, 17, 17, 16, NA, 26, 18, 20, 18, NA, NA,…
$ Uninsured              <dbl> 17, 24, 24, 20, 24, 26, 19, 23, 20, 26, 14, 26,…
$ UninsuredChildren      <dbl> 10, 16, 16, 11, 15, 17, 12, 14, 12, 15, 10, 20,…
$ HealthCareCosts        <dbl> 9095, 7086, 8327, 8925, 8942, 7824, 8121, 8527,…
$ CouldNotSeeDr          <dbl> 13, 12, 13, 13, 13, NA, NA, 13, 11, 13, 13, NA,…
$ MedianIncome           <dbl> 61927, 60419, 62014, 57067, 43921, 36403, 43128…
$ ChildrenFreeLunch      <dbl> 34, 36, 30, 31, 38, 58, 31, 34, 37, 40, 12, 33,…
$ HomicideRate           <dbl> NA, NA, NA, 2, NA, NA, NA, NA, 2, NA, NA, NA, 1…

General setup

Given a dataset with \(p\) variables (columns) and \(n\) observations (rows) \(x_1,\dots,x_n\)
Compute the distance/dissimilarity between observations
e.g. Euclidean distance between observations \(i\) and \(j\)

\[d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}\]

What are the distances between these counties using PercentOver65 (percent of county population that is 65 and over) and DiabeticRate (prevalence of diabetes)?

utah_health |> 
  ggplot(aes(x = PercentOver65, y = DiabeticRate)) +
  geom_point(size = 4)

Remember to standardize!

utah_health <- utah_health |> 
  mutate(
    std_pct_over65 = as.numeric(scale(PercentOver65)),
    std_diabetic_rate = as.numeric(scale(DiabeticRate))
  )

utah_health |> 
  ggplot(aes(x = std_pct_over65, y = std_diabetic_rate)) +
  geom_point(size = 4) +
  coord_fixed()

Compute the distance matrix using `dist()`

Compute pairwise Euclidean distance

county_dist <- utah_health |> 
  select(std_pct_over65, std_diabetic_rate) |> 
  dist()

Returns an object of dist class… but not a matrix
Convert to a matrix, then set the row and column names:

county_dist_matrix <- as.matrix(county_dist)
rownames(county_dist_matrix) <- utah_health$County
colnames(county_dist_matrix) <- utah_health$County
county_dist_matrix[1:4, 1:4]

             Tooele    Uintah   Wasatch Statewide
Tooele    0.0000000 0.7474104 2.1010897 1.4367342
Uintah    0.7474104 0.0000000 1.3885164 0.7003632
Wasatch   2.1010897 1.3885164 0.0000000 0.7003632
Statewide 1.4367342 0.7003632 0.7003632 0.0000000

Convert to a long table with pivot_longer for plotting purpose

long_dist_matrix <- county_dist_matrix |> 
  as_tibble() |> 
  mutate(county1 = rownames(county_dist_matrix)) |> 
  pivot_longer(cols = !county1, names_to = "county2", values_to = "distance")

This heatmap is useless…

long_dist_matrix |> 
  ggplot(aes(x = county1, y = county2, fill = distance)) +
  geom_tile() +
  scale_fill_gradient(low = "darkorange", 
                      high = "darkblue") +
  coord_fixed() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank())

Arrange heatmap with `seriation`

library(seriation)
county_dist_seriate <- seriate(county_dist)
county_order <- get_order(county_dist_seriate)
county_names_order <- utah_health$County[county_order]
long_dist_matrix |> 
  mutate(
    county1 = fct_relevel(county1, county_names_order),
    county2 = fct_relevel(county2, county_names_order)
  ) |> 
  ggplot(aes(x = county1, y = county2, fill = distance)) +
  scale_fill_gradient(low = "darkorange", 
                      high = "darkblue") +
  geom_tile() +
  coord_fixed() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank())

Hierarchical clustering

(Agglomerative) Hierarchical clustering

Let’s pretend all \(n\) observations are in their own cluster

Step 1: Compute the pairwise dissimilarities between each cluster
- e.g., distance matrix on previous slides

Step 2: Identify the pair of clusters that are least dissimilar

Step 3: Fuse these two clusters into a new cluster!

Repeat Steps 1 to 3 until all observations are in the same cluster

“Bottom-up”, agglomerative clustering that forms a tree/hierarchy of merging

No mention of any randomness. And no mention of the number of clusters \(k\).

(Agglomerative) Hierarchical clustering

Start with all observations in their own cluster

Step 1: Compute the pairwise dissimilarities between each cluster
Step 2: Identify the pair of clusters that are least dissimilar
Step 3: Fuse these two clusters into a new cluster!
Repeat Steps 1 to 3 until all observations are in the same cluster

(Agglomerative) Hierarchical clustering

Start with all observations in their own cluster

Step 1: Compute the pairwise dissimilarities between each cluster
Step 2: Identify the pair of clusters that are least dissimilar
Step 3: Fuse these two clusters into a new cluster!
Repeat Steps 1 to 3 until all observations are in the same cluster

Forms a dendrogram (typically displayed from bottom-up)

Dissimilarity between clusters

We know how to compute distance/dissimilarity between two observations
But how do we handle clusters?
- Dissimilarity between a cluster and an observation, or between two clusters

We need to choose a linkage function. Clusters are built up by linking them together

Types of linkage

First, compute all pairwise dissimilarities between the observations in the two clusters

i.e., compute the distance matrix between observations, \(d(x_i, x_j)\) for \(i \in C_1\) and \(j \in C_2\)

Complete linkage: use the maximum (largest) value of these dissimilarities \(\underset{i \in C_1, j \in C_2}{\text{max}} d(x_i, x_j)\) (maximal inter-cluster dissimilarity)

Single linkage: use the minimum (smallest) value of these dissimilarities \(\underset{i \in C_1, j \in C_2}{\text{min}} d(x_i, x_j)\) (minimal inter-cluster dissimilarity)

Average linkage: use the average value of these dissimilarities \(\displaystyle \frac{1}{|C_1||C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(x_i, x_j)\) (mean inter-cluster dissimilarity)

Complete linkage example

Use hclust() with a dist() objsect
Use complete linkage by default

utah_complete <- county_dist |> 
  hclust(method = "complete")

Use cutree() to return cluster labels
Returns compact clusters (similar to \(k\)-means)

utah_health |> 
  mutate(
    cluster = as.factor(cutree(utah_complete, k = 3))
  ) |>
  ggplot(aes(x = std_pct_over65, y = std_diabetic_rate,
             color = cluster)) +
  geom_point(size = 4) + 
  ggthemes::scale_color_colorblind() +
  theme(legend.position = "bottom")

What are we cutting? Dendrograms

Use the ggdendro package (instead of plot())

library(ggdendro)
utah_complete |> 
  ggdendrogram(labels = FALSE, 
               leaf_labels = FALSE,
               theme_dendro = FALSE) +  
  labs(y = "Dissimilarity between clusters") +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        panel.grid = element_blank())

Each leaf is one observation
Height of branch indicates dissimilarity between clusters
- (After first step) Horizontal position along x-axis means nothing

Textbook example

Cut dendrograms to obtain cluster labels

Specify the height to cut with h (instead of k)

For example, cutree(utah_complete, h = 4)

Single linkage example

Change the method argument to single

Results in a chaining effect

Average linkage example

Change the method argument to average

Closer to complete but varies in compactness

Minimax linkage

Each cluster is defined by a prototype observation (most representative)
The prototype is “minimally dissimilar” from every point in the cluster (hence the “minimax”)
Dendrogram interpretation: each point is \(\leq h\) in dissimilarity to the prototype of cluster
Cluster centers are chosen among the observations themselves - hence prototype

Minimax linkage

For each point belonging to either cluster, find the maximum distance between it and all the other points in the two clusters.
The smallest of these maximum distances (“minimal-maximum” distance) is defined as the distance between the two clusters
The distance to prototype is measured by the maximum minimax radius

Minimax linkage example

Easily done in R via the protoclust package
Use the protoclust() function to apply the clustering to the dist() object

library(protoclust)
utah_minimax <- protoclust(county_dist_matrix)

utah_minimax |> 
  as.hclust() |>
  as.dendrogram() |> 
  ggdendrogram(labels = FALSE, 
               leaf_labels = FALSE,
               theme_dendro = FALSE) +  
  labs(y = "Dissimilarity between clusters") +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        panel.grid = element_blank())

converting to dist (note: ignores above diagonal)

Minimax linkage example

Use the protocut() function to make the cut
But then access the cluster labels cl

minimax_county_clusters <- utah_minimax |> 
  protocut(k = 3)
utah_health |> 
  mutate(cluster = as.factor(minimax_county_clusters$cl)) |>
  ggplot(aes(x = std_pct_over65, y = std_diabetic_rate,
             color = cluster)) +
  geom_point(size = 4) + 
  ggthemes::scale_color_colorblind() +
  theme(legend.position = "bottom")

Minimax linkage example

Want to check out the prototypes for the three clusters
protocut returns the indices of the prototypes (in order of the cluster labels)

minimax_county_clusters$protos

[1] 18 13  7

Subset the rows for these counties using slice()

utah_health |>
  select(County, std_pct_over65, std_diabetic_rate) |>
  slice(minimax_county_clusters$protos)

# A tibble: 3 × 3
  County std_pct_over65 std_diabetic_rate
  <chr>           <dbl>             <dbl>
1 Emery          0.0600             0.116
2 Davis         -0.978             -1.27 
3 Kane           1.74               1.50

Post-clustering analysis

For context, how does population relate to our clustering results?

utah_health |> 
  mutate(cluster = as.factor(minimax_county_clusters$cl)) |> 
  ggplot(aes(x = log(Population), fill = cluster)) +
  geom_density(alpha = 0.1) +
  scale_fill_manual(values = c("darkblue", "purple", "gold", "orange"))

Post-clustering analysis

# utah_health |> 
#   arrange(Population)
library(dendextend)
utah_minimax |> 
  as.hclust() |>
  as.dendrogram() |> 
  set("branches_k_color", k = 3) |> 
  set("labels_col", k = 3) |> 
  ggplot(horiz = TRUE)

Different population levels tend to fall within particular clusters…
It’s easy to include more variables - just change the distance matrix

Practical issues

What dissimilarity measure should be used?
What type of linkage should be used?
How many clusters to choose?
Which features should we use to drive the clustering?
- Categorical variables?
Hard clustering vs. soft clustering
- Hard clustering (\(k\)-means, hierachical): assigns each observation to exactly one cluster
- Soft (fuzzy) clustering: assigns each observation a probability of belonging to a cluster

IT DEPENDS…

Unsupervised learning: hierarchical clustering

Background

The big picture

Data: County-level health indicators for Utah in 2014

General setup

Remember to standardize!

Compute the distance matrix using dist()

This heatmap is useless…

Arrange heatmap with seriation

Hierarchical clustering

(Agglomerative) Hierarchical clustering

(Agglomerative) Hierarchical clustering

(Agglomerative) Hierarchical clustering

Dissimilarity between clusters

Types of linkage

Complete linkage example

What are we cutting? Dendrograms

Textbook example

Cut dendrograms to obtain cluster labels

Single linkage example

Average linkage example

More linkage functions

Minimax linkage

Minimax linkage

Minimax linkage example

Minimax linkage example

Minimax linkage example

Post-clustering analysis

Post-clustering analysis

Practical issues

Compute the distance matrix using `dist()`

Arrange heatmap with `seriation`