Unsupervised learning: hierarchical clustering


SURE 2024

Department of Statistics & Data Science
Carnegie Mellon University

Background

The big picture

  • \(k\)-means clustering: partition the observations into a pre-specified number of clusters
  • Hierarchical clustering: does not require commitment to a particular choice of clusters

    • In fact, we end up with a tree-like visual representation of the observations, called a dendrogram

    • This allows us to view at once the clusterings obtained for each possible number of clusters

    • Common approach: agglomerative (bottom-up) hierarchical clustering: build a dendrogram starting from the leaves and combining clusters up to the trunk

    • There’s also divisive (top-down) hierarchical clustering: start with one large cluster and then break the cluster recursively into smaller and smaller pieces

Data: County-level health indicators for Utah in 2014

library(tidyverse)
theme_set(theme_light())
utah_health <- read_csv("https://raw.githubusercontent.com/36-SURE/36-SURE.github.io/main/data/utah_health.csv")
glimpse(utah_health)
Rows: 30
Columns: 20
$ County                 <chr> "Tooele", "Uintah", "Wasatch", "Statewide", "Sa…
$ Population             <dbl> 59870, 34524, 25273, 2855287, 27906, 1524, 7221…
$ PercentUnder18         <dbl> 35.3, 33.6, 33.1, 31.1, 29.4, 27.8, 23.4, 33.6,…
$ PercentOver65          <dbl> 7.9, 9.1, 9.1, 9.5, 12.3, 23.7, 20.5, 11.0, 10.…
$ DiabeticRate           <dbl> 9, 8, 6, 7, 8, 10, 10, 8, 8, 7, 7, 9, 6, 4, 9, …
$ HIVRate                <dbl> 37, 21, 28, 111, 22, NA, NA, NA, 72, 36, NA, NA…
$ PrematureMortalityRate <dbl> 347.6, 396.2, 227.2, 286.7, 314.3, 445.8, 293.6…
$ InfantMortalityRate    <dbl> 5.0, 5.2, NA, 5.0, NA, NA, NA, NA, 5.8, 7.1, NA…
$ ChildMortalityRate     <dbl> 47.7, 57.4, 69.0, 52.9, 61.2, NA, NA, 70.4, 58.…
$ LimitedAccessToFood    <dbl> 14, 14, 14, 17, 17, 15, 14, 14, 16, 19, 11, 18,…
$ FoodInsecure           <dbl> 7, 5, 1, 5, 4, 41, 16, 8, 6, 12, 0, 14, 4, 4, 8…
$ MotorDeathRate         <dbl> 18, 23, 16, 11, 18, NA, 21, 36, 10, 13, NA, NA,…
$ DrugDeathRate          <dbl> 18, 13, 17, 17, 16, NA, 26, 18, 20, 18, NA, NA,…
$ Uninsured              <dbl> 17, 24, 24, 20, 24, 26, 19, 23, 20, 26, 14, 26,…
$ UninsuredChildren      <dbl> 10, 16, 16, 11, 15, 17, 12, 14, 12, 15, 10, 20,…
$ HealthCareCosts        <dbl> 9095, 7086, 8327, 8925, 8942, 7824, 8121, 8527,…
$ CouldNotSeeDr          <dbl> 13, 12, 13, 13, 13, NA, NA, 13, 11, 13, 13, NA,…
$ MedianIncome           <dbl> 61927, 60419, 62014, 57067, 43921, 36403, 43128…
$ ChildrenFreeLunch      <dbl> 34, 36, 30, 31, 38, 58, 31, 34, 37, 40, 12, 33,…
$ HomicideRate           <dbl> NA, NA, NA, 2, NA, NA, NA, NA, 2, NA, NA, NA, 1…

General setup

  • Given a dataset with \(p\) variables (columns) and \(n\) observations (rows) \(x_1,\dots,x_n\)

  • Compute the distance/dissimilarity between observations

  • e.g. Euclidean distance between observations \(i\) and \(j\)

\[d(x_i, x_j) = \sqrt{(x_{i1}-x_{j1})^2 + \cdots + (x_{ip}-x_{jp})^2}\]

What are the distances between these counties using PercentOver65 (percent of county population that is 65 and over) and DiabeticRate (prevalence of diabetes)?


utah_health |> 
  ggplot(aes(x = PercentOver65, y = DiabeticRate)) +
  geom_point(size = 4)

Remember to standardize!

utah_health <- utah_health |> 
  mutate(
    std_pct_over65 = as.numeric(scale(PercentOver65)),
    std_diabetic_rate = as.numeric(scale(DiabeticRate))
  )

utah_health |> 
  ggplot(aes(x = std_pct_over65, y = std_diabetic_rate)) +
  geom_point(size = 4) +
  coord_fixed()

Compute the distance matrix using dist()

  • Compute pairwise Euclidean distance
county_dist <- utah_health |> 
  select(std_pct_over65, std_diabetic_rate) |> 
  dist()
  • Returns an object of dist class… but not a matrix

  • Convert to a matrix, then set the row and column names:

county_dist_matrix <- as.matrix(county_dist)
rownames(county_dist_matrix) <- utah_health$County
colnames(county_dist_matrix) <- utah_health$County
county_dist_matrix[1:4, 1:4]
             Tooele    Uintah   Wasatch Statewide
Tooele    0.0000000 0.7474104 2.1010897 1.4367342
Uintah    0.7474104 0.0000000 1.3885164 0.7003632
Wasatch   2.1010897 1.3885164 0.0000000 0.7003632
Statewide 1.4367342 0.7003632 0.7003632 0.0000000
  • Convert to a long table with pivot_longer for plotting purpose
long_dist_matrix <- county_dist_matrix |> 
  as_tibble() |> 
  mutate(county1 = rownames(county_dist_matrix)) |> 
  pivot_longer(cols = !county1, names_to = "county2", values_to = "distance")

This heatmap is useless…


long_dist_matrix |> 
  ggplot(aes(x = county1, y = county2, fill = distance)) +
  geom_tile() +
  scale_fill_gradient(low = "darkorange", 
                      high = "darkblue") +
  coord_fixed() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank())

Arrange heatmap with seriation


library(seriation)
county_dist_seriate <- seriate(county_dist)
county_order <- get_order(county_dist_seriate)
county_names_order <- utah_health$County[county_order]
long_dist_matrix |> 
  mutate(
    county1 = fct_relevel(county1, county_names_order),
    county2 = fct_relevel(county2, county_names_order)
  ) |> 
  ggplot(aes(x = county1, y = county2, fill = distance)) +
  scale_fill_gradient(low = "darkorange", 
                      high = "darkblue") +
  geom_tile() +
  coord_fixed() +
  theme(axis.text = element_blank(), 
        axis.ticks = element_blank())

Hierarchical clustering

(Agglomerative) Hierarchical clustering

Let’s pretend all \(n\) observations are in their own cluster

  • Step 1: Compute the pairwise dissimilarities between each cluster

    • e.g., distance matrix on previous slides
  • Step 2: Identify the pair of clusters that are least dissimilar
  • Step 3: Fuse these two clusters into a new cluster!
  • Repeat Steps 1 to 3 until all observations are in the same cluster

“Bottom-up”, agglomerative clustering that forms a tree/hierarchy of merging

No mention of any randomness. And no mention of the number of clusters \(k\).

(Agglomerative) Hierarchical clustering

Start with all observations in their own cluster

  • Step 1: Compute the pairwise dissimilarities between each cluster

  • Step 2: Identify the pair of clusters that are least dissimilar

  • Step 3: Fuse these two clusters into a new cluster!

  • Repeat Steps 1 to 3 until all observations are in the same cluster

(Agglomerative) Hierarchical clustering

Start with all observations in their own cluster

  • Step 1: Compute the pairwise dissimilarities between each cluster

  • Step 2: Identify the pair of clusters that are least dissimilar

  • Step 3: Fuse these two clusters into a new cluster!

  • Repeat Steps 1 to 3 until all observations are in the same cluster

Forms a dendrogram (typically displayed from bottom-up)

Dissimilarity between clusters

  • We know how to compute distance/dissimilarity between two observations

  • But how do we handle clusters?

    • Dissimilarity between a cluster and an observation, or between two clusters

We need to choose a linkage function. Clusters are built up by linking them together

Types of linkage

First, compute all pairwise dissimilarities between the observations in the two clusters

i.e., compute the distance matrix between observations, \(d(x_i, x_j)\) for \(i \in C_1\) and \(j \in C_2\)

  • Complete linkage: use the maximum (largest) value of these dissimilarities \(\underset{i \in C_1, j \in C_2}{\text{max}} d(x_i, x_j)\) (maximal inter-cluster dissimilarity)
  • Single linkage: use the minimum (smallest) value of these dissimilarities \(\underset{i \in C_1, j \in C_2}{\text{min}} d(x_i, x_j)\) (minimal inter-cluster dissimilarity)
  • Average linkage: use the average value of these dissimilarities \(\displaystyle \frac{1}{|C_1||C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(x_i, x_j)\) (mean inter-cluster dissimilarity)

Complete linkage example

  • Use hclust() with a dist() objsect

  • Use complete linkage by default

utah_complete <- county_dist |> 
  hclust(method = "complete")
  • Use cutree() to return cluster labels

  • Returns compact clusters (similar to \(k\)-means)

utah_health |> 
  mutate(
    cluster = as.factor(cutree(utah_complete, k = 3))
  ) |>
  ggplot(aes(x = std_pct_over65, y = std_diabetic_rate,
             color = cluster)) +
  geom_point(size = 4) + 
  ggthemes::scale_color_colorblind() +
  theme(legend.position = "bottom")

What are we cutting? Dendrograms

Use the ggdendro package (instead of plot())

library(ggdendro)
utah_complete |> 
  ggdendrogram(labels = FALSE, 
               leaf_labels = FALSE,
               theme_dendro = FALSE) +  
  labs(y = "Dissimilarity between clusters") +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        panel.grid = element_blank())
  • Each leaf is one observation

  • Height of branch indicates dissimilarity between clusters

    • (After first step) Horizontal position along x-axis means nothing

Textbook example


Cut dendrograms to obtain cluster labels

Specify the height to cut with h (instead of k)

For example, cutree(utah_complete, h = 4)

Single linkage example

Change the method argument to single

Results in a chaining effect

Average linkage example

Change the method argument to average

Closer to complete but varies in compactness

More linkage functions

  • Centroid linkage: Computes the dissimilarity between the centroid for cluster 1 and the centroid for cluster 2

    • i.e. distance between the averages of the two clusters

    • use method = centroid

  • Ward’s linkage: Merges a pair of clusters to minimize the within-cluster variance

    • i.e. aim is to minimize the objection function from \(K\)-means

    • can use ward.D or ward.D2 (different algorithms)

  • There’s another one…

Minimax linkage

  • Each cluster is defined by a prototype observation (most representative)

  • The prototype is “minimally dissimilar” from every point in the cluster (hence the “minimax”)

  • Dendrogram interpretation: each point is \(\leq h\) in dissimilarity to the prototype of cluster

  • Cluster centers are chosen among the observations themselves - hence prototype

Minimax linkage

  • For each point belonging to either cluster, find the maximum distance between it and all the other points in the two clusters.

  • The smallest of these maximum distances (“minimal-maximum” distance) is defined as the distance between the two clusters

  • The distance to prototype is measured by the maximum minimax radius

Minimax linkage example

  • Easily done in R via the protoclust package

  • Use the protoclust() function to apply the clustering to the dist() object

library(protoclust)
utah_minimax <- protoclust(county_dist_matrix)

utah_minimax |> 
  as.hclust() |>
  as.dendrogram() |> 
  ggdendrogram(labels = FALSE, 
               leaf_labels = FALSE,
               theme_dendro = FALSE) +  
  labs(y = "Dissimilarity between clusters") +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        panel.grid = element_blank())
converting to dist (note: ignores above diagonal)

Minimax linkage example

  • Use the protocut() function to make the cut

  • But then access the cluster labels cl

minimax_county_clusters <- utah_minimax |> 
  protocut(k = 3)
utah_health |> 
  mutate(cluster = as.factor(minimax_county_clusters$cl)) |>
  ggplot(aes(x = std_pct_over65, y = std_diabetic_rate,
             color = cluster)) +
  geom_point(size = 4) + 
  ggthemes::scale_color_colorblind() +
  theme(legend.position = "bottom")

Minimax linkage example

  • Want to check out the prototypes for the three clusters

  • protocut returns the indices of the prototypes (in order of the cluster labels)

minimax_county_clusters$protos
[1] 18 13  7
  • Subset the rows for these counties using slice()
utah_health |>
  select(County, std_pct_over65, std_diabetic_rate) |>
  slice(minimax_county_clusters$protos)
# A tibble: 3 × 3
  County std_pct_over65 std_diabetic_rate
  <chr>           <dbl>             <dbl>
1 Emery          0.0600             0.116
2 Davis         -0.978             -1.27 
3 Kane           1.74               1.50 

Post-clustering analysis

  • For context, how does population relate to our clustering results?
utah_health |> 
  mutate(cluster = as.factor(minimax_county_clusters$cl)) |> 
  ggplot(aes(x = log(Population), fill = cluster)) +
  geom_density(alpha = 0.1) +
  scale_fill_manual(values = c("darkblue", "purple", "gold", "orange"))

Post-clustering analysis

# utah_health |> 
#   arrange(Population)
library(dendextend)
utah_minimax |> 
  as.hclust() |>
  as.dendrogram() |> 
  set("branches_k_color", k = 3) |> 
  set("labels_col", k = 3) |> 
  ggplot(horiz = TRUE)
  • Different population levels tend to fall within particular clusters…

  • It’s easy to include more variables - just change the distance matrix

Practical issues

  • What dissimilarity measure should be used?

  • What type of linkage should be used?

  • How many clusters to choose?

  • Which features should we use to drive the clustering?

    • Categorical variables?
  • Hard clustering vs. soft clustering

    • Hard clustering (\(k\)-means, hierachical): assigns each observation to exactly one cluster

    • Soft (fuzzy) clustering: assigns each observation a probability of belonging to a cluster


IT DEPENDS…