Topic 15 Hierarchical Clustering
Learning Goals
- Clearly describe / implement by hand the hierarchical clustering algorithm
- Compare and contrast k-means and hierarchical clustering in their outputs and algorithms
- Interpret cuts of the dendrogram for single and complete linkage
- Describe the rationale for how clustering algorithms work in terms of within-cluster variation
- Describe the tradeoff of more vs. less clusters in terms of interpretability
- Implement strategies for interpreting / contextualizing the clusters
Exercises
You can download a template RMarkdown file to start from here.
In this paper, Gorman et al. study characteristics of penguin populations in the Antarctic. We’ll be looking at a dataset of penguin body measurements available in the palmerpenguins
package. (Make sure to install this package before beginning.)
Our goal in using this data is to better understand the following questions: What similarities are there among the penguins? Do there appear to be different species? If so, how many species are there?
library(dplyr)
library(ggplot2)
library(palmerpenguins)
data(penguins)
# Remove observations with missing data on key variables
<- penguins %>%
penguins filter(!is.na(bill_length_mm), !is.na(bill_depth_mm), !is.na(flipper_length_mm))
Exercise 1: Hierarchical clustering by hand
To practice the hierarchical clustering algorithm, let’s look at a small example. Suppose we collect the following bill depth and length measurements from 5 penguins:
# NOTE: these data are already scaled!
<- data.frame(
penguins_small depth = c(2.5, 2.7, 3.2, 3.5, 3.6),
length = c(5.5, 6.0, 4.5, 5.0, 4.7)
)
penguins_small
ggplot(penguins_small, aes(x = depth, y = length)) +
geom_point() +
geom_text(aes(label = 1:5), vjust = 1.5)
In the exercises below, you’ll draw a dendrogram for these 5 penguins by hand. Sketch the following plotting frame on some scrap paper:
Step 1: First fusion
Calculate the distance between each pair of penguins:
round(dist(penguins_small), 2)
Which pair of penguins 1-5 is most similar? Draw the fusion between this pair of leaves on your plot. Clearly indicate the height at which you draw this fusion.
Step 2: Second fusion
Construct a new distance matrix reflecting the distances between each pair of branches (where 4 and 5 have now been fused). Use complete linkage. That is, the distance between one branch and another is the maximum distance between any pair of leaves in those branches. We can do this by taking the distances from the distance matrix above and using the complete linkage strategy to obtain distances to the 4+5 cluster.
1 | 2 | 3 | 4 & 5 | |
---|---|---|---|---|
1 | 0 | |||
2 | xxxx | 0 | ||
3 | xxxx | xxxx | 0 | |
4 & 5 | xxxx | xxxx | xxxx | 0 |
Which pair of branches is most similar? Draw the fusion between this pair on the plot.
Step 3: Third fusion
Repeat! Construct a new distance matrix reflecting the distances between each pair of branches, those that have been fused already and those that have not.
Which pair of branches is most similar? Draw the fusion between this pair on the plot.
Step 4: Final fusion
At this point, you should have 2 penguins in one cluster and 3 in another. The final step is to combine these into the tree trunk. Draw this fusion.
Final step: Check your work in R:
<- hclust(dist(penguins_small), method = "complete")
penguin_cluster plot(penguin_cluster)
Exercise 2: Exploring penguin dendrograms
Let’s use hierarchical clustering under different linkage strategies to look at nestings of clusters for the full set of penguins. We’ll continue to cluster based on bill length and depth as well as flipper length. (We’ll use a smaller random subset of 50 penguins for ease of visualization.)
- Remind yourselves about the importance of considering variable scaling by looking at the summary statistics for the 3 variables.
- In looking at the dendrograms under the different linkage types, do you notice any particular features of how the dendrograms look (as mentioned in the video for single and centroid linkage)?
# Random subsample of 50 penguins
set.seed(253)
<- penguins %>%
penguins slice_sample(n = 50)
# Select the variables to be used in clustering
<- penguins %>%
penguins_sub select(bill_length_mm, bill_depth_mm, flipper_length_mm)
# Summary statistics for the variables
summary(penguins_sub)
# Compute a distance matrix on the scaled data
<- dist(scale(penguins_sub))
dist_mat_scaled
# The (scaled) distance matrix is the input to hclust()
# The method argument indicates the linkage type
<- hclust(dist_mat_scaled, method = "complete")
hc_complete <- hclust(dist_mat_scaled, method = "single")
hc_single <- hclust(dist_mat_scaled, method = "average")
hc_average <- hclust(dist_mat_scaled, method = "centroid")
hc_centroid
# Plot dendrograms
plot(hc_complete)
plot(hc_single)
plot(hc_average)
plot(hc_centroid)
Exercise 3: Interpreting the clusters visually
Let’s continue exploring the dendrogram from complete linkage. The plot()
function for hclust()
output allows a labels
argument which can show custom labels for the leaves (cases). The code below labels the leaves with the species of each penguin.
- What do you notice about the clustering in relation to the actual penguin species?
- Try changing the labels to show the
sex
of the penguin. What do you notice?
plot(hc_complete, labels = penguins$species)
Exercise 4: Tree-cutting and interpretation
When we cut a dendrogram at a given height, we indicate that the branches that have fused before (below) that point should be distinct clusters.
For the complete linkage dendrogram, say that we cut the tree at a height of 3? How many clusters should result (eyeball from dendrogram)? How can we interpret this cut in terms of the distances between cases in the resulting clusters?
We can add new variables containing cluster assignments with
mutate()
andcutree()
, as below. Using either set of cluster assignments, make some visualizations to explore how variables of interest are related to the cluster assignments.
<- penguins %>%
penguins mutate(
hclust_height3 = factor(cutree(hc_complete, h = 3)), # Cut at height (h) 3
hclust_num6 = factor(cutree(hc_complete, k = 6)) # Cut into 6 clusters (k)
)