Hierarchical Clustering in R: Step-by-Step Example | Online Statistics library

Clustering is a technique in machine learning that attempts to find groups or clusters of observations within a dataset such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other.

Clustering is a form of unsupervised learning because we’re simply attempting to find structure within a dataset rather than predicting the value of some response variable.

Clustering is often used in marketing when companies have access to information like:

Household income
Household size
Head of household Occupation
Distance from nearest urban area

When this information is available, clustering can be used to identify households that are similar and may be more likely to purchase certain products or respond better to a certain type of advertising.

One of the most common forms of clustering is known as k-means clustering. Unfortunately this method requires us to pre-specify the number of clusters K.

An alternative to this method is known as hierarchical clustering, which does not require us to pre-specify the number of clusters to be used and is also able to produce a tree-based representation of the observations known as a dendrogram.

What is Hierarchical Clustering?

Similar to k-means clustering, the goal of hierarchical clustering is to produce clusters of observations that are quite similar to each other while the observations in different clusters are quite different from each other.

In practice, we use the following steps to perform hierarchical clustering:

1. Calculate the pairwise dissimilarity between each observation in the dataset.

First, we must choose some distance metric – like the Euclidean distance – and use this metric to compute the dissimilarity between each observation in the dataset.
For a dataset with n observations, there will be a total of n(n-1)/2 pairwise dissimilarities.

2. Fuse observations into clusters.

At each step in the algorithm, fuse together the two observations that are most similar into a single cluster.
Repeat this procedure until all observations are members of one large cluster. The end result is a tree, which can be plotted as a dendrogram.

To determine how close together two clusters are, we can use a few different methods including:

Complete linkage clustering: Find the max distance between points belonging to two different clusters.
Single linkage clustering: Find the minimum distance between points belonging to two different clusters.
Mean linkage clustering: Find all pairwise distances between points belonging to two different clusters and then calculate the average.
Centroid linkage clustering: Find the centroid of each cluster and calculate the distance between the centroids of two different clusters.
Ward’s minimum variance method: Minimize the total

Depending on the structure of the dataset, one of these methods may tend to produce better (i.e. more compact) clusters than the other methods.

Hierarchical Clustering in R

The following tutorial provides a step-by-step example of how to perform hierarchical clustering in R.

Step 1: Load the Necessary Packages

First, we’ll load two packages that contain several useful functions for hierarchical clustering in R.

library(factoextra)
library(cluster)

Step 2: Load and Prep the Data

For this example we’ll use the USArrests dataset built into R, which contains the number of arrests per 100,000 residents in each U.S. state in 1973 for Murder, Assault, and Rape along with the percentage of the population in each state living in urban areas, UrbanPop.

The following code shows how to do the following:

Load the USArrests dataset
Remove any rows with missing values
Scale each variable in the dataset to have a mean of 0 and a standard deviation of 1

#load data
df #remove rows with missing values
df omit(df)

#scale each variable to have a mean of 0 and sd of 1
df #view first six rows of dataset
head(df)

               Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Step 3: Find the Linkage Method to Use

To perform hierarchical clustering in R we can use the agnes() function from the cluster package, which uses the following syntax:

agnes(data, method)

where:

data: Name of the dataset.
method: The method to use to calculate dissimilarity between clusters.

Since we don’t know beforehand which method will produce the best clusters, we can write a short function to perform hierarchical clustering using several different methods.

Note that this function calculates the agglomerative coefficient of each method, which is metric that measures the strength of the clusters. The closer this value is to 1, the stronger the clusters.

#define linkage methods
m #function to compute agglomerative coefficient
ac function(x) {
  agnes(df, method = x)$ac
}

#calculate agglomerative coefficient for each clustering linkage method
sapply(m, ac)

  average    single  complete      ward 
0.7379371 0.6276128 0.8531583 0.9346210

We can see that Ward’s minimum variance method produces the highest agglomerative coefficient, thus we’ll use that as the method for our final hierarchical clustering:

#perform hierarchical clustering using Ward's minimum variance
clust ward")

#produce dendrogram
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")

Dendrogram in R

Each leaf at the bottom of the dendrogram represents an observation in the original dataset. As we move up the dendrogram from the bottom, observations that are similar to each other are fused together into a branch.

Step 4: Determine the Optimal Number of Clusters

To determine how many clusters the observations should be grouped in, we can use a metric known as the gap statistic, which compares the total intra-cluster variation for different values of k with their expected values for a distribution with no clustering.

We can calculate the gap statistic for each number of clusters using the clusGap() function from the cluster package along with a plot of clusters vs. gap statistic using the fviz_gap_stat() function:

#calculate gap statistic for each number of clusters (up to 10 clusters)
gap_stat #produce plot of clusters vs. gap statistic
fviz_gap_stat(gap_stat)

Optimal clusters for hierarchical clustering

From the plot we can see that the gap statistic is highest at k = 4 clusters. Thus, we’ll choose to group our observations into 4 distinct clusters.

Step 5: Apply Cluster Labels to Original Dataset

To actually add cluster labels to each observation in our dataset, we can use the cutree() method to cut the dendrogram into 4 clusters:

#compute distance matrix
d euclidean")

#perform hierarchical clustering using Ward's method
final_clust ward.D2" )

#cut the dendrogram into 4 clusters
groups #find number of observations in each cluster
table(groups)

 1  2  3  4 
 7 12 19 12

We can then append the cluster labels of each state back to the original dataset:

#append cluster labels to original data
final_data 
#display first six rows of final data
head(final_data)

               Murder Assault UrbanPop Rape cluster
Alabama          13.2     236       58 21.2       1
Alaska           10.0     263       48 44.5       2
Arizona           8.1     294       80 31.0       2
Arkansas          8.8     190       50 19.5       3
California        9.0     276       91 40.6       2
Colorado          7.9     204       78 38.7       2

Lastly, we can use the aggregate() function to find the mean of the variables in each cluster:

#find mean values for each cluster
aggregate(final_data, by=list(cluster=final_data$cluster), mean)

  cluster    Murder  Assault UrbanPop     Rape cluster
1       1 14.671429 251.2857 54.28571 21.68571       1
2       2 10.966667 264.0000 76.50000 33.60833       2
3       3  6.210526 142.0526 71.26316 19.18421       3
4       4  3.091667  76.0000 52.08333 11.83333       4

We interpret this output is as follows:

The mean number of murders per 100,000 citizens among the states in cluster 1 is 14.67.
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 251.28.
The mean percentage of residents living in an urban area among the states in cluster 1 is 54.28%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 21.68.

You can find the complete R code used in this example here.