R

Hierarchical Clustering

Notes

Posted by Shsun on March 12, 2020

Cluster analysis (CA) can be classified as hierarchical and non-hierarchical. Two most popular CA techniques are introduced: hierarchical agglomerative, and k-means. Currentl I am tring to identify spatial and temporal patterns as well as impacts from local synoptic meteorology with CA and PCA.

Hierarchical clustering

Hierarchical clustering algorithms in a agglomerative manner (bottom-up) is implemented as following steps:

  1. Pre-processing observations by scaling and normalizing.
  2. Consider each observation as a cluster and calculate distances between clusters.
  3. Two clusters with minimum distances are combined and replaced by a single cluster.
  4. Repeat 2 and 3 steps.

There are different ways to calculate the distances between clusters: complete-linkage, single linkage, average linkage, and centroid linkage. The dendrogram from hierarchical clustering can be used to decide number of clusters.

Non-hierarchical clustering

K-means clustering aims to divide m objects in n dimensions into k clusters such that within-cluster sum of squares is minimized. K-means iterative clustering method is as follows:

  1. Choose a k value as initial set of k centroids.
  2. Assign each of the objects to the cluster with nearest centroid.
  3. Determine the new centroids of the k clusters.
  4. Repeat 2 and 3 steps.
Two-stage clustering

Two-stage clustering is a combination of two clustering methods, where the output from the first is used as input of the second clustering. Hierarchical clustering is the first, and k-means is the second. The hierarchical relationship from the first stage is used as “seeds” in the k-means algorithm.

A useful package for a two-stage clustering package prcr, which is for person-centered analysis.

Clustering Routine in R

kmeans() does basic k-means clustering. hclust() does hierarchical aggomerative clustering. Then use plot() to check teh results. Basically, using agglomerative hierarchical method hclust aims to merge small clusters into larger ones, which is also computationally easy and fast.

Using optimal partitioning method such as kmeans requires decision of how many clusters to use, as well as an initial clustering. kmeans does a random assignment of groups to start with. To set a nonrandom start can avoid this by specifying the centers of teh clusters.

To determine the number of clusters, we can use pseudo-F test. Psuedo F describes the ratio of between cluster variance to within-cluster variance. Before using kmeans, we can use the elbow plot to check how many clusters we can use. The following code is what I use to find clusters:

1
2
km.out = kmeans(PCA$PC.data[,1:5], centers = ncluster, nstart = 50, iter.max = 100)
km.cluster = km.out$cluster

To avoid the error that kmeans not converge in 10 iterations, we can set iter.max to a larger number for it to converge.

Reference
  1. https://doi.org/10.1016/j.apr.2019.09.009