WPGMA
WPGMA is a simple agglomerative hierarchical clustering method, generally attributed to Sokal and Michener.
The WPGMA method is similar to its unweighted variant, the UPGMA method.
Algorithm
The WPGMA algorithm constructs a rooted tree that reflects the structure present in a pairwise distance matrix. At each step, the nearest two clusters, say and, are combined into a higher-level cluster. Then, its distance to another cluster is simply the arithmetic mean of the average distances between members of and and and :The WPGMA algorithm produces rooted dendrograms and requires a constant-rate assumption: it produces an ultrametric tree in which the distances from the root to every branch tip are equal. This ultrametricity assumption is called the molecular clock when the tips involve DNA, RNA and protein data.
Working example
This working example is based on a JC69 genetic distance matrix computed from the 5S ribosomal RNA sequence alignment of five bacteria: Bacillus subtilis, Bacillus stearothermophilus, Lactobacillus viridescens, Acholeplasma modicum, and Micrococcus luteus.First step
- First clustering
| a | b | c | d | e | |
| a | 0 | 17 | 21 | 31 | 23 |
| b | 17 | 0 | 30 | 34 | 21 |
| c | 21 | 30 | 0 | 28 | 39 |
| d | 31 | 34 | 28 | 0 | 43 |
| e | 23 | 21 | 39 | 43 | 0 |
In this example, is the smallest value of, so we join elements and.
- First branch length estimation
The branches joining and to then have lengths
- First distance matrix update
Bold values in correspond to the new distances, calculated by averaging distances between each element of the first cluster and each of the remaining elements:
Italicized values in are not affected by the matrix update as they correspond to distances between elements not involved in the first cluster.
Second step
- Second clustering
| c | d | e | ||
| 0 | 25.5 | 32.5 | 22 | |
| c | 25.5 | 0 | 28 | 39 |
| d | 32.5 | 28 | 0 | 43 |
| e | 22 | 39 | 43 | 0 |
Here, is the smallest value of, so we join cluster and element.
- Second branch length estimation
We deduce the missing branch length:
- Second distance matrix update
Of note, this average calculation of the new distance does not account for the larger size of the cluster with respect to . Similarly:
The averaging procedure therefore gives differential weight to the initial distances of matrix. This is the reason why the method is weighted, not with respect to the mathematical procedure but with respect to the initial distances.
Third step
- Third clustering
| c | d | ||
| 0 | 32.25 | 37.75 | |
| c | 32.25 | 0 | 28 |
| d | 37.75 | 28 | 0 |
Here, is the smallest value of, so we join elements and.
- Third branch length estimation
The branches joining and to then have lengths
- Third distance matrix update
Final step
The final matrix is:| 0 | 35 | |
| 35 | 0 |
So we join clusters and.
Let denote the node to which and are now connected.
The branches joining and to then have lengths:
We deduce the two remaining branch lengths:
The WPGMA dendrogram
The dendrogram is now complete. It is ultrametric because all tips are equidistant from :The dendrogram is therefore rooted by, its deepest node.