K-medoids Distance-Based clustering

Weksi Budiaji

2019-01-02

1. Introduction

The kmed package was designed to analyse k-medoids based clustering. The features include:

2. Distance Computation

A. Numerical variables

For numerical variables, there are four distance options, namely manhattan weighted by rank (mrw), squared euclidean weighted by rank (ser) and squared euclidean weighted by variance (sev), and unweighted squared euclidean (se). The distNumeric function provides method in which the desired distance method can be chosen. The manhattan weighted by rank is applied in the popular iris data in this article.

##            [,1]      [,2]      [,3]      [,4]       [,5]      [,6]
## [1,] 0.00000000 0.2638889 0.2530603 0.3225047 0.06944444 0.3841808
## [2,] 0.26388889 0.0000000 0.1558380 0.1419492 0.27777778 0.6480697
## [3,] 0.25306026 0.1558380 0.0000000 0.1033427 0.26694915 0.6372411
## [4,] 0.32250471 0.1419492 0.1033427 0.0000000 0.33639360 0.6727872
## [5,] 0.06944444 0.2777778 0.2669492 0.3363936 0.00000000 0.3702919
## [6,] 0.38418079 0.6480697 0.6372411 0.6727872 0.37029190 0.0000000

B. Binary and Categorical variables

For binary and categorical variables, on the other hand, two distance methods are available, i.e. simple matching (matching) and coocurrence (cooccur) distance. The matching function requires two matrices/ data frames, while the cooccurrence only needs one matrix/ data frame.

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.0000000 0.3333333 0.6666667 0.6666667 0.6666667 0.3333333 0.6666667
## [2,] 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000
## [3,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [4,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [5,] 0.6666667 0.3333333 0.6666667 0.6666667 0.0000000 1.0000000 0.6666667
## [6,] 0.3333333 0.6666667 0.3333333 0.3333333 1.0000000 0.0000000 0.3333333
## [7,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.0000000 0.4500000 0.7916667 0.7916667 0.7000000 0.5416667 0.7916667
## [2,] 0.4500000 0.0000000 1.2416667 1.2416667 0.2500000 0.9916667 1.2416667
## [3,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [4,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [5,] 0.7000000 0.2500000 0.9916667 0.9916667 0.0000000 1.2416667 0.9916667
## [6,] 0.5416667 0.9916667 0.2500000 0.2500000 1.2416667 0.0000000 0.2500000
## [7,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000

C. Mixed variables

There are six distance methods for mixed variables data, namely gower (gower), wishart (wishart), podani (podani), huang (huang), harikumar-pv (harikumar), ahmad and dey (ahmad). The distmix function calculates mixed variable distance in which it requires column id of each class of variables. As an example, we create a 7 by 9 data frame that consists of three variables in each class. Then gower and wishart distances are calculated.

##   num1 num2 num3 bin1 bin2 bin3 cat1 cat2 cat3
## 1  5.1  3.5  1.4    1    2    2    1    3    3
## 2  4.9  3.0  1.4    1    2    1    2    2    3
## 3  4.7  3.2  1.3    2    1    2    1    2    1
## 4  4.6  3.1  1.5    2    1    2    1    2    3
## 5  5.0  3.6  1.4    1    1    1    2    2    2
## 6  5.4  3.9  1.7    2    2    2    1    1    3
## 7  4.6  3.4  1.4    2    1    2    2    3    2
##           1         2         3         4         5         6         7
## 1 0.0000000 0.4228395 0.5648148 0.4799383 0.5817901 0.3966049 0.5262346
## 2 0.4228395 0.0000000 0.6358025 0.5262346 0.3101852 0.7083333 0.6466049
## 3 0.5648148 0.6358025 0.0000000 0.1929012 0.5632716 0.6280864 0.3996914
## 4 0.4799383 0.5262346 0.1929012 0.0000000 0.5895062 0.4876543 0.3981481
## 5 0.5817901 0.3101852 0.5632716 0.5895062 0.0000000 0.8425926 0.4135802
## 6 0.3966049 0.7083333 0.6280864 0.4876543 0.8425926 0.0000000 0.7006173
## 7 0.5262346 0.6466049 0.3996914 0.3981481 0.4135802 0.7006173 0.0000000
##           1         2         3         4         5        6         7
## 1 0.0000000 0.6956580 0.8514500 0.9512342 0.5817152 1.212672 0.8039374
## 2 0.6956580 0.0000000 0.7345406 0.6560616 0.6778587 2.454716 0.8768878
## 3 0.8514500 0.7345406 0.0000000 0.4346564 0.8401231 2.804699 0.4706517
## 4 0.9512342 0.6560616 0.4346564 0.0000000 1.0477822 2.193826 0.5181166
## 5 0.5817152 0.6778587 0.8401231 1.0477822 0.0000000 1.668443 0.6046386
## 6 1.2126721 2.4547164 2.8046989 2.1938258 1.6684434 0.000000 2.3092201
## 7 0.8039374 0.8768878 0.4706517 0.5181166 0.6046386 2.309220 0.0000000

3. K-medoids algorithms

There are some k-medoids algorithms, partitioning around medoids (pam), for example, is available in cluster package. In kmed package, the available algorithm is simple and fast k-medoids (fastkmed), ranked k-medoids (rankkmed), and step k-medoids (stepkmed).

result <- fastkmed(mrwdist, ncluster = 3, iterate = 50)
(fastiris <- table(result$cluster, iris[,5]))
##    
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         39         3
##   3      0         11        47
(misclass <- (1-sum(diag(fastiris))/length(iris[,5]))*100)
## [1] 9.333333

Applying fastkmed in iris data with manhattan weighted by rank, the misclassification is 9.33 %.

4. Cluster validation

A. Internal criteria

To validate using an internal criteria, a silhoutte and shadow value (CSV) indices can be applied. The silhoutte produces silhoutte index and a silhoutte plot, while the shadow function produces and plots shadow value. The validated clustering result consists of three arguments, i.e. the distance/ matrix, medoids id, and cluster membership. Then, the output is a list of silhoutte/ shadow index and plot.

Silhoutte plot:

##     silhoutte cluster
## 1   0.8244574       1
## 2   0.7491049       1
## 3   0.7855731       1
## 70  0.6479288       2
## 71  0.1344483       3
## 72  0.5455987       2
## 73  0.2186731       2
## 74  0.4462472       2
## 75  0.3542140       2
## 101 0.4871723       3
## 102 0.0241100       3
## 103 0.5690102       3

Shadow value (CSV) plot:

##       shadval cluster
## 1   0.1158460       1
## 2   0.2888389       1
## 3   0.2493415       1
## 70  0.3422222       2
## 71  0.8000000       3
## 72  0.4744792       2
## 73  0.9568075       2
## 74  0.6676907       2
## 75  0.7713964       2
## 101 0.5912951       3
## 102 0.9042793       3
## 103 0.4388626       3

B. Relative criteria

1. Bootstrap replicate matrix

To evaluate a clustering algorithm, a bootstrap evaluation function (clustboot) can be applied. Before applying clustboot, a clustering algorihtm that will be evaluated must be created first. The evaluated clustering algorithm must consist of two arguments, i.e. the distance/ matrix and the number of cluster. Then, the output must be a vector of membership. We create two functions of clustering algorihtm that will be evaluated by the bootstrap. The first is a simple and fast k-medoids and the second is k-means algorithm, which is get from kmeans function in stats package.

The result is a n x b matrix, where n is the number of objects and b is the number of bootstrap replicates.

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    1    0    2    3    0    3    1     1
## [2,]    3    0    1    1    2    3    0    0    1     1
## [3,]    0    0    0    0    2    3    3    0    0     1
## [4,]    3    0    1    0    0    3    3    0    0     1
## [5,]    0    3    0    1    2    0    3    3    1     1
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    2    0    3    3    0    1    2    2     0
## [2,]    1    0    3    3    3    2    1    2    2     2
## [3,]    1    2    3    3    3    2    1    2    2     2
## [4,]    0    2    0    3    0    2    1    2    0     2
## [5,]    1    0    3    0    0    1    1    2    0     0

2. Consensus matrix

A consensus matrix (n x n) can be produced from a bootstrap replicate matrix. To create the consensus matrix, a function to order the objects is required. The consensusmatrix function provides reorder in which a reorder method can be supplied. The reorder method must have two input arguments, namely a distance matrix and a number of clusters. Meanwhile, the output is only a membership. For example, hierarchical cluster algorithm with ward linkage is applied to order the objects in the consensus matrix.

##   1 1 1 1 1         2         2         2         2         2         2
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0 0 0 0 0 1.0000000 0.8750000 0.9310345 0.5555556 0.9047619 0.7666667
## 2 0 0 0 0 0 0.8750000 1.0000000 0.8095238 0.9000000 1.0000000 0.6000000
## 2 0 0 0 0 0 0.9310345 0.8095238 1.0000000 0.4545455 0.8500000 0.7931034
## 2 0 0 0 0 0 0.5555556 0.9000000 0.4545455 1.0000000 0.7777778 0.3200000
## 2 0 0 0 0 0 0.9047619 1.0000000 0.8500000 0.7777778 1.0000000 0.7037037
## 2 0 0 0 0 0 0.7666667 0.6000000 0.7931034 0.3200000 0.7037037 1.0000000
## 2 0 0 0 0 0 0.7727273 0.6000000 0.7500000 0.3684211 0.8125000 1.0000000
## 2 0 0 0 0 0 0.9354839 0.8333333 0.9200000 0.4761905 0.8800000 0.8064516
## 2 0 0 0 0 0 0.7407407 0.6190476 0.7000000 0.3500000 0.7368421 1.0000000
## 2 0 0 0 0 0 0.7083333 0.6000000 0.7272727 0.3000000 0.7000000 1.0000000
##           2         2         2         2
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.7727273 0.9354839 0.7407407 0.7083333
## 2 0.6000000 0.8333333 0.6190476 0.6000000
## 2 0.7500000 0.9200000 0.7000000 0.7272727
## 2 0.3684211 0.4761905 0.3500000 0.3000000
## 2 0.8125000 0.8800000 0.7368421 0.7000000
## 2 1.0000000 0.8064516 1.0000000 1.0000000
## 2 1.0000000 0.8421053 1.0000000 1.0000000
## 2 0.8421053 1.0000000 0.8214286 0.8000000
## 2 1.0000000 0.8214286 1.0000000 1.0000000
## 2 1.0000000 0.8000000 1.0000000 1.0000000

This consensus matrix can then be visualized using heatmap directly.

3. Visualization of cluster consensus matrix

To produce a heatmap of consensus matrix clustheatmap can be applied in the consensus matrix. The consensus matrix heatmap of Iris data by simple and fast k-medoids is produced.

We can also create a heatmap of the consensus matrix fromm the k-means algorithm bootstraping.

5. Cluster visualization

A. Biplot

With all numerical variable in the data set, pca biplot can be plotted by giving difference colours for each clusters. pcabiplot has five arguments with colobj setting the colour of objects. x and y arguments can be also change into "PC2" and "PC3", for example, if we want to plot the second vs third principle components.

B. Barplot

The clustering result can be also plotted in a barplot. The function barplotnum creates a barplot for each cluster. It offers also a significance test of each variable. It requires three arguments; the original data, cluster memberships, and number of columns of the graph. If there are four clusters, nc can be set to 2 such that there are 2 by 2 graphs.


References

Park, H., and C. Jun. 2009. “A Simple and Fast Algorithm for K-Medoids Clustering.” Expert Systems with Applications 36 (2). Elsevier:3336–41. https://doi.org/10.1016/j.eswa.2008.01.039.

Yu, D., G. Liu, M. Guo, and X. Liu. 2018. “An Improved K-Medoids Algorithm Based on Step Increasing and Optimizing Medoids.” Expert Systems with Applications 92 (February). Elsevier:464–73. https://doi.org/10.1016/j.eswa.2017.09.052.

Zadegan, S.M.R, M. Mirzaie, and F. Sadoughi. 2013. “Ranked K-Medoids A Fast and Accurate Rank-Based Partitioning Algorithm for Clustering Large Datasets.” Knowledge-Based Systems 39 (February). Elsevier:133–43. https://doi.org/10.1016/j.knosys.2012.10.012.