The kmed package was designed to analyse k-medoids based clustering. The features include:
For numerical variables, there are four distance options, namely manhattan weighted by rank (mrw
), squared euclidean weighted by rank (ser
) and squared euclidean weighted by variance (sev
), and unweighted squared euclidean (se
). The distNumeric
function provides method
in which the desired distance method can be chosen. The manhattan weighted by rank is applied in the popular iris data in this article.
library(kmed)
num <- as.matrix(iris[,1:4])
mrwdist <- distNumeric(num, num, method = "mrw")
mrwdist[1:6,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.00000000 0.2638889 0.2530603 0.3225047 0.06944444 0.3841808
## [2,] 0.26388889 0.0000000 0.1558380 0.1419492 0.27777778 0.6480697
## [3,] 0.25306026 0.1558380 0.0000000 0.1033427 0.26694915 0.6372411
## [4,] 0.32250471 0.1419492 0.1033427 0.0000000 0.33639360 0.6727872
## [5,] 0.06944444 0.2777778 0.2669492 0.3363936 0.00000000 0.3702919
## [6,] 0.38418079 0.6480697 0.6372411 0.6727872 0.37029190 0.0000000
For binary and categorical variables, on the other hand, two distance methods are available, i.e. simple matching (matching
) and coocurrence (cooccur
) distance. The matching
function requires two matrices/ data frames, while the cooccurrence
only needs one matrix/ data frame.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000000 0.3333333 0.6666667 0.6666667 0.6666667 0.3333333 0.6666667
## [2,] 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000
## [3,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [4,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [5,] 0.6666667 0.3333333 0.6666667 0.6666667 0.0000000 1.0000000 0.6666667
## [6,] 0.3333333 0.6666667 0.3333333 0.3333333 1.0000000 0.0000000 0.3333333
## [7,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000000 0.4500000 0.7916667 0.7916667 0.7000000 0.5416667 0.7916667
## [2,] 0.4500000 0.0000000 1.2416667 1.2416667 0.2500000 0.9916667 1.2416667
## [3,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [4,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [5,] 0.7000000 0.2500000 0.9916667 0.9916667 0.0000000 1.2416667 0.9916667
## [6,] 0.5416667 0.9916667 0.2500000 0.2500000 1.2416667 0.0000000 0.2500000
## [7,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
There are six distance methods for mixed variables data, namely gower (gower
), wishart (wishart
), podani (podani
), huang (huang
), harikumar-pv (harikumar
), ahmad and dey (ahmad
). The distmix
function calculates mixed variable distance in which it requires column id of each class of variables. As an example, we create a 7 by 9 data frame that consists of three variables in each class. Then gower
and wishart
distances are calculated.
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""), paste(c("bin"), 1:3, sep = ""), paste(c("cat"), 1:3, sep = ""))
mixdata
## num1 num2 num3 bin1 bin2 bin3 cat1 cat2 cat3
## 1 5.1 3.5 1.4 1 2 2 1 3 3
## 2 4.9 3.0 1.4 1 2 1 2 2 3
## 3 4.7 3.2 1.3 2 1 2 1 2 1
## 4 4.6 3.1 1.5 2 1 2 1 2 3
## 5 5.0 3.6 1.4 1 1 1 2 2 2
## 6 5.4 3.9 1.7 2 2 2 1 1 3
## 7 4.6 3.4 1.4 2 1 2 2 3 2
## 1 2 3 4 5 6 7
## 1 0.0000000 0.4228395 0.5648148 0.4799383 0.5817901 0.3966049 0.5262346
## 2 0.4228395 0.0000000 0.6358025 0.5262346 0.3101852 0.7083333 0.6466049
## 3 0.5648148 0.6358025 0.0000000 0.1929012 0.5632716 0.6280864 0.3996914
## 4 0.4799383 0.5262346 0.1929012 0.0000000 0.5895062 0.4876543 0.3981481
## 5 0.5817901 0.3101852 0.5632716 0.5895062 0.0000000 0.8425926 0.4135802
## 6 0.3966049 0.7083333 0.6280864 0.4876543 0.8425926 0.0000000 0.7006173
## 7 0.5262346 0.6466049 0.3996914 0.3981481 0.4135802 0.7006173 0.0000000
## 1 2 3 4 5 6 7
## 1 0.0000000 0.6956580 0.8514500 0.9512342 0.5817152 1.212672 0.8039374
## 2 0.6956580 0.0000000 0.7345406 0.6560616 0.6778587 2.454716 0.8768878
## 3 0.8514500 0.7345406 0.0000000 0.4346564 0.8401231 2.804699 0.4706517
## 4 0.9512342 0.6560616 0.4346564 0.0000000 1.0477822 2.193826 0.5181166
## 5 0.5817152 0.6778587 0.8401231 1.0477822 0.0000000 1.668443 0.6046386
## 6 1.2126721 2.4547164 2.8046989 2.1938258 1.6684434 0.000000 2.3092201
## 7 0.8039374 0.8768878 0.4706517 0.5181166 0.6046386 2.309220 0.0000000
There are some k-medoids algorithms, partitioning around medoids (pam
), for example, is available in cluster package. In kmed package, the available algorithm is simple and fast k-medoids (fastkmed
), ranked k-medoids (rankkmed
), and step k-medoids (stepkmed
).
result <- fastkmed(mrwdist, ncluster = 3, iterate = 50)
(fastiris <- table(result$cluster, iris[,5]))
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 39 3
## 3 0 11 47
## [1] 9.333333
Applying fastkmed
in iris data with manhattan weighted by rank, the misclassification is 9.33 %.
To validate using an internal criteria, a silhoutte and shadow value (CSV) indices can be applied. The silhoutte
produces silhoutte index and a silhoutte plot, while the shadow
function produces and plots shadow value. The validated clustering result consists of three arguments, i.e. the distance/ matrix, medoids id, and cluster membership. Then, the output is a list of silhoutte/ shadow index and plot.
Silhoutte plot:
rownames(mrwdist) <- colnames(mrwdist) <- 1:nrow(mrwdist)
siliris <- silhoutte(mrwdist, result$medoid, result$cluster)
siliris$result[c(1:3,70:75,101:103),]
## silhoutte cluster
## 1 0.8244574 1
## 2 0.7491049 1
## 3 0.7855731 1
## 70 0.6479288 2
## 71 0.1344483 3
## 72 0.5455987 2
## 73 0.2186731 2
## 74 0.4462472 2
## 75 0.3542140 2
## 101 0.4871723 3
## 102 0.0241100 3
## 103 0.5690102 3
Shadow value (CSV) plot:
## shadval cluster
## 1 0.1158460 1
## 2 0.2888389 1
## 3 0.2493415 1
## 70 0.3422222 2
## 71 0.8000000 3
## 72 0.4744792 2
## 73 0.9568075 2
## 74 0.6676907 2
## 75 0.7713964 2
## 101 0.5912951 3
## 102 0.9042793 3
## 103 0.4388626 3
To evaluate a clustering algorithm, a bootstrap evaluation function (clustboot
) can be applied. Before applying clustboot
, a clustering algorihtm that will be evaluated must be created first. The evaluated clustering algorithm must consist of two arguments, i.e. the distance/ matrix and the number of cluster. Then, the output must be a vector of membership. We create two functions of clustering algorihtm that will be evaluated by the bootstrap. The first is a simple and fast k-medoids and the second is k-means algorithm, which is get from kmeans
function in stats package.
k <- 3
# a simple and fast k-medoids function for bootstrap evaluation
parkboot <- function(x, nclust) {
res <- fastkmed(x, nclust, iterate = 50)
return(res$cluster)
}
# k-means function for bootstrap evaluation
kmboot <- function(x, nclust) {
res <- kmeans(x, nclust)
return(res$cluster)
}
The result is a n x b matrix, where n is the number of objects and b is the number of bootstrap replicates.
fastkmedboot <- clustboot(mrwdist, nclust=k, parkboot, nboot=50)
kmeansboot <- clustboot(num, nclust=k, kmboot, nboot=50, diss = FALSE)
fastkmedboot[1:5,c(1:5,46:50)]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 1 0 2 3 0 3 1 1
## [2,] 3 0 1 1 2 3 0 0 1 1
## [3,] 0 0 0 0 2 3 3 0 0 1
## [4,] 3 0 1 0 0 3 3 0 0 1
## [5,] 0 3 0 1 2 0 3 3 1 1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 2 0 3 3 0 1 2 2 0
## [2,] 1 0 3 3 3 2 1 2 2 2
## [3,] 1 2 3 3 3 2 1 2 2 2
## [4,] 0 2 0 3 0 2 1 2 0 2
## [5,] 1 0 3 0 0 1 1 2 0 0
A consensus matrix (n x n) can be produced from a bootstrap replicate matrix. To create the consensus matrix, a function to order the objects is required. The consensusmatrix
function provides reorder
in which a reorder method can be supplied. The reorder
method must have two input arguments, namely a distance matrix and a number of clusters. Meanwhile, the output is only a membership. For example, hierarchical cluster algorithm with ward linkage is applied to order the objects in the consensus matrix.
wardorder <- function(x, nclust) {
res <- hclust(x, method = "ward.D2")
member <- cutree(res, nclust)
return(member)
}
consensusfastkmed <- consensusmatrix(fastkmedboot, nclust = k, wardorder)
consensusfastkmed[c(1:5,51:55,101:105),c(1:5,51:55,101:105)]
## 1 1 1 1 1 2 2 2 2 2 2
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0 0 0 0 0 1.0000000 0.8750000 0.9310345 0.5555556 0.9047619 0.7666667
## 2 0 0 0 0 0 0.8750000 1.0000000 0.8095238 0.9000000 1.0000000 0.6000000
## 2 0 0 0 0 0 0.9310345 0.8095238 1.0000000 0.4545455 0.8500000 0.7931034
## 2 0 0 0 0 0 0.5555556 0.9000000 0.4545455 1.0000000 0.7777778 0.3200000
## 2 0 0 0 0 0 0.9047619 1.0000000 0.8500000 0.7777778 1.0000000 0.7037037
## 2 0 0 0 0 0 0.7666667 0.6000000 0.7931034 0.3200000 0.7037037 1.0000000
## 2 0 0 0 0 0 0.7727273 0.6000000 0.7500000 0.3684211 0.8125000 1.0000000
## 2 0 0 0 0 0 0.9354839 0.8333333 0.9200000 0.4761905 0.8800000 0.8064516
## 2 0 0 0 0 0 0.7407407 0.6190476 0.7000000 0.3500000 0.7368421 1.0000000
## 2 0 0 0 0 0 0.7083333 0.6000000 0.7272727 0.3000000 0.7000000 1.0000000
## 2 2 2 2
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.7727273 0.9354839 0.7407407 0.7083333
## 2 0.6000000 0.8333333 0.6190476 0.6000000
## 2 0.7500000 0.9200000 0.7000000 0.7272727
## 2 0.3684211 0.4761905 0.3500000 0.3000000
## 2 0.8125000 0.8800000 0.7368421 0.7000000
## 2 1.0000000 0.8064516 1.0000000 1.0000000
## 2 1.0000000 0.8421053 1.0000000 1.0000000
## 2 0.8421053 1.0000000 0.8214286 0.8000000
## 2 1.0000000 0.8214286 1.0000000 1.0000000
## 2 1.0000000 0.8000000 1.0000000 1.0000000
This consensus matrix can then be visualized using heatmap directly.
To produce a heatmap of consensus matrix clustheatmap
can be applied in the consensus matrix. The consensus matrix heatmap of Iris data by simple and fast k-medoids is produced.
We can also create a heatmap of the consensus matrix fromm the k-means algorithm bootstraping.
consensuskmeans <- consensusmatrix(kmeansboot, nclust = k, wardorder)
clustheatmap(consensuskmeans, "Iris Data via K-means")
With all numerical variable in the data set, pca biplot can be plotted by giving difference colours for each clusters. pcabiplot
has five arguments with colobj
setting the colour of objects. x
and y
arguments can be also change into "PC2"
and "PC3"
, for example, if we want to plot the second vs third principle components.
The clustering result can be also plotted in a barplot. The function barplotnum
creates a barplot for each cluster. It offers also a significance test of each variable. It requires three arguments; the original data, cluster memberships, and number of columns of the graph. If there are four clusters, nc
can be set to 2 such that there are 2 by 2 graphs.
Park, H., and C. Jun. 2009. “A Simple and Fast Algorithm for K-Medoids Clustering.” Expert Systems with Applications 36 (2). Elsevier:3336–41. https://doi.org/10.1016/j.eswa.2008.01.039.
Yu, D., G. Liu, M. Guo, and X. Liu. 2018. “An Improved K-Medoids Algorithm Based on Step Increasing and Optimizing Medoids.” Expert Systems with Applications 92 (February). Elsevier:464–73. https://doi.org/10.1016/j.eswa.2017.09.052.
Zadegan, S.M.R, M. Mirzaie, and F. Sadoughi. 2013. “Ranked K-Medoids A Fast and Accurate Rank-Based Partitioning Algorithm for Clustering Large Datasets.” Knowledge-Based Systems 39 (February). Elsevier:133–43. https://doi.org/10.1016/j.knosys.2012.10.012.