# beginners with ClustVarLV

The ClustVarLV package is dedicated to the CLV method for the Clustering of Variables Around Latent Variables (Vigneau & Qannari,2003;Vigneau, Chen & Qannari (2015).

library(ClustVarLV)

For illustration, we consider the “apples_sh” dataset which includes the sensory characterization and consumers preference for 12 varieties of apples (Daillant-Spinnler et al.,1996).

data(apples_sh)
# 43 sensory attributes of 12 varieties of apple from southern hemisphere
senso<-apples_sh$senso # Scores of liking given fy 60 consumers for each of the 12 varieties of apple pref<-apples_sh$pref

## Clustering of the sensory attributes

The aim is to find groups of sensory attributes correlated, or anti-correlated, to each others. Herein “directional” groups are sought. Each group is associated with a latent component which makes it possible to identify the underlying sensory dimensions.

resclv_senso <- CLV(X = senso, method = "directional", sX = TRUE)
# option sX=TRUE means that each attribute will be auto-scaled (standard deviation =1)

# Print of the 'clv' object
print(resclv_senso)
# Dendrogram of the CLV hierarchical clustering algorithm :
plot(resclv_senso,"dendrogram")

# Graph of the variation of the clustering criterion
plot(resclv_senso,"delta")

The graph of the variation of the clustering criterion between a partition into K clusters and a partition into (K-1) clusters (after consolidation) is useful for determining the number of clusters to be retained. Because the criterion clearly jumps when passing from 4 to 3 groups, a partition into 4 groups is retained.

# Summary the CLV results for a partition into 4 groups
summary(resclv_senso,K=4)
## $number ## clusters ## 1 2 3 4 ## 12 14 12 5 ## ##$prop_within
##      Group.1 Group.2 Group.3 Group.4
## [1,]  0.8355  0.7337   0.734  0.7289
##
## $prop_tot ## [1] 0.7616 ## ##$groups
## $groups[[1]] ## cor in group |cor|next group ## iogreen 0.98 0.74 ## ioredap -0.97 0.80 ## ioacids 0.96 0.74 ## iounrip 0.96 0.68 ## iocooka 0.96 0.81 ## iagreen 0.92 0.60 ## ioplums -0.90 0.75 ## iograss 0.89 0.72 ## iayelow -0.89 0.63 ## iagreli 0.89 0.55 ## iosweet -0.86 0.79 ## iawhite 0.76 0.60 ## ##$groups[[2]]
##         cor in group  |cor|next group
## asgreen         0.94             0.80
## flgreen         0.93             0.81
## flredap        -0.93             0.88
## flunrip         0.93             0.64
## asredap        -0.93             0.83
## asastri         0.92             0.58
## assweet        -0.90             0.63
## flsweet        -0.86             0.60
## flacids         0.86             0.62
## flgrass         0.85             0.82
## flplumc        -0.83             0.66
## asacids         0.83             0.56
## flpdrop        -0.76             0.59
## iodampt         0.33             0.27
##
## $groups[[3]] ## cor in group |cor|next group ## txcrisp 0.97 0.54 ## txjuicy 0.94 0.58 ## txspong -0.94 0.53 ## fbhardn 0.93 0.55 ## iajuicy 0.90 0.52 ## flfresh 0.90 0.66 ## iapulpy -0.83 0.64 ## flpearl -0.81 0.77 ## iatrans 0.79 0.66 ## fbjuicy 0.79 0.53 ## txslowb 0.78 0.45 ## flsoapy -0.64 0.53 ## ##$groups[[4]]
##         cor in group  |cor|next group
## asbitte         0.95             0.34
## flbitte         0.90             0.41
## flcoxli        -0.84             0.54
## flofffl         0.81             0.29
## flwater         0.75             0.31
##
##
## $set_aside ## NULL ## ##$cormatrix
##       Comp1 Comp2 Comp3 Comp4
## Comp1  1.00  0.76  0.43  0.43
## Comp2  0.76  1.00  0.67  0.19
## Comp3  0.43  0.67  1.00  0.01
## Comp4  0.43  0.19  0.01  1.00

The function plot_var() allows us to describe the groups of variables into a two dimensional space obtained by Principal Components Analysis. Several options are available for the choice of the axes, for adding labels, producing a plot without colours but symbols, having only one plot or a plot by groups of variables.

# Representation of the group membership for a partition into 4 groups
plot_var(resclv_senso,K=4,label=T,cex.lab=0.8)

or

plot_var(resclv_senso,K=4,beside=T)

# Extract the group membership of each variable
get_partition(resclv_senso,K=4,type="vector")
# or
get_partition(resclv_senso,K=4,type="matrix")

# Extract the group latent variables
get_comp(resclv_senso,K=4)

## Clustering of the consumers’ preference data

The aim is to find segments of consumers. Herein “local” groups are sought. Each group latent variable represents a synthetic direction of preference. If, simultaneously, the aim is to explain these directions of preference by means of the sensory attributes of the products, the sensory data has to be included as external data.

res.segext<- CLV(X = pref, Xr = senso, method = "local", sX=TRUE, sXr = TRUE)

print(res.segext)
plot(res.segext,"dendrogram")

plot(res.segext,"delta") 

Two or three segments may be explored. To Compare the partitions into two or three segments :

table(get_partition(res.segext,K=2),get_partition(res.segext,K=3))
##
##      1  2  3
##   1 12 28  0
##   2  2  0 18

Each latent variable being a linear combination of the external variables (sensory), it is possible to extract the associated loadings

get_load(res.segext,K=3)
##               Comp1       Comp2        Comp3
## iosweet  0.09589740 -0.11243600  0.170065231
## ioacids -0.04386473  0.14575014 -0.181554627
## iogreen -0.11714632  0.09817949 -0.188648499
## ioredap  0.08408320 -0.10912674  0.210758453
## iograss -0.12869347  0.09064836 -0.164391644
## iounrip -0.11022465  0.10832505 -0.159098176
## iocooka -0.03972714  0.15610353 -0.180077680
## iodampt  0.01716467  0.06379372 -0.104010800
## ioplums  0.12919401 -0.06974771  0.194629392
## iawhite -0.14814587  0.05571348 -0.072367992
## iagreen -0.06489321  0.11467716 -0.167953543
## iayelow  0.10745652 -0.09712042  0.135004928
## iagreli  0.01142059  0.12511441 -0.172633043
## iajuicy  0.20647838  0.21459247 -0.113849158
## iatrans  0.08181864  0.20027109 -0.113725956
## iapulpy -0.02530062 -0.16927896  0.123476168
## fbjuicy  0.23053897  0.21393868 -0.085459713
## fbhardn  0.14217811  0.21350356 -0.074432978
## txcrisp  0.18273207  0.23060437 -0.081569097
## txjuicy  0.21365042  0.23119349 -0.100979076
## txslowb  0.05108176  0.15451433 -0.073779800
## txspong -0.17462895 -0.21172483  0.075861428
## flgreen  0.08240198  0.21148600 -0.212099157
## flredap  0.06608129 -0.14564504  0.190123018
## flsweet  0.07624093 -0.08040511  0.191963762
## flacids  0.12759887  0.17878462 -0.214577876
## flbitte -0.28768482 -0.01853208 -0.002779502
## flgrass -0.05981788  0.15996944 -0.166073967
## flfresh  0.25544419  0.25096548 -0.132150873
## flpdrop  0.09307223 -0.10540908  0.103396259
## flwater -0.24348387 -0.05511837  0.021650008
## flofffl -0.32171652 -0.12038473  0.098986497
## flplumc -0.02766669 -0.15985868  0.159218105
## flunrip  0.03659315  0.16572691 -0.227213615
## flcoxli  0.27848883  0.04023097  0.064057522
## flpearl -0.14561103 -0.20872572  0.152113914
## flsoapy -0.22462943 -0.17028960  0.122183453
## assweet  0.11042842 -0.08378596  0.195131491
## asacids  0.10209954  0.16407373 -0.208978301
## asbitte -0.32002510 -0.04858914  0.007748031
## asgreen  0.08426373  0.20758903 -0.222199420
## asredap  0.08574768 -0.13062593  0.184568786
## asastri  0.02003463  0.15128160 -0.216116849

## Using the CLV_kmeans function

This procedure is less time consuming when the number of variables is large. The number of clusters needs to be fixed (e.g.3).

The initialization of the algorithm can be made at random, “nstart” times :

res.clvkm.rd<-CLV_kmeans(X = pref, Xr = senso, method = "local", sX=TRUE,
sXr = TRUE, clust=3, nstart=100)

or the initialization can be defined by the user, for instance on the basis of the clusters obtained by cutting the CLV dendrogram to get 3 clusters

res.clvkm.hc<-CLV_kmeans(X = pref, Xr = senso, method = "local", sX=TRUE,
sXr = TRUE, clust=res.segext[[3]]$clusters[1,]) It is possible to compare the partitions according to the procedure used : table(get_partition(res.segext,K=3),get_partition(res.clvkm.hc,K=3))  ## ## 1 2 3 ## 1 14 0 0 ## 2 0 28 0 ## 3 0 0 18 In this case, the CLV solution is the same that the CLV_kmeans solution with an initialization based on the partition obtained by cutting the dendrogram. table(get_partition(res.segext,K=3),get_partition(res.clvkm.rd,K=3))  ## ## 1 2 3 ## 1 13 0 1 ## 2 0 0 28 ## 3 1 17 0 Partitions are very close. ## Clustering wile setting aside atypical or noisy variables This functionnality is available with the CLV_kmeans procedure. You can refer to Vigneau, Qannari, Navez & Cottet (2016) and Vigneau & Chen (2015) for theoretical details. By considering the sensory data, applying (as shown below) the strategy “kplusone” makes it possible to identify and put aside (in a group “G0”) a spurious attribute. clvkm_senso_kpone<-CLV_kmeans(X = senso, method = "directional",sX=TRUE, clust=4, strategy="kplusone",rho=0.5) get_partition(clvkm_senso_kpone,type="matrix") ## G.0 G.1 G.2 G.3 G.4 ## iosweet 0 1 0 0 0 ## ioacids 0 1 0 0 0 ## iogreen 0 1 0 0 0 ## ioredap 0 1 0 0 0 ## iograss 0 1 0 0 0 ## iounrip 0 1 0 0 0 ## iocooka 0 1 0 0 0 ## iodampt 1 0 0 0 0 ## ioplums 0 1 0 0 0 ## iawhite 0 1 0 0 0 ## iagreen 0 1 0 0 0 ## iayelow 0 1 0 0 0 ## iagreli 0 1 0 0 0 ## iajuicy 0 0 0 1 0 ## iatrans 0 0 0 1 0 ## iapulpy 0 0 0 1 0 ## fbjuicy 0 0 0 1 0 ## fbhardn 0 0 0 1 0 ## txcrisp 0 0 0 1 0 ## txjuicy 0 0 0 1 0 ## txslowb 0 0 0 1 0 ## txspong 0 0 0 1 0 ## flgreen 0 0 0 0 1 ## flredap 0 0 0 0 1 ## flsweet 0 0 0 0 1 ## flacids 0 0 0 0 1 ## flbitte 0 0 1 0 0 ## flgrass 0 0 0 0 1 ## flfresh 0 0 0 1 0 ## flpdrop 0 0 0 0 1 ## flwater 0 0 1 0 0 ## flofffl 0 0 1 0 0 ## flplumc 0 0 0 0 1 ## flunrip 0 0 0 0 1 ## flcoxli 0 0 1 0 0 ## flpearl 0 0 0 1 0 ## flsoapy 0 0 0 1 0 ## assweet 0 0 0 0 1 ## asacids 0 0 0 0 1 ## asbitte 0 0 1 0 0 ## asgreen 0 0 0 0 1 ## asredap 0 0 0 0 1 ## asastri 0 0 0 0 1 For the consumers liking data, by varying the parameter “rho” associated with the strategy “kplusone” , a more or less large proportion of consumers will be set aside : sizG0<-NULL for (r in seq(0,1,0.1)) { res<-CLV_kmeans(X = pref, method = "local", sX=TRUE, clust=3, nstart=20, strategy="kplusone",rho=r) sizG0<-c(sizG0,sum(get_partition(res)==0)) } plot(seq(0,1,0.1),sizG0,type="b",xlab="rho",ylab="# var in noise cluster") By choosing rho=0.4, 8 out 60 consumers are assigned to the noise cluster. They are highlighted in gray when using the “plot_var” function. plot_var(CLV_kmeans(X = pref, method = "local", sX=TRUE, clust=3, nstart=20, strategy="kplusone",rho=0.4)) ## Warning : Changes with respect to the versions 1.1, 1.2 and 1.3 of the ClustVarLV package The changes are illustrated on the basis of the examples given above. from version 1.4.0 for earlier versions resclv_senso <- CLV(X = senso,method = “directional”, sX = TRUE) resclv_senso <- CLV(X = senso,method=1, sX = TRUE, graph=TRUE) plot(resclv_senso,“dendrogram”); plot(resclv_senso,“delta”) summary(resclv_senso,K=4) descript_gp(resclv_senso,X=senso,K=4) plot_var(resclv_senso,K=4) gpmb_on_pc(resclv_senso,X=senso,K=4) get_partition(resclv_senso,K=4,type=“vector”) resclv_senso[[4]]$clusters[2,]
get_comp(resclv_senso,K=4) resclv_senso[[4]]$comp get_load(res.segext,K=3) res.segext[[3]]$loading
res.clvkm.rd<-CLV_kmeans(X = pref, Xr = senso, method = “local”, sX=TRUE, sXr = TRUE, clust=3, nstart=100) res.clvkm.rd<-CLV_kmeans(X = pref, Xr = senso, method = 2, sX=TRUE, sXr = TRUE, init=3, nstart=100)

## References

Daillant-Spinnler B., MacFie H.J.H, Beyts P., Hedderley D. (1996). Relationships“Relationships between perceived sensory properties and major preference directions of 12 varieties of apples from the southern hemisphere. Food Quality and Preference, 7(2), 113-126.

Vigneau E., Qannari E.M. (2003). Clustering of variables around latents components. Comm. Stat, 32(4), 1131-1150.

Vigneau E., Chen M., Qannari E.M. (2015). ClustVarLV: An R Package for the clustering of Variables around Latent Variables. The R Journal, 7(2), 134-148.

Vigneau E., Qannari E. M., Navez B., Cottet V. (2016). Segmentation of consumers in preference studies while setting aside atypical or irrelevant consumers. Food Quality and Preference, 47, 54-63.

Vigneau E., Chen M. (2016). Dimensionality reduction by clustering of variables while setting aside atypical variables. Electronic Journal of Applied Statistical Analysis, 9(1), 134-153