To be able to run the SOM algorithm, you have to load the package called
SOMbrero
. The function used to run it is called trainSOM()
and is
detailed below.
This documentation only considers the case of dissimilarity matrices.
The trainSOM
function has several arguments, but only the first one is
required. This argument is x.data
which is the dataset used to train the
SOM. In this documentation, it is passed to the function as a matrix or a data
frame. This set must be a dissimilarity matrix, i.e., a symmetric matrix of
positive numbers, with zero entries on the diagonal.
The other arguments are the same as the arguments passed to the initSOM
function (they are parameters defining the algorithm, see help(initSOM)
for further details).
The trainSOM
function returns an object of class somRes
(see
help(trainSOM)
for further details on this class).
The following table indicates which graphics are available for a relational SOM.
Type | Energy | Obs | Prototypes | Add | Super Cluster |
---|---|---|---|---|---|
no type | x | ||||
hitmap | x | x | |||
color | x | ||||
lines | x | x | x2 | ||
barplot | x | x | x2 | ||
radar | x | x | x2 | ||
pie | x | x2 | |||
boxplot | x | ||||
3d | |||||
poly.dist | x | x | |||
umatrix | x | ||||
smooth.dist | x | ||||
words | x | ||||
names | x | x | |||
graph | x | x | |||
mds | x | x | |||
grid.dist | x | ||||
grid | x | ||||
dendrogram | x | ||||
dendro3d | x |
In the “Super Cluster” column, a plot marked by “x2” means it is available for both data set variables and additional variables.
lesmis
data setThe lesmis
data set provides the coappearance graph of the characters of
the novel Les Miserables (Victor Hugo). Each vertex stands for a character whose
name is given by the vertex label. One edge means that the corresponding two
characters appear in a common chapter in the book. Each edge also has a value
indicating the number of coappearances. The lesmis
data contain two
objects: the first one lesmis
is an igraph
object (see the igraph
web page),
with 77 nodes and 254 edges.
Further information on this data set is provided with help(lesmis)
.
data(lesmis)
lesmis
## IGRAPH 3babff7 U--- 77 254 --
## + attr: layout (g/n), id (v/n), label (v/c), value (e/n)
## + edges from 3babff7:
## [1] 1-- 2 1-- 3 1-- 4 3-- 4 1-- 5 1-- 6 1-- 7 1-- 8 1-- 9 1--10
## [11] 11--12 4--12 3--12 1--12 12--13 12--14 12--15 12--16 17--18 17--19
## [21] 18--19 17--20 18--20 19--20 17--21 18--21 19--21 20--21 17--22 18--22
## [31] 19--22 20--22 21--22 17--23 18--23 19--23 20--23 21--23 22--23 17--24
## [41] 18--24 19--24 20--24 21--24 22--24 23--24 13--24 12--24 24--25 12--25
## [51] 25--26 24--26 12--26 25--27 12--27 17--27 26--27 12--28 24--28 26--28
## [61] 25--28 27--28 12--29 28--29 24--30 28--30 12--30 24--31 31--32 12--32
## [71] 24--32 28--32 12--33 12--34 28--34 12--35 30--35 12--36 35--36 30--36
## + ... omitted several edges
plot(lesmis, vertex.size=0)
The dissim.lesmis
object is a matrix with entries equal to the length of
the shortest path between two characters (obtained with the function
shortest.paths
of package igraph
). Note that its row and column
names have been initialized to the characters' names to ease the use of the
graphical functions of SOMbrero
.
set.seed(622)
mis.som <- trainSOM(x.data=dissim.lesmis, type="relational", nb.save=10,
init.proto="random", radius.type="letremy")
plot(mis.som, what="energy")
The dissimilarity matrix dissim.lesmis
is passed to the trainSOM
function as input. As the SOM intermediate backups have been registered
(nb.save=10
), the energy evolution can be plotted: it stabilized in the
last 100 iterations.
The clustering component provides the classification of each of the 77
characters. The table
function is a simple way to view data distribution
on the map.
mis.som$clustering
## Myriel Napoleon MlleBaptistine MmeMagloire
## 5 5 4 4
## CountessDeLo Geborand Champtercier Cravatte
## 5 5 5 5
## Count OldMan Labarre Valjean
## 5 5 2 2
## Marguerite MmeDeR Isabeau Gervais
## 2 6 1 7
## Tholomyes Listolier Fameuil Blacheville
## 21 21 21 21
## Favourite Dahlia Zephine Fantine
## 21 21 21 22
## MmeThenardier Thenardier Cosette Javert
## 18 23 13 17
## Fauchelevent Bamatabois Perpetue Simplice
## 1 11 22 17
## Scaufflaire Woman1 Judge Champmathieu
## 3 1 11 11
## Brevet Chenildieu Cochepaille Pontmercy
## 11 11 11 19
## Boulatruelle Eponine Anzelma Woman2
## 23 23 23 3
## MotherInnocent Gribier Jondrette MmeBurgon
## 1 1 15 15
## Gavroche Gillenormand Magnon MlleGillenormand
## 20 13 18 13
## MmePontmercy MlleVaubois LtGillenormand Marius
## 13 13 13 19
## BaronessT Mabeuf Enjolras Combeferre
## 19 25 25 25
## Prouvaire Feuilly Courfeyrac Bahorel
## 25 25 25 25
## Bossuet Joly Grantaire MotherPlutarch
## 25 25 20 25
## Gueulemer Babet Claquesous Montparnasse
## 23 23 23 23
## Toussaint Child1 Child2 Brujon
## 7 15 15 23
## MmeHucheloup
## 20
table(mis.som$clustering)
##
## 1 2 3 4 5 6 7 11 13 15 17 18 19 20 21 22 23 25
## 5 3 2 2 8 1 2 6 6 4 2 2 3 3 7 2 9 10
plot(mis.som)
The clustering can be displayed using the plot
function
with type=names
.
plot(mis.som, what="obs", type="names")
or by sur-imposing the original igraph object on the map:
plot(mis.som, what="add", type="graph", var=lesmis)
Clusters profile overviews can be plotted either with e.g., lines or radar.
plot(mis.som, what="prototypes", type="lines")
plot(mis.som, what="prototypes", type="radar")
On these graphics, one variable is represented respectively with a point or a slice. It is therefore easy to see which variable affects which cluster.
To see how different the clusters are, some graphics show the distances between prototypes. These graphics have exactly the same behaviour as in the other SOM types.
"poly.dist"
represents the distances between neighboring prototypes with
polygons plotted for each cell of the grid. The smaller the distance between
a polygon's vertex and a cell border, the closer the pair of prototypes.
The colors indicates the number of observations in the neuron (white is used
for empty neurons);
"umatrix"
fills the neurons of the grid using colors that represent
the average distance between the current prototype and its neighbors;
"smooth.dist"
plots the mean distance between the current prototype and
its neighbors with a color gradation;
"mds"
plots the number of the neuron on a map according to a Multi
Dimensional Scaling (MDS) projection;
"grid.dist"
plots a point for each pair of prototypes, with x
coordinates representing the distance between the prototypes in the
input space, and y coordinates representing the distance between the
corresponding neurons on the grid.
plot(mis.som, what="prototypes", type="poly.dist", print.title=TRUE)
plot(mis.som, what="prototypes", type="smooth.dist")
plot(mis.som, what="prototypes", type="umatrix", print.title=TRUE)
plot(mis.som, what="prototypes", type="mds")
plot(mis.som, what="prototypes", type="grid.dist")
Here we can see that the prototypes located in the top left corner of the map (e.g., clusters 4 and 5) are far from the others.
Finally, with a graphical overview of the clustering
plot(lesmis, vertex.label.color=rainbow(25)[mis.som$clustering], vertex.size=0)
legend(x="left", legend=1:25, col=rainbow(25), pch=19)
We can see that cluster 5 is very relevant to the story: as the characters of
this cluster appear only in the sub-story of the Bishop Myriel
, he is the
only connection for all other characters of cluster 5. The same kind of
conclusion holds for cluster 11, among others. Most of the other clusters have a
small number of observations: it thus seems relevant to compute super clusters.
As the number of clusters is quite important with the SOM algorithm, it is possible to perform a hierarchical clustering. First, let us have an overview of the dendrogram:
plot(superClass(mis.som))
## Warning in plot.somSC(superClass(mis.som)): Impossible to plot the rectangles: no super clusters.
According to the proportion of variance explained by super clusters, 5 groups seem to be a good choice.
sc.mis <- superClass(mis.som, k=5)
summary(sc.mis)
##
## SOM Super Classes
## Initial number of clusters : 25
## Number of super clusters : 5
##
##
## Frequency table
## 1 2 3 4 5
## 9 2 4 6 4
##
## Clustering
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 1 1 2 2 1 1 1 1 3 1 1 4 4 3 5 5 4 4 3 5 5 4 4 3
##
##
## ANOVA
## F : 10.09429
## Degrees of freedom : 4
## p-value : 1.526375e-06
## significativity : ***
table(sc.mis$cluster)
##
## 1 2 3 4 5
## 9 2 4 6 4
plot(sc.mis)
plot(sc.mis, type="grid", plot.legend=TRUE)
plot(sc.mis, type="lines", print.title=TRUE)
plot(sc.mis, type="mds", plot.legend=TRUE)
plot(sc.mis, type="dendro3d")
library(RColorBrewer)
plot(lesmis, vertex.size=0, vertex.label.color=
brewer.pal(6, "Set2")[sc.mis$cluster[mis.som$clustering]])
legend(x="left", legend=paste("SC",1:5), col=brewer.pal(5, "Set2"), pch=19)