Self-Organizing Map for contingency tables

Basic package description

To be able to run the SOM algorithm, you have to load the package called SOMbrero. The function used to run it is called trainSOM() and is detailed below.

This documentation only considers the case of contingency tables.

Arguments

The trainSOM function has several arguments, but only the first one is required. This argument is x.data which is the dataset used to train the SOM. In this documentation, it is passed to the function as a matrix or a data frame. This set must be a contingency table, i.e., it must contain either 0 or positive integers. Column and row names must be supplied.

The other arguments are the same as the arguments passed to the initSOM function (they are parameters defining the algorithm, see help(initSOM) for further details).

Outputs

The trainSOM function returns an object of class somRes (see help(trainSOM) for further details on this class).

Case study: the presidentielles2002 data set

The presidentielles2002 data set provides the number of votes at the first round of the 2002 French presidential election for each of the 16 candidates in all of the 106 French administrative districts called “departements”. Further details about this data set and the 2002 French presidential election are given with help(presidentielles2002).

data(presidentielles2002)
apply(presidentielles2002, 2, sum)
##      MEGRET      LEPAGE  GLUCKSTEIN      BAYROU      CHIRAC      LE_PEN 
##      667043      535875      132696     1949219     5666021     4804772 
##     TAUBIRA SAINT_JOSSE      MAMERE      JOSPIN      BOUTIN         HUE 
##      660515     1204801     1495774     4610267      339157      960548 
## CHEVENEMENT     MADELIN   LAGUILLER  BESANCENOT 
##     1518568     1113551     1630118     1210562

(the two candidates that ran the second round of the election were Jacques Chirac and the far-right candidate Jean-Marie Le Pen)

Training the SOM

set.seed(01091407)
korresp.som <- trainSOM(x.data=presidentielles2002, dimension=c(8,8),
                        type="korresp", scaling="chi2", nb.save=10,
                        radius.type="letremy")
korresp.som
##       Self-Organizing Map object...
##          online learning, type: korresp 
##          8 x 8 grid with square topology
##          neighbourhood type: letremy 
##          distance type: letremy

As the energy is registered during the intermediate backups, we can have a look at its evolution

plot(korresp.som, what="energy")

plot of chunk energyPresi

which is stabilized during the last 100 iterations.

Resulting clustering

The clustering component contains the final classification of the dataset. As both row and column variables are classified, the length of the resulting vector is equal to the sum of the number of rows and the number of columns.

NB: The clustering component shows first the column variables (here, the candidates) and then the row variables (here, the departements).

The following table indicates which graphics are available for a korresp SOM.

Type Energy Obs Prototypes Add Super Cluster
no type x
hitmap x x
color x2 x2
lines x2 x2
barplot x
radar x
pie
boxplot
3d x2
poly.dist x x
umatrix x
smooth.dist x
words
names x
graph
mds x x
grid.dist x
grid x
dendrogram x
dendro3d x

In the column “Prototypes”, a plot marked “x2” means that this plot is available for both row and column variables. In the “Super Cluster” column, a “x2” cell means the plot is available for both data set variables and additional variables.

korresp.som$clustering
##                   MEGRET                   LEPAGE               GLUCKSTEIN 
##                       15                       16                        8 
##                   BAYROU                   CHIRAC                   LE_PEN 
##                       32                       54                        9 
##                  TAUBIRA              SAINT_JOSSE                   MAMERE 
##                        6                       12                       22 
##                   JOSPIN                   BOUTIN                      HUE 
##                       57                       16                       21 
##              CHEVENEMENT                  MADELIN                LAGUILLER 
##                       22                       23                       21 
##               BESANCENOT                      ain                    aisne 
##                       21                       27                       18 
##                   allier  alpes_de_haute_provence             hautes_alpes 
##                       43                       33                       41 
##          alpes_maritimes                  ardeche                 ardennes 
##                        2                       35                       25 
##                   ariege                     aube                     aude 
##                       33                       25                       26 
##                  aveyron         bouches_du_rhone                 calvados 
##                       43                        3                       52 
##                   cantal                 charente        charente_maritime 
##                       42                       43                       52 
##                     cher                  correze                corse_sud 
##                       35                       51                       41 
##              haute_corse                cote_d'or            cotes_d'armor 
##                       41                       27                       44 
##                   creuse                 dordogne                    doubs 
##                       42                       43                       27 
##                    drome                     eure             eure_et_loir 
##                       26                       18                       26 
##                finistere                     gard            haute_garonne 
##                       61                       18                       28 
##                     gers                  gironde                  herault 
##                       34                        4                       19 
##          ille_et_vilaine                    indre          indre_et_loire_ 
##                       61                       34                       44 
##                    isere                     jura                   landes 
##                       28                       25                       43 
##             loir_et_cher                    loire              haute_loire 
##                       35                       27                       33 
##         loire_atlantique                   loiret                      lot 
##                       61                       27                       34 
##          lot_et_garonne_                   lozere          maine_et_loire_ 
##                       34                       41                       60 
##                   manche                    marne              haute_marne 
##                       52                       27                       33 
##                  mayenne       meurthe_et_moselle                    meuse 
##                       51                       27                       33 
##                 morbihan                  moselle                   nievre 
##                       52                        2                       33 
##                     nord                     oise                     orne 
##                       21                       27                       35 
##            pas_de_calais              puy_de_dome     pyrenees_atlantiques 
##                       13                       44                       52 
##          hautes_pyrenees      pyrenees_orientales                 bas_rhin 
##                       34                       26                       40 
##                haut_rhin                    rhone              haute_saone 
##                       27                       39                       25 
##          saone_et_loire_                   sarthe                   savoie 
##                       27                       44                       26 
##             haute_savoie                    paris          seine_maritime_ 
##                       27                       30                       13 
##          seine_et_marne_                 yvelines              deux_sevres 
##                       28                       48                       43 
##                    somme                     tarn          tarn_et_garonne 
##                       12                       43                       34 
##                      var                 vaucluse                   vendee 
##                        2                       18                       52 
##                   vienne             haute_vienne                   vosges 
##                       43                       43                       26 
##                    yonne    territoire_de_belfort                  essonne 
##                       26                       41                       28 
##          hauts_de_seine_        seine_saint-denis             val_de_marne 
##                       37                       28                       28 
##               val_d'oise               guadeloupe               martinique 
##                       28                        6                        6 
##                   guyane               la_reunion                  mayotte 
##                        6                       57                       41 
##       nouvelle_caledonie      polynesie_francaise saint_pierre_et_miquelon 
##                       50                       50                       41 
##         wallis_et_futuna   francais_de_l'etranger 
##                       41                       51

The resulting distribution of the clustering on the map can also be visualized by a hitmap:

plot(korresp.som, what="obs", type="hitmap")

plot of chunk presiHitmap

For a more precise view, "names" plot is implemented: it prints, in each neuron, the names of the variables assigned to it ; in the korresp SOM, both row and column variable names are printed.

plot(korresp.som, what="obs", type="names", scale=c(0.9,0.5))

plot of chunk presiGraphObs

The map is divided into two main parts: minor candidates are classified at its top left hand side whereas the first main candidates CHIRAC, LE PEN and JOSPIN are classified at the bottom right hand side of the map, in three different parts of this corner. Some strinking facts are:

Clustering interpretation

Some graphics from the numeric SOM algorithm are still available in the korresp case. They are detailed below. As the resulting clustering provides the classification for both rows and columns, a new argument view is used to specify which one should be considered. Its possible values are either "r" for row variables (the default value) or "c" for column variables.

Graphics on prototype values

Three representations are available:

# plot the line prototypes (106 French departements)
plot(korresp.som, what="prototypes", type="lines", view="r", print.title=TRUE)

plot of chunk presiProtoL

# plot the column prototypes (16 candidates)
plot(korresp.som, what="prototypes", type="lines", view="c", print.title=TRUE)

plot of chunk presiProtoL

The peaks in neurons 5 and 6 correspond, in the row view, to the overseas departements and, in the column view, to the candidate TAUBIRA. In the column views, the two peaks clearly identified in the bottom right side clusters correspond to the two “main” tranditional candidates JOSPIN and CHIRAC (respectively, left and right candidates).

A more precise individual view are given with the graphics “color” and “3d”, here drawn, as an example for the candidate “Le Pen” and for the departement “Martinique”.

par(mfrow=c(1,2))
plot(korresp.som, what="prototypes", type="color", variable="TAUBIRA")
plot(korresp.som, what="prototypes", type="3d", variable="martinique")

plot of chunk presiProtoC3d

The first graphic shows that TAUBIRA obtained her best scores in the departements located at the left hand side of the map. The second graphic shows that the candidates that obtained the higher scores in Martinique are located in the bottom right hand side of the map.

The graphics can also be drawn by giving the variable number and its type, either “r” or “c” (here, as an example, CHIRAC who is the 5th candidate):

par(mfrow=c(1,2))
plot(korresp.som, what="prototypes", type="color", variable=5, view="c")
plot(korresp.som, what="prototypes", type="3d", variable=5, view="c")

plot of chunk presiProtoNumber

Hence CHIRAC obtained more votes in departement located at the top of the map and has his lowest scores in departements Jura, Aubde, Haute Saone, Ardennes, Alpes de Haute Provence, Ariege, … located at the bottom (middle) of the map.

Graphic on prototype distances

These graphics are exactly the same as in the numerical case:

plot(korresp.som, what="prototypes", type="poly.dist", print.title=TRUE)

plot of chunk presiGraphProto2

plot(korresp.som, what="prototypes", type="umatrix", print.title=TRUE)

plot of chunk presiGraphProto2

plot(korresp.som, what="prototypes", type="smooth.dist", print.title=TRUE)

plot of chunk presiGraphProto2

plot(korresp.som, what="prototypes", type="mds")

plot of chunk presiGraphProto2

plot(korresp.som, what="prototypes", type="grid.dist")

plot of chunk presiGraphProto2

Neuron 6 which has been already picked out in the section Clustering interpretation for having prototypes rather different than the rest of the map shows a larger distance to the others.

Analyze the projection quality

quality(korresp.som)
## $topographic
## [1] 0.1415094
## 
## $quantization
## [1] 11292.83

By default, the quality function calculates both quantization and topographic errors. It is also possible to specify which one you want to obtain, by using the argument quality.type.

The topographic error value varies between 0 (good projection quality) and 1 (poor projection quality). Here, the topographic quality of the mapping is rather good with a topographic error equal to 0.142.

The quantization error is an unbounded positive number. The closer from 0 it is, the better the projection quality is.

Building super classes from the resulting SOM

In the SOM algorithm, the number of clusters is necessarily close to the number of neurons on the grid (not necessarily equal as some neurons may have no observations assigned to them). This - quite large - number may not suit the original data for a clustering purpose.

A usual way to address clustering with SOM is to perform a hierarchical clustering on the prototypes. This clustering is directly available in the package SOMbrero using the function superClass. To do so, you can first have a quick overview to decide on the number of super clusters which suits your data.

plot(superClass(korresp.som))
## Warning in plot.somSC(superClass(korresp.som)): Impossible to plot the rectangles: no super clusters.

plot of chunk presiSC1

By default, the function plots both a dendrogram and the evolution of the percentage of explained variance. Here, 3 super clusters seem to be a good choice. The output of superClass is a somSC class object. Basic functions have been defined for this class:

my.sc <- superClass(korresp.som, k=3)
summary(my.sc)
## 
##    SOM Super Classes
##      Initial number of clusters :  64 
##      Number of super clusters   :  3 
## 
## 
##   Frequency table
##  1  2  3 
## 13 24 27 
## 
##   Clustering
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  1  1  2  2  2  2  2  2  1  1  2  2  2  2  2  2  1  1  1  2  2  2  2  2  1 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
##  1  1  3  2  2  2  2  1  1  1  3  3  2  2  2  3  3  3  3  3  3  3  3  3  3 
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3
plot(my.sc, plot.var=FALSE)

plot of chunk presiSC2

Like plot.somRes, the function plot.somSC has an argument 'type' which offers many different plots and can thus be combined with most of the graphics produced by plot.somSC:

Case "grid" fills the grid with colors according to the super clustering (and can provide a legend). Case "dendro3d" plots a 3d dendrogram.

plot(my.sc, type="grid", plot.legend=TRUE)

plot of chunk presiSC3

plot(my.sc, type="dendro3d")

plot of chunk presiSC3

The three super-clusters correspond to traditional votes (blue), far right votes (green) and votes for minor candidates (orange).

A couple of plots from plot.somRes are also available for the super clustering. Some identify the super clusters with colors:

plot(my.sc, type="hitmap", plot.legend=TRUE)

plot of chunk presiSC4

plot(my.sc, type="lines", print.title=TRUE)

plot of chunk presiSC4

plot(my.sc, type="lines", print.title=TRUE, view="c")

plot of chunk presiSC4

plot(my.sc, type="mds", plot.legend=TRUE)

plot of chunk presiSC4

And some others identify the super clusters with titles:

plot(my.sc, type="color", view="r", variable="correze")

plot of chunk presiSC5

plot(my.sc, type="color", view="c", variable="JOSPIN")