Explore univariate and bivariate structures of new datatsets with the package mbgraphic

Katrin Grimm

2017-05-13

Introduction

This vignette demonstrates how to use the package mbgraphic to explore new and unknown data sets with the help of interactive apps. The documentation focuses on the exploration of single variables and bivariate structures of numeric variables.

The idea of characterising graphics by different measures was mentioned by Paul and John Tukey in the middle of the 1980’s (see J. W. Tukey and Tukey 1985). They suggested criteria they called cognostics to handle huge data sets. For scatterplots the idea was concretized in the form of scagnostics (a neologism built from scatterplot and diagnostics). The Tukeys themself never published any details on cognostics and scagnostics. The topic was revived by Wilkinson, Anand, and Grossman (2005). They proposed specific measures for scatterplots and implemented them in the package scagnostics (see Wilkinson and Anand 2012).

The gneral approach in the package mbgraphic is to calculate measures to describe univariate and bivariate behavior of the variables in a first step and to select plots based on the measures in a second step. We concentrate more on presenting flexible ways of selecting graphics based on the measures than on the criteria themselves. In this vignette criteria from the package mbgraphic as well as scagnostics from the package scagnostics are used.

The package’s interactive functions are programmed using shiny (see Chang et al. 2016).

Load the package

library(mbgraphic)

The data

For demonstrating what the package does, data from the German ‘Bundestag’ election in 2013 are used. Election2013 contains 299 observations on 114 variables. It includes information about the elections in 2013 and in 2009 separately for each of the 299 constituencies and also additional information about the constituencies themselves. For details see ?Election2013.

data(Election2013)
dim(Election2013)
## [1] 299 114

Explore individual variables

By calling function iaunivariate we can explore the variables interactively. The output includes measures for discreteness, skewness and multimodality calculated by the functions discrete1d, skew1d and multimod1d.

We can interactively choose variables based on their values on the three criteria. A categorical variable can also be chosen and is then displayed in a barplot. Selecting a bar means that cases in the selected category are highlighted in the other plots.

iaunivariate(Election2013)

In the example choosing variable Name is not advisable, because every constituency has a unique name. Land tells us in which Bundesland the constituency is located. Through different selections the Bundesländer can be compared. The barplot displays the number of constituencies in the 16 Bundesländer. ‘Nordrhein-Westfalen’ has the most constituencies by far.

Explore bivariate structures of numeric variables

We only consider numeric variables.

election_num <- Election2013[,sapply(Election2013,is.numeric)] 
dim(election_num)
## [1] 299 112

If categorical variables are not excluded explicitly, they are ignored by the following functions.

Interactive corrgrams

First we explore the correlation structure of the numeric variables.

iacorrgram(election_num)

There are a large number of variables in this example. With the help of interactive corrgrams we can explore if any variables are highly correlated and what might be a good number of clusters to group variables. With optimal leaf reordering (OLO) ‘cluster lines’ can be added. These lines show the clusters for a fixed chosen number of clusters. The number can also be set by choosing a ‘minimal correlation within the clusters’. This is the minimum correlation every single pair of variables within a cluster must exceed.

Additionally a ‘range of absolute correlation’ can be determined: only correlations with absolute value within the range are drawn in color. Selections of single scatterplots and scatterplot matrices can be made by clicking and drawing boxes in the corrgram.

Cluster variables by function ‘varclust’

The function varclust does a clustering based on an optimal leaf ordering of the variables. The number of clusters is determined by specifying the number directly or by choosing the ‘minimal correlation’ (mincor).

vc <- varclust(Election2013,mincor=0.8)
summary(vc)
##           Length Class      Mode     
## c           1    -none-     numeric  
## mincor      1    -none-     numeric  
## clusters  112    -none-     numeric  
## clusrep    54    -none-     character
## dfclusrep  54    data.frame list
# the reduced data set
election_reduced <- vc$dfclusrep
dim(election_reduced)
## [1] 299  54

Further scagnostics and plots for them

We can use the reduced data set for exploring further bivariate structures faster. First we calculate the nine scagnostics from the package scagnostics with the function sdf. It calls the function scagnostics and stores the results in a list which holds the scagnostics and a data frame (the original data frame or only the numeric variables of the original data frame).

scagdf <- sdf(election_reduced)
# List of class "sdfdata"
class(scagdf)
## [1] "sdfdata"
# Entries 
summary(scagdf)
##      Length Class      Mode
## sdf  12     data.frame list
## data 54     data.frame list

Additional and self defined scagnostics can be use and integrated with the function scag2sdf.

addscag <- scag2sdf(election_reduced,scagfun.list=list(dcor2d=dcor2d,splines2d=splines2d))
# merge 'addscag' and 'scagsdf'
scagdf2 <- mergesdfdata(scagdf,addscag)
# merged list contains  11 scagnostics
names(scagdf2$sdf)
##  [1] "Outlying"  "Skewed"    "Clumpy"    "Sparse"    "Striated" 
##  [6] "1-Convex"  "Skinny"    "Stringy"   "Monotonic" "dcor2d"   
## [11] "splines2d" "x"         "y"         "status"

Interactive parallel coordinate plot (pcp)

iascagpcp(scagdf2)

All scagnostics stored in scagdf2 are drawn in a parallel coordinate plot. For selecting a line within the pcp draw a box on one of the axis around the selected line. The line will be highlighted and the corresponding scatterplot drawn. If you use the function sdf to calculate the scagnostics from package scagnostics you can decide if you want to consider all plots or only the defined Outliers and Exemplars.

Interactive scaggram

The package includes so called scaggrams (function scaggram). These graphics are a generalization of corrgrams. The idea is to represent scatter plots through different colors. By using the RGB color space, three different measures can be applied at the same time. That means that (up to) three measures are represented using the colors red, green and blue. The mixture of colors determines the color of the boxes in the scaggram. Using scaggrams within interactive enviroments allows user to select scatterplots and scatterplot matrices using the measures.

iascaggram(scagdf)

Reordering can be carried out by the functions sdf_sort (reordering based on similarity of scatterplots) and sdf_quicksort (reordering basad on similarity of variables). sdf_sort can be slow if there are many variables. ‘Quick’ reordering based on the OLO algorithm or ordering based on the algorithm from sdf_sort with a time break might be good choices.

For smaller data frames the option ‘Add -> Glyphs’ can be interessting.

The glyphs representing all scagnostics which are stored in scagdf2_ds are added above the diagonal of the scaggram. The shadings of the boxes are drawn using transparency. It’s also possible to add the scatterplots for each pair of variables.

References

Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2016. “Shiny: Web Application Framework for R.” @ONLINE https://cran.r-project.org/package=shiny.

Tukey, J. W., and P. A. Tukey. 1985. “Computer Graphics and Exploratory Data Analysis: An Introduction.” In Proceedings of the Sixth Annual Conference and Exposition: Computer Graphics ’85, 3:773–85.

Wilkinson, Leland, Anushka Anand, and Robert Grossman. 2005. “Graph-Theoretic Scagnostics.” Proceedings of the 2005 IEEE Symposium on Information Visualization, 157–64.

Wilkinson, Leland, and Anushka Anand. 2012. “Scagnostics: Compute Scagnostics - Scatterplot Diagnostics.” @ONLINE https://cran.r-project.org/package=scagnostics.