This vignette demonstrates how to use the package **mbgraphic** to explore new and unknown data sets with the help of interactive apps. The documentation focuses on the exploration of single variables and bivariate structures of numeric variables.

The idea of characterising graphics by different measures was mentioned by Paul and John Tukey in the middle of the 1980’s (see J. W. Tukey and Tukey 1985). They suggested criteria they called *cognostics* to handle huge data sets. For scatterplots the idea was concretized in the form of *scagnostics* (a neologism built from *scatterplot* and *diagnostics*). The Tukeys themself never published any details on cognostics and scagnostics. The topic was revived by Wilkinson, Anand, and Grossman (2005). They proposed specific measures for scatterplots and implemented them in the package **scagnostics** (see Wilkinson and Anand 2012).

The gneral approach in the package **mbgraphic** is to calculate measures to describe univariate and bivariate behavior of the variables in a first step and to select plots based on the measures in a second step. We concentrate more on presenting flexible ways of selecting graphics based on the measures than on the criteria themselves. In this vignette criteria from the package **mbgraphic** as well as scagnostics from the package **scagnostics** are used.

The package’s interactive functions are programmed using **shiny** (see Chang et al. 2016).

`library(mbgraphic)`

For demonstrating what the package does, data from the German ‘Bundestag’ election in 2013 are used. **Election2013** contains 299 observations on 114 variables. It includes information about the elections in 2013 and in 2009 separately for each of the 299 constituencies and also additional information about the constituencies themselves. For details see `?Election2013`

.

```
data(Election2013)
dim(Election2013)
```

`## [1] 299 114`

By calling function `iaunivariate`

we can explore the variables interactively. The output includes measures for discreteness, skewness and multimodality calculated by the functions ` discrete1d`,

`skew1d`

`multimod1d`

We can interactively choose variables based on their values on the three criteria. A categorical variable can also be chosen and is then displayed in a barplot. Selecting a bar means that cases in the selected category are highlighted in the other plots.

`iaunivariate(Election2013)`

In the example choosing variable *Name* is not advisable, because every constituency has a unique name. *Land* tells us in which Bundesland the constituency is located. Through different selections the Bundesländer can be compared. The barplot displays the number of constituencies in the 16 Bundesländer. ‘Nordrhein-Westfalen’ has the most constituencies by far.

We only consider numeric variables.

```
election_num <- Election2013[,sapply(Election2013,is.numeric)]
dim(election_num)
```

`## [1] 299 112`

If categorical variables are not excluded explicitly, they are ignored by the following functions.

First we explore the correlation structure of the numeric variables.

`iacorrgram(election_num)`

There are a large number of variables in this example. With the help of interactive corrgrams we can explore if any variables are highly correlated and what might be a good number of clusters to group variables. With optimal leaf reordering (OLO) ‘cluster lines’ can be added. These lines show the clusters for a fixed chosen number of clusters. The number can also be set by choosing a ‘minimal correlation within the clusters’. This is the minimum correlation every single pair of variables within a cluster must exceed.

Additionally a ‘range of absolute correlation’ can be determined: only correlations with absolute value within the range are drawn in color. Selections of single scatterplots and scatterplot matrices can be made by clicking and drawing boxes in the corrgram.

The function `varclust`

does a clustering based on an optimal leaf ordering of the variables. The number of clusters is determined by specifying the number directly or by choosing the ‘minimal correlation’ (`mincor`

).

```
vc <- varclust(Election2013,mincor=0.8)
summary(vc)
```

```
## Length Class Mode
## c 1 -none- numeric
## mincor 1 -none- numeric
## clusters 112 -none- numeric
## clusrep 54 -none- character
## dfclusrep 54 data.frame list
```

```
# the reduced data set
election_reduced <- vc$dfclusrep
dim(election_reduced)
```

`## [1] 299 54`

We can use the reduced data set for exploring further bivariate structures faster. First we calculate the nine scagnostics from the package **scagnostics** with the function ` sdf`. It calls the function

`scagnostics`

```
scagdf <- sdf(election_reduced)
# List of class "sdfdata"
class(scagdf)
```

`## [1] "sdfdata"`

```
# Entries
summary(scagdf)
```

```
## Length Class Mode
## sdf 12 data.frame list
## data 54 data.frame list
```

Additional and self defined scagnostics can be use and integrated with the function ` scag2sdf`.

```
addscag <- scag2sdf(election_reduced,scagfun.list=list(dcor2d=dcor2d,splines2d=splines2d))
# merge 'addscag' and 'scagsdf'
scagdf2 <- mergesdfdata(scagdf,addscag)
# merged list contains 11 scagnostics
names(scagdf2$sdf)
```

```
## [1] "Outlying" "Skewed" "Clumpy" "Sparse" "Striated"
## [6] "1-Convex" "Skinny" "Stringy" "Monotonic" "dcor2d"
## [11] "splines2d" "x" "y" "status"
```

`iascagpcp(scagdf2)`

All scagnostics stored in *scagdf2* are drawn in a parallel coordinate plot. For selecting a line within the pcp draw a box on one of the axis around the selected line. The line will be highlighted and the corresponding scatterplot drawn. If you use the function `sdf`

to calculate the scagnostics from package **scagnostics** you can decide if you want to consider all plots or only the defined *Outliers* and *Exemplars*.

The package includes so called scaggrams (function `scaggram`

). These graphics are a generalization of corrgrams. The idea is to represent scatter plots through different colors. By using the RGB color space, three different measures can be applied at the same time. That means that (up to) three measures are represented using the colors red, green and blue. The mixture of colors determines the color of the boxes in the scaggram. Using scaggrams within interactive enviroments allows user to select scatterplots and scatterplot matrices using the measures.

`iascaggram(scagdf)`

Reordering can be carried out by the functions `sdf_sort`

(reordering based on similarity of scatterplots) and `sdf_quicksort`

(reordering basad on similarity of variables). `sdf_sort`

can be slow if there are many variables. ‘Quick’ reordering based on the OLO algorithm or ordering based on the algorithm from `sdf_sort`

with a time break might be good choices.

For smaller data frames the option ‘Add -> Glyphs’ can be interessting.

The glyphs representing all scagnostics which are stored in `scagdf2_ds`

are added above the diagonal of the scaggram. The shadings of the boxes are drawn using transparency. It’s also possible to add the scatterplots for each pair of variables.

Chang, Winston, Joe Cheng, JJ Allaire, Yihui Xie, and Jonathan McPherson. 2016. “Shiny: Web Application Framework for R.” @ONLINE https://cran.r-project.org/package=shiny.

Tukey, J. W., and P. A. Tukey. 1985. “Computer Graphics and Exploratory Data Analysis: An Introduction.” In *Proceedings of the Sixth Annual Conference and Exposition: Computer Graphics ’85*, 3:773–85.

Wilkinson, Leland, Anushka Anand, and Robert Grossman. 2005. “Graph-Theoretic Scagnostics.” *Proceedings of the 2005 IEEE Symposium on Information Visualization*, 157–64.

Wilkinson, Leland, and Anushka Anand. 2012. “Scagnostics: Compute Scagnostics - Scatterplot Diagnostics.” @ONLINE https://cran.r-project.org/package=scagnostics.