A course named “data analysis” is a common gateway to the concepts and techniques for drawing information from data. In this context, as in so many, “analysis” is used to infuse a scientific flavor into the word it modifies; it has no precise meaning.^{1} Strictly speaking, “analysis” is the opposite of “synthesis.” Synthesis brings parts together to form a whole; analysis splits the whole into its parts. In statistics, “analysis of variance” uses “analysis” in precisely this sense: dividing the whole variance into parts.

Wikipedia describes “data analysis” as having “multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.” This means that “data analysis” is, ironically, the synthesis of many different things.

Many, but not all, of the techniques taught in a typical “data analysis” course are really about modeling. In my experience, “modeling” provides a useful framework for describing much of what we want to accomplish by looking at data. The goal of this vignette is to motivate, describe, and document some software — the `statisticalModeling`

package — designed to make it easier for students and their teachers to discuss, study, and carry out modeling.

My favorite definition of a model comes from a book, *How to Model It: Problem Solving for the Computer Age*, by Starfield, Smith, and Blelock. They wrote that

a model is a representation for a purpose.

This definition points out that a model is not the thing itself, it is a representation, it stands for, it displays important facets of the thing itself. The definition also emphasizes that the best representation depends in large part on the purpose for which that representation will be used. The definition provides a useful framework for organizing concepts, e.g.

- What “material” are representations made from?
- What are typical purposes for models?
- What operations are performed on models?

In teaching and conducting statistical modeling, I think it’s important to be able to hold in one’s head the answers to these questions. With this accomplished, building a model is a matter of making choices from the possibilities.

With respect to data, models are typically created to:

- Make predictions about new events based on observations of events that have already occurred.
- Provide a description of the real-world mechanisms that underlie the system being modeled.
- Look for (potentially complex) relationships in a mass of data.

The `statisticalModeling`

package is intended to make it straightforward to carry out these objectives. This is achieved by using notation that is consistent across different operations for modeling and by providing specific operations most importantly needed for building and interpreting models.

The starting point for using `statisticalModeling`

is data organized into the usual tabular form: cases and variables. The package does not provide facilities for collecting and organizing raw data. There are many excellent packages already in wide use for that, e.g., `dplyr`

, `tidyr`

, and `data.table`

. So start working with `statisticalModeling`

with a data table (e.g. a “data.frame”“, a”tibble“, etc.) already in hand.

Throughout the operations provided by `statisticalModeling`

, you will access a variable by its column name, and direct the operation to a particular data table by using a `data =`

argument.

Some data tables used for examples are provided directly by `statisticalModeling`

. There is nothing special or essential about these data. They are in the package because it is helpful in examples or demonstrations to work with data that the learner also has access to, and because over-familiarity with ever-popular data sets such as the famous `iris`

, `mtcars`

, and `diamonds`

can lead to a failure to see what’s new about a demonstration.

Models are built using the standard R functions for such things, e.g. `lm()`

, `glm()`

, `rpart()`

and so on.

Each of these model-building functions takes the same kinds of inputs:

- a data table passed via the
`data =`

argument. - a formula specifying the response variable on the left and the explanatory variables on the right. Depending on the model architecture, e.g.
`lm()`

or`rpart()`

, the formula may also specify interactions, transformations, etc. - whatever parameters are specific to the model architecture, e.g. the complexity parameter
`cp`

for recursive partitioning models or the`family`

for generalized linear models.

This vignette isn’t about how to select response and explanatory variables, or which model architectures are appropriate for any given purpose. For our purposes here, it suffices to say that the choice depends on (among other things) the *purpose* for building the model:

- Prediction. Include whatever explanatory terms produce the best predictions. (See cross-validation, below.) It doesn’t matter what causes what, just what variables are known
*before*you need to make the prediction. - Description of the system under study. Think carefully about covariates.
- Looking for patterns in masses of data. Graphics (see above) are, of course, a nice way to start. Architectures such as
`rpart()`

that provide facilities for selecting subsets of variables can be useful for winnowing out the irrelevant.

Accomplishing these purposes of course involves constructing models. R already provides many relevant model-fitting capabilities in regression, machine learning, classification, etc. Typically, constructing a model means investigating several or many competing models, comparing them, and evaluating and interpreting them to extract information of interest.

The `statisticalModeling`

package proviodes a small set of operations on models to make it straightforward to compare, evaluate, and interpret models. My goals in choosing these operations were:

- generality. The operations should apply to a wide range of model architectures.
- ease of use. There should be a minimal amount of set up needed to apply the operations.
- direct applicability to making the decisions that modelers need to make in carrying out their work.

- Evaluating a model for “new” inputs: see
`evaluate_model()`

. - Calculating the influence of a variable on the output of the model: see
`effect_size()`

. - Supporting comparison of multiple models: see
`cv_pred_error()`

. - Estimating uncertainty in model results: see, e.g.
`ensemble()`

which works hand-in-hand with`evaluate_model()`

and`effect_size()`

.

The function `gmodel()`

^{2} draws a graph of the function induced by a model. I’ll start with this because it illustrates well how the `statisticalModeling`

functions provide ease of use.

When graphing a model, choices need to be made about the levels at which to evaluate the explanatory variables, and which explanatory variable is to be mapped to which feature of the graph: x-axis, color, facets. The only required argument to `gmodel()`

is a model object. You can use a formula if you want to control which the roles played by the explanatory variables.

Example: The `Houses_for_sale`

dataset was organized by Prof. Dick de Veau of Williams college as a demonstration of confounding for introductory statistics students. Here is a model of the asking prices of houses as a function of several familiar attributes of the property.

`house_mod <- lm(price ~ fireplaces * living_area + land_value + bedrooms, data = Houses_for_sale)`

Since five variables are involved in the model, some quantitative and some categorical, decisions need to be made about how to depict the model in a graph. `gmodel()`

provides plausible defaults so that a graph of the model function can be made without specifying any details. `gmodel()`

also provides a simple mechanism, via a formula, to set the roles of the various explanatory variables in the graph.

```
gmodel(house_mod)
gmodel(house_mod, ~ living_area + fireplaces)
```

`gmodel(house_mod, ~ living_area + land_value + bedrooms, bedrooms = 1:4)`

Note that the argument `bedrooms = 1:4`

indicates to `gmodel()`

that it should use the values 1, 2, 3, and 4 as levels shown in the plot for the number of bedrooms.

Data can be added to the graphic using `geom_point()`

. Here’s an example of a model linking wages to age, sex, and the sector of work:

```
library(ggplot2)
wage_mod <- lm(wage ~ age * sex + sector, data = mosaicData::CPS85)
gmodel(wage_mod, ~ age + sex + sector, nlevels = 8) +
ggplot2::geom_point(data = mosaicData::CPS85, alpha = 0.1)
```

Often, we apply transforms such as the logarithm to the response or to explanatory variables *before* constructing the model. It can be convenient to be able to display the variable itself rather than the log. The `post_transform`

argument supports this for the response variable. [Feedback: would you like to be able to do this for explanatory variables? How about for transforms such as the rank?]

```
mod <- lm(log(wage) ~ age + educ + sex, data = CPS85)
gmodel(mod, post_transform = c("hourly wage" = exp))
```

When the output of a classifier is to be plotted, `gmodel()`

selects one of the possible output levels and presents the probability of that level as a function of the explanatory variables. You can specify this level with the `prob_of`

argument. For example:

```
data(CPS85, package = "mosaicData")
mod <- glm(married ~ age*wage*sex + sector, family = "binomial", data = CPS85)
gmodel(mod, prob_of = "Single")
```

Evaluating a model means to provide a set of inputs for the explanatory variables and then compute the output of the model at each of those inputs. In base R, the `predict()`

method provides this functionality, but it has several deficiencies:

- The default inputs are the data used to train the model. But often, the modeller wants a quick view of the model output for just a few different inputs. Accomplishing this involves creating a new data frame.
`predict()`

has an argument interface that is inconsistent among different model types.`predict()`

does not record which inputs got mapped to which outputs.

The `evaluate_model()`

function in the `statisticalModeling`

package addresses these difficulties. It’s default inputs are a small set of “typical” values derived from the training data. It’s easy to override this default for any particular explanatory variable, while continuing to use the defaults for the others. The interface is consistent across kinds of model objects. And the output of `evaluate_model()`

is a data frame that includes both the inputs and the corresponding output.

To illustrate:

`evaluate_model(wage_mod)`

```
## age sex sector model_output
## 1 20 M prof 11.135906
## 2 40 M prof 13.356867
## 3 60 M prof 15.577828
## 4 20 F prof 10.083368
## 5 40 F prof 10.994436
## 6 60 F prof 11.905504
## 7 20 M clerical 7.317333
## 8 40 M clerical 9.538293
## 9 60 M clerical 11.759254
## 10 20 F clerical 6.264794
## 11 40 F clerical 7.175862
## 12 60 F clerical 8.086931
## 13 20 M service 5.951303
## 14 40 M service 8.172264
## 15 60 M service 10.393225
## 16 20 F service 4.898765
## 17 40 F service 5.809833
## 18 60 F service 6.720902
```

Or, to override the levels selected for evaluation:

`evaluate_model(wage_mod, nlevels = 2)`

```
## age sex sector model_output
## 1 20 M prof 11.135906
## 2 30 M prof 12.246386
## 3 40 M prof 13.356867
## 4 50 M prof 14.467347
## 5 20 F prof 10.083368
## 6 30 F prof 10.538902
## 7 40 F prof 10.994436
## 8 50 F prof 11.449970
## 9 20 M clerical 7.317333
## 10 30 M clerical 8.427813
## 11 40 M clerical 9.538293
## 12 50 M clerical 10.648774
## 13 20 F clerical 6.264794
## 14 30 F clerical 6.720328
## 15 40 F clerical 7.175862
## 16 50 F clerical 7.631397
```

`evaluate_model(wage_mod, sector = "service", age = c(25, 55))`

```
## age sex sector model_output
## 1 25 M service 6.506544
## 2 55 M service 9.837985
## 3 25 F service 5.126532
## 4 55 F service 6.493135
```

Use the `data =`

argument to provide an explicit set of cases, or set `on_training = TRUE`

to evaluate the model on the training data. When this is done, the response variable used for training will also be in the output of `evaluate_model()`

, making it easy to calculate residuals or prediction error.

A simple but effective way to interpret a model is to look at how a change in an explanatory variable leads to a change in model output: the effect size. Conventionally, this is done by looking at model coefficients, but this can become complicated in the presence of interactions or with models involving link functions.

A short-cut is to use `evaluate_model()`

with inputs at different levels, for example:

`evaluate_model(wage_mod, sector = "service", sex = "F", age = 50)`

```
## age sex sector model_output
## 1 50 F service 6.265368
```

`evaluate_model(wage_mod, sector = "service", sex = "F", age = 55)`

```
## age sex sector model_output
## 1 55 F service 6.493135
```

The `effect_size()`

function provides a simpler interface:

`effect_size(wage_mod, ~ age, age = 50, sector = "service")`

```
## slope age to:age sex sector
## 1 0.111048 50 61.72657 M service
```

The explanatory variable whose effect size is being calculated is set by the one-sided formula, e.g. `~ age`

in the above example. Note also that the effect size is specified as a *slope* when the explanatory variable is quantitative, and as a *change* when the explanatory variable is categorical.

Also notice that for quantitative variables, the two levels being compared are set to a mid-sized value (based on the range in the training data). This finite-difference can be more appropriate when working with discontinuous architectures such as recursive partitioning.

Statistical inference is one of the most daunting subjects that a beginning student faces. Students need to learn about parameters like degrees of freedom, the null hypothesis plays an outside role, there are several different kinds of sampling distribution involved.

Any inference technique provided by a model object is, of course, always available for your use. For instance, regression tables and ANOVA are often used in linear models.

The `statisticalModeling`

package attempts to provide a much simpler basis for inference. There are two elements behind the strategy for doing so:

- Use conceptually simple techniques, e.g. the bootstrap, that work in the same way across many different model architectures.
- Use cross-validation to avoid having to worry about the consequences of in-sample evaluation of data.

To illustrate, let’s draw on the question that motivated the collection of the `Houses_for_sale`

data: how much does a fireplace influence the sales price of a house? It’s hardly reasonable to claim that a fireplace is a primary determinant of a house’s price: there are covariates that are likely much more important. Here’s a baseline model of price based on some of these covariates:

`baseline_mod <- lm(price ~ living_area + bathrooms + land_value, data = Houses_for_sale)`

And here’s a model that adds `fireplaces`

to the mix:

```
fireplace_mod <- lm(price ~ living_area + bathrooms + land_value + fireplaces,
data = Houses_for_sale)
```

A conventional comparison of the two models is done using in-sample residuals and therefore needs to take into account degrees of freedom, the shape of the F distribution, etc.

`anova(baseline_mod, fireplace_mod)`

```
## Analysis of Variance Table
##
## Model 1: price ~ living_area + bathrooms + land_value
## Model 2: price ~ living_area + bathrooms + land_value + fireplaces
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1724 6.3822e+12
## 2 1723 6.3749e+12 1 7250106066 1.9595 0.1617
```

This is fine, but somewhat complicated and abstract.

A cross-validation approach looks at the prediction error of the models on out-of-sample data (implemented in this case with k-fold cross-validation). That prediction error will vary depending on the data used for training and the data used for evaluation. The `cv_pred_error()`

function calculates several random replicates of cross-validation, producing an output of mean-squared error for each replicate.

```
Trials <- cv_pred_error(baseline_mod, fireplace_mod, ntrials = 10)
Trials
```

```
## mse model
## 1 3726240397 baseline_mod
## 2 3722219654 baseline_mod
## 3 3721471575 baseline_mod
## 4 3716207772 baseline_mod
## 5 3714097912 baseline_mod
## 6 3722824667 baseline_mod
## 7 3721183965 baseline_mod
## 8 3711782807 baseline_mod
## 9 3721790392 baseline_mod
## 10 3717647960 baseline_mod
## 11 3735891158 fireplace_mod
## 12 3728906736 fireplace_mod
## 13 3731685298 fireplace_mod
## 14 3724325113 fireplace_mod
## 15 3733313634 fireplace_mod
## 16 3711272118 fireplace_mod
## 17 3722476485 fireplace_mod
## 18 3725142870 fireplace_mod
## 19 3739402885 fireplace_mod
## 20 3717633031 fireplace_mod
```

Assessing whether `fireplaces`

shows up as determining the house’s price is a matter of comparing the MSE for the two models:

`boxplot(mse ~ model, data = Trials)`

Not so much. Or, if you want to generate a p-value, try

`t.test(mse ~ model, data = Trials)`

```
##
## Welch Two Sample t-test
##
## data: mse by model
## t = -2.4364, df = 13.517, p-value = 0.02933
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14045750.6 -870695.1
## sample estimates:
## mean in group baseline_mod mean in group fireplace_mod
## 3719546710 3727004933
```

Cross-validation tests the performance of models when assessed on *out-of-sample* data. Here, we see that `fireplaces`

is not a major player.

Returning the the house-price models, consider the effect size of `fireplaces`

.

`effect_size(fireplace_mod, ~ fireplaces)`

```
## slope fireplaces to:fireplaces living_area bathrooms land_value
## 1 4236.482 1 1.556102 1634.5 2 25000
```

This indicates that a fireplace adds about $4000 to the price of a house. How precise is this estimate? We can address this by bootstrapping.

In the `statisticalModeling`

package, the process is to create an ensemble of bootstrap replicates of a model.

`E <- ensemble(fireplace_mod, nreps = 50)`

This particular ensemble has 50 bootstrap replicates. These replicates can be passed to functions such as `effect_size()`

in the same way that an individual model is used.

Calculating the effect size on this ensemble will produce an ensemble of effect sizes, the spread of which gives the precision of the estimate.

```
EE <- effect_size(E, ~ fireplaces)
with(data = EE, sd(slope))
```

`## [1] 4114.25`

[Please provide feedback: Would it be useful to be able to pass a bootstrap ensemble to `gmodel()`

to plot the ensemble of model functions?]

This is different from the approach in the `mosaic`

package. I think the mosaic approch is expressive and elegant, but it’s not well suited to the automatic choice of input values supported by `evaluate_model()`

. Nonetheless, you can use the mosaic approach with `statisticalModeling`

functions, for instance:

```
library(mosaic)
do(10) * {
fireplace_mod <-
lm(price ~ living_area + bathrooms + land_value + fireplaces,
data = resample(Houses_for_sale))
effect_size(fireplace_mod, ~ fireplaces)
}
```

```
## slope fireplaces to.fireplaces living_area bathrooms land_value
## 1 2626.4342 1 1.555714 1649.0 2 25500
## 2 8357.9940 1 1.567239 1666.0 2 25000
## 3 3877.9657 1 1.564845 1656.0 2 26900
## 4 3527.0284 1 1.563655 1623.5 2 25050
## 5 6660.0502 1 1.566740 1636.0 2 24500
## 6 -892.9505 1 1.554782 1640.0 2 25400
## 7 6427.5210 1 1.553534 1609.0 2 24100
## 8 7962.4726 1 1.564586 1649.0 2 25400
## 9 5135.1025 1 1.575627 1668.0 2 25400
## 10 5531.7272 1 1.543186 1658.0 2 25000
## .row .index
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
```

One of the main materials from which statistical models are made is graphics. A data graphic is a representation of a data table which has particular resonance with human cognitive abilities and which builds on topics found in widely taught math curricula, e.g. cartesian coordinates.

There are many excellent graphics packages provided for R. The `statisticalModeling`

package currently builds on one of them, `ggplot2`

. The purpose of the graphics functions in `statisticalModeling`

is to provide a clear, concise, and consistent interface to the underlying graphics package.

A simple example:

```
library(statisticalModeling)
gf_point(mpg ~ hp, data = mtcars)
```

Each of the marks in the plot is a *glyph*. Every glyph has graphical *attributes*. In the above plot, the attributes are x- and y-position. Other possible graphical attributes for a point glyph are color, shape, size, stroke, fill, and alpha (transparency).

Experts, of course, will want to use the underlying package directly in order to maximize expressiveness and minimize the “bureaucracy” induced by intervening layers of software. The `statisticalModeling`

graphics are suited for beginners because they reduce the number of elements used in notation.

There are three aspects to describing graphics using `statisticalModeling`

:

- The “kind of graphic” to be created, e.g. scatter plots, density plots, bar plots, maps, etc.
- The role that each variable plays in the graphic, e.g. x- or y-axis, color, shape, facet.
- And, of course, the data table(s) on which the graphic is to be based.

The “kind of graphic” is specified by the name of the graphics function. All of the `statisticalModeling`

data graphics functions have names starting with `gf_`

, which is intended to remind the user that they are formula-based interfaces to `ggplot2`

: `g`

for `ggplot2`

and `f`

for “formula.” Commonly used functions are

`gf_points()`

for scatter plots`gf_density()`

or`gf_histogram()`

or`gf_freqpoly()`

to display distributions`gf_boxplot()`

for comparing distributions side-by-side`gf_counts()`

for bar-graph style depictions of counts.`gf_bar()`

for more general bar-graph style graphics

The function names generally match the corresponding operation in `ggplot2`

, although `gf_counts()`

is a simplified special case.

Each of the `gf_`

functions creates the coordinate axes and fills it in one operation. (In `ggplot2`

nomenclature, `gf_`

function create a frame and add a geom layer, all in one operation.)

In the `gf_`

functions, you specify the graphical attributes using a formula. Each such specification takes the form `attribute:value`

, where `attribute`

is one of `color`

, `shape`

, etc. and `value`

is either a constant (e.g. `"red"`

or `0.5`

as appropriate) or a variable (e.g. `cyl`

). For instance:

```
gf_point(mpg ~ hp + color:cyl + size:carb + alpha:0.75, data = mtcars) +
ggplot2::ylab("Miles per gallon") +
ggplot2::xlab("Horsepower")
```

You add labels and facets using the standard `ggplot2`

notation. `xlab()`

and `ylab()`

are adding the axis labels in the above. (For faceting, see below.)

The `ggplot`

system offers two ways to connect points. `gf_line()`

ignores the order of the points in the data draws the line going from left to right. `gf_path()`

goes from point to point according to the order in the data. Both forms can use a `color`

or `group`

aesthetic as a flag to draw groupwise lines.

```
# use a categorical variable
mtcars <- mtcars %>% mutate(n_cylinders = as.character(cyl))
gf_line(mpg ~ hp, data = mtcars)
gf_path(mpg ~ hp, data = mtcars)
gf_line(mpg ~ hp + color:n_cylinders, data = mtcars)
```

The above are examples of *bad plots*. The viewer is unnecessarily distracted by the zigs and zags in the connecting lines. It would be better to use `gf_point()`

here, but then you wouldn’t see how `gf_line()`

and `gf_path()`

work!

The `ggplot`

system allows you to make subplots — called “facets” — based on the values of one or two categorical variables. This is done by “adding” a `facet_grid()`

directive. The directive uses a formula to specify which variable(s) are to be used in the grid.

```
gf_density_2d(net ~ age, data = Runners) + facet_grid( ~ sex)
gf_density_2d(net ~ age, data = Runners) + facet_grid(start_position ~ sex)
```

Sometimes you have so many points in a scatter plot that they obscure one another. The `ggplot`

system provides two easy ways to deal with this: translucency and jittering.

Use `alpha:0.5`

to make the points semi-translucent. If there are many points overlapping at one point, a much smaller value of alpha, say `alpha:0.01`

.

Using `gf_jitter()`

in place of `gf_point()`

will move the plotted points to reduce overlap. You can use both of these techniques, e.g.

`gf_jitter(age ~ sex + alpha:0.01, data = Runners)`

The `gf_`

functions generate a ggplot object as well as a character string containing the `ggplot()`

command to generate the graphic. This can be useful when you want to use the `gf_`

functions to remind you about how `ggplot()`

works, but you want to have the `ggplot()`

commands directly in your document for future modification.

Use `verbose = TRUE`

to see the string being generated.