# Support for Statistical Modeling

#### November 11, 2016

A course named “data analysis” is a common gateway to the concepts and techniques for drawing information from data. In this context, as in so many, “analysis” is used to infuse a scientific flavor into the word it modifies; it has no precise meaning.1 Strictly speaking, “analysis” is the opposite of “synthesis.” Synthesis brings parts together to form a whole; analysis splits the whole into its parts. In statistics, “analysis of variance” uses “analysis” in precisely this sense: dividing the whole variance into parts.

Wikipedia describes “data analysis” as having “multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.” This means that “data analysis” is, ironically, the synthesis of many different things.

Many, but not all, of the techniques taught in a typical “data analysis” course are really about modeling. In my experience, “modeling” provides a useful framework for describing much of what we want to accomplish by looking at data. The goal of this vignette is to motivate, describe, and document some software — the statisticalModeling package — designed to make it easier for students and their teachers to discuss, study, and carry out modeling.

# Modeling

My favorite definition of a model comes from a book, How to Model It: Problem Solving for the Computer Age, by Starfield, Smith, and Blelock. They wrote that

a model is a representation for a purpose.

This definition points out that a model is not the thing itself, it is a representation, it stands for, it displays important facets of the thing itself. The definition also emphasizes that the best representation depends in large part on the purpose for which that representation will be used. The definition provides a useful framework for organizing concepts, e.g.

• What “material” are representations made from?
• What are typical purposes for models?
• What operations are performed on models?

In teaching and conducting statistical modeling, I think it’s important to be able to hold in one’s head the answers to these questions. With this accomplished, building a model is a matter of making choices from the possibilities.

With respect to data, models are typically created to:

1. Make predictions about new events based on observations of events that have already occurred.
2. Provide a description of the real-world mechanisms that underlie the system being modeled.
3. Look for (potentially complex) relationships in a mass of data.

The statisticalModeling package is intended to make it straightforward to carry out these objectives. This is achieved by using notation that is consistent across different operations for modeling and by providing specific operations most importantly needed for building and interpreting models.

# Data

The starting point for using statisticalModeling is data organized into the usual tabular form: cases and variables. The package does not provide facilities for collecting and organizing raw data. There are many excellent packages already in wide use for that, e.g., dplyr, tidyr, and data.table. So start working with statisticalModeling with a data table (e.g. a “data.frame”“, a”tibble“, etc.) already in hand.

Throughout the operations provided by statisticalModeling, you will access a variable by its column name, and direct the operation to a particular data table by using a data = argument.

Some data tables used for examples are provided directly by statisticalModeling. There is nothing special or essential about these data. They are in the package because it is helpful in examples or demonstrations to work with data that the learner also has access to, and because over-familiarity with ever-popular data sets such as the famous iris, mtcars, and diamonds can lead to a failure to see what’s new about a demonstration.

# Models and functions

Models are built using the standard R functions for such things, e.g. lm(), glm(), rpart() and so on.

Each of these model-building functions takes the same kinds of inputs:

1. a data table passed via the data = argument.
2. a formula specifying the response variable on the left and the explanatory variables on the right. Depending on the model architecture, e.g. lm() or rpart(), the formula may also specify interactions, transformations, etc.
3. whatever parameters are specific to the model architecture, e.g. the complexity parameter cp for recursive partitioning models or the family for generalized linear models.

This vignette isn’t about how to select response and explanatory variables, or which model architectures are appropriate for any given purpose. For our purposes here, it suffices to say that the choice depends on (among other things) the purpose for building the model:

• Prediction. Include whatever explanatory terms produce the best predictions. (See cross-validation, below.) It doesn’t matter what causes what, just what variables are known before you need to make the prediction.
• Description of the system under study. Think carefully about covariates.
• Looking for patterns in masses of data. Graphics (see above) are, of course, a nice way to start. Architectures such as rpart() that provide facilities for selecting subsets of variables can be useful for winnowing out the irrelevant.

Accomplishing these purposes of course involves constructing models. R already provides many relevant model-fitting capabilities in regression, machine learning, classification, etc. Typically, constructing a model means investigating several or many competing models, comparing them, and evaluating and interpreting them to extract information of interest.

The statisticalModeling package proviodes a small set of operations on models to make it straightforward to compare, evaluate, and interpret models. My goals in choosing these operations were:

• generality. The operations should apply to a wide range of model architectures.
• ease of use. There should be a minimal amount of set up needed to apply the operations.
• direct applicability to making the decisions that modelers need to make in carrying out their work.
1. Evaluating a model for “new” inputs: see evaluate_model().
2. Calculating the influence of a variable on the output of the model: see effect_size().
3. Supporting comparison of multiple models: see cv_pred_error().
4. Estimating uncertainty in model results: see, e.g. ensemble() which works hand-in-hand with evaluate_model() and effect_size().

## Graphics of model functions

The function gmodel()2 draws a graph of the function induced by a model. I’ll start with this because it illustrates well how the statisticalModeling functions provide ease of use.

When graphing a model, choices need to be made about the levels at which to evaluate the explanatory variables, and which explanatory variable is to be mapped to which feature of the graph: x-axis, color, facets. The only required argument to gmodel() is a model object. You can use a formula if you want to control which the roles played by the explanatory variables.

Example: The Houses_for_sale dataset was organized by Prof. Dick de Veau of Williams college as a demonstration of confounding for introductory statistics students. Here is a model of the asking prices of houses as a function of several familiar attributes of the property.

house_mod <- lm(price ~ fireplaces * living_area + land_value + bedrooms, data = Houses_for_sale)

Since five variables are involved in the model, some quantitative and some categorical, decisions need to be made about how to depict the model in a graph. gmodel() provides plausible defaults so that a graph of the model function can be made without specifying any details. gmodel() also provides a simple mechanism, via a formula, to set the roles of the various explanatory variables in the graph.

gmodel(house_mod)
gmodel(house_mod, ~ living_area + fireplaces)

gmodel(house_mod, ~ living_area + land_value + bedrooms, bedrooms = 1:4)

Note that the argument bedrooms = 1:4 indicates to gmodel() that it should use the values 1, 2, 3, and 4 as levels shown in the plot for the number of bedrooms.

Data can be added to the graphic using geom_point(). Here’s an example of a model linking wages to age, sex, and the sector of work:

library(ggplot2)
wage_mod <- lm(wage ~ age * sex + sector, data = mosaicData::CPS85)
gmodel(wage_mod, ~ age + sex + sector, nlevels = 8) +
ggplot2::geom_point(data = mosaicData::CPS85, alpha = 0.1)

### Transforms

Often, we apply transforms such as the logarithm to the response or to explanatory variables before constructing the model. It can be convenient to be able to display the variable itself rather than the log. The post_transform argument supports this for the response variable. [Feedback: would you like to be able to do this for explanatory variables? How about for transforms such as the rank?]

mod <- lm(log(wage) ~ age + educ + sex, data = CPS85)
gmodel(mod, post_transform = c("hourly wage" = exp))

### Plotting classifiers

When the output of a classifier is to be plotted, gmodel() selects one of the possible output levels and presents the probability of that level as a function of the explanatory variables. You can specify this level with the prob_of argument. For example:

data(CPS85, package = "mosaicData")
mod <- glm(married ~ age*wage*sex + sector, family = "binomial", data = CPS85)
gmodel(mod, prob_of = "Single")

## Evaluating models

Evaluating a model means to provide a set of inputs for the explanatory variables and then compute the output of the model at each of those inputs. In base R, the predict() method provides this functionality, but it has several deficiencies:

1. The default inputs are the data used to train the model. But often, the modeller wants a quick view of the model output for just a few different inputs. Accomplishing this involves creating a new data frame.
2. predict() has an argument interface that is inconsistent among different model types.
3. predict() does not record which inputs got mapped to which outputs.

The evaluate_model() function in the statisticalModeling package addresses these difficulties. It’s default inputs are a small set of “typical” values derived from the training data. It’s easy to override this default for any particular explanatory variable, while continuing to use the defaults for the others. The interface is consistent across kinds of model objects. And the output of evaluate_model() is a data frame that includes both the inputs and the corresponding output.

To illustrate:

evaluate_model(wage_mod)
##    age sex   sector model_output
## 1   20   M     prof    11.135906
## 2   40   M     prof    13.356867
## 3   60   M     prof    15.577828
## 4   20   F     prof    10.083368
## 5   40   F     prof    10.994436
## 6   60   F     prof    11.905504
## 7   20   M clerical     7.317333
## 8   40   M clerical     9.538293
## 9   60   M clerical    11.759254
## 10  20   F clerical     6.264794
## 11  40   F clerical     7.175862
## 12  60   F clerical     8.086931
## 13  20   M  service     5.951303
## 14  40   M  service     8.172264
## 15  60   M  service    10.393225
## 16  20   F  service     4.898765
## 17  40   F  service     5.809833
## 18  60   F  service     6.720902

Or, to override the levels selected for evaluation:

evaluate_model(wage_mod, nlevels = 2)
##    age sex   sector model_output
## 1   20   M     prof    11.135906
## 2   30   M     prof    12.246386
## 3   40   M     prof    13.356867
## 4   50   M     prof    14.467347
## 5   20   F     prof    10.083368
## 6   30   F     prof    10.538902
## 7   40   F     prof    10.994436
## 8   50   F     prof    11.449970
## 9   20   M clerical     7.317333
## 10  30   M clerical     8.427813
## 11  40   M clerical     9.538293
## 12  50   M clerical    10.648774
## 13  20   F clerical     6.264794
## 14  30   F clerical     6.720328
## 15  40   F clerical     7.175862
## 16  50   F clerical     7.631397
evaluate_model(wage_mod, sector = "service", age = c(25, 55))
##   age sex  sector model_output
## 1  25   M service     6.506544
## 2  55   M service     9.837985
## 3  25   F service     5.126532
## 4  55   F service     6.493135

Use the data = argument to provide an explicit set of cases, or set on_training = TRUE to evaluate the model on the training data. When this is done, the response variable used for training will also be in the output of evaluate_model(), making it easy to calculate residuals or prediction error.

## Effect sizes

A simple but effective way to interpret a model is to look at how a change in an explanatory variable leads to a change in model output: the effect size. Conventionally, this is done by looking at model coefficients, but this can become complicated in the presence of interactions or with models involving link functions.

A short-cut is to use evaluate_model() with inputs at different levels, for example:

evaluate_model(wage_mod, sector = "service", sex = "F", age = 50)
##   age sex  sector model_output
## 1  50   F service     6.265368
evaluate_model(wage_mod, sector = "service", sex = "F", age = 55)
##   age sex  sector model_output
## 1  55   F service     6.493135

The effect_size() function provides a simpler interface:

effect_size(wage_mod, ~ age, age = 50, sector = "service")
##      slope age   to:age sex  sector
## 1 0.111048  50 61.72657   M service

The explanatory variable whose effect size is being calculated is set by the one-sided formula, e.g. ~ age in the above example. Note also that the effect size is specified as a slope when the explanatory variable is quantitative, and as a change when the explanatory variable is categorical.

Also notice that for quantitative variables, the two levels being compared are set to a mid-sized value (based on the range in the training data). This finite-difference can be more appropriate when working with discontinuous architectures such as recursive partitioning.

## Inference

Statistical inference is one of the most daunting subjects that a beginning student faces. Students need to learn about parameters like degrees of freedom, the null hypothesis plays an outside role, there are several different kinds of sampling distribution involved.

Any inference technique provided by a model object is, of course, always available for your use. For instance, regression tables and ANOVA are often used in linear models.

The statisticalModeling package attempts to provide a much simpler basis for inference. There are two elements behind the strategy for doing so:

1. Use conceptually simple techniques, e.g. the bootstrap, that work in the same way across many different model architectures.
2. Use cross-validation to avoid having to worry about the consequences of in-sample evaluation of data.

To illustrate, let’s draw on the question that motivated the collection of the Houses_for_sale data: how much does a fireplace influence the sales price of a house? It’s hardly reasonable to claim that a fireplace is a primary determinant of a house’s price: there are covariates that are likely much more important. Here’s a baseline model of price based on some of these covariates:

baseline_mod <- lm(price ~ living_area + bathrooms + land_value, data = Houses_for_sale)

And here’s a model that adds fireplaces to the mix:

fireplace_mod <- lm(price ~ living_area + bathrooms + land_value + fireplaces,
data = Houses_for_sale)

A conventional comparison of the two models is done using in-sample residuals and therefore needs to take into account degrees of freedom, the shape of the F distribution, etc.

anova(baseline_mod, fireplace_mod)
## Analysis of Variance Table
##
## Model 1: price ~ living_area + bathrooms + land_value
## Model 2: price ~ living_area + bathrooms + land_value + fireplaces
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1724 6.3822e+12
## 2   1723 6.3749e+12  1 7250106066 1.9595 0.1617

This is fine, but somewhat complicated and abstract.

A cross-validation approach looks at the prediction error of the models on out-of-sample data (implemented in this case with k-fold cross-validation). That prediction error will vary depending on the data used for training and the data used for evaluation. The cv_pred_error() function calculates several random replicates of cross-validation, producing an output of mean-squared error for each replicate.

Trials <- cv_pred_error(baseline_mod, fireplace_mod, ntrials = 10)
Trials
##           mse         model
## 1  3726240397  baseline_mod
## 2  3722219654  baseline_mod
## 3  3721471575  baseline_mod
## 4  3716207772  baseline_mod
## 5  3714097912  baseline_mod
## 6  3722824667  baseline_mod
## 7  3721183965  baseline_mod
## 8  3711782807  baseline_mod
## 9  3721790392  baseline_mod
## 10 3717647960  baseline_mod
## 11 3735891158 fireplace_mod
## 12 3728906736 fireplace_mod
## 13 3731685298 fireplace_mod
## 14 3724325113 fireplace_mod
## 15 3733313634 fireplace_mod
## 16 3711272118 fireplace_mod
## 17 3722476485 fireplace_mod
## 18 3725142870 fireplace_mod
## 19 3739402885 fireplace_mod
## 20 3717633031 fireplace_mod

Assessing whether fireplaces shows up as determining the house’s price is a matter of comparing the MSE for the two models:

boxplot(mse ~ model, data = Trials)

Not so much. Or, if you want to generate a p-value, try

t.test(mse ~ model, data = Trials)
##
##  Welch Two Sample t-test
##
## data:  mse by model
## t = -2.4364, df = 13.517, p-value = 0.02933
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14045750.6   -870695.1
## sample estimates:
##  mean in group baseline_mod mean in group fireplace_mod
##                  3719546710                  3727004933

Cross-validation tests the performance of models when assessed on out-of-sample data. Here, we see that fireplaces is not a major player.

## Bootstrapping

Returning the the house-price models, consider the effect size of fireplaces.

effect_size(fireplace_mod, ~ fireplaces)
##      slope fireplaces to:fireplaces living_area bathrooms land_value
## 1 4236.482          1      1.556102      1634.5         2      25000

This indicates that a fireplace adds about \$4000 to the price of a house. How precise is this estimate? We can address this by bootstrapping.

In the statisticalModeling package, the process is to create an ensemble of bootstrap replicates of a model.

E <- ensemble(fireplace_mod, nreps = 50)

This particular ensemble has 50 bootstrap replicates. These replicates can be passed to functions such as effect_size() in the same way that an individual model is used.

Calculating the effect size on this ensemble will produce an ensemble of effect sizes, the spread of which gives the precision of the estimate.

EE <- effect_size(E, ~ fireplaces)
with(data = EE, sd(slope))
## [1] 4114.25

[Please provide feedback: Would it be useful to be able to pass a bootstrap ensemble to gmodel() to plot the ensemble of model functions?]

This is different from the approach in the mosaic package. I think the mosaic approch is expressive and elegant, but it’s not well suited to the automatic choice of input values supported by evaluate_model(). Nonetheless, you can use the mosaic approach with statisticalModeling functions, for instance:

library(mosaic)

do(10) * {
fireplace_mod <-
lm(price ~ living_area + bathrooms + land_value + fireplaces,
data = resample(Houses_for_sale))
effect_size(fireplace_mod, ~ fireplaces)
}
##        slope fireplaces to.fireplaces living_area bathrooms land_value
## 1  2626.4342          1      1.555714      1649.0         2      25500
## 2  8357.9940          1      1.567239      1666.0         2      25000
## 3  3877.9657          1      1.564845      1656.0         2      26900
## 4  3527.0284          1      1.563655      1623.5         2      25050
## 5  6660.0502          1      1.566740      1636.0         2      24500
## 6  -892.9505          1      1.554782      1640.0         2      25400
## 7  6427.5210          1      1.553534      1609.0         2      24100
## 8  7962.4726          1      1.564586      1649.0         2      25400
## 9  5135.1025          1      1.575627      1668.0         2      25400
## 10 5531.7272          1      1.543186      1658.0         2      25000
##    .row .index
## 1     1      1
## 2     1      2
## 3     1      3
## 4     1      4
## 5     1      5
## 6     1      6
## 7     1      7
## 8     1      8
## 9     1      9
## 10    1     10

# Formula-driven graphics

One of the main materials from which statistical models are made is graphics. A data graphic is a representation of a data table which has particular resonance with human cognitive abilities and which builds on topics found in widely taught math curricula, e.g. cartesian coordinates.

There are many excellent graphics packages provided for R. The statisticalModeling package currently builds on one of them, ggplot2. The purpose of the graphics functions in statisticalModeling is to provide a clear, concise, and consistent interface to the underlying graphics package.

A simple example:

library(statisticalModeling)
gf_point(mpg ~ hp, data = mtcars)

Each of the marks in the plot is a glyph. Every glyph has graphical attributes. In the above plot, the attributes are x- and y-position. Other possible graphical attributes for a point glyph are color, shape, size, stroke, fill, and alpha (transparency).

Experts, of course, will want to use the underlying package directly in order to maximize expressiveness and minimize the “bureaucracy” induced by intervening layers of software. The statisticalModeling graphics are suited for beginners because they reduce the number of elements used in notation.

There are three aspects to describing graphics using statisticalModeling:

1. The “kind of graphic” to be created, e.g. scatter plots, density plots, bar plots, maps, etc.
2. The role that each variable plays in the graphic, e.g. x- or y-axis, color, shape, facet.
3. And, of course, the data table(s) on which the graphic is to be based.

The “kind of graphic” is specified by the name of the graphics function. All of the statisticalModeling data graphics functions have names starting with gf_, which is intended to remind the user that they are formula-based interfaces to ggplot2: g for ggplot2 and f for “formula.” Commonly used functions are

• gf_points() for scatter plots
• gf_density() or gf_histogram() or gf_freqpoly() to display distributions
• gf_boxplot() for comparing distributions side-by-side
• gf_counts() for bar-graph style depictions of counts.
• gf_bar() for more general bar-graph style graphics

The function names generally match the corresponding operation in ggplot2, although gf_counts() is a simplified special case.

Each of the gf_ functions creates the coordinate axes and fills it in one operation. (In ggplot2 nomenclature, gf_ function create a frame and add a geom layer, all in one operation.)

In the gf_ functions, you specify the graphical attributes using a formula. Each such specification takes the form attribute:value, where attribute is one of color, shape, etc. and value is either a constant (e.g. "red" or 0.5 as appropriate) or a variable (e.g. cyl). For instance:

gf_point(mpg ~ hp + color:cyl + size:carb + alpha:0.75, data = mtcars) +
ggplot2::ylab("Miles per gallon") +
ggplot2::xlab("Horsepower")

You add labels and facets using the standard ggplot2 notation. xlab() and ylab() are adding the axis labels in the above. (For faceting, see below.)

## Paths and lines

The ggplot system offers two ways to connect points. gf_line() ignores the order of the points in the data draws the line going from left to right. gf_path() goes from point to point according to the order in the data. Both forms can use a color or group aesthetic as a flag to draw groupwise lines.

# use a categorical variable
mtcars <- mtcars %>% mutate(n_cylinders = as.character(cyl))
gf_line(mpg ~ hp, data = mtcars)
gf_path(mpg ~ hp, data = mtcars)
gf_line(mpg ~ hp + color:n_cylinders, data = mtcars)

The above are examples of bad plots. The viewer is unnecessarily distracted by the zigs and zags in the connecting lines. It would be better to use gf_point() here, but then you wouldn’t see how gf_line() and gf_path() work!

## Faceting

The ggplot system allows you to make subplots — called “facets” — based on the values of one or two categorical variables. This is done by “adding” a facet_grid() directive. The directive uses a formula to specify which variable(s) are to be used in the grid.

gf_density_2d(net ~ age, data = Runners) + facet_grid( ~ sex)
gf_density_2d(net ~ age, data = Runners) + facet_grid(start_position ~ sex)

## Overlapping cases

Sometimes you have so many points in a scatter plot that they obscure one another. The ggplot system provides two easy ways to deal with this: translucency and jittering.

Use alpha:0.5 to make the points semi-translucent. If there are many points overlapping at one point, a much smaller value of alpha, say alpha:0.01.

Using gf_jitter() in place of gf_point() will move the plotted points to reduce overlap. You can use both of these techniques, e.g.

gf_jitter(age ~ sex + alpha:0.01, data = Runners)

## For ggplot2 users: gf and gg together

The gf_ functions generate a ggplot object as well as a character string containing the ggplot() command to generate the graphic. This can be useful when you want to use the gf_ functions to remind you about how ggplot() works, but you want to have the ggplot() commands directly in your document for future modification.

Use verbose = TRUE to see the string being generated.

1. The character Spock in Star Trek says in one episode, “I have no analysis due to insufficient information.” That sounds more scientific or logical than its simpler equivalent: “I don’t know.”

2. Called fmodel() in earlier versions of this package.