Tidy Data Frames of Marginal Effects

Daniel Lüdecke

2018-10-11

Aim of the ggeffects-package

The aim of this package is similar to the broom-package: transforming “untidy” input into a tidy data frame, especially for further use with ggplot. However, ggeffects does not return model-summaries; rather, this package computes marginal effects at the mean or average marginal effects from statistical models and returns the result as tidy data frame.

Since the focus lies on plotting the data (the marginal effects), at least one model term needs to be specified for which the effects are computed. It is also possible to compute marginal effects for model terms, grouped by the levels of another model’s predictor. The package also allows plotting marginal effects for two- or three-way-interactions, or for specific values of a model term only. Examples are shown below.

Consistent and tidy structure

The returned data frames always have the same, consistent structure and column names, so it’s easy to create ggplot-plots without the need to re-write the arguments to be mapped in each ggplot-call. x and predicted are the values for the x- and y-axis. conf.low and conf.high could be used as ymin and ymax aesthetics for ribbons to add confidence bands to the plot. group can be used as grouping-aesthetics, or for faceting.

Marginal effects at the mean

ggpredict() computes predicted values for all possible levels and values from a model’s predictors. In the simplest case, a fitted model is passed as first argument, and the term in question as second argument:

library(ggeffects)
data(efc)
fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc)

ggpredict(fit, terms = "c12hour")
#> 
#> # Predicted values for Total score BARTHEL INDEX 
#> # x = average number of hours of care per week 
#> 
#>   x predicted conf.low conf.high
#>   0    75.444   73.257    77.630
#>   5    74.177   72.098    76.256
#>  10    72.911   70.931    74.890
#>  15    71.644   69.753    73.535
#>  20    70.378   68.564    72.191
#>  25    69.111   67.361    70.861
#>  30    67.845   66.144    69.545
#>  35    66.578   64.911    68.245
#>  40    65.312   63.661    66.962
#>  45    64.045   62.393    65.697
#>  ... and 25 more rows.
#> 
#> Adjusted for:
#> *  neg_c_7 = 11.84
#> *  c161sex = 1.76
#> * c172code = 1.97

The output shows the predicted values for the response at each value from the term c12hour. The data is already in shape for ggplot:

library(ggplot2)
theme_set(theme_bw())

mydf <- ggpredict(fit, terms = "c12hour")
ggplot(mydf, aes(x, predicted)) + geom_line()

Marginal effects at the mean for different groups

The terms-argument accepts up to three model terms, where the second and third term indicate grouping levels. This allows predictions for the term in question at different levels for other model terms:

ggpredict(fit, terms = c("c12hour", "c172code"))
#> 
#> # Predicted values for Total score BARTHEL INDEX 
#> # x = average number of hours of care per week 
#> 
#> # low level of education
#>   x predicted conf.low conf.high
#>   0    74.746   71.266    78.227
#>   5    73.480   70.080    76.880
#>  10    72.213   68.887    75.540
#>  15    70.947   67.686    74.207
#>  20    69.680   66.478    72.883
#>  25    68.414   65.262    71.566
#>  30    67.147   64.037    70.257
#>  35    65.881   62.804    68.958
#>  40    64.614   61.561    67.668
#>  45    63.348   60.309    66.387
#>  ... and 25 more rows.
#> 
#> # intermediate level of education
#>   x predicted conf.low conf.high
#>   0    75.465   73.282    77.647
#>   5    74.198   72.123    76.273
#>  10    72.932   70.955    74.909
#>  15    71.665   69.777    73.554
#>  20    70.399   68.587    72.211
#>  25    69.132   67.383    70.881
#>  30    67.866   66.165    69.566
#>  35    66.599   64.931    68.267
#>  40    65.333   63.680    66.985
#>  45    64.066   62.412    65.720
#>  ... and 25 more rows.
#> 
#> # high level of education
#>   x predicted conf.low conf.high
#>   0    76.183   72.816    79.550
#>   5    74.917   71.604    78.229
#>  10    73.650   70.384    76.916
#>  15    72.384   69.155    75.612
#>  20    71.117   67.918    74.316
#>  25    69.851   66.672    73.029
#>  30    68.584   65.417    71.751
#>  35    67.318   64.153    70.482
#>  40    66.051   62.880    69.223
#>  45    64.785   61.597    67.972
#>  ... and 25 more rows.
#> 
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76

Creating a ggplot is pretty straightforward: the colour-aesthetics is mapped with the group-column:

mydf <- ggpredict(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()

Finally, a second grouping structure can be defined, which will create another column named facet, which - as the name implies - might be used to create a facted plot:

mydf <- ggpredict(fit, terms = c("c12hour", "c172code", "c161sex"))
mydf
#> 
#> # Predicted values for Total score BARTHEL INDEX 
#> # x = average number of hours of care per week 
#> 
#> # low level of education
#> # [1] Male
#>   x predicted conf.low conf.high
#>   0    73.954   69.354    78.554
#>   5    72.688   68.143    77.233
#>  10    71.421   66.925    75.917
#>  15    70.155   65.702    74.607
#>  20    68.888   64.472    73.304
#>  25    67.622   63.237    72.007
#>  30    66.355   61.995    70.716
#>  35    65.089   60.746    69.431
#>  40    63.822   59.490    68.154
#>  45    62.556   58.228    66.883
#>  ... and 25 more rows.
#> 
#> # low level of education
#> # [2] Female
#>   x predicted conf.low conf.high
#>   0    74.996   71.406    78.585
#>   5    73.729   70.219    77.239
#>  10    72.463   69.026    75.899
#>  15    71.196   67.826    74.566
#>  20    69.930   66.618    73.241
#>  25    68.663   65.403    71.924
#>  30    67.397   64.179    70.614
#>  35    66.130   62.947    69.313
#>  40    64.864   61.706    68.021
#>  45    63.597   60.456    66.738
#>  ... and 25 more rows.
#> 
#> # intermediate level of education
#> # [1] Male
#>   x predicted conf.low conf.high
#>   0    74.673   71.055    78.290
#>   5    73.406   69.846    76.966
#>  10    72.139   68.629    75.650
#>  15    70.873   67.404    74.342
#>  20    69.606   66.171    73.042
#>  25    68.340   64.931    71.749
#>  30    67.073   63.681    70.465
#>  35    65.807   62.424    69.190
#>  40    64.540   61.158    67.923
#>  45    63.274   59.883    66.665
#>  ... and 25 more rows.
#> 
#> # intermediate level of education
#> # [2] Female
#>   x predicted conf.low conf.high
#>   0    75.714   73.313    78.115
#>   5    74.447   72.146    76.748
#>  10    73.181   70.972    75.390
#>  15    71.914   69.787    74.041
#>  20    70.648   68.592    72.703
#>  25    69.381   67.385    71.378
#>  30    68.115   66.165    70.065
#>  35    66.848   64.931    68.766
#>  40    65.582   63.682    67.482
#>  45    64.315   62.418    66.213
#>  ... and 25 more rows.
#> 
#> # high level of education
#> # [1] Male
#>   x predicted conf.low conf.high
#>   0    75.391   71.040    79.741
#>   5    74.124   69.810    78.439
#>  10    72.858   68.573    77.143
#>  15    71.591   67.330    75.853
#>  20    70.325   66.080    74.570
#>  25    69.058   64.823    73.294
#>  30    67.792   63.559    72.025
#>  35    66.525   62.288    70.762
#>  40    65.259   61.011    69.507
#>  45    63.992   59.727    68.258
#>  ... and 25 more rows.
#> 
#> # high level of education
#> # [2] Female
#>   x predicted conf.low conf.high
#>   0    76.432   72.887    79.977
#>   5    75.166   71.674    78.657
#>  10    73.899   70.454    77.345
#>  15    72.633   69.226    76.040
#>  20    71.366   67.989    74.743
#>  25    70.100   66.744    73.455
#>  30    68.833   65.491    72.175
#>  35    67.567   64.229    70.904
#>  40    66.300   62.958    69.642
#>  45    65.034   61.679    68.388
#>  ... and 25 more rows.
#> 
#> Adjusted for:
#> * neg_c_7 = 11.84
ggplot(mydf, aes(x, predicted, colour = group)) + 
  geom_line() + 
  facet_wrap(~facet)

Marginal effects for each model term

If the term argument is either missing or NULL, marginal effects for each model term are calculated. The result is returned as a list, which can be plotted manually (or using the plot() function).

mydf <- ggpredict(fit)
mydf
#> $c12hour
#> 
#> # Predicted values for Total score BARTHEL INDEX 
#> # x = average number of hours of care per week 
#> 
#>   x predicted conf.low conf.high
#>   0    75.444   73.257    77.630
#>   5    74.177   72.098    76.256
#>  10    72.911   70.931    74.890
#>  15    71.644   69.753    73.535
#>  20    70.378   68.564    72.191
#>  25    69.111   67.361    70.861
#>  30    67.845   66.144    69.545
#>  35    66.578   64.911    68.245
#>  40    65.312   63.661    66.962
#>  45    64.045   62.393    65.697
#>  ... and 25 more rows.
#> 
#> Adjusted for:
#> *  neg_c_7 = 11.84
#> *  c161sex = 1.76
#> * c172code = 1.97
#> 
#> 
#> $neg_c_7
#> 
#> # Predicted values for Total score BARTHEL INDEX 
#> # x = Negative impact with 7 items 
#> 
#>   x predicted conf.low conf.high
#>   6    78.166   75.107    81.225
#>   8    73.571   71.207    75.935
#>  10    68.977   67.139    70.815
#>  12    64.383   62.732    66.033
#>  14    59.788   57.883    61.693
#>  16    55.194   52.725    57.662
#>  18    50.599   47.419    53.779
#>  20    46.005   42.043    49.966
#>  22    41.410   36.632    46.188
#>  24    36.816   31.201    42.430
#>  ... and 2 more rows.
#> 
#> Adjusted for:
#> *  c12hour = 42.20
#> *  c161sex = 1.76
#> * c172code = 1.97
#> 
#> 
#> $c161sex
#> 
#> # Predicted values for Total score BARTHEL INDEX 
#> # x = carer's gender 
#> 
#>  x predicted conf.low conf.high
#>  1    63.962   60.575    67.350
#>  2    65.004   63.110    66.897
#> 
#> Adjusted for:
#> *  c12hour = 42.20
#> *  neg_c_7 = 11.84
#> * c172code = 1.97
#> 
#> 
#> $c172code
#> 
#> # Predicted values for Total score BARTHEL INDEX 
#> # x = carer's level of education 
#> 
#>  x predicted conf.low conf.high
#>  1    64.057   61.012    67.103
#>  2    64.776   63.125    66.427
#>  3    65.494   62.317    68.672
#> 
#> Adjusted for:
#> * c12hour = 42.20
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
#> 
#> 
#> attr(,"class")
#> [1] "ggalleffects" "list"

Average marginal effects

ggaverage() compute average marginal effects. While ggpredict() creates a data-grid (using expand.grid()) for all possible combinations of values (even if some combinations are not present in the data), ggaverage() computes predicted values based on the given data. This means that different predicted values for the outcome may occure at the same value or level for the term in question. The predicted values are then averaged for each value of the term in question and the linear trend is smoothed accross the averaged predicted values. This means that the line representing the marginal effects may cross or diverge, and are not necessarily in paralell to each other.

mydf <- ggaverage(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()

Two- and Three-Way-Interactions

To plot the marginal effects of interaction terms, simply specify these terms in the terms-argument.

library(sjmisc)
data(efc)

# make categorical
efc$c161sex <- to_factor(efc$c161sex)

# fit model with interaction
fit <- lm(neg_c_7 ~ c12hour + barthtot * c161sex, data = efc)

# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("barthtot [30,50,70]", "c161sex"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()

Since the terms-argument accepts up to three model terms, you can also compute marginal effects for a 3-way-interaction.

To plot the marginal effects of interaction terms, simply specify these terms in the terms-argument.

# fit model with 3-way-interaction
fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex, data = efc)

# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("c12hour", "barthtot [30,50,70]", "c161sex"))

ggplot(mydf, aes(x, predicted, colour = group)) + 
  geom_line() +
  facet_wrap(~facet)

Polynomial terms and splines

ggpredict() also works for models with polynomial terms or splines. Following code reproduces the plot from ?splines::bs:

library(splines)
data(women)

fm1 <- lm(weight ~ bs(height, df = 5), data = women)
dat <- ggpredict(fm1, "height")

ggplot(dat, aes(x, predicted)) + 
  geom_line() +
  geom_point()

Survival models

ggpredict() also supports coxph-models from the survival-package and is able to either plot risk-scores (the default), probabilities of survival (type = "surv") or cumulative hazards (type = "cumhaz").

Since probabilities of survival and cumulative hazards are changing accross time, the time-variable is automatically used as x-axis in such cases, so the terms-argument only needs up to two variables for type = "surv" or type = "cumhaz".

data("lung", package = "survival")
# remove category 3 (outlier)
lung <- subset(lung, subset = ph.ecog %in% 0:2)
lung$sex <- factor(lung$sex, labels = c("male", "female"))
lung$ph.ecog <- factor(lung$ph.ecog, labels = c("good", "ok", "limited"))

m <- survival::coxph(survival::Surv(time, status) ~ sex + age + ph.ecog, data = lung)

# predicted risk-scores
ggpredict(m, c("sex", "ph.ecog"))

# probability of survival
ggpredict(m, c("sex", "ph.ecog"), type = "surv")

Labelling the data

ggeffects makes use of the sjlabelled-package and supports labelled data. If the data from the fitted models is labelled, the value and variable label attributes are usually copied to the model frame stored in the model object. ggeffects provides various getter-functions to access these labels, which are returned as character vector and can be used in ggplot’s lab()- or scale_*()-functions.

The data frame returned by ggpredict() or ggaverage() must be used as argument to one of the above function calls.

get_x_title(mydf)
#> [1] "average number of hours of care per week"
get_y_title(mydf)
#> [1] "Negative impact with 7 items"

ggplot(mydf, aes(x, predicted, colour = group)) + 
  geom_line() +
  facet_wrap(~facet) +
  labs(
    x = get_x_title(mydf),
    y = get_y_title(mydf),
    colour = get_legend_title(mydf)
  )