The aim of this package is similar to the broom-package: transforming “untidy” input into a tidy data frame, especially for further use with ggplot. However, ggeffects does not return model-summaries; rather, this package computes marginal effects at the mean or average marginal effects from statistical models and returns the result as tidy data frame.
Since the focus lies on plotting the data (the marginal effects), at least one model term needs to be specified for which the effects are computed. It is also possible to compute marginal effects for model terms, grouped by the levels of another model’s predictor. The package also allows plotting marginal effects for two- or three-way-interactions, or for specific values of a model term only. Examples are shown below.
The returned data frames always have the same, consistent structure and column names, so it’s easy to create ggplot-plots without the need to re-write the arguments to be mapped in each ggplot-call. x
and predicted
are the values for the x- and y-axis. conf.low
and conf.high
could be used as ymin
and ymax
aesthetics for ribbons to add confidence bands to the plot. group
can be used as grouping-aesthetics, or for faceting.
ggpredict()
computes predicted values for all possible levels and values from a model’s predictors. In the simplest case, a fitted model is passed as first argument, and the term in question as second argument:
library(ggeffects)
data(efc)
fit <- lm(barthtot ~ c12hour + neg_c_7 + c161sex + c172code, data = efc)
ggpredict(fit, terms = "c12hour")
#>
#> # Predicted values for Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> x predicted conf.low conf.high
#> 0 75.444 73.257 77.630
#> 5 74.177 72.098 76.256
#> 10 72.911 70.931 74.890
#> 15 71.644 69.753 73.535
#> 20 70.378 68.564 72.191
#> 25 69.111 67.361 70.861
#> 30 67.845 66.144 69.545
#> 35 66.578 64.911 68.245
#> 40 65.312 63.661 66.962
#> 45 64.045 62.393 65.697
#> ... and 25 more rows.
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
#> * c172code = 1.97
The output shows the predicted values for the response at each value from the term c12hour. The data is already in shape for ggplot:
library(ggplot2)
theme_set(theme_bw())
mydf <- ggpredict(fit, terms = "c12hour")
ggplot(mydf, aes(x, predicted)) + geom_line()
The terms
-argument accepts up to three model terms, where the second and third term indicate grouping levels. This allows predictions for the term in question at different levels for other model terms:
ggpredict(fit, terms = c("c12hour", "c172code"))
#>
#> # Predicted values for Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> # low level of education
#> x predicted conf.low conf.high
#> 0 74.746 71.266 78.227
#> 5 73.480 70.080 76.880
#> 10 72.213 68.887 75.540
#> 15 70.947 67.686 74.207
#> 20 69.680 66.478 72.883
#> 25 68.414 65.262 71.566
#> 30 67.147 64.037 70.257
#> 35 65.881 62.804 68.958
#> 40 64.614 61.561 67.668
#> 45 63.348 60.309 66.387
#> ... and 25 more rows.
#>
#> # intermediate level of education
#> x predicted conf.low conf.high
#> 0 75.465 73.282 77.647
#> 5 74.198 72.123 76.273
#> 10 72.932 70.955 74.909
#> 15 71.665 69.777 73.554
#> 20 70.399 68.587 72.211
#> 25 69.132 67.383 70.881
#> 30 67.866 66.165 69.566
#> 35 66.599 64.931 68.267
#> 40 65.333 63.680 66.985
#> 45 64.066 62.412 65.720
#> ... and 25 more rows.
#>
#> # high level of education
#> x predicted conf.low conf.high
#> 0 76.183 72.816 79.550
#> 5 74.917 71.604 78.229
#> 10 73.650 70.384 76.916
#> 15 72.384 69.155 75.612
#> 20 71.117 67.918 74.316
#> 25 69.851 66.672 73.029
#> 30 68.584 65.417 71.751
#> 35 67.318 64.153 70.482
#> 40 66.051 62.880 69.223
#> 45 64.785 61.597 67.972
#> ... and 25 more rows.
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
Creating a ggplot is pretty straightforward: the colour-aesthetics is mapped with the group
-column:
mydf <- ggpredict(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Finally, a second grouping structure can be defined, which will create another column named facet
, which - as the name implies - might be used to create a facted plot:
mydf <- ggpredict(fit, terms = c("c12hour", "c172code", "c161sex"))
mydf
#>
#> # Predicted values for Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> # low level of education
#> # [1] Male
#> x predicted conf.low conf.high
#> 0 73.954 69.354 78.554
#> 5 72.688 68.143 77.233
#> 10 71.421 66.925 75.917
#> 15 70.155 65.702 74.607
#> 20 68.888 64.472 73.304
#> 25 67.622 63.237 72.007
#> 30 66.355 61.995 70.716
#> 35 65.089 60.746 69.431
#> 40 63.822 59.490 68.154
#> 45 62.556 58.228 66.883
#> ... and 25 more rows.
#>
#> # low level of education
#> # [2] Female
#> x predicted conf.low conf.high
#> 0 74.996 71.406 78.585
#> 5 73.729 70.219 77.239
#> 10 72.463 69.026 75.899
#> 15 71.196 67.826 74.566
#> 20 69.930 66.618 73.241
#> 25 68.663 65.403 71.924
#> 30 67.397 64.179 70.614
#> 35 66.130 62.947 69.313
#> 40 64.864 61.706 68.021
#> 45 63.597 60.456 66.738
#> ... and 25 more rows.
#>
#> # intermediate level of education
#> # [1] Male
#> x predicted conf.low conf.high
#> 0 74.673 71.055 78.290
#> 5 73.406 69.846 76.966
#> 10 72.139 68.629 75.650
#> 15 70.873 67.404 74.342
#> 20 69.606 66.171 73.042
#> 25 68.340 64.931 71.749
#> 30 67.073 63.681 70.465
#> 35 65.807 62.424 69.190
#> 40 64.540 61.158 67.923
#> 45 63.274 59.883 66.665
#> ... and 25 more rows.
#>
#> # intermediate level of education
#> # [2] Female
#> x predicted conf.low conf.high
#> 0 75.714 73.313 78.115
#> 5 74.447 72.146 76.748
#> 10 73.181 70.972 75.390
#> 15 71.914 69.787 74.041
#> 20 70.648 68.592 72.703
#> 25 69.381 67.385 71.378
#> 30 68.115 66.165 70.065
#> 35 66.848 64.931 68.766
#> 40 65.582 63.682 67.482
#> 45 64.315 62.418 66.213
#> ... and 25 more rows.
#>
#> # high level of education
#> # [1] Male
#> x predicted conf.low conf.high
#> 0 75.391 71.040 79.741
#> 5 74.124 69.810 78.439
#> 10 72.858 68.573 77.143
#> 15 71.591 67.330 75.853
#> 20 70.325 66.080 74.570
#> 25 69.058 64.823 73.294
#> 30 67.792 63.559 72.025
#> 35 66.525 62.288 70.762
#> 40 65.259 61.011 69.507
#> 45 63.992 59.727 68.258
#> ... and 25 more rows.
#>
#> # high level of education
#> # [2] Female
#> x predicted conf.low conf.high
#> 0 76.432 72.887 79.977
#> 5 75.166 71.674 78.657
#> 10 73.899 70.454 77.345
#> 15 72.633 69.226 76.040
#> 20 71.366 67.989 74.743
#> 25 70.100 66.744 73.455
#> 30 68.833 65.491 72.175
#> 35 67.567 64.229 70.904
#> 40 66.300 62.958 69.642
#> 45 65.034 61.679 68.388
#> ... and 25 more rows.
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
ggplot(mydf, aes(x, predicted, colour = group)) +
geom_line() +
facet_wrap(~facet)
If the term
argument is either missing or NULL
, marginal effects for each model term are calculated. The result is returned as a list, which can be plotted manually (or using the plot()
function).
mydf <- ggpredict(fit)
mydf
#> $c12hour
#>
#> # Predicted values for Total score BARTHEL INDEX
#> # x = average number of hours of care per week
#>
#> x predicted conf.low conf.high
#> 0 75.444 73.257 77.630
#> 5 74.177 72.098 76.256
#> 10 72.911 70.931 74.890
#> 15 71.644 69.753 73.535
#> 20 70.378 68.564 72.191
#> 25 69.111 67.361 70.861
#> 30 67.845 66.144 69.545
#> 35 66.578 64.911 68.245
#> 40 65.312 63.661 66.962
#> 45 64.045 62.393 65.697
#> ... and 25 more rows.
#>
#> Adjusted for:
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
#> * c172code = 1.97
#>
#>
#> $neg_c_7
#>
#> # Predicted values for Total score BARTHEL INDEX
#> # x = Negative impact with 7 items
#>
#> x predicted conf.low conf.high
#> 6 78.166 75.107 81.225
#> 8 73.571 71.207 75.935
#> 10 68.977 67.139 70.815
#> 12 64.383 62.732 66.033
#> 14 59.788 57.883 61.693
#> 16 55.194 52.725 57.662
#> 18 50.599 47.419 53.779
#> 20 46.005 42.043 49.966
#> 22 41.410 36.632 46.188
#> 24 36.816 31.201 42.430
#> ... and 2 more rows.
#>
#> Adjusted for:
#> * c12hour = 42.20
#> * c161sex = 1.76
#> * c172code = 1.97
#>
#>
#> $c161sex
#>
#> # Predicted values for Total score BARTHEL INDEX
#> # x = carer's gender
#>
#> x predicted conf.low conf.high
#> 1 63.962 60.575 67.350
#> 2 65.004 63.110 66.897
#>
#> Adjusted for:
#> * c12hour = 42.20
#> * neg_c_7 = 11.84
#> * c172code = 1.97
#>
#>
#> $c172code
#>
#> # Predicted values for Total score BARTHEL INDEX
#> # x = carer's level of education
#>
#> x predicted conf.low conf.high
#> 1 64.057 61.012 67.103
#> 2 64.776 63.125 66.427
#> 3 65.494 62.317 68.672
#>
#> Adjusted for:
#> * c12hour = 42.20
#> * neg_c_7 = 11.84
#> * c161sex = 1.76
#>
#>
#> attr(,"class")
#> [1] "ggalleffects" "list"
ggaverage()
compute average marginal effects. While ggpredict()
creates a data-grid (using expand.grid()
) for all possible combinations of values (even if some combinations are not present in the data), ggaverage()
computes predicted values based on the given data. This means that different predicted values for the outcome may occure at the same value or level for the term in question. The predicted values are then averaged for each value of the term in question and the linear trend is smoothed accross the averaged predicted values. This means that the line representing the marginal effects may cross or diverge, and are not necessarily in paralell to each other.
mydf <- ggaverage(fit, terms = c("c12hour", "c172code"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
To plot the marginal effects of interaction terms, simply specify these terms in the terms
-argument.
library(sjmisc)
data(efc)
# make categorical
efc$c161sex <- to_factor(efc$c161sex)
# fit model with interaction
fit <- lm(neg_c_7 ~ c12hour + barthtot * c161sex, data = efc)
# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("barthtot [30,50,70]", "c161sex"))
ggplot(mydf, aes(x, predicted, colour = group)) + geom_line()
Since the terms
-argument accepts up to three model terms, you can also compute marginal effects for a 3-way-interaction.
To plot the marginal effects of interaction terms, simply specify these terms in the terms
-argument.
# fit model with 3-way-interaction
fit <- lm(neg_c_7 ~ c12hour * barthtot * c161sex, data = efc)
# select only levels 30, 50 and 70 from continuous variable Barthel-Index
mydf <- ggpredict(fit, terms = c("c12hour", "barthtot [30,50,70]", "c161sex"))
ggplot(mydf, aes(x, predicted, colour = group)) +
geom_line() +
facet_wrap(~facet)
ggpredict()
also works for models with polynomial terms or splines. Following code reproduces the plot from ?splines::bs
:
library(splines)
data(women)
fm1 <- lm(weight ~ bs(height, df = 5), data = women)
dat <- ggpredict(fm1, "height")
ggplot(dat, aes(x, predicted)) +
geom_line() +
geom_point()
ggpredict()
also supports coxph
-models from the survival-package and is able to either plot risk-scores (the default), probabilities of survival (type = "surv"
) or cumulative hazards (type = "cumhaz"
).
Since probabilities of survival and cumulative hazards are changing accross time, the time-variable is automatically used as x-axis in such cases, so the terms
-argument only needs up to two variables for type = "surv"
or type = "cumhaz"
.
data("lung", package = "survival")
# remove category 3 (outlier)
lung <- subset(lung, subset = ph.ecog %in% 0:2)
lung$sex <- factor(lung$sex, labels = c("male", "female"))
lung$ph.ecog <- factor(lung$ph.ecog, labels = c("good", "ok", "limited"))
m <- survival::coxph(survival::Surv(time, status) ~ sex + age + ph.ecog, data = lung)
# predicted risk-scores
ggpredict(m, c("sex", "ph.ecog"))
# probability of survival
ggpredict(m, c("sex", "ph.ecog"), type = "surv")
ggeffects makes use of the sjlabelled-package and supports labelled data. If the data from the fitted models is labelled, the value and variable label attributes are usually copied to the model frame stored in the model object. ggeffects provides various getter-functions to access these labels, which are returned as character vector and can be used in ggplot’s lab()
- or scale_*()
-functions.
get_title()
- a generic title for the plot, based on the model family, like “predicted values” or “predicted probabilities”get_x_title()
- the variable label of the first model term in terms
.get_y_title()
- the variable label of the response.get_legend_title()
- the variable label of the second model term in terms
.get_x_labels()
- value labels of the first model term in terms
.get_legend_labels()
- value labels of the second model term in terms
.The data frame returned by ggpredict()
or ggaverage()
must be used as argument to one of the above function calls.
get_x_title(mydf)
#> [1] "average number of hours of care per week"
get_y_title(mydf)
#> [1] "Negative impact with 7 items"
ggplot(mydf, aes(x, predicted, colour = group)) +
geom_line() +
facet_wrap(~facet) +
labs(
x = get_x_title(mydf),
y = get_y_title(mydf),
colour = get_legend_title(mydf)
)