A Tutorial on Using the Simulation Functions Included in powerlmm

Kristoffer Magnusson

2018-04-17

This vignette shows how you can further investigate your models using Monte Carlo simulations. In addition to reporting power, simulations allows you to investigate parameter bias, model misspecification, Type I errors and convergence issues. For instance, even if the analytical power calculations work well, it can be useful to check the small sample properties of the model, e.g. estimates of cluster-level random effects with only a few clusters.

Setup

The simulation-function accepts the same study setup-object that we’ve used in the two- and three-level power vignettes. The simulation-function supports all models that are available in powerlmm. As an example we’ll investigate the partially nested three-level model.

Partially nested three-level model

When doing simulations it can be a good idea to specify the model using raw values, to ensure all parameters are reasonable. Here I use estimates from a real clinical trial.

p <- study_parameters(n1 = 11,
                      n2 = 10,
                      n3 = 6,
                      fixed_intercept = 37,
                      fixed_slope = -0.64,
                      sigma_subject_intercept = 2.8,
                      sigma_subject_slope = 0.4,
                      sigma_cluster_intercept = 0,
                      cor_subject = -0.2,
                      icc_slope = 0.05,
                      sigma_error = 2.6,
                      dropout = dropout_weibull(proportion = 0.3, 
                                                rate = 1/2),
                      partially_nested = TRUE,
                      effect_size = cohend(-0.8, 
                                           standardizer = "pretest_SD"))

p

How many simulations to run?

The get_monte_carlo_se-function accepts an power object, and will return the 95 % interval for a given amount of simulations. This is useful to gauge the precision of the empirical power estimates.

x <- get_power(p)
get_monte_carlo_se(x, nsim = c(1000, 2000, 5000))

We would need about 5000 simulation if we want our empirical power to be +/- 1% from the analytical power.

Now, let’s run the simulation. First we need to write the lme4-formula, since the simulated data sets are analyzed using lme4::lmer. Then we pass the study_parameters-object to the simulate function, and when the simulation is finished we use summary to view the results. You can run the simulation in parallel by settings cores > 1. The parallel computations will be done using the parallel package. If performed on a Windows machine or inside a GUI, then PSOCK clusters are used, if on Unix/macOS and the simulation is run non-interactively in a terminal then forking is used.

f <- "y ~ treatment * time + (1 + time | subject) + (0 + treatment:time | cluster)"

cores <- parallel::detectCores() # use all cores

res <- simulate(object = p,
                nsim = 5000,
                formula = f,
                satterthwaite = TRUE,
                cores = cores,
                save = FALSE)

summary(res)

Automatic formula creation

It is also possible to automatically create the formula by leaving it blank. Then the lmer-formula is created by including the parameters that are not specified as NA or NULL in the model, see ?create_lmer_formula. N.B., parameters specified as 0 are included in the model.

Investigate model misspecification

Let’s compare the correct partially nested model to a two-level linear mixed-effects model. The formula argument accepts a named list with a correct and wrong formula. By setting effect_size to 0 the simulation results tells us the empirical Type I errors for the correct and misspecified model.

p <- update(p, effect_size = 0)

f <- list("correct" = "y ~ treatment * time + (1 + time | subject) + (0 + treatment:time | cluster)",
          "wrong" = "y ~ treatment * time + (1 + time | subject)")

res <- simulate(object = p,
                nsim = 5000,
                formula = f,
                satterthwaite = TRUE,
                cores = cores,
                save = FALSE)

summary(res)

All of the parameters are summarized and presented for both models. Since we specified satterthwaite = TRUE, empirical power is both based on the Wald Z test and the Wald t test using Satterthwaite’s df approximation (from the lmerTest package).

Compare multiple combinations of parameter values

We can also simulate multiple designs and compare them. Let’s compare 6 vs 12 clusters in the treatment arm, and ìcc_slope of 0.05 and 0.1, and a Cohen’s d of 0.5 and 0.8.

p <- study_parameters(n1 = 11,
                      n2 = 10,
                      n3 = c(6, 12),
                      fixed_intercept = 37,
                      fixed_slope = -0.64,
                      sigma_subject_intercept = 2.8,
                      sigma_subject_slope = 0.4,
                      sigma_cluster_intercept = 0,
                      cor_subject = -0.2,
                      icc_slope = c(0.05, 0.1),
                      sigma_error = 2.6,
                      dropout = dropout_weibull(proportion = 0.3, 
                                                rate = 1/2),
                      partially_nested = TRUE,
                      effect_size = cohend(c(0.5, 0.8), 
                                           standardizer = "pretest_SD"))


f <- "y ~ treatment * time + (1 + time | subject) + (0 + treatment:time | cluster)"

res <- simulate(object = p,
                nsim = 5000,
                formula = f,
                satterthwaite = TRUE,
                cores = cores,
                save = FALSE)

The setup and simulation is the same, except that we define vectors of values for some of the parameters. In this scenario we have a total of 2 * 2 * 2 = 8 different combination of parameter values. For each of the 8 setups nsim number of simulations will be performed. With multiple designs summary works a bit differently. Instead of summarizing all parameters, we now have to choose which parameters to summarize.

# Summarize the 'time:treatment' results 
summary(res, para = "time:treatment", type = "fixed", model = "correct")

# Summarize the cluster-level random slope 
summary(res, para = "cluster_slope", type = "random", model = "correct")