The simglm
package allows the ability to conduct a power analysis through simulation. This will be particularly helpful with multilevel models and generalized linear models. To show the process, we will start with basic regression models.
Let’s look at a simple single level regression example to get started:
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- 150
error_var <- 20
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 100
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates)
Much of the output here is the same from the sim_reg
function. The additional arguments, pow_param
represents the terms to conduct a power analysis for and must be a subset of the fixed
argument, alpha
represents the per term level of significance, pow_dist
represents the sampling distribution to refer to, either ‘z’ or ‘t’, pow_tail
represents whether a one or two tailed hypothesis is being tested, and replicates
represents the number of simulations to conduct. Note, to do a power analysis for the intercept, ‘(Intercept)’ must be used. By default, if pow_param is not specified power is conducted for all terms.
Finally, looking at the output from the above call:
## # A tibble: 4 x 9
## term avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Int… 0.549 0.364 0.371 0.0225 0 0
## 2 act 1.11 0.208 0.189 0.0161 0 0
## 3 diff 0.602 0.171 0.185 0.0157 0 0
## 4 numC… 0.866 0.374 0.376 0.0306 0 0
## # … with 2 more variables: num_repl <dbl>, data <list>
The output contains the variable name, the average test statistic, the standard deviation of the test statistic, the power rate, the number of null hypotheses rejects, and the total number of replications. Increasing the number of replications would increase the precision of the power analysis, however may significantly increase the computational time.
By default, the simglm
package uses unstandardized regression coefficients when doing the simulation. A way to use standardized coefficients however, would be to generate standardized variables. For example:
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.2, 0.4, 0.25, 0.7, 0.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 1),
list(mean = 0, sd = 1),
list(mean = 0, sd = 1)))
n <- 150
error_var <- 1
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 100
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates)
head(power_out)
## # A tibble: 4 x 9
## term avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Int… 0.192 0.0678 0.0831 0.00498 0 0
## 2 act 0.404 0.0797 0.0835 0.00707 0 0
## 3 diff 0.257 0.0823 0.0835 0.00678 0 0
## 4 numC… 0.695 0.0749 0.0836 0.00675 0 0
## # … with 2 more variables: num_repl <dbl>, data <list>
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- NULL
error_var <- NULL
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 10
terms_vary <- list(n = c(20, 40, 60, 80, 100), error_var = c(5, 10, 20),
fixed_param = list(c(0.5, 1.1, 0.6, 0.9, 1.1),
c(0.6, 1.1, 0.6, 0.9, 1.1)),
cov_param = list(list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
mean = c(0, 0, 0), sd = c(2, 2, 1),
var_type = c("single", "single", "single")),
list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
mean = c(0.5, 0, 0), sd = c(2, 2, 1),
var_type = c("single", "single", "single"))
)
)
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,
n = n, error_var = error_var, with_err_gen = with_err_gen,
data_str = "single", pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, terms_vary = terms_vary)
It is also possible to specify a model different than the generating model for evaluation of power. This may be useful if it is thought another variable is important in explaining variation, however due to design issues is not possible to collect this variable. As a result, there would likely be additional variation due to this variable that will not be able to be explained. Building this into the generating model can help provide additional information about the impact on power.
An example using single level power analysis is shown below.
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- 150
error_var <- 20
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 100
lm_fit_mod <- sim_data ~ 1 + act + diff
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, lm_fit_mod = lm_fit_mod)
head(power_out)
## # A tibble: 3 x 9
## term avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Int… 0.545 0.455 0.415 0.0288 0 0
## 2 act 1.08 0.249 0.210 0.0171 0 0
## 3 diff 0.609 0.204 0.210 0.0185 0 0
## # … with 2 more variables: num_repl <dbl>, data <list>
You’ll notice, compared to above, the power is somewhat reduced for the other three fixed effects when not modeling the numCourse
variable. You can also build in the lm_fit_mod
argument into the design via the terms_vary
argument.
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 2),
list(mean = 0, sd = 1)))
n <- 150
error_var <- 20
with_err_gen <- 'rnorm'
pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse')
alpha <- .01
pow_dist <- "t"
pow_tail <- 2
replicates <- 100
terms_vary <- list(n = c(20, 40, 60, 80, 100), error_var = c(5, 10, 20),
fixed_param = list(c(0.5, 1.1, 0.6, 0.9, 1.1),
c(0.6, 1.1, 0.6, 0.9, 1.1)),
lm_fit_mod = list(sim_data ~ 1 + act + diff,
sim_data ~ 1 + act)
)
power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, terms_vary = terms_vary)
head(power_out)
## # A tibble: 6 x 13
## # Groups: term, n, error_var, fixed_param [3]
## term n error_var fixed_param lm_fit_mod avg_test_stat sd_test_stat
## <chr> <dbl> <dbl> <fct> <fct> <dbl> <dbl>
## 1 (Int… 20 5 0.5,1.1,0.… ~,sim_dat… 0.535 0.799
## 2 (Int… 20 5 0.5,1.1,0.… ~,sim_dat… 0.526 0.675
## 3 (Int… 20 5 0.6,1.1,0.… ~,sim_dat… 0.499 0.772
## 4 (Int… 20 5 0.6,1.1,0.… ~,sim_dat… 0.579 0.849
## 5 (Int… 20 10 0.5,1.1,0.… ~,sim_dat… 0.408 0.920
## 6 (Int… 20 10 0.5,1.1,0.… ~,sim_dat… 0.495 0.874
## # … with 6 more variables: avg_std_err <dbl>, sd_std_err <dbl>,
## # power <dbl>, num_reject <dbl>, num_repl <dbl>, data <list>
Extending the power analysis to nested data is fairly straightforward. Below is a three level example
fixed <- ~1 + time + diff + act + actClust + time:act
random <- ~1 + time
random3 <- ~ 1 + time
fixed_param <- c(4, 2, 6, 2.3, 7, 0)
random_param <- list(random_var = c(7, 4), rand_gen = 'rnorm')
random_param3 <- list(random_var = c(4, 2), rand_gen = 'rnorm')
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("level1", "level2", "level3"),
opts = list(list(mean = 0, sd = 1.5),
list(mean = 0, sd = 4),
list(mean = 0, sd = 2)))
k <- 10
n <- 45
p <- 8
error_var <- 4
with_err_gen <- 'rnorm'
data_str <- "long"
pow_param <- c('time', 'diff', 'act', 'actClust')
alpha <- .01
pow_dist <- "z"
pow_tail <- 2
replicates <- 3
power_out <- sim_pow(fixed = fixed, random = random, random3 = random3,
fixed_param = fixed_param,
random_param = random_param,
random_param3 = random_param3,
cov_param = cov_param,
k = k, n = n, p = p,
error_var = error_var, with_err_gen = "rnorm",
data_str = data_str,
unbal = list(level2 = FALSE, level3 = FALSE),
pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates)
head(power_out)
## # A tibble: 4 x 9
## term avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 act 2.27 0.0117 0.0333 0.000722 0 0
## 2 actC… 6.85 0.135 0.468 0.219 1 3
## 3 diff 5.97 0.00978 0.0250 0.000174 1 3
## 4 time 1.88 0.380 0.429 0.0914 0 0
## # … with 2 more variables: num_repl <dbl>, data <list>
fixed <- ~ 1 + act + diff
fixed_param <- c(0.1, 0.5, 0.3)
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
var_type = c("single", "single"),
opts = list(list(mean = 0, sd = 2),
list(mean = 0, sd = 4)))
n <- 50
pow_param <- c('(Intercept)', 'act', 'diff')
alpha <- .01
pow_dist <- "z"
pow_tail <- 2
replicates <- 10
power_out <- sim_pow_glm(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,
n = n, data_str = "single",
outcome_type = 'logistic',
pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates)
head(power_out)
## # A tibble: 3 x 9
## term avg_test_stat sd_test_stat avg_std_err sd_std_err power[,1]
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Int… 0.316 0.322 0.377 0.0345 0
## 2 act 0.588 0.267 0.237 0.0597 0
## 3 diff 0.352 0.0998 0.121 0.0229 0
## # … with 3 more variables: num_reject[,1] <dbl>, num_repl <dbl>,
## # data <list>
fixed <- ~ 1 + act + diff
fixed_param <- c(0.1, 0.5, 0.3)
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
var_type = c("single", "single"),
opts = list(list(mean = 0, sd = 5),
list(mean = 0, sd = 8)))
n <- NULL
pow_param <- c('(Intercept)', 'act', 'diff')
alpha <- .01
pow_dist <- "z"
pow_tail <- 2
replicates <- 10
terms_vary <- list(n = c(20, 40, 60, 80, 100),
fixed_param = list(c(0.5, 0.1, 0.2),
c(0.6, 0.1, 0.2)))
power_out <- sim_pow_glm(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,
n = n, data_str = "single",
outcome_type = 'logistic',
pow_param = pow_param, alpha = alpha,
pow_dist = pow_dist, pow_tail = pow_tail,
replicates = replicates, terms_vary = terms_vary)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## # A tibble: 6 x 11
## # Groups: term, n [3]
## term n fixed_param avg_test_stat sd_test_stat avg_std_err sd_std_err
## <chr> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 (Int… 20 0.5,0.1,0.2 9.88 24.8 6297. 19906.
## 2 (Int… 20 0.6,0.1,0.2 0.320 1.46 1.01 0.990
## 3 (Int… 40 0.5,0.1,0.2 0.327 0.441 0.434 0.0520
## 4 (Int… 40 0.6,0.1,0.2 0.650 0.542 0.459 0.0631
## 5 (Int… 60 0.5,0.1,0.2 0.430 0.323 0.349 0.0459
## 6 (Int… 60 0.6,0.1,0.2 0.750 0.240 0.363 0.0401
## # … with 4 more variables: power[,1] <dbl>, num_reject[,1] <dbl>,
## # num_repl <dbl>, data <list>