Power with simglm - legacy code

2019-01-23

Power Analysis with simglm

The simglm package allows the ability to conduct a power analysis through simulation. This will be particularly helpful with multilevel models and generalized linear models. To show the process, we will start with basic regression models.

Single Level Power Analysis

Let’s look at a simple single level regression example to get started:

Much of the output here is the same from the sim_reg function. The additional arguments, pow_param represents the terms to conduct a power analysis for and must be a subset of the fixed argument, alpha represents the per term level of significance, pow_dist represents the sampling distribution to refer to, either ‘z’ or ‘t’, pow_tail represents whether a one or two tailed hypothesis is being tested, and replicates represents the number of simulations to conduct. Note, to do a power analysis for the intercept, ‘(Intercept)’ must be used. By default, if pow_param is not specified power is conducted for all terms.

Finally, looking at the output from the above call:

## # A tibble: 4 x 9
##   term  avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
##   <chr>         <dbl>        <dbl>       <dbl>      <dbl> <dbl>      <dbl>
## 1 (Int…         0.549        0.364       0.371     0.0225     0          0
## 2 act           1.11         0.208       0.189     0.0161     0          0
## 3 diff          0.602        0.171       0.185     0.0157     0          0
## 4 numC…         0.866        0.374       0.376     0.0306     0          0
## # … with 2 more variables: num_repl <dbl>, data <list>

The output contains the variable name, the average test statistic, the standard deviation of the test statistic, the power rate, the number of null hypotheses rejects, and the total number of replications. Increasing the number of replications would increase the precision of the power analysis, however may significantly increase the computational time.

Standardized Coefficients

By default, the simglm package uses unstandardized regression coefficients when doing the simulation. A way to use standardized coefficients however, would be to generate standardized variables. For example:

## # A tibble: 4 x 9
##   term  avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
##   <chr>         <dbl>        <dbl>       <dbl>      <dbl> <dbl>      <dbl>
## 1 (Int…         0.192       0.0678      0.0831    0.00498     0          0
## 2 act           0.404       0.0797      0.0835    0.00707     0          0
## 3 diff          0.257       0.0823      0.0835    0.00678     0          0
## 4 numC…         0.695       0.0749      0.0836    0.00675     0          0
## # … with 2 more variables: num_repl <dbl>, data <list>

Model Misspecification

It is also possible to specify a model different than the generating model for evaluation of power. This may be useful if it is thought another variable is important in explaining variation, however due to design issues is not possible to collect this variable. As a result, there would likely be additional variation due to this variable that will not be able to be explained. Building this into the generating model can help provide additional information about the impact on power.

An example using single level power analysis is shown below.

## # A tibble: 3 x 9
##   term  avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
##   <chr>         <dbl>        <dbl>       <dbl>      <dbl> <dbl>      <dbl>
## 1 (Int…         0.545        0.455       0.415     0.0288     0          0
## 2 act           1.08         0.249       0.210     0.0171     0          0
## 3 diff          0.609        0.204       0.210     0.0185     0          0
## # … with 2 more variables: num_repl <dbl>, data <list>

You’ll notice, compared to above, the power is somewhat reduced for the other three fixed effects when not modeling the numCourse variable. You can also build in the lm_fit_mod argument into the design via the terms_vary argument.

## # A tibble: 6 x 13
## # Groups:   term, n, error_var, fixed_param [3]
##   term      n error_var fixed_param lm_fit_mod avg_test_stat sd_test_stat
##   <chr> <dbl>     <dbl> <fct>       <fct>              <dbl>        <dbl>
## 1 (Int…    20         5 0.5,1.1,0.… ~,sim_dat…         0.535        0.799
## 2 (Int…    20         5 0.5,1.1,0.… ~,sim_dat…         0.526        0.675
## 3 (Int…    20         5 0.6,1.1,0.… ~,sim_dat…         0.499        0.772
## 4 (Int…    20         5 0.6,1.1,0.… ~,sim_dat…         0.579        0.849
## 5 (Int…    20        10 0.5,1.1,0.… ~,sim_dat…         0.408        0.920
## 6 (Int…    20        10 0.5,1.1,0.… ~,sim_dat…         0.495        0.874
## # … with 6 more variables: avg_std_err <dbl>, sd_std_err <dbl>,
## #   power <dbl>, num_reject <dbl>, num_repl <dbl>, data <list>

Nested Data

Extending the power analysis to nested data is fairly straightforward. Below is a three level example

## # A tibble: 4 x 9
##   term  avg_test_stat sd_test_stat avg_std_err sd_std_err power num_reject
##   <chr>         <dbl>        <dbl>       <dbl>      <dbl> <dbl>      <dbl>
## 1 act            2.27      0.0117       0.0333   0.000722     0          0
## 2 actC…          6.85      0.135        0.468    0.219        1          3
## 3 diff           5.97      0.00978      0.0250   0.000174     1          3
## 4 time           1.88      0.380        0.429    0.0914       0          0
## # … with 2 more variables: num_repl <dbl>, data <list>

Generalized Power Analysis

## # A tibble: 3 x 9
##   term  avg_test_stat sd_test_stat avg_std_err sd_std_err power[,1]
##   <chr>         <dbl>        <dbl>       <dbl>      <dbl>     <dbl>
## 1 (Int…         0.316       0.322        0.377     0.0345         0
## 2 act           0.588       0.267        0.237     0.0597         0
## 3 diff          0.352       0.0998       0.121     0.0229         0
## # … with 3 more variables: num_reject[,1] <dbl>, num_repl <dbl>,
## #   data <list>

Vary Arguments

## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## # A tibble: 6 x 11
## # Groups:   term, n [3]
##   term      n fixed_param avg_test_stat sd_test_stat avg_std_err sd_std_err
##   <chr> <dbl> <fct>               <dbl>        <dbl>       <dbl>      <dbl>
## 1 (Int…    20 0.5,0.1,0.2         9.88        24.8      6297.    19906.    
## 2 (Int…    20 0.6,0.1,0.2         0.320        1.46        1.01      0.990 
## 3 (Int…    40 0.5,0.1,0.2         0.327        0.441       0.434     0.0520
## 4 (Int…    40 0.6,0.1,0.2         0.650        0.542       0.459     0.0631
## 5 (Int…    60 0.5,0.1,0.2         0.430        0.323       0.349     0.0459
## 6 (Int…    60 0.6,0.1,0.2         0.750        0.240       0.363     0.0401
## # … with 4 more variables: power[,1] <dbl>, num_reject[,1] <dbl>,
## #   num_repl <dbl>, data <list>