The simglm
package aims to define a consistent framework for simulating regression models - including single level and multilevel models. This will hopefully allow the user to quickly simulate data for a class, project, or even a dissertation.
Currently development is happening on github. To install the package, use the devtools
package:
This should load the devtools
package, install the simglm
package, and finally load the simglm
package. The package has currently not been tested on a Mac machine. I do not anticipate any problems installing on a Mac however.
The master function that handles the simulation grunt work is the sim_reg
function. As always, you can do ?sim_reg
to pull up the help file for the function.
Let’s look at a simple single level regression example to get started:
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 4),
list(mean = 0, sd = 3),
list(mean = 0, sd = 3)))
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,
n = n, error_var = error_var,
with_err_gen = with_err_gen,
data_str = "single")
A few things to highlight about the above syntax, first the object fixed
is a one sided formula that gives the names of the variables to be included in the simulated data. The intercept is directly shown in the formulation here, but can also be omitted (similar to linear models in R). I like to include the 1 as it reminds me that I do in fact want an intercept. The next object, fixed_param
is the regression weights for the fixed effects, this must be the same length as fixed (or one larger if the 1 is not explicitly stated in the fixed
object). Next, cov_param
represents the mean, standard deviation, and type of variable from the fixed
object (must by “single” for single level regression). The cov_param
object must contain all variables except the intercept and any interactions.
The rest of the arguments are pretty straightforward, n
is the sample size, error_var
is the error variance, with_err_gen
is the distribution of the residuals, and finally in the function call itself, data_str
must be “single” in this instance to reflect a single level regression.
Finally, looking at the output from the above call:
## X.Intercept. act diff numCourse act.numCourse Fbeta
## 1 1 -7.45234915 2.4228645 -0.3275686 2.4411554 -21.650711
## 2 1 -0.72964305 0.9803266 -1.3505910 0.9854494 -2.694416
## 3 1 -3.67360588 -1.7859986 -3.2554373 11.9591937 -1.956065
## 4 1 -2.49082346 -3.4102050 6.1164966 -15.2351133 -20.435987
## 5 1 0.02477098 1.0935405 4.4918000 0.1112663 19.136457
## 6 1 -1.65048518 -3.3462656 -2.2954946 3.7886799 -8.405078
## err sim_data ID
## 1 1.2730017 -20.377710 1
## 2 1.0484233 -1.645992 2
## 3 -1.3564645 -3.312530 3
## 4 3.4826562 -16.953331 4
## 5 0.3500679 19.486525 5
## 6 1.3636882 -7.041390 6
As can be seen from the data itself, the first 5 columns represent the raw data used in the simulation, the column labeled “Fbeta” is the matrix multiplication of the design matrix (first 5 columns in this case) by the fixed_param
object above (the desired values for the fixed effects). The “err” column is the simulated errors, the column labeled “sim_data” is the simulated data (taking “Fbeta” + “err”), and lastly an ID variable reflecting the individuals.
You could then use simple regression on these data to see how the simulation went:
##
## Call:
## lm(formula = sim_data ~ 1 + act + diff + numCourse + act:numCourse,
## data = temp_single)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2844 -1.1770 -0.0531 1.0707 4.5384
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.01895 0.14301 14.12 <2e-16 ***
## act 3.99541 0.03472 115.06 <2e-16 ***
## diff 0.88918 0.05270 16.87 <2e-16 ***
## numCourse 3.53002 0.04626 76.31 <2e-16 ***
## act:numCourse 1.99807 0.01226 163.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.733 on 145 degrees of freedom
## Multiple R-squared: 0.997, Adjusted R-squared: 0.9969
## F-statistic: 1.206e+04 on 4 and 145 DF, p-value: < 2.2e-16
To add a factor, categorical, or ordinal variable, just append one of the following to the end of the variable name in the fixed
object: “.f”, “.c”, “.o”, "_f“,”_c“, or”_o“. These indicate the variable is a discrete variable of some sort. By default, all variables with a”.f" or “.c” will be treated as factors and expanded using dummy coding. If “.o” is used in the variable names, the variable will be discrete but treated as continuous in the simulation (i.e. only a single fixed effect variable). See below for an example.
fixed <- ~ 1 + act_o + diff.o + numCourse_f + act_o:numCourse_f
fixed_param <- c(0.8, 1, 0.2, 0.1, 0, 0.15, 0.2, 0.5, 0.02, -0.6, -0.1)
cov_param <- NULL
fact_vars <- list(numlevels = c(36, 8, 5),
var_type = c('single', 'single', "single"),
opts = list(list(replace = TRUE), list(replace = TRUE),
list(replace = TRUE)))
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
fact_vars = fact_vars)
The next thing to add is a new object called fact_vars
. This object must be a list that contains numlevels
and var_type
. Other optional features will be added in the future to increase functionality - such as including value labels, user specified probabilities, etc. Once these are passed into the sim.reg()
function, the simulated data now looks like the following.
## X.Intercept. act_o diff.o factor.numCourse_f.2 factor.numCourse_f.3
## 1 1 25 1 0 0
## 2 1 17 7 0 1
## 3 1 7 6 0 1
## 4 1 21 4 0 0
## 5 1 29 8 0 0
## 6 1 10 2 1 0
## factor.numCourse_f.4 factor.numCourse_f.5 act_o.factor.numCourse_f.2
## 1 1 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 1 0
## 5 0 1 0
## 6 0 0 10
## act_o.factor.numCourse_f.3 act_o.factor.numCourse_f.4
## 1 0 25
## 2 17 0
## 3 7 0
## 4 0 0
## 5 0 0
## 6 0 0
## act_o.factor.numCourse_f.5 act_o1 diff.o1 numCourse_f Fbeta err
## 1 0 25 1 4 11.15 2.22511666
## 2 0 17 7 3 19.54 2.88599017
## 3 0 7 6 3 9.14 0.01653154
## 4 21 21 4 5 20.70 -0.98233954
## 5 29 29 8 5 28.70 0.06788205
## 6 0 10 2 2 16.30 0.24663623
## sim_data ID
## 1 13.375117 1
## 2 22.425990 2
## 3 9.156532 3
## 4 19.717660 4
## 5 28.767882 5
## 6 16.546636 6
The distribution of covariates can also be changed to any R distribution function (e.g. rnorm
or rt
). These are called via the cov_param
argument using dist_fun
. Any additional parameters besides sample size are called via the optional opts
argument within cov_param
. The distributions need not be the same for each covariate. Below is an example generating two covariates using two different distributions and correlating them.
fixed <- ~ 1 + act + diff + numCourse_o + act:numCourse_o
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rt', 'rgamma'),
var_type = c("single", "single"),
opts = list(list(df = 5),
list(shape = 2)))
fact_vars <- list(numlevels = 5, var_type = "single",
opts = list(list(replace = TRUE)))
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
cor_vars <- 0.6
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n,
error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
cor_vars = cor_vars, fact_vars = fact_vars)
cor(temp_single_o[, 2:3])
## act diff
## act 1.0000000 0.6105847
## diff 0.6105847 1.0000000
The skewness and kurtosis can be ways to ensure that these distributions are actually generated as desired. For example, the skewness and kurtosis for the rt
distribution are: -1.1731523 and 4.0197038. Similarly done for the rgamma
distribution: -1.239144 and 5.5475066.
It is also possible to build in heterogeneity of variance into the simulation design. This is done through two arguments, homogeneity
and heterogeneity_var
. Below is a possible specification within a single level framework. By specifying homogeneity = FALSE
, this indicates that some level of heterogeneity is desired. Finally, by passing a variable name as a character string to the heterogeneity_var
argument, this is the variable in which heterogeneity in the residuals depends on (i.e. this variable is used to group variables).
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
cov_param = list(list(mean = 0, sd = 4),
list(mean = 0, sd = 3),
list(mean = 0, sd = 3)))
n <- 1500
error_var <- c(3, 50, 10)
with_err_gen = 'rnorm'
temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,
n = n, error_var = error_var,
with_err_gen = with_err_gen,
data_str = "single",
homogeneity = FALSE, heterogeneity_var = 'diff')
The outcome of the heterogeneity from the simulation is best show with a figure.
As you can see from the figure, there is an increase in the error variance in the middle of the variable “diff” and smaller amounts of variance in the tails. What was done internally was that the variable “diff” was grouped into 3 groups (based on the length of the argument error_var
) and the variance values passed from error_var
are used to create the heterogeneity. In many instances, heterogeneity is thought of with regard to groups. This can be shown with the following:
fixed <- ~ 1 + act_o + diff.o + numCourse_f
fixed_param <- c(0.8, 1, 0.2, 0.1, 0, 0.15, 0.2)
cov_param <- NULL
fact_vars <- list(numlevels = c(36, 8, 5),
var_type = c('single', 'single', "single"),
opts = list(list(replace = TRUE), list(replace = TRUE),
list(replace = TRUE)))
n <- 150
error_var <- c(3, 10, 20, 5, 4)
with_err_gen = 'rnorm'
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
fact_vars = fact_vars,
homogeneity = FALSE, heterogeneity_var = 'numCourse_f')
Showing a plot of the outcome.
This package currently supports the simulation of two-level nested or two-level longitudinal models. A few additional arguments are needed to do these models but much is the same.
fixed <- ~1 + time + diff + act + time:act
random <- ~1 + time + diff
fixed_param <- c(4, 2, 6, 2.3, 7)
random_param <- list(random_var = c(7, 4, 2), rand_gen = 'rnorm')
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
var_type = c("level1", "level2"),
opts = list(list(mean = 0, sd = 1.5),
list(mean = 0, sd = 4)))
n <- 150
p <- 30
error_var <- 4
data_str <- "long"
temp_long <- sim_reg(fixed = fixed, random = random,
fixed_param = fixed_param,
random_param = random_param, cov_param = cov_param,
k = NULL, n = n, p = p,
error_var = error_var, with_err_gen = "rnorm",
data_str = data_str, unbal = list(level2 = FALSE))
Highlighting the new agruments needed, the first is a one sided formula random
. This specifies which terms should be specified above. Related to random
, random_param
is a list of characteristics when simulation the random effects. This list must contain two named components: random_var
and rand_gen
. These two components represent the variance of the terms specified in the one sided random
formula and rand_gen
which represents a quoted distribution function (e.g. ‘rnorm’) to simulate the random effects. The random_var
component must be the same length as the number of terms in random
.
The optional named arguments to the random_param
argument include ther
, ther_sim
, and cor_vars
. First, cor_vars
is a vector of correlations between random effects, the default is no correlation between random effects. The argument ther
represents a vector of the theoretical mean and standard deviation of the random effects. This can be used to center simulated values at 0 or scale them to have a standard deviation of 1. The following standardization formula is used: \(z = \frac{X - \mu}{\sigma}\), where \(\mu\) and \(\sigma\) are the values specified within the ther
argument. The default values are 0 and 1 for \(\mu\) and \(\sigma\) respectively so that no scaling is performed. The ther_sim
argument does the same process as ther
, but the \(\mu\) and \(\sigma\) values are empirically drawn by sampling one million values from the rand_gen
specified. Lastly, depending on the rand_gen
function specified, additional arguments may be needed for that distribution function and can be specified as well. For example, if rand_gen = 'rchisq'
, including df = 1
would simulate a chi-square distribution with one degree of freedom.
The specification of the covariates is still done with the cov_param
argument, however, now the var_type
portion of cov_param
must be either “level1” or “level2” to represent either level 1 or level 2 variables respectively. Note that the time variable is not included in the cov_param
argument and is automatically specified as discrete starting from 0. In the future, differing time scales will be expanded.
The other new terms needed are straightforward, p
is the within cluster sample size (i.e. how many repeated measurements) and data_str
is now “long” for longitudinal data.
The simulated data now look like the following:
## X.Intercept. time diff act time.act b0 b1
## 1 1 0 0.2577159 2.94335 0.000000 -1.638033 -0.07811534
## 2 1 1 -1.0729046 2.94335 2.943350 -1.638033 -0.07811534
## 3 1 2 2.6141752 2.94335 5.886699 -1.638033 -0.07811534
## 4 1 3 -1.5084224 2.94335 8.830049 -1.638033 -0.07811534
## 5 1 4 0.6697206 2.94335 11.773399 -1.638033 -0.07811534
## 6 1 5 0.3252211 2.94335 14.716748 -1.638033 -0.07811534
## b2 Fbeta randEff err sim_data withinID clustID
## 1 0.8934488 12.31600 -1.4077766 0.01334516 10.92157 1 1
## 2 0.8934488 26.93572 -2.6747332 -1.42888701 22.83210 2 1
## 3 0.8934488 71.66165 0.5413685 2.86304449 75.06606 3 1
## 4 0.8934488 69.52951 -3.2200768 0.60372346 66.91316 4 1
## 5 0.8934488 105.20182 -1.3521329 2.53227041 106.38196 5 1
## 6 0.8934488 125.73827 -1.7380409 -1.16330696 122.83692 6 1
This structure is very similar to before, except now there are columns for the specific random effects as denoted by the lower case b’s, a column reflecting the contribution for the random effects combined and lastly now two ID variables, one reflecting the within cluster ID and another being the cluster ID.
Checking how the simulation worked with the following:
## Loading required package: Matrix
## Linear mixed model fit by REML ['lmerMod']
## Formula: sim_data ~ 1 + time + diff + act + time:act + (1 + time + diff |
## clustID)
## Data: temp_long
## REML criterion at convergence: 21375.53
## Random effects:
## Groups Name Std.Dev. Corr
## clustID (Intercept) 2.732
## time 2.008 0.00
## diff 1.292 -0.03 0.14
## Residual 2.024
## Number of obs: 4500, groups: clustID, 150
## Fixed Effects:
## (Intercept) time diff act time:act
## 4.307 1.895 5.902 2.310 6.967
One note when looking at the output is that standard deviations are given, not variances as inputted into the simulation function.
For longitudinal designs, the time variable (linear slope) is generated going from 0 to number of observations minus 1. Therefore, if there are 5 time points (measurement occasions), the time variable by default will be a sequence from 0 to 4 indexing by 1. An optional named argument within cov_param
, named time_var
, allows the user to specify the time scale. Below is an example with 5 time points. In this example the first three measurement occasions may be separated by 6 months with the last two occasions being measured 2 years apart and the time variable is represented on the time scale.
fixed <- ~1 + time + diff + act + time:act
random <- ~1 + time + diff
fixed_param <- c(4, 2, 6, 2.3, 7)
random_param <- list(random_var = c(7, 4, 2), rand_gen = 'rnorm')
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
var_type = c("level1", "level2"),
time_var = c(0, 0.5, 1, 3, 5),
opts = list(list(mean = 0, sd = 1.5),
list(mean = 0, sd = 4)))
n <- 150
p <- 5
error_var <- 4
with_err_gen <- 'rnorm'
data_str <- "long"
temp_long <- sim_reg(fixed, random, random3 = NULL, fixed_param,
random_param, random_param3 = NULL,
cov_param, k = NULL, n, p, error_var, with_err_gen,
data_str = data_str)
head(temp_long)
## X.Intercept. time diff act time.act b0 b1
## 1 1 0.0 0.09742512 4.519589 0.000000 -4.1668851 -0.4592707
## 2 1 0.5 -0.75493604 4.519589 2.259795 -4.1668851 -0.4592707
## 3 1 1.0 0.37881711 4.519589 4.519589 -4.1668851 -0.4592707
## 4 1 3.0 -0.16605870 4.519589 13.558768 -4.1668851 -0.4592707
## 5 1 5.0 -2.77293590 4.519589 22.597947 -4.1668851 -0.4592707
## 6 1 0.0 -0.29929534 -2.436166 0.000000 -0.9243513 0.7249551
## b2 Fbeta randEff err sim_data withinID clustID
## 1 -0.9631169 14.979606 -4.260717 -0.16751084 10.551378 1 1
## 2 -0.9631169 26.684002 -3.669429 1.22952150 24.244095 2 1
## 3 -0.9631169 50.305083 -4.991001 -0.02080311 45.293279 3 1
## 4 -0.9631169 114.310079 -5.384763 0.75638418 109.681700 4 1
## 5 -0.9631169 165.943067 -3.792577 1.60892101 163.759410 5 1
## 6 2.0776312 -3.398954 -1.546177 0.78081168 -4.164319 1 2
A similar framework can be used for cross-sectional data, such as when students are nested within schools. The only thing that would need to be changed from the longitudinal portion above is the data_str
argument from “long” to “cross” for cross-sectional data.
One last note about cross-sectional data. As a default for longitudinal data, time is always considered to be in the model in some fashion (as noted above when talking about the cov_param
object). Therefore, when specifying the cov_param
and fact_vars
objects, ensure information about all variables is there.
The same framework to specify categorical data for single level regression designs is used for nested designs. Just asign the “.f”, “.c”, or “.o” (or "_f“,”_c“, or”_o") to the end of the name and the function will take care of the rest.
Lastly, for any bugs or feature requests go to the github repository to create post an issue. I will work to resolve them as quickly as possible. See simglm github repository