In this vignette, we show how the simulated data included as an example dataset in simsum
has been generated.
Say we want to run a simulation study in which we want to compare the sensitivity of parametric and semiparametric survival models on relative risk estimates.
We simulate an hypotetical trial with a binary treatment. We fix the log-treatment effect to \(-0.50\), and we generate a treatment indicator variable for each simulated individual via a \(Binom(1, 0.5)\) random variable. We simulate two different sample sizes (50 and 250 individuals) and we assume two different baseline hazard functions: exponential with scale parameter \(\lambda = 0.5\), and Weibull with scale parameter \(\lambda = 0.5\) and shape parameter \(\gamma = 1.5\). Finally, we apply administrative censoring at time \(t = 5\).
exp_basehaz <- function(t, lambda = 0.5) lambda * 1 * t^0
exp_weibull <- function(t, lambda = 0.5, gamma = 1.5) lambda * gamma * t^(gamma - 1)
curve(exp_basehaz, from = 0, to = 5, lty = 1, ylim = c(0, 2), ylab = expression(h[0](t)), xlab = "Follow-up time t")
curve(exp_weibull, from = 0, to = 5, lty = 2, add = TRUE)
legend(x = "topleft", lty = 1:2, legend = c("Exponential baseline hazard", "Weibull baseline hazard"), bty = "n")
The survival times are estimated using the approach of Bender et al. (2005), based on drawing from a \(U(0, 1)\) random variable and applying the following transformations:
for an exponential baseline hazard, the survival time \(t\) is simulated as: \[t = -\frac{log(U)}{\lambda \exp(\beta ^ T X)}\]
for a Weibull baseline hazard, the survival time \(t\) is simulated as: \[t = \left(-\frac{log(U)}{\lambda \exp(\beta ^ T X)}\right) ^ {1 / \gamma}\]
The R function to simulate a dataset for our simulation study is defined as follows:
simulate_data <- function(dataset, n, baseline, params = list(), coveff = -0.50) {
# Simulate treatment indicator variable
x <- rbinom(n = n, size = 1, prob = 0.5)
# Draw from a U(0,1) random variable
u <- runif(n)
# Simulate survival times depending on the baseline hazard
if (baseline == "Exponential") {
t <- -log(u) / (params$lambda * exp(x * coveff))
} else {
t <- (-log(u) / (params$lambda * exp(x * coveff)))^(1 / params$gamma)
}
# Winsorising tiny values for t (smaller than one day on a yearly-scale, e.g. 1 / 365.242), and adding a tiny amount of white noise not to have too many concurrent values
t <- ifelse(t < 1 / 365.242, 1 / 365.242, t)
t[t == 1 / 365.242] <- t[t == 1 / 365.242] + rnorm(length(t[t == 1 / 365.242]), mean = 0, sd = 1e-4)
# ...and make sure that the resulting value is positive
t <- abs(t)
# Make event indicator variable applying administrative censoring at t = 5
d <- as.numeric(t < 5)
t <- pmin(t, 5)
# Return a data.frame
data.frame(dataset = dataset, x = x, t = t, d = d, n = n, baseline = baseline, stringsAsFactors = FALSE)
}
We compare the Cox model (Cox, 1972) with a fully parametric survival model assuming an exponential baseline hazard and a flexible parametric model with 2 degrees of freedom for modelling the baseline hazard (Royston and Parmar, 2002). The Cox model can be fit via the coxph
function from the survival
package, the exponential model can be fit via the phreg
function from the eha
package, and the Royston-Parmar model can be fixed via the stpm2
function from the rstpm2
package.
Say we are interested in the following performance measures:
We are primarily interested in bias, and assume that the variance of the estimated log-treatment effect is \(0.1\). The Monte Carlo standard error for the bias is:
\[\text{MCSE} = \sqrt{\frac{\text{variance}}{\# \text{simulations}}}\]
Aiming for a Monte Carlo standard error of 0.01 on the estimated bias, we would require \(1,000\) replications.
The Monte Carlo standard error for converage is:
\[\text{MCSE} = \sqrt{\frac{\text{coverage} \times (1 - \text{coverage})}{\# \text{simulations}}}\]
This Monte Carlo standard error is maximised for a coverage = \(0.5\). In that setting, the Monte Carlo standard error with \(1,000\) replications would be \(0.01581139\), which is deemed to be acceptable.
Therefore, we will run \(1,000\) replications of this simulation study.
We generate \(1,000\) datasets for each data-generating mechanism.
First, we set a random seed for reproducibility:
Then, we simulate the data:
data <- list()
data[["n = 50, baseline = Exp"]] <- lapply(
X = 1:1000,
FUN = simulate_data,
n = 50,
baseline = "Exponential",
params = list(lambda = 0.5)
)
data[["n = 250, baseline = Exp"]] <- lapply(
X = 1:1000,
FUN = simulate_data,
n = 250,
baseline = "Exponential",
params = list(lambda = 0.5)
)
data[["n = 50, baseline = Wei"]] <- lapply(
X = 1:1000,
FUN = simulate_data,
n = 50,
baseline = "Weibull",
params = list(lambda = 0.5, gamma = 1.5)
)
data[["n = 250, baseline = Wei"]] <- lapply(
X = 1:1000,
FUN = simulate_data,
n = 250,
baseline = "Weibull",
params = list(lambda = 0.5, gamma = 1.5)
)
We define a function to fit the models of interest:
library(survival)
library(rstpm2)
#> Loading required package: splines
#>
#> Attaching package: 'rstpm2'
#> The following object is masked from 'package:survival':
#>
#> colon
library(eha)
fit_models <- function(data, model) {
# Fit model
if (model == "Cox") {
fit <- survival::coxph(Surv(t, d) ~ x, data = data)
} else if (model == "RP(2)") {
fit <- rstpm2::stpm2(Surv(t, d) ~ x, data = data, df = 2)
} else {
fit <- eha::phreg(Surv(t, d) ~ x, data = data, dist = "weibull", shape = 1)
}
# Return relevant coefficients
data.frame(
dataset = unique(data$dataset),
n = unique(data$n),
baseline = unique(data$baseline),
theta = coef(fit)["x"],
se = sqrt(ifelse(model == "Exp", fit$var["x", "x"], vcov(fit)["x", "x"])),
model = model,
stringsAsFactors = FALSE,
row.names = NULL
)
}
We now run the models for each simulated dataset:
results <- list()
results[["n = 50, baseline = Exp, model = Cox"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 50, baseline = Exp"]],
FUN = fit_models,
model = "Cox"
)
)
results[["n = 250, baseline = Exp, model = Cox"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 250, baseline = Exp"]],
FUN = fit_models,
model = "Cox"
)
)
results[["n = 50, baseline = Wei, model = Cox"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 50, baseline = Wei"]],
FUN = fit_models,
model = "Cox"
)
)
results[["n = 250, baseline = Wei, model = Cox"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 250, baseline = Wei"]],
FUN = fit_models,
model = "Cox"
)
)
results[["n = 50, baseline = Exp, model = Exp"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 50, baseline = Exp"]],
FUN = fit_models,
model = "Exp"
)
)
results[["n = 250, baseline = Exp, model = Exp"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 250, baseline = Exp"]],
FUN = fit_models,
model = "Exp"
)
)
results[["n = 50, baseline = Wei, model = Exp"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 50, baseline = Wei"]],
FUN = fit_models,
model = "Exp"
)
)
results[["n = 250, baseline = Wei, model = Exp"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 250, baseline = Wei"]],
FUN = fit_models,
model = "Exp"
)
)
results[["n = 50, baseline = Exp, model = RP(2)"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 50, baseline = Exp"]],
FUN = fit_models,
model = "RP(2)"
)
)
results[["n = 250, baseline = Exp, model = RP(2)"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 250, baseline = Exp"]],
FUN = fit_models,
model = "RP(2)"
)
)
results[["n = 50, baseline = Wei, model = RP(2)"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 50, baseline = Wei"]],
FUN = fit_models,
model = "RP(2)"
)
)
results[["n = 250, baseline = Wei, model = RP(2)"]] <- do.call(
rbind.data.frame,
lapply(
X = data[["n = 250, baseline = Wei"]],
FUN = fit_models,
model = "RP(2)"
)
)
We save the final results, that will be included as an example in the R package rsimsum
.
Finally, we obtain summary statistics by calling the simsum
function:
library(rsimsum)
s <- rsimsum::simsum(data = relhaz, estvarname = "theta", se = "se", true = -0.50, methodvar = "model", ref = "Cox", mcse = TRUE, by = c("n", "baseline"))
s
#>
#> Call:
#> rsimsum::simsum(data = relhaz, estvarname = "theta", true = -0.5,
#> se = "se", methodvar = "model", ref = "Cox", by = c("n",
#> "baseline"), mcse = TRUE)
#>
#> Method variable: model
#> Unique methods: Cox, Exp, RP(2)
#> Reference method: Cox
#>
#> By factors: n, baseline
#>
#> Monte Carlo standard errors were computed.
summary(s)
#>
#> Call:
#> rsimsum::simsum(data = relhaz, estvarname = "theta", true = -0.5,
#> se = "se", methodvar = "model", ref = "Cox", by = c("n",
#> "baseline"), mcse = TRUE)
#>
#> Method variable: model
#> Unique methods: Cox, Exp, RP(2)
#> Reference method: Cox
#> By factors: n, baseline
#>
#> Summary statistics:
#>
#> Method = Cox, n = 250, baseline = Exponential
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5004 NA NA NA
#> Median point estimate -0.4969 NA NA NA
#> Average standard error 0.0194 NA NA NA
#> Median standard error 0.0194 NA NA NA
#> Bias in point estimate -0.0004 0.0043 -0.0088 0.0081
#> Empirical standard error 0.1363 0.0030 0.1303 0.1422
#> Mean squared error 0.0185 0.0009 0.0168 0.0203
#> % gain in precision relative to method Cox 1.0000 0.0000 1.0000 1.0000
#> Model-based standard error 0.1394 0.0001 0.1393 0.1396
#> Relative % error in standard error 2.3293 2.2894 -2.1578 6.8164
#> Coverage of nominal 95% CI 0.9510 0.0068 0.9376 0.9644
#> Bias corrected coverage of nominal 95% CI 0.9520 0.0068 0.9388 0.9652
#> Power of 5% level test 0.9570 0.0064 0.9444 0.9696
#>
#> Method = Exp, n = 250, baseline = Exponential
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.4996 NA NA NA
#> Median point estimate -0.5004 NA NA NA
#> Average standard error 0.0191 NA NA NA
#> Median standard error 0.0190 NA NA NA
#> Bias in point estimate 0.0004 0.0043 -0.0080 0.0088
#> Empirical standard error 0.1350 0.0030 0.1291 0.1409
#> Mean squared error 0.0182 0.0009 0.0165 0.0199
#> % gain in precision relative to method Cox 1.0192 0.0093 1.0009 1.0375
#> Model-based standard error 0.1381 0.0001 0.1379 0.1382
#> Relative % error in standard error 2.2813 2.2882 -2.2035 6.7661
#> Coverage of nominal 95% CI 0.9540 0.0066 0.9410 0.9670
#> Bias corrected coverage of nominal 95% CI 0.9540 0.0066 0.9410 0.9670
#> Power of 5% level test 0.9570 0.0064 0.9444 0.9696
#>
#> Method = RP(2), n = 250, baseline = Exponential
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5015 NA NA NA
#> Median point estimate -0.5000 NA NA NA
#> Average standard error 0.0194 NA NA NA
#> Median standard error 0.0193 NA NA NA
#> Bias in point estimate -0.0015 0.0043 -0.0100 0.0069
#> Empirical standard error 0.1363 0.0030 0.1303 0.1423
#> Mean squared error 0.0186 0.0009 0.0168 0.0203
#> % gain in precision relative to method Cox 0.9996 0.0034 0.9929 1.0063
#> Model-based standard error 0.1392 0.0001 0.1391 0.1394
#> Relative % error in standard error 2.1570 2.2855 -2.3225 6.6365
#> Coverage of nominal 95% CI 0.9520 0.0068 0.9388 0.9652
#> Bias corrected coverage of nominal 95% CI 0.9520 0.0068 0.9388 0.9652
#> Power of 5% level test 0.9590 0.0063 0.9467 0.9713
#>
#> Method = Cox, n = 50, baseline = Exponential
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5255 NA NA NA
#> Median point estimate -0.5226 NA NA NA
#> Average standard error 0.1016 NA NA NA
#> Median standard error 0.1003 NA NA NA
#> Bias in point estimate -0.0255 0.0104 -0.0458 -0.0052
#> Empirical standard error 0.3277 0.0073 0.3133 0.3421
#> Mean squared error 0.1079 0.0047 0.0988 0.1170
#> % gain in precision relative to method Cox 1.0000 0.0000 1.0000 1.0000
#> Model-based standard error 0.3188 0.0004 0.3180 0.3197
#> Relative % error in standard error -2.7072 2.1801 -6.9802 1.5658
#> Coverage of nominal 95% CI 0.9410 0.0075 0.9264 0.9556
#> Bias corrected coverage of nominal 95% CI 0.9400 0.0075 0.9253 0.9547
#> Power of 5% level test 0.3740 0.0153 0.3440 0.4040
#>
#> Method = Exp, n = 50, baseline = Exponential
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5223 NA NA NA
#> Median point estimate -0.5181 NA NA NA
#> Average standard error 0.0977 NA NA NA
#> Median standard error 0.0970 NA NA NA
#> Bias in point estimate -0.0223 0.0101 -0.0422 -0.0024
#> Empirical standard error 0.3208 0.0072 0.3068 0.3349
#> Mean squared error 0.1033 0.0045 0.0945 0.1122
#> % gain in precision relative to method Cox 1.0431 0.0126 1.0183 1.0679
#> Model-based standard error 0.3126 0.0004 0.3118 0.3133
#> Relative % error in standard error -2.5836 2.1821 -6.8603 1.6932
#> Coverage of nominal 95% CI 0.9370 0.0077 0.9219 0.9521
#> Bias corrected coverage of nominal 95% CI 0.9390 0.0076 0.9242 0.9538
#> Power of 5% level test 0.3850 0.0154 0.3548 0.4152
#>
#> Method = RP(2), n = 50, baseline = Exponential
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5337 NA NA NA
#> Median point estimate -0.5269 NA NA NA
#> Average standard error 0.1005 NA NA NA
#> Median standard error 0.0993 NA NA NA
#> Bias in point estimate -0.0337 0.0105 -0.0543 -0.0131
#> Empirical standard error 0.3323 0.0074 0.3177 0.3468
#> Mean squared error 0.1114 0.0049 0.1018 0.1210
#> % gain in precision relative to method Cox 0.9726 0.0066 0.9597 0.9855
#> Model-based standard error 0.3171 0.0004 0.3162 0.3179
#> Relative % error in standard error -4.5711 2.1383 -8.7620 -0.3802
#> Coverage of nominal 95% CI 0.9370 0.0077 0.9219 0.9521
#> Bias corrected coverage of nominal 95% CI 0.9390 0.0076 0.9242 0.9538
#> Power of 5% level test 0.3920 0.0154 0.3617 0.4223
#>
#> Method = Cox, n = 250, baseline = Weibull
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5057 NA NA NA
#> Median point estimate -0.5031 NA NA NA
#> Average standard error 0.0173 NA NA NA
#> Median standard error 0.0172 NA NA NA
#> Bias in point estimate -0.0057 0.0040 -0.0136 0.0022
#> Empirical standard error 0.1274 0.0029 0.1219 0.1330
#> Mean squared error 0.0163 0.0008 0.0147 0.0178
#> % gain in precision relative to method Cox 1.0000 0.0000 1.0000 1.0000
#> Model-based standard error 0.1316 0.0001 0.1315 0.1317
#> Relative % error in standard error 3.2717 2.3104 -1.2566 7.8001
#> Coverage of nominal 95% CI 0.9440 0.0073 0.9297 0.9583
#> Bias corrected coverage of nominal 95% CI 0.9500 0.0069 0.9365 0.9635
#> Power of 5% level test 0.9740 0.0050 0.9641 0.9839
#>
#> Method = Exp, n = 250, baseline = Weibull
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.3522 NA NA NA
#> Median point estimate -0.3500 NA NA NA
#> Average standard error 0.0164 NA NA NA
#> Median standard error 0.0163 NA NA NA
#> Bias in point estimate 0.1478 0.0028 0.1423 0.1533
#> Empirical standard error 0.0886 0.0020 0.0847 0.0925
#> Mean squared error 0.0297 0.0009 0.0279 0.0314
#> % gain in precision relative to method Cox 2.0701 0.0381 1.9954 2.1448
#> Model-based standard error 0.1280 0.0000 0.1279 0.1280
#> Relative % error in standard error 44.4489 3.2309 38.1165 50.7813
#> Coverage of nominal 95% CI 0.8900 0.0099 0.8706 0.9094
#> Bias corrected coverage of nominal 95% CI 0.9910 0.0030 0.9851 0.9969
#> Power of 5% level test 0.8870 0.0100 0.8674 0.9066
#>
#> Method = RP(2), n = 250, baseline = Weibull
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5072 NA NA NA
#> Median point estimate -0.5055 NA NA NA
#> Average standard error 0.0171 NA NA NA
#> Median standard error 0.0171 NA NA NA
#> Bias in point estimate -0.0072 0.0040 -0.0150 0.0007
#> Empirical standard error 0.1267 0.0028 0.1212 0.1323
#> Mean squared error 0.0161 0.0008 0.0146 0.0176
#> % gain in precision relative to method Cox 1.0116 0.0067 0.9984 1.0248
#> Model-based standard error 0.1309 0.0001 0.1308 0.1310
#> Relative % error in standard error 3.3058 2.3110 -1.2238 7.8353
#> Coverage of nominal 95% CI 0.9470 0.0071 0.9331 0.9609
#> Bias corrected coverage of nominal 95% CI 0.9500 0.0069 0.9365 0.9635
#> Power of 5% level test 0.9740 0.0050 0.9641 0.9839
#>
#> Method = Cox, n = 50, baseline = Weibull
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5123 NA NA NA
#> Median point estimate -0.5133 NA NA NA
#> Average standard error 0.0926 NA NA NA
#> Median standard error 0.0900 NA NA NA
#> Bias in point estimate -0.0123 0.0100 -0.0318 0.0073
#> Empirical standard error 0.3158 0.0071 0.3020 0.3297
#> Mean squared error 0.0998 0.0049 0.0903 0.1094
#> % gain in precision relative to method Cox 1.0000 0.0000 1.0000 1.0000
#> Model-based standard error 0.3043 0.0005 0.3035 0.3052
#> Relative % error in standard error -3.6377 2.1601 -7.8714 0.5960
#> Coverage of nominal 95% CI 0.9450 0.0072 0.9309 0.9591
#> Bias corrected coverage of nominal 95% CI 0.9440 0.0073 0.9297 0.9583
#> Power of 5% level test 0.3820 0.0154 0.3519 0.4121
#>
#> Method = Exp, n = 50, baseline = Weibull
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.3492 NA NA NA
#> Median point estimate -0.3506 NA NA NA
#> Average standard error 0.0834 NA NA NA
#> Median standard error 0.0825 NA NA NA
#> Bias in point estimate 0.1508 0.0066 0.1379 0.1636
#> Empirical standard error 0.2073 0.0046 0.1982 0.2163
#> Mean squared error 0.0656 0.0028 0.0602 0.0711
#> % gain in precision relative to method Cox 2.3224 0.0503 2.2238 2.4209
#> Model-based standard error 0.2888 0.0002 0.2884 0.2891
#> Relative % error in standard error 39.3334 3.1175 33.2231 45.4436
#> Coverage of nominal 95% CI 0.9740 0.0050 0.9641 0.9839
#> Bias corrected coverage of nominal 95% CI 0.9950 0.0022 0.9906 0.9994
#> Power of 5% level test 0.1450 0.0111 0.1232 0.1668
#>
#> Method = RP(2), n = 50, baseline = Weibull
#> Estimate MCSE Lower 2.5% Upper 97.5%
#> Simulations with non-missing estimates/SEs 1000.0000 NA NA NA
#> Average point estimate -0.5187 NA NA NA
#> Median point estimate -0.5120 NA NA NA
#> Average standard error 0.0894 NA NA NA
#> Median standard error 0.0873 NA NA NA
#> Bias in point estimate -0.0187 0.0100 -0.0382 0.0009
#> Empirical standard error 0.3155 0.0071 0.3017 0.3294
#> Mean squared error 0.0998 0.0047 0.0905 0.1091
#> % gain in precision relative to method Cox 1.0020 0.0110 0.9804 1.0237
#> Model-based standard error 0.2989 0.0004 0.2982 0.2997
#> Relative % error in standard error -5.2531 2.1230 -9.4140 -1.0922
#> Coverage of nominal 95% CI 0.9380 0.0076 0.9231 0.9529
#> Bias corrected coverage of nominal 95% CI 0.9430 0.0073 0.9286 0.9574
#> Power of 5% level test 0.4140 0.0156 0.3835 0.4445
With this vignette we showed how to simulate survival data and run a small, simple simulation study.
Cox D.R. Regression models and life-tables. Journal of the Royal Statistical Society, Series B (Methodological), 1972, 34(2):187-220
Royston P. and Parmar M.K. Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in Medicine, 2002, 21(15):2175-2197
Bender R., Augustin T., and Blettner M. Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 2005, 24(11):1713-1723