rprev User Guide

S J Lax and S E Lacy



In this vignette we are going to demonstrate how predictions of disease prevalence are calculated from a synthetic patient data set using rprev. We define the term prevalence as “point estimates of disease prevalence at a specific index date”. This is typically estimated from incidence and modelled survival probability using registry data. Following the methodology of (S. Crouch et al. 2014), which has been employed in published work (Roman et al. 2016, A. Smith et al. (2015)), we calculate prevalence for n years, with n defined as the number of years prior to the index date for which we have included incident cases in the prevalent pool. In other words, prevalence estimates consider only those diagnosed in the preceding n years. Therefore, the higher the value of n, the more incident cases are included. If n is less than or equal to the number of registry years, then the prevalence estimate is made by simply counting those remaining in the prevalent pool at the index date. If n is greater than the number of registry years for which we have real patient data, then the contribution for the years preceding the registry back to n years is estimated using Monte Carlo simulation. This is illustrated in the diagram below:

Estimating prevalence from registry data using rprev.

Estimating prevalence from registry data using rprev.

Throughout we assume that a patient diagnosed with a particular disease remains prevalent while they are still alive. If time to death is replaced with “time to removal from prevalent pool” in the survival modelling, then estimates corresponding to this notion of prevalence can be made.

As prevalence is estimated for the simulated years using probability of survival from the disease modelled on the available registry data, an appropriate value of n should be selected. If n is too low, the prevalence prediction will be an underestimate, as long-surviving cases may exist that were incident before n years preceding the index date. If n is too high relative to the length of the registry, the longer the simulation and the greater the reliance on extrapolation outside the registry data. If the disease of interest is rapidly fatal or the registry is sufficiently long that survival is accurately modelled on the available data, larger values of n might be appropriate. It is necessary to keep in mind that, as survival is presently modelled using Weibull regression, some long-surviving cases may appear to be immortal. To overcome the problem of extrapolating the survival of these patients, we recommend implementing our so-called cure model, which we will discuss in more detail later in this vignette. In short, the higher the value of n, the more accurate the prevalence prediction, as long as survival can be adequately modelled over n years from the available registry data.

In the following sections we aim to take the user through the successive stages of the modelling process, outlining the correct use of the basic functions, illustrating how the diagnostics provided are used to check the validity of the assumptions required for simulating years of interest before the registry, and checking the agreement between observed and simulated prevalence estimates. First, we will inspect the consistency of incidence data between years of the registry and determine if it can be appropriately modelled as an homogeneous Poisson process. Second, we will look at the consistency of the survival data between years of the registry and check the adequacy of a parametric model of survival. We emphasise that the user should check that their registry data does meet the required assumptions and if not, should understand that estimates of prevalence made using simulation based on that data may not be correct. Then, we will also introduce the population survival data and explain how it integrates with patient survival to produce a cure model, where it is desired. Lastly, we will introduce our functions for estimating prevalence, demonstrate how to check the resulting estimates are reasonable, and extract the posterior distribution of a modelled covariate.

Basic setup and summary of data


prevsim is a data set that has been synthesised to resemble disease registry data. Incident cases are recorded from 2003-01-01 to 2013-01-30, and events occur between 2003-01-09 and 2015-03-17. It has 6 columns and is organised in a fashion typical to that found in real registry data sets. Patient data includes the date of both entry into the registry and last known event (be it death or a censored event time), survival time and an event indicator (status) along with both age and sex. The latter two are incorporated into the prevalence model at several levels, from being used as covariates in the survival modelling, and marking the incidence Poisson process (see S. Crouch et al. 2014). Currently rprev does not support the inclusion of additional variables in either the survival or incidence models, although this feature will be introduced in a later release.

##       time            status           age        sex    
##  Min.   :   1.0   Min.   :0.000   Min.   :36.43   0:508  
##  1st Qu.: 235.0   1st Qu.:0.000   1st Qu.:58.16   1:492  
##  Median : 893.5   Median :1.000   Median :64.56          
##  Mean   :1236.0   Mean   :0.558   Mean   :64.95          
##  3rd Qu.:1868.2   3rd Qu.:1.000   3rd Qu.:71.88          
##  Max.   :4444.0   Max.   :1.000   Max.   :95.32          
##   entrydate          eventdate        
##  Length:1000        Length:1000       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  

The following Kaplan-Meier plot shows the survival probabilities associated with the included data set, prevsim, where, typical of many diseases, males have poorer survival outcomes than females. It also highlights that survival starts to level off after 2000 days. As our registry is short relative to the duration of survival from this disease, we do not have much information about long-term survival after this point in our patient population in order to base a model. In this scenario, instead of extrapolating survival (and thus risking the problem of “immortal” subjects), we recommend using general population survival data to model long-term survivors, and we refer to this as a cure model (S. Crouch et al. 2014). In this model, after a specified “cure time”, surviving subjects are considered to revert to the survival characteristics for someone of their age and sex in the general population. In our example, a patient is considered cured after five years survival with the disease.

Until the point in time that the cure model takes effect, survival is modelled on disease registry data; after this time, survival probability for the surviving cases is modelled using population mortality rates. rprev includes data from the Office of National Statistics (ONS) (see National Statistics, n.d., data licenced under the Open Government Licence v3.0) to describe UK mortality rates (in data set UKmortality), however, the user may supply their own population mortality data.

survf <- survfit(Surv(time, status) ~ sex, data=prevsim)
plot(survf, lwd=1, col = c("blue","red"), main="Survival stratified by sex", xlab="Days", 
legend(3000, 1, c("Males", "Females"), lty = 1, col=c("blue","red"))


The simulation of incident cases in the years for which registry data is not available currently assumes that the incidence process is homogeneous Poisson. The package includes several functions for inspecting the incidence data, as well as diagnosing the validity of this assumption.

The primary function is incidence, which calculates several summary statistics of the disease incidence. It requires a vector of entry dates into the registry, along with the size of the population under consideration. The registry under interest can be subset to only count incident cases from specific years of interest, useful if there are doubts about the accuracy of the data, for example during the first year of operation. This function also fits a spline smooth to the incidence rate data.

inc <- incidence(prevsim$entrydate, population_size=3.2e6, start='2005-09-01', num_reg_years=5)

This function returns an S3 object of class incidence with the following attributes.

## Cumulative incidence object with 5 years of data.
## Smooth fitted using 6 degrees of freedom.
## [1] "raw_incidence"     "ordered_diagnoses" "smooth"           
## [4] "index_dates"       "mean"              "pvals"            
## [7] "dof"

The summary.incidence method displays the most useful information about the incidence process of the disease of interest.

## Number of years of registry data: 5 
## Incidence
## ~~~~~~~~~
## Known incidence by year: 101 89 86 95 99 
## Diagnoses (time since registry began):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   448.0   940.5   922.4  1387.0  1822.0 
## p-values for over/under dispersion: 0.78305 0.21695 
## Fitted smooth:
## ~~~~~~~~~~~~~
## Call:
## smooth.spline(x = diags, y = seq(length(diags)), df = df)
## Smoothing Parameter  spar= 0.9582532  lambda= 0.01219356 (11 iterations)
## Equivalent Degrees of Freedom (Df): 5.999155
## Penalized Criterion: 3506.247
## GCV: 7.732843

The known yearly incidence rates are displayed starting at start date for num_reg_years, starting with the first year. For the simulated data set prevsim, there were 96 incident cases between 2005-09-01 and 2006-09-01. It can be concluded that the disease has around 100 new cases each year.

Below that is a summary of the cumulative diagnosis times (in days), useful as a quick check of the distribution of incident cases. The p-values of a simulation test indicating whether the yearly incidence estimates are under or over dispersed relative to a homogeneous Poisson process are also displayed. Inspection of the smoothed incidence function should reveal whether the problem is one of non-homogeneity (which may lead to inaccurate prevalence estimates) or of failure of the Poisson assumption (which may lead to inaccurate estimates of confidence intervals).

Another useful incidence statistic is the mean incidence rate per 100,000 within the study population, which is obtained from the mean attribute. Confidence intervals are provided at the specified level (default is 95%):

## $absolute
## [1] 94
## $per100K
## [1] 2.94
## $per100K.lower
## [1] 2.82
## $per100K.upper
## [1] 3.06

Alongside the p-values from the over/under dispersion test, rprev provides visual tools to assess the consistency of the incidence data with an homogeneous Poisson process.

plot.incidence displays a plot of the actual incidence rate as recorded in the registry, with the smooth overlaid. If incidence is an homogeneous Poisson process, both the smooth (green) and incidence process (red) should remain within the 95% confidence interval (dashed blue) and be evenly distributed about the mean (blue line).


Currently, the incidence process can only be modelled as a Poisson homogeneous process; if the diagnostics indicate this is not a suitable fit then the prevalence estimates (or their confidence intervals) will be unreliable. Future releases of rprev will allow for custom incidence processes.

Another function used to describe the incident cases is incidence_age_distribution, which simply plots the distribution of incident cases by age.

prevsim_r <- prevsim[prevsim$entrydate >= "2004-01-30", ]


Survival modelling

The second component to the prevalence estimation is the survival modelling process. Survival is currently modelled using a parametric Weibull model with age and sex as covariates. It is recommended that the user inspects the survival data for consistency with the assumptions of this model and for homogeneity:

First, it is always useful to plot the Kaplan-Meier estimator of the data, both as a whole and stratified by age to visually inspect for any inconsistencies:

km <- survfit(Surv(time, status) ~ 1, data=prevsim_r)
plot(km, lwd=2, col="blue", main="Overall Survival", xlab="Days", 
     ylab="Survival probability")

ages = c(55, 65, 75, 85, 100)
km2 <- survfit(Surv(time, status) ~ cut(age, breaks=ages), data=prevsim_r)
plot(km2, lwd=2, col=1:length(ages), main="Survival stratified by age", xlab="Days", 
     ylab="Survival probability")
legend("topright", legend=substring(names(km2$strata), 25, 32), lty = 1, 

It is also a useful diagnostic aid to plot the survival curve for each year of the registry to determine whether there is any inhomogeneity:

plot(km, lwd=2, col="blue", mark.time=F, conf.int=T, xlab="Days", 
     ylab="Survival probability")
num_reg_years <- 9
registry_years <- determine_registry_years(start='2004-01-30', 
       function(i) lines(survfit(Surv(time, status) ~ 1, 
                                 data=prevsim_r[prevsim_r$entrydate >= 
                                                          registry_years[i] & 
                                                          prevsim_r$entrydate < 
                                                          registry_years[i + 1], ]), 
                         mark.time = F, conf.int = F))

The effect of age on hazard can be visualised to determine if there are any non-proportional effects, by inspecting Schoenfeld residuals from a Cox model. This is easily done using the cox.zph function from the survival package.

cx <- coxph(Surv(time, status) ~ age, data=prevsim_r)
cxp <- survfit(cx, 
               newdata=data.frame(age=sapply(seq(length(ages) - 1), 
                                             function(i) mean(c(ages[i], ages[i + 1]))))) 
lines(cxp, lwd=2, col=1:length(ages), lty=2, mark.time=F)

An overall test of the proportional hazards assumption may be made.

##         rho chisq    p
## age -0.0142 0.102 0.75

Functional form of age

functional_form_age() is provided so that the user can further investigate if modelling the effect of age as linear is appropriate. This function fits a Cox model to the data with a restricted cubic spline with the specified degrees of freedom. The model is returned by the function as an rms::cph object, allowing the user to inspect the fit. The function can also plot relative hazard as a function of age with the plot_fit argument.

functional_form_age(Surv(time, status) ~ age, prevsim_r, df=4, plot_fit=T)

## Cox Proportional Hazards Model
## rms::cph(formula = myform, data = mydf, x = TRUE, y = TRUE, surv = TRUE, 
##     time.inc = 1)
##                     Model Tests       Discrimination    
##                                          Indexes        
## Obs       900    LR chi2     29.24    R2       0.032    
## Events    496    d.f.            3    Dxy      0.136    
## Center 1.8153    Pr(> chi2) 0.0000    g        0.266    
##                  Score chi2  30.92    gr       1.305    
##                  Pr(> chi2) 0.0000                      
##       Coef    S.E.   Wald Z Pr(>|Z|)
## age    0.0293 0.0189  1.55  0.1208  
## age'  -0.0252 0.0485 -0.52  0.6037  
## age''  0.1165 0.1843  0.63  0.5275

Weibull model

Survival is modelled in the prevalence estimations using Weibull regression. Currently, there is no flexibility to how this model is fitted, it defaults to using both age and sex. It is therefore important to manually verify that this is an appropriate model.

wb <- survreg(Surv(time, status) ~ age + sex, data=prevsim_r)
## Call:
## survreg(formula = Surv(time, status) ~ age + sex, data = prevsim_r)
##               Value Std. Error     z        p
## (Intercept) 10.7691    0.59009 18.25 2.07e-74
## age         -0.0479    0.00873 -5.49 4.04e-08
## sex1         0.5254    0.17214  3.05 2.27e-03
## Log(scale)   0.6422    0.03967 16.19 6.09e-59
## Scale= 1.9 
## Weibull distribution
## Loglik(model)= -4087.9   Loglik(intercept only)= -4107.2
##  Chisq= 38.48 on 2 degrees of freedom, p= 4.4e-09 
## Number of Newton-Raphson Iterations: 5 
## n= 900

Population survival

For modelling long-term survivors in registry data, we have implemented a cure model option. For example, the default cure = 10 in the following functions for estimating prevalence, means a patient can be considered “cured” if they have survived until 10 years after diagnosis, and therefore from this point onwards their survival is modelled using population survival rather than cohort survival.

General population period survival data for the UK is loaded from the supplied UKmortality data set, and yearly rates are translated into daily rates by linear interpolation using population_survival_rate().


daily_survival_males <- population_survival_rate(rate ~ age, subset(UKmortality, 
                                                                    sex == 0))
daily_survival_females <- population_survival_rate(rate ~ age, subset(UKmortality, 
                                                                      sex == 1))

plot(daily_survival_males, type="l", col="blue", xlab="days", ylab="survival")
lines(daily_survival_females, col="red")
legend("topright", legend = c("Males", "Females"),
               bty = "n", lty = 1, col = c(4,2))

Prevalence estimates

As a reminder, prevalence is estimated using incidence and survival data from n years. However, registry data (and thus known incidence and survival) data may only be known for r years, where r <= n. If r < n, the remaining n-r years of incidence and survival are simulated using Monte Carlo techniques.


In the first instance, prevalence can be estimated by counting from the registry data. It is imperative that the number of incident cases is complete for estimating prevalence, and that follow-up data for all cases is complete at least until after the index date. In the case of the provided prevsim data set the last recorded incident case date and the last date of follow-up are different, as is typically the case in registry data sets where incidence and follow-up data are provided by different sources.

prevalence_counted calculates prevalence contributions for each year of the provided registry data to the index date - specified using the index_date argument. The num_reg_years parameter allows the user to specify the number of years of data to use in the estimation, in the example below incident cases from the first year of the registry in 2003 are discarded due to concerns about their accuracy. In prevalence_counted(), all cases alive after the index date are censored.

## [1] 44 41 50 35 44 52 57 66 85

The function returns the prevalence contributions by year, in ascending order (i.e there are 44 contributions from the year between 2004-01-30 and 2005-01-30).


Prevalence is estimated using simulation by calling prevalence(). index_date and num_reg_years perform the same role as in prevalence_counted: specifying the date at which to estimate point prevalence, and allowing for the specification of the number of years of registry data to use. num_years_to_estimate corresponds to the required number of years preceding the index date that contribute incident cases. If any values are larger than the number of complete years of registry data available before index_date, then the remainder of years have their incidence and survival characteristics simulated. By passing a vector to num_years_to_estimate, multiple estimates of prevalence at the index date can be calculated with their own confidence intervals. The population_size variable is used to estimate prevalence per one hundred thousand (or at any rate specified by the proportion argument).

NB: all follow-up data for the available incident cases is still used in survival modelling.

prevalence_total <- prevalence(Surv(time, status) ~ sex(sex) + age(age) + 
                                   entry(entrydate) + event(eventdate),
                               prevsim, num_years_to_estimate=c(3, 5, 10), 
                               num_reg_years=9, cure=5) 

Printing the prevalence object returned by the function of the same name displays the point estimate of prevalence at the index date using num_years_to_estimate years of data:

## Estimated prevalence per 1e+05 at 2013-01-30 
## 3 years: 20.8 
## 5 years: 30.4 
## 10 years: 51.16

More detail from the prevalence object can be extracted using summary, including the p-value from a chi squared test of the difference between the predicted and counted prevalence for the available years of registry data:

## Registry Data
## ~~~~~~~~~~~~~
## Index date: 2013-01-30 
## Start date: 2004-01-30 
## Number of years: 9 
## Known incidence rate:
## 107 97 97 88 83 100 111 109 107 
## Counted prevalent cases:
## 44 41 50 35 44 52 57 66 85
## Bootstrapping
## ~~~~~~~~~~~~~
## Iterations: 1000 
## Posterior age distribution summary:
##      Length Class  Mode
## [1,] 10     -none- list
## [2,] 10     -none- list
## Average simulated prevalent cases per year:
## 38 43 40 40 38 38 52 64 71 85 
## P-value from chi-square test: 0.7735422

The prevalence object’s estimates attribute holds the point prevalence estimate along with rates per one hundred thousand and confidence intervals. As n increases, the prevalence estimate increases:

## $y3
## $y3$absolute.prevalence
## [1] 208
## $y3$per100K
## [1] 20.8
## $y3$per100K.upper
## [1] 23.63
## $y3$per100K.lower
## [1] 17.97
## $y5
## $y5$absolute.prevalence
## [1] 304
## $y5$per100K
## [1] 30.4
## $y5$per100K.upper
## [1] 33.82
## $y5$per100K.lower
## [1] 26.98
## $y10
## $y10$absolute.prevalence
## [1] 511.6
## $y10$per100K
## [1] 51.16
## $y10$per100K.upper
## [1] 55.61
## $y10$per100K.lower
## [1] 46.71

Inspecting bootstrapped survival modelling

To inspect the distribution of the bootstrapped survival models (see S. Crouch et al. 2014 for details), a survfit.prev object can be constructed using the usual survfit call, accepting a data frame of new data (with identical column names to those found in the original data set). In the example below, survival probability is estimated for a 60 year old male:

prevsurv <- survfit(prevalence_total, newdata=list(age=60, sex=0))
## Survival probability calculated at 4444 timepoints, across 1000 bootstraps.

The summary.survfit.prev method provides N-year survival probabilities, with N specified as an argument vector:

summary(prevsurv, years=c(1, 3, 5, 10))
## Survival probability estimated using 1000 bootstrap survival curves:
## 1 year survival: 0.706 (0.671 - 0.743) 
## 3 year survival: 0.537 (0.495 - 0.583) 
## 5 year survival: 0.443 (0.397 - 0.493) 
## 10 year survival: 0.31 (0.263 - 0.362)

Plotting the survfit.prev object displays the survival curve of a Weibull model using the original data set (orange), along with 95% confidence intervals derived using the bootstrapped models shaded in light grey, while outlier curves are individually plotted. Outliers are defined as those survival curves where at least pct_show proportion of their point predictions lie outside the 95% confidence interval.

plot(prevsurv, pct_show=0.90)

Comparison between simulated and counted prevalence

As a test of whether the model is predicting reasonable values of prevalence, we can use the fact that we can directly measure the discrepancy between the predicted and actual prevalence for the available registry years. This difference can be formally tested using a chi-square test; the resulting p-value resulting is returned as an attribute of a prevalence object, called pval.

## [1] 0.7735422

For this model, there is no evidence to reject the null hypothesis.

This can also be calculated manually with the test_prevalence_fit function.

## [1] 0.7735422

Age distribution of prevalent cases

The age distribution of simulated prevalent cases can be viewed as a histogram and compared with the age distribution of incident cases:

hist(prevsim$age[prevsim$entrydate >= min(registry_years) & 
                     prevsim$entrydate < max(registry_years)], 
     col=rgb(1,0,0, alpha=0.5), xlim=c(0,100), ylim=c(0,0.045), 
     main = "", xlab = "age", prob = TRUE)
     col=rgb(0,1,0, alpha=0.5), prob = TRUE, add=T)
legend("topleft", legend = c("Incident", "Prevalent"), bty = "n", lty = 1, 
       col = c(rgb(1,0,0, alpha=0.5),
               rgb(0,1,0, alpha=0.5)))


Crouch, Simon, Alex Smith, Dan Painter, Jinlei Li, and Eve Roman. 2014. “Determining Disease Prevalence from Incidence and Survival Using Simulation Techniques.” Cancer Epidemiology 38 (2). Elsevier: 193–99. http://www.ncbi.nlm.nih.gov/pubmed/24656754.

National Statistics, Office for. n.d. “Life Expectancies.” http://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/lifeexpectancies.

Roman, Eve, Alex Smith, Simon Appleton, Simon Crouch, Richard Kelly, Sally Kinsey, Catherine Cargo, and Russell Patmore. 2016. “Myeloid Malignancies in the Real-World: Occurrence, Progression and Survival in the UK’s Population-Based Haematological Malignancy Research Network 2004–15.” Cancer Epidemiology 42. Elsevier: 186–98.

Smith, A, S Crouch, S Lax, J Li, D Painter, D Howell, R Patmore, A Jack, and E Roman. 2015. “Lymphoma Incidence, Survival and Prevalence 2004–2014: Sub-Type Analyses from the UK’s Haematological Malignancy Research Network.” British Journal of Cancer 112 (9). Nature Publishing Group: 1575–84.