Bayesian Small Area Estimation using Complex Survey Data

Zehang Richard Li


The SUMMER package offers a class of tools for small-area estimation with survey data in space and time. Such data are usually collected with complex stratified designs, which must be acknowledge in the analysis. In this vignette, we offer two main examples to illustrate spatial and spatial-temporal smoothing of design-based estimates:

The second example of estimating U5MR is the main purpose of this package, and involves more complex modeling steps. For many users, the first BRFSS example may be of more general interest, and provides an introduction to the idea of small area estimation with survey data. At the end of this vignette, we also provide a toy example of simulating spatially correlated data and performing spatial smoothing of design-based estimates.

Small Area Estimation with BRFSS Data

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone health survey conducted by the Centers for Disease Control and Prevention (CDC) that tracks health conditions and risk behaviors in the United States and its territories since 1984. The BRFSS sampling scheme is complex with high variability in the sampling weights. In this example, we estimate the prevalence of Type II diabetes in health reporting areas (HRAs) in King County, using BRFSS data. We will compare the weighted direct estimates, with simple spatial smoothed estimates (ignore weighting), and the smoothed and weighted estimates.

Load Package and Data

First, we load the package and the necessary data. INLA is not in a standard repository, so we check if it is available and install it if it is not.

BRFSS contains the full BRFSS dataset with \(16,283\) observations. The diab2 variable is the binary indicator of Type II diabetes, strata is the strata indicator and rwt_llcp is the final design weight. For the purpose of this analysis, we first remove records with missing HRA code or diabetes status from this dataset.

KingCounty contains the map of the King County HRAs. In order to fit spatial smoothing model, we first need to compute the adjacency matrix for the HRAs, mat, and make sure both the column and row names correspond to the HRA names.

## [1] 48 48

The Direct Estimates

Let \(y_i\) and \(m_i\) be the number of individuals flagged as having type II diabetes and the denominators in areas \(i = 1, ..., n\). Ignoring the survey design, the naive estimates for the prevalence of Type II diabetes can be easily calculated as \(\hat p_i = y_i / m_i\), with associated standard errors \(\sqrt{\hat p_i(1 - \hat p_i)/m_i}\). The design-based weighted estimates of \(p_i\) and the associated variances can be easily calculated using the survey package.

##                                                 hracode diab2    se
## Auburn-North                               Auburn-North 0.104 0.021
## Auburn-South                               Auburn-South 0.233 0.049
## Ballard                                         Ballard 0.070 0.022
## Beacon/Gtown/S.Park                 Beacon/Gtown/S.Park 0.081 0.026
## Bear Creek/Carnation/Duvall Bear Creek/Carnation/Duvall 0.052 0.012
## Bellevue-Central                       Bellevue-Central 0.059 0.015

The Smoothed Estimates

When the number of samples in each area is large, the design-based variance is usually small and the direct estimates will work well. However, when we have small sample from each area, we would like to perform some form of smoothing over the areas. For now, let us ignore the survey weights, we can consider the following Bayesian smoothing model:

\[\begin{eqnarray*} y_i | p_i &\sim& \mbox{Binomial}(m_i,p_i)\\ \theta_i &=& \log \left( \frac{p_i}{1-p_i}\right) = \mu+ \epsilon_i+s_i,\\ \epsilon_i &\sim & N(0,\sigma_\epsilon^2)\\ s_i | s_j, j \in \text{ne}(i) &\sim& N\left(\bar{s_i}, \dfrac{\sigma_s^2}{n_i} \right). \end{eqnarray*}\] where \(n_i\) is the number of neighbors for area \(i\), \(\bar{s_i} = \frac{1}{n_i}\sum_{j \in \text{ne}(i)} s_j\), and hyperpriors are put on \(\mu,\sigma_\epsilon^2,\sigma_s^2\). This simple smoothing model can be fitted using the fitGeneric function by specifying NULL for the survey parameters

The smoothed estimates of \(p_i\) and \(\theta_i\) can be found in the smooth object returned by the function, and the direct estimates are stored in the HT object (without specifying survey weights, these are the simple binomial probabilities).

##                        region time mean variance median lower upper
## 1                Auburn-North   NA -1.9    0.018   -1.9  -2.1  -1.6
## 2                Auburn-South   NA -1.5    0.026   -1.5  -1.8  -1.1
## 3                     Ballard   NA -2.7    0.021   -2.7  -3.0  -2.4
## 4         Beacon/Gtown/S.Park   NA -2.4    0.023   -2.4  -2.7  -2.1
## 5 Bear Creek/Carnation/Duvall   NA -2.6    0.017   -2.6  -2.8  -2.3
## 6            Bellevue-Central   NA -2.4    0.029   -2.4  -2.8  -2.1
##   mean.original variance.original median.original lower.original
## 1         0.136          0.000254           0.136          0.107
## 2         0.191          0.000625           0.190          0.145
## 3         0.065          0.000078           0.065          0.049
## 4         0.085          0.000137           0.085          0.064
## 5         0.072          0.000076           0.072          0.056
## 6         0.081          0.000158           0.080          0.058
##   upper.original
## 1          0.168
## 2          0.241
## 3          0.083
## 4          0.109
## 5          0.089
## 6          0.106
##   HT.est HT.variance HT.prec HT.est.original
## 1   -1.8  0.17       0.030      34           0.140
## 2   -1.2  0.18       0.031      32           0.232
## 3   -2.6  0.17       0.029      35           0.067
## 4   -2.4  0.25       0.061      16           0.086
## 5   -2.6  0.17       0.028      35           0.069
## 6   -2.4  0.21       0.045      22           0.084
##   HT.variance.original   n  y                      region
## 1              0.00043 278 39                Auburn-North
## 2              0.00098 181 42                Auburn-South
## 3              0.00011 555 37                     Ballard
## 4              0.00037 210 18         Beacon/Gtown/S.Park
## 5              0.00012 550 38 Bear Creek/Carnation/Duvall
## 6              0.00027 286 24            Bellevue-Central

The Weighted and Smoothed Estimates

To account for the survey designs in the smoothing, we instead model the logit of the design-based direct estimates \(\hat p_i \sim N(\theta_i, \hat V_i)\) directly, where both \(\hat p_i\) and the asymptotic variance on the logit scale, \(\hat V_i\), are assumed known. This model can be fit with

Again, the design-based direct estimates and the smoothed estimates accounting for design can be obtained by svysmoothed$smooth and svysmoothed$HT. We can now compare the three types of estimates and their associated variance and it can be seen that smoothing reduces the variance of estimates significantly.

Area-level covariates

Covariates at the area level could be included in the smoothing model by specifying the X argument. The first column of the covariate matrix should be the names of regions that match the column and row names of the adjacency matrix.

##                     mean    sd 0.025quant 0.5quant 0.97quant  mode
## (Intercept)        -2.67 0.071      -2.81    -2.67     -2.54 -2.67
## educauless.than.HS  1.93 3.207      -4.42     1.94      7.95  1.95
## educauHS.grad      -1.36 1.350      -4.02    -1.37      1.19 -1.37
##  1.59 1.144      -0.66     1.59      3.75  1.58
## educaucollege.grad -0.74 0.507      -1.73    -0.74      0.22 -0.75
##                        kld
## (Intercept)        2.9e-07
## educauless.than.HS 9.3e-07
## educauHS.grad      1.4e-06
## 5.9e-07
## educaucollege.grad 5.6e-06

Customized prior distribution

The fitGeneric function has some default hyperprior choices built in using the Penalising Complexity (PC) priors. There are two ways to customize this default hyperprior choice. To simply update the hyper parameters of the PC prior, we can simply use the pc.u and pc.alpha for hyperpriors on precision parameters, and pc.u.phi and pc.alpha.phi for mixing parameter \(\phi\) in the BYM2 model for spatial effects. The interpretation of PC prior parameters for the precision \(\tau\) is \[ \mbox{Prob}(\sigma > u) = \alpha \] where \(\sigma = 1 / \sqrt{\tau}\) is the standard deviation. And for the mixing parameter \(\phi\) is \[ \mbox{Prob}(\phi < u) = \alpha \]

For example, if we change pc.u and pc.alpha from the default value \(u = 1, \alpha = 0.01\) to \(u = 0.1, \alpha = 0.01\), we would assign more prior mass on smaller variance of the random effects.

Another way to visualize one or more metrics on the map is by the mapPlot function, for example,

Or we can change the random effect model entirely using the formula argument. To use this option, we will need to use the internal indicator for region, region.struct or region.unstruct as the index. This index variable name can also be observed from the fitted object by summary(svysmoothed$fit) or svysmoothed$formula. For example, we can reassign a different parameterization of the BYM model. For users familiar with INLA syntax, this allows more possibilities of expanding the modeling framework.

U5MR Estimation in Space and Time

Load Data

DemoData contains model survey data provided by DHS. Note that this data is fake, and does not represent any real country’s data. Data similar to the DemoData data used in this vignette can be obtained by using getBirths. DemoMap contains geographic data from the 1995 Uganda Admin 1 regions defined by DHS. Data similar to the DemoMap data used in this vignette can be obtained by using read_shape.

DemoData is a list of \(5\) data frames where each row represent one person-month record and contains the \(8\) variables as shown below. Notice that time variable is turned into 5-year bins from 80-84 to 10-14.

##      Length Class      Mode
## 1999 8      data.frame list
## 2003 8      data.frame list
## 2007 8      data.frame list
## 2011 8      data.frame list
## 2015 8      data.frame list
##   clustid id  region  time  age weights        strata died
## 1       1  1 eastern 00-04    0     1.1 eastern.rural    0
## 2       1  1 eastern 00-04 1-11     1.1 eastern.rural    0
## 3       1  1 eastern 00-04 1-11     1.1 eastern.rural    0
## 4       1  1 eastern 00-04 1-11     1.1 eastern.rural    0
## 5       1  1 eastern 00-04 1-11     1.1 eastern.rural    0
## 6       1  1 eastern 00-04 1-11     1.1 eastern.rural    0

DemoData is obtained by processing the raw DHS birth data (in .dta format) in R. The raw file of birth recodes can be downloaded from the DHS website For this example dataset, no registration is needed. For real DHS survey datasets, permission to access needs to be registered with DHS directly. DemoData contains a small sample of the observations in this dataset randomly assigned to \(5\) example DHS surveys.

Here we demonstrate how to split the raw data into person-month format from. Notice that to read the file from early version of stata, the package readstata13 is required. The following script is based on the example dataset ZZBR62FL.DTA available from the DHS website. We use the interaction of v024 and v025 as the strata indicator for the purpose of demonstration. We can see from range(data$v008) that the CMC code for the date of interview corresponds to the year 1900 + floor(1386/12)= 2015. In practice, however, the survey year is usually known. The survey year variable allows the mis-recorded data. Dates after the surveyyear will be removed. Thus for survey taking place over multiple years, the later year is suggested to be used as surveyyear. If set to NA then no checking will be performed.

Horvitz-Thompson estimators of U5MR

Next, we obtain Horvitz-Thompson estimators using getDirectList. We specify the survey design as the two-stage stratified cluster sampling where strata are specified in the strata column, and clusters are specified by both the cluster ID (clusterid) and household ID (id).

Before fitting the model, we also aggregate estimates based on different surveys into a single set of estimates, using the inverse design-based variances as the weights.

## [1] 150  11
## [1] 30 10

National estimates of U5MR

With the combined direct estimates, we are ready to fit the smoothing models. First, we ignore the subnational estimates, and fit a model with temporal random effects only. In this part, we use the subset of data region variable being “All”. In fitting this model, we first define the list of time periods we wish to project the estimates on. First we can fit a Random Walk 2 only model defined on the 5-year period.

We can also estimate the Random Walk 2 random effects on the yearly scale, with direct estimates calculated in 5-year intervals.

The marginal posteriors are already stored in the fitted object. We use the following function to extract and re-arrange them.

We can compare the results visually using the function below.

Subnational estimates of U5MR

Similarly we can fit the full model on all subnational regions as well. First, we fit the Random Walk 2 model defined on the 5-year period.

Similarly we can also estimate the Random Walk 2 random effects on the yearly scale.

The figures below shows the comparison of the subnational model with different temporal scales.

We can also add back the direct estimates for comparison when plotting the smoothed estimates.

Finally, we show the estimates over time on maps.

Simulate spatial(temporal) random effects

Finally, we can simulate spatially correlated data in SUMMER for simulation studies as well. The simulation scheme in this section can be extended to temporal and spatial-temporal case as well. As an illustration, we simulate \(3\) draws from the Besag model using the King County map.

u <- rst(n = 3, type = "s", Amat = getAmat(KingCounty, names = KingCounty$HRA2010v2_))
mapPlot(data.frame(r1 = u[1, ], r2 = u[2, ], r3 = u[3, ], region = colnames(u)), 
    geo = KingCounty, = "region", by.geo = "HRA2010v2_", variables = c("r1", 
        "r2", "r3"))