Analyzing network scale-up data using the networkreporting package

Introduction

The networkreporting package has several tools for analyzing survey data that have been collected using the network scale-up method.

This introduction will assume that you already have the networkreporting package installed. If you don't, please refer to the introductory vignette (“getting started”) for instructions on how to do this.

Review of the network scale-up method

For the purposes of this vignette, we'll assume that we have conducted a survey using network scale-up questions to try and estimate the size of an important population. Analytically, this involves two main steps:

We'll quickly review each of these steps, and then we'll show how to use the package to carry the estimation out.

Step 1: estimating network sizes

Here, we will use the known population estimator for respondents' degrees. (TODO CITE). In order to estimate the degree of the \(i\) th survey respondent, we use

\[ \begin{align} \label{eqn:kpdegree} \hat{d_i} = \sum_{j=1}^{K} y_{ij} ~ \frac{N}{\sum_{j=1}^{K} N_j}, \end{align} \]

where \(N\) is the total size of the population, \(N_j\) is the size of the \(j\) th population of known size, and \(y_{ij}\) is the number of connections that survey respondent \(i\) reports between herself and members of the \(j\) th population of known size.

Step 2: estimating hidden population sizes

Once we have the estimates of the respondents' degrees, we use them to produce an estimate for the size of the hidden population:

\[ \begin{align} \label{eqn:nsum} \hat{N}_h = \frac{ \sum_{i \in s} y_{ih} }{ \sum_{i \in s} \hat{d_i} }, \end{align} \]

where \(N_h\) is the size of the population of interest (which we want to estimate), \(s\) is the set of respondents in our sample, and \(\hat{d_i}\) is the estimate of the size of respondent \(i\)'s degree, obtained using the known population method.

Preparing data

The first step is to prepare the data. We'll assume that we start with two datasets: the first is a survey containing information collected from respondents about their personal networks; the second is information about the sizes of several populations.

The example data for this vignette are provided with the networkreporting package, and can be loaded by typing

library(networkreporting)
library(plyr)
library(ggplot2)  # we'll use qplot from ggplot2 for plots
theme_set(theme_minimal())

data(hhsurvey)  # this is a demo dataset included with the package

The demo data include two datasets: one has all of the responses from a network scale-up survey, and the other has the known population sizes for use with the known population estimator.

Preparing the known population data

The demo known population data are in knownpop.dat:

knownpop.dat
##               known.popn   size
## 1                widower  36147
## 2        nurse.or.doctor   7807
## 3  male.community.health  22000
## 4                teacher  47745
## 5            woman.smoke 119438
## 6                 priest   1004
## 7       woman.gave.birth 256164
## 8                 muslim 195449
## 9           incarcerated  68000
## 10          man.divorced  50698
## 11            nsengimana  32528
## 12            murekatete  30531
## 13              twahirwa  10420
## 14           mukandekezi  10520
## 15             nsabimana  48560
## 16              mukamana  51449
## 17            ndayambaje  22724
## 18             nyiraneza  21705
## 19              bizimana  38497
## 20         nyirahabimana  42727
## 21           ndagijimana  37375
## 22        mukandayisenga  35055

knownpop.dat is very simple: one column has a name for each known population, and the other has its toal size. We expect that users will typically start with a small dataset like this one. When using the networkreporting package, it is more useful to have a vector whose entries are known population sizes and whose names are the known population names. The df.to.kpvec function makes it easy for us to create it:

kp.vec <- df.to.kpvec(knownpop.dat, kp.var = "known.popn", kp.value = "size")

kp.vec
##               widower       nurse.or.doctor male.community.health 
##                 36147                  7807                 22000 
##               teacher           woman.smoke                priest 
##                 47745                119438                  1004 
##      woman.gave.birth                muslim          incarcerated 
##                256164                195449                 68000 
##          man.divorced            nsengimana            murekatete 
##                 50698                 32528                 30531 
##              twahirwa           mukandekezi             nsabimana 
##                 10420                 10520                 48560 
##              mukamana            ndayambaje             nyiraneza 
##                 51449                 22724                 21705 
##              bizimana         nyirahabimana           ndagijimana 
##                 38497                 42727                 37375 
##        mukandayisenga 
##                 35055

Finally, we also need to know the total size of the population we are making estimates about. In this case, let's assume that we're working in a country of 10 million people:

# total size of the population
tot.pop.size <- 1e+07

TODO – attach known population example

Preparing the survey data

Now let's take a look at the demo survey dataset, which is called example.survey:

head(example.survey)
##      id cluster      region indweight  sex age.cat widower nurse.or.doctor
## 1 1.1.1       1 Kigali City      1353 Male [25,35)       3               2
## 2 1.1.2       1 Kigali City      1353 Male [25,35)       0               2
## 3 1.1.3       1 Kigali City      1353 Male [25,35)       2               8
## 4 1.1.4       1 Kigali City      1353 Male [25,35)       0               1
## 5 1.1.5       1 Kigali City      1353 Male [25,35)       0               0
## 6 1.1.6       1 Kigali City      1353 Male [25,35)       7               4
##   male.community.health teacher woman.smoke priest civil.servant
## 1                     1       5           1      0             5
## 2                     1       5           0      0             5
## 3                     0       3           0      0            50
## 4                     0       0           0      0             5
## 5                     0       0           0      0             5
## 6                     0       8           2      0             6
##   woman.gave.birth muslim incarcerated judge man.divorced treatedfortb
## 1                3      4            2     3            2            0
## 2                3      0            2     3            1            0
## 3                4      3            0     0            2            0
## 4                0      0            0     0            0            0
## 5                0      0            0     0            1            0
## 6                3      4            3     0            1            0
##   nsengimana murekatete twahirwa mukandekezi nsabimana mukamana ndayambaje
## 1          0          0        2           1         2        3          1
## 2          3          2        0           0         1        0          0
## 3          0          0        0           1         2        0          0
## 4          1          0        0           0         0        0          0
## 5          0          0        0           0         0        0          0
## 6          1          1        0           0         0        0          0
##   nyiraneza bizimana nyirahabimana ndagijimana mukandayisenga died
## 1         0        2             0           1              0    0
## 2         0        0             0           0              0    1
## 3         0        0             0           0              0    2
## 4         0        0             0           0              0    0
## 5         0        0             0           0              0    0
## 6         0        0             0           0              0    4
##   sex.workers msm idu clients
## 1           0   0   0       2
## 2           0   0   0       1
## 3           0   0   0       0
## 4           0   0   0       0
## 5           0   0   0       0
## 6           0   0   0      10

The columns fall into one of a few categories:

Topcoding

Many network scale-up studies have topcoded the responses to the aggregate relational data questions. This means that researchers considered any responses above a certain value, called the topcode, to be implausible. Before proceeding with the analysis, they substitute the maximum plausible value in for the implausible ones. For example, in many studies, researchers assumed that responses more than 30 are implausible, so they replaced responses with the value 31 or higher with the value 30 before conducting their analysis.

We won't discuss whether or not this is advisable here, but this is currently a common practice in scale-up studies. If you wish to follow it, you can use the topcode.data function. As an example, let's topcode the responses to the questions about populations of known size to the value 30. First, we'll examine the distribution of the responses before topcoding:

## make a vector with the list of known population names from our dataset of
## known population totals
known.popn.vars <- paste(knownpop.dat$known.popn)

## before topcoding: max. response for several popns is > 30
summary(example.survey[, known.popn.vars])
##     widower      nurse.or.doctor male.community.health    teacher     
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00         Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 0.00         1st Qu.: 0.00  
##  Median : 0.00   Median : 0.00   Median : 0.00         Median : 0.00  
##  Mean   : 0.61   Mean   : 0.51   Mean   : 0.72         Mean   : 1.36  
##  3rd Qu.: 0.00   3rd Qu.: 0.00   3rd Qu.: 1.00         3rd Qu.: 1.00  
##  Max.   :95.00   Max.   :40.00   Max.   :95.00         Max.   :95.00  
##                                                                       
##   woman.smoke        priest       woman.gave.birth     muslim     
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.00    Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.00    1st Qu.: 0.00  
##  Median : 0.00   Median : 0.000   Median : 1.00    Median : 0.00  
##  Mean   : 1.02   Mean   : 0.148   Mean   : 1.89    Mean   : 2.09  
##  3rd Qu.: 1.00   3rd Qu.: 0.000   3rd Qu.: 3.00    3rd Qu.: 1.00  
##  Max.   :95.00   Max.   :25.000   Max.   :30.00    Max.   :95.00  
##                                                    NA's   :1      
##   incarcerated    man.divorced      nsengimana     murekatete    
##  Min.   : 0.00   Min.   : 0.000   Min.   :0.00   Min.   : 0.000  
##  1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.: 0.000  
##  Median : 0.00   Median : 0.000   Median :0.00   Median : 0.000  
##  Mean   : 0.43   Mean   : 0.337   Mean   :0.36   Mean   : 0.342  
##  3rd Qu.: 0.00   3rd Qu.: 0.000   3rd Qu.:0.00   3rd Qu.: 1.000  
##  Max.   :95.00   Max.   :20.000   Max.   :8.00   Max.   :12.000  
##                                                                  
##     twahirwa       mukandekezi      nsabimana        mukamana     
##  Min.   : 0.000   Min.   :0.000   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.:0.000   1st Qu.: 0.00   1st Qu.: 0.000  
##  Median : 0.000   Median :0.000   Median : 0.00   Median : 0.000  
##  Mean   : 0.239   Mean   :0.165   Mean   : 0.47   Mean   : 0.414  
##  3rd Qu.: 0.000   3rd Qu.:0.000   3rd Qu.: 1.00   3rd Qu.: 1.000  
##  Max.   :10.000   Max.   :7.000   Max.   :20.00   Max.   :15.000  
##                                                                   
##    ndayambaje      nyiraneza         bizimana      nyirahabimana   
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 0.00   Median : 0.000   Median : 0.000   Median : 0.000  
##  Mean   : 0.33   Mean   : 0.268   Mean   : 0.433   Mean   : 0.261  
##  3rd Qu.: 0.00   3rd Qu.: 0.000   3rd Qu.: 1.000   3rd Qu.: 0.000  
##  Max.   :30.00   Max.   :10.000   Max.   :12.000   Max.   :17.000  
##                                                                    
##   ndagijimana     mukandayisenga  
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.000  
##  Mean   : 0.328   Mean   : 0.258  
##  3rd Qu.: 0.000   3rd Qu.: 0.000  
##  Max.   :10.000   Max.   :20.000  
## 

Several populations, including widower, male.community.health, teacher, woman.smoke, muslim, and incarcerated have maximum values that are very high. (It turns out that 95 is the highest value that could be recorded during the interviews; if respondents said that they were connected to more than 95 people in the group, the interviewers wrote 95 down.)

Now we use the topcode.data function to topcode all of the responses at 30:

example.survey <- topcode.data(example.survey, vars = known.popn.vars, max = 30)

## after topcoding: max. response for all popns is 30
summary(example.survey[, known.popn.vars])
##     widower       nurse.or.doctor  male.community.health    teacher     
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000        Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000        1st Qu.: 0.00  
##  Median : 0.000   Median : 0.000   Median : 0.000        Median : 0.00  
##  Mean   : 0.583   Mean   : 0.506   Mean   : 0.653        Mean   : 1.22  
##  3rd Qu.: 0.000   3rd Qu.: 0.000   3rd Qu.: 1.000        3rd Qu.: 1.00  
##  Max.   :30.000   Max.   :30.000   Max.   :30.000        Max.   :30.00  
##                                                                         
##   woman.smoke         priest       woman.gave.birth     muslim     
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.00    Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.00    1st Qu.: 0.00  
##  Median : 0.000   Median : 0.000   Median : 1.00    Median : 0.00  
##  Mean   : 0.964   Mean   : 0.148   Mean   : 1.89    Mean   : 1.47  
##  3rd Qu.: 1.000   3rd Qu.: 0.000   3rd Qu.: 3.00    3rd Qu.: 1.00  
##  Max.   :30.000   Max.   :25.000   Max.   :30.00    Max.   :30.00  
##                                                     NA's   :1      
##   incarcerated     man.divorced      nsengimana     murekatete    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.00   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:0.00   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.000   Median :0.00   Median : 0.000  
##  Mean   : 0.381   Mean   : 0.337   Mean   :0.36   Mean   : 0.342  
##  3rd Qu.: 0.000   3rd Qu.: 0.000   3rd Qu.:0.00   3rd Qu.: 1.000  
##  Max.   :30.000   Max.   :20.000   Max.   :8.00   Max.   :12.000  
##                                                                   
##     twahirwa       mukandekezi      nsabimana        mukamana     
##  Min.   : 0.000   Min.   :0.000   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.:0.000   1st Qu.: 0.00   1st Qu.: 0.000  
##  Median : 0.000   Median :0.000   Median : 0.00   Median : 0.000  
##  Mean   : 0.239   Mean   :0.165   Mean   : 0.47   Mean   : 0.414  
##  3rd Qu.: 0.000   3rd Qu.:0.000   3rd Qu.: 1.00   3rd Qu.: 1.000  
##  Max.   :10.000   Max.   :7.000   Max.   :20.00   Max.   :15.000  
##                                                                   
##    ndayambaje      nyiraneza         bizimana      nyirahabimana   
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 0.00   Median : 0.000   Median : 0.000   Median : 0.000  
##  Mean   : 0.33   Mean   : 0.268   Mean   : 0.433   Mean   : 0.261  
##  3rd Qu.: 0.00   3rd Qu.: 0.000   3rd Qu.: 1.000   3rd Qu.: 0.000  
##  Max.   :30.00   Max.   :10.000   Max.   :12.000   Max.   :17.000  
##                                                                    
##   ndagijimana     mukandayisenga  
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.000  
##  Mean   : 0.328   Mean   : 0.258  
##  3rd Qu.: 0.000   3rd Qu.: 0.000  
##  Max.   :10.000   Max.   :20.000  
## 

If you look at the help page for topcode.data, you'll see that it can also handle situations where the variables can take on special codes for missing values, refusals, and so forth.

Estimating network sizes

Now that we have finished preparing the data, we turn to esimating the sizes of each respondent's personal network. To do this using the known population estimator, we use the kp.degree.estimator function:

d.hat <- kp.degree.estimator(survey.data=example.survey,
                             known.popns=kp.vec,
                             total.popn.size=tot.pop.size,
                             missing="complete.obs")

summary(d.hat)
##        V1        
##  Min.   :   0.0  
##  1st Qu.:  25.3  
##  Median :  59.0  
##  Mean   : 101.2  
##  3rd Qu.: 126.4  
##  Max.   :1904.7

Note that the function reports that it's working in absolute numbers (instead of, for example, proportions.) We can examine the results with a histogram

qplot(d.hat, binwidth = 25)

plot of chunk unnamed-chunk-9

Let's attach the degree estimates to the dataframe to keep track of them:

example.survey$d.hat <- d.hat

Missing data in known population questions

TODO

Internal valididty checks

In order to run internal validation checks, you can use the nsum.internal.validation function. We specify that we wish to use only complete observations (ie, we will remove rows that have any missing values from our calculations).

iv.result <- nsum.internal.validation(survey.data=example.survey,
                                      known.popns=kp.vec,
                                      missing="complete.obs",
                                      killworth.se=TRUE,
                                      total.popn.size=tot.pop.size,
                                      kp.method=TRUE,
                                      return.plot=TRUE)

Now iv.result is a list that has a summary of the results in the entry results

iv.result$results
##                     name nsum.holdout.est known.size d.hat.sum
## 1                widower            58708      36147    238980
## 2        nurse.or.doctor            51873       7807    234804
## 3  male.community.health            66955      22000    234635
## 4                teacher           128262      47745    228049
## 5            woman.smoke            93114     119438    249050
## 6                 priest            14831       1004    240719
## 7       woman.gave.birth           173203     256164    261832
## 8                 muslim           137934     195449    255929
## 9           incarcerated            36620      68000    250140
## 10          man.divorced            32759      50698    247263
## 11            nsengimana            35697      32528    242876
## 12            murekatete            33933      30531    242828
## 13              twahirwa            23922      10420    240784
## 14           mukandekezi            16383      10520    242327
## 15             nsabimana            46399      48560    243969
## 16              mukamana            40565      51449    245777
## 17            ndayambaje            32841      22724    241465
## 18             nyiraneza            26637      21705    242516
## 19              bizimana            42949      38497    242614
## 20         nyirahabimana            25412      42727    247131
## 21           ndagijimana            32260      37375    244579
## 22        mukandayisenga            25249      35055    245554
##    killworth.se killworth.se.wgtdenom    err abserr     sqerr   relerr
## 1        1562.7                1562.7  22561  22561 5.090e+08  0.62414
## 2        1482.5                1482.5  44066  44066 1.942e+09  5.64443
## 3        1683.6                1683.6  44955  44955 2.021e+09  2.04342
## 4        2356.3                2356.3  80517  80517 6.483e+09  1.68639
## 5        1924.6                1924.6 -26324  26324 6.930e+08  0.22040
## 6         784.3                 784.3  13827  13827 1.912e+08 13.77147
## 7        2549.6                2549.6 -82961  82961 6.883e+09  0.32386
## 8        2305.5                2305.5 -57515  57515 3.308e+09  0.29427
## 9        1207.7                1207.7 -31380  31380 9.847e+08  0.46148
## 10       1149.1                1149.1 -17939  17939 3.218e+08  0.35385
## 11       1210.2                1210.2   3169   3169 1.004e+07  0.09743
## 12       1180.1                1180.1   3402   3402 1.158e+07  0.11144
## 13        995.5                 995.5  13502  13502 1.823e+08  1.29576
## 14        821.6                 821.6   5863   5863 3.437e+07  0.55730
## 15       1375.9                1375.9  -2161   2161 4.668e+06  0.04449
## 16       1282.1                1282.1 -10884  10884 1.185e+08  0.21154
## 17       1164.3                1164.3  10117  10117 1.024e+08  0.44522
## 18       1046.6                1046.6   4932   4932 2.433e+07  0.22725
## 19       1327.6                1327.6   4452   4452 1.982e+07  0.11564
## 20       1012.7                1012.7 -17315  17315 2.998e+08  0.40526
## 21       1146.6                1146.6  -5115   5115 2.617e+07  0.13687
## 22       1012.7                1012.7  -9806   9806 9.616e+07  0.27973

Since we passed the argument return.plot=TRUE to the function, we also get a plot:

print(iv.result$plot)

plot of chunk unnamed-chunk-13

This is a ggplot2 object, so we can customize it if we want. As a very simple example, we can change the title:

print(iv.result$plot + ggtitle("internal validation checks"))

plot of chunk unnamed-chunk-14

Estimating hidden population size

Now that we have estimated degrees, we can use them to produce estimates of the populations we're interested in. Here, we'll take the example of injecting drug users, idu

idu.est <- nsum.estimator(survey.data=example.survey,
                          d.hat.vals=d.hat,
                          total.popn.size=tot.pop.size,
                          y.vals="idu",
                          missing="complete.obs")

Note that we had to specify that we should use only rows in our dataset with no missing values through the missing = "complete.obs" option, and also that we had to pass in the total population size using the total.popn.size option. The resulting estimate is

idu.est
## $estimate
## [1] 1068
## 
## $tot.connections
## [1] 26
## 
## $sum.d.hat
## [1] 243524

This returns the estimate, and also the numerator and denominator used to compute it.

Variance estimation

In order to estimate the sampling uncertainty of our estimated totals, we can use the rescaled bootstrap technique that is in the networkreporting package. In order to do so, we need to be able to describe the sampling design of our study. In particular, we need to be able to describe the stratifcation (if any) and the primary sampling units used in the study.

idu.est <- bootstrap.estimates(## this describes the sampling design of the
                               ## survey; here, the PSUs are given by the
                               ## variable cluster, and the strata are given
                               ## by the variable region
                               survey.design = ~ cluster + strata(region),
                               ## the number of bootstrap resamples to obtain
                               ## (NOTE: in practice, you should use more than 100.
                               ##  this keeps building the package relatively fast)
                               num.reps=100,
                               ## this is the name of the function
                               ## we want to use to produce an estimate
                               ## from each bootstrapped dataset
                               estimator.fn="nsum.estimator",
                               ## these are the sampling weights
                               weights="indweight",
                               ## this is the name of the type of bootstrap
                               ## we wish to use
                               bootstrap.fn="rescaled.bootstrap.sample",
                               ## our dataset
                               survey.data=example.survey,
                               ## other parameters we need to pass
                               ## to the nsum.estimator function
                               d.hat.vals=d.hat,
                               total.popn.size=tot.pop.size,
                               y.vals="idu",
                               missing="complete.obs")

By default, bootstrap.estimates produces a list with num.reps entries; each entry is the result of calling the estimator function on one bootstrap resample. We can write a bit of code that will help us put all of these results together, for plotting and summarizing

## combine the estimates together in one data frame (bootstrap.estimates
## gives us a list)
all.idu.estimates <- ldply(idu.est, function(x) {
    data.frame(estimate = x$estimate)
})

We can examine the summarized results with a histogram or with summarize.

## look at a histogram of the results
qplot(all.idu.estimates$estimate, binwidth = 50)

plot of chunk unnamed-chunk-19


## summarize the results
summary(all.idu.estimates$estimate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     160     452     581     611     780    1180

Attaching known population totals to the dataframe

Several of the functions we demonstrated above required us to pass in the vector containing the known population sizes and also the size of the total population. We can avoid this step by attaching these two pieces of information to the survey dataframe using the add.kp function:

example.survey <- add.kp(example.survey, kp.vec, tot.pop.size)

d.hat.new <- kp.degree.estimator(survey.data=example.survey,
                                 # we don't need this anymore, since we
                                 # them to survey.data's attributes using add.kp
                                 #known.popns=kp.vec,
                                 #total.popn.size=tot.pop.size,
                                 missing="complete.obs")

summary(d.hat.new)
##        V1        
##  Min.   :   0.0  
##  1st Qu.:  25.3  
##  Median :  59.0  
##  Mean   : 101.2  
##  3rd Qu.: 126.4  
##  Max.   :1904.7

This is exactly the same result we obtained before.