Error estimation

2019-07-15

For the most part, this document will present the functionalities of the function surveysd::calc.stError() which generates point estimates and standard errors for user-supplied estimation functions.

Prerequisites

In order to use a dataset with calc.stError(), several weight columns have to be present. Each weight column corresponds to a bootstrap sample. In the following examples, we will use the data from demo.eusilc() and attach the bootstrap weights using draw.bootstrap() and recalib(). Please refer to the documentation of those functions for more detail.

library(surveysd)

set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
                           strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region")
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps 
## 
## Convergence reached in 3 steps
dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]

## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
year povertyRisk eqIncome onePerson pWeight w1 w2 w3 w4 w5
2010 FALSE 16090.69 FALSE 504.5696 1.467361 1.461925 1.448003 1.479799 1.490757
2010 FALSE 16090.69 FALSE 504.5696 1.467361 1.461925 1.448003 1.479799 1.490757
2010 FALSE 16090.69 FALSE 504.5696 1.467361 1.461925 1.448003 1.479799 1.490757
2011 FALSE 16090.69 FALSE 504.5696 1.464252 1.400870 1.360662 1.510375 1.477272
2011 FALSE 16090.69 FALSE 504.5696 1.464252 1.400870 1.360662 1.510375 1.477272

Estimator functions

The parameters fun and var in calc.stError() define the estimator to be used in the error analysis. There are two built-in estimator functions weightedSum() and weightedRatio() which can be used as follows.

povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)

Those functions calculate the ratio of persons at risk of povery (in percent) and the total income. By default, the results are calculated seperately for each reference period.

povertyRate$Estimates
year n N val_povertyRisk stE_povertyRisk
2010 14827 8182222 14.44422 0.4710230
2011 14827 8182222 14.77393 0.5615682
2012 14827 8182222 15.04515 0.3879097
2013 14827 8182222 14.89013 0.4429596
2014 14827 8182222 15.14556 0.5121590
2015 14827 8182222 15.53640 0.4302365
2016 14827 8182222 15.08315 0.3469344
2017 14827 8182222 15.42019 0.5633740
totalIncome$Estimates
year n N val_eqIncome stE_eqIncome
2010 14827 8182222 162750998071 973119034
2011 14827 8182222 161926931417 840418104
2012 14827 8182222 162576509628 1115195638
2013 14827 8182222 163199507862 1396066604
2014 14827 8182222 163986275009 1384785226
2015 14827 8182222 163416275447 1339544407
2016 14827 8182222 162706205137 1274753118
2017 14827 8182222 164314959107 1396270280

Columns that use the val_ prefix denote the point estimate belonging to the “main weight” of the dataset, which is pWeight in case of the dataset used here.

Columns with the stE_ prefix denote standard errors calculated with bootstrap replicates. The replicates result in using w1, w2, …, w10 instead of pWeight when applying the estimator.

n denotes the number of observations for the year and N denotes the total weight of those persons.

Custom estimators

In order to define a custom estimator function to be used in fun, the function needs to have two arguments like the example below.

## [1] TRUE

The parameters x and w can be assumed to be vectors with equal length with w being numeric and x being the column defined in the var argument. It will be called once for each period (in this case year) and for each weight column (in this case pWeight, w1, w2, …, w10).

Multiple estimators

In case an estimator should be applied to several columns of the dataset, var can be set to a vector containing all necessary columns.

year n N val_povertyRisk stE_povertyRisk val_onePerson stE_onePerson
2010 14827 8182222 14.44422 0.4710230 14.85737 0.2602878
2011 14827 8182222 14.77393 0.5615682 14.85737 0.3228597
2012 14827 8182222 15.04515 0.3879097 14.85737 0.3342605
2013 14827 8182222 14.89013 0.4429596 14.85737 0.3798783
2014 14827 8182222 15.14556 0.5121590 14.85737 0.4507181
2015 14827 8182222 15.53640 0.4302365 14.85737 0.4143875
2016 14827 8182222 15.08315 0.3469344 14.85737 0.3280088
2017 14827 8182222 15.42019 0.5633740 14.85737 0.2870508

Here we see the relative number of persons at risk of poverty and the relative number of one-person households.

Grouping

The groups argument can be used to calculate estimators for different subsets of the data. This argument can take the grouping variable as a string that refers to a column name (usually a factor) in dat. If set, all estimators are not only split by the reference period but also by the grouping variable. For simplicity, only one reference period of the above data is used.

dat2 <- subset(dat_boot_calib, year == 2010)
for (att  in c("period", "weights", "b.rep"))
  attr(dat2, att) <- attr(dat_boot_calib, att)

To calculate the ratio of persons at risk of poverty for each federal state of austria, group = "region" can be used.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates
year n N region val_povertyRisk stE_povertyRisk
2010 549 260564 Burgenland 19.53984 1.7962051
2010 733 377355 Vorarlberg 16.53731 3.2682348
2010 924 535451 Salzburg 13.78734 2.2561119
2010 1078 563648 Carinthia 13.08627 1.7480745
2010 1317 701899 Tyrol 15.30819 2.0314532
2010 2295 1167045 Styria 14.37464 1.3554432
2010 2322 1598931 Vienna 17.23468 1.0438717
2010 2804 1555709 Lower Austria 13.84362 1.6944545
2010 2805 1421620 Upper Austria 10.88977 0.8676276
2010 14827 8182222 NA 14.44422 0.4710230

The last column with region = NA denotes the aggregate over all regions. Note that the columns N and n now show the weighted and unweighted number of persons in each region.

Several grouping variables

In case more than one grouping variable is used, there are several options of calling calc.stError() depending on whether combinations of grouping levels should be regarded or not. We will consider the variables gender and region as our grouping variables and show three options on how calc.stError() can be called.

Option 1: All regions and all genders

Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.\]

The last row is again the estimate for the whole period.

year n N gender region val_povertyRisk stE_povertyRisk
2010 549 260564 NA Burgenland 19.53984 1.7962051
2010 733 377355 NA Vorarlberg 16.53731 3.2682348
2010 924 535451 NA Salzburg 13.78734 2.2561119
2010 1078 563648 NA Carinthia 13.08627 1.7480745
2010 1317 701899 NA Tyrol 15.30819 2.0314532
2010 2295 1167045 NA Styria 14.37464 1.3554432
2010 2322 1598931 NA Vienna 17.23468 1.0438717
2010 2804 1555709 NA Lower Austria 13.84362 1.6944545
2010 2805 1421620 NA Upper Austria 10.88977 0.8676276
2010 7267 3979572 male NA 12.02660 0.5814477
2010 7560 4202650 female NA 16.73351 0.4651848
2010 14827 8182222 NA NA 14.44422 0.4710230

Option 2: All combinations of state and gender

Split the data by all cobinations of the two grouping variables. This will result in a larger output-table of the size

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.\]

year n N gender region val_povertyRisk stE_povertyRisk
2010 261 122741.8 male Burgenland 17.414524 2.2240337
2010 288 137822.2 female Burgenland 21.432598 2.0722169
2010 359 182732.9 male Vorarlberg 12.973259 3.0907951
2010 374 194622.1 female Vorarlberg 19.883637 3.7366664
2010 440 253143.7 male Salzburg 9.156964 1.8991413
2010 484 282307.3 female Salzburg 17.939382 2.5337406
2010 517 268581.4 male Carinthia 10.552148 2.0564400
2010 561 295066.6 female Carinthia 15.392924 1.9761700
2010 650 339566.5 male Tyrol 12.857542 2.2576659
2010 667 362332.5 female Tyrol 17.604861 2.0523531
2010 1128 571011.7 male Styria 11.671247 1.5091092
2010 1132 774405.4 male Vienna 15.590616 1.3336286
2010 1167 596033.3 female Styria 16.964539 1.3880995
2010 1190 824525.6 female Vienna 18.778813 0.9322183
2010 1363 684272.5 male Upper Austria 9.074690 1.1708532
2010 1387 772593.2 female Lower Austria 16.372949 1.8277282
2010 1417 783115.8 male Lower Austria 11.348283 1.6402739
2010 1442 737347.5 female Upper Austria 12.574205 0.7678252
2010 14827 8182222.0 NA NA 14.444218 0.4710230

Option 3: Cobination of Option 1 and Option 2

In this case, the estimates and standard errors are calculated for

  • every gender,
  • every state and
  • every combination of state and gender.

The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.\]

year n N gender region val_povertyRisk stE_povertyRisk
2010 261 122741.8 male Burgenland 17.414524 2.2240337
2010 288 137822.2 female Burgenland 21.432598 2.0722169
2010 359 182732.9 male Vorarlberg 12.973259 3.0907951
2010 374 194622.1 female Vorarlberg 19.883637 3.7366664
2010 440 253143.7 male Salzburg 9.156964 1.8991413
2010 484 282307.3 female Salzburg 17.939382 2.5337406
2010 517 268581.4 male Carinthia 10.552148 2.0564400
2010 549 260564.0 NA Burgenland 19.539836 1.7962051
2010 561 295066.6 female Carinthia 15.392924 1.9761700
2010 650 339566.5 male Tyrol 12.857542 2.2576659
2010 667 362332.5 female Tyrol 17.604861 2.0523531
2010 733 377355.0 NA Vorarlberg 16.537310 3.2682348
2010 924 535451.0 NA Salzburg 13.787343 2.2561119
2010 1078 563648.0 NA Carinthia 13.086268 1.7480745
2010 1128 571011.7 male Styria 11.671247 1.5091092
2010 1132 774405.4 male Vienna 15.590616 1.3336286
2010 1167 596033.3 female Styria 16.964539 1.3880995
2010 1190 824525.6 female Vienna 18.778813 0.9322183
2010 1317 701899.0 NA Tyrol 15.308191 2.0314532
2010 1363 684272.5 male Upper Austria 9.074690 1.1708532
2010 1387 772593.2 female Lower Austria 16.372949 1.8277282
2010 1417 783115.8 male Lower Austria 11.348283 1.6402739
2010 1442 737347.5 female Upper Austria 12.574205 0.7678252
2010 2295 1167045.0 NA Styria 14.374637 1.3554432
2010 2322 1598931.0 NA Vienna 17.234683 1.0438717
2010 2804 1555709.0 NA Lower Austria 13.843623 1.6944545
2010 2805 1421620.0 NA Upper Austria 10.889773 0.8676276
2010 7267 3979571.7 male NA 12.026600 0.5814477
2010 7560 4202650.3 female NA 16.733508 0.4651848
2010 14827 8182222.0 NA NA 14.444218 0.4710230