This vignette documents the method for calculating DSRs using the PHEindicatormethods::phe_dsr function which calculates DSRs and their confidence limits using the Dobson method.
The function can be used to calculate DSRs for grouping single or multiple geographic areas/ genders/ timeperiods/ indicators in a single execution and takes the following arguments as inputs:
Argument | Type | Definition | Default value |
---|---|---|---|
data | data.frame | data.frame containing the data to be standardised | none |
x | unquoted string | field name from data containing the observed number of events for each standardisation category (eg ageband) within each grouping set (eg area or indicator) | none |
n | unquoted string | field name from data containing the populations for each standardisation category (eg ageband) within each grouping set (eg area or indicator) | none |
stdpop | unquoted string | standard populations for each standardisation category (eg age band) specified as a field name from data or a vector. | esp2013 |
stdpoptype | quoted string | whether the stdpop argument has been specified as a vactor or a field name | “vector” |
confidence | numeric value | the required level of confidence expressed as a number between 0.9 and 1 or 90 and 100 | 0.95 |
multiplier | numeric value | the multiplier used to express the final values (eg 100,000 = rate per 100,000 | 100,000 |
Note that the European Standard Population 2013 divided into 19 five-year agebands (0-4, 5-9, 10-14, …..90+) is provided in vector format within the package and will be used as the default for the stdpop argument
If multiple DSRs are required from a single data frame then the data frame must be grouped prior to inputting to the function - this is demonstrated below
In a real situation we’d most likely be sourcing our numerators and denominators from different places so let’s create them separately for now.
pops <- data.frame(indicator = rep(c("Ind1","Ind2","Ind3","Ind4"), each = 19 * 2 * 5),
period = rep(2012:2016, each = 19 * 2),
region = rep(rep(c("Area1","Area2"),each=19), times = 5),
ageband = rep(c(0,5,10,15,20,25,30,35,40,45,50,
55,60,65,70,75,80,85,90),times = 10),
pop = sample(10000:20000, 19 * 2 * 5 * 4, replace = TRUE))
head(pops)
#> indicator period region ageband pop
#> 1 Ind1 2012 Area1 0 13896
#> 2 Ind1 2012 Area1 5 17041
#> 3 Ind1 2012 Area1 10 13275
#> 4 Ind1 2012 Area1 15 18545
#> 5 Ind1 2012 Area1 20 11353
#> 6 Ind1 2012 Area1 25 18088
deaths <- data.frame(indicator = rep(c("Ind1","Ind2","Ind3","Ind4"), each = 19 * 2 * 5),
period = rep(2012:2016, each = 19 * 2),
region = rep(rep(c("Area1","Area2"),each=19), times = 5),
ageband = rep(c(0,5,10,15,20,25,30,35,40,45,50,
55,60,65,70,75,80,85,90),times = 10),
dths = sample(200, 19 * 2 * 5 * 4, replace=TRUE))
head(deaths)
#> indicator period region ageband dths
#> 1 Ind1 2012 Area1 0 1
#> 2 Ind1 2012 Area1 5 181
#> 3 Ind1 2012 Area1 10 121
#> 4 Ind1 2012 Area1 15 4
#> 5 Ind1 2012 Area1 20 11
#> 6 Ind1 2012 Area1 25 194
Our data contains records for 4 different indicators, 5 time periods and 2 geographies so let’s calculate a DSR for each combination - that’s 40 separate DSRs from a single execution of the phe_dsr function……
First we’ll need to join our datasets to create the input data.frame for the function and specify the grouping sets:
It is important that your data meets the following criteria in order for the phe_dsr function to work so it is wise to check this before we move on.
1. Each grouping set within your data must contain an equal number of records.
The phe_dsr function has built in error handling to check this requirement, or you can check your data manually using code like this:
# check equal number of records in each grouping set - eyeball check
summarise(df,n=n())
#> # A tibble: 40 x 4
#> # Groups: indicator, period [?]
#> indicator period region n
#> <fct> <int> <fct> <int>
#> 1 Ind1 2012 Area1 19
#> 2 Ind1 2012 Area2 19
#> 3 Ind1 2013 Area1 19
#> 4 Ind1 2013 Area2 19
#> 5 Ind1 2014 Area1 19
#> 6 Ind1 2014 Area2 19
#> 7 Ind1 2015 Area1 19
#> 8 Ind1 2015 Area2 19
#> 9 Ind1 2016 Area1 19
#> 10 Ind1 2016 Area2 19
#> # ... with 30 more rows
# or alternatively the following should return TRUE
n_distinct(select(ungroup(summarise(df,n=n())),n)) == 1
#> [1] TRUE
2. If you are supplying your standard population in vector format (stdpoptype=“vector”) then this vector must also contain the same number of records as each grouping set within your data.
In this example we’re going to use the default esp2013 vector that is provided with the PHEindicatormethods package for our standard population - it contains 19 ordered values representing the 5-year age bands 0-4, 5-9, 10-14….85-89, 90+.
The phe_dsr function has built in error handling to check this requirement, or you can check your data manually using code like this:
# check standard population has same number of records as in each grouping set of data in check 1 above - eyeball check
length(esp2013)
#> [1] 19
# or alternatively the following should return TRUE
pull(slice(select(ungroup(summarise(df,n=n())),n),1)) == length(esp2013)
#> [1] TRUE
3. If you are supplying your standard population in vector format (stdpoptype=“vector”) then the standard population and your data (for each grouping set) must be sorted in the same standardisation category order because the function will join these by position. This would normally mean sorting both the standard population vector and the records for each group within your data by age band from youngest to oldest.
The phe_dsr function does not have any built in error handling to check this requirement as the function does not require the standardisation category labels to be provided in your data. It is therefore the responsibility of the function user to ensure this requirement is met. If the standardisation category labels are included with your data (as in our example) then the following code can be used to check the requirment manually:
# check data is ordered by required agebands from youngest to oldest
all(df$ageband == rep(c(0,5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90)))
#> [1] TRUE
CAUTION: Failure to ensure that this 3rd requirement is met could result in apparent successful execution of the phe_dsr function even though the data may have been incorrectly standardised. This is demonstrated in the sample code below:
Now we’re ready to calculate the DSRs using our correctly ordered df data frame.
By default the function will apply 95% confidence, a 100,000 multiplier and will output just 3 fields against each grouping set - the dsr, the lower confidence limit and the upper confidence limit:
phe_dsr(df, dths, pop)
#> # A tibble: 40 x 6
#> # Groups: indicator, period [20]
#> indicator period region value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Ind1 2012 Area1 689. 654. 725.
#> 2 Ind1 2012 Area2 593. 563. 625.
#> 3 Ind1 2013 Area1 703. 670. 737.
#> 4 Ind1 2013 Area2 660. 625. 696.
#> 5 Ind1 2014 Area1 634. 605. 664.
#> 6 Ind1 2014 Area2 726. 692. 762.
#> 7 Ind1 2015 Area1 770. 734. 807.
#> 8 Ind1 2015 Area2 724. 691. 758.
#> 9 Ind1 2016 Area1 591. 563. 621.
#> 10 Ind1 2016 Area2 746. 713. 780.
#> # ... with 30 more rows
Alternatively, we can add further arguments to specify:
phe_dsr(df, dths, pop, type = "full", confidence = 99.8, multiplier = 10000)
#> # A tibble: 40 x 11
#> # Groups: indicator, period [20]
#> indicator period region total_count total_pop value lowercl uppercl
#> <fct> <int> <fct> <int> <int> <dbl> <dbl> <dbl>
#> 1 Ind1 2012 Area1 1775 274256 68.9 63.5 74.6
#> 2 Ind1 2012 Area2 1722 274842 59.3 54.6 64.3
#> 3 Ind1 2013 Area1 1971 286057 70.3 65.1 75.7
#> 4 Ind1 2013 Area2 1548 279184 66.0 60.6 71.7
#> 5 Ind1 2014 Area1 2116 306649 63.4 58.9 68.2
#> 6 Ind1 2014 Area2 1932 297426 72.6 67.3 78.2
#> 7 Ind1 2015 Area1 1991 291382 77.0 71.4 82.9
#> 8 Ind1 2015 Area2 2110 300204 72.4 67.3 77.8
#> 9 Ind1 2016 Area1 1843 289172 59.1 54.7 63.8
#> 10 Ind1 2016 Area2 2140 294442 74.6 69.4 80.0
#> # ... with 30 more rows, and 3 more variables: confidence <chr>,
#> # statistic <chr>, method <chr>
In some cases you may wish to standardise against a different population to the default esp2013 one provided - such as the 1976 European Standard Population or an age and sex standardised population. There are two ways to specify an alternative standard population:
In the example below, the 1976 European Standard Population (which has 18 age groups) is provided as a vector and then referenced in the function call. To ensure the function works we must also ensure that our data has been broken down into these same 18 age bands (for the purposes of this example I’ve just combined the 85-90 and 90+ age band data into a single 85+ age band from the data.frame we used earlier).
The phe_dsr function can then be executed using a user-defined standard population:
esp1976 <- c(8000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 7000, 6000, 5000, 4000, 3000, 2000, 1000, 1000)
df18 <- df
df18$dths[df18$ageband == 85] <- df18$dths[df18$ageband == 85] + df18$dths[df18$ageband == 90]
df18$pop[df18$ageband == 85] <- df18$pop[df18$ageband == 85] + df18$pop[df18$ageband == 90]
df18 <- filter(df18,ageband != 90)
phe_dsr(df18,dths,pop,stdpop = esp1976)
#> # A tibble: 40 x 6
#> # Groups: indicator, period [20]
#> indicator period region value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Ind1 2012 Area1 668. 632. 704.
#> 2 Ind1 2012 Area2 614. 581. 648.
#> 3 Ind1 2013 Area1 715. 681. 751.
#> 4 Ind1 2013 Area2 653. 617. 690.
#> 5 Ind1 2014 Area1 630. 599. 662.
#> 6 Ind1 2014 Area2 732. 696. 769.
#> 7 Ind1 2015 Area1 829. 789. 871.
#> 8 Ind1 2015 Area2 741. 705. 778.
#> 9 Ind1 2016 Area1 534. 506. 563.
#> 10 Ind1 2016 Area2 742. 707. 778.
#> # ... with 30 more rows
In the example below, the esp2013 standard population is appended to our data frame prior to calling the phe_dsr function. The field name can then be specified in the function call. If stdpop is specified as a field name we must also tell the function this by specifying stdpoptype = “field” as below:
df_with_stdpop <- df %>%
mutate(spop = esp2013)
names(df_with_stdpop)
#> [1] "indicator" "period" "region" "ageband" "pop" "dths"
#> [7] "spop"
phe_dsr(df_with_stdpop, dths, pop, stdpop = spop, stdpoptype = "field")
#> # A tibble: 40 x 6
#> # Groups: indicator, period [20]
#> indicator period region value lowercl uppercl
#> <fct> <int> <fct> <dbl> <dbl> <dbl>
#> 1 Ind1 2012 Area1 689. 654. 725.
#> 2 Ind1 2012 Area2 593. 563. 625.
#> 3 Ind1 2013 Area1 703. 670. 737.
#> 4 Ind1 2013 Area2 660. 625. 696.
#> 5 Ind1 2014 Area1 634. 605. 664.
#> 6 Ind1 2014 Area2 726. 692. 762.
#> 7 Ind1 2015 Area1 770. 734. 807.
#> 8 Ind1 2015 Area2 724. 691. 758.
#> 9 Ind1 2016 Area1 591. 563. 621.
#> 10 Ind1 2016 Area2 746. 713. 780.
#> # ... with 30 more rows
This would be a fairly common scenario - maybe you have Local Authority data and there are no deaths in some of the younger age groups for some of the smaller areas.
Let’s fudge a couple of data frames to represent this. In this example, there are no deaths in the 10-14, 15-20 and 20-14 age bands:
pops2 <- data.frame(ageband = c( 0, 5,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90),
pop = c(30,35,35,35,40,40,45,50,50,50,60,60,70,75,70,60,20,20,15))
deaths2 <- data.frame(ageband = c(0,5,25,30,35,40,45,50,55,60,65,70,75,80,85,90),
dths = c(1,1, 1, 1, 3, 3, 3, 3,10,10,10,10, 8, 8, 8, 8))
If we simply join these data frames to produce the input data frame required for the phe_dsr function then we get NA values in the Deaths column and the function will return an error:
df2 <- left_join(pops2, deaths2, by="ageband")
phe_dsr(df2, dths, pop)
#> Error in if (any(pull(data, !!x) < 0)) {: missing value where TRUE/FALSE needed
The NA values must be replaced with zeros before executing the function: