Welcome to SPEW (Synthetic Population and Ecosystems of the World). This guide is for use in for basic usage of the R
package spew. This vignette will walk through the set-up of the fictitious human ecosystem of Tartanville. We show how to quickly ‘spew’ a synthetic ecosystem; sample agents using different methods for location assignments, sample agents using different methods for population characteristic assignments, discuss the essential input data for SPEW, expand upon the supplementary data format required for SPEW, and briefly describe some of the more advanced functions of SPEW.
SPEW uses random sampling to sample both locations of residence for the synthetic individuals and individual records from the microdata. SPEW is modular in that it can use different methods of sampling to emphasize different aspects of the synthetic ecosystem. A few methods are demonstrated here. The methods in bold are the default options.
Open up an R
session and run the following commands.
install.packages("spew")
Next, load the spew library into the R-session with:
library(spew)
To plot with functions in SPEW, make sure ggplot2
is installed.
data(tartanville)
tartanville_syneco <- spew(tartanville$pop_table, tartanville$shapefile,
tartanville$pums_h, tartanville$pums_p)
plot_syneco(tartanville, tartanville_syneco, region_name = "Tartanville")
out <- summarize_spew_out(tartanville_syneco, vars_to_sum_h = c("puma_id"),
vars_to_sum_p = c("SEX"),
vars_to_sum_env = NULL, top_region_id = "Tartanville")
print(out)
g <- plot_characteristic_proportions(feature_name = "Sex", legend_name = "Sex",
feature_df = out$SEX,
category_names = c("Male", "Female"),
text_size = 10,
region_colors = c("lightslateblue", "maroon1"))
g <- plot_pop_totals(feature_df = out$pop_totals, type = "n_people")
In SPEW, we assign locations to our synthetic agents. We currently support two methods of location assignment:
Uniform sampling of locations means sampling uniformly within the boundaries of a region from the inputted shapefile. Population density is maintained at a macro-level but may not accurately reflect the actual population density of a region. For instance, synthetic individuals may be placed in lakes by mistake under this method.
Ex. As shown in the ‘A true quickstart’ tab.
Roads-based sampling requires the shapefile
argument to be a list with two entries, the first as the boundaries
and the second as the roads
. The first entry should be a shapefile of polygons, and the second, a shapefile of lines. We see that in the resulting synthetic ecosystem, the agents are assigned locations closer to the roads than in uniform sampling. We can adjust the variability of how close the agents are to the roads by adjusting roads_noise
. Larger values result in larger variance.
Ex.
data(tartanville)
shapefile <- list(boundaries = tartanville$shapefile, roads = tartanville$roads)
tartanville_syneco <- spew(tartanville$pop_table, shapefile,
tartanville$pums_h, tartanville$pums_p,
locations_method = "roads", road_noise = .05)
## [1] "Place has 0 Households!"
plot_syneco(tartanville, tartanville_syneco, region_name = "Tartanville")
In SPEW, we assign population characteristics to our synthetic agents. We currently support three methods of population characteristics assignment:
Uniform sampling of population characteristics means sampling from the microdata where each record is given equal weight. Thus, if the microdata is thought to be representative of the regions, then the resulting synthetic individuals should have close to the correct distribution of population characteristics. If the microdata is not representative, then we must implement other sampling methods.
Ex. As shown in the ‘A true quickstart’ tab.
Assume we have the first moment(s) of a continuous or ordered variable(s)’s distribution for every sub-region in the region of interest (e.g. average household size for every block in Tartanville). Moment matching is a method to calculate weights for the microdata records so that after sampling, the sample average of the variable(s) will be matched to the given averages. We demonstrate with the average household size (NP) of Tartanville. The average household sizes have been exaggerated here as we are generating a small synthetic population.
Ex.
We first make the moments
object for use in SPEW. Here, the moments object is the average household size for each block. The function make_mm_obj
puts the data necessary for MM in the right format. The format is discussed more in another section. The supplementary_data
is thus a list including the moments
outputted from make_mm_obj
.
NP_avg <- c(3.2, 0, 6.0,
2.0, 3.2, 3.1,
4.0, 4.8, 3.9)
supplementary_data <- list(moments = make_mm_obj(moments_list =
list(mom1 = data.frame(place_id = paste0("T", 1:9),
puma_id = "T",
NP = NP_avg)),
assumption = "independence",
nMom = 1, type = "cont"))
And then we use SPEW, with the additional library quadprog
, which is required for MM. Note we have supplemented spew
with both a supplementary_data
argument and the sampling method of 'mm'
. The outputted synthetic ecosystem is in the same format. We see that MM matches the inputted average household sizes well, especially compared to uniform sampling.
data(tartanville)
tartanville_syneco_mm <- spew(pop_table = tartanville$pop_table,
shapefile = tartanville$shapefile,
pums_h = tartanville$pums_h,
pums_p = tartanville$pums_p,
marginals = supplementary_data$moments,
sampling_method = "mm")
## [1] "Place has 0 Households!"
non_empty_regions <- sapply(tartanville_syneco_mm, class) == "list"
synthetic_NP_avg <- sapply(tartanville_syneco_mm[non_empty_regions], function(ll){
mean(ll$households$NP, na.rm = TRUE)
})
sum(abs(synthetic_NP_avg - NP_avg[-2]))
## [1] 1.749603
# Comparing to uniform sampling
synthetic_NP_avg_unif<- sapply(tartanville_syneco[non_empty_regions], function(ll){
mean(ll$households$NP, na.rm = TRUE)
})
sum(abs(synthetic_NP_avg_unif- NP_avg[-2]))
## [1] 6.640278
Iterative Proportional Fitting is a method discovered by Deming and Stephan (1941) and implemented as a sampling scheme by Beckman et al. (1996). The idea behind IPF is filling in a contingency table where the marginal totals are known. As such, an important step is “cutting” our variables into discrete categories if this has not already been done.
In SPEW, we do this by creating a marginals
object which contains information both how to cut our variables into categories and how many of each category we expect in each region for each marginal variable. We demonstrate this below.
Ex. We must make the marginals
object for use of the IPF sampling method. We create two marginal objects, one for household income and the other for head of household race and combine them together for use in SPEW.
## Income
var_name <- "HHINC"
## How to cut the variable
type <- "ord"
bounds <- data.frame(lower = c(0, 50), upper = c(49, Inf))
category_name <- c("HHINC_0-49", "HHINC_50-Inf")
## How often we expect to see each category for each region.
df <- data.frame(place_id = paste0("T", 1:9), v1 = c(30, 0, 5, 10, 13, 9, 2, 1, 5))
df$v2 <- tartanville$pop_table$n_house - df$v1
#
ipf_obj_hhinc<- make_ipf_obj(var_name, type, bounds, category_name, df = df)
ipf_obj_hhinc
## $HHINC
## $HHINC$df
## place_id HHINC_0-49 HHINC_50-Inf
## 1 T1 30 2
## 2 T2 0 0
## 3 T3 5 4
## 4 T4 10 2
## 5 T5 13 5
## 6 T6 9 1
## 7 T7 2 3
## 8 T8 1 2
## 9 T9 5 2
##
## $HHINC$type
## [1] "ord"
##
## $HHINC$lookup
## marg_names lower upper
## 1 HHINC_0-49 0 49
## 2 HHINC_50-Inf 50 Inf
## Head of Household Race
var_name <- c("RAC1P")
type <- "cat"
bounds <- data.frame(lower = c(1, 2), upper = c(1, 2))
category_name <- c("Tartan", "Argyle")
df2 <- data.frame(place_id = paste0("T", 1:9), v1 = c(28, 0, 4, 1, 5, 8, 2, 1, 3))
df2$v2 <- tartanville$pop_table$n_house - df2$v1
ipf_obj_rac1p <- make_ipf_obj(var_name, type, bounds, category_name, df = df2)
# Combine both together
ipf_obj <- list(HHINC = ipf_obj_hhinc[[1]], RAC1P = ipf_obj_rac1p[[1]])
supplementary_data <- list(moments = ipf_obj)
As demonstration of how we cut our variables, we use a function called align_pums
.
pums_h <- align_pums(tartanville$pums_h, ipf_obj, suffix = "_ipf") # split into categories
knitr::kable(head(pums_h))
SERIALNO | HHINC | NP | RAC1P | puma_id | HHINC_ipf | RAC1P_ipf |
---|---|---|---|---|---|---|
1 | 20 | 2 | 1 | T | HHINC_0-49 | Tartan |
2 | 42 | 6 | 1 | T | HHINC_0-49 | Tartan |
3 | 42 | 4 | 2 | T | HHINC_0-49 | Argyle |
4 | 108 | 4 | 1 | T | HHINC_50-Inf | Tartan |
5 | 64 | 2 | 1 | T | HHINC_50-Inf | Tartan |
We are now ready to perform IPF-sampling and do so below. SPEW relies on the mipfp
package. We find that IPF matches the table we gave as expected number of households for each category in each marginal variable quite well. Note: IPF does not always converge, generally due to having a cell in the contingency with value 0.
tartanville_syneco_ipf <- spew(tartanville$pop_table, tartanville$shapefile,
pums_h, tartanville$pums_p,
marginals = supplementary_data$moments,
sampling_method = "ipf")
## [1] "Place has 0 Households!"
## Warning in mipfp::Ipfp(seed = seed, target.list = target_list, target.data = target_data): IPFP did not converged after 1000 iteration(s)!
## This migh be due to 0 cells in the seed, maximum number
## of iteration too low or tolerance too small
## Warning in mipfp::Ipfp(seed = seed, target.list = target_list, target.data = target_data): IPFP did not converged after 1000 iteration(s)!
## This migh be due to 0 cells in the seed, maximum number
## of iteration too low or tolerance too small
out <- summarize_spew_out(tartanville_syneco_ipf, vars_to_sum_h = c("HHINC_ipf", "RAC1P_ipf"),
vars_to_sum_p = c("SEX"),
vars_to_sum_env = NULL, top_region_id = "Tartanville",
marginals = supplementary_data$moments)
sum(abs(as.matrix(out$HHINC_ipf[order(out$HHINC_ipf$region), -3]) - as.matrix(df[-2, -1])))
## [1] 6
sum(abs(as.matrix(out$RAC1P_ipf[order(out$HHINC_ipf$region), -3]) - as.matrix(df2[-2, -1])))
## [1] 40
We see we almost match almost perfectly on household income and very closely on race.
We can also assign environments to the population of Tartanville. Environments are locations where the agents interact, both with one another, and possibly the environment itself.
Ex. We assign the children between ages 5 and 18 in T1 to one of the two schools in Tartanville.
knitr::kable(tartanville$environments)
ID | Type | longitude | latitude | capacity |
---|---|---|---|---|
Andrew High | School | 7.0 | 1.0 | 10 |
Maggie Mo High | School | 7.4 | 6.5 | 40 |
Steel Factory | Work | 1.4 | 4.0 | 100 |
tartanville_syneco <- spew(tartanville$pop_table, tartanville$shapefile,
tartanville$pums_h, tartanville$pums_p)
## [1] "Place has 0 Households!"
plot_syneco(tartanville, tartanville_syneco,
region_name = "Tartanville")
We first subset the data frame of students in T1.
t1_people <- tartanville_syneco[[1]]$people
t1_students <- subset(t1_people, subset = (t1_people$AGEP >= 5 & t1_people$AGEP <= 18))
places <- subset(tartanville$environments, tartanville$environments$Type == "School")
We then assign the schools to the children, first without regard to capacity.
t1_assignments <- assign_place_coords(t1_students, places = places, place_name = "school")
table(t1_assignments$school)
##
## Andrew High Maggie Mo High
## 23 19
These values are consistent as Andrew High (T7) is slightly closer to T1 than Maggie Mo High (T9), but only by a small distance.
However, we note that Maggie Mo High has about 4 times the capacity than Andrew High. If we take capacity into account, then we have more students going to Maggie Mo High. The user is encouraged to provide their own distance and capacity functions. The details may be found in the documentation of assign_place_cords
.
t1_assignments <- assign_place_coords(t1_students, places = places, place_name = "school", method = "capacity")
table(t1_assignments$school)
##
## Andrew High Maggie Mo High
## 10 32
SPEW requires three essential data inputs connected through the variables place_id
and puma_id
.
Input Data | Description | Required Variables | Format |
---|---|---|---|
Population Counts | Table of number of individuals per region. | place_id , puma_id |
.csv |
Shapefile | Spatial boundaries of the regions in Population Counts. | place_id |
.shp |
Microdata | Table of individual records with individual-level characteristics. Also known as Public Use | ||
Microdata Samples (PUMS). | We need both household-level microdata (pums_h ) and individual-level microdata (pums_p ). |
puma_id |
.csv |
Given a row in the table of population counts with the variables place_id
and puma_id
, one should be able to find the unique region place_id
in the shapefile that matches that of the row along with the microdata records with the same puma_id
.
Additionally, the above may be combined into a list along with a roads
variable, environments
variable and a supplementary_data
variable to form an object used to spew the final synthetic ecosystem, e.g. the format of tartanville
data(tartanville)
names(tartanville)
## [1] "pop_table" "shapefile" "pums_h" "pums_p"
## [5] "roads" "environments"
class(tartanville)
## [1] "list"
lapply(tartanville, class)
## $pop_table
## [1] "data.frame"
##
## $shapefile
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
##
## $pums_h
## [1] "data.frame"
##
## $pums_p
## [1] "data.frame"
##
## $roads
## [1] "SpatialLinesDataFrame"
## attr(,"package")
## [1] "sp"
##
## $environments
## [1] "data.frame"
Due to the increased complexity of the data required for more sophisticated sampling methods, we describe here the form supplementary data required for SPEW to function.
roads
)The roads are in the form of spatial lines.
data(tartanville)
class(tartanville$roads)
## [1] "SpatialLinesDataFrame"
## attr(,"package")
## [1] "sp"
moments
)The moments object is created through the make_mm_obj
function. This function requires a moments_list
argument which has the following format.
mom1
a data frame with the columns place_id
, puma_id
and var1
through var{p}
which are the average for var1
through var{p}
respectively for the given place.
assumption
is either “independence” or “joint” which effects the sampling.
nMom
should be 1 at this moment in time.
type
should be “cont” for continuous variables.
moments_list <- list(mom1 = data.frame(place_id = {my.place.ids}, puma_id = {my.pumas},
var1 = {var1.avg}, var2 = {var2.avg}, ..., var{p} ={varp.avg}),
assumption = {my.assumption},
nMom = 1, type = "cont")
Finally, the moments
object is added to the list of supplementary data,
supplementary_data <- list(moments = moments)
marginals
)The marginals
requires the most data. The variable marginals
is itself a list such as
list(var1 = ipf_obj_var1[[1]], var2 = ipf_obj_var2[[1]], var{p} = ipf_obj_varp[[1]])
The names of the list must be names that are either found in resulting synthetic households or synthetic persons.
The individual IPF objects are created using the make_ipf_obj
function requiring the following arguments:
var_name
name of the variable that must match a column name of the resulting synthetic households or synthetic persons.
bounds
either “ord” for ordinal or “cat” for categorical, the type of variable to be converted to
category_name
short name of the category, will be visible to person. Either length one and the bounds will be pasted to it or length of the number of rows of the bounds with names of your choice.
df
data frame with place_id
and category counts as columns
Finally, the marginals
object is added to the list of supplementary data,
supplementary_data <- list(marginals = marginals)