# Introduction

Welcome to SPEW (Synthetic Population and Ecosystems of the World). This guide is for use in for basic usage of the R package spew. This vignette will walk through the set-up of the fictitious human ecosystem of Tartanville. We show how to quickly ‘spew’ a synthetic ecosystem; sample agents using different methods for location assignments, sample agents using different methods for population characteristic assignments, discuss the essential input data for SPEW, expand upon the supplementary data format required for SPEW, and briefly describe some of the more advanced functions of SPEW.

SPEW uses random sampling to sample both locations of residence for the synthetic individuals and individual records from the microdata. SPEW is modular in that it can use different methods of sampling to emphasize different aspects of the synthetic ecosystem. A few methods are demonstrated here. The methods in bold are the default options.

# A true quickstart

Open up an R session and run the following commands.

  install.packages("spew")

Next, load the spew library into the R-session with:

library(spew)

## Tartanville: SPEW basics

To plot with functions in SPEW, make sure ggplot2 is installed.

data(tartanville)

tartanville_syneco <- spew(tartanville$pop_table, tartanville$shapefile,
tartanville$pums_h, tartanville$pums_p)

plot_syneco(tartanville, tartanville_syneco, region_name = "Tartanville")

out <-  summarize_spew_out(tartanville_syneco, vars_to_sum_h = c("puma_id"),
vars_to_sum_p = c("SEX"),
vars_to_sum_env = NULL, top_region_id = "Tartanville")
print(out)

g <- plot_characteristic_proportions(feature_name = "Sex", legend_name = "Sex",
feature_df = out$SEX, category_names = c("Male", "Female"), text_size = 10, region_colors = c("lightslateblue", "maroon1")) g <- plot_pop_totals(feature_df = out$pop_totals, type = "n_people")

# Sampling of locations

In SPEW, we assign locations to our synthetic agents. We currently support two methods of location assignment:

1. Uniform sampling

## Uniform sampling

Uniform sampling of locations means sampling uniformly within the boundaries of a region from the inputted shapefile. Population density is maintained at a macro-level but may not accurately reflect the actual population density of a region. For instance, synthetic individuals may be placed in lakes by mistake under this method.

Ex. As shown in the ‘A true quickstart’ tab.

Roads-based sampling requires the shapefile argument to be a list with two entries, the first as the boundaries and the second as the roads. The first entry should be a shapefile of polygons, and the second, a shapefile of lines. We see that in the resulting synthetic ecosystem, the agents are assigned locations closer to the roads than in uniform sampling. We can adjust the variability of how close the agents are to the roads by adjusting roads_noise. Larger values result in larger variance.

Ex.

data(tartanville)
shapefile <- list(boundaries = tartanville$shapefile, roads = tartanville$roads)

tartanville_syneco <- spew(tartanville$pop_table, shapefile, tartanville$pums_h, tartanville$pums_p, locations_method = "roads", road_noise = .05) ## [1] "Place has 0 Households!" plot_syneco(tartanville, tartanville_syneco, region_name = "Tartanville") # Sampling of characteristics In SPEW, we assign population characteristics to our synthetic agents. We currently support three methods of population characteristics assignment: 1. Uniform sampling 2. Moment Matching (MM) 3. Iterative Proportional Fitting (IPF) ## Uniform sampling Uniform sampling of population characteristics means sampling from the microdata where each record is given equal weight. Thus, if the microdata is thought to be representative of the regions, then the resulting synthetic individuals should have close to the correct distribution of population characteristics. If the microdata is not representative, then we must implement other sampling methods. Ex. As shown in the ‘A true quickstart’ tab. ## Moment Matching Assume we have the first moment(s) of a continuous or ordered variable(s)’s distribution for every sub-region in the region of interest (e.g. average household size for every block in Tartanville). Moment matching is a method to calculate weights for the microdata records so that after sampling, the sample average of the variable(s) will be matched to the given averages. We demonstrate with the average household size (NP) of Tartanville. The average household sizes have been exaggerated here as we are generating a small synthetic population. Ex. We first make the moments object for use in SPEW. Here, the moments object is the average household size for each block. The function make_mm_obj puts the data necessary for MM in the right format. The format is discussed more in another section. The supplementary_data is thus a list including the moments outputted from make_mm_obj. NP_avg <- c(3.2, 0, 6.0, 2.0, 3.2, 3.1, 4.0, 4.8, 3.9) supplementary_data <- list(moments = make_mm_obj(moments_list = list(mom1 = data.frame(place_id = paste0("T", 1:9), puma_id = "T", NP = NP_avg)), assumption = "independence", nMom = 1, type = "cont")) And then we use SPEW, with the additional library quadprog, which is required for MM. Note we have supplemented spew with both a supplementary_data argument and the sampling method of 'mm'. The outputted synthetic ecosystem is in the same format. We see that MM matches the inputted average household sizes well, especially compared to uniform sampling. data(tartanville) tartanville_syneco_mm <- spew(pop_table = tartanville$pop_table,
shapefile = tartanville$shapefile, pums_h = tartanville$pums_h,
pums_p = tartanville$pums_p, marginals = supplementary_data$moments,
sampling_method = "mm")
## [1] "Place has 0 Households!"
non_empty_regions <- sapply(tartanville_syneco_mm, class) == "list"
synthetic_NP_avg <- sapply(tartanville_syneco_mm[non_empty_regions], function(ll){
mean(ll$households$NP, na.rm = TRUE)
})
sum(abs(synthetic_NP_avg - NP_avg[-2]))
## [1] 1.749603
# Comparing to uniform sampling
synthetic_NP_avg_unif<- sapply(tartanville_syneco[non_empty_regions], function(ll){
mean(ll$households$NP, na.rm = TRUE)
})
sum(abs(synthetic_NP_avg_unif- NP_avg[-2]))
## [1] 6.640278

## Iterative Proportional Fitting

Iterative Proportional Fitting is a method discovered by Deming and Stephan (1941) and implemented as a sampling scheme by Beckman et al. (1996). The idea behind IPF is filling in a contingency table where the marginal totals are known. As such, an important step is “cutting” our variables into discrete categories if this has not already been done.

In SPEW, we do this by creating a marginals object which contains information both how to cut our variables into categories and how many of each category we expect in each region for each marginal variable. We demonstrate this below.

Ex. We must make the marginals object for use of the IPF sampling method. We create two marginal objects, one for household income and the other for head of household race and combine them together for use in SPEW.

## Income
var_name <- "HHINC"
## How to cut the variable
type <- "ord"
bounds <- data.frame(lower = c(0, 50), upper = c(49, Inf))
category_name <- c("HHINC_0-49", "HHINC_50-Inf")
## How often we expect to see each category for each region.
df <- data.frame(place_id = paste0("T", 1:9),  v1 = c(30, 0, 5, 10, 13, 9, 2, 1, 5))
df$v2 <- tartanville$pop_table$n_house - df$v1
#
ipf_obj_hhinc<- make_ipf_obj(var_name, type, bounds, category_name, df = df)

ipf_obj_hhinc
## $HHINC ##$HHINC$df ## place_id HHINC_0-49 HHINC_50-Inf ## 1 T1 30 2 ## 2 T2 0 0 ## 3 T3 5 4 ## 4 T4 10 2 ## 5 T5 13 5 ## 6 T6 9 1 ## 7 T7 2 3 ## 8 T8 1 2 ## 9 T9 5 2 ## ##$HHINC$type ## [1] "ord" ## ##$HHINC$lookup ## marg_names lower upper ## 1 HHINC_0-49 0 49 ## 2 HHINC_50-Inf 50 Inf ## Head of Household Race var_name <- c("RAC1P") type <- "cat" bounds <- data.frame(lower = c(1, 2), upper = c(1, 2)) category_name <- c("Tartan", "Argyle") df2 <- data.frame(place_id = paste0("T", 1:9), v1 = c(28, 0, 4, 1, 5, 8, 2, 1, 3)) df2$v2 <- tartanville$pop_table$n_house - df2v1 ipf_obj_rac1p <- make_ipf_obj(var_name, type, bounds, category_name, df = df2) # Combine both together ipf_obj <- list(HHINC = ipf_obj_hhinc[[1]], RAC1P = ipf_obj_rac1p[[1]]) supplementary_data <- list(moments = ipf_obj) As demonstration of how we cut our variables, we use a function called align_pums. pums_h <- align_pums(tartanvillepums_h, ipf_obj, suffix = "_ipf") # split into categories
knitr::kable(head(pums_h))
SERIALNO HHINC NP RAC1P puma_id HHINC_ipf RAC1P_ipf
1 20 2 1 T HHINC_0-49 Tartan
2 42 6 1 T HHINC_0-49 Tartan
3 42 4 2 T HHINC_0-49 Argyle
4 108 4 1 T HHINC_50-Inf Tartan
5 64 2 1 T HHINC_50-Inf Tartan

We are now ready to perform IPF-sampling and do so below. SPEW relies on the mipfp package. We find that IPF matches the table we gave as expected number of households for each category in each marginal variable quite well. Note: IPF does not always converge, generally due to having a cell in the contingency with value 0.

tartanville_syneco_ipf <- spew(tartanville$pop_table, tartanville$shapefile,
pums_h, tartanville$pums_p, marginals = supplementary_data$moments,
sampling_method = "ipf")
## [1] "Place has 0 Households!"
## Warning in mipfp::Ipfp(seed = seed, target.list = target_list, target.data = target_data): IPFP did not converged after 1000 iteration(s)!
##             This migh be due to 0 cells in the seed, maximum number
##             of iteration too low or tolerance too small

## Warning in mipfp::Ipfp(seed = seed, target.list = target_list, target.data = target_data): IPFP did not converged after 1000 iteration(s)!
##             This migh be due to 0 cells in the seed, maximum number
##             of iteration too low or tolerance too small
out <-  summarize_spew_out(tartanville_syneco_ipf, vars_to_sum_h = c("HHINC_ipf", "RAC1P_ipf"),
vars_to_sum_p = c("SEX"),
vars_to_sum_env = NULL, top_region_id = "Tartanville",
marginals = supplementary_data$moments) sum(abs(as.matrix(out$HHINC_ipf[order(out$HHINC_ipf$region), -3]) - as.matrix(df[-2, -1])))
## [1] 6
sum(abs(as.matrix(out$RAC1P_ipf[order(out$HHINC_ipf$region), -3]) - as.matrix(df2[-2, -1]))) ## [1] 40 We see we almost match almost perfectly on household income and very closely on race. # Assigning environments ## General assignment function We can also assign environments to the population of Tartanville. Environments are locations where the agents interact, both with one another, and possibly the environment itself. Ex. We assign the children between ages 5 and 18 in T1 to one of the two schools in Tartanville. knitr::kable(tartanville$environments)
ID Type longitude latitude capacity
Andrew High School 7.0 1.0 10
Maggie Mo High School 7.4 6.5 40
Steel Factory Work 1.4 4.0 100
tartanville_syneco <- spew(tartanville$pop_table, tartanville$shapefile,
tartanville$pums_h, tartanville$pums_p)
## [1] "Place has 0 Households!"
plot_syneco(tartanville, tartanville_syneco,
region_name = "Tartanville")

We first subset the data frame of students in T1.

t1_people <- tartanville_syneco[[1]]$people t1_students <- subset(t1_people, subset = (t1_people$AGEP >= 5 & t1_people$AGEP <= 18)) places <- subset(tartanville$environments, tartanville$environments$Type == "School")

We then assign the schools to the children, first without regard to capacity.

t1_assignments <- assign_place_coords(t1_students, places = places, place_name = "school")
table(t1_assignments$school) ## ## Andrew High Maggie Mo High ## 23 19 These values are consistent as Andrew High (T7) is slightly closer to T1 than Maggie Mo High (T9), but only by a small distance. However, we note that Maggie Mo High has about 4 times the capacity than Andrew High. If we take capacity into account, then we have more students going to Maggie Mo High. The user is encouraged to provide their own distance and capacity functions. The details may be found in the documentation of assign_place_cords. t1_assignments <- assign_place_coords(t1_students, places = places, place_name = "school", method = "capacity") table(t1_assignments$school)
##
##    Andrew High Maggie Mo High
##             10             32

# Essential input data format

SPEW requires three essential data inputs connected through the variables place_id and puma_id.

Input Data Description Required Variables Format
Population Counts Table of number of individuals per region. place_id, puma_id .csv
Shapefile Spatial boundaries of the regions in Population Counts. place_id .shp
Microdata Table of individual records with individual-level characteristics. Also known as Public Use
Microdata Samples (PUMS). We need both household-level microdata (pums_h) and individual-level microdata (pums_p). puma_id .csv

Given a row in the table of population counts with the variables place_id and puma_id, one should be able to find the unique region place_id in the shapefile that matches that of the row along with the microdata records with the same puma_id.

Additionally, the above may be combined into a list along with a roads variable, environments variable and a supplementary_data variable to form an object used to spew the final synthetic ecosystem, e.g. the format of tartanville

data(tartanville)
names(tartanville)
## [1] "pop_table"    "shapefile"    "pums_h"       "pums_p"
## [5] "roads"        "environments"
class(tartanville)
## [1] "list"
lapply(tartanville, class)
## $pop_table ## [1] "data.frame" ## ##$shapefile
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
##
## $pums_h ## [1] "data.frame" ## ##$pums_p
## [1] "data.frame"
##
## $roads ## [1] "SpatialLinesDataFrame" ## attr(,"package") ## [1] "sp" ## ##$environments
## [1] "data.frame"

# Supplementary data format

## Supplementary data formats

Due to the increased complexity of the data required for more sophisticated sampling methods, we describe here the form supplementary data required for SPEW to function.

## Road-based sampling (roads)

The roads are in the form of spatial lines.

data(tartanville)
class(tartanville\$roads)
## [1] "SpatialLinesDataFrame"
## attr(,"package")
## [1] "sp"

## Moment Matching (MM) (moments)

The moments object is created through the make_mm_obj function. This function requires a moments_list argument which has the following format.

• mom1 a data frame with the columns place_id, puma_id and var1 through var{p} which are the average for var1 through var{p} respectively for the given place.

• assumption is either “independence” or “joint” which effects the sampling.

• nMom should be 1 at this moment in time.

• type should be “cont” for continuous variables.

moments_list <- list(mom1 = data.frame(place_id = {my.place.ids}, puma_id = {my.pumas},
var1 = {var1.avg}, var2 = {var2.avg}, ..., var{p} ={varp.avg}),
assumption = {my.assumption},
nMom = 1, type = "cont")

Finally, the moments object is added to the list of supplementary data,

supplementary_data <- list(moments = moments)

## Iterative Proportional Fitting (IPF) (marginals)

The marginals requires the most data. The variable marginals is itself a list such as

list(var1 = ipf_obj_var1[[1]], var2 = ipf_obj_var2[[1]], var{p} = ipf_obj_varp[[1]])

The names of the list must be names that are either found in resulting synthetic households or synthetic persons.

The individual IPF objects are created using the make_ipf_obj function requiring the following arguments:

• var_name name of the variable that must match a column name of the resulting synthetic households or synthetic persons.

• bounds either “ord” for ordinal or “cat” for categorical, the type of variable to be converted to

• category_name short name of the category, will be visible to person. Either length one and the bounds will be pasted to it or length of the number of rows of the bounds with names of your choice.

• df data frame with place_id and category counts as columns

Finally, the marginals object is added to the list of supplementary data,

supplementary_data <- list(marginals = marginals)