Big IPUMS data

Minnesota Population Center

2018-09-27

Browsing data on the IPUMS website can be a little like grocery shopping when you’re hungry — you show up to grab a couple things, but everything looks so good, and you end up with an overflowing cart1. When you do this with IPUMS data, unfortunately, sometimes your extract may get so large that it doesn’t fit in your computer’s memory.

If this you’ve got an extract that’s too big, both the IPUMS website and the ipumsr package have tools to help. There are four basic strategies:

  1. Get more memory.
  2. Reduce the size of your dataset.
  3. Use “chunked” reading.
  4. Use a database.

The IPUMS website has features to help with option 2, and the ipumsr package can help you with options 3 and 4 (option 1 depends on your wallet).

The examples in this vignette will rely on the ipumsr, dplyr and biglm packages, and the example CPS extract used in the ipums-cps vignette. If you want to follow along, you should follow the instructions in that vignette to make an extract.

library(ipumsr)
library(dplyr)

# To run the full vignette you'll also need the following packages:
installed_biglm <- requireNamespace("biglm")
installed_db_pkgs <- requireNamespace("DBI") & 
  requireNamespace("RSQLite") & 
  requireNamespace("dbplyr")

# Change these filepaths to the filepaths of your downloaded extract
cps_ddi_file <- "cps_00001.xml"
cps_data_file <- "cps_00001.dat"

Option 1: Trade money for convenience

If you’ve got a dataset that’s too big for your RAM, you could always get more. You could accomplish this by upgrading your current computer, getting a new one, or paying a cloud service like Amazon or Microsoft Azure (or one of the many other similar services). Here are guides for using R on Amazon and Microsoft Azure.

Option 2: Do you really need all of that?

The IPUMS website has many features that will let you reduce the size of your extract. The easiest thing to do is to review your sample and variable selections to see if you can drop some.

If you do need every sample and variable, but your analysis is on a specific subset of the data, the IPUMS extract engine has a feature called “Select Cases”, where you can subset on an included variable (for example you could subset on AGE so that your extract only includes those older than 65, or subset on EDUCATION to look at only college graduates). In most IPUMS microdata projects, the select cases feature is on the “Create Extract” page, as the last step before you submit the extract. If you’ve already submitted the extract, you can click the “revise” link on the “Download or Revise Extracts” page to access the “Select Cases” feature.

Or, if you would be happy with a random subsample of the data, the IPUMS extract engine has an option to “Customize Sample Size” that will take a random sample. This feature is also available on the “Create Extract” page, as the last step before you submit the extract. Again, if you’ve already submitted your extract, you can access this feature by clicking the “revise” link on the “Download or Revise Extracts” page.

Option 3: Work one chunk at a time

ipumsr has “chunked” versions of the microdata reading functions (read_ipums_micro_chunked() and read_ipums_micro_list_chunked()). These chunked versions of the functions allow you to specify a function that will be applied to each chunk, and then also control how the results from these chunks are combined. This functionality is based on the chunked functionality introduced by readr and so is quite flexible. Below, we’ll outline solutions to three common use-cases for IPUMS data: tabulation, regression and selecting cases.

Chunked tabulation example

Let’s say you want to find the percent of people in the workforce by their self-reported health. Since this extract is small enough to fit in memory, we could just do the following:

But let’s pretend like we can only store 1,000 rows at a time. In this case, we need to use a chunked function, tabulate for each chunk, and then calculate the counts across all of the chunks.

First we’ll make the callback function, which will take two arguments: x (the data from a chunk) and pos (the position of the chunk, expressed as the line in the input file at which the chunk starts). We’ll only use x, but the callback function must always take both these arguments.

Next we need to create a callback object. The choice of a callback object depends mainly on how we want to combine the results from applying our callback function to each chunk. In this case, we want to row-bind the data.frames returned by cb_function(). If we didn’t care about attaching IPUMS value labels and other metadata, we could use readr::DataFrameCallback, but ipumsr includes the IpumsDataFrameCallback object that allows you to preserve this metadata.

Callback objects are R6 objects, but you don’t need to be familiar with R6 to use them2. For now, all we really need to know is that to create a callback we can use, we use $new() syntax.

Next we read in the data with the read_ipums_micro_chunked() function, specifying the callback and that we want the chunk_size to be 1000.

Now we have a data.frame with the counts by health and work status within each chunk. To get the full table, we just need to sum by health and work status one more time.

Chunked regression example

With the biglm package, it is possible to use R to perform a regression on data that is too large to store in memory all at once. The ipumsr package provides a callback designed to make this simple: IpumsBiglmCallback.

Again we’ll use the CPS example, which is small enough that we can keep it in memory. Here’s an example of a regression looking at how hours worked, self-reported health and age are related among those who are currently working. This is meant as a simple example, and ignores many of the complexities in this relationship, so please use caution when interpreting.

To do the same regression, but with only 1000 rows loaded at a time, we work in a similar manner.

First we make the IpumsBiglmCallback callback object that specifies both the model and a function to prepare the data.

And then we read the data using read_ipums_micro_chunked(), passing the callback that we just made.

Chunked “select cases” example

Sometimes you may want to select a subset of the data before reading it in. The IPUMS website has this functionality built in, which can be a faster way to do this (this “select cases” functionality is described in the second section above). Also, Unix commands like awk and sed will generally be much faster than these R based solutions. However, it is possible to use the chunked functions to create a subset, which can be convenient if you want to subset on some complex logic that would be hard to code into the IPUMS extract system or Unix tools.

Option 4: Use a database

Databases are another option for data that cannot fit in memory as an R data.frame. If you have access to a database on a remote machine, then you can easily pull in parts of the data for your analysis. Even if you’ll need to store the database on your machine, it may have more efficient storage of data so your data fits in your memory, or it may use your hard drive.

R’s tools for integrating with databases are improving quickly. The DBI package has been updated, dplyr (through dbplyr) provides a frontend that allows you to write the same code for data in a database as you would for a local data.frame, and packages like sparklyr, sparkR, bigrquery and others provide access to the latest and greatest.

There are many different kinds of databases, each with their own benefits, weaknesses and tradeoffs. As such, it’s hard to give concrete advice without knowing your specific use-case. However, once you’ve chosen a database, in general, there will be two steps: Importing the data into the database and then connecting it to R.

As an example, we’ll use the RSQLite package to load the data into an in-memory database. RSQLite is great because it is easy to set up, but it is probably not efficient enough to help you if you need to use a database because your data doesn’t fit in memory.

Importing data into a database

When using rectangular extracts, your best bet to import IPUMS data into your database is probably going to be a csv file. Most databases support csv importing, and these implementations will generally be well supported since this is a common file format.

However, if you need a hierarchical extract, or your database software doesn’t support the csv format, then you can use the chunking functions to load the data into a database without storing the full data in R.

Connecting to a database with dbplyr

The dbplyr vignette “dbplyr” (which you can access with vignette("dbplyr", package = "dbplyr")) is a good place to get started learning about how to connect to a database. Here I’ll just briefly show some examples.

Though dbplyr shows us a nice preview of the first rows of the result of our query, the data still lives in the database. When using a regular database, in general you’d use the function dplyr::collect() to load in the full results of the query to your R session. However, the database has no concept of IPUMS attributes like value and variable labels, so if you want them, you can use ipums_collect() like so:

Learning more

Big data is a problem for lots of R users, not just IPUMS users, so there are a lot of resources to help you out! These are just a few that I found useful while writing this document:


  1. Bonus joke: Why is the IPUMS website better than any grocery store? Answer: More free samples.

  2. If you’re interested in learning more about R6, the upcoming revision to Hadley Wickham’s Advanced R book includes a chapter on R6 available for free here