Getting Started

Matthew Cornell

2019-05-28

Getting Started with zoltr

zoltr is an R package that simplifies access to the zoltardata.com API. This vignette takes you through the package’s main features. So that you can experiment without needing a Zoltar account, we use our “flagship” project for the CDC Flu challenge, which should always be available for public read-only access.

Connect to the host

The starting point for working with Zoltar’s API is a ZoltarConnection object, obtained via the new_connection function. Most zoltr functions take a ZoltarConnection along with the ID of the thing of interest, e.g., a project, model, or forecast ID. You can obtain the ID using one of the *_info functions, and you can always use the web interface to navigate to the resource and look at its ID in the browser address field. The ID should be self evident for all resources. For example, the project CDC Flu challenge’s URL is https://www.zoltardata.com/project/4, and you can see that its ID is 4.

(Note: To access protected resources, you need to call the zoltar_authenticate() function before calling any zoltr functions. But this is unnecessary in this vignette because we’re accessing a public project, models, and forecasts.)

library(zoltr)
conn <- new_connection()
conn
#> ZoltarConnection 'https://zoltardata.com' not authenticated

Get a list of all projects on the host

Use the projects() function to get a connection’s projects as a data.frame. Note that it will only list those that you are authenticated to access.

the_projects <- projects(conn)
str(the_projects)
#> 'data.frame':    1 obs. of  8 variables:
#>  $ id         : int 4
#>  $ url        : chr "https://zoltardata.com/api/project/4/"
#>  $ owner_id   : int 5
#>  $ public     : logi TRUE
#>  $ name       : chr "CDC Flu challenge"
#>  $ description: chr "Guidelines and forecasts for a collaborative U.S. influenza forecasting project."
#>  $ home_url   : chr "https://github.com/FluSightNetwork/cdc-flusight-ensemble"
#>  $ core_data  : chr "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models"

Get a project to work with, list its info and models, and download its scores

Let’s start by getting a public project to work with. We will search the projects list for it by name. Then we will pass its ID to the project_info() function to get details, and pass it to the models() function to get a data.frame of its models.

project_id <- the_projects[the_projects$name == "CDC Flu challenge", 'id']  # integer(0) if not found, which is an invalid project ID
the_project_info <- project_info(conn, project_id)
names(the_project_info)
#>  [1] "id"           "url"          "owner"        "is_public"   
#>  [5] "name"         "description"  "home_url"     "core_data"   
#>  [9] "config_dict"  "template"     "truth"        "model_owners"
#> [13] "score_data"   "models"       "locations"    "targets"     
#> [17] "timezeros"
the_project_info$description
#> [1] "Guidelines and forecasts for a collaborative U.S. influenza forecasting project."

the_models <- models(conn, project_id)  # may take some time
str(the_models)
#> 'data.frame':    21 obs. of  8 variables:
#>  $ id          : int  12 26 25 24 23 11 10 18 20 21 ...
#>  $ url         : chr  "https://zoltardata.com/api/model/12/" "https://zoltardata.com/api/model/26/" "https://zoltardata.com/api/model/25/" "https://zoltardata.com/api/model/24/" ...
#>  $ project_id  : int  4 4 4 4 4 4 4 4 4 4 ...
#>  $ owner_id    : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ name        : chr  "Rank Histogram Filter SIRS" "SARIMA model with seasonal differencing" "SARIMA model without seasonal differencing" "Kernel Density Estimation" ...
#>  $ description : chr  "<em>Team name</em>: CU.\n        <em>Team members</em>: J Shaman, T Yamana, S Kandula, S Pei, W Yang, H Morita."| __truncated__ "<em>Team name</em>: ReichLab.\n        <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n        <em>Data"| __truncated__ "<em>Team name</em>: ReichLab.\n        <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n        <em>Data"| __truncated__ "<em>Team name</em>: ReichLab.\n        <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n        <em>Data"| __truncated__ ...
#>  $ home_url    : chr  "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/CU_RHF_SIRS" "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/ReichLab_"| __truncated__ "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/ReichLab_"| __truncated__ "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/ReichLab_kde" ...
#>  $ aux_data_url: logi  NA NA NA NA NA NA ...

score_data <- scores(conn, project_id)
score_data
#> # A tibble: 259,820 x 10
#>    model timezero season location target   error abs_error log_single_bin
#>    <chr>    <dbl> <chr>  <chr>    <chr>    <dbl>     <dbl>          <dbl>
#>  1 Baye… 20101024 2010-… HHS Reg… Seaso… -0.878     0.878           -3.62
#>  2 Baye… 20101024 2010-… HHS Reg… 1 wk … -0.196     0.196           -1.37
#>  3 Baye… 20101024 2010-… HHS Reg… 2 wk … -0.243     0.243           -2.39
#>  4 Baye… 20101024 2010-… HHS Reg… 3 wk … -0.0948    0.0948          -1.99
#>  5 Baye… 20101024 2010-… HHS Reg… 4 wk … -0.0791    0.0791          -2.35
#>  6 Baye… 20101024 2010-… HHS Reg… Seaso… -0.422     0.422           -3.47
#>  7 Baye… 20101024 2010-… HHS Reg… 1 wk … -0.278     0.278           -1.10
#>  8 Baye… 20101024 2010-… HHS Reg… 2 wk … -0.105     0.105           -4.77
#>  9 Baye… 20101024 2010-… HHS Reg… 3 wk … -0.270     0.270           -2.76
#> 10 Baye… 20101024 2010-… HHS Reg… 4 wk … -0.446     0.446           -4.40
#> # … with 259,810 more rows, and 2 more variables: log_multi_bin <dbl>,
#> #   pit <dbl>

Get a model to work with and then list its info and forecasts (if any)

Now let’s work with a particular model, getting its ID by name and then passing it to the model_info() function to get details. Then use the forecasts() function to get a data.frame of that model’s forecasts.

model_id <- the_models[the_models$name == "SARIMA model with seasonal differencing", 'id']  # integer(0) if not found
the_model_info <- model_info(conn, model_id)
names(the_model_info)
#>  [1] "id"           "url"          "project"      "owner"       
#>  [5] "name"         "abbreviation" "description"  "home_url"    
#>  [9] "aux_data_url" "forecasts"
the_model_info$description
#> [1] "<em>Team name</em>: ReichLab.\n        <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n        <em>Data source(s)</em>: ilinet.\n        <em>Methods</em>: A seasonal ARIMA model is fit using the auto.arima function in the forecast package for R.  The data are log-transformed, any infinite or missing values after the transformation are linearly imputed, and a first-order seasonal difference (at lag 52 weeks) is taken before fitting the model.  A separate model is fit for each region.  Through iterating the one-step-ahead predictions, this model fit yields a joint predictive distribution for incidence in all remaining weeks of the season.  Appropriate integrals of this joint distribution are calculated via Monte Carlo integration to obtain predictions for the seasonal quantities.  For making prospective predictions for each season, only data before the start of that season were used in fitting model parameters.  All code used in estimation and prediction is available at https://github.com/reichlab/2017-2018-cdc-flu-contest\n        "

the_forecasts <- forecasts(conn, model_id)
str(the_forecasts)
#> 'data.frame':    225 obs. of  4 variables:
#>  $ id               : int  4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 ...
#>  $ url              : chr  "https://zoltardata.com/api/forecast/4539/" "https://zoltardata.com/api/forecast/4540/" "https://zoltardata.com/api/forecast/4541/" "https://zoltardata.com/api/forecast/4542/" ...
#>  $ timezero_date    : Date, format: "2010-10-24" "2010-10-31" ...
#>  $ data_version_date: Date, format: "2010-11-08" "2010-11-15" ...

Finally, download the first forecast’s data in two different formats

You can get forecast data using the forecast_data() function, which supports a nested list or a data.table (tabular) format.

First as a list. (Please see here for nested format, AKA JSON, details.)

first_forecast_id <- the_forecasts[1, 'id']  # assumes at least one exists

forecast_data <- forecast_data(conn, first_forecast_id, is_json=TRUE)
forecast_data$locations[[1]]$name
#> [1] "HHS Region 1"
forecast_data$locations[[1]]$targets[[1]]$name
#> [1] "Season onset"
forecast_data$locations[[1]]$targets[[1]]$bins[[1]]
#> [[1]]
#> [1] 44
#> 
#> [[2]]
#> [1] 45
#> 
#> [[3]]
#> [1] 0.186

And as a data.frame:

forecast_data <- suppressMessages(forecast_data(conn, first_forecast_id, is_json=FALSE))
forecast_data
#> # A tibble: 4,468 x 7
#>    location   target     type  unit  bin_start_incl bin_end_notincl   value
#>    <chr>      <chr>      <chr> <chr>          <dbl>           <dbl>   <dbl>
#>  1 US Nation… Season on… point week              NA              NA 5.00e+0
#>  2 US Nation… Season on… bin   week              45              46 3.00e-4
#>  3 US Nation… Season on… bin   week              46              47 8.00e-4
#>  4 US Nation… Season on… bin   week              47              48 7.00e-4
#>  5 US Nation… Season on… bin   week              48              49 1.00e-3
#>  6 US Nation… Season on… bin   week              49              50 3.00e-3
#>  7 US Nation… Season on… bin   week              50              51 9.60e-3
#>  8 US Nation… Season on… bin   week              51              52 1.15e-2
#>  9 US Nation… Season on… bin   week              52              53 5.10e-3
#> 10 US Nation… Season on… bin   week               1               2 5.00e-4
#> # … with 4,458 more rows