zoltr is an R package that simplifies access to the zoltardata.com API. This vignette takes you through the package’s main features. So that you can experiment without needing a Zoltar account, we use our “flagship” project for the CDC Flu challenge, which should always be available for public read-only access.
The starting point for working with Zoltar’s API is a ZoltarConnection
object, obtained via the new_connection
function. Most zoltr functions take a ZoltarConnection
along with the ID of the thing of interest, e.g., a project, model, or forecast ID. You can obtain the ID using one of the *_info
functions, and you can always use the web interface to navigate to the resource and look at its ID in the browser address field. The ID should be self evident for all resources. For example, the project CDC Flu challenge’s URL is https://www.zoltardata.com/project/4, and you can see that its ID is 4.
(Note: To access protected resources, you need to call the zoltar_authenticate()
function before calling any zoltr functions. But this is unnecessary in this vignette because we’re accessing a public project, models, and forecasts.)
library(zoltr)
conn <- new_connection()
conn
#> ZoltarConnection 'https://zoltardata.com' not authenticated
Use the projects()
function to get a connection’s projects as a data.frame. Note that it will only list those that you are authenticated to access.
the_projects <- projects(conn)
str(the_projects)
#> 'data.frame': 1 obs. of 8 variables:
#> $ id : int 4
#> $ url : chr "https://zoltardata.com/api/project/4/"
#> $ owner_id : int 5
#> $ public : logi TRUE
#> $ name : chr "CDC Flu challenge"
#> $ description: chr "Guidelines and forecasts for a collaborative U.S. influenza forecasting project."
#> $ home_url : chr "https://github.com/FluSightNetwork/cdc-flusight-ensemble"
#> $ core_data : chr "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models"
Let’s start by getting a public project to work with. We will search the projects list for it by name. Then we will pass its ID to the project_info()
function to get details, and pass it to the models()
function to get a data.frame of its models.
project_id <- the_projects[the_projects$name == "CDC Flu challenge", 'id'] # integer(0) if not found, which is an invalid project ID
the_project_info <- project_info(conn, project_id)
names(the_project_info)
#> [1] "id" "url" "owner" "is_public"
#> [5] "name" "description" "home_url" "core_data"
#> [9] "config_dict" "template" "truth" "model_owners"
#> [13] "score_data" "models" "locations" "targets"
#> [17] "timezeros"
the_project_info$description
#> [1] "Guidelines and forecasts for a collaborative U.S. influenza forecasting project."
the_models <- models(conn, project_id) # may take some time
str(the_models)
#> 'data.frame': 21 obs. of 8 variables:
#> $ id : int 12 26 25 24 23 11 10 18 20 21 ...
#> $ url : chr "https://zoltardata.com/api/model/12/" "https://zoltardata.com/api/model/26/" "https://zoltardata.com/api/model/25/" "https://zoltardata.com/api/model/24/" ...
#> $ project_id : int 4 4 4 4 4 4 4 4 4 4 ...
#> $ owner_id : int 5 5 5 5 5 5 5 5 5 5 ...
#> $ name : chr "Rank Histogram Filter SIRS" "SARIMA model with seasonal differencing" "SARIMA model without seasonal differencing" "Kernel Density Estimation" ...
#> $ description : chr "<em>Team name</em>: CU.\n <em>Team members</em>: J Shaman, T Yamana, S Kandula, S Pei, W Yang, H Morita."| __truncated__ "<em>Team name</em>: ReichLab.\n <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n <em>Data"| __truncated__ "<em>Team name</em>: ReichLab.\n <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n <em>Data"| __truncated__ "<em>Team name</em>: ReichLab.\n <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n <em>Data"| __truncated__ ...
#> $ home_url : chr "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/CU_RHF_SIRS" "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/ReichLab_"| __truncated__ "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/ReichLab_"| __truncated__ "https://github.com/FluSightNetwork/cdc-flusight-ensemble/tree/master/model-forecasts/component-models/ReichLab_kde" ...
#> $ aux_data_url: logi NA NA NA NA NA NA ...
score_data <- scores(conn, project_id)
score_data
#> # A tibble: 259,820 x 10
#> model timezero season location target error abs_error log_single_bin
#> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Baye… 20101024 2010-… HHS Reg… Seaso… -0.878 0.878 -3.62
#> 2 Baye… 20101024 2010-… HHS Reg… 1 wk … -0.196 0.196 -1.37
#> 3 Baye… 20101024 2010-… HHS Reg… 2 wk … -0.243 0.243 -2.39
#> 4 Baye… 20101024 2010-… HHS Reg… 3 wk … -0.0948 0.0948 -1.99
#> 5 Baye… 20101024 2010-… HHS Reg… 4 wk … -0.0791 0.0791 -2.35
#> 6 Baye… 20101024 2010-… HHS Reg… Seaso… -0.422 0.422 -3.47
#> 7 Baye… 20101024 2010-… HHS Reg… 1 wk … -0.278 0.278 -1.10
#> 8 Baye… 20101024 2010-… HHS Reg… 2 wk … -0.105 0.105 -4.77
#> 9 Baye… 20101024 2010-… HHS Reg… 3 wk … -0.270 0.270 -2.76
#> 10 Baye… 20101024 2010-… HHS Reg… 4 wk … -0.446 0.446 -4.40
#> # … with 259,810 more rows, and 2 more variables: log_multi_bin <dbl>,
#> # pit <dbl>
Now let’s work with a particular model, getting its ID by name and then passing it to the model_info()
function to get details. Then use the forecasts()
function to get a data.frame of that model’s forecasts.
model_id <- the_models[the_models$name == "SARIMA model with seasonal differencing", 'id'] # integer(0) if not found
the_model_info <- model_info(conn, model_id)
names(the_model_info)
#> [1] "id" "url" "project" "owner"
#> [5] "name" "abbreviation" "description" "home_url"
#> [9] "aux_data_url" "forecasts"
the_model_info$description
#> [1] "<em>Team name</em>: ReichLab.\n <em>Team members</em>: Evan L. Ray, Nicholas G. Reich.\n <em>Data source(s)</em>: ilinet.\n <em>Methods</em>: A seasonal ARIMA model is fit using the auto.arima function in the forecast package for R. The data are log-transformed, any infinite or missing values after the transformation are linearly imputed, and a first-order seasonal difference (at lag 52 weeks) is taken before fitting the model. A separate model is fit for each region. Through iterating the one-step-ahead predictions, this model fit yields a joint predictive distribution for incidence in all remaining weeks of the season. Appropriate integrals of this joint distribution are calculated via Monte Carlo integration to obtain predictions for the seasonal quantities. For making prospective predictions for each season, only data before the start of that season were used in fitting model parameters. All code used in estimation and prediction is available at https://github.com/reichlab/2017-2018-cdc-flu-contest\n "
the_forecasts <- forecasts(conn, model_id)
str(the_forecasts)
#> 'data.frame': 225 obs. of 4 variables:
#> $ id : int 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 ...
#> $ url : chr "https://zoltardata.com/api/forecast/4539/" "https://zoltardata.com/api/forecast/4540/" "https://zoltardata.com/api/forecast/4541/" "https://zoltardata.com/api/forecast/4542/" ...
#> $ timezero_date : Date, format: "2010-10-24" "2010-10-31" ...
#> $ data_version_date: Date, format: "2010-11-08" "2010-11-15" ...
You can get forecast data using the forecast_data()
function, which supports a nested list
or a data.table
(tabular) format.
First as a list
. (Please see here for nested format, AKA JSON, details.)
first_forecast_id <- the_forecasts[1, 'id'] # assumes at least one exists
forecast_data <- forecast_data(conn, first_forecast_id, is_json=TRUE)
forecast_data$locations[[1]]$name
#> [1] "HHS Region 1"
forecast_data$locations[[1]]$targets[[1]]$name
#> [1] "Season onset"
forecast_data$locations[[1]]$targets[[1]]$bins[[1]]
#> [[1]]
#> [1] 44
#>
#> [[2]]
#> [1] 45
#>
#> [[3]]
#> [1] 0.186
And as a data.frame
:
forecast_data <- suppressMessages(forecast_data(conn, first_forecast_id, is_json=FALSE))
forecast_data
#> # A tibble: 4,468 x 7
#> location target type unit bin_start_incl bin_end_notincl value
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 US Nation… Season on… point week NA NA 5.00e+0
#> 2 US Nation… Season on… bin week 45 46 3.00e-4
#> 3 US Nation… Season on… bin week 46 47 8.00e-4
#> 4 US Nation… Season on… bin week 47 48 7.00e-4
#> 5 US Nation… Season on… bin week 48 49 1.00e-3
#> 6 US Nation… Season on… bin week 49 50 3.00e-3
#> 7 US Nation… Season on… bin week 50 51 9.60e-3
#> 8 US Nation… Season on… bin week 51 52 1.15e-2
#> 9 US Nation… Season on… bin week 52 53 5.10e-3
#> 10 US Nation… Season on… bin week 1 2 5.00e-4
#> # … with 4,458 more rows