Non-raster Data: Using Predictor Importance, Predictor Directionality,and Predictive Performance Metrics

Tom Auer, Daniel Fink



  1. Introduction
  2. Loading Centroids
  3. Selecting Region and Season
  4. Plotting Centroids and Extent of Analysis
  5. Plot Predictor Importance
  6. Plot Partial Dependencies
  7. Predictive Performance Metrics


Beyond estimates of occurrence and relative abundance, the eBird Status and Trends products contain information about predictor importance, predictor directionality (also known as partial dependencies), as well as predictive performance metrics (PPMs). The PPMs can be used to evaluate statistical performance of the models, either over the entire spatiotemporal extent of the model results, or for a specific region and season. Predictor Importances (PIs) and Partial Dependencies (PDs) can be used to understand relationships between occurrence and abundance and predictors, most notably the land cover variables used in the model. When PIs and PDs are combined, we can depict habitat association and avoidance, as well as the strength of the relationship with those habitats. The functions described in this section help load this data from the results packages, give tools for assessing predictive performance, and synthesize information about predictor importances and partial dependencies.

Data Structure

IMPORTANT. AFTER DOWNLOADING THE RESULTS, DO NOT CHANGE THE FILE STRUCTURE. All functionality in this package relies on the structure inherent in the delivered results. Changing the folder and file structure will cause errors with this package.

Data are stored in four files, found under \<run_name>\results\abund_preds\unpeeled_folds, as described below.

\<run_name>\results\abund_preds\unpeeled_folds\pd.txt \<run_name>\results\abund_preds\unpeeled_folds\pi.txt \<run_name>\results\abund_preds\unpeeled_folds\summary.txt \<run_name>\results\abund_preds\unpeeled_folds\test.pred.ave.txt

The ebirdst package provides functions for accessing these, such that you should never have to handle them manually, granted that the original file structure of the results is maintained. These data are stored at “stixel centroids,” which are the centers of the independent, partitioned, regional model extents (see Fink et al. 2010, Fink et al. 2013).

Loading Centroids

The first step when working with stixel centroid data is to load the Predictor Importances (PIs) and the Partial Dependencies (PDs). These files will be used for all of the functions in this vignette and are the input to many of the functions in ebirdst.


# Currently, example data is available on a public s3 bucket. The following 
# ebirdst_download() function copies the species results to a selected path and 
# returns the full path of the results. Please note that the example_data is
# for Yellow-bellied Sapsucker and has the same run code as the real data,
# so if you download both, make sure you put the example_data somewhere else.

# Because the non-raster data is large, there is a parameter on the
# ebirdst_download function that defaults to downloading only the raster data.
# To access the non-raster data, set tifs_only = FALSE.
sp_path <- ebirdst_download(species = "example_data", tifs_only = FALSE)

pis <- load_pis(sp_path)
pds <- load_pds(sp_path)

Selecting Region and Season

When working with Predictive Performance Metrics (PPMs), PIs, and/or PDs, it is very common to select a subset of space and time for analysis. In ebirdst this is done by creating a spatiotemporal extent object with ebirdst_extent(). These objects define the region and season for analysis and are passed to many functions in ebirdst. To review the available stixel centroids associated with both PIs and PDs and to see which have been selected by a spatiotemporal subset, use the map_centroids function, which will map and summarize this information.

lp_extent <- ebirdst_extent(c(xmin = -86, xmax = -83, ymin = 42, ymax = 45),
                            t = c(0.425, 0.475))
map_centroids(sp_path, ext = lp_extent)

Plotting Centroids and Extent of Analysis

Similarly, calc_effective_extent() will analyze a spatiotemporal subset of PIs or PDs and plot the selected stixel centroids, as well as a RasterLayer depicting where a majority of the information is coming from. The map ranges from 0 to 1, with pixels have a value of 1 meaning that 100% of the selected stixels are contributing information at that pixel. The function returns the RasterLayer in addition to mapping.

par(mfrow = c(1, 1), mar = c(0, 0, 0, 6))
calc_effective_extent(sp_path, ext = lp_extent)

Plot Predictor Importance

Once predictive performance has been evaluated, exploring information about habitat association and/or avoidance can be done using the PIs and PDs. The plot_pis() function generates a bar plot showing a rank of the most important predictors within a spatiotemporal subset. There is an option to show all predictors or to aggregate FRAGSTATS by the land cover types.

# with all classes
plot_pis(pis, ext = lp_extent, by_cover_class = FALSE, n_top_pred = 15)

# aggregating fragstats for cover classes
plot_pis(pis, ext = lp_extent, by_cover_class = TRUE, n_top_pred = 15)

Plot Partial Dependencies (predictor directionality)

Complementary to plot_pis(), the plot_pds() function plots the partial dependency curves for individual predictors, with various levels of detail and smoothing available.

mypds <- plot_pds(pds, predictor = "effort_hrs", 
                  ext = lp_extent, 
                  bootstrap_smooth = TRUE)

Predictive Performance Metrics

Beyond confidence intervals provided for the abundance estimates, the centroid data can also be used to calculate predictive performance metrics, to get an idea as to whether there is substantial statistical performance to evaluate information provided by PIs and PDs (as well as abundance and occurrence information).

Binary Metrics by Time

The plot_binary_by_time() function analyzes a species’ entire range of data and plots predictive performance metrics by a custom time interval (typically either 52 for weeks or 12 for months).

All Metrics for Spatiotemporal Extent

The plot_all_ppms() function provides all available predictive performance metrics and is important for determining predictive performance within a spatiotemporal subset region and season. This function is also useful in comparing the performance between subsets.


Fink, D., Damoulas, T., & Dave, J. (2013, July). Adaptive Spatio-Temporal Exploratory Models: Hemisphere-wide species distributions from massively crowdsourced eBird data. In AAAI.

Fink, D., Hochachka, W. M., Zuckerberg, B., Winkler, D. W., Shaby, B., Munson, M. A., … & Kelling, S. (2010). Spatiotemporal exploratory models for broad‐scale survey data. Ecological Applications, 20(8), 2131-2147.