# MarineSPEED quickstart guide

#### 2017-02-17

The goal of MarineSPEED is to provide a benchmark data set for presence-only species distribution modeling (SDM) in order to facilitate reproducible and comparable SDM research. It contains species occurrences (coordinates) from a wide diversity of marine species and associated environmental data from Bio-ORACLE and MARSPEC. Some additional information about MarineSPEED can be found in the R Shiny viewer at http://marinespeed.org.

## Èxploring the data

Three functions help with exploring

library(marinespeed)

# set a data directory, preferably something different from tempdir to avoid

# list all species
species <- list_species()

The first 5 species and there aphia_id (WoRMS species id) are:

species aphia_id
Laternula elliptica 197217
Pseudosagitta gazellae 266258
Parasagitta elegans 105440
Parasagitta setosa 105443
Branchiostoma lanceolatum 104906

The species information consists of species identifiers, taxonomic information from the World Register of Marine Species (WoRMS), a visual assessment score for the amount of sampling bias and the covered latitudinal zones.

# all species information
info <- species_info()
colnames(info)
##  [1] "species"        "aphia_id"       "kingdom"        "phylum"
##  [5] "class"          "order"          "family"         "genus"
##  [9] "sampling_bias"  "eco_polar"      "eco_temperate"  "eco_tropical"
## [13] "eco_open_ocean"

## Looping over all species data

To loop over the occurrence data of all species you have to call the lapply_species function. For instance if you wanted to count the total number of records in MarineSPEED you’d need the following code. As you can see the function passed to lapply_species expects to parameters, one for the species name and one for the actual occurrences.

get_occ_count <- function(speciesname, occ) {
nrow(occ)
}
record_counts <- lapply_species(get_occ_count)
sum(unlist(record_counts))
## [1] 868151

## Cross-validation

To enable the usage of the same cross-validation k-fold datasets I splitted species occurrence data upfront in 5 folds (or 4 and 9 for grid) in 3 different ways:

• disc: disc partitioning of occurrences with pairwise distance sampled and buffer filtered random background points.
• grid: partitioning of the data based on splitting the records along the x- and y-axis in groups of records with equal numbers of records. Data was split in 4 (2 by 2) and 9 (3 by 3) folds.
• random: random partitioning of occurrences and random background points.
• targetgroup: same way of partitioning as the random folds but instead of random background points, a random subset of all occurrences points was used creating a targetgroup background points set which has the same sampling bias as the entire dataset.

Below code plots the training (blue) and test (red) occurrences for the first two disc folds of the first two species.

## plot first 2 disc folds for the first 2 species (blue=trainig, red=test)
plot_occurrences <- function(speciesname, data, fold) {
title <- paste0(speciesname, " (fold = ", fold, ")")
plot(data$occurrence_train[,c("longitude", "latitude")], pch=20, col="blue", main = title) points(data$occurrence_test[,c("longitude", "latitude")], pch=20, col="red")
}

lapply_kfold_species(plot_occurrences, species=species[1:2,],
fold_type = "disc", k = 1:2)

## Lower level functions

• get_occurrences: get all occurrences for some or all species
• get_fold_data: get training and test folds for occurrences and background for a species
• kfold_occurrence_background: to create k folds for other species records