Downloading and Preparing OISST Data

AJ Smit and Robert W Schlegel



In this vignette we will see how to retrieve and prepare Reynolds optimally interpolated sea surface temperature (OISST) data for calculating marine heatwaves (MHWs). The OISST product is a global 1/4 degree gridded dataset of Advanced Very High Resolution Radiometer (AVHRR) derived SSTs at a daily resolution, starting on 1 September 1981. The source of the data is currently the NOAA NCDC.

Each global, daily file is around 8.3 MB, so they add up to a large amount of data when a time series of at least 30 years duration is downloaded. Remember that a time series of at least 30 years is needed for heatwave detection. If one were to download all of the data currently available it would now be much larger than 100 GB of total disk space. So it is therefore advantageous to download only a a subset of data. Thanks to the rerddap package this is now easy to do.

Downloading data

Thanks to some recent developments in access capabilities to the data generated by the environmental observation organisations in the United States it is now much easier to access their data. For the purposes of this vignette we will be accessing the NOAA OISST data hosted on this ERDDAP web interface. One may download the data there manually using the remarkably user friendly web interface, or use the rerddap package to do the same through R.

# The three packages we will need

# The information for the NOAA OISST data
info(datasetid = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", url = "")

With our packages loaded and our target dataset identified, we may now simply download it with griddap(). While putting this vignette together however I noticed one little hiccup in the work flow. It seems that the ERDDAP server does not like it when one tries to access more than nine consecutive years of data in one request, regardless of the spatial extent being requested. So before we download our data we are going to make a small wrapper function that we can tell the range of times we want to download. This will reduce the amount of redundant coding we would otherwise need to do.

# This function expects the user to provide it with two values 
# that match the time format of the target OISST dataset
OISST_sub <- function(times){
  oisst_res <- griddap(x = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", 
                       url = "", 
                        time = times, 
                        depth = c(0, 0),
                        latitude = c(-40, -35),
                        longitude = c(15, 21),
                        fields = "sst")

In the wrapper function above we see that we have chosen to download only the ‘sst’ data out of the several variables available to us. We also see that we have chosen the spatial extent of latitude -40 to -35 and longitude 15 to 21. This a small window over some of the Agulhas Retroflection to the south west of the coastline of South Africa. A larger area is not being chosen here simply due to the speed constraints of downloading the data and detecting the events therein. One may change the lon/lat values above as is necessary to match a desired study area.

With our wrapper function written we now run it a few times in order to grab all of the available OISST data as of the writing of this vignette. Each one of the files below is ~7.0 MB. Regardless of the size of the file, the overall download time doesn’t differ by much. It appears that the bottle neck is somewhere on the ERDDAP server.

OISST1 <- OISST_sub(c("1981-09-01T00:00:00Z", "1990-12-31T00:00:00Z"))
OISST2 <- OISST_sub(c("1991-01-01T00:00:00Z", "1999-12-31T00:00:00Z"))
OISST3 <- OISST_sub(c("2000-01-01T00:00:00Z", "2008-12-31T00:00:00Z"))
OISST4 <- OISST_sub(c("2009-01-01T00:00:00Z", "2013-12-03T00:00:00Z"))
OISST5 <- OISST_sub(c("2014-01-01T00:00:00Z", "2018-12-03T00:00:00Z"))

It is worth pointing out here that these data are downloaded as cached files on the users computer by using the hoardr package. This means that if one runs the same command again, it will not re-download the data again because it first looks in the folder where it has automatically cached the data for you and sees that it may simply draw the data from there. This is remarkably convenient as it means that re-running this same code will load the data that have already been downloaded. No need to change anything or write a second script for loading data.

Preparing data

With our data downloaded we now want to pull the data out of their native NetCDF format and save them as R data files for ease of use later on. One must note here that depending on the RAM available on ones machine, it may not be possible to access all of the data within a NetCDF file at once if it is very large (e.g. > 5 GB). The discussion on the limitations of the R language due to its dependence on virtual memory is beyond the scope of this vignette, but if one has chosen to only download the spatial extent laid out above one should not encounter any RAM issues.

As we have downloaded our data in four separate pieces we will want to create another wrapper function to easily pull out the data before stitching the dataframes together and saving them as a single file.

OISST_prep <- function(nc_file){
  # Open the NetCDF connection
  nc <- nc_open(nc_file$summary$filename)
  # Extract the SST values and add the lon/lat/time dimension names
  res <- ncvar_get(nc, varid = "sst")
  dimnames(res) <- list(lon = nc$dim$longitude$vals,
                        lat = nc$dim$latitude$vals,
                        t = nc$dim$time$vals)
  # Convert the data into a 'long' dataframe for use in the 'tidyverse' ecosystem
  res <-, = "temp"), row.names = NULL) %>% 
    mutate(t = as.Date(as.POSIXct(t, origin = "1970-01-01 00:00:00")),
           temp = round(temp, 2))
  # Close the NetCDF connection and finish

With our wrapper function for preparing our data written we must now run it on our downloaded data. Once that has been done we will simply rbind() the prepared data together and save them. Note that in the wrapper function above we will be re-labelling the ‘time’ column as ‘t’, and the ‘sst’ column as ‘temp’. We do this so that they match the default column names that are expected for calculating MHWs and we won’t have to do any extra work later on.

# Prep the data
OISST1_prep <- OISST_prep(OISST1)
OISST2_prep <- OISST_prep(OISST2)
OISST3_prep <- OISST_prep(OISST3)
OISST4_prep <- OISST_prep(OISST4)
OISST5_prep <- OISST_prep(OISST5)

# Bind them together
OISST_all <- rbind(OISST1_prep, OISST2_prep, OISST3_prep, OISST4_prep, OISST5_prep)

# Save the data as an .Rda file
saveRDS(OISST_all, file = "~/Desktop/OISST_vignette.Rda")

Note above that I have chosen to save the file to my desktop. This is not normally where one (hopefully!) would save such a file. Rather one would be saving these data into the project folder out of which one is working.

And that is all there is to it. Download, prep, save. In the next vignette we will see how to detect MHWs in gridded data.