Retrieval and processing of OISST data

AJ Smit

2018-06-22

Overview

In this vignette we shall look at retrieving and processing the Reynolds optimally interpolated sea surface temperature (OISST), which is a global data set of Advanced Very High Resolution Radiometer (AVHRR) derived SSTs at a daily resolution, starting on 1 September 1981. The source of the data is the Physical Oceanography Distributed Active Archive Centre (PODAAC).

Each global, daily file is around 8.3Mb, so they add up to a large amount of data when a time series of at least 30 years duration is downloaded. A time series of at least 30 years is needed for heatwave detection. Currently I have 13,216 of these global files, and this amounts to ~108Gb of total disk space. Since not everyone will need all of these data, we shall subset the data using a python script prior to downloading them.

Subsetting using a python script

To do the subsetting and bring the data to your local computer/server, you will need access to python 2.7 with numpy. Make sure it is installed on your system and visible on the system PATH.

Create a folder on your server where all the data will be received, below, for example, I use /Users/ajsmit/spatial/test/netCDF.

Into this directory, copy the python script, subset_dataset.py (link). Remember to make the file executable by running chmod +x subset_dataset.py. I use MacOS X (or linux), so I’m not able to provide instructions if you use Windows. In a terminal, change to the directory that will receive the netCDF files, where the python script now lives. If python is in your system’s path, you should be able to execute the following command on the terminal/command line at the prompt >:

> ./subset_dataset.py -s 19810901 -f 20171014 -b 6.25 45 -45 -20 -x AVHRR_OI-NCEI-L4-GLOB-v2.0

Encapsulated by the above command are the following parameters:

The spatial extent is for the Agulhas Current region around South Africa; we select files starting in 1981-09-01 and going up to 2017-10-19. The short name is the name mentioned on the Reynolds OISST data website—substituting this name for any of the other SST datasets on that website should then permit the retrieval of other data sets (e.g. the MUR data’s short name is MUR-JPL-L4-GLOB-v4.1).

Adjust any of these parameters to taste in order to define the spatial extent and the time period as required by your study.

If everything works according to plan, a bunch of data will now be downloaded. This might take several hours. There will be one netCDF file for each day of the study period. In later steps in R we shall combine them into one .csv file, and then do some other processing steps on them to extract the heat waves.

Extracting the SST data

There are many ways to approach this problem, and the one I take here is to produce an intermediate .csv file that has all the data for the region/period of interest—all of it in one file. For the subregion/period defined by the python subsetting function, above, this combined file weighs in at 4.88Gb. This might not be the most efficient way to deal with the problem, but I like having this intermediate .csv file on hand as I subject these data to several other processing steps besides marine heat wave detection.

Here is a set of steps that does the job for me:

# generic_netCDF2csv.R


# NOTES ON USING THIS SCRIPT ----------------------------------------------

# 1. The Reynolds OISST v2 data processed by this script can be retrieved from:
# https://podaac.jpl.nasa.gov/dataset/AVHRR_OI-NCEI-L4-GLOB-v2.0
# 2. Subsetting and the selection of the time steps to be done via the python
# script `subset_dataset.py`.
# 3. This R script requires the already subsetted netCDFs to reside inside a
# directory whose path is specified by `nc.dir`, below.
# 4. The .csv files produced will be placed inside of the directory named by
# `csv.dir` (make sure this directory already exists).
# 5. The dates that will be included with the final .csv file will be extracted
# directly from the names of the daily netCDF files; please, therefore, make
# sure to never change them by manually editing them.
# 6. The base name of the new .csv file will be partly based on the name of the
# input netCDF files, with the start and end dates appended at the end. These
# things are hard coded into the script below.
# 7. I am sure I have missed some things, or that some things may break somehow;
# please let me know if this happens and I shall fix it.
# 8. This file may take a while to run (10s of minutes to hours, depending on
# the amount of data processed); please be patient while it does its thing.

# Author: AJ Smit
# Date: 27 April 2018
# e-mail: ajsmit@uwc.ac.za


# CAUTION -----------------------------------------------------------------

# This function will append data to the end of an existing file that had been
# previously produced by this script. This will result in duplicate data. If you
# need to rerun the script for some reason, please make sure to delete the file
# created as the result of the previous run from the `csv.dir`.


# LOAD LIBRARIES ----------------------------------------------------------

library(ncdf4) # library for processing netCDFs
library(data.table) # for fast .csv write function, `fwrite()`
library(tidyverse) # misc. data processing conveniences
library(reshape2) # for making a long data format
library(plyr) # for `llply()`
library(lubridate) # for working with dates
library(stringr) # for working with strings
library(doMC); doMC::registerDoMC(cores = 4) # for multicore spead-ups


# SPECIFY FILE PATHS ------------------------------------------------------

# Setup OISST netCDF data path and csv file output directory
nc.dir <- "/Users/ajsmit/spatial/test/netCDF"
csv.dir <- "/Users/ajsmit/spatial/test/csv"


# PARSE FILE INFO (not used directly) -------------------------------------

# Use to determine the start/end points of the `name.stem` (see code below)
#          1         2         3         4         5         6         7
# 123456789012345678901234567890123456789012345678901234567890123456789012345
# 20091231120000-NCEI-L4_GHRSST-SSTblend-AVHRR_OI-GLOB-v02.0-fv02.0_subset.nc


# OISST netCDF READ FUNCTION ----------------------------------------------

# Function to extract the dims and data from OISST netCDFs
ncRead <- function(nc.dir = nc.dir, csv.dir = csv.dir) {
  nc.list <- list.files(path = nc.dir, pattern = "*.nc", full.names = TRUE, include.dirs = TRUE)
  strt.date <- str_sub(basename(nc.list[1]), start = 1, end = 8)
  end.date <- str_sub(basename(nc.list[length(nc.list)]), start = 1, end = 8)

  ncFun <- function(nc.file = nc.file, csv.dir = csv.dir) {
    nc <- nc_open(nc.file)
    name.stem <- substr(basename(nc.file), 16, 72)
    date.stamp <- substr(basename(nc.file), 1, 8)
    sst <- round(ncvar_get(nc, varid = "analysed_sst"), 4)
    dimnames(sst) <- list(lon = nc$dim$lon$vals,
                          lat = nc$dim$lat$vals)
    nc_close(nc)
    sst <-
      as.data.frame(melt(sst, value.name = "temp"), row.names = NULL) %>%
      mutate(t = ymd(date.stamp)) %>%
      na.omit()
    # Use data.table's `fwrite()` because it is much faster than anything else
    # for large csv files
    fwrite(sst,
           file = paste0(csv.dir, "/", name.stem, "-", strt.date, "-", end.date, ".csv"),
           append = TRUE, col.names = FALSE)
    rm(sst)
  }
  # `parallel = FALSE` allows for better error checking should something go wrong;
  # there's also a helpful progress bar that becomes available
  llply(nc.list, ncFun, csv.dir = csv.dir, .parallel = FALSE)
}

Now I simply run this function:

# RUN THE FUNCTION --------------------------------------------------------

# If everything works according to plan, all that's required is to execute
# this line as is after specifying `nc.dir` and `csv.dir` paths, above
system.time(ncRead(nc.dir, csv.dir))

Event detection

Now we shall detect the extreme heat events using the RmarineHeaWaves package (soon to be replaced with heatwaveR).

# SET-UP SOME THINGS ------------------------------------------------------

csv.dir <- "/Users/ajsmit/spatial/test/csv"

# Specify the name of the file created by `system.time(ncRead(nc.dir, csv.dir))` in the
# `csv.dir` that was named above
csv.file <- "NCEI-L4_GHRSST-SSTblend-AVHRR_OI-GLOB-v02.0-fv02.0_subset-19810901-20171014.csv"

# Read in the data in the combined .csv file and assign column names;
# again using the data.table way (`fread()`), which is FAST...
sst.dat <- fread(paste0(csv.dir, "/", csv.file))
colnames(sst.dat) <- c("lon", "lat", "temp", "t")


# A FAST DATE FUNCTION ----------------------------------------------------

# This speeds the creation of the date vector up immensely
fastDate <- function(x, tz = NULL) {
  require(fasttime)
  as.Date(fastPOSIXct(x, tz = tz))
}


# APPLY FAST DATE FUNCTION ------------------------------------------------

sst.dat$t <- fastDate(sst.dat$t)


# DETECT FUNCTION FOR GRIDDED DATA ----------------------------------------

# The `clim` switch selects either climatology or events as output;
# please provide your own date range in here
OISST_detect <- function(dat = data, doy = doy, x = t, y = temp,
                         warm = TRUE, clim = FALSE,
                         climatology_start = "1983-01-01",
                         climatology_end = "2012-12-31") {
  require(RmarineHeatWaves)
  dat <- dat[,3:4] # hard-code cols with temp and date
  whole <- make_whole(dat)
  if (warm) {
    out <- detect(whole, min_duration = 5, max_gap = 2, cold_spells = FALSE,
                  climatology_start = climatology_start,
                  climatology_end = climatology_end)
  } else {
    out <- detect(whole, min_duration = 5, max_gap = 2, cold_spells = TRUE,
                  climatology_start = climatology_start,
                  climatology_end = climatology_end)
  }
  if (clim) {
    clim <- out$clim
    return(clim)
  } else {
    event <- out$event
    return(event)
  }
}


# DETECT EVENTS! ----------------------------------------------------------

events <- as_tibble(ddply(sst.dat, .(lon, lat), OISST_detect,
                          warm = TRUE, clim = FALSE,
                          .parallel = FALSE, .progress = "text"))


# SAVE AND CLEAN UP -------------------------------------------------------
event.dir <- "/Users/ajsmit/spatial/test/events"

# Reuse the existing name; this ensures traceability of the files back to
# the input sources
event.file <- gsub(pattern = "\\.csv$", "", csv.file)

# Write the file out to disk
fwrite(events, file = paste0("event.dir", "/", event.file, "-events.csv"))
remove(events); remove(sst.dat)

Please let me know if there are issues with the scripts, or if you have suggesations about how to improve them.