dataone R package overview

2016-08-29

dataone R Package Overview

The dataone R package enables R scripts to search, download and upload science data and metadata to the DataONE Federation. This package calls DataONE web services that allow client programs to interact with DataONE Coordinating Nodes (CN) and Member Nodes (MN).

Searching for data in DataONE described in vignette searching-dataone.

Downloading data from DataONE described in vignette download-data

Uploading data to DataONE is described in vignette upload-data. This document also discusses maintenance operations that can be performed on datasets after they have been uploaded to a DataONE MN, such as updating the dataset or updating system information about the dataset.

If more detailed information is required than is provided by the dataone R package help, then in-depth documentation for the DataONE web services can be found at:

Note: In this R package documentation, dataone refers to the R package and DataONE refers to the Federation of Member Nodes and the computer infrastructure comprising these data repositories.

Please see the DataONE Glossary for definitions of some terms used in this document that are used to describe DataONE services and architecture.

For additional information about the DataONE Federation, please https://www.dataone.org.

New Features in dataone Version 2.0

1. Series Identifiers

Each data, metadata, and resource map object in DataONE has a unique identifier, refered to in DataONE documentation as a Persistent Identifier (PID). A PID is associated with one object in DataONE and always refers to the same object, the same set of bytes that are stored on the DataONE network.

A data or metadata object can be updated on a DataONE Member Node by using the R method dataone::updateObject(), so that a new version of the object becomes the active version that is discoverable in searches of DataONE. The older version is still available if the PID is known, but this version will not show up in DataONE searches. In order to download the new version, the new PID must be discovered and specified when the object is downloaded.

With dataone Version 2.0, an additional, optional identifier can be associated with an object, the Series Identifier (SID). Using SIDs the most current version of an object can be obtained, without the need to determine the PID of the latest version. For example, if the SID is specified when the object is downloaded, the most recent version of that object will be downloaded.

2. New Authentication Mechanism

Uploading data to DataONE requires that a DataONE user identity be provided. In DataONE Version 1.x, the identity of a user was provided by an X.509 client certificate. DataONE Version 2.0 adds an additional method to provide identity information - an authentication token. Authentication tokens can be used with Member Nodes that have been upgraded to the DataONE Version 2.0 Member Node API.

The process of providing user identity information either via an X.509 certificate or via an authentication token is refered to as authentication.

Authentication tokens can be obtained from a user’s DataONE account settings web page. To create an authentication token:

If the authentication token is defined as shown above, it will automatically be used when using methods from the dataone R client.

The authentication token must be safegaurded like a password. Anyone with access to it can access content in DataONE as the user identity contained in the token. Care should be taken to not add this code to any published scripts or documents. This code will expire after a certain time period after which it will be necessary to obtain a new one.

Note that the token shown above is for use in the DataONE production environment. If it is necessary to use authentication in a DataONE test environment, then the steps shown above should be followed by navigating to the following web page to generate the token: https://search.test.dataone.org. If you generate a token from this web page, then the R option name used should be “dataone_test_token”, not “dataone_token”, for example:

options(dataone_test_token = "eyJhbGciOiJSUzI1NiJ9...")

(For this example, the typically very long token value has been truncated for brevity, however the entire string must be used for the token to be valid.)

For an explanation of DataONE environments, please see DataONE environments.

Detailed, technical information about user identities and authentication in DataONE can be viewed at

DataONE Authentication

You can check your token by entering the following R commands:

libary(dataone)
am <- AuthenticationManager()
getTokenInfo(am)

The most important column of the information returned by getTokenInfo() is the expired column. If a token is expired, then it will not provide authentication and must be re-created from the user’s profile page as shown above.

3. Ability To Update System Metadata

Metadata is maintained by DataONE for each object that has been uploaded to it. This SystemMetadata for an object contains information such as the access policy that determines the users that can read or update the data, the data’s format type, how many replicated copies of the data to create, etc.

Member Nodes that have been upgraded to DataONE Version 2.0 Member Node API now the have the ability to update the system metadata of a data object without having to update (replace) the data object itself. So for example, an object can be uploaded to DataONE without having ‘public’ read enabled (the data creator or rightsholder and possibly a specified list of users could have access however). At a later date, the system metadata. could be updated to allow public read.

See help("updateSystemMetadata") for more information.

Known Issues with Version 2.0

Error Using X.509 Certificates

Using an X.509 certificate for DataONE authenticatin on certain versions of Mac OS X can cause the following error:

Error in curl::curl_fetch_memory(url, handle = handle) : 
Problem with the local SSL certificate 

Changes in the Mac OS X system libraries in OS X Mavericks have taken away support of these X.509 certificates. A workaround to make these certificates usable on Mac OS X with R is to install a version of the curl R package that supports these certificates.

On Mac OS X, the libcurl library can be installed with either Mac Ports package manager or the HomeBrew package manager. The HomeBrew package manager can be significantly faster to install but either one will work provided the directions shown below are followed.

You can check if you have MacPorts installed by entering the following command in a terminal window:

port version

Create new curl package using MacPorts

If MacPorts is being used on your system, the following commands can be entered to install a curl package that can read the certificate and allow them to be used by the dataone package for authentication to a DataONE node. In a terminal window enter the commands:

sudo port install curl
Sys.setenv(LIB_DIR="/opt/local/lib")
Sys.setenv(INCLUDE_DIR="/opt/local/include")
install.packages("curl", type="source")
library(curl)
library(dataone)

# Remove the environment variables as they are no longer needed.
Sys.setenv(LIB_DIR="")
Sys.setenv(INCLUDE_DIR="")

At this point you should be able to use X.509 Certificates.

Create new curl package using HomeBrew

The HomeBrew software can be installed with the following command entered at a terminal window:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Once HomeBrew has been installed, you can get the required curl libraries by entering the command:

brew install curl --with-openssl
brew link curl --force

In the R console enter the commands:

Sys.setenv(LIB_DIR="/usr/local/opt/curl/lib")
Sys.setenv(INCLUDE_DIR="/usr/local/opt/curl/include")
install.packages("curl", type="source")
library(curl)
library(dataone)

# Remove the environment variables as they are no longer needed.
Sys.setenv(LIB_DIR="")
Sys.setenv(INCLUDE_DIR="")

At this point you should be able to use X.509 Certificates.

DataONE Environments

DataONE nodes are separated into several networks or environments. Each environment provides an isolated installations of the DataONE services, such that nodes in one environment do not communicate with nodes in another.

Currently DataONE uses the following environments: production, staging, staging2, sandbox, sandbox2, dev and dev2. All environments except production are test environments that may have different version of software than the released version of the DataONE infrastructure, and may experience service outages as required by the DataONE development team.

Most users will only need to use the production and staging2 environments.

The production environment is the current working release of the DataONE infrastructure and is the environment that supports the operations necessary to fully implement the DataONE system. There is only one production environment.

The staging environment provides an installation of DataONE infrastructure that is a copy of the production environment. It is used to prepare for a new release of the infrastructure by testing the upgrade and software replacement procedures.

The staging2 is a copy of the production environment and can be used by DataONE Member Nodes that are preparing to join the production environment. Staging2 usually has the released version of the DataONE infrastructure, and is more stable that the other test environments.

The sandbox and sandbox2 environments offer more stable environments than the dev environments, and is intended to provide a more stable system where new features or alternative implementations may be evaluated within an environment that is close to a particular release of the DataONE infrastructure.

The dev and dev2 environments are intended for use by DataONE developers and are unstable, with software upgrades and service outages occurring as needed by the development staff.

The dataone R package uses the following abbreviations for the DataONE environments:

DataONE environment dataone R package abbreviation
production PROD
stage STAGING
stage 2 STAGING2
sandbox SANDBOX
sandbox2 SANDBOX2
dev DEV
dev2 DEV2

Each DataONE environment has a CN that maintains a registry of all MNs in that network. The CN can be queried for a list of all MNs in a network, for example, to see the MNs that are currently in the production environmemt:

cn <- CNode("PROD")
unlist(lapply(listNodes(cn), function(x) x@identifier))

To see the URL service endpoint for a CN, you can view the endpoint slot:

cn@endpoint

To see the member nodes that are currenly in the STAGE2 environment:

cn <- CNode("STAGING2")
unlist(lapply(listNodes(cn), function(x) x@identifier))

Using the node identifiers retrieved from these commands, the CN can be used to create a MNode object that can be used to interact with a DataONE MN. This example obtains an MNode object for the Knowledge Network for Biocomplexity Member Node and queries the node’s data holdings for datasets that mention “hydrocarbon” in the abstract and “aquaculture” in the keywords:

cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:KNB")
result <- query(mn, searchTerms=list(abstract="hydrocarbon", keywords="aquaculture"), as="data.frame")

Acknowledgements

Work on this package was supported by:

Additional support was provided for working group collaboration by the National Center for Ecological Analysis and Synthesis, a Center funded by the University of California, Santa Barbara, and the State of California.