Uploading Data to DataONE

2016-08-29

Introduction

This document describes how to use the dataone R package to upload data to DataONE, and how to perform maintenance operations on the data after it has been uploaded.

The dataone package provides methods to allow R scripts to interact with DataONE Coordinating Nodes (CN) and Member Nodes (MN). The dataone R package takes care of the details of calling the corresponding DataONE web service on a DataONE node. For example, the dataone createObject R method calls the DataONE web service MNStorage.create() that uploads a dataset to a DataONE MN.

Before uploading any data to a DataONE MN, it is necessary to obtain a DataONE user identity, and the means to provide that user identity when data is uploaded. The method that DataONE uses to achieve this is known as user identity authentication, and requires that an authentication token, which is a character string, is provided during upload. The process to obtain this token is described in the dataone-overview vignette, in the section New Authentication Mechanism, which is viewable with the R command vignette("dataone-overview"). (Note: DataONE originally used X.509 certificates for authentication, which are still supported.)

Uploading a data package using uploadDataPackage

Datasets and metadata can be uploaded individually or as a data package. Uploading a data package will be described first and a workflow for preparing and uploading a data package using the uploadDataPackage method will be shown. A complete script that uses this workflow is shown here:

library(dataone)
library(datapack)
library(uuid)
dp <- new("DataPackage")
sampleData <- system.file("extdata/sample.csv", package="dataone")
# Create a unique identifier string for the data object in a standard format.
dataId  <- paste("urn:uuid:", UUIDgenerate(), sep="")
dataObj <- new("DataObject", id=dataId, format="text/csv", file=sampleData)
dataObj <- setPublicAccess(dataObj)
sampleEML <- system.file("extdata/sample-eml.xml", package="dataone")
# Create a unique id string for the data object in a standard format
# Alternatively DOI string could used using "generateIdentifier(mn, scheme="DOI")"
metadataId <- paste("urn:uuid:", UUIDgenerate(), sep="")
metadataObj <- new("DataObject", id=metadataId, format="eml://ecoinformatics.org/eml-2.1.1", file=sampleEML)
metadataObj <- setPublicAccess(metadataObj)
dp <- addData(dp, dataObj, metadataObj)
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, replicate=TRUE, public=TRUE, numberReplicas=2)

The following sections describe each line of this script in detail.

1. Create a DataPackage object.

In order to use uploadDataPackage, it is necessary to prepare an R DataPackage object that will serve as a container for the set of files that will be included in the data package. The following commands load the required libraries and creates an empty DataPackage object that will be added to later:

library(dataone)
library(datapack)
library(uuid)
dp <- new("DataPackage")

When using the uploadDataPackage method, data structures that are required by DataONE are created, configured and uploaded automatically with the data package. These data structures include a ResourceMap that describes the data package, and SystemMetadata objects that contain DataONE system information for each of the science datasets and associated science metadata.

2. Prepare a metadata file that will describe the files in the data package

The next step is to prepare a metadata file that will describe the science datasets in the data package. The most common metadata format used in the DataONE network is the Ecological Metadata Langauage (EML). Other supported formats include FGDC and ISO 19113. Additional information about EML is available at https://knb.ecoinformatics.org/#external//emlparser/docs/index.html.

Detailed directions regarding authoring metadata documents are outside the scope of this document.

3. Determine what access your data and metadata should have

The levels of access available to objects in DataOne are “read”, “write”, and “changePermission”. The “read” permission allows a user the ability to view the content of a DataONE object. The “write” permission allows a user the ability to change the content of an object via update services.

Permissions are hierarchical, so write permission also includes read permission The “changePermission” permission allows the ability to change the access policy for an object and includes both read and write permissions.

Each of these permissions can be granted to a single user, a group of users, or the special public user which means all users.

Each object in DataONE can have one or more access rules that control the access of that object. The complete set of access rules for an object is refered to as its access policy.

The next section shows how to apply the desired access rules to items that will be added to a data package before upload.

4. Create a DataObject for each data file

A DataObject must be created for each metadata file and data file that will be included in the data package. The DataObject maintains information about an object that will be needed by DataONE.

A SystemMetadata object will be created automatically and stored in each DataObject. The SystemMetadata object will be used by DataONE to maintain low level information about the dataset, such as the access policy, the user identity of the rightsholder (the user identity that can modify access the dataset), which Member Nodes it can be replicated to, etc.

The example below creates a DataObject for a science dataset:

sampleData <- system.file("extdata/sample.csv", package="dataone")
dataId <- paste("urn:uuid:", UUIDgenerate(), sep="")
dataObj <- new("DataObject", id=dataId, format="text/csv", file=sampleData)

An optional user argument can be specified when creating a DataObject, which will be used to set the DataONE submitter and rightsholder of the dataset when it is uploaded. The rightsholder is granted all access priviledges to the object.

If user is not specified for a DataObject, then the submitter and rightsholder for an object will automatically be set, when the object is uploaded to DataONE, to the DataONE user that created the authentication token or X.509 certificate.

Note that if the id argument is not specified, a unique identifier will automatically be created and assigned to the DataObject.

Access rules can be added to each DataObject after it has been created. Access rules can be added to grant permissions to a single user. Access can also be granted to the public user, which means any and all users. For example, public read access can be set using the setPublicAccess method:

dataObj <- setPublicAccess(dataObj)

Individual access rules to be added for a DataONE user identity can also be added to the access policy.

Access rules are added to a DataObject using the addAccessRule method. The following access rule will grant user ‘Peter Smith’ changePermission access to the dataset, which will take effect after it is uploaded and available on a DataONE MN:

accessRules <- data.frame(subject="CN=Peter Smith A10499,O=Google,C=US,DC=cilogon,DC=org", permission="changePermission") dataObj <- addAccessRule(dataObj, accessRules)

The value of the subject argument in the above example (“CN=Peter Smith A10499,O=Google,C=US,DC=cilogon,DC=org”) is the string value of a typical DataONE user identity. DataONE user identities and user authentication are described in section A New Authentication Mechanism in the vignette dataone-overview (to view this vignette, type this command in the R console: vignette("dataone-overview"))

5. Create a DataObject for the metadata file

When a DataObject is created, a unique identifier is generated if one is not specified. This automatically generated identifier has the format “urn:uuid:”, for example “urn:uuid:c3443142-6260-4ea5-aaa1-1114981e04ad”.

The following command creates the DataObject for the science metadata, using an automatically generated identifier:

sampleEML <- system.file("extdata/sample-eml.xml", package="dataone")
metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", file=sampleEML)

Alternatively, a Digital Object Identfier (DOI) may be assigned to the metadata DataObject, using the generateIdentifier method:

cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
doi <- generateIdentifier(mn, "DOI")
metadataObj <- new("DataObject", id=doi, format="eml://ecoinformatics.org/eml-2.1.1", file=sampleEML)

(Note that the example uses a DataONE test environment STAGING, and not the production environment. In order to create an valid DOI, a production DataONE Member Node that supports creating DOIs must be used.)

The generateIdentifier method requests that the DataONE MN generate a properly formatted DOI.

6. Add each DataObject to the DataPackage

The DataPackage object serves as a container for a set of data objects that will be uploaded to DataONE. The metadata DataObject and all science data DataObjects must be added to the DataPackage before calling uploadDataPackage.

Relationships between the objects in a DataPackage are stored in the ResourceMap which is stored in and maintained by the DataPackage. One type of relationship that is stored is between the science metadata and the science datasets that are described by it. In the DataONE data package implementation, this relationship is the CITO documents relationship that links the metadata object to science objects.

This relationship between the science metadata and science data objects will be made automatically for each science data object as it is added to the DataPackage, if the metadata object is included when the science data object is added.

Now add the metadata object to the DataPackage:

dp <- addData(dp, metadataObj)

Then specify the metadata object when each science data object is added, associating the metadata object with the science object:

dp <- addData(dp, do = dataObj, mo = metadataObj)

If there were additional DataObjects to add to the package, they would be added to the DataPackage and associated with the metadata object as follows:

dp <- addData(dp, do = dataObj2, mo = metadataObj)
dp <- addData(dp, do = dataObj3, mo = metadataObj)
dp <- addData(dp, do = dataObj4, mo = metadataObj)

7. Upload the DataPackage

When all DataObjects have been added to the DataPackage, call the uploadDataPackage method to upload the entire DataPackage:

d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, replicate=TRUE, numberReplicas=2)
message(sprintf("Uploaded data package with identifier: %s", packageId))

(Note that the example uses a DataONE test environment STAGING, and not the production environment.)

After uploadDataPackage has been called sucessfully, the data package can be viewed on the member node, searched for using the DataONE search facility. Note that if objects in DataONE are not publicly readable, and the authenticated user performing the search isn’t granted access in an object’s access policy, then the objects will not be viewable or discoverable via the search facility for that user.

Uploading Individual Data And Metadata files

A single data or metadata file can be uploaded to a DataONE MN using the createObject method. When uploading a single file using this method, additional information must be supplied to DataONE that controls how DataONE interacts with the uploaded file. This additional information is stored in DataONE as a system metadata object and contains information such as who can access or update the file, how many copies of the file should be maintained, whether the file has been superseded by another object, etc. The system metadata information that will be uploaded to DataONE is collected and stored in an R object type datapack::SystemMetadata, as shown below:

library(digest)
# Create a system metadata object for a data file. 
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha1 <- digest(csvfile, algo="sha1", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha1)
sysmeta <- addAccessRule(sysmeta, "public", "read")

Alternatively, the system metadata could have been created with a seriesId. The seriesId is explained in the dataone_overview vignette. The following example shows the creation of a SystemMetadata object using the optional seriesId:

# Create a system metadata object for a data file. 
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha1 <- digest(csvfile, algo="sha1", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
# The seriesId can be any unique character string.
seriesId <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha1,  seriesId=seriesId)

A unique identifier must be specified for each system metadata, whether or not a seriesId is used.

The dataset can now be uploaded to DataONE with the associated system metadata:

cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
response <- createObject(mn, pid, csvfile, sysmeta)

Note that for this example, the DataONE test environment STAGING is used, and not the production environment.

Maintaining Uploaded Datasets

After data has been uploaded to DataONE, maintenance operations can be performed on these objects using the methods described in the following sections.

Update the DataONE system metadata for an object (MNode: updateSystemMetadata)

The system metadata can be updated for an object in DataONE without updating the data bytes of the object itself. For example, if an object was only readable by the data submitter, the access policy for an object can be updated to enable public access. System metadata is updated for an object using the updateSystemMetadata method:

The following exmaple first downloads the current system metadata for the pid from the previous example, then updates the access policy that will be applied to the object and uploads the new system metadata to DataONE so that the changes will be applied:

cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
sysmeta <- getSystemMetadata(mn, pid)
sysmeta <- addAccessRule(sysmeta, "public", "read")
status <- updateSystemMetadata(mn, pid, sysmeta)

Note that updating an object in DataONE requires the proper access. For example, for the identifier shown above (urn:uuid:17d61d5a-061a-4778-9cdf-4e14751aaddc), only the DataONE user identity that is the rightsholder, or another user identity that has been granted write access by the rightsholder will be able to update the object on the DataONE member node. Running the previous example without the proper authentication will produce an error.

Replace an object with a newer version (MNode: updateObject)

The updateObject updates an existing object by creating a new object identified by a new PID on the Member Node. The new object replaces and obsoletes the old object. An obsoleted object in DataONE does not appear in search results, however it is still available for download if the identifier is known.

# Update object from previous example with a new version
updateid <- sprintf("urn:uuid:%s", UUIDgenerate())
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
size <- file.info(csvfile)$size
sha1 <- digest(csvfile, algo="sha1", serialize=FALSE, file=TRUE)
# Start with the old object's sysmeta, then modify it to match
# the new object. We could have also created a sysmeta from scratch.
sysmeta <- getSystemMetadata(mn, pid)
sysmeta@identifier <- updateid
sysmeta@size <- size
sysmeta@checksum <- sha1
sysmeta@obsoletes <- pid
# Now update the object on the member node.
response <- updateObject(mn, pid, csvfile, updateid, sysmeta)
# Get the new, updated sysmeta and check it to ensure that the update
# worked, i.e. "obsoletes" is the old pid that was replaced by the update.
updsysmeta <- getSystemMetadata(mn, updateid)
updsysmeta@obsoletes

The Member Node will mark the object as being obsolete by setting a property in the system metadata on the object being replaced. An object marked as obsolete will not appear in search results, however, such an object is still available for download if the PID is known.