Introduction
This document describes how to use the dataone R package to upload data to DataONE, and how to perform maintenance operations on the data after it has been uploaded.
The dataone package provides methods to allow R scripts to interact with DataONE Coordinating Nodes (CN) and Member Nodes (MN). The dataone R package takes care of the details of calling the corresponding DataONE web service on a DataONE node. For example, the dataone createObject
R method calls the DataONE web service MNStorage.create() that uploads a dataset to a DataONE MN.
Before uploading any data to a DataONE MN, it is necessary to obtain a DataONE user identity, and the means to provide that user identity when data is uploaded. The method that DataONE uses to achieve this is known as user identity authentication, and requires that an authentication token, which is a character string, is provided during upload. The process to obtain this token is described in the dataone-overview vignette, in the section New Authentication Mechanism, which is viewable with the R command vignette("dataone-overview")
. (Note: DataONE originally used X.509 certificates for authentication, which are still supported.)
Uploading a data package using uploadDataPackage
Datasets and metadata can be uploaded individually or as a data package. Uploading a data package will be described first and a workflow for preparing and uploading a data package using the uploadDataPackage
method will be shown. A complete script that uses this workflow is shown here:
library(dataone)
library(datapack)
library(uuid)
dp <- new("DataPackage")
sampleData <- system.file("extdata/sample.csv", package="dataone")
# Create a unique identifier string for the data object in a standard format.
dataId <- paste("urn:uuid:", UUIDgenerate(), sep="")
dataObj <- new("DataObject", id=dataId, format="text/csv", file=sampleData)
dataObj <- setPublicAccess(dataObj)
sampleEML <- system.file("extdata/sample-eml.xml", package="dataone")
# Create a unique id string for the data object in a standard format
# Alternatively DOI string could used using "generateIdentifier(mn, scheme="DOI")"
metadataId <- paste("urn:uuid:", UUIDgenerate(), sep="")
metadataObj <- new("DataObject", id=metadataId, format="eml://ecoinformatics.org/eml-2.1.1", file=sampleEML)
metadataObj <- setPublicAccess(metadataObj)
dp <- addData(dp, dataObj, metadataObj)
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, replicate=TRUE, public=TRUE, numberReplicas=2)
The following sections describe each line of this script in detail.
1. Create a DataPackage object.
In order to use uploadDataPackage
, it is necessary to prepare an R DataPackage object that will serve as a container for the set of files that will be included in the data package. The following commands load the required libraries and creates an empty DataPackage object that will be added to later:
library(dataone)
library(datapack)
library(uuid)
dp <- new("DataPackage")
When using the uploadDataPackage
method, data structures that are required by DataONE are created, configured and uploaded automatically with the data package. These data structures include a ResourceMap that describes the data package, and SystemMetadata objects that contain DataONE system information for each of the science datasets and associated science metadata.
4. Create a DataObject for each data file
A DataObject must be created for each metadata file and data file that will be included in the data package. The DataObject maintains information about an object that will be needed by DataONE.
A SystemMetadata object will be created automatically and stored in each DataObject. The SystemMetadata object will be used by DataONE to maintain low level information about the dataset, such as the access policy, the user identity of the rightsholder (the user identity that can modify access the dataset), which Member Nodes it can be replicated to, etc.
The example below creates a DataObject for a science dataset:
sampleData <- system.file("extdata/sample.csv", package="dataone")
dataId <- paste("urn:uuid:", UUIDgenerate(), sep="")
dataObj <- new("DataObject", id=dataId, format="text/csv", file=sampleData)
An optional user argument can be specified when creating a DataObject, which will be used to set the DataONE submitter and rightsholder of the dataset when it is uploaded. The rightsholder is granted all access priviledges to the object.
If user is not specified for a DataObject, then the submitter and rightsholder for an object will automatically be set, when the object is uploaded to DataONE, to the DataONE user that created the authentication token or X.509 certificate.
Note that if the id argument is not specified, a unique identifier will automatically be created and assigned to the DataObject.
Access rules can be added to each DataObject after it has been created. Access rules can be added to grant permissions to a single user. Access can also be granted to the public user, which means any and all users. For example, public read access can be set using the setPublicAccess
method:
dataObj <- setPublicAccess(dataObj)
Individual access rules to be added for a DataONE user identity can also be added to the access policy.
Access rules are added to a DataObject using the addAccessRule
method. The following access rule will grant user ‘Peter Smith’ changePermission access to the dataset, which will take effect after it is uploaded and available on a DataONE MN:
accessRules <- data.frame(subject="CN=Peter Smith A10499,O=Google,C=US,DC=cilogon,DC=org", permission="changePermission") dataObj <- addAccessRule(dataObj, accessRules)
The value of the subject argument in the above example (“CN=Peter Smith A10499,O=Google,C=US,DC=cilogon,DC=org”) is the string value of a typical DataONE user identity. DataONE user identities and user authentication are described in section A New Authentication Mechanism in the vignette dataone-overview (to view this vignette, type this command in the R console: vignette("dataone-overview")
)
6. Add each DataObject to the DataPackage
The DataPackage object serves as a container for a set of data objects that will be uploaded to DataONE. The metadata DataObject and all science data DataObjects must be added to the DataPackage before calling uploadDataPackage
.
Relationships between the objects in a DataPackage are stored in the ResourceMap which is stored in and maintained by the DataPackage. One type of relationship that is stored is between the science metadata and the science datasets that are described by it. In the DataONE data package implementation, this relationship is the CITO documents relationship that links the metadata object to science objects.
This relationship between the science metadata and science data objects will be made automatically for each science data object as it is added to the DataPackage, if the metadata object is included when the science data object is added.
Now add the metadata object to the DataPackage:
dp <- addData(dp, metadataObj)
Then specify the metadata object when each science data object is added, associating the metadata object with the science object:
dp <- addData(dp, do = dataObj, mo = metadataObj)
If there were additional DataObjects to add to the package, they would be added to the DataPackage and associated with the metadata object as follows:
dp <- addData(dp, do = dataObj2, mo = metadataObj)
dp <- addData(dp, do = dataObj3, mo = metadataObj)
dp <- addData(dp, do = dataObj4, mo = metadataObj)
7. Upload the DataPackage
When all DataObjects have been added to the DataPackage, call the uploadDataPackage
method to upload the entire DataPackage:
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp, replicate=TRUE, numberReplicas=2)
message(sprintf("Uploaded data package with identifier: %s", packageId))
(Note that the example uses a DataONE test environment STAGING, and not the production environment.)
After uploadDataPackage has been called sucessfully, the data package can be viewed on the member node, searched for using the DataONE search facility. Note that if objects in DataONE are not publicly readable, and the authenticated user performing the search isn’t granted access in an object’s access policy, then the objects will not be viewable or discoverable via the search facility for that user.
Uploading Individual Data And Metadata files
A single data or metadata file can be uploaded to a DataONE MN using the createObject method. When uploading a single file using this method, additional information must be supplied to DataONE that controls how DataONE interacts with the uploaded file. This additional information is stored in DataONE as a system metadata object and contains information such as who can access or update the file, how many copies of the file should be maintained, whether the file has been superseded by another object, etc. The system metadata information that will be uploaded to DataONE is collected and stored in an R object type datapack::SystemMetadata, as shown below:
library(digest)
# Create a system metadata object for a data file.
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha1 <- digest(csvfile, algo="sha1", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha1)
sysmeta <- addAccessRule(sysmeta, "public", "read")
Alternatively, the system metadata could have been created with a seriesId. The seriesId is explained in the dataone_overview vignette. The following example shows the creation of a SystemMetadata object using the optional seriesId:
# Create a system metadata object for a data file.
# Just for demonstration purposes, create a temporary data file.
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
format <- "text/csv"
size <- file.info(csvfile)$size
sha1 <- digest(csvfile, algo="sha1", serialize=FALSE, file=TRUE)
# Generate a unique identifier for the dataset
pid <- sprintf("urn:uuid:%s", UUIDgenerate())
# The seriesId can be any unique character string.
seriesId <- sprintf("urn:uuid:%s", UUIDgenerate())
sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha1, seriesId=seriesId)
A unique identifier must be specified for each system metadata, whether or not a seriesId is used.
The dataset can now be uploaded to DataONE with the associated system metadata:
cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
response <- createObject(mn, pid, csvfile, sysmeta)
Note that for this example, the DataONE test environment STAGING is used, and not the production environment.
Maintaining Uploaded Datasets
After data has been uploaded to DataONE, maintenance operations can be performed on these objects using the methods described in the following sections.
Replace an object with a newer version (MNode: updateObject)
The updateObject updates an existing object by creating a new object identified by a new PID on the Member Node. The new object replaces and obsoletes the old object. An obsoleted object in DataONE does not appear in search results, however it is still available for download if the identifier is known.
# Update object from previous example with a new version
updateid <- sprintf("urn:uuid:%s", UUIDgenerate())
testdf <- data.frame(x=1:20,y=11:30)
csvfile <- paste(tempfile(), ".csv", sep="")
write.csv(testdf, csvfile, row.names=FALSE)
size <- file.info(csvfile)$size
sha1 <- digest(csvfile, algo="sha1", serialize=FALSE, file=TRUE)
# Start with the old object's sysmeta, then modify it to match
# the new object. We could have also created a sysmeta from scratch.
sysmeta <- getSystemMetadata(mn, pid)
sysmeta@identifier <- updateid
sysmeta@size <- size
sysmeta@checksum <- sha1
sysmeta@obsoletes <- pid
# Now update the object on the member node.
response <- updateObject(mn, pid, csvfile, updateid, sysmeta)
# Get the new, updated sysmeta and check it to ensure that the update
# worked, i.e. "obsoletes" is the old pid that was replaced by the update.
updsysmeta <- getSystemMetadata(mn, updateid)
updsysmeta@obsoletes
The Member Node will mark the object as being obsolete by setting a property in the system metadata on the object being replaced. An object marked as obsolete will not appear in search results, however, such an object is still available for download if the PID is known.
Remove an object from DataONE search
An object can be removed from searches done with the DataONE search mechanism by calling the archive method with the PID of the object. This operation does not delete the object bytes, but instead updates the system metadata for the object to set the archived flag to true. The object can still be referenced with its PID and downloaded, but it will not appear in any search results.
Objects that are archived can not be updated using the updateObject method. Once an object is archived it cannot be un-archived.
The following statement archives the object that was just created in the previous example with the updateObject method.
response <- archive(mn, updateid)
The following commands can be used to verify that the object was archived.
sysmeta <- getSystemMetadata(mn, updateid)
sysmeta@archived