One of the principle advantages in creating EML is to make it easier to find the data you need, without having to standardize all your data files themselves into one giant database. Instead, the data files can be whatever you want, provided the relevant information you might search on to discover data of interest is listed in the metadata.
To do this, we will use the dataone
package to upload a private copy of our data file to the central data repository.
#install.packages("dataone", repos= c("https://cran.rstudio.com", "http://nceas.github.io/drat"))
## or
devtools::install_github(c("ropensci/datapack", "DataONEorg/rdataone"))
library("dataone")
library("datapack")
Imagine we have the file paths to our .csv
data file and an .xml
EML file providing metadata for it:
sampleData <- system.file("extdata/sample.csv", package="dataone")
sampleEML <- system.file("extdata/sample-eml.xml", package="dataone")
To upload these to a metadata repository such as the KNB, we simply create DataObject
s for both files:
dataObj <- new("DataObject", format="text/csv", file=sampleData)
metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", file=sampleEML)
Note that optionally, new("DataObject")
could have been given an id
argument, which could be a (namespaced) UUID from UUIDgenerate
, or a DOI from a member node (see generateIdentifier()
). Since no id
has been given, a UUID is automatically generated for each.
accessRules <- data.frame(subject="CN=Noam Ross A45991,O=Google,C=US,DC=cilogon,DC=org", permission="write")
dataObj <- addAccessRule(dataObj, accessRules)
metadataObj <- addAccessRule(metadataObj, accessRules)
We now want to bundle these two objects (data and metadata) into a single “data package” to be uploaded. To do so, we just create a new DataPackage
object and then add the data and metadata using the addData
file:
dp <- new("DataPackage")
dp <- addData(dp, dataObj, metadataObj)
This both adds the files and registers that the metadata object describes the data object.
d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp)
Let’s see if the ID returned for our package now appears in the DataONE index:
query(CNode("STAGING"), searchTerms = list(id = packageId))
Note that the example would fail if run here since only the Production (PROD
) environment can provide DOIs (the STAGING
environment is only for tests and training examples), and only then on member nodes that offer DOIs, such as the KNB (urn:node:KNB
).
cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
newid <- generateIdentifier(mn, "DOI")
We also want to update the EML file itself to use the new id as the packageId
:
eml <- read_eml(sampleEML)
eml@packageId <- newid
write_eml(eml, sampleEML)
We can now use the DOI in packaging the DataObject
new("DataObject", id = newid, format = "eml://ecoinformatics.org/eml-2.1.1", file=sampleEML)