Reading and parsing EML

Carl Boettiger

2017-05-01

Crude view of an EML file

The function eml_view() allows, when the listviewer package is installed, to get a crude view of an EML file in the viewer. It can be useful for exploring the file.

library("EML")
f <- system.file("xsd/test/eml-i18n.xml", package = "EML")
eml_view(f)

Parsing an EML file

The function eml_get() extracts a list of all occurances the desired metadata elements from a given eml object or part thereof.

library("EML")
f <- system.file("xsd/test/eml-i18n.xml", package = "EML")
eml <- read_eml(f)

Here we request all coverage elements occurring in the anywhere in the eml document:

coverage <- eml_get(eml, "coverage")
coverage
## [[1]]
## <coverage system="uuid" scope="document">
##   <geographicCoverage scope="document">
##     <geographicDescription>The Geographic region of the kelp bed data
##                  extends along the California coast, down through the coast of Baja,
##                  Mexico: Central California (Halfmoon Bay to Purisima Point),
##                  Southern California (Point Arguello to the United States/Mexico
##                  border including the Channel Islands) and Baja California (points
##                  south of the United States/Mexico border including several offshore
##                  islands).</geographicDescription>
##     <boundingCoordinates>
##       <westBoundingCoordinate>-122.44</westBoundingCoordinate>
##       <eastBoundingCoordinate>-117.15</eastBoundingCoordinate>
##       <northBoundingCoordinate>37.38</northBoundingCoordinate>
##       <southBoundingCoordinate>30.00</southBoundingCoordinate>
##     </boundingCoordinates>
##   </geographicCoverage>
##   <temporalCoverage scope="document">
##     <rangeOfDates>
##       <beginDate>
##         <calendarDate>1957-08-13</calendarDate>
##       </beginDate>
##       <endDate>
##         <calendarDate>2006-02-18</calendarDate>
##       </endDate>
##     </rangeOfDates>
##   </temporalCoverage>
##   <taxonomicCoverage scope="document">
##     <taxonomicClassification>
##       <taxonRankName>KINGDOM</taxonRankName>
##       <taxonRankValue>Plantae</taxonRankValue>
##       <taxonomicClassification>
##         <taxonRankName>PHYLUM</taxonRankName>
##         <taxonRankValue>Phaeophyta</taxonRankValue>
##         <taxonomicClassification>
##           <taxonRankName>CLASS</taxonRankName>
##           <taxonRankValue>Phaeophyceae</taxonRankValue>
##           <taxonomicClassification>
##             <taxonRankName>ORDER</taxonRankName>
##             <taxonRankValue>Laminariales</taxonRankValue>
##             <taxonomicClassification>
##               <taxonRankName>FAMILY</taxonRankName>
##               <taxonRankValue>Lessoniaceae</taxonRankValue>
##               <taxonomicClassification>
##                 <taxonRankName>GENUS</taxonRankName>
##                 <taxonRankValue>Macrocystis</taxonRankValue>
##                 <taxonomicClassification>
##                   <taxonRankName>genusSpecies</taxonRankName>
##                   <taxonRankValue>Macrocystis pyrifera</taxonRankValue>
##                   <taxonomicClassification>
##                     <taxonRankName>commonName</taxonRankName>
##                     <taxonRankValue>MAPY</taxonRankValue>
##                   </taxonomicClassification>
##                 </taxonomicClassification>
##               </taxonomicClassification>
##             </taxonomicClassification>
##           </taxonomicClassification>
##         </taxonomicClassification>
##       </taxonomicClassification>
##     </taxonomicClassification>
##   </taxonomicCoverage>
## </coverage>

The result is a list containing 1 coverage element. We can further subset this element directly using eml_get() on it, for instance, to extract just the temporalCoverage element:

eml_get(coverage, "temporalCoverage")
## An object of class "ListOftemporalCoverage"
## [[1]]
## <temporalCoverage system="uuid" scope="document">
##   <rangeOfDates>
##     <beginDate>
##       <calendarDate>1957-08-13</calendarDate>
##     </beginDate>
##     <endDate>
##       <calendarDate>2006-02-18</calendarDate>
##     </endDate>
##   </rangeOfDates>
## </temporalCoverage>

Any EML element can be extracted in this way. Let’s try an example metadata file for a dataset that documents 11 seperate dataTables:

hf001 <- system.file("examples/hf001.xml", package="EML") 

eml_HARV <- read_eml(hf001)

How many dataTable entities are there in this dataset?

dt <- eml_get(eml_HARV, "dataTable")
length(dt)
## [1] 11

We can iterate over our list of dataTable elements to extract relevant metadata, such as the entityName or the download url:

entities <- sapply(dt, eml_get, "entityName")
urls <- sapply(dt, eml_get, "url")

Note that the latter example is the same as providing the more verbose arbument that specificies exactly where the url of interest is located:

urls <- sapply(dt, function(x) x@physical[[1]]@distribution[[1]]@online@url)

this verbose syntax can be useful if there are multiple url elements in each dataTable metadata, and we are trying to get only certain ones and not others. Specifying the exact path in this way can also improve the speed of the command. For these reasons, programmatic use should consider this format, while the much simpler eml_get example shown above is practical for most interactive applications.

Although the default return type for eml_get is just the S4 object (whose print method displays the corresponding XML structure used to represent that metadata), for a few commonly accessed complex elements, eml_get returns a more convenient data.frame. For instance, the attributeList describing the metadata for every column in an EML document is returned as a pair of data.frames, one for all the attributes, and an second optional data.frame defnining the levels for the factors, if any are used. Let’s take a look:

Here we get the attributeList for each dataTable in the dataset. We check the length to confirm we get one attributeList for each dataTable

attrs <- eml_get(dt, "attributeList") 
length(attrs)
## [1] 11
attrs[[1]]
## $attributes
##   attributeName     domain formatString precision minimum maximum
## 1          date       <NA>         <NA>        NA    <NA>    <NA>
## 2         notes textDomain         <NA>        NA    <NA>    <NA>
##   definition pattern source attributeLabel storageType missingValueCode
## 1       <NA>    <NA>   <NA>           <NA>        <NA>             <NA>
## 2      notes    <NA>   <NA>           <NA>        <NA>             <NA>
##   missingValueCodeExplanation measurementScale attributeDefinition
## 1                        <NA>         dateTime                date
## 2                        <NA>          nominal               notes
## 
## $factors
## NULL

(Note, we could have passed this argument the original eml_HARV instead of dt here, since we know all attributeList elements are inside dataTable elements, but this is a bit more explicit and a bit faster.)

This returned data.frame object containing the attribute metadata for the first table (hence the [[1]], though attrs contains this metadata for all 11 tables now.) This is the same result we would have gotten using the more explicit call to the helper function get_attributes():

get_attributes(eml_HARV@dataset@dataTable[[1]]@attributeList)
## $attributes
##   attributeName     domain formatString precision minimum maximum
## 1          date       <NA>         <NA>        NA    <NA>    <NA>
## 2         notes textDomain         <NA>        NA    <NA>    <NA>
##   definition pattern source attributeLabel storageType missingValueCode
## 1       <NA>    <NA>   <NA>           <NA>        <NA>             <NA>
## 2      notes    <NA>   <NA>           <NA>        <NA>             <NA>
##   missingValueCodeExplanation measurementScale attributeDefinition
## 1                        <NA>         dateTime                date
## 2                        <NA>          nominal               notes
## 
## $factors
## NULL