# Introduction to the rcoreoa package

#### 2018-09-20

rcoreoa - Client for the CORE API (https://core.ac.uk/docs/). CORE (https://core.ac.uk) aggregates open access research outputs from repositories and journals worldwide and make them available to the public.

## Installation

The package can be installed directly from the CRAN repository:

install.packages("rcoreoa")

Or you can install the development version:

devtools::install_github("ropensci/rcoreoa")

Once the package is installed, simply run:

library("rcoreoa")

## Obtaining an API key

The Core API requires an API key, and, as such, requires you to register for the key on the Core Website. Once you register with your email address you will be sent an API key that looks a little like this:

thisISyourAPIkeyITlooksLIKEaLONGstringOFletters

## Using the API Key

Best practice is to set the API key as an environment variable for your system, and then call it in R using Sys.getenv(). If you set the parameter in .Renviron it is permanently available to your R sessions. There is a decent writeup of how to use the .Renviron file in Colin Gillespie & Robin Lovelace’s Efficient R Programming. Be aware that if you are using version control you do not want to commit the .Renviron file in your local directory. Either edit your global .Renviron file, or make sure that .Renviron is added to your .gitignore file.

Within the .Renviron file you will add:

CORE_KEY=thisISyourAPIkeyITlooksLIKEaLONGstringOFletters

The key may also be included in a file such as a .bash_profile file, or elsewhere. Users may decide which works best for them. Once you’ve added the API key, restart your R session and test to make sure the key has been added using the command:

Sys.getenv("CORE_KEY")

If you get this to work, you’re doing great and we can move on to the next section. If this is still not working for you, check to make sure you have saved the .Renviron file, that it is in the same directory as your current project’s working directory, and that the name you have given the variable in the .Renviron file is the same as the name you are calling in Sys.getenv().

## An Introduction to the Functions

The rcoreoa package accesses CORE’s API to facilitate open text searching. The API allows an individual to search for articles based on text string searches using core_search(). Given a set of article IDs from the core_search(), users can then find more bibliographic information on the article (core_articles()) and article publishing history (core_articles_history()), on the journals in which the article was published (core_journals()).

All of the functions return structured R objects, but can return JSON character strings by appending an unerscore (_) to the function name. We will illustrate the difference:

api_key <- Sys.getenv('CORE_KEY')
core_journals(id = '2167-8359', key = api_key)
#> $status #> [1] "OK" #> #>$data
#> $data$title
#> [1] "PeerJ"
#>
#> $data$identifiers
#> [1] "oai:doaj.org/journal:576e4d34b8bf461bb586f1e90d80d7cc"
#> [2] "issn:2167-8359"
#> [3] "url:https://doaj.org/toc/2167-8359"
#>
#> $data$subjects
#> [1] "biomedical" "health"     "genetics"   "ecology"    "biology"
#> [6] "Medicine"   "R"
#>
#> $data$language
#> [1] "EN"
#>
#> $data$rights
#> [1] "CC BY"
#>
#> $data$publisher
#> [1] "PeerJ Inc."

And with the underscore:

core_journals_(id = '2167-8359', key = api_key)
#> [1] "{\"status\":\"OK\",\"data\":{\"title\":\"PeerJ\",\"identifiers\":[\"oai:doaj.org\\/journal:576e4d34b8bf461bb586f1e90d80d7cc\",\"issn:2167-8359\",\"url:https:\\/\\/doaj.org\\/toc\\/2167-8359\"],\"subjects\":[\"biomedical\",\"health\",\"genetics\",\"ecology\",\"biology\",\"Medicine\",\"R\"],\"language\":\"EN\",\"rights\":\"CC BY\",\"publisher\":\"PeerJ Inc.\"}}"

Through this Vignette we will illustrate some of the tools available as part of the package within a workflow that seeks to perform some basic bibliometric analysis.

## A Research Workflow

We are interested in the way research within a particular field is communicated. Given this, we want to use CORE to examine trends in publishing for a set of key terms. To avoid being completely swamped by results we will pick a topic that is a relatively small subdiscipline, and, to ensure that we understand some of the broader patterns, we will pick a discipline that we have some knowledge about. We will also try to find a discipline that is linked to an exisitng rOpenSci package, so why not choose paleoecology, and, more specifically, palynology.

Palynology is the study of organic-walled microfossils. Of these microfossils, the most well known is pollen fossilized in lake or ocean sediments. Using relative abundances of pollen can give us an indication of the composition of ancient fossils, and, handily enough, there is an R package, neotoma that allows us to explore the pollen data.

### Finding articles

So, lets see how many articles there are in the CORE holdings that contain the term palynology:

palyn <- core_search(query = 'palynology', key = api_key)

The result is a list with three components: ['Status', 'totalHits', 'data']. In our case, the result for status is "OK" (as opposed to not found), the totalHits are 13175 and then there are 10 articles returned. The total number of articles is much higher than the number of articles returned because by default the limit flag for core_search is set to 10.

We can constrain our search, to look at the change in results over time by looking only for articles within certain specified time limits. For this we can use the core_advanced_search:


# Define a data.frame with a row per year bin.
query <- data.frame(
all_of_the_words = "palynology",
year_from = as.character(seq(1950, 2016)),
year_to   = as.character(seq(1951, 2017)),
stringsAsFactors = FALSE)

paly_bins <- core_advanced_search(query = query, key = api_key)

plot(seq(1951, 2017), paly_bins[[2]],
xlab = "Published Year", ylab = "Articles", type = 'l')

But this mapping of a single term likely reflects both an increase in the use of the term “palynology” (coined in 1947), but also in the number of records within the CORE repository. As such we need to use some sort of control, so let’s take a very common word, for example “and”:


query <- data.frame(
all_of_the_words = "and",
year_from = as.character(seq(1950, 2016)),
year_to   = as.character(seq(1951, 2017)),
stringsAsFactors = FALSE)

control_bins <- core_advanced_search(query = query, key = api_key)

plot(seq(1951, 2017), control_bins[[2]] / max(control_bins[[2]]),
xlab = "Published Year", ylab = "Articles", lty = 2, col = 'red', type = 'l')
lines(seq(1951, 2017), paly_bins[[2]] / max(paly_bins[[2]]))

Given that the number of papers for both "palynology" & "and" are well matched, it seems like this curve is largely a function of available publications. Given that "and" should represent the majority of all publications, let’s look at a set of search strings. Palynology is a branch of paleoecology. Europeans often spell paleoecology as palaeoecology. Let’s look at all three terms as a function of th "and" results. We’ll wrap it into a function to streamline things:


ann_query <- function(string, control = NULL) {

query <- data.frame(
all_of_the_words = string,
year_from = as.character(seq(1950, 2016)),
year_to   = as.character(seq(1951, 2017)),
stringsAsFactors = FALSE)

hit_bins <- core_advanced_search(query = query, key = api_key)

if (is.null(control)) control = hit_bins

return(data.frame(years = seq(1951, 2017),
hits = hit_bins[[2]],
transformed = hit_bins[[2]] / control[[2]],
string = string,
stringsAsFactors = FALSE))
}

string_sets <- do.call(rbind.data.frame,
lapply(c("and", "palynology", "paleoecology", "palaeoecology"),
ann_query, control = control_bins))

string_sets <- string_sets[!string_sets$string %in% 'and',] plot(data = string_sets, transformed ~ years, col = factor(string_sets$string),
pch = 19, cex = 0.5)

It’s interesting to note the overall changes in the rates of publication. In particular, the contined growth of the use of palaeoecology vs paleoecology. Given the results we’ve pulled, it’s possible to do further analysis of the publications, but this provides a simple overview of some of the capabilities of the package, and ways to generate simple analytics for search strings from the publically available papers within CORE.