UK REF Impact Case Studies

Perry Stephenson

2017-07-19

library(refimpact)

Introduction

This package is an API wrapper around the REF Impact Case Studies database API. Chances are that if you’re looking at this package, you already know what this dataset is, and you probably know roughly what you’re looking for.

If you have stumbled upon this package however, and you want to know more about the dataset, you can head here to find out more. If you are thinking of using this dataset as a toy dataset for learning, then you might find this dataset useful for text mining, amongst other things.

Core functions

The core function for this package is ref_get(), which takes an API method as the first argument, and some optional arguments depending on the method.

The API methods available are detailed below, but presented here for quick reference:

SearchCaseStudies

This is the core method of the API, and the most important for users of this package. The search method requires a compulsory argument to the ref_get() function: query. This argument takes a list of query parameters, which can be as simple as a single Case Study ID, which returns a single record. A query returning a single record is shown below to demonstrate the syntax and the returned data structure; more complex queries will be shown later in the vignette.

results <- ref_get("SearchCaseStudies", query=list(ID=941))
print(results)
## # A tibble: 1 x 19
##   CaseStudyId            Continent              Country   Funders
## *       <chr>               <list>               <list>    <list>
## 1         941 <data.frame [2 x 2]> <data.frame [2 x 2]> <chr [3]>
## # ... with 15 more variables: ImpactDetails <chr>, ImpactSummary <chr>,
## #   ImpactType <chr>, Institution <chr>, Institutions <list>, Panel <chr>,
## #   PlaceName <list>, References <chr>, ResearchSubjectAreas <list>,
## #   Sources <chr>, Title <chr>, UKLocation <list>, UKRegion <list>,
## #   UOA <chr>, UnderpinningResearch <chr>

You will note that the function returns a nested tibble - that is a tibble with other data frames inside it. This means that you can interrogate the tibble as per usual:

cat(results[[1, "CaseStudyId"]])
## 941
cat(results[[1, "Title"]])
## 
## Novel models for advanced imaging of urinary system function in healthy and diseased tissue.
cat(strtrim(results[[1, "ImpactSummary"]], width = 200), "<truncated>")
## 
## Drs Peppiatt-Wildman &amp; Wildman have developed novel models to investigate kidney and bladder
## function and drug action, through visualisation of cellular events in live tissue. This has had an
##  <truncated>
cat(strtrim(results[[1, "ImpactDetails"]], width = 200), "<truncated>")
## 
## Impact on Commerce
## Peppiatt-Wildman and Wildman's unique approaches to investigate kidney and bladder function in
## a range of experimental models, including tissue slices, isolated and perfused org <truncated>
cat(results[[1, "Institution"]])
## 
## University of Kent and University of Greenwich

You can also interrogate the nested fields the same way, and even subset them:

print(results[[1, "Country"]])
##   GeoNamesId           Name
## 1    2635167 United Kingdom
## 2    6252001  United States
print(results[[1, "Institutions"]])
##             AlternativeName         InstitutionName PeerGroup     Region
## 1 Greenwich (University of) University of Greenwich         D     London
## 2      Kent (University of)      University of Kent         B South East
##      UKPRN
## 1 10007146
## 2 10007150
print(results[[1, "Institutions"]][,c("UKPRN", "InstitutionName")])
##      UKPRN         InstitutionName
## 1 10007146 University of Greenwich
## 2 10007150      University of Kent

In the opinion of the package author, the nested tibble offers many advantages over other data representations - it is a relatively straight-forward exercise to transform the data into a set of wide or narrow tables if required.

Returning a single case study based on the ID is obviously a niche use-case, so there are some other ways to search the database. But before getting to those, it is worth pointing out that you can select multiple case studies in a single query:

results <- ref_get("SearchCaseStudies", query=list(ID=c(941, 942, 1014)))
print(results)
## # A tibble: 3 x 19
##   CaseStudyId            Continent              Country   Funders
## *       <chr>               <list>               <list>    <list>
## 1         941 <data.frame [2 x 2]> <data.frame [2 x 2]> <chr [3]>
## 2         942 <data.frame [2 x 2]> <data.frame [2 x 2]> <chr [3]>
## 3        1014 <data.frame [0 x 0]> <data.frame [0 x 0]> <chr [0]>
## # ... with 15 more variables: ImpactDetails <chr>, ImpactSummary <chr>,
## #   ImpactType <chr>, Institution <chr>, Institutions <list>, Panel <chr>,
## #   PlaceName <list>, References <chr>, ResearchSubjectAreas <list>,
## #   Sources <chr>, Title <chr>, UKLocation <list>, UKRegion <list>,
## #   UOA <chr>, UnderpinningResearch <chr>

The ID parameter above is an exclusive parameter - if you provide one or more IDs then the function will print a warning to the console, and remove all parameters except for the IDs. This is based on the API’s documented limitations.

The other parameters can all be combined for searching. Those parameters are:

Some examples are shown below.

results <- ref_get("SearchCaseStudies", query=list(UKPRN = 10007777))
dim(results)
## [1]  7 19
results <- ref_get("SearchCaseStudies", query=list(UoA = 5))
dim(results)
## [1] 257  19
results <- ref_get("SearchCaseStudies", query=list(tags = c(11280, 5085)))
dim(results)
## [1] 24 19
results <- ref_get("SearchCaseStudies", query=list(phrase = "hello"))
dim(results)
## [1]  7 19
results <- ref_get("SearchCaseStudies", query=list(UKPRN = 10007146,
                                                   UoA   = 3))
dim(results)
## [1]  2 19

Unfortunately, the API method requires at least one search parameter, which makes it more difficult to download the entire dataset. A short script for this purpose is included at the end of this vignette.

Useful values for the UKPRN, UoA and tags parameters can be found by querying the other 4 API methods - the phrase parameter is the only parameter which can be used in isolation. Each of the 4 other API methods are outlined below.

ListInstitutions

This method lists all of the institutions which are included in the REF Impact Case Studies database, and the UKPRN column in the resuling tibble can be used as a query parameter

institutions <- ref_get("ListInstitutions")
print(institutions)
## # A tibble: 155 x 5
##                             AlternativeName
##  *                                    <chr>
##  1                          Open University
##  2                     Cranfield University
##  3                     Royal College of Art
##  4            Bishop Grosseteste University
##  5           Buckinghamshire New University
##  6 Royal Central School of Speech and Drama
##  7                  Chester (University of)
##  8      Canterbury Christ Church University
##  9                  York St John University
## 10                     Edge Hill University
## # ... with 145 more rows, and 4 more variables: InstitutionName <chr>,
## #   PeerGroup <chr>, Region <chr>, UKPRN <int>

ListTagTypes and ListTagValues

These methods provide tags which can be used as search parameters in the SearchCaseStudies method. The ListTagTypes method returns the types of tags available:

tag_types <- ref_get("ListTagTypes")
print(tag_types)
## # A tibble: 13 x 2
##       ID              TagType
##  * <int>                <chr>
##  1     1           ImpactType
##  2     3              Subject
##  3     4            PlaceName
##  4     5              Country
##  5     6            Continent
##  6     7    Interdisciplinary
##  7     8              Similar
##  8     9               Funder
##  9    10                Panel
## 10    11    InstitutionRegion
## 11    12 InstitutionPeerGroup
## 12    13            UK Region
## 13    15     Joint Submission

These tag types can then be used as an argument to the ListTagValues method, to get all tags for each type:

tag_values_5 <- ref_get("ListTagValues", tag_type = 5)
print(tag_values_5)
## # A tibble: 252 x 2
##       ID                         Name
##  * <int>                        <chr>
##  1 11280                  Afghanistan
##  2 11310                Aland Islands
##  3 11116                      Albania
##  4 11106                      Algeria
##  5 25129 American Samoa, Territory of
##  6 11221                      Andorra
##  7 11185                       Angola
##  8 11301                     Anguilla
##  9 11187          Antigua and Barbuda
## 10 11328                    Argentina
## # ... with 242 more rows

This can take some time to iterate through, so the full table is bundled with this package. You can access it via ref_tags:

print(ref_tags)
## # A tibble: 9,400 x 4
##       ID                                    Name TypeID    TagType
##  * <int>                                   <chr>  <int>      <chr>
##  1  5083                                Cultural      1 ImpactType
##  2  5086                                Economic      1 ImpactType
##  3  5087                           Environmental      1 ImpactType
##  4  5082                                  Health      1 ImpactType
##  5  5081                                   Legal      1 ImpactType
##  6  5080                               Political      1 ImpactType
##  7  5085                                Societal      1 ImpactType
##  8  5084                           Technological      1 ImpactType
##  9   911 Accounting, Auditing and Accountability      3    Subject
## 10  1022                   Aerospace Engineering      3    Subject
## # ... with 9,390 more rows

ListUnitsOfAssessment

This method lists all of the units of assessment which the Impact Case Studies can be assessed against. The tibble also includes an ID column which can be used when querying the SearchCaseStudies method.

UoAs <- ref_get("ListUnitsOfAssessment")
print(UoAs)
## # A tibble: 36 x 3
##       ID      Panel
##  * <int>      <chr>
##  1     1 A         
##  2     2 A         
##  3     3 A         
##  4     4 A         
##  5     5 A         
##  6     6 A         
##  7     7 B         
##  8     8 B         
##  9     9 B         
## 10    10 B         
## # ... with 26 more rows, and 1 more variables: Subject <chr>

Extracting the entire dataset

As alluded to above, the API cannot be searched without parameters, which means that downloading the entire dataset is not a simple task. The code below can be used to extract all records from the database.

uoa_table <- ref_get("ListUnitsOfAssessment")
uoa_list <- uoa_table$ID

ref_corpus <- vector(length = length(uoa_list), mode = "list")

for (i in seq_along(uoa_list)) {
  message("Retrieving data for UoA ", uoa_list[i])
  ref_corpus[[i]] <- ref_get("SearchCaseStudies", query = list(UoA = uoa_list[i]))
}

output <- do.call(rbind, ref_corpus)