codecov R-CMD-check-win-macos R-CMD-check-linux Slack

ctrdata for aggregating and analysing clinical trials

The package ctrdata provides functions for retrieving (downloading) information on clinical trials from public registers, and for aggregating and analysing such information. It can be used for the European Union Clinical Trials Register (“EUCTR”, and for (“CTGOV”, Development of ctrdata started in 2015 and was motivated by the wish to understand trends in designs and conduct of trials and their availability for patients. The package is to be used within the R system.

Last reviewed on 2020-10-04 for version 1.3.2

Main features:

Remember to respect the registers’ copyrights and terms and conditions (see ctrOpenSearchPagesInBrowser(copyright = TRUE)). Please cite this package in any publication as follows: Ralf Herold (2020). ctrdata: Retrieve and Analyze Clinical Trials in Public Registers. R package version 1.3,

Package ctrdata has been used for example for:


1. Install package in R

Package ctrdata can be found here on CRAN and here on github. Within R, use the following commands to install package ctrdata:

# Install CRAN version:

# Alternatively, install development version: 
# - Preparation:
# - Install ctrdata:

These commands also install the package dependencies, which are nodbi, jsonlite, httr, curl, clipr, xml2, rvest.

2. Command line tools perl, sed, cat and php (5.2 or higher)

These command line tools are required for ctrLoadQueryIntoDb(), the main function of package ctrdata.

In Linux and macOS (including version 10.15 Catalina), these are usually already installed.

For MS Windows, install cygwin: In R, run ctrdata::installCygwinWindowsDoInstall() for an automated minimal installation into c:\cygwin (installations in folders corresponding to c:\cygw* will also be recognised and used). Alternatively, install manually cygwin with packages perl, php-jsonc and php-simplexml into c:\cygwin. This installation will consume about 160 MB disk space; administrator credentials not needed.


Once installed, a comprehensive testing can be executed as follows (note this will take several minutes):

tinytest::test_package("ctrdata", at_home = TRUE)

Overview of functions in ctrdata

The functions are listed in the approximate order of use.

Function name Function purpose
ctrOpenSearchPagesInBrowser() Open search pages of registers or execute search in web browser
ctrFindActiveSubstanceSynonyms() Find synonyms and alternative names for an active substance
ctrGetQueryUrlFromBrowser() Import from clipboard the URL of a search in one of the registers
ctrLoadQueryIntoDb() Retrieve (download) or update, and annotate, information on clinical trials from a register and store in a database
dbQueryHistory() Show the history of queries that were downloaded into the database collection
dbFindIdsUniqueTrials() Produce a vector of de-duplicated identifiers of clinical trial records in the database
dbFindFields() Find names of fields in the database
dbGetFieldsIntoDf() Create a data.frame from records in the database with the specified fields
dfMergeTwoVariablesRelevel() Merge two variables into a single variable, optionally map values to a new set of values
dfListExtractKey() Extract an element based on its name (key) from a list in a complex data.frame such as obtained from dbGetFieldsIntoDf() for deeply nested fields
installCygwinWindowsDoInstall() Convenience function to install a cygwin environment (MS Windows only)

Example workflow

The aim is to download protocol-related trial information and tabulate the trials’ status of conduct.


# Please review and respect register copyrights:
ctrOpenSearchPagesInBrowser(copyright = TRUE)
q <- ctrGetQueryUrlFromBrowser()
# * Found search query from EUCTR.

#                                  query-term query-register
# 1 query=cancer&age=under-18&phase=phase-one          EUCTR

Under the hood, scripts and xml2json.php (in ctrdata/exec) transform EUCTR plain text files and CTGOV XML files to ndjson format, which is imported into the database. As a first step, the database is specified using nodbi (using RSQlite or MongoDB as backend). Second, trial information is retrieved and loaded into the database.

# Connect to (or newly create) a SQLite database 
# that is stored in a file on the local system:
db <- nodbi::src_sqlite(
  dbname = "some_database_name.sqlite_file", 
  collection = "some_collection_name")

# Alternative, for a MongoDB database:
# db <- nodbi::src_mongo(url = "mongodb://localhost", 
#                        db = "some_database_name",
#                        collection = "some_collection_name")

# Retrieve trials from public register:
  queryterm = 
  con = db)

Tabulate the status of those trials that are recorded to be part of an agreed paediatric development program (paediatric investigation plan, PIP):

# Get all records that have values in the fields of interest:
result <- dbGetFieldsIntoDf(
  fields = c(
  con = db)

# Find unique trial identifiers for trials that have nore than one record, 
# for example for several EU Member States: 
uniqueids <- dbFindIdsUniqueTrials(con = db)
# * Total of 522 records in collection.
# Searching for duplicates, found 
#  - 340 EUCTR _id were not preferred EU Member State record of trial
# No CTGOV records found.
# = Returning keys (_id) of 182 out of total 522 records in collection

# Keep only unique / deduplicated records:
result <- result[ result[["_id"]] %in% uniqueids, ]

# Tabulate the clinical trial information:
with(result, table(p_end_of_trial_status, 
#                     a7_trial_is_part_of_a_paediatric_investigation_plan
# p_end_of_trial_status Information not present in EudraCT No Yes
#   Completed                                            6 25  14
#   Ongoing                                              5 60  19
#   Prematurely Ended                                    1  6   3
#   Restarted                                            0  1   0
#   Temporarily Halted                                   0  0   1
# Retrieve trials from public register:
  queryterm = "cond=neuroblastoma&rslt=With&recrs=e&age=0&intr=Drug", 
  register = "CTGOV",
  con = db)

Analyse some result details; note how information fields are used with slightly different approaches:

# Get all records that have values in all specified fields. 
# Note the fields are specific to CTGOV, thus not in EUCTR,
# which results in a warning that not all reacords in the 
# database have information on the specified fields:  
result <- dbGetFieldsIntoDf(
  fields = c(
  con = db)

# - Count sites: location is a list of lists, 
#   hence the hierarchical extraction by
#   facility and then name of facility
result$number_sites <- sapply(
  result$location, function(x) length(x[["facility"]][["name"]]))

#   an alternative approach uses a function provided by
#   ctrdata to extract keys from a list in a data frame:
    df = result, 
    list.key = list(c("location", ""))), 
  by(item, `_id`, max)

# - Count total participant numbers, by summing the reporting groups
#   for which their description does not contain the word "total" 
#   (such as in "Total participants")
result$number_participants <- sapply(
  seq_len(nrow(result)), function(i) {
    # Participant counts are in a list of elements with attributes, 
    # where attribute value has a vector of numbers per reporting group
    tmp <- result$clinical_results.baseline.analyzed_list.analyzed.count_list.count[[i]]
    # Information on reporting groups is in a list with a subelement description
    tot <- result$[[i]]
    # see for example
    tmp <- tmp[["@attributes"]][["value"]]
    tmp <- tmp[ !grepl("(^| )[tT]otal( |$)", tot[["description"]])]
    # to sum up, change string into integer value.
    # note that e.g. sum(..., na.rm = TRUE) is not used
    # since there are no empty entries in these trials
    tmp <- sum(as.integer(tmp))

# Allocation is part of study design information and available
# as a simple character string, suitable for routine manipulation
result$is_controlled <- grepl(
  pattern = "^Random", 
  x = result$study_design_info.allocation)

# Example plot
ggplot(data = result) + 
  labs(title = "Neuroblastoma trials with results",
       subtitle = "") +
  geom_point(mapping = aes(x = number_sites,
                           y = number_participants,
                           colour = is_controlled)) + 
  scale_x_log10() + 
ggsave(filename = "inst/image/README-ctrdata_results_neuroblastoma.png",
       width = 4, height = 3, units = "in")
Neuroblastoma trials


The database connection object con is created by calling nodbi::src_*(), with parameters that are specific to the database (e.g., url) and with a special parameter collection that is used by ctrdata to identify which table or collection in the database to use. Any such connection object can then be used by ctrdata and generic functions of nodbi in a consistent way, as shown in the table:

Purpose SQLite MongoDB
Create database connection dbc <- nodbi::src_sqlite(dbname = ":memory:", collection = "name_of_my_collection") dbc <- nodbi::src_mongo(db = "name_of_my_database", collection = "name_of_my_collection", url = "mongodb://localhost")
Use connection with any ctrdata function ctrdata::{ctr,db}*(con = dbc) ctrdata::{ctr,db}*(con = dbc)
Use connection with any nodbi function nodbi::docdb_*(src = dbc, key = dbc$collection) nodbi::docdb_*(src = dbc, key = dbc$collection)

Features in the works


Issues and notes

Annex: Representation of trial records’ JSON in databases


Example JSON representation in MongoDB


Example JSON representation in SQLite