Downloading and using data from clinicaltrials.gov

Michael C Sachs

2017-01-09

R interface to clinicaltrials.gov

Clinicaltrials.gov ClinicalTrials.gov is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world. Users can search for information about and results from those trials. This package provides a set of functions to interact with the search and download features. Results are downloaded to temporary directories and returned as R objects.

Installation

The package is available on CRAN and can be installed as usual. To install the latest version from github, use devtools::install_github(), as follows:

install.packages("devtools")
library(devtools)
install_github("sachsmc/rclinicaltrials")

Basic usage

The main function is clinicaltrials_search(). Here’s an example of its use:

library(rclinicaltrials)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
z <- clinicaltrials_search(query = 'lime+disease')
str(z)
## 'data.frame':    20 obs. of  8 variables:
##  $ score               : chr  "0.021201" "0.020411" "0.0074037" "0.0072417" ...
##  $ nct_id              : chr  "NCT01951924" "NCT01333202" "NCT01056133" "NCT01644682" ...
##  $ url                 : chr  "https://ClinicalTrials.gov/show/NCT01951924" "https://ClinicalTrials.gov/show/NCT01333202" "https://ClinicalTrials.gov/show/NCT01056133" "https://ClinicalTrials.gov/show/NCT01644682" ...
##  $ title               : chr  "LIME Study (LFB IVIg MMN Efficacy Study)" "Fresh Lime Alone for Smoking Cessation" "Effect of Fish-oil on Non-alcoholic Steatohepatitis (NASH)" "Replacement of Insecticides to Control Visceral Leishmaniasis (VL)" ...
##  $ status.text         : chr  "Completed" "Completed" "Completed" "Completed" ...
##  $ condition_summary   : chr  "Motor Neuron Disease" "Tobacco Use Disorder" "Non-alcoholic Fatty Liver Disease; Non-alcoholic Steatohepatitis" "Cost-effective and Sustainable Vector Control Methods Will be Established to Reduce VL in India, Bangladesh and Nepal" ...
##  $ intervention_summary: chr  "Drug: Biological : I10E (Human normal Immunoglobulin for intravenous administration 100mg/mL); Drug: Biological: Kiovig® (Human"| __truncated__ "Other: Fresh lime" "Other: Omega-3 capsules-Fish Oil" "Other: IWFPL; Other: IDWL; Other: ITN" ...
##  $ last_changed        : chr  "July 18, 2016" "April 8, 2011" "May 10, 2016" "February 16, 2015" ...

This gives you basic information about the trials. Before searching or downloading, you can determine how many results will be returned using the clinicaltrials_count() function:

clinicaltrials_count(query = "myeloma")
## [1] 2215
clinicaltrials_count(query = "29485tksrw@")
## [1] 0

The query can be a single string which will be passed to the “search terms” field on clinicaltrials.gov. Terms can be combined using the logical operators AND, OR, and NOT. Advanced searches can be performed by passing a vector of key=value pairs as strings. For example, to search for cancer interventional studies,

clinicaltrials_count(query = c("type=Intr", "cond=cancer"))
## [1] 44894

The possible advance search terms are included in the advanced_search_terms data frame which comes with the package. The data frame has the keys, description, and a link to the help webpage which will explain the possible values of the search terms. To open the help page for cond, for instance, run browseURL(advanced_search_terms["cond", "help"]).

head(advanced_search_terms)
##      keys   description
## term term  Search Terms
## recr recr   Recruitment
## rslt rslt Study Results
## type type    Study Type
## cond cond    Conditions
## intr intr Interventions
##                                                        help
## term        http://clinicaltrials.gov/ct2/help/search_terms
## recr         http://clinicaltrials.gov/ct2/help/recruitment
## rslt       http://clinicaltrials.gov/ct2/help/study_results
## type          http://clinicaltrials.gov/ct2/help/study_type
## cond    http://clinicaltrials.gov/ct2/help/conditions_instr
## intr http://clinicaltrials.gov/ct2/help/interventions_instr

To download detailed study information, including results, use clinicaltrials_download():

y <- clinicaltrials_download(query = 'myeloma', count = 10, include_results = TRUE)
str(y)
## List of 2
##  $ study_information:List of 6
##   ..$ study_info   :'data.frame':    10 obs. of  48 variables:
##   .. ..$ org_study_id                        : chr [1:10] "J0997" "MMRF-11-001" "101565" "IUCRO-0498" ...
##   .. ..$ nct_id                              : chr [1:10] "NCT01045460" "NCT01454297" "NCT01410981" "NCT02212262" ...
##   .. ..$ brief_title                         : chr [1:10] "Trial of Activated Marrow Infiltrating Lymphocytes Alone or in Conjunction With an Allogeneic Granulocyte Macrophage Colony-sti"| __truncated__ "Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile" "Prognostic Potential of Cell Surface Markers and Pim Kinases in Multiple Myeloma" "Role of Osteocytes in Myeloma Bone Disease" ...
##   .. ..$ official_title                      : chr [1:10] "Randomized Trial of Activated Marrow Infiltrating Lymphocytes Alone or in Conjunction With an Allogeneic GM-CSF-based Myeloma C"| __truncated__ "A Prospective, Longitudinal, Observational Study in Newly Diagnosed Multiple Myeloma (MM) Patients to Assess the Relationship B"| __truncated__ "Prognostic Potential of Cell Surface Markers and Pim Kinases in Multiple Myeloma" "Role of Osteocytes in Myeloma Bone Disease" ...
##   .. ..$ overall_status                      : chr [1:10] "Active, not recruiting" "Active, not recruiting" "Unknown status" "Recruiting" ...
##   .. ..$ start_date                          : chr [1:10] "December 2009" "July 2011" "July 2011" "July 2014" ...
##   .. ..$ completion_date.text                : chr [1:10] "July 2017" "September 2023" NA "December 2018" ...
##   .. ..$ completion_date..attrs              : chr [1:10] "Anticipated" "Anticipated" NA "Anticipated" ...
##   .. ..$ completion_date_type                : chr [1:10] "Anticipated" "Anticipated" NA "Anticipated" ...
##   .. ..$ lead_sponsor/agency                 : chr [1:10] "Sidney Kimmel Comprehensive Cancer Center" "Multiple Myeloma Research Foundation" "Medical University of South Carolina" "Attaya Suvannasankha" ...
##   .. ..$ phase                               : chr [1:10] "Phase 2" "N/A" "N/A" "N/A" ...
##   .. ..$ study_type                          : chr [1:10] "Interventional" "Observational" "Observational" "Observational" ...
##   .. ..$ study_design                        : chr [1:10] "Allocation: Randomized, Endpoint Classification: Safety/Efficacy Study, Intervention Model: Single Group Assignment, Masking: O"| __truncated__ "Observational Model: Cohort, Time Perspective: Prospective" "Observational Model: Cohort, Time Perspective: Prospective" "Observational Model: Case Control, Time Perspective: Prospective" ...
##   .. ..$ enrollment.text                     : chr [1:10] "32" "1154" "130" "240" ...
##   .. ..$ enrollment..attrs                   : chr [1:10] "Anticipated" "Actual" "Anticipated" "Anticipated" ...
##   .. ..$ primary_condition                   : chr [1:10] "Multiple Myeloma" "Multiple Myeloma" "Multiple Myeloma" "Multiple Myeloma" ...
##   .. ..$ primary_outcome.measure             : chr [1:10] "Evaluate clinical efficacy of activated marrow infiltrating lymphocytes (aMILs) administered alone or in combination with allog"| __truncated__ "Molecular profiles and clinical characteristics that define subsets of myeloma patients at initial diagnosis and at relapse of "| __truncated__ "Measure the expression levels of CXCR4, CD47, and pS6 by flow cytometry in myeloma patient's marrow aspirate" "Molecular interactions between multiple myeloma and osteocytes" ...
##   .. ..$ primary_outcome.time_frame          : chr [1:10] "Days 60, 180, and 360" "Baseline to 8 years." "3 years" "Up to 4 years" ...
##   .. ..$ primary_outcome.safety_issue        : chr [1:10] "No" "No" "No" "No" ...
##   .. ..$ primary_outcome.description         : chr [1:10] "2.1.1 Evaluate Response Rates utilizing the Blade' criteria\nComplete Response (CR) rate\nNear Complete Response (nCR) rate\nVe"| __truncated__ "Standard clinical and laboratory assessments. Genomic tests (DNA and RNA sequencing, etc.) on bone marrow aspirates obtained at"| __truncated__ "Our Aim 1 is to perform a corrective study to measure the expression levels of CXCR4, CD47, and pS6 by flow cytometry in myelom"| __truncated__ "To determine FGF23 and heparanase, Dkk1 and plasma klotho levels increase in patients with newly diagnosed and relapsed myeloma"| __truncated__ ...
##   .. ..$ eligibility.gender                  : chr [1:10] "Both" "Both" "Both" "Both" ...
##   .. ..$ eligibility.minimum_age             : chr [1:10] "18 Years" "18 Years" "18 Years" "18 Years" ...
##   .. ..$ eligibility.maximum_age             : chr [1:10] "70 Years" "N/A" "N/A" "N/A" ...
##   .. ..$ eligibility.healthy_volunteers      : chr [1:10] "No" "No" "No" "Accepts Healthy Volunteers" ...
##   .. ..$ sponsors.lead_sponsor.agency        : chr [1:10] "Sidney Kimmel Comprehensive Cancer Center" "Multiple Myeloma Research Foundation" "Medical University of South Carolina" "Attaya Suvannasankha" ...
##   .. ..$ sponsors.lead_sponsor.agency_class  : chr [1:10] "Other" "Other" "Other" "Other" ...
##   .. ..$ date_disclaimer                     : chr [1:10] "ClinicalTrials.gov processed this data on January 06, 2017" "ClinicalTrials.gov processed this data on January 06, 2017" "ClinicalTrials.gov processed this data on January 06, 2017" "ClinicalTrials.gov processed this data on January 06, 2017" ...
##   .. ..$ overall_official.last_name          : chr [1:10] NA "Daniel Auclair" NA "Attaya Suvannasankha, M.D." ...
##   .. ..$ overall_official.role               : chr [1:10] NA "Study Director" NA "Principal Investigator" ...
##   .. ..$ overall_official.affiliation        : chr [1:10] NA "Multiple Myeloma Research Foundation" NA "Indiana University" ...
##   .. ..$ eligibility.sampling_method         : chr [1:10] NA "Non-Probability Sample" "Non-Probability Sample" "Non-Probability Sample" ...
##   .. ..$ eligibility.textblock.1             : chr [1:10] NA "\n        Inclusion Criteria:\n\n          -  Patient is at least 18 years old.\n\n          -  Patient has been diagnosed with"| __truncated__ "\n        Inclusion Criteria:\n\n          -  A diagnosis of multiple myeloma or possible multiple myeloma who will have a bone"| __truncated__ "\n        Inclusion Criteria:\n\n          1. Age > 18 years but = 95 years at the time of consent\n\n          2. Subjects mus"| __truncated__ ...
##   .. ..$ sponsors.collaborator.agency        : chr [1:10] NA "Translational Genomics Research Institute" "Genentech, Inc." NA ...
##   .. ..$ sponsors.collaborator.agency_class  : chr [1:10] NA "Other" "Industry" NA ...
##   .. ..$ sponsors.collaborator.agency.1      : chr [1:10] NA "Spectrum Health Hospitals" NA NA ...
##   .. ..$ sponsors.collaborator.agency_class.1: chr [1:10] NA "Other" NA NA ...
##   .. ..$ sponsors.collaborator.agency.2      : chr [1:10] NA "Van Andel Research Institute" NA NA ...
##   .. ..$ sponsors.collaborator.agency_class.2: chr [1:10] NA "Other" NA NA ...
##   .. ..$ primary_outcome.measure.1           : chr [1:10] NA NA NA NA ...
##   .. ..$ primary_outcome.time_frame.1        : chr [1:10] NA NA NA NA ...
##   .. ..$ primary_outcome.safety_issue.1      : chr [1:10] NA NA NA NA ...
##   .. ..$ primary_outcome.description.1       : chr [1:10] NA NA NA NA ...
##   .. ..$ overall_official.last_name.1        : chr [1:10] NA NA NA NA ...
##   .. ..$ overall_official.role.1             : chr [1:10] NA NA NA NA ...
##   .. ..$ overall_official.affiliation.1      : chr [1:10] NA NA NA NA ...
##   .. ..$ overall_official.last_name.2        : chr [1:10] NA NA NA NA ...
##   .. ..$ overall_official.role.2             : chr [1:10] NA NA NA NA ...
##   .. ..$ overall_official.affiliation.2      : chr [1:10] NA NA NA NA ...
##   ..$ locations    :'data.frame':    82 obs. of  12 variables:
##   .. ..$ name                  : chr [1:82] "Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins" "Mayo Clinic Campus in Scottsdale, AZ" "UC San Diego Moores Cancer Center" "Sharp Health Care" ...
##   .. ..$ address.city          : chr [1:82] "Baltimore" "Scottsdale" "San Diego" "San Diego" ...
##   .. ..$ address.state         : chr [1:82] "Maryland" "Arizona" "California" "California" ...
##   .. ..$ address.zip           : chr [1:82] "21231" "85259" "92093" "92123" ...
##   .. ..$ address.country       : chr [1:82] "United States" "United States" "United States" "United States" ...
##   .. ..$ nct_id                : chr [1:82] "NCT01045460" "NCT01454297" "NCT01454297" "NCT01454297" ...
##   .. ..$ status                : chr [1:82] NA NA NA NA ...
##   .. ..$ contact.last_name     : chr [1:82] NA NA NA NA ...
##   .. ..$ contact.phone         : chr [1:82] NA NA NA NA ...
##   .. ..$ contact.email         : chr [1:82] NA NA NA NA ...
##   .. ..$ investigator.last_name: chr [1:82] NA NA NA NA ...
##   .. ..$ investigator.role     : chr [1:82] NA NA NA NA ...
##   ..$ arms         :'data.frame':    14 obs. of  4 variables:
##   .. ..$ arm_group_label: chr [1:14] "1" "2" "Newly diagnosed Multiple Myeloma" "Multiple Myeloma subjects with bone marrow aspirate/biopsy" ...
##   .. ..$ arm_group_type : chr [1:14] "Experimental" "Experimental" NA NA ...
##   .. ..$ description    : chr [1:14] "aMILs" "aMILs + allogeneic myeloma vaccine" "This is a prospective observational study in patients with symptomatic multiple myeloma who have not yet initiated therapy for "| __truncated__ "All patients seen at MUSC with a diagnosis of multiple myeloma or possible multiple myeloma who undergo bone marrow aspirate an"| __truncated__ ...
##   .. ..$ nct_id         : chr [1:14] "NCT01045460" "NCT01045460" "NCT01454297" "NCT01410981" ...
##   ..$ interventions:'data.frame':    18 obs. of  8 variables:
##   .. ..$ intervention_type: chr [1:18] "Biological" "Biological" "Other" "Drug" ...
##   .. ..$ intervention_name: chr [1:18] "aMILs" "Allogeneic Myeloma Vaccine" "oligosecretary" "Lenalidomide" ...
##   .. ..$ description      : chr [1:18] "Activated marrow infiltrating lymphocytes" "Allogeneic granulocyte macrophage colony-stimulating factor (GM-CSF)-based myeloma cellular vaccine" "not abvailable" "Dosage forms: 5, 10, 15 and 25 mg capsules. Patients will be continued on the same dose of lenalidomide as they were prior to b"| __truncated__ ...
##   .. ..$ arm_group_label  : chr [1:18] "1" "2" "oligosecretary" "Myeloma Vaccine, Prevnar-13 Vaccine, & Lenalidomide" ...
##   .. ..$ arm_group_label.1: chr [1:18] "2" NA NA NA ...
##   .. ..$ nct_id           : chr [1:18] "NCT01045460" "NCT01045460" "NCT02095379" "NCT01349569" ...
##   .. ..$ other_name       : chr [1:18] NA NA NA "Revlimid" ...
##   .. ..$ other_name.1     : chr [1:18] NA NA NA NA ...
##   ..$ outcomes     :'data.frame':    34 obs. of  6 variables:
##   .. ..$ measure     : chr [1:34] "Evaluate clinical efficacy of activated marrow infiltrating lymphocytes (aMILs) administered alone or in combination with allog"| __truncated__ "Evaluate Progression-free Survival and Overall Survival" "Anti-tumor immune response" "The effect of aMILs on osteoclastogenesis" ...
##   .. ..$ time_frame  : chr [1:34] "Days 60, 180, and 360" "Days 60, 180, and 360" "Days 60, 180, and 360" "Days 60, 180, and 360" ...
##   .. ..$ safety_issue: chr [1:34] "No" "Yes" "No" "No" ...
##   .. ..$ description : chr [1:34] "2.1.1 Evaluate Response Rates utilizing the Blade' criteria\nComplete Response (CR) rate\nNear Complete Response (nCR) rate\nVe"| __truncated__ "Patients will be monitored for progression/relapse on Days 60, 180, and 360, and as clinically indicated. Following one year fo"| __truncated__ "Evaluate tumor specific responses in blood and bone marrow\nExamine T cell responses to DC-pulsed myeloma cell lines\nExamine i"| __truncated__ "Parameters of bone turnover that will include:\nRANKL/OPG ratio\nSerum C Telopeptide levels\nbAlkaline phosphatase and osteocal"| __truncated__ ...
##   .. ..$ type        : chr [1:34] "primary_outcome" "secondary_outcome" "secondary_outcome" "secondary_outcome" ...
##   .. ..$ nct_id      : chr [1:34] "NCT01045460" "NCT01045460" "NCT01045460" "NCT01045460" ...
##   ..$ textblocks   : NULL
##  $ study_results    :List of 3
##   ..$ participant_flow: NULL
##   ..$ baseline_data   : NULL
##   ..$ outcome_data    : NULL

This returns a list of dataframes that have a common key variable: nct_id. Optionally, you can get the long text fields and/or study results (if available). Study results are also returned as a list of dataframes, contained within the list.

How to use the results

The data come from a relational database with lots of text fields, so it may take some effort to get the data into a flat format for analysis. For that reason, results come back from the clinicaltrials_download function as a list of dataframes. Each dataframe has a common key variable: nct_id. To merge dataframes, use this key. Otherwise, you can analyze the dataframes separately. They are organized into study information, locations, outcomes, interventions, results, and textblocks. Results, where available, is itself a list with three dataframes: participant flow, baseline data, and outcome data.

Results tables are stored in long format, so there are often multiple rows per study, each corresponding to a different group or outcome. Let’s look at an example, the cumulative enrollment of men and women in phase III, melanoma, interventional studies over time. We can also pass the query as a list of named items.

melanom <- clinicaltrials_search(query = c("cond=melanoma", "phase=2", 
                                           "type=Intr", "rslt=With"), 
                                 count = 1e6)
nrow(melanom)
## [1] 27
table(melanom$status.text)
## 
## Active, not recruiting              Completed             Terminated 
##                      8                     17                      2
melanom2 <- clinicaltrials_search(query = list(cond = "melanoma", phase = "2", 
                                           type = "Intr", rslt = "With"), 
                                 count = 1e6)
nrow(melanom)
## [1] 27

Now to download the data and summarize it:

melanom_information <- clinicaltrials_download(query = c("cond=melanoma", "phase=2", 
                                                         "type=Intr", "rslt=With"), 
                                               count = 1e6, include_results = TRUE)
summary(melanom_information$study_results$baseline_data)
##                         title              units    
##  Gender                    :140   years       : 52  
##  Age                       : 97   participants:461  
##  Race/Ethnicity, Customized: 97   Participants:225  
##  Region of Enrollment      : 81   Years       : 25  
##  Age, Customized           : 41                     
##  Site of lesion            : 27                     
##  (Other)                   :280                     
##                    param                  dispersion    subtitle        
##  Median               : 25   Full Range        : 35   Length:763        
##  Number               :674   Standard Deviation: 42   Class :character  
##  Mean                 : 52   NA's              :686   Mode  :character  
##  Count of Participants: 12                                              
##                                                                         
##                                                                         
##                                                                         
##    group_id            value           lower_limit       
##  Length:763         Length:763         Length:763        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  upper_limit       
##  Length:763        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##                    
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               description 
##  Stage IIIB: Ulcerated lesion and 1 lymph node or 2-3 nodes with micrometastasis, or any-depth lesion with no ulceration, and 1 lymph node or 2-3 nodes with macrometastasis; Stage IIIC: Ulcerated lesion and 1 lymph node with macrometastasis; 2-3 nodes with macrometastasis or =4 metastatic lymph nodes, matted lymph nodes, or in-transit met(s)/satellite(s); Stage IV: M1a: Spread to skin, subcutaneous tissue, or lymph nodes; normal lactate dehydrogenase (LDH) level; M1b: Spread to lungs, normal LDH; M1c: Spread to all other visceral organs, normal LDH or any distant disease with elevated LDH.: 18  
##  The "M" in the TNM (tumor, node, metastasis) system refers to distant metastases—whether, and how far, the cancer has spread outside the original site. M0: There is no evidence that the cancer has spread beyond the original site. M1: The cancer has spread beyond the original site. M1a: The cancer has spread to other areas of skin, underneath the epidermis to the dermis (subcutaneous), or to lymph node(s). M1b: The cancer has spread to the lung(s) only. M1c: The cancer has spread to other organs and/or locations in the body with or without elevated LDH.                                     : 16  
##  Breslow's Thickness is a measure of the vertical thickness of a cutaneous melanoma lesion and is reported in millimeters (mm).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     : 15  
##  Scale used to assess how a patient's disease is progressing, how the disease affects the daily living abilities of the patient: 0 = Fully active, able to carry on all pre-disease performance without restriction; 1 = Restricted in physically strenuous activity, ambulatory, able to carry out work of a light nature; 2 = Ambulatory and capable of all self-care but unable to carry out any work activities. Up and about > 50% of waking hours; 3 = Capable of only limited self care, confined to a bed or chair > 50% of waking hours; 4 = Completely disabled, confined to bed or chair; 5 = Dead.      : 15  
##  ECOG-Eastern Cooperative Oncology Group (ECOG) Performance Status is used by doctors and researchers to assess how a participant's disease is progressing, assess how the disease affects the daily living activities of the participant and determine appropriate treatment and prognosis. 0 = Fully Active (Most Favorable Activity); 1 = Restricted activity but ambulatory; 2 = Ambulatory but unable to carry out work activities; 3 = Limited Self-Care; 4 = Completely Disabled, No self-care (Least Favorable Activity)                                                                                    : 15  
##  (Other)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            :174  
##  NA's                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               :510  
##      arm               nct_id             spread         
##  Length:763         Length:763         Length:763        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 
gend_data <- subset(melanom_information$study_results$baseline_data, 
                    title == "Gender" & arm != "Total")

gender_counts <- gend_data %>% group_by(nct_id, subtitle) %>% 
  do( data.frame(
    count = sum(as.numeric(paste(.$value)), na.rm = TRUE)
    ))

dates <- melanom_information$study_information$study_info[, c("nct_id", "start_date")]
dates$year <- sapply(strsplit(paste(dates$start_date), " "), function(d) as.numeric(d[2]))

counts <- merge(gender_counts, dates, by = "nct_id")

cts <- counts %>% group_by(year, subtitle) %>%
  summarize(count = sum(count))
colnames(cts)[2] <- "Gender"

ggplot(cts, aes(x = year, y = cumsum(count), color = Gender)) + 
  geom_line() + geom_point() + 
  labs(title = "Cumulative enrollment into Phase III, \n interventional trials in Melanoma, by gender") + 
  scale_y_continuous("Cumulative Enrollment")