R interface to clinicaltrials.gov

Clinicaltrials.gov ClinicalTrials.gov is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world. Users can search for information about and results from those trials. This package provides a set of functions to interact with the search and download features. Results are downloaded to temporary directories and returned as R objects.

Installation

The package is not currently available on CRAN. To install, use devtools::install_github(), as follows:

install.packages("devtools")
library(devtools)
install_github("sachsmc/rclinicaltrials")

Basic usage

The main function is clinicaltrials_search(). Here's an example of its use:

library(rclinicaltrials)
z <- clinicaltrials_search(query = 'lime+disease')
str(z)
## 'data.frame':    17 obs. of  7 variables:
##  $ score            : chr  "0.040881" "0.02838" "0.020476" "0.017" ...
##  $ nct_id           : chr  "NCT01333202" "NCT01951924" "NCT01056133" "NCT02156089" ...
##  $ url              : chr  "http://ClinicalTrials.gov/show/NCT01333202" "http://ClinicalTrials.gov/show/NCT01951924" "http://ClinicalTrials.gov/show/NCT01056133" "http://ClinicalTrials.gov/show/NCT02156089" ...
##  $ title            : chr  "Fresh Lime Alone for Smoking Cessation" "LIME Study (LFB IVIG MMN Efficacy Study)" "Effect of Fish-oil on Non-alcoholic Steatohepatitis (NASH)" "Neuroimaging, Omega-3 and Reward in Adults With ADHD (NORAA) Trial" ...
##  $ status.text      : chr  "Completed" "Recruiting" "Recruiting" "Recruiting" ...
##  $ condition_summary: chr  "Tobacco Use Disorder" "Motor Neuron Disease" "Non-alcoholic Fatty Liver Disease; Non-alcoholic Steatohepatitis" "Attention Deficit Disorder; Attention Deficit Hyerpactivity Disorder" ...
##  $ last_changed     : chr  "April 8, 2011" "September 24, 2013" "May 23, 2014" "August 12, 2014" ...

This gives you basic information about the trials. Before searching or downloading, you can determine how many results will be returned using the clinicaltrials_count() function:

clinicaltrials_count(query = "myeloma")
## [1] 1844
clinicaltrials_count(query = "29485tksrw@")
## [1] 0

The query can be a single string which will be passed to the “search terms” field on clinicaltrials.gov. Terms can be combined using the logical operators AND, OR, and NOT. Advanced searches can be performed by passing a vector of key=value pairs as strings. For example, to search for cancer interventional studies,

clinicaltrials_count(query = c("type=Intr", "cond=cancer"))
## [1] 35138

The possible advance search terms are included in the advanced_search_terms data frame which comes with the package. The data frame has the keys, description, and a link to the help webpage which will explain the possible values of the search terms. To open the help page for cond, for instance, run browseURL(advanced_search_terms["cond", "help"]).

head(advanced_search_terms)
##      keys   description
## term term  Search Terms
## recr recr   Recruitment
## rslt rslt Study Results
## type type    Study Type
## cond cond    Conditions
## intr intr Interventions
##                                                        help
## term        http://clinicaltrials.gov/ct2/help/search_terms
## recr         http://clinicaltrials.gov/ct2/help/recruitment
## rslt       http://clinicaltrials.gov/ct2/help/study_results
## type          http://clinicaltrials.gov/ct2/help/study_type
## cond    http://clinicaltrials.gov/ct2/help/conditions_instr
## intr http://clinicaltrials.gov/ct2/help/interventions_instr

To download detailed study information, including results, use clinicaltrials_download():

y <- clinicaltrials_download(query = 'myeloma', count = 10, include_results = TRUE)
str(y)
## List of 2
##  $ study_information:List of 5
##   ..$ study_info   :'data.frame':    10 obs. of  34 variables:
##   .. ..$ org_study_id                  : chr [1:10] "000201" "CDR0000597015" "MCC-15697" "HCI33979" ...
##   .. ..$ nct_id                        : chr [1:10] "NCT00006184" "NCT00897910" "NCT00948922" "NCT00983346" ...
##   .. ..$ brief_title                   : Factor w/ 10 levels "Chemotherapy, Stem Cell Transplantation and Donor and Patient Vaccination for Treatment of Multiple Myeloma",..: 1 2 3 4 5 6 7 8 9 10
##   .. ..$ official_title                : Factor w/ 10 levels "Active Immunization of Sibling Stem Cell Transplant Donors Against Purified Myeloma Protein of the Stem Cell Recipient With Mul"| __truncated__,..: 1 2 3 4 5 6 7 8 9 10
##   .. ..$ overall_status                : Factor w/ 5 levels "Completed","Terminated",..: 1 2 3 4 4 3 3 3 5 3
##   .. ..$ start_date                    : Factor w/ 8 levels "August 2000",..: 1 2 3 4 5 6 6 7 8 8
##   .. ..$ completion_date.text          : Factor w/ 7 levels "July 2013","September 2012",..: 1 2 3 4 5 NA 6 7 NA NA
##   .. ..$ completion_date..attrs        : Factor w/ 2 levels "Actual","Anticipated": 1 1 2 2 2 NA 2 2 NA NA
##   .. ..$ completion_date_type          : chr [1:10] "Actual" "Actual" "Anticipated" "Anticipated" ...
##   .. ..$ lead_sponsor/agency           : Factor w/ 10 levels "National Cancer Institute (NCI)",..: 1 2 3 4 5 6 7 8 9 10
##   .. ..$ phase                         : Factor w/ 4 levels "Phase 2","N/A",..: 1 2 1 1 1 2 2 3 2 4
##   .. ..$ study_type                    : Factor w/ 2 levels "Interventional",..: 1 2 1 1 1 2 2 1 2 1
##   .. ..$ study_design                  : Factor w/ 8 levels "Allocation: Non-Randomized, Endpoint Classification: Safety/Efficacy Study, Intervention Model: Parallel Assignment, Masking: O"| __truncated__,..: 1 2 3 4 5 6 6 1 7 8
##   .. ..$ enrollment.text               : Factor w/ 9 levels "20","6","150",..: 1 2 3 1 4 5 6 7 8 9
##   .. ..$ enrollment..attrs             : Factor w/ 2 levels "Actual","Anticipated": 1 1 2 2 2 2 2 2 2 2
##   .. ..$ primary_condition             : chr [1:10] "Multiple Myeloma" "Multiple Myeloma and Plasma Cell Neoplasm" "Multiple Myeloma" "Cancer" ...
##   .. ..$ primary_outcome.measure       : Factor w/ 10 levels "Immune Response",..: 1 2 3 4 5 6 7 8 9 10
##   .. ..$ primary_outcome.time_frame    : Factor w/ 9 levels "105 days","Collection of PBMCs over a period of 9-12 months, and the laboratory component will be performed over another year.",..: 1 2 3 4 5 6 7 3 8 9
##   .. ..$ primary_outcome.safety_issue  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 2
##   .. ..$ primary_outcome.description   : Factor w/ 6 levels "Immune cell depletion is defined as immunosuppression of participants T cells prior to transplant measured by cluster of differ"| __truncated__,..: 1 NA 2 NA 3 4 5 NA 6 NA
##   .. ..$ primary_outcome.measure.1     : Factor w/ 4 levels "Number of Participants With Adverse Events",..: 1 NA 2 NA NA NA NA 3 NA 4
##   .. ..$ primary_outcome.time_frame.1  : Factor w/ 3 levels "9 years","2 years",..: 1 NA 2 NA NA NA NA 2 NA 3
##   .. ..$ primary_outcome.safety_issue.1: Factor w/ 2 levels "Yes","No": 1 NA 2 NA NA NA NA 1 NA 2
##   .. ..$ primary_outcome.description.1 : Factor w/ 2 levels "Here is the number of participants with adverse events. For a detailed list of adverse events see the adverse event module.",..: 1 NA 2 NA NA NA NA NA NA NA
##   .. ..$ eligibility.gender            : Factor w/ 1 level "Both": 1 1 1 1 1 1 1 1 1 1
##   .. ..$ eligibility.minimum_age       : Factor w/ 1 level "18 Years": 1 1 1 1 1 1 1 1 1 1
##   .. ..$ eligibility.maximum_age       : Factor w/ 3 levels "75 Years","N/A",..: 1 2 2 2 3 2 2 2 2 2
##   .. ..$ eligibility.healthy_volunteers: Factor w/ 2 levels "No","Accepts Healthy Volunteers": 1 1 1 1 1 1 1 1 2 1
##   .. ..$ date_disclaimer               : chr [1:10] "ClinicalTrials.gov processed this data on September 18, 2014" "ClinicalTrials.gov processed this data on September 18, 2014" "ClinicalTrials.gov processed this data on September 18, 2014" "ClinicalTrials.gov processed this data on September 18, 2014" ...
##   .. ..$ eligibility.sampling_method   : Factor w/ 1 level "Non-Probability Sample": NA 1 NA NA NA 1 1 NA 1 NA
##   .. ..$ eligibility.textblock.1       : Factor w/ 4 levels "\n        DISEASE CHARACTERISTICS:\n\n          -  Diagnosis of multiple myeloma\n\n        PATIENT CHARACTERISTICS:\n\n       "| __truncated__,..: NA 1 NA NA NA 2 3 NA 4 NA
##   .. ..$ primary_outcome.measure.2     : Factor w/ 2 levels "recommended Phase 2 dose for TH-302 monotherapy and in combination with bortezomib in subjects with relapsed/refractory multipl"| __truncated__,..: NA NA NA NA NA NA NA 1 NA 2
##   .. ..$ primary_outcome.time_frame.2  : Factor w/ 2 levels "2 years","Dose Expansion Stage - from day 1 of Cycle 1 through 28 days after the patient's last cycle of treatment": NA NA NA NA NA NA NA 1 NA 2
##   .. ..$ primary_outcome.safety_issue.2: Factor w/ 1 level "Yes": NA NA NA NA NA NA NA 1 NA 1
##   ..$ locations    :'data.frame':    115 obs. of  6 variables:
##   .. ..$ name           : chr [1:115] "National Institutes of Health Clinical Center, 9000 Rockville Pike" "Barbara Ann Karmanos Cancer Institute" "H. Lee Moffitt Cancer Center" "Huntsman Cancer Institute" ...
##   .. ..$ address.city   : chr [1:115] "Bethesda" "Detroit" "Tampa" "Salt Lake City" ...
##   .. ..$ address.state  : chr [1:115] "Maryland" "Michigan" "Florida" "Utah" ...
##   .. ..$ address.zip    : chr [1:115] "20892" "48201-1379" "33612" "84112" ...
##   .. ..$ address.country: chr [1:115] "United States" "United States" "United States" "United States" ...
##   .. ..$ nct_id         : chr [1:115] "NCT00006184" "NCT00897910" "NCT00948922" "NCT00983346" ...
##   ..$ interventions:'data.frame':    29 obs. of  8 variables:
##   .. ..$ intervention_type: chr [1:29] "Drug" "Drug" "Drug" "Drug" ...
##   .. ..$ intervention_name: chr [1:29] "Myeloma Immunoglobulin Idiotype Vaccine" "Bortezomib" "Cyclophosphamide" "Cyclosporine" ...
##   .. ..$ description      : chr [1:29] "3 subcutaneous (SC) injections of myeloma protein within 10 weeks before stem cell collection the first (week 0), second (week "| __truncated__ "Induction chemotherapy: 1.3 mg/m^2 bolus intravenous injection twice weekly for 2 weeks (days 1, 4, 8, 11) followed by 10 day r"| __truncated__ "Induction chemotherapy: 600 mg/m^2 day 4 Transplant: 1200 mg/m^2 intravenous x 4 days (days -6, -5, -4, -3)" "Transplant: 2 mg/kg intravenous every 12 hours continuous intravenous or by mouth (PO) until day + 180" ...
##   .. ..$ arm_group_label  : chr [1:29] "Donor - Vaccine Generation Group" "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" ...
##   .. ..$ arm_group_label.1: chr [1:29] NA "Donor - Vaccine Generation Group" NA NA ...
##   .. ..$ other_name       : chr [1:29] NA "Velcade" "Cytoxan" "Sandimmune" ...
##   .. ..$ other_name.1     : chr [1:29] NA "PS341" "CTX" "Cyclosporin A" ...
##   .. ..$ nct_id           : chr [1:29] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
##   ..$ outcomes     :'data.frame':    43 obs. of  6 variables:
##   .. ..$ measure     : Factor w/ 43 levels "Immune Response",..: 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ time_frame  : Factor w/ 21 levels "105 days","9 years",..: 1 2 3 3 4 4 4 4 4 5 ...
##   .. ..$ safety_issue: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 2 1 1 ...
##   .. ..$ description : Factor w/ 23 levels "Immune cell depletion is defined as immunosuppression of participants T cells prior to transplant measured by cluster of differ"| __truncated__,..: 1 2 NA NA 3 4 5 6 7 NA ...
##   .. ..$ type        : chr [1:43] "primary_outcome" "primary_outcome" "primary_outcome" "secondary_outcome" ...
##   .. ..$ nct_id      : chr [1:43] "NCT00006184" "NCT00006184" "NCT00897910" "NCT00897910" ...
##   ..$ textblocks   : NULL
##  $ study_results    :List of 3
##   ..$ participant_flow:'data.frame': 9 obs. of  6 variables:
##   .. ..$ title   : Factor w/ 1 level "Overall Study": 1 1 1 1 1 1 1 1 1
##   .. ..$ status  : Factor w/ 3 levels "STARTED","COMPLETED",..: 1 1 2 2 3 3 1 2 3
##   .. ..$ group_id: chr [1:9] "P1" "P2" "P1" "P2" ...
##   .. ..$ count   : chr [1:9] "10" "10" "9" "10" ...
##   .. ..$ arm     : chr [1:9] "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" ...
##   .. ..$ nct_id  : chr [1:9] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
##   ..$ baseline_data   :'data.frame': 56 obs. of  10 variables:
##   .. ..$ title     : Factor w/ 5 levels "Number of Participants",..: 1 1 1 2 2 2 2 2 2 2 ...
##   .. ..$ units     : Factor w/ 3 levels "participants",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ param     : Factor w/ 2 levels "Number","Mean": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ subtitle  : Factor w/ 16 levels "","<=18 years",..: 1 1 1 2 2 2 3 3 3 4 ...
##   .. ..$ group_id  : Factor w/ 3 levels "B1","B2","B3": 1 2 3 1 2 3 1 2 3 1 ...
##   .. ..$ value     : Factor w/ 17 levels "10","20","0",..: 1 1 2 3 3 3 1 1 2 3 ...
##   .. ..$ dispersion: Factor w/ 1 level "Standard Deviation": NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ spread    : Factor w/ 4 levels "4.59","6.05",..: NA NA NA NA NA NA NA NA NA NA ...
##   .. ..$ arm       : chr [1:56] "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" "Total" "Recipient - Chemotherapy Group" ...
##   .. ..$ nct_id    : chr [1:56] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
##   ..$ outcome_data    :'data.frame': 9 obs. of  14 variables:
##   .. ..$ type                : Factor w/ 2 levels "Primary","Secondary": 1 1 1 1 1 1 1 1 2
##   .. ..$ title               : Factor w/ 4 levels "Immune Response",..: 1 1 2 2 2 2 3 3 4
##   .. ..$ description         : Factor w/ 2 levels "Immune cell depletion is defined as immunosuppression of participants T cells prior to transplant measured by cluster of differ"| __truncated__,..: 1 1 2 2 2 2 NA NA NA
##   .. ..$ time_frame          : Factor w/ 3 levels "105 days","9 years",..: 1 1 2 2 2 2 3 3 3
##   .. ..$ safety_issue        : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 1 1 1
##   .. ..$ population          : Factor w/ 1 level "This outcome measure was only pre-specified to be measured in the recipient Arm/Group.": 1 1 NA NA NA NA NA NA NA
##   .. ..$ units               : Factor w/ 3 levels "participants",..: 1 2 1 1 3 3 1 NA NA
##   .. ..$ param               : Factor w/ 1 level "Number": 1 1 1 1 1 1 1 NA NA
##   .. ..$ subtitle            : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 NA
##   .. ..$ group_id            : Factor w/ 2 levels "O1","O2": 1 1 1 2 1 2 1 NA NA
##   .. ..$ value               : Factor w/ 3 levels "10","7","0": 1 2 1 1 1 1 3 NA NA
##   .. ..$ arm                 : chr [1:9] "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" ...
##   .. ..$ nct_id              : chr [1:9] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
##   .. ..$ measurement.group_id: Factor w/ 1 level "O1": NA NA NA NA NA NA NA 1 NA

This returns a list of dataframes that have a common key variable: nct_id. Optionally, you can get the long text fields and/or study results (if available). Study results are also returned as a list of dataframes, contained within the list.

How to use the results

The data come from a relational database with lots of text fields, so it may take some effort to get the data into a flat format for analysis. For that reason, results come back from the clinicaltrials_download function as a list of dataframes. Each dataframe has a common key variable: nct_id. To merge dataframes, use this key. Otherwise, you can analyze the dataframes separately. They are organized into study information, locations, outcomes, interventions, results, and textblocks. Results, where available, is itself a list with three dataframes: participant flow, baseline data, and outcome data.

Results tables are stored in long format, so there are often multiple rows per study, each corresponding to a different group or outcome. Let's look at an example, the cumulative enrollment of men and women in phase III, melanoma, interventional studies over time.

melanom <- clinicaltrials_search(query = c("cond=melanoma", "phase=2", "type=Intr", "rslt=With"), count = 1e6)
nrow(melanom)
## [1] 18
table(melanom$status.text)
## 
## Active, not recruiting              Completed             Terminated 
##                      6                     11                      1

Now to download the data:

melanom_information <- clinicaltrials_download(frame = melanom, count = 1e6, include_results = TRUE)
summary(melanom_information$study_results$baseline_data)
##                         title              units        param    
##  Gender                    :104   participants:439   Number:524  
##  Region of Enrollment      : 83   years       : 41   Median: 16  
##  Age                       : 73   Participants: 85   Mean  : 34  
##  Number of Participants    : 52   Years       :  9               
##  Race/Ethnicity, Customized: 37                                  
##  Age, Customized           : 23                                  
##  (Other)                   :202                                  
##           subtitle   group_id     value                  dispersion 
##               :102   B1:197   0      : 40   Full Range        : 28  
##  Female       : 52   B2:168   1      : 18   Standard Deviation: 22  
##  Male         : 52   B3:168   2      : 11   NA's              :524  
##  >=65 years   : 16   B4: 29   5      : 10                           
##  United States: 13   B5:  4   11     :  9                           
##  White        : 10   B6:  4   (Other):478                           
##  (Other)      :329   B7:  4   NA's   :  8                           
##   lower_limit   upper_limit      arm               nct_id         
##  19     :  6   87     :  4   Length:574         Length:574        
##  23     :  3   88     :  3   Class :character   Class :character  
##  10     :  2   74     :  2   Mode  :character   Mode  :character  
##  -0.0   :  2   84     :  2                                        
##  18     :  2   90     :  2                                        
##  (Other): 11   (Other): 13                                        
##  NA's   :548   NA's   :548                                        
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          description 
##  The "M" in the TNM (tumor, node, metastasis) system refers to distant metastases—whether, and how far, the cancer has spread outside the original site. M0: There is no evidence that the cancer has spread beyond the original site. M1: The cancer has spread beyond the original site. M1a: The cancer has spread to other areas of skin, underneath the epidermis to the dermis (subcutaneous), or to lymph node(s). M1b: The cancer has spread to the lung(s) only. M1c: The cancer has spread to other organs and/or locations in the body with or without elevated LDH.: 16  
##  Breslow's Thickness is a measure of the vertical thickness of a cutaneous melanoma  lesion and is reported in millimeters (mm).                                                                                                                                                                                                                                                                                                                                                                                                                                               : 15  
##  ECOG-Eastern Cooperative Oncology Group (ECOG) Performance Status is used by doctors and researchers to assess how a participant's disease is progressing, assess how the disease affects the daily living activities of the participant and determine appropriate treatment and prognosis. 0 = Fully Active (Most Favorable Activity); 1 = Restricted activity but ambulatory; 2 = Ambulatory but unable to carry out work activities; 3 = Limited Self-Care; 4 = Completely Disabled, No self-care (Least Favorable Activity)                                               : 15  
##  One patient on arm V had missing data for gender. Hence, a total of 189 patients on arm V reported gender.                                                                                                                                                                                                                                                                                                                                                                                                                                                                    : 14  
##  Upper limit of normal (ULN) was 250 U/L for most assessments (some variation caused by tests performed at local laboratories).                                                                                                                                                                                                                                                                                                                                                                                                                                                : 12  
##  (Other)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       :115  
##  NA's                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          :387  
##      spread   
##  12.8   :  1  
##  13.0   :  1  
##  57.0   :  1  
##  13.51  :  1  
##  13.71  :  1  
##  (Other): 20  
##  NA's   :549
gend_data <- subset(melanom_information$study_results$baseline_data, title == "Gender" & arm != "Total")

library(plyr)

gender_counts <- ddply(gend_data, ~ nct_id + subtitle, function(df){

  data.frame(
    count = sum(as.numeric(paste(df$value)), na.rm = TRUE)
    )

})
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
dates <- melanom_information$study_information$study_info[, c("nct_id", "start_date")]
dates$year <- sapply(strsplit(paste(dates$start_date), " "), function(d) as.numeric(d[2]))

counts <- merge(gender_counts, dates, by = "nct_id")
library(ggplot2)
cts <- ddply(counts, ~ year + subtitle, summarize, count = sum(count))
colnames(cts)[2] <- "Gender"
ggplot(cts, aes(x = year, y = cumsum(count), color = Gender)) + 
  geom_line() + geom_point() + labs(title = "Cumulative enrollment into Phase III, \n interventional trials in Melanoma, by gender") + scale_y_continuous("Cumulative Enrollment")

plot of chunk fig