Clinicaltrials.gov ClinicalTrials.gov is a registry and results database of publicly and privately supported clinical studies of human participants conducted around the world. Users can search for information about and results from those trials. This package provides a set of functions to interact with the search and download features. Results are downloaded to temporary directories and returned as R objects.
The package is not currently available on
CRAN. To install, use
devtools::install_github()
, as follows:
install.packages("devtools")
library(devtools)
install_github("sachsmc/rclinicaltrials")
The main function is clinicaltrials_search()
. Here's an example of its use:
library(rclinicaltrials)
z <- clinicaltrials_search(query = 'lime+disease')
str(z)
## 'data.frame': 17 obs. of 7 variables:
## $ score : chr "0.040881" "0.02838" "0.020476" "0.017" ...
## $ nct_id : chr "NCT01333202" "NCT01951924" "NCT01056133" "NCT02156089" ...
## $ url : chr "http://ClinicalTrials.gov/show/NCT01333202" "http://ClinicalTrials.gov/show/NCT01951924" "http://ClinicalTrials.gov/show/NCT01056133" "http://ClinicalTrials.gov/show/NCT02156089" ...
## $ title : chr "Fresh Lime Alone for Smoking Cessation" "LIME Study (LFB IVIG MMN Efficacy Study)" "Effect of Fish-oil on Non-alcoholic Steatohepatitis (NASH)" "Neuroimaging, Omega-3 and Reward in Adults With ADHD (NORAA) Trial" ...
## $ status.text : chr "Completed" "Recruiting" "Recruiting" "Recruiting" ...
## $ condition_summary: chr "Tobacco Use Disorder" "Motor Neuron Disease" "Non-alcoholic Fatty Liver Disease; Non-alcoholic Steatohepatitis" "Attention Deficit Disorder; Attention Deficit Hyerpactivity Disorder" ...
## $ last_changed : chr "April 8, 2011" "September 24, 2013" "May 23, 2014" "August 12, 2014" ...
This gives you basic information about the trials. Before searching or downloading, you can determine how many results will be returned using the clinicaltrials_count()
function:
clinicaltrials_count(query = "myeloma")
## [1] 1844
clinicaltrials_count(query = "29485tksrw@")
## [1] 0
The query can be a single string which will be passed to the “search terms” field on clinicaltrials.gov. Terms can be combined using the logical operators AND, OR, and NOT. Advanced searches can be performed by passing a vector of key=value pairs as strings. For example, to search for cancer interventional studies,
clinicaltrials_count(query = c("type=Intr", "cond=cancer"))
## [1] 35138
The possible advance search terms are included in the advanced_search_terms
data frame which comes with the package. The data frame has the keys, description, and a link to the help webpage which will explain the possible values of the search terms. To open the help page for cond
, for instance, run browseURL(advanced_search_terms["cond", "help"])
.
head(advanced_search_terms)
## keys description
## term term Search Terms
## recr recr Recruitment
## rslt rslt Study Results
## type type Study Type
## cond cond Conditions
## intr intr Interventions
## help
## term http://clinicaltrials.gov/ct2/help/search_terms
## recr http://clinicaltrials.gov/ct2/help/recruitment
## rslt http://clinicaltrials.gov/ct2/help/study_results
## type http://clinicaltrials.gov/ct2/help/study_type
## cond http://clinicaltrials.gov/ct2/help/conditions_instr
## intr http://clinicaltrials.gov/ct2/help/interventions_instr
To download detailed study information, including results, use clinicaltrials_download()
:
y <- clinicaltrials_download(query = 'myeloma', count = 10, include_results = TRUE)
str(y)
## List of 2
## $ study_information:List of 5
## ..$ study_info :'data.frame': 10 obs. of 34 variables:
## .. ..$ org_study_id : chr [1:10] "000201" "CDR0000597015" "MCC-15697" "HCI33979" ...
## .. ..$ nct_id : chr [1:10] "NCT00006184" "NCT00897910" "NCT00948922" "NCT00983346" ...
## .. ..$ brief_title : Factor w/ 10 levels "Chemotherapy, Stem Cell Transplantation and Donor and Patient Vaccination for Treatment of Multiple Myeloma",..: 1 2 3 4 5 6 7 8 9 10
## .. ..$ official_title : Factor w/ 10 levels "Active Immunization of Sibling Stem Cell Transplant Donors Against Purified Myeloma Protein of the Stem Cell Recipient With Mul"| __truncated__,..: 1 2 3 4 5 6 7 8 9 10
## .. ..$ overall_status : Factor w/ 5 levels "Completed","Terminated",..: 1 2 3 4 4 3 3 3 5 3
## .. ..$ start_date : Factor w/ 8 levels "August 2000",..: 1 2 3 4 5 6 6 7 8 8
## .. ..$ completion_date.text : Factor w/ 7 levels "July 2013","September 2012",..: 1 2 3 4 5 NA 6 7 NA NA
## .. ..$ completion_date..attrs : Factor w/ 2 levels "Actual","Anticipated": 1 1 2 2 2 NA 2 2 NA NA
## .. ..$ completion_date_type : chr [1:10] "Actual" "Actual" "Anticipated" "Anticipated" ...
## .. ..$ lead_sponsor/agency : Factor w/ 10 levels "National Cancer Institute (NCI)",..: 1 2 3 4 5 6 7 8 9 10
## .. ..$ phase : Factor w/ 4 levels "Phase 2","N/A",..: 1 2 1 1 1 2 2 3 2 4
## .. ..$ study_type : Factor w/ 2 levels "Interventional",..: 1 2 1 1 1 2 2 1 2 1
## .. ..$ study_design : Factor w/ 8 levels "Allocation: Non-Randomized, Endpoint Classification: Safety/Efficacy Study, Intervention Model: Parallel Assignment, Masking: O"| __truncated__,..: 1 2 3 4 5 6 6 1 7 8
## .. ..$ enrollment.text : Factor w/ 9 levels "20","6","150",..: 1 2 3 1 4 5 6 7 8 9
## .. ..$ enrollment..attrs : Factor w/ 2 levels "Actual","Anticipated": 1 1 2 2 2 2 2 2 2 2
## .. ..$ primary_condition : chr [1:10] "Multiple Myeloma" "Multiple Myeloma and Plasma Cell Neoplasm" "Multiple Myeloma" "Cancer" ...
## .. ..$ primary_outcome.measure : Factor w/ 10 levels "Immune Response",..: 1 2 3 4 5 6 7 8 9 10
## .. ..$ primary_outcome.time_frame : Factor w/ 9 levels "105 days","Collection of PBMCs over a period of 9-12 months, and the laboratory component will be performed over another year.",..: 1 2 3 4 5 6 7 3 8 9
## .. ..$ primary_outcome.safety_issue : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 2
## .. ..$ primary_outcome.description : Factor w/ 6 levels "Immune cell depletion is defined as immunosuppression of participants T cells prior to transplant measured by cluster of differ"| __truncated__,..: 1 NA 2 NA 3 4 5 NA 6 NA
## .. ..$ primary_outcome.measure.1 : Factor w/ 4 levels "Number of Participants With Adverse Events",..: 1 NA 2 NA NA NA NA 3 NA 4
## .. ..$ primary_outcome.time_frame.1 : Factor w/ 3 levels "9 years","2 years",..: 1 NA 2 NA NA NA NA 2 NA 3
## .. ..$ primary_outcome.safety_issue.1: Factor w/ 2 levels "Yes","No": 1 NA 2 NA NA NA NA 1 NA 2
## .. ..$ primary_outcome.description.1 : Factor w/ 2 levels "Here is the number of participants with adverse events. For a detailed list of adverse events see the adverse event module.",..: 1 NA 2 NA NA NA NA NA NA NA
## .. ..$ eligibility.gender : Factor w/ 1 level "Both": 1 1 1 1 1 1 1 1 1 1
## .. ..$ eligibility.minimum_age : Factor w/ 1 level "18 Years": 1 1 1 1 1 1 1 1 1 1
## .. ..$ eligibility.maximum_age : Factor w/ 3 levels "75 Years","N/A",..: 1 2 2 2 3 2 2 2 2 2
## .. ..$ eligibility.healthy_volunteers: Factor w/ 2 levels "No","Accepts Healthy Volunteers": 1 1 1 1 1 1 1 1 2 1
## .. ..$ date_disclaimer : chr [1:10] "ClinicalTrials.gov processed this data on September 18, 2014" "ClinicalTrials.gov processed this data on September 18, 2014" "ClinicalTrials.gov processed this data on September 18, 2014" "ClinicalTrials.gov processed this data on September 18, 2014" ...
## .. ..$ eligibility.sampling_method : Factor w/ 1 level "Non-Probability Sample": NA 1 NA NA NA 1 1 NA 1 NA
## .. ..$ eligibility.textblock.1 : Factor w/ 4 levels "\n DISEASE CHARACTERISTICS:\n\n - Diagnosis of multiple myeloma\n\n PATIENT CHARACTERISTICS:\n\n "| __truncated__,..: NA 1 NA NA NA 2 3 NA 4 NA
## .. ..$ primary_outcome.measure.2 : Factor w/ 2 levels "recommended Phase 2 dose for TH-302 monotherapy and in combination with bortezomib in subjects with relapsed/refractory multipl"| __truncated__,..: NA NA NA NA NA NA NA 1 NA 2
## .. ..$ primary_outcome.time_frame.2 : Factor w/ 2 levels "2 years","Dose Expansion Stage - from day 1 of Cycle 1 through 28 days after the patient's last cycle of treatment": NA NA NA NA NA NA NA 1 NA 2
## .. ..$ primary_outcome.safety_issue.2: Factor w/ 1 level "Yes": NA NA NA NA NA NA NA 1 NA 1
## ..$ locations :'data.frame': 115 obs. of 6 variables:
## .. ..$ name : chr [1:115] "National Institutes of Health Clinical Center, 9000 Rockville Pike" "Barbara Ann Karmanos Cancer Institute" "H. Lee Moffitt Cancer Center" "Huntsman Cancer Institute" ...
## .. ..$ address.city : chr [1:115] "Bethesda" "Detroit" "Tampa" "Salt Lake City" ...
## .. ..$ address.state : chr [1:115] "Maryland" "Michigan" "Florida" "Utah" ...
## .. ..$ address.zip : chr [1:115] "20892" "48201-1379" "33612" "84112" ...
## .. ..$ address.country: chr [1:115] "United States" "United States" "United States" "United States" ...
## .. ..$ nct_id : chr [1:115] "NCT00006184" "NCT00897910" "NCT00948922" "NCT00983346" ...
## ..$ interventions:'data.frame': 29 obs. of 8 variables:
## .. ..$ intervention_type: chr [1:29] "Drug" "Drug" "Drug" "Drug" ...
## .. ..$ intervention_name: chr [1:29] "Myeloma Immunoglobulin Idiotype Vaccine" "Bortezomib" "Cyclophosphamide" "Cyclosporine" ...
## .. ..$ description : chr [1:29] "3 subcutaneous (SC) injections of myeloma protein within 10 weeks before stem cell collection the first (week 0), second (week "| __truncated__ "Induction chemotherapy: 1.3 mg/m^2 bolus intravenous injection twice weekly for 2 weeks (days 1, 4, 8, 11) followed by 10 day r"| __truncated__ "Induction chemotherapy: 600 mg/m^2 day 4 Transplant: 1200 mg/m^2 intravenous x 4 days (days -6, -5, -4, -3)" "Transplant: 2 mg/kg intravenous every 12 hours continuous intravenous or by mouth (PO) until day + 180" ...
## .. ..$ arm_group_label : chr [1:29] "Donor - Vaccine Generation Group" "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" ...
## .. ..$ arm_group_label.1: chr [1:29] NA "Donor - Vaccine Generation Group" NA NA ...
## .. ..$ other_name : chr [1:29] NA "Velcade" "Cytoxan" "Sandimmune" ...
## .. ..$ other_name.1 : chr [1:29] NA "PS341" "CTX" "Cyclosporin A" ...
## .. ..$ nct_id : chr [1:29] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
## ..$ outcomes :'data.frame': 43 obs. of 6 variables:
## .. ..$ measure : Factor w/ 43 levels "Immune Response",..: 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ time_frame : Factor w/ 21 levels "105 days","9 years",..: 1 2 3 3 4 4 4 4 4 5 ...
## .. ..$ safety_issue: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 2 1 1 ...
## .. ..$ description : Factor w/ 23 levels "Immune cell depletion is defined as immunosuppression of participants T cells prior to transplant measured by cluster of differ"| __truncated__,..: 1 2 NA NA 3 4 5 6 7 NA ...
## .. ..$ type : chr [1:43] "primary_outcome" "primary_outcome" "primary_outcome" "secondary_outcome" ...
## .. ..$ nct_id : chr [1:43] "NCT00006184" "NCT00006184" "NCT00897910" "NCT00897910" ...
## ..$ textblocks : NULL
## $ study_results :List of 3
## ..$ participant_flow:'data.frame': 9 obs. of 6 variables:
## .. ..$ title : Factor w/ 1 level "Overall Study": 1 1 1 1 1 1 1 1 1
## .. ..$ status : Factor w/ 3 levels "STARTED","COMPLETED",..: 1 1 2 2 3 3 1 2 3
## .. ..$ group_id: chr [1:9] "P1" "P2" "P1" "P2" ...
## .. ..$ count : chr [1:9] "10" "10" "9" "10" ...
## .. ..$ arm : chr [1:9] "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" ...
## .. ..$ nct_id : chr [1:9] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
## ..$ baseline_data :'data.frame': 56 obs. of 10 variables:
## .. ..$ title : Factor w/ 5 levels "Number of Participants",..: 1 1 1 2 2 2 2 2 2 2 ...
## .. ..$ units : Factor w/ 3 levels "participants",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ param : Factor w/ 2 levels "Number","Mean": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ subtitle : Factor w/ 16 levels "","<=18 years",..: 1 1 1 2 2 2 3 3 3 4 ...
## .. ..$ group_id : Factor w/ 3 levels "B1","B2","B3": 1 2 3 1 2 3 1 2 3 1 ...
## .. ..$ value : Factor w/ 17 levels "10","20","0",..: 1 1 2 3 3 3 1 1 2 3 ...
## .. ..$ dispersion: Factor w/ 1 level "Standard Deviation": NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ spread : Factor w/ 4 levels "4.59","6.05",..: NA NA NA NA NA NA NA NA NA NA ...
## .. ..$ arm : chr [1:56] "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" "Total" "Recipient - Chemotherapy Group" ...
## .. ..$ nct_id : chr [1:56] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
## ..$ outcome_data :'data.frame': 9 obs. of 14 variables:
## .. ..$ type : Factor w/ 2 levels "Primary","Secondary": 1 1 1 1 1 1 1 1 2
## .. ..$ title : Factor w/ 4 levels "Immune Response",..: 1 1 2 2 2 2 3 3 4
## .. ..$ description : Factor w/ 2 levels "Immune cell depletion is defined as immunosuppression of participants T cells prior to transplant measured by cluster of differ"| __truncated__,..: 1 1 2 2 2 2 NA NA NA
## .. ..$ time_frame : Factor w/ 3 levels "105 days","9 years",..: 1 1 2 2 2 2 3 3 3
## .. ..$ safety_issue : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 1 1 1
## .. ..$ population : Factor w/ 1 level "This outcome measure was only pre-specified to be measured in the recipient Arm/Group.": 1 1 NA NA NA NA NA NA NA
## .. ..$ units : Factor w/ 3 levels "participants",..: 1 2 1 1 3 3 1 NA NA
## .. ..$ param : Factor w/ 1 level "Number": 1 1 1 1 1 1 1 NA NA
## .. ..$ subtitle : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 NA
## .. ..$ group_id : Factor w/ 2 levels "O1","O2": 1 1 1 2 1 2 1 NA NA
## .. ..$ value : Factor w/ 3 levels "10","7","0": 1 2 1 1 1 1 3 NA NA
## .. ..$ arm : chr [1:9] "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" "Recipient - Chemotherapy Group" "Donor - Vaccination Generation Group" ...
## .. ..$ nct_id : chr [1:9] "NCT00006184" "NCT00006184" "NCT00006184" "NCT00006184" ...
## .. ..$ measurement.group_id: Factor w/ 1 level "O1": NA NA NA NA NA NA NA 1 NA
This returns a list of dataframes that have a common key variable: nct_id
. Optionally, you can get the long text fields and/or study results (if available). Study results are also returned as a list of dataframes, contained within the list.
The data come from a relational database with lots of text fields, so it may take some effort to get the data into a flat format for analysis. For that reason, results come back from the clinicaltrials_download
function as a list of dataframes. Each dataframe has a common key variable: nct_id
. To merge dataframes, use this key. Otherwise, you can analyze the dataframes separately. They are organized into study information, locations, outcomes, interventions, results, and textblocks. Results, where available, is itself a list with three dataframes: participant flow, baseline data, and outcome data.
Results tables are stored in long format, so there are often multiple rows per study, each corresponding to a different group or outcome. Let's look at an example, the cumulative enrollment of men and women in phase III, melanoma, interventional studies over time.
melanom <- clinicaltrials_search(query = c("cond=melanoma", "phase=2", "type=Intr", "rslt=With"), count = 1e6)
nrow(melanom)
## [1] 18
table(melanom$status.text)
##
## Active, not recruiting Completed Terminated
## 6 11 1
Now to download the data:
melanom_information <- clinicaltrials_download(frame = melanom, count = 1e6, include_results = TRUE)
summary(melanom_information$study_results$baseline_data)
## title units param
## Gender :104 participants:439 Number:524
## Region of Enrollment : 83 years : 41 Median: 16
## Age : 73 Participants: 85 Mean : 34
## Number of Participants : 52 Years : 9
## Race/Ethnicity, Customized: 37
## Age, Customized : 23
## (Other) :202
## subtitle group_id value dispersion
## :102 B1:197 0 : 40 Full Range : 28
## Female : 52 B2:168 1 : 18 Standard Deviation: 22
## Male : 52 B3:168 2 : 11 NA's :524
## >=65 years : 16 B4: 29 5 : 10
## United States: 13 B5: 4 11 : 9
## White : 10 B6: 4 (Other):478
## (Other) :329 B7: 4 NA's : 8
## lower_limit upper_limit arm nct_id
## 19 : 6 87 : 4 Length:574 Length:574
## 23 : 3 88 : 3 Class :character Class :character
## 10 : 2 74 : 2 Mode :character Mode :character
## -0.0 : 2 84 : 2
## 18 : 2 90 : 2
## (Other): 11 (Other): 13
## NA's :548 NA's :548
## description
## The "M" in the TNM (tumor, node, metastasis) system refers to distant metastases—whether, and how far, the cancer has spread outside the original site. M0: There is no evidence that the cancer has spread beyond the original site. M1: The cancer has spread beyond the original site. M1a: The cancer has spread to other areas of skin, underneath the epidermis to the dermis (subcutaneous), or to lymph node(s). M1b: The cancer has spread to the lung(s) only. M1c: The cancer has spread to other organs and/or locations in the body with or without elevated LDH.: 16
## Breslow's Thickness is a measure of the vertical thickness of a cutaneous melanoma lesion and is reported in millimeters (mm). : 15
## ECOG-Eastern Cooperative Oncology Group (ECOG) Performance Status is used by doctors and researchers to assess how a participant's disease is progressing, assess how the disease affects the daily living activities of the participant and determine appropriate treatment and prognosis. 0 = Fully Active (Most Favorable Activity); 1 = Restricted activity but ambulatory; 2 = Ambulatory but unable to carry out work activities; 3 = Limited Self-Care; 4 = Completely Disabled, No self-care (Least Favorable Activity) : 15
## One patient on arm V had missing data for gender. Hence, a total of 189 patients on arm V reported gender. : 14
## Upper limit of normal (ULN) was 250 U/L for most assessments (some variation caused by tests performed at local laboratories). : 12
## (Other) :115
## NA's :387
## spread
## 12.8 : 1
## 13.0 : 1
## 57.0 : 1
## 13.51 : 1
## 13.71 : 1
## (Other): 20
## NA's :549
gend_data <- subset(melanom_information$study_results$baseline_data, title == "Gender" & arm != "Total")
library(plyr)
gender_counts <- ddply(gend_data, ~ nct_id + subtitle, function(df){
data.frame(
count = sum(as.numeric(paste(df$value)), na.rm = TRUE)
)
})
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
dates <- melanom_information$study_information$study_info[, c("nct_id", "start_date")]
dates$year <- sapply(strsplit(paste(dates$start_date), " "), function(d) as.numeric(d[2]))
counts <- merge(gender_counts, dates, by = "nct_id")
library(ggplot2)
cts <- ddply(counts, ~ year + subtitle, summarize, count = sum(count))
colnames(cts)[2] <- "Gender"
ggplot(cts, aes(x = year, y = cumsum(count), color = Gender)) +
geom_line() + geom_point() + labs(title = "Cumulative enrollment into Phase III, \n interventional trials in Melanoma, by gender") + scale_y_continuous("Cumulative Enrollment")