This RMarkdown document demonstrates how key elements from the notebook for case study 3 in the EpiGraphDB paper can be achieved using the R package. For detailed explanations of the case study please refer to the paper or the case study notebook.
The biomedical literature contains a wealth of information than far exceeds our capacity for systematic manual extraction. For this reason, there are many existing literature mining methods to extract the key concepts and content. Here we use data from SemMedDB, a well established database that provides subject-predicate-object triples from all PubMed titles and abstracts. Using a subset of this data we created MELODI-presto (https://melodi-presto.mrcieu.ac.uk/), a method to assign triples to any given biomedical query via a PubMed search and some basic enrichment, and have applied this systematically to traits represented in EpiGraphDB. This allows us to identify overlapping terms connecting any set of GWAS traits, e.g. exposure and disease outcome. From here we can attempt to triangulate causal estimates, and conversely, check the mechanisms identified from the literature against the causal evidence.
This case study goes as follows:
library("magrittr")
library("dplyr")
library("purrr")
library("glue")
library("ggplot2")
library("igraph")
library("epigraphdb")
Here we set the starting trait, which we will use to explore associated disease traits.
"Sleep duration" STARTING_TRAIT <-
Given an exposure trait, find all traits with causal evidence. This method searches the causal evidence data for cases where our exposure trait has a potential casual effect on an outcome trait.
function(trait) {
get_mr <- "/mr"
endpoint <- list(
params <-exposure_trait = trait,
pval_threshold = 1e-10
) query_epigraphdb(route = endpoint, params = params, mode = "table")
mr_df <-
mr_df
}
get_mr(STARTING_TRAIT)
mr_df <-%>% glimpse()
mr_df #> Rows: 29
#> Columns: 10
#> $ exposure.id <chr> "ieu-a-1088", "ieu-a-1088", "ieu-a-1088", "ieu-a-1088"…
#> $ exposure.trait <chr> "Sleep duration", "Sleep duration", "Sleep duration", …
#> $ outcome.id <chr> "ukb-a-460", "ieu-a-118", "ieu-a-115", "ieu-a-1073", "…
#> $ outcome.trait <chr> "Vitamin and mineral supplements: Vitamin B", "Neuroti…
#> $ mr.b <dbl> -0.0100669162, -0.1818909943, 2.3293509483, 0.25011813…
#> $ mr.se <dbl> 0.0001310998, 0.0114868227, 0.1428329498, 0.0063949227…
#> $ mr.pval <dbl> 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00…
#> $ mr.method <chr> "FE IVW", "FE IVW", "FE IVW", "FE IVW", "FE IVW", "FE …
#> $ mr.selection <chr> "DF", "DF", "DF", "DF", "DF", "DF", "DF", "HF", "DF", …
#> $ mr.moescore <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 0.94, 1.00, …
For this example, we are interested in traits mapped to a disease node. To do this we utilise the mapping from GWAS trait to Disease via EFO term.
function(trait) {
trait_to_disease <- "/ontology/gwas-efo-disease"
endpoint <- list(trait = trait)
params <- query_epigraphdb(route = endpoint, params = params, mode = "table")
disease_df <-if (nrow(disease_df) > 0) {
disease_df %>% pull(`disease.label`)
res <-else {
} c()
res <-
}
res
}
mr_df %>%
disease_df <- mutate(disease = map(`outcome.trait`, trait_to_disease)) %>%
filter(map_lgl(`disease`, function(x) !is.null(x)))
disease_df#> # A tibble: 5 x 11
#> exposure.id exposure.trait outcome.id outcome.trait mr.b mr.se mr.pval
#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 ieu-a-1088 Sleep duration ukb-a-107 Non-cancer i… -0.00257 2.53e-4 3.84e-24
#> 2 ieu-a-1088 Sleep duration ieu-a-6 Coronary hea… -1.04 1.10e-1 2.32e-21
#> 3 ieu-a-1088 Sleep duration ukb-a-548 Diagnoses - … -0.00671 8.64e-4 8.01e-15
#> 4 ieu-a-1088 Sleep duration ukb-a-54 Cancer code … -0.00191 2.47e-4 1.07e-14
#> 5 ukb-a-9 Sleep duration ukb-a-13 Sleeplessnes… -0.322 3.45e-2 1.10e-11
#> # … with 4 more variables: mr.method <chr>, mr.selection <chr>,
#> # mr.moescore <dbl>, disease <list>
For the multiple exposure -> outcome
relationships as reported from the table above, here we look at the literature evidence for one pair in detail:
The following looks for enriched triples of information (Subject-Predicate-Object) associated with our two traits. These have been derived via PubMed searches and corresponding SemMedDB data.
function(gwas_id, assoc_gwas_id) {
get_gwas_pair_literature <- "/literature/gwas/pairwise"
endpoint <-# NOTE in this example we blacklist to semmentic types
list(
params <-gwas_id = gwas_id,
assoc_gwas_id = assoc_gwas_id,
by_gwas_id = TRUE,
pval_threshold = 1e-1,
semmantic_types = "nusq",
semmantic_types = "dsyn",
blacklist = TRUE,
limit = 1000
) query_epigraphdb(route = endpoint, params = params, mode = "table")
lit_df <-
lit_df
}
"ieu-a-1088"
GWAS_ID_X <- "ieu-a-6"
GWAS_ID_Y <- get_gwas_pair_literature(GWAS_ID_X, GWAS_ID_Y)
lit_df <-
glimpse(lit_df)
#> Rows: 839
#> Columns: 18
#> $ gwas.trait <chr> "Sleep duration", "Sleep duration", "Sleep duration"…
#> $ gwas.id <chr> "ieu-a-1088", "ieu-a-1088", "ieu-a-1088", "ieu-a-108…
#> $ gs1.localCount <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
#> $ gs1.pval <dbl> 7.851119e-06, 7.851119e-06, 7.851119e-06, 7.851119e-…
#> $ s1.subject_name <chr> "Caffeine", "Caffeine", "Caffeine", "Caffeine", "Caf…
#> $ s1.predicate <chr> "INTERACTS_WITH", "INTERACTS_WITH", "INTERACTS_WITH"…
#> $ s1.id <chr> "Caffeine:INTERACTS_WITH:Melatonin", "Caffeine:INTER…
#> $ s1.object_name <chr> "Melatonin", "Melatonin", "Melatonin", "Melatonin", …
#> $ st.name <chr> "Melatonin", "Melatonin", "Melatonin", "Melatonin", …
#> $ st.type <chr> "horm", "horm", "horm", "horm", "horm", "horm", "hor…
#> $ s2.subject_name <chr> "Melatonin", "Melatonin", "Melatonin", "Melatonin", …
#> $ s2.predicate <chr> "ASSOCIATED_WITH", "ASSOCIATED_WITH", "ASSOCIATED_WI…
#> $ s2.id <chr> "Melatonin:ASSOCIATED_WITH:Acute coronary syndrome",…
#> $ s2.object_name <chr> "Acute coronary syndrome", "Myocardial Infarction", …
#> $ gs2.localCount <int> 3, 2, 3, 9, 2, 6, 4, 2, 2, 2, 3, 3, 2, 2, 2, 5, 2, 4…
#> $ gs2.pval <dbl> 8.329046e-05, 1.094386e-02, 8.329046e-05, 3.435014e-…
#> $ assoc_gwas.trait <chr> "Coronary heart disease", "Coronary heart disease", …
#> $ assoc_gwas.id <chr> "ieu-a-6", "ieu-a-6", "ieu-a-6", "ieu-a-6", "ieu-a-6…
# Predicate counts for SemMed triples for trait X
%>%
lit_df count(`s1.predicate`) %>%
arrange(desc(n))
#> # A tibble: 8 x 2
#> s1.predicate n
#> <chr> <int>
#> 1 INTERACTS_WITH 643
#> 2 INHIBITS 80
#> 3 NEG_INTERACTS_WITH 49
#> 4 STIMULATES 47
#> 5 higher_than 10
#> 6 COEXISTS_WITH 7
#> 7 NEG_INHIBITS 2
#> 8 CONVERTS_TO 1
# Predicate counts for SemMed triples for trait Y
%>%
lit_df count(`s2.predicate`) %>%
arrange(desc(n))
#> # A tibble: 15 x 2
#> s2.predicate n
#> <chr> <int>
#> 1 CAUSES 154
#> 2 STIMULATES 132
#> 3 PREVENTS 123
#> 4 PREDISPOSES 107
#> 5 AFFECTS 95
#> 6 ASSOCIATED_WITH 50
#> 7 INTERACTS_WITH 50
#> 8 TREATS 31
#> 9 DISRUPTS 30
#> 10 INHIBITS 24
#> 11 COEXISTS_WITH 21
#> 12 NEG_PREVENTS 16
#> 13 higher_than 3
#> 14 AUGMENTS 2
#> 15 same_as 1
Sometimes it is preferable to filter the SemMedDB data, e.g. to remove less informative Predicates, such as COEXISTS_WITH and ASSOCIATED_WITH.
# Filter out some predicates that are not informative
c("COEXISTS_WITH", "ASSOCIATED_WITH")
pred_filter <- lit_df %>%
lit_df_filter <- filter(
!`s1.predicate` %in% pred_filter,
!`s2.predicate` %in% pred_filter
)%>%
lit_df_filter count(`s1.predicate`) %>%
arrange(desc(n))
#> # A tibble: 6 x 2
#> s1.predicate n
#> <chr> <int>
#> 1 INTERACTS_WITH 608
#> 2 INHIBITS 68
#> 3 NEG_INTERACTS_WITH 48
#> 4 STIMULATES 28
#> 5 higher_than 9
#> 6 CONVERTS_TO 1
%>%
lit_df_filter count(`s2.predicate`) %>%
arrange(desc(n))
#> # A tibble: 13 x 2
#> s2.predicate n
#> <chr> <int>
#> 1 CAUSES 154
#> 2 STIMULATES 132
#> 3 PREVENTS 123
#> 4 PREDISPOSES 107
#> 5 AFFECTS 93
#> 6 INTERACTS_WITH 50
#> 7 DISRUPTS 30
#> 8 TREATS 27
#> 9 INHIBITS 24
#> 10 NEG_PREVENTS 16
#> 11 higher_than 3
#> 12 AUGMENTS 2
#> 13 same_as 1
If we explore the full table in lit_df_filter
, we can see lots of links between the two traits, pinned on specific overlapping terms. For example:
Aspirin:INHIBITS:Oral anticoagulants
from Sleep duration (s1) and anticoagulants:INHIBITS:P2RY12
from CHD (s2).
We can summarise the SemMedDB semantic type and number of overlapping terms:
lit_df_filter %>%
lit_counts <- count(`st.type`, `st.name`) %>%
arrange(`st.type`, desc(`n`))
%>% print(n = 30)
lit_counts #> # A tibble: 21 x 3
#> st.type st.name n
#> <chr> <chr> <int>
#> 1 aapp Leptin 6
#> 2 aapp Somatotropin 5
#> 3 aapp Adrenergic Receptor 4
#> 4 aapp Glutathione 2
#> 5 aapp apolipoprotein E-4 2
#> 6 aapp Monoamine Oxidase 1
#> 7 gngm Cytochrome P450 10
#> 8 gngm Glycoproteins 2
#> 9 horm Melatonin 28
#> 10 horm Hormones 3
#> 11 orch Ethanol 672
#> 12 orch Oral anticoagulants 10
#> 13 orch Metoprolol 8
#> 14 orch Propofol 2
#> 15 orch Acetaldehyde 1
#> 16 orch Benzodiazepines 1
#> 17 orch Luteolin 1
#> 18 orch Morphine 1
#> 19 orch Neostigmine 1
#> 20 orch Thiopental 1
#> 21 orch gamma hydroxybutyrate 1
Note, the SemMedDB semantic types have been pre-filtered to only include a subset of possibilities.
Further examples of these term IDs and descriptions can be found here - https://mmtx.nlm.nih.gov/MMTx/semanticTypes.shtml
We can also visualise the above table as a bar chart. In this case we will remove Ethanol as it is an outlier.
%>%
lit_counts filter(n < 100) %>%
{
ggplot(.) +
aes(x = `st.name`, y = `n`, fill = `st.type`) +
geom_col() +
geom_text(
aes(label = `n`),
position = position_dodge(0.9),
hjust = 0
+
) coord_flip()
}
Here we look at cases where Leptin
is the central overlapping term.
"Leptin"
focus_term <- lit_df_filter %>% filter(`st.name` == focus_term)
lit_detail <-%>% head()
lit_detail #> # A tibble: 6 x 18
#> gwas.trait gwas.id gs1.localCount gs1.pval s1.subject_name s1.predicate s1.id
#> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
#> 1 Sleep dur… ieu-a-… 5 1.77e-12 ghrelin INHIBITS ghre…
#> 2 Sleep dur… ieu-a-… 5 1.77e-12 ghrelin INHIBITS ghre…
#> 3 Sleep dur… ieu-a-… 5 1.77e-12 ghrelin INHIBITS ghre…
#> 4 Sleep dur… ieu-a-… 5 1.77e-12 ghrelin INHIBITS ghre…
#> 5 Sleep dur… ieu-a-… 5 1.77e-12 ghrelin INHIBITS ghre…
#> 6 Sleep dur… ieu-a-… 5 1.77e-12 ghrelin INHIBITS ghre…
#> # … with 11 more variables: s1.object_name <chr>, st.name <chr>, st.type <chr>,
#> # s2.subject_name <chr>, s2.predicate <chr>, s2.id <chr>,
#> # s2.object_name <chr>, gs2.localCount <int>, gs2.pval <dbl>,
#> # assoc_gwas.trait <chr>, assoc_gwas.id <chr>
We can create a network diagram to visualise these relationships.
lit_detail %>%
lit_detail <- mutate_at(vars(`gwas.trait`, `assoc_gwas.trait`), stringr::str_to_upper)
# add node types: 1 - selected GWAS, 2 - traits from literature, 3 - current focus term connecting 1 and 2
bind_rows(
nodes <-%>% select(node = `gwas.trait`) %>% distinct() %>% mutate(node_type = 1),
lit_detail %>% select(node = `assoc_gwas.trait`) %>% distinct() %>% mutate(node_type = 1),
lit_detail %>% select(node = `s1.subject_name`) %>% distinct() %>% mutate(node_type = 2),
lit_detail %>% select(node = `s2.subject_name`) %>% distinct() %>% mutate(node_type = 2),
lit_detail %>% select(node = `s1.object_name`) %>% distinct() %>% mutate(node_type = 2),
lit_detail %>% select(node = `s2.object_name`) %>% distinct() %>% mutate(node_type = 2)
lit_detail %>%
) mutate(node_type = ifelse(node == focus_term, 3, node_type)) %>%
distinct()
nodes#> # A tibble: 10 x 2
#> node node_type
#> <chr> <dbl>
#> 1 SLEEP DURATION 1
#> 2 CORONARY HEART DISEASE 1
#> 3 ghrelin 2
#> 4 Leptin 3
#> 5 Coronary Arteriosclerosis 2
#> 6 Coronary heart disease 2
#> 7 Proteome 2
#> 8 Sleep Apnea, Obstructive 2
#> 9 Adiponectin 2
#> 10 Insulin 2
bind_rows(
edges <-# exposure -> s1 subject
%>%
lit_detail select(node = `gwas.trait`, assoc_node = `s1.subject_name`) %>%
distinct(),
# s2 object -> outcome
%>%
lit_detail select(node = `s2.object_name`, assoc_node = `assoc_gwas.trait`) %>%
distinct(),
# s1 subject - s1 predicate -> s1 object
%>%
lit_detail select(
node = `s1.subject_name`, assoc_node = `s1.object_name`,
label = `s1.predicate`
%>%
) distinct(),
# s2 subject - s2 predicate -> s2 object
%>%
lit_detail select(
node = `s2.subject_name`, assoc_node = `s2.object_name`,
label = `s2.predicate`
%>%
) distinct()
%>%
) distinct()
edges#> # A tibble: 14 x 3
#> node assoc_node label
#> <chr> <chr> <chr>
#> 1 SLEEP DURATION ghrelin <NA>
#> 2 Coronary Arteriosclerosis CORONARY HEART DISEASE <NA>
#> 3 Coronary heart disease CORONARY HEART DISEASE <NA>
#> 4 Proteome CORONARY HEART DISEASE <NA>
#> 5 Sleep Apnea, Obstructive CORONARY HEART DISEASE <NA>
#> 6 Adiponectin CORONARY HEART DISEASE <NA>
#> 7 Insulin CORONARY HEART DISEASE <NA>
#> 8 ghrelin Leptin INHIBITS
#> 9 Leptin Coronary Arteriosclerosis TREATS
#> 10 Leptin Coronary heart disease PREDISPOSES
#> 11 Leptin Proteome INTERACTS_WITH
#> 12 Leptin Sleep Apnea, Obstructive TREATS
#> 13 Leptin Adiponectin STIMULATES
#> 14 Leptin Insulin INTERACTS_WITH
function(edges, nodes, show_edge_labels = FALSE) {
plot_network <-
# default is to not display edge labels
if (!show_edge_labels) {
select(edges, -label)
edges <-
}
graph_from_data_frame(edges, directed = TRUE, vertices = nodes)
graph <-$layout <- layout_with_kk
graph
# generate colors based on node type
c("tomato", "steelblue", "gold")
colrs <-V(graph)$color <- colrs[V(graph)$node_type]
# Configure canvas
par("mar")
default_mar <- c(0, 0, 0, 0)
new_mar <-par(mar = new_mar)
plot.igraph(graph,
vertex.size = 13,
vertex.label.color = "black",
vertex.label.family = "Helvetica",
vertex.label.cex = 0.8,
edge.arrow.size = 0.4,
edge.label.color = "black",
edge.label.family = "Helvetica",
edge.label.cex = 0.5
)par(mar = default_mar)
}
plot_network(edges, nodes, show_edge_labels = TRUE)
We can refer back to the articles to check the text that was used to derive the SemMedDB data. This is important due to the imperfect nature of the SemRep annotation process (https://semrep.nlm.nih.gov/).
function(gwas_id, semmed_triple_id) {
get_literature <- "/literature/gwas"
endpoint <- list(
params <-gwas_id = gwas_id,
semmed_triple_id = semmed_triple_id,
by_gwas_id = TRUE,
pval_threshold = 1e-1
) query_epigraphdb(route = endpoint, params = params, mode = "table")
df <-%>% select(`triple.id`, `lit.pubmed_id`)
df
}
bind_rows(
pub_df <-%>%
lit_detail select(gwas_id = `gwas.id`, semmed_triple_id = `s1.id`) %>%
distinct(),
%>%
lit_detail select(gwas_id = `assoc_gwas.id`, semmed_triple_id = `s2.id`) %>%
distinct()
%>%
) transpose() %>%
map_df(function(x) get_literature(x$gwas_id, x$semmed_triple_id))
pub_df#> # A tibble: 15 x 2
#> triple.id lit.pubmed_id
#> <chr> <chr>
#> 1 ghrelin:INHIBITS:Leptin 21659802
#> 2 ghrelin:INHIBITS:Leptin 22473743
#> 3 ghrelin:INHIBITS:Leptin 19955752
#> 4 ghrelin:INHIBITS:Leptin 18719052
#> 5 ghrelin:INHIBITS:Leptin 30364557
#> 6 Leptin:TREATS:Coronary Arteriosclerosis 26769430
#> 7 Leptin:TREATS:Coronary Arteriosclerosis 20691218
#> 8 Leptin:PREDISPOSES:Coronary heart disease 18723235
#> 9 Leptin:PREDISPOSES:Coronary heart disease 22412070
#> 10 Leptin:INTERACTS_WITH:Proteome 26975316
#> 11 Leptin:TREATS:Sleep Apnea, Obstructive 10449691
#> 12 Leptin:STIMULATES:Adiponectin 17403719
#> 13 Leptin:STIMULATES:Adiponectin 21481397
#> 14 Leptin:INTERACTS_WITH:Insulin 12724058
#> 15 Leptin:INTERACTS_WITH:Insulin 15648007
sessionInfo
sessionInfo()
#> R version 4.0.2 (2020-06-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Manjaro Linux
#>
#> Matrix products: default
#> BLAS: /usr/lib/libblas.so.3.9.0
#> LAPACK: /usr/lib/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.3.2 epigraphdb_0.2.1 igraph_1.2.5 glue_1.4.1
#> [5] purrr_0.3.4 dplyr_1.0.2 magrittr_1.5
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.4.6 pillar_1.4.6 compiler_4.0.2 tools_4.0.2
#> [5] digest_0.6.25 jsonlite_1.7.0 evaluate_0.14 lifecycle_0.2.0
#> [9] tibble_3.0.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7
#> [13] cli_2.0.2 curl_4.3 yaml_2.2.1 xfun_0.14
#> [17] withr_2.2.0 httr_1.4.2 stringr_1.4.0 knitr_1.28
#> [21] generics_0.0.2 vctrs_0.3.2 gtools_3.8.2 grid_4.0.2
#> [25] tidyselect_1.1.0 R6_2.4.1 fansi_0.4.1 rmarkdown_2.1
#> [29] farver_2.0.3 scales_1.1.1 ellipsis_0.3.1 htmltools_0.4.0
#> [33] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3 utf8_1.1.4
#> [37] stringi_1.4.6 munsell_0.5.0 crayon_1.3.4