Using the Gene Ontology data objects

Daniel Greene

2016-09-09

ontologySimliarity comes with data objects encapsulating the GO (Gene Ontology) annotation of genes [1]:

These data objects can be loaded in an R session using data(gene_GO_terms) and data(GO_IC) respectively. To process these objects, one can load the ontologyIndex package and a data object encapsulating the Gene Ontology.

library(ontologyIndex)
data(go)

library(ontologySimilarity)
data(gene_GO_terms)
data(GO_IC)

Users can simply subset the gene_GO_terms object to obtain GO annotation for their genes of interest, using a character vector of gene names. In this example, we’ll use the BEACH domain containing gene family [2].

beach <- gene_GO_terms[c("LRBA", "LYST", "NBEA", "NBEAL1", "NBEAL2", "NSMAF", "WDFY3", "WDFY4", "WDR81")]

To see the names of the terms annotating a particular gene, the go ontology_index object can be used, using the term IDs to subset the name slot. For example, for "LRBA":

go$name[beach$LRBA]
##                       GO:0003674                       GO:0005764 
##             "molecular_function"                       "lysosome" 
##                       GO:0005783                       GO:0005794 
##          "endoplasmic reticulum"                "Golgi apparatus" 
##                       GO:0005886                       GO:0008150 
##                "plasma membrane"             "biological_process" 
##                       GO:0016020                       GO:0016021 
##                       "membrane" "integral component of membrane"

The gene_GO_terms object contains annotation relating to all branches of the Gene Ontology, i.e. "cellular_component", "biological_process" and "molecular_function". If you are only interested in one branch - for example "cellular_component", you can use the ontologyIndex package’s function intersection_with_branches to subset the annotation.

cc <- go$id[go$name == "cellular_component"]
beach_cc <- lapply(beach, function(x) intersection_with_branches(go, branch_roots=cc, x)) 
data.frame(check.names=FALSE, `#terms`=sapply(beach, length), `#CC terms`=sapply(beach_cc, length))
##        #terms #CC terms
## LRBA        8         6
## LYST       32         2
## NBEA        6         4
## NBEAL1      3         2
## NBEAL2      4         2
## NSMAF      18         6
## WDFY3      29        12
## WDFY4       4         3
## WDR81      18         7

A pairwise gene semantic similarity matrix can be computed simply using the function get_sim_grid, and passing an ontology_index object, information content and annotation list as parameters (see ?get_sim_grid for more details). Here we plot the resulting similarity matrix using the heatmap function.

sim_matrix <- get_sim_grid(
    ontology=go, 
    information_content=GO_IC,
    term_sets=beach)

heatmap(sim_matrix)

One can test whether a subset of genes is significantly similar as a group in the context of a larger collection by using the function get_sim_p_from_ontology to compute a p-value of similarity. For example here, we will compare the significance of the mean pairwise gene similarity within the BEACH group against randomly selected subsets of genes of the same size chosen from the gene_GO_anno set.

get_sim_p_from_ontology(
    ontology=go,
    information_content=GO_IC,
    term_sets=gene_GO_terms,
    group=names(beach)
)
## [1] 0.1478521

References

  1. Gene Ontology Consortium website, http://geneontology.org/, dated 7/7/2016.
  2. HUGO Gene Nomenclature Committee http://www.genenames.org/