Functional Annotation with BioMart, GO, and KeGG

2016-08-07

Functional Annotation with BioMart

The BioMart project enables users to retrieve a vast diversity of annotation data for specific organisms. Steffen Durinck and Wolfgang Huber provide an powerful interface between the R language and BioMart by providing the R package biomaRt. The following sections will introduce users to the functionality and data retrieval precedures using the biomaRt package and will then introduce them to the interface functions biomart() and biomart_organisms() implemented in biomartr that are based on the biomaRt methodology but aim to introduce an more intuitive way of interacting with BioMart.

Getting Started with biomaRt

The best way to get started with the methodology presented by biomaRt is to understand the workflow of data retrieval. The database provided by BioMart is organized in so called: marts, datasets, and attributes. So when users want to retrieve information for a specific organism of interest, first they need to specify the marts and datasets in which the information of the corresponding organism can be found and subsequently they can specify the attributes that shall be returned for that particular organism.

The availability of marts, datasets, and attributes can be checked by the following functions:

# install the biomaRt package
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

# load biomaRt
library(biomaRt)

# look at top 10 databases
head(listMarts(host = "www.ensembl.org"), 10)
               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 83
2     ENSEMBL_MART_SNP  Ensembl Variation 83
3 ENSEMBL_MART_FUNCGEN Ensembl Regulation 83
4    ENSEMBL_MART_VEGA               Vega 63
5                pride        PRIDE (EBI UK)

Users will observe that several marts providing annotation for specific classes of organisms or groups of organisms are available.

For our example, we will choose the plants_variations_26 mart and list all available datasets that are element of this mart.

head(listDatasets(useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org")), 10)
                          dataset                                description         version
1          oanatinus_gene_ensembl     Ornithorhynchus anatinus genes (OANA5)           OANA5
2         cporcellus_gene_ensembl            Cavia porcellus genes (cavPor3)         cavPor3
3         gaculeatus_gene_ensembl     Gasterosteus aculeatus genes (BROADS1)         BROADS1
4          lafricana_gene_ensembl         Loxodonta africana genes (loxAfr3)         loxAfr3
5  itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2)         spetri2
6         choffmanni_gene_ensembl        Choloepus hoffmanni genes (choHof1)         choHof1
7          csavignyi_gene_ensembl             Ciona savignyi genes (CSAV2.0)         CSAV2.0
8             fcatus_gene_ensembl        Felis catus genes (Felis_catus_6.2) Felis_catus_6.2
9        rnorvegicus_gene_ensembl         Rattus norvegicus genes (Rnor_6.0)        Rnor_6.0
10         psinensis_gene_ensembl     Pelodiscus sinensis genes (PelSin_1.0)      PelSin_1.0

The useMart() function is a wrapper function provided by biomaRt to connect a selected BioMart database (mart) with a corresponding dataset stored within this mart.

We select dataset hsapiens_gene_ensembl and now check for available attributes (annotation data) that can be accessed for Homo sapiens genes.

head(listAttributes(useDataset(dataset = "hsapiens_gene_ensembl", 
                               mart    = useMart("ENSEMBL_MART_ENSEMBL",host = "www.ensembl.org"))), 10)
                    name           description
1        ensembl_gene_id       Ensembl Gene ID
2  ensembl_transcript_id Ensembl Transcript ID
3     ensembl_peptide_id    Ensembl Protein ID
4        ensembl_exon_id       Ensembl Exon ID
5            description           Description
6        chromosome_name       Chromosome Name
7         start_position       Gene Start (bp)
8           end_position         Gene End (bp)
9                 strand                Strand
10                  band                  Band

Please note the nested structure of this attribute query. For an attribute query procedure an additional wrapper function named useDataset() is needed in which useMart() and a corresponding dataset needs to be specified. The result is a table storing the name of available attributes for Homo sapiens as well as a short description.

Furthermore, users can retrieve all filters for Homo sapiens that can be specified by the actual BioMart query process.

head(listFilters(useDataset(dataset = "hsapiens_gene_ensembl", 
                            mart    = useMart("ENSEMBL_MART_ENSEMBL",
                                              host = "www.ensembl.org"))), 10)
                 name                                               description
1     chromosome_name                                           Chromosome name
2               start                                           Gene Start (bp)
3                 end                                             Gene End (bp)
4          band_start                                                Band Start
5            band_end                                                  Band End
6        marker_start                                              Marker Start
7          marker_end                                                Marker End
8       encode_region                                             Encode region
9              strand                                                    Strand
10 chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)

After accumulating all this information, it is now possible to perform an actual BioMart query by using the getBM() function.

In this example we will retrieve attributes: start_position,end_position and description for the Homo sapiens gene "GUCA2A".

Since the input genes are ensembl gene ids, we need to specify the filters argument filters = "tair_locus".

# 1) select a mart and data set
mart <- useDataset("hsapiens_gene_ensembl", 
                   mart = useMart("ENSEMBL_MART_ENSEMBL",
                                  host = "www.ensembl.org"))

# 2) run a biomart query using the getBM() function
# and specify the attributes and filter arguments
geneSet <- "GUCA2A"

resultTable <- getBM(attributes = c("start_position","end_position","description"),
                     filters = "hgnc_symbol", values = geneSet, mart = mart)

resultTable 
  start_position end_position
1       42162691     42164718
                                                                   description
1 guanylate cyclase activator 2A (guanylin) [Source:HGNC Symbol;Acc:HGNC:4682]

When using getBM() users can pass all attributes retrieved by listAttributes() to the attributes argument of the getBM() function.

Getting Started with biomartr

This query methodology provided by BioMart and the biomaRt package is a very well defined approach for accurate annotation retrieval. Nevertheless, when learning this query methodology it (subjectively) seems non-intuitive from the user perspective. Therefore, the biomartr package provides another query methodology that aims to be more organism centric.

Taken together, the following workflow allows users to perform fast BioMart queries for attributes using the biomart() function implemented in this biomartr package:

  1. get attributes, datasets, and marts via : organismAttributes()

  2. choose available filters via: organismFilters()

  3. specify a set of query genes

  4. specify all arguments of the biomart() function using steps 1) - 3) and perform a BioMart query

Note that dataset names change very frequently due to the update of dataset versions. So in case some query functions do not work properly, users should check with organismAttributes(update = TRUE) whether or not their dataset name has been changed. For example, organismAttributes("Homo sapiens", topic = "id", update = TRUE) might reveal that the dataset ENSEMBL_MART_ENSEMBL has changed.

Retrieve marts, datasets, attributes, and filters with biomartr

Retrieve Available Marts

The getMarts() function allows users to list all available databases that can be accessed through BioMart interfaces.

# load the biomartr package
library(biomartr)

# list all available databases
getMarts()
                   mart               version
1  ENSEMBL_MART_ENSEMBL      Ensembl Genes 83
2 ENSEMBL_MART_SEQUENCE              Sequence
3 ENSEMBL_MART_ONTOLOGY              Ontology
4  ENSEMBL_MART_GENOMIC   Genomic features 83
5      ENSEMBL_MART_SNP  Ensembl Variation 83
6  ENSEMBL_MART_FUNCGEN Ensembl Regulation 83
7     ENSEMBL_MART_VEGA               Vega 63
8                 pride        PRIDE (EBI UK)

Retrieve Available Datasets from a Specific Mart

Now users can select a specific database to list all available datasets that can be accessed through this database. In this example we choose the ENSEMBL_MART_ENSEMBL database.

head(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 5)
                         dataset                                description version
1         oanatinus_gene_ensembl     Ornithorhynchus anatinus genes (OANA5)   OANA5
2        cporcellus_gene_ensembl            Cavia porcellus genes (cavPor3) cavPor3
3        gaculeatus_gene_ensembl     Gasterosteus aculeatus genes (BROADS1) BROADS1
4         lafricana_gene_ensembl         Loxodonta africana genes (loxAfr3) loxAfr3
5 itridecemlineatus_gene_ensembl Ictidomys tridecemlineatus genes (spetri2) spetri2

Now you can select the dataset hsapiens_gene_ensembl and list all available attributes that can be retrieved from this dataset.

tail(getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 38)
                       dataset                                 description      version
32       hsapiens_gene_ensembl              Homo sapiens genes (GRCh38.p5)    GRCh38.p5
33       pformosa_gene_ensembl       Poecilia formosa genes (PoeFor_5.1.2) PoeFor_5.1.2
34          mfuro_gene_ensembl  Mustela putorius furo genes (MusPutFur1.0) MusPutFur1.0
35     tbelangeri_gene_ensembl            Tupaia belangeri genes (tupBel1)      tupBel1
36        ggallus_gene_ensembl               Gallus gallus genes (Galgal4)      Galgal4
37    xtropicalis_gene_ensembl           Xenopus tropicalis genes (JGI4.2)       JGI4.2
38      ecaballus_gene_ensembl              Equus caballus genes (EquCab2)      EquCab2
39        pabelii_gene_ensembl                  Pongo abelii genes (PPYG2)        PPYG2
40     xmaculatus_gene_ensembl   Xiphophorus maculatus genes (Xipmac4.4.2)  Xipmac4.4.2
41         drerio_gene_ensembl                  Danio rerio genes (GRCz10)       GRCz10
42     lchalumnae_gene_ensembl         Latimeria chalumnae genes (LatCha1)      LatCha1
43  tnigroviridis_gene_ensembl Tetraodon nigroviridis genes (TETRAODON8.0) TETRAODON8.0
44   amelanoleuca_gene_ensembl      Ailuropoda melanoleuca genes (ailMel1)      ailMel1
45       mmulatta_gene_ensembl               Macaca mulatta genes (MMUL_1)       MMUL_1
46      pvampyrus_gene_ensembl           Pteropus vampyrus genes (pteVam1)      pteVam1
47        panubis_gene_ensembl              Papio anubis genes (PapAnu2.0)    PapAnu2.0
48     mdomestica_gene_ensembl       Monodelphis domestica genes (monDom5)      monDom5
49  acarolinensis_gene_ensembl       Anolis carolinensis genes (AnoCar2.0)    AnoCar2.0
50         vpacos_gene_ensembl               Vicugna pacos genes (vicPac1)      vicPac1
51      tsyrichta_gene_ensembl            Tarsius syrichta genes (tarSyr1)      tarSyr1
52     ogarnettii_gene_ensembl          Otolemur garnettii genes (OtoGar3)      OtoGar3
53  dmelanogaster_gene_ensembl       Drosophila melanogaster genes (BDGP6)        BDGP6
54       mmurinus_gene_ensembl          Microcebus murinus genes (micMur1)      micMur1
55      loculatus_gene_ensembl        Lepisosteus oculatus genes (LepOcu1)      LepOcu1
56       olatipes_gene_ensembl                Oryzias latipes genes (HdrR)         HdrR
57       ggorilla_gene_ensembl           Gorilla gorilla genes (gorGor3.1)    gorGor3.1
58      oprinceps_gene_ensembl         Ochotona princeps genes (OchPri2.0)    OchPri2.0
59         dordii_gene_ensembl             Dipodomys ordii genes (dipOrd1)      dipOrd1
60         oaries_gene_ensembl                 Ovis aries genes (Oar_v3.1)     Oar_v3.1
61      mmusculus_gene_ensembl              Mus musculus genes (GRCm38.p4)    GRCm38.p4
62     mgallopavo_gene_ensembl            Meleagris gallopavo genes (UMD2)         UMD2
63        gmorhua_gene_ensembl                Gadus morhua genes (gadMor1)      gadMor1
64 aplatyrhynchos_gene_ensembl     Anas platyrhynchos genes (BGI_duck_1.0) BGI_duck_1.0
65       saraneus_gene_ensembl               Sorex araneus genes (sorAra1)      sorAra1
66      sharrisii_gene_ensembl       Sarcophilus harrisii genes (DEVIL7.0)     DEVIL7.0
67       meugenii_gene_ensembl           Macropus eugenii genes (Meug_1.0)     Meug_1.0
68        btaurus_gene_ensembl                   Bos taurus genes (UMD3.1)       UMD3.1
69    cfamiliaris_gene_ensembl          Canis familiaris genes (CanFam3.1)    CanFam3.1

Retrieve Available Attributes from a Specific Dataset

Now that you have selected a database (plants_mart_26) and a dataset (athaliana_eg_gene), users can list all available attributes for this dataset using the getAttributes() function.

# list all available attributes for dataset: hsapiens_gene_ensembl
head( getAttributes(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl"), 10 )
                    name           description
1        ensembl_gene_id       Ensembl Gene ID
2  ensembl_transcript_id Ensembl Transcript ID
3     ensembl_peptide_id    Ensembl Protein ID
4        ensembl_exon_id       Ensembl Exon ID
5            description           Description
6        chromosome_name       Chromosome Name
7         start_position       Gene Start (bp)
8           end_position         Gene End (bp)
9                 strand                Strand
10                  band                  Band

Retrieve Available Filters from a Specific Dataset

Finally, the getFilters() function allows users to list available filters for a specific dataset that can be used for a biomart() query.

# list all available filters for dataset: hsapiens_gene_ensembl
head( getFilters(mart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl"), 10 )
                 name                                               description
1     chromosome_name                                           Chromosome name
2               start                                           Gene Start (bp)
3                 end                                             Gene End (bp)
4          band_start                                                Band Start
5            band_end                                                  Band End
6        marker_start                                              Marker Start
7          marker_end                                                Marker End
8       encode_region                                             Encode region
9              strand                                                    Strand
10 chromosomal_region Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)

Organism Specific Retrieval of Information

In most use cases, users will work with a single or a set of model organisms. In this process they will mostly be interested in specific annotations for this particular model organism. The organismBM() function addresses this issue and provides users with an organism centric query to marts and datasets which are available for a particular organism of interest.

Note that when running the following functions for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt file named _biomart/listDatasets.txt within the tempdir() folder, allowing subsequent queries to perform much faster. The tempdir() folder however, will be deleted after a new R session was established, so in this case the inital call of the subsequent functions, again will take time to retrieve all organism specific data from the BioMart API.

# retrieving all available datasets and biomart connections for
# a specific query organism (scientific name)
organismBM(organism = "Homo sapiens")
   organism_name
1   Homo sapiens
2   Homo sapiens
3   Homo sapiens
4   Homo sapiens
5   Homo sapiens
6   Homo sapiens
7   Homo sapiens
8   Homo sapiens
9   Homo sapiens
10  Homo sapiens
11  Homo sapiens
12  Homo sapiens
                                                                                    description
1                                                                Homo sapiens genes (GRCh38.p5)
2          Homo sapiens Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p5)
3                                                  Homo sapiens Structural Variants (GRCh38.p5)
4                                          Homo sapiens Somatic Structural Variants (GRCh38.p5)
5  Homo sapiens Somatic Short Variants (SNPs and indels excluding flagged variants) (GRCh38.p5)
6                                                  Homo sapiens Regulatory Evidence (GRCh38.p5)
7                                                       Homo sapiens Binding Motifs (GRCh38.p5)
8                                                  Homo sapiens Regulatory Features (GRCh38.p5)
9                                                 Homo sapiens miRNA Target Regions (GRCh38.p5)
10                                                 Homo sapiens Regulatory Segments (GRCh38.p5)
11                                            Homo sapiens Other Regulatory Regions (GRCh38.p5)
12                                                               Homo sapiens genes (GRCh38.p5)
                   mart                       dataset   version
1  ENSEMBL_MART_ENSEMBL         hsapiens_gene_ensembl GRCh38.p5
2      ENSEMBL_MART_SNP                  hsapiens_snp GRCh38.p5
3      ENSEMBL_MART_SNP            hsapiens_structvar GRCh38.p5
4      ENSEMBL_MART_SNP        hsapiens_structvar_som GRCh38.p5
5      ENSEMBL_MART_SNP              hsapiens_snp_som GRCh38.p5
6  ENSEMBL_MART_FUNCGEN    hsapiens_annotated_feature GRCh38.p5
7  ENSEMBL_MART_FUNCGEN        hsapiens_motif_feature GRCh38.p5
8  ENSEMBL_MART_FUNCGEN   hsapiens_regulatory_feature GRCh38.p5
9  ENSEMBL_MART_FUNCGEN hsapiens_mirna_target_feature GRCh38.p5
10 ENSEMBL_MART_FUNCGEN hsapiens_segmentation_feature GRCh38.p5
11 ENSEMBL_MART_FUNCGEN     hsapiens_external_feature GRCh38.p5
12    ENSEMBL_MART_VEGA            hsapiens_gene_vega GRCh38.p5

The result is a table storing all marts and datasets from which annotations can be retrieved for Homo sapiens. Furthermore, a short description as well as the version of the dataset being accessed (very useful for publications) is returned.

Users will observe that 3 different marts provide 6 different datasets storing annotation information for Homo sapiens.

Please note however, that scientific names of organisms must be written correctly! For ex. “Homo Sapiens” will be treated differently (not recognized) than “Homo sapiens” (recognized).

Similar to the biomaRt package query methodology, users need to specify attributes and filters to be able to perform accurate BioMart queries. Here the functions organismAttributes() and organismFilters() provide useful and intuitive concepts to obtain this information.

# return available attributes for "Homo sapiens"
head(organismAttributes("Homo sapiens"), 20)
                       name                                description               dataset
1           ensembl_gene_id                            Ensembl Gene ID hsapiens_gene_ensembl
2     ensembl_transcript_id                      Ensembl Transcript ID hsapiens_gene_ensembl
3        ensembl_peptide_id                         Ensembl Protein ID hsapiens_gene_ensembl
4           ensembl_exon_id                            Ensembl Exon ID hsapiens_gene_ensembl
5               description                                Description hsapiens_gene_ensembl
6           chromosome_name                            Chromosome Name hsapiens_gene_ensembl
7            start_position                            Gene Start (bp) hsapiens_gene_ensembl
8              end_position                              Gene End (bp) hsapiens_gene_ensembl
9                    strand                                     Strand hsapiens_gene_ensembl
10                     band                                       Band hsapiens_gene_ensembl
11         transcript_start                      Transcript Start (bp) hsapiens_gene_ensembl
12           transcript_end                        Transcript End (bp) hsapiens_gene_ensembl
13 transcription_start_site             Transcription Start Site (TSS) hsapiens_gene_ensembl
14        transcript_length Transcript length (including UTRs and CDS) hsapiens_gene_ensembl
15           transcript_tsl             Transcript Support Level (TSL) hsapiens_gene_ensembl
16 transcript_gencode_basic                   GENCODE basic annotation hsapiens_gene_ensembl
17        transcript_appris                          APPRIS annotation hsapiens_gene_ensembl
18       external_gene_name                       Associated Gene Name hsapiens_gene_ensembl
19     external_gene_source                     Associated Gene Source hsapiens_gene_ensembl
20 external_transcript_name                 Associated Transcript Name hsapiens_gene_ensembl
                   mart
1  ENSEMBL_MART_ENSEMBL
2  ENSEMBL_MART_ENSEMBL
3  ENSEMBL_MART_ENSEMBL
4  ENSEMBL_MART_ENSEMBL
5  ENSEMBL_MART_ENSEMBL
6  ENSEMBL_MART_ENSEMBL
7  ENSEMBL_MART_ENSEMBL
8  ENSEMBL_MART_ENSEMBL
9  ENSEMBL_MART_ENSEMBL
10 ENSEMBL_MART_ENSEMBL
11 ENSEMBL_MART_ENSEMBL
12 ENSEMBL_MART_ENSEMBL
13 ENSEMBL_MART_ENSEMBL
14 ENSEMBL_MART_ENSEMBL
15 ENSEMBL_MART_ENSEMBL
16 ENSEMBL_MART_ENSEMBL
17 ENSEMBL_MART_ENSEMBL
18 ENSEMBL_MART_ENSEMBL
19 ENSEMBL_MART_ENSEMBL
20 ENSEMBL_MART_ENSEMBL

Users will observe that the organismAttributes() function returns a data.frame storing attribute names, datasets, and marts which are available for Homo sapiens.

An additional feature provided by organismAttributes() is the topic argument. The topic argument allows users to to search for specific attributes, topics, or categories for faster filtering.

# search for attribute topic "id"
head(organismAttributes("Homo sapiens", topic = "id"), 20)
                        name                                       description
1            ensembl_gene_id                                   Ensembl Gene ID
2      ensembl_transcript_id                             Ensembl Transcript ID
3         ensembl_peptide_id                                Ensembl Protein ID
4            ensembl_exon_id                                   Ensembl Exon ID
34         study_external_id                          Study External Reference
35                     go_id                                 GO Term Accession
49                 dbass3_id Database of Aberrant 3' Splice Sites (DBASS3) IDs
51                 dbass5_id Database of Aberrant 5' Splice Sites (DBASS5) IDs
64                   hgnc_id                                        HGNC ID(s)
68      mim_morbid_accession                              MIM Morbid Accession
69    mim_morbid_description                            MIM Morbid Description
73                mirbase_id                                     miRBase ID(s)
76                protein_id              Protein (Genbank) ID [e.g. AAA02487]
84            refseq_peptide             RefSeq Protein ID [e.g. NP_001005353]
85  refseq_peptide_predicted   RefSeq Predicted Protein ID [e.g. XP_001720922]
96               wikigene_id                                       WikiGene ID
182          ensembl_gene_id                                   Ensembl Gene ID
183    ensembl_transcript_id                             Ensembl Transcript ID
184       ensembl_peptide_id                                Ensembl Protein ID
213          ensembl_exon_id                                   Ensembl Exon ID
                  dataset                 mart
1   hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
2   hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
3   hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
4   hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
34  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
35  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
49  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
51  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
64  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
68  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
69  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
73  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
76  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
84  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
85  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
96  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
182 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
183 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
184 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
213 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL

Now, all attribute names having id as part of their name are being returned.

Another example is topic = "homolog".

# search for attribute topic "homolog"
head(organismAttributes("Homo sapiens", topic = "homolog"), 20)
                                             name                            description
229                   vpacos_homolog_ensembl_gene                 Alpaca Ensembl Gene ID
230   vpacos_homolog_canonical_transcript_protein     Canonical Protein or Transcript ID
231                vpacos_homolog_ensembl_peptide              Alpaca Ensembl Protein ID
232                     vpacos_homolog_chromosome                 Alpaca Chromosome Name
233                    vpacos_homolog_chrom_start           Alpaca Chromosome Start (bp)
234                      vpacos_homolog_chrom_end             Alpaca Chromosome End (bp)
235                 vpacos_homolog_orthology_type                          Homology Type
236                        vpacos_homolog_subtype                               Ancestor
237           vpacos_homolog_orthology_confidence   Orthology confidence [0 low, 1 high]
238                        vpacos_homolog_perc_id  % Identity with respect to query gene
239                     vpacos_homolog_perc_id_r1 % Identity with respect to Alpaca gene
240                 pformosa_homolog_ensembl_gene           Amazon molly Ensembl Gene ID
241 pformosa_homolog_canonical_transcript_protein     Canonical Protein or Transcript ID
242              pformosa_homolog_ensembl_peptide        Amazon molly Ensembl Protein ID
243                   pformosa_homolog_chromosome           Amazon molly Chromosome Name
244                  pformosa_homolog_chrom_start     Amazon molly Chromosome Start (bp)
245                    pformosa_homolog_chrom_end       Amazon molly Chromosome End (bp)
246               pformosa_homolog_orthology_type                          Homology Type
247                      pformosa_homolog_subtype                               Ancestor
248         pformosa_homolog_orthology_confidence   Orthology confidence [0 low, 1 high]
                  dataset                 mart
229 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
230 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
231 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
232 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
233 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
234 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
235 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
236 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
237 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
238 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
239 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
240 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
241 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
242 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
243 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
244 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
245 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
246 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
247 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
248 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL

Or topic = "dn" and topic = "ds" for dn and ds value retrieval.

# search for attribute topic "dn"
head(organismAttributes("Homo sapiens", topic = "dn"))
                                                  name                        description
209                                  cdna_coding_start                  cDNA coding start
210                                    cdna_coding_end                    cDNA coding end
262                           acarolinensis_homolog_dn                                 dN
264                 dnovemcinctus_homolog_ensembl_gene          Armadillo Ensembl Gene ID
265 dnovemcinctus_homolog_canonical_transcript_protein Canonical Protein or Transcript ID
266              dnovemcinctus_homolog_ensembl_peptide       Armadillo Ensembl Protein ID
                  dataset                 mart
209 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
210 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
262 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
264 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
265 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
266 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
# search for attribute topic "ds"
head(organismAttributes("Homo sapiens", topic = "ds"))
                        name description               dataset                 mart
48                      ccds     CCDS ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
199               cds_length  CDS Length hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
214                cds_start   CDS Start hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
215                  cds_end     CDS End hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
263 acarolinensis_homolog_ds          dS hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
276 dnovemcinctus_homolog_ds          dS hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL

Analogous to the organismAttributes() function, the organismFilters() function returns all filters that are available for a query organism of interest.

# return available filters for "Homo sapiens"
head(organismFilters("Homo sapiens"), 20)
                                     name
1                         chromosome_name
2                                   start
3                                     end
4                              band_start
5                                band_end
6                            marker_start
7                              marker_end
8                           encode_region
9                                  strand
10                     chromosomal_region
11                              with_hgnc
12              with_hgnc_transcript_name
13                   with_ox_arrayexpress
14                              with_ccds
15                            with_chembl
16       with_ox_clone_based_ensembl_gene
17 with_ox_clone_based_ensembl_transcript
18          with_ox_clone_based_vega_gene
19    with_ox_clone_based_vega_transcript
20                            with_dbass3
                                                 description               dataset
1                                            Chromosome name hsapiens_gene_ensembl
2                                            Gene Start (bp) hsapiens_gene_ensembl
3                                              Gene End (bp) hsapiens_gene_ensembl
4                                                 Band Start hsapiens_gene_ensembl
5                                                   Band End hsapiens_gene_ensembl
6                                               Marker Start hsapiens_gene_ensembl
7                                                 Marker End hsapiens_gene_ensembl
8                                              Encode region hsapiens_gene_ensembl
9                                                     Strand hsapiens_gene_ensembl
10 Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1) hsapiens_gene_ensembl
11                                           with HGNC ID(s) hsapiens_gene_ensembl
12                              with HGNC transcript name(s) hsapiens_gene_ensembl
13                                   with ArrayExpress ID(s) hsapiens_gene_ensembl
14                                           with CCDS ID(s) hsapiens_gene_ensembl
15                                         with ChEMBL ID(s) hsapiens_gene_ensembl
16                       with clone based Ensembl gene ID(s) hsapiens_gene_ensembl
17                 with clone based Ensembl transcript ID(s) hsapiens_gene_ensembl
18                          with clone based VEGA gene ID(s) hsapiens_gene_ensembl
19                    with clone based VEGA transcript ID(s) hsapiens_gene_ensembl
20                                         with DBASS3 ID(s) hsapiens_gene_ensembl
                   mart
1  ENSEMBL_MART_ENSEMBL
2  ENSEMBL_MART_ENSEMBL
3  ENSEMBL_MART_ENSEMBL
4  ENSEMBL_MART_ENSEMBL
5  ENSEMBL_MART_ENSEMBL
6  ENSEMBL_MART_ENSEMBL
7  ENSEMBL_MART_ENSEMBL
8  ENSEMBL_MART_ENSEMBL
9  ENSEMBL_MART_ENSEMBL
10 ENSEMBL_MART_ENSEMBL
11 ENSEMBL_MART_ENSEMBL
12 ENSEMBL_MART_ENSEMBL
13 ENSEMBL_MART_ENSEMBL
14 ENSEMBL_MART_ENSEMBL
15 ENSEMBL_MART_ENSEMBL
16 ENSEMBL_MART_ENSEMBL
17 ENSEMBL_MART_ENSEMBL
18 ENSEMBL_MART_ENSEMBL
19 ENSEMBL_MART_ENSEMBL
20 ENSEMBL_MART_ENSEMBL

The organismFilters() function also allows users to search for filters that correspond to a specific topic or category.

# search for filter topic "id"
head(organismFilters("Homo sapiens", topic = "id"), 20)
                             name                                        description
31                     with_go_id                          with GO Term Accession(s)
36                with_mim_morbid                             with MIM disease ID(s)
43                with_protein_id                       with protein (Genbank) ID(s)
53            with_refseq_peptide                          with RefSeq protein ID(s)
54  with_refseq_peptide_predicted                with RefSeq predicted protein ID(s)
63                ensembl_gene_id          Ensembl Gene ID(s) [e.g. ENSG00000139618]
64          ensembl_transcript_id    Ensembl Transcript ID(s) [e.g. ENST00000380152]
65             ensembl_peptide_id       Ensembl protein ID(s) [e.g. ENSP00000369497]
66                ensembl_exon_id          Ensembl exon ID(s) [e.g. ENSE00001508081]
67                        hgnc_id                        HGNC ID(s) [e.g. HGNC:8030]
87                          go_id             GO Term Accession(s) [e.g. GO:0005515]
92           mim_morbid_accession              MIM Morbid Accession(s) [e.g. 540000]
93                     mirbase_id                   miRBase ID(s) [e.g. hsa-mir-137]
97                     protein_id            Protein (Genbank) ID(s) [e.g. ACU09872]
105                refseq_peptide           RefSeq protein ID(s) [e.g. NP_001005353]
106      refseq_peptide_predicted RefSeq predicted protein ID(s) [e.g. XP_011520427]
119                   wikigene_id                       WikiGene ID(s) [e.g. 115286]
197              go_evidence_code                                   GO Evidence code
300            with_validated_snp                        Variant supporting evidence
325                with_validated                        Variant supporting evidence
                  dataset                 mart
31  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
36  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
43  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
53  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
54  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
63  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
64  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
65  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
66  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
67  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
87  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
92  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
93  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
97  hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
105 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
106 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
119 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
197 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
300 hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
325          hsapiens_snp ENSEMBL_MART_ENSEMBL

Performing BioMart queries with biomartr

The short introduction to the functionality of organismBM(), organismAttributes(), and organismFilters() will allow users to perform BioMart queries in a very intuitive organism centric way. The main function to perform BioMart queries is biomart().

For the following examples we will assume that we are interested in the annotation of specific genes from the Homo sapiens proteome. We want to map the corresponding refseq gene id to a set of other gene ids used in other databases. For this purpose, first we need consult the organismAttributes() function.

head(organismAttributes("Homo sapiens", topic = "id"))
                    name              description               dataset                 mart
1        ensembl_gene_id          Ensembl Gene ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
2  ensembl_transcript_id    Ensembl Transcript ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
3     ensembl_peptide_id       Ensembl Protein ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
4        ensembl_exon_id          Ensembl Exon ID hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
34     study_external_id Study External Reference hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
35                 go_id        GO Term Accession hsapiens_gene_ensembl ENSEMBL_MART_ENSEMBL
# retrieve the proteome of Homo sapiens from refseq
getProteome( db       = "refseq",
             kingdom  = "vertebrate_mammalian",
             organism = "Homo sapiens",
             path     = file.path("_ncbi_downloads","proteomes") )


file_path <- file.path("_ncbi_downloads","proteomes","Homo_sapiens_protein.faa.gz")

Hsapiens_proteome <- read_proteome(file_path, format = "fasta")

# remove splice variants from id
gene_set <- unlist(sapply(strsplit(Hsapiens_proteome[1:5 , geneids], ".",fixed = TRUE),function(x) x[1]))

result_BM <- biomart( genes      = gene_set,
                      mart       = "ENSEMBL_MART_ENSEMBL", 
                      dataset    = "hsapiens_gene_ensembl",
                      attributes = c("ensembl_gene_id","ensembl_peptide_id"),
                      filters    = "refseq_peptide")

result_BM 
  refseq_peptide ensembl_gene_id ensembl_peptide_id
1      NP_000005 ENSG00000175899    ENSP00000323929
2      NP_000006 ENSG00000156006    ENSP00000286479
3      NP_000007 ENSG00000117054    ENSP00000359878
4      NP_000008 ENSG00000122971    ENSP00000242592
5      NP_000009 ENSG00000072778    ENSP00000349297

The biomart() function takes as arguments a set of genes (gene ids specified in the filter argument), the corresponding mart and dataset, as well as the attributes which shall be returned.

Gene Ontology

The biomartr package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO() function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO() function allows GO information retrieval from the BioMart database.

In this example we will retrieve GO information for a set of A. thaliana genes stored as tair locus id.

GO Annotation Retrieval via BioMart

The getGO() function takes several arguments as input to retrieve GO information from BioMart. First, the scientific name of the organism of interest needs to be specified. Furthermore, a set of gene ids as well as their corresponding filter notation (GUCA2A gene ids have filter notation hgnc_symbol; see organismFilters() for details) need to be specified. The database argument then defines the database from which GO information shall be retrieved.

# search for GO terms of an example Homo sapiens gene
GO_tbl <- getGO(organism = "Homo sapiens", 
                genes    = "GUCA2A",
                filters  = "hgnc_symbol")
  hgnc_symbol                       goslim_goa_description goslim_goa_accession
1      GUCA2A                           biological_process           GO:0008150
2      GUCA2A                           molecular_function           GO:0003674
3      GUCA2A cellular nitrogen compound metabolic process           GO:0034641
4      GUCA2A                           cellular_component           GO:0005575
5      GUCA2A                         biosynthetic process           GO:0009058
6      GUCA2A             small molecule metabolic process           GO:0044281
7      GUCA2A                                    organelle           GO:0043226
8      GUCA2A                    enzyme regulator activity           GO:0030234
9      GUCA2A                         extracellular region           GO:0005576

Hence, for each gene id the resulting table stores all annotated GO terms found in BioMart.