Ensembl Biomart
The Ensembl Biomart database enables users to retrieve a vast diversity of annotation data for specific organisms. Initially, Steffen Durinck and Wolfgang Huber provided a powerful interface between the R language and Ensembl Biomart by implementing the R package biomaRt.
The purpose of the biomaRt
package was to mimic the ENSEMBL BioMart database structure to construct queries that can be sent to the Application Programming Interface (API) of BioMart. Although, this procedure was very useful in the past, it seems not intuitive from an organism centric point of view. Usually, users wish to download functional annotation for a particular organism of interest. However, the BioMart and thus the biomaRt
package require that users already know in which mart
and dataset
the organism of interest will be found which requires significant efforts of searching and screening. In addition, once the mart
and dataset
of a particular organism of interest were found and specified the user must again learn which attribute
has to be specified to retrieve the functional annotation information of interest.
The new functionality implemented in the biomartr
package aims to overcome this search bottleneck by extending the functionality of the biomaRt package. The new biomartr
package introduces a more organism cantered annotation retrieval concept which does not require to screen for marts
, datasets
, and attributes
beforehand. With biomartr
users only need to specify the scientific name
of the organism of interest to then retrieve available marts
, datasets
, and attributes
for the corresponding organism of interest.
This paradigm shift enables users to quickly construct queries to the BioMart database without having to learn the particular database structure and organization of BioMart.
The following sections will introduce users to the functionality and data retrieval precedures of biomartr
and will show how biomartr
extends the functionality of the initial biomaRt package.
biomaRt
query methodologyThe best way to get started with the old methodology presented by the established biomaRt package is to understand the workflow of its data retrieval process. The query logic of the biomaRt
package derives from the database organization of Ensembl Biomart which stores a vast diversity of annotation data for specific organisms. In detail, the Ensembl Biomart database is organized into so called:
marts
, datasets
, and attributes
. Marts
denote a higher level category of functional annotation such as SNP
(e.g. for functional annotation of particular single nucleotide polymorphisms (SNPs)) or FUNCGEN
(e.g. for functional annotation of regulatory regions or relationsships of genes). Datasets
denote the particular species of interest for which functional annotation is available within this specific mart
. It can happen that datasets
(= particular species of interest) are available in one mart
(= higher category of functional annotation) but not in an other mart
. For the actual retrieval of functional annotation information users must then specify the type
of functional annotation information they wish to retrieve. These types
are called attributes
in the biomaRt
notation.
Hence, when users wish to retrieve information for a specific organism of interest, they first need to specify a particular mart
and dataset
in which the information of the corresponding organism of interest can be found. Subsequently they can specify the attributes
argument to retrieve a particular type of functional annotation (e.g. Gene Ontology terms).
The following section shall illustrate how marts
, datasets
, and attributes
could be explored using biomaRt
before the biomartr
package existed.
The availability of marts
, datasets
, and attributes
can be checked by the following functions:
# install the biomaRt package
# source("http://bioconductor.org/biocLite.R")
# biocLite("biomaRt")
# load biomaRt
library(biomaRt)
# look at top 10 databases
head(biomaRt::listMarts(host = "www.ensembl.org"), 10)
#> biomart version
#> 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 92
#> 2 ENSEMBL_MART_MOUSE Mouse strains 92
#> 3 ENSEMBL_MART_SNP Ensembl Variation 92
#> 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 92
Users will observe that several marts
providing annotation for specific classes of organisms or groups of organisms are available.
For our example, we will choose the hsapiens_gene_ensembl
mart
and list all available datasets that are element of this mart
.
head(biomaRt::listDatasets(biomaRt::useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org")), 10)
#> dataset description version
#> 1 acarolinensis_gene_ensembl Anole lizard genes (AnoCar2.0) AnoCar2.0
#> 2 amelanoleuca_gene_ensembl Panda genes (ailMel1) ailMel1
#> 3 amexicanus_gene_ensembl Cave fish genes (AstMex102) AstMex102
#> 4 anancymaae_gene_ensembl Ma's night monkey genes (Anan_2.0) Anan_2.0
#> 5 aplatyrhynchos_gene_ensembl Duck genes (BGI_duck_1.0) BGI_duck_1.0
#> 6 btaurus_gene_ensembl Cow genes (UMD3.1) UMD3.1
#> 7 caperea_gene_ensembl Brazilian guinea pig genes (CavAp1.0) CavAp1.0
#> 8 catys_gene_ensembl Sooty mangabey genes (Caty_1.0) Caty_1.0
#> 9 ccapucinus_gene_ensembl Capuchin genes (Cebus_imitator-1.0) Cebus_imitator-1.0
#> 10 cchok1gshd_gene_ensembl Chinese hamster CHOK1GS genes (CHOK1GS_HDv1) CHOK1GS_HDv1
The useMart()
function is a wrapper function provided by biomaRt
to connect a selected BioMart database (mart
) with a corresponding dataset stored within this mart
.
We select dataset hsapiens_gene_ensembl
and now check for available attributes (annotation data) that can be accessed for Homo sapiens
genes.
head(biomaRt::listAttributes(biomaRt::useDataset(
dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))), 10)
Please note the nested structure of this attribute query. For an attribute query procedure an additional wrapper function named useDataset()
is needed in which useMart()
and a corresponding dataset needs to be specified. The result is a table storing the name of available attributes for
Homo sapiens as well as a short description.
Furthermore, users can retrieve all filters for Homo sapiens that can be specified by the actual BioMart query process.
head(biomaRt::listFilters(biomaRt::useDataset(dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))), 10)
After accumulating all this information, it is now possible to perform an actual BioMart query by using the getBM()
function.
In this example we will retrieve attributes: start_position
,end_position
and description
for the Homo sapiens gene "GUCA2A"
.
Since the input genes are ensembl gene ids
, we need to specify the filters
argument filters = "hgnc_symbol"
.
# 1) select a mart and data set
mart <- biomaRt::useDataset(dataset = "hsapiens_gene_ensembl",
mart = useMart("ENSEMBL_MART_ENSEMBL",
host = "www.ensembl.org"))
# 2) run a biomart query using the getBM() function
# and specify the attributes and filter arguments
geneSet <- "GUCA2A"
resultTable <- biomaRt::getBM(attributes = c("start_position","end_position","description"),
filters = "hgnc_symbol",
values = geneSet,
mart = mart)
resultTable
When using getBM()
users can pass all attributes retrieved by listAttributes()
to the attributes
argument of the getBM()
function.
biomaRt
using the new query system of the biomartr
packagebiomartr
This query methodology provided by Ensembl Biomart
and the biomaRt
package is a very well defined approach for accurate annotation retrieval. Nevertheless, when learning this query methodology it (subjectively) seems non-intuitive from the user perspective. Therefore, the biomartr
package provides another query methodology that aims to be more organism centric.
Taken together, the following workflow allows users to perform fast BioMart queries for attributes using the biomart()
function implemented in this biomartr
package:
get attributes, datasets, and marts via : organismAttributes()
choose available biological features (filters) via: organismFilters()
specify a set of query genes: e.g. retrieved with getGenome()
, getProteome()
or getCDS()
specify all arguments of the biomart()
function using steps 1) - 3) and perform a BioMart query
Note that dataset names change very frequently due to the update of dataset versions. So in case some query functions do not work properly, users should check with organismAttributes(update = TRUE)
whether or not their dataset name has been changed. For example, organismAttributes("Homo sapiens", topic = "id", update = TRUE)
might reveal that the dataset ENSEMBL_MART_ENSEMBL
has changed.
The getMarts()
function allows users to list all available databases that can be accessed through BioMart interfaces.
# load the biomartr package
library(biomartr)
# list all available databases
biomartr::getMarts()
Now users can select a specific database to list all available datasets that can be accessed through this database. In this example we choose the ENSEMBL_MART_ENSEMBL
database.
head(biomartr::getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 5)
Now you can select the dataset hsapiens_gene_ensembl
and list all available attributes that can be retrieved from this dataset.
tail(biomartr::getDatasets(mart = "ENSEMBL_MART_ENSEMBL") , 38)
Now that you have selected a database (hsapiens_gene_ensembl
) and a dataset (hsapiens_gene_ensembl
), users can list all available attributes for this dataset using the getAttributes()
function.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# list all available attributes for dataset: hsapiens_gene_ensembl
head( biomartr::getAttributes(mart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl"), 10 )
Finally, the getFilters()
function allows users to list available filters for a specific dataset that can be used for a biomart()
query.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# list all available filters for dataset: hsapiens_gene_ensembl
head( biomartr::getFilters(mart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl"), 10 )
In most use cases, users will work with a single or a set of model organisms. In this process they will mostly be interested in specific annotations for this particular model organism. The organismBM()
function addresses this issue and provides users with an organism centric query to marts
and datasets
which are available for a particular organism of interest.
Note that when running the following functions for the first time, the data retrieval procedure will take some time, due to the remote access to BioMart. The corresponding result is then saved in a *.txt
file named _biomart/listDatasets.txt
within the tempdir()
folder, allowing subsequent queries to be performed much faster. The tempdir()
folder, however, will be deleted after a new R session was established. In this case the inital call of the subsequent functions again will take time to retrieve all organism specific data from the BioMart database.
This concept of locally storing all organism specific database linking information available in BioMart into an internal file allows users to significantly speed up subsequent retrieval queries for that particular organism.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieving all available datasets and biomart connections for
# a specific query organism (scientific name)
biomartr::organismBM(organism = "Homo sapiens")
The result is a table storing all marts
and datasets
from which annotations can be retrieved for Homo sapiens. Furthermore, a short description as well as the version of the dataset being accessed (very useful for publications) is returned.
Users will observe that 3 different marts
provide 6 different datasets
storing annotation information for Homo sapiens.
Please note, however, that scientific names of organisms must be written correctly! For ex. “Homo Sapiens” will be treated differently (not recognized) than “Homo sapiens” (recognized).
Similar to the biomaRt
package query methodology, users need to specify attributes
and filters
to be able to perform accurate BioMart queries. Here the functions organismAttributes()
and organismFilters()
provide useful and intuitive concepts to obtain this information.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# return available attributes for "Homo sapiens"
head(biomartr::organismAttributes("Homo sapiens"), 20)
Users will observe that the organismAttributes()
function returns a data.frame storing attribute names, datasets, and marts which are available for Homo sapiens
. After the ENSEMBL release 87 the ENSEMBL_MART_SEQUENCE
service provided by Ensembl does not work properly and thus the organismAttributes()
function prints out warning messages to make the user aware when certain marts provided bt Ensembl do not work properly, yet.
An additional feature provided by organismAttributes()
is the topic
argument. The topic
argument allows users to to search for specific attributes, topics, or categories for faster filtering.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "id"
head(biomartr::organismAttributes("Homo sapiens", topic = "id"), 20)
Now, all attribute names
having id
as part of their name
are being returned.
Another example is topic = "homolog"
.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "homolog"
head(biomartr::organismAttributes("Homo sapiens", topic = "homolog"), 20)
Or topic = "dn"
and topic = "ds"
for dn
and ds
value retrieval.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "dn"
head(biomartr::organismAttributes("Homo sapiens", topic = "dn"))
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for attribute topic "ds"
head(biomartr::organismAttributes("Homo sapiens", topic = "ds"))
Analogous to the organismAttributes()
function, the organismFilters()
function returns all filters that are available for a query organism of interest.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# return available filters for "Homo sapiens"
head(biomartr::organismFilters("Homo sapiens"), 20)
The organismFilters()
function also allows users to search for filters that correspond to a specific topic or category.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for filter topic "id"
head(biomartr::organismFilters("Homo sapiens", topic = "id"), 20)
biomartr
The short introduction to the functionality of organismBM()
, organismAttributes()
, and organismFilters()
will allow users to perform BioMart queries in a very intuitive organism centric way. The main function to perform BioMart queries is biomart()
.
For the following examples we will assume that we are interested in the annotation of specific genes from the Homo sapiens proteome. We want to map the corresponding refseq gene id to a set of other gene ids used in other databases. For this purpose, first we need consult the organismAttributes()
function.
# show all elements of the data.frame
options(tibble.print_max = Inf)
head(biomartr::organismAttributes("Homo sapiens", topic = "id"))
# show all elements of the data.frame
options(tibble.print_max = Inf)
# retrieve the proteome of Homo sapiens from refseq
file_path <- biomartr::getProteome( db = "refseq",
organism = "Homo sapiens",
path = file.path("_ncbi_downloads","proteomes") )
Hsapiens_proteome <- biomartr::read_proteome(file_path, format = "fasta")
# remove splice variants from id
gene_set <- unlist(sapply(strsplit(Hsapiens_proteome@ranges@NAMES[1:5], ".",fixed = TRUE), function(x) x[1]))
result_BM <- biomartr::biomart( genes = gene_set, # genes were retrieved using biomartr::getGenome()
mart = "ENSEMBL_MART_ENSEMBL", # marts were selected with biomartr::getMarts()
dataset = "hsapiens_gene_ensembl", # datasets were selected with biomartr::getDatasets()
attributes = c("ensembl_gene_id","ensembl_peptide_id"), # attributes were selected with biomartr::getAttributes()
filters = "refseq_peptide") # specify what ID type was stored in the fasta file retrieved with biomartr::getGenome()
result_BM
The biomart()
function takes as arguments a set of genes (gene ids specified in the filter
argument), the corresponding mart
and dataset
, as well as the attributes
which shall be returned.
The biomartr
package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO()
function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO()
function allows GO information retrieval from the Ensembl Biomart database.
In this example we will retrieve GO information for a set of Homo sapiens genes stored as hgnc_symbol
.
The getGO()
function takes several arguments as input to retrieve GO information from BioMart. First, the scientific name of the organism
of interest needs to be specified. Furthermore, a set of gene ids
as well as their corresponding filter
notation (GUCA2A
gene ids have filter
notation hgnc_symbol
; see organismFilters()
for details) need to be specified. The database
argument then defines the database from which GO information shall be retrieved.
# show all elements of the data.frame
options(tibble.print_max = Inf)
# search for GO terms of an example Homo sapiens gene
GO_tbl <- biomartr::getGO(organism = "Homo sapiens",
genes = "GUCA2A",
filters = "hgnc_symbol")
GO_tbl
Hence, for each gene id the resulting table stores all annotated GO terms found in Ensembl Biomart.