aphid is an R package for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. It contains functions for multiple and pairwise sequence alignment, model construction and parameter optimization, file import/export, implementation of the forward, backward and Viterbi algorithms for conditional sequence probabilities, tree-based sequence weighting, and sequence simulation. The package has a wide variety of uses including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification.
Hidden Markov models (HMMs) underlie many of the most important tasks in computational biology, including multiple sequence alignment, genome annotation, and increasingly, sequence database searching. Originally developed for speech recognition algorithms, their application to the field of molecular biology has increased dramatically since advances in computational capacity have enabled full probabilistic analysis in place of heuristic approximations. Pioneering this transition are two groups lead by Anders Krogh and Sean Eddy, whose respective software packages SAM and HMMER have underpinned HMM-based bioinformatic analysis for over two decades.
Here, we present the aphid package for analysis with profile hidden Markov models in the R environment (R Core Team 2017). The package contains functions for developing, plotting, importing and exporting both standard and profile HMMs, as well as implementations of the forward, backward and Viterbi algorithms for computing full and optimal conditional sequence probabilities. The package also features a multiple sequence alignment tool that produces high quality alignments via profile HMM training.
The aphid package is designed to work in conjunction with the “DNAbin” and “AAbin” object types produced using the ape package (Paradis, Claude, and Strimmer 2004; Paradis 2012). These object types, in which sequences are represented in a bit-level coding scheme, are preferred over standard character-type sequences for maximizing memory and speed efficiency. While we recommend using ape alongside aphid, it is not a requisite and as such is listed in the “Suggests” rather than “Imports” section of the package description. Indeed, any sequence of standard ASCII characters is supported, making aphid suitable for other applications outside of biological sequence analysis. However, it should be noted that if DNA and/or amino acid sequences are input as character vectors, the functions may not recognize the ambiguity codes and therefore are not guaranteed to treat them appropriately.
To maximize speed, the low-level dynamic programming functions (including the forward, backward, Viterbi, and maximum a posteriori algorithms) are written in C++ linking to the Rcpp package (Eddelbuettel and Francois 2011). R versions of these functions are also maintained for the purposes of debugging, experimentation and code interpretation. This package also relies on the openssl package (Ooms 2016) for sequence and alignment comparisons using the MD5 hash algorithm.
Two primary object classes, “HMM” (hidden Markov model) and “PHMM” (profile hidden Markov model) are generated using the aphid functions deriveHMM and derivePHMM, respectively. These objects are lists consisting of emission and transition probability matrices (elements named “E” and “A”), and non-mandatory elements that may include vectors of background emission and transition probabilities (“qe” and “qa”, respectively) and other model metadata including “name”, “description”, “size” (the number of modules in the model), and “alphabet” (the set of symbols/residues emitted by the model). Objects of class “DPA” (dynamic programming array) are also generated by the Viterbi and forward/backward functions. These are predominantly created for the purposes of succinct console-printing.
HMMs and PHMMs are explained in more detail throughout the following sections using aphid package functions to demonstrate their utility. The examples are borrowed from Durbin et al (1998), to which users are encouraged to refer for a more in-depth explanation on the theory and application of these models. Book chapter numbers are provided wherever possible for ease of reference.
The aphid package can be used to produce high-quality multiple sequence alignments using the iterative model training method outlined above. The function align
takes as its primary argument a list of sequences either as a “DNAbin” object, an “AAbin” object, or a list of character sequences. An object of class “PHMM” can be passed to the function as an optional secondary argument (“model”), in which case the sequences are simply aligned to the model to produce the alignment matrix. If “model” is NULL, a preliminary model is first derived using the ‘seed’ sequence method outlined above, after which the model is trained using either the Baum Welch or Viterbi training algorithm (specified via the “method” argument). The sequences are then aligned to the model in the usual fashion to produce the alignment. Note that if only two sequences are present inthe input list, the align
function will perform a pairwise alignment without a profile HMM (Smith-Waterman or Needleman-Wunch alignment).
In this final example, we will deconstruct the original globin alignment and re-align the sequences using the original PHMM as a guide.
globins <- unalign(globins)
align(globins, model = globins.PHMM, seqweights = NULL, residues = "AMINO")
## 1 2 3 I I 4 5 6 7 8
## HBA_HUMAN "V" "G" "A" "-" "-" "H" "A" "G" "E" "Y"
## HBB_HUMAN "V" "-" "-" "-" "-" "N" "V" "D" "E" "V"
## MYG_PHYCA "V" "E" "A" "-" "-" "D" "V" "A" "G" "H"
## GLB3_CHITP "V" "K" "G" "-" "-" "-" "-" "-" "-" "D"
## GLB5_PETMA "V" "Y" "S" "-" "-" "T" "Y" "E" "T" "S"
## LGB2_LUPLU "F" "N" "A" "-" "-" "N" "I" "P" "K" "H"
## GLB1_GLYDI "I" "A" "G" "A" "D" "N" "G" "A" "G" "V"
Note that the column names show the progressive positions along the model and where residues were predicted to have been emitted by insert states (e.g. the 4th and 5th residues of sequence 7).
This package was written based on the algorithms described in Durbin et al (1998). This book offers an in depth explanation of hidden Markov models and profile HMMs for users of all levels of familiarity. Many of the examples and datasets in the package are directly derived from the text, which serves as a useful primer for this package. There are also excellent resources available for those wishing to use profile HMMs outside of the R environment. The aphid package maintains compatibility with the HMMER software suite through the file input and output functions readPHMM and writePHMM. Those interested are further encouraged to check out the SAM software package, which also features a comprehensive suite of functions and tutorials.
This software was developed at Victoria University of Wellington, NZ, with funding from a Rutherford Foundation Postdoctoral Research Fellowship award from the Royal Society of New Zealand.
Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press.
Eddelbuettel, Dirk, and Romain Francois. 2011. “Rcpp: seamless R and C++ integration.” Journal of Statistical Software 40 (8): 1–18. http://www.jstatsoft.org/v40/i08/.
Ooms, Jeroen. 2016. openssl: toolkit for encryption, signatures and certificates based on OpenSSL. R package. https://cran.r-project.org/package=openssl.
Paradis, Emmanuel. 2012. Analysis of Phylogenetics and Evolution with R. Second Edi. New York: Springer.
Paradis, Emmanuel, Julien Claude, and Korbinian Strimmer. 2004. “APE: analyses of phylogenetics and evolution in R language.” Bioinformatics 20: 289–90. doi:10.1093/bioinformatics/btg412.
R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://cran.r-project.org/.