Nucleic Acids Research Advance Access originally published online on June 3, 2008
Nucleic Acids Research 2008 36(Web Server issue):W65-W69; doi:10.1093/nar/gkn283
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2008, Vol. 36, No. suppl_2 W65-W69
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Articles |
xREI: a phylo-grammar visualization webserver
Department of Bioengineering, University of California, Berkeley
*To whom correspondence should be addressed. Email: lbarquist{at}gmail.com
Received February 9, 2008. Revised April 16, 2008. Accepted April 26, 2008.
| ABSTRACT |
|---|
|
|
|---|
Phylo-grammars, probabilistic models combining Markov chain substitution models with stochastic grammars, are powerful models for annotating structured features in multiple sequence alignments and analyzing the evolution of those features. In the past, these methods have been cumbersome to implement and modify. xrate provides means for the rapid development of phylo-grammars (using a simple file format) and automated parameterization of those grammars from training data (via the Expectation Maximization algorithm). xREI (pron. X-ray) is an intuitive, flexible AJAX (Asynchronous Javascript And XML) web interface to xrate providing grammar visualization tools as well as access to xrate's training and annotation functionality. It is hoped that this application will serve as a valuable tool to those developing phylo-grammars, and as a means for the exploration and dissemination of such models. xREI is available at http://harmony.biowiki.org/xrei/
| INTRODUCTION |
|---|
|
|
|---|
Accurate automated annotation of biological sequences is an increasingly important problem in the biological sciences. Recent releases of high-quality multiple sequence alignment data, such as by the Drosophila 12 Genomes Consortium (1,2) have only underscored this fact. Phylo-grammars have had great success in this arena, with diverse applications in areas such as the prediction of exons in DNA (3,4), prediction of secondary structure in proteins (5,6) and detection of noncoding RNA (7). However, despite their broad range of application, implementations of phylo-grammars have often been limited to a single model and have lacked fast and accurate training algorithms, limiting more widespread adoption.
xrate provides exactly this missing functionality (8). By allowing the nonexpert user to quickly and effectively implement, train and utilize phylo-grammars, it has opened this powerful method of analysis to researchers who would otherwise lack the necessary expertise and/or resources to implement such a tool. In this article we introduce xREI (pronounced X-ray), a web interface which allows researchers to explore a range of pre-defined phylo-grammars, as well as providing tools to visualize their own.
| The XRATE PROGRAM |
|---|
|
|
|---|
xrate is an extremely flexible software tool for modeling structural and phylogenetic variation in multiple sequence alignments. Users can design models for point substitution of nucleotides or amino acids, or for coordinated substitution among groups of residues (e.g. codons or RNA basepairs). The models can be parametric, fully unconstrained or lineage-specific (i.e. using different parameters on different branches of a tree). These substitution models can then be organized via a hidden Markov model or stochastic context-free grammar, allowing for structured variation due to localized rate heterogeneity, intron/exon structure, RNA secondary structure, mixture models, binding sites, etc. The software can be used to fit maximum-likelihood parameters from training alignments (annotated or unannotated), or to use previously-fit parameters to estimate phylogenetic trees or annotate alignments. The entire model is specified using a compact file format described at biowiki.org/XrateFormat.
xrate's ability to specify arbitrary substitution rate matrices is similar to HyPhy (9), while its grammar-design functionality is most similar to HMMoC (10). xrate can reproduce the core functionality of a wide range of programs: molecular evolutionary measurement [PAML: (11)], annotation of regions where substitutions are suppressed [PhastCons: (12)] or recently accelerated [DLESS: (13)], protein-coding genefinding [Exoniphy: (14); EvoGene: (3)], RNA folding [PFOLD: (15)] and gene discovery [EvoFold: (7)] and prediction of regulatory elements [RNA-DECODER: (16); MONKEY: (17)].
Building a web interface to such a generic tool is a challenge. We realized that we would need to combine various components: a grammar browser to show the model structure, a rate matrix browser to show relative substitution rates and an alignment browser for training and annotation. Processing would have to be split between client and server: the platform-independent web browser would handle layout and user interface components, with a Linux backend doing the heavy lifting. These considerations strongly pointed to the AJAX model (Asynchronous Javascript And XML) of client-server communication.
|
| THE xREI WEBSERVER |
|---|
|
|
|---|
The xREI webserver handles two file formats, xrate grammars and stockholm alignments with defined phylogenetic trees. The server makes available an arbitrary number of repositories of these files, which can be browsed and selected by the end user based on a preview window. Users are also able to upload their own files in the appropriate formats from their local machines. Once loaded, a state diagram for the grammar is automatically generated, and options for generating rate matrix displays and accessing xrate functionalities are presented.
State transition diagrams
State transition diagrams are generated from the transformation rules within the xrate grammar file using GraphViz. If a start state is not explicitly defined, one is drawn proceeding the first state in the first transformation rule. An end state is drawn, and all transformations to an empty state are treated as transitioning to it. Each emission state is shown as transitioning to its substitution chain. Bifurcations are drawn as rectangles, with dotted and dashed lines representing the right and left transformations.
Diagrams can be rendered in one of the two ways. The dot rendering produces a layered graph avoiding edge crossings and minimizing edge length. The neato produces a ball and spring graph with edge weights representative of their transition probability. The neato rendering is also randomly seeded, so that multiple renderings will produce different results. Both produce output in PNG format for viewing in-browser, as well as PostScript suitable for publication which may be exported and saved locally. The raw GraphViz files are also available to be exported and hand modified for clarity, as may be required particularly in larger grammars.
Rate matrix visualization
The rate matrices of individual substitution chains are displayed as bubble plots. A grid is drawn with the grammar alphabet as the axes. At each vertex, a circle is drawn proportional to the substitution rate between the respective residues. These circles are initially scaled by an arbitrary function which generally produces good results. This scale factor is available to the user to modify manually.
|
xREI automatically detects the grammar alphabet (DNA, RNA, amino acids, codons) and presents a range of appropriate coloring options to the user. These include coloring by transition/transversion, the number of nucleotide differences, preservation of canonical pairing and synonymous/nonsynonymous codons. Output is made available in PNG and PostScript formats.
xrate Functions
The XRate menu item provides access to xrate's grammar training and annotation functions. The Use Grammar to Annotate Alignment option will run the currently loaded grammar over the alignment, loading a new alignment into memory containing the xrate annotations, predicted secondary structure for instance, in a new labeled # GC line. Similarly, the Use Alignment to Train Grammar option will load a new grammar into browser memory containing updated transformation and substitution parameters. We also provide an option to use xrate's neighbor-joining functionality to generate a Newick tree for a given alignment.
While there is no hard upper limit to the size of alignments submitted to xREI for training or annotation, large alignments should be avoided. Due to the dynamic nature of the server, a large alignment could potentially cause the server to timeout producing an error message, or leave the user waiting for a response indefinitely. As such, intensive training and annotation (e.g. whole genome analysis, training over stockholm alignment databases) should still be performed using xrate at the command-line. We have recently introduced an option to run annotations and training as a background process on the server as part of the Advanced Options menu, with results emailed to the end user, though this feature may need to be suspended in the event of high server traffic.
Available repositories
As part of the initial xREI package we have included a set of xrate phylo-grammar implementations of a variety of models. Grammars for DNA and RNA include: Jukes-Cantor (18), Kimura (19), HKY85 (20), the general reversible model REV, (21) the general irreversible model IRREV, the general irreversible dinucleotide model (4), a general reversible codon model (22), a phylo-HMM for detecting local rate variation (23,12), a phylo-SCFG for RNA folding (15). Grammars for proteins include: a general reversible amino acid substitution model (24) and several phylo-HMMs for protein secondary structure analysis (5).
We have also included several alignment databases over which both these and user-supplied grammars may be trained and run. RFAM (25) is a collection of non-coding RNA multiple alignments. PANDIT (26) contains codon multiple sequence alignments covering many common protein-coding domains. TreeFam (27) is a database of protein alignments along with curated and semi-curated trees.
| IMPLEMENTATION |
|---|
|
|
|---|
The xREI interface is written in javascript with the Dojo Toolkit. The Dojo Toolkit provides a range of cross-browser functions for creation of interactive interface elements as well as AJAX client-server communication. xREI relies heavily on AJAX communication to provide a seamless user experience without page refreshes, more similar to a desktop application than a traditional web server.
xREI uses a set of server-side scripts written in perl to process data. All rely heavily on the freely available DART perl libraries (http://dart.sourceforge.net), which provides a collection of tools for manipulating xrate format grammars and stockholm alignments, among other related functions. xrei_load.pl provides basic utilities such as producing preview screens and formatting grammars and alignments to be loaded client-side. xrei_xrate.pl uses the DART perl libraries to perform xrate operations such as training and annotation. xrei_statediag.pl uses the perl GraphViz module to produce state diagrams. xrei_vizrates.pl is a modified version of the visualizeRates.pl script included with DART which relies on LaTeX to produce its bubble plot graphs. In the interest of privacy, no temporary files are stored on the server save postscript output files, which are deleted automatically every night.
Code reusability and flexibility were both major goals in implementation. As such any functionality currently in xREI can be easily incorporated in other web applications. All server-side scripts are written using CGI.pm, and are capable of being used in either an AJAX or standard CGI environment. Each is capable of running independently of the others or the xREI interface. Conversely, the xREI javascript interface has been designed for easy extension and compatibility with other web applications. Adding new functionality to the server requires only trivial modifications to the javascript code. In addition, the xREI server can pass grammar or alignment data to other web applications, as is currently done with the Raton RNA alignment viewer, with the caveat that both applications must be running on the same domain due to javascript security limitations.
| CONCLUSIONS |
|---|
|
|
|---|
Phylo-grammars have a broad range of applications in the biological sciences. xREI provides a flexible and intuitive method for visualizing phylo-grammars developed within the xrate framework. Given an xrate grammar and an alignment in stockholm format, xREI can produce state transition graphs and substitution rate matrices for the grammar, as well as being capable of training and alignment annotation functions. xREI can be easily expanded to incorporate new visualization or computational methods as needed. In addition, xREI functionalities can be easily incorporated into other web servers. It is hoped that by increasing availability and easing use of these tools, xREI will accelerate their adoption as a standard analytical method.
| ACKNOWLEDGEMENTS |
|---|
The authors were funded by NIH/NHGRI grant 1R01GM076705-01 and by the 2007 Google Summer of Code [National Evolutionary Synthesis Center (NESCent) group, NSF #EF-0423641].
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer V, et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature (2007) 450:203–218.[CrossRef][Medline]
- Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy S, Deoras A, et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature (2007) 450:219–232.[CrossRef][Medline]
- Pedersen JS, Hein J. Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics (2003) 19:219–227.
[Abstract/Free Full Text] - Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. (2004) 21:468–488.
[Abstract/Free Full Text] - Thorne JL, Goldman N, Jones DT. Combining protein evolution and secondary structure. Mol. Biol. Evol. (1996) 13:666–673.[Abstract]
- Goldman N, Thorne JL, Jones DT. Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J. Mol. Biol. (1996) 263:196–208.[CrossRef][Web of Science][Medline]
- Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput. Biol. (2006) 2:e33.[CrossRef][Medline]
- Klosterman PS, Uzilov AV, Bendana YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes I. XRate: a fast prototyping, training and annotation tool for phylo-grammars. BMC Bioinformatics (2006) 7:428.[CrossRef][Medline]
- Pond SLK, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics (2005) 21:676–679.
[Abstract/Free Full Text] - Lunter G. HMMoC–a compiler for hidden Markov models. Bioinformatics (2007) 23:2485–2487.
[Abstract/Free Full Text] - Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. (2007) 24:1586–1591.
[Abstract/Free Full Text] - Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. (2005) 15:1034–1050.
[Abstract/Free Full Text] - Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, et al. An RNA gene expressed during cortical development evolved rapidly in humans. Nature (2006) 443:167–172.[CrossRef][Medline]
- Siepel A, Haussler D. Combining phylogenetic and hidden Markov models in biosequence analysis. J. Comput. Biol. (2004) 11:413–428.[CrossRef][Web of Science][Medline]
- Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. (2003) 31:3423–3428. Evaluation studies.
[Abstract/Free Full Text] - Pedersen JS, Meyer IM, Forsberg R, Simmonds P, Hein J. A comparative method for finding and folding RNA secondary structures within protein-coding regions. Nucleic Acids Res. (2004) 32:4925–4923.
[Abstract/Free Full Text] - Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol. (2004) 5:R98.[CrossRef][Medline]
- Jukes TH, Cantor C. Evolution of protein molecules. In: Mammalian Protein Metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–132.
- Kimura M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol (1980) 16:111–120.[CrossRef][Web of Science][Medline]
- Hasegawa M, Kishino H, Yano T. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]
- Yang Z. Estimating the pattern of nucleotide substitution. J. Mol. Evol. (1994) 39:105–111.[Web of Science][Medline]
- Kosiol C, Holmes I, Goldman N. An empirical codon model for protein sequence evolution. Mol. Biol. Evol. (2007) 24:1464–1479.
[Abstract/Free Full Text] - Felsenstein J, Churchill GA. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. (1996) 13:93–104.[Abstract]
- Dayhoff MO, Eck RV, Park CM. A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure—Dayhoff MO, ed. (1972) Vol. 5. Washington, DC: National Biomedical Research Foundation. 89–99.
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. (2005) 33:D121–D124.
[Abstract/Free Full Text] - Whelan S, de Bakker PI, Goldman N. Pandit: a database of protein and associated nucleotide domains with inferred trees. Bioinformatics (2003) 19:1556–1563.
[Abstract/Free Full Text] - Li H, Coghlan A, Ruan J, Coin LJ, Hrich JK, Osmotherly L, Li R, Liu T, Zhang Z, et al. Treefam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. (2006) 34:D572–D580.
[Abstract/Free Full Text]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

