AUGUSTUS: a web server for gene finding in eukaryotes
University of Göttingen, Institut für Mikrobiologie und Genetik, Goldschmidtstraße 1, 37077 Göttingen, Germany and 1 University of Göttingen, Institut für Numerische und Angewandte Mathematik, Lotzestraße 1618, 37083 Göttingen, Germany
* To whom correspondence should be addressed. Tel: +49 551 3914926; Fax: +49 551 3914929; Email: mstanke{at}gwdg.de
Received February 14, 2004; Revised and Accepted March 15, 2004
| ABSTRACT |
|---|
|
|
|---|
We present a www server for AUGUSTUS, a novel software program for ab initio gene prediction in eukaryotic genomic sequences. Our method is based on a generalized Hidden Markov Model with a new method for modeling the intron length distribution. This method allows approximation of the true intron length distribution more accurately than do existing programs. For genomic sequence data from human and Drosophila melanogaster, the accuracy of AUGUSTUS is superior to existing gene-finding approaches. The advantage of our program becomes apparent especially for larger input sequences containing more than one gene. The server is available at http://augustus.gobics.de.
| INTRODUCTION |
|---|
|
|
|---|
The first step in genome annotation is to predict all gene structures in a given genomic sequence. The development of gene-finding methods is, therefore, an important field in biological sequence analysis. For eukaryotes this problem is far from trivial, since eukaryotic genes usually contain large introns, i.e. non-coding regions. Most gene-prediction programs are based on stochastic models such as Hidden Markov Models (HMMs). These models describe the statistical features of different regions and signals in genomic sequences, such as introns, coding exons, UTRs, promoters, etc. A large number of gene-finding programs have been proposed since the 1980s, e.g. GENIE (1), GENSCAN (2) and GENEID (3). GENSCAN is widely used and has been found in earlier studies (4,5) to be one of the most accurate gene-prediction programs. All these tools are routinely used for automatic genome annotation. Despite considerable efforts in the bioinformatics community, the performance of existing gene-prediction tools is still not satisfactory. A study by Guigó et al. (6) has shown that these tools are accurate if applied to rather short sequences that contain single genes together with short flanking intergenic regions. However, their performance drops dramatically if they are applied to long input sequences. Experiments with semi-artificial sequences showed that GENSCAN tends to predict many more genes than are actually present in genomic sequences.
A major problem in gene prediction is the correct modeling of the intron length distribution for a given organism. Other HMM-based gene-finding programs, such as GENSCAN (2), GENIE (1), DOUBLESCAN (7) and TWINSCAN (8), can only model a geometric intron length distribution, in which the probabilities decline exponentially with the length. This approach is computationally more efficient than explicitly modeling the actual non-geometric length distribution.
However, the assumed geometric intron length distribution is the reason why a single gene is often split into two or more predicted genes (1) and a reason why large introns are very unlikely to be correctly identified.
| AUGUSTUSA NEW APPROACH TO HMM-BASED GENE PREDICTION |
|---|
|
|
|---|
AUGUSTUS is based on a generalized Hidden Markov Model (GHMM). This model defines probability distributions for the various sections of genomic sequences. Introns, exons, intergenic regions and so on correspond to states in the model, and each state is thought to create DNA sequences with certain pre-defined emission probabilities. Like other HMM-based gene finders, AUGUSTUS finds an optimal parse of a given genomic sequence, i.e. a segmentation of the sequences into states that is most likely according to the underlying statistical model. The default version of the model consists of 47 states, of which 23 states model genes on the reverse strand and are symmetric copies of corresponding states which model genes on the forward strand. We probabilistically model separately the sequence around the splice sites, the sequence of the branch point region, the bases before the translation start, the coding regions, the non-coding regions, the first coding bases of a gene, the length distribution of single exons, initial exons, internal exons, terminal exons, intergenic regions, the distribution of the number of exons per gene and the length distribution of introns.
In our intron length model, which is described in (9,10), we combine explicit length modeling with a geometric distribution. For introns shorter than a few hundred bases (human 584, Drosophila 929), we use explicit length modeling. Only for introns exceeding this length does the probability decline exponentially, but at a slower rate than if the whole distribution was geometric. In the explicitly modeled part of the distribution, intron lengths have probabilities that have been estimated from observed frequencies. This way, our program is computationally efficient but is able to model intron lengths much more realistically than standard approaches do.
Our model parameters have been estimated using training sequences with known genes. For the human version we used 1284 single-gene training sequences; for the Drosophila version we used 400 single-gene training sequences. For each species, we use one of 10 different sets of parameters according to the average GC content of the input sequence.
The performance of AUGUSTUS has been extensively evaluated on sequence data from human and Drosophila (9,10). These studies showed that, especially for long input sequences, our program is considerably more accurate than existing approaches. Table 1 shows the prediction accuracy of AUGUSTUS, GENEID and GENIE on the Drosophila Adh region, which has been carefully annotated and has been used in the Genome Annotation Assessment Project (11).
|
To make our tool available for the research community, we set up a www server at GOBICS (Göttingen Bioinformatics Compute Server), where AUGUSTUS is accessible through a user-friendly interface.
| WEB SERVER DESCRIPTION |
|---|
|
|
|---|
The AUGUSTUS web server allows a DNA sequence to be uploaded in FASTA format or as multiple sequences in multiple FASTA format or by pasting a sequence into the web form. It is also possible to paste the sequence part of the GENBANK format (which follows the ORIGIN keyword) into the web form because spaces and digits are ignored by the program.
The maximal total length of the sequences submitted to the server is 3 million bp. Currently, AUGUSTUS has two specially trained parameter sets that can be chosen on the web site: human and Drosophila. We can generate parameter sets for other species automatically from annotated GENBANK files of these species and plan to add them to the web site. For the moment, we recommend using the human version also for other vertebrates.
AUGUSTUS reports predicted genes of the input DNA sequence on the forward strand, the reverse strand or on both strands, depending on the user's choice. Usually the default version of the program is the best choice, but in some cases additional evidence about the gene structure suggests deviating from the default program behavior. For these cases the user has two expert options.
The first expert option is a choice by radio button from one option from the following list:
- predict any number of (possibly partial) genes,
- only predict complete genes,
- only predict complete genesat least one,
- predict exactly one gene.
|
The other expert option is a checkbox that, if checked, tells AUGUSTUS to ignore conflicts between the gene structures of the two strands. By default this option is not chosen and AUGUSTUS assumes that genes on opposite strands do not overlap (as well as genes on the same strand). This assumption is usually satisfied, and making it helps to avoid finding shadow genes, i.e. false positive genes on a certain strand, at a position where the true gene is actually on the other strand. In some cases the assumption is not satisfied, and a gene is contained in an intron of a gene on the other strand as in Figure 1a. In this case the default setting cannot produce the correct prediction. In the case of the particular pair of nested genes of Figure 1, the default version of AUGUSTUS correctly predicts the included gene but splits the including gene into two predicted genes as shown in Figure 1b. In a case with evidence about nested genes, e.g. derived by expressed sequence tag (EST) alignments, the ignore conflicts option should be chosen. With this option the predictions are made independently on the two strands. In this example the two genes are then predicted almost correctly (Figure 1c).
|
When one of the expert options is changed from the default setting the maximal total sequence length is 400 kb. This limit will be suspended soon. The running time for a 200 kb input sequence is approximately 30 s when the server is otherwise idle.
| OUTPUT DESCRIPTION |
|---|
|
|
|---|
AUGUSTUS outputs its results in both graphics and text format. The results page of the web server shows for each sequence a clickable thumbnail which links to a postscript image similar to the one in Figure 1. The pictures are generated with the program gff2ps (13) from the text output. The text output is in the General Feature Format (GFF) proposed by Richard Durbin and David Haussler. The Sanger Institute lists at http://www.sanger.ac.uk/Software/formats/GFF a large number of tools which work with the GFF. In this format the results contain one line for each exon with data fields separated by a TAB character. These data fields include the start and end positions of the exon, a name for the sequence, a name for the gene and whether it is on the forward or reverse strand. A detailed description of the output is in the Supplementary Materials to this article.
| FUTURE WORK |
|---|
|
|
|---|
Currently, the AUGUSTUS web server makes its predictions ab initio, i.e. without making use of external evidence about the gene structure of the input sequence. However, a natural and flexible generalization of the GHMM of AUGUSTUS that allows the integration of uncertain extrinsic information from various sources has already been developed (10). This has been tested with extrinsic information which the program AGRIPPA (14) has constructed from the results of searching the input DNA sequence against protein and EST databases. The approach also allows such user constraints as This interval of the sequence must be part of an exon to be set. A publication presenting the promising results of the integration of EST and protein database search results is in preparation.
During recent years, a number of comparative gene-finding tools have been proposed (1519). These tools work by comparing genomic sequences from related organisms to each other, e.g. human and mouse. They use the phylogenetic footprinting principle, i.e. they exploit the fact that functionally important parts of sequences are usually more conserved than non-functional parts of the genome. Comparative methods try to identify evolutionarily conserved parts of the sequences and then search for signals such as splice sites near these conserved sequences.
Some authors have combined intrinsic and comparative gene-finding approaches (7,8,20,21). We also plan to utilize the homology information produced by the alignment program DIALIGN (22) for the above-mentioned generalization of AUGUSTUS. DIALIGN has been used in the past for genome sequence analysis; it has been shown that local sequence similarities returned by DIALIGN are highly correlated to protein-coding exons (23). A new version of the program has been implemented that is considerably faster than the original version and can therefore be applied to larger sequence data (24).
| SUPPLEMENTARY MATERIAL |
|---|
|
|
|---|
Supplementary Material is available at NAR Online.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.
| REFERENCES |
|---|
|
|
|---|
- Reese,M.G., Kulp,D., Tammana,H. and Haussler,D. ( (2000) ) Gene finding in Drosophila melanogaster. Genome Res., , 10, , 529538.
[Abstract/Free Full Text] - Burge,C.B. ( (1997) ) Identification of genes in human genomic DNA. Ph.D. Thesis, Stanford University, Stanford, CA, USA.
- Parra,G., Blanco,E. and Guigó,R. ( (2000) ) GeneID in Drosophila. Genome Res., , 10, , 511515.
[Abstract/Free Full Text] - Rogic,S., Mackworth,A.K. and Ouellette,F.B.F. ( (2001) ) Evaluation of gene-finding programs on mammalian sequences. Genome Res., , 11, , 817832.
[Abstract/Free Full Text] - Claverie,J.-M. ( (1997) ) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet., , 6, , 17351744.
[Abstract/Free Full Text] - Guigó,R., Agarwal,P., Abril,J., Burset,M. and Fickett,J.W. ( (2000) ) An assessment of gene prediction accuracy in large DNA sequences. Genome Res., , 10, , 16311642.
[Abstract/Free Full Text] - Meyer,I.M. and Durbin,R. ( (2002) ) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics, , 18, , 13091318.
[Abstract/Free Full Text] - Korf,I., Flicek,P., Duan,D. and Brent,M.R. ( (2001) ) Integrating genomic homology into gene structure prediction. Bioinformatics, , 1, (Suppl. 1), S1S9.
- Stanke,M. and Waack,S. ( (2003) ) Gene prediction with a hidden Markov model and new intron submodel. Bioinformatics, , 19, (Suppl. 2), ii215ii225.[Abstract]
- Stanke,M. ( (2004) ) Gene prediction with a hidden markov model. Ph.D. Thesis, University of Göttingen, Germany.
- Reese,M.G., Hartzell,G., Harris,N.L., Ohler,U., Abril,J.F. and Lewis,S.E. ( (2000) ) Genome annotation assessment in Drosophila melanogaster. Genome Res., , 10, , 391393.
[Free Full Text] - The FlyBase Consortium ( (2003) ) The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res., , 31, , 172175, http://flybase.org/.
[Abstract/Free Full Text] - Abril,J.F. and Guigó,R. ( (2000) ) gff2ps: visualizing genomic annotations. Bioinformatics, , 16, , 743744.
[Abstract/Free Full Text] - Schöffmann,O. ( (2003) ) Gewinnung extrinsischer Informationen zur Genvorhersage und Einbindung in ein Hidden Markov Modell. Diploma thesis, University of Göttingen, Germany.
- Bafna,V. and Huson,D.H. ( (2000) ) The conserved exon method for gene finding. Bioinformatics, , 16, , 190202.
[Abstract/Free Full Text] - Batzoglou,S., Pachter,L., Mesirov,J.P., Berger,B. and Lander,E.S. ( (2000) ) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res., , 10, , 950958.
[Abstract/Free Full Text] - Rinner,O. and Morgenstern,B. ( (2002) ) AGenDA: gene prediction by comparative sequence analysis. In Silico Biol., , 2, , 195205.[Medline]
- Blayo,P., Rouzé,P. and Sagot,M.-F. ( (2003) ) Orphan gene findingan exon assembly approach. Theor. Comput. Sci., , 290, , 14071431.[CrossRef]
- Wiehe,T., Gebauer-Jung,S., Mitchell-Olds,T. and Guigó,R. ( (2001) ) SGP-1: Prediction and validation of homologous genes based on sequence alignments. Genome Res., , 11, , 15741583.
[Abstract/Free Full Text] - Cawley,S., Pachter,L. and Alexandersson,M. ( (2003) ), SLAM web server for comparative gene finding and alignment. Nucleic Acids Res., , 31, , 35073509.
[Abstract/Free Full Text] - Parra,G., Agarwal,P., Abril,J.F., Wiehe,T., Fickett,J.W. and Guigó,R. ( (2003) ) Comparative gene prediction in human and mouse. Genome Res., , 13, , 108117.
[Abstract/Free Full Text] - Morgenstern,B. ( (1999) ) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, , 15, , 211218.
[Abstract/Free Full Text] - Morgenstern,B., Rinner,O., Abdeddaïm,S., Haase,D., Mayer,K., Dress,A. and Mewes,H.-W. ( (2002) ) Exon discovery by genomic sequence alignment. Bioinformatics, , 18, , 777787.
[Abstract/Free Full Text] - Brudno,M., Chapman,M., Göttgens,B., Batzoglou,S. and Morgenstern,B. ( (2003) ) BMC Bioinformatics, , 4, , 66.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
M. A. Coelho, A. Rosa, N. Rodrigues, A. Fonseca, and P. Goncalves Identification of Mating Type Genes in the Bipolar Basidiomycetous Yeast Rhodosporidium toruloides: First Insight into the MAT Locus Structure of the Sporidiobolales Eukaryot. Cell, June 1, 2008; 7(6): 1053 - 1061. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano, H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, et al. The UCSC Genome Browser Database: update 2006 Nucleic Acids Res., January 1, 2006; 34(suppl_1): D590 - D598. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Galagan, M. R. Henn, L.-J. Ma, C. A. Cuomo, and B. Birren Genomics of the fungal kingdom: Insights into eukaryotic biology Genome Res., December 1, 2005; 15(12): 1620 - 1631. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stanke and B. Morgenstern AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints Nucleic Acids Res., July 1, 2005; 33(suppl_2): W465 - W467. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



