Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (71K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Solovyev, V. V.
Right arrow Articles by Salamov, A. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Solovyev, V. V.
Right arrow Articles by Salamov, A. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research Pages 248-250  


INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects
Known Gene Structures Database
Database Of Predicted Genes
References


INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects

INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects

Victor V. Solovyev* and Asaf A. Salamov

The Sanger Centre, Hinxton, Cambridge CB10 1SA, UK

Received September 1, 1998; Revised September 30, 1998; Accepted November 3, 1998

ABSTRACT

INFOGENE is a database of known and predicted gene structures with descriptions of basic functional signals and gene components. It provides a possibility to create compilations of sequences with a given gene feature as well as to accumulate and analyze predicted genes in finished and unfinished sequences from genome sequencing projects. Protein sequence similarity searches in the database of predicted proteins is offered through the BLASTP program. INFOGENE is realized under the Sequence Retrieval System that provides useful links with the other informational databases. The database is available through the WWW server of the Computational Genomics Group at http://genomic.sanger.ac.uk/db.html

Large scale genome sequencing projects currently produce hundreds of megabases each year. The major sequencing centers are in the process of scaling up their throughput over the next few years. Shifting efforts toward sequencing gene-rich rather than random regions might provide the sequence of most of human genes during the next 3 years. Moreover, the initiative to create by 2001 a 'rough draft' of the human genome can allow other scientists to proceed more rapidly with discovering disease genes (1). However, the sequence itself does not always provide the knowledge of gene coding regions, which usually only cover a pretty small fraction of genomic DNA. Also, we cannot expect their rapid identification in the near future by pure experimental approach for such an enormous volume of sequence data. The value of sequence information for the biomedical community will strongly depend on availability of candidate genes computationally predicted in these sequences.

The aim of this work was to create the information resource of known and predicted gene structures in major model organisms as Human, Mouse, Drosophila and Arabidopsis. The structural components of theINFOGENE database are presented in Figure 1.


Figure 1. Structure of the INFOGENE database.

INFOGENE is realized under theSequence Retrieval System (SRS), developed at the European Bioinformatics Institute (2). This system provides a possibility to connect the database with existing data resources (such as TRRD, Transfac, Swiss-Prot, GenBank, etc.) and to make complex queries over several databases using the WWW server.In SRS any retrieval command, logical operations with sets that were obtained by previous queries, links between sets of different databanks, or a combination of all can be easily expressed in the SRS query language.

KNOWN GENE STRUCTURES DATABASE

Primary reasons for generating known gene structure databases are to: (i) have a collection of known gene structures with their main features presented in the form convenient for retrieval entries including some functional characteristics; (ii) easily create subsets of genes or exons with a given set of features; (iii) check availability of genes with particular features; (iv) have links to different informational databases providing regulatory site locations or other information for a particular gene (polymorphism or mutations underlying inherited disease, for example); and (v) provide the possibility to make links between similar genes of different model organisms.

Today the problem of reliable gene prediction in human genomic DNA is still open. The best multiple gene prediction programs such as GeneScan (3) (probabilistic approach) and Fgenes (http://genomic.sanger.ac.uk/gf/gf.html ) [pattern recognition approach (4,6)] were tested mostly on short sequences containing one gene. The recent test of these programs for 660 human genes shows that the programs can correctly predict ~80% of internal exons and just ~60% of 5[prime]-exons. The prediction of multiple genes should be even less accurate. Therefore, it is important for developing further gene prediction programs to have as much information as possible about known genes and their functional signals, that will provide the learning and testing datasets.

We have developed a GenBank (5) parser GeneParse which produces a flat file with some description of genes and gene features including terms corresponding to exon types, regulatory elements, processes and characteristics of genes in a given GenBank sequence. To add this information to SRS we created several files with logical structure of INFOGENE database components and files with the syntax of their entries. Using these files the information about gene structure was written to SRS with indexing of specific words in the entries.


Figure 2. Example of an INFOGENE entry corresponding to MMTNFAB GenBank locus. Description of the first gene of this locus is presented. Features of coding regions (CDS field) are: (i) start; (ii) end; (iii) start codon/acceptor splice site short consensus; (iv) stop codon/donor splice site short consensus; (v) type of exon: f, i, l, o are the initial, internal, terminal and single CDS. A table of all codes and their explanations is available at the database main page http://genomic.sanger.ac.uk/db.html

We can use the query language and search/retrieving software of SRS that will quickly extract sets of sequences with particular biological features. For example, genes where transcription start and stop sites are known or entries with multiple genes. The query language provides an effective usage of database information in investigation of significant characteristics of genes and their regulatory elements and assists in development of methods of their recognition. Currently it might take days to collect such information from the literature and visual analysis of GenBank entries. The current release of INFOGENE contains completely sequenced genes of the following model organisms: human (1835 genes), mouse (1038), Drosophila (970) and Arabidopsis (1726).

One example of INFOGENE entrycorresponding to MMTNFAB locus of GenBank is presented in Figure 2. This locus includes two neighbor genes, whose exons and coding regions are characterized as well as the locations of the start, TATA-box and stop of transcription. In the LFT (Locus Features) field we have described this sequence with special keywords: mang (locus includes many genes), nasp (no alternative splicing), nmts (no multiple starts of transcription), natp (no alternative promoters), yftr (yes full transcript), npse (no pseudogenes). For example, using the ytss keyword we can retrieve a set of genes with known start of transcription. It is described for 251 human genes with completely sequenced coding regions. The full description of all keywords is presented on the web pages of INFOGENE.

DATABASE OF PREDICTED GENES


Figure 3. Example of an INFOGENE entry corresponding to predicted gene in the HSCPH70 sequence. Description of the first gene of this locus is presented. Locus features (LFT field): (i) `mang/oneg', if multiple (mang) or single (oneg) genes is predicted in the locus; (ii) `ytss', if at least one transcription start site is predicted in the locus, otherwise `ntss'; (iii) `sexn', if at least one predicted gene consists of a single exon, otherwise `mexn'. A table of all codes and their explanations is available at http://genomic.sanger.ac.uk/db.html

The primary reason for generating predicted gene structure databases is to provide positional cloners, gene hunters and others with the gene candidates observed in finished and unfinished genomic sequences. Recently a broad agreement has been reached amongst major genome centers and funding agencies in the US, the Sanger Centre and the Wellcome Trust in the UK to go ahead with a plan that will deliver all of the human sequence, part finished and part in draft, into the public domain by the end of 2001. Using gene prediction the scientific community can start experimental work with most human genes during the next 3 years because gene finding programs usually correctly predict at least the major part of exons in a gene sequence. Our experience shows that the accuracy of predictions is significantly lower for long genomic sequences than in usually presented tests with single genes (decreasing in the order of 10-20% with a high rate of false positive predictions). However, exons predicted simultaneously by several programs based on different approaches correspond to the real ones much more often than those predicted by a single program. For example, Fgenes and GeneScan predict exactly ~80% of real exons from 38 long or multigene genomic sequences with specificity 65% (true predicted/all predicted). If we take the subset of exons predicted by both programs, then the observed specificity is 92% and this set will include ~70% of all real exons.

We have used two of our programs Fgenes-p (4,6) and Fgenes-h [HMM based approach similar to GeneScan (3)] to predict genes in genomic sequences. The Blast (7) search is used to check if some of the predicted exons have similarity with known EST and protein sequences. Possible repeats in the sequence were annotated using RepeatMasker program (Smit and Green, unpublished; http://genome.washington.edu/RM/RepeatMasker.html ).

The current release of INFOGENE contains 768 finished and 3698 unfinished loci. These sequences were produced by The Sanger Centre Human Genome Project. We plan to include in the database predicted genes in finished and unfinished sequences from other sequencing centers.

An example of a description of a predicted gene is presented in Figure 3. Fgenes-h predicts four coding exons, three correct and one partially correct. All identical exons predicted by both programs are correct. The keyword `both' in the CDS field marks such exons and they often correspond to real ones as discussed above. The other features, which increase our confidence in predicted exons are produced by searching EST and protein databases. If any significant similarity is found it is presented in HOP (for protein homology) and HOE (for EST homology) database fields. Additional fields HPP and HPE provide information about similarities found. Features of protein homology (HPP field) are: (i) and (ii) the first and the last aligned positions of exon, respectively; (iii) and (iv) the first and the last aligned positions of the database protein, respectively; (v) the length of database protein; (vi) the score of the alignment calculated by BLASTP; (vii) sequence identity; (viii) E-value fromBLASTP output. Similar features are presented for EST similarity (HPE field).

The INFOGENE database is available through the WWW server of the Computational Genomics Group at http://genomic.sanger.ac.uk/db.html . Users wishing to cite INFOGENE are asked to refer to this article.

REFERENCES

1. Wadman,M. (1998) Nature, 393, 399-400. MEDLINE Abstract

2. Etzold,T., Ulyanov,A. and Argos,P. (1996) in Doolittle,R. (ed.), Methods in Enzymology., vol.266, pp. 114-128. MEDLINE Abstract

3. Burge,C. and Karlin,S. (1997) J. Mol. Biol., 268, 78-94. MEDLINE Abstract

4. Solovyev,V.V., Salamov,A.A. and Lawrence,C.B. (1994) Nucleic Acids Res., 22, 5156-5163. MEDLINE Abstract

5. Benson,D.A., Boguski,M.S., Lipman,D.J., Ostell,J. and Ouelette,B.F.F. (1998) Nucleic Acids Res., 26, 1-7. MEDLINE Abstract

6. Solovyev,V.V. and Salamov,A.A. (1997) In Rawling,C., Clark,D., Altman,R., Hunter,L., Lengauer,T. and Wodak,S. (eds), Proceedings of ISMB. AAAI Press, Halkidiki, Greece, pp. 294-302.

7. Altshul,S.F., Madden,T.L., Schiffer,A.A, Zhang,J., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389-3402.


*To whom correspondence should be addressed. Tel: +44 1223 494799; Fax: +44 1223 494919; Email: solovyev@sanger.ac.uk


This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 9 Dec 1998
Copyright©Oxford University Press, 1998.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Gen. Virol.Home page
Y.-F. Chen, M. Shi, F. Huang, and X.-x. Chen
Characterization of two genes of Cotesia vestalis polydnavirus and their expression patterns in the host Plutella xylostella
J. Gen. Virol., December 1, 2007; 88(12): 3317 - 3322.
[Abstract] [Full Text] [PDF]


Home page
J. Gen. Virol.Home page
D. E. Gundersen-Rindal and M. J. Pedroni
Characterization and transcriptional analysis of protein tyrosine phosphatase genes and an ankyrin repeat gene of the parasitoid Glyptapanteles indiensis polydnavirus in the parasitized host
J. Gen. Virol., February 1, 2006; 87(2): 311 - 322.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Mathe, M.-F. Sagot, T. Schiex, and P. Rouze
Current methods of gene prediction, their strengths and weaknesses
Nucleic Acids Res., October 1, 2002; 30(19): 4103 - 4117.
[Abstract] [Full Text] [PDF]


Home page
J. Gen. Virol.Home page
A. Johner and B. Lanzrein
Characterization of two genes of the polydnavirus of Chelonus inanitus and their stage-specific expression in the host Spodoptera littoralis
J. Gen. Virol., May 1, 2002; 83(5): 1075 - 1085.
[Abstract] [Full Text] [PDF]


Home page
J. Gen. Virol.Home page
S. Wyder, A. Tschannen, A. Hochuli, A. Gruber, V. Saladin, S. Zumbach, and B. Lanzrein
Characterization of Chelonus inanitus polydnavirus segments: sequences and analysis, excision site and demonstration of clustering
J. Gen. Virol., January 1, 2002; 83(1): 247 - 256.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Burset, I. A. Seledtsov, and V. V. Solovyev
SpliceDB: database of canonical and non-canonical mammalian splice sites
Nucleic Acids Res., January 1, 2001; 29(1): 255 - 259.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Burset, I. A. Seledtsov, and V. V. Solovyev
Analysis of canonical and non-canonical splice sites in mammalian genomes
Nucleic Acids Res., November 1, 2000; 28(21): 4364 - 4375.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
A. A. Salamov and V. V. Solovyev
Ab initio Gene Finding in Drosophila Genomic DNA
Genome Res., April 1, 2000; 10(4): 516 - 522.
[Abstract] [Full Text]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (71K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Solovyev, V. V.
Right arrow Articles by Salamov, A. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Solovyev, V. V.
Right arrow Articles by Salamov, A. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?