| Nucleic Acids Research | Pages |
INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects
Known Gene Structures Database
Database Of Predicted Genes
References
INFOGENE: a database of known gene structures and predicted genes and proteins in sequences of genome sequencing projects
ABSTRACT
Large scale genome sequencing projects currently produce hundreds of megabases each year. The major sequencing centers are in the process of scaling up their throughput over the next few years. Shifting efforts toward sequencing gene-rich rather than random regions might provide the sequence of most of human genes during the next 3 years. Moreover, the initiative to create by 2001 a 'rough draft' of the human genome can allow other scientists to proceed more rapidly with discovering disease genes (1). However, the sequence itself does not always provide the knowledge of gene coding regions, which usually only cover a pretty small fraction of genomic DNA. Also, we cannot expect their rapid identification in the near future by pure experimental approach for such an enormous volume of sequence data. The value of sequence information for the biomedical community will strongly depend on availability of candidate genes computationally predicted in these sequences.
The aim of this work was to create the information resource of known and predicted gene structures in major model organisms as Human, Mouse, Drosophila and Arabidopsis. The structural components of theINFOGENE database are presented in Figure
Figure 1. Structure of the INFOGENE database. INFOGENE is realized under theSequence Retrieval System (SRS), developed at the European Bioinformatics Institute (2). This system provides a possibility to connect the database with existing data resources (such as TRRD, Transfac, Swiss-Prot, GenBank, etc.) and to make complex queries over several databases using the WWW server.In SRS any retrieval command, logical operations with sets that were obtained by previous queries, links between sets of different databanks, or a combination of all can be easily expressed in the SRS query language. Primary reasons for generating known gene structure databases are to: (i) have a collection of known gene structures with their main features presented in the form convenient for retrieval entries including some functional characteristics; (ii) easily create subsets of genes or exons with a given set of features; (iii) check availability of genes with particular features; (iv) have links to different informational databases providing regulatory site locations or other information for a particular gene (polymorphism or mutations underlying inherited disease, for example); and (v) provide the possibility to make links between similar genes of different model organisms. Today the problem of reliable gene prediction in human genomic DNA is still open. The best multiple gene prediction programs such as GeneScan (3) (probabilistic approach) and Fgenes (http://genomic.sanger.ac.uk/gf/gf.html ) [pattern recognition approach (4,6)] were tested mostly on short sequences containing one gene. The recent test of these programs for 660 human genes shows that the programs can correctly predict ~80% of internal exons and just ~60% of 5[prime]-exons. The prediction of multiple genes should be even less accurate. Therefore, it is important for developing further gene prediction programs to have as much information as possible about known genes and their functional signals, that will provide the learning and testing datasets. We have developed a GenBank (5) parser GeneParse which produces a flat file with some description of genes and gene features including terms corresponding to exon types, regulatory elements, processes and characteristics of genes in a given GenBank sequence. To add this information to SRS we created several files with logical structure of INFOGENE database components and files with the syntax of their entries. Using these files the information about gene structure was written to SRS with indexing of specific words in the entries. Figure 2. Example of an INFOGENE entry corresponding to MMTNFAB GenBank locus. Description of the first gene of this locus is presented. Features of coding regions (CDS field) are: (i) start; (ii) end; (iii) start codon/acceptor splice site short consensus; (iv) stop codon/donor splice site short consensus; (v) type of exon: f, i, l, o are the initial, internal, terminal and single CDS. A table of all codes and their explanations is available at the database main page http://genomic.sanger.ac.uk/db.html We can use the query language and search/retrieving software of SRS that will quickly extract sets of sequences with particular biological features. For example, genes where transcription start and stop sites are known or entries with multiple genes. The query language provides an effective usage of database information in investigation of significant characteristics of genes and their regulatory elements and assists in development of methods of their recognition. Currently it might take days to collect such information from the literature and visual analysis of GenBank entries. The current release of INFOGENE contains completely sequenced genes of the following model organisms: human (1835 genes), mouse (1038), Drosophila (970) and Arabidopsis (1726). One example of INFOGENE entrycorresponding to MMTNFAB locus of GenBank is presented in Figure Figure 3. Example of an INFOGENE entry corresponding to predicted gene in the HSCPH70 sequence. Description of the first gene of this locus is presented. Locus features (LFT field): (i) `mang/oneg', if multiple (mang) or single (oneg) genes is predicted in the locus; (ii) `ytss', if at least one transcription start site is predicted in the locus, otherwise `ntss'; (iii) `sexn', if at least one predicted gene consists of a single exon, otherwise `mexn'. A table of all codes and their explanations is available at http://genomic.sanger.ac.uk/db.html The primary reason for generating predicted gene structure databases is to provide positional cloners, gene hunters and others with the gene candidates observed in finished and unfinished genomic sequences. Recently a broad agreement has been reached amongst major genome centers and funding agencies in the US, the Sanger Centre and the Wellcome Trust in the UK to go ahead with a plan that will deliver all of the human sequence, part finished and part in draft, into the public domain by the end of 2001. Using gene prediction the scientific community can start experimental work with most human genes during the next 3 years because gene finding programs usually correctly predict at least the major part of exons in a gene sequence. Our experience shows that the accuracy of predictions is significantly lower for long genomic sequences than in usually presented tests with single genes (decreasing in the order of 10-20% with a high rate of false positive predictions). However, exons predicted simultaneously by several programs based on different approaches correspond to the real ones much more often than those predicted by a single program. For example, Fgenes and GeneScan predict exactly ~80% of real exons from 38 long or multigene genomic sequences with specificity 65% (true predicted/all predicted). If we take the subset of exons predicted by both programs, then the observed specificity is 92% and this set will include ~70% of all real exons. We have used two of our programs Fgenes-p (4,6) and Fgenes-h [HMM based approach similar to GeneScan (3)] to predict genes in genomic sequences. The Blast (7) search is used to check if some of the predicted exons have similarity with known EST and protein sequences. Possible repeats in the sequence were annotated using RepeatMasker program (Smit and Green, unpublished; http://genome.washington.edu/RM/RepeatMasker.html ). The current release of INFOGENE contains 768 finished and 3698 unfinished loci. These sequences were produced by The Sanger Centre Human Genome Project. We plan to include in the database predicted genes in finished and unfinished sequences from other sequencing centers. An example of a description of a predicted gene is presented in Figure The INFOGENE database is available through the WWW server of the Computational Genomics Group at http://genomic.sanger.ac.uk/db.html . Users wishing to cite INFOGENE are asked to refer to this article.
KNOWN GENE STRUCTURES DATABASE
DATABASE OF PREDICTED GENES
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 9 Dec 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
Y.-F. Chen, M. Shi, F. Huang, and X.-x. Chen
Characterization of two genes of Cotesia vestalis polydnavirus and their expression patterns in the host Plutella xylostella
J. Gen. Virol.,
December 1, 2007;
88(12):
3317 - 3322.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. E. Gundersen-Rindal and M. J. Pedroni
Characterization and transcriptional analysis of protein tyrosine phosphatase genes and an ankyrin repeat gene of the parasitoid Glyptapanteles indiensis polydnavirus in the parasitized host
J. Gen. Virol.,
February 1, 2006;
87(2):
311 - 322.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. Mathe, M.-F. Sagot, T. Schiex, and P. Rouze
Current methods of gene prediction, their strengths and weaknesses
Nucleic Acids Res.,
October 1, 2002;
30(19):
4103 - 4117.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. Johner and B. Lanzrein
Characterization of two genes of the polydnavirus of Chelonus inanitus and their stage-specific expression in the host Spodoptera littoralis
J. Gen. Virol.,
May 1, 2002;
83(5):
1075 - 1085.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. Wyder, A. Tschannen, A. Hochuli, A. Gruber, V. Saladin, S. Zumbach, and B. Lanzrein
Characterization of Chelonus inanitus polydnavirus segments: sequences and analysis, excision site and demonstration of clustering
J. Gen. Virol.,
January 1, 2002;
83(1):
247 - 256.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Burset, I. A. Seledtsov, and V. V. Solovyev
SpliceDB: database of canonical and non-canonical mammalian splice sites
Nucleic Acids Res.,
January 1, 2001;
29(1):
255 - 259.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
M. Burset, I. A. Seledtsov, and V. V. Solovyev
Analysis of canonical and non-canonical splice sites in mammalian genomes
Nucleic Acids Res.,
November 1, 2000;
28(21):
4364 - 4375.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. A. Salamov and V. V. Solovyev
Ab initio Gene Finding in Drosophila Genomic DNA
Genome Res.,
April 1, 2000;
10(4):
516 - 522.
[Abstract]
[Full Text]
![]()
This Article ![]()
![]()
Abstract
![]()
Print PDF (71K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (18)
![]()
Request Permissions ![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Solovyev, V. V.
![]()
Articles by Salamov, A. A.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Solovyev, V. V.
![]()
Articles by Salamov, A. A.
![]()
Social Bookmarking ![]()
![]()
What's this?