Nucleic Acids Research, 2002, Vol. 30, No. 1 299-300
© 2002 Oxford University Press
SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein
Max-Planck-Institute for Molecular Genetics, Computational Molecular Biology, Ihnestrasse 73, 14195 Berlin, Germany
Received September 21, 2001; Accepted October 1, 2001.
| ABSTRACT |
|---|
|
|
|---|
We have integrated the protein families from SYSTERS and the expressed sequence tag (EST) clusters from our database GeneNest with SpliceNest, a new database mapping EST contigs into genomic DNA. The SYSTERS protein sequence cluster set provides an automatically generated classification of all sequences of the SWISS-PROT, TrEMBL and PIR databases into disjoint protein family and superfamily clusters. GeneNest is a database and software package for producing and visualizing gene indices from ESTs and mRNAs. Currently, the database comprises gene indices of human, mouse, Arabidopsis thaliana and zebrafish. SpliceNest is a web-based graphical tool to explore gene structure, including alternative splicing, based on a mapping of the EST consensus sequences from GeneNest to the complete human genome. The integration of SYSTERS, GeneNest and SpliceNest into one framework now permits an overall exploration of the whole sequence space covering protein, mRNA and EST sequences, as well as genomic DNA. The databases are available for querying and browsing at http://cmb.molgen.mpg.de.
| INTEGRATED DATABASES |
|---|
|
|
|---|
SYSTERS
The SYSTERS protein sequence cluster set (1) consists of the hierarchical classification of all known sequences from the SWISS-PROT (2), TrEMBL and PIR (3) sequence databases into disjoint protein family clusters and superfamilies. The classification is based on an all-against-all database search using gapped BLAST (4) with a subsequent hierarchical clustering. The sequences in every cluster have been multiply aligned using CLUSTALW (5) and for each cluster an unrooted phylogenetic tree is available. All multiple alignments are annotated with known domains from the Pfam database of protein domain families (6) and clusters can be selected directly from a list of Pfam domains. A new protein sequence can be searched against the database of multiple alignments using the similarity searching tool SSMAL (7). For each cluster, an MView (8) output is generated and from the resulting partial multiple alignment a majority consensus sequence is calculated. All consensus sequences together build a database searchable with BLAST. Precomputed BLAST searches of the GeneNest consensus sequences against the SYSTERS protein consensus sequences were evaluated to generate links from SYSTERS to GeneNest and vice versa.
GeneNest
GeneNest (9) is a database and software package for the generation and visualization of gene indices based on EST and mRNA sequences. Currently, the database comprises gene indices of man (based on UniGene), mouse, Arabidopsis thaliana and zebrafish. All cDNA/mRNA sequences related to an organism are extracted either directly from the EMBL (10) database or from an already clustered UniGene (11) database. A preprocessing step includes vector clipping, repeat annotation and marking of regions of low sequence quality in order to restrict processing to data of high quality. In further steps, these sequences are clustered and all members of each cluster are assembled into one or more contigs. Roughly speaking, each cluster represents a single gene, whereas contigs of a cluster reflect different transcripts of that gene. A schematic view of the assembled clusters is presented on the GeneNest web site. Detailed information about sequences and their preprocessing results, as well as information about open reading frames, similarities between clusters or protein homologies, can be accessed interactively. GeneNest can be queried using BLAST against the consensus sequences or by keyword search. GeneNest is tightly linked to SYSTERS and SpliceNest as well as to external resources like EMBL.
SpliceNest
SpliceNest (12) is a web-based graphical tool to explore gene structure based on a mapping of the expressed sequence tag (EST) consensus sequences (contigs) from GeneNest to the complete human genome. Assuming that a cluster normally represents a single gene, every contig of a cluster is aligned separately to the same genomic region, using the spliced alignment program sim4 (13). Differences between the contigs may correspond to alternative splicing, but they can also be due to low sequence quality, genomic contamination or other artifacts. The alignments are visualized in a diagram showing the exon/intron structure of all contigs of a single cluster (i.e. gene) simultaneously, mapped on the common genomic sequence. Exons are represented as colored bars and introns as arrows. The visualization facilitates the identification of genuine splice variants. Furthermore, candidate loci of alternative splicing are automatically identified and highlighted. If a cluster has several matches in the genome, a ranked list of all matches is provided. Each contig is linked to the corresponding GeneNest assembly, giving easy access to information about individual EST and mRNA sequences. Other links point to detailed alignments, related entries in the EMBL database or raw sequences. A toolbar allows zooming into the alignment. The current version of SpliceNest uses the GeneNest assembly based on human UniGene and the Golden Path genomic sequence (14).
| SUMMARY |
|---|
|
|
|---|
The three otherwise independent databases GeneNest, SpliceNest and SYSTERS are now fully linked with each other and to other major databases (Fig. 1). This allows navigating, e.g. from a protein to its UniGene cluster assembly and on to its genomic position and structure. Alternatively, one might enter via a sorted list of UniGene clusters on a chromosome and link from a particular cluster to its gene product in the context of a protein family. Thus, the linking of these databases facilitates navigation of sequence space between genomic DNA and protein sequences and families.
|
| ACKNOWLEDGEMENTS |
|---|
We acknowledge financial support from Bundesministerium für Bildung und Forschung (BMBF) and Deutsches Human Genom Projekt (DHGP).
| FOOTNOTES |
|---|
* To whom correspondence should be addressed. Tel: +49 30 8413 1404; Fax: +49 30 8413 1152; Email: krause_a{at}molgen.mpg.de
| REFERENCES |
|---|
|
|
|---|
-
1 Krause,A. and Vingron,M. (1998) A set-theoretic approach to database searching and clustering. Bioinformatics, 14, 430438.
2 Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 4548.
3 Barker,W.C., Garavelli,J.S., Hou,Z., Huang,H., Ledley,R.S., McGarvey,P.B., Mewes,H.W., Orcutt,B.C., Pfeiffer,F., Tsugita,A. et al. (2001) Protein Information Resource: a community resource for expert annotation of protein data. Nucleic Acids Res., 29, 2932. Updated article in this issue: Nucleic Acids Res. (2002), 30, 3537.
4 Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402.
5 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680.
6 Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Howe,K.L. and Sonnhammer,E.L.L. (2000) The Pfam protein families database. Nucleic Acids Res., 28, 263266. Updated article in this issue: Nucleic Acids Res. (2002), 30, 276280.
7 Nicodème,P. (1998) SSMAL: similarity searching with alignment graphs. Bioinformatics, 14, 508515.
8 Brown,N.P., Leroy,C. and Sander,C. (1998) MView: a web-compatible database search or multiple alignment viewer. Bioinformatics, 14, 380381.
9 Haas,S.A., Beissbarth,T., Rivals,E., Krause,A. and Vingron,M. (2000) GeneNest: automated generation and visualization of gene indices. Trends Genet., 16, 521523.[Web of Science][Medline]
10 Stoesser,G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., Kanz,C., Kulikova,T., Lombard,V., Lopez,R., Parkinson,H. et al. (2001) The EMBL nucleotide sequence database. Nucleic Acids Res., 29, 1721. Updated article in this issue: Nucleic Acids Res. (2002), 30, 2126.
11 Schuler,G.D. (1997) Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med., 75, 694698.[Web of Science][Medline]
12 Coward,E., Haas,S.A. and Vingron,M. (2002) SpliceNest: visualization of gene structure and alternative splicing based on EST clusters. Trends Genet., 18, in press.
13 Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M. and Miller,W. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res., 8, 967974.
14 International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921.[Medline]
This article has been cited by other articles:
![]() |
S. Foissac and M. Sammeth ASTALAVISTA: dynamic and flexible analysis of alternative splicing events in custom gene datasets Nucleic Acids Res., July 13, 2007; 35(suppl_2): W297 - W299. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan A hitchhiker's guide to expressed sequence tag (EST) analysis Brief Bioinform, January 1, 2007; 8(1): 6 - 21. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Rattei, R. Arnold, P. Tischler, D. Lindner, V. Stumpflen, and H. W. Mewes SIMAP: the similarity matrix of proteins Nucleic Acids Res., January 1, 2006; 34(suppl_1): D252 - D256. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Kunin, S. A. Teichmann, M. A. Huynen, and C. A. Ouzounis The properties of protein family space depend on experimental design Bioinformatics, June 1, 2005; 21(11): 2618 - 2622. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Horan, J. Lauricha, J. Bailey-Serres, N. Raikhel, and T. Girke Genome Cluster Database. A Sequence Family Analysis Platform for Arabidopsis and Rice Plant Physiology, May 1, 2005; 138(1): 47 - 54. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Kim, S. Shin, and S. Lee ECgene: Genome-based EST clustering and gene modeling for alternative splicing Genome Res., April 1, 2005; 15(4): 566 - 576. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Kim, N. Kim, Y. Lee, B. Kim, Y. Shin, and S. Lee ECgene: genome annotation for alternative splicing Nucleic Acids Res., January 1, 2005; 33(suppl_1): D75 - D79. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Meinel, A. Krause, H. Luz, M. Vingron, and E. Staub The SYSTERS Protein Family Database in 2005 Nucleic Acids Res., January 1, 2005; 33(suppl_1): D226 - D229. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Girke, J. Lauricha, H. Tran, K. Keegstra, and N. Raikhel The Cell Wall Navigator Database. A Systems-Based Approach to Organism-Unrestricted Mining of Protein Families Involved in Cell Wall Metabolism Plant Physiology, October 1, 2004; 136(2): 3003 - 3008. [Full Text] [PDF] |
||||
![]() |
N. Kim, S. Shin, and S. Lee ASmodeler: gene modeling of alternative splicing from genomic alignment of mRNA, EST and protein sequences Nucleic Acids Res., July 1, 2004; 32(suppl_2): W181 - W186. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Mohseni-Zadeh, A. Louis, P. Brezellec, and J.-L. Risler PHYTOPROT: a database of clusters of plant proteins Nucleic Acids Res., January 1, 2004; 32(90001): D351 - 353. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Enright, V. Kunin, and C. A. Ouzounis Protein families and TRIBES in genome sequence space Nucleic Acids Res., August 1, 2003; 31(15): 4632 - 4638. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sorek and H. M. Safer A novel algorithm for computational identification of contaminated EST libraries Nucleic Acids Res., February 1, 2003; 31(3): 1067 - 1074. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





