ABSTRACT
The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany,
collects, processes and distributes protein sequence data within the framework
of the tripartite association of the PIR-International Protein Sequence Database (
1
,
2
). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique,
unverified protein sequences from external sources compiled by MIPS. Through
its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and
to yeast genome information. (i) Sequence similarity results from the FASTA
program (
3
) are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and
permits instant access to FASTA results. (ii) Starting with FASTA database
queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure (
4
) developed at MIPS is a new approach for rapid sequence and pattern searching.
(iv) MIPS provides access to the sequence and annotation of the complete yeast
genome (
5
), the functional classification of yeast genes (FunCat) and its graphical
display, the `Genome Browser' (
6
). A CD-ROM based on the JAVA programming language providing dynamic interactive
access to the yeast genome and the related protein sequences has been compiled
and is available on request.
MIPS is responsible for collecting protein sequence data from European sources
for the common Protein Sequence Database of PIR-International (
1
,
2
) by scanning major European journals and by translation of nucleic acid
sequence data received on a day-by-day basis from the EBI (
7
). In addition, sequence data generated in the European sequencing projects for
Saccharomyces cerevisiae
and
Arabidopsis thaliana
are processed and analyzed. The resulting protein sequences are processed for
PIR-International and the nucleic acid sequence data are forwarded to the EBI.
As soon as they are captured, protein sequences are compared with the complete
set of published proteins and the results are incorporated into the FASTA
database (see below). After data verification sequences are rapidly added to
the PIR-International Protein Sequence Database (
1
,
2
). In a second step, sequences are annotated. During this process they are
scrutinized for overlaps with existing database entries and merged if identity
with the matching database object has been confirmed. The established
nomenclature of PIR-International is rigorously applied for the annotation of protein names,
species, keywords and features including posttranslational protein
modifications and homology domains.
To allow for strong data typing, we have introduced the CO2 format (
8
) for data handling, supported by a commercial object-oriented database management system. Data forwarding to support
synchronized copies of the database on wide area network is currently under
investigation.
As a supplement to the Protein Sequence Database, we have created PATCHX, a set
of unique, unverified protein sequences built from external sources, e.g.
automatic translations of nucleic acid sequences from the EBI Data Library (
7
), translations contained in GenBank (
9
), and sequences from SwissProt (
10
). Sequences that occur in the PIR-International Protein Sequence Database are excluded from PATCHX. A large
fraction of PATCHX are sequences with minor differences to entries from the
Protein Sequence Database. These are largely due to inconsistencies between the
published and the submitted version of a sequence. Current efforts are
dedicated to reduce the number of entries in PATCHX to a minimum and to parse
the different entry formats into a homogeneous database.
Sequence similarity is the most powerful tool for sequence data analysis. To
support the annotation efforts of MIPS an up-to-date database of sequence similarities has been built based on the
FASTA algorithm for sequence database searches (
3
). FASTA results can be retrieved within milliseconds. The FASTA database,
introduced in 1991, is updated with every new sequence added to the database in
a reciprocal manner. Owing to the symmetric relation of the one-to-one sequence comparison, all matches of the new sequence are
incorporated into the existing FASTA results of older entries. Thus, the
database is always up-to-date. To allow for a compact representation the entry title is
stored separately. The output for queries to the FASTA database can be
customized according to the users needs (e.g. cutoff, maximum number of hits,
database, etc.). FASTA results are raw data of sequence similarity. The
superfamily classification effort of PIR-International provides access to validated sequence homology information (
11
,
12
) at the level of the entire protein and at the level of the homology domain.
Classification results are displayed as multiple sequence alignments. PROT-FAM permits access to nearly 10 000 multiple alignments through the World
Wide Web (see below).
The superfamily concept developed in the mid 1970s (
13
-
15
) states that homologous proteins can be assigned to protein superfamilies.
Members of a superfamily have diverged from a common ancestral form. The
original concept of the `protein superfamily' is not applicable to multidomain
proteins and thus was recently extended (
11
,
12
). The extended superfamily concept permits classification at two levels: the
level of the `complete protein' and the level of the `homology domain'.
Complete proteins are classified into `homeomorphic protein superfamilies'. Two
proteins belong to the same homeomorphic protein superfamily when they are
homologous over all of their sequence from the amino to the carboxyl end.
`Homology domains' are regions of local similarity contained in otherwise
unrelated proteins, e.g. `protein kinase homology' or `trypsin homology'.
Regions of local similarity repeated within a single protein are also
classified at the level of the `homology domain', e.g. `ADP,ATP carrier protein
repeat homology'.
Classification at the level of the complete protein permits to partition all the
Protein Sequence Database into independent, nonoverlapping groups of entries.
Each completely sequenced protein belongs to exactly one homeomorphic protein
superfamily. The condition that members of a homeomorphic protein superfamily
must be homologous over their entire sequence length implicates that all
members must contain the same homology domains in the same order. For practical
reasons we use `sequence similarity' as main criterion to discriminate between
homologous and non-homologous sequences. To avoid false positive assignments we define
stringent conditions as criteria for sequence homology in routine work: (i) 30%
sequence identity; (ii) at least 100 residues in length; and (iii) free of
composition bias. More distantly related proteins may be clustered into the
same homeomorphic protein superfamily after detailed sequence analysis or when
homology is supported by non-sequence data, e.g. structural information.
`Protein families' are defined as sets of proteins under the even more stringent
condition that each member of the family has more than 50% identity to at least
one other member of the family. A superfamily is a union over families.
As of September 1996 the PIR-International Protein Sequence Database contains 90 000 entries. Of these,
68 000 (75%) entries have been classified into 26 000 protein families. 64 000
(71%) entries are finally classified. The remaining sequences are 4000 (4%)
fragments that cannot be unambiguously classified. Protein families vary in
size from one to several hundered members. Seventy-three percent (47 000 sequences) are present in the 8000 families with at
least two members. The residual 17 000 sequences (27%) represent distinct
protein families. Multiple sequence alignments and sequence profiles have been
computed using the GCG sequence analysis software (
16
) programs PILEUP and PROFILEMAKE.
Sixty percent (38 000 of 64 000) of the finally classified entries have been
grouped into 4100 superfamilies. Of these, 2800 (68%) are based on a single
protein family whereas 1300 (32%) contain sequences from more than one protein
family. Multiple alignments for the later set are also precomputed for rapid
inspection.
Homology domains are annotated in the `superfamily' field and as `domain'
feature in the PIR-International Protein Sequence Database. Superfamily names for homology
domains contain the term `homology'. The `domain' feature annotation contains
exactly the same name and indicates sequence coordinates. Although different
authors usually agree that a sequence contains a certain homology domain, the
assignment of domain boundaries is rather subjective. To avoid inconsistencies,
we set boundaries close to well-conserved regions. MIPS extracts all homology domains annotated as domain
feature from the Protein Sequence Database into a specific homology domain
sequence database called HOMDOM. 17 000 individual domain features are
annotated for the 285 distinct homology domains. MIPS screens for yet
unannotated occurences of homology domains and adds the corresponding domain
feature annotation to the database entries.
The HPT (hashed position tree) is an index data structure for improved
performance of sequence comparisons. The index is used to preprocess a data set
which largely reduces the computational complexity of sequence comparisons.
Various applications using the HPT can be formulated. (i) The all-against-all matching of the 12 million bases of the yeast genome on DNA and
protein level could be done in less than 48 h on a single DEC workstation. (ii)
A WWW interface is available that allows to compare a query sequence with the
HPT-indexed dataset of the 6274 ORFs of the yeast proteome. (iii) The yeast
proteins have been compared against the more than 2.2 million translated human
ESTs. (iv) A prototype version of the PIR-International Protein Sequence Database indexed by HPT is available to
search for amino acid patterns and protein sequences.
By April 24, 1996, the sequence of the yeast genome was completed as the result
of a world-wide collaboration among European, Swiss, UK, American, Canadian and
Japanese laboratories. Sixteen chromosomes, not including rDNA repeats, of 12
million bases of DNA code for 6274 proteins. MIPS has served as the informatics
centre for the European effort and assembled more than 6 million bases of data
submissions into contiguous chromosomal sequences. Data have been annotated and
organized in a database of the yeast genome accessible through the WWW,
including information from other, specialized yeast databases. In addition to
the information available in the PIR-International Protein Sequence Database the yeast database contains
detailed information on specific properties like codon adaptation bias,
disruptants and motifs. Links to the corresponding sequence databases, PROT-FAM, the FASTA database and YPD (
17
) are implemented.
Intuitive visual access to large volumes of data is indispensible for systematic
analysis of large scale genomic data, e.g. to inspect the 12 million bases of
the complete yeast genome. Correlation of independent findings becomes apparent
only if displayed in a coherent and well structured way. The WWW-based genome browser permits to visualize various aspects of the genome in
an interactive way, e.g. display of the set of all sequence similarities within
the whole genome as computed using HPT or display of all proteins that belong
to a selected functional category. The user may specify a specific view of the
genome as a declarative query. The result of the request is an image that can
be inspected at variable resolutions. Any detected genetic element can be used
as entry-point to the yeast information system provided by MIPS.
A version of the genome browser and the yeast sequence database will be
available on CD-ROM. This enables every scientist to access this tool independent of
network resources. The standard browser technology can be used for local data
processing, as the program was written in the novel internet programming
language JAVA and all documents are stored in HTML. The CD-ROM will contain the complete yeast genome and its annotation. The
appropriate system-independent software to navigate and query interactively will be provided.
The MIPS server attempts to integrate data and services to ease the access to
our resources. Database services vary widely in: (i) the type of data retrieved
or explored; (ii) their temporal behavior; (iii) their principal type of
operation as stateless or state dependent; and (iv) the platform on which they
reside. Layered software architecture was established to hide the heterogeneity
of services from the user. Such an architecture was implemented using
client/server communication to distribute and schedule tasks over the local
network of workstations entirely transparent to the user (
18
).
Data and services offered by the WWW server focus on data and services uniquely
supplied by MIPS. These include a WWW interface to the multi-database/multifield query system ATLAS allowing format independent access
and retrieval to 74 indexed databases totaling more than 2 million entries. The
MIPS WWW site gives access to the PROT-FAM project with nearly 10 000 multiple sequence alignments at the level
of the protein family (8000 alignments), protein superfamily (1200 alignments),
or homology domain (285 alignments). It is possible to align a query sequence
against a sequence profile derived from the multiple alignment.
Access to the yeast genome browser and several applications based on HPT are
also available through the WWW.
How to contact MIPS: Münchner Informationszentrum für Proteinsequenzen, Max-Planck-Institut für Biochemie, D-82152 Martinsried bei München, Germany; Tel +49 89 8578 2656;
Fax +49 89 8578 2655; Email mewes{at}mips.embnet.org
MIPS is supported by the Max-Planck-Gesellschaft, the Forschungszentrum f. Umwelt und Gesundheit (GSF)
and the European Commission BRIDGE Grants BIOT-CT-0167 and 0172.
REFERENCES
Return