ABSTRACT
The PRINTS database of protein family `fingerprints' is a diagnostic resource that complements the PROSITE dictionary of sites and patterns. Unlike regular expressions, fingerprints exploit groups of conserved motifs within sequence alignments to build characteristic signatures of family membership. Thus fingerprints inherently offer improved diagnostic reliability by virtue of the mutual context provided by motif neighbours. To date, 600 fingerprints have been constructed and stored in PRINTS, representing a 50% increase in the size of the database in the last year. The current version, 13.0, encodes ~3000 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. The database is accessible via UCL's Bioinformatics World Wide Web (WWW) server at http://www.biochem.ucl.ac.uk/bsm/dbbrowser/ . We describe here progress with the database, its Web interface, and a recent exciting development: the integration of a novel colour alignment editor (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA ), which allows visualisation and interactive manipulation of PRINTS alignments over the Internet.
In the analysis of novel protein sequences, in addition to routine searches of
the primary data sources, it is now customary to extend search strategies to
include a range of `secondary' databases. These distil sequence information in
the primary databanks into a variety of potent descriptors that aid family
diagnosis: for example, PROSITE houses regular expression patterns and a number
of profiles (
1
), and the BLOCKS database stores aligned, weighted motifs (
2
).
Of the available analysis methods, regular expressions are probably the easiest
to understand. Their derivation involves the reduction of conserved motifs
within sequence alignments into simple, single consensus expressions, in which
all but the most signficant residue information is discarded. PROSITE is now
the most comprehensive and widely-used secondary database, version 13.0 containing descriptors for 889
families and functional sites.
In terms of their performance in pattern recognition, regular expressions have
certain limitations. Patterns may themselves encode flexibility, or fuzziness,
but require query sequences to match them exactly. Thus sequences that differ
only slightly from the definition will be missed. In view of this draw-back, more powerful discriminators (i.e., profiles) have been incorporated
into PROSITE to provide an alternative means of diagnosis where patterns are
likely to fail. Profiles are highly complex descriptors, generally encoding the
full sequence length and allowing gap insertion in generating pairwise
alignments between profile and target sequence; their numbers in PROSITE are
therefore still relatively small.
We have developed a different approach to pattern recognition, which we term
`fingerprinting' (
3
,
4
). Within a sequence alignment, it is usual to find not one, but several motifs
that characterise the aligned family. Diagnostically, it makes sense to use
many, or all, of the conserved regions to build a family signature. In a
database search, there is then a greater chance of identifying a distant
relative, whether or not all parts of the signature are matched. Thus, for
example, a sequence that matches only four of seven motifs may still be
diagnosed as a true match if the motifs are matched in the correct order in the
sequence, and the distances between them are consistent with that expected of
true neighbouring motifs. The ability to tolerate mismatches, both at the level
of residues within individual motifs, and at the level of motifs within the
fingerprint as a whole, renders fingerprinting a very powerful diagnostic
technique.
To facilitate sequence analysis and complement other secondary resources, we
have made a range of unique protein fingerprints available in the PRINTS
database (
5
). This paper describes recent progress with the PRINTS system and its evolving
role as an information resource in computational molecular biology.
PRINTS' source database is OWL (
6
http://www.biochem.ucl.ac.uk/bsm/dbbrowser/OWL/OWL.html), a non-redundant composite of the major publicly-available primary sources: SWISS-PROT (
7
), PIR (
8
), GenBank (translation) (
9
) and NRL-3D (
10
).
Fingerprinting commences with sequence alignment and excision of conserved
motifs using SOMAP (
11
). The motifs are used to dredge OWL independently using the ADSP sequence
analysis package, a suite of procedures for iterative database scanning and hit-list correlation (
3
,
4
). The scanning algorithm interprets the motifs essentially as a series of
frequency matrices, i.e., identity searches are made, with no mutation or other
similarity data to weight the results. The weighting scheme is thus based on
the calculation of residue frequencies for each position in the motifs, summing
the scores of identical residues for each position of the retrieved match.
Diagnostic performance is enhanced by iterative database scanning. The motifs
therefore grow, and become more mature, with each database pass, as more
sequences are matched and assimilated into the process. Full potency is gained
from the mutual context provided by motif neighbours, which allows sequence
identification even when some parts of the signature are absent.
PRINTS is built as a single ASCII (text) file. The contents are separated into
specific fields, relating to general information, bibliographic references,
text, lists of matches, and the motifs themselves-each line of a field is assigned a distinct two-letter code, allowing us to index the database for fast querying of
its contents. In the general field, each entry is assigned an identification
code and an accession number (of the form PR00000), followed by an indication
of the number of constituent motifs in the fingerprint. Finally, where
relevant, the general field provides cross-references to corresponding entries in a variety of other bio-databanks, including PROSITE, ProDom (
12
), SBASE (
13
), NRL-3D, SWISS-3DIMAGE (
14
), scop (
15
), cath (
16
), etc. Such links are vital for efficient communication between related
databases and effectively broaden the scope of the resource. Similarly, the use
of static accession numbers itself facilitates cross-referencing by other databases-PRINTS is now cross-referenced by BLOCKS, SBASE and GCRDb (
17
), and is linked to by PROSITE.
The full format has been described previously (
18
,
19
), so will not be discussed further here.
Release 13.0 of PRINTS (September 1996) contains 600 entries, encoding ~3000 individual motifs. The complete contents list is available from the
distribution sites and on the PRINTS WWW page (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/printscontents.html ).
PRINTS is released in major and minor versions: major releases are database
expansions, i.e., they denote the addition of new material to the resource;
minor releases reflect updates of existing entries to bring the contents in
line with the current version of OWL. To date, there have been 17 releases of
the database: 13 major and four minor. We endeavour to make a major or minor
version available quarterly: in the last year, we have achieved four major and
two minor releases.
The principal obstacle to the frequency of expansions, and particularly of
updates, is the time-consuming nature of the approach. Deriving a fingerprint involves two
major threads: (i) a computational aspect, which involves initial alignment and
maximisation of sequence information through iterative scanning, with multiple
motifs, of a large composite database; and (ii) an annotation component, which
involves researching each family, and linking sequence conservation information
to known structural or functional data. This is a rigorous, exhaustive
technique. The precision of the results, coupled with the quality of
annotations, tends to justify the sacrifice of speed, and sets the database
apart from the growing number of automatically-derived pattern resources, for which there are no annotations, and hence
no appropriate mechanisms for result validation.
PRINTS is available for interactive use via the SEQNET service. It may be
retrieved directly from the anonymous-ftp servers at Daresbury (s-ind2.dl.ac.uk in pub/database/prints), NCBI (ncbi.nlm.nih.gov), EBI
(ftp.ebi.ac.uk in pub/databases), EMBL (ftp.embl-heidelberg.de) and UCL (ftp.biochem.ucl.ac.uk in pub/prints). In addition,
it is distributed on the EMBL suite of CD-ROMs. The database requires ~60 Mb of disk storage.
In addition, the database is accessible from UCL's DbBrowser Bioinformatics
Server, at http://www.biochem.ucl.ac.uk/bsm/dbbrowser/ (
20
). The server primarily provides access to OWL, PRINTS and ALIGN, the compendium
of alignments used to create PRINTS entries
(http://www.biochem.ucl.ac.uk/bsm/dbbrowser/ALIGN/ALIGN.html ). Figure
1
shows part of the PRINTS home page
(http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/PRINTS.html ), which allows
keyword searching of database code, accession number, text, sequence, etc. Such
queries are made possible by links to the query language, but are presented in
a manner that shields the user from its syntax, which is desirable for routine,
trivial queries. Where query results are of particular interest, the full entry
may be retrieved to discover more about the fingerprint, as shown in Figure
2
.
The fingerprint technique has been used to study a wide range of globular and
membrane proteins, modular polypeptides, and so on (
4
,
24
-
27
). In recent database releases, particular emphasis has been placed on the
elucidation of discriminatory fingerprints for a range of G-protein-coupled receptor (GPCR) families and subfamilies (
4
,
27
). This has become increasingly important as the growth of the rhodopsin-like family has soared: there are now >1000 rhodopsin-like GPCRs known and diagnosis of certain family outliers has become
more and more difficult. By expanding the range of GPCR families covered in
PRINTS, the fingerprint facility on the Web now effectively provides an instant
diagnostic tool for putative GPCRs-this is illustrated in Figure
3
, in which a hypothetical
C.elegans
protein from SWISS-PROT (YMJC_CAEEL) is shown to make a partial match with the rhodopsin-like fingerprint, which encodes the seven transmembrane domains. The
sequence is not diagnosed by PROSITE because it contains changes in the third
transmembrane domain, which alone provides the basis for the PROSITE pattern.
Using the fingerprint approach, it is possible to detect such Twilight
relationships because of the diagnostic framework provided by neighbouring
motifs. Thus, in spite of the relative weakness of several peaks in the
fingerprint profile, the mutual context provided by the remaining fingerprint
elements allows us to make a reliable assessment of family membership.
In order to address more effectively the flood of information arising from the
various genome projects, it is essential to increase levels of automation, and
relieve many of the current manual burdens inherent in database maintenance.
This is already imperative, given the difficulties in attracting funding for
database curation. In the short term, emphasis will be placed on adding new
families to PRINTS, rather than on routinely updating existing ones. Attention
will then be given to developing more automated curation strategies. We will,
however, maintain a balance between manual input (especially at the stage of
annotation) and automatic processing. In practical applications, the power of
secondary databases derives not only from the reliability of their diagnostic
performance, but also from the extent and quality of their family
documentations. Annotated databases tend to be more reliable than their fully-automated counterparts, which are more error prone and provide little or
no validation either for the patterns they house or for matches to those
patterns.
In addition to addressing the practicalities of database maintenance, we also
aim to enhance the range of analysis tools available, to make the information
within PRINTS more readily accessible to users. For example, we are extending
the alignment applet to include a structural viewer so that, for families for
which coordinates are available, their fingerprints may be visualised in a 3D
context.
Bioinformatics is a technically-demanding discipline, in terms of both the nature and the scale of the
undertaking, and promises enormous practical dividends as it begins to reveal
the hidden jewels of the human genome. Secondary databases, such as PRINTS, are
an important part of this endeavour: their scope and subtlety make them
powerful tools for diagnosing the relationships between sequences that underlie
the identification of function. But none of these resources is sufficient in
itself. No single analysis method is yet infallible, and no pattern database
complete. Together with PROSITE, BLOCKS, the profile library, etc., PRINTS thus
provides one of several potent weapons in the sequence analyst's armoury.
We thank everyone who has contributed entries to the database. PRINTS is built
and maintained at UCL with support from the Royal Society (T.K.A. is a Royal
Society University Research Fellow). The project benefits from use of the BBSRC
SEQNET facility.
REFERENCES
Return
