ABSTRACT
PRINTS is a compendium of protein motif `fingerprints' derived from the OWL
composite sequence database. Fingerprints are groups of motifs within sequence
alignments whose conserved nature allows them to be used as signatures of
family membership. To date, 400 fingerprints have been constructed and stored
in PRINTS, the size of which has doubled in the last year. The current version,
9.0, encodes
~
2000 motifs, covering a range of globular and membrane proteins, modular
polypeptides, and so on. Fingerprints inherently offer improved diagnostic
reliability over single motif methods by virtue of the mutual context provided
by motif neighbours. PRINTS thus provides a useful adjunct to the widely used
PROSITE dictionary of patterns. The database is now accessible via the Database
Browser on the UCL Bioinformatics server at
http://www.biochem.ucl.ac.uk/bsm/dbbrowser.
A first step in the analysis of novel protein sequences is usually to scan the
full sequence against one of the many primary data sources [e.g., SWISS-PROT (
1
), PIR (
2
), GenBank (
3
)] or against a composite resource [e.g., NRDb (
4
), MIPSX (
5
), OWL (
6
)] using a pairwise similarity search algorithm such as that exploited in BLAST
(
7
). This will frequently allow outright identification of the query, or at least
allows classification into a broad protein family. Sometimes, however, such
diagnoses are not possible, either because there are no other related sequences
in the primary sources, or because the target sequences are only partially
similar and the relationship is lost in the `twilight zone' (
8
). In such situations, it is important to bring a range of techniques to bear on
the analysis in order to improve the chances of meaningful identification.
To this end, it is becoming standard practice also to search novel sequences
against a variety of secondary `value added' databases, which distill sequence
information in primary sources into a variety of potent family descriptors,
including patterns and profiles [e.g., PROSITE (
9
)], motifs [e.g., BLOCKS (
10
)], domains [e.g., SBASE (
11
) and ProDom (
12
)], and so on. Of these techniques, regular expression patterns are probably the
easiest to derive, involving the reduction of conserved motifs within
alignments into single consensus expressions. PROSITE has thus become the most
comprehensive and widely used database of this type, and version 12.2 contains
785 documentation entries describing 1029 patterns, rules and profiles.
A recognised draw back of regular expression patterns is their essential binary
`on/off' nature, i.e., a query sequence will either match the pattern or not,
regardless of how similar it may be. For this reason, more powerful
discriminators, i.e. profiles, have recently been incorporated into PROSITE in
order to provide an alternative means of diagnosis where patterns are likely to
fail. Profiles are highly complex descriptors, generally encoding the full
sequence length and allowing gap insertion in generating pairwise alignments
between profile and target sequence; their numbers in PROSITE are therefore
still relatively small.
We have used a different approach to pattern recognition, which is simple to
apply. Groups of conserved motifs are excised from sequence alignments and used
as fingerprints of family membership. Sequence information is maximised through
iterative database scanning, so diagnostic performance increases with each
database pass. The advantage of this approach is, first, that residue
mismatches are tolerated within motifs, and second, where a motif is not
matched, the diagnostic framework provided by neighbouring motifs still allows
reliable identification. To facilitate sequence analysis and complement the
PROSITE pattern/profile resource, we have recently made a range of unique
protein fingerprints available in the PRINTS database (
13
). Here we describe a number of improvements made to the resource in the last
year.
The database used to derive individual fingerprints is OWL (
6
), a non redundant composite of the major publicly available primary sources:
SWISS-PROT (
1
), PIR (
2
), GenBank (translation) (
3
) and NRL-3D (
14
). Although strict redundancy criteria are applied to the amalgamation of the
primary databases, error checking of the sources themselves is not undertaken.
In its current form, OWL thus includes errors that derive directly from these
sources: results of database searches must therefore be viewed in this context.
Fingerprint construction commences with sequence alignment and excision of
conserved motifs using SOMAP (
15
). The individual motifs are used to dredge OWL using the ADSP sequence analysis
package, a suite of procedures for iterative database scanning and hit list
correlation (
16
,
17
). The scanning algorithm interprets the aligned motifs essentially as a series
of frequency matrices, i.e., identity searches are made, with no mutation or
other similarity data to weight the results. Thus the weighting scheme is based
on the calculation of residue frequencies for each position in the motifs,
summing the scores of identical residues for each position of the retrieved
match (
16
,
17
).
The PRINTS database is currently generated in the form of a single ASCII (text)
file. The contents are divided into a number of specific fields, relating to
general information, bibliographic references, text, lists of matches and the
aligned motifs. In the general field at the top of the file, each entry is
assigned a code by which it can be identified. This is followed by a
description of the type of entry, which may be single (if the fingerprint has
only one element) or multi component (if it contains several)-in this latter case, the number of motifs contained is also indicated. To
date we have included only two single component entries: these have been
derived using a modification of the fingerprint technique and are thus best
regarded as special cases. Finally, the general field provides cross references
to corresponding PROSITE patterns, where relevant, together with entry creation
and latest update information.
The full format was illustrated in a previous article (
18
) and will not therefore be shown here. However, some important format changes
have been made to each entry (see Fig.
3
) and deserve special mention: first, where formerly only PROSITE was cross
referenced, many additional links have been included, thereby improving
communication and coupling with other databases, and broadening the scope of
the resource; secondly, accession numbers (which take the form PR00000) have
been introduced-these will not change between releases and hence will facilitate future
cross referencing by other databanks.
Release 9.0 of PRINTS (July 1995) contains 400 entries, encoding 1942 individual
motifs-a list of additional entries since release 5.0 (April 1994) is provided
in Appendix 1. The complete contents list is available from the distribution
sites and on the PRINTS World Wide Web (WWW) page (see later).
The fingerprint database is released in major and minor versions: major releases
are database expansions, i.e. they denote the addition of new material to the
resource; minor releases reflect updates of existing versions to bring the
contents in line with the current version of OWL. To date, there have been 11
releases of the database: nine major and two minor. We endeavour to make a
major or minor version available quarterly.
The principal obstacle to the frequency of expansions, and particularly of
updates, is the time consuming nature of the approach. Deriving a fingerprint
for a given protein family involves initial alignment and maximisation of
sequence information through iterative scanning, with multiple motifs, of a
large composite database. This is an exhaustive technique, but is consequently
rigorous, and the precision of the resulting fingerprints tends to justify the
sacrifice of speed.
Interactive access to PRINTS can be achieved over the network via the SEQNET
facility at Daresbury, where, together with OWL, it is part of an integrated
database and software resource that also includes query languages for each of
the databases, and several other programs for sequence alignment (
15
), pattern recognition (
16
) and similarity searching (
19
).
PRINTS is also available directly via the anonymous ftp servers at Daresbury (on
s-ind2.dl.ac.uk in pub/database/prints-this directory also supplies documentation and other information
files, which contain details of the database contents, update statistics,
references, and so on), and at NCBI (ncbi.nlm.nih.gov) and EMBL (ftp.embl
heidelberg.de). In addition, it is available on the EMBL suite of CD ROMs. The
database requires ~9 Mb of disc storage.
More recently, a Database Browser has been launched on the WWW, as part of UCL's
Bioinformatics Server, at http://www.biochem.ucl.ac.uk/bsm/dbbrowser (Michie
et al.
, in preparation). DbBrowser has much of the look and feel of the ExPASy server
to minimise learning curves for new users. It primarily provides access to OWL,
PRINTS and ALIGN (the compendium of alignments used to create PRINTS entries),
and one navigates through the facility by clicking on the appropriate hypertext
link. Figure
1
shows part of the PRINTS top page, which offers the means to interrogate the
database by keyword searching of database code, accession number, text,
sequence, etc. More complex queries can be made using the query language, which
allows the combination of simple queries using logical operators.
The fingerprint technique has been used to study a wide range of globular and
membrane proteins, modular polypeptides, and so on. Specific uses have included
the development of a fingerprint for the lipocalins and fatty acid binding
proteins (
20
,
21
), for the diacylglycerol/phorbol ester binding domain (
22
), and for the five known families of G protein coupled receptors (GPCRs) and
some of their many subfamilies (
17
,
23
). This latter is particularly important as the growth of the GPCR `clan' in
general, and of the rhodopsin like family in particular, has been vast-there are now >700 rhodopsin like GPCRs known, encompassing an enormously
diverse range of sequences, to an extent that diagnosis of some family members
has become difficult. The fingerprint facility on the Web now provides an
instant diagnostic tool for putative GPCRs-this is illustrated in Figure
2
in which a hypothetical
C.elegans
protein in SWISS-PROT is shown to match well with the rhodopsin like fingerprint (it is not
diagnosed by PROSITE because the sequence shows subtle changes in the third
transmembrane domain-hence the relative weakness of the peak in the fingerprint profile at
this point).
Similarly, we were able to diagnose the human cytomegalovirus hypothetical
protein UL78 as a rhodopsin like GPCR-this was not predicted at the time of publication of the sequence (
24
), presumably because conventional database search methods failed to find a
significant match (BLAST, for example, retrieves no significant scores).
Nevertheless, our result has now been supported by Gompels
et al.
(
25
).
Just as circumstances arise where regular expression patterns cannot
unambiguously detect a particular protein family (usually because of their
extreme sequence divergence), so fingerprints are not universally applicable.
Sequences that have diverged to such an extent that no similarity remains will
certainly escape detection by sequence based methods of this type. We are
therefore comparing the effects of applying substitution and mutation data
matrices to investigate possible improvements in diagnostic performance.
However, this is a complex process, as the additional information provided by
such weighting schemes tends to compromise fingerprint potency by increasing
the level of background noise.
Another important avenue of research is currently the application of protein
fingerprints to the analysis of expressed sequence tag (EST) data. There are
now >278 000 ESTs in the gbEST section of GenBank release 9.0, 175 000 of which
have been provided as a result of the Washington University/Merck and Co.
collaboration (see http://genome.wustl.edu/est/esthmgp.html). This rich source
of information presents special problems when attempting to search for, and
assign, sequence homologues: the fragmentary nature of the sequence information
confounds regular expression and full-length profile pattern-matching techniques. However, since fingerprints have the inherent
ability to diagnose partial matches, they provide a relevant perspective from
which to begin to analyse data of this type. Of course, diagnosis is only
possible if the EST covers the part of the protein from which the fingerprint
is derived; but, unlike regular expressions, for example, fingerprints often
encode core conserved regions spanning up to 75% of a sequence, rendering the
chance of diagnosis using this method significantly greater.
Fingerprinting offers a powerful approach to the analysis of protein sequences:
it inherently offers improved diagnostic reliability over single motif methods
by virtue of the mutual context provided by motif neighbours, and it allows
rapid and striking visual diagnosis. Modern predictive methods are increasingly
exploiting multiple alignments as input to prediction algorithms, since
multiple sequence information can strongly enhance the signal (depending on the
underlying structure of the data). In creating PRINTS, we recognised the
importance of multiple sequence information from the outset and, accordingly,
results are stored in the form of multiply aligned motifs-these can then be the subject of detailed structure/function analyses, in
a manner that is not possible with abstractions of sequence alignments such as
regular expressions, profiles and weight matrices.
We thank everyone who has contributed entries to the database. PRINTS is built
and maintained at UCL with support from the Royal Society (TKA is a Royal
Society University Research Fellow); it is compiled with assistance from a
BBSRC Link grant in Leeds (MEB, KD). The project benefits from use of the DRAL
SEQNET facility.
REFERENCES
Return
