Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (230K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Citing Articles
Right arrowScopus Links
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Attwood, T.
Right arrow Articles by Parry Smith, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Attwood, T.
Right arrow Articles by Parry Smith, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1996 Oxford University Press 182-189

Footnote

Progress with the PRINTS protein fingerprint database

Progress with the PRINTS protein fingerprint database T. K. Attwood* , M. E. Beck 1 , A. J. Bleasby 2 , K. Degtyarenko 1 and D. J. Parry Smith 3

Departments of Biochemistry and Molecular Biology, University College London, London WC1E 6BT, UK , 1 The University of Leeds, Leeds LS2 9JT, UK , 2 DRAL, Warrington , Cheshire WA4 4AD, UK and 3 Department of Molecular Sciences, Pfizer Central Research, Sandwich , Kent CT13 9NJ, UK

Received October 2, 1995 ; Accepted October 4, 1995

ABSTRACT

PRINTS is a compendium of protein motif `fingerprints' derived from the OWL composite sequence database. Fingerprints are groups of motifs within sequence alignments whose conserved nature allows them to be used as signatures of family membership. To date, 400 fingerprints have been constructed and stored in PRINTS, the size of which has doubled in the last year. The current version, 9.0, encodes ~ 2000 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. Fingerprints inherently offer improved diagnostic reliability over single motif methods by virtue of the mutual context provided by motif neighbours. PRINTS thus provides a useful adjunct to the widely used PROSITE dictionary of patterns. The database is now accessible via the Database Browser on the UCL Bioinformatics server at http://www.biochem.ucl.ac.uk/bsm/dbbrowser.

INTRODUCTION

A first step in the analysis of novel protein sequences is usually to scan the full sequence against one of the many primary data sources [e.g., SWISS-PROT ( 1 ), PIR ( 2 ), GenBank ( 3 )] or against a composite resource [e.g., NRDb ( 4 ), MIPSX ( 5 ), OWL ( 6 )] using a pairwise similarity search algorithm such as that exploited in BLAST ( 7 ). This will frequently allow outright identification of the query, or at least allows classification into a broad protein family. Sometimes, however, such diagnoses are not possible, either because there are no other related sequences in the primary sources, or because the target sequences are only partially similar and the relationship is lost in the `twilight zone' ( 8 ). In such situations, it is important to bring a range of techniques to bear on the analysis in order to improve the chances of meaningful identification.

To this end, it is becoming standard practice also to search novel sequences against a variety of secondary `value added' databases, which distill sequence information in primary sources into a variety of potent family descriptors, including patterns and profiles [e.g., PROSITE ( 9 )], motifs [e.g., BLOCKS ( 10 )], domains [e.g., SBASE ( 11 ) and ProDom ( 12 )], and so on. Of these techniques, regular expression patterns are probably the easiest to derive, involving the reduction of conserved motifs within alignments into single consensus expressions. PROSITE has thus become the most comprehensive and widely used database of this type, and version 12.2 contains 785 documentation entries describing 1029 patterns, rules and profiles.

A recognised draw back of regular expression patterns is their essential binary `on/off' nature, i.e., a query sequence will either match the pattern or not, regardless of how similar it may be. For this reason, more powerful discriminators, i.e. profiles, have recently been incorporated into PROSITE in order to provide an alternative means of diagnosis where patterns are likely to fail. Profiles are highly complex descriptors, generally encoding the full sequence length and allowing gap insertion in generating pairwise alignments between profile and target sequence; their numbers in PROSITE are therefore still relatively small.

We have used a different approach to pattern recognition, which is simple to apply. Groups of conserved motifs are excised from sequence alignments and used as fingerprints of family membership. Sequence information is maximised through iterative database scanning, so diagnostic performance increases with each database pass. The advantage of this approach is, first, that residue mismatches are tolerated within motifs, and second, where a motif is not matched, the diagnostic framework provided by neighbouring motifs still allows reliable identification. To facilitate sequence analysis and complement the PROSITE pattern/profile resource, we have recently made a range of unique protein fingerprints available in the PRINTS database ( 13 ). Here we describe a number of improvements made to the resource in the last year.

SOURCE DATABASE AND METHODS

The database used to derive individual fingerprints is OWL ( 6 ), a non redundant composite of the major publicly available primary sources: SWISS-PROT ( 1 ), PIR ( 2 ), GenBank (translation) ( 3 ) and NRL-3D ( 14 ). Although strict redundancy criteria are applied to the amalgamation of the primary databases, error checking of the sources themselves is not undertaken. In its current form, OWL thus includes errors that derive directly from these sources: results of database searches must therefore be viewed in this context.

Fingerprint construction commences with sequence alignment and excision of conserved motifs using SOMAP ( 15 ). The individual motifs are used to dredge OWL using the ADSP sequence analysis package, a suite of procedures for iterative database scanning and hit list correlation ( 16 , 17 ). The scanning algorithm interprets the aligned motifs essentially as a series of frequency matrices, i.e., identity searches are made, with no mutation or other similarity data to weight the results. Thus the weighting scheme is based on the calculation of residue frequencies for each position in the motifs, summing the scores of identical residues for each position of the retrieved match ( 16 , 17 ).

Database format

The PRINTS database is currently generated in the form of a single ASCII (text) file. The contents are divided into a number of specific fields, relating to general information, bibliographic references, text, lists of matches and the aligned motifs. In the general field at the top of the file, each entry is assigned a code by which it can be identified. This is followed by a description of the type of entry, which may be single (if the fingerprint has only one element) or multi component (if it contains several)-in this latter case, the number of motifs contained is also indicated. To date we have included only two single component entries: these have been derived using a modification of the fingerprint technique and are thus best regarded as special cases. Finally, the general field provides cross references to corresponding PROSITE patterns, where relevant, together with entry creation and latest update information.

The full format was illustrated in a previous article ( 18 ) and will not therefore be shown here. However, some important format changes have been made to each entry (see Fig. 3 ) and deserve special mention: first, where formerly only PROSITE was cross referenced, many additional links have been included, thereby improving communication and coupling with other databases, and broadening the scope of the resource; secondly, accession numbers (which take the form PR00000) have been introduced-these will not change between releases and hence will facilitate future cross referencing by other databanks.

Content of the current release

Release 9.0 of PRINTS (July 1995) contains 400 entries, encoding 1942 individual motifs-a list of additional entries since release 5.0 (April 1994) is provided in Appendix 1. The complete contents list is available from the distribution sites and on the PRINTS World Wide Web (WWW) page (see later).

Database update and growth

The fingerprint database is released in major and minor versions: major releases are database expansions, i.e. they denote the addition of new material to the resource; minor releases reflect updates of existing versions to bring the contents in line with the current version of OWL. To date, there have been 11 releases of the database: nine major and two minor. We endeavour to make a major or minor version available quarterly.

The principal obstacle to the frequency of expansions, and particularly of updates, is the time consuming nature of the approach. Deriving a fingerprint for a given protein family involves initial alignment and maximisation of sequence information through iterative scanning, with multiple motifs, of a large composite database. This is an exhaustive technique, but is consequently rigorous, and the precision of the resulting fingerprints tends to justify the sacrifice of speed.

Database distribution

Interactive access to PRINTS can be achieved over the network via the SEQNET facility at Daresbury, where, together with OWL, it is part of an integrated database and software resource that also includes query languages for each of the databases, and several other programs for sequence alignment ( 15 ), pattern recognition ( 16 ) and similarity searching ( 19 ).

PRINTS is also available directly via the anonymous ftp servers at Daresbury (on s-ind2.dl.ac.uk in pub/database/prints-this directory also supplies documentation and other information files, which contain details of the database contents, update statistics, references, and so on), and at NCBI (ncbi.nlm.nih.gov) and EMBL (ftp.embl heidelberg.de). In addition, it is available on the EMBL suite of CD ROMs. The database requires ~9 Mb of disc storage.

More recently, a Database Browser has been launched on the WWW, as part of UCL's Bioinformatics Server, at http://www.biochem.ucl.ac.uk/bsm/dbbrowser (Michie et al. , in preparation). DbBrowser has much of the look and feel of the ExPASy server to minimise learning curves for new users. It primarily provides access to OWL, PRINTS and ALIGN (the compendium of alignments used to create PRINTS entries), and one navigates through the facility by clicking on the appropriate hypertext link. Figure 1 shows part of the PRINTS top page, which offers the means to interrogate the database by keyword searching of database code, accession number, text, sequence, etc. More complex queries can be made using the query language, which allows the combination of simple queries using logical operators.


Figure 1 . PRINTS home page on the UCL Bioinformatics Server. The range of access points is shown, allowing simple queries by keyword searching, or more complex queries using the query language logicals. The PRINTS/PROSITE scanner takes submissions in the form of database codes or user specified sequences.

Perhaps more importantly, the PRINTS home page provides facilities to search PRINTS in conjunction with other pattern databases: e.g., there is a link to the new BLOCKS/PRINTS scanner (http://www.blocks.fhcrc.org/blocks_search.html), which searches PRINTS in blocks format; and there is an option to search PRINTS and PROSITE simultaneously, offering an instant diagnosis of any query sequence-the user supplies either the known database code, or may choose to cut and paste a sequence from a file, and a fingerprint profile is returned in which the top scoring matches and/or any completely matching fingerprints are plotted, as shown in Figure 2 . Where results are of particular interest, the full database entry may be retrieved from PRINTS to discover more about the matched fingerprint, as shown in Figure 3 -this provides links to related databases, so further information can be retrieved at the click of a mouse button.


Figure 2 . Fingerprint profile returned by the PRINTS/PROSITE scanner. The horizontal axis denotes the sequence, the vertical axis the percentage score of each fingerprint element (0-100 per element), and the peak a residue by residue match in the sequence, its leading edge marking the first position of the match. Sharp peaks appearing in a systematic order along the length of the sequence and above the level of noise indicate matches with a given fingerprint. Even when some elements of a fingerprint are not well matched, the context provided by their neighbours still allows diagnosis, as shown in this example, where the sequence YN84_CAEEL makes a match with the rhodopsin like GPCR fingerprint even though two motifs match only at the level of noise.


Figure 3 . Part of the full PRINTS entry for rhodopsin like GPCRs. Cross links to related databases are given at the top of the file, so related subfamily fingerprints can be retrieved, and/or the equivalent result can be consulted in PROSITE, BLOCKS, SBASE, etc.

Applications

The fingerprint technique has been used to study a wide range of globular and membrane proteins, modular polypeptides, and so on. Specific uses have included the development of a fingerprint for the lipocalins and fatty acid binding proteins ( 20 , 21 ), for the diacylglycerol/phorbol ester binding domain ( 22 ), and for the five known families of G protein coupled receptors (GPCRs) and some of their many subfamilies ( 17 , 23 ). This latter is particularly important as the growth of the GPCR `clan' in general, and of the rhodopsin like family in particular, has been vast-there are now >700 rhodopsin like GPCRs known, encompassing an enormously diverse range of sequences, to an extent that diagnosis of some family members has become difficult. The fingerprint facility on the Web now provides an instant diagnostic tool for putative GPCRs-this is illustrated in Figure 2 in which a hypothetical C.elegans protein in SWISS-PROT is shown to match well with the rhodopsin like fingerprint (it is not diagnosed by PROSITE because the sequence shows subtle changes in the third transmembrane domain-hence the relative weakness of the peak in the fingerprint profile at this point).

Similarly, we were able to diagnose the human cytomegalovirus hypothetical protein UL78 as a rhodopsin like GPCR-this was not predicted at the time of publication of the sequence ( 24 ), presumably because conventional database search methods failed to find a significant match (BLAST, for example, retrieves no significant scores). Nevertheless, our result has now been supported by Gompels et al. ( 25 ).

Future directions

Just as circumstances arise where regular expression patterns cannot unambiguously detect a particular protein family (usually because of their extreme sequence divergence), so fingerprints are not universally applicable. Sequences that have diverged to such an extent that no similarity remains will certainly escape detection by sequence based methods of this type. We are therefore comparing the effects of applying substitution and mutation data matrices to investigate possible improvements in diagnostic performance. However, this is a complex process, as the additional information provided by such weighting schemes tends to compromise fingerprint potency by increasing the level of background noise.

Another important avenue of research is currently the application of protein fingerprints to the analysis of expressed sequence tag (EST) data. There are now >278 000 ESTs in the gbEST section of GenBank release 9.0, 175 000 of which have been provided as a result of the Washington University/Merck and Co. collaboration (see http://genome.wustl.edu/est/esthmgp.html). This rich source of information presents special problems when attempting to search for, and assign, sequence homologues: the fragmentary nature of the sequence information confounds regular expression and full-length profile pattern-matching techniques. However, since fingerprints have the inherent ability to diagnose partial matches, they provide a relevant perspective from which to begin to analyse data of this type. Of course, diagnosis is only possible if the EST covers the part of the protein from which the fingerprint is derived; but, unlike regular expressions, for example, fingerprints often encode core conserved regions spanning up to 75% of a sequence, rendering the chance of diagnosis using this method significantly greater.

CONCLUSION

Fingerprinting offers a powerful approach to the analysis of protein sequences: it inherently offers improved diagnostic reliability over single motif methods by virtue of the mutual context provided by motif neighbours, and it allows rapid and striking visual diagnosis. Modern predictive methods are increasingly exploiting multiple alignments as input to prediction algorithms, since multiple sequence information can strongly enhance the signal (depending on the underlying structure of the data). In creating PRINTS, we recognised the importance of multiple sequence information from the outset and, accordingly, results are stored in the form of multiply aligned motifs-these can then be the subject of detailed structure/function analyses, in a manner that is not possible with abstractions of sequence alignments such as regular expressions, profiles and weight matrices.

ACKNOWLEDGEMENTS

We thank everyone who has contributed entries to the database. PRINTS is built and maintained at UCL with support from the Royal Society (TKA is a Royal Society University Research Fellow); it is compiled with assistance from a BBSRC Link grant in Leeds (MEB, KD). The project benefits from use of the DRAL SEQNET facility.

REFERENCES

1 Bairoch, A. and Boeckmann, B. (1994) Nucleic Acids Res., 22 , 3578-3580. MEDLINE Abstract

2 George, D.G., Barker, W.C., Mewes, H. W., Pfieffer, F. and Tsugita, A. (1994) Nucleic Acids Res., 22, 3569-3573. MEDLINE Abstract

3 Benson, D.A., Boguski, M., Lipman, D.J. and Ostell, J. (1994) Nucleic Acids Res., 22, 3441-3444.

4 Gish, W. (1994) National Center for Biotechnology Information Server.

5 Barker, W.C., George, D.G., Mewes, H W., Pfeiffer, F. and Tsugita, A. (1993) Nucleic Acids Res., 22, 3089-3092.

6 Bleasby, A.J., Akrigg, D. and Attwood, T.K. (1994) Nucleic Acids Res., 22, 3574-3577. MEDLINE Abstract

7 Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) J. Mol. Biol., 215, 403-410. MEDLINE Abstract

8 Doolittle, R.F. (1985) Proteins. Sci. Am., 253, 88-99.

9 Bairoch, A. and Bucher, P. (1994) Nucleic Acids Res., 22, 3583-3589. MEDLINE Abstract

10 Henikoff, S. and Henikoff, J.G. (1991) Nucleic Acids Res., 19, 6565-6572. MEDLINE Abstract

11 Pongor, S., Hatsagi, Z., Degtyarenko, K., Fabian, P., Skerl, V., Hegyi, H., Murvai, J. and Bevilacqua, V. (1994) Nucleic Acids Res., 22, 3610-33615. MEDLINE Abstract

12 Sonhammer, E.L.L. and Kahn, D. (1994) Protein Science, 3, 482-492.

13 Attwood, T.K., Beck, M.E., Bleasby, A.J. and Parry Smith, D.J. (1994) Nucleic Acids Res., 22, 3590-3596. MEDLINE Abstract

14 Pattabiraman, N., Namboodiri, K., Lowrey, A. and Gaber, B.P. (1990) Protein Seq. Data Anal., 3, 387-405. MEDLINE Abstract

15 Parry Smith, D.J. and Attwood, T.K. (1991) CABIOS, 7, 233-235.

16 Parry Smith, D.J. and Attwood, T.K. (1992) CABIOS, 8, 451-459.

17 Attwood, T.K. and Findlay, J.B.C. (1994) Protein Engineering, 7, 195-203.

18 Attwood, T.K. and Beck, M.E. (1994) Protein Engineering, 7, 841-848.

19 Akrigg, D., Attwood, T.K, Bleasby, A.J., Findlay, J.B.C., Maughan, N.A., North, A.C.T., Parry Smith, D.J., Perkins, D.N. and Wootton, J.C. (1992) CABIOS, 8, 295-296.

20 Flower, D.R., North, A.C.T. and Attwood, T.K. (1993) Protein Science, 2, 753-761.

21 Flower, D.R., North, A.C.T. and Attwood, T.K. (1991) BBRC, 180, 69-74.

22 Boguski, M., Bairoch, A., Attwood, T.K. and Michaels, G.S. (1992) Nature, 358, 113. MEDLINE Abstract

23 Attwood, T.K. and Findlay, J.B.C. (1993) Protein Engineering, 6, 167-176.

24 Chee, M.S., Bankier, A.T., Beck, S., Bohni, R., Brown, C.M., Cerny, R., Horsnell, T., Hutchison, C.A., Kouzarides, T., Martingnetti, J.A., Preddie, E., Satchwell, S.C., Tomlinson, P., Weston, K.M. and Barrell, B.G. (1990) Curr. Top. Microbiol. Immunol., 154, 125-169. MEDLINE Abstract

25 Gompels, U.A., Nicholas, J., Lawrence, G., Jones, M., Thomson, B.J., Martin, M.E.D., Efstathiou, S., Craxton, M. and Macaulay, H.A. (1995) Virology, 209, 29-51. MEDLINE Abstract


Return

* To whom correspondence should be addressed
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
R F Smith
Perspectives: sequence data base searching in the era of large-scale genomic sequencing.
Genome Res., August 1, 1996; 6(8): 653 - 660.
[Abstract] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (230K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrowScopus Links
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Attwood, T.
Right arrow Articles by Parry Smith, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Attwood, T.
Right arrow Articles by Parry Smith, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?