| Nucleic Acids Research | Pages |
The PRINTS protein fingerprint database in its fifth year
Introduction
Source Database And Methods
Database format
Content of the current release
Database update and growth
Database distribution
Derivative databases
New search software
Applications
Future directions
Conclusion
Acknowledgements
References
The PRINTS protein fingerprint database in its fifth year
ABSTRACT
INTRODUCTION
The last two decades have seen remarkable advances in molecular biology: 20 years ago sequencing a single gene was considered a monumental technical achievement; today, the sequencing of whole genomes has become almost routine. Advances in the fundamental techniques of sequencing, in concert with advances in laboratory automation and robotics, have led to the rapid and unprecedented accumulation of macromolecular sequence data. The challenge resides not just in the management of this huge quantity of information, but also in its analysis. One of the main goals of bioinformatics is to uncover the knowledge implicit within the data.
The decisive step in this knowledge-discovery process is often the identification of the family to which a newly-identified gene belongs; from this devolves a wealth of insights into function. With its links to 3D structure and post-translational modifications, and thus biological function, it is generally thought that the amino acid sequence, rather than the nucleic acid sequence, is the most appropriate level at which to seek such relationships.
Secondary, so-called value-added, databases are now standard tools in sequence analysis strategies. Such resources distil sequence information from the primary databanks into a variety of potent descriptors that aid family diagnosis: PROSITE, for example, houses regular expression patterns and a small number of profiles (1); the BLOCKS database stores aligned, weighted motifs, or blocks (2); Pfam offers a range of hidden Markov models (HMMs) (3); and PRINTS provides groups of aligned, unweighted sequence motifs, or fingerprints (4). Diagnostically, each of these types of descriptor has different strengths and weaknesses and hence different areas for optimum application. In terms of family coverage, the databases tend to differ in content, and the most effective search strategies should ideally combine them all.
The technique of protein fingerprinting (5,6) arose largely because of the limitations of single-motif regular expression pattern-matching methods: these give binary `hit or miss', `match or no match' diagnoses that provide no biological context with which to assess the significance of a result. However, within a sequence alignment, it is usual to find not one, but several motifs that characterise the aligned family. Diagnostically, it makes sense to use many or all such conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. For example, a sequence that matches only three of six motifs may still be diagnosed as a true match if the motifs are matched in the correct order in the sequence, and the distances between them are consistent with those expected of true neighbouring motifs. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprints as a whole, renders fingerprinting a powerful diagnostic tool.
To facilitate sequence analysis and complement other secondary resources, we have made a range of protein fingerprints available in the PRINTS database (4). In this paper, we describe recent progress with the database, its new search software, and some applications.
SOURCE DATABASE AND METHODS
At present, the source database for PRINTS is OWL (7) (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/OWL/ ), a non-redundant composite of the major publicly-available primary sources: SWISS-PROT (8), PIR (9), GenBank (translation) (10) and NRL-3D (11).
Fingerprinting is an iterative procedure that commences with manual sequence alignment and excision of conserved motifs using SOMAP (12). The motifs are used to trawl OWL independently using the ADSP sequence analysis package (5,6). The scanning algorithm interprets the motifs essentially as a series of frequency matrices, i.e., identity searches are made, with no mutation or other similarity data to weight the results. The weighting scheme is thus based on the calculation of residue frequencies for each position in the motifs, summing the scores of identical residues for each position of the retrieved match. Diagnostic performance is enhanced by iterative database scanning. The motifs therefore grow and become more mature with each database pass, as more sequences are matched and assimilated into the process. Full potency is gained from the mutual context provided by motif neighbours, which allows sequence identification even when parts of the signature are absent.
Database format
PRINTS is currently built as a single ASCII (text) file. The contents are separated into specific fields, relating to general information, bibliographic references, text, lists of matches, and the motifs themselves. Each line of a field is assigned a distinct two-letter code, allowing the database to be indexed for fast querying of its contents (13). Entries are assigned both an identification code and an accession number to facilitate cross-referencing by other databases. Conversely, where relevant, cross-references are provided to other databanks (e.g., PROSITE (1), SBASE (14), scop (15), CATH (16), etc.) in order to promote efficient communication between related bioinformatics resources and effectively broaden the scope of sequence analysis strategies. The full format has been described previously (13,17,18), so will not be discussed further here.
Content of the current release
Release 17.0 of PRINTS (September 1997) contains 800 entries, encoding 4460 individual motifs. The complete contents list is available from the distribution sites and on the PRINTS WWW page (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/printscontents.html ).
Database update and growth
PRINTS is released in major and minor versions: major releases are database expansions, i.e., they denote the addition of new material to the resource; minor releases reflect updates of existing entries to bring the contents in line with the current version of OWL. To date, there have been 21 releases of the database: 17 major and four minor. We endeavour to make a major or minor version available quarterly; in the last year, we have achieved four major releases.
The principal obstacle to the frequency of expansions, and particularly of updates, is the time-consuming nature of the approach. Deriving a fingerprint involves two major threads:(i) a computational aspect, which involves initial alignment and maximisation of sequence information through iterative scanning, with multiple motifs, of a large composite database; and(ii) an annotation component, which involves researching each family and, where possible, linking sequence conservation information to known structural or functional data. This is a rigorous, exhaustive and thus time-consuming technique. But the precision of the results, coupled with the quality of annotations, has justified the sacrifice of speed, and sets the database apart from the growing number of automatically-derived pattern resources, for which there are no annotations, and hence no appropriate mechanisms for result validation.
Database distribution
PRINTS is available for interactive use via UCL's DbBrowser Bioinformatics Server, at http://www.biochem.ucl.ac.uk/bsm/dbbrowser/ (19). The PRINTS home page (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/PRINTS/ ) allows keyword searching of database code, accession number, text, sequence, etc.. Such queries are made possible by links to the query language (13), but are presented in a manner that shields the user from its syntax, which is desirable for routine queries. Where results are of particular interest, the full entry may be retrieved to discover more about the fingerprint. As shown in Figure 1, hyperlinks allow the user to retrieve related information from a variety of bioinformatics resources. In addition, the parent alignment from which the fingerprint was derived may be downloaded via a link to the CINEMA colour alignment editor (20), allowing visualisation and interactive manipulation of the alignment of interest.
Figure
For local installation, the database may be retrieved directly from the anonymous-ftp servers at UCL (ftp.biochem.ucl.ac.uk in pub/prints), Daresbury (s-ind2.dl.ac.uk in pub/database/ prints), EBI (ftp.ebi.ac.uk in pub/databases), EMBL (ftp.embl-heidelberg.de) and NCBI (ncbi.nlm.nih.gov). In addition, it is distributed on the EMBL suite of CD-ROMs.

Derivative databases
A particular strength of the PRINTS database is that the underlying data are stored in the form of raw sequence alignments. This allows different implementations to be set up using a variety of alternative scoring methods and/or abstractions. For example, a BLOCKS-format version of the resource is available at the Fred Hutchinson Cancer Research Center (http://www.blocks.fhcrc.org/blocks_search.html ); this exploits the powerful scoring method originally developed for the BLOCKS database (2). Alternatively, the protein function identification resource (IDENTIFY) at Stanford (http://dna.stanford.edu/identify/ ) overlays a fuzzy regular expression approach over the PRINTS multiply-aligned motifs and offers different levels of stringency from which to infer the significance of matches. Such derivative databases are useful for providing different perspectives on the same data set: they afford the opportunity to validate results, where there are corresponding matches in more than one resource; and they offer the chance to diagnose matches that may have been missed by the original implementation.
New search software
An important new facility has been added to the Web interface and deserves special mention. Secondary databases are of limited value without appropriate search tools. Our previous software (21) was limited to single sequence queries and could not differentiate between partial, but nevertheless true, fingerprint matches and random, high-scoring individual motif hits. We have addressed these problems with a new suite of programs, which provides facilities for: (i) interactive, individual query sequence submission against the full database; (ii) non-interactive, bulk query submission against the full database (with full genome analysis in mind); and (iii) interactive, individual sequence searching against a named fingerprint. Results from these programs are returned in distinct ways, with an attempt made to cater for both casual and expert users: the first offers an `intelligent' best guess, based on the occurrence of the highest-scoring full or partial fingerprint match, but more detailed results are provided in different layers via an extended HTML table, as illustrated in Figure 2; the second facility provides only brief information, which is returned via email; and the third option provides a graphical cartoon view of a single fingerprint profile, offering an instant diagnosis of any query sequence, as shown in Figure 3.
Figure .
Figure


Applications
The fingerprint technique has been used to study a wide range of globular, membrane, and modular proteins (6,22,23). In recent database releases, particular emphasis has been placed on the elucidation of discriminatory fingerprints for a range of G-protein-coupled receptor (GPCR) families and subfamilies (http://www.biochem.ucl.ac.uk/bsm/dbbrowser/GPCR ). This has become important as the growth of the rhodopsin-like family has soared; there are now >1000 rhodopsin-like GPCRs known and diagnosis of family outliers has become increasingly difficult. By expanding the range of GPCR families covered in PRINTS, the fingerprint facility on the Web effectively provides an instant diagnostic tool for putative GPCRs. This is illustrated in Figure 3, in which a Caenorhabditis elegans integral membrane protein from SWISS-PROT (SG12_CAEEL) is shown to make a partial match with the rhodopsin-like fingerprint, which encodes the seven transmembrane domains. The sequence is not diagnosed by PROSITE because it contains changes in the third transmembrane domain, which alone provides the basis for the PROSITE pattern; BLAST (24) also fails to return any significant scores, and no matches are reported from searches of resources such as BLOCKS and Pfam. Using the fingerprint approach, it is possible to detect such twilight relationships because of the diagnostic framework provided by neighbouring motifs. Thus, in spite of the relative weakness or absence of several peaks in the fingerprint profile, the mutual context provided by the remaining fingerprint elements allows us to infer a distant family relationship.
The ability to detect distant familial relationships is particularly important in the context of complete genome analysis. Protocols based, for example, on the combination of BLAST and PROSITE alone, are likely to miss significant matches. Preliminary results from the examination of the Saccharomyces cerevisiae genome (25) suggest that application of the PRINTS system has been able to make family assignments for ~300 sequences designated as hypothetical proteins, i.e., the method has assigned potential functions to ~10% of uncharacterised sequences. This figure has to be set in the context of the size of PRINTS, which is small in comparison with the primary databases; as PRINTS grows, inevitably its impact in such applications will increase. But still, this is an encouraging early result and is the focus of an ongoing investigation.
Future directions
In order to cope more effectively with the information arising from the various genome projects, it is essential to reduce the manual burden inherent in our current database curation strategies and, where possible, increase levels of automation. To this end, developments are planned in a number of areas: e.g., we aim to (i) implement automated strategies for fingerprint derivation; (ii) design methods for automatic extraction of low-level annotations from the primary database; and ultimately, (iii) pool high-level documentations with those from PROSITE and Pfam, creating a central compendium of domain and family descriptions. This last will help to reduce duplication of effort in the rate-determining step of annotation, and aims to provide a one-stop shop for analysis of newly-determined sequences.
In the meantime, while largely-manual approaches are still in place, emphasis will continue to be placed on adding new families to PRINTS, rather than on routinely updating existing ones. The underlying philosophy here is to try to provide a more comprehensive diagnostic resource, with high-quality annotations, rather than simply to focus on providing an up-to-date look-up table of family membership (an impossible individual human task against the swelling tide of primary data).
In addition to addressing the practicalities of database maintenance, we also aim to enhance the range of analysis tools available, to make the information within PRINTS more readily accessible to users.
CONCLUSION
Secondary databases are an important part of the endeavour to harvest the abundant fruits of the various genome projects. The scope and subtlety of such resources make them powerful tools for diagnosing the relationships between sequences that underpin the inference of function. But none of these databases is an end in itself: none of the underlying analysis methods is yet infallible, and none of the resources is complete. But coupled with PROSITE, BLOCKS, Pfam, etc., PRINTS adds an important piece to the jigsaw in the challenging puzzle of sequence analysis.
ACKNOWLEDGEMENTS
We thank the authors of the database software and everyone who has contributed entries to the resource. PRINTS is built and maintained at UCL with support from the Royal Society (TKA is a Royal Society University Research Fellow). PS is grateful to Astra Charnwood for a studentship. JS is grateful to Zeneca for a bioinformatics fellowship.The project benefits from use of the BBSRC SEQNET facility.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
This article has been cited by other articles:
![]() |
M. Freamat, H. Kawauchi, M. Nozaki, and S. A Sower Identification and cloning of a glycoprotein hormone receptor from sea lamprey, Petromyzon marinus. J. Mol. Endocrinol., August 1, 2006; 37(1): 135 - 146. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M. Baxter, J. S. Rosenblum, S. Knutson, M. R. Nelson, J. S. Montimurro, J. A. Di Gennaro, J. A. Speir, J. J. Burbaum, and J. S. Fetrow Synergistic Computational and Experimental Proteomics Approaches for More Accurate Detection of Active Serine Hydrolases in Yeast Mol. Cell. Proteomics, March 1, 2004; 3(3): 209 - 225. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Parschat, B. Hauer, R. Kappl, R. Kraft, J. Huttermann, and S. Fetzner Gene Cluster of Arthrobacter ilicis Ru61a Involved in the Degradation of Quinaldine to Anthranilate: CHARACTERIZATION AND FUNCTIONAL EXPRESSION OF THE QUINALDINE 4-OXIDASE qoxLMS GENES J. Biol. Chem., July 18, 2003; 278(30): 27483 - 27494. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Rigoutsos, T. Huynh, A. Floratos, L. Parida, and D. Platt Dictionary-driven protein annotation Nucleic Acids Res., September 1, 2002; 30(17): 3901 - 3916. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Jesenska, M. Bartos, V. Czernekova, I. Rychlik, I. Pavlik, and J. Damborsky Cloning and Expression of the Haloalkane Dehalogenase Gene dhmA from Mycobacterium avium N85 and Preliminary Characterization of DhmA Appl. Envir. Microbiol., August 1, 2002; 68(8): 3724 - 3730. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Karmirantzou and S.J. Hamodrakas A Web-based classification system of DNA-binding protein families Protein Eng. Des. Sel., July 1, 2001; 14(7): 465 - 472. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Av-Gay, S. Jamil, and S. J. Drews Expression and Characterization of the Mycobacterium tuberculosis Serine/Threonine Protein Kinase PknB Infect. Immun., November 1, 1999; 67(11): 5676 - 5682. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Damborsky and J. Koca Analysis of the reaction mechanism and substrate specificity of haloalkane dehalogenases by sequential and structural comparisons Protein Eng. Des. Sel., November 1, 1999; 12(11): 989 - 998. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. V. Nair, E. M. Green, D. E. Watson, G. N. Bennett, and E. T. Papoutsakis Regulation of the sol Locus Genes for Butanol and Acetone Formation in Clostridium acetobutylicum ATCC 824 by a Putative Transcriptional Repressor J. Bacteriol., January 1, 1999; 181(1): 319 - 330. [Abstract] [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







