Nucleic Acids Research, 2000, Vol. 28, No. 1 41-44
© 2000 Oxford University Press
The Protein Information Resource (PIR)
Protein Information Resource, National Biomedical Research Foundation, 3900 Reservoir Road, NW, Washington, DC 20007, USA, 1GSF-Forschungszentrum für Umwelt und Gesundheit, Munich Information Center for Protein Sequences am Max-Planck-Instut für Biochemie, Am Klopferspitz 18, D-82152 Martinsried, Germany and 2Japan International Protein Information Database, Amakubo 1-16-1, Tsukuba 305-0005, Japan
Received October 1, 1999; Accepted October 4, 1999.
| ABSTRACT |
|---|
|
|
|---|
The Protein Information Resource (PIR) produces the largest, most comprehensive, annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Sequence Database (JIPID). The expanded PIR WWW site allows sequence similarity and text searching of the Protein Sequence Database and auxiliary databases. Several new web-based search engines combine searches of sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. New capabilities for searching the PIR sequence databases include annotation-sorted search, domain search, combined global and domain search, and interactive text searches. The PIR-International databases and search tools are accessible on the PIR WWW site at http://pir.georgetown.edu and at the MIPS WWW site at http://www.mips.biochem.mpg.de . The PIR-International Protein Sequence Database and other files are also available by FTP.
| INTRODUCTION |
|---|
|
|
|---|
The accelerating pace of genome sequencing projects has greatly increased the volume and complexity of available molecular data. To realize the fullest possible value from the data and to gain a better understanding of the genome, databases and the computational tools for analyzing them are required to allow biologically relevant features in the sequences to be identified and to provide insight on their structure and function. For over 30 years, the Protein Information Resource (PIR) has been providing the scientific community with databases and tools for the organization and analysis of protein sequence data (1,2). Together with MIPS and JIPID, we have undertaken a major restructuring to meet the challenges presented by the rapid growth of largely uncharacterized sequence data and the opportunities provided by the nearly universal access of scientists to the resources available on the WWW. Among the key developments are complete protein family organization for the PIR-International Protein Sequence Database (PSD) and integrated WWW interfaces for user-friendly sequence analysis, database searching and information retrieval.
| THE PIR-INTERNATIONAL PROTEIN DATABASES |
|---|
|
|
|---|
PIR, MIPS and JIPID constitute the PIR-International consortium that maintains the PIR-International Protein Sequence Database (PSD), the largest publicly distributed and freely available protein sequence database. The database has the following distinguishing features.
It is a comprehensive, annotated, and non-redundant protein sequence database, containing over 142 000 sequences as of September 1999. Included are sequences from the completely sequenced genomes of 16 prokaryotes, six archaebacteria, 17 viruses and phages, >100 eukaryote organelles and Saccharomyces cerevisiae.
The collection is well organized with >99% of entries classified by protein family and >57% classified by protein superfamily.
PSD annotation includes concurrent cross-references to other sequence, structure, genomic and citation databases, including the public nucleic acid sequence databases ENTREZ, MEDLINE, PDB, GDB, OMIM, FlyBase, MIPS/Yeast, SGD/Yeast, MIPS/Arabidopsis and TIGR. Where these databases are publicly and freely accessible and provide suitable WWW access, the cross-references presented on the PIR WWW site are hot-linked so that searchers can consult the most current data.
The PIR is the only sequence database to provide context cross-references between its own database entries. These cross-references assist searchers in exploring relationships such as subunit associations in molecular complexes, enzymesubstrate interactions, activation and regulation cascades, as well as in browsing entries with shared features and annotations.
Interim updates are made publicly available on a weekly basis, and full releases have been published quarterly since 1984.
In addition to the PSD, PIR-International distributes or provides WWW access to other sequence and auxiliary databases (Table 1), briefly described below, and maintains several internal data collections used for sequence annotation and integrity checks.
|
PATCHX (3) is a non-redundant database assembled by MIPS of publicly available protein sequences not yet in the PIR-International PSD. PIR+PATCHX, a combination of the PSD and PATCHX containing ~300 000 sequences available for similarity searches, is the most complete non-redundant collection of protein sequences available in the public domain.
ARCHIVE is a database of protein sequences as originally reported in a publication or submission, the only such collection of as published unmerged sequences.
NRL_3D (4) sequence-structure database is produced from sequence and annotation in the Protein Data Bank (PDB) of three-dimensional structures (5).
FAMBASE is a collection of representative sequences from each protein family that can be used in a similarity search to reduce search time and improve sensitivity for identifying distant families.
PIR-ALN (6) is a curated database of sequence alignments of superfamilies, families and homology domains, with annotation information derived from PSD and consensus patterns calculated from the alignments.
RESID (7) is a database of post-translational modifications with descriptive, chemical, structural and bibliographic information based on feature information in the PSD.
ProClass (8) is a protein family database that organizes non-redundant PIR-International PSD and SWISS-PROT sequences according to PIR superfamilies and PROSITE patterns.
ProtFam (9) is a curated database of homology clusters with automatically generated multiple sequence alignments for families, superfamilies and homology domains.
To support both data management and data mining and assist knowledge discovery, the PIR databases are being migrated to an object-relational database management system. A three-tier network computing, architecture provides a framework for distributed object computing and Java-based WWW interfaces connect with the database server for database query and update tasks.
Superfamily and family classification
The pioneering work of Margaret Dayhoff on protein classification based on the superfamily concept (10,11) was refined by PIR-International (12) to assist database organization and molecular evolution studies. Central to the organization and annotation of the PIR-International sequence and auxiliary databases are the protein family relationships which are structured at three levels: (i) superfamilies and families (for full-length sequence similarity), (ii) homology domains (for local functional or structural units), and (iii) motifs (for functional or structural sites). PIR-International has maintained the highest classification rate and provided the most comprehensive classification and alignments of proteins among all major public domain databases.
To deal efficiently with the many new sequences from genome sequencing projects, procedures for family and superfamily classification have been automated. Over 99% of sequences are routinely classified shortly after entry into the database into protein families of sequences that are at least 45% identical. Subsequently, entries are further clustered into regular superfamilies of sequences that share end-to-end homology (but may be rather distantly related) and also into domain superfamilies of proteins sharing at least one common homology domain. There are currently >76 000 sequences in >8900 superfamilies, and 30 000 entries with 370 recognized homology domains in the PSD. Corresponding to the classification are 1500 superfamily, 2100 family and 400 domain alignments in the PIR-ALN database, and 15 000 family and 4500 superfamily alignments in the MIPS ProtFam database. Also available from PIR is the ProClass protein family database containing 92 000 classified entries as well as 1300 motif alignments of ~44 000 PIR-International PSD entries.
| THE PIR SEARCH AND ANALYSIS SYSTEM |
|---|
|
|
|---|
The PIR search and analysis system provides search engines of three types (Table 2): (i) interactive text-based search engines, which allow Boolean queries of text fields; (ii) standard sequence similarity search engines, including Peptide Match, Pattern Match, BLAST, FASTA, Pairwise Alignment and Multiple Alignment; and (iii) advanced search engines that combine sequence similarity and annotation searches or evaluate gene family relationships, including Annotation-Sorted Similarity Search, Domain Search, Global and Domain Similarity Search and GeneFIND. Sequence searching can be performed against the PSD, NRL_3D, PATCHX, FAMBASE, ProClass and the combined PSD+PATCHX collections. Text and entry searching is provided for the PSD, NRL_3D, PIR-ALN and RESID databases.
|
Sequence search and alignment
BLAST (13) and FASTA (14) searches for sequence similarity are available for all sequence databases. The output of these search engines employs a graphical interface showing location of hits within the query sequence and full-length alignments generated by SSEARCH (15). Multiple or pairwise alignments of PSD or user-supplied sequences can be done using CLUSTALW (16) or SSEARCH. PIR pattern or peptide matching programs can (i) match a query sequence against a database of regular expressions (i.e., patterns); (ii) search a user-specified regular expression against a sequence database; or (iii) find an exact match for a user-specified peptide sequence in one of the sequence databases, including the ARCHIVE database of as published sequences.
PIR Similarity Search system
Combining sequence and annotation search, the Annotation-Sorted Similarity Search facility displays BLAST or FASTA matches along with the user-selected annotation (superfamily, family, species, taxonomic group, keyword or all five) in the annotation-sorted order. The matched entries can be selected for multiple alignments against the query sequence using CLUSTALW and displayed using MView (17).
The Domain Similarity Search engine uses FASTA to search against domain sequences compiled from the PIR-International PSD, and displays the PSD entry and domain annotation with a graphical representation of the matched region with links to domain alignments in PIR-ALN. The Global and Domain Similarity Search uses BLAST to search the PSD for global similarity and FASTA to search the domain sequence collection for local similarity. The results are ranked by the global score and show the extent of matches at both the global and domain levels. Any combination of complete sequences and domains can be selected and viewed in a multiple alignment.
The PIR Integrated Environment for Sequence Analysis provides an integrated environment for all above protein analysis tasks, including sequence similarity search, pattern and peptide match, multiple sequence alignment and advanced PIR similarity searches, as well as for entry retrieval by unique superfamily, family, title, species, taxonomic group, domains or keywords.
GeneFIND (18) provides protein family classification and information retrieval by combining several search/alignment tools and the ProClass database in a multi-level filter system, including the MOTIFIND neural networks, BLAST search, SSEARCH sequence alignment, motif pattern matching, hidden Markov motif modeling and CLUSTALW multiple motif alignment.
| AVAILABILITY |
|---|
|
|
|---|
PIR provides free public access to value-added protein information through its WWW site at http://pir.georgetown.edu and direct file transfer at ftp://nbrfa.georgetown.edu/pir . In addition to the databases (Table 1) and search tools (Table 2), the PIR WWW site also provides associated metadata, including technical bulletins and documentation that serves as metadata dictionaries for the PIR-International PSD. Accessible from the PIR anonymous FTP site are PIR-International databases and many other documents, files and software tools, including the weekly interim updates of the PSD (in NBRF format) and the corresponding sequence file (in FASTA format). The PIR-International PSD quarterly releases (in both NBRF and CODATA formats) are also available at the NCBI FTP server. Other sites and data depositories do not always have the most recent quarterly release of the PSD.
| ACKNOWLEDGEMENTS |
|---|
PIR is a registered mark of NBRF. The work at NBRF is supported by grant number P41 LM05978 from the National Library of Medicine and by gifts from COMPAQ, Pfizer and Dupont. The work at MIPS is supported by the Federal Ministry of Education, Science, Research and Technology (BMBF, FKZ 03311670, 01KW9703/7), the Max-Planck-Society and the European Commission (BIO4-CT96-0110, 0338,0558).
| FOOTNOTES |
|---|
* To whom correspondence should be addressed. Tel. +1 202 687 2121; Fax: +1 202 687 1662; Email: pirmail@nbrf.georgetown.edu
| REFERENCES |
|---|
|
|
|---|
-
1 Dayhoff,M.O., Eck,R.V., Chang,M.A. and Sochard,M.R. (1965) Atlas of Protein Sequence and Structure, Vol. 1. National Biomedical Research Foundation, Silver Spring, MD.
2 Dayhoff,M.O. (1979) Atlas of Protein Sequence and Structure, Vol. 5, Supplement 3. National Biomedical Research Foundation, Washington, DC.
3 Barker,W.C., George,D.G., Mewes,H.-W., Pfeiffer,F. and Tsugita,A. (1993) Nucleic Acids Res., 21, 30893092.
4 Pattabiraman,N., Namboodiri,K., Lowrey,A. and Gaber,B.P. (1990) Protein Seq. Data Anal., 3, 387405.[Medline]
5 Abola,E.E., Manning,N.O., Prilusky,J., Stampf,D.R. and Sussman,J.L. (1996) Res. Natl Stand. Technol., 101, 231241.
6 Srinivasarao,G.Y., Yeh,L.-S., Marzec,C.R., Orcutt,B.C. and Barker,W.C. (1999) Bioinformatics, 15, 382390.
7 Garavelli,J.S. (2000) Nucleic Acids Res., 28, 209211 (this issue).
8 Wu,C., Xiao,C. and Huang,H. (2000) Nucleic Acids Res., 28, 273276 (this issue).
9 Mewes,H.W., Frishman,D., Haase,D., Kaps,A., Lemcke,K., Mannhaupt,G., Pfeiffer,F., Schüller C., Stocker,S. and Weil,B. (2000) Nucleic Acids Res., 28, 3740 (this issue).
10 Dayhoff,M.O. (1976) Fed. Proc., 35, 21322138.[ISI][Medline]
11 Dayhoff,M.O., McLaughlin,P.J., Barker,W.C. and Hunt,L.T. (1975) Naturwissenschaften, 62, 154161.
12 Barker,W.C., Pfeiffer,F. and George,D. (1996) Methods Enzymol., 266, 5971.[ISI][Medline]
13 Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 33893402.
14 Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 85, 24442448.
15 Smith,T.F. and Waterman,M.S. (1981) Adv. Appl. Math., 2, 482489.
16 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 46734680.
17 Brown,N.P., Leroy,C. and Sander,C. (1998) Bioinformatics, 14, 380381.
18 Wu,C.H., Huang,H. and Shivakumar,S. (1999) Int. J. Artificial Intelligence Tools, in press.
This article has been cited by other articles:
![]() |
A. Saha, A. Sharma, A. Dhar, B. Bhattacharyya, S. Roy, and S. K. Das Gupta Antagonists of Hsp16.3, a Low-Molecular-Weight Mycobacterial Chaperone and Virulence Factor, Derived from Phage-Displayed Peptide Libraries Appl. Envir. Microbiol., November 1, 2005; 71(11): 7334 - 7344. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. R. Meloni, C.-H. Lai, T.-P. Yao, and J. R. Nevins A Mechanism of COOH-Terminal Binding Protein-Mediated Repression Mol. Cancer Res., October 1, 2005; 3(10): 575 - 583. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Guldener, G. Mannhaupt, M. Munsterkotter, et al. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes Nucleic Acids Res., October 14, 2004; 32(18): 5539 - 5545. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Wheeler, D. M. Church, S. Federhen, A. E. Lash, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. A. Tatusova, et al. Database resources of the National Center for Biotechnology Nucleic Acids Res., January 1, 2003; 31(1): 28 - 33. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Frishman, M. Mokrejs, D. Kosykh, G. Kastenmuller, G. Kolesov, I. Zubrzycki, C. Gruber, B. Geier, A. Kaps, K. Albermann, et al. The PEDANT genome database Nucleic Acids Res., January 1, 2003; 31(1): 207 - 211. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Wheeler, D. M. Church, A. E. Lash, D. D. Leipe, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, T. A. Tatusova, L. Wagner, et al. Database resources of the National Center for Biotechnology Information: 2002 update Nucleic Acids Res., January 1, 2002; 30(1): 13 - 16. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M. Gromiha, H. Uedaira, J. An, S. Selvaraj, P. Prabakaran, and A. Sarai ProTherm, Thermodynamic Database for Proteins and Mutants: developments in version 3.0 Nucleic Acids Res., January 1, 2002; 30(1): 301 - 302. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Antcheva, A. Pintar, A. Patthy, A. Simoncsits, E. Barta, B. Tchorbanov, and S. Pongor Proteins of circularly permuted sequence present within the same organism: The major serine proteinase inhibitor from Capsicum annuum seeds Protein Sci., November 1, 2001; 10(11): 2280 - 2290. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Murvai, K. Vlahovicek, C. Szepesvari, and S. Pongor Prediction of Protein Functional Domains from Sequences Using Artificial Neural Networks Genome Res., August 1, 2001; 11(8): 1410 - 1417. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Bertone, Y. Kluger, N. Lan, D. Zheng, D. Christendat, A. Yee, A. M. Edwards, C. H. Arrowsmith, G. T. Montelione, and M. Gerstein SPINE: an integrated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics Nucleic Acids Res., July 1, 2001; 29(13): 2884 - 2898. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Jayasinghe, K. Hristova, and S. H. White MPtopo: A database of membrane protein topology Protein Sci., February 1, 2001; 10(2): 455 - 458. [Abstract] [Full Text] |
||||
![]() |
D. L. Wheeler, D. M. Church, A. E. Lash, D. D. Leipe, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, T. A. Tatusova, L. Wagner, et al. Database resources of the National Center for Biotechnology Information Nucleic Acids Res., January 1, 2001; 29(1): 11 - 16. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. T. Silverstein, E. Shoop, J. E. Johnson, A. Kilian, J. L. Freeman, T. M. Kunau, I. A. Awad, M. Mayer, and E. F. Retzel The MetaFam Server: a comprehensive protein family resource Nucleic Acids Res., January 1, 2001; 29(1): 49 - 51. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Murvai, K. Vlahovicek, E. Barta, and S. Pongor The SBASE protein domain library, release 8.0: a collection of annotated protein sequence segments Nucleic Acids Res., January 1, 2001; 29(1): 58 - 60. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Kong, L.-F. Lin, N. Porter, S. Stickel, D. Byrd, J. Posfai, and R. J. Roberts Functional analysis of putative restriction-modification system genes in the Helicobacter pylori J99 genome Nucleic Acids Res., September 1, 2000; 28(17): 3216 - 3223. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Garavelli The RESID Database of protein structure modifications: 2000 update Nucleic Acids Res., January 1, 2000; 28(1): 209 - 211. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Huang, C. Xiao, and C. H. Wu ProClass protein family database Nucleic Acids Res., January 1, 2000; 28(1): 273 - 276. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Zhuo, W. D. Zhao, F. A. Wright, H.-Y. Yang, J.-P. Wang, R. Sears, T. Baer, D.-H. Kwon, D. Gordon, S. Gibbs, et al. Assembly, Annotation, and Integration of UNIGENE Clusters into the Human Genome Draft Genome Res., May 1, 2001; 11(5): 904 - 918. [Abstract] [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




