Nucleic Acids Research, 2003, Vol. 31, No. 1 345-347
© 2003 Oxford University Press
The Protein Information Resource
Department of Biochemistry and Molecular Biology, Georgetown University Medical Center, 3900 Reservoir Road, NW, Box 571414, Washington, DC 20057-1414, USA 1 National Biomedical Research Foundation, Georgetown University Medical Center, 3900 Reservoir Road, NW, Box 571414, Washington, DC 20057-1414, USA
*To whom correspondence should be addressed. Tel: +1 2026872121; Fax: +1 2026871662; Email: pirmail{at}georgetown.edu
Received September 15, 2002; Accepted September 27, 2002
ABSTRACT
The Protein Information Resource (PIR) is an integrated public resource of protein informatics that supports genomic and proteomic research and scientific discovery. PIR maintains the Protein Sequence Database (PSD), an annotated protein database containing over 283 000 sequences covering the entire taxonomic range. Family classification is used for sensitive identification, consistent annotation, and detection of annotation errors. The superfamily curation defines signature domain architecture and categorizes memberships to improve automated classification. To increase the amount of experimental annotation, the PIR has developed a bibliography system for literature searching, mapping, and user submission, and has conducted retrospective attribution of citations for experimental features. PIR also maintains NREF, a non-redundant reference database, and iProClass, an integrated database of protein family, function, and structure information. PIR-NREF provides a timely and comprehensive collection of protein sequences, currently consisting of more than 1 000 000 entries from PIR-PSD, SWISS-PROT, TrEMBL, RefSeq, GenPept, and PDB. The PIR web site (http://pir.georgetown.edu) connects data analysis tools to underlying databases for information retrieval and knowledge discovery, with functionalities for interactive queries, combinations of sequence and text searches, and sorting and visual exploration of search results. The FTP site provides free download for PSD and NREF biweekly releases and auxiliary databases and files.
INTRODUCTION
In order to provide integrated and value-added protein information to the scientific community, the Protein Information Resource (PIR) continues to enhance its three major databases, the Protein Sequence Database (PSD), the Non-redundant REFerence (NREF) sequence database, and the integrated Protein Classification (iProClass) database (1). The sections below describe key developments in the past year.
PIR-PSD
The PIR-PSD is public domain protein sequence database, which currently contains over 283 000 annotated and classified entries, covering the entire taxonomic range. Recent development and annotation efforts have focused on superfamily classification and curation and bibliography mapping and attribution.
Superfamily classification and curation
A unique characteristic of the PIR-PSD is the superfamily classification (2) that provides comprehensive, non-overlapping, and hierarchical clustering of sequences to reflect their evolutionary relationships. To further improve the quality of automated classification, we have conducted systematic superfamily curation that: (i) defines the signature domain architecture (number, order, and types of domains) characteristic of the superfamily, (ii) categorizes regular and associate members to distinguish sequence entries sharing the signature features from outliers (such as fragments), and (iii) designates representative and seed members amongst regular members. Several thousand superfamilies have been manually curated. The seed members provide a basis for automatic placement of new sequences into existing superfamilies and for automatic generation of multiple sequence alignments and phylogenetic trees. Currently, over 99% of PSD sequences are classified into families of closely related sequences (at least 45% identical), and over two-thirds of sequences are classified into >36 000 superfamilies.
Bibliography mapping and attribution
To improve the quality of protein annotation by increasing the amount of experimentally verified data with source attribution, the PIR has developed a bibliography information system and conducted retrospective attribution of literature data. The bibliography system allows browsing and searching of extensive literature collected for all protein entries from PubMed and other curated molecular databases, together with an interface for scientists to categorize and submit literature information for mapped proteins. In PIR-PSD, protein features such as binding sites, structural motifs, and post-translational modifications are tagged with experimental status for experimentally determined features to distinguish from those that are computationally predicted; however, they had not been associated with literature citations. A systematic manual attribution of experimental features is being carried out with computer-assisted mapping to existing protein bibliographic information. So far, a few thousand experimental features have been associated with publications.
PIR-NREF DATABASE
The PIR-NREF provides a timely and comprehensive collection of protein sequence data, keeping pace with the genome sequencing projects and containing source attribution and minimal redundancy. The database contains all sequences in PIR-PSD, SWISS-PROT (3), TrEMBL (3), RefSeq (4), GenPept, and PDB (5), totaling more than 1 000 000 entries currently. Identical sequences from the same source organism (species) reported in different databases are presented as a single NREF entry with protein IDs, accession numbers, and protein names from each underlying database, as well as amino acid sequence, taxonomy, and composite bibliographic data. Also listed are related sequences identified by all-against-all FASTA search (6), including identical sequences from different organisms, identical subsequences, and highly similar sequences (
95% identity). NREF can be used for sequence searching and protein identification against the entire sequence collection or a subset of one or more genomes. The collective protein names, including synonyms, and the bibliographic information can be used to develop a protein name ontology. The different protein names assigned by different databases may help detect annotation errors, especially those resulting from large-scale genomic annotation.
AVAILABILITY
PIR web site
The PIR web site connects data mining and sequence analysis tools to underlying databases for information retrieval and knowledge discovery, with functionalities for interactive queries, combinations of sequence and annotation text searches, and sorting and visual exploration of search results. The three major databases (PSD, NREF and iProClass) represent primary entry points in the PIR web site, all of which provide text search for entry and list retrieval as well as BLAST search and peptide match. Direct entry report retrieval is based on sequence unique identifiers of all underlying databases, such as PIR, SWISS-PROT, or RefSeq. Basic and advanced text searches return protein entries listed in summary lines with information on protein IDs, matched fields, protein name, taxonomy, superfamily, domain, and motif, with hypertext links to the full entry report and to cross-referenced databases. More than 50 fields are searchable, including about 30 database unique identifiers (e.g., PDB ID, EC number, PubMed ID, and KEGG pathway number) and a wide range of annotation texts (e.g., protein name, organism name, sequence feature, and paper title). The BLAST search and peptide search likewise return lists of matched entries with summary lines that also contain search statistics and matched sequence region. Protein entries returned from text and sequence searches can be selected for further analysis, including BLAST (7) and FASTA search, pattern match, hidden Markov model (HMMER) (8) domain search, ClustalW (9) multiple sequence alignments and Phylip (10) phylogentic tree generation, and graphical display of superfamily, domain and motif relationships. Species-based browsing and searching are supported for about 100 organisms, including over 70 complete genomes. The related sequences in FASTA clusters are retrievable based on sequence unique identifiers where neighbors are listed with annotation information and graphical display of matched sequence region. A list of the major PIR pages is shown in Table 1.
|
PIR FTP site
The three PIR databases, PSD, NREF and iProClass, are updated biweekly in the same release schedule and made immediately available from the PIR web site for searching and browsing, as well as from the FTP site for free downloading. PIR-PSD is distributed as flat files in NBRF, CODATA, and XML formats, and in the open source relational database, MySQL, format. The MySQL distribution file contains data files (in relational tables), SQL scripts for creating the database and a user's guide with the database schema. PIR-NREF is distributed as XML files. Both PSD and NREF XML distributions have an associated DTD (Document Type Definition) file. The sequence files of both databases are distributed in FASTA format.
ACKNOWLEDGEMENT
The PIR is supported by grant P41 LM05978 from the National Library of Medicine, National Institutes of Health.
REFERENCES
- Wu,C.H., Xiao,C., Hou,Z., Huang,H. and Barker,W.C. (2001) iProClass: an integrated and comprehensive protein classification database. Nucleic Acids Res., 29, 5254.
[Abstract/Free Full Text] - Barker,W.C., Pfeiffer,F. and George,D.G. (1996) Superfamily classification in PIR-International Protein Sequence Database. Methods Enzymol., 266, 5971.[ISI][Medline]
- Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 4548.
[Abstract/Free Full Text] - Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res., 29, 137140.
[Abstract/Free Full Text] - Westbrook,J., Feng,Z., Jain,S., Bhat,T.N., Thanki,N., Ravichandran,V., Gilliland,G.L., Bluhm,W., Weissig,H., Greer,D.S., Bourne,P.E. and Berman,H.E. (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res., 30, 245248.
[Abstract/Free Full Text] - Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448.
[Abstract/Free Full Text] - Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402.
[Abstract/Free Full Text] - Eddy,S.R., Mitchison,G. and Durbin,R. (1995) Maximum Discrimination hidden Markov models of sequence consensus. J. Comp. Biol., 2, 923.
- Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680.
[Abstract/Free Full Text] - Felsenstein,J. (1989) PHYLIPphylogeny inference package (Version 3.2). Cladistics, 5, 164166.
This article has been cited by other articles:
![]() |
D. W. Huang, B. T. Sherman, Q. Tan, J. Kir, D. Liu, D. Bryant, Y. Guo, R. Stephens, M. W. Baseler, H. C. Lane, et al. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists Nucleic Acids Res., July 13, 2007; 35(suppl_2): W169 - W175. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. de la Vega, R. G. Sevilla, A. Hermoso, J. Lorenzo, S. Tanco, A. Diez, L. D. Fricker, J. M. Bautista, and F. X. Aviles Nna1-like proteins are active metallocarboxypeptidases of a new and diverse M14 subfamily FASEB J, March 1, 2007; 21(3): 851 - 865. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Draghici, S. Sellamuthu, and P. Khatri Babel's tower revisited: a universal resource for cross-referencing across annotation databases Bioinformatics, December 1, 2006; 22(23): 2934 - 2939. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Han, Z. Obradovic, Z.-Z. Hu, C. H. Wu, and S. Vucetic Substring selection for biomedical document classification Bioinformatics, September 1, 2006; 22(17): 2136 - 2142. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. A. Clark, S. Ganesan, S. Papp, and H. W. T. van Vlijmen Trends in Antibody Sequence Changes during the Somatic Hypermutation Process J. Immunol., July 1, 2006; 177(1): 333 - 340. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Rampp, T. Soddemann, and H. Lederer The MIGenAS integrated bioinformatics toolkit for web-based sequence analysis. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W15 - W19. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Khatri, V. Desai, A. L. Tarca, S. Sellamuthu, D. E. Wildman, R. Romero, and S. Draghici New Onto-Tools: Promoter-Express, nsSNPCounter and Onto-Translate. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W626 - W631. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Newcomb, R. N. Crowhurst, A. P. Gleave, E. H.A. Rikkerink, A. C. Allan, L. L. Beuning, J. H. Bowen, E. Gera, K. R. Jamieson, B. J. Janssen, et al. Analyses of Expressed Sequence Tags from Apple Plant Physiology, May 1, 2006; 141(1): 147 - 166. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Sivakumar, C. Wilton, and L. Holm From sequences to a functional unit Physiol Genomics, March 13, 2006; 25(1): 1 - 8. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. C. Greaves, S. L. Preston, P. J. Tadrous, R. W. Taylor, M. J. Barron, D. Oukrif, S. J. Leedham, M. Deheragoda, P. Sasieni, M. R. Novelli, et al. Mitochondrial DNA mutations are established in human colonic stem cells, and mutated clones expand by crypt fission PNAS, January 17, 2006; 103(3): 714 - 719. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Buturovic PCP: a program for supervised classification of gene expression profiles Bioinformatics, January 15, 2006; 22(2): 245 - 247. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. S. Kumar and M. M. Gromiha PINT: Protein-protein Interactions Thermodynamic Database Nucleic Acids Res., January 1, 2006; 34(suppl_1): D195 - D198. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. S. Kumar, K. A. Bava, M. M. Gromiha, P. Prabakaran, K. Kitajima, H. Uedaira, and A. Sarai ProTherm and ProNIT: thermodynamic databases for proteins and protein-nucleic acid interactions Nucleic Acids Res., January 1, 2006; 34(suppl_1): D204 - D206. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Maltsev, E. Glass, D. Sulakhe, A. Rodriguez, M. H. Syed, T. Bompada, Y. Zhang, and M. D'Souza PUMA2--grid-based high-throughput analysis of genomes and metabolic pathways Nucleic Acids Res., January 1, 2006; 34(suppl_1): D369 - D372. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. H. Johnson, J. Tsao, M. Luo, and M. Carson SGCEdb: a flexible database and web interface integrating experimental results and analysis for structural genomics focusing on Caenorhabditis elegans Nucleic Acids Res., January 1, 2006; 34(suppl_1): D471 - D474. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-Y. Lee, H.-D. Huang, J.-H. Hung, H.-Y. Huang, Y.-S. Yang, and T.-H. Wang dbPTM: an information repository of protein post-translational modification Nucleic Acids Res., January 1, 2006; 34(suppl_1): D622 - D627. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hartmann, D. Lu, J. Phillips, and T. J. Vision Phytome: a platform for plant comparative genomics Nucleic Acids Res., January 1, 2006; 34(suppl_1): D724 - D730. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Liu, Z.-Z. Hu, J. Zhang, and C. Wu BioThesaurus: a web-based thesaurus of protein and gene names Bioinformatics, January 1, 2006; 22(1): 103 - 105. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Petryszak, E. Kretschmann, D. Wieser, and R. Apweiler The predictive power of the CluSTr database Bioinformatics, September 15, 2005; 21(18): 3604 - 3609. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Djebbari, S. Karamycheva, E. Howe, and J. Quackenbush MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms Bioinformatics, August 1, 2005; 21(15): 3324 - 3326. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Alm, K. H. Huang, M. N. Price, R. P. Koche, K. Keller, I. L. Dubchak, and A. P. Arkin The MicrobesOnline Web site for comparative genomics Genome Res., July 1, 2005; 15(7): 1015 - 1022. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Khatri, S. Sellamuthu, P. Malhotra, K. Amin, A. Done, and S. Draghici Recent additions and improvements to the Onto-Tools Nucleic Acids Res., July 1, 2005; 33(suppl_2): W762 - W765. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Wheeler, B. Smith-White, V. Chetvernin, S. Resenchuk, S. M. Dombrowski, S. W. Pechous, T. Tatusova, and J. Ostell Plant Genome Resources at the National Center for Biotechnology Information Plant Physiology, July 1, 2005; 138(3): 1280 - 1288. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Z. Hu, M. Narayanaswamy, K. E. Ravikumar, K. Vijay-Shanker, and C. H. Wu Literature mining and database annotation of protein phosphorylation using a rule-based system Bioinformatics, June 1, 2005; 21(11): 2759 - 2765. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Schneider, A. Bairoch, C. H. Wu, and R. Apweiler Plant Protein Annotation in the UniProt Knowledgebase Plant Physiology, May 1, 2005; 138(1): 59 - 66. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Orchard, H. Hermjakob, and R. Apweiler Annotating the Human Proteome Mol. Cell. Proteomics, April 1, 2005; 4(4): 435 - 440. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Via, A. Zanzoni, and M. Helmer-Citterich Seq2Struct: a resource for establishing sequence-structure links Bioinformatics, February 15, 2005; 21(4): 551 - 553. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Wheeler, T. Barrett, D. A. Benson, S. H. Bryant, K. Canese, D. M. Church, M. DiCuccio, R. Edgar, S. Federhen, W. Helmberg, et al. Database resources of the National Center for Biotechnology Information Nucleic Acids Res., January 1, 2005; 33(suppl_1): D39 - D45. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Brooksbank, G. Cameron, and J. Thornton The European Bioinformatics Institute's data resources: towards systems biology Nucleic Acids Res., January 1, 2005; 33(suppl_1): D46 - D53. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. The Universal Protein Resource (UniProt) Nucleic Acids Res., January 1, 2005; 33(suppl_1): D154 - D159. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Higgins, J. Pucilowska, R. Q. Lombardi, and J. P. Rooney A mutation in a novel ATP-dependent Lon protease gene in a kindred with mild mental retardation Neurology, November 23, 2004; 63(10): 1927 - 1931. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Li, H. Amri, H. Huang, C. Wu, and V. Papadopoulos Gene and Protein Profiling of the Response of MA-10 Leydig Tumor Cells to Human Chorionic Gonadotropin J Androl, November 1, 2004; 25(6): 900 - 913. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. P. Otto, M. Y. Wu, N. Kazgan, O. R. Anderson, and R. H. Kessin Dictyostelium Macroautophagy Mutants Vary in the Severity of Their Developmental Defects J. Biol. Chem., April 9, 2004; 279(15): 15621 - 15629. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Wheeler, D. M. Church, R. Edgar, S. Federhen, W. Helmberg, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, et al. Database resources of the National Center for Biotechnology Information: update Nucleic Acids Res., January 1, 2004; 32(90001): D35 - 40. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. H. Wu, A. Nikolskaya, H. Huang, L.-S. L. Yeh, D. A. Natale, C. R. Vinayaka, Z.-Z. Hu, R. Mazumder, S. Kumar, P. Kourtesis, et al. PIRSF: family classification system at the Protein Information Resource Nucleic Acids Res., January 1, 2004; 32(90001): D112 - 114. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. UniProt: the Universal Protein knowledgebase Nucleic Acids Res., January 1, 2004; 32(90001): D115 - 119. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Bava, M. M. Gromiha, H. Uedaira, K. Kitajima, and A. Sarai ProTherm, version 4.0: thermodynamic database for proteins and mutants Nucleic Acids Res., January 1, 2004; 32(90001): D120 - 121. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fleming, A. Muller, R. M. MacCallum, and M. J. E. Sternberg 3D-GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes Nucleic Acids Res., January 1, 2004; 32(90001): D245 - 250. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Kulkarni-Kale, S. Bhosle, G. S. Manjari, and A. S. Kolaskar VirGen: a comprehensive viral genome resource Nucleic Acids Res., January 1, 2004; 32(90001): D289 - 292. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Kaplan, A. Vaaknin, and M. Linial PANDORA: keyword-based analysis of protein sets by integration of annotation sources Nucleic Acids Res., October 1, 2003; 31(19): 5617 - 5626. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Perriere, C. Combet, S. Penel, C. Blanchet, J. Thioulouse, C. Geourjon, J. Grassot, C. Charavay, M. Gouy, L. Duret, et al. Integrated databanks access and sequence/structure analysis services at the PBIL Nucleic Acids Res., July 1, 2003; 31(13): 3393 - 3399. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-S. Goh, N. Lan, N. Echols, S. M. Douglas, D. Milburn, P. Bertone, R. Xiao, L.-C. Ma, D. Zheng, Z. Wunderlich, et al. SPINE 2: a system for collaborative structural proteomics within a federated database framework Nucleic Acids Res., June 1, 2003; 31(11): 2833 - 2838. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Garavelli The RESID Database of Protein Modifications: 2003 developments Nucleic Acids Res., January 1, 2003; 31(1): 499 - 501. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











