Nucleic Acids Research, 2004, Vol. 32, Database issue D497-D501
© 2004 Oxford University Press
Human protein reference database as a discovery resource for proteomics
1 McKusickNathans Institute of Genetic Medicine and Department of Biological Chemistry, Johns Hopkins University, Baltimore, MD 21287, USA, 2 Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, Denmark, 3 Departamento de Automática y Computación, Área de Ciencias de la Computación e Inteligencia Artificial, Universidad Pública de Navarra, 31006 Pamplona, Spain and 4 Institute of Bioinformatics, Discoverer 7th Floor, International Technology Park Ltd, Bangalore 560 066, India
*To whom correspondence should be addressed. Tel: +1 410 502 6662; Fax: +1 410 502 7543; Email: pandey{at}jhmi.edu
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors
Received August 17, 2003; Revised and Accepted September 30, 2003
| ABSTRACT |
|---|
|
|
|---|
The rapid pace at which genomic and proteomic data is being generated necessitates the development of tools and resources for managing data that allow integration of information from disparate sources. The Human Protein Reference Database (http://www.hprd.org) is a web-based resource based on open source technologies for protein information about several aspects of human proteins including proteinprotein interactions, post-translational modifications, enzymesubstrate relationships and disease associations. This information was derived manually by a critical reading of the published literature by expert biologists and through bioinformatics analyses of the protein sequence. This database will assist in biomedical discoveries by serving as a resource of genomic and proteomic information and providing an integrated view of sequence, structure, function and protein networks in health and disease.
| INTRODUCTION |
|---|
|
|
|---|
Completion of sequencing of the human genome (1,2) has ushered in an era of characterizing genes and their gene products or proteins in greater detail. High-throughput technologies such as mass spectrometry and the yeast two-hybrid assay are being applied to generate proteomic data on an unprecedented scale (3,4). However, this wealth of data being generated can be fully harnessed only if it can be visualized and understood in the context of the existing information about proteins and their role in biology and disease.
The Human Protein Reference Database (HPRD) is a novel comprehensive protein information resource that depicts various features of proteins such as domain architecture, post-translational modifications, tissue expression, molecular function, subcellular localization, enzymesubstrate relationships and proteinprotein interactions (5). This database is completely object oriented and was developed using Zope and Python, both open source technologies. HPRD is web based and is freely available to the academic community at http://www.hprd.org.
| REPRESENTING COMPLEX PROTEIN DATA |
|---|
|
|
|---|
The complexity of protein data is intimately related to their diverse functional roles in various biological processes. Annotation and display of such complex data is quite challenging. Although most types of data can be tackled computationally, a visually appealing graphical interface that is intuitive and easy to use is more likely to be accepted by users. We have therefore designed HPRD to provide as many features graphically as possible along with links to more detailed text-based descriptions in research articles. Figure 1 shows a molecule page of the BRCA1 protein with a graphic showing the various protein domains and motifs along with sites of post-translational modification. Tabs allow simple navigation between different features of the protein.
|
| ANNOTATION |
|---|
|
|
|---|
The proteins in HPRD are annotated manually by reading the published literature as well as by bioinformatics analyses of the protein sequences. Figure 2 shows the different steps in the annotation process involving various features of proteins. Interpretive annotation is crucial for classifying types of proteinprotein interaction, delineating regions of interaction, type of experiment showing modification of a substrate by an enzyme, domain and motif analysis and subcellular localization. Links to PubMed entries are provided in each case so that the user can access the primary data for more details.
|
| AN OBJECT-ORIENTED DATABASE ARCHITECTURE |
|---|
|
|
|---|
HPRD is an object-oriented database. We used Zope (http://www.zope.org) for development of HPRD. Zope is a leading open source web application server and is built using the programming language Python (http://www.python.org). Zope was especially suited for developing HPRD because it provides a powerful dynamic site generation system, a clustering system and a robust and transparent object database, which is ideal for storing hierarchical data such as protein interactions, PTMs and domains (11). We used the Zope object database (ZODB), a robust object database that transparently stores persistent objects, to store the data in HPRD. This allowed the programmers to develop a whole application without imposing restrictions for the creation of data structure. We used another Zope-based object called Zcatalog that provides powerful indexing and searching on a Zope database. Zcatalog allows fast and robust searches. Since it catalogs objects and not file handles, all the contents in the database are easily searchable.
In HPRD, the proteins are accessible by using the query page, by browsing or by using BLAST. The search method is one of the powerful tools in HPRD and the power comes from the ability to search any field in the HPRD. Multiple fields can also be queried simultaneously as shown in Figure 3. Entering a protein name as a query automatically searches the main name as well as alternative names of all entries in HPRD. The browse page allows users to access proteins based on categorization of their function, domains, motifs, post-translational modifications (PTMs) or cellular component.
|
| DATA STANDARDIZATION |
|---|
|
|
|---|
The use of controlled vocabulary facilitates annotation efforts and promotes standardization and interoperability across different platforms and databases. Several annotation projects have already adopted the Gene Ontology (GO) Consortiums standardization framework in their annotations (12). In order to be compatible with such efforts, the vocabulary used in HPRD is compliant with GO vocabulary, describing protein functions based on molecular function, biological process and cellular component. In addition to the use of controlled vocabulary, standards that govern data format and transfer have made tremendous progress in unifying data across the world wide web. One such standard involves the file format in which the data are stored. The use of eXtensible Markup Language (XML) as a standard for implementation of gene expression microarray data has already been adopted by the microarray community (13) and will soon be adopted for proteomic data as well (14). The major advantage of XML is that it lends itself well to importing from, and exporting to, various database systems while preserving the hierarchical nature of the data. The data contained in HPRD are available as XML files in addition to other flat file formats. To standardize nomenclature of gene names, the Human Genome Organization (HUGO) has put forth officially approved gene symbols for genes in humans (15). HPRD provides HUGO-approved gene symbols linked to all the proteins, which should allow easy linking to other databases because these gene symbols are non-redundant.
| VISUALIZATION OF INTERACTION NETWORKS |
|---|
|
|
|---|
The complexity and intensity of the protein data are difficult to present without proper visualization methods. Of course, ultimately one would like to have an integrated view of genomic as well as proteomic networks as has been demonstrated in the case of certain metabolic pathways in yeast (16). Currently, there are protein interaction network pathway diagrams for nine signal transduction pathways in HPRD and this number is expected to grow as more proteins are annotated (see Fig. 4 for the interleukin-2 receptor pathway diagram). These pathway diagrams were generated using Pajek, which is a large network analysis program (17). These networks are made available both in jpeg format and in Scalable Vector Graphics (SVG) format by clicking on view SVG format button. SVG is a language for describing 2D graphics in XML and allows additional functionalities such as zooming in without loss of resolution, search capability and the ability to link to the molecule page of any protein in the network by clicking on its name.
|
| FUTURE DEVELOPMENTS |
|---|
|
|
|---|
Federated databases that integrate information from various high-throughput experiments are essential to process the massive amount of available information into knowledge. In this respect, we are developing HPRD further such that mass spectrometric database searching algorithms can use certain annotated features of proteins such as PTMs and processing events. This will allow better identification of proteins and their isoforms including PTMs by taking advantage of the known information buried in the literature. Centralized efforts in genome and proteome annotation need to be supplemented by annotation by the entire biomedical community. Such a concerted effort will not only help enrich the databases but also minimize the errors that abound in databases. The Distributed Annotation System (DAS) provides a mechanism for multiple servers/providers to provide annotations for a common sequence database (18). We are developing specifications by which any third partys annotations can be viewed along with the data contained in HPRD. Finally, it must be noted that HPRD continues to evolve in terms of the number of entries as well as in the depth of annotation for each entry and in the types of information displayed for each protein. With the active involvement of the biomedical community, we wish to make HPRD an evolving knowledge base of human proteins that will provide an integrated view of human proteins and networks.
| ACKNOWLEDGEMENTS |
|---|
Akhilesh Pandey is a Sidney Kimmel Scholar of the Sidney Kimmel Foundation for Cancer Research. He serves as Chief Scientific Advisor to the Institute of Bioinformatics. The terms of this arrangement are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.
| REFERENCES |
|---|
|
|
|---|
- The International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921.[CrossRef][Medline]
- Venter,J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G., Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A. et al. (2001) The sequence of the human genome. Science, 291, 13041351.
[Abstract/Free Full Text] - Mann,M. and Pandey,A. (2001) Use of mass spectrometry-derived data to annotate nucleotide and protein sequence databases. Trends Biochem. Sci., 26, 5461.[CrossRef][Web of Science][Medline]
- Tucker,C.L., Gera,J.F. and Uetz,P. (2001) Towards an understanding of complex protein networks. Trends Cell Biol., 11, 102116.[CrossRef][Web of Science][Medline]
- Peri,S., Navarro,J.D., Amanchy,R., Kristiansen,T.Z., Jonnalagadda,C.K., Surendranath,V., Niranjan,V., Muthusamy,B., Gandhi,T.K.B., Gronborg,M. et.al. (2003) Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res., 13, 23632371.
[Abstract/Free Full Text] - Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410.[CrossRef][Web of Science][Medline]
- Schultz,J., Milpetz,F., Bork,P. and Ponting,C.P. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl Acad. Sci. USA, 95, 58575864.
[Abstract/Free Full Text] - Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and Sonnhammer,E.L. (2002) The Pfam protein families database. Nucleic Acids Res., 30, 276278.
[Abstract/Free Full Text] - Hamosh,A., Scott,A.F., Amberger,J., Bocchini,C., Valle,D. and McKusick,V.A. (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res., 30, 5255.
[Abstract/Free Full Text] - Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res., 29, 137140.
[Abstract/Free Full Text] - Navarro,J.D., Niranjan,V., Peri,S., Jonnalagadda,C.K. and Pandey,A. (2003) From biological databases to platforms for biomedical discovery. Trends Biotechnol., 21, 263268.[CrossRef][Web of Science][Medline]
- Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res., 11, 14251433.
[Abstract/Free Full Text] - Spellman,P.T., Miller,M., Stewart,J., Troup,C., Sarkans,U., Chervitz,S., Bernhart,D., Sherlock,G., Ball,C., Lepage,M. et al. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol., 3, RESEARCH0046.[Medline]
- Orchard,S., Hermjakob,H. and Apweiler,R. (2003) The proteomics standards initiative. Proteomics, 3, 13741376.[CrossRef][Web of Science][Medline]
- Wain,H.M., Bruford,E.A., Lovering,R.C., Lush,M.J., Wright,M.W. and Povey,S. (2002) Guidelines for human gene nomenclature. Genomics, 79, 464470.[CrossRef][Web of Science][Medline]
- Ideker,T., Thorsson,V., Ranish,J.A., Christmas,R., Buhler,J., Eng,J.K., Bumgarner,R., Goodlett,D.R., Aebersold,R. and Hood,L. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292, 929934.
[Abstract/Free Full Text] - Batagelj,V. and Mrvar,A. (1998) Pajekprogram for large network analysis. Connection, 21, 4757.
- Dowell,R.D., Jokerst,R.M., Day,A., Eddy,S.R. and Stein,L. (2001) The Distributed Annotation System. BMC Bioinformatics, 2, 7.[CrossRef][Medline]
This article has been cited by other articles:
![]() |
C. M. Song, S. J. Lim, and J. C. Tong Recent advances in computer-aided drug design Brief Bioinform, September 1, 2009; 10(5): 579 - 591. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Chen, E. E. Bardes, B. J. Aronow, and A. G. Jegga ToppGene Suite for gene list enrichment analysis and candidate gene prioritization Nucleic Acids Res., July 1, 2009; 37(suppl_2): W305 - W311. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Tuncbag, G. Kar, O. Keskin, A. Gursoy, and R. Nussinov A survey of available tools and web servers for analysis of protein-protein interactions and interfaces Brief Bioinform, May 1, 2009; 10(3): 217 - 232. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Georgii, S. Dietmann, T. Uno, P. Pagel, and K. Tsuda Enumeration of condition-dependent dense modules in protein interaction networks Bioinformatics, April 1, 2009; 25(7): 933 - 940. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Kandasamy, S. Keerthikumar, R. Goel, S. Mathivanan, N. Patankar, B. Shafreen, S. Renuse, H. Pawar, Y. L. Ramachandra, P. K. Acharya, et al. Human Proteinpedia: a unified discovery resource for proteomics research Nucleic Acids Res., January 1, 2009; 37(suppl_1): D773 - D781. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, et al. Human Protein Reference Database--2009 update Nucleic Acids Res., January 1, 2009; 37(suppl_1): D767 - D772. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Hofmann-Apitius, J. Fluck, L. Furlong, O. Fornes, C. Kolarik, S. Hanser, M. Boeker, S. Schulz, F. Sanz, R. Klinger, et al. Knowledge environments representing molecular entities for the virtual physiological human Phil Trans R Soc A, September 13, 2008; 366(1878): 3091 - 3110. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Park, B.-C. Kim, S.-W. Cho, S.-J. Park, J.-S. Choi, S. I. Kim, J. Bhak, and S. Lee MassNet: a functional annotation service for protein mass spectrometry data Nucleic Acids Res., July 1, 2008; 36(suppl_2): W491 - W495. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Li, P. Agarwal, and D. Rajagopalan A global pathway crosstalk network Bioinformatics, June 15, 2008; 24(12): 1442 - 1447. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Li, W. Liu, Z. Liu, J. Wang, Q. Liu, Y. Zhu, and F. He PRINCESS, a Protein Interaction Confidence Evaluation System with Multiple Data Sources Mol. Cell. Proteomics, June 1, 2008; 7(6): 1043 - 1052. [Abstract] [Full Text] [PDF] |
||||
![]() |
E.-M. Duerr, Y. Mizukami, A. Ng, R. J Xavier, H. Kikuchi, V. Deshpande, A. L Warshaw, J. Glickman, M. H Kulke, and D. C Chung Defining molecular classifications and targets in gastroenteropancreatic neuroendocrine tumors through DNA microarray analysis Endocr. Relat. Cancer, March 1, 2008; 15(1): 243 - 256. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Yao and A. Rzhetsky Quantitative systems-level determinants of human genes targeted by successful drugs Genome Res., February 1, 2008; 18(2): 206 - 213. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Guo, Y. Li, X. Gong, C. Yao, W. Ma, D. Wang, Y. Li, J. Zhu, M. Zhang, D. Yang, et al. Edge-based scoring and searching method for identifying condition-responsive protein protein interaction sub-network Bioinformatics, August 15, 2007; 23(16): 2121 - 2128. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. G. Faux, G. A. Huttley, K. Mahmood, G. I. Webb, M. Garcia de la Banda, and J. C. Whisstock RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins Genome Res., July 1, 2007; 17(7): 1118 - 1127. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. J. Gaulton, K. L. Mohlke, and T. J. Vision A computational system to select candidate genes for complex human traits Bioinformatics, May 1, 2007; 23(9): 1132 - 1140. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Savas, I. W. Taylor, J. L. Wrana, and H. Ozcelik Functional nonsynonymous single nucleotide polymorphisms from the TGF-{beta} protein interaction network Physiol Genomics, April 24, 2007; 29(2): 109 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fundel, R. Kuffner, and R. Zimmer RelEx--Relation extraction using dependency parse trees Bioinformatics, February 1, 2007; 23(3): 365 - 371. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Villen, S. A. Beausoleil, S. A. Gerber, and S. P. Gygi Large-scale phosphorylation analysis of mouse liver PNAS, January 30, 2007; 104(5): 1488 - 1493. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Przulj Biological network comparison using graphlet degree distribution Bioinformatics, January 15, 2007; 23(2): e177 - e183. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Ng, B. Bursteinas, Q. Gao, E. Mollison, and M. Zvelebil Resources for integrative systems biology: from data through databases to networks and dynamic system models Brief Bioinform, December 1, 2006; 7(4): 318 - 330. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Letunic, R. R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork SMART 5: domains in the context of genomes and networks Nucleic Acids Res., January 1, 2006; 34(suppl_1): D257 - D260. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. R. Mishra, M. Suresh, K. Kumaran, N. Kannabiran, S. Suresh, P. Bala, K. Shivakumar, N. Anuradha, R. Reddy, T. M. Raghavan, et al. Human protein reference database--2006 update Nucleic Acids Res., January 1, 2006; 34(suppl_1): D411 - D414. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Guldener, M. Munsterkotter, M. Oesterheld, P. Pagel, A. Ruepp, H.-W. Mewes, and V. Stumpflen MPact: the MIPS protein interaction resource on yeast Nucleic Acids Res., January 1, 2006; 34(suppl_1): D436 - D441. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Ng, B. Bursteinas, Q. Gao, E. Mollison, and M. Zvelebil pSTIING: a 'systems' approach towards integrating signalling pathways, interaction and transcriptional regulatory networks in inflammation and cancer Nucleic Acids Res., January 1, 2006; 34(suppl_1): D527 - D534. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. R. Miles, D. K. Crockett, M. S. Lim, and K. S. J. Elenitoba-Johnson Analysis of BCL6-interacting Proteins by Tandem Mass Spectrometry Mol. Cell. Proteomics, December 1, 2005; 4(12): 1898 - 1909. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Espadaler, O. Romero-Isart, R. M. Jackson, and B. Oliva Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships Bioinformatics, August 15, 2005; 21(16): 3360 - 3368. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Droit, G. G Poirier, and J. M Hunter Experimental and bioinformatic approaches for interrogating protein-protein interactions to determine protein function J. Mol. Endocrinol., April 1, 2005; 34(2): 263 - 280. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D'Eustachio, E. Schmidt, B. de Bono, B. Jassal, G.R. Gopinath, G.R. Wu, L. Matthews, et al. Reactome: a knowledgebase of biological pathways Nucleic Acids Res., January 1, 2005; 33(suppl_1): D428 - D432. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. J. Smink, E. M. Helton, B. C. Healy, C. C. Cavnor, A. C. Lam, D. Flamez, O. S. Burren, Y. Wang, G. E. Dolman, D. B. Burdick, et al. T1DBase, a community web-based resource for type 1 diabetes research Nucleic Acids Res., January 1, 2005; 33(suppl_1): D544 - D549. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. S. Scott, D. Y. Thomas, and M. T. Hallett Predicting Subcellular Localization via Protein Motif Co-Occurrence Genome Res., October 1, 2004; 14(10a): 1957 - 1966. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













