Nucleic Acids Research, 2003, Vol. 31, No. 1 489-491
© 2003 Oxford University Press
The Protein Data Bank and structural genomics
Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
*To whom correspondence should be addressed. Email: berman{at}rcsb.rutgers.edu
Received September 16, 2002; Revised and Accepted October 15, 2002
ABSTRACT
The Protein Data Bank (PDB; http://www.pdb.org/) continues to be actively involved in various aspects of the informatics of structural genomics projectsdeveloping and maintaining the Target Registration Database (TargetDB), organizing data dictionaries that will define the specification for the exchange and deposition of data with the structural genomics centers and creating software tools to capture data from standard structure determination applications.
INTRODUCTION
Over the history of the Protein Data Bank (PDB; http://www.pdb.org/) (1,2), this archive of three dimensional structural data has grown from 7 files in 1971 to a database containing over 18 800 structures as of October 2002. The archive's growth has been accompanied by increases in both data content and the structural complexity of individual entries. A further acceleration is anticipated due to development in high-throughput structural determination methodologies and worldwide structural genomics efforts.
The PDB has been actively involved in various aspects of structural genomics (3), ranging from target tracking to the automation of data deposition from the many steps involved in high throughput structure determination (Fig. 1).
|
TARGET REGISTRATION
Efficient structure solution on a genomic scale requires a centralized coordination of effort, to which the timely availability of status information on the progress of protein production and structure solution is key. Building on the work of earlier target databases [Presage (4) and http://www.genome3d.org/], we have created a centralized target registration database for sequences from worldwide structural genomics projects (TargetDB; http://targetdb.pdb.org/). Target sequences are collected weekly from the P50 NIH structural genomics centers and other international projects (see http://www.rcsb.org/pdb/strucgen.html). Target data are organized following recommendations from the International Task Force on Target Tracking, which include the definitions for the states used to track target progress (5). These states span the details of protein production, structure solution and the ultimate deposition of experimental and structure data. Target data from all contributing structural genomics sites are combined into a single downloadable XML document following the document type definition at http://targetdb.pdb.org/apps/target.txt.
Target sequences are loaded into a relational database, along with the sequences from experimentally determined structures in the PDB (
41 000 sequences), and with the sequences data depositors have approved for pre-release. From this latter set, sequence information is currently available for about 45% of on-hold depositions.
All or subsets of these sequence data may be searched using the FASTA sequence comparison method (6). A simple search form is provided to permit queries of each target data element, including: contributing site, protein name, sequence, project tracking identifier, date of last modification, current status of the target and source organism. Search results may be viewed as HTML reports, FASTA files or XML documents. TargetDB also tracks target status temporally so reports of the evolution of target progress over a selected time interval can be created.
Table 1 summarizes the status of sequences under study by the structural genomics projects. It is expected that a significant fraction of the thousands of proteins currently in process will ultimately be solved and deposited in the PDB.
|
DATA DICTIONARIES
The International Task Force on Deposition, Archiving and Curation of the Primary Information has recommended that all depositions from structural genomics efforts include information that would normally be found in a Materials and Methods section of a journal article reporting structure determination (7). This has required the addition of new data items in a PDB file which are documented in the PDB exchange data dictionary (http://deposit.pdb.org/mmcif/). This data dictionary expands upon the macromolecular Crystallographic Information File (mmCIF) data dictionary (8), an ontology of data definitions that electronically encodes domain information in the form of precise definitions, examples and controlled vocabularies (9). In addition to domain information, data definitions also encode information such as data type, data relationships, range restrictions, controlled vocabularies and presentation units.
The PDB exchange data dictionary is virtually complete for crystallographic and NMR structure determination and refinement; a protein production dictionary is under development and should be complete in 2003.
The use of software accessible data dictionaries is the key ingredient of the PDB informatics infrastructure (1). The dictionary provides the foundation for software tools used to exchange and validate data, create and load databases, translate data formats and serve application program interfaces.
SOFTWARE TOOLS FOR DATA EXTRACTION AND DEPOSITION
One goal of high-throughput structural genomics is the automatic capture of the important details of each step in the structure determination pipeline. Figure 1 shows the steps in a simplified structure determination data pipeline. At each step, essential details are captured and assembled to make a data file for PDB deposition. The status for each target sequence is updated at each step and forwarded to TargetDB. The PDB data processing system has been developed in anticipation of a structure determination data pipeline with automated deposition as a final step.
The AutoDep Input Tool (ADIT) was originally developed by the PDB to support the centralized data deposition and annotation of macromolecular structure data; however, this system can also support data from the structure determination pipeline. ADIT depends entirely on an underlying data dictionary to define the content and properties of information to be processed. This design permits the system to easily adapt to content extensions without software change. ADIT has been packaged in a workstation mode to provide single user data input and processing functionality tailored specifically for the content requirements of structural genomics applications.
ADIT can capture and edit data files stored in a standard form such as mmCIF and can be used to manually include details that are not captured automatically from a standard format. Although this is a common practice for current PDB depositions, it would obviously be more efficient if all of the information to be captured conformed to a standard data dictionary and format. This standardization is a key requirement for building a robust automated data pipeline.
The PDB exchange data dictionary definitions have been carefully developed to describe the information to be extracted from each step in the structure determination pipeline. The majority of these details are output by structure determination applications, but are not currently produced in a common format. Some applications export information directly in mmCIF format following the exchange dictionary; others produce output in program-specific formats or in a program log file. For the latter, the utility program PDB_EXTRACT was created to extract key data values from common structure determination applications output. PDB_EXTRACT also facilitates the merging of the incremental extractions of data from each program step. In the end, PDB_EXTRACT produces an integrated mmCIF data file that can be imported into ADIT to prepare and check the data file for PDB deposition.
The impact of providing precise data specifications and software tools to depositors is already having an impact on the efficiency of data deposition and annotation. In our first test of fully automated deposition with a NIH P50 structural genomics center (the Joint Center for Structural Genomics), we were able to reduce the total data processing and annotation time for a structure by a factor of 10. As automated deposition technology spreads to other projects inside and outside the structural genomics arena, the PDB will begin to realize the significant benefit of this investment in infrastructure.
ADIT, PDB_EXTRACT and mmCIF parsing and data management tools are currently distributed by the PDB under an open-source license at http://deposit.pdb.org/software/.
FUTURE
The PDB will continue to work with the community to identify and define the required data items for structural genomics and to work with software developers to directly export these data in a common form and integrate this output with the PDB deposition software. Our efforts to produce tools to facilitate the extraction and integration of the data from existing structure determination software will continue. We will further encourage the structure genomics centers to use PDB software tools in their respective data processing operations.
Questions and comments about the PDB should be sent to info{at}rcsb.org.
ACKNOWLEDGEMENTS
The PDB is operated by Rutgers, The State University of New Jersey; The San Diego Supercomputer Center at the University of California, San Diego; and the National Institute of Standards and Technologythree members of the Research Collaboratory for Structural Bioinformatics (RCSB). This work is supported by grants from the National Science Foundation, the Department of Energy, and two units of the National Institutes of Health: the National Institute of General Medical Sciences and the National Library of Medicine.
REFERENCES
- Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242.
[Abstract/Free Full Text] - Bernstein,F.C., Koetzle,T.F., Williams,G.J., Meyer,E.E., Brice,M.D., Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977) Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol., 112, 535542.[ISI][Medline]
- Berman,H.M., Bhat,T.N., Bourne,P.E., Feng,Z., Gilliland,G., Weissig,H. and Westbrook,J. (2000) The Protein Data Bank and the challenge of structural genomics. Nature Struct. Biol., 7, 957959.
- Brenner,S.E., Barken,D. and Levitt,M. (1999) The Presage database for structural genomics. Nucleic Acids Res., 27, 251253.
[Abstract/Free Full Text] - Task Force on Target Tracking (2001) Task Force Reports from the Second International Structural Genomics Meeting. Airlie, VA. http://www.nigms.nih.gov/news/reports/airlie_tasks.html.
- Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 24, 24442448.
- Task Force on the Deposition, Annotation, and Curation of the Primary Information (2001) Task Force Reports from the Second International Structural Genomics Meeting. Airlie, VA. http://www.nigms.nih.gov/news/reports/airlie_tasks.html.
- Bourne,P.E., Berman,H.M., Watenpaugh,K., Westbrook,J.D. and Fitzgerald,P.M.D. (1997) The macromolecular Crystallographic Information File (mmCIF). Methods Enzymol., 277, 571590.
- Westbrook,J. and Bourne,P.E. (2000) STAR/mmCIF: An extensive ontology for macromolecular structure and beyond. Bioinformatics, 16, 159168.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
B. Squires, C. Macken, A. Garcia-Sastre, S. Godbole, J. Noronha, V. Hunt, R. Chang, C. N. Larsen, E. Klem, K. Biersack, et al. BioHealthBase: informatics support in the elucidation of influenza virus host pathogen interactions and virulence Nucleic Acids Res., January 11, 2008; 36(suppl_1): D497 - D503. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. L. Tress firestar--prediction of functionally important residues using structural templates and alignment reliability Nucleic Acids Res., July 13, 2007; 35(suppl_2): W573 - W577. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Lopez, A. Valencia, and M. Tress FireDB--a database of functionally important residues from proteins of known structure Nucleic Acids Res., January 12, 2007; 35(suppl_1): D219 - D223. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Plotz, C. Welsch, L. Giron-Monzon, P. Friedhoff, M. Albrecht, A. Piiper, R. M. Biondi, T. Lengauer, S. Zeuzem, and J. Raedle Mutations in the MutS{alpha} interaction interface of MLH1 can abolish DNA mismatch repair Nucleic Acids Res., December 2, 2006; 34(22): 6574 - 6586. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. H. Melner, A. L. Haas, J. M. Klein, A. R. Brash, W. E. Boeglin, S. K. NagDas, V. P. Winfrey, and G. E. Olson Demonstration of Ubiquitin Thiolester Formation of UBE2Q2 (UBCi), a Novel Ubiquitin-Conjugating Enzyme with Implantation Site-Specific Expression Biol Reprod, September 1, 2006; 75(3): 395 - 406. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Kantardjiev and B. P. Atanasov PHEPS: web-based pH-dependent Protein Electrostatics Server. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W43 - W47. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.-C. Ngan, M. T. Inouye, and R. Samudrala A knowledge-based scoring function based on residue triplets for protein structure prediction Protein Eng. Des. Sel., May 1, 2006; 19(5): 187 - 193. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L. Marsden, D. Lee, M. Maibaum, C. Yeats, and C. A. Orengo Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Nucleic Acids Res., February 15, 2006; 34(3): 1066 - 1080. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Saito, M. Go, and T. Shirai An empirical approach for detecting nucleotide-binding sites on proteins Protein Eng. Des. Sel., February 1, 2006; 19(2): 67 - 75. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Arnold, L. Bordoli, J. Kopp, and T. Schwede The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling Bioinformatics, January 15, 2006; 22(2): 195 - 201. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Rattei, R. Arnold, P. Tischler, D. Lindner, V. Stumpflen, and H. W. Mewes SIMAP: the similarity matrix of proteins Nucleic Acids Res., January 1, 2006; 34(suppl_1): D252 - D256. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Djebbari, S. Karamycheva, E. Howe, and J. Quackenbush MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms Bioinformatics, August 1, 2005; 21(15): 3324 - 3326. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Sasson and M. Linial ProTarget: automatic prediction of protein structure novelty Nucleic Acids Res., July 1, 2005; 33(suppl_2): W81 - W84. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Maiti, G. H. Van Domselaar, and D. S. Wishart MovieMaker: a web server for rapid rendering of protein motions and interactions Nucleic Acids Res., July 1, 2005; 33(suppl_2): W358 - W362. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Kifer, O. Sasson, and M. Linial Predicting fold novelty based on ProtoNet hierarchical classification Bioinformatics, April 1, 2005; 21(7): 1020 - 1027. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. E. Shakhnovich, E. Deeds, C. Delisi, and E. Shakhnovich Protein structure and evolutionary history determine sequence space topology Genome Res., March 1, 2005; 15(3): 385 - 392. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Julenius, A. Molgaard, R. Gupta, and S. Brunak Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites Glycobiology, February 1, 2005; 15(2): 153 - 164. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. The Universal Protein Resource (UniProt) Nucleic Acids Res., January 1, 2005; 33(suppl_1): D154 - D159. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. Chance, A. Fiser, A. Sali, U. Pieper, N. Eswar, G. Xu, J. E. Fajardo, T. Radhakannan, and N. Marinkovic High-Throughput Computational and Experimental Techniques in Structural Genomics Genome Res., October 1, 2004; 14(10b): 2145 - 2154. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Nair and B. Rost LOCnet and LOCtarget: sub-cellular localization for structural genomics targets Nucleic Acids Res., July 1, 2004; 32(suppl_2): W517 - W521. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Yu, N. M. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J.-D. J. Han, N. Bertin, S. Chung, M. Vidal, and M. Gerstein Annotation Transfer Between Genomes: Protein-Protein Interologs and Protein-DNA Regulogs Genome Res., June 1, 2004; 14(6): 1107 - 1118. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Wheeler, D. M. Church, R. Edgar, S. Federhen, W. Helmberg, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, et al. Database resources of the National Center for Biotechnology Information: update Nucleic Acids Res., January 1, 2004; 32(90001): D35 - 40. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al. UniProt: the Universal Protein knowledgebase Nucleic Acids Res., January 1, 2004; 32(90001): D115 - 119. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M. Hoffman, M. A. Khrapov, J. C. Cox, J. Yao, L. Tong, and A. D. Ellington AANT: the Amino Acid-Nucleotide Interaction Database Nucleic Acids Res., January 1, 2004; 32(90001): D174 - 181. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. E. Bourne, K. J. Addess, W. F. Bluhm, L. Chen, N. Deshpande, Z. Feng, W. Fleri, R. Green, J. C. Merino-Ott, W. Townsend-Merino, et al. The distribution and query systems of the RCSB Protein Data Bank Nucleic Acids Res., January 1, 2004; 32(90001): D223 - 225. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Kopp and T. Schwede The SWISS-MODEL Repository of annotated three-dimensional protein structure homology models Nucleic Acids Res., January 1, 2004; 32(90001): D230 - 234. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Ferre, G. Ausiello, A. Zanzoni, and M. Helmer-Citterich SURFACE: a database of protein surface regions for functional annotation Nucleic Acids Res., January 1, 2004; 32(90001): D240 - 244. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Kreppel, P. Fey, P. Gaudet, E. Just, W. A. Kibbe, R. L. Chisholm, and A. R. Kimmel dictyBase: a new Dictyostelium discoideum genome database Nucleic Acids Res., January 1, 2004; 32(90001): D332 - 333. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Whitmore and B. A. Wallace The Peptaibol Database: a database for sequences and structures of naturally occurring peptaibols Nucleic Acids Res., January 1, 2004; 32(90001): D593 - 594. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Kurowski and J. M. Bujnicki GeneSilico protein structure prediction meta-server Nucleic Acids Res., July 1, 2003; 31(13): 3305 - 3307. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Willard, A. Ranjan, H. Zhang, H. Monzavi, R. F. Boyko, B. D. Sykes, and D. S. Wishart VADAR: a web server for quantitative evaluation of protein structure quality Nucleic Acids Res., July 1, 2003; 31(13): 3316 - 3319. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Gaulton and T. K. Attwood Motif3D: relating protein sequence motifs to 3D structure Nucleic Acids Res., July 1, 2003; 31(13): 3333 - 3336. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Kawabata MATRAS: a program for protein 3D structure comparison Nucleic Acids Res., July 1, 2003; 31(13): 3367 - 3369. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Schwede, J. Kopp, N. Guex, and M. C. Peitsch SWISS-MODEL: an automated protein homology-modeling server Nucleic Acids Res., July 1, 2003; 31(13): 3381 - 3385. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Perriere, C. Combet, S. Penel, C. Blanchet, J. Thioulouse, C. Geourjon, J. Grassot, C. Charavay, M. Gouy, L. Duret, et al. Integrated databanks access and sequence/structure analysis services at the PBIL Nucleic Acids Res., July 1, 2003; 31(13): 3393 - 3399. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Linding, R. B. Russell, V. Neduva, and T. J. Gibson GlobPlot: exploring protein sequences for globularity and disorder Nucleic Acids Res., July 1, 2003; 31(13): 3701 - 3708. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Shionyu-Mitsuyama, T. Shirai, H. Ishida, and T. Yamane An empirical approach for structure-based prediction of carbohydrate-binding sites on proteins Protein Eng. Des. Sel., July 1, 2003; 16(7): 467 - 478. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-S. Goh, N. Lan, N. Echols, S. M. Douglas, D. Milburn, P. Bertone, R. Xiao, L.-C. Ma, D. Zheng, Z. Wunderlich, et al. SPINE 2: a system for collaborative structural proteomics within a federated database framework Nucleic Acids Res., June 1, 2003; 31(11): 2833 - 2838. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Chen, J. B. Anderson, C. DeWeese-Scott, N. D. Fedorova, L. Y. Geer, S. He, D. I. Hurwitz, J. D. Jackson, A. R. Jacobs, C. J. Lanczycki, et al. MMDB: Entrez's 3D-structure database Nucleic Acids Res., January 1, 2003; 31(1): 474 - 477. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Garavelli The RESID Database of Protein Modifications: 2003 developments Nucleic Acids Res., January 1, 2003; 31(1): 499 - 501. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






