Nucleic Acids Research, 2002, Vol. 30, No. 1 260-263
© 2002 Oxford University Press
ASTRAL compendium enhancements
1Berkeley Structural Genomics Center, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, 2Department of Plant and Microbial Biology, 111 Koshland Hall, University of California, Berkeley, CA 94720-3102, USA, 3MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK and 4Department of Structural Biology, D-109 Fairchild, Stanford University, Stanford, CA, USA
Received September 18, 2001; Accepted September 20, 2001.
| ABSTRACT |
|---|
|
|
|---|
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. It is partially derived from the SCOP database of protein domains, and it includes sequences for each domain as well as other resources useful for studying these sequences and domain structures. Several major improvements have been made to the ASTRAL compendium since its initial release 2 years ago. The number of protein domain sequences included has doubled from 15 190 to 30 867, and additional databases have been added. The Rapid Access Format (RAF) database contains manually curated mappings linking the biological amino acid sequences described in the SEQRES records of PDB entries to the amino acid sequences structurally observed (provided in the ATOM records) in a format designed for rapid access by automated tools. This information is used to derive sequences for protein domains in the SCOP database. In cases where a SCOP domain spans several protein chains, all of which can be traced back to a single genetic source, a genetic domain sequence is created by concatenating the sequences of each chain in the order found in the original gene sequence. Both the original-style library of SCOP sequences and a new library including genetic domain sequences are available. Selected representative subsets of each of these libraries, based on multiple criteria and degrees of similarity, are also included. ASTRAL may be accessed at http://astral.stanford.edu/.
| BACKGROUND |
|---|
|
|
|---|
The Protein Data Bank (PDB) is a centralized repository of protein structures (1,2) containing over 13 000 entries in March 2001. The SCOP database (3,4) provides a manually curated set of domains from all PDB entries, classified in a hierarchy indicating different levels of structural and evolutionary relationships between the domains. SCOP thus provides a broad survey of all known protein folds, detailed information about relatives of proteins of known structure, and a framework for classification of additional structures as they are solved.
Many tools for bioinformatic analysis rely on sequence information, but the nature of PDB files makes it challenging to accurately extract the sequence corresponding to a given domain definition. ASTRAL addresses this issue by providing an explicit mapping between the PDB ATOM and SEQRES records, which is used to derive databases of sequences corresponding to the SCOP domains, as described in the original paper (5). Also available are subsets of selected representative domains created using different thresholds and measures of similarity. To choose the highest quality representatives for these subsets, Summary PDB ASTRAL Check Index (SPACI) scores are used to provide a first order guide to the resolution, R-factor and stereochemical accuracy of each crystallographically determined structure.
In the two years since ASTRAL was released, the number of domain sequences included has doubled from 15 190 to 30 867, and several additional databases that require manual curation have been included.
| RAPID ACCESS FORMAT (RAF) SEQUENCE MAPPINGS |
|---|
|
|
|---|
The original release of ASTRAL included mappings between ATOM and SEQRES records in PDB chains represented in ASTRAL. These CIFMAP files were automatically generated from the output of the pdb2cif program (6), which is now provided for every entry in the PDB. Due to errors and inconsistencies in the underlying data (7), fully automated mappings produce errors. Therefore, the CIFMAP files often contained bugs, some of which were described in the notes accompanying each ASTRAL release. In the latest release, we have manually corrected bugs in the mappings for 638 protein chains, eliminating all known bugs caused by errors in the automated mapping. We did not attempt to correct errors occurring in the original PDB files; in cases of discrepancy between the PDB and the pdb2cif output, the original PDB file served as the final arbiter for making manual corrections. In the curation process, some errors in the PDB files were discovered. For 26 domains, additional residues appear in observed sequence (ATOM records), but were omitted from the record of the sequence being studied (SEQRES records). For an additional 83 domains, the sequences given in the SEQRES records and ATOM records do not match. While the differences are sometimes small (Asp/Asn), some are more significant (Glu/Pro). A file listing these differences is available on our web site.
As a robust alternative to the CIFMAP files, we now distribute the sequence mappings in RAF, a file format designed for rapid access in most computer languages. Details of the format are available on the web site. In addition to being the definitive set of maps used in SCOP and ASTRAL, the database of RAF maps is intended as a general purpose, manually curated resource for PDB users. It includes mappings for all PDB chains represented in the first seven classes in SCOP. The old format CIFMAP files are also available in the current release, but manual corrections are not included in these maps, and their use has been deprecated.
| GENETIC DOMAINS |
|---|
|
|
|---|
A SCOP domain may include fragments from different PDB chains. In most cases, these appear to be the product of a single gene. For example, insulin and many proteases are products of post-translational cleavage of single precursor chains; each chain is given a different identifier in the PDB file. It is not trivial to reassemble the fragments in the order found in the original gene sequence, as the order in which the chains are presented in the PDB file often has no correlation with the original genetic order. In previous versions of SCOP, multiple chains in each of these genetic domains have almost always been listed in the correct order in the SCOP domain definitions. In the current release, the correct order has been ensured through manual curation in collaboration with the SCOP authors. The order of chains in each genetic domain was determined by aligning the sequences against SWISS-PROT (8), followed by manual inspection. In the original-style ASTRAL sequences, there is a separate entry for each chain included in a SCOP domain, with the initial d in the SCOP identifier replaced by an e. Figure 1 shows the example of d1cph1, a multi-chain domain from a structure of cow insulin (9). This domain is split into the one sequence for each chain (a and b), e1cph.1a and e1cph.1b. In the genetic domain sequence sets, the sequences are concatenated into a single entry with the initial d in the domain name replaced by a g. In the insulin example, the d1cph.1 domain is given as a single sequence, g1cph.1, with the chain sequences in the correct order (b before a) and separated by the letter X. In the current version of ASTRAL, all fragments in a single domain were found to be products of a single gene. However, this may not always be the case, so future versions of ASTRAL may include d, e and g type domains in a single sequence set.
|
| REPRESENTATIVE SUBSETS |
|---|
|
|
|---|
Since the majority of the domains in the PDB are very similar to others, it is helpful to reduce the redundancy by selecting high quality representatives at different levels of similarity. The process of selecting these representative subsets is shown in Figure 2.
|
Using the RAF maps, four complete sequence sets are created for every domain in the first seven classes of the SCOP database. Two sets (the genetic domain sets) include the genetic domain sequences described above, and the other two (the original-style sequence sets) use the prior method of splitting each multi-chain domain into multiple sequences. For each of these methodologies, one complete sequence set is derived from sequences in the PDB ATOM records, and another from sequences in the SEQRES records. We expect the genetic domain sets, in particular the one derived from SEQRES records, to be more commonly used than sets created using the prior methodology, so the original-style sets may be deprecated in future versions of ASTRAL.
The SEQRES sets (for both genetic domain and original-style methods) are used to derive representative subsets. As shown in Figure 2, each set is fully compared against itself using BLAST (10), and subsets are created using the three similarity criteria (BLAST E-values, sequence identity and SCOP classification) described previously (5). Representatives are chosen according to SPACI scores, which are derived from the resolution and R-factor of crystallographically determined structures, as well as the output of the programs PROCHECK (11) and WHAT_CHECK (7). Future SPACI scores may incorporate updated programs and additional source data (15). This selection procedure will be enhanced in future versions of ASTRAL to reflect mutated or misfolded structures.
The most frequently requested representative subsets are those filtered at a 40% level of sequence identity (ID) and a 95% ID level. These two subsets are highlighted on our web site. Since the original release of ASTRAL, the 40% ID subset has increased in size from 1947 domain sequences to 3613. The 95% ID subset has increased in size from 3285 sequences to 6146. Both of these represent an increase in the order of 85%, in contrast to the 103% increase in the total number of domains in the full sequence sets.
| NEW TRANSLATION TABLE |
|---|
|
|
|---|
Chemically modified residues are now included in our translation table which maps the three-letter codes found in PDB files to one-letter codes in our sequences. The translation table for modified residues is given in Table 1. This information was extracted from the Het group dictionary given on the PDB site, although modified amino acids are not always distinguished in this file from other types of prosthetic groups. Several PDB files used HETATM codes for modified amino acids that were inconsistent with the standard table; however, in these cases the notation was explained in the PDB headers and the correct sequences were manually entered into the RAF maps.
|
| FORMAT CHANGES |
|---|
|
|
|---|
In order to facilitate parsing of the sequence headers, the FASTA header for each ASTRAL sequence has been modified to always include a region code, identifying a residue range and chain ID. If the domain spans an entire PDB entry with no chain ID, the region code is represented as (). The old classification page numbers in SCOP have been replaced by new SCOP Concise Classification String (sccs) identifiers (4), and this change is also reflected in the headers of FASTA files distributed in ASTRAL. Details on parsing the headers are available on our web site.
| ACKNOWLEDGEMENTS |
|---|
We thank J. Michael Sauder for suggesting the translation of non-standard amino acids. This work is supported by grants from the NIH (1-P50-GM62412, 1-K22-HG00056 and GM1455), the Department of Energy (DE-FG03-95ER62135) and the Searle Scholars Program (01-L-116).
| FOOTNOTES |
|---|
* To whom correspondence should be addressed at: Department of Plant and Microbial Biology, 111 Koshland Hall, University of California, Berkeley, CA 94720-3102, USA. Tel: +1 510 643 9131; Fax: +1 208 279 8978; Email: brenner{at}compbio.berkeley.edu
| REFERENCES |
|---|
|
|
|---|
-
1 Abola,E.E., Bernstein,F.C., Bryant,S.H., Koetzle,T.F. and Weng,J. (1987) In Allen,F.H., Bergerhoff,G. and Sievers,R. (eds), Crystallographic Databases: Information Content, Software Systems, Scientific Applications. Data Comission of the International Union of Crystallography, Cambridge, UK, pp. 107132.
2 Berman,H.M., Westbrook,J., Feng,Z., Gilliland,G., Bhat,T.N., Weissig,H., Shindyalov,I.N. and Bourne,P.E. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235242. Updated article in this issue: Nucleic Acids Res. (2002), 30, 245248.
3 Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540.[ISI][Medline]
4 Lo Conte,L., Brenner,S.E., Hubbard,T.J.P., Chothia,C. and Murzin,A.G. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30, 264267.
5 Brenner,S.E., Koehl,P. and Levitt,M. (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res., 28, 254256.
6 Bernstein,H., Bernstein,F. and Bourne,P.E. (1998) pdb2cif: translating PDB entries into mmCIF format. J. Appl. Crystallog., 31, 282295.
7 Hooft,R.W.W., Vriend,G., Sander,C. and Abola,E.E. (1996) Errors in protein structures. Nature, 381, 272.[Medline]
8 Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL. Nucleic Acids Res., 28, 4548.
9 Gursky,O., Badger,J., Li,Y. and Caspar,D.L. (1992) Conformational changes in cubic insulin crystals in the pH range 711. Biophys. J., 63, 1210.
10 Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410.[ISI][Medline]
11 Laskowski,R.A., MacArthur,M.W., Moss,D.S. and Thornton,J.M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr., 26, 283291.
12 Sayle,R.A. and Milner-White,E.J. (1995) RasMol: biomolecular graphics for all. Trends Biochem. Sci., 20, 374376.[ISI][Medline]
13 Kraulis,P.J. (1991) MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Crystallogr., 24, 946950.
14 Merritt,E.A. and Bacon,D.J. (1997) Raster3D: photorealistic molecular graphics. Methods Enzymol., 277, 505524.[ISI][Medline]
15 Hooft,R.W.W., Sander,C., Scharf,M. and Vriend,G. (1996) The PDBFINDER database: a summary of PDB, DSSP and HSSP information with added value. Comput. Appl. Biosci., 12, 525529.
This article has been cited by other articles:
![]() |
D. A. C. Beck, D. O. V. Alonso, D. Inoyama, and V. Daggett The intrinsic conformational propensities of the 20 naturally occurring amino acids and reflection of these propensities in proteins PNAS, August 26, 2008; 105(34): 12259 - 12264. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A.C. Beck, A. L. Jonsson, R. D. Schaeffer, K. A. Scott, R. Day, R. D. Toofanny, D. O.V. Alonso, and V. Daggett Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations Protein Eng. Des. Sel., June 1, 2008; 21(6): 353 - 368. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-M. Chandonia StrBioLib: a Java library for development of custom computational structural biology applications Bioinformatics, August 1, 2007; 23(15): 2018 - 2020. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Pandini, G. Mauri, A. Bordogna, and L. Bonati Detecting similarities among distant homologous proteins by comparison of domain flexibilities Protein Eng. Des. Sel., June 30, 2007; (2007) gzm021v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Portugaly, N. Linial, and M. Linial EVEREST: a collection of evolutionary conserved protein domains Nucleic Acids Res., January 12, 2007; 35(suppl_1): D241 - D246. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-K. Yu, E. M. Gertz, R. Agarwala, A. A. Schaffer, and S. F. Altschul Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches Nucleic Acids Res., November 6, 2006; 34(20): 5966 - 5973. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Fodor and R. W. Aldrich Statistical Limits to the Identification of Ion Channel Domains by Sequence Similarity J. Gen. Physiol., May 30, 2006; 127(6): 755 - 766. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. Price, G. E. Crooks, R. E. Green, and S. E. Brenner Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap Bioinformatics, October 15, 2005; 21(20): 3824 - 3831. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Kolodny, L. Guibas, M. Levitt, and P. Koehl Inverse Kinematics in Biology: The Protein Loop Closure Problem The International Journal of Robotics Research, February 1, 2005; 24(2-3): 151 - 163. [Abstract] [PDF] |
||||
![]() |
R. I. Dima and D. Thirumalai Probing the instabilities in the dynamics of helical fragments from mouse PrPC PNAS, October 26, 2004; 101(43): 15335 - 15340. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. J. Blanco, A. Yee, R. Campos-Olivas, A. R. Ortiz, D. Devos, A. Valencia, C. H. Arrowsmith, and M. Rico Solution structure of the hypothetical protein Mth677 from Methanobacterium thermoautotrophicum: A novel {alpha}+{beta} fold Protein Sci., June 1, 2004; 13(6): 1458 - 1465. [Abstract] [Full Text] [PDF] |
||||
![]() |
I.-G. Choi, J. Kwon, and S.-H. Kim Local feature frequency profile: A method to measure structural similarity in proteins PNAS, March 16, 2004; 101(11): 3797 - 3802. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-M. Chandonia, G. Hon, N. S. Walker, L. Lo Conte, P. Koehl, M. Levitt, and S. E. Brenner The ASTRAL Compendium in 2004 Nucleic Acids Res., January 1, 2004; 32(90001): D189 - 192. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fleming, A. Muller, R. M. MacCallum, and M. J. E. Sternberg 3D-GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes Nucleic Acids Res., January 1, 2004; 32(90001): D245 - 250. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Vogel, S. A. Teichmann, and C. Chothia The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity Development, December 22, 2003; 130(25): 6317 - 6328. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. R. Bienkowska, H. Hartman, and T. F. Smith A search method for homologs of small proteins. Ubiquitin-like proteins in prokaryotic cells? Protein Eng. Des. Sel., December 1, 2003; 16(12): 897 - 904. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Day, D. A.C. Beck, R. S. Armen, and V. Daggett A consensus view of fold space: Combining SCOP, CATH, and the Dali Domain Dictionary Protein Sci., October 1, 2003; 12(10): 2150 - 2160. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M. Babu NCI: a server to identify non-canonical interactions in protein structures Nucleic Acids Res., July 1, 2003; 31(13): 3345 - 3348. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Linding, R. B. Russell, V. Neduva, and T. J. Gibson GlobPlot: exploring protein sequences for globularity and disorder Nucleic Acids Res., July 1, 2003; 31(13): 3701 - 3708. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Du, M. Andrec, and R. M. Levy Have we seen all structures corresponding to short protein fragments in the Protein Data Bank? An update Protein Eng. Des. Sel., June 1, 2003; 16(6): 407 - 414. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Lo Conte, S. E. Brenner, T. J. P. Hubbard, C. Chothia, and A. G. Murzin SCOP database in 2002: refinements accommodate structural genomics Nucleic Acids Res., January 1, 2002; 30(1): 264 - 267. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









