| Nucleic Acids Research | Pages |
An Integrated Sequence-Structure Database incorporating matching mRNA sequence, amino acid sequence and protein three-dimensional structure data
Introduction And Rationale
Database Construction
Database Description
Database Access
Acknowledgements
References
An Integrated Sequence-Structure Database incorporating matching mRNA sequence, amino acid sequence and protein three-dimensional structure data
ABSTRACT
INTRODUCTION AND RATIONALE
Since the genetic code is degenerate (the 61 sense codons are arranged, in the universal genetic code, into groups or families of synonyms each coding for a single amino acid residue) there is no back-translation algorithm that can, strictly speaking, yield the native nucleic coding sequence solely on the basis of amino acid sequence data. Knowledge of the exact coding sequence as part of the general set of data describing the sequence-structure relationship in proteins is thus important. The degeneracy of the genetic code, primarily connected with `silent' nucleotide at the third codon-position, results in additional degrees of freedom in the nucleotide sequence, allowing it to carry superfluous information relevant to the encoded amino acid sequence. It has been found (1-3) that there is a correlation between patterns of positive rare codon usage bias in mRNA sequences and segments linking domains and regular secondary structure blocks in proteins. The rates of translation can be lower for inter-domain regions (1,4) and can also vary for different secondary structures (5). The bias of nucleotides in specific codon-positions, associated with the regions adjacent to [alpha]-helices and [beta]-sheets has been reported (6) although no correlation with the rare codon usage was found in this study.
It is difficult to interpret these results since different sets of data were used for the analyses and none were published by the authors. Importantly, the organism specificity of the codon usage should be taken into account when compiling a dataset. To separate sequence-structure effects from the raw genome codon bias it is necessary to have a dataset compiled using the gene and protein data for the same organisms and, whenever possible, matching tissue/cell types. Generally available databases (e.g. GenBank, PDB) were not suitable for such analysis for several reasons. Firstly, their structural information is extremely redundant. For instance, the `non-redundant' version of GenBank (available via NCBI Entrez) retains duplicate entries for all overlapping nucleotide sequences, even if they differ only in a few nucleotides. Protein Data Bank (Brookhaven) contains numerous structures for identical proteins or proteins with single residue substitutions. Secondly, the alignment of different levels of sequence and structure data presents a problem. Automatic generation of alignments directly from PDB or GenBank records is unreliable due to relatively high level of sequence errors and inconsistency of data formats even within a single database, either PDB or GenBank. Hence the best solution was to compile a specialised database using resources of the publicly available general purpose databases, additionally applying strict manual data consistency checks. This technique was used to construct ISSD.
The initial version of ISSD was used by us to carry out a direct analysis of the synonymous codon distribution frequencies between protein secondary structure types, which showed that the distribution is statistically non-random (7), with the codon structural preferences related to the nucleotide in the third `silent' codon-position. Specific structural preferences for some synonymous codons at the N- or C-termini of [alpha]-helices and [beta]-sheets were also observed.
Figure
Figure


DATABASE CONSTRUCTION
The Integrated Sequence-Structure Database was compiled according to the algorithm presented in Figure 1. The main objective of ISSD is to integrate sequence and protein three-dimensional structure data of each given protein molecule, from multiple entries in the source databases to a single target ISSD entry. Thus fast and convenient access is provided to the information describing all levels of protein structure, from the coding sequences of genes to three-dimensional coordinates.
Figure
The algorithm described in Figure 1 was used to establish a one-to-one correspondence between entries in different sequence and structure databases and a specific protein. It also allowed us to avoid ambiguities in sequence data that can exist in the PDB [Protein Data Bank (8)] database. A PDB entry contains SEQRES records listing the amino acid sequence of the molecule, but this sequence does not always have a one-to-one correspondence with the sequence for which the three-dimensional coordinates are given in the same entry. Older PDB entries also do not provide a reference to the relevant amino acid sequence databases. Even if such a reference is present it is still necessary to find, for the same protein, an exact alignment of its PDB-sequence and the sequence deposited in an amino acid sequence database. This is done because of potential inaccuracies in the sequence data and since discrepancies in sequence identification are possible between different databases. The PDB-sequences processed by the ISSD compilation software were extracted from ATOMS records and thus represent the exact sequences featured in the three-dimensional structure determination. At the next step the nucleotide sequence exactly matching a given PDB-sequence has to be identified. In the ISSD compilation algorithm this is done by searching a translated version of the NCBI GenBank database, thus directly identifying all sequences with near 100% match to the initial PDB-sequence. Corresponding nucleotide sequences are then extracted from the database and the best matching sequence identified on the basis of TFASTA alignments with the PDB-sequence. At this stage of the compilation process the sequences are checked manually for a number of selection criteria. The criteria include final checks to ensure that (i) the selected nucleotide sequence is of the same source/organism with the PDB-sequence, (ii) the source tissue/organ/stage of development is the same or as close as possible, (iii) the sequences do not represent mutants. Protein secondary structure assignments were made by the program DSSP (9) and include the following types: `H', [alpha]-helix; `B', [beta]-bridge; `E', [beta]-ladder (participating in a [beta]-sheet), `G', 310-helix; `T', turn with hydrogen bond; `S', bend; `I', [pi]-helix (very rare). The structure type `P', polyproline II-type helix, was assigned by the RSS program (10). The PDB coordinates are given as they appear in the corresponding PDB entries. The currently available version (Alpha 1.0) of ISSD is a regularly updated, non-homologous database (identity <50% for any pair of aligned sequences in the database), and is restricted to mammalian genes/proteins (as shown in Fig. 4). The actual sequence identity levels in the non-homologous version of ISSD were calculated using the program ALIGN to generate a total non-redundant set of global pairwise alignments. For the amino acid sequences 99.5% of the pairwise alignments in the database have identity <30%. Global pairwise alignments were also calculated for codon sequences in the database and showed no pairs with identity >30%. Structures in the database have resolution better than or equal to 2.5 Å. The database is accessible via the HTTP protocol on the WWW. Individual entries, indexed by the organism name, and statistical data (Fig. 2) for the codon and amino acid usage in the database can be viewed using a general purpose Web browser. An entry (protein) in the ISSD includes headers giving short description of a protein (name, source organism, structure resolution) and hyperlinked references to the entries in the sequence and structural databases. The main data in an ISSD entry consists of the nucleotide coding sequence (codons) aligned with the amino acid sequence and, for each equivalent codon-amino acid, structural parameters of the backbone. Structural parameters include secondary structure assignments, accessibility, [phis],[psi] angles of the peptide groups and PDB coordinates of the polypeptide backbone. The ISSD format is designed specifically to facilitate computer processing of the database records, the full format description is given at the ISSD WWW site.
The ISSD dataset looks small in comparison with the vast amount of nucleotide sequence data available and rapidly growing number of experimentally solved protein structures. However, one should note the already mentioned excessive redundancy of the structural databases. Also, only non-homologous proteins with high resolution structures and available corresponding nucleotide sequences of the genes from the same organism were selected. An upgraded version of ISSD is in preparation and will substantially increase the number of mammalian proteins included in the database. The datasets for other species are also being prepared, with priority given to Escherichia coli and Saccharomyces cerevisiae. Two other versions of ISSD are planned: (i) a non-homologous dataset that will combine both mammalian and other genes/proteins, (ii) a dataset that will include all sequences for which the three-dimensional structures are available, classified by species, with the first release covering human genes/proteins. Codon and amino acid usage statistics are used to monitor the database uniformity. Statistics calculated for the version 1.0 ISSD dataset include relative synonymous codon usage (RSCU) distances (11) and amino acid usage distances between sequences (calculated in a similar manner). Cluster analysis shows a relatively high diversity in synonymous codon usage for individual genes although no clusters separated by significantly high RSCU distances could be identified according to the criteria introduced earlier (11) (Fig. 3). Notably, the genes that belong to the same species do not form separate clusters but are distributed diffusely. A computer-generated listing of the non-redundant ISSD mean codon and amino acid frequencies is given in Figure 2. The ISSD mean synonymous codon usage fairly reflects codon usage in the source species as calculated from independent data of the Codon Usage Database (12) (Fig. 4). Our results are in agreement with the previously reported observations of markedly different codon usage patterns in individual mammalian genes, which result in similar patterns for different species after averaging (13). The amino acid usage dendrogram shown in Figure 3 displays, as expected, lower distances between individual sequences in comparison with the RSCU.
Figure
At present several possible applications of ISSD can be suggested. The data can be used for gene expression optimisation, including back-translation algorithms, total codon usage optimisation, optimisation of synonymous codon distribution along mRNA, and codon usage optimisation relative to protein structure. The database can be used to explore possibilities to improve protein three-dimensional structure prediction and modelling. It will also be possible to analyse the evolutionary relationship between protein structure and gene sequence.
The Integrated Sequence-Structure Database can be accessed via URL http://www.protein.bio.msu.su/issd/
This work was supported by the Cancer Research Campaign (Programme Grant SP13848SN) and in part by the Russian Fund for Basic Research. IAA acknowledges support and thanks the Royal Society for a fellowship awarded under the `Exchanges with the former FSU' scheme.

DATABASE DESCRIPTION

DATABASE ACCESS
ACKNOWLEDGEMENTS
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
This article has been cited by other articles:
![]() |
S. Kato, S.-Y. Han, W. Liu, K. Otsuka, H. Shibata, R. Kanamaru, and C. Ishioka Understanding the function-structure and function-mutation relationships of p53 tumor suppressor protein by high-resolution missense mutation analysis PNAS, July 8, 2003; 100(14): 8424 - 8429. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
