| Nucleic Acids Research | Pages |
A phylogenetic approach to target selection for structural genomics: solution structure of YciH
Introduction
Materials And Methods
Sample preparation
NMR spectroscopy
Chemical shift assignments
Structure calculations
Coordinates
Results
Structure determination of YciH
Discussion
Classification of the YciH fold
Comparative analysis of the YciH structure
Comparison to human eIF-1 solution structure
Conclusion
Supplementary Material
Acknowledgements
References
A phylogenetic approach to target selection for structural genomics: solution structure of YciH
Received July 7, 1999; Revised and Accepted August 27, 1999
PDB accession no. 1D1R
ABSTRACT Structural genomics presents an enormous challenge with up to 100 000 protein targets in the human genome alone. At current rates of structure determination, judicious selection of targets is neccessary. Here, a phylogenetic approach to target selection is described which makes use of the National Center for Biotechnology Information database of Clusters of Orthologous Groups (COGS). The strategy is designed so that each new protein structure is likely to provide novel sequence-fold information. To demonstrate this approach, the NMR solution structure of YciH (COG0023), a putative translation initiation factor from Escherichia coli, has been determined and its fold classified. YciH is an ortholog of eIF-1/SUI1, an integral component of the translation initiation complex in eukaryotes. The structure consists of two antiparallel [alpha]-helices packed against the same side of a five-stranded [beta]-sheet. The first 31 residues of the 11.5 kDa protein are unstructured in solution. Comparative analysis indicates that the folded portion of YciH resembles a number of structures with the [alpha]-[beta] plait topology, though its sequence is not homologous to any of them. Thus, the phylogenetic approach to target selection described here was used successfully to identify a new homologous superfamily within this topology.
INTRODUCTION
The Genome Project is expected to produce the detailed sequence of the 3 billion bases of the human genome by the year 2005 if not earlier (1). The knowledge of this DNA sequence, which is estimated to code for about 100 000 proteins (2), is necessary but insufficient for a complete understanding of human and other living systems.The next step is to determine the biochemical functions of these 100 000 proteins and their relationships to one another. This task will be significantly enhanced by determining the 3-dimensional structures of these gene products. However, to determine all 100 000 protein structures is a formidable task. Even when obvious multigene families are taken into account (3), which will probably reduce this number by an order of magnitude, we still will be left with about 10 000 distinct proteins, all of which cannot be realistically structurally characterized in the short term.
A more tractable alternative would be to first identify, through comparative sequence analysis, families of homologous proteins that span a wide phylogenetic range, and then determine the 3-dimensional structure of at least one member of each family. Proteins of similar sequence generally share the same fold (4), so the determination of the structure of one member of a family provides a model for all of them. In favorable cases, knowing the structure may also provide considerable insight into the possible function of a protein, when the function is not already clearly identified from sequence comparisons to proteins with known functions.
A strategy of genome-directed structural biology in the context of structural genomics will also help populate the universe of protein folds in a systematic fashion. Current estimates of the number of fundamental folding motifs indicate that there may be about 1000 distinct folds (5). Approximately several hundred are already characterized experimentally, and if these estimates are reasonable, determining representative structures for each existing fold seems to be a feasible goal.
Families of homologous proteins have been determined by comparative sequence analysis. The outcome of this analysis is particularly satisfactory in terms of the phylogenetic distribution of family members when completely sequenced genomes are involved. One such scheme (6,7) has identified about 864 clusters of orthologous groups of proteins (COGs) by analyzing genomes of eight microorganisms from the three domains of life (bacteria, archaea and eukaryotes). Such clusters of evolutionarily conserved proteins appear to constitute a basic set of core biochemical elements from which components of more complex organisms evolved. The determination of the structures and functions of these proteins could, therefore, have profound implications for the understanding of all living systems and their evolutionary relationships.
It is possible, using sequence analysis techniques, to rationally and systematically identify a set of proteins, from completely sequenced genomes of microorganisms, with wide phylogenetic distribution and no discernible similarity to proteins of known 3-dimensional structure. This kind of analysis delineates a set of proteins, hereafter referred to as targets, that have a high likelihood of either having a new fold or of structurally defining a new homologous superfamily, or group of proteins which share sequence and structure characteristics through an evolutionary relationship (8). Since it still requires considerable effort to determine the structure for any protein, it is essential to obtain a list of targets that maximize the potential for the discovery of novel structure, function and evolutionary information. This can be accomplished by using protein sequence analysis methods that identify both obvious pair-wise similarities and more subtle motifs discernible only through the application of multiple sequence comparison techniques.
Using such an analysis, protein folds have been assigned for a set of 14 complete proteomes (translated complete genomes) that included 13 microorganisms and the multicellular organism Caenorhabditis elegans (9). Since COGs were determined from a subset of these organisms (6), a consequence of this proteome fold analysis was the identification of about 200 COGs with protein sequences that had no discernible similarity to proteins with known structures. From this list, we have chosen several target proteins for our structural genomics work. In this report we describe the structure determination of one of these targets, namely Escherichia coli YciH from COG0023.
When COG0023 was selected, its members had no discernible sequence similarity to proteins with known structures. A multiple sequence alignment of the members of COG0023 is shown in Figure 1. One of the orthologs in COG0023, yeast SUI1, has plant, insect and human homologs. In mammals, SUI1 is known as eIF-1 (eukaryotic translation initiation factor 1; this protein is unrelated to bacterial IF-1). Mammalian eIF-1 is an essential component of the eukaryotic translation initiation complex, where it promotes formation of the complex and destabilizes aberrant complexes by an unknown mechanism (10). The solution structure of human eIF-1 has recently been reported (11). In yeast, SUI1 plays an essential role in recognition of the initiator codon (12). Many bacterial genomes lack a yciH homolog, however, and the role of YciH in bacteria remains an open question (13). In E.coli, the yciH gene is located in the same operon (pyrF) as the gene for orotidine 5[prime]-monophosphate decarboxylase and a role in translational regulation of this protein has been suggested (14). Interestingly, YciH/SUI1 has been found in all completely sequenced archaeal genomes (E.V.Koonin, unpublished observations). This phyletic distribution suggests that in eukaryotes and archaea, SUI1 is indeed an essential translation factor, whereas in bacteria, its ortholog YciH may have a different, and probably non-essential function.
Figure 1. Multiple sequence alignment for YciH and orthologs from bacteria (sequences 1-4), archaea (sequences 5-8) and eukaryotes (sequences 9 and 10). The NMR-derived secondary structure of E.coli YciH appears at the top. Residues colored red or green are, respectively, identical or conserved with respect to the consensus sequence. Alignment was constructed using the CLUSTALW program (40). All sequences are either part of COG0023 or would be if their host organism's genome were completely sequenced. Not all known eukaryotic SUI1 sequences are shown. The full name of the host organism and the Swiss-Prot ID number for each protein are: YCIH_ECOLI (E.coli, P08245); YCIH_SALTY (Salmonella typhimurium, P20770); YCIH_HAEIN (Haemophilus influenzae, P45116); Y546_SYNY3 (Synechocystis sp., Q55397); SUI1_METJA (Methanococcus jannaschii, Q57902); SUI1_METTH (Methanobacterium thermoautotrophicum, 026118); SUI1_ARCFU (Archaeoglobus fulgidus, 029348); SUI1_YEAST (Saccharomyces cerevisiae, P32911); SUI1_HUMAN (Homo sapiens, P41567).
Three possible outcomes for this project were envisioned: (i) YciH would have a new fold and define a new homologous superfamily; (ii) YciH would have a known fold, yet still define a new homologous superfamily; or (iii) YciH would be a previously unrecognized member (due to very distant sequence relationships revealed only through structural comparison) of an existing homologous superfamily.
MATERIALS AND METHODS
Sample preparation
Unless specified otherwise, chemicals were purchased from Sigma. The yciH gene from E.coli was cloned, inserted into the pET-29b vector at the NdeI/XhoI site, and transformed into E.coli strain BL21(DE3) by ATG laboratories (Eden Prairie, MN). The construct included a C-terminal 6× histidine tag (LEHHHHHH). Cells were grown in minimal medium using ammonium chloride and glucose as nitrogen and carbon sources, respectively. 15NH4Cl and [U-13C]glucose (Cambridge Isotope Laboratories) were used for uniform labeling. Cultures (500 ml) were grown at 37°C to an OD600 of 0.6, induced for 3-5 h with 1.2 mM isopropylthiogalactoside, harvested by centrifugation at 6700 g, and resuspended in pH 7.4 50 mM Tris-HCl or sodium phosphate containing 200 mM NaCl, 10 mM imidazole and 1 mM [beta]-mercaptoethanol. The cells were lysed by passage through a French press twice at 10 000 p.s.i. in the presence of 2.9 mM phenylmethylsulfonyl fluoride. Insoluble material was removed by centrifugation at 17 500 g, and protamine sulfate was added to a concentration of 0.5 mg ml-1 to precipitate DNA. The supernatant was ultracentrifuged for 1 h at 300 000 g to remove lipids and loaded onto a 1 × 9 cm Ni2+ chelation column (Novagen His-Bind or Qiagen Ni-NTA). After washing the bound protein with 12 column vol of loading buffer, YciH was eluted with 100-150 mM imidazole. Fractions containing YciH (~8 ml total) were pooled and dithiothreitol (DTT) was added to a concentration of 10 mM. The solutions were concentrated to 1 ml in centricon-3 ultrafiltration devices (Millipore). The concentrate was applied to a PD-10 desalting column (Pharmacia) equilibrated with NMR buffer (50 mM sodium phosphate or Tris-HCl pH 7.4, 200 mM NaCl, 10 mM DTT, 10% D2O). Fractions containing protein were concentrated again to the final NMR sample volume. Final concentrations were between 2 and 2.5 mM. Purification was monitored by SDS-PAGE; gels (15% polyacrylamide ready gels; Bio-Rad) of the NMR samples showed a single dominant band at ~13 kDa.
NMR spectroscopy
NMR spectroscopy was carried out at 500, 600 and 750 MHz on Varian instruments equipped with Unity plus or Inova consoles and triple resonance probes at the Department of Energy's Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory (Richland, WA). All experiments were run at 25°C and referenced to external DSS (15). 2D NOESY (150 and 300 ms), DQF-COSY and TOCSY (50 ms) experiments were conducted on unlabeled YciH. 2D HSQC (16), HMQC-J (17) and 3D 15N-NOESY-HSQC (16) experiments were conducted on [U-15N]YciH. 3D HNCACB (18), HNCO (18), CBCA(CO)NNH (18), CBCA(CO)CAHA (19), HCCH-TOCSY (20), HCCH-TOCSY-NNH (21), CCC-TOCSY-NNH (21-23) and 4D CC-NOESY (24) experiments were conducted on [U-13C, U-15N]YciH. Data were processed with Felix (MSI, San Diego, CA).
Chemical shift assignments
Chemical shift assignments have been deposited in the BioMagRes Bank.
Structure calculations
X-PLOR 3.1 was used for structure calculations (25). Input files dg_sub_embed, dg_full_embed and dgsa were used as supplied except that in dgsa 30 000 high temperature (2000 K) and 200 000 cooling steps were used. Residues 1-28 and the histidines of the C-terminal His-tag were excluded from the calculations. Distance restraints of 1.8-2.5, 1.8-3.2 and 1.8-4.0 Å corresponding to strong, medium and weak NOEs were used for all restraints between backbone hydrogens and for backbone hydrogen-[beta]H restraints within three residues of each other. All other restraints were 1.8-5.0 Å. A correction of 1 Å was added to the upper bound for restraints involving methyl groups. Pseudoatom corrections were employed throughout the structure calculations. No intraresidue restraints were used. Dihedral restraints for [phis] in non-glycine residues were derived from the HMQC-J experiment. For 3JHNH[alpha] values >8 Hz, [phis] was restrained to -120 ± 40°; for values <6 Hz in residues judged to be helical from NOE and chemical shift data, [phis] was restrained to -57 ± 20°. Hydrogen bond restraints (dO-HN = 1.8-2.2 Å; dO-N = 2.8-3.3 Å) for slowly exchanging amide protons were added once the correct acceptor had been established from NOESY data.
Coordinates
Coordinates for a set of 20 structures have been deposited in the Protein Data Bank (accession no. 1D1R).
RESULTS
Structure determination of YciH
YciH was soluble to >2 mM and yielded NMR data of high quality. The 1H15N HSQC spectrum showed excellent dispersion. Following assignment, the secondary structure elements of YciH were identified easily from backbone chemical shifts, NOEs, and from D2O exchange data (Supplementary Material). The first 31 residues are unstructured, displaying little chemical shift deviation from random coil values, intraresidue and sequential NOEs only, no diagonal peaks and crosspeaks to the water resonance in a gradient water suppression 15N-edited NOESY, 6.5 Hz 3JHNH[alpha] values, and negative amide 1H-15N heteronuclear NOEs (Supplementary Material). Residues 32-106 form a folded [alpha]+[beta] structure, with secondary structure elements ordered as follows: [beta][beta][alpha][beta][beta][alpha][beta] (Fig. 1). It is interesting to note that these secondary structural elements correspond nearly precisely to those predicted previously for the YciH/SUI1 protein family (26).
A continuous network of NOEs shows there is a single five-stranded [beta]-sheet. The strands are ordered [beta]3-[beta]4-[beta]2-[beta]1-[beta]5 across the sheet. Strand [beta]5 runs parallel to [beta]1, and the others are antiparallel to their neighbors. Strands [beta]1 and [beta]2 are linked by a loop of eight residues (residues 40-47). Amide proton-nitrogen correlations are missing in the HSQC spectrum for six of these eight residues. The two loop residues that have HSQC crosspeaks, G42 and K44, have negative amide 1H-15N heteronuclear NOEs (Supplementary Material). This suggests that the loop exhibits intermediate conformational exchange and undergoes motion at a rate greater than the overall tumbling rate of the folded portion of the protein.
Interresidue NOE distance restraints within residues 32-107 were used to generate an ensemble of 25 structures. The poorest quintile of the ensemble was discarded (first using high Etotal, then high backbone r.m.s.d. to the average structure as elimination criteria) and a least squares superposition of the backbone atoms (excluding residues 1-31, 40-47 and 108-112) of the remaining 20 structures was generated (Fig. 2a). Structural statistics for the ensemble are tabulated (Table 1).
a
![]() |
b
![]() |
Figure 2. (a) Stereoview of the backbone atom trace of the final ensemble of structures; the view is looking at the `top' (helix side) of the open-faced [beta]-sandwich. (b) MOLSCRIPT (41) rendition of the structure most similar to the average structure of YciH, from the same perspective. Regions of secondary structure are identified and correspond to the amino acid sequence as indicated in Figure 1.
Table 1. Statistics for the ensemble of 20 structures
| Distance restraints, total (residues 32-107) | 311 |
| Intraresidue | 0 |
| Sequential | 93 |
| Medium range (i+2, i+3, i+4) | 50 |
| Long range | 136 |
| Hydrogen bonds (two restraints per hydrogen bond) | 32 |
| Dihedral angle restraints | |
| [phis] | 42 |
| [psi] | 0 |
| [chi]a | 5 |
| Mean r.m.s. deviations from the experimental restraints | |
| Distance (Å) | 0.001 ± 0.001 |
| Dihedral angle (°) | 0.020 ± 0.050 |
| Mean r.m.s. deviations from idealized covalent geometry | |
| Bonds (Å) | 0.0004 ± 0.0001 |
| Angles (°) | 0.08 ± 0.02 |
| Impropers (°) | 0.151 ± 0.003 |
| Mean energies (kcal mol-1)a | |
| E(NOE) | 0.021 ± 0.032 |
| E(cdih) | 0.004 ± 0.016 |
| E(improper) | 2.01 ± 0.7 |
| E(VdW-repel) | 0.359 ± 0.746 |
| Ramachandran plot (residues 32-40 and 48-106)b | |
| Percent residues in the most favorable regionc | 79 |
| Percent residues in the other favorable regions | 21 |
| Mean r.m.s.d. to average structure (residues 32-40 and 48-106) | |
| Backbone atoms: N, C[alpha], C[prime], O (Å) | 0.80 ± 0.11 |
| Heavy atoms (Å) | 1.13 ± 0.11 |
bFor the average structure.
cAs defined by PROCHECK (50).
Long-range NOEs between the helices and the sheet established their relative orientation; the handedness is that which is commonly observed (27). The helices pack against one another on one side of the sheet (Fig. 2b) to form a hydrophobic core (Fig. 3a). Many long-range side chain-to-side chain NOEs were observed between residues in the core (Fig. 3b).
a
![]() |
b
![]() |
c
![]() |
Figure 3. (a) Wire frame model of YciH viewed from the [alpha]1-[beta]3 side showing core hydrophobic residues in blue and surface-exposed hydrophobic residues in red (generated with INSIGHTII). (b and c) Methionine 102 [epsiv]CH3 1H-13C planes from the 4D CC-NOESY. (b) 1H-13C plane, indirectly detected 1H. (c) Upfield portion of 1H-13C plane with direct detection of 1H, showing resolved crosspeaks to methyl groups of residues V34, V54, L56 and L64.
The structured portion of YciH contains an excess of basic residues. Several of the acidic residues (D55, D57, D58, E60 and D80) form a distinct cluster at one end of the structure, further increasing the basicity of the structured portion (Fig. 4). The residues in the acidic cluster are somewhat conserved among YciH orthologs (Fig. 1). The flexible eight-residue loop between [beta]1 and [beta]2 protrudes from the end of the structure opposite the acidic cluster. This loop contains three basic residues (R43, K44 and K46); K46 and residues in a nearby turn (residues 86-89) are conserved in all three kingdoms (Fig. 1). Most hydrophobic side chains in the structured portion of YciH are buried, with the notable exceptions of V48, L50 and V82 on the exposed surface of the sheet and L92 on the second helix (Fig. 3a). The locations of residues conserved among one, two or three kingdoms (Fig. 1) have been mapped onto the backbone structure of YciH (Fig. 5).
Figure 4. Surface features of YciH. (a) (top) Electrostatic potential surface for the top (helix side, right) and bottom (sheet side, left) of the open-faced [beta]-sandwich structure. Positive potentials are indicated in blue, negative in red. (b) (bottom) Exposed hydrophobic side chains (yellow) from the same perspectives. Residues labeled in red correspond to the surface-exposed residues depicted in Figure 3a. The figure was generated with GRASP (42).
Figure 5. Backbone structure of YciH mapped according to residues conserved in bacteria, archaea and eukarya (red); bacteria and eukarya only (pink); bacteria and archaea only (blue); (assuming structural similarity to YciH) archaea and eukarya only (green). The figure was generated with INSIGHTII.
DISCUSSION
Classification of the YciH fold
The structural classification databases CATH (28), Dali/FSSP (29) and SCOP (30) were utilized for comparative analysis of the YciH structure. The fold of YciH clearly resembles many others with the open-faced (or 2-layer) [beta]-sandwich form (31), a fold in the [alpha]+[beta] class. A Dali search returned 10 unique domains (including eIF-1) with Z scores >3.0, indicating significant structural similarity. Thirty-two additional structures had scores between 2.0 and 3.0; in a Dali search scores >2.0 are considered evidence of structural similarity. Most have the same or a quite similar topology, or ordering of secondary structure elements within the sequence, as YciH. The 10 best hits are listed in Table 2 and illustrated in Figure 6. Three of these had characteristics which, in spite of the similarity according to Dali, qualitatively differentiated them from YciH. One of these is a domain (1bdf-A) that is discontiguous in the sequence, another (1aye) appears to have virtually no helices, while the third (1b24-A) contains an extra helix that is clearly part of the globular structure of the domain. Although their similarity according to Dali should not be ignored because of these differences, these three structures are not considered in the following qualitative discussion.
a
![]() |
b
![]() |
Figure 6. MOLSCRIPT (41) renditions of the best hits in a Dali structural database search with the YciH structure. The YciH-like domains are colored according to secondary structure for each protein; other portions are gray. Structures are drawn so that congruent secondary structures are oriented identically. Structures are labeled by pdb codes as indicated in Table 2. (a) Structures which are qualitatively similar to YciH (excluding eIF-1). (b) Structures which are qualitatively different from YciH.
Table 2. Protein structures similar to YciH, according to Dali
| Name and description | pdb code | Dali Z score | Topology | Reference |
| 59 kDa DNA gyrase A fragment, E.coli | 1ab4 | 4.5 | bbabba | 24 |
| C-terminus (Arg binding) domain of arginine repressor, E.coli | 1xxa | 4.2 | bbabba | 43 |
| CheY binding domain of CheA, E.coli | 1eay-C | 3.9 | babbab | 44 |
| eiF-1/SUI1, human | 2if1 | 3.6 | bbabbab | 11 |
| [alpha],[beta] T cell receptor, Mus musculus | 1tcr-A | 3.4 | babbab * | 45 |
| RNA polymerase [alpha] subunit, N-terminus domain, E.coli | 1bdf-A | 3.3 | bba...bba | 46 |
| Archeal intron-encoded endonuclease, Desulfurococcus mobilis | 1b24-A | 3.1 | bbabba | 47 |
| Procarboxypetidase a2 (pcpa2), human | 1aye | 3.1 | bb..bb.. | 48 |
| Menkes copper-transporting ATPase, Cu-binding domain, human | 1aw0 | 3.1 | babbab | 49 |
In the SCOP classification of protein structures, the structures found in the Dali search are members of one of two fold categories (similar to topology in CATH): the ferredoxin-like fold (23 superfamilies) and the DCoH-like fold (two superfamilies). In the CATH classification, all the similar structures fall into the [alpha]-[beta] plait topology classification (32 homologous superfamilies), which encompasses both of the SCOP folds. The two proteins in SCOP with DCoH-like topology, the arginine binding domain of the arginine repressor, and dimerization cofactor of hepatocyte nuclear factor-1 (DCoH), have the identical core topology as YciH: [beta][beta][alpha][beta][beta][alpha]. A domain of DNA gyrase that was located by Dali also has this topology but does not appear in CATH or SCOP. Structures with ferredoxin-like topology are slightly permuted: [beta][alpha][beta][beta][alpha][beta]. Both contain the same placement of secondary structure elements: two helices running antiparallel to each other packed on the same side of a four-stranded antiparallel [beta]-sheet. Furthermore, they have the same organization of the strands in the core of the sheet: [beta]2, [beta]3, and [beta]4 in the DCoH topology correspond to [beta]1, [beta]2 and [beta]3 in the ferredoxin topology. In other words, removal of the N-terminal and C-terminal [beta]-strands, respectively, from the DCoH and ferredoxin topologies yields exactly the same structure. YciH has a fifth [beta]-strand as well, which is parallel to [beta]1, and thus its topology is [beta][beta][alpha][beta][beta][alpha]([beta]). In YciH (Fig. 2b), if [beta]1 were removed and [beta]5 moved sideways slightly to adjoin (antiparallel to) [beta]2, the resulting topology would be [beta][alpha][beta][beta][alpha][beta], the ferredoxin-like fold.
Both YciH [[beta][beta][alpha][beta][beta][alpha]([beta])] and all [beta][alpha][beta][beta][alpha][beta] structures contain two interlocking, sequential split [beta]-[alpha]-[beta] units. The split [beta]-[alpha]-[beta] motif occurs frequently in [beta]-sandwich proteins, a large group of proteins with diverse functions (32). While functional diversity is apparent in the Dali hits for the YciH structure presented herein, these proteins comprise a tight structural subset of the (structurally) diverse set of open-faced [beta]-sandwich proteins. As long as the chain reverses direction at the end of each strand or helix and there are no structure-spanning loops, nine permutations or arrangements of the secondary structures are possible within the sequence of a two-helix, four-stranded sheet [beta]-sandwich protein with antiparallel helices and strands: [alpha][alpha][beta][beta][beta][beta], [alpha][beta][beta][alpha][beta][beta], [alpha][beta][beta][beta][beta][alpha], [beta][alpha][alpha][beta][beta][beta], [beta][alpha][beta][beta][alpha][beta], [beta][beta][alpha][alpha][beta][beta], [beta][beta][alpha][beta][beta][alpha], [beta][beta][beta][alpha][alpha][beta] and [beta][beta][beta][beta][alpha][alpha]. Different arrangements of the [beta]-strands within a sheet are possible as well (32), further increasing the complexity. Many of these permutations are found among compact open-faced [beta]-sandwich domains, although the [beta][alpha][beta][beta][alpha][beta] type (ferredoxin-like fold) predominates. Some of these arrangements are more similar to one another than others. For example, the [beta][alpha][beta][beta][alpha][beta] and [beta][beta][alpha][beta][beta][alpha] topologies are quite similar to each other (vide supra) but much less so to [alpha][alpha][beta][beta][beta][beta] and [beta][beta][alpha][alpha][beta][beta], which are themselves similar to each other due to their [alpha][alpha] hairpin motif. Only [beta][alpha][beta][beta][alpha][beta] and [beta][beta][alpha][beta][beta][alpha] topologies were found in the Dali search on YciH, however, and all had the same sheet structure, except for the first or last strand, as described above. Thus, based on the Dali search and the CATH and SCOP systems of structure classification, YciH can be assigned to the [alpha]-[beta] plait topology (CATH) and the DCoH-like fold (SCOP).
Comparative analysis of the YciH structure
The identification of protein structures similar to YciH invites speculation about possible corresponding functional similarities. There are many examples of proteins which share fold and function characteristics but have no detectable sequence similarity (33). For SUI1 and YciH, likely homologs with subtle sequence similarity have been detected using new iterative database searching techniques, and it has been shown that a region homologous to these proteins is also present in several larger proteins implicated in translation, probably as a distinct domain (26). The systematic determination of protein folds in a structural genomics project will help clarify sequence-fold relationships in such cases, and may point toward homologs that are not readily detectable by other techniques. On the other hand, it is conceivable that proteins can evolve independently to the same fold (perhaps because the fold accommodates a functional feature they happen to share). This seems to be the case for the functionally diverse [alpha]-[beta] plait topology, which has been described as a superfold (34) because it contains many homologous superfamilies with a wide range of functions.
Many proteins that are structurally similar to YciH have roles similar to the putative function of YciH. These proteins serve as starting points for understanding this function. For example, because of its possible role in translation regulation and/or initiation, it has been proposed that YciH might have an RNA-binding capacity (26). The structure with the best score in the Dali search (Z = 4.5), a domain of DNA gyrase believed to contact DNA, also has the [beta][beta][alpha][beta][beta][alpha] arrangement. This domain may contain a disordered loop between the first two strands as in YciH since this region displayed uniform electron density in the DNA gyrase crystal structure (35). This part of the sequence contains basic residues, again like YciH.
A class of proteins involved in pre-mRNA processing contains structures called RNP consensus-type RNA-binding domains which have the same fold and [beta][alpha][beta][beta][alpha][beta] topology (36-38) seen in many of the structures similar to YciH. Two RNP domains were found in the search of the Dali structural database (pdb codes 1ha1 and 2sxl); both had Z scores between 2 and 3. These domains differ somewhat from YciH in that their helices are nearly perpendicular to each other. It was proposed that the four-stranded [beta]-sheet and some adjacent flexible loops in these domains comprised an RNA-binding `platform' (36) and chemical shift mapping experiments confirmed that the exposed surface of the [beta]-sheet in one of these domains was the RNA contact surface (39).
Other oligonucleotide-binding proteins found to be structurally similar to YciH in the Dali search (with pdb codes) include ribosomal protein s6 (1ris), biotin operon repressor protein (1bia), spliceosomal protein fragment (1urn-A), a DNA polymerase fragment (1xwl), thermostable B DNA polymerase (1tgo-A), transcription factor mbp1 fragment (1bm8), DNA-binding domain from bovine papillomavirus (2bop-A), elongation factor G (1dar) and ebna-1 nuclear protein fragment (1b3t-A). All these similarities are compatible with the hypothesis that YciH is a nucleic acid-binding protein.
Comparison to human eIF-1 solution structure
The solution structure of human eIF-1 was recently reported (11). The structure is quite similar to the YciH structure (Fig. 7). A least squares superposition of the secondary structure elements of the two structures had a backbone r.m.s.d. of 2.6 Å. Conserved residues in the [beta]-sheet occupy identical positions relative to neighboring strands. As in YciH, the first 30 residues of eIF-1 are unstructured. The most significant structural differences correlate with insertion/deletion points in the sequences (Fig. 1). An insertion of five residues forms a loop between [beta]3 and [beta]4 in eIF-1 where there is a [beta]-hairpin in YciH. The large loop in YciH between [beta]1 and [beta]2 is also present in eIF-1, but is slightly shorter and does not give negative amide 1H-15N NOEs, suggesting that the human protein is less flexible in this region. Moreover, in YciH but not eIF-I, diminished NOE intensity was observed for the amides of residues G87, D88, K89 and R90 (Supplementary Material). The paper describing the structure of eIF-1 suggests a potential binding site that includes these residues, based on translational errors resulting from mutations at these sites in yeast and also on their conservation throughout the superfamily (11). Like YciH, the eIF-1 structure has a similar cluster of acidic residues on the surface, and three highly exposed basic residues (R38, R41 and K42) on the large loop between [beta]1 and [beta]2. Hydrophobic residues on the solvent-exposed surface of the [beta]-sheet (L44, I50 and V82) are also conserved in eIF-1 (Fig. 5). It is reasonable to consider that these conserved features may also be important for interactions between YciH/eIF-1/SUI1 and other molecules.
Figure 7. Overlaid ribbon diagrams of YciH (green) and eIF-1 (blue). Multi-residue inserts are shown in red. Equivalent secondary structure elements were selected for the overlay. The figure was generated with INSIGHTII.
CONCLUSION
In this study, a phylogenetic approach has been used to select a structural genomics target. This approach was designed to select targets from completely sequenced genomes that, firstly, represent evolutionarily conserved functions, and, secondly, are likely to contain new sequence-fold information. The experimental determination of the structure of one such target protein, YciH, has identified a novel sequence-fold relationship. YciH appears to be a member of a new homologus superfamily within a known fold. YciH is an [alpha]+[beta] protein organized into an open-faced [beta]-sandwich with a [beta][beta][alpha][beta][beta][alpha]([beta]) topology (the [alpha]-[beta] plait topology in CATH, or the DCoH-like fold in SCOP). Unstructured regions were identified in YciH (the N-terminal 31 residues and loop portion); in light of a potential role in translation initiation, these merit further investigation for interactions with proteins and RNA. Because of their sequence conservation, it was expected that YciH and its orthologs in COG0023 would share a common fold; the structure of human eIF-1/SUI1 confirms this. The functional diversity within the [alpha]-[beta] plait topology precludes an unequivocal determination of YciH function by structural comparisons alone, other than to confirm that an RNA-binding function is not unprecedented for such a fold. Comparisons to known structures provide a starting point for functional characterization of YciH and its eukaryotic and archaeal orthologs.
SUPPLEMENTARY MATERIAL
Additional NMR data is available in the Supplementary Material at NAR Online (56KB PDF File).
ACKNOWLEDGEMENTS
E.V.K. is grateful to L. Aravind for his help in the analysis of the COGs for target selection. We thank G. Buchko for helpful comments on the manuscript.
REFERENCES
*To whom correspondence should be addressed. Tel: +1 509 372 2168; Fax: +1 509 376 2303; Email: ma_kennedy{at}pnl.gov
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: jnl.info{at}oup.co.uk
Last modification:
Copyright© Oxford University Press, 1999.
This article has been cited by other articles:
![]() |
Y. Benita, M. J. Wise, M. C. Lok, I. Humphery-Smith, and R. S. Oosting Analysis of High Throughput Protein Expression in Escherichia coli Mol. Cell. Proteomics, September 1, 2006; 5(9): 1567 - 1580. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Maita, K. Okada, K. Hatakeyama, and T. Hakoshima Crystal structure of the stimulatory complex of GTP cyclohydrolase I and its feedback regulatory protein GFRP PNAS, January 24, 2002; (2002) 22646999. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. D. Pollock, J. A. Eisen, N. A. Doggett, and M. P. Cummings A Case for Evolutionary Genomics and the Comprehensive Examination of Sequence Biodiversity Mol. Biol. Evol., December 1, 2000; 17(12): 1776 - 1788. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Balasubramanian, T. Schneider, M. Gerstein, and L. Regan Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome Nucleic Acids Res., August 15, 2000; 28(16): 3075 - 3082. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Minasov, M. Teplova, G. C. Stewart, E. V. Koonin, W. F. Anderson, and M. Egli Functional implications from crystal structures of the conserved Bacillus subtilis protein Maf with and without dUTP PNAS, June 6, 2000; 97(12): 6328 - 6333. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sanchez, U. Pieper, N. Mirkovi, P. I. W. de Bakker, E. Wittenstein, and A. ali MODBASE, a database of annotated comparative protein structure models Nucleic Acids Res., January 1, 2000; 28(1): 250 - 253. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Maita, K. Okada, K. Hatakeyama, and T. Hakoshima Crystal structure of the stimulatory complex of GTP cyclohydrolase I and its feedback regulatory protein GFRP PNAS, February 5, 2002; 99(3): 1212 - 1217. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














