Published online 21 December 2004
Nucleic Acids Research, Vol. 32 No. 22 © Oxford University Press 2004; all rights reserved
Looking into DNA recognition: zinc finger binding specificity
Laboratoire de Biochimie Théorique, CNRS UPR 9080, Institut de Biologie Physico-Chimique, 13 rue Pierre et Marie Curie, Paris 75005, France
* To whom correspondence should be addressed. Tel: +33 1 5841 5016; Fax: +33 1 5841 5026; Email: rlavery{at}ibpc.fr
Received August 17, 2004; Revised and Accepted November 27, 2004
| ABSTRACT |
|---|
|
|
|---|
We present a quantitative, theoretical analysis of the recognition mechanisms used by two zinc finger proteins: Zif268, which selectively binds to GC-rich sequences, and a Zif268 mutant, which binds to a TATA box site. This analysis is based on a recently developed method (ADAPT), which allows binding specificity to be analyzed via the calculation of complexation energies for all possible DNA target sequences. The results obtained with the zinc finger proteins show that, although both mainly select their targets using direct, pairwise proteinDNA interactions, they also use sequence-dependent DNA deformation to enhance their selectivity. A new extension of our methodology enables us to determine the quantitative contribution of these two components and also to measure the contributions of individual residues to overall specificity. The results show that indirect recognition is particularly important in the case of the TATA box binding mutant, accounting for 30% of the total selectivity. The residue-by-residue analysis of the proteinDNA interaction energy indicates that the existence of amino acidbase contacts does not necessarily imply sequence selectivity, and that side chains without contacts can nevertheless contribute to defining the protein's target sequence.
| INTRODUCTION |
|---|
|
|
|---|
Some DNA-binding proteins can target DNA sequences with remarkable specificity. In the case of human genome, this implies locating one or a handful of sites among billions of base pairs. Understanding the mechanism of such recognition is important not only for identifying binding sites within genomic sequences, but also for understanding how point mutations within a protein, whether they occur naturally or are voluntarily induced, will influence binding specificity. Such knowledge will be the key to future protein design efforts.
In practice, despite early optimism, proteinDNA recognition has turned out to be much more complicated than expected. The first models of recognition assumed that a specific DNA target was selected as the result of a finite number of hydrogen bonds or steric interactions between amino acid side chains and bases (1). Although the analysis (2,3) and also the modeling (47) of complexes in these terms has been undertaken, it has been recognized that, at least in a number of cases, there are simply not enough direct interactions to explain specificity. This led to the idea that the sequence dependence of DNA deformation can also play a role in recognition. This so-called indirect component of recognition is naturally more difficult to quantify since its importance cannot be judged by simply looking at the conformation of a complex. Although indirect effects have been confirmed experimentally in cases of extreme deformation, such as that induced by the TATA box binding protein (8), molecular modeling has been the principal source of data on sequence effects dependent DNA mechanics (911) and dynamics (12). Modeling studies are however generally too expensive to be used for a comprehensive study of selectivity, which, for typical protein targets, implies comparing millions of potential binding sequences.
To overcome this problem, we have recently developed an approach termed ADAPT. This method is based on an all-atom representation of a proteinDNA complex, derived from crystallographic data, coupled with the calculation of the principal sequence-dependent terms of the complexation energy, using a molecular mechanics force field (1315). By using a multi-copy approach for the DNA base pairs, we are able to study the complexation energy for all possible DNA sequences of a given complex (4N for a target sequence of N base pairs). ADAPT has already been used to successfully reproduce experimental consensus sequences for a variety of DNA-binding proteins, as well as the ordering of free energy changes for a number of distinct binding sequences (13). Since ADAPT calculates both proteinDNA interaction energy (Eint) and DNA deformation energy with respect to an unbound reference DNA (Edef), it is also possible to use the results to analyze recognition in terms of its direct component (involving hydrogen bonding and steric compatibility at the proteinDNA interface, contained within Eint) and its indirect component (sequence-dependent DNA deformation, contained within Edef).
In the present study, we apply ADAPT to the analysis of the binding specificity of zinc finger proteins. This important family of proteins (the largest family within eukaryotic organisms) is generally taken to be a textbook example of the classical type of recognition, based on an additive set of proteinDNA interactions. This idea is supported by the behavior of hybrid proteins constructed by the so-called finger swapping experiments (16,17). Here, we hope to delve deeper into the binding mechanisms by breaking recognition down into its components, looking at how individual amino acids contribute to specificity, whether indirect effects contribute and, if so, what is the relative importance of different structural components of DNA deformation. For this analysis, we have also developed indices based on information theory which enable us to make a fully quantitative analysis of terms that are generally only discussed qualitatively.
The zinc finger proteins that we have chosen, belong to the Cys2-His2 family. Zif268, a well-studied member of this family, containing three zinc fingers, binds to a GC-rich target sequence. Each finger is constituted by a 24 residue ßß
motif, which chelates a zinc ion. Recognition appears to involve four amino acids within the
-helix of this motif (see Figure 1A). The apparent modularity of recognition within a single finger led to the hope that mutations might again be used in a simple manner to modify specificity. Such studies have actually revealed a more complex picture of Zif268 recognition (18). As with other proteins, amino acidbase interactions do not turn out to be additive, and a single amino acid may be involved in the recognition of several DNA bases. Moreover, mutating the key residues within a finger may also lead to repositioning of the finger with respect to DNA and thus, to a completely different recognition pattern. This effect is particularly clear in the work carried out by the group of Pabo (19) who designed mutants of Zif268, which showed selectivity for a TATA box sequence, GCTATAAAA, very different from their wild-type, GC-rich consensus sequence, GCGTGGGCG (see Figure 1B).
|
Here, we use our methodological tools to analyze the recognition of both Zif268 and the TATA binding mutant (hereafter termed TATAZif), with the specific aim of testing the two assumptions generally thought to apply to zinc finger binding, namely: (i) specificity relies entirely on amino acidbase interactions (with no so-called indirect contributions from sequence-dependent DNA deformation) and (ii) individual contributions of amino acidbase interactions can be identified visually from the structures of the corresponding proteinDNA complexes.
The results suggest that although zinc finger recognition is indeed dominated by direct interactions, DNA deformation can play a significant role. It is also found that structural data alone need not be a reliable guide to the role of individual amino acids in determining sequence specificity. Although these conclusions do not help to simplify an already complex story, they illustrate a new way to analyze recognition mechanisms, which can also be applied to understanding the impact of point mutations within chosen families of proteinDNA complexes.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Before discussing the information theory indices used in this article to make a quantitative analysis of proteinDNA recognition, we present an overview of the ADAPT methodology. For complete details, the reader is referred to our earlier publication (13).
Energy calculations
All energy calculations are performed with a modified version of the JUMNA molecular mechanics program. JUMNA uses an all-atom representation of DNA combined with a mixture of helicoidal and internal variables for positioning individual nucleotides, and for modifying their conformations in terms of torsion angles and certain valence angles (20). Energy calculations were carried out with the AMBER PARM98 force field (21). Solvent and counter ion electrostatic damping effects, respectively, are included by using a sigmoidal distance-dependent dielectric function (with a slope of 0.356 and a plateau value of 80) and reduced net charges on the DNA phosphate groups (0.5e) (20). For the present calculations, we have used a soft-core version of the standard LennardJones potential for all proteinDNA interactions (13). This modification introduces a cubic potential at short range to smoothly damp repulsion and to limit its maximum value (chosen here to be 10 kcal mol1). The energy minimizations discussed below are carried out in the JUMNA coordinate system (mixing helicoidal and internal variables), using a quasi-Newton algorithm with analytically calculated energy derivatives and a convergence criteria of 104 on the predicted energy change at the last cycle of minimization.
Multi-copy bases
In order to be able to study all possible DNA sequences bound to a given protein, we have introduced the concept of nucleotides carrying multi-copy bases, termed lexides. Such nucleotides contain all four standard bases, adenine (A), cytosine (C), guanine (G) and thymine (T), superposed in space, and linked to the same sugar C1' atom. All bases thus share a common C1'-(N1/N9) glycosidic torsion angle. In terms of energy, the individual bases within a given lexide do not interact with one another. The contributions of each of these bases to the total energy of the system are controlled by variable coefficients (normalized to 1.0 for each lexide). By setting all weights to 0.25, it is possible to carry out calculations on a DNA with an average sequence. This option has been used for creating unbound reference DNA structures, and for removing unnecessary sequence-dependent deformations, from the crystallographic complex conformations (see below). Energy calculations involving lexides can be stored in a matrix, which groups together terms involving the interactions of each possible base at each nucleotide position with the DNA backbones and an eventual bound protein (along the diagonal), and also the interactions between such bases (off-diagonal entries). This matrix can then be used to rapidly calculate the energy for any chosen sequence by simply summing the matrix elements corresponding to the appropriate bases. Note that since we are only interested in conventional WatsonCrick base pairs in the present study, base-paired lexides can be grouped together into multi-copy lexide pairs, thus reducing the size of the energy matrix and the number of additions to be carried out.
Constructing a proteinDNA complex and a free DNA oligomer
The starting point for our calculations is the crystallographic structure of an appropriate proteinDNA complex taken from the Protein Data Bank (PDB) (22). Before binding energy calculations are carried out, these data are processed in several ways. First, any unpaired, terminal DNA bases are removed. Next, the DNA fragment is rebuilt using lexides from the JUMNA library. Since crystallographic DNA fragments are often quite short, six flanking nucleotide pairs (in a canonical B-DNA conformation) are added to each end of the fragment to avoid possible end-effects. We also add missing hydrogen atoms to the protein residues.
We next carry out energy minimization under the following restraints: (i) the conformation of the protein backbone is fixed, (ii) an average sequence is imposed along the whole DNA fragment (all lexide coefficients set to 0.25) and (iii) the geometry of the proteinDNA interface is maintained by requiring that the relative position of all DNA atoms belonging to successive nucleotide pairs contacted by the protein, remain in the same relative positions with respect to one another. In this context, contacts refer to all proteinDNA atom pairs distant by <4 Å. Relative atom positions are maintained using flat-well quadratic constraints allowing a 0.1 Å freedom of movement and neither hydrogen atoms nor phosphate anionic oxygens are constrained. This procedure allows us to generate a proteinDNA conformation, which respects the specific interactions characterizing the complex, but removes any fine sequence-dependent structural details related to the specific DNA sequence used for crystallization.
Having obtained a proteinDNA complex conformation as described above, we also generate a conformation of the corresponding free DNA oligomer. This oligomer has the same length as that in the complex, but is again constructed using an average base sequence, where each base pair is present with a coefficient of 0.25. This oligomer is energy minimized without any structural constraints and is taken to represent an optimal sequence-averaged B-DNA conformation.
Complexation energy and its decomposition
Once conformations have been obtained for the proteinDNA complex and the free DNA oligomer, we can obtain the complexation energy for a given base sequence. This requires inserting the appropriate sequence in both conformations and calculating the interaction energy between the protein and the DNA (Eint) and the deformation energy necessary to pass from the free DNA oligomer to its complexed conformation (Edef).
As explained above, using multi-copy lexide pairs enables us to make these calculations very fast by using a pre-constructed energy matrix. This matrix groups together all components of the energy, which depend on the given lexide coefficients. Thus, the proteinDNA interaction energy, when it concerns DNA bases, will be grouped base by base along the diagonal of the matrix. Similarly, the internal energy of the complexed DNA (minus the energy of the free oligomer), when it concerns the base interactions, will be placed in off-diagonal matrix elements, while basebackbone interactions will again lie along the diagonal. Obtaining the complexation energy of a given base sequence then only requires adding up the elements of the energy matrix, which correspond to the presence of the desired base pair at each position along the DNA oligomer. Note that the internal energy of the protein is not required since its conformation and thus, its energy, is taken to be unchanged by modifications to the DNA sequence. Note also that the proteinDNA interaction energy can, if desired, be broken down into contributions from individual amino acids. This option is used in the present study to investigate how given residues contribute to sequence specificity.
Sequence selectivity in terms of information content
Once we have calculated the energies for all possible sequences, we consider that binding sites are those which fall below an energy cutoff with respect to the complexation energy of the optimal sequence. On the basis of our earlier studies (13), this cutoff is taken to be 5 kcal mol1. The group of M sequences selected in this way (where M = 677 for Zif268 and M = 776 for the mutant protein) can be described by a weight matrix, where each element fij is the frequency with which a base j (A, C, G or T) occurs at position i. In terms of information theory, this group of sequences exhibits an entropy Ri at each position i:
![]() |
1. This choice is used here; however, to return to traditional units of bits it is enough to multiply the results by two.
Note that entropy is the complement of information content, since Ri = 1 means that protein binding provides no sequence information at position i, while Ri = 0 means complete sequence information has been determined. The change of information content at position i in passing between two states with entropies
and
is consequently:
![]() |
![]() |
To characterize the detailed base preferences at any position along the binding site, we can construct a weight matrix or represent the same information graphically with sequence logos (26), where the height of each base is determined by its frequency of occurrence in the set of M binding sequences and the sum of the heights of the letters corresponds to the total information content.
Note that the local measure of information content can be generalized to the full binding site by a simple summation, if we assume, as we will do here, that there is no correlation between neighboring positions along the site:
![]() |
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Validating ADAPT for Zif268
The wild-type Zif268 complex was built using the high-resolution crystallographic structure 1AAY (27). These coordinates were treated as described in Materials and Methods. Six nucleotide pair fragments were added to either end of the 9 bp DNA fragment within the complex, leading to a total of 21 bp with Zif268 bound at positions 716. Using the possibilities provided by the multi-copy base pairs in ADAPT, the 4 bp at the 5'- and 3'-termini were fixed at an average base composition (25% contributions from each standard base pair: AT, TA, CG and GC). The protein binding energies were then calculated for the full combinatorial set of sequences corresponding to the remaining 13 bp (which consequently covered a total of 413, i.e. 6.7 x 107, sequences). The bound proteins were found to induce selectivity at only 10 of these base pairs (positions 717), and further analysis was limited to these positions. In the following discussion, we will adopt a local numbering for the Zif268 DNA target site, where positions 19 constitute the canonically recognized bases and position 10 is the supplementary 3'-base pair discussed above (see Figure 1A). It is remarked that the interaction pattern of Zif268 and its mutant with DNA displayed in Figure 1A and B were determined using HBplus (28). (Note that, for consistency, five amino acid residues are indicated for each finger in Figure 1A and B, although not all of these residues necessarily play a role in selective binding). The geometrical criteria used by this program result in some minor differences with other descriptions of these complexes (19), for example, the absence of a hydrogen bond involving Glu 3 of finger 3 of the wild-type Zif268, but these differences have no impact on the conclusions drawn in the present study.
An experimental consensus weight matrix for wild-type Zif268 has been determined using a PCR-mediated random site selection protocol (29) and is shown in the lower part of Figure 1C. This consensus confirms that Zif268 selectivity indeed extends over 10 bp and not 9 bp as was initially reported (30). The strongest selectivity is for guanine at positions 1, 3, 6, 7 and 9, and for cytosine at position 2. In terms of information content (for definitions see Materials and Methods), the overall weight matrix represents a selectivity for 7.9 bp.
We can compare these experimental results with the Zif268 binding preference obtained with ADAPT. Using a 5 kcal mol1 cutoff with respect to the complexation energy of the optimal sequence, we obtain a group of 677 strongly binding sequences. The highest energy in this group is termed as
. The number of sequences in the group corresponds to an information content of 7.8 bp, which is very close to the experimental result. The corresponding theoretical sequence logo, shown in the upper part of Figure 1C, confirms that this good overall agreement also applies to each position within the binding site.
Since the consensus view of binding selectivity does not take in to account the relative energy of binding sequences, we have also checked the ordering of calculated complexation energies for sequences where experimental results were available (31). The results, which concern nine variants of a GCGxxxGCGT target sequence, are shown in Figure 2. Despite the simplifications involved in the ADAPT approach, the two sets of data show a good correlation (with a correlation coefficient of 0.84). However, it should be noted that, assuming a linear correlation between these two sets of values is linear, the theoretical consensus in Figure 1C shows that T is less weakly selected in the first position of the second finger than if it is experimentally (implying that the sequences of the type Txx should be shifted to the left in Figure 2). It is also remarked that we are using a single conformation of the complex for all these comparisons, which is unlikely to be optimal for each sequence of the set. We have tested this problem by energy minimizing the complex and free DNA oligomer conformations for the sequences given in Figure 2, and recalculating the complexation energies. This leads to a slightly better alignment, notably for the sequences Txx, and to an overall correlation coefficient of 0.86.
|
Analyzing recognition mechanisms
As a first step to understanding how wild-type Zif268 recognizes its target sequence, we look at the contributions coming from the direct (Eint) and indirect (Edef) terms of the calculated complexation energies. This is done by finding the maximum value of Eint for the set of binding sequences selected on the basis of
. This value is termed
. We then look at the full set of sequences, ordered in terms of Eint, and count how many fall in the energy interval up to
. If Zif268 recognition relied entirely on direct (proteinDNA) interactions, one would expect the ordering of sequences in terms of Eint or Etot to be the same, since Edef would not show any selectivity. Therefore, we would expect to count the same number of sequences up to the respective energy cutoffs. In fact, there are more than four times as many sequences in the Eint list, 3201 compared with 677 selected on the basis of Etot. This implies that Eint has an information content of 6.9 bp, which is 89% of the total selectivity (7.8 bp) found for Zif268. Although direct interactions therefore dominate Zif268 recognition, a non-negligible part must be attributed to indirect contributions.
If we now build a weight matrix on the basis of the sequences selected using Eint (Figure 3A), we can see that the only significant loss of selectivity involves positions 2 and 8. This change can be traced to the fact that steric repulsions contained within Eint effectively eliminate thymine at either of these positions, but Eint alone cannot select between the remaining bases (A, C and G). This selection requires a contribution from local DNA deformation. In passing, it is remarked that the steric repulsions at positions 2 and 8 involve Glu 3 of fingers 1 and 3, and it is known that mutating these residues to Ala indeed decreases the binding selectivity of Zif268 (18).
|
Although only a relatively small amount of selectivity remains to be accounted for, this must be due to indirect effects involving induced DNA deformation. This can be understood by noting that the addition of Edef to Eint for the 3201 sequences selected above and then reordering the energies will effectively bring us back to the set of 677 sequences <5 kcal mol1 cutoff on Etot. This can be quantified by calculating the increase in information content in passing between these two situations. As expected, the overall result is equivalent to 0.9 bp, that is to say 11% of the total selectivity. Once again this selectivity can be represented as a sequence logo (Figure 3A), which confirms that it is indeed DNA deformation that is responsible for the selection of cytosine at positions 2 and 8. We can therefore conclude that, although the DNA deformation induced by binding wild-type Zif268 is small, it still plays a significant role in selecting two positions within the target sequence of this protein.
Decomposing recognition by residues
In the same way that the complexation energy can be split into interaction and deformation components, we can also subdivide the interaction energy into individual residue contributions. Eint is simply the sum of the interaction energies, Eint(i), between DNA and each amino acid i composing the protein. Therefore, we can again calculate an information content associated with a chosen residue i by finding the number of sequences falling below
(i), where this is the highest residue interaction energy for the set of sequences initially selected from Etot.
Only the amino acids at positions 1, 2, 3 and 6 of the recognition helix of each finger of Zif268 potentially contact DNA (see Figure 1A), so we will limit our detailed analysis to these residues, the remaining amino acids being considered as a single group. (It was confirmed that no single amino acid within this group makes a significant contribution to selectivity.)
For each of the four key amino acids, in each of the three fingers, the sequences selected on the basis of
(i) give access to both the total information content for the binding site and to the information position by position along this site. The results are presented in Figure 3B. In the majority of cases, these results verify that individual amino acids lead to base selectivity at the positions they directly contact. Note that the residues Asp 2 of each finger show significant selectivity for the base pair adjacent to the 3' end of the base triplet nominally recognized by each finger (0.3, 1.0 and 0.6 bp for fingers 13, compared with 0.4, 1.0 and 0.5 bp for the entire protein). This confirms that each finger actually influences selectivity for a total of 4 bp (32).
The results for Arg 6 of finger 1 are particularly interesting since, as shown clearly in Figure 3B, this residue is unique in contributing significantly to the selectivity of two successive base pairs (in positions 6 and 7). This result confirms that even the direct component of recognition cannot be decomposed into binary amino acidbase contributions. The complex nature of these interactions can also be seen by noting that the sum of the information content related to the key amino acids at a given position within the target site is often significantly lower than the total information coming from the full proteinDNA interaction term. For example, at position 10, only Asp 2 of finger 1 shows a significant selectivity (equivalent to 0.23 bp), whereas total protein interaction has an information content equivalent to 0.44 bp, the remaining selectivity being the cumulative result of small contributions from many amino acids.
It is finally remarked that residue Glu 3 of fingers 1 and 3 has been described as being involved in selective contacts with DNA (33). However, as noted in the previous section, our analysis implies that it actually influences selectivity mainly by sterically hindering the presence of thymine in positions 2 and 8. Finally, selecting cytosine at these positions requires contributions from DNA deformation. Hence, some residue contacts lead to only partial selectivity, which needs to be complimented by other effects.
TATAZif recognition mechanism
We now turn to the mutant of Zif268 created by the group of Pabo to recognize a TATA box binding sequence, GCTATAAAA (19). We term this mutant TATAZif. We have constructed the corresponding complex using the PDB structure 1G2D
[PDB]
. (We also carried out calculations for a related structure, 1G2F, but the results were very similar and will not be discussed here.) Overall, the TATAZif complex is very similar to that of Zif268: the C
RMSD difference for the two proteins is only 1 Å. The bound DNA remains close to a canonical B-DNA conformation, with the exception of an enlarged major groove at the positions where the zinc fingers are bound. The DNA fragments corresponding to the canonical binding site (positions 19, reconstructed with an identical sequence, GCGTGGGCG) differ by an RMSD of only 1.8 Å.
Binding specificity for the TATA box was obtained by mutating Zif268 at positions 1, 1, 2, 3, 4, 5 and 6 of each finger (19). This leads to a proteinDNA interaction pattern that is very different to that of wild-type Zif268 (see Figure 1A and B). Residues that did not contact DNA in the case of the wild-type protein, now enter into play and several residues contact more than a single DNA base, leading to significant overlap between the binding sites of successive fingers.
No experimental consensus is available for TATAZif, but the consensus calculated using ADAPT (Figure 1D) confirms a strong dominance of thymine and adenine, in contrast to the GC-rich selectivity of wild-type Zif268. Again, using a 5 kcal mol1 cutoff on Etot, we obtain a total information content of 7.9 bp for TATAZif binding. We can analyze the origins of this recognition in the same way as for the wild-type protein, starting with the calculation of the information contained in the direct interaction term Eint. The results in Figure 4A show that TATAZif has a much weaker consensus based on direct interactions, with an information content of only 5.5 bp, corresponding to 70% of total information. Positions 2, 3, 5, 6 and 7 are all less well-defined than in the full consensus. This is particularly striking for positions 3, 5 and 6 (T/C, T/C and G/A, respectively), where the information content per base pair falls from 1 bp in the full consensus, to 0.48, 0.37 and 0.52 bp, respectively, on the basis of the proteinDNA interaction energy.
|
We can again decompose the direct proteinDNA interactions into contributions from individual residues. Five amino acids at positions 1, 1, 2, 3 and 6 have been analyzed in detail and it was again checked to confirm that the remaining residues made no significant contribution to selectivity. The results are shown in Figure 4B. Note that, in contrast to wild-type Zif268, residue 1 now plays a role in recognition for fingers 2 and 3, in line with the analysis of binding made on the basis of the crystallographic structure (19). In contrast, other nominally important residues show little selectivity, although they do make contacts with the DNA bases (and may be important for the stability of the complex). This is notably the case for Thr 2 in finger 2 and Thr 3 in finger 3. It is equally found that some residues, which do not make contacts, can nevertheless influence binding selectivity, as in the case of Thr 1 in finger 3. Finally, Figure 4B shows that many residues in TATAZif influence the selectivity of 2 or even 3 bp. This is strikingly different from the pattern of selectivity seen for the wild-type protein in Figure 3B.
We now turn to indirect contributions to recognition in the case of TATAZif. These effects amount to an information content of 2.4 bp, that is, 30% of the total selectivity (7.9 bp). This is surprising, given that zinc finger protein only induces very small deformations upon binding. Figure 5A shows that deformation mainly contributes to selectivity at positions 3, 4 and 8, with smaller contributions at positions 1, 6 and 9. In the same way that we have analyzed the selectivity due to direct proteinDNA interactions in terms of individual residue contributions, we can also analyze the indirect DNA deformation contribution in terms of the various structural changes composing the overall deformation. Here, we distinguish basebase and basebackbone deformations (note that since we use a fixed average DNA conformation for all calculations, there are no intra-backbone terms to contribute to selectivity). Basebase deformations are further divided in pairing and stacking deformations. Therefore, we can again calculate an information content associated with a given type of deformation by ordering all sequences in terms of the associated deformation energy, determining the highest energy within the set of sequences initially selected from Etot using the 5 kcal mol1 cutoff, and counting the number of sequences now lying below this energy. The results are shown in Figure 5B. The results indicate that basebase interactions are more important than basebackbone interactions and represent 57% of the total indirect selectivity. Within the basebase term, it turns out that base pairing deformations play a significantly larger role in recognition than base stacking. Although it is naturally more difficult to trace the structural origins of indirect selectivity, it can be noted that DNA bound to TATAZif shows significant base pair buckling (5°12°) at positions 35 and significant base opening (7°11°) at positions 48. The base pairs in the fragment of DNA bound to wild-type Zif268 are less distorted, with buckling between 1° and 4° and opening between 3° and 4° for the same groups of positions.
|
| CONCLUSIONS |
|---|
|
|
|---|
Zinc finger proteins are of major interest as one of the largest families of eukaryote DNA-binding proteins. This family of proteins is assumed to represent a canonical example of DNA recognition, where selectivity is achieved using a set of direct interactions between amino acid side chains and DNA bases. It is also generally assumed that these interactions can be identified visually from an experimental structure of the proteinDNA complex.
We have investigated the validity of these assumptions by carrying out a theoretical analysis of recognition in the case of the three finger wild-type Zif268 protein and of a multiple mutant generated to recognize a TATA box binding site. The results, which are based on calculating proteinDNA interaction energies and DNA deformation energies for a full combinatorial set of sequences, enable both direct and indirect contributions to be quantified in terms of their information content.
Our findings agree with deductions made from mutation studies and from an analysis of the crystallographic data. Notably, recognition is dominated by proteinDNA contacts involving a limited number of key amino acids, although certain residues can influence selectivity at more than one site in the target DNA sequence, and the selectivity related to each finger actually overlaps, involving 4, rather than the canonical 3, base pairs. However, two results do not support the assumptions cited above. First, direct interactions alone cannot account for the observed binding specificity. Although zinc finger proteins do not cause major DNA deformation upon binding, these deformations still accounts for >10% of recognition in the case of wild-type Zif268 and for 30% of recognition in the case of the mutant protein. Second, a residue-by-residue analysis shows that the presence of direct amino acidbase contacts does not necessarily imply significant contributions to selectivity. Indeed, our study brings to light both examples wherein contacts exist without an impact on selectivity and where selectivity is found without direct molecular contacts.
These results are not very encouraging from the point of view of protein engineering. Although zinc finger proteins have a modular design, which allows individual fingers to be swapped between proteins, the complexity of the recognition process carried out by individual fingers makes structure-based re-engineering a daunting task. Part of this complexity was already visible from the modified pattern of contacts seen in the crystallographic complex of the mutated Zif268 protein (19). Our analysis suggests that this complexity is still greater.
From a methodological point of view, indices based on information theory have been developed and used in the present study to extend our ADAPT approach in order to describe overall and residue-by-residue contributions to proteinDNA selectivity in a quantitative manner and this will hopefully be useful in studying, and in attempting to modify, other proteinDNA complexes.
| ACKNOWLEDGEMENTS |
|---|
The authors wish to thank the CNRS and the inter-organism Bioinformatics Program for funding this research.
| REFERENCES |
|---|
|
|
|---|
- Seeman,N.C., Rosenberg,J.M. and Rich,A. ( (1976) ) Sequence-specific recognition of double helical nucleic acids by proteins. Proc. Natl Acad. Sci. USA, , 73, , 804808.
[Abstract/Free Full Text] - Suzuki,M., Brenner,S.E., Gerstein,M. and Yagi,N. ( (1995) ) DNA recognition code of transcription factors. Protein Eng., , 8, , 319328.
[Free Full Text] - Mandel-Gutfreund,Y., Schueler,O. and Margalit,H. ( (1995) ) Comprehensive analysis of hydrogen bonds in regulatory proteinDNA complexes: in search of common principles. J. Mol. Biol., , 253, , 370382.[CrossRef][Web of Science][Medline]
- Kono,H. and Sarai,A. ( (1999) ) Structure-based prediction of DNA target sites by regulatory proteins. Proteins, , 35, , 114131.[CrossRef][Web of Science][Medline]
- Selvaraj,S., Kono,H. and Sarai,A. ( (2002) ) Specificity of proteinDNA recognition revealed by structure-based potentials: symmetric/asymmetric and cognate/non-cognate binding. J. Mol. Biol., , 322, , 907915.[CrossRef][Web of Science][Medline]
- Yoshida,T., Nishimura,T., Aida,M., Pichierri,F., Gromiha,M.M. and Sarai,A. ( (2001) ) Evaluation of free energy landscape for base-amino acid interactions using ab initio force field and extensive sampling. Biopolymers, , 61, , 8495.[CrossRef][Web of Science][Medline]
- Mandel-Gutfreund,Y. and Margalit,H. ( (1998) ) Quantitative parameters for amino acidbase interaction: implications for prediction of proteinDNA binding sites. Nucleic Acids Res., , 26, , 23062312.
[Abstract/Free Full Text] - Parvin,J.D., McCormick,R.J., Sharp,P.A. and Fisher,D.E. ( (1995) ) Pre-bending of a promoter sequence enhances affinity for the TATA-binding factor. Nature, , 373, , 724727.[CrossRef][Medline]
- Sarai,A., Mazur,J., Nussinov,R. and Jernigan,R.L. ( (1989) ) Sequence dependence of DNA conformational flexibility. Biochemistry, , 28, , 78427849.[CrossRef][Medline]
- Olson,W.K., Gorin,A.A., Lu,X.J., Hock,L.M. and Zhurkin,V.B. ( (1998) ) DNA sequence-dependent deformability deduced from proteinDNA crystal complexes. Proc. Natl Acad. Sci. USA, , 95, , 1116311168.
[Abstract/Free Full Text] - Steffen,N.R., Murphy,S.D., Tolleri,L., Hatfield,G.W. and Lathrop,R.H. ( (2002) ) DNA sequence and structure: direct and indirect recognition in proteinDNA binding. Bioinformatics, , 18, (Suppl. 1), S22S30.[Abstract]
- Thayer,K.M. and Beveridge,D.L. ( (2002) ) Hidden Markov models from molecular dynamics simulations on DNA. Proc. Natl Acad. Sci. USA, , 99, , 86428647.
[Abstract/Free Full Text] - Paillard,G. and Lavery,R. ( (2004) ) Analyzing proteinDNA recognition mechanisms. Structure (Cambridge), , 12, , 113122.
- Lafontaine,I. and Lavery,R. ( (2000) ) ADAPT: a molecular mechanics approach for studying the structural properties of long DNA sequences. Biopolymers, , 56, , 292310.[CrossRef][Medline]
- Lafontaine,I. and Lavery,R. ( (2000) ) Optimization of nucleic acid sequences. Biophys. J., , 79, , 680685.[Web of Science][Medline]
- Beerli,R.R. and Barbas,C.F.,III. ( (2002) ) Engineering polydactyl zinc-finger transcription factors. Nat. Biotechnol., , 20, , 135141.[CrossRef][Web of Science][Medline]
- Beerli,R.R., Segal,D.J., Dreier,B. and Barbas,C.F.,III ( (1998) ) Toward controlling gene expression at will: specific regulation of the erbB-2/HER-2 promoter by using polydactyl zinc finger proteins constructed from modular building blocks. Proc. Natl Acad. Sci. USA, , 95, , 1462814633.
[Abstract/Free Full Text] - Miller,J.C. and Pabo,C.O. ( (2001) ) Rearrangement of side-chains in a Zif268 mutant highlights the complexities of zinc finger-DNA recognition. J. Mol. Biol., , 313, , 309315.[CrossRef][Web of Science][Medline]
- Wolfe,S.A., Grant,R.A., Elrod-Erickson,M. and Pabo,C.O. ( (2001) ) Beyond the recognition code: structures of two Cys2His2 zinc finger/TATA box complexes. Structure (Cambridge), , 9, , 717723.
- Lavery,R., Zakrzewska,K. and Sklenar,H. ( (1995) ) JUMNA (junction minimization of nucleic acids). Comput. Phys. Commun., , 91, , 135158.[CrossRef]
- Cheatham,T.E.,III, Cieplak,P. and Kollman,P.A. ( (1999) ) A modified version of the Cornell et al. force field with improved sugar pucker phases and helical repeat. J. Biomol. Struct. Dyn., , 16, , 845862.[Web of Science][Medline]
- Berman,H.M., Battistuz,T., Bhat,T.N., Bluhm,W.F., Bourne,P.E., Burkhardt,K., Feng,Z., Gilliland,G.L., Iype,L., Jain,S. et al. ( (2002) ) The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr., , 58, , 899907.[CrossRef][Medline]
- Shannon,C.E. ( (1948) ) A mathematical theory of communication. Bell Syst. Tech. J., , 27, , 379423 623656.
- Schneider,T.D., Stormo,G.D., Gold,L. and Ehrenfeucht,A. ( (1986) ) Information content of binding sites on nucleotide sequences. J. Mol. Biol., , 188, , 415431.[CrossRef][Web of Science][Medline]
- Stephens,R.M. and Schneider,T.D. ( (1992) ) Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J. Mol. Biol., , 228, , 11241136.[CrossRef][Web of Science][Medline]
- Schneider,T.D. and Stephens,R.M. ( (1990) ) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res., , 18, , 60976100.
[Abstract/Free Full Text] - Elrod-Erickson,M., Rould,M.A., Nekludova,L. and Pabo,C.O. ( (1996) ) Zif268 proteinDNA complex refined at 1.6 Å: a model system for understanding zinc fingerDNA interactions. Structure, , 4, , 11711180.[Medline]
- McDonald,I.K. and Thornton,J.M. ( (1994) ) Satisfying hydrogen bonding potential in proteins. J. Mol. Biol., , 238, , 777793.[CrossRef][Web of Science][Medline]
- Swirnoff,A.H. and Milbrandt,J. ( (1995) ) DNA-binding specificity of NGFI-A and related zinc finger transcription factors. Mol. Cell. Biol., , 15, , 22752287.[Abstract]
- Christy,B. and Nathans,D. ( (1989) ) DNA binding site of the growth factor-inducible protein Zif268. Proc. Natl Acad. Sci. USA, , 86, , 87378741.
[Abstract/Free Full Text] - Bulyk,M.L., Huang,X., Choo,Y. and Church,G.M. ( (2001) ) Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl Acad. Sci. USA, , 98, , 71587163.
[Abstract/Free Full Text] - Isalan,M., Choo,Y. and Klug,A. ( (1997) ) Synergy between adjacent zinc fingers in sequence-specific DNA recognition. Proc. Natl Acad. Sci. USA, , 94, , 56175621.
[Abstract/Free Full Text] - Elrod-Erickson,M. and Pabo,C.O. ( (1999) ) Binding studies with mutants of Zif268. Contribution of individual side chains to binding affinity and specificity in the Zif268 zinc fingerDNA complex. J. Biol. Chem., , 274, , 1928119285.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
N. A. Temiz and C. J. Camacho Experimentally based contact energies decode interactions responsible for protein-DNA affinity and the role of molecular waters at the binding interface Nucleic Acids Res., June 15, 2009; (2009) gkp289v2. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Jamal Rahi, P. Virnau, L. A. Mirny, and M. Kardar Predicting transcription factor specificity with all-atom models Nucleic Acids Res., November 1, 2008; 36(19): 6209 - 6217. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Donald, W. W. Chen, and E. I. Shakhnovich Energetics of protein-DNA interactions Nucleic Acids Res., February 28, 2007; 35(4): 1039 - 1047. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. W. Siggers and B. Honig Structure-based prediction of C2H2 zinc-finger binding specificity: sensitivity to docking geometry Nucleic Acids Res., February 28, 2007; 35(4): 1085 - 1097. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. B. Becker, L. Wolff, and R. Everaers Indirect readout: detection of optimized subsequences and calculation of relative binding affinities using different DNA elastic potentials Nucleic Acids Res., November 14, 2006; 34(19): 5638 - 5649. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Flor-Parra, M. Vranes, J. Kamper, and J. Perez-Martin Biz1, a Zinc Finger Protein Required for Plant Invasion by Ustilago maydis, Regulates the Levels of a Mitotic Cyclin PLANT CELL, September 1, 2006; 18(9): 2369 - 2387. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. V. Morozov, J. J. Havranek, D. Baker, and E. D. Siggia Protein-DNA binding specificity predictions with structural models Nucleic Acids Res., October 24, 2005; 33(18): 5781 - 5798. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







are selected from the full set of 413 sequences to define the direct recognition logo. Choosing the subset of these sequences for which
defines the overall consensus logo, the corresponding, additional recognition logo being shown on the far left of the figure. (B) Consensus sequence logos corresponding to the total binding energy, the proteinDNA interaction energy and the contributions from key amino acid residues. Gray shading indicates amino acidbase contacts (see 

are selected from the full set of 413 sequences to define the indirect recognition logo. Choosing the subset of these sequences for which 
