Skip Navigation



Nucleic Acids Research Advance Access published online on May 13, 2008

Nucleic Acids Research, doi:10.1093/nar/gkn260
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (435K) Freely available
Right arrow Screen PDF (360K) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
36/11/3690    most recent
gkn260v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Svozil, D.
Right arrow Articles by Schneider, B.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Svozil, D.
Right arrow Articles by Schneider, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Structural Biology

DNA conformations and their sequence preferences

Daniel Svozil1, Jan Kalina2, Marek Omelka2 and Bohdan Schneider1,*

1Institute of Organic Chemistry and Biochemistry, Academy of Sciences of the Czech Republic and Center for Biomolecules and Complex Molecular Systems, Flemingovo nám. 2, CZ-166 10 Prague and 2Jaroslav Hájek Center for Theoretical and Applied Statistics, Department of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, Charles University, Sokolovská 83, CZ-186 75 Prague, Czech Republic

*To whom correspondence should be addressed. Tel: +420 728 303 566; Fax: +420 296 443 610; Email: bohdan{at}rcsb.rutgers.edu; bohdan.schneider{at}uochb.cas.cz

Received March 5, 2008. Revised April 17, 2008. Accepted April 18, 2008.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
The geometry of the phosphodiester backbone was analyzed for 7739 dinucleotides from 447 selected crystal structures of naked and complexed DNA. Ten torsion angles of a near-dinucleotide unit have been studied by combining Fourier averaging and clustering. Besides the known variants of the A-, B- and Z-DNA forms, we have also identified combined A + B backbone-deformed conformers, e.g. with {alpha}/{gamma} switches, and a few conformers with a syn orientation of bases occurring e.g. in G-quadruplex structures. A plethora of A- and B-like conformers show a close relationship between the A- and B-form double helices. A comparison of the populations of the conformers occurring in naked and complexed DNA has revealed a significant broadening of the DNA conformational space in the complexes, but the conformers still remain within the limits defined by the A- and B- forms. Possible sequence preferences, important for sequence-dependent recognition, have been assessed for the main A and B conformers by means of statistical goodness-of-fit tests. The structural properties of the backbone in quadruplexes, junctions and histone-core particles are discussed in further detail.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
The apparent simplicity of double-helical DNA, the icon of molecular biology, is deceiving. While the architecture of its antiparallel strands remains the same, subtle conformational variations suffice to guarantee its recognition by other molecules. The structural variations are critical especially for reliable recognition between DNA and proteins, which is the conditio sine qua non in the essential processes of replication, transcription and DNA chromatin compaction. Local conformational changes induced by interactions with other molecules can either leave the DNA structure unaltered (i.e. in the form of a straight double helix) or introduce bends and kinks within the double helix, as in sequence-dependent CAP/DNA complexes (1) or in DNA coiled around histone-core proteins (2).

The necessity of understanding DNA variability has become more urgent as the sequence-specific protein/DNA recognition required e.g. by transcription factors seems less likely to follow simple and generally applicable rules analogous to the rules governing DNA self-recognition by the complementarity of the Watson–Crick (W–C) paired bases (3). The idea of the general ‘code of recognition’ between amino acids and nucleotides (4) has not been confirmed despite extensive efforts. The lack of simple rules for general protein/DNA recognition has been explained by the existence of too many structural degrees of freedom at the protein/DNA interface (5), and so far only limited rules of recognition have been formulated within narrower groups of transcription factors with certain binding motifs, such as zinc fingers or helix–turn–helix (6–9).

Ultimately, the variability and plasticity of the local DNA structure, and thus its ability to recognize other molecules and be recognized by them, can be attributed to the properties of the bases and to their sequence-dependent arrangement. Base-pair and base-step morphology (10,11) has been widely analyzed to describe sequence-dependent deformability as observed in the crystal structures of DNA complexes with sequence-specific proteins (12,13) as well as in noncomplexed DNA (14). By combining descriptors of base morphology with constraints imposed by a simple model of the phosphodiester backbone, slide and shift have been suggested to describe the key sequence properties of dinucleotide steps (15). However, the backbone does not act as a passive link merely holding the bases at their positions, but its inherent flexibility contributes to, and limits, the base placement so that the local DNA structure results from the interplay between optimal base positions and preferred conformations of the sugar phosphate backbone. An analysis of the conformational space populated by the DNA backbone and the correlation between its conformation and the DNA sequence are therefore important for fully understanding DNA recognition.

The structural alphabet of the DNA double-helical A-, B- and Z-forms has been described in detail earlier (16,17). Nevertheless, DNA is known to adopt also other forms, such as triple (18) and quadruple helices (19), junction (cruciform) structures (20) and parallel helices (21). However unusual some of these DNA forms may be, their architecture is, in full analogy to the double helical DNA, almost completely based on the self-assembly of two or more DNA strands and does not form complicated folds analogous to RNA. The availability of some of these unusual DNA structures in well-refined crystal structures as well as the growing number and quality of more conventional DNA crystal structures present a challenge to undertake an analysis of the DNA conformational space in much greater detail than it was possible a few years ago (22).

This work presents a comprehensive analysis of the conformational space of the DNA backbone using a near-dinucleotide building block as a model. Dinucleotide conformations have been clustered as the local structural property without any consideration of the classification of the overall DNA architecture as, for instance, B- or A-type double helix. The study has been performed on almost 8000 dinucleotide units from 447 crystal structures of DNA, alone or in complexes with other molecules and has made use of a slightly modified procedure developed earlier for an analysis of RNA conformations (23). To assess the nature of the broadening of the DNA conformational space upon interacting with other molecules (mainly with proteins), the classified conformers of naked DNA have been compared to those of complexed DNA molecules. In addition, the structural properties of the backbone have been discussed in selected unusual structures like quadruplexes and histone-core particles. Because the possible sequence preferences of various conformers are important for the sequence-dependent recognition they have been assessed by means of rigorous nonparametric statistical testing within the group of naked B- and A-form double helices.


    METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
The selection of structures used for the analysis was limited to nucleic acid (NA) crystal structures containing only DNA (thus excluding hybrids with RNA) present in the Nucleic Acid Database (24) on 19 July 2005. Four hundred and fifteen structures with crystallographic resolution better than or equal to 1.9 Å were selected; this resolution had previously been identified as limiting the ambiguity of the statistical treatment of torsional distributions (22). Four hexa- and one hepta-nucleotide sequences, CGATCG, CGTACG, CGCGCG, CGCGAA, GCGCGCG, were overrepresented in the original compilation of structures and 26 structures containing them were therefore removed from the analyzed set. Since the initial set of structures limited to 1.9 Å resolution does not contain some a priori important classes, it was further augmented by 58 structures with unusual topologies, such as G-quadruplexes, i-motif, four-way and three-way junctions, as well as by important types of protein/DNA complexes, such as DNA complexed with TATA-box binding proteins or histone-core proteins so that 447 structures were selected for the analysis. All modified and incomplete nucleotides were removed so that the complete data set (further referred to as Dataset 1) contains 7739 dinucleotides (Table 1); PDB codes of the analyzed structures are listed in Table 2, and all dinucleotides are fully characterized in Supplementary Table T1.


View this table:
[in this window]
[in a new window]

 
Table 1. The datasets of the dinucleotides used in this study

 

View this table:
[in this window]
[in a new window]

 
Table 2. The PDB codes of the structures used in the analysis

 
The DNA conformational space was investigated at the level of a dinucleotide unit with its 5'-end phosphate group removed; it was described by six backbone torsion angles between {gamma} and {delta} + 1, plus two {chi} angles characterizing the glycosidic bond (Figure 1). This unit is identical to the 5'-end dinucleotide that naturally lacks the initial phosphate group, and similar to the ‘suite’ defined by Murray et al. (25), which covers the angles between {delta} and {delta} + 1. Two torsion angles at both 5'- and 3'-ends of the complete dinucleotide unit ({alpha} and β at the 5'-end and {varepsilon} + 1 and {zeta} + 1 at the 3'-end) were not explicitly analyzed but they were monitored during the clustering process.


Figure 1
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. The analyzed unit is defined by ten torsion angles from {gamma} to {delta} + 1 along the backbone plus torsions {chi} and {chi} + 1 at the glycosidic bond. B0 and B1 symbolize the bases.

 
The presence of torsions {alpha}, β, and {varepsilon} + 1, {zeta} + 1 in the analyzed nucleotides implies that neither 5'- nor 3'-end residues were among the 7739 analyzed dinucleotides of Dataset 1. All structural data are ‘crystallographically’ independent, i.e. all dinucleotide coordinates are gathered from the asymmetric units of the respective structures. However, information about symmetry-related strands was used when appropriate, e.g. when considering base-pairing patterns in double helical or quadruplex structures.

The analysis started by dividing the multidimensional torsional space into three-dimensional (3D) projections (maps). Based on a priori knowledge of the DNA (22) and RNA conformational spaces (23), the following nine 3D maps were selected: ({zeta}, {alpha} + 1, {gamma} + 1), ({zeta}, {alpha} + 1, {delta}), ({alpha} + 1, {gamma} + 1, {delta} + 1), ({gamma}, {delta}, {zeta}), ({gamma}, {zeta}, {gamma} + 1), ({zeta}, {gamma} + 1, {delta} + 1), ({gamma}, {zeta}, β + 1), ({alpha} + 1, {delta} + 1, {chi} + 1) and ({zeta}, {alpha} + 1, {chi}). For each map, Fourier transform of the torsional values ({tau}1, {tau}2, {tau}3) was calculated as described earlier (23). The only methodological difference from the previous work on RNA conformations was the treatment of places with a high density of data points corresponding to the regions of prevalent double helical A, BI and BII conformations. The extreme density of the points in these areas strongly influenced the results of the Fourier transform in the whole 3D map. To eliminate this mathematical artifact, two 2D scattergrams were constructed for each map, the highest density region was manually selected in each scattergram, and the intersection of the selected points was randomly reduced by 95%. For example, the number of Fourier-transformed points in the ({zeta}, {alpha} + 1, {gamma} + 1) map was reduced from 7739 to 1375. It should be emphasized that the reduced number of the data points was used only to improve the reliability of the Fourier averaging, whereas the full data set of 7739 points (dinucleotides) was utilized in all the subsequent analyses.

The distribution of the points ({tau}1, {tau}2, {tau}3)i was then transformed into pseudo-electron densities using standard crystallographic procedures implemented in the program XtalView (26) with the same set of parameters as had been used in ref. (23). The sites with a high density of points were transformed into peaks in the maps. The peaks thus correspond to areas with a high concentration of torsion angles and represent conformationally favored (and therefore interesting) regions. Eight to twelve peaks were identified within each of the nine analyzed maps, and each peak was assigned a symbolic name in the form of a letter or a letter and a number. The peak names are mere labels and carry no particular meaning. Each peak was then approximated by a sphere with a radius, typically of between 15° and 40°, estimated from the density contour. All data points lying inside the peak's sphere were labeled by the peak's name. If a data point lay within two or more peaks, it was assigned to the most intense one. The data points located outside the radii of all the peaks were not assigned to any peak. As nine maps were analyzed, each dinucleotide was characterized by a nine-letter string referred to as an imprint. The individual imprints were used to cluster dinucleotides with similar conformations by simple alphabetical sorting. The clusters were identified as a set of data points with (nearly) identical imprints. The sorting was based primarily on the imprints from the first four maps, ({zeta}, {alpha} + 1, {gamma} + 1), ({zeta} + 1, {alpha} + 1, {delta}), ({alpha} + 1, {gamma} + 1, {delta} + 1) and ({gamma}, {delta}, {zeta}), whereas the other maps were used mainly to verify the quality of the sorting process. The final data matrix consisting of 7739 data points (dinucleotides) sorted into clusters is presented as Supplementary Table T1.

Within each cluster, the arithmetic means and the standard deviations were calculated for all 14 dinucleotide torsions using the formulas for the circular mean and circular standard deviation (27). The outliers leading to the degradation of the standard deviation were removed so that the final standard deviations of the torsional angles between {gamma} and {delta} + 1 are typically better than 10°.

Dataset 1 was subdivided into two more data sets: Dataset 2 and Dataset 3 (Table 1). Dataset 2 was created by removing all structures of DNA/protein and DNA/drug complexes from Dataset 1 (Table 1) and was used to study the effect of complexation on dinucleotide conformation. To test the possible relationships between dinucleotide conformational classes and sequences in noncomplexed B- and A-form double helices, Dataset 2 was further modified by removing Z-DNA dinucleotides, quadruplexes (G-quadruplexes and i-motif structures), structure 1DC0 (BD0026) (28) and all the dinucleotides with non-W–C paired or non-paired bases, thus resulting in Dataset 3 (Table 1). The DNA dodecamer 1DC0 (28) was removed from Dataset 3, because of the uniqueness of its double helical architecture in combining features typical of the B- and A-forms. The 1DC0 structure will be discussed in the Results and discussion section below. The aim of classifying dinucleotide steps by means of combining Fourier averaging with clustering analysis was to define the conformational families with low variations of torsion angles unambiguously. A consequence of such a strict requirement was the relatively large number of dinucleotides not assigned to any cluster. Therefore, to improve the statistical significance of the sequence analyses, an additional round of conformational assignment of unclassified dinucleotides in Dataset 3 was performed. Further classification was accomplished by calculating both the Euclidean distance and Manhattan distance (known also as taxi-cab metric, L1 distance; the distance between two points measured along axes at right angles) distances between torsional angles {delta}, {varepsilon}, {zeta}, {alpha}1, β1, {gamma}1 and {delta}1 of the unassigned dinucleotides and the conformational families. A dinucleotide was assigned to the cluster with the lowest distance provided that both the Euclidean and Manhattan distances were smaller than 35°. Approximately one-half of the originally unassigned dinucleotides were classified in this procedure, leaving roughly 1/8 of the total number of dinucleotides unclassified.

The contingency tables of the counts of dinucleotide sequences (steps) were built for six broad conformational classes, BI, BII, AI, AII, B/A and A/B, whose detailed description can be found in the Results and discussion section. Those steps with unclassified conformations were attributed either to the RestB category if their parent structure was annotated as a B-type double helix by the NDB, or to the RestA category if they had originated from an A-type double helix. Only the assignment of these two categories, RestB and RestA, used an a priori classification of the double-helical architecture.

The sequence–conformation relationships of Dataset 3 were analyzed by means of statistical hypothesis testing. The contingency tables correspond to the product-multinomial model with fixed row margins. At the very beginning, the {chi}2 test of the homogeneity of the frequency distribution of the sequences of the individual conformational classes AI, AII, RestA, BI, BII, A/B, B/A and RestB (Table 5) was performed. This test compares the multinomial distributions between rows. Since this homogeneity was rejected (Pearson {chi}2-test statistic on 105° of freedom is 996.8 with a P-value <10–16), the sequences are not distributed homogeneously between conformational families and the sequence–conformation relationships were tested further.

The first statistical experiment, further referred to as the test of the ‘uniformity of dinucleotide representation’, is a {chi}2 goodness-of-fit test of equality of the column margins. It compares the observed frequency of a given dinucleotide with a hypothetical frequency of 1/16 corresponding to the situation when all dinucleotides are distributed evenly. The uniformity of dinucleotide representation was measured for dinucleotides in A-like and B-like conformations from Dataset 3 including the unclassified ones (RestA was included in A-like, and RestB in B-like conformers). The test was performed for all 16 steps as well as for four pyrimidine/purine (Y/R) sequences. The actual frequencies of the palindromes (AT, GC, CG, TA) were counted twice. If the null hypothesis of the uniformity of dinucleotide representation in all sequences is rejected, the Pearson residuals provide evidence about a possible over- or underrepresentation of dinucleotide sequences with respect to the hypothetical equal frequencies. The critical values were calculated according to the rule of Bonferroni (29), which for the multiple-significance test ensures that the overall type I error is below 5%.

The second test further referred to as the test of ‘dinucleotide homogeneity’ examined whether a particular sequence was under- or over-represented within a particular conformational class. The count of the sequence in any conformation was then compared to the sum of the counts of this sequence in the remaining conformations considered. Like the test before, the test of dinucleotide homogeneity was performed for all 16 sequences as well as four pyrimidine/purine (Y/R) sequences, this time employing Standardized Pearson residuals (30), which are residuals adjusted to have asymptotic standard normal distribution. Too large a value of the standardized Pearson residual, exceeding the critical value, indicates a significant overrepresentation as compared with the null hypothesis in that cell, whereas a negative value below the negative of the critical value indicates a significant underrepresentation. The Bonferroni correction gives the conservative critical value to these tests, and values which are too large or too small are significant.

The described {chi}2-test of dinucleotide homogeneity was supported by an additional statistical analysis. A so-called ‘odds ratio’ is a measure indicating the violation of homogeneity in the individual cells of the contingency table. The odds ratio for a particular cell was computed by reducing the contingency table to a two-by-two table composed of the cell being investigated and merging the remaining rows together and the remaining columns together. The odds ratio then represents the ratio of the likelihood of the occurrence of an individual sequence in an individual conformation and the probability of the occurrence of this sequence in any other conformation. The odds ratio greater/smaller than one corresponds to over-/under-representation, respectively. Complete Tables of odds ratios are shown in Supplementary Tables S3–S6.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
This section characterizes DNA dinucleotide conformations (Table 4), compares them in structures of naked and of complexed oligonucleotides (Figure 2), and investigates sequence preferences in the naked A-DNA and B-DNA double helices (Tables 5–9GoGoGoGo). The differences between the dinucleotide conformers observed in DNA and RNA are also briefly discussed. Finally, the characteristic features of selected important classes of ‘untypical’ DNA structures, such as quadruplexes or histone-core particles, are annotated in the context of their dinucleotide conformations.


Figure 2
View larger version (44K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2. Two-dimensional scattergrams of torsion angles in naked DNA (Dataset 2 from Table 1, dark blue) and in DNA from complexes (Dataset 1, cyan). A, B and Z are the respective double-helical forms, r stands for purines and y for pyrimidines. The conformations of almost 4000 RNA dinucleotides are plotted as pink dots for comparison.

 
Overview of DNA conformations
Fourier averaging and clustering performed on the full data set of all available structures (Dataset 1 defined in Table 1) revealed a large number of conformational clusters (Supplementary Table T2), which may however be condensed into a much smaller number of families (Table 4). These main conformational families include all major well-characterized DNA double helical forms (BI, BII, AI, AII and Z) as well as less expected conformers, combining structural features typical of A- and B-forms. However, their structural diversity is far from matching that of RNA conformers (31).

A-DNA is a well-described conformation (32–34) characterized by the C3'-endo sugar pucker ({delta} ~ 80°), with {zeta} and {alpha} + 1 ~300° (gauche-), β + 1 ~180° (trans-) and with its glycosidic torsion angle {chi} adopting a value near 200° (low anti). A-form conformers exhibit a relatively low dispersion of torsion angles and are sufficiently represented by two major conformations, the canonical A-form, represented by Cluster 8 and labeled also AI (Table 4) and the AII conformation (Cluster 19). AII is characterized by the {alpha} and {gamma} torsions in the trans region. These values can be reached from the canonical {alpha}/{gamma} values (300° and 60°, respectively) by the so-called ‘crankshaft motion’, which effectively compensates for the switch in torsion values in such a way that the overall course of the backbone does not alter dramatically. Both A-forms have torsion values close to those reported earlier (32,33) and are virtually the same as in their respective RNA conformations (31). What is important is the existence of other A-like conformers with a sugar pucker in the O4'-endo region ({delta} ~ 100°, Cluster 25 in Table 4), observed both in noncomplexed and complexed DNA. Since the interconversion of C2'-endo to C3'-endo occurs preferentially via the O4'-endo state (35), dinucleotides forming Cluster 25 may be described as A-to-B transitional conformers; they are closer to the A-form, because their {chi} torsion is A-like.

Also both major B-form conformers, BI (Cluster 54 in Table 4) and BII (Cluster 96), have torsion values near those known from earlier studies (36,37). The canonical BI-conformation is by far the most frequent conformer both in naked and complexed DNA, respectively. It is characterized by the C2'-endo sugar pucker with {delta} ~ 135°, {zeta} and {alpha} + 1 torsions in the gauche-range (however, its {zeta}-value near 260° is lower than in A-DNA) and by {chi} adopting much a higher value than in A-DNA; the {chi} values typical of the B-form close to 260° are called high anti. The variations within the BI conformers (Supplementary Table T2) result mainly from {varepsilon}, {zeta}, {alpha} + 1 and β + 1 torsions, but the changes in these torsions mostly compensate each other. Only four or five BI conformers of the many which have been identified form larger clusters and only three (Clusters 50, 54 and 58) have been observed in structures of naked DNA.

In naked DNA structures, the BI- and BII-forms are separated by a gap between the {zeta} and {varepsilon} torsions, and to a lesser extent also between β + 1 and {chi} torsions (37). However, such a distinction between these forms almost completely disappears in complexed DNA (Figure 2a and d). Despite the near-continuous BI-to-BII transition, the data from naked DNA (Dataset 3) clearly indicate that BII should be recognized as a distinct B-form characterized by {zeta} in the trans region, by a high value of {varepsilon} and by a low β near 140°. Changes and mutual compensations of torsion angles in BI and BII are obvious from Table 4 by comparing the values for Clusters 54, 86 and 96 (all the clusters spanning the BI- and BII-forms are listed in the Supplementary Table T2): The BII-like conformers start at {varepsilon} and {zeta} values close to the BI-form (illustrated by Cluster 86 with {varepsilon}/{zeta} ~ 200°/215°), gradually pass to ‘typical BII’ values of Cluster 96 ({varepsilon}/{zeta} ~ 245°/172°) and end with Cluster 105 having extreme values of {varepsilon} and {zeta} ({varepsilon}/{zeta} ~ 264°/149°). The almost continuous transition from BI to BII is best described by the linear anticorrelation of {varepsilon} and {zeta} values ({varepsilon} = –0.73 {zeta} + 367°, R2 = 0.85, N = 2022, with the equation being valid within the limits of BI and BII conformations).

The occurrence of two consecutive BII conformers is infrequent, but it does occasionally occur both in naked and complexed structures, corroborating an earlier observation (22). In naked DNA, the BII–BII repetition has to be stabilized either by a crystal contact, or it is induced by specific structural features, such as the BII repetitions found in the four-way DNA junction 1L4J (38). The frequency of occurrence of BII–BII steps in protein complexes is similar to that in naked DNA, and BII–BII repetition is uncommon even in nucleosome or in strongly bent CAP/DNA structures. Three consecutive BII steps are rare but have been observed in four-way junctions (e.g. in the 1L4J structure) or in DNA/histone-core complexes. In summary, BII conformers tend to be isolated, with a typical pattern of BII-rich regions being the BI–BII–BI–BII repetition. Alternatively, BI may be repeated more than once while BII can be replaced by another deformed A–B conformer in this pattern.

The average values of the backbone torsions in a nucleotide (listed from {alpha} to {zeta}) were calculated for the most common A- (AI and AII) and B-forms (BI and BII) using only naked DNA structures (Dataset 3 in Table 1). The values listed in Table 3 allow for an easy comparison with other references (39,22). A careful selection of high-resolution DNA structures in Dataset 3, the numbers of observations for each structural class, and the intervals of reliability of all torsions imply that the torsion values in Table 3 are a source of reliable structural description of the sugar–phosphate backbone of the double helical forms.


View this table:
[in this window]
[in a new window]

 
Table 3. Torsion angles [°] in nucleotides of the major A- and B-forms

 
Several fairly populated conformers fall within neither the A- nor the B-form category. However, they can be characterized as conformations with one nucleotide of the B-type and the other of the A-type and having sugar in one or both nucleotides in the transitional O4'-endo pucker. The first such group of conformers, exemplified by Cluster 41 (Table 4), can be described as AI–BI. The nucleotide at the 5'-end of the analyzed dinucleotide unit is in an A-like conformation (C3'-endo sugar pucker, low {chi}), whereas the 3'-end nucleotide is of a BI-type (C2'-endo sugar pucker, {chi} higher than 200°). In another similar cluster (Cluster 47, Table 4), the sugar of the 5'-end nucleotide adopts the A-to-B transitional O4'-endo pucker, and this nucleotide should therefore be considered as an intermediate between the A- and B- forms. AI–BI conformers occurring both in naked and complexed B-DNA double helices are characterized by a strong sequence bias toward the Y–R sequences (see below for a detailed discussion on sequence preferences within the individual conformer families). The occurrence of the A-to-B conformations seems to reflect the inherent flexibility of certain, preferably Y–R, sequences, as they can be explained neither by the presence of interacting species (e.g. ions) nor by the packing effects.


View this table:
[in this window]
[in a new window]

 
Table 4. The main DNA conformational classes identified in the present work

 
In analogy with the A–B conformers, combined B–A clusters were also identified (Table 4). First such a group of conformers is exemplified by Cluster 32 (Table 4); it has the C5'-end nucleotide in the BI-form while the C3'-end adopts a transitional conformation between B- and A-forms (O4'-endo sugar pucker, {chi} ~ 239°) and can be characterized as BI–AI. The BI–AI conformers occur both in naked and complexed DNA and the majority of their nucleotides are involved in W–C base pairing. Their sequence dependencies are more complex than those of the AI–BI conformations: The Y–R sequences are disfavored while the A–A and A–T sequences are preferred. A few small B–A clusters with high {varepsilon} and low {zeta} can be characterized as BII–AI conformers (Supplementary Table T2) with the C5'-end residue in the BII-form (C2'-endo sugar pucker, high anti {chi}), and with the C3'-end residue in the A-form (C3'-endo, {chi} ~ 200°). Other torsions in BII–AI conformers may also adopt unusual values, such as {alpha} + 1 at 60° and {gamma} + 1 at 200°, in Cluster 110. Most of the BII–AI dinucleotides are found in the R–R sequences; some are involved in G–A or A–G mismatches adopting Hoogsteen base pairing (Cluster 107 in Supplementary Table T2). These clusters clearly show how localized deformation of the regular B-form is sufficient to accommodate the G/A mismatch into the double helix.

Both B- and A-forms accommodate the crankshaft motion compensation between {alpha} and {gamma} (or {alpha} + 1 and {gamma} + 1) torsions but differ in its realization. The A-form has its important substate, AII (Cluster 19, Table 4), with trans/trans {alpha} and {gamma} torsions, observed in naked and complexed DNA, as well as in RNA. In contrast, trans/trans {alpha}/{gamma} combination is never observed in the B-form where {alpha} and {gamma} torsions may be flipped from their canonical g–/g+ values to the g+/g– combination (Cluster 116) virtually only by interactions with proteins.

The B–A and A–B clusters described so far combine features typical of the B- and A-forms in such a way that each nucleotide within the analyzed unit adopts predominantly one or the other form. However, there are three or four small clusters (Clusters 24–26 and to some extent also Cluster 3 in Supplementary Table T2, Cluster 25 is also in Table 4) which are examples of a true combination of the B- and A-forms. These dinucleotides are characterized by having both sugars at the O4'-endo sugar pucker, transitional form between the C2'- and C3'-endo puckers and the {zeta} value near 280°. These unusual conformers can be found not only in highly deformed regions of DNA complexed with the TATA-box-binding protein (40,41) but also in naked A-DNA structures (42).

The combining of B- and A-type conformers has been described using base morphology and helical parameters such as groove width and helical twist in DNA complexed with proteins (43,44) as well as in naked DNA (45). B- and A-forms have also been observed to coexist in several crystal structures. Entire B- and A-DNA double helices have been located in a single-crystal structure (46), demonstrating similarity in their thermodynamic stability. The scaffolding of this crystal lattice is built by A-DNA helices with B-DNA helices being interspersed in crystallographically disordered positions in the lattice interstices. In several structures, oligonucleotides (28,47,48) capture DNA in various phases of the B-to-A transition. Perhaps the most direct observation of the B-to-A transition itself has been achieved by solving a series of structures (1IH1–6) with sequences containing guanines and a varying number of methylated and brominated cytosines (48). In another structure, 1DC0 (28), the whole double helix has features of both B- and A-forms: The bases are perpendicular to the helical axis (have an almost zero inclination), which is typical of the B-form, but most of the other parameters, such as twist, sugar pucker, minor-groove width, slide and the distance of the P atom from the base plane Zp (49), adopt A-like values. The backbone torsions of this structure are strictly A-like, in fact, all its residues were classified as canonical A-DNA. This structure demonstrates the limits of any analysis using torsion angles only. To capture all the details of such a conformation, torsion angles should be complemented with other parameters, such as inclination, slide or Zp (49). However, it should be emphasized that the 1DC0 structure is an exception and that the distinction between the B- and A-type was possible in the other cases by analyzing the torsional space.

The above observations support the view describing the right-handed double-helical forms as one broad conformational family with a strong preference for the BI-form connected by a nearly continuous set of conformers to the AI- and AII-forms on the one hand and to the BII-form on the other.

The glycosidic angle in four clusters (Clusters 119–122, Table 4 and Supplementary. Table T2) characterizing the structure of non-W–C (mismatched) base pairs adopts a rare syn orientation ({chi} ~ 70°). Most dinucleotides in these clusters are of a GG or GA sequence, but there are several GT and GC exceptions. Clusters 121 and 122 with the 3'-end base in syn orientation are listed in Table 4, whereas their 3'-end continuation, Clusters 119 and 120, are shown in Supplementary Table T2. All nucleotides forming Cluster 122 come from G-quadruplexes. Cluster 121 contains mainly unusual non-W–C pairs between the W–C edge of cytosine and the Hoogsteen edge of guanine. These nonplanar G–C pairs from the TATA box bound to the TATA-box-binding protein [e.g. the PDB entry 1QN3 (50)] correspond to the class ‘IV trans’ of the Leontis–Westhof classification (51).

While the single building unit both for A- and B-DNA is a nucleotide, the left-handed double-helical Z-DNA is constructed from dinucleotide steps with distinct conformations consisting of alternating pyrimidine–purine (Y–R) or purine–pyrimidine (R–Y) steps (52). The Y–R steps are implemented by one geometry described by Cluster 123 (Table 4), whereas the R–Y steps may adopt two distinct conformations characterized either as ZI (Cluster 124) or as ZII (Cluster 126) (53,54).

A comparison of conformations of naked and complexed DNA
A brief inspection of 2D scattergrams in Figure 2 shows that distributions of all torsions in naked (noncomplexed) DNA are significantly broadened upon complexation with proteins and small ligands (e.g. drugs). DNA molecules in the crystal phase are obviously not ‘naked’ but immersed in solvent, mainly water molecules and metal cations. These small solvent particles are indispensable for structural integrity of nucleic acids but their influence is not considered explicitly here. In our opinion, the fact that DNA crystallized from pure solvent is conformationally more compact than DNA co-crystallized with drug molecules and especially polymeric proteins indicates that (i) small solvent particles impose the smallest conformational constraints, and (ii) DNA–DNA crystal contacts are rare and/or relatively nonspecific.

The merging of the BI- and BII-forms upon complexation with proteins, perhaps the most significant case of the conformational broadening caused by complexation, was discussed in the previous section. Four distinct regions of the {zeta}/{alpha} + 1 scattergram (Figure 2a) induced by complexation with proteins are discussed below:

  1. A fair number of conformations is present at very low {alpha} + 1 near 30° and ‘normal’ {zeta} at ~240°. These well-defined conformers also appear in the upper left corner of the {alpha} + 1/{gamma} + 1 scattergram near {alpha} + 1~30° and {gamma} + 1~300° (Figure 2b). They correspond to B-like families with {alpha} + 1 and {gamma} + 1 values flipped from their normal g–/g+ values to g+/g– conformation and are represented by Clusters 116 and 112 (Supplementary Table T2). These conformers occur at points of a substantial DNA bend like in complexes with DNA polymerase, histone-core proteins and transcription factors, or in ‘disordered’ regions of four-way junctions. However, not all conformations in this area originate from DNA complexes: The points pertaining to Cluster 122 come mostly from guanine quadruplexes (Cluster 122).
  2. A small region of about 40 residues adopting the same {alpha} + 1 values (~30°) but lower {zeta} (~180°-230°) corresponds to nucleotides with higher β + 1 (~190°–240°) and belongs to Cluster 110 (Supplementary Table T2). The rest of dinucleotides in this region of the {zeta}/{alpha} + 1 scattergram were not assigned to any cluster and correspond either to DNA in protein complexes, especially with endonucleases, or to DNA intercalated with drug molecules.
  3. A rather diffuse region between {zeta} ~ 170°–250° and {alpha} + 1 ~90°–150° originates from nonclustered dinucleotides interacting strongly with histone-core proteins and with intercalated drugs. Interestingly, this conformation may be induced by the intercalation of a drug molecule to both the double helix and quadruplex and may thus reflect backbone adaptation to the intercalated drug.
  4. The residues in the region with a rare combination of {zeta}/{alpha} + 1, ~60°/~200°, are not clustered; some are from structures of single-stranded DNA and likely to be a real DNA conformer. However, most of these residues are likely to result from an incorrectly fitted sugar pucker, which forces the backbone into extreme values of {varepsilon} torsion (below 90°), subsequently implying the rare combination of {zeta}/{alpha} + 1.

The distribution of β torsion is also significantly broadened in complexes. The {zeta}/β + 1 scattergram (Figure 2d) shows how conformations separated into distinct regions in naked DNA broaden upon complexation. While the most prominent case of merged BI- and BII-forms was already discussed above, two other diffuse areas (not assigned to any of the identified clusters), occurring exclusively in complexes, are found near {zeta}/β + 1 ~280°/80° and near {zeta}/β + 1 ~170°-230/200–240°. In the former group, {alpha} + 1/{gamma} + 1 torsions occupy the untypical g–/t region. Similar conformations exist also in RNA both for C3'/C3' and C2'/C2' sugar puckers as conformers 1e and 4s, respectively (31). In the latter group, approximately half of the residues can be mapped to the {zeta}/{alpha} + 1 group discussed above under Point (i), whereas the other half has no structural or functional characteristics in common.

Two groups of conformers can be observed in the {alpha} + 1/{gamma} + 1 scattergram (Figure 2b) for complexed DNA. One large group around the {alpha} + 1/{gamma} + 1 ~30°/300° region was already discussed above as the {zeta}/{alpha} + 1 Group 1. Another is located in a small region of {alpha} + 1 ~240°–270° and {gamma} + 1 ~170°. Although this area has not been identified as a distinct cluster, this conformation may be found either in the i-motif structures [e.g. 190D (55)] or in the noncanonical base pairs classified as WC/WC trans [Type 2 according to Leontis–Westhof classification (51)].

The two conformers described above (i-motif/noncanonical base pairs with {alpha} + 1 ~240°–270° and {gamma} + 1 ~170°, and dinucleotides extended by intercalated drugs with {zeta} ~ 170°–250° and {alpha} + 1 ~90°-150°) represent, in our opinion, new unique DNA conformations. However, they were not clustered by the analysis, and their unequivocal identification as novel backbone conformers requires an analysis of new crystal structures.

The main change occurring in the {delta}/{chi} distribution (Figure 2c) upon the DNA complexation is the increase of the dispersion of {chi} values in the C3'-endo region, blurring the positive correlation between {delta} and {chi} torsions observed in noncomplexed structures. It should be emphasized that {delta} torsions near 100°, corresponding to the O4'-endo sugar pucker (as well as another C2'-to-C3'-endo transitional form with a sugar pucker in the C1'-exo region), are populated both in complexed and noncomplexed DNA. Apparently, the O4'-endo pucker is of a type of a distinct deoxyribose conformation and is highly unlikely to result from incorrect pucker assignment during the refinement process. Conformers with a sugar pucker between C3'-endo and O4'-endo ({delta} ~ 90°), occurring both in naked and complexed DNA are mostly purine residues from Z-DNA and from guanine quadruplexes.

The syn orientation of the bases ({chi} ~ 70°) is rare. Syn conformers with the C2'-endo ({delta} ~ 140°) sugar pucker have been observed only in complexes with proteins; roughly 1/3 of them were classified as Clusters 121 and 122 (Table 4). This conformation is adopted by guanine in a syn orientation, either forming a Hoogsteen pair with cytosine in complexes with TATA-box-binding proteins, thus avoiding a possible sterical clash (50), or forming G–G pairs of guanine quadruplexes (56). However, the majority of the C2'-endo syn residues did not form any distinct conformation. These originated either from the same structures as the classified ones or are found in single-stranded DNA.

To summarize, complexation with proteins and small ligands (‘drugs’) induces a widening of torsional distributions of the DNA backbone. Some selected protein/DNA complexes have crystallographic resolution worse than the target value of 1.9 Å (22) and these structures are likely to blur the distributions by noise. Assuming that the refinement protocols do not systematically bias torsion distributions, at least in such a large statistical sample, the resolution-related broadening represents white noise. Nonrandom widening of torsion distributions caused by interactions between DNA and ligands should then be reflected by new conformers not seen in the naked DNA. This is indeed the case: While over 120 clusters were localized in all analyzed dinucleotides (Dataset 1), only 28 of them were found in naked DNA (Dataset 2).

One important reason for the conformational widening is the stabilization of A-like or combined B/A conformers induced by interacting molecules (33,43,57). Although the majority (70%) of dinucleotides from protein/DNA complexes adopt BI and BII conformations, their significant portion (30%) may be found in AI- and AII-forms. Such a plasticity of DNA, when the conformation is changed locally into an A-form, is one of the ways in which DNA achieves specificity in protein/DNA binding (58–60). Remodeling from the B- to A-form also provides a mechanism for smooth bending of the double helix and for the controlling of widths of major and minor grooves. By changing the accessibility of the edges of the individual base pairs (43), the narrowing and deepening of the major groove in A-DNA enables the appearance of sequence-specific contacts. Quite a large degree of distortion of the double helix required to achieve a specific protein binding may be attained by its local ‘deformation’ into an A-like conformation (61,62). The narrowing and deepening of the major and the widening of the minor grooves is also a reason for the increased population of BII conformers and for a smooth transition between the BI- and BII-forms in protein/DNA complexes (44).

Sequence preferences in double-helical B- and A-forms
The preferences of different sequences for different conformations were tested for dinucleotide steps in naked (noncomplexed) right-handed double-helical structures involved in W–C base pairs (Dataset 3). The conformational plasticity of a sequence probed by the crystal forces, which is statistically tested here, is de facto a consequence of the general structure-correlation principle formulated by Burgi and Dunitz (63,64).

In order to maximize the statistical significance of the sequence comparisons, the number of the conformational classes being analyzed was reduced, leaving eight statistical categories: AI, AII, BI, BII, A/B, B/A, RestA and RestB. The B-like clusters were labeled as BI (Clusters 48–85), BII (Clusters 86–106), A/B (clusters 22, 23, 38–47) or B/A (Clusters 27–37, 107–110). Similarly, the A-like clusters were assigned either to the AI (Clusters 1–21, 24–26) or AII (Cluster 19) category. For dinucleotides not assigned to any of these categories, an a priori NDB classification into the A- and B-form helices had to be used. If they appeared in A-form double helices, they were assigned to the RestA category, if they appeared in a B-form, they were assigned to the RestB category. The counts of all the 16 dinucleotide steps which were utilized in the statistical analyses are listed in Table 5.


View this table:
[in this window]
[in a new window]

 
Table 5. Counts of 16 dinucleotide steps in conformational families of noncomplexed A- and B-form double helices (Dataset 3)

 
It should be emphasized that the statistical tests performed are limited to sequences which were subjected to crystallization trials and succeeded in them. This fact must be borne in mind when interpreting the sequence preferences within our data sets. For instance, the underrepresentation of sequences with adenine and thymine in the A-form double helices may, and is likely to, reflect the thermodynamic preference of sequences containing these nucleotides. However, it may also reflect a lack of crystallization trials of such sequences after it was detected that they do not crystallize in A-DNA. Similarly, the preference of A-DNA for the GG sequences and of B-DNA for the AA sequences is likely to reflect real thermodynamic preferences, but we cannot completely exclude that certain sequences of a particular length were more popular in crystallization trials than others (here we allude to the known disposition of octameric sequences to crystallize in the A-form). A complicated interplay between sequences, their length, crystallization conditions and the resulting double-helical structure has been discussed since the early days of oligonucleotide crystallography (33,65,66). A seminal work by Dickerson et al. (67) shows that the crystal-packing forces probe the malleability of different sequences to adopt different conformations and that one sequence subjected to different environments may adopt several conformations.

Conformational space of dinucleotide sequences is mapped not only by crystal packing forces. Another force probing the polymorphism of individual sequences are the interactions with co-crystallized molecules where proteins, drugs and other small ligands impose different constraints on the DNA helices. However, complexed DNA structures have not been used to investigate sequence–structure correlations for two reasons: (i) their resolution is lower on average than that of naked DNA structures and (ii) the bias of the sequence space is likely to be even higher than in naked DNA, because sequences of DNA complexed with specific binders (e.g. transcription factors) need not be random in any way.

An analysis of the current data shows a strong general preference for the canonical BI conformer (Cluster 54) in all 16 dinucleotide steps. However, most steps are also capable of adopting a wide range of conformations, thus reflecting various local influences. Great differences have been found between counts in the A- and B-forms (Table 5, ‘Total in A’, ‘Total in B’ rows). The most apparent difference is the low number of adenosine and thymine residues in all A-like conformers (Table 5); while the AA step is highly populated in the B-form, it is completely missing in the A-form, and while GG is the most frequent step in the A-form, it is much less populated in the B-form. This observation was quantitatively confirmed by a test of uniformity of dinucleotide representation, performed for all 16 dinucleotide steps (Supplementary Table S2) between A-like (AI, AII, RestA) and B-like (BI, BII, A/B, B/A, RestB) dinucleotides. The qualitative difference between sequences of A-like and B-like conformers and the virtual lack of A/T nucleotides in the former one leads to the necessity of treating the A-like and B-like conformers separately in subsequent statistical tests (Table 6).


View this table:
[in this window]
[in a new window]

 
Table 6. The violation of the uniformity of dinucleotide representation for purines (R) and pyrimidines (Y) between A, B and combined conformational families as measured by the standardized Pearson residuals

 
Another statistical test, the dinucleotide homogeneity test, allows for a more detailed analysis of sequence preferences within A- and B-forms. The sequences for this test were categorized either as purines/pyrimidines, or as actual nucleotides.

Well-founded conclusions for A-type conformers can only be drawn at the purine/pyrimidine level. Table 7 and Supplementary Table S3 show that the minor AII family prefers YR sequences while being under-represented for RY and YY sequences. The typical feature of the AII conformation, torsions {alpha} + 1 and {gamma} + 1 near the trans region (Table 4), leading to an almost planar arrangement of six atoms O3'-P-O5'-C5'-C4'-C3' at the 3'-end nucleotide, can be adopted by the purine nucleotide but only with difficulty by the pyrimidine nucleotide.


View this table:
[in this window]
[in a new window]

 
Table 7. The violation of the homogeneity of purine (R) and pyrimidine (Y) dinucleotide steps between AI, AII and nonclassified (RestA) conformers in the A-form double helices as measured by the standardized Pearson residuals

 
The dinucleotide homogeneity test performed for the B-like conformational families reveals their intrinsic sequence preferences. When tested for the Y/R sequences, the homogeneous steps (RR and YY) show significant preference within the BI family (Supplementary Table S5), which is usually not the case with combined steps (YR and RY). On the other hand, the combined steps are preferred in less populated conformational families, namely YR is abundant in the BII and A/B families, and RY in B/A families (nevertheless, many RY steps remain unclassified as RestB). Such a variability of conformations, especially of the YR steps, corresponds to their known role in bending and kinking (68).

The sufficient amount of data for the B-form DNA makes it possible to analyze all 16 dinucleotide steps separately. Both Pearson residuals (Table 8) and odds ratios (Supplementary Table S6) confirm the preference of YY or RR for adopting the BI conformation, with the only exception being the GG step (see below). The underrepresentation of combined Y/R sequences in BI is caused by the frequencies of the GC and CA steps being very low (Tables 5 and 8). The preference of the BII family for all four YR sequences can be inferred from the values of both Pearson residuals and odds ratios. The only two significantly populated YR sequences are TG and the complementary CA steps, which have high frequencies in the BII-form and low frequencies in other conformations. This indicates that the facing strands of the W–C paired tetranucleotide d(CA).d(TG) are likely to adopt the BII-form (69). Besides the CA step, also the CG step is often considered to be highly malleable to adopt the BII-form (70) but the current data do not support this view; the CG count in the BII-form is not statistically significant. Instead of preferring BII, the CG step may be considered plastic, it can adopt BI, A/B, and BII with comparable counts and was also found in a number of unclassified conformers (RestB). Structural variability of the CG step has been observed previously; CG conformation has been shown to depend not only on the immediately flanking nucleotides (37,71,72) but also on the more distant ones (73). The TA step has a similar count in the BI- and BII-forms, which contradicts the earlier observation that TA displays a low propensity to undergo BI-to-BII transition (70).


View this table:
[in this window]
[in a new window]

 
Table 8. The violation of the dinucleotide homogeneity for sequences between BI, BII, A/B, B/A and unclassified dinucleotides (RestB) in the B-form double helices measured by the standardized Pearson residuals

 
The YY steps disfavor the BII family to the extent that none of the CC, CT, TC, TT steps was identified as a BII conformer. The conformational preferences of the RR sequences are rather interesting: The three steps containing adenine (AA, AG, GA) are underrepresented while GG is significantly overrepresented in BII, which is the opposite in the case of the BI family. Although the lower number (23) of the GG steps in the B families calls for caution, their propensity to adopt the BII conformation (69) seems to be clearly pronounced.

The A/B and B/A families are less populated (Table 5), therefore any conclusion must be drawn carefully. Whereas the CG step can clearly adopt the A/B conformation, the GC step shows a high propensity for the B/A conformation. Some of these steps come from the CGC sequence, in which the central G nucleotide is responsible for the A-like features of the two consecutive B/A and A/B conformations. An analogous link can be made for T from the ATC or ATT sequences, where the AT step exists in the B/A conformation and TC or TT in the A/B one. On the other hand, several steps have seldom been observed in the combined B + A families. In particular, no YR step was classified as B/A, and only three RR and five RY steps adopted the A/B conformation, corroborating the general reluctance of purines to accept the C3'-endo sugar pucker in the B-like double helix.

Dinucleotide steps with an unclassified B-like conformation (RestB in Table 8) form about 14% of all steps in the B-form double helices, and the Pearson residuals of these unclassified dinucleotides are neutral for most sequences. An important exception is the significantly over-represented GC. The GC step, occurring with comparable counts in BII, B/A and RestB categories and underrepresented in the BI family, is the sequence with the most complicated conformational behavior. Multiple stable conformational states observed in this work for the GC step may be an indirect confirmation and generalization of its bistability in non-complexed DNA and ‘continuous flexibility’ in DNA/protein complexes (49).

The sequence preferences for the B-like conformation families are briefly summarized in Table 9. The BI conformation is numerically dominant in all the sequences (with the possible exception of CA) but it is significantly overrepresented in comparison with the other families only in some steps, notably in AA, TT and GA. Some steps, mainly GG, CA and TG show a propensity for the BII-form, whereas the CG step has a high propensity for the A/B conformation, and the AT and GC steps for the B/A conformation.


View this table:
[in this window]
[in a new window]

 
Table 9. A summary of the conformational preferences of dinucleotide steps in B-DNA helices

 
Annotation of selected DNA structures
Certain conformers occur mostly or exclusively in structurally and/or functionally distinct types of nondouble helical and deformed double-helical structures. The following paragraphs describe several such relationships in various DNA structures.

G-quadruplexes of the Oxytricha nova telomere are all conformationally similar structures which can be almost completely formed from clustered conformers. The central step of the quadruplex (Residues 2 and 3) in complexes with the telomere-end binding protein [structures 1JB7 (74), 1PH4, 1PH6 and 1PH8 (56)] as well as in the non-complexed quadruplex [1JPQ (75)] adopts the conformation of Cluster 122 (Supplementary Table T2), a B-like cluster with the canonical {alpha} + 1 and {gamma} + 1 values flipped (‘{alpha} + 1/{gamma} + 1 crank’) and with the syn orientation ({chi} ~ 70°) of the second guanine base enabling non-W–Ck purine–purine base pair. The next step (Residues 3 and 4) adopt the conformation of Cluster 119, another B-like conformer with the first base in the syn orientation ({chi} ~ 70°). The GT steps joining the G-quadruplex with TTTT loops have the conformation of Cluster 120; their G nucleotides are again characterized by the syn orientation ({chi} ~ 70°) and by nontypical values of {alpha} and {gamma} torsions, namely g+ (~ 60°) and t (~ 180°), respectively. The second thymine residue from the TTTT loop stacks on top of the 5'-terminal guanine from the second strand. This TT step, like the subsequent one, has its backbone deformed both at the 5'-end ({alpha} = 150°) and at the 3'-end ({zeta} + 1 = 60°). Its central sugar-to-sugar ({delta}-to-{delta} + 1) part is classified as the BI Cluster 85. The other residues in the Oxytricha nova G-quadruplex adopt the conformations of clusters in BI- and BII-forms.

i-motif or cytosine quadruplex
The i-motif or cytosine quadruplex (55) consists of two interlocked pairs of parallel strands of the CCCC sequence. Unlike in the case of the Oxytricha nova G-quadruplexes, nucleotide conformations in the i-motif do not cluster into distinct conformers and most dinucleotide steps were actually not classified. For instance, only three steps in the d(ACCCCT) structure [1BQJ, (76)] were classified as Clusters 11 and 15 (Supplementary Table T2), containing conformers with C3'-endo sugar puckers but more B-like {zeta} and {chi} torsion values. No steps were classified in the 1V3N and 1V3O (77) structures, and only one step was assigned to the BI-to-A Cluster 32 (Table 4) in 1V3P (77). The limited success of clustering the i-motif dinucleotides can be partially attributed to the small amount of data available and partially also to the extreme, and most likely incorrect, values of some torsional angles (most notably to the {delta} values near 170°) pushing other torsions to rarely populated regions during the refinement and preventing these residues from being identified by their clustering.

Four- and three-way junctions
Junctions between DNA helices are important as intermediates in DNA rearrangements and as components in the secondary structure of single-stranded DNA molecules, such as certain viral genomes. The most important of these is undoubtedly the four-way junction, the Holliday junction of genetic recombination (78). It is formed by an incomplete exchange of strands between two double-stranded helices. However, other junctions are also possible, namely three-way junctions, the simplest and most commonly occurring branched structures in biologically active, single-stranded nucleic acids.

The arms of the junctions are formed by B-type double helices, residues are classified either as BI, or as BII, the junction site itself is formed by a sharp turn in the phosphodiester backbone. This sharp turn is captured mainly by a change in the {varepsilon}, {zeta}, {alpha} + 1, β + 1, and {gamma} + 1 torsions, which adopt unusual values. Three conformationally distinct types of four-way junctions have been identified. However, the scarcity of structural data did not allow to classify the junction-site step as a distinct conformation in any of these structures.

  1. Structures of Cre recombinase bound to a Holliday junction recombination intermediate [e.g. 2CRX (79), 4CRX (80)] contain DNA duplexes arranged in a nearly planar X-shaped structure. The junction is formed by a linkage between T and A nucleotides, which sharply bends DNA by an unusual combination of torsions {zeta}, {alpha} + 1, β + 1 and {gamma} + 1. The values of these torsions vary, however, from one structure to another, thus preventing this step from clustering. For example, the {zeta}, {alpha} + 1, β + 1 and {gamma} + 1 torsions of the junction in the 2CRX structure adopt a rare combination g+/g+/g+/t, which has not observed among stable conformers even in the more variable RNA.
  2. The Holliday junction of the ‘inverted repeat sequence’ CCGGTACCGG [e.g. 1DCW (81), 1JUC (82)] is characterized by a high proportion of unclassified and BII-form residues, but only the residue joining two double-helical segments radically deviates from B-like torsion values, mainly in {zeta} and {alpha} + 1.
  3. The third distinct architecture of the four-way DNA junction is exemplified by the decamer structures 467D (83) and 1ZF2 (84) with a sharp bend between Residues A6 and C7; the bend can be characterized by a combination of unusually high {varepsilon} and {zeta} (~290° and ~260°, respectively).

Like four-way junctions, also a three-way junction in a complex with trimeric Cre recombinase [e.g. the 1F44 structure (85)] has only one phosphodiester linkage of the junction region in an unusual conformation while the arms retain a near-perfect B-form. In analogy to the four-way junctions listed under Point (iii) above, the only distinction between the junction site and the BI conformation is in the high values of both the {varepsilon} and {zeta} torsions; {varepsilon} adopts a value of 260°, typical of BII, and the value of {zeta} is higher (~210°) than that expected for a BII conformation (~150°).

DNA in the nucleosome-core particle (NCP)
The nucleosome-core particle consists of 146 or 147 bp of double-stranded DNA wrapped in 1.65 left-handed superhelical turns around four identical pairs of proteins individually known as histones and collectively known as the histone octamer. Nucleosomes, which are ubiquitous in eukaryotic DNA, have been shown to displa