| Nucleic Acids Research | Pages |
Proximal transcribed regions of bacterial promoters have a non-random distribution of A/T tracts
Introduction
Materials And Methods
Compilation procedure
Clustering procedure
Results
Transcribed regions display a non-random distribution of A/T tracts
Downstream periodicity is in phase with non-canonical promoter regions and the -10 element
Correlation of the distribution of A/T tracts along the promoter DNA
The relative disposition of A/T tracts supports the significance of the 17 bp period
Discussion
Acknowledgements
References
Proximal transcribed regions of bacterial promoters have a non-random distribution of A/T tracts
Received August 13, 1999; Revised and Accepted October 25, 1999
ABSTRACT Promoter sequences of Escherichia coli were compiled and their transcribed regions characterized by site-specific cluster analysis. Here we report that transcribed regions contain a non-random distribution of A/T tracts with strongly preferred positions at 6 ± 3, 23 ± 3, 40 ± 2 and 56 ± 2. The maxima of this distribution follow an unusual periodicity (~17 bp) and are in phase with important promoter elements involved in interaction with RNA polymerase, while the value of periodicity numerically fits the spacer length between the canonical -35 and -10 elements. The possible functional significance of this newly described feature is discussed in the context of promoter clearance and transcription pausing.
INTRODUCTION
Basic characteristics of nucleotide sequences that distinguish promoter regions from non-promoter DNA have been, and remain, the subject of intensive study. In addition to the canonical hexamers 35 and 12 bp upstream from the transcription start point (1-3), a number of non-canonical sequence elements have been proposed on the basis of more complicated statistical approaches (4-11) and experimental data (12-16). Additional regions important for promoter activity occur upstream of the canonical element -35 (4-6,14-16) and in the spacer region (12,13). A growing amount of experimental data indicate that sequence peculiarities in the initially transcribed regions may also be important for productive transcription, participating in the process of promoter clearance or other regulatory mechanisms (17-21). However, downstream promoter sequences have not been statistically characterized or even compiled. Here we present the first data on this subject.
The analysis was performed on a compilation containing 415 promoters of Escherichia coli and related phages with sequences extending from -70 and to +150 from the start point of transcription. We applied the method of cluster analysis used in a previous paper (4) as a sensitive tool to reveal specific patterns that may occur imperfectly in a set of nucleotide sequences.
No preferences in the distribution of particular base pairs were found in the transcribed part of the genes. The distribution of purines (R) and pyrimidines (Y) was also random except for a strong preference for the presence of YR around the start point of transcription. However, the distribution of A/T tracts was not random and showed a pronounced periodicity. We show that the presence of A/T tracts in the initially transcribed part of genes correlates with some sequence peculiarities of the upstream promoter region. A possible functional significance of this newly revealed feature is discussed in the context of this correlation.
MATERIALS AND METHODS
Compilation procedure
For the greater part our compilation comprises promoters published previously (4), with the downstream regions obtained from the DDBJ or EMBL databases (405 promoters), while 10 promoters (bioA, folA, T5J5, ilvIH, leuX, lrp, Mu I, Mu-mom, MuP and topA-P4) were added from original papers (22-30). Promoters were aligned according to the position of the start point of transcription. For continuity of positional coordinates we numbered the promoter positions from the start point, considered as 0, thus shifting the numbering in the transcribed region by 1 bp in comparison to the commonly used coordinates. To avoid misunderstanding, we use real numbering when we refer to the downstream positions in the text. The promoter compilation is available on request by Email (ozoline{at}venus.iteb.serpukhov.su ).
To estimate the significance of the maxima observed we used a set of 100 bp long natural non-promoter fragments taken from well-characterized DNA sequences of phages T7 and [lambda], as well as from the plasmid pBR322. An additional set was used to characterize the distance distribution between different A/T tracts. This set was composed of 220 bp long fragments taken from less-characterized chromosome of E.coli. Care was taken to avoid the presence of any regulatory region in the selected sequences. Both sets contain 415 fragments.
Clustering procedure
Analysis was performed using the hierarchical clustering method described in a previous paper (4). In brief, promoter segments in fixed positions were compared. The length of the segment was varied from 2 to 10 bp in different clustering procedures. The value of resemblance for two promoters was determined as the number of coinciding nucleotides in the selected promoter segments, and sequential grouping of the promoters according to their scores generated hierarchical dendrograms (Fig. 1). In the first step the clustering procedure collected the promoters which had the maximum value of matching nucleotides. In the second and subsequent steps the groups with smaller values of resemblance were combined so as to get the highest value of overall similarity. The hierarchical dendrograms were sequentially generated at all promoter positions. An average size of promoter groups characterizing a degree of promoter homology at every position was used as a parameter to reveal regions with a non-random distribution of base pairs.
Figure 1. An example of a dendrogram obtained for position +56 ± 1 and length of the segment, 5. Numbers on the right indicate the value of minimal resemblance in each group. The difference in the group size at the first level is schematically represented by the size of the half ellipses. Consensus sequences of the largest groups are indicated (W = A or T, S = G or C).
RESULTS
Transcribed regions display a non-random distribution of A/T tracts
When the clustering procedure was applied to the promoter sequences located upstream from the transcription start point, we observed at least five regions with a non-random distribution of base pairs (4) (Fig. 2a). They are located at -54 ± 4, -44 ± 3, -35 ± 3, -29 ± 2 and -11 ± 4. The canonical -35 and -10 regions show the highest clustering efficiency, displaying TTGACA and TATAAT as dominant elements. Non-canonical regions were characterized by a set of typical sequences, which are not similar to each other (4,31). No additional regions with an essential non-random distribution of base pairs were found when transcribed parts of the genes were subjected to the same clustering procedure (Fig. 2a). However, the distribution of purines (R) and pyrimidines (Y) showed a statistically relevant preference near the start point of transcription (Fig. 2b). The most typical dinucleotide in this region is YR at -1 (265 promoters). When the longer segments were analyzed, YYR at -2, YRYY at -1 and YYRYY at -2 were dominant, all derived from the presence of YR at -1. With respect to statistical significance, the presence of this dinucleotide near the start point of transcription was one of the strongest promoter features.
Figure 2. Position dependence for the average number of promoters in the two largest groups obtained by the clustering procedure for L = 5, V = 0. (a) Four letter alphabet; (b) purine--pyrimidine subalphabet (R = A or G, Y = T or C); (c) W = A or T, S = G or C. Main panels represent the data obtained for the promoter set (415 promoters). (Insets) The clustering procedure was applied to the control set containing 415 non-promoter fragments from phage and plasmid DNAs.
The distributions of A/T (W) and G/C (S) base pairs exhibit several peaks both in the region upstream from the start point of transcription (-62, -53/-54, -45/-44, -30, -19 and -12) and in the transcribed part of the genes (+5/+6, +23, +39/+40, and +56) (Fig. 2c). The relative amplitudes of these maxima depend on the length of the segment analyzed. Thus, canonical element -35 shows a random distribution of W and S pairs when segments of 5 bp or longer are compared but exhibits a detectable maximum for shorter segments. In contrast, periodic patterns in the downstream region are well defined when 4-6 bp segments were analyzed, but are less apparent when the segment length is 3 bp, while dinucleotides exhibit a random distribution. To determine the statistical significance of the maxima observed in the downstream region we applied the clustering procedure to the set of non-promoter sequences taken from the phage DNAs (inset in Fig. 2c) and found that for the 5 bp long fragments it is 5-7 standard deviations (SD), a value which is very high and even comparable in statistical significance to canonical elements in the four-letter alphabet (~22 SD for the -10 region, and ~9 SD for the -35 region).
The observed maxima are located at different distances from one another, however, starting from element -12, they follow a rather strong periodicity close to 17 bp, which is numerically identical to the optimal length of the spacer between the two major canonical elements and does not reflect the normal helical periodicity of the DNA double helix.
All critical positions are similar in terms of sequence preferences, exhibiting 3-5 consecutive A/T bp as a dominant motif (an example of a typical dendrogram is presented in Fig. 1). Thus, the downstream region of a promoter may be characterized by a non-random distribution of A/T tracts. In full agreement with this suggestion, the bar histogram displaying the presence of 5 consecutive A/T bp (Fig. 3a) practically reproduces Figure 2c.
Figure 3. The distribution of different sequence elements along the promoter length. (a) (W)5; (b) (W)3(N)8(W)3; (c) (W)3(N)30(W)3. Each bar represents the number of promoters bearing the indicated element at position X ± 1. Arrows indicate critical positions.
Approximately 15% of promoters have 5 consecutive A/T bp in at least one critical position (±1) in the downstream region; 24 and 39% possess (W)4 and (W)3, respectively. This exceeds average values typical for non-promoter DNA by at least 5 SD. If A/T tracts are independently presented in these regions, one could expect that ~15% of (100 × 0.39 × 0.39) promoters will have (W)3 in two positions. We noticed, however, that this value is sometimes higher, for example 19.2% of promoters have A/T triplets at both 6 ± 1 and 40 ± 1 and 17.5% bear them at both 6 ± 1 and 23 ± 1. Thus, the maxima observed may be conditioned not only by the specific distribution of single A/T tracts, but also by the presence of coupled tracts separated by 17 or 34 nt.
Downstream periodicity is in phase with non-canonical promoter regions and the -10 element
To test the latter possibility we analyzed the distribution of linked A/T triplets, separated by a spacer of different length. The length of the spacer (n) varied from 1 to 54. We found that promoters possess approximately equal numbers of A/T triplets separated by a spacer of different length (1 [le] n [le] 20). A very small preference (~10 A/T tracts/promoter, rather than 8-9) was only found for tracts separated by approximately one helix turn (n = 8 ± 1). The bar histograms representing the distribution of these and many other paired tracts have maxima near the A/T-rich sites (example in Fig. 3b), revealing no essential regularity in the downstream region. Well-pronounced periodicity was however observed for A/T tracts separated by 13/14, ~30 (bar histograms in Fig. 3c) and ~48 bp.
The distribution of the linked A/T triplets located at 17 bp distance (spacer 14) has the major maximum at -29 ± 1. When two (W)3 are distanced by 33 and 51 bp the major peaks occur near -44 and at -63, respectively. In all cases the canonical element -12 also exhibited a well-formed maximum. Thus, at least four sites in the upstream and core promoter regions are in phase with the periodicity found in the downstream areas. The presence of thermodynamically unstable elements in the coding DNA may therefore be conditioned by the same structural requirements of RNA polymerase, which favor its efficient docking at the surface of the promoter.
Correlation of the distribution of A/T tracts along the promoter DNA
If the specific distribution of A/T tracts is important for gene expression, it is reasonable to expect that their presence at critical positions may be of higher significance for some particular promoter groups and less significant for other subsets. To study this possibility we created a series of promoter subsets possessing a single (W)3 or (W)4 at each position from -69 ± 1 to +80 ± 1 and estimated the level of overlap of these groups with the subsets possessing A/T tracts at 6 ± 1, 23 ± 1, 40 ± 1 or 56 ± 1. The lengths of the segments and the deviations were chosen to obtain a statistically reliable pool of promoters in every group (58-282 promoters). Several hundred combinations were analyzed and in most cases the number of common promoters in two different groups was very close to the value theoretically expected for the size of these groups. However, in some cases statistically relevant correlations were found. Their significance (B) was estimated using a non-parametric statistical method (32):
| B = (n - Nq)/[Nq(1 - q)]0.5 |
where n is the number of common promoters for two analyzed groups, N is the size of one group and q is the percentage of another one in the overall promoter compilation. The correlation for the presence of A/T tracts at two different promoter positions was considered significant if |B| > 1.96.
We found that the presence of single A/T tracts at critical positions in the transcribed region usually correlates with their appearance at approximately three helix turns upstream or downstream (circles in Fig. 4). A higher than expected frequency of the simultaneous presence of A/T tracts was also observed at shorter distances near neighboring critical positions. On the other hand, the presence of A/T tracts at +6 and +56 showed pronounced correlations with their presence at distal positions +76/+77 and -12/-10, respectively. Higher than expected frequencies of the simultaneous appearance of A/T tracts were also registered for positions +6, +22, +40 and +56, on the one hand, and regions -64/-62, -56/-53 and -43/-41, on the other, but the significance B in these cases was <1.96.
Figure 4. Correlation of distribution of single and linked A/T tracts along the promoter length. Schematic presentation of the data obtained for the subsets possessing single or linked A/T tracts near positions (a) +6, (b) +23, (c) +40 and (d) +56. Symbols indicate promoter positions that have higher (symbol above line) or lower (symbol below line) than expected frequencies of different elements. Circles indicate deviation from the expected value for the distribution of single (W)3 or 4; squares indicate positions with higher than expected values of (W)3(N)30(W)3 in the subsets with (W)3(N)14(W)3 at critical positions; nablas indicate deviation from the expected frequencies in the distribution (W)3(N)14(W)3 and (W)3(N)30(W)3, respectively; triangles indicate the deviation in the expected frequency distribution of (W)3(N)30(W)3. Dashed lines indicate a 33 bp offset from critical positions.
A similar analysis was performed for promoter subsets containing linked triplets separated by 14 or 30 bp in different positions. To avoid overlap and to simplify the calculations, the analysis was made only for regions from -67 ± 3 to X - 20 (if n =14) or to X - 36 (if n = 30) for X = 6 ± 3, 23 ± 3, 40 ± 3 and 56 ± 3, respectively. A strong correlation for the presence of linked A/T tracts at positions +6, +23 and +56 was found with their presence in the upstream region (Fig. 4).
Thus, some sequence peculiarities in the upstream promoter DNA correlate with structural features of the downstream region. Since upstream DNA enriched with A/T base pairs has the capability to provide additional contact points for the polymerase-promoter complex mediated by the [alpha]-subunit (14-16), the proper disposition of A/T tracts in the transcribed region may be involved in modifying these contacts prior to elongation and/or escape.
The relative disposition of A/T tracts supports the significance of the 17 bp period
This newly discovered feature in the structural organization of promoters is characterized by an unusual periodicity (16/18 bp) which does not correspond to any regular helical feature of DNA. This value may be interpreted as a half 33/34 bp periodicity, which corresponds to approximately three helix turns of DNA (an 11.2 bp periodicity was recently documented for bacterial DNA sequences; 33). In this case overlap of two periodic patterns shifted relative to each other by half of the period may yield the observed specific distribution of A/T tracts. If this is true, one could expect that the relative disposition of A/T tracts should have a preferred distance of ~33 bp. However, no maxima were observed when the distances between all possible pairs of A/T tracts were examined (Fig. 5a, curves 1-3). We found, nonetheless, that the first A/T triplet is most frequently distanced by 6 bp from another (A/T)3 (Fig. 5b, curve 1) or (A/T)4. A longer distance (6-11 bp) separates two neighboring (A/T)4. A second A/T triplet appears at a preferred position of 17 bp from (A/T)3 (Fig. 5c, curve 1) or 18 bp from (A/T)4-6, and a third is most often located at position 28/29 (Fig. 5d, curve 1). The character of the distributions shown in Figure 5 remained the same when sequences upstream (-70-0 bp) or downstream (0-150 bp) from the start point of transcription were analyzed independently. However, control sets of non-promoter sequences exhibited different distributions of preferred distances (Fig. 5, curves 4), which roughly followed a one helix turn periodicity.
Figure 5. Typical distances between A/T tracts. (a) Distances between all pairs of A/T tracts of different length; (b-d) typical distances between (W)3 and the first, second and third S(W)3, respectively. The lengths of the A/T tracts were 3 (curves 1 and 4), 4 (curves 2) and 5 (curves 3). Curves 1-3 represent the data obtained for the promoter compilation; curves 4, the distance distribution for the control set composed of E.coli non-promoter sequences. S before (W)3 was added with the aim of avoiding uninformative background due to the presence of long tracts. Distance was estimated between 5[prime]-ends of the tracts.
Thus, (i) the relative disposition of A/T tracts supports the significance of the 17 bp periodicity; (ii) promoter DNA, including proximal transcribed areas, is enriched in A/T tracts (compare absolute values in Fig. 5) which are disposed at closer preferred distances than in random DNA (compare the positions of peaks in Fig. 5). Hence, the specific disposition of A/T tracts provides an additional way to distinguish promoter DNA from other regions. This conclusion is supported by the fact that any alignment of non-promoter fragments according to the presence of A/T tracts does not increase the frequency of their appearance in other regions.
DISCUSSION
Our principal conclusion is that a non-random distribution of A/T tracts occurs in downstream promoter regions. The periodicity of this distribution is the same numerically as the optimal length of the spacer between the two canonical hexamers -35 and -10 and is supported by a specific disposition of every second A/T triplet. Promoter DNA as well as proximal transcribed regions exhibit a specific disposition of A/T tracts that is not typical of non-promoter DNA (Fig. 6). Thus, irrespective of the molecular mechanism of their functional role (structural perturbation or target site for interaction with RNA polymerase or some cellular factor), this peculiarity may potentially be used as an additional distinguishing parameter in promoter search algorithms.
Figure 6. Representative promoters with (a) and without (b) the described A/T periodicity in the downstream region. Set (a) contains all promoters from our compilation, which have at least three consecutive A/T base pairs with their 5[prime]-ends at +6 ± 1, +23 ± 1, +40 ± 1 and +56 ± 1. Set (b) shows promoters which have no A/T tracts at the critical positions (±3). A/T base pairs which belong to the first A/T run in (W)3(N)n(W)3 are underlined (if n = 13-15) or double underlined (if n = 28-31). A/T runs near the peak positions in their distribution (-67/-50, -48/-40, -31/-25, -14/-7, +4/+11, +21/+28, +38/+45 and +54/+61) are in bold. The start point of transcription is designated by a lower case letter. YR dinucleotides at -1 are shown in bold italic. Genes where initiating codons for protein synthesis were identified are marked by asterisks. Initiating codons for translation if situated within the presented region are marked by bold lower case letters and shaded. The well-studied promoters rrnBP1 and cysG are added to the respective promoter groups. (c) Non-promoter fragments taken at random from E.coli non-promoter DNA. Designations as in (a) and (b).
Approximately 20% of promoters have at least three consecutive A/T base pairs in the vicinity (±3) of all critical positions in the transcribed region. The corresponding genes belong to different functional groups and are derived from different genomes (E.coli and different phages). Thus, this new sequence motif is not restricted to any particular category of genes. Promoters of this type with A/T runs at all critical positions (±1) are presented in Figure 6A. The representative promoter galP1 belongs to the small group of `extended -10 region' promoters (12,13) that have an additional contact point at dinucleotide TG at -15 mediated by the [sigma]-subunit of the polymerase and also interact with the enzyme in the upstream regions. Most of the promoters shown in Figure 6a have A/T tracts upstream of -60, in the regions -48/-40 and -31/-25 and near position -12, which are in phase with the downstream periodicity (see also Fig. 4). Another biochemically characterized `extended -10 region' promoter cysG has downstream A/T tracts in close proximity to the preferred positions coupled to other tracts up to position -53. Taking into account the high affinity of RNA polymerase for A/T-rich DNA, it can be suggested that properly positioned A/T tracts may be used by the enzyme to modulate certain tight contacts during promoter clearance. The correlation, which we observe between the presence of A/T tracts in the upstream and in the downstream promoter regions, supports this possibility. However rrnBP1, one of the strongest E.coli promoters, which forms additional contacts with the RNA polymerase [alpha]-subunit in a region of well-defined A/T-rich sequence around -50/-40 (34), is depleted of A/T tracts downstream from the transcription start point. This promoter has an A/T triplet at position +20 in phase with the -10 element and no other in the region from -8 to +66. Figure 6b shows the set of promoters having no A/T triplets at the preferred positions (±3). The presence of both single and linked A/T runs upstream from -60 is obviously less than in the case of promoters with well-expressed downstream periodicity (set a). Nevertheless, enrichment of these promoters with linked A/T tracts is still higher than of non-promoter DNA (Fig. 6c).
The functional significance of the revealed promoter feature remains to be elucidated by systematic analysis. Nevertheless, some speculations might be considered on the basis of existing experimental data. Thus, if one assumes that RNA polymerase may interact with A/T runs in the downstream region, it would be reasonable to expect some trapping of the enzyme at these sites and transcription pausing. One example of this type is RNA synthesis from promoter tyrT, whose expression may be partially limited at the level of transcription attenuation (35). This promoter has well-defined upstream periodical patterns and A/T tracts starting from +9 and +26. A mutant promoter with substitutions in the +26 A/T run and a G/C tract around -38 have elevated levels of expression in vivo (36). A more compelling argument that implicates A/T tracts in polymerase escape is found in the malT promoter. The downstream sequence of malT is extremely A/T rich and contains A/T tracts at essentially all the conserved positions (Fig. 6). Initiation at this promoter requires the transcription activator CRP to facilitate the promoter clearance step (37).
Transcription pausing often takes place in the proximal transcribed regions (20,21,38-41). In some cases this process is mediated by interactions between base-paired secondary structures formed in the nascent RNA and elongation complex (40,41), while many other cases require a different explanation(s). It was recently demonstrated, for example, that the major sequence element that induces pausing (at +16/+17 or +26/+27) in the initially transcribed region of promoter [lambda]PR[prime] is an A/T-rich sequence similar to the -10 consensus element (21). Another example, the promoter region for asnC, has properly positioned A/T tracts at +6, +23, +40 and +70, with a gap at +56, and has a strong pause site at +61 (38). Our preliminary attempts to analyze the environment of the pause sites in seven other promoters led to the observation that A/T tracts are usually present upstream or downstream from these sites while their overall distributions along the length of the promoter does not follow the conserved periodicity. DNA transcription termination correlates with oligo(dA:dT) tracts in another enzymatic system, reverse transcriptase (42). These share with the downstream A/T tracts of promoters the property of adopting a preferred narrow minor groove but the latter, in which the preferred sequences are mixed rather than homopolymeric, would be expected to be more flexible.
It is clear that in some cases newly described elements in the transcribed parts of genes overlap with coding sequences. Variability in codon usage probably allows both the maintenance of transcription signals and the required amino acid sequence in the N-terminal part of the respective proteins in the same way that structural signals may be reconciled with the coding requirements of DNA in eukaryotic nucleosomes (43). However, taking into account that translation in bacterial systems is initiated on the nascent mRNA and takes place simultaneously with transcription, there are a variety of possibilities for interference between the two machineries. For example, transcription attenuation at particular sites may serve as a distinguishing factor for choosing alternative initiation codons by the ribosome at the beginning of the gene. On the other hand, paused RNA polymerase may provide steric hindrance for ribosome movement in some specific cases, thus putatively facilitating dynamic reprogramming of translation by frameshifting (44). It should be noted, however, that the length of the leader sequence is highly variable in different genes and may be as long as 500 nt (45). Initiating codons known to be used are marked in Figure 6, however, one can find other initiating codons coupled with Shine-Dalgarno-like elements more proximally situated to the promoter regions. Thus the portion of genes which have overlapping transcription and translation signals is probably not very high.
Phasing of periodically distributed A/T tracts with canonical element -12 and non-canonical regions near -29 and -45 allows the possibility that these signals might be used by RNA polymerase for initial docking at the promoter region in the optimal orientation for specific interaction. In this case the newly described property may be one of a large spectrum of distributed promoter-specific features collectively recognized by the transcription machinery.
Periodic patterns found in the transcribed parts of the genes with respect to their statistical significance may be characterized as one of the strongest promoter-specific features. We have tried to analyze the nature of this peculiarity and described some general features. However, many aspects remain unclear and require further analysis and new ideas for full understanding.
ACKNOWLEDGEMENTS
We thank A. Ishihama, N. Fujita and T. Gaal for their comments and suggestions. These studies were supported by the Russian Foundation for Basic Research (O.N.O. and A.A.D).
REFERENCES
*To whom correspondence should be addressed. Tel: +7 095 923 74 67; Fax: +7 096 779 05 09; Email: ozoline{at}venus.iteb.serpukhov.su
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: jnl.info{at}oup.co.uk
Last modification:
Copyright© Oxford University Press, 1999.
This article has been cited by other articles:
![]() |
K. S. Shavkunov, I. S. Masulis, M. N. Tutukina, A. A. Deev, and O. N. Ozoline Gains and unexpected lessons from genome-scale promoter mapping Nucleic Acids Res., August 1, 2009; 37(15): 4919 - 4931. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Mitchell, D. Zheng, S. J. W. Busby, and S. D. Minchin Identification and analysis of 'extended -10' promoters in Escherichia coli Nucleic Acids Res., August 15, 2003; 31(16): 4689 - 4695. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






