ABSTRACT
The basal elements of class II promoters are: (i) a -30 region, recognized by TATA binding protein (TBP); (ii) an initiator (Inr) surrounding the start site for transcription;
(iii) frequently a downstream (+10 to +35) element. To determine the sequences
that specify an Inr, we performed a saturation mutagenesis of the Inr of the
SV40 major late promoter (SV40-MLP). The transcriptional activity of each mutant was determined both
in vivo
and
in vitro
. An excellent correlation between transcriptional activity and closeness of fit
to the optimal Inr sequence, 5
'
-CAG/TT-3
'
, was found to exist both
in vivo
and
in vitro
. Employing a neural network technique we generated from these data a weight
matrix definition of an Inr that can be used to predict the activity of a given
sequence as an Inr. Using saturation mutagenesis data of TBP binding sites we
likewise generated a weight matrix definition of the -30 region element. We conclude the following: (i) Inrs are defined by the
nucleotides immediately surrounding the transcriptional start site; (ii) most, if not all, Inrs are recognized by
the same general transcription factor(s). We propose that the mechanism of transcription initiation is fundamentally conserved, with the formation of pre-initiation complexes involving the concurrent binding of general transcription factors to the -30, Inr and, possibly, downstream elements of class II promoters.
Considerable progress has been made toward elucidating the mechanism of
transcription initiation by RNA polymerase II (
1
-
5
). For class II genes of higher eukaryotes containing an A+T-rich sequence (TATA box) ~30 bp upstream of the initiation site formation of the pre-initiation complex is believed to occur in a stepwise fashion.
The first step is recognition of this element by TATA binding protein (TBP), a
DNA binding component of TFIID. This factor then recruits the other general
transcription factors to form the pre-initiation complex (PIC). Lastly, the addition of ATP promotes an open
structure in the promoter region of the DNA and synthesis of RNA begins (
6
-
8
).
However, many class II genes lack an obvious TATA box 30 bp upstream of their
transcription start site. It has been proposed that the formation of pre-initiation complexes on these TATA-less promoters may occur via alternative pathways involving other
sequence elements and factors that functionally replace the TATA box and TBP to
accurately select the initiation site (
9
-
12
).
One candidate for an alternative positioning element is the sequence immediately
surrounding the transcription initiation site, referred to as the initiator
(Inr). Numerous workers have shown that this element is genetically important
for efficient transcription from a variety of promoters (for example
13
-
18
). Several groups have identified proteins that bind Inrs (
19
-
23
). However, Means
et al
. (
24
) and Javahery
et al
. (
25
) have reported that the binding of HIP1/E2F-1 and YY1 to the Inrs of the dihydrofolate reductase (
dhfr
) and adeno-associated virus (AAV) P5 promoters respectively does not correlate with the transcriptional activities of these promoters.
On the other hand, Usheva and Schenk (
26
) found that YY1, TFIIB and polymerase II were sufficient for basal
transcription from the AAV P5 promoter. Thus it remains controversial whether any Inr binding
proteins are truly novel factors that direct the formation of initiation
complexes or simply regulatory factors that act by binding to sites situated at
or near Inrs.
We have been investigating the mechanism of transcription initiation from the
naturally occurring, TATA-less major late promoter of simian virus 40 (SV40-MLP). This promoter contains three genetically important proximal
sequence elements located at approximately -30, +1 and +30 relative to the start site of transcription (
27
and citations therein). The -30 region element functions as a binding site for TBP despite lack of a
consensus TATA box sequence and it cooperatively interacts with the Inr to determine the transcriptional start site (
28
). The +30 region element binds a cellular factor, DAP (
29
), that may play a role in anti-termination of transcription (T.E.Eisenbraun, F.Zuo, R.J.Kraus and
J.E.Mertz, manuscript in preparation). The +1 region element binds several
members of the steroid/thyroid hormone receptor superfamily (e.g. hERR1 and TR[alpha]1/RXR[alpha];
30
,
31
), however, these factors act as sequence-specific repressors, not activators, of late transcription when viral
template copy number is low.
To precisely determine the nucleotides that define an Inr we performed a
saturation mutagenesis of the bases surrounding the +1 site of the SV40-MLP. We found that an excellent correlation exists between transcriptional
activity and closeness of fit to the optimal Inr sequence, 5'-CAG/TT-3'. Using these data we derived a weight matrix that can
be used to predict the activity of a sequence as an Inr. We conclude that
transcription initiation at the SV40-MLP probably occurs via a mechanism similar, if not identical, to that
used by TATA box-containing promoters, rather than by a mechanism involving a novel
initiator factor. We propose that transcription initiation is fundamentally
conserved among class II promoters, with the functions of the Inr being to bind
TFIID/TFIIB/pol II/TFIIF and to serve as a site at which RNA polymerase II can
initiate transcription.
Plasmid DNAs were constructed by standard recombinant DNA techniques (
32
). Plasmid pSV1773(WT) contains the pseudo-wild-type SV40 used in all
in vivo
experiments except where indicated otherwise. It is a variant of pSVS (
33
) lacking SV40 nucleotides (nt) 1629-1635 inclusive (
34
).
Plasmid pSV1790 is a derivative of pSV1773 in which SV40 nt 319-336 inclusive have been replaced with the sequence 5'-CTGGGCAGGTCTCGAGACCTGCCCAG-3' (
28
), containing two
Bsp
MI sites. Cleavage with
Bsp
MI yields a vector containing non-complementary single-stranded ends that can be used for cassette mutagenesis of SV40 nt
319-336. Plasmids pSV4501(-6/-4TCT), pSV4502(-3/-1TAA), pSV4503 (+1/ +3GGG),
pSV4504(+4/+6GAC), pSV4505(+7/+9CCT), pSV- 4506(+10/+12GCG) and pSV4528(-1C,+2G) were generated by insertion of appropriate synthetic
oligonucleotides into
Bsp
MI-cut pSV1790 DNA. Plasmids pSV4507(-3A), pSV4516(+1G), pSV4517(+1C), pSV4518(+1T), pSV4519(+2A),
pSV4520(+2G), pSV4521(+2C), pSV4523(+3G), pSV4524(+3C), pSV4526(+4G) and
pSV4527(+4C), each containing the indicated single base pair change, were
generated likewise.
Plasmids pSV4508(-3C), pSV4509(-3T), pSV4510(-2A), pSV4511(-2G), pSV4512(-2C), pSV4513(-1A), pSV4514(-1G), pSV4515(-1C), pSV4522(+3A),
pSV4525(+4A), pSV4529 (-3C, +3G), pSV4530(+2C,+4A,+5A), pSV4531 (-4A,-3C,+3A,+4A) and pSV4538 (-4A,-3C,+3A) were generated by a variation of the
mutagenesis procedure of Chen and Struhl (
35
), starting with a pair of 55 bp oligonucleotides synthesized to contain 2% random bases in each of the nucleotides corresponding to SV40 nt -4 to +4 relative to the major late initiation site. These
oligonucleotides were annealed via their complementary 5'- and 3'-ends, end filled with Klenow polymerase, cleaved with
Kpn
I and ligated into
Kpn
I (SV40 nt 294)- and
Nae
I (SV40 nt 345)-cut pSV1773 DNA.
Plasmid pXS13 is a derivative of pSVS lacking the 72 bp repeat region (i.e. SV40
nt 115-272 inclusive) of the SV40 promoter (
33
). Plasmids pSV4532, pSV4533, pSV4536 and pSV4537 were constructed by substitution
of the smaller
Kpn
I-
Eco
RV (SV40 nt 294-770) fragment of plasmids pSV4503, pSV4516, pSV4518 and pSV5428
respectively for the corresponding fragment in pXS13.
Transcription reactions (50 [mu]l) were performed at 26oC with 0.5 [mu]g circular plasmid DNA as template and 16 [mu]l HeLa cell nuclear extract (18-20 mg protein/ml). At these protein:DNA ratios
repression of the SV40-MLP by initiator binding proteins (IBPs) does not occur (
30
). The quantities of the 5'-ends of the SV40 late and early RNAs were analyzed by primer
extension as previously described (
34
). Synthetic oligonucleotides corresponding to SV40 nt 394-369 and nt 5136-5160 served as primers for detecting the late and early RNAs
respectively. The relative amount of SV40 late RNA synthesized from the MLP of each mutant was determined by
normalization to both: (i) the amount of SV40 early RNA synthesized from the
same plasmid in the same reaction; (ii) the amount of SV40 late RNA synthesized
in a parallel reaction from the MLP of WT SV40.
All assays were performed utilizing CV-1PD cells grown in Dulbecco's modified Eagle's medium supplemented with 5%
fetal bovine serum. Viral recombinant DNA (3.5 [mu]g/10 cm dish) was excised from the vector sequences and ligated to form
monomer circles prior to transfection. The cells were transfected and whole
cell RNA was isolated 42 h post-transfection as described previously (
34
). By these post-transfection times expression of the SV40-MLP is unaffected by IBPs (
30
,
31
). The relative amount of steady-state SV40 late RNA synthesized from each mutant was determined by
quantitative primer extension analysis as described above, with normalization
to both: (i) the amount of SV40 late RNA present in cells transfected in
parallel with WT SV40 DNA; (ii) the relative amount of replicated viral DNA
present in these cells (determined by quantitative Southern blot analysis;
36
).
We used a neural network learning algorithm as described by Rumelhart
et al
. (
37
,
38
), along with our experimentally determined data, to generate a weight matrix
definition of an Inr. In brief, a neural network with 4 * 8 input units and 1 output unit was created to represent the possible
nucleotides located at positions -4 to +4 relative to the major late transcription start site. Each
nucleotide position was coded as four bits, with each bit representing the
presence [1] or absence [0] of the nucleotide A, C, G or T at that position (
39
,
40
). The network was trained on our
in vivo
data of the 21 single base pair Inr mutants (Fig.
3
B) and mutant SV4538(-4A,-3C,+3A) (relative activity 0.3). In mutant SV4525(+4A) two nearby
start sites of similar intensity were used (+1 and +4); the output of each of these Inrs was doubled to compensate for
apparent competition between them. Where transcription from a mutant promoter was observed to initiate predominantly at a
position other than +1 [i.e. mutant SV4513(-1A), which initiates from -1; relative activity 1.0] the 8 bp input sequence was shifted such
that the experimentally determined start site remained aligned. The output of
the neural network was offset by 0.010 and normalized to a scale ranging from
0.010 to 0.869. In addition, 24 random nucleotide sequences, pre-screened to eliminate any sequences resembling a functional initiator,
were included in the training set as examples of nulls. Back propagation was
performed without hidden units using the function
y
= 1/[1 + e
-2(
x
+
b
)
], a learning parameter of 0.2 and a momentum parameter of 0.8;
b
signifies a `bias' term representing the weight for an additional input unit
which was always set to unity.
A similar analysis was performed for the TATA box. In this case we used a 4 * 7 input unit matrix (coded as above). Training was performed using the
data sets of Wobbe and Struhl (
41
) and Mukumoto
et al.
(
42
) for cell-free transcription in HeLa cell nuclear extracts of mutants of the
Saccharomyces cerevisiae
his
3 and
Arabidopsis
TC7 promoters respectively. To create a single training set these two sets of
data were first aligned by eye and the alignments confirmed by the method of
Bucher (
43
); their relative outputs were calibrated by first training on each dataset
individually. Output was offset by 0.010 and normalized to a scale ranging from
0.10 to 0.70. In addition, 39 random nucleotide sequences, pre-screened to eliminate any sequences resembling a functional TATA box, were
included as examples of nulls. Back propagation was performed using a hidden
layer containing one neuron. A learning parameter of 0.2 and a momentum
parameter of 0.8 were used for both the hidden and output layers. The hidden
layer employed the function
y
= (e
x
/1.5 - e
-
x
/1.5
)/(e
x
/1.5 + e
-
x
/1.5
); the output layer used the function
y
= 1/[1 + e
-2(
mx
+
b
)
]. The single hidden unit was used to appropriately adjust the gain and
threshhold parameters of the output function.
Previously Ayer and Dynan (
27
) showed that a linker-scanning substitution mutant spanning nt -5 to +7 relative to the transcriptional initiation site of the SV40-MLP was defective in the synthesis of RNA from this promoter.
To identify more precisely the bases that define the Inr of this promoter we
constructed a set of cluster point substitution mutants spanning this region of
the SV40-MLP (Fig.
1
B). Each mutant was assayed for transcriptional activity both in a cell-free transcription system (Fig.
1
C) and
in vivo
(Fig.
1
E). These data, summarized in Figure
1
B, indicated that only mutations in the bases at or immediately surrounding the
start site (i.e. nt -3 to +3) significantly affect transcription from nt +1. Taken together
with the finding of Ayer and Dynan (
27
) that mutants spanning nt +10 to +19 and -25 to -3 are not defective in transcription from the SV40-MLP, we conclude that only the bases immediately surrounding
the initiation site are critical for defining initiator function in the context
of this promoter.
To further define the Inr motif of the SV40-MLP we performed a saturation mutagenesis of nt -3 to +4. In the cell-free transcription system mutations at either nt -3 or -2 had little effect on either the efficiency or
location of the major start site of transcription (data not shown and Fig.
2
A, lanes 2-5; summarized in Fig.
2
B). On the other hand, the nucleotide in the -1 position had major effects on transcription: a T -> A change resulted in a dramatic reduction in +1 initiated
transcription, as well as an increase in the frequency of initiation from other
nearby sites (Fig.
2
A, lane 6 versus 2); substitution of a G led to a significant reduction in
transcription initiation (Fig.
2
A, lane 7); placement of a C in this position led to a 7-fold increase in +1 initiated transcription (Fig.
2
A, lane 8 versus 2). The sequence of the initiating nucleotide (+1) was equally
critical: alteration of the A to any other base resulted in significant
reduction in transcription initiation from this site.
Except for a few quantitative differences, similar results were obtained for
transcription of these mutants after transfection into monkey cells (Fig.
3
). Again, changes in the nucleotides outside of nt -1 to +3 affected transcription at most 2-fold (summarized in Fig.
3
B). Transcription from the +1 site was lower in the nt +4 T -> A substitution mutant, because of the creation of a new functional Inr
(data not shown). Similarly, the trends and qualitative effects of the
mutations in the -1 to +3 positions were nearly identical to those observed
in vitro
(Fig.
3
B versus
2
B). For example, the T -> A change at nt -1 produced nearly wild-type levels of RNA, but starting from the -1 position (Fig.
3
A, lane 6).
Three significant quantitative differences from the
in vitro
data were observed: (i) the WT genome produced amounts of +1 initiated RNA
similar to the substitution mutant containing a G at the +2 position; (ii) the
nt -1 T -> C substitution resulted in only a 2-fold increase in transcription; (iii) the nt +1 A -> C substitution decreased RNA synthesis only 20%.
Overall, while the trends were similar, the extents of the quantitative effects
of the sequence alterations on transcription
in vivo
were less dramatic than they were
in vitro
.
We conclude from these experimental data that an Inr in the context of the SV40-MLP is largely defined by nt -1 to +3 relative to the transcription initiation site, with the
nucleotides at positions -1, +1 and +3 having the most profound effects on determining the strength
of the sequence as an Inr. If the Inr with maximal activity is the one in which
each of these positions contains the nucleotide that individually yielded
maximal activity, we predict that the functionally optimal Inr sequence should
be 5'-CAG/TT-3'. This prediction was confirmed by the analysis of mutant
pSV4528(-1C,+2G) in both our cell-free (Fig.
7
B below; summarized in Fig.
7
A) and
in vivo
assay systems (data not shown): 7- and 3-fold respectively more late RNA was synthesized from this mutant
than from the wild-type promoter.
Genetic elements can also be identified by comparison of functionally similar
sequences. By mathematically analyzing the sequences of the Inrs of 502
naturally existing polymerase II promoters Bucher (
43
) concluded that the Inr consensus sequence is 5'-TCAGT-3', with initiation occurring at the A and the
dinucleotide CA being most prevalent. We also concluded that this consensus
sequence is optimal for transcriptional activity.
Using the
in vivo
data on the transcriptional activities of our complete set of Inr point mutants
and a neural network learning algorithm (
37
) we generated a weight matrix definition of an Inr that indicates true
transcriptional activity, rather than frequency of occurrence in nature (Fig.
4
A). This matrix indicates the weights of a representative net connecting the
input units to the output unit. The columns correspond to the positions in the
sequence relative to the transcription initiation site; the rows correspond to
the nucleotide at each of these positions. By summing the weights shown in this matrix that correspond to the 8 bp of a given sequence and plugging this sum
into the formula presented in the legend to Figure
4
one can calculate a predicted relative activity for that sequence as an Inr.
The excellent convergence we obtained between predicted and actual
transcriptional activities (open circles in Fig.
4
B) indicates that this matrix successfully assimilated this SV40-MLP point mutant data set on which the net was trained.
Figure
Repeated runs of this algorithm with this data set always generated matrices
whose general features were quite similar, but not identical, to that shown in
Figure
4
A (data not shown). Noteworthy is the fact that our weight matrix has a window
of 8 input units, even though we had not systematically mutagenized the base in
the -4 position of the Inr. Because some of our mutants exhibited alterations
in the initiating nucleotide [e.g. mutant pSV4513(-1A); Figs
2
and
3
, lane 6], some input data were, nevertheless, available with alterations
involving this position. We chose to employ an 8 bp window because we needed
`hidden units' (i.e. additional units between the input and output units) to
produce a good convergence with a 7 bp window (data not shown).
Having trained the neural net on experimentally derived data, we next used this
weight matrix as described in the legend to Figure
4
to predict the relative Inr activity of any given nt -4 to +4 sequence. We initially compared the predicted and
in vivo
determined transcriptional activities of several SV40 mutants that contained
multiple mutations in the MLP Inr (data not shown). These mutants, SV4529(-3C,+3G), SV4530(+2C,+4A,+5A) and SV4531(-4A,-3C,+3A,+4C), had not been part of the original data set
used to train the net. Nevertheless, the relative experimentally determined
activities of each of these mutant promoters (0.12, 0.09 and 0.3 respectively)
correlated reasonably well with their predicted activities (0.11, 0.04 and 0.20
respectively).
To test more generally the value of our weight matrix we also examined the predicted versus experimentally determined transcriptional activities of mutants in the Inrs of several other polymerase II promoters
(Fig.
4
B). For the
mdr
-1 promoter a correlation of
r
= 0.68 was obtained. The
mdr
-1 promoter also has a genetically defined tripartite proximal sequence element structure, as well as a weak TATA box (
44
). In the case of the TATA-less
TdT
promoter a reasonable correlation (
r
= 0.57) between calculated and actual transcriptional activity was observed,
however, for several of the
TdT
mutants the experimentally determined activity was somewhat lower than
predicted.
To examine the predictive value of our Inr weight matrix for a promoter
containing a strong TATA box we compared the predicted transcriptional
activities of Inr mutants of a hybrid promoter consisting of the TATA box of
the
hsp
70 promoter linked to the initiation site region of the Ad-MLP with the activities of these mutants in a cell-free transcription system (
45
). Once again, mutations in the nt -3 to +3 region had the qualitative effects predicted (
r
= 0.75; Fig.
4
B). However, as expected, the quantitative effects on transcription were
somewhat smaller than those observed with the promoters containing a weak TATA
box.
Thus we conclude that the weight matrix presented here can be used to predict
the qualitative effects on transcriptional activity of alterations in the
sequence of the -4 to +4 region of a promoter when this region does not also contain an
overlapping binding site for a non-general transcription factor. When the -30 region of the promoter has a weak TATA box this matrix can also
be used to predict quantitative effects.
A similar neural network approach was employed to generate an experimentally
determined weight matrix definition of the -30 region (TBP binding site) of polymerase II promoters (Fig.
5
A). In this case we used as our data sets the
in vitro
transcriptional analyses performed by Wobbe and Struhl (
41
) and Mukumoto
et al.
(
42
) of -30 region mutants of the
his
3 and TC7 promoters respectively. The columns in Figure
5
A correspond to the alignment of the given -30 region sequence with respect to the experimentally derived optimal -30 region sequence, 5'-TATAAAA-3'; the rows correspond to the nucleotides
at these positions. The excellent correlation obtained between predicted and
experimentally derived transcriptional activities for these mutants (depicted
by open circles in Fig.
5
B) indicates that the matrix successfully assimilated these training data.
Figure
Using this -30 region weight matrix as described in the legend to Figure
5
A, we tested whether one could predict the ability of a given sequence to
function as the -30 region element of a polymerase II promoter (Fig.
5
B). Correlations of
r
= 0.74, 0.98 and 0.69 were obtained for the predicted versus experimentally
determined transcriptional activities of mutants in the -30 region of the human [beta]-globin (
46
), SV40 late (
47
) and SV40 early (
48
) promoters respectively. Remarkably, a correlation of 0.98 was observed between
predicted transcriptional activity and experimentally determined binding of
recombinant human TBP to a set of eight sequences that included the -30 regions of the promoters for Ad-ML, human
dhfr
, rpL32, SV40-ML,
TdT
and
IRF
-1 (
28
). Thus we conclude that the weight matrix presented in Figure
5
A can be used to predict the ability of a given sequence both: (i) to bind TBP;
(ii) to function as the -30 region element of a polymerase II promoter.
Figure
6
shows the relative activities predicted from our weight matrices of the -30 regions and Inrs of a variety of naturally occurring polymerase II
promoters. A striking finding is that the activities predicted for either of
these two basal elements vary greatly from one promoter to the next, spanning
the complete range from very strong to quite inactive. In addition, the
combined predicted activities of these two basal sequence elements on any given
promoter also span the entire range from highly active to barely functional.
Figure
Figure
As a control, we evaluated the output of our Inr and -30 region matrices across all possible 8 and 7 bp sequences respectively
(data not shown). These data indicated that substantial specificity is
associated with these two sequence motifs: only 2.4% of random 7 bp sequences
yielded scores with our -30 region matrix that exceeded 10% of a consensus -30 element; only 20% of random 8 bp sequences yielded scores with
our Inr matrix that exceeded 10% of the output of a consensus Inr element.
Thus, even in the absence of other
cis
-acting elements, the presence of the -30 region and Inr elements spaced appropriately apart is predicted
to provide fairly good selectivity in defining a site for transcription
initiation.
What sequence(s) determines the transcription initiation site in promoters such
as
CAD
in which both basal elements are weak? It has been suggested that upstream Sp1
binding sites may affect the location as well as the efficiency of
transcription initiation from the human
dhfr
and
CAD
promoters (
49
,
50
).
To determine whether such upstream activating sequences can determine the
initiation site when both the -30 region and Inr motifs are weak, we constructed mutants of the SV40-MLP in which the six Sp1 binding sites of the SV40 promoter region
(Fig.
1
A) were relocated to ~75 bp upstream of the major late start site of transcription (Fig.
7
). As expected, substitution of the Sp1 binding sites for the wild-type activating sequences led to significant enhancement in the efficiency
of transcription initiation regardless of the sequence of the Inr (Figs
1
and
2
and
7
B versus
7
D; summarized in Fig.
7
A). For example, transcription from nt +1 in the cell-free system was at least 10- to 15-fold higher when the +1/+3GGG and +1G Inr mutant promoters
were activated by Sp1 than when they were present in the WT background,
however, the major site of initiation from these weak Inr promoters remained
unchanged by the presence of the Sp1 binding sites (Fig.
7
D, lanes 3 and 4), in agreement with the observations of Smale
et al.
(
51
).
Likewise, a low level of RNA was synthesized from the +1T Inr mutant in the wild-type background, with heterogenous 5'-ends mapping to approximately nt -3 to +1, rather than +1 (Figs
2
A and
3
A, lane 11); when placed downstream of the Sp1 sites transcription increased 10- to 15-fold (Fig.
7
D, lane 5; summarized in Fig.
7
A), again starting at approximately nt -3 to +1. Therefore, the site of transcription initiation is predominantly
determined by the Inr, not by its distance to a binding site for an upstream
activator.
Using a natural, TATA-less promoter under assay conditions free of the effects of overlapping
regulatory factors we investigated the role Inr sequences play in determining
the efficiency and selection of the start site of transcription. Reported here
is the first systematic mutagenesis of an Inr element (Figs
1
-
3
). We found that only the region from -2 to +3 relative to the +1 start site is genetically important, with the
nucleotides at positions -1, +1 and +3 being critical. We determined that the functionally optimal
Inr sequence is 5'-(T/G)CA(G/T)T-3' (Figs
2
,
3
and
7
) and correlated transcriptional activity with similarity to this sequence (Fig.
4
). This finding concurs with the partial genetic analyses of the Inrs of other
promoters (
52
-
57
). Thus the optimal sequence of an Inr is probably universal, with non-general Inr binding factors functioning as transcriptional regulators, not as novel `selectors'.
Using a neural net algorithm (
37
) and our experimental mutagenesis data of the SV40-MLP Inr (Fig.
3
) we generated a weight matrix that should be of general use to predict the
relative strength of any given sequence as an Inr (Fig.
4
A). For most non-training set sequences examined a good correlation was found to exist
between predicted and actual transcriptional activity (Fig.
4
B). Analysis of our data by the methodology of Stormo
et al.
(
58
) generated a weight matrix with comparable predictive value, i.e.
r
= 0.36, 0.55 and 0.78 for the
mdr
-1, Ad-MLP and
TdT
Inr data sets respectively. We prefer the neural net method, since it minimizes the differences between the predicted and actual activities, rather than their natural logs, and
thus more likely reflects the biological systems being considered here.
Divergence between predicted and actual activity could be attributed to any of
several factors. (i) The number of sequences we sampled in training the neural net may have been insufficient. (ii) The importance of the Inr is probably dependent upon its context with
respect to the strengths of other basal and regulatory elements present in the
promoter. (iii) Bases within the Inr may also be part of overlapping elements
recognized by non-general regulatory factors (see for example
19
,
24
,
30
). In fact, significant deviation between measured and predicted transcriptional
activity might indicate the existence of a binding site for a regulatory
factor.
Using mutagenesis data of others (
41
,
42
) we likewise generated a weight matrix for the -30 region of polymerase II promoters (Fig.
5
A). This matrix was found to have excellent predictive value, not only for
transcriptional activity (Fig.
5
B), but also for binding of TBP to the -30 region (Fig.
5
C). However, a major limitation of the usefulness of these matrices remains in
that they fail to incorporate information concerning the effects of the
distance and interactions between the -30 and Inr elements.
Several recent studies have indicated that the Inr is recognized by components of holoTFIID (59-
66
). Especially noteworthy is the finding of Purnell
et al.
(
63
) that the optimal sequence in the Inr region of the
hsp
70 promoter that binds
Drosophila
(d) TFIID is 5'-CA(T/G)T(T/G)-3', the same sequence we found to be optimal for
transcriptional activity of the SV40-MLP (Figs
2
,
3
and
7
). This match between binding of TFIID and transcriptional activity provides
compelling evidence that a component(s) of TFIID is a primary basal factor that
functionally recognizes the Inr. Furthermore, Tjian and colleagues (
64
-
66
) have demonstrated that `TAF
11
150', a component of dTFIID, can specifically interact with and functionally
recognize proximal elements, including the Inr.
The Inr likely also plays a direct role in promoter recognition and initiation
of transcription by RNA polymerase. Caramco
et al
. (
67
) reported that highly purified RNA polymerase II by itself can weakly bind to the Inr of the Ad-MLP and preferentially initiate transcription from sequences resembling
Inrs.
Escherichia coli
RNA polymerase also recognizes sites of transcription initiation (
68
), with the optimal start site being 5'-CAGT-3'. Interestingly, sequence alterations in the Inr that
were deleterious to transcriptional activity of the SV40-MLP are known to interfere with binding by prokaryotic RNA polymerase (
68
). Furthermore, as has been documented with
E.coli
polymerase (
69
), we also noted a phenomenon consistent with `slippage' by eukaryotic
polymerase II when multiple T residues were present at the initiation site
(Fig.
7
D, lane 5). Thus we conclude that components of both RNA polymerase II and
holoTFIID probably recognize the Inr in a sequence-specific manner. Most interestingly, recognition of the Inr by components
of the basal machinery appears to be highly conserved from prokaryotes to
higher eukaryotes.
To examine whether the location of binding sites for an activator protein can
affect the site of transcription initiation we repositioned the multiple Sp1
binding sites naturally present in SV40 directly upstream of the basal elements
of the SV40-MLP. Contrary to the conclusions reached with the
dhfr
(
49
) and
CAD
(
50
) promoters, we found that the locations of the transcriptional initiation sites
were unaltered by the location of the Sp1 binding sites (Fig.
7
). Thus we conclude that initiation site location is probably determined
primarily by the binding of general transcription factors to multiple,
appropriately spaced, basal elements of promoters.
Until recently a prevailing view has been that polymerase II promoters can be
divided into two subclasses in which the early steps in the formation of a pre-initiation complex occur either by a TATA-dependent or an Inr-dependent mechanism (
9
,
11
,
67
). We consider this hypothesis unlikely to be valid. First, in higher eukaroytes
the distance between the TBP binding site and the Inr is fixed to within a few
bases (
28
,
70
). Second, the strengths of both the -30 region and Inr elements vary enormously in natural promoters, ranging from optimal to very weak (
12
; Fig.
6
). Third, a single TBP binding site or Inr region does not provide sufficient
sequence specificity to determine either the site or direction of transcription
initiation. Fourth, most, if not all, class II promoters require TBP for
transcription, whether they appear to contain a TBP binding site or not (
3
,
15
,
51
,
71
-
73
). Finally, cooperation between multiple elements is necessary for accurate
start site selection (
15
,
28
,
74
,
75
).
As an alternative hypothesis we propose that the mechanism of transcription
initiation is fundamentally conserved among class II promoters. The initial
event is recognition by the holoTFIID complex of the multiple basal elements of
the promoter existing in a fairly strict spatial alignment to each other. Next,
TFIIB binds via protein-protein interactions with TBP and recruits polymerase II/TFIIF into the
complex (
71
,
73
,
76
,
77
). Alternatively, these factors pre-exist and bind concurrently to the multiple basal elements in the form of
a holocomplex (
78
,
79
). Finally, within a small window, sequence recognition of the Inr by RNA
polymerase II (or another member of this complex) accurately defines the start
site of transcription.
This genetic arrangement meets the requirement that start site sequences occur
uniquely and rather infrequently in the genome. Within this spatial context
each of the basal elements can possess individual binding capacities. Nevertheless, sufficient total binding capacity
exists to permit the stable formation of pre-initiation complexes. Biochemical evidence supporting this hypothesis
includes the recent studies of Aso
et al.
(
73
) that a variety of both TATA and TATA-less promoters require the same general factors for basal transcription
and that basal transcriptional activity correlates with the binding affinities
of these factors for the promoter.
We thank Peggy Farnham for HeLa S-3 cells and Dick Burgess, Peggy Farnham and members of their laboratories
and our laboratory for helpful discussions. We especially thank Paul Lambert,
Gary Stormo, Grace Wahba and Nancy Thompson for helpful comments on the
manuscript. This research was supported by US Public Health Service Research
grants CA-07175, CA-09075, CA-09135 and CA-22443 from the National Cancer Institute.
Present addresses:
+
Promega Corporation, Madison, WI 53711, USA and
[sect]
Immunex Corporation, Seattle, WA 98101, USA.




REFERENCES
Return

