Binding site analysis of c-Myb: screening of potential binding sites by using the mutation matrix
derived from systematic binding affinity measurements
Binding site analysis of c-Myb: screening of potential binding sites by using the mutation matrix derived from systematic binding affinity measurements
Qiao-Lin
Deng
,
Shunsuke
Ishii
and
Akinori
Sarai*
Tsukuba Life Science Center, The Institute of Physical and Chemical Research
(RIKEN), 3-1-1 Koyadai,
Tsukuba
, Ibaraki 305,
Japan
Received August 31, 1995;
Revised and Accepted December 20, 1995
ABSTRACT
The c-Myb oncoprotein is known to bind to multiple sites in the promoters of
target genes. We have developed a protocol to screen the binding site of c-Myb by using the systematic binding data derived from measurements of
binding affinity for oligonucleotide containing a known Myb-binding site and its complete single mutants. We first applied the method
to predict the binding affinity for the known binding sites and compared with
available experimental data. The predicted binding sites agree with many
putative binding sites of known target promoters. However, there are some
binding sites not predicted by the analysis. These sequences deviate from the
consensus sequence derived from the binding analyses. In the light of the
structure of Myb-DNA complex, these results indicate that different DNA-binding modes may be used by c-Myb to recognize different classes of binding sites. We also
screened the sequence database for potential Myb-binding sites, and found sequences of several promoters that have not been
identified experimentally but could be the target for c-Myb.
INTRODUCTION
C-
myb
is a proto-oncogene encoding a nuclear protein (c-Myb) that binds to DNA (
1
,
2
). Its analog, v-
myb
, also encodes a nuclear protein, which is characterized by N- and C-terminal truncations compared with c-Myb. The c-Myb protein is supposed to function as an activator as
well as a repressor of transcription (reviewed in
3
,
4
), regulating genes important for cell growth and development. Myb expression
was initially thought to be largely restricted to the hematopoietic system, but
recently, it was also demonstrated to be involved in the regulation of
proliferation and differentiation of various types of cells other than
hematopoietic cells (
5
,
6
).
The c-Myb protein is comprised of three domains responsible for DNA binding,
transcriptional activation and negative regulation (
7
). The DNA-binding domain of Myb protein is located at the N-terminal side and consists of three homologous tandem repeats of 51
or 52 amino acids (designated as R1, R2 and R3 from the N terminus). Each
repeat has three conserved tryptophans spaced 18 or 19 residues apart. This
repeat structure with conserved tryptophans seems to be a general motif
utilized in many different transcription activators found in a wide spectrum of
eukaryotes, including vertebrates, insects, yeast, cellular slime mold and
higher plants (
4
,
8
,
9
). Among the three repeats of the DNA-binding domain, R1 can be deleted without significant loss of DNA-binding activity (
7
,
10
) and plays a minor role in sequence recognition. Domains R2 and R3 are
necessary and sufficient for the recognition of specific DNA sequences (
2
,
7
,
10
). Thus, the minimum DNA-binding domain is the R2R3 fragment.
The NMR analysis of the Myb DNA-binding domain revealed that the three repeats in the DNA-binding domain have similar overall structures, each containing
three helices (
11
-
13
). The second and the third helices form a helix-turn-helix (HTH) variation motif, which contains a longer turn than the
one in the prototypical HTH motif. The NMR analysis of the Myb-DNA complex showed that R2 and R3 are closely packed in the major groove
with each of the third helices acting as the recognition helix (
12
). In contrast, R1 was shown to have no specific interactions with DNA (
12
,
14
).
The c-Myb protein can activate or repress the transcription of several potential
target genes. So far, the known target genes include promyelocyte-specific
mim-1
gene (
15
),
cdc2
gene (
16
), gene encoding PR264 splicing factor (
17
), c-
myc
proto-oncogene (
18
,
19
) and c-
myb
itself (
20
), as well as the long-terminal repeat (LTR) promoters of the human immunodeficiency virus type-1 (HIV-1) (
21
) and the human T-lymphotrophic virus type-I (HTLV-I) (
22
). Other genes
trans
-activated by c-Myb include CD4 (
23
), CD34 (
24
) and T cell receptor [delta] (
25
). In addition, c-Myb also binds to c-
erbB
-2 (
26
), which is the only known gene repressed by c-Myb so far.
The Myb-recognition element (MRE) was originally defined as YAACKG, where K stands
for G or T, as derived from comparisons of isolated chicken DNA-binding sites for v-Myb (
27
). Then, two extended consensus sequences, the 9-bp YGRCGTTR motif (
28
), where Y and R denote pyrimidines and purines, respectively, and the 8-bp YAACKGHH motif (
29
), where H denotes A, C or T (i.e. not G), were obtained from binding-site selection protocols.
In order to characterize qualitatively the sequence specificity in c-Myb binding, we have carried out an extensive binding analysis by using
the Myb-binding site, MBS-I, in SV40 (
30
). By using a synthesized oligonucleotide containing this binding site, the
binding affinities to the c-Myb DNA-binding domain were measured by filter-binding assay and the effects of systematic base-pair substitutions on binding affinity were examined (
14
). Such analyses have provided valuable information about the location and
energetic contributions of specific interactions (
31
,
32
). The mutational analyses have shown that the specific interactions are not
uniformly distributed in the TAACTGAC region of MBS-I; the 2nd A, the 4th C and the 6th G are involved in very specific
interactions with Myb, whereas the interactions at the 3rd A and the 8th C are
less specific (
14
). A set of binding affinity changes or binding free energy changes,
[Delta][Delta]
G
, due to base substitutions defines a consensus recognition sequence in a
quantitative sense.
Although c-Myb is thought to regulate transcription in multiple genes, there is no
efficient method to identify its target genes. In this paper, we apply the
above binding analysis to approach this problem, by using the experimental [Delta][Delta]
G
data to predict putative binding sites. In order to test the predictive
capacity of this method, we first examined the binding sites in the known
target genes and compared these with available experimental data. We have also
applied the protocol to seek potential Myb-binding sites in the sequence database, and found a number of binding
sites in the promoters of genes that have not yet been verified as targets of
Myb by experiments but may be potentially important for the function of c-Myb. Finally, we discuss the multiple-binding mechanism of Myb by comparing the present results with
available structural and functional information about Myb.
METHOD OF CALCULATIONS
Previously we carried out an extensive binding analysis on c-Myb (
14
). We introduced complete single mutations at each position of the 22mer
oligonucleotide containing the MBS-I site in SV40, and measured binding affinities by filter-binding assays. The binding affinity ratio,
K
(wild)
/K
(mutant), or the binding free energy change, [Delta][Delta]
G = RT
ln
K
(wild)
/K
(mutant), between the original MBS-I and its mutants can provide information about the location and magnitude
of specific interactions within the binding region, as has been shown for Cro
and [lambda] repressor (
31
,
32
).
The [Delta][Delta]
G
values for each base position with respect to three substituted bases define a
matrix, as shown in Table
1
, which can be used to calculate the binding strength of any sequence as long as
the [Delta][Delta]
G
values are independent (see Discussion). The binding strength for a given
segment of sequence can be calculated simply by summing individual contribution
of [Delta][Delta]
G
at each base position (the total free energy change will be denoted as [Delta][Delta]
G
tot
). Note that positive values of [Delta][Delta]
G
mean that a mutation weakens the binding, and every 10-fold change in the binding affinity corresponds to a 1.3 kcal/mol change
in [Delta][Delta]
G.
The [Delta][Delta]
G
tot
for each segment in a certain region of DNA can be calculated by moving the
window, base by base, along the sequence. In the calculations, both directions
were considered, in order to include both strands. In screening for potential
Myb binding sites, all the human genes in the sequence database (primate
subsection) were searched for potential Myb target sites in promoters and
enhancers.
Relative binding free energy changes [Delta][Delta]
G
(kcal/mol) for the binding of Myb R1R2R3 to MBS-I with base substitutions (14)
Position
G
A
T
C
Preference
1
0.67
0.46
0.00
0.10
y (T or C)
2
4.12
0.00
3.97
3.88
A
3
1.54
0.00
1.31
1.02
a
4
3.60
3.51
3.75
0.00
C
5
-0.04
0.57
0.00
0.27
k (G or T)
6
0.00
4.20
4.32
4.29
G
7
0.09
0.00
0.05
-0.29
N (any bases)
8
4.20
0.22
0.54
0.00
H (not G)
9
0.25
0.00
-0.01
0.38
10
0.22
-0.01
0.12
0.00
11
-0.10
0.00
0.02
-0.02
The letters in the last column show the Myb binding-site motif derived from the [Delta][Delta]
G
. Upper-case letters indicate higher specificity, while lowercase letters denote
less specific preferences.
RESULTS
Screening of binding sites for known target genes
Myb protein has been shown to bind within the promoter regions of several
potential target genes, including promyelocyte-specific
mim-1
gene (
15
),
cdc2
gene (
16
), gene encoding PR264 splicing factor (
17
), c-
myc
proto-oncogene (
18
,
19
), c-
myb
itself (
20
), LTR sequences of HIV-1 (
21
) and HTLV-I (
22
), c-
erbB
-2 (
26
), and in the enhancer domain of SV40 (
30
). These known target genes were screened by the present protocol to look for
putative binding sites and then compared with experimental data. All the
numberings of the sequences follow the original GenBank entry, and the
notations of the putative binding sites follow the corresponding references.
SV40.
Nakagoshi
et al.
(
30
) found that binding of c-Myb to the simian virus enhancer stimulates transcription. There is a c-Myb-binding site, MBS-I, in the enhancer domain of SV40. We calculated [Delta][Delta]
G
tot
along the whole SV40 sequence. Plotted in Figure
1
A is <[Delta][Delta]
G
tot
> - [Delta][Delta]
G
tot
, so that the larger its value, the stronger the affinity. The highest point is
at position 256, corresponding to the binding site MBS-I in SV40 (
30
). In Figure
1
B, we draw possible binding sites whose [Delta][Delta]
G
tot
are lower than the threshold value, [Delta][Delta]
G
threshold
. Here, the binding sites on the opposite strand are also included. This figure
shows the putative binding sites more clearly. Figure
1
C shows the histogram of binding for the entire SV40 sequence, where the
position of [Delta][Delta]
G
tot
for the MBS-I fragment (shown by an arrow) is located near the end in the tail of
distribution. This shows the high specificity of c-Myb binding for the MBS-I site.
Screening of potential Myb target genes from the sequence database
C-Myb has been thought to be involved in regulation of transcription in
multiple genes. So far, only a limited number of target genes have been
demonstrated (
15
-
26
). Identification of the target genes is important for understanding the role of
transcription factors in many biological events such as development. However,
we have had no efficient method to identify them. Usually, a differential
screening using the cells expressing or lacking the specific transcription
factor is used to identify target genes. However, finding target genes is
difficult, as expression can be rapidly induced or modulated only at low level.
Furthermore, this method is fairly laborious. Therefore, if some candidates for
target genes can be identified by rapid screening procedure it would be very
useful. Thus, we have developed and applied the present protocol to screen the
sequence database for unknown but potentially important Myb-binding sites.
We have searched the database (primate subsection) for human genes with putative
Myb binding sites in regulatory regions which have not yet been identified
experimentally. We first examined different criteria for the screening, as
shown in Table
3
. Because Myb-binding sites are usually present in multiples, we also incorporated such
a condition into the screening. The multi-site condition greatly reduces the number of hits, as shown in Table
3
. We started with a large pool of these binding sites listed in Table
3
, and narrowed down to only those binding sites in the regulatory regions. Then,
we further selected only those binding sites that have not been verified by
experimentation but may be interesting for the functional aspects of c-Myb, as shown in Table
4
(the screening program and a library of potential Myb-binding sites listing all entries and binding sites in Table
3
are available upon request).
Screening of the binding sites by c-Myb with various screening conditions
Threshold [Delta][Delta]
G
threshold
No. of binding sites
L
a
No. of hits
b
0.0
1
2079
1.0
1
20 224
2.0
1
30 145
3.0
1
32 177
4.0
1
32 719
0.0
2
25
8
0.0
2
50
11
0.0
2
100
22
0.0
2
500
76
1.0
2
50
5265
1.0
3
50
446
1.0
4
50
24
1.0
5
50
1
2.0
2
50
22 521
3.0
2
50
28 690
a
L represents the size of the region within which the specified number of binding
sites are found.
b
The search was carried out for all sequences in the primate section of GenBank
(Rel.91). Thus, the numbers shown here represent all hits including the sites
within coding regions.
These candidate genes can be classified into seven groups based on the function
of their gene products, as shown in Table
4
. Mammalian c-Myb is required for the G1/S transition in the cell cycle, and
Drosophila
Myb has recently been speculated to be necessary for G2/M transition (
33
). In this sense, four candidates encoding cyclins and
cdc25
phosphatase, together with
cdc2
mentioned before, are interesting. Recently, a specific region containing two
elements called CDE (cell cycle-dependent element) and CHR (cell cycle homology region) in the
cyclin A
,
cdc25C
and
cdc2
genes was demonstrated to be responsible for the cell cycle-dependent transcription of these genes (
34
). Interestingly, our search identified multiple Myb-binding sites adjacent to this region in these genes. Therefore, it is
important to examine whether Myb is involved in the cell cycle-dependent transcriptional regulation of these genes through binding to
them. The level of c-Myb expression is high in immature hematopoietic cells, and downregulated
during differentiation, indicating that c-Myb is important for maintaining the proliferative state of hematopoietic
progenitor cells (
35
). Those genes classified as proto-oncogenes and their related genes are important for supporting cellular
proliferation. Especially, the c-
kit
gene is a good candidate for the Myb target gene, because it encodes the
receptor for stem cell factor that is necessary for growth of immature
hematopoietic cells. Among those genes encoding transcription factors, the
ets-1
, c-
src
, and NF-[kappa]B genes encode the transcription factors that positively regulate
cellular proliferation, implying that their expression could be activated by c-Myb. In contrast, I[kappa]B-[alpha] and Id2A are inhibitor of NF-[kappa]B and the helix-loop-helix-type factor,
respectively. In addition, CP2 is an activator of the [alpha]-globin gene, a typical terminal differentiation marker. Therefore,
these genes could be repressed by c-Myb. The genes encoding the negative regulators of cell growth such as
tumor suppressor genes are also candidates for c-Myb targets, whose expression is repressed by c-Myb. In addition to the role in maintaining the proliferative state
of immature hematopoietic cells, c-Myb is also important for growth control of differentiated cells. For
example, the level of c-
myb
mRNA is transiently increased after stimulation of resting mature T cells by IL-2 (
36
). In addition, a similar increase in the c-
myb
mRNA level is also found in serum stimulated smooth muscle cells (
5
). In this aspect, the genes encoding cytokines, their receptors, receptors
involved in immune response, and regulators of signal transduction are all
candidates for Myb target genes.
DISCUSSION
We have examined the binding site of c-Myb by using the binding free energy data derived from extensive binding
analyses. The [Delta][Delta]
G
values used here define a consensus sequence recognized by c-Myb in a quantitative manner so that they can be used to predict other
potential binding sites for c-Myb. However, such a prediction is valid only if the [Delta][Delta]
G
values are additive with respect upon base substitution. Although the condition
holds in most cases for Cro and [lambda] repressor (
31
,
32
), it has not been proven for c-Myb. The non-uniform [Delta][Delta]
G
values over the binding region usually indicate the presence of local contacts
between amino acids and base pairs with different interaction strengths. A
recent paper (
37
) compared such apparent strengths across several different molecular systems
and experimental approaches, including sequence variability, and found strong
correlations for the strongly interacting base-amino acid pairs. Thus, the individual base-amino acid strengths might be quite general across different
molecular systems.
In order to test the predictive capacity of the method in the case of c-Myb, we first examined the binding sites for the known target genes and
compared them with the available experimental data. The results showed that our
protocol can identify many putative binding sites and potential target genes.
On the other hand, there are several cases where binding sites were not
predicted by the analysis. Interestingly, a close inspection of these binding
sites shows that the second A and fourth C in the consensus sequence, yAaCkGNH,
derived from the systematic binding analyses (
14
) are almost perfectly conserved, except for F4-L of c-
myc
, where the fourth C is instead a T. On the other hand, the sixth G is not
conserved in several sites of c-
myc
, c-
myb
, PR264,
cdc2
, c-
erbB
-2; there seems to be no specific base preference at this position for
these sites. Also, the eighth position is occupied by G in a few cases.
The accession number and the numbering of the binding sites follows the GenBank
entry. The numbering of the binding sites indicates the center of the
corresponding binding sites (the fifth position in Table 1).
*Binding sites where the [Delta][Delta]
G
tot
value is <1.0 kcal/mol (thus, stronger binding expected). For the multiple binding sites
in which there are more than two sites within a 50-bp region, numbers are separated by `/', whereas they are separated by a
comma otherwise. We listed only one single-binding site and one multiple-binding site with the lowest [Delta][Delta]
G
tot
if there is more than one such site.
The structure of the Myb-DNA complex solved by NMR analysis (
12
) shows that Asn-183 and Lys-182 from R3 interact with the second A and fourth C, respectively;
whereas the sixth G interacts with Lys-128 from R2. Thus, the interaction of R3 appears to be conserved
throughout these binding sites. On the other hand, the interaction of R2 for
these binding sites appears to be different from the other sites; R2 seems not
to specify bases at the sixth position, so that this interaction may be absent
in binding to these sites.
We cannot preclude the possibility that some of the above binding sites are not
real targets for c-Myb, but the significant deviation from the consensus sequence for a
number of sites indicates that the [Delta][Delta]
G
values may not be strictly additive. The breakdown of the additivity may be
mainly attributed to (i) cooperative interactions within single binding site,
or (ii) cooperative interactions among different binding sites. As an example
of the former case, even small conformational changes in the DNA by a base
substitution could affect substitutions at other positions in the same site. On
the other hand, multiple Myb-binding sites are often present in promoters, and Myb molecules may
interact with themselves or other molecules in the transcriptional machinery.
Therefore, we can naturally expect a certain level of cooperativity in the
binding of c-Myb to DNA. In fact, it has been reported that a oligonucleotide sequence
containing duplicated Myb-binding motif showed a higher affinity than the sequence with only a
single Myb-binding motif (
10
).
The good correlation between the observed interactions in the myb-DNA complex
structure and those base positions indicated by the binding analysis to have
specific interactions indicates that no major problem arising in the additivity
of the individual [Delta][Delta]
G
s
because of the cooperativity within single binding site is anticipated. Rather,
we suggest that c-Myb may use different modes of binding in recognizing the multiple binding
sites of the target promoters. As shown earlier, there are multiple potential
binding sites, some of which are closely located or spaced with a period of 10
bp. The proximity and/or phasing of binding sites may cause cooperative
interactions of multiple Myb molecules via direct contact, DNA bending, or
through other transcription factors. When c-Myb binds to multiple binding sites cooperatively, the binding mode of R2 could be different from that in the case of single-site binding.
The present protocol was also applied to the screening of potential binding
sites of c-Myb that have not been verified by experiments, and we have been able to
identify some additional interesting binding sites by the screening, that can
be targets of c-Myb. It should be emphasized that certain classes of sequences mentioned
above escape the present screening. Thus, the screened results represent a
subset library of potential Myb-binding sites, and they are by no means a complete library. The screening
protocol is straightforward given the [Delta][Delta]
G
values, and parameters such as the [Delta][Delta]
G
threshold and the number of binding sites within a specified range can be
varied to control the screening. The binding-site library will of course grow with the rapidly increasing sequence
database. The present method may provide useful information on the function of
c-Myb as a transcriptional regulator for multiple genes.
ACKNOWLEDGEMENTS
We thank Drs Kazuhiro Ogata and Robert L. Jernigan for helpful comments on the
manuscript. This work was partly supported by the Grants-in Aid for Scientific Research on Priority Areas from the Ministry of
Education, Science and Culture of Japan.