ABSTRACT
One, two or four copies of the `helix-hairpin-helix' (HhH) DNA-binding motif are predicted to occur in 14 homologous
families of proteins. The predicted DNA-binding function of this motif is shown to be consistent with the
crystallographic structure of rat polymerase
[beta]
, complexed with DNA template-primer [Pelletier, H., Sawaya, M.R., Kumar, A., Wilson, S.H. and Kraut,
J. (1994)
Science
264, 1891-1903] and with biochemical data. Five crystal structures of predicted HhH
motifs are currently known: two from rat pol
[beta]
and one each in endonuclease III, AlkA and the 5
'
nuclease domain of
Taq
pol I. These motifs are more structurally similar to each other than to any
other structure in current databases, including helix-turn-helix motifs. The clustering of the five HhH structures separately
from other bi-helical structures in searches indicates that all members of the 14
families of proteins described herein possess similar HhH structures. By
analogy with the rat pol
[beta]
structure, it is suggested that each of these HhH motifs bind DNA in a non-sequence-specific manner, via the formation of hydrogen bonds between protein
backbone nitrogens and DNA phosphate groups. This type of interaction contrasts
with the sequence-specific interactions of other motifs, including helix-turn-helix structures. Additional evidence is provided that alphaherpesvirus virion host
shutoff proteins are members of the polymerase I 5'-nuclease and FEN1-like endonuclease gene family, and that a novel HhH-containing DNA-binding domain occurs in the kinesin-like molecule nod, and in other proteins such as
cnjB, emb-5 and SPT6.
The interaction of proteins with DNA, in a sequence-dependent manner, is fundamental to DNA synthesis, repair and degradation, and to the regulation of gene transcription. Many of these proteins contain small, discrete structural motifs that utilize either [alpha]-helices or [beta]-strands to bind the phosphate backbone or the
grooves of DNA. Among these are the helix-turn-helix, zinc finger, leucine zipper and helix-loop-helix motifs (
1
). These may arise in different molecular contexts which has been interpreted to
be due either to divergent evolution via gene duplication and insertion or to
structural convergence via the effects of selective pressures on protein
function. Thus the helix-turn-helix (HtH) motif is found both in molecules with similar folds
(e.g. homeodomain proteins) and in others with different folds (e.g. lambda
repressor and cro proteins). Unlike gene regulatory proteins, other molecules
that bind DNA do so in a manner that is non-sequence-specific. These proteins include nucleases,
N
-glycosylases, ligases, helicases, topoisomerases and polymerases that are
essential for the protein-mediated synthesis and repair of DNA structure. Much less is known about
how such proteins bind DNA or how this sequence independent binding is coupled
to their function (
2
). These proteins, unlike sequence-dependent DNA binding proteins, have not previously been shown to possess a common structural motif.
Recently, the N-terminal region of an open reading frame (ORF) of the cyanobacterium
Synechocystis
sp. has been shown to be a member of a family of phosphodiesterases that
includes phospholipases D and endonucleases (
3
). Further investigation of this ORF indicated that the sequence of its C-terminal region is similar to a previously proposed family of DNA-binding proteins (
4
) that includes
Bacillus subtilis
comE ORF A and human OriP-binding protein (OriP-BP). This suggests that the
Synechocystis
sp. ORF encodes a nuclease with a C-terminal DNA-binding domain. The family of DNA-binding domains was suggested (
4
) to include regions of
Escherichia coli
uvrC (a subunit of the uvrABC DNA repair enzyme) and
Haloarcula marisortui
ribosomal protein HL5.
Further studies indicated that this family of DNA-binding domains also was significantly similar in sequence to a variety of other
molecules, each of which possessed a DNA-binding function, over regions of ~20 amino acids. It was considered that these sequence similarities
either arose as results of evolutionary divergence from a common ancestor (i.e. homology) or by localised convergence of sequence due to adaptive replacements that were positively
selected (cf.
5
). Here evidence from similarity searches of sequence and structural databases
is presented that suggests that a `helix-hairpin-helix' motif occurs in 14 families of proteins and that this mediates a non-sequence-dependent interaction with DNA.
Sequence searches were undertaken using a local similarity method of Barton (
6
) and Searchwise (
7
); a generalised profile method. Additional searches for homologues used BLAST (
8
), as implemented at the National Center for Biotechnology Information (NCBI) USA. Estimation of
p
-values for ungapped blocks within multiple alignments was provided by the
program MACAW (
9
). Calculations of
p
-values were overestimations due to the use of a maximal search space (
9
) equal to the product of proteins' sequence lengths. The BLOSUM62 amino acid
substitution matrix (
10
) was used for each of these computational methods. Secondary structure
predictions from multiple alignments were provided by the neural network method
(PHD) of Rost and Sander (
11
).
Comparison of individual HhH structures with current structure databases was
provided by the dynamic programming algorithm encoded within STAMP (
12
). The algorithm was used to calculate length-independent degrees of similarity (
S
c
-values) between a query structure and all other known structures where
maximum values of
S
c
= 9.8 represent a comparison of any structure with itself. The putative HhH motifs in
E.coli
endonuclease III (residues 111-126), rat polymerase [beta] (residues 59-74 and 100-115) and
E.coli
AlkA (residues 209-224) were subjected to database searches each as the query structure.
The four bi-helical structures were superimposed using matrices generated by STAMP and
examined using RASMOL (
13
). Examination of the hierarchical lists of scores from searches and
visualisation of superpositions using RASMOL demonstrated that superpositions
with
S
c
scores >8.00 represented highly similar structures. Multiple alignment of the
four similar structures was performed using STAMP and values of P
ij'
were calculated, these represent the confidence in the alignment at each C-[alpha] position. Crossing angles between the two helices of the HhH
structures were determined using the structural analysis program ACTIVE (Hyeon
Son, in preparation); helical regions were defined using Quanta (Molecular
Simulations).
Regions of
comE
ORF A, OriP-BP, HL5 and uvrC have been suggested elsewhere to form a family of DNA-binding domain homologues (
4
). Results of preliminary database searches indicated that the sequences of several other DNA-binding proteins were apparently significantly similar to a 16 residue
motif of
Synechocystis
sp. ORF,
comE
ORF A, OriP-BP, HL5 and uvrC: for example, regions similar to the motif in
Synechocystis
sp. ORF, rat polymerase [beta],
E.coli
ruvA and
E.coli
recR, when aligned in pairs, yielded probabilities of being aligned by chance (
p
-values) of <5 * 10
-2
, calculated using MACAW (
9
). In the absence of known structures for each of these proteins it was
considered prudent to search, using both a local similarity algorithm (
6
) and a generalised profile method (
7
), for sequences that possess statistically significant similarities to a 16 residue profile calculated from regions of
Synechocystis
sp. ORF,
comE
ORF A, OriP-BP, HL5 and uvrC sequences.
The initial search (
6
) provided evidence that such sequences exist predominantly in DNA-interacting proteins. The 11 sequences, that were most similar to the profile (i.e. the top scoring sequences),
were three NAD
+
-dependent DNA ligases,
E.coli
recR and ruvA (both involved in DNA repair processes), rat DNA polymerase [beta],
Pseudomonas fluorescens
uvrC,
Drosophila melanogaster
nod (a DNA-binding kinesin homologue), human flap endonuclease-1 (FEN1),
Thermus aquaticus
DNA-binding protein 1a (DNAB1a), and an endonuclease from
Saccharomyces cerevisiae
(ORF YKL113c). Using MACAW (
9
), comparisons of these sequences in pairs and in groups demonstrated that the
probabilities that these similarities in sequence arose by chance were small.
For example, aligning 16 residue regions of
Rhodothermus marinus
DNA ligase, nod, FEN1 and DNAB1a yielded a
p
-value of 1 * 10
-7
.
A new profile, containing both the original five sequences and 11 that were the
next most similar, was calculated and compared (
6
) once more with databases. The top-scoring 53 sequences in this iteration contained the 16 residue motif;
significantly the motif was conserved also for their close homologues, as
defined by conservation at five or more of eight positions of hxxhxGhGxxxAxxhh,
where h is a hydrophobic residue (VILMWFYA). Four exceptions to this, where
homologues did not conserve the motif, were discarded; these four were:
Streptococcus pneumoniae
penicillin-binding protein, trout cellular tumour antigen p53, varicella-zoster virus gene 53 protein and
S.pombe
sexual differentiation process protein ISP7. A slight reduction of the
acceptance threshold suggested the presence of multiple copies of the motif in
NAD
+
-dependent ligases (four copies), ruvA, polymerase [beta] and HL5 (two copies each).
Comparison of a newly-generated profile with databases using Searchwise, a profile method (
7
), corroborated and extended these findings. In this iteration, the top scoring
68 sequences were all represented among previously-identified sequences, together with radC homologues, that also scored
highly. Collating these top-scoring sequences with their homologues yielded a total of 107 sequences.
In order to decrease redundancy, one of each pair of close homologues whose
motifs showed >60% sequence identity was removed from the list. Unexpectedly,
each of these 107 was found to represent a protein known to interact with DNA,
except for four ORFs whose functions are unknown; no protein whose known
function is unrelated to DNA was present among the 107. This procedure was
unlikely to have identified by chance such a set of proteins with functions so
disproportionately related to DNA (as an illustration of this it is noted that
only 2% of SwissProt database entries have the string `DNA' in their title);
therefore, this motif is likely to have arisen in each of these sequences as a
result of a common DNA-related function. From literature data (discussed below), this function is
likely to be a DNA-binding function. Given the stringent selection criteria for this motif it
is likely that many more examples, with slightly differing sequences, remain to
be identified in current sequence databases. Seven of the 107 sequences could
not be identified as close homologues of any other proteins (Table
1
).
Table 1
HnH sequence alignment
The 107 sequences could be clustered into 14 homologous families (Table
1
). Some care was taken to ensure that sequences were demonstrably homologous by
ensuring that significant sequence similarity existed outside of the motifs.
This precaution was warranted by the observation that the five known structures
that contain the motif (pol [beta], AlkA, endonuclease III and
Taq
polymerase I; see below) do not adopt a common fold and may be results of
localised sequence and structural convergence.
Secondary structure predictions were provided by the PHD server (
11
) using multiple alignments of all homologues in Table
1
whose tertiary structures remain unknown, as query information. At an expected
accuracy of prediction of 72% (
11
), eight out of 13 alignments produced a prediction of two [alpha]-helices, 4 (radC, uvrC, NAD
+
-dependent ligase motif 1 and ruvA motif 2) yielded an uncertain prediction
for the first half of the motif followed by prediction of an [alpha]-helix, and 1 (tranposase homologues) yielded a [beta]-strand-[beta]-strand prediction. These predictions
are consistent with the proposal that the majority of these sequences
represents a bi-helical structure; the possibility that a minority of these do not contain
a N-terminal [alpha]-helix can not be discounted.
These observations of sequence similarities were able to be correlated with structural similarities given that the crystal structures of five of these
motifs have been determined. Rat polymerase [beta] (pol [beta]; containing two motifs) (
14
),
T.aquaticus
polymerase I (
Taq
pol I) (
15
),
E.coli
AlkA (T. Ellenberger, personal communication) and
E.coli
endonuclease III (endo III) (
16
) do not all adopt a common fold yet each contains a bi-helical structure with a short inter-helical loop that coincides with their sequence-similar motifs. Since tandem helices are a common occurrence
in protein structures (
17
) it was important to assess the significance of the perceived similarities
between the bi-helical motifs of pol [beta], pol I, endo III and AlkA. Henceforth these motifs shall be termed
`helix-hairpin-helix' (HhH) motifs in accordance with Thayer
et al
. (
18
) (see below).
A DNA-binding function has been proposed (
18
) for the putative HhH motif in endonuclease III, based on its identification as
the binding site of thymine glycol (
15
), a known inhibitor of the
N-
glycosylase activity of the enzyme. In this structure the electron density for
the inhibitor was weak and the authors could not identify unambiguously the
nature of its interactions with the HhH motif (
16
). Resolution of this question awaits the determination of the tertiary structure of the DNA-bound form of endonuclease III. The crystal structure of another DNA glycosylase, AlkA, has recently been determined (T. Ellenberger, personal
communication) and, as predicted, it also contains a HhH motif.
Escherichia coli
AlkA is involved in base excision repair; this work indicates that the AlkA HhH
motif is likely to mediate its affinity for DNA during repair processes.
The crystal structure of DNA polymerase I from
T.aquaticus
(
Taq
pol I), containing the first description of the structure of a 5'-nuclease domain, has been reported recently (
15
). A HhH motif, predicted to reside in the 5' nuclease domain (residues 191-211), does indeed adopt a helix-hairpin-helix-like structure although the structure of this
region could not be determined unambiguously due to high crystallographic B-factors and poor electron density, with no density present for residues
200 and 201 (
15
). However it was possible to superimpose accurately the helices of the pol I
HhH with other HhH structures; in addition it has a crossing angle similar to
other HhH structures (Table
3
). The
Taq
pol I structure does not contain bound DNA, but Kim
et al
. (
15
) have proposed three metal ion binding sites formed by conserved carboxylates
situated at the base of the major cleft in the 5' nuclease domain as constituting an active site. Interestingly, the HhH
protrudes into this cleft adjacent to the metal binding sites and it seems
plausible that the motif presents DNA to the nuclease active site.
The 5'-nuclease domain of polymerases I from a diverse range of organisms
is highly conserved (
22
). Analysis of mutations within the
E.coli
pol I 5' nuclease domain (
23
) reveals that two mutations (Gly
184
-> Asp and Gly
192
-> Asp), which result in defective 5'-3'-nuclease activity, occur in or near the
predicted HhH motif (Fig.
3
a). The Gly
184
-> Asp variant (
polA480ex
) has a markedly reduced 5'-3'-nuclease activity with little effect on polymerase
activity, whereas the Gly
192
-> Asp variant (
polA214
) has a more pronounced effect on both polymerase and nuclease activities.
In vitro
, both these activities are thermolabile in this mutant, and
in vivo
, the mutation is lethal at high temperatures suggesting an essential role for
this residue. It has recently been reported that substitution of the
corresponding glycine residues by aspartates, in
B.caldotenax
pol I, results in the abolition of the 5'-3' exonuclease activity (
24
). These data indicate that the pol I HhH motif is essential for 5'-3' nuclease activity.
Table 3
Figure
A second HhH, predicted by our search, but not others (
18
,
19
), is located in the 8 kDa domain of pol [beta] which appears to be responsible for the short-gap filling activity of the enzyme (
26
,
27
). The crystal structures of pol [beta] (
14
,
28
) do not demonstrate DNA-binding to the 8 kDa domain, which is assumed to be a result of it
adopting one of many non-productive conformations in the crystal. However, other experimental
evidence strongly implicates the 8 kDa domain, and its HhH motif, in binding
DNA. Kumar
et al
. (
29
) have demonstrated single-stranded (ss) DNA binding to the 8 kDa domain; this interaction is
mediated by the two helices of the HhH motif as shown by nuclear magnetic
resonance chemical shift data (
30
). Lys 72, located in the HhH motif, has been implicated in binding dNTP from
the results of pyridoxal phosphate modification studies (
31
). The first putative HhH motif of the pol [beta] homologue terminal deoxynucleotide transferase also appears to possess
affinity for ssDNA (
32
) and nucleotides (
33
).
Recently it was reported that pol [beta] can be specifically inhibited by its N-terminal 14 kDa domain (residues 1-140) (
34
) that contains both its HhH motifs. This domain, like the intact enzyme, binds
both ss and double-stranded (ds) DNA but is deficient in polymerase activity. A smaller 8 kDa
fragment (residues 3-75), encompassing its first HhH motif binds ss DNA but not ds DNA while
another fragment containing the second motif and the catalytic domain (residues 87-334) binds only ds DNA. This evidence supports the prediction that the first HhH is important in
ss DNA recognition and we propose that a minimal region containing both HhH
motifs (residues 50-120) would also inhibit pol [beta] activity by competing for the DNA substrate.
Figure
Figure
Sequence clustering also indicated that regions of
B.subtilis
comE
ORF A, the
Synechocystis
sp. phospholipase D homologue, human oriP-BP,
D.melanogaster
nod and other proteins are homologous (Fig.
5
) and are likely to possess DNA-binding functions. A subset of the putative nod-like DNA-binding domain (NDD) homologues possibly possess a second HhH
motif (positions 49-62 in Fig.
5
). The observation of one or two HhH motifs at the C-terminal end of nod is particularly intriguing. Nod is a kinesin-like molecule required for proper segregation of non-exchange chromosomes in female meiosis (
40
). Although the NDD sequence has been shown not to be essential for the binding
of nod to chromosomes (
41
,
42
), deletion of the C-terminal 12 residues of the NDD sequence at the nod C-terminus renders it non-functional (
43
).
Prior to recent advances in structural biology it was evident that the [alpha]-helices could be accommodated in the major groove of B-DNA and therefore could mediate sequence-specific contacts with DNA bases (
44
). Elucidation of the structures of sequence-specific DNA-binding proteins has confirmed that [alpha]-helices of helix-turn-helix, zinc-finger, helix-loop-helix and leucine
zipper motifs play an important role in DNA-recognition. In addition, [alpha]-helices are used commonly to orientate recognition helices
enabling interaction with DNA.
In this paper we have presented evidence to support and extend the proposition
of Thayer
et al.
(
18
) that the helix-hairpin-helix motif is a distinct and novel class of DNA binding motif.
This highly conserved motif contains features characteristic of sequence-specific motifs such as helix-turn-helix (HtH) structures, namely the use of [alpha]-helices as a structural element for the correct
orientation of a DNA-recognition element. However, although both are bi-helical structures the HhH motif differs from the HtH motif in its
structure and its mode of recognition of DNA. By analogy with the DNA-bound structure of pol [beta] (
14
) the HhH motif represents a novel structural motif involved in the non-sequence-specific recognition of both ss and ds DNA via hydrogen bond-mediated interactions with the DNA-phosphate backbone. Such interactions appear to be
essential for the functions of many non-sequence-specific proteins, particularly those involved in base excision
repair processes. Interestingly, on the occasions that the HhH motif is found
in multiple copies, these are invariably separated by between 12 and 21
residues, suggesting that a particular spatial arrangement of HhH motifs may be
required for multiple-sites of interaction with DNA. Future determination of further HhH-containing structures shall examine the proposition that this [alpha]-helical motif is prevalent among many base excision
repair enzymes, in which it adopts a common structure and fulfills a common
role.
We would like to thank Asim Siddiqui, Robert Russell and Geoff Barton for
assistance in the use of STAMP, Hyeon Son for assistance with ACTIVE, and Steve
Ashford for assistance with figures. We are indebted to Dr Tom Ellenberger for
sending the AlkA coordinates prior to publication. We are grateful to Drs Soo
Hyun Eom, Joe Jager and Tom Steitz for allowing access to the coordinates of
Taq
pol I. C.P.P. is an MRC Training Fellow, and a member of the Oxford Centre for
Molecular Sciences, which is supported by EPSRC, BBSRC and the MRC. C.P.P.
wishes to thank Dr C. M. Dobson for support and encouragement.

HhH structure
Helix 1 (residues)
Helix 2 (residues)
Crossing angle (degrees)
Pol [beta] 1
56-61
67-75
154
Pol [beta] 2
97-102
108-116
133
Endo III
108-113
119-127
130
AlkA
201-211
217-228
131
Taq
Pol I
189-197
204-212
129

The crystallographic structure of pol [beta] bound to a DNA template-primer (Fig.
3
) (
14
) and the identification of putative HhH motifs has enabled us to predict the
mode of DNA-binding for each putative HhH sequence. Seeberg
et al
. (
19
) suggested that the central residues G[I/V]G of the HhH hairpin bind to DNA
through hydrophobic interactions with the bases in the grooves. However this
proposal is not supported by the pol [beta] structure. As discussed by Pelletier
et al
. (
14
), the pol [beta]-DNA template complex structure reveals that pol [beta] backbone nitrogens form non-specific hydrogen bonds with DNA phosphate oxygens. Of
two regions in pol [beta] involved in this interaction, four backbone nitrogens in the second HhH
motif, between Gly105-Ala110 (HhH 8-13) form hydrogen bonds with phosphates of the primer strand (Fig.
3
). It is suggested that all HhH motifs bind DNA in an analogous manner to that
of the second pol [beta] HhH motif. It is notable that there is a high propensity for glycine
residues at HhH positions 8 and 10 (Table
1
), which in pol [beta] are critical for DNA-recognition and provide an extended surface for DNA-protein recognition. A high propensity for lysine at HhH 12 and
threonine or serine at HhH 13 within a subset of proposed HhH sequences
suggests that these interact with DNA phosphate groups in a similar manner to
the same residues in P-loop structures (
20
). DNA recognition by HhH motifs, in the manner proposed above, would provide non-sequence-specific interactions of proteins with DNA. This type of interaction would contrast with
the sequence-specific interactions of other motifs such as helix-turn-helix motifs (
25
).

Structural and functional homology between the 5'-nuclease domain of pol I enzymes and a range of other nucleases has
been reported (
22
,
35
). During the clustering of sequences into homologous classes it became apparent
that these nuclease sequences show significant similarities to a family of
endonucleases (
36
) that includes mammalian FEN1 (or DNase IV) and ERCC-5 (or XPG) and yeast RAD2 (Fig.
4
). This observation corroborates the original finding of Robins
et al
. (
37
) that these two families are homologous. Scanning current sequence databases against a multiple alignment of FEN1- and pol I-like 5'-nuclease domain sequences using scanps (
6
), identified the family of alphaherpesvirus virion host shutoff (vhs) proteins
as candidate homologues. Sequence conservation, particularly in three alignment `blocks',
indicates that these three protein families are homologous (Fig.
4
). The alignment shows conservation of several Asp and/or Glu residues that have
been suggested to coordinate metal ions within the
Taq
polymerase I 5'-nuclease structure (
14
). The alphaherpesvirus vhs proteins are known to degrade host and viral mRNAs
during infection and therefore have been proposed, although not shown, to function as nucleases (
38
). The observation of their sequence similarities to exonucleases is consistent with this
proposal. Additional support is provided by a vhs mutant with defective
activity which contains a substitution of a threonine with an isoleucine in a
conserved tripeptide (Asp-Thr-Asp) containing two of the four proposed active site carboxylate
groups. Interestingly, both vhs and phage T4 RNase H (
39
) appear to lack the HhH DNA-binding motif present in their homologous counterparts (Fig.
4
).

REFERENCES
Return
