Frequent oligonucleotides and peptides of the
Haemophilus influenzae
genome
Frequent oligonucleotides and peptides of the Haemophilus influenzae genome
Samuel
Karlin*
,
Jan
Mrázek
and
Allan M.
Campbell
1
Department of Mathematics, Stanford University,
Stanford
, CA 94305-2125,
USA
and
1
Department of Biological Sciences, Stanford University,
Stanford
, CA 94305-5020,
USA
Received June 24, 1996;
Revised and Accepted September 6, 1996
ABSTRACT
The complete
Haemophilus influenzae
genome (1.83 Mb, Rd strain) provides opportunities for characterizing global
genomic inhomogeneities and for detecting important sequence signals. Along
these lines, new methods for identifying frequent words (oligonucleotides
and/or peptides) and their distributions are applied to the
H.influenzae
genome with some comparisons and contrasts made with frequent words of other
bacterial genomes. Three major classes of frequent oligonucleotides stand out:
(i) oligos related to the familiar uptake signal sequences (USSs), AAGTGCGGT
(USS
+
) and its inverted complement (USS
-
), (ii) multiple tetranucleotide iterations and (iii) intergenic dyad sequences
(ISDs) found as AAGCCCACCCTAC and its dyad form. The USS
+
and USS
-
occur in almost equal counts, are remarkably evenly spaced around the genome,
and appear predominantly in the same reading frame of protein coding domains
(USS
+
translated to Ser-Ala-Val, USS
-
translated to Thr-Ala-Leu). These observations suggest that USSs contribute to global
genomic functions, for example, in replication and/or repair processes, or as
membrane attachment sites, or as sequences helping to pack DNA. The long
tetranucleotide iterations, virtually unique to
H.influenzae
(i.e., unknown in other prokaryotes), through polymerase slippage during
replication and/or homologous recombination may produce subpopulations expressing alternative proteins. The 13 bp frequent IDS
words, invariably intergenic, occur mostly in clusters and provide potential
for complex secondary structures suggesting that these sequences may be important signals for regulating the activity of
their flanking genes. The frequent oligopeptides of
H.influenzae
are principally of two kinds-those induced by oligonucleotide frequent words (USSs, tetranucleotide
iterations), and those associated with ATP or GTP binding sites that are
generally composed of three motifs: the A-box which contributes to delineating the binding pocket; the B-box which functions in hydrolysis; and the C-box whose function is unknown. The A-box occurs fairly universally in prokaryotes and
eukaryotes. The B- and C-motifs appear to be specialized to various functional groups (e.g.,
transport, recombination, chaperone activity). Other putative motifs correspond
to homologs of
Escherichia coli
motifs, for example, are associated with proteins of transcriptional processing,
aminoacyl-tRNA synthetases and proteins functioning in electron transfer.
INTRODUCTION
The complete eubacterial genome (
Haemophilus influenzae
: 1.83 Mb, strain Rd) was recently published including identification of 1746 putative genes and concomitant database gene similarities (see refs
1
-
3
, and TIGR Web page http://www.tigr.org/ and compare with ref.
4
). The
H.influenzae
genome provides opportunities for characterizing genomic inhomogeneities and detecting distinctive sequence patterns. In this context, it is of
interest to determine which words (oligonucleotides and/or peptides) in the
sequence occur with unusually high or low frequencies and to identify anomalies
in their distribution over the genome. For DNA, rare words might be binding
sites for transcription control factors restricted to specific locations.
Alternatively, rare words may be discriminated against due to structural
incompatibilities, e.g., the tetranucleotide CTAG which is extremely rare in
most proteobacterial genomes (
5
). Frequent words often include repetitive structural, regulatory and
transposable elements, e.g., uptake signal sequences (USSs, see below) in
H.influenzae
(
6
), Chi sites (which in association with the RecBCD complex promote
recombination) and REP elements (repeated extragenic palindrome of unknown
function) the latter two in
Escherichia coli
(
7
,
8
). In proteins, frequent oligopeptides often reflect characteristic motifs
shared in certain protein functional families, e.g., the sequence environment
of the catalytic triad of serine proteases (
9
), the ATP-binding motif (Walker-box) of bacterial proteins (
10
). A comparison of texts or distributions of such words within sets of sequences
from different organisms may suggest important evolutionary tendencies or
constraints at work.
We introduce new statistics (see Methods) for identifying frequent words in the
H.influenzae
genome.
Haemophilus influenzae
readily undergoes transformation, facilitated by recognition of USSs, by
integrating free DNA fragments of its own genus. The active uptake of large
pieces of DNA from the environment is commonly called natural genetic
competence (for recent reviews see refs.
11
and
12
). The USSs are the aggregate of the 9 bp words AAGTGCGGT (+ orientation)
together with its inverted complement ACCGCACTT (- orientation). Many +/- pairs in the
H.influenzae
genome enable hairpin loops of up to ~20 nucleotides (nt) in length (128 such pairings) (
1
,
6
). These USSs are highly frequent in the
H.influenzae
genome, with almost identical counts on the Watson and Crick strands, respectively (
6
). The genome also contains several extensive tetranucleotide iterations which generate frequent words. We further identified a
frequent dyad pairing in
H.influenzae
, labelled IDS (intergenic dyad sequence), of quite long stem length which has
unusual properties, e.g., clusters of these dyads, all intergenic, flanked by
genes of related function and possessing the potential for complex secondary
structures. The counts and distribution of these frequent words are discussed
below. In
E.coli
, frequent words mostly correspond to parts of REP sequences (
8
,
13
). In
Neisseria gonorrhoeae
, constitutive natural uptake of DNA of its own genus is related to the
oligonucleotides TTCAGACGGC and its inverted complement GCCGTCTGAA, which are the most frequent words of size 10
in
N.gonorrhoeae
DNA. In the cyanobacterium
Synechocystis sp
., the 10 bp palindrome GGCGATCGCC is frequent to about the same extent as the
USSs of
H.influenzae
(
14
, and below). By contrast, examination of three long contigs in
Bacillus subtilis
totaling 0.51 Mb yielded no frequent oligonucleotides. This observation is essentially consistent with the natural history of
B.subtilis
where naturally genetically transformable DNA is non-specific and engages ~10% of its cells (
12
).
METHODS
Frequent words (oligonucleotides and peptides)
A classical approach for deciding whether a given word is frequent is to count the number of its occurrences
N
(
L
) in a sequence of length
L
and compare this count with the expected count, postulating independently or
Markov generated sequences of letters taking account of statistical variance.
More precisely, let [mu] be the mean and [sigma]
2
the variance of the length
l
between successive occurrences of the word. For these models, the quantity
c
(
L
) = [mu]
3/2
(
N
(
L
) -
L
/[mu])/([sigma][sigma][theta][rho][tau] [iota][tau][alpha][lambda][iota][chi] [Lambda])
follows approximately the standard normal distribution for large
L
(
15
). The tails of the normal distribution can be used to define rare and frequent
cutoffs for the occurrence of each individual word (
16
-
18
). This method is difficult to implement, especially the computation of [sigma] for each word.
We introduce an intrinsically different formulation idealized into a ball-in-urn model [urns correspond to all nucleotide (or amino acid) words
of a given size and balls refer to the observed words in a given sequence (
19
,
13
)]. We apply Poisson distribution approximations associated with generalized occupancy problems of balls-in-urns (see ref.
19
for mathematical details). Previously we considered the question of frequent
words in a sequence of letters drawn from an alphabet of size
A
with equal frequency 1/
A
. We show that for a sequence of length
L
, there is a natural word size
s
determined by the inequalities
As
-1 <=
L
<
A
s
[
1
]
and a natural copy number
r
determined by the inequalities
(
r
- 1)/
r
< (log
L
) / log
A
s
<=
r
/(
r
+ 1)
For large
L
, the number of
s
-words that occur more than
r
times is approximately Poisson distributed with parameter
A
s
exp(-
L
/
A
s
)(
L
/
A
s
)
r
/
r
! which is <1. This technique has given interesting results for both nucleotide and amino
acid sequences (
13
). Although this model is fairly accurate for many DNA sequences with roughly
equal base frequencies, the uniform copy number cutoff cannot be applied to
compositionally biased sequences. In order to identify frequent words in the
globally A+T rich
H.influenzae
genome (or for peptides where different amino acids have very different
frequencies), the method is extended to the more general case of words
w
with non-uniform frequencies
p
w
. For each word, we determine the copy threshold
r
w
as the smallest integer satisfying the inequality
exp(-
p
w
L
)(
p
w
L
)
rw
/
r
w
! <= 1/
L
[
2
]
where the word size
s
is defined as in the uniform case and
p
w
=
fi
1
fi
2
...
fi
s
and for a Markov model set
p
w
=
fi
1
fi
1
i
2
...
fis
-1
is
where
fi
is the frequency of letter
i
and
fij
is the transition frequency from letter
i
to letter
j
. A general feature of this formulation is that the lower the expected frequency
of a word, the lower the cutoff required for it to be frequent. By setting the
left side of the inequality [
2
] to 1/
L
, at most one frequent word is to be expected in a random sequence.
Analysis of the distribution of frequent words
In probing for insights on the organization of a genome, the general problem
arises of how to characterize anomalies in the spacings of markers in a long
sequence of nucleotides or amino acids. These include properties of clumping
(too many neighboring short spacings), overdispersion (too many long gaps
between markers), and excessive regularity (too few short spacings and/or too
few long gaps). Questions about spacings of a general marker array and issues
of sequence heterogeneity can be approached by consideration of the cumulative
lengths of
r
consecutive distances between the markers, and
R
i
(
r
)
is the distance between marker
i
and marker
i+r
called
r
-scan length. Limiting distributions for the extremal statistics among {
R
i
(
r
)
} were derived in ref.
20
. The lengths of the longest and shortest
r
-scans are appropriate statistics for detecting cases of significant
clumping, significant overdispersion, or excessive regularity in the spacings
of the marker.
The case for
r
= 1 is classical. By varying
r
, organization on different scales can be detected, e.g.,
r
= 3 can aptly detect near neighbor interactions while
r
= 6 can identify concentrations over a greater range. The
r
-scan process is a moving sum process derived from the original first order
process and so tends to smooth fluctuations and presents quite sensitive
statistics in discerning clustering and overdispersion.
To study the distribution of the marker in a sequence, we compare the
distribution of {
R
i
(
r
)
} calculated under a random model with the observed
r
-scan lengths. Moreover,
r
-scan statistics can be extended to a marker array distributed as an
inhomogeneous Poisson process that easily accommodates alternating tracts of
differing marker densities. Let
m {* sub r} = {min from i} {R sub i sup {( r )}} , ^ M {* sub r} = {max from i}
{R sub i sup {( r )}}
The theoretical probabilities for a marker array of
n
points obey the asymptotic relations
Prob{
m
*
r
>=
x
/
n
(1+1/
r
)
} [approx] exp(-
x
r
/
r
!),
[
3
]
Prob{
M
*
r
<=
n
-1
[ln
n
+ (r-1) ln(ln
n
) +
x
]} [approx] exp[-e
-
x
/(
r
-1)!]
[
4
]
These formulas provide benchmarks as to whether the minimum and/or maximum
spacing deviates significantly from randomness. For example, the probability in
[
3
] (left side of the equation) is set to a desired significance level (generally
0.01), in order to determine
x
b
from the equation exp(-
x
b
r
/
r
!) = 0.01, and subsequently the threshold level
b
*
r
=
x
b
/
n
(1+1/
r
)
is determined. All observed
r
-scan lengths less than
b
*
r
L
define
r
-scan clusters. A value of
a
*
r
is calculated analogously setting the probability to 0.99; and a significantly
even spacing is detected if
m
*
r
>=
a
*
r
L
. The appropriate range (
A
*
r
,
B
*
r
) of the maximal spacing
M
*
r
can be delimited analogously using equation [
4
]. The markers are considered randomly distributed if all the
r
-scan lengths are within the ranges (
a
*
r
L
,
b
*
r
L
) and (
A
*
r
L
,
B
*
r
L
).
RESULTS
Identification of frequent oligonucleotides
The frequent oligonucleotide length was determined as
s
= 11 bp obeying
As
-1 <=
L
<
A
s
(
L
= 1 830 140 bp,
A
= 4). Applying the condition [
2
], the most outstanding frequent words contain the 9 bp uptake signal sequence AAGTGCGGT or its inverted complement ACCGCACTT. Another
class of frequent words arises from the extensive tandemly iterated
tetranucleotides present in the genome. Several among the frequent words
overlap and can be combined into a longer frequent word. By these means the
method of equation [
2
] detected in the genome the frequent 13 letter words AAGCCCACCCTAC (denoted as
W13) and its dyad form GTAGGGTGGGCTT (W'13), in 14 and 9 exact copies, respectively. The pairs W'13 and W13 mostly occur as close dyads. We refer to the dyad
pairings W'13-W13 as IDS. Various groups of frequent words are reported and some
interpretations, comparisons and contrasts on their function are discussed.
The unusual distribution of USSs
Smith
et al
. (
6
; see also ref.
21
) investigated many aspects of USSs. In general, our analysis confirmed and
augmented their findings including: (i) the equal counts of USS
+
and USS
-
words; (ii) USS occurrences as close dyad pairs in either +/- or -/+ orientation (87 and 39 cases, respectively) separated by a loop
of up to ~20 bp but predominantly of 8 bp loops, reminiscent of
E.coli
rho-independent terminator sequences. A cluster of four dyad pairs occurs in
the interval 1755-1756 kb; (iii) a significant overdispersion at position 1.56-1.59 Mb, a stretch of much sequence similarity to a Mu-like phage; (iv) mutated USS
+
and USS
-
(Table
1
).
Additional results were deduced with the aid of the
r
-scan statistics and frequent word analyses, emphasizing (v) significantly
even spacings of the USSs in each orientation; (vi) a significant
overdispersion of USSs in a region of a concentration of ribosomal protein
genes; (vii) reading frame preference of protein coding USSs for the
tripeptides SAV (USS
+
) and TAL (USS
-
) (one letter amino acid code used).
Clusters and overdispersions of USSs
. Four sets of USS markers were analyzed via
r
-scan statistics: the combined ensemble of USS
+
(AAGTGCGGT) and USS
-
(ACCGCACTT), the sets of USS
+
and USS
-
separately, and the collection of close USS dyad pairs. The
r
-scan statistics applied to the aggregate of USS
+
and USS
-
revealed three distributional anomalies significant at the 1% probability
level. An overdispersion (
r
= 2) occurs at coordinates 1.56-1.59 Mb in agreement with that reported by Smith
et al
. (
6
). This region is relatively G+C-rich showing some sequence similarity to Mu-like phage sequences. A second significant overdispersion was
detected at position 834-856 kb by the 4-scan statistical analysis. Thus, only three USSs occur in this
region of ~22 kb length (on a random basis we would expect ~18). This overdispersion centers on a region containing 25 ribosomal
protein genes (position 838-853 kb). Parenthetically, a USS never occurs within a ribosomal protein
gene or between two contiguous ribosomal protein genes.
Frequent mutated USSs in complete genome and in intergenic regions (with ORFs,
rRNA and tRNA genes removed)
Word
Occurrences in
Complete genome (1.83 Mb)
Intergenic regions (289 kb)
AAGTGCGGT
737
++
259
++
ACCGCACTT
734
++
241
++
AAG
A
GCGGT
45
+
13
(+)
ACCGC
T
CTT
48
+
17
(+)
AAGTGCGG
A
34
+
7
0
T
CCGCACTT
44
+
9
0
AAGTGCGG
C
36
+
5
0
G
CCGCACTT
37
+
9
0
AAGTGCGG
G
36
+
5
0
C
CCGCACTT
28
(+)
4
0
Deviations from the core USS are shown in bold. All other words differing by 1
bp from the core USS did not qualify as frequent. Overrepresentation is
indicated by symbols ++ (very frequent), + (significantly frequent), (+)
(marginally frequent), and 0 (not frequent).
A cluster of eight USSs was detected at about position 1.76 Mb. The cluster
reflects four close dyad pairings of USSs, one in each of four 168 bp tandemly
repeated segments.
Equal counts of USS
+
and USS
-
.
USSs are equally partitioned between the two DNA strands with 737 occurrences of
USS
+
and 734 occurrences of USS
-
. The difference in counts over all 200 kb sections (averaging ~160) never exceeded 20 and mostly differed by <10. Moreover, the maximal run of USS
+
or of USS
-
observed among the aggregate USSs is of length 8 counts (
6
) whereas for a random permutation of 737 USS
+
and 734 USS
-
at least one run of about ([1/2])sqrt {1 4 7 1} [approx] 19 would be expected.
Significantly even spacings of USSs in each orientation
. Another striking anomaly concerns the significantly even spacings of the USS
+
occurrences and the same for the USS
-
occurrences. Specifically, both USS
+
positions and USS
-
positions have respective minimum spacings significantly higher than expected
by chance for several
r
-scan statistics (
r
= 1,2,...,6) with the probability <= 0.001 to observe such an even distribution with the same numbers of randomly
distributed markers. See Discussion for possible interpretations.
Biases in mutated USSs.
Eight frequent words differ by a single base change from either USS
+
or USS
-
(Table
1
). Smith
et al
. (
6
) indicated an abundance of mutated USSs and proposed a dynamic equilibrium
model in which the drift away from USSs is judiciously repaired. However, our
analysis of frequent words revealed that only four among the 27 possible singly
mutated USSs are significantly frequent. Moreover, the mutated USSs at the 3' end base are not frequent in intergenic sequences and the mutated form
AAGAGCGGT (and its inverted complement) is only marginally of high count in
intergenic sequences (Table
1
). These observations portend a possible mutation bias for the 40% of USSs
occurring in intergenic regions. Equivalently, these relatively low counts of
noncoding mutated USSs suggest that they are not good enough for efficient USS
transformation properties and may be subject to some sort of selective
constraints in coding sequences (e.g., amino acid and/or codon requirements).
Reading frame preference of protein coding USSs.
934 of the 1471 USSs (63%) occur in protein coding genes or ORFs. About 64% of
all protein coding USS
-
(354 of 553) are translated to TAL, and the even higher fraction 72% of USS
+
(268 of 381) are translated to SAV. Both tripeptides TAL and SAV present
hydroxyl residues (T/S) on one side, hydrophobic aliphatics (L/V) on the other
side, and a central versatile small aliphatic A.
USS close dyad pairs
. The USSs often appear as close dyad pairs in either +/- or -/+ orientation (
6
). The +/- USS dyads predominantly show loops 8 or 19-22 bp long while the loop lengths of the -/+ dyads are not restricted to any narrow ranges. The
distribution of USS dyads around the genome appear to be random except for the
previously noted cluster of four +/- dyads at about position 1.76 Mb of the genome.
Intergenic dyad sequences (IDSs)
Several frequent words relate to IDSs of consensus sequence:
GTAGGGTGGGCTTYAGCCCACCA..........TGGTGG-
GCTRAAGCCCACCCTAC
(Y = C or T, and R = A or G) with a central A+T-rich loop of ~20 bp. The IDSs occur 14 times in the
H.influenzae
genome but only once (in
Neisseria meningitidis
) in all other GenBank sequences. The IDSs are often repeated at distances <300 bp thus having a potential for complex secondary structures. Their
distribution over the genome is depicted in Figure
1
.
Tandem tetranucleotide repeats
The
H.influenzae
genome contains 11 impressive microsatellites in the form of tandem
tetranucleotide repeats (
1
) each extending at least 15 iterations (i.e. >= 60 bp long): CCAA (5 distinct runs), TCAA (3 runs), GCAA (1 run), AGTC (1
run), and ACAG (1 run) (Table
2
). Six tetranucleotide tandem repeats appear in putative genes generally near
the 5' terminus. The other five are intergenic. All 11 were examined for
sequence similarity in their flanking sequences. A strong sequence similarity
was detected in the 3' flanks of four CCAA repeats (at positions 677215, 705969, 760672 and
1633204, denoted CCAA1, CCAA2, CCAA3 and CCAA5, respectively) extending ~2 kb downstream from the repeats, and to a lesser extent in their 5' flanks extending ~400 bp (Table
3
). No significant sequence similarity was detected in the flanks of the other
tetranucleotide iterations. The repeat CCAA3 containing 36 iterations appears in the putative gene
tbp1
(transferrin binding protein 1, >3000 bp in length) near its 5' end. CCAA1 (20 iterations) is in an intergenic region upstream of the
homologous gene
tbp2
. The other two CCAA repeats (CCAA2, 20 iterations, and CCAA5, 18 iterations)
are in short ORFs (Table
2
). The count of CCAA iterations in CCAA2 not being a multiple of three (in
contrast with the count of CCAA3 iterations) results in a stop codon proximal
downstream from the CCAA2 repeat. The ORFs containing CCAA5 and CCAA3 do not
continue in the same frame due to a single base insertion 5' to the CCAA iterations (Table
3
). All insertions/deletions for ~2 kb after the CCAA iterations are multiples of 3 bp suggesting a protein-coding character of the flanking sequences.
Tandem tetranucleotide iterations in
H.influenzae
a
Tetranucleotide/Gene
b
5' USS
c
Start (5')-End (3')
3' USS
c
tbp2
676 963-674 105
d
(CCAA)
20
1987
677 215-677 132
d
3835
orf (HI0636)
677 939-677 652
d
(CCAA)
20
766
705 969-705 896
d
2855
orf (HI0662)
706 065-705 874
d
(CCAA)
36
210
760 672-760 525
d
4492
tbp1
760 746-757 495
d
rsgA
1 480 514-1 481 008
(CCAA)
16
2748
1 481 221-1 481 284
920
orf (HI1386)
1 481 385-1 481 828
(CCAA)
18
3488
1 633 204-1 633 279
2899
orf (HI1566)
1 633 129-1 634 017
hypothetical protein (HI0352)
379 337-378 645
d
(TCAA)
33
2570
379 651-379 520
d
2199
nasD
380 053-380 772
(TCAA)
23
1035
570 889-570 798
d
865
lipooligosaccharide biosynthesis (HI0550)
571 008-570 103
d
orf (HI1536)
1 607 592-1 607 936
(TCAA)
17
203
1 608 033-1 608 100
1064
licA
1 608 191-1 608 991
yopA
1 542 870-1 542 832
d
(GCAA)
25
192
1 543 250-1 543 151
d
3553
nodT
1 543 493-1 544 854
(GACA)
22
981
288 840-288 753
d
2811
glycosyl transferase (HI0259)
288 842-288 183
d
(AGTC)
32
913
1 123 049-1 122 922
d
1873
orf (HI1058)
1 123 460-1 122 879
d
a
Positions of tetranucleotides iterated >= 15 times and overlapping or flanking genes are shown.
b
Genes identified by sequence homology (7), ORFs predicted by GeneMark (35).
Positions of two flanking genes are shown for intergenic tetranucleotide
iterations. Only one gene is indicated if the iteration is included in this
gene.
c
Distance (in bp) to the nearest USS.
d
Gene/iteration is read in the reverse orientation (on the complementary strand).
Strikingly, the tetranucleotide iterations (and the CCAA iterations in
particular) are associated with unusually long gaps between successive USSs,
mostly exceeding 3500 bp, compared with the average distance between two
successive USSs of 1235 bp (Table
2
). During a search for tetranucleotide iterations in the aggregate bacterial
sequences from GenBank (Release 94, April 1996), long tetranucleotide
iterations ( >= 10 iterations) were found exclusively in
H.influenzae
sequences. Long pentanucleotide iterations ( >= 7 iterations) were found in sequences from the genus
Neisseria
(
N.gonorrhoeae
,
N.meningitidis
and
N.flava
) and
H.influenzae
.
.
Alignment of homologous flanking sequences of CCAA tandem repeats
a
a
Bold underlined letters locate the CCAA tandem repeats; initiation and
termination codons of putative genes (ORFs) are also in bold. Conserved sites
are indicated by uppercase letters. Significant sequence similarity extends
beyond the displayed alignment (see Results for details).
Other frequent oligonucleotides
Some frequent words are substrings of the 12 bp word TTCGCCTTKTTC (K = T or G)
showing 17 occurrences in the
H.influenzae
genome. Its inverted complement GAAMAAGGCGAA (M = A or C) occurs nine times. Three of the 26 occurrences (the uppercase
letters) are in the sequence
...ttttcctcgttctccttgtccgccttgTTCGCCTTGTTCGCCTT-
GTTCGCCTTGTTCgccttt...
at coordinates 1788929-1788991 which is an imperfect tandem 7- repeat of the 9 bp word TTCGCCTTG. The tandem repeat is intergenic
and 4 bp downstream of a putative ORF. The other 23 occurrences are all in
genes (or ORFs) and mostly confined to the reading frame GAA-MAA-GGC-GAA translated as E(K/Q)GE.
The 12 bp words TCTAATTCTTCM and KGAAGAATTAGA occur 17 times in genes and
invariably code for the tetrapeptide EEL(E/D). Also the words AAGATGGTTTA and
TAAACGTTCTG, and their inverted complements are principally protein coding,
generally translated as KPS(F/T) and (S/P/A/T)ERL, respectively.
Frequent peptides among
H.influenzae
genes
The aggregate amino acid count over all known and putative proteins in
H.influenzae
is just over 500 000. By equation [
1
] with
A
= 20, the frequent word size is
s
= 5 and the minimum copy number for each word depending on its composition is
determined from formula [
2
].
Table
4
lists all frequent pentapeptides, their observed copy number, and the cutoff
level (in parentheses) required for the word to qualify as frequent. Frequent 5-peptides were, where possible, combined into groups of larger size words
(6mers, 7mers, etc.) with their observed counts. Several groups of frequent
words come from frequent oligonucleotides, especially USSs and tetranucleotide
iterations (of CCAA, TCAA, AGTC, GACA).
a
Significantly frequent pentapeptides were organized into several groups. They
are listed together with their counts in the
H.influenzae
protein sequences (aggregate >500 000 aa). Counts of interesting longer
oligopeptides combined from overlapping pentapeptides are indicated. The
cutoffs (minimum count for each particular word to qualify as frequent, in
parentheses) were determined only for pentapeptides.
b
Tandem repeat of EGKCG in HI1601 (ORF of ~100 amino acids).
c
For comparison, the
E.coli
frequent pentapeptides are those related to ATP-binding (same as the groups 4a, 4b and 4c), GHVDHGKT (group 6), and
several oligopeptides with ~5 occurrences (HLYHCDHR, LCSHCR, NAWWV, HQLQQ, NRHRY, CPSCS, QLGFS, ELAKQ,
GLYYN, CRKTW, TPDGR, WCAEY).
Frequent oligopeptides in the
H.influenzae
proteins
a
Frequent pentapeptides derived from USS sequences
. The USS
+
generate frequent peptides primarily centered on the tripeptide serine-alanine-valine (SAV) and the USS
-
yield frequent peptides primarily centered on the tripeptide TAL. Other groups
of 5mers derived in different frames from USSs translate to PHF including the
pentapeptides N(Q/L)PHF, and to VR, including the pentapeptides KVRLK or EKVRS.
The most frequent peptides have the triresidue flanked by the two charged amino
acids K and E, showing 31 KSAVK, 20 KSAVE, 16 ESAVE and 18 ESAVK.
There are 13 protein coding USS +/- dyads with loop length 8 bp; 10 of the 13 are in the same frame and
translate to the consensus KSAVKNDRTF (SAV and RT are invariant in all 10). The
5mers SAVKN and KNDRT both qualify as frequent words. The 6mers ESAVKN and
KSAVKN have 14 and 11 occurrences, respectively.
Frequent pentapeptides derived from tetranucleotide iterations
. (i) The exact tetranucleotide tandem repeat of CCAA translates to (PTNQ)
n
; (ii) of AGTC to (SQSV)
n
; (iii) of GACA to (QTDR)
n
; and (iv) of CTAA to (NQSI)
n
.
Frequent peptides connected with ATP-binding and hydrolysis
. Groups 4a, 4b, and 4c of Table
4
show significantly frequent peptides associated with ATP binding. Group 4a
frequent words are related to the motif GXXGXGK(S/T)TL (familiar as the Walker-box A-site;
23
). Group 4b frequent words are related to the motif LLDEPTN associated with the
ATP hydrolysis B-site. This B-site generally located 40-70 residues downstream of the A-site was characterized by Walker
et al
. (
10
) in the form [Phi][Phi][Phi][Phi]D representing four successive unspecified hydrophobic
residues culminating with the essential aspartate residue. As identified in X-ray crystal structures the Walker A and B boxes mostly contribute to the
ATP-binding pocket of ATP-dependent transport proteins. A third motif approximately of the
form LSGGQ(Q/R)Q ~20 amino acids upstream from the B-site was identified in ref.
8
. This is related to the frequent words of group 4c (Table
4
). We label this motif the C-box.
The most frequent pentapeptide words among
E.coli
proteins correspond to the A, B and C motifs described above and are found
mostly in ATP-dependent transport proteins (
8
,
13
).
In comparing
H.influenzae
with
E.coli
or closely related bacterial proteins, more than 35 significant alignments
substantially reflect common occurrence of the three ATP binding motifs.
Examples include the HI1272 and HI1470 ORFs presumably orthologous with ferric
enterobactin transport ATP-binding protein (
fepC
of
E.coli
); HI0036 with the multidrug resistance protein
mdl
of
E.coli
; HI0354 with the
nasD
protein (
Klebsiella pneumoniae
); and HI0664 with
mdrH
protein (
E.coli
).
The frequent motifs HVDHG, VDHGK and DHGKT which combine into HVDHGKT (five
occurrences) are notable since they occur in
tufB-A
(elongation factor EF-Tu, HI0578), in
tufB-B
(duplicate, HI0632), in
selB
(translation factor, HI0709), in
infB
(initiation factor IF-2, HI1284), and in two hypothetical proteins HI0864 (similar to GenBank
L19201 of
E.coli
) and HI1195 (similar to SwissProt P32066 of
E.coli
). Is it possible that HVDHGKT is a vital motif related to translation?
The frequent peptide HDHDH corresponds to a striking period two histidine run
D
HDHDH
KHEHKHDHK
HDHDHDHDH
KHEHKHD-
HEH
HDHDH
mostly alternating (HX)
17
predominantly X = aspartate, encoded from the gene
fim
A-A (HI0119).
fim
A-A and
fim
A-B (HI0362) are `duplicated' genes both similar to SwissProt P31305 (
Streptococcus parasanguis
) that is considered a cell envelope adhesin B precursor. The
fim
A-A,
fim
A-B and P31305 proteins align quite well but
fim
A-B and P31305 are totally missing the period-2 histidine run.
There is no obvious functional correlation among the proteins possessing the
frequent word LTEEQ (10 occurrences).
The KMSKS peptide is found exclusively in aminoacyl-tRNA sythetase class I proteins (Cys, Trp, Leu, Met, Val, Tyr, generally
aromatic or hydrophobic, see also ref.
24
). Strikingly, no other putative
H.influenzae
aminoacyl-tRNA synthetase contains a peptide with one amino acid change from the
motif KMSKS.
The CASCH peptide occurs once each in a cytochrome c-type protein (HI0644), in a denitrification system component (HI0348), and
in formate-dependent nitrite reductase (HI1068). These proteins are all related to
electron transfer. The motif is familiar in binding to a heme of a cytochrome
where H is the axial ligand and the two cysteines covalently attach at the heme
end. There are seven cytochromes in the
H.influenzae
genome but the pattern CXXCH exists only in three of these. For example, the
pattern is not found in cytochrome B562 or in the cytochrome P450 family.
DISCUSSION
Haemophilus influenzae
is a parasite of the upper respiratory mucosa in humans. The refined analysis of
Tatusov
et al
. (
4
) reported on 1703 genes of
H.influenzae
including more than 1000 genes significantly similar to
E.coli
genes. On the basis of these similarities, these authors propose shared and
disparate metabolic and other functional pathways of
H.influenzae
. In particular, they call attention to the absence of functional motility and
chemotaxis operons in
H.influenzae
.
Apart from gene similarities and differences of
H.influenzae
with other bacterial genomes, intriguing questions remain about genomic
organization in relation to repetitive structures (direct and inverted), short
oligonucleotide compositional biases, classification of control elements, codon
preferences, etc. In this paper, we concentrated on the identification and
analysis of the distributional properties of frequent words (oligonucleotides
and peptides) of the
H.influenzae
genome. Our method for characterizing frequent words is based on a Poisson
approximation with respect to word size, sequence length and composition (see
Methods). This analysis provides a novel perspective on sequence heterogeneity.
Four classes of frequent oligonucleotides stand out: (i) USSs, (ii) IDSs, (iii)
tetranucleotide iterations and (iv) special groups. Why are these frequent in
H.influenzae
? We venture some models and interpretations.
(i) USSs (USS
+
and USS
-
).
USSs are highly frequent and virtually dense.
Haemophilus influenzae
is a naturally transformable organism which takes up double-stranded DNA of its own species, facilitated through recognition of USSs (
11
). Nearly all cells of
H.influenzae
become competent. The uptake mechanisms are largely unknown. Another bacterium
with correspondingly directed uptake is
Neisseria gonorrhoeae
. Generally, a small percentage (~10%) of a population of
B.subtilis
can become competent for uptake of non-specific DNA sequences (
11
,
12
). In
B.subtilis
and
S.pneumoniae
, competence is regulated by cell density, cell-cell signaling, and nutritional signaling dependence on growth
conditions. In
B.subtilis
more than 40 genes have been identified that are required for competence.
Natural genetic competence has also been reported in many other genera (
11
). Although DNA uptake is widespread in bacterial cells, non-specific integration into the chromosome seems to be rare (
11
).
The distributions of USS
+
and USS
-
in the genome are intriguing in several respects: (i) they occur in almost
equal counts in the two strands around the genome; (ii) the separate USS
+
and USS
-
sequences are significantly evenly spaced about the chromosome; and (iii)
protein coding USSs (~63% of all USSs) predominantly appear in the same reading frame where USS
+
s translate to the tripeptide SAV and USS
-
s translate to the tripeptide TAL. Their high density and significantly even
distribution around the genome suggest that they may contribute to global
genomic activities such as replication and repair (the DNA repair hypothesis,
e.g., ref.
25
), sites of membrane attachments in association with domain loops, sites of
nucleating Okazaki fragments or helix unwinding and/or sites contributing to
genome packaging. It is reasonably established that transforming DNA increases
the survivorship of cultures (
11
).
A major hypothesis concerning
H.influenzae
(and some other bacterial organisms) is that natural genetic competence
(transformation) evolved and is maintained for the function of acquiring
templates mediating repair of DNA lesions (
26
). One possibility is that the uptake of DNA followed by the production of
single-stranded tails could induce higher levels of Rec enzymes and concomitantly
increase the extent of DNA repair. In fact, single-stranded DNA is known to be an inducing signal of SOS repair and RecA
polymerization in binding ssDNA of
E.coli
. Other possible roles of natural genetic competence have been attributed to
benefits for horizontal gene transfer (e.g., transfer of antibiotic resistance
determinants), for the repair of damaged chromosomes that are rescued by
recombination with exogenous homologous DNA, for conversion of mutant alleles
to functional alleles, or simply as a good nutrient source (
11
,
25
).
As pointed out by Smith
et al
. (
6
), alignment of USS
+
sequences leads in more than 90% of cases to the consensus sequence
aAAGTGCGGT.rwwwww......rwwwww
of length 29 bp where w refers to weak bases (A or T). Thus, the recognition
part of USS appears to be at least 29 bp long with a highly conserved 9 bp
core.This fact implies that USS
+
sequences are displaced by >= 26 bp but generally much more. A similar accounting applies to the
distribution of USS
-
. When the 20 bases immediately 3' to all USS
+
core sequences were removed,
r
-scan statistics showed that the even distribution of USS
+
s persisted. This observation suggests that there is interference of some sort
between close USS
+
s of the same orientation.
The preponderance of USS
+
(USS
-
) encoded to SAV (TAL) might be explained as follows. The tripeptides SAV and
TAL are both small and have a hydroxyl residue at one end and a hydrophobic
residue at the other and are sufficiently spread out in order to have little
influence on protein conformation. Translation in the other two frames involves
the large residues arginine and histidine which could more easily disrupt
protein 3D structure. The close USS dyads are predominantly intergenic,
generally close downstream from a gene, and are rather similar to
E.coli
rho-independent transcription termination signals. The proliferation of USSs
for global genomic purposes allows also their facile conversion into requisite
terminator signals.
The palindrome GGCGATCGCC labeled HIP1 (highly iterated palindrome) is highly
frequent in the 1 Mb contig of
Synechocystis
(
14
). How is it distributed? The
r
-scan analysis shows in this case also a significantly even distribution.
In fact, for a random distribution of these words in the 1 Mb contig the chance
that all successive spacings (
r
= 1) exceed 9 bp has a probability <0.001. Impressively, the observed minimal spacing
m
1
* is 52 bp. Similar conclusions apply to the
r
-scan lengths for
r
= 2, 3, ..., 6. Thus, the even spacing of the HIP1 in
Synechocystis
is considerably more dramatic than the even spacings of the USSs in
H.influenzae
.
Synechocystis
, like
H.influenzae
, is known to be transformable (
11
,
27
). Whether the HIP1 sequences serve as recognition sites in this capacity is
unknown. The significance of its palindromic character is also intriguing.
Many bacteria can develop the state of physiological competence for natural DNA
uptake that is consistent with a bacterial gene transfer of free DNA.
Haemophilus influenzae
(and
N.gonorrhoeae
) can only bind and take up double-stranded and single-stranded DNA from the same or closely related species. This is
different from
B.subtilis
where the DNA uptake tends to be non-specific and most cells are not competent. By contrast, natural genetic
competence in
H.influenzae
and
A.vinelandii
can be attained by almost 100% of cells (
11
). As noted previously, the degree of bacterial cell competence appears to be
correlated with the presence of highly frequent words.
(ii) The role of IDSs.
The intergenic 14 bp stem dyad sequence (IDS) frequent words occur mostly in
clusters that provide the potential for variable secondary structure. This
suggests that these sequences may form protein binding sites which could be
important for regulating the activity of the flanking genes.
(iii) Tetranucleotide iterations
. Mechanisms promoting alterations in the frequency of gene expression include
introduction of frameshifts that affect transcription and/or translation (
28
). These frameshifts are caused by insertions and deletions that are more likely
within regions of reiterated short oligonucleotides. Moxon
et al
. (
28
) and Rainey
et al
. (
29
), for a number of pathogenic bacterial populations, highlight non-standard mutation mechanisms which occur at special loci, called
`contingency' genes (
28
). These authors explicitly discuss the case of the repeat tract, (TCAA)
16
, present in
H.influenzae
at the 5' end of the
lic2
gene essential for synthesis of digalactoside (
30
). In this context, the loss or gain of one or more CAAT unit(s) may alter the
reading frame, resulting in change in the synthesis of the digalactoside. More
generally, in appropriate tandem repeats (in the coding region and/or in gene
regulatory regions) polymerase slippage, homologous recombination or mismatch
repair occurring during chromosomal replication can generate a heterogenous
population of cells that can facilitate infection or can counter host defense
mechanisms (
28
). Thus, variation in the numbers of repeated segments can modulate alternative
gene expression patterns in a population. Other examples of variable gene
expression putatively controlled by repeat sequence tracts occur in
Bordetella pertussis
(
31
), in
Neisseria meningitidis
(
32
) and in
Neisseria gonorrhoeae
(
33
).
The example of (AGTC)
32
in
H.influenzae
that offers at least two alternatively encoded genes is particularly interesting
(Fig.
2
). In frame 1, (AGTC)
32
is part of a 194 aa ORF. In frame 2, the gene encodes Mod (629 aa), similar to
the type III restriction-modification ECOP15 enzyme of
E.coli
. Frame 3 is `null' flanked by multiple termination codons. Both genes (ORF and
Mod) have strong Shine-Dalgarno signals and both were predicted independently as bona fide genes
by Fleischmann
et al
. (
1
) and Tatusov
et al
. (
4
). Thus, in a heterogeneous population two separate genes or a single complete
(fused) gene might be expressed, or the mod gene may be switched off depending
on the number of AGTC iterations. This sequence exhibits a rare example in a
bacterial genome of two genes encoded in the same orientation in different
reading frames overlapping more than 40 amino acids.
REFERENCES
1 Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J.-F., Dougherty, B. A., Merrick, J. M., et al. (1995) Science, 269, 496-512.
3 Robison, K., Gilbert, W. and Church, G. M. (1996) Science, 271, 1302-1303.MEDLINE Abstract
4 Tatusov, R. L., Mushegian, A. R., Bork, P., Brown, N. P., Hayes, W. S., Borodovsky, M., Rudd, K. E. and Koonin, E. V. (1996) Curr. Biol., 6, 279-291.
5 Burge, C., Campbell, A. M. and Karlin, S. (1992) Proc. Natl. Acad. Sci. USA, 89, 1358-1362.MEDLINE Abstract
6 Smith, H. O., Tomb, J.-F., Dougherty, B. A., Fleischmann, R. D. and Venter J. C. (1995) Science, 269, 538-540.
7 Krawiec, S. and Riley,M. (1990) Microbiol. Rev., 54, 502-539.MEDLINE Abstract
8 Blaisdell, B. E., Rudd, K. E., Matin, A. and Karlin, S. (1993) J. Mol. Biol., 229, 833-848.
9 Rawlings, N. D. and Barrett, A. J. (1994) Methods Enzymol., 244, 19-61.
10 Walker, J. E., Saraste, M., Runswick, M. J. and Gay, N. J. (1982) EMBO J., 1, 945-951.
11 Lorenz, M. G. and Wackernagel, W. (1994) Microbiol. Rev., 58, 563-602.
12 Solomon, J. M. and Grossman, A. D. (1996) Trends Genet., 12, 148-155.
13 Karlin, S. and Cardon, L. R. (1994) Annu. Rev. Microbiol., 48, 619-654.MEDLINE Abstract
14 Robinson, N. J., Robinson, P. J., Gupta, A., Bleasby, A. J., Whitton, B. A. and Morby, A. P. (1995) Nucleic Acids Res., 23, 729-735.
15 Karlin, S. and Taylor, H. M. (1975) A First Course in Stochastic Processes. Chapter 5. Academic Press, San Diego, CA, USA.
16 Kleffe, J. and Borodovsky, M. (1992) Comput. Appl. Biosci., 8, 433-441.MEDLINE Abstract
17 Schbath, S., Prum, B. and de Turckheim, E. (1995) J. Comput. Biol., 2, 417-437.MEDLINE Abstract
18 Karlin, S. and Brendel, V. (1996) In Arrow, K. J., Cottle, R. W., Eaves, B. C. and Olkin, I. (eds), Education in a Research University. Stanford University Press, Stanford, CA, USA pp. 407-427.
19 Karlin, S. and Leung, M.-Y. (1991) Ann. Appl. Prob., 1, 513-538.
20 Dembo, A. and Karlin, S. (1992) Ann. Appl. Prob., 2, 329-357.
21 Kahn, M. E. and Smith, H. O. (1984) J. Membrane Biol., 81, 89-103.
22 Mrázek, J. and Karlin, S. (1996) Trends Biochem. Sci., 21, 201-202.
23 Higgins, C. F., Gallagher, M. P., Mimmack, M. L. and Pearce, S. R. (1988) BioEssays, 8, 111-116.
24 Nagel, G. M. and Doolittle, R. F. (1995) J. Mol. Evol., 40, 487-498.