Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (163K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (51)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Karlin, S
Right arrow Articles by Campbell, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Karlin, S
Right arrow Articles by Campbell, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1996 Oxford University Press 4263-4272

Footnote

Frequent oligonucleotides and peptides of the Haemophilus influenzae genome

Frequent oligonucleotides and peptides of the Haemophilus influenzae genome Samuel Karlin* , Jan Mrázek and Allan M. Campbell 1

Department of Mathematics, Stanford University, Stanford , CA 94305-2125, USA and 1 Department of Biological Sciences, Stanford University, Stanford , CA 94305-5020, USA

Received June 24, 1996; Revised and Accepted September 6, 1996

ABSTRACT

The complete Haemophilus influenzae genome (1.83 Mb, Rd strain) provides opportunities for characterizing global genomic inhomogeneities and for detecting important sequence signals. Along these lines, new methods for identifying frequent words (oligonucleotides and/or peptides) and their distributions are applied to the H.influenzae genome with some comparisons and contrasts made with frequent words of other bacterial genomes. Three major classes of frequent oligonucleotides stand out: (i) oligos related to the familiar uptake signal sequences (USSs), AAGTGCGGT (USS + ) and its inverted complement (USS - ), (ii) multiple tetranucleotide iterations and (iii) intergenic dyad sequences (ISDs) found as AAGCCCACCCTAC and its dyad form. The USS + and USS - occur in almost equal counts, are remarkably evenly spaced around the genome, and appear predominantly in the same reading frame of protein coding domains (USS + translated to Ser-Ala-Val, USS - translated to Thr-Ala-Leu). These observations suggest that USSs contribute to global genomic functions, for example, in replication and/or repair processes, or as membrane attachment sites, or as sequences helping to pack DNA. The long tetranucleotide iterations, virtually unique to H.influenzae (i.e., unknown in other prokaryotes), through polymerase slippage during replication and/or homologous recombination may produce subpopulations expressing alternative proteins. The 13 bp frequent IDS words, invariably intergenic, occur mostly in clusters and provide potential for complex secondary structures suggesting that these sequences may be important signals for regulating the activity of their flanking genes. The frequent oligopeptides of H.influenzae are principally of two kinds-those induced by oligonucleotide frequent words (USSs, tetranucleotide iterations), and those associated with ATP or GTP binding sites that are generally composed of three motifs: the A-box which contributes to delineating the binding pocket; the B-box which functions in hydrolysis; and the C-box whose function is unknown. The A-box occurs fairly universally in prokaryotes and eukaryotes. The B- and C-motifs appear to be specialized to various functional groups (e.g., transport, recombination, chaperone activity). Other putative motifs correspond to homologs of Escherichia coli motifs, for example, are associated with proteins of transcriptional processing, aminoacyl-tRNA synthetases and proteins functioning in electron transfer.

INTRODUCTION

The complete eubacterial genome ( Haemophilus influenzae : 1.83 Mb, strain Rd) was recently published including identification of 1746 putative genes and concomitant database gene similarities (see refs 1 - 3 , and TIGR Web page http://www.tigr.org/ and compare with ref. 4 ). The H.influenzae genome provides opportunities for characterizing genomic inhomogeneities and detecting distinctive sequence patterns. In this context, it is of interest to determine which words (oligonucleotides and/or peptides) in the sequence occur with unusually high or low frequencies and to identify anomalies in their distribution over the genome. For DNA, rare words might be binding sites for transcription control factors restricted to specific locations. Alternatively, rare words may be discriminated against due to structural incompatibilities, e.g., the tetranucleotide CTAG which is extremely rare in most proteobacterial genomes ( 5 ). Frequent words often include repetitive structural, regulatory and transposable elements, e.g., uptake signal sequences (USSs, see below) in H.influenzae ( 6 ), Chi sites (which in association with the RecBCD complex promote recombination) and REP elements (repeated extragenic palindrome of unknown function) the latter two in Escherichia coli ( 7 , 8 ). In proteins, frequent oligopeptides often reflect characteristic motifs shared in certain protein functional families, e.g., the sequence environment of the catalytic triad of serine proteases ( 9 ), the ATP-binding motif (Walker-box) of bacterial proteins ( 10 ). A comparison of texts or distributions of such words within sets of sequences from different organisms may suggest important evolutionary tendencies or constraints at work.

We introduce new statistics (see Methods) for identifying frequent words in the H.influenzae genome. Haemophilus influenzae readily undergoes transformation, facilitated by recognition of USSs, by integrating free DNA fragments of its own genus. The active uptake of large pieces of DNA from the environment is commonly called natural genetic competence (for recent reviews see refs. 11 and 12 ). The USSs are the aggregate of the 9 bp words AAGTGCGGT (+ orientation) together with its inverted complement ACCGCACTT (- orientation). Many +/- pairs in the H.influenzae genome enable hairpin loops of up to ~20 nucleotides (nt) in length (128 such pairings) ( 1 , 6 ). These USSs are highly frequent in the H.influenzae genome, with almost identical counts on the Watson and Crick strands, respectively ( 6 ). The genome also contains several extensive tetranucleotide iterations which generate frequent words. We further identified a frequent dyad pairing in H.influenzae , labelled IDS (intergenic dyad sequence), of quite long stem length which has unusual properties, e.g., clusters of these dyads, all intergenic, flanked by genes of related function and possessing the potential for complex secondary structures. The counts and distribution of these frequent words are discussed below. In E.coli , frequent words mostly correspond to parts of REP sequences ( 8 , 13 ). In Neisseria gonorrhoeae , constitutive natural uptake of DNA of its own genus is related to the oligonucleotides TTCAGACGGC and its inverted complement GCCGTCTGAA, which are the most frequent words of size 10 in N.gonorrhoeae DNA. In the cyanobacterium Synechocystis sp ., the 10 bp palindrome GGCGATCGCC is frequent to about the same extent as the USSs of H.influenzae ( 14 , and below). By contrast, examination of three long contigs in Bacillus subtilis totaling 0.51 Mb yielded no frequent oligonucleotides. This observation is essentially consistent with the natural history of B.subtilis where naturally genetically transformable DNA is non-specific and engages ~10% of its cells ( 12 ).

METHODS

Frequent words (oligonucleotides and peptides)

A classical approach for deciding whether a given word is frequent is to count the number of its occurrences N ( L ) in a sequence of length L and compare this count with the expected count, postulating independently or Markov generated sequences of letters taking account of statistical variance. More precisely, let [mu] be the mean and [sigma] 2 the variance of the length l between successive occurrences of the word. For these models, the quantity

c ( L ) = [mu] 3/2 ( N ( L ) - L /[mu])/([sigma][sigma][theta][rho][tau] [iota][tau][alpha][lambda][iota][chi] [Lambda])

follows approximately the standard normal distribution for large L ( 15 ). The tails of the normal distribution can be used to define rare and frequent cutoffs for the occurrence of each individual word ( 16 - 18 ). This method is difficult to implement, especially the computation of [sigma] for each word.

We introduce an intrinsically different formulation idealized into a ball-in-urn model [urns correspond to all nucleotide (or amino acid) words of a given size and balls refer to the observed words in a given sequence ( 19 , 13 )]. We apply Poisson distribution approximations associated with generalized occupancy problems of balls-in-urns (see ref. 19 for mathematical details). Previously we considered the question of frequent words in a sequence of letters drawn from an alphabet of size A with equal frequency 1/ A . We show that for a sequence of length L , there is a natural word size s determined by the inequalities As -1 <= L < A s [ 1 ]

and a natural copy number r determined by the inequalities ( r - 1)/ r < (log L ) / log A s <= r /( r + 1)

For large L , the number of s -words that occur more than r times is approximately Poisson distributed with parameter A s exp(- L / A s )( L / A s ) r / r ! which is <1. This technique has given interesting results for both nucleotide and amino acid sequences ( 13 ). Although this model is fairly accurate for many DNA sequences with roughly equal base frequencies, the uniform copy number cutoff cannot be applied to compositionally biased sequences. In order to identify frequent words in the globally A+T rich H.influenzae genome (or for peptides where different amino acids have very different frequencies), the method is extended to the more general case of words w with non-uniform frequencies p w . For each word, we determine the copy threshold r w as the smallest integer satisfying the inequality exp(- p w L )( p w L ) rw / r w ! <= 1/ L [ 2 ]

where the word size s is defined as in the uniform case and p w = fi 1 fi 2 ... fi s and for a Markov model set p w = fi 1 fi 1 i 2 ... fis -1 is where fi is the frequency of letter i and fij is the transition frequency from letter i to letter j . A general feature of this formulation is that the lower the expected frequency of a word, the lower the cutoff required for it to be frequent. By setting the left side of the inequality [ 2 ] to 1/ L , at most one frequent word is to be expected in a random sequence.

Analysis of the distribution of frequent words

In probing for insights on the organization of a genome, the general problem arises of how to characterize anomalies in the spacings of markers in a long sequence of nucleotides or amino acids. These include properties of clumping (too many neighboring short spacings), overdispersion (too many long gaps between markers), and excessive regularity (too few short spacings and/or too few long gaps). Questions about spacings of a general marker array and issues of sequence heterogeneity can be approached by consideration of the cumulative lengths of r consecutive distances between the markers, and R i ( r ) is the distance between marker i and marker i+r called r -scan length. Limiting distributions for the extremal statistics among { R i ( r ) } were derived in ref. 20 . The lengths of the longest and shortest r -scans are appropriate statistics for detecting cases of significant clumping, significant overdispersion, or excessive regularity in the spacings of the marker.

The case for r = 1 is classical. By varying r , organization on different scales can be detected, e.g., r = 3 can aptly detect near neighbor interactions while r = 6 can identify concentrations over a greater range. The r -scan process is a moving sum process derived from the original first order process and so tends to smooth fluctuations and presents quite sensitive statistics in discerning clustering and overdispersion.

To study the distribution of the marker in a sequence, we compare the distribution of { R i ( r ) } calculated under a random model with the observed r -scan lengths. Moreover, r -scan statistics can be extended to a marker array distributed as an inhomogeneous Poisson process that easily accommodates alternating tracts of differing marker densities. Let m {* sub r} = {min from i} {R sub i sup {( r )}} , ^ M {* sub r} = {max from i} {R sub i sup {( r )}}

The theoretical probabilities for a marker array of n points obey the asymptotic relations Prob{ m * r >= x / n (1+1/ r ) } [approx] exp(- x r / r !), [ 3 ] Prob{ M * r <= n -1 [ln n + (r-1) ln(ln n ) + x ]} [approx] exp[-e - x /( r -1)!] [ 4 ]

These formulas provide benchmarks as to whether the minimum and/or maximum spacing deviates significantly from randomness. For example, the probability in [ 3 ] (left side of the equation) is set to a desired significance level (generally 0.01), in order to determine x b from the equation exp(- x b r / r !) = 0.01, and subsequently the threshold level b * r = x b / n (1+1/ r ) is determined. All observed r -scan lengths less than b * r L define r -scan clusters. A value of a * r is calculated analogously setting the probability to 0.99; and a significantly even spacing is detected if m * r >= a * r L . The appropriate range ( A * r , B * r ) of the maximal spacing M * r can be delimited analogously using equation [ 4 ]. The markers are considered randomly distributed if all the r -scan lengths are within the ranges ( a * r L , b * r L ) and ( A * r L , B * r L ).

RESULTS

Identification of frequent oligonucleotides

The frequent oligonucleotide length was determined as s = 11 bp obeying As -1 <= L < A s ( L = 1 830 140 bp, A = 4). Applying the condition [ 2 ], the most outstanding frequent words contain the 9 bp uptake signal sequence AAGTGCGGT or its inverted complement ACCGCACTT. Another class of frequent words arises from the extensive tandemly iterated tetranucleotides present in the genome. Several among the frequent words overlap and can be combined into a longer frequent word. By these means the method of equation [ 2 ] detected in the genome the frequent 13 letter words AAGCCCACCCTAC (denoted as W13) and its dyad form GTAGGGTGGGCTT (W'13), in 14 and 9 exact copies, respectively. The pairs W'13 and W13 mostly occur as close dyads. We refer to the dyad pairings W'13-W13 as IDS. Various groups of frequent words are reported and some interpretations, comparisons and contrasts on their function are discussed.

The unusual distribution of USSs

Smith et al . ( 6 ; see also ref. 21 ) investigated many aspects of USSs. In general, our analysis confirmed and augmented their findings including: (i) the equal counts of USS + and USS - words; (ii) USS occurrences as close dyad pairs in either +/- or -/+ orientation (87 and 39 cases, respectively) separated by a loop of up to ~20 bp but predominantly of 8 bp loops, reminiscent of E.coli rho-independent terminator sequences. A cluster of four dyad pairs occurs in the interval 1755-1756 kb; (iii) a significant overdispersion at position 1.56-1.59 Mb, a stretch of much sequence similarity to a Mu-like phage; (iv) mutated USS + and USS - (Table 1 ).

Additional results were deduced with the aid of the r -scan statistics and frequent word analyses, emphasizing (v) significantly even spacings of the USSs in each orientation; (vi) a significant overdispersion of USSs in a region of a concentration of ribosomal protein genes; (vii) reading frame preference of protein coding USSs for the tripeptides SAV (USS + ) and TAL (USS - ) (one letter amino acid code used). Clusters and overdispersions of USSs . Four sets of USS markers were analyzed via r -scan statistics: the combined ensemble of USS + (AAGTGCGGT) and USS - (ACCGCACTT), the sets of USS + and USS - separately, and the collection of close USS dyad pairs. The r -scan statistics applied to the aggregate of USS + and USS - revealed three distributional anomalies significant at the 1% probability level. An overdispersion ( r = 2) occurs at coordinates 1.56-1.59 Mb in agreement with that reported by Smith et al . ( 6 ). This region is relatively G+C-rich showing some sequence similarity to Mu-like phage sequences. A second significant overdispersion was detected at position 834-856 kb by the 4-scan statistical analysis. Thus, only three USSs occur in this region of ~22 kb length (on a random basis we would expect ~18). This overdispersion centers on a region containing 25 ribosomal protein genes (position 838-853 kb). Parenthetically, a USS never occurs within a ribosomal protein gene or between two contiguous ribosomal protein genes.

Table 1 Frequent mutated USSs in complete genome and in intergenic regions (with ORFs, rRNA and tRNA genes removed)
Word

Occurrences in

Complete genome (1.83 Mb)

Intergenic regions (289 kb)

AAGTGCGGT

737

++

259

++

ACCGCACTT

734

++

241

++

AAG A GCGGT

45

+

13

(+)

ACCGC T CTT

48

+

17

(+)

AAGTGCGG A

34

+

7

0

T CCGCACTT

44

+

9

0

AAGTGCGG C

36

+

5

0

G CCGCACTT

37

+

9

0

AAGTGCGG G

36

+

5

0

C CCGCACTT

28

(+)

4

0

Deviations from the core USS are shown in bold. All other words differing by 1 bp from the core USS did not qualify as frequent. Overrepresentation is indicated by symbols ++ (very frequent), + (significantly frequent), (+) (marginally frequent), and 0 (not frequent).

A cluster of eight USSs was detected at about position 1.76 Mb. The cluster reflects four close dyad pairings of USSs, one in each of four 168 bp tandemly repeated segments. Equal counts of USS + and USS - . USSs are equally partitioned between the two DNA strands with 737 occurrences of USS + and 734 occurrences of USS - . The difference in counts over all 200 kb sections (averaging ~160) never exceeded 20 and mostly differed by <10. Moreover, the maximal run of USS + or of USS - observed among the aggregate USSs is of length 8 counts ( 6 ) whereas for a random permutation of 737 USS + and 734 USS - at least one run of about ([1/2])sqrt {1 4 7 1} [approx] 19 would be expected. Significantly even spacings of USSs in each orientation . Another striking anomaly concerns the significantly even spacings of the USS + occurrences and the same for the USS - occurrences. Specifically, both USS + positions and USS - positions have respective minimum spacings significantly higher than expected by chance for several r -scan statistics ( r = 1,2,...,6) with the probability <= 0.001 to observe such an even distribution with the same numbers of randomly distributed markers. See Discussion for possible interpretations. Biases in mutated USSs. Eight frequent words differ by a single base change from either USS + or USS - (Table 1 ). Smith et al . ( 6 ) indicated an abundance of mutated USSs and proposed a dynamic equilibrium model in which the drift away from USSs is judiciously repaired. However, our analysis of frequent words revealed that only four among the 27 possible singly mutated USSs are significantly frequent. Moreover, the mutated USSs at the 3' end base are not frequent in intergenic sequences and the mutated form AAGAGCGGT (and its inverted complement) is only marginally of high count in intergenic sequences (Table 1 ). These observations portend a possible mutation bias for the 40% of USSs occurring in intergenic regions. Equivalently, these relatively low counts of noncoding mutated USSs suggest that they are not good enough for efficient USS transformation properties and may be subject to some sort of selective constraints in coding sequences (e.g., amino acid and/or codon requirements). Reading frame preference of protein coding USSs. 934 of the 1471 USSs (63%) occur in protein coding genes or ORFs. About 64% of all protein coding USS - (354 of 553) are translated to TAL, and the even higher fraction 72% of USS + (268 of 381) are translated to SAV. Both tripeptides TAL and SAV present hydroxyl residues (T/S) on one side, hydrophobic aliphatics (L/V) on the other side, and a central versatile small aliphatic A. USS close dyad pairs . The USSs often appear as close dyad pairs in either +/- or -/+ orientation ( 6 ). The +/- USS dyads predominantly show loops 8 or 19-22 bp long while the loop lengths of the -/+ dyads are not restricted to any narrow ranges. The distribution of USS dyads around the genome appear to be random except for the previously noted cluster of four +/- dyads at about position 1.76 Mb of the genome.

Intergenic dyad sequences (IDSs)

Several frequent words relate to IDSs of consensus sequence:

GTAGGGTGGGCTTYAGCCCACCA..........TGGTGG-

GCTRAAGCCCACCCTAC

(Y = C or T, and R = A or G) with a central A+T-rich loop of ~20 bp. The IDSs occur 14 times in the H.influenzae genome but only once (in Neisseria meningitidis ) in all other GenBank sequences. The IDSs are often repeated at distances <300 bp thus having a potential for complex secondary structures. Their distribution over the genome is depicted in Figure 1 .


Figure 1 . Positions of IDSs in the H.influenzae genome. Partial IDSs (missing one stem sequence) are indicated by an asterisk. For explicit sequences see reference 22.

The genes flanking the IDSs are related in function and are invariably transcribed in the same orientation ( 22 ). For example, hsdM and hsdS are type I restriction enzymes, ureA and ureB are both subunits of urease, G6PD is glucose-6-phosphate dehydrogenase and devB is a putative isozyme, fmt (Met-tRNA formyltransferase) and def (f-Met deformylase) both function in translation, and lpdA (lipoamide dehydrogenase) and aceF (dihydrolipoamide acetyltransferase) participate in energy metabolism ( 1 ). The similar function of the flanking genes suggests that these IDSs could be important for regulating their activity. Similarity of IDS to certain recurrent repeats of E.coli . Previous analysis of repeated DNA sequences in E.coli ( 8 ) revealed four distinct groups of intergenic approximately palindromic repeated sequences (REP elements plus three new groups denoted Group I, Group III, and Group IV). The sequences of Groups I, III and IV are not similar to IDS except for their dyad symmetry and the presence of six GGG/CCC triplets in the Group IV sequences. Interestingly, one Group I repeat and one Group III repeat appear between the E.coli genes aceF and lpd and an IDS is present between their counterparts aceF and lpdA in H.influenzae . One recurrent sequence of Group I is located between the E.coli genes hsdR and hsdM , upstream of hsdM , while an IDS appears between the H.influenzae genes hsdM and hsdS , downstream of hsdM . The similar counts of Groups I, III and IV repeats in E.coli and IDS in H.influenzae coupled with their similar dyad symmetry (stem-loop forming potential) suggest that they may serve similar functions in the two genomes.

Tandem tetranucleotide repeats

The H.influenzae genome contains 11 impressive microsatellites in the form of tandem tetranucleotide repeats ( 1 ) each extending at least 15 iterations (i.e. >= 60 bp long): CCAA (5 distinct runs), TCAA (3 runs), GCAA (1 run), AGTC (1 run), and ACAG (1 run) (Table 2 ). Six tetranucleotide tandem repeats appear in putative genes generally near the 5' terminus. The other five are intergenic. All 11 were examined for sequence similarity in their flanking sequences. A strong sequence similarity was detected in the 3' flanks of four CCAA repeats (at positions 677215, 705969, 760672 and 1633204, denoted CCAA1, CCAA2, CCAA3 and CCAA5, respectively) extending ~2 kb downstream from the repeats, and to a lesser extent in their 5' flanks extending ~400 bp (Table 3 ). No significant sequence similarity was detected in the flanks of the other tetranucleotide iterations. The repeat CCAA3 containing 36 iterations appears in the putative gene tbp1 (transferrin binding protein 1, >3000 bp in length) near its 5' end. CCAA1 (20 iterations) is in an intergenic region upstream of the homologous gene tbp2 . The other two CCAA repeats (CCAA2, 20 iterations, and CCAA5, 18 iterations) are in short ORFs (Table 2 ). The count of CCAA iterations in CCAA2 not being a multiple of three (in contrast with the count of CCAA3 iterations) results in a stop codon proximal downstream from the CCAA2 repeat. The ORFs containing CCAA5 and CCAA3 do not continue in the same frame due to a single base insertion 5' to the CCAA iterations (Table 3 ). All insertions/deletions for ~2 kb after the CCAA iterations are multiples of 3 bp suggesting a protein-coding character of the flanking sequences.

Table 2 Tandem tetranucleotide iterations in H.influenzae a
Tetranucleotide/Gene b

5' USS c

Start (5')-End (3')

3' USS c

tbp2

676 963-674 105 d

(CCAA) 20

1987

677 215-677 132 d

3835

orf (HI0636)

677 939-677 652 d

(CCAA) 20

766

705 969-705 896 d

2855

orf (HI0662)

706 065-705 874 d

(CCAA) 36

210

760 672-760 525 d

4492

tbp1

760 746-757 495 d

rsgA

1 480 514-1 481 008

(CCAA) 16

2748

1 481 221-1 481 284

920

orf (HI1386)

1 481 385-1 481 828

(CCAA) 18

3488

1 633 204-1 633 279

2899

orf (HI1566)

1 633 129-1 634 017

hypothetical protein (HI0352)

379 337-378 645 d

(TCAA) 33

2570

379 651-379 520 d

2199

nasD

380 053-380 772

(TCAA) 23

1035

570 889-570 798 d

865

lipooligosaccharide biosynthesis (HI0550)

571 008-570 103 d

orf (HI1536)

1 607 592-1 607 936

(TCAA) 17

203

1 608 033-1 608 100

1064

licA

1 608 191-1 608 991

yopA

1 542 870-1 542 832 d

(GCAA) 25

192

1 543 250-1 543 151 d

3553

nodT

1 543 493-1 544 854

(GACA) 22

981

288 840-288 753 d

2811

glycosyl transferase (HI0259)

288 842-288 183 d

(AGTC) 32

913

1 123 049-1 122 922 d

1873

orf (HI1058)

1 123 460-1 122 879 d

a Positions of tetranucleotides iterated >= 15 times and overlapping or flanking genes are shown. b Genes identified by sequence homology (7), ORFs predicted by GeneMark (35). Positions of two flanking genes are shown for intergenic tetranucleotide iterations. Only one gene is indicated if the iteration is included in this gene. c Distance (in bp) to the nearest USS. d Gene/iteration is read in the reverse orientation (on the complementary strand).

Strikingly, the tetranucleotide iterations (and the CCAA iterations in particular) are associated with unusually long gaps between successive USSs, mostly exceeding 3500 bp, compared with the average distance between two successive USSs of 1235 bp (Table 2 ). During a search for tetranucleotide iterations in the aggregate bacterial sequences from GenBank (Release 94, April 1996), long tetranucleotide iterations ( >= 10 iterations) were found exclusively in H.influenzae sequences. Long pentanucleotide iterations ( >= 7 iterations) were found in sequences from the genus Neisseria ( N.gonorrhoeae , N.meningitidis and N.flava ) and H.influenzae .


Table 3 . Alignment of homologous flanking sequences of CCAA tandem repeats a a Bold underlined letters locate the CCAA tandem repeats; initiation and termination codons of putative genes (ORFs) are also in bold. Conserved sites are indicated by uppercase letters. Significant sequence similarity extends beyond the displayed alignment (see Results for details).

Other frequent oligonucleotides

Some frequent words are substrings of the 12 bp word TTCGCCTTKTTC (K = T or G) showing 17 occurrences in the H.influenzae genome. Its inverted complement GAAMAAGGCGAA (M = A or C) occurs nine times. Three of the 26 occurrences (the uppercase letters) are in the sequence

...ttttcctcgttctccttgtccgccttgTTCGCCTTGTTCGCCTT-

GTTCGCCTTGTTCgccttt...

at coordinates 1788929-1788991 which is an imperfect tandem 7- repeat of the 9 bp word TTCGCCTTG. The tandem repeat is intergenic and 4 bp downstream of a putative ORF. The other 23 occurrences are all in genes (or ORFs) and mostly confined to the reading frame GAA-MAA-GGC-GAA translated as E(K/Q)GE.

The 12 bp words TCTAATTCTTCM and KGAAGAATTAGA occur 17 times in genes and invariably code for the tetrapeptide EEL(E/D). Also the words AAGATGGTTTA and TAAACGTTCTG, and their inverted complements are principally protein coding, generally translated as KPS(F/T) and (S/P/A/T)ERL, respectively.

Frequent peptides among H.influenzae genes

The aggregate amino acid count over all known and putative proteins in H.influenzae is just over 500 000. By equation [ 1 ] with A = 20, the frequent word size is s = 5 and the minimum copy number for each word depending on its composition is determined from formula [ 2 ].

Table 4 lists all frequent pentapeptides, their observed copy number, and the cutoff level (in parentheses) required for the word to qualify as frequent. Frequent 5-peptides were, where possible, combined into groups of larger size words (6mers, 7mers, etc.) with their observed counts. Several groups of frequent words come from frequent oligonucleotides, especially USSs and tetranucleotide iterations (of CCAA, TCAA, AGTC, GACA). a Significantly frequent pentapeptides were organized into several groups. They are listed together with their counts in the H.influenzae protein sequences (aggregate >500 000 aa). Counts of interesting longer oligopeptides combined from overlapping pentapeptides are indicated. The cutoffs (minimum count for each particular word to qualify as frequent, in parentheses) were determined only for pentapeptides. b Tandem repeat of EGKCG in HI1601 (ORF of ~100 amino acids). c For comparison, the E.coli frequent pentapeptides are those related to ATP-binding (same as the groups 4a, 4b and 4c), GHVDHGKT (group 6), and several oligopeptides with ~5 occurrences (HLYHCDHR, LCSHCR, NAWWV, HQLQQ, NRHRY, CPSCS, QLGFS, ELAKQ, GLYYN, CRKTW, TPDGR, WCAEY).


Table 4 Frequent oligopeptides in the H.influenzae proteins a Frequent pentapeptides derived from USS sequences . The USS + generate frequent peptides primarily centered on the tripeptide serine-alanine-valine (SAV) and the USS - yield frequent peptides primarily centered on the tripeptide TAL. Other groups of 5mers derived in different frames from USSs translate to PHF including the pentapeptides N(Q/L)PHF, and to VR, including the pentapeptides KVRLK or EKVRS. The most frequent peptides have the triresidue flanked by the two charged amino acids K and E, showing 31 KSAVK, 20 KSAVE, 16 ESAVE and 18 ESAVK.

There are 13 protein coding USS +/- dyads with loop length 8 bp; 10 of the 13 are in the same frame and translate to the consensus KSAVKNDRTF (SAV and RT are invariant in all 10). The 5mers SAVKN and KNDRT both qualify as frequent words. The 6mers ESAVKN and KSAVKN have 14 and 11 occurrences, respectively. Frequent pentapeptides derived from tetranucleotide iterations . (i) The exact tetranucleotide tandem repeat of CCAA translates to (PTNQ) n ; (ii) of AGTC to (SQSV) n ; (iii) of GACA to (QTDR) n ; and (iv) of CTAA to (NQSI) n . Frequent peptides connected with ATP-binding and hydrolysis . Groups 4a, 4b, and 4c of Table 4 show significantly frequent peptides associated with ATP binding. Group 4a frequent words are related to the motif GXXGXGK(S/T)TL (familiar as the Walker-box A-site; 23 ). Group 4b frequent words are related to the motif LLDEPTN associated with the ATP hydrolysis B-site. This B-site generally located 40-70 residues downstream of the A-site was characterized by Walker et al . ( 10 ) in the form [Phi][Phi][Phi][Phi]D representing four successive unspecified hydrophobic residues culminating with the essential aspartate residue. As identified in X-ray crystal structures the Walker A and B boxes mostly contribute to the ATP-binding pocket of ATP-dependent transport proteins. A third motif approximately of the form LSGGQ(Q/R)Q ~20 amino acids upstream from the B-site was identified in ref. 8 . This is related to the frequent words of group 4c (Table 4 ). We label this motif the C-box.

The most frequent pentapeptide words among E.coli proteins correspond to the A, B and C motifs described above and are found mostly in ATP-dependent transport proteins ( 8 , 13 ).

In comparing H.influenzae with E.coli or closely related bacterial proteins, more than 35 significant alignments substantially reflect common occurrence of the three ATP binding motifs. Examples include the HI1272 and HI1470 ORFs presumably orthologous with ferric enterobactin transport ATP-binding protein ( fepC of E.coli ); HI0036 with the multidrug resistance protein mdl of E.coli ; HI0354 with the nasD protein ( Klebsiella pneumoniae ); and HI0664 with mdrH protein ( E.coli ).

The frequent motifs HVDHG, VDHGK and DHGKT which combine into HVDHGKT (five occurrences) are notable since they occur in tufB-A (elongation factor EF-Tu, HI0578), in tufB-B (duplicate, HI0632), in selB (translation factor, HI0709), in infB (initiation factor IF-2, HI1284), and in two hypothetical proteins HI0864 (similar to GenBank L19201 of E.coli ) and HI1195 (similar to SwissProt P32066 of E.coli ). Is it possible that HVDHGKT is a vital motif related to translation?

The frequent peptide HDHDH corresponds to a striking period two histidine run

D HDHDH KHEHKHDHK HDHDHDHDH KHEHKHD-

HEH HDHDH

mostly alternating (HX) 17 predominantly X = aspartate, encoded from the gene fim A-A (HI0119). fim A-A and fim A-B (HI0362) are `duplicated' genes both similar to SwissProt P31305 ( Streptococcus parasanguis ) that is considered a cell envelope adhesin B precursor. The fim A-A, fim A-B and P31305 proteins align quite well but fim A-B and P31305 are totally missing the period-2 histidine run.

There is no obvious functional correlation among the proteins possessing the frequent word LTEEQ (10 occurrences).

The KMSKS peptide is found exclusively in aminoacyl-tRNA sythetase class I proteins (Cys, Trp, Leu, Met, Val, Tyr, generally aromatic or hydrophobic, see also ref. 24 ). Strikingly, no other putative H.influenzae aminoacyl-tRNA synthetase contains a peptide with one amino acid change from the motif KMSKS.

The CASCH peptide occurs once each in a cytochrome c-type protein (HI0644), in a denitrification system component (HI0348), and in formate-dependent nitrite reductase (HI1068). These proteins are all related to electron transfer. The motif is familiar in binding to a heme of a cytochrome where H is the axial ligand and the two cysteines covalently attach at the heme end. There are seven cytochromes in the H.influenzae genome but the pattern CXXCH exists only in three of these. For example, the pattern is not found in cytochrome B562 or in the cytochrome P450 family.

DISCUSSION

Haemophilus influenzae is a parasite of the upper respiratory mucosa in humans. The refined analysis of Tatusov et al . ( 4 ) reported on 1703 genes of H.influenzae including more than 1000 genes significantly similar to E.coli genes. On the basis of these similarities, these authors propose shared and disparate metabolic and other functional pathways of H.influenzae . In particular, they call attention to the absence of functional motility and chemotaxis operons in H.influenzae .

Apart from gene similarities and differences of H.influenzae with other bacterial genomes, intriguing questions remain about genomic organization in relation to repetitive structures (direct and inverted), short oligonucleotide compositional biases, classification of control elements, codon preferences, etc. In this paper, we concentrated on the identification and analysis of the distributional properties of frequent words (oligonucleotides and peptides) of the H.influenzae genome. Our method for characterizing frequent words is based on a Poisson approximation with respect to word size, sequence length and composition (see Methods). This analysis provides a novel perspective on sequence heterogeneity. Four classes of frequent oligonucleotides stand out: (i) USSs, (ii) IDSs, (iii) tetranucleotide iterations and (iv) special groups. Why are these frequent in H.influenzae ? We venture some models and interpretations. (i) USSs (USS + and USS - ). USSs are highly frequent and virtually dense. Haemophilus influenzae is a naturally transformable organism which takes up double-stranded DNA of its own species, facilitated through recognition of USSs ( 11 ). Nearly all cells of H.influenzae become competent. The uptake mechanisms are largely unknown. Another bacterium with correspondingly directed uptake is Neisseria gonorrhoeae . Generally, a small percentage (~10%) of a population of B.subtilis can become competent for uptake of non-specific DNA sequences ( 11 , 12 ). In B.subtilis and S.pneumoniae , competence is regulated by cell density, cell-cell signaling, and nutritional signaling dependence on growth conditions. In B.subtilis more than 40 genes have been identified that are required for competence. Natural genetic competence has also been reported in many other genera ( 11 ). Although DNA uptake is widespread in bacterial cells, non-specific integration into the chromosome seems to be rare ( 11 ).

The distributions of USS + and USS - in the genome are intriguing in several respects: (i) they occur in almost equal counts in the two strands around the genome; (ii) the separate USS + and USS - sequences are significantly evenly spaced about the chromosome; and (iii) protein coding USSs (~63% of all USSs) predominantly appear in the same reading frame where USS + s translate to the tripeptide SAV and USS - s translate to the tripeptide TAL. Their high density and significantly even distribution around the genome suggest that they may contribute to global genomic activities such as replication and repair (the DNA repair hypothesis, e.g., ref. 25 ), sites of membrane attachments in association with domain loops, sites of nucleating Okazaki fragments or helix unwinding and/or sites contributing to genome packaging. It is reasonably established that transforming DNA increases the survivorship of cultures ( 11 ).

A major hypothesis concerning H.influenzae (and some other bacterial organisms) is that natural genetic competence (transformation) evolved and is maintained for the function of acquiring templates mediating repair of DNA lesions ( 26 ). One possibility is that the uptake of DNA followed by the production of single-stranded tails could induce higher levels of Rec enzymes and concomitantly increase the extent of DNA repair. In fact, single-stranded DNA is known to be an inducing signal of SOS repair and RecA polymerization in binding ssDNA of E.coli . Other possible roles of natural genetic competence have been attributed to benefits for horizontal gene transfer (e.g., transfer of antibiotic resistance determinants), for the repair of damaged chromosomes that are rescued by recombination with exogenous homologous DNA, for conversion of mutant alleles to functional alleles, or simply as a good nutrient source ( 11 , 25 ).

As pointed out by Smith et al . ( 6 ), alignment of USS + sequences leads in more than 90% of cases to the consensus sequence

aAAGTGCGGT.rwwwww......rwwwww

of length 29 bp where w refers to weak bases (A or T). Thus, the recognition part of USS appears to be at least 29 bp long with a highly conserved 9 bp core.This fact implies that USS + sequences are displaced by >= 26 bp but generally much more. A similar accounting applies to the distribution of USS - . When the 20 bases immediately 3' to all USS + core sequences were removed, r -scan statistics showed that the even distribution of USS + s persisted. This observation suggests that there is interference of some sort between close USS + s of the same orientation.

The preponderance of USS + (USS - ) encoded to SAV (TAL) might be explained as follows. The tripeptides SAV and TAL are both small and have a hydroxyl residue at one end and a hydrophobic residue at the other and are sufficiently spread out in order to have little influence on protein conformation. Translation in the other two frames involves the large residues arginine and histidine which could more easily disrupt protein 3D structure. The close USS dyads are predominantly intergenic, generally close downstream from a gene, and are rather similar to E.coli rho-independent transcription termination signals. The proliferation of USSs for global genomic purposes allows also their facile conversion into requisite terminator signals.

The palindrome GGCGATCGCC labeled HIP1 (highly iterated palindrome) is highly frequent in the 1 Mb contig of Synechocystis ( 14 ). How is it distributed? The r -scan analysis shows in this case also a significantly even distribution. In fact, for a random distribution of these words in the 1 Mb contig the chance that all successive spacings ( r = 1) exceed 9 bp has a probability <0.001. Impressively, the observed minimal spacing m 1 * is 52 bp. Similar conclusions apply to the r -scan lengths for r = 2, 3, ..., 6. Thus, the even spacing of the HIP1 in Synechocystis is considerably more dramatic than the even spacings of the USSs in H.influenzae . Synechocystis , like H.influenzae , is known to be transformable ( 11 , 27 ). Whether the HIP1 sequences serve as recognition sites in this capacity is unknown. The significance of its palindromic character is also intriguing.

Many bacteria can develop the state of physiological competence for natural DNA uptake that is consistent with a bacterial gene transfer of free DNA. Haemophilus influenzae (and N.gonorrhoeae ) can only bind and take up double-stranded and single-stranded DNA from the same or closely related species. This is different from B.subtilis where the DNA uptake tends to be non-specific and most cells are not competent. By contrast, natural genetic competence in H.influenzae and A.vinelandii can be attained by almost 100% of cells ( 11 ). As noted previously, the degree of bacterial cell competence appears to be correlated with the presence of highly frequent words. (ii) The role of IDSs. The intergenic 14 bp stem dyad sequence (IDS) frequent words occur mostly in clusters that provide the potential for variable secondary structure. This suggests that these sequences may form protein binding sites which could be important for regulating the activity of the flanking genes. (iii) Tetranucleotide iterations . Mechanisms promoting alterations in the frequency of gene expression include introduction of frameshifts that affect transcription and/or translation ( 28 ). These frameshifts are caused by insertions and deletions that are more likely within regions of reiterated short oligonucleotides. Moxon et al . ( 28 ) and Rainey et al . ( 29 ), for a number of pathogenic bacterial populations, highlight non-standard mutation mechanisms which occur at special loci, called `contingency' genes ( 28 ). These authors explicitly discuss the case of the repeat tract, (TCAA) 16 , present in H.influenzae at the 5' end of the lic2 gene essential for synthesis of digalactoside ( 30 ). In this context, the loss or gain of one or more CAAT unit(s) may alter the reading frame, resulting in change in the synthesis of the digalactoside. More generally, in appropriate tandem repeats (in the coding region and/or in gene regulatory regions) polymerase slippage, homologous recombination or mismatch repair occurring during chromosomal replication can generate a heterogenous population of cells that can facilitate infection or can counter host defense mechanisms ( 28 ). Thus, variation in the numbers of repeated segments can modulate alternative gene expression patterns in a population. Other examples of variable gene expression putatively controlled by repeat sequence tracts occur in Bordetella pertussis ( 31 ), in Neisseria meningitidis ( 32 ) and in Neisseria gonorrhoeae ( 33 ).

The example of (AGTC) 32 in H.influenzae that offers at least two alternatively encoded genes is particularly interesting (Fig. 2 ). In frame 1, (AGTC) 32 is part of a 194 aa ORF. In frame 2, the gene encodes Mod (629 aa), similar to the type III restriction-modification ECOP15 enzyme of E.coli . Frame 3 is `null' flanked by multiple termination codons. Both genes (ORF and Mod) have strong Shine-Dalgarno signals and both were predicted independently as bona fide genes by Fleischmann et al . ( 1 ) and Tatusov et al . ( 4 ). Thus, in a heterogeneous population two separate genes or a single complete (fused) gene might be expressed, or the mod gene may be switched off depending on the number of AGTC iterations. This sequence exhibits a rare example in a bacterial genome of two genes encoded in the same orientation in different reading frames overlapping more than 40 amino acids.


Figure 2 . The influence of the (AGTC) n iteration on the encoded genes. Termination codons are marked by <=> . See Discussion for details.

There are four occurrences of the tetranucleotide iterations (CCAA) 20 , (CCAA) 20 , (CCAA) 36 and (CCAA) 18 , at positions 677, 706, 761 and 1633 kb, respectively, whose flanking sequences extending ~2 kb downstream and ~400 bp upstream are substantially similar (~90% identical nucleotides) (Table 3 ). Reasons why the flanking sequences of the (CCAA) n repeats are largely congruent are unclear. Is it possible that their movement around the genome is channeled through transposon activity? By contrast with the different (CCAA) n tracts, the flanking sequences of (TCAA) 33 , (TCAA) 23 and (TCAA) 17 are not similar. As previously noted, variation in the number of (CCAA) n iterations is a strategy that can alter the translational frame and/or intensity of DNA supercoiling in regulation of gene expression and provide a repertoire of genetic polymorphism ( 28 ). In particular, these tetranucleotide iterations can turn cell lines sometimes on and sometimes off. Thus, they produce population heterogeneity related to mutations at specific loci in contrast with mechanisms that affect mutation rates throughout the genome. Mechanisms capable of generating such non-standard random variation putatively provide a solution to the problem of enabling rapid and reversible response to environmental changes that are frequently encountered in the bacterial natural habitat.

Generally, long oligonucleotide iterations are scarce in bacterial sequences or ordinarily emphasize trinucleotide repeats which obviously conserve the reading frame (e.g., for the current sequences available in Mycoplasma genitalium , 11 distinct trinucleotide iterations; Synechocystis sp ., four; Mycobacterium leprae , six; E.coli , two). (iv) Frequent oligopeptides. The peptide frequent words are of two major kinds: those induced by oligonucleotide frequent words (USSs, tetranucleotide iterations) and those of bona fide pentapeptides that allow variable synonymous codons. The most frequent 5mers for both H.influenzae and E.coli are associated with ATP binding occurring in three forms: A-box (contributes to binding pocket), B-box (hydrolysis site) and C-box ( 8 ) of unknown function.

The A-box motif GXXGXGK(S/T)TL comprises, in H.influenzae , 31 copies of GKSTL, 19 of GAGKS, and of the combined 6mers 15 of GAGKST, 11 of AGKSTL, 12 of GKSTLL and many slight variations (Table 4 ). The B-box has predominantly 20 LDEPT and the 6mers 11 LLDEPT, 10 LLLDEP and their variations. The C-box includes 28 LSGG(E/Q) and six copies each of the 6mers LSGGER, QRVALA and RVALAR. The simultaneous occurrence of A, B and C boxes is often found in ATP-dependent transport proteins ( 8 , 13 ). However, the RecA lineage of proteins replace the B-box with [Phi][Phi][Phi][Phi]DSA ([Phi] is a hydrophobic residue). Apparently, the ATP binding motifs divide into several functional groups, probably not homologous but analogous.

A strong motif associated with translation has five copies of HVDHGKT. These peptides appear to be frequent in bacterial sequences infecting human hosts.

The frequent amino acid words of H.influenzae can be compared with the frequent pentapeptides in E.coli . These feature: (i) ATP-binding motifs (sites A, B and C same as 4a, 4b and 4c above); (ii) GHVDHGKT; (iii) other frequent words with ~5 occurrences (see Table 4 footnote).

For comparative purposes it is useful to display the consensus ATP and GTP binding motifs among yeast and human proteins: A-site, GXXGXGK(S/T)[Phi][Phi]; and generally 20-50 aa downstream follows the B-site, (i/l)(l/w)DTAGQE(E/D)y; and 10-30 aa further downstream follows the C-site: (Q/K)(I/L/S/A)D(L/M).

Frequent words in 2543 non-redundant human protein sequences of aggregate 1 182 870 residues correspond to pentapeptides GPPGP and PGPPG, mostly in collagens and very frequent are the homopeptides G5, L5, P5, A5, Q5 or almost homopeptides. These are generally part of long homopeptides in individual sequences ( 13 ). These results are based on the equal frequency model. When we used the frequency-dependent model, and even more so with the Markov model, many of these homopeptide frequent words disappeared, since they were actually composed of higher-frequency amino acids and because the frequency of an amino acid to follow itself in at least 12 of the 20 amino acid types is usually high ( 34 ). Similarly, words such as DVWSF and FPIKW, which are highly conserved in the proteins in which they occur (many protein kinases), were missed because their expected frequency was overestimated by assuming equal amino acid frequencies.

Results so far affirm that E.coli and H.influenzae proteins have many fewer frequent words than do eukaryotic proteins, regardless of what model we use. Most of the frequent words in higher eukaryotes have been characteristic of zinc fingers, chymotrypsin proteases, serine/threonine and tyrosine kinases (which have many words in common but also some that distinguish them), immunoglobulin heavy and light chains and homeobox proteins, among others. Most of the active sites in these classes have at least one frequent word associated with them.

REFERENCES

1 Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J.-F., Dougherty, B. A., Merrick, J. M., et al. (1995) Science, 269, 496-512.

2 Brosius, J. (1996) Science, 271, 1302. MEDLINE Abstract

3 Robison, K., Gilbert, W. and Church, G. M. (1996) Science, 271, 1302-1303. MEDLINE Abstract

4 Tatusov, R. L., Mushegian, A. R., Bork, P., Brown, N. P., Hayes, W. S., Borodovsky, M., Rudd, K. E. and Koonin, E. V. (1996) Curr. Biol., 6, 279-291.

5 Burge, C., Campbell, A. M. and Karlin, S. (1992) Proc. Natl. Acad. Sci. USA, 89, 1358-1362. MEDLINE Abstract

6 Smith, H. O., Tomb, J.-F., Dougherty, B. A., Fleischmann, R. D. and Venter J. C. (1995) Science, 269, 538-540.

7 Krawiec, S. and Riley,M. (1990) Microbiol. Rev., 54, 502-539. MEDLINE Abstract

8 Blaisdell, B. E., Rudd, K. E., Matin, A. and Karlin, S. (1993) J. Mol. Biol., 229, 833-848.

9 Rawlings, N. D. and Barrett, A. J. (1994) Methods Enzymol., 244, 19-61.

10 Walker, J. E., Saraste, M., Runswick, M. J. and Gay, N. J. (1982) EMBO J., 1, 945-951.

11 Lorenz, M. G. and Wackernagel, W. (1994) Microbiol. Rev., 58, 563-602.

12 Solomon, J. M. and Grossman, A. D. (1996) Trends Genet., 12, 148-155.

13 Karlin, S. and Cardon, L. R. (1994) Annu. Rev. Microbiol., 48, 619-654. MEDLINE Abstract

14 Robinson, N. J., Robinson, P. J., Gupta, A., Bleasby, A. J., Whitton, B. A. and Morby, A. P. (1995) Nucleic Acids Res., 23, 729-735.

15 Karlin, S. and Taylor, H. M. (1975) A First Course in Stochastic Processes. Chapter 5. Academic Press, San Diego, CA, USA.

16 Kleffe, J. and Borodovsky, M. (1992) Comput. Appl. Biosci., 8, 433-441. MEDLINE Abstract

17 Schbath, S., Prum, B. and de Turckheim, E. (1995) J. Comput. Biol., 2, 417-437. MEDLINE Abstract

18 Karlin, S. and Brendel, V. (1996) In Arrow, K. J., Cottle, R. W., Eaves, B. C. and Olkin, I. (eds), Education in a Research University. Stanford University Press, Stanford, CA, USA pp. 407-427.

19 Karlin, S. and Leung, M.-Y. (1991) Ann. Appl. Prob., 1, 513-538.

20 Dembo, A. and Karlin, S. (1992) Ann. Appl. Prob., 2, 329-357.

21 Kahn, M. E. and Smith, H. O. (1984) J. Membrane Biol., 81, 89-103.

22 Mrázek, J. and Karlin, S. (1996) Trends Biochem. Sci., 21, 201-202.

23 Higgins, C. F., Gallagher, M. P., Mimmack, M. L. and Pearce, S. R. (1988) BioEssays, 8, 111-116.

24 Nagel, G. M. and Doolittle, R. F. (1995) J. Mol. Evol., 40, 487-498.

25 Mongold, J. A. (1992) Genetics, 132, 893-898.

26 Bernstein, H., Byerly, H. C., Hopf, F. A. and Michod, R. E. (1985) Int. Rev. Cytol., 96, 1-24.

27 Grigorieva, G. and Shestakov, S. (1982) FEMS Microbiol. Lett., 13, 367-370.

28 Moxon, E. R., Rainey, P. B., Nowak, M. A. and Lenski, R. E. (1994) Curr. Biol., 4, 24-33.

29 Rainey, P. B., Moxon, E. R. and Thompson, I.P. (1993) Adv. Microbiol. Ecol., 13, 263-300.

30 High, N. J., Deadman, M. E. and Moxon, E. R. (1993) Mol. Microbiol., 9, 1275-1282.

31 Willems, R., Paul, A., van der Heide, H. G. J., Ter Avest, A. R. and Mooi, F. R. (1990) EMBO J., 9, 2803-2810. MEDLINE Abstract

32 Olyhoek, A. J. M., Sarkari, J., Bopp, M., Morelli, G. and Achtman, M. (1991) Microb. Pathog., 11, 249-257.

33 Stern, A., Brown, M., Nickel, P. and Meyer, T. F. (1986) Cell, 47, 61-67. MEDLINE Abstract

34 Karlin, S., Bucher, P., Brendel, V. and Altschul, S. (1991) Annu. Rev. Biophys. Biophys. Chem., 20, 175-203. MEDLINE Abstract

35 Borodovsky, M. and McIninch, J. (1993) Comput. Chem., 17, 123-133.


Return

*To whom correspondence should be addressed. Tel: +1 415 723 2204; Fax: +1 415 725 2040; Email: fd.zgg@forsythe.stanford.edu
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. Mrazek, S. Xie, X. Guo, and A. Srivastava
AIMIE: a web-based environment for detection and interpretation of significant sequence motifs in prokaryotic genomes
Bioinformatics, April 15, 2008; 24(8): 1041 - 1048.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. Mrazek, X. Guo, and A. Shah
Simple sequence repeats in prokaryotic genomes
PNAS, May 15, 2007; 104(20): 8472 - 8477.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. Mrazek and S. Karlin
Distinctive features of large complex virus genomes and proteomes
PNAS, March 20, 2007; 104(12): 5127 - 5132.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
J. H. Badger, T. R. Hoover, Y. V. Brun, R. M. Weiner, M. T. Laub, G. Alexandre, J. Mrazek, Q. Ren, I. T. Paulsen, K. E. Nelson, et al.
Comparative Genomic Evidence for a Close Relationship between the Dimorphic Prosthecate Bacteria Hyphomonas neptunium and Caulobacter crescentus.
J. Bacteriol., October 1, 2006; 188(19): 6841 - 6850.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
J. Mrazek
Analysis of Distribution Indicates Diverse Functions of Simple Sequence Repeats in Mycoplasma Genomes
Mol. Biol. Evol., July 1, 2006; 23(7): 1370 - 1385.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. Karlin
Colloquium Perspective: Statistical signals in bioinformatics
PNAS, September 20, 2005; 102(38): 13355 - 13362.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M. Bakkali, T.-Y. Chen, H. C. Lee, and R. J. Redfield
Evolutionary stability of DNA uptake signal sequences in the Pasteurellaceae
PNAS, March 30, 2004; 101(13): 4513 - 4518.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Davidsen, E. A. Rodland, K. Lagesen, E. Seeberg, T. Rognes, and T. Tonjum
Biased distribution of DNA uptake sequences towards genome maintenance genes
Nucleic Acids Res., February 11, 2004; 32(3): 1050 - 1058.
[Abstract] [Full Text] [PDF]


Home page
J. Clin. Microbiol.Home page
G. Bruant, S. Watt, R. Quentin, and A. Rosenau
Typing of Nonencapsulated Haemophilus Strains by Repetitive-Element Sequence-Based PCR Using Intergenic Dyad Sequences
J. Clin. Microbiol., August 1, 2003; 41(8): 3473 - 3480.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
E. Kolker, S. Purvine, M. Y. Galperin, S. Stolyar, D. R. Goodlett, A. I. Nesvizhskii, A. Keller, T. Xie, J. K. Eng, E. Yi, et al.
Initial Proteome Analysis of Model Microorganism Haemophilus influenzae Strain Rd KW20
J. Bacteriol., August 1, 2003; 185(15): 4593 - 4602.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Mrazek, L. H. Gaynon, and S. Karlin
Frequent oligonucleotide motifs in genomes of three streptococci
Nucleic Acids Res., October 1, 2002; 30(19): 4216 - 4221.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Mrazek, D. Bhaya, A. R. Grossman, and S. Karlin
Highly expressed and alien genes of the Synechocystis genome
Nucleic Acids Res., April 1, 2001; 29(7): 1590 - 1601.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
D. Bhaya, D. Vaulot, P. Amin, A. W. Takahashi, and A. R. Grossman
Isolation of Regulated Genes of the Cyanobacterium Synechocystis sp. Strain PCC 6803 by Differential Display
J. Bacteriol., October 15, 2000; 182(20): 5692 - 5699.
[Abstract] [Full Text]


Home page
J. Bacteriol.Home page
S. Karlin and J. Mrázek
Predicted Highly Expressed Genes of Diverse Prokaryotic Genomes
J. Bacteriol., September 15, 2000; 182(18): 5238 - 5250.
[Abstract] [Full Text]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (163K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (51)
Right arrowRequest Permissions