Sequence analysis of 56 kb from the genome of the bacterium
Mycoplasma pneumoniae
comprising the dnaA region, the atp operon and a cluster of ribosomal protein
genes
Sequence analysis of 56 kb from the genome of the bacterium Mycoplasma pneumoniae comprising the dnaA region, the atp operon and a cluster of ribosomal protein genes
Helmut
Hilbert
,
Ralf
Himmelreich
,
Helga
Plagens
and
Richard
Herrmann*
Zentrum für Molekulare Biologie Heidelberg, Mikrobiologie, Universität Heidelberg, 69120
Heidelberg
,
Germany
Received October 31, 1995
;
Revised and Accepted December 29, 1995
GenBank accession nos U34816, U43738 and U34795
ABSTRACT
To sequence the entire 800 kilobase pair genome of the bacterium
Mycoplasma pneumoniae
, a plasmid library was established which contained the majority of the
Eco
RI fragments from
M.pneumoniae
. The
Eco
RI fragments were subcloned from an ordered cosmid library comprising the
complete
M.pneumoniae
genome. Individual plasmid clones were sequenced in an ordered fashion mainly
by primer walking. We report here the initial results from the sequence
analysis of
~
56 kb comprising the dnaA region as a potential origin of replication, the
ATPase operon and a region coding for a cluster of ribosomal protein genes. The
data were compared with the corresponding genes/operons from
Bacillus subtilis
,
Escherichia coli, Mycoplasma capricolum
and
Mycoplasma gallisepticum.
INTRODUCTION
The bacteria with the smallest known genomes are found among members of the
class
Mollicutes
(
1
). This class presently comprises the six eubacterial genera
Acholeplasma
,
Anaeroplasma
,
Asteroleplasma
,
Mycoplasma
,
Spiroplasma
and
Ureaplasma
(however, the term mycoplasma has been frequently used to denote any species
included in the class
Mollicutes
). The common characteristics are the complete lack of a bacterial cell wall,
osmotic fragility, colony shape and filterability through 450 nm pore diameter
membrane filters (
2
). The relatively close phylogenetic relationship of these genera was measured
by comparative sequence analysis of the 5S and 16S ribosomal RNA (rRNA) (
3
,
4
). The rRNA sequence analyses also revealed that the
Mollicutes
are not at the root of the bacterial phylogenetic tree, but rather developed by
degenerate evolution from gram-positive bacteria with a low mol% G+C (guanine plus cytosine) content of
DNA, the
Lactobacillus
group containing
Lactobacillus
,
Bacillus
,
Streptococcus
and two
Chlostridium
species (
5
). The
Mollicutes
lost during the process of evolution a substantial part of their genetic
information. This is reflected by significantly smaller genome sizes as low as
600 kb and extending to 2300 kb (
6
) as compared with 2500-5700 kb long genomes of their ancestor bacteria (
1
). The loss of coding capacity could probably be tolerated because of the
parasitic life style of the
Mollicutes
. They have never been found as freely living organisms. In nature,
Mollicutes
depend on a host cell, respectively on a host organism. For instance,
Mycoplasmas
and
Ureaplasmas
are parasites in different vertebrates, from which they obtain essential
compounds such as fatty acids, amino acids, precursors for nucleic acid
synthesis and cholesterol, a compound normally not found in bacteria. Only
Acholeplasma
and
Asteroleplasma
do not require cholesterol for growth (
7
).
Mycoplasma pneumoniae
, the subject of this study, is a human pathogenic bacterium causing
tracheobronchitis and primary atypical pneumonia. Associated with the host
cell, surface colonization of human respiratory tract epithelial cells takes
place (
8
). In the laboratory,
M.pneumoniae
can be grown without a host cell in rich medium supplemented with 10-20% horse serum. The lack of a cell wall most probably facilitates the
close contact between
M.pneumoniae
and its host cell and guarantees the exchange of compounds, which support the
growth of the bacterium. As a consequence of this bacterial surface-parasitism the host cell is severely damaged. The exchange of toxic
metabolic compounds is discussed as a possible cause of cell damage (
9
), however, at this stage not a single toxic compound has been identified as a
causative agent of cell damage.
Mycoplasma pneumoniae
has an exceptional position among the
Mollicutes
since its DNA has the highest G+C content (41 mol%), whereas the genomes of
most of the other
Mollicutes
have a G+C content <30 mol% (
2
,
6
). The genome size of
M.pneumoniae
is ~800 kb (
10
,
11
) having a coding capacity for 700 proteins assuming an average molecular mass
of 40 000 Da. Hence
M.pneumoniae
is among the smallest self-replicating cells known today (
12
).
It was selected, mainly for this reason, as a model system for defining the
minimal genetic requirements of an autonomously reproducing cell. This can be
done by determining as many as possible genes and then classifying them into
essential and non-essential ones. Based on these results we should be able to define a set
of genes which are sufficient for the reproduction of
M.pneumoniae
under defined laboratory conditions. Morowitz already proposed several years
ago that a mycoplasma species would be a suitable candidate for defining the
essentials of a self-replicating cell (
13
). Apart from this model character as a genetically reduced self-replicating cell,
M.pneumoniae
offers a number of interesting phenomena to analyze. For instance, studies on
the interaction between this prokaryotic surface parasite and its eukaryotic
host cell, including the host immune reaction, might help to reveal bacterial
pathomechanisms (
14
). Another promising area of research concerns the bacterial cytoskeleton.
Despite the lack of a cell wall and other cellular appendages,
M.pneumoniae
exhibits a characteristic cell shape and motility. Both might be correlated to
a cytoskeleton-like structure (
15
). Last, but not least, the evolution of the
Mollicutes
is, despite considerable progress in this field, still left with many
unanswered questions. The large body of DNA sequence data from bacteria which
are phylogenetically related to
M.pneumoniae
such as
Bacillus subtilis
might allow to reconstruct the process of degenerate evolution and to
understand how
Mollicutes
genomes with different G+C contents, between 25 and 41 mol%, developed.
Little is known about genetics, physiology and molecular biology of
M.pneumoniae
in comparison with the relatively well studied bacteria
Escherichia coli
and
B.subtilis
(
12
). An efficient transformation system for
M.pneumoniae
comparable with the ones for
E.coli
is missing (
16
-
18
), however transposon mutagenesis has been successfully applied for the
isolation of mutants (
19
). The dependence on rich medium for growth prevents the isolation of
auxotrophic mutants and the efficient incorporation of labelled precursors.
These disadvantages can be compensated to a large extent by the methods of
molecular biology, for example, DNA cloning techniques, expression of genes or
parts of genes in
E.coli
, restriction analysis and the construction of physical genome maps.
Furthermore, combined with improved DNA sequencing techniques, computer aided
data collection and analysis and a rapidly expanding source of information on
genes and proteins in freely accessible data banks allow genes to be proposed
on basis of DNA or protein sequence homology. At present ~50-70% of DNA sequences derived from open reading frames can be defined
by significant sequence homology to known genes, gene products or conserved
typical motifs in proteins or DNA sequences. DNA sequence analysis is therefore
the fastest way to identify a large number of genes of a given genome (
20
,
21
).
This paper introduces the
M.pneumoniae
genome sequencing project and describes the overall strategies and first
results.
Sequencing strategy
The general strategy is to sequence both strands in an directed fashion by
primer walking and to limit random (`shot gun') sequencing to a minimum. DNA
sequence data are being generated by the Sanger method (
22
) using fluorescent dye labelled primers or dideoxynucleotides in combination
with a semi-automated DNA sequencing unit (
23
-
26
).
An ordered cosmid library containing the complete
M.pneumoniae
genome in 34 overlapping cosmids, two [lambda] phages and one plasmid (
10
) was the starting point for the project. The cosmid library was constructed by
partial digestion of the
M.pneumoniae
genome with the restriction endonuclease
Eco
RI (
27
). The individual
Eco
RI DNA fragments from the cosmids are being further subcloned into a plasmid vector, resulting in a plasmid library
consisting of clones each carrying one individual
Eco
RI DNA fragment. These fragments are between 0.1 and 28 kb long and are
sequenced individually. The following methods are applied depending on the
insert size. (i) Inserts up to 3 kb long are sequenced by primer walking only.
(ii) Sets of nested deletions are constructed by the exonuclease III method (
28
) from clones with inserts between 3 and 10 kb long. A set comprising 20 nested
deletions is normally sufficient to obtain the sequence from one strand of a 6
kb long fragment. The complementary strand and possible gaps in the first
strand are then sequenced by primer walking. (iii) For sequence analysis of all
other plasmids with inserts between 10 and 28 kb we apply a limited `shotgun
cloning' sequencing strategy. Suitable frequently cutting restriction
endonucleases like
Sau
3A,
Alu
I or
Hae
III, are used to establish two or three different sets of ~20 subclones carrying fragments 100-500 bp long. Both ends of individual cloned fragments are sequenced
and aligned to contigs. Gaps are filled by primer walking on plasmids or
cosmids carrying the
Eco
RI fragment in question.
The project is organized in such a way that many different plasmids can be
sequenced at the same time and waiting for the synthesis of new primers is not
a limiting step. Furthermore, sequencing efforts may be shifted to any region
of interest on the genome. The speed of sequencing can be calculated since the
complete genome has been cloned and therefore, the frequently painful analysis
of the last few percent of the genome as a result of missing or not clonable
DNA regions will not be an extra burden. The ordered cosmid library has been
used to construct an
Eco
RI restriction map of the entire
M.pneumoniae
chromosome. Therefore, any DNA sequence can be attributed to a defined position
on the physical map. This permits to establish a detailed genetic map parallel
to the sequencing project.
MATERIALS AND METHODS
Cloning of
Eco
RI fragments
The
Eco
RI fragments were subcloned from an ordered cosmid library comprising the
complete
M.pneumoniae
genome (
10
). Standard procedures were applied (
29
) using as vector the plasmid pBC (Stratagene) and the
E.coli
strains HB101 (
30
) or XL1-Blue (Stratagene) for propagation of these plasmids. The plasmid clones
containing the individual
Eco
RI fragments which were used for sequencing were purified by Qiagen column
chromatography according to the protocol provided by the manufacturer (Qiagen).
The cloning vector pBC was purified by centrifugation to equilibrium in two
sequential cesium chloride-ethidium bromide gradients (
29
).
The following nomenclature was used for the plasmid clones as well as
Eco
RI fragments. The cosmid always had the prefix pcosMP and a letter and a number,
e.g., pcosMPD2. The plasmid carrying an
Eco
RI fragment from this cosmid received a `p' as prefix and the letter and number
applied to the cosmid and additionally the size of the
Eco
RI fragment in kb, e.g., pD2/4.8. The
Eco
RI fragment alone was named D2/4.8.
Long range PCR
To determine or to check the orientation of adjacent
Eco
RI fragments the improved PCR method for the amplification of DNA fragments up
to 45 kb long was used (
31
). The reactions were done with the GeneAmp XLPCR kit from Perkin Elmer
according to the manufacturers protocol. Genomic
M.pneumoniae
DNA used for amplification was purified as described (
27
). The primers for the reactions were designed as 22mers with a melting
temperature of 68oC.
Synthesis of oligonucleotides
Synthetic oligonucleotides were synthesized (R. Frank
et al
., ZMBH) according to phosphoramidite chemistry using a solid carrier (
32
) on a model 394 DNA/RNA synthesizer from Applied Biosystems. Oligonucleotides
between 17 and 20 nt long were specifically designed using the program OLIGO
4.0 (National Biosciences Inc.) and were used without further purification
following a standard dilution for the sequencing reaction.
DNA sequencing
The sequence data for this study was exclusively generated by the enzymatic
dideoxy chain-termination method described by Sanger
et al
. (
22
). The radioactive label was substituted by a fluorescent label and
Taq
polymerase was used in the reaction. The protocols were adopted from cycle
sequencing protocols introduced by Craxton (
33
) in which the basic principle of this method is the linear amplification of the
target DNA with a single primer.
All data were generated on a fluorescent-based sequence-gel reader (Model 373A, Applied Biosystems). Either fluorescently
labelled universal primers (-21M13, M13 RP, T3 and T7) or fluorescently labelled dideoxynucleotides
were used as label.
Taq
dye primer cycle sequencing and
Taq
dye deoxy cycle sequencing were done as provided in the manufacturer's
protocol. In each sequencing reaction 1 [mu]g plasmid DNA or 2.3 [mu]g cosmid DNA and 10 pmol primer were used.
In a typical sequence analysis ~500 nt were read. Primers for primer walking were selected between
nucleotide 300 and 400 from such a sequence. All sequence chromatograms were
visually inspected and edited by the
SeqEd
program (Version 1.03) from Applied Biosystems. Sequence Assembly was performed
by using the
Sequence Project Management
program of the
DNA
* program package by Lasergene.
Computer assisted analysis
Computer analyses were performed with the program package HUSAR (Heidelberg Unix
Sequence Analysis Resources) release 4.0 at the German Cancer Research Center,
Heidelberg, Germany. This package is based on the GCG program package version
Unix-8.01 of the Genetics Computer Group, Wisconsin. For searching the DNA and
protein databases [SWISS-PROT (
34
) and PIR (
35
)] the FASTA (
36
) and BLAST (
37
) programs (BLASTX, BLASTN and BLASTP) were used. Conserved motifs in proteins
and peptides were identified by running the program PROSITE (
38
). Open reading frames (ORFs) were calculated by the program FRAMES allowing AUG
(or GUG, UUG) as start codons using the Mycoplasma translation table where UGA
is coding for tryptophane (
39
). The G+C content was calculated by the program COMPOSITION. Protein sequences
were aligned by using either the program GAP (pairwise alignment) based on the
algorithm of Needleman and Wunsch (
40
) or CLUSTAL (
41
) for multiple alignments.
Subcloning of the
Eco
RI fragments from the cosmid library into the plasmid pBC
All but six of the 143 unique
Eco
RI fragments which have been identified in the cosmids (
42
) were subcloned into the plasmid pBC (
43
). Several attempts failed to subclone a 4.9 kb
Eco
RI fragment from the cosmid pcosMPD12 and a 10.8 kb
Eco
RI fragment from the cosmid pcosMPG7. Since these and the other four
Eco
RI fragments (D12/4.9, G7/10.8, GT9/25, H8/8.2, H8/0.43B, K8/2.5) were present
in the cosmids, which are also suitable for sequencing no further attempts were
made to subclone them. In addition, any region of interest can be amplified by
polymerase chain reaction (PCR) if necessary. Personal experience showed that
some
Eco
RI fragments could be better propagated in
E.coli
as part of a cosmid than as an individual fragment cloned in a plasmid.
Revised version of the
Eco
RI restriction map
Initially 300-400 nt from the end of one strand of each cloned
Eco
RI fragment were sequenced. This required only the two primers T3 and T7 which
hybridized to the left and to the right of the
Eco
RI cloning site of the vector pBC. Approximately 100 000 nt were gained from
sequencing 274 ends from 137 fragments. The information was distributed over
the genome in short mosaic like stretches. Based on this information, reverse
primers were synthesized which permitted to sequence on the corresponding
cosmid over the adjacent
Eco
RI site into the neighbouring
Eco
RI fragment and to align the correct ends of adjacent
Eco
RI fragments. Sixteen new
Eco
RI fragments between 15 and 193 bp long, which were not recognized by the
restriction analysis of the cosmids, were also identified. Eight of these
fragments were 87 bp long and were known to be part of the repetitive DNA
sequence RepMP1 (
44
). The sizes of the other eight fragments were 15, 37, 42, 98, 128, 160, 168 and
193 bp.
The sequence analysis also revealed two problem regions, the overlapping cosmids
pcosMPD12-pcosMPK5 and pcosMPK4- pcosMPG7 (
44
). It was found that a 5.4 kb
Eco
RI fragment of pcosMPK4 was a cloning artefact and had to be replaced by a 2.5
and a 0.7 kb
Eco
RI fragment which were both cloned in a phage M13 vector. The 0.4 kb fragment of
pcosMPD12 could not be positioned unambiguously. Both regions were reexamined
by long range amplification of genomic
M.pneumoniae
DNA by the polymerase chain reaction (PCR). Restriction fragment analysis of
the amplified DNA with endonuclease
Eco
RI and DNA sequence analysis of the regions around the
Eco
RI sites confirmed number, size and position of the
Eco
RI fragments as shown in the map. In five other instances we had to exchange the
positions of two fragments which were very similar in size in four cosmids
(E7/1.85-E7/1.9; E7/0.7-E7/0.75; H8/0.43A- H8/0.43B; H91/1.9-H9/1.8; K4/8-G7/7.5). The order of all other fragments (
44
) was confirmed by the sequence analysis.
DNA sequence analysis of three selected regions
Based on previous experiments (
42
) and results of data base homology searches with the first 100 000 nt DNA
sequence (one strand only) regions were selected for sequencing longer coherent
stretches. Among others, the region around the dnaA gene was selected
(pcosMPK5), since we expected to find there the origin of replication, as well
as two other regions coding for conserved genes like the F
0
F
1
ATPase operon (pcosMPD2) or ribosomal protein genes (pcosMPGT9). Sequences from
these genes were also available from other mycoplasma species and from the
phylogenetically related
B.subtilis
which could be used for comparative analyses. All DNA sequences published were
generated from independent sequences of both strands of a given region.
Discrepancies between two sequences were resolved by directly comparing the
sequencing chromatograms and, if necessary, by repeating the sequence analysis
of one or the other strand under different conditions, e.g., by selecting a new
sequencing primer.
To estimate the sequencing error rate the following two types of experiments
were done with a 12 kb long region comprising the
Eco
RI fragments (D2/7.3, D2/4.8) (Fig.
5
A) using the same batch of DNA preparation: (i) two persons (H.H. and R.Hi.)
edited the same sequence chromatograms independently; (ii) using different
primers, the 12 kb region was sequenced independently by the same two persons.
The reading of the same chromatograms revealed three discrepancies, but the two
independent generated sequences agreed fully.
These data do not permit to calculate an error rate for the complete project,
but they hint at the fidelity of our sequencing analysis and underscore that
sequencing by primer walking with low redundancy produces reliable results. The
redundancy for one strand is 1.3 for a sequence generated by primer walking
only, and ~1.9 for a sequence generated by primer walking and shot gun subcloning of a
larger
Eco
RI fragment like GT9/25.
The dnaA region
Figure
2
shows the physical and genetic map of the dnaA region of
M.pneumoniae
and Table
1
summarizes the proposed open reading frames. The first obvious result is the
uneven distribution of G and C in this area. Although the average G+C content
of the
M.pneumoniae
genome is ~41mol%, the region between nucleotide 4000 and 4850 is characterized by a
G+C content well below 30mol% and by the absence of open reading frames coding
for proteins >40 amino acids. The orientation of genes flanking this
untranslatable region is also striking, because the genes to the right and to
the left are transcribed divergently in opposite direction (Fig.
3
). The genes located in the dnaA region of
M.pneumoniae
are also unusual. Many of them are not found in the corresponding regions of
other bacteria (Table
1
and Fig.
3
) and the conserved genes, gyrB, dnaN and dnaA, are in different order or
orientation with respect to each other.
The ATPase operon
The number and order of genes of the atp operon coding for F
0
F
1
ATPase in
M.pneumoniae
are conserved. That is they are identical with those determined for
E.coli
(
60
),
B.subtilis
(
61
) and
Mycoplasma gallisepticum
(
62
). In addition, our data suggest, that the operon structure may be extended, as
the proposed C-terminal section of a gamma enolase in front of the atpI gene and at least
the two proposed genes orf569 and orf152 (
lac
A) (
63
), following immediately the atpC gene could be part of the same operon. The
space between the end of the preceding gene and the beginning of the following
genes is at the most 8 nt long (Fig.
5
A and Table
1
).
Comparision of length and percent identity between subunits of the ATPase operon
from selected bacteria with M.p.
By comparing the corresponding genes of atp operons from different bacteria, the
following picture emerges (Table
2
). The highest identities exist between the two
Mycoplasma
species in particular among the atpA and atpD gene and also among the atpE and
atpB gene with slightly less homology. The identity scores in all other
instances are rather low. In some cases the scores between
E.coli
and
M.pneumoniae
are higher (atpH, atpG) than between
M.pneumoniae
and the phylogenetically more closely related
B.subtilis
. All comparable genes of the atp operon of the four bacteria are of similar
length except atpE, atpF and atpI. Of interest is the structure of the b
subunit (atpF). The proposed b subunits from
M.pneumoniae
(Fig.
6
) and
M.gallisepticum
carry the characteristic features of a prokaryotic lipoprotein, a signal
peptide with positively charged amino acids near the N-terminus, an accumulation of hydrophobic residues within a signal peptide
and a cysteine between position 10 and 35 of the proposed preprotein. This
cysteine will become the first amino acid after cleavage of the signal peptide.
The processed protein is associated with the membrane via the diacyl-glycerol modified cysteine and a third acyl chain bound to the free amino
group of the same cysteine (
64
,
65
). Downstream of the cysteine is a potential transmembrane section which would
ensure that the C-terminal part of these subunits is facing the cytosol. The b subunits of
E.coli
and
B.subtilis
do not show the lipoprotein-specific structure, instead only one transmembrane section near the N-terminal end binds the b subunit to the cytoplasmatic membrane in
such a way, that the residual protein is also orientated towards the cell
interior permitting interaction with the F
1
portion of the enzyme. Typical signatures exist for the [alpha] and [beta] subunits which are present in most of the F
0
F
1
type ATPases [(Pro- [Ser, Ala, Pro]-[Ile, Val[-[Asp, Asn]-X(3)-Ser-X-Ser]. This signature is also present
in the [beta] subunit of
M.pneumoniae
although the [alpha] subunit is modified; Asp or Asn is replaced by His. This is also the
case for
M.gallisepticum
, while in
E.coli
and
B.subtilis
Asp or Glu has been found. Both, the [alpha] and [beta] subunits possess also the expected ATP/GTP binding sites.
Figure 5
.
Proposed lipoproteins. The boxed Cys is the first amino acid after cleavage.
Hydrophobic amino acids from the signal peptide are in bold face. The proposed
hydrophobic transmembrane segment downstream of the modified Cys in C12_orf207
(b subunit atpF) is in bold face and in italics.
The subunits a, b and c which build the integral F
o
membrane complex, have the required transmembrane segments: six for the a
subunit, two for the c subunit and one for the b subunit. The typical
signatures for subunits a and c including the binding site for
N
1
N
'-dicyclohexylcarbodiimide (DCCD) are also conserved (
66
). DCCD inhibits proton transport across the membrane. Downstream of the atpC
gene are nine proposed ORFs of which only one, Orf152 shows significant
homology to a known protein, the galactose-6-phosphate isomerase, an enzyme involved in sugar metabolism.
Interesting are orf521, orf217L and orf531 which code for proteins with the
proposed features of prokaryotic lipoproteins as mentioned already in
connection with the b subunit (atpF) of the F
0
F
1
ATPase. In all three instances, the proposed cleavage sites are located between
Ala and Cys in the preprotein, and between Ser and Cys in the ATPase b subunit.
Two other proposed lipoproteins which will be introduced in the next paragraph
also carry the proposed cleavage site between Ala and Cys (Fig.
6
). In five of six instances the 3rd amino acid in front of the proposed cleavage
site is Leu.
Remarkable is also the high percentage of identity (46.6%) between the proposed
lipoproteins from orf 521 and orf 531.
The ribosomal protein operons
[alpha]
, spc and S10
The last region analyzed comprises one of the largest genomic
Eco
RI fragments, the 25 kb long GT9/25.
Approximately one third of the fragment codes for ribosomal proteins. As in
E.coli
(
67
-
69
) the [alpha] and spc operon and four proteins from the S10 operon are located
consecutively (Fig.
5
B and Table
1
). The corresponding genes are in the same order except the genes coding for the
S4 and L30 protein and the transition between the spc operon and the [alpha] operon contains, as in
B.subtilis
(
70
), the genes secY, adh, ampM and infA (Fig.
7
).
Figure 6
.
Comparison of the organization of a ribosomal protein gene cluster. In
E.coli
these genes are organized in three operons; [alpha] operon: S13(rpsM)-L17(rplQ); spc operon: L14 (rplN)-L36(rpmJ); S10 operon: S10(rpsJ)-S17(rpsQ). The genes rpmD(L30) of
M.pneumoniae
and
M.capricolum
and the genes rpsD(S4) of
M.pneumoniae
and
B.subtilis
(88) are located in different positions on the genome. ns = not sequenced; g.d.
= gene is deleted.
Codon usage for the eight ribosomal proteins rpmC, rpsQ, rplN, rplX, rplE, rpsN,
rpsE, rplO
M.p. (1),
M.pneumoniae
; M.c.,
M.capricolum
; B.s.,
B.subtilis
; E.c..
E.coli;
M.p. (2),
M.pneumoniae
(P1 and ORF6 gene from P1 operon).The amino acid sequences of eight ribosomal proteins, their length (Table
3
) and the codon usage (Table
4
) of the corresponding genes were compared with their counterparts in
M.capricolum
(
71
),
B.subtilis
(
72
,
73
) and
E.coli
. Protein sizes are similar except S5, S14 and L29 (Fig.
8
). Some uncertainty is caused by the lack of protein data for N-termini of the ribosomal proteins from
M.pneumoniae
. The ribosomal proteins S5 in
M.pneumoniae
and
M.capricolum
have an elongated N-terminal region and in addition
M.capricolum
an elongated C-terminal region compared with
E.coli
and
B.subtilis
. The differences in length concerning L29 are caused by extension of the two
mycoplasmal proteins at the C-terminal end. The larger size of S14 in
E.coli
happens as a result of an insertion in the middle of the gene/protein. We have
no unifying explanation for the observed differences in size, as there is also
no systematic pattern emerging. For instance, the smaller proteins are not
always found in the bacteria with the smaller genomes. It might be worthwhile
to analyze those ribosomal proteins for possible modifications which interact
in the ribosome with L29, S5 and S14.
Figure 7
.
Comparison of ribosomal proteins with differences in length. (
A
) Ribosomal protein L29; (
B
) ribosomal protein S5; (
C
) ribosomal protein S14; bacsu:
B.subtilis
; ecoli:
E.coli
; mycca:
M.capricolum
; mycpn:
M.pneumoniae
.
Several conclusions can be drawn comparing the codon usage in eight ribosomal
protein genes between four different bacteria (Table
4
). The codon usage in
M.capricolum
is strongly influenced by the low G+C content of the DNA as has already been
pointed out several years ago by Ohkubo (
71
). In all instances the codons with A or U in the third position are clearly
preferred. The low frequency, <10% of G or C codon usage in the third position would indicate that ~20 codons are dispensable. On the contrary,
M.pneumoniae
uses all codons except AUA for Ile.
Bacillus subtilis
and
E.coli
show more preferences. Seven codons in
B.subtilis
and three codons in
E.coli
are not used at all. This bias seems to be independent of the G+C content since
it is similar at least in
M.pneumoniae
and
B.subtilis
. Included in the comparison of codon usage were the P1 and ORF6 protein from
M.pneumoniae
, because the G+C content of these two genes is clearly higher (
74
). In these two examples a bias in favour of codons with G and C in the third
position is observed but again except the codons for cystein, all other ones
are used.
Orf274, orf303 and orf434 most likely code for components of an ABC transporter
system (
54
). The products of orf274 and orf303 exhibit the typical pattern of the ATP
binding domain (see orf284 from the dnaA region). The highest scores in protein
data bank searches are shared with the GlnQ protein from
B.subtilis
and the CysA protein from a
Synechococcus
strain. GlnQ and CysA are components of glutamine and sulphate-thiosulphate ABC transporters. But since the scores of quite a number of
other ATP binding domains of ABC transporters are in a similar range it is
impossible to determine the specificity by DNA sequence analysis alone. The
Orf434 protein shows no significant homologies to other proteins, but based on
computer prediction it resembles an integral membrane protein with at least six
transmembrane segments. Considering that these three ORFs are organized in an
operon-like structure it is likely that they code for one ABC transporter
consisting of two identical membrane domains and two different ATP binding
domains. Among the residual ORFs of the
Eco
RI fragment GT9/25, the Orf243 protein reveals significant homologies to a
hypothetical protein from a Bacillus species and Orf319 protein to a methylase
modifying the adenine in an
Eco
RI restriction site (GAATTC). This finding is in agreement with the observation
that in genomic
M.pneumoniae
DNA the
Eco
RI site between the
Eco
RI fragments E7/1.4 and E7/1.85 are protected from cleavage, until after cloning
and amplification in a methylase negative
E.coli
strain (Wenzel, unpublished). Orf611 has homology to the ligoendopeptidase F
from
Lactococcus lactis
and, in addition, a zinc protease pattern from amino acid 384 to 391 His-Glu-Leu-Gly-His-Ser-Val-His. This pattern is closest to the
short consensus sequence of the thermolysin family (
75
). Orf798 and orf760 code for two potential bacterial lipoproteins with the same
characteristic features mentioned already. The Orf798 and Orf760 protein have
almost 50% identity over the first 250 amino acids and 30% identity over the
entire protein but no significant homologies to the other four mentioned
proposed lipoproteins. The high degree of identity between the proteins Orf798
and Orf760 as well as between Orf521 and Orf531 may indicate that these
proteins arose from a common ancestor by gene duplication. Finally, orf313 and
orf127 code for internal peptides of the P1 protein, the main adhesin which
interacts with the receptor(s) of the host cell. It seems very unlikely that
these ORFs were expressed, since antibodies against the complete P1 protein did
not recognize proteins of the size of these peptides. The coding regions show a
very high identity of 84.9% in a 1703 bp overlap to the middle part of gene P1
[extending from nucleotide 2317 to 4097, taking the sequence and numbering from
Su
et al
. (
76
)]. At least seven very similar copies from this part of the P1 gene exist
dispersed on the bacterial genome, of which one has been located on the
Eco
RI fragment GT9/25. They were named RepMP2/3 (
77
) or just `multiple copies' from the P1 gene (
78
). Ruland
et al.
(
77
) calculated, that RepMP2/3 would be ~1800 bp long, based on hybridization experiments with specific
oligonucleotides. The DNA sequence data confirm this. These repetitive DNA
elements may play a role in antigenic variation or in modifying the receptor
specificity, as binding sites for adherence inhibiting monoclonal antibodies
are encoded by the DNA stretch representing the RepMP2/3 region (
79
). Individual copies of RepMP2/3 may be exchanged by gene conversion, as
proposed for RepMP5, another repetitive DNA sequence (
80
) found in the
M.pneumoniae
genome.
One of the problems of the DNA sequence analysis is the unambiguous assignment
of the first amino acid for a proposed protein. The following criteria were
used. First, all open reading frames were considered starting with ATG and
predicting a protein >100 amino acids. If the intergenic regions could be
filled with ORFs >100 amino acids using either the codons GTG or TTG these
triplets were also considered as start codons. Whenever the protein extension
caused by GTG or TTG showed significant homology to the same protein in the
data base as the ORF starting with ATG then the start codon which gave the
longest protein was selected. ORFs <100 amino acids were considered when the BLASTX program indicated homology or
insight to smaller proteins. To minimize the loss of proteins <100 amino acids, the translation products of any sequenced DNA were analyzed
routinely by the BLASTX program which screens for homologies at the protein
level neglecting stop and start codons. This program picks up short proteins
for example some of the ribosomal proteins, however all the peptides without
significant homology to other known proteins are excluded. Presently, no
reliable method to identify those small proteins is available. In other
bacteria, e.g.
B.subtilis
, gene expression signals are valuable tools for the assignment of protein start
sites. Since gene expression signals in
M.pneumoniae
are not extensively experimentally studied, this method can not be applied. The
conserved sequence of the 3' end of the 16SrRNA from
M.pneumoniae
M129 (R. Himmelreich, unpublished) suggests that a Shine-Dalgarno (SD) like ribosomal binding site (
81
) should promote translation initiation, however, unlike in
B.subtilis
, many proposed genes do not have a Shine-Dalgarno site at the correct position. Studies on the closely related
M.genitalium
indicate that signals other than a SD-sequence may function as a ribosome binding site (
82
). Likewise, Sprengart
et al
. (
83
) proposed that a region downstream of the start codon specifies a stimulatory
interaction between the mRNA and 16S rRNA. Therefore, the best proof for a
protein start is the amino acid sequence of the N-terminal region of the protein in question. The small number of proteins
encoded by the
M.pneumoniae
genome and the improved methods in protein biochemistry make it likely, that a
substantial part of N-termini will be sequenced within a reasonable period of time.
NOTE ADDED IN PROOF
Since this paper was submitted the complete nucleotide sequence of the
Mycoplasma genitalium
genome has been published [Fraser
et al
. (1995)
Science
,
270
, 397-403].
At this stage we could not consider anymore all the data from this publication.
But, since there were some discrepancies to the data from Bailey and Bott (
50
) on the order and direction of genes of the dnaA region of
M.genitalium
, we included the results of Fraser
et al
. in our comparative analysis on bacterial dnaA regions. (Latest accessible
database version, January 1996.)
ACKNOWLEDGEMENTS
We thank E. Pirkl for excellent technical assistance, R. Frank and A. Bosserhoff
for synthesis of oligonucleotides, B. Reiner for her expertice in computer data
analysis, V. Wasinger for critical reading of the manuscript, I. Schmid for
preparing the manuscript, J. Pyrowolakis for subcloning the
Eco
RI fragment GT9/25 and last, but not least, H. Schaller for financial assistance
during a critical period and his encouragement throughout our work.
This research was supported by a grant from the Deutsche Forschungsgemeinschaft
(He 780/5-1).
REFERENCES
1 Krawiec, S. and Riley, M. (1990) Microbiol. Rev., 54, 502-539.
2 Freundt, E. A. and Razin, S. (1984) In Krieg, N. R. and Holt, J. G. e. (eds), Bergey's Manual of Systematic Bacteriology, vol. 1. Williams and Wilkins, Baltimore, pp. 742-770.
3 Rogers, M. J., Simmons, J., Walker, R. T., Weisburg, W. G., Woese, C. R., Tanner, R. S., Robinson, I. M., Stahl, D. A., Olsen, G., Leach, R. H. and Maniloff, J. (1985) Proc. Natl Acad Sci. USA, 82, 1160-1164.
4 Weisburg, W. G., Tully, J. G., Rose, D. L., Petzel, J. P., Oyaizu, H., Yang, D., Mandelco, L., Sechrest, J., Lawrence, T. G., Van, E. J., Maniloff, J. and Woese, C. R. (1989) J. Bacteriol., 171, 6455-6467.MEDLINE Abstract
6 Herrmann, R. (1992) In Maniloff, J., McElhaney, R. N., Finch, L. R. and Baseman, J. B. e. (eds), Mycoplasmas-Molecular Biology and Pathogenesis. American Society for Microbiology, Washington, DC, pp. 157-168.
7 Miles, R. J. (1992) In Maniloff, J., McElhaney, R. N., Finch, L. R. and Baseman, J. B. e. (eds), Mycoplasmas-Molecular Biology and Pathogenesis. Americam Society for Microbiology, Washington, DC, pp. 23-40.
8 Chanock, R. M., Dienes, L., Eaton, M. D., Edward, D. G., Freundt, E. A., Hayflick, L., Hers, J. F. P., Jensen, K. E., Liu, C., Marmion, B. P., Morton, H. E., Mufson, M. A., Smith, P. F., Somerson, N. L. and Taylor-Robinson, D. (1963) Science, 140, 662.
9 Almagor, M., Yatziv, S. and Kahane, I. (1983) Infect. Immunol., 41, 251-256.
10 Wenzel, R. and Herrmann, R. (1989) Nucleic Acids Res., 17, 7029-7043.MEDLINE Abstract
11 Krause, D. C. and Mawn, C. B. (1990) J. Bacteriol., 172, 4790-4797.MEDLINE Abstract
13 Morowitz, H. J. (1984) Isr. J. Med. Sci., 20, 750-753.
14 Jacobs, E. (1991) Rev. Med. Microbiol., 2, 83-90.
15 Meng, K. E. and Pfister, R. M. (1980) J. Bacteriol., 144, 390-399.MEDLINE Abstract
16 Liss, A. and Maniloff, J. (1972) Proc. Natl Acad. Sci. USA69, 3423-3427.MEDLINE Abstract
17 Mahairas, G. G. and Minion, F. C. (1989) Plasmid, 21, 43-47.MEDLINE Abstract
18 Dybvig, K. (1990) Ann. Rev. Mircobiol., 44, 81-103.
19 Hedreyda, C. T. and Krause, D. C. (1995) Infect. Immunol., 63, 3479-3483.
20 Bork, P., Ouzounis, C. and Sander, C. (1994) Curr. Opin. Struct. Biol., 4, 393-403.
21 Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J.-F., Dougherty, B. A. and Venter, J. C. (1995) Science, 269, 496-512.MEDLINE Abstract
22 Sanger, F., Nicklen, R. and Coulson, A. R. (1977) Proc. Natl Acad. Sci. USA, 79, 5463-5467.
23 Smith, L. M., Sanderz, J. Z., Kaiser, R. J., Hughes, P., Connell, C. R., Heiner, C., Kent, S. B. H. and Hood, L. E. (1986) Nature, 321, 674-679.MEDLINE Abstract
24 Prober, J. M., Trainor, G. L., Dam, R. J., Hobbs, F. W., Robertson, C. W., Zagursky, R. J., Cocuzza, A. J., Jensen, M. A. and Baumeister, K. (1987) Science, 238, 336-341.MEDLINE Abstract
25 Ansorge, W., Sproat, B., Stegemann, J., Schwager, C. and Zenke, M. (1987) Nucleic Acids Res., 15, 4593-4602.MEDLINE Abstract
26 Brumbaugh, J. A., Middendorf, L. R., Grone, D. L. and Ruth, J. L. (1988) Proc. Natl Acad. Sci. USA, 85, 5610-5614.MEDLINE Abstract
27 Wenzel, R. and Herrmann, R. (1988b) Nucleic Acid Res., 16, 8323-8336.
59 Ladefoged, S. A. and Christiansen, G. (1994) J. Bacteriol., 176, 5835-5842. MEDLINE Abstract
60 Walker, J. E., Gay, N. J., Saraste, M. and Eberle, A. N. (1984) Biochem. J., 224, 799-815.MEDLINE Abstract
61 Santana, M., Ionescu, M. S., Vertes, A., Longin, R., Kunst, F., Danchin, A. and Glaser, P. (1994) J. Bacteriol., 176, 6802-6811.MEDLINE Abstract
62 Rasmussen, O. F., Shirvan, M. H., Margalit, H. Christiansen, C. and Rottem, S. (1992) Biochem. J., 285, 881-888.MEDLINE Abstract
63 Rosey, E. L. and Steward, G. C. (1992) J. Bacteriol., 174, 6159-6170.MEDLINE Abstract
64 Braun, V. and Wu, H. C. (1994) In Ghuysen, J.-M. and Hakenbeck, R. e. (eds), Bacterial Cell Wall. Elsevier Science B.V., Vol. Chapter 14, pp. 319-341.
65 Sutcliffe, I. C. and Russell, R. R. B. (1995) J. Bacteriol., 177, 1123-1128.MEDLINE Abstract
66 Sebald, W. and Hoppe, J. (1981) Curr. Top. Bioenerg., 12, 1-64.
67 Bedwell, D., Davis, G., Gosink, M., Post, L., Nomura, M., Kestler, H., Zengler, J. M. and Lindahl, L. (1985) Nucleic Acids Res., 13, 3891-3902.MEDLINE Abstract
68 Cerretti, D. P., Dean, D., Davis, G. R., Bedwell, D. M. and Nomura, M. (1983) Nucleic Acids Res., 11, 2599-2616.MEDLINE Abstract
69 Zurawski, G. and Zurawski, S. M. (1985) Nucleic Acids Res., 13, 4521-4526.MEDLINE Abstract
70 Boylan, S. A., Suh, J.-W., Thomas, S. M. and Price, C. W. (1989) J. Bacteriol., 171, 2553-2562.MEDLINE Abstract
71 Ohkubo, S., Muto, A., Kawauchi, Y., Yamao, F. and Osawa, S. (1987) Mol. Gen. Genet., 210, 314-322.MEDLINE Abstract