| Nucleic Acids Research | Pages |
A phylogenomic study of the MutS family of proteins
Introduction
Materials And Methods
Results And Discussion
Identification and alignment of MutS homologs
Phylogenetic trees of the MutS homologs
Beyond gene trees: identifying evolutionary events in the MutS family's history
Using the evolutionary information
Conclusions
Acknowledgements
References
A phylogenomic study of the MutS family of proteins
ABSTRACT
INTRODUCTION
The ability to recognize and repair mismatches in DNA after replication has occurred has been well documented in many species. While some such mismatch repair (MMR) is carried out by pathways that repair only specific DNA replication errors, most is performed by broad specificity `general' MMR pathways. The most extensively studied general MMR system is the MutHLS pathway of the bacterium Escherichia coli (see 1,2 for review). In the first critical step in this pathway, the MutS protein (in the form of a dimer) binds to the site of a mismatch in double-stranded DNA. Through a complex interaction between MutS, MutL and MutH, a section of the newly replicated DNA strand (and thus the strand with the replication error) at the location of the mismatch bound by MutS is targeted for removal. Other proteins complete the repair process: the section of DNA that has been targeted is removed and degraded, a patch is synthesized using the complementary strand as a template and the patch is ligated into place resulting in a section of double-stranded DNA without mismatches.
The ability of the MutHLS pathway to repair many types of replication errors is due to the broad specificity of MutS recognition and binding. Since MutS binds to many types of base:base mismatches, the MutHLS pathway can repair many types of base misincorporation errors. Similarly, since MutS binds to heteroduplex loops (in which one strand contains extra-helical bases) the MutHLS pathway can repair frameshift replication errors. This ability to repair loops was somewhat surprising since this pathway was originally characterized as being involved in repairing mismatches. The repair of loops is particularly important in the regulation of the stability of microsatellites (loci that contain 1-10 bp tandem repeats). Microsatellites are prone to a special class of frameshift replication errors due to a process known as slip-strand mispairing (SSM). This process leads to the generation of loops of one or more copies of repeat unit (3,4). The MutHLS pathway helps keep microsatellite mutation rates in check by repairing many of the loops generated by SSM (5). While the specificity of MutS binding (and thus the MutHLS pathway) is quite broad, it is not uniform. For example, MutS does not bind C:C mismatches well and therefore the misincorporation of a C opposite a C will not be repaired well by the MutHLS pathway (6). Binding of MutS to heteroduplex loops is also not uniform. MutS only binds loops of up to four bases in size and only binds well to those up to three bases in size (7). Thus frameshift errors are only repaired if they produce loops of four bases or smaller. Since loops generated by SSM in microsatellites are usually one repeat unit in size, microsatellites with repeats >4 bp are highly unstable in E.coli. The non-uniformity of MutS recognition causes the MutHLS pathway to influence not only the mutation rate, but also the mutation spectrum.
The overall scheme of the MutHLS pathway (mismatch recognition, strand discrimination and excision and resynthesis) is conserved in the general MMR systems of other species (1). However, the degree of conservation of specific details varies greatly between the different steps in the process. Some steps (e.g. strand recognition) do not even use the same general mechanism between species. Others (e.g. exonucleolytic degradation) are similar in biochemical mechanism but make use of non-homologous proteins in different species. Nevertheless, some of the specific details of the MMR process are highly conserved. In particular, homologs of MutL and MutS are required for general MMR in all species examined and these proteins function in much the same way as the E.coli MutL and MutS (1). The conservation of MutS between species makes the specificity of MMR similar to that of E.coli. As with the E.coli MutHLS pathway, all characterized general MMR systems can repair both mismatches and loops. Incidentally, this is what led to the discovery that hereditary non-polyposis colon cancer (HNPCC) can be caused by defects in MMR (8). Cells from patients with HNPCC showed exceptionally high levels of microsatellite instability, due to defects in loop repair.
While the ability to repair both loops and mismatches is conserved, the specificity of other species MMR is not identical to that of E.coli. As with E.coli, dissecting the specificity of MMR in other species requires dissection of the binding preferences of MutS (or in these cases MutS homologs). However, in many cases the comparison to the E.coli MutS is complicated. For example, the best-studied eukaryotic MMR system is that of the yeast Saccharomyces cerevisiae. Unlike E.coli, S.cerevisiae encodes six MutS homologs, referred to as MutS Homolog (MSH) proteins (9). The best characterized of these are MSH2, MSH3 and MSH6 which are involved in MMR in the nucleus. These proteins are combined to create two distinct heterodimers; one for recognizing and repairing base:base mismatches and loops of one to two bases (composed of MSH2 and MSH6), and one for recognizing and repairing larger loops (composed of MSH2 and MSH3) (4,10). Thus, since MSH2 is in both heterodimers, it is required for all MMR in the nucleus, while MSH3 and MSH6 provide the specificity for the type of replication error recognized. The roles of the other MutS homologs in S.cerevisiae are not as well understood. MSH1 is involved in the repair of mismatches in mitochondrial DNA, although its exact function is not known (11-13). MSH4 and MSH5 do not even function in MMR, but instead are involved in meiotic crossing-over and chromosome segregation (14-16). The role of MutS homologs in processes other than correction of replication errors is not surprising since mismatches can arise in a variety of cellular circumstances. The proteins in the E.coli MutHLS pathway also have alternative cellular roles including the regulation of interspecies recombination and the repair of certain types of DNA damage (1,17). It may be that some of the multiple roles of the E.coli MutS have been divided up among the many S.cerevisiae MutS homologs.
Mismatch recognition and repair in humans and other animals is quite similar to that of S.cerevisiae (18,19; A.Villanuve, personal communication). Preliminary studies suggest that this is also true for plants (20). These similarities suggest that the complex MMR system of S.cerevisiae was established prior to the divergence of animal, fungal and plant ancestors. While studies of MMR in model species like humans, S.cerevisiae and E.coli are likely to continue, most new information about the MutS family of proteins is coming in the form of sequence data. Sequences of MutS homologs continue to pour into sequence databases, most without any accompanying functional information. An important new source of these sequences has been genome projects and the results coming out of these projects are somewhat surprising. For example, two MutS homologs have been found in many bacterial species as a result of bacterial genome projects (21,22), but it is not known if their functions are distinct. In addition, some bacteria do not encode any MutS homologs and some species do not encode any MutS homologs, while others encode a MutS homolog but no MutL homolog (23).
How can one make sense out of the ever-expanding MutS family, the diversity of MutS proteins within particular species, and these unusual distribution patterns in complete genome sequences? In this paper, I describe a new type of analysis, which I refer to as phylogenomics, focused specifically on the MutS family of proteins. This analysis provides insight into the evolution of the MutS protein family and the diversity of functions within and between species. In addition, it allows improved predictions of the functions of uncharacterized genes in the MutS family, and the likely phenotypes of species for which complete genomes are available. Such a phylogenomic analysis can be useful to studies of any gene family.
MATERIALS AND METHODS
The sequences of previously characterized MutS-like proteins were downloaded from the National Center for Biotechnology Information (NCBI) databases (accession numbers are given in Table 1). Additional members of the MutS family were searched for using the blast (24), blast2 and PSI-blast (25) computer programs. Databases searched included the NCBI non-redundant database and unpublished, nearly complete genome sequences of Deinococcus radiodurans and Treponema pallidum from The Institute for Genomic Research (personal communication) and Streptococcus pyogenes and Neisseria gonorrhoeae from University of Oklahoma (B.A.Roe, S.Clifton and D.W.Dyer, personal communication).
Protein sequences were aligned using the clustalw (26) and clustalx (27) multiple sequence alignment programs with some manual adjustment using the GDE computer software package (28,29). Regions of ambiguity in this alignment were determined by comparison to alternative alignments generated using modifications of the alignment parameters (such as different gap penalties).
Phylogenetic trees were generated from the sequence alignments using the PAUP* program (30) on a PowerBook 3400/180. Parsimony analysis was conducted using the heuristic search algorithm. The total branch length of trees was quantified using either an identity matrix, a PAM250 matrix or a MutS-specific matrix (based on the frequency of particular amino acid substitutions in the evolution of the MutS protein family as estimated by the MacClade program; 31). Multiple runs searching for the shortest tree were conducted for each matrix. Distance-based phylogenetic trees were generated by the neighbor-joining (32) and UPGMA algorithms using estimated evolutionary calculated from the matrices described above. Bootstrap resampling was conducted by the method of Felsenstein (33). Character state analysis for the study of gene loss was conducted using the MacClade computer program (31).
RESULTS AND DISCUSSION
The publication in 1995 of the complete genome sequence of the bacterium Haemophilus influenzae (34) signaled the beginning of a new era in biological research. Genome sequences provide a wealth of information not only about a single organism but also about all of the genes that they encode. As genome and other sequence data continue to pour into databases at an amazing pace, we need to develop new methods to sort out this information. In developing such methods it is important to recognize that analysis of genomes can benefit from studies of individual gene families and analysis of genome sequences can provide a great deal of information about gene families. For example, many genomes encode dozens or even hundreds of members of some multigene families. Making accurate predictions of the phenotype of these species from the genome sequence requires making accurate predictions of the functions of genes in multigene families. Similarly, a simple analysis of the presence and absence of particular genes in a genome can reveal a great deal about different multigene families. Most methods currently being used to analyze gene and genome data rely on the identification and quantification of similarity between the gene or genome of interest and those of other species. While such methods are useful, they tend to ignore the fact that biological similarities have a historical component (i.e. evolution). It is well documented that the incorporation of an evolutionary perspective can greatly benefit any comparative biological study. The benefits of the evolutionary perspective come from focusing not just on similarities and differences, but on how and why such similarities and differences arose. Therefore, I believe that studies of genes and genomes can also benefit greatly from an evolutionary focus. I refer to the combined evolutionary study of genes and genomes as phylogenomics (35,36).
I report here a phylogenomic analysis that is focused on the MutS family of proteins. The MutS family is an ideal case study for phylogenomic analysis for a variety of reasons. First, there is a good deal of functional diversity within this gene family. Thus, classifying uncharacterized genes may help improve functional predictions. In addition, this diversity of functions may have major effects on species phenotypes, in particular any phenotype related to mutation rate and pattern. Thus identifying which genes are present in a particular genome may help improve predictions of that species phenotype. Finally, as mentioned in the Introduction, there are many unusual patterns of distribution of MutS homologs in currently available complete genome sequences. I have divided the phylogenomic analysis of the MutS family into multiple sections. In the first few sections, the evolutionary history of the MutS family is inferred by analysis of genes and genomes currently available. In the remaining sections, this evolutionary information is used to place some of the studies of the members of this gene family into a useful context and also to make predictions for uncharacterized genes and species.
Identification and alignment of MutS homologs
Multiple sequence searching algorithms were used to identify proteins with extensive amino acid sequence similarity to the previously characterized members of the MutS family. To increase the likelihood of identifying all available MutS homologs, highly divergent members of the MutS family and a MutS consensus sequence were used as query sequences. In addition, the PSI-blast program was used to identify any proteins with similar motifs to other MutS-like proteins. Proteins were considered to be members of the MutS family if they showed significant sequence similarity to any of the previously identified MutS proteins, and if this similarity extended throughout the protein. All identified complete or nearly complete MutS family members are listed in Table 1.
| Table 1. . Proteins in the MutS family1 |
![]() |
2Unnamed open reading frames are given a proposed name which isunderlined.
3Determined by increased mutation rate in lines with defects in this gene.
4Genetic and biochemical studies suggest the MSH3 proteins are only involved in repair of large loops.
5Mutants show an increased rate of small duplications consistent with a possible role in loop repair.
6Genetic and biochemical studies suggest that MSH6 proteins are only involved in the repair of base:base mismatches and small loops.
7The last two of these may not be true orthologs of the others (see Discussion).
8I suggest changing the names of the sequences in this groups to MutS2 to reflect their distinctness from the proteins in the MutS1 subgroup.
*Information not available.
The sequences of the proteins listed in Table 1 were aligned to each other using the clustalw multiple sequence alignment algorithm. This alignment was enhanced both manually and with the clustalx program, which allows local clustalw alignments to be performed within a larger alignment. (This complete alignment is available at http://www-leland.stanford.edu/~jeisen/MutS/MutS.html ). The alignment reveals that there are motifs that are highly conserved among all MutS-like proteins. Most of these conserved motifs are confined to one section that is on average ~260 amino acids in length. This section can be considered the core MutS-family domain. For most of the members of the MutS family, the MutS-family domain is near the C-terminal end of each protein. The alignment of this domain is shown for a representative sample of the proteins in the MutS family in Figure
Figure 1. Alignment of a conserved region of the MutS proteins from representative members of the MutS family. The alignment was generated using the clustalw and clustalx programs and modified slightly manually. Shading was done based on degree of identity or conservation using the MacBoxshade program. Previously described MutS motifs are referred to by roman numerals. The beginning and ending amino acids for each protein are numbered.
Phylogenetic trees of the MutS homologs
Phylogenetic trees of the proteins in the MutS family were determined from the alignment using distance and parsimony methods, each with multiple parameters (see Materials and Methods). Since each alignment position is assumed to include residues that share a common ancestry among species, regions of ambiguous alignment were excluded from the phylogenetic analysis. Regions of particularly low sequence conservation were also excluded. In total, 313 amino acid alignment positions were used (available at the MutS web site). The trees generated with the different methods and parameters were very similar in topology to each other. Therefore only one tree (the neighbor-joining tree) is shown here (Fig.
Figure 2. Phylogenomic analysis of the MutS family of proteins. (A) Unrooted neighbor-joining tree of the proteins in the MutS family. The tree was generated from a clustalw based sequence alignment (with regions of ambiguous alignment excluded) with the PAUP* program. Some of the bacterial MutS1 proteins are left out for clarity. (B) Proposed subfamilies of orthologs are highlighted (see Discussion for details). (C) Known functions of genes are overlaid onto the tree. For simplicity, only two colors are used, red for mismatch repair and blue for meiotic-crossing over and chromosome segregation. (D) Prediction of functions of uncharacterized proteins based on position in the tree. In addition to assessing the internal consistency of the results, it is also useful to compare the results presented here to those of other studies. Unfortunately, many previous studies of the evolution of the MutS family of proteins have not described the methods used to generate the trees and thus are not comparable to this study (e.g. 18). In addition, some studies have used multiple sequence alignment programs like clustalw and pileup to generate trees directly and thus cannot be considered reliable phylogenetic studies (37,38). There have been only two studies of the evolution of MutS homologs using standard phylogenetic methods (20,39). These studies should be considered limited because they did not include many of the more divergent members of the MutS family. Nevertheless, most of the results of these studies are similar to those reported here. Some specific differences and similarities are discussed below.
Beyond gene trees: identifying evolutionary events in the MutS family's history
As with any gene family, the phylogenetic tree of the MutS proteins simply shows the relationships among homologs. It is almost always useful to go beyond this gene tree to identify specific evolutionary events in a gene family's history. For example, identification of the types of homology (orthology, paralogy and xenology) in this tree allows the detection of the particular evolutionary event (speciation, gene duplication and lateral gene transfer, respectively) that led to the divergence of homologs. To identify these and other evolutionary events, it is necessary to integrate the gene tree with other information, such as gene function, species phenotype or species phylogeny.
Subfamilies of orthologs. As the first step in going beyond the MutS gene tree, I divided the MutS family into subfamilies that I propose represent distinct groups of orthologs (i.e. sets of genes that diverged from each other due to speciation events). Each subfamily has been given a name based on the name of one of the better-studied proteins in that group (italics are used to distinguish the subfamilies from individual proteins). The proposed subfamilies are highlighted in FigureOverall, eight orthologous subfamilies were identified; six that include only proteins from eukaryotes (corresponding to the six yeast MutS homologs) and two that include only proteins from bacteria. Most of these subfamilies correspond well to groups that have been suggested previously. For example, the animal and yeast proteins in each eukaryotic subfamily have been identified as likely orthologs of each other by standard sequence similarity searches and other non-phylogenetic methods. The phylogenetic analysis simply confirms that these are indeed orthologs. The identification of two distinct bacterial subfamilies represents a novel finding [although it was suggested by Eisen et al. (35)]. This finding shows one of the benefits of phylogenetic analysis over standard sequence-similarity searches. In addition to the subfamilies, two proteins (one from Methanobacterium thermoautotrophicum and one from the mitochondrial genome of Sarcophyton glaucum) are closely related to the MutS2 subfamily but they were not placed into this subfamily. Although these two genes group with the MutS2 subfamily in every tree, it is possible that they may have been involved in lateral transfer events and therefore may not be orthologs of the MutS2 proteins. Nevertheless, they are close relatives of the MutS2 subfamily.
Examination of the species represented in each orthologous group can help determine when that group originated. For example, all the eukaryotic subfamilies except MSH1 include proteins from yeast and humans suggesting that these subfamilies originated prior to the divergence of the common ancestor of fungi and animals. Similarly, the MutS1 and MutS2 subfamilies are composed of proteins from diverse bacterial species, including some of the deeper branching bacterial taxa (e.g., D.radiodurans and Aquifex aeolicus). Therefore the origin of these bacterial subfamilies probably predates the divergence of most of the bacterial phyla. While this type of analysis can help time the origin of the orthologous groups, it does not provide any information about how these groups originated. That is, did the orthologous groups originate by gene duplication or lateral transfer? Many other questions also cannot be answered by the simple division into groups of orthologs. Therefore additional analysis is required.
Unusual distributions of MutS orthologs help identify specific evolutionary events. One way to identify particular evolutionary events in the history of a gene family is to analyze unusual distribution patterns of the different orthologs. Such unusual distributions can be explained either by lateral transfer to the species with an `unexpected' presence of a gene, or by gene loss in the lineages with an unexpected absence of certain genes. These two possibilities can be distinguished by comparing the gene tree to the tree of the species from which these genes come. If an unusual distribution is caused by gene loss, then the gene and species trees should be congruent (as though the species which do not encode a particular gene were just cut out of a larger tree of life). If instead lateral transfer caused an unusual distribution, then the gene and species trees should be incongruent.Analysis of the distribution of proteins used to be relatively haphazard. However, the availability of complete genome sequences allows for the first time the reliable determination (through sequence analysis) of what genes are present or absent in a species. This of course assumes that homologs can be detected by the sequence analysis methods used. Given the level of conservation among a diverse collection of MutS homologs (Fig.
Table 2. . Properties of MutS subfamilies Table 3. . Presence of MutS homologs in complete genome sequences Since most of the available complete genome sequences are from bacteria, I focused first on distribution patterns in the bacteria. Every possible pattern of presence and absence of the MutS1 and MutS2 proteins is found in the bacteria (Table 3); some species encode members of both subfamilies, while others encode only one or none. There are two reasonable explanations for this: either rampant gene loss after gene duplication or multiple lateral transfer events. As discussed above, one way of testing which occurred is to compare the phylogenetic trees of the two subfamilies. If there was an ancient duplication, then the branching patterns within the MutS1 and MutS2 subfamilies should be identical. However, it is not valid to simply extract the MutS1 and MutS2 evolutionary relationships from the gene tree shown in Figure Figure 3. Gene duplication and gene loss in the history of the bacterial MutS homologs. (A) Neighbor-joining phylogenetic tree of the MutS1 and MutS2 subfamilies (using only those proteins from species with both). The identical topology of the tree in the two subfamilies suggests the occurrence of a duplication prior to the divergence of these bacteria. (B) Gene loss within the bacteria. Gene loss was determined by overlaying the presence and absence of MutS1 and MutS2 orthologs onto the tree of the species for which complete genomes are available (since only with a complete genome sequence can one be relatively certain that a gene is absent from a species). The thick gray lines represent the evolutionary history of the species based on a combination of the MutS and rRNA trees for these species. The thin colored lines represent the evolutionary history of the two MutS subfamilies (MutS1 in red and MutS2 in blue). Branch lengths do not correspond to evolutionary distance. Gene loss is indicated by a dashed line and each loss is labeled by a number: 1, MutS2 loss in enterobacteria; 2, MutS1 loss in H.pylori; 3, MutS2 loss in the mycoplasmas; 4, MutS1 loss in the mycoplasmas; and 5, MutS2 loss in T.pallidum. The evidence presented above shows that the MutS1 and MutS2 subfamilies are most likely related by a gene duplication event. However, the evidence does not specify when this duplication occurred. Based on a variety of evidence, I propose that the duplication was ancient and that the root of the MutS tree is most accurately placed such that it divides the family into two main lineages which I refer to as MutS-I and MutS-II. MutS-I includes the MutS1, MSH1, MSH2, MSH3 and MSH6 subfamilies and MutS-II includes the MutS2, MSH4 and MSH5 subfamilies. Three pieces of information support the division into these two main lineages: (i) these two groups were found in all trees regardless of methods or parameters used; (ii) function is generally conserved within but not between lineages (the proteins involved in MMR are all in the MutS-I lineage and those involved in meiotic crossing-over are in the MutS-II lineage) (Table 1); and (iii) such an ancient duplication is consistent with the presence of bacterial and eukaryotic subfamilies in each lineage and is also consistent with the evidence for a duplication prior to the emergence of the major bacterial groups. Since these arguments are somewhat circumstantial and, since the bootstrap values defining the two supergroups are relatively low, this hypothesis should be considered highly tentative. A consensus tree, using the proposed rooting but in which those patterns that are not robust are collapsed, is shown in Figure Figure 4. Consensus phylogenetic tree of MutS family of proteins. Branches with low bootstrap values or that were not-identical in trees generated with different methods were collapsed. Only the proposed subfamilies are shown (sequences in each group are listed in Table 1). In addition, two proteins that are related to the MutS2 subfamily are grouped with it. The height of each subgroup corresponds to the number of sequences in that group and the width corresponds to the longest branch length within the group. Bootstrap values for specific nodes are listed when >40% (neighbor-joining on the top, parsimony on the bottom). The root of the tree was assigned as discussed in the text between the groups labeled MutS-I and MutS-II. Conserved functions for the different groups are listed. The ancient duplication theory proposed above does not describe all of the unusual distribution patterns in the MutS family. One such pattern is the presence of only one MutS homolog among the three Archaea for which complete genomes are available. This is the MutS2-like protein of M.thermoautotrophicum. As discussed above, since the MutS proteins are highly conserved (including the one MutS homolog from Archaea) it is unlikely that other MutS homologs are present in these Archaeal species but were not identified. With the data currently available, it is not possible to resolve the origins of this gene. One reason for this is the lack of a consensus concerning the evolutionary history of the major domains of life. If the Archaea are a sister group to the eukaryotes (as suggested by some studies), then the distribution pattern is probably best explained by gene loss in the history of these Archaea. If instead the bacteria and eukaryotes are sister groups (or even just for the parts of the genome encoding the MutS proteins), then the MutS gene family may have evolved after the Archaea formed a separate lineage. Thus the distribution pattern could be explained simply by lateral transfer to M.thermoautotrophicum. Another reason for the difficulty in resolving this unusual distribution pattern is that these three species do not represent much of the Archaeal evolutionary diversity. It is likely that additional Archaeal genomes will help resolve the history of the Archaeal MutS homolog(s). Another unusual distribution pattern is the presence of a MutS homolog (sgMutS) in the mitochondrial genome of the coral S.glaucum. Although this mitochondrial genome is not completely sequenced, many other mitochondrial genomes have been and none of these encodes a MutS homolog. In a detailed phylogenetic study, Pont-Kingdon et al. found that the sgMutS branched most closely to the yeast MSH1 (39). Since MSH1 is encoded by the nucleus but functions in the mitochondria, this seemed like a possible case of lateral transfer from the mitochondria to the nucleus. However, since the sgMutS did not branch within any bacterial group of proteins and since most mitochondria do not encode a MutS homolog, they concluded that the sgMutS represented a case of `reverse' lateral transfer from the nucleus to the mitochondria. Although their analysis was sound, it was not complete because they did not include proteins from all of the MutS subfamilies. With the more complete sample of MutS homologs, the sgMutS branches closely to the MutS2 subfamily and not with the MSH1 subfamily (Fig.
2Genome not yet complete.
Using the evolutionary information
The benefits of using evolutionary analysis in molecular biology come from improving both our understanding of observed molecular characteristics and our ability to make biological useful predictions. What are the particular uses of the evolutionary analysis of the MutS family described above? First, I used the phylogenetic information to infer likely functions for uncharacterized members of the MutS family (Fig.
The phylogenetic-functional analysis suggests not only that functions have been conserved within orthologous groups but also that the generation of the orthologous groups was accompanied by functional divergence. The evolutionary analysis on its own does not provide a complete explanation of the functions of the MutS genes. There must be some sequence patterns that explain the functional similarities and differences in the family. Since the MutS-family domain is highly conserved among all the MutS-like proteins, this domain is likely to provide some general activity to all the proteins in the family, such as the ability to recognize and bind to unusual double-stranded DNA structures. In addition, there must be some sequence patterns that are conserved within but not between subfamilies (either in these proteins or in regulatory regions) that provide specific functions to each subfamily. The phylogenetic analysis can help identify functionally important motifs because they can be searched for only within subfamilies (42). Thus the phylogenetic analysis can help understand the mechanism of the specificity of each subfamily.
The phylogenetic-functional analysis can be used in combination with gene presence and absence data to predict organismal phenotypes for those species for which complete genomes are available. For example, it is likely that the species that do not encode a protein in the MutS-I lineage do not have the MMR process as it has been found in other species. Such an inference is supported by the fact that all species that do not encode a protein in the MutS-I lineage also do not encode a MutL homolog (see above and 35). Such a conclusion is supported by the fact that some of the species that do not encode a MutS1 ortholog also have a high mutation rate (e.g. the mycoplasmas) which is consistent with an absence of MMR. However, since it is possible that other enzymatic mechanisms could have evolved to deal with mismatches, without experimental verification it is not possible to know for certain if these species have MMR. Since no function is known for the proteins in the MutS2 subfamily it is difficult to determine the significance of the absence of orthologs of these genes from species like E.coli and H.influenzae.
Combining functional predictions for genes with the gene loss analysis allows a better understanding of why the loss of these genes occurred. The gene loss data shows that losses of MutS1 and MutS2 occurred in multiple lineages. Many theories have been put forward to explain gene loss during evolution (43,44). Many of these theories involve genome level phenomena such as selection for reduced genome size, or Muller's ratchet destroying some genes. However, the loss of MutS homologs may be a more gene-specific event, there is likely to be a selective benefit for the loss of MutS genes in some lineages. Defects in MMR have been suggested to be beneficial in certain conditions such as under nutrient stress (45) and selection for pathogenesis (46,47). It is likely that many of these benefits are due to an increased mutation rate, although some may also be due to changes in other functions associated with MMR proteins. While these benefits have been shown by comparing different strains of the same species, it is possible that such benefits may also occur in comparisons between species. For example, it has been suggested that H.pylori varies its antigens through a microsatellite mutation process (23). Such mutations would occur at a much higher rate in a MMR deficient strain and could explain the loss of MutS1 from H.pylori sometime in the past.
Conclusions
I have used a combination of phylogenetic reconstruction methods and analysis of complete genome sequences to better understand the MutS family of proteins. Since studies of multigene families and genomes are interdependent it is useful to combine analysis into one study. Phylogenomic methodology similar to that used here can be applied to any multigene family. First, molecular phylogenetic analysis should be used to determine the evolutionary relationships among the genes in the gene family. Then, integration of species information can be used to divide the family into subfamilies of orthologs and to infer evolutionary events such as gene duplications, lateral transfers and gene loss. This evolutionary information can be used in combination with genome information to improve functional predictions for uncharacterized genes. For example, the phylogenetic analysis shows that the proteins in the MutS2 subfamily are distant and distinct from those involved in mismatch repair, and genome analysis shows that many of the species that encode these genes do not encode other proteins required for mismatch repair. Thus these proteins are likely not involved in mismatch repair. The phylogenomic analysis can also be used to characterize functionally important sequence motifs, to predict the phenotypes of species for which complete genomes are available and to better understand why events such as gene loss and gene duplication may have occurred. In summary, since any comparative biological analysis benefits from evolutionary perspective, the use of evolutionary methods can only serve to improve what can be learned from ever increasing amounts of gene and genome data.
ACKNOWLEDGEMENTS
I would like to thank P. Hanawalt for his continued support; J. Halliday, R. Myers and M.-I. Benito for helpful discussions; F. Taddei, D. Berg, K. Culligan and G. Meyer-Gauen for suggestions on earlier versions of this work; D. Swofford for making PAUP* available prior to release and The Institute for Genomic Research and Oklahoma University for making sequences available prior to publication. The comments of two anonymous reviewers were particularly helpful. I was supported by Outstanding Investigator Grant CA44349 from the National Cancer Institute to P. Hanawalt. I apologize to the authors of primary papers on the various MutS homologs for being unable to cite them all here. References relating to each can be found at the MutS web site: http://www-leland.stanford.edu/~jeisen/MutS/MutS.html
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 29 Aug 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
K. M. Fisher
Bayesian reconstruction of ancestral expression of the LEA gene families reveals propagule-derived desiccation tolerance in resurrection plants
Am. J. Botany,
April 1, 2008;
95(4):
506 - 515.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. L. Seyfert, M. E. A. Cristescu, L. Frisse, S. Schaack, W. K. Thomas, and M. Lynch
The Rate and Spectrum of Microsatellite Mutation in Caenorhabditis elegans and Daphnia pulex
Genetics,
April 1, 2008;
178(4):
2113 - 2121.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
T. Snowden, K.-S. Shim, C. Schmutte, S. Acharya, and R. Fishel
hMSH4-hMSH5 Adenosine Nucleotide Processing and Interactions with Homologous Recombination Machinery
J. Biol. Chem.,
January 4, 2008;
283(1):
145 - 154.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
Z. Lin, M. Nei, and H. Ma
The origins and early evolution of DNA mismatch repair genes multiple horizontal gene transfers and co-evolution
Nucleic Acids Res.,
December 3, 2007;
35(22):
7591 - 7603.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. Fukui, Y. Takahata, N. Nakagawa, S. Kuramitsu, and R. Masui
Analysis of a nuclease activity of catalytic domain of Thermus thermophilus MutS2 by high-accuracy mass spectrometry
Nucleic Acids Res.,
August 7, 2007;
(2007)
gkm575v1.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. Fukui, H. Kosaka, S. Kuramitsu, and R. Masui
Nuclease activity of the MutS homologue MutS2 from Thermus thermophilus is confined to the Smr domain
Nucleic Acids Res.,
February 16, 2007;
35(3):
850 - 860.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. Dailidiene, G. Dailide, D. Kersulyte, and D. E. Berg
Contraselectable Streptomycin Susceptibility Determinant for Genetic Manipulation and Analysis of Helicobacter pylori
Appl. Envir. Microbiol.,
September 1, 2006;
72(9):
5908 - 5914.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
S. R. Schulze, B. F. McAllister, D. A. R. Sinclair, K. A. Fitzpatrick, M. Marchetti, S. Pimpinelli, and B. M. Honda
Heterochromatic Genes in Drosophila: A Comparative Analysis of Two Genes
Genetics,
July 1, 2006;
173(3):
1433 - 1445.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E. C. Chao and S. M. Lipkin
Molecular models for the tissue specificity of DNA mismatch repair-deficient carcinogenesis
Nucleic Acids Res.,
February 6, 2006;
34(3):
840 - 852.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. Kang, S. Huang, and M. J. Blaser
Structural and Functional Divergence of MutS2 from Bacterial MutS1 and Eukaryotic MSH4-MSH5 Homologs
J. Bacteriol.,
May 15, 2005;
187(10):
3528 - 3537.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. R. Denver, S. Feinberg, S. Estes, W. K. Thomas, and M. Lynch
Mutation Rates, Spectra and Hotspots in Mismatch Repair-Deficient Caenorhabditis elegans
Genetics,
May 1, 2005;
170(1):
107 - 113.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
P. Meier and W. Wackernagel
Impact of mutS Inactivation on Foreign DNA Acquisition by Natural Transformation in Pseudomonas stutzeri
J. Bacteriol.,
January 1, 2005;
187(1):
143 - 154.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E. Sanchez-Moran, G. H. Jones, F. C. H. Franklin, and J. L. Santos
A Puromycin-Sensitive Aminopeptidase Is Essential for Meiosis in Arabidopsis thaliana
PLANT CELL,
November 1, 2004;
16(11):
2895 - 2909.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
K. Fukui, R. Masui, and S. Kuramitsu
Thermus thermophilus MutS2, a MutS Paralogue, Possesses an Endonuclease Activity Promoted by MutL
J. Biochem.,
March 1, 2004;
135(3):
375 - 384.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. S. Bell and R. McCulloch
Mismatch Repair Regulates Homologous Recombination, but Has Little Influence on Antigenic Variation, in Trypanosoma brucei
J. Biol. Chem.,
November 14, 2003;
278(46):
45182 - 45188.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
R. J. Willems, J. Top, D. J. Smith, D. I. Roper, S. E. North, and N. Woodford
Mutations in the DNA Mismatch Repair Proteins MutS and MutL of Oxazolidinone-Resistant or -Susceptible Enterococcus faecium
Antimicrob. Agents Chemother.,
October 1, 2003;
47(10):
3061 - 3066.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
J. A. Eisen and C. M. Fraser
Phylogenomics: Intersection of Evolution and Genomics
Science,
June 13, 2003;
300(5626):
1706 - 1707.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. L. Nesbo, K. E. Nelson, and W. F. Doolittle
Suppressive Subtractive Hybridization Detects Extensive Genomic Diversity in Thermotoga maritima
J. Bacteriol.,
August 15, 2002;
184(16):
4475 - 4488.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
B. Binishofer, I. Moll, B. Henrich, and U. Blasi
Inducible Promoter-Repressor System from the Lactobacillus casei Phage {phi}FSW
Appl. Envir. Microbiol.,
August 1, 2002;
68(8):
4132 - 4135.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
W.-Y. Ku, Y.-W. Liu, Y.-C. Hsu, C.-C. Liao, P.-H. Liang, H. S. Yuan, and K.-F. Chak
The zinc ion in the HNH motif of the endonuclease domain of colicin E7 is not required for DNA binding but is essential for DNA hydrolysis
Nucleic Acids Res.,
April 1, 2002;
30(7):
1670 - 1678.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()


