Nucleic Acids Research, 2003, Vol. 31, No. 17 e106
© 2003 Oxford University Press
MGView: an alignment and visualization tool to enhance gap closure of microbial genomes
Departments of Microbiology and Veterinary Pathobiology and Biomedical Genomics Center, University of Minnesota, St Paul, MN 55108, USA and 1 Center for Computational Genomics and Bioinformatics, University of Minnesota, Minneapolis, MN 55455, USA
*To whom correspondence should be addressed at 1971 Commonwealth Avenue, St Paul, MN 55108, USA. Tel: +1 612 625 7712; Fax: +1 612 625 5203; Email: vkapur{at}umn.edu
Received May 19, 2003; Revised and Accepted July 11, 2003
| ABSTRACT |
|---|
|
|
|---|
Gap closure is a challenging phase in microbial random shotgun genome sequencing projects, particularly since genome assemblies are often complicated by the presence of repeat elements, insertion sequences and other similar factors that contribute to sequence misassemblies. While it is well recognized that the conservation of genetic information between microbial genomes, combined with the exponential increase in available microbial sequences, can be exploited to increase the efficiency of gap closure, we lack the computational tools to aid in this process. We describe here a new tool, MGView, which was developed to create a graphical depiction of the alignment of a set of microbial contigs against a completed microbial genome. The results of our assembly of the Staphylococcus aureus RF122 genome show that MGView enables a considerable reduction in time and economic cost associated with closure. Together, the results also show that the application of MGView not only enables a reduction in fold-coverage requirements of the random shotgun sequence phase, but also provides interesting insights into differences in gene content and organization between finished and unfinished microbial genomes.
| INTRODUCTION |
|---|
|
|
|---|
The random shotgun genome sequencing approach, first demonstrated by the complete genome sequencing of Haemophilus influenzae (1) has become a standard for the complete genome sequencing of microbial and eukaryotic organisms. Whilst methods for speeding up the initial random shotgun phase have been considerably enhanced through technological advances, the gap closure phase remains amongst the most challenging and time consuming steps in the process. This is particularly important, since sequencing microbial genomes to full closure results in a considerably more robust set of data for downstream analysis than partial or majority genome sequence assemblies (2). While numerous bioinformatics tools exist to aid in assembling and visualizing raw sequence data into contiguous sequence (contigs), gap closure remains a challenging bottleneck for genome sequencing.
The relative orientation of the contigs immediately following raw sequence assembly using gold-standard sequence assembly programs such as phredPhrap (P. Green, http://genome.washington.edu) is not known. It is also well recognized that certain regions of the genome, due to physical and biological properties associated with the nucleotide sequence, are very difficult to clone. This results in the absence of high quality sequence covering these regions in the shotgun assembly. Similarly, random probability that a region of sequence will not be covered even after the numerical length of the genome has been covered multiple times and the presence of repeat sequences in microbial genomes make it virtually impossible to obtain a single contig by shotgun sequencing alone. The accuracy of sequence assembly algorithms is further challenged by the increased occurrence of base call errors near the ends of sequence reads. While mathematically, an 8-fold coverage of a genome should allow well over 99% coverage of the physical genetic material, the results of 8-fold sequencing coverage vary widely in actual coverage (2,3). This is particularly problematic among Gram-positive bacterial genomes, for which random, well sampled libraries are difficult to make in the standard Gram-negative Escherichia coli host cell (2). Consequently, even a 10-fold coverage of the genome will result in gaps where regions of the physical chromosome that are not yet represented in the sequencing data. Reducing gaps directly via additional random sequencing costs a great deal of money and time, and the difficult-to-clone regions will remain uncovered. The alternative, however, is to halt the shotgun sequencing and begin closing contig gaps by more directed methods. Without knowledge of the relationship of the contigs to one another, few truly directed resources exist. In many cases, the gaps must be closed through a combination of: (i) educated guesses that are later confirmed by a directed PCR amplification and sequencing approach; (ii) a random PCR approach that utilizes primers designed to amplify contig ends; and (iii) a primer-walking approach whereby additional sequence is obtained by the incremental acquisition of new sequence information from small- or large-insert clones. In theory and practice, an average 8-fold coverage assembly from a well formed random library of a 4 Mb microbial genome would likely result in at least 80 contigs, requiring the design of at least 160 primers and more than 6000 PCRs to randomly close the 80 gaps. When considering PCR failure and computational errors in the raw sequence assembly, the resulting gap closure scenario is often both frustrating and time consuming. While more efficient multiplex PCR strategies are replacing random PCR, and single-primer PCR extensions are being investigated (4,5), the time and cost associated with gap closure remains a major challenge for genome sequencing projects.
While several whole-genome alignment and alignment/visualization tools exist for comparing completed sequences and even large segments of unfinished genomes, these are not designed to fully exploit comparative sequence alignment for genome closure purposes (68). For instance, MUMmer, a suffix tree algorithm for aligning unfinished contigs of microbial and other organisms, enables comparative sequence alignments, but the output is limited to text and hence is somewhat difficult to use during the process of genome closure even when the completed genome of a closely related strain or species is available (8).
In order to overcome the limitations of existing programs and make the closure phase of microbial genome sequencing more cost- and time-efficient, we set out to design a software tool that would enable visualization of the order and orientation of contigs in relation to one another when a finished sequence from a closely related organism is already available. To this end, we describe here a contig visualization tool, MGView (Microbial Genome View), which aligns finished microbial genomes with the contigs of unfinished genomes, generating several visualizations to facilitate gap closure. The program also enables visualization of assembly progress for evaluating the extent of contig coverage and the exit point from the random sequencing phase.
| MGVIEW PROGRAM COMPONENTS |
|---|
|
|
|---|
MGView consists of two program components, MGView and MGDirect, both written in PERL, and both of which produce PDF files and ASCII output files (Fig. 1). The PDF format was chosen because graphics and text can be intermixed, unlike HTML, yet PDF files can still be viewed on most platforms using various standard display tools, such as Acrobat Reader (http://www.adobe.com/products/acrobat/readstep.html). Pagination and zooming tools, which play a vital role in whole-genomic visualizations, are thus implemented via the display program and require no overhead effort in MGView. PERL was chosen because of its superiority as a parsing language, its widespread availability, ease of development, and the ease with which PDF files can be generated using existing tools such as pdflib (http://www.pdflib.com/).
|
MGView requires two input files: a BLAST result report and a nucleotide fasta file of a completed backbone genome. The BLAST result report contains alignments of the experimental contigs (queries) against the previously completed genome (target), with presumed similar overall genomic organizational conservation relative to the genome being assembled. Importantly, the BLAST file therefore contains information from only two organisms: the query contigs from the experimental organism and the template genome from a similar organism. The TimeLogic system is currently implemented with MGView to accelerate the alignment operation (www.timelogic.com). Together with the backbone genome nucleotide file, the TimeLogic BLAST report is used to produce two major PDF output files, MGMap and MGTrack, along with several text files that will be discussed. MGView is compatible with both circular and linear chromosomes, although multiple chromosomes and plasmids must be run as separate files.
MGMap allows the user to visualize contig-to-genome alignments at the nucleotide level, at a resolution of 1 Mb per page. A bacterial genome of 4 Mb, therefore, will fit on a mere four pages in the PDF file. The ability to zoom on these pages is important, and also allows the large amount of alignment information to be stored in a relatively small file. The backbone genome in this file is depicted as a series of nucleotide letters, with coordinates marked every 100 bases and margin coordinates also delineated to allow rapid navigation over several thousand bases (Fig. 2). The contigs are depicted as colored line segments along the region of the backbone where the BLAST alignment expect-value is <1e40. Contigs aligned in the same direction as the backbone genome are solid lines; contigs whose directions are reversed relative to the backbone appear as broken lines. Beginning and ending coordinates of the matching portion of the contig are marked, along with corresponding starting and ending coordinates of the backbone genome. In addition to the visual alignments, MGMap contains a table of the contig segments listing the start and end positions of each query/contig segment and the start and end positions in the target/backbone genome.
|
MGTrack depicts contig orientation on a more global scale. The genome backbone is a solid circular line that is stretched into a flattened oval to allow the entire genome to fit on a single page with the ability to zoom into a specific region using the magnify tools provided by Acrobat Reader. In this way, the backbone resembles a racetrack circuit with an extremely long front and backstretch. Whole contigs or segments thereof are aligned next to the track, appearing as bold solid lines, with only the contig number identified for each bold segment (Fig. 3). Hence, identifying the end of the current contig and its orientation with the beginning of the next contig is made facile, and short segments of contigs that are artifacts of repeat or insertion sequences can be recognized, preventing confusion generated by text-only output. Additionally, contig misassemblies are easily identifiable with this view because overlapping and redundant regions are colored red (Fig. 3), and the identification of each contig significantly matching such a region is individually printed. Insertion sequences, ribosomal regions and other repeats are also easily discernable by recognizing points along the backbone where a set of contig names each correspond.
|
Along with the MGMap and MGTrack PDF files, MGView generates a table of gaps and a table of overlaps, which are stored as plain ASCII files. These tables are useful for identifying those occurrences that are difficult for assembly algorithms to deal with and are consequently sources of error: repeat regions and redundancies. Assemblies can be manually checked and corrected as needed using the information from these tables. Additional useful tools are made available through MGDirect, which creates a chronological and cumulative map of the sequence reads mapped against the complete genome (Fig. 4). The chronological map will show which regions are over- or under-represented along the genome, making it easy to analyze library quality early in the assembly phase, as well as identify regions that are not easily cloned. Finally, the cumulative map, essentially a real-time visualization of the actual Poisson distribution for an individual assembly project, indicates when additional random shotgun sequencing is not likely to aid greatly in closing genome gaps. Based on several sequence assemblies tested using MGDirect, the time at which gap regions are consistently not shortened by the addition of raw sequence reads corresponds to the point at which sequencing of representative clones in the genome library is exhausted. The addition of raw sequence at this time is no longer economically effective, because the returns are almost entirely in regions that are already covered.
|
| APPLICATION OF MGVIEW TO STAPHYLOCOCCUS AUREUS GENOMIC SEQUENCING |
|---|
|
|
|---|
MGView was developed and tested on a random shotgun sequence project for Staphylococcus aureus strain RF122, a bovine isolate with significant homology to other completed strains of S.aureus isolated from humans. With the aid of MGView, the genome of RF122 is complete, with each nucleotide being covered by at least two sequence reads and an error rate of <1 error per 10 000 bases. The slightly smaller genome of the bovine isolate demonstrates similar genomic organization to completed S.aureus strains, although rearrangement of phage-related regions is apparent. Table 1 compares resource usage during the MGView-assisted closure of the RF122 genome with that of several other S.aureus sequencing projects, each within
100 000 bases of the same total length. It is important to note that the economical and temporal savings observed during downstream projects using MGView are dependent on the availability of at least one complete genome of an isolate from the same or genetically closely related species; furthermore, subtleties of genome structure in any organism may lengthen the time to complete closure. The results of the genomegenome comparisons generated by MGView must be used carefully as a guide rather than an absolute reference. Genome rearrangements do occur, and emerge if all genome gaps are closed and verified by PCR. It is imperative that this verification takes place, particularly when using reference genomes, to avoid overlooking such rearrangements. Overall, we believe that the acceleration of whole genome sequencing, particularly in the microbial world, will supply the necessary templates for using comparative strategies such as MGView to accelerate the gap closure.
|
| CONCLUSION |
|---|
|
|
|---|
In summary, the results of our studies show that MGView provides a set of useful visualization tools that increase the efficiency of the microbial genome closure process and accelerate the transition towards identifying specific features for comparative investigation. The program is currently freely web-accessible (http://ccgb.umn.edu/cgi-bin/swdownload/down.pl) and has been tested on several Gram-positive and Gram-negative microbial sequencing projects in addition to S.aureus as described above. A limitation of the program is the dependence on the availability of a finished genome sequence from another strain or a closely related bacterial species. Avenues for improvement include implementation with stand-alone or publicly accessible BLAST software and options for user-specified parameters for homology threshold levels as an aid to making comparisons of genetically distant organisms. Furthermore, expansion of the annotation currently provided in the output files, such as directional indications on the large-scale MGTrack output, and coordinates at the beginning and end of overlapping segments, will also be helpful and are currently being developed. Despite these limitations and opportunities for improvement, the results of our implementation suggest that MGView is a valuable visualization tool that can considerably enhance the efficiency of microbial genome closure.
| ACKNOWLEDGEMENTS |
|---|
Research in the laboratory of V.K. is funded by competitive awards from the National Institutes of Health, US Department of Agriculture and the Minnesota Agricultural Experiment Station. L.H.-O. is supported by NIH NIGMS Biological Process Technology Institute Fellowship GM08347.
| REFERENCES |
|---|
|
|
|---|
- Fleischmann,R.D., Adams,M.D., White,O., Clayton,R.A., Kirkness,E.F., Kerlavage,A.R., Bult,C.J., Tomb,J.F., Dougherty,B.A., Merrick,J.M. et al. (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269, 496512.
[Abstract/Free Full Text] - Fraser,C.M., Eisen,J.A., Nelson,K.E., Paulsen,I.T. and Salzberg,S.L. (2002) The value of complete microbial genome sequencing (you get what you pay for). J. Bacteriol., 184, 64036405.
[Free Full Text] - Lander,E.S. and Waterman,M.S. (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2, 231239.[CrossRef][Medline]
- Tettlin,H., Radune,D., Kasif,S., Khouri,H. and Salzberg,S.L. (1999) Optimized multiplex PCR: efficiently closing a whole-genome sequencing project. Genomics, 62, 500507.[CrossRef][Web of Science][Medline]
- Carraro,D.M., Camargo,A.A., Salim,A.C., Grivet,M., Vasconcelos,A.T. and Simpson,A.J. (2003) PCR-assisted contig extension: stepwise strategy for bacterial genome closure. Biotechniques, 34, 626632.[Web of Science][Medline]
- Delcher,A.L., Kasif,S., Fleischmann,R.D., Peterson,J., White,O. and Salzberg,S.L. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, 23692376.
[Abstract/Free Full Text] - Florea,L., Riemer,C., Schwartz,S., Zhang,A., Stojanovic,N., Miller,W. and McClelland,M. (2000) Web-based visualization tools for bacterial genome alignments. Nucleic Acids Res., 28, 34863496.
[Abstract/Free Full Text] - Delcher,A.L., Phillippy,A., Carlton,J. and Salzberg,S.L. (2002) Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res., 30, 24782483.
[Abstract/Free Full Text] - Kuroda,M., Ohta,T., Uchiyama,I., Baba,T., Yuzawa,H., Kobayashi,I., Cui,L., Oguchi,A., Aoki,K., Nagai,Y. et al. (2001) Whole genome sequencing of meticillin resistant Staphylococcus aureus. Lancet, 357, 12251240.[CrossRef][Web of Science][Medline]
- Baba,T., Takeuchi,F., Kuroda,M., Yuzawa,H., Aoki,K., Oguchi,A., Nagai,Y., Iwama,N., Asano,K., Naimi,T. et al. (2002) Genome and virulence determinants of high-virulence community-acquired MRSA. Lancet, 359, 18191827.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
S. A. F. T. van Hijum, A. L. Zomer, O. P. Kuipers, and J. Kok Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies Nucleic Acids Res., July 1, 2005; 33(suppl_2): W560 - W566. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Bartels, S. Kespohl, S. Albaum, T. Druke, A. Goesmann, J. Herold, O. Kaiser, A. Puhler, F. Pfeiffer, G. Raddatz, et al. BACCardI--a tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison Bioinformatics, April 1, 2005; 21(7): 853 - 859. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





