Nucleic Acids Research Advance Access originally published online on October 14, 2008
Nucleic Acids Research 2009 37(Database issue):D494-D498; doi:10.1093/nar/gkn674
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2009, Vol. 37, Database issue D494-D498
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article appears in the following Nucleic Acids Research issue: Database issue [View the issue table of contents]
Articles |
Strepto-DB, a database for comparative genomics of group A (GAS) and B (GBS) streptococci, implemented with the novel database platform Open Genome Resource (OGeR)
Institute for Microbiology, Technische Universität Braunschweig, Spielmannstrasse 7, 38106 Braunschweig, Germany
*To whom correspondence should be addressed. Tel: +49 531 391 5810; Fax: +49 531 391 5854; Email: i.retter{at}tu-bs.de
Received August 15, 2008. Revised September 19, 2008. Accepted September 22, 2008.
| ABSTRACT |
|---|
|
|
|---|
Streptococci are the causative agent of many human infectious diseases including bacterial pneumonia and meningitis. Here, we present Strepto-DB, a database for the comparative genome analysis of group A (GAS) and group B (GBS) streptococci. The known genomes of various GAS and GBS contain a large fraction of distributed genes that were found absent in other strains or serotypes of the same species. Strepto-DB identifies the homologous proteins deduced from the genomes of interest. It allows for the elucidation of the GAS and GBS core- and pan-genomes via genome-wide comparisons. Moreover, an intergenic region analysis tool provides alignments and predictions for transcription factor binding sites in the non-coding sequences. An interactive genome browser visualizes functional annotations. Strepto-DB (http://oger.tu-bs.de/strepto_db) was created by the use of OGeR, the Open Genome Resource for comparative analysis of prokaryotic genomes. OGeR is a newly developed open source database and tool platform for the web-based storage, distribution, visualization and comparison of prokaryotic genome data. The system automatically creates the dedicated relational database and web interface and imports an arbitrary number of genomes derived from standardized genome files. OGeR can be downloaded at http://oger.tu-bs.de.
| INTRODUCTION |
|---|
|
|
|---|
The development of cost-efficient DNA sequencing methods has caused an explosion of prokaryotic genome sequencing projects (1,2). The exploration of new genome sequences is strongly supported by the availability of related genomes that can be used as templates. Correspondingly, strain-specific properties can be traced back to differences in the genomes of compared strains. The comparison of the gene composition of several bacterial genomes from different strains of the same species revealed that only a fraction of genes is shared among the analyzed strains. This so-called core-genome is complemented by a fraction of distributed genes that are only present in some strains and absent in others (3). The supra- or pan-genome of a species is defined as the core-genome plus all distributed genes. It became clear that the pathogenicity of certain bacteria strongly depends on the fraction of distributed genes in the genome (4).
Due to the medical impact of pathogenic Streptococcus pyogenes (GAS) and Streptococcus agalactiae (GBS) infections, several genome projects focused on the elucidation of serotypic variants of these Gram-positive bacteria (5). Several streptococci genomes are available at the NMPDR (6), a database that focuses on microbial pathogens. Moreover, comprehensive databases provide comparative analysis features for prokaryotic genomes. These include MicrobesOnline (7), IMG (8) and GenoList (9), amongst others. However, for S. agalactiae it was predicted that the available reservoir of distributed genes is so large that new genes will be discovered even after hundreds of elucidated genomes (10). Therefore, comparative analyses of GAS and GBS genomes require the incorporation of all available sequence data.
For sequencing projects usually confidential data handling is required prior publishing of the results. For this purpose, local data storage and analysis is essential. Several software tools have recently been published that offer local solutions for comparative analysis of prokaryotic genome data. PSAT (11) is a web tool that visualizes the conservation of gene order among a given set of organisms. Although PSAT supplies a very useful overview about the relatedness of different genomes, it does not offer the query functions of a typical genome database, i.e. direct gene and protein queries with detailed information on obtained results. These features are provided for example by JCoast (12), a tool for the comparative analysis of prokaryotic genomes that is based on GenDB (13). However, JCoast is a local solution that does not support the distribution of data by a web server.
For this reason, we have developed Open Genome Resource (OGeR) as a generic web-accessible database and bioinformatics tool platform for the storage, visualization and comparative analysis of prokaryotic genome data. OGeR is suited to supply convenient assistance for reading and interpretation of genome files for biologists. The system is very flexible as it supports the import of an arbitrary selection of prokaryotic genome DNA sequence flat files. After the initial installation, the system is automatically generated, so that the update to new genome releases is very simple. Thus, OGeR can aid annotation and controlled data distribution in sequencing projects that depend on confidential data handling.
In this article, the functionalities of Strepto-DB are introduced as an example of application for OGeR. The database Strepto-DB provides an up-to-date resource for all GAS and GBS genomes that are currently publicly available, including unfinished WGS sequences. It supplies a convenient platform for the (pan-)genome analysis and interpretation of GAS and GBS. Strepto-DB was developed as part of the ERA-NET PathoGenoMics project that conducts a comprehensive comparative molecular analysis of GAS and GBS pathogenesis (http://www.pathogenomics-era.net).
| FEATURES OF STREPTO-DB |
|---|
|
|
|---|
Data content, exploration and visualization
The current Strepto-DB release 8.8 provides access to 13 GAS genomes, 8 GBS genomes and 7 plasmids. These comprise 41804 protein coding genes, including 902 unique genes for which no orthologs in any of the other strains could be detected (Table 1). To visualize the respective sizes of pan-genomes and core-genomes, Venn diagrams are provided as Supplementary Data.
|
The query options of the Strepto-DB web interface are summarized in Table 2. The database can be searched by gene and protein names, gene ontology (GO) and other functional annotation terms. Sequences can be searched either as strings and regular expressions or by BLAST. A genome viewer provides a scalable overview over the locus of the genes of interest on the chromosome. For each gene, Strepto-DB provides a gene and a corresponding protein entry that comprise functional annotation including GO terms and EC numbers, respectively. Furthermore, links to external data resources are provided. These include EMBL-Bank (14), UniProt (15), Integr8 (16), ExPASy (17), NCBI Gene and Protein (18), KEGG (19), BRENDA (20) and PRODORIC (21). For gene entries, the genomic context is visualized as a map in an interactive genome browser that centers on a gene when selected by a mouse click. The selected gene is marked in red. Below this genome map, the genome browser displays a frame plot of the GC content. The genome browser also displays the DNA sequence of the referring genome section with coding regions in color. At the bottom of the gene entry, the Genomic Data field provides the gene sequence in various formats and the option for download in FASTA format.
|
Search for homologous proteins and intergenic region analysis
Strepto-DB allows for the alignment of both coding and non-coding DNA sequences within the Streptococcus genomes of interest. Homologous proteins were pre-calculated by reciprocal BLAST searches. The proteome comparison query supplies an overview about the conservation of proteins between different strains. After the selection of a reference genome and one or more comparison genomes, this query returns lists of those proteins that are conserved between the selected strains. In addition, each protein entry provides a list of homologous proteins. On demand, the identified homologs are aligned with the MUSCLE alignment tool (22) and displayed with the Jalview visualization software (23). Furthermore, the genomic context of the various homologous genes can be displayed as genome maps. As an example, Figure 1 shows the genomic context of the cylE gene for β-hemolytic/cytolytic activity (24). The cylE gene is present in all sequenced strains of S. agalactiae but was found absent in all S. pyogenes strains. The genome map shows differences in the annotation of the region of the cyl operon in S. agalactiae COH1, CJB111 and NEM316.
|
In the intergenic regions, conserved DNA sequence motifs can function as regulator binding sites. Thus, an analysis of the intergenic DNA sequences might reveal information on the regulation of the respective downstream genes. Strepto-DB provides an intergenic region analysis that is composed of three tools: first, a BLAST search that aligns the intergenic region DNA sequence of choice with the intergenic regions of the referring homologous genes of other Streptococcus strains. This similarity search can be started by a mouse click on the region of interest on the homologs' genome map. Second, selected intergenic regions can be analyzed for conserved sequence motifs with the MEME motif discovery tool (25). Third, each intergenic region entry includes a link to the Virtual Footprint analysis tool (21). Virtual Footprint uses position weight matrices from the PRODORIC database to predict transcription factor binding sites within the promoter region of a gene. Taken together, these methods provide very useful supplementary evidence for potential regulator binding sites, generating hypotheses for experimental verification.
| THE OPEN GENOME RESOURCE (OGeR) PLATFORM FOR THE COMPARATIVE ANALYSIS OF PROKARYOTIC GENOMES |
|---|
|
|
|---|
OGeR is generically applicable for the storage and comparison of related prokaryotic genomes. As one example, Strepto-DB was set up and is maintained with OGeR and therefore provides an example for its functionalities. Thus, the Strepto-DB database and all features of the web interface were automatically compiled.
System architecture
OGeR consists of three components, a relational database, a setup that processes input data and imports them into the database and a web interface that queries the database (Figure 2). By default, genome sequences are downloaded from the EBI Genome Reviews database. GenBank files and other local data sources can also be loaded. Additional Supplementary Data is automatically downloaded from the Gene Ontology, EBI and NCBI websites. The setup creates the database schema and processes sequences and other input files. This procedure includes the extraction of gene and protein annotations and the detection of homologous proteins. Finally, sequence data and corresponding annotations are imported into the database. The web interface presents data stored in the database and performs multiple alignments on demand. It links to various external databases, provided the referring database identifiers were included in the input files.
|
Implementation and local installation
OGeR is implemented as a PHP application that uses an Apache web server and operates on a PostgreSQL database. Local installation requires a Linux operating system and the installation of the corresponding PHP and Apache software packages. For the creation of a new OGeR-based database, the OGeR setup procedure requests the required information and imports the desired genomes into the system. Data download is performed by the wget program. Local genome sequences can be imported in EMBL or GenBank format. Subsequently, homologous proteins are determined by an all-against-all BLAST search (26) of the proteins that are annotated in the imported flat files. As the BLAST search follows a quadratic time complexity, this step limits the number of genomes that can be imported in a reasonable amount of time on a given computing hardware. The BLAST results are evaluated to detect homologous proteins. Thereby, homology is defined as a double reciprocal BLAST hit with a given maximal E-value. For Strepto-DB, an E-value cutoff of 1*e-5 was chosen. Finally, the setup finishes with the creation of a new web interface for the database. A detailed installation instruction facilitates the installation and setup procedure.
The OGeR web interface uses CGView (27) for the genome viewer. Multiple alignments are performed with MUSCLE (22) and depicted with Jalview (23). As CGView and JalView are implemented as Java applets, the client web browser requires Java installation. However, multiple alignments can alternatively be shown in a simple view that does not depend on Java.
| CONCLUDING REMARKS |
|---|
|
|
|---|
We have implemented a simple integrated database and bioinformatics platform named OGeR for the comparative analysis of related genomes. This platform was subsequently employed for comparative genomic analyses of 21 Streptococcus genomes with establishment of the Strepto-DB platform. Conserved and distributed genes were deduced for the analyzed strains and used for core- and pan-genome prediction.
| FUNDING |
|---|
|
|
|---|
German Bundesministerium für Bildung und Forschung (ERA-NET grant 0313936C to J.K. and R.M.). and Deutsche Forschungsgemeinschaft (Sonderforschungsbreich 578 to I.B. and R.M.). Funding for open access charges: Deutsche Forschungsgemeinschaft (Sonderforschungsbreich 578).
Conflict of interest statement. None declared.
| ACKNOWLEDGEMENTS |
|---|
We would like to thank Bernd Hoppe for excellent technical assistance and financial management.
| REFERENCES |
|---|
|
|
|---|
- Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. (2008) 36:D475–479.
[Abstract/Free Full Text] - Medini D, Serruto D, Parkhill J, Relman DA, Donati C, Moxon R, Falkow S, Rappuoli R. Microbiology in the post-genomic era. Nat. Rev. Micro. (2008) 6:419–430.[CrossRef]
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pan-genome. Curr. Opin. Genet. Dev. (2005) 15:589–594.[CrossRef][Web of Science][Medline]
- Ehrlich GD, Hiller NL, Hu FZ. What makes pathogens pathogenic. Genome Biol. (2008) 9:225.[CrossRef][Medline]
- Lefébure T, Stanhope MJ. Evolution of the core and pan-genome of Streptococcus: positive selection, recombination, and genome composition. Genome Biol. (2007) 8:R71.[CrossRef][Medline]
- McNeil LK, Reich C, Aziz RK, Bartels D, Cohoon M, Disz T, Edwards RA, Gerdes S, Hwang K, Kubal M, et al. The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation. Nucleic Acids Res. (2007) 35:D347–353.
[Abstract/Free Full Text] - Alm EJ, Huang KH, Price MN, Koche RP, Keller K, Dubchak IL, Arkin AP. The MicrobesOnline Web site for comparative genomics. Genome Res. (2005) 15:1015–1022.
[Abstract/Free Full Text] - Markowitz VM, Szeto E, Palaniappan K, Grechkin Y, Chu K, Chen IMA, Dubchak I, Anderson I, Lykidis A, Mavromatis K, et al. The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions. Nucleic Acids Res. (2008) 36:D528–533.
[Abstract/Free Full Text] - Lechat P, Hummel L, Rousseau S, Moszer I. GenoList: an integrated environment for comparative analysis of microbial genomes. Nucleic Acids Res. (2008) 36:D469–474.
[Abstract/Free Full Text] - Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc. Natl Acad. Sci. USA (2005) 102:13950–13955.
[Abstract/Free Full Text] - Fong C, Rohmer L, Radey M, Wasnick M, Brittnacher M. PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes. BMC Bioinformatics (2008) 9:170.[CrossRef][Medline]
- Richter M, Lombardot T, Kostadinov I, Kottmann R, Duhaime M, Peplies J, Glockner F. JCoast - a biologist-centric software tool for data mining and comparison of prokaryotic (meta)genomes. BMC Bioinformatics (2008) 9:177.[CrossRef][Medline]
- Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, et al. GenDB–an open source genome annotation system for prokaryote genomes. Nucleic Acids Res. (2003) 31:2187–2195.
[Abstract/Free Full Text] - Cochrane G, Akhtar R, Aldebert P, Althorpe N, Baldwin A, Bates K, Bhattacharyya S, Bonfield J, Bower L, Browne P, et al. Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database. Nucleic Acids Res. (2008) 36:D5–12.
[Abstract/Free Full Text] - UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res. (2008) 36:D190–195.
[Abstract/Free Full Text] - Mulder NJ, Kersey P, Pruess M, Apweiler R. In silico characterization of proteins: UniProt, InterPro and Integr8. Mol. Biotechnol. (2008) 38:165–177.[Web of Science][Medline]
- Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A. ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. (2003) 31:3784–3788.
[Abstract/Free Full Text] - Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. (2008) 36:D13–21.
[Abstract/Free Full Text] - Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. (2008) 36:D480–484.
[Abstract/Free Full Text] - Barthelmes J, Ebeling C, Chang A, Schomburg I, Schomburg D. BRENDA, AMENDA and FRENDA: the enzyme information system in 2007. Nucleic Acids Res. (2007) 35:D511–514.
[Abstract/Free Full Text] - Münch R, Hiller K, Grote A, Scheer M, Klein J, Schobert M, Jahn D. Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes. Bioinformatics (2005) 21:4187–4189.
[Abstract/Free Full Text] - Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics (2004) 5:113.[CrossRef][Medline]
- Clamp M, Cuff J, Searle SM, Barton GJ. The Jalview Java alignment editor. Bioinformatics (2004) 20:426–427.
[Abstract/Free Full Text] - Tettelin H, Masignani V, Cieslewicz MJ, Eisen JA, Peterson S, Wessels MR, Paulsen IT, Nelson KE, Margarit I, Read TD, et al. Complete genome sequence and comparative genomic analysis of an emerging human pathogen, serotype V Streptococcus agalactiae. Proc. Natl Acad. Sci. USA (2002) 99:12391–12396.
[Abstract/Free Full Text] - Bailey TL, Williams N, Misleh C, Li WW. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. (2006) 34:W369–373.
[Abstract/Free Full Text] - Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
[Abstract/Free Full Text] - Grant JR, Stothard P. The CGView Server: a comparative genomics tool for circular genomes. Nucleic Acids Res. (2008) 36:W181–184.
[Abstract/Free Full Text]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

