Skip Navigation


Nucleic Acids Research Advance Access originally published online on October 21, 2008
Nucleic Acids Research 2009 37(Database issue):D408-D411; doi:10.1093/nar/gkn749
This Article
Right arrow Abstract Freely available
Right arrow Print PDF (85K) Freely available
Right arrow Screen PDF (96K) Freely available
Right arrowOA All Versions of this Article:
37/suppl_1/D408    most recent
gkn749v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Walter, M. C.
Right arrow Articles by Frishman, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Walter, M. C.
Right arrow Articles by Frishman, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2009, Vol. 37, Database issue D408-D411
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article appears in the following Nucleic Acids Research issue: Database issue [View the issue table of contents]

Articles

PEDANT covers all complete RefSeq genomes

Mathias C. Walter1, Thomas Rattei2, Roland Arnold2, Ulrich Güldener1, Martin Münsterkötter1, Karamfilka Nenova1, Gabi Kastenmüller1, Patrick Tischler2, Andreas Wölling3, Andreas Volz3, Norbert Pongratz3, Ralf Jost3, Hans-Werner Mewes1,2 and Dmitrij Frishman1,2,*

1Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, German Research Center for Environmental Health (GmbH), Ingolstädter Landstrasse 1, 85764 Neuherberg, 2Department of Genome-oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Am Forum 1, 85350 Freising and 3Biomax Informatics AG, Lochhamer Strasse 9, 82152 Martinsried, Germany

*To whom correspondence should be addressed. Tel: +49 8161 712134; Fax: +49 8161 712186; Email: d.frishman{at}wzw.tum.de

Received September 15, 2008. Accepted October 3, 2008.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES AND IMPROVEMENTS
 DISCUSSION
 FUNDING
 REFERENCES
 
The PEDANT genome database provides exhaustive annotation of nearly 3000 publicly available eukaryotic, eubacterial, archaeal and viral genomes with more than 4.5 million proteins by a broad set of bioinformatics algorithms. In particular, all completely sequenced genomes from the NCBI's Reference Sequence collection (RefSeq) are covered. The PEDANT processing pipeline has been sped up by an order of magnitude through the utilization of precalculated similarity information stored in the similarity matrix of proteins (SIMAP) database, making it possible to process newly sequenced genomes immediately as they become available. PEDANT is freely accessible to academic users at http://pedant.gsf.de. For programmatic access Web Services are available at http://pedant.gsf.de/webservices.jsp.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES AND IMPROVEMENTS
 DISCUSSION
 FUNDING
 REFERENCES
 
Since its first announcement in 1997 (1), the PEDANT genome database has steadily grown to become one of the most comprehensive collections of automatically annotated genomes. As of September 2008, PEDANT covers all complete genomes as provided by the RefSeq (2) database. In total 861 completely sequenced genomes from all three domains of life as well as 2081 complete viral genomes are available (Table 1). Here, we define a ‘complete genome’ as a genome whose chromosomal datasets exist as RefSeq records or Ensembl (3) entries and genes have been predicted. For those eukaryotic genomes (currently 33) that are available both from RefSeq or Ensembl, we provide the annotation of both versions. This results in a total number of 2975 genome databases with 4.5 million proteins occupying 3.1 TB of storage. All PEDANT databases are continuously updated. For example, assignments of genes to the MIPS Functional Catalog (FunCat) (4) have been recently recalculated using the new 2.1 version of FunCat (http://mips.gsf.de/projects/funcat).


View this table:
[in this window]
[in a new window]

 
Table 1. The number of species from major taxonomic groups contained in the PEDANT genome database as of September 2008

 
The current version of the software driving the PEDANT web site, which we refer to as PEDANT3, represents an industry-strength Java workbench that supports large-scale grid computing and utilizes a work-flow-based processing engine (D. Frishman et al., manuscript in preparation). Dozens of custom workflows are available: generic workflows for eukaryotic, prokaryotic and viral genomes as well as more specialized workflows supporting specific genome groups (gram-positive versus gram-negative bacteria, fungi, plants), data types (EST collections, raw contigs without any predicted Open Reading Frames (ORFs), protein-only datasets, etc.) and bioinformatics methods (e.g. alternative gene prediction techniques). Advanced protein and DNA viewers implemented using server-side Java provide graphical representation of protein annotation features as well as genetic elements on chromosomes.


    NEW FEATURES AND IMPROVEMENTS
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES AND IMPROVEMENTS
 DISCUSSION
 FUNDING
 REFERENCES
 
Genome import pipeline
Given the quick pace of genome sequencing keeping track of currently available data and obtaining them from source databases for local processing represents a time-consuming and technically challenging task. In order to organize a more efficient import of genomes to PEDANT from various sources, we set up a specialized processing pipeline (Figure 1). In the first step, we acquire a list of available genomes from each genome resource. Then we try to find out the Entrez genome project ID by using the Entrez Programming Utilities (eUtils, http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) and querying the NCBI databases (5) for genome project information. If available, we use the genome project ID as a primary key for a given genome, otherwise the NCBI taxonomy ID is utilized. The advantage of genome project IDs is that they are stable in contrast to the taxonomy IDs which may change, especially for the species/strains of newly sequenced genomes. The genome IDs are then stored in our local meta-database which also serves as the data basis for generating the full genome list for the PEDANT web page.


Figure 1
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1. UML activity model of the PEDANT genome import and processing pipeline. Symbols according to the UML 2.0 specification (http://www.uml.org) for activity diagrams.

 
Data retrieval procedures have been adapted to several different sources of genome information. For downloading RefSeq genomes, we use a patched version (retry on connection timeouts, improved error handling) of the NCBI ToolBox (http://www.ncbi.nlm.nih.gov/IEB/ToolBox) program. For Ensembl genomes, we install the provided MySQL database dumps (ftp://ftp.ensembl.org/pub/current_mysql) at our local MySQL server and extract the genomic data directly.

Retrieval of genomes not contained in RefSeq and Ensembl can only be done in a semi-automatic fashion with manual verification. In many cases, RefSeq lists the involved genome sequence centers where original data can be obtained. Another useful resource to locate genomes is ‘the genomes online database (GOLD)’ (6). We then retrieve the assembly and annotation data directly from the sequence centers and check them for missing sequences, nonunique identifiers and unusual formatting. If the gene annotation data are missing or in a draft version (especially fungal genomes), gene predictions are carried out or existing models are improved dependent on the annotation project (7,8).

Integration of PEDANT and SIMAP
Calculating and updating protein similarities and domain assignments is the most time consuming and computationally expensive task in our genome annotation pipeline. Previously, BLASTP (9) and InterProScan (10) searches required up to 80% of the total CPU time of the PEDANT genome annotation workflow. To master the high number of newly sequenced genomes and to keep the data in PEDANT up-to-date, a radical reduction of this huge computational effort has become necessary.

The most obvious answer to this problem is to utilize high-performance computing facilities and avoid redundant calculations. The similarity matrix of proteins (SIMAP) (11) provides precalculated and up-to-date all-against-all alignments as well as domain assignments for essentially all publicly available protein sequences (21 million as of this writing). Our recent efforts to integrate PEDANT with SIMAP made it possible to avoid computationally intensive BLASTP and InterProScan runs and have led to a dramatic acceleration of the genome annotation work. Compared with de novo calculations, retrieving similarities and domains from the SIMAP database reduces the required CPU time by factors between 5 and 60. A typical bacterial genome with 3000 predicted genes can be processed at MIPS in <40 min using 60 Sun Grid Engine (SGE, http://gridengine.sunsource.net) nodes.

To generate and obtain these data, we have developed a computational workflow that coordinates the tasks between PEDANT and SIMAP. The first step in this workflow involves the import and maintenance of genome sequences and primary annotation provided by the respective source databases in PEDANT. In a subsequent step, SIMAP automatically retrieves protein and sequence data from PEDANT. If novel protein sequences previously unknown to SIMAP have been imported, their similarities to all other protein sequences and their domain architecture are calculated in SIMAP by utilizing large public resource computing facilities (12). As soon as the precalculated data are completely available in SIMAP, a notification event is triggered to start the SIMAP-based methods in PEDANT. These methods have been implemented as remote Enterprise Java Bean (EJB) invocations, which allow for rapid and efficient retrieval of data from SIMAP. One method designed to replace BLASTP retrieves homologs from a composite nonredundant database that includes PDB, UniProt/Swissprot, UniProt/TrEMBL, as well as all protein sequences already present in PEDANT. The second method which serves as a substitute for InterProScan retrieves precalculated protein domain assignments considering all InterPro member databases according to the InterPro XML format specification, except for the TMHMM (13), SignalP (14) and TargetP (15) methods which are run by PEDANT itself considering the appropriate genomic context (i.e. gram stain for signal peptides).

Web Services
The comprehensive collection of 3000 extensively annotated genomes provides a unique foundation for data mining and large-scale investigation of genome properties. While information on a limited number of genes of interest can be conveniently explored using the PEDANT web interface, any computational analysis of genomes at large necessitates local access to data. However, the large amount of annotation data computed for 4.5 million PEDANT proteins makes systematic dissemination of database dumps or flat files unpractical (although we do provide them upon request). Instead, we offer a simple, transparent and computer language-independent remote access based on the Web Service technology. This service has been implemented as a document style, SOAP-based Web Service (see http://www.w3.org/TR/soap12-part0). It can be easily integrated into own applications since for most computer languages libraries exist to access these kind of services. The functions provided by the Web Service are described in a Web Service Description File (WSDL, see http://www.w3.org/TR/wsdl), which allows for an automatic generation of a client program, e.g. by using the Perl SOAP::Lite (http://www.soaplite.com) or the Java Axis (http://ws.apache.org/axis/java/index.html) libraries.

The PEDANT3 WSDL File can be found at http://mips.gsf.de/webservice/pedant3/Pedant3Access BeanService/Pedant3AccessWebService?wsdl. At present the service provides the following query types:

  1. return the list of organisms processed in PEDANT,
  2. return the computational methods used to annotate a particular organism,
  3. return a result overview (e.g. which functional category appears how many times) for a certain method in a certain organism,
  4. return the genetic elements of an organism,
  5. return the result of a certain method for a single genetic element or for a whole genome ordered by its genetic elements.
For the latter query type it is possible to search in both directions: the service can return all genetic elements having a certain property (e.g. a certain functional attribute), or all properties of a certain genetic element (e.g. all functional attributes of a protein). Furthermore, in the former case it is possible to query several genomes at once. For BLASTP- and SIMAP-based methods, it is possible to restrict the results by an E-Value cutoff. A detailed overview of the Web Service functionality can be found at http://pedant.gsf.de/webservices.jsp.

The PEDANT3 Web Service encapsulates the complicated internal data structures of the PEDANT database and returns the results in a generic format that consists of key-value pairs of properties assigned to a given genetic element. This generic format assures that the end-user client software will not have to be reprogrammed if new methods are introduced into the PEDANT system.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES AND IMPROVEMENTS
 DISCUSSION
 FUNDING
 REFERENCES
 
There is no fixed release cycle for PEDANT. As soon as new genomes become available at RefSeq or any other listed genome resource, they will be imported, processed and made available via the web server. However, since SIMAP has a monthly release cycle, the computation of a genome by PEDANT is typically finished roughly 1 month after its import. Since the PEDANT3 software is now stable and all genomes from the previous version, PEDANT2, have been either migrated or reimported into PEDANT3, we took PEDANT2 and its Web Service offline. We also discarded all incomplete genomes previously available via PEDANT2 because the new high-throughput technologies now allow finishing genome sequencing projects on a very short-time frame.

In the future, genomes from further resources [i.e. USCS Genome Browser Database (16), Vega (17)] will be imported and previously imported genomes will be kept up-to-date. We are also in the process of supplementing the PEDANT web site by multiple new features, including viewing the genome project information [RefSeq status, source sequence centers, whole-genome shotgun (WGS) (18) sequencing coverage, number of records, etc.], taxonomic selection of genomes and improved search capabilities. A cross-genome index for precomputed annotations is nearly finished and will be available online shortly. This will allow for comparison of genomes based on their annotated features, such as domain content, functional categories and structural folds.


    FUNDING
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES AND IMPROVEMENTS
 DISCUSSION
 FUNDING
 REFERENCES
 
Funding for open access charge: Helmholtz Gemeinschaft.

Conflict of interest statement. None declared.


    ACKNOWLEDGEMENTS
 
We are grateful to Volker Stümpflen for assistance with the Web Services.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 NEW FEATURES AND IMPROVEMENTS
 DISCUSSION
 FUNDING
 REFERENCES
 

  1. Frishman D, Mewes H.-W. Pedantic genome analysis. Trends Genet. (1997) 13:415–416.[CrossRef][Web of Science]

  2. Pruitt KD, Tatusova T, Maglott DR. Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. (2007) 35:D61–D65.[Abstract/Free Full Text]

  3. Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Ensembl 2007. Nucleic Acids Res. (2007) 35:D610–D617.[Abstract/Free Full Text]

  4. Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M, et al. The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. (2004) 32:5539–5545.[Abstract/Free Full Text]

  5. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. (2008) 36:D13–D21.[Abstract/Free Full Text]

  6. Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC. The genomes on line database (gold) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. (2008) 36:D475–D479.[Abstract/Free Full Text]

  7. Güldener U, Mannhaupt G, Münsterkötter M, Haase D, Oesterheld M, Stümpflen V, Mewes H.-W, Adam G. Fgdb: a comprehensive fungal genome resource on the plant pathogen fusarium graminearum. Nucleic Acids Res. (2006) 34:D456–D458.[Abstract/Free Full Text]

  8. Kämper J, Kahmann R, Bölker M, Ma L.-J, Brefort T, Saville BJ, Banuett F, Kronstad JW, Gold SE, Müller O, et al. Insights from the genome of the biotrophic fungal plant pathogen ustilago maydis. Nature (2006) 444:97–101.[CrossRef][Web of Science][Medline]

  9. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.[Abstract/Free Full Text]

  10. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. Interproscan: protein domains identifier. Nucleic Acids Res. (2005) 33:W116–W120.[Abstract/Free Full Text]

  11. Rattei T, Tischler P, Arnold R, Hamberger F, Krebs J, Krumsiek J, Wachinger B, Stümpflen V, Mewes H.-W. Simap–structuring the network of protein similarities. Nucleic Acids Res. (2008) 36:D289–D292.[Abstract/Free Full Text]

  12. Rattei T, Walter M, Arnold R, Anderson D, Mewes W. Using public resource computing and systematic pre-calculation for large scale sequence analysis. Lect. Notes Bioinform. (2007) 4360:11–18.

  13. Kahsay RY, Gao G, Liao L. An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics (2005) 21:1853–1858.[Abstract/Free Full Text]

  14. Bendtsen JD, Nielsen H, vonHeijne G, Brunak S. Improved prediction of signal peptides: Signalp 3.0. J. Mol. Biol. (2004) 340:783–795.[CrossRef][Web of Science][Medline]

  15. Emanuelsson O, Nielsen H, Brunak S, vonHeijne G. Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. J. Mol. Biol. (2000) 300:1005–1016.[CrossRef][Web of Science][Medline]

  16. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, et al. The ucsc genome browser database: 2008 update. Nucleic Acids Res. (2008) 36:D773–D779.[Abstract/Free Full Text]

  17. Wilming LG, Gilbert JGR, Howe K, Trevanion S, Hubbard T, Harrow JL. The vertebrate genome annotation (vega) database. Nucleic Acids Res. (2008) 36:D753–D760.[Abstract/Free Full Text]

  18. Staden R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. (1979) 6:2601–2610.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
C. Soderlund
Computational techniques for elucidating plant-pathogen interactions from large-scale experiments on fungi and oomycetes
Brief Bioinform, November 1, 2009; 10(6): 654 - 663.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Print PDF (85K) Freely available
Right arrow Screen PDF (96K) Freely available
Right arrowOA All Versions of this Article:
37/suppl_1/D408    most recent
gkn749v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Walter, M. C.
Right arrow Articles by Frishman, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Walter, M. C.
Right arrow Articles by Frishman, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?