Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (70K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Perriere, G
Right arrow Articles by Gojobori, T
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Perriere, G
Right arrow Articles by Gojobori, T
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1996 Oxford University Press 41-45

Footnote

NRSub: a non-redundant database for Bacillus subtilis

NRSub: a non-redundant database for Bacillus subtilis Guy Perrière* , Ivan Moszer 1 and Takashi Gojobori 2

Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard-Lyon 1, 43 boulevard du 11 Novembre 1918, 69622 Villeurbanne Cedex, France , 1 Unité de Régulation de l'Expression Génétique, Institut Pasteur, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France and 2 Center for Information Biology, National Institute of Genetics, Mishima , Shizuoka-ken 411, Japan

Received August 21, 1995; Revised and Accepted October 17, 1995

ABSTRACT

In the context of the international project aimed at sequencing the whole genome of Bacillus subtilis we have developed a non-redundant, fully annotated database of sequences from this organism. Starting from the B.subtilis sequences available in the EMBL, GenBank and DDBJ collections we have removed all encountered duplications and then added extra annotations to the sequences (e.g. accession numbers for the genes, locations on the genetic map, codon usage, etc.) We have also added cross-references to the EMBL, MEDLINE, SWISS-PROT and ENZYME data banks. The present system results from merging of the NRSub and SubtiList databases and the sequence contigs used in the two systems are identical. NRSub is distributed as a flatfile in EMBL format (which is supported by most sequence analysis software packages) and as an ACNUC database, while SubtiList is distributed as a relational database under 4th Dimension. It is possible to access the data through two dedicated World Wide Web servers located in France and Japan.

INTRODUCTION

The Bacillus subtilis genome sequencing project started five years ago ( 1 ) and presently B.subtilis is the most extensively studied Gram-positive bacterium. About 900 genetic loci have been localized on its genetic map ( 2 ) and a long-range physical map has been published ( 3 ). The last reports from the Japanese and the European groups involved in the B.subtilis sequencing project were published at the beginning of 1995 ( 4 , 5 ). Precise DNA regions flanked by well-identified genetic markers have been assigned to 25 laboratories from 11 different countries. Among them, seven Japanese laboratories are in charge of sequencing a 1.3 Mb region. Up to now they have cloned ~1.1 Mb and have already sequenced ~800 kb. With a sequencing rate of >50 kb/year/group they expect to finish the region by mid-1996. For their part the 18 European sequencing groups have to each sequence at least 20 kb/year and have already completed ~1.1 Mb of sequence. They will attempt to accelerate the overall rate of sequencing, with the goal of determining the whole genome sequence of B.subtilis in the course of 1997.

Until the beginning of 1994 the only sequence compilation available for B.subtilis was a data library containing all published proteins ( 6 ). In 1994 the first release of NRSub was developed ( 7 ). Simultaneously a group at the Pasteur Institute had also developed its own system, SubtiList ( 8 ), which runs on Macintosh computers under the database management system 4th Dimension. Since then the content of the two databases has been merged and the contigs distributed in NRSub and SubtiList are now identical. Only a few differences persist between the annotations of the two databases. For instance, NRSub integrates data on codon usage and cross-references with MEDLINE. New releases of NRSub/SubtiList include a cleaner sequence data set; sequences have been renamed using a nomenclature based on their position on the chromosome and we have defined accession numbers for the genes. Moreover, it is now possible to access the database through two World Wide Web (WWW) servers in Japan and France.

DATABASE CONTENT

The sequence contigs are built from the B.subtilis 168 chromosomal sequences available in the EMBL ( 9 ), GenBank ( 10 ) and DDBJ ( 11 ) data banks by removing all the redundancy. Sequences from strains other than 168 and derivatives have been discarded. Release 5 of NRSub (July 1995) contains a total of 239 sequence contigs created using 577 EMBL/GenBank/DDBJ entries. These sequences total 1 408 645 bp, representing ~33.8% of the entire B.subtilis 168 chromosome, which consists of ~4165 kb ( 3 ).

As a way to spread the information widely we use the EMBL format for NRSub entries (Fig. 1 ). This format is recognized by many programs and retrieval systems and is the standard for all EBI databases ( 9 ). Many additions and corrections have been performed on the original annotations. First, new identifiers are given to each contig (ID field). They are based on the format SLxxx_z, where xxx corresponds to the position of the sequence on the chromosome numbered in degrees and z corresponds to a rank number used if more than one sequence is found at the same position. As a way of other databases being able to establish cross-references with NRSub we define our own accession numbers for the contigs (AC field). The format for accession numbers is BSxxxx. When two (or more) pre-existing contigs are merged into a larger one we combine their accession numbers and a new primary accession number is assigned. The first line of the DT field gives the date of creation of the most ancient EMBL/GenBank/DDBJ entry used to build the contig and the second line the date of the most recent modification. The DE field is completely rewritten and contains the name, the accession number, the position and the orientation of all the sequences used to build the contig (including duplicates). In the KW field we put together all the keywords associated with the original sequences. We remove from this field the keywords duplicated in the `/gene' and `/product' qualifiers of the features. We use the ACNUC format ( 12 ) to index NRSub and this database system creates its own keywords with the content of these qualifiers. In the case of composite contigs we add `composite' as a keyword. Bibliographic references for the sequences used to build the contigs are merged (RA, RT and RL fields). We also systematically add cross-references to the bibliographic data bank MEDLINE (RX field).


Figure 1 . Example of an NRSub entry. New qualifiers have been defined so as to add information on gene accession number (`/acnum'), codon usage (`/CAI') and cross-references with other collections (`/xref'). Note that the bibliographic references, the features and the sequence are shown in part only.

The FT field required much modification. First, for each CDS or structural RNA molecule we add an accession number under the `/acnum' qualifier. BG1xxxx accession numbers are used for the CDS and BG0xxxx accession numbers are used for the structural RNAs. Due to this introduction of accession numbers for the genes it is possible to cross-reference the SWISS-PROT data bank ( 13 ) with NRSub (DR field in SWISS-PROT entries). When the feature is a CDS we add a codon adaptation index (CAI) ( 14 ) value under the `/CAI' qualifier. When information on locus name is available we add a `/gene' qualifier for the CDS or the structural RNA described. The gene names used are taken from the listing of Anagnostopoulos et al. ( 2 ) or from the original publication. In some cases we have corrected the names when they were inconsistent with Demerec's nomenclature ( 15 ). To name open reading frames (ORFs) we have used a nomenclature similar to that defined by K.Rudd for Escherichia coli ( 16 ): as there is no gene name in B.subtilis that begins with y, we have used this letter as the first letter in naming each ORF. The second letter is assigned according to the group in charge of sequencing the region of the chromosome in which the ORF is located. In the case of an unmapped gene the letter z is used. The third letter is freely chosen by the group sequencing the region. Preferably each different letter should correspond to a different operon. In the case of genes obtained by piecemeal sequencing the letters x or y are used. The fourth and last letter is always a capital letter. This letter is also freely chosen by the group sequencing the region. Preferably genes in the same operon should be ordered using the alphabet from A to Z. If alternative names are available for a given gene they are indicated under the `/alt_name' qualifier. In the case of protein genes we add a `/xref' qualifier for the cross-reference pointing to the corresponding entry in SWISS-PROT. We rewrite or complete the `/product' qualifier using data from SWISS-PROT. In the case of ORFs we use `hypothetical protein' for the product associated with these putative genes. When an encoded protein is an enzyme we add its EC number in an `/EC_number' qualifier. EC numbers used are taken from the ENZYME collection ( 17 ), rather of those provided by EMBL, GenBank or DDBJ. For each CDS or structural RNA gene we add, in a `/map' qualifier, its location on the genetic map of B.subtilis 168 when available.

Some elements of the original features are discarded. Thus we keep only the descriptions of signal, mature and leader peptides, CDS, tRNA, rRNA, -35 and -10 regions, promoters and terminators. Some mistakes are corrected, consisting mainly of signal, mature or leader peptides wrongly annotated as CDS, frame-shifts resulting from bad start points, CDS shortened due to bad end points and features described in the original publications but not inserted in the tables.

USE WITH ACNUC

To make access to the data easier we have indexed the NRSub flatfile with ACNUC. This database format was primarily developed for general sequence collections, like GenBank, EMBL and NBRF/PIR ( 18 ). Later it was also used with more specialized systems, such as ColiGene ( 19 ), MultiMap ( 20 ) and Hovergen ( 21 ). ACNUC is provided with the retrieval program QUERY. Two versions of this application exist: a command line version (written mainly in Fortran) and a graphical version (written in C) ( 7 ). This graphical version was developed with the VIBRANT library, which is part of the NCBI TOOLBOX ( 22 ). At present QUERY runs on SunSparc (under SunOS or Solaris), Silicon Graphics, IBM RS/6000 and DEC Alpha (under OSF/1) systems. It may be installed on any UNIX computer under X Window and on which the VIBRANT and MOTIF libraries are available.

The ACNUC data structure associated with QUERY allows one to build complex queries to retrieve sequences. For example, it is possible to use sequence names, accession numbers, keywords, bibliographic references, dates of insertion in the bank or the nature of the molecule sequenced (e.g. DNA, mRNA, tRNA, etc.). It is also possible to combine many criteria using logical operators and to use the results of previous queries to build new ones. Each type of feature can be accessed and extracted, as well as its flanking regions, and CDS can be translated into proteins. Different formats are available for extracting sequences, including FASTA, GCG and EMBL.

DATA DISTRIBUTION

The NRSub flatfile, as well as ACNUC index tables and the binary code for the program QUERY, are available on the DDBJ anonymous ftp server (ftp.nig.ac.jp). The C and Fortran code for the command line version of QUERY are also furnished. These include a library allowing one to interface user-developed applications with any ACNUC database. Distribution is located in the directory `/pub/db/nrsub'. The relational database SubtiList is available for Macintosh computers (and soon for Windows-based computers) at the Pasteur Institute anonymous ftp server (ftp.pasteur.fr), in the directory `/pub/GenomeDB/SubtiList'. Note that it is also possible to access NRSub through a WWW server at URL (Uniform Resource Locator) http://ddbjs4h.genes.nig.ac.jp/. This server is mirrored in France and can be reached at URL http://acnuc.univ-lyon1.fr/nrsub/nrsub.html. The home page of the server gives access to different entry points allowing one to make simple or complex queries (Fig. 2 ). Simple access to the sequences is made by keyword, sequence name, accession number, gene accession number and full text search. Elaborated access uses WWW-QUERY, a WWW version of QUERY. WWW-QUERY is a general interface and any ACNUC database could be accessed by this system. The last entry point allows one to query NRSub with the Sequence Retrieval System (SRS) ( 23 , 24 ) set up at EBI. Our server also gives access to hypertext documents such as release notes, a version history of the database, on-line documentation, a list of keywords and a table containing cross-references with SWISS-PROT. Any questions and comments related to NRSub can be sent to the corresponding author at perriere{at}biomserv.univ-lyon.fr Any comments on SubtiList can be sent to moszer{at}pasteur.fr


Figure 2 . Home page of the NRSub WWW server. This page gives access to entry points to query the database, to associated hypertext documents and to the ftp sites for downloading NRSub and SubtiList.

PERSPECTIVES

In the future we would like to add accession numbers not only for genes, but also for all kinds of genomic fragments of biological interest (e.g. ribosome binding sites, promoters, terminators, etc.). When longer sequences are available it will be possible to add physical mapping data, but at present the map is not detailed enough regarding the length of existing sequence contigs. As an important body of data has become available for both B.subtilis and E.coli comparative studies between these two bacterial species have already produced some interesting results ( 25 ) and it would be interesting to integrate these results in NRSub. Lastly we want to try to hunt for genes that have possibly escaped detection so far, using CDS prediction methods such as GeneMark ( 26 ) or Recsta ( 27 ). Indeed, in the case of regions obtained by piecemeal sequencing it is likely that some genes have not been detected due to sequencing errors or because they are interrupted at one extremity of the sequence (see 28 for a good example of this in E.coli ). Latterly the probability of such events is lower in the case of regions obtained during the systematic sequencing effort, as sequences are always scanned using this kind of computer program.

ACKNOWLEDGEMENTS

We would like to thank Amos Bairoch (University of Geneva) for his help on the protein gene annotations and Jean Lobry (University of Lyon) for his careful testing of the first releases of NRSub.

REFERENCES

1 Kunst,F. and Devine,K. (1991) Res. Microbiol., 142, 905-912. MEDLINE Abstract

2 Anagnostopoulos,C., Piggot,P.J. and Hoch,J.A. (1993). In Sonenshein,A.L., Hoch,J.A. and Losick,R. (eds), Bacillus subtilis and Other Gram-positive Bacteria: Biochemistry, Physiology and Molecular Genetics. American Society for Microbiology, Washington, DC, pp. 425-461.

3 Itaya,M. and Tanaka,T. (1991) J. Mol. Biol., 220, 631-648. MEDLINE Abstract

4 Ogasawara,N., Fujita,Y., Kobayashi,Y., Sadaie,Y., Tanaka,T., Takahashi,H., Yamane,K. and Yoshikawa,H. (1995) Microbiology, 141, 257-259. MEDLINE Abstract

5 Kunst,F., Vassarotti,A. and Danchin,A. (1995) Microbiology, 141, 249-255. MEDLINE Abstract

6 Sharp,P.M., Higgins,D.G., Shields,D.C. and Devine,K.M. (1990) In Harwood,C.R. and Cutting,S.M. (eds), Molecular Biological Methods for Bacillus. John Wiley & Sons, Chichester, UK, pp. 557-569.

7 Perrière,G., Gouy,M. and Gojobori,T. (1994) Nucleic Acids Res., 22, 5525-5529. MEDLINE Abstract

8 Moszer,I., Glaser,P. and Danchin,A. (1995) Microbiology, 141, 261-268. MEDLINE Abstract

9 Emmert,D.B., Stoehr,P.J., Stoesser,G. and Cameron,G.N. (1994) Nucleic Acids Res., 22, 3445-3449. MEDLINE Abstract

10 Benson,D.A., Boguski,M., Lipman,D.J. and Ostell,J. (1994) Nucleic Acids Res., 22, 3441-3444. MEDLINE Abstract

11 DDBJ (1995) DDBJ Newslett., 15, 63-66.

12 Gouy,M., Gautier,C., Attimonelli,M., Lanave,C. and di Paola,G. (1985) Comput. Appl. Biosci., 1, 167-172. MEDLINE Abstract

13 Bairoch,A. and Boeckmann,B. (1994) Nucleic Acids Res., 22, 3578-3580. MEDLINE Abstract

14 Sharp,P.M. and Li,W.-H. (1987) Nucleic Acids Res., 15, 1281-1295. MEDLINE Abstract

15 Demerec,M., Adelberg,E.A., Clark,A.J. and Hartman,P.E. (1966) Genetics, 54, 61-76. MEDLINE Abstract

16 Rudd,K.E. (1993) ASM News, 59, 335-341.

17 Bairoch,A. (1994) Nucleic Acids Res., 22, 3626-3627. MEDLINE Abstract

18 George,D.G., Barker,W.C., Mewes,H.-W., Pfeiffer,F. and Tsugita,A. (1994) Nucleic Acids Res., 22, 3569-3573. MEDLINE Abstract

19 Perrière,G. and Gautier,C. (1993) Biochimie, 75, 415-422. MEDLINE Abstract

20 Perrière,G., Dorkeld,F., Rechenmann,F. and Gautier,C. (1993) In Hunter,L., Searls,D. and Shavlik,J. (eds), Proceedings of the First International Conference on Intelligent Systems for Molecular Biology. AAAI/MIT Press, Menlo Park, CA, pp. 319-327.

21 Duret,L., Mouchiroud,D. and Gouy,M. (1994) Nucleic Acids Res., 22, 2360-2365. MEDLINE Abstract

22 National Center for Biotechnology Information (1993) NCBI Software Development Toolkit, Version 1.8. NCBI, National Library of Medecine, Bethesda, MD.

23 Etzold,T. and Argos,P. (1993) Comput. Appl. Biosci., 9, 49-57. MEDLINE Abstract

24 Etzold,T. and Argos,P. (1993) Comput. Appl. Biosci., 9, 59-64. MEDLINE Abstract

25 Kunisawa,T. (1995) J. Mol. Evol., 40, 585-593. MEDLINE Abstract

26 Borodovsky,M. and McIninch,J. (1993) Comput. Chem., 17, 123-133.

27 Fichant,G. and Gautier,C. (1987) Comput. Appl. Biosci., 3, 287-295. MEDLINE Abstract

28 Borodovsky,M., Rudd,K.E. and Koonin,E.V. (1994) Nucleic Acids Res., 22, 4756-4767. MEDLINE Abstract


Return

* To whom correspondence should be addressed
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (70K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Perriere, G
Right arrow Articles by Gojobori, T
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Perriere, G
Right arrow Articles by Gojobori, T
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?