Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (81K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Perriere, G.
Right arrow Articles by Gojobori, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Perriere, G.
Right arrow Articles by Gojobori, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1997 Oxford University Press 53-56

Footnote

The NRSub database: update 1997

The NRSub database: update 1997 Guy Perrière* , Ivan Moszer 1 and Takashi Gojobori 2

Laboratoire de Biométrie, Génétique et Biologie des Populations, Université Claude Bernard, Lyon 1, 43, bd. du 11 Novembre 1918, 69622 Villeurbanne Cedex, France , 1 Unité de Régulation de l'Expression Génétique, Institut Pasteur, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France and 2 Center for Information Biology, National Institute of Genetics, Mishima, Shizuoka-ken 411, Japan

Received September 23, 1996; Accepted September 24, 1996

ABSTRACT

In the context of the international project aiming at sequencing the whole genome of Bacillus subtilis we have developed NRSub, a non-redundant database of sequences from this organism. Starting from the B.subtilis sequences available in the repository collections we have removed all encountered duplications, then we have added extra annotations to the sequences (e.g. accession numbers for the genes, locations on the genetic map, codon usage index). We have also added cross-references with EMBL/GenBank/DDBJ, MEDLINE, SWISS-PROT and ENZYME databases. NRSub is distributed through anonymous FTP as a text file in EMBL format and as an ACNUC database. It is also possible to access the database through two dedicated World Wide Web servers located in France (http://acnuc.univ-lyon1.fr/nrsub/nrsub.html ) and in Japan (http://ddbjs4h.genes.nig.ac.jp/ ).

INTRODUCTION

The Bacillus subtilis genome sequencing project enters now what will be its last year as the complete sequence of the chromosome is expected to be published in July 1997 ( 1 , 2 ). At present (September 1996) the 25 laboratories from Europe and Japan involved in that project have already sequenced a total of 2879 kb. If we add the sequences from the repository collections that have been obtained by piecemeal sequencing before the starting of the project, we have a total of 3310 kb sequenced (after removal of redundancies). This represents [approx]79% of the entire B.subtilis 168 chromosome, which consists of 4165 kb ( 3 ). However, due to contractual obligations, only 1983 kb ([approx]47% of the chromosome) have been made publicly available when this paper was written.

In the context of this sequencing project, the first public release of the non-redundant B.subtilis database (NRSub) was developed at the National Institute of Genetics (Mishima, Japan) in July 1994 ( 4 ). In January 1995, the content of NRSub was merged with that of the SubtiList database ( 5 ) developed at the Institut Pasteur (Paris, France). As, even if the contigs provided by the two databases are identical, they are still distributed under their own names as many differences persist in their annotations. Among the specific features provided by NRSub are a measure of codon usage bias in protein genes, the systematic use of data from the SWISS-PROT database ( 6 ) and references to orthologous genes in Escherichia coli.

DATABASE CONTENT

The sequence contigs are built from the B.subtilis 168 (and derivatives) chromosomal sequences available in the EMBL ( 7 ), GenBank ( 8 ) and DDBJ ( 9 ) collections by removing anything redundant. Sequences from strains other than 168 and plasmidic sequences have been discarded. Release 9 of NRSub (September 1996) contains 214 contigs created using 654 EMBL/GenBank/DDBJ entries (Fig. 1 ). These sequences total 1 983 145 bp and allow access to 1852 protein genes (95 partial), 78 tRNA genes and 31 rRNA genes.

As a way of spreading the information widely we distribute primarily NRSub as a text file in EMBL format. This format is recognized by many sequence analysis packages and retrieval systems, and is the standard for all the European Bioinformatic Institute (EBI) databases ( 7 ). Many additions and corrections are performed on the original annotations. First, new identifiers are given to each contig (ID field). They are based on the format SLxxx_z, where xxx corresponds to the position of the sequence on the chromosome numbered in degrees and z corresponds to a rank number used if more than one sequence is found at the same position. As a way of other databases being able to establish cross-references with NRSub we define our own accession numbers for the contigs (AC field). The format for accession numbers is BSxxxx. When two (or more) pre-existing contigs are merged we combine their accession numbers and a new primary accession number is assigned. The first line of the DT field gives the date of creation of the most ancient EMBL/GenBank/DDBJ entry used to build the contig and the second line the date of the most recent modification. The DE field is completely rewritten and contains the name, the accession number, the position and the orientation of all the sequences used to build the contig (including duplicates). In the KW field we put together all the keywords associated with the original sequences. We remove from this field the keywords duplicated in the `/gene' and `/product' qualifiers of the features. Indeed, we use the ACNUC format ( 10 ) to index NRSub and this database management system creates its own keywords with the content of these qualifiers. In the case of contigs built by merging overlapping sequences we add `composite' as a keyword. Bibliographic references of the sequences used to build the contigs are merged (RA, RT and RL fields). We also systematically add cross-references to the bibliographic database MEDLINE (RX field).


Figure 1 . Contig map of NRSub release 9 (September 1996).

The FT field required much modification. First, for each CDS or structural RNA molecule we add an accession number under the `/acnum' qualifier. BG1xxxx accession numbers are used for the CDS, and BG0xxxx accession numbers are used for the structural RNAs. Owing to this introduction of accession numbers for the genes it is possible to cross-reference SWISS-PROT with NRSub (DR field in SWISS-PROT entries). When the feature is a CDS we add a Codon Adaptation Index (CAI) (11 ) value under the `/CAI' qualifier. When information on locus name is available we add a `/gene' qualifier for the CDS or the structural RNA described. The gene names used are taken from the listing of Anagnostopoulos et al. (12 ) or from the original publication. In some cases we have corrected the names when they were inconsistent with the nomenclature established by Demerec et al. ( 13 ). To name Open Reading Frames (ORFs) we have used a nomenclature similar to that defined by Rudd for E.coli ( 14 ): as there is no gene name in B.subtilis that begins with y, we have used this letter as the first letter in naming each ORF. The second letter is assigned according to the group in charge of sequencing the region of the chromosome in which the ORF is located. In the case of an unmapped gene the letter z is used. The third letter is freely chosen by the group sequencing the region. Preferably each different letter should correspond to a different operon. In the case of genes obtained by piecemeal sequencing the letters x or y are used. The fourth and last letter is always a capital letter. This letter is also freely chosen by the group sequencing the region. Preferably, genes in the same operon should be ordered using the alphabet from A to Z. If alternative names are available for a given gene they are indicated under the `/alt_name' qualifier. In the case of protein genes we add a `/db_xref' qualifier for the cross-reference pointing to the corresponding entry in SWISS-PROT. We rewrite or complete the `/product' qualifier using data from SWISS-PROT. In the case of ORFs, we use `hypothetical protein' for the product associated with these putative genes. When an encoded protein is an enzyme, we add its EC number in an `/EC_number' qualifier. The EC numbers used are taken from the ENZYME database ( 15 ). For each CDS or structural RNA gene we add, in a `/map' qualifier, its location on the genetic map of B.subtilis 168. Since release 8 we have started to introduce cross-references to E.coli orthologous protein genes in SWISS-PROT. These cross-references are introduced under a `/note' qualifier.

Some elements of the original features are discarded. Thus we keep only the descriptions of signal, mature and leader peptides, CDS, tRNA, rRNA, -35 and -10 regions, promoters and terminators. Some mistakes are corrected, consisting mainly of signal, mature or leader peptides wrongly annotated as CDS, frameshifts resulting from bad start points, CDS shortened due to bad end-points and in features described in the original publications but not inserted in the tables.

DATABASE USE

To make access to NRSub easier, we have indexed the text file with the ACNUC database management system. The ACNUC data structure allows to index all collections in EMBL, GenBank/DDBJ or NBRF/PIR ( 16 ) formats. The retrieval system provided with ACNUC is called Query. Two versions of this program exist: a line mode version ( 10 ) and a graphical version ( 4 ). The line mode version is written in Fortran and runs on almost any computer under UNIX or VMS operating systems.The graphical version is written in C and uses the Vibrant library which is a part of the toolbox distributed by the National Center for Biotechnology Information (NCBI) ( 17 ). Binaries of this program are available for Sun (under Solaris or SunOS), DEC Alpha, IBM RS/6000, Silicon Graphics, HP/UX and Macintosh (680x0 or PowerPC) computers.

With NRSub indexed with ACNUC it is possible to build complex queries to retrieve sequences. Interrogations can be made on sequence names, accession numbers, keywords, bibliographic references, dates of insertion in the bank, etc. It is possible to combine many criteria using logical operators and to use the results of previous queries to build new ones. Each kind of feature can be accessed and extracted as well as its flanking regions and CDS can be translated into proteins. Different formats are available for extracting sequences including FASTA, GCG and EMBL.

A World Wide Web (WWW) server allowing access to NRSub has been installed in France (http://acnuc.univ-lyon1.fr/nrsub/nrsub.html ) and a mirror has been installed in Japan (http://ddbjs4h.genes.nig.ac.jp/ ). The home page of the server gives access to entry points allowing one to make simple or complex queries (Fig. 2 ). Simple queries are made by keyword, sequence name, accession number, gene accession number and full text search. More sophisticated access is possible through WWW-Query, a WWW version of Query ( 18 ). WWW-Query is a general interface and any database indexed with ACNUC can be accessed thanks to this system. The server gives access to various documents such as release notes, history of the database, on-line documentation, a list of keywords and a list of all the protein genes accessible in NRSub. This list includes the accession numbers of these genes in NRSub and the accession numbers of their corresponding proteins in SWISS-PROT.


Figure 2 . Home page of the NRSub World Wide Web server.

NRSub has also been indexed with the Sequence Retrieval System (SRS) ( 19 , 20 ) and it is possible to query the database with this program on six WWW servers. These servers can be reached through the home page of the NRSub server.

AVAILABILITY

The NRSub text file, as well as the ACNUC index tables, the sources and the binaries of Query are available through two anonymous FTP servers: one in France (ftp://biom3.univ-lyon1.fr/pub/nrsub ) and one in Japan (ftp://ftp.nig.ac.jp/pub/db/nrsub ). The sources of Query include a Fortran and a C library allowing one to interface user-developed applications with any ACNUC database. The text file is mirrored at the EBI (ftp://ftp.ebi.ac.uk/pub/databases/nrsub ) and InfoBioGen (ftp://ftp.infobiogen.fr/pub/db/nrsub ) FTP servers. Any questions and comments related to NRSub can be sent by Email to the corresponding author (perriere{at}biomserv.univ-lyon1.fr).

PERSPECTIVES

We have started to search for orthologous genes in species other than E.coli . Particularly, we have planned to try to locate orthologs in all bacterial species for which the complete genome has been sequenced (e.g. Haemophilus influenzae , Mycoplasma genitalium , Methanococcus jannascii , Synechocystis sp.). As strong asymmetries has been observed in the nucleotide composition of the CDS belonging to the leading and the lagging strands of B.subtilis chromosome ( 21 ) we want to integrate the CDS orientation in the sequence features. Indeed, this information can be used to orient the contigs themselves ( 22 ). When the complete genome of B.subtilis will be available we would like to add accession numbers for all the genomic fragments of biological interest (ribosome binding sites, promoters and terminators). At last we hope to provide a Windows 95 version of Query soon.

ACKNOWLEDGEMENTS

Many thanks to Manolo Gouy who developed ACNUC and Query and to Franck Samson who provided the contig map drawing.

REFERENCES

1 Devine,K.M. (1995) Trends Biotechnol., 13, 210-216.

2 Harwood,C.R. and Wipat,A. (1996) FEBS Lett., 389, 84-87.

3 Itaya,M. and Tanaka,T. (1991) J. Mol. Biol., 220, 631-648.

4 Perrière,G., Gouy,M. and Gojobori,T. (1994) Nucleic Acids Res., 22, 5525-5529.

5 Moszer,I., Glaser,P. and Danchin,A. (1995) Microbiology, 141, 261-268.

6 Bairoch,A. and Apweiler,R. (1996) Nucleic Acids Res., 24, 21-25.

7 Rodriguez-Tomé,P., Stoehr,P.J., Cameron,G.N. and Flores,T.P. (1996) Nucleic Acids Res., 24, 6-12.

8 Benson,D.A., Boguski,M.S., Lipman,D.J. and Ostell,J. (1996) Nucleic Acids Res., 24, 1-5.

9 Shin-I,T., Ikeo,K., Tateno,Y. and Gojobori,T. (1994) Biochemist, 16, 18-21.

10 Gouy,M., Gautier,C., Attimonelli,M., Lanave,C. and di Paola,G. (1985) Comput. Applic. Biosci., 1, 167-172.

11 Sharp,P.M. and Li,W-H. (1987) Nucleic Acids Res., 15, 1281-1295.

12 Anagnostopoulos,C., Piggot,P.J. and Hoch,J.A. (1993) In Sonenshein,A.L., Hoch,J.A. and Losick,R. (eds), Bacillus subtilis and Other Gram-positive Bacteria: Biochemistry, Physiology, and Molecular Genetics. American Society for Microbiology, Washington DC, pp. 425-461.

13 Demerec,M., Adelberg,E.A., Clark,A.J. and Hartman,P.E. (1966) Genetics, 54, 61-76.

14 Rudd,K.E. (1993) ASM News, 59, 335-341.

15 Bairoch,A. (1996) Nucleic Acids Res., 24, 221-222.

16 George,D.G., Barker,W.C., Mewes,H-W., Pfeiffer,F. and Tsugita,A. (1996) Nucleic Acids Res., 24, 17-20.

17 NCBI (1993) NCBI Software Development Toolkit, Version 1.8. National Center for Biotechnology Information, National Library of Medicine, Bethesda, MD.

18 Perrière,G. and Thioulouse,J. (1996) Comput. Applic. Biosci., 12, 63-69.

19 Etzold,T. and Argos,P. (1993) Comput. Applic. Biosci., 9, 49-57.

20 Etzold,T. and Argos,P. (1993) Comput. Applic. Biosci., 9, 59-64.

21 Lobry,J.R. (1996) Mol. Biol. Evol., 13, 660-665.

22 Perrière,G., Lobry,J.R. and Thioulouse,J. (1996) Comput. Applic. Biosci., 12.


Return

*To whom correspondence should be adressed. Tel: +33 472 44 80 00; Fax: +33 478 89 27 19; Email: perriere{at}biomserv.univ-lyon1.fr
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (81K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Perriere, G.
Right arrow Articles by Gojobori, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Perriere, G.
Right arrow Articles by Gojobori, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?