ABSTRACT
In the context of the international project aiming at sequencing the whole
genome of
Bacillus subtilis
we have developed NRSub, a non-redundant database of sequences from this organism. Starting from the
B.subtilis
sequences available in the repository collections we have removed all
encountered duplications, then we have added extra annotations to the sequences (e.g. accession numbers for the genes, locations on the genetic map, codon
usage index). We have also added cross-references with EMBL/GenBank/DDBJ, MEDLINE, SWISS-PROT and ENZYME databases. NRSub is distributed through anonymous
FTP as a text file in EMBL format and as an ACNUC database. It is also possible
to access the database through two dedicated World Wide Web servers located in
France (http://acnuc.univ-lyon1.fr/nrsub/nrsub.html ) and in Japan (http://ddbjs4h.genes.nig.ac.jp/ ).
The
Bacillus subtilis
genome sequencing project enters now what will be its last year as the complete
sequence of the chromosome is expected to be published in July 1997 (
1
,
2
). At present (September 1996) the 25 laboratories from Europe and Japan
involved in that project have already sequenced a total of 2879 kb. If we add
the sequences from the repository collections that have been obtained by
piecemeal sequencing before the starting of the project, we have a total of
3310 kb sequenced (after removal of redundancies). This represents [approx]79% of the entire
B.subtilis
168 chromosome, which consists of 4165 kb (
3
). However, due to contractual obligations, only 1983 kb ([approx]47% of the chromosome) have been made publicly available when this paper was
written.
In the context of this sequencing project, the first public release of the non-redundant
B.subtilis
database (NRSub) was developed at the National Institute of Genetics (Mishima,
Japan) in July 1994 (
4
). In January 1995, the content of NRSub was merged with that of the SubtiList
database (
5
) developed at the Institut Pasteur (Paris, France). As, even if the contigs
provided by the two databases are identical, they are still distributed under
their own names as many differences persist in their annotations. Among the
specific features provided by NRSub are a measure of codon usage bias in
protein genes, the systematic use of data from the SWISS-PROT database (
6
) and references to orthologous genes in
Escherichia coli.
The sequence contigs are built from the
B.subtilis
168 (and derivatives) chromosomal sequences available in the EMBL (
7
), GenBank (
8
) and DDBJ (
9
) collections by removing anything redundant. Sequences from strains other than
168 and plasmidic sequences have been discarded. Release 9 of NRSub (September
1996) contains 214 contigs created using 654 EMBL/GenBank/DDBJ entries (Fig.
1
). These sequences total 1 983 145 bp and allow access to 1852 protein genes (95
partial), 78 tRNA genes and 31 rRNA genes.
As a way of spreading the information widely we distribute primarily NRSub as a
text file in EMBL format. This format is recognized by many sequence analysis
packages and retrieval systems, and is the standard for all the European
Bioinformatic Institute (EBI) databases (
7
). Many additions and corrections are performed on the original annotations.
First, new identifiers are given to each contig (ID field). They are based on
the format SLxxx_z, where xxx corresponds to the position of the sequence on
the chromosome numbered in degrees and z corresponds to a rank number used if
more than one sequence is found at the same position. As a way of other
databases being able to establish cross-references with NRSub we define our own accession numbers for the contigs
(AC field). The format for accession numbers is BSxxxx. When two (or more) pre-existing contigs are merged we combine their accession numbers and a new
primary accession number is assigned. The first line of the DT field gives the
date of creation of the most ancient EMBL/GenBank/DDBJ entry used to build the
contig and the second line the date of the most recent modification. The DE
field is completely rewritten and contains the name, the accession number, the
position and the orientation of all the sequences used to build the contig
(including duplicates). In the KW field we put together all the keywords
associated with the original sequences. We remove from this field the keywords
duplicated in the `/gene' and `/product' qualifiers of the features. Indeed, we
use the ACNUC format (
10
) to index NRSub and this database management system creates its own keywords
with the content of these qualifiers. In the case of contigs built by merging
overlapping sequences we add `composite' as a keyword. Bibliographic references
of the sequences used to build the contigs are merged (RA, RT and RL fields).
We also systematically add cross-references to the bibliographic database MEDLINE (RX field).
To make access to NRSub easier, we have indexed the text file with the ACNUC
database management system. The ACNUC data structure allows to index all
collections in EMBL, GenBank/DDBJ or NBRF/PIR (
16
) formats. The retrieval system provided with ACNUC is called Query. Two
versions of this program exist: a line mode version (
10
) and a graphical version (
4
). The line mode version is written in Fortran and runs on almost any computer
under UNIX or VMS operating systems.The graphical version is written in C and
uses the Vibrant library which is a part of the toolbox distributed by the
National Center for Biotechnology Information (NCBI) (
17
). Binaries of this program are available for Sun (under Solaris or SunOS), DEC
Alpha, IBM RS/6000, Silicon Graphics, HP/UX and Macintosh (680x0 or PowerPC)
computers.
With NRSub indexed with ACNUC it is possible to build complex queries to
retrieve sequences. Interrogations can be made on sequence names, accession
numbers, keywords, bibliographic references, dates of insertion in the bank,
etc. It is possible to combine many criteria using logical operators and to use
the results of previous queries to build new ones. Each kind of feature can be
accessed and extracted as well as its flanking regions and CDS can be
translated into proteins. Different formats are available for extracting
sequences including FASTA, GCG and EMBL.
A World Wide Web (WWW) server allowing access to NRSub has been installed in
France (http://acnuc.univ-lyon1.fr/nrsub/nrsub.html ) and a mirror has been installed in Japan
(http://ddbjs4h.genes.nig.ac.jp/ ). The home page of the server gives access to entry points allowing one
to make simple or complex queries (Fig.
2
). Simple queries are made by keyword, sequence name, accession number, gene
accession number and full text search. More sophisticated access is possible
through WWW-Query, a WWW version of Query (
18
). WWW-Query is a general interface and any database indexed with ACNUC can be
accessed thanks to this system. The server gives access to various documents
such as release notes, history of the database, on-line documentation, a list of keywords and a list of all the protein genes
accessible in NRSub. This list includes the accession numbers of these genes in
NRSub and the accession numbers of their corresponding proteins in SWISS-PROT.
The NRSub text file, as well as the ACNUC index tables, the sources and the
binaries of Query are available through two anonymous FTP servers: one in
France (ftp://biom3.univ-lyon1.fr/pub/nrsub ) and one in Japan (ftp://ftp.nig.ac.jp/pub/db/nrsub ). The sources of Query include a Fortran and a C library
allowing one to interface user-developed applications with any ACNUC database. The text file is mirrored
at the EBI (ftp://ftp.ebi.ac.uk/pub/databases/nrsub ) and InfoBioGen (ftp://ftp.infobiogen.fr/pub/db/nrsub ) FTP servers. Any questions and comments related to NRSub can
be sent by Email to the corresponding author (perriere{at}biomserv.univ-lyon1.fr).
We have started to search for orthologous genes in species other than
E.coli
. Particularly, we have planned to try to locate orthologs in all bacterial
species for which the complete genome has been sequenced (e.g.
Haemophilus influenzae
,
Mycoplasma genitalium
,
Methanococcus jannascii
,
Synechocystis
sp.). As strong asymmetries has been observed in the nucleotide composition of
the CDS belonging to the leading and the lagging strands of
B.subtilis
chromosome (
21
) we want to integrate the CDS orientation in the sequence features. Indeed,
this information can be used to orient the contigs themselves (
22
). When the complete genome of
B.subtilis
will be available we would like to add accession numbers for all the genomic
fragments of biological interest (ribosome binding sites, promoters and
terminators). At last we hope to provide a Windows 95 version of Query soon.
Many thanks to Manolo Gouy who developed ACNUC and Query and to Franck Samson
who provided the contig map drawing.
REFERENCES
Return

