ABSTRACT
In the context of the international project aimed at sequencing the whole genome
of
Bacillus subtilis
we have developed a non-redundant, fully annotated database of sequences from this organism.
Starting from the
B.subtilis
sequences available in the EMBL, GenBank and DDBJ collections we have removed
all encountered duplications and then added extra annotations to the sequences
(e.g. accession numbers for the genes, locations on the genetic map, codon
usage, etc.) We have also added cross-references to the EMBL, MEDLINE, SWISS-PROT and ENZYME data banks. The present system results from merging
of the NRSub and SubtiList databases and the sequence contigs used in the two
systems are identical. NRSub is distributed as a flatfile in EMBL format (which
is supported by most sequence analysis software packages) and as an ACNUC
database, while SubtiList is distributed as a relational database under 4th
Dimension. It is possible to access the data through two dedicated World Wide
Web servers located in France and Japan.
The
Bacillus subtilis
genome sequencing project started five years ago (
1
) and presently
B.subtilis
is the most extensively studied Gram-positive bacterium. About 900 genetic loci have been localized on its
genetic map (
2
) and a long-range physical map has been published (
3
). The last reports from the Japanese and the European groups involved in the
B.subtilis
sequencing project were published at the beginning of 1995 (
4
,
5
). Precise DNA regions flanked by well-identified genetic markers have been assigned to 25 laboratories from 11
different countries. Among them, seven Japanese laboratories are in charge of
sequencing a 1.3 Mb region. Up to now they have cloned ~1.1 Mb and have already sequenced ~800 kb. With a sequencing rate of >50 kb/year/group they expect to
finish the region by mid-1996. For their part the 18 European sequencing groups have to each
sequence at least 20 kb/year and have already completed ~1.1 Mb of sequence. They will attempt to accelerate the overall rate of
sequencing, with the goal of determining the whole genome sequence of
B.subtilis
in the course of 1997.
Until the beginning of 1994 the only sequence compilation available for
B.subtilis
was a data library containing all published proteins (
6
). In 1994 the first release of NRSub was developed (
7
). Simultaneously a group at the Pasteur Institute had also developed its own
system, SubtiList (
8
), which runs on Macintosh computers under the database management system 4th
Dimension. Since then the content of the two databases has been merged and the
contigs distributed in NRSub and SubtiList are now identical. Only a few
differences persist between the annotations of the two databases. For instance,
NRSub integrates data on codon usage and cross-references with MEDLINE. New releases of NRSub/SubtiList include a cleaner
sequence data set; sequences have been renamed using a nomenclature based on
their position on the chromosome and we have defined accession numbers for the
genes. Moreover, it is now possible to access the database through two World
Wide Web (WWW) servers in Japan and France.
The sequence contigs are built from the
B.subtilis
168 chromosomal sequences available in the EMBL (
9
), GenBank (
10
) and DDBJ (
11
) data banks by removing all the redundancy. Sequences from strains other than
168 and derivatives have been discarded. Release 5 of NRSub (July 1995)
contains a total of 239 sequence contigs created using 577 EMBL/GenBank/DDBJ
entries. These sequences total 1 408 645 bp, representing ~33.8% of the entire
B.subtilis
168 chromosome, which consists of ~4165 kb (
3
).
As a way to spread the information widely we use the EMBL format for NRSub
entries (Fig.
1
). This format is recognized by many programs and retrieval systems and is the
standard for all EBI databases (
9
). Many additions and corrections have been performed on the original
annotations. First, new identifiers are given to each contig (ID field). They
are based on the format SLxxx_z, where xxx corresponds to the position of the
sequence on the chromosome numbered in degrees and z corresponds to a rank
number used if more than one sequence is found at the same position. As a way
of other databases being able to establish cross-references with NRSub we define our own accession numbers for the contigs
(AC field). The format for accession numbers is BSxxxx. When two (or more) pre-existing contigs are merged into a larger one we combine their accession
numbers and a new primary accession number is assigned. The first line of the
DT field gives the date of creation of the most ancient EMBL/GenBank/DDBJ entry
used to build the contig and the second line the date of the most recent
modification. The DE field is completely rewritten and contains the name, the
accession number, the position and the orientation of all the sequences used to
build the contig (including duplicates). In the KW field we put together all
the keywords associated with the original sequences. We remove from this field
the keywords duplicated in the `/gene' and `/product' qualifiers of the
features. We use the ACNUC format (
12
) to index NRSub and this database system creates its own keywords with the content of these qualifiers. In the case of composite
contigs we add `composite' as a keyword. Bibliographic references for the
sequences used to build the contigs are merged (RA, RT and RL fields). We also
systematically add cross-references to the bibliographic data bank MEDLINE (RX field).
To make access to the data easier we have indexed the NRSub flatfile with ACNUC.
This database format was primarily developed for general sequence collections,
like GenBank, EMBL and NBRF/PIR (
18
). Later it was also used with more specialized systems, such as ColiGene (
19
), MultiMap (
20
) and Hovergen (
21
). ACNUC is provided with the retrieval program QUERY. Two versions of this
application exist: a command line version (written mainly in Fortran) and a
graphical version (written in C) (
7
). This graphical version was developed with the VIBRANT library, which is part
of the NCBI TOOLBOX (
22
). At present QUERY runs on SunSparc (under SunOS or Solaris), Silicon Graphics,
IBM RS/6000 and DEC Alpha (under OSF/1) systems. It may be installed on any
UNIX computer under X Window and on which the VIBRANT and MOTIF libraries are
available.
The ACNUC data structure associated with QUERY allows one to build complex
queries to retrieve sequences. For example, it is possible to use sequence
names, accession numbers, keywords, bibliographic references, dates of
insertion in the bank or the nature of the molecule sequenced (e.g. DNA, mRNA,
tRNA, etc.). It is also possible to combine many criteria using logical
operators and to use the results of previous queries to build new ones. Each
type of feature can be accessed and extracted, as well as its flanking regions,
and CDS can be translated into proteins. Different formats are available for
extracting sequences, including FASTA, GCG and EMBL.
The NRSub flatfile, as well as ACNUC index tables and the binary code for the
program QUERY, are available on the DDBJ anonymous ftp server (ftp.nig.ac.jp).
The C and Fortran code for the command line version of QUERY are also
furnished. These include a library allowing one to interface user-developed applications with any ACNUC database. Distribution is located in
the directory `/pub/db/nrsub'. The relational database SubtiList is available
for Macintosh computers (and soon for Windows-based computers) at the Pasteur Institute anonymous ftp server
(ftp.pasteur.fr), in the directory `/pub/GenomeDB/SubtiList'. Note that it is
also possible to access NRSub through a WWW server at URL (Uniform Resource
Locator) http://ddbjs4h.genes.nig.ac.jp/. This server is mirrored in France and
can be reached at URL http://acnuc.univ-lyon1.fr/nrsub/nrsub.html. The home page of the server gives access to
different entry points allowing one to make simple or complex queries (Fig.
2
). Simple access to the sequences is made by keyword, sequence name, accession
number, gene accession number and full text search. Elaborated access uses WWW-QUERY, a WWW version of QUERY. WWW-QUERY is a general interface and any ACNUC database could be
accessed by this system. The last entry point allows one to query NRSub with
the Sequence Retrieval System (SRS) (
23
,
24
) set up at EBI. Our server also gives access to hypertext documents such as
release notes, a version history of the database, on-line documentation, a list of keywords and a table containing cross-references with SWISS-PROT. Any questions and comments related to NRSub can be sent
to the corresponding author at perriere{at}biomserv.univ-lyon.fr Any comments on SubtiList can be sent to moszer{at}pasteur.fr
In the future we would like to add accession numbers not only for genes, but
also for all kinds of genomic fragments of biological interest (e.g. ribosome
binding sites, promoters, terminators, etc.). When longer sequences are
available it will be possible to add physical mapping data, but at present the
map is not detailed enough regarding the length of existing sequence contigs.
As an important body of data has become available for both
B.subtilis
and
E.coli
comparative studies between these two bacterial species have already produced
some interesting results (
25
) and it would be interesting to integrate these results in NRSub. Lastly we
want to try to hunt for genes that have possibly escaped detection so far,
using CDS prediction methods such as GeneMark (
26
) or Recsta (
27
). Indeed, in the case of regions obtained by piecemeal sequencing it is likely
that some genes have not been detected due to sequencing errors or because they
are interrupted at one extremity of the sequence (see
28
for a good example of this in
E.coli
). Latterly the probability of such events is lower in the case of regions
obtained during the systematic sequencing effort, as sequences are always
scanned using this kind of computer program.
We would like to thank Amos Bairoch (University of Geneva) for his help on the
protein gene annotations and Jean Lobry (University of Lyon) for his careful
testing of the first releases of NRSub.
REFERENCES
Return

