GIF-DB, a WWW database on gene interactions involved in
Drosophila melanogaster
development
GIF-DB, a WWW database on gene interactions involved in Drosophila melanogaster development
Bernard
Jacq*
,
Florence
Horn
,
Florence
Janody
,
Nicolas
Gompel
,
Olivier
Serralbo
,
Elodie
Mohr
,
Christine
Leroy
,
Bernard
Bellon
1
,
Laurent
Fasano
,
Patrick
Laurenti
2
and
Laurence
Röder
Laboratoire de Génétique et Physiologie du Développement, IBDM, Parc Scientifique de Luminy, CNRS
Case 907, 13288
Marseille
Cedex 09,
France
,
1
Atelier de Bio-Informatique, Case 13, Université de Provence, 3 Place Victor Hugo, 13331
Marseille
Cedex 03,
France
and
2
Laboratoire de Biologie du Développement (Anatomie Comparée), Université de Paris VII, case 7077, 75251
Paris
Cedex 05,
France
Received September 23, 1996;
Accepted September 24, 1996
ABSTRACT
GIF-DB (Gene Interactions in the Fly Database) is a new WWW database
(http://www-biol.univ-mrs.fr/~lgpd/GIFTS_home_page.html
) describing gene molecular interactions involved in
the process of embryonic pattern formation in the fly
Drosophila melanogaster
. The detailed information is distributed in specific lines arranged into an
EMBL- (or SWISS-PROT-) like format. GIF-DB achieves a high level of integration with other
databases such as FlyBase, EMBL and SWISS-PROT through numerous hyperlinks. The original concept of interaction
databases examplified by GIF-DB could be extended to other biological subjects and organisms so as to
study gene regulatory networks in an evolutionary perspective.
INTRODUCTION
Databases are now of widespread use in biological research and this issue of
Nucleic Acids Research
provides the reader with an up-to-date collection of different biological databases which have various
scientific purposes and contents. The majority of these databases can be
classified as mainly structural in that the core of their informational content
is based on various aspects of DNA, RNA or protein sequence and/or structure.
Relatively few databases have a content and an organization which are oriented
towards the biological function of the genes and the relationships between
structure and function. EcoCyc, an encyclopedia of
Escherichia coli
genes and metabolism (
1
) is an example of a database which integrates functional aspects as one can
find both data on gene structure and their function in the regulation of
biochemical pathways. In the field of genetic diseases, OMIM, a catalog of
human genes and genetic disorders (
2
) provides the user with both structural (molecular genetics, biochemistry,
genetic mapping) and functional data (clinical features, diagnosis,
inheritance, etc.) on human genes.
We are interested in the biological process of pattern formation in
Drosophila
and in understanding the basis of specific identity acquisition by the
different body parts (
3
-
7
). In
Drosophila
, different classes of genes involved in the segmentation processes (maternal,
gap, pair-rule and segment polarity genes) divide the embryo along the antero-posterior axis into repeated homologous units (
8
,
9
), which will develop specific identities and morphogenetic features under the
control of homeotic genes (
10
). Specific interactions within and between these gene families are essential
for the establishment of a correct body pattern. Being able to access, query
and manipulate the data on these developmental genes and their functional
interactions within specific regulatory networks is now an important need for
developmental and molecular biologists studying gene regulation.
Gene molecular interactions, i.e direct molecular interactions involving DNA,
RNA and proteins, play an essential part in all known biological processes.
Although different databases exist for each of these three types of
macromolecules, data concerning precise molecular interactions between them are
underrepresented in these databases. If one considers protein/DNA interactions
for instance, only four examples of such co-crystals are found (of 15 homeodomain structures) in PDB, an archive of
experimentally determined three-dimensional structures of biological macromolecules (
11
). In addition, it is extremely difficult to extract from GenBank (
12
), EMBL (
13
), PIR-international (
14
) or SWISS-PROT (
15
) databases a list of proteins which interact with a given gene or a list of
target genes for a given DNA-binding protein and the same is true for protein/RNA and protein-protein interactions. The Transfac database (
16
) gives some precise structural data for transcription factors and their known
binding sites. Even in this case, however, data essential for the understanding
of transcription factor function in their specific biological contexts are
missing: developmental stage at which interaction occurs, phenotype of animals
in which the transcription factor is absent or mutated, biological result of
the interaction, organisation of the
cis
-regulatory region, experimental evidence for interaction.
In this paper, we describe the concepts, organization, content and use of GIF-DB, the Gene Interactions in the Fly Database, a new WWW database which
aims at providing a repository for data on gene interactions involved in
Drosophila
embryonic development and the regulatory networks in which they are implied.
LEADING CONCEPTS OF GIF-DB
Four main leading concepts were considered to elaborate GIF-DB.
Detailed and structured description of interactions
Our aim was to find a relatively simple, but well defined way to represent the
various and complex knowledge we presently have on gene molecular interactions
during embryonic development of
Drosophila.
This led to the conception of a structured entry format which is described in
the next chapter.
Integration of GIF-DB data with that of other WWW databases
As GIF-DB was designed to be accessible on the web, it was of great importance
that it essentially includes original data and relies on other databases to
access related data already described elsewhere. This goal is achieved through
hyperlinks pointing towards external molecular and genetic databases. At the
moment, hyperlinks towards three different databases have been introduced:
EMBL, SWISS-PROT and FlyBase, the genetic and molecular
Drosophila
database (
17
). In this latter case, links are pointing either towards the gene entry (FBgn
in FlyBase) or the bibliographic reference (FBrf in FlyBase) (Fig.
1
).
Classification of all interactions in one of three major interaction types
Gene molecular interactions should not be mistaken for genetic interactions. The
latter ones are more general and include both indirect and direct interactions.
Our working definition is: there is a direct molecular interaction between gene
A and gene B if gene A or one of its products (i.e. mRNA or protein) physically
interacts at the molecular level with gene B or one of its products (mRNA or
protein). In GIF-DB, we have focused on direct gene interactions, and six different
molecular types of interaction could theoretically be considered: DNA-DNA, DNA-RNA, RNA-RNA, DNA-protein, RNA-protein and protein-protein interactions. Among these
possibilities, we will consider further three major types of interactions only,
which are by far the most documented ones, whatever the organism being
considered: (i) protein-DNA interactions (type I); protein-RNA interactions (type II); and protein-protein interactions (type III).
A practical consequence of the above definition of interaction types is that we
will only take into account binary interactions (i.e. interactions occurring
between two molecular partners). This could be viewed as a limitation if one
considers what is already known about the complexity of gene interactions.
However, and within certain limits, any complex interaction which involves more
than two partners (interaction between a DNA sequence and several proteins, or
between several proteins into a multimeric complex, for instance) could be
split up into several binary interactions in order to be described.
Generic mode of interaction representation
Although our purpose in GIF-DB is to focus on interactions involved in the biological process of
Drosophila
pattern formation, we designed the file structure of GIF-DB so that it could have a generic value. We therefore propose herein a
multipurpose tool for the representation of interaction knowledge. Our aim was
to create a general format which could be used for the description of nearly
any gene interaction, whatever the biological process and the organism in which
they occur may be.
DATABASE ORGANIZATION
The GIF-DB interaction database is a collection of hypertext files, each of them
describing an interaction between two partners as discussed above. Each entry
contains biological information which has been arranged into an `EMBL-like' or `SWISS-PROT-like' model format. Several reasons have dictated such a
choice. (i) The EMBL and SWISS-PROT formats, which are quite similar, meet simplicity, logic and power of
data representation. (ii) The complementarity between the description of
nucleic acid and protein data found in these two European databases is a
concept that is useful for a new database in which protein, DNA and RNA data
will be found altogether. (iii) Finally, adhering closely to an already
existing data structure model will allow future users to find themselves into a
relatively familiar environment, even if some obligatory differences will exist
due to the different nature of our database.
The first column (line code) lists the two-letter codes indicating the type of data contained in the corresponding
line of each entry. They are listed in the order in which they appear and are
grouped according to the five zones which make up every entry. Line codes
marked with an asterisk are used with the same purpose in the EMBL and SWISS-PROT databases. The second column (line content) lists the type of data
corresponding to each line code.
As is the case for EMBL and SWISS-PROT entries, GIF-DB ones are structured so as to be usable by human readers as well
as by computer programs. The explanations, descriptions, classifications and
other comments are in ordinary English. Wherever possible, symbols and
nomenclature supposed to be familiar to drosophilists, geneticists, biochemists
and molecular biologists are used to describe the interactions and some
conventions used in FlyBase have been followed.
All data found in GIF-DB comes from the literature. The information coming from different papers
is compiled (and synthetized if necessary), verified and entered in DEXIFLY, a
relational
Drosophila
database (Horn
et al.
, in preparation). The HTML files constituting GIF-DB are then automatically generated from this database.
2 On-line Mendelian Inheritance in Man (OMIM), a catalog of human genes and genetic disorders. McKusick,V.A. et al., Johns Hopkins University. URL-http://www3.ncbi.nlm.nih.gov/omim/.
11 Abola,E.E., Bernstein,F.C. and Koetzle,T.F. (1988) In: Computational molecular biology. Sources and methods for sequence analysis (Lesk A.M., ed.), pp. 69-81, Oxford University Press, Oxford.