| Nucleic Acids Research | Pages |
FlyNets and GIF-DB, two Internet databases for molecular interactions in Drosophila melanogaster
Introduction
Interaction Definitions And Classification
The GIF-DB Database
Purpose and leading concepts of GIF-DB
Database organisation and entry format
Recent developments
The FlyNets Database
Purpose and leading concepts
Database organisation and entry format
Interactive WWW Access To The Interaction DATABASES
Conclusions And Future Prospects
Citing And Contacting GIF-DB, FlyNets OR SOS-DGDB
Acknowledgements References
FlyNets and GIF-DB, two Internet databases for molecular interactions in Drosophila melanogaster
ABSTRACT
INTRODUCTION
Direct and specific molecular interactions, involving DNA, RNA and proteins play an essential role in all known biological processes. Three major types of interactions, i.e., protein-DNA, protein-RNA and protein-protein interactions account for the great majority of known biological macromolecular interactions. Several general databases exist for each of the three types of informational macromolecules, such as GenBank (1), EMBL (2), and DDBJ (3) databases for DNA and RNA sequences, SWISS-PROT (4) and PIR (5) databases for protein sequences and the PDB database (6) for molecular 3D structures. Many more specialised databases exist for specific families of genes, RNAs and proteins and this issue of Nucleic Acids Research provides the reader with an up-to-date collection of such databases. Despite this abundance of information sources, data describing known specific molecular interactions between genes, RNA and proteins are very often underepresented in these databases and difficult to query. If one considers protein-DNA interactions for instance, it is in principle possible to extract specific information about them from the major nucleic acid databases: in the features table of GenBank, EMBL and DDBJ databases entries, the `protein_bind' feature was designed to localise the regions of DNA or RNA sequences which specifically interact with proteins and to identify these proteins. Using the SRS 5.05 retrieval system (7) on GenBank release 102 (August 1997), we found only 18 Drosophila melanogaster sequences in which the `protein_bind' feature was present and a total number of 59 corresponding DNA or RNA binding sites in them (out of 25 315 sequences from this species). Although the number of Drosophila genes and RNAs for which specific interactions with proteins have been published is difficult to estimate, the above numbers are clearly an underepresentation of our present knowledge. Conversely, on the protein databases side, it is extremely difficult to extract from SWISS-PROT (4) or PIR (5) databases either a list of proteins which interact with a given gene or a list of genes controlled by a given regulatory protein. The TRANSFAC database (8) gives some precise structural data for transcription factors and their known binding sites. Using again Drosophila as an example, we found in TRANSFAC 40 target genes for transcription factors and a total of 322 corresponding DNA-binding sites from this species, a better result than the one obtained with GenBank. Even in this case however, data essential for the understanding of transcription factor function in their specific biological contexts are missing such as: the developmental stage at which interaction occurs, the phenotype of animals in which the transcription factor is absent or mutated, the biological result of the interaction (gene activation or repression), the organisation of the cis-regulatory region and the experimental evidence for interaction. As far as protein-RNA and protein-protein interactions are being considered, and although some very specific databases do exist, such as the MHCPEP database of MHC binding peptides for instance (9), data on these interactions are also not easy to extract from existing databases.
We are interested in the process of pattern formation in Drosophila and in understanding the basis of specific identity acquisition by the different body parts (10-12). Different classes of genes involved in the segmentation process (maternal, gap, pair-rule and segment polarity genes) divide the embryo along the antero-posterior axis into repeated homologous units (13,14) which will develop specific identities and morphogenetic features under the control of homeotic genes (15). Specific interactions within and between these gene families are essential for the establishment of a correct body pattern. Being able to access, query and manipulate the data on these developmental genes and their functional interactions within specific regulatory networks is now recognised as an important need for developmental and molecular biologists studying gene regulation. From a more general point of view, a basic knowledge of all other known macromolecular interactions in Drosophila would certainly help to integrate our knowledge on the structure and function of the individual genes into a unified and physiological view of the organism. To achieve these goals, the development of new computer and information science tools is needed, among which that of interaction databases would probably be one of the most useful.
In this paper, we describe the concepts, organisation, content and use of two Drosophila interaction databases: GIF-DB, which focuses on developmental molecular interactions and FlyNets, a new general Drosophila interaction database. Finally, SOS-DGDB, a new collection of Drosophila DNA sequences in which binding sites for regulatory proteins are annotated on the primary sequence and hyperlinked to GIF-DB and TRANSFAC database entries is presented.
INTERACTION DEFINITIONS AND CLASSIFICATION
Gene molecular interactions should not to be mistaken for genetic interactions. The latter are more general and include both indirect and direct interactions. Our working definition for a gene molecular interaction is the following: there is a direct molecular interaction between gene A and gene B if gene A or one of its products (i.e., mRNA or protein) physically interacts at the molecular level with gene B or one of its products (mRNA or protein). Among the six different molecular interaction types which could then theoretically be considered we have focused on three major types of interactions only, which are by far the most documented ones, whatever the organism being considered: protein-DNA interactions (type I), protein-RNA interactions (type II) and protein-protein interactions (type III).
In order to simplify the management of data on interactions, and in accordance with the above definition of interaction types, all molecular interactions in GIF-DB and FlyNets databases will be described as binary interactions (i.e., interactions occuring between two molecular partners). This could be viewed as a limitation if one considers what is already known about the complexity of gene interactions. However, and within certain limits, any complex interaction which involves more than two partners (interaction between a DNA sequence and several different transcription factors, or between several proteins into a multimeric complex, for instance) could be split up into several binary interactions in order to be described. This binary point of view is also well adapted to our current experimental approach of genetic and molecular interactions, in which two entities only are usually studied at a time. It has to be noted that the binding of several molecules of the same transcription factor on different cis-regulatory binding sites of a given gene will be considered as one interaction only, in which the individual molecular interaction events represent sub-interaction components. The same is true for the interaction between an RNA molecule and several identical RNA-binding protein molecules or also for the interaction of several copies of two different proteins within a multimeric complex.
THE GIF-DB DATABASE
Purpose and leading concepts of GIF-DB
GIF-DB, the Gene Interactions in the Fly Database, is a WWW database which aims at providing a repository for data on gene interactions involved in Drosophila embryonic development and the regulatory networks in which they are involved. The first version of the database appeared on the WWW in October 1995 and the concepts, database organisation and entry format of GIF-DB have previously been described (16). Briefly, four main leading concepts and goals were considered to elaborate GIF-DB.
Database organisation and entry format
In order to fulfill the above requirements, we have developed a generic structured hypertext format. The GIF-DB interaction database, which makes use of this format, is a collection of hypertext files, each of them describing an interaction between two molecular partners as discussed above. Each entry contains biological information which has been arranged into an `EMBL-like' or `SWISS-PROT-like' model format. Wherever possible, symbols and nomenclature supposed to be familiar to drosophilists, geneticists, biochemists and molecular biologists are used to describe the interactions and some conventions used in FlyBase, the genetic and molecular Drosophila database (17) have been followed. All scientific data found in GIF-DB comes from the literature. The information extracted from different articles is compiled (and synthetized if necessary), verified and entered in DEXIFLY (Horn et al., manuscript submitted), a Drosophila relational database built using the 4th Dimension[trade] program (ACI, Inc.) on a MacIntosh computer. The HTML files constituting GIF-DB are then automatically generated from this database.
Each entry in the GIF-DB database is composed of lines and different types of lines (each having its own format) are used to record the various types of gene interaction information which make up the entry. As is the case for the EMBL and SWISS-PROT databases, each line in a GIF-DB entry begins with a two-character line code indicating the type of information contained in the line. Wherever possible, we have tried to use the linetypes already established by the EMBL and SWISS-PROT databases, but due to the specificity of the GIF-DB database, many new linetypes had to be introduced. Some of the original linetypes deserve a few comments. In particular, information about the cis-regulatory regions in protein-DNA interaction entries is given with an increasingly detailed view by the group of RR (Regulatory Region location), RS (number and strength of Regulatory Sites) and SS (regulatory Sites Sequence) lines. Several lines contain hypertext pointers towards other databases and at the moment, links towards FlyBase, GenBank, EMBL and SWISS-PROT databases are supported. For the sake of clarity, the 40 different linetypes in a GIF-DB entry have been arranged into five zones: the ENTRY zone, the EFFECTOR zone, the TARGET zone, the INTERACTION zone and the
Recent developments
Version 2.0 of GIF-DB (January 1997) contains 25 entries and each of them is ~4-6 pages long, with an average of 6-10 associated bibliographic references. Ten new entries have been added and the majority of Version 1.2 entries have been updated. The major change in Version 2.0 is the creation of a collection of annotated DNA sequences linked to GIF-DB (Fig. 1). This collection of sequences has been named SOS-DGDB (Sites On Sequences Drosophila Gene DataBase). A click on any site in the SS lines of a given GIF-DB entry opens the corresponding SOS-DGDB sequence entry and points directly at the position of the selected site. This possibility allows the user to obtain an integrated view of the sites for all different trans-acting factors acting upon a cis-regulatory region within the context of the sequence of the target gene. From each binding site, two hyperlinks are available which point either back to the GIF-DB database or to the corresponding site in the TRANSFAC (8) database (if available). At present, our SOS-DGDB collection (Version 1.0) contains sequences for 16 genes and ~200 binding sites have been color-highlighted on the sequences. Different colors are used to discriminate between the several families of transcription factors. As has been extensively discussed in Arnone and Davidson (18), identification and analysis of many cis-regulatory elements is of central importance for understanding the function of regulatory networks and of the genomic program for development. The development of databases like SOS-DGDB is a step in that direction, but new specific tools devoted to the analysis and comparison of regulatory sequences have still to be created.
Figure

THE FlyNets DATABASE
Purpose and leading concepts
A database like GIF-DB could gain added value if the developmental networks that it contains could also be viewed and analysed in the physiological context of all other regulatory networks occurring in the organism. Although a large number of molecular interactions constitutive of these networks have been studied, no database or even a simple list of these interactions is as yet available. Flybase (17) recently started to compile genetic interactions (M. Ashburner, personal communication). Collecting and organising data about all Drosophila direct molecular interactions for which published experimental evidence is available would therefore represent a major advance in our knowledge on interaction networks. However, describing all these interactions (which we estimate to represent a few thousand ones) with the precision level that GIF-DB entries have presently reached is now impossible, since elaboration of a GIF-DB entry represents, on average, 20-30 hours of work (including critical reading and analysis of the original papers). We therefore decided to build another interaction database with less information fields than GIF-DB, and for which part of the elaboration task could be automatized in the future. This database is called the FlyNets database and our goal is to progressively integrate in it data about the majority of known molecular Drosophila interactions.
Database organisation and entry format
The FlyNets database shares many common features with the GIF-DB database: the definition of three major interaction classes and the general organisation of entries into five zones have been conserved. This is also true for almost all FlyNets linetypes which are the same as in GIF-DB, for the conventions adopted in the line contents (and described in the on-line FlyNets-primer document), for the line length (80 characters) and for the format of the references. The two main differences are the number of linetypes supported and the format of the line headers. A total of 31 linetypes (instead of 40 in GIF-DB) are presently supported. The comments line in FlyNets now regroups data which were present in several different comment lines in GIF-DB. This line is organised in a way similar to that of the SWISS-PROT CC line in which different types of comments are arranged in as many sub-comments. The second difference with GIF-DB is the format of linetypes headers: each line in a GIF-DB entry begins with a two-character line code indicating the type of information contained in the line, as is the case for the EMBL and SWISS-PROT databases. Since several users of GIF-DB having found difficulty in memorising the signification of a two-letter code for 40 linetypes, we have decided to adopt a different convention in FlyNets. The line headers now have a GenBank-like format and are explicit words or group of words with a maximum length of 20 characters (e.g. identificator, creation date, target function, authors, etc.). As is the case for GIF-DB, the information necessary to build FlyNets entries is extracted from different scientific articles, compiled, verified and entered into a relational database (F.Horn, unpublished) from which the HTML files constituting FlyNets are then automatically generated. Version 1.0 of FlyNets (January 1997) contains ~70 interaction entries. A typical example of a FlyNets entry is shown in Figure 2.
Figure

INTERACTIVE WWW ACCESS TO THE INTERACTION DATABASES
GIF-DB and FlyNets databases can be accessed using the World Wide Web through the GIFTS (Gene Interaction in the Fly Transworld Server) WWW server in Marseille. To access the WWW, one needs a WWW browser such as Netscape Navigator[trade] (from Netscape Communications Corp.) and a link to the Internet. The URL (Uniform Resource Locator, the addressing system used in the WWW) of the GIFTS Server is http://gifts.univ-mrs.fr/GIFTS_home_page.html . In addition to giving an access to these databases, the GIFTS server also provides services such as GIN, a series of annotated pages to help navigate on the Internet and BLASTula, a specialised service giving an integrated access to more than 40 different Blast analysis sites in the world. Many of these Blast servers operate on collections of new DNA sequences from the different genome projects which are not yet integrated in the EMBL, GenBank and DDBJ general databases.
There are two different ways to access GIF-DB data once the connection with the server is established: either through a hypertext list of all available entries accessible from the GIF-DB home page or through a query search program. The search can be performed either on the entire database or in any one of the 40 different data lines of all entries. At the moment, FlyNets data is accessible through the hypertext list of entries only, and a query program will soon be available.
CONCLUSIONS AND FUTURE PROSPECTS
Interaction databases such as GIF-DB and FlyNets provide a simple and straightforward way to make functional links between specific entries from different molecular databases. Such functional links are a useful complement to the structural links (present as database cross-references) existing between many EMBL (or GenBank) entries and their SWISS-PROT (or PIR-International) corresponding translational products.
After that the different genome projects will have provided us with extensive catalogs of genes and proteins for several organisms, it will be essential to describe how and with which other molecules these components establish specific interactions, a knowledge which cannot be deduced from their sequences or structures. In what is now called the `post-genome' era, both new experimental methods and new bioinformatics concepts and tools are needed to gradually paint the complicated picture of biological pathways. In this respect, on the informatics side, interaction databases, such as GIF-DB and FlyNets, as well as metabolic databases such as EcoCyc (19) or KEGG (20) for instance, are likely to play increasingly important roles in the near future. We have recently been aware of the existence of GeNet, a gene networks database (21) in which Drosophila segmentation networks are also described. Some concepts, originally developed in the GIF-DB database, have been introduced in the GeNet database. A nice feature of this database is the presence of interactive schematic representations of regulatory pathways linked to gene entries. In an evolutionary perspective, building interaction databases for different organisms would be extremely interesting since it would provide a means to see to what extent homologous genes are working through homologous regulatory pathways.
Within the next few years, we plan to offer new possibilities within GIF-DB and FlyNets through the addition of a few new linetypes and the adjunction of hyperlinks towards other databases. Among the scheduled improvements are: links towards Medline PubMed abstracts, the Interactive Fly, a cyberspace guide to Drosophila genes and their roles in development (22) and Flyview (23), a database on expression patterns of Drosophila genes. Part of our data on interactions have now been included in KNIFE, a knowledge base presently under development (24), within which graphical representations of interactions and regulatory networks are automatically generated. This represents a first step towards the simulation of some aspects of the dynamic behavior of developmental genetic regulatory networks.
Finally, many efforts will also be devoted to the problem of interaction data acquisition. Recent results (Pillet et al., manuscript in preparation) have shown that it is possible, using textual statistics techniques, to retrieve in a semi-automatic way a list of interactions from a collection of article mini-abstracts found in the FlyBase database. We are presently working to improve the efficiency of our methodology and plan to apply it to extract information on molecular interactions from Medline abstracts.
CITING AND CONTACTING GIF-DB, FlyNets OR SOS-DGDB
If you use GIF-DB, FlyNets or SOS-DGDB as tools in your published research work, please cite this paper. Comments and inquiries about GIF-DB, FlyNets or SOS-DGDB are welcome and should be sent to Bernard Jacq (e-mail: jacq{at}lgpd.univ-mrs.fr).
ACKNOWLEDGEMENTS
We would like to thank J. Euzenat, F. Rechenmann, L. Quoniam, L. Fasano and M. Djabali for helpul discusssions on computer and biological issues about interactions. This work has been supported by an ACC-SV grant from the MESR (Ministre de l'enseignement et de la recherche) to B.J. and F.Rechenmann and by the CNRS.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||