| Nucleic Acids Research | Pages |
Virgil: a database of rich links between GDB and GenBank
Background
Rich Links
Implementation
Data Model
Data Curation
Discussion
Acknowledgements
References
Virgil: a database of rich links between GDB and GenBank
ABSTRACT
BACKGROUND
Links between biological objects are frequently used by individuals, e.g., for data browsing, and software, e.g., for data analysis or database interconnection. However, the links as found in the major databases are too often difficult to retrieve, inconsistent, not sufficiently documented or maintained. To address these problems, we propose Virgil, a database dedicated to the management and distribution of rich links.
RICH LINKS
Virgil focuses on storing rich links between GDB genes (1) and GenBank human sequences (2).
Virgil links are imported from GDB (10 155 links) and GenBank (3433 links). It also contains 10 677 links that were automatically generated by the genXref system (3).
It results in 18 667 links, each referred to by a unique Virgil identifier. From a random sample of 170 links, we estimate that Virgil contains 86% (±5%) of relevant links.
IMPLEMENTATION
Virgil uses an object-oriented engine to model and manage the data: EYEDB (developed by Sysra informatique, technical documents are available from http://www.infobiogen.fr/services/eyedb ).
Simple Virgil searches are available from a Web form, as shown in Figure 1a. It allows retrieval of all the links attached to a remote biological object by entering its unique identifier (such as GDB:128600 or GENBANK:M61764). One can also enter a Virgil link identifier (such as VIRGIL:6661) to retrieve a rich link. The ad-hoc query form returns a list of hyperlinks to EYEDB object links. Such a link is shown in Figure 1b. Navigating through the generic Web EYEDB interface gives access to the data that constitute rich links such as objects shown in Figure 1c, d and e.
Figure
Expert queries to Virgil can be performed directly by means of a generic Web interface. It allows one to enter an EYEDB OQL query (Object Query Language); facilities for building such queries are provided. We also provide a prototype CORBA server for programmed access. The services delivered to a client by CORBA objects are publicly available by means of an IDL (Interface Definition Language). As an illustration, we implemented two CORBA clients for querying Virgil. We refer the reader to `Ubiquitous Distributed Objects with CORBA' (4) for an overview on CORBA.

DATA MODEL
Virgil schema was designed to comprehensively describe a link between two biological objects.
A Virgil link is a bi-directional connection between two database objects. The two database objects are referred to by a unique identifier, prefixed with the database name.
Link characterization is effective by means of annotations. An annotation contains the name of the author (an individual, a database or a program), the method supporting for the creation of the link, the status (VALIDATED, PUTATIVE or DELETED) and the belief value (normalized value between 0 and 1). The two latter attributes give the author's judgment on the link quality.
The global status of a link is inferred from the status given by all the annotators. For example, if some annotators are known for the quality of their annotations, the annotation status will be passed on as the global status of the link.
In Virgil, much attention was given to describe the meaning of the objects. Controlled vocabulary (as opposed to free-text) was used to facilitate programmed access. At present Virgil contains three object types (new object types can be created on demand). VIRGIL.UNION specifies a union link between two parts of the same biological object (this terminology is imported from ref. 5); GENBANK.NUCLEIC_SEQUENCE and GDB.GENE specify two types of objects from remote databases.
DATA CURATION
The default status of a link is PUTATIVE. A link is VALIDATED when it is imported from a population where >95% of links are relevant. This is the case for links imported from GDB and GenBank. The link generated by genXref are VALIDATED when the belief value is >0.90 (this threshold corresponds to a population where 95% of the links are relevant). Virgil contains 11 195 VALIDATED links.
Methods to check data integrity are attached to the link objects. Virgil data will be updated on a bi-monthly basis with every new major GenBank release.
DISCUSSION
A few works make extensive use of links to build complex information sytems dedicated to biological data. The getDB system (6) achieves integration of information via linkDB, a bank of the links explicitly specified within any of the 16 molecular databases that compose getDB. Similarly, SRS (7) creates a virtual federation of genome DBs. A language allows one to describe the structure of a flat file library and to define means to extract links between libraries. The program processes indices to allow navigation through all the libraries. A limitation of the SRS system is that it applies only to flat file libraries, not relational or object oriented systems.
In parallel to these efforts, we propose with Virgil a service to distribute exhaustive collections of richer links between GDB genes and GenBank sequences. This is a necessary step to allow a seamless integration of data between a biomolecular and a genomic database.
ACKNOWLEDGEMENTS
Virgil benefitted from the work of Ken Fasman and Stan Letovsky on the Biolinks project, we thank them for this contribution. We are also grateful to Infobiogen staff for excellent computing support and valuable discussions.
REFERENCES
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals Comments and feedback: www-admin{at}oup.co.uk
Last modification: 17 Dec 1997
Copyright© Oxford University Press, 1998.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||