Article |
GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases
Centre for Molecular and Biomolecular Informatics, Radboud University Nijmegen PO Box 9010, 6500GL Nijmegen, The Netherlands 1Dalicon BV PO Box 354, 6700AJ Wageningen, The Netherlands 2Genomics Laboratory, University Medical Centre Utrecht PO Box 85060, 3508AB Utrecht, The Netherlands 3Wageningen University and Research Centre Wageningen, The Netherlands 4Department of Human Genetics, University Medical Centre Nijmegen PO Box 9101, 6500HB Nijmegen, The Netherlands
*To whom correspondence should be addressed. Tel: +31 24 36 53391; Fax: +31 24 36 52977; Email: G.Vriend{at}cmbi.ru.al
Received February 12, 2005. Revised March 25, 2005. Accepted March 25, 2005.
| ABSTRACT |
|---|
|
|
|---|
The identification of genes underlying human genetic disorders requires the combination of data related to cytogenetic localization, phenotypes and expression patterns, to generate a list of candidate genes. In the field of human genetics, it is normal to perform this combination analysis by hand. We report on GeneSeeker (http://www.cmbi.ru.nl/GeneSeeker/), a web server that gathers and combines data from a series of databases. All database searches are performed via the web interfaces provided with the original databases, guaranteeing that the most recent data are queried, and obviating data warehousing. GeneSeeker makes the same selection of candidate genes as the human geneticists would have performed, and thus reducing the time-consuming process to a few minutes. GeneSeeker is particularly well suited for syndromes in which the disease gene displays altered expression patterns in the affected tissue(s).
| INTRODUCTION |
|---|
|
|
|---|
The identification of causative genes in human genetic disorders will be accelerated by the wealth of omics information being generated. Geneticists consult a number databases to search for these genes. Each database concentrates on a different (molecular) aspect. In addition, databases have their own user interface, different formats to present the data and sometimes even their own ontologies. Data, such as gene localization and expression patterns, may be distributed over multiple databases.
Geneticists normally collect phenotypic and/or expression data and the genes in the chromosomal region(s) of interest, and combine these to get a list of candidate genes. The rationale for this is that the gene that causes a disease is most probably expressed in the tissues affected by that disease (13). Using model organisms, such as the mouse, it is often possible to obtain information on genes, proteins, protein interactions and other functional attributes that can be transferred to Homo sapiens by means of synteny and protein homology relationships. The use of data from other species (such as mouse) often proves helpful in identifying the location or function of the equivalent human gene (4). GeneSeeker mimics this multi-species identification strategy (5).
| MATERIALS AND METHODS |
|---|
|
|
|---|
Databases used
Table 1 lists the databases that GeneSeeker queries. These are divided over database groups (DB-groups). All databases are accessed through their standard WWW interfaces except MIMMAP and OXFORD. MIMMAP is a reformatted version of the OMIM (6) gene mapping information. OXFORD is used to translate human to mouse chromosomal locations, and is described in more detail in the pre-processing section. We use SRS (Sequence Retrieval System, Lion Biosciences, Cambridge, UK) to access these two databases (7). The SRS parser was modified to allow searches for chromosomal ranges.
|
Data processing
The layout of the GeneSeeker web server is shown in Figure 1. The user query consists of a chromosomal band range using standard nomenclature (e.g. 7p15p21). This cytogenetic localization is passed through DB-group 1. Syntenic regions in the mouse are sought in DB-group 2 using an Oxford-grid. Tissues of interest or phenotypic features of a syndrome can be specified by the user as a Boolean expression that is split up and processed by DB-group 3. This modular set-up makes it easy to add extra DB-groups in the future. For every database, a plug-in was designed to perform all tasks from user-query pre-processing to query-result post-processing. These plug-ins deal with a series of technical topics, such as query reformatting, generating the correct URL, filling in the form on that database's web interface, requesting all hits rather than in chunks, parsing the database HTML output and so on.
|
The name of a gene can vary from database to database. The gene for the multi-drug resistance-associated protein 1, for example, is stored as ABCC1, MRP or MRP1, depending on the database used. These gene nomenclature problems have to be solved because GeneSeeker depends on the gene names in the combination steps. For each DB-group the results are integrated with a Boolean OR. The resulting gene lists of the three DB-groups are combined according to the Boolean logic specified in the user query.
Implementation issues
Parallelization
The database plug-ins run in parallel to minimize the waiting time. A queuing system prevents excessive loads on remote servers. The plug-ins return the results of the queries to GeneSeeker as a list containing the gene names and corresponding database hyperlinks.
Mousehuman synteny
An Oxford grid (8) is used to find the homologous genes and gene regions in the mouse genome for all human chromosome locations entered by the user. A human chromosomal band range is translated into the corresponding mouse chromosome locations. Two mouse locations are combined if the genetic distance is shorter than a user-specified value (defaults to 10 cM). We regenerate this Oxford grid weekly to ensure that the latest synteny information is used in each query.
Gene nomenclature
Inconsistent gene nomenclature is resolved using gene synonym information from UniProt database (9). We use the MGD human homologues information to interconvert mouse and human gene names. We maintain local copies of these conversion tables because nearly all queries require that gene nomenclature problems be solved.
User interface
The GeneSeeker interface consists of the query form shown in Figure 2 and an options form that usually requires no user input. A genetic localization and the phenotypic/expression terms should be entered for a meaningful search. Databases that generate more noise than signal can be removed from the query. The user can also suppress the display of housekeeping genes or a specified list of genes. The options form contains a thesaurus (10) that can help the user to select the correct expression terms: for example, when the user is interested in a genetic trait that results in abnormalities in the brain, selection of the brain category returns the hints brain or hindbrain or forebrain.... Hints for the genetic localization data can be found in a table containing frequently aberrant chromosomal bands in specific disorders taken from literature (11,12). The user can be notified on request about the completion of GeneSeeker searches by email. All parameters are linked to help screens. The results are presented in four tables (Figure 3).
|
|
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
The GeneSeeker offers a user-friendly quick scan of several databases that are commonly used by geneticists to identify candidate genes for specific Mendelian diseases. As such, GeneSeeker uses those databases that are most appropriate for the questions asked. Several aspects are likely to change in the near future as genomics and genetics develop. For example, our usage of an Oxford grid can be improved or replaced as soon as consensus is reached about the localization of genes on the mouse and human genomes among the various databases. Expression pattern information (e.g. microarray data) is growing rapidly, and is expected to become useful for GeneSeeker in the near future. At the moment, publicly available expression information is still sparse, scattered and not yet standardized.
In its present form, GeneSeeker is best suited for syndromes in which one can assume aberrant or absent gene expression in the affected tissues. GeneSeeker allows the user to query heterogeneous databases and obtain good candidate genes for the disease of interest based on positional, expression and model data (5). With the present hardware set-up GeneSeeker can perform
1000 searches per day.
| ACKNOWLEDGEMENTS |
|---|
We thank David Thomas for helpful corrections to the manuscript. This work was supported by NWO/Unilever, the Irene Kinderziekenhuis Foundation and the EU FP6 Programme (LHSG-CT-2003-503265). Funding to pay the Open Access publication charges for this article was provided by Radboud University Nijmegen.
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Blackshaw, S., Fraioli, R.E., Furukawa, T., Cepko, C.L. (2001) Comprehensive analysis of photoreceptor gene expression and the identification of candidate retinal disease genes Cell, 107, 579589[CrossRef][Web of Science][Medline] .
- den Hollander, A.I., van Driel, M.A., de Kok, Y.J., van de Pol, D.J., Hoyng, C.B., Brunner, H.G., Deutman, A.F., Cremers, F.P. (1999) Isolation and mapping of novel candidate genes for retinal disorders using suppression subtractive hybridization Genomics, 58, 240249[CrossRef][Web of Science][Medline] .
- Dryja, T.P. (1997) Gene-based approach to human genephenotype correlations Proc. Natl Acad. Sci. USA, 94, 1211712121
[Abstract/Free Full Text] . - Chiang, A.P., Nishimura, D., Searby, C., Elbedour, K., Carmi, R., Ferguson, A.L., Secrist, J., Braun, T., Casavant, T., Stone, E.M., et al. (2004) Comparative genomic analysis identifies an ADP-ribosylation factor-like gene as the cause of BardetBiedl syndrome (BBS3) Am. J. Hum. Genet., 75, 475484[CrossRef][Web of Science][Medline] .
- van Driel, M.A., Cuelenaere, K., Kemmeren, P.P., Leunissen, J.A., Brunner, H.G. (2003) A new web-based data mining tool for the identification of candidate genes for human genetic disorders Eur. J. Hum. Genet., 11, 5763[CrossRef][Web of Science][Medline] .
- Hamosh, A., Scott, A.F., Amberger, J., Bocchini, C., Valle, D., McKusick, V.A. (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders Nucleic Acids Res., 30, 5255
[Abstract/Free Full Text] . - Etzold, T., Ulyanov, A., Argos, P. (1996) SRS: information retrieval system for molecular biology data banks Methods Enzymol., 266, 114128[Web of Science][Medline] .
- Edwards, J.H. (1991) The Oxford Grid Ann. Hum. Genet., 55, 1731[Medline] .
- Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., et al. (2004) UniProt: the Universal Protein knowledgebase Nucleic Acids Res., 32, D115D119
[Abstract/Free Full Text] . - van Steensel, M.A., Celli, J., van Bokhoven, J.H., Brunner, H.G. (1999) Probing the gene expression database for candidate genes Eur. J. Hum. Genet., 7, 910919[CrossRef][Web of Science][Medline] .
- Brewer, C., Holloway, S., Zawalnyski, P., Schinzel, A., FitzPatrick, D. (1998) A chromosomal deletion map of human malformations Am. J. Hum. Genet., 63, 11531159[CrossRef][Web of Science][Medline] .
- Brewer, C., Holloway, S., Zawalnyski, P., Schinzel, A., FitzPatrick, D. (1999) A chromosomal duplication map of malformations: regions of suspected haplo- and triplolethalityand tolerance of segmental aneuploidyin humans Am. J. Hum. Genet., 64, 17021708[CrossRef][Web of Science][Medline] .
- Veugelers, M., Bressan, M., McDermott, D.A., Weremowicz, S., Morton, C.C., Mabry, C.C., Lefaivre, J.F., Zunamon, A., Destree, A., Chaudron, J.M., et al. (2004) Mutation of perinatal myosin heavy chain associated with a Carney complex variant N. Engl. J. Med., 351, 460469
[Abstract/Free Full Text] . - Safran, M., Chalifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., et al. (2003) Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE Nucleic Acids Res., 31, 142146
[Abstract/Free Full Text] . - Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A., Eppig, J.T. (2003) MGD: the Mouse Genome Database Nucleic Acids Res., 31, 193195
[Abstract/Free Full Text] . - Letovsky, S.I., Cottingham, R.W., Porter, C.J., Li, P.W. (1998) GDB: the Human Genome Database Nucleic Acids Res., 26, 9499
[Abstract/Free Full Text] . - Ringwald, M., Eppig, J.T., Begley, D.A., Corradi, J.P., McCright, I.J., Hayamizu, T.F., Hill, D.P., Kadin, J.A., Richardson, J.E. (2001) The Mouse Gene Expression Database (GXD) Nucleic Acids Res., 29, 98101
[Abstract/Free Full Text] . - Woychik, R.P., Wassom, J.S., Kingsbury, D., Jacobson, D.A. (1993) TBASE: a computerized database for transgenic animals and targeted mutations Nature, 363, 375376[CrossRef][Medline]
.
This article has been cited by other articles:
![]() |
Y. Yoshida, Y. Makita, N. Heida, S. Asano, A. Matsushima, M. Ishii, Y. Mochizuki, H. Masuya, S. Wakana, N. Kobayashi, et al. PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning Nucleic Acids Res., July 1, 2009; 37(suppl_2): W147 - W152. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yilmaz, P. Jonveaux, C. Bicep, L. Pierron, M. Smail-Tabbone, and M.D. Devignes Gene-disease relationship discovery based on model-driven data integration and database view definition Bioinformatics, January 15, 2009; 25(2): 230 - 236. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Smedley, M. A. Swertz, K. Wolstencroft, G. Proctor, M. Zouberakis, J. Bard, J. M. Hancock, and P. Schofield Solutions for data integration in functional genomics: a critical assessment and case study Brief Bioinform, November 1, 2008; 9(6): 532 - 544. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Tiffin, I. Okpechi, C. Perez-Iratxeta, M. A. Andrade-Navarro, and R. Ramesar Prioritization of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes Physiol Genomics, September 17, 2008; 35(1): 55 - 64. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-C. Tranchevent, R. Barriot, S. Yu, S. Van Vooren, P. Van Loo, B. Coessens, B. De Moor, S. Aerts, and Y. Moreau ENDEAVOUR update: a web resource for gene prioritization in multiple species Nucleic Acids Res., July 1, 2008; 36(suppl_2): W377 - W384. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Sookoian and C. J. Pirola Review: Genetics of the cardiometabolic syndrome: new insights and therapeutic implications Therapeutic Advances in Cardiovascular Disease, October 1, 2007; 1(1): 37 - 47. [Abstract] [PDF] |
||||
![]() |
M. G. Kann Protein interactions and disease: computational approaches to uncover the etiology of diseases Brief Bioinform, September 1, 2007; 8(5): 333 - 346. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Perez-Iratxeta, P. Bork, and M. A. Andrade-Navarro Update of the G2D tool for prioritization of gene candidates to inherited diseases Nucleic Acids Res., July 13, 2007; 35(suppl_2): W212 - W216. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Rossi, D. Masotti, C. Nardini, E. Bonora, G. Romeo, E. Macii, L. Benini, and S. Volinia TOM: a web-based integrated approach for identification of candidate disease genes. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W285 - W292. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Tiffin, E. Adie, F. Turner, H. G. Brunner, M. A. van Driel, M. Oti, N. Lopez-Bigas, C. Ouzounis, C. Perez-Iratxeta, M. A. Andrade-Navarro, et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes Nucleic Acids Res., June 6, 2006; 34(10): 3067 - 3081. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







