Nucleic Acids Research Advance Access originally published online on May 3, 2007
Nucleic Acids Research 2007 35(Web Server issue):W212-W216; doi:10.1093/nar/gkm223
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W212-W216
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Articles |
Update of the G2D tool for prioritization of gene candidates to inherited diseases
1Ontario Genomics Innovation Centre, Ottawa Health Research Institute, 501 Smyth, Ottawa, ON, Canada K1H 8L6, 2European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany and 3Department of Cellular and Molecular Medicine, Faculty of Medicine, University of Ottawa, Canada
*To whom correspondence should be addressed. Tel: +1 613 737 8899 Ext 73255; Fax: +1 613 739 6294; Email: cperez-iratxeta{at}ohri.ca
Received January 29, 2007. Revised March 20, 2007. Accepted March 28, 2007.
| ABSTRACT |
|---|
|
|
|---|
G2D (genes to diseases) is a web resource for prioritizing genes as candidates for inherited diseases. It uses three algorithms based on different prioritization strategies. The input to the server is the genomic region where the user is looking for the disease-causing mutation, plus an additional piece of information depending on the algorithm used. This information can either be the disease phenotype (described as an online Mendelian inheritance in man (OMIM) identifier), one or several genes known or suspected to be associated with the disease (defined by their Entrez Gene identifiers), or a second genomic region that has been linked as well to the disease. In the latter case, the tool uses known or predicted interactions between genes in the two regions extracted from the STRING database. The output in every case is an ordered list of candidate genes in the region of interest. For the first two of the three methods, the candidate genes are first retrieved through sequence homology search, then scored accordingly to the corresponding method. This means that some of them will correspond to well-known characterized genes, and others will overlap with predicted genes, thus providing a wider analysis. G2D is publicly available at http://www.ogic.ca/projects/g2d_2/
| BACKGROUND |
|---|
|
|
|---|
In the effort to identify which gene or genes produce what disease phenotype, geneticists are producing an increasing number of genome-wide linkage analyses and gene association studies that are generating too many to test candidate genes. For this reason, during the past five years, the problem of automating the prioritization of candidate genes to inherited diseases has received increasing attention from the bioinformatics community. Computational approaches were made possible due to the availability of the complete human genome sequence and to considerable developments on database annotation and data integration for molecular biology databases. As a result, a number of methods that address this problem have been published (112). These methods apply a variety of approaches exploiting known or deduced pieces of information that range from using only the genomic sequence of the target region to data mining analyses that include literature and different annotation systems. For details about these methods see the Introduction sections of (13) and (9). A recent work where many of these methods were applied to the prioritization of gene candidates for obesity and type 2 diabetes, shows that there are many discrepancies between the produced candidate gene lists (13). This study suggests the concurrent use of as many approaches as possible when working with particular diseases, especially as there are large differences in the kind and amount of useful information that it is available for each disease. Following this idea, we are developing and maintaining G2D (genes to diseases), a resource that seeks to encompass an increasing number of methodologies to obtain disease-related genes.
G2D is a web resource for prioritizing genes as candidates to inherited diseases, which started as a public server in 2002. At the time, we provided the users with pre-analyses of hundreds of inherited diseases by applying a literature and data mining algorithm based on fuzzy binary relations (2) using the disease's phenotype as input. In 2005, we made the algorithm available as a web tool (14). Now, we have included two more ways of prioritization that apply other principles. One is the use of genes in a different genomic location already known or suspected to produce a similar variant of the disease of interest. The expectation is that the disease-responsible gene will have a similar function to those. This is a more stringent approximation and requires more information about the disease, but it is obviously correct in some cases. The SUSPECTS method (7) has already exploited the idea of gene similarity. The second approach added to G2D, uses as input a second locus where the phenotype has also been mapped in addition to the one where the user is looking for the mutation (as in complex diseases). The expectation is that the products of the genes in several of the loci may be functionally associated, e.g. by participating in the same pathway, or even by being part of a protein complex (9,11). The system will prioritize the candidates according to whether functional interactions (both known and predicted) between proteins encoded in genes of both loci are detected. We used the human protein interactions from the STRING database (15) as our source database.
Besides this web implementation, G2D has been applied by us to the prioritization of genes for complex diseases in two different studies so far: the aforementioned collective computational study to suggest candidates for obesity and type 2 diabetes (13), and to the selection of asthma candidates for genotyping in two linkage regions in a French Canadian founder population (manuscript in preparation).
| OUTLINE OF THE G2D SERVER |
|---|
|
|
|---|
Given a chromosomal region where the user is looking for candidate genes, there are three ways of using the G2D server depending on the algorithm that will be applied. The first option takes as input the phenotype of the disease of interest described by means of an online Mendelian inferitance in man (OMIM) disease entry (16). The system will prioritize the genes according to the description of the phenotype as provided by the MeSH disease annotations (http://www.nlm.nih.gov/mesh/) to the linked bibliography in the corresponding OMIM entry. For details of this method see (2). A second option is to input one or several human or mouse genes that are already known or are suspected to be involved in the disease. The system will prioritize the genes in the target region according to their similarity to the known gene(s) as given by their GO annotations and high sequence homology. In order to do this, we compute scores for all GO terms according to their similarity to the GO terms of the known genes (measured in terms of their Resnik similarity). The measure favors rare GO annotations. Next, we apply this GO scoring system to the annotated genes in Entrez Gene. Their corresponding RefSeq proteins are compared through BLASTX to the chromosomal target region, and the scoring is transferred to the hits that are below the selected E-value. The third option can be used when another chromosomal region has been also linked to the disease of interest. The system will look for proteinprotein interactions in the STRING database, both known and predicted that may be occurring between a gene in the region of interest and a gene in the second region. Since many interactions in STRING are predicted, every interaction has associated a score (STRING score) that reflects how likely is to be a true interaction. We use the STRING score to sort the candidates within the region of interest. Candidates that are more likely to interact with a gene in the second region are then prioritized. In our benchmark (see Conclusion and Supplementary page in our web site), we observed higher precision when STRING scores are very high. Additionally, it is possible to access from our server a database of pre-calculated results for more than 550 monogenic diseases on published linkage regions using the phenotype method.
In the next sections, we describe the different use options through examples. We have prepared step-by-step web tutorials that contain very detailed information. They can be accessed from our web server page (see Tutorial section at our web site).
| USE OF PHENOTYPE |
|---|
|
|
|---|
The system will prioritize the genes according to the description of the phenotype and its precomputed associations to gene features as extracted from the literature and the Entrez Gene database. The input consists simply of the region where the user is looking for the mutation, and the phenotype of interest, given as an OMIM identifier. For example, suppose that you are interested in candidate genes to Hirschsprung disease in 22q13.2. You have to enter in the appropriate boxes a MIM number that defines the disease and the target region. In this particular case, you would enter 131224 (the ID of the OMIM entry for Hirschsprung disease) in the PHENOTYPE BOX, and q13.2 in the LOCATION BOX. You would also select the chromosome, 22 and Bands in the LOCATION BOX as you have entered the genomic region in the form of a cytogenetic band. You can refer to our web tutorial for more input and output options. The output is an ordered list of candidate genes, both known and predicted, ordered according to their susceptibility for producing the phenotype. For each candidate, you can explore any overlapping with ESTs and pseudogenes, as well as trace back the reasoning the system followed to associate the candidates to the disease.
| USE OF KNOWN GENES |
|---|
|
|
|---|
This method can be used when one or more genes are already known or are suspected to be related to the disease. In order to use it, you have to provide your target region in the LOCATION BOX in the same way as for the phenotype method, plus one or several human genes associated to the disease or related mouse genes. Following with our example, Hirschsprung disease is suspected to be a multigenic disease, and it is known that mutations in the endothelin-B receptor (ETRB) gene cause one of the variants of this disease. In order to find similar or functionally related genes to ETRB in 22q13.2, you could input the Entrez Gene IDs of human and mouse ETRBs in the KNOWN GENES BOX. These are 2861 and 13 618, respectively. Like for the phenotype method, the output is an ordered list of candidate genes that you may explore in a similar manner. Candidates that overlap with genes that have (STRING) interactions with any of the known genes input by the user are flagged.
| USE OF STRING PROTEINPROTEIN INTERACTIONS |
|---|
|
|
|---|
The third method for finding candidate genes for multiple gene phenotypes relies on proteinprotein functional interactions, either known or predicted. The rational is that mutations on two proteins that participate on the same pathway, or are directly interacting, will produce the same or very similar phenotypes. The input for this method consists of the target region, entered in the LOCATION BOX like in the previous two methods, and a second region where the phenotype of interest has been also mapped to. The second region is entered in the SECOND LOCUS BOX in the same manner, specifying format and chromosome. The output is a list of genes in the target region that may interact, according to STRING (15), with any genes(s) in the SECOND BOX locus. Candidates are sorted by how likely are their corresponding interactions to be true indicated by their associated STRING scores with 0.99 as maximum value. As an illustration, suppose that you are interested in candidates for myocardial infarction susceptibility (MIS) on 1p36-p34, one of the loci where the condition has been linked to (17). The OMIM entry for MIS (608446 [OMIM] ) shows that it is linked as well to regions 1q25, 6q25.1 and 1p22.1. By entering 1q25 as the second locus, the output page displays 20 candidates corresponding to proteins in 1p36-p34 that interact, according to the STRING database, with at least one protein in 1q25. The first candidate (see Figure 1) is the tumor necrosis factor receptor superfamily, member 4 TNFRSF4 (Entrez Gene ID 7293, located on 1p36). This gene comes out as the top candidate because of its direct interaction (STRING score = 0.999) with TNFSF4, the tumor necrosis factor (ligand) superfamily, member 4 (tax-transcriptionally activated glycoprotein 1, 34 kDa) (Entrez Gene ID 7292). The latter has been recently associated to atherosclerosis susceptibility increasing the risk for myocardial infarction (18) making TNFRSF4 a very good candidate. TNFRSF4 is therefore a quite attractive candidate from a biological point of view, not yet tested but within the linkage peak identified by the group that linked MIS to 1p36-p34 (E.J. Topol, personal communication).
|
| CONCLUSION |
|---|
|
|
|---|
The G2D web server allows users to prioritize candidate genes for practically any disease phenotype in OMIM through three different methods. The results of benchmarking the methods on 227 diseases that have been shown to be caused by more than one gene, are given in detail in our web server (see Supplementary section at our web site), and they can serve as a guidance on the expected performance of each method. The phenotype and the known-genes methods show a similar performance prioritizing the responsible gene in average among the top 25 and 27% of all genes in the band, respectively, when considering target bands of 30 MB (averaging a content of 300 genes). It must be taken into account, however, that more input information about the molecular cause of the disease is required by the known-genes method. The proteinprotein interactions method has a very low recall but shows high precision when the STRING scores associated to the candidates are high and very high (0.9 score value and higher). This method could be applied to a proportion of one in four cases, and in such situations the responsible gene was among the top 10 candidates, 70% of the times for bands of size 30 MB.
Running times for the algorithms vary from very few seconds to almost immediate yielding of results. Users can examine, for all three procedures, the rational used by the system to make the prioritization, keeping the process transparent. We support this by including extensive hyperlinks to related resources such as NCBI sequence databases, Gene Ontology and the UCSC Genome Browser.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Supplementary Data are available at G2D web site.
| ACKNOWLEDEGMENTS |
|---|
|
|
|---|
We thank Christian von Mering for his assistance with STRING data. This work was supported by funding from the Canadian Institutes of Health Research, the Ontario Research Development Challenge Fund, the Canadian Foundation for Innovation, and the Canada Research Chair Program. Funding to pay the Open Acess publication charges for this article was provided by CIHR grant MOP-77640.
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Freudenberg J, Propping P. A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics (2002) 18(Suppl. 2):S110S115.[Abstract]
- Perez-Iratxeta C, Bork P, Andrade MA. Association of genes to genetically inherited diseases using data mining. Nat. Genet (2002) 31:316319.[Web of Science][Medline]
- Turner FS, Clutterbuck DR, Semple CA. POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol (2003) 4:R75.[CrossRef][Medline]
- Lopez-Bigas N, Ouzounis CA. Genome-wide identification of genes likely to be involved in human genetic disease. Nucle Acids Res (2004) 32:31083114.[CrossRef]
- Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA. Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res (2005) 33:15441552.
[Abstract/Free Full Text] - van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, Vriend G. GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Res (2005) 33:W758W761.
[Abstract/Free Full Text] - Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS. SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics (2006) 22:773774.
[Abstract/Free Full Text] - Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, et al. Gene prioritization through genomic data fusion. Nat. Biotechnol (2006) 24:537544.[CrossRef][Web of Science][Medline]
- George RA, Liu JY, Feng LL, Bryson-Richardson RJ, Fatkin D, Wouters MA. Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res (2006) 34:e130.
[Abstract/Free Full Text] - Ma X, Lee H, Wang L, Sun F. CGI: a new approach for prioritizing genes by Combining Gene expression and protein-protein Interaction data. Bioinformatics (2006) 23:21521.[CrossRef][Web of Science][Medline]
- Oti M, Snel B, Huynen MA, Brunner HG. Predicting disease genes using protein-protein interactions. J. Med. Genet (2006) 43:691698.
[Abstract/Free Full Text] - Rossi S, Masotti D, Nardini C, Bonora E, Romeo G, Macii E, Benini L, Volinia S. TOM: a web-based integrated approach for identification of candidate disease genes. Nucleis Acids Res (2006) 34:W285W292.[CrossRef]
- Tiffin N, Adie E, Turner F, Brunner HG, van Driel MA, Oti M, Lopez-Bigas N, Ouzounis C, et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes. Nucleic Acids Res (2006) 34:30673081.
[Abstract/Free Full Text] - Perez-Iratxeta C, Wjst M, Bork P, Andrade MA. G2D: a tool for mining genes associated with disease. BMC Genet (2005) 6:45.[CrossRef][Medline]
- von Mering C, Jensen JL, Kuhn M, Chaffron S, Doerks T, Kruger B, Snel B, Bork P. STRING 7recent developments in the integration and prediction of protein interactions. Nucleic Acids Res (2007) 35:D358D362.
[Abstract/Free Full Text] - Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res (2007) 35:D5D12.
[Abstract/Free Full Text] - Wang Q, Rao S, Shen GQ, Li L, Moliterno DJ, Newby LK, Rogers WJ, Cannata R, Zirzow E, et al. Premature myocardial infarction novel susceptibility locus on chromosome 1P34-36 identified by genomewide linkage analysis. Am. J. Hum. Genet (2004) 74:262271.[CrossRef][Web of Science][Medline]
- Wang X, Ria M, Kelmenson PM, Eriksson P, Higgins DC, Samnegard A, Petros C, Rollins J, Bennet AM, et al. Positional identification of TNFSF4, encoding OX40 ligand, as a gene that influences atherosclerosis susceptibility. Nat. Genet (2005) 37:365372.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
S. Nam, M. Li, K. Choi, C. Balch, S. Kim, and K. P. Nephew MicroRNA and mRNA integrated analysis (MMIA): a web tool for examining biological functions of microRNA expression Nucleic Acids Res., July 1, 2009; 37(suppl_2): W356 - W362. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Ortutay and M. Vihinen Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies Nucleic Acids Res., February 1, 2009; 37(2): 622 - 628. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Tiffin, I. Okpechi, C. Perez-Iratxeta, M. A. Andrade-Navarro, and R. Ramesar Prioritization of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes Physiol Genomics, September 17, 2008; 35(1): 55 - 64. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


