ConSite: web-based prediction of regulatory elements using cross-species comparison
Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, s-17177 Stockholm, Sweden and 1 Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, University of British Columbia, Vancouver BC, Canada
* To whom correspondence should be addressed. Tel: +46 8 5248 6391; Fax: +46 8 32 48 26; Email: Boris.Lenhard{at}cgb.ki.se
Received February 16, 2004; Revised and Accepted March 10, 2004
| ABSTRACT |
|---|
|
|
|---|
ConSite is a user-friendly, web-based tool for finding cis-regulatory elements in genomic sequences. Predictions are based on the integration of binding site prediction generated with high-quality transcription factor models and cross-species comparison filtering (phylogenetic footprinting). By incorporating evolutionary constraints, selectivity is increased by an order of magnitude as compared to single-sequence analysis. ConSite offers several unique features, including an interactive expert system for retrieving orthologous regulatory sequences. Programming modules and biological databases that form the foundation of the ConSite service are freely available to the research community. ConSite is available at http:/www.phylofoot.org/consite.
| INTRODUCTION |
|---|
|
|
|---|
Understanding the mechanisms of coordinated regulation of gene activity is one of the primary goals of the post-genomic era of biology. Gene regulation at the transcriptional level is an ancient and central control mechanism present in all forms of life. RNA-polymerase II-mediated transcription is activated or repressed by sequence-specific DNA binding proteins called transcription factors (TFs). Transcription factor binding sites (TFBS) are typically short (
512 bp), and considerable sequence variation between functional binding sites is tolerated by most TFs. While the laboratory elucidation of TFBS within genes is feasible, the process is arduous and time-consuming if no prior information is available. As regulatory elements are often scattered over regions spanning thousands of base pairs around the targeted gene in multi-cellular eukaryotes, additional methodology is required. Computational predictions have been successfully utilized for suggesting potential regulatory regions for further experimental analysis; in effect enabling researchers to determine key regulatory elements more efficiently (14). Transcriptional regulation, in particular modeling and prediction of TFBS, is one of the most studied problems in computational biology (5,6). Reliable profile-based methods and model frameworks have been developed over the years which accurately describe the DNA-binding specificity of a TF (6). While these models can accurately identify sites bound in vitro (7), they are insufficiently selective for finding functional elements in vivo (8): the information contained in the interface between TFBS and TF is in itself not enough to discriminate between functional and non-functional sites in the complex cellular environment. The in vivo specificity of a factor depends on other, additional properties; for instance interacting proteins, DNA accessibility and effective concentrations. With rare exceptions, our understanding of these properties is insufficient to enable the creation of effective computational methods.
Phylogenetic footprinting
Cross-species sequence conservation can be used as an effective filter for improving selectivity of detection of functional elements in DNA sequences. This approach is known as phylogenetic footprinting: due to selective pressure, functional regulatory regions in non-coding sequence should be preferentially conserved compared to regions with no sequence-specific function (9,10). Two assumptions are implicit: the analyzed regions must be orthologous, and the selective pressure on the gene from each respective organism must be similar. A change of gene function will alter the functional constraints imposed on the regulation of the gene.
A further consideration is the evolutionary distance that is ideal for relevant filteringi.e. what pair of organisms should be analyzed. The sequence difference between closely related species, for instance human and chimpanzee, is generally insufficient to confer any meaningful filtering in pairwise analysis. Conversely, relatively long evolutionary distances, such as between human and fish, often render similarities in promoters all but undetectable with current methodology (11).
Here we describe a web resource for TFBS prediction in genomic sequence using phylogenetic footprinting, available at www.phylofoot.org/consite. The web service is a user-friendly tool with a high degree of user interactivity and optional customization.
| IMPLEMENTATION |
|---|
|
|
|---|
General schema
The ConSite program executes a number of analysis steps, each in interaction with the user (Figure 1). In brief, the program (i) aligns input promoter sequences, (ii) calculates the degree of conservation in the alignment, (iii) scans the sequences of a set of TF binding profile models, (iv) performs filtering on the initial sets of sites using phylogenetic footprinting and (v) presents the results in user-selected output formats (Figure 1).
|
Selection of regulatory sequences
The success of phylogenetic footprinting methods is critically dependent on the selection of regulatory sequences. As discussed above, only the corresponding regulatory regions of orthologous pairs of genes are appropriate. In ConSite, the user has the choice between manually locating promoter pairs of interest [e.g. by using genome browsers such as UCSC (12) and EnsEMBL (13)] and semi-automatically retrieving target mouse : human genomic sequences based on an accession number, keyword or sequence. In this process, users are aided by an intuitive graphical interface. This ortholog finder interface is unique among phylogenetic footprinting services. Its search engine is powered by GeneLynx (14)a gene index and catalog. As GeneLynx expands to include additional species, the automated orthologous gene selection service will have increasing utility for promoter analysis of model organisms.
Sequence alignment
Once submitted or selected, sequences are aligned preferentially using the ORCA aligner (Arenillas and Wasserman, unpublished), a progressive global alignment program optimized for non-coding genomic sequences. For pairs of particularly long sequences, ConSite accepts pre-computed alignments in a variety of standard formats. For convenience, users can specify a cDNA sequence that will be used to identify and highlight coding regions and/or exons in the output.
Conservation calculation
The degree of conservation is calculated by sliding a window of a user-defined width over the alignment. In each window location, the percentage of identical nucleotides is calculated. A potential problem with this approach is that short, highly-conserved regions adjacent to large gaps or insertions will be assigned low identity-scores. For each input sequence, we chose to collapse the gaps in the alignment for the purpose of the calculation of nucleotide identity. Thus, the analysis results are displayed as a pair of conservation plots, where the first (second) input sequence is continuous in the first (second) plot. For each analysis, we obtain a set of window identity scores Wi, corresponding to the percentage of identical nucleotides in the window starting at position i. For filtering purposes, only those windows with Wi exceeding an identity threshold I (typically 7080%) are retained for further TFBS analysis.
TF binding profile collection and scoring
The mathematical background of profile models used for describing TF binding properties has been extensively reviewed elsewhere (5,6). In brief, a profile consists of a matrix tabulating observed nucleotides in each position of the proteinDNA interface, typically counted from an alignment of known sites. The profile collection in ConSite is drawn from the JASPAR database, an open-access, non-redundant collection of curated profiles (15). Profiles are converted to log-scaled position weight matrices (PWMs) in order to evaluate possible binding sites in an input sequence, as reviewed elsewhere (6). As score ranges are unique for each model, scores are normalized according to
![]() | (1) |
- have a site score S
c (a user-adjustable TFBS detection threshold),
- and are found in window where Wi
I (as defined above),
- and have a predicted site in the other input sequence in corresponding position, subjected to constraints (i) and (ii).
Graphical interpretation of analysis results
For the evaluation of results, researchers are given a choice of several distinct output formats (Figure 1), including:
- Graphical view: showing an alignment overview and conservation plots. Conserved TFBS are shown as intuitive flags with mouse-over functionality.
- Alignment view: detailed alignments labeled with detected conserved sites;
- Table view: a tabular view of all detected sites with supplementary data.
- Alignment view: detailed alignments labeled with detected conserved sites;
All predicted sites are hyperlinked to pop-up summary pages describing TF binding models, including a sequence logo for graphical representation of the TF's binding specificity.
| PERFORMANCE |
|---|
|
|
|---|
For many individual genes, phylogenetic footprinting has been shown to be a highly useful method. However, until recently, no confirmation that the concept holds in regard to larger gene sets has been available. In a recent study (11), we sought to test the concept with two separate test sets: sites resulting from detailed literature analysis (40 sites in 14 promoters) and sites from the TRANSFAC database mapped onto genome assemblies (110 sites in 40 promoters). The latter set is the largest reference set to date for phylogenetic footprinting tests, available at http://www.phylofoot.org/consite/testset. In brief, the ConSite set of methods could, in both test cases, reduce the noise level by
85% while retaining high sensitivity compared to single sequence analysis (11). | ACCESS TO UNDERLYING SOFTWARE |
|---|
|
|
|---|
The ConSite integration of TFBS prediction and cross-species comparison is based on the open-access TFBS Perl modules (15,16) (http://forkhead.cgb.ki.se/TFBS), which support a wide variety of analysis modes, including pattern-finding and pattern similarity analysis. TFBS enables automated analysis on genome-scale data sets for power users.
| CONCLUSION |
|---|
|
|
|---|
We have presented a graphical, web-based interface for computer-assisted prediction of regulatory regions in higher eukaryotes, powered by cross-species comparison. Besides the intuitive interface design, ConSite integrates several features not found in other TFBS prediction services such as TESS (17) or rVista (18). The features include a curated model dataset, a computer-assisted input sequence selection and a powerful underlying set of programming modules for genome-scale analysis.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.
| REFERENCES |
|---|
|
|
|---|
- Aparicio,S., Morrison,A., Gould,A., Gilthorpe,J., Chaudhuri,C., Rigby,P., Krumlauf,R. and Brenner,S. ( (1995) ) Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc. Natl Acad. Sci. USA, , 92, , 16841688.
[Abstract/Free Full Text] - Bagheri-Fam,S., Ferraz,C., Demaille,J., Scherer,G. and Pfeifer,D. ( (2001) ) Comparative genomics of the SOX9 region in human and Fugu rubripes: conservation of short regulatory sequence elements within large intergenic regions. Genomics, , 78, , 7382.[CrossRef][ISI][Medline]
- Christensen,T.H., Prentice,H., Gahlmann,R. and Kedes,L. ( (1993) ) Regulation of the human cardiac/slow-twitch troponin C gene by multiple, cooperative, cell-type-specific, and MyoD-responsive elements. Mol. Cell Biol., , 13, , 67526765.
[Abstract/Free Full Text] - Gumucio,D.L., Heilstedt-Williamson,H., Gray,T.A., Tarle,S.A., Shelton,D.A., Tagle,D.A., Slightom,J.L., Goodman,M. and Collins,F.S. ( (1992) ) Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human gamma and epsilon globin genes. Mol. Cell Biol., , 12, , 49194929.
[Abstract/Free Full Text] - Wasserman,W.W. and Krivan,W. ( (2003) ) In silico identification of metazoan transcriptional regulatory regions. Naturwissenschaften, , 90, , 156166.[ISI][Medline]
- Stormo,G.D. ( (2000) ) DNA binding sites: representation and discovery. Bioinformatics, , 16, , 1623.
[Abstract/Free Full Text] - Tronche,F., Ringeisen,F., Blumenfeld,M., Yaniv,M. and Pontoglio,M. ( (1997) ) Analysis of the distribution of binding sites for a tissue-specific transcription factor in the vertebrate genome. J. Mol. Biol., , 266, , 231245.[CrossRef][ISI][Medline]
- Fickett,J.W. ( (1996) ) Quantitative discrimination of MEF2 sites. Mol. Cell Biol., , 16, , 437441.[Abstract]
- Tagle,D.A., Koop,B.F., Goodman,M., Slightom,J.L., Hess,D.L. and Jones,R.T. ( (1988) ) Embryonic epsilon and gamma globin genes of a prosimian primate (Galago crassicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol., , 203, , 439455.[CrossRef][ISI][Medline]
- Wasserman,W.W., Palumbo,M., Thompson,W., Fickett,J.W. and Lawrence,C.E. ( (2000) ) Humanmouse genome comparisons to locate regulatory sites. Nat. Genet., , 26, , 225228.[CrossRef][ISI][Medline]
- Lenhard,B., Sandelin,A., Mendoza,L., Engstrom,P., Jareborg,N. and Wasserman,W.W. ( (2003) ) Identification of conserved regulatory elements by comparative genome analysis. J. Biol., , 2, , 13.[CrossRef][Medline]
- Karolchik,D., Baertsch,R., Diekhans,M., Furey,T.S., Hinrichs,A., Lu,Y.T., Roskin,K.M., Schwartz,M., Sugnet,C.W., Thomas,D.J. et al. ( (2003) ) The UCSC Genome Browser Database. Nucleic Acids Res., , 31, , 5154.
[Abstract/Free Full Text] - Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V. et al. ( (2003) ) EnsEMBL 2002: accommodating comparative genomics. Nucleic Acids Res., , 31, , 3842.
[Abstract/Free Full Text] - Lenhard,B., Hayes,W.S. and Wasserman,W.W. ( (2001) ) GeneLynx: a gene-centric portal to the human genome. Genome Res., , 11, , 21512157.
[Abstract/Free Full Text] - Sandelin,A., Alkema,W., Engstrom,P., Wasserman,W.W. and Lenhard,B. ( (2004) ) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res., , 32, , D9194.
[Abstract/Free Full Text] - Lenhard,B. and Wasserman,W.W. ( (2002) ) TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics, , 18, , 11351136.
[Abstract/Free Full Text] - Schug,J. and Overton,G.C. ( (1997) ) TESS: Transcription Element Search Software on the WWW (report CBIL-TR-1997-1001-v0.0). Computational Biology and Informatics Laboratory, School of Medicine, University of Pennsylvania.
- Loots,G.G., Ovcharenko,I., Pachter,L., Dubchak,I. and Rubin,E.M. ( (2002) ) rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res., , 12, , 832839.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
A. Subtil-Rodriguez, L. Millan-Arino, I. Quiles, C. Ballare, M. Beato, and A. Jordan Progesterone Induction of the 11{beta}-Hydroxysteroid Dehydrogenase Type 2 Promoter in Breast Cancer Cells Involves Coordinated Recruitment of STAT5A and Progesterone Receptor to a Distal Enhancer and Polymerase Tracking Mol. Cell. Biol., June 1, 2008; 28(11): 3830 - 3849. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Lampe, O. A. Samad, A. Guiguen, C. Matis, S. Remacle, J. J. Picard, F. M. Rijli, and R. Rezsohazy An ultraconserved Hox-Pbx responsive element resides in the coding sequence of Hoxa2 and is active in rhombomere 4 Nucleic Acids Res., June 1, 2008; 36(10): 3214 - 3225. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. I. Dmitrieva, C. A. Hinojos, E. Boerwinkle, M. C. Braun, M. Fornage, and P. A. Doris Hepatocyte Nuclear Factor 1 and Hypertensive Nephropathy Hypertension, June 1, 2008; 51(6): 1583 - 1589. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Emslie, K. D'Costa, J. Hasbold, D. Metcalf, K. Takatsu, P. O. Hodgkin, and L. M. Corcoran Oct2 enhances antibody-secreting cell differentiation through regulation of IL-5 receptor {alpha} chain expression on activated B cells J. Exp. Med., February 18, 2008; 205(2): 409 - 421. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. H. Lee and H. Shatkay F-SNP: computationally predicted functional SNPs for disease association studies Nucleic Acids Res., January 11, 2008; 36(suppl_1): D820 - D824. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Meng, A. Banerjee, and L. Zhou BLISS 2.0: a web-based tool for predicting conserved regulatory modules in distantly-related orthologous sequences Bioinformatics, December 1, 2007; 23(23): 3249 - 3250. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. T. Read, M. Rahmani, S. Boroomand, S. Allahverdian, B. M. McManus, and P. S. Rennie Androgen Receptor Regulation of the Versican Gene through an Androgen Response Element in the Proximal Promoter J. Biol. Chem., November 2, 2007; 282(44): 31954 - 31963. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Goold, M. Hubank, A. Hunt, J. Holton, R. P. Menon, T. Revesz, M. Pandolfo, and A. Matilla-Duenas Down-regulation of the dopamine receptor D2 in mice lacking ataxin 1 Hum. Mol. Genet., September 1, 2007; 16(17): 2122 - 2134. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-W. Chang, B. R. Fontaine, G. D. Stormo, and R. Nagarajan PAP: a comprehensive workbench for mammalian transcriptional regulatory sequence analysis Nucleic Acids Res., July 13, 2007; 35(suppl_2): W238 - W244. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. X. Jin, H. O'Geen, S. Iyengar, R. Green, and P. J. Farnham Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches Genome Res., June 1, 2007; 17(6): 807 - 817. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Tomovic and E. J. Oakeley Position dependencies in transcription factor binding sites Bioinformatics, April 15, 2007; 23(8): 933 - 941. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Cao, L. M. Costa, C. Biderre-Petit, B. Kbhaya, N. Dey, P. Perez, D. R. McCarty, J. F. Gutierrez-Marcos, and P. W. Becraft Abscisic Acid and Stress Signals Induce Viviparous1 Expression in Seed and Vegetative Tissues of Maize Plant Physiology, February 1, 2007; 143(2): 720 - 731. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. S. Bais, S. Grossmann, and M. Vingron Simultaneous alignment and annotation of cis-regulatory regions Bioinformatics, January 15, 2007; 23(2): e44 - e49. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. X. Jin, A. Rabinovich, S. L. Squazzo, R. Green, and P. J. Farnham A computational genomics approach to identify cis-regulatory modules from chromatin immunoprecipitation microarray data--A case study using E2F1 Genome Res., December 1, 2006; 16(12): 1585 - 1595. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Juszczynski, J. L. Kutok, C. Li, J. Mitra, R. C. T. Aguiar, and M. A. Shipp BAL1 and BBAP Are Regulated by a Gamma Interferon-Responsive Bidirectional Promoter and Are Overexpressed in Diffuse Large B-Cell Lymphomas with a Prominent Inflammatory Infiltrate. Mol. Cell. Biol., July 1, 2006; 26(14): 5348 - 5359. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Lardenois, F. Chalmel, L. Bianchetti, J.-A. Sahel, T. Leveillard, and O. Poch PromAn: an integrated knowledge-based web server dedicated to promoter analysis. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W578 - W583. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Fang and M. Blanchette FootPrinter3: phylogenetic footprinting in partially alignable sequences. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W617 - W620. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Vlieghe, A. Sandelin, P. J. De Bleser, K. Vleminckx, W. W. Wasserman, F. van Roy, and B. Lenhard A new generation of JASPAR, the open-access repository for transcription factor binding site profiles Nucleic Acids Res., January 1, 2006; 34(suppl_1): D95 - D97. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. A. Jones and R. A. Flavell Distal Enhancer Elements Transcribe Intergenic RNA in the IL-10 Family Gene Cluster J. Immunol., December 1, 2005; 175(11): 7437 - 7446. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Jia and G. A. Coetzee Androgen Receptor-Dependent PSA Expression in Androgen-Independent Prostate Cancer Cells Does Not Involve Androgen Receptor Occupancy of the PSA Locus Cancer Res., September 1, 2005; 65(17): 8003 - 8008. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Berezikov, V. Guryev, and E. Cuppen CONREAL web server: identification and visualization of conserved transcription factor binding sites Nucleic Acids Res., July 1, 2005; 33(suppl_2): W447 - W450. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Cartharius, K. Frech, K. Grote, B. Klocke, M. Haltmeier, A. Klingenhoff, M. Frisch, M. Bayerlein, and T. Werner MatInspector and beyond: promoter analysis based on transcription factor binding sites Bioinformatics, July 1, 2005; 21(13): 2933 - 2942. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Coutinho, R. R. Singaraja, M. Kang, D. J. Arenillas, L. N. Bertram, N. Bissada, B. Staels, J.-C. Fruchart, C. Fievet, A. M. Joseph-George, et al. Complete functional rescue of the ABCA1-/- mouse by human BAC transgenesis J. Lipid Res., June 1, 2005; 46(6): 1113 - 1123. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Aranyi, M. Ratajewski, V. Bardoczy, L. Pulaski, A. Bors, A. Tordai, and A. Varadi Identification of a DNA Methylation-dependent Activator Sequence in the Pseudoxanthoma Elasticum Gene, ABCC6 J. Biol. Chem., May 13, 2005; 280(19): 18643 - 18650. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. D. Marinescu, I. S. Kohane, and A. Riva The MAPPER database: a multi-genome catalog of putative transcription factor binding sites Nucleic Acids Res., January 1, 2005; 33(suppl_1): D91 - D97. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













