Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes
D.I.Co. and 1 Department of Biomolecular Science and Biotechnology, University of Milan, Milan, Italy and 2 Department of Computer Science, Systems and Communication, University of Milano-Bicocca, Milan, Italy
* To whom correspondence should be addressed. Tel: +39 02 50314915; Fax: +39 02 50314912; Email: graziano.pesole{at}unimi.it
Received February 14, 2004; Revised and Accepted April 28, 2004
| ABSTRACT |
|---|
|
|
|---|
One of the greatest challenges that modern molecular biology is facing is the understanding of the complex mechanisms regulating gene expression. A fundamental step in this process requires the characterization of regulatory motifs playing key roles in the regulation of gene expression at transcriptional and post-transcriptional levels. In particular, transcription is modulated by the interaction of transcription factors with their corresponding binding sites. Weeder Web is a web interface to Weeder, an algorithm for the automatic discovery of conserved motifs in a set of related regulatory DNA sequences. The motifs found are in turn likely to be instances of binding sites for some transcription factor. Other than providing access to the program, the interface has been designed so to make usage of the program itself as simple as possible, and to require very little prior knowledge about the length and the conservation of the motifs to be found. In fact, the interface automatically starts different runs of the program, each one with different parameters, and provides the user with an overall summary of the results as well as some advice on which motifs look more interesting according to their statistical significance and some simple considerations. The web interface is available at the address www.pesolelab.it by following the Tools link.
| INTRODUCTION |
|---|
|
|
|---|
Understanding the complex mechanisms governing basic biological processes requires the characterization of regulatory motifs modulating gene expression at transcriptional and post-transcriptional levels. In particular extent, chronology and cell-specificity of transcription are modulated by the interaction of transcription factors (TFs) with their corresponding binding sites (TFBSs) (1), mostly located nearby the transcription start site (TSS) of the gene (i.e. proximal promoter region) or far away (i.e. enhancers, silencers, etc.). The ever growing amount of genomic data (complemented by other sources of information such as full-length cDNA sequencing projects that permit the precise mapping of the TSS on the genome sequence) and expression data derived from microarray and other experiments open new opportunities to researchers.
The fact that transcription factor binding sites are generally short (<1214 bp long) and degenerate oligonucleotides makes their computational discovery and large-scale annotation significantly hardhence the need for efficient and reliable methods for detecting novel motifs significantly over-represented in the regulatory regions of sets of genes sharing common properties (e.g. similar expression profile, biological function, product cellular localization), which in turn could correspond to binding sites for common TFs regulating the genes.
We present here a web server that provides access to a previously developed enumerative pattern discovery method (2) that is able to carry out an (almost) exhaustive search of significantly conserved degenerate oligonucleotide patterns with remarkable computational efficiency.
| METHODS |
|---|
|
|
|---|
Nearly all the computational methods for the discovery of novel motifs in a set of sequences of co-regulated genes are based on two steps. First, one or more groups of oligonucleotides similar enough to each other (i.e. differing in some nucleotide substitutions) are detected in the sequences. Second, their presence is evaluated from a statistical point of view; that is, algorithms estimate how likely each group would be to appear in a set of sequences either picked at random from the same organism (thus very unlikely to be coregulated) or built randomly with the same nucleotide composition as the input sequences (thus very likely to present a different oligo composition). The best groups of oligos found are in turn likely to be instances of binding sites for some TF [see (3,4) for reviews on different methods applied to the problem, as well as programs and interfaces implementing them].
For the first step, two main approaches have been proposed so far: consensus-based (or pattern-driven) and profile-based (or alignment-driven) methods. In the former case, the different oligonucleotides recognized by a given TF are described by their consensus, representing, for each position, the nucleotide that appears most frequently in the binding sites. All oligos that differ from the consensus in no more than a given number of positions (usually depending on the oligo length) can in turn be supposed to be recognized by the same TF. The debate on which philosophy is more suitable for representing and capturing TFBSs is nearly as old as the problem itself (57). However, comparative tests (8) have shown that the two approaches seem to be complementary, with no definite prevalence of either one. Some instances are correctly solved by representing motifs with their consensus, some others by using alignment profiles, and some by both (or neither). Thus, given a set of sequences, the best way to proceed is perhaps to try different methods based on wholly different principles, such as the one presented here and traditional alignment-driven methods such as Consensus (9), MEME (10) and the Gibbs sampler (11), and to compare the results obtained.
The key idea of consensus-based methods is to enumerate all the oligos of (or up to) a given length, in order to determine which ones appear, with possible substitutions, in a significant fraction of the input sequences, and finally to rank them according to statistical measures of significance. At first sight, this approach seems to suffer from different drawbacks. First of all, if the length of the motifs sought is m, then there are 4m candidate oligos to enumerate, with an exponential growth of the execution time according to the motif length. Then, many additional parameters are usually required by these algorithms, such as the length of the motif itself, the number of mutations allowed for its occurrences and a minimum number q of sequences the motif has to appear in (this parameter is usually called quorum). Moreover, a suitable significance measure, able to discriminate real TF binding sites from uninteresting motifs, has to be introduced, since often there are hundreds of candidate motifs satisfying the input parameters. All these factors have led to the impression that consensus-based methods are too slow (given the high number of candidates) and too difficult to use, since many different parameter combinations have to be tried.
However, as demonstrated in (2,12), the exhaustive search for motifs can be significantly accelerated if the input sequences are preprocessed and organized in a suitable indexing structure, such as a suffix tree (13), especially when their lengths fall within the usual range of TFBSs. We refer the reader to reference (2) for further details on how the exhaustive search is implemented in the Weeder algorithm. Instead, in the implementation of the interface presented here we focused on the other points just mentioned. The first task is to determine automatically which values of all the parameters needed by the algorithm are suitable for TFBSs, in order to reduce to the minimum possible the user input and any prior knowledge about the motifs to be found, as well as to make the program as simple to use as alignment-based algorithms. Then, we also propose a significance measure especially fine-tuned for TFBSs, which is used to rank the results and to find the best motifs.
User input
The Weeder Web interface (an example is shown in the online Supplementary Material) requires users to input their email address, and one or more sequences in FASTA format either by cutting and pasting the sequences or by uploading a file. Sequences can be in uppercase or lowercase letters, and can contain ambiguous IUPAC symbols. A checkbox is available in order to specify if the complementary strand of the input sequences has also to be examined by the program. Then, users must provide values of a few intuitive parameters. The first one specifies only whether the motif has to appear in all the input sequences or in some of them (analogous to the zoops mode of alignment-based methods). The user can also select a mode in which the input is processed as a single sequence, looking for repeated oligonucleotides regardless of their distribution throughout the sequences. The second parameter needed, describing the type of analysis desired (quick, normal or thorough), simply influences the time required to obtain the results: clearly, the shorter the time, the less accurate is the analysis performed. Also, users must specify which organism their sequences come from, by selecting it from a list. This choice is fundamental for the computation of the significance of the results (see Supplementary Material), since this is based on organism-specific expected values. If the organism is not included in the list provided, users can contact the page administrators. The list of available organisms will be constantly updated and enlarged. Since the computation time in some cases might exceed an hour, the results are sent by email. The results are accessible also on a dedicated web page, whose address is communicated in the email as well. The web page will also provide a link to the original input sequences provided by the user.
In addition, if the user wants to set manually all or some of the input parameters (motif length, quorum, error and so on), an extended input form is available, which can be reached by following a link from the main page. Also, this page permits the user to submit nucleotide sequences other than upstream regulatory regions (for which the default parameter values for TFBSs might not be appropriate). At the present time, it includes human 5' and 3' mRNA untranslated regions. Further types of nucleotide sequences will be included in the future to broaden the applicability of the method to signals other than TFBSs.
Program runs
Once the Submit button is clicked, if all the fields are filled in correctly the web interface automatically starts a series of runs of the Weeder algorithm, looking for motifs of lengths 6 and 8 (if launched in quick mode), or from length 6 up to 12 (in normal mode and thorough mode). The number of sequences a motif has to appear in is determined according to the user's choice (if some is selected, the threshold is set to half of the sequences). The number of mutations allowed is one for motifs of length 6, two for length 8, three for length 10 and four for length 12. The thorough mode performs an additional scan for motifs of length 8 with three mutations and length 10 with four mutations, and lowers the quorum choice to one-third of the input sequences. Clearly, the last parameter setting is useful when no significant motif is reported by quick and normal analyses. Experiments performed by us and by other groups [see e.g. (14,15)] have shown that these values are suitable for capturing a large class of TFBSs, even in the case of corrupted datasets including several sequences not containing instances of the motif.
The output
For each run, all the motifs satisfying the input constraints (length, error and quorum) are scored according to a statistical measure of significance (see Supplementary Material) that takes into account the number of sequences a motif appears in, how much it is conserved and the overall number of its occurrences in the input set. Then, the five most significant motifs of each run are reported to the user. However, it is sometimes difficult to judge which ones, among motifs of different lengths obtained allowing different degrees of approximation, could be more interesting and worth further investigation, especially (as in real case studies) when the motif length is not known in advance. Clearly, if a motif reported has a score significantly higher than the others it is less likely to be the effect of random similarities. However, for this task, we have included an additional post-processing stage. The idea is best explained by an example. Suppose that the consensus for the binding sites of a given TF is A[C/G]GTAC, admitting with equal frequency either a C or a G in the second position. If all went well in the program runs, we might expect to find both ACGTAC and AGGTAC in the list of the best motifs of length 6 with one mutation. And we also expect this phenomenon to be quite frequent in real binding sites, since the different positions of the sites usually exhibit different degrees of variability, and mutations seem to appear more frequently at some positions than at others. Thus, first of all the algorithm scans each list of best motifs of length m with e mutations (the result of each single run) to see whether in each one there are motifs that differ in no more than e positions. These motifs could be two alternative consensuses for the same set of binding sites, and the higher scoring one is reported as interesting. Then, the results of different runs are also compared to each other. If a short motif is found to be a part of a longer one, then the latter might have a conserved core, another feature often encountered in real instances. Also in this case, the longer one is added to the interesting motif list. Several experiments that we have performed on real case studies support the feasibility of these simple criteria.
Finally, all the interesting motifs are again listed at the bottom of the output file, under the heading My Advice. For each of these, the interface also reports a frequency matrix built by aligning all the instances of the motif found as well as a list of its best occurrences, collected from the input sequences by using the frequency matrix (see Supplementary Material). Another frequency matrix, obtained by aligning only the best occurrences of the motif, is also listed. Although motifs are discovered by allowing a predefined number of mutations e in their occurrences, the additional frequency matrix scan makes it possible to pick new instances presenting more than e mutations with respect to the motif consensus but still fitting the motif profile well, and at the same time to single out which instances are more likely to be real TFBSs. Also, even if the motif has been detected in q of the input sequences, the best instances might not be present in each of the q sequences (or else the motif might be found to appear in additional sequences). An example of the output file sent to users is shown in Figure 1. Notice that the best motif of length 6 differs significantly from the best 8mer. However, the advice of the program is to pay attention to the latter (the correct one), since there is another motif differing from it in a single position among the highest scoring motifs of length 8.
|
| DISCUSSION |
|---|
|
|
|---|
The web interface to the Weeder program permits the analysis of a set of regulatory sequences looking for conserved motifs that in turn could represent instances of binding sites for some common TF. The interface has been designed with the 2-fold purpose of being as user-friendly as possible and of making the interpretation of the results easier. All the actual parameters that are needed by the Weeder algorithm are kept hidden and automatically set to values suitable for the discovery of TFBSs according to some intuitive indications provided by the user, who can, however, change the default values by using an extended input form. Some hints and comments on the results obtained are also output.
In the future, the interface will be constantly updated and enhanced with new features, including the graphical representation of the output and the possibility of applying the algorithm to other types of nucleotide sequences. On the algorithmic side, we plan to include the comparative analysis of regulatory sequences of orthologous genes from different species, and the possibility of detecting motifs composed of two conserved parts interrupted by a non-conserved region, as well as correlations among different motifs (i.e. combinations of two or more elements showing a conserved order, strand orientation and spacing). To this end, we have developed a suitable algorithm (16) that will be included in a further version of the server. Indeed, the search for complex regulatory modules, greatly reducing false positive rates, would make possible genome-wide promoter analyses. Finally, we are currently working on the integration of microarray expression values with sequence data, which could represent another big leap forward in promoter annotation and the discovery of novel TF binding sites. Each update will be highlighted and reported on the web page.
| SUPPLEMENTARY MATERIAL |
|---|
|
|
|---|
Supplementary Material is available at NAR Online.
| Notes |
|---|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated.
| REFERENCES |
|---|
|
|
|---|
- Levine,M. and Tjian,R. ( (2003) ) Transcription regulation and animal diversity. Nature, , 424, , 147151.[CrossRef][Medline]
- Pavesi,G., Mauri,G. and Pesole,G. ( (2001) ) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, , 17, (Suppl. 1), S207S214.[Abstract]
- Stormo,G.D. ( (2000) ) DNA binding sites: representation and discovery. Bioinformatics, , 16, , 1623.
[Abstract/Free Full Text] - Pavesi,G., Mauri,G. and Pesole,G. ( (2001) ) Methods for pattern discovery in unaligned biological sequences. Brief. Bioinformatics, , 2, , 417430.
- Berg,O.G. and von Hippel,P.H. ( (1988) ) Selection of DNA binding sites by regulatory proteins. Trends Biochem. Sci., , 13, , 207211.[CrossRef][ISI][Medline]
- Frech,K., Quandt,K. and Werner,T. ( (1997) ) Finding protein-binding sites in DNA sequences: the next generation. Trends Biochem. Sci., , 22, , 103104.[CrossRef][ISI][Medline]
- Frech,K., Quandt,K. and Werner,T. ( (1997) ) Software for the analysis of DNA sequence elements of transcription. Comput. Appl. Biosci., , 13, , 8997.
[Abstract/Free Full Text] - Sinha,S. and Tompa,M. ( (2003) ) Performance comparison of algorithms for finding transcription factor binding sites, Third IEEE Symposium on Bioinformatics and Bioengineering. Washington DC, pp. 6978.
- Hertz,G.Z. and Stormo,G.D. ( (1999) ) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, , 15, , 563577.
[Abstract/Free Full Text] - Bailey,T.L. and Elkan,C. ( (1994) ) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol., , 2, , 2836.[Medline]
- Thompson,W., Rouchka,E.C. and Lawrence,C.E. ( (2003) ) Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res., , 31, , 35803585.
[Abstract/Free Full Text] - Marsan,L. and Sagot,M.F. ( (2000) ) Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol., , 7, , 345362.[CrossRef][ISI][Medline]
- Gusfield,D. ( (1997) ) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK and New York.
- Narasimhan,C., LoCascio,P. and Uberbacher,E. ( (2003) ) Background rareness-based iterative multiple sequence alignment algorithm for regulatory element detection. Bioinformatics, , 19, , 19521963.
[Abstract/Free Full Text] - Sinha,S. and Tompa,M. ( (2002) ) Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res., , 30, , 55495560.
[Abstract/Free Full Text] - Pavesi,G., Mauri,G., Iannelli,F., Gissi,C. and Pesole,G. ( (2004) ) GeneSyn: a tool for detecting conserved gene syntenies across genomes. Bioinformatics, , 10.1093/bioinformatics/bth1102
- Zhu,J. and Zhang,M.Q. ( (1999) ) SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics, , 15, , 607611.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
H. M. Petrykowska, C. M. Vockley, and L. Elnitski Detection and characterization of silencers and enhancer-blockers in the greater CFTR locus Genome Res., August 1, 2008; 18(8): 1238 - 1246. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Pennycooke, H. Cheng, and E. J. Stockinger Comparative Genomic Sequence and Expression Analyses of Medicago truncatula and Alfalfa Subspecies falcata COLD-ACCLIMATION-SPECIFIC Genes Plant Physiology, March 1, 2008; 146(3): 1242 - 1254. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Menges, G. Pavesi, P. Morandini, L. Bogre, and J. A.H. Murray Genomic Organization and Evolutionary Conservation of Plant D-Type Cyclins Plant Physiology, December 1, 2007; 145(4): 1558 - 1576. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Xu, M. Bieda, V. X. Jin, A. Rabinovich, M. J. Oberley, R. Green, and P. J. Farnham A comprehensive ChIP chip analysis of E2F1, E2F4, and E2F6 in normal and tumor cells reveals interchangeable roles of E2F family members Genome Res., November 1, 2007; 17(11): 1550 - 1561. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Mahony and P. V. Benos STAMP: a web tool for exploring DNA-binding motif similarities Nucleic Acids Res., July 13, 2007; 35(suppl_2): W253 - W258. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Romer, G.-R. Kayombya, and E. Fraenkel WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches Nucleic Acids Res., July 13, 2007; 35(suppl_2): W217 - W220. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Segal, M. Lapidot, Z. Solan, E. Ruppin, Y. Pilpel, and D. Horn Nucleotide variation of regulatory motifs may lead to distinct expression patterns Bioinformatics, July 1, 2007; 23(13): i440 - i449. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Matsumoto, M. Ukai-Tadenuma, R. G. Yamada, J. Houl, K. D. Uno, T. Kasukawa, B. Dauwalder, T. Q. Itoh, K. Takahashi, R. Ueda, et al. A functional genomics strategy reveals clockwork orange as a transcriptional regulator in the Drosophila circadian clock Genes & Dev., July 1, 2007; 21(13): 1687 - 1700. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. X. Jin, H. O'Geen, S. Iyengar, R. Green, and P. J. Farnham Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches Genome Res., June 1, 2007; 17(6): 807 - 817. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Li, Y. Liang, and R. L. Bass GAPWM: a genetic algorithm method for optimizing a position weight matrix Bioinformatics, May 15, 2007; 23(10): 1188 - 1194. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. R. Krig, V. X. Jin, M. C. Bieda, H. O'Geen, P. Yaswen, R. Green, and P. J. Farnham Identification of Genes Directly Regulated by the Oncogene ZNF217 Using Chromatin Immunoprecipitation (ChIP)-Chip Assays J. Biol. Chem., March 30, 2007; 282(13): 9703 - 9712. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Lee, K. He, V. Stolc, H. Lee, P. Figueroa, Y. Gao, W. Tongprasit, H. Zhao, I. Lee, and X. W. Deng Analysis of Transcription Factor HY5 Genomic Binding Sites Revealed Its Hierarchical Role in Light Regulation of Development PLANT CELL, March 1, 2007; 19(3): 731 - 749. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. J. Donaldson and B. Gottgens CoMoDis: composite motif discovery in mammalian genomes Nucleic Acids Res., January 12, 2007; 35(1): e1 - e1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-H. Peng, J.-T. Hsu, Y.-S. Chung, Y.-J. Lin, W.-Y. Chow, D. F. Hsu, and C. Y. Tang Identification of degenerate motifs using position restricted selection and hybrid ranking combination Nucleic Acids Res., December 2, 2006; 34(22): 6379 - 6391. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Elnitski, V. X. Jin, P. J. Farnham, and S. J.M. Jones Locating mammalian transcription factor binding sites: A survey of computational and experimental techniques Genome Res., December 1, 2006; 16(12): 1455 - 1464. [Abstract] [Full Text] [PDF] |
||||
![]() |
N.-K. Kim, K. Tharakaraman, and J. L. Spouge Adding sequence context to a Markov background model improves the identification of regulatory elements Bioinformatics, December 1, 2006; 22(23): 2870 - 2875. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. T. Naughton, E. Fratkin, S. Batzoglou, and D. L. Brutlag A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites Nucleic Acids Res., November 6, 2006; 34(20): 5730 - 5739. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Michaloski, P. A.F. Galante, and B. Malnic Identification of potential regulatory motifs in odorant receptor genes by analysis of promoter sequences Genome Res., September 1, 2006; 16(9): 1091 - 1098. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Pavesi, P. Mereghetti, F. Zambelli, M. Stefani, G. Mauri, and G. Pesole MoD Tools: regulatory motif discovery in nucleotide sequences from co-regulated or homologous genes. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W566 - W570. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-C. Liu, C.-C. Lin, W.-S. E. Chen, H.-Y. Chen, P.-C. Chang, J. J.W. Chen, and P.-C. Yang CRSD: a comprehensive web server for composite regulatory signature discovery. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W571 - W577. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. MOHANTY, S. P. T. KRISHNAN, S. SWARUP, and V. B. BAJIC Detection and Preliminary Analysis of Motifs in Promoters of Anaerobically Induced Genes of Different Plant Species Ann. Bot., September 1, 2005; 96(4): 669 - 681. [Abstract] [Full Text] [PDF] |
||||
![]() |
W.J. Kent, F. Hsu, D. Karolchik, R. M. Kuhn, H. Clawson, H. Trumbower, and D. Haussler Exploring relationships and mining data with the UCSC Gene Sorter Genome Res., May 1, 2005; 15(5): 737 - 741. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Down and T. J. P. Hubbard NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence Nucleic Acids Res., March 10, 2005; 33(5): 1445 - 1453. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Tang, S. L. Tan, S. K. Ramadoss, A. P. Kumar, M.-H. E. Tang, and V. B. Bajic Computational method for discovery of estrogen responsive genes Nucleic Acids Res., December 1, 2004; 32(21): 6212 - 6217. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








