Nucleic Acids Research, 2003, Vol. 31, No. 13 3666-3668
© 2003 Oxford University Press
Cluster-Buster: finding dense clusters of motifs in DNA sequences
1 Bioinformatics Program, Boston University, 44 Cummington Street, Boston, MA 02215, USA 2 Biomedical Engineering Department, Boston University, 44 Cummington Street, Boston, MA 02215, USA
*To whom correspondence should be addressed at Biomedical Engineering Department, Boston University, 44 Cummington Street, Boston, MA 02215, USA. Tel: +1 6173533509; Fax: +1 6173536766; Email: zhiping{at}bu.edu
Received February 13, 2003; Revised March 15, 2003. Accepted March 26, 2003
| ABSTRACT |
|---|
|
|
|---|
The signals that determine activation and repression of specific genes in response to appropriate stimuli are one of the most important, but least understood, types of information encoded in genomic DNA. The nucleotide sequence patterns, or motifs, preferentially bound by various transcription factors have been collected in databases. However, these motifs appear to be individually too short and degenerate to enable detection of functional enhancer and silencer elements within a large genome. Several groups have proposed that dense clusters of motifs may diagnose regulatory regions more accurately. Cluster-Buster is the third incarnation of our software for finding clusters of pre-specified motifs in DNA sequences. We offer a Cluster-Buster web server at http://zlab.bu.edu/cluster-buster/.
| INTRODUCTION |
|---|
|
|
|---|
Enhancers and silencers of transcription consist of clusters of transcription factor binding sites (1,2). A number of publications have proposed to detect transcription regulatory regions by searching for clusters of the sequence motifs preferentially bound by a set of transcription factors (214). It remains to be seen whether this approach will be generally successful for large eukaryotic genomes. Most recent methods for finding motif clusters fall into two categories: those that count motif hits occurring within a sequence window of some size (2,12,13), and those that employ probabilistic models (5,10,11,14). The advantage of the former methods is the intuitive clarity of what they do. Advantages of the latter are avoidance of arbitrary thresholds and the ability to integrate contributions from indefinitely many indefinitely weak motif hits. By using log likelihood ratios, model-based approaches can also claim to discriminate motif clusters from background DNA in a mathematically optimal way (the NeymanPearson Lemma).
We have taken the modeling approach, searching for regions of the sequence that resemble a statistical model of a motif cluster more than they resemble a model of background DNA. Our motif cluster model is for motifs to occur randomly with a uniform distribution across the region and the background model consists of independent, random nucleotides with probabilities estimated from their local abundances in the query sequence. We wish to identify subsequences whose log likelihood ratios, In [Prob (subsequence|cluster model)/Prob (subsequence|background model)], are maximal (i.e. they do not overlap subsequences with higher log likelihood ratios). Unfortunately, the algorithm for finding these subsequences requires time proportional to the sequence length squared and is not feasible for sequences longer than a few kb (15).
We have developed three ways of circumventing this problem. Our first program, Cister, does not directly predict motif clusters, but returns a probability curve indicating the probability that each basepair in the sequence lies within a cluster, using the linear-time ForwardBackward algorithm (10). Cister pays the price of using a slightly more complex probabilistic model with more nuisance parameters. Comet finds motif clusters in linear time with the Viterbi algorithm, but it does not calculate the full log likelihood ratio, considering only the most likely arrangement of motifs within the subsequence (11). An advantage of Comet is that it calculates E-values to indicate the statistical significance of its predictions. Cluster-Buster tackles the problem head-on, employing a linear-time heuristic which attempts to return the same cluster predictions as the full quadratic-time algorithm. As a test we applied Cluster-Buster and an implementation of the quadratic-time algorithm to a set of 27 short sequences. The two programs returned the exact same 19 clusters. So Cluster-Buster appears to be extremely successful at emulating the exact algorithm.
In constructing the Cluster-Buster web server, we have gone to unusual lengths to make it convenient to use. For example, every option on the input form is linked to a pop-up help box which describes its purpose. The output provides an overview figure depicting the locations of motif clusters and annotated protein-coding regions in the sequence, followed by graphics and tables that detail the motifs within each cluster. We also provide a Linux executable of Cluster-Buster for download, which would be necessary for large-scale or highly customized use. Cluster-Buster is extremely fast, requiring
5 s to analyze a megabase of sequence with five motifs on a 1.6 GHz Athlon processor. That extrapolates to about 4 h for the whole human genome.
| INPUT |
|---|
|
|
|---|
Cluster-Buster takes two required items of input: a DNA sequence and a selection of motifs, and optionally a small number of parameters may be varied to tune its behavior. The sequence may be in raw, Fasta or GenBank format. If GenBank format is used, any annotated protein-coding regions (CDS) will be drawn in the results overview figure along with the motif clusters. Alternatively, if the user simply enters a GenBank identifier, the web server will automatically fetch the sequence. The web server, but not the downloadable program, limits the sequence length to 100 kb. Motifs are entered as 4xN matrices, where rows correspond to sequential positions in the motif and columns indicate abundances of A, C, G and T at each position. A limited number of built-in motifs are provided as checkboxes and matrices can be copied and pasted directly from the TRANSFAC website (16). A user of Cluster-Buster must begin with a hypothesis that a particular set of motifs may occur in clusters.
The tunable parameters include a gap parameter, residue abundance range and pseudocount. The gap parameter indicates the average distance between motifs within a cluster in Cluster-Buster's internal model of motif clusters. Low values enhance the program's sensitivity for tight clusters of weak motifs and high values enhance its sensitivity for loose clusters of strong motifs. Cluster-Buster's model of background DNA varies across the sequence to take account of fluctuations in nucleotide abundances. The residue abundance range specifies how far either side of each point in the sequence to count residue frequencies. The pseudocount is added to all entries in all motif matrices. Pseudocounts are a widely used technique, with a theoretical underpinning in Bayesian statistics, for estimating underlying frequencies from a limited number of counts. The web server also allows low-complexity regions, such as microsatellites or poly(A) tracts, to be filtered from the sequence using the program dust (R. Tatusov and D. Lipman, unpublished). Another option is to filter sequence regions written with lowercase letters, which is becoming a standard way of indicating repetitive elements. The repetitive nature of such regions may lead to artefactually strong motif clusters, but in some cases they may contain genuine regulatory elements.
| OUTPUT |
|---|
|
|
|---|
Cluster-Buster produces an overview diagram indicating the locations of motif clusters with score higher than a user-specified threshold and any annotated coding regions (Fig. 1A), followed by details of the motifs within each cluster (Fig. 1B). The overview represents motif clusters as green rectangles whose horizontal position indicates their position in the sequence and whose height is proportional to their log likelihood ratio score. Protein-coding regions on the forward strand are drawn as purple rectangles above the central line and those on the reverse strand are drawn below the line.
|
In the details figure, motifs are drawn as color-coded rectangles above or below the line according to the strand they are on (relative to the motif matrix; the ERE appears on only one strand in Fig. 1B because we used a slightly non-palindromic matrix to represent this motif ). The heights of the motif rectangles represent their log likelihood ratio scores using the standard weight matrix technique (17). A table below this figure lists the position, strand, score and sequence of each motif.
| METHOD |
|---|
|
|
|---|
The Cluster-Buster algorithm, in outline, consists of three steps:
- Perform one pass of the Forward algorithm to obtain the log likelihood score s[i] for each subsequence beginning at nucleotide 1 and ending at nucleotide i. Keep track of the subsequences (a,b) where the score increase s[b]-s[a] is maximal (i.e. they do not overlap another subsequence with a larger score increase).
- For each of these subsequences, we consider the end-point b to be reliable, but the start-point a to be unreliable. Perform the Backward algorithm beginning at b and continuing until slightly before a, to refine the optimal start-point.
- Remove subsequences that overlap higher scoring subsequences with a greedy algorithm.
This method bears some resemblance to one that has been used earlier for protein fold recognition (18).
| ACKNOWLEDGEMENTS |
|---|
M.F. is a Howard Hughes Medical Institute Predoctoral Fellow. This project was partially funded by NSF grants DBI-0078194 and DBI-0116574 and NIH grant 1P20GM066401-01.
| REFERENCES |
|---|
|
|
|---|
- Arnone,M.I. and Davidson,E.H. (1997) The hardwiring of development: organization and function of genomic regulatory systems. Development, 124, 18511864.[Abstract]
- Berman,B.P., Nibu,Y., Pfeiffer,B.D., Tomancak,P., Celniker,S.E., Levine,M., Rubin,G.M. and Eisen,M.B. (2002) Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA, 99, 757762.
[Abstract/Free Full Text] - Prestridge,D.S. (1995) Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol., 249, 923932.[CrossRef][Web of Science][Medline]
- Kondrakhin,Y.V., Kel,A.E., Kolchanov,N.A., Romashchenko,A.G. and Milanesi,L. (1995) Eukaryotic promoter recognition by binding sites for transcription factors. Comput. Appl. Biosci., 11, 477488.
[Abstract/Free Full Text] - Crowley,E.M., Roeder,K. and Bina,M. (1997) A statistical model for locating regulatory regions in genomic DNA. J. Mol. Biol., 268, 814.[CrossRef][Web of Science][Medline]
- Frech,K., Danescu-Mayer,J. and Werner,T. (1997) A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J. Mol. Biol., 270, 674687.[CrossRef][Web of Science][Medline]
- Wasserman,W.W. and Fickett,J.W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol., 278, 167181.[CrossRef][Web of Science][Medline]
- Krivan,W. and Wasserman,W.W. (2001) A predictive model for regulatory sequences directing liver-specific transcription. Genome Res., 11, 15591566.
[Abstract/Free Full Text] - Wagner,A. (1999) Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics, 15, 776784.
[Abstract/Free Full Text] - Frith,M.C., Hansen,U. and Weng,Z. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878889.
[Abstract/Free Full Text] - Frith,M.C., Spouge,J.L., Hansen,U. and Weng,Z. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res., 30, 32143224.
[Abstract/Free Full Text] - Markstein,M., Markstein,P., Markstein,V. and Levine,M.S. (2002) Genome-wide analysis of clustered dorsal binding sites identifies putative target genes in the Drosophila embryo. Proc. Natl Acad. Sci. USA, 99, 763768.
[Abstract/Free Full Text] - Rebeiz,M., Reeves,N.L. and Posakony,J.W. (2002) SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation. Proc. Natl Acad. Sci. USA, 99, 98889893.
[Abstract/Free Full Text] - Rajewsky,N., Vergassola,M., Gaul,U. and Siggia,E.D. (2002) Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics, 3, 30.[CrossRef][Medline]
- Rivas,E. and Eddy,S.R. (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics, 2, 8.[CrossRef][Medline]
- Wingender,E., Chen,X., Hehl,R., Karas,H., Liebich,I., Matys,V., Meinhardt,T., Pruss,M., Reuter,I. and Schacherer,F. (2000) TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res., 28, 316319.
[Abstract/Free Full Text] - Stormo,G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16, 1623.
[Abstract/Free Full Text] - He,H. (2002) Protein domain dissection using stochastic modeling. PhD Thesis, Boston University, Boston, USA.
This article has been cited by other articles:
![]() |
P. Van Loo and P. Marynen Computational methods for the detection of cis-regulatory modules Brief Bioinform, September 1, 2009; 10(5): 509 - 524. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. J. Pape, H. Klein, and M. Vingron Statistical detection of cooperative transcription factors with similarity adjustment Bioinformatics, August 15, 2009; 25(16): 2103 - 2109. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Narlikar and I. Ovcharenko Identifying regulatory elements in eukaryotic genomes Brief Funct Genomic Proteomic, July 1, 2009; 8(4): 215 - 230. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Fu, P. Ray, and E. P. Xing DISCOVER: a feature-based discriminative method for motif search in complex genomes Bioinformatics, June 15, 2009; 25(12): i321 - i329. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Whitington, A. C. Perkins, and T. L. Bailey High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites Nucleic Acids Res., January 1, 2009; 37(1): 14 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hu, H. Hu, and X. Li MOPAT: a graph-based method to predict recurrent cis-regulatory modules from known motifs Nucleic Acids Res., August 1, 2008; 36(13): 4488 - 4497. [Abstract] [Full Text] [PDF] |
||||
![]() |
K.-J. Won, A. Sandelin, T. T. Marstrand, and A. Krogh Modeling promoter grammars with evolving hidden Markov models Bioinformatics, August 1, 2008; 24(15): 1669 - 1675. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Thomas-Chollier, O. Sand, J.-V. Turatsinze, R. Janky, M. Defrance, E. Vervisch, S. Brohee, and J. van Helden RSAT: regulatory sequence analysis tools Nucleic Acids Res., July 1, 2008; 36(suppl_2): W119 - W127. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. van Oeffelen, P. Cornelis, W. Van Delm, F. De Ridder, B. De Moor, and Y. Moreau Detecting cis-regulatory binding sites for cooperatively binding proteins Nucleic Acids Res., May 1, 2008; 36(8): e46 - e46. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Vandenbon, Y. Miyamoto, N. Takimoto, T. Kusakabe, and K. Nakai Markov Chain-based Promoter Structure Modeling for Tissue-specific Expression Pattern Prediction DNA Res, February 7, 2008; (2008) dsm034v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Chiranand, I. McLeod, H. Zhou, J. J. Lynn, L. A. Vega, H. Myers, J. R. Yates III, M. C. Lorenz, and M. C. Gustin CTA4 Transcription Factor Mediates Induction of Nitrosative Stress Response in Candida albicans Eukaryot. Cell, February 1, 2008; 7(2): 268 - 278. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Jiang, M. Q. Zhang, and X. Zhang OSCAR: One-class SVM for accurate recognition of cis-elements Bioinformatics, November 1, 2007; 23(21): 2823 - 2828. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Dai, J. He, and X. Zhao A new systematic computational approach to predicting target genes of transcription factors Nucleic Acids Res., July 26, 2007; 35(13): 4433 - 4440. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Rozowsky, D. Newburger, F. Sayward, J. Wu, G. Jordan, J. O. Korbel, U. Nagalakshmi, J. Yang, D. Zheng, R. Guigo, et al. The DART classification of unannotated transcription within the ENCODE regions: Associating transcription with known and novel loci Genome Res., June 1, 2007; 17(6): 732 - 745. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Papatsenko ClusterDraw web server: a tool to identify and visualize clusters of binding motifs for transcription factors Bioinformatics, April 15, 2007; 23(8): 1032 - 1034. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ferretti, C. Poitras, D. Bergeron, B. Coulombe, F. Robert, and M. Blanchette PReMod: a database of genome-wide mammalian cis-regulatory module predictions Nucleic Acids Res., January 12, 2007; 35(suppl_1): D122 - D126. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Pierstorff, C. M. Bergman, and T. Wiehe Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA Bioinformatics, December 1, 2006; 22(23): 2858 - 2864. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Chowdhary, S. L. Tan, R. A. Ali, B. Boerlage, L. Wong, and V. B Bajic Dragon Promoter Mapper (DPM): a Bayesian framework for modelling promoter structures Bioinformatics, September 15, 2006; 22(18): 2310 - 2312. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. GuhaThakurta Computational identification of transcriptional regulatory elements in DNA sequence Nucleic Acids Res., July 19, 2006; 34(12): 3585 - 3598. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. J. Donaldson and B. Gottgens TFBScluster web server for the identification of mammalian composite regulatory elements. Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W524 - W528. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Blanchette, A. R. Bataille, X. Chen, C. Poitras, J. Laganiere, C. Lefebvre, G. Deblois, V. Giguere, V. Ferretti, D. Bergeron, et al. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression Genome Res., May 1, 2006; 16(5): 656 - 668. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. J. Donaldson, M. Chapman, and B. Gottgens TFBScluster: a resource for the characterization of transcriptional regulatory networks Bioinformatics, July 1, 2005; 21(13): 3058 - 3059. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. V. Favorov, M. S. Gelfand, A. V. Gerasimova, D. A. Ravcheev, A. A. Mironov, and V. J. Makeev A Gibbs sampler for identification of symmetrically structured, spaced DNA motifs with improved estimation of the signal length Bioinformatics, May 15, 2005; 21(10): 2240 - 2245. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Hutcheson, M. I. Hanson, K. B. Moore, T. T. Le, N. L. Brown, and M. L. Vetter bHLH-dependent and -independent modes of Ath5 gene regulation during retinal development Development, February 15, 2005; 132(4): 829 - 839. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Wiwi and D. J. Waxman Role of Hepatocyte Nuclear Factors in Transcriptional Regulation of Male-specific CYP2A2 J. Biol. Chem., February 4, 2005; 280(5): 3259 - 3268. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Blüthgen, S. M. Kielbasa, and H. Herzel Inferring combinatorial regulation of transcription in silico Nucleic Acids Res., January 12, 2005; 33(1): 272 - 279. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. O'Lone, M. C. Frith, E. K. Karlsson, and U. Hansen Genomic Targets of Nuclear Estrogen Receptors Mol. Endocrinol., August 1, 2004; 18(8): 1859 - 1875. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. A. Wiwi, M. Gupte, and D. J. Waxman Sexually Dimorphic P450 Gene Expression in Liver-Specific Hepatocyte Nuclear Factor 4{alpha}-Deficient Mice Mol. Endocrinol., August 1, 2004; 18(8): 1975 - 1987. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. B.L. Alkema, O. Johansson, J. Lagergren, and W. W. Wasserman MSCAN: identification of functional clusters of transcription factor binding sites Nucleic Acids Res., July 1, 2004; 32(suppl_2): W195 - W198. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Hu, Y. Fu, A. S. Halees, S. M. Kielbasa, and Z. Weng SeqVISTA: a new module of integrated computational tools for studying transcriptional regulation Nucleic Acids Res., July 1, 2004; 32(suppl_2): W235 - W241. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. C. Frith, Y. Fu, L. Yu, J.-F. Chen, U. Hansen, and Z. Weng Detection of functional DNA motifs via statistical over-representation Nucleic Acids Res., February 26, 2004; 32(4): 1372 - 1381. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Sandelin, W. Alkema, P. Engstrom, W. W. Wasserman, and B. Lenhard JASPAR: an open-access database for eukaryotic transcription factor binding profiles Nucleic Acids Res., January 1, 2004; 32(90001): D91 - 94. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











