Published online 12 February 2004
Nucleic Acids Research, 2004, Vol. 32, No. 3 949-958
© 2004 Oxford University Press
Statistical analysis of over-represented words in human promoter sequences
Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, MSC 6075 Bethesda, MD 20894-6075, USA
*To whom correspondence should be addressed. Tel: +1 301 435 5981; Fax: +1 301 480 2288; Email: landsman{at}ncbi.nlm.nih.gov
The identification and characterization of regulatory sequence elements in the proximal promoter region of a gene can be facilitated by knowing the precise location of the transcriptional start site (TSS). Using known TSSs from over 5700 different human full-length cDNAs, this study extracted a set of 4737 distinct putative promoter regions (PPRs) from the human genome. Each PPR consisted of nucleotides from 2000 to +1000 bp, relative to the corresponding TSS. Since many regulatory regions contain short, highly conserved strings of less than 10 nucleotides, we counted eight-letter words within the PPRs, using z-scores and other related statistics to evaluate their over- and under-representation. Several over-represented eight-letter words have known biological functions described in the eukaryotic transcription factor database TRANSFAC; however, many did not. Besides calculating a P-value with the standard normal approximation associated with z-scores, we used two extra statistical controls to evaluate the significance of over-represented words. These controls have important implications for evaluating over- and under-represented words with z-scores.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
K. Tharakaraman, O. Bodenreider, D. Landsman, J. L. Spouge, and L. Marino-Ramirez The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site Nucleic Acids Res., May 1, 2008; 36(8): 2777 - 2786. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ceribelli, D. Dolfini, D. Merico, R. Gatta, A. M. Vigano, G. Pavesi, and R. Mantovani The Histone-Like NF-Y Is a Bifunctional Transcription Factor Mol. Cell. Biol., March 15, 2008; 28(6): 2047 - 2058. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Y. Yamamoto, H. Ichida, T. Abe, Y. Suzuki, S. Sugano, and J. Obokata Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis Nucleic Acids Res., September 25, 2007; 35(18): 6219 - 6226. [Abstract] [Full Text] [PDF] |
||||
![]() |
N.-K. Kim, K. Tharakaraman, and J. L. Spouge Adding sequence context to a Markov background model improves the identification of regulatory elements Bioinformatics, December 1, 2006; 22(23): 2870 - 2875. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Yu. Mitrophanov and M. Borodovsky Statistical significance in biological sequence analysis Brief Bioinform, March 1, 2006; 7(1): 2 - 24. |
||||
![]() |
C. Zhang, Z. Xuan, S. Otto, J. R. Hover, S. R. McCorkle, G. Mandel, and M. Q. Zhang A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Res., January 1, 2006; 34(8): 2238 - 2246. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-Y. Roh, S. Cuddapah, and K. Zhao Active chromatin domains are defined by acetylation islands revealed by genome-wide mapping Genes & Dev., March 1, 2005; 19(5): 542 - 552. [Abstract] [Full Text] [PDF] |
||||




