Skip Navigation


Nucleic Acids Research Advance Access originally published online on June 13, 2008
Nucleic Acids Research 2008 36(12):4137-4148; doi:10.1093/nar/gkn361
This Article
Right arrow Full Text Freely available
Right arrow Print PDF (1952K) Freely available
Right arrow Screen PDF (330K) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
36/12/4137    most recent
gkn361v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Zhou, Q.
Right arrow Articles by Liu, J. S.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhou, Q.
Right arrow Articles by Liu, J. S.
Related Collections
Right arrow Protein-nucleic acid interaction
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Nucleic Acids Research, 2008, Vol. 36, No. 12 4137-4148
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Computational Biology

Extracting sequence features to predict protein–DNA interactions: a comparative study

Qing Zhou1,* and Jun S. Liu2

1Department of Statistics, University of California, Los Angeles, CA 90095 and 2Department of Statistics, Harvard University, Cambridge, MA 02138, USA

*To whom correspondence should be addressed. Tel: +1 310 794 7563; Fax: +1 310 206 5658; Email: zhou{at}stat.ucla.edu

Correspondence may also be addressed to Jun S. Liu. Tel: +1 617 495 1600; Fax: +1 617 496 8057; Email: jliu{at}stat.harvard.edu

Received February 25, 2008. Revised May 16, 2008. Accepted May 21, 2008.

Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF–DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF–TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?




Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.