Nucleic Acids Research Advance Access originally published online on June 13, 2008
Nucleic Acids Research 2008 36(12):4137-4148; doi:10.1093/nar/gkn361
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2008, Vol. 36, No. 12 4137-4148
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Computational Biology |
Extracting sequence features to predict protein–DNA interactions: a comparative study
1Department of Statistics, University of California, Los Angeles, CA 90095 and 2Department of Statistics, Harvard University, Cambridge, MA 02138, USA
*To whom correspondence should be addressed. Tel: +1 310 794 7563; Fax: +1 310 206 5658; Email: zhou{at}stat.ucla.edu
Correspondence may also be addressed to Jun S. Liu. Tel: +1 617 495 1600; Fax: +1 617 496 8057; Email: jliu{at}stat.harvard.edu
Received February 25, 2008. Revised May 16, 2008. Accepted May 21, 2008.
Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF–DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF–TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.