Nucleic Acids Research Advance Access originally published online on May 3, 2007
Nucleic Acids Research 2007 35(Web Server issue):W47-W51; doi:10.1093/nar/gkm217
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2007, Vol. 35, No. suppl_2 W47-W51
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Articles |
RF-DYMHC: detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features
State Key Laboratory of Bioelectronics, Department of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, P. R. China
*To whom correspondence should be addressed: Tel: +86 25 83793779; Fax: +86 25 83793779; Email: zhlu{at}seu.edu.cn
Received January 27, 2007. Revised March 20, 2007. Accepted March 28, 2007.
| ABSTRACT |
|---|
|
|
|---|
In the yeast, meiotic recombination is initiated by double-strand DNA breaks (DSBs) which occur at relatively high frequencies in some genomic regions (hotspots) and relatively low frequencies in others (coldspots). Although observations concerning individual hot/cold spots have given clues as to the mechanism of recombination initiation, the prediction of hot/cold spots from DNA sequence information is a challenging task. In this article, we introduce a random forest (RF) prediction model to detect recombination hot/cold spots from yeast genome. The out-of-bag (OOB) estimation of the model indicated that the RF classifier achieved high prediction performance with 82.05% total accuracy and 0.638 Mattew's correlation coefficient (MCC) value. Compared with an alternative machine-learning algorithm, support vector machine (SVM), the RF method outperforms it in both sensitivity and specificity. The prediction model is implemented as a web server (RF-DYMHC) and it is freely available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast genome and prediction parameters (RI-value and non-overlapping window scan size), the program reports the predicted hot/cold spots and marks them in color.
| INTRODUCTION |
|---|
|
|
|---|
In the yeast, meiotic recombination is initiated by double-strand DNA breaks (DSBs). Meiotic DSBs occur at relatively high frequencies in some genomic regions which are called hotspots while the regions associated with low frequencies of DSBs are called coldspots (1). Several studies have been performed to determine whether the hot/cold spots share common DNA sequences and/or structural elements (2,3). It was found that the hotspots were non-randomly associated with regions of high G + C base composition and certain transcriptional profiles while the coldspots were non-randomly associated with centromeres and telomeres.
Although observations concerning individual hot/cold spots have given clues as to the mechanism of recombination initiation, the prediction of hot/cold spots from DNA sequence information is still a challenging task. So far, nearly all recombination hot/cold spots finding methods are based on population-genetic data (46) and no software or web server has been reported to predict the hot/cold spots from a single DNA sequence.
In this study, we present a novel machine-learning method, random forest (RF) model, to detect the yeast meiotic recombination hotspots and coldspots from genome sequences. Although several studies demonstrated that there was a correlation between the synonymous codon usage pattern and the recombination rate in Caenorhabditis elegans, mouse, human and other species(713), most hotspots are intergenic rather than intragenic, and thus the gene codon usage pattern-based attributes may fail to be applied in non-coding regions. For that reason, an ORF (Open Reading Frame)-independent feature (gapped dinucleotide composition) was used in our study. Compared with an alternative machine-learning algorithm, support vector machine (SVM), the RF method outperformed it in both sensitivity and specificity. The prediction model is implemented as a web server (RF-DYMHC) and it is freely available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast DNA sequence and prediction parameters (RI-value and non-overlapping scan window size), the program reports the predicted hot/cold spots and marks them in color.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Data sets
Gerton et al. (14) have estimated the relative recombination rates for the yeast Saccharomyces cerevisiae loci using DNA microarray at single-gene resolution. To estimate the DSBs formation adjacent to each ORF, they measured the ratio of hybridization to a DSB-enriched probe (P2) to a total genomic probe (P1). The relative strength of the recombination rate was estimated by the P2/P1 hybridization ratio. The experiments were repeated seven times for each of the 6200 genes. In this article, we take the median value as the relative recombination rate of each sequence. If any repeated array value was missing, the sequence was excluded. Finally, a total of 5266 sequences were collected. The sequences whose relative hybridization ratio
1.5 are defined as hotspots, while the ones whose relative hybridization ratio <0.82 are defined as coldspots. Thus, we obtained 490 hotspots and 591 coldspots which composed of the training data set. The yeast S. cerevisiae mitochondrial DNA sequence, served as negative control for our method, was downloaded from Saccharomyces Genome Database (15) at the website: http://www.yeastgenome.org/. All the data sets used in this article can be downloaded from website: http://www.bioinf.seu.edu.cn/Recombination/datasets.htm
Gapped dinucleotide composition features
The gapped dinucleotide composition is the fraction of each two nucleotides with k intervening bases within a sequence. It can be defined as:
| (1) |
Random forest
RF is a classifier consisting of an ensemble of tree-structured classifiers (17). RF takes advantage of two powerful machine-learning techniques: bagging (18) and random feature selection. In bagging, each tree is trained on a bootstrap sample of the training data, and predictions are made by majority vote of trees. RF is a further development of bagging. Instead of using all features, RF randomly selects a subset of features to split at each node when growing a tree. To assess the prediction performance of the algorithm, RF performs a type of cross-validation in parallel with the training step by using the so-called out-of-bag (OOB) samples. Specifically, in the process of training, each tree is grown using a particular bootstrap sample. Since bootstrapping is sampling with replacement from the training data, some of the sequences will be left out of the sample, while others will be repeated in the sample. The left out sequences constitute the OOB sample. On average, each tree is grown using about 1 e1
2/3 of the training sequences, leaving e1
1/3 as OOB. Because OOB sequences have not been used in the tree construction, one can use them to estimate the prediction performance (19,20). The RF algorithm was implemented by the randomForest R package (21).
Support vector machine
SVM is a supervised machine-learning technology based on statistical theory for data classification (22). SVM seeks an optimal hyperplane to separate two classes of samples. It uses kernel functions to map original data to a feature space of higher dimensions and locate an optimal separating hyperplane there. The SVM algorithm was implemented by the e1071 (version 1.5-12) R package (23). We used different kernels (linear, RBF, 2, 3-order polynomial) and the RBF kernel performed the best (data not shown). So we used the SVM with RBF kernel, as a competent machine-learning method, to compare with the RF algorithm. The parameters C and
of the RBF kernel were optimized by the standard grid search (24).
Prediction system assessment
For a prediction problem, a classifier can classify an individual instance into the following four categories: false positive (FP), true positive (TP), false negative (FN) and true negative (TN). The total prediction accuracy (ACC), Specificity (Sp), Sensitivity (Se) and Mattew's correlation coefficient (MCC) (25) for assessment of the prediction system are given by
|
| (2) |
|
| (3) |
|
| (4) |
| (5) |
Reliability index
Here, the reliability index (RI) was used to determine the effectiveness of recombination hotspots and coldspots prediction. For RF algorithm, an intuitive RI can be derived from the fractions of votes for the positive and negative classes of each sample. We define RI as:
|
| (6) |
| RESULTS |
|---|
|
|
|---|
Constructing the RF prediction model with gapped dinucleotide composition features
The prediction results of the RF classifiers were shown in Table 1. The performance was evaluated by the OOB estimation on the training dataset. The gap {0} and the gap {1} dinucleotide composition-based RF prediction models achieved total accuracies of 80.94 and 81.12%, respectively. The prediction performance can be improved by combing the two composition features. The gap {0, 1} based RF model achieved 82.05% total accuracy and 0.638 MCC value.
|
Reliability index of the RF model
The reliability of prediction is an important factor that gives users more information about the quality of the prediction. We adopted RI to indicate the level of certainty of the prediction model. The results, as shown in Figure 1, were obtained through the OOB estimation. It indicated that the higher the RI was the higher reliability the prediction gained. When RI > 6, the total prediction accuracy is >90%. Approximately, 78.1% of the predicted sequences were with RI > 2 which indicated that the RF prediction model was reliable.
|
Comparison with the SVM prediction model
It has been proven that SVMs usually outperform other machine-learning methods in many fields of pattern recognition (24,2631). So, we choose the SVM prediction model as an alternative algorithm to compare with the RF prediction model. To make comparisons impartial, a double-fold cross-validation was implemented. We randomly divided the training data set into two independent data sets (data set 1 and data set 2) of approximately equal size. Then, we used one data set for parameters tuning (the parameters were optimized by the standard grid search (24)) and training. The other data set was used for evaluating the prediction performance. As shown in Table 2, the RF classifier outperformed the SVM classifier in both sensitivity and specificity.
|
Applying the RF model to full genome analysis
In order to evaluate the sensitivity and specificity of the RF model in detecting hotspots and coldspots from the full genome, we trained the RF model on the training data set and tested the remaining 4185 sequences. The distribution of recombination rates of the predicted hot/cold spots with different RI values is shown in Figure 2. There is a trend that an increase in the RI value results in an increase in recombination rates of the predicted hotspots and a decrease in recombination rates of the predicted coldspots, respectively. The predicted hotspots and coldspots have more possibility to be true hotspots or coldspots with a higher RI value. Therefore, RI as a regulating parameter controls the trade-off between sensitivity and specificity. We set a cutoff RI > 7. Out of the 4185 sequences, a total of 195 sequences were predicted as hotspots and 591 sequences were predicted as coldspots. Approximately, 81.0% of the predicted hotspots had relative recombination ratios >1.09 and
80.0% of the predicted coldspots had relative recombination ratios <1.07.
|
Since it would be surprising to find meiotic recombination hot/cold spots in mtDNA data, the yeast S. cerevisiae mitochondrial data can be served as a negative control for our method. We used the RF model to scan the S. cerevisiae mitochondrial DNA with a non-overlapping window (sliding window size: 0.5 kb). The results showed that all RI values were
5 and
98.8% RI values were
3, which was consistent with the current knowledge.
Web server
The prediction model is implemented as a web server named RF-DYMHC, and it is made available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast genome and prediction parameters (RI value and non-overlapping window scan size), the program breaks the input sequence into subsequences. Each of these subsequences constitutes a sample and each sample will be mapped into a 32-dimension feature space reflecting the gap {0} and gaped {1} base-pair compositions. The output of the web server returns the predicted hotspots and coldspots and marks them in color. More details about the input and output formats are available at http://www.bioinf.seu.edu.cn/Recombination/Manual.htm
| DISCUSSION |
|---|
|
|
|---|
It is a challenging problem to detect meiotic recombination hotspots and coldspots in eukaryotic genomes based on computational techniques. In this article, we have introduced a RF-based method to detect recombination hot/cold spots from yeast genome. The OOB estimation of the prediction model indicated that the RF classifier achieved high prediction accuracy. It was also compared with an alternative machine-learning algorithm, SVM prediction model. The RF was found to outperform the SVM in both sensitivity and specificity. We used the RF model to test the remaining 4185 sequences. The results indicated that the RI controlled the trade-off between sensitivity and specificity.
Though the prediction model was constructed by a two-class prediction model, we attempted to construct another three-class RF prediction model. We ranked the Gerton et al. data sets (5266 sequences) based on the median array value of the seven microarrays. The top one-third sequences were marked as hotspots, the bottom one-third sequences as coldspots and the rest as neutral sequences. The total accuracy of the OOB estimation was 51.22%, which was only 17.89% higher than the random classifier. Approximately 65.60% of the failed predicted coldspots were falsely predicted as neutral ones, while
67.23% of the failed predicted neutral sequences were classified into coldspots. The results indicated that the three-class RF model failed to separate the coldspots from the neutral ones.
Since the experimental identification of recombination hot/cold spots is time consuming and money costing, it is infeasible for large numbers of genomic sequences. Hence, efficiently and reliably detecting them by computational approach is important. Further improvement of our model will be focused on incorporating more attributes. Our predicting system will also be optimized by the rapidly increased experimental validated data sets in the future.
| ACKNOWLEDGEMENT |
|---|
Funding to pay the Open Access publication charges for this article was provided by National Natural Science Foundation of China (No. 60121101).
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Lichten M, Goldman AS. Meiotic recombination hotspots. Annu. Rev. Genet (1995) 29:423444.[CrossRef][Web of Science][Medline]
- Zenvirth D, Arbel T, Sherman A, Goldway M, Klein S, Simchen G. Multiple sites for double-strand breaks in whole meiotic chromosomes of Saccharomyces cerevisiae. EMBO J (1992) 11:34413447.[Web of Science][Medline]
- Klein S, Zenvirth D, Dror V, Barton AB, Kaback DB, Simchen G. Patterns of meiotic double-strand breakage on native and artificial yeast chromosomes. Chromosoma (1996) 105:276284.[Web of Science][Medline]
- Fearnhead P, Smith NG. A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet (2005) 77:781794.[CrossRef][Web of Science][Medline]
- Fearnhead P, Donnelly P. Estimating recombination rates from population genetic data. Genetics (2001) 159:12991318.
[Abstract/Free Full Text] - Stumpf MP, McVean GA. Estimating recombination rates from population-genetic data. Nat. Rev. Genet (2003) 4:959968.[CrossRef][Web of Science][Medline]
- Fullerton SM, Bernardo Carvalho A, Clark AG. Local rates of recombination are positively correlated with GC content in the human genome. Mol. Biol. Evol (2001) 18:11391142.
[Free Full Text] - Kliman RM, Hey J. Reduced natural selection associated with low recombination in Drosophila melanogaster. Mol. Biol. Evol (1993) 10:12391258.[Abstract]
- Kliman RM, Irving N, Santiago M. Selection conflicts, gene expression, and codon usage trends in yeast. J. Mol. Evol (2003) 57:98109.[CrossRef][Web of Science][Medline]
- Marais G, Mouchiroud D, Duret L. Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes. Proc. Natl Acad. Sci. USA (2001) 98:56885692.
[Abstract/Free Full Text] - Marais G, Piganeau G. Hill-Robertson interference is a minor determinant of variations in codon bias across Drosophila melanogaster and Caenorhabditis elegans genomes. Mol. Biol. Evol (2002) 19:13991406.
[Abstract/Free Full Text] - Perry J, Ashworth A. Evolutionary rate of a gene affected by chromosomal position. Curr. Biol (1999) 9:987989.[CrossRef][Web of Science][Medline]
- Zhou T, Weng J, Sun X, Lu Z. Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition. BMC Bioinformatics (2006) 7:223.[CrossRef][Medline]
- Gerton JL, DeRisi J, Shroff R, Lichten M, Brown PO, Petes TD. Inaugural article: global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA (2000) 97:1138311390.
[Abstract/Free Full Text] - Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, et al. Genetic and physical maps of Saccharomyces cerevisiae. Nature (1997) 387:6773.[CrossRef][Medline]
- Park KJ, Kanehisa M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics (2003) 19:16561663.
[Abstract/Free Full Text] - Breiman L. Random forest. Mach. Learning (2001) 45:532.[CrossRef]
- Breiman L. Bagging predictors. Mach. Learning (1996) 24:12314.
- Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci (2003) 43:19471958.[CrossRef][Web of Science][Medline]
- Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics (2006) 7:3.[CrossRef][Medline]
- Liaw A, Wiener M. Classification and regression by randomForest. R News (2002) 2:1822.
- Vapnik V. Statistical Learning Theory (1998) NY, USA: Wiley.
- Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A. e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. (2006).
- Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics (2001) 17:721728.
[Abstract/Free Full Text] - Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (1975) 405:442451.[Medline]
- Bhasin M, Raghava GP. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res (2004) 32:W414W419.
[Abstract/Free Full Text] - Bhasin M, Reinherz EL, Reche PA. Recognition and classification of histones using support vector machine. J. Comput. Biol (2006) 13:102112.[CrossRef][Web of Science][Medline]
- Lin HH, Han LY, Cai CZ, Ji ZL, Chen YZ. Prediction of transporter family from protein sequence by support vector machine approach. Proteins (2006) 62:218231.[CrossRef][Web of Science][Medline]
- Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J. Theor. Biol (2006) 240:175184.[CrossRef][Web of Science][Medline]
- Cai CZ, Wang WL, Sun LZ, Chen YZ. Protein function classification via support vector machine approach. Math. Biosci (2003) 185:111122.[CrossRef][Web of Science][Medline]
- Cai YD, Liu XJ, Li YX, Xu XB, Chou KC. Prediction of beta-turns with learning machines. Peptides (2003) 24:665669.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

