Nucleic Acids Research Advance Access originally published online on June 13, 2007
Nucleic Acids Research 2007 35(12):4195-4202; doi:10.1093/nar/gkm338
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Research, 2007, Vol. 35, No. 12 4195-4202
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Computational Biology |
A statistical learning approach to the modeling of chromatographic retention of oligonucleotides incorporating sequence and secondary structure data
1Simulation of Biological Systems, Eberhard Karls University Tübingen and 2Department of Chemistry, Instrumental Analysis and Bioanalysis, Saarland University, Saarbrücken, Germany
*To whom correspondence should be addressed. Tel: 0049 7071 29 70462; Fax: 0049 7071 29 5152; Email: sturm{at}informatik.uni-tuebingen.de
Received March 7, 2007. Revised April 17, 2007. Accepted April 19, 2007.
| ABSTRACT |
|---|
|
|
|---|
We propose a new model for predicting the retention time of oligonucleotides. The model is based on
support vector regression using features derived from base sequence and predicted secondary structure of oligonucleotides. Because of the secondary structure information, the model is applicable even at relatively low temperatures where the secondary structure is not suppressed by thermal denaturing. This makes the prediction of oligonucleotide retention time for arbitrary temperatures possible, provided that the target temperature lies within the temperature range of the training data. We describe different possibilities of feature calculation from base sequence and secondary structure, present the results and compare our model to existing models.
| INTRODUCTION |
|---|
|
|
|---|
The structure of biological macromolecules like polypeptides and nucleic acids comprises linear polymers of a limited number of small building blocks such as amino acids or nucleotides. Even so, because of their large size and a certain degree of flexibility of the polymer chains, biopolymers are usually folded, forming a distinct 3D structure which is generally essential for the biological function of the molecules. Studies of the 3D structure of nucleic acids go back to the seminal work of Watson and Crick, who discovered the DNA double helix. Since then, a large number of 3D structures of single- and double- stranded nucleic acids have been solved by X-ray crystallography and nuclear magnetic resonance spectroscopy, which allow a deep insight into the structurefunction relationships of nucleic acids.
In spite of the broad availability of structural data, investigations into the relationship between the 3D structure of nucleic acids and their interaction with stationary phases in chromatographic separation systems have been quite limited (1,2), mainly because of the highly dynamic nature of the process, and the involvement of both a liquid and a solid phase in the phase transfer. Since DNA and RNA molecules are employed in many biochemical applications, such as polymerase chain reaction, genotyping (3), DNA sequencing (4), hybridization assays and gene therapy (5), understanding these phase transfer processes would not only help to improve our selection and optimization of chromatographic separation systems in order to purify and analyze them, but also to gain valuable information about the behavior of nucleic acid molecules in multi-phase systems and at interphases.
Commonly used models describing the behavior of solute molecules at the liquidsolid interface, such as linear free enthalpy relationships (6), are based on the fundamental Hammet equation and knowledge of principal structural descriptors, such as intrinsic molar volume, Hildebrand-solubility parameters, polarizability and proton-donor/acceptor parameters. However, these structural descriptors are difficult and tedious to determine experimentally, especially for biological macromolecules. Other models rely on learning structureretention relationships from a set of standard analytes. By doing so, Gilar et al. (7) have developed a model for the retention of oligonucleotides in ion-pair reversed-phase chromatography by adding up the retention contributions of the individual nucleotides, which have been determined experimentally from analyzing homo-oligonucleotides. Since the input structural parameters for the model are only oligonucleotide length and base composition, it is only applicable at relatively high temperatures of 6080° C where secondary structures are usually suppressed because of thermal denaturing. A similar model was used by Otvos et al. (8) for peptide nucleic acids. The model was based upon retention contribution of monomer building units, which were studied by Meek (9) and Guo et al. (10).
We propose a new model for retention time prediction which differs in two ways from the Gilar approach. First, we incorporate several additional features of the oligonucleotides into the model, the most important being predicted secondary structure information. The temperature dependency of the secondary structure is captured by adding information similar to a melting curve. The second difference lies in model creation. Support vector regression (SVR) is used instead of simple linear or logarithmic models. SVR can model non-linear relationships while optimizing both the model performance and the model complexity. These improvements lead to a reliable model with a significantly increased prediction performance.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Chemicals and oligonucleotides
Acetonitrile (far UV HPLC grade) and acetic acid (analytical reagent grade) were obtained from Riedel-de-Haën (Seelze, Germany), triethylamine (analytical reagent grade) and ethylenediamine tetraacetate (EDTA, analytical reagent grade) were purchased from Fluka (Buchs, Switzerland). Triethylammonium acetate was prepared by mixing equal amounts of triethylamine and acetic acid. Water was purified by means of a purification system (Purelab Ultra Genetic, Elga, Siershahn, Germany). Oligonucleotides were synthesized by MWG-Biotech AG (Ebersberg, Germany) or Biospring (Frankfurt, Germany). The sequences of the 72 different oligonucleotides that were used to create training and evaluation datasets for the retention model are collected in Table S1 of the Supplementry Data.
Ion-pair reversed-phase high-performance liquid chromatography
Ion-pair reversed-phase high-performance liquid chromatography (IP-RP-HPLC) was performed with a fully automated capillary/nano HPLC system (Model Ultimate 3000, Dionex, Amsterdam, The Netherlands) equipped with a low-pressure gradient micropump (Model LPG-3600) with a vacuum degasser (Model SRD-3600), an integrated column oven, a microcolumn switching unit and flow-splitting device (Model FLM-3100), a micro-autosampler (Model WPS-3000) and a UV-detector (Model UVD 3000) with a 3-nL Z-shaped detector cell (Model Ultimate). The system was controlled by a personal computer with Chromeleon Version 6.60 SP2 (Dionex). The 60 x 0.20 mm i.d. poly-(styrene/divinylbenzene) monolithic column was prepared according to the published protocol (11) and is available from Dionex.
Generation of experimental retention time datasets
One to five nanograms of oligonucleotides, dissolved in 1 µl water, were injected and chromatographed with a 30-min linear gradient from 016% acetonitrile in 100 mmol/l triethylammonium acetate, 0.5 mmol/l EDTA, pH 7.0. The flow rate was adjusted to 2 µl/min. The gradient delay volume of the used LC system was 5.94 µl. Although retention times of oligonucleotides were highly reproducible (0.26% relative SD in absolute retention time from three repetitive injections), (dC14 and (dT)26 homo-oligonucleotides were coinjected as internal standards to normalize retention. Normalization was performed using Equation (1) (t and
representing the retention time and average retention time, respectively, of the oligonucleotides)
|
| (1) |
|
Support vector regression
A number of methods can be used to derive models from training data. In this study, SVR is used for model generation and predicting retention times. Squared Pearson correlation coefficients were used to evaluate the quality of the models (R2) and the predictions (Q2).
SVR is a machine-learning technique that uses a training dataset to derive a model for the prediction of a quantitative response. Just like in linear regression methods, the training data consists of m pairs of feature vector x and quantitative response y:
|
| (2) |
Features of oligonucleotides for retention time prediction might be: overall length, number of adenines, number of thymines, etc. The quantitative response is the retention time.
For these training data,
-SVR (12) determines the optimal parameters w and b of the function
|
| (3) |
|
| (4) |
In this equation, the first addend minimizes the model complexity while the second minimizes the
-insensitive training error i.e. training errors below a fixed
are not penalized. C is a constant which defines the trade-off between these two objectives. In 2001, Schölkopf et al. (13) proposed
-SVR, a modified version of
-SVR which minimizes
along with the model complexity and the training error. This simplifies the use of SVR, as
no longer has to be chosen a priori.
SVR, as so far described here, cannot model non-linear relationships between input features and the quantitative response. Non-linearity is achieved by transforming the input features into a higher-dimensional space using so-called kernel functions. A linear model in the transformed feature space corresponds to a non-linear model in the original feature space. Details about kernel functions can be found in (12).
In this study, the libSVM implementation (http://www.csie.ntu.edu.tw/~cjlin/libsvm) of
-SVR was used with radial basis function kernel. The kernel parameter gamma and the trade-off C have been optimized using grid search and 3-fold cross-validation.
Oligonucleotide sequence selection
Our dataset consists of retention times measured for 72 oligonucleotides ranging from 15 to 48 bases. As one focus of this study is the influence of secondary structure on the retention time, the dataset consists of oligonucleotides that contain little to no secondary structure and others where nearly all bases form a hairpin. Four sequences that form stable hairpin structures even at elevated temperatures were included. Figure 2 shows the average fraction of paired bases plotted against the measurement temperature.
|
A second point of interest is the influence of the base sequence on the retention time. That is why the dataset contains two groups of oligonucleotides that have the same overall base composition, but slight differences in the base sequence. Both groups are derived from the 24mer GTGCTCAGTGTAGCCCAGGATGCC. In the first group, one guanine is exchanged for an adenine; in the second, one guanine is exchanged for a cytosine.
Features and models
Secondary structure prediction
All secondary structures used in our models are not experimental but predicted. For this purpose, the tool RNAFold of the Vienna RNA Package (14) version 1.4 was used.
Input features
The selection of input features for performing SVR is a critical step, as the features determine the performance of the prediction. Leaving out essential features or adding unnecessary features both lead to a drop in prediction accuracy. We thus tested several different models, i.e. input feature sets. Each model consists of several model components which group closely related features together. Besides these model components, each model implicitly contains the temperature and the length of the oligonucleotide. The model components can be grouped into composition components, structural components and energy components. Table 1 contains an overview of all components.
|
Sequence components are calculated from the oligonucleotide sequence and describe the base composition or sequence details of the oligonucleotide. COUNT reflects the base composition, while CONTACT and SCONTACT contain dinucleotide frequencies, which contain information on the adjacency of bases. The key idea here is that stacking bases will influence the secondary structure and the interaction with the stationary phase.
Since retention time behavior strongly depends on the secondary structure, sequence-based features alone are suitable only for high temperatures. At lower temperatures, the secondary structure requires the inclusion of structure-based components in the prediction. These components can be subdivided into two groups. The first group considers only the secondary structure of the temperature at which the measurement was performed, whereas the second group contains information on secondary structures for temperatures of 30, 40, 50, 60, 70 and 80° C.
Figure 3 shows the predicted secondary structure of the oligonucleotide GTGCTCAGTGTAGCCCAGGATGGG at 40° C. Examples of features calculated from the predicted structure are listed in Table 2. For the simple structural components, only the measurement temperature is considered. For the multi-temperature structural components, the secondary structure and the resulting features are calculated for different temperatures.
|
|
The single temperature components PAIRED and UNPAIRED have four features each reflecting the fraction of A, T, C and G inside stems and outside stems, respectively. The STRUCTURE component contains more detailed information, as it considers the fraction of A, T, C and G inside stems and inside loops and unpaired nucleotides.
The multi-temperature structural components reflect secondary structure information as well. In contrast to the single temperature components, they do not only consider the measurement temperature but a range of temperatures (30, 40, 50, 60, 70 and 80° C). Adding this gradient of secondary structure information for different temperatures provides information similar to a melting curve.
The simplest component in this group is MULTISTRUCT, which, for each of the six temperatures, contains the fraction of all bases that are inside stems. MULTITWO contains the fraction of bases in stems and loops as well as the fraction of unpaired bases. The most detailed representation of the secondary structure is MULTIDETAIL. It contains the fraction of bases in stems, the fraction of bases in loops, and the fractions of A, C, G and T that are unpaired.
Energy components describe the ring stacking effects between adjacent bases. These energies might influence retention time as they have to be overcome in order for the aromatic ring system of the base to interact with the column packing. The EMBOSS suite (15) was the source of the stacking energies of base pairs. The thermodynamic parameters used for the STACKING component were taken from (16).
Single temperature and combined temperatures models
The standard approach in predicting retention time is to create temperature-specific models, i.e. each model is based on data for a single temperature. Thus, if predictions for several temperatures are needed, one model has to be created for each temperature.
The model components that consider multiple temperatures allow for a more general approach. The data points of one dataset may differ in temperature, which leads to a model that can predict retention times over the whole temperature range of the training data. The advantage of this approach is that all available data can be integrated into a single, more general model. This reduces the amount of data needed for the individual temperatures.
Combining datasets of all temperatures leads to a dataset with 432 data points (72 data points from six temperatures). To avoid overfitting, we included data from only one randomly chosen temperature for each oligonucleotide. The resulting model was then used to predict the retention times of the remaining measurements.
| RESULTS |
|---|
|
|
|---|
Model creation
From the 11 model components presented in Table 1, we created models containing one or more of these components. As the number of models that can be built out of 11 components is huge, we focused on models built out of two or three components, which mostly contain one composition component and one structural component. These models will subsequently be referred to by the names of the components they contain (e.g. count_sesum for a model that consists of the COUNT component and the SESUM component).
For each temperature, the models were trained and the results evaluated. Additionally, the models were trained on the combined temperature dataset as described in Single Temperature and Combined Temperatures Models section.
Prediction accuracy in cross-validation
For each model, the prediction correlation Q2 in 3-fold cross-validation was determined for all temperatures. Table 3 shows the prediction performance for a selection of models:
|
Among the composition components SCONTACT and CONTACT did not show much difference, so only results for SCONTACT are shown. In the group of single temperature secondary structure components, the combination of PAIRED and UNPAIRED was the only one that performed at a similar level to the multi-temperature structural components, where MULTISTRUCT and MULTITWO were the best components. Out of the energy components, STACKING always performed better than SESUM, so the models containing SESUM are not shown. The results of the omitted models can be found in Table S2 of the Supplementary Data.
Single temperatures datasets
All models show a good prediction correlation on the 80° C dataset but the prediction performance averaged over all models drops with the temperature. The models that are exclusively based on sequence information (count, scountact and count_scontact) cannot compete with those models that use secondary structure information at lower temperatures.
Combined temperatures dataset
The models with multi-temperature structural components outperform all other models on the combined temperature dataset. Out of these components, multistruct performs best.
For all further evaluations, we chose the model count_multistruct_stacking. It was preferred over the model count_scountact_multistruct, which shows a slightly better average performance, because it has far fewer features and thus has less tendency to overfit.
Homology reduction
In order to rule out the possibility that the good prediction performance stems from overfitting caused by sequence homology present in the dataset, three homology-reduced datasets were created. We simply removed oligonucleotides from the original dataset until the remaining sequences differed in at least two, three or five bases, respectively. Table 4 shows that there is almost no drop in the prediction performance for homology-reduced datasets. Only the dataset with oligonucleotides that differ at least in five bases actually performs a little worse than the rest, which is probably caused by the small number of training data points.
|
Comparison to the Gilar model
In 2002, Gilar et al. (7) proposed a simple mathematical model for predicting the retention time of oligonucleotides. The model takes only the overall length and the base composition of oligonucleotides into account. As Equation (5)
|
| (5) |
The Gilar model fits very well at 80° C but drops to R2 = 0.58 at 30° C (Figure 4). In contrast, the count_multistruct_stacking model maintains a performance of R2
0.95 for all temperatures. Figure 5 shows the retention times predicted by the Gilar model at 30 and 80° C. At 80° C, the prediction performance is equal to the performance of our models. However, at 30° , the hairpin structures (marked with circles) cannot be predicted correctly any more and generally, prediction errors increase significantly (Figure 4).
|
|
Influence of secondary structure
In order to demonstrate the effect of adding secondary structure information to a sequence-based model, we compared the model error of the models count and count_multistruct_stacking. Table 5 shows the average relative model error of oligonucleotides with a similar fraction of bases involved in their secondary structure. The count_multistruct_stacking model can compensate for the influence of the secondary structure. However, the relative model error of the count model rises up to nearly 38 and 23% for several oligonucleotides at 30 and 50° C, respectively. It is simply impossible for the count model to derive a reasonable model from the training data at lower temperatures as, it does not incorporate secondary structure information. At 80° C, the count_multistruct_stacking model still has lower relative model errors than the count model.
|
The amount of data required for training
For practical use, the number of training data points that are required for a reasonable model is very important. So we determined an average prediction performance for the range of 1060 training data points. We used the following procedure:
- Repeat the following steps 200 times:
- Randomly choose 10 data points as a test set.
- Randomly choose 10, 20, 30, 40, 50 and 60 of the data points not contained in the test set as training set.
- For each of the training sets, train a model, predict retention times for the test set and store the squared correlation coefficient.
- Randomly choose 10 data points as a test set.
- Calculate the average squared correlation for 10, 20, 30, 40, 50 and 60 training data points.
Figure 6 shows the results for the count_multistruct_stacking model. From the figure, one can see that even for a temperature of 30° C, 40 training sequences (or more) are sufficient to construct a model describing oligonucleotide retention with acceptable accuracy. No significant improvement is observed for more than 50 data points.
|
| DISCUSSION |
|---|
|
|
|---|
The methods, experiments and results presented here clearly demonstrate that predicting oligonucleotide retention time can be significantly improved using secondary structure information and SVR.
The use of secondary structure information improves the prediction performance, especially at low temperatures and for oligonucleotides that form highly stable secondary structures. The second point of interest in this study was the effect that the base sequence had on the retention time. The fact that our best model contains no explicit sequence information shows that it plays only a subordinate role in this process. Rather, it is the influence of the base sequence on the secondary structure that seems to be the base sequences' main contribution. Explicitly modeling the base sequence is therefore not necessary. However, a closer examination of the influence of the base sequence remains to be done.
The second performance boost came from the use of SVR, probably because of its ability to model non-linear relations and because it handles outliers better. The essential step, and limiting factor, when working with a SVR model is the selection of the training data. A good prediction performance for oligonucleotides that lie far outside the training feature space is very unlikely. Thus, it is crucial to ensure a broad coverage of feature space, both in terms of secondary structure and base composition.
We tested 11 model components that cover base composition, base sequence, secondary structure and base stacking. Although our model performed very well, there is still room for improvements. New features that model additional chemical properties can be easily added to our model by appending them to the feature vector. We expect that this kind of model extensibility can further help to improve the oligonucleotide retention time prediction model in the future.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Supplementary Data are available at NAR Online.
Conflict of interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Gelhaus S, LaCourse WR, Hagan NA, Amarasinghe GK, Fabris D. Rapid purification of RNA secondary structures. Nucleic Acids Res (2003) 31:e135.
[Abstract/Free Full Text] - Huber CG, Berti GN. Detection of partial denaturation in AT-rich DNA fragments by ion-pair reversed-phase chromatography. Anal. Chem (1996) 68:29592965.
- Braun A, Little DP, Koster H. Detecting CFTR gene mutations by using primer oligo base extension and mass spectrometry. Clin. Chem (1997) 43:11511158.
[Abstract/Free Full Text] - Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc. Natl Acad. Sci. USA (1977) 74:54635467.
[Abstract/Free Full Text] - Bumcrot D, Manoharan M, Koteliansky V, Sah DWY. RNAi therapeutics: a potential new class of pharmaceutical drugs. Nat. Chem. Biol (2006) 2:711719.[CrossRef][Web of Science][Medline]
- Tan LC, Carr PW, Abraham MH. (1996) 752:118. Study of Retention in reversed-phase liquid chromatography using linear solvation energy relationships I. The Stationary Phase. J. Chromatogr. A.
- Gilar M, Fountain KJ, Budman Y, Neue UD, Yardley KR, Rainville PD, Russell RJ, Gebler JC. Ion-pair reversed-phase high-performance liquid chromatography analysis of oligonucleotides: retention prediction. J. Chromatogr. A (2002) 958:16782.[CrossRef][Web of Science][Medline]
- Hoffmann R, Bril G, Yoneyama M, Otvos L. Prediction of retention times of peptide nucleic acids during reversed-phase high-performance liquid chromatography. J. Chromatogr. A (1998) 814:111119.[CrossRef][Web of Science][Medline]
- Meek JL. Prediction of peptide retention times in high-pressure liquid chromatography on the basis of amino acid composition. Proc. Natl Acad. Sci. USA (1980) 77:16321636.
[Abstract/Free Full Text] - Guo D, Mant CT, Taneja AK, Parker JMR, Hodges RS. Prediction of peptide retention times in reversed phase HPLC: determination of retention coefficients of amino acid residues of model synthetic peptides. J. Chromatogr (1986) 359:499518.[CrossRef][Web of Science]
- Huber CG, Premstaller A, Oberacher H. Method and apparatus for separating polynucleotides using monolithic capillary columns. (2001) Patent application U.S. 09/770410.
- Vapnik VN. The Nature of Statistical Learning Theory. (2001) New York, USA: Springer-Verlag New York Inc.
- Schölkopf B, Smola AJ, Williamson RC, Bartlett PL. New support vector algorithms. Neural Computation (2000) 12:12071245.[CrossRef][Web of Science][Medline]
- Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatsh. Chem (1994) 125:167188.[CrossRef]
- Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet (2000) 16:276277.[CrossRef][Web of Science][Medline]
- Sugimoto N, Nakano S, Yoneyama M, Honda K. Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. Nucleic Acids Res (1996) 24:45014505.
[Abstract/Free Full Text]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





