Nucleic Acids Research Advance Access published online on October 18, 2008
Nucleic Acids Research, doi:10.1093/nar/gkn740
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Computational Biology |
A competitive hybridization model predicts probe signal intensity on high density DNA microarrays
Gulf Coast Research Lab, Department of Coastal Sciences, University of Southern Mississippi, Ocean Springs, MS 39564, USA
*To whom correspondence should be addressed: Tel: +1 228 872 4278; Fax: +1 228 872 4204; Email: shuzhao.li{at}gmail.com
Received August 20, 2008. Revised October 1, 2008. Accepted October 2, 2008.
| ABSTRACT |
|---|
|
|
|---|
A central, unresolved problem of DNA microarray technology is the interpretation of different signal intensities from multiple probes targeting the same transcript. We propose a competitive hybridization model for DNA microarray hybridization. Our model uses a probe-specific dissociation constant that is computed with current nearest neighbor model and existing parameters, and only four global parameters that are fitted to Affymetrix Latin Square data. This model can successfully predict signal intensities of individual probes, therefore makes it possible to quantify the absolute concentration of targets. Our results offer critical insights into the design and data interpretation of DNA microarrays.
| INTRODUCTION |
|---|
|
|
|---|
Current DNA microarray technology utilizes multiple oligonucleotide probes to detect the concentration of target molecules. These probes, even though interrogating the same target, often yield very different signal intensities. Without understanding the physicochemistry underlying this problem, the quantification of absolute gene abundance is unattainable and inter-probe comparison is unjustified, leaving DNA microarray technology severely compromised.
A number of physical models have been proposed to address this problem, mostly in the form of Langmuir derivatives (1–10). The Langmuir model is a generic mathematical form that also fits the description of first-order chemical reactions, which is frequently used for probe/target binding on DNA microarrays:
|
| (1) |
is the fraction of occupied probes, T free target concentration, K dissociation constant.
According to the Langmuir model, all probes should saturate at the same level, which is clearly not the case in microarray hybridizations. Various modifications were proposed to accommodate this difference in saturation levels. A generic version may be written as
|
| (2) |
is a probe-specific factor. While a physical meaning of
is difficult to obtain, some (7,8) tried to explain
through the washing step in microarray experiments. That is, all probes reach the same saturation level in the end of hybridization, but they lose the bound targets to different extents during the washing step. This washing model suggests a significant loss of signals upon each washing cycle. In experimental observations, the first washing cycle usually removes a considerable amount of partially bound targets, but it is clear that signal intensities do not decrease dramatically after extra washing cycles (11). This contradicts the above washing model. Furthermore, the Langmuir derivatives predict that, in response to increasing target concentrations, probes with higher binding affinities saturate first. In experimental observations, on the contrary, low-affinity probes generally saturate first. Although Langmuir models seem to work well on simple surface hybridizations, no Langmuir derivative has adequately predicted probe signals in real experimental settings, such as those in the Affymetrix Latin Square data with complex backgrounds (12).
The best prediction of probe signals to date was reported by Zhang et al. (13). They accounted for both specific binding and nonspecific binding in the form of
, where
is total target concentration, while fitting 83 parameters to the data. Mei et al. (14) also sought a linear composition of binding energy, where the single base energy contribution alone used 75 parameters. Over-parameterization has been a concern in all these previous studies and invited criticism on their general applicability (15).
After all, a valid physical model of microarray hybridization will have to explain the probe difference through sequence-specific thermodynamics, as its oligonucleotide sequence is the defining property of a microarray probe. The free energy of polynucleotide hybridization in bulk solution has been successfully described by a nearest neighbor (NN) model (16,17). However, this NN model is widely regarded as not applicable to high-density microarray hybridization, as it was either modified and re-parameterized (5,7,9,13,18), or abandoned (1,14,19).
We will first demonstrate that the NN model is applicable to probes free of secondary structures. With the thermodynamic component calculated from the NN model, we then propose a new competitive hybridization model to describe the kinetics. Our model, using only four global parameters that are fitted to Affymetrix Latin Square data, can successfully predict signal intensities of individual probes, and therefore, achieve the absolute quantification of target concentrations.
| METHODS |
|---|
|
|
|---|
The Affymetrix Latin Square spike-in data U133A were retrieved from (12). They contained 14x3 hybridizations where spike-in targets were added at various concentrations from 0 pM to 512 pM. Probe information was obtained through (20), where only 30 of the 42 probesets were found. A total of 365 probes matched to target sequences. Among them, 10 probes with very low signal intensities (under 900 at highest target concentration) were removed. In total, 355 probes are included in this study. Background was taken as the signal intensity at zero spike-in concentration, and subtracted from data at other concentrations. No normalization was performed on these data. The probe self-folding energy,
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Thermodynamic predictability in DNA microarrays
Microarray probes and targets may form secondary structures by intramolecular self-folding. These structural effects are not accounted for in the NN model, posing a problem to the thermodynamic calculation. As a first step, we investigated the structural effects through the self-folding energy of probes,
|
A competitive hybridization model
We treat DNA microarray hybridization as two subprocesses, the binding of targets to probes and the dissociation of target/probe duplexes. Assuming that equilibrium is reached at the end of hybridization and the binding rate is the same for all target molecules (see below), the dissociation rate is governed by the duplexing energy between paired target/probe. A kinetic equilibrium between binding and dissociation should be observed.
Two types of targets are explicitly modeled: specific targets (perfect match) with probe-specific dissociation rate kd, and cross-hybridizing targets with dissociation rate kn. These cross-hybridizing targets are present in large quantities because partially matching sequences are abundant in a transcriptome. For the moment, we simplify them as a uniform mixture with a probe-nonspecific kn.
The target/probe duplex formation is commonly believed to start with an initiation step, the base-pairing between a small number of nucleotide bases, and then extend to the rest of complementary regions (22,23). If the initiation step sets the rate limit, the binding rate should be hardly specific to probe sequences. We therefore assume a single binding rate, kb, for all target molecules. How the specific factors (24,25), including adsorption and electrostatics (26), steric and brush effects (27) and labeling (19,28), come into play is not yet entirely clear. In this study, we postulate that the available area of probe spots is the limiting factor in adsorption, so that the binding is described as
|
| (3) |
is the fraction of probes bound to specific targets, β the fraction of probes bound to cross-hybridizing targets. p is the total number of probes in unit of molar concentration (for simplicity, as if they were dissolved in the hybridization solution).
On the other hand, the dissociation is described by
|
| (4) |
At equilibrium between binding and dissociation,
|
| (5) |
|
| (6) |
|
| (7) |
Equations (6) and (7) can be combined to express β as:
|
| (8) |
|
| (9) |
|
| (10) |
as the nominal spike-in concentration (total amount).
We assume the concentration of cross-hybridizing targets,
, is large and can be treated as constant in this model. Let
|
| (11) |
|
| (12) |
|
| (13) |
|
| (14) |
It can be shown that the other analytical solution of Equation (12), which bears a plus sign before the square root, has no valid physical meaning and merits no further discussion.
So
, the fraction of probes bound to specific targets, is described by three global parameters: p, kb and
, one probe-specific parameter kd and one variable
. kd can be expressed as:
|
| (15) |
as a scaling factor to account for binding to immobilized probes.
The physical meaning of our model is clear. Both specific binding
and cross-hybridization β compete for the same probe sites. As a result, high affinity probes (small kd) can achieve a higher fraction of specific binding, while low-affinity probes (large kd) saturate at a lower fraction.
serves as a cross-hybridization factor. We made assumptions that are important to real experimental settings: a large quantity of cross-hybridizing targets are present; kb is uniform for all targets and the adsorption is limited by the available area of probe spots. These assumptions make our model fundamentally different from previous competitive kinetic models (29,30).
Experimentally, signal intensity is what is observed after washing, where most of cross-hybridized targets have been washed off:
|
| (16) |
the residual intensity from cross-hybridized targets,
scanner bias, A the detection coefficient of fluorescence. As the unit of signal intensities is arbitrarily digitized, it only comes to a physical meaning through A.
Explanation to the
correlation
First of all, we shall demonstrate that our model is capable of explaining the
correlation at high target concentration in Figure 1.
Equation (12) can be rearranged to a logarithmic form:
|
| (17) |
Note that the second item on the right side still contains the probe-dependent variable
. However, at high target concentration, the bound targets are minor comparing to free targets. This means,
, and
. Hence, Equation (17) at high target concentration is approximated as:
|
| (18) |
In a long range for
, a linear approximation can be drawn between
and
. As in Figure 2,
|
| (19) |
|
Combining Equations (15), (18) and (19), we get
|
| (20) |
At high target concentration, both cross-hybridization and scanner bias can be neglected. Therefore Equation (16) can be simplified to
. We substitute the
in Equation (20) with
:
|
| (21) |
. Thus,
, the premise
A bonus here is the determination of
. Since the coefficient for
in Equation (21) should equate the slope in Figure 1, we get
.
Procedure of fitting model to Latin Square data
In DNA microarray experiments, signal intensities are measured in place of fluorescent densities of bound targets. However, common photomultiplier tube scanners usually carry a significant nonlinearity for low signal intensities (31). This means, the lower end of these Affymetrix data may deviate from the true fluorescent densities, a problem difficult to correct without knowledge of the specific instrument calibration data. And the signals from targets below 1 pM are hardly distinguishable from backgrounds, therefore, data from spike-in concentration 1 pM and above are used for our modeling.
The model fitting is to match the theoretical calculation of signal intensity,
, to the experimentally observed counterpart
is defined from Equation (16):
|
| (22) |
are observed values in these Affymetrix data (signal intensities at zero spike-in concentration). With background
subtracted, the signal intensity should approach zero when the target concentration approaches zero. However, there is usually a deviation from zero that is known as scanner bias,
, which can be therefore estimated by extrapolating the signal intensities at low target concentrations. For the data in this study,
is relatively small and has no significant effect on our model parameters. Though it is useful for stabilizing the small numbers in the fitting process. |
| (23) |
Equation (13) can be written as
|
| (24) |
is defined in Equation (14). In this equation,
.
We use a fitness function of weighted squares [similar to (1)]. For a probe i, the fitting error is calculated as
|
| (25) |
A useful constraint to the fitting is the value of
. This is the signal intensity in Equation (23) when
, often referred as the saturation level of hybridization. It is obvious that P0 should be larger but not infinitely larger than the highest signal intensity observed in the experiment. When varying P0 is used, as shown in Figure 3, the overall fitting results are not very sensitive to P0 beyond a certain value. So we choose
here.
|
Probe signal intensities can be successfully modeled
Figure 1 shows that
. The evaluation is then performed on the rest of probes.
The results indicate that our model captures the probe properties well. Figure 4A shows the modeling of individual probe signals on both the training data and evaluation data. Overall, the prediction on training data has
(Figure 4B), and
on evaluation data (Figure 4C). If we relax the probe selection criterion to
, about 75% of total probes are included, with prediction
(Figure 4D). The rest 25% of probes, which are presumably under stronger influence of secondary structures, can still be modeled with the same parameters but less accuracy at
.
|
In the previous, heavily parameterized models, the best prediction on
Prediction of target concentrations
With the four global parameters, target concentration can be calculated from Equation (12):
|
| (26) |
|
| (27) |
Since kd calculation is more accurate for probes free of secondary structures, we focus on 19 out of the 30 probesets (transcripts) in this study that have five or more probes with
. For these transcripts, Equation (27) is applied to calculate a target concentration from each probe. And the final concentration of a transcript is taken as the median of the data from its probes (Figure 5A). Figure 5B shows the prediction at gene level for all 19 transcripts. In fact, comparable results can be obtained by using the few probes with
alone. At low concentrations, the predicted values in Figure 5B tend to be higher than the nominal concentrations. We think this is likely to be a reflection of scanner nonlinearity in the low signal range, which can be corrected by an instrument calibration.
|
DISCUSSION
In DNA microarray experiments, systematic variations stem from sample preparation and instrument operations. They are likely reflected in the global parameters of our model, A, kb and
. Therefore, batch variations can be expected in these parameters. The highest signal intensity in the Latin Square data was about 16 000. Comparing with the saturation level
We would like to emphasize that kd is the only probe-specific factor in our model, and therefore plays a pivotal role in model accuracy. The accuracy of kd or
in this article is limited by the NN model, which is only a coarse approximation and affected by probe/target secondary structures. This can be improved but beyond the scope of this current study.
We assumed a constant cross-hybridization factor
for all probes, which may not be the case. Further research on
may improve the accuracy of our model. We did not deal with the background levels in this study, which are not important to signals at high target concentration but will affect signals at low concentrations. Background levels have a clear dependency on
, and are well addressed in other studies (32,33).
CONCLUSION
Our study presents the first model of DNA microarray hybridization that explains probe signal intensities through sequence-based thermodynamic properties without excessive parameter fitting. This fills in the long standing knowledge gap in DNA microarray hybridization. Our model provides a mechanism of absolute quantification, and shall improve the quality control and reproducibility of the technology. With only four global parameters, this model can be easily calibrated through control features that are built into microarrays, and adopted in practice. We expect new design and quantification algorithms to take advantage of our results.
| FUNDING |
|---|
|
|
|---|
National Oceanic and Atmospheric Administration (NA05NOS4261163 and NA06NOS42600117).
Conflict of interest statement: None declared.
| REFERENCES |
|---|
|
|
|---|
- Hekstra D, Taussig AR, Magnasco M, Naef F. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res. (2003) 31:1962–1968.
[Abstract/Free Full Text] - Held GA, Grinstein G, Tu Y. Modeling of DNA microarray data by using physical properties of hybridization. Proc Nat Acad Sci. USA (2003) 100:7575.
[Abstract/Free Full Text] - Halperin,A. Buhot A, Zhulina EB. Specificity, sensitivity and the hybridization Isotherms of DNA chips. Biophys. J (2004) 86:718–730.[Web of Science][Medline]
- Abdueva D, Skvortsov D, Tavare S. Non-linear analysis of GeneChip arrays. Nucleic Acids Res. (2006) 34:e105.
[Abstract/Free Full Text] - Binder H, Preibisch S. GeneChip microarrays-signal intensities, RNA concentrations and probe sequences. J.Phys. Condens. Matter (2006) 18:S537–S566.[CrossRef]
- Heim T, Tranchevent LC, Carlon E, Barkema GT. Physical-chemistry-based analysis of affymetrix microarray data. J. phys. chem. B (2006) 110:22786–22795.[Medline]
- Held GA, Grinstein G, Tu Y. Relationship between gene expression and observed intensities in DNA microarrays–a modeling study. Nucleic Acids Res. (2006) 34:e70.
[Abstract/Free Full Text] - Burden CJ, Pittelkow Y, Wilson SR. Adsorption models of hybridization and post-hybridization behaviour on oligonucleotide microarrays. J. Phys. Condensed Matter (2006) 18:5545–5565.[CrossRef]
- Bruun GM, Wernersson R, Juncker AS, Willenbrock H, Nielsen HB. Improving comparability between microarray probe signals by thermodynamic intensity correction. Nucleic Acids Res. (2007) 35:e48.
[Abstract/Free Full Text] - Burden CJ. Understanding the physics of oligonucleotide microarrays: the Affymetrix spike-in data reanalysed. Phys. Biol. (2008) 5:16004.[CrossRef][Medline]
- Skvortsov D, Abdueva D, Curtis C, Schaub B, Tavare S. Explaining differences in saturation levels for Affymetrix GeneChip (R) arrays. Nucleic Acids Res. (2007) 35:4154–4163.
[Abstract/Free Full Text] - Affymetrix Latin Square data. http://www.affymetrix.com/support/datasets.affx (6 May 2008, date last accessed).
- Zhang L, Miles MF, Aldape KD. A model of molecular interactions on short oligonucleotide microarrays. Nat. Biotechnol (2003) 21:818–821.[CrossRef][Web of Science][Medline]
- Mei,R. Hubbell,E. Bekiranov,S. Mittmann,M. Christians,F.C. Shen,M.M. Lu,G. Fang,J. Liu,W.M. Ryder T, et al. Probe selection for high-density oligonucleotide arrays. Proc. Nat. Acad. Sci. (2003) 100:11237.
[Abstract/Free Full Text] - Wu Z, Irizarry RA. Preprocessing of oligonucleotide array data. Nat. Biotechnol. (2004) 22:656–658.[Web of Science][Medline]
- Lucia Santa Jr. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Nat. Acad. Sci. USA (1998) 95:1460.
[Abstract/Free Full Text] - Wu P, Nakano S, Sugimoto N. Temperature dependence of thermodynamic properties for DNA/DNA and RNA/DNA duplex formation. FEBS J. (2002) 269:2821–2830.[Web of Science][Medline]
- Ono N, Suzuki S, Furusawa C, Agata T, Kashiwagi A, Shimizu H, Yomo T. An improved physico-chemical model of hybridization on high-density oligonucleotide microarrays. Bioinformatics (2008) 24:1278–1285.
[Abstract/Free Full Text] - Naef F, Magnasco MO. Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E (2003) 68:11906.[CrossRef]
- NetAffx Analysis Center. http://www.affymetrix.com/analysis/index.affx (6 May 2008, date last accessed).
- Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Nat. Acad. Sci. USA (2004) 101:7287–7292.
[Abstract/Free Full Text] - Bloomfield VA, Crothers DM, Tinoco Jr I. Nucleic Acids: Structures, Properties, and Functions. Sausalito University Science Books, Sausalito, CA (2000) 259–334.
- Christensen U, Jacobsen N, Rajwanshi VK, Wengel J, Koch T. Stopped-flow kinetics of locked nucleic acid (LNA)-oligonucleotide duplex formation: studies of LNA-DNA and DNA-DNA interactions. Biochem. J. (2001) 354:481–484.[CrossRef][Web of Science][Medline]
- Levicky R, Horgan A. Physicochemical perspectives on DNA microarray and biosensor technologies. Trends Biotechnol. (2005) 23:143–149.[CrossRef][Web of Science][Medline]
- Halperin A, Buhot A, Zhulina EB. On the hybridization isotherms of DNA microarrays: the Langmuir model and its extensions. J. Phys. Condens. Matter (2006) 18:S463–S490.[CrossRef]
- Vainrub A, Pettitt BM. Thermodynamics of association to a molecule immobilized in an electric double layer. Chem. Phys. Lett. (2000) 323:160–166.[CrossRef][Web of Science]
- Halperin A, Buhot A, Zhulina EB. Brush Effects on DNA Chips: Thermodynamics, Kinetics, and Design Guidelines. Biophys. J. (2005) 89:796–811.[CrossRef][Web of Science][Medline]
- Zhang L, Hurek T, Reinhold-Hurek B. Position of the fluorescent label is a crucial factor determining signal intensity in microarray hybridizations. Nucleic Acids Res. (2005) 33:e166.
[Abstract/Free Full Text] - Zhang Y, Hammer DA, Graves DJ. Competitive hybridization kinetics reveals unexpected behavior patterns. Biophys. J. (2005) 89:2950–2959.[CrossRef][Web of Science][Medline]
- Bishop,J. Blair S, Chagovetz AM. A competitive kinetic model of nucleic acid surface hybridization in the presence of point mutants. Biophys. J (2006) 90:831–840.[CrossRef][Web of Science][Medline]
- Shi,L. Tong,W. Su,Z. Han,T. Han,J. Puri,R.K. Fang,H. Frueh,F.W. Goodsaid,F.M. Guo L, et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics (2005) 6:S11.
- Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. (2004) 99:909–917.[CrossRef][Web of Science]
- Schuster EF, Blanc E, Partridge L, Thornton JM. Estimation and correction of non-specific binding in a large-scale spike-in experiment. Genome Biol. (2007) 8:R126.[CrossRef][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

G0 > –1, 160 probes from Affymetrix U133A data). Each dot represents a probe.



