Nucleic Acids Research Advance Access originally published online on October 18, 2008
Nucleic Acids Research 2008 36(20):6585-6591; doi:10.1093/nar/gkn740
Nucleic Acids Research, 2008, Vol. 36, No. 20 6585-6591
© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
A competitive hybridization model predicts probe signal intensity on high density DNA microarrays
Shuzhao Li*,
Alex Pozhitkov and
Marius Brouwer
Gulf Coast Research Lab, Department of Coastal Sciences, University of Southern Mississippi, Ocean Springs, MS 39564, USA
*To whom correspondence should be addressed. Tel: +1 228 872 4278; Fax: +1 228 872 4204; Email: shuzhao.li{at}gmail.com
Received August 20, 2008. Revised October 1, 2008. Accepted October 2, 2008.
 |
ABSTRACT
|
|---|
A central, unresolved problem of DNA microarray technology is
the interpretation of different signal intensities from multiple
probes targeting the same transcript. We propose a competitive
hybridization model for DNA microarray hybridization. Our model
uses a probe-specific dissociation constant that is computed
with current nearest neighbor model and existing parameters,
and only four global parameters that are fitted to Affymetrix
Latin Square data. This model can successfully predict signal
intensities of individual probes, therefore makes it possible
to quantify the absolute concentration of targets. Our results
offer critical insights into the design and data interpretation
of DNA microarrays.
 |
INTRODUCTION
|
|---|
Current DNA microarray technology utilizes multiple oligonucleotide
probes to detect the concentration of target molecules. These
probes, even though interrogating the same target, often yield
very different signal intensities. Without understanding the
physicochemistry underlying this problem, the quantification
of absolute gene abundance is unattainable and inter-probe comparison
is unjustified, leaving DNA microarray technology severely compromised.
A number of physical models have been proposed to address this problem, mostly in the form of Langmuir derivatives (1–10). The Langmuir model is a generic mathematical form that also fits the description of first-order chemical reactions, which is frequently used for probe/target binding on DNA microarrays:
| (1) |
where

is the fraction
of occupied probes,
T free target concentration,
K dissociation
constant.
According to the Langmuir model, all probes should saturate at the same level, which is clearly not the case in microarray hybridizations. Various modifications were proposed to accommodate this difference in saturation levels. A generic version may be written as
| (2) |
where

is a probe-specific factor. While a physical meaning of

is
difficult to obtain, some (
7,
8) tried to explain

through the
washing step in microarray experiments. That is, all probes
reach the same saturation level in the end of hybridization,
but they lose the bound targets to different extents during
the washing step. This washing model suggests
a significant loss of signals upon each washing cycle. In experimental
observations, the first washing cycle usually removes a considerable
amount of partially bound targets, but it is clear that signal
intensities do not decrease dramatically after extra washing
cycles (
11). This contradicts the above washing model.
Furthermore, the Langmuir derivatives predict that, in response
to increasing target concentrations, probes with higher binding
affinities saturate first. In experimental observations, on
the contrary, low-affinity probes generally saturate first.
Although Langmuir models seem to work well on simple surface
hybridizations, no Langmuir derivative has adequately predicted
probe signals in real experimental settings, such
as those in the Affymetrix Latin Square data with complex backgrounds
(
12).
The best prediction of probe signals to date was reported by Zhang et al. (13). They accounted for both specific binding and nonspecific binding in the form of
, where
is total target concentration, while fitting 83 parameters to the data. Mei et al. (14) also sought a linear composition of binding energy, where the single base energy contribution alone used 75 parameters. Over-parameterization has been a concern in all these previous studies and invited criticism on their general applicability (15).
After all, a valid physical model of microarray hybridization will have to explain the probe difference through sequence-specific thermodynamics, as its oligonucleotide sequence is the defining property of a microarray probe. The free energy of polynucleotide hybridization in bulk solution has been successfully described by a nearest neighbor (NN) model (16,17). However, this NN model is widely regarded as not applicable to high-density microarray hybridization, as it was either modified and re-parameterized (5,7,9,13,18), or abandoned (1,14,19).
We will first demonstrate that the NN model is applicable to probes free of secondary structures. With the thermodynamic component calculated from the NN model, we then propose a new competitive hybridization model to describe the kinetics. Our model, using only four global parameters that are fitted to Affymetrix Latin Square data, can successfully predict signal intensities of individual probes, and therefore, achieve the absolute quantification of target concentrations.
 |
METHODS
|
|---|
The Affymetrix Latin Square spike-in data U133A were retrieved
from (
12). They contained 14
x3 hybridizations where spike-in
targets were added at various concentrations from 0 pM to 512
pM. Probe information was obtained through (
20), where only
30 of the 42 probesets were found. A total of 365 probes matched
to target sequences. Among them, 10 probes with very low signal
intensities (under 900 at highest target concentration) were
removed. In total, 355 probes are included in this study. Background
was taken as the signal intensity at zero spike-in concentration,
and subtracted from data at other concentrations. No normalization
was performed on these data. The probe self-folding energy,

, was computed by RNAStructure
[version 4.5, function OligoWalk (
21)]. Duplexing energy,

, was computed by the current NN model with the
parameters from Ref. (
17). Compiled data and computational scripts
used in this study are available upon request.
 |
RESULTS AND DISCUSSION
|
|---|
Thermodynamic predictability in DNA microarrays
Microarray probes and targets may form secondary structures
by intramolecular self-folding. These structural effects are
not accounted for in the NN model, posing a problem to the thermodynamic
calculation. As a first step, we investigated the structural
effects through the self-folding energy of probes,

. In the Affymetrix Latin Square data (see Methods
Section for details), about 45% of the probes can be selected
by the criterion

. For
these probes, a clear correlation appears between log signal
intensities (

) at the highest
target concentration and the duplexing energy

that are computed by the current NN model with
existing parameters (
Figure 1,

). If the selection criterion is relaxed to

, 75% probes are included and the

correlation has

(data not shown). However, the

correlation diminishes at lower target concentrations (data
not shown). These observations suggest that the current NN model
offers a certain degree of predictability, but they cannot be
accommodated by previous, Langmuir-like models. A new kinetic
model is needed.

View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 1. Duplexing energy calculated by NN model correlates with signal intensity at 512 pM, the highest spike-in target concentration, for probes free of secondary structures ( G0 > –1, 160 probes from Affymetrix U133A data). Each dot represents a probe.
|
|
A competitive hybridization model
We treat DNA microarray hybridization as two subprocesses, the
binding of targets to probes and the dissociation of target/probe
duplexes. Assuming that equilibrium is reached at the end of
hybridization and the binding rate is the same for all target
molecules (see below), the dissociation rate is governed by
the duplexing energy between paired target/probe. A kinetic
equilibrium between binding and dissociation should be observed.
Two types of targets are explicitly modeled: specific targets (perfect match) with probe-specific dissociation rate kd, and cross-hybridizing targets with dissociation rate kn. These cross-hybridizing targets are present in large quantities because partially matching sequences are abundant in a transcriptome. For the moment, we simplify them as a uniform mixture with a probe-nonspecific kn.
The target/probe duplex formation is commonly believed to start with an initiation step, the base-pairing between a small number of nucleotide bases, and then extend to the rest of complementary regions (22,23). If the initiation step sets the rate limit, the binding rate should be hardly specific to probe sequences. We therefore assume a single binding rate, kb, for all target molecules. How the specific factors (24,25), including adsorption and electrostatics (26), steric and brush effects (27) and labeling (19,28), come into play is not yet entirely clear. In this study, we postulate that the available area of probe spots is the limiting factor in adsorption, so that the binding is described as
| (3) |
where

is the number of target molecules going into the exposed
probes over a unit of time,
NA the Avogadro constant,
V the
volume of hybridization solution. On the right side,

is the
fraction of probes bound to specific targets, β the fraction
of probes bound to cross-hybridizing targets.
p is the total
number of probes in unit of molar concentration (for simplicity,
as if they were dissolved in the hybridization solution).
On the other hand, the dissociation is described by
| (4) |
where

is the number of target molecules leaving target/probe duplexes
over a unit of time;
kd and
kn are dissociation rates for specific
targets and cross-hybridizing targets, respectively.
At equilibrium between binding and dissociation,
| (5) |
Equilibrium is established for both specific
and cross-hybridizing targets. The proportions of specific targets
and cross-hybridizing targets are determined by their concentrations:
| (6) |
| (7) |
where

is the concentration of free specific targets,

the concentration of free cross-hybridizing
targets.
Equations (6) and (7) can be combined to express β as:
| (8) |
Then, Equations
(
5) and (
8) give the fraction of specific binding
| (9) |
Here

, the concentration of free specific target molecules, is
less than nominal spike-in concentration by the amount of probe
binding:
| (10) |
with

as the nominal spike-in concentration
(total amount).
We assume the concentration of cross-hybridizing targets,
, is large and can be treated as constant in this model. Let
| (11) |
then Equation (
9) becomes
| (12) |
An analytical solution of Equation (
12) is
| (13) |
where
| (14) |
It can be shown that the other analytical solution of Equation (12), which bears a plus sign before the square root, has no valid physical meaning and merits no further discussion.
So
, the fraction of probes bound to specific targets, is described by three global parameters: p, kb and
, one probe-specific parameter kd and one variable
. kd can be expressed as:
| (15) |
where
R is
the molar gas constant,
T absolute temperature (318 K in the
Affymetrix hybridization experiments),

the energy computed from NN model,

as a scaling
factor to account for binding to immobilized probes.
The physical meaning of our model is clear. Both specific binding
and cross-hybridization β compete for the same probe sites. As a result, high affinity probes (small kd) can achieve a higher fraction of specific binding, while low-affinity probes (large kd) saturate at a lower fraction.
serves as a cross-hybridization factor. We made assumptions that are important to real experimental settings: a large quantity of cross-hybridizing targets are present; kb is uniform for all targets and the adsorption is limited by the available area of probe spots. These assumptions make our model fundamentally different from previous competitive kinetic models (29,30).
Experimentally, signal intensity is what is observed after washing, where most of cross-hybridized targets have been washed off:
| (16) |
where

is the observed signal intensity,

the residual
intensity from cross-hybridized targets,

scanner bias,
A the
detection coefficient of fluorescence. As the unit of signal
intensities is arbitrarily digitized, it only comes to a physical
meaning through
A.
Explanation to the
correlation
First of all, we shall demonstrate that our model is capable of explaining the
correlation at high target concentration in Figure 1.
Equation (12) can be rearranged to a logarithmic form:
| (17) |
Note that the second item on the right side still contains the probe-dependent variable
. However, at high target concentration, the bound targets are minor comparing to free targets. This means,
, and
. Hence, Equation (17) at high target concentration is approximated as:
| (18) |
In a long range for
, a linear approximation can be drawn between
and
. As in Figure 2,
| (19) |
Combining Equations (
15), (
18) and (
19), we get
| (20) |
At high target concentration, both cross-hybridization and scanner bias can be neglected. Therefore Equation (16) can be simplified to
. We substitute the
in Equation (20) with
:
| (21) |
where

, a constant for fixed

.
Thus,

is inversely correlated
to

. The observed

correlation is explained by our competitive
hybridization model. At low

, the
premise

is less valid;
as a result,

is less
correlated to

. A similar
effect may be created by a very low

, where a large fraction of targets is bound to probes and
taken out of solution.
A bonus here is the determination of
. Since the coefficient for
in Equation (21) should equate the slope in Figure 1, we get
.
Procedure of fitting model to Latin Square data
In DNA microarray experiments, signal intensities are measured in place of fluorescent densities of bound targets. However, common photomultiplier tube scanners usually carry a significant nonlinearity for low signal intensities (31). This means, the lower end of these Affymetrix data may deviate from the true fluorescent densities, a problem difficult to correct without knowledge of the specific instrument calibration data. And the signals from targets below 1 pM are hardly distinguishable from backgrounds, therefore, data from spike-in concentration 1 pM and above are used for our modeling.
The model fitting is to match the theoretical calculation of signal intensity,
, to the experimentally observed counterpart
is defined from Equation (16):
| (22) |
Here, the background levels

are observed
values in these Affymetrix data (signal intensities at zero
spike-in concentration). With background

subtracted, the signal
intensity should approach zero when the target concentration
approaches zero. However, there is usually a deviation from
zero that is known as scanner bias,

, which can be therefore
estimated by extrapolating the signal intensities at low target
concentrations. For the data in this study,

is taken. This value of

is relatively small
and has no significant effect on our model parameters. Though
it is useful for stabilizing the small numbers in the fitting
process.
With the theoretical value
| (23) |
Equation (13) can be written as
| (24) |
where

is defined in Equation (
14). In this equation,

is known,

the observed value for

, and
kd can be calculated from Equation (
15).
So we only need to fit four global parameters:
A,
p,
kb and

.
We use a fitness function of weighted squares [similar to (1)]. For a probe i, the fitting error is calculated as
| (25) |
where

is observed signal intensity,

the calculated value by Equation (
24),
t one of the nominal
target concentrations
(1–512 pM). Our model in Equation (
24) is fitted to the
training data by minimizing the sum of
Ei through brute-force
searches as heuristic ranges of the four parameters can be obtained
based on their physical meanings.
A useful constraint to the fitting is the value of
. This is the signal intensity in Equation (23) when
, often referred as the saturation level of hybridization. It is obvious that P0 should be larger but not infinitely larger than the highest signal intensity observed in the experiment. When varying P0 is used, as shown in Figure 3, the overall fitting results are not very sensitive to P0 beyond a certain value. So we choose
here.
Probe signal intensities can be successfully modeled
Figure 1 shows that

(hence
kd) can be reasonably approximated by the current NN model for
the probes free of secondary structures. We use half of these
probes to fit our competitive hybridization model, and determine
the four global parameters,
A,
p,
kb and

. The evaluation is
then performed on the rest of probes.
The results indicate that our model captures the probe properties well. Figure 4A shows the modeling of individual probe signals on both the training data and evaluation data. Overall, the prediction on training data has
(Figure 4B), and
on evaluation data (Figure 4C). If we relax the probe selection criterion to
, about 75% of total probes are included, with prediction
(Figure 4D). The rest 25% of probes, which are presumably under stronger influence of secondary structures, can still be modeled with the same parameters but less accuracy at
.
In the previous, heavily parameterized models, the best prediction
on

was correlation coefficient

in Ref. (
14) and

in Ref. (
13). In comparison, our model of four
parameters produces

for
all probes, and

for 75%
probes after a preliminary selection by secondary structures
(i.e.
Figure 4D). In conclusion, our competitive hybridization
model can not only predict probe signals successfully, but also
opens up paths to future improvements.
Prediction of target concentrations
With the four global parameters, target concentration can be calculated from Equation (12):
| (26) |
If we substitute
,
| (27) |
Since kd calculation is more accurate for probes free of secondary structures, we focus on 19 out of the 30 probesets (transcripts) in this study that have five or more probes with
. For these transcripts, Equation (27) is applied to calculate a target concentration from each probe. And the final concentration of a transcript is taken as the median of the data from its probes (Figure 5A). Figure 5B shows the prediction at gene level for all 19 transcripts. In fact, comparable results can be obtained by using the few probes with
alone. At low concentrations, the predicted values in Figure 5B tend to be higher than the nominal concentrations. We think this is likely to be a reflection of scanner nonlinearity in the low signal range, which can be corrected by an instrument calibration.
DISCUSSION
In DNA microarray experiments, systematic variations stem from
sample preparation and instrument operations. They are likely
reflected in the global parameters of our model,
A,
kb and

.
Therefore, batch variations can be expected in these parameters.
The highest signal intensity in the Latin Square data was about
16 000. Comparing with the saturation level

, this means about half of those probe sites
are bound to specific targets at

Thus, the fitted value of

(as in
Figure 4) seems to be reasonable. Since our model
has only four global parameters, they can be easily calibrated
if control probes are built into array design. For instance,
a set of targets complementary to the control probes can be
spiked into the hybridization at various concentrations. Signal
intensities of the control probes along with the known target
concentrations can then be used to calibrate our model every
time a hybridization experiment is performed.
We would like to emphasize that kd is the only probe-specific factor in our model, and therefore plays a pivotal role in model accuracy. The accuracy of kd or
in this article is limited by the NN model, which is only a coarse approximation and affected by probe/target secondary structures. This can be improved but beyond the scope of this current study.
We assumed a constant cross-hybridization factor
for all probes, which may not be the case. Further research on
may improve the accuracy of our model. We did not deal with the background levels in this study, which are not important to signals at high target concentration but will affect signals at low concentrations. Background levels have a clear dependency on
, and are well addressed in other studies (32,33).
CONCLUSION
Our study presents the first model of DNA microarray hybridization that explains probe signal intensities through sequence-based thermodynamic properties without excessive parameter fitting. This fills in the long standing knowledge gap in DNA microarray hybridization. Our model provides a mechanism of absolute quantification, and shall improve the quality control and reproducibility of the technology. With only four global parameters, this model can be easily calibrated through control features that are built into microarrays, and adopted in practice. We expect new design and quantification algorithms to take advantage of our results.
 |
FUNDING
|
|---|
National Oceanic and Atmospheric Administration (NA05NOS4261163
and NA06NOS42600117).
Conflict of interest statement: None declared.
 |
REFERENCES
|
|---|
- Hekstra D, Taussig AR, Magnasco M, Naef F. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res. (2003) 31:1962–1968.[Abstract/Free Full Text]
- Held GA, Grinstein G, Tu Y. Modeling of DNA microarray data by using physical properties of hybridization. Proc Nat Acad Sci. USA (2003) 100:7575.[Abstract/Free Full Text]
- Halperin A, Buhot A, Zhulina EB. Specificity, sensitivity and the hybridization Isotherms of DNA chips. Biophys. J (2004) 86:718–730.[Web of Science][Medline]
- Abdueva D, Skvortsov D, Tavare S. Non-linear analysis of GeneChip arrays. Nucleic Acids Res. (2006) 34:e105.[Abstract/Free Full Text]
- Binder H, Preibisch S. GeneChip microarrays-signal intensities, RNA concentrations and probe sequences. J.Phys. Condens. Matter (2006) 18:S537–S566.[CrossRef]
- Heim T, Tranchevent LC, Carlon E, Barkema GT. Physical-chemistry-based analysis of affymetrix microarray data. J. phys. chem. B (2006) 110:22786–22795.[Medline]
- Held GA, Grinstein G, Tu Y. Relationship between gene expression and observed intensities in DNA microarrays–a modeling study. Nucleic Acids Res. (2006) 34:e70.[Abstract/Free Full Text]
- Burden CJ, Pittelkow Y, Wilson SR. Adsorption models of hybridization and post-hybridization behaviour on oligonucleotide microarrays. J. Phys. Condensed Matter (2006) 18:5545–5565.[CrossRef]
- Bruun GM, Wernersson R, Juncker AS, Willenbrock H, Nielsen HB. Improving comparability between microarray probe signals by thermodynamic intensity correction. Nucleic Acids Res. (2007) 35:e48.[Abstract/Free Full Text]
- Burden CJ. Understanding the physics of oligonucleotide microarrays: the Affymetrix spike-in data reanalysed. Phys. Biol. (2008) 5:16004.[CrossRef][Medline]
- Skvortsov D, Abdueva D, Curtis C, Schaub B, Tavare S. Explaining differences in saturation levels for Affymetrix GeneChip (R) arrays. Nucleic Acids Res. (2007) 35:4154–4163.[Abstract/Free Full Text]
- Affymetrix Latin Square data. http://www.affymetrix.com/support/datasets.affx (6 May 2008, date last accessed).
- Zhang L, Miles MF, Aldape KD. A model of molecular interactions on short oligonucleotide microarrays. Nat. Biotechnol (2003) 21:818–821.[CrossRef][Web of Science][Medline]
- Mei R, Hubbell E, Bekiranov S, Mittmann M, Christians FC, Shen MM, Lu G, Fang J, Liu WM, Ryder T, et al. Probe selection for high-density oligonucleotide arrays. Proc. Nat. Acad. Sci. (2003) 100:11237.[Abstract/Free Full Text]
- Wu Z, Irizarry RA. Preprocessing of oligonucleotide array data. Nat. Biotechnol. (2004) 22:656–658.[Web of Science][Medline]
- Lucia Santa Jr. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Nat. Acad. Sci. USA (1998) 95:1460.[Abstract/Free Full Text]
- Wu P, Nakano S, Sugimoto N. Temperature dependence of thermodynamic properties for DNA/DNA and RNA/DNA duplex formation. FEBS J. (2002) 269:2821–2830.[Web of Science][Medline]
- Ono N, Suzuki S, Furusawa C, Agata T, Kashiwagi A, Shimizu H, Yomo T. An improved physico-chemical model of hybridization on high-density oligonucleotide microarrays. Bioinformatics (2008) 24:1278–1285.[Abstract/Free Full Text]
- Naef F, Magnasco MO. Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E (2003) 68:11906.[CrossRef]
- NetAffx Analysis Center. http://www.affymetrix.com/analysis/index.affx (6 May 2008, date last accessed).
- Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Nat. Acad. Sci. USA (2004) 101:7287–7292.[Abstract/Free Full Text]
- Bloomfield VA, Crothers DM, Tinoco I Jr. Nucleic Acids: Structures, Properties, and Functions. Sausalito University Science Books, Sausalito, CA (2000) 259–334.
- Christensen U, Jacobsen N, Rajwanshi VK, Wengel J, Koch T. Stopped-flow kinetics of locked nucleic acid (LNA)-oligonucleotide duplex formation: studies of LNA-DNA and DNA-DNA interactions. Biochem. J. (2001) 354:481–484.[CrossRef][Web of Science][Medline]
- Levicky R, Horgan A. Physicochemical perspectives on DNA microarray and biosensor technologies. Trends Biotechnol. (2005) 23:143–149.[CrossRef][Web of Science][Medline]
- Halperin A, Buhot A, Zhulina EB. On the hybridization isotherms of DNA microarrays: the Langmuir model and its extensions. J. Phys. Condens. Matter (2006) 18:S463–S490.[CrossRef]
- Vainrub A, Pettitt BM. Thermodynamics of association to a molecule immobilized in an electric double layer. Chem. Phys. Lett. (2000) 323:160–166.[CrossRef][Web of Science]
- Halperin A, Buhot A, Zhulina EB. Brush Effects on DNA Chips: Thermodynamics, Kinetics, and Design Guidelines. Biophys. J. (2005) 89:796–811.[CrossRef][Web of Science][Medline]
- Zhang L, Hurek T, Reinhold-Hurek B. Position of the fluorescent label is a crucial factor determining signal intensity in microarray hybridizations. Nucleic Acids Res. (2005) 33:e166.[Abstract/Free Full Text]
- Zhang Y, Hammer DA, Graves DJ. Competitive hybridization kinetics reveals unexpected behavior patterns. Biophys. J. (2005) 89:2950–2959.[CrossRef][Web of Science][Medline]
- Bishop J, Blair S, Chagovetz AM. A competitive kinetic model of nucleic acid surface hybridization in the presence of point mutants. Biophys. J (2006) 90:831–840.[CrossRef][Web of Science][Medline]
- Shi L, Tong W, Su Z, Han T, Han J, Puri RK, Fang H, Frueh FW, Goodsaid FM, Guo L, et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics (2005) 6:S11.
- Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. (2004) 99:909–917.[CrossRef][Web of Science]
- Schuster EF, Blanc E, Partridge L, Thornton JM. Estimation and correction of non-specific binding in a large-scale spike-in experiment. Genome Biol. (2007) 8:R126.[CrossRef][Medline]

CiteULike
Connotea
Del.icio.us What's this?