A mathematical model and a computerized simulation of PCR using complex
templates
A mathematical model and a computerized simulation of PCR using complex templates
Eitan
Rubin
and
Avraham A.
Levy*
Department of Plant Genetics, Weizmann Institute of Science,
Rehovot
76100,
Israel
Received June 14, 1996;
Revised and Accepted July 31, 1996
ABSTRACT
A mathematical model and a computer simulation were used to study PCR
specificity. The model describes the occurrences of non-targeted PCR products formed through random primer-template interactions. The PCR simulation scans DNA sequence
databases with primers pairs. According to the model prediction, PCR with
complex templates should rarely yield non-targeted products under typical reaction conditions. This is surprising as
such products are often amplified in real PCR under conditions optimized for
stringency. The causes for this `PCR paradox' were investigated by comparing
the model predictions with simulation results. We found that deviations from
randomness in sequences from real genomes could not explain the frequent
occurrence of non-targeted products in real PCR. The most likely explanation to the `PCR
paradox' is a relatively high tolerance of PCR to mismatches. The model also
predicts that mismatch tolerance has the strongest effect on the number of non-targeted products, followed by primer length, template size and product
size limit. The model and the simulation can be utilized for PCR studies,
primer design and probing DNA uniqueness and randomness.
INTRODUCTION
The polymerase chain reaction (PCR) allows the amplification of a specific
region (target) from a DNA template, using two oligonucleotides (primers) that
anneal to opposite strands (
1
,
2
). The reaction is based on multiple cycles of DNA synthesis; each includes
denaturation of the template, annealing of the primers to complementary sites
in the template and primer extension. The high sensitivity of the reaction and
its low cost in time and reagents make PCR one of the most significant
innovations in molecular biology during the past decade (
3
). Nevertheless, for any new primer-template combination, the behavior of the reaction is not completely
predictable; non-targeted products are often amplified (
4
), particularly when complex templates, such as genomic DNA, are involved in the
reaction. This problem has been addressed by empirically optimizing the
components of the reaction (
5
), or by experimentally investigating the specificity of the priming process (
6
-
12
). Despite some progress made through these approaches, we still have a limited
understanding of the factors that govern PCR specificity.
Several computer programs have been used to predict the formation of non-targeted products (
12
-
18
). Such products may occur when two opposite regions in the template, situated
within a certain size limit, are similar enough to the primer to serve as
annealing sites (
19
). The currently available primer design programs cannot handle complex
templates and, therefore, have a limited prediction capability. This is
unfortunate, as DNA sequence databases (DBases) such as GenBank (NCBI, 8600
Rockville Pike, Bethesda, MD 20894, USA) and EMBL (EBI, Hinxton Hall, Hinxton,
Cambridge CB10 1RQ, UK), contain an increasing amount of information which
could be used to more accurately predict the amplification of non-targeted products. These DBases contain a very large collection of >2.5 * 10
8
nucleotides (nt) from numerous species, including both a biased sample of the
genome and a rapidly growing assembly of unbiased sequences in the form of
complete chromosomes (
20
-
24
). Thus, sequence DBases are the best approximation of complex templates such as
genomic DNA or cDNA libraries.
We combined two approaches to study the amplification of non-targeted PCR products as a result of random primer-template interactions: first, through a mathematical model, and
second, through a computerized simulation of PCR (simPCR). The model was
developed to assess the relative effect(s) of various parameters on the
amplification of non-targeted products. It was also used to determine the expected frequency of
non-targeted products when the template is a random sequence. This frequency
was then compared with that of non-targeted products obtained by scanning real sequence databases with the
computerized PCR simulation. We reached the following conclusions: (i) the
expected probability of obtaining a non-targeted PCR product under stringent annealing conditions is extremely
low; (ii) the frequent amplification of non-targeted products in real PCR is not caused by deviations from randomness
in nucleotides order or composition, but rather by the tolerance of PCR to
mismatches; and (iii) based on the model predictions, mismatch tolerance is the
most significant factor affecting PCR specificity, followed by primer length,
template size and product size limit. We discuss how the predictions based on
the model and the simulation could be useful for future improvement in PCR
specificity and primer design. In addition, the equations and simulation that
we developed can be used to study many other biological processes which,
similarly to PCR, involve the recognition of sequences along the DNA in complex
genomes.
MATERIALS AND METHODS
Definitions
Mismatch tolerance
:
the maximal number of mismatches allowed between the primer and a sequence in
the template.
Find
: a sequence in a database entry identified with Findpatterns as similar to the
primer within mismatch tolerance.
simPCR product
: the sequence between two opposing primers. This includes
solo simPCR products
: the sequence defined by a single primer repeated in the template in an
inverted orientation; or
XY-simPCR products:
the
sequence defined by two different opposing primers.
Degenerate primers:
mix of oligonucleotides, each containing alternative bases at specific sites.
All possible combinations are equally represented in the mix. A degeneracy is
mismatched if none of the possible bases at a location is identical to the
target.
Computer algorithm for simPCR
The Findpatterns program from the GCG package is first used to search for
annealing sites in the DBases (mismatches are allowed by using the appropriate
option). SimPCR then creates for each entry an array of finds, each represented
by the annealed primer number, position, number of mismatches and orientation.
All the pairs in the array are examined, and a `product' is reported when the
two sites are opposite and satisfy all search parameters. The user defines
mismatch tolerance and product size limit. The limitations of this simulation
as a representation of PCR are described in the discussion. Flowcharts of the
program are available through the WWW (see below).
Databases
Two databases were used in all simPCR
searches. The first contained selected subdivisions from the GenBank database
(release 90) omitting subdivisions which contain, on average, entries <1000 bp (EST, STS, UNA, PAT, RNA and SYN). The remaining subdivisions (BCT,
INV, MAM, VRT, PHG, PLN, PRI, ROD and VRL) contain 2.38 * 10
8
bp arranged in 168 434 entries. The second DBase contained random sequences
containing 2.5 * 10
6
bp organized in 250 entries. Each entry of this DBase was created by
concatamering 2500 ACGT repeats and randomizing nucleotide order, using the
SHUFFLE routine from GCG. Randomness was confirmed by checking the score
distribution of FASTA comparisons between random sequences against the complete
database, all of which showed a uniform distribution of scores (data not
shown).
Primers
The sequence of the primers used in this work and a number of PCR-related parameters are described in Table
1
. The performance of the simple primers in pairs S, and the degenerate primers
in pair D was tested in real PCR reactions. Primers of type R were randomly
generated and were never used in real PCR. Primers in pair S were designed
based on the nucleotide sequence of their target. Pair D is degenerate, as its
sequence was deduced from the amino acid sequence of its target. In real PCR,
pair S was successful in specifically amplifying its target (
13
), but pair D was not (G. Benet, personal communication).
Availability
Sources of the simPCR program in the C programming language, together with
related materials, can be obtained directly from the authors by anonymous FTP
to bioinformatics.weizmann.ac.il in the directory/pub/software, or through the
world wide web in http://dapsas1.weizmann.ac.il/~bcrubin/simPCR/simPCR.html. It is also possible to run a demo version of
the program though the WWW server.
RESULTS
A model of the PCR
A model was developed to describe the formation of unspecific PCR products as a
function of several parameters (Fig.
1
). The following conditions were used: (i) the template is a double-stranded DNA sequence made of 4 nt (A, C, G, T) in an equal ratio and a
random order; (ii) annealing may occur at any site of the template which is
similar to the primer within mismatch tolerance, with successful annealing
always leading to priming; and (iii) any two opposing sites within product size
limit give a PCR product. These conditions were chosen for the sake of
simplicity. The limitations of this model, together with possible improvements,
are discussed below (see Discussion).
{N sub {p a i r s}} = {sum from Q} {left [ {{L sub i} {L sub s} - {1 over 2} {L sub s} ( {L sub s} - 1 )} right ]} + {sum from R} {left ( {{{L sub i sup 2} over 2} +
{{L sub i} over 2}} right )}
11
Variables affecting the occurrence of unspecific PCR products according to the
mathematical model
Using equations
6
and
9
from the model, the number of non-targeted PCR products or of annealing sites was calculated as a function
of several parameters (Fig.
2
): mismatch tolerance (Fig.
2
A); primer length (Fig.
2
B); maximal product length (Fig.
2
C); and template length (Fig.
2
D). Unless otherwise specified, reaction parameters were set to be
characteristic of real PCR: maximal product length,
L
s
, of 3000 bp and template size similar to the Human genome (
L
t
= 3 * 10
9
bp). These calculations indicate that mismatch tolerance is the variable with
the strongest effect on PCR specificity. At low mismatch tolerance values, the
proliferation of PCR products is nearly exponential. It becomes more moderate
with increasing mismatch tolerance, until a maximal value of 4
L
t
L
s
is reached (see equation
6
). In the example shown in Figure
2
A, the number of products reaches 4
L
t
L
s
= 3.6 * 10
13
. Primer pairs of different length give shifted curves: shorter primer pairs
give more products at 0 mismatches, and reach their maximal value at lower mismatch tolerance (Fig.
2
A). Increasing primer length causes a nearly exponential reduction in the number of
PCR products (Fig.
2
B): 20 base primer pairs give five products with four mismatches, whereas 11
base primer pairs give 2 * 10
9
products with the same number of mismatches. Nevertheless, reducing primer
length has a smaller effect than increasing mismatch tolerance. For example,
reducing the length by two bases has a very similar effect to increasing the
mismatch tolerance by one base (Fig.
2
A, B and D). Other variables, such as maximal product length (Fig.
2
C) and template length (Fig.
2
D) have a linear effect on the number of products.
Simulation of PCR with a complex template
The PCR paradox can be accounted for by the model provided a large number of
mismatches is tolerated in real PCR. For example, a tolerance of four to five
mismatches with a 20 nt primer could give one or a few non-targeted products (Fig.
2
A), a value often observed in real PCR. Alternatively, it is possible that the
frequent occurrence of non-targeted products stems from deviations from the model's conditions in
real PCR. Two important assumptions of the model, namely, the randomness in
nucleotide order and the equal representation of the four nucleotides, do not
reflect real genomes. Possibly, annealing sites occur more frequently in real
sequences than expected on a chance basis, reflecting biases in nucleotide
composition and order. One way to test this possibility empirically is to
simulate PCR with natural genomic sequences and examine whether the frequency
of the obtained non-targeted products is higher than expected with a random genome. For that
purpose, a program was written, simPCR, which can handle large templates such
as GenBank or EMBL DBases (see Materials and Methods).
First, as a control, simPCR was run with random primers (Table
1
) and a random database (Fig.
3
A). The average number of solo simPCR products obtained with five different
primers is shown in comparison with model predictions calculated from equation
4
. An excellent fit between the model and simPCR results is observed (Fig.
3
A). The same random primer set was used in a simPCR run with natural template
sequences from GenBank (see Materials and Methods). In this case, the template
is fragmented, therefore the total number of annealing sites,
N
pairs
, as calculated from equation
11
was used in equations
2
-
5
. The good fit between the model and the simulation (Fig.
3
B) suggests that for random primers, natural and random templates are similar
with respect to non-targeted product formation.
Limitations of the model and possible improvements
Non-randomness of the DNA cannot explain the PCR paradox
According to the PCR model presented here, unspecific amplification of PCR
products should virtually never occur in reactions with no mismatches and with
typical primers (Fig.
2
A). This result is surprising as under such conditions, real PCR often gives
unspecific products, even when reactions are optimized for stringency. The
great discrepancy between real PCR behavior and the model predictions, the so
called PCR paradox, can be explained in two ways which were tested in this
work. First, real PCR primers and templates might share non-random features which cause the occurrence of more annealing sites than
expected
(
27
). Second, PCR can tolerate several mismatches, even under presumably stringent
conditions. simPCR output, using template sequences from natural genes and real
primers, allowed us to assess the effect of the non-randomness of genomic sequences. A good fit was found between simulation
results and model predictions for the two primers pairs presented here (Fig.
4
and Table
2
), although the fit is better for the non-degenerated pair. This may result from the use of effective length
approximation, which do not fully represent the effect of degenerated bases on
the probability of chance annealing. Analysis of 20 additional real primers
also showed good agreement between model prediction and simulations results
(data available through WWW; see Materials and Methods). As the model
predictions are based on the assumption that DNA is a random sequence, the good
fit between the model and the simulation rules out the possibility that non-randomness of the genome accounts for the PCR paradox. Interestingly, the
fact that real DNA behaves almost as a random template suggests that the model,
despite its over-simplified assumptions, is adequate for the prediction of non-targeted primer-template interactions. The non-randomness of the genome probably has some effect on
the amplification of non-targeted products (
25
). This effect, however, must be minor compared with the deviations mentioned as
the `PCR paradox'. In summary, the most likely explanation for the PCR paradox
is high tolerance of the reaction to mismatches. These conclusions are
supported by experimental data indicating that mismatches occur frequently in
PCR (see below).
Relative weight of factors affecting PCR specificity- importance of mismatches
According to the model, the effect of template length on specificity is linear
(Fig.
2
D). This is in agreement with data from real PCR, as the problem of non-targeted product amplification is less frequent with short templates than
with larger ones. Maximal product length is also expected to have a linear
effect on specificity (Fig.
2
C). Currently, there is no good experimental data on the relationship between
reaction conditions and product length that enable to confirm model
predictions. The length of the primers affects exponentially the number of PCR
products expected from the model (Fig.
2
B). From these predictions, it could be assumed that any increase in primer
length improves specificity. In real PCR, short primers, such as 10 or 11mers,
used in RAPDs, are known to give several products (
28
,
29
); using longer primers indeed increases specificity. However, real PCR data
suggest that specificity cannot be increased indefinitely by using longer
primers: a 30 base primer was shown to amplify its target with eight mismatches
at annealing temperature of 10oC above calculated
T
m
(
9
). This unexpected low specificity requires further experimental research, but
might be explained if increased primer length is accompanied by increased
mismatch tolerance even under stringent conditions.
Of all the factors considered, the number of mismatches tolerated in the
reaction had the strongest effect on amplification of non-targeted products (Fig.
2
A). In real PCR, factors that reduce mismatch tolerance, like increasing
annealing temperature, were found to improve specificity (
6
), in agreement with the model. This raises the question of the extent of
mismatch tolerance in real PCR. Experimental works have shown that mismatches
were tolerated under supposedly
stringent conditions with 30 base primers, as described above (
9
). Similarly, using a 20 base primer, a PCR product was amplified with only 13
bp shared between the primer and the template (
7
). Under less stringent conditions, at 37oC, a 17 base primer was found to amplify a product with nine mismatches
distributed throughout the primer (
11
). These experimental data suggest that in many reactions mismatches cannot be
prevented, further supporting the above proposal that mismatch tolerance can
resolve the PCR paradox. Reducing mismatch tolerance might therefore be the
most significant means to improve PCR specificity. This might become possible
by stabilizing perfect matches with chemical components added to the reaction,
or with heat-stable enzymes (
30
).
Utilization of simPCR for primer design
An important aspect of primer design is to identify unwanted annealing sites
that might give rise to a non-targeted PCR product, prior to primer synthesis. Several programs can
handle this task (
13
-
15
,
18
,
31
-
34
), but not with complex templates. Therefore, DBases screening for `suspicious'
homologies to individual primers is sometimes performed using Findpatterns, or
more sophisticated programs which monitor single annealing sites under various
T
m
conditions (
17
). Dbase screening for single annealing sites becomes unpractical with mismatch
levels tolerated in PCR as hundreds or thousands of entries are detected (see
example in Fig.
4
). Compared with Findpatterns, the simPCR output has the advantage of reporting
only putative PCR products, and thus being more compact and informative. The
utility of simPCR will be enhanced when the sequence of complete genomes
becomes available and a better understanding of the reaction is gained.
Probing uniqueness and randomness of DNA sequences using mismatch response
curves
PCR can be considered a private case of reactions involving recognition of
specific sites along the DNA. A wide range of biological reactions that involve
such recognition sites can be studied using the approach presented in this
work. Initiation of transcription, processing of introns, and several other
processes require at least two different motifs positioned within a certain
distance and orientation in a specific manner. Like PCR, the recognition of
each motif tolerates mismatches, and the distance between the motifs may vary.
Each motif is analogous to an annealing site, whose chance occurrence can be
described by equation
8
. A composite structure is analogous to the formation of non-targeted PCR products, and can be mathematically described, with minor
modifications, by equation
3
. Thus, equations
8
and
3
allow the uniqueness of a recognition site or of a composite structure in the
genome to be predicted. Consider a transcription unit composed of a 19 base
promoter and a 20 base enhancer, both occurring on the same strand and within
500 bp distance. This structure has the same mismatch response curve as shown for XY-products of pair S in a 2.2 * 10
8
bp genome (Fig.
4
A, bold dashed line). From this theoretical curve, it can be predicted that such
a structure will be unique only if each motif tolerates no more than four
mismatches. Furthermore, when comparing the theoretical curve with simulation
results (Fig.
4
A), deviations from the model predictions indicate the extent of non-randomness of the DNA studied. In the future, when DBases contain complete
genomic sequences, these comparisons will enable to probe DNA non-randomness more accurately. In summary, the combined use of response
curves for mismatches, distance limitations and motif length obtained from
mathematical modeling and DBase scanning, is a new approach to probe uniqueness
and randomness of DNA sequences in complex genomes.
ACKNOWLEDGEMENTS
We are thankful to D. Lancet, S. Pietrokovski, Y. Elkind, L. Segal, M. Edelman
and O. Yarden for fruitful discussions and critical reading; to G. Benet for
providing unpublished results; to A. Rubin and O. Asor for help in developing
the model; to the bioinformatics unit for technical support in programming and
database handling; and to Y. Avivi and V. Levy for carefully editing the
manuscript. This work was supported by a doctoral fellowship from the Feinberg
graduate school to E.R. and an Yigal Alon Fellowship to A.A.L.
4 Mullis, K., Faloona, F., Scharf, S., Saiki, R., Horn, G. and Erlich, H. (1986) Cold Spring Harbor Symp. Quant. Biol., 51, 263-273.
5 Innis, M.A. and Davis, H.G. (1990) In Innis, M.A., Gelfand, D.H., Sninsky, J.J. and White, T.J. (eds), PCR protocols. Academic Press Inc., San Diego, pp 3-12.
6 Rychlik, W., Spencer, W.J. and Rhoads, R.E. (1990) Nucleic Acids Res., 18, 6409-6412.MEDLINE Abstract
15 Hillier, L. and Green, P. (1991) PCR Methods Appl., 1, 124-128.MEDLINE Abstract
16 Lincoln, S.E., Daly, M.J. and Lander, E.S., (1991) PRIMER: A Computer Program for Automatically Selecting PCR Primers (0.5)-program and manuals, MIT Center for Genome Research, Nine Cambridge Center, Cambridge, MA 02142, USA.
17 Mitsuhashi, M., Cooper, A., Ogura, M., Shinagawa, T., Yano, K. and Hosokawa, T. (1994) Nature, 367, 759-761.MEDLINE Abstract
18 Montpetit, M.L., Cassol, S., Salas, T. and O'Shaughnessy, M.V. (1992) J. Virol. Methods, 36, 119-128.
20 Dujon, B., Alexandraki, D., Andre, B., Ansorge, W., Baladron, V., Ballesta, J., Banrevi, A., Bolle, P.A., Bolotinfukuhara, M., Bossier, P. et al. (1994) Nature, 369, 371-378.MEDLINE Abstract
21 Feldmann, H., Aigle, M., Aljinovic, G., Andre, B., Baclet, M.C., Barthe, C., Baur, A., Becam, A.M., Biteau, N., Boles, E. et al. (1994) Embo J., 13, 5795-5809.MEDLINE Abstract
22 Johnston, M., Andrews, S., Brinkman, R., Cooper, J., Ding, H., Dover, J., Kucaba, T., Hillier, L., Jier, M., Johnston, L. et al. (1994) Science, 265, 2077-2082.MEDLINE Abstract
23 Oliver, S.G., Vanderaart, Q., Agostonicarbone, M.L., Aigle, M., Alberghina, L., Alexandraki, D., Antoine, G., Anwar, R., Ballesta, J., Benit, P. et al. (1992) Nature, 357, 38-46.MEDLINE Abstract
24 Wilson, R., Ainscough, R., Anderson, K., Baynes, C., Berks, M., Bonfield, J., Burton, J., C, o.M., Copsey, T., Cooper, J. et al. (1994) Nature, 368, 32-38.MEDLINE Abstract
25 Griffais, R., Andre, P.M. and Thibon, M. (1991) Nucleic Acids Res., 19, 3887-3891.MEDLINE Abstract
26 He, Q., Marjamaki, M., Soini, H., Mertsola, J. and Viljanen, M.K. (1994) Biotechniques, 17, 82.MEDLINE Abstract
27 Karlin, S. and Cardon, L.R. (1994) Annu. Rev. Micro., 48, 619-654.
28 Welsh, J., Chada, K., Dalal, S.S., Cheng, R., Ralph, D. and McClelland, M. (1992) Nucleic Acids Res., 20, 4965-4970.MEDLINE Abstract
29 Birkenmeier, E.H., Schneider, U. and Thurston, S.J. (1992) Mamm. Genome, 3, 537-545.MEDLINE Abstract
30 Angov, E. and Cameriniotero, R.D. (1994) J. Bacteriol., 176, 1405-1412.MEDLINE Abstract
31 Lucas, K., Busch, M., Mossinger, S. and Thompson, J.A. (1991) Comput. Appl. Biosci., 7, 525-529.MEDLINE Abstract
32 Dopazo, J., Rodriguez, A., Saiz, J.C. and Sobrino, F. (1993) Comput. Appl. Biosci., 9, 123-125.MEDLINE Abstract