ABSTRACT
From a library of nucleic acid molecules, which are randomized in parts of their
sequence, unique sequence variants can be selected for specific properties. The
planning of such an
in vitro
selection experiment requires some consideration regarding how much DNA
template or RNA transcript should be used initially. The amount applied depends
on the number of randomized nucleotides and on the expectations of how often
each conceivable and unique sequence combination should be represented in the
experimental pool. We display graphs describing the probability for the
representation of unique nucleic acid molecules in a randomized pool as a
function of the mean representation
k
, defined by the ratio of sampled nucleic acid molecules to conceivable sequence
combinations and we summarize the amounts required to represent unique
sequences with 99% likelihood. The probability of representation,
P
= 1 - e
-
k
, can be applied also to `sub- saturated' pools (
k
< 1) of nucleic acids with long randomized domains, where it is impossible to
provide sufficient material for full sequence representation.
An RNA molecule provides sequence information but it may also have functional
properties, for example the capability of self-cleavage or being a ligand. These RNAs can be identified by
in vitro
selection procedures (reviewed in
1
-
3
). The RNA of interest is selected from a pool (library) of RNA molecules that
differ in sequence. For selection, a DNA molecule is synthesized containing, at
defined positions, a completely random mixture of all four bases, A, C, G, T.
Fixed sequences at the termini allow
in vitro
transcription, reverse transcription and PCR amplification. The pool of RNA
transcripts is subjected to selection: for example binding to an immobilized
ligand (SELEX) (
4
) or for different catalytic properties (
5
-
7
). Selected RNAs are reverse transcribed, the resulting cDNAs are PCR-amplified and used as template for RNA synthesis in a new round of
selection.
The length of the randomized sequence determines the conceivable number of
unique sequence variants in a pool of nucleic acids. However, for practical
reasons, the amount of RNA (or DNA) that can be actually synthesized and
subjected to a selection experiment is limited. In view of the enrichment
during the selection procedure, it does not matter-at least in theory-whether the pool contains this unique sequence several-fold or just once. It is therefore of interest to determine
the likelihood that a particular sequence is not represented in a library of
nucleic acids. This probability,
P
0,n
, approximates e
-
k
, where the representation factor
k
, given as
k
=
n
/4
L
is the ratio of molecules,
n
, in a pool to the conceivable sequence combinations depending on the number of
randomized nucleotides,
L
.
In a pool with
k
= 1, which contains as many molecules as there are unique sequence combinations
possible, a unique sequence is not represented with a 36.8% chance, but 63.2%
of all sequence combinations are represented at least once. Each further
increase of the pool size by a factor of 2.3 (~ln 10) will reduce the number of unrepresented sequences by a factor of 10.
Figure
1
A demonstrates the relationship between the representation factor and the
probability that an RNA sequence is not included in an experimental pool.
The probability of representation is only a function of
k
and is independent of the number of randomized nucleotides,
L
, as long as the number of molecules sampled in a pool increases with the number
of conceivable sequence combinations. For example, each extension of the
randomized sequence by one nucleotide (increase of
L
by 1), requires the 4-fold increase of nucleic acid molecules in the library to maintain the
same likelihood of sequence representation. These pools are characterized by
the same
k
factor. The amounts of nucleic acids required to achieve 99% sequence
representation depending on the number of randomized nucleotides is summarized
in Table
1
.
Table 1
Nucleic acids required for 99% likelihood of sequence representation (
P
0,
n
= 0.01)
Randomized nucleotides
Sequence combinations
Size of librarya
(L)
(4L)
(M)
6
4.10 * 103
3.13 * 10-20
7
1.64 * 104
1.25 * 10-19
8
6.55 * 104
5.01 * 10-19
9
2.62 * 105
2.00 * 10-18
10
1.05 * 106
8.02 * 10-18
11
4.19 * 106
3.21 * 10-17
12
1.68 * 107
1.28 * 10-16
13
6.71 * 107
5.13 * 10-16
14
2.68 * 108
2.05 * 10-15
15
1.07 * 109
8.21 * 10-15
16
4.29 * 109
3.28 * 10-14
17
1.72 * 1010
1.31 * 10-13
18
6.87 * 1010
5.25 * 10-13
19
2.75 * 1011
2.10 * 10-12
20
1.10 * 1012
8.41 * 10-12
21
4.40 * 1012
3.36 * 10-11
22
1.76 * 1013
1.35 * 10-10
23
7.04 * 1013
5.38 * 10-10
24
2.81 * 1014
2.15 * 10-9
25
1.13 * 1015
8.61 * 10-9
26
4.50 * 1015
3.44 * 10-8
REFERENCES
Return
