Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow Print PDF (34K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Tabler, M
Right arrow Articles by Dorr, M
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tabler, M
Right arrow Articles by Dorr, M
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 1996 Oxford University Press 3437-3438

Footnote

Representation of unique sequences in libraries of randomized nucleic acids

Representation of unique sequences in libraries of randomized nucleic acids Martin Tabler 1, * , Panayiotis Benos 1,3 and Martin Dörr 2

1 Institute of Molecular Biology and Biotechnology, 2 Institute of Computer Science, Foundation for Research and Technology, Hellas and 3 Department of Biology, University of Crete, PO Box 1527, GR-71110 Heraklion, Crete , Greece

Received April 16, 1996; Revised and Accepted July 12, 1996

ABSTRACT

From a library of nucleic acid molecules, which are randomized in parts of their sequence, unique sequence variants can be selected for specific properties. The planning of such an in vitro selection experiment requires some consideration regarding how much DNA template or RNA transcript should be used initially. The amount applied depends on the number of randomized nucleotides and on the expectations of how often each conceivable and unique sequence combination should be represented in the experimental pool. We display graphs describing the probability for the representation of unique nucleic acid molecules in a randomized pool as a function of the mean representation k , defined by the ratio of sampled nucleic acid molecules to conceivable sequence combinations and we summarize the amounts required to represent unique sequences with 99% likelihood. The probability of representation, P = 1 - e - k , can be applied also to `sub- saturated' pools ( k < 1) of nucleic acids with long randomized domains, where it is impossible to provide sufficient material for full sequence representation.

An RNA molecule provides sequence information but it may also have functional properties, for example the capability of self-cleavage or being a ligand. These RNAs can be identified by in vitro selection procedures (reviewed in 1 - 3 ). The RNA of interest is selected from a pool (library) of RNA molecules that differ in sequence. For selection, a DNA molecule is synthesized containing, at defined positions, a completely random mixture of all four bases, A, C, G, T. Fixed sequences at the termini allow in vitro transcription, reverse transcription and PCR amplification. The pool of RNA transcripts is subjected to selection: for example binding to an immobilized ligand (SELEX) ( 4 ) or for different catalytic properties ( 5 - 7 ). Selected RNAs are reverse transcribed, the resulting cDNAs are PCR-amplified and used as template for RNA synthesis in a new round of selection.

The length of the randomized sequence determines the conceivable number of unique sequence variants in a pool of nucleic acids. However, for practical reasons, the amount of RNA (or DNA) that can be actually synthesized and subjected to a selection experiment is limited. In view of the enrichment during the selection procedure, it does not matter-at least in theory-whether the pool contains this unique sequence several-fold or just once. It is therefore of interest to determine the likelihood that a particular sequence is not represented in a library of nucleic acids. This probability, P 0,n , approximates e - k , where the representation factor k , given as k = n /4 L is the ratio of molecules, n , in a pool to the conceivable sequence combinations depending on the number of randomized nucleotides, L .

In a pool with k = 1, which contains as many molecules as there are unique sequence combinations possible, a unique sequence is not represented with a 36.8% chance, but 63.2% of all sequence combinations are represented at least once. Each further increase of the pool size by a factor of 2.3 (~ln 10) will reduce the number of unrepresented sequences by a factor of 10. Figure 1 A demonstrates the relationship between the representation factor and the probability that an RNA sequence is not included in an experimental pool.

The probability of representation is only a function of k and is independent of the number of randomized nucleotides, L , as long as the number of molecules sampled in a pool increases with the number of conceivable sequence combinations. For example, each extension of the randomized sequence by one nucleotide (increase of L by 1), requires the 4-fold increase of nucleic acid molecules in the library to maintain the same likelihood of sequence representation. These pools are characterized by the same k factor. The amounts of nucleic acids required to achieve 99% sequence representation depending on the number of randomized nucleotides is summarized in Table 1 .


Figure 1 . Relationship between the representation factor k and probability P 0,n , which indicates the likelihood that a nucleic acid is not represented in a sequence-randomized library. ( A ) The relationship for `saturated' libraries, in which the number of sampled molecules is greater than the number of conceivable sequence combinations ( k >= 1). ( B ) The relationship for `sub-saturated' libraries, in which the number of sampled molecules is smaller than the number of conceivable sequence combinations ( k <= 1).


P 0, n can be calculated also for `sub-saturated' pools ( k < 1; Fig. 1 B), for example if the randomized sequence is large ( L > 25), where it is impossible to provide sufficient material for full sequence representation. Here, k also represents the upper limit of the possible conceivable sequence combinations.

Some researchers prefer pool sizes with simultaneous representation of almost all conceivable sequences to ensure that all sequence variants are subjected to the selection process ( 8 ). A formula to calculate the required pool sizes is provided in http://www.imbb.forth.gr/jol/sel.html . However, the probability of identifying the best performing sequence is solely described by P = 1 - e - k and is independent of whether the residual sequences are present or not. Therefore, simultaneous representation of all sequences is not relevant.

Table 1 Nucleic acids required for 99% likelihood of sequence representation ( P 0, n = 0.01)
Randomized nucleotides

Sequence combinations

Size of librarya

(L)

(4L)

(M)

6

4.10 * 103

3.13 * 10-20

7

1.64 * 104

1.25 * 10-19

8

6.55 * 104

5.01 * 10-19

9

2.62 * 105

2.00 * 10-18

10

1.05 * 106

8.02 * 10-18

11

4.19 * 106

3.21 * 10-17

12

1.68 * 107

1.28 * 10-16

13

6.71 * 107

5.13 * 10-16

14

2.68 * 108

2.05 * 10-15

15

1.07 * 109

8.21 * 10-15

16

4.29 * 109

3.28 * 10-14

17

1.72 * 1010

1.31 * 10-13

18

6.87 * 1010

5.25 * 10-13

19

2.75 * 1011

2.10 * 10-12

20

1.10 * 1012

8.41 * 10-12

21

4.40 * 1012

3.36 * 10-11

22

1.76 * 1013

1.35 * 10-10

23

7.04 * 1013

5.38 * 10-10

24

2.81 * 1014

2.15 * 10-9

25

1.13 * 1015

8.61 * 10-9

26

4.50 * 1015

3.44 * 10-8

a For a pool size of ln100 * k ; for each 2-fold increase, the sequences that are not represented are reduced by a factor of 100.

REFERENCES

1 Gold,L., Polisky,B., Uhlenbeck,O. and Yarus,M. (1995) Annu. Rev. Biochem., 64, 763-697. MEDLINE Abstract

2 Kumar,P.K. and Ellington,A.D. (1995) FASEB J., 9, 1183-1195. MEDLINE Abstract

3 Burgstaller,P. and Famulok,M. (1995) Angew. Chem. Int. Ed. Engl., 34, 1189-1192.

4 Tuerk,C. and Gold,L. (1990) Science, 249, 505-510. MEDLINE Abstract

5 Robertson,D.L. and Joyce,G.F. (1990) Nature, 344, 467-468. MEDLINE Abstract

6 Beaudry,A.A. and Joyce,G.F. (1992) Science, 257, 635-641. MEDLINE Abstract

7 Berzal-Herranz,A., Joseph,S. and Burke,J.M. (1992) Genes Dev., 6, 129-134.


Return

* To whom correspondence should be addressed
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Print PDF (34K) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Commercial Re-use Guidelines
for Open Access NAR Content
Google Scholar
Right arrow Articles by Tabler, M
Right arrow Articles by Dorr, M
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tabler, M
Right arrow Articles by Dorr, M
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?