ABSTRACT
An increasing number of proteins are being identified that regulate gene
expression by binding specific nucleic acids
in vivo
. A method termed genomic SELEX facilitates the rapid identification of networks
of protein-nucleic acid interactions by identifying within the genomic sequences of an organism the highest affinity sites for any protein of
the organism. As with its progenitor, SELEX of random-sequence nucleic acids, genomic SELEX involves iterative binding, partitioning, and amplification of nucleic acids. The two methods differ
in that the variable region of the nucleic acid library for genomic SELEX is
derived from the genome of an organism. We have used a quick and simple method
to construct
Escherichia coli,
Saccharomyces cerevisiae
, and human genomic DNA PCR libraries that can be transcribed with T7 RNA
polymerase. We present evidence that the libraries contain overlapping inserts
starting at most of the positions within the genome, making these libraries
suitable for genomic SELEX.
Interactions of proteins with DNA and RNA are at the heart of gene expression
regulation. It has become clear that this regulatory network is intricate, and
that we are only starting to understand its full scope. In addition to proteins
for which binding nucleic acids seems to be the primary function
in vivo
, other proteins have dual functions of which one is the capacity to bind to
nucleic acid [for instance, (
1
-
3
) and see the lists in (
4
,
5
)]. Some of these proteins are involved in gene regulation, while the function
of nucleic acid binding in others remains unknown. Several proteins that bind
two or more different nucleic acids are involved in gene regulation [e.g., (
3
,
6
,
7
)]. Undoubtedly, there are other protein-nucleic acid interactions that have yet to be identified. For most
proteins, RNA ligands can be selected that bind with nanomolar affinities
(affinities that are certainly high enough to elicit a response
in vivo
). Among these proteins are many not thought of as RNA- or DNA-binders (
8
). Based on this evidence, we have suggested (
9
) that a wide range of proteins affect gene expression by interacting with
nucleic acids
in vivo
; indeed, we hypothesized that a complete description of the workings of the
cell must include a `linkage map' that describes the interactions between
proteins and nucleic acids in the life of the cell. The discovery of
interactions like these requires a global search method, a method such as
genomic SELEX.
Genomic SELEX is an extension of SELEX (
8
,
10
,
11
). In SELEX, nucleic acids that bind tightly to a protein of interest are
identified through successive rounds of binding, partitioning and
amplification. In SELEX as originally developed, the library contains 1014-15
random sequences. PCR amplification requires that the nucleic acid sequences of
interest be flanked by fixed sequence primer annealing sites. A T7 promoter is
included in one of the primer annealing sites so that the library can be
expressed as RNA.
In genomic SELEX, the libraries contain sequences derived from the genome of the
organism of interest flanked by fixed regions that allow PCR amplification and
transcription. The success of genomic SELEX is critically dependent on the
quality of the starting library. Ideally, the library should be fully
representative of the genome of interest and the various genomic inserts should
be equally represented. In this article we present a method of library
construction, and two independent methods to test library quality. We also
present the results of these tests on genomic libraries that we constructed
from human,
Saccharomyces cerevisiae
, and
Escherichia coli
DNA.
DNAs from human placenta (Type XIII) and
E.coli
B were purchased from Sigma.
Saccharomyces cerevisiae
DNA was purified from strain S288C using equilibrium density gradients in CsCl
(
12
).
Figure
1
provides an overview of library construction. We used random priming on
denatured genomic DNA that had been sheared by sonication only enough to reduce
viscosity. Primer Bran
(12 [mu]M final concentration) and genomic DNA (0.17 mg/ml final concentration, 25
mg total) were mixed and incubated at 93oC for 3 min, then quickly chilled on ice. Klenow (0.9 U per ml final
concentration) and 300 [mu]M dNTPs (final concentration) were added and the reaction was incubated on
ice for 5 min, at 25oC for 25 min and at 50oC for 5 min. The low temperature step facilitates annealing of the
primer's random nine nucleotides, while the 50oC step allows Klenow extension through hairpins in single-stranded DNA. Four successive spins through Microcon-10 filters (Amicon, MA) removed ~60% of the primers from the first step. [Kirk Jensen,
personal communication, reported that Microspin S-400 HR Columns (Pharmacia) removed 95% of the primers during construction
of a library using the method described here.] Second strand synthesis with
primer Aran
was the same as first strand synthesis. The reaction products were separated on
a denaturing gel and fractions of various sizes (with genomic inserts ranging
from about 40-700 nucleotides) were electro-eluted according to the protocol provided by Isco, Inc. (Lincoln,
NE). The yield was >= 1 pmol of each fraction. Since the human genome has about 6 * 109
nucleotides, each position potentially served as the starting nucleotide of an
insert of every given size in >= 100 molecules of the human library (and correspondingly more for the less
complex genomes of
E.coli
and
S.cerevisiae
). Next we amplified the library by PCR using primers of completely fixed
sequences, one of which adds a T7 promoter (primer
A); thus the library can be expressed as either RNA or DNA.
Primers.
Single copy gene primer sequences were selected from GenBank version 90.0
(8/95). We chose primer sequences with predicted annealing temperatures close
to 72oC, and whenever possible, within 1oC of each other. We used 72oC as the PCR extension temperature. Thus we were often able to
eliminate the annealing step, and we may have additionally gained a measure of
specificity of priming. We calculated annealing temperatures in degrees
centigrade (TA
) by the formula of Wu
et al
. (
13
):
TA
= 1.46(A + T + 2C + 2G) + 22,
where A, T, C and G are the numbers of the corresponding nucleotides in the
primer.
Primers were obtained from Operon (CA). Biotinylated primers were synthesized
incorporating three `biotin-ON' phosphoramidites at the 5' end.
We used a biotinylated genome-specific primer and a library primer (A or B) to generate by PCR a set of
overlapping `sub-fragments' that contain only one fixed library primer annealing site and
variable extents of genomic insert (Fig.
2
B).
We chose primers predicted to anneal only to single copy genes and carried out
PCR as described in the section `PCR' above. We typically used 3 pmol library
DNA and a corresponding amount of genomic DNA. For instance, in a library with
genomic insert DNA of length 60, 3 pmol library molecules contains 120 ng
insert. Hence for the control we used 120 nge genomic DNA from which the
library was made.
Genomic libraries have been generated previously using a variety of methods,
including restriction digestion and ligation (
14
), mechanical fragmentation and blunt-end ligation (
15
), mechanical fragmentation and enzymatic `tailing' (He
et al
., in preparation), and PCR amplification using a single primer with a fixed 5' end and random bases at the 3' end (
16
), the method most similar to that reported here.
Figure
1
shows the approach we used to construct our libraries. Human, yeast, or
E.coli
genomic DNA was denatured and annealed to an oligo with nine random nucleotides
at the 3' end and a fixed sequence at the 5' end. After annealing, which ideally is distributed randomly, the
oligo was extended with Klenow. Another randomized oligo with a different 5' fixed sequence was added to the products of the first reaction, annealed
and extended in the same way. We ran the extended reaction products on a
denaturing gel to fractionate by size. Each fraction became the basis of a
library with a different length of genomic insert. The library was completed by
PCR amplification that added a T7 promoter to one of the primer annealing
sites.
The library should contain an overlapping set of inserts for every segment of
the genome (Fig.
2
A). In order to test this notion, we developed a novel technique that allows us
to examine in detail the sequences of such overlapping inserts. This method
shows us the distribution of end-points in a specific region of genomic sequence, and allows us to
determine the sequence fidelity of the library, both within and outside of the
nine base pair region derived from random priming during library construction.
Figure
2
B shows a hypothetical set of overlapping sub-fragments that would be generated during this analysis. They are called
`sub-fragments' because they include only one end of the corresponding library
fragments (Fig.
2
A). Each sub-fragment has one library and one genome-specific primer annealing site. During the PCR that generates this
set of sub-fragments, the library primer amplifies (linearly) every molecule in the library. Thus,
it is necessary to biotinylate the genome-specific primer so that the desired sub-fragments can be isolated by binding to streptavidin-coated beads. Additional cycles of PCR using a `nested' genome-specific primer eliminate any remaining background.
Table 1
.
The end-points of genomic insert sequences in sub-fragments isolated from single copy genes in four libraries are
shown in Figure
3
. While we did not find an end-point at every possible position, well over half of them are represented.
The largest region in which we found no end-points is only nine nucleotides long. This expanse is small relative to
the size of the insert size to be chosen for most purposes.
We examined the nine nucleotides adjacent to primer A in libraries from all
three organisms. These nine nucleotides are involved in annealing to the
randomized sequence of the primer during library construction. As primer
annealing at low temperature is imprecise, we expected misannealing to generate
mutations in this region. The results are shown in Table
3
. The human library was generated using the Klenow fragment of
E.coli
DNA polymerase I with intact 3' exonuclease, the so-called proofreading exonuclease (exo+
). The yeast and
E.coli
libraries were made with Klenow lacking that exonuclease (exo-
). As was expected, the library generated with exo+
Klenow is more accurate in these positions. The downside of the exo+
polymerase is that it might yield a greater over-representation of molecules with genomic inserts adjacent to regions that
are similar in sequence to the fixed regions of the library primers, since the
entire random region could be removed by this enzyme, leaving only annealing of
the fixed region to genomic DNA.
Incorrect bases in the priming region reduce the effective size of the genomic
insert. (By `insert' we mean the part of the library molecule between the two
fixed sequences.) The effective size of the insert should be reduced by 4 +- 4 nucleotides in the human library that we made, and by 6 +- 6 nucleotides in the
E.coli
library. In practice, it is wise to work with a library that has inserts long
enough to make mistakes in the priming regions irrelevant.
This method of testing the library also shows how well the sequences of library
inserts match the published genomic sequence. In the region excluding the nine
nucleotide stretches adjacent to the library primers, two of the 60
E.coli
sub-fragments sequenced had one point mutation each, whereas in the human
library four out of 86 sub-fragments sequenced had one point mutation each. Thus the libraries that
we have generated are sufficiently accurate for use in genomic SELEX.
We also tested the library quality with a more traditional method, using PCR to
amplify various genomic segments. Each amplified segment spans almost the
entire genomic insert length of the library. If some genomic segment is missing
from the library because of the way the library was constructed, it will be
amplified from the original genomic DNA, but not from the library DNA. With all
four libraries, the observed size of the PCR products was as predicted from the
GenBank sequences (Fig.
4
). Because the library contains overlapping inserts, and because the size of
genomic segments amplified in these experiments approximates the size of the
insert, most of the molecules with one genome-specific priming site do not contain the other genome-specific priming site. Thus, for a given genomic region, most
library molecules are not PCR-amplifiable. As expected, the yield was lower with the library DNA as a
template than with the genomic DNA (from which library was made), under
otherwise identical conditions.
Escherichia coli library.
A total of 13 tested segments, 60 base pairs each, were all amplified both from
the
E.coli
genomic DNA and from the library. Five segments were amplified from the dam
(DNA adenine methylase) gene, four segments from the
bgl
operon (involved in
utilization of sugars, beta-glucosides), and one segment each from
metB
(involved in methionine biosynthesis), the
ilvGMEDA
operon (involved in isoleucine/valine biosynthesis),
corA
(Mg2+
transport protein), and the ribosomal RNA gene (this segment is the only one
tested that is not from a single copy gene).
We were concerned that sequences predicted to form stable hairpins in ssDNA
might result in the under-representation in the completed library of molecules that include those
sequences or that are adjacent to them. Such a hairpin could obstruct primer
annealing or extension during library construction. To address this question,
we amplified an rRNA gene segment inside a region that is predicted to form a
long hairpin, thus possibly preventing random primer annealing during library
construction. Based on the number of cycles it takes to amplify the segment
from the library DNA to approximately the same level as from the genomic DNA,
the sequence does appear to be somewhat under-represented in the library (data not shown).
Libraries such as the ones we describe in this paper may be useful for a variety
of genomic SELEX applications, e.g., to find RNA or DNA that binds a particular
protein or another biologically important molecule of interest, such as an
antibiotic, a cofactor, a mono- or polysaccharide, or to find RNA or DNA that is cleaved or otherwise
covalently modified by a protein, a metal ion or any other molecule. The
described method of library construction by randomized primer extension is easy
and robust (it has worked for human,
S.cerevisiae
, and
E.coli
DNA). Although we have constructed only genomic DNA libraries, the method is
adaptable to cDNA as well.
By PCR with genomic primers, we amplified from the library DNA template
all
segments that we amplified from genomic DNA. No specific loss of segments can
be attributed to the library construction process.
A `perfect' library has molecules with inserts of a given size that begin at
every nucleotide of the genome. Because of this, distribution of end-point analysis provides a more sensitive and rigorous test of the library
quality. We have shown that the libraries reported here are virtually complete.
If we had sequenced additional sub-fragments, it is likely that we would have discovered additional end-points. However, even if inserts with the end-points that we failed to find are indeed missing in their
respective libraries, the libraries are sufficiently comprehensive to contain
every genomic binding site represented by many distinct inserts in an
appropriately long library. Not all positions have equal fractions of molecules
that start there, but the level of the imperfection is insufficient to affect
the outcome of an
in vitro
selection (
19
; Vant-Hull
et al
., in preparation).
Sequencing showed that there are relatively few mutations in the library except
in the nine nucleotide region immediately adjacent to the library primer
annealing sites. Errors in this region make the library somewhat shorter by
reducing the portion of the library molecules that is identical to the genomic
sequences. These errors, as well as the adjacent fixed sequences, may affect
binding of genomic inserts during the subsequent SELEX, and this remains a
shortcoming in this method of library construction. However, all other
published libraries have fixed sequences too. We are currently developing
methods to overcome this limitation (Shtatland
et al.
, in preparation).
We are not aware of any other published methods for rigorously testing the
quality of a genomic library. Our methods may thus be a useful experimental
tool in assessing the quality of a library. So far, only one other library was
tested in our lab, and was found to be comparable in quality (He
et al.
, in preparation).
If our libraries are used for RNA SELEX, one should keep in mind that not all
genomic DNA is transcribed into RNA. Any RNAs not expressed
in vivo
will have little or no biological relevance; however, very tight binding to an
RNA sequence not thought to be transcribed may indicate that the sequence is
transcribed after all. Binding sites that are present only in spliced or edited
RNA do not exist in our libraries. On the other hand, libraries made from cDNA
do not include introns and intron-exon boundaries, sites that may be important
in regulation of splicing [reviewed in (
20
)]. Moreover, cDNA libraries reflect transcription in some particular stage of
development, and may thus yield incomplete answers for certain biological
questions.
We have discussed both our expectations from genomic SELEX and the early results
from some of the genomic SELEX experiments underway in our lab (
9
). We have used genomic SELEX to discover binding sites for the bacteriophage
MS2 coat protein in the
E.coli
genome (Shtatland
et al.
, in preparation) and binding sites for human U1A protein within human RNA
(Singer
et al.
, in preparation). We have also performed genomic SELEX using human basic
fibroblast growth factor (a protein not known to bind RNA
in vivo
) and human genomic RNA; this SELEX yielded a single RNA winner that has a nM Kd
(He
et al.
, in preparation).
Genomic SELEX is conceived to be analogous to the yeast two-hybrid system (
21
) as a rapid screen for any protein-nucleic acid or metabolite-nucleic acid interaction that occurs
in vivo
; in short, we expect genomic SELEX to provide a nucleic acid `linkage map' for
such interactions and note that a nucleic acid `linkage' made plausible through
genomic SELEX can be tested directly in organisms using the familiar research
tools of molecular biology.
We thank D. Burke, R. Gutell, K. Jensen, K. Krauter, A. Payano-Baez, and B.Vant-Hull for thoughtful comments and useful discussions. We thank A.
Wright, C. Nislow, H. Chial, F. Luca, A. Monterrosa, S. Jacobsen, and A. Schutz
for oligonucleotides, D. Smith for suggesting the primer sequences, S.
Creighton for suggesting conditions for Klenow, and D. Lorenz for help with the
figures. This work was supported by NIH grant GM19963 to L.G., and by funds
from NeXstar Pharmaceuticals, Inc., for which we are grateful.
*To whom correspondence should be addressed. Tel: +1 303 546 7605; Fax: +1 303
546 7603; Email: gold@nexstar.com
+
Present address: Ambion, Inc., 2130 Woodward St., Suite 200, Austin, TX 78744,
USA
Organism
exo
Gene
No.
%
Percent incorrect at each position
tested
correct
9
8
7
6
5
4
3
2
1
E.coli
-
met B
60
8
0
12
13
13
13
21
28
46
43
Yeast
-
NDC1
44
18
2
5
5
7
12
21
24
30
39
Yeast
-
NDC1*
49
32
2
4
8
6
16
24
20
36
32
Human
+
ada
43
49
0
0
9
12
10
15
0
14
33
Human
+
U1A
43
44
0
0
14
3
6
0
10
23
37
REFERENCES
Return


