PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing
PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencingDeborah A. Nickerson*, Vincent O. Tobe and Scott L. Taylor
Department of Molecular Biotechnology, Box 357730, University of Washington, Seattle, WA 98195-7730, USA
Received April 22, 1997;Revised and Accepted May 27, 1997
ABSTRACT
Fluorescence-based sequencing is playing an increasingly important role in efforts to identify DNA polymorphisms and mutations of biological and medical interest. The application of this technology in generating the reference sequence of simple and complex genomes is also driving the development of new computer programs to automate base calling (Phred), sequence assembly (Phrap) and sequence assembly editing (Consed) in high throughput settings. In this report we describe a new computer program known as PolyPhred that automatically detects the presence of heterozygous single nucleotide substitutions by fluorescencebased sequencing of PCR products. Its operations are integrated with the use of the Phred, Phrap and Consed programs and together these tools generate a high throughput system for detecting DNA polymorphisms and mutations by large scale fluorescence-based resequencing. Analysis of sequences containing known DNA variants demonstrates that the accuracy of PolyPhred with single pass data is >99% when the sequences are generated with fluorescent dye-labeled primers and ~90% for those prepared with dye-labeled terminators.
INTRODUCTION
Single base substitutions are the most frequent form of DNA sequence variation in the human genome (1 ,2 ). Identification of these variations plays an important role in detailing the evolutionary history of human populations (3 ,4 ) and in exploring the relationships between genome structure and function (genotype-phenotype correlations) through genetic and disequilibrium mapping (5 ,6 ). Furthermore, many diagnostic applications depend on accurate identification of single nucleotide substitutions for finding mutated genes (7 ,8 ), matching tissues prior to transplantation (9 ) and analyzing samples in forensic situations (10 ).
Amplification of specific genomic regions using PCR has greatly simplified the process of comparing sequences to identify DNA variations by eliminating the need for genomic cloning from multiple individuals (11 ). Once a region has been amplified, a number of techniques can be employed to comparatively scan it for sequence variants. These include denaturing gradient gel electrophoresis (12 ), chemical or enzymatic cleavage (13 -15 ), heteroduplex analysis (16 ), the analysis of single-stranded DNA conformations (17 ), hybridization to oligonucleotide arrays (18 ,19 ) and DNA sequencing (20 -22 ). Among these approaches, DNA sequencing offers several advantages, including its ease of application (use of a single set of reagents and assay conditions), its automation with fluorescence-based methods and its ability to provide complete information about the location and nature of the sequence variant(s) in a single pass.
Despite the advantages of detecting DNA variations by sequence analysis, it is difficult to accurately identify heterozygous sites (two bases at the same location in a sequence) because of the variability in fluorescence signals and the inconsistency of base calling at these sites. Recently, several approaches have been taken to improve identification of heterozygous sites using automated sequence analysis (22 -25 ). In one approach, heterozygotes are found by comparing the pattern of fluorescence dye incorporation between the sequence traces (22 ). Since this pattern is faithfully reproduced every time the same sequence is generated, heterozygous positions in a trace can be accurately identified based on a predictable reduction (~50%) in normalized peak area when compared with homozygous positions. In this report we present a computer program known as PolyPhred that automatically finds potential heterozygotes in a sequence using this comparative approach. We also compare program performance with sequencing chemistries that produce highly variable patterns of dye incorporation (sequences generated with dye-labeled terminators; 26 ) and those that produce more uniform fluorescence incorporation (dye-labeled primer sequencing) in terms of their accuracy and efficiency in heterozgyote detection. Lastly, we report the discovery of new DNA variations using comparative sequencing and PolyPhred.
MATERIALS AND METHODS
PCR primers
Primers for PCR amplification of genomic DNA were assembled using standard phosphoramidite chemistry on an Applied Biosystems 394 DNA synthesizer (Foster City, CA). Primers were prepared and used to amplify 11 genomic regions containing single nucleotide substitutions representing all potential nucleotide changes (A <-> T, A <-> C, G <-> T, C <-> G, A <-> G, C <-> T). The regions examined were: (i) exon 2 of the human steroid 5[alpha]-reductase gene (SRD5A1, A <-> G, GDB:193189, CCCAAATCATTTAAGATAGGATTAC, ATGATGTGAACAAGGCGGAGTTCAC, 60oC); (ii) intron 8 of the human lipoprotein lipase gene (LPL, A <-> C, GDB:191079, TACACTAGCAATGTCTAGCTGA, TCAGCTTTAGCCCAGAATGC, 60oC); (iii) exon 28 of the von Willebrand factor pseudogene (VWFP, A <-> T, GDB:194282, TGTAAAACGACGGCCAGT(-21M13)AGCCGTCGTGGTACTCCACCACA, CAGGAAACAGCTATGACC(M13Rev)AGATTCTGTGGGAATATGGAAGTAGTCA, 55oC); (iv) exon 5 of the guanine nucleotide binding protein (GNAS, A <-> G, GDB:203981, TCTTGTAGCGCCCTCCCA, TGCCCATGTGCAGGGCTGTCACTCATGTT, 60oC); (v) a segment from the 3'-untranslated region of [beta]2-integrin (ITGB2, C <-> T, GDB:185175, GAGCACTTGGTGAAGACAAG, GGATGTCATTTTATACCCTG, 51oC); (vi) intron 3 of adenine nucleotide translocator 1 (ANT1, C <-> T, GDB:201792, ACAGGGCTCCTTTCAGTCTTCC, CAAATGCTGGTGAGGGCTCCG, 57oC); (vii) exon 4a of solute carrier family 2, member 4 (SLC2A4, C <-> T, GDB:180271, CAGGAAGGGAGCCACTGCTG, ATCTGAAAGCCCAGGCATGG, 63oC); (viii) a segment from the 3'-untranslated region of the tyrosinase-related protein 1 gene (TYRP1, A <-> C, GDB:555709, GTCGGGAGTTTAGTGTACCT, TCTGAAAGGGTCTTCCCAGC, 60oC); (ix) intron 4 of the constant region of the human T cell receptor (TCR) [alpha] locus (TCRCA, C <-> T, G <-> T and C <-> G, TGTAAAACGACGGCCAGT(-21M13)GAGCTAAGAGAGCCGTACTGG, CAGGAAACAGCTATGACC(M13Rev)CTTGAAGCTGGGAGTGG, 55oC) (27 ); (x) a variable gene segment from the human TCR [alpha] locus (TCRVA23, C <-> T and C <-> G, TGTAAAACGACGGCCAGT- (-21M13)GTCTAAGTGACAGAAGGAATG, AATGTATAAAGTACTACGTCCTGA, 55oC) (28 ); (xi) a variable gene segment from the human TCR [beta] locus (TCRVB23, A <-> G and G <-> T, GenBank accession no. U96844, TGTAAAACGACGGCCAGT- (-21M13)GGAAAGCCTGAGTTAGCTGAGC, CAGGAAACAGCTATGACC(M13Rev)AGAATAGAAGCATCTCTGGG, 55oC).
DNA amplification
DNA samples from the parents of the 40 families available through the Centre d'Etude du Polymorphisme Humaine (CEPH) were used for PCR amplification of the target loci. All amplification reactions were performed in a 96-well microtiter plate thermal cycler (PTC 100; MJ Research, Watertown, MA). The reactions were assembled (20 [mu]l total volume) and contained a standard PCR buffer (10 mM Tris-HCl, pH 8.3, 50 mM KCl, 1.5 mM MgCl2 and 0.001% gelatin), the four deoxynucleotide triphosphates at 40 [mu]M each, 0.5 [mu]M each primer, 0.5 U Taq polymerase (Perkin Elmer Cetus, Norwalk, CT) and 20 ng genomic DNA. Following assembly, the reactions were covered with 50 [mu]l mineral oil. Thermal cycling was performed with an initial denaturation at 94oC for 1 min followed by 35 cycles of denaturation at 95oC for 20 s, primer annealing for 30 s (temperatures specified above with primer sequences) and primer extension at 72oC for 2 min. After 35 cycles, a final extension was carried out at 72oC for 5 min. Individuals were selected for DNA amplification based on genotypes previously established in these loci using PCR combined with an oligonucleotide ligation assay (OLA; 29 ).
DNA sequencing
Following DNA amplification, unincorporated PCR primers and deoxynucleotide triphosphates in the samples were inactivated prior to sequencing by enzymatic treatment. This was accomplished by mixing 6 [mu]l PCR product with 1 [mu]l exonuclease I (10 U/[mu]l; Amersham Life Science Inc., Arlington Heights, IL) and 1 [mu]l shrimp alkaline phosphatase (2 U/[mu]l; Amersham) and incubating at 37oC for 15 min followed by 80oC for 15 min to inactivate the exonuclease and alkaline phosphatase enzymes prior to sequencing. In our hands PCR products treated with these enzymes sequence as well in terms of quality and read length as those isolated by agarose gel electrophoresis coupled with column purification (26 ). Cycle sequencing was performed according to the manufacturer's instructions using ABI PRISM Dye Terminator or Dye Primer Sequencing Kits with Amplitaq DNA polymerase FS (Perkin Elmer Corp., Foster City, CA). For dye-terminator cycle sequencing the entire enzyme-treated PCR sample (8 [mu]l total following treatment) was used as the sequencing template. The sequencing primer (3.2 pmol, same as PCR primer) and 8 [mu]l Dye Terminator Ready-Reaction sequencing premix were added to the template. Following a denaturation step at 96oC for 2 min, dye-terminator reactions were incubated at 96oC for 15 s, 50oC for 1 s and 60oC for 4 min for 25 cycles. Excess dye-terminators were removed by ethanol precipitation. In the case of dye-primer sequencing, PCR products were generated using locus-specific primers containing either the -21M13 or M13Rev primer sequences at their 5'-end. For sequencing the enzyme-treated PCR sample was subdivided into four separate reactions as follows: 1 [mu]l each of the PCR sample mixed with 4 [mu]l PRISM ready premix for the A and C reactions and 2 [mu]l each of the PCR sample mixed with 8 [mu]l PRISM ready premix for the G and T reactions. Sequencing reactions were denatured for 1 min at 96oC and subjected to 15 cycles at 96oC for 10 s, 55oC for 5 s and 70oC for 1 min and 15 cyles at 96oC for 10 s and 70oC for 1 min. Then, the A, C, G and T reactions were pooled and subjected to ethanol precipitation. The extension products obtained with either chemistry were evaporated to dryness under pressure (Savant Instruments, Farmingdale, NY), resuspended in 3 [mu]l loading buffer (5:1, 1% deionized formamide, 50 mM EDTA, pH 8.0), heated for 2 min at 90oC and loaded onto an Applied Biosystems 373 sequencer according to the manufacturer's directions.
Sequence analysis
The ABI sequence software (version 2.1.2) was used for lane tracking and first pass base calling (Perkin Elmer). Chromatograms were transferred to a Unix workstation (Sun Microsystems Inc., Mountain View, CA), base called with Phred (version 0.961028), assembled with Phrap (version 0.960731), scanned by PolyPhred (version 0.970312) and the results viewed with the Consed program (version 4.0). Specific descriptions and documentation on Phred, Phrap and Consed are available at http://www.genome. washington.edu (P.Green, personal communication). PolyPhred has been designed to parse information from Phred and Phrap output files and via a flat file provide input to Consed to aid in identification of heterozygous single nucleotide substitutions by color coding potential sites. All data presented in this report were generated using command line parameters requiring a peak drop ratio of 0.55, a second peak ratio of 0.15 and, unless noted otherwise, an average sequence quality setting of 30. PolyPhred is available via Email from debnick@u.washington.edu and more documentation is available at http://droog.mbt.washington.edu.
RESULTS
In comparing sequence traces of homozygotes with those for heterozygotes two changes are usually present: (i) a significant drop in normalized peak height at a polymorphic site when traces from homozygous and heterozygous individuals are compared; (ii) a second underlying peak at the position in question (22 ,26 ). To automate identification of substitution variations using these criteria, we created a program known as PolyPhred. Its functions are fully integrated with three software packages currently applied in large scale sequence analysis: Phred, Phrap and Consed (P.Green, B.Ewing and D.Gordon, personal communication). PolyPhred reads the normalized peak areas and quality values obtained by Phred for each position in a sequence. It then searches for reductions in peak areas at each position across the sequence alignment obtained from the Phrap assembly program. If the required drop in peak area is found at a position and a second base is detected by Phred, PolyPhred calls the site a potential heterozygote and information on this position is stored in the program's output file.
By interfacing the information obtained by PolyPhred with the `quality means color and tags view' in the Consed program, potential heterozygotes become color coded, as shown in Figure 1 (position 214, Fig. 1 a, and position 351, Fig. 1 b). In these examples Consed views of sequences from 10 individuals are shown. Heterozygotes at two positions in the coding region of a TCR gene were automatically identified by PolyPhred (91 bp of assembly sequence are shown in each window and, altogether for the 10 individuals, 1820 bp of sequence are displayed). When a common polymorphism is identified (position 214 in Fig. 1 a, His -> Arg substitution) homozygotes for each of the alternative alleles are detected (e.g. in this instance four individuals homozygous G and one individual homozygous A), in addition to heterozygotes containing the two alternative alleles (e.g. sequences from the five heterozygous individuals color coded blue). However, less common alleles will typically be identified just as heterozygotes among the homozygotes (e.g. the three heterozygotes color coded blue at position 351 in Fig. 1 b). It is worth noting that the variant at position 351 (Val -> Gly) would have been missed if identification was based solely on the results of sequence alignment, since neither the ABI nor the Phred program called these positions as Ns: in all cases the G peak was sufficiently dominant even in a heterozygote to meet the ABI and Phred criteria for a G.
aIncludes variations previously identified in target loci and typed by PCR/OLA in addition to new variations detected and confirmed by sample resequencing.
bThe ratio of true positives to false positives rounded to the nearest whole number.
The signal-to-noise ratio (true positive to false positives) improves greatly as the scanning window for PolyPhred is set to analyze data at increasing quality thresholds. The total number of base pairs that can be scanned at higher quality settings differs significantly for the two chemistries examined. With dye-primer sequencing and a quality setting of 40, an average of 250 bp are scanned in 94% of the available sequence traces (143 traces altogether) and the signal-to-noise ratio is 2:1 (Table 1 ). At higher qualities, false positives are easily distinguished by an analyst and are usually caused by a fluctuation in peak height from a homozgyote combined with some increase in the background noise that led PolyPhred to detect a second base. Examples of these types of false positives are shown in Figure 3 for sequences produced with dye-labeled primer (Fig. 3 a and b) or dye-labeled terminator (Fig. 3 c and d) sequencing.
ACKNOWLEDGEMENTS
We thank Drs Phil Green and Brent Ewing and Mr David Gordon for sharing their insights and programs (Phred, Phrap and Consed) with us. We also thank Drs Maynard Olson and Mark Rieder and Ms Ursula Petralia for their helpful comments. This work was supported in part by the National Science Foundation (DIR 8809710), the National Institute for Human Genome Research (HG01436) and the Department of Energy (DE-FG03-97ER- 62385).