| Nucleic Acids Research | Pages |
Automated detection of point mutations using fluorescent sequence trace subtraction
Introduction
Methods
Methods background
Mutation detection criteria
Automatic mutation detection
Checking for mutations with the GAP4 contig editor
Obtaining an overview using the GAP4 template display
Choosing a representative wild-type sequence
Results
Discussion
Acknowledgements
References
Automated detection of point mutations using fluorescent sequence trace subtraction
ABSTRACT
INTRODUCTION
Mutation detection is becoming increasingly important as a clinical diagnostic tool and as part of genome research and several techniques are currently in use. These include single-stranded conformational polymorphism analysis (1), denaturing gradient gel electrophoresis (2), heteroduplex analysis (3,4), chemical mismatch cleavage (5) and direct sequencing. A review of these techniques has been given by Grompe (6). Ultimately, direct sequencing is required to define the precise location and nature of any change. For this final step fluorescence-based sequencing instruments are now extensively used.
Because no single sequence reading obtained from a fluorescence-based sequencing instrument is 100% reliable, the search for mutations is not simply a matter of comparing a suspected divergent sequence against that of the wild-type: differences between two sequences can be due to mutations or to base calling errors and therefore the problem is to recognize real mutations within a background of base calling errors.
To eliminate base calling errors by visual comparison of wild-type and mutant traces, Tamary (7) used the ABI SeqEd program and Jonsson (8) used an early version of our sequence assembly software. Automated detection of heterozygotes using improvements to version 1.2 of the PE/ABI Factura program has been described by Phelps (9). In this case traces were analysed to identify base positions that exhibited a secondary signal strength greater than a selected percentage threshold of the primary peak. A further method of this type was described by Kwok (10) and a more recent method from Nickerson (11) has the advantage of being used in conjunction with a sequence alignment editor.
Our main interest is in automating the accurate location of base changes. The method we have devised employs sequences determined by a fluorescence-based sequencing instrument. The new software then normalizes and subtracts the pairs of wild-type and mutant traces to produce a new set of traces which represent their differences. The value of these calculations is two-fold: first, they enable reliable automatic detection of mutations; second, they greatly simplify visual inspection of traces because the software can display the traces of the two original sequences plus the trace of their differences. To illustrate the basic idea, in Figure
Figure 1. This shows the traces as displayed by the GAP4 contig editor and illustrates the features of the trace differences as analysed by TRACE_DIFF. The top trace is a consensus trace, the middle one is from an individual reading and the bottom one the difference between the other two. In general the difference trace contains very few features, however, three base changes are shown with their associated peaks in the differences plot. The left hand mutation (C->T) causes a pair of strong peaks in opposite directions with a small context effect. The next mutation (T->G) has strong opposite peaks and a strong context effect peak. The third mutation has quite strong opposite peaks and a single context effect peak. The new routines we describe are additions to the extensive set of methods that we have developed for use in large scale sequencing projects. Of these existing methods, the following components are used in conjunction with the new routines described in this paper. SCF files (13) are used for storing trace and sequence data generated by sequencing instruments; Experiment files (14) are used for transferring data about sequences between programs; the program GAP4 (12) is used for aligning and editing sequences. GAP4 obtains its data from Experiment files and stores the assembled sequences in its own database. The GAP4 contig editor can display and scroll (in register) sets of traces from aligned readings. Segments of sequences can be annotated and the annotated segments are marked with coloured boxes, called tags. Each tag type has its own identifying colour and the contig editor's cursor can jump from one tag to the next. During conventional sequencing projects each base is generally determined from several independent readings and so the accuracy of each reading is less important than it is for mutation detection studies. Here, we cannot rely on base calling alone, but instead use it in conjunction with numerical analysis of the trace data. Our assumption is that the traces are a more reliable picture of the data than the base calls, and that, aside from sequence context effects (15,16), the shapes of the traces are reproducible. By context effects we mean the differences in the heights of the peaks are a function of more than just a single base. Hence, a single base difference between two segments of DNA can result in trace differences, not only for the divergent base, but also for its immediate neighbours (see the bottom trace of Fig. For the purpose of mutation detection a false positive will be a difference between two sequences which is the result of a base calling error. A false negative will be caused by the rare event of a base calling error disguising a real base difference. For automatic detection of mutations, we can search for trace differences above a specified minimum value that are accompanied by a change to the called base, although this will necessarily miss the rare cases where a base called in error disguises a real mutation. Context effect changes to the peak heights of bases adjacent to a mutated base will not be accompanied by a base change, and therefore will not be reported as mutations. So, our first criterion (I) for detection of a mutation is that the called base should be different from that of the wild-type. In order to compare the traces from a pair of sequences we need to normalize the data. We have tried several methods of scaling the amplitudes of the traces (the y direction), including expressing the values as proportions of the total signal at each point, but at present we find that we obtain the best results by leaving the data unscaled in y. For scaling along the traces (the x direction), we need to get the peaks for the equivalent bases from the two sets of traces in register. If there are no insertions or deletions this is straightforward, otherwise there are a number of possible cases. The first case is that the base calling software has called an extra base but there is no extra peak and the base calling is incorrect. The second is that the base calling software has missed a base for which there is a peak. The other cases are not base calling errors but genuine insertion and deletion differences between wild-type and mutant. The software uses a dynamic programming algorithm to align the sequences and then the traces are re-sampled to produce two new sets of traces. Let tiw(x) and tim(x) be the heights of the wild-type and mutant traces at trace sample position x for each base type i = A, G, C or T. (In our data, for each peak/base there are typically around 12 sample points for each base type.) The differences between the wild-type and mutant traces are It is these values that we assess and, for example, where a single base is changed, say A to T, we would expect to observe abnormally large magnitudes for dA(x) and dT(x), one positive and the other negative. The significance of the values di(x) is assessed by first calculating their mean m and standard deviation s over the length of the reading and then locating all positions for which the observed value exceeds a threshold defined by the user. The values of m and s are the combined values for all di(x). We chose to use the mean and standard deviation because we needed a value that depended on the sequence being analysed (rather than use a fixed value for all sequences), but, as described above, this calculation is dependent on the number of mutations. Therefore, the calculation of the mean and standard deviation is restricted to the segments of the trace differences that correspond to the troughs between the peaks of the original data. We also restrict the search of di(x) to the segments that correspond to the positions of the peaks in the original data. Context effects may result in changes to the heights of peaks, but are not expected to produce a positive change for one base type and a negative change for another. Referring again to Figure So our second criterion (II) for a mutation at position x is that there exist two base types i and j such that and where n is a value set by the user. We assume that if criterion II alone is satisfied then it corresponds to a context effect, and if criterion I alone is satisfied then it corresponds to a base calling error. If they are both satisfied at the same position then a mutation is reported. For any individual prospective mutant sequence the processing is generally performed in two steps. First, a new program, TRACE_DIFF, is used to align and compare the mutant and wild-type sequences and traces and to locate possible mutations. Second, the sequence is assembled into a GAP4 database from where users can visually check the differences between the wild-type and mutant traces. As explained below, using one of the search modes in GAP4, the visual inspection can be performed very rapidly. To facilitate batch processing of sequence readings for mutation detection we have added a new type of Experiment file record (WT) to contain the name of each reading's wild-type trace file. Then TRACE_DIFF can get all the information it needs from the readings' Experiment file. Also, all mutations detected by TRACE_DIFF are written back to the reading's Experiment file as mutation tags and the peak height from the difference trace is recorded in the tags' comment field. As all the information needed for processing (apart from user-supplied parameters that are common to all readings) are in each reading's Experiment file, any number of readings can be processed in a single batch. The mutation detection program TRACE_DIFF performs the following steps. Get the options and parameters selected by the user such as the threshold n used in equations 1 and 2 and the name(s) of input file(s) and possibly an output file. Because, in general, the traces are poor at both ends of readings, the user selects a start and end point in base positions to define the range over which the peak detection routine should be applied.
Read-the wild type and mutant traces and sequences.
Use dynamic programming to align the two sequences.
Use the sequence alignment to align the two traces.
Compare the two sets of traces to create a new set that represents the differences between them.
Calculate the mean and standard deviation of the difference traces.
Scan the difference traces and locate all peaks where criterion II is satisfied. If such a peak corresponds to a base difference (i.e. criterion I is also satisfied) report it. Note that TRACE_DIFF can also be used in a mode which reports all positions with significant trace differences, even though the called base is identical to that of the wild-type.
If requested, write the trace differences as a new SCF format file. The sequence of this file denotes all the suspected mutations discovered by TRACE_DIFF using the IUBC (17) redundant base symbols in upper and lower case, e.g. Y denotes a C->T change and y a T->C change. This SCF file can be viewed just like a normal trace file. If the mutant sequence's trace file was accessed indirectly by using the data on the WT line of its Experiment file, then TRACE_DIFF can write annotation data into this Experiment file to denote the positions of all the detected mutations. If this file is assembled into a GAP4 database these annotated bases will appear as `mutation tags'. Since the original version of the GAP4 contig editor was described many new features have been added. A view of the editor exhibiting some of the modes relevant to mutation detection is shown in Figure Figure 2. An example of the GAP4 contig editor showing modes of operation useful for mutation detection. A vertical scroll bar enables alignments of any depth to be viewed. Individual readings are numbered, named and written left to right. Accuracy estimates or confidence values for each base are shown using grey scales: the darker the background the poorer the data. Mutation tags are shown in dark green (for example there are three visible in the top sequence). Changes to the original base calls or accuracy estimates are also colour coded by changes to the background colour: deletions are shown in red, base changes in pink, padding characters in light green and modified confidence values in blue. Within the GAP4 contig editor, to facilitate the use of trace comparisons for checking and detecting mutations the following new features have been added. Although, for the convenience of users, there are several methods of selection, there are, in essence, two new modes of operation. The first, like TRACE_DIFF, compares the traces from any pair of overlapping readings and creates a new trace display of their differences. This display, as seen in Figure The second mode is more complicated in that the trace of a selected reading is compared with a `consensus trace' calculated from the readings it overlaps. The consensus trace and its differences from the traces of the selected reading are displayed as shown in Figure If, prior to assembly into the GAP4 database, the readings have been analysed with TRACE_DIFF to produce mutation tags, the contig editor search function can step from tag to tag whilst simultaneously scrolling traces and their difference trace. In this way, the user can check each possible mutation assignment and delete the tags for those that look false. The GAP4 template display can provide an overview of all the mutations in a set of readings. Figure Figure 3. Here the GAP4 template display shows readings as red arrows and tags as small coloured rectangles. Along the base of the display is a scale marked in 100s of bases and the vertical black line is a cross-hair. In this example tags automatically generated by TRACE_DIFF are shown in green, false negatives have been manually edited to red and false positives to yellow. A suitable wild-type trace for use by TRACE_DIFF and GAP4 can be a single high quality reading or (and this has given the best results for our data) the consensus from several less good readings processed by GAP4. As mentioned above, GAP4 can even create one by, at each point, only including the components of traces that produce the consensus base. Alternatively, it may be beneficial to always include the wild-type DNA on the gel used to determine the suspected mutant sequences. For consistent results it is probably best to use GAP4 to create a single consensus trace for use in all experiments. The mutation detection program TRACE_DIFF is being used to study somatic hypermutation in immunoglobulin genes. In Table 1, column A, we show the results of applying TRACE_DIFF to three sets of readings (214 in total) determined as part of the somatic hypermutation study. The sequences were obtained from mice containing immunoglobulin [kappa] light chain transgenes. The transgenic lines were Lk6 (19), Lk[[Delta]Li] (20) and LK045 (C.Rada, unpublished results). The sequences were obtained by PCR and cloning in M13mp18 as previously described (21). Sequencing was performed using fluorescent dye terminators (Thermosequenase cycle sequencing pre-mix kit from Amersham Life or ABI Prism Dye terminator cycle sequencing ready reaction kit with AmpliTaq DNA polymerase FS from Applied Biosystems) in an ABI 377 sequencer. Table 1. For each of the three sets of sequences the consensus trace used by TRACE_DIFF was generated from five of the better quality readings. Prior to mutation analysis the readings were automatically clipped of poor quality data at either end using TRACE_CLIP (R.Staden, unpublished results). The threshold n was set at 4.0 and searches were performed for bases 10-600 within the clipped data. The value (4.0) used for n was derived from work on a separate set of data (not included in the analysis) and was not tuned to give the best results for the three test sets. The results from TRACE_DIFF were compared with those obtained from scanning the complete traces by eye after the readings had been assembled into a GAP4 database. The test data consisted of 108 497 bases called using the standard ABI software. After the readings had been aligned with their consensus sequences they contained 1232 differences, of which 392 were bases called as unknown (N) and a further 166 were padding characters introduced during alignment. Visual inspection showed that there were 353 real mutations, and with n = 4.0 TRACE_DIFF missed 36 of them and found 28 false positives. The false positives tended to be at the two ends of the readings, where the data were less reliable, and the false negatives were almost entirely due to the weak G after A problem that is found in the chemistry used. While this manuscript was in preparation we received the materials to use the new ABI BigDye terminators and found a marked improvement in the sequences obtained: their lengths were increased and the weak G problem was almost non-existent. The results from one batch of data are shown in Table 1, columns B and C. Column B contains results from sequences that were loosely clipped for quality, giving an average analysis length per reading of 673 bases, and column C has the results when the readings were clipped more severely, to leave only high quality data of an average analysed length of 560 bases. As can be seen, there are far fewer base calling errors or uncertainties for both ranges. Using the threshold parameter n set to 3.0 for the extended range set, TRACE_DIFF missed five mutations and found 15 false positives and for the narrower range, it missed no mutations and gave no false positives.
METHODS
Methods background
Mutation detection criteria
di(x) = tiw(x) - tim(x)
di(x) [ge] m + n.s
1
dj(x) [le] m - n.s
2
Automatic mutation detection
Checking for mutations with the GAP4 contig editor
Obtaining an overview using the GAP4 template display
Choosing a representative wild-type sequence
RESULTS
A
B
C
Number of readings
214
65
65
Number of bases
108 497
43 741
36 400
Average analysed length
503
673
560
Base differences
1 232
274
162
Real mutations
353
165
132
TRACE_DIFF false positives
28
15
0
TRACE_DIFF false negatives
36
5
0
DISCUSSION
We have described simple yet very useful methods for the detection and identification of mutations when sequences are determined using fluorescence-based sequencing instruments. As we have demonstrated, numerical treatment of the peaks is an effective method of automatic mutation detection and, second, the way that the differences are displayed within GAP4 makes it very much easier for the results to be checked visually.
We have demonstrated the reliability of automatic mutation detection for dye terminators and, for a more limited dataset, the new BigDye terminators. Given the wide choice of instruments and protocols in use it is not possible for us to cover them all. Nevertheless, we believe that those using the programs will quickly be able to establish suitable threshold values for TRACE_DIFF appropriate to the sequencing method of their choice. Obviously, the choice of threshold value n also depends on the type of project being undertaken: for some work an error rate similar to that obtained for our test data would be acceptable and no visual checking within GAP4 would be required, but for other projects the threshold would need to be set low enough to give a high chance of finding all possible mutations and visual inspection using the tag search routine would be essential in order to rule out false positives. For this latter type of project we have considered a modification of the tag searching routines that would locate the mutation tags in descending order of trace difference peak height.
We would like to emphasize the convenience of the overall environment provided by our methods. The readings can be automatically processed and searched for mutations in batches of any size and then assembled into a GAP4 database. With the contig editor, its search facilities and trace difference displays the data can then be checked rapidly by eye. The methods have been in routine use here for 1 year.
We have concentrated on the use of the new methods for mutation detection, but believe that they may have a role to play in genome sequencing projects. In very repetitive regions of genomes, where very long repeats differ at only a few positions, assembly engines can easily create contigs that contain segments in which almost identical but unrelated readings have been aligned together. These repetitive regions are problematical, as it is difficult to determine how many copies of each repeated element there are and then to fit them together in the right order. The new facility in the GAP4 contig editor to compare a trace against the consensus trace can be used to find misplaced readings and this procedure could be automated if it proved useful.
Information about obtaining the programs (for UNIX systems only) is available from the WWW address given in the abstract.
ACKNOWLEDGEMENTS
We thank Cesar Milstein for suggestions, Caroline Napper for help with sequencing and Kathryn Beal and Tony Crowther for critical reading of the manuscript. The work was supported by the UK Medical Research Council.
REFERENCES
This article has been cited by other articles:
This page is run by Oxford University Press, Great Clarendon Street, Oxford OX2 6DP, as part of the OUP Journals
Comments and feedback: www-admin{at}oup.co.uk
Last modification: 2 Jul 1998
Copyright©Oxford University Press, 1998.
![]()
CiteULike
Connotea
Del.icio.us What's this?
![]()
![]()

![]()
![]()
![]()
G. Denisov, B. Walenz, A. L. Halpern, J. Miller, N. Axelrod, S. Levy, and G. Sutton
Consensus generation and variant detection by Celera Assembler
Bioinformatics,
April 15, 2008;
24(8):
1035 - 1040.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
E. Cuppen
Genotyping by Dideoxy Resequencing
CSH Protocols,
September 1, 2007;
2007(18):
pdb.prot4842 - pdb.prot4842.
[Abstract]
[Full Text]
![]()
![]()
![]()

![]()
![]()
![]()
D. Huntley, A. Baldo, S. Johri, and M. Sergot
SEAN: SNP prediction and display program utilizing EST sequence clusters
Bioinformatics,
February 15, 2006;
22(4):
495 - 496.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
A. Bart, M. W. J. van Passel, K. van Amsterdam, and A. van der Ende
Direct detection of methylation in genomic DNA
Nucleic Acids Res.,
August 9, 2005;
33(14):
e124 - e124.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
C. Bredrup, P. M. Knappskog, J. Majewski, E. Rodahl, and H. Boman
Congenital Stromal Dystrophy of the Cornea Caused by a Mutation in the Decorin Gene
Invest. Ophthalmol. Vis. Sci.,
February 1, 2005;
46(2):
420 - 426.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
B. Chevreux, T. Pfisterer, B. Drescher, A. J. Driesel, W. E.G. Muller, T. Wetter, and S. Suhai
Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs
Genome Res.,
June 1, 2004;
14(6):
1147 - 1159.
[Abstract]
[Full Text]
[PDF]
![]()
![]()
![]()

![]()
![]()
![]()
D. D. Pollock, J. A. Eisen, N. A. Doggett, and M. P. Cummings
A Case for Evolutionary Genomics and the Comprehensive Examination of Sequence Biodiversity
Mol. Biol. Evol.,
December 1, 2000;
17(12):
1776 - 1788.
[Abstract]
[Full Text]
[PDF]
![]()
This Article ![]()
![]()
Abstract
![]()
Print PDF (292K)
![]()
Alert me when this article is cited
![]()
Alert me if a correction is posted
![]()
Services ![]()
![]()
Email this article to a friend
![]()
Similar articles in this journal
![]()
Similar articles in ISI Web of Science
![]()
Similar articles in PubMed
![]()
Alert me to new issues of the journal
![]()
Add to My Personal Archive
![]()
Download to citation manager
![]()
Search for citing articles in:
ISI Web of Science (35)
![]()
Request Permissions ![]()
Commercial Re-use Guidelines
for Open Access NAR Content
![]()
Google Scholar ![]()
![]()
Articles by Bonfield, J. K.
![]()
Articles by Staden, R.
![]()
Search for Related Content
![]()
PubMed ![]()
![]()
PubMed Citation
![]()
Articles by Bonfield, J. K.
![]()
Articles by Staden, R.
![]()
Social Bookmarking ![]()
![]()
What's this?