ABSTRACT
Partial restriction digestion is used to map restriction sites and the location
of genes within yeast artificial chromosomes (YACs). Locus-specific probes are hybridised to the partially digested YAC DNA and the fragments to which they hybridise are compared with the pattern of partial
digestion products that include each map region. A least squares criterion is
presented which allows for error in fragment length determination. This rapidly defines the most likely location of a marker within the restriction map
and permits the combination of results from digestions with different restriction enzymes. Approximate confidence intervals may be assigned to gene locations, and tests
of goodness-of -fit of the data may be performed. Since the number of erroneously matched fragments increases in proportion to the square of the number of sites, denser maps are not
necessarily more informative. Simulations indicate that the optimal number of internal restriction sites given typical experimental error (1% of YAC length) is about five sites;
the associated broad support interval (on average one third of YAC length) may be reduced by combining results from different enzyme digestions. Application of a computer
implementation of this model to experimental data showed that the model fitted
well, and estimates of location were found to be consistent with other
evidence.
The physical mapping of large segments of genomic DNA has been greatly
facilitated by their manipulation in yeast artificial chromosomes, or YACs (
1
). While the long-range and medium-range connectivity is provided by linkage and radiation hybrid maps,
and the highest mapping resolution is obtained by sequence determination, it is
often of interest to map markers more precisely within a 1000 kb interval. A
variety of approaches are possible, including fibre-FISH (
2
), analysis of overlapping YACs and restriction mapping within YACs.
Restriction mapping of YACs presents a number of difficulties. Large DNA
fragments are much less prone to shearing when protected in agarose, but this
inhibits restriction enzyme activity, so that it is rarely possible to obtain a
complete digestion without partially digested products. An established method (
3
-
5
) which is appropriate for the analysis of an individual YAC is partial
restriction mapping using pulsed field gel electrophoresis (PFGE). A given
restriction map is constructed by probing partially digested YAC DNA with DNA
markers from the ends of the YAC (which contain yeast marker genes from the
original YAC vector), thus identifying ladders of fragments whose sizes
correspond to the distance of each restriction site from the respective ends of
the YAC (
1
,
6
). A gene contained within the YAC can then be hybridised to the partial
digestion, yielding a characteristic pattern of signals. The position of the
gene within the partial restriction map may then be determined by relating the
probe hybridisation pattern to the fragments mandated by the partial
restriction map. However, the number of potential partial restriction products
increases in proportion to the square of the number of restriction sites,
making manual comparison of the observed fragments with expectations arduous,
and open to subjective interpretation. For this reason, the approach has been
mainly used with rare cutting enzymes which only cut a few times in a given
YAC. Here, we show how computer-assisted comparison of observed and expected fragment sizes can greatly
speed up analysis of restriction data by allowing rapid interpretation of
digestions with a number of sites, by combining information from a number of
different enzymes, and by assigning approximate confidence intervals.
A number of groups have presented elaborate computational approaches to the
analysis of partial restriction data. However, they have in common the goal of
assigning the marker locations from the fragment sizes generated by
hybridisation of the unknown marker on its own, without consideration of the
restriction map data provided by the end-markers. This is computationally intense since, if there are
r
restriction sites, there are (
r
- 1)!/2 possible orders (assuming that all completely digested fragments
are identified). While simulation has suggested that it may be possible to
construct maps in the presence of realistic levels of error and with missing
fragments when there are <10 restriction sites (
7
), this approach has yet to be applied in practise. Other groups have presented simulations (
8
) with 11 sites where the number of orders is reduced to ~
r
2
/4 by hybridising with a probe which is located within the region (
9
). Their results indicate that this approach is quite sensitive to error, e.g. a
1% error rate resulted in only 60% correct recovery of the true restriction map
(
8
). The method discussed here depends on the availability of end-markers to the YAC, which are usually readily available in the form of YAC
vector specific sequences contained in the YAC construct. The practical
computational problem is then reduced to choosing among the regions within a
known restriction map; this has, up to now, been solved manually by
investigators.
YAC DNA is partially digested by a restriction enzyme. The fragments are
separated by PFGE, transferred to membranes and hybridised with a DNA probe for
one end (e.g. the left end) of the YAC. The sizes of the observed fragments are
estimated by direct comparison with a size standard. For each enzyme site, a
fragment is observed whose length (
L
) corresponds to the length from the left end of the YAC to the position of the
restriction site. The same procedure is usually repeated using a right hand end
probe. The location of the site determined using the right end probe is placed
on a similar scale by simply calculating
R
as the difference between the total YAC length and the observed fragment size.
For each of the
r
restriction sites there are then two individually generated estimates (
L
and
R
) of the location of the site within the YAC. PFGE experimental conditions are
usually chosen such that distance travelled on the gel is approximately
linearly proportional to fragment length. Thus, it is assumed that large and
small fragments have a similar error, unlike analysis of standard
polyacrylamide gel electrophoresis, where log-transformation of fragment size is typically performed to allow for the
dependence of error on fragment size. Assuming that error in measuring fragment
size follows a normal distribution, the variance of fragment size determination
may be estimated as
{sigma sup 2} = sum ( {italic L} ^ - ^ {italic R} {) sup 2} / 2 {italic r}
1
The restriction map is established from the mean locations based on the left and
right marker information.
DNA specific for a gene whose location within the established restriction map is
unknown is probed against the partial digestion, and hybridises only to the
fragments containing that gene. A set of
n
observed fragments are thus identified by experimentation: it is not necessary,
and in some cases may be unlikely, that all fragments are observed.
For each interval within the established map, there is a set of expected
fragments which would be associated with a probe which hybridised to that
interval: essentially, these are the possible fragments between all possible
sites which overlap this interval. A full likelihood comparison between each
interval is not desirable for two reasons: first, the number of expected
fragments differs between intervals, so that the likelihood is conditioned on a
different set of data in each case. Second, if one expected fragment
encompassing a particular map region matches the observed fragment well, it is
not biologically meaningful to consider the matches of the other expected
fragments, since only one is expected to match. For these reasons, an
approximate likelihood is calculated by only comparing the observed fragments
with the closest expected fragments in each interval. If experimental
conditions were such that all expected fragments were observed, it would be of
use to take into account the number of observed fragments, which would be
greater for central regions. However, as the number of observed fragments is
dominated by experimental conditions rather than by the expectations, this
information is ignored (leaving aside the question of how this information
might best be utilised). The objective is then to determine which set of
predicted fragments is closest to the set of experimentally observed fragments,
and how well they match each other. The variance of a single observed fragment
is taken to be [sigma]
2
. Since the location of a restriction map is typically derived as the mean of
two measurements from left and right end probes, its variance is [sigma]
2
/2. Each predicted fragment within a map is calculated as the difference of two
restriction sites, and has the variance [sigma]
2
. For each interval, the closest predicted fragment to the observed fragment is
taken, and the difference in their sizes (
d
i
) is calculated, which has a variance of 2[sigma]
2
. The approximate likelihood of this interval is
L = {PI from {{italic i} = 1} to n} exp {{( - {italic d}} sub i sup 2} / 4 {sigma sup 2} )
The log
10
likelihood ratio comparing region
a
to the region with the highest likelihood (
b
) is:
L O G L R = 0 . 2 5 {{( l o g} sub {1 0}} e ) ( {sum from {{italic i} = 1} to
italic n} {d sub {i b} sup 2} ^ - ^ {sum from {{italic i} = 1} to italic n} {d sub {i a} sup 2} ) / {sigma
sup 2}
2
The choice of log
10
scale facilitates easy understanding of the likelihood differences (e.g. a difference of 2 corresponds to 100:1 odds), and
follows the convention estalished for human linkage maps (
8
). Where more than one enzyme is used, LOGLR is summed across various enzymes
for each map interval. The output for analysis with one enzyme constitutes a LOGLR value of 0 for the best fitting region, and negative values for the other regions.
The confidence interval (CI) of gene location may be approximated as the map
regions whose log
10
likelihood is within 1.3 of that of the best fitting region, corresponding to
20:1 odds against the gene being found outside this interval. This is a
heuristic cut-off established during the analysis of data (see below). Frequently, the
support interval may include a number of non-continuous regions within the map, but for clarity a single interval
encompassing such intervals is reported. The confidence interval is sensitive
to the estimation of measurement error: if the error is over-estimated, the interval will be too large, if it is under-estimated, it will be more narrow than the experimental data can
support. For this reason, it is important to consider the goodness-of-fit.
The goodness-of-fit of the observed data to the predicted interval is calculated as
{chi sub n sup 2} = {sum from {i = 1} to italic n} {d sub i sup 2} / 2 {sigma
sup 2}
3
which may interpreted as a chi-square with
n
degrees of freedom. If the fit of the best region is significantly poor, then the measurement error may be
under specified, or there may be specific errors in the data.
Where more than one enzyme is used, LOGLR is summed across various enzymes for
each map interval, and the confidence interval for the location of a marker is
again estimated as the map regions whose summed LOGLR is within 1.3 of the best
fitting region. A test of heterogeneity among enzymes may be carried out to
detect whether they differ significantly in their favoured location of the
marker, as
{chi sub {e - 1} sup 2} {sum from {{italic j} = 1} to italic e} {sum from {{italic i}
= 1} to italic n} {d sub {i j y} sup 2} / 2 {sigma sup 2} - {d sub {i j x} sup 2} / 2 {sigma sup 2}
4
where
e
is the number of enzymes,
x
is the location favoured by the particular enzyme
j
,
y
is the location favoured by the combined enzymes and
n
is the number of observed fragments for the marker for enzyme
j
. There are
e
- 1 degrees of freedom. Again, significant heterogeneity may indicate that
the measurement error is underestimated, or that there are specific errors or
omissions in the data. When there is a poor fit, the possible source of the
error can be investigated by seeing which combinations of markers and enzymes
make the largest contribution to the chi-square.
If the number of observed fragments is large, then bootstrap analysis (
11
) can provide more robust estimates of the confidence interval, as the
difference between observed and predicted fragments may be considered to be
approximately independent (bootstrapping establishes the confidence interval by
analysing many re-sampled subsets of the data). However, when there are only a few fragments
which are critically informative, then the bootstrap results in excessively
large confidence intervals. For this reason, we chose to calculate the CI using
the likelihood approach, when interpreted in conjunction with the information
on the goodness-of-fit of the data.
As the number of predicted fragments is greater in the centre of a restriction
map than at the two ends, the likelihood of fragments matching by chance is
greater in the centre. Assuming that expected fragments have a uniform size
distribution (in fact there is an excess of intermediate sized fragments
expected for a region in the centre of the map), the probability that one or
more expected fragments in a region is by chance closer to the observed
fragment than the expected value of the error of the observed fragment ([sigma]) is
1 ^ - ^ ( 1 ^ - ^ 2 sigma {{/ L )} sup p}
5
where
L
is the total YAC length and
p
is the number of predicted fragments. When there are twenty sites, there are 19
expected fragments for an end interval, and 100 for the central interval. If
the standard error is 1% of YAC length, according to formula
5
the central region is, by chance, 2.7 times more likely to match an observed
fragment than the end interval.
While the confidence interval around each marker location is probably sufficient
information for most purposes, it is possible to infer the significance of the
relative ordering of two genes mapped to the same YAC which lie within
different intervals. The probability that they both lie between the same two
restriction sites may be calculated by finding the region with the minimum
summed log-likelihood: the magnitude of this log-likelihood represents the evidence in favour of the gene order
suggested by their most likely positions.
The statistical approach should allow certain data which would be too difficult
to analyse manually to be used in producing reliable maps. However, there is a
limit to the number of restriction sites that may cut in a particular
experiment, as the number of predicted fragments in the centre of the map is
proportional to the square of the number of sites, so that even when
measurement error is minimised, the chance matching of predicted fragments will
increasingly obscure true matches. Simulations were performed to determine the
density of restriction maps that are most likely to be informative in defining
marker location. The number of restriction sites ranged between 2 and 30
internal sites, and the experimental error between 0.1 and 3% of total YAC
length. A map of 1000 kb was simulated 500 times for each condition, and a
separate set of simulations was carried out for each internal interval within
the map. Restriction sites were randomly selected, and 10 observed fragments
were randomly drawn from each interval (with replacement, so that there were in
some cases fewer observed fragments, depending on the number of predicted
fragments). Error was then drawn from a normal distribution, and added to the
observed fragments, and also to the restriction sites.
The experimental data analysed comprised previously published data mapping the
CRP
,
H4F2
and
IFI-16
genes within the human YAC
28A,B5
(
5
), as well as newly generated data using frequent cutting enzymes which could
only be effectively analysed with computer assistance. This data was generated
for the mouse YAC
KB8
, which was isolated from the combined ICRF and St Mary's YAC libraries (
12
), and determined by polymerase chain reaction (PCR) to contain all five members
of the mouse
Saa
gene cluster (
12
,
13
). Probes specific for the mouse
Saa1
,
Saa2 ,
Saa3
,
Saa4
and
Saa5
genes were generated by PCR across sequence-specific regions of these genes (
14
,
15
), and partial restriction mapping was carried out following the previously described experimental approach (
5
,
16
).
Statistics were calculated with the assistance of a computer program, which is
available free of charge for non-profit use (see world-wide-web site http://biotech.bio.tcd.ie/partial.html).
Simulations were performed to assess the performance of the statistical model,
and its sensitivity to the numbers of restriction sites and the measurement
error. The mean 95% confidence interval (CI) over all intervals and over all
500 simulations for each interval are presented for each point in Figure
1
. A lower CI indicates a higher resolution in defining the location of the
marker. While increased number of sites reduce the CI at low error, at typical
experimental error (1% of YAC length) the optimum number of internal sites is
around four or five. If that experimental error is halved (to 0.5%), four sites
are still approximately as efficient as 10. Thus, optimum results are obtained
with rare cutters in the presence of realistic experimental error, where a
single enzyme will define the marker's location within one fifth of YAC length.
In spite of this, useful information is still provided by more frequently
cutting enzymes. Even 20 internal sites will, on average, limit the CI to three-quarters of the YAC length (assuming 1% error). This information, in
combination with results from other enzymes, may be of some value in defining
gene location.
The value of the 95% CI as a measure of location support was also assessed by
these simulations. The number of times a marker was located outside its CI
increased as the number of sites and the measurement error increased: these
erroneously placed markers were mainly those whose correct location was at the
ends of the YAC. However, the number of incorrect assignments was still
relatively few: when the marker was simulated at the end of the YAC with 1%
measurement error, the program correctly recovered the marker within the CI 88%
of the time when there were 10 internal restriction sites, and 78% when there
were 20 sites. However, only 6% of markers fell outside the CI when they were
simulated from the intervals next to those at the end of the map, in a map with
20 internal sites. On average, when the true location was simulated as being
derived from each of the 21 intervals in turn, only 3% of simulated markers
were placed outside the CI. Thus, the 95% criterion for the confidence interval
is generally well justified by these simulations.
A straightforward example of the experimental application of this analytical
technique is illustrated with previously published data (Walsh
et al.,
1996). The positions of the restriction sites for the enzymes
Xho
I and
Sfi
I (Table
1
; Fig.
2
) were inferred by probing with left and right end probes to the human YAC
28A,B5
which was known to contain the genes for
CRP
,
H3F2
and
IFI-16
. The standard error of fragment size was estimated according to formula
1
as 2.5, which is ~1% of the YAC length (360 kb). The fragments for the
CRP
gene (Table
1
) are illustrated in Figure
2
in alignment with the nearest predicted fragments for the best fitting
interval. The best fitting interval and the associated 95% support interval for
This work was supported by grants from the Wellcome Trust 039618 and 034345.
A.B. was supported by a project grant from the Health Research Board of
Ireland; M.T.W. was supported by a Department of Education (Northern Ireland) research studentship and a FORBAIRT (Ireland) studentship. We thank an anonymous referee for
suggestions which improved the method.
*To whom correspondence should be addressed. Tel: +353 1 608 2390; Fax: +353 1
679 8558; Email: dshields@biotech.bio.tcd.ie
REFERENCES
Return

