Nucleic Acids Research Advance Access originally published online on August 28, 2007
Nucleic Acids Research 2007 35(17):e113; doi:10.1093/nar/gkm621
Nucleic Acids Research, 2007, Vol. 35, No. 17 e113
© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Efficacy assessment of SNP sets for genome-wide disease association studies
Andreas Wollstein1,2,
Alexander Herrmann2,3,
Michael Wittig2,
Michael Nothnagel4,
Andre Franke2,
Peter Nürnberg1,
Stefan Schreiber2,
Michael Krawczak4 and
Jochen Hampe2,3,*
1Cologne Center for Genomics, Cologne, 2Institute of Clinical Molecular Biology, Christian-Albrechts University, 3Ist Department of Medicine and 4Institute of Medical Informatics and Statistics, Christian-Albrechts University, University Hospital Schleswig-Holstein Campus Kiel, Kiel, Germany
*To whom correspondence should be addressed. Tel: +49 431 597 1246; Fax: +49 431 597 1842; Email: J.Hampe{at}1med.uni-kiel.de
Received May 31, 2007. Revised July 30, 2007. Accepted July 30, 2007.
 |
ABSTRACT
|
|---|
The power of a genome-wide disease association study depends
critically upon the properties of the marker set used, particularly
the number and physical spacing of markers, and the level of
inter-marker association due to linkage disequilibrium. Extending
our previously devised theoretical framework for the entropy-based
selection of genetic markers, we have developed a local measure
of the efficacy of a marker set, relative to including a maximally
polymorphic single nucleotide polymorphism (SNP) at the map
position of interest. Using this quantitative criterion, we
evaluated five currently available SNP sets, namely Affymetrix
100K and 500K, and Illumina 100K, 300K and 550K in the CEU,
YRI and JPT + CHB HapMap populations. At 50% relative efficacy,
the commercial marker sets cover between 19 and 68% of the human
genome, depending upon the population under study. An optimal
technology-independent 500K marker set constructed from HapMap
for Caucasians, in contrast, would achieve 73% coverage at the
same relative efficacy.
 |
INTRODUCTION
|
|---|
Genome-wide association studies with large sets of single nucleotide
polymorphisms (SNP) (
1) are a new option for mapping the genetic
variants underlying complex human diseases. However, the power
and cost-effectiveness of such studies depends critically upon
the properties of the SNP sets used. Consequently, the choice
between one of the commercially available marker panels and
the construction of a new set is of strong practical significance.
No objective criteria other than descriptive measures (e.g.
marker number) have so far been used to compare the utility
of genome-wide marker sets. More importantly, any sensible assessment
of a marker panel requires that recent discoveries about the
biology of meiotic recombination are appropriately taken into
account (
2–4). For example, it has been shown (
2) that
the geodesy of the human genetic map is fairly
homogenous above the centi-Morgan level, but that the correlation
between physical and genetic distance is weak at a finer scale,
due to rapidly evolving recombination hotspots. Consequently,
SNP selection strategies that are based upon the assumption
of static linkage disequilibrium (LD) blocks, or that merely
employ pairwise LD, may result in sub-optimal marker sets.
The utility of a marker set for disease association analysis is determined by a number of factors, including marker number, informativity and spacing, in addition to the local level of LD. In practice, genotyping technologies may pose serious restrictions upon the usability of an individual SNP, irrespective of whether its inclusion might be desirable or not. If such limitations can be ignored, however, then the utility of a marker set should ideally be evaluated by a criterion that:
- allows the assessment of the coverage of a genomic region in a single quantity,
- is computationally practicable,
- is applicable to the limited genotype information typically available for large marker sets and
- draws upon a theoretical framework that allows meaningful interpretation of the numerical results.
Shannon entropy
(
5) is a well-established mathematical concept for assessing
the utility of genetic markers. We have recently devised an
entropy-based SNP selection approach (
6) that can in principle
be adapted to a genome-wide setting. Furthermore, the methodology
facilitates estimation of the relative, region-specific efficacy
of a given marker set by

, a quantity that approximates to the
relative sample size required to map a causative variant at
a given map position, compared to including a maximally polymorphic
SNP at the same position (see Methods section). We calculated

across the genome using publicly available genotype data for
HapMap (Phase 2, built 35) (
7) and for the five commercial marker
sets of Affymetrix (
8) (100K and 500K) and Illumina (
9) (100K,
300K and 550K). The results were compared to an ideal
SNP set constructed from HapMap via entropy-based marker selection.
 |
METHODS
|
|---|
Estimation of inverse swept radii
Parameter

, which denotes the inverse of the swept radius, was
used as a local measure of LD strength (
10,
11) and was estimated
from HapMap genotype data on the basis of all markers with a
minor allele frequency

10%. To this end, pairwise haplotype
frequencies were estimated from the genotype data using an EM
algorithm. Then, the pairwise allelic association was quantified
as
| (1) |
where P is the
haplotype frequency matrix (
pij)
i,j=1...2,
Q =
p11 +
p12 and
R =
p11 +
p21 (
10), and where det|P| denotes the determinant
of P. Marker-specific

values were estimated by a log-linear
regression analysis of

and the physical distance to all other
markers
Xi in a 500 kb window surrounding the marker
Y of interest
(
12), i.e. by fitting model log(

) = –

·|
xi –
y| to marker locations
xi and
y.
Here and in the following, we assumed that the population of interest was characterized by monophyletic inheritance and by a lack of association between unlinked loci, a simplification of the original model of LD decay that was justified by empirical observations made for autosomal markers in Europe and the US (11).
At inter-marker positions z1 < z < z2,
(z) was estimated by linear interpolation, i.e.
| (2) |
Entropy-based SNP selection
We have previously devised a method for assessing the utility of marker sets for disease association studies (6), based upon Shannon entropy (5). In brief, for a locus X with k alleles of frequency pi (i = 1...k), entropy H(X) is defined as
| (3) |
For the purposes of disease association analysis, a genomic region is assumed to be covered by markers X1, ... , Xn at map positions x1 < ... < xn. Then, the problem of SNP selection reduces to deciding, on the basis of existing genotype or haplotype data, which single marker out of some additional markers Y1, ... , Ym to include in order to maximize the mapping utility of the extended panel. Without loss of generality, it can be assumed that this choice is confined to maximizing the utility of the marker set in a given interval, centred at map position z. A utility score
(Y:X, z) is then constructed that reflects the benefit, with respect to mapping a disease gene at position z, of adding Y to a single marker X,
| (4) |
Here, H(Y|X) = H(X,Y) – H(X) denotes the conditional entropy of Y given X. The quantity in formula (4) can be calculated directly from pairwise haplotype frequencies, known swept radii and known marker locations. The best marker to include into the existing marker panel X1, ... , Xn is then chosen according to
| (5) |
Application to genome-wide marker panels
Application of the above-mentioned framework to large-scale genome-wide data sets poses additional computational problems since the comprehensive evaluation of all pairwise haplotype frequencies, as required by formulas (4) and (5), is no longer feasible. Thus,
(Y:X, z) was replaced by
| (6) |
when the distance between
Y and
z exceeded
3/

(
z) (
11). In this way, the number of pairwise haplotype frequency
estimations was limited and the computing time scaled linearly
(instead of quadratically) with marker number. Formula (
6) was
also used for selecting the first few markers on a given chromosome,
successively breaking the chromosome down into shorter intervals
by applying formula (
6) to the corresponding interval centers.
Marker selection according to formula (
4) commenced for an interval
when it was shorter than three times the internal median swept
radius.
Evaluation of genome-wide marker sets
Following Hampe et al. (6), we define criterion
(z) for the local evaluation of a marker set around map position z as
| (7) |
where
qX(z) is the minor allele frequency
of that marker,
X(
z), that maximizes the right-hand side of
formula (
7) [note that

(
z) is similar, but not equivalent, to
1 –
min(
z) as defined in the original paper (
6)]. Since
| (8) |
equals the predicted
allelic association (
11) between
X and a maximally informative
biallelic marker
Z at map position
z, it follows that
| (9) |
On the other hand, the number
n of individuals
required to detect association

between
X and
Z at significance
level

and with power 1 – ß is approximately
equal to
| (10) |
where
z1–
/2 and
zß are the respective quantiles of
the Gaussian distribution (for a detailed derivation of formula
(
10), see Appendix). For any two marker sets A and B, let
A(
z)
and
B(
z) be the

values obtained with respect to the same location
z. Then,
| (11) |
which
implies that

(
z) is a good approximation of the relative efficacy
of a marker set, measured by the inverse of the sample size
required to map a maximally informative SNP at position
z.
Computer implementation
The methodology described above has been implemented into a suite of JAVA programs interacting with a MySQL relational database for the storage of genotypes and intermediate results. Since the HapMap data set was the most exhaustive one, calculation of swept radii was based upon these markers and genotypes. The software is available as a web service under http://www.ikmb.uni-kiel.de/snpselection/.
SNP data sources and genotyping
Caucasian genotype data for HapMap (Phase II, built 35), Affymetrix 100K and 500K were retrieved from the respective web sites (www.hapmap.org, www.affymetrix.com). The marker identities of the Illumina 100K, 300K and 500K sets were retrieved from the Illumina website (www.illumina.com); the corresponding genotypes were taken from HapMap or from the Illumina website.
 |
RESULTS
|
|---|
Quantity

measures the relative efficacy of a given marker set
to map a causal variant at a specified map position, compared
to including a maximally polymorphic SNP at the very same position
(see Methods section). Therefore,

= 1 corresponds to full local
efficacy of a marker panel whereas

= 0 indicates that no information
can be extracted locally. For the purpose of comparing different
marker sets,

was calculated here at 10 kb intervals along the
human genome (NCBI build 34), except for annotated gaps, heterochromatic,
telomeric and centromeric regions. Y chromosomal SNPs were also
excluded. Variation of the interval size between 5 and 10 kb
for chromosomes 3 and 19 did not yield notably different results
(data not shown). It may be argued that, in many instances,
only markers located in gene-coding regions are of practical
interest for genome-wide disease association studies. In order
to take this issue into account, coding regions
were defined here as all sequences containing one of the RefSeq
genes of the Golden Path (
http://genome.ucsc.edu), including
exons, introns and 10 kb of flanking sequence. Marker sets were
evaluated on the basis of publicly available genotype data (
Table 1).
Our analyses included CEPH samples from Northern and Western
Europe (CEU), from Yoruba in Nigeria (YRI) and from Japanese
and Han Chinese people (JPT + CHB).
Swept radii 1/

were estimated for different genomic regions
on the basis of the available HapMap genotype data. As is exemplified
by chromosomes 12 and 19 in the CEU population (
Figures 1A and
2A), the distribution of 1/

was found to vary considerably along
chromosomes and therefore resembled recently published recombination
plots in this respect (
2). The median 1/

of

500 kb corresponds
closely to previous estimates (
11). A graphical representation
of all swept radii and

values obtained in the present study
is available at
http://www.ikmb.uni-kiel.de/snpselection. In
the following, our results will be exemplified by a more detailed
consideration of chromosomes 12 and 19, which are typical in
terms of their size and gene density.
When all 180 613 HapMap SNPs on chromosome 12 were included
in the analysis,

values larger than 0.5 were obtained for most
of the chromosome (
Figure 1C). By contrast, the 5253 chromosome
12 markers of the Affymetrix 100K set left many intervals with

close to 0, indicating low efficacy (
Figure 1B). Similar results
were obtained for chromosome 19 (
Figure 2).
Figures 3 and
4 provide an overview of the distribution of

along the coding
regions and the full genomic sequences of the two chromosomes.
When all HapMap SNPs were included, the median

values obtained
were 0.70 (interquartile range: 0.56–0.82) for chromosome
12 and 0.66 (interquartile range: 0.52–0.78) for chromosome
19. By contrast, the best commercial marker sets yielded a median

of 0.59 (interquartile range: 0.45–0.73) for chromosome
12, and of 0.56 (interquartile range: 0.41–0.70) for chromosome
19 in the case of Illumina 550K, and of 0.52 (interquartile
range: 0.36–0.67) for chromosome 12 and of 0.41 (interquartile
range: 0.26–0.58) for chromosome 19 with the Affymetrix
500K set.
A comparison of the two commercially available 100K sets revealed
the impact of both, the genotyping technology and the selection
strategy upon the mapping efficacy. If only the coding sequence
was considered on chromosome 12, the median

for Affymetrix
100K was 0.21 (interquartile range: 0.08–0.41), as compared
to 0.44 (interquartile range: 0.27–0.61) for Illumina
100K (
Figure 4). The Illumina 100K set, designed primarily for
a good coverage of sequences containing annotated transcripts,
provides essentially the same efficacy for the coding sequence
on this gene-rich chromosome as the Affymetrix 500K set (median

: 0.41, interquartile range: 0.26–0.58). Similar, albeit
less pronounced results were obtained for chromosome 12 (
Figure 3).
A genome-wide overview of the efficacy of all SNP sets is given
in
Table 2 and, on a chromosome-wise basis, in
Figure 5.

View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 5. Chromosome-specific estimates of relative SNP set efficacy in full genomic (Panel A) and coding (Panel B) sequences. Chromosome-wide median values and interquartile ranges obtained for the CEU population are plotted in chromosomal order.
|
|
View this table:
[in this window]
[in a new window]
|
Table 2. Median estimated efficacy and coverage of the human genome (excluding the Y chromosome) in different populations, provided by different marker sets.
|
|
Let
Cx denote the local coverage of a chromosome or chromosomal
region at relative efficacy
x, achieved by a particular marker
set (i.e.
Cx equals the proportion of a given genomic region
for which
x). For the coding regions of chromosome 12, for
example,
C0.5 = 0.16 for the Affymetrix 100K set and
C0.5 =
0.48 for Affymetrix 500K (
Figure 3). This means that the two
sets cover 16 and 48% of the gene containing sequence, respectively,
at 50% or higher relative efficacy. At 80% relative efficacy,
the respective figures decrease to 2 and 8%, respectively. A
genome-wide overview of the coverage of the different marker
sets at 50 and 80% efficacy is given in
Table 2 and, on a chromosome-wise
basis, in
Figures 6 and
7.

View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 6. SNP set coverage of full genomic (Panel A) and coding (Panel B) sequences at 50% relative efficacy. The chromosome-wide coverage C0.5 is plotted in chromosomal order. HYP 500K: hypothetical, optimal marker set constructed from HapMap so as to include the same number of SNPs per chromosome as the Affymetrix 500K set.
|
|

View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
|
Figure 7. SNP set coverage of full genomic (Panel A) and coding (Panel B) sequences at 80% relative efficacy. The chromosome-wide coverage C0.8 is plotted in chromosomal order. HYP 500K: hypothetical, optimal marker set constructed from HapMap so as to include the same number of SNPs per chromosome as the Affymetrix 500K set.
|
|
The HapMap markers provide the gold standard for
the currently achievable coverage of the human genome with informative
SNPs. If a fully flexible genotyping technology were available,
optimal SNP sets could thus be constructed from HapMap using,
for example, entropy-based marker selection. As exemplified
for chromosomes 12 (
Figure 3) and 19 (
Figure 4), such customized
panels would significantly improve the coverage provided by
a given number of markers. With 5253 SNPs on chromosome 12,
which corresponds to the size of the respective Affymetrix 100K
set, HapMap would yield
C0.5 = 0.81, i.e. a more than four times
higher coverage than the commercial product. Replacing the Affymetrix
500K set by a similarly sized HapMap set would increase
C0.5 from 0.48 to 0.81 whereas
C0.8 would increase from 0.07 to 0.21.
More detailed information about the present study can be found on our web server at http://www.ikmb.uni-kiel.de/snpselection. The same site also provides routines for the customized selection of optimal SNP sets from HapMap build 19, using the available Caucasian, Asian and Yoruba genotype data.
 |
DISCUSSION
|
|---|
Justification of an entropy-based SNP selection framework
Currently available technologies do not allow full re-sequencing
of the human genome in samples that are appropriately sized
for mapping complex disease genes. Instead, the success of genome-wide
association studies depends heavily upon the presence of sufficient
LD between the causal variant(s) and at least one marker in
the study panel. Whilst the level of inter-marker LD may indeed
be fully known, however, LD is inherently unknown in relation
to the causal variant itself, and therefore has to be extrapolated.
This implies that the markers of an ideal study panel should
be selected in such a way as to maximize the information extracted
about any possible location of a disease variant in the genome.
Under a model of spatially homogenous LD, with constant recombination and mutation rates and a common evolutionary history shared by all chromosomal regions, disease association markers would ideally be spread evenly along the genome. However, the systematic evaluation of both LD and local recombination rates has revealed an inherent non-uniformity of these characteristics (2,13,14). Thus, recombination rates differ between chromosomal segments and between populations, which implies that even closely linked genomic regions may be of substantially different ancestry in individuals from one and the same population (15). Consequently, the relationship between LD and physical distance is complex, and combinations of unevenly spaced SNPs may prove more informative than equally spaced markers, depending upon the genomic region of interest.(16)
Previous studies have suggested the existence of haplotype blocks, i.e. clearly identifiable chromosomal segments that are characterized by a reduced rate of recombination, low haplotype diversity and a high level of internal LD (2–4). In addition, haplotype-tagging SNPs (htSNPs) have been proposed to be capable of identifying haplotypes for substantially larger marker sets from within these blocks (17–19). The practical relevance of this block concept arises from the expectation that htSNPs extract sufficient information from an LD block with respect to co-ancestry while, at the same time, reducing genotyping costs (2–4). A number of computational methods for the construction of htSNP sets have been developed (16,19,20) but for these techniques to be efficient, detailed knowledge of the extended haplotype frequency distribution in the population of interest is required. Moreover, the size and location of haplotype blocks depend critically upon the SNP density and the method of marker selection (13,14,21–24). Therefore, haplotype tagging appears feasible only when large samples and appropriate family structures are available for the necessary (deterministic or probabilistic) haplotype assignments, the reliability of which decreases with the number and complexity of the haplotypes present (25–27).
The idealized picture of static LD blocks, separated by hot spots of recombination, (28) has recently been challenged by new insights into the biology of meiotic recombination (2–4). The correlation between physical and genetic distance is weak below the centi-Morgan level so that the inference of marker genotypes from htSNP haplotypes is far from being reliable (24). Moreover, block-like structures may even occur merely because of genetic drift (29). It thus appears as if the tacit assumption underlying the use of the haplotype block concept for disease association mapping, namely that all genetic variation in a block follows the same hierarchical pattern, is often not fulfilled. As a consequence, the usefulness of htSNPs for such studies has generally been questioned (30–32).
SNP selection based upon pairwise LD alone has been suggested to avoid the conceptual and computational problems of extended haplotype (or block) approaches. The use of some SNPs as proxies for other SNPs that are in high LD with the former (2–4), measured by r2, reduces the redundancy of a SNP set. Thresholds for r2 of at least 0.8 are generally regarded as sufficient to provide good marker coverage for association studies (21,33–38). The rationale underlying the pairwise approach is the expectation that high inter-marker LD translates into high LD between some of the markers and potentially causative variants, an assumption that is however unlikely to hold true in general (2–4). Selection of SNPs based upon pairwise LD alone is therefore likely to perform well only with a particularly high and uniform SNP density (6). Irrespective of the approach taken, the inherently unknown LD between markers and unknown causal variants has to be extrapolated in one way or another from both physical distance and the local strength of LD. However, marker selection based upon pairwise LD alone does not take distance or individual marker informativity into account. As a consequence, simple pairwise haplotype tagging potentially leads to inhomogeneous marker spacing with less than maximum efficacy.
Here, we have adapted a recently proposed method for selecting maximally informative marker sets for association studies (6) to a genome-wide comparison of marker sets. The original approach combines the information content, physical spacing and pairwise LD of individual markers with information on the local LD structure, extracted from available data in the form of swept radii (10,11). All of these determinants are included in a single, position-specific utility measure that corresponds to the distance-weighted haplotype entropy of the marker set, approximated however by a pairwise score of the same form (see Methods section). The approach is therefore not affected by the computational and conceptual problems of block-based methods and, at the same time, takes physical distance and local LD structure into account when extrapolating LD between markers and causal variants from pairwise inter-marker LD. An extension of the approach has led to the development of a quantitative criterion (
) that approximates the efficacy of a given marker set to map a disease-causing variant at a position of interest. It should be emphasized that the interpretation of
as a measure of efficacy is only valid in relative terms, i.e. by comparison to the inclusion of a maximally polymorphic SNP at the site of the causal variant. In general, since the properties of the underlying disease model are unknown, no marker-based quantity can on its own provide information about the absolute power of a marker set to map genetic variants underlying a given phenotype.
Quality of currently available marker sets
Owing to recent successes (39) and its theoretical appeal (1), significant funds have been allocated to the concept of genome-wide association analysis in the context of various phenotypes. Researchers are however facing the practical problem of choosing the right genotyping technology. In many countries, universal control genotype pools are in the process of being established, and these pools will pre-determine the choice of technology for future studies. Of the currently available marker sets, the Affymetrix 500K (C0.5 = 0.68, C0.8 = 0.19) and Illumina 550K (C0.5 = 0.79, C0.8 = 0.29) products provide the best genomic coverage in Caucasians. The Illumina 550K marker set provides a higher coverage than the 500K Affymetrix set, probably because of the higher flexibility of the Illumina genotyping technology. Pronounced differences between full genomic and coding region coverage were only observed for the 100K sets, probably because of the relatively small marker numbers. The good coding region coverage provided by the Illumina 100K set highlights the fact that this panel was primarily designed for gene-based association mapping. It should be emphasized, however, that all of the above conclusions were based upon the assumption that all markers were callable, and that practical factors such as genotyping quality, departure from Hardy–Weinberg equilibrium and DNA requirements could be neglected. Furthermore, interesting differences became apparent in terms of in different ethnic groups. Whilst their relative efficacy was approximately the same in the Caucasian and African populations, SNP coverage was notably poorer for all products for the East Asian populations.
The analytical method used here to compare the utility of different marker sets provides a means to weight the costs and benefits of closing gaps in a given marker set. Additional genotyping costs incurred by a flexible (and thus more expensive) genotyping method can be contrasted directly with the relative efficacy gained from using additional, customized SNPs. If genotyping costs would be negligible, the complete current HapMap set would provide 90% coverage of the genome with at least 50% relative efficacy, and 47% coverage with at least 80% relative efficacy. These figures represent the gold standard with which all other marker panels have to be compared. Interestingly, when our entropy-based SNP selection approach was used to construct an optimum SNP set, the size of the Affymetrix 500K product from HapMap, this technology-independent, hypothetical set would nearly double the coverage at 80% relative efficacy.
In summary, we have devised a methodology that helps researchers make rational choices between different marker sets for genome-wide disease association studies and to assess the trade-off between genotyping costs and gain in power when expanding existing marker sets. Furthermore, use of the
criterion facilitates judging the position-specific completeness of a genome-wide association study and may thus help to improve the practicability of complex disease gene mapping.
 |
APPENDIX
|
|---|
In general, the sample size
n required to detect the difference
between proportions
1 and
2 by means of a
2 test can be approximated
by
| (A.1) |
where

=
(
1 +
2)/2,

and 1 – ß are the significance level
and power of the applied test, respectively, and
z1–
/2 and
z1–ß denote the corresponding quantiles
of the Gaussian distribution (
40). If
q is the minor allele
frequency of marker
X, and if the two alleles of marker
Z are
equally frequent, then the corresponding haplotype frequency
matrix equals
| (A.2) |
with
| (A.3) |
Furthermore,
since
Q =
p11 +
p12 = 0.5
1 + 0.5(1 –
1) = 0.5 and
R =
p11 +
p12 = 0.5
1 + 0.5
2, it follows that
| (A.4) |
Solving Equations (A.3) and (A.4) for
1 and
2 yields
1 = 1 –
q(1 –

) and
2 = 1 –
q(1 +

), so that

= 1 –
q and
1 –
2 = 2
q
. Replacing
1,
2 and

by these expressions in formula (A.1) yields
| (A.5) |
for sufficiently small

. This proves
formula (
10) of the main text.
 |
ACKNOWLEDGEMENTS
|
|---|
This study was supported by the German Federal Ministry of Education
and Research as part of the National Genome Research Network
(01GS02105, 0313437A) and the MediGrid project (01AK803G), and
by the German Research Council (Ha 3091/1-2). We are most grateful
to Ulf Leser, Humboldt-University, Berlin, for helpful discussions
and to Uwe Mordhorst and Marcus Will, Christian-Albrechts-University,
Kiel, for computing support. Funding to pay the Open Access
publication charges for this article was provided by the Medical
Faculty of the University of Kiel.
Conflict of interest statement. None declared.
 |
REFERENCES
|
|---|
- Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science (1996) 273:1516–1517.[Abstract/Free Full Text]
- Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across the human genome. Science (2005) 310:321–324.[Abstract/Free Full Text]
- Jeffreys AJ, Neumann R. Factors influencing recombination frequency and distribution in a human meiotic crossover hotspot. Hum. Mol. Genet. (2005) 14:2277–2287.[Abstract/Free Full Text]
- Jeffreys AJ, Neumann R, Panayi M, Myers S, Donnelly P. Human recombination hot spots hidden in regions of strong marker association. Nat. Genet. (2005) 37:601–606.[CrossRef][Web of Science][Medline]
- Shannon CE. A mathematical theory of communication. Bell Syst. Tech.l J. (1948) 27:379–423.
- Hampe J, Schreiber S, Krawczak M. Entropy-based SNP selection for genetic association studies. Hum. Genet. (2003) 114:36–43.[CrossRef][Web of Science][Medline]
- Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P. A haplotype map of the human genome. Nature (2005) 437:1299–1320.[CrossRef][Medline]
- Matsuzaki H, Dong S, Loi H, Di X, Liu G, Hubbell E, Law J, Berntsen T, Chadha M, et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods (2004) 1:109–111.[CrossRef][Web of Science][Medline]
- Murray SS, Oliphant A, Shen R, McBride C, Steeke RJ, Shannon SG, Rubano T, Kermani BG, Fan JB, et al. A highly informative SNP linkage panel for human genetic studies. Nat. Methods (2004) 1:113–117.[CrossRef][Web of Science][Medline]
- Morton NE, Collins A. Tests and estimates of allelic association in complex inheritance. Proc. Natl Acad. Sci. USA (1998) 95:11389–11393.[Abstract/Free Full Text]
- Morton NE, Zhang W, Taillon-Miller P, Ennis S, Kwok PY, Collins A. The optimal measure of allelic association. Proc. Natl Acad. Sci. USA (2001) 98:5217–5221.[Abstract/Free Full Text]
- Collins A, Lonjou C, Morton NE. Genetic epidemiology of single-nucleotide polymorphisms. Proc. Natl Acad. Sci. USA (1999) 96:15173–15177.[Abstract/Free Full Text]
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, et al. The structure of haplotype blocks in the human genome. Science (2002) 296:2225–2229.[Abstract/Free Full Text]
- Sawyer SL, Mukherjee N, Pakstis AJ, Feuk L, Kidd JR, Brookes AJ, Kidd KK. Linkage disequilibrium patterns vary substantially among populations. Eur. J. Hum. Genet. (2005) 13:677–686.[CrossRef][Web of Science][Medline]
- Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW. Genetic structure of human populations. Science (2002) 298:2381–2385.[Abstract/Free Full Text]
- Johnson GC, Esposito L, Barratt BJ, Smith AN, Heward J, Di Genova G, Ueda H, Cordell HJ, Eaves IA, et al. Haplotype tagging for the identification of common disease genes. Nat. Genet. (2001) 29:233–237.[CrossRef][Web of Science][Medline]
- Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science (2001) 294:1719–1723.[Abstract/Free Full Text]
- Zhang K, Calabrese P, Nordborg M, Sun F. Haplotype block structure and its applications to association studies: power and study designs. Am. J. Hum. Genet. (2002) 71:1386–1394.[CrossRef][Web of Science][Medline]
- Zhang K, Deng M, Chen T, Waterman MS, Sun F. A dynamic programming algorithm for haplotype block partitioning. Proc. Natl Acad. Sci. USA (2002) 99:7335–7339.[Abstract/Free Full Text]
- Halperin E, Kimmel G, Shamir R. Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics (2005) 21(Suppl. 1):i195–i203.[Abstract]
- Ke X, Durrant C, Morris AP, Hunt S, Bentley DR, Deloukas P, Cardon LR. Efficiency and consistency of haplotype tagging of dense SNP maps in multiple samples. Hum. Mol. Genet. (2004) 13:2557–2565.[Abstract/Free Full Text]
- Kidd JR, Pakstis AJ, Zhao H, Lu RB, Okonofua FE, Odunsi A, Grigorenko E, Tamir BB, Friedlaender J, et al. Haplotypes and linkage disequilibrium at the phenylalanine hydroxylase locus, PAH, in a global representation of populations. Am. J. Hum. Genet. (2000) 66:1882–1899.[CrossRef][Web of Science][Medline]
- Sun X, Stephens JC, Zhao H. The impact of sample size and marker selection on the study of haplotype structures. Hum. Genomics (2004) 1:179–193.[Medline]
- Nothnagel M, Rohde K. The effect of SNP marker selection on patterns of haplotype blocks and haplotype frequency estimates. Am. J. Hum. Genet. (2005) 77:988–998.[CrossRef][Web of Science][Medline]
- Becker T, Knapp M. Efficiency of haplotype frequency estimation when nuclear familiy information is included. Hum. Hered. (2002) 54:45–53.[CrossRef][Web of Science][Medline]
- Douglas JA, Boehnke M, Gillanders E, Trent JM, Gruber SB. Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilibrium studies. Nat. Genet. (2001) 28:361–364.[CrossRef][Web of Science][Medline]
- Schaid DJ. Relative efficiency of ambiguous vs. directly measured haplotype frequencies. Genet. Epidemiol. (2002) 23:426–443.[CrossRef][Web of Science][Medline]
- Goldstein DB. Islands of linkage disequilibrium. Nat. Genet. (2001) 29:109–111.[CrossRef][Web of Science][Medline]
- Wang N, Akey JM, Zhang K, Chakraborty R, Jin L. Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am. J. Hum. Genet. (2002) 71:1227–1234.[CrossRef][Web of Science][Medline]
- Crawford DC, Carlson CS, Rieder MJ, Carrington DP, Yi Q, Smith JD, Eberle MA, Kruglyak L, Nickerson DA. Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations. Am. J. Hum. Genet. (2004) 74:610–622.[CrossRef][Web of Science][Medline]
- Zhai W, Todd MJ, Nielsen R. Is haplotype block identification useful for association mapping studies? Genet. Epidemiol. (2004) 27:80–83.[CrossRef][Web of Science][Medline]
- Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common disease-common variant ... or not? Hum. Mol. Genet. (2002) 11:2417–2423.[Abstract/Free Full Text]
- Wang WY, Barratt BJ, Clayton DG, Todd JA. Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet. (2005) 6:109–118.[CrossRef][Web of Science][Medline]
- Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. (2004) 74:106–120.[CrossRef][Web of Science][Medline]
- Carlson CS, Eberle MA, Rieder MJ, Smith JD, Kruglyak L, Nickerson DA. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet. (2003) 33:518–521.[CrossRef][Web of Science][Medline]
- Lowe CE, Cooper JD, Chapman JM, Barratt BJ, Twells RC, Green EA, Savage DA, Guja C, Ionescu-Tirgoviste C, et al. Cost-effective analysis of candidate genes using htSNPs: a staged approach. Genes. Immun. (2004) 5:301–305.[CrossRef][Web of Science][Medline]
- Chapman JM, Cooper JD, Todd JA, Clayton DG. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Hum. Hered. (2003) 56:18–31.[CrossRef][Web of Science][Medline]
- Wang WY, Todd JA. The usefulness of different density SNP maps for disease association studies of common variants. Hum. Mol. Genet. (2003) 12:3145–3149.[Abstract/Free Full Text]
- Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, et al. Complement factor H polymorphism in age-related macular degeneration. Science (2005) 308:385–389.[Abstract/Free Full Text]
- Fleiss JL. Statistical Methods for Rates and Proportions (2003) 3rd. New York, USA: John Wiley & Sons.

CiteULike
Connotea
Del.icio.us What's this?