Download sv-lncs - acteon

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative trait locus wikipedia , lookup

DNA paternity testing wikipedia , lookup

Population genetics wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Human genetic variation wikipedia , lookup

SNP genotyping wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Tag SNP wikipedia , lookup

Transcript
Mapping genes using spurious disequilibrium
Jules Hernández-Sánchez1*
1Genètica
i Millora Animal, IRTA, 25198 Lleida, Spain
Abstract. Spurious disequilibrium (SD) is the association between unlinked
loci. Several phenomena cause SD, e.g. population structure, often leading to
false positive results in genome wide association (GWA) studies, where one can
choose between powerful-not-robust or robust-not-powerful tests. If there is SD
the latter tests ought to be used, otherwise the former can be used. We propose
a new test to detect SD and show how to use it advantageously in GWA studies.
Keywords: Spurious disequilibrium, QTN mapping, robustness, power.
1 Introduction
GWA analyses consist of testing a massive number of single nucleotide
polymorphisms (SNPs) for phenotypic causality, i.e. a SNP could be a Quantitative
Trait Nucleotide (QTN). GWA studies promised to discover a large amount of QTN
[1]. However, recent large studies mapped QTN that explained, at most, 11% of the
phenotypic variance in human height despite it being a highly heritable trait (h2~0.8),
testing >500K SNPs, sampling ~63,000 individuals and spending ~$30 million
[2,3,4,5]. Multiple testing corrections are partly responsible for the high rate of false
negative results due to the overly stringent thresholds used.
The other problem in GWA studies are false positive results caused by SD. SD is
the statistical association between alleles at unlinked loci. SD arises, for example, by
chance, due to population structure or small population size [e.g. 6]. In case-control
studies, individuals are assumed unrelated. However, many false positive results arise
when cases are more closely related to each other than controls. Several methods have
been used to deal with hidden population structures: 1) robust tests [7], 2) detection of
population structure [8], 3) principal components adjustments [9], and 4) genomic
correction [10].
Ideally, false positive results should be prevented without unnecessarily reducing
the power of QTN detection. In order to achieve this, I show how an extension of
robust tests, known as transmission-disequilibrium tests (TDT), can be used to detect
SD. If SD existed then TDT should be used in all SNPs. Otherwise, a more powerful
but not robust test could be used.
*
Corresponding author: [email protected]
2 Methods
In a study in preparation, I showed that when family trios are sampled, two alternative
phenotype-genotype models arise naturally: 1) Y    b A X A  bSXS , and 2)
Y    bBXB  bW XW . Here, Y is a vector of phenotypes, μ a population mean, Xi is
a vector of functions of SNP genotypes and b i the corresponding regression
coefficient where i = A, S, W or B (Table 1). The explanatory variables in X A, XB, XW
and XS score, respectively, progeny genotypes, average parental genotype, progeny
deviations from the average parental genotype and differences between X B and XW.
The latter, XS, contains a novel variable that will be used to detect SD.
Assuming a closed and randomly mating population the regression coefficients are:
b B  aD pq , bA  1  cbB , b W  1  2cbB and bS  cbB , where ‘a’ is the additive
QTN effect, c and D are the recombination rate and disequilibrium between SNP and
QTN and p=1-q are SNP allele frequencies. The pairs of vectors XA and XS or XW and
XB are orthogonal, and therefore the corresponding regression coefficients are
independent. If there is no additive QTN effect (a=0) or not disequilibrium (D=0) then
bi=0 for all i. However, assuming a≠0 and D≠0, if c=0 then bB=bA=bW and bS=0, or
alternatively, if c=½ then bA=bS=bB/2 and bW=0.
In samples of unrelated individuals, only b A is estimable. Family information
allows estimation of the other three regression coefficients. A standard t-test,
t i  bi sebi , where sebi is the standard error of bi, can be used to test the null
hypothesis (H0) of bi=0. Only bA or bW can be used to locate a QTN because bB does
not depend on c and bS approaches 0 as c decreases. At the QTN, |t A| is on average
greater than |tW|. However, false positive results can arise with t A because bA≠0 even
if c=½ when D≠0 and a≠0. On the contrary, bW=0 when c=½ irrespective of D.
The tS test detects SD. The coefficient b S tends to 0 as c decreases and tends to
bB/2 as c increases. This suggests the following GWA approach. Call z the number of
SNPs significant for tS (0≤ z ≤ m where m is the total number of SNPs). With m
unlinked SNPs and no QTN, z follows a Poisson distribution with parameter λ=mα
when there is no SD (α is the error rate per SNP, typically 0.05). Linkage and the
presence of a QTN change the distribution of z, although it does not seem to be
important (not shown). If z is a significantly large number then SD is detected and all
SNPs must be tested with tW, otherwise tA is used to boost power. It is not correct to
make this choice at each marker as it would lead to a non-convex solution space (not
shown). This is a flexible testing procedure because it prioritises power over
robustness in the absence of SD and vice-versa when it is present. I call it tAW|S.
Computer simulations showed the benefits of t AW|S compared to tA and tW. First, a
founder population of unrelated individuals was created in Hardy-Weinberg and
linkage equilibria. Second, the population evolved according to the Fisher-Wright
model for T=30 discrete generations with a constant effective population size
(Ne=100). These Ne and T lead to Fst~0.15 as reported in human populations [11].
There were two chromosomes with m=50 SNPs each, equally spaced over ½ Morgan.
The 25th SNP on chromosome 1 was an additive QTN of variable size. One hundred
trios without common parents were sampled. The ith progeny phenotype (yi) was
yi=gi+ei, where ei is a standardized normal error and gi ‘a’, 0 or ‘–a’ if the genotype is


11, 12 or 22 at the QTN ( a  h 2 1  h 2 uv , u=1-v are allele frequencies at the
QTN). Population structure consisted of two subpopulations at t=0 that never
interbred, and it was obviated in the final analysis. There were 1000 replicates.
3 Results and Conclusions
Table 1 shows the explanatory variables used to model a QTN given family data.
P1 P2 O xA xS xW xB
11 11 11
1
1
0
1
11 12 11
1
0
½
½
11 12 12
0
1 -½
½
11 22 12
0
0
0
0
12 12 11
1 -1
1
0
12 12 12
0
0
0
0
12 12 22 -1
1
-1
0
12 22 12
0 -1
½ -½
12 22 22 -1
0 -½ -½
22 22 22 -1 -1
0 -1
Table 1. Family trio genotypes and explanatory variables used in modeling. Pi is the ith parent,
O the progeny and xi the explanatory variable corresponding to each trio type for i=A, S, W, B.
Fig. 1. Power (QTN) and error rate (unlinked SNP) for tA, tW and tAW|S. No structure (no SD) at
fig. a) and c). Structure (SD) at fig. b) and d). See text for simulation details.
Figure 1a shows how tA is always more powerful than tW when testing a QTN.
Therefore, tA is preferable to tW when power is low, i.e. QTN and/or sample size are
small. However, tA is not robust to SD (Fig. 1d). This problem is exacerbated when
power increases. One alternative to prevent the rise of false positive results in GWA
studies is to test SD and decide whether to apply tA or tW based on those results.
Figure 1a shows how tAW|S can be almost as powerful as tA and more powerful than tW
in the absence of SD, i.e. no structure. Figure 1d shows how t AW|S can be almost as
robust as tW and more robust than tA in the presence of SD, i.e. structure.
In conclusion, our tAW|S test consists of choosing either tA or tW at all SNPs in the
sample if the number z of significant t S tests is low or high, respectively (z assumed to
follow a Poisson with parameter 2.5 if there is no SD). The tS test is designed to detect
SD. In the presence of SD, many SNPs are expected to be significant for t S and robust
tests such as tW are more desirable than not robust tests such as t A. In the absence of
SD, few SNPs are expected to be significant for tS and powerful tests such as tA are
more desirable than less powerful ones such as tW. Figures 1a,d show how tAW|S is
more powerful than tW in the absence of SD and more robust than tA in its presence.
Thus, we advise to choose objectively the type of test in GWA studies rather than
subjectively. In this way, the data tell us what is appropriate and obviates the need for
suboptimal adjustments to account for SD caused for example by hidden population
structures.
References
1. Chakravarty, A.: It’s raining SNPs, hallelujah?. Nat. Genet. 19, 233-240 (1998)
2. Weedon, M.N. et al.: Genome-wise association analysis identifies 20 loci that influence adult
height. Nat. Genet. 40, 575-583 (2008)
3. Lettre, G. et al.: Identification of ten loci associated with height highlights new biological
pathways in human growth. Nat. Genet. 40, 584-591 (2008)
4. Gudbjartsson, D.F. et al.: Many sequence variants affecting diversity of adult human height.
Nat. Genet. 40, 609-615 (2008)
5. Visscher, P.M.: Sizing up human height variation. Nat. Genet. 40, 489-490 (2008)
6. Farnir, F. et al.: Extensive genome-wide linkage disequilibrium in cattle. Genome Res. 10,
220-227 (2000)
7. Spielman, R.S., McGinnies, R.E., Warren, J.E.: Transmission test for linkage disequilibrium:
the Insullin gene and Insullin-dependent Diabetes Mellitus (IDDM). Am. J. Hum. Genet. 52,
506-516 (1993)
8. Pritchard, J.K., Stephens, M., Rosenberg, N.A., Donnelly, P.: Association mapping in
structured populations. Am. J. Hum. Genet. 67, 170-181 (2000)
9. Price, A.L., Patterson N.J., Plenge, R.M., Winblatt, M.E., Shadick, N.A., Reich, D.: Principal
components analysis corrects for stratification in genome-wide association studies. Nat.
Genet. 38, 904-909 (2006)
10. Devlin, B., Roeder, K.: Genomic control for association studies. Biometrics 55, 997-1004
(1999)
11. Holsinger, K. E., Weir, B. S.: Genetics in geographically structured populations: defining,
estimating and interpreting FST. Nat. Rev. Genet. 10, 639-650 (2009)