Download PopStratGEMS - Division of Statistical Genomics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Medical genetics wikipedia , lookup

Ancestry.com wikipedia , lookup

Heritability of IQ wikipedia , lookup

Public health genomics wikipedia , lookup

Inbreeding wikipedia , lookup

Genetics and archaeogenetics of South Asia wikipedia , lookup

Genetic studies on Bulgarians wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Genetic drift wikipedia , lookup

Microevolution wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Population genetics wikipedia , lookup

Human genetic variation wikipedia , lookup

Transcript
Population Stratification
Qunyuan Zhang
Division of Statistical Genomics
Course Title: M21-621
Computational Statistical Genetics
https://dsgweb.wustl.edu/qunyuan/presentations/PopStratGEMS.pptx
1
What is Population Stratification (PS) ?
In narrow sense
PS is the presence of a
systematic difference in
allele frequencies between
subpopulations in a
population, possibly due to
different ancestry or origins,
especially in the context of
genetic association studies.
Population stratification is
also referred to as
population structure.
In broad sense
PS can be regarded as the
presence of a difference in
relatedness between
individuals in a population,
due to different
subpopulations,
family/pedigree structure
and/or cryptic relation.
2
PS & False Positives
False Positives (inflation)
Association could be due to the underlying
structure of the population, even there is
no disease-locus association.
3
An Example of PS-caused False Positive
Sub-population 1
case control total
A
72
8
80
a
18
2
20
total
90
10 100
Sub-population 2
case control total
A
3
27
30
a
7
63
70
10
90 100
Mixed population
case control total
A
75
35 110
a
25
65
90
100
100 200
risk
9/1
9/1
9/1
risk
1/9
1/9
1/9
risk
2.14
0.38
1.00
• No disease-locus
association.
• Risk difference between
sub-populations.
• Allele Frequency difference
between sub-populations.
• False disease-locus
association in mixed
population. (any allele with
higher frequency in higherrisk sub-population seems
to be risk allele)
4
Diagnosis of Inflation
Expected: uniform distribution [0,1] of p-values under the null
no inflation
inflation
Histogram
-log10(p)
Q-Q plot
5
Inflation Rate (IR)
Devlin et al. 2004
For Binary Trait
For Continuous Trait
Amin , Duijn, Aulchenko, 2007
6
Inflation of False Positives
•Inflation: more false positives than expected
under the null
•In GWAS, usually due to PS
•Can be caused by inappropriate statistical
methods even with no PS
•May (not necessarily) indicate PS
7
Genomic Control (by IR)
For Binary Trait
Yi 2   i2
For Continuous Trait
Yi 2  (ti ) 2
Or based on p-value
Yi 2   (21 pi ,df 1)
2
Y
~2
Yi  i ~  df2 1
ˆ
~2
2
~
pi  Pr ob(  df 1  Yi )
8
Mantel-Haenszel Test for Stratification
Adjusted RR
An Example
Standard error
Chi-square test
(1)
(2)
(3)
9
Linear Model
Marker data
Population structure variable
Genetic background variable
Membership variable
Subgroup/sub-population variable
Ancestry/admixture proportion variable
Usually Q is unknown, needs to be estimated
10
Estimating Q by Eigen-analysis
singular values
X
snp1
snp2
snp3
snp4
snp5
=
idv1 idv2
0
1
0
0
2
U
idv3
2
2
0
1
0
1
2
1
0
0
-0.55
-0.78
-0.16
-0.20
-0.15
0.33
-0.10
0.04
0.14
-0.93
S
VT
3.81 0.00 0.00
0.00 2.05 0.00
0.00 0.00 1.13
T
S2
eigenvalues
0.34
-0.27
-0.71
0.52
0.20
14.51
0.00
0.00
0.00
4.21
0.00
0.00
0.00
1.28
-0.28
-0.75
-0.60
-0.95
0.29
0.08
0.11
0.59
-0.80
Q1
Q2
Q3
Eigenvector of COV(X)
References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT)
Or SAS Proc PRINCOM; R svd() and eigen()
11
Eigen-analysis of HapMap Populations
Q2
Q1
12
Estimating Q by MLE
(for admixed population)
G: Observed genotypes of admixed [and parental populations]
Q: Allelic frequencies in parental populations
P : Individual membership to be estimated
Goal: obtain P that maximizes Pr(G|P,Q)
1. Assign prior values for Q (randomly or estimated from parental population
genotype data) & P (randomly)
2. Compute P(i) by solving
3.
Compute Q(i) by solving
(G | Q, P)
0
( P)
(G | Q, P)
0
 (Q)
4. Iterate Steps 1 and 2 until convergence.
Tang et al. Genetic Epidemiology, 2005(28): 289–301
13
Estimating Q by MCMC
(for admixed population)
Observed G : genotypes of admixed [and parental populations]
Unknown Z : admixed individuals’ membership from ancestral populations
Problem: How to estimate Z
?
Bayesian and Markov Chain Monte Carlo (MCMC) methods
1. Assume ancestral population number K (see next slide)
2. Define prior distribution Pr(Z) under K
3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z)
4. Average over large number of MCMC samples to obtain estimate of Z
Falush et al. Genetics, 2003(164):1567–1587
Software : STRUCTURE
14
Infer Population Number (K)
15
Linear Model
(an example including m Q-variables)
y  a  bx  b1Q1  b2Q2  ...  bmQm  e
m
y  a  bx   bi Qi  e
i 1
SAS Proc REG, Proc GENMOD; R lm(), glm()
Generalized, can fit binary/categorical y
16
Unified Mixed Model
(more general)
Inferred population
membership
SNP(s)
Covariate(s)
ID matrix
Modeling the
resemblance
among individuals
V=ZGZ'+R
17
Multi-Variate Normal Distribution (MVN)
& Likelihood of Mixed Model
Based on MVN, the likelihood of trait (y) in a matrix form is:
no. of individuals
(in a pedigree)
Kinship (IBD)
matrix (nn )
nn variancecovariance
matrix
phenotype
vector
mean
phenotype
vector
V=ZGZ'+R
V  2    I
2
a
2
e
18
Kinship
Inbreeding Coefficient
The inbreeding coefficient of an individual is the probability that the pair of
alleles carried by the gametes that produced it are Identical By Descent (IBD).
Identical By Descent (IBD)
Two alleles come from the same ancestry.
Kinship/Coancestry
The inbreeding coefficient of an individual is equal to the coancestry between
its parents. For example if parents X and Y have a child Z, then
inbreeding coefficient of Z = coancestry between X and Y
Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al.
(need pedigree and/or marker data)
19
Kinship Matrix
(expected probability of allele sharing among relatives)
20
Resources for Mixed Model with
Kinship Matrix
Software
Kinship
Mixed Model
Data
SAS
Proc INBREED
Proc MIXED
Quantitative trait
Pedigree data
SAS
Proc INBREED
Proc GLIMMIX
Quantitative/qualitative
trait, Pedigree data
R : kinship
R: kinship2
/coxme
makekinship()
lmekin(),coxme()
Quantitative and
survival trait
Pedigree data
R: emma
emma.kinship()
emma.REML.t()
EMMAX
MMAP
Emmax-kin
mmap
Emmax
mmap
Quantitative trait
Using maker data to
calculate kinship
kinship()
21
Admixture Mapping
Qunyuan Zhang
Division of Statistical Genomics
GEMS Course M21-621
Computational Statistical Genetics
22
Three Mapping Strategies
Linkage Analysis (linkage): genotype &
phenotype data from family (or families)
Association Scan (LD): genotype & phenotype
data from population(s) or families
Admixture Mapping (LD): genotype data from
admixed and ancestral populations, phenotype
data from admixed populations
(1) Ancestry-phenotype association mapping
(2) Ancestry info for population structure
control
23
Genetic Admixture
Ancestral Population 1
Ancestral Population 2
Africans
Caucasians
Admixture Information
(Ancestry Analysis)
Admixed Population
African Americans
Admixture Mapping
24
Rationale of Admixture Mapping
If a disease has some genetic factors, and the disease gene
frequency in pop 2 is higher than in pop 1. After the admixture of
pop 1 and 2, the diseased individuals in admixed generations will
carry disease genes/alleles that have more ancestry from pop 2 than
from pop 1.
If a marker is linked with disease genes, because of linkage
disequilibrium, the diseased individuals will also carry the marker
copies that have more ancestry from pop 2 than from pop 1.
Inversely, if we find a marker/locus whose ancestry from pop 2 in
diseased group is significantly different from that in non-diseased
group, we consider this marker/locus to be linked with (or a part of )
disease gene.
25
Advantages of Admixture Mapping
Admixed population has more genetic variation and
polymorphism than relatively pure ancestral populations.
Admixture produces new LD in admixed population. Compared
with ancestral populations, shorter genetic history of admixture
population keeps more LD (long genetic history will destroy LD),
In admixed population, LD could be detected for relatively loose
linkage.
Ancestry information can be used to control population
stratification caused by genetic admixture.
According to simulation, admixture mapping demonstrates
higher power than regular methods, needs less sample size.
Flexible design: case-control or case-only, qualitative or
quantitative traits, no need of pedigree information
26
Ancestry
Proportion of genetic materials descending from each
founding population
Population level : population admixture proportion
Individual level: individual admixture proportion
Individual-locus level: locus-specific ancestry
27
Two Ways of Using Ancestral Info.
Individual Ancestry (IA) can be used as a genetic
background covariate for population structure control
Phenotype= a + b * Genotype + c * IA + Error
Locus-specific Ancestry (LSA) can be directly used to
detect association (admixture mapping)
Phenotype=a + b * LSA
28
Individual Ancestry (IA) Estimation
using MLE (1)
G: Observed genotypes of admixed and ancestral populations
P: Allelic frequencies in ancestral populations
Q : Individual Ancestry to be estimated
Goal: obtain P that maximizes Pr(G|P,Q)
1. Assign prior values for Q (randomly or estimated from ancestral population
genotype data) & P (randomly)
2. Compute P(i) by solving
3.
Compute Q(i) by solving
(G | Q, P)
0
( P)
(G | Q, P)
0
 (Q)
4. Iterate Steps 1 and 2 until convergence.
Tang et al. Genetic Epidemiology, 2005(28): 289–301
29
Individual Ancestry (IA) Estimation
using MLE (2)
30
Individual Ancestry (IA) Estimation
using MLE (3)
31
Locus-specific Ancestry Estimation
using MCMC
Observed G : genotypes of admixed and ancestral populations
Unknown Z : admixed individuals’ locus specific ancestries from ancestral populations
Problem: How to estimate Z
?
Bayesian and Markov Chain Monte Carlo (MCMC) methods
1. Assume ancestral population number K
2. Define prior distribution Pr(Z) under K
3. Sample P, Q from (P, Q | G, Z)
4. Sample Z from Pr(Z|G,P,Q)
5. Average over large number of MCMC samples to obtain estimate of Z
Jonathan K. Pritchard et al, Genetics 155: 945–959
32
Distribution of Locus-specific Ancestries from Africans
Ancestry from Africans
(An Example of African American)
Maker Position
33
Software
STRUCTURE
Falush D, Stephens M, Pritchard JK (2003)
Inference of population structure using multilocus genotype data: linked
loci and correlated allele frequencies. Genetics 164:1567–1587.
ADMIXMAP
Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles
RA, Clayton DG, McKeigue PM (2003) Control of confounding of genetic
associations in stratified populations. Am J Hum Genet 72:1492–1504.
ANCESTRYMAP
Patterson N, Hattangadi N, Lane B, Lohmueller
KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O’Brien SJ, Altshuler
D, Daly MJ, Reich D (2004) Methods for high-density admixture mapping of
disease genes. Am J Hum Genet 74:979–1000
34