Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Medical genetics wikipedia , lookup
Ancestry.com wikipedia , lookup
Heritability of IQ wikipedia , lookup
Public health genomics wikipedia , lookup
Genetics and archaeogenetics of South Asia wikipedia , lookup
Genetic studies on Bulgarians wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Genetic drift wikipedia , lookup
Microevolution wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Population Stratification Qunyuan Zhang Division of Statistical Genomics Course Title: M21-621 Computational Statistical Genetics https://dsgweb.wustl.edu/qunyuan/presentations/PopStratGEMS.pptx 1 What is Population Stratification (PS) ? In narrow sense PS is the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure. In broad sense PS can be regarded as the presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation. 2 PS & False Positives False Positives (inflation) Association could be due to the underlying structure of the population, even there is no disease-locus association. 3 An Example of PS-caused False Positive Sub-population 1 case control total A 72 8 80 a 18 2 20 total 90 10 100 Sub-population 2 case control total A 3 27 30 a 7 63 70 10 90 100 Mixed population case control total A 75 35 110 a 25 65 90 100 100 200 risk 9/1 9/1 9/1 risk 1/9 1/9 1/9 risk 2.14 0.38 1.00 • No disease-locus association. • Risk difference between sub-populations. • Allele Frequency difference between sub-populations. • False disease-locus association in mixed population. (any allele with higher frequency in higherrisk sub-population seems to be risk allele) 4 Diagnosis of Inflation Expected: uniform distribution [0,1] of p-values under the null no inflation inflation Histogram -log10(p) Q-Q plot 5 Inflation Rate (IR) Devlin et al. 2004 For Binary Trait For Continuous Trait Amin , Duijn, Aulchenko, 2007 6 Inflation of False Positives •Inflation: more false positives than expected under the null •In GWAS, usually due to PS •Can be caused by inappropriate statistical methods even with no PS •May (not necessarily) indicate PS 7 Genomic Control (by IR) For Binary Trait Yi 2 i2 For Continuous Trait Yi 2 (ti ) 2 Or based on p-value Yi 2 (21 pi ,df 1) 2 Y ~2 Yi i ~ df2 1 ˆ ~2 2 ~ pi Pr ob( df 1 Yi ) 8 Mantel-Haenszel Test for Stratification Adjusted RR An Example Standard error Chi-square test (1) (2) (3) 9 Linear Model Marker data Population structure variable Genetic background variable Membership variable Subgroup/sub-population variable Ancestry/admixture proportion variable Usually Q is unknown, needs to be estimated 10 Estimating Q by Eigen-analysis singular values X snp1 snp2 snp3 snp4 snp5 = idv1 idv2 0 1 0 0 2 U idv3 2 2 0 1 0 1 2 1 0 0 -0.55 -0.78 -0.16 -0.20 -0.15 0.33 -0.10 0.04 0.14 -0.93 S VT 3.81 0.00 0.00 0.00 2.05 0.00 0.00 0.00 1.13 T S2 eigenvalues 0.34 -0.27 -0.71 0.52 0.20 14.51 0.00 0.00 0.00 4.21 0.00 0.00 0.00 1.28 -0.28 -0.75 -0.60 -0.95 0.29 0.08 0.11 0.59 -0.80 Q1 Q2 Q3 Eigenvector of COV(X) References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT) Or SAS Proc PRINCOM; R svd() and eigen() 11 Eigen-analysis of HapMap Populations Q2 Q1 12 Estimating Q by MLE (for admixed population) G: Observed genotypes of admixed [and parental populations] Q: Allelic frequencies in parental populations P : Individual membership to be estimated Goal: obtain P that maximizes Pr(G|P,Q) 1. Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly) 2. Compute P(i) by solving 3. Compute Q(i) by solving (G | Q, P) 0 ( P) (G | Q, P) 0 (Q) 4. Iterate Steps 1 and 2 until convergence. Tang et al. Genetic Epidemiology, 2005(28): 289–301 13 Estimating Q by MCMC (for admixed population) Observed G : genotypes of admixed [and parental populations] Unknown Z : admixed individuals’ membership from ancestral populations Problem: How to estimate Z ? Bayesian and Markov Chain Monte Carlo (MCMC) methods 1. Assume ancestral population number K (see next slide) 2. Define prior distribution Pr(Z) under K 3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z) 4. Average over large number of MCMC samples to obtain estimate of Z Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE 14 Infer Population Number (K) 15 Linear Model (an example including m Q-variables) y a bx b1Q1 b2Q2 ... bmQm e m y a bx bi Qi e i 1 SAS Proc REG, Proc GENMOD; R lm(), glm() Generalized, can fit binary/categorical y 16 Unified Mixed Model (more general) Inferred population membership SNP(s) Covariate(s) ID matrix Modeling the resemblance among individuals V=ZGZ'+R 17 Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model Based on MVN, the likelihood of trait (y) in a matrix form is: no. of individuals (in a pedigree) Kinship (IBD) matrix (nn ) nn variancecovariance matrix phenotype vector mean phenotype vector V=ZGZ'+R V 2 I 2 a 2 e 18 Kinship Inbreeding Coefficient The inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD). Identical By Descent (IBD) Two alleles come from the same ancestry. Kinship/Coancestry The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, then inbreeding coefficient of Z = coancestry between X and Y Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data) 19 Kinship Matrix (expected probability of allele sharing among relatives) 20 Resources for Mixed Model with Kinship Matrix Software Kinship Mixed Model Data SAS Proc INBREED Proc MIXED Quantitative trait Pedigree data SAS Proc INBREED Proc GLIMMIX Quantitative/qualitative trait, Pedigree data R : kinship R: kinship2 /coxme makekinship() lmekin(),coxme() Quantitative and survival trait Pedigree data R: emma emma.kinship() emma.REML.t() EMMAX MMAP Emmax-kin mmap Emmax mmap Quantitative trait Using maker data to calculate kinship kinship() 21 Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics 22 Three Mapping Strategies Linkage Analysis (linkage): genotype & phenotype data from family (or families) Association Scan (LD): genotype & phenotype data from population(s) or families Admixture Mapping (LD): genotype data from admixed and ancestral populations, phenotype data from admixed populations (1) Ancestry-phenotype association mapping (2) Ancestry info for population structure control 23 Genetic Admixture Ancestral Population 1 Ancestral Population 2 Africans Caucasians Admixture Information (Ancestry Analysis) Admixed Population African Americans Admixture Mapping 24 Rationale of Admixture Mapping If a disease has some genetic factors, and the disease gene frequency in pop 2 is higher than in pop 1. After the admixture of pop 1 and 2, the diseased individuals in admixed generations will carry disease genes/alleles that have more ancestry from pop 2 than from pop 1. If a marker is linked with disease genes, because of linkage disequilibrium, the diseased individuals will also carry the marker copies that have more ancestry from pop 2 than from pop 1. Inversely, if we find a marker/locus whose ancestry from pop 2 in diseased group is significantly different from that in non-diseased group, we consider this marker/locus to be linked with (or a part of ) disease gene. 25 Advantages of Admixture Mapping Admixed population has more genetic variation and polymorphism than relatively pure ancestral populations. Admixture produces new LD in admixed population. Compared with ancestral populations, shorter genetic history of admixture population keeps more LD (long genetic history will destroy LD), In admixed population, LD could be detected for relatively loose linkage. Ancestry information can be used to control population stratification caused by genetic admixture. According to simulation, admixture mapping demonstrates higher power than regular methods, needs less sample size. Flexible design: case-control or case-only, qualitative or quantitative traits, no need of pedigree information 26 Ancestry Proportion of genetic materials descending from each founding population Population level : population admixture proportion Individual level: individual admixture proportion Individual-locus level: locus-specific ancestry 27 Two Ways of Using Ancestral Info. Individual Ancestry (IA) can be used as a genetic background covariate for population structure control Phenotype= a + b * Genotype + c * IA + Error Locus-specific Ancestry (LSA) can be directly used to detect association (admixture mapping) Phenotype=a + b * LSA 28 Individual Ancestry (IA) Estimation using MLE (1) G: Observed genotypes of admixed and ancestral populations P: Allelic frequencies in ancestral populations Q : Individual Ancestry to be estimated Goal: obtain P that maximizes Pr(G|P,Q) 1. Assign prior values for Q (randomly or estimated from ancestral population genotype data) & P (randomly) 2. Compute P(i) by solving 3. Compute Q(i) by solving (G | Q, P) 0 ( P) (G | Q, P) 0 (Q) 4. Iterate Steps 1 and 2 until convergence. Tang et al. Genetic Epidemiology, 2005(28): 289–301 29 Individual Ancestry (IA) Estimation using MLE (2) 30 Individual Ancestry (IA) Estimation using MLE (3) 31 Locus-specific Ancestry Estimation using MCMC Observed G : genotypes of admixed and ancestral populations Unknown Z : admixed individuals’ locus specific ancestries from ancestral populations Problem: How to estimate Z ? Bayesian and Markov Chain Monte Carlo (MCMC) methods 1. Assume ancestral population number K 2. Define prior distribution Pr(Z) under K 3. Sample P, Q from (P, Q | G, Z) 4. Sample Z from Pr(Z|G,P,Q) 5. Average over large number of MCMC samples to obtain estimate of Z Jonathan K. Pritchard et al, Genetics 155: 945–959 32 Distribution of Locus-specific Ancestries from Africans Ancestry from Africans (An Example of African American) Maker Position 33 Software STRUCTURE Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164:1567–1587. ADMIXMAP Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM (2003) Control of confounding of genetic associations in stratified populations. Am J Hum Genet 72:1492–1504. ANCESTRYMAP Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O’Brien SJ, Altshuler D, Daly MJ, Reich D (2004) Methods for high-density admixture mapping of disease genes. Am J Hum Genet 74:979–1000 34