Download AA vs. Aa and aa

Population Structure, Association Studies, and QTLs Stat 115/215 Structure Algorithm • One of the most widely-used programs in population genetics (original paper cited >9,000 times since 2000) – Pritchard, Stephens and Donnelly (2000). Inference of Population Structure Using Multilocus Genotype Data, Genetics. 155:945-959. • Very flexible model can determine: – The most likely number of uniform groups (populations, K) – The genomic composition of each individual (admixture coefficients) – Possible population of origin 2 A simple model of population structure • Individuals in our sample represent a mixture of K (unknown) ancestral populations. • Each population is characterized by (unknown) allele frequencies at each locus. • Within populations, markers are in HardyWeinberg and linkage equilibrium. 3 The model • Let A1, A2, …, AK represent the (unknown) allele frequencies in each subpopulation • Let Z1, Z2, … , Zm represent the (unknown) subpopulation of origin of the sampled individuals – they are indicators • Assuming HWE and LE within subpopulations, the likelihood of an individual’s genotypes at various loci in subpopulation k is given by the product of the relevant allele frequencies: 4 More details • Probability of observing a genotype at locus l by chance in population is a function of allele frequencies: – Pl =pi2 for homozygous loci – Pl =2pipj for heterozygous loci • Assuming no linkage among the markers, we have the product form as in the previous page. 5 Heuristics • If we knew the population allele frequencies in advance, then it would be easy to assign individuals (using Bayes rule). P(Gi | Zi = k, A)P(Zi = k | A) Pr(Zi = k | Gi , A1,… , Ak ) = å P(Gi | Zi = j, A)P(Zi = j | A) • If we knew the individual assignments, it would be easy to estimate frequencies. • In practice, we don’t know either of these, but we have the Gibbs sampler! 6 MCMC algorithm (for fixed K) • Start with random assignment of individuals to populations – Step 1: Gene frequencies in each population are estimated based on the individuals that are assigned to it. – Step 2: Individuals are assigned to populations based on gene frequencies in each population. • And this is repeated... • Estimation of K performed separately 7 Admixed individuals are mosaics of ancestral populations 8 Two basic models 9 Inferred from human populations 10 More details 11 12 Alternative approach • Structure is very computationally intensive • Often no clear best-supported K-value • Alternative is to use traditional multivariate statistics to find uniform groups • Principal Components Analysis is most commonly used algorithm • EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190) 13 Principal Component Analysis • Efficient way to summarize multivariate data like genotypes • Each axis passes through maximum variation in data, explains a component of the variation 14 Human population assignment with SNPs • Assayed 500,000 SNP genotypes for 3,192 Europeans • Used Principal Components Analysis to ordinate samples in space • High correspondence between sample ordination and geographic origin of samples Individuals assigned to populations of origin with high accuracy 15 Genetic Association Tests • Review of typical approach: chi-square test – 2x3 table (or 2x2 table) AA A a aa Total Cases n11 n12 n13 n1. Controls n21 n22 n23 n2. n.1 n.2 n.3 n.. Total A a Tota l Cases n11 n12 n1. Controls n21 n22 n2. Total n.1 n.2 n.. – Alternatively, we can do a logistic regression P(Y =1) log = a + bX P(Y = 0) 16 Genetic Models and Underlining Hypotheses  Genotypic Model Genotype AA Genotypic Value μAA Aa aa μAa μaa Genotypic value is the expected phenotypic value of a particular genotype  Hypothesis: all 3 different genotypes have different effects AA vs. Aa vs. aa Genetic Models and Underlining Hypotheses Dominant Model Genotype AA Genotypic Value μA- Aa aa μAμaa Hypothesis: the genetic effects of AA and Aa are the same (assuming A is the minor allele) AA and Aa vs. aa Genetic Models and Underlining Hypotheses  Recessive Model Genotype AA Genotypic Value μA- Aa aa μaμaa  Hypothesis: the genetic effects of Aa and aa are the same (A is the minor allele) 19 AA vs. Aa and aa Genetic Models and Underlining Hypotheses  Allelic Model Genotype Genotypic Value AA 2μA Aa aa μA+ μa 2μa  Hypothesis: the genetic effects of allele A and allele a are different A vs. a Pearson’s Chi-squared Test  Genotypic Model:  Null Hypothesis: Independence H 0 :  ij   i.   . j cases controls AA nAA mAA Aa nAa mAa df = 2 aa naa maa Pearson’s Chi-squared Test  Dominant Model:  Null Hypothesis: Independence H 0 :  ij   i.   . j cases controls AA+Aa nAA + nAa mAA + mAa df = 1 aa naa maa Pearson’s Chi-squared Test  Recessive Model:  Null Hypothesis: Independence H 0 :  ij   i.   . j cases controls AA nAA mAA Aa +aa nAa + naa mAa + maa df = 1 Pearson’s Chi-squared Test  Allelic Model:  Null Hypothesis: Independence H 0 :  ij   i.   . j cases controls A 2nAA + nAa 2mAA + mAa df = 1 a nAa + 2naa mAa +2 maa Test Statistic  Chi-squared Test Statistic: (O  E )    E all cells 2 2  O is the observed cell counts  E is the expected cell counts, under null hypothesis of independence (row total  column tot al) E N Other Options  Fisher’s Exact Test: When sample size is small, the asymptotic approximation of null distribution is no longer valid. By performing Fisher’s exact test, exact significance of the deviation from a null hypothesis can be calculated.  For a 2 by 2 table, the exact p-value can be calculated as: a b c d Association Tool  PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/  Case-control, TDT, quantitative traits. 27 Mapping Quantitative Traits • Examples: weight, height, blood pressure, BMI, mRNA expression of a gene, etc. • Example: F2 intercross mice 28 Quantitative traits (phenotypes) 133 females from our earlier (NOD  B6)  (NOD  B6) cross Trait 4 is the log count of a particular white blood cell type. 29 Another representation of a trait distribution 30 Note the equivalent of dominance in our trait distributions. A second example 31 Note the approximate additivity in our trait distributions here. Trait distributions: a classical view In general we seek a difference in the phenotype distributions of the parental strains before we think seeking genes associated with a trait is worthwhile. But even if there is little difference, there may be many such genes. Our trait 4 is a case like this. 32 Data and goals Data Phenotypes: yi = trait value for mouse i Genotype: xij = 1/0 of mouse i is A/H at marker j (backcross); need two dummy variables for intercross Genetic map: Locations of markers Goals •Identify the (or at least one) genomic region, called quantitative trait locus = QTL, that contributes to variation in the trait •Form confidence intervals for the QTL location •Estimate QTL effects Models: GenotypePhenotype • Let y = phenotype, g = whole genome genotype • Imagine a small number of QTLw with genotypes g1,…., gp (2p or 3p distinct genotypes for BC, IC resp). • We assume E(y|g) = (g1,…gp ), var(y|g) = 2(g1,…gp) 34 Models: GenotypePhenotype, ctd • Homoscedacity (constant variance) 2(g1,…gp) = 2 (constant) • Normality of residual variation y|g ~ N(g ,2 ) • Additivity: (g1,…gp ) =  + ∑j gj (gj = 0/1 for BC) • Epistasis: Any deviations from additivity. 35 Additivity, or non-additivity (BC) 36 Additivity or non-additivity: F2 37 The simplest method: ANOVA • Split mice into groups according to genotype at a marker • Do a t-test/ANOVA • Repeat for each marker • Adjust for multiplicity LOD score = log10 likelihood ratio, comparing single-QTL model to the “no QTL anywhere” model. 38 Interval mapping (IM) • Lander & Botstein (1989) • Take account of missing genotype data (uses the HMM) • Interpolates between markers • Maximum likelihood under a mixture model 39 Interval mapping, cont • Imagine that there is a single QTL, at position z between two (flanking) markers • Let qi = genotype of mouse i at the QTL, and assume • yi | qi ~ Normal( qi , 2 ) • We won’t know qi, but we can calculate • pig = Pr(qi = g | marker data) • Then, yi, given the marker data, follows a mixture of normal distributions, with known mixing proportions (the pig). • Use an EM algorithm to get MLEs of  = (A, H, B, ). • Measure the evidence for a QTL via the LOD score, which is the log10 likelihood ratio comparing the hypothesis of a single QTL at position z to the hypothesis of no QTL anywhere. 40 Epistasis, interactions, etc • How to find interactions? – Stepwise regression – BEAM (Zhang and Liu 2007) 41 Naïve Bayes model Y X1 X2 X3 Xm 42 Augmented Naïve Bayes Group 0 X01 X2.21 Y X02 X2.22 Group 22 X11 X12 Group 1 X13 X2.12 X2.11 X2.13 Group 21 43 Variable Selection with Interaction Let Y ∈ R be a univerate response variable and X ∈ R p be a vector of p continuous predictor variables Y = X 1× X 2+ ϵ , ϵ ∼ N (0,σ 2 ), X ∼ MVN(0, I p ) Suppose p= 1000 . How to find X 1 and X 2 ? One step forward selection :∼500,000 interaction terms Is there any marginal relationship between Y and X 1 ? 44 σ̂ = 2.24 2 (1) σ̂ = 0.97 σ̂ = 0.42 2 (2) 2 (3) y 45 x1 x2 46 x1 Acknowledgment • Terry Speed (some of the slides) • Karl Broman (U of Wisconsin) • Steven P. DiFazio (West Virginia U) 47

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download AA vs. Aa and aa