Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Analysis of whole genome association studies in pedigreed populations Goutam Sahana Genetics and Biotechnology Faculty of Agricultural Sciences Aarhus University, 8830 Tjele, Denmark AAR H U S UNIVERSITET Faculty of Agricultural Sciences Concept of mapping Identification of genetic variant underlying disease susceptibility or a trait value Evidence for the location of the gene = Causal variant Approaches to Mapping 1. Candidate gene studies Association Resequencing approaches 2. Genome-wide studies Linkage analysis Genome-wide association studies (Linkage disequilibrium, LD mapping) Linkage mapping Look for marker alleles that are correlated with the phenotype within a pedigree Different alleles can be connected with the trait in the different pedigrees Association mapping Marker alleles are correlated with a trait on a population level Can detect association by looking at unrelated individuals from a population Does not necessarily imply that markers are linked to (are close to) genes influencing the trait. Linkage vs. association Effect Unlikely to exist Linkage analysis Association study Very difficult Freq. of causal variant Modified from D. Altschuler Linkage vs. association Potential Advantage Linkage Association No prior information regarding gene function required + + Localization to small genomic region - + Not susceptible to effects of stratification + -/+ Sufficient power to detect common alleles of modest effect (MAFs>5%) -/+ + Ability to detect rare allele (MAFs<1%) + - Tools for analysis available + +/- Hirschhorn & Daly, Nature Rev. Genet. 2005 Allelic Association Direct Association Allele of interest is itself involved in phenotype Indirect Association Allele itself is not involved, but due to LD with the functional variant Spurious association Confounding factors (e.g., population stratification) Linkage disequilibrium Non random association between alleles at different loci. Loci are in LD if alleles are present on haplotypes in different proportions than expected based on allele frequencies Two alleles that are in LD are occurring together more often than would be expected by chance Linkage disequilibrium Locus A: Alleles A & a; freq. PA & Pa Locus B: Alleles B & b; freq. PB & Pb Possible haplotyoes A A a a B b B b Expected frequencies: pApB p Ap b p a p B p a p b Observed frequencies: pAB pAb paB D = pAB - pApB ≠ 0 pab LD variation across genome The extent of LD is highly variable across the genome The determinants of LD are not fully understood. Factors that are believed to influence LD Genetic drift Population growth Admixture or migration Selection Variable recombination rates Haplotype Genotypes Locus1 Locus2 Locus3 Locus4 Locus5 Locus6 2 1 3 4 2 1 4 3 2 1 3 2 Haplotypes Identification of phase PHASE BEAGLE 4 1 3 1 2 2 2 3 2 4 3 1 Haplotype-based analysis Increased ability to identify regions that are shared identical by descent among affected individuals Haplotypes may the causative ‘composite allele’ rather than a particular nucleotide at a particular SNP Haplotype analysis is meaningful only if SNPS are in themselves in LD Monogenic verses Complex traits Monogenic trait Mutation in single gene is both necessary and sufficient to produce the phenotype or to cause the disease The impact of the gene on genetic risk is the same in all families Follow clear segregation pattern in families Typically rare in population Complex trait Multiple genes lead to genetic predisposition to a phenotype Pedigree reveals no Mendelian pattern Any particular gene mutation is neither sufficient nor necessary to explain the phenotype Environment has major contribution We study the relative impact of individual gene on the phenotype Some examples Mendelian/ Complex No. of genes Incidence (in 100,000) Cystic fibrosis M 1 40 Huntington disease M 1 5-10 Diabetes, type 2 C ? 10,000 – 20,000 Alzheimer C ? 20,000 Schizophrenia C ? 1000 Disease Quantitative Trait A biological trait that shows continuous variation rather than falling into distinct categories Quantitative trait locus (QTL) - Genetic locus that is associated with variation in such quantitative trait Assessing genetic contributions to complex traits Continuous characters (wt, blood pressure) Heritability: Proportion of observed variance in phenotype explained by genetic factors Discrete characters (disease) Relative risk ratio: λ= risk to relative of an affected individual/risk in general population λ encompasses all genetic and environmental effects, not just those due to any single locus Factors that influence identification of allelic association Effect size Linkage disequilibrium Disease and marker allele frequencies Sample Size Reviewed by Zondervar & Cardon, Nature Rev. Genet. 2004 Odds ratio Sample size Disease allele freq. Marker allele freq. Odd ratio 3.0 0.2 0.05 2.0 1.3 0.2 150 360 2900 0.5 430 1250 11,000 0.2 1170 4150 40,000 0.5 4200 15000 160,000 No. of cases= no. of controls; D’=0.7; power 80%; =0.001 Zondervar & Cardon (Nature Rev. Genet. 2004) Population stratification Consider two case/control samples, genotyped at a marker with alleles M and m Sample A Sample B M m Freq. Affected 50 50 0.10 Unaffec. 450 450 0.90 Freq. 0.50 0.50 2 NS M m Freq. Affected 1 9 0.01 Unaffec. 99 891 0.99 Freq. 0.10 0.90 2 NS Population stratification Sample A M Sample B m Freq. M m Freq. Affected 50 50 0.10 Affected 1 9 0.01 Unaffec. 450 450 0.90 Unaffec. 99 891 0.99 Freq. 0.50 0.50 Freq. 0.10 0.90 M m Freq. Affected 51 59 0.055 Unaffec. 549 1341 0.945 Freq. 0.30 0.70 2 =14.8 P<0.001 Dealing with population structure Genomic control (Devlin and Roeder, 1999) Inflate the distribution of the test statistic by λ. λ estimated from data 2 No stratification E(2) Test locus Unlinked ‘null’ markers 2 E(2) Stratification Adjust test statistics Dealing with population structure Structured association (Pritchard et al., 2000) Discover structure from set of unlinked markers, i.e. assign probabilities of ancestry from k populations to each individual, and then control for it. Association analysis approaches Case–control studies Markers frequencies are determined in a group of affected individuals and compared with allele frequencies in a control population Family based methods Based on unequal transmission of alleles from parents to a single affected child in each family. Associations are summed over many unrelated families Case-Control studies: 2 test Alleles Genotypes 11 12 22 Case n11 n12 n22 N Ctrl m11 m12 m22 M T12 T22 N+M Total T11 1 2 Total Case n1 n2 2N Ctrl m1 m2 2M Total T1 T2 2(N+M) Total 2x3 contingency table 2x2 contingency table Test of independence: 2 = (O-E)2/E with 2 or 1 df Family based tests Genotypes from independent family trios where the child is affected Use the non-transmitted genotypes or alleles as internal controls to the transmitted ones Family-based association studies ? ? 12 34 14 23 14 transmitted non-transmitted control Is an allele transmitted more often than it’s not transmitted to affected offspring ? TDT: Transmission Disequilibrium Test G/g G/G G/g Transmitted Non-transmitted G g G a b g c d TDTG = (TG-NTG)2/(TG+NTG) =(b-c)2/(b+c) ~ 21 TDT: Transmission Disequilibrium Test Multiallelic markers ETDT (Sham & Curtis, 1995) Missing parent genotypes TRANSMIT (Cayton,1999) Haplotypes TDTHAP (Clayton & Jones, 1999) Sibs TDT/STDT (Spielman & Ewens, 1998) Pedigrees PBAT (Martin et al, 2000) Quantitative traits QTDT (Abecasis et al. 2000) Some limitations Subjects – random or structure family Parents not available Difficult when there are very many genes individually of small effect Environmental influence may obscure genetic effects Genetic heterogeneity underlying disease phenotype Hidden (unaccounted) relationship Rare allele Single family is segregating A a B b Offspring group I Offspring group II Complex pedigree & Quantitative traits Complex pedigree Non-independence among pedigree members Only polygenic relationship is not sufficient Association analysis should account for the point-wise relationship among individuals Identical-by-decent probabilities Methods Combined linkage and LD Generalized linear models Mixed-model (Yu et al. 2006) Bayesian approach Combined linkage and LD Phenotype= Fixed factors + Polygene + Haplotype • Polygene – the whole relationship in pedigree is used • Identical-by-descend coefficients were estimated for point-wise relationship Phase determination - GDQTL QTL mapping - DMU QTL for Clinical Mastitis in cattle 16 14 12 LA LRT 10 8 6 4 2 0 0.0 0.1 0.2 0.3 0.4 0.5 Morgan 0.6 0.7 0.8 0.9 1.0 QTL for Clinical Mastitis in cattle 16 14 12 LA LRT 10 8 6 4 LD 2 0 0.0 0.1 0.2 0.3 0.4 0.5 Morgan 0.6 0.7 0.8 0.9 1.0 QTL for Clinical Mastitis in cattle 16 14 LD/LA 12 LA LRT 10 8 6 4 LD 2 0 0.0 0.1 0.2 0.3 0.4 0.5 Morgan 0.6 0.7 0.8 0.9 1.0 Simulation 100 half-sib families (Dairy cattle pedigree) 2000 progeny 5 chromosomes – 100 cM (each) SNP – 5000 15 QTL (1QTL-10%, 4QTL-5 %, 10QTL–2%) 50% of the genetic variance Heritability – 30% Generalized linear models Phenotype= Sire-family + genotype Software – TASSEL http://www2.maizegenetics.net/index.php?page=bioinformatics/tassel Generalized linear models 120 100 - ln(p) 80 60 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Generalized linear models 120 100 - ln(p) 80 60 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Generalized linear models 120 100 - ln(p) 80 60 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Mixed-model (Yu et al. 2006) Phenotype= Fixed factors + SNP + Population + polygene Relationship 0 1 2 STRUCTURE SAS mixed model (Gael Pressoir) Mixed-model 120 100 - ln(p) 80 60 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Mixed-model 120 100 - ln(p) 80 60 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Mixed-model 120 100 - ln(p) 80 60 40 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Bayesian approach Phenotype= Fixed factors + Polygene + Allele or Haplotype • All markers are fitted simultaneously, search for marker combination that explains the trait variation • Avoid multiple testing Software – iBays (Janss LLG, 2007) Bayesian approach Posterior probability 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Bayesian approach Posterior probability 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Multiple testing Multiple testing Performing one test at an alpha level of 0.05 implies 5% chance of rejecting a true null hypothesis (false positive) Performing 100 tests at = 0.05 when all 100 H0 are true, we expect 5 of the tests to give FP results Pr(at least one FP)=1-Pr(no FP)= 1- (0.95)100 = 0.994 (if the tests are independent) Multiple testing Bonferroni correction Rejection level of each test is i /m Permutation test False discovery rate (FDR) What proportion of rejections are when H0 is true? Of all the times you reject H0 how often is H0 true? q value (Storey et al. PNAS 2003) Summary 4 methods LD and linkage GLM Mixed-model Bayesian approach Project team Goutam Sahana Bernt Guldbrandtsen Luc Janss Mogens Sandø Lund