Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
QTL mapping 1 Linear regression for GWAS, statistical testing and “big data statistics” Genome-wide association studies • Given: • Genetic variants for multiple samples • E.g. Single nucleotide polymorphisms (SNPs), microsatellite markers, etc. • Phenotypes for same samples • E.g. disease, height, gene-expression • Goal: • Find causal variants • Practical goal: • Find markers that explain variance in the phenotype due to linkage to causal variants • Use linear regression! The basic principles of GWAS Genetic variation SNP: single nucleotide polymorphism Indel: insertion/deletion CNV: copy number variation The basic principles of GWAS Genetic linkage Meiotic recombination Genetic linkage: tendency of certain alleles to be inherited together Correlation between polymorphisms Haplotype blocks Yoruba Marker: a polymorphic position used to trace a genomic region European adapted from J. Gagneur The basic principles of GWAS Complex traits Simple (or Mendelian) trait: A single polymorphism involved Trait distributes around discrete values AA AG GG ...TCATACGCTCATATC.... ...TCATACACTCATATC.... Huntington disease and ~2,000 more at the OMIM database Complex (or Quantitative) trait: Multiple polymorphisms contribute Trait distributes across a continuum of values Height, weight, onset and intensity of most diseases... The basic principles of GWAS Complex traits Mendelian traits are the exception rather than the rule! straight thumb hitchhiker’s thumb Mendelian? http://udel.edu/~mcdonald/myththumb.html What is association? In statistics, association is any relationship between two measured quantities that renders them statistically dependent.* • Direct association • Indirect association • Can be beneficial • e.g.: Linkage • Can be harmful • e.g.: Population structure *Oxford Dictionary of Statistics The basic principle of GWAS QTL mapping GENOTYPE DATA 1 1 1 0 0 0 pheno SNP 𝜖 𝜖 noise Slide adapted from Paolo Casale Quantitative Phenotype The basic principle of GWAS QTL mapping C T Slide adapted from Paolo Casale The basic principle of GWAS QTL mapping 1 1 1 0 0 0 pheno SNP 𝛽≠0 Null Model: no association 1 1 1 0 0 0 𝜖 noise pheno 𝜖 𝜖 noise SNP 𝛽=0 Quantitative Phenotype Full Model: association C T Slide adapted from Paolo Casale Linear regression • Model phenotype 𝑦𝑛 as a linear function of the SNP 𝑥𝑛𝑠 𝑦𝑛 = 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷 + 𝜖𝑛 • 𝛽𝑠 effect size • quantifies change in 𝑦𝑛 as a function of 𝑥𝑛𝑠 . • 𝒙𝒏 covariates with effect sizes 𝜷 • Race, known background SNPs, environment, etc. • 𝜖𝑛 ∼ 𝑁 0, 𝜎𝑒2 noise • Everything else we don’t account for… • (independent, normal distributed with variance 𝜎𝑒2 ) Linear regression • Model phenotype 𝑦𝑛 as a linear function of the SNP 𝑥𝑛𝑠 𝑦𝑛 = 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷 + 𝜖𝑛 • Likelihood of 𝛽𝑠 𝐿 𝛽𝑠 = 𝑝 𝒚 𝛽𝑠 = • • 𝑁 𝑁 𝑦𝑛 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷, 𝜎𝑒2 ) 𝑛=1 Probability of the observed data given the parameter 𝑝 𝒚 𝛽𝑠 Scoring function for 𝛽𝑠 (The higher the likelihood, the better the parameter) Parameter estimation • Maximum likelihood estimator 𝛽𝑠𝑀𝐿 • Maximum likelihood is equivalent to least squares 𝑁 𝛽𝑠𝑀𝐿 = arg max 𝛽𝑠 𝑦𝑛 − 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷 2 𝑛=1 • Unbiased 𝐸 𝛽𝑠𝑀𝐿 = 𝛽𝑠 • Standard error: 𝜎𝛽𝑠 • Standard deviation of the estimate 𝛽𝑠𝑀𝐿 • Goes to 0 as • 1 𝑁 1 and , where 𝑓𝑠 is the allele frequency 𝑓𝑠 Allows to perform power calculations: What is the required sample size for finding SNPs with allele frequency 𝑓𝑠 ≥ 𝑓𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 and effect size 𝛽𝑠 ≥ 𝛽𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ? • For large 𝑁 we obtain the “true” 𝛽𝑠 • Given a sample • 𝐷 = 𝑥1 , … , 𝑥𝑁 • Test whether 𝐻0 or 𝐻1 is true. • 𝐻0 : 𝛽𝑠 = 0 (null hypothesis) • 𝐻1 : 𝛽𝑠 ≠ 0 (alternative hypothesis) • To show that 𝛽𝑠 ≠ 0 we perform a statistical test that tries to reject 𝐻0 • Type 1 error (false positive): • 𝐻0 rejected, but true. • Type 2 error (false negative): • 𝐻0 accepted, but does not hold. Quantitative Phenotype Hypothesis testing 𝛽≠0 𝛽=0 C T Hypothesis testing One-sided test: • Given a sample • 𝐷 = 𝑥1 , … , 𝑥𝑁 • Test whether 𝐻0 or 𝐻1 is true. Two-sided test: 𝑅𝛼 • 𝐻0 : 𝛽𝑠 = 0 (null hypothesis) • 𝐻1 : 𝛽𝑠 ≠ 0 (alternative hypothesis) • Significance level 𝛼 defines the sensitivity of the test by bounding the probability of a type 1 error. • Decision based on a test statistic 𝛾 • Critical region 𝑅𝛼 • Area of rejected values of test statistic • Under 𝐻0 : 𝑃 𝛾 ∈ 𝑅𝛼 = 𝛼 𝑅𝛼 𝑅𝛼 P-values • Probability of a type-1 error if 𝐻0 is true • Uniformly distributed under 𝐻0 Formal: • Smallest significance 𝛼 such that the null hypothesis is rejected 𝑃 𝛾 = inf 𝛼′ subject to 𝛾 ∈ 𝑅𝛼′ = 𝑃(𝛾) • All 𝑃-values smaller than 𝛼 significant 𝛾 Testing in Linear regression 𝑁 𝑁 𝑦𝑛 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷, 𝜎𝑒2 ) 𝑛=1 • 𝐻0 : effect size 𝛽𝑠 = 0 • (𝛽𝑠 unknown) • Use maximum likelihood estimate 𝛽𝑠𝑀𝐿 divided by its standard error 𝜎𝛽𝑠 as test statistic 𝛽 𝑧𝑠 = 𝑠𝑀𝐿 (z-score) 𝜎𝛽𝑠 distributed 𝑁 0,1 under 𝐻0 . • Intuition: The larger the absolute value of 𝛽𝑠𝑀𝐿 , the less likely is 𝐻0 : 𝛽𝑠 = 0. How to assess significance? Permutation Distribution of unpermuted statistics • Recipe to obtain an approximate 𝐻0 • Repeat M times: • Permute the phenotype and all covariates • Compute the test statistic on permuted data Distribution of permuted statistics Multiple hypothesis testing • 𝑃-value: probability of a type-1 error when 𝛼 = 𝑃(𝛾) • In GWAS we test 𝑆 = 105 to 107 hypotheses 6 • When testing 𝑆 = 10 SNPs at 𝛼 = 0.01 we would 4 expect 10 type-1 errors under 𝐻0 ! • Probability of at least one type-1 error 1 − 1 − 𝛼 𝑆0 → 1 • Individual 𝑃-values smaller than 𝛼 are not significant. • Need to correct for multiple hypothesis testing! Family-wise error rate Bonferroni correction 𝐹𝑊𝐸𝑅 ≤ 𝑝 𝑃 𝛾𝑖 ≤ 𝛼 = 𝛼 ⋅ 𝑆0 ≤ 𝛼 ⋅ 𝑆 𝛾𝑖 ∈𝐻0 • Bound probability of at least 1 type-1 error in all tests 𝛼𝐵𝑜𝑛𝑓 𝛼 = 𝑆 • Leads to very stringent cutoffs • For human GWAS typically 𝛼𝐺𝑊𝐴𝑆 = 5 ⋅ 10−8 False discovery rates q-values • FWER is very conservative • We might be willing to accept a few type-1 errors • Bound the type-1 error rate instead of the absolute value of errors instead 𝑇𝑃 Quantile-Quantile plot • P-values are uniformly distributed under the null distribution • Plot quantiles of theoretical − log10 (𝑃-values) under 𝐻0 against quantiles Genomic control 𝜆𝐺𝐶 𝑚𝑒𝑑𝑖𝑎𝑛 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 = 𝑚𝑒𝑑𝑖𝑎𝑛 𝐻0 Comparison of observed and theoretical P-values under 𝐻0 • 𝜆𝐺𝐶 = 1 • 𝜆𝐺𝐶 > 1 • 𝜆𝐺𝐶 < 1 calibrated 𝑃-values inflation of 𝑃-values deflation of 𝑃-values