Download QTL mapping 1_corrected

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Transcript
QTL mapping 1
Linear regression for GWAS, statistical testing
and “big data statistics”
Genome-wide association studies
• Given:
• Genetic variants for multiple samples
• E.g. Single nucleotide polymorphisms (SNPs),
microsatellite markers, etc.
• Phenotypes for same samples
• E.g. disease, height, gene-expression
• Goal:
• Find causal variants
• Practical goal:
• Find markers that explain variance in the
phenotype due to linkage to causal variants
• Use linear regression!
The basic principles of GWAS
Genetic variation
SNP: single nucleotide polymorphism
Indel: insertion/deletion
CNV: copy number variation
The basic principles of GWAS
Genetic linkage
Meiotic recombination
Genetic linkage: tendency of certain alleles to be
inherited together
Correlation between polymorphisms
Haplotype blocks
Yoruba
Marker: a polymorphic position used to trace a
genomic region
European
adapted from J. Gagneur
The basic principles of GWAS
Complex traits
Simple (or Mendelian) trait:
A single polymorphism involved
Trait distributes around discrete values
AA
AG
GG
...TCATACGCTCATATC....
...TCATACACTCATATC....
Huntington disease and ~2,000 more at the OMIM
database
Complex (or Quantitative) trait:
Multiple polymorphisms contribute
Trait distributes across a continuum of values
Height, weight, onset and intensity of most diseases...
The basic principles of GWAS
Complex traits
Mendelian traits are the exception rather than the rule!
straight thumb
hitchhiker’s thumb
Mendelian?
http://udel.edu/~mcdonald/myththumb.html
What is association?
In statistics, association is any
relationship between two measured
quantities that renders them
statistically dependent.*
• Direct association
• Indirect association
• Can be beneficial
• e.g.: Linkage
• Can be harmful
• e.g.: Population structure
*Oxford Dictionary of Statistics
The basic principle of GWAS
QTL mapping
GENOTYPE DATA
1
1
1
0
0
0
pheno
SNP
𝜖
𝜖
noise
Slide adapted from Paolo Casale
Quantitative Phenotype
The basic principle of GWAS
QTL mapping
C
T
Slide adapted from Paolo Casale
The basic principle of GWAS
QTL mapping
1
1
1
0
0
0
pheno
SNP
𝛽≠0
Null Model: no association
1
1
1
0
0
0
𝜖
noise
pheno
𝜖
𝜖
noise
SNP
𝛽=0
Quantitative Phenotype
Full Model: association
C
T
Slide adapted from Paolo Casale
Linear regression
• Model phenotype 𝑦𝑛 as a linear function of the
SNP 𝑥𝑛𝑠
𝑦𝑛 = 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷 + 𝜖𝑛
• 𝛽𝑠 effect size
• quantifies change in 𝑦𝑛 as a function of 𝑥𝑛𝑠 .
• 𝒙𝒏 covariates with effect sizes 𝜷
• Race, known background SNPs, environment, etc.
• 𝜖𝑛 ∼ 𝑁 0, 𝜎𝑒2 noise
• Everything else we don’t account for…
• (independent, normal distributed with variance 𝜎𝑒2 )
Linear regression
• Model phenotype 𝑦𝑛 as a linear function
of the SNP 𝑥𝑛𝑠
𝑦𝑛 = 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷 + 𝜖𝑛
• Likelihood of 𝛽𝑠
𝐿 𝛽𝑠 = 𝑝 𝒚 𝛽𝑠 =
•
•
𝑁
𝑁 𝑦𝑛 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷, 𝜎𝑒2 )
𝑛=1
Probability of the observed data given the parameter 𝑝 𝒚 𝛽𝑠
Scoring function for 𝛽𝑠
(The higher the likelihood, the better the parameter)
Parameter estimation
• Maximum likelihood estimator 𝛽𝑠𝑀𝐿
• Maximum likelihood is equivalent to least squares
𝑁
𝛽𝑠𝑀𝐿 = arg max
𝛽𝑠
𝑦𝑛 − 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷
2
𝑛=1
• Unbiased
𝐸 𝛽𝑠𝑀𝐿 = 𝛽𝑠
• Standard error: 𝜎𝛽𝑠
• Standard deviation of the estimate 𝛽𝑠𝑀𝐿
• Goes to 0 as
•
1
𝑁
1
and , where 𝑓𝑠 is the allele frequency
𝑓𝑠
Allows to perform power calculations:
What is the required sample size for finding SNPs with allele
frequency 𝑓𝑠 ≥ 𝑓𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 and effect size 𝛽𝑠 ≥ 𝛽𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ?
• For large 𝑁 we obtain the “true” 𝛽𝑠
• Given a sample
• 𝐷 = 𝑥1 , … , 𝑥𝑁
• Test whether 𝐻0 or 𝐻1 is true.
• 𝐻0 : 𝛽𝑠 = 0 (null hypothesis)
• 𝐻1 : 𝛽𝑠 ≠ 0 (alternative hypothesis)
• To show that 𝛽𝑠 ≠ 0 we perform a statistical
test that tries to reject 𝐻0
• Type 1 error (false positive):
• 𝐻0 rejected, but true.
• Type 2 error (false negative):
• 𝐻0 accepted, but does not hold.
Quantitative Phenotype
Hypothesis testing
𝛽≠0
𝛽=0
C
T
Hypothesis testing
One-sided test:
• Given a sample
• 𝐷 = 𝑥1 , … , 𝑥𝑁
• Test whether 𝐻0 or 𝐻1 is true.
Two-sided test:
𝑅𝛼
• 𝐻0 : 𝛽𝑠 = 0 (null hypothesis)
• 𝐻1 : 𝛽𝑠 ≠ 0 (alternative hypothesis)
• Significance level 𝛼 defines the sensitivity of
the test by bounding the probability of a type
1 error.
• Decision based on a test statistic 𝛾
• Critical region 𝑅𝛼
• Area of rejected values of test statistic
• Under 𝐻0 : 𝑃 𝛾 ∈ 𝑅𝛼 = 𝛼
𝑅𝛼
𝑅𝛼
P-values
• Probability of a type-1 error if 𝐻0 is true
• Uniformly distributed under 𝐻0
Formal:
• Smallest significance 𝛼 such that the null hypothesis is
rejected
𝑃 𝛾 = inf 𝛼′
subject to 𝛾 ∈ 𝑅𝛼′
= 𝑃(𝛾)
• All 𝑃-values smaller than 𝛼 significant
𝛾
Testing in Linear regression
𝑁
𝑁 𝑦𝑛 𝑥𝑛𝑠 𝛽𝑠 + 𝒙𝒏 𝜷, 𝜎𝑒2 )
𝑛=1
• 𝐻0 : effect size 𝛽𝑠 = 0
• (𝛽𝑠 unknown)
• Use maximum likelihood estimate 𝛽𝑠𝑀𝐿
divided by its standard error 𝜎𝛽𝑠 as test
statistic
𝛽
𝑧𝑠 = 𝑠𝑀𝐿
(z-score)
𝜎𝛽𝑠
distributed 𝑁 0,1 under 𝐻0 .
• Intuition: The larger the absolute value of
𝛽𝑠𝑀𝐿 , the less likely is 𝐻0 : 𝛽𝑠 = 0.
How to assess significance?
Permutation
Distribution of unpermuted statistics
• Recipe to obtain an approximate 𝐻0
• Repeat M times:
• Permute the phenotype and all covariates
• Compute the test statistic on permuted
data
Distribution of permuted statistics
Multiple hypothesis testing
• 𝑃-value: probability of a type-1 error when 𝛼 = 𝑃(𝛾)
• In GWAS we test 𝑆 = 105 to 107 hypotheses
6
• When testing
𝑆
=
10
SNPs at 𝛼 = 0.01 we would
4
expect 10 type-1 errors under 𝐻0 !
• Probability of at least one type-1 error
1 − 1 − 𝛼 𝑆0 → 1
• Individual 𝑃-values smaller than 𝛼 are not significant.
• Need to correct for multiple hypothesis testing!
Family-wise error rate
Bonferroni correction
𝐹𝑊𝐸𝑅 ≤
𝑝 𝑃 𝛾𝑖 ≤ 𝛼 = 𝛼 ⋅ 𝑆0 ≤ 𝛼 ⋅ 𝑆
𝛾𝑖 ∈𝐻0
• Bound probability of at least 1 type-1 error in all
tests
𝛼𝐵𝑜𝑛𝑓
𝛼
=
𝑆
• Leads to very stringent cutoffs
• For human GWAS typically 𝛼𝐺𝑊𝐴𝑆 = 5 ⋅ 10−8
False discovery rates
q-values
• FWER is very conservative
• We might be willing to accept a few
type-1 errors
• Bound the type-1 error rate instead of
the absolute value of errors instead
𝑇𝑃
Quantile-Quantile plot
• P-values are uniformly distributed
under the null distribution
• Plot quantiles of theoretical
− log10 (𝑃-values) under 𝐻0
against quantiles
Genomic control
𝜆𝐺𝐶
𝑚𝑒𝑑𝑖𝑎𝑛 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑
=
𝑚𝑒𝑑𝑖𝑎𝑛 𝐻0
Comparison of observed and theoretical
P-values under 𝐻0
• 𝜆𝐺𝐶 = 1
• 𝜆𝐺𝐶 > 1
• 𝜆𝐺𝐶 < 1
calibrated 𝑃-values
inflation of 𝑃-values
deflation of 𝑃-values