Download Statistical Genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistical Genetics
Matthew Stephens
Statistics Retreat, October 26th 2012
Matthew Stephens
Retreat Talk 2012
Two stories
I
The two most influential statistical ideas in analysis of genetic
association studies.1
I
Sequence, sequence, everywhere.
1
With apologies to Steve Stigler
Matthew Stephens
Retreat Talk 2012
Story I: Genetic Association Studies
Genetic association studies aim to identify genetic variants that
modify risk of common diseases or affect other phenotypes
(e.g. Type I Diabetes, height, LDL cholestrol).
The idea is absurdly simple: measure genetic variants (usually
SNPs), and phenotypes in randomly-sampled individuals, and see
which SNPs are correlated with phenotypes.
Matthew Stephens
Retreat Talk 2012
Story I: Genetic Association Studies
I
Typical recent genome-wide studies have typed 500K-1M
SNPs in thousands of (unrelated) phenotyped individuals.
I
Basic Analysis: test each SNP, one-by-one, for statistical
association with each phenotype.
Matthew Stephens
Retreat Talk 2012
Progress identifying variants underlying common disease
Published Genome‐Wide Associations through 09/2011
1,617 published GWA at p≤5X10‐8 for 249 traits
NHGRI GWA Catalog
www.genome.gov/GWAStudies
Matthew Stephens
Retreat Talk 2012
Credit:
The two most influential statistical ideas in GWAS
I
Correction for unmeasured confounding (population
structure).
I
Imputation to combine studies.
Matthew Stephens
Retreat Talk 2012
Population Structure and Unmeasured Confounding
The Problem in a nutshell: What would happen if you conducted a
Genetic Association study for “Chopstick Use” in San Francisco?
Matthew Stephens
Retreat Talk 2012
Population Structure and Unmeasured Confounding
If you know the “genetic background” of the individuals in your
study (e.g. which continent they inherited their genes from), then
you can correct for it.
What if you don’t know it?
Matthew Stephens
Retreat Talk 2012
Principal Components Analysis to the rescue!
Novembre et al, Nature, 2008
Matthew Stephens
Retreat Talk 2012
Principal Components Analysis to the rescue!
Test for significance of genetic effect β, controlling for effects of
genetic background (α):
y = v α + xβ + Price et al, Nature Genetics, 2006
Matthew Stephens
Retreat Talk 2012
The two most influential statistical ideas in GWAS
I
Correction for unmeasured confounding (population
structure).
I
Imputation to combine studies.
Credit: Bryan Howie
Matthew Stephens
Retreat Talk 2012
Genotype(imputa-on(background(
0%
0%
1%
1%
1%
1%
0%
1%
?%
1%
0%
1%
0%
0%
1%
0%
1%
0%
1%
1%
1% 1% 0%
0% 0% 1%
1% 1% 0%
1% 0% 0%
0%
1%
0%
0%
2%
1%
1%
2%
2%
1%
2%
1%
0%
0%
1%
0%
0%
1%
0%
1%
1%
1%
0%
1%
1%
0%
1%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
1% 1%
0% 0%
0% 0%
0% 0%
1%
1%
0%
1%
Reference(
haplotypes(
0%
?%
1%
0%
0%
1%
0%
1%
1%
0%
0%
1%
0%
0%
1%
1%
1%
0%
1%
1%
0%
?%
1%
2%
Phenotyped(
GWAS(
samples(
SNPs%genotyped%on%an%array%
Matthew Stephens
Retreat Talk 2012
Genotype(imputa-on(background(
0%
0%
1%
1%
0%
0%
1%
0%
1%
0%
1%
1%
1% 1% 0%
0% 0% 1%
1% 1% 0%
1% 0% 0%
0%
1%
0%
0%
1%
1%
0%
1%
1%
0%
1%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
1% 1%
0% 0%
0% 0%
0% 0%
1%
1%
0%
1%
Reference(
haplotypes(
1%
1%
0%
1%
?%
1%
0%
1%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?% 2% ?%
?% 1% ?%
?% 1% ?%
?% 2% ?%
?% 2% ?%
?% 1% ?%
?% 2% ?%
?% 1% ?%
0%
0%
1%
0%
0%
1%
0%
1%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
?%
0%
?%
1%
0%
0%
1%
0%
1%
1% ?%
0% ?%
0% ?%
1% ?%
0% ?%
0% ?%
1% ?%
1% ?%
1%
0%
1%
1%
0%
?%
1%
2%
Phenotyped(
GWAS(
samples(
Untyped%SNPs%
Matthew Stephens
Retreat Talk 2012
Genotype(imputa-on(background(
Associa8on%
signal%
Matthew Stephens
Retreat Talk 2012
0%
0%
1%
1%
0%
0%
1%
0%
1%
0%
1%
1%
1% 1% 0%
0% 0% 1%
1% 1% 0%
1% 0% 0%
0%
1%
0%
0%
1%
1%
0%
1%
1%
0%
1%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
1% 1%
0% 0%
0% 0%
0% 0%
1%
1%
0%
1%
Reference(
haplotypes(
1%
1%
0%
1%
2%
1%
0%
1%
1%
1%
0%
2%
1%
1%
0%
1%
2%
1%
1%
2%
2%
1%
2%
1%
2% 2% 0%
1% 1% 0%
1% 1% 1%
2% 2% 0%
2% 2% 0%
1% 1% 1%
2% 2% 0%
1% 1% 1%
0%
0%
1%
0%
0%
1%
0%
1%
1%
1%
2%
1%
0%
1%
2%
1%
2%
2%
1%
2%
2%
1%
2%
1%
0%
1%
0%
0%
2%
1%
2%
1%
0%
0%
1%
0%
0%
1%
2%
1%
0%
0%
1%
0%
0%
1%
0%
1%
1% 1%
0% 0%
0% 0%
1% 1%
0% 0%
0% 0%
1% 1%
1% 1%
1%
0%
1%
1%
0%
1%
1%
2%
Phenotyped(
GWAS(
samples(
Imputa-on(facilitates(meta>analysis(
0%
0%
1%
1%
0%
0%
1%
0%
1%
0%
1%
1%
1% 1% 0%
0% 0% 1%
1% 1% 0%
1% 0% 0%
0%
1%
0%
0%
2%
1%
1%
2%
0%
0%
1%
0%
1%
1%
0%
1%
0%
0%
2%
1%
Matthew Stephens
Retreat Talk 2012
1%
0%
2%
1%
1%
1%
0%
0%
1%
1%
0%
1%
1%
0%
1%
1%
0%
0%
0%
1%
0%
1%
0%
1%
1%
2%
1%
1%
0%
1%
0%
1%
0%
1%
0%
1%
1% 1%
0% 0%
0% 0%
0% 0%
1%
1%
0%
1%
Reference(
haplotypes(
0%
0%
1%
0%
1%
0%
0%
1%
1%
0%
1%
1%
GWAS(1(
1%
0%
0%
1%
GWAS(2(
Imputa-on(facilitates(meta>analysis(
Associa8on%
signal%
Matthew Stephens
Retreat Talk 2012
0%
0%
1%
1%
0%
0%
1%
0%
1%
0%
1%
1%
1% 1% 0%
0% 0% 1%
1% 1% 0%
1% 0% 0%
0%
1%
0%
0%
1%
1%
0%
1%
1%
0%
1%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
1% 1%
0% 0%
0% 0%
0% 0%
1%
1%
0%
1%
Reference(
haplotypes(
1%
1%
0%
1%
1%
1%
0%
2%
2%
1%
1%
2%
2% 2% 0%
1% 1% 0%
1% 1% 1%
2% 2% 0%
0%
0%
1%
0%
1%
1%
2%
1%
1%
0%
1%
0%
2%
1%
1%
1%
0%
0%
1%
0%
0%
0%
1%
0%
1% 1%
0% 0%
0% 0%
1% 1%
1%
0%
1%
1%
GWAS(1(
0%
0%
1%
0%
0%
0%
1%
0%
0%
0%
2%
1%
1% 1% 1%
0% 0% 1%
2% 1% 0%
1% 1% 0%
1%
1%
0%
0%
2%
2%
1%
2%
0%
0%
0%
1%
1%
2%
1%
1%
1%
1%
1%
0%
1%
1%
1%
0%
1% 1%
0% 0%
0% 0%
1% 1%
2%
1%
1%
1%
GWAS(2(
Imputa-on(facilitates(meta>analysis(
0%
0%
1%
1%
0%
0%
1%
0%
1%
0%
1%
1%
1% 1% 0%
0% 0% 1%
1% 1% 0%
1% 0% 0%
0%
1%
0%
0%
1%
1%
0%
1%
1%
0%
1%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
0%
1%
1% 1%
0% 0%
0% 0%
0% 0%
1%
1%
0%
1%
Reference(
haplotypes(
1%
1%
0%
1%
1%
1%
0%
2%
2%
1%
1%
2%
2% 2% 0%
1% 1% 0%
1% 1% 1%
2% 2% 0%
0%
0%
1%
0%
1%
1%
2%
1%
1%
0%
1%
0%
2%
1%
1%
1%
0%
0%
1%
0%
0%
0%
1%
0%
1% 1%
0% 0%
0% 0%
1% 1%
1%
0%
1%
1%
GWAS(1(
0%
0%
1%
0%
0%
0%
1%
0%
0%
0%
2%
1%
1% 1% 1%
0% 0% 1%
2% 1% 0%
1% 1% 0%
1%
1%
0%
0%
2%
2%
1%
2%
0%
0%
0%
1%
1%
2%
1%
1%
1%
1%
1%
0%
1%
1%
1%
0%
1% 1%
0% 0%
0% 0%
1% 1%
2%
1%
1%
1%
GWAS(2(
Type%1%diabetes:%Cooper%et%al.,%Nov%2008%(Nature'Gene*cs)%
Type%2%diabetes:%Zeggini%et%al.,%May%2008%(Nature'Gene*cs)%
Crohn’s%disease:%BarreH%et%al.,%Aug%2008%(Nature'Gene*cs)%
Matthew Stephens
Retreat Talk 2012
Story II: Sequence, Sequence, Everywhere
Matthew Stephens
Retreat Talk 2012
Sequencing Assays, and Statistical Challenges
Although DNA sequencing is best known for obtaining “genome
sequences”, it is now routinely used for measuring cellular
processes to try to understand how cells operate.
For example:
I
Gene expression (RNA-seq).
I
Chromatin openness (DNase-seq).
I
Transcription Factor Binding (ChIP-seq)
I
Histone modifications (ChIP-seq)
A key question is how/why cells differ from one another (they
share the same DNA!).
Matthew Stephens
Retreat Talk 2012
Chromatin and DNA structure
Figure from Felsenfeld and Groudine. Nature, 2003
Matthew Stephens
Retreat Talk 2012
The Data
The basic structure of these assays is the same:
I
Do something clever to get bits of the DNA that you want
(e.g. the bits that contact a modified histone, or the bits that
are bound by a particular transcription factor).
I
Sequence these bits (producing millions of little sequences).
I
Work out where in the genome each sequence came from.
I
The number of sequences coming from each location (usually
0 or 1) is a measure of the “intensity” of the process at that
location.
I
Basic model: an inhomogeneous Poisson process,
xib ∼ Poi(λib ).
Matthew Stephens
Retreat Talk 2012
Example: Histone Modification H3K4me1
Can you spot the difference?
0.00
0.02
0.04
0.06
0.08
Left Ventricle, H3K4me1
32230000
32250000
32270000
32290000
0.00
0.02
0.04
0.06
0.08
xx H3K4me1
Right Ventricle,
32230000
Matthew Stephens
Retreat Talk 2012
32250000
32270000
32290000
Data from Scott Smemo, Nobrega lab
Advertisement: STAT 45800
We have preliminary ideas and methods for dealing with these
data, based on wavelets for count data (work with H. Shim).
In STAT 45800 we will try “crowd-sourcing” these ideas, to see
how much further progress we can make.
Aim: to combine expertises in Bioinformatics, Computing, Biology
and Statistics, to make more progress together than any of us
could do alone!
Matthew Stephens
Retreat Talk 2012
Acknowledgements
I
Bryan Howie, Heejung Shim.
I
Funding: NHGRI, NIH GTEX project, and NIH ENDGAME
consortium.
Matthew Stephens
Retreat Talk 2012
Related documents