Download Association mapping

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neurogenomics wikipedia , lookup

Twin study wikipedia , lookup

Genetic algorithm wikipedia , lookup

Behavioural genetics wikipedia , lookup

Heritability of IQ wikipedia , lookup

Transcript
How could you test with a statistical test if there is an association between Gcn4p
upstream vs. ORF binding and gene expression?
1. T-test to see if mean fold-change upon amino acid starvation is different for genes
bound upstream vs. in ORF by Gcn4p
2. Hypergeometric test to see if upstream-bound genes are enriched for ‘induced’ genes
(would require pre-determining which genes are ‘induced’)
Genetic mapping: linking phenotype to genotype
Lab studies of gene knockouts, RNAi, and mutagenesis screens can reveal phenotypes
A major goal is to do the reverse: use natural phenotypic variation to
identify causal variants
 reveals phenotypic and genetic variation relevant in ‘the wild’
Goal: Identify QTL (Quantitative Trait Locus) or QTN (Quantitative Trait Nucleotide)
that significantly correlate with (and there likely explain) the phenotype
Two strategies for genetic mapping
Linkage Mapping
Mate two parents with opposite
phenotypes and score progeny
Association Mapping (GWAS)
Score many individuals from
natural (randomly mating) populations
Ability to identify QTL/QTLs requires recombination to mix and match the genomes.
Linkage mapping: generate recombination through crosses. Generally need many individuals
(or many generations: increased recombination frequency = smaller regions)
Association mapping: uses historical recombination events between individuals. Generally
requires fewer individuals for the same statistical power.
Goals of QTL linkage mapping
To identify the loci that
contribute to phenotypic
variation
1. Cross two parents with
extreme phenotypes
2. Phenotype all the progeny
3. Genotype the progeny at
markers across the genome
4. Associate the observed
phenotypic variation with the
underlying genetic variation
Ades 2008, NHGRI
5. Ultimate goal: identify causal
polymorphisms that explain the
phenotypic variation
4
Backcross
Phenotype:
Drug tolerance
80%
20% viability
Usually have at least 100 individuals
Broman and Sen5 2009
Intercross
Phenotype:
Drug tolerance
80%
20% viability
Can reveal AA, BB, and AB
genotypes. Takes more individuals
to map, due to more intricate
genotypes generated
Broman and Sen6 2009
Genetic map: specific markers spaced across the genome
Markers can be:
• SNPs at particular loci
• Variable-length repeats
e.g. ALU repeats
• ALL polymorphisms
(if have whole genomes)
Ideally, markers should
be spaced every 10-20 cM
and span the whole genome
7
Genotype data: Determine allele at all markers in each F2
Phenotype:
e.g. drug
tolerance score
8
Statistical framework
1. Missing Data Problem
Use marker data to infer intervening genotypes
2. Model Selection Problem
How do the QTL across the genome combine with the covariates to
generate the phenotype?
9
Broman and Sen 2009
Marker regression: simple T-test (or ANOVA) at each marker
Marker 1: no QTL
Marker 2: significant QTL
(population means are different)
10
Marker regression
Advantages:
• Simple test – standard T-test/ANOVA
• Covariates (e.g. Gender, Environment) are to incorporate
• No genetic map necessary, since test is done separately on each marker
Disadvantages:
• Any individuals with missing marker data must be omitted from analysis
• Does not effectively consider positions between markers
• Does not test for genetic interactions (e.g. epistasis)
• The effect size of the QTL (i.e. power to detect QTL) is reduced by incomplete
linkage to the marker
11
• Difficult to pinpoint QTL position, since only the marker positions are considered
Interval mapping
• Lander and Botstein 1989
• In addition to examining phenotype-genotype associations at markers, look for
associations between makers by inferring the genotype
Q
• The methods for calculating genotype probabilities between markers typically use
hidden Markov models to account for additional factors, such as genotyping errors
12
Interval mapping
Advantages:
• Takes account of missing genotype information – all individuals are included
• Can scan for QTL at locations in between markers
• QTL effects are better estimated
Disadvantages:
• More computation time required
• Still only a single-QTL model – cannot separate linked QTL or examine for
interactions among QTL
13
LOD scores
• Measure of the strength of evidence for the presence of a QTL
at each marker location
LOD(λ) = log10 likelihood ratio comparing the hypothesis of a QTL at position λ
versus that of no QTL
Phenotype
log10
{
Pr(y|QTL at λ, µAAλ, µABλ, σλ)
Pr(y|no QTL, µ, σ)
}
LOD 3 means that the TOP model is
103 times more likely than
the BOTTOM model
14
LOD curves
Chromosome
How do you know which peaks are really significant?
15
LOD threshold
•Consider the null hypothesis that there are no QTLs genome-wide
one location
genome-wide
1. Randomize the phenotype labels on the relative to the genotypes
2. Conduct interval mapping and determine what the maximum LOD score is
genome-wide
3. Repeat a large number of times (1000-10,000) to generate a null distribution
of maximum LOD scores
16
Broman and Sen 2009
LOD threshold
• 1000 permutations
10% False Discovery Rate = LOD 3.19
(means that at this LOD cutoff 10% of peaks could be random chance)
5% FDR = LOD 3.52
• Boundary of the peak is often taken as points that cross (Max LOD – 1.5)
(or - 1.8 for an intercross)
17
Association Mapping
Haplotype blocks:
linked alleles
segregating together
means that only
subsets of SNPs
need to be genotyped.
Relies on historic matings in “randomly mating” populations.
Most populations are not randomly mating – therefore need to consider
population structure.
TASSEL:
Trait Analysis by aSSociation, Evolution and Linkage
Bradbury et al. (Buckler Lab)
Genotypes
for 65 strains
Phenotypes
for 65 strains
Phylogenetic
Relatedness
Population
Structure
1.0
0.8
0.6
0.4
0.2
0.0
Laboratory
BY4741
S288c
W303
FL100
SK1
Y55
YJM975
YJM981
YJM978
322134S
273614N
YJM789
378604X
YJM326
YJM428
YJM653
YJM320
YJM421
YJM451
YS9
YS2
YS4
CLIB215
CLIB324
JAY291
CBS7960
DBVPG1788
DBVPG1106
DBVPG1373
DBVPG6765
L-1374
L-1528
RM11_1A
BC187
YIIc17_E5
WE372
T73
NCYC110
DBVPG6044
Y12
K11
Y9
DBVPG6040
NCYC361
DBVPG1853
CLIB382
UC5
PW5
YPS163
YPS606
YPS128
NC-02
YPS1009
T7
UWOPS05-227.2
UWOPS05-217.3
UWOPS03-461.4
Y10
IL-01
YJM269
M22
I14
MUSH
LEP
CRB
UWOPS87-2421
UWOPS83-787.3
Random
Error
Clinical
Wine
Strains
Bio
Baking Fuel
Other fermentation Oak
Nature
0.0000
Dana Wohlbach
0.2000
0.4000
Random
Error
Association Mapping
-log p-value
FDR threshold set by permutation analysis or q value correction
Meta-analysis of 15 GWAS studies of IBD = 75,000 people.
- Imputation-based GWAS: imputed SNPs where there was missing data
(using known haplotypes and human HapMap3 reference data)
- Identified 25,075 SNPs that were significantly associated with IBD (p < 0.01)
… collapsed these into 163 IBD-associated loci
* 71 of these are new, due to increased statistical power
* 163 loci is “far more” than associated with any other complex disease
* More SNPs are linked to non-coding/regulatory variation than missense
- Significant overlap with SNPs linked to immunodeficiences & bacterial infection
- 13.6% of the phenotypic variance is explained by all these loci together
The challenge with human GWAS: missing heritability
>1200 variants associated with ~165 complex human diseases.
In most cases, known loci account for only 20-30% of the heritable phenotypic variance.
PNAS 2011
They argue much of the “missing” heritability is not really missing: additive interactions
(without considering epistasis) can only account for so much …
Investigating Epistasis: Genetic Interactions?
Significant SNP (FDR 1%)
Insignificant SNP
1
16
2
3
15
5
14
6
13
Linear model for
each pair of
significant SNPs
7
12
11
8
10
Dana Wohlbach
4
How much of ‘missing’
heritability is explained
by epistasis?
9
Many significant interactions between SNPs
Significant SNP (FDR 1%)
Insignificant SNP
1
16
2
3
15
5
14
4
6
13
0.1% FDR (82 pairs)
1.0% FDR (413 pairs)
Genetic or Physical (SGD)
7
12
Significant interaction at:
11
8
10
Dana Wohlbach
9