Download Identification of a large set of rare complete human

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Process tracing wikipedia , lookup

Heritability of IQ wikipedia , lookup

Twin study wikipedia , lookup

Behavioural genetics wikipedia , lookup

Nations and intelligence wikipedia , lookup

Transcript
HAPRAP: Haplotype-based iterative
method for fine mapping using GWAS
summary data
Zheng et al., Sept 2016
Questions: (i) “Can we narrow down the list of all SNPs at an
associated (GWAS) region/locus to a smaller ‘credible’ set?”
(ii) “Are there any other independent signals?”
Nice reviews:
(i) Strategies for fine-mapping complex traits. Spain and Barrett, 2015, Hum
Mol Genet
(ii) Fine Mapping Causal Variants with an Approximate Bayesian Method Using
Marginal Test Statistics. Chen et al, 2015. Genetics
Journal club: 21/09/16
Mesut Erzurumluoglu
Population Genetics 101

Linkage disequilibrium (LD)
◦ Non-random association of alleles at two or
more loci (if random alleles at two loci coinherited 50% of the time)
Haploview software
LD example
Minor allele: a
T
Major allele: A
T
Major allele: B
Minor allele: b
Observed frequency
Haplotypes
AB=0.2
D= ? (unstandardized measure of how far the association between two alleles differs from that expected by chance)
Ab=0.5
D’ = ? (D standardised to the maximum possible value it can take)
aB=0.3
r2 = ? (correlation coefficient)
ab=0
Another example
Observed
Haplotypes
AB=0.8
Ab=0
=>
aB=0
ab=0.2
Alleles
A=0.8
a=0.2
B=0.8
b=0.2
=>
Expected
Haplotypes
AB=0.64
Ab=0.16
aB=0.16
ab=0.04
D = 0.16
D’ = 1
r2 = 1
Calculations:
D = 0.8-(0.8x0.8)=0.8-0.64=0.16
D’ = 0.16/0.16=1
r^2 = (0.16)^2/(0.8x0.2x0.8x0.2)=1 (great for imputation!)
Possible explanations:
i) Combinations A and b, and a and B are highly disadvantageous to the organism
ii) Could be a (small and) highly consanguineous/endogamous population
iii) Very small and isolated population
LD (2)

LD between alleles can be influenced by:
◦ Selection
 Lack of sunlight
◦
◦
◦
◦
◦
◦
White skin and blue eyes
Rate of recombination
Rate of mutation
Genetic drift
System of mating
Population structure
Genetic linkage
Haplotype

A set of variants that are inherited
together – found on the same
chromosome
◦ Closer the variants, (usually) smaller the
probability of recombination between them
 Therefore fewer haplotypes are found than
maximum possible (as in previous slides, i.e. 3/4)
◦ Higher LD between variants means respective
haplotype is inherited more often
Introduction to study

GWASs provide a powerful approach for identifying
variants associated with complex human diseases/traits
◦ However, identifying the ‘causal’ variant(s) is challenging due
to LD between SNPs that are close

Fine mapping (in a GWAS setting) is the process of
narrowing-down a list of associated variants to a
‘credible’ set of most likely causal variants
◦ Prioritizes most-informative variants

Many tools/methods out there: Multiple regression,
GCTA-COJO, CAVIAR-BF, BIMBAM, PAINTOR,
Wakefield 2007 (Bayesian), SSSRAP
◦ Multiple regression is best if individual level (and dense
SNP) data is available
 But this is rarely the case for large GWASs
 Time consuming
Introduction to study (2)
Using ‘top SNP’ to represent region is
problematic as there may be several
causal SNPs
 Existing state-of-the-art methods (e.g.
GCTA-COJO) use r2 between SNPs to
represent LD structure

◦ Problematic when there are more than two
causal variants in a region as LD information
may be lost
 May introduce constraints on the max/min values
for pairwise LD
Existing Fine-mapping methods
Adapted from: S.L. Spain and J.C. Barrett, Hum Mol Genet, 2015
Existing methods (continued)







HAPRAP: approximate conditional analysis using
haplotypes
GCTA-COJO: approximate conditional analysis using
pair-wise LD
PAINTOR: integrates association strength with
functional genomic annotation data
CAVIARBF: Bayesian method with good accuracy
FINEMAP: CAVIARBF + shotgun stochastic search
algorithm, much faster and allows genome-wide fine
mapping
RiVIERA: Bayesian method + transcription and cisregulatory element annotations
JAM: Bayesian penalized regression with variable
selection, designed for the analysis of quantitative
traits only
Theory
Traditional fine-mapping methods, such as conditional
analysis, needs genotype and phenotype data for each
individual
 More and more fine mapping methods, such as GCTA-COJO,
CAVIAR-BF and FINEMAP, use ‘GWAS summary results +
LD reference panel’ to identify causal variants
 These methods consider pair-wise LD + MAF information to
represent LD between SNPs. When considering regions with
three or more causal variants, such settings may lose LD
information
 Haplotypes, which represent combinations of co-inherited
alleles within the same chromosome, are a more biologically
plausible way for representing LD among multiple loci. Fine
mapping using haplotypes will pick up the LD information
that is not detected using pairwise LD measures

Theory

HAPlotype Regional Association analysis
Program (HAPRAP)
◦ Rationale behind developing HAPRAP: Using
haplotypes is biologically a better
representation of LD compared to r2 (i.e.
pairwise correlation between SNPs)
◦ Iterative method – haplotype effects updated
based on haplotype frequencies and observed
(single) SNP effects from meta-analyses to
estimate joint SNP effect
Methods – HAPRAP overview
• GWAS
summary
results
• Haplotypes
Input
HAPRAP
• Joint effect
analysis
• Conditional
analysis
• Identify possible
causal SNPs
• Independent
effect of each
SNP
Output
Methods

i.
Extend on a single-SNP based linear
regression to multi-locus based regression
by:
Dichotomising haplotypes into two
groups
Effect allele
SNP j
ii.
Treating each group as a bivariate allele
Methods (continued)
• For each SNP, haplotypes were split into two groups:
1) HEj is the set of haplotypes containing the effect allele of SNP j;
2) HBj is the set of haplotypes containing the baseline allele of SNP j.
• Marginal SNP effect of j = St(Effect of HEj - Effect of HBj)
Estimated β in the gth iteration = Sum of differences between the two groups
standardised by the haplotype frequencies
Methods (continued)
Iterative method used to estimate
haplotype effects from single SNP based
linear regression (GWAS) results:
Step (i) Randomly assign an effect to
each SNP (seed: between 10 and -10)
Step (ii) Parse these effects into
haplotype reference and estimate effect
for each (haplotype) group
Step (iii) Estimate β for each SNP and
cross-check against meta-analysis results
Step (iv) If different, repeat after
adjusting the β of SNP with the greatest
deviation – iterate until estimated
haplotype effects agree with observed
single SNP meta-analysis results
Methods - simHAPRAP
•
•
A bootstrap method used to calculate center
estimates and standard errors of the joint SNP
effects.
Simulate a population with genotypes and
phenotypes:
• Sample size equal to the total number of participants in
the meta-analysis
• Genotypes from the haplotype reference panel
• Phenotypes from a normal distribution with mean equal
to zero and SE equal to the observed standard deviation
of the phenotype
•
•
Process the simulation 2000 times, derive mean and
standard deviation (SD).
The SD is the standard error of the joint SNP effect
Datasets

ALSPAC (n=8363) data used as haplotype
reference panel
◦ SHAPE-IT used to phase haplotypes
BWHHS cohort data (n=5425)
 UCLEB (QTc interval, n=7106)
 GIANT consortium (height, n=253288)

◦ Three regions: ACAN, ADAMTS17, PTCH1
Gall bladder disease (n=15213)
 1000 Genomes – simHAPRAP

Results – HAPRAP and GCTA-COJO
(simulation)
Sample size (N) in log10 scale
Results – HAPRAP and GCTACOJO (simulation)
Results – Real case example:
GIANT (height) data
• Total of 4195 SNPs in three
genes, 782 SNPs for ACAN,
1477 SNPs for ADAMTS17
and 1936 SNPs for PTCH1
• Using 8263 unrelated
ALSPAC children as
reference panel.
• Found two additional SNPs
independently associated
with height:
1) rs357564: a missense
variant in PTCH1 with
joint effect of -0.034
2) rs1529889: an intronic
variant in ADAMTS17
with joint effect of
0.019
Conclusions

HAPRAP uses GWAS summary data to
carry out an ‘approximate’ conditional
analysis and narrow down list of SNPs to
a credible set
◦ Also selects independent SNPs

Relatively better accuracy compared to
GCTA-COJO when the sample size of
the meta-analysis is limited (N ≤5000)
Discussion

HAPRAP’s advantages over other methods
◦ Biologically makes more sense
◦ Makes better use of summary level data
 Compared to GCTA-COJO and SSSRAP
 Especially if there are several causal SNPs at a locus
 Can handle rare variants better than others
 Demonstrated using BWHHS data (APOB SNP, MAF=0.18%)
◦ Considers all loci simultaneously rather than
pairwise
 More immune to poor LD estimates due to mismatches
between reference panel and meta-analysis samples
Discussion (continued)

Reference panel was derived from
1000GP (older version)
◦ Would be interesting to see how results
would change/improve with the new updated
reference panel (McCarthy et al, 2016)
HLA regions are being investigated (as an
addition in next version)
 Easily applied to different ethnicities –
with a matching reference panel
 Will be interesting to see whether
rs357564 and rs1529889 are true signals

Limitations of HAPRAP

Assumptions
◦ Haplotypes and frequencies in the reference
panel and the meta-analysis set are the same
◦ Additive model
Standard error estimation can be time
consuming
 Limited number of SNPs for each analysis
(ideally <20 SNPs in a region)

Appendices
Fine-mapping software download URLs







HAPRAP: http://apps.biocompute.org.uk/haprap/
GCTA-COJO:
http://cnsgenomics.com/software/gcta/cojo.html
PAINTOR:
http://bogdan.bioinformatics.ucla.edu/software/paintor/
CAVIARBF: https://bitbucket.org/Wenan/caviarbf
JAM: https://github.com/pjnewcombe/R2BGLiMS
FINEMAP: http://www.christianbenner.com/
RiVIERA: https://github.com/yueli-compbio/RiVIERA-MT
LD-Hub for LD Score Regression
The basic idea is that the more genetic variation a marker tags, the
higher the probability that it will tag a causal variant. In contrast, linkage
disequilibrium score (LD score) should not be correlated with
population stratification
Univariate analysis
8
6
4
Chi square
6
2
4
0
2
0
Chi square
8
10
10
Bivariate analysis
0
20
40
60
LD Score
80
100
0
20
40
60
LD Score
80
100
LD-Hub: a database of
harmonized GWAS summary data
GWAS studies that have released all
their results
65 studies + (36 consortia)
1000 trait analyses
>2 billion SNP-phenotype associations
>1.5 million individuals
89 diseases
9 cancers
19 psychiatric/neurological
45 auto./inflammatory
6 cardiovascular
4 diabetic
6 other
154 risk factors
Other
75 anthropometric
6 behavioural
24 glycaemic
9 lipids
4 blood pressure
6 hematological
30 other
576 metabolites
151 immune traits
GWAS studies that have released
subsets of results
2414 GWAS studies†
~80,000 traits
~70,000 SNP-trait associations
eQTL/pQTLs
12,000 gene expression
146 protein expression
LD-Hub web interface
Link to web interface: ldsc.broadinstitute.org/ldhub
BioRxiv link: http://biorxiv.org/content/early/2016/05/03/051094