Download A Markov Chain Monte Carlo Application: Identification of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Allelic Pattern Sampler:
Genetic Combinations
Underlying Complex
Diseases
Polygenic diseases (traits)
 Polygenic diseases susceptibility arise due to contribution
of a set of genes.
 Heterogeneity: different genetic backgrounds arise the
same disease.
 The disease outcome is correlated with the genetic
background rather that is determined.
Environmental effect or
heterogeneity: gang-specific
eyebrows. А common
signature is improbable.
Polygenic contribution
The genes can
contribute
independently in an
additive way.
The genes interact
(epistasis)
The genes can behave
as interacting only
relatively to the
disease.
• Complementary
alleles. An allele’s trait
explication requires
another allele of
another gene.
• Alternative pathways.
The pattern concept.
An example: image recognition
(1,0)
(1,1)
(1,1/2)
(1/2,1/2)
(1/2,1)
(0,1)
Allelic (genetic) pattern
We know levels of a trait (i.e. disease) and we
know alleles of candidate genes that these
persons carry.
A pattern is a set of alleles of the genes, whose
presence in a genome a whole is associated with
the trait.
• Any subset of the pattern is associated less
reliable than the while pattern is. Any superset,
too. So, a pattern is a locally minimal subset
satisfying the statements above.
• A pattern may contain only one allele.
Example of a genetic pattern for a complex polygenic disease.
Cross-sectional comparison of MS
patients and controls among carriers
and non-carriers of alleles of DRB1
HLA gene, CCR5 chemokine receptor
gene deletion and their combination.
100%
75%
50%
25%
0%
The solid line points to an
independent combination ratio.
DR4
non - DR4
controls
48
183
patients
49
163
CCR5 Del
non - (CCR5 Del)
100%
100%
75%
75%
50%
OR 20.1
50%
25%
p<0.0001
25%
0%
0%
DR4 + CCR5 Del
non - (DR4 + CCR5 Del)
controls
1
230
controls
40
191
patients
17
195
patients
52
160
Favorova OO, Andreewski TV, Boiko AN, Sudomoina MA, Alekseenkov AD, Kulakova OG,
Slanova AV, Gusev EI. 2002. The chemokine receptor CCR5 deletion mutation is
associated with MS in HLA-DR4-positive Russians. Neurology 59(10):1652-5.
Patterns hide each other
More-than-2-allele-in-a-locus union of the
combinations.
....|0 0 | a b | 0 0 |....
....|0 0 | c 0 | 0 0 |....
The strongest association (not obligatory the most
reliable) statistically shadows all the other ones.
disease level
Independency question
We cannot invent a correct concept of a space of
patterns, because the operation of addition (as a
union of allelic sets) is not defined for every pair,
thus we cannot apply a component analysis
technique.
Set of patterns
As far as we cannot take one pattern apart, we
consider a set of patterns simultaneously.
Mutual isolation of patterns
We say that a pattern is considered isolated from
a set of other patterns if we remove the influence
of all the other patterns before we consider our
pattern’s association with the trait.
• It is an analog of adjustment procedure.
Data
• We have genotypic data and phenotypic trait level
data for some individuals.
• The trait levels are comparative characteristics.
They cannot be measured, they can only be
compared.
• We want to obtain allelic patterns, which best
characterizes the relation between genotypic and
phenotypic data.
We will look for a whole set of patterns, which
maximises the probability that all the patterns are
associated with the disease in in the mutually
isolated manner.
• A good patternset forms a kind of “gradient basis”
in the genome-trait association.
Data structures
The set of patterns is a
variable to be optimized
Set of
0 |
0 |
0 f |
The correspondence of the
0
two matrices below shows
the set of patterns quality. 0
Trait
Level
0.1
0.4
0.7
0.9
0.2
…
Incidence
matrix
1 0 0
0 1 1
0 0 0
0 0 1
1 1 1
.......
patterns
d 0 | 0
a 0 | 0
0 0 | b
0 |....
0 |....
0 |....
Gene data
a c | d d | f s |....
c f | a b | b a |....
a a | c b | a c |....
c f | f b | b s |....
a f | a d | b c |....
........................
The incidence classification
All the cases are classified into 2n possible classes
based on the row in the incidence matrix.
Incidence
matrix
1 0 0
0 1 1
0 0 0
0 0 1
1 1 1
1 0 1
0 1 0
1 0 1
0 0 1
.......
110
The classes could
be represented by
the vertices of a
hypercube.
A set of parallel
edges of the cube
corresponds to a
pattern.
100
111
101
010
000
011
001
It is the direction of the
second pattern.
A pair of classes comparison
Two classes of trait levels, which are on the same edge,
differs due to the “isolated” influence of the edge’s pattern.
So, we base the patternset consideration on such pairwise
comparisons.
y 111
110
100
101
x
010
000
011
001
We can only compare the
disease (trait) levels, so the
appropriate statistics for the
comparison is the inversions
number.
A pair of classes. Alternative
hypotheses.
To test a pair of adjacent classes, we formulate
three hypotheses about the corresponding pattern:



null-hypothesis: X and Y has the same median, e.g. X≡Y
“positive” hypothesis: median (Y) > median (X)
(predisposing pattern)
“negative” hypothesis: median (Y) < median (X)
(protecting pattern).
We compare the hypotheses in a Bayesian
paradigm.
The likelihoods for a pair: example
p 0.25
+
-
null
const
0
inv#
The larger the minor class is, the more sharp are all the
likelihoods. If it is 1 or 0, all the 4 lines are equal.
8
The null-hypothesis posterior for a
pattern
P  H 0 for a pattern | data  
P  data | H 0  P  H 0 

P  data | H 0  P  H 0   P  data | H   P  H    P  data | H   P  H  
• A pattern’s likelihood for a hypothesis is a product of
the likelihoods of all corresponding class pairs.
• If a pattern is carried by all the genomes in the data
or is not carried by any (it is uninformative), nullhypothesis prior for the pattern is 1. For informative
patterns, we use uniform prior.
The quality of a set of patterns
• The pairwise comparisons for
110
111
all classes, which correspond
to parallel edges together
100
101
qualify a pattern.
• All patterns together qualify a
010
011
set of patterns.
• A good pattern set is one
000
001
without bad patterns.
p
P  H 0  is the quality
P  H 0    1  P  H 0 i  | data  
of a set of patterns.
i 1
Optimization of the pattern set
quality
• Direct enumeration is ineffective.
• A kind of gradient maximisation is prone to be
locked in local maxima.
Thus, we use the Monte-Carlo Markov Chain
(MCMC) method.
Definitely, it is a hybrid Metropolis-HastingsGibbs with random choice of updates.
Possible updating steps
A mutation:
A recombination:
0
0
0
0 | d
0 | a
f | 0
0 | 0
0 | 0
0 | b
0
0
0
0
0
0
0 | d
0 | a
f | 0
0 | 0
0 | 0
0 | b
0
0
0
0
0
0
0 | d
0 | a
f | c
0 | 0
0 | 0
0 | b
0
0
0
0
0
0
0 | d
0 | a
f | 0
0 | 0
0 | b
0 | 0
0
0
0
Output statistics
*** Patternsets statistics: ***
| alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender |
+-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+
| 0
0
| 0
0
| 0
0
| T
0
| 0
0
| 0
0
| 0
0
| 0
0
| 0 0 | 0 0 | 0
0
|
| 0
0
| 0
0
| 0
0
| 0
0
| C
T
| 0
0
| 0
0
| 0
0
| 0 0 | 0 0 | 0
0
|
Registered 64 times.
Pattern posteriors to be positive:
3.709e-10 7.143e-11
Pattern posteriors to be negative:
0.001556
0.03835
Point reliability = 5.9658e-05
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Patterns statistics:
| alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender |
+-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+
| 0
0
| 0
0
| 0
0
| 0
0
| C
0
| 0
0
| 0
0
| 0
0
| 0 0 | 0 0 | 0
0
|
Occured
5927
times. +/- :
0/5927
(Mentioned
41
times. +/- :
0/41
)
maximal reliabilities as + and - are 4.81058e-10 and 0.0172151 .
| alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender |
+-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+
| 0
0
| 0
0
| 0
0
| T
0
| 0
0
| 0
0
| 0
0
| 0
0
| 0 0 | 0 0 | 0
0
|
Occured
3022
times. +/- :
0/3022
(Mentioned
19
times. +/- :
0/19
)
maximal reliabilities as + and - are 4.74783e-06 and 0.00205254 .
A(llelic) P(attern) Sampler
APSampler software was developed …
Favorov AV, Andreewski TV, Sudomoina MA, Favorova OO, Parmigiani G, Ochs
MF: A Markov chain Monte Carlo technique for identification of combinations of
allelic variants underlying complex diseases in humans Genetics 2005, 171(4):21132121.
… and applied to real data
Favorova OO, Favorov AV, Boiko AN, Andreewski TV, Sudomoina MA, Alekseenkov
AD, Kulakova OG, Gusev EI, Parmigiani G, Ochs MF: Three allele combinations
associated with multiple sclerosis BMC Med Genet 2006, 7:63.
Sudomoina MA, Nikolaeva TY, Parfenov MG, Alekseenkov AD, Favorov AV, Gekht
AB, Gusev EI, Favorova OO: Genetic risk factors of arterial hypertension: analysis of
ischemic stroke patients from the Yakut ethnic group Dokl Biochem Biophys. 2006
Sep-Oct;410:324-6 (Rus).
Chikhladze NM, Samedova KhF, Sudomoina MA, Thant M, Htut ZM, Litonova GN,
Favorov AV, Chazova IE, Favorova OO: Contribution of CYP11B2, REN and AGT
genes in genetic predisposition to arterial hypertension associated with
hyperaldosteronism Kardiologiia 2008;48(1):37-42 (Rus).
Validation I: Exact Fisher
pattern
p (pattern)
Patients
Controls
Carriers
PC
CC
Non-carriers
PNC
CNC
Validation II: permutation
Genetic data
Permuted
disease
data
Permuted
disease
data
.....
Permuted
disease
data
1-st null
distribution
2-nd null
distribution
3-rd
distribution
.....
N-th null
distribution
Null
distribution
Pfail [pattern]=
Pfail [p (pattern)]
p
Permuted
disease
data
Disease
data
Permutation
Validation III: FDR
Test passed
Test failed
True
TP
FN
False
FP
TN
p ≈FP/(FP+TN)
FDR ≈FP/(FP+TP)
Validation III: FDR: evaluation
Validation III: FDR: calculation
Genetic data
Permuted
disease
data
Permuted
disease
data
.....
Permuted
disease
data
1-st null
distribution
2-nd null
distribution
3-rd
distribution
.....
N-th null
distribution
Null
distribution
Original
distribution
p
Permuted
disease
data
Disease
data
Permutation
Validation III: FDR: evaluation II
Approximated
Evaluated directly
FDR(T1) >FDR(T2)
T
Validation: FDR: example
• 61 markers and gender
• 120 controls and 255 MS patients
• Among 255, 155 give response to a medication
Pattern contains 3 informative alleles:
Gender:1; 27:T; 42:C.
Pattern contains 3 informative alleles:
21:G; 37:T; 53:C.
The pattern is mentioned in statistics
as occurred 1 times at line: 3011.
The pattern is mentioned in statistics
as occurred 1 times at line: 3227.
Occured in 1 patternsets 1 times.
Mentioned in patternsets at lines: 731.
Occurred in 1 patternsets 1 times.
Mentioned in patternsets at lines: 427.
Fisher 4-pole table:
1
2
51
51
60
171
Fisher 4-pole table:
0
1
1
19
89
118
levels
carriers
noncarriers
levels
carriers
noncarriers
p-value = 1.98632243779503e-05
p-value = 0.000368247913041713
FDR=0.00179340028694405
(2.5e-06/0.001394)
FDR <=1
(0.0067765/1e-06)
Acknowledgements
Authors





1.
2.
3.
4.
5.
6.
7.
Alexander Favorov 1,3
Olga Favorova 2
Marina Sudomoina 2
Giovanni Parmigiani 3
Michael Ochs 3










Alexey Alexeenkov 2
Alexey Boiko 2
Evgeniy Gusev 2
Alexey Boiko 2
Mikhail Parfenov 2
Tatiana Nikolaeva 5
Mikhail Gelfand 6
Vsevolod Makeev 1
Andrew Mironov 4
Koen Vanderbroek 7
State Scientific Centre “GosNIIGenetica”, Moscow, Russia.
Russian State Medical University, Moscow, Russia.
The Sidney Kimmel Cancer Center at Johns Hopkins, Baltimore, MD, USA
Faculty of Bioinformatics and Biotechnology, MSU, Moscow
Yakut Research Center, Russian Academy of Medical Sciences and Government
of the Sakha Republic (Yakutia), Yakutsk
Institute of Information Transmission Problems RAS, Moscow, Russia
School of Pharmacy - CCRCB – QUB, Belfast, UK
Thank your for your attention.
MS case-control study
• The method was applied to a database that
contains results of the genotyping of DNAs from
237 unrelated patients with clinically defined MS
and from 358 healthy unrelated controls (all of
them were Russians).
• 15 polymorphous sites of candidate loci for MS
development were analyzed.
• The phenotypic trait (i.e. the MS susceptibility)
levels were 1 for patients and 0 for controls.
• There were two starts: one for 2 patterns, one for
three.
APSampler identified the following
patterns as MS-associated:
• DRB1 *15(2)
• TNFa9
• CCR532 + DRB1 *04
 TGF1-509 *C + DRB1 *18 + +49CTLA4 *G (trio 1)
 -238 TNF *B1 + -308 TNF *A2 + +49CTLA4 *G (trio 2)
The Fisher’s 4-pole association test result for the trios
and their 2-elements subsets
Combinations
–509TGFβ1*C,DRB1*18(3),CTLA4*G
(trio 1)
–509TGFβ1*C,DRB1*18(3)
–509TGFβ1*C,CTLA4*G
DRB1*18(3),CTLA4*G
–238TNF*B1,–308TNF*A2,CTLA4*G
(trio 2)
–238TNF*B1,–308TNF*A2
–238TNF*B1,CTLA4*G
–308TNF*A2,CTLA4*G
Patients, N Controls, N
(%)
(%)
p Value
5 (5)
0 (0)
0.009
5 (5)
60 (61)
5 (5)
2 (1)
88 (57)
1 (1)
0.114
0.603
0.035
11 (9)
0 (0)
0.003
13 (10)
38 (30)
23 (18)
4 (5)
15 (17)
13 (15)
0.198
0.037
0.580
The permutation test gave the values for the trios
were less than 0.3%
Analysis of genetic background of
ischemic stroke (IS) patients of Yakut
descent
Total (n)
(mean age ± SD)
Men (n)
(mean age ± SD)
Women (n)
(mean age ± SD)
115
(58.1 ± 11.5)
75
(55.9 ± 12.3)
40
(62.2 ± 8.4)
108
(57.7 ± 11.3)
64
(55.9 ± 12.1)
44
(60.3 ± 9.6)
Examined polymorphic loci
Gene
FGA
Chromosome
4q28
Coding region
Regulatory regions
A4266G (Thr312Ala)
FGB
C-249T; C-148T
APOE
19q13.2
T3937C + C4075T
(Cys112Arg +
Arg158Cys)
LPL
8p22
C1595G (Ser447Ter)
ACE
17q23
I/D
CMA
14q11.2
G-1903A
A-491T; T-427C
T495G
IS genetic background analysis
Associations identified
Allele or allelic combination
p(pcorr)
OR
CI (95%)
-427C
0.001
0.3
0.1-0.6
-427T/C
0.0003
0.2
0.08-0.5
-427T/T
0.001
3.8
1.6-8.9
ε2
0.01*
0.3
0.1-0.8
ε2/ ε3
0.03*
0.3
0.09-0.7
APOE -491T + FGB -249T
0.02
0.3
0.1-0.9
APOE -491T + LPL 495T/T
0.01
0.3
0.08-0.8
APOE
Allele 495TLPL carriership
100
80
0
1
2
3
60
%
40
20
0
p<0.0001*
*p-value is counted by Fisher criteria it 8-pole table
3-allelic pattern: -249C FGB,
ε4 APOE and -1903A CMA
carriership
3
50
40
30
2
%
20
10
1
-249С FGB + -1903A CMA
p=0.017
0
ε4 APOE + -1903A CMA
0
p=0.0003*
p=0.023
Related documents