Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Allelic Pattern Sampler: Genetic Combinations Underlying Complex Diseases Polygenic diseases (traits) Polygenic diseases susceptibility arise due to contribution of a set of genes. Heterogeneity: different genetic backgrounds arise the same disease. The disease outcome is correlated with the genetic background rather that is determined. Environmental effect or heterogeneity: gang-specific eyebrows. А common signature is improbable. Polygenic contribution The genes can contribute independently in an additive way. The genes interact (epistasis) The genes can behave as interacting only relatively to the disease. • Complementary alleles. An allele’s trait explication requires another allele of another gene. • Alternative pathways. The pattern concept. An example: image recognition (1,0) (1,1) (1,1/2) (1/2,1/2) (1/2,1) (0,1) Allelic (genetic) pattern We know levels of a trait (i.e. disease) and we know alleles of candidate genes that these persons carry. A pattern is a set of alleles of the genes, whose presence in a genome a whole is associated with the trait. • Any subset of the pattern is associated less reliable than the while pattern is. Any superset, too. So, a pattern is a locally minimal subset satisfying the statements above. • A pattern may contain only one allele. Example of a genetic pattern for a complex polygenic disease. Cross-sectional comparison of MS patients and controls among carriers and non-carriers of alleles of DRB1 HLA gene, CCR5 chemokine receptor gene deletion and their combination. 100% 75% 50% 25% 0% The solid line points to an independent combination ratio. DR4 non - DR4 controls 48 183 patients 49 163 CCR5 Del non - (CCR5 Del) 100% 100% 75% 75% 50% OR 20.1 50% 25% p<0.0001 25% 0% 0% DR4 + CCR5 Del non - (DR4 + CCR5 Del) controls 1 230 controls 40 191 patients 17 195 patients 52 160 Favorova OO, Andreewski TV, Boiko AN, Sudomoina MA, Alekseenkov AD, Kulakova OG, Slanova AV, Gusev EI. 2002. The chemokine receptor CCR5 deletion mutation is associated with MS in HLA-DR4-positive Russians. Neurology 59(10):1652-5. Patterns hide each other More-than-2-allele-in-a-locus union of the combinations. ....|0 0 | a b | 0 0 |.... ....|0 0 | c 0 | 0 0 |.... The strongest association (not obligatory the most reliable) statistically shadows all the other ones. disease level Independency question We cannot invent a correct concept of a space of patterns, because the operation of addition (as a union of allelic sets) is not defined for every pair, thus we cannot apply a component analysis technique. Set of patterns As far as we cannot take one pattern apart, we consider a set of patterns simultaneously. Mutual isolation of patterns We say that a pattern is considered isolated from a set of other patterns if we remove the influence of all the other patterns before we consider our pattern’s association with the trait. • It is an analog of adjustment procedure. Data • We have genotypic data and phenotypic trait level data for some individuals. • The trait levels are comparative characteristics. They cannot be measured, they can only be compared. • We want to obtain allelic patterns, which best characterizes the relation between genotypic and phenotypic data. We will look for a whole set of patterns, which maximises the probability that all the patterns are associated with the disease in in the mutually isolated manner. • A good patternset forms a kind of “gradient basis” in the genome-trait association. Data structures The set of patterns is a variable to be optimized Set of 0 | 0 | 0 f | The correspondence of the 0 two matrices below shows the set of patterns quality. 0 Trait Level 0.1 0.4 0.7 0.9 0.2 … Incidence matrix 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 ....... patterns d 0 | 0 a 0 | 0 0 0 | b 0 |.... 0 |.... 0 |.... Gene data a c | d d | f s |.... c f | a b | b a |.... a a | c b | a c |.... c f | f b | b s |.... a f | a d | b c |.... ........................ The incidence classification All the cases are classified into 2n possible classes based on the row in the incidence matrix. Incidence matrix 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 0 1 0 1 0 1 0 0 1 ....... 110 The classes could be represented by the vertices of a hypercube. A set of parallel edges of the cube corresponds to a pattern. 100 111 101 010 000 011 001 It is the direction of the second pattern. A pair of classes comparison Two classes of trait levels, which are on the same edge, differs due to the “isolated” influence of the edge’s pattern. So, we base the patternset consideration on such pairwise comparisons. y 111 110 100 101 x 010 000 011 001 We can only compare the disease (trait) levels, so the appropriate statistics for the comparison is the inversions number. A pair of classes. Alternative hypotheses. To test a pair of adjacent classes, we formulate three hypotheses about the corresponding pattern: null-hypothesis: X and Y has the same median, e.g. X≡Y “positive” hypothesis: median (Y) > median (X) (predisposing pattern) “negative” hypothesis: median (Y) < median (X) (protecting pattern). We compare the hypotheses in a Bayesian paradigm. The likelihoods for a pair: example p 0.25 + - null const 0 inv# The larger the minor class is, the more sharp are all the likelihoods. If it is 1 or 0, all the 4 lines are equal. 8 The null-hypothesis posterior for a pattern P H 0 for a pattern | data P data | H 0 P H 0 P data | H 0 P H 0 P data | H P H P data | H P H • A pattern’s likelihood for a hypothesis is a product of the likelihoods of all corresponding class pairs. • If a pattern is carried by all the genomes in the data or is not carried by any (it is uninformative), nullhypothesis prior for the pattern is 1. For informative patterns, we use uniform prior. The quality of a set of patterns • The pairwise comparisons for 110 111 all classes, which correspond to parallel edges together 100 101 qualify a pattern. • All patterns together qualify a 010 011 set of patterns. • A good pattern set is one 000 001 without bad patterns. p P H 0 is the quality P H 0 1 P H 0 i | data of a set of patterns. i 1 Optimization of the pattern set quality • Direct enumeration is ineffective. • A kind of gradient maximisation is prone to be locked in local maxima. Thus, we use the Monte-Carlo Markov Chain (MCMC) method. Definitely, it is a hybrid Metropolis-HastingsGibbs with random choice of updates. Possible updating steps A mutation: A recombination: 0 0 0 0 | d 0 | a f | 0 0 | 0 0 | 0 0 | b 0 0 0 0 0 0 0 | d 0 | a f | 0 0 | 0 0 | 0 0 | b 0 0 0 0 0 0 0 | d 0 | a f | c 0 | 0 0 | 0 0 | b 0 0 0 0 0 0 0 | d 0 | a f | 0 0 | 0 0 | b 0 | 0 0 0 0 Output statistics *** Patternsets statistics: *** | alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender | +-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+ | 0 0 | 0 0 | 0 0 | T 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | | 0 0 | 0 0 | 0 0 | 0 0 | C T | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | Registered 64 times. Pattern posteriors to be positive: 3.709e-10 7.143e-11 Pattern posteriors to be negative: 0.001556 0.03835 Point reliability = 5.9658e-05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patterns statistics: | alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender | +-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+ | 0 0 | 0 0 | 0 0 | 0 0 | C 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | Occured 5927 times. +/- : 0/5927 (Mentioned 41 times. +/- : 0/41 ) maximal reliabilities as + and - are 4.81058e-10 and 0.0172151 . | alpha-fibr | fibr-249 | fibr-148 | ApoE-491 | Apoe-427 | ApoE-epsilon | Hind_III | LPL-Ser447Ter | ACE | CMA/B | gender | +-------------+-----------+-----------+-----------+-----------+---------------+-----------+---------------+-----+-------+---------+ | 0 0 | 0 0 | 0 0 | T 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | 0 0 | Occured 3022 times. +/- : 0/3022 (Mentioned 19 times. +/- : 0/19 ) maximal reliabilities as + and - are 4.74783e-06 and 0.00205254 . A(llelic) P(attern) Sampler APSampler software was developed … Favorov AV, Andreewski TV, Sudomoina MA, Favorova OO, Parmigiani G, Ochs MF: A Markov chain Monte Carlo technique for identification of combinations of allelic variants underlying complex diseases in humans Genetics 2005, 171(4):21132121. … and applied to real data Favorova OO, Favorov AV, Boiko AN, Andreewski TV, Sudomoina MA, Alekseenkov AD, Kulakova OG, Gusev EI, Parmigiani G, Ochs MF: Three allele combinations associated with multiple sclerosis BMC Med Genet 2006, 7:63. Sudomoina MA, Nikolaeva TY, Parfenov MG, Alekseenkov AD, Favorov AV, Gekht AB, Gusev EI, Favorova OO: Genetic risk factors of arterial hypertension: analysis of ischemic stroke patients from the Yakut ethnic group Dokl Biochem Biophys. 2006 Sep-Oct;410:324-6 (Rus). Chikhladze NM, Samedova KhF, Sudomoina MA, Thant M, Htut ZM, Litonova GN, Favorov AV, Chazova IE, Favorova OO: Contribution of CYP11B2, REN and AGT genes in genetic predisposition to arterial hypertension associated with hyperaldosteronism Kardiologiia 2008;48(1):37-42 (Rus). Validation I: Exact Fisher pattern p (pattern) Patients Controls Carriers PC CC Non-carriers PNC CNC Validation II: permutation Genetic data Permuted disease data Permuted disease data ..... Permuted disease data 1-st null distribution 2-nd null distribution 3-rd distribution ..... N-th null distribution Null distribution Pfail [pattern]= Pfail [p (pattern)] p Permuted disease data Disease data Permutation Validation III: FDR Test passed Test failed True TP FN False FP TN p ≈FP/(FP+TN) FDR ≈FP/(FP+TP) Validation III: FDR: evaluation Validation III: FDR: calculation Genetic data Permuted disease data Permuted disease data ..... Permuted disease data 1-st null distribution 2-nd null distribution 3-rd distribution ..... N-th null distribution Null distribution Original distribution p Permuted disease data Disease data Permutation Validation III: FDR: evaluation II Approximated Evaluated directly FDR(T1) >FDR(T2) T Validation: FDR: example • 61 markers and gender • 120 controls and 255 MS patients • Among 255, 155 give response to a medication Pattern contains 3 informative alleles: Gender:1; 27:T; 42:C. Pattern contains 3 informative alleles: 21:G; 37:T; 53:C. The pattern is mentioned in statistics as occurred 1 times at line: 3011. The pattern is mentioned in statistics as occurred 1 times at line: 3227. Occured in 1 patternsets 1 times. Mentioned in patternsets at lines: 731. Occurred in 1 patternsets 1 times. Mentioned in patternsets at lines: 427. Fisher 4-pole table: 1 2 51 51 60 171 Fisher 4-pole table: 0 1 1 19 89 118 levels carriers noncarriers levels carriers noncarriers p-value = 1.98632243779503e-05 p-value = 0.000368247913041713 FDR=0.00179340028694405 (2.5e-06/0.001394) FDR <=1 (0.0067765/1e-06) Acknowledgements Authors 1. 2. 3. 4. 5. 6. 7. Alexander Favorov 1,3 Olga Favorova 2 Marina Sudomoina 2 Giovanni Parmigiani 3 Michael Ochs 3 Alexey Alexeenkov 2 Alexey Boiko 2 Evgeniy Gusev 2 Alexey Boiko 2 Mikhail Parfenov 2 Tatiana Nikolaeva 5 Mikhail Gelfand 6 Vsevolod Makeev 1 Andrew Mironov 4 Koen Vanderbroek 7 State Scientific Centre “GosNIIGenetica”, Moscow, Russia. Russian State Medical University, Moscow, Russia. The Sidney Kimmel Cancer Center at Johns Hopkins, Baltimore, MD, USA Faculty of Bioinformatics and Biotechnology, MSU, Moscow Yakut Research Center, Russian Academy of Medical Sciences and Government of the Sakha Republic (Yakutia), Yakutsk Institute of Information Transmission Problems RAS, Moscow, Russia School of Pharmacy - CCRCB – QUB, Belfast, UK Thank your for your attention. MS case-control study • The method was applied to a database that contains results of the genotyping of DNAs from 237 unrelated patients with clinically defined MS and from 358 healthy unrelated controls (all of them were Russians). • 15 polymorphous sites of candidate loci for MS development were analyzed. • The phenotypic trait (i.e. the MS susceptibility) levels were 1 for patients and 0 for controls. • There were two starts: one for 2 patterns, one for three. APSampler identified the following patterns as MS-associated: • DRB1 *15(2) • TNFa9 • CCR532 + DRB1 *04 TGF1-509 *C + DRB1 *18 + +49CTLA4 *G (trio 1) -238 TNF *B1 + -308 TNF *A2 + +49CTLA4 *G (trio 2) The Fisher’s 4-pole association test result for the trios and their 2-elements subsets Combinations –509TGFβ1*C,DRB1*18(3),CTLA4*G (trio 1) –509TGFβ1*C,DRB1*18(3) –509TGFβ1*C,CTLA4*G DRB1*18(3),CTLA4*G –238TNF*B1,–308TNF*A2,CTLA4*G (trio 2) –238TNF*B1,–308TNF*A2 –238TNF*B1,CTLA4*G –308TNF*A2,CTLA4*G Patients, N Controls, N (%) (%) p Value 5 (5) 0 (0) 0.009 5 (5) 60 (61) 5 (5) 2 (1) 88 (57) 1 (1) 0.114 0.603 0.035 11 (9) 0 (0) 0.003 13 (10) 38 (30) 23 (18) 4 (5) 15 (17) 13 (15) 0.198 0.037 0.580 The permutation test gave the values for the trios were less than 0.3% Analysis of genetic background of ischemic stroke (IS) patients of Yakut descent Total (n) (mean age ± SD) Men (n) (mean age ± SD) Women (n) (mean age ± SD) 115 (58.1 ± 11.5) 75 (55.9 ± 12.3) 40 (62.2 ± 8.4) 108 (57.7 ± 11.3) 64 (55.9 ± 12.1) 44 (60.3 ± 9.6) Examined polymorphic loci Gene FGA Chromosome 4q28 Coding region Regulatory regions A4266G (Thr312Ala) FGB C-249T; C-148T APOE 19q13.2 T3937C + C4075T (Cys112Arg + Arg158Cys) LPL 8p22 C1595G (Ser447Ter) ACE 17q23 I/D CMA 14q11.2 G-1903A A-491T; T-427C T495G IS genetic background analysis Associations identified Allele or allelic combination p(pcorr) OR CI (95%) -427C 0.001 0.3 0.1-0.6 -427T/C 0.0003 0.2 0.08-0.5 -427T/T 0.001 3.8 1.6-8.9 ε2 0.01* 0.3 0.1-0.8 ε2/ ε3 0.03* 0.3 0.09-0.7 APOE -491T + FGB -249T 0.02 0.3 0.1-0.9 APOE -491T + LPL 495T/T 0.01 0.3 0.08-0.8 APOE Allele 495TLPL carriership 100 80 0 1 2 3 60 % 40 20 0 p<0.0001* *p-value is counted by Fisher criteria it 8-pole table 3-allelic pattern: -249C FGB, ε4 APOE and -1903A CMA carriership 3 50 40 30 2 % 20 10 1 -249С FGB + -1903A CMA p=0.017 0 ε4 APOE + -1903A CMA 0 p=0.0003* p=0.023