Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Nutriepigenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Designer baby wikipedia , lookup
Tay–Sachs disease wikipedia , lookup
Human genetic variation wikipedia , lookup
Genome (book) wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
SNP genotyping wikipedia , lookup
COMBINATORIAL SEARCH METHODS FOR MULTI-SNP DISEASE ASSOCIATION Dumitru Brinza, Jingwu He and Alexander Zelikovsky Human Genome and SNP 1 Length of Human Genome 3 109 base pairs Difference between any two people 0.1% of genome Total number of single nucleotide polymorphisms (SNP) 3 106 SNP - single nucleotide site where two or more different nucleotides occur in a large percentage of population 0 = willde type/major (frequency) allele 1 = mutation/minor (frequency) allele International HapMap project: SNP maps are constructed across the human genome with density of about one SNP per thousand nucleotides. HapMap tries to identify 1 million tag SNP’s providing almost as much mapping information as entire 10 million SNP’s Unfortunately, not as much known about SNP combinations HapMap initial budget was 100Million dollars Due today around 1.5Million SNPs are typed Most of the data are trio Analysis of variation in suspected genes in disease and nondisease individuals is aimed at identifying SNPs with considerably higher frequencies among the disease individuals than among the nondisease individuals Most searches are done on a SNP-by-SNP basis Recently two-SNP analysis shows promising results (Marchini et al, 2005) Multi-SNP analyses are expected to find even stronger disease associations Common diseases can be caused by combinations of several unlinked gene (SNPs) variations We address the computational challenge of searching for such multi-gene causal combinations Unadjusted p-value: Probability of case/control distribution in a set defined by MSC, computed by binomial distribution Multiple-testing adjusted p-value : randomization Randomly permute the disease status of the population to generate 1000 instances. Apply searching methods on each instance to get MSCs. Compute the probability of MSCs that have a higher unadjusted p-value than the observed p-value. In our search we report only MSC with adjusted p-value < 0.05 Disease association analysis searches for a SNPs or multi-SNP combinations with frequency among disease individuals considerably higher than among nondisease individuals. Affymetrix GeneChip for gene genotyping ( 500k microarray chip ) 0 0 0 0 0 0 0 Genetic epidemiology Searching for genetic risk factors for diseases Monogenic diseases A mutated gene is entirely responsible for the disease Typically rare in population: < 0.1% Practically all cases are already reported 1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 1 0 1 0 0 0 1 0 0 1 1 1 1 2 2 0 2 2 1 2 1 0 0 0 1 0 0 0 0 2 0 0 0 0 2 1 1 1 2 2 2 sick sick sick sick sick healthy healthy 3 Bonferroni is too crude (e.g., 3-SNP combinations among 100 SNPs, p < 0.05×10-6) We adjust resulted p-values via randomization The number of multi-SNP combinations is infeasible high (3100 for 100 SNPs). How to find associated multi-SNP combinations without total checking? High-throughput genotyping technology Our contributions A novel combinatorial method for finding diseaseassociated multi-SNP combinations was developed. Multi-SNP combinations significantly associating with diseases were found. MSC x x 1 x x 2 x x x 4 sick : 1 healthy For Crohn's disease data (Daly, et al., 2001), a few associated multi-SNP combinations with multiple-testing-adjusted to p < 0.05 were found, while no single SNP or pair of SNPs showed significant association. For a dataset for an autoimmune disorder (Ueda, et al., 2003), a few previously unknown associated multi-SNP combinations were found. For tick-borne encephalitis virus-induced disease, a multi-SNP combination within a group of genes showing a high degree of linkage disequilibrium significantly associated with the severity of the disease was found. check significance Statistical significance Complex diseases Affected by the interaction of multiple genes Significance of risk factor is usually measured by Risk Rate or _ _ _Odds Ratio We measure significance by the p-value of the set of genotypes _defined by risk factor Proposed searching methods 2 Disease association analysis If the reported SNP is found among 100 SNPs then the probability that the SNP is associated with a disease by mere chance becomes 100 times larger (Bonferroni). 4 Exhaustive Search (ES): In order to find a multi-SNP combination with the p-value of the frequency distribution below 0.05, it checks all one-SNP, twoSNP, ..., m-SNP combinations. Runtime is O(n3m) making complete searching unfeasible even for small numbers of SNPs m We restrict searching to 1,2,3,4,5 SNPs Searching level – number of SNPs which participate in MSC Multi-SNP combination (MSC) define a set of disease and nondisese individuals MSC is considered statistically significant if the frequency of disease and nondisese distribution has p-value < 0.05 A lot of reported findings are frequently not reproducible on different populations. It is believed that this happens because the p-values are unadjusted to multiple testing Disease-closure allow finding of the statistically significant MSC on the earlier stage of searching. Trivial MSCs and MSCs which coincide after diseaseclosure are avoided. That significantly speedups the searching. 5 Faster than ES Finds more significant association on the early stage of searching Still slow for wide-genome studies Searching level – number of SNPs which define MSC before disease-closure Indexed Exhaustive Search (IES): Exhaustive search on the indexed datasets obtained by extracting k indexed SNPs with MLR based tagging method. MLR - multiple linear regression based tagging method (He and Zelikovsky, 2006). Indexed Combinatorial Search (ICS): Combinatorial search on the indexed datasets obtained by extracting k indexed SNPs with MLR based tagging method. Can perform complete searching for the larger datasets Data Sets Crohn's disease : 387 genotypes with 103 SNPs derived from the 616 KB region of human Chromosome 5q31, 144 disease genotypes and 243 nondisease genotypes. (Daly et al., 2001). Autoimmune disorder : 1024 genotypes with 108 SNPs containing gene CD28, CTLA4 and ICONS, 378 disease genotypes and 646 nondisease genotypes. (Ueda et al., 2003). The tradeoff between the number of chosen indexing SNPs and quality of reconstruction requires choosing the maximum number of index SNPs that can be handled by ES in a reasonable computational time. Can perform complete searching for the larger datasets For wide-genome study number of tags can’t be reduced to 5-10 tags. Therefore, IES will not be able to perform complete search Combinatorial Search (CS): Similar to ES check all one-SNP, two-SNP, ..., m-SNP diseaseclosed combinations. Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals. Tick-borne encephalitis : 75 genotypes with 41 SNPs containing gene TLR3, PKR, OAS1, OAS2, and OAS3, 21 disease genotypes and 54 nondisease genotypes. (Barkash et al., 2006). Discussion Comparing indexed counterparts with ES and CS shows that indexing is quite successful. Indeed, the indexed searches found the same multi-SNP combinations as the non-indexed searches but were much faster and the multiple-testing adjusted 0.05-threshold was higher and easier to meet. Comparing the CS with the ES counterparts is advantageous to the former. Indeed, for the Crohn's disease data (Daly.et al., 2001), the ES on the first and second search levels is unsuccessful while the CS finds several statistically significant multi-SNP combinations. Similarly, for the tick-borne encephalitis virus-induced disease data, the CS and ICS(20) found a significant association on the first level while no association was found by the ES or IES(20). For the autoimmune disorder data (Ueda.et al., 2003), the CS found many more statistically significant multi-SNP combinations then the ES. We conclude that the proposed indexing approach and the combinatorial search method are very promising techniques for searching for statistically significant diseases-associated multi-SNP combinations and disease susceptibility prediction. Disease-Associated Multi-SNP Combinations Search Given: a population of n genotypes (or haplotypes) each containing values of m SNPs from {0,1,2} and disease status (diseased or nondisease) Find: all multi-SNP combinations with multiple testing adjusted p-value of the frequency distribution below 0.05 Results/comparison of searching methods 6 The relative qualities of the searching methods are compared using the number of statistically significant multi-SNP combinations found. The statistical significance was adjusted to multiple testing and the adjusted 0.05 threshold is shown (third column). In the 4th, 5th and 6th columns, we give the frequencies of the best multi-SNP combination among disease and nondisease populations and the unadjusted p-value, respectively.