* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download You can position your opening statement here, either in
Nutriepigenomics wikipedia , lookup
Human genetic variation wikipedia , lookup
Metagenomics wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Public health genomics wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Standards for SNPs Analysis with Decision Trees Tools. Linda Fiaschi Supervisors: Jon Garibaldi Natalio Krasnogor IMA Seminar 24/02/2009 1 Outline • Genetic background and clinical objectives • Disease : Pre-eclampsia • Method of analysis • My Methodology: ADTree, C4.5, ID3 • Results • Conclusions • Future Work 2 1 Genetics : SNPs • The DNA of most people is 99.9 percent the same. • Single Nucleotide Polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) is changed, which occur approximately once every 100 to 300 bases • The resulting different forms of the same gene are called Alleles. People can have two identical or two different alleles for a particular gene. 3 2 Clinical objectives on SNPs • The majority have no effect, others cause subtle differences in countless characteristics, like appearance. • Genetic factors may also confer susceptibility or resistance to a disease and determine the severity or progression of disease • Genetic factors also affect a person's response to drug therapy 4 3 Disease: Pre-eclampsia • It occurs during pregnancy and the postpartum period and affects both the mother and the unborn baby. • Affecting at least 5-8% of all pregnancies, it is a rapidly progressive condition characterized by high blood pressure and the presence of protein in the urine. • Pre-eclampsia and other hypertensive disorders of pregnancy are a responsible for 76,000 deaths globally each year. 5 4 Case-Control Analysis Case-control studies use patients who already have a disease or other condition and look back to see if there are characteristics of these patients that differ from those who don’t have the disease. Comparison Cases: Sick Controls: Healthy Classification Rules 6 5 Decision Tree Analysis • One of the most widely used and practical forms of machine learning and data mining • It assigns a class to an input pattern through tests • Test: has mutually exclusive and exhaustive outcomes • Test: is either multivariate or univariate • Attributes: is categorical or numeric • Tree: 2 classes (Boolean) or more. 7 6 ADTree Algorithm • They are a natural generalization of decision trees • They are competitive with other boosted decision tree algorithms • The rules are usually smaller in size and easier to interpret • In addition to classification they give a measure of confidence • For each instance there is a multi-path: the sum of all the prediction nodes gives the classification 8 8 ID3 Algorithm Gain measures how well a given attribute separates training examples into targeted classes. Gain(S, A) = Entropy(S) – Σ((|Sv| / |S|) * Entropy(Sv) ) S is each value v of all possible values of attribute A Sv = subset of S for which attribute A has value v |Sv| = number of elements in Sv |S| = number of elements in S Entropy(S) = Σ((-p(I) log2 p(I)) - S is a collection of c outcomes - Σ is over c. - p(I) is the proportion of S belonging to class I. 9 9 ID3 Algorithm Example Delivery week < 35.5 Liver measures <94 1(15\4) >=94 >= 35.5 Systolic Pressure <152.5 >=152.5 0(25\0) 1(9\1) Age <36.3 1(26\2) >=36.3 0(31\0) 10 10 From ID3 to C4.5 Algorithm • Handling both continuous and discrete attributes • Handling training data with missing attribute values • Pruning trees after creation 11 11 Methodology A progressive analysis: detection of significant results deepened and confirmed in the subsequent analysis. Pre-processing of the Data Data Analysis 12 12 Pre-processing 13 13 A Data Analysis Statistical Significance Kappa Value: proportion of agreement corrected for chance between two judges assigning cases to a set of categories Kappa[8] Agreement <0 No agreement 0.0-0.2 Slight 0.2-0.4 Fair 0.4-0.6 Moderate 0.6-0.8 Substantial 0.8-1.0 Almost perfect A 14 14 Experimental Dataset 4529 Patients Genotype: 52 SNP attributes • • • • • • • AGT gene: SNPs 1-8, alleles 1 and 2 AGTR1 gene: SNPs 9-12, alleles 1 and 2 TNF gene: SNPs 13-16, alleles 1 and 2 F5 gene: SNP 17, alleles 1 and 2 NOS3 gene: SNPs 18-22 and 24, alleles 1 and 2 MTHFR gene: SNPs 25, 26, alleles 1 and 2 AGTR2 gene: SNP 27 Phenotype: 53 clinical attributes • 5 individual's identity data • 34 maternal data: physical and physiological parameters, pregnancy details and current treatments • 6 fetal data: weight and gestational age at birth • 8 medical history data of parents, partners or siblings 15 15 Results: Pre-processing I Babies dataset (372X58) 1. Attributes: Gestation at birth (day and week), weight, disease status, live at birth 2. Class: CBC - birth-weight centile corrected for gestation at birth, baby sex, ethnicity, mother's height and weight and number of pregnancies. 50 is normal weight, below 50 is underweight. 3. Missing Value: we retain missing values using the appropriate codification for the chosen algorithm. 4. Data Balancing: case-control ratio depends on the chosen CBC threshold to transform it from numeric to Boolean. 16 16 Data Analysis I Kappa Analysis: 17 17 Results: Data Analysis II Balancing of the data: CBC = 6: 147 cases (39.5%) and 225 controls CBC = 10: 177 cases (47.6%) and 195 controls CBC = 28: 243 cases (65.3%) and 129 controls > 33% ADTree results Analysis 18 18 Results: Data Analysis III C4.5 Results Analysis: 19 19 Results: Data Analysis IV Cross Analysis: common attributes between ADTree and C4.5 20 20 Results: Data Analysis V Analysis with common attributes for CBC= 28 (ADTree Kappa = 0.41, C4.5 Kappa = 0.38) : Male babies, born after the 35th week of gestation and with: AGT SNP3 allele2 = 1 (CBC > 28) AGT SNP3 allele2 = 2 & AGTR1 SNP11 allele2 = 1 (CBC < 28) Analysis with only Gestational week and CBC = 10 (Kappa value = 0.42 for both the ADTree and C4.5) : Babies delivered before 35 or 35.5 week of gestation are likely to be underweight (CBC < 10). 21 21 Conclusions • Guideline for data mining in the specific application of case-control analysis for SNPs. • Methodological point of view: attributes are rejected, instances are decreased (screening stage). • Clinical perspective: Significance of threshold CBC = 10 and dependency of CBC on the “week of delivery”. 22 22 Future Work • Genotype of the mothers rather that the babies. • Recoding of the SNPs • Redundant interaction between attributes • Non linear interaction between attributes • Heritable trend can be detected across the two generations 23 23 References [1] J. Han and M. Kamber, Data Mining: Concept and Techniques.Morgan Kaufmann, 2006. [2] N. M. Laird and C. Lange, “Family-based designs in the age of largescale gene-association studies,” Nature Reviews Genetics, pp. 385–394, 2006. [3] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, pp. 81–106, 1986. [4] J. R. Quinlan, “C4.5: Programs for machine learning,” Machine Learning, vol. 16, no. 3, pp. 235– 240, 1994. [5] Y. Freund and L. Mason, “The alternating decision tree learning algorithm,” Proceedings of the Sixteenth International Conference on Machine Learning, pp. 124–133, 1999. [6] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960. [7] D. G. Altman, Practical Statistics for Medical Research., Chapman and Hall, Eds. CRC Press, 1991. [8] Landis, J. R. and Koch, The measurement of observer agreement for categorical data. Biometrics. (1977) pp. 159--174 24 24