Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Techniques in The Diagnosis of Coronary Artery Disease (CAD) Steve Iduye Xiaoqing Zhuang HINF 6210 Data Mining Contents ❖Coronary Heart Disease in a Nutshell ❖Description of the Datasets ❖Case 1 ❖Case 2 ❖Case 3 ❖Discussion ❖Conclusion Heart Disease in a Nutshell ● Coronary Artery Disease(CAD) happens when the arteries that supply blood to heart muscle become hardened and narrowed. ● As a result, the heart muscle cannot get the blood or oxygen it needs and this can lead to chest pain (angina) or a heart attack. ● Current research on heart disease research has established that it is not a single condition, but refers to any condition in which the heart and blood vessels are injured and do not function properly, resulting in serious and fatal health problems (Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006). Heart Disease in a Nutshell ● The causes of heart disease are unclear, but age, gender, family history, and ethnic background are all considered to be the major causes in different investigations (Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006). ● Other factors like eating habits, fatty foods, lack of exercise, high cholesterol, hypertension, pollution, life style factors, obesity, high blood pressure, stress, diabetes and lack of awareness have also been claimed to increase the chance of developing heart disease (Chilnick, 2008; HEALTHS, 2010); ● Heart research, further, has found that the majority of the disease occurrence is noticed in people between the ages of 50–60 (Chilnick, 2008; HEALTHS, 2010) Case 1 ● The case study investigates the risk factors which contribute to Coronary Artery Disease in males and females ● (Article was published by Jesmin Nahar, Tasadduq Imama, Kevin S. Ticklea, Yi-Ping Phoebe Chen) ● UCI Cleveland Dataset(https://archive.ics.uci.edu/ml/machine-learning- databases/heart-disease/) ● Predictive Apriori (Association Rules) was used to identify those risk factors Apriori Algorithm (Case 1) The learning process looks for the following: – Support and Confidence greater than or equal to the min threshold – List all possible association rules that meet these requirements – Confidence and support are used in this study because of its accuracy in Apriori to rank the rules (Agrawal et al., 1993; Mutter, Hall, & Frank, 2005; Taihua & Fan, 2010) Attributes of Interest in the Dataset ● These attributes are the combination of symptoms, characteristics of heart disease, diagnostic techniques and probable causes. ● Let X represents all the attributes ● Let Y represents the class vector(CAD=unhealthy, No_CAD= healthy) Attributes of Interest in the Dataset Prior Setting ● ● ● Rules with confidence levels above 90%, with accuracy levels above 99% and confirmation levels above 79% were selected respectively for Predictive Apriori . As there can be many such rules, only the rules containing the ‘sick’ or ‘healthy’ class in the right-hand side (RHS) were considered. If no such rules were available, rules containing the ‘sick’ or ‘healthy’ class in the left-hand side (LHS) were reported. Apriori Rules Apriori Rules Summary: Case 1 ● Four of the five rules attributed for the ‘healthy’ class indicates female gender on this particular dataset, have more chance of being free from coronary heart disease. ● Also, the results shows that when exercise induced angina (chest pain) was false, it was a good indicator of a person being healthy, irrespective of gender (exercise induced angina = false has appeared in the LHS of all the high confidence rules). ● The number of coloured vessels being zero and thal (heart status) being normal were also shown to be good indicators of health. Case 1 Summary ● Rules mined for the ‘sick’ class, on the other hand, showed that chest pain type being asymptomatic and thal being reversed were probable indicators of a person being sick (both the high confidence rules have these two factors in LHS). Building Classification Rules Objectives ● Building Classification Rules from the previous A.R attributes data ● Trained data are analyzed by a classification algorithm ● The learned attribute or classifier becomes the rules ● Trained Data are used to estimate the accuracy of the rules ● The rules can be applied to the classification of new data tuples (Jiawei, Kamber, Pei, 2012) Step 1: Training Data Healthy Class SEX EXERCISE_INDUCED_ ANGINA NO_VESSEL_COLORED THAL(HEART STATUS) Female Failed 0 Normal Female Failed 0 Female Failed 0 Female Failed M or F Failed CLASS Healthy (no_CAD) False Healthy(no_CAD) Healthy (no_CAD) Normal 0 FASTING BLOOD SUGAR Normal False Healthy (no_CAD) Healthy (no_CAD) Step 1: Training Data Un- Healthy Class CHEST_PAIN_TYPE SLOPE asymptomatic flat asymptomatic EXERCISE INDUCED ANGINA true THAL(HEART STATUS) CLASS reversible defect Unhealthy (CAD) reversible defect Unhealthy (CAD) Step 2 : Create Classification Rules ● The learned attribute or classifier becomes the rules ● If {Sex = female \ exercise_induced_angina = fal \ number_of_vessels_colored=0 \ thal = nom} => Then, no CAD . ● If {Sex = female \ fasting_blood_sugar = fal \ exercise_induced_angina = fal \ number_of_vessels_colored = 0} => Then,no CAD . C. Rules ● If {Sex = female \ fasting_blood_sugar = fal \ exercise_induced_angina = fal \ thal = norm} => Then, no CAD ● If {Resting_blood_pres less or = ‘(115.2, 136.4]’ \ exercise_induced_angina = fal \ number_of_vessels_colored = 0 \ thal = norm} => Then, no CAD ● If {Sex=female \ exercise_induced_angina = fal \ number_of_vessels_colored = 0} => Then, no CAD C. Rules ● If {Chest_pain_type = asympt \ slope = flat \ thal = rev} => Then, CAD is present ● If {Chest_pain_type=asympt \ exercise_induced_angina=TRUE \ thal=rev} => Then, CAD is present Step 3: To Estimate the Accuracy of the Rules Using Decision Tree ● Find the attributes Information Gain info(D) -5/7log2(5/7)-2/7log2(2/7)= 1.9848 (A) infosex(D) 4/7*(-4/5log24/5-1/5log21/5)=1.4411(B) info exercise_induced_angina(D) 6/7*(-5/6log2 ⅚-1/6log2 1/6)= 3.6914(C) info heart status(D) 5/7*(-3/5log23/5-2/5log22/5)= 2.6779 (D) A-B=0.5437bits(sex), A-C= -1.7066bits, A-D= -0.6931bits Case 2: Diagnosing Coronary Artery Disease via Data Mining Algorithms by Considering Laboratory and Echocardiography Features Case 3: A data mining approach for diagnosis of coronary artery disease Dataset Z-Alizadeh Sani dataset: 303 patients (each 54 features) Z-Alizadeh Sani dataset: 303 patients (each 54 features) Objective Using non-invasive, less costly method, various data mining algorithms to predict stenosis of each artery separately. Using affordable costs and affordable feature measurements and applying proposed approached to identify CAD state probability. Features Demographic Features, Laboratory and Echo Features FEATURES 4 GROUPS: demographic, symptom and examination, ECG, laboratory and echo features 2 possible categories: CAD or Normal (IF patient’s diameter narrowing is >= 50% THEN CAD, ELSE = Normal) Methods Classification Algorithm: C4.5, Bagging algorithm Information gain, Gini Index, Ten-fold crossvalidation method, Confusion matrix, Performance measure RapidMiner software Classification Algorithm: SMO, Naïve Bayes classifier, Bagging algorithm, Neural Network algorithm Feature Selection & Feature creation, Information gain, Gini Index, Association rule mining, Performance measure, Confusion matrix Version 5.2.003 of RapidMiner Results This study presents the highest accuracy value (79.54%) for diagnosing the LAD stenosis in the 94.08% accuracy is achieved which is higher than the known approaches in the available literature. Case 2 (METHODS) C4.5 classification algorithm • Based on decision trees (augment the performance) • Has the ability of the latter to manage continuous values by breaking them down into sub intervals • Using pruning methods: improve accuracy Case 2 (METHODS) Bagging Algorithm • Classifies each sample based on the output of a set of diverse base classifiers. • Base classifiers can be selected from the C4.5, Naïve Bayes, ID3, and other data mining algorithms. Case 3 (METHODS) Sequential Minimal Optimization (SMO): algorithm for efficiently solving the optimization problem which arises during the training of Support Vector Machines (SVMs) Naïve Bayes classifier: simple probabilistic classifier based on applying Bayes’ theorem with strong independence assumption Bagging algorithm Neural Network algorithm: Artificial Neural Network (ANN) interconnected group of artificial neuronsuse a mathematical or computational model for information processing based on a connectionist approach.Model complex relationships between inputs and outputs or to find patterns in data. Case 3 (METHODS) Feature Selection • uses the coefficients of the normal vector of a linear SVM as feature weights • The attribute values still have to be numerical. • 34 of features had the weight > 0.6: selected and the algorithms were applied on them. Case 3 (METHODS) Feature creation • 3 new features: LAD (Left Anterior Descending) recognizer, LCX (Left Circumflex) recognizer, RCA (Right Coronary Artery) recognizer are used to recognize whether LAD, LCX, RCA is blocked. Higher the value, higher the risk. • Available features of the dataset are first discretized into binary variables value 1 for a feature indicates higher probabilities of the record being in the CAD class, while value zero indicates otherwise. Case 3(METHODS) Association rule mining (Mentioned in Case 1) • Support • Confidence Case 2 and Case 3 Informaton gain • measures the reduction in entropy of the data records because of a single split over a given attribute. • The entropy before and after the split is computed c is the class value which can be CAD or Normal P(c)probability of a record being in class c if a feature separates the two classes completely, it has the most Information Gain and is the best feature for classification Case 2 and Case 3 Gini Index • measure of how often a randomly chosen element from a set of elements would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset • the probability of correctly labeling an item is equal to the probability of choosing that item • higher values of Gini Index for a feature indicate its prevalence in causing the disease. Case1 and Case 2 Performance measure: Accuracy, sensitivity, and specificity are the most important performance measures in the medical field Confusion matrix: a table that allows visualization of the performance of an algorithm Discussion(Improve Accuracy of CAD Diagnosis by Using Data Mining Techniques) Understand CAD Confusion Matrix Sensitivity Specificity Accuracy CAD Risk Features Rules Extracted Performance Measurement Results Confidence Support Feature Selection Feature Creation Information Gain Gini Index Dataset with Effective Features C 4.5 Bagging Algorithm SMO Algorithm Naïve Bayes algorithm Neural Network algorithm Association Rule Mining RapidMiner Data Mining Methods Conclusion – Using Feature selection methods can increase the accuracy of CAD diagnosis (Though sometimes may decrease the accuracy of the LAD, RCA stenosis diagnosis) – To enrich our dataset, we may need to create some new features which has vital influence the accuracy of the CAD diagnosis. – Rules extracted from association rule mining methods may not be 100% correct, we need some more testing data to test the rules. – Still need the results of the standard angiographic method which are used as the base of comparison, to assess the prediction capability of classification algorithms.