Download Data Mining Techniques in The Diagnosis of Coronary Heart Disease

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining Techniques in
The Diagnosis of
Coronary Artery Disease (CAD)
Steve Iduye
Xiaoqing Zhuang
HINF 6210 Data Mining
Contents
❖Coronary Heart Disease in a Nutshell
❖Description of the Datasets
❖Case 1
❖Case 2
❖Case 3
❖Discussion
❖Conclusion
Heart Disease in a Nutshell
● Coronary Artery Disease(CAD) happens when the arteries that supply blood to heart
muscle become hardened and narrowed.
● As a result, the heart muscle cannot get the blood or oxygen it needs and this can
lead to chest pain (angina) or a heart attack.
● Current research on heart disease research has established that it is not a single
condition, but refers to any condition in which the heart and blood vessels are
injured and do not function properly, resulting in serious and fatal health problems
(Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006).
Heart Disease in a Nutshell
● The causes of heart disease are unclear, but age, gender, family history, and ethnic
background are all considered to be the major causes in different investigations
(Chilnick, 2008; HEALTHS, 2010; King, 2004; Silverstein et al., 2006).
● Other factors like eating habits, fatty foods, lack of exercise, high cholesterol,
hypertension, pollution, life style factors, obesity, high blood pressure, stress,
diabetes and lack of awareness have also been claimed to increase the chance of
developing heart disease (Chilnick, 2008; HEALTHS, 2010);
● Heart research, further, has found that the majority of the disease occurrence is
noticed in people between the ages of 50–60 (Chilnick, 2008; HEALTHS, 2010)
Case 1
● The case study investigates the risk factors which contribute to Coronary Artery
Disease in males and females
● (Article was published by Jesmin Nahar, Tasadduq Imama, Kevin S. Ticklea, Yi-Ping
Phoebe Chen)
● UCI Cleveland Dataset(https://archive.ics.uci.edu/ml/machine-learning-
databases/heart-disease/)
● Predictive Apriori (Association Rules) was used to identify those risk factors
Apriori Algorithm (Case 1)
The learning process looks for the following:
– Support and Confidence greater than or equal to the min threshold
– List all possible association rules that meet these requirements
– Confidence and support are used in this study because of its accuracy in Apriori to
rank the rules (Agrawal et al., 1993; Mutter, Hall, & Frank, 2005; Taihua & Fan, 2010)
Attributes of Interest in the Dataset
● These attributes are the combination of symptoms, characteristics of heart disease,
diagnostic techniques and probable causes.
● Let X represents all the attributes
● Let Y represents the class vector(CAD=unhealthy, No_CAD= healthy)
Attributes of Interest in the
Dataset
Prior Setting
●
●
●
Rules with confidence levels above 90%, with accuracy levels above 99% and
confirmation levels above 79% were selected respectively for Predictive Apriori .
As there can be many such rules, only the rules containing the ‘sick’ or ‘healthy’
class in the right-hand side (RHS) were considered.
If no such rules were available, rules containing the ‘sick’ or ‘healthy’ class in the
left-hand side (LHS) were reported.
Apriori Rules
Apriori Rules
Summary: Case 1
● Four of the five rules attributed for the ‘healthy’ class indicates female gender on
this particular dataset, have more chance of being free from coronary heart disease.
● Also, the results shows that when exercise induced angina (chest pain) was false, it
was a good indicator of a person being healthy, irrespective of gender (exercise
induced angina = false has appeared in the LHS of all the high confidence rules).
● The number of coloured vessels being zero and thal (heart status) being normal
were also shown to be good indicators of health.
Case 1 Summary
● Rules mined for the ‘sick’ class, on the other hand, showed that chest pain type
being asymptomatic and thal being reversed were probable indicators of a person
being sick (both the high confidence rules have these two factors in LHS).
Building Classification Rules
Objectives
● Building Classification Rules from the previous A.R attributes data
● Trained data are analyzed by a classification algorithm
● The learned attribute or classifier becomes the rules
● Trained Data are used to estimate the accuracy of the rules
● The rules can be applied to the classification of new data tuples (Jiawei, Kamber, Pei,
2012)
Step 1: Training Data
Healthy Class
SEX
EXERCISE_INDUCED_
ANGINA
NO_VESSEL_COLORED
THAL(HEART
STATUS)
Female
Failed
0
Normal
Female
Failed
0
Female
Failed
0
Female
Failed
M or F
Failed
CLASS
Healthy (no_CAD)
False
Healthy(no_CAD)
Healthy (no_CAD)
Normal
0
FASTING
BLOOD
SUGAR
Normal
False
Healthy (no_CAD)
Healthy (no_CAD)
Step 1: Training Data
Un- Healthy Class
CHEST_PAIN_TYPE
SLOPE
asymptomatic
flat
asymptomatic
EXERCISE
INDUCED
ANGINA
true
THAL(HEART STATUS)
CLASS
reversible defect
Unhealthy
(CAD)
reversible defect
Unhealthy (CAD)
Step 2 : Create Classification Rules
● The learned attribute or classifier becomes the rules
● If {Sex = female \ exercise_induced_angina = fal \ number_of_vessels_colored=0 \
thal = nom} => Then, no CAD .
● If {Sex = female \ fasting_blood_sugar = fal \ exercise_induced_angina = fal \
number_of_vessels_colored = 0} => Then,no CAD .
C. Rules
● If {Sex = female \ fasting_blood_sugar = fal \ exercise_induced_angina = fal \ thal =
norm} => Then, no CAD
● If {Resting_blood_pres less or = ‘(115.2, 136.4]’ \ exercise_induced_angina = fal \
number_of_vessels_colored = 0 \ thal = norm} => Then, no CAD
● If {Sex=female \ exercise_induced_angina = fal \ number_of_vessels_colored = 0} =>
Then, no CAD
C. Rules
● If {Chest_pain_type = asympt \ slope = flat \ thal = rev} => Then, CAD is present
● If {Chest_pain_type=asympt \ exercise_induced_angina=TRUE \ thal=rev} => Then,
CAD is present
Step 3: To Estimate the Accuracy of the
Rules Using Decision Tree
● Find the attributes Information Gain
info(D) -5/7log2(5/7)-2/7log2(2/7)= 1.9848 (A)
infosex(D) 4/7*(-4/5log24/5-1/5log21/5)=1.4411(B)
info exercise_induced_angina(D) 6/7*(-5/6log2 ⅚-1/6log2 1/6)= 3.6914(C)
info heart status(D) 5/7*(-3/5log23/5-2/5log22/5)= 2.6779 (D)
A-B=0.5437bits(sex), A-C= -1.7066bits, A-D= -0.6931bits
Case 2: Diagnosing Coronary Artery Disease via
Data Mining Algorithms by Considering
Laboratory and Echocardiography Features
Case 3: A data mining approach for diagnosis of coronary
artery disease
Dataset
Z-Alizadeh Sani dataset: 303 patients (each 54
features)
Z-Alizadeh Sani dataset: 303 patients (each 54 features)
Objective
Using non-invasive, less costly method, various
data mining algorithms to predict stenosis of
each artery separately.
Using affordable costs and affordable feature measurements
and applying proposed approached to identify CAD state
probability.
Features
Demographic Features, Laboratory and Echo
Features
FEATURES 4 GROUPS: demographic, symptom and
examination, ECG, laboratory and echo features
2 possible categories: CAD or Normal
(IF patient’s diameter narrowing is >= 50% THEN CAD,
ELSE = Normal)
Methods
Classification Algorithm: C4.5, Bagging algorithm
Information gain, Gini Index, Ten-fold crossvalidation method, Confusion matrix,
Performance measure
RapidMiner software
Classification Algorithm: SMO, Naïve Bayes classifier, Bagging
algorithm, Neural Network algorithm
Feature Selection & Feature creation, Information gain, Gini
Index, Association rule mining, Performance measure,
Confusion matrix
Version 5.2.003 of RapidMiner
Results
This study presents the highest accuracy value
(79.54%) for diagnosing the LAD stenosis in the
94.08% accuracy is achieved which is higher than the known
approaches in the available literature.
Case 2 (METHODS)
 C4.5 classification algorithm
• Based on decision trees (augment the performance)
• Has the ability of the latter to manage continuous values by breaking them down into
sub intervals
• Using pruning methods: improve accuracy
Case 2 (METHODS)
 Bagging Algorithm
• Classifies each sample based on the output of a set of diverse base classifiers.
• Base classifiers can be selected from the C4.5, Naïve Bayes, ID3, and other data mining
algorithms.
Case 3 (METHODS)
 Sequential Minimal Optimization (SMO): algorithm for efficiently solving the
optimization problem which arises during the training of Support Vector Machines
(SVMs)
 Naïve Bayes classifier: simple probabilistic classifier based on applying Bayes’
theorem with strong independence assumption
 Bagging algorithm
 Neural Network algorithm: Artificial Neural Network (ANN) interconnected group
of artificial neuronsuse a mathematical or computational model for information
processing based on a connectionist approach.Model complex relationships
between inputs and outputs or to find patterns in data.
Case 3 (METHODS)
 Feature Selection
• uses the coefficients of the normal vector of a linear SVM as feature weights
• The attribute values still have to be numerical.
• 34 of features had the weight > 0.6: selected and the algorithms were applied on them.
Case 3 (METHODS)
 Feature creation
• 3 new features: LAD (Left Anterior Descending) recognizer, LCX (Left Circumflex)
recognizer, RCA (Right Coronary Artery) recognizer are used to recognize whether LAD,
LCX, RCA is blocked. Higher the value, higher the risk.
• Available features of the dataset are first discretized into binary variables
value 1 for a feature indicates higher probabilities of the record being in the CAD class,
while value zero indicates otherwise.
Case 3(METHODS)
 Association rule mining (Mentioned in Case 1)
• Support
• Confidence
Case 2 and Case 3
 Informaton gain
• measures the reduction in entropy of the data records because of a single split over a
given attribute.
• The entropy before and after the split is computed
c is the class value which can be CAD or Normal
P(c)probability of a record being in class c
if a feature separates the two classes completely, it has the most Information Gain and is
the best feature for classification
Case 2 and Case 3
 Gini Index
• measure of how often a randomly chosen element from a set of elements would be
incorrectly labeled if it was randomly labeled according to the distribution of labels in
the subset
• the probability of correctly labeling an item is equal to the probability of choosing that
item
• higher values of Gini Index for a feature indicate its prevalence in causing the disease.
Case1 and Case 2
 Performance measure: Accuracy, sensitivity, and specificity are the most important
performance measures in the medical field
 Confusion matrix: a table that allows visualization of the performance of an algorithm
Discussion(Improve Accuracy of CAD
Diagnosis by Using Data Mining Techniques)
Understand CAD
Confusion Matrix
Sensitivity
Specificity
Accuracy
CAD Risk Features
Rules
Extracted
Performance
Measurement
Results
Confidence
Support
Feature Selection
Feature Creation
Information Gain
Gini Index
Dataset with Effective Features
C 4.5
Bagging Algorithm
SMO Algorithm
Naïve Bayes algorithm
Neural Network algorithm
Association Rule Mining
RapidMiner
Data Mining Methods
Conclusion
– Using Feature selection methods can increase the accuracy of CAD diagnosis (Though sometimes may
decrease the accuracy of the LAD, RCA stenosis diagnosis)
– To enrich our dataset, we may need to create some new features which has vital influence the accuracy
of the CAD diagnosis.
– Rules extracted from association rule mining methods may not be 100% correct, we need some more
testing data to test the rules.
– Still need the results of the standard angiographic method which are used as the base of comparison, to
assess the prediction capability of classification algorithms.