Download Data Mining: A Closer Look

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining: A Closer
Look
Chapter 2
2.1 Data Mining Strategies
Data Mining
Strategies
Unsupervised
Clustering
Supervised
Learning
Market Basket
Analysis
Classification
Estimation
Prediction
Figure 2.1 A hierarchy of data
mining strategies
Classification
• Learning is supervised.
• The dependent variable is
categorical.
• Well-defined classes.
• Current rather than future behavior.
Estimation
•
•
•
•
Learning is supervised.
The dependent variable is numeric.
Well-defined classes.
Current rather than future behavior.
Prediction
• The emphasis is on predicting future
rather than current outcomes.
• The output attribute may be
categorical or numeric.
The Cardiology Patient
Dataset
Table 2.1 • Cardiology Patient Data
Attribute
Name
Mixed
Values
Numeric
Values
Comments
Age
Numeric
Numeric
Age in years
Sex
Male, Female
1, 0
Patient gender
Chest Pain Type
Angina, Abnormal Angina,
NoTang, Asympt omatic
1–4
NoTang = Nonanginal
pain
Blood Pressure
Numeric
Numeric
Resting blood pressure
upon hospital admission
Cholesterol
Numeric
Numeric
Serum cholesterol
Fasting Blood
Sugar < 120
True, False
1, 0
Is fasting blood sugar less
than 120?
Resting ECG
Normal, Abnormal, Hyp
0, 1, 2
Hyp = Left ventricular
hypertrophy
Maximum Heart
Rate
Numeric
Numeric
Maximum heart rate
achieved
Induced Angina?
True, False
1, 0
Does t he patient experience angina
as a result of exercise?
Old Peak
Numeric
Numeric
ST depression induced by exercise
relative to rest
Slope
Up, flat, dow n
1–3
Slope of t he peak exercise ST
segment
Number Colored
Vessels
0, 1, 2, 3
0, 1, 2, 3
Number of major vessels
colored by fluorosopy
Thal
Normal fix, rev
3, 6, 7
Normal, fixed defect,
reversible defect
Concept Class
Healthy, Sick
1, 0
Angiographic disease stat us
Table 2.2 • Most and Least Typical Instances from the Cardiology Domain
Attribute
Name
Age
Sex
Chest Pain Type
Blood Pressure
Cholesterol
Fasting Blood Sugar < 120
Resting ECG
Maximum Heart Rate
Induced Angina?
Old Peak
Slope
Number of Colored Vessels
Thal
Most Typical
Healthy Class
Least Typical
Healthy Class
Most Typical
Sick Class
Least Typical
Sick Class
52
Male
NoTang
138
223
False
Normal
169
False
0
Up
0
Normal
63
Male
Angina
145
233
True
Hyp
150
False
2.3
Down
0
Fix
60
Male
Asymptomatic
125
258
False
Hyp
141
True
2.8
Flat
1
Rev
62
Female
Asymptomatic
160
164
False
Hyp
145
False
6.2
Down
3
Rev
A Healthy Class Rule for
the Cardiology Patient
Dataset
• IF 169 <= Maximum Heart Rate <=202
• THEN Concept Class = Healthy
•
Rule accuracy: 85.07%
•
Rule coverage: 34.55%
A Sick Class Rule for the
Cardiology Patient
Dataset
• IF Thal = Rev & Chest Pain Type =
Asymptomatic
• THEN Concept Class = Sick
•
Rule accuracy: 91.14%
•
Rule coverage: 52.17%
Unsupervised Clustering
• Determine if concepts can be found in
the data.
• Evaluate the likely performance of a
supervised model.
• Determine a best set of input attributes
for supervised learning.
• Detect Outliers.
Market Basket Analysis
• Find interesting relationships among
retail products.
• Uses association rule algorithms.
2.2 Supervised Data Mining
Techniques
The Credit Card Promotion
Database
Table 2.3 • The Credit Card Promotion Database
Income
Range ($)
Magazine
Promotion
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
Watch
Life Insurance
Promotion
Promotion
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Credit Card
Insurance
Sex
Age
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
A Hypothesis for the
Credit Card Promotion
Database
• A combination of one or more of the dataset
attributes differentiate Acme Credit Card
Company card holders who have taken advantage
of the life insurance promotion and those card
holders who have chosen not to participate in the
promotional offer.
A Production Rule for the
Credit Card Promotion
Database
• IF Sex = Female & 19 <=Age <= 43
• THEN Life Insurance Promotion = Yes
•
Rule Accuracy: 100.00%
•
Rule Coverage: 66.67%
Production Rules
• Rule accuracy is a between-class
measure.
• Rule coverage is a within-class
measure.
Neural Networks
Input
Layer
Hidden
Layer
Figure 2.2 A multilayer fully
connected neural network
Output
Layer
Table 2.4 • Neural Network Training: Actual and Computed Output
Instance Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Life Insurance Promotion
0
1
0
1
1
0
1
0
0
1
1
1
1
0
1
Computed Output
0.024
0.998
0.023
0.986
0.999
0.050
0.999
0.262
0.060
0.997
0.999
0.776
0.999
0.023
0.999
Statistical Regression
• Life insurance promotion =
•
0.5909 (credit card insurance) •
0.5455 (sex) + 0.7727
2.3 Association Rules
An Association Rule for
the Credit Card
Promotion Database
• IF Sex = Female & Age = over40 &
•
Credit Card Insurance = No
• THEN Life Insurance Promotion = Yes
2.4 Clustering Techniques
Cluster 1
# Instances: 3
Sex: Male => 3
Female => 0
Age: 43.3
Credit Card Insurance:
Yes => 0
No => 3
Life Insurance Promotion: Yes => 0
No => 3
Cluster 2
# Instances: 5
Sex: Male => 3
Female => 2
Age: 37.0
Credit Card Insurance:
Yes => 1
No => 4
Life Insurance Promotion: Yes => 2
No => 3
Cluster 3
# Instances: 7
Sex: Male => 2
Female => 5
Age: 39.9
Credit Card Insurance:
Yes => 2
No => 5
Life Insurance Promotion: Yes => 7
No => 0
Figure 2.3 An unsupervised
cluster of the credit card
2.5 Evaluating Performance
Evaluating Supervised
Learner Models
Confusion Matrix
(混亂矩陣)
• A matrix used to summarize the
results of a supervised classification.
• Entries along the main diagonal are
correct classifications.
• Entries other than those on the main
diagonal are classification errors.
Table 2.5 • A Three-Class Confusion Matrix
Computed Decision
計算結果
C
1
2
C
3
C
C
11
C
12
C
13
C
C
21
C
22
C
23
C
C
31
C
32
C
33
1
事
實
C
2
3
Two-Class Error Analysis
(二類別錯誤分析)
Table 2.6 • A Simple Confusion Matrix
Computed
Accept
Computed
Reject
Accept
True
Accept
False
Reject
Reject
False
Accept
True
Reject
Table 2.7 • Two Confusion Matrices Each Showing a 10% Error Rate
Model
A
Accept
Reject
Computed
Accept
600
75
Computed
Reject
Model
B
Computed
Accept
Computed
Reject
25
300
Accept
Reject
600
25
75
300
Evaluating Numeric
Output
• Mean absolute error
• Mean squared error
• Root mean squared error
Comparing Models by
Measuring Lift (增益值)
1200
1000
Number
Responding
800
600
400
200
0
0
10
20
30
40
50
60
70
% Sampled
Figure 2.4 Targeted vs. mass
mailing
80
90
100
Computing Lift
Lift 
P (Ci | Sample)
P (Ci | Population)
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model
No
Model
Computed
Accept
Computed
Reject
Ideal
Model
Computed
Accept
Computed
Reject
Accept
Reject
1,000
99,000
0
0
Accept
Reject
1,000
0
0
99,000
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
Model
X
Computed
Accept
Computed
Reject
Model
Y
Computed
Accept
Computed
Reject
Accept
Reject
540
23,460
460
75,540
Accept
Reject
450
19,550
550
79,450
Unsupervised Model
Evaluation
Related documents