Download Chapter 3 Effects of IT on Strategy and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 2
Data Mining: A Closer Look
Jason C. H. Chen, Ph.D.
Professor of MIS
School of Business Administration
Gonzaga University
Spokane, WA 99223
[email protected]
A/W & Dr. Chen, Data Mining
2.1 Data Mining Strategies
A/W & Dr. Chen, Data Mining
Figure 2.1 - A hierarchy of data mining strategies
Data Mining
Strategies
Unsupervised
Clustering
Market Basket
Analysis
Supervised
Learning
No output attributes
Classification
Categorical/discrete
(current behavior)
A/W & Dr. Chen, Data Mining
Prediction
Estimation
Numeric
Future outcome
(categorical/numeric) 3
Classification
• Learning is supervised.
• The dependent variable is
categorical.
• Well-defined classes.
• Current rather than future
behavior.
Estimation
• Learning is supervised.
• The dependent variable is
numeric.
• Well-defined classes.
• Current rather than future
behavior.
Prediction
• The emphasis is on predicting future rather than current
outcomes.
• The output attribute may be categorical or numeric.
A/W & Dr. Chen, Data Mining
The Cardiology Patient Dataset
5
A/W & Dr. Chen, Data Mining
Table 2.1 • Cardiology Patient Data
Attribute
Name
Mixed
Values
Numeric
Values
Comments
Age
Numeric
Numeric
Age in years
Sex
Male, Female
1, 0
Patient gender
Chest Pain Type
Angina, Abnormal Angina,
1–4
NoTang =Nonanginal
NoTang, Asymptomatic
Blood Pressure
Numeric
pain
Numeric
Resting blood pressure
upon hospital admission
Cholesterol
Numeric
Numeric
Serum cholesterol
Fasting Blood
True, False
1, 0
Is f asting blood sugar less
Sugar < 120
Resting ECG
than 120?
Normal, Abnormal, Hyp
0, 1, 2
Hyp = Lef t ventricular
hypertrophy
Maximum Heart
Numeric
Numeric
Rate
Induced Angina?
Maximum heart rate
achieved
True, False
1, 0
Does the patient experience angina
as a result of exercise?
Old Peak
Numeric
Numeric
ST depression induced by exercise
relative to rest
Slope
Up, f lat, dow n
1–3
Slope of the peak exercise ST
segment
Number Colored
0, 1, 2, 3
0, 1, 2, 3
Vessels
Thal
Number of major vessels
colored by f luorosopy
Normal f ix, rev
3, 6, 7
Normal, f ixed def ect,
reversible def ect
Concept Class
A/W & Dr. Chen, Data Mining
Healthy, Sick
1, 0
Angiographic disease status
6
Table 2.2 • Most and Least Typical Instances from the Cardiology Domain
Attribute
Name
Most Typical Least Typical Most Typical Least Typical
Healthy Class Healthy Class Sick Class
Sick Class
Age
52
63
60
62
Sex
Male
Male
Male
Female
Chest Pain Type
NoTang
Angina
Asymptomatic
Asymptomatic
Blood Pressure
138
145
125
160
Cholesterol
223
233
258
164
Fasting Blood Sugar < 120
FALSE
TRUE
FALSE
FALSE
Resting ECG
Normal
Hyp
Hyp
Hyp
169
150
141
145
FALSE
FALSE
TRUE
FALSE
0
2.3
2.8
6.2
Up
Down
Flat
Down
0
0
1
3
Normal
Fix
Rev
Rev
Maximum Heart Rate
Induced Angina?
Old Peak
Slope
Number of Colored Vessels
Thal
A/W & Dr. Chen, Data Mining
7
A Healthy Class Rule for the Cardiology Patient Dataset
IF 169 <= Maximum Heart Rate <=202
THEN Concept Class = Healthy
Rule accuracy: 85.07%
Rule coverage: 34.55%
A Sick Class Rule for the Cardiology Patient Dataset
IF Thal = Rev & Chest Pain Type = Asymptomatic
THEN Concept Class = Sick
Rule accuracy: 91.14%
Rule coverage: 52.17%
A/W & Dr. Chen, Data Mining
Unsupervised Clustering
• Determine if concepts can be found in
the data.
• Evaluate the likely performance of a
supervised model.
• Determine a best set of input attributes
for supervised learning.
• Detect Outliers.
A/W & Dr. Chen, Data Mining
Market Basket Analysis
• Find interesting
relationships among
retail products.
• Uses association rule
algorithms.
• The results of a market
basket analysis help
retailers
– Design promotions,
– Arrange shelf or
catalog items, and
– Develop crossmarketing strategies
10
A/W & Dr. Chen, Data Mining
2.2 Supervised Data Mining
Techniques
11
A/W & Dr. Chen, Data Mining
The Credit Card Promotion
Database
12
A/W & Dr. Chen, Data Mining
Table 2.3 • The Credit Card Promotion Database
Income Magazine
Watch
Life Insurance Credit Card
Range ($) Promotion Promotion Promotion
Insurance
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Sex
Age
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
13
A/W & Dr. Chen, Data Mining
Data file: CreditCardPromotion.xls
Income Range
Magazine
C
C
I
I
40-50,000 Yes
30-40,000 Yes
40-50,000 No
30-40,000 Yes
50-60,000 Yes
20-30,000 No
30-40,000 Yes
20-30,000 No
30-40,000 Yes
30-40,000 Yes
40-50,000 No
20-30,000 No
50-60,000 Yes
40-50,000 No
20-30,000 No
Promo
Watch Promo
Life Ins Promo
Credit CardSex
Ins.
C
C
C
C
I
I
I
I
No
No
No
Male
Yes
Yes
No
Female
No
No
No
Male
Yes
Yes
Yes
Male
No
Yes
No
Female
No
No
No
Female
No
Yes
Yes
Male
Yes
No
No
Male
No
No
No
Male
Yes
Yes
No
Female
Yes
Yes
No
Female
Yes
Yes
No
Male
Yes
Yes
No
Female
Yes
No
No
Male
No
Yes
Yes
Female
Age
R
I
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
14
A/W & Dr. Chen, Data Mining
A Hypothesis for the Credit
Card Promotion Database
A combination of one or more of the dataset attributes
differentiate Acme Credit Card Company card holders
who have taken advantage of the life insurance
promotion and those card holders who have chosen not
to participate in the promotional offer.
A/W & Dr. Chen, Data Mining
A Production Rule for the
Credit Card Promotion Database
IF Sex = Female & 19 <=Age <= 43
THEN Life Insurance Promotion = Yes
Rule Accuracy: 100.00%
Rule Coverage: 66.67%
Question: Can we assume that two-thirds of all females in
the specified age range will take advantage of the
promotion?
• Rule accuracy is a between-class measure.
• Rule coverage is a within-class measure.
A/W & Dr. Chen, Data Mining
Neural Networks
• A neural network is a set of interconnected nodes
designed to imitate the functioning of the human
brain.
• Two phases of operations
– Learning phase at the input layer until it reaches a
predetermined minimum error rate
– Fixing weights and recompute output values for
new instances
• A major shortcoming of the neural network
approach is a lack of explanation about what has
been learned.
17
A/W & Dr. Chen, Data Mining
Input
Layer
Hidden
Layer
Output
Layer
18
A/W & Dr. Chen, Data Mining
Table 2.3 • The Credit Card Promotion Database
Income
Range ($)
Magazine
Watch
Promotion Promotion
40–50K
30–40K
Yes
Yes
No
Yes
Life Insurance Credit Card
Promotion
Insurance
Sex
Age
No
Yes
No
No
Male
Female
45
40
(Note that blue: input attributes, red: output attributes)
Therefore, there four input nodes and one output node and chose
five hidden-layer nodes.
19
A/W & Dr. Chen, Data Mining
Table 2.4 - Neural Network Training: Actual and Computed Output
Instance Number Life Insurance Promotion Computed Output
1
0
0.024
2
1
0.998
3
0
0.023
4
1
0.986
5
1
0.999
6
0
0.05
7
1
0.999
8
0
0.262
9
0
0.06
10
1
0.997
11
1
0.999
12
1
0.776
13
1
0.999
14
0
0.023
15
1
0.999
20
A/W & Dr. Chen, Data Mining
Table 2.4 - Neural Network Training: Actual and Computed Output
Instance Number
Life Insurance Promotion
Computed Output
1
0
0.024
2
1
0.998
3
0
0.023
4
1
0.986
5
1
0.999
6
0
0.05
7
1
0.999
8
0
0.262
9
0
0.06
10
1
0.997
11
1
0.999
12
1
0.776
13
1
0.999
14
0
0.023
15
1
0.999
21
A/W & Dr. Chen, Data Mining
Statistical Regression
• Statistical regression is a supervised learning
technique that generalizes a set of numeric data by
creating a mathematical equation relating one or
more input attributes to a single numeric output
attributes.
• Linear regression model is characterized by an
output attribute whose value is determined by a
linear sum of weighted input attribute values.
22
A/W & Dr. Chen, Data Mining
Statistical Regression
Life insurance promotion =
0.5909 (credit card insurance) 0.5455 (sex) + 0.7727
Example, a female who does not have credit card insurance is a
likely candidate for the life insurance promotion.
Life insurance promotion = 0.5909 (0) –
0.5455 (0) + 0.7727 = 0.7727
Because the value 0.7727 is close to 1.0, we conclude that the
individual is likely to take advantage of the promotional offer.
A/W & Dr. Chen, Data Mining
2.3 Association Rules
• Association rule mining techniques are used to
discover interesting associations between
attributes contained in a database.
• Unlike traditional association rules, association
rules can have one or several output attributes.
• An output attribute for one rule can be an input
attribute for another rule.
• A popular technique for ‘market basket’
analysis.
• Problem: we may have several rules with little
value.
24
A/W & Dr. Chen, Data Mining
An Association Rule for the
Credit Card Promotion
Database
IF Sex = Female & Age = over40 &
Credit Card Insurance = No
THEN Life Insurance Promotion = Yes
A/W & Dr. Chen, Data Mining
2.4 Clustering Techniques
• By applying unsupervised clustering to the
instances of the Acme Credit Card
Company database, we will find a subset of
input attributes that differentiate card
holders who have taken advantage of the
life insurance promotion from those
cardholders who have not accepted the
promotion offer.
26
A/W & Dr. Chen, Data Mining
Cluster 1
# Instances: 3
Sex: Male => 3
Female => 0
Age: 43.3
Credit Card Insurance:
Yes => 0
No => 3
Life Insurance Promotion:
Yes => 0
No => 3
Cluster 2
# Instances: 5
Sex: Male => 3
Female => 2
Age: 37.0
Credit Card Insurance:
Yes => 1
No => 4
Life Insurance Promotion:
Yes => 2
No => 3
IF Sex=Female & 43>=Age>=35
& Credit Card Insurance=NO
THEN Class = 3
Rule Accuracy: 100%
Rule Coverage: 66.67%
Cluster 3
# Instances: 7
Sex: Male => 2
Female => 5
Age: 39.9
Credit Card Insurance:
Yes => 2
No => 5
Life Insurance Promotion:
Yes => 7
No => 0
27
A/W & Dr. Chen, Data Mining
End here for now!
28
A/W & Dr. Chen, Data Mining
2.5 Evaluating Performance
• Performance evaluation is probably the
most critical of all the steps in the data
mining process. Three general questions:
– 1. Will the benefits received from a data mining
project more than offset the cost of the data mining
process?
– 2. How do we interpret the results of a data mining
session?
– 3. Can we use the results of a data mining process
with confidence?
29
A/W & Dr. Chen, Data Mining
Evaluating Supervised Learner
Models
• Supervised learner models are designed to
classify, estimate, and/or predict future outcome.
• Applications on classification correctness:
– Develop a model to accept to reject credit card
applications
– Develop a model to accept or reject home mortgage
applicants
– Develop a model to decide whether or not to drill for
oil
30
A/W & Dr. Chen, Data Mining
Confusion Matrix
• A matrix used to summarize the results of a
supervised classification.
• Entries along the main diagonal are correct
classifications.
• Entries other than those on the main diagonal
are classification errors.
Classification correctness is best calculated by
presenting previously unseen data in the form of a test
to the model being evaluated.
A confusion matrix is of little use for evaluating
supervised learner models offering numeric output.
A/W & Dr. Chen, Data Mining
Table 2.5 • A Three-Class Confusion Matrix
C1
C2
C3
Computed Decision
C1
C2
C3
C11
C12
C13
C21
C22
C23
C31
C32
C33
Rule 1: Values along the main diagonal represent correct classification
e.g., C11 represents the total number of class C1 instance correctly
classified by the model
Rule 2: for C2, C21,C22,C23 are all actually members of C2; but
C21
and C23 are incorrectly classified as members of
another class.
Rule 3: for C2, C12 and C32 are instances are incorrectly
C2.
classified as members of class Computation
questions: #1
32
A/W & Dr. Chen, Data Mining
Two-Class Error Analysis
Table 2.6 • A Simple Confusion Matrix
Computed
Accept
Accept
True
Accept
Reject
False
Accept
Computed
Reject
False
Reject
The more the better
True
Reject
The less the better
33
A/W & Dr. Chen, Data Mining
Table 2.7 • Two Confusion Matrices Each Showing a 10%
Error Rate
Model
A
Computed
Accept
Accept
Reject
600
75
Computed
Reject
Model
B
Computed
Accept
Computed
Reject
25
300
Accept
Reject
600
25
75
300
Which model is better?
34
A/W & Dr. Chen, Data Mining
Evaluating Numeric
• Mean absolute error
• Mean squared error
• Root mean squared error
A/W & Dr. Chen, Data Mining
Comparing Models by
Measuring Lift
• Marketing applications that focus on response rates
from mass mailings are less concerned with test set
classification error and more interested in building
models able to extract bias samples from large
populations.
• Supervised learner models designed for extracting
bias samples from a general population are often
evaluated by a measure that comes directly from
marketing known as lift.
36
A/W & Dr. Chen, Data Mining
Computing Lift
Lift 
P (Ci | Sample)
P (Ci | Population)
Ci is the class of al zero-balance customers who, given the
opportunity, will take advantage of the promotional offer.
A/W & Dr. Chen, Data Mining
Figure 2.4 – A lift chart (Targeted vs. mass mailing)
1200
1000
Number
Responding
800
600
400
200
0
0
10
20
30
40
50
60
70
80
90
100
% Sampled
38
A/W & Dr. Chen, Data Mining
Table 2.8 • Two Confusion Matrices: No Model and an Ideal Model
No
Model
Computed
Accept
Computed
Reject
Ideal
Model
Computed
Accept
Computed
Reject
Accept
Reject
1,000
99,000
0
0
Accept
Reject
1,000
0
0
99,000
Table 2.9 • Two Confusion Matrices for Alternative Models with Lift
Equal to 2.25
Model
X
Computed
Accept
Computed
Reject
Model
Y
Computed
Accept
Computed
Reject
Accept
Reject
540
23,460
460
75,540
Accept
Reject
450
19,550
550
79,450
39
A/W & Dr. Chen, Data Mining
Unsupervised Model Evaluation
• Evaluating unsupervised data mining is, in
general, a more difficult task than
supervised evaluation. This is true because
the goals of an unsupervised data mining
session are frequently not as clear as the
goals for supervised learning.
40
A/W & Dr. Chen, Data Mining
Data Mining
Strategies
Unsupervised
Clustering
Supervised
Learning
Market Basket
Analysis
Classification
Estimation
Prediction
41
A/W & Dr. Chen, Data Mining