Download Data Mining A Tutorial

Document related concepts
no text concepts found
Transcript
Basic Data Mining Techniques
Chapter 3
3.1 Decision Trees
An Algorithm for Building
Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T.
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a unique value
for the chosen attribute.
-Use the child link values to further subdivide the instances into subclasses.
4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria or if the set of
remaining attribute choices for this path is null, specify the classification
for new instances following this decision path.
-If the subclass does not satisfy the criteria and there is at least one attribute
to further subdivide the path of the tree, let T be the current set of subclass
instances and return to step 2.
(i.e. accuracy)
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Target: life insurance
Use a node for classification: how to index the attributes
Income
Range
20-30K
2 Yes
2 No
30-40K
4 Yes
1 No
Accuracy=11/15=0.7333
40-50K
1 Yes
3 No
50-60K
2 Yes
Index for choice=0.73336/4branch=0.183
Figure 3.1 A partial decision tree with
root node = income range
Target: life insurance
Credit
Card
Insurance
No
6 Yes
6 No
Accuracy=9/15=0.6
Yes
3 Yes
0 No
Index for
choice=0.6/2branch=0.3
Figure 3.2 A partial decision tree with
root node = credit card insurance
Target: life insurance
Age
<= 43
9 Yes
3 No
Accuracy=12/15=0.8
> 43
0 Yes
3 No
Index for
choice=0.8/2branch=0.4
Figure 3.3 A partial decision tree with
root node = age
Sex
Male
3 Yes
5 No
Female
6 Yes
1 No
(11/15)/2 branch=0.733/2=0.367
We choose age
as the root
attribute
ID3
• See homework
Decision Trees for the Credit
Card Promotion Database
Target: life insurance
Use 3 nodes for
classification
Age
<= 43
> 43
No (3/0)
Sex
Female
Male
Yes (6/0)
Credit
Card
Insurance
o/p : life insurance
No
No (4/1)
Yes
Yes (2/0)
Figure 3.4 A three-node decision tree
for the credit card database
Credit
Card
Insurance
No
Yes
Yes (3/0)
Sex
o/p : life insurance
Female
Yes (6/1)
Male
No (6/1)
Figure 3.5 A two-node decision treee
for the credit card database
Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit Card
Insurance = No
Income
Credit Card
Range
Insurance
40–50K
20–30K
30–40K
20–30K
No
No
No
No
≦
43
Life Insurance
Sex
Age
Promotion
Male
Male
Male
Male
42
27
43
29
No
No
No
Yes
(4/1)
an error
revised
Decision Tree Rules
A Rule for the Tree in Figure 3.4
IF Age <=43 & Sex = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
A Simplified Rule Obtained by
Removing Attribute Age
IF Sex = Male & Credit Card Insurance = No
THEN Life Insurance Promotion = No
Other Methods for Building
Decision Trees
• CART (Classification and Regression Tree)
• CHAID (Chi-Square Automatic Interaction
Detector)
Advantages of Decision Trees
• Easy to understand.
• Map nicely to a set of production rules.
• Applied to real problems.
• Make no prior assumptions about the data.
• Able to process both numerical and categorical data.
Disadvantages of Decision Trees
• Output attribute must be categorical.
• Limited to one output attribute.
• Decision tree algorithms are unstable.
• Trees created from numeric datasets can be complex.
3.2 Generating Association Rules
Confidence and Support
Rule Confidence
Given a rule of the form “If A then
B”, rule confidence is the conditional
probability that B is true when A is
known to be true.
Rule Support
The minimum percentage of instances
in the database that contain all items
listed in a given association rule.
Mining Association Rules:
An Example
Table 3.3 • A Subset of the Credit Card Promotion Database
Magazine
Promotion
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
No
Yes
No
Yes
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Note: coverage level ≥ 4
Table 3.4 • Single-Item Sets
Single-Item Sets
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female
Number of Items
7
4
6
5
5
8
6
4
Table 3.5 • Two-Item Sets
Two-Item Sets
Number of Items
Magazine Promotion = Yes & Watch Promotion = No
Magazine Promotion = Yes & Life Insurance Promotion = Yes
Magazine Promotion = Yes & Credit Card Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance Promotion = No
Watch Promotion = No & Credit Card Insurance = No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female
4
5
5
4
4
5
4
5
4
4
4
3 items (coverage level ≥ 4)
• Watch Promotion = No
• and Life Insurance Promotion = No
• and Credit Card Insurance = No
Generating rules :using two items
• Magazine Promotion = Yes
• & Life Insurance Promotion = Yes 5 items
& Life Insurance Promotion = No
2 items
Magazine Promotion = Yes Insurance Promotion = Yes
Accuracy=5/7 support=7/10 total 10 items
How about others?
Generating rules :using three items
•
•
•
•
Watch Promotion = No
and Life Insurance Promotion = No
and Credit Card Insurance = No
IF Watch=n and Life Insurance Promotion =
No Then Credit Card Insurance = No (4/4)
• 100% accuracy Support=4/10
• How about others?
General Considerations
• We are interested in association rules that show a lift in
product sales where the lift is the result
of
the
product’s association with one or more
other
products.
• We are also interested in association rules that show a
lower than expected confidence for a particular
association.
3.3 The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not
change.
An Example Using K-Means
Table 3.6
• K-Means Input Values
Instance
X
Y
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
Figure 3.6 A coordinate mapping of
the data in Table 3.6
4
5
6
Iteration 1: choose two cluster
centers randomly
•
•
•
•
•
•
•
C1=(1.0, 1.5), C2=(2.0, 1.5)
d(C1-point1)=0
d(C2-point1)=1
d(C1-point2)=3
d(C2-point2)=3.16
d(C1-point3)=1
d(C2-point3)=0
d(C1-point4)=2.24
d(C2-point4)=2
d(C1-point5)=2.24
d(C2-point5)=1.41
d(C1-point1)=6.02
d(C2-point6)=5.41
Result of the first iteration
•
•
•
•
•
C1 Cluster 1:point 1, 2
C2 Cluster 1:point 3,4,5,6
New center:C1(x,y)=[(1.0+1.0)/2, (1.5+4.5)/2]=
(1.0, 3.0)
New center:C2(x,y)=[(2+2+3+5)/4,
(1.5+3.5+2.5+6)/4]=(3, 3.375)
2nd iteration
• C1=(1.33,2.5)
• C2=(3.33,4)
• …..
Table 3.7 • Several Applications of the K-Means Algorithm (K = 2) (may result from
different initials)
Outcome
Cluster Centers
Cluster Points
1
(2.67,4.67)
2, 4, 6
Squared dist. Error
14.50
2
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
15.94 poor
3
9.60 good
(5,6)
6
A poor clustering
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
Figure 3.7 A K-Means clustering of
the data in Table 3.6 (K = 2)
5
6
practice
• Choose an acceptable summation of squared
distance difference error
• SPSS---2 stage, Apply first hierarchical
clustering to determine K, then use K-mean.
General Considerations
• Requires real-valued data.
• We must select the number of clusters present in the
data.
• Works best when the clusters in the data are of
approximately equal size.
• Attribute significance cannot be determined.
• Lacks explanation capabilities.
3.4 Genetic Learning
Genetic Learning Operators
• Crossover
• Mutation
• Selection
Genetic Algorithms and Supervised
Learning
yes/no ratio
Population
Elements
Fitness
Function
Training
Data (targets)
Candidates
for Crossover
& Mutation
Figure 3.8 Supervised genetic
learning
Keep
Throw
Table 3.8 • An Initial Population for Supervised Genetic Learning
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
No
Yes
Yes
No
No
Yes
Male
Female
Male
Male
30–39
50–59
40–49
40–49
Table 3.9 • Training Data for Genetic Learning
Training
Instance
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
1
2
3
4
5
6
30–40K
30–40K
50–60K
20–30K
20–30K
30–40K
Yes
Yes
Yes
No
No
No
Yes
No
No
No
No
No
Male
Female
Female
Female
Male
Male
30–39
40–49
30–39
50–59
20–29
40–49
#2 in Table 3.10
#1 in table 3.8
Population
Element
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
Population
Element
Income
Range
#1
20-30K
No
Yes
Male
30-39
#2
30-40K
Population
Element
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
Population
Element
Income
Range
#2
30-40K
Yes
No
Fem
50-59
#1
20-30K
#2 in table 3.8
Life Insurance Credit Card
Promotion
Insurance
Yes
Life Insurance Credit Card
Promotion
Insurance
No
#1 in Table 3.10
Figure 3.9 A crossover operation
Yes
No
Sex
Age
Male
30-39
Sex
Age
Fem
50-59
Table 3.10 • A Second-Generation Population
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
No
Yes
No
Yes
No
Yes
Female
Male
Male
Male
50–59
30–39
40–49
40–49
• Test:
– New instance will be compared with all
instances and be assigned the same class as the
most similar instance compared.
• Or randomly choose any one in the final
population and assigned the same class …
Genetic Algorithms and
Unsupervised Clustering
• Agglomerative hierarchical clustering
• Partitional clustering
• Incremental clustering
a1
a2
. . .
a3
E11
S1
E12
an
I1
P
instances
I2
.
.
.
.
.
Ip
E21
S2
.
.
.
.
.
.
.
SK
Solutions
Figure 3.10 Unsupervised genetic
clustering
E22
Ek1
Ek2
Points in
cluster S1
Table 3.11 • A First-Generation Population for Unsupervised Clustering
S
1
Center of group 1
S
2
S
3
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(5.0,1.0)
9.78
15.55
(4.0,3.0)
(1.0,1.0)
Solution elements
(initial population)
(1.0,1.0)
(5.0,5.0)
Fitness score
11.31
Solution elements
(second generation)
(5.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
Fitness score
17.96
9.78
11.34
Solution elements
(third generation)
(5.0,5.0)
(1.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
9.78
11.34
Center of group2
crossover
mutation
Fitness score
13.64
Best at iteration 3
Final solution?
Point 2
(3.0,5.0)
Point 4
(3.0, 2.0)
Point 6
homework
• Demonstrate Table 3.11
General Considerations
• Global optimization is not a guarantee.
• The fitness function determines the
complexity of the algorithm.
• Explain their results provided the fitness
function is understandable.
• Transforming the data to a form suitable for
genetic learning can be a challenge.
3.5 Choosing a Data Mining
Technique
Initial Considerations
• Is learning supervised or unsupervised?
• Is explanation required?
• What is the interaction between input and output
attributes?
• What are the data types of the input and output
attributes?
Further Considerations
• Do We Know the Distribution of the Data?
• Do We Know Which Attributes Best Define the Data?
• Does the Data Contain Missing Values?
• Is Time an Issue?
• Which Technique Is Most Likely to Give a Best Test
Set Accuracy?
Related documents