Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Basic Data Mining Techniques
Chapter 3
3.1 Decision Trees
An Algorithm for Building
Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the instances in T.
3. Create a tree node whose value is the chosen attribute.
-Create child links from this node where each link represents a unique value
for the chosen attribute.
-Use
Use the child link values to further subdivide the instances into subclasses.
4. For each subclass created in step 3:
-If the instances in the subclass satisfy predefined criteria or if the set of
remaining attribute choices for this path is null, specify the classification
f new iinstances
for
t
following
f ll i this
thi decision
d i i path.
th
-If the subclass does not satisfy the criteria and there is at least one attribute
to further subdivide the path of the tree, let T be the current set of subclass
instances and return to stepp 2.
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40 50K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
N
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
N
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
M l
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Income
Range
20-30K
2 Yes
2 No
30-40K
4 Yes
1 No
40-50K
1 Yes
3 No
Figure 3.1 A partial decision tree with
root node = income range
50-60K
2 Yes
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Credit
Card
Insurance
N
No
6 Yes
6 No
Y
Yes
3 Yes
0 No
Figure 3.2 A partial decision tree with
root node = credit card insurance
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Age
<= 43
9 Yes
3N
No
> 43
0 Yes
3 No
Figure 3.3 A partial decision tree with
root node = age
Decision Trees for the Credit
C d Promotion
Card
P
ti Database
D t b
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40 50K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40 50K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
N
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
N
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
M l
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Age
<= 43
> 43
No (3/0)
Sex
Female
Male
Yes (6/0)
Credit
Card
Insurance
No
No (4/1)
Figure 3.4 A three-node decision tree
for the credit card database
Yes
Yes (2/0)
Table 3.1 • The Credit Card Promotion Database
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
Credit
Card
Insurance
No
Yes
Yes ((3/0))
Sex
Female
Yes (6/1)
Male
No (6/1)
Figure 3.5 A two-node decision treee
for the credit card database
Table 3.2 • Training Data Instances Following the Path in Figure 3.4 to Credit Card
Insurance = No
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
40–50K
0 50
20–30K
30–40K
20–30K
No
o
No
No
Yes
No
o
No
No
No
Male
ae
Male
Male
Male
42
27
43
29
Decision Tree Rules
A Rule for the Tree in Figure 3.4
34
IF Age <=43 & Sex = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
A Simplified
p
Rule Obtained byy
Removing Attribute Age
IF Sex = Male & Credit Card Insurance = No
THEN Life Insurance Promotion = No
Other Methods for Building
Decision Trees
• CART
• CHAID
Advantages of Decision Trees
• Easy to understand.
• Map nicely to a set of production rules.
rules
• Applied to real problems.
• Make no prior assumptions about the data.
data
• Able to process both numerical and categorical data.
Disadvantages of Decision Trees
• Output attribute must be categorical.
• Limited to one output attribute.
• Decision tree algorithms are unstable.
• Trees created from numeric datasets can be complex.
32G
3.2
Generating
ti Association
A
i ti Rules
R l
C fid
Confidence
andd Support
S
Rule Confidence
Given a rule of the form “If A then
B” rule
B”,
l confidence
fid
i the
is
th conditional
diti l
probability that B is true when A is
k
known
t be
to
b true.
t
Rule Support
The minimum percentage of instances
i the
in
th database
d t b
th t contain
that
t i all
ll items
it
listed in a given association rule.
Mining Association Rules:
An Example
Subset
b off the
h C
Credit
di C
Card
dP
Promotion
i D
Database
b
T bl 3.3
Table
33•AS
Magazine
Promotion
Yes
Yes
No
Yes
Yes
N
No
Yes
No
Yes
Yes
Watch
Promotion
Life Insurance
Promotion
Credit Card
Insurance
Sex
No
Yes
No
Yes
No
N
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
N
No
Yes
No
No
Yes
No
No
No
Yes
No
N
No
Yes
No
No
No
Male
Female
Male
Male
Female
F
Female
l
Male
Male
Male
Female
Table 3.4 • Single-Item Sets
Single-Item
g
Sets
Magazine Promotion = Yes
Watch Promotion = Yes
Watch Promotion = No
Life Insurance Promotion = Yes
Life Insurance Promotion = No
Credit Card Insurance = No
Sex = Male
Sex = Female
Number of Items
7
4
6
5
5
8
6
4
Table 3.5 • Two-Item Sets
Two-Item Sets
Number of Items
Magazine Promotion = Yes & Watch Promotion = No
Magazine Promotion = Yes & Life Insurance Promotion = Yes
Magazine Promotion = Yes & Credit Card Insurance = No
Magazine Promotion = Yes & Sex = Male
Watch Promotion = No & Life Insurance Promotion = No
Watch Promotion = No & Credit Card Insurance = No
Watch Promotion = No & Sex = Male
Life Insurance Promotion = No & Credit Card Insurance = No
Life Insurance Promotion = No & Sex = Male
Credit Card Insurance = No & Sex = Male
Credit Card Insurance = No & Sex = Female
4
5
5
4
4
5
4
5
4
4
4
Two Possible Two-Item Set
Rules
l
IF Magazine
M
i Promotion
P
i =Yes
Y
THEN Life Insurance Promotion =Yes (5/7)
IF Life Insurance Promotion =Yes
Yes
THEN Magazine Promotion =Yes (5/5)
Three-Item
Three
Item Set Rules
IF
Watch Promotion =No & Life
f Insurance
Promotion = No
THEN Credit Card Insurance =No (4/4)
IF
Watch Promotion =No
THEN Life
L f Insurance
I
Promotion
P
= No
N & Credit
C d
Card Insurance = No (4/6)
General Considerations
• We are interested in association rules that show a
lift in pproduct
oduc sa
sales
es w
where
e e thee lift iss thee result
esu
of the product’s association with one or more
other products.
• We are also interested in association rules that show a
lower than expected confidence for a particular
association.
3.3 The K-Means Algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5
3 5 until the cluster centers do not
change.
An Example Using K
K-Means
Means
Table 3.6
• K-Means Input Values
I t
Instance
X
Y
1
2
3
4
5
6
1.0
1.0
2.0
20
2.0
3.0
5.0
1.5
4.5
1.5
35
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
Figure 3.6 A coordinate mapping of
the data in Table 3.6
4
5
6
Table 3.7 • Several Applications of the K-Means Algorithm (K = 2)
Outcome
Cluster Centers
Cluster Points
1
(2.67,4.67)
2, 4, 6
Squared Error
14.50
2
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
(2 75 4 125)
(2.75,4.125)
2 4
2,
4, 5
5, 6
(1.8,2.7)
1, 2, 3, 4, 5
15.94
3
9.60
(5,6)
6
f(x)
7
6
5
4
3
2
1
0
x
0
1
2
3
4
Figure 3.7 A K-Means clustering of
the data in Table 3.6 (K = 2)
5
6
General Considerations
• Requires real-valued data.
• We must select the number of clusters present in the
data.
• Works best when the clusters in the data are of
approximately equal size.
• Attribute significance cannot be determined.
determined
• Lacks explanation capabilities.
3 4 Genetic Learning
3.4
Genetic Learning Operators
• Crossover
• Mutation
• Selection
S l i
G
Genetic
i Al
Algorithms
i h andd Supervised
S
i d
Learningg
Keep
Population
Elements
Fitness
Function
Training
Data
Candidates
for Crossover
& Mutation
Figure 3.8 Supervised genetic
learning
Throw
Table 3.8 • An Initial Population for Supervised Genetic Learning
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
30
40K
?
30–40K
No
Yes
No
Yes
Yes
No
No
Yes
Male
Female
Male
Male
30–39
50–59
50
59
40–49
40–49
Table 3.9 • Training Data for Genetic Learning
Training
Instance
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
1
2
3
4
5
6
30–40K
30–40K
50–60K
20–30K
20–30K
30–40K
Yes
Yes
Yes
No
No
No
Yes
No
No
No
No
No
Male
Female
Female
Female
Male
Male
30–39
40–49
30–39
50–59
20–29
40–49
Population
e e t
Element
Income
a ge
Range
Life Insurance
o ot o
Promotion
Credit Card
su a ce
Insurance
Sex
Age
Population
e e t
Element
Income
a ge
Range
#1
20-30K
No
Yes
Male
30-39
#2
30-40K
Population
Element
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
Population
Element
Income
Range
#2
30-40K
Yes
No
Fem
50-59
#1
20-30K
20
30K
Figure 3.9 A crossover operation
Life Insurance Credit Card
o ot o
su a ce
Promotion
Insurance
Yes
Yes
Life Insurance Credit Card
Promotion
Insurance
No
No
Sex
Age
Male
30-39
Sex
Age
Fem
50-59
50
59
Table 3.10 • A Second-Generation Population
Population
Element
1
2
3
4
Income
Range
Life Insurance
Promotion
Credit Card
Insurance
Sex
Age
20–30K
30–40K
?
30–40K
No
Yes
N
No
Yes
No
Yes
N
No
Yes
Female
Male
M l
Male
Male
50–59
30–39
40 49
40–49
40–49
Genetic
G
i Al
Algorithms
i h andd
Unsupervised
p
Clusteringg
a1
P
instances
I1
I2
.
.
.
.
.
Ip
a2
. . .
a3
E11
S1
E12
an
E21
S2
.
.
.
.
.
.
.
SK
Solutions
Figure 3.10 Unsupervised genetic
clustering
E22
Ek1
Ek2
p
for Unsupervised
p
Clustering
g
Table 3.11 • A First-Generation Population
S
S
1
S
2
3
Solution elements
(initial population)
(1.0,1.0)
(5 0 5 0)
(5.0,5.0)
(3.0,2.0)
(3 0 5 0)
(3.0,5.0)
(4.0,3.0)
(5 0 1 0)
(5.0,1.0)
Fitness score
11.31
9.78
15.55
Solution elements
(second generation)
(5.0,1.0)
(5.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
17.96
9.78
11.34
Solution elements
(third generation)
(5.0,5.0)
(1.0,5.0)
(3.0,2.0)
(3.0,5.0)
(4.0,3.0)
(1.0,1.0)
Fitness score
13.64
9.78
11.34
General Considerations
• Global optimization is not a guarantee.
• The fitness function determines the
complexity of the algorithm.
• Explain their results provided the fitness
function is understandable.
• Transforming the data to a form suitable for
genetic learning can be a challenge.
3.5 Choosing a Data Mining
Technique
Initial Considerations
• Is learning supervised or unsupervised?
• Is explanation required?
• What is the interaction between input and output
attributes?
• What are the data types of the input and output
attributes?
F th C
Further
Considerations
id ti
• Do We Know the Distribution of the Data?
• Do We Know Which Attributes Best Define the Data?
• Does the Data Contain Missing Values?
• Is
I Time
Ti an Issue?
I
?
• Which Technique Is Most Likely to Give a Best Test
Set Accuracy?