Download Data Mining Unit 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
9/11/2013
Knowledge Discovery and Data
Mining
Unit # 3
Sajjad Haider
Fall 2013
1
Acknowledgement
• Most of the slides in this presentation are
taken from course slides provided by
– Han and Kimber (Data Mining Concepts and
Techniques) and
– Tan, Steinbach and Kumar (Introduction to Data
Mining)
Sajjad Haider
Fall 2013
2
1
9/11/2013
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training
and test sets, with training set used to build the model
and test set used to validate it.
Sajjad Haider
Fall 2013
3
Classification: Motivation
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Sajjad Haider
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Fall 2013
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
4
2
9/11/2013
Decision/Classification Tree
age?
<=30
31..40
overcast
student?
no
>40
credit rating?
yes
excellent
yes
no
yes
yes
Sajjad Haider
fair
Fall 2013
5
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
Sajjad Haider
Fall 2013
6
3
9/11/2013
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
< 80K
NO
> 80K
YES
NO
10
Model: Decision Tree
Training Data
Sajjad Haider
Fall 2013
7
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that fits
the same data!
10
Sajjad Haider
Fall 2013
8
4
9/11/2013
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Sajjad Haider
Fall 2013
9
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
Sajjad Haider
NO
> 80K
YES
Fall 2013
10
5
9/11/2013
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
Sajjad Haider
Fall 2013
11
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
Sajjad Haider
NO
> 80K
YES
Fall 2013
12
6
9/11/2013
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
Sajjad Haider
Fall 2013
13
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
Sajjad Haider
NO
> 80K
YES
Fall 2013
14
7
9/11/2013
Apply Model to Test Data
Test Data
Refund
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
Yes
No
NO
MarSt
Assign Cheat to “No”
Married
Single, Divorced
TaxInc
NO
< 80K
> 80K
YES
NO
Sajjad Haider
Fall 2013
15
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Sajjad Haider
Fall 2013
16
8
9/11/2013
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Sajjad Haider
Fall 2013
17
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
• Depends on number of ways to split
– 2-way split
– Multi-way split
Sajjad Haider
Fall 2013
18
9
9/11/2013
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Sajjad Haider
Fall 2013
19
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
Sajjad Haider
Fall 2013
20
10
9/11/2013
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t )  1  [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1
C2
0
6
Gini=0.000
C1
C2
1
5
C1
C2
Gini=0.278
Sajjad Haider
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
Fall 2013
21
Examples for computing GINI
GINI (t )  1  [ p( j | t )]2
j
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Sajjad Haider
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
Gini = 1 –
P(C2) = 5/6
(1/6)2 –
(5/6)2 = 0.278
P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Fall 2013
22
11
9/11/2013
Classification: Motivation
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
Sajjad Haider
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Fall 2013
23
Binary Attributes: Computing GINI Index
• Splits into two partitions
• Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Student?
Yes
Gini(N1)
= 1 – (6/7)2 – (1/7)2
= 0.24
Gini(N2)
= 1 – (3/7)2 – (4/7)2
= 0.49
Node N1
No
Node N2
Gini(Student)
= 7/14 * 0.24 +
7/14 * 0.49
= ??
Sajjad Haider
12
9/11/2013
GINI Index for Buy Computer Example
• Gini (Income):
• Gini (Credit_Rating):
• Gini (Age):
Sajjad Haider
Fall 2013
25
13
Related documents