Download Decision Tree - home.kku.ac.th

Document related concepts
no text concepts found
Transcript
“ Add your company slogan ”
Decision Tree
Prem Junsawang
Department of Statistics
Faculty of Science
Khon Kaen University
LOGO
Overview
 Predictive Models
 The process by which a model is
created or chosen to try to best
predict the probability of an
outcome (Classification)
Overview
 Classification
Overview
 Classification Concept
• Given a training set: each record
contains a set of attributes (features,
parameters, variables) and a class label
(target)
• Find a model for a function of a set of
attributes
• Gold is to predict class labels of an
unseen records as accurately as
possible:
Overview
 Test Set : determine an accuracy of a
model
 Examples of classification techniques
• Decision tree
• Genetic Algorithm
• Neural Network
• Bayesian Classifier
• K-nearest neighbor
Overview
 Decision Tree
 Tree-shaped structure that represent set of
decisions for generating rules for
classification of a dataset.
 Genetic Algorithms
 Optimization techniques that use the
concept of evolution such as selection,
crossover and mutation
 Neural Network
 Nonlinear predictive model that learn
through learning and resemble biological
neural network
Overview
 Bayesian Classifier
 Bayesian Theorem
 K-nearest neighbor
 Perform prediction by finding the
prediction value of records similar to
the record to be predicted
 Determine an email as spam or not spam
Overview
 Learning algorithm
• identify a model that best fits the
relationship between the set of
attributes and its class label
 Examples:
• Classify credit card transactions as
legitimate (ของจริง) or fraudulent (ของ
ปลอม)
Overview
 Classification Process
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learn
Model
10
10
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Attrib2
Attrib3
Class
Apply
Model
Decision Tree Induction
 Example of a Decision Tree
Decision Tree Induction
 Another Example of a Decision Tree
Decision Tree Induction
 Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learn
Model
10
10
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Attrib2
Attrib3
Class
Apply
Model
Decision Tree Induction
 Apply Model to Test Data
Decision Tree Induction
 Apply Model to Test Data
Decision Tree Induction
 Apply Model to Test Data
Decision Tree Induction
 Apply Model to Test Data
Decision Tree Induction
 Apply Model to Test Data
Decision Tree Induction
 Apply Model to Test Data
Decision Tree Induction
 Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learn
Model
10
10
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Attrib2
Attrib3
Class
Apply
Model
Decision Tree Induction
 Many Algorithms:
• Hunt’s Algorithm
• CART
• ID 3 or C 4.5
 How to build a Decision Tree
• Hunt’s Algorithm The basis of many
existing decision tree induction
algorithms, including ID3, C4.5 and
CART
Decision Tree Induction
 Hunt’s Algorithm
• Let’s Dt be the set of training records
that are associated with node t and
y={y1 ,y2 , …,yc} be the class labels
1. If all records in Dt belong to the same
class yt , then t is a leaf node labeled
as yt
2. If Dt contains records that belong to
more than one class, an attribute test
condition is select to partition the
records into smaller subsets
Decision Tree Induction
 How to the algorithm works
Decision Tree Induction
 How to the algorithm works
Decision Tree Induction
 How to the algorithm works
Decision Tree Induction
Design Issues of DT Induction
 How should the training records be
split?
• An measurement is used to evaluate the
goodness of each test condition
 How should the splitting procedure
stop?
• All records belong to the same class
• The records have identical attribute
values
Decision Tree Induction
 Method for Expressing Attribute
Test Condition
1.
2.
3.
4.
Binary Attributes
Nominal Attributes
Ordinal Attributes
Continuous Attributes
Decision Tree Induction
 Binary Attributes
Decision Tree Induction
 Nominal Attributes
Decision Tree Induction
 Ordinal Attributes
Decision Tree Induction
 Continuous Attributes
Decision Tree Induction
 Splitting Based on Continuous Attributes
 Discretization – form an ordinal attribute
•Static – discretize once at the beginning
•Dynamic – range can be determined by equal
interval bucketing, equal frequency bucketing or
clustering
 Binary Decision - A<v or A>v
•Consider all possible splits and find the best cut
•More time consumption
Decision Tree Induction
 Continuous Attributes
Decision Tree Induction
 How to Determine the Best Split
•
•
Nodes with homogeneous
distribution are preferred
Measure of node impurity
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
class
Decision Tree Induction
Decision Tree Induction
 Measure of Impurity
Let p(i | t) denote the fraction of records
belonging to class i at a given node t
Entropy(t )
C
   p(i | t )log2 p(i | t )
(1)
i 1
Gini(t )
C
 1 [ p(i | t )]2
(2)
i 1
Classification error(t ) = 1  max[ p(i | t )]
i
(3)
Where C is the number of classes and 0log2 0 = 0 in entropy calcalation
Decision Tree Induction
 Measure of Impurity
p(0|gender) = 10/17, p(1|gender) = 7/17
• Gini Index(Gender)
= 1 - [ (10/17)2+(7/17)2 ]
= 0.4844
• Entropy(Gender)
= - [(10/17) log2(10/17) +(7/17) log2(7/17)]
= 0.7655
• Error (Gender)
= 1- max{(10/17), (7/17)} = 1-(10/17) =0.4118
Decision Tree Induction
 ทดสอบ
Car Type
Family
Luxury
Sport
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Decision Tree Induction
 Gini Index
Decision Tree Induction
Decision Tree Induction
Table: Training data tuples from the all electronics customer
database
Decision Tree Induction
 Gini Index is used as impurity
measure
2
2
Gini(root)  1 [ p(0| root )  p(1| root ) ]
5 2 9 2
 1[( )  ( ) ]  0.4592
14
14
Gini(root)=Gini(age)=Gini(income)=Gini(student)
Decision Tree Induction
 Age: <=30(V1), 31-40(V2) and >40(V3)
Gini (v1 )  1  [ p (0 | v1 ) 2  p (1| v1 ) 2 ]
3 2 2 2
 1  [( )  ( ) ]  0.48
5
5
Gini (v2 )  1  [ p (0 | v2 ) 2  p (1| v2 ) 2 ]
0 2 4 2
 1  [( )  ( ) ]  0
4
4
Gini (v3 )  1  [ p (0 | v3 ) 2  p (1| v3 ) 2 ]
2 2 3 2
 1  [( )  ( ) ]  0.48
5
5
Decision Tree Induction
 Gain(age)
Gain(age)
N (v3 )
N (v1 )
N (v2 )
 Gini(age)  [
Gini(v1 ) 
Gini(v2 ) 
Gini(v3 )]
N
N
N
5
4
5
 0.4592  [( )0.48  ( )0  ( )0.48]
14
14
14
 0.1163
Decision Tree Induction
 Income: high, medium and low
Gini (h)  1  [ p (0 | h) 2  p (1| h) 2 ]
2 2 2 2
 1  [( )  ( ) ]  0.5
4
4
Gini (m)  1  [ p (0 | m) 2  p (1| m) 2 ]
2 2 4 2
 1  [( )  ( ) ]  0.44
6
6
Gini (l )  1  [ p (0 | l ) 2  p (1| l ) 2 ]
1 2 3 2
 1  [( )  ( ) ]  0.38
4
4
Decision Tree Induction
 Gain(income)
Gain(income)
N (h)
N (m)
N (l )
 Gini(income)  [
Gini(h) 
Gini(m) 
Gini(l )]
N
N
N
4
6
4
 0.4592  [( )0.5  ( )0.44  ( )0.38]
14
14
14
 0.0192
Decision Tree Induction
 Student: No and Yes
Gini ( N )  1  [ p (0 | No) 2  p (1| No) 2 ]
4 2 3 2
 1  [( )  ( ) ]  0.49
7
7
Gini (Y )  1  [ p (0 | Yes ) 2  p (1| Yes ) 2 ]
1 2 5 2
 1  [( )  ( ) ]  0.47
7
7
Decision Tree Induction
 Gain(student)
Gain(student)
N ( No)
N (Yes)
 Gini(student)  [
Gini( No) 
Gini(Yes)]
N
N
7
7
 0.4592  [( )0.49  ( )0.47]
14
14
 0.0208
Decision Tree Induction
 Credit rating: fair and excellent
Gini ( f )  1  [ p (0 | f ) 2  p (1| f ) 2 ]
2 2 6 2
 1  [( )  ( ) ]  0.38
8
8
Gini (e)  1  [ p (0 | e) 2  p (1| e) 2 ]
1 2 5 2
 1  [( )  ( ) ]  0.47
7
7
Decision Tree Induction
 Gain(student)
N (e)
N( f )
Gain(credit_rating)  Gini(parent)  [
Gini(e) 
Gini( f )]
N
N
8
6
 0.4592  [( )0.38  ( )0.5  0.0278
14
14
Decision Tree Induction
 Gain Information
Gain(age) = 0.1163
Gain(income) = 0.0192
Gain(student) = -0.0208
Gain(credit_rating) = 0.0278
Decision Tree Induction
Decision Tree Induction
 Final Decision Tree
Extract Rules
 IF age = “<=30” And student=”no”
THEN buys_computer = “no”
 IF age = “<=30” And student=”yes”
THEN buys_computer = “yes”
 IF age = “31-40”
THEN buys_computer = “yes”
 IF age = “>40”
AND credit_tating =
“excellent” THEN buys_computer = “no”
 IF age = “>40”
AND credit_tating = “fair”
THEN buys_computer = “yes”
Decision Tree Induction
 How to extract classification rules
from decision tree
1. Prepruning approach
2. Postpruning approach
Prepruning
 Measures such as information gain can be
used to assess the goodness of a split
 If partitioning the samples at a node would
result in a split that falls below a prespecified
threshold, then further partitioning of the
given subset is halted



Difficulties in choosing an appropriate
threshold.
High thresholds => oversimplified trees
Low thresholds => complicated trees
Postpruning
 Some branches are remove from a fully
grown tree
 The cost complexity pruning algorithm is an
example of the postpruning approach
 Alternatively, prepruning and postpruning
may be combined
Characteristic of DT
 Nonparametric approach for building
classification models
 Robust to the presence of noise of
data set
 The presence of redundant attributes
does not adversely affect the
accuracy of decision trees
Evaluating the perf. of a Classifier
 Holdout Method
 Random Subsampling
 Cross-validation
Holdout Method
 The original data with
examples is partitioned:
labeled
 Two disjoint sets, called the training
and the test sets, respectively(e.g.,
50-50 or two-thirds for training and
one-third for testing).
Random Subsampling
 The holdout method can be
repeated several times to improve
the estimation of a classifier’s
performance known as random
subsampling.
 The overall accuracy is given by the
average accuracy of all iterations.
K-Fold Cross-validation
 The dataset is partitioned into k
equal-sized parts.
 One of the parts is used for testing,
while the rest of them are used for
training.
 This procedure is repeated k times
so that each partition is used for
testing only once
 , the total error is give by summing
up the errors for all k runs.
References
 P. N. Tan, M Steinbach, V. Kumar,”Introduction
to data mining”, Pearson Addison Wesley.
 เอกสารประกอบการสอนรายวิชา KNOWLEDGE / DATA MINING โดย
ผศ.ดร. จันทรเจา มงคลนาวิน
“ Add your company slogan ”
LOGO
Decision Tree Induction
 How to the algorithm works
Tid
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1
Yes
Single
125k
No
2
No
Married
100k
No
3
No
Single
70k
No
4
Yes
Married
120k
No
5
No
Divorced
95k
Yes
6
No
Married
60k
No
7
Yes
Divorced
220k
No
8
No
Single
85k
Yes
9
No
Married
75k
No
10
No
Single
90k
Yes
Decision Tree Induction
 How to the algorithm works
Decision Tree Induction
 ทดสอบ
Car Type
p(0|Car) = 10/20, p(1|Car) = 10/20
• Gini Index(Car)
= 1 - [ (10/20)2+(10/20)2 ]
= 0.5
Family
Luxury
Sport
C0: 1
C1: 3
C0: 8
C1: 0
• Entropy(Gender)
= - [(10/20) log2(10/20) +(10/20) log2(10/20)]
=1
• Error (Gender)
= 1- max{(10/20), (10/20)} = 1-(10/20) =0.5
C0: 1
C1: 7