Download Document

Document related concepts

Gibbs paradox wikipedia , lookup

Information theory wikipedia , lookup

Gene expression programming wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Corecursion wikipedia , lookup

Transcript
DATA MINING
DECISION TREE INDUCTION
1
Classification Techniques
 Linear Models Support Vector Machines
 Decision Tree based Methods
 Rule-based Methods
 Memory based reasoning
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
 Support Vector Machines
2
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
3
Another Decision Tree Example
MarSt
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
More than one tree may perfectly fit the data
10
4
Decision Tree Classification
Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
5
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
6
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
7
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
8
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
9
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
11
Decision Tree Terminology
12
Decision Tree Induction
 Many Algorithms:




Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
 John Ross Quinlan is a computer science researcher in data
mining and decision theory. He has contributed extensively to
the development of decision tree algorithms, including
inventing the canonical C4.5 and ID3 algorithms.
13
Decision Tree Classifier
10
9
8
7
6
5
4
3
2
1
Antenna Length
Ross Quinlan
Abdomen Length > 7.1?
yes
no
Antenna Length > 6.0?
no
1
2 3
4 5
6 7
Abdomen Length
8 9 10
Grasshopper
Katydid
yes
Katydid
14
Antennae shorter than body?
Yes
No
3 Tarsi?
Grasshopper
Yes
No
Foretiba has ears?
Yes
No
Cricket
Decision trees predate computers
Katydids
Camel Cricket
15
Definition

Decision tree is a classifier in the form of a tree structure
– Decision node: specifies a test on a single attribute
– Leaf node: indicates the value of the target attribute
– Arc/edge: split of one attribute
– Path: a disjunction of test to make the final decision

Decision trees classify instances or examples by starting at
the root of the tree and moving through it until a leaf node.
16
Decision Tree Classification
• Decision tree generation consists of two phases
– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected
attributes
• This can also be called supervised segmentation
• This emphasizes that we are segmenting the
instance space
– Tree pruning
• Identify and remove branches that reflect noise or
outliers
17
Decision Tree Representation
 Each internal node tests an attribute
 Each branch corresponds to attribute value
 Each leaf node assigns a classification
outlook
sunny
overcast
humidity
rain
wind
yes
high
normal
strong
weak
no
yes
no
yes
18
How do we Construct a Decision Tree?
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-
and-conquer manner
 At start, all the training examples are at the root
 Examples are partitioned recursively based on
selected attributes.
 Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., info. gain)
 Why do we call this a greedy algorithm?
 Because it makes locally optimal decisions (at
each node).
19
When Do we Stop Partitioning?
 All samples for a node belong to same class
 No remaining attributes
 majority voting used to assign class
 No samples left
20
How to Pick Locally Optimal Split
 Hunt’s algorithm: recursively partition
training records into successively purer
subsets.
 How to measure purity/impurity?
 Entropy and associated information gain
 Gini
 Classification error rate
 Never used in practice but good for understanding and
simple exercises
21
How to Determine Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Own
Car?
Yes
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
Which test condition is the best?
Why is student id a bad feature to use?
22
How to Determine Best Split
 Greedy approach:
 Nodes with homogeneous class distribution are preferred
 Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
23
Information Theory
 Think of playing "20 questions": I am thinking of an
integer between 1 and 1,000 -- what is it? What is the first
question you would ask?
 What question will you ask?
 Why?
 Entropy measures how much more information you need
before you can identify the integer.
 Initially, there are 1000 possible values, which we assume
are equally likely.
 What is the maximum number of question you need to
ask?
24
Entropy
 Entropy (disorder, impurity) of a set of examples, S, relative to a
binary classification is:
Entropy (S )   p1 log 2 ( p1 )  p0 log 2 ( p0 )
where p1 is the fraction of positive examples in S and p0 is fraction
of negatives.
 If all examples are in one category, entropy is zero (we define
0log(0)=0)
 If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.
 For multi-class problems with c categories, entropy generalizes to:
c
Entropy ( S )    pi log 2 ( pi )
i 1
25
Entropy for Binary Classification
 The entropy is 0 if the outcome is certain.
 The entropy is maximum if we have no knowledge
of the system (or any outcome is equally possible).
Entropy of a 2-class
problem with regard to
the portion of one of the
two groups
26
Information Gain in Decision
Tree Induction
• Is the expected reduction in entropy caused by partitioning the
examples according to this attribute.
• Assume that using attribute A, a current set will be partitioned into
some number of child sets
• The encoding information that would be gained by branching on A
Gain( A)  E (Current set )   E (all child sets )
The summation in the above formula is a bit misleading since when doing
the summation we weight each entropy by the fraction of total examples in
the particular child set. This applies to GINI and error rate also.
27
Examples for Computing Entropy
Entropy(t )   p( j | t ) log p( j | t )
j
2
NOTE: p( j | t) is computed as the relative frequency of class j at node t
C1
C2
0
6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log2 0 – 1 log2 1 = – 0 – 0 = 0
C1
C2
1
5
P(C1) = 1/6
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65
C1
C2
2
4
P(C1) = 2/6
P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
C1
C2
3
3
P(C1) = 3/6=1/2
P(C2) = 3/6 = 1/2
Entropy = – (1/2) log2 (1/2) – (1/2) log2 (1/2)
= -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1
28
How to Calculate log2x
 Many calculators only have a button for log10x
and logex (“log” typically means log10)
 You can calculate the log for any base b as
follows:
 logb(x) = logk(x) / logk(b)
 Thus log2(x) = log10(x) / log10(2)
 Since log10(2) = .301, just calculate the log base 10
and divide by .301 to get log base 2.
 You can use this for HW if needed
29
Splitting Based on INFO...
 Information Gain:
GAIN
 n

 Entropy( p)  
Entropy(i ) 
 n

k
split
i
i 1
Parent Node, p is split into k partitions;
ni is number of records in partition i
 Uses a weighted average of the child nodes, where weight
is based on number of examples
 Used in ID3 and C4.5 decision tree learners
 WEKA’s J48 is a Java version of C4.5
 Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
How Split on Continuous Attributes?
 For continuous attributes
 Partition the continuous value of attribute A into
a discrete set of intervals
 Create a new boolean attribute Ac , looking for a
threshold c
 One method is to try all possible splits
true if Ac  c
Ac  
 false otherwise
How to choose c ?
31
Person
Homer
Marge
Bart
Lisa
Maggie
Abe
Selma
Otto
Krusty
Comic
Hair
Length
Weight
Age
Class
0”
10”
2”
6”
4”
1”
8”
10”
6”
250
150
90
78
20
170
160
180
200
36
34
10
8
1
70
41
38
45
M
F
M
F
F
M
F
M
M
8”
290
38
32
?
Entropy ( S )  
 p 
p
log2 
 
pn
p

n


 n 
n
log2 

pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
no
yes
Hair Length <= 5?
Let us try splitting on
Hair length
Gain( A)  E (Current set )   E (all child sets )
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
33
Entropy ( S )  
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
no
yes
Weight <= 160?
Let us try splitting on
Weight
Gain( A)  E (Current set )   E (all child sets )
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
34
Entropy ( S )  
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
no
yes
age <= 40?
Let us try splitting on
Age
Gain( A)  E (Current set )   E (all child sets )
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
35
Of the 3 features we had, Weight was best.
But while people who weigh over 160 are
perfectly classified (as males), the under 160
people are not perfectly classified… So we
simply recurse!
no
yes
Weight <= 160?
This time we find that we can split on
Hair length, and we are done!
no
yes
Hair Length <= 2?
36
We don’t need to keep the data around, just the
test conditions.
Weight <= 160?
yes
How would these
people be
classified?
no
Hair Length <= 2?
yes
Male
Male
no
Female
37
It is trivial to convert Decision Trees
to rules…
Weight <= 160?
yes
no
Hair Length <= 2?
yes
Male
Male
no
Female
Rules to Classify Males/Females
If Weight greater than 160, classify as Male
Elseif Hair Length less than or equal to 2, classify as Male
Else classify as Female
Note: could avoid use of “elseif” by specifying all test conditions from root
to corresponding leaf.
38
Once we have learned the decision tree, we don’t even need a computer!
This decision tree is attached to a medical machine, and is designed to help
nurses make decisions about what type of doctor to call.
Decision tree for a typical shared-care setting applying the system for
the diagnosis of prostatic obstructions.
39
The worked examples we have seen were
performed on small datasets. However with
small datasets there is a great danger of
overfitting the data…
When you have few datapoints, there are
many possible splitting rules that perfectly
classify the data, but will not generalize to
future datasets.
Yes
No
Wears green?
Female
Male
For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
40
GINI is Another Measure of Impurity
 Gini for a given node t with classes j
GINI (t )  1  [ p( j | t )]2
j
NOTE: p( j | t) is again computed as relative frequency of class j at node t
Compute best split by computing the partition that yields the lowest GINI where we
again take the weighted average of the children’s GINI
Worst GINI = 0.5
Best GINI = 0.0
C1
C2
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
41
Splitting Criteria based on
Classification Error
 Classification error at a node t :
Error(t )  1  max P(i | t )
i
 Measures misclassification error made by a node.
 Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting
information. This is ½ for 2-class problems
 Minimum (0.0) when all records belong to one class,
implying most interesting information
42
Examples for Computing Error
Error(t )  1  max P(i | t )
i
C1
C2
0
6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
C1
C2
1
5
P(C1) = 1/6
P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
C1
C2
2
4
Equivalently, predict
majority class and
determine fraction
of errors
P(C1) = 2/6
P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
43
Complete Example using Error Rate
 Initial sample has 3 C1 and 15 C2
C1
C2
0
6
C1
C2
1
5
C1
C2
2
4
 Based on one 3-way split you get
the 3 child nodes to the left
 What is the decrease in error rate?
 What is the error rate initially?
 What is it afterwards?
 As usual you need to take the
weighted average (but there is a
shortcut)
44
Error Rate Example Continued
 Error rate before: 3/18
C1
C2
C1
C2
0
6
1
5
 Error rate after:
 Shortcut:
 Number of errors = 0 + 1 + 2
 Out of 18 examples
 Error rate = 3/18
 Weighted average method:
C1
C2
2
4
 6/18 x 0 + 6/18 x 1/6 + 6/18 x 2/6
 Simplifies to 1/18 + 2/18 = 3/18
45
Comparison among Splitting Criteria
For a 2-class problem:
46
Discussion
 Error rate is often the metric used to evaluate a
classifier (but not always)
 So it seems reasonable to use error rate to determine
the best split
 That is, why not just use a splitting metric that
matches the ultimate evaluation metric?
 But this is wrong!
 The reason is related to the fact that decision trees use a
greedy strategy, so we need to use a splitting metric that
leads to globally better results
 The other metrics will empirically outperform error rate,
although there is no proof for this.
47
How to Specify Test Condition?
 Depends on attribute types
 Nominal
 Ordinal
 Continuous
 Depends on number of ways to split
 2-way split
 Multi-way split
48
Splitting Based on Nominal Attributes
 Multi-way split: Use as many partitions as distinct
values.
CarType
Family
Luxury
Sports
 Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sports,
Luxury}
CarType
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
49
Splitting Based on Ordinal Attributes
 Multi-way split: Use as many partitions as distinct
values.
Size
Small
Large
Medium
 Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Small,
Medium}
Size
{Large}
 What about this split?
OR
{Small,
Large}
{Medium,
Large}
Size
{Small}
Size
{Medium}
50
Splitting Based on Continuous Attributes
 Different ways of handling
 Discretization to form an ordinal categorical attribute
 Static – discretize once at the beginning
 Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
 Binary Decision: (A < v) or (A  v)
 consider all possible splits and finds the best cut
 can be more compute intensive
51
Splitting Based on Continuous Attributes
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
(i) Binary split
[25K,50K)
[50K,80K)
(ii) Multi-way split
52
Data Fragmentation
 Number of instances gets smaller as you traverse
down the tree
 Number of instances at the leaf nodes could be too
small to make statistically significant decision
 Decision trees can suffer from data fragmentation
 Especially true if there are many features and not too many
examples
 True or False: All classification methods may suffer
data fragmentation.
 False: not logistic regression or instance-based learning.
Only applies to divide-and-conquer methods
53
Expressiveness
 Expressiveness relates to flexibility of the classifier in forming
decision boundaries
 Linear models are not that expressive since they can only form linear
boundaries
 Decision tree models can form rectangular regions
 Which is more expressive and why?
 Decision trees because they can form many regions, but DTs do have the
limitation of only forming axis-parallel boundaries.
 Decision tree do not generalize well to certain types
of functions (like parity which depends on all
features)
 For accurate modeling, must have a complete trees
 Not expressive enough for modeling continuous variables
especially when more than one variable at a time is involved
54
Decision Boundary
1
0.9
x < 0.43?
0.8
0.7
Yes
No
y
0.6
y < 0.33?
y < 0.47?
0.5
0.4
Yes
0.3
0.2
:4
:0
0.1
No
:0
:4
Yes
:0
:3
No
:4
:0
0
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
1
• Border line between two neighboring regions of different classes is known as
decision boundary
• Decision boundary is parallel to axes because test condition involves a single
attribute at-a-time
55
Oblique Decision Trees
x+y<1
Class = +
Class =
This special type of decision tree avoids some weaknesses and increases the
expressiveness of decision trees
This is not what we mean when we refer to decision trees (e.g., on an exam)
56
Tree Replication
P
Q
S
0
R
0
Q
1
S
0
1
0
1
This can be viewed as a weakness of decision trees, but this is really a minor issue
57
Pros and Cons of Decision Trees
 Advantages:
 Easy to understand
 Can get a global view of what is going on and also explain
individual decisions
 Can generate rules from them
 Fast to build and apply
 Can handle redundant and irrelevant features and
missing values
 Disadvantages:
 Limited expressive power
 May suffer from overfitting and validation set may be
necessary to avoid overfitting
58
More to Come on Decision Trees
 We have covered most of the essential
aspects of decision trees except pruning
 We will cover pruning next and, more
generally, overfitting avoidance
 We will also cover evaluation, which applies
to decision trees but also to all predictive
models
59