Download Association Rule Mining - Indian Statistical Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Decision Tree Learning
Debapriyo Majumdar
Data Mining – Fall 2014
Indian Statistical Institute Kolkata
August 25, 2014
Example: Age, Income and Owning a flat
Monthly income
(thousand rupees)
250
Training set
• Owns a
house
200
150
L1
100
• Does
not own
a house
50
L2
0
0
10
20
30
40
50
60
70
Age
 If the training data was as above
– Could we define some simple rules by observation?
 Any point above the line L1  Owns a house
 Any point to the right of L2  Owns a house
 Any other point  Does not own a house
2
Example: Age, Income and Owning a flat
Monthly income
(thousand rupees)
250
Training set
• Owns a
house
200
150
L1
100
• Does
not own
a house
50
L2
0
0
10
20
30
40
50
60
70
Age
Root node: Split at
Income = 101
Income ≥ 101: Label
= Yes
Income < 101: Split
at Age = 54
Age ≥ 54: Label = Yes
Age < 54: Label = No
3
Example: Age, Income and Owning a flat
Monthly income
(thousand rupees)
250
Training set
• Owns a
house
200
150
• Does
not own
a house
100
50
0
0
10
20
30
40
50
60
70
Age
 Approach: recursively split the data into partitions so
that each partition becomes purer till …
How to
decide the
split?
How to measure
purity?
When to stop?
4
Approach for splitting
 What are the possible lines for splitting?
– For each variable, midpoints between pairs of consecutive
values for the variable
– How many?
– If N = number of points in training set and m = number of
variables
– About O(N × m)
 How to choose which line to use for splitting?
– The line which reduce impurity (~ heterogeneity of
composition) the most
 How to measure impurity?
5
Gini Index for Measuring Impurity
 Suppose there are C classes
 Let p(i|t) = fraction of observations belonging to class
i in rectangle (node) t
 Gini index:
C
Gini(t) =1- å p(i | t)2
i=1
 If all observations in t belong to one single class
Gini(t) = 0
 When is Gini(t) maximum?
6
Entropy
 Average amount of information contained
 From another point of view – average amount of
information expected – hence amount of uncertainty
– We will study this in more detail later
 Entropy:
C
Entropy(t) = -å p(i | t)´ log2 p(i | t)
i=1
Where 0 log20 is defined to be 0
7
Classification Error
 What if we stop the tree building at a
node
– That is, do not create any further branches
for that node
– Make that node a leaf
– Classify the node with the most frequent
class present in the node
 Classification error as measure of
impurity
This rectangle (node)
is still impure
ClassificationError(t) =1- maxi [ p(i | t)]
 Intuitively – the impurity of the most frequent class in the
rectangle (node)
8
The Full Blown Tree
 Recursive splitting
 Suppose we don’t stop until
all nodes are pure
 A large decision tree with
leaf nodes having very few
data points
– Does not represent classes well
– Overfitting
Root
1000
Number
of points
400
200
600
200
160
240
 Solution:
– Stop earlier, or
– Prune back the tree
Statistically not
significant
2
1
5
9
Prune back
 Pruning step: collapse leaf
nodes and make the immediate
parent a leaf node
 Effect of pruning
– Lose purity of nodes
– But were they really pure or was
that a noise?
– Too many nodes ≈ noise
 Trade-off between loss of
purity and gain in complexity
Decision
node
(Freq = 7)
Leaf node
(label = Y)
Freq = 5
Leaf node
(label = B)
Freq = 2
Prune
Leaf node
(label = Y)
Freq = 7
10
Prune back: cost complexity
 Cost complexity of a (sub)tree:
 Classification error (based on
training data) and a penalty for
size of the tree
tradeoff (T ) = Err(T)+ a L(T)
 Err(T) is the classification error
 L(T) = number of leaves in T
 Penalty factor α is between 0
and 1
– If α=0, no penalty for bigger tree
Decision
node
(Freq = 7)
Leaf node
(label = Y)
Freq = 5
Leaf node
(label = B)
Freq = 2
Prune
Leaf node
(label = Y)
Freq = 7
11
Different Decision Tree Algorithms
 Chi-square Automatic Interaction Detector (CHAID)
– Gordon Kass (1980)
– Stop subtree creation if not statistically significant by chi-square test
 Classification and Regression Trees (CART)
– Breiman et al.
– Decision tree building by Gini’s index
 Iterative Dichotomizer 3 (ID3)
– Ross Quinlan (1986)
– Splitting by information gain (difference in entropy)
 C4.5
– Quinlan’s next algorithm, improved over ID3
– Bottom up pruning, both categorical and continuous variables
– Handling of incomplete data points
 C5.0
– Ross Quinlan’s commercial version
12
Properties of Decision Trees
 Non parametric approach
– Does not require any prior assumptions regarding the
probability distribution of the class and attributes
 Finding an optimal decision tree is an NP-complete
problem
– Heuristics used: greedy, recursive partitioning, top-down,
bottom-up pruning
 Fast to generate, fast to classify
 Easy to interpret or visualize
 Error propagation
– An error at the top of the tree propagates all the way down
13
References
 Introduction to Data Mining, by Tan, Steinbach,
Kumar
– Chapter 4 is available online: http://wwwusers.cs.umn.edu/~kumar/dmbook/ch4.pdf
14