Download Decision Trees An Algorithm for Building Decision

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Decision Tree
Decision Trees
An Algorithm for Building Decision Trees
1. Let T be the set of training instances.
2. Choose an attribute that best differentiates the
instances contained in T.
3. Create a tree node whose value is the chosen
attribute. Create child links from this node where
each link represents a unique value for the chosen
attribute. Use the child link values to further
subdivide the instances into subclasses.
Chapter 6
2
Decision Trees
An Algorithm for Building Decision Trees
4. For each subclass created in step 3:
a. If the instances in the subclass satisfy
predefined criteria or if the set of remaining attribute
choices for this path of the tree is null, specify the
classification for new instances following this
decision path.
b. If the subclass does not satisfy the
predefined criteria and there is at least one attribute
to further subdivide the path of the tree, let T be the
current set of subclass instances and return to step 2.
Chapter 6
3
Decision Trees
An Algorithm for Building Decision Trees
Chapter 6
4
Decision Trees
An Algorithm for Building Decision Trees
Chapter 6
5
Decision Trees
An Algorithm for Building Decision Trees
Chapter 6
6
Decision Trees
An Algorithm for Building Decision Trees
Chapter 6
7
Decision Trees
Decision Trees for the Credit Card Promotion Database
Chapter 6
8
Decision Trees
Decision Trees for the Credit Card Promotion Database
Chapter 6
9
Decision Trees
Decision Trees for the Credit Card Promotion Database
Chapter 6
10
Decision Trees
Decision Tree Rules
IF Age <= 43 & Sex = Male & Credit Card
Insurance = No
THEN Life Insurance Promotion = No
IF Sex = Male & Credit Card Insurance = No
THEN Life Insurance Promotion = No
Chapter 6
11
Decision Trees
General Considerations
Here is a list of a few of the many advantages
decision trees have to offer.
 Decision trees are easy to understand and map
nicely to a set of production rules.
 Decision trees have been successfully applied to
real problems.
 Decision trees make no prior assumptions about the
nature of the data.
 Decision trees are able to build models with
datasets containing numerical as well as categorical
data.
Chapter 6
12
Decision Trees
General Considerations
As with all data mining algorithms, there are several
issues surrounding decision tree usage. Specifically,
• Output attributes must be categorical, and multiple output
attributes are not allowed.
• Decision tree algorithms are unstable in that slight variations
in the training data can result in different attribute selections at
each choice point with in the tree. The effect can be significant
as attribute choices affect all descendent subtrees.
• Trees created from numeric datasets can be quite complex as
attribute splits for numeric data are typically binary.
Chapter 6
13

Decision tree
◦
◦
◦
◦

A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
◦ Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
◦ Tree pruning
 Identify and remove branches that reflect noise or outliers

Use of decision tree: Classifying an unknown sample
◦ Test the attribute values of the sample against the decision
tree
Chapter 6
14
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
This follows an example from Quinlan’s ID3
Chapter 6
15
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Chapter 6
16

Basic algorithm (a greedy algorithm)
◦ Tree is constructed in a top-down recursive divide-andconquer manner
◦ At start, all the training examples are at the root
◦ Attributes are categorical (if continuous-valued, they are
discretized in advance)
◦ Examples are partitioned recursively based on selected
attributes
◦ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)

Conditions for stopping partitioning
◦ All samples for a given node belong to the same class
◦ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
◦ There are no samples left
Chapter 6
17
weather
sunny
Temp > 75
BBQ
Eat in
rainy
cloudy
Eat in
windy
no
yes
BBQ
Eat in
Chapter 6
18

Information gain (ID3/C4.5)
◦ All attributes are assumed to be categorical
◦ Can be modified for continuous-valued attributes

Gini index (IBM IntelligentMiner)
◦ All attributes are assumed continuous-valued
◦ Assume there exist several possible split values for
each attribute
◦ May need other tools, such as clustering, to get the
possible split values
◦ Can be modified for categorical attributes
Chapter 6
19


Select the attribute with the highest
information gain
Assume there are two classes, ....P and ....N
◦ Let the set of examples S contain
 ...p elements of class P
 ...n elements of class N
◦ The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is defined
as
p
p
n
n
I ( p, n)  
log 2

log 2
pn
p  n p  n Chapter 6 p  n
20

Assume that using attribute A
a set S will be partitioned into sets
Sv}
{S1, S2 , …,
◦ If Si contains pi examples of P and ni examples
of N ,the entropy, or the expected information
needed to classify objects
 p  n in all subtrees Si is
i
E ( A)   i
I ( pi , ni )
i 1 p  n

The encoding information that would be
Gain ( A)  I ( p, n)  E ( A)
gained by branching on A
Chapter 6
21
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student credit_rating buys_computer
no fair
no
no excellent
no
no fair
yes
no fair
yes
yes fair
yes
yes excellent
no
yes excellent
yes
no fair
no
yes fair
yes
yes fair
yes
yes excellent
yes
no excellent
yes
yes fair
yes
no excellent
no
age
<=30
30…40
>40
pi
2
4
3
Chapter 6
ni I(pi, ni)
3 0.971
0 0
2 0.971
22
Gain ( A)  I ( p, n)  E ( A)
I ( p, n)  
p
p
n
n
log 2

log 2
pn
pn
pn
pn
-9/(9+5)log2 9/(9+5) – 5/(9+5)log25/(9+5)
 Class
P: buys_computer =
“yes”
 Class
N: buys_computer =
“no”
 I(p,
n) = I(9, 5) = 0.940
 Compute the
entropy for age:
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n
age
<=30
30…40
>40
pi
2
4
3
ni
I(pi, ni)
3 0.971
0 0
2 0.971
9
5
Chapter 6
23
I(p,
n) = I(9, 5) =0.940
pi  ni
I ( pi , ni )
i 1 p  n
Gain ( age)  I ( p, n)  E ( age)

E ( A)  
5
4
I ( 2,3) 
I ( 4,0)
14
14
5

I (3, 2)  0.692
14
Gain(age) = 0.940 –0.692 = 0.248
E ( age) 
age
<=30
30…40
>40
pi
2
4
3
ni
I(pi, ni)
3 0.971
0 0
2 0.971
Similarly
Gain (income )  0.029
Gain ( student )  0.151
Gain (credit _ rating )  0.048
Chapter 6
24





Represent the knowledge in the form of IFTHEN rules
One rule is created for each path from the
root to a leaf
Each attribute-value pair along a path forms
a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Chapter 6
25
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Chapter 6
26

The generated tree may overfit the training data
◦ Too many branches...., some may reflect anomalies
due to noise or outliers

◦ Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
◦ Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling
below a threshold
 Difficult to choose an appropriate threshold
◦ Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”
Chapter 6
27

Separate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold cross validation

Use all the data for training
◦ but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution

Use minimum description length (MDL) principle:
◦ halting growth of the tree when the encoding is minimized
Chapter 6
28

Allow for continuous-valued attributes
◦ Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals

Handle missing attribute values
◦ Assign the most common value of the attribute
◦ Assign probability to each of the possible values

Attribute construction
◦ Create new attributes based on existing ones that are
sparsely represented
◦ This reduces fragmentation, repetition, and replication
Chapter 6
29



Classification—a classical problem extensively
studied by statisticians and machine learning
researchers
Scalability: Classifying data sets with millions of
examples and hundreds of attributes with reasonable
speed
Why decision tree induction in data mining?
◦ relatively faster learning speed (than other classification
methods)
◦ convertible to simple and easy to understand classification
rules
◦ can use SQL queries for accessing databases
◦ comparable classification accuracy with other methods
Chapter 6
30
Chapter Summary
•Decision trees are probably the most popular
structure for supervised data mining.
•A common algorithm for building a decision tree
selects a subset of instances from the training data to
construct an initial tree.
•The remaining training instances are then used to test
the accuracy of the tree.
•If any instance is incorrectly classified the instance is
added to the current set of training data and the
process is repeated.
Chapter 6
31
Chapter Summary
•A main goal is to minimize the number of tree levels
and tree nodes, thereby maximizing data
generalization.
•Decision trees have been successfully applied to real
problems, are easy to understand, and map nicely to a
set of production rules.
Chapter 6
32
Data Mining: Concepts and Techniques (Chapter 7 Slide for
textbook), Jiawei Han and Micheline Kamber, Intelligent
Database Systems Research Lab, School of Computing Science,
Simon Fraser University, Canada
Chapter 6
33