Download extra3 - gain ratio - PhD in Information Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Eibe Frank and Ian H. Witten
Wishlist for a purity
measure
Properties of the entropy
 Properties we require from a purity measure:
 The multistage property:
 When node is pure, measure should be zero
 When impurity is maximal (i.e. all classes equally likely),
measure should be maximal
 Measure should obey multistage property (i.e. decisions
can be made in several stages):
entropy( p,q,r ) = entropy( p,q + r ) + (q + r ) ! entropy(
q
r
,
)
q+r q+r
 Simplification of computation:
info([2,3,4]) = "2 / 9 ! log(2 / 9) " 3 / 9 ! log(3 / 9) " 4 / 9 ! log(4 / 9)
= [!2 log 2 ! 3 log 3 ! 4 log 4 + 9 log 9] / 9
measure([2,3,4]) = measure([2,7]) + (7/9) ! measure([3,4])
 Entropy is the only function that satisfies all three
properties!
 Note: instead of maximizing info gain we could just
minimize information
1
Gain ratio
2
Computing the gain ratio
 Example: intrinsic information for ID code
info([1,1,...,1) = 14 " (#1/14 " log1/14) = 3.807 bits
 Value of attribute decreases as intrinsic
information gets larger
!  Definition of gain ratio:
gain(" Attribute")
gain_ratio(" Attribute") =
intrinsic_info(" Attribute")
 Gain ratio: a modification of the information gain
that reduces its bias
 Gain ratio takes number and size of branches into
account when choosing an attribute
 It corrects the information gain by taking the intrinsic
information of a split into account
 Intrinsic information: entropy of distribution of
instances into branches (i.e. how much info do we
need to tell which branch an instance belongs to)
 Example:
gain_ratio(" ID_code") =
0.940 bits
= 0.246
3.807 bits
3
Gain ratios for weather
data
Outlook
More on the gain ratio
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
Temperature
Info:
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
Split info: info([5,4,5])
1.577
Split info: info([4,6,4])
1.362
Gain ratio: 0.247/1.577
0.156
Gain ratio: 0.029/1.362
0.021
Humidity
 Standard fix: ad hoc test to prevent splitting on that
type of attribute
 Problem with gain ratio: it may overcompensate
 May choose an attribute just because its intrinsic
information is very low
 Standard fix: only consider attributes with greater than
average information gain
Windy
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
Split info: info([7,7])
1.000
Split info: info([8,6])
0.985
Gain ratio: 0.152/1
0.152
Gain ratio: 0.048/0.985
0.049
5
March 2004
4
6
Data Mining
Eibe Frank and Ian H. Witten
Industrial-strength
algorithms
Decision trees
 For an algorithm to be useful in a wide range of real-world
applications it must:




 Extending ID3:
 to permit numeric attributes:
straightforward
 to dealing sensibly with missing values:
trickier
 stability for noisy data:
requires pruning mechanism
Permit numeric attributes
Allow missing values
Be robust in the presence of noise
Be able to approximate arbitrary concept descriptions (at least
in principle)
 End result: C4.5 (Quinlan)
 Basic schemes need to be extended to fulfill these
requirements
 Best-known and (probably) most widely-used learning algorithm
 Commercial successor: C5.0
1
2
Numeric attributes
Weather data (again!)
Outlook
Temperature
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
Normal
False
Yes
 Standard method: binary splits

E.g. temp < 45
 Unlike nominal attributes,
every attribute has many possible split points
 Solution is straightforward extension:



…
Evaluate info gain (or other measure)
for every possible split point of attribute
Choose “best” split point
Info gain for best split point is info gain for attribute
If
If
If
If
If
 Computationally more demanding
… Outlook
…
Temperature
…Humidity
… Windy
Play
Sunny
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
outlook = sunnyRainy
and humidity75= high then
80 play = no
False
Yes
outlook = rainy and windy = true then play = no
…
…
…
…
…
outlook = overcast then play = yes
humidity = normal then play = yes
none of the If
above
then =
play
= yes
outlook
sunny
and humidity > 83 then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
3
Example
Avoid repeated sorting!
 Split on temperature attribute:
64
Yes
65
No
68
69
Yes Yes
70
71
Yes No
72
No
72
75
Yes Yes
4
 Sort instances by the values of the numeric attribute
75
Yes
80
No

81
83
85
Yes Yes No
Time complexity for sorting: O (n log n)
 Does this have to be repeated at each node of the tree?

E.g.
temperature < 71.5: yes/4, no/2
temperature ≥ 71.5: yes/5, no/3

Info([4,2],[5,3])
 No! Sort order for children can be derived from sort order for
parent


= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits
Time complexity of derivation: O ( n )
Drawback: need to create and store an array of sorted indices
for each numeric attribute
 Place split points halfway between values
 Can evaluate all split points in one pass!
5
March 2004
6
Data Mining
Eibe Frank and Ian H. Witten
Binary vs multiway splits
Computing multi-way
splits
 Splitting (multi-way) on a nominal attribute exhausts all
information in that attribute

 Simple and efficient way of generating multi-way splits:
greedy algorithm
Nominal attribute is tested (at most) once on any path in the
tree
 Dynamic programming can find optimum multi-way split in O
(n2) time
 Not so for binary splits on numeric attributes!

Numeric attribute may be tested several times along a path in
the tree


 Disadvantage: tree is hard to read

 Remedy:


imp (k, i, j ) is the impurity of the best split of values xi … xj
into k sub-intervals
imp (k, 1, i ) =
min0<j < i imp (k–1, 1, j ) + imp (1, j+1, i )
imp (k, 1, N ) gives us the best k-way split
 In practice, greedy algorithm works as well
pre-discretize numeric attributes, or
use multi-way splits instead of binary ones
7
8
Missing values
Pruning
 Split instances with missing values into pieces


A piece going down a branch receives a weight proportional to
the popularity of the branch
weights sum to 1
Prevent overfitting to noise in the data

“Prune” the decision tree

 Info gain works with fractional instances


use sums of weights instead of counts
 During classification, split the instance into pieces in the
same way

Two strategies:
Postpruning
take a fully-grown decision tree and discard unreliable parts
Prepruning
stop growing a branch when information becomes unreliable

Merge probability distribution using weights
Postpruning preferred in practice—prepruning can “stop
early”
9
Prepruning
Early stopping
 Based on statistical significance test

a
b
class
1
0
0
0
2
0
1
1
3
1
0
1
4
1
1
0
 Pre-pruning may stop the growth process prematurely: early
stopping
Stop growing the tree when there is no statistically significant
association between any attribute and the class at a particular
node
 Classic example: XOR/Parity-problem
 Most popular test: chi-squared test

 ID3 used chi-squared test in addition to information gain

10


Only statistically significant attributes were allowed to be
selected by information gain procedure
No individual attribute exhibits any significant association to
the class
Structure is only visible in fully expanded tree
Prepruning won’t expand the root node
 But: XOR-type problems rare in practice
 And: prepruning faster than postpruning
11
March 2004
12
Data Mining
Eibe Frank and Ian H. Witten
Postpruning


First, build full tree

 Bottom-up
 Consider replacing a tree
only after considering all
its subtrees
Then, prune it


Subtree
replacement
Fully-grown tree shows all attribute interactions
Problem: some subtrees might be due to chance effects
Two pruning operations:
Subtree replacement
Subtree raising

Possible strategies:



error estimation
significance testing
MDL principle
13
Subtree
replacement
Attribute
Bottom-up
Type
Duration
Consider replacing(Number
a tree
of years)
Wage
increase
first year
Percentage
only
after
considering
all
Wage increase second year
Percentage
itsincrease
subtrees
Wage
third year
Percentage
Cost of living adjustment
Working hours per week
Pension
Standby pay
Shift-work supplement
Education allowance
Statutory holidays
Vacation
Long-term disability assistance
Dental plan contribution
Bereavement assistance
Health plan contribution
Acceptability of contract
14
Subtree raising



1
1
2%
?
?
{none,tcf,tc}
none
(Number of hours)
28
{none,ret-allw, empl-cntr} none
Percentage
?
Percentage
?
{yes,no}
yes
(Number of days)
11
{below-avg,avg,gen}
avg
{yes,no}
no
{none,half,full}
none
{yes,no}
no
{none,half,full}
none
{good,bad}
bad
2
3
2
4%
5%
?
tcf
35
?
13%
5%
?
15
gen
?
?
?
?
good
3
4.3%
4.4%
?
?
38
?
?
4%
?
12
gen
?
full
?
full
good
…
40
Delete node
Redistribute instances
Slower than subtree replacement
(Worthwhile?)
2
4.5
4.0
?
none
40
?
?
4
?
12
avg
yes
full
yes
half
good
15
Estimating error rates
C4.5’s method
 Prune only if it reduces the estimated error
 Error estimate for subtree is weighted sum of error estimates
for all its leaves
 Error on the training data is NOT a useful
estimator
 Error estimate for a node:
(would result in almost no pruning)
&
z2
f f2
z 2 #! & z 2 #
$1 + !
e = $$ f +
+z
'
+
2
N
N
N
4
N 2 !" $% N !"
%
 Use hold-out set for pruning
(“reduced-error pruning”)
 If c = 25% then z = 0.69 (from normal distribution)
 C4.5’s method




 f is the error on the training data
Derive confidence interval from training data
Use a heuristic limit, derived from this, for pruning
Standard Bernoulli-process-based method
Shaky statistical assumptions (based on training data)
 N is the number of instances covered by the leaf
17
March 2004
16
18
Data Mining
Eibe Frank and Ian H. Witten
Complexity of tree
induction
Example
 Assume
 m attributes
 n training instances
 tree depth O (log n)
f = 5/14
e = 0.46
e < 0.51
so prune!
f=0.33
e=0.47
f=0.5
e=0.72
O (m n log n)
 Subtree replacement
O (n )
 Subtree raising
O ( n (log n) 2)
 Every instance may have to be redistributed at every node
between its leaf and the root
 Cost for redistribution (on average): O (log n)
f=0.33
e=0.47
Combined using ratios 6:2:6 gives 0.51
 Building a tree
19
From trees to rules
 Total cost: O ( m n log n) + O (n (log n)2)
20
C4.5: choices and options
 Simple way: one rule for each leaf
 C4.5rules slow for large and noisy datasets
 C4.5rules: greedily prune conditions from each rule if this
reduces its estimated error
 Commercial version C5.0rules uses a different technique
 Can produce duplicate rules
 Check for this at the end

Much faster and a bit more accurate
 C4.5 has two parameters
 Then

 look at each class in turn
 consider the rules for that class
 find a “good” subset (guided by MDL)

Confidence value (default 25%):
lower values incur heavier pruning
Minimum number of instances in the two most popular
branches (default 2)
 Then rank the subsets to avoid conflicts
 Finally, remove rules (greedily) if this decreases error on the
training data
21
Discussion
TDIDT: Top-Down Induction of Decision Trees
 The most extensively studied method of machine learning
used in data mining
 Different criteria for attribute/test selection rarely make a
large difference
 Different pruning methods mainly change the size of the
resulting pruned tree
 C4.5 builds univariate decision trees
 Some TDITDT systems can build multivariate trees (e.g.
CART)
23
March 2004
22
Data Mining
Eibe Frank and Ian H. Witten
Selecting a test
Example:
contact lens data
 Rule we seek:
 Goal: maximize accuracy



⇒
t total number of instances covered by rule
p positive examples of the class covered by rule
t – p number of errors made by rule
Select test that maximizes the ratio p/t
If ?
then recommendation = hard
 Possible tests:
 We are finished when p/t = 1 or the set of
instances can’t be split any further
Age = Young
2/8
Age = Pre-presbyopic
1/8
Age = Presbyopic
1/8
Spectacle prescription = Myope
3/12
Spectacle prescription = Hypermetrope
1/12
Astigmatism = no
0/12
Astigmatism = yes
4/12
Tear production rate = Reduced
0/12
Tear production rate = Normal
4/12
1
Modified rule and
resulting data
Further refinement
 Rule with best test added:
 Current state:
If astigmatism = yes
then recommendation = hard
Young
Young
Young
Young
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Pre-presbyopic
Presbyopic
Presbyopic
Presbyopic
Presbyopic
Spectacle
prescription
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Myope
Myope
Hypermetrope
Hypermetrope
Astigmatism
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Tear production
rate
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Reduced
Normal
Recommended
lenses
None
Hard
None
hard
None
Hard
None
None
None
Hard
None
None
Young
Young
Pre-presbyopic
Pre-presbyopic
Presbyopic
Presbyopic
Yes
Yes
Yes
Yes
Yes
Yes
3/6
Spectacle prescription = Hypermetrope
1/6
Tear production rate = Reduced
0/6
Tear production rate = Normal
4/6
4
 Possible tests:
Recommended
lenses
Hard
hard
Hard
None
Hard
None
Age = Young
2/2
Age = Pre-presbyopic
1/2
Age = Presbyopic
1/2
Spectacle prescription = Myope
3/3
Spectacle prescription = Hypermetrope
1/3
 Tie between the first and the fourth test
5
March 2004
1/4
Spectacle prescription = Myope
If astigmatism = yes
and tear production rate = normal
and ?
then recommendation = hard
 Instances covered by modified rule:
Tear production
rate
Normal
Normal
Normal
Normal
Normal
Normal
1/4
Age = Presbyopic
 Current state:
If astigmatism = yes
and tear production rate = normal
then recommendation = hard
Astigmatism
2/4
Age = Pre-presbyopic
Further refinement
 Rule with best test added:
Spectacle
prescription
Myope
Hypermetrope
Myope
Hypermetrope
Myope
Hypermetrope
Age = Young
3
Modified rule and
resulting data
Age
If astigmatism = yes
and ?
then recommendation = hard
 Possible tests:
 Instances covered by modified rule:
Age
2
 We choose the one with greater coverage
6
Data Mining
Eibe Frank and Ian H. Witten
The result
 Final rule:
If astigmatism = yes
and tear production rate = normal
and spectacle prescription = myope
then recommendation = hard
 Second rule for recommending “hard lenses”:
(built from instances not covered by first rule)
If age = young and astigmatism = yes
and tear production rate = normal
then recommendation = hard
 These two rules cover all “hard lenses”:
 Process is repeated with other two classes
7
March 2004