Download Where is the Data?

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Supervised Learning
Fall 2004
1
Introduction

Key idea



Known target concept (predict certain attribute)
Find out how other attributes can be used
Algorithms






Rudimentary Rules (e.g., 1R)
Statistical Modeling (e.g., Naïve Bayes)
Divide and Conquer: Decision Trees
Instance-Based Learning
Neural Networks
Support Vector Machines
Fall 2004
2
1-Rule




Generate a one-level decision tree
One attribute
Performs quite well!
Basic idea:





Rules testing a single attribute
Classify according to frequency in training data
Evaluate error rate for each attribute
Choose the best attribute
That’s all folks!
Fall 2004
3
The Weather Data (again)
Outlook
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy
Fall 2004
Temp. Humidity Windy
Hot
High FALSE
Hot
High
TRUE
Hot
High FALSE
Mild
High FALSE
Cool Normal FALSE
Cool Normal TRUE
Cool Normal TRUE
Mild
High FALSE
Cool Normal FALSE
Mild
Normal FALSE
Mild
Normal TRUE
Mild
High
TRUE
Hot
Normal FALSE
Mild
High
TRUE
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
4
Apply 1R
Attribute
1 outlook
Rules
sunnyno
overcast yes
rainy yes
2 temperaturehot  no
2/4
mild  yes
cool  no
3 humidity
high  no
normal  yes
4 windy
false  yes
true  no
Fall 2004
Errors
2/5
0/4
2/5
5/14
2/6
3/7
3/7
2/8
2/8
3/6
Total
4/14
4/14
5/14
5
Other Features

Numeric Values

Discretization :


Sort training data
Split range into categories
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y

N
Y
Y
Y
N
N
Y
Y
Y
N
Y
Y
N
Missing Values

“Dummy” attribute
Fall 2004
6
Naïve Bayes Classifier


Allow all attributes to
contribute equally
Assumes




All attributes equally
important
All attributes
independent
Realistic?
Selection of attributes
Fall 2004
7
Bayes Theorem
Posterior
Probability
Hypothesis
P[ E | H ]  P[ H ]
P[ H | E ] 
P[ E ]
Prior
Evidence
Conditional probability
of H given E
Fall 2004
8
Maximum a Posteriori (MAP)
H MAP  arg max P[ H | E ]
H
P[ E | H ]  P[ H ]
 arg max
P[ E ]
H
 arg max P[ E | H ]  P[ H ]
H
H ML  arg max P[ E | H ]
H
Maximum Likelihood (ML)
Fall 2004
9
Classification


Want to classify a new instance (a1, a2,…, an) into finite
number of categories from the set V.
Bayesian approach: Assign the most probable category
vMAP given (a1, a2,…, an).
vMAP  arg max P[v | a1 , a2 ,..., an ]
vV
P[a1 , a2 ,..., an | v]  P[v]
 arg max
P[a1 , a2 ,..., an ]
vV
 arg max P[a1 , a2 ,..., an | v]  P[v]
vV

Can we estimate the probabilities from the training data?
Fall 2004
10
Naïve Bayes Classifier

Second probability easy to estimate

How?

The first probability difficult to estimate

Why?

Assume independence (this is the naïve
bit):
vMAP  arg max P[v]   P[ai | v]
vV
Fall 2004
i
11
The Weather Data (yet again)
Outlook
Temperature
Humidity
Windy
Play
Yes No
Yes No
Yes No
Yes No Yes No
Sunny
2
3 Hot
2
2 High
3
4 FALSE 6
2
9
5
Overcast 4
0 Mild
4
2 Normal 6
1 TRUE
3
3
Rainy
3
2 Cool 3
1
Pˆ [ Play  yes ]  9
14
Pˆ [Outlook  sunny | Play  yes]  2
9
Pˆ [Temperature  cool | Play  yes ]  3
Pˆ [ Humidity  high | Play  yes]  3
Pˆ [Windy  true | Play  yes]  3
Fall 2004
9
9
9
12
Estimation

Given a new instance with




outlook=sunny,
temperature=high,
humidity=high,
windy=true
P[ Play  yes]   P[ai | play  yes]
i
9 2 3 3 3
      0.0053
14 9 9 9 9
Fall 2004
13
Calculations continued …

Similarly
P[ Play  no]   P[ai | play  no]
i
5 3 1 4 3
      0.0206
14 5 5 5 5

arg max
P[v]   P[ai | v]
Thus vMAP  v{Play
 yes , Play no}
i
 {Play  no}
Fall 2004
14
Normalization

Note that we can normalize to get the
probabilities:
P[a1 , a2 ,..., an | v]  P[v]
P[v | a1 , a2 ,..., an ] 
P[a1 , a2 ,..., an ]
0.0052

 0.0053  0.0206  0.205 v  {Play  yes}

0.0206

 0.795 v  {Play  no}
 0.0053  0.0206
Fall 2004
15
Problems ….

Suppose we had the following training data:
Outlook
Temperature
Humidity
Windy
Play
Yes No
Yes No
Yes No
Yes No Yes No
Sunny
0
5 Hot
2
2 High
3
4 FALSE 6
2
9
5
Overcast 4
0 Mild
4
2 Normal 6
1 TRUE
3
3
Rainy
3
2 Cool 3
1
Pˆ [ Play  yes ]  9
14
Pˆ [Outlook  sunny | Play  yes ]  0
9
Pˆ [Temperature  cool | Play  yes ]  3
Pˆ [ Humidity  high | Play  yes ]  3
Pˆ [Windy  true | Play  yes ]  3
Fall 2004
9
9
Now what?
9
16
Laplace Estimator

Replace estimates
2
ˆ
P[Outlook  sunny | play  yes ] 
9
with
Fall 2004
4
Pˆ [Outlook  overcast | play  yes ] 
9
3
ˆ
P[Outlook  rainy | play  yes ] 
9
2  p1
ˆ
P[Outlook  sunny | play  yes ] 
9
4  p 2
ˆ
P[Outlook  overcast | play  yes] 
9
3  p3
ˆ
P[Outlook  rainy | play  yes] 
9
17
Numeric Values

Assume a probability distribution for the numeric
attributes  density f(x)
2
(
x


)

 normal
1
2 2
f ( x) 
e
.
2


fit a distribution (better)
Similarly as before
vMAP  arg max P[v]   f (ai | v)
vV
Fall 2004
i
18
Discussion




Simple methodology
Powerful - good results in practice
Missing values no problem
Not so good if independence assumption is
severely violated


Extreme case: multiple attributes with same values
Solutions:


Fall 2004
Preselect which attributes to use
Non-naïve Bayesian methods: networks
19
Decision Tree Learning

Basic Algorithm:

Select an attribute to be tested

If classification achieved return classification



Otherwise, branch by setting attribute to each of
the possible values
Repeat with branch as your new tree
Main issue: how to select attributes
Fall 2004
20
Deciding on Branching

What do we want to accomplish?
Make good predictions
Obtain simple to interpret rules

No diversity (impurity) is best





all same class
all classes equally likely
Goal: select attributes to reduce impurity
Fall 2004
21
Measuring Impurity/Diversity


Lets say we only have two classes:
Minimum
min( p1 , p2 )

Gini index/Simpson diversity index
2 p1 p2  2 p1 (1  p1 )

Entropy
 p1 log 2 ( p1 )  p2 log 2 ( p2 )
Fall 2004
22
Impurity Functions
1.2
1
Entropy
0.8
0.6
Gini index
0.4
Minimum
0.2
0
Fall 2004
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
23
Number of classes
Entropy
c
Entropy( S )    pi log 2 pi
i 1
Training data
(instances)



Proportion of
S classified as i
Entropy is a measure of impurity in the training data S
Measured in bits of information needed to encode a member
of S
Extreme cases


Fall 2004
All member same classification (Note: 0·log 0 = 0)
All classifications equally frequent
24
Expected Information Gain
| Sv |
Gain( S , a)  Entropy( S )  
Entropy( Sv )
vValues( a ) | S |
Sv  {s  S : a( s)  v}
All possible values
for attribute a
Gain(S,a) is the expected information provided about the
classification from knowing the value of attribute a
(Reduction in number of bits needed)
Fall 2004
25
The Weather Data (yet again)
Outlook
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy
Fall 2004
Temp. Humidity Windy
Hot
High FALSE
Hot
High
TRUE
Hot
High FALSE
Mild
High FALSE
Cool Normal FALSE
Cool Normal TRUE
Cool Normal TRUE
Mild
High FALSE
Cool Normal FALSE
Mild
Normal FALSE
Mild
Normal TRUE
Mild
High
TRUE
Hot
Normal FALSE
Mild
High
TRUE
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
26
Decision Tree: Root Node
Outlook
Rainy
Sunny
Overcast
Yes
Yes
No
No
No
Fall 2004
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
27
Calculating the Entropy
Entropy( S )    p log p
2
i 1
i
2
i
9
9 5
5
   log 2   log 2
 0.92
14
14 14
14
2
2 3
3
Entropy( S sunny )    log 2   log 2  0.97
5
5 5
5
4
4 0
0
Entropy( S overcast )    log 2   log 2  0.00
4
4 4
4
3
3 2
2
Entropy( S rainy )    log 2   log 2  0.97
5
5 5
5
Fall 2004
28
Calculating the Gain
| Sv |
Gain( S , outlook )  Entropy( S )  
Entropy( S v )
vValues( a ) | S |
5
4
5
 0.92   0.97   0   0.97
14
14
14
 0.92  0.69  0.23
Gain( S , temp)  0.03
Gain( S , humidity )  0.15
Select!
Gain( S , windy )  0.05
Fall 2004
29
Next Level
Outlook
Rainy
Sunny
Overcast
Temperature
No
No
Fall 2004
Yes
No
Yes
30
Calculating the Entropy
Entropy( S )  0.97
0
0 2
2
Entropy( S hot )    log 2   log 2  0
2
2 2
2
1
1 1
1
Entropy( S mild )    log 2   log 2  1
2
2 2
2
1
1 0
0
Entropy( S cool )    log 2   log 2  0
1
1 1
1
Fall 2004
31
Calculating the Gain
| Sv |
Gain( S , temp)  Entropy( S )  
Entropy( S v )
vValues( a ) | S |
2
2
1
 0.97   0  1   0
5
5
5
 0.97  0.40  0.57
Gain( S , humidity )  0.97
Gain( S , windy )  0.02
Fall 2004
Select
32
Final Tree
Outlook
Rainy
Sunny
Overcast
Humidity
High
No
Fall 2004
Normal
Yes
Windy
True
False
No
Yes
Yes
33
What’s in a Tree?

Our final decision tree correctly classifies
every instance

Is this good?

Two important concepts:
Fall 2004

Overfitting

Pruning
34
Overfitting



Two sources of abnormalities

Noise (randomness)

Outliers (measurement errors)
Chasing every abnormality causes overfitting

Tree to large and complex

Does not generalize to new data
Solution: prune the tree
Fall 2004
35
Pruning

Prepruning




Halt construction of decision tree early
Use same measure as in determining
attributes, e.g., halt if InfoGain < K
Most frequent class becomes the leaf node
Postpruning




Fall 2004
Construct complete decision tree
Prune it back
Prune to minimize expected error rates
Prune to minimize bits of encoding (Minimum
Description Length principle)
36
Scalability

Need to design for large amounts of data

Two things to worry about


Large number of attributes

Leads to a large tree (prepruning?)

Takes a long time
Large amounts of data


Fall 2004
Can the data be kept in memory?
Some new algorithms do not require all the data to be
memory resident
37
Discussion: Decision Trees

The most popular methods

Quite effective

Relatively simple

Have discussed in detail the ID3 algorithm:

Information gain to select attributes

No pruning

Only handles nominal attributes
Fall 2004
38
Selecting Split Attributes

Other Univariate splits



Gain Ratio: C4.5 Algorithm (J48 in Weka)
CART (not in Weka)
Multivariate splits

Fall 2004
May be possible to obtain better splits by
considering two or more attributes
simultaneously
39
Instance-Based Learning


Classification
To not construct a explicit description of
how to classify

Store all training data (learning)

New example: find most similar instance
Fall 2004

computing done at time of classification

k-nearest neighbor
40
K-Nearest Neighbor


Each instance lives in n-dimensional
space a1 ( x), a2 ( x),..., an ( x)
Distance between instances
d ( x1 , x2 ) 
Fall 2004
n
 (a ( x ), a ( x ))
r 1
r
1
r
2
2
41
Example: nearest neighbor
+
-
*+
+
xq -
+
Fall 2004
1-Nearest neighbor?
-
6-Nearest neighbor?
+
42
Normalizing

Some attributes may take large values and
other small
Normalize

All attributes on equal footing

v1  min vi
ai 
max vi  min vi
Fall 2004
43
Other Methods for Supervised
Learning

Neural networks

Support vector machines

Optimization

Rough set approach

Fuzzy set approach
Fall 2004
44
Evaluating the Learning

Measure of performance


Classification: error rate
Resubstitution error

Performance on training set

Poor predictor of future performance

Overfitting

Useless for evaluation
Fall 2004
45
Test Set



Need a set of test instances

Independent of training set instances

Representative of underlying structure
Sometimes: validation data

Fine-tune parameters

Independent of training and test data
Plentiful data - no problem!
Fall 2004
46
Holdout Procedures

Common case: data set large but limited

Usual procedure:


Reserve some data for testing

Use remaining data for training
Problems:

Want both sets as large as possible

Want both sets to be representitive
Fall 2004
47
"Smart" Holdout


Simple check: Are the proportions of classes
about the same in each data set?
Stratified holdout


Guarantee that classes are (approximately)
proportionally represented
Repeated holdout

Fall 2004
Randomly select holdout set several times and average
the error rate estimates
48
Holdout w/ Cross-Validation

Cross-validation


Fixed number of partitions of the data (folds)
In turn: each partition used for testing and
remaining instances for training

May use stratification and randomization

Standard practice:

Stratified tenfold cross-validation

Instances divided randomly into the ten partitions
Fall 2004
49
Cross Validation
Fold 1
Train on 90% of the data
Model
Test on 10%
of the data
Error rate e1
Fold 2
Train on 90% of the data
Model
Test on 10%
of the data
Fall 2004
Error rate e2
50
Cross-Validation

Final estimate of error
1 k
   k
k i 1

Quality of estimate
  t1 ,k 1s
k
1
 i  
s 

k (k  1) i 1
Fall 2004

2
51
Leave-One-Out Holdout

n-Fold Cross-Validation (n instance set)

Use all but one instance for training

Maximum use of the data

Deterministic

High computational cost

Non-stratified sample
Fall 2004
52
Bootstrap


Sample with replacement n times

Use as training data

Use instances not in training data for testing
How many test instances are there?
n
1
lim   
n  n
 
Fall 2004
53
0.632 Bootstrap



On the average e-1 n = 0.369 n
instances will be in the test set
Thus, on average we have 63.2% of
instance in training set
Estimate error rate
e = 0.632 etest + 0.368 etrain
Fall 2004
54
Accuracy of our Estimate?



Suppose we observe s successes in a
testing set of ntest instances ...
We then estimate the success rate
Rsuccess=s/ ntest.
Each instance is either a success or
failure (Bernoulli trial w/success
probability p)


Fall 2004
Mean p
Variance p(1-p)
55
Properties of Estimate


We have
E[Rsuccess]=p
Var[Rsuccess]=p(1-p)/ntest
If ntraining is large enough the Central Limit
Theorem (CLT) states that, approximately,
Rsuccess~Normal(p,p(1-p)/ntest)
Fall 2004
56
Confidence Interval


CI for normal

P  z 

Look up in table

Rsuccess  p
 z  c
p(1  p) / ntest

Level
CI for p
2
Rsuccess Rsuccess
z2
z2
Rsuccess 
z

 2
2ntest
ntest
ntest
4ntest
p
z2
1
ntest
Fall 2004
57
Comparing Algorithms



Know how to evaluate the results of our
data mining algorithms (classification)
How should we compare different
algorithms?

Evaluate each algorithm

Rank

Select best one
Don't know if this ranking is reliable
Fall 2004
58
Assessing Other Learning

Developed procedures for classification

Association rules


Evaluated based on accuracy

Same methods as for classification
Numerical prediction

Error rate no longer applies

Same principles

use independent test set and hold-out procedures

cross-validation or bootstrap
Fall 2004
59
Measures of Effectiveness


Need to compare:

Predicted values p1, p2,..., pn.

Actual values a1, a2,..., an.
Most common measure

Mean-squared error
1 n
2
(
p

a
)

i
i
n i 1
Fall 2004
60
Other Measures

1 n
Mean absolute error  | pi  ai |
n i 1
n

Relative squared error
2
(
p

a
)
 i i
i 1
n
2
(
a

a
)
 i
i 1

Relative absolute error
| p  a |
n
i 1

Correlation
Fall 2004
i
i
n
| a  a |
i 1
i
61
What to Do?

“Large” amounts of data




Hold-out 1/3 of data for testing
Train a model on 2/3 of data
Estimate error (or success) rate and calculate CI
“Moderate” amounts of data

Estimate error rate:



Fall 2004
Use 10-fold cross-validation with stratification,
or use bootstrap.
Train model on the entire data set
62
Predicting Probabilities

Classification into k classes




Predict probabilities p1, p2,..., pnfor each class.
Actual values a1, a2,..., an.
No longer 0-1 error
Correct class
Quadratic loss function
1 k
2
2
2
( p j  a j )  (1  pi )   p j

k j 1
j i
 1  2 pi   p 2j
j
Fall 2004
63
Information Loss Function

Instead of quadratic function:
 log 2 p j

where the j-th prediction is correct.
Information required to communicate
which class is correct


Fall 2004
in bits
with respect to the probability distribution
64
Occam's Razor



Given a choice of theories that are equally
good the simplest theory should be chosen
Physical sciences: any theory should be
consistant with all empirical observations
Data mining:

theory  predictive model

good theory  good prediction

What is good? Do we minimize the error rate?
Fall 2004
65
Minimum Description Length

MDL principle:




Minimize
size of theory + info needed to specify
exceptions
Suppose trainings set E is mined
resulting in a theory T
Want to minimize
L[T ]  L[ E | T ]
Fall 2004
66
Most Likely Theory

Suppose we want to maximize P[T|E]

Bayes' rule
P[ E | T ]P[T ]
P[T | E ] 
P[ E ]

Take logarithms
log P[T | E ]  log P[ E | T ] log P[T ]  log P[ E ]
Fall 2004
67
Information Function

Maximizing P[T|E] equivilent to minimizing
 log P[T | E ] 
 log P[ E | T ]  log P[T ]  log P[ E ]
Number of bits it
takes
to submit the
exceptions

Number of bits it
takes
to submit the theory
That is, the MDL principle!
Fall 2004
68
Applications to Learning

Classification, association, numeric prediciton


Several predictive models with 'similar' error rate (usually
as small as possible)

Select between them using Occam's razor

Simplicity subjective

Use MDL principle
Clustering

Important learning that is difficult to evaluate

Can use MDL principle
Fall 2004
69
Comparing Mining Algorithms

Know how to evaluate the results

Suppose we have two algorithms





Obtain two different models
Estimate the error rates e(1) and e(2).
Compare estimates
Select the better e
one
ˆ (1)  eˆ ( 2 ) ?
Problem?
Fall 2004
70
Weather Data Example

Suppose we learn the rule
If outlook=rainy then play=yes
Otherwise play=no

Test it on the following test set:
Outlook
Sunny
Sunny
Rainy
Rainy
Sunny

Temp. Humidity Windy
Hot
High FALSE
Hot
High
TRUE
Mild
High FALSE
Cool Normal FALSE
Mild
High FALSE
Play
No
No
Yes
Yes
No
Have zero error rate
Fall 2004
71
Different Test Set 2

Again, suppose we learn the rule
If outlook=rainy then play=yes
Otherwise play=no

Test it on a different test set:
Outlook
Overcast
Rainy
Overcast
Sunny
Rainy

Temp. Humidity Windy
Hot
High FALSE
Cool Normal TRUE
Cool Normal TRUE
Cool Normal FALSE
Mild
High
TRUE
Play
Yes
No
Yes
Yes
No
Have 100% error rate!
Fall 2004
72
Comparing Random Estimates



Estimated error rate is just an estimate
(random)
Need variance as well as point
estimates
Average of differences
Construct a t-test dstatisticin error rates
t
2
s /k
Estimated standard
deviation
Fall 2004
H0: Difference = 0
73
Discussion


Now know how to compare two learning
algorithms and select the one with the better
error rate
We also know to select the simplest model
that has 'comparable' error rate

Is it really better?

Minimising error rate can be misleading
Fall 2004
74
Examples of 'Good Models'



Application: loan approval

Model: no applicants default on loans

Evaluation: simple, low error rate
Application: cancer diagnosis

Model: all tumors are benign

Evaluation: simple, low error rate
Application: information assurance
Fall 2004

Model: all visitors to network are well intentioned

Evaluation: simple, low error rate
75
What's Going On?

Many (most) data mining applications can be
thought about as detecting exceptions




Ignoring the exceptions does not significantly increase
the error rate!
Ignoring the exceptions often leads to a simple model!
Thus, we can find a model that we evaluate
as good but completely misses the point
Need to account for the cost of error types
Fall 2004
76
Accounting for Cost of Errors



Explicit modeling of the cost of each error

costs may not be known

often not practical
Look at trade-offs

visual inspection

semi-automated learning
Cost-sensitive learning

Fall 2004
assign costs to classes a priori
77
Explicit Modeling of Cost
Confusion Matrix
(Displayed in Weka)
Predicted class
Yes
Yes
Actual
Class No
Fall 2004
No
True
False
positive negative
False
True
positive negative
78
Cost Sensitive Learning

Have used cost information to evaluate learning

Better: use cost information to learn

Simple idea:


Fall 2004
Increase instances that demonstrate important
behavior (e.g., classified as exceptions)
Applies for any learning algorithm
79
Discussion



Evaluate learning

Estimate error rate

Minimum length principle/Occam’s Razor
Comparison of algorithm

Based on evaluation

Make sure difference is significant
Cost of making errors may differ

Use evaluation procedures with caution

Incorporate into learning
Fall 2004
80
Engineering the Output

Prediction base on one model



Model performs well on one training set, but
poorly on others
New data becomes available  new model
Combine models

Bagging

Boosting

Stacking
Fall 2004
Improve prediction but complicate structure
81
Bagging

Bias: error despite all the data in the world!

Variance: error due to limited data

Intuitive idea of bagging:



Assume we have several data sets
Apply learning algorithm to each set
Vote on the prediction (classification/numeric)

What type of error does this reduce?

When is this beneficial?
Fall 2004
82
Bootstrap Aggregating


In practice: only one training data set
Create many sets from one


Sample with replacement (remember the
bootstrap)
Does this work?


Fall 2004
Often given improvements in predictive
performance
Never degeneration in performance
83
Boosting

Assume a stable learning procedure

Low variance

Bagging does very little

Combine structurally different models

Intuitive motivation:


Any given model may be good for a subset of the
training data
Encourage models to explain part of the data
Fall 2004
84
AdaBoost.M1

Generate models:


Assign equal weight to each training instance
Iterate:
Apply learning algorithm and store model
 e ¬ error
 If e = 0 or e > 0.5 terminate
 For every instance:
If classified correctly multiply weight by e/(1-e)
 Normalize weight
 Until STOP

Fall 2004
85
AdaBoost.M1

Classification:

Assign zero weight to each class

For every model:


Fall 2004
 1  e
Add
 log e
to class predicted by model
Return class with highest weight
86
Performance Analysis

Error of combined classifier converges to zero
at an exponential rate (very fast)



Fails on test data if



Questionable value due to possible overfitting
Must use independent test data
Classifier more complex than training data justifies
Training error become too large too quickly
Must achieve balance between model
complexity and the fit to the data
Fall 2004
87
Fitting versus Overfitting

Overfitting very difficult to assess here

Assume we have reached zero error

May be beneficial to continue boosting!

Occam's razor?

Build complex models from simple ones

Boosting offers very significant improvement


Can hope for more improvement than bagging
Can degenerate performance

Never happens with bagging
Fall 2004
88
Stacking

Models of different types

Meta learner:

Learn which learning algorithms are good

Combine learning algorithms intelligently
Level-0 Models
Level-1 Model
Decision Tree
Naïve Bayes
Meta Learner
Instance-Based
Fall 2004
89
Meta Learning

Holdout part of the training set

Use remaining data for training level-0 methods

Use holdout data to train level-1 learning

Retrain level-0 algorithms with all the data

Comments:


Level-1 learning: use very simple algorithm (e.g., linear
model)
Can use cross-validation to allow level-1 algorithms to train
on all the data
Fall 2004
90
Supervised Learning

Two types of learning



Classification
Numerical prediction
Classification learning algorithms




Fall 2004
Decision trees
Naïve Bayes
Instance-based learning
Many others are part of Weka, browse!
91
Other Issues in Supervised
Learning

Evaluation




Accuracy: hold-out, bootstrap, crossvalidation
Simplicity: MDL principle
Usefulness: cost-sensitive learning
Metalearning

Fall 2004
Bagging, Boosting, Stacking
92