Download slides 6up

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Learning as Prediction:
Batch Learning Model
Empirical Risk Minimization,
Model Selection, and
Model Assessment
CS6780 – Advanced Machine Learning
Spring 2015
Thorsten Joachims
Cornell University
Reading:
Murphy 5.7-5.7.2.4, 6.5-6.5.3.1
Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised
Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924.
(http://sci2s.ugr.es/keel/pdf/algorithm/articulo/dietterich1998.pdf
Real-world Process
P(X,Y)
Train Sample Strain
(x1,y1), …, (xn,yn)
Strain
Learner
h
Test Sample Stest
(xn+1,yn+1), …
β€’ Definition: A particular Instance of a Learning Problem
is described by a probability distribution 𝑃 𝑋, π‘Œ .
β€’ Definition: Any Example 𝑋𝑖 , π‘Œπ‘– is a random variable
that is independently identically distributed according
to 𝑃 𝑋, π‘Œ .
Training / Validation / Test Sample
β€’ Definition: A Training / Test / Validation
Sample 𝑆 = (π‘₯1 , 𝑦1 , … , π‘₯𝑛 , 𝑦𝑛 ) is drawn iid
from 𝑃 𝑋, π‘Œ .
π‘₯1 , 𝑦1 , … , π‘₯𝑛 , 𝑦𝑛
β€’ Definition: The Risk / Prediction Error / True
Error / Generalization Error of a hypothesis β„Ž
for a learning task 𝑃 𝑋, π‘Œ is
πΈπ‘Ÿπ‘Ÿπ‘ƒ β„Ž =
𝑛
𝑃 𝑆=
Risk
=
𝑃 𝑋𝑖 = π‘₯𝑖 , π‘Œπ‘– = 𝑦𝑖
𝑖=1
𝑆 = (π‘₯1 , 𝑦1 , … , π‘₯𝑛 , 𝑦𝑛 )
is
𝑛
πΈπ‘Ÿπ‘Ÿπ‘† β„Ž =
Ξ” 𝑦𝑖 , β„Ž π‘₯𝑖
𝑃 𝑋 = π‘₯, π‘Œ = 𝑦
β€’ Definition: The Loss Function Ξ” 𝑦, 𝑦 ∈ β„œ
measures the quality of prediction 𝑦 if the
true label is 𝑦.
Learning as Prediction
Overview
Real-world Process
Empirical Risk
β€’ Definition: The Empirical Risk / Error of
hypothesis β„Ž on sample
Ξ” 𝑦, β„Ž π‘₯
π‘₯,𝑦
P(X,Y)
drawn i.i.d.
Train Sample Strain
(x1,y1), …, (xn,yn)
Strain
Learner
drawn i.i.d.
h
Test Sample Stest
(xn+1,yn+1), …
β€’Goal: Find h with small prediction error ErrP(h)
with respect to P(X,Y).
𝑖=1
β€’ Training Error: Error ErrStrain(h) on training sample.
β€’ Test Error: Error ErrStest(h) on test sample is an estimate
of ErrP(h) .
1
Three Roadmaps for
Designing ML Methods
Bayes Risk
β€’ Given knowledge of P(X,Y), the true error of
the best possible h is
πΈπ‘Ÿπ‘Ÿπ‘ƒ β„Žπ‘π‘Žπ‘¦π‘’π‘  = 𝐸π‘₯~𝑃(𝑋) [min 1 βˆ’ 𝑃 π‘Œ = 𝑦 𝑋 = π‘₯ ]
β€’ Generative Model:
οƒ Learn 𝑃 𝑋, π‘Œ from training sample.
β€’ Discriminative Conditional Model:
οƒ Learn 𝑃 π‘Œ|𝑋 from training sample.
π‘¦βˆˆπ‘Œ
β€’ Discriminative ERM Model:
for the 0/1 loss.
οƒ  Learn h directly from training sample.
Empirical Risk Minimization
β€’ Definition [ERM Principle]: Given a training
sample 𝑆 = ( π‘₯1 , 𝑦1 , … , (π‘₯𝑛 , 𝑦𝑛 ) and a
hypothesis space 𝐻, select the rule β„ŽπΈπ‘…π‘€ ∈ 𝐻
that minimizes the empirical risk (i.e. training
error) on 𝑆
𝑛
β„Ž
𝐸𝑅𝑀
= min
β„Žβˆˆπ»
MODEL SELECTION
Ξ”(𝑦𝑖 , β„Ž(𝑦𝑖 ))
𝑖=1
Overfitting
Decision Tree Example: revisited
complete
original
yes
yes
β€’
Note: Accuracy = 1.0-Error
correct
guessing
partial
no
no
no
presentation
unclear
clear
no
yes
[Mitchell]
2
Controlling Overfitting
in Decision Trees
β€’ Early Stopping: Stop growing the tree and
introduce leaf when splitting no longer
β€œreliable”.
β€’ Post Pruning: Grow full tree, then simplify.
– Reduced-error tree pruning
– Rule post-pruning
Reduced-Error Pruning
Real-world Process
drawn i.i.d.
drawn i.i.d.
split
randomly
Train Sample
Strain
Train Sample
Strain’
Strain’
Learner 1
Learner m
Test Sample
Stest
split
randomly
Δ₯
Δ₯1
Val. Sample
Sval
…
– Restrict size of tree (e.g., number of nodes, depth)
– Minimum number of examples in node
– Threshold on splitting criterion
Model Selection via Validation Sample
Δ₯k
β€’ Training: Run learning algorithm m times (e.g. different parameters).
β€’ Validation Error: Errors ErrSval(Δ₯i) is an estimates of ErrP(Δ₯i) for each hi.
β€’ Selection: Use hi with min ErrSval(Δ₯i) for prediction on test examples.
Text Classification Example
β€œCorporate Acquisitions” Results
β€’ Unpruned Tree (ID3 Algorithm):
– Size: 437 nodes Training Error: 0.0%
Test Error: 11.0%
β€’ Early Stopping Tree (ID3 Algorithm):
– Size: 299 nodes Training Error: 2.6%
Test Error: 9.8%
β€’ Reduced-Error Pruning (C4.5 Algorithm):
– Size: 167 nodes Training Error: 4.0%
Test Error: 10.8%
β€’ Rule Post-Pruning (C4.5 Algorithm):
– Size: 164 tests Training Error: 3.1%
– Examples of rules
Test Error: 10.3%
β€’ IF vs = 1 THEN - [99.4%]
β€’ IF vs = 0 & export = 0 & takeover = 1 THEN + [93.6%]
Evaluating Learned
Hypotheses
Real-world Process
drawn i.i.d.
split randomly
Training Sample Strain
(x1,y1), …, (xn,yn)
MODEL ASSESSMENT
split randomly
Sample S
Strain
Learner
(incl. ModSel)
Δ₯
Test Sample Stest
(x1,y1),…(xk,yk)
β€’ Goal: Find h with small prediction error ErrP(h) over P(X,Y).
β€’ Question: How good is ErrP(Δ₯) of Δ₯ found on training sample Strain.
β€’ Training Error: Error ErrStrain(Δ₯) on training sample.
β€’ Test Error: Error ErrStest(Δ₯) is an estimate of ErrP(Δ₯) .
3
β€’ Given
What is the True Error of a
Hypothesis?
– Sample of labeled instances S
– Learning Algorithm A
β€’ Setup
– Partition S randomly into Strain and Stest
– Train learning algorithm A on Strain, result is Δ₯.
– Apply Δ₯ to Stest and compare predictions against true labels.
β€’ Test
– Error on test sample ErrStest(Δ₯) is estimate of true error ErrP(Δ₯).
– Compute confidence interval.
Training Sample Strain
(x1,y1), …, (xn,yn)
Strain
Learner
Δ₯
Binomial Distribution
β€’ The probability of observing x heads (i.e. errors) in a
sample of n independent coin tosses (i.e. examples),
where in each toss the probability of heads (i.e. making
an error) is p, is
β€’ Normal approximation: For np(1-p)>=5 the binomial can
be approximated by the normal distribution with
– Expected value: E(X)=np
Variance: Var(X)=np(1-p)
– With probability , the observation x falls in the interval
Test Sample Stest
(x1,y1),…(xk,yk)

50%
68%
80%
90%
95%
98%
99%
z
0.67
1.00
1.28
1.64
1.96
2.33
2.58
Is Rule h1 More Accurate than h2?
β€’ Given
β€’ Given
– Sample of labeled instances S
– Learning Algorithms A1 and A2
β€’ Setup
– Partition S randomly into Strain and Stest
– Train learning algorithms A1 and A2 on Strain, result are Δ₯1 and Δ₯2.
– Apply Δ₯1 and Δ₯2 to Stest and compute ErrStest(Δ₯1) and ErrStest(Δ₯2).
β€’ Test
– Decide, if ErrP(Δ₯1) ο‚Ή ErrP(Δ₯2)?
– Null Hypothesis: ErrStest(Δ₯1) and ErrStest(Δ₯2) come from binomial
distributions with same p.
οƒ  Binomial Sign Test (McNemar’s Test)
β€’ Given
Is Learning Algorithm
A1 better than A2?
– k samples S1 … Sk of labeled instances, all i.i.d. from P(X,Y).
– Learning Algorithms A1 and A2
β€’ Setup
– For i from 1 to k
β€’ Test
β€’ Partition Si randomly into Strain and Stest
β€’ Train learning algorithms A1 and A2 on Strain, result are Δ₯1 and Δ₯2.
β€’ Apply Δ₯1 and Δ₯2 to Stest and compute ErrStest(Δ₯1) and ErrStest(Δ₯2).
– Decide, if ES(ErrP(A1(Strain))) ο‚Ή ES(ErrP(A2(Strain)))?
– Null Hypothesis: ErrStest(A1(Strain)) and ErrStest(A2(Strain)) come
from same distribution over samples S.
οƒ  t-Test or Wilcoxon Signed-Rank Test
Approximation via
K-fold Cross Validation
– Sample of labeled instances S
– Learning Algorithms A1 and A2
β€’ Compute
– Randomly partition S into k equally sized subsets S1 … Sk
– For i from 1 to k
β€’ Train A1 and A2 on S1 … Si-1 Si+1 ….Sk and get Δ₯1 and Δ₯2.
β€’ Apply Δ₯1 and Δ₯2 to Si and compute ErrSi(Δ₯1) and ErrSi(Δ₯2).
β€’ Estimate
– Average ErrSi(Δ₯1) is estimate of ES(ErrP(A1(Strain)))
– Average ErrSi(Δ₯2) is estimate of ES(ErrP(A2(Strain)))
– Count how often ErrSi(Δ₯1)>ErrSi(Δ₯2) and ErrSi(Δ₯1)<ErrSi(Δ₯2)
4