Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Learning as Prediction: Batch Learning Model Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 β Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) 1895-1924. (http://sci2s.ugr.es/keel/pdf/algorithm/articulo/dietterich1998.pdf Real-world Process P(X,Y) Train Sample Strain (x1,y1), β¦, (xn,yn) Strain Learner h Test Sample Stest (xn+1,yn+1), β¦ β’ Definition: A particular Instance of a Learning Problem is described by a probability distribution π π, π . β’ Definition: Any Example ππ , ππ is a random variable that is independently identically distributed according to π π, π . Training / Validation / Test Sample β’ Definition: A Training / Test / Validation Sample π = (π₯1 , π¦1 , β¦ , π₯π , π¦π ) is drawn iid from π π, π . π₯1 , π¦1 , β¦ , π₯π , π¦π β’ Definition: The Risk / Prediction Error / True Error / Generalization Error of a hypothesis β for a learning task π π, π is πΈπππ β = π π π= Risk = π ππ = π₯π , ππ = π¦π π=1 π = (π₯1 , π¦1 , β¦ , π₯π , π¦π ) is π πΈπππ β = Ξ π¦π , β π₯π π π = π₯, π = π¦ β’ Definition: The Loss Function Ξ π¦, π¦ β β measures the quality of prediction π¦ if the true label is π¦. Learning as Prediction Overview Real-world Process Empirical Risk β’ Definition: The Empirical Risk / Error of hypothesis β on sample Ξ π¦, β π₯ π₯,π¦ P(X,Y) drawn i.i.d. Train Sample Strain (x1,y1), β¦, (xn,yn) Strain Learner drawn i.i.d. h Test Sample Stest (xn+1,yn+1), β¦ β’Goal: Find h with small prediction error ErrP(h) with respect to P(X,Y). π=1 β’ Training Error: Error ErrStrain(h) on training sample. β’ Test Error: Error ErrStest(h) on test sample is an estimate of ErrP(h) . 1 Three Roadmaps for Designing ML Methods Bayes Risk β’ Given knowledge of P(X,Y), the true error of the best possible h is πΈπππ βπππ¦ππ = πΈπ₯~π(π) [min 1 β π π = π¦ π = π₯ ] β’ Generative Model: ο Learn π π, π from training sample. β’ Discriminative Conditional Model: ο Learn π π|π from training sample. π¦βπ β’ Discriminative ERM Model: for the 0/1 loss. ο Learn h directly from training sample. Empirical Risk Minimization β’ Definition [ERM Principle]: Given a training sample π = ( π₯1 , π¦1 , β¦ , (π₯π , π¦π ) and a hypothesis space π», select the rule βπΈπ π β π» that minimizes the empirical risk (i.e. training error) on π π β πΈπ π = min ββπ» MODEL SELECTION Ξ(π¦π , β(π¦π )) π=1 Overfitting Decision Tree Example: revisited complete original yes yes β’ Note: Accuracy = 1.0-Error correct guessing partial no no no presentation unclear clear no yes [Mitchell] 2 Controlling Overfitting in Decision Trees β’ Early Stopping: Stop growing the tree and introduce leaf when splitting no longer βreliableβ. β’ Post Pruning: Grow full tree, then simplify. β Reduced-error tree pruning β Rule post-pruning Reduced-Error Pruning Real-world Process drawn i.i.d. drawn i.i.d. split randomly Train Sample Strain Train Sample Strainβ Strainβ Learner 1 Learner m Test Sample Stest split randomly Δ₯ Δ₯1 Val. Sample Sval β¦ β Restrict size of tree (e.g., number of nodes, depth) β Minimum number of examples in node β Threshold on splitting criterion Model Selection via Validation Sample Δ₯k β’ Training: Run learning algorithm m times (e.g. different parameters). β’ Validation Error: Errors ErrSval(Δ₯i) is an estimates of ErrP(Δ₯i) for each hi. β’ Selection: Use hi with min ErrSval(Δ₯i) for prediction on test examples. Text Classification Example βCorporate Acquisitionsβ Results β’ Unpruned Tree (ID3 Algorithm): β Size: 437 nodes Training Error: 0.0% Test Error: 11.0% β’ Early Stopping Tree (ID3 Algorithm): β Size: 299 nodes Training Error: 2.6% Test Error: 9.8% β’ Reduced-Error Pruning (C4.5 Algorithm): β Size: 167 nodes Training Error: 4.0% Test Error: 10.8% β’ Rule Post-Pruning (C4.5 Algorithm): β Size: 164 tests Training Error: 3.1% β Examples of rules Test Error: 10.3% β’ IF vs = 1 THEN - [99.4%] β’ IF vs = 0 & export = 0 & takeover = 1 THEN + [93.6%] Evaluating Learned Hypotheses Real-world Process drawn i.i.d. split randomly Training Sample Strain (x1,y1), β¦, (xn,yn) MODEL ASSESSMENT split randomly Sample S Strain Learner (incl. ModSel) Δ₯ Test Sample Stest (x1,y1),β¦(xk,yk) β’ Goal: Find h with small prediction error ErrP(h) over P(X,Y). β’ Question: How good is ErrP(Δ₯) of Δ₯ found on training sample Strain. β’ Training Error: Error ErrStrain(Δ₯) on training sample. β’ Test Error: Error ErrStest(Δ₯) is an estimate of ErrP(Δ₯) . 3 β’ Given What is the True Error of a Hypothesis? β Sample of labeled instances S β Learning Algorithm A β’ Setup β Partition S randomly into Strain and Stest β Train learning algorithm A on Strain, result is Δ₯. β Apply Δ₯ to Stest and compare predictions against true labels. β’ Test β Error on test sample ErrStest(Δ₯) is estimate of true error ErrP(Δ₯). β Compute confidence interval. Training Sample Strain (x1,y1), β¦, (xn,yn) Strain Learner Δ₯ Binomial Distribution β’ The probability of observing x heads (i.e. errors) in a sample of n independent coin tosses (i.e. examples), where in each toss the probability of heads (i.e. making an error) is p, is β’ Normal approximation: For np(1-p)>=5 the binomial can be approximated by the normal distribution with β Expected value: E(X)=np Variance: Var(X)=np(1-p) β With probability ο€, the observation x falls in the interval Test Sample Stest (x1,y1),β¦(xk,yk) ο€ 50% 68% 80% 90% 95% 98% 99% zο€ 0.67 1.00 1.28 1.64 1.96 2.33 2.58 Is Rule h1 More Accurate than h2? β’ Given β’ Given β Sample of labeled instances S β Learning Algorithms A1 and A2 β’ Setup β Partition S randomly into Strain and Stest β Train learning algorithms A1 and A2 on Strain, result are Δ₯1 and Δ₯2. β Apply Δ₯1 and Δ₯2 to Stest and compute ErrStest(Δ₯1) and ErrStest(Δ₯2). β’ Test β Decide, if ErrP(Δ₯1) οΉ ErrP(Δ₯2)? β Null Hypothesis: ErrStest(Δ₯1) and ErrStest(Δ₯2) come from binomial distributions with same p. ο Binomial Sign Test (McNemarβs Test) β’ Given Is Learning Algorithm A1 better than A2? β k samples S1 β¦ Sk of labeled instances, all i.i.d. from P(X,Y). β Learning Algorithms A1 and A2 β’ Setup β For i from 1 to k β’ Test β’ Partition Si randomly into Strain and Stest β’ Train learning algorithms A1 and A2 on Strain, result are Δ₯1 and Δ₯2. β’ Apply Δ₯1 and Δ₯2 to Stest and compute ErrStest(Δ₯1) and ErrStest(Δ₯2). β Decide, if ES(ErrP(A1(Strain))) οΉ ES(ErrP(A2(Strain)))? β Null Hypothesis: ErrStest(A1(Strain)) and ErrStest(A2(Strain)) come from same distribution over samples S. ο t-Test or Wilcoxon Signed-Rank Test Approximation via K-fold Cross Validation β Sample of labeled instances S β Learning Algorithms A1 and A2 β’ Compute β Randomly partition S into k equally sized subsets S1 β¦ Sk β For i from 1 to k β’ Train A1 and A2 on S1 β¦ Si-1 Si+1 β¦.Sk and get Δ₯1 and Δ₯2. β’ Apply Δ₯1 and Δ₯2 to Si and compute ErrSi(Δ₯1) and ErrSi(Δ₯2). β’ Estimate β Average ErrSi(Δ₯1) is estimate of ES(ErrP(A1(Strain))) β Average ErrSi(Δ₯2) is estimate of ES(ErrP(A2(Strain))) β Count how often ErrSi(Δ₯1)>ErrSi(Δ₯2) and ErrSi(Δ₯1)<ErrSi(Δ₯2) 4