Download Machine Learning

Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997) 1 Motivation  Evaluating the performance of learning systems is important because:  Learning systems are usually designed to predict the class of “future” unlabeled data points.  In some cases, evaluating hypotheses is an integral part of the learning process (example, when pruning a decision tree) 2 Difficulties in Evaluating Hypotheses when only limited data are available  Bias in the estimate: The observed accuracy of the learned hypothesis over the training examples is a poor estimator of its accuracy over future examples ==> we test the hypothesis on a test set chosen independently of the training set and the hypothesis.  Variance in the estimate: Even with a separate test set, the measured accuracy can vary from the true accuracy, depending on the makeup of the particular set of test examples. The smaller the test set, the greater the expected variance. 3 Questions Considered  Given the observed accuracy of a hypothesis over a limited sample of data, how well does this estimate its accuracy over additional examples?  Given that one hypothesis outperforms another over some sample data, how probable is it that this hypothesis is more accurate, in general?  When data is limited what is the best way to use this data to both learn a hypothesis and estimate its accuracy? 4 Variance in Test Accuracy P(errorS(h))  Let errorS(h) denote the percentage of examples in an independently sampled test set S of size n that are incorrectly classified by hypothesis h.  Let errorD(h) denote the true error rate for the overall data distribution D.  When n is at least 30, the central limit theorem ensures that the distribution of errorS(h) for different random samples will be closely approximated by a normal (Guassian) distribution. errorD(h) errorS(h) 5 Comparing Two Learned Hypotheses  When evaluating two hypotheses, their observed ordering with respect to accuracy may or may not reflect the ordering of their true accuracies.  Assume h1 is tested on test set S1 of size n1 Assume h2 is tested on test set S2 of size n2 P(errorS(h))  errorS1(h1) errorS2(h2) errorS(h) Observe h1 more accurate than h2 6 P(errorS(h)) errorS1(h1) errorS2(h2) errorS(h) Observe h1 less accurate than h2 7 Estimating Hypothesis Accuracy Two Questions of Interest:  Given a hypothesis h and a data sample containing n examples drawn at random according to distribution D, what is the best estimate of the accuracy of h over future instances drawn from the same distribution? ==> sample vs. true error  What is the probable error in this accuracy estimate? ==> confidence intervals 8 Sample Error and True Error  Definition 1: The sample error (denoted errors(h)) of hypothesis h with respect to target function f and data sample S is: errors(h)= 1/n xS(f(x),h(x)) where n is the number of examples in S, and the quantity (f(x),h(x)) is 1 if f(x)  h(x), and 0, otherwise.  Definition 2: The true error (denoted errorD(h)) of hypothesis h with respect to target function f and distribution D, is the probability that h will misclassify an instance drawn at random according to D. errorD(h)= PrxD[f(x)  h(x)] 9 Confidence Intervals for Discrete-Valued Hypotheses  The general expression for approximate N% confidence intervals for errorD(h) is: errorS(h)  zNerrorS(h)(1-errorS(h))/n where ZN is given in [Mitchell, table 5.1]  This approximation is quite good when n errorS(h)(1 - errorS(h))  5 10 Mean and Variance  Definition 1: Consider a random variable Y that takes on possible values y1, …, yn. The expected value (or mean value) of Y, E[Y], is: E[Y] = i=1n yi Pr(Y=yi)  Definition 2: The variance of a random variable Y, Var[Y], is: . Var[Y] = E[(Y-E[Y])2]  Definition 3: The standard deviation of a random variable Y is the square root of the variance. 11 Estimators, Bias and Variance  Since errorS(h) (an estimator for the true error) obeys a Binomial distribution (See, [Mitchell, Section 5.3]), we have: errorS(h) = r/n and errorD(h) = p where n is the number of instances in the sample S, r is the number of instances from S misclassified by h, and p is the probability of misclassifying a single instance drawn from D.  Definition: The estimation bias ( from the inductive bias) of an estimator Y for an arbitrary parameter p is E[Y] – p  Unbiased Estimator if E[Y]=p  The standard deviation for errorS(h) is given by  p(1-p)/n  errorS(h)(1-errorS(h))/n 12 Confidence Intervals  N% Confidence Interval for some parameter p is an interval that is expected with prob. N% to contain p.  Example in page 132 and 138  Approximated Normal by central limit theorem rather than Binomial Distribution n errorS(h)(1 - errorS(h))  5 13 Two-sided and One-sided Bounds  One-sided Bound  What is the prob. that errorD(h) is at most U?  100(1-alpha)% confidence interval for twosided bounds  100(1-alpha/2)% confidence interval for one-sided bounds for (-inf, U) and (L,+inf)  Example: page 141 14 Difference in Error of two Hypotheses  Let h1 and h2 be two hypotheses for some discrete-valued target function. H1 has been tested on a sample S1 containing n1 (>=30) randomly drawn examples, and h2 has been tested on an independent sample S2 containing n2 (>=30) examples drawn from the same distribution.  Let’s estimate the difference between the true errors of these two hypotheses, d, by computing the difference between the sample errors: dˆ = errorS1(h1)-errorS2(h2) or dˆ = errorS(h1)-errorS(h2)  The approximate N% confidence interval for d is: d^  ZNerrorS1(h1)(1-errorS1(h1))/n1 + errorS2(h2)(1-errorS2(h2))/n2 15 Sample Z-Score Test 1 Assume we test two hypotheses on different test sets of size 100 and observe: errorS1 (h1 )  0.20 errorS2 (h2 )  0.30 d  errorS1 (h1 )  errorS 2 (h2 )  0.2  0.3  0.1 errorS1 (h1 )  (1  errorS1 (h1 )) d  n1  z d d  errorS 2 (h2 )  (1  errorS 2 (h2 )) n2 0.2  (1  0.2) 0.3  (1  0.3)   0.0608 100 100  0.1  1.644 0.0608 Confidence for two-tailed test: 90% Confidence for one-tailed test: (100 – (100 – 90)/2) = 95% h1 has better accuracy than h2 for 95times out of 100 trials. 16 Sample Z-Score Test 2 Assume we test two hypotheses on different test sets of size 100 and observe: errorS1 (h1 )  0.20 errorS2 (h2 )  0.25 d  errorS1 (h1 )  errorS 2 (h2 )  0.2  0.25  0.05 errorS1 (h1 )  (1  errorS1 (h1 )) d  n1  z d d  errorS 2 (h2 )  (1  errorS 2 (h2 )) n2 0.2  (1  0.2) 0.25  (1  0.25)   0.0589 100 100  0.05  0.848 0.0589 Confidence for two-tailed test: 50% Confidence for one-tailed test: (100 – (100 – 50)/2) = 75% 17 Hypothesis Testing  What is Probability[errorD(h1)>error D(h2)]    Size=100 errorS1(h1)=0.3 errorS2(h2)=0.2 Observed difference dˆ=0.1  d  0.0608 Prob [errorD(h1)>error D(h2)]= Prob[d >0] • H: [errorD(h1)>error D(h2)] – null hypothesis: Opposite hypothesis z d d  0.1  1.644 0.0608 Confidence for two-tailed test: 90% Confidence for one-tailed test: (100 – (100 – 90)/2) = 95% h1 has better accuracy than h2 for 95times out of 100 trials. 18 Comparing Learning Algorithms  Which of LA and LB is the better learning method on average for learning some particular target function f ?  To answer this question, we wish to estimate the expected value of the difference in their errors: ESD [errorD(LA(S))-errorD(LB(S))]  Of course, since we have only a limited sample D0 we estimate this quantity by dividing D0 into a training set S0 and a testing set T0 and measure: errorT0(LA(S0))-errorT0(LB(S0))  Problem: We are only measuring the difference in errors for one training set S0 rather than the expected value of this difference over all samples S drawn from D 19 Solution: k-fold Cross-Validation k-Fold Cross-Validation 1. Partition the available data D0 into k disjoint subsets T1, T2, …, Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si • Si <- {D0 - Ti} • hA <- LA(Si) • hB <- LB(Si) • i <- errorTi(hA)-errorTi(hB) 3. Return the value avg(), . where avg() = 1/k i=1k i 20 K-Fold Cross Validation Comments  Every example gets used as a test example once and as a training example k–1 times.  All test sets are independent; however, training sets overlap significantly.  Measures accuracy of hypothesis generated for [(k–1)/k]|D| training examples.  Standard method is 10-fold.  If k is low, not sufficient number of train/test trials; if k is high, test set is small and test variance is high and run time is increased.  If k=|D|, method is called leave-one-out cross validation. 21 Significance Testing  Typically k<30, so not sufficient trials for a z test.  Can use (Student’s) t-test, which is more accurate when number of trials is low.  Can use a paired t-test, which can determine smaller differences to be significant when the training/sets sets are the same for both systems.  However, both z and t test’s assume the trials are independent. Not true for k-fold cross validation:   Test sets are independent Training sets are not independent 22 Confidence of the k-fold Estimate  The approximate N% confidence interval for estimating ESD0 [errorD(LA(S))-errorD(LB(S))] using avg(), is given by: avg()tN,k-1savg() where tN,k-1 is a constant similar to ZN (See [Mitchell, Table 5.6]) and savg() is an estimate of the standard deviation of the distribution governing avg() savg())=1/k(k-1) i=1k (i -avg())2 23 Experimental Evaluation Conclusions  Good experimental methodology is important to evaluating learning methods.  Important to test on a variety of domains to demonstrate a general bias that is useful for a variety of problems. Testing on 20+ data sets is common.  Variety of freely available data sources    UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html KDD Cup (large data sets for data mining) http://www.kdnuggets.com/datasets/kddcup.html CoNLL Shared Task (natural language problems) http://www.ifarm.nl/signll/conll/  Data for real problems is preferable to artificial problems to demonstrate a useful bias for real-world problems.  Many available datasets have been subjected to significant feature engineering to make them learnable. 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Machine Learning