Download No Slide Title

Evaluation of Learning Models    Literature: T. Mitchel, Machine Learning, chapter 5 I.H. Witten and E. Frank, Data Mining, chapter 5 1 Fayyad’s KDD Methodology Transformed data Target data Patterns Processed data Data Mining data Selection Knowledge Interpretation Evaluation Transformation Preprocessing & feature selection & cleaning 2 Contents     Estimation of errors for one hypothesis Comparison of hypotheses Comparison of learning models Practical aspects 3 Two definitions of error  True error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D. errorD (h)  Pr [ f ( x)  h( x)] xD 4 Two definitions of error (2)  Sample error of hypothesis h with respect to target function f and data sample S is the proportion of examples h misclassifies errorS (h)  1 n   ( f ( x)  h( x)) xS where  ( f ( x)  h( x)) is 1 if and 0 otherwise. f ( x )  h( x ) 5 Two definitions of error (3)  How well does errorS(h) estimate errorD(h)? 6 Problems estimating error  1. Bias: If S is training set, errorS(h) is optimistically biased bias  E[errorS (h)]  errorD (h)  For unbiased estimate, h and S must be chosen independently. 2. Variance: Even with unbiased S, errorS(h) may still vary from errorD(h). 7 Example  Hypothesis h misclassifies 12 of the 40 examples in S 12 errorS (h)   0.30 40 What is errorD(h)? 8 Estimators  Experiment: 1. choose sample S of size n according to distribution D 2. measure errorS(h) errorS(h) is a random variable (i.e., result of an experiment) errorS(h) is an unbiased estimator for errorD(h) Given observed errorS(h) what can we conclude about errorD(h)? 9 Confidence intervals  If  S contains n examples, drawn independently of h and each other  n  30  Then  With approximately 95% probability, errorD(h) lies in the interval errorS (h)  1.96 errorS ( h )(1errorS ( h )) n 10 Confidence intervals (2)  If  S contains n examples, drawn independently of h and each other  n  30  Then  With approximately N% probability, errorD(h) lies in the interval errorS (h)  z N errorS ( h )(1errorS ( h )) n where N%: 50% 68% 80% 90% 95% 98% 99% zN: 0.67 1.00 1.28 1.64 1.96 2.33 2.58 11 errorS(h) is a random variable   Rerun the experiment with different randomly drawn S (of size n) Probability of observing r misclassified examples: n! r nr P(r )  errorD (h) (1  errorD (h)) r!(n  r )! 12 Binomial probability distribution  Probability P(r) of r heads in n coin flips, if p = Pr(heads) Binomial distribution for n = 10 and p = 0.3 n! r nr P(r )  errorD (h) (1  errorD (h)) r!(n  r )! 13 Binomial probability distribution (2)  Expected, or mean value of X, E[X], is n E[ X ]   iP (i )  np i 0  Variance of X is Var ( X )  E[( X  E[ X ]) ]  np(1  p) 2  Standard deviation of X, X, is  X  E[( X  E[ X ]) ]  np(1  p) 2 14 Normal distribution approximates binomial  errorS(h) follows a Binomial distribution, with  mean  error ( h )  errorD ( h) S  standard deviation  error ( h)  S  errorD ( h )(1errorD ( h )) n Approximate this by a Normal distribution with  mean  errorS ( h )  errorD ( h)  standard deviation  error ( h)  S errorD ( h )(1errorD ( h )) n 15 Normal probability distribution p ( x)   1 2 a   e The probability that X will fall into the interval b (a,b) is given by   2  12 ( x ) 2 p ( x)dx Expected, or mean value of X, E[X], is E[X] =  Variance of X is Var(x) = 2 Standard deviation of X, X , X =  16 Normal probability distribution   80% of area (probability) lies in   1.28 N% of area (probability) lies in   zN N%: zN: 50% 68% 80% 90% 95% 98% 99% 0.67 1.00 1.28 1.64 1.96 2.33 2.58 17 Confidence intervals, more correctly  If  S contains n examples, drawn independently of h and each other  n  30  Then  with approximately 95% probability, errorS(h) lies in the interval errorD (h)  1.96 errorD ( h )(1errorD ( h )) n errorS (h)  1.96 errorS ( h )(1errorS ( h )) n  and errorD(h) approximately lies in the interval 18 Central Limit Theorem  Consider a set of independent, identically distributed random variables Y1...Yn, all governed by an arbitrary probability distribution with mean  and finite variance 2. Define the sample mean, n Y   1 n Y i 1 i Central Limit Theorem. As n  , the distribution governing Y approaches a Normal distribution, with mean  and variance 2/n. 19 Difference between hypotheses   Test h1 on sample S1, test h2 on S2 1. Pick parameter to estimate d  errorD (h1 )  errorD (h2 )  2. Choose an estimator  3. Determine probability distribution that governs estimator dˆ  errorS1 (h1 )  errorS2 (h2 )  dˆ  errorS1 ( h1 )(1errorS1 ( h1 )) n1  errorS2 ( h2 )(1errorS2 ( h2 )) n2 20 Difference between hypotheses (2)  4. Find interval (L, U) such that N% of probability mass falls in the interval dˆ  z N errorS1 ( h1 )(1errorS1 ( h1 )) n1  errorS2 ( h2 )(1errorS2 ( h2 )) n2 21 Paired t test to compare hA, hB   1. Partition data into k disjoint test sets T1,T2,…,Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do  i  errorT (hA )  errorT (hB )  3. Return the value  , where i   i k 1 k  i 0 i 22 Paired t test to compare hA, hB (2)  N% confidence interval estimate for d:   s   k 1 k  i 0 i k 1 k ( k 1)  ( i 0 i  ) 2 Note  i approximately normally distributed 23 Comparing learning algorithms LA and LB  What we’d like to estimate: ES  D [errorD ( LA (S ))  errorD ( LB (S ))]  where L(S) is the hypothesis output by learner L using training set S I.e., the expected difference in true error between hypotheses output by learners LA and LB, when trained using randomly selected training sets S drawn according to distribution D. 24 Comparing learning algorithms LA and LB (2)  But, given limited data D0, what is a good estimator?  Could partition D0 into training set S0 and test set T0, and measure errorT0 ( LA ( S 0 ))  errorT0 ( LB ( S 0 ))  Even better, repeat this many times and average the results 25 Comparing learning algorithms LA and LB (3): k-fold cross validation    1. Partition data D0 into k disjoint test sets T1,T2,…,Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for the test set, and the remaining data for training set Si 3. Return the average of the errors on the test sets 26 Practical Aspects A note on parameter tuning   It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages:  Stage 1: builds the basic structure  Stage 2: optimizes parameter settings     The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters 27 Holdout estimation, stratification    What shall we do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training  Usually: one third for testing, the rest for training  Problem: the samples might not be representative  Example: class might be missing in the test data  Advanced version uses stratification  Ensures that each class is represented with approximately equal proportions in both subsets 28 More on cross-validation    Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate  There is also some theoretical evidence for this   Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation  E.g. ten-fold cross-validation is repeated ten times  and results are averaged (reduces the variance) 29 Issues in evaluation   Statistical reliability of estimated differences in performance Choice of performance measure  Number of correct classifications  Accuracy of probability estimates  Error in numeric predictions  Costs assigned to different types of errors  Many practical applications involve costs 30 Counting the costs   In practice, different types of classification errors often incur different costs Examples:  Predicting when cows are in heat (“in estrus”)  “Not in estrus” correct 97% of the time  Loan decisions  Oil-slick detection  Fault diagnosis  Promotional mailing 31 Taking costs into account  The confusion matrix: predicted class yes actual class  no yes True positive False negative no True negative False positive There are many other types of costs!  E.g.: costs of collecting training data 32 Lift charts    In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Example: promotional mailout  Situation 1: classifier predicts that 0.1% of all households will respond  Situation 2: classifier predicts that 0.4% of the 10000 most promising households will respond  A lift chart allows for a visual comparison 33 Generating a lift chart  Instances are sorted according to their predicted probability of being a true positive: Rank 1 2 3 4 …  Predicted probability Actual class 0.95 Yes 0.93 Yes 0.93 No 0.88 Yes … … In lift chart, x axis is sample size and y axis is number of true positives 34 A hypothetical lift chart 35 Summary of measures 36 Model selection criteria    Model selection criteria attempt to find a good compromise between: A. The complexity of a model B. Its prediction accuracy on the training data Reasoning: a good model is a simple model that achieves high accuracy on the given data Also known as Occam’s Razor: the best theory is the smallest one that describes all the facts 37 Warning    Suppose you are gathering hypotheses that have a probability of 95% to have an error level below 10% What if you have found 100 hypotheses satisfying this condition? Then the probability that all have an error below 10% is equal to (0.95)100  0.013 corresponding to 1.3 %. So, the probability of having at least one hypothesis with an error above 10% is about 98.7%! 38

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download No Slide Title