Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Evaluation of Classifiers – p. 1 Evaluating classifiers How well can a classifier be expected to perform on novel data? Choice of performance measure How close is the estimated performance to the true performance? Comparing Classifiers – p. 2 Measuring classifier performance Natural performance measure for classification problems: error rate or accuracy Higher accuracy does not necessarily imply better performance on target task Implicit assumption: the class distribution among examples is relative balanced Biased in favor of the majority class! Should be used with caution! – p. 3 Measuring classifier performance Consider two category classification One method for handling c-class problem is to consider c 2-class problems: ωi /not ωi Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Researchers – p. 4 Measuring classifier performance confusion matrix (also called a contingency table) TP: Number of True positives FP: Number of False positives TN: Number of True Negatives FN: Number of False Negative Accuracy = (T P + T N)/n (With c classes the confusion matrix becomes an c × c matrix containing the c correct classifications (the major diagonal entries) and c2 − c possible errors (the off-diagonal entries)). – p. 5 Measuring classifier performance True positive rate (TPR) (also called sensitivity, hit rate, and recall) TP sensitivity = T P + FN A statistical measure of how well a binary classification test correctly identifies a condition Probability of correctly labeling members of the target class – p. 6 Measuring classifier performance False positive rate (FPR) (also called false alarm rate) FalseAlarm = speci f icity = FP T N + FP TN = 1 − FalseAlarm T N + FP The specificity is a statistical measure of how well a binary classification test correctly identifies the negative cases – p. 7 Measuring classifier performance Precision (also positive predictive value) precision = TP T P + FP Probability that a positive prediction is correct F-measure (in information retrieval): can be used as a single measure of performance. The F-measure is the harmonic mean of precision and recall: F-measure = 2 × precision × recall precision + recall – p. 8 Measuring classifier performance TP, FP, TN, FN provide the relevant information No single measure tells the whole story A classifier with 90% accuracy can be useless if 90 percent of the population does not have cancer and the 10% that do are misclassified by the classifier Use of multiple measures recommended Beware of terminological confusion in the literature! E.g., specificity sometimes refers to precision When you write: provide the formula in terms of TP etc. When you read: check the formula in terms of TP etc. – p. 9 ROC Curve Receiver Operating Characteristics (ROC) graphs have long been used in signal detection theory to depict the tradeoff between hit rates and false alarm rates over noisy channel Recent years have seen an increase in the use of ROC graphs in the machine learning community. A useful technique for organizing classifiers and visualizing their performance. Especially useful for domains with skewed class distribution and unequal classification error costs. – p. 10 ROC Space ROC curve is a plot of TPR against FPR which depicts relative trade-offs between benefits (true positives) and costs (false positives). – p. 11 ROC Curve A discrete classifier produces an (FPR, TPR) pair corresponding to a single point in ROC space. Some classifiers, such as a Naive Bayes or a neural network, naturally yield an instance probability or score, a numeric value that represents the degree to which an instance is a member of a class. Such a ranking or scoring classifier can be used with a threshold to produce a discrete classifier Plotting the ROC point for each possible threshold value results in a curve – p. 12 ROC Curve 1.0 1.0 B True Positive rate True Positive rate B 0.8 A 0.6 0.4 0.6 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1.0 False Positive rate A 0.8 0 0.2 0.4 0.6 0.8 1.0 False Positive rate – p. 13 Measuring performance-ROC Curve ROC curves offer a more complete picture of the performance of the classifier We often do not know costs, or changing class distributions To compare classifiers we may want to reduce ROC performance to a single scalar value representing expected performance. A common method is to calculate the area under the ROC curve, abbreviated AUC AUC(h) > AUC(g): classifier h has better average performance – p. 14 Evaluating A Classifier How well will the classifier we learned perform on novel data? We can estimate the performance (e.g., accuracy, sensitivity) of the classifier using a test data set Performance on the training data is not a good indicator of performance on future data Test set: independent instances that have not been used in any way to create the classifier Assumption: both training data and test data are representative samples of the underlying problem – p. 15 Evaluating A Classifier How close is the estimated performance to the true performance? Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data There is extensive literature on how to estimate the performance of a hypothesis from limited samples and how to assign confidence to estimates – take Stat 430x Evaluating hypotheses, Chapter 5, Tom Mitchell, Machine Learning – p. 16 Estimating Hypothesis Accuracy General setting There is some space of possible instances X , with unknown distribution D that defines the probability of encountering each instance Learning task: learn target function f by considering a space H of possible hypotheses Each training example is drawn independently according to D Choose test sample S of size n according to distribution D What is the best estimate of the accuracy of a hypothesis h? What is the error in this estimate? – p. 17 Estimating Hypothesis Accuracy The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies errorS (h) ≡ 1 ∑ δ( f (x) = h(x)) n x∈S Where δ( f (x) = h(x)) is 1 if f (x) = h(x), and 0 otherwise. The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D . errorD (h) ≡ Pr [ f (x) = h(x)] x∈D How well does errorS (h) estimate errorD (h)? – p. 18 Example Hypothesis h misclassifies 12 of the 40 examples in S errorS (h) = 12 = .30 40 What is errorD (h)? – p. 19 errorS (h) is a Random Variable errorS (h) is a random variable (i.e., outcome of an experiment) Compare: Toss a coin n times and it turns up heads r times with probability of each toss turning heads p = P(heads). (Bernoulli trial) Randomly draw n samples and h misclassifies r samples with probability of misclassification p = errorD (h).(Bernoulli trial) errorS (h) = r n Probability of observing r misclassified examples: P(r) = n! pr (1 − p)n−r r!(n − r)! – p. 20 Binomial Probability Distribution P(X = r) = n! pr (1 − p)n−r r!(n − r)! Expected, or mean value of X , E[X], is n E[X] ≡ ∑ iP(i) = np i=0 Variance of X is Var(X) ≡ E[(X − E[X])2 ] = np(1 − p) Standard deviation of X , σX , is σX ≡ E[(X − E[X])2 ] = np(1 − p) – p. 21 Bias, Variance errorS (h) follows a Binomial distribution, with mean µerrorS (h) = errorD (h) standard deviation σerrorS (h) errorD (h)(1 − errorD (h)) n errorS (h)(1 − errorS (h)) σerrorS (h) ≈ n errorS (h) is an unbiased estimator for errorD (h) σerrorS (h) = bias ≡ E[errorS (h)] − errorD (h) = 0 – p. 22 Confidence Interval Confidence Interval: one common way to describe the uncertainy associated with an estimate. An N% confidence interval for some parameter p is an interval that is expected with probability N% to contain p Computation can be quite tedious Binomial distribution can be closely approximated by a Normal distribution with the same mean and variance when np(1 − p) ≥ 5 – p. 23 Normal Probability Distribution Normal distribution with mean 0, standard deviation 1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 p(x) = 1 x−µ 2 σ ) √ 1 e− 2 ( 2πσ2 0.05 0 -3 -2 -1 0 1 2 3 Expected, or mean value of X , E[X] = µ Variance of X is Var(X) = σ2 Standard deviation of X , σX = σ – p. 24 Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 80% of area (probability) lies in µ ± 1.28σ N% of area (probability) lies in µ ± zN σ N%: 50% zN : 0.67 68% 80% 90% 95% 98% 99% 1.00 1.28 1.64 1.96 2.33 2.58 – p. 25 Confidence Intervals If S contains n examples, drawn independently of h and each other, and n ≥ 30 (and p is not too close to 0 or 1) or np(1 − p) ≥ 5, then With approximately N% probability, errorS (h) lies in interval errorD (h) ± zN errorD (h)(1 − errorD (h)) n equivalently, errorD (h) lies in interval errorS (h) ± zN errorD (h)(1 − errorD (h)) n – p. 26 Confidence Intervals Approximately errorS (h) ± zN Or solve for p = errorD (h) p̂ + z2 2n errorS (h)(1 − errorS (h)) n p̂) z2 ± z p̂(1− + 4n 2 n 2 1 + zn – p. 27 Confidence Intervals Examples: p̂ = 0.75, n = 1000, c = 80%(z = 1.28) with 80% confidence the true value p in [.732, .767] p̂ = 0.75, n = 100, c = 80%(z = 1.28) With 80% confidence p in [.691, .801] – p. 28 Central Limit Theorem Central Limit Theorem simplifies confidence interval calculations Consider a set of independent, identically distributed random variables Y1 . . .Yn , all governed by an arbitrary distribution with mean µ and finite variance σ2 . Define the sample mean, 1 n Ȳ ≡ ∑ Yi n i=1 Central Limit Theorem. As n → ∞, the distribution governing Ȳ 2 approaches a Normal distribution, with mean µ and variance σn . – p. 29 Central Limit Theorem Central Limit Theorem implies that whenever we define an estimator that is the mean of some sample (e.g. errorS (h)), the distribution governing this estimator can be approximated by a Normal distribution for sufficiently large n A common rule of thumb is that we can use the Normal approximation when n ≥ 30. – p. 30 Estimating Classifier Performance Holdout method: use part of the data for training, and the rest for testing Usually: one third for testing, the rest for training Ideally both training set and test set should be large We may be unlucky – training data or test data may not be representative Use stratification: Ensures that each class is represented in roughly the same proportion as in the entire data set What to do if the amount of data is limited? Solution: Run multiple experiments with disjoint training and test data sets – p. 31 Cross-validation K-fold cross-validation: The training set is randomly divided into K disjoint sets of equal size where each part has roughly the same class distribution The classifier is trained K times, each time with a different set held out as a test set. The estimated error p̂CV is the mean of these K errors Ron Kohavi, A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, IJCAI 1995. – p. 32 Cross-validation The standard deviation can be estimated as σCV = p̂CV (1 − p̂CV ) n where n is the total number of instances in the dataset Confidence interval can be computed as before with n being the total number of instances in the dataset – p. 33 Leave-one-out approach K-fold cross validation with K = n where n is the total number of samples available n experiments using n-1 samples for training and the remaining sample for testing Computationally expensive Leave-one-out cross-validation does not guarantee the same class distribution in training and test data! Extreme case: 50% class 1, 50% class 2 Predict majority class label in the training data True error 50%; Leave-one-out error estimate 100%! – p. 34 The bootstrap The bootstrap uses sampling with replacement to form the training set Sample uniformly a dataset of n instances n times with replacement to form a new dataset of n instances as the training set Use the instances from the original dataset that do not occur in the new training set for testing Repeat process several times; average the results – p. 35 Estimating Classifier Performance Recommended procedure Use (stratified) K-fold cross-validation (K=5 or 10) for estimating performance estimates (accuracy, etc.) Compute mean value of performance estimate, and standard deviation and confidence intervals Report mean values of performance estimates and their standard deviations or 95% confidence intervals around the mean – p. 36 Difference Between Hypotheses We wish to estimate d ≡ errorD (h1 ) − errorD (h2 ) Suppose h1 has been tested on a sample S1 of size n1 drawn according to D and h2 has been tested on a sample S2 of size n2 drawn according to D An unbiased estimator dˆ ≡ errorS1 (h1 ) − errorS2 (h2 ) – p. 37 Difference Between Hypotheses An unbiased estimator dˆ ≡ errorS1 (h1 ) − errorS2 (h2 ) For large n1 and large n2 the corresponding error estimates follow Normal distribution Difference of two Normal distributions yields a normal distribution Variance equal to the sum of the variances of the individual distributions for two independent variables – p. 38 Difference Between Hypotheses σdˆ ≈ errorS1 (h1 )(1 − errorS1 (h1 )) errorS2 (h2 )(1 − errorS2 (h2 )) + n1 n2 N% of probability mass falls in the interval dˆ±zN errorS1 (h1 )(1 − errorS1 (h1 )) errorS2 (h2 )(1 − errorS2 (h2 )) + n1 n2 When S1 = S2 , the variance is usually smaller than above and the confidence interval applicable but overly conservative – p. 39 Hypothesis testing Is one hypothesis likely to be better than another? Suppose n1 = n2 = 100, errorS1 (h1 ) = .30, errorS2 (h2 ) = .20 dˆ = .1, σdˆ = .061 Is the error difference statistically significant? – p. 40 Hypothesis testing The null hypothesis: the two hypotheses have the same error rate d = 0 Significance level of a test (α): The test’s probability of incorrectly rejecting the null hypothesis ˆ > zN σ) = α Pr(|d| for a two-sided test with confidence N = 1 − α. Test statistic z= ˆ |d| σdˆ Reject the null hypothesis if the test statistic z > zN – p. 41 Hypothesis testing E.g., n1 = n2 = 100, errorS1 (h1 ) = .30, errorS2 (h2 ) = .20 dˆ = .1, σdˆ = .061, the test statistic z = 1.64 For significance level α = 0.05, N = 95%, zN = 1.96, there is not enough evidence to reject the null hypothesis. – p. 42 Comparing learning algorithms Which learning algorithm is better at learning target function f ? What we’d like to estimate: ES⊂D [errorD (LA (S)) − errorD (LB (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners LA and LB , when trained using randomly selected training sets S drawn according to distribution D . – p. 43 Comparing learning algorithms But, given limited data D0 , what is a good estimator? could partition D0 into training set S0 and test set T0 , and measure errorT0 (LA (S0 )) − errorT0 (LB (S0 )) even better, repeat this many times and average the results – p. 44 CVed Paired t Test 1. Partition data D0 into k disjoint test sets T1 , T2 , . . . , Tk of equal size, where this size is at least 30. 2. For i from 1 to k, do use Ti for test set, and the remaining data for training set Si Si ← {D0 − Ti } hA ← LA (Si ), hB ← LB (Si ) δi ← errorTi (hA ) − errorTi (hB ) 3. Return the value δ̄, where 1 k δ̄ ≡ ∑ δi k i=1 – p. 45 Paired t Test δ̄ is our estimate of the error difference For large test sets, δi has Normal distribution with mean d , and variance σ2 that we don’t know δ̄ have normal distribution with mean d and variance σ2 /k if we pretend δi ’s independent If we estimate σ2 with sample variance s2δ , the distribution i δ̄−d sδ̄ is no longer Normal unless s2δi k is large 1 k = (δi − δ̄)2 ∑ k − 1 i=1 – p. 46 Paired t Test δ̄−d sδ̄ follows a Student’s t-distribution with freedom. sδ̄ ≡ k − 1 degrees of k 1 (δi − δ̄)2 ∑ k(k − 1) i=1 N% confidence interval estimate: δ̄ ± tN,k−1 sδ̄ The t-statistic t= δ̄ sδ̄ – p. 47 Performing the test The null hypothesis: the two learning algorithms have the same error rate Fix a significance level (α) The test’s probability of incorrectly rejecting the null hypothesis Two-sided test with confidence N = 1 − α Look up tN,k−1 Reject the null hypothesis if |t| > tN,k−1 . E.g., typically k=30 trials, 0.05 significance level, reject if |t| > 2.045. – p. 48 Performance evaluation summary Rigorous statistical evaluation is extremely important in experimental computer science in general and machine learning in particular How good is a learned hypothesis? How close is the estimated performance to the true performance? Is one hypothesis better than another? Is one learning algorithm better than another on a particular learning task? – p. 49