Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 15 ASSESSMENT OF DATA MODELS Cios / Pedrycz / Swiniarski / Kurgan Outline • Introduction • Models, Their Assessment and Selection - The Bias-Variance Dilemma • Simple Split and Cross-Validation • Bootstrap • Occam’s Razor Heuristic • Minimum Description Length Principle • Akaike’s Information Criterion and Bayesian Information Criterion • Sensitivity, Specificity and ROC Analyses • Interestingness Criteria © 2007 Cios / Pedrycz / Swiniarski / Kurgan 2 Introduction Assessment of a data model is done by a data miner before selecting the “best” model and presenting it to the user/owner/expert of data. The user then makes a final decision whether to accept the model in its current form, and use it for some purpose, or ask to generate a new one (a frequent outcome) . The user’s expectation of a DM project is to find some new information/knowledge, hidden in the data, so that it can be used for some advantage. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 3 Introduction Problem: A data miner assesses quality of the generated model often by using the same data that were used to generate the model itself (albeit divided into training and testing data sets), while the owner/user depends not only on data and DM results but also on his deep (expert) domain knowledge. In spite of the KDP requirement that the data miner learns about the domain and the data as much as possible, his or her knowledge obviously constitutes only a subset of the knowledge of the experts (data owners). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 4 Introduction In practice: • A data miner generates several models of the data and must choose one “best” model, in terms of how well it explains the training data and its predictive power, before presenting it to the data owner. We will discuss the methods for selecting such “best” models. • The process of selecting the “best” model by a data miner is called model selection. • The owner of the data then utilizes his deep domain knowledge for model assessment. Data miners attempt to mimic experts by designing artificial interestingness measures. • Next, we focus on the heuristic, data-reuse (data re-sampling), and analytic methods for model selection and validation. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 5 Introduction Terms model, classifier, and estimator will be used interchangeably. • A model can be defined as a description of causal relationships between input and output variables. • A classifier is a model of data used for a classification purpose: given a new input, it assigns it to one of the classes it was designed/trained to recognize. • An estimator is a method used to calculate a parameter; it is a variable defined as function of the sample values. • The number of independent pieces of information required for estimating the model is called model’s degree of freedom. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 6 Introduction • One of the simplest heuristics for model selection is to choose a parsimonious model: one that uses the fewest number of parameters – from among several acceptably well-performing models. • There is always a model error associated with any model. It is calculated as a difference between the true value and the model output value, and is expressed either as absolute or squared error between the observed/true and model output values. • Model error can be calculated only if training data, meaning known inputs corresponding to known outputs, are available. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 7 Introduction • When we generate a model of the data, we fit the model to the data. • In addition to fitting the model to the data we are also interested in using the model for prediction. • Once we have generated several models and have selected the “best” one, we need to validate the model for its goodness of fit (fit error) and its goodness of prediction (prediction error). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 8 Introduction • In neural network and ML literature, goodness of prediction is often referred to as the generalization error. The latter term ties it into the concepts of overfitting and underfitting the data. • Overfitting means unnecessary increase of the model complexity. For example, increasing the number of parameters and the model degrees of freedom beyond what is necessary. • Underfitting is opposite to overfitting, i.e., too simple a model will not fit the data well. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 9 Introduction We can divide model assessment techniques based on the nature of the methods: • data-reuse / re-sampling (simple split, cross-validation, bootstrap) • heuristic (parsimonious model, Occam’s razor) • analytical (Minimum Description Length, Aikake’s Information Criterion, and Bayesian Information Criterion) • interestingness measures © 2007 Cios / Pedrycz / Swiniarski / Kurgan 10 Bias-Variance Dilemma Bias is defined as the error that cannot be reduced by increasing the sample size: • it is present even if an infinite sample is available • it is a systematic error. Some example sources of bias: - measurement error (experimental error that cannot be removed) - sample error (sample may not be correctly drawn from the population, and thus may not represent the data correctly) - error associated with a particular form of an estimator, etc. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 11 Bias-Variance Dilemma Bias is calculated as a difference between the estimated expected value and the true value of some parameter: B E( p) p Its squared value, B2, is one of the two components (the other is variance, S) of the mean square error, MSE, which calculates the mean square difference between a true value of a parameter, p, and its estimated value: MSE( p) E( p p) 2 S (p) B 2 (p) where the variance is: S 2 ( (p i p i ) 2 ) / (N 1) i S 2 ( ( p i p i ) 2 ) / N i © 2007 Cios / Pedrycz / Swiniarski / Kurgan 12 Bias-Variance Dilemma Variance is an additional (to bias) error that is incurred given a finite sample (because of sensitivity to random fluctuations). Estimator is a method used to calculate a parameter. Some examples: - histogram density estimator, which estimates density based on counts per bin - Bayesian estimator, which estimates the a posteriori probability from the a priori probability via Bayes’ rule. The simplest nonparametric estimator (meaning one that does not depend on complete knowledge of the underlying distribution) is the sample mean that estimates the population mean; it constitutes the simplest model of the data. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 13 Bias-Variance Dilemma Biased estimators have a non-zero bias (meaning that the estimated expected value is different from the true value) while unbiased estimators have zero bias of the estimate. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 14 Bias-Variance Dilemma There exist a trade-off between bias and variance, known as the bias-variance dilemma. For a given MSE, if it has large bias then it has a small accompanying variance, and vice versa. We are interested in finding a model/estimator which is neither too complex (may overfit the data) nor too simple (may underfit the data). Such a model can be found by minimizing the MSE value, with acceptable bias and variance. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 15 Bias-Variance Dilemma High Bias, Low Variance Low Bias, High Variance Medium Bias, Medium Variance 16 Bias-Variance Dilemma MSE Bias Variance MSE Variance Bias optimal Model complexity Data size Illustration of the bias -variance dilemma. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 17 Bias-Variance Dilemma A model should be chosen in such a way that it does not overfit the data, which means that it should perform acceptably well on both training and test data, as measured by the error between the true/desired value and the model output value. High dimensionality increases variance (curse of dimensionality) Removing irrelevant features (feature selection) can improve accuracy. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 18 Bias-Variance Dilemma Fit/Training & Prediction/Test Errors High bias & low variance Low bias & high variance underfitting Test data Training data overfitting optimal Model complexity Data size Choosing optimally complex model with acceptable fit and prediction errors. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 19 Assessment Techniques Model assessment techniques can be divided based on the nature of the methods: • data-reuse / re-sampling (simple split, cross-validation, bootstrap) • heuristic (parsimonious model, Occam’s razor) • analytical (Minimum Description Length, Aikake’s Information Criterion, and Bayesian Information Criterion) • interestingness measures © 2007 Cios / Pedrycz / Swiniarski / Kurgan 20 Simple Split Test Data Training Data Given a large data set composed of known inputs corresponding to known outputs, we can evaluate the model by splitting the available data into two parts: the training part used for fitting the model, and the test part used for evaluation of its goodness of prediction. Holding back 1/3 (or ½) of data for testing is typical. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 21 Simple Split and Cross-Validation If the results are unsatisfactory, then a more expensive computational method, called cross-validation, is used to estimate the goodness of prediction of the model. Informally, cross validation should be used in situations where the data set is relatively small but “difficult” (meaning that splitting into just two parts does not result in good prediction). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 22 Cross-Validation Let n be the number of data points in the training data set. Let k be an integer index that is much smaller than n. In a k-fold cross validation, we divide the entire data set into k equal-size subsets, and use k-1 parts for training and the remaining part for testing and calculation of the prediction error (goodness of prediction). We repeat the procedure k number of times, and report the average error from the k runs. In an extreme situation, when data set is very small, n-fold cross validation, also known as leave-one-out method, is used. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 23 Nested Cross-Validation 24 Nested Cross-Validation From “Repeated double cross validation” by Filzmoser et al. 25 Nested Cross-Validation 26 Cross-Validation Advantages: • All data used for training • All data used for test (once) • No data wasted Disadvantages: • Computationally more expensive than simple split (brute force) • Good bias performance can have high variance. – Increasing k gives better bias, worse variance © 2007 Cios / Pedrycz / Swiniarski / Kurgan 27 Bootstrap Bootstrap gives a nonparametric estimate of the error of a model in terms of its bias and variance. It works as follows: We draw x samples of equal size from a population consisting of n samples, with the purpose of calculating confidence intervals for the estimates. A strong assumption is made that the available data set of n samples constitutes the population itself. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 28 Bootstrap First, from this “population”, x samples of size n (notice that it is of the same size as the original data set) called bootstrap samples, are drawn, with replacement. Next, we fit a model to each of the x bootstrap samples and assess goodness of fit (error) for each of the bootstrap samples and average it over the x bootstrap samples (let us denote by xi the ith realization of the bootstrap sample, where i=1 , …, x) and calculate the bootstrap estimate of the bias and variance as: 1 x B (t) x i t x i x S (t) (x i t ave ) / (x 1) 2 2 i t ave 1 x x x i i © 2007 Cios / Pedrycz / Swiniarski / Kurgan 29 Bootstrap • Assume the available data set of n samples is the population itself. • Draw “x bootstrap samples”, with replacement, from the population of size n (the same size as the original data) – With replacement means that we don’t remove the points from the population when selected, so they may appear multiple times in a single sample. • Fit the model to each bootstrap sample and test it against the original data. • Average the error over the x bootstrap samples. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 30 Data (original): Bootstrap abcdefghij Sample 1: fgeajgbddi Sample 2: dfIegjaejh Sample 3: bdjebghjdf Train Test Train Test Train Test Err1 Err2 Err3 Calculate the bootstrap estimate of the bias and variance as: 1 x B (t) x i t x i x S2 (t) (x i t ave ) 2 / (x 1) i t ave 1 x xi x i xi is the ith realization of the bootstrap sample, from i=1,…,x and t is the estimate calculated from the original data © 2007 Cios / Pedrycz / Swiniarski / Kurgan 31 Bootstrap To use bootstrap for calculating goodness of prediction we treat the set of bootstrap samples as the training data set, and the original data set as the test set. We thus fit the model to all bootstrap samples, and calculate the prediction error on the original data. The problem with this approach is that prediction error is optimistic (good), since the bootstrap samples are drawn with replacement. In other words, we calculate the prediction/generalization error on highly overlapping data (with many data items in the test set being the same as in the training data). Kind of “cheating” (note that CV is less optimistic). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 32 Bootstrap Cross Validation To address the optimism of bootstrap: •Generate x bootstrap samples of size n •Perform leave-one-out CV on each bootstrap sample •Average the CV estimates for each of the x bootstrap samples This reduces the issue of testing on the same data as used for training, but the same instance can still appear multiple times in one bootstrap sample. Thus, bootstrap CV still underestimates the true error. Fu, Carrol, Wang 2005. Estimating misclassification error with small samples via bootstrap crossvalidation. Bioinformatics, Vol 21, Issue 9 pg 1979-1986. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 33 .632 Bootstrap Bootstrap is an optimistically biased estimator, due to test/training data overlap. .632 Bootstrap corrects this bias by combining two biased estimators: Zero Bootstrap estimator and Resubstitution estimator. Zero bootstrap: Use bootstrap sample to train, and use only the samples not in a bootstrap as test data (eliminates test/training overlap). This is a very pessimistic estimator. Efron, Bradley 1983. Estimating the Error Rate of a Prediction Rule: Improvement on CrossValidation. Journal of the American Statistical Association, Vol 78, No. 382 pg 316-331 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 34 .632 Bootstrap Resubstitution is an error when using full data set for both training and test (this is very optimistic estimator / “true cheating”). The probability of any given instance not being chosen from the original data/population after n samples are drawn is (1-1/n)n ~= e-1 ~= 0.368 thus, probability of the expected distinct instances in a (training) sample is 0.632 We then combine these two biased estimators to correct the bias: err = 1/x ∑xi=1(0.632 * c0i + 0.368 errs), where c0i is error estimate on bootstrap sample i © 2007 Cios / Pedrycz / Swiniarski / Kurgan 35 .632 Bootstrap Data: abcdefghij Train Test errs ch Sample 1: fgeajgbddi Train Test c01 bci Sample 2: dfIegjaejh Train Test c02 acj Sample 3: bdiebghidf Train Test c03 err = 1/3 [(0.632 c01 + 0.368 errs) + (0.632 c02 + 0.368 errs) + (0.632 c03 + 0.368 errs) ] © 2007 Cios / Pedrycz / Swiniarski / Kurgan 36 Occam’s Razor Heuristic Occam’s razor states that the simplest explanation (a model) of the observed phenomena, in a given domain, is the most likely to be a correct one. Although we do not know how to determine “the simplest explanation” we intuitively agree with William of Ockham that given several models (specified, for example, in terms of production IF… THEN… rules) a more “compact” model (composed of a fewer number of rules, especially if on average the rules in this model are shorter than in all other models) should be chosen. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 37 Occam’s Razor Heuristic Occam’s razor heuristic is built-in into most machine learning algorithms, namely, that simpler/shorter/more compact description of the data is preferred over more complex ones. Thus, a problem with this heuristic is that we plan to use it for a model assessment while we have already used it for generating the model itself! Another issue with Occam’s razor is that some researchers say that it may be entirely incorrect, because: if we model some data (process that generated the data) known to be very complex why should a simple model of the data be preferred at all? © 2007 Cios / Pedrycz / Swiniarski / Kurgan 38 Minimum Description Length Principle MDL principle was designed to be general and, importantly, independent of any underlying probability distribution. Its “father”, Rissaneen, wrote: “We never want to make a false assumption that the observed data actually were generated by a distribution of some kind, say, Gaussian, and then go on and analyze the consequences and make further deductions. Our deductions can be entertaining but quite irrelevant…”. This statement is in stark contrast to statistical methods, because the MDL provides a clear interpretation regardless of whether some underlying “true” model of data exists or not. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 39 Minimum Description Length Principle The basic idea of the MDL is connected with a specific understanding of learning (model building). Namely, MDL is about finding regularities in the data, where regularity is understood as the ability to compress the data. Learning can also be understood as the ability to compress the data, i.e., we want to come up with compact (e.g., with fewer number of rules) description of the data (the model). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 40 Minimum Description Length Principle In parlance of machine learning, we desire to select the most general model but one that does not overfit the data. In parlance of MDL, having a set of models (M) about the data (D) we want to select the model that most compresses the data. Both methods specify the same goal, but are stated using different terminology. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 41 Minimum Description Length Principle If a system can be defined in terms of input and corresponding output data, then in the worst case (longest), it can be described by supplying the entire data set, thus constituting the longest (least compressed) model of the data. However, if regularities can be discovered in the data, say, expressed in the form of production rules, then a much shorter description is possible, and it can be measured by the MDL principle. The MDL principle says that the complexity of a model/hypothesis is measured by the number of bits needed to encode the model itself, plus the number of bits needed to encode the data using the model. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 42 Minimum Description Length Principle Formally, from a set of models, we choose as the “best” the model that minimizes the following sum: L(M) + L(D│M) where L(M) is length (in bits) of the description of the model, and L(D│M) is length of the description of data encoded using model M. The basic idea behind this definition can be also explained using notions of underfitting and overfitting. It is easy to find a complex model, meaning one having large L(M) value that overfits the data (i.e., with small L(D│M) value). It is equally easy to find a simple model, with the small L(M) value that underfits the data, but which has large L(D│M) value. Notice the similarity of the MDL principle to the bias-variance dilemma. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 43 Minimum Description Length Principle What we are looking for is a model that constitutes the best compromise between the two cases. Suppose that we have generated two models that explain/fit the data equally well, then the MDL principle tells us to choose one that is simpler (for instance having smaller number of parameters; recall that a model using the smallest number of parameters is called parsimonious), and that allows for the most compact description of the data. In that sense, the MDL can be seen as a formalization of the Occam’s razor heuristic. Small L(M) & Large L(D|M) Large L(M) & small L(D|M) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 44 Akaike’s Information Criterion Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are statistical measures used for choosing between models that use different number of parameters, d. They are closely related to each other, as well as to the MDL. The idea behind both is as follows. We want to estimate the prediction error (E) and use it for model selection. We can easily calculate the training error (TrE), however, the test instances most often do not coincide with the training instances, so the TrE is too optimistic. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 45 Akaike’s Information Criterion To remedy this we estimate the error of optimism (Eop), and calculate the in-sample (given the sample) error as follows: E = TrE + Eop AIC is defined as: AIC = - 2logL + 2 (d/N) where logL is the maximized log-likelihood defined as: L(θ │y) = ∑ log f(yi │ θ) N logL log P (y i ) logL of the model given the data, reflects the overall fit of the model i 1 where Pθ(Y) is a family of densities containing “true” density, is the maximum–likelihood estimate of θ, and d is the number of parameters in the model. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 46 Statement of the Maximum Likelihood Method (MLM): Assume we have made N measurements of x {x1, x 2 ... xn }. Assume we know the probability distribution function that describes x: f(x,) Assume we want to determine the parameter . The MLM says that we pick such as to maximize the probability of getting the measurements (the xi's) that we obtained! How do we use the MLM? The probability of measuring x1 is f(x1,)dx The probability of measuring x2 is f(x2,)dx The probability of measuring xn is f(xn,)dx If the measurements are independent, the probability of getting our measurements is: L=f(x1,) f(x2,)…f(xn,)dxn. L is called the Likelihood Function. It is convenient to write L as: N L f (xi , ) i 1 We want to pick the that maximizes L. Thus we want to solve: L 0 * From Richard Kass 47 Information Criteria AIC and BIC are defined as: [-2logL + k*p], <- Training Error + Error of Optimism where L is the likelihood function, p is the number of parameters in the model, and k is 2 for AIC and log(n) for BIC. 48 AIC AIC = -2logL + 2p Goal: Evaluate a set of models for a given data set and determine which model best fits the data - with the least amount of information loss – corresponding to the smallest AIC value. Akaike weights ( wi ) provide another measure of the strength of each model, and represent the ratio of delta AIC (Δ i ) values for each model relative to the set of R candidate models. Mazerolle 2004 49 AIC Table 1. Akaike’s second-order information criterion (AICc) of the regression models of mass lost by frogs after 2 h across three substrate types in or out of the shade. A total of 121 observations were retained for analysis. <- Mazerolle 2004 50 Bayesian Information Criterion BIC is defined as: BIC = - 2logL + d logN AIC and BIC definitions are equivalent in the sense that 2 in the second term was replaced with logN in the definition of the BIC. For a Gaussian distribution with variance S2 we can write BIC as: BIC = (N/S2) (TrE + d/N logN) We choose the optimal model as the one corresponding to the minimum value of the BIC. BIC favors simpler models since it heavily penalizes more complex models. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 51 Sensitivity and Specificity Let us assume that some underlying “truth” (also known as gold standard) exists; this is in contrast to the MDL principle that does not require such assumption. This means that training data must be available, i.e., known inputs corresponding to known outputs. It further implies that we know the total number of positive examples, P, and the total number of negative examples, N. Then, we are able to form a confusion matrix / misclassification matrix / contingency table. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 52 Sensitivity and Specificity Test result Test result Truth (Gold standard) Positive Negative Positive True Positive (TP) (no error) False Negative (FN) (Rejection error, Type I error) Negative False Positive (FP) (Acceptance error, Type II error) True Negative (TN) (no error) Total of Test result recognized as Positive Total of Test result recognized as Negative P (total of true positives) N (total of true negatives) Total population TP is the case in which the test result and gold standard/truth are both positive; FP is the case in which the test result is positive but the gold standard is negative; TN is the case where both are negative; and FN is the case where the test result is negative but the gold standard is positive. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 53 Sensitivity and Specificity Sensitivity = TP / P = TP / (TP+FN) = Recall =TP rate © 2007 Cios / Pedrycz / Swiniarski / Kurgan 54 Sensitivity and Specificity Sensitivity measures how often we find what we are looking for (say for a certain disease). In other words, sensitivity is 1 if all instances of the True class are classified to the True class. What we are looking for is P = (TP+FN), i.e., all the cases in which the gold-standard (disease) is actually positive (but we found only TP of them), thus the ratio. Sensitivity is also known in the literature under a variety of nearsynonyms such as TP rate hand hit rate. In information extraction / text mining literature sensitivity is known under the term recall. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 55 Sensitivity and Specificity Specificity = TN / N = TN / (FP+TN) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 56 Sensitivity and Specificity Specificity measures how often what we find is what we are not looking for (say, for a certain disease); in other words, it measures the ability of a test to be negative when a disease is not present. Specificity is 1 if only instances of True class are classified to the True class. What we find is N= (FP+TN), i.e., all the cases in which the gold standard is actually negative (but we have found only TN of them). © 2007 Cios / Pedrycz / Swiniarski / Kurgan 57 Sensitivity and Specificity People often report only accuracy of their results instead of calculating both sensitivity and specificity – not acceptable in medical data analyses. In a situation where, say, sensitivity is high (say over 0.9) while specificity is low (say about 0.3), the accuracy may look acceptable (over 0.6 or 60%), as can be seen from its definition: Accuracy = (TP+TN) / (P+N) Sometimes a different definition for accuracy is used, one that ignores the number of TN: Accuracy2 = TP / (P+N) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 58 Sensitivity and Specificity In text mining a measure called precision is used: Precision = TP / (TP + FP) where TP + FN = P are called relevant documents and TP +FP are called retrieved documents. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 59 Sensitivity and Specificity Test results Test results Sum True; Gold standard diagnosis Classified as A Classified as B Classified as C Classified as D A 4 1 1 1 7 B 0 7 0 0 7 C 0 0 2 1 3 D 0 0 0 10 10 4 8 3 12 27 Sum © 2007 Cios / Pedrycz / Swiniarski / Kurgan 60 Sensitivity and Specificity Test results Test results True ; Gold stan dard diag nosi s Clas sifie d as A Clas sifie d as C A 4 1 1 1 7 B 0 7 0 0 7 C 0 0 2 1 3 D 0 0 0 10 10 Sum 4 8 3 12 27 Clas sifie d as B Sum Clas sifie d as D Specificity = TN / N = TN / (FP+TN) Accuracy = (TP+TN) / (P+N) Class A Class B Class C Class D Sensitivity .57 (4/7) 1.0 (7/7) .67 (2/3) 1.0 (10/10) Specificity 1.0 (20/20) .95 (19/20) .96 (23/24) .88 (15/17) Accuracy .89 (4+20)/27 .96 (7+19)/27 .93 (2+23)/27 .93 (10+15)/27 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 61 Sensitivity and Specificity Class A Test result positive (A) Test result negative (N or U) Sum positive rules (for A) 13 0+2=2 15 negative rules (for N) 3 5+1=6 9 24 examples: 15 Abnormal and 9 Normal; 6 are left as unclassified (24-(13+5))=2+1+1+2) Sensitivity = TP / (TP+FN) = 13/(13+2) = .867 Specificity = TN / (FP+TN) = 6/(3+(5+1)) = .667 Accuracy = (TP+TN) / (TP+TN+FP+FN) = (13+6) / (13+2+3+6) = .792 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 62 Sensitivity and Specificity Class N Test result positive (N) Test result negative (A or U) Sum negative rules (for N) 5 3+1=4 9 positive rules (for A) 0 13 + 2 = 15 15 Sensitivity = TP / (TP+FN) = 5/(5+4) = .556 Specificity = TN / (FP+TN) = 15/(15+0) = 1.0 Accuracy = (TP+TN) / (TP+TN+FP+FN) = (5+15) / (5+4+15+0) = .883 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 63 Sensitivity and Specificity Results for class Sensitivity Specificity Accuracy A .867 .667 .792 N .556 1.0 .833 MEAN .712 .834 .813 © 2007 Cios / Pedrycz / Swiniarski / Kurgan 64 Sensitivity and Specificity Other measures calculated from the misclassification matrix: False Discovery Rate (FDR) = FP / (TP + FP) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 65 Sensitivity and Specificity F-measure is calculated as the harmonic mean of recall/sensitivity and precision: F-measure = (2 x Precision x Recall) / (Precision + Recall) © 2007 Cios / Pedrycz / Swiniarski / Kurgan 66 Fß Measure F-measure that places ß times more weight on recall vice precision: Fß-measure = (1+ ß2) x (Precision x Recall) ((ß2 x Precision) + Recall) F-measure is also known as F1-measure (ß=1) while: F.5 - measure places less emphasis on recall F2 - measure places more emphasis on recall © 2007 Cios / Pedrycz / Swiniarski / Kurgan 67 ROC Analyses Receiver Operating Characteristics (ROC) analysis is performed by drawing curves in two-dimensional space, with axes defined by the TP rate and FP rate, or equivalently, by using terms of sensitivity and specificity. Let us define the FP rate as being equal to FP/N; this is similar (but using “negative” logic) to the definition of the TP rate (which is TP/P and is better known as Sensitivity) then we can re-write the specificity as Specificity = 1 - FP rate from which we obtain FP rate = 1- Specificity © 2007 Cios / Pedrycz / Swiniarski / Kurgan 68 ROC Analyses Sensitivity = TP rate (1,1) 1 . P W .5 (0,0) .5 1 1-Specificity= FP rate A ROC graph. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 69 ROC Analyses ROC plots allow for visual comparison of several models (classifiers). For each model, we calculate its sensitivity and specificity, and draw it as a point on the ROC graph. Confusion matrix represents the evaluation of a model, which when drawn on the ROC graph, represents a single point, corresponding to a (1-specificity, sensitivity) value, denoted as point P. The ideal model would be one represented by a location (0,1) on the graph, corresponding to 100% specificity and 100% sensitivity. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 70 ROC Analyses Point (0,0) represents 100% specificity and 0% sensitivity, and point (1,1) 0% specificity and 100% sensitivity; thus, neither corresponds to an acceptable model. All points lying on the line connecting points (0,0) and (1,1) represent random guessing of the classes (equal values of 1-specificity and sensitivity, or in other words, equal values of TP and FP rates). That means that such models would recognize equal amounts of TP and FP, with point W (0.5, 0.5) representing 50% specificity and 50% sensitivity. These observations suggests the strategy for always “operating” in the region above the diagonal line (y=x), since in this region the TP rate is higher than the FP rate. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 71 ROC Analyses How can we obtain a curve on the ROC plot corresponding to a classifier, say A? Let us assume that for a (large) class called Abnormal (say one representing people with some disease) we have drawn a distribution of the examples over some symptom (feature). And let us assume that we have done the same for (large) class Normal (say representing people without a disease) over the same symptom. Note that such distributions would very often overlap. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 72 ROC Analyses Normal Mean Abnormal Mean Distribution of Normal (healthy) Abnormal (sick) patients. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 73 ROC Analyses Division into Normal and Abnormal patients using a threshold of 4.5. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 74 ROC Analyses ROC curves for two classifiers: A and B © 2007 Cios / Pedrycz / Swiniarski / Kurgan 75 ROC Analyses How to decide which of the two classifiers is a better model of the data? By visual analysis: the curve more to the upper left indicates a better classifier. However, the curves often overlap, and a decision may not be so easy to make. A popular method used to solve the problem is to calculate Area Under Curve (AUC). The area under the diagonal curve is 0.5. Thus, we are interested in choosing a classifier which has maximum area under its ROC curve: the larger the area the better performing the model is. There exist a measure similar to the AUC for assessing the goodness of a model, known as Gini coefficient, which is defined as twice the area between the diagonal and the ROC curve; the two measures are equivalent since Gini + 1 = 2 AUC © 2007 Cios / Pedrycz / Swiniarski / Kurgan 76 ROC Analyses ROC curves for two classifiers: A and B © 2007 Cios / Pedrycz / Swiniarski / Kurgan 77 ROC Analyses ROC curves for two classifiers: A and B © 2007 Cios / Pedrycz / Swiniarski / Kurgan 78 Interestingness Measures Similarly to research carried out in the area of expert systems, when the computer scientists mimicked human decision-making process by interviewing experts and codifying the rules they use (e.g., how to diagnose a certain disease) in terms of production rules, data scientists undertook a similar effort to formalize the measures (supposedly) used by domain experts to evaluate models of data. Such criteria can be roughly divided into those assessing interestingness of the association rules, which are generated from unsupervised data, and those for assessing interestingness of rules generated by inductive machine learning (i.e., decision trees / rule learners) from supervised data. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 79 Interestingness Measures Rule refinement is used to reduce the number of rules and focus only on some “more important - according to some criterion” rules. First, we identify potentially interesting rules that satisfy user-specified criteria like strength, complexity etc., or that are similar to the rules that already satisfy such criteria. Second, a subset of theses rules, known as technically interesting rules that are selected using methods such as chi-square and AIC/BIC. Third, we remove all but the technically interesting rules. Notice, that the just described process of rule refinement is “equivalent” to selection of the “best” rules done by a user. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 80 Interestingness Measures Interestingness measure is used to assess interestingness of generated classification rules, one rule at a time. Classification rules are divided into two types: Discriminant when evidence, “e implies hypothesis h”; a rule specifies conditions sufficient for distinguishing between the classes (they are the most frequently used in practice) Characteristic when hypothesis “h implies evidence e”; a rule specifies conditions necessary for membership in a class (h->e) Given a rule type e-> h or h->e: The rule is X% COMPLETE if e satisfies tuples satisfying h. (true / covers) X% of the The rule is Y% DISCRIMINANT if e satisfies (100-Y)% of the ¬ h tuples (i.e., tuples for which h is false) 81 Interestingness Measures To assess interestingness of a characteristic rule, measures of sufficiency and necessity are calculated S(e h) N(e h) p(e | h) p(e | h) p(e | h) p(e | h) where → stands for implication and the ¬ symbol stands for negation. These two measures are used for determining interestingness of different forms of characteristic rules using the following relations: © 2007 Cios / Pedrycz / Swiniarski / Kurgan 82 Interestingness Measures (1 N(e h)) p(h), IC(h e) 0 0 N(e h) 1 otherwise (1 S(e h)) p(h), IC(h e) 0 0 S(e h) 1 otherwise (1 1/N(e h)) p(h), IC(h e) 0 (1 1/S(e h)) p(h), IC(h e) 0 1 N(e h) otherwise 1 S(e h) otherwise The calculated values of interestingness are in the range from 0 to 1, and the owner of the data then uses some threshold (say, .5) to make a decision of retaining or removing the rule. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 83 IC++ Data obtained for characteristic rules mined from synthetic database of 4000 tuples described by 10 attributes IC++ was compared against RI alg. (Piatetsky-Shapiro 1991), J alg. (Smyth & Goodman 1992), and Certainty alg. (Hong & Mao 1991) • IC++ has chosen rule 2 as the most interesting also showed that • Rule 1: S & N are statistically independent, CE did not • Rule 4: IC+- shows that there is interestingness in that h-> ¬ e & e-> ¬ h Kamber & Shingal 1996 84 Interestingness Measures Distance measure is another criterion that uses distance between any two of the generated rules to find the strongest (with the highest coverage) rules: DA(R i , R j ) 2DV(R i , R j ) 2EV(R i , R j ) , NO(R i , R j ) 0 D(R i , R j ) N(R i ) N(R j ) 0 otherwise where Ri and Rj are rules; DA(Ri, Rj) is the sum of the number of attributes present in Ri but not in Rj , and the sum of attributes present in Rj but not in Ri; DV(Ri, Rj) is the number of attributes in Ri and Rj that have slightly (less than 2/3 of the range) overlapping values; EV(Ri, Rj) is the number of attributes in both rules that have strongly overlapping (more than 2/3) values in the range; N(Ri) and N(Rj) are the number of attributes in each rule, and NO(Ri, Rj) is the number of attributes in Ri and Rj with non-overlapping values. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 85 Interestingness Measures The distance criterion calculates values in the range from -1 to 1, indicating strong and slight overlap, respectively. The value of 1 means no overlap. The most interesting rules are those with the highest average distance to the other rules. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 86 References Cios, K.J., Pedrycz, W. and Swiniarski, R. 1998. Data Mining Methods for Knowledge Discovery. Kluwer Grunwald PD, Myung I.J and Pitt MA (eds.). 2005. Advances in Minimum Description Length Theory and Applications. MIT Press Hastie T, Tibshirani R and Friedman J. 2001. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Hilderman RJ and Hamilton HJ. 1999. Knowledge Discovery and Interestingness Measures: A Survey. Technical Report CS 99-04, University of Regina, Regina, Saskatchewan, Canada Moore GW and Hutchins GM. 1983. Consistency versus completeness in medical decision-making: Exemplar of 155 patients autopsied after coronary artery bypass graft surgery. Med Inform (London). Jul-Sep., 8(3):197-207 Kecman V. 2001. Learning and Soft Computing. MIT Press Rissaneen J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific van Rijsbergen, C.J. 1979. Information Retrieval. London. Butterworths © 2007 Cios / Pedrycz / Swiniarski / Kurgan 87 References Kamber & Shingal. 1996 Evaluating the Interestingness of Characteristic Rules. Application of Mathematical Theoreis. AAAI.org Mehta, Agrawal & Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining Cavanaugh J., 2012 Lecture V: The Bayesian Information Criterion, Department of Biostatistics, Department of Statistics and Actuarial Science, Univ. of Iowa Mazerolle, 2007, Making sense out of Akaike’s Information Criterion (AIC) http://avesbiodiv.mncn.csic.es/estadistica/senseaic.pdf Efron, B. 1983. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. Journal of the American Statistical Association, Vol 78, No. 382 pg 316-331. Fu, Carrol, Wang. 2005. Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, Vol 21, Issue 9 pg 1979-1986. © 2007 Cios / Pedrycz / Swiniarski / Kurgan 88