Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Data Mining – Classification Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Keith E Emmert Tarleton State University [email protected] k-Nearest Neighbors October 23, 2013 Keith E Emmert (TSU) Data Mining October 23, 2013 1 / 222 Sub-Setting Data Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves One must create training and test sets from a data set or data frame. The following code will perform that task, leaving two-thirds of the data for the training set and one third for the test set. Let’s look at how to split the iris data set. > set.seed(100) > index = 1:nrow(iris) > trainindex <- sample(index, trunc(2*length(index)/3)) > trainSet <- iris[trainindex, ] > testSet <- iris[-trainindex, ] For a much fancier way of performing splits (and much more), consider the caret package in R. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 2 / 222 Classification Classification Data Mining Definition Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Classification is the task of learning (training) a target function f that maps each attribute set x to one of the predefined class labels y . The target function is also known as a classification model. Descriptive models is a classification model which distinguishes between objects of different classes. Note that the class labels are known. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 3 / 222 Classification Examples Data Mining Keith E Emmert Sub-setting Data Decision Trees Rule-Based Classifiers Neural Networks Classification Decision Trees Support Vector Machines Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 4 / 222 Classification General Process Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Learn (train) the classifier with a training set of data (with known labels). Evaluate the classifier with a test set (with known labels). Predictive Models A classification model can be used to predict the class label of unknown records. Learn/Training of the model is called induction. Deduction of class labels occurs when the model is applied to a data set. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 5 / 222 Classification Evaluation – The Confusion Matrix or 2 × 2 Contingency Table Data Mining Keith E Emmert Sub-setting Data Actual Class Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Class = 1 Class = 0 Predicted Class Class = 1 Class = 0 f11 f10 f01 f00 fij is the number of records from class i predicted to be of class j. Thus when i = j we have a correct prediction. Accuracy is a performance metric defined as Correct Predictions f00 + f11 Accuracy = = . f00 + f01 + f10 + f11 Total Predictions Error rate is a performance metric defined as f01 + f10 Wrong Predictions Error rate = = . f00 + f01 + f10 + f11 Total Predictions Often training a model consists of minimizing the error rate or maximizing the accuracy using a test set. Keith E Emmert (TSU) Data Mining October 23, 2013 6 / 222 Decision Trees The Basics Data Mining Keith E Emmert Sub-setting Data Classification Definition A decision tree is a (directed) graph consisting of a root node with no incoming edges and zero or more outgoing edges internal nodes with exactly one incoming edge and two or more outgoing edges leaf nodes with exactly one incoming edge and no outgoing edges. Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Each leaf node is assigned exactly one class label. The root and internal nodes are assigned attribute test conditions to separate records. A decision tree consisting of only the root node (which is then a leaf) is an indication that the training process has failed spectacularly. To classify an unknown record, follow nodes to a leaf. Keith E Emmert (TSU) Data Mining October 23, 2013 7 / 222 Decision Trees Example Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Fisher’s Iris data is used. A training set and R’s party package is used. This data set consists of the measurements petal width, petal length, sepal width, and sepal length. There are three species, setosa, versicolor, and verginica are recorded. A decision tree to predict species based upon one of the four measurements is constructed. A split variable (petal width, petal length, sepal width, sepal length) and p-value used to reject the null hypothesis of independence of the variable with Species. Note that sepal values are not needed! Keith E Emmert (TSU) Data Mining October 23, 2013 8 / 222 Decision Trees Example – Conditional Inference Trees Root and internal nodes list the split variable. H0 : independence between input variables and the response variable. Smallest p-value determines the split. Data Mining Keith E Emmert 1 Petal.Length p < 0.001 Sub-setting Data Classification Decision Trees Bayes Classification ≤ 1.9 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves 2 n = 35 y = (1, 0, 0) k-Nearest Neighbors > 1.9 3 Petal.Width p < 0.001 ≤ 1.7 4 n = 35 y = (0, 0.971, 0.029) > 1.7 5 n = 39 y = (0, 0.026, 0.974) n is the number classified (correctly or incorrectly) into this node. Probabilities are listed as y = (a, b, c), where a for setosa, b for versicolor, c for virginica. Node 2 is perfect at classifying setosa. Nodes 4 & 5 have issues. Keith E Emmert (TSU) Data Mining October 23, 2013 9 / 222 Decision Trees Building Decision Trees – Hunt’s Algorithm Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves The Set-Up Grown in a recursive fashion by partitioning the training records into successively “purer” subsets. DEt is the set of training records that are associated with node T. y = {y1 , y2 , . . . , yc } are the class labels. Hunt’s Algorithm 1 If all records in Dt belong to the same class yt , then t is a leaf node with label yt . 2 If Dt contains records that belong to more than one class, then 1 k-Nearest Neighbors 2 3 4 Keith E Emmert (TSU) an attribute test condition is selected to partition Dt into smaller subsets. a child node is created for each outcome of the test condition. the records in Dt are distributed to the children based upon the outcomes. the algorithm (steps 1 & 2) is (recursively) applied to each child node. Data Mining October 23, 2013 10 / 222 Decision Trees Building Decision Trees – Problems with Hunt’s Algorithm Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves It is possible to generate an empty leaf with a training sample if none of the training records have the combination of attribute values associated with this node. For this case, a leaf is created with the same class label as the majority class of training records in its parent node. Home Owner Yes Yes No k-Nearest Neighbors Keith E Emmert (TSU) Married Yes Yes Yes Income $50,000 $25,000 $35,000 Predict: Loan Default No Yes Yes Split: Married? Problem: An empty node for Married = No. Fix: Current node is a leaf; Default is “Yes” Data Mining October 23, 2013 11 / 222 Decision Trees Building Decision Trees – Problems with Hunt’s Algorithm Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves If all records associated with Dt have the same attributes (except the class label), then no split is possible. Then, declare the node a leaf with the same class label as the majority class of training records associated with this node. Home Owner Yes Yes Yes k-Nearest Neighbors Keith E Emmert (TSU) Married Yes Yes Yes Income $50,000 $50,000 $50,000 Predict: Loan Default No Yes Yes Problem: No attribute allows a split! Fix: Current node is a leaf; Default is “Yes” Data Mining October 23, 2013 12 / 222 Decision Trees Splitting Notes Data Mining Keith E Emmert Binary data is obvious. Ordinal data - the first problem child. Small, Medium, and Large could generate three categories (obviously) or two, say Small & Medium vs Large. Don’t group beginning and ending (i.e. Small & Large vs. Medium). Sub-setting Data Classification Decision Trees Bayes Classification Numeric - the second problem child. Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves It can be split by a single number (< 10 vs ≥ 10) By bins/ranges [10, 20), [20, 30), etc, and then treated as ordinal data. Picking the numeric cut-off/best bin or deciding to how to group ordinal data is “interesting.” k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 13 / 222 Decision Trees Decision Tree Example Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Apply Hunt’s Algorithm to the following data. Split order: Home Owner, Marital Status (married vs not married), Annual Income ≥ 100K . (Table is changed slightly from book.) Binary Categorical Continuous Class Tid Home Owner Married Income Defaulted 1 Yes Single 125K Yes 2 No Married 100K No 3 No Single 70K No Yes Married 120K No 4 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Keith E Emmert (TSU) Data Mining October 23, 2013 14 / 222 Decision Trees Impurity Measures Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Definition A partition is impure if it contains a collection of tuples from different classes rather than from a single class. Otherwise it is pure. p(i | t) is the fraction of records belonging to class i at a given node t. When t is fixed or clear, define pi = p(i | t). c is the number of classes in node t labeled 0, 1, . . . , c − 1. Using the above, the following impurity measures are defined: c−1 X Entropy(t) = − p(i | t) log2 [p(i | t)] i=0 k-Nearest Neighbors Gini(t) = 1 − c−1 X [p(i | t)]2 i=0 Classification Error(t) = 1 − max[p(i | t)]. i For empty nodes, assume that entropy, Gini, and classification error are all zero. Keith E Emmert (TSU) Data Mining October 23, 2013 15 / 222 Decision Trees Gain Data Mining Keith E Emmert Definition The gain, ∆, is defined to be Sub-setting Data Classification Bayes Classification j=1 Parent Impurity | Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors k X N(vj ) ∆ = I (parent) − | {z } Decision Trees N {z I (vj ) , } Weighted Average of an Impurity Index where I (•) is any (fixed) impurity measure (i.e. Entropy, Gini, etc). N is the total number of records at the parent node N(vj ) is the number of records at a proposed child node vj k is the number of attribute values or proposed child nodes. Keith E Emmert (TSU) Data Mining October 23, 2013 16 / 222 Decision Trees Gain Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Remark Gain compares the degree of impurity of the parent node before splitting to the degree of impurity of each of the child nodes after splitting. Note that maximizing Gain, ∆, is equivalent to minimizing the weighted average of an impurity index of proposed child nodes k X N(vj ) Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors j=1 N I (vj ) since I (parent) is constant. When I (•) = Entropy(•), then ∆info ≡ ∆ is called the information gain. Keith E Emmert (TSU) Data Mining October 23, 2013 17 / 222 Decision Trees Splitting Nodes Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Use the following table on the next few slides to determine the best split. (Table is changed slightly from book.) Binary Categorical Continuous Class Tid Home Owner Married Income Defaulted 1 Yes Single 125K Yes No Married 100K No 2 3 No Single 70K No 4 Yes Married 120K No No Divorced 95K Yes 5 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Keith E Emmert (TSU) Data Mining October 23, 2013 18 / 222 Decision Trees Splitting Categorical Data Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Here we investigate the root split of married by determining if Single/Divorced vs Married or if Single vs Divorced vs Married is better? Married Defaulted Single/Divorced Married Yes 4 0 No 2 4 2 2 X 4 2 4 2 Gini(S&D) = 1 − [p(i | t)] = 1 − − = 6 6 9 i=Yes,No 2 2 0 4 Gini(M) = 1 − − =0 4 4 X N(vj ) 4 6 4 4 Gini(vj ) = + 0= Wt Gini = N 10 9 10 15 j=S&D,M Keith E Emmert (TSU) Data Mining October 23, 2013 19 / 222 Decision Trees Splitting Categorical Data - Continued Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Defaulted Single 3 1 Married Divorced 1 1 Married 0 4 2 2 X 1 3 3 2 − = Gini(S) = 1 − [p(i | t)] = 1 − 4 4 8 i=Yes,No 2 2 X 1 1 1 Gini(D) = 1 − [p(i | t)]2 = 1 − − = 2 2 2 i=Yes,No 2 2 0 4 Gini(M) = 1 − − =0 4 4 X N(vj ) 4 3 2 1 4 Wt Gini = Gini(vj ) = + + 0 N 10 8 10 2 10 Yes No j=S,D,M = Keith E Emmert (TSU) 1 4 Data Mining October 23, 2013 20 / 222 Decision Trees Splitting Categorical Data - Continued Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines 4 ≈ 0.2667 15 1 Wt Gini(S, D, M) = = 0.25 4 So, a better split when considering only Married would be 3 categories in a new decision tree. Wt Gini(S&D, M) = Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 21 / 222 Decision Trees Splitting Continuous Data Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors The following table is based upon making a new decision tree using Income from the above table. Splits are the midpoints of intervals..say [60K , 70K ) has midpoint of 65K . Class No No No Yes Yes Income 60K 70K 75K 85K 90K 55K 65K 72K 80K 87K Splits ≤ > ≤ > ≤ > ≤ > ≤ > Yes 0 4 0 4 0 4 0 4 1 3 No 0 6 1 5 2 4 3 3 3 3 Wt. Gini 0.480 0.444 0.400 0.343 0.450 Class Yes No No No No Income 95K 100K 120K 125K 220K 92K 97K 110K 122K 172K 230K Splits ≤ > ≤ > ≤ > ≤ > ≤ > ≤ > Yes 2 2 3 1 3 1 3 1 4 0 4 0 No 3 3 3 3 4 2 5 1 5 1 6 0 Wt. Gini 0.480 0.450 0.476 0.475 0.444 0.480 Keith E Emmert (TSU) Data Mining October 23, 2013 22 / 222 Decision Trees Splitting Continuous Data – Continued Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Continuous data can also be split into bins and treated as nominal (categorical) data. For example, Income Defaulted Wt Gini I: < 85K II: [85K , 100K )] III: ≥ 100K Yes 0 2 1 No 3 1 3 Gini 4 9 0 Gini I&II: 4 9 3 8 0.2833 3 8 0.4167 24 0.3429 49 So, the three categories appear to be better. Of course, there are many ways to build such categories.... Gini Keith E Emmert (TSU) 0 II&III: Data Mining October 23, 2013 23 / 222 Decision Trees Final Thoughts About the First Node Data Mining Keith E Emmert Recall: Using Married Wt Gini for S&D vs M: 0.2667 Wt Gini for S vs D vs M: 0.25 Sub-setting Data Classification Decision Trees Using Income Wt Wt Wt Wt Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Gini Gini Gini Gini for for for for Single Split: ≤ 80K vs > 80K : 0.343 Categories I vs II vs III: 0.2833 Categories I & II vs III: 0.4167 Categories I vs II & III: 0.3429 For completeness, Wt Gini for Home Owner: 0.4190 Conclusion: The first node should be split using the Married variable and the three categories: S vs D vs M. Next Steps: For each of the three nodes, S, D, and M, determine the optimal split using the remaining attributes: Income and Home Owner. Keith E Emmert (TSU) Data Mining October 23, 2013 24 / 222 Decision Trees Characteristics Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves This is a non-parametric approach – no underlying assumptions for the distribution are made! Building a decision tree is easy – finding the optimal is a NP-complete problem. Classification time is linear – based on the height of the tree. Sub-trees can be replicated throughout the decision tree making it more complicated than necessary. The decision boundaries are rectilinear – non rectilinear boundaries in the data need other methods. The choice of impurity method (i.e. Gini) has little impact on the tree. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 25 / 222 Decision Trees Types of Errors Data Mining Keith E Emmert Sub-setting Data Classification Training Error is the number of miss-classification errors committed on training records. Generalization Error is the expected error of the model on previously unseen records. Training error – from training data – is a (possibly poor) estimate of the generalization error. Test error – from test data – is a (possibly better) estimate of the generalization error since we’re using the trained model on previously unseen data. There are bounds on the generalization error which are implementation specific. Cross-validation, random sub-sampling, and bootstrap will improve your estimates for the generalization error. Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Goal: Low Training and Low Generalization Errors. Keith E Emmert (TSU) Data Mining October 23, 2013 26 / 222 Decision Trees Types of Errors – Estimating Generalization Error: Random Sub-Sampling Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Split data set randomly into 70% Training and 30% Testing. Train the model. Test the Model and compute the Test Error. Repeat k times. Average all of the Test Errors. Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 27 / 222 Decision Trees Types of Errors – Estimating Generalization Error: Cross-Validation Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Split data set into k disjoint groups, C1 , . . . , Ck , of equal size. Train the model using C1 , . . . , Ci−1 , Ci+1 , . . . , Ck . Test the Model using Ci and compute the Test Error. Repeat i = 1, 2, . . . , k times. Average all of the Test Errors. If k is the size of the data set, this is called Leave-one-Out Cross-Validation. Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 28 / 222 Decision Trees Types of Errors – Estimating Generalization Error: Bootstrap Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Generate a training set of size N by sampling the data :::: with replacement. The rest is the test set. Train the model. Use the test set to calculate the test error. Repeat k times. Average all of the Test Errors. Approximately 63.2% of the records are sampled. Receiver Operating Characteristics (ROC) and Precision - Recall Curves The probability a record is chosen is 1 − (1 − 1/N)N −−−−→ 1 − e −1 ≈ 0.632 N→∞ k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 29 / 222 Decision Trees Types of Errors – Example, 400 Ordered Pairs Data Mining Keith E Emmert 8 Sub-setting Data Positive Negative Classification 6 Decision Trees Bayes Classification 2 0 −2 k-Nearest Neighbors y Receiver Operating Characteristics (ROC) and Precision - Recall Curves 4 Support Vector Machines −2 Keith E Emmert (TSU) 0 2 4 x Data Mining 6 8 October 23, 2013 30 / 222 Decision Trees Training Errors – Example – A (Default) Decision Tree using a Training Set of 300 Pairs, Leaving 100 Pairs for Testing Data Mining x>=0.7277 | Keith E Emmert Sub-setting Data Classification y>=1.496 x< 4.496 Decision Trees Bayes Classification y>=1.049 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves y< 2.63 −1 y< 5.506 1 x>=1.814 −1 1 k-Nearest Neighbors −1 Keith E Emmert (TSU) 1 −1 1 Data Mining October 23, 2013 31 / 222 Decision Trees Example – Training and Test Error for the Default Tree Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves The Training (or Test) Error is given by Incorrect . Total Confusion Matrix Confusion Matrix myPosNegTrainError myPosNegTestError labelTrain -1 1 labelTest -1 1 -1 123 28 -1 35 14 1 16 133 1 6 45 Training Error is 0.146667. Test Error is 0.2 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 32 / 222 Decision Trees Example – Training Error for the Default Tree - Visual Data Mining Positive class is denoted with black and circles. Negative class is denoted with red and triangles. Miss-classified items are black/triangles and red/circles. Keith E Emmert Sub-setting Data Classification Decision Trees 8 Bayes Classification Positive Negative 2 0 −2 k-Nearest Neighbors y Receiver Operating Characteristics (ROC) and Precision - Recall Curves 4 6 Support Vector Machines −2 Keith E Emmert (TSU) 0 2 4 Data Mining 6 8 October 23, 2013 33 / 222 Decision Trees Example – Test Error for the Default Tree - Visual Data Mining Positive class is denoted with black and circles. Negative class is denoted with red and triangles. Miss-classified items are black/triangles and red/circles. Keith E Emmert Sub-setting Data Classification Decision Trees 8 Positive Negative Bayes Classification 4 2 0 −2 k-Nearest Neighbors y Receiver Operating Characteristics (ROC) and Precision - Recall Curves 6 Support Vector Machines −2 Keith E Emmert (TSU) 0 2 4 Data Mining 6 8 October 23, 2013 34 / 222 Decision Trees Example – Complexity Parameter, CP for RPART Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 1 2 3 4 5 6 7 The Root Node Training Error is 149 300 ≈ 0.496667. A split that does not decrease the overall lack of fit by a factor of CP is not attempted. nsplit is the number of splits 0 for none, etc. rel error is the test error, scaled to 1 based upon the error of the root node. xerror is the test error using cross-validation (10-fold by default). xstd is the standard error using (10-fold) cross-validation. CP nsplit rel error xerror xstd 0.28859060 0 1.0000000 1.1744966 0.05730957 0.16778523 1 0.7114094 0.9463087 0.05801780 0.08053691 2 0.5436242 0.6711409 0.05479843 0.06711409 3 0.4630872 0.5771812 0.05256652 0.04026846 4 0.3959732 0.5234899 0.05098905 0.02013423 6 0.3154362 0.4429530 0.04815407 0.01000000 7 0.2953020 0.4228188 0.04734755 Keith E Emmert (TSU) Data Mining October 23, 2013 35 / 222 Decision Trees Example – Picking the “Best” Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 1 2 3 4 5 6 7 CP nsplit rel error xerror xstd 0.28859060 0 1.0000000 1.1744966 0.05730957 0.16778523 1 0.7114094 0.9463087 0.05801780 0.08053691 2 0.5436242 0.6711409 0.05479843 0.06711409 3 0.4630872 0.5771812 0.05256652 0.04026846 4 0.3959732 0.5234899 0.05098905 0.02013423 6 0.3154362 0.4429530 0.04815407 0.01000000 7 0.2953020 0.4228188 0.04734755 Use the “best tree” – lowest cross-validation error: That would be the tree in row 7. Use the “smallest tree” – within one standard error of the best tree. 0.4228188 + 0.04734755 = 0.47016635, so choose the one in row 6 because it is simpler and within 1 SE of the “best” tree. Keith E Emmert (TSU) Data Mining October 23, 2013 36 / 222 Decision Trees Example – Picking the “Best” Visually Data Mining The horizontal line is 1 SE above the minimum of the curve. The “ideal” tree lies below this line...so choose cp so that it generates an appropriate tree. Keith E Emmert Sub-setting Data Classification 1 2 3 4 5 7 8 Inf 0.22 0.12 0.074 0.052 0.028 0.014 Decision Trees 1.2 Bayes Classification 1.0 0.8 0.6 k-Nearest Neighbors 0.4 Receiver Operating Characteristics (ROC) and Precision - Recall Curves X−val Relative Error Support Vector Machines Keith E Emmert (TSU) Data Mining October 23, 2013 37 / 222 Decision Trees Data Mining Keith E Emmert Sub-setting Data Classification Remark plotcp and printcp generate cp values using different algorithms! Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 38 / 222 Decision Trees Types of Errors – Under-fitting and Over-fitting Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Under-fitting occurs at the beginning when both training and generalization errors are high. (Model has not learned much of the structure of the data.) Over-fitting occurs when a model fits the training data too well and has a poorer generalization error than a model with a higher training error rate. (The model understands the training data too well – which could have “noise” in it leading to higher generalization errors.) k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 39 / 222 Decision Trees Example – Under-fitting: cp = 0.25 and minimum split is 2 Data Mining x>=0.7277 | Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors −1 Keith E Emmert (TSU) 1 Data Mining October 23, 2013 40 / 222 Decision Trees Example – Training and Test Error for the Under-fit Tree Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves The Training (or Test) Error is given by Incorrect . Total Confusion Matrix Confusion Matrix myPosNegTrainError3 myPosNegTestError3 labelTrain -1 1 labelTest -1 1 -1 137 14 -1 43 6 1 92 57 1 34 17 Training Error is 0.353333. Test Error is 0.4 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 41 / 222 Decision Trees Over-fitting Causes Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Noise – Training data is mislabeled. Lack of representative samples – training data has too few records of one or more categories. Multiple Comparison Procedure Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 42 / 222 Decision Trees Example – Over-fitting: cp = 0 and minimum split is 2 Data Mining x>=0.7277 | Keith E Emmert y>=1.049 Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors y>=1.496 x< 4.496 Sub-setting Data y< 5.506 y< 2.63 y>=2.383 y>=0.5338 x>=1.814y< 2.336 x>=6.391 0.558 x<y< 0.124 −1 1 y< 4.722y>=7.176 y< 0.5552 y< 3.847x>=5.506 y< 2.356 x< 0.8038 x>=6.077 −1 1−1 1 y>=−0.3969 x< 4.313 x< 3.386x< 3.461 x>=2.763 x< 6.67 x< 5.708x< x>=−0.4303 4.883 −1 −1 −1 −1 1−1 1 y>=2.006 y< y< 6.487 5.684 y< −0.2607 x>=4.347 x>=3.711 x< 5.98 x>=4.845 x< −0.958 −1 −1 1 1 −1 1−1 1 −1 y< 4.117 x>=1.012 x< y>=4.943 4.026y>=5.635 x< 1.011 x>=5.97y< 4.89 −1 1 −1 1 1 −1 1−1 −1 1 y<y>=4.117 3.5y< 1.882x< 3.412 y>=4.02 x>=0.8701 x>=1.505 x< 5.832 1 −1 1 1 −1 1 −1 1 y>=3.562 y>=4.284 x>=3.427 y>=−0.1042 x< 1.609 x>=5.806 x>=4.57 −1 1−1 1 −1 1 1 1 1 y< 4.267 x< 5.761 −1 1−1 −1 1 −1 1 −1 1 −1 −1 1 x>=5.732 −1 1 1 −1 1 Keith E Emmert (TSU) Data Mining October 23, 2013 43 / 222 Decision Trees Example – Training and Test Error for the Over-fit Tree Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves The Training (or Test) Error is given by Incorrect . Total Confusion Matrix Confusion Matrix myPosNegTrainError2 myPosNegTestError2 labelTrain -1 1 labelTest -1 1 -1 151 0 -1 33 16 1 0 149 1 11 40 Training Error is 0. Test Error is 0.27 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 44 / 222 Decision Trees Correcting Over-fitting Data Mining Prepruning Stop growing the tree when the gain in impurity (or lessoning of training error) is minimal. CP is an example of this. Hard to find the best threshold. Keith E Emmert Sub-setting Data Classification Decision Trees Post-Pruning Grow the tree to maximum size. Trimming can occur by replacing a subtree with Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves a new leaf node with class label based upon the majority of the records the most frequently used branch of the subtree. Stops when no further improvement is observed. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 45 / 222 Decision Trees Correcting Over-fitting: Example - Over-fit, cp = 0, minimum split is 2 Data Mining Example Keith E Emmert Sub-setting Data Recall, the poorly generated tree: Classification x>=0.7277 | Decision Trees y>=1.496 x< 4.496 Bayes Classification y>=1.049 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors y< 2.63 y>=2.383 y>=0.5338 y< 5.506 x>=1.814y< 2.336 0.558 x>=6.391 x<y< 0.124 −1 1 y< 4.722y>=7.176 y< 0.5552 y< 3.847x>=5.506 y< 2.356 x< 0.8038 x>=6.077 −1 1−1 1 y>=−0.3969 x< 4.313 x< 3.386x< 3.461 x>=2.763 x< 6.67 x< 5.708x< x>=−0.4303 4.883 −1 −1 −1 −1 1−1 1 y>=2.006 y< y< 6.487 5.684 y< −0.2607 x>=4.347 x>=3.711 x< 5.98 x>=4.845 x< −0.958 −1 −1 1 1 −1 1−1 1 −1 y< 4.117 x>=1.012 x< y>=4.943 4.026y>=5.635 x< 1.011 x>=5.97y< 4.89 −1 1 −1 1 1 −1 1−1 −1 1 y<y>=4.117 3.5y< 1.882x< 3.412 y>=4.02 x>=0.8701 x>=1.505 x< 5.832 1 −1 1 1 −1 1 −1 1 y>=3.562 y>=4.284 x>=3.427 y>=−0.1042 x< 1.609 x>=5.806 x>=4.57 −1 1−1 1 −1 1 1 1 1 y< 4.267 x< 5.761 −1 1−1 −1 1 −1 1 −1 1 −1 −1 1 x>=5.732 −1 1 1 −1 1 Keith E Emmert (TSU) Data Mining October 23, 2013 46 / 222 Decision Trees Correcting Over-fitting: Example - Over-fit, cp = 0, minimum split is 2 Data Mining Example Keith E Emmert Sub-setting Data Now, we determine cp: 1 2 3 4 5 7 8 Inf 0.22 0.12 0.074 0.052 0.028 0.014 Classification Decision Trees 0.8 0.6 k-Nearest Neighbors 0.4 Receiver Operating Characteristics (ROC) and Precision - Recall Curves X−val Relative Error Support Vector Machines 1.0 1.2 Bayes Classification Keith E Emmert (TSU) cp Data Mining October 23, 2013 47 / 222 Decision Trees Correcting Over-fitting: Example - Over-fit, cp = 0, minimum split is 2 Data Mining Example Keith E Emmert Sub-setting Data Now, we can prune the tree: much more readable! Classification x>=0.7277 | Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves y>=1.496 x< 4.496 −1 k-Nearest Neighbors y>=1.049 1 y< 2.63 x>=1.814 −1 −1 −1 Keith E Emmert (TSU) 1 1 Data Mining October 23, 2013 48 / 222 Decision Trees Example – Training and Test Error for the Pruned Tree Data Mining Keith E Emmert The Pruned Tree Confusion Matrix prunedTrainError2 prunedTestError2 labelTrain -1 1 labelTest -1 1 -1 131 20 -1 37 12 1 27 122 1 9 42 Training Error is 0.156667. Test Error is 0.21 Confusion Matrix Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors The Over-fit Tree Confusion Matrix myPosNegTrainError2 myPosNegTestError2 labelTrain -1 1 labelTest -1 1 -1 151 0 -1 33 16 1 0 149 1 11 40 Training Error is 0. Test Error is 0.27 Confusion Matrix Keith E Emmert (TSU) Data Mining October 23, 2013 49 / 222 Bayes Classification Bayesian vs Non-Bayesian Data Mining Keith E Emmert Remark Sub-setting Data Non-Bayesian Modeling: Pr (X | θ). Classification Bayesian Modeling: Pr (X , θ) = Pr (θ)Pr (X | θ). Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 50 / 222 Bayes Classification Condition Probability Data Mining Keith E Emmert Sub-setting Data Classification Remark For random variables X and Y , with joint probability mass function Pr (X , Y ) = Pr (X = x, Y = y ), we have Decision Trees Pr (X , Y ) = Pr (X | Y )Pr (Y ). Bayes Classification Support Vector Machines In particular, for Pr (Y ) 6= 0, Receiver Operating Characteristics (ROC) and Precision - Recall Curves Pr (X | Y ) = Pr (X , Y ) . Pr (Y ) k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 51 / 222 Bayes Classification Law of Total Probability Data Mining Keith E Emmert Sub-setting Data Classification Theorem (Law of Total Probability) If {Y1 , . . . , Yk } is the set of mutually exclusive and exhaustive outcomes of Y , then Decision Trees Bayes Classification Pr (X ) = Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors k X Pr (X | Yi )Pr (Yi ). i=1 Remark Suppose X is an attribute set and Y is the class variable. Pr (X | Y ) is the class conditional probability. Pr (Y ) is the prior or initial degree of belief in Y . Pr (Y | X ) is the posterior or the degree of belief having accounted for X . Keith E Emmert (TSU) Data Mining October 23, 2013 52 / 222 Bayes Classification Bayes Theorem Data Mining Keith E Emmert Sub-setting Data Classification Theorem (Bayes Theorem) If {Y1 , . . . , Yk } is the set of mutually exclusive and exhaustive outcomes of Y , then Decision Trees Pr (Y | X ) = Bayes Classification Support Vector Machines Pr (X | Y )Pr (Y ) Pr (X | Y )Pr (Y ) . = Pk Pr (X ) i=1 Pr (X | Yi )Pr (Yi ) Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 53 / 222 Bayes Classification Goal Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Given an unknown record X = (x1 , . . . , xn ), classify it as one of y1 , . . . , yk by computing Pr (Y | X ) = Pr (Y = yi | X = (x1 , . . . , xn )), = 1, . . . , k. The class label for X is the yi that maximizes the conditional probability. Hence, we need Pr (Y ) Estimated from the training set Fraction of training records in that class Pr (X | Y ) Harder to find. Use Naive Bayes Classifier Bayesian Belief Network Keith E Emmert (TSU) Data Mining October 23, 2013 54 / 222 Bayes Classification Naive Bayes Classification Data Mining Remark Keith E Emmert Sub-setting Data Classification We will assume that the attributes, X = (X1 , . . . , Xn ), are conditionally independent given class label Y = y , that is Decision Trees Pr (X | Y = y ) = Bayes Classification k-Nearest Neighbors Pr (Xj | Y = y ). j=1 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves n Y Hence, by Bayes Theorem, for 1 ≤ i ≤ k and fixed X = (X1 = x1 , . . . , Xn = xn ), Pr (Y = yi | X ) = = Keith E Emmert (TSU) Pr (X | Y )Pr (Y = yi ) Pr (X ) Qn Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi ) Pr (X ) Data Mining October 23, 2013 55 / 222 Bayes Classification Naive Bayes Classification – Goal Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Find yi which maximizes Pr (Y = yi | X ) given a fixed X . Since Qn Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi ) Pr (Y = yi | X ) = , Pr (X ) we only need to maximize the numerator, Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Pr (Y = yi ) n Y Pr (Xj = xj | Y = yi ). j=1 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 56 / 222 Naive Bayes Classification – Handling Categorical & Binary Attributes Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Note that Pr (Xj = xj | Y = yi ) represents the fraction of training instances of class yi that take on the particular class attribute xj . Example Suppose we wish to classify X = (Red, Domestic, SUV ) given the following data set as stolen or not. Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) No. 1 2 3 4 5 6 7 8 9 10 Color Red Red Red Yellow Yellow Yellow Yellow Yellow Red Red Type Sports Sports Sports Sports Sports SUV SUV SUV SUV Sports Data Mining Origin Domestic Domestic Domestic Domestic Imported Imported Imported Domestic Imported Imported Stolen? Yes No Yes No Yes No Yes No No Yes October 23, 2013 57 / 222 Bayes Classification Naive Bayes Classification - Categorical Example Continued Data Mining Example Keith E Emmert Sub-setting Data Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 3 Pr (Red | No) = 10 1 Pr (SUV | Yes) = Pr (SUV | No) = 10 2 Pr (Domestic | Yes) = Pr (Domestic | No) = 10 5 Pr (Yes) = Pr (No) = 10 Qn So, we maximize Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi ). Pr (Red | Yes) = Classification Pr (Yes | X ) = 5 3 1 2 = 0.003, 10 10 10 10 2 10 3 10 3 10 5 10 Pr (No | X ) = 0.009 X is assigned a label “No” since Pr (No | X ) > Pr (Yes | X ). Keith E Emmert (TSU) Data Mining October 23, 2013 58 / 222 Naive Bayes Classification - Continuous Attributes Data Mining Keith E Emmert For continuous attributes, you have a choice: Discretize Xj by using “appropriate” intervals or bins. Pr (Xj = xj | Y = yi ) is the fraction of training records belonging to class yi that falls withing the interval containing xi . Too many or too few bins can cause issues. Sub-setting Data Classification Decision Trees Assume a distribution and estimate the parameters of the distribution using training data. Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 59 / 222 Naive Bayes Classification - Continuous Attributes Continued Data Mining Remark Keith E Emmert Sub-setting Data For PMF f (X ; θ) with parameter vector θ, if > 0, then Classification Z Decision Trees xj + Pr (xj ≤ Xj ≤ xj + ) = f (Xj ; θ) dXj xj Bayes Classification ≈ f (xj ; θ) Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Pr (Y = yi | X ) = ≈ k-Nearest Neighbors Pr (Y = yi | X ) ≈ n Pr (Y = yi ) Qn Pr (Y = yi ) Qn j=1 Pr (Xj = xj | Y = yi ) Pr (X ) j=1 f (xj ; θ) Pr (X ) Qn Pr (Y = yi ) j=1 f (xj ; θ) Pr (X ) Qn Hence, we seek to maximize Pr (Y = yi ) j=1 f (xj ; θ). Keith E Emmert (TSU) Data Mining October 23, 2013 60 / 222 Bayes Classification Naive Bayes Classification - Example Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Use the following table on the next few slides to classify X = (Home Owner = No, Marital Status = Married, Income = $120K ). (Table is changed slightly from book.) Binary Categorical Continuous Class Tid Home Owner Married Income Defaulted 1 Yes Single 125K Yes 2 No Married 100K No 3 No Single 70K No Yes Married 120K No 4 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Keith E Emmert (TSU) Data Mining October 23, 2013 61 / 222 Bayes Classification Naive Bayes Classification - Example Continued Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification 2 2 √ 1 e −(x−x̄) /(2s ) 2πs = 107.5 x̄No = 100+70+120+60+220+75 6 qP (xi −x̄No )2 sNo = = 59.31 n−1 Assume Gaussian f (x; x̄, s) = for Annual Income Support Vector Machines Pr (Income = 120K | No) ≈ f (120; 107.5, 59.31) = 0.0507 Receiver Operating Characteristics (ROC) and Precision - Recall Curves x̄Yes = 98.75 k-Nearest Neighbors Prior: Pr (No) = sYes = 17.97 Pr (Income = 120K | Yes) ≈ f (120; 98.75, 17.97) = 0.0468 Keith E Emmert (TSU) 6 10 = 3 5 Data Mining Pr (Yes) = 4 10 = 25 . October 23, 2013 62 / 222 Bayes Classification Naive Bayes Classification - Example Continued Data Mining The class conditional probabilities can now be easily computed: Keith E Emmert Sub-setting Data Pr (X | No) = Classification Decision Trees Bayes Classification 3 Y Pr (Xj = xj | No) j=1 = Pr (HomeOwner . = No | No) × Pr (MaritalStatus = Married | No) Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves × Pr (Income = 120K | No) 4 4 = 0.0507 = 0.0225 6 6 k-Nearest Neighbors Pr (X | Yes) = 3 Y Pr (Xj = xj | Yes) j=1 3 0 = 0.0468 = 0 4 4 Keith E Emmert (TSU) Data Mining October 23, 2013 63 / 222 Bayes Classification Naive Bayes Classification - Example Continued Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Recall, the goal is to find yi which maximizes Pr (Y = yi | X ) given a fixed X . Since Qn Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi ) Pr (Y = yi | X ) = , Pr (X ) we only need to maximize the numerator, Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Pr (Y = yi ) n Y Pr (Xj = xj | Y = yi ). j=1 Hence, we have Pr (Y = No) · 3 Y Pr (Xj = xj | No) = 6 · 0.0225 = 0.0135 10 Pr (Xj = xj | Yes) = 4 ·0=0 10 j=1 Pr (Y = Yes) · 3 Y j=1 Therefore, X is classified as “No.” Keith E Emmert (TSU) Data Mining October 23, 2013 64 / 222 Bayes Classification Naive Bayes Classification - m-Estimate of Conditional Probability Data Mining Keith E Emmert Remark Recall that from the previous example, Sub-setting Data Classification Pr (Y = Yes) · Decision Trees Bayes Classification Pr (Xj = xj | Yes) = j=1 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 3 Y 4 · 0 = 0. 10 This is bad since classification may not work very well. However, if Pr (X | Y = yi ) = 0, ∀i, then this is BAD and classification can’t occur. The fix is m-estimation. Keith E Emmert (TSU) Data Mining October 23, 2013 65 / 222 Bayes Classification Naive Bayes Classification - m-Estimate of Conditional Probability Data Mining Keith E Emmert Sub-setting Data Classification Remark The idea is that we assume we have an extra m training records with class yi where pm of them having the xj attribute. Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Definition The m-estimate approach approximates the conditional probabilities: nc + mp , Pr (Xj = xj | Y = yi ) = n+m p is the prior estimate of the probability. You might assume 1 (unless you have other information) a uniform prior, p = , k where k is the number of attributes of Xj . m is the equivalent sample size (constant). nc is the number of training samples from class yi that take value xj . n is the number of instances from class yi . Keith E Emmert (TSU) Data Mining October 23, 2013 66 / 222 Bayes Classification Naive Bayes Classification - m-Estimate Example Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Use the following table on the next few slides to classify X = (Home Owner = No, Marital Status = Married, Income = $120K ). (Table is changed slightly from book.) Use m-estimates. Binary Categorical Continuous Class Tid Home Owner Married Income Defaulted 1 Yes Single 125K Yes 2 No Married 100K No 3 No Single 70K No Yes Married 120K No 4 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Keith E Emmert (TSU) Data Mining October 23, 2013 67 / 222 Bayes Classification Naive Bayes Classification - m-Estimate Example Continued Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Recall that 6 10 4 Pr (Y = Yes) = 10 Pr (Income = 120K | No) ≈ f (120; 107.5, 59.31) = 0.0507 Pr (Y = No) = Pr (Income = 120K | Yes) ≈ f (120; 98.75, 17.97) = 0.0468 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 68 / 222 Bayes Classification Naive Bayes Classification - m-Estimate Example Continued Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Assume m = 6, for convenience. Home Owner = No Y = No nc = 4 Pr (HomeOwner = No | No) = n = 6 p = 12 4+6( 12 ) 7 6+6 = 12 nc +mp n+m = Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves nc = 4 p = 13 Marital Status = Married Pr (MaritalStatus = Married | No) = 4+6( 13 ) nc +mp 6 n+m = 6+6 = 12 k-Nearest Neighbors Hence, we have Pr (Y = No) · 3 Y Pr (Xj = xj | No) = j=1 Keith E Emmert (TSU) Data Mining 6 7 6 · · · 0.0507 = 0.0089 10 12 12 October 23, 2013 69 / 222 Bayes Classification Naive Bayes Classification - m-Estimate Example Continued Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Assume m = 6, for convenience. Home Owner = No Y = Yes nc = 3 Pr (HomeOwner = No | Yes) = n = 4 p = 12 3+6( 12 ) 6 4+6 = 10 nc +mp n+m = Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves nc = 0 p = 13 Marital Status = Married Pr (MaritalStatus = Married | Yes) = 0+6( 13 ) nc +mp 2 n+m = 4+6 = 10 k-Nearest Neighbors Hence, we have Pr (Y = Yes) · 3 Y Pr (Xj = xj | Yes) = j=1 Keith E Emmert (TSU) Data Mining 4 6 2 · · · 0.0468 = 0.0034 10 10 10 October 23, 2013 70 / 222 Bayes Classification Naive Bayes Classification - m-Estimate Example Continued Data Mining Example Keith E Emmert Sub-setting Data Again, since Classification Decision Trees Pr (No | X ) · Pr (X ) = Pr (Y = No) · Bayes Classification Pr (Xj = xj | No) = 0.0089 j=1 > Pr (Yes | X ) · Pr (X ) Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 3 Y = Pr (Y = Yes) · 3 Y Pr (Xj = xj | Yes) = 0.0034, j=1 we can safely classify X as No. Keith E Emmert (TSU) Data Mining October 23, 2013 71 / 222 Bayes Classification Naive Bayes Classification - Bayes Error Rate Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Suppose that the true probability distribution for Pr (X | Y ) is known, where Y is the class label, and X is a vector of attributes. We seek to classify alligators and crocodiles based upon their length by finding the ideal decision boundary. Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 72 / 222 Bayes Classification Naive Bayes Classification - Bayes Error Rate Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Assume average length of an adult crocodile is N(15 ft, 22 ft). Hence we can approximate the class-conditional probabilities as " 2 # 1 1 X − 15 Pr (X | Crocodile) = √ exp − . 2 2 2π · 2 Assume average length of an adult alligator is N(12 ft, 22 ft). Hence we can approximate the class-conditional probabilities as " 2 # 1 X − 12 1 exp − . Pr (X | Alligator) = √ 2 2 2π · 2 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 73 / 222 Bayes Classification Naive Bayes Classification - Bayes Error Rate Data Mining Assuming that the prior probabilities are the same, that is Keith E Emmert Pr (Alligator) = Pr (Crocodile) Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Hence, left of x̂ = 13.5 should be an alligator and to the right of x̂ = 13.5 should be a crocodile. 0.20 Receiver Operating Characteristics (ROC) and Precision - Recall Curves then the figure illustrates the ideal decision boundary where Pr (X = x̂ | Crocodile) = Pr (X = x̂ | Alligator). Hence, we have 2 2 x̂ − 12 27 x̂ − 15 = 13.5 = =⇒ x̂ = 2 2 2 Alligators Crocodiles 0.10 0.00 0.05 Density 0.15 k-Nearest Neighbors 5 10 15 20 Length Keith E Emmert (TSU) Data Mining October 23, 2013 74 / 222 Bayes Classification Naive Bayes Classification - Bayes Error Rate Data Mining Assuming that the prior probabilities are the different, that is Keith E Emmert Pr (Alligator) 6= Pr (Crocodile) Sub-setting Data Classification Decision Trees Bayes Classification then decision boundary shifts towards the class with the lower prior probability. Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 75 / 222 Bayes Classification Naive Bayes Classification - Bayes Error Rate Data Mining For example, if Keith E Emmert Pr (Alligator) = 2Pr (Crocodile) Sub-setting Data then Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Pr (Alligator | X ) = Pr (Crocodile | X ) Pr (Crocodile)Pr (X | Crocodile) Pr (Alligator)Pr (X | Alligator) = Pr (X ) Pr (X ) ⇔ 2Pr (Crocodile)Pr (X | Alligator) = Pr (Crocodile)Pr (X | Crocodile) ⇔ ⇔ 2Pr (X | Alligator) = Pr (X | Crocodile) " " 2 # 2 # 1 1 1 x̂ − 12 1 x̂ − 15 ⇔2· √ exp − =√ exp − 2 2 2 2 2π · 2 2π · 2 2 2 ⇔ 8 ln(2) − (x̂ − 12) = − (x̂ − 15) 1 ⇒ x̂ = (81 + 8 ln(2)) =⇒ x̂ ≈ 14.4242 6 Hence, the decision boundary shifted towards the crocodile. Keith E Emmert (TSU) Data Mining October 23, 2013 76 / 222 Bayes Classification Naive Bayes Classification - Bayes Error Rate Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves The Bayes error rate is the probability of incorrectly classifying X and is nonzero if the distributions of the classes overlap. For this example, Z x̂ Z ∞ Bayes Error = Pr (Crocodile | X ) dX + Pr (Alligator | X ) dX 0 x̂ ( 0.453255, x̂ = 13.5 = 0.499449, x̂ = 14.4242 where the lower bound is 0 because it would be weird to consider a crocodile of negative length. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 77 / 222 Bayes Classification Naive Bayes Classification - Iris Example Data Mining Here, we should use the package “klaR”. Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves A confusion matrix (should not test with training set...but this is just a tiny example) is generated using the function NaiveBayes(). setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 78 / 222 Bayes Classification Naive Bayes Classification - Iris Continued - Visualizes the marginal probabilities of predictor variables given the class. Data Mining Naive Bayes Plot Keith E Emmert setosa versicolor virginica Sub-setting Data 0.6 Classification Decision Trees Bayes Classification 0.4 0.0 k-Nearest Neighbors 0.2 Receiver Operating Characteristics (ROC) and Precision - Recall Curves Density Support Vector Machines 1 Keith E Emmert (TSU) 2 3 4 Petal.Length Data Mining 5 6 7 October 23, 2013 79 / 222 Bayes Classification Naive Bayes Classification - The 400 Ordered Pairs Data Mining Keith E Emmert 8 Sub-setting Data Positive Negative Classification 6 Decision Trees Bayes Classification 2 0 −2 k-Nearest Neighbors y Receiver Operating Characteristics (ROC) and Precision - Recall Curves 4 Support Vector Machines −2 Keith E Emmert (TSU) 0 2 4 x Data Mining 6 8 October 23, 2013 80 / 222 Bayes Classification Naive Bayes Classification - The 400 Ordered Pairs Continued Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Training set is 300 pairs, test set is 100 pairs. How well did we do with the training data? labelTrain -1 1 -1 111 67 1 40 82 Predictions based upon the test data labelTest -1 1 -1 34 26 1 15 25 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 81 / 222 Bayes Classification Naive Bayes Classification - The 400 Ordered Pairs - Visualizes the marginal probabilities of predictor variables given the class. Data Mining Naive Bayes Plot Keith E Emmert −1 1 0.10 Sub-setting Data Classification 0.00 k-Nearest Neighbors 0.02 Receiver Operating Characteristics (ROC) and Precision - Recall Curves Density Support Vector Machines 0.04 Bayes Classification 0.06 0.08 Decision Trees −2 Keith E Emmert (TSU) 0 2 4 x Data Mining 6 8 October 23, 2013 82 / 222 Bayes Classification Naive Bayes Classification - The 400 Ordered Pairs Scatter Plot for Training Data Mining Keith E Emmert 8 Sub-setting Data Positive Negative Classification 6 Decision Trees Bayes Classification 2 0 −2 k-Nearest Neighbors y Receiver Operating Characteristics (ROC) and Precision - Recall Curves 4 Support Vector Machines −2 Keith E Emmert (TSU) 0 2 4 x Data Mining 6 8 October 23, 2013 83 / 222 Naive Bayes Classification - The 400 Ordered Pairs Scatter Plot for Testing Data Mining Positive Negative 8 Keith E Emmert Sub-setting Data 6 Classification Decision Trees Bayes Classification 2 0 −2 k-Nearest Neighbors y Receiver Operating Characteristics (ROC) and Precision - Recall Curves 4 Support Vector Machines −2 Keith E Emmert (TSU) 0 2 4 x Data Mining 6 8 October 23, 2013 84 / 222 Bayes Classification Bayesian Belief Networks Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors See http://www.bnlearn.com/ for very detailed information. This requires the package “bnlearn.” Other packages that might prove useful include “deal,”“catnet/mugnet” and “pcalg.” Note that when loading the package “bnlearn” with library("bnlearn"), I received an error that package “graph” was not installed. An attempt to install it yielded the message that it is no longer supported with this version of R. So, we need to run the following three commands to install “graph.” For the first, you need to select a mirror, I chose “1. Seattle (USA).” For the second command, you must choose a repository, I chose “2: Bio software.” Finally, install the package as usual. > chooseBioCmirror(graphics = getOption("menu.graphics")) > setRepositories(addURLs = c(CRANxtras = "http://www.bioconductor.org")) > install.packages("graph") Everything should now be OK. Keith E Emmert (TSU) Data Mining October 23, 2013 85 / 222 Bayes Classification Bayesian Belief Networks Data Mining Remark Keith E Emmert Sub-setting Data Classification For naive Bayes, recall that the attributes, X = (X1 , . . . , Xn ), are conditionally independent given class label Y = y , that is Decision Trees Pr (X | Y = y ) = Bayes Classification k-Nearest Neighbors Pr (Xj | Y = y ). j=1 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves n Y Hence, by Bayes Theorem, for 1 ≤ i ≤ k and fixed X = (X1 = x1 , . . . , Xn = xn ), Pr (Y = yi | X ) = = Pr (X | Y )Pr (Y = yi ) Pr (X ) Qn Pr (Y = yi ) j=1 Pr (Xj = xj | Y = yi ) Pr (X ) What if this assumption is relaxed? Keith E Emmert (TSU) Data Mining October 23, 2013 86 / 222 Bayes Classification Bayesian Belief Networks Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Definition A Bayesian belief network is a graphical representation of the probabilistic relationships among a set of random variables and has the following elements: A directed acyclic graph encoding the dependence relationships among a set of variables A probability table associating each node to its immediate parent nodes. Remark For nodes A, B, if there is a direct arc from A to B, then A is the parent of B and B is the child of A. if there is a path from A to B, then A is an ancestor of B and B is a descendant of A. None of these relationships (parent, etc) are unique. Keith E Emmert (TSU) Data Mining October 23, 2013 87 / 222 Bayes Classification Bayesian Belief Networks - Properties Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Remark Causal Sufficiency Assumption There exist no common unobserved (or hidden or latent) variables in the domain that are parent of one or more observed variables in the domain. Definition Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves A and B are independent if Pr (A, B) = Pr (A)Pr (B). A and B are conditionally independent given C if Pr (A, B | C ) = Pr (A | C )Pr (B | C ) Pr (A | B, C ) = Pr (A | C ). k-Nearest Neighbors Remark If A and B are conditionally independent given C , then B and A are conditionally independent given C . Keith E Emmert (TSU) Data Mining October 23, 2013 88 / 222 Bayes Classification Bayesian Belief Networks - Properties Data Mining Keith E Emmert Sub-setting Data Classification Remark Markov Assumption A node in a Bayesian network is conditionally independent of its non-descendants, if its parents are known. Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Remark If a node X does not have any parents, then the table contains only the prior probability, Pr (X ). If a node X has only one parent, Y , then the table contains the conditional probability Pr (X | Y ). If a node X has multiple parents, {Y1 , . . . , Yk }, then the table contains the conditional probability Pr (X | Y1 , Y2 , . . . , Yk ). Keith E Emmert (TSU) Data Mining October 23, 2013 89 / 222 Bayes Classification Bayesian Belief Networks - The Full Joint Distribution Data Mining Keith E Emmert Sub-setting Data Classification Definition The full joint distribution is defined in terms of local conditional distributions: Decision Trees Pr (X1 , X2 , . . . , Xd ) = Bayes Classification Pr (Xi | π(Xi )) i=1 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves d Y where π(Xi ) represents the parents of Xi . If Xi has no parents, then this is simply the prior distribution of Xi . k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 90 / 222 Bayes Classification Bayesian Belief Networks - Building a Model Data Mining Keith E Emmert It’s a two step process: 1 Create the structure of the network 2 Estimate the probability values in the tables associated with each node. Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 91 / 222 Bayes Classification Bayesian Belief Networks - Learning of Bayesian Belief Networks Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Data: D = {D1 , . . . , Dn }, Di = xi ,a vector of variable values. Discrete Variables: X = {X1 , . . . , Xd } True probability distribution: p(X ). Goal: estimate the true distribution p(X ) over variables X using examples in D. That is, find the “best” parameters T such that Bayes Classification p̃(X | T ) ≈ p(X ). Support Vector Machines Parameter Estimation Criteria: Maximum likelihood estimation (MLE): maximize Receiver Operating Characteristics (ROC) and Precision - Recall Curves Pr (D | Θ) = n Y f (xi | Θ), i=1 k-Nearest Neighbors where f is a probability function. Maximum a posteriori probability (MAP): maximize Pr (Θ | D) = Pr (D | Θ)Pr (Θ) Pr (D | Θ)Pr (Θ) = R . Pr (D) Pr (D | Θ)Pr (Θ)dΘ That is, find the mode. ] Keith E Emmert (TSU) Data Mining October 23, 2013 92 / 222 Bayes Classification Bayesian Belief Networks - Example - Maximum Likelihood Estimation Data Mining Example Keith E Emmert Sub-setting Data Suppose we have a biased coin marked with Heads or Tails. Pr (Heads) = θ. Classification ( Decision Trees Let D be the sequence of outcomes xi = Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 1, Heads 0, Tails. Clearly, Pr (x | θ) = f (x | θ) = θx (1 − θ)1−x . (i.e. our friend Bernoulli!) n Y Likelihood Function: L(θ | D) = θxi (1 − θ)1−xi . i=1 Using the log-likelihood log(L(θ | D)) and a gentle partial derivative, n n ∂ log(L(θ)) 1X 1X = 0 =⇒ θ = xi = x̄ =⇒ θ̂ = Xi = X̄ . ∂θ n n i=1 Keith E Emmert (TSU) Data Mining i=1 October 23, 2013 93 / 222 Bayes Classification Bayesian Belief Networks - Conjugate Distributions Data Mining Definition Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification If the posterior distributions p(θ | x) are in the same family as the prior probability distribution p(θ), then the prior and the posterior are called conjugate distributions and the prior is called a conjugate prior for the likelihood. The parameters for the conjugate distribution are called hyper-parameters. Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Remark All members of the exponential family have conjugate priors. See: Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis, 2nd edition. CRC Press, 2003. ISBN 1-58488-388-X. Keith E Emmert (TSU) Data Mining October 23, 2013 94 / 222 Bayes Classification Bayesian Belief Networks - Example - Maximum a Posteriori Probability (MAP) Data Mining Example Keith E Emmert Sub-setting Data Suppose we have a biased coin marked with Heads or Tails. Pr (Heads) = θ. Classification ( Decision Trees Let D be the sequence of outcomes xi = Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves 1, Heads 0, Tails. Clearly, Pr (x | θ) = f (x | θ) = θx (1 − θ)1−x . (i.e. our friend Bernoulli!) MAP seeks to maximize the prior probability on θ: k-Nearest Neighbors p̃(θ | D) = Keith E Emmert (TSU) Pr (D | θ)p̃(θ) Pr (D | θ)p̃(θ) =R . Pr (D) Pr (D | θ)p̃(θ)dθ Data Mining October 23, 2013 95 / 222 Bayes Classification Bayesian Belief Networks - Example - Maximum a Posteriori Probability (MAP) Data Mining Example Keith E Emmert Sub-setting Data We wish to maximize the prior probability on θ: Classification p̃(θ | D) = Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Pr (D | θ)p̃(θ) , Pr (D) where Pr (D | θ) is the likelihood of data. That is, Pr (D | θ) = N Y θxi (1 − θ)1−xi = θN1 (1 − θ)N2 i=1 k-Nearest Neighbors where N1 is the number of success, N2 the number of failures. p̃(θ) is the prior probability on θ. Problem: How to choose the prior probability, p̃(θ)? Keith E Emmert (TSU) Data Mining October 23, 2013 96 / 222 Bayes Classification Bayesian Belief Networks - Example - Maximum a Posteriori Probability (MAP) Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Suppose we have a biased coin marked with Heads or Tails. Problem: How to choose the prior probability, p̃(θ)? Use the conjugate distribution for the binomial: Bayes Classification Pr (D | θ) = θN1 (1 − θ)N2 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves =⇒ Pr (θ) = Γ(α1 + α2 ) α1 −1 θ (1 − θ)α2 −1 . Γ(α1 )Γ(α2 ) Hence, we know k-Nearest Neighbors p̃(θ | D) = R Pr (D | θ)Beta(θ | α1 , α2 ) . Pr (D | θ)Beta(θ | α1 , α2 )dθ = Beta(θ | α1 + N1 , α2 + N2 ) The mode is: θMAP = Keith E Emmert (TSU) α1 +N1 −1 α1 +α2 +N1 +N2 −2 . Data Mining October 23, 2013 97 / 222 Bayes Classification Bayesian Belief Networks - Bayesian Learning Data Mining Both MLE and MAP pick one parameter value. Is it always the best solution? If we have two different parameter setting that are close in terms of probability (i.e. MLE or MAP), then using only one may introduce a strong bias. Keith E Emmert Sub-setting Data Classification Decision Trees Bayesian approach Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Remedies the limitation of one choice Considers all parameter settings and averages the result: Z (∆ | D) = Pr (∆ | θ)p̃(θ | D)dθ. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 98 / 222 Bayes Classification Bayesian Belief Networks - Example Continued Data Mining Example Keith E Emmert Sub-setting Data Predict the next coin flip. Recall we have Classification p̃(θ | D) = Beta(θ | α1 + N1 , α2 + N2 ) Decision Trees Bayes Classification So, maximize Pr (X = 1 | D) and Pr (X = 0 | D). Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Z 1 Pr (X = 1 | θ)p̃(θ | D)dθ Pr (X = 1 | D) = 0 Z 1 θ1 (1 − θ)1−1 Beta(θ | α1 + N1 , α2 + N2 )dθ = k-Nearest Neighbors 0 Z 1 θBeta(θ | α1 + N1 , α2 + N2 )dθ = 0 = E (θ) = Keith E Emmert (TSU) α1 + N1 α1 + α2 + N1 + N2 Data Mining October 23, 2013 99 / 222 Bayes Classification Bayesian Belief Networks - Example Continued Data Mining Example Keith E Emmert Sub-setting Data Z 1 Pr (X = 0 | D) = Classification Decision Trees Bayes Classification Pr (X = 0 | θ)p̃(θ | D)dθ 0 Z 1 θ0 (1 − θ)1−0 Beta(θ | α1 + N1 , α2 + N2 )dθ = Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0 Z 0 = E (1 − θ) = 1 − k-Nearest Neighbors = Keith E Emmert (TSU) 1 (1 − θ)Beta(θ | α1 + N1 , α2 + N2 )dθ = α1 + N1 α1 + α2 + N1 + N2 α2 + N2 α1 + α2 + N1 + N2 Data Mining October 23, 2013 100 / 222 Bayes Classification Bayesian Belief Networks - Example Continued Data Mining Example Keith E Emmert Sub-setting Data So, we predict heads if Classification Decision Trees Bayes Classification Support Vector Machines Pr (X = 1 | D) > Pr (X = 0 | D) α2 + N2 α1 + N1 > α1 + α2 + N1 + N2 α1 + α2 + N1 + N2 α1 + N1 > α2 + N2 . Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 101 / 222 Bayes Classification Bayesian Belief Networks - Multinomial Example Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Roll k dice. Data: D = {D1 , . . . , DN }, Di = xi , a vector of k variable values, Ni represents the number of times i occurs. P Model Parameters: θ~ = (θ1 , . . . , θk ), i θi = 1, θi the probability of an outcome i. Probability of Data (Likelihood function) N! θN1 · · · θkNk . N1 · · · Nk ! 1 This is the multinomial distribution. The MLE estimate is Ni , i = 1, 2, . . . , k. θ̂i = N Keith E Emmert (TSU) ~ = Pr (N1 , . . . , Nk | θ) Data Mining October 23, 2013 102 / 222 Bayes Classification Bayesian Belief Networks - Multinomial Example Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Roll k dice. The Conjugate The Dirichlet function: P k Γ i=1 αi θiα1 −1 · · · θkαk −1 , Dir (θ~ | α1 , . . . , αk ) = Qk Γ(α ) i i=1 where Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Z Γ(α) = ∞ x α−1 e −x dx, α > 0. 0 The Posterior ~ Pr (D | θ)Dir (θ~ | α1 , . . . , αk ) Pr (D) = Dir (θ~ | α1 + N1 , . . . , αk + Nk ). p̃(θ~ | D) = k-Nearest Neighbors The MAP estimate (the mode) is θi,MAP = Pk αi + N1 − 1 i=1 (αi Keith E Emmert (TSU) Data Mining + Ni ) − k . October 23, 2013 103 / 222 Bayes Classification Bayesian Belief Networks - Example Using Package “bnlearn” Data Mining Using the Hill Climbing Algorithm – A Score-Based Algorithm Keith E Emmert Sub-setting Data F Classification Decision Trees Bayes Classification A E B D Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors C Keith E Emmert (TSU) Data Mining October 23, 2013 104 / 222 Bayes Classification Bayesian Belief Networks - Example Using Package “bnlearn” Data Mining Using the Grow Shrink Algorithm – A Constraint-Based Algorithm Keith E Emmert Sub-setting Data Classification Note that the arc A − B has no arrow. This indicates that A → B and A → B generate networks with the same score. Decision Trees Bayes Classification F Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves A E B D k-Nearest Neighbors C Keith E Emmert (TSU) Data Mining October 23, 2013 105 / 222 Bayes Classification Bayesian Belief Networks – Comparing the Models Data Mining We have different results... Keith E Emmert Hill−Climbing Grow−Shrink F F Sub-setting Data Classification Decision Trees Bayes Classification A E A E B D B D Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors C Keith E Emmert (TSU) C Data Mining October 23, 2013 106 / 222 Bayes Classification Bayesian Belief Networks - Rgraphviz package Data Mining Keith E Emmert Sub-setting Data Classification Note that the package Rgraphviz helps with plots. It is available at source("http://bioconductor.org/biocLite.R") biocLite("Rgraphviz") Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 107 / 222 Bayes Classification Bayesian Belief Networks - A Prettier Graph via Rgraphviz Data Mining So, here is a better comparison plot... Keith E Emmert Hill−Climbing Grow−Shrink Sub-setting Data Classification Decision Trees Bayes Classification A C A C Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves B D F B D F k-Nearest Neighbors E Keith E Emmert (TSU) E Data Mining October 23, 2013 108 / 222 Bayes Classification Bayesian Belief Networks Data Mining Keith E Emmert Sub-setting Data Small synthetic data set from Lauritzen and Spiegelhalter (1988) about lung diseases (tuberculosis, lung cancer or bronchitis) and visits to Asia. Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors The data set has the following two-level variables, with levels yes and no. D (dyspnoea) – shortness of breath T (tuberculosis) L (lung cancer) B (bronchitis) A (visit to Asia) S (smoking) X (chest X-ray) E (tubercolosis versus cancer/bronchitis) Keith E Emmert (TSU) Data Mining October 23, 2013 109 / 222 Bayes Classification Bayesian Belief Networks - Asia Example Data Mining Keith E Emmert Sub-setting Data D Classification A Decision Trees X Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors S E T B L Keith E Emmert (TSU) Data Mining October 23, 2013 110 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Fitting with MLE Data Mining Keith E Emmert Sub-setting Data We can find the conditional probability tables of the variables. For example, node A: Parameters of node A (multinomial distribution) Classification Decision Trees Conditional probability table: Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors no yes 0.9916 0.0084 and node T : Parameters of node T (multinomial distribution) Conditional probability table: A T no yes no 0.991528842 0.952380952 yes 0.008471158 0.047619048 Keith E Emmert (TSU) Data Mining October 23, 2013 111 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Fitting with MLE Data Mining Keith E Emmert Node D Parameters of node D (multinomial distribution) Sub-setting Data Conditional probability table: Classification Decision Trees Bayes Classification , , E = no Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors B D no yes no 0.90017286 0.21373057 yes 0.09982714 0.78626943 , , E = yes B D no yes no 0.27737226 0.14592275 yes 0.72262774 0.85407725 Keith E Emmert (TSU) Data Mining October 23, 2013 112 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Fitting with MLE Asia Node D Conditional Probabilities using MLE Data Mining 0.0 Keith E Emmert Sub-setting Data 0.2 0.4 no no no yes yes no yes yes 0.6 0.8 Classification Decision Trees yes Bayes Classification Support Vector Machines k-Nearest Neighbors no Levels Receiver Operating Characteristics (ROC) and Precision - Recall Curves yes no 0.0 0.2 0.4 0.6 0.8 Probabilities Keith E Emmert (TSU) Data Mining October 23, 2013 113 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Fitting with Bayes Data Mining Keith E Emmert Sub-setting Data Classification We can find the conditional probability tables of the variables using the expected value of the posterior distribution. For example, node A: Parameters of node A (multinomial distribution) Decision Trees Bayes Classification Conditional probability table: Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves no yes 0.990618762 0.009381238 and node T : Parameters of node T (multinomial distribution) k-Nearest Neighbors Conditional probability table: A T no yes no 0.991033649 0.904255319 yes 0.008966351 0.095744681 Keith E Emmert (TSU) Data Mining October 23, 2013 114 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Fitting with Bayes Data Mining Keith E Emmert Node D Parameters of node D (multinomial distribution) Sub-setting Data Conditional probability table: Classification Decision Trees Bayes Classification , , E = no Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors B D no yes no 0.8997410 0.2140392 yes 0.1002590 0.7859608 , , E = yes B D no yes no 0.2813620 0.1496815 yes 0.7186380 0.8503185 Keith E Emmert (TSU) Data Mining October 23, 2013 115 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Fitting with Bayes Asia Node D Conditional Probabilities using Bayes Data Mining 0.0 Keith E Emmert Sub-setting Data 0.2 0.4 no no no yes yes no yes yes 0.6 0.8 Classification Decision Trees yes Bayes Classification Support Vector Machines k-Nearest Neighbors no Levels Receiver Operating Characteristics (ROC) and Precision - Recall Curves yes no 0.0 0.2 0.4 0.6 0.8 Probabilities Keith E Emmert (TSU) Data Mining October 23, 2013 116 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Computations: Do you have dyspnoea? No Prior Information Data Mining Keith E Emmert D A Sub-setting Data Remark Recall: A node in a Bayesian network is conditionally independent of its non descendants, if its parents are known. X Classification Decision Trees S E Bayes Classification Support Vector Machines T k-Nearest Neighbors X is conditionally independent to D and also L given E . The answer is “Yes” if Pr (D = Y ) > Pr (D = N). B Receiver Operating Characteristics (ROC) and Precision - Recall Curves L Compute Pr (D = Yes). (We get Pr (D = No) for free.) XX Pr (D = Yes) = Pr (D = Yes | E = α, B = β)Pr (E = α, B = β) α = XX α Keith E Emmert (TSU) β Pr (D = Yes | E = α, B = β)Pr (E = α)Pr (B = β) β Data Mining October 23, 2013 117 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Computations: Do you have dyspnoea? Using Prior Information Data Mining Keith E Emmert D A Sub-setting Data X Classification Decision Trees S E Bayes Classification Support Vector Machines T B Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Suppose we have some background on the patient: they have bronchitis. This, of course, changes the chance of dyspnoea. The answer is “Yes” if Pr (D = Y | B = Y ) > Pr (D = N | B = Y ). L Compute Pr (D = Yes | B = Yes). (Pr (D = No | B = Yes) is free.) Pr (B = Y | D = Y )Pr (D = Y ) Pr (B = Y ) Pr (B = Y | D = Y )Pr (D = Y ) =P α Pr (B = Y | D = α)Pr (D = α) Pr (D = Y | B = Y ) = Keith E Emmert (TSU) Data Mining October 23, 2013 118 / 222 Bayes Classification Bayesian Belief Networks - Asia Example - Computations: Do you have dyspnoea? Using Prior Information Data Mining Keith E Emmert D A Sub-setting Data X Classification Decision Trees S E Bayes Classification Support Vector Machines T Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors B L Suppose we have some background on the patient: they have bronchitis, a yes for E (tb vs lung cancer/bronchitis), and no for x-ray. This, of course, changes the chance of dyspnoea. Compare Pr (D = Yes | B = Yes, E = Yes, X = No) and Pr (D = No | B = Yes, E = Yes, X = No). D is conditionally independent from X given E , B! So, Pr (D = Y | B = Y , E = Y , X = N) = Pr (D = Y | B = Y , E = Y ), which is the conditional probability table for D. (Thus, the x-ray is not necessary.) Compare to Pr (D = N | B = Y , E = Y ) to make the call for dyspnoea. Keith E Emmert (TSU) Data Mining October 23, 2013 119 / 222 Bayes Classification Bayesian Belief Networks - ALARM Data Mining Keith E Emmert Sub-setting Data The ALARM (”A Logical Alarm Reduction Mechanism”) is a Bayesian network designed to provide an alarm message system for patient monitoring. Classification MVS Decision Trees DISC Bayes Classification PMB Support Vector Machines PAP Receiver Operating Characteristics (ROC) and Precision - Recall Curves SHNT FIO2 APL MINV ANES VMCH KINK VTUB VLNG PRSS VALV PVS TPR k-Nearest Neighbors INT ACO2 SAO2 ECO2 HYP LVF CCHL LVV STKV HIST ERLO HR ERCA PCWP CVP CO HRBP HRSA HREK BP Keith E Emmert (TSU) Data Mining October 23, 2013 120 / 222 Bayes Classification Bayesian Belief Networks - Continuous Data Data Mining Keith E Emmert Sub-setting Data Univariate case: use a Gaussian p(x) = √ 1 e− 2 ( 1 x−µ σ 2 ) 2πσ 2 Multivariate case: multivariate Gaussian over X1 , . . . , Xn with Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Mean vector µ, µi = E (Xi ) for 1 ≤ i ≤ n. n × n co-variance matrix Σ, Σii = Var (Xi ), Σij = Cov (Xi , Xj ) = E (Xi Xj ) − E (Xi ) − E (Xj ), for i 6= j. Joint density function T −1 1 p(x) = e −(x−µ) Σ (x−µ)/2 . (2π)n/2 |Σ|1/2 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 121 / 222 Bayes Classification Bayesian Belief Networks - Continuous Data Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification µX Σ ΣXY , XX , then the marginal for µY ΣYX ΣYY X is Pr (X ) ∼ N(µX , ΣXX ). If X = (X1 , . . . , Xn ), then Xi and Xj are independent if and only if Σij = 0. If Pr (X , Y ) ∼ N Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 122 / 222 Bayes Classification Bayesian Belief Networks - Continuous Data Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Definition Suppose Y is a continuous variable with parents X1 , . . . , Xn . Then Y has a linear Gaussian model if it can be described using parameters β0 , β1 , . . . , βn and σ 2 such that Pr (Y | x1 , . . . , xn ) ∼ N(β0 + β1 x1 + · · · βn xn , σ 2 ) Pr (Y | ~x ) ∼ N(β0 + β~ T ~x , σ 2 ) Vector Notation Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 123 / 222 Bayes Classification Bayesian Belief Networks - Iris Data Set - A Continuous Example Data Mining Hill−Climbing Grow−Shrink Keith E Emmert Sub-setting Data Sepal.Length Sepal.Width Sepal.Length Sepal.Width Species Classification Decision Trees Bayes Classification Support Vector Machines Petal.Length Receiver Operating Characteristics (ROC) and Precision - Recall Curves Petal.Length Petal.Width k-Nearest Neighbors Petal.Width Species Keith E Emmert (TSU) Data Mining October 23, 2013 124 / 222 Bayes Classification Bayesian Belief Networks – QQ Plots on Iris Hill−Climb Normal QQ−Plot Data Mining −2 −1 0 Sepal.Width Keith E Emmert 1 Grow−Shrink Normal QQ−Plot 2 −2 −1 0 Species Sepal.Width 1 2 Species 1 1 Classification 0 0 Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves −1 Petal.Length Petal.Width Sepal.Length 2 1 Sample Quantiles 2 Sub-setting Data Sample Quantiles 2 −1 Petal.Length Petal.Width Sepal.Length 2 1 0 0 −1 −1 −2 −1 0 1 2 −2 −1 0 1 2 Theoretical Quantiles −2 −1 0 1 2 −2 −1 0 1 2 Theoretical Quantiles k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 125 / 222 Bayes Classification Bayesian Belief Networks – Residuals Vs Fitted on Iris Hill−Climb Normal Residuals Vs Fitted Data Mining 0 Sepal.Width Keith E Emmert 2 4 6 Grow−Shrink Normal Residuals Vs Fitted 8 0 Species Sepal.Width 2 4 6 Species 2 Sub-setting Data 1 1 Classification 0 0 Decision Trees −1 Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Petal.Length Petal.Width −1 Residuals Residuals 2 Sepal.Length 2 1 Petal.Length Petal.Width Sepal.Length 2 1 0 0 −1 −1 0 2 4 6 8 0 2 4 6 8 Fitted values 0 2 4 6 0 2 4 6 Fitted values k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 126 / 222 Bayes Classification Bayesian Belief Networks – Histogram of the Residuals on Iris Hill−Climb Normal Histogram of the Residuals Data Mining −1 Sepal.Width Keith E Emmert 0 1 Grow−Shrink Normal Histogram of the Residuals 2 −1 Species Sepal.Width 0 1 2 Species 1.5 1.5 1.0 1.0 0.5 0.5 Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Petal.Length Petal.Width 0.0 Density Density 0.0 Sepal.Length Petal.Length 1.5 1.5 1.0 1.0 0.5 0.5 0.0 Petal.Width Sepal.Length 0.0 −1 0 1 2 −1 0 1 2 Residuals −1 0 1 2 −1 0 1 2 Residuals k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 127 / 222 Bayes Classification Bayesian Belief Networks – Predictions for Iris Data - GS Data Mining Keith E Emmert This gives surprisingly horrible predictions – everything classifies as Species # 2. Sub-setting Data Classification Decision Trees irisPredictGS 1 2 3 2 50 50 50 Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 128 / 222 Bayes Classification Bayesian Belief Networks – Predictions for Iris Data - HC Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines This one gives what we might expect – after all, I’ve used the entire data set to train the network and then predicted on the set I used to train! (Bad idea, by the way...) irisPredictHC 1 2 3 1 50 0 0 2 0 48 2 3 0 2 48 Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 129 / 222 Support Vector Machines Linearly Separable Data Mining 8 Keith E Emmert Sub-setting Data 6 Classification Decision Trees 4 Bayes Classification 2 Class A Class B 0 Receiver Operating Characteristics (ROC) and Precision - Recall Curves y Support Vector Machines −4 −2 k-Nearest Neighbors −2 Keith E Emmert (TSU) 0 2 4 x Data Mining 6 8 October 23, 2013 130 / 222 Support Vector Machines Linearly Separable – The Margins 8 Data Mining Keith E Emmert Sub-setting Data Decision Trees 6 Classification −2 k-Nearest Neighbors 0 Receiver Operating Characteristics (ROC) and Precision - Recall Curves Class A Class B 2 Support Vector Machines 4 Bayes Classification −2 Keith E Emmert (TSU) 0 2 Data Mining 4 6 8 October 23, 2013 131 / 222 Support Vector Machines Linearly Separable The decision boundary, Bi , is a hyper-plane which separates two classes. Keith E Emmert 8 Data Mining Sub-setting Data Suppose bij are hyper-planes parallel to Bi which touch one data point, but still separate the two classes. 6 Classification 4 Decision Trees 0 Bi The margin is the distance between bi1 and bi2 . b i2 k-Nearest Neighbors b i1 Receiver Operating Characteristics (ROC) and Precision - Recall Curves −2 Support Vector Machines Class A Class B 2 Bayes Classification −2 Keith E Emmert (TSU) 0 2 4 6 8 Data Mining A linear support vector machine is a maximal margin classifier. It searches for the hyper-plane with the largest margin. October 23, 2013 132 / 222 Support Vector Machines Linearly Separable – Hyper-planes Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Let N be the number of training records of the form (~xi , yi ), for i = 1, 2, . . . N. Let ~xi = (xi1 , . . . , xid ) be the attributes Let yi ∈ {−1, 1} be the class labels. The decision boundary is a hyper-plane and can be written as Bayes Classification ~ · ~x + b = 0, w Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors ~ and b are to be determined. where w Fact: Hyper-planes of this type separate a space into two connected components with ~ · ~x + b > 0 w and ~ · ~x + b < 0. w For Rd , a hyper-plane contains d − 1 independent variables (and is a d − 1 dimensional subspace if the origin is included, otherwise it is just a set.) Keith E Emmert (TSU) Data Mining October 23, 2013 133 / 222 Support Vector Machines Linearly Separable Data Mining Example Keith E Emmert 3 Sub-setting Data Classification Note that the line y = 2 − x is a hyper-plane in R2 . Decision Trees w.x + d > 0 2 Bayes Classification Clearly, w.x + d < 0 0 y k-Nearest Neighbors y =2−x x +y −2=0 [1 1] · [x y ] − 2 = 0 ~ · ~x + d = 0 w −1 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 1 Support Vector Machines −1 0 1 2 3 x Keith E Emmert (TSU) Data Mining October 23, 2013 134 / 222 Support Vector Machines ~ and b Linearly Separable – Finding w 8 Data Mining Keith E Emmert For a Class A point, ~x◦ , we ~ · ~x◦ + b = k~x◦ < 0. have w For a Class B point, ~x∆ , we ~ · ~x∆ + b = k~x∆ > 0. have w 6 Sub-setting Data Decision Trees 4 Classification Class A Class B For a ◦ point which is closest to Bi , we have a parallel hyper-plane bi1 . 2 Bayes Classification Support Vector Machines 0 −2 Bi b i2 k-Nearest Neighbors b i1 Receiver Operating Characteristics (ROC) and Precision - Recall Curves −2 0 2 4 6 8 For a ∆ point which is closest to Bi , we have a parallel hyper-plane bi2 . ~ and b so that Rescale the parameters w ~ · ~x + b = −1 bi1 : w Keith E Emmert (TSU) Data Mining and ~ · ~x + b = 1. bi2 : w October 23, 2013 135 / 222 Support Vector Machines Linearly Separable – Classifying New Vectors ~ and b so Suppose we have w that the hyper-planes bi1 , Bi , bi2 are determined, that is Data Mining 8 Keith E Emmert Sub-setting Data 6 Classification Decision Trees 0 Bi b i2 b i1 Receiver Operating Characteristics (ROC) and Precision - Recall Curves −2 Support Vector Machines bi1 :~ w · ~x + b = −1 Bi :~ w · ~x + b = 0 bi2 :~ w · ~x + b = 1 Class A Class B 2 4 Bayes Classification k-Nearest Neighbors −2 Keith E Emmert (TSU) 0 2 4 6 8 Data Mining Then, to classify the attribute vector ~z , ( ~ · ~z + b > 0 1, w y= ~ · ~z + b < 0. −1, w October 23, 2013 136 / 222 Support Vector Machines ~ and b Linearly Separable – Finding w Let d be the margin, the distance between bi2 and bi1 . Choose a point ~u on bi1 and a point ~v on bi2 . Then 8 Data Mining Sub-setting Data 6 Keith E Emmert Decision Trees 4 Classification Class A Class B ~ · ~u = −1 w ~ · ~v = 1. and w 2 Bayes Classification Support Vector Machines Bi 0 Subtracting, we have b i2 k-Nearest Neighbors b i1 ~ (~v − ~u ) = 2, w −2 Receiver Operating Characteristics (ROC) and Precision - Recall Curves −2 0 ||~ w || 2 4 6 which, using a famous property of norms, dot products, and cosine, yields 2 ~ · (~v − ~u ) = 2 =⇒ d = =w ||~ w || 8 ||~v − ~u || cos(θ) | {z } ~ ⇔θ=0 =d⇔~ v −~ u is || to w in order to maximize the margin between the two classes. Keith E Emmert (TSU) Data Mining October 23, 2013 137 / 222 Support Vector Machines Linearly Separable – Learning a Linear SVM Model Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification ~ and b are to be estimated so w that ~ · ~xi + b ≥ 0, yi = 1 w 1 =⇒ yi (~ w · ~xi + b) ≥ 0 for all ~ · ~xi + b ≤ 0, yi = −1 w i = 1, . . . , N. 2 The margin of the decision boundary is maximized if and only if d = ||~w2 || is maximized iff the objective function Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves f (~ w) = ||~ w ||2 2 is minimized. So, we have a constrained optimization problem, k-Nearest Neighbors min ~ ∈Rd w ||~ w ||2 subject to yi (~ w · ~xi + b) ≥ 1 2 for all i = 1, . . . , N. Thus we have a quadratic objective function ~ and b. The constraints are linear in w Keith E Emmert (TSU) Data Mining October 23, 2013 138 / 222 Support Vector Machines Linearly Separable – Lagrange Multipliers – Equality Constraints Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves For ~x ∈ Rd , minimize f (~x ) subject to gi (~x ) = 0 for i = 1, 2, . . . , p. Pp 1 The Lagrangian: L(~ x , ~λ) = f (~x ) + i=1 λi gi (~x ), where the λi are dummy variables called Lagrange multipliers. 2 Compute ∂L ∂L , i = 1, . . . , d, and , i = 1, . . . , p. ∂xi ∂λi 3 Solve the d + p equations for the stationary point ~ x ∗ and ~λ. Example Minimize f (x, y ) = x + 2y subject to x 2 + y 2 − 4 = 0. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 139 / 222 Support Vector Machines Linearly Separable – Inequality Constraints Data Mining Keith E Emmert For ~x ∈ Rd , minimize f (~x ) subject to hi (~x ) ≤ 0 for i = 1, . . . , q. Then the Lagrangian Sub-setting Data L(~x , ~λ) = f (~x ) + Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves p X λi hi (~x ) i=1 is generalized and solved using the Karush-Kuhn-Tucker conditions ∂L = 0, i = 1, . . . , d ∂xi hi (~x ) ≤ 0, i = 1, . . . , q λi ≥ 0 i = 1, . . . , q k-Nearest Neighbors λi hi (~x ) = 0, i = 1, . . . , q. Example x +y y That is, subject to h1 (~x ) = x + y − 2 and h2 (~x ) = x − y . 2 2 Minimize f (x, y ) = (x − 1) + (y − 3) subject to Keith E Emmert (TSU) Data Mining ≤2 ≥x October 23, 2013 140 / 222 Support Vector Machines A Little Math Data Mining Example Keith E Emmert Sub-setting Data Classification Decision Trees Consider f (~x ) = ||~x ||2 and g (~x ) = ~x · ~a, for ~x ,~a ∈ Rn . Recall p Pn ||~x || = h~x , ~x i =⇒ ||~x ||2 = h~x , ~x i = i=1 xi2 . ∂f ∂x 1 ∂f = . . . . ∂~x ∂f Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves ∂xn Hence, n ∂f ∂ X 2 ∂f ∂ = xi = 2xi =⇒ = ||~x ||2 = 2~x ∂xi ∂xi ∂~x ∂~x k-Nearest Neighbors i=1 and n ∂ X ∂g ∂ ∂g = xi ai = ai =⇒ = (~x · ~a) = ~a. ∂xi ∂xi ∂~x ∂~x i=1 Keith E Emmert (TSU) Data Mining October 23, 2013 141 / 222 Support Vector Machines Linearly Separable - The Primal Problem Data Mining Recall, we have a constrained optimization problem, Keith E Emmert min Sub-setting Data ~ ∈Rd w ||~ w ||2 subject to yi (~ w · ~xi + b) ≥ 1 2 Classification Decision Trees Bayes Classification for all i = 1, . . . , N. Lagrange multipliers yields the primal N X 1 2 ~ ~ , b) = ||~ w || − λi yi (~ w · ~xi + b) − 1 . Lp (λ, w 2 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors i=1 To minimize the Lagrangian, N X ∂Lp ~ = = 0 =⇒ w λi yi ~xi , ~ ∂w i=1 N X ∂Lp = 0 =⇒ λi yi = 0. ∂b i=1 The Karush-Kahn-Tucker Conditions: λi ≥ 0, λi [1 − yi (~ w · ~xi + b)] = 0, for i = 1, 2, . . . , N. Keith E Emmert (TSU) Data Mining October 23, 2013 142 / 222 Support Vector Machines Linearly Separable - Support Vectors Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Inferences from the Karush-Kahn-Tucker Conditions: ( λi = 0, 1 − yi (~ w · ~xi + b) 6= 0 λi [1 − yi (~ w · ~xi + b)] = 0 =⇒ λi > 0, 1 − yi (~ w · ~xi + b) = 0. Note that yi (~ w · ~xi + b) = 1 Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors if and only if the vector ~xi lies along one of the hyper-planes bi1 or bi2 . Such a vector is called a support vector (hence the name, Support Vector Machines). Definition In the above context, ~xi is a support vector if and only if the corresponding λi > 0. Keith E Emmert (TSU) Data Mining October 23, 2013 143 / 222 Support Vector Machines Linearly Separable - The Dual Lagrangian Problem Data Mining Recall: the primal is Keith E Emmert N X 1 2 ~ ~ , b) = ||~ Lp (λ, w w || − λi yi (~ w · ~xi + b) − 1 . 2 Sub-setting Data Classification i=1 Decision Trees Bayes Classification where we know ~ = w Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves N X N X λi yi ~xi , i=1 λi yi = 0. i=1 ~ , b), we obtain the Dual problem: Substituting the above into Lp (~λ, w LD = k-Nearest Neighbors N X N λi − i=1 subject to N X N 1 XX λi λj yi yj h~xj , ~xi i, 2 i=1 j=1 λi yi = 0 and λi ≥ 0. Note that this removes i=1 ~ and b! dependence upon w Keith E Emmert (TSU) Data Mining October 23, 2013 144 / 222 Support Vector Machines Linearly Separable - The Dual Lagrangian Problem Data Mining Keith E Emmert The Dual problem: LD = N X N λi − i=1 Sub-setting Data to Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors N 1 XX λi λj yi yj h~xj , ~xi i, subject 2 i=1 j=1 PN i=1 λi yi = 0 and λi ≥ 0. Since the “quadratic” term is negative, this is now a maximization problem rather than a minimization problem. N X ~ = Of course w λi yi ~xi . i=1 b is found via λi [1 − yi (~ w · ~xi + b)] = 0, i = 1, . . . , N using the support vectors (when λi > 0) ~xi . (Just average all the b’s together...) ! N X The decision boundary is λi yi ~xi · ~x + b = 0. i=1 Classification: f (~z ) = sign(~ w ·~z + b) = sign N X ! λi yi ~xi · z + b . i=1 Keith E Emmert (TSU) Data Mining October 23, 2013 145 / 222 Support Vector Machines ~ and b: Back to the Old Example... Linearly Separable – Finding w ~ is The vector w Keith E Emmert 8 Data Mining ~ = w 6 Sub-setting Data Classification N X λi yi ~xi Decision Trees 4 i=1 = (−0.7071, −0.6797). Class A Class B 2 Bayes Classification 0 −2 Bi b i1 Receiver Operating Characteristics (ROC) and Precision - Recall Curves b = 3.9408 (R reports the negative of our b, which I’ve corrected for.) b i2 Support Vector Machines −2 0 2 4 6 8 k-Nearest Neighbors Decision function is given by The equation of the decision ~ · ~z + b = 0 or boundary is w −0.7071z1 +−0.6797z2 +3.9408 = 0. f (~z ) = sign(~ w · ~z + b) = sign(−0.7071z1 + −0.6797z2 + 3.9408). Keith E Emmert (TSU) Data Mining October 23, 2013 146 / 222 Support Vector Machines Using R Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification There are several packages for using Support Vector Machines. A few of them include kernlab: The one we’ll use e1071: Perhaps the first implementation of SVM klaR svmpath Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 147 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel Data Mining Keith E Emmert Generate 200 ordered pairs, there are 2 positive classes with different means, and one negative class Sub-setting Data Classification 8 Decision Trees Positive Negative 6 Bayes Classification y 2 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 4 Support Vector Machines −2 0 k-Nearest Neighbors −2 Keith E Emmert (TSU) 0 2 4 Data x Mining 6 8 October 23, 2013 148 / 222 Support Vector Machines Linear Non-Separable: The Idea Data Mining Slack variables: ξi > 0, ∀i: Keith E Emmert ~ · ~xi + b ≥ 1 − ξi , w yi = 1 =⇒ yi [~ w · ~xi + b] ≥ 1 − ξi . ~ · ~xi + b ≤ −1 + ξi , yi = −1 w Sub-setting Data Classification 3.5 Decision Trees Support Vector Machines 3.0 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 2.5 Bayes Classification ξi 1.5 ξi > 0 is a penalty for misclassification of pi . 1.0 2.0 ||w|| ξi ||~ w || y pi 2 0.5 0.0 <w, x> + b = 1 1 <w, x> + b = 0 <w, x> + b = −1 0 <w, x> + b = −1 + ξi k-Nearest Neighbors Clearly not linearly separable. pi will not be correctly classified. 3 4 5 6 is the distance from ~ · ~x + b = −1 to the “noisy” w point pi . This is an estimate of the error of the decision boundary on pi . x Keith E Emmert (TSU) Data Mining October 23, 2013 149 / 222 Support Vector Machines Linear Non-Separable: The Objective Function Data Mining The New Objective Function Keith E Emmert ||~ w ||2 f (~ w) = +C 2 Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors N X !k ξi . i=1 Seek to minimize the objective function. !k N X The term C ξi represents a penalty for a decision i=1 boundary with large values of slack variables which misclassify many training examples. Simplification: k = 1. Keith E Emmert (TSU) Data Mining October 23, 2013 150 / 222 Support Vector Machines Linear Non-Separable: The Primal Data Mining N Keith E Emmert Sub-setting Data Classification Lp : X ||~ w ||2 ξi +C 2 i=1 | {z } objective function Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves − N X λi [yi (~ w · ~xi + b) − 1 + ξi ] − i=1 | N X µi ξ i i=1 {z inequality constraints + slack variables } | {z } ξi s non-negativity requirements The Karush-Kahn-Tucker Constraints: (to transform into equality constraints for optimization) ξi ≥ 0, λi ≥ 0, µi ≥ 0, µi ξi = 0, λi [yi (~ w · ~xi + b) − 1 + ξi ] = 0. k-Nearest Neighbors N X ∂Lp ~ = 0 =⇒ w λi yi ~xi , ~ ∂w i=1 ∂Lp = 0 =⇒ ∂b N X λi yi = 0, i=1 ∂Lp = 0 =⇒ λi + µi = C . ∂ξi Keith E Emmert (TSU) Data Mining October 23, 2013 151 / 222 Support Vector Machines Linear Non-Separable: The Dual Data Mining Gentle mathematics yields Keith E Emmert Sub-setting Data LD : Classification N X N λi − i=1 Decision Trees N 1 XX λi λj yi yj ~xi · ~xj . 2 i=1 j=1 Note that the following restrictions hold: Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0 ≤ λi ≤ C , µi ≥ 0, N X λi yi = 0. i=1 The solution is ~ w k-Nearest Neighbors N X λi yi ~xi i=1 with b an average of the b’s obtained by solving yi (~ w · ~xi + b) − 1 + ξi = 0 when λi > 0 (i.e. using support vector ~xi ). Keith E Emmert (TSU) Data Mining October 23, 2013 152 / 222 Support Vector Machines Linear Non-Separable: Classification Data Mining Keith E Emmert The decision boundary is Decision Trees ! λi yi ~xi · ~x + b = 0. i=1 Sub-setting Data Classification N X Classification: f (~z ) = sign(~ w ·~z + b) = sign N X ! λi yi ~xi · z + b . i=1 Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 153 / 222 Support Vector Machines Linear Non-Separable: Hard Margin vs Soft Margin Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors A hard-margin SVM does not allow slack variables, ξi s. Hence, one constraint is simply λi ≥ 0. A soft-margin SVM does allow slack variables, ξi s. Hence, one constraint is simply 0 ≤ λi ≤ C . The Role of C : If C → ∞, then the soft-margin SVM becomes a hard-margin SVM because N X ||~ w ||2 1 In Lp , try to minimize: +C ξi . Hence, to minimize Lp 2 i=1 as C → ∞, simply take ξi = 0 for all i. 2 In Ld , the dual, the constraint for the soft-margin 0 ≤ λi ≤ C becomes λi ≥ 0 (the constraint for the hard-margin) when C → ∞. Finding C : Use a grid search. Keith E Emmert (TSU) Data Mining October 23, 2013 154 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Training Data is 75% Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors > > > > + + > library(kernlab) myC = 1 # The parameter C for most of this section # Do some training mySVM = ksvm(trainData, labelTrain, type="C-svc", kernel='vanilladot', C=myC, scaled=c(), kpar=list()) mySVM # Look at the summary Support Vector Machine object of class "ksvm" SV type: C-svc (classification) parameter : cost C = 1 Linear (vanilla) kernel function. Number of Support Vectors : 271 Objective Function Value : -269.9585 Training error : 0.35 Keith E Emmert (TSU) Data Mining October 23, 2013 155 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Visual Results SVM classification plot Data Mining Keith E Emmert 2 8 Sub-setting Data 6 Classification 1 Decision Trees 4 Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 0 2 −1 0 −2 −2 −2 0 Keith E Emmert (TSU) 2 4 6 8 Data Mining In the figure, we have the training results when C is 1. for a (linear since we’ve used ”vanilladot’ for a kernel) SVM. The filled triangles and circles are the support vectors. Values near zero are close to the decision boundary. What a mess. Looks like most vectors are training vectors. October 23, 2013 156 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding λi s, b, and Support Vectors Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Note that alpha(mySVM) returns the positive λi s and the corresponding indices with alphaindex(mySVM) which can be used to find the corresponding support vectors. coef(mySVM) returns yi λi s. The negative intercept is given by b(mySVM). Let’s just look at a few of them. > coef(mySVM)[[1]][1:20] [1] -1 1 -1 -1 1 -1 -1 1 1 -1 -1 1 1 -1 1 1 1 [18] 1 1 -1 > alpha(mySVM)[[1]][1:20] [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > alphaindex(mySVM)[[1]][1:20] [1] 1 2 3 4 5 6 7 8 9 10 12 14 15 17 18 19 20 [18] 22 23 24 > b(mySVM) [1] -0.7650215 Keith E Emmert (TSU) Data Mining October 23, 2013 157 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Do-It-Yourself Decision Boundary Data Mining Keith E Emmert Sub-setting Data To generate the same picture as the previous contour plot, one must swap the columns around as follows. We’ll also fill in the training vectors as solid circles and triangles. Decision Trees 8 Classification 4 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 2 Support Vector Machines 6 Bayes Classification −2 0 k-Nearest Neighbors −2 Keith E Emmert (TSU) 0 2 4 6 Data Mining 8 October 23, 2013 158 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Do-It-Yourself Decision Boundary and Contour Plot Now we can compare the contour-plot with the plot showing the boundaries side-by-side. It does not look like we did a very good job with this classifier – we’ve got a lot of circles and triangles on both sides of the decision boundary – which shouldn’t be a big surprise! Data Mining Keith E Emmert Sub-setting Data Classification SVM classification plot Decision Trees 2 4 0 4 1 2 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 6 2 −1 0 Support Vector Machines 6 8 8 Bayes Classification 0 k-Nearest Neighbors −2 −2 −2 −2 0 Keith E Emmert (TSU) 2 4 6 8 −2 Data Mining 0 2 4 6 8 October 23, 2013 159 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Have We Done a Good Job? Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Recall that we trained our toy data with C = 1 and used the “vanniladot” kernel, indicating a linear SVM. The training error was reported and is 0.35. Have we done a good job? Let’s look at the predictions on the labels first using the test set. myPred labelTest -1 1 -1 35 14 1 29 22 Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 160 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Have We Done a Good Job? Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Next, we can compute the accuracy of the model. This is computed in a fairly straightforward manner. P δ(myPred, labelTest) Accuracy = , |labelTest| ( 1, x = y where δ(x, y ) = is the Kronecker delta function. 0, x 6= y [1] 0.57 The test-error (or training error when using training data) is found in a similar manner where we count the differences. [1] 0.43 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 161 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding a Better Model Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification > myNewSVM # Look at the summary Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors However, we might be able to do better by modifying the parameter C . Recall that we set C = 1. Most people recommend using a grid search to find the “best” value for C . So, let’s assume that 10−k ≤ C ≤ 10k for some k ∈ N. Remember that smaller C s indicate that we should not penalize misclassifications too much while larger C s indicate greater penalties. Let’s use C = 105 . Support Vector Machine object of class "ksvm" SV type: C-svc (classification) parameter : cost C = 1e+05 Linear (vanilla) kernel function. Number of Support Vectors : 191 Objective Function Value : -22054852 Training error : 0.303333 Keith E Emmert (TSU) Data Mining October 23, 2013 162 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding a Better Model SVM classification plot Data Mining Keith E Emmert 8 4 Sub-setting Data 6 Classification 2 Decision Trees Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 0 4 Bayes Classification −2 2 −4 0 −6 −2 −8 −2 0 Keith E Emmert (TSU) 2 4 6 In the figure, we have the training results when C is 1e + 05. for a (linear since we’ve used ”vanilladot’ for a kernel) SVM. The filled triangles and circles are the support vectors. Values near zero are close to the decision boundary. 8 Data Mining October 23, 2013 163 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Compare the Two Models It is interesting to compare the two models together to see how radically things have changed. (The old model is on the left, the new on the right.) Solid circles/triangles represent support vectors. Data Mining Keith E Emmert Sub-setting Data SVM classification plot Classification SVM classification plot 2 8 8 Decision Trees Bayes Classification 6 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 4 1 4 0 6 2 0 4 −2 2 2 −1 −4 0 0 −6 −2 −2 −2 −2 0 Keith E Emmert (TSU) 2 4 6 8 −8 −2 Data Mining 0 2 4 6 8 October 23, 2013 164 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Compare the Two Models Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Let’s look at the predictions on the labels of the old model. Here are the predictions for the new model. myPred labelTest -1 1 -1 35 14 1 29 22 myNewPred labelTest -1 1 -1 41 8 1 29 22 Next, we can compute the accuracy of the old model. Next, we can compute the accuracy of the new model. [1] 0.57 [1] 0.63 The test-error of the old model. The test-error of the new model. [1] 0.43 [1] 0.37 Keith E Emmert (TSU) Data Mining October 23, 2013 165 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding C Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors For the linear SVM example, we can tune C as follows. Let’s try k = 10. Then we have 2k + 1 = 21 SVMs to consider. Note that the option cross=10 indicates that 10-fold cross validation should be used on the full data set. > k=10 > mySVMVec = NULL > for (i in -k:k) + { + mySVMVec = c(mySVMVec, + ksvm(rbind(posClass, negClass), labels, + type="C-svc", kernel='vanilladot', + C=10^i, scaled=c(), kpar=list(), + cross=10) + ) + + } Keith E Emmert (TSU) Data Mining October 23, 2013 166 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding C Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Now, let’s take a look at the cross-validation errors generated by various C s. > xCross=-k:k > yCross=NULL > for (i in 1:(2*k+1)) + { + yCross = c(yCross, cross(mySVMVec[[i]])) + } > yCross [1] 0.5600 0.5775 0.5750 0.5750 0.5825 0.5725 0.5850 [8] 0.5625 0.4225 0.4275 0.4325 0.4200 0.4350 0.4375 [15] 0.4525 0.3775 0.3675 0.4150 0.4725 0.4275 0.5100 Keith E Emmert (TSU) Data Mining October 23, 2013 167 / 222 Support Vector Machines Linear Non-Separable: Simple Example Using vanilladot (Linear) Kernel – Finding C Data Mining Keith E Emmert We might also generate a simple plot as C varies. Choose the smallest C that minimizes the CV-Error. log10(C) vs CV Error Sub-setting Data Classification 0.55 Decision Trees Bayes Classification 0.50 0.45 k-Nearest Neighbors 0.40 Receiver Operating Characteristics (ROC) and Precision - Recall Curves CV−Error Support Vector Machines −10 Keith E Emmert (TSU) −5 0 log10(C) Data Mining 5 10 October 23, 2013 168 / 222 Support Vector Machines Non-Linearly Separable 4 Data Mining Keith E Emmert Sub-setting Data 2 Classification Decision Trees Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0 −2 Support Vector Machines y Bayes Classification −4 k-Nearest Neighbors −4 −2 0 2 4 x Keith E Emmert (TSU) Data Mining October 23, 2013 169 / 222 Support Vector Machines Non-Linearly Separable Using (x1 , x2 ) 7→ (x12 , x22 ) Data Mining 15 Keith E Emmert Sub-setting Data Classification Decision Trees Receiver Operating Characteristics (ROC) and Precision - Recall Curves 5 Support Vector Machines y 10 Bayes Classification 0 k-Nearest Neighbors 0 5 10 15 x Keith E Emmert (TSU) Data Mining October 23, 2013 170 / 222 Support Vector Machines Non-Linearly Separable – The Idea Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Transform the input space – non-linearly separable – using a function, φ, into what we will call the feature space – linearly separable. In the feature space, we have ~ · φ(~xi ) + b = 0. The linear decision boundary given by w ||~ w || subject to yi [~ w · φ(~xi ) + b] ≥ 1, for Learning: min ~ w 2 i = 1, 2, . . . , N where yi = ±1. The Dual Lagrangian is Receiver Operating Characteristics (ROC) and Precision - Recall Curves LD : N X N λi − i=1 k-Nearest Neighbors ~ = Then w N X N 1 XX λi λj yi yj φ(~xi ) · φ(~xj ). 2 i=1 j=1 λi yi φ(~xi ) λi [yi (~ w · φ(~xi ) + b) − 1] = 0. i=1 Classification: N X f (~z ) = sign[~ w · φ(~z ) + b] = sign λj yj φ(~xj ) · ~xi + b . j=1 Keith E Emmert (TSU) Data Mining October 23, 2013 171 / 222 Support Vector Machines Non-Linearly Separable – Problems Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines What mapping function should be used to generate a linear decision boundary? The transformed space is likely to be a very high (infinite?) dimensional space. Finding λi s in the transformed space is likely to be computationally expensive. Lots of dot products involved in classification. Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 172 / 222 Support Vector Machines Non-Linearly Separable – The Kernel Trick Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Theorem (Mercer’s Theorem) A kernel function, K , can be expressed as K (~u , ~v ) = φ(~u )φ(~v ) R if and only if for all functions g such that [g (x)]2 dx < ∞ we have Z k(~x , ~y )g (~x )g (~y ) d~x d~y ≥ 0. Such kernel functions are called positive definite kernel functions. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 173 / 222 Support Vector Machines Non-Linearly Separable – The Application of Mercer’s Theorem Data Mining Keith E Emmert Sub-setting Data Classification With such a positive definite kernel function, K , we compute b via N X λi yi λ + jyj K (~xi , ~xj ) + b − 1 = 0. j=1 Decision Trees Bayes Classification Support Vector Machines (for λi > 0) and classify via " f (~z ) = sign Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors N X # λi yi φ(~xi ) · φ(~z ) + b i=1 " = sign N X # λi yi K (~xi , ~z ) + b . i=1 Notes: We no longer need the exact form of the mapping function, φ. Don’t need to compute dot products in a transformed space! Avoid high-dimensional spaces since we remain in the original space! Keith E Emmert (TSU) Data Mining October 23, 2013 174 / 222 Support Vector Machines Non-Linearly Separable – Kernel Facts Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees If K1 and K2 are kernels, then the following functions are kernels: K (~x , ~y ) = K1 (~x , ~y ) + K2 (~x , ~y ). K (~x , ~y ) = αK1 (~x , ~y ), for α ∈ R+ = (0, ∞). K (~x , ~y ) = K1 (~x , ~y )K2 (~x , ~y ). Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 175 / 222 Support Vector Machines Kernels in R Data Mining Keith E Emmert Sub-setting Data Let u, v ∈ Rn and hu, v i be any inner product on Rn . Polynomial: k(u, v ) = (αhu, v i + β)n where α is a scale parameter, β is an offset and n is the degree. Linear: α = n = 1, β = 0. Useful when dealing with large sparse data vectors as can be found in text mining. Polynomial: Used in image processing. Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Gaussian Radial Basis Function (RBF): 2 2 k(u, v ) = e −||u−v || /(2σ ) . General purpose kernel used when there is no pressing reason to pick another kernel. Hyperbolic Tangent: k(u, v ) = tanh(αhu, v i + β). α is a scale parameter, and β is an offset parameter. Note that not every parameter choice generates a valid kernel! Keith E Emmert (TSU) Data Mining October 23, 2013 176 / 222 Support Vector Machines Kernels in R Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Let u, v ∈ Rn and hu, v i be any inner product on Rn . Bessel function ofn the first kind: Besselν+1 (σ||u − v ||) k(u, v ) = Another general purpose kernel. (||u − v ||)−n(ν+1) Note that n ∈ N is the degree of the Bessel function, σ is the inverse kernel width, and ν is from the second order differential equation which gives rise to the Bessel function. Laplace Radial Basis Function (RBF): k(u, v ) = e −σ||u−v || . Another general purpose kernel. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 177 / 222 Support Vector Machines Non-Linearly Separable – Soft Margins Data Mining Keith E Emmert Sub-setting Data In a similar manner, include slack variables ξi to deal with the case when the transformed space is not linearly separable. Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 178 / 222 Support Vector Machines Non-Linearly Separable Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Everything that we’ve talked about (with obvious changes since we have non-linear rather than linear) applies. However, there might be more than the one parameter, C , to consider. For example, the Gaussian allows C and σ to vary. > mySVMGauss = ksvm(trainData, labelTrain, type="C-svc", + kernel='rbf', kpar=list(sigma=1), + C=myC, scaled=c()) Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 179 / 222 Support Vector Machines Non-Linearly Separable Everything that we’ve talked about (with obvious changes since we have non-linear rather than linear) applies. However, there might be more than the one parameter to consider. The Gaussian allows C and σ to vary. For example, we can compare the linear (left) and the Gaussian (right) below. Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification SVM classification plot 1.5 8 4 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors SVM classification plot 8 1.0 6 2 6 0.5 0 4 4 0.0 −2 2 2 −0.5 −4 0 0 −1.0 −6 −2 −8 −2 0 2 4 6 8 −2 −1.5 −2 0 2 4 6 8 It is pretty clear that the decision boundary curves quite a bit in the Gaussian (and can’t curve in the linear). Keith E Emmert (TSU) Data Mining October 23, 2013 180 / 222 Support Vector Machines Non-Linearly Separable Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Have we done a good job? Let’s look at the predictions on the labels from both methods. myNewPred myPredGauss labelTest -1 1 labelTest -1 1 -1 41 8 -1 39 10 1 29 22 1 5 46 Linear Gaussian Training Error 0.303333 0.123333 Accuracy 0.63 0.85 Test Error 0.37 0.15 It looks like the nonlinear model is a better classifier. However, we need to remember that there are two parameters to tune, C and σ! Keith E Emmert (TSU) Data Mining October 23, 2013 181 / 222 Support Vector Machines Non-Linearly Separable Data Mining This time, let’s let R choose σ for us. For now, let’s keep C = 1. Keith E Emmert Using automatic sigma estimation (sigest) for RBF or laplace kern Sub-setting Data Support Vector Machine object of class "ksvm" Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors SV type: C-svc (classification) parameter : cost C = 1 Gaussian Radial Basis kernel function. Hyperparameter : sigma = 0.35671985783464 Number of Support Vectors : 139 Objective Function Value : -109.8726 Training error : 0.133333 Keith E Emmert (TSU) Data Mining October 23, 2013 182 / 222 Support Vector Machines Non-Linearly Separable – R Chooses σ SVM classification plot Data Mining Keith E Emmert 8 1.0 Sub-setting Data 6 Classification Decision Trees Bayes Classification 0.5 4 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.0 2 −0.5 0 −1.0 k-Nearest Neighbors −2 −1.5 −2 0 2 4 6 8 Notice that some of the support vectors have changed. Keith E Emmert (TSU) Data Mining October 23, 2013 183 / 222 Support Vector Machines Non-Linearly Separable Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves myNewPred labelTest -1 1 -1 41 8 1 29 22 myPredGauss2 labelTest -1 1 -1 38 11 1 6 45 Linear Training Error 0.303333 Accuracy 0.63 Test Error 0.37 myPredGauss labelTest -1 1 -1 39 10 1 5 46 Gaussian 0.123333 0.85 0.15 Gaussian w/Auto σ 0.133333 0.83 0.17 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 184 / 222 Support Vector Machines Non-Linearly Separable Data Mining Keith E Emmert Sub-setting Data However, we might be able to do better by modifying the parameter C . Recall that we set C = 1. Most people recommend using a grid search to find the “best” value for C . So, let’s assume that 10−k ≤ C ≤ 10k . Classification Decision Trees Bayes Classification Using automatic sigma estimation (sigest) for RBF or laplace kern Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors We now try C = 5. Support Vector Machine object of class "ksvm" SV type: C-svc (classification) parameter : cost C = 5 Gaussian Radial Basis kernel function. Hyperparameter : sigma = 0.320905142072364 Number of Support Vectors : 123 Objective Function Value : -470.6818 Training error : 0.133333 Keith E Emmert (TSU) Data Mining October 23, 2013 185 / 222 Support Vector Machines Non-Linearly Separable SVM classification plot Data Mining Keith E Emmert 8 Sub-setting Data 1 6 Classification Decision Trees Bayes Classification 4 0 Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 2 −1 0 −2 −2 −2 0 2 4 6 8 Notice that some of the support vectors have changed. Keith E Emmert (TSU) Data Mining October 23, 2013 186 / 222 Support Vector Machines Non-Linearly Separable Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves myNewPred labelTest -1 1 -1 41 8 1 29 22 myPredGauss2 labelTest -1 1 -1 38 11 1 6 45 Linear Training Error 0.303333 Accuracy 0.63 Test Error 0.37 k-Nearest Neighbors Training Error Accuracy Test Error Keith E Emmert (TSU) myPredGauss labelTest -1 1 -1 39 10 1 5 46 myPredGauss3 labelTest -1 1 -1 38 11 1 5 46 Gaussian Gaussian w/Auto σ 0.123333 0.133333 0.85 0.83 0.15 0.17 Gaussian w/Auto σ, C = 5 0.133333 0.84 0.16 Data Mining October 23, 2013 187 / 222 Support Vector Machines Non-Linearly Separable For ease of comparison, here are the four contour-plots: Data Mining SVM classification plot Keith E Emmert SVM classification plot 8 1.5 8 4 Sub-setting Data 1.0 6 2 6 Classification Linear 0.5 0 4 4 Decision Trees 0.0 −2 2 Bayes Classification 0 −0.5 0 Support Vector Machines −1.0 −6 −2 Receiver Operating Characteristics (ROC) and Precision - Recall Curves Basic Gaussian 2 −4 −8 −2 0 2 4 6 −2 −1.5 8 −2 SVM classification plot 0 2 4 6 with 8 Gaussian with C = 5 and Auto σ. SVM classification plot 8 Basic Gaussian Auto σ 8 1.0 k-Nearest Neighbors 6 1 6 0.5 4 4 0 0.0 2 2 −0.5 0 −1 0 −1.0 −2 −2 −2 −1.5 −2 0 2 4 Keith E Emmert (TSU) 6 8 −2 0 2 4 Data Mining 6 8 October 23, 2013 188 / 222 Support Vector Machines Non Linear: Simple Example Using a Gaussian Kernel – Finding C Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors For the Gaussian SVM example, we can tune C as follows. We’ll let R guess the best σ. Let’s try k = 10 and search for 10−k ≤ C ≤ 10k . Then we have 2k + 1 = 21 SVMs to consider. > k=10 > myNonLinearSVMVec = NULL > for (i in -k:k) + { + myNonLinearSVMVec = c(myNonLinearSVMVec, + ksvm(trainData, labelTrain, type="C-svc", + kernel='rbf', C=10^i, scaled=c()) + ) + + } Using automatic sigma estimation (sigest) for RBF or laplac Using automatic sigma estimation (sigest) for RBF or laplac Using automatic sigma estimation (sigest) for RBF or laplac Using automatic sigma estimation (sigest) for RBF or laplac Keith E Emmert (TSU) Data Mining October 23, 2013 189 / 222 Support Vector Machines Non Linear: Simple Example Using a Gaussian Kernel – Finding C Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Now, let’s take a look at the test errors generated by various C s. > xNonLinear=-k:k > yNonLinear=NULL > for (i in 1:(2*k+1)) + { + myPredNonLinGauss = predict(myNonLinearSVMVec[[i]], + trainTest) + contTable = table(labelTest , myPredNonLinGauss) + testError = (sum(contTable) + sum(diag(contTable)))/sum(contTable) + yNonLinear = c(yNonLinear, testError) + } > yNonLinear [1] 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.18 [11] 0.17 0.16 0.16 0.16 0.19 0.18 0.25 0.39 0.46 0.42 [21] 0.36 Keith E Emmert (TSU) Data Mining October 23, 2013 190 / 222 Support Vector Machines Non Linear: Simple Example Using a Gaussian Kernel – Finding C Data Mining Keith E Emmert We might also generate a simple plot as C varies. Choose the smallest C that minimizes the test Error. log10(C) vs Test Error 0.50 Sub-setting Data Classification 0.45 Decision Trees Bayes Classification 0.35 Test Error 0.30 0.20 0.15 k-Nearest Neighbors 0.25 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.40 Support Vector Machines −10 Keith E Emmert (TSU) −5 0 log10(C) Data Mining 5 10 October 23, 2013 191 / 222 Support Vector Machines Non Linear: Simple Example Using a Gaussian Kernel – The Best C Data Mining Keith E Emmert Sub-setting Data The “best” C = 101 . The confusion matrix for this C is Classification Decision Trees Bayes Classification Support Vector Machines myPredNonLinGauss labelTest -1 1 -1 38 11 1 5 46 Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 192 / 222 Support Vector Machines Non-Linearly Separable – The Best C SVM classification plot Data Mining Keith E Emmert 2 8 Sub-setting Data Classification 6 1 4 0 Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 2 −1 0 −2 −2 −3 −2 0 2 4 6 8 Notice that some of the support vectors have changed. Keith E Emmert (TSU) Data Mining October 23, 2013 193 / 222 Support Vector Machines Non-Linearly Separable – Summaries Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves myNewPred myPredGauss labelTest -1 1 labelTest -1 1 -1 41 8 -1 39 10 1 29 22 1 5 46 myPredGauss2 myPredGauss3 labelTest -1 1 labelTest -1 1 -1 38 11 -1 38 11 1 6 45 1 5 46 myPredNonLinGauss labelTest -1 1 -1 38 11 1 5 46 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 194 / 222 Support Vector Machines Non-Linearly Separable – Summaries Data Mining Keith E Emmert Sub-setting Data Classification Training Error Accuracy Test Error Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Training Error Accuracy Test Error Linear 0.303333 0.63 0.37 Gaussian 0.123333 0.85 0.15 Gaussian w/Auto σ, C = 5 0.133333 0.84 0.16 Gaussian w/Auto σ 0.133333 0.83 0.17 Gauss. w/Auto σ, C = 101 0.123333 0.84 0.16 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 195 / 222 ROC and Precision-Recall Curves Class Imbalance Problem Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification For inspections (fraud detection), you hope the # of problems is small relative to the number of successes. This gives rise to unbalanced data sets. The rare instances are susceptible to noisy data. Accuracy is a bad measure for rare cases. Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 196 / 222 ROC and Precision-Recall Curves Class Imbalance Problem Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Predicted Class + − + TP FN P = TP + FN Actual Class − FP TN N = FP + TN True Positive TP: # of positive examples correctly predicted by a classification model. False Negative FN: # of positive examples wrongly predicted as negative. False Positive FP: # of negative examples wrongly predicted as positive. True Negative TN: # of negative examples correctly predicted. Correct TP + TN Accuracy = = . TP + FP + FN + TN Total Keith E Emmert (TSU) Data Mining October 23, 2013 197 / 222 ROC and Precision-Recall Curves Class Imbalance Problem Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors P = TP + FN N = FP + TN TP is the Sensitivity or True Positive Rate = TPR = TP + FN fraction of positive examples predicted correctly. TN Specificity or True Negative Rate = TNR = is the TN + FP fraction of negative examples predicted correctly. FP is the fraction of False Positive Rate = FPR = TN + FP negative examples predicted as a positive class. FN False Negative Rate = FNR = is the fraction of TP + FN positive examples predicted as a negative class. Actual Class Keith E Emmert (TSU) + − Predicted Class + − TP FN FP TN Data Mining October 23, 2013 198 / 222 ROC and Precision-Recall Curves Class Imbalance Problem Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves P = TP + FN N = FP + TN TP is the fraction of Recall or Sensitivity = r = TPR = TP + FN positive examples predicted correctly. TP Precision = p = is the fraction of records that TP + FP actually turns out to be positive in the group the classifier has declared as a positive class. Actual Class + − Predicted Class + − TP FN FP TN k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 199 / 222 ROC and Precision-Recall Curves Class Imbalance Problem Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 1 TP = ⇐⇒↓ FN or ↑ TP. FN TP + FN 1 + TP 1 TP = ⇐⇒↓ FP or ↑ TP. ↑ Precision = p = FP TP + FP 1 + TP A model that declares every record to be the positive class has perfect recall r = 1, but has poor precision p < 1 due to high FP. A model that assigns a positive class to every test record that matches one of the positive training set records has high precision p ≈ 1 but low recall r < 1 since many positive examples in the test set which fail to appear in the training set can be classified as negative, i.e. FN > 0. Goal: Maximize Precision and Recall. ↑ Recall = r = Keith E Emmert (TSU) Data Mining October 23, 2013 200 / 222 ROC and Precision-Recall Curves Class Imbalance Problem Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Predicted Class + − TP FN FP TN + P = TP + FN − N = FP + TN TP TP Recall = r = and Precision = p = . TP + FN TP + FP 2 2rp 2TP F1 Measure: F1 = = = . 1 1 r + p 2TP + FP + FN + r p | {z } Actual Class Harmonic mean Note that the harmonic mean tends to be closer to the smaller number. ↑ F1 ⇐⇒↑ r and ↑ p. k-Nearest Neighbors Fβ Measure: Fβ = (β 2 + 1)rp (β 2 + 1)TP = . r + β2p (β 2 + 1)TP + β 2 FP + FN F0 = p and F∞ = r . Measures a trade-off between precision and recall. Keith E Emmert (TSU) Data Mining October 23, 2013 201 / 222 ROC and Precision-Recall Curves Class Imbalance Problem Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Predicted Class + − + TP FN P = TP + FN Actual Class − FP TN N = FP + TN w1 TP + w4 TN . Weighted accuracy = w1 TP + w2 FP + w3 FN + w4 TN Measure Recall Precision Fβ Accuracy Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors w1 1 1 β2 + 1 1 w2 1 0 β2 1 w3 0 1 1 1 w4 0 0 0 1 Thus, weighted accuracy measures the trade-offs between recall, precision, accuracy, and Fβ . Keith E Emmert (TSU) Data Mining October 23, 2013 202 / 222 ROC and Precision-Recall Curves ROC Classification Decision Trees Bayes Classification 0.8 True Positive Rate 0.0 k-Nearest Neighbors 0.2 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 1.0 Support Vector Machines 0.6 Sub-setting Data Definition A receiver operating characteristic curve or ROC curve is a plot of the sensitivity or true positive rate on the vertical axis vs the false positive rate on the horizontal axis. Each point on the curve represents a model induced by the classifier. 0.4 Data Mining Keith E Emmert 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate Keith E Emmert (TSU) Data Mining October 23, 2013 203 / 222 ROC and Precision-Recall Curves ROC Curve 1.0 Data Mining Keith E Emmert (0, 0) represents a model which predicts every instance to be a negative class. 0.8 Sub-setting Data Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.6 0.4 (1, 1) represents a model which predicts every instance to be a positive class. 0.2 Bayes Classification 0.0 Decision Trees True Positive Rate Classification 0.0 k-Nearest Neighbors 0.2 0.4 0.6 0.8 1.0 (0, 1) represents a perfect model. TPR = 1 and FPR = 0. False Positive Rate y = x (dashed line) represents a model which classifies a record as positive with some fixed probability, i.e. random guessing! Keith E Emmert (TSU) Data Mining October 23, 2013 204 / 222 ROC and Precision-Recall Curves ROC Curve Data Mining Towards bottom left corner “Conservative classifier” positive classifications only with strong evidence. Hence, few FP, but low TP as well. 1.0 Keith E Emmert 0.8 Sub-setting Data Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.6 0.4 0.2 Bayes Classification 0.0 Decision Trees True Positive Rate Classification 0.0 k-Nearest Neighbors 0.2 0.4 0.6 0.8 False Positive Rate 1.0 Towards top right corner “Liberal classifiers” - positive classifications with weak evidence. Therefore, classify most positive correctly, but have a high FP. If your classifier falls below the curve, this is BAD. However, you can swap your prediction labels, i.e. swap TP and FN, and swap FP and TN to reflect the classifier above the curve. Area under the ROC curve, AUC, can be used to tell which classifier, on average, is better. If AUC ≤ 0.5 is bad! Keith E Emmert (TSU) Data Mining October 23, 2013 205 / 222 ROC and Precision-Recall Curves ROC Curve Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors The prediction scores are computed for the two “best” models, one linear and one based upon a Gaussian. Note the use of the type="decision option in the predict function. > myNewPredLinear = predict(myNewSVM, trainTest, + type="decision") > myNewPredNonLinGauss = predict( + myNonLinearSVMVec[[which.min(yNonLinear)]], + trainTest, type="decision") We can also compute some ROC curves. > library(ROCR) > predLinearROC = prediction(myNewPredLinear, labelTest) > predGaussROC = prediction(myNewPredNonLinGauss, + labelTest) > perLinearROC = performance(predLinearROC, + measure = "tpr", x.measure = "fpr") > perGaussROC = performance(predGaussROC, + measure = "tpr", x.measure = "fpr") Keith E Emmert (TSU) Data Mining October 23, 2013 206 / 222 ROC and Precision-Recall Curves ROC Curve Now, plot the ROC curves. Data Mining Linear Model 1.0 Gaussian Model 1.0 Keith E Emmert 0.8 0.6 0.2 0.4 0.6 0.0 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.4 Support Vector Machines True positive rate Bayes Classification 0.2 Decision Trees 0.0 Classification True positive rate 0.8 Sub-setting Data 0.0 0.2 0.4 0.6 0.8 1.0 0.0 False positive rate 0.2 0.4 0.6 0.8 1.0 False positive rate k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 207 / 222 ROC and Precision-Recall Curves ROC Curve Data Mining Of course, we can plot multiple ROC curves on the same axes. Keith E Emmert 1.0 Linear is Blue, Gaussian is Red Sub-setting Data Classification 0.8 Decision Trees 0.6 0.4 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.2 Support Vector Machines True positive rate Bayes Classification 0.0 k-Nearest Neighbors 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Keith E Emmert (TSU) Data Mining October 23, 2013 208 / 222 ROC and Precision-Recall Curves ROC Curve – Problems Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Predicted Class + − + TP FN P = TP + FN Actual Class − FP TN N = FP + TN ROC curves are not very sensitive to changes in the class distributions. (Few − and many + vs equal numbers of − and +.) So, if the proportion of positive to negative instances changes, the ROC curve will not detect this. This is true because ROC curves utilize ratios of rows and does not depend upon the distribution. TP FP TPR = FPR = . TP + FN TN + FP Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors TP utilize values TP + FP from both rows and is sensitive to class skews. (Accuracy and Fβ are also sensitive to class skews.) Other measures such as precision p = Keith E Emmert (TSU) Data Mining October 23, 2013 209 / 222 ROC and Precision-Recall Curves ROC Curve – Problems Data Mining Keith E Emmert Sub-setting Data Actual Class Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves + − Predicted Class + − TP = 50 FN = 25 FP = 20 TN = 55 = 1. Ratio of + to − is 75 75 50 TP = 50+25 = 23 and FPR = TPR = TP+FN Precision = p = Actual Class + − k-Nearest Neighbors TP TP+FP 1 75 = 10 . Ratio of + to − is 750 TP 50 TPR = TP+FN = 50+25 = 23 and FPR = Precision = p = Keith E Emmert (TSU) FP FP+TN = 4 . 15 = 57 . Predicted Class + − TP = 50 FN = 25 FP = TN = 200 550 TP TP+FP P = TP + FN N = FP + TN = P = TP + FN N = FP + TN FP FP+TN = 4 . 15 1 . 5 Data Mining October 23, 2013 210 / 222 ROC and Precision-Recall Curves Precision-Recall Graphs Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Definition A precision-recall graph uses precision for the vertical axis and recall for the horizontal axis. (0, 0) is bad. (1, 1) is the goal. When recall is zero, it is possible for precision to be undefined. So, when creating your own precision-recall graphs, never let recall become zero. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 211 / 222 ROC and Precision-Recall Curves Connections between ROC Curves and Precision-Recall Graphs Data Mining Keith E Emmert Sub-setting Data Classification Theorem For a fixed data set, Model 1 dominates Model 2 in ROC space if and only if Model 1 dominates Model 2 in precision-recall space. Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 212 / 222 ROC and Precision-Recall Curves Precision-Recall Graphs Data Mining Next, plot the precision/recall curves. 1.0 1.0 Keith E Emmert 0.8 0.7 0.7 0.6 0.5 Precision Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.6 Support Vector Machines 0.5 Bayes Classification 0.4 Decision Trees Precision 0.8 Classification 0.9 0.9 Sub-setting Data 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Recall k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 213 / 222 ROC and Precision-Recall Curves Precision-Recall Graphs Data Mining Again, lets stack the plots. Keith E Emmert 1.0 Linear is Blue, Gaussian is Red Sub-setting Data Classification 0.8 Decision Trees 0.6 0.4 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.2 Support Vector Machines Precision Bayes Classification 0.0 k-Nearest Neighbors 0.0 0.2 0.4 0.6 0.8 1.0 Recall Keith E Emmert (TSU) Data Mining October 23, 2013 214 / 222 k-Nearest Neighbors Eager vs Lazy Learners Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification An eager learner map the input attributes to the class label as soon as training data are available. Decision trees are examples of this type of learning style. A lazy learner delays the process of modeling the training data until it is needed to classify the test data. k-nearest neighbors is a lazy learner. Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 215 / 222 k-Nearest Neighbors Definition Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Definition The k-nearest neighbors strategy determines the k-points closest to an unknown point z. The test example is classified based upon the majority of the class labels of the k-nearest neighbors. 4 Bayes Classification Support Vector Machines 2 1-NN: The point is a triangle. 2-NN: Can’t classify the point. 3-NN: The point is a square. 0 k-Nearest Neighbors y 3 Goal: classify the green circle as a blue triangle or red square. 1 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0 1 2 3 4 x Note that “close” means we’re using some measure of similarity or dissimilarity based upon the attribute(s) being measured. Keith E Emmert (TSU) Data Mining October 23, 2013 216 / 222 k-Nearest Neighbors Problems 3.0 Data Mining Goal: classify the green circle as a blue triangle or red square. 2.5 Keith E Emmert 2.0 Sub-setting Data If k is too small, then the nearest neighbor algorithm is susceptible to over-fitting due to noise. y 1.5 Classification 1.0 Decision Trees Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors 0.0 0.5 Bayes Classification 0.0 0.5 1.0 1.5 2.0 2.5 3.0 If k is too large, then it may misclassify because it includes data too far away from its neighborhood. x One way to minimize contributions of distant points is to use distance-weighted voting, where given the k-nearest neighbors and their associated labels as a set, Dk , δ returns 0 or 1 if a label matches at training point v , and ARGMAX returns the label, y , associated with the largest number of label matches. X 1 y = ARGMAX δ(yi , v ). v ∈Labels d(~xi , ~z )2 (~ xi ,yi )∈Dk Keith E Emmert (TSU) Data Mining October 23, 2013 217 / 222 k-Nearest Neighbors Example Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves k-Nearest Neighbors Here, we will use the package “class” and the R function “knn” for classification. Function “knn” takes several arguments. matrix or data frame of training set cases. train test matrix or data frame of test set cases. A vector will be interpreted as a row vector for a single case. cl factor of true classifications of training set k number of neighbors considered. l minimum vote for definite decision, otherwise doubt. (More precisely, less than k − l dissenting votes are allowed, even if k is increased by ties.) prob If this is true, the proportion of the votes for the winning class are returned as attribute prob use.all controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbors. Keith E Emmert (TSU) Data Mining October 23, 2013 218 / 222 k-Nearest Neighbors Example: 3-Nearest Neighbors Data Mining Keith E Emmert Sub-setting Data Classification The iris data set is labeled Setosa, Versicolor, and Virginica. In this example, we will classify a rectangular grid rather than holding back some data. Only two columns are chosen so that a contour plot can be constructed. Here is the decision boundary. Iris Data 3.0 Decision Trees 2.5 Bayes Classification 1.5 Petal.Width 0.5 0.0 k-Nearest Neighbors 1.0 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 2.0 Support Vector Machines 0 2 4 6 8 10 Petal.Length Keith E Emmert (TSU) Data Mining October 23, 2013 219 / 222 k-Nearest Neighbors Example: 3-Nearest Neighbors Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves Now, we utilize the full iris data set. How did we do? testLabels irisClass2 setosa versicolor virginica setosa 20 0 0 versicolor 0 17 0 virginica 0 3 20 Next, we can compute the accuracy of the model. [1] 0.95 Of course, the test-error is [1] 0.05 k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 220 / 222 k-Nearest Neighbors Example - Finding the best k Data Mining Keith E Emmert We compute the test error for a variety of k. Now, let’s plot our values. Sub-setting Data 0.7 Classification Decision Trees 0.6 Bayes Classification Test Error 0.4 0.3 0.1 k-Nearest Neighbors 0.2 Receiver Operating Characteristics (ROC) and Precision - Recall Curves 0.5 Support Vector Machines 0 20 40 60 80 100 k Keith E Emmert (TSU) Data Mining October 23, 2013 221 / 222 k-Nearest Neighbors Example - Find the Best K Data Mining Keith E Emmert Sub-setting Data Classification Decision Trees Bayes Classification Support Vector Machines Receiver Operating Characteristics (ROC) and Precision - Recall Curves We can find the minimum k by using the “which.min(vector)” command. So, in this region, our best model corresponds to k = 10 which corresponds to a test error of 0.033333. If we train this model, testLabels bestKNN setosa versicolor virginica setosa 20 0 0 versicolor 0 18 0 virginica 0 2 20 Thus, our accuracy is 0.966667. Of course, we already knew this from the test error 0.033333. k-Nearest Neighbors Keith E Emmert (TSU) Data Mining October 23, 2013 222 / 222