Download Classification fundamentals - DataBase and Data Mining Group

Politecnico di Torino Classification fundamentals Classification Objectives    Classification fundamentals prediction of a class label definition of an interpretable model of a given phenomenon training data model unclassified data classified data Elena Baralis, Tania Cerquitelli Politecnico di Torino DB MG Classification: definition    Definitions Given   a collection of class labels a collection of data objects labelled with a class label Training set   Find a descriptive profile of each class, which will allow the assignment of unlabeled objects to the appropriate class DB MG  3        DataBase and Data Mining Group Collection of labeled data objects used to validate the classification model DB MG 4 Evaluation of classification techniques Decision trees Classification rules Association rules Neural Networks Naïve Bayes and Bayesian Networks k-Nearest Neighbours (k-NN) Support Vector Machines (SVM) …. DB MG Collection of labeled data objects used to learn the classification model Test set Classification techniques  2  Accuracy  Efficiency     model building time classification time Scalability   training set size attribute number  Robustness  Interpretability    5 quality of the prediction DB MG noise, missing data model interpretability model compactness 6 1 Politecnico di Torino Classification fundamentals Example of decision tree Decision trees Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO 10 DB MG Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K MarSt Married NO Refund NO Test Data Start from the root of tree. Single, Divorced Refund No Yes Yes TaxInc < 80K > 80K NO Taxable Income Cheat No 80K Married ? 10 MarSt Single, Divorced TaxInc < 80K There could be more than one tree that fits the same data! Refund Marital Status No YES NO 8 From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 Apply Model to Test Data Another example of decision tree Tid Refund Marital Status Model: Decision Tree Training Data Politecnico di Torino Married NO > 80K YES NO 10 DB MG 9 From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 DB MG Apply Model to Test Data Apply Model to Test Data Test Data Refund Yes Taxable Income Cheat No 80K ? Refund 10 No NO Yes MarSt Single, Divorced TaxInc < 80K NO DB MG Test Data Refund Marital Status Married TaxInc NO From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 < 80K NO 11 DB MG No 80K Married ? 10 Single, Divorced YES Taxable Income Cheat MarSt Married > 80K Refund Marital Status No NO DataBase and Data Mining Group 10 From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 Married NO > 80K YES From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 12 2 Politecnico di Torino Classification fundamentals Apply Model to Test Data Apply Model to Test Data Test Data Refund Yes Test Data Refund Marital Status Taxable Income Cheat No 80K Married ? Refund 10 No NO Yes MarSt Single, Divorced TaxInc < 80K DB MG Married Single, Divorced TaxInc NO < 80K NO 13 From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 DB MG Apply Model to Test Data Test Data Refund Yes No 80K   MarSt TaxInc < 80K NO Married  Assign Cheat to “No”   DB MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 14 YES 15 From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 then t is a leaf node labeled as yt select the “best” attribute A on which to split Dt and label node t as A split Dt into smaller subsets and recursively apply the procedure to each subset Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K DB MG    Structure of test condition    17 not a global optimum Issues  From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 DataBase and Data Mining Group “Best” attribute for the split is selected locally at each step t then t is a leaf node labeled as the default (majority) class, yd 16 Adopts a greedy strategy 10 Dt,, set of training records that reach a node t Dt From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 Decision tree induction  If Dt is an empty set  YES NO If Dt contains records that belong to more than one class  NO > 80K > 80K Basic steps  If Dt contains records that belong to the same class yt  Married Hunt’s Algorithm (one of the earliest) CART ID3, C4.5, C5.0 SLIQ, SPRINT General structure of Hunt’s algorithm  ? Many algorithms to build a decision tree  ? 10 Single, Divorced DB MG  Taxable Income Cheat No NO 80K Married Decision tree induction Refund Marital Status Married No 10 MarSt YES NO Taxable Income Cheat No NO > 80K Refund Marital Status Binary split versus multiway split Selection of the best attribute for the split Stopping condition for the algorithm DB MG 18 3 Politecnico di Torino Classification fundamentals Splitting on nominal attributes Structure of test condition  Depends on attribute type      use as many partitions as distinct values   {Sports, Luxury} DB MG 19 From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 DB MG Multi-way split    use as many partitions as distinct values Binary split   Small Divides values into two subsets Need to find optimal partitioning OR Size {Small, Medium} {Large} {Medium, Large} Size What about this split? DB MG OR {Family}  Large   Size {Small}  {Medium} 21 Static – discretize once at the beginning Dynamic – discretize during tree induction consider all possible splits and find the best cut more computationally intensive DB MG Splitting on continuous attributes Selection of the best attribute Taxable Income? Own Car? > 80K (i) Binary split DB MG Yes 10 records of class 0, 10 records of class 1 Car Type? No Family Student ID? Luxury c1 Sports No [25K,50K) C0: 6 C1: 4 [50K,80K) (ii) Multi-way split From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 DataBase and Data Mining Group 22 From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 Before splitting: [10K,25K) 20 Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering Binary decision (A < v) or (A  v) Size < 10K {Sports} Discretization to form an ordinal categorical attribute Medium From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 Yes CarType From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006  Taxable Income > 80K? {Family, Luxury} Different techniques  {Small, Large} CarType Splitting on continuous attributes Splitting on ordinal attributes  Luxury Sports Divides values into two subsets Need to find optimal partitioning  2-way split multi-way split CarType Family Binary split  Depends on number of outgoing edges  Multi-way split  nominal ordinal continuous C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 ... c10 C0: 1 C1: 0 c11 C0: 0 C1: 1 c20 ... C0: 0 C1: 1 Which attribute (test condition) is the best? 23 DB MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 24 4 Politecnico di Torino Classification fundamentals Measures of node impurity Selection of the best attribute Attributes with homogeneous class distribution are preferred Need a measure of node impurity    Many different measures available    C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, high degree of impurity Homogeneous, low degree of impurity DB MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006  25 Gini index Entropy Misclassification error Different algorithms rely on different measures DB MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 26 Decision Tree Based Classification  Advantages      Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets Associative classification Disadvantages  accuracy may be affected by missing data Politecnico di Torino DB MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 27 Associative classification  The classification model is defined by means of association rules (Condition)  y   Associative classification    rule body is an itemset    based on support, confidence and correlation thresholds Database coverage: the training set is covered by selecting topmost rules according to previous sort DataBase and Data Mining Group Weak points  rule generation may be slow  reduced scalability in the number of attributes  29 correlation among attributes is considered efficient classification unaffected by missing data good scalability in the training set size  Rule pruning  DB MG  Rule selection & sorting  interpretable model higher accuracy than decision trees   Model generation  Strong points DB MG it depends on support threshold rule generation may become unfeasible 30 5 Politecnico di Torino Classification fundamentals Neural networks  Inspired to the structure of the human brain Neurons as elaboration units Synapses as connection network  Neural networks  Politecnico di Torino DB MG Structure of a neural network 32 Structure of a neuron Output vector - mk Output nodes x0 w0 x1 w1 xn wn Input vector x Weight vector w  Hidden nodes f output y weight wij Input nodes Weighted sum Activation function Input vector: xi DB MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006 33 DB MG Construction of the neural network  For each node, definition of    set of weights offset value Strong points      High accuracy Robust to noise and outliers Supports both discrete and continuous output Efficient during classification Weak points  Long training time    DataBase and Data Mining Group 35 weakly scalable in training data size complex configuration Not interpretable model  DB MG 34 Neural networks  providing the highest accuracy on the training data Iterative approach on training data instances From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006 DB MG application domain knowledge cannot be exploited in the model 36 6 Politecnico di Torino Classification fundamentals Bayes theorem  Bayesian Classification   Politecnico di Torino Let C and X be random variables P(C,X) = P(C|X) P(X) P(C,X) = P(X|C) P(C) Hence P(C|X) P(X) = P(X|C) P(C) and also P(C|X) = P(X|C) P(C) / P(X) DB MG Bayesian classification      C = any class label X = <x1,…,xk> record to be classified Bayesian classification   How to estimate P(X|C), i.e. P(x1,…,xk|C)? Naïve hypothesis P(x1,…,xk|C) = P(x1|C) P(x2|C) … P(xk|C)  statistical independence of attributes x1,…,xk  not always true compute P(C|X) for all classes   Bayesian classification Let the class attribute and all data attributes be random variables   probability that record X belongs to C  assign X to the class with maximal P(C|X)     39 DB MG Bayesian classification: Example DB MG Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain DataBase and Data Mining Group for continuous attributes, use probability distribution allow specifying a subset of dependencies among attributes 40 Bayesian classification: Example Temperature Humidity Windy Class hot high false N hot high true N hot high false P mild high false P cool normal false P cool normal true N cool normal true P mild high false N cool normal false P mild normal false P mild normal true P mild high true P hot normal false P mild high true N From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006 where |xkC| is number of instances having value xk for attribute k and belonging to class C Bayesian networks  DB MG for discrete attributes P(xk|C) = |xkC|/ Nc  P(X) constant for all C, disregarded for maximum computation P(C) a priori probability of C P(C) = Nc/N model quality may be affected Computing P(xk|C) Applying Bayes theorem P(C|X) = P(X|C)·P(C) / P(X)  38 outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(p) = 9/14 P(overcast|p) = 4/9 P(overcast|n) = 0 P(n) = 5/14 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy 41 DB MG P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006 42 7 Politecnico di Torino Classification fundamentals Bayesian classification: Example    Data to be labeled X = <rain, hot, high, false> For class p P(X|p)·P(p) = = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 For class n P(X|n)·P(n) = = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 Model evaluation Politecnico di Torino DB MG From: Han, Kamber,”Data mining; Concepts and Techniques”, Morgan Kaufmann 2006 43 Model evaluation  Methods for performance evaluation      Accuracy, other measures  ROC curve Stratified sampling to generate partitions  Bootstrap  45   reserve 2/3 for training and 1/3 for testing  may be repeated several times  repeated holdout   47 repeat for all folds reliable accuracy estimation, not appropriate for very large datasets Leave-one-out  DataBase and Data Mining Group partition data into k disjoint subsets (i.e., folds) k-fold: train on k-1 partitions, test on the remaining one   DB MG 46 Cross validation  Appropriate for large datasets  Sampling with replacement Cross validation Fixed partitioning  without replacement DB MG Holdout  holdout cross validation   DB MG training set for model building test set for model evaluation Several partitioning techniques  Techniques for model comparison  Partitioning labeled data in  Partitioning techniques for training and test sets Metrics for performance evaluation   Methods of estimation DB MG cross validation for k=n only appropriate for very small datasets 48 8 Politecnico di Torino Classification fundamentals Metrics for model evaluation   Accuracy Evaluate the predictive accuracy of a model Confusion matrix  Most widely-used metric for model evaluation binary classifier  Accuracy  Number of correctly classified objects Number of classified objects PREDICTED CLASS Class=Yes Class=Yes ACTUAL CLASS Class=No  Class=No a b c d Not always a reliable metric a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) DB MG From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 49 DB MG Limitations of accuracy Accuracy  For a binary classifier  Consider a binary problem  PREDICTED CLASS  Class=Yes ACTUAL CLASS Accuracy  DB MG Class=Yes Class=No Class=No a (TP) c (FP)  Model ()  class 0 b (FN)  d (TN) Model predicts everything to be class 0   ad TP  TN  a  b  c  d TP  TN  FP  FN From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 Cardinality of Class 0 = 9900 Cardinality of Class 1 = 100 51 DB MG     Evaluate separately for each class Misclassification of objects of a given class is more important e.g., ill patients erroneously assigned to the healthy patients class Recall (r)  Number of objects correctly assigned to C Number of objects belonging to C Precision (p)  Number of objects correctly assigned to C Number of objects assigned to C Accuracy is not appropriate for  unbalanced class label distribution different class relevance DB MG DataBase and Data Mining Group 52 Class specific measures Classes may have different importance  accuracy is 9900/10000 = 99.0 % Accuracy is misleading because the model does not detect any class 1 object Limitations of accuracy  50  Maximize F - measure (F)  53 DB MG 2rp r p 54 9 Politecnico di Torino Classification fundamentals Class specific measures  For a binary classification problem  on the confusion matrix, for the positive class Precision (p)  Recall (r)  a ac a ab F - measure (F)  DB MG 2 rp r p  2a 2a  b  c From: Tan,Steinbach, Kumar, Introduction to Data Mining, McGraw Hill 2006 DataBase and Data Mining Group 55 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Classification fundamentals - DataBase and Data Mining Group