Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification Part I Applications of Classification Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 2012/3/23 Data Mining Techniques and Applications 2 Classification: Definition Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2012/3/23 Data Mining Techniques and Applications 3 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Deduction 10 Test Set 2012/3/23 Data Mining Techniques and Applications 4 Classification Techniques Decision Tree based Methods Naïve Bayes and Bayesian Belief Networks Rule-based Methods Memory based reasoning Neural Networks Support Vector Machines 2012/3/23 Data Mining Techniques and Applications 5 Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 10 Training Data 2012/3/23 Model: Decision Tree Data Mining Techniques and Applications 6 Another Example of Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married MarSt NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 2012/3/23 Data Mining Techniques and Applications 7 Apply Model to Test Data Test Data Start from the root of tree. Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 2012/3/23 Refund Marital Status NO > 80K YES Data Mining Techniques and Applications 8 Apply Model to Test Data Test Data Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 2012/3/23 Refund Marital Status NO > 80K YES Data Mining Techniques and Applications 9 Apply Model to Test Data Test Data Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 2012/3/23 Refund Marital Status NO > 80K YES Data Mining Techniques and Applications 10 Apply Model to Test Data Test Data Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 2012/3/23 Refund Marital Status NO > 80K YES Data Mining Techniques and Applications 11 Apply Model to Test Data Test Data Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 2012/3/23 Refund Marital Status NO > 80K YES Data Mining Techniques and Applications 12 Apply Model to Test Data Test Data Refund Yes Taxable Income Cheat No 80K Married ? 10 No NO MarSt Married Single, Divorced TaxInc < 80K NO 2012/3/23 Refund Marital Status Assign Cheat to “No” NO > 80K YES Data Mining Techniques and Applications 13 Pigeon Problem 1 Examples of class A 3 4 1.5 5 Examples of class B 5 2.5 5 2 6 8 8 3 2.5 5 4.5 3 2012/3/23 Data Mining Techniques and Applications 14 Pigeon Problem 1 Examples of class A 3 4 1.5 6 5 8 What class is this object? Examples of class B 5 2.5 5 2 8 3 8 What about this one, A or B? 4.5 2.5 2012/3/23 5 4.5 1.5 7 3 Data Mining Techniques and Applications 15 Pigeon Problem 1 Examples of class A 3 4 1.5 5 Examples of class B 5 2.5 5 2 6 8 8 3 2.5 5 4.5 3 2012/3/23 This is a B! 8 1.5 Here is the rule. If the left bar is smaller than the right bar, it is an A, otherwise it is a B. Data Mining Techniques and Applications 16 Pigeon Problem 2 Examples of class A Examples of class B 4 4 5 2.5 5 5 2 5 6 6 5 3 Oh! This ones hard! 8 Even I know this one 7 3 2012/3/23 3 2.5 1.5 7 3 Data Mining Techniques and Applications 17 Pigeon Problem 2 Examples of class A Examples of class B 4 4 5 2.5 5 5 2 5 The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B. So this one is an A. 6 6 5 3 7 3 2012/3/23 3 2.5 7 3 Data Mining Techniques and Applications 18 Pigeon Problem 3 Examples of class A Examples of class B 4 4 5 6 1 5 7 5 6 3 4 8 3 7 7 7 2012/3/23 6 6 This one is really hard! What is this, A or B? Data Mining Techniques and Applications 19 Pigeon Problem 3 Examples of class A 4 4 Examples of class B 5 6 6 6 1 5 7 5 6 3 4 8 3 7 7 7 2012/3/23 It is a B! The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. Data Mining Techniques and Applications 20 Examples of class A 3 Examples of class B 5 4 2.5 Left Bar Pigeon Problem 4 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Right Bar 1.5 5 5 2 6 8 8 3 2.5 5 4.5 3 2012/3/23 Here is the rule again. If the left bar is smaller than the right bar, it is an A, otherwise it is a B. Data Mining Techniques and Applications 21 Examples of class A 4 4 Examples of class B 5 2.5 Left Bar Pigeon Problem 5 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Right Bar 5 5 2 5 6 6 5 3 3 3 2.5 3 2012/3/23 Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B. Data Mining Techniques and Applications 22 Examples of class A 4 4 Examples of class B 5 6 Left Bar Pigeon Problem 6 100 90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90 100 Right Bar 1 5 7 5 6 3 4 8 3 2012/3/23 7 7 7 The rule again: if the square of the sum of the two bars is less than or equal to 10, it is an A. Otherwise it is a B. Data Mining Techniques and Applications 23 Supervised vs. Unsupervised Learning Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 2012/3/23 Data Mining Techniques and Applications 24 Issues Regarding Classification and Prediction (1): Data Preparation Data cleaning – Preprocess data in order to reduce noise and handle missing values Relevance analysis (feature selection) – Remove the irrelevant or redundant attributes Data transformation – Generalize and/or normalize data 2012/3/23 Data Mining Techniques and Applications 25 Issues regarding classification and prediction (2): Evaluating Classification Methods 2012/3/23 Predictive accuracy Speed and scalability – time to construct the model – time to use the model Robustness – handling noise and missing values Scalability – efficiency in disk-resident databases Interpretability: – understanding and insight provided by the model Goodness of rules – decision tree size – compactness of classification rules Data Mining Techniques and Applications 26 Training Dataset age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 2012/3/23 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Data Mining Techniques and Applications 27 Output: A Decision Tree for “buys_computer” age? <=30 student? 2012/3/23 overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes Data Mining Techniques and Applications 28 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-andconquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) 2012/3/23 Data Mining Techniques and Applications 29 age? <=30 31…40 >40 income student credit_rating class income student credit_rating class high no fair no medium no fair yes high no excellent no low yes fair yes medium no fair no low yes excellent no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent no 2012/3/23 income student credit_rating class high no fair yes low yes excellent yes medium no excellent yes high fair and Applications Datayes Mining Techniques yes 30 Algorithm for Decision Tree Induction (cont’d) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left 2012/3/23 Data Mining Techniques and Applications 31 Notes on Decision Trees Training Data Choosing Splitting Attributes Ordering of Splitting Attributes Splits Tree Structure Stopping Criteria Pruning 2012/3/23 Data Mining Techniques and Applications 32 Grasshoppers Katydids Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length 2012/3/23 Data Mining Techniques and Applications 33 previously unseen instance = 11 5.1 7.0 ??????? • We can “project” the previously unseen instance into the same space as the database. Antenna Length 10 9 8 7 6 5 4 3 2 1 • We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space. Katydids Abdomen Length Grasshoppers Data Mining Techniques and Applications 1 2 3 4 5 6 7 8 9 10 2012/3/23 34 imple Linear Classifier 10 9 8 7 6 5 4 3 2 1 If previously unseen instance above the line then class is Katydid else class is Grasshopper Katydids Grasshoppers 1 2 3 4 5 6 7 8 9 10 2012/3/23 Data Mining Techniques and Applications 35 Information/Entropy 2012/3/23 Given probabilitites p1, p2, .., ps whose sum is 1 Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification no surprise entropy = 0 Data Mining Techniques and Applications 36 Entropy log (1/p) 2012/3/23 H(p,1-p) Data Mining Techniques and Applications 37 Example of Entropy Entropy is a measure of „uncertainty‟ in a probability distribution. 1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 0.50 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 1 Probability(event 1) = 0.5 Probability(event 2) = 0.5 Entropy = 1.0 2012/3/23 1 2 2 Probability(event 1) = 0.1 Probability(event 2) = 0.9 Entropy = 0.469 Data Mining Techniques and Applications 38 Example of Entropy (cont’d) 1.00 This is zero entropy, i.e. zero uncertainty about which event is the „true‟ one. 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 1 2 Probability(event 1) = 0 Probability(event 2) = 1 Entropy = 0.0 2012/3/23 Data Mining Techniques and Applications 39 Example of Entropy (cont’d) Entropy can be measured for a set, e.g.: S = {a, a, a, a, a, a, a, a, b, b, b, b, b} (8 a’s and 5 b’s, 13 total) 8 5 5 8 entropy( S ) (log 2 ) (log 2 ) 0.96124 13 13 13 13 Remember negative! 2012/3/23 for the a’s for the b’s Data Mining Techniques and Applications 40 Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, …, m} Information measures info. required to classify any arbitrary tuple I( s1,s2,...,s m) m si log 2 si s i 1 s Entropy of attribute A with values {a1,a2,…,av} v s1 j ... smj E(A) I ( s1 j ,..., smj ) s j 1 Information gained by branching on attribute A Gain(A) I(s 1, s 2 ,...,sm) E(A) 2012/3/23 Data Mining Techniques and Applications 41 Example (Step by Step) age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 2012/3/23 income student credit_rating buys_computer high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes Entropy = I(9, 5) low yes excellent no =0.940 low yes excellent yes (from 14 examples) medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no Data Mining Techniques and Applications 42 age? <=30 0.971 (5 examples) 31…40 >40 0 (4 examples) 0.971 (5 examples) 5 4 5 E (age) I (2,3) I (4,0) I (3,2) 0.694 14 14 14 5 tuples with ‘<=30’ 4 tuples with ’31..40’ 5 tuples with ’>40’ Gain(age) I (9,5) E (age) 0.246 2012/3/23 Data Mining Techniques and Applications 43 Attribute Selection by Information Gain Computation Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: age <=30 30…40 >40 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 age income student credit_rating <=30 high no fair <=30 high no excellent 31…40 high no fair >40 medium no fair >40 low yes fair >40 low yes excellent 31…40 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 31…40 medium no excellent 31…40 high yes fair >402012/3/23 medium no excellent 5 4 I (2,3) I ( 4,0) 14 14 5 I (3,2) 0.694 14 E (age) 5 I (2,3) means “age <=30” has 5 14 out of 14 samples, with 2 yes’es and 3 no’s. Hence Gain(age) I ( p, n) E (age) 0.246 buys_computer no no yes yes yes no yes no yes yes yes yes yes Data Mining no Techniques and Applications Similarly, Gain(income) 0.029 Gain( student) 0.151 Gain(credit _ rating ) 0.048 44 Advantages/Disadvantages of Decision Trees • Advantages: – Easy to understand – Easy to generate rules • Disadvantages: – May suffer from overfitting. – Classifies by rectangular partitioning (so does not handle correlated features very well). – Can be quite large – pruning is necessary. – Does not handle streaming data easily 2012/3/23 Data Mining Techniques and Applications 45 Overfitting Problem • With few data points, decision tree may perfectly classify the training data • Model built is not generalized to future datasets 2012/3/23 Yes Female Data Mining Techniques and Applications No Male Wears green -> female ?? 46