Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CI6227: Data Mining Introduction (2nd-half) Sinno Jialin PAN School of Computer Engineering, NTU, Singapore Homepage: http://www3.ntu.edu.sg/home/sinnopan/ General Information Office Hours/Consultations After class or during breaks. Q&A via email, [email protected]. Please email me to make appointment. My Office: can be found on my homepage. Course Webpage NTULearn www3.ntu.edu.sg/home/sinnopan/courses/ntu/CI6227.htm 2 Content Data Mining Tasks Descriptive Association Rule Mining Clustering 3 Predictive Classification Sequence Pattern Mining Regression Outlier Detection Breadth and Depth Classification Algorithms (through lectures): Decision Tree Rule-based Classifier Nearest-Neighbor Classifier Bayesian Classifiers (Naïve Bayes & Bayesian Networks) Artificial Neural Networks Support Vector Machines Real-world Applications (through course projects) One course project on data mining applications. 4 Breadth and Depth … Focus on introducing basis concepts, motivations, and algorithms of classification approaches. Most students can understand. For those who want to learn more, some up-to-date techniques and advanced issues will be mentioned Details cannot be covered in lecture, some additional materials for reading will be suggested (optional). 5 Course Evaluation Two assignments/Projects (40%) Assignment/Project 1 (1st-half semester) – 20% Assignment/Project 2 (2nd-half semester) – 20% Final Exam (60% – Closed Book) Content taught in first-half semester – 30% Content taught in second-half semester – 30% 6 Pre-lecture & Post-lecture Slides On pre-lecture slides (file name starting with “PreLecture”), I may pose some questions to ask you to figure out the answers. On post-lecture slides (file name starting with “ci6227”), the answers will be released. To avoid confusion, when post-lecture slides are loaded on NTULearn after each lecture, the corresponding pre-lecture slides will be deleted. 7 Outline Overall on classification Project description Classification I: Decision Tree 8 Classification The task of assigning objects to one of several predefined categories. Can an object be assigned to more than one categories? Multi-label classification 9 Classification via Data Mining Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Goal: find a model for class attribute as a function of the values of other attributes. Such that previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 10 Classification (in Mathematics) In mathematics, given a set of 𝒙𝑖 , 𝑦𝑖 for 𝑖 = 1, … , 𝑁, where 𝒙𝑖 = [𝑥𝑖𝑖 , 𝑥𝑖𝑖 , … , 𝑥𝑖𝑖 ], the goal is to learn a mapping 𝑓: 𝒙 → 𝑦 by requiring 𝑓 𝒙𝑖 = 𝑦𝑖 . The learned mapping 𝑓 is expected to be able to make precise predictions on any unseen 𝒙∗ as 𝑓(𝒙∗ ). Classifier A set of attributes Class Training set 11 Classification v.s. Regression For classification, y is discrete If y is binary, then binary classification If y is nominal not binary, then multi-class classification If y is ordinal, then ordinal classification For regression, y is continuous 12 Evaluation of Performance Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix for a binary-class problem: f11: TP (true positive) Predicted Class Actual Class 13 Class=1 Class=0 f10: FN (false negative) Class=1 f11 f10 f01: FP (false positive) Class=0 f01 f00 f00: TN (true negative) Evaluation of Performance … Predicted Class Actual Class Class=1 Class=0 Class=1 f11 f10 Class=0 f01 f00 Most widely-used metric: f11 + f 00 Number of correct predictions Accuracy = = Total number of predictions f11 + f10 + f 01 + f 00 Error rate = 14 Number of wrong predictions = 1 − Accuracy Total number of predictions An Illustrating Classification Task Consider the problem of predicting whether a loan applicant will repay his/her loan obligation (no cheat) or become delinquent (cheat). Predefined categories Object 15 An Illustrating Classification Task … Training set: constructed by examining the records of previous borrowers 1 Home Marital Taxable Cheat Owner Status Income Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No Single 90K Yes Tid 16 10 No 60K An Illustrating Classification Task… Tid 1 2 3 4 5 6 7 8 9 10 Home Owner Yes No No Yes No No Yes No No No Marital Status Single Married Single Married Divorced Married Divorced Single Married Single Taxable Income 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K Cheat No No No No Yes No No Yes No Yes Classification algorithm Induction Learn Model Model Training Set Tid 11 12 13 14 15 17 Home Owner No Yes Yes No No Marital Status Single Divorce Married Single Married Taxable Income 55K 80K 110K 95K 67K Test Set Apply Model Cheat ? ? ? ? ? Deduction Decision Tree Refund Yes No NO MarSt Single, Divorced TaxInc 18 Married NO < 80K > 80K NO YES Rule-based Classifier R1: (Home Owner = yes) ∧ (Taxable Income > 100k) → Cheat = No R2: (Home Owner = no) ∧ (Marital Status = Divorced) → Cheat = Yes R3: (Home Owner = no) ∧ (Marital Status = Single) ∧ (Taxable Income < 10k) → Cheat = Yes … 19 Nearest-Neighbor Classifier 1 Home Marital Taxable Cheat Owner Status Income 125K Single Yes No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K 6 No Married 7 Yes Divorced 220K 8 No Single 85K Yes 9 No Married 75K No Single 90K Yes Tid 10 No 60K Training Set 20 Home Marital Taxable Owner Status Income Cheat No ? Married 80K Yes No No Seek for most similar record(s) from the training set to make predictions based on the majority of their classes. Bayesian Classifiers: Naïve Bayes & Bayesian Belief Networks P(MS=Married) = 0.5 P(MS=Single) = 0.4 Marital Status P(HO=Yes) = 0.4 Home Owner P(TI > 80K) = 0.4 Taxable Income Cheat P(C=Yes | MS=Married, HO=Yes, TI>80K) = 0.8 P(C=Yes | MS=Married, HO=No, TI>80K) = 0.6 P(C=Yes | MS=Married, HO=Yes, TI≤80K) = 0.7 P(C=Yes | MS=Married, HO=No, TI≤80K) = 0.5 … 21 Artificial Neural Networks x1 x2 x3 Input Layer x5 Neuron i Input Hidden Layer I1 wi1 I2 wi2 I3 wi3 Si Activation function g(Si ) Output Oi threshold, t Hidden nodes Output Layer y 22 x4 Oi Support Vector Machines B2 23 Ensemble Learning 24 Advanced Classification Issues Class Imbalance Problem Data sets with imbalanced class distributions. E.g., the number of loan applicants who repaid their loan obligation is much larger than that of those who were delinquent. Multi-class Problem Some classification techniques are designed for binary classification problems. How to extend them to multi-classification problems? 25 Course Schedule (Tentative) Date 26 Topics Note Week 8 (7/10) Introduction, Project Description, Classification I: Decision Tree (a) Chapter 4 Week 9 (14/10) Classification II: Decision Tree (b), Rule-based Classifier Chapter 4 & 5 Week 10 (21/10) Classification III: Nearest-Neighbor Classifier, Naïve Bayes Classifier, Bayesian Belief Networks Chapter 5 Week 11 (28/10) Classification IV: Artificial Neural Networks, Ensemble Learning Chapter 5 Week 12 (4/11) Classification V: Support Vector Machines, Advanced Classification Issues, Review (2nd Half Semester) Chapter 5 Week 13 (11/11) No Lecture I will be at LT20 for Q&A and project discussion Textbook: Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Addison Wesley, 2005.