Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information Retrieval and Data Mining Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Information Retrieval and Data Mining, SoSe 2015, S. Michel 1 Chapter VI: Classification 1. Motivation and Definitions 2. Decision Trees 3. Bayes Classifier 4. Support Vector Machines (only as teaser) Tan, Steinbach & Kumar, Chapter 8 Information Retrieval and Data Mining, SoSe 2015, S. Michel 2 1. Classification: Example Classifier age? youth student? no no middle_age? senior yes yes yes credit_rating? fair no excellent yes A decision tree for the concept buys_computer, indicating whether a customer at an electronic shop is likely to purchase a computer. source: Han&Kamber Information Retrieval and Data Mining, SoSe 2015, S. Michel 3 Classification: Definition • Given a collection of records (training set) – Each record contains a set of attributes, one of the attributes is the class. © Tan,Steinbach, Kumar • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Information Retrieval and Data Mining, SoSe 2015, S. Michel 4 Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent © Tan,Steinbach, Kumar • Categorizing news stories as finance, weather, entertainment, sports, etc • Classifying persons into tax evaders and tax payers. Information Retrieval and Data Mining, SoSe 2015, S. Michel 5 Illustrating Classification Task Tid Attrib1 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Attrib2 Attrib3 Class Learning algorithm Induction Learn Model Model 10 © Tan,Steinbach, Kumar Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Deduction 10 Test Set Information Retrieval and Data Mining, SoSe 2015, S. Michel 6 Classification model evaluation • Much the same measures as with IR methods Predicted class – Focus on accuracy and error rate Class = 1 Class = 0 Class = 1 f11 f10 Class = 0 f01 f00 – But also precision, recall, F-scores, … Information Retrieval and Data Mining, SoSe 2015, S. Michel 7 Overview Classification Techniques • • • • • Decision-Tree-based Methods Rule-based Methods Naïve Bayes Support Vector Machines …… Information Retrieval and Data Mining, SoSe 2015, S. Michel 8 © Tan,Steinbach, Kumar Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 10 Training Data Model: Decision Tree Information Retrieval and Data Mining, SoSe 2015, S. Michel 9 2. Decision Trees © Tan,Steinbach, Kumar MarSt Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Married NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 Information Retrieval and Data Mining, SoSe 2015, S. Michel 10 Decision Tree Classification Task Tid Attrib1 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Attrib2 Attrib3 Class Tree Induction algorithm Induction Learn Model Model 10 © Tan,Steinbach, Kumar Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set Information Retrieval and Data Mining, SoSe 2015, S. Michel 11 Apply Model to Test Data Test Data Start from the root of tree. Refund Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Single, Divorced TaxInc < 80K © Tan,Steinbach, Kumar Refund Marital Status NO Married NO > 80K YES Information Retrieval and Data Mining, SoSe 2015, S. Michel 12 Apply Model to Test Data Test Data Refund Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Single, Divorced TaxInc < 80K © Tan,Steinbach, Kumar Refund Marital Status NO Married NO > 80K YES Information Retrieval and Data Mining, SoSe 2015, S. Michel 13 Apply Model to Test Data Test Data Refund Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Single, Divorced TaxInc < 80K © Tan,Steinbach, Kumar Refund Marital Status NO Married NO > 80K YES Information Retrieval and Data Mining, SoSe 2015, S. Michel 14 Apply Model to Test Data Test Data Refund Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Single, Divorced TaxInc < 80K © Tan,Steinbach, Kumar Refund Marital Status NO Married NO > 80K YES Information Retrieval and Data Mining, SoSe 2015, S. Michel 15 Apply Model to Test Data Test Data Refund Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Single, Divorced TaxInc < 80K © Tan,Steinbach, Kumar Refund Marital Status NO Married NO > 80K YES Information Retrieval and Data Mining, SoSe 2015, S. Michel 16 Apply Model to Test Data Test Data Refund Taxable Income Cheat No 80K Married ? 10 Yes No NO MarSt Single, Divorced TaxInc < 80K © Tan,Steinbach, Kumar Refund Marital Status NO Married Assign Cheat to “No” NO > 80K YES Information Retrieval and Data Mining, SoSe 2015, S. Michel 17 Classifying a Record with a Decision Tree • Given a decision tree. • How to classify a test record? • Start at root note and apply the test condition to the record and follow the appropriate branch. • If this leads to internal node, again apply test condition and follow branch. • Otherwise, if at leave node, assign class of leave node to record. • Repeat until at leave node. Information Retrieval and Data Mining, SoSe 2015, S. Michel 18 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 © Tan,Steinbach, Kumar Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set Information Retrieval and Data Mining, SoSe 2015, S. Michel 19 Constructing Decision Tree • There are exponentially many decision trees for the training data. • Finding optimal tree is computationally infeasible. • Instead, use greedy algorithms: Series of local split operations to grow the tree. Not optimal, but there are efficient algorithms that create sufficiently accurate trees. Information Retrieval and Data Mining, SoSe 2015, S. Michel 20 © Tan,Steinbach, Kumar General Structure of Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Information Retrieval and Data Mining, SoSe 2015, S. Michel Dt ? 21 Hunt’s Algorithm Don’t Cheat Start with most frequent class as default class. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K © Tan,Steinbach, Kumar 10 Information Retrieval and Data Mining, SoSe 2015, S. Michel 22 Hunt’s Algorithm (2) Refund Don’t Cheat Yes Don’t Cheat No Don’t Cheat Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K © Tan,Steinbach, Kumar 10 Information Retrieval and Data Mining, SoSe 2015, S. Michel 23 Hunt’s Algorithm (3) Refund Don’t Cheat Yes No Don’t Cheat Don’t Cheat Refund Yes No © Tan,Steinbach, Kumar Don’t Cheat Single, Divorced Cheat Marital Status Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Married Don’t Cheat Information Retrieval and Data Mining, SoSe 2015, S. Michel 24 Hunt’s Algorithm (4) Refund Don’t Cheat Yes No Don’t Cheat Don’t Cheat Refund Refund Yes Don’t Cheat Single, Divorced © Tan,Steinbach, Kumar Yes No Cheat Marital Status Married Don’t Cheat Single, Divorced Don’t Cheat No Marital Status Don’t Cheat Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Married Don’t Cheat Taxable Income < 80K Tid Refund Marital Status >= 80K Cheat Information Retrieval and Data Mining, SoSe 2015, S. Michel 25 Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues © Tan,Steinbach, Kumar – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel 26 How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous © Tan,Steinbach, Kumar • Depends on number of ways to split – 2-way split – Multi-way split Information Retrieval and Data Mining, SoSe 2015, S. Michel 27 Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. CarType Family Luxury © Tan,Steinbach, Kumar Sports • Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} Information Retrieval and Data Mining, SoSe 2015, S. Michel CarType {Sports} 28 Splitting Based on Continuous Attributes • Different ways of handling continuous attributes – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. © Tan,Steinbach, Kumar – Binary Decision: (A < v) or (A v) • consider all possible splits and finds the best cut • can be more compute intensive Information Retrieval and Data Mining, SoSe 2015, S. Michel 30 Splitting Based on Continuous Attributes Taxable Income > 80K? Taxable Income? < 10K Yes > 80K No © Tan,Steinbach, Kumar [10K,25K) (i) Binary split [25K,50K) [50K,80K) (ii) Multi-way split Information Retrieval and Data Mining, SoSe 2015, S. Michel 31 Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues © Tan,Steinbach, Kumar – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel 32 How to determine the Best Split Before Splitting: 10 records of class 0 10 records of class 1 Own Car? Yes Car Type? No Family Student ID? Luxury c1 Sports © Tan,Steinbach, Kumar C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 ... c10 C0: 1 C1: 0 c11 C0: 0 C1: 1 c20 ... C0: 0 C1: 1 Which test condition is the best? Information Retrieval and Data Mining, SoSe 2015, S. Michel 33 How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: © Tan,Steinbach, Kumar C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity Information Retrieval and Data Mining, SoSe 2015, S. Michel 34 Selecting the Best Split • Let p(i | t) be the fraction of records belonging to class i at node t • Best split is selected based on the degree of impurity of the child nodes – p(0 | t) = 0 and p(1 | t) = 1 has high purity – p(0 | t) = 1/2 and p(1 | t) = 1/2 has the smallest purity (highest impurity) • Intuition: high purity ⇒ small value of impurity measures ⇒ better split 35 © Tan,Steinbach, Kumar Example of Purity high impurity high purity Information Retrieval and Data Mining, SoSe 2015, S. Michel 36 Impurity Measures Information Retrieval and Data Mining, SoSe 2015, S. Michel 37 © Tan,Steinbach, Kumar Examples for Computing Entropy C1 C2 0 6 P(C1) = 0/6 = 0 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 Information Retrieval and Data Mining, SoSe 2015, S. Michel 38 © Tan,Steinbach, Kumar Examples for computing GINI C1 C2 0 6 P(C1) = 0/6 = 0 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Information Retrieval and Data Mining, SoSe 2015, S. Michel 39 Comparing Conditions • The quality of the split: the change in the impurity – Called the gain of the test condition • • • • • I( ) is the impurity measure k is the number of attribute values p is the parent node, vj is the child node N is the total number of records at the parent node N(vj) is the number of records associated with the child node • Maximizing the gain ⇔ minimizing the weighted average impurity measure of child nodes • If I() = Entropy(), then Δ = Δinfo is called information gain Information Retrieval and Data Mining, SoSe 2015, S. Michel 40 How to Find the Best Split Before Splitting: C0 C1 N00 N01 M0 A? B? Yes No Node N1 © Tan,Steinbach, Kumar C0 C1 Node N2 N10 N11 C0 C1 N20 N21 M2 M1 Yes No Node N3 C0 C1 Node N4 N30 N31 C0 C1 M3 M12 N40 N41 M4 M34 Gain = M0 – M12 vs M0 – M34 Information Retrieval and Data Mining, SoSe 2015, S. Michel 41 Problems of maximizing Δ Higher purity Information Retrieval and Data Mining, SoSe 2015, S. Michel 42 Problems of Maximizing Δ • Impurity measures favor attributes with large number of values • A test condition with large number of outcomes might not be desirable – Number of records in each partition is too small to make predictions • Solution 1: gain ratio = Δinfo / SplitInfo P(vi) = the fraction of records at child; k = total number of splits • Solution 2: restrict the splits to binary 43 Tree Induction • Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. • Issues © Tan,Steinbach, Kumar – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting Information Retrieval and Data Mining, SoSe 2015, S. Michel 44 Stopping Criteria for Tree Induction • Stop expanding a node when all the records belong to the same class • Stop expanding a node when all the records have the same or similar attribute values. In this case the class with “majority” wins. Information Retrieval and Data Mining, SoSe 2015, S. Michel 45 Overfitting and Tree Pruning • Common problem with decision trees is that tree might be too tightly tailored to training data (and thus possibly to noise in data). – Good: error on training data might be very low – But what about test data previously unseen? • Idea: Avoid tree becoming too fine-grained. • Solution 1: Stop splitting nodes early (i.e., preprocessing) • Solution 2: Build tree regularly and then prune parts of it (i.e., postprocessing) Information Retrieval and Data Mining, SoSe 2015, S. Michel 46 Example: Training Data © Tan,Steinbach, Kumar Example of overfitting due to noisy training data ….. *) wrong class Information Retrieval and Data Mining, SoSe 2015, S. Michel 47 © Tan,Steinbach, Kumar Example: Two Different Decision Trees Information Retrieval and Data Mining, SoSe 2015, S. Michel 48 Example: Test Data 1 1 2 1 Let’s see how the trees M1 and M2 perform on test and training data. M1: 0% error on training data, but 30% error on test data! Errors marked with 1 M2: 20% error on training data, but 10% error on test data! 2 table source: Tan,Steinbach, Kumar Information Retrieval and Data Mining, SoSe 2015, S. Michel 49 2. (Naive) Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: P(C | A) P( A, C ) © Tan,Steinbach, Kumar P ( A) P ( A, C ) P( A | C ) P (C ) • Bayes theorem: P( A | C ) P(C ) P(C | A) P( A) Information Retrieval and Data Mining, SoSe 2015, S. Michel 50 Example of Bayes Theorem • Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20 © Tan,Steinbach, Kumar • If a patient has stiff neck, what’s the probability he/she has meningitis? P( S | M ) P( M ) 0.5 1 / 50000 P( M | S ) 0.0002 P( S ) 1 / 20 Information Retrieval and Data Mining, SoSe 2015, S. Michel 51 Bayesian Classifiers • Consider each attribute and class label as random variables • Given a record with attributes (A1, A2,…,An) © Tan,Steinbach, Kumar – Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Can we estimate P(C| A1, A2,…,An ) directly from data? Information Retrieval and Data Mining, SoSe 2015, S. Michel 52 Bayesian Classifiers • Approach: – compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P( A A A | C ) P(C ) P(C | A A A ) P( A A A ) 1 1 2 2 n n 1 2 n © Tan,Steinbach, Kumar – Choose value of C that maximizes P(C | A1, A2, …, An) – Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )? Information Retrieval and Data Mining, SoSe 2015, S. Michel 53 Naïve Bayes Classifier • Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) © Tan,Steinbach, Kumar – Can estimate P(Ai| Cj) for all Ai and Cj. – New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal. Information Retrieval and Data Mining, SoSe 2015, S. Michel 54 How to Estimate Probabilities from s al al u c c i i Data? ategor ategor ontinuo lass • Class: P(C) = Nc/N c Tid © Tan,Steinbach, Kumar – e.g., P(No) = 7/10, P(Yes) = 3/10 Refund c c c Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No • For discrete attributes: P(Ai | Ck) = |Aik|/ Nc 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No – where |Aik| is number of instances having attribute Ai and belongs to class Ck – Examples: 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes k 10 P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 Information Retrieval and Data Mining, SoSe 2015, S. Michel 55 How to Estimate Probabilities from Data? • For continuous attributes: – Discretize the range into bins • one ordinal attribute per bin • violates independence assumption k – Two-way split: (A < v) or (A > v) • choose only one of the two splits as new attribute © Tan,Steinbach, Kumar – Probability density estimation: • Assume attribute follows a normal distribution • Use data to estimate parameters of distribution (e.g., mean and standard deviation) • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) Information Retrieval and Data Mining, SoSe 2015, S. Michel 56 How to Estimate Probabilities from Data? t ca • Normal distribution: Tid – One for each (Ai,ci) pair © Tan,Steinbach, Kumar • For (Income, Class=No): – If Class=No • sample mean = 110 • sample variance = 2975 Refund o eg a c i r l t ca o eg a c i r l co in nt u s u o Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Information Retrieval and Data Mining, SoSe 2015, S. Michel as l c 57 Example of Naïve Bayes Classifier Given a Test Record: X (Refund No, Married, Income 120K) naive Bayes Classifier: © Tan,Steinbach, Kumar P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25 P(X|Class=No) = P(Refund=No|Class=No) P(Married| Class=No) P(Income=120K| Class=No) = 4/7 4/7 0.0072 = 0.0024 P(X|Class=Yes) = P(Refund=No| Class=Yes) P(Married| Class=Yes) P(Income=120K| Class=Yes) = 1 0 1.2 10-9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Information Retrieval and Data Mining, SoSe 2015, S. Michel 58 3. Support Vector Machines © Tan,Steinbach, Kumar Idea: Find a linear hyperplane (decision boundary) that will separate the data Information Retrieval and Data Mining, SoSe 2015, S. Michel 59 Support Vector Machines One Possible Solution © Tan,Steinbach, Kumar B1 Information Retrieval and Data Mining, SoSe 2015, S. Michel 60 Support Vector Machines Another possible solution © Tan,Steinbach, Kumar B2 Information Retrieval and Data Mining, SoSe 2015, S. Michel 61 Support Vector Machines Other possible solutions © Tan,Steinbach, Kumar B2 Information Retrieval and Data Mining, SoSe 2015, S. Michel 62 Support Vector Machines B1 © Tan,Steinbach, Kumar B2 • Which one is better? B1 or B2? • How do you define better? Information Retrieval and Data Mining, SoSe 2015, S. Michel 63 Support Vector Machines B1 B2 © Tan,Steinbach, Kumar b21 b22 margin b11 b12 • Find hyperplane maximizes the margin => B1 is better than B2 Information Retrieval and Data Mining, SoSe 2015, S. Michel 64 Support Vector Machines B1 w x b 0 w x b 1 w x b 1 © Tan,Steinbach, Kumar b11 if w x b 1 1 f ( x) 1 if w x b 1 Information Retrieval and Data Mining, SoSe 2015, S. Michel b12 Margin 2 || w ||2 65 Summary Data Mining • Frequent Itemset and Association Rule Mining: – Apriori Principle and Algorithm • Clustering: – K-means – Hierarchical clustering – DBSCAN (density based clustering) • Classification: – Decision trees – Naïve Bayes Classifier – Support Vector Machines (SVMs) Information Retrieval and Data Mining, SoSe 2015, S. Michel 66