Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification: Naïve Bayes Classifier © Tan,Steinbach, Kumar Introduction to Data Mining Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4/18/2004 1 Joint, Marginal, Conditional Probability… To determine probabilities of events that result from combining other events in various ways. There are several types of combinations and relationships between events: •Complement of an event [everything other than that event] •Intersection of two events [event A and event B] or [A*B] •Union of two events [event A or event B] or [A+B] Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.2 Example of Joint Probability Why are some mutual fund managers more successful than others? One possible factor is where the manager earned his or her MBA. The following table compares mutual fund performance against the ranking of the school where the fund manager earned their MBA: Mutual fund outperforms the market Mutual fund doesn’t outperform the market Top 20 MBA program .11 .29 Not top 20 MBA program .06 .54 Venn Diagrams E.g. This is the probability that a mutual fund outperforms AND the manager was in a top20 MBA program; it’s a joint probability [intersection]. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.3 Example of Joint Probability Alternatively, we could introduce shorthand notation to represent the events: A1 = Fund manager graduated from a top-20 MBA program A2 = Fund manager did not graduate from a top-20 MBA program B1 = Fund outperforms the market B2 = Fund does not outperform the market A1 A2 B1 B2 .11 .29 .06 .54 E.g. P(A2 and B1) = .06 = the probability a fund outperforms the market and the manager isn’t from a top-20 school. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.4 Marginal Probabilities… Marginal probabilities are computed by adding across rows and down columns; that is they are calculated in the margins of the table: P(A2) = .06 + .54 “what’s the probability a fund manager isn’t from a top school?” B1 B2 P(Ai) A1 A2 .11 .29 .40 .06 .54 .60 P(Bj) .17 .83 1.00 P(B1) = .11 + .06 “what’s the probability a fund outperforms the market?” Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. BOTH margins must add to 1 (useful error check) 6.5 Conditional Probability… Conditional probability is used to determine how two events are related; that is, we can determine the probability of one event given the occurrence of another related event. Experiment: random select one student in class. P(randomly selected student is male) = P(randomly selected student is male/student is on 3rd row) = Conditional probabilities are written as P(A | B) and read as “the probability of A given B” and is calculated as: Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.6 Conditional Probability… Again, the probability of an event given that another event has occurred is called a conditional probability… P( A and B) = P(A)*P(B/A) = P(B)*P(A/B) both are true Keep this in mind! Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.7 Conditional Probability… Example 6.2 • What’s the probability that a fund will outperform the market given that the manager graduated from a top-20 MBA program? Recall: A1 = Fund manager graduated from a top-20 MBA program A2 = Fund manager did not graduate from a top-20 MBA program B1 = Fund outperforms the market B2 = Fund does not outperform the market Thus, we want to know “what is P(B1 | A1) ?” Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.8 Conditional Probability… We want to calculate P(B1 | A1) B1 B2 P(Ai) A1 A2 .11 .29 .40 .06 .54 .60 P(Bj) .17 .83 1.00 Thus, there is a 27.5% chance that that a fund will outperform the market given that the manager graduated from a top-20 MBA program. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.9 Independence… One of the objectives of calculating conditional probability is to determine whether two events are related. In particular, we would like to know whether they are independent, that is, if the probability of one event is not affected by the occurrence of the other event. Two events A and B are said to be independent if P(A|B) = P(A) and P(B|A) = P(B) P(you have a flat tire going home/radio quits working) Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.10 Are B1 and A1 Independent Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.11 Independence… For example, we saw that P(B1 | A1) = .275 The marginal probability for B1 is: P(B1) = 0.17 Since P(B1|A1) ≠ P(B1), B1 and A1 are not independent events. Stated another way, they are dependent. That is, the probability of one event (B1) is affected by the occurrence of the other event (A1). Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.12 Determine the probability that a fund outperforms (B1) or the manager graduated from a top-20 MBA program (A1). Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.13 Union… Determine the probability that a fund outperforms (B1) or the manager graduated from a top-20 MBA program (A1). A1 or B1 occurs whenever: A1 and B1 occurs, A1 and B2 occurs, or A2 and B1 occurs… B1 B2 P(Ai) A1 A2 .11 .29 .40 .06 .54 .60 P(Bj) .17 .83 1.00 P(A1 or B1) = .11 + .06 + .29 = .46 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.14 Data Mining Classification: Naïve Bayes Classifier © Tan,Steinbach, Kumar Introduction to Data Mining Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4/18/2004 15 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learning algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class 10 Test Set Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Deduction Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of a Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Splitting Attributes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO NO > 80K YES 10 Training Data Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Married Model: Decision Tree Another Example of Decision Tree MarSt Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Married NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class 10 Test Set Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Deduction Decision Tree Apply Model to Test Data Test Data Start from the root of tree. Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Refund Marital Status Taxable Income Cheat No 80K Married ? Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Assign Cheat to “No” Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: P ( A, C ) P (C | A) P ( A) Bayes theorem: P ( A, C ) P( A | C ) P (C ) P( A | C ) P(C ) P(C | A) P( A) Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Bayes Theorem Given a hypothesis h and data D which bears on the hypothesis: P ( D | h) P ( h) P ( h | D) P( D) P(h): independent probability of h: prior probability P(D): independent probability of D P(D|h): conditional probability of D given h: likelihood P(h|D): conditional probability of h given D: posterior probability Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? P( S | M ) P( M ) 0.5 1 / 50000 P( M | S ) 0.0002 P( S ) 1 / 20 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) Can we estimate P(C| A1, A2,…,An ) directly from data? Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Bayesian Classifiers Approach: compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P(C | A A A ) 1 2 n P( A A A | C ) P(C ) P( A A A ) 1 2 n 1 2 n Choose value of C that maximizes P(C | A1, A2, …, An) Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) How to estimate P(A1, A2, …, An | C )? Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. The Bayes Classifier Likelihood Normalization Constant Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Prior Maximum A Posterior Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D. hMAP argmax P(h | D) hH argmax hH P ( D | h) P ( h) P( D) argmax P( D | h) P(h) hH H: set of all hypothesis. Note that we can drop P(D) as the probability of the data is constant (and independent of the hypothesis). Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Naïve Bayes Classifier Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example • Example: Play Tennis 38 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Example • Learning Phase Outlook Play=Yes Play=No Temperature Play=Yes Play=No Sunny 2/9 4/9 3/9 3/5 0/5 2/5 Hot 2/9 4/9 3/9 2/5 2/5 1/5 Overcast Rain Humidity High Normal Mild Cool Play=Yes Play=No 3/9 6/9 4/5 1/5 Play=Yes Play=No Strong 3/9 6/9 3/5 2/5 Weak P(Play=Yes) = 9/14 P(Play=No) = 5/14 39 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Wind Example • Test Phase – Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) – Look up tables – P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 MAP rule P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 40 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: N ic Original : P ( Ai | C ) Nc N ic 1 Laplace : P ( Ai | C ) Nc c Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. c: number of classes p: prior probability Problem Solving Name human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Give Birth yes Give Birth yes no no yes no no yes no yes yes no no yes no no no no no yes no Can Fly no Can Fly no no no no no no yes yes no no no no no no no no no yes no yes Live in Water Have Legs no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes mammals non-mammals A: attributes non-mammals M: mammals mammals N: non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals Live in Water Have Legs yes Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. no Class Class ? Solution Name human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Give Birth yes Give Birth yes no no yes no no yes no yes yes no no yes no no no no no yes no Can Fly no no no no no no yes yes no no no no no no no no no yes no yes Can Fly no Live in Water Have Legs no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no Class yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes mammals non-mammals non-mammals mammals non-mammals non-mammals mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals Live in Water Have Legs yes no Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Class ? A: attributes M: mammals N: non-mammals 6 6 2 2 P ( A | M ) 0.06 7 7 7 7 1 10 3 4 P ( A | N ) 0.0042 13 13 13 13 7 P ( A | M ) P( M ) 0.06 0.021 20 13 P ( A | N ) P ( N ) 0.004 0.0027 20 P(A|M)P(M) > P(A|N)P(N) => Mammals Classifier Evaluation Metrics: Confusion Matrix Confusion Matrix: Actual class\Predicted class C1 ¬ C1 C1 True Positives (TP) False Negatives (FN) ¬ C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted buy_computer buy_computer class = yes = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 44 Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P C ¬C Class Imbalance Problem: C TP FN P One class may be rare, e.g. ¬C FP TN N fraud, or HIV-positive P’ N’ All Significant majority of the negative class and minority of Classifier Accuracy, or recognition rate: percentage of test set tuples that the positive class Sensitivity: True Positive are correctly classified recognition rate Accuracy = (TP + TN)/All Sensitivity = TP/P Error rate: 1 – accuracy, or Specificity: True Negative Error rate = (FP + FN)/All recognition rate Specificity = TN/N Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 45 Measuring Error Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 46 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.