Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu The UNIVERSITY of Mining, KENTUCKY CS685 : Special Topics in Data UKY Bayesian Classification: Why? • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured CS685 : Special Topics in Data Mining, UKY Bayesian Theorem: Basics • Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H) : probability of observing the sample X, given that the hypothesis holds CS685 : Special Topics in Data Mining, UKY Bayesian Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem P(H | X ) P( X | H )P(H ) P( X ) • Informally, this can be written as posterior =likelihood x prior / evidence • MAP (maximum posteriori) hypothesis h arg max P(h | D) arg max P(D | h)P(h). MAP hH hH • Practical difficulty: require initial knowledge of many probabilities, significant computational cost CS685 : Special Topics in Data Mining, UKY Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent: n P( X | C i) P( x k | C i ) k 1 • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) CS685 : Special Topics in Data Mining, UKY Training dataset age Class: <=30 C1:buys_computer= <=30 ‘yes’ 30…40 C2:buys_computer= >40 >40 ‘no’ >40 31…40 Data sample <=30 X =(age<=30, <=30 Income=medium, >40 Student=yes <=30 Credit_rating= 31…40 Fair) 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no CS685 : Special Topics in Data Mining, UKY Naïve Bayesian Classifier: Example • Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” CS685 : Special Topics in Data Mining, UKY Naïve Bayesian Classifier: Comments • Advantages : – Easy to implement – Good results obtained in most of the cases • Disadvantages – Assumption: class conditional independence , therefore loss of accuracy – Practically, dependencies exist among variables – E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc – Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? – Bayesian Belief Networks CS685 : Special Topics in Data Mining, UKY Bayesian Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships – Represents dependency among the variables – Gives a specification of joint probability distribution Y X Z P Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles CS685 : Special Topics in Data Mining, UKY Bayesian Belief Network: An Example Family History Smoker (FH, S) LungCancer PositiveXRay Emphysema Dyspnea Bayesian Belief Networks (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents n P( z1,..., zn ) P( z i | Parents( Z i )) i 1 CS685 : Special Topics in Data Mining, UKY Learning Bayesian Networks • Several cases – Given both the network structure and all variables observable: learn only the CPTs – Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning – Network structure unknown, all variables observable: search through the model space to reconstruct graph topology – Unknown structure, all hidden variables: no good algorithms known for this purpose • D. Heckerman, Bayesian networks for data mining CS685 : Special Topics in Data Mining, UKY Association Rules Transactionid Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Customer buys both Customer buys beer Customer buys diaper Itemset X = {x1, …, xk} Find all the rules X Y with minimum support and confidence support, s, is the probability that a transaction contains X Y confidence, c, is the conditional probability that a transaction having X also contains Y Let supmin = 50%, confmin = 50% Association rules: A C (60%, 100%) C A (60%, 75%) CS685 : Special Topics in Data Mining, UKY Classification based on Association • Classification rule mining versus Association rule mining • Aim – A small set of rules as classifier – All rules according to minsup and minconf • Syntax – Xy – X Y CS685 : Special Topics in Data Mining, UKY Why & How to Integrate • Both classification rule mining and association rule mining are indispensable to practical applications. • The integration is done by focusing on a special subset of association rules whose right-hand-side are restricted to the classification class attribute. – CARs: class association rules CS685 : Special Topics in Data Mining, UKY CBA: Three Steps • Discretize continuous attributes, if any • Generate all class association rules (CARs) • Build a classifier based on the generated CARs. CS685 : Special Topics in Data Mining, UKY Our Objectives • To generate the complete set of CARs that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. • To build a classifier from the CARs. CS685 : Special Topics in Data Mining, UKY Rule Generator: Basic Concepts • Ruleitem <condset, y> :condset is a set of items, y is a class label Each ruleitem represents a rule: condset->y • condsupCount • The number of cases in D that contain condset • rulesupCount • The number of cases in D that contain the condset and are labeled with class y • Support=(rulesupCount/|D|)*100% • Confidence=(rulesupCount/condsupCount)*100% CS685 : Special Topics in Data Mining, UKY RG: Basic Concepts (Cont.) • Frequent ruleitems – A ruleitem is frequent if its support is above minsup • Accurate rule – A rule is accurate if its confidence is above minconf • Possible rule – For all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems. • The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate. CS685 : Special Topics in Data Mining, UKY RG: An Example • A ruleitem:<{(A,1),(B,1)},(class,1)> – assume that • the support count of the condset (condsupCount) is 3, • the support of this ruleitem (rulesupCount) is 2, and • |D|=10 – then (A,1),(B,1) -> (class,1) • supt=20% (rulesupCount/|D|)*100% • confd=66.7% (rulesupCount/condsupCount)*100% CS685 : Special Topics in Data Mining, UKY RG: The Algorithm 1 F 1 = {large 1-ruleitems}; 2 CAR 1 = genRules (F 1 ); 3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to determine the frequent 1-ruleitems and prune it 4 for (k = 2; F k-1 Ø; k++) do 5 C k = candidateGen (F k-1 ); //generate the candidate ruleitems Ck 6 7 8 9 10 11 12 using the frequent ruleitems Fk-1 for each data case d D do //scan the database C d = ruleSubset (C k , d); //find all the ruleitems in Ck whose condsets are supported by d for each candidate c C d do c.condsupCount++; if d.class = c.class then c.rulesupCount++; //update various support counts of the candidates in Ck end end CS685 : Special Topics in Data Mining, UKY RG: The Algorithm(cont.) 13 F k = {c C k | c.rulesupCount minsup}; //select those new frequent ruleitems to form Fk 14 CAR k = genRules(F k ); //select the ruleitems both accurate and frequent 15 prCAR k = pruneRules(CAR k ); 16 end 17 CARs = k CAR k ; 18 prCARs = k prCAR k ; CS685 : Special Topics in Data Mining, UKY SVM – Support Vector Machines Small Margin Large Margin Support Vectors CS685 : Special Topics in Data Mining, UKY SVM – Cont. • Linear Support Vector Machine Given a set of points xi n with label yi {1,1} The SVM finds a hyperplane defined by the pair (w,b) (where w is the normal to the plane and b is the distance from the origin) s.t. yi ( xi w b) 1 i 1,..., n x – feature vector, b- bias, y- class label, 2/||w|| - margin CS685 : Special Topics in Data Mining, UKY SVM – Cont. CS685 : Special Topics in Data Mining, UKY SVM – Cont. • What if the data is not linearly separable? • Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels) (0,1) + + - + -1 0 +1 (0,0) + (1,0) CS685 : Special Topics in Data Mining, UKY Non-Linear SVM Classification using SVM (w,b) ? xi w b 0 In non linear case we can see this as ? K ( xi , w) b 0 Kernel – Can be thought of as doing dot product in some high dimensional space CS685 : Special Topics in Data Mining, UKY CS685 : Special Topics in Data Mining, UKY Results CS685 : Special Topics in Data Mining, UKY SVM Related Links • http://svm.dcs.rhbnc.ac.uk/ • http://www.kernel-machines.org/ • C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998. • SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light • BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge University Press CS685 : Special Topics in Data Mining, UKY