Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification COMP 790-90 790 90 Seminar BCB 713 Module Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bayesian Classification: Why? y y Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured 2 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Bayesian Theorem: Basics Let X be a data sample whose class label is unknown Let H be a hypothesis that X belongs to class C For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) P(X): probability that sample data is observed P(X|H) : probability of observing the sample X, given that the hypothesis holds 3 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Bayesian Theorem Given training g data X,, pposteriori pprobabilityy off a hypothesis H, P(H|X) follows the Bayes theorem P(H | X ) P( X | H )P(H ) P( X ) Informally, this can be written as posterior =likelihood x prior / evidence MAP (maximum posteriori) hypothesis h arg max P(h | D) arg max P(D | h)P(h). MAP hH hH Practical difficulty: require initial knowledge of many probabilities, significant computational cost 4 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Naïve Bayes Classifier A simplified p assumption: p attributes are conditionally y independent: n P( X | C i) P( x k | C i) k 1 The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities b bili i off eachh element l taken k separately, l given i the h same class P([y1,y2],C) = P(y1,C) * P(y2,C) No dependence p relation between attributes Greatly reduces the computation cost, only count the class distribution. Once the probability P(X|Ci) is known, known assign X to the class with maximum P(X|Ci)*P(Ci) 5 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Training dataset age Class: <=30 C1:buys_computer= <=30 ‘yes’ yes 30 40 30…40 C2:buys_computer= >40 >40 ‘no’ >40 31…40 Data sample <=30 X =(age<=30, <=30 Income=medium, >40 Student=yes <=30 Credit_rating= 31…40 Fair) 31 40 31…40 >40 6 income student credit_rating credit rating high no fair high no excellent high no fair medium no fair low yes fair yes excellent y low low yes excellent medium no fair low yes fair medium yes fair f medium yes excellent medium no excellent high yes fair medium no excellent buys_computer buys computer no no yes yes yes no yes no yes yes yes yes yes no COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Naïve Bayesian Classifier: Ex mpl Example Compute p P(X/Ci) ( ) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(i P(income=“medium” “ di ” | buys_computer=“yes”)= b “ ”) 4/9 =0.444 0 444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P( di P(credit_rating=“fair” i “f i ” | buys_computer=“yes”)=6/9=0.667 b “ ”) 6/9 0 667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|b P(X|buys_computer=“no”) t “ ”) * P(b P(buys_computer=“no”)=0.007 t “ ”) 0 007 X belongs to class “buys_computer=yes” COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 7 Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc D Dependencies d i among th these cannott be b modeled d l d by b Naïve N ï Bayesian B i Classifier How to deal with these dependencies? Bayesian Belief Networks 8 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Bayesian Networks B Bayesian i belief b li f network t k allows ll a subset b t off the th variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Y X Z P Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency p y between Z and P Has no loops or cycles COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 9 Bayesian Belief Network: An Example Family History Smoker (FH, S) LungCancer PositiveXRay Emphysema Dyspnea B Bayesian i Belief B li f Networks N t k 10 (FH, ~S) (~FH, S) (~FH, ~S) LC 08 0.8 05 0.5 07 0.7 01 0.1 ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents p n P ( z1,..., zn ) P ( z i | Parents ( Z i )) i 1 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Learning Bayesian Networks Several cases Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct graph topology p gy Unknown structure, all hidden variables: no good algorithms known for this purpose D. H D Heckerman, k B Bayesian i networks t k for f data d t mining COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 11 SVM – Support Vector Machines Small Margin Large Margin Support Vectors 12 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications SVM – Cont Cont. Linear Support Vector Machine Given a set of points xi n with label yi {1,1} The SVM finds a hyperplane defined by the pair (w,b) (where w is the normal to the plane and b is the distance from the origin) st s.t. yi ( xi w b) 1 i 1,..., n x – feature vector, vector bb- bias, bias yy- class label, label 2/||w|| - margin 13 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications SVM – Cont. Cont 14 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications SVM – Cont. Cont What if the data is not linearly separable? Project the data to high dimensional space where it is linearly separable and then we can use linear SVM – (Using Kernels) (0,1) + 15 + - + -1 0 +1 (0,0) + (1,0) COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Non-Linear SVM Classification using SVM (w,b) ? xi w b 0 In non linear case we can see this as ? K ( xi , w) b 0 Kernel – Can be thought of as doing dot product in some high dimensional space 16 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 17 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications Results 18 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications SVM Related Links http://svm dcs rhbnc ac uk/ http://svm.dcs.rhbnc.ac.uk/ http://www.kernel-machines.org/ C. J. C J C. C Burges. B rges A T Tutorial torial on Support S pport Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2) 1998. 2(2), 1998 SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light BOOK: An Introduction to Support Vector Machines N. Cristianini and J. Shawe-Taylor Cambridge g University y Press 19 COMP 790-090 Data Mining: Concepts, Algorithms, and Applications