Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Classification CS690L Data Mining: Classification(2) Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Bayesian Theorem: Basics Bayesian Theorem • Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H) : probability of observing the sample X, given that the hypothesis holds • Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem • Informally, this can be written as posterior =likelihood x prior / evidence • MAP (maximum posteriori) hypothesis • Practical difficulty: require initial knowledge of many probabilities, significant computational cost Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent: n P ( X | C i ) = ∏ P( x k | C i ) k =1 • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no 1 Naïve Bayesian Classifier: Example Naïve Bayes: Continuous Value (1) • Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007 Outlook Temp(ºF) Humidity Windy Class sunny 85 85 false Don't Play sunny 80 90 true Don't Play overcast 83 86 false Play rainy 70 96 false Play rainy 68 80 false Play rainy 65 70 true Don’t Play overcast 64 65 true Play sunny 72 95 false Don’t Play sunny 69 70 false Play rainy 75 80 false Play sunny 75 70 true Play overcast 72 90 true Play overcast 81 75 false Play rainy 71 91 true Don’t Play X belongs to class “buys_computer=yes” Naïve Bayes: Continuous Value (2) Outlook yes temperature humidity yes yes no no windy no yes play yes no 9 5 sunny 2 3 83 85 86 85 false 6 2 4 0 70 80 96 90 true 3 3 rainy 3 2 68 65 80 70 64 72 65 95 69 71 70 91 75 80 75 70 72 90 81 75 Outlook no overcast Naïve Bayes: Continuous Value (3) temperature humidity windy yes no yes no ye s no sunny 2/9 3/5 µ 73 74.6 µ 79.1 86.2 overcast 4/9 0/5 σ 6.2 7.9 σ 10.2 9.7 rainy 3/9 2/5 play yes no yes no false 6/9 2/5 9/14 5/14 true 3/9 3/5 µ: mean and σ standard deviation Naïve Bayes: Continuous Value (4) Outlook Temp Humidity Windy Class sunny 66 90 true ? Naïve Bayes: Continuous Value (5) f (Temp = 66 | Class = yes) = • P(Class = yes | Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true) − 1 ⋅e 2π (6.2) ( x − ( 73)) 2 2 ( 6.2 ) 2 = 0.0340 f ( Humidity = 90 | Class = yes) = 0.0221 • P(Class = no | Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true) Pyes = Gaussian (Normal) Density Function f ( x) = 1 2πσ ⋅e − ( x − µ )2 2 3 9 9 ⋅ 0.0340 ⋅ 0.0221 ⋅ 9 ⋅ 14 = P(Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true) 2σ 2 = 0.000036 P(Outlook = sunny ∧ Temp = cool ∧ Humidity = high ∧ Windy = true) 2 Naïve Bayes: Continuous Value (6) Pno = = 3 3 5 5 ⋅ 0.0291 ⋅ 0.0380 ⋅ 5 ⋅ 14 = P(Outlook = sunny ∧ Temp = 66 ∧ Humidity = 90 ∧ Windy = true) 0.000136 P(Outlook = sunny ∧ Temp = cool ∧ Humidity = high ∧ Windy = true) Pyes = 20.9% Pno = 79.1% Naïve Bayesian Classifier: Evaluation • Advantages : – Easy to implement – Good results obtained in most of the cases • Disadvantages – Assumption: class conditional independence , therefore loss of accuracy – Practically, dependencies exist among variables. – E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc – Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? – Bayesian Belief Networks Bayesian Networks • Bayesian belief network allows a subset of the variables Bayesian Belief Network: An Example Family History Smoker conditionally independent (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) • A graphical model of causal relationships – Represents dependency among the variables LC LungCancer Emphysema 0.8 ~LC 0.2 0.5 0.7 0.1 0.5 0.3 0.9 – Gives a specification of joint probability distribution Y X Z P Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles Learning Bayesian Networks • Several cases – Given both the network structure and all variables observable: learn only the CPTs – Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning – Network structure unknown, all variables observable: search through the model space to reconstruct graph topology – Unknown structure, all hidden variables: no good algorithms known for this purpose PositiveXRay Dyspnea Bayesian Belief Networks The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents n P ( z1,..., zn ) = ∏ P ( z i | Parents ( Z i )) i =1 Prediction • Prediction is similar to classification – First, construct a model – Second, use model to predict unknown value • Major method for prediction is regression – Linear and multiple regression – Non-linear regression • Prediction is different from classification – Classification refers to predict categorical class label – Prediction models continuous-valued functions 3 Regress Analysis and Log-Linear Models in Prediction Predictive Modeling in Databases • Linear regression: Y = α + β X • Predictive modeling: Predict data values or construct generalized linear models based on the database data. • One can only predict value ranges or category distributions • Method outline: – – – – – Two parameters , α and β specify the line and are to be estimated by using the data at hand. – using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. • Multiple regression: Y = b0 + b1 X1 + b2 X2. – Many nonlinear functions can be transformed into the above. • Log-linear models: – The multi-way table of joint probabilities is approximated by a product of lower-order tables. – Probability: p(a, b, c, d) = αab βacχad δbcd Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction • Determine the major factors which influence the prediction – Data relevance analysis: uncertainty measurement, entropy analysis, expert judgment, etc. • Multi-level prediction: drill-down and roll-up analysis Linear regression Regression • Minimizing w.r.t. “a” • Method of Least Squares • Principle: Minimize ∑ [Yˆ (Xi ) - Yi]2 n n ∑Y i=1 i =1 • Assume Yˆ (Xi ) = a + bXi (linear relationship) i n = na + b∑ X i • Minimizing w.r.t. “b” n • Minimize ∑X Y n ∑ [Yi-(a+bXi)]2 i=1 i =1 b= a= n ∑ XY − ∑ X ∑ Y n ∑ X − (∑ X ) 2 2 1 (∑ Y − b∑ X ) n i i n n i =1 i =1 = a∑ Xi + b∑ X i2 (2) Linear Regression: Example Linear regression • Solving (1) and (2) we get: (1) i =1 X (years) Y (Salary) 3 30 8 57 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83 X = b= 1 ∑ X = 9 .1 n n ∑ XY − ∑ X ∑ Y n ∑ X 2 − (∑ X ) 2 Y = = 1 ∑ y = 55 .4 n ∑ ( X − X )(Y − Y ) ∑ (X − X ) 2 a = Y − bX (3 – 9.1)(30 – 55.4)+(8 – 9.1)(57 – 55.4)…+(16 – 9.1)(83 – 55.4) b = ------------------------------------------------------------------------(3 – 9.1)2 + (8 – 9.1) 2 + … + (16 – 9.1) 2 a = 55.4 – (3.5)(9.1) = 23.6 4 Linear regression Determine linear regression line, with Y linearly dependent on X 5