Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Classification Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities It is based on the famous Bayes theorem. A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Bayes Theorem • • • Let X be the data record (case) whose class label is unknown. Let H be some hypothesis, such as "data record X belongs to a specified class C.“ For classification, we want to determine P (H|X) – the probability that the hypothesis H holds, given the observed data record X. P (H|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round. • In contrast, P(H) is the prior probability, or a priori probability, of H. In this example P(H) is the probability that any given data record is an apple, regardless of how the data record looks. • The posterior probability, P (H|X), is based on more information (such as background knowledge) than the prior probability, P(H), which is independent of X. Bayesian Classification: Simple introduction… Similarly, P (X|H) is posterior probability of X conditioned on H. That is, it is the probability that X is red and round given that we know that it is true that X is an apple. P(X) is the prior probability of X, i.e., it is the probability that a data record from our set of fruits is red and round. Bayes theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from P(H), P(X), and P(X|H). Bayes theorem is P (H|X) = P(X|H) P(H) / P(X) Bayes Classifier A probabilistic framework for solving classification problems P ( A, C ) Conditional Probability: P (C | A) P ( A) P ( A, C ) P( A | C ) P (C ) Bayes theorem: P( A | C ) P(C ) P(C | A) P( A) Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis? P( S | M ) P( M ) 0.5 1 / 50000 P( M | S ) 0.0002 P( S ) 1 / 20 Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) Can we estimate P(C| A1, A2,…,An ) directly from data? Bayesian Classifiers Approach: compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P( A A A | C ) P(C ) P(C | A A A ) P( A A A ) 1 1 2 2 n n 1 2 Choose value of C that maximizes P(C | A1, A2, …, An) Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) How to estimate P(A1, A2, …, An | C ) ? n Naïve Bayes Classifier Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal. How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins k Two-way split: (A < v) or (A > v) one ordinal attribute per bin violates independence assumption choose only one of the two splits as new attribute Probability density estimation: Assume attribute follows a normal distribution Use data to estimate parameters of distribution (e.g., mean and standard deviation) Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) Naïve Bayes Classifier: Example1 Name human python salmon whale frog komodo bat pigeon cat leopard shark turtle penguin porcupine eel salamander gila monster platypus owl dolphin eagle Give Birth yes Give Birth yes no no yes no no yes no yes yes no no yes no no no no no yes no Can Fly no no no no no no yes yes no no no no no no no no no yes no yes Can Fly no Live in Water Have Legs no no yes yes sometimes no no no no yes sometimes sometimes no yes sometimes no no no yes no Class yes no no no yes yes yes yes yes no yes yes yes no yes yes yes yes no yes mammals non-mammals non-mammals mammals non-mammals non-mammals mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals non-mammals non-mammals mammals non-mammals mammals non-mammals Live in Water Have Legs yes no Class ? A: attributes M: mammals N: non-mammals 6 6 2 2 P ( A | M ) 0.06 7 7 7 7 1 10 3 4 P ( A | N ) 0.0042 13 13 13 13 7 P ( A | M ) P( M ) 0.06 0.021 20 13 P ( A | N ) P ( N ) 0.004 0.0027 20 P(A|M)P(M) > P(A|N)P(N) => Mammals Naïve Bayes (Summary) Robust to isolated noise points Handle missing values by ignoring the instance during probability estimate calculations Robust to irrelevant attributes Independence assumption may not hold for some attributes It makes computation possible It yields optimal classifiers when satisfied But this is seldom satisfied in practice, as attributes (variables) are often correlated.