Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt The Classification Problem From a data set describing objects by vectors of features and a class Vector1= <49, 0, 2, 134, 271, 0, 0, 162, 0, 0, 2, 0, 3> Presence Vector2= <42, 1, 3, 130, 180, 0, 0, 150, 0, 0, 1, 0, 3> Presence Vector3= <39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3 > Presence Vector4= <41, 1, 2, 135, 203, 0, 0, 132, 0, 0, 2, 0, 6 > Absence Vector5= <56, 1, 3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 > Absence Vector6= <70, 1, 2, 156, 245, 0, 2, 143, 0, 0, 1, 0, 3 > Presence Vector7= <56, 1, 4, 132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 > Absence a function F: features class to classify a new object Find Examples Predicting Features: cholesterol, chest pain, angina, age, etc. Class: {present, absent} Finding recognition Features: matrix of pixel descriptors Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} Speech lemons in cars Features: make, brand, miles per gallon, acceleration,etc. Class: {normal, lemon} Digit heart disease recognition Features: Signal characteristics, language model Class: {pause/hesitation, retraction} Approaches Memory based Define a distance between samples Nearest neighbor, support vector machines Decision surface Find best partition of the space CART, decision trees Generative models Induce a model and impose a decision rule Bayesian networks Generative Models Bayesian We We classifiers Induce a probability describing the data P(A1,…,An,C) Impose a decision rule. Given a new object < a1,…,an > c = argmaxC P(C = c | a1,…,an) have shifted the problem to learning P(A1,…,An,C) are learning how to do this efficiently: learn a Bayesian network representation for P(A1,…,An,C) Optimality of the decision rule Minimizing the error rate... Let ci be the true class, and let lj be the class returned by the classifier. A decision by the classifier is correct if ci=lj, and in error if ci lj. The error incurred by choose label lj is E (ci | L) (ci | li ) P(l j | a ) 1 P(li | a ) n j 1 Thus, had we had access to P, we minimize error rate by choosing li when P (li | a ) P (l j | a )j i which is the decision rule for the Bayesian classifier Advantages of the Generative Model Approach Output: Rank over the outcomes---likelihood of present vs. absent Explanation: What is the profile of a “typical” person with a heart disease Missing values: both in training and testing Value of information: If the person has high cholesterol and blood sugar, which other test should be conducted? Validation: confidence measures over the model and its parameters Background knowledge: priors and structure Evaluating the performance of a classifier: n-fold cross validation Partition D1 D2 D3 Dn Run 1 Run 2 Run 3 Run n the data set in n segments Do n times Train the classifier with the green segments Test accuracy on the red segments Compute statistics on the n runs • Variance • Mean accuracy Accuracy: on test data of size m m Original data set Acc = (c | l ) k 1 k i m j Advantages of Using a Bayesian Network Efficiency in learning and query answering Combine knowledge engineering and statistical induction Algorithms for decision making, value of information, diagnosis and repair Heart disease Accuracy = 85% Data source UCI repository Outcome Age Vessels MaxHeartRate STSlope Angina ChestPain BloodSugar Cholesterol RestBP Sex ECG OldPeak Thal Problems with BNs as classifiers When evaluating a Bayesian network, we examine the likelyhood of the model B given the data D and try to maximize it: N LL(B | D ) log PB (a1i , , ani , c i ) i 1 When Learning structure we also add penalty for structure complexity and seek a balance between the two terms (MDL or variant). The following properties follow: A Bayesian network minimized the error over all the variables in the domain and not necessarily the local error of the class given the attributes (OK with enough data). Because of the penalty, a Bayesian network in effect looks at a small subset of the variables that effect a given node (it’s Markov blanket) Problems with BNs as classifiers (cont.) Let’s look closely at the likelyhood term: N N i 1 i 1 LL(B | D ) log PB (c i | a1i , , ani ) log PB (a1i , , ani ) The first term estimates just what we want: the probability of the class given the attributes. The second term estimates the joint probability of the attributes. When there are many attributes, the second term starts to dominate (value of log is increased for small values). Why not use the just the first term? We can no longer factorize and calculations become much harder. The Naïve Bayesian Classifier C Diabetes in Pima Indians (from UCI repository) F1 pregnant F2 age F3 insulin F4 dpf F5 mass F6 glucose Fixed structure encoding the assumption that features are independent of each other given the class. P (C | F1, , F6 ) P (F1 | C ) P (F2 | C ) P (F6 | C ) P (C ) Learning amounts to estimating the parameters for each P(Fi|C) for each Fi. The Naïve Bayesian Classifier (cont.) What do we gain? We ensure that in the learned network, the probability P(C|A1…An) will take every attribute into account. We will show polynomial time algorithm for learning the network. Estimates are robust consisting of low order statistics requiring few instances Has proven to be a powerful classifier often exceeding unrestricted Bayesian networks. The Naïve Bayesian Classifier (cont.) C F1 Common These F2 F3 F4 F5 F6 N (ai , c ) practice is to estimate ˆ a |c i N (c ) estimate are identical to MLE for multinomials Improving Naïve Bayes Naïve Bayes encodes assumptions of independence that may be unreasonable: Are pregnancy and age independent given diabetes? Problem: same evidence may be incorporated multiple times (a rare Glucose level and a rare Insulin level over penalize the class variable) The success of naïve Bayes is attributed to Robust estimation Decision may be correct even if probabilities are inaccurate Idea: improve on naïve Bayes by weakening the independence assumptions Bayesian networks provide the appropriate mathematical language for this task Tree Augmented Naïve Bayes (TAN) C F1 pregnant dpf F2 age F4 mass F5 F3 F6 glucose insulin P (C | F1, , F6 ) P (F1 | C ) P (F2 | F1, C ) P (F6 | F3, C ) P (C ) Approximate the dependence among features with a tree Bayes net Tree induction algorithm Optimality: maximum likelihood tree Efficiency: polynomial algorithm Robust parameter estimation Optimal Tree construction algorithm The procedure of Chow and Lui construct a tree structure BT that maximizes LL(BT |D) Compute the mutual information between every pair of attributes: I (Ai ;Aj ) Build P (a , a ai ,a j D i j PD ( ai ,a j ) )log PD (ai )PD (aj ) a complete undirected graph in which the vertices are the attributes and each edge is annotated with the corresponding mutual information as weight. Build a maximum weighted spanning tree of this graph. Complexity: O(n2N) + O(n2) + O(n2logn) = O(n2N) where n is the number of attributes and N is the sample size Tree construction algorithm (cont.) It is easy to “plant” the optimal tree in the TAN by revising the algorithm to use a revised conditional measure which takes the conditional probability on the class into account: I (Ai ;Aj ) ai ,a j ,c PD (ai , a j , c ) log P PD ( ai ,a j |c ) D ( ai |c )PD ( a j |c ) This measures the gain in the log-likelyhood of adding Ai as a parent of Aj when C is already a parent. Problem with TAN When evaluating parameters we estimate the conditional probability P(Ai|Parents(Ai)). This is done by partitionaing the data according to possible values of Parents(Ai). When a partition contains just a few instances we get an unreliable estimate In Naive Bayes the partition was only on the values of the classifier (and we have to assume that is adequate) In TAN we have twice the number of partitions and get unreliable estimates, especially for small data sets. Solution: * (x | Pax ) PD (x | Pax ) (1 )PD (x ) NPD (Pax ) where NPD (Pax ) s where s is the smoothing bias and typically small. Performance: TAN vs. Naïve Bayes 100 25 Data sets from UCI repository Medical Signal processing Financial Games Naïve Bayes 95 90 85 80 75 70 65 65 70 75 80 85 TAN 90 Accuracy based on 5-fold crossvalidation 95 100 No parameter tuning Performance: TAN vs C4.5 100 25 Data sets from UCI repository Medical Signal processing Financial Games 95 C4.5 90 85 80 75 70 65 65 70 75 80 85 TAN 90 95 100 Accuracy based on 5-fold crossvalidation No parameter tuning Beyond TAN Can we do better by learning a more flexible structure? Experiment: the structure learn a Bayesian network without restrictions on Performance: TAN vs. Bayesian Networks Bayesian Networks 100 25 Data sets from UCI repository Medical Signal processing Financial Games 95 90 85 80 75 70 65 65 70 75 80 85 90 TAN Accuracy based on 5-fold crossvalidation 95 100 No parameter tuning Classification: Summary Bayesian networks provide a useful language to improve Bayesian classifiers Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc Additional Missing values Compute the tradeoffs involved in finding out feature values Compute misclassification costs Recent benefits progress: Combine generative probabilistic models, such as Bayesian networks, with decision surface approaches such as Support Vector Machines