* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistical Inference, Multiple Comparisons, Random Field Theory
Survey
Document related concepts
Transcript
Study of Bayesian network classifier Huang Kaizhu Supervisors: Prof. Irwin King Prof. Lyu Rung Tsong Michael Markers: Prof. Chan Lai Wan Prof. Wong Kin Hong Outline • Background – – – – What is Bayesian network? How Bayesian networks can be used as classifiers? Why choose Bayesian network? What is problem of Learning Bayesian network ? • My main works – Large-Node Chow-Liu-tree – Maximum likelihood Large-Node-Bounded semi-Naïve BN • Future work • Conclusion Background • What is Bayesian Network(BN)? – Composed of a “structure” component G and a “parameter” component . – G=(V,E) is a directed acyclic graph. nodes set :V and its edge set is E. And the nodes represent the attributes, the edges between the nodes represent the dependence relationship between the nodes. – is a conditional probability table. – It encodes the following joint probability among the nodes (X1,X2,…,Xn): n P ( X 1,..., Xn) P ( Xi | Xi )...(1) i 1 Xi is the parents of Xi Background(con’t) Example of Bayesian network: Structure component Parameter component The Bayesian network above encodes the following probability relationship. P(F, B, L, D , H) =P(F) P(B)P(L |F) P(D |F,B)P(H|D) Background (con’t) • How can BN be a classifier? – Firstly use BN to model the dataset – Then use the distribution encoded in BN to do classification c( x1 , x2 ,..., xn ) arg max P(c | x1 , x2 ,..., xn ) c arg max P( x1 , x2 ,..., xn , c) c Background (con’t) • Why choose Bayesian network? – Bayesian network represents some inner relationship between the attributes – The joint probability based on BN can be written as a decomposable form n P ( X 1,..., Xn) P ( Xi | Xi )...(1) i 1 Xi is the parents of Xi Background (con’t) • What is the problem of learning Bayesian network? – Given a training data set D={u1, u2 , u3 …uN } of instances of attributes U, find a network B that best matches D. • What’s the difficulty in learning BN? – Generally speaking, BN optimization problem is intractable. • Two Approaches – Either we constrain the searching in a certain restricted class of networks (Naïve BN, Semi-Naïve BN, CL-tree etc) Q1: Are these restricted class enough to represent the data? – Or we adopt some heuristic methods on general networks (K2 etc) • Q2: Are the heuristic methods on general network efficient ? Q3: Are the heuristic methods on general network redundant to represent the data? Background (con’t) • Problems in two approaches – Q1: Are these restricted class enough to represent the data? • No, in many cases, they are really too limited in expression ability to model the data. – Q2: Are the heuristic methods on general network efficient ? • No, they have a big search space, which will be greatly timeconsuming – Q3: is it possible that general networks obtained by the heuristic methods are redundant to represent the data? • Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem. Possible solutions – Upgrading Solution • How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure? – Bound Solution • Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter. Work1:Large node Chow-Liu tree • Upgrade Chow-Liu tree(CLT) into Large node ChowLiu tree(LNCLT) – What is the restriction of CLT? • CLT restricts the network in a tree structure among the variables – Shortcomings caused by the restriction • CLT can not represent many dataset with a non-tree underlying structure . – Observations: • A “large node tree” may partially solve this shortcoming. – Example: • Right figure P( A, B, C , D, E ) P( A) P( BC | A) P( DE | BC ) Work1:Large node Chow-Liu tree • Large-node, which is a combination of several nodes, may partially relax the tree restriction. In forming a large node,There are two requirements. – Requirement 1 • Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other. – Requirement 2 • Large-node can not be too “large” or the probability estimation of this large node will not be not reliable – An extreme situation is that we combine all the nodes into a large node. This situation will lost all the advantages of Bayesian network. Upgrade CLT into Large-Node-CLT • A bounded Frequent itemset can satisfy the Requirement 1 & 2 – What is Frequent itemset? • It is the set of attributes that come together with each other frequently. • Example: Food store---{bread}, {button}, {bread, button} – An frequent itemset with high frequency is more like a “large node”. ---Requirement 1 – We restrict that the the number of nodes involved in a large node is no greater than a K threshold ---Requirement 2 – Frequent itemset can be obtained according to the algorithm Apriori in [AS1994] Upgrade CLT into Large-Node-CLT • The construction algorithm 1. Call Apriori[AS94] to generate the frequent itemsets, which have the size less than k. Record all the frequent itemsets together with their supports into list L. 2. Draft the CL-tree of the dataset according to the CLT algorithm 3. Until L is null 4. Iteratively combine the frequent itemsets which satisfy the combination conditions: father-son or sibling relationship 1.{A,C} does not satisfy the 2.f{B,C} is the condition, biggest andfilter satisfies combination out 3..Filter the frequent itemsets which P( A, B, C , D, E ) P( A) P( BC | A) P( DE | BC ) combination condition, {A,C} 4..{D, Ecoverage } is frequent have them with {B,C}itemset , combine intothe (c) satisfies the and {D,E} is left. the combination Example: condition, combine them into (d) the k is 2, after step 1, we get the We assume frequent itemsets {A, B} {A, C},{B, C}, {B, E}, {B, D}, {D, E}. And f({B, C})>f({A, B})> f({B, E}) >f({B, D})>f({D, E}) (s represents the frequency of frequent itemsets). (b) is the CLT in step2. Experiment • Database – The experiments are conducted on MNIST handwritten digit database. – MNIST consists of : • a 60000-digit training dataset • a 10000-digit testing dataset. • Both are 28*28 gray-level digit datasets Experiment • Preprocessing of MNIST database – Binarization :We use a global binarization method to binarize MNIST datasets. – Feature Extraction[Bakis68]: 4*4*6=96 dimension feature Experiment • We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output. • We compare LNCLT with CLT in : – Data fitness---log likelihood – Recognition rate Experimental results Data fitness---Log likelihood testing Experimental results Recognition rate We randomly selected 1000 digits as test datasets from the 10000-digit testing dataset. We do the testing 10 times Work2: Bound approach in semi-Naïve Bayesian network 1.A bounded Semi-Naïve Bayesian network(SNB). 2. We reduced the SNB into a network which has the same number K of nodes in every subset, K is the bound parameter. 3. We use Linear programming to do the optimization. 4. Our solution is shown sub-optimal Comparison between our model and traditional SNB • Time cost – Our model can be solved in a polynomial time – Traditional SNB has an exponential time cost • Structure – Each large node in our model has the same number of nodes K, K is a bound parameter – The number of nodes in subsets of traditional SNB are not same and some values of this number may be very large. • Performance – Our model is shown to be a sub-optimal in the bound restriction – In traditional SNB , there is no evidence that show it is optimal or suboptimal Experimental results • We evaluate our approach on Tic and vote dataset from UCI machine learning repository Future work • Evaluate our approaches based on a large number of datasets in Machine Learning repository from UCI • Build a Bayesian network which combine the upgrading strategy and bound strategy – In fact we are considering if we can upgrade our boundedSNB into a mixture model of bounded-SNB. Conclusion • A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers . • In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network. • The experimental results show that these two approaches are promising and encouraging. Main Reference • • • • • • • • • • • [AS1994] R. Agrawal, R. Srikant, 1994,“Fast algorithms for mining association rules”, Proc. VLDB-94 1994. [Chow, Liu1968] Chow, C.K. and Liu, C.N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Information Theory, 14,(pp462-467) [Friedman1997] Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29,(pp.131-161) [Kononenko1991] Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.), Proceedings of sixth European working session on learning (pp.206-219). Springer-Verlag [Maxwell1995] Learning Bayesian Networks is NP-Complete [Pearl1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible inference, Morgan Kaufmann. [Cheng1997] Cheng, J. Bell, D.A. Liu, W. 1997, Learning Belief Networks from Data: An Information Theory Based Approach. In Proceedings of ACM CIKM’97 [Cheng2001] Cheng, J. and Greiner, R. 2001, Learning Bayesian Belief Network Classifiers: Algorithms and System, E.Stroulia and S. Matwin(Eds.): AI 2001, LNAI 2056, (pp.141-151), [Meretakis, Wuthrich1999] Meretakis, D. and Wuthrich, B. Extending Naive Bayes Classifiers using long Itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, (pp. 165—174) [Srebro2000] Artificial Intelligence Laboratory, Massachusetts Institute of Technology Cambridge, Massachusetts 02139, http://www.ai.mit.edu Q& A Thanks! Work2: Bound strategy on Semi-Naïve BN • We restrict the semi-naïve network into not too complex structure. Large Node Bounded semi-Naïve BN Model Bounded-SNB MODEL DEFINITION Reduce Bounded-SNB MODEL According to Lemma 1, given a bound K, we should not separate the variables set into too many small subsets. Or it is more possible that we can combine some of the subsets into a new subset whose cardinality is no greater than K, thus the new SNB will be coarser than the old one. From this viewpoint, we reduce the searching space of BLN-SNB into a K-regular SNB space since there are no possibility that a SNB coarser than K-regular SNB exists in the K-bound. Even though it is reasonable to search the maximum likelihood SNB in the K-regular-SNB space, we won't say that: a K-regular SNB is absolutely better than a non-K-regular SNB with the biggest cardinality no more than K . It is obvious some non-K-regular SNBs can not be combined into a K-regular SNB. Thus in such a way, we reduce the searching space into a sub-space of K-bound SNB. Difference between our model & traditional SNB 1. Different approach • • Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost. Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity. 2. Different performance • • There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure. Our approach can maintain a sub-optimal structure. K-Bounded-SNB Problem K-Bounded-SNB Problem: Finding the m= [n/K ] K-cardinality subsets from attributes set which satisfy the SNB conditions to maximize the log likelihood (3). [x] means rounding the x to the nearest integer Transforming into Integer Programming Problem Model definition If we relax the (6) into 0x 1, IP is transformed into a Linear Programming problem which can be solved in a polynomial time. Computational complexity analysis Traditional SNB time cost is exponential cost Our model is polynomial time cost