Download ppt - hkust cse

L11: Uses of Bayesian Networks Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology http://www.cse.ust.hk/~lzhang/ Page 2 Outline  Traditional Uses  Structure Discovery  Density Estimation  Classification  Clustering  An HKUST Project Page 3 Traditional Uses   Probabilistic Expert Systems  Diagnostic  Prediction Example: BN for diagnosing “blue baby” over phone in a London Hospital Comparable to specialist, Better than others Page 4 Traditional Uses  Language for describing probabilistic models in Science & Engineering  Example: BN for turbo code Page 5 Traditional Uses  Language for describing probabilistic models in Science & Engineering  Example: BN from Bioinformatics Page 6 Outline  Traditional Uses  Structure Discovery  Density Estimation  Classification  Clustering  An HKUST Project Page 7 BN for Structure Discovery  Given: Data set D on variables X1, X2, …, Xn  Discover dependence, independence, and even causal relationship among the variable.  Example: Evolution trees Page 8 Phylogenetic Trees   Assumption  All organisms on Earth have a common ancestor  This implies that any set of species is related. Phylogeny   The relationship between any set of species. Phylogenetic tree  Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree  this is not always true Page 9 Phylogenetic Trees  Phylogenetic trees Time giant panda lesser panda moose goshawk duck vulture alligator Current-day species at bottom Page 10 Phylogenetic Trees  TAXA (sequences) identify species  Edge lengths represent evolution time  Assumption: bifurcating tree topology AAGGCCT AAGACTT AGCACTT AAGGCAT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Time Page 11 Probabilistic Models of Evolution  Characterize relationship between taxa using substitution probability: P(x | y, t): probability that ancestral sequence y evolves into sequence x along an edge of length t t5 x5 t1 s1 x7 t2 s2 t3 s3 x6 t6 t4 s4 P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), …. Page 12 Probabilistic Models of Evolution  What should P(x|y, t) be?  Two assumptions of commonly used models  There are only substitutions, no insertions/deletions (aligned)  One-to-one correspondence between sites in different sequences   Each site evolves independently and identically P(x|y, t) = ∏i=1 to m P(x(i) | y(i), t)  m is sequence length AAGGCCT AAGACTT AGCACTT AAGGCAT AGGGCAT TAGACTT TAGCCCA AGCACAA AGCGCTT Page 13 Probabilistic Models of Evolution  What should P(x(i)|y(i), t) be?  Jukes-Cantor (Character Evolution) Model [1969]  Rate of substitution a (Constant or parameter?)  A A rt C st G st T st C st rt st st G st st rt st T st st st rt Multiplicativity (lack of memory) rt = 1/4 (1 + 3e-4at) st = 1/4 (1 - e-4at) Limit values when t = 0 or t = infinity? P(c | a, t1  t2 )   P(b | a, t1 ) P(c | b, t2 ) b Page 14 Tree Reconstruction    Given: collection of currentday taxa AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT Find: tree  Tree topology: T  Edge lengths: t Maximum likelihood  Find tree to maximize P(data | tree) AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Page 15 Tree Reconstruction  When restricted to one particular site, a phylogenetic tree is an LT model where  The structure is a binary tree and variables share the same state space.  The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization.  The model is the same for different sites AAGGCCT AAGACTT AGCACTT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Page 16 Tree Reconstruction  Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT  Samples for LT model. One Sample per site. The samples are i.i.d.  1st site: (A, T, T, A, A),  2nd site: (G, A, A, G, G),  3rd site: (G, G, G, C, C), AAGGCCT AAGACTT AGCACTT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Page 17 Tree Reconstruction  Finding ML phylogenetic tree == Finding ML LT model  Model space:  Model structures: binary tree where all variables share the same state space, which is known.  Parameterization: one parameter for each edge. (In general, P(x|y) has |x||y|-1 parameters).  The objective is to find relationships among variables.  Applying new LTM algorithms to Phylogenetic tree reconstruction? Page 18 Outline  Traditional Uses  Structure Discovery  Density Estimation  Classification  Clustering  An HKUST Project Page 19 BN for Density Estimation  Given: Data set D on variables X1, X2, …, Xn  Estimate: P(X1, X2, …, Xn) under some constraints  ..  Uses of the estimate:  Inference  Classification Page 20 BN Methods for Density Estimation   Chow-Liu tree with X1, X2, …, Xn as nodes  Easy to compute  Easy to use  Might not be good estimation of “true” distribution BN with X1, X2, …, Xn as nodes  Can be good estimation of “true” distribution.  Might be difficult to find  Might be complex to use Page 21 BN Methods for Density Estimation  LC model with X1, X2, …, Xn as manifest variables (Lowd and Domingos 2005)  Determine the cardinality of the latent variable using hold-out validation,  Optimize the parameters using EM.  ..  Easy to compute  Can be good estimation of “true” distribution  Might be complex to use (cardinality of latent variable might be very large) Page 22 BN Methods for Density Estimation  LT model for density estimation  Pearl 1988: As model over manifest variables, LTMs  Are computationally very simple to work with.  Can represent complex relationships among manifest variables. Page 23 BN Methods for Density Estimation  New approximate inference algorithm for Bayesian networks (Wang, Zhang and Chen, AAAI 08, JAIR 32: 879-900, 08 ) Sample sparse sparse Learn dense dense Page 24 Outline  Traditional Uses  Structure Discovery  Density Estimation  Classification  Clustering  An HKUST Project Page 25 Bayesian Networks for Classification  The problem:   Given data: Find mapping  (A1, A2, …, An) |- C  Possible solutions  ANN  Decision tree (Quinlan)  …  (SVM: Continuous data) A1 A2 … An C 0 1 … 0 T 1 0 … 1 F .. .. .. .. .. Page 26 Bayesian Networks for Classification  Naïve Bayes model  From data, learn  P(C), P(Ai|C)  Classification  arg max_c P(C=c|A1=a1, …, An=an)  Very good in practice Page 27 Bayesian Networks for Classification   Drawback of NB:  Attributes mutually independent given class variable  Often violated, leading to double counting. Fixes:  General BN classifiers  Tree augmented Naïve Bayes (TAN) models  Hierarchical NB  Bayes rule + Density Estimation  … Page 28 Bayesian Networks for Classification  General BN classifier  Treat class variable just as another variable  Learn a BN.  Classify the next instance based on values of variables in the Markov blanket of the class variable.  Pretty bad because it does not utilize all available information because of Markov boundary Page 29 Bayesian Networks for Classification  TAN model  Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian networks classifiers. Machine Learning, 29:131-163.  Capture dependence among attributes using a tree structure.  During learning,  First learn a tree among attributes: use Chow-Liu algorithm  Add class variable and estimate parameters  Classification  arg max_c P(C=c|A1=a1, …, An=an) Page 30 Bayesian Networks for Classification  Hierarchical Naïve Bayes models  N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable discovery in classification models. Artificial Intelligence in Medicine, to appear.  Capture dependence among attributes using latent variables  Detect interesting latent structures besides classification  Algorithm in the step of DHC..  … Page 31 Bayesian Networks for Classification  Bayes Rule  .   Chow-Liu  LC model  LT Model Wang Yi: Bayes rule + LT model is for superior Page 32 Outline  Traditional Uses  Structure Discovery  Density Estimation  Classification  Clustering  An HKUST Project Page 33 BN for Clustering    Latent class (LC) model  One latent variable  A set of manifest variables Conditional Independence Assumption:  Xi’s mutually independent given Y.  Also known as Local Independence Assumption Used for cluster analysis of categorical data  Determine cardinality of Y: number of clusters  Determine P(Xi|Y): characteristics of clusters Page 34 BN for Clustering Clustering Criteria  Distance based clustering:   Minimizes intra-cluster variation and/or maximizes inter-cluster variation LC Model-based clustering:  The criterion follows from the conditional independence assumption  Divide data into clusters such that, in each cluster, manifest variables are mutually independent under the empirical distribution. Page 35 BN for Clustering  Local independence assumption often not true  LT models generalize LC models  Relax the independence assumption  Each latent variable gives a way to partition data… multidimensional clustering Page 36 ICAC Data // 31 variables, 1200 samples C_City: s0 s1 s2 s3 // very common, quit common, uncommon, .. C_Gov: s0 s1 s2 s3 C_Bus: s0 s1 s2 s3 Tolerance_C_Gov: s0 s1 s2 s3 Tolerance_C_Bus: s0 s1 s2 s3 WillingReport_C: s0 s1 s2 // yes, no, depends LeaveContactInfo: s0 s1 // yes, no I_EncourageReport: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e I_Deterrence: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... //totally intolerable, intolerable, tolerable,... ….. -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0 -1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0 -1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0 …. Page 37 Latent Structure Discovery Y2: Demographic info; Y3: Tolerance toward corruption Y4: ICAC performance; Y7: ICAC accountability Y5: Change in level of corruption; Y6: Level of corruption (Zhang, Poon, Wang and Chen 2008) Page 38 Interpreting Partition  Y2 partition the population into 4 clusters  What is the partition about? What is “criterion”?  On what manifest variables do the clusters differ the most?  Mutual information:  The larger I(Y2, X), the more the 4 clusters differ on X Page 39 Interpreting Partition  Information curves:  Partition of Y2 is based on Income, Age, Education, Sex  Interpretation: Y2 --- Represents a partition of the population based on demographic information  Y3 --- Represents a partition based on Tolerance toward Corruption Page 40 Interpreting Clusters Y2=s0: Low income youngsters; Y2=s1: Women with no/low income Y2=s2: people with good education and good income; Y2=s3: people with poor education and average income Page 41 Interpreting Clustering Y3=s0: people who find corruption totally intolerable; 57% Y3=s1: people who find corruption intolerable; 27% Y3=s2: people who find corruption tolerable; 15% Interesting finding: Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus Y3=s0: Same attitude toward C-Gov and C-Bus People who are touch on corruption are equally tough toward C-Gov and C-Bus. People who are relaxed about corruption are more relaxed toward C-Bus than C-GOv Page 42 Relationship Between Dimensions Interesting finding: Relationship btw background and tolerance toward corruption Y2=s2: ( good education and good income) the least tolerant. 4% tolerable Y2=s3: (poor education and average income) the most tolerant. 32% tolerable The other two classes are in between. Page 43 Result of LCA  Partition not meaningful  Reason:   Local Independence not true Another way to look at it  LCA assumes that all the manifest variables joint defines a meaningful way to cluster data  Obviously not true for ICAC data  Instead, one should look for subsets that do define meaningful partition and perform cluster analysis on them  This is what we do with LTA Page 44 Finite Mixture Models  Y: discrete latent variable  Xi: continuous  P(X1, X2, …, Xn|Y): Usually multivariate Gaussian  No independence assumption  Assume states of Y: 1, 2, …, k P(X1, X2, …, Xn) = P(Y=i)P(X1, X2, …, Xn|Y=i): Mixture of k Gaussian components Page 45 Finite Mixture Models  Used to cluster continuous data  Learning  Determine  k: number of clusters  P(Y)  P(X1, …, Xn|Y)  Also assume: All attributes define coherent partition  Not realistic  LT models are a natural framework for clustering high dimensional data Page 46 Outline  Traditional Uses  Structure Discovery  Density Estimation  Classification  Clustering  An HKUST Project Page 47 Observation on How Human Brain Does Thinking  Human beings often invoke latent variables to explain regularities that we observe.  Example 1  Observe Regularity:  Beers, Diapers often bought together in early evening  Hypothesize (latent) cause:  There must be a common (latent) cause  Identify the cause and explain regularity  Shopping by Father of Babies on the way home from work  Based on our understanding of the world Page 48 Observation on How Human Brain Does Thinking  Example 2  Background: At night, watch lighting throw windows of apartments in big buildings  Observe Regularity:  Lighting from several apartments were changing in brightness and color at the same times and in perfect synchrony.  Hypothesize common (latent) cause:  There must be a (late) common cause  Identify the cause and explain the phenomenon:  People watching the same TV channel.  Based on understanding of the world Page 49 Back to Ancient Time  Observe Regularity  Several symptoms often occur together  ‘intolerance to cold’, ‘cold limbs’, and ‘cold lumbus and back’  Hypothesize common latent cause:   There must be a common latent cause Identify the cause  Answer based on understanding of world at that time, primitive  Conclusion: Yang deficiency (阳虚)  Explanation: Yang is like the sun, it warms your body. If you don’t have enough of it, feel cold. Page 50 Back to Ancient Time  Regularity observed:  Several symptoms often occur together  Tidal fever (潮热)，heat sensation in palm and feet (手足心热), palpitation (心慌心跳), thready and rapid pulse (脉细数)  Hypothesize common latent cause:   There must be a common latent cause Identify the cause and explain the regularirt  Yin deficiency causing internal heart (阴虚内热)  Yin and Yang should be in balance. If Yin is in deficiency, Yang will be in excess relatively, and hence causes heat. Page 51 Traditional Chinese Medicine (TCM)  Claim   TCM Theories = Statistical Regularities + Subjective Interpretations How to justify the claim Page 52 A Case Study  We collected a data set about kidney deficiency (肾虚)  35 symptom variables, 2600 records Result of Data Analysis  Y0-Y34: manifest variables from data  X0-X13: latent variables introduced by data analysis  Structure interesting, supports TCM’s theories about various symptoms. Page 53 Page 54 Other TCM Data Sets  From Beijing U of TCM, 973 project       Depression Hepatitis B Chronic Renal Failure COPD Menopause China Academy of TCM  Subhealth  Diabetes  In all cases, results of LT analysis match relevant TCM Theories Page 55 Result on the Depression Data Page 56 Significance  Conclusion   TCM Theories = Statistical Regularities + Subjective Interpretations Significance  TCM theories are partially based on objective facts  Boast user confidence  Can help to lay a modern statistical foundation for TCM  Systematically identify statistical regularities about occurrence of symptoms, find natural partitions  Establish objective and quantitative diagnosis standards  Assist in double-blind experiments for evaluate and improve the efficacy of TCM treatment Page 57 Outline  Traditional Uses  Structure Discovery  Density Estimation  Classification  Clustering  An HKUST Project

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - hkust cse