Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Graphical Models David Heckerman Microsoft Research Overview Intro to graphical models – Application: Data exploration – Dependency networks undirected graphs – Directed acyclic graphs (“Bayes nets”) Applications – Clustering – Evolutionary history/phylogeny Using classification/regression for data exploration Decision tree: male young p(cust)=0.2 female old p(cust)=0.8 p(cust)=0.7 Logistic regression: log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old Neural network: Using classification/regression for data exploration p(target|inputs) Decision tree: male young p(cust)=0.2 female old p(cust)=0.8 p(cust)=0.7 Logistic regression: log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old Neural network: Conditional independence Decision tree: male young p(cust)=0.2 female old p(cust)=0.8 p(cust)=0.7 p(cust | gender, age, month born)=p(cust | gender, age) p(target | all inputs) = p(target | some inputs) Learning conditional independence from data: Model Selection Cross validation Bayesian methods Penalized likelihood Minimum description length Using classification/regression for data exploration Suppose you have thousands of variables and you’re not sure about the interactions among those variables Build a classification/regression model for each variable, using the rest of the variables as inputs Example with three variables X, Y, and Z Target: X Inputs: Y,Z Y=0 p(x|y=0) Target: Y Inputs: X,Z Y=1 X=0 X=1 p(y|x=1) p(x|y=1) Z=0 Z=1 p(y|x=0,z=0) p(y|x=0,z=1) Target: Z Inputs: X,Y Y=0 p(z|y=0) Y=1 p(z|y=1) Summarize the trees with a single graph Target: X Inputs: Y,Z Y=0 p(x|y=0) Target: Y Inputs: X,Z Y=1 X=0 X=1 p(y|x=1) p(x|y=1) Z=0 Target: Z Inputs: X,Y Y=0 p(z|y=0) Z=1 p(y|x=0,z=0) p(y|x=0,z=1) X Y Z Y=1 p(z|y=1) Dependency Network Build a classification/regression model for every variable given the other variables as inputs Construct a graph where – Nodes correspond to variables – There is an arc from X to Y if X helps to predict Y The graph along with the individual classification/regression model is a “dependency network” (Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000) Example: TV viewing Nielsen data: 2/6/95-2/19/95 viewer 1 viewer 2 viewer 3 Age Show1 Show2 Show3 73 y n n 16 n y y ... 35 n n n etc. ~400 shows, ~3000 viewers Goal: exploratory data analysis (acausal) A bit of history Julian Besag (and others) invented dependency networks (under another name) in the mid 1970s But they didn’t like them, because they could be inconsistent A consistent dependency network Target: X Inputs: Y,Z Y=0 p(x|y=0) Target: Y Inputs: X,Z Y=1 X=0 X=1 p(y|x=1) p(x|y=1) Z=0 Target: Z Inputs: X,Y Y=0 p(z|y=0) Z=1 p(y|x=0,z=0) p(y|x=0,z=1) X Y Z Y=1 p(z|y=1) An inconsistent dependency network Target: X Inputs: Y,Z Y=0 p(x|y=0) Target: Y Inputs: X,Z Y=1 X=0 X=1 p(y|x=1) p(x|y=1) Z=0 Target: Z Inputs: X,Y Y=0 p(z|y=0) Z=1 p(y|x=0,z=0) p(y|x=0,z=1) X Y Z Y=1 p(z|y=1) A bit of history Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s But they didn’t like them, because they could be inconsistent So they used a property of consistent dependency networks to develop a new characterization of them Conditional independence Target: X Inputs: Y,Z Y=0 p(x|y=0) Target: Y Inputs: X,Z Y=1 X=0 X=1 p(y|x=1) p(x|y=1) Z=0 Y=0 p(z|y=0) Y=1 p(z|y=1) Z=1 p(y|x=0,z=0) X Target: Z Inputs: X,Y p(y|x=0,z=1) X^Z|Y Y Z Conditional independence in a dependency network Each variable is independent of all other variables given its immediate neighbors Hammersley-Clifford Theorem (Besag 1974) Given a set of variables which has a positive joint distribution Where each variable is independent of all other variables given its immediate neighbors in some graph G It follows that “clique potentials” N p(x) f i (ci ) i 1 where c1, c2, …, cn are the maximal cliques in the graph G. Example X Y p( x, y, z) f1 ( x, y) f 2 ( y, z) Z Consistent dependency networks: Directed arcs not needed X Y Z X Y Z p( x, y, z) f1 ( x, y) f 2 ( y, z) A bit of history Julian Besag (and others) invented dependency networks (under the name “Markov graphs”) in the mid 1970s But they didn’t like them, because they could be inconsistent So they used a property of consistent dependency networks to develop a new characterization of them “Markov Random Fields” aka “undirected graphs” were born Inconsistent dependency networks aren’t that bad They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized) They are easy to learn from data (build separate classification/regression model for each variable) Conditional distributions (e.g., trees) are easier to understand than clique potentials Inconsistent dependency networks aren’t that bad They are *almost consistent* because each classification/regression model is learned from the same data set (can be formalized) They are easy to learn from data (build separate classification/regression model for each variable) Conditional distributions (e.g., trees) are easier to understand than clique potentials Over the last decade, has proven to be a very useful tool for data exploration Shortcomings of undirected graphs Lack a generative story (e.g., Lat Dir Alloc) Lack a causal story cold sore throat lung cancer cough weight loss Solution: Build trees in some order 1. Target: X Inputs: none . 2. Target: Y Inputs: X X=0 X=1 p(y|x=0) p(y|x=1) 3. Target: Z Inputs: X,Y Y=0 Y=1 p(x) p(z|y=0) p(z|y=1) Solution: Build trees in some order 1. Target: X Inputs: none . 2. Target: Y Inputs: X X=0 X=1 p(y|x=0) p(y|x=1) 3. Target: Z Inputs: X,Y Y=0 Y=1 p(x) X Y p(z|y=0) Z p(z|y=1) Some orders are better than others X Y Z X Z Y Random orders Greedy search Monte-Carlo methods Joint distribution is easy to obtain 1. Target: X Inputs: none . 2. Target: Y Inputs: X X=0 X=1 p(y|x=0) p(y|x=1) 3. Target: Z Inputs: X,Y Y=0 Y=1 p(x) X Y p(z|y=0) Z p( x) p( y | x) p( z | x, y) p( x, y, z ) p(z|y=1) Directed Acyclic Graphs (aka Bayes Nets) p( x1 ,..., xn ) p( xi | x1 ,..., xi 1 ) i p( xi | parents ( xi )) i Many inventors: Wright 1921; Good 1961; Howard & Matheson 1976, Pearl 1982 The power of graphical models Easy to understand Useful for adding prior knowledge to an analysis (e.g., causal knowledge) The conditional independencies they express make inference more computationally efficient Inference 1. Target: X Inputs: none . 2. Target: Y Inputs: X X=0 X=1 p(y|x=0) p(y|x=1) 3. Target: Z Inputs: X,Y Y=0 Y=1 p(x) X Y What is p(z|x=1)? p(z|y=0) Z p(z|y=1) Inference: Example X W p( z ) Y Z p(w) p( x | w) p( y | x) p( z | y) w, x , y Inference: Example (“Elimination Algorithm”) X W p( z ) Y Z p(w) p( x | w) p( y | x) p( z | y) w, x , y p( w) p( x | w) p( y | x) p( z | y ) y w, x Inference: Example (“Elimination Algorithm”) X W p( z ) Y Z p(w) p( x | w) p( y | x) p( z | y) w, x , y p( w) p ( x | w) p( y | x) p( z | y ) y w, x p( w) p( x | w) p ( y | x) p ( z | y ) y w x Inference Inference also important because it is the E step of EM algorithm (when learning with missing data and/or hidden variables) Exact methods for inference that exploit conditional independence are well developed (e.g., Shachter, Lauritzen & Spiegelhalter, Dechter) Exact methods fail when there are many cycles in the graph – MCMC (e.g., Geman and Geman 1984) – Loopy propagation (e.g., Murphy et al. 1999) – Variational methods (e.g., Jordan et al. 1999) Applications of Graphical Models DAGs and UGs: Data exploration Density estimation Clustering UGs: Spatial processes DAGs: Expert systems Causal discovery Applications Clustering Evolutionary history/phylogeny Clustering Example: msnbc.com User Sequence 1 2 3 4 5 6 7 8 Etc. frontpage news news news frontpage news news news frontpage news news weather news health frontpage sports travel travel news news frontpage news news frontpage news weather health sports travel travel weather weather business business weather travel weather business sports Millions of users per day Goal: understand what is and isn’t working on the site Solution Cluster data User clusters • Cluster users based on their behavior on the site • Display clusters somehow Generative model for clustering (e.g., AutoClass, Cheeseman & Stutz 1995) Cluster 1st page 2nd page Discrete, hidden 3rd page … Sequence Clustering (Cadez, Heckerman, Meek, & Smyth, 2000) Cluster 1st page 2nd page Discrete, hidden 3rd page … Learning parameters (with missing data) Principles: Find the parameters that maximize the (log) likelihood of the data Find the parameters whose posterior probability is a maximum Find distributions for quantities of interest by averaging over the unknown parameters Gradient methods or EM algorithm typically used for first two Expectation-Maximization (EM) algorithm Dempster, Laird, Rubin 1977 Initialize parameters (e.g., at random) Expectation step: compute probabilities for values of unobserved variable using the current values of the parameters and the incomplete data [THIS IS INFERENCE]; reinterpret data as set of fractional cases based on these probabilities Maximization step: choose parameters so as to maximize the log likelihood of the fractional data Parameters will converge to a local maximum of log p(data) E-step Suppose cluster model has 2 clusters, and that p(cluster=1|case,current params) = 0.7 p(cluster=2|case,current params) = 0.3 Then, write q(case) = 0.7 log p(case,cluster=1|params) + 0.3 log p(case,cluster=2|params) Do this for each case and then find the parameters that maximize Q=Scase q(case). These parameters also maximize the log likelihood. Demo: SQL Server 2005 Example: msnbc.com User Sequence 1 2 3 4 5 6 7 8 Etc. frontpage news news news frontpage news news news frontpage news news weather news health frontpage sports travel travel news news frontpage news news frontpage news weather health sports travel travel weather weather business business weather travel weather business sports Sequence clustering Other applications at Microsoft: Analyze how people use programs (e.g. Office) Analyze web traffic for intruders (anomaly detection) Computational biology applications Evolutionary history/phylogeny Vaccine for AIDS Evolutionary History/Phylogeny Jojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004 Donkey Horse Indian rhino White rhino Grey seal Harbor seal Dog Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Alpaca Pig Little red flying fox Ryukyu flying fox Horseshoe bat Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Aardvark Elephant Armadillo Rabbit Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon White-fronted capuchin Slow loris Squirrel Dormouse Cane-rat Guinea pig Mouse Rat Vole Hedgehog Gymnure Bandicoot Wallaroo Opossum Platypus Perissodactyla Carnivora Cetartiodactyla Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Primates Rodentia 1 Rodentia 2 Hedgehogs Probabilistic Model of Evolution hidden … hidden species 1 species 2 species 3 Learning phylogeny from data For a given tree, find max likelihood parameters Search over structure to find best likelihood (penalized to avoid over fitting) Strong simplifying assumption Evolution at each DNA nucleotide is independent EM is computationally efficient h10 x11 h20 x12 h12 x13 x14 Nucleotide position 1 … h22 x23 x24 Nucleotide position 2 hJ0 x 1J hJ2 x J3 x J4 Nucleotide position N Relaxing the assumption h10 x11 x12 h12 x13 h20 x14 … h22 x23 x24 hJ0 x 1J hJ2 x J3 x J4 Each substitution depends on the substitution at the previous position This structure captures context specific effects during evolution EM is computationally intractable Variational approximation for inference ln p(o | ) ln p (h, o | ) h p (h, o, ) ln q(h | o, ) q(h | o, ) h p (h, o, ) q(h | o, ) ln h q(h | o, ) Lower bound good enough for EM-like algorithm Two simple q distributions h Product of trees o h h o Product of chains o o o h o o o o h o … h o o o h h o … h h o h o h o Things I didn’t have time to talk about Factor graphs, mixed graphs, etc. Relational learning: PRMs, Plates, PERs Bayesian methods for learning Scalability Causal modeling Variational methods Non-parametric distributions To learn more Main conferences: Uncertainty in Artificial Intelligence (UAI) Neural information Processing Systems (NIPS)