* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Why Probability?
Concept learning wikipedia , lookup
Catastrophic interference wikipedia , lookup
Multi-armed bandit wikipedia , lookup
Time series wikipedia , lookup
Existential risk from artificial general intelligence wikipedia , lookup
Machine learning wikipedia , lookup
Knowledge representation and reasoning wikipedia , lookup
History of artificial intelligence wikipedia , lookup
Mixture model wikipedia , lookup
Particle filter wikipedia , lookup
Linear belief function wikipedia , lookup
Neural modeling fields wikipedia , lookup
Mathematical model wikipedia , lookup
Inductive probability wikipedia , lookup
Hierarchical temporal memory wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Probabilistic Plan Recognition Kathryn Blackmond Laskey Department of Systems Engineering and Operations Research George Mason University Dagstuhl Seminar April 2011 The problem of plan recognition is to take as input a sequence of actions performed by an actor and to infer the goal pursued by the actor and also to organize the action sequence in terms of a plan structure Schmidt, Sridharan and Goodson, 1978 …the problem of plan recognition is largely a problem of inference under conditions of uncertainty. Charniak and Goldman, 1993 1 PPR in a Nutshell P( plan | obs) P(obs | plan)P( plan) P(obs) Thomas Bayes (1702-1762) Bayes, Thomas. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions 2 of the Royal Society of London, 53:370- 418, 1763. Represent • set of possible plans • anticipated evidence for each plan Specify …or just directly specify P(plan|obs) • prior probabilities for plans • likelihood for evidence given plans Infer plans using Bayes Rule Why Probability? • Theoretically well-founded representation for relative plausibility of competing explanations • Unified approach to inference and learning • Combine engineered and learned knowledge • Many general-purpose exact and approximate algorithms with strong theoretical justification and practical success • Good results on many interesting problems • But… – Inference and learning (exact and approximate) are NP hard – Balancing tractability and expressiveness is a major research and engineering challenge 3 Representing Plans and Observations • Plan recognition requires a computational representation of possible plans and observable evidence – Goals – Actions • When executed in combination, actions are expected (with high probability) to achieve the goal – Preconditions / postconditions of actions – Constraints • Most notably, temporal ordering – Observables • Actions may or may not be directly observable • Sometimes we observe effects of actions – Hierarchical decomposition of the above • For probabilistic plan recognition, we need to assign probabilities to these elements 4 – Balance expressivity against tractability of inference & learning Some Representations for PPR • Bayesian networks • Hidden Markov Models / Dynamic Bayesian Networks • Plan Recognition Bayesian Networks / Probabilistic Relational Models / Multi-Entity Bayesian Networks • Bayesian Abductive Logic Programs • Stochastic Grammars • Conditional Random Fields • Markov Logic Networks Each of these formalisms can be thought of as a way of representing a set of “possible worlds” and defining a probability measure on an algebra of subsets 5 Graphical Probability Models • Factorize joint distribution into factors involving only a few variables each – Graph represents conditional independence assumptions – Local distributions specify probability information for small groups of related variables – Factors are combined into joint distribution G • Drastically simplifies specification, inference and learning • 20 possible goals, 100 possible actions A5 A1 A4 A2 A3 Naïve Bayes Model – Fully general model 2.5x1031 probabilities – “Naïve Bayes model” 19x20x100=38,000 probabilities – If each goal has only 10 associated actions then “naïve Bayes model” 19x10 = 190 probabilities – Naïve Bayes inference scales as #variables x #states/variable 6 Bayesian Network (BN) • Directed graph represents dependencies n • Joint distribution factors as Pr(X1,K , X n ) Pr(X i | X pa(i) ) i1 • Factored representation makes specification, inference and learning tractable for interesting classes of problems • Directed graph naturally represents causality via “do” operator – Effects of intervention – Explaining away Regional Domination (R) Exercise (E) Troops Moving (T) 7 Invasion (I) Pr(R,E,I,W,T,B,S) = Pr(R)Pr(E)Pr(I|R)Pr(W|R)Pr(T|E,I)Pr(B|W)Pr(S|W) Nuclear Weapons (W) Buy Reactors (B) Steal Materials (S) 127 probabilities 14 probabilities Possible and Probable Worlds • “Traditional or deductive logic admits only three attitudes to any proposition: definite proof, disproof, or blank ignorance.” (Jeffreys) • Semantics of classical logic is based on possible worlds – Set of possible worlds defined by language, domain, and axioms – In propositional logic, possible worlds assign truth values to atoms (e.g., R T; W T; E F) • Probability theory – Set of possible worlds is called the sample space – Probability measure maps subsets to real numbers – Probability axioms are a natural extension of classical propositional logic to likelihood • BN combines propositional logic with probability Regional Domination (R) Exercise (E) 8 Troops Moving (T) Invasion (I) Nuclear Weapons (W) Buy Reactors (B) Steal Materials (S) R E I W T B S Pr T T T T T T T p1 T T T T T T F p2 T T T T T F T p3 … Other Factored Representations • Markov network: factorization specified by undirected graph – More natural for domains without natural causal direction C indexes cliques in the graph – Joint distribution factorizes as: 1 P(X1,K , X n ) C (X1C ,K X kC C ) Z C xiC is ith variable in clique C kC is size of clique C Z is a normalization constant • Chain graph: factorization specified by graph with both directed and undirected edges • Representations to exploit context-specific independence – Probability trees – Tree-structured parameterization for local distributions in a BN 9 Conditional Random Fields • Bayesian networks are generative models – Represent joint probability over plans and observations – Realistic dependence models often yield intractable inference • Conditional (or discriminative) model directly represents probability of plans given observations – Can allow some dependencies to be relaxed • CRFs are discriminative – Undirected graph represents local dependencies – Potential function represents strength of dependence • A CRF is a family of MRFs (a mapping from observations to potentials) 10 Inference in Graphical Models • Exact inference – E.g., Belief propagation, junction tree, bucket elimination, symbolic probabilistic inference, cutset conditioning – Exploit graph structure / factorization to simplify computation – Infeasible for complex problems • Approximate (deterministic) – E.g., Loopy BP, variational Bayes • Approximate (stochastic) – E.g., Gibbs sampling, Metropolis-Hastings sampling, likelihood weighting • Combinations – E.g., Bidyuk and Dechter (2007) – cutset sampling 11 Belief Propagation for Singly Connected BNs • • • Goal: compute probability distribution of random variable B given evidence (assume B itself is not known) Random variables “above” B A1 Key idea: impact of belief in B from evidence "above" B and evidence "below" B can be processed separately A2 A3 A4 A5 p A6 ? Justification: B d-separates “above” random variables from “below” random variables D5 D6 Random variables “below” B • This picture depicts the updating process for one node. • The algorithm simultaneously updates beliefs for all the nodes. • Loopy BP applies BP to network with loops; often results in good approximation 12 p D7 B l D1 D2 D3 D4 = evidence random variable Likelihood Weighting (for BNs) 1. Proceed through non-evidence variables in order consistent with partial ordering induced by graph A C – Sample variable according to its local probability distribution – Calculate weight proportional to Pr(evidence | sampled values) 2. Repeat Step 1 until done 3. Estimate Pr(Variable=value) by weighted sample frequency B D G E H F K J L 13 Junction Tree Algorithm 1. Compile BN into junction tree – Tree of clusters of nodes – Has JT property: variable belonging to 2 clusters must belong to all clusters along path connecting them – Becomes part of the knowledge representation – Changes only if the graph changes 2. Use local local messagepassing algorithm to propagate beliefs in the junction tree 3. Query on any node or any set of nodes in same cluster can be computed from cluster joint distribution 14 ABC A BCDEH C B CDGEH D DEGHJ G E H DFGHJ F K J L FGJK JKL Gibbs Sampling A 1. Initialize – Evidence variables assigned to observed values – Arbitrary value for other variables C 2. Sample non-evidence nodes one at a time: – Sample with probability Pr(variable | Markov blanket) – Replace with newly sampled value D G 3. Repeat Step 2 until done K 4. Estimate Pr(Variable=value) by sample frequency • Markov blanket 15 B E H F J L - In BN: parents, children, co-parents - In MN: neighbors • Variable is conditionally independent of rest of network given its Markov blanket Cutset Sampling (for BNs) A • Find a loop cutset • Initialize cutset variables • Do until done C – Propagate beliefs on non-cutset variables – Do Gibbs iteration on cutset • Estimate P(Variable=value) by averaging probability over samples • This is a kind of “RaoBlackwellization” – Reduce variance of Monte Carlo estimator by replacing a sampling step with an exact computation with same expected value 16 B D=d D G E H F K J=j J L Variational Inference • Method for approximating posterior distribution of unobserved variables given observed variables • Approximation finds distribution in family with simpler functional form (e.g., remove some arcs in graph) by minimizing a measure of distance from true posterior • Estimation via “variational EM” – Alternate between “expectation” and “maximization” steps – Converges to local minimum of distance function – Yields lower bound for marginal likelihood • Often faster but less accurate than MC 17 Extending Expressive Power of BNs • Propositional logic + probability is insufficiently expressive for requirements of plan recognition – – – – Repeated structure Multiple interrelated entities (e.g., plans, actors, actions) Type hierarchy and inheritance Unbounded number of potentially relevant variables • Some formalisms with greater expressive power: 18 – PBN (Plan recognition BN) – PRM (Probabilistic Relational Models) – OOBN (Object-Oriented Bayesian Networks) – MEBN (Multi-Entity BN) – Plates – BALP (Bayesian Abductive Logic Programs) Charniak and Goldman (1993) Example: Maritime Domain Awareness Entities, attributes and relations 19 MDA Probabilistic Ontology 20 Built in UnBBayes-MEBN MDA SSBN Screenshot of situation-specific BN in UnBBayes-MEBN 21 (open-source tool for building & reasoning with PR-OWL ontologies) Protégé Plugin for UnBBayes 22 Drag-and-Drop Mapping 23 Markov Logic Networks • First-order knowledge base with weight attached to formulas and clauses • KB + individual constants ground Markov network containing variable for each grounding of a formula in the KB • Compact language for specifying large Markov networks 24 MLN Example (Richardson and Domingos, 2006) 25 CRFs for Chat Recognition (Hsu, Lian and Jih, 2011) • • • • Subscript indexes pairs of individuals Yit represents chatting activity of pair Xit represents observed acoustic features Dependence structure: – Within-pair temporal dependence – Between-pair concurrent dependence • Can be represented as MLN 26 Possible and Probable FO Worlds • In first-order logic, a possible world (aka “structure”) assigns: – Each constant symbol to a domain element (e.g., go3 obj23) – Each n-ary function symbol to a function on n-tuples of domain elements (e.g., (go-stp pln1) obj23 – Each n-ary relation symbol to a set of n-tuples of domain elements (e.g., inst {(obj23, go-), (obj78, liquor-store), (obj78, store) … } • A first-order probabilistic logic assigns a probability measure to first-order structures – This is called “measure model” semantics (Gaifman,1964) 27 FOL + Probability: Issues • Probability zero ≠ unsatisfiable – E.g., every possible value of a continuous distribution has probability zero • FOL is undecidable; FOL + probability is not even semidecidable – Example: IID sequence of coin tosses, 0 < P(H) < 1 • Given any finite sequence of prior tosses, both H and T are possible • We cannot disprove any non-extreme probability distribution from a finite sequence of tosses • Wrong solution: “We will prevent you from expressing this query because we cannot tractably compute the answer.” • Better solution: “Represent the problem you really want to solve, and then figure out a way to approximate the answer.” – Think carefully about what the real problem is! 28 Knowledge Based Model Construction • KBMC system contains : – Base representation that represents goals, plans, actions, actors, observables, constraints, etc. – Model construction procedure that maps a context and/or query into a target model • At problem solving time – Construct a problem-specific Bayesian network – Process queries on constructed model using general-purpose BN algorithm • Advantages of expressive representation 29 – – – – – Understandability Maintainability Knowledge reuse Exploit repeated structure (representation, inference, learning) Construct only as much of model as needed for query Hypothesis Management • Constructed BN rapidly becomes intractable, especially in presence of existence and association uncertainty • What do we really need to represent? • Heuristics help to avoid constructing (or prune) very unlikely hypotheses (or variables with very weak impact on conclusions) – E.g., from only “John went to the airport” do not nominate hypothesis that John intends to set off a bomb – But a security system needs to be on the alert for prospective bombers! 30 Lifted Inference • Constructed BN (propositionalized theory) typically contains repeated structure • Applying standard BN inference often results in many repetitions of the identical computation • Lifted inference algorithms detect such repetitions – “Lift” problem from ground to first-order level – Perform computation only once • Very active area of research (Braz, et al., 2005) 31 Learning = Inference … in theory, at least (inst xi liquor-shopping) (inst yj liquor-store) (= (store-of xi) yj) i=1,…,N j=1,…,M Plate model for parameter learning of store-of local distribution 32 Representing Temporal Evolution • Plans evolve in time • HMM / DBN / PDBN replicate variables describing temporally evolving situation Position(k-1) Position(k) Mission Destination Report(k-1) Shape Report(k) Hidden Markov Model (HMM) unobservable evolving state + observable indicator Position(k-1) Position(k-1) Position(k) Position(k) ApparentShape(k-1) ApparentShape(k-1) ApparentShape(k) PosReport(k-1) PosReport(k-1) PosReport(k) PosReport(k) ShpReport(k-1) ShpReport(k-1) 33 ApparentShape(k) ShpReport(k-1) ShpReport(k-1) Dynamic Bayesian Network (DBN) factored representation of state / observable Partially Dynamic Bayesian Network (PDBN) some variables not time-dependent DBN Inference • Any BN inference algorithm can be applied to a finitehorizon DBN • Special-case inference algorithms exploit DBN structure – “Rollup” algorithm marginalizes out past hidden states given past observations to explicitly represent only a sliding window – Viterbi algorithm finds most probable values of hidden states given observations – Forward-backward algorithm estimates marginal distributions for hidden states given observations • Exact inference is generally intractable – Factored frontier algorithm approximates marginalization of past hidden state for intractable DBNs – Particle filter is a temporal variant of likelihood weighting with resampling 34 • Beware of static nodes! Particle Filter initialization Likelihood weighting Resampling Resampling Evolution Likelihood weighting • • • 35 • Maintains sample of weighted particles Each particle is a single realization of all non-evidence nodes Particle is weighted by likelihood of observation given particle Particles are resampled with probability proportional to weight From van der Merwe et al. (undated) Particle Impoverishment • Particles with large weights are sampled more often, leading to low particle diversity • This effect is counteracted by “spreading” effects of process noise • Impoverishment is very serious when: – Observations are extremely unlikely – Low “process noise” leads to long dwell times in widely separated basins of attraction • “In fact, for the case of very small process noise, all particles will collapse to a single point within a few iterations.” • “If the process noise is zero, then using a particle filter is not entirely appropriate.” (Arulampulam et al., 2002) 36 Particle Filter with Static Nodes • PF cannot recover from impoverishment of static node • Some approaches: – Estimate separate PF for each combination of static node • Only if static node has small state space – Regularized PF - artificial evolution of static node • Ad hoc; no justification for amount of perturbation; information loss over time – Shrinkage (Liu & West) • Combines ideas from artificial evolution & kernel smoothing • Perturbation “shrinks” static node for each particle toward weighted sample mean – Perturbation holds variance of set of particles constant – Correlation in disturbances compensates for information loss X – Resample-Move (Gilks & Berzuini) 37 • Metropolis-Hastings step corrects for particle impoverishment • MH sampling of static node involves entire trajectory but is performed less frequently as runs become longer Stochastic Grammars • Motivation: find representation that is sufficiently expressive for plan recognition but more tractable than general DBN inference • A stochastic grammar is a set of stochastic production rules for generating sequences of actions (terminal symbols in the grammar) • Modularity of production rules yields factored joint distribution 38 Stochastic Grammar - Example • Taken from Geib and Goldman (2009) • Plans are represented as and/or tree with temporal constraints 39 Stochastic Grammar - Inference • Parsing algorithms can be applied to compute restricted class of queries – If plans can be represented in a given formalism then that formalism’s inference algorithms can be applied to process queries – We are often interested in a broader class of queries than traditional parsing algorithms can handle (e.g., we usually have not observed all actions) • Parse tree can be converted to DBN – Enables answering a broader class of queries – Can exploit structure of grammar to improve tractability of inference • Special-purpose algorithms exploit grammar structure 40 Where Do We Stand? • Contributions of probabilistic methods – – – – – Useful way of thinking about problems Unified approach to reasoning, parameter learning, structure learning Principled combination of KE with learning Can learn from small, moderate and large samples Many general-purpose exact and approximate algorithms with strong theoretical justification and practical success – Good results (better than previous state of the art) on many interesting problems • Many challenging problems remain – Exact learning and inference are intractable – High-dimensional multi-modal distributions are just plain ugly • All inference algorithms break down on the toughest cases – Asymptotics doesn’t mean much when the long run is millions of years! – With good engineering backed by solid theory, we will continue to make progress 41 Bibliography (1 of 2) Arulampalam, M. Maskell, S., Gordon, N. and Clapp, T. A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking, IEEE Transactions on Signal Processing, 50 , pp. 174–188, 2002. Bidyuk, B. and Dechter, R. "Cutset Sampling for Bayesian Networks", Journal of Artificial Intelligence Research 28, pages 148, 2007. Braz, R., Amir, E. and Roth, D. Lifted First-Order Probabilistic Inference. Proceedings of the International Joint Conference on Artificial Intelligence, 2005. Bui, H., Venkatesh, S. and West, G. "Policy Recognition in the Abstract Hidden Markov Model", Artificial Intelligence, Journal of Artificial Intelligence Research, Volume 17, pages 451-499, 2002. Charniak, E. and Goldman, R. A Bayesian Model of Plan Recognition. Artificial Intelligence, 64: 53-79, 1993. Charniak, E. and Goldman, R. A Probabilistic Model of Plan Recognition. Proceedings of the Ninth Conference on Artificial Intelligence 1991. Darwiche, A. Modeling and Reasoning with Bayesian Networks. Cambridge University Press. 2009. Gaifman, H. Concerning measures in First-Order calculi. Israel Journal of Mathematics, 2, 1–18, 1964. Geib, C.W. and Goldman, R.P. A Probabilistic Plan Recognition Algorithm Based on Plan Tree Grammars. Artificial Intelligence 173, pp. 1101–1132, 2009. Gilks, W.R. and Berzuini, C. Following a Moving Target—Monte Carlo Inference for Dynamic Bayesian Models,” Journal of the Royal Statistical Society B, 63, pp. 127–146, 2001. Hsu, J., Lian, C., and Jih, W. Probabilistic Models for Concurrent Chatting Activity Recognition. ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, 2011. Jensen, F., Bayesian Networks and Decision Graphs (2nd edition). Springer, 2007. Korb, K. and Nicholson, A. Bayesian Artificial Intelligence. Chapman and Hall, 2003. Koller, D., Friedman, N. Probabilistic Graphical Models. MIT Press, 2009. Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001. Laskey, K.B., MEBN: A Language for First-Order Bayesian Knowledge Bases, Artificial Intelligence, 172(2-3): 140-178, 2008 42 Bibliography (2 of 2) Liao, L. Patterson, D. J. Fox, D. and Kautz, H. Learning and Inferring Transportation Routines. Artificial Intelligence, 2007. Liao, L., Fox, D., AND Kautz, H. Hierarchical Conditional Random Fields for GPS-based Activity Recognition. In Springer Tracts in Advanced Robotics. Springer, 2007. Liu, J. and West, M., Combined Parameter and State Estimation in Simulation-Based Filtering,” in Sequential Monte Carlo Methods in Practice, A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001. Musso, C. Oudjane, N and LeGland, F. Improving Regularised Particle Filters, in Sequential Monte Carlo Methods in Practice, A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001. Neapolitan, R. Learning Bayesian Networks. Prentice Hall, 2003. Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. Pynadath, D.V. and Wellman, M.P. Probabilistic State-Dependent Grammars for Plan Recognition. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000. Pynadath, D. V. and Wellman, M. P. Generalized queries on probabilistic context-free grammars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):65–77, 1998. Richardson, M. and Domingos, P., Markov Logic Networks. Machine Learning, 62, 107-136, 2006. Schmidt, C., Sridharan, N., and Goodson, J., The plan recognition problem: An Intersection of psychology and Artificial Intelligence, Artificial Intelligence 11 pp. 45–83, 1978. van der Merwe, R., Doucet, A., de Freitas, N. and Wan, E. The Unscented Particle Filter, Adv. Neural Inform. Process. Syst. 2000. van der Merwe, R., Doucet, A., de Freitas, N. and Wan, E. (undated) “The unscented particle filter,” http://cslu.cse.ogi.edu/publications/ps/UPF_CSLU_talk.pdf Wellman, M.P., J.S. Breese, and R.P. Goldman (1992) From knowledge bases to decision models. The Knowledge Engineering Review, 7(1):35-53. 43