Download PowerPoint - people.csail.mit.edu

Document related concepts

Agent-based model in biology wikipedia , lookup

Inductive probability wikipedia , lookup

Pattern recognition wikipedia , lookup

Neural modeling fields wikipedia , lookup

Mathematical model wikipedia , lookup

Markov chain wikipedia , lookup

Time series wikipedia , lookup

Mixture model wikipedia , lookup

Transcript
Inference on Relational Models
Using Markov Chain Monte Carlo
Brian Milch
Massachusetts Institute of Technology
UAI Tutorial
July 19, 2007
Example 1: Bibliographies
Stuart Russell
Peter Norvig
Artificial Intelligence: A Modern Approach
Russell, Stuart and Norvig, Peter. Articial Intelligence. Prentice-Hall, 1995.
S. Russel and P. Norvig (1995). Artificial Intelligence: A Modern
Approach. Upper Saddle River, NJ: Prentice Hall.
2
Example 2: Aircraft Tracking
(1.9, 9.0, 2.1)
(1.8, 7.4, 2.3)
(1.9, 6.1, 2.2)
(0.7, 5.1, 3.2)
(0.6, 5.9, 3.2)
(0.9, 5.8, 3.1)
t=1
t=2
t=3
3
Inference on Relational Structures
1.2 x 10-12
2.3 x 10-12
“Roberts”
“Russell”
“AI: A Mod...”
“Rus...” “AI...”
“AI: A...”
“Seuss”
“The...”
4.5 x 10-14
“Advance...”
“Rob...” “Adv...”
“Rob...”
“Russell”
“Norvig”
“AI: A Mod...”
“Rus...” “AI...”
“AI: A...”
“Shak...”
“If you...”
“Seu...” “The...”
6.7 x 10-16
“Seu...”
“Tempest”
“Hamlet”
“Shak...” “Haml...” “Wm...”
8.9 x 10-16
“Rus...” “AI...”
“AI: A...”
5.0 x 10-20
4
Markov Chain Monte Carlo
(MCMC)
• Markov chain s1, s2, ...
over worlds where
evidence E is true
• Approximate P(Q|E) as
fraction of s1, s2, ... that
satisfy query Q
Q
E
5
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
6
Simple Example: Clustering
 = 22
10
20
 = 49
30
40
50
 = 80
60
70
80
90
100
Wingspan (cm)
7
Simple Bayesian Mixture Model
• Number of latent objects is known to be k
• For each latent object i, have parameter:
i ~ Uniform [0,100]
• For each data point j, have object selector
C j ~ Uniform({ 1,..., k }
and observable value

X j ~ Normal c j , 5
2

8
BN for Mixture Model
1
2
…
k
X1
X2
X3
…
Xn
C1
C2
C3
…
Cn
9
Context-Specific Dependencies
1
2
…
k
X1
X2
X3
…
Xn
C1
C2
C3
…
Cn
=2
=1
=2
10
Extensions to Mixture Model
• Random number of latent objects k, with
distribution p(k) such as:
– Uniform({1, …, 100})
– Geometric(0.1)
unbounded!
– Poisson(10)
• Random distribution  for selecting objects
– p( | k) ~ Dirichlet(1,..., k)
(Dirichlet: distribution over probability vectors)
– Still symmetric: each i = /k
11
Existence versus Observation
• A latent object can exist even if no observations
correspond to it
– Bird species may not be observed yet
– Aircraft may fly over without yielding any blips
• Two questions:
– How many objects correspond to observations?
– How many objects are there in total?
• Observed 3 species, each 100 times: probably no more
• Observed 200 species, each 1 or 2 times: probably more exist
12
Expecting Additional Objects
r observed species
observe more later?
…
…
…
• P(ever observe new species | seen r so far)
bounded by P(k  r)
• So as # species observed  , probability of
ever seeing more  0
• What if we don’t want this?
13
Dirichlet Process Mixtures
• Set k = , let  be infinite-dimensional
probability vector with stick-breaking prior
1
2
3
4
5 …
• Another view: Define prior directly on
partitions of data points, allowing
unbounded number of blocks
• Drawback: Can’t ask about number of
unobserved latent objects (always infinite)
[Ferguson 1983; Sethuraman 1994]
[tutorials: Jordan 2005; Sudderth 2006]
14
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
15
Mistake 1: Ignoring
Interchangeability
• Which birds are in species S1?
• Latent object indices are
interchangeable
B2
B1 B3
B5 B4
– Posterior on selector variable CB1 is uniform
– Posterior on S1 has a peak for each cluster of birds
• Really care about partition of observations
{{1, 3}, {2}, {4, 5}}
• Partition with r blocks corresponds to k! / (k-r)!
instantiations of the Cj variables
(1, 2, 1, 3, 3), (1, 2, 1, 4, 4), (1, 4, 1, 3, 3), (2, 1, 2, 3, 3), …
16
Ignoring Interchangeability, Cont’d
• Say k = 4. What’s prior probability that B1,
B3 are in one species, B2 in another?
• Multiply probabilities for CB1, CB2, CB3:
(1/4) x (1/4) x (1/4)
• Not enough! Partition {{B1, B3}, {B2}}
corresponds to 12 instantiations of C’s
(S1, S2, S1), (S1, S3, S1), (S1, S4, S1), (S2, S1, S2), (S2, S3, S2), (S2, S4, S2)
(S3, S1, S3), (S3, S2, S3), (S3, S4, S3), (S4, S1, S4), (S4, S2, S4), (S4, S3, S4)
• Partition with r blocks corresponds to kPr
instantiations
17
Mistake 2: Underestimating the
Bayesian Ockham’s Razor Effect
• Say k = 4. Are B1 and B2 in same species?
XB1=50
10
20
30
40
XB2=52
50
60
70
80
90
100
Wingspan (cm)
• Maximum-likelihood estimation would yield one
species with  = 50 and another with  = 52
• But Bayesian model trades off likelihood against
prior probability of getting those  values
18
Bayesian Ockham’s Razor
XB1=50
10
20
30
40
XB2=52
50
60
70
80
90
100
H1: Partition is {{B1, B2}}
 
2
100
p( H1, data) 4 P1  1  
4 0
p( 1 ) p( x1 | 1 ) p( x2 | 1 ) d1
 1.3 x 10-4
= 0.01
H2: Partition is {{B1}, {B2}}
 
2
100
p( H 2, data) 4 P2  1  
4 0
100
p( 1 ) p( x1 | 1 ) d1  
0
p( 2 ) p( x2 | 2 ) d2
 7.5 x 10-5
Don’t use more latent objects than necessary to explain your data
[MacKay 1992]
19
Mistake 3: Comparing Densities
Across Dimensions
XB1=50
10
20
30
40
XB2=52
50
60
70
80
90
100
Wingspan (cm)
H1: Partition is {{B1, B2}},  = 51
 
2
p( H1, data) 4 P1  1  0.01 N (50; 51, 52 )  N (52; 51, 52 )
4
 1.5 x 10-5
H1 wins by greater margin
H2: Partition is {{B1}, {B2}}, B1 = 50, B2 = 52
 
2
p( H 2, data) 4 P1  1  0.01 N (50; 50, 52 )  0.01 N (52; 52, 52 )
4
 4.8 x 10-7
20
What If We Change the Units?
XB1=0.50
0.1
0.2
0.3
0.4
0.5
XB2=0.52
0.6
0.7
0.8
0.9
1.0
Wingspan (m)
H1: Partition is {{B1, B2}},  = 0.51
 
2
p( H1, data) 4 P1  1  1 N (0.50; 0.51, 0.052 )  N (0.52; 0.51, 0.052 )
4
 15
density of Uniform(0, 1) is 1!
H2: Partition is {{B1}, {B2}}, B1 = 0.50, B2 = 0.52
 
2
p( H 2, data) 4 P1  1  1 N (0.50; 0.50, 0.052 )  1 N (0.52; 0.52, 0.052 )
4
 48 Now H wins by a landslide
2
21
Lesson: Comparing Densities
Across Dimensions
• Densities don’t behave like probabilities
(e.g., they can be greater than 1)
• Heights of density peaks in spaces of
different dimension are not comparable
• Work-arounds:
– Find most likely partition first, then most likely
parameters given that partition
– Find region in parameter space where most of
the posterior probability mass lies
22
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
23
Why Not Exact Inference?
• Number of possible partitions is
superexponential in n
• Variable elimination?


1
– Summing out i
couples all the Cj’s
– Summing out Cj X1
couples all the
i’s
C1
k
2
X2
X3
…
Xn
C2
C3
…
Cn
24
Markov Chain Monte Carlo
(MCMC)
• Start in arbitrary state
(possible world) s1
satisfying evidence E
• Sample s2, s3, ...
according to transition
kernel T(si, si+1), yielding
Markov chain
• Approximate p(Q | E) by
fraction of s1, s2, …, sL
that are in Q
Q
E
25
Why a Markov Chain?
• Why use Markov chain rather than
sampling independently?
– Stochastic local search for high-probability s
– Once we find such s, explore around it
26
Convergence
• Stationary distribution  is such that
  ( s ) T ( s, s ' )   ( s ' )
s
• If chain is ergodic (can get to anywhere
from anywhere*), then:
– It has unique stationary distribution 
– Fraction of s1, s2, ..., sL in Q converges to
(Q) as L  
• We’ll design T so (s) = p(s | E)
* and it’s aperiodic
27
Gibbs Sampling
• Order non-evidence variables V1,V2,...,Vm
• Given state s, sample from T as follows:
– Let s = s
– For i = 1 to m
• Sample vi from p(Vi | s-i)
• Let s = (s-i, Vi = vi)
– Return s
Conditional for Vi given
other vars in s
• Theorem: stationary distribution is p(s | E)
[Geman & Geman 1984]
28
Gibbs on Bayesian Network
• Conditional for V depends only on factors
that contain v
p(v | sV )  p(v | s[Pa(V )])
 p( s[Y ] | v, s[Pa
V
(Y )])
Y ch(V )
• So condition on V’s Markov
blanket mb(V): parents,
children, and co-parents
V
29
Gibbs on Bayesian Mixture Model
• Given current state s:
– Resample each i
given prior and
{Xj : Cj = i in s}
– Resample each Cj
given Xj and 1:k
1
2
k
X1
X2
X3
…
Xn
C1
C2
C3
…
Cn
context-specific
Markov blanket
[Neal 2000]
30
Sampling Given Markov Blanket
p(v | sV )  p(v | s[Pa(V )])
 p( s[Y ] | v, s[Pa
V
(Y )])
Y ch(V )
• If V is discrete, just iterate over values,
normalize, sample from discrete distrib.
• If V is continuous:
– Simple if child distributions are conjugate to
V’s prior: posterior has same form as prior
with different parameters
– In general, even sampling from p(v | s-V) can
be hard
[See BUGS software: http://www.mrc-bsu.cam.ac.uk/bugs]
31
Convergence Can Be Slow
1 = 20
2 = 90
species 2 is far away
10
20
30
should be two clusters
40
50
60
70
80
90
100
Wingspan (cm)
• Cj’s won’t change until 2 is in right area
• 2 does unguided random walk as long as no
observations are associated with it
– Especially bad in high dimensions
32
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
33
Metropolis-Hastings
[Metropolis et al. 1953; Hastings 1970]
• Define T(si, si+1) as follows:
– Sample s from proposal distribution q(s | s)
– Compute acceptance probability
 ps | E  qsi | s 

  min1,
 psi | E  qs | si  
relative posterior
probabilities
backward / forward
proposal probabilities
– With probability , let si+1 = s;
else let si+1 = si
Can show that p(s | E) is stationary distribution for T
34
Metropolis-Hastings
• Benefits
– Proposal distribution can propose big steps
involving several variables
– Only need to compute ratio p(s | E) / p(s | E),
ignoring normalization factors
– Don’t need to sample from conditional distribs
• Limitations
– Proposals must be reversible, else q(s | s) = 0
– Need to be able to compute q(s | s) / q(s | s)
35
Split-Merge Proposals
• Choose two observations i, j
• If Ci = Cj = c, then split cluster c
– Get unused latent object c
– For each observation m such that Cm = c,
change Cm to c with probability 0.5
– Propose new values for c, c
• Else merge clusters ci and cj
– For each m such that Cm = cj, set Cm = ci
– Propose new value for c
[Jain & Neal 2004]
36
Split-Merge Example
1 = 20
10
2 = 90
2 = 27
20
30
40
50
60
70
80
90
100
Wingspan (cm)
• Split two birds from species 1
• Resample 2 to match these two birds
• Move is likely to be accepted
37
Mixtures of Kernels
• If T1,…,Tm all have stationary distribution
, then so does mixture
m
T ( s, s' )   wiTi ( s, s' )
i 1
• Example: Mixture of split-merge and Gibbs
moves
• Point: Faster convergence
38
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
39
MCMC States in Split-Merge
• Not complete instantiations!
– No parameters for unobserved species
• States are partial instantiations of random
variables
k = 12, CB1 = S2, CB2 = S8, S2 = 31, S8 = 84
– Each state corresponds to an event: set of
outcomes satisfying description
40
MCMC over Events
[Milch & Russell 2006]
• Markov chain over
events , with stationary
distrib. proportional to p()
• Theorem: Fraction of
visited events in Q
converges to p(Q|E) if:
Q
E
– Each  is either subset of Q
or disjoint from Q
– Events form partition of E
41
Computing Probabilities of Events
• Engine needs to compute p() / p(n)
efficiently (without summations)
• Use instantiations that
include all active parents
of the variables they
instantiate
• Then probability is product of CPDs:
p( )   p X  ( X ) |  (Pa ( X )) 
X vars( )
42
States That Are Even More Abstract
• Typical partial instantiation:
k = 12, CB1 = S2, CB2 = S8, S2 = 31, S8 = 84
– Specifies particular species numbers, even though
species are interchangeable
• Let states be abstract partial instantiations:
 x  y  x [k = 12, CB1 = x, CB2 = y, x = 31, y = 84]
• See [Milch & Russell 2006] for conditions under
which we can compute probabilities of such
events
43
Outline
• Probabilistic models for relational structures
– Modeling the number of objects
– Three mistakes that are easy to make
• Markov chain Monte Carlo (MCMC)
– Gibbs sampling
– Metropolis-Hastings
– MCMC over events
• Case studies
– Citation matching
– Multi-target tracking
44
Representative Applications
•
•
•
•
Tracking cars with cameras [Pasula et al. 1999]
Segmentation in computer vision [Tu & Zhu 2002]
Citation matching [Pasula et al. 2003]
Multi-target tracking with radar [Oh et al. 2004]
45
Citation Matching Model
[Pasula et al. 2003; Milch & Russell 2006]
#Researcher ~ NumResearchersPrior();
Name(r) ~ NamePrior();
#Paper ~ NumPapersPrior();
FirstAuthor(p) ~ Uniform({Researcher r});
Title(p) ~ TitlePrior();
PubCited(c) ~ Uniform({Paper p});
Text(c) ~ NoisyCitationGrammar
(Name(FirstAuthor(PubCited(c))), Title(PubCited(c)));
46
Citation Matching
• Elaboration of generative model shown earlier
• Parameter estimation
– Priors for names, titles, citation formats learned
offline from labeled data
– String corruption parameters learned with Monte
Carlo EM
• Inference
– MCMC with split-merge proposals
– Guided by “canopies” of similar citations
– Accuracy stabilizes after ~20 minutes
[Pasula et al., NIPS 2002]
47
Citation Matching Results
Error
(Fraction of Clusters Not Recovered Correctly)
0.25
0.2
Phrase Matching
[Lawrence et al. 1999]
0.15
Generative Model + MCMC
[Pasula et al. 2002]
Conditional Random Field
[Wellner et al. 2004]
0.1
0.05
0
Reinforce
Face
Reason
Constraint
Four data sets of ~300-500 citations, referring to ~150-300 papers
48
Cross-Citation Disambiguation
Wauchope, K. Eucalyptus: Integrating Natural Language
Input with a Graphical User Interface. NRL Report
NRL/FR/5510-94-9711 (1994).
Is "Eucalyptus" part of the title, or is the author
named K. Eucalyptus Wauchope?
Kenneth Wauchope (1994). Eucalyptus: Integrating
natural language input with a graphical user
interface. NRL Report NRL/FR/5510-94-9711, Naval
Research Laboratory, Washington, DC, 39pp.
Second citation makes it clear how to parse the first one
49
Preliminary Experiments:
Information Extraction
• P(citation text | title, author names)
modeled with simple HMM
• For each paper: recover title, author
surnames and given names
• Fraction whose attributes are recovered
perfectly in last MCMC state:
– among papers with one citation: 36.1%
– among papers with multiple citations: 62.6%
Can use inferred knowledge for disambiguation
50
Multi-Object Tracking
Unobserved
Object
False
Detection
51
State Estimation for “Aircraft”
#Aircraft ~ NumAircraftPrior();
State(a, t)
if t = 0 then ~ InitState()
else ~ StateTransition(State(a, Pred(t)));
#Blip(Source = a, Time = t)
~ NumDetectionsCPD(State(a, t));
#Blip(Time = t)
~ NumFalseAlarmsPrior();
ApparentPos(r)
if (Source(r) = null) then ~ FalseAlarmDistrib()
else ~ ObsCPD(State(Source(r), Time(r)));
52
Aircraft Entering and Exiting
#Aircraft(EntryTime = t) ~ NumAircraftPrior();
Exits(a, t)
if InFlight(a, t) then ~ Bernoulli(0.1);
InFlight(a, t)
if t < EntryTime(a) then = false
elseif t = EntryTime(a) then = true
else = (InFlight(a, Pred(t)) & !Exits(a, Pred(t)));
State(a, t)
if t = EntryTime(a) then ~ InitState()
elseif InFlight(a, t) then
~ StateTransition(State(a, Pred(t)));
#Blip(Source = a, Time = t)
if InFlight(a, t) then
~ NumDetectionsCPD(State(a, t));
…plus last two statements from previous slide
53
MCMC for Aircraft Tracking
• Uses generative model from previous slide
(although not with BLOG syntax)
• Examples of Metropolis-Hastings proposals:
[Figures by Songhwai Oh]
[Oh et al., CDC 2004]
54
Aircraft Tracking Results
Estimation Error
MCMC has smallest error,
hardly degrades at all as
tracks get dense
[Figures by Songhwai Oh]
Running Time
MCMC is nearly as fast as
greedy algorithm;
much faster than MHT
[Oh et al., CDC 2004]
55
Toward General-Purpose Inference
• Currently, each new application requires
new code for:
– Proposing moves
– Representing MCMC states
– Computing acceptance probabilities
• Goal:
– User specifies model and proposal distribution
– General-purpose code does the rest
56
General MCMC Engine
[Milch & Russell 2006]
Model
(in declarative language)
MCMC states: partial worlds
• Define p(s)
Custom proposal distribution
(Java class)
• Compute acceptance
probability based on
model
• Set sn+1
General-purpose engine
(Java code)
• Propose MCMC
state s given sn
• Compute ratio
q(sn | s) / q(s | sn)
Handle arbitrary proposals efficiently
using context-specific structure
57
Summary
• Models for relational structures go beyond
standard probabilistic inference settings
• MCMC provides a feasible path for
inference
• Open problems
– More general inference
– Adaptive MCMC
– Integrating discriminative methods
58
References
•
•
•
•
•
•
•
Blei, D. M. and Jordan, M. I. (2005) “Variational inference for Dirichlet process
mixtures”. J. Bayesian Analysis 1(1):121-144.
Casella, G. and Robert, C. P. (1996) “Rao-Blackwellisation of sampling schemes”.
Biometrika 83(1):81-94.
Ferguson T. S. (1983) “Bayesian density estimation by mixtures of normal
distributions”. In Rizvi, M. H. et al., eds. Recent Advances in Statistics: Papers in
Honor of Herman Chernoff on His Sixtieth Birthday. Academic Press, New York,
pages 287-302.
Geman, S. and Geman, D. (1984) “Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images”. IEEE Trans. on Pattern Analysis and Machine
Intelligence 6:721-741.
Gilks, W. R., Thomas, A. and Spiegelhalter, D. J. (1994) “A language and program for
complex Bayesian modelling”. The Statistician 43(1):169-177.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., eds. (1996) Markov Chain
Monte Carlo in Practice. Chapman and Hall.
Green, P. J. (1995) “Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination”. Biometrika 82(4):711-732.
59
References
•
•
•
•
•
•
•
•
Hastings, W. K. (1970) “Monte Carlo sampling methods using Markov chains and
their applications”. Biometrika 57:97-109.
Jain, S. and Neal, R. M. (2004) “A split-merge Markov chain Monte Carlo procedure
for the Dirichlet process mixture model”. J. Computational and Graphical Statistics
13(1):158-182.
Jordan M. I. (2005) “Dirichlet processes, Chinese restaurant processes, and all that”.
Tutorial at the NIPS Conference, available at
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps
MacKay D. J. C. (1992) “Bayesian Interpolation” Neural Computation 4(3):414-447.
MacEachern, S. N. (1994) “Estimating normal means with a conjugate style Dirichlet
process prior” Communications in Statistics: Simulation and Computation 23:727-741.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E.
(1953) “Equations of state calculations by fast computing machines”. J. Chemical
Physics 21:1087-1092.
Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D. L., and Kolobov, A. (2005)
“BLOG: Probabilistic Models with Unknown Objects”. In Proc. 19th Int’l Joint Conf. on
AI, pages 1352-1359.
Milch, B. and Russell, S. (2006) “General-purpose MCMC inference over relational
structures”. In Proc. 22nd Conf. on Uncertainty in AI, pages 349-358.
60
References
•
•
•
•
•
•
•
•
Neal, R. M. (2000) “Markov chain sampling methods for Dirichlet process mixture
models”. J. Computational and Graphical Statistics 9:249-265.
Oh, S., Russell, S. and Sastry, S. (2004) “Markov chain Monte Carlo data association
for general multi-target tracking problems”. In Proc. 43rd IEEE Conf. on Decision and
Control, pages 734-742.
Pasula, H., Russell, S. J., Ostland, M., and Ritov, Y. (1999) “Tracking many objects
with many sensors”. In Proc. 16th Int’l Joint Conf. on AI, pages 1160-1171.
Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. (2003) “Identity
uncertainty and citation matching”. In Advances in Neural Information Processing
Systems 15, MIT Press, pages 1401-1408.
Richardson,, S. and Green, P. J. (1997) “On Bayesian analysis of mixtures with an
unknown number of components”. J. Royal Statistical Society B 59:731-792.
Sethuraman, J. (1994) “A constructive definition of Dirichlet priors”. Statistica Sinica
4:639-650.
Sudderth, E. (2006) “Graphical models for visual object recognition and tracking”.
Ph.D. thesis, Dept. of EECS, Massachusetts Institute of Technology, Cambridge, MA.
Tu, Z. and Zhu, S.-C. (2002) “Image segmentation by data-driven Markov chain
Monte Carlo”. IEEE Trans. Pattern Analysis and Machine Intelligence 24(5):657-673.
61