Download PPT - University of California, Irvine

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Notes on Graphical Models
Padhraic Smyth
Department of Computer Science
University of California, Irvine
P(Data | Parameters)
Probabilistic
Model
Real World
Data
P(Data | Parameters)
Probabilistic
Model
Real World
Data
P(Parameters | Data)
Generative Model, Probability
P(Data | Parameters)
Real World
Data
Probabilistic
Model
P(Parameters | Data)
Inference, Statistics
Part 1: Review of Probability
Notation and Definitions
• X is a random variable
– Lower-case x is some possible value for X
– “X = x” is a logical proposition: that X takes value x
– There is uncertainty about the value of X
• e.g., X is the Dow Jones index at 5pm tomorrow
• p(X = x) is the probability that proposition X=x is true
– often shortened to p(x)
• If the set of possible x’s is finite, we have a probability
distribution and S p(x) = 1
• If the set of possible x’s is infinite, p(x) is a density
function, and p(x) integrates to 1 over the range of X
Example
• Let X be the Dow Jones Index (DJI) at 5pm Monday
August 22nd (tomorrow)
• X can take real values from 0 to some large number
• p(x) is a density representing our uncertainty about X
– This density could be constructed from historical data,
e.g.,
– After 5pm p(x) = 1 for some value of x (no uncertainty),
once we hear from Wall Street what x is
Probability as Degree of Belief
• Different agents can have different p(x)’s
– Your p(x) and the p(x) of a Wall Street expert might be
quite different
– OR: if we were on vacation we might not have access to
stock market information
• we would still be uncertain about p(x) after 5pm
• So we should really think of p(x) as p(x | BI)
– Where BI is background information available to agent I
– (will drop explicit conditioning on BI in notation)
• Thus, p(x) represents the degree of belief that agent I
has in proposition x, conditioned on available
background information
Comments on Degree of Belief
• Different agents can have different probability models
– There is no necessarily “correct” p(x)
– Why? Because p(x) is a model built on whatever assumptions or
background information we use
– Naturally leads to the notion of updating
• p(x | BI) -> p(x | BI, CI)
• This is the subjective Bayesian interpretation of probability
– Generalizes other interpretations (such as frequentist)
– Can be used in cases where frequentist reasoning is not applicable
– We will use “degree of belief” as our interpretation of p(x) in this
tutorial
• Note!
– Degree of belief is just our semantic interpretation of p(x)
– The mathematics of probability (e.g., Bayes rule) remain the
same regardless of our semantic interpretation
Multiple Variables
• p(x, y, z)
– Probability that X=x AND Y=y AND Z =z
– Possible values: cross-product of X Y Z
– e.g., X, Y, Z each take 10 possible values
• x,y,z can take 103 possible values
• p(x,y,z) is a 3-dimensional array/table
– Defines 103 probabilities
• Note the exponential increase as we add more
variables
– e.g., X, Y, Z are all real-valued
• x,y,z live in a 3-dimensional vector space
• p(x,y,z) is a positive function defined over this space,
integrates to 1
Conditional Probability
• p(x | y, z)
– Probability of x given that Y=y and Z = z
– Could be
• hypothetical, e.g., “if Y=y and if Z = z”
• observational, e.g., we observed values y and z
– can also have p(x, y | z), etc
– “all probabilities are conditional probabilities”
• Computing conditional probabilities is the basis of many
prediction and learning problems, e.g.,
– p(DJI tomorrow | DJI index last week)
– expected value of [DJI tomorrow | DJI index next week)
– most likely value of parameter a given observed data
Computing Conditional Probabilities
• Variables A, B, C, D
– All distributions of interest related to A,B,C,D can be computed
from the full joint distribution p(a,b,c,d)
• Examples, using the Law of Total Probability
– p(a) =
S{b,c,d} p(a, b, c, d)
– p(c,d) =
S{a,b} p(a, b, c, d)
– p(a,c | d) =
S{b} p(a, b, c | d)
where p(a, b, c | d) = p(a,b,c,d)/p(d)
• These are standard probability manipulations: however, we
will see how to use these to make inferences about
parameters and unobserved variables, given data
Two Practical Problems
(Assume for simplicity each variable takes K values)
• Problem 1: Computational Complexity
– Conditional probability computations scale as O(KN)
• where N is the number of variables being summed over
• Problem 2: Model Specification
– To specify a joint distribution we need a table of O(KN) numbers
– Where do these numbers come from?
Two Key Ideas
• Problem 1: Computational Complexity
– Idea: Graphical models
• Structured probability models lead to tractable inference
• Problem 2: Model Specification
– Idea: Probabilistic learning
• General principles for learning from data
Part 2: Graphical Models
“…probability theory is more fundamentally
concerned with the structure of reasoning and
causation than with numbers.”
Glenn Shafer and Judea Pearl
Introduction to Readings in Uncertain Reasoning,
Morgan Kaufmann, 1990
Conditional Independence
• A is conditionally independent of B given C iff
p(a | b, c) = p(a | c)
(also implies that B is conditionally independent of A given C)
• In words, B provides no information about A, if value of C is
known
• Example:
– a = “reading ability”
– b = “height”
– c = “age”
• Note that conditional independence does not imply marginal
independence
Graphical Models
• Represent dependency structure with a directed graph
– Node <-> random variable
– Edges encode dependencies
• Absence of edge -> conditional independence
– Directed and undirected versions
• Why is this useful?
– A language for communication
– A language for computation
• Origins:
– Wright 1920’s
– Independently developed by Spiegelhalter and Lauritzen in
statistics and Pearl in computer science in the late 1980’s
Examples of 3-way Graphical Models
A
B
C
Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Graphical Models
A
Conditionally independent effects:
p(A,B,C) = p(B|A)p(C|A)p(A)
B
C
B and C are conditionally independent
Given A
e.g., A is a disease, and we model
B and C as conditionally independent
symptoms given A
Examples of 3-way Graphical Models
A
B
C
Independent Causes:
p(A,B,C) = p(C|A,B)p(A)p(B)
Examples of 3-way Graphical Models
A
B
C
Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Real-World Example
Monitoring Intensive-Care Patients
• 37 variables
• 509 parameters
PULMEMBOLUS
…instead of 237
PAP
MINVOLSET
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH DISCONNECT
VENTLUNG
VENITUBE
PRESS
MINOVL
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
LVFAILURE
FIO2
VENTALV
PVSAT
ARTCO2
INSUFFANESTH
CATECHOL
STROEVOLUME HISTORY ERRBLOWOUTPUT HR
PCWP
CO
HREKG
HRBP
(figure courtesy of Kevin Murphy/Nir Friedman)
BP
EXPCO2
ERRCAUTER
HRSAT
Directed Graphical Models
B
A
p(A,B,C) = p(C|A,B)p(A)p(B)
C
Directed Graphical Models
B
A
p(A,B,C) = p(C|A,B)p(A)p(B)
C
In general,
p(X1, X2,....XN) =
 p(Xi | parents(Xi ) )
Directed Graphical Models
B
A
p(A,B,C) = p(C|A,B)p(A)p(B)
C
In general,
p(X1, X2,....XN) =
 p(Xi | parents(Xi ) )
• Probability model has simple factored form
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Also known as belief networks, Bayesian networks, causal networks
Reminders from Probability….
• Law of Total Probability
P(a) =
Sb
P(a, b) = Sb P(a | b) P(b)
– Conditional version:
P(a|c) = Sb P(a, b|c) = Sb P(a | b , c) P(b|c)
• Factorization or Chain Rule
– P(a, b, c, d) = P(a | b, c, d) P(b | c, d) P(c | d) P (d), or
= P(b | a, c, d) P(c | a, d) P(d | a) P(a), or
= …..
Graphical Models for Computation
MINVOLSET
KINKEDTUBE VENTMACHDISCONNECT
PULMEMBOLUSINTUBATION
PAP
SHUNT
VENTLUNG
VENITUBE
PRESS
MINOVL FIO2 VENTALV
PVSAT ARTCO2
ANAPHYLAXIS
SAO2 INSUFFANESTH
TPR
HYPOVOLEMIA LVFAILURE
EXPCO2
CATECHOL
LVEDVOLUME STROEVOLUMEHISTORY
ERRBLOWOUTPUT HR ERRCAUTER
CVP
PCWP
CO
HREKG HRSAT
HRBP
BP
• Say we want to compute P(BP|Press)
• Law of total probability:
-> must sum over all other variables
-> exponential in # variables
• Factorization:
-> joint distribution factors into smaller
tables
• Can now sum over smaller tables, can
reduce complexity dramatically
Example
D
A
B
E
C
F
G
Example
D
A
p(A, B, C, D, E, F, G) =
B
E
C
F
G
 p( variable | parents )
= p(A|B)p(C|B)p(B|D)p(F|E)p(G|E)p(E|D) p(D)
Example
D
A
B
E
c
F
Say we want to compute p(a | c, g)
g
Example
D
A
B
E
c
F
g
Direct calculation: p(a|c,g) = Sbdef p(a,b,d,e,f | c,g)
Complexity of the sum is O(K4)
Example
D
A
B
E
c
F
Reordering (using factorization):
Sb p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g)
g
Example
D
A
B
E
c
F
g
Reordering:
Sb p(a|b) Sd p(b|d,c) Se p(d|e) Sf p(e,f |g)
p(e|g)
Example
D
A
B
E
c
F
Reordering:
Sb p(a|b) Sd p(b|d,c) Se p(d|e) p(e|g)
p(d|g)
g
Example
D
A
B
E
c
F
Reordering:
Sb p(a|b) Sd p(b|d,c) p(d|g)
p(b|c,g)
g
Example
D
B
E
c
F
A
g
Reordering:
Sb p(a|b) p(b|c,g)
p(a|c,g)
Complexity is O(K), compared to O(K4)
Graphs with “loops”
D
A
B
E
C
F
G
Message passing algorithm does not work when
there are multiple paths between 2 nodes
Graphs with “loops”
D
A
B
E
C
F
General approach: “cluster” variables
together to convert graph to a tree
G
Reduce to a Tree
D
B, E
A
C
F
G
Probability Calculations on Graphs
• General algorithms exist - beyond trees
– Complexity is typically O(m (number of parents ) )
(where m = arity of each node)
– If single parents (e.g., tree), -> O(m)
– The sparser the graph the lower the complexity
• Technique can be “automated”
– i.e., a fully general algorithm for arbitrary graphs
– For continuous variables:
• replace sum with integral
– For identification of most likely values
• Replace sum with max operator
Part 3: Learning with Graphical
Models
Further Reading:
M. Jordan, Graphical models,
Statistical Science: Special Issue on Bayesian Statistics, vol. 19, no. 1, pp. 140-155, Feb. 2004
A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin,
Bayesian Data Analysis (2nd ed), Chapman and Hall, 2004
Generative Model, Probability
P(Data | Parameters)
Real World
Data
Probabilistic
Model
P(Parameters | Data)
Inference, Statistics
The Likelihood Function
• Likelihood = p(data | parameters)
= p( D | f )
= L (f)
• Likelihood tells us how likely the observed data are
conditioned on a particular setting of the parameters
• Details
– Constants that do not involve f can be dropped in defining L (f)
– Often easier to work with log L (f)
Comments on the Likelihood Function
• Constructing a likelihood function L (f) is the first step in
probabilistic modeling
• The likelihood function implicitly assumes an underlying
probabilistic model M with parameters f
• L (f) connects the model to the observed data
• Graphical models provide a useful language for constructing
likelihoods
Binomial Likelihood
• Binomial model
– N memoryless trials, 2 outcomes
– probability q of success at each trial
• Observed data
– r successes in n trials
– Defines a likelihood:
L(f) = p(D | f)
= p(successes) p(non-successes)
=
f r (1-f)
n-r
Binomial Likelihood Examples
Multinomial Likelihood
• Multinomial model
– N memoryless trials, K outcomes
– Probability vector F for outcomes at each trial
• Observed data
– nj successes in n trials
– Defines a likelihood:
– Maximum likelihood estimates:
Graphical Model for Multinomial
f
w1
w2
Parameters
f
= [ p(w1), p(w2),….. p(wk) ]
wn
Observed data
“Plate” Notation
Model parameters
f
wi
i=1:n
Data = D = {w1,…wn}
Plate (rectangle) indicates replicated nodes in a graphical model
Variables within a plate are conditionally
independent manner given parent
Learning in Graphical Models
Model parameters
f
wi
i=1:n
Data = D = {w1,…wn}
Can view learning in a graphical model as computing the
most likely value of the parameter node given the data nodes
Maximum Likelihood (ML) Principle
(R. Fisher ~ 1922)
f
Model parameters
wi
Data = {w1,…wn}
i=1:n
L (f) = p(Data | f ) =  p(yi | f )
Maximum Likelihood: fML = arg max{ Likelihood(f) }
Select the parameters that make the observed data most likely
The Bayesian Approach to Learning
Prior(f) = p( f | a )
a
f
wi
i=1:n
Fully Bayesian:
p( f | Data) = p(Data | f ) p(f) / p(Data)
Maximum A Posteriori:
fMAP = arg max{ Likelihood(f) x Prior(f) }
Learning a Multinomial
• Likelihood: same as before
• Prior: p( F | a ) = Dirichlet (a1,… aK)
proportional to
– Has mean
Prior weight for aj
Can set all aj = a for “uniform” prior
Dirichlet Shapes
From: http://en.wikipedia.org/wiki/Dirichlet_distribution
Bayesian Learning
• P(f | D, a) is proportional to p(data | f) p(f)
=
=
= Dirichlet(n1 + a1,…, nK + aK )
Posterior mean estimate
Summary of Bayesian Learning
•
Can use graphical models to describe relationships between
parameters and data
•
P(data | parameters) = Likelihood function
•
P(parameters) = prior
– In applications such as text mining, prior can be “uninformative”, i.e., flat
– Prior can also be optimized for prediction (e.g., on validation data)
•
We can compute P(parameters | data, prior)
or a “point estimate” (e.g., posterior mode or mean)
•
Computation of posterior estimates can be computationally intractable
– Monte Carlo techniques often used