Download lecture slides

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Bayesian approaches to knowledge
representation and reasoning
Part 1
(Chapter 13)
Bayesianism vs. Frequentism
• Classical probability: Frequentists
– Probability of a particular event is defined
relative to its frequency in a sample space of
events.
– E.g., probability of “the coin will come up
heads on the next trial” is defined relative to the
frequency of heads in a sample space of coin
tosses.
• Bayesian probability:
– Combine measure of “prior” belief you have in
a proposition with your subsequent
observations of events.
• Example: Bayesian can assign probability to
statement “The first e-mail message ever written
was not spam” but frequentist cannot.
Bayesian Knowledge Representation and Reasoning
• Question: Given the data D and our prior beliefs, what is
the probability that h is the correct hypothesis? (spam
example)
• Bayesian terminology (example -- spam recognition)
– Random variable X: returns one of a set of values
{x1, x2, ...,xm},
or a continuous value in interval [a,b] with probability
distribution D(X).
– Data D: {v1, v2, v3, ...} Set of observed values of
random variables X1, X2, X3, ...
– Hypothesis h: Function taking instance j and returning
classification of j (e.g., “spam” or “not spam”).
– Space of hypotheses H: Set of all possible hypotheses
– Prior probability of h:
• P(h): Probability that hypothesis h is true given our
prior knowledge
• If no prior knowledge, all h  H are equally
probable
– Posterior probability of h:
• P(h|D): Probability that hypothesis h is true, given
the data D.
– Likelihood of D:
• P(D|h): Probability that we will see data D, given
hypothesis h is true.
Recall definition of conditional probability:
X
Y
event space
P( X  Y )
P( X | Y ) 
P(Y )
Event space = all e-mail messages
X = all spam messages
Y = all messages containing word “v1agra”
X
Y
event space
Bayes Rule:
P(Y | X ) P( X )
P( X | Y ) 
P(Y )
Proof :
P( X | Y ) P(Y )  P( X  Y )  P(Y | X ) P( X )
Example: Using Bayes Rule
Hypotheses:
h = “message m is spam”
h = “message m is not spam”
Data:
+ = message m contains “viagra”
– = message m does not contain
“viagra”
Prior probability:
P(h) = 0.1
P(h) = 0.9
Likelihood:
P(+ | h) = 0.6 P(– | h) = 0.4
P(+ |  h) = 0.03, P(– |  h) = 0.97
P(+) = P(+ | h) P(h) + P(+ | h)P(h)
= 0.6 * .1 + .03 * .9 = 0.09
P(–) = 0.91
P(h | +) = P(+ | h) P(h) / P(+)
= 0.6 * 0.1 / .09 = .67
How would we learn these prior probabilities and
likelihoods from past examples of spam and not spam?
Full joint probability distribution
(CORRECTED)
Notation: P(h,D)  P(h  D)
“viagra”
“viagra”
Spam
.06
.04
Spam
.027
.873
P (h  +) = P(h | +) P(+)
P(h  -) = P(h | -) P(-)
etc.
Now suppose there is a second feature examined: does
message contain the word “offer”?
P(m=spam, viagra, offer)
offer
viagra
 offer
“viagra” “viagra”
“viagra”
spam
spam
Full joint distribution scales exponentially with
number of parameters
• Bayes optimal classifier for spam:
arg max P(h | f1 , f 2 ,..., f n )
h { spam,spam}
where fi is a feature (here, could be a “keyword”)
• In general, intractable.
• Classification using “naive Bayes”
P(h | f1 , f 2 ,..., f n )  P(h) P( f i | h)
i
• Assumes that all features are independent of one another.
• How do we learn the naive Bayes model from data?
• How do we apply the naive Bayes model to a new
instance?
Example: Training and Using Naive Bayes for
Classification
• Features:
– CAPS: Boolean (longest contiguous string of
capitalized letters in message is longer than 3)
– URL: Boolean (0 if no URL in message, 1 if at least
one URL in message)
– $: Boolean (0 if $ does not appear at least once in
message; 1 otherwise)
• Training data:
M1: “DON’T MISS THIS AMAZING OFFER $$$!”
spam
M2: “Dear mm, for more $$, check this out: http://www.spam.com”
spam
M3: “I plan to offer two sections of CS 250 next year”
not spam
M4: “Hi Mom, I am a bit short on $$ right now, can you
send some ASAP? Love, me”
not spam
Training a Naive Bayes Classifier
P(h, f1 , f 2 ,..., f n )  P(h) P( f i | h)
i
• Two hypotheses: spam or not spam
• Estimate:
P(spam) = .5
P(CAPS | spam) =
P(URL | spam) = .5
P($ | spam) = .75
.5
P(CAPS |  spam )= .5
P(URL |  spam) = .25
P($ |  spam) = .5
P(spam) = .5
P(CAPS | spam) = .5
P(URL | spam) = .5
P($ | spam) = .25
P(CAPS |  spam) = .5
P(URL |  spam) = .75
P($ |  spam) =.5
• m-estimate of probability (to fix cases where one of the terms in the
product is 0):
nc  mp
P ( f i | h) 
,
nm
where n is the number of training examples for which h is true
nc is the number of h  true training examples for which f i is true,
p is prior estimate for h  f i
m is a constant t hat determines how heavily to weight p.
For Boolean features, typically we set p 
1
and m  2.
2
• Now classify new message:
M4: “This is a ONE-TIME-ONLY offer that will get you BIG
$$$, just click on http://www.spammers.com”
Information Retrieval
• Most important concepts:
– Defining features of a document
– Indexing documents according to features
– Retrieving documents in response to a query
– Ordering retrieved documents by relevance
• Early search engines:
– Features: List of all terms (keywords) in document
(minus “a”, “the”, etc.)
– Indexing: by keyword
– Retrieval: by keyword match with query
– Ordering: by number of keywords matched
• Problems with this approach
Naive Bayesian Document retrieval
• Let D be a document (“bag of words”), Q be a query (“bag
of words”), and r be the event that D is relevant to Q.
• In document retreival, we want to compute:
P ( r | D, Q )
• Or, “odds ratio”:
P ( r | D, Q )
P ( r | D, Q )
• In the book, they show (via a lot of algebra) that
P ( r | D, Q )
P(r | D)
 P(Q | D, r ) 
P ( r | D, Q )
P ( r | D )
 P(Q | D, r )  " intrinsic relevance" of the document"
Naive Bayesian Document retrieval
• Let D be a document (“bag of words”), Q be a query (“bag
of words”), and r be the event that D is relevant to Q.
• In document retreival, we want to compute:
P ( r | D, Q )
• Or, “odds ratio”:
P ( r | D, Q )
P ( r | D, Q )
• In the book, they show (via a lot of algebra) that
P ( r | D, Q )
P(r | D)
 P(Q | D, r ) 
P ( r | D, Q )
P ( r | D )
 P(Q | D, r )  " intrinsic relevance" of the document"
Naive Bayesian Document retrieval
P(Q | D, R  t )   P(Q j | D, r )
j
• Where Qj is the jth keyword in the query.
• The probability of a query given a relevant document D is
estimated as the product of the probabilities of each
keyword in the query, given the relevant document.
• How to learn these probabilities?
Evaluating Information Retrieval Systems
• Precision and Recall
• Example: Out of corpus of 100 documents, query has
following results:
In results set
Not in results set
Relevant
30
20
Not relevant
10
40
• Precision: Fraction of relevant documents in results set =
30/40 = .75 “How precise is results set?”
• Recall: Fraction of relevant documents in whole corpus
that are in results set = 30/50 = .60 “How many relevant
documents were recalled?”
• Tradeoff between recall and precision:
If we want to ensure that recall is high, just recall a lot of
documents. Then precision may be low. If we recall
100% of documents, but only 50% are relevant, then recall
is 1, but precision is 0.5.
If we want high chance for precision to be high, just recall
the single document judged most relevant (“I’m feeling
lucky” in Google.) Then precision will (likely) be 1.0, but
recall will be low.
When do you want high precision? When do you want
high recall?
Bayesian approaches to knowledge
representation and reasoning
Part 2
(Chapter 14, sections 1-4)
• Recall Naive Bayes method:
P(h | f1, ..., f n , )  P(h) P( f i | h)
i
• This can also be written in terms of “cause” and “effect”:
P(Cause | effect1, ..., effectn , )  P(Cause) P(effecti | Cause)
i
Naive Bayes
Spam
cause
v1agra
stock
effects
offer
Bayesian network
Spam
v1agra
stock
offer
Each node has a “conditional probability table”
that gives its dependencies on its parents.
P(Spam)
Spam
0.1
v1agra
Spam
stock
offer
P(v1agra)
t
0.6
f
0.03
Spam
P(stock)
t
0.2
f
0.3
stock
P(offer)
t
0.6
f
0.1
Semantics of Bayesian networks
• If network is correct, can calculate full joint probability distribution
from network.
P( X 1  x1  ...  X n  xn )
n
 P( x1 ,..., xn )   P( xi | parents( X i ))
i 1
where parents(Xi) denotes specific values of parents of Xi.
 stock
stock
offer
Spam
 offer
offer
 offer
.012
.008
.008
.072
Spam .162
.108
.063
.567
Sum of all boxes
is 1.
Example from textbook
• I'm at work, neighbor John calls to say my alarm is
ringing, but neighbor Mary doesn't call. Sometimes it's set
off by minor earthquakes. Is there a burglar?
• Variables: Burglary, Earthquake, Alarm, JohnCalls,
MaryCalls
• Network topology reflects "causal" knowledge:
– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call
Example continued
Complexity of Bayesian Networks
• For n random Boolean variables:
– Full joint probability distribution: 2n entries
– Bayesian network with at most k parents per node:
• Each conditional probability table: at most 2k entries
• Entire network: n 2k entries
Exact inference in Bayesian networks
Query:
What is P(Burglary | JohnCalls=true ^ MaryCalls =
true)?
Notation: Capital letters are distributions; lower case
letters are values or variables, depending on context.
We have:
P ( B  j  m)
P ( B | j , m) 
  P ( B  j  m)
P ( j ) P ( m)



      P ( B  e  a  j  m)  
 e{true, false}  a{true, false}
 
Let’s calculate this for b = “Burglary = true”:
P(b | j, m)    P(b) P(e) P(a | b, e) P( j | a) P(m | a)
e
a
Worse case complexity: O(n 2n), where n is number of Boolean
variables.
We can simplify:
P(b | j, m)   P(b) P(e) P(a | b, e) P( j | a) P(m | a)
e
a
A. Onisko et al., A Bayesian network model for diagnosis of liver disorders
Can speed up further via “variable elimination”.
However, bottom line on exact inference:
In general, it’s intractable. (Exponential in n.)
Solution:
Approximate inference, by sampling.
Bayesian approaches to knowledge
representation and reasoning
Part 3
(Chapter 14, section 5)
What are the advantages of Bayesian
networks?
• Intuitive, concise representation of joint probability
distribution (i.e., conditional dependencies) of a set of
random variables.
• Represents “beliefs and knowledge” about a particular
class of situations.
• Efficient (?) (approximate) inference algorithms
• Efficient, effective learning algorithms
Review of exact inference in Bayesian networks
General question: What is P(x|e)?
Example Question: What is P(c| r,w)?
General question: What is P(x|e)?
P ( x, e)
P ( x | e) 
P (e)
(definitio n of conditiona l probabilit y)
  P ( x, e)
( is a normalizat ion factor)
   P ( x, e, Y)
(where Y are the non - evidence variables )
y

 P(z | parents(Z ))
y z{ x ,e , y }
(semantics of Bayesian networks)
Event space
Event space
Cloudy
Event space
Cloudy
Rain
Event space
Sprinkler
Cloudy
Rain
Event space
Sprinkler
Cloudy
Wet Grass
Rain
Event space
Sprinkler
Cloudy
Wet Grass
Rain
P(c | r , w)    P(c, r , w, s )
s
Event space
Sprinkler
Cloudy
Wet Grass

s
 P( z | parents(Z ))
Rain
z{c , r , w, s}
   P(c) P(r | c) P( s | c) P( w | s, r )
s
  .5  .8  .1 .99  (.5  .8  .9  .9)
P (C | r , w)   .3636 , .0945
  (.3636)
P(c | r , w)   (.0945)

.3636
.0945
,
.3636  .0945 .3636  .0945
 .794 , .206
• Draw expression tree for
  P(c) P(r | c) P( s | c) P( w | s, r )
s
• Worst-case complexity is exponential in n (number of
nodes)
• Problem is having to enumerate all possibilities for many
variables.
Issues in Bayesian Networks
• Building / learning network topology
• Assigning / learning conditional probability tables
• Approximate inference via sampling
Real-World Example 1:
The Lumière Project at Microsoft Research
• Bayesian network approach to answering user queries
about Microsoft Office.
• “At the time we initiated our project in Bayesian
information retrieval, managers in the Office division were
finding that users were having difficulty finding assistance
efficiently.”
• “As an example, users working with the Excel spreadsheet
might have required assistance with formatting “a graph”.
Unfortunately, Excel has no knowledge about the common
term, “graph,” and only considered in its keyword
indexing the term “chart”.
• Networks were developed by experts from user modeling
studies.
• Offspring of project was Office Assistant in Office 97.
Real-World Example 2:
Diagnosing liver disorders with Bayesian
networks
• Variables: “disorder class” (16 possibilities) plus 93
features from existing database of patient records.
• Data: 600 patient records, which used those features
• Network structure: designed by “domain experts” (30
hours)
A. Onisko et al., A Bayesian network model for diagnosis of liver disorders
• Prior and conditional probability distributions were learned
from data in liver-disorders database.
• Problem: Data doesn’t give enough samples for good
conditional probability estimates.
• For combinations of parent values that are not adequately
sampled, assume uniform distribution over those values.
Results
“number of observations” = number of evidence variables in query
window = n means that classification is counted as correct if it is in the
n most probable diagnoses given by the network for the given
evidence values.
Approximate inference in Bayesian networks
Instead of enumerating all possibilities, sample to estimate
probabilities.
...
X1
X2
X3
Xn
Direct Sampling
•
Suppose we have no evidence, but we want to determine
P(c,s,r,w) for all c,s,r,w.
•
Direct sampling:
– Sample each variable in topological order, conditioned
on values of parents.
–
I.e., always sample from P(Xi | parents(Xi))
Example
1. Sample from P(Cloudy). Suppose returns true.
2. Sample from P(Sprinkler | Cloudy = true). Suppose
returns false.
3. Sample from P(Rain | Cloudy = true). Suppose returns
true.
4. Sample from P(WetGrass | Sprinkler = false, Rain = true).
Suppose returns true.
Here is the sampled event: [true, false, true, true]
• Suppose there are N total samples, and let NS (x1, ..., xn) be
the observed frequency of the specific event x1, ..., xn.
N S ( x1 ,..., xn )
lim
 P( x1 ,..., xn )
N 
N
N S ( x1 ,..., xn )
 P( x1 ,..., xn )
N
• Suppose N samples, n nodes. Complexity O(Nn).
• Problem 1: Need lots of samples to get good probability
estimates.
• Problem 2: Many samples are not realistic; low likelihood.
Likelihood weighting
•
Now suppose we have evidence e. Thus values for the
evidence variables E are fixed.
•
We want to estimate P(X | e)
•
Need to sample X and Y, where Y is the set of nonevidence variables.
•
Each event sampled is weighted by the likelihood that that
event accords to the evidence.
•
I.e., events in which the actual evidence appears unlikely
should be given less weight.
Example:
Estimate P(Rain | Sprinkler = true, WetGrass = true).
WeightedSample algorithm:
1. Set weight w = 1.0
2. Sample from Cloudy. Suppose it returns true.
3. Sprinkler is an evidence variable with value true. Update
likelihood weighting:
w  w  P( Sprinkler  true | Cloudy  true)  0.1
Low likelihood for sprinkler if cloudy is true, so this
sample gets lower weight.
4. Sample from P(Rain | Cloudy = true). Suppose this
returns true.
5. WetGrass is an evidence variable with value true. Update
likelihood weighting:
w  w  P(WetGrass  true | Sprinkler  true, Rain  true)  0.099
6. Return event [true, true, true, true] with weight 0.099.
Weight is low because cloudy = true, so sprinkler is
unlikely to be true.
Problem with likelihood sampling
• As number of evidence variables increases, performance
degrades. This is because most samples will have very low
weights, so weighted estimate will be dominated by
fraction of samples that accord more than an infinitesimal
likelihood to the evidence.
Markov Chain Monte Carlo Sampling
• One of most common methods used in real applications.
• Uses idea of “Markov blanket” of a variable Xi:
– parents, children, children’s parents
• Recall that: By construction of Bayesian network, a node
is conditionaly independent of its non-descendants, given
its parents.
• Proposition: A node Xi is conditionally independent of all
other nodes in the network, given its Markov blanket.
– Example.
– Need to show that Xi is conditionally independent of
nodes outside its Markov blanket.
– Need to show that Xi can be conditionally dependent on
children’s parents.
A
C
B
Example:
The proposition says: B is conditionally
independent of F given A, C, E.
This can only be true if
E
F
P(B | A,C,E,F) = P(B | A, C, E)
Prove:
P( B | A, C , E , F )  P( B | A, C , E )
We know, by definition of conditional probability:
P( B | A, C , E , F ) 
P( B, A, C , E , F )
P( A, C , E , F )
From tree we have:
(1) P( B, A, C , E , F )  P( A) P(C ) P( B | A, C ) P( E | B) P( F | E )
(2) P( A, C , E , F )   P( A) P(C ) P(b | A, C ) P( E , b) P( F | E )
b
 P( A) P(C ) P( F | E ) P(b | A, C )P( E , b)
b
Thus:
P( A) P(C ) P( B | A, C ) P( E | B) P( F | E )
P( B | A, C , E , F ) 
P( A) P(C ) P( F | E ) P(b | A, C ) P( E | B)
b
P( B | A, C ) P( E | B)

 P(b | A, C ) P( E | b)
b
Now compute P(B | A, C, E):
P( B | A, C , E ) 
P( B, A, C , E )
P( A, C , E )
(1) P( B, A, C , E )  P( A) P(C ) P( B | A, C ) P( E | B)
(2) P( A, C , E )   P( A) P(C ) P(b | A, C ) P( E | b)
b
Thus:
P( B | A, C ) P( E | B)
P( B | A, C , E ) 
 P(b | A, C ) P( E, b)
b
 P( B | A, C , E , F ).
Q.E.D.
Markov Chain Monte Carlo Sampling
• Start with random sample from variables: (x1, ..., xn). This
is the current “state” of the algorithm.
• Next state: Randomly sample value for one non-evidence
variable Xi , conditioned on current values in “Markov
Blanket” of Xi.
Example
•
Query: What is P(Rain | Sprinkler = true, WetGrass =
true)?
•
MCMC:
–
Random sample, with evidence variables fixed:
[true, true, false, true]
–
Repeat:
1. Sample Cloudy, given current values of its Markov blanket:
Sprinkler = true, Rain = false. Suppose result is false. New
state:
[false, true, false, true]
2.
Sample Rain, given current values of its Markov blanket:
Cloudy = false, Sprinkler = true, WetGrass = true. Suppose
result is true. New state: [false, true, true, true].
• Each sample contributes to estimate for query
P(Rain | Sprinkler = true, WetGrass = true)
• Suppose we perform 100 such samples, 20 with Rain =
true and 80 with Rain = false.
• Then answer to the query is
Normalize (20,80) = .20,.80
• Claim: “The sampling process settles into a dynamic
equilibrium in which the long-run fraction of time spent in
each state is exactly proportional to its posterior
probability, given the evidence.”
• Proof of claim is on pp. 517-518.
Claim (again)
• Claim: MCMC settles into behavior in which each state is
sampled exactly according to its posterior probability,
given the evidence.
• That is: for all variables Xi, the probability of the value xi
of Xi appearing in a sample is equal to P(xi | e).
Proof of Claim (outline)
First, give example of Markov chain.
Now:
Let x be a state, with x = (x1, ..., xn).
Let q (x  x) be the transition probability from state x to
state x.
Let t(x) be the probability that the system will be in state
x after t time steps, starting from state x0.
Let t+1(x’) be the probability that the system will be in
state x after t+1 time steps, starting from state x0.
We have:
 t 1 (x' )    t (x) q(x  x' )
x
Definition:
 is called the Markov process’s stationary distribution if
t = t+1 for all x.
Defining equation for stationary distribution:
 (x' )    (x) q(x  x' ) x
(1)
x
Result from Markov chain theory: Given q, there is exactly
one such stationary distribution  (assuming q is
“ergodic”).
One way to satisfy equation 1 is:
 ( x ) q ( x  x' )   ( x' ) q ( x '  x )
x, x'
Called property of detailed balance.
Detailed balance implies stationarity:
  ( x) q ( x  x' )    ( x' ) q ( x'  x)
x
x
  (x' ) q(x'  x)   (x' ).
x
Proof of claim: [Replace old version of this slide with this
version]
Show that transition probability q (x  x) defined by
MCMC sampling satisfies detailed balance equation, with
a stationary distribution equal to P(x | e).
Let Xi be the variable to be sampled. Let e be the values of
the evidence variables and let Y be the other non-evidence
variables.
Current sample: x = vector(xi, y), with fixed evidence
variable values e.
We have, by definition of MCMC algorithm:
q(x  x' )  q(( xi , y, e)  ( xi ' , y, e))  P( xi ' | y, e)
q(x'  x)  q(( xi ' , y, e)  ( xi , y, e))  P( xi | y, e)
Now, show this transition probability produces detailed
balance.
We want to show:
If  (x) q (x  x' )   (x' ) q (x'  x)
then  (x)  P(x | e)
 ( x) q ( x  x' )   ( x' ) q ( x'  x)
 ( x ')
 q ( x ' x )



 

 P( xi ' | y, e) P(y | e) P( xi | y, e)
 P( x' | y, e) P( xi , y | e)
by backwards chain rule on
terms 2 and 3
Since P( x' | y, e)  q (x'  x), we have
 (x)  P( xi , y | e)  P(x | e)
Q.E.D.
Speech Recognition
(Section 15.6)
• Task: Identify sequence of words uttered by speaker, given
acoustic signal.
• Uncertainty introduced by noise, speaker error, variation in
pronunciation, homonyms, etc.
• Thus speech recognition is viewed as problem of
probabilistic inference.
Speech Recognition
• So far, we’ve looked at probabilistic reasoning in static
environments.
• Speech: Time sequence of “static environments”.
– Let X be the “state variables” (i.e., set of non-evidence
variables) describing the environment (e.g., Words said
during time step t)
– Let E be the set of evidence variables (e.g., S =
features of acoustic signal).
– The E values and X joint probability distribution
changes over time.
t1: X1, e1
t2: X2 , e2
etc.
• At each t, we want to compute P(Words | S).
• We know from Bayes rule:
P(Words | S)   P(S | Words ) P(Words )
• P(S | Words), for all words, is a previously learned
“acoustic model”.
– E.g. For each word, probability distribution over phones, and for
each phone, probability distribution over acoustic signals (which
can vary in pitch, speed, volume).
• P(Words), for all words, is the “language model”, which
specifies prior probability of each utterance.
– E.g. “bigram model”: probability of each word following each
other word.
•
Speech recognition typically makes three assumptions:
1. Process underlying change is itself “stationary”
i.e., state transition probabilities don’t change
2. Current state X depends on only a finite history of
previous states (“Markov assumption”).
– Markov process of order n: Current state depends
only on n previous states.
3. Values et of evidence variables depend only on current
state Xt. (“Sensor model”)
Example: “I’m firsty, um, can I have something to dwink?”