Download CS 9633 Knowledge Discovery and Data Mining

Document related concepts

Probability wikipedia , lookup

Transcript
Bayesian Learning
Computer Science Department
CS 9633 Machine Learning
Bayesian Learning
• Probabilistic approach to inference
• Assumption
– Quantities of interest are governed by probability
distribution
– Optimal decisions can be made by reasoning
about probabilities and observations
• Provides quantitative approach to weighing
how evidence supports alternative
hypotheses
Computer Science Department
CS 9633 Machine Learning
Why is Bayesian Learning
Important?
• Some Bayesian approaches (like naive
Bayes) are very practical learning
approaches and competitive with other
approaches
• Provides a useful perspective for
understanding many learning algorithms
that do not explicitly manipulate
probabilities
Computer Science Department
CS 9633 Machine Learning
Important Features
• Model is incrementally updated with training
examples
• Prior knowledge can be combined with observed data
to determine the final probability of the hypothesis
– Asserting prior probability of candidate hypotheses
– Asserting a probability distribution over observations for
each hypothesis
• Can accommodate methods that make probabilistic
predictions
• New instances can be classified by combining
predictions of multiple hypotheses
• Can provide a gold standard for evaluating
hypotheses
Computer Science Department
CS 9633 Machine Learning
Practical Problems
• Typically require initial knowledge of many
probabilities. Can be estimated by:
– Background knowledge
– Previously available data
– Assumptions about distribution
• Significant computational cost of determining
Bayes optimal hypothesis in general
– linear in number of hypotheses in general case
– Significantly lower for certain situations
Computer Science Department
CS 9633 Machine Learning
Bayes Theorem
• Goal: learn the “best” hypothesis
• Assumption in Bayes learning: the “best”
hypothesis is the most probable hypothesis
• Bayes theorem allows computation of most
probable hypothesis based on
– Prior probability of hypothesis
– Probability of observing certain data given the
hypothesis
– Observed data itself
Computer Science Department
CS 9633 Machine Learning
Notation
P(h) Prior probability of h
P(D) Prior probability of D
P(D|h) Probability of D given h
posterior probability of D given h
likelihood of Data given h
P(h|D) Probability that h holds, given the
data
Computer Science Department
CS 9633 Machine Learning
Bayes Theorem
• Based on definitions
of P(D|h) and P(h|D)
P ( D | h) P ( h )
P ( h | D) 
P( D)
D
Computer Science Department
CS 9633 Machine Learning
h
Maximum A Posteriori
Hypothesis
• Many learning algorithms try to identify
the most probable hypothesis h  H
given observations D
• This is the maximum a posteriori
hypothesis (MAP hypothesis)
Computer Science Department
CS 9633 Machine Learning
Identifying the MAP Hypothesis
using Bayes Theorem
hMAP  arg max P(h | D)
hH
hMAP
hMAP
P ( D | h) P ( h)
 arg max
P( D)
hH
 arg max P( D | h) P(h)
hH
Computer Science Department
CS 9633 Machine Learning
Equally Probable
Hypotheses
hMAP  arg max P( D | h) P(h)
hH
hMAP  arg max P( D | h)
hH
Any hypothesis that maximizes P(D|h) is a Maximum
Likelihood (ML) hypothesis
hML  arg max P( D | h)
hH
Computer Science Department
CS 9633 Machine Learning
Bayes Theorem and
Concept Learning
• Concept Learning Task
H Hypothesis space
X Instance space
c: X{0,1}
Computer Science Department
CS 9633 Machine Learning
Brute-Force MAP Learning
Algorithm
• For each hypothesis h in H, calculate the
posterior probability
P ( D | h) P ( h )
P ( h | D) 
P( D)
• Output the hypothesis with the highest
posterior probability
hMAP  arg max P( D | h)
hH
Computer Science Department
CS 9633 Machine Learning
To Apply Brute Force MAP
Learning
• Specify P(h)
• Specify P(D|h)
Computer Science Department
CS 9633 Machine Learning
An Example
• Assume
– Training data D is noise free (di = c(xi))
– The target concept is contained in H
– We have no a priori reason to believe one
hypothesis is more likely than any other
1
P(h) 
for all h  H
H
Computer Science Department
CS 9633 Machine Learning
Probability of Data Given
Hypothesis
1 if d i  h(xi ) for all d i in D
P ( D | h)  
0 otherwise

Computer Science Department
CS 9633 Machine Learning
Apply the algorithm
• Step 1 (2 cases)
P ( D | h) P ( h )
P ( h | D) 
P( D)
– Case 1 (D is inconsistent with h)
P( h | D) 
(0) P(h)
0
P( D)
– Case 2 (D is consistent with h)
1
1
(1)
(1)
H
H
1
P(h | D) 


P( D) VS H , D
VS H , D
H
Computer Science Department
CS 9633 Machine Learning
Step 2
• Every consistent hypothesis has
probability 1/|VSH,D|
• Every inconsistent hypothesis has
probability 0
Computer Science Department
CS 9633 Machine Learning
MAP hypothesis and
consistent learners
• FIND-S (finds maximally specific
consistent hypothesis)
• Candidate-Elimination (finds all
consistent hypotheses.
Computer Science Department
CS 9633 Machine Learning
Maximum Likelihood and
Least-Squared Error Learning
• New problem: learning a continuousvalued target function
• Will show that under certain
assumptions, any learning algorithm
that minimized the squared error
between output hypotheses on training
data will output a maximum likelihood
hypothesis.
Computer Science Department
CS 9633 Machine Learning
Problem Setting
•
•
•
•
Learner L
Instance space X
Hypothesis space H
h: XR
Task of L is to learn unknown target function
f: XR
• Have m examples
• Target value for each example is corrupted by
random noise drawn from Normal distribution
Computer Science Department
CS 9633 Machine Learning
Work Through Derivation
Computer Science Department
CS 9633 Machine Learning
Why Normal Distribution for
Noise?
• Its easy to work with
• Good approximation of many physical
processes
• Important point: we are only dealing
with noise in the target function—not the
attribute values.
Computer Science Department
CS 9633 Machine Learning
Bayes Optimal Classifier
• Two Questions:
– What is the most probable hypothesis
given the training data?
» Find MAP hypothesis
– What is the most probable classification
given the training data?
Computer Science Department
CS 9633 Machine Learning
Example
• Three hypotheses:
P(h1|D) = 0.35
P(h2|D) = 0.45
P(h3|D) = 0.20
• New instance x
h1 predicts negative
h2 predicts positive
h3 predicts negative
• What is the predicted class using hMAP?
• What is the predicted class using all hypotheses?
Computer Science Department
CS 9633 Machine Learning
Bayes Optimal Classification
• The most probable classification of a new instance is
obtained by combining the predictions of all
hypotheses, weighted by their posterior probabilities.
• Suppose set of values for classification is from set V
(each possible value is vj)
• Probability that vj is the correct classification for new
instance is:
P(v j | D) 
 P(v
hi H
j
| hi ) P(hi | D)
• Pick the vj with the max probability as the predicted
class
Computer Science Department
CS 9633 Machine Learning
Bayes Optimal Classifier
arg max
v j V
 P(v
hi H
j
| hi ) P(hi | D)
Apply this to the previous example:
Computer Science Department
CS 9633 Machine Learning
Bayes Optimal Classification
• Gives the optimal error-minimizing solution to
prediction and classification problems.
• Requires probability of exact combination of
evidence
• All classification methods can be viewed as
approximations of Bayes rule with varying
assumptions about conditional probabilities
– Assume they come from some distribution
– Assume conditional independence
– Assume underlying model of specific format
(linear combination of evidence, decision tree)
Computer Science Department
CS 9633 Machine Learning
Simplifications of Bayes
Rule
• Given observations of attribute values a1, a2,
…an,, compute the most probable target value
vMAP
vMAP  arg max P(v j | a1 , a2 ,, an )
v j V
• Use Bayes Theorem to rewrite
vMAP  arg max
v j V
P(a1 , a2 ,  , an | v j ) P (v j )
P (a1 , a2 ,  , an )
vMAP  arg max P(a1 , a2 ,  , an | v j ) P(v j )
v j V
Computer Science Department
CS 9633 Machine Learning
Naïve Bayes
• The most usual simplification of Bayes Rule is to assume
conditional independence of the observations
– Because it is approximately true
– Because it is computationally convenient
• Assume the probability of observing the conjunction a1,
a2, …an is the product of the probabilities of the individual
attributes
vNB  arg max P(v j ) P(ai | v j )
v j V
i
• Learning consists of estimating probabilities
Computer Science Department
CS 9633 Machine Learning
Simple Example
• Two classes C1 and C2.
• Two features
– a1
– a2
Male, Female
Blue eyes, Brown eyes
• Instance (Male with blue eyes) What is the class?
Probability
C1
C2
P(Ci)
0.4
0.6
P(Male|Cj)
0.1
0.2
P(BlueEyes|Cj)
0.3
0.2
Computer Science Department
CS 9633 Machine Learning
Estimating Probabilities
(Classifying Executables)
• Two Classes (Malicious, Benign)
• Features
– a1
– a2
– a3
– a4
GUI present (yes/no)
Deletes files (yes/no)
Allocates memory (yes/no)
Length (< 1K, 1-10 K, > 10K)
Computer Science Department
CS 9633 Machine Learning
Instance
a1
a2
a3
a4
Class
1
Yes
No
No
Yes
B
2
Yes
No
No
No
B
3
No
Yes
Yes
No
M
4
No
No
Yes
Yes
M
5
Yes
No
No
Yes
B
6
Yes
No
No
No
M
7
Yes
Yes
Yes
No
M
8
Yes
Yes
No
Yes
M
9
No
No
No
Yes
B
10
No
No
Yes
No
M
Classify the Following
Instance
• <Yes, No, Yes, Yes>
Computer Science Department
CS 9633 Machine Learning
Estimating Probabilities
• To estimate P(C|D)
• Let n be the number of training examples
labeled D
• Let nc be the number labeled D that are also
labeled C
• P(C|D) was estimated as nc/n
• Problems
– This is a biased underestimate of the probability
– When the term is 0, it dominates all others
Computer Science Department
CS 9633 Machine Learning
Use m-estimate of
probability
nc  mp
nm
• p is prior of what we are trying to estimate (often
assume attribute values equally probable)
• m is a constant (called equivalent sample size) view
this augmenting with a virtual sample
Computer Science Department
CS 9633 Machine Learning
Repeat Estimates
• Use equal priors for attribute values
• Use m value of 1
Computer Science Department
CS 9633 Machine Learning
Bayesian Belief Networks
• Naïve Bayes is based on assumption of
conditional independence
• Bayesian networks provide a tractable
method for specifying dependencies
among variables
Computer Science Department
CS 9633 Machine Learning
Terminology
• A Bayesian Belief Network describes the probability distribution
over a set of random variables Y1, Y2, …Yn
• Each variable Yi can take on the set of values V(Yi)
• The joint space of the set of variables Y is the cross product
V(Y1)  V(Y2) …  V(Yn)
• Each item in the joint space corresponds to one possible
assignment of values to the tuple of variables <Y1, …Yn>
• Joint probability distribution: specifies the probabilities of the
items in the joint space
• A Bayesian Network provides a way to describe the joint
probability distribution in a compact manner.
Computer Science Department
CS 9633 Machine Learning
Conditional Independence
• Let X, Y, and Z be three discrete-valued
random variables.
• We say that X is conditionally independent of
Y given Z if the probability distribution
governing X is independent of the value of Y
given a value for Z
xi , y j , zk P( X  xi | Y  y j , Z  zk )  P( X  xi | Z  zk )
P( X | Y , Z )  P( X | Z )
Computer Science Department
CS 9633 Machine Learning
Bayesian Belief Network
– A set of random variables makes up the nodes of
the network
– A set of directed links or arrows connects pairs of
nodes. The intuitive meaning of an arrow from X to
Y is that X has a direct influence on Y.
– Each node has a conditional probability table that
quantifies the effects that the parents have on the
node. The parents of a node are all those nodes
that have arrows pointing to it.
– The graph has no directed cycles (it is a DAG)
Computer Science Department
CS 9633 Machine Learning
Example (from Judea
Pearl)
You have a new burglar alarm installed at home. It is
fairly reliable at detecting a burglary, but also responds
on occasion to minor earthquakes. You also have two
neighbors, John and Mary, who have promised to call
you at work when they hear the alarm. John always
calls when he hears the alarm, but sometimes confuses
the telephone ringing with the alarm and calls then, too.
Mary, on the other hand, likes rather loud music and
sometimes misses the alarm altogether. Given the
evidence of who has or has not called, we would like to
estimate the probability of a burglary.
Computer Science Department
CS 9633 Machine Learning
Step 1
• Determine what the propositional
(random) variables should be
• Determine causal (or another type of
influence) relationships and develop the
topology of the network
Computer Science Department
CS 9633 Machine Learning
Topology of Belief
Network
Burglary
Earthquake
Alarm
JohnCalls
MaryCalls
Computer Science Department
CS 9633 Machine Learning
Step 2
• Specify a conditional probability table or CPT for
each node.
• Each row in the table contains the conditional
probability of each node value for a conditioning
case (possible combinations of values for parent
nodes).
• In the example, the possible values for each
node are true/false.
• The sum of the probabilities for each value of a
node given a particular
conditioning
case is 1.
Computer Science
Department
CS 9633 Machine Learning
Example:
CPT for Alarm Node
Burglary Earthquake
P(Alarm|Burglary,Earthquake)
True
False
True
True
0.950
0.050
True
False
0.940
0.060
False
True
0.290
0.710
False
False
0.001
0.999
Computer Science Department
CS 9633 Machine Learning
Complete Belief Network
P(B)
0.001
P(E)
0.002
Burglary
Earthquake
B
T
T
F
F
Alarm
JohnCalls
A
T
F
P(J|A)
0.90
0.05
MaryCalls
Computer Science Department
CS 9633 Machine Learning
E
T
F
T
F
P(A|B,E)
0.95
0.94
0.29
0.01
A
T
F
P(M|A)
0.70
0.01
Semantics of Belief
Networks
• View 1: A belief network is a
representation of the joint probability
distribution (“joint”) of a domain.
• The joint completely specifies an
agent’s probability assignments to all
propositions in the domain (both simple
and complex.)
Computer Science Department
CS 9633 Machine Learning
Network as
representation of joint
• A generic entry in the joint probability
distribution is the probability of a conjunction
of particular assignments to each variable,
such as:
n
P(x1,..., xn )   P(xi | Parents(Xi ))
i1

Each entry in the joint is represented by the
product of appropriate elements of the CPTs
in the belief network.
Computer Science Department
CS 9633 Machine Learning
Example Calculation
Calculate the probability of the event that
the alarm has sounded but neither a
burglary nor an earthquake has occurred,
and both John and Mary call.
P(J ^ M ^ A ^ ~B ^ ~E)
= P(J|A) P(M|A) P(A|~B,~E) P(~B) P(~E)
= 0.90 * 0.70 * 0.001 * 0.999 * 0.998
= 0.00062
Computer Science Department
CS 9633 Machine Learning
Semantics
• View 2: Encoding of a collection of
conditional independence statements.
– JohnCalls is conditionally independent of
other variables in the network given the
value of Alarm
• This view is useful for understanding
inference procedures for the networks.
Computer Science Department
CS 9633 Machine Learning
Inference Methods for
Bayesian Networks
• We may want to infer the value of some target
variable (Burglary) given observed values for other
variables.
• What we generally want is the probability distribution
• Inference straightforward if all other values in network
known
• More general case, if we know a subset of the values
of variables, we can infer a probability distribution
over other variables.
• NP-Hard problem
• But approximations work well
Computer Science Department
CS 9633 Machine Learning
Learning Bayesian Belief
Networks
• Focus of a great deal of research
• Several situations of varying complexity
– Network structure may be given or not
– All variables may be observable or you may have
some variables that cannot be observed
• If the network structure is known and all
variables can be observed, the CPT’s can be
computed like they were for Naïve Bayes
Computer Science Department
CS 9633 Machine Learning
Gradient Ascent Training of
Bayesian Networks
• Method developed by Russell
• Maximizes P(D|h) by following the
gradient of
ln P(D|h)
• Let wijk be a single entry in CPT table
that variable Yi will take on value yij
given that its immediate parent is Ui
takes on values given by uik
Computer Science Department
CS 9633 Machine Learning
Illustration
Ui=uik
wijk= P(Yi=yij|Ui=uik)
Yi = yij
Computer Science Department
CS 9633 Machine Learning
Result
P(Yi  yi , j , U  ui ,k | d )
 ln P( D | h)

wij
wi , j ,k
d D
Computer Science Department
CS 9633 Machine Learning
Example
Burglary
Earthquake
Alarm
JohnCalls
To compute P(A|B,E) we
would need P(A,B,E|d)
for each training example
MaryCalls
Computer Science Department
CS 9633 Machine Learning
EM Algorithm
• The EM algorithm is a general purpose
algorithm that is used in many settings
including
– Unsupervised learning
– Learning CPT’s for Bayesian networks
– Learning Hidden Markov models
• Two-step algorithm for learning hidden
variables
Computer Science Department
CS 9633 Machine Learning
Two Step Process
• For a specific problem with have three quantities
– Xobserved data for instances
– Z
unobserved data for instances (this is usually what
we are trying to learn)
– Y
full data
• General approach
– Determine initial hypothesis for values for Z
– Step 1: Estimation
» Compute a function Q(h’|h) using current hypothesis h and the
observed data X to estimate the probability distribution over Y.
– Step 2: Maximization
» Revise hypothesis h with h’ that maximizes the Q function
Computer Science Department
CS 9633 Machine Learning
K-means algorithm
Assume that data comes from 2 Gaussian
distributions. Means () are unknown
P(x)
x
Computer Science Department
CS 9633 Machine Learning
Generation of data
• Select one of the normal distributions at
random
• Generate a single random instance xi using
this distribution
p( x) 
1
2
2
e
1 x 2
 (
)
2 
E[ X ]  
Computer Science Department
CS 9633 Machine Learning
Example
Select initial values for h
h = <1, 2>
2
X
1
Y
Computer Science Department
CS 9633 Machine Learning
E-step: Compute the probability
that datum xi generated by
component i
h = <1, 2>
2
X
1
Y
Computer Science Department
CS 9633 Machine Learning
M-step: Replace hypothesis h
with h’ that maximizes Q
h’ = <1’, 2’>
X
1’
2’
Y
Computer Science Department
CS 9633 Machine Learning