Download Lec2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of randomness wikipedia , lookup

Indeterminism wikipedia , lookup

Randomness wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Probability box wikipedia , lookup

Birthday problem wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Inductive probability wikipedia , lookup

Transcript
Bayes Classification
Uncertainty
• Our main tool is the probability theory, which
assigns to each sentence numerical degree of belief
between 0 and 1
• It provides a way of summarizing the uncertainty
5/3/2017
H.Malekinezhad
2
Variables
• Boolean random variables: cavity might be true or false
• Discrete random variables: weather might be sunny, rainy,
cloudy, snow
•
•
•
•
P(Weather=sunny)
P(Weather=rainy)
P(Weather=cloudy)
P(Weather=snow)
• Continuous random variables: the temperature has
continuous values
5/3/2017
H.Malekinezhad
3
Where do probabilities come from?
• Frequents:
• From experiments: form any finite sample, we can estimate the true
fraction and also calculate how accurate our estimation is likely to be
• Subjective:
• Agent’s believe
• Objectivist:
• True nature of the universe, that the probability up heads with
probability 0.5 is a probability of the coin
5/3/2017
H.Malekinezhad
4
• Before the evidence is obtained; prior probability
• P(a) the prior probability that the proposition is true
• P(cavity)=0.1
• After the evidence is obtained; posterior probability
• P(a|b)
• The probability of a given that all we know is b
• P(cavity|toothache)=0.8
5/3/2017
H.Malekinezhad
5
Axioms of Probability
(Kolmogorov’s axioms, first published in German 1933)
• All probabilities are between 0 and 1. For any proposition
a
0 ≤ P(a) ≤ 1
• P(true)=1, P(false)=0
• The probability of disjunction is given by
P(a b)  P(a)  P(b)  P(a b)
5/3/2017
H.Malekinezhad
6
• Product rule
P(a  b)  P(a | b) P(b)
P(a  b)  P(b | a ) P(a )
5/3/2017
H.Malekinezhad
7
Theorem of total probability
If events A1, ... , An are mutually
exclusive with
5/3/2017
then
H.Malekinezhad
8
Bayes’s rule
• (Reverent Thomas Bayes 1702-1761)
• He set down his findings on probability in
"Essay Towards Solving a Problem in the
Doctrine of Chances" (1763), published
posthumously in the Philosophical
Transactions of the Royal Society of
London
P(a | b)P(b)
P(b | a) 
P(a)
5/3/2017
H.Malekinezhad
9
Diagnosis
• What is the probability of meningitis in the patient with stiff
neck?
• A doctor knows that the disease meningitis causes the patient to have
a stiff neck in 50% of the time
-> P(s|m)
• Prior Probabilities:
• That the patient has meningitis is 1/50.000 -> P(m)
• That the patient has a stiff neck is 1/20
-> P(s)
P(s | m)P(m)
P(m | s) 
P(s)
5/3/2017
0.5* 0.00002
P(m | s) 
 0.0002
H.Malekinezhad
0.05
10
Bayes Theorem
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D
• P(D|h) = probability of D given h
5/3/2017
H.Malekinezhad
11
Choosing Hypotheses
• Generally want the most probable hypothesis given
the training data
• Maximum a posteriori hypothesis hMAP:
5/3/2017
H.Malekinezhad
12
5/3/2017
H.Malekinezhad
13
• If assume P(hi)=P(hj) for all hi and hj, then can
further simplify, and choose the
• Maximum likelihood (ML) hypothesis
5/3/2017
H.Malekinezhad
14
Example
• Does patient have cancer or not?
A patient takes a lab test and the result comes back
positive. The test returns a correct positive result
(+) in only 98% of the cases in which the disease is
actually present, and a correct negative result (-) in
only 97% of the cases in which the disease is not
present
Furthermore, 0.008 of the entire population have
this cancer
5/3/2017
H.Malekinezhad
15
Suppose a positive result (+) is
returned...
5/3/2017
H.Malekinezhad
16
Normalization
0.0078
 0.20745
0.0078  0.0298
0.0298
 0.79255
0.0078  0.0298
• The result of Bayesian inference depends strongly
on the prior probabilities, which must be available

in order to apply the method
5/3/2017
H.Malekinezhad
17
Brute-Force
Bayes Concept Learning
• For each hypothesis h in H, calculate the posterior
probability
• Output the hypothesis hMAP with the highest
posterior probability
5/3/2017
H.Malekinezhad
18
Bayes optimal Classifier
A weighted majority classifier
• What is he most probable classification of the new
instance given the training data?
• The most probable classification of the new instance is
obtained by combining the prediction of all hypothesis,
weighted by their posterior probabilities
• If the classification of new example can take any
value vj from some set V, then the probability
P(vj|D) that the correct classification for the new
instance is vj, is just:
5/3/2017
H.Malekinezhad
19
Bayes optimal classification:
5/3/2017
H.Malekinezhad
20
Gibbs Algorithm
• Bayes optimal classifier provides best result, but
can be expensive if many hypotheses
• Gibbs algorithm:
• Choose one hypothesis at random, according to P(h|D)
• Use this to classify new instance
5/3/2017
H.Malekinezhad
21
• Suppose correct, uniform prior distribution over H,
then
• Pick any hypothesis at random..
• Its expected error no worse than twice Bayes optimal
5/3/2017
H.Malekinezhad
22
A single cause directly influence a number of effects,
all of which are conditionally independent
n
P(cause, effect1 , effect2 ,...effectn )  P(cause) P(effecti | cause)
i 1
5/3/2017
H.Malekinezhad
23
Naive Bayes Classifier
• Along with decision trees, neural networks, nearest
nbr, one of the most practical learning methods
• When to use:
• Moderate or large training set available
• Attributes that describe instances are conditionally
independent given classification
• Successful applications:
• Diagnosis
• Classifying text documents
5/3/2017
H.Malekinezhad
24
Naive Bayes Classifier
• Assume target function f: X  V, where each
instance x described by attributes a1, a2 .. an
• Most probable value of f(x) is:
5/3/2017
H.Malekinezhad
25
vNB
• Naive Bayes assumption:
• which gives
5/3/2017
H.Malekinezhad
26
Naive Bayes Algorithm
• For each target value vj
•
 estimate P(vj)
• For each attribute value ai of each attribute a
•
 estimate P(ai|vj)
5/3/2017
H.Malekinezhad
27
Example
•
Example: Play Tennis
5/3/2017
H.Malekinezhad
28
Example
•
Learning Phase
Outlook
Play=Yes
Play=No
Temperature
Play=Yes
Play=No
Sunny
2/9
4/9
3/9
3/5
0/5
2/5
Hot
2/9
4/9
3/9
2/5
2/5
1/5
Overcast
Rain
Humidity
High
Normal
Mild
Cool
Play=Yes Play=No
3/9
6/9
4/5
1/5
Play=Yes
Play=No
Strong
3/9
6/9
3/5
2/5
Weak
P(Play=Yes) = 9/14
5/3/2017
Wind
P(Play=No) = 5/14
H.Malekinezhad
29
Example
•
Test Phase
–
Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
–
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
5/3/2017
H.Malekinezhad
30
Naïve Bayesian Classifier: Comments
• Advantages :
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence , therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
• How to deal with these dependencies?
• Bayesian Belief Networks
5/3/2017
H.Malekinezhad
31