Download Lecture 3 , Jan - 26

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Machine Learning
Mehdi Ghayoumi
MSB rm 132
[email protected]
Ofc hr: Thur, 11-12 a
Machine Learning
• Overfitting:
• Model is too “complex” and fits irrelevant characteristics
(noise) in the data
– Low bias and high variance
– Low training error and high test error
Machine Learning
• Models
with
parameters
are
too
many
inaccurate
because of a large variance
(too much sensitivity to the
sample).
Machine Learning
Bias-Variance Trade-off
Machine Learning
Bias-Variance Trade-off
E(MSE) = noise2 + bias2 + variance
Unavoidable
error
Error due to
incorrect
assumptions
Error due to
variance of training
samples
Machine Learning
Machine Learning
Machine Learning
Machine Learning
Probabilities
• We write P(A) as “the fraction of possible worlds in which A is true”
Event space of
all possible
worlds
Its area is 1
Worlds in
which A is true
Worlds in which
A is false
P(A) = Area of
red rectangle
Machine Learning
Axioms of Probability Theory
1. All probabilities between 0 and 1
0<= P(A) <= 1
2. True has probability 1, false has probability 0.
P(true) = 1
P(false) = 0
P(not A) = P(~A) = 1-P(A)
3. The probability of disjunction is:
P( A or B) = P(A) + P(B) – P (A and B)
Sometimes it is written as this:
P( A  B)  P( A)  P( B)  P( A  B)
Machine Learning
Interpretation of the Axioms
A
P( A  B)  P( A)  P( B)  P( A  B)
A or B
B
Simple addition and subtraction
A and B
B
Machine Learning
Definition of Conditional Probability
P(A ^ B)
P(A|B) = -----------P(B)
The Chain Rule:
P(A ^ B) = P(A|B) P(B)
Machine Learning
Conditional Probability
P(A|B) = Fraction of worlds in which B is true that also have A true
H = “Have a headache”
F = “Coming down with Flu”
P(H) = 1/10
P(F) = 1/40
P(H|F) = ½
P(H|F) = Fraction of flu-inflicted worlds in which
you have a Headache
=
#worlds with flu and headache
---------------------------------#worlds with flu
=
Area of “H and F” region
-----------------------------Area of “F” region
=
P(H ^ F)
----------P(F)
F
H
Machine Learning
Probabilistic Inference
H = “Have a headache”
P(H) = 1/10
F = “Coming down with Flu”
P(F) = 1/40
Area wise we have:
P(H|F) = ½
F
A
P(F)=
P(H)=
P(H|F)=
P(F|H)=
C
B
H
Machine Learning
Independence
2 blue and 3 red marbles are
in a bag.
What
are
the
chances
of
getting a blue marble?
The chance is 2 in 5 But after
taking one out the chances
change!
So the next time?
Machine Learning
Independence
• A and B are independent iff:
P( A | B)  P( A)
P( B | A)  P( B)
• Therefore, if A and B are independent:
P( A  B)
P( A | B) 
 P( A)
P( B)
P( A  B)  P( A) P( B)
Machine Learning
Example: Ice Cream
70% of your friends like Chocolate, and 35% like
Chocolate AND like Strawberry.
What percent of those who like Chocolate also like
Strawberry?
Machine Learning
P(Strawberry | Chocolate) =
P(Chocolate and Strawberry) / P(Chocolate)
0.35 / 0.7 = 50%
It means:
50% of your friends who like Chocolate also like Strawberry
Machine Learning
Machine Learning
• The joint probability distribution for a set of random variables, X1,…,Xn gives
the probability of every combination of values (an n-dimensional array with vn
values if all variables are discrete with v values, all vn values must sum to 1):
P(X1,…,Xn)
positive
negative
circle
square
red
0.20
0.02
blue
0.02
0.01
P( positive | red  circle ) 
circle
square
red
0.05
0.30
blue
0.20
0.20
Machine Learning
• The probability of all possible conjunctions (assignments of values to
some subset of variables) can be calculated by summing the
appropriate subset of values from the joint distribution.
P(red  circle )  0.20  0.05  0.25
• Therefore, all conditional probabilities can also be calculated.
P( positive  red  circle ) 0.20
P( positive | red  circle ) 

 0.80
P(red  circle )
0.25
Machine Learning
Bayes Rule
•
Thomas Bayes (c. 1701 – 7 April 1761) was an English statistician, philosopher and Presbyterian
minister, known for having formulated a specific case of the theorem that bears his name: Bayes'
theorem.
Bayes
never
published
what
would
eventually
become
his
most
accomplishment; his notes were edited and published after his death by Richard Price.
P( A  B) P( B | A) P( A)
p( A | B) 

P( B)
P( B)
famous
Machine Learning
Bayesian Learning
P ( x | h) P ( h)
p (h | x) 
P( x)
P ( h) :
P ( x | h) :
prior belief (probability of hypothesis h before seeing any data)
likelihood (probability of the data if the hypothesis h is true)
P( x)   P( x | h) P(h) : data evidence (marginal probability of the data)
h
P(h | x) :
posterior (probability of hypothesis h after having seen the data d )
Machine Learning
An Illustrating Example
A patient takes a lab test and the result comes back positive. It is
known that the test returns a correct positive result in only 98% of the
cases and a correct negative result in only 97% of the cases.
Furthermore, only 0.008 of the entire population has this disease.
1. What is the probability that this patient has cancer?
2. What is the probability that he does not have cancer?
3. What is the disease?
Machine Learning
An Illustrating Example
• The available data has two possible outcomes:
Positive (+) and Negative (-)
• Various probabilities are
P(cancer) = 0.008
P(~cancer) = 0.992
P(+|cancer) = 0.98
P(-|cancer) = 0.02
P(+|~cancer) = 0.03
P(-|~cancer) 0.97
• Now a new patient, whose test result is positive, Should we
diagnose the patient have cancer or not?
Machine Learning
Choosing Hypotheses
• Generally, we want the most probable hypothesis given the observed
data:
– Maximum a posteriori (MAP) hypothesis
– Maximum likelihood (ML) hypothesis
Definition:
Arg max stands for the argument of the maximum, that is to say, the
set of points of the given argument for which the given function attains
its maximum value.
Machine Learning
Maximum a posteriori (MAP)
• Maximum a posteriori (MAP) hypothesis
P ( x | h) P ( h)
p (h | x) 
P( x)
P ( x | h) P ( h )
hMAP  arg max p(h | x)  arg max
 arg max P( x | h) P(h)
hH
hH
hH
P( x)
Note P(x) is independent of h, hence can be ignored.
Machine Learning
Does patient have cancer or not?
P(+|cancer)P(cancer)
=0.98*0.008 = 0.078
P(+|~cancer)P(~cancer)
= 0.03*0.992 = 0.298
MAP:
P(+|cancer)P(cancer) < P(+|~cancer)P(~cancer)
Diagnosis: ~cancer
Thank you!
Related documents