Download CS 188: Artificial Intelligence Today Uncertainty Probabilities

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Inductive probability wikipedia , lookup

Mathematical model wikipedia , lookup

Linear belief function wikipedia , lookup

Transcript
Today
CS 188: Artificial Intelligence
ƒ Uncertainty
ƒ Probability Basics
Spring 2006
ƒ
ƒ
ƒ
ƒ
Lecture 8: Probability
2/9/2006
Joint and Condition Distributions
Models and Independence
Bayes Rule
Estimation
ƒ Utility Basics
ƒ Value Functions
ƒ Expectations
Dan Klein – UC Berkeley
Many slides from either Stuart Russell or Andrew Moore
Uncertainty
Probabilities
ƒ Let action At = leave for airport t minutes before flight
ƒ Will At get me there on time?
ƒ Probabilistic approach
ƒ Given the available evidence, A25 will get me there on
time with probability 0.04
ƒ P(A25 | no reported accidents) = 0.04
ƒ Problems:
ƒ
ƒ
ƒ
ƒ
partial observability (road state, other drivers' plans, etc.)
noisy sensors (KCBS traffic reports)
uncertainty in action outcomes (flat tire, etc.)
immense complexity of modeling and predicting traffic
ƒ Probabilities change with new evidence:
ƒ A purely logical approach either
ƒ Risks falsehood: “A25 will get me there on time” or
ƒ Leads to conclusions that are too weak for decision making:
ƒ “A25 will get me there on time if there's no accident on the bridge, and it
doesn't rain, and my tires remain intact, etc., etc.''
ƒ P(A25 | no reported accidents, 5 a.m.) = 0.15
ƒ P(A25 | no reported accidents, 5 a.m., raining) = 0.08
ƒ i.e., observing evidence causes beliefs to be updated
ƒ A1440 might reasonably be said to get me there on time but I'd have
to stay overnight in the airport…
Probabilistic Models
ƒ CSPs:
ƒ Variables with domains
ƒ Constraints: map from
assignments to true/false
ƒ Ideally: only certain variables
directly interact
What Are Probabilities?
ƒ Objectivist / frequentist answer:
A
B
warm
sun
P
T
warm
rain
F
cold
sun
F
cold
rain
T
ƒ Probabilistic models:
ƒ (Random) variables with
domains
ƒ Joint distributions: map from
assignments (or outcomes)
to positive numbers
ƒ Normalized: sum to 1.0
ƒ Ideally: only certain variables
are directly correlated
A
B
warm
sun
P
0.4
warm
rain
0.1
cold
sun
0.2
cold
rain
0.3
ƒ
ƒ
ƒ
ƒ
ƒ
Averages over repeated experiments
E.g. empirically estimating P(rain) from historical observation
Assertion about how future experiments will go (in the limit)
New evidence changes the reference class
Makes one think of inherently random events, like rolling dice
ƒ Subjectivist / Bayesian answer:
ƒ
ƒ
ƒ
ƒ
Degrees of belief about unobserved variables
E.g. an agent’s belief that it’s raining, given the temperature
Often estimate probabilities from past experience
New evidence updates beliefs
ƒ Unobserved variables still have fixed assignments (we
just don’t know what they are)
1
Probabilities Everywhere?
Distributions on Random Vars
ƒ Not just for games of chance!
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ A joint distribution over a set of random variables:
is a map from assignments (or outcome, or atomic event) to reals:
I’m snuffling: am I sick?
Email contains “FREE!”: is it spam?
Tooth hurts: have cavity?
Safe to cross street?
60 min enough to get to the airport?
Robot rotated wheel three times, how far did it advance?
ƒ Size of distribution if n variables with domain sizes d?
ƒ Why can a random variable have uncertainty?
ƒ
ƒ
ƒ
ƒ
ƒ
ƒ Must obey:
Inherently random process (dice, etc)
Insufficient or weak evidence
Unmodeled variables
Ignorance of underlying processes
The world’s just noisy!
ƒ Compare to fuzzy logic, which has degrees of truth, or soft
assignments
ƒ For all but the smallest distributions, impractical to write out
Examples
ƒ An event is a set E of assignments (or
outcomes)
Marginalization
T
S
warm
sun
0.4
warm
rain
0.1
cold
sun
0.2
cold
rain
0.3
ƒ From a joint distribution, we can calculate
the probability of any event
P
ƒ Probability that it’s warm AND sunny?
ƒ Probability that it’s warm?
ƒ Marginalization (or summing out) is projecting a joint
distribution to a sub-distribution over subset of variables
T
T
S
P
ƒ E.g., P(cavity | toothache) = 0.8
ƒ Given that toothache is all I know…
ƒ
0.5
sun
0.4
warm
rain
0.1
cold
sun
0.2
S
cold
rain
0.3
sun
0.6
rain
0.4
Conditional Probabilities
Conditional or posterior probabilities:
0.5
cold
warm
ƒ Probability that it’s warm OR sunny?
ƒ
P
warm
P
Conditioning
ƒ Conditioning is fixing some variables and renormalizing
over the rest:
Notation for conditional distributions:
ƒ P(cavity | toothache) = a single number
ƒ P(Cavity, Toothache) = 4-element vector summing to 1
ƒ P(Cavity | Toothache) = Two 2-element vectors, each summing to 1
ƒ
If we know more:
ƒ P(cavity | toothache, catch) = 0.9
ƒ P(cavity | toothache, cavity) = 1
ƒ
Note: the less specific belief remains valid after more evidence arrives, but
is not always useful
ƒ
New evidence may be irrelevant, allowing simplification:
ƒ
This kind of inference, sanctioned by domain knowledge, is crucial
ƒ P(cavity | toothache, traffic) = P(cavity | toothache) = 0.8
T
S
P
warm
sun
0.4
warm
rain
0.1
cold
sun
0.2
cold
rain
0.3
Select
T
P
warm
0.1
cold
0.3
Normalize
T
P
warm
0.25
cold
0.75
2
Inference by Enumeration
Inference by Enumeration
ƒ General case:
ƒ P(R)?
ƒ P(R|winter)?
ƒ P(R|winter,warm)?
S
T
R
summer
warm
sun
0.30
summer
warm
rain
0.05
summer
cold
sun
0.10
summer
cold
rain
0.05
winter
warm
sun
0.10
winter
warm
rain
0.05
winter
cold
sun
0.15
winter
cold
rain
0.20
P
ƒ Evidence variables:
ƒ Query variables:
ƒ Hidden variables:
All variables
ƒ We want:
ƒ The required summation of joint entries is done by summing out H:
ƒ Then renormalizing
ƒ Obvious problems:
ƒ Worst-case time complexity O(dn)
ƒ Space complexity O(dn) to store the joint distribution
The Chain Rule I
Lewis Carroll's Sack Problem
ƒ Sack contains a red or blue ball, 50/50
ƒ We add a red ball
ƒ If we draw a red ball, what’s the
chance of drawing a second red ball?
ƒ Variables:
ƒ Sometimes joint P(X,Y) is easy to get
ƒ Sometimes easier to get conditional P(X|Y)
ƒ F={r,b} is the original ball
ƒ D={r,b} is the ball we draw
ƒ Example: P(Sun,Dry)?
R
P
sun
0.8
rain
0.2
ƒ Query: P(F=r|D=r)
D
S
P
D
S
P
wet
sun
0.1
wet
sun
0.08
F
P
F
D
P
F
D
r
r
1.0
r
r
b
dry
sun
0.9
dry
sun
0.72
r
0.5
r
b
0.0
r
wet
rain
0.7
wet
rain
0.14
b
0.5
b
r
0.5
b
r
dry
rain
0.3
dry
rain
0.06
b
b
0.5
b
b
Lewis Carroll's Sack Problem
P
Independence
ƒ Two variables are independent if:
ƒ Now we have P(F,D)
ƒ Want P(F|D=r)
F
D
P
r
r
0.5
r
b
0.0
b
r
0.25
b
b
0.25
ƒ This says that their joint distribution factors into a product two
simpler distributions
ƒ Independence is a modeling assumption
ƒ Empirical joint distributions: at best “close” to independent
ƒ What could we assume for {Sun, Dry, Toothache, Cavity}?
ƒ How many parameters in the full joint model?
ƒ How many parameters in the independent model?
ƒ Independence is like something from CSPs: what?
3
Example: Independence
ƒ N fair, independent coins:
H
0.5
H
0.5
H
0.5
T
0.5
T
0.5
T
0.5
Conditional Independence
ƒ P(Toothache,Cavity,Catch) has 23 = 8 entries (7 independent
entries)
ƒ If I have a cavity, the probability that the probe catches in it doesn't
depend on whether I have a toothache:
ƒ P(catch | toothache, cavity) = P(catch | cavity)
Example: Independence?
ƒ Arbitrary joint
distributions can be
(poorly) modeled by
independent factors
T
P
S
P
warm
0.5
sun
0.6
cold
0.5
rain
0.4
T
S
P
T
S
P
warm
sun
0.4
warm
sun
0.3
warm
rain
0.1
warm
rain
0.2
cold
sun
0.2
cold
sun
0.3
cold
rain
0.3
cold
rain
0.2
Conditional Independence
ƒ Unconditional independence is very rare (two reasons:
why?)
ƒ Conditional independence is our most basic and robust
form of knowledge about uncertain environments:
ƒ The same independence holds if I haven't got a cavity:
ƒ P(catch | toothache, ¬cavity) = P(catch| ¬cavity)
ƒ Catch is conditionally independent of Toothache given Cavity:
ƒ P(Catch | Toothache, Cavity) = P(Catch | Cavity)
ƒ Equivalent statements:
ƒ P(Toothache | Catch , Cavity) = P(Toothache | Cavity)
ƒ P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
ƒ What about this domain:
ƒ Traffic
ƒ Umbrella
ƒ Raining
ƒ What about fire, smoke, alarm?
The Chain Rule II
ƒ Can always factor any joint distribution as a product of
incremental conditional distributions
The Chain Rule III
ƒ Write out full joint distribution using chain rule:
ƒ P(Toothache, Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity)
= P(Toothache | Cavity) P(Catch | Cavity) P(Cavity)
ƒ Why?
Cav
P(Cavity)
ƒ What are the sizes of the tables we supply?
Graphical model notation:
• Each variable is a node
ƒ This actually claims nothing…
T
Cat
• The parents of a node are the
other variables which the
decomposed joint conditions on
• MUCH more on this to come!
P(Toothache | Cavity)
P(Catch | Cavity)
4
Bayes’ Rule
More Bayes’ Rule
ƒ Two ways to factor a joint distribution over two variables:
ƒ Diagnostic probability from causal probability:
That’s my rule!
ƒ Example:
ƒ Dividing, we get:
ƒ m is meningitis, s is stiff neck
ƒ Why is this at all helpful?
ƒ Lets us invert a conditional distribution
ƒ Often the one conditional is tricky but the other simple
ƒ Foundation of many systems we’ll see later (e.g. ASR, MT)
ƒ Note: posterior probability of meningitis still very small
ƒ Note: you should still get stiff necks checked out! Why?
ƒ In the running for most important AI equation!
Combining Evidence
Expectations
ƒ Real valued functions of random variables:
P(Cavity| toothache, catch)
= α P(toothache, catch| Cavity) P(Cavity)
= α P(toothache | Cavity) P(catch | Cavity) P(Cavity)
ƒ Expectation of a function a random variable
ƒ This is an example of a naive Bayes model:
C
ƒ Example: Expected value of a fair die roll
E1
E2
ƒ Total number of parameters is linear in n!
ƒ We’ll see much more of naïve Bayes next week
En
X
P
1
1/6
1
2
1/6
2
3
1/6
3
4
1/6
4
5
1/6
5
6
1/6
6
Expectations
f
Utilities
ƒ Expected seconds wasted because of spam filter
ƒ Preview of utility theory (later)
Lax Filter
Strict Filter
S
B
P
S
B
P
spam
block
0.45
0
spam
block
0.35
0
spam
allow
0.10
10
spam
allow
0.20
10
ham
block
0.05
100
ham
block
0.02
100
ham
allow
0.40
0
ham
allow
0.43
0
f
f
ƒ Utilities:
ƒ Function from events to real numbers
(payoffs)
ƒ E.g. spam
ƒ E.g. airport
ƒ We’ll use the expected cost of actions to drive classification,
decision networks, and reinforcement learning…
5
Estimation
ƒ How to estimate the a distribution of a random variable X?
ƒ Maximum likelihood:
ƒ Collect observations from the world
ƒ For each value x, look at the empirical rate of that value:
ƒ This estimate is the one which maximizes the likelihood of the data
ƒ Elicitation: ask a human!
ƒ Harder than it sounds
ƒ E.g. what’s P(raining | cold)?
ƒ Usually need domain experts, and sophisticated ways of eliciting
probabilities (e.g. betting games)
Estimation
ƒ Problems with maximum likelihood estimates:
ƒ If I flip a coin once, and it’s heads, what’s the estimate for
P(heads)?
ƒ What if I flip it 50 times with 27 heads?
ƒ What if I flip 10M times with 8M heads?
ƒ Basic idea:
ƒ We have some prior expectation about parameters (here, the
probability of heads)
ƒ Given little evidence, we should skew towards our prior
ƒ Given a lot of evidence, we should listen to the data
ƒ How can we accomplish this? Stay tuned!
6