* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Probability
Survey
Document related concepts
Transcript
Review of probability
Nuno Vasconcelos
UCSD
Probability
• probability is the language to deal with processes that
are non-deterministic
• examples:
–
–
–
–
if I flip a coin 100 times, how many can I expect to see heads?
what is the weather going to be like tomorrow?
are my stocks going to be up or down?
am I in front of a classroom or is this just a picture of it?
Sample space
• the most important concept is that of a sample space
• our process defines a set of events
– these are the outcomes or states of the process
• example:
– we roll a pair of dice
– call the value on the up face at
the nth toss xn
– note that possible events such as
odd number on second throw
two sixes
x1 = 2 and x2 = 6
– can all be expressed as combinations
of the sample space events
x2
6
1
1
6
x1
Sample space
• is the list of possible events that satisfies the following
properties:
– finest grain: all possible distinguishable
events are listed separately
– mutually exclusive: if one event happens
the other does not (if x1 = 5 it cannot be
anything else)
– collectively exhaustive: any possible
outcome can be expressed as unions of
sample space events
x2
6
1
1
6
• mutually exclusive property simplifies the calculation of
the probability of complex events
• collectively exhaustive means that there is no possible
outcome to which we cannot assign a probability
x1
Probability measure
• probability of an event:
– number expressing the chance that the event will be the outcome
of the process
• probability measure: satisfies three axioms
– P(A) ≥ 0 for any event A
– P(universal event) = 1
– if A ∩ B = ∅, then P(A+B) = P(A) + P(B)
x2
6
• e.g.
– P(x1 ≥ 0) = 1
– P(x1 even U x1 odd) = P(x1 even)+ P(x1 odd)
1
1
6
x1
Probability measure
• the last axiom
– combined with the mutually exclusive property of the sample set
– allows us to easily assign probabilities to all possible events
• back to our dice example:
– suppose that the probability of any pair
(x1,x2) is 1/36
– we can compute probabilities of
all “union” events
– P(x2 odd) = 18x1/36 = 1
– P(U) = 36x1/36 = 1
– P(two sixes) = 1/36
– P(x1 = 2 and x2 = 6) = 1/36
x2
6
1
1
6
x1
Probability measure
• note that there are many ways to
define the universal event U
x2
6
– e.g. A = {x2 odd}, B = {x2 even},
U=AUB
1
– on the other hand
U = (1,1) U (1,2) U (1,3) U … U (6,6)
1
6
x1
– the fact that the sample space is
finest grain, exhaustive, and mutually exclusive and the measure
axioms
– make the whole procedure consistent
Random variables
• random variable X
– is a function that assigns a real value to each sample space event
– we have already seen one such function: PX(x1,x2) = 1/36 for all
(x1,x2)
• notation:
– specify both the random variable and the value that it takes in
your probability statements
– we do this by specifying the random variable as subscript PX and
the value as argument
PX (x1,x2) = 1/36
means Prob[X=(x1,x2)] = 1/36
– without this, probability statements can be hopelessly confusing
Random variables
• two types of random variables:
– discrete and continuous
– really means what types of values the RV can take
• if it can take only one of a finite set of possibilities, we call
it discrete
– this is the dice example we saw, there are only 36 possibilities
x2
6
1
1
6
x1
Random variables
• if it can take values in a real interval we say that the
random variable is continuous
• e.g. consider the sample space of weather temperature
– we know that it could be any
number between -50 and
150 degrees
– random variable T ∈ [-50,150]
– note that the extremes do
not have to be very precise,
we can just say that
P(T < -45o) = 0
• most probability notions apply equal well to discrete and
continuous random variables
Discrete RV
• for a discrete RV the probability assignments given by a
probability mass function (PMF)
– this can be thought of as a
normalized histogram
– satisfies the following
properties
α
0 ≤ PX ( a ) ≤ 1, ∀ a
∑P
X
(a ) = 1
a
• example for the random variable
– X ∈ {1,2,3, …, 20} where X = i if the grade of student z on class is
between 5i and 5(i+1)
– we see that PX(14) = α
Continuous RV
• for a continuous RV the probability assignments are given
by a probability density function (PDF)
– this is just a continuous
function
– satisfies the following
properties
0 ≤ PX ( a ) ∀ a
∫P
X
( a ) da = 1
• example for the Gaussian random variable of mean µ and
variance σ2
⎧ (a − µ ) 2 ⎫
1
PX ( a ) =
exp ⎨ −
2π σ
⎩
2σ 2
⎬
⎭
Discrete vs continuous RVs
• in general the same, up to replacing summations by
integrals
• note that PDF means “density of probability”,
– this is probability per unit
– the probability of a particular event
is always zero (unless there is a
discontinuity)
– we can only talk about
b
Pr( a ≤ X ≤ b ) = ∫ PX (t ) dt
a
– note also that PDFs are not
upper bounded
– e.g. Gaussian goes to Dirac when variance goes to zero
Multiple random variables
• frequently we have problems with multiple random
variables
– e.g. when in the doctor, you are mostly a collection of
random variables
x1: temperature
x2: blood pressure
x3: weight
x4: cough
…
• we can summarize this as
– a vector X = (x1, …, xn) of n random variables
– PX(x1, …, xn) is the joint probability distribution
Marginalization
P (cold ) ?
• important notion for multiple random
variables is marginalization
– e.g. having a cold does not depend on
blood pressure and weight
– all that matters are fever and cough
– that is, we need to know PX1,X4(a,b)
• we marginalize with respect to a subset of variables
– (in this case X1 and X4)
– this is done by summing (or integrating) the others out
PX 1 , X 4 ( x1 , x4 ) =
∑P
X1 , X 2 , X 3 , X 4
x3 , x4
( x1 , x2 , x3 , x4 )
PX 1 , X 4 ( x1 , x4 ) = ∫ ∫ PX1 , X 2 , X 3 , X 4 ( x1 , x2 , x3 , x4 )dx2 dx3
Conditional probability
PY | X ( sick | cough) ?
• another very important notion:
– so far, doctor has PX1,X4(fever,cough)
– still does not allow a diagnostic
– for this we need a new variable Y with
two states Y ∈ {sick, not sick}
– doctor measures fever and cough levels,
these are no longer unknowns, or even
random quantities
– the question of interest is “what is the probability that patient is
sick given the measured values of fever and cough?”
• this is exactly the definition of conditional probability
– what is the probability that Y takes a given value given
observations for X
PY | X 1 , X 4 ( sick | 98, high)
Conditional probability
• note the very important difference between conditional
and joint probability
• joint probability is an hypothetical question with respect to
all variables
– what is the probability that you will be sick and cough a lot?
PY , X ( sick , cough) ?
Conditional probability
• conditional probability means that you know the values of
some variables
– what is the probability that you are sick given that you cough a
lot?
PY | X ( sick | cough) ?
– “given” is the key word here
– conditional probability is very important because it allows us to
structure our thinking
– shows up again and again in design of intelligent systems
Conditional probability
• fortunately it is easy to compute
– we simply normalize the joint by the probability of what we know
PY | X 1 ( sick | 98) =
PY , X 1 ( sick ,98)
PX 1 (98)
– note that this makes sense since
PY | X 1 ( sick | 98) + PY | X 1 (not sick | 98) = 1
– and, by the marginalization equation,
PY , X 1 ( sick ,98) + PY , X 1 (not sick ,98) = PX 1 (98)
– the definition of conditional probability
just makes these two statements coherent
simply says that, given what we know, we still have a valid probability
measure
universal event {sick} U {not sick} still probability 1 after observation
The chain rule of probability
• is an important consequence of the definition of
conditional probability
– note that, from this definition,
PY , X 1 ( y, x1 ) = PY | X 1 ( y | x1 ) PX 1 ( x1 )
– more generally, it has the form
PX 1 , X 2 ,..., X n ( x1 , x2 ,..., xn ) = PX 1 | X 2 ,..., X n ( x1 | x2 ,..., xn ) ×
× PX 2 | X 3 ..., X n ( x2 | x3 ,..., xn ) × ...
× ... × PX n−1 | X n ( xn −1 | xn ) PX n ( xn )
• combination with marginalization allows us to make hard
probability questions simple
The chain rule of probability
• e.g. what is the probability that you will be sick and have
104o of fever?
PY , X 1 ( sick ,104) = PY | X 1 ( sick | 104) PX 1 (104)
– breaks down a hard question (prob of sick and 104) into two
easier questions
– Prob (sick|104): everyone knows that this is close to one
You have
a cold!
PY | X ( sick | 104) = 1!
The chain rule of probability
• e.g. what is the probability that you will be sick and have
104o of fever?
PY , X 1 ( sick ,104) = PY | X 1 ( sick | 104) PX 1 (104)
– Prob(104): still hard, but easier than P(sick,104) since we know
only have one random variable (temperature)
– does not depend on sickness, it is just the question “what is the
probability that someone will have 104o?”
gather a number of people, measure their temperatures and make an
histogram that everyone can use after that
The chain rule of probability
• in fact, the chain rule is so handy, that most times we use
it to compute probabilities
– e.g. PY ( sick ) = ∫ PY , X ( sick , t ) dt
(marginalization)
1
= ∫ PY | X 1 ( sick | t ) PX 1 (t )dt
– in this way we can get away with knowing
PX1(t), which we may know because it was needed for some other
problem
PY|X1(sick|t), we can ask a doctor or approximate with rule of thumb
t > 102
⎧1
⎪
PY | X 1 ( sick | t ) ≈ ⎨0.5 98 < t < 102
⎪0
t < 98
⎩
Independence
• another fundamental concept for multiple variables
– two variables are independent if the joint is the product of the
marginals
PX 1 , X 2 (a, b) = PX 1 (a ) PX 2 (b)
– note: implies that
PX 1 | X 2 (a | b) =
PX 1 , X 2 (a, b)
PX 2 (b)
= PX 1 (a )
– which captures the intuitive notion:
– “if X1 is independent of X2, knowing X2 does not change the
probability of X1”
e.g. knowing that it is sunny does not change the probability that it
will rain in three months
Independence
• extremely useful in the design of intelligent
systems
– frequently, knowing X makes Y independent of Z
– e.g. consider the shivering symptom:
if you have temperature you sometimes shiver
it is a symptom of having a cold
but once you measure the temperature, the two become independent
PY , X 1 , S ( sick ,98, shiver ) = PY | X 1 , S ( sick | 98, shiver ) ×
PS | X 1 ( shiver | 98) PX 1 (98)
= PY | X 1 ( sick | 98) ×
PS | X 1 ( shiver | 98) PX 1 (98)
• simplifies considerably the estimation of the probabilities
Independence
• useful property: if you add two independent random
variables their probability distributions convolve
– i.e. if z = x + y and x,y are independent then
PZ ( z ) = PX ( z ) * Py ( z )
where * is the convolution operator
– for discrete random variables
PZ ( z ) = ∑ PX (k ) PY ( z − k )
k
– for continuous random variables
PZ ( z ) = ∫ PX (t ) PY ( z − t )dt
Moments
• important properties of
random variables
– summarize the distribution
σ2
• important moments
– mean: µ = E[x]
µ
– variance: σ2 = E[(x-µ)2]
– various distributions are completely specified by a small number
of moments
discrete
mean
continuous
µ = ∑ PX (k ) k
µ = ∫ PX (k ) k dk
k
variance
σ 2 = ∑ PX (k )(k-µ )
k
2
σ 2 = ∫ PX (k ) (k - µ ) 2 dk
Mean
• µ = E[x], is the center of mass of the distribution
mean
discrete
continuous
µ = ∑ PX (k ) k
µ = ∫ PX (k ) k dk
k
• is a linear quantity
– if Z = X + Y, then E[Z] = E[X] + E[Y]
– this does not require any special
relation between X and Y
– always holds
σ2
• other moments are the mean of powers of X
– nth order moment is E[Xn]
– nth central moments is E[(X-µ)n]
µ
Variance
• σ2 = E[(x-µ)2] measures the dispersion around the mean
– it is the second central moment
discrete
variance
continuous
σ 2 = ∑ PX (k )(k-µ )
k
2
σ 2 = ∫ PX (k ) (k - µ ) 2 dk
• in general, not linear
– if Z = X + Y, then Var[Z] = Var[X] + Var[Y]
– only holds if X and Y are independent
• it is related to 2nd order moment by
σ
2
= E [(x − µ ) ] = E [x
2
[ ]
2
− 2 xµ + µ 2
[ ]
]
= E x 2 − 2E [x ]µ + µ 2 = E x 2 − µ 2
σ2
µ