Download Probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
Review of probability
Nuno Vasconcelos
UCSD
Probability
• probability is the language to deal with processes that
are non-deterministic
• examples:
–
–
–
–
if I flip a coin 100 times, how many can I expect to see heads?
what is the weather going to be like tomorrow?
are my stocks going to be up or down?
am I in front of a classroom or is this just a picture of it?
Sample space
• the most important concept is that of a sample space
• our process defines a set of events
– these are the outcomes or states of the process
• example:
– we roll a pair of dice
– call the value on the up face at
the nth toss xn
– note that possible events such as
ƒ odd number on second throw
ƒ two sixes
ƒ x1 = 2 and x2 = 6
– can all be expressed as combinations
of the sample space events
x2
6
1
1
6
x1
Sample space
• is the list of possible events that satisfies the following
properties:
– finest grain: all possible distinguishable
events are listed separately
– mutually exclusive: if one event happens
the other does not (if x1 = 5 it cannot be
anything else)
– collectively exhaustive: any possible
outcome can be expressed as unions of
sample space events
x2
6
1
1
6
• mutually exclusive property simplifies the calculation of
the probability of complex events
• collectively exhaustive means that there is no possible
outcome to which we cannot assign a probability
x1
Probability measure
• probability of an event:
– number expressing the chance that the event will be the outcome
of the process
• probability measure: satisfies three axioms
– P(A) ≥ 0 for any event A
– P(universal event) = 1
– if A ∩ B = ∅, then P(A+B) = P(A) + P(B)
x2
6
• e.g.
– P(x1 ≥ 0) = 1
– P(x1 even U x1 odd) = P(x1 even)+ P(x1 odd)
1
1
6
x1
Probability measure
• the last axiom
– combined with the mutually exclusive property of the sample set
– allows us to easily assign probabilities to all possible events
• back to our dice example:
– suppose that the probability of any pair
(x1,x2) is 1/36
– we can compute probabilities of
all “union” events
– P(x2 odd) = 18x1/36 = 1
– P(U) = 36x1/36 = 1
– P(two sixes) = 1/36
– P(x1 = 2 and x2 = 6) = 1/36
x2
6
1
1
6
x1
Probability measure
• note that there are many ways to
define the universal event U
x2
6
– e.g. A = {x2 odd}, B = {x2 even},
U=AUB
1
– on the other hand
U = (1,1) U (1,2) U (1,3) U … U (6,6)
1
6
x1
– the fact that the sample space is
finest grain, exhaustive, and mutually exclusive and the measure
axioms
– make the whole procedure consistent
Random variables
• random variable X
– is a function that assigns a real value to each sample space event
– we have already seen one such function: PX(x1,x2) = 1/36 for all
(x1,x2)
• notation:
– specify both the random variable and the value that it takes in
your probability statements
– we do this by specifying the random variable as subscript PX and
the value as argument
PX (x1,x2) = 1/36
means Prob[X=(x1,x2)] = 1/36
– without this, probability statements can be hopelessly confusing
Random variables
• two types of random variables:
– discrete and continuous
– really means what types of values the RV can take
• if it can take only one of a finite set of possibilities, we call
it discrete
– this is the dice example we saw, there are only 36 possibilities
x2
6
1
1
6
x1
Random variables
• if it can take values in a real interval we say that the
random variable is continuous
• e.g. consider the sample space of weather temperature
– we know that it could be any
number between -50 and
150 degrees
– random variable T ∈ [-50,150]
– note that the extremes do
not have to be very precise,
we can just say that
P(T < -45o) = 0
• most probability notions apply equal well to discrete and
continuous random variables
Discrete RV
• for a discrete RV the probability assignments given by a
probability mass function (PMF)
– this can be thought of as a
normalized histogram
– satisfies the following
properties
α
0 ≤ PX ( a ) ≤ 1, ∀ a
∑P
X
(a ) = 1
a
• example for the random variable
– X ∈ {1,2,3, …, 20} where X = i if the grade of student z on class is
between 5i and 5(i+1)
– we see that PX(14) = α
Continuous RV
• for a continuous RV the probability assignments are given
by a probability density function (PDF)
– this is just a continuous
function
– satisfies the following
properties
0 ≤ PX ( a ) ∀ a
∫P
X
( a ) da = 1
• example for the Gaussian random variable of mean µ and
variance σ2
⎧ (a − µ ) 2 ⎫
1
PX ( a ) =
exp ⎨ −
2π σ
⎩
2σ 2
⎬
⎭
Discrete vs continuous RVs
• in general the same, up to replacing summations by
integrals
• note that PDF means “density of probability”,
– this is probability per unit
– the probability of a particular event
is always zero (unless there is a
discontinuity)
– we can only talk about
b
Pr( a ≤ X ≤ b ) = ∫ PX (t ) dt
a
– note also that PDFs are not
upper bounded
– e.g. Gaussian goes to Dirac when variance goes to zero
Multiple random variables
• frequently we have problems with multiple random
variables
– e.g. when in the doctor, you are mostly a collection of
random variables
ƒ
ƒ
ƒ
ƒ
ƒ
x1: temperature
x2: blood pressure
x3: weight
x4: cough
…
• we can summarize this as
– a vector X = (x1, …, xn) of n random variables
– PX(x1, …, xn) is the joint probability distribution
Marginalization
P (cold ) ?
• important notion for multiple random
variables is marginalization
– e.g. having a cold does not depend on
blood pressure and weight
– all that matters are fever and cough
– that is, we need to know PX1,X4(a,b)
• we marginalize with respect to a subset of variables
– (in this case X1 and X4)
– this is done by summing (or integrating) the others out
PX 1 , X 4 ( x1 , x4 ) =
∑P
X1 , X 2 , X 3 , X 4
x3 , x4
( x1 , x2 , x3 , x4 )
PX 1 , X 4 ( x1 , x4 ) = ∫ ∫ PX1 , X 2 , X 3 , X 4 ( x1 , x2 , x3 , x4 )dx2 dx3
Conditional probability
PY | X ( sick | cough) ?
• another very important notion:
– so far, doctor has PX1,X4(fever,cough)
– still does not allow a diagnostic
– for this we need a new variable Y with
two states Y ∈ {sick, not sick}
– doctor measures fever and cough levels,
these are no longer unknowns, or even
random quantities
– the question of interest is “what is the probability that patient is
sick given the measured values of fever and cough?”
• this is exactly the definition of conditional probability
– what is the probability that Y takes a given value given
observations for X
PY | X 1 , X 4 ( sick | 98, high)
Conditional probability
• note the very important difference between conditional
and joint probability
• joint probability is an hypothetical question with respect to
all variables
– what is the probability that you will be sick and cough a lot?
PY , X ( sick , cough) ?
Conditional probability
• conditional probability means that you know the values of
some variables
– what is the probability that you are sick given that you cough a
lot?
PY | X ( sick | cough) ?
– “given” is the key word here
– conditional probability is very important because it allows us to
structure our thinking
– shows up again and again in design of intelligent systems
Conditional probability
• fortunately it is easy to compute
– we simply normalize the joint by the probability of what we know
PY | X 1 ( sick | 98) =
PY , X 1 ( sick ,98)
PX 1 (98)
– note that this makes sense since
PY | X 1 ( sick | 98) + PY | X 1 (not sick | 98) = 1
– and, by the marginalization equation,
PY , X 1 ( sick ,98) + PY , X 1 (not sick ,98) = PX 1 (98)
– the definition of conditional probability
ƒ just makes these two statements coherent
ƒ simply says that, given what we know, we still have a valid probability
measure
ƒ universal event {sick} U {not sick} still probability 1 after observation
The chain rule of probability
• is an important consequence of the definition of
conditional probability
– note that, from this definition,
PY , X 1 ( y, x1 ) = PY | X 1 ( y | x1 ) PX 1 ( x1 )
– more generally, it has the form
PX 1 , X 2 ,..., X n ( x1 , x2 ,..., xn ) = PX 1 | X 2 ,..., X n ( x1 | x2 ,..., xn ) ×
× PX 2 | X 3 ..., X n ( x2 | x3 ,..., xn ) × ...
× ... × PX n−1 | X n ( xn −1 | xn ) PX n ( xn )
• combination with marginalization allows us to make hard
probability questions simple
The chain rule of probability
• e.g. what is the probability that you will be sick and have
104o of fever?
PY , X 1 ( sick ,104) = PY | X 1 ( sick | 104) PX 1 (104)
– breaks down a hard question (prob of sick and 104) into two
easier questions
– Prob (sick|104): everyone knows that this is close to one
You have
a cold!
PY | X ( sick | 104) = 1!
The chain rule of probability
• e.g. what is the probability that you will be sick and have
104o of fever?
PY , X 1 ( sick ,104) = PY | X 1 ( sick | 104) PX 1 (104)
– Prob(104): still hard, but easier than P(sick,104) since we know
only have one random variable (temperature)
– does not depend on sickness, it is just the question “what is the
probability that someone will have 104o?”
ƒ gather a number of people, measure their temperatures and make an
histogram that everyone can use after that
The chain rule of probability
• in fact, the chain rule is so handy, that most times we use
it to compute probabilities
– e.g. PY ( sick ) = ∫ PY , X ( sick , t ) dt
(marginalization)
1
= ∫ PY | X 1 ( sick | t ) PX 1 (t )dt
– in this way we can get away with knowing
ƒ PX1(t), which we may know because it was needed for some other
problem
ƒ PY|X1(sick|t), we can ask a doctor or approximate with rule of thumb
t > 102
⎧1
⎪
PY | X 1 ( sick | t ) ≈ ⎨0.5 98 < t < 102
⎪0
t < 98
⎩
Independence
• another fundamental concept for multiple variables
– two variables are independent if the joint is the product of the
marginals
PX 1 , X 2 (a, b) = PX 1 (a ) PX 2 (b)
– note: implies that
PX 1 | X 2 (a | b) =
PX 1 , X 2 (a, b)
PX 2 (b)
= PX 1 (a )
– which captures the intuitive notion:
– “if X1 is independent of X2, knowing X2 does not change the
probability of X1”
ƒ e.g. knowing that it is sunny does not change the probability that it
will rain in three months
Independence
• extremely useful in the design of intelligent
systems
– frequently, knowing X makes Y independent of Z
– e.g. consider the shivering symptom:
ƒ if you have temperature you sometimes shiver
ƒ it is a symptom of having a cold
ƒ but once you measure the temperature, the two become independent
PY , X 1 , S ( sick ,98, shiver ) = PY | X 1 , S ( sick | 98, shiver ) ×
PS | X 1 ( shiver | 98) PX 1 (98)
= PY | X 1 ( sick | 98) ×
PS | X 1 ( shiver | 98) PX 1 (98)
• simplifies considerably the estimation of the probabilities
Independence
• useful property: if you add two independent random
variables their probability distributions convolve
– i.e. if z = x + y and x,y are independent then
PZ ( z ) = PX ( z ) * Py ( z )
where * is the convolution operator
– for discrete random variables
PZ ( z ) = ∑ PX (k ) PY ( z − k )
k
– for continuous random variables
PZ ( z ) = ∫ PX (t ) PY ( z − t )dt
Moments
• important properties of
random variables
– summarize the distribution
σ2
• important moments
– mean: µ = E[x]
µ
– variance: σ2 = E[(x-µ)2]
– various distributions are completely specified by a small number
of moments
discrete
mean
continuous
µ = ∑ PX (k ) k
µ = ∫ PX (k ) k dk
k
variance
σ 2 = ∑ PX (k )(k-µ )
k
2
σ 2 = ∫ PX (k ) (k - µ ) 2 dk
Mean
• µ = E[x], is the center of mass of the distribution
mean
discrete
continuous
µ = ∑ PX (k ) k
µ = ∫ PX (k ) k dk
k
• is a linear quantity
– if Z = X + Y, then E[Z] = E[X] + E[Y]
– this does not require any special
relation between X and Y
– always holds
σ2
• other moments are the mean of powers of X
– nth order moment is E[Xn]
– nth central moments is E[(X-µ)n]
µ
Variance
• σ2 = E[(x-µ)2] measures the dispersion around the mean
– it is the second central moment
discrete
variance
continuous
σ 2 = ∑ PX (k )(k-µ )
k
2
σ 2 = ∫ PX (k ) (k - µ ) 2 dk
• in general, not linear
– if Z = X + Y, then Var[Z] = Var[X] + Var[Y]
– only holds if X and Y are independent
• it is related to 2nd order moment by
σ
2
= E [(x − µ ) ] = E [x
2
[ ]
2
− 2 xµ + µ 2
[ ]
]
= E x 2 − 2E [x ]µ + µ 2 = E x 2 − µ 2
σ2
µ