Download Probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Ars Conjectandi wikipedia , lookup

History of statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
EE 278
Lecture Notes # 2
Winter 2010–2011
Probability Space
Probability assigns a measure like length, area, volume, weight,
mass to events = sets in some space
Probability
Usually involves sums (discrete probability) or integrals (continuous)
Review and elaboration of basic probability + simple examples of
random variables, vectors, and processes
Basic construct is a probability space or experiment (Ω, F , P) which
consists of three items:
Probability spaces, fair spinner, one coin flip, multiple coin flips, a
Bernoulli random process. pdfs and pmfs
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
1
1. Sample space Ω= an abstract space of elements called points
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
2
Sample space Ω An abstract space of elements called (sample)
points
E.g., {H, T }, {0, 1}, [0, 1), Rk
2. Event space F = collection of subsets of Ω called events, to which
probabilities are assigned
Intuition: contains all distinguishable elementary outcomes or
finest grain results of an experiment.
E.g., all subsets of Ω
3. Probability measure P = assignment of real numbers to events
consistent with a set of axioms
Consider each component in order:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
3
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
4
Probability measure P An assignment of a number P(F) to every
event F ∈ F in a way that satisfies Kolmogorov’s axioms of
probability:
Event space F (sigma-field) Collection of subsets of Ω s.t.
a) If F ∈ F , then also F c ∈ F
b) If Fi ∈ F , i = 1, 2, . . . , then also
�
i
Fi ∈ F ,
1. P(F) ≥ 0 for all F ∈ F
2. P(Ω) = 1
Intuition: Algebraic structure — a)-b) + set theory ⇒ countable set
theoretic operations (union, intersection, complementation,
difference) of events produces new events. Ω ∈ F , ∅ ∈ F .
3. If Fi ∈ F , i = 1, 2, . . . are disjoint or mutually exclusive
(Fi ∩ F j = ∅ if i � j), then
F ⊂ Ω, but F ∈ F (set inclusion vs. element inclusion)
P(∪i Fi) =
Event spaces not an issue in elementary case where Ω discrete, use
F = all subsets of Ω (power set of Ω)
Event spaces are in issue in continuous spaces, power set too large
for useful theory. If Ω = R, Borel field B(R). (Smallest event space
containing the intervals.)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
5
Example: Spinning pointer
0.75
P(Fi)
i
Axioms are enough to get a useful calculus of probability + useful
mathematical models of random processes with predictable long
term behavior (laws of large numbers or ergodic theorems, central
limit theorem)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
6
Sample space = Ω = [0, 1)
Event space = F = smallest event space containing all of the
intervals, called B([0, 1)), the Borel field of [0, 1).
Introduce several fundamental ideas in context of two simple
examples: a fair spinning pointer (or wheel) and a single coin flip.
Then consider many coin flips.
Spin a fair pointer in a circle:
�
Probability measure: For fair spinner probability outcome a point in
F ∈ B([0, 1)) is
�
P(F) =
0.0
where
✬✩
✻
�
F
f (x)dx
f (x) = 1, x ∈ [0, 1),
0.25
✫✪
is a uniform probability density function (pdf)
0.5
E.g., if F = [a, b] = {r : a ≤ r ≤ b} with 0 ≤ a ≤ b < 1, P(F) = b − a
When pointer stops it can point to any number in the unit interval
∆
Ω = [0, 1) = {r : 0 ≤ r < 1}
Describe (Ω, F , P)
The probability of the pointer landing in an interval of length b − a is
b − a, the fraction of the sample space corresponding to the event
intuitive!
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
7
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
8
In general, f (x), x ∈ Ω is a pdf if
Same result if instead set Ω = R, F = B(R), and pdf



1
f (r) = 

0
1. f (x) ≥ 0, all x,
2.
�
Ω
f (x)dx = 1.
pdf ⇒ a probability measure by integration:
P(F) =
Can also write
P(F) =
�
�
F
Comments:
• The integrals in practice are Riemann
In theory they are Lebesgue (better limiting properties)
In most cases two integrals are the same.
if r ∈ F
• The axioms of probability are properties of integration in disguise.
if r � F
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
See the next slide.
January 11, 2011 R.M. Gray
9
PDFs and the axioms of probability
Suppose f is a pdf (nonnegative,
P(F) =
Ω
�
F
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
10
• The probability of a finite union of disjoint regions is the sum of the
probabilities of the individual events since integration is linear:
P(F ∪ G) =
�
�
1F∪G (r) f (r) dr = (1F (r) + 1G (r)) f (r) dr
�
�
=
1F (r) f (r) dr + 1G (r) f (r) dr
f (x)dx = 1) and
f (x)dx
= P(F) + P(G).
Then
(Part of Axiom 3 from linearity of integration)
• Probabilities are nonnegative since integrating a nonnegative
Need to show for a countable number of disjoint events for Axiom 3.
Is above enough?
Unfortunately, no for Riemann integral.
argument yields a nonnegative result. (Axiom 1)
• The probability of the entire sample space is 1 since integrating 1
True if use Lebesgue integral. (We do not pursue details.)
over the unit interval yields 1. (Axiom 2)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
.
Important in research, not so much in practice. But good to know
the language when reading the literature.
where indicator function of F given by
�
otherwise
• Event space details ⇒ integrals make sense
f (x)dx
1F (x) f (x)dx


1
∆ 
1F (r) = 

0
if r ∈ [0, 1)
January 11, 2011 R.M. Gray
11
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
12
Example Probability Spaces: Single coin flip
Coin Flip Model I: Direct description
Develop in two ways:
Sample space Ω = {0, 1} — using numbers instead of {head, tail } will
prove convenient
Model I Direct description — Simplest way, fine if all you care about
is one coin flip.
Model II As a random variable (or signal processing) defined on fair
spinner. Will define lots of other random variables on same space.
Event space F = {{0}, {1}, Ω = {0, 1}, ∅}, all subsets of Ω (power set
of Ω)
Probability measure P Define in terms of a sum of probability mass
function, analogous to integral of pdf for spinner
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
13
Given pmf p, define P by P({ω}) + Axiom 3:
P(F) =
�
x∈F
P({x}) =
�
x∈F
p(x) =
Given a discrete sample space (= countable number of elements) Ω,
a probability mass function (pmf) p(ω) is a nonnegative function such
�
that ω∈Ω p(ω) = 1.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
14
Notes:
�
• Probabilities are defined on sets — P(F), pdfs and pmfs are
1F (x)p(x)
defined on points!!
Again: theory of sums ⇒ axioms of probability satisfied, obvious if Ω
finite.
For the fair coin, set p(1) = p(0) = 1/2. For a biased coin, common to
use p(1) = p, p(0) = 1 − p for a parameter p ∈ (0, 1)
• In discrete case, one point (singleton) sets have possibly nonzero
probability, e.g.,
P({0}) = p(0) =
1
2
A pmf gives the probability of something.
A pdf is not the probability of anything, e.g., in our uniform spinner
case
�
1
P({ }) =
2
{1/2}
1dx = 0.
If P determined by pdf, probability of individual point (e.g., 1/π) =0
Must integrate a pdf to get a probability
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
15
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
16
Coin Flip Model II: Random variable on spinner
Simple example of signal processing — fair spinner produces signal
or input r, quantizer operates on signal to produce a value q(r). Think
A/D conversion or threshold decision rule based on noisy observation
Suppose (Ω, F , P) describes uniform spinner: Ω = [0, 1), P described
by pdf f (r) = 1 for r ∈ [0, 1)
Define a measurement q made on outcome of spin by
Simple quantizer



1 if r ∈ [0.5, 1)
q(r) = 1[0.5,1)(r) = 

0 if r ∈ [0, 0.5)
(1)
January 11, 2011 R.M. Gray
�
pq(0) = P({r : q(r) = 0}) =
Similarly pq(1) = P({r : q(r) = 1}) =
17
Can use pmf to find any probability involving q:
Pq(F) =
Output space is discrete ⇒ only need pmf to characterize probability
measure
�
0.5
0
1dr =
1
2
Often write informally as Pr(q = 0) or P{q = 0}, short hand for
“probability random variable q takes on a value of 0”
Simple example of a random variable = a real-valued function
defined on sample space, q(r), r ∈ Ω
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Original probability space + function ⇒ new probability space with
binary sample space Ωq = {0, 1}, event space Fq = power set of
{0, 1}. New space inherits probability measure, say Pq, from old.
�
1
0.5
1dr =
1
2
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
18
Notes:
• Multiple ways to arrive at same model — equivalent models
pq(x)
x∈F
Directly in terms of P, indirectly in terms of probability space + rv
Derived new probability measure from old + function q.
Probability measure Pq describing output of random variable q called
the distribution of q
In general, Pq(F) = P({ω
: q(ω) ∈ F}) relates probability in new
����������������������������
• Basic idea will be seen often:
probability space + function (random variable)
⇒ new probability space with probability measure = distribution of
random variable given by inverse image formula
Will see many tools and tricks for doing actual calculus.
q−1(F)
space to probability in old.
−1
q (F) = inverse image of set F under mapping q
Basic example of derived distribution: (Ω, F , P) + q ⇒ (Ωq, Fq, Pq)
• Using Model II, can define more random variables on common
experiment, e.g., two coin flips or even an infinite number.
Model I only good for single coin flip.
General solution: inverse image method Pq(F) = P(q−1(F))
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
19
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
20
Two coin flips
Single experiment ⇒ values of two random variables X and Y , or
single 2D random vector (X, Y).
Easy to compute pmfs for individual random variables X and Y
(marginal pmfs)
Let (Ω, F , P) be the fair spinner experiment.
1
pX (k) = pY (k) = ; k = 0, 1
2
Rename quantizer q as X (common to use upper case for random
variables). Define another random variable Y on same space:



0
X(ω) = 

1



0
Y(ω) = 

1
Equivalent random variables, same pmf.
if ω ∈ [0, 0.5)
Now also have joint pmf of 2 rvs together: inverse image formula ⇒
if ω ∈ [0.5, 1.0)
pXY (x, y) = Pr(X = x, Y = y) = P({ω : (X, Y)(ω) = (x, y)})
if ω ∈ [0, 0.25) ∪ [0.5, 0.75)
1
4
if ω ∈ [0.25, 5.0) ∪ [0.75, 1.0)
Y(r)=0 Y(r)=1 Y(r)=0 Y(r)=1
��������
������������������������
0��������������������������������������������������������1
X(r)=0
✲
Do the math: pXY (x, y) = ; x = 0, 1; y = 0, 1 E.g.,
pXY (0, 1) = P({ω : ω ∈ [0, 0.5)} ∩ {ω ∈ [0.25, 5.0) ∪ [0.75, 1.0)})
1
= P({ω : ω ∈ [0.25, 5.0)}) =
4
r
X(r)=1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
21
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
• For this example pXY (x, y) = pX (x)pY (y),
Notes:
Alternatively could compute marginals from joint using total
probability:
pX (x) = P({ω : X(ω) = x}) =
=
�
22
a product pmf
If two discrete random variables satisfy above, they are said to be
independent (will discuss more later)
• Here separately derived joint pmf pXY and marginal pmfs pX , pY .
�
January 11, 2011 R.M. Gray
Can define any number of random variables on a common probability
space.
P({ω : X(ω) = x, Y(ω) = y})
An infinite collection of random variables such as {Xn; n = 0, 1, 2, . . .}
defined on a common probability space is called a random process
y∈ΩY
pXY (x, y)
y∈ΩY
pY (y) =
�
pXY (x, y)
x∈ΩX
Joint and marginal pmfs are consistent — can get marginals either
from original P or from joint pmf pXY
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
23
Extend example: a Bernoulli random process
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
24
A Bernoulli Random Process: Fair Coin Flips
Xn(u) = bn(u) defines a discrete-time random process
Xn; n = 0, 1, 2, . . . one experiment ⇒ an infinite number of rvs!
In earlier 2D example, X = X0, Y = X1.
Again let Ω = [0, 1) and P be determined by uniform pdf.
Similar computations to 2D example (inverse image formula) ⇒
Every number in u ∈ [0, 1) has a binary representation
u=
∞
�
• marginal pmfs pXn (x) = 1/2, x = 0, 1 (all equivalent, fair coin flip)
• for any k = 1, 2, . . ., joint pmf describing random vector X k is
bn(u)2−n−1
n=0
pX k (xk ) = Pr(X k = xk ) = 2−k ; xk ∈ {0, 1}k
binary analog of decimal representation — unique if choose
representation {bn} not having a finite number of 0s,
Hence
k
pX k (x ) =
e.g., choose 1/2 → 100000 · · · , not 01111 · · ·
n−1
�
n=0
pXn (xn); xk ∈ {0, 1}k ,
E.g., if u = 3/4, then b0(u) = b1(u) = 1, bn(u) = 0 for all n ≥ 2.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
a product pmf
25
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
26
Probability spaces: sample space Ω
A collection of discrete random variables is said to be (mutually)
independent if the joint pmf = the product of the marginal pmfs.
A random vector is said to be independent identically distributed or
iid (or i.i.d. or IID) if its component random variables are independent
and identically distributed
Common examples:
• {0, 1} Coin flip, value in data stream at time t
A random process {Xn} is iid if any finite collection of random
variables in the collection is iid
• [0, 1) Fair spin, analog sensor output
An iid process also called a Bernoulli process (sometimes name
reserved for binary processes). Classic example is an infinite
sequence of fair coin flips.
• Zk = {0, 1, . . . , k − 1} die roll, ascii character, card drawn from deck
• [0, ∞) Time to arrival of first packet, bus, customer
∆
• Z+ = {0, 1, 2, . . .} Number of packets/buses/customers arriving in
[0, T )
End extended example of uniform spinner and Bernoulli process,
back to general material
Elaborate on Ω, F , P —
• R = (−∞, ∞) voltage at sensor (without known bound)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
27
• {0, 1}k =all binary k-tuples (a product space), flip one coin k times,
flip k coins once, sample k successive values in a data stream
January 11, 2011 R.M. Gray
28
Probability spaces: event space F
• [0, 1)k = k-dimensional unit cube, measurements from k identical
sensors at different locations
• Rk k-dimensional Euclidean space, voltage of array of k sensors
Given Ω, smallest event space is {∅, Ω}.
• etc. — e.g., Zk , sequence spaces such as all binary sequences,
Biggest event space is power set. (Too big to be useful for continuous
spaces.)
waveform spaces such as all continuous differential waveforms
Useful event space in R and Rk is Borel field=smallest event space
containing all of the intervals (k=1) and rectangles (k > 1)
(Do not need for HW or exams, useful to have an idea when
encounter in books and papers)
A measurable space (Ω, F ) = sample space + event space of
subsets of sample space
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
29
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Probability spaces: probability measure
January 11, 2011 R.M. Gray
30
A trivial example
When dealing with finite sample spaces, only need Axiom 3 to hold
for finite collections of disjoint events.
The simplest possible example is useless except for providing a trivial
example.
Key point A set function P defined on an event space F of subsets
of a sample space Ω is a probability measure if and only if (iff) it
satisfies the axioms of probability
Ω is any abstract space
F = {Ω, ∅}
P defined by P(Ω) = 1, P(∅) = 0
Axioms of probability measure satisfied.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
31
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
32
Simple example: biased coin
In general can not do it this way and list probabilities of ever event,
too many events
Instead provide a formula for computing probabilities of events as
integrals (of a pdf) or sums (of a pmf)
Simplest nontrivial example
Ω = {0, 1}
Will see many common examples, most have names (uniform,
binomial, geometric, Poisson, Gaussian, Bernoulli, exponential,
Laplacian, . . . )
F = {{0}, {1}, Ω = {0, 1}, ∅} axioms of event space satisfied
P defined by









P(F) = 







1− p
p
0
1
if F = {0}
if F = {1}
,
if F = ∅
if F = Ω
where p ∈ (0, 1) is a fixed parameter ( p = 0, 1 is a variation on the
trivial probability space)
Before more examples, derive several fundamental properties of
probability — several elementary and one advanced
Axioms can be verified in a straightforward manner.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
33
Elementary properties of probability
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
34
Proof
(a) F ∪ F c = Ω ⇒ P(F ∪ F c) = 1 (Axiom 2). F ∩ F c = ∅ ⇒
1 = P(F ∪ F c) = P(F) + P(F c) (Axiom 3).
(Ω, F , P)
(b) P(F) = 1 − P(F c) ≤ 1 (Axiom 1 and (a) above).
(a) For all events F , P(F c) = 1 − P(F)
(c) By Axiom 2 and (a) above, P(Ωc) = P(∅) = 1 − P(Ω) = 0.
(b) For all events F , P(F) ≤ 1
Note: Empty set ∅ has probability 0, but P(F) = 0 does not mean
F = ∅.
E.g., uniform spinner, F = {1/n : n = 1, 2, 3, . . .}
(c) Let ∅ be the null or empty set, then P(∅) = 0.
(d) Total Probability If events {Fi; i = 1, 2, . . .} form a partition of Ω,
�
i.e., if Fi ∩ Fk = ∅ when i � k and i Fi = Ω, then for any event G
P(G) =
�
i
(d) Using set theory and Axiom 3:
�
�
�
P(G) = P(G ∩ Ω) = P(G ∩ ( i Fi)) = P( i (G
∩ Fi)) = i P(G ∩ Fi).
����������
P(G ∩ Fi).
disjoint
(e) If G ⊂ F for events G, F , then P(F − G) = P(F) − P(G)
∆
(F − G = F ∩ Gc, also written F\G)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
35
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
36
(e) F − G = F ∩ Gc and G are disjoint, so Axiom 3 ⇒
An advanced property of probability: Continuity
P((F − G) ∪ G) = P(F − G) + P(G).
A sequence of sets Fn, n = 0, 1, 2, . . . is increasing if Fn−1 ⊂ Fn all n
(also called nested)
decreasing if Fn ⊂ Fn−1
Since G ⊂ F , G = F ∩ G ⇒
(F − G) ∪ G = (F ∩ Gc) ∪ (F ∩ G) = F ∩ (G
∪ Gc) = F .
������������
Ω
a
Thus P(F) = P(F − G) + P(G).
b
�
Note (e) ⇒ that if G ⊂ F , then P(G) ≤ P(F)
F1 F2
F3
F4
F4
F3
F2
F1
(a) Increasing sets, (b) decreasing sets
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
37
E.g., Increasing: Fn = [0, n), Fn = [1, 2 − 1/n), (−n, a)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
∆
lim Fn =
n→∞
Natural definition of limit of increasing sets
∆
n→∞
∞
�
38
Natural definition of limit of decreasing sets
Decreasing: Fn = [1, 3 + 1/n), Fn = (1 − 1/n, 1 + 1/n)
lim Fn =
January 11, 2011 R.M. Gray
∞
�
Fn
n=1
E.g.,
Fn
n=1
lim [1, 3 + 1/n) = [1, 3)
n→∞
E.g.,
lim [1 − 1/n, 1 + 1/n) = {1}
n→∞
lim [0, n) = [0, ∞)
n→∞
lim [1, 2 − 1/n) = [1, 2)
n→∞
There is no natural definition of a limit of an arbitrary sequence of
sets, only for increasing or decreasing.
lim (−n, a) = [−∞, a)
n→∞
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
39
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
40
Then
Continuity of probability If Fn is an increasing or decreasing
sequence of events, then
�
P( lim Fn) = lim P(Fn)
n→∞
P lim Fn
n→∞
n→∞
Prove for increasing sets {Fn; n = 0, 1, 2, . . .}.
Recall set theory difference A − B = A ∩ Bc = points in A that are not
in B.
Define G0 = F0, Gn = Fn − Fn−1 for n = 1, 2, . . .
Gn are disjoint,
�n
k=0
Fk = Fn =
∞
�
�n
�
∞ 
∞ 
∞
� 
�  �
= P  Fk  = P  Gk  =
P(Gk )
k=0
k=0
k=0
������������������������������������������������
Fk =
k=0
⇒
Gk
,
definition of infinite sum
Gn = Fn − Fn−1 and Fn−1 ⊂ Fn ⇒ P(Gn) = P(Fn) − P(Fn−1)
k=0 G k ,
∞
�
Axiom 3
n
�
= lim
P(Gk )
n→∞
k=0
��������������������������������
n
�
k=0
P(Gk ) = P(F0) +
n
�
k=1
(P(Fn) − P(Fn−1)) = P(Fn).
“telescoping sum,” all terms cancel but last one:
k=0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
41
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
42
Similar proof for decreasing events. E.g.,
P({a}) = limn→∞ P((a − 1/n, a + 1/n))
P(Fn) = + P(Fn) − P(Fn−1)
+ P(Fn−1) − P(Fn−2)
If P described by a pdf, then probability of a point is 0.
+ P(Fn−2) − P(Fn−3)
.
Can show Axioms 1, 2, and 3 for finite collections of disjoint events +
continuity of probability ⇔ Axioms 1–3
+ P(F1) − P(F0)
Kolmogorov’s countable additivity axiom ensures good limiting
behavior.
+ P(F0).
⇒ If Fn is a sequence of increasing events, then
�
�
P lim Fn = lim P(Fn)
n→∞
Back to more concrete issues.
n→∞
E.g., P((−∞, a]) = limn→∞ P((−n, a]).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
43
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
44
Discrete probability spaces
Common examples of pmfs
Binary (Bernoulli) pmf. Ω = {0, 1}; p(0) = 1 − p, p(1) = p, where p
is a parameter in (0, 1).
A uniform pmf. Ω = Zn = {0, 1, . . . , n − 1} and p(k) = 1/n; k ∈ Zn.
Binomial pmf. Ω = Zn+1 = {0, 1, . . . , n} and
Introduce several common examples.
Recall basic construction:
Ω = a discrete (countable) set, F = power set of Ω
� �
n k
p(k) =
p (1 − p)n−k ; k ∈ Zn+1,
k
Given a probability measure P on (Ω, F ), ⇒ pmf
∆
p(ω) = P({ω}); ω ∈ Ω
where
� �
n
n!
=
k
k!(n − k)!
is the binomial coefficient (read as “n choose k”).
Conversely, given a pmf p (nonnegative, sums to 1), ⇒ P via
P(F) =
�
p(ω).
ω∈F
Calculus (properties of sums) implies axioms of probability satisfied.
Geometric pmf. Ω = {1, 2, 3, . . .} and p(k) = (1 − p)k−1 p; k = 1, 2, . . .,
where p ∈ (0, 1) is a parameter.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
45
The Poisson pmf. Ω = Z+ = {0, 1, 2, . . .} and p(k) = (λk e−λ)/k!, where
∆
λ is a parameter in (0, ∞). (Keep in mind that 0! = 1.)
These are all obviously nonnegative, hence to verify they are pmfs
�
need only show ω∈Ω p(ω) = 1
46
Discrete expectations
Given pmf p on discrete sample space Ω
Obvious for Bernoulli and uniform. For binomial, use binomial
theorem, for geometric use geometric progression formula, for
Poisson use Taylor series expansion for e x (will do details later, try it
on your own)
Suppose that g is a real-valued function defined on Ω: g : Ω → R
Recall: A real-valued function defined on a probability space is
called a random variable 1
These are exercises in calculus of sums.
Define expectation of g (with respect to p) by
Before continuing to continuous examples (pdfs), consider more
general weighted sums called expectations.
E(g) =
January 11, 2011 R.M. Gray
�
g(ω)p(ω).
ω∈Ω
(Treated in depth later, useful to introduce basic idea early.)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
1
There is a required technical condition we will see later, but it is automatic for discrete probability
spaces with the power set as event space.
47
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
48
Example: If g(ω) = 1F (ω), then
The second moment is often of interest:
P(F) = E(1F ),
m(2) =
(2)
⇒ probability can be viewed as a special case of expectation
centralized moments:
Some other important examples:
Most important is the variance
Suppose that Ω ⊂ R, e.g., R, [0, 1), Z or Zn.
If g(ω) = ω, mean or first moment
m=
kth moment
e.g., m = m(1).
m(k) =
�
�
Fix pmf p.
(ω)2 p(ω).
�
(ω − m)k p(ω),
σ2 =
�
(ω − m)2 p(ω).
Note:
ωp(ω)
�
�
(ω − m)2 p(ω) =
(ω2 − 2ωm + m2)p(ω)
�
�
�
=
ω2 p(ω) − 2m
ωp(ω) + m2
p(ω)
σ2 =
(ω)k p(ω),
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
�
= m(2) − 2m2 + m2 = m(2) − m2
January 11, 2011 R.M. Gray
49
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
50
January 11, 2011 R.M. Gray
52
Computational examples
variance = second moment - mean2
Discrete uniform pmf on Zn
1�
#(F)
1F (ω) =
,
n
n
#(F)= number of elements in F .
P(F) =
Will see similar definition for continuous case with a pdf, with
integrals instead of sums
For now moments can be viewed as simply attributes of pmfs and
pdfs. For many pmfs and pdfs, knowing certain moments completely
describes the pmf/pdf.
mean:
n−1
m=
second moment:
m(2) =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
51
1�
n+1
k=
n k=0
2
n−1
1 � 2 (n + 1)(2n + 1)
k =
n k=0
6
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Binomial pmf To show a valid pmf, use binomial theorem:
n
(a + b) =
n � �
�
n
k
k=0
Set a = p, b = 1 − p
n
�
k=0
p(k) =
Trick: resembles terms in binomial theorem, massage it into form:
Change variables l = k − 1,
anbn−k
m =
n−1
�
l=0
n � �
�
n
k=0
k
k
p (1 − p)
n−k
= np
n
= (p + 1 − p) = 1.
k=1
Postpone second moment until have a better method.
n!
pk (1 − p)n−k
(n − k)!(k − 1)!
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
53
∞
�
ak =
k=0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Set a = 1 − p and get
Geometric pmf Use geometric progression: if |a| < 1,
1
,
1−a
m=
∞
�
kp(k) =
k=1
m=
January 11, 2011 R.M. Gray
54
1
p
for geometric pmf
A similar idea works for the second moment:
⇒ geometric pmf indeed sums to 1.
mean:
(n − 1)!
pl(1 − p)n−1−l
(n − 1 − l)!l!
= np(p + 1 − p)n−1 = np.
� �
n
�
n
m =
k pk (1 − p)n−k
k
k=0
n
�
n−1
�
l=0
Finding mean messy, but good practice. Later find shortcuts.
=
n!
pl+1(1 − p)n−l−1
(n − l − 1)!l!
∞
�
k=1
(2)
m
=
∞
�
k=1
kp(1 − p)k−1.
2
k−1
k p(1 − p)
�
2
1
2− p
=p 3− 2 =
p
p
p2
hence
How evaluate? Can look it up, or use trick: differentiate geometric
progression formula to obtain
σ2 = m(2) − m2 =
�
1− p
.
p2
∞
d � k
d 1
a =
da k=0
da 1 − a
����������
������������
1
�
∞ kak−1
k=0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
(1−a)2
January 11, 2011 R.M. Gray
55
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
56
Example of probability calculus using geometric pmf: Find the
probabilities of the event F = {k : k ≥ 10} and G = {k : k is odd}.
P(G) =
�
�
p(k) =
k∈G
Note that F = {10, 11, 12, . . .} and G = {1, 3, 5, 7, . . .}
= p
k=1,3,...
�
p(1 − p)k−1
∞
�
(1 − p) = p [(1 − p)2]k
k
k=0,2,4,...
Solutions:
=
�
P(F) =
p(k) =
k∈F
∞
�
k=10
p(1 − p)k−1
k=0
p
1
=
.
2
2− p
1 − (1 − p)
Poisson pmf
Show sums to 1:
∞
∞
�
p �
p
k
10
=
(1 − p) =
(1 − p)
(1 − p)k−10
1 − p k=10
1− p
k=10
∞
�
∞
�
= p(1 − p)
(1 − p)k = (1 − p)9,
k=0
9
p(k) =
∞
�
λk e−λ
k=0
k!
−λ
=e
×
January 11, 2011 R.M. Gray
57
mean:
∞
∞
�
�
λk e−λ
λk
−λ
kp(k) =
k
=e
.
k!
(k − 1)!
k=0
k=1
k=1
m(2) = λ2
m
=
kp(k) = λe
−λ
=
k=1
∞
�
k=2
∞
�
λl
l=0
k
January 11, 2011 R.M. Gray
58
l!
+ m = λ2 + λ
so σ2 = λ
k=0
∞
�
∞
�
λle−λ
l=0
Change variables l = k − 1
(2)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Change variables l = k − 2
∞
�
∞
�
=1
k!
k=0
����
Tayor series for eλ
k=0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
∞
�
λk
k −λ
2λ e
k!
=
k −λ
∞
�
k=2
l!
=λ
k(k − 1)
Moments crop up in many signal processing applications (and
statistical analysis of many kinds), so it is useful to have table of
moment formulas for important distributions handy and an
appreciation of the underlying calculus.
λk e−λ
+m
k!
λe
+m
(k − 2)!
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
59
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
60
Multidimensional pmf’s
Common example of a pmf on vectors is a product pmf:
Product pmf Suppose {pi; i = 0, 1, . . . , k − 1} one-dimensional pmf’s
�
on A (for each i, pi(x) ≥ 0 all x ∈ Ω, x∈Ω pi(x) = 1) each
nonnegative, sums to 1)
Vector spaces with discrete coordinates like Zkn are also discrete.
Same ideas of describing probabilities by pmfs work.
Define the product k-dimensional pmf p on Ak by
Suppose A is a discrete space
p(x) = p(x0, x1, . . . , xk−1) =
k
A = {all vectors x = (x0, . . . xk−1) with xi ∈ A, i = 0, 1, . . . , k − 1} is also
pi(xi).
i=0
discrete.
Easily seen to be a pmf (nonnegative, sums to 1)
If have pmf p on Ak , i.e., p(x) ≥ 0,
�
k−1
�
p(x) = 1,
x∈Ak
then P(F) =
�
x∈F
p(x) a probability measure
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
61
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
62
Continuous probability spaces
Example: product of Bernoulli pmfs:
pi(x) = p x(1 − p)1−x; x = 0, 1, all i
Recall basic construction using pdf on the real line R:
⇒ product pmf
(Ω, F ) = (R, B(R))
p(x0, x1, . . . , xk−1) =
k−1
�
f is a pdf on R if
p xi (1 − p)1−xi
i=0
w(x0,x1,...,xk−1)
= p
f (r)dr = 1
(1 − p)k−w(x0,x1,...,xk−1),
Ω
Define set function P by
where w(x0, x1, . . . , xk−1)= # number of ones in binary k-tuple
x0, x1, . . . , xk−1, Hamming weight of the vector.
P(F) =
Will see when probability of a bunch of things factors into a bunch of
probabilities in this way, things simplify significantly
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
f (r)� ≥ 0, all r ∈ Ω
63
�
F
f (r) dr =
�
1F (r) f (r) dr, F ∈ B(R)
Unlike pmf, a pdf is not a probability of something — it is a density of
probability. To get a probability, must integrate a pdf over a set.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
64
Common examples of pdfs
Common approximation: Suppose Ω = R and f a pdf. Consider
event F = [x, x + ∆x), ∆ small.
Mean value theorem of calculus ⇒
P([x, x + ∆x)) =
�
x+∆x
x
�
pdfs are assumed to be 0 outside the specified domain and are given
in terms of real-valued parameters b, a, λ > 0, m, and σ > 0.
f (α) dα ≈ f (x)∆x
�
Theoretical issue Does P(F) = F f (r) dr = 1F (r) f (r) dr, F ∈ B(R)
in fact define a probability measure? That is, does set function P
satisfy axioms 1-3?
Answer: Yes, with proper care to use correct event space and notion
of integration. As usual, rarely a problem in practice. Need details in
research.
Uniform pdf. Given b > a, f (r) = 1/(b − a) for r ∈ [a, b].
Exponential pdf. f (r) = λe−λr ; r ≥ 0.
Doubly exponential (or Laplacian) pdf f (r) =
λ −λ|r|
e ; r ∈ R.
2
Gaussian (or Normal) pdf f (r) = (2πσ2)−1/2 exp(−(r − m)2/2σ2);
r ∈ R. Commonly denoted by N(m, σ2).
Usually continuous probability just substitutes integration for
summation, but pmfs and pdfs are inherently different.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
65
Continuous expectation
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
centralized moments
including the variance
Analogous to discrete expectation.
�
(r − m)k f (r) dr
�
σ2 = (r − m)2 f (r) dr
As in the discrete case,
g(r) f (r) dr.
σ2 = m(2) − m2
P(F) = E(1F ),
As in discrete case
66
When allow complex-valued�functions, often kth absolute moment is
used:
m(k)
|r|k f (r) dr.
a =
Have Ω, pdf f , function g : Ω → R Define expectation of g (with
respect to f )
�
E(g) =
January 11, 2011 R.M. Gray
Important examples if Ω ⊂ R
mean or first moment
kth moment
second moment
�
m = r f (r) dr,
�
m(k) = rk f (r) dr
�
m(2) = r2 f (r) dr
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
67
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
68
Computational examples
Exponential pdf
From tables, or continuous analogs of tricks for geometric.
The validation of the pdf (integrates to 1) and the mean, second
moment, and variance of the exponential pdf can be found from
integral tables, or by the integral analogs to the corresponding
computations for the geometric pmf, or from integration by parts:
Continuous uniform pdf on [a, b)
Consider a = 0 and b = 1
m=
mean
�1
0
m(2) =
second moment
σ2 = 13 −
variance
�1
r2 ��
2 �0
r dr =
�1
0
1
2
�
��
r2 dr = 3 �� = 13 ,
0
� �2
1
2
=
=
1
r3
1
12
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
69
∞
0
λe−λr dr = 1
� ∞
1
m =
rλe−λr dr =
λ
�0 ∞
2
m(2) =
r2λe−λr dr = 2
λ
0
2
1
1
σ2 = 2 − 2 = 2
λ
λ
λ
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
Laplacian pdf = mixture of exponential and its reverse, left as
exercise.
Computing probabilities sometimes easy, sometimes not.
Gaussian pdf Moment computation more trouble than its worth, will
find an easier method later.
P([c, d]) = (d − c)/(b − a)
With a uniform pdf on [a, b], then for a ≤ c < d ≤ b
for exponential
� pdf, [c, d], 0 ≤ c < d,
For now just state:
P([c, d]) =
�
�
∞
−∞
�
∞
−∞
∞
−∞
√
1
2πσ2
e−(x−m)
1
xe
√
2πσ2
1
2 /2σ2
−(x−m)2/2σ2
(x − m)2e−(x−m)
√
2
2πσ
2 /2σ2
d
c
λe−λx dx = e−λc − e−λd .
No such nice form for Gaussian, but well tabulated in terms of
dx = 1
dx = m
Φ(α) =
dx = σ2
Q(α) =
erf(α) =
i.e., mean=m, variance = σ2, as notation suggests.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
70
71
� α
1
2
e−u /2 du
√
2π −∞
� ∞
1
2
e−u /2 du = 1 − Φ(α)
√
2π α
� α
2
2
e−u du
√
π 0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
72
Note: Φ(α) = P((−∞, α]) = P({x : x ≤ α}) for N(0, 1)
Change variables: u = (x − m)/σ ⇒
α
1
2
2
e−(x−m) /2σ dx
√
−∞
2πσ2
� (α−m)/σ
1
2
=
√ e−u /2 du
−∞
2π
�α − m�
�α − m�
= Φ
=1−Q
.
σ
σ
P({x : x ≤ α}) =
Q function = complementary function Q(α) = 1 − Φ(α).
Common in communications systems analysis
�
� ��
1
α
Q(α) = 1 − erf √
= 1 − Φ(α).
2
2
Change variables to find probabilities with tables (or numerically from
these functions)
�
�
�a − m�
b−m
P((a, b]) = P((−∞, b]) − P((−∞, a]) = Φ
−Φ
.
σ
σ
Symmetry of a Gaussian density ⇒
1 − Φ(a) = Φ(−a).
For example, find P((−∞, α)) for N(m, σ2):
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
�
January 11, 2011 R.M. Gray
73
Multidimensional pdf example
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
74
Interpretation: Sample points (x, y) = first arrival times of two distinct
types of particle (type A and type B) at a sensor (or packets at a
router or buses at a bus stop) after time 0
Ω = R2
Event space: two-dimensional Borel field B(R)2 (to give it a name,
main point is there is a standard, useful definition which you can
learn about in advanced probability)
F = event that a particle of type A arrives at the sensor before one of
type B
Probability measure described by a 2D pdf:
As with 1D, probability = integral of pdf over event:



λµe−λx−µy; x ∈ [0, ∞), y ∈ [0, ∞)
f (x, y) = 
.

0
otherwise
P(F) =
What is the probability of the event F = {(x, y) : x < y}?
=
Note: Analogous to a product pmf, this is a product pdf
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
75
� �
� �
(x,y):(x,y)∈F
f (x, y) dx dy
(x,y):x≥0,y≥0,x<y
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
λµe−λx−µy dx dy.
January 11, 2011 R.M. Gray
76
Mass functions as densities
Now it is just calculus:
P(F) =
� �
= λµ
= λµ
�
�
�
0
(x,y):x≥0,y≥0,x<y
�� y
∞
dy
∞
0
dye
∞
0
−µy
λµe−λx−µy dx dy
dxe−λxe−µy
��
y
0
−µy 1
dxe
−λx
Can use continuous ideas in discrete problems with Dirac deltas, but
usually clumsy and adds complication. Describe briefly.
�
Dirac delta defined implicitly by behavior inside integral: if
{g(r); r ∈ R} continuous at a ∈ R, then
�
�
−λy
= λµ
dye
(1 − e )
λ
0
�� ∞
�
� ∞
−µy
−(µ+λ)y
= µ
dye −
dye
0
g(r)δ(r − a) dr = g(a)
(no ordinary function does this, Dirac deltas are generalized
functions)
0
µ
λ
= 1−
=
.
µ+λ µ+λ
Given a pmf p defined on Ω ⊂ R, can define a pdf f by
f (r) =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Then f (r) ≥ 0 and
�
77
p(ω)δ(r − ω).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
78
Multidimensional pdfs: General Case
f (r) dr =
=
�
January 11, 2011 R.M. Gray
�
=
1F (r) f (r) dr =
=
=
� ��
�
p(ω)δ(r − ω) dr
�
�
p(ω) δ(r − ω) dr
�
p(ω) = 1
�
��
�
1F (r)
p(ω)δ(r − ω) dr
�
�
p(ω) 1F (r)δ(r − ω) dr
�
p(ω)1F (ω) = P(F).
Multidimensional integrals allow construction of probabilities on Rk
Given the measurable space (Rk , B(R)k ), a real-valued function f on
Rk is a pdf if
f (x) ≥ 0 ; all x = (x0, x1, . . . , xk−1) ∈ Rk ,
�
f (x) dx = 1
Rk
Define a set function P by
But sum of pmfs is simpler, deltas usually make things harder.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
P(F) =
79
�
F
f (x) dx, all F ∈ B(R)k ,
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
80
where the vector integral is shorthand for the k-dimensional integral
P(F) =
�
(x0,x1,...,xk−1)∈F
where fi; i = 0, 1, . . . , k − 1 are one-dimensional pdfs on the real line.
Most important special case: all fi(r) are the same (identically
distributed)
f (x0, x1, . . . , xk−1) dx0 dx1 . . . dxk−1.
So all pdfs on R ⇒ product pdfs on Rk
As with multidimensional pmf’s, a pdf is not itself the probability of
anything.
As with 1D case, subject to appropriate assumptions (event space,
integral), this does define a probability measure, i.e., it satisfies the
axioms.
Two common and very important examples of k-dimensional pdfs:
Product pdf. A product pdf has the form
f (x) = f (x0, x1, . . . , xk−1) =
k−1
�
fi(xi)
i=0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
81
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Multidimensional Gaussian pdf
where det Λ is the determinant of the matrix Λ
m = (m0, m1, . . . , mk−1)t , a column vector
Notes:
Λ = a k by k square matrix with entries
{λi, j; i = 0, 1, . . . , k − 1; j = 0, 1, . . . , k − 1}. Assume
• Λ positive definite ⇒ det Λ > 0 and Λ−1 exists
1. Λ is symmetric (Λ = Λ or, equivalently, λi, j = λ j,i, all i, j)
• If Λ is a diagonal matrix, this becomes a product pdf
2. Λ is positive definite; i.e., for any nonzero vector y ∈ Rk
k−1 �
k−1
�
• There is a more general definition of a Gaussian random vector
that does only requires that Λ be nonnegative definite
yiλi, jy j > 0
• Gaussian important because it crops up often as a good
i=0 j=0
approximation (central limit theorem) and sometimes is a “worst
case”
A multidimensional pdf is Gaussian if it has the form
t −1 (x−m)/2
f (x) = (2π)−k/2(det Λ)−1/2e−(x−m) Λ
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
82
• Hard to show integrates to 1
t
yt Λy =
January 11, 2011 R.M. Gray
; x ∈ Rk .
January 11, 2011 R.M. Gray
83
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
84
Mixtures
Useful for constructing probability measures mixing continuous and
discrete:
Example of constructing new probability measures from old
Ω = R, f a pdf and p a pmf. λ ∈ (0, 1)
Pi, i = 1, 2, . . . , ∞ is a collection of probability measures on a
common measurable space (Ω, F )
∞
�
ai ≥ 0, i = 1, 2, . . .,
Then mixture
ai = 1. Then
P(F) = λ
i=0
P(F) =
∞
�
x∈F
ai Pi(F)
is also a probability measure on (Ω, F ).
∞
�
x∈F
f (x) dx
(Almost) most general model. Find expectations in such cases in
natural way:
a i Pi
i=1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
85
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
86
Event Independence
Given a function g,
E(g) = λ
p(x) + (1 − λ)
�
E.g., experiment: First spin a fair wheel. If pointer in [0, λ), then roll a
die described by p. If pointer in [λ, 1), then choose ω using Gaussian.
i=1
Abbreviation: P =
�
�
x∈Ω
g(x)p(x) + (1 − λ)
�
x∈Ω
g(x) f (x) dx .
Given (Ω, F , P), two events F and G are independent if
P(F ∩ G) = P(F)P(G)
Works for scalar and vector sample spaces.
A collection of events {Fi; i = 0, 1, . . . , k − 1} is independent or
mutually independent if for any distinct subcollection
{Fli ; i = 0, 1, . . . , m − 1}, lm < k,
Another use of mixtures: First choose parameters at random (e.g.,
bias of coin), then use probability measure described by parameter
(e.g., Bernoulli ( p))
m−1  m−1
�  �
P  Fli  =
P(Fli ).
Will see examples.
i=0
Note:
Requirement
on all subcollections! Not enough to just say
�
�
P
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
87
i=0
�k−1
i=0
Fi =
�k−1
i=0
P(Fi)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
88
Example of what can go wrong: Suppose that
P(F) = P(G) = P(H) =
Probabilistic independence has an intuitive interpretation in terms of
next topic — elementary conditional probability.
1
3
1
P(F ∩ G ∩ H) =
= P(F)P(G)P(H)
27
P(F ∩ G) = P(G ∩ H) = P(F ∩ H) =
But definition does not require conditional probability.
1
� P(F)P(G).
27
Zero probability on the overlap F ∩ G except where it also overlaps H ,
i.e., P(F ∩ G ∩ H c) = 0. Thus P(F ∩ G ∩ H) = P(F)P(G)P(H) = 1/27,
but P(F ∩ G) = 1/27 � P(F)P(G) = 1/9.
So P(F ∩ G ∩ H) = P(F)P(G)P(H) for three events F , G, and H , yet it
is not true that P(F ∩ G) = P(F)P(G).
So here it is not true that the three events are mutually independent!
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
89
Elementary conditional probability
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
What properties should it have? Clearly P(Gc|G) = 0 & P(G|G) = 1.
Since P(· | G) must be a probability measure, ⇒
P(F | G) = P(F ∩ (G ∪ Gc) | G)
E.g., outcome of roll of one die does not change probability of next
roll (or rolling another die)
= P(F ∩ G | G) + ��������������������������
P(F ∩ Gc | G)
0
Need definition of probability of one event conditioned on another.
Motivation: Given (Ω, F , P), suppose know that event G has occured.
What is a reasonable definition for probability F will occur given
(conditioned on) G?
P(F | G)
January 11, 2011 R.M. Gray
90
For a fixed G, P(F | G) should be defined for all events F , it should be
a probability measure on (Ω, F ).
Intuitively, independence of two events means that the occurrence of
one event should not affect the probability of occurrence of the other.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
91
= P(F ∩ G | G)
so
P(F | G) = P(F ∩ G | G) (�1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
92
Next, no reason to suspect that the relative probabilities within G
should change because of the knowledge that G occurred.
Only works if P(G) > 0, called elementary conditional probability
E.g., F ⊂ G is twice as probable as an event H ⊂ G with respect to P,
then the same should be true with respect to P(· | G).
For arbitrary events F and H , F ∩ G, H ∩ G ⊂ G, ⇒
(will later see nonelementary conditional probability, it is more
complicated)
Easy to prove that since P is a probability measure, so is P(· | G)
Independence redux
P(F ∩ G|G) P(F ∩ G)
=
.
P(H ∩ G|G) P(H ∩ G)
Suppose that F and G are independent events and that P(G) > 0,
then
P(F ∩ G)
= P(F);
P(G)
Occurence of G does not affect probability of F , a posteriori
probability P(F | G) = a priori probability P(F).
P(F | G) =
Set H = Ω + (�1) ⇒
P(F|G) = P(F ∩ G|G) =
P(F ∩ G)
(�2)
P(G)
Not used as definition of independence since less general.
Motivates using (�2) as definition of conditional probability
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
93
PA(F) =
�
94
(1 − p)k−1 p
(1 − p)k−1 p
pA(k) = �∞
=
l−1 p
(1 − p)K−1
l=K (1 − p)
Example: Given discrete (Ω, F , P) described by a pmf p, event A
with P(A) > 0. Define pmf pA:
⇒ (Ω, F , PA), where
January 11, 2011 R.M. Gray
E.g., p = geometric pmf and A = {ω : ω ≥ K} = {K, K + 1, . . .}.
Conditional probability provides a means of constructing new
probability spaces from old ones



 p(ω)/P(A) = P({ω}|A) ,
pA(ω) = 

0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
= (1 − p)k−K p; k = K + 1, K + 2, . . . ,
ω∈A
a geometric pmf that begins at k = K + 1
ω�A
pA(ω) = P(F|A).
ω∈F
pA is a conditional pmf
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
95
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
96
Example: Continuous (Ω, F , P), pdf f P(A) > 0 Define conditional fA:



 f (ω)/P(A)
fA(ω) = 

0
ω∈A
ω∈A
Interpretation: Exponential models bus arrival times. pdf for next
arrival given you have waited an hour is the same as when you got to
the bus stop.
and conditional probability
PA(F) =
�
fA(ω) dω = P(F|A).
ω∈F
E.g., f = exponential pdf, A = {r : r ≥ c}
fA(x) = � ∞
λe−λx
c
λe−λy dy
=
λe−λx
e−λc
= λe−λ(x−c); x ≥ c,
Like geometric, conditioned on x ≥ c, has same form, just shifted.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
97
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
98
Bayes Rule
P(Fi ∩ G) P(G | Fi)P(Fi)
= �
P(G)
j P(G ∩ F j)
P(G | Fi)P(Fi)
= �
j P(G | F j)P(F j)
P(Fi | G) =
Total probability + conditional probability
Recall total probability: If {Fi; i = 1, 2, . . .} is a partition of, then for
any event G
�
P(G) =
i
Example of Bayes’ rule: Binary communication channel
P(G ∩ Fi).
Noisy binary communication channel: 0 or 1 is sent and 0 or 1 is
received. Assume that 0 is sent with probability 0.2 (⇒ 1 is sent with
probability 0.8)
Suppose know P(Fi), all i — a priori probabilities of collection of
events
Suppose know P(G | Fi)
The channel is noisy. If a 0 is sent, a 0 is received with probability
0.9, and if a 1 is sent, a 1 is received with probability 0.975
Find a posteriori probabilities P(Fi | G).
Given you observe G, how does that change the probabilities of Fi?
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
99
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
100
Can represent this channel model by a probability transition diagram
Define
A = {0 is sent} = {(0, 1), (0, 0)},
P({0} | {0}) = 0.9
P({0}) = 0.2 0 •
✘✘
0
✘✘✘ •

✘✘✘

✘✘✘

✘
✘✘

P({1} | {0}) = 0.1
✘
✘
✘✘

✘✘

✘✘✘
✘✘

P({0} | {1}) = 0.025
✘✘✘

✘
✘
✘

✘
✘

✘✘
1
•
• 0
P({1}) = 0.8
P({1} | {1}) = 0.975
B = {0 is received} = {(0, 0), (1, 0)}
Probability measure is defined via the P(A), P(B|A), and P(Bc|Ac)
provided on the probability transition diagram of the channel
Problem: Given that 0 is received, find the probability that 0 was sent
To find P(A|B), use Bayes rule
P(A|B) =
Sample space Ω = {(0, 0), (0, 1), (1, 0), (1, 1)}, points: (bit sent, bit
received)
⇒
P(B|A)P(A)
,
P(A)P(B|A) + P(Ac)P(B|Ac)
0.9
× 0.2 = 0.9,
0.2
much higher than the a priori probability of A (= 0.2)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
101
End of basic probability: Next: random variables, vectors, and
processes.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
103
P(A|B) =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
January 11, 2011 R.M. Gray
102