Download Handout 1 – Probability Theory Review 1 Probability Space

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
EE7101: Introduction to Information and Coding Theory
Handout 1 – Probability Theory Review
Tay Wee Peng
This is a very brief review of probability theory. For more details, please consult EE7401 or
MAS713. A good reference at the undergraduate level is “Introduction to Probability” by Dimitri
Bertsekas and John Tsitsiklis. A good introduction to measure theoretic probability is “Probability
with Martingales” by David Williams.
1
Probability Space
A probability space is represented by a tuple (Ω, F, P), where Ω is the sample space, and F is a
σ-algebra (think of it as a collection of events or subsets of Ω), with the following properties:
(i) Ω ∈ F
(ii) If A ∈ F, then Ac ∈ F.
(iii) If A1 , A2 , . . . ∈ F, then ∪∞
i=1 Ai ∈ F.
Think of F as the “information structure” of Ω. For example, let Ω = [0, 1], and we are interested
in the probability of subsets of Ω that are intervals of the form [a, b] where 0 ≤ a < b ≤ 1, but not
individual points in Ω. Then we should also be able to say something about the probability of the
union, intersections, complement and so on of such intervals. This is captured by the definition of
a σ-algebra.
The probability measure P is a function P : F 7→ [0, 1] defined on the measurable space (Ω, F), and
represents your “belief” about the events in F. In order for P to be called a probability measure,
it must satisfy the following properties:
(i) P(Ω) = 1
(ii) For A1 , A2 , . . . such that Ai ∩ Aj = ∅ for all i 6= j, we have
!
∞
∞
[
X
P
Ai =
P(Ai ).
i=1
i=1
Some basic properties that can be derived from the above definition:
(a) P(Ac ) = 1 − P(A)
(b) If A ⊂ B, then P(A) ≤ P(B).
1
(c) P(A ∪ B) = P(A) + P(B) − P(A ∩ B) ≤ P(A) + P(B) is called the union bound. Similarly,
!
∞
∞
[
X
P
Ai ≤
P(Ai ).
i=1
i=1
(How would you prove the above? Note that induction does not work here, why?)
Conditional probability of A given B is defined as
P(A | B) =
P(A ∩ B)
,
P(B)
if P(B) 6= 0.
Two events A and B are said to be independent if P(A ∩ B) = P(A)P(B), i.e., P(A | B) = P(A) or
P(B | A) = P(B). (This definition sets probability theory apart from measure theory.)
2
Random Variables
A random variable X : Ω 7→ X is a mapping from the space (Ω, F) to a measurable space (X , B).
For example, for a real-valued random variable, the space that X takes values in is typically chosen
to be (R, B), where B is the Borel σ-algebra.1 In order for X to make any sense, the mapping
has to ensure that {ω ∈ Ω : X(ω) ∈ B} ∈ F for all B ∈ B, because we are restricted to the
information structure imposed by F. This is called a measurable function. A random variable
is then more formally defined as a measurable mapping from (Ω, F) to (X , B). The probability
measure P induces a probability measure on (R, B), given by
PX (B) = P ({ω ∈ Ω : X(ω) ∈ B}) ,
for all B ∈ B. We will often write PX (B) as P(X ∈ B), and is called the distribution of X.
If X = {x1 , . . . , xn }, then we say that X is a discrete random variable. The distribution of X is then
also known as the probability mass function (pmf) of X and is fully defined by (pX (x1 ), . . . , pX (xn )),
where pX (·) = P(X = ·).
For real-valued random variable X, if there exists a function fX : R 7→ [0, ∞) such that for all
A ∈ B, we have
Z
P(X ∈ A) =
fX (x)dx,
A
then we say that X is a continuous random variable. The function fX is called the probability
density function (pdf).
In this class, we deal mainly with discrete random variables, although we will also encounter Gaussian random variables, which are continuous, later on. Some additional notations and definitions
for discrete random variables are given below. The counterparts for continuous random variables
can be obtained by simply replacing pmfs with pdfs. Assume that X and Y are discrete random
variables.
1
This is the smallest σ-algebra containing all open intervals in R.
2
(a) Joint pmf: pX,Y (x, y) = P(X = x, Y = y) = P({ω ∈ Ω : X(ω) = x, Y (ω) = y}).
(b) Conditional pmf: pX|Y (x | y) = pX,Y (x, y)/pY (y) for pY (y) 6= 0.
Bayes rule:
pY |X (y | x)pX (x)
0
0
x0 pY |X (y | x )pX (x )
pX|Y (x | y) = P
(c) We write X ⊥
⊥ Y if X is independent of Y , i.e., pX,Y (x, y) = pX (x)pY (y).
Example 1. Binary symmetric channel. X ∼ Bern(p) sent over channel is corrupted by additive
noise Z ∼ Bern(), X ⊥
⊥ Z. Output of channel is Y = X ⊕ Z.
pY |X (y | x) = P(X ⊕ Z = y | X = x)
= P(Z = y ⊕ x | X = x)
= P(Z = y ⊕ x)
= pZ (y ⊕ x)
x
0
1
0
1
y
0
0
1
1
pY |X
1−
1−
Table 1: Conditional probabilities for binary symmetric channel.
3
Expectation and Variance
The expectation of a random variable X is defined to be
Z
E[X] =
X(ω)dP(ω).
Ω
This definition has a very precise mathematical meaning, which unfortunately is out of the scope
of this review. Roughly speaking, we think of P(ω) as a weight that we impose on X(ω) for each
value of ω ∈ Ω, and we are taking the weighted sum of X(ω). If X is a discrete random variable,
this reduces to the familiar formula,
X
E[X] =
xpX (x).
x∈X
If X is a continuous random variable, we have “dP(ω) = fX (X(ω))dx”, and
Z
E[X] =
xfX (x)dx.
R
Note that the expectation is a statistical summary of the distribution of X, rather than depending
on the realized value of X. Perhaps it is more fitting to write it as E[pX ] but for legacy reasons,
we use the notation E[X] instead.
3
Since g(X) is a random variable, we can obtain the expectation of g(X) in the same way. It can
be shown that
Z
g(x)fX (x)dx.
E[g(X)] =
R
In particular, if g(X) = aX + b is a linear function of X, then E[g(X)] = aE[X] + b = g(E[X]).
The variance of X is just the expectation of g(X) = (X − E[X])2 .
The conditional expectation E[X | Y ] can be defined with respect to pX|Y , but take note that this
is actually a random variable as it depends on the value of Y . To get some intuition, suppose
that Y ∈ {y1 , . . . , yn }. Then E[X | Y ] is a random variable taking values E[X | Y = yi ], where
i = 1, . . . , n, with corresponding probability P (Y = yi ). We see that Y “partitions” the sample
space Ω into different regions, and E[X | Y = yi ] is the normalized expectation of X on the region
corresponding to Y = yi .2 We have
X
E[X] =
E[X | Y = yi ]P (Y = yi ) = E[E[X | Y ]].
i
In some sense, E[X | Y ] is the “best” estimator of X you can have, given knowledge of Y . Indeed,
recall from your statistics classes that under the squared error loss criterion, E[X | Y ] is the optimal
estimator of X, i.e., it minimizes E(X − X̂(Y ))2 for all functions X̂(Y ).
4
Jensen’s Inequality
A function f (x) is said to be convex if for all x, y and λ ∈ [0, 1], we have
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
(1)
The function f is strictly convex if equality in (1) holds iff λ = 0 or 1, or x = y.
Proposition 1. f (x) = supl∈L l(x), where L = {l : l(u) = au+b ≤ f (u) for all u, and for some a, b}.
Proof. It suffices to show that for any given x, we can find a linear function l(·) such that l(x) = f (x)
and l(u) ≤ f (u) for all u.
For any h > 0, convexity of f implies that
2f (x) ≤ f (x + h) + f (x − h)
f (x) − f (x − h)
f (x + h) − f (x)
≤
,
h
h
(2)
so that by letting h ↓ 0, we have
lim
h↓0
f (x) − f (x − h)
f (x + h) − f (x)
≤ lim
,
h↓0
h
h
2
For general random variables, E[X | Y ] should be viewed as E[X | σ(Y )], where the conditioning is on the
σ-algebra σ(Y ) generated by Y .
4
which means that the left limit of f at x is no greater than the right limit. We can then choose
a constant a between these two limits and let l(u) = a(u − x) + f (x). We claim that this linear
function is the one we are looking for. Indeed, note that l(x) = f (x), and for any u > x, letting
h = u − x, we have
l(u) = a(u − x) + f (x)
f (x + h) − f (x)
≤
(u − x) + f (x)
h
= f (x + h) = f (u),
where the inequality follows from (2) and our choice of a. A similar argument holds for u < x, and
the proof is complete.
Proposition 2. Jensen’s Inequality. If f is convex, E[|X|] < ∞, and E[|f (X)|] < ∞, then
E[f (X)] ≥ f (E[X]).
Furthermore, if f is strictly convex, then equality holds iff X is a constant.
Proof.
E[f (X)] = E[sup l(X)]
from Proposition 1
l∈L
≥ sup E[l(X)]
can you see why?
l∈L
= sup l(E[X])
l∈L
= f (E[X]).
5
because l is linear