Download Notes - ECE@NUS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
EE5139R Lecture 1 : Review of Necessary Mathematical
Preliminaries for EE5139R
Vincent Y. F. Tan
August 4, 2016
This is a very brief review on probability theory adapted from the notes of EE7101 (by Prof. Wee Peng
Tay of NTU). For more details, please consult the undergraduate level is “Introduction to Probability” by
Dimitri Bertsekas and John Tsitsiklis [1].
1
Probability Space
A probability space is represented by a tuple (Ω, F, P) where Ω is the sample space, and F is a σ-algebra
(think of it as a collection of events or legitimate subsets of Ω), with the following properties:
ˆ Ω∈F
ˆ If A ∈ F, then Ac = Ω \ A ∈ F
ˆ If A1 , A2 , . . . , An , . . . ∈ F, then ∪∞
i=1 Ai ∈ F
Think of F as the “information structure” of Ω. For example, let Ω = [0, 1], and we are interested in
the probability of subsets of that are intervals of the form [a, b] where 0 ≤ a < b ≤ 1, but not individual
points in Ω. Then we should also be able to say something about the probability of the union, intersections,
complement and so on of such intervals. This is captured by the definition of a σ-algebra. The probability
measure P is a function P : F → [0, 1] defined on the measurable space (Ω, F), and represents your “belief”
about the events in F. In order for P to be called a probability measure, it must satisfy the following
properties:
ˆ P(Ω) = 1
ˆ For A1 , A2 , . . . such that Ai ∩ Aj = ∅ for all i 6= j, we have
!
∞
∞
[
X
P
Ai =
P(Ai )
i=1
i=1
Some basic properties that can be derived from the above definition:
ˆ P(Ac ) = 1 − P(A)
ˆ If A ⊂ B, then P(A) ≤ P(B)
ˆ P(A ∪ B) = P(A) + P(B) − P(A ∩ B) ≤ P(A) + P(B) is called the union bound. Clearly, by induction,
the union bound works for finitely many sets Ai , i = 1, . . . , k. It is also true that
!
∞
∞
[
X
P
Ai ≤
P(Ai ).
i=1
i=1
This is not straightforward. Can you prove it using only the axioms above?
1
Conditional probability of A given B is defined as
P(A|B) =
P(A ∩ B)
P(B)
if P(B) 6= 0. It is possible to define conditional probability even if P(B) = 0 but we would not need this for
this course.
Two events A, B ∈ F are said to be independent if
P(A ∩ B) = P(A)P(B)
In other words,
P(A|B) = P(A),
2
or P(B|A) = P(B).
Random Variables
A random variable X : Ω → X is a mapping (function) from the space (Ω, F) to a measurable space (X , B).
For example, for a real-valued random variable, the space that X takes values in is typically chosen to be
(R, B), where B is the Borel σ-algebra.1 In order for X to make any sense, the mapping has to ensure that
{ω ∈ Ω : X(ω) ∈ B} for all B ∈ B, because we are restricted to the information structure imposed by
F. This is called a measurable function. A random variable is then more formally defined as a measurable
mapping from (Ω, F) to (X , B).. The probability measure P induces a probability measure PX on (R, B),
given by
PX (B) = P({ω ∈ Ω : X(ω) ∈ B})
for all B ∈ B. We often write PX (B) as Pr(X ∈ B) and PX is called the distribution of the random variable
X.
If X = {a1 , . . . , ad }, then we say that X is a discrete random variable. The distribution of X is then
also known as the probability mass function (pmf) of X and is fully defined by (PX (a1 ), . . . , PX (ad )), where
PX (a) = Pr(X = a).
For real-valued random variable X, if there exists a function fX : R → [0, ∞) such that for all A ∈ B,
we have
Z
Pr(X ∈ A) =
fX (x) dx
A
then we say that X is a continuous random variable. The function fX is called the probability density
function (pdf).
In this class, we deal mainly with discrete random variables, although we will also encounter Gaussian
random variables, which are continuous, later on. Some additional notations and definitions for discrete
random variables are given below. The counterparts for continuous random variables can be obtained by
simply replacing pmfs with pdfs. Assume that X and Y are discrete random variables taking on values in
X and Y respectively.
ˆ Joint pmf: PX,Y (x, y) = Pr(X = x, Y = y)
ˆ Conditional pmf:
PX|Y (x|y) =
PX,Y (x, y)
,
PY (y)
for PY (y) > 0
ˆ Bayes rule
PY |X (y|x)PX (x)
0
0
x0 PY |X (y|x )PX (x )
PX|Y (x|y) = P
1 The
smallest σ-algebra containing all open intervals in R.
2
ˆ If X and Y are independent random variables, then
PX,Y (x, y) = PX (x)PY (y)
Furthermore, for any set A ⊂ X , we have two different ways of denoting the probability that X belongs
to A, namely
X
PX (A) := Pr(X ∈ A) =
PX (x),
x∈A
and similarly for the conditional distribution and the joint distribution.
Note that it is incorrect to write Pr(A) for a set A ⊂ X . What one can do is to define an event
A := {X ∈ A} and then to write
Pr(A ) = Pr(X ∈ A)
3
Expectation and Variance
The expectation of a random variable X is defined to be
Z
E[X] =
X(ω) dP(ω).
Ω
This definition has a very precise mathematical meaning, which unfortunately is out of the scope of this
review. Roughly speaking, we think of P(ω) as a “weight” that we impose on X(ω) for each value of
ω ∈ Ω, and we are taking the weighted sum of X(ω). If X is a discrete random variable with alphabet
X = {a1 , . . . , ad }, this reduces to the familiar formula,
X
E[X] =
xPX (x).
x∈X
If X is a continuous random variable with pdf fX (x), we have
Z
E[X] =
xfX (x) dx.
R
Note that the expectation is a statistical summary of the distribution of X, rather than depending on the
realized value of X. Perhaps it is more fitting to write it as E[PX ] but for legacy reasons, we use the notation
E[X] instead.
If g is a function, we can obtain the the expectation of g(X) in the same way. It can be shown that
Z
E[g(X)] =
g(x)fX (x) dx.
R
In particular if g(X) = aX + b, then E[g(X)] = aE[X] + b = g(E[X]).
The variance of X is the expectation of g(X) = (X − E[X])2 . Thus,
Z
2
Var(X) = E[(X − E[X]) ] = (x − E[X])2 fX (x) dx
R
Check from the above definition that the variance can also be expressed as
Var(X) = E[X 2 ] − E[X]2 .
The conditional expectation E[X|Y ] can be defined with respect to PX|Y , but take note that this is
actually a random variable as it depends on the value of Y (a random variable). To get some intuition,
suppose that Y = {b1 , . . . , bm }. Then E[X|Y ] is a random variable taking values E[X|Y = bi ], where
3
i = 1, . . . , m, with corresponding probability Pr(Y = bi ). We see that Y “partitions” the sample space Ω
into different regions, and E[X|Y = bi ] is the normalized expectation of X on the region corresponding to
Y = bi . We have
m
X
E[X] =
E[X|Y = bi ] Pr(Y = bi ) = EY [EX [X|Y ]].
i=1
In some sense, E[X|Y ] is the “best” estimator of X you can have, given knowledge of Y .
Example 1. Binary symmetric channel. X ∼ Bern(p) sent over channel is corrupted by additive noise
Z ∼ Bern(), X ⊥
⊥ Z. Output of channel is Y = X ⊕ Z.
PY |X (y|x) = Pr(X ⊕ Z = y | X = x)
= Pr(Z = y ⊕ x | X = x)
= Pr(Z = y ⊕ x)
= PZ (y ⊕ x)
x
0
1
0
1
y
0
0
1
1
PY |X
1−
1−
Table 1: Conditional probabilities for binary symmetric channel.
4
Independence and Markov Chains
Recall that two random variables X and Z with joint distribution PXZ is said to be independent if
PXZ (x, z) = PX (x)PZ (z),
(x, z) ∈ X × Z.
for all
Another way of saying this is that the conditional distribution PX|Z (x|z) does not depend on z, i.e.
PX|Z (x|z) = PX (x)
for all
(x, z) ∈ X × Z.
The notion of Markov chains is very similar to independence but we need three random variables (instead
of two). Let’s start with the three random variables X, Y and Z. They are said to form a Markov chain in
the order
X −Y −Z
if their joint distribution PXY Z satisfies
PXY Z (x, y, z) = PX (x)PY |X (y|x)PZ|Y (z|y)
for all
(x, y, z) ∈ X × Y × Z.
This the same as saying that X and Z are conditionally independent given Y . Notice that if we do not
assume anything about the joint distribution PXY Z , then it factorizes (by repeated applications of Bayes
rule) as
PXY Z (x, y, z) = PX (x)PY |X (y|x)PZ|XY (z|x, y)
for all
(x, y, z) ∈ X × Y × Z
so what Markovity in the order X − Y − Z buys us is that PZ|XY (z|x, y) = PZ|Y (z|y) (i.e., we can drop the
conditioning on X). In essence all the information that we can learn about Z is already contained in Y . No
other information about Z can be gleaned from knowing X if we already know Y . Another way of saying
this is that the conditional distribution of X and Z given Y = y can be factorized as
PXZ|Y (x, z|y) = PX|Y (x|y)PZ|Y (z|y)
4
for all
(x, y, z) ∈ X × Y × Z.
Notice that this is in direct analogy to the situation where X and Z are (marginally) independent. Simply
set Y to be a deterministic random variable to recover the definition of independence.
Some exercises:
1. Assume X − Y − Z. Show that it is also true that Z − Y − X.
2. If Z is a deterministic function of Y , show that X − Y − Z is true.
3. If X and Z are conditionally independent given Y , this does not imply that X and Z are marginally
independent (in general). Construct a counterexample.
5
Probability Bounds
In this section, we summarize some bounds on probabilities that we use extensively in the sequel. For a
random variable X, we let E[X] and Var(X) be its expectation and variance respectively. To emphasize that
the expectation is taken with respect to a random variable X with distribution P , we sometimes make this
explicit by using a subscript, i.e., EX or EP .
5.1
Basic Bounds
We start with the familiar Markov and Chebyshev inequalities.
Proposition 1 (Markov’s inequality). Let X be a real-valued non-negative random variable. Then for any
a > 0, we have
E[X]
.
Pr(X > a) ≤
a
Proof. By the definition of the expectation, we have
Z ∞
Z ∞
Z
E[X] =
xfX (x) dx ≥
xfX (x) dx ≥ a
0
a
∞
fX (x) dx = a Pr(X > a).
a
and we are done.
In which step is non-negativity of X used?
If we let X above be the non-negative random variable (X − E[X])2 , we obtain Chebyshev’s inequality.
Proposition 2 (Chebyshev’s inequality). Let X be a real-valued random variable with mean µ and variance
σ 2 . Then for any a > 0, we have
1
(1)
Pr |X − µ| > aσ ≤ 2 .
a
Proof. Let X in Markov’s inequality be the random variable g(X) := (X − E[X])2 . This is clearly nonnegative and the expectation of g(X) is Var(X) = σ 2 . Thus, by Markov’s inequality, we have
Pr(g(X) > a2 σ 2 ) ≤
σ2
1
= 2.
a2 σ 2
a
Now, g(X) > a2 σ 2 if and only if |X − µ| > aσ so the claim is proved.
We now consider a collection of real-valued random variables that are independent and identically distributed (i.i.d.). In particular, let X n = (X1 , . . . , Xn ) be a collection of independent random variables where
each Xi has distribution P with zero mean and finite variance σ 2 .
5
Proposition 3 (Weak Law of Large Numbers). For every > 0, we have
!
X
1 n
lim Pr Xi > = 0.
n→∞
n
i=1
Consequently, the average
1
n
Pn
i=1
Xi converges to 0 in probability.
Note that for a sequence of random variables {Sn }∞
n=1 , we say that this sequence converges to a number
b ∈ R in probability if for all > 0,
lim Pr(|Sn − b| > ) = 0.
n→∞
We also write this as
p
Sn −→ b.
Contrast this to convergence of numbers: We say that a sequence of numbers {sn }∞
n=1 converges to a number
b ∈ R if for all > 0, we have
lim |sn − b| = 0.
n→∞
Proof. Let
of X is
1
n
Pn
i=1
Xi take the role of X in Chebyshev’s inequality. Clearly, the mean is zero. The variance
!
!
n
n
n
X
σ2
1X
1
1 X
Xi = 2 Var
Xi = 2
Var(Xi ) =
.
Var
n i=1
n
n i=1
n
i=1
Thus, we have
!
n
1 X σ2
Pr Xi > ≤ 2 → 0.
n i=1
n
which proves the claim.
In fact, under mild conditions, the convergence to zero occurs exponentially fast. See, for example,
Cramer’s theorem in [2, Thm. 2.2.3].
There is also a theorem known as the “strong law of large numbers” but for the purposes of this course,
we would not need it.
5.2
Central Limit-Type Bounds
In preparation for the next result, we denote the probability density function (pdf) of a univariate Gaussian
as
2
2
1
N (x; µ, σ 2 ) = √
e−(x−µ) /(2σ ) .
(2)
2
2πσ
We will also denote this as N (µ, σ 2 ) if the argument is unnecessary.
For the univariate case, the cumulative distribution function (cdf) of a standard Gaussian is denoted as
Z y
Φ(y) :=
N (x; 0, 1) dx.
(3)
−∞
If the scaling in front of the sum in the statement of the law of large numbers Proposition 3 is √1n instead
Pn
of n1 , the resultant random variable √1n i=1 Xi converges in distribution to a Gaussian random variable.
As in Proposition 3, let X n be a collection of i.i.d. random variables where each Xi is zero mean with finite
variance σ 2 .
6
Proposition 4 (Central Limit Theorem). For any a ∈ R, we have
!
n
1 X
√
Xi < a = Φ(a).
lim Pr
n→∞
σ n i=1
(4)
In other words,
n
1 X
d
√
Xi −→ Z
σ n i=1
(5)
d
where −→ means convergence in distribution and Z is a standard Gaussian random variable.
For a sequence of random variables {Sn }∞
n=1 , we say that this sequence of random variables converges in
distribution to another random variable S̄ if
lim Pr(Sn < a) = Pr(S̄ < a)
n→∞
for all a ∈ R. So the distribution functions converge for all points a. Actually this is not quite true but is
enough for our purposes in this course.
6
Jensen’s Inequality and Convexity
A function f (x) is said to be convex if for all x, y and λ ∈ [0, 1], we have
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
(6)
The function f is strictly convex if equality in (6) holds iff λ = 0 or 1, or x = y.
Proposition 5. f (x) = supl∈L l(x), where L = {l : l(u) = au + b ≤ f (u) for all u, and for some a, b}.
Proof. It suffices to show that for any given x, we can find a linear function l(·) such that l(x) = f (x) and
l(u) ≤ f (u) for all u.
For any h > 0, convexity of f implies that
2f (x) ≤ f (x + h) + f (x − h)
f (x + h) − f (x)
f (x) − f (x − h)
≤
,
h
h
(7)
so that by letting h ↓ 0, we have
lim
h↓0
f (x) − f (x − h)
f (x + h) − f (x)
≤ lim
,
h↓0
h
h
which means that the left limit of f at x is no greater than the right limit. We can then choose a constant
a between these two limits and let l(u) = a(u − x) + f (x). We claim that this linear function is the one we
are looking for. Indeed, note that l(x) = f (x), and for any u > x, letting h = u − x, we have
l(u) = a(u − x) + f (x)
f (x + h) − f (x)
(u − x) + f (x)
h
= f (x + h) = f (u),
≤
where the inequality follows from (7) and our choice of a. A similar argument holds for u < x, and the proof
is complete.
7
Often it is hard to check convexity directly. But for twice differentiable functions, this is easy.
Proposition 6. Let f : [a, b] → R be twice differentiable. f is convex if and only if f 00 (x) ≥ 0 for all
x ∈ (a, b).
Proof. Assume f 00 (x) ≥ 0 for all x ∈ [a, b]. By Taylor expansion of f around x0 ∈ (a, b), we have
f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) +
f 00 (x∗ )
(x − x0 )2
2
where x∗ ∈ [x0 , x]. By assumption f 00 (x∗ ) ≥ 0 so the quadratic term is non-negative. Now let x0 =
λx1 + (1 − λ)x2 . Further let x = x1 . Then we have
f (x1 ) ≥ f (x0 ) + f 0 (x0 )((1 − λ)(x1 − x2 )).
Now let x = x2 . Then we have
f (x2 ) ≥ f (x0 ) + f 0 (x0 )(λ(x2 − x1 )).
Multiply the first inequality by λ and the second by 1 − λ and add them up, we recover the definition of
convexity.
In the other direction, let f be convex and twice differentiable on [a, b]. Choose a < x1 < x2 < x3 < x4 < b.
By definition of convexity (check!) that
f (x2 ) − f (x1 )
f (x4 ) − f (x3 )
≤
x2 − x1
x4 − x3
Now let x2 ↓ x1 and x3 ↑ x4 . We see that f 0 (x1 ) ≤ f 0 (x4 ), and since these were arbitrary points, f 0 is
increasing on (a, b). So f 00 (x) ≥ 0 for all x ∈ (a, b).
Proposition 7 (Jensen’s Inequality). If f is convex, E[|X|] < ∞, and E[|f (X)|] < ∞, then
E[f (X)] ≥ f (E[X]).
Furthermore, if f is strictly convex, then equality holds iff X is a constant.
Proof.
E[f (X)] = E[sup l(X)]
from Proposition 5
l∈L
≥ sup E[l(X)]
can you see why?
l∈L
= sup l(E[X])
because l is linear
l∈L
= f (E[X]).
For a simpler proof of Jensen’s inequality for discrete distributions, we may use induction.
Proof. By convexity, we have
p1 f (x1 ) + p2 f (x2 ) ≥ f (p1 x1 + p2 x2 ).
Suppose the statement E[f (X)] ≥ f (E[X]) is true for discrete distributions with k ≥ 2 mass points. Then
consider a distribution with k mass points {p1 , p2 , . . . , pk }. Define
p0i :=
pi
,
1 − pk
i = 1, . . . , k − 1
8
We then have
k
X
pi f (xi ) = pk f (xk ) + (1 − pk )
k−1
X
i=1
p0i f (xi )
i=1
≥ pk f (xk ) + (1 − pk )f
k−1
X
!
p0i xi
i=1
≥f
pk xk + (1 − pk )
k−1
X
!
p0i xi
i=1
=f
k
X
!
pi xi
i=1
where the first inequality is from the induction hypothesis and the second by convexity (of two points). By
the definition of expectation we have E[f (X)] ≥ f (E[X]).
References
[1] D. P. Bertsekas and J. N. Tsitsiklis. Introduction to Probability. Athena Scientific, 1st, 2002.
[2] A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. Springer, 2nd edition, 1998.
9