Download Basic probability refresher

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Lecture 1
Basic probability refresher
1.1
Characterizations of random variables
Let (Ω, F, P ) be a probability space where Ω is a general set, F is a σ-algebra and P is a
probability measure on Ω. A random variable (r.v.) X is a measurable function X : (Ω, F) →
(R, B) where B is a Borel σ-algebra. We will also write X(ω) to stress the fact that it is a
function of ω ∈ Ω.
Cumulative distribution function
[0, 1]
(c.d.f.) of a random variable X is the function F : R →
F (x) = P (X ≤ x) = P (ω : X(ω) ≤ x).
F is monotone nondecreasing, right-continuous and such that F (−∞) = 0 and F (∞) = 1. We
also refer to F as the probability law (distribution) of X.
We distinguish 2 types of random variables: discrete variables and continuous variables.
Discrete variable X takes values in the finite or countable set. Poisson random variable X
is an example of a discrete variable with countable value set: for λ > 0 the distribution of X
satisfies
1
Pλ (X = k) =
1
λk −λ
e ,
k!
k = 0, 1, 2, ...
We will see in the sequel the importance of this law and how it is linked to Poisson point process.
1
2
LECTURE 1. BASIC PROBABILITY REFRESHER
We denote X ∼ P(λ) and say that is distributed according to the Poisson distribution with
parameter λ. The c.d.f. of X is
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
−1
0
1
2
3
4
5
6
The c.d.f. of a discrete random variable is a step function.
Continuous variable. X is a continuous variable if its distribution admits a density with
respect to the Lebesgue measure on R. In this case the c.d.f. F of X is differentiable almost
everywhere on R and its derivative
f (x) = F 0 (x)
is called probability density of X. Note that f (x) ≥ 0 for all x ∈ R and
Z ∞
f (x)dx = 1.
−∞
Example 1.1
a) Normal distribution N (µ, σ 2 ) with density
f (x) = √
(x−µ)2
1
e− 2σ2 ,
2πσ
x ∈ R,
where µ ∈ R and σ > 0. If µ = 0, σ 2 = 1, the distribution N (0, 1) is referred to as standard
normal distribution.
b) Uniform distribution U [0, θ] with density
1
f (x) = I{x ∈ [0, θ]},
θ
x ∈ R,
where η > 0 and I{·} stands for the indicator function: for a set A
(
I{x ∈ A} =
1
0
if x ∈ A,
otherwise.
1.1. CHARACTERIZATIONS OF RANDOM VARIABLES
3
c) Exponential distribution E(λ) with density
f (x) = λe−λx , for x ≥ 0 and f (x) = 0 pour x < 0,
where λ > 0. The c.d.f. of E(λ) is given by
F (x) = (1 − e−λx ) for x ≥ 0 and F (x) = 0 for x < 0.
Discrete distributions are entirely determined by the probabilities {P (X = k)}k , continuous
distribution are defined with their density f (·). However, some scalar functionals of the distribution may be useful to characterize the behavior of corresponding random variables. Examples
of such functionals are the moments and quantiles.
1.1.1
Moments of random variables
Mean (expectation) of a random variable X:
Z ∞
µ = E(X) =
−∞
Moment of order k
( P
iP (X = i)
xdF (x) = R i
xf (x)dx
in the discrete case,
in the continuous case.
(k = 1, 2, ...) :
k
µk = E(X ) =
Z ∞
xk dF (x),
−∞
same as central moment of order k:
µ0k = E((X − µ)k ) =
Z ∞
(x − µ)k dF (x).
−∞
A special case is the variance σ 2 (= µ02 – the central moment of order 2):
σ 2 = Var(X) = E((X − E(X))2 ) = E(X 2 ) − (E(X))2 .
The
p squared root of the variance is called standard deviation (s.d. or st.d.) of X: σ =
Var(X).
Absolute moment µ̄k of order k
µ̄k = E(|X|k )
same as central absolute moment of order k:
µ̄0k = E(|X − µ|k ).
Clearly, these definitions assume the existence of the respective integrals, and not all distributions
possess moments.
Example 1.2
4
LECTURE 1. BASIC PROBABILITY REFRESHER
Let X be a random variable with probability density
c
,
f (x) =
1 + |x| log2 |x|
where the constant c > 0 is such that
R
x ∈ R,
f = 1. Then E(|X|a ) = ∞ for all a > 0.
The mean is used to characterize the location (position) of a random variable. The variance
characterizes the scale (dispersion) of the distribution.
The normal distribution N (µ, σ 2 ) with mean µ and variance σ 2 :
0.4

0.35
0.3
0.25
0.2
0.15
0.1

0.05
0
−10

−8
−6
−4
−2
0
2
4
6
8
10
“large” σ (large dispertion), “small” σ (little dispersion)
Let F be the c.d.f. of the random variable X with mean µ and variance σ. By an affine
transformation we obtain the variable X0 = (X − µ)/σ, such that E(X0 ) = 0, E(X02 ) = 1 (the
standardized variable). If F0 is the c.d.f. of X0 then F (x) = F0 ( x−µ
σ ). In the continuous
case, the density of X satisfies
x−µ
1
),
f (x) = f0 (
σ
σ
where f0 is the density of X0 .
Note that it is not necessary to assume that the mean and the variance exist to define the
standardized distribution F0 and the representation F (x) = F0 ( x−µ
σ ). Typically, this is done
to underline that F depends on location parameter µ and scale σ. E.g., for the family of
1
Cauchy densities parameterized with µ, σ, f (x) = πσ(1+[(x−µ)/σ]
2 ) , the standardized density is
f0 (x) =
1
.
π(1+x2 )
Meanwhile, expectation and variance do not exist for the Cauchy distribution.
An interesting problem of Calculus is related to the notion of moments µk : let F be a c.d.f.
such that all its moments are finite. Given a sequence {µk }, k = 1, 2, ... of moments of F , is it
possible to recover F ? The general answer to this question is negative. Nevertheless, there exist
particular cases where the recovery is possible, namely, under the hypothesis that
1/k
µ̄k
k→∞ k
lim sup
<∞
(µ̄k being the k-the absolute moment). This hypothesis holds true, for instance, for densities
with bounded support. To the best of our knowledge, necessary and sufficient conditions for
existence of a solution to the problem of moments are currently unknown.
1.1. CHARACTERIZATIONS OF RANDOM VARIABLES
1.1.2
5
Probability quantiles
Let X be a random variable with continuous and strictly increasing c.d.f. F . The quantile of
order p, 0 < p < 1, of the distribution F is the solution qp of the equation
F (qp ) = p.
Observe that if F is strictly increasing and continuous, the solution exists and is unique, thus
the quintile qp is well defined. If F has “flat zones” or is not continuous we can modify the
definition, for instance, as follows:
Definition 1.1 Let F be a c.d.f. The quintile qp of order p of F is the value
qp = inf{q : F (q) ≥ p}.
The median M of the c.d.f. F is the quintile of order 1/2,
M = q1/2 .
Note that if F is continuous F (M ) = 1/2.
The quartiles
are the quantiles q1/4 and q3/4 of order 1/4 and 3/4.
The l% percentiles
of F are the quantiles qp of order p = l/100, 0 < l < 100.
We note that the median characterizes the location of the probability distribution, while the
difference q3/4 − q1/4 (referred to as interquartile interval) can be interpreted as a characteristics of scale. These quantities are analogues of the mean µ and standard deviation σ. However,
unlike the mean and the standard deviation, the median and the interquartile interval are well
defined for all probability distributions.
1.1.3
Other characterizations
The mode.
For a discrete distribution F , we call the mode of F the value k ∗ such that
P (X = k ∗ ) = max P (X = k)
k
In the continuous case, the mode x∗ is defined a local maximum of the density f :
f (x∗ ) = max f (x).
x
A density f is said unimodal if x∗ is the unique local maximum of f (one can also speak of
bi-modal or multi-modal densities). This characteristics is rather imprecise, because even when
6
LECTURE 1. BASIC PROBABILITY REFRESHER
the density has a unique global maximum, we will call it multimodal if it has other local maxima.
The mode is a characteristics of location which can be of interest in the case of unimodal density.
0.25
Mode
0.2
Mediane
Moyenne
0.15
0.1
0.05
0
0
2
4
6
8
10
12
14
16
18
20
The mode, the mediane and the mean of a distribution
Skewness and kurtosis
Definition 1.2 The distribution of X (the c.d.f. F ) is said symmetric with respect to zero (or
“simply” symmetric) if for all x ∈ Rm, F (x) = 1 − F (−x) (f (x) = f (−x) in the continuous
case).
Definition 1.3 The distribution of X (the c.d.f. F ) is called symmetric with respect to µ ∈ R
if
F (x + µ) = 1 − F (µ − x)
(f (x + µ) = f (µ − x) in the continuous case).
In other words, the c.d.f F (· − µ) is symmetric (with respect to zero).
Exercise 1.1
a) Show that if F is symmetric with respect to µ, and E(|X|) < ∞, then E(X) = µ. Moreover,
if F admits an unimodal density, then the mean = mediane = mode.
b) If F is symmetric and all absolute moments µ̄k exist, then the moments µk = 0 for all odd
k. If F is symmetric with respect to µ and all the moments µ̄k exist, then µ0k = 0 for all odd k
(e.g., µ03 = 0).
We can qualify the “asymmetry” of distributions (for which E(|X|3 ) < ∞) using the skewness
parameter
µ0
α = 33 .
σ
Note that α = 0 for a symmetric c.d.f. such that E(|X|3 ) < ∞. Note that that inverse is not
true: the condition α = 0 does not imply the distribution symmetry.
Exercise 1.2
1.1. CHARACTERIZATIONS OF RANDOM VARIABLES
7
Provide an example of asymmetric density with α = 0.
Observe
the role of σ in the
definition of α: suppose, for istance,
that the density f0 (x) of X
R
R
R
satisfies xf0 (x)dx = 0 and x2 f0 (x)dx = 1 with α0 = µ030 = x3 f0 (x)dx. For σ > 0, µ ∈ R,
the function
1
x−µ
f (x) = f0 (
),
σ
σ
is the density of the random variable σX + µ, and thus Var(σX + µ) = σ 2 and µ03 = (x −
µ0
µ)3 f (x)dx = σ 3 µ030 . When computing α = σ33 we note that α = α0 . Thus the skewness α is
invariant with respect to affine transformations (of scale and position).
R
Note that one cannot say that α > 0 for distributions which are “asymmetric on the right”,
or α < 0 for ‘asymmetric on the left” distributions. The notions of left or right asymmetries are
not properly defined.
Kurtosis coefficient
β is defined as follows: if the 4th central moment µ04 of X exists then
β=
µ04
− 3.
σ4
Exercise 1.3
Show that µ04 /σ 4 = 3 and β = 0 for normal distribution N (µ, σ 2 ).
We note that, same as the asymmetry coefficient α, the kurtosis β is invariant with respect
to affine transformations.
The coefficient β is often used to roughly qualify the tails of the distribution of X. One use
the following vocabulary: a distribution F has “heavy tails” if
Z
Q(b) =
Z
dF (x) (=
|x|≥b
f (x)dx in the continuous case)
|x|≥b
decreases slowly when b → ∞; for instance, polynomially (as 1/br where r > 0). Otherwise, we
say that F has “light tails” if Q(b) is fast decreasing (example: exponentially decreasing).
We may use the following heuristics: if β > 0 we may consider that the distribution tails are
2
heavier than those of the normal normal distribution (Q(b) = O(e−b /2 ) for N (0, 1)). If β < 0
(we say that the distribution is leptokurtic) and assume that its tails are lighter than those of
normal distribution (β = 0 for the normal distribution).
Note also that β ≥ −2 for all distributions (si the next paragraph).
Example 1.3
a) The kurtosis β of the uniform distribution U [0, 1] is equal to −1.2 (ultra-light tails).
b) If f (x) ∼ |x|−5 when |x| tends to ∞, σ 2 is finite but µ04 = ∞, imlying that β = ∞ (heavy
tails).
8
1.2
LECTURE 1. BASIC PROBABILITY REFRESHER
Some useful inequalities
Proposition 1.1 (Markov inequality) Let h(·) be a positive nondecreasing function, and
E(h(X)) < ∞. Then for all a > 0 such that h(a) > 0,
P (X ≥ a) ≤
E(h(X))
.
h(a)
(1.1)
Proof : Let a > 0 be such that h(a) > 0. Since h(·) is a nondecreasing function,
P (X ≥ a) ≤ P (h(X) ≥ h(a)) =
Z
I{h(x) ≥ h(a)}dF (x)
= E(I{h(X) ≥ h(a)}) ≤ E
h(X)
E(h(X))
I{h(X) ≥ h(a)} ≤
.
h(a)
h(a)
Corollary 1.1 (Chebyshev inequality) Let X be a random variable such that E(X 2 ) < ∞.
Then for all a > 0
P (|X| ≥ a) ≤
P (|X − E(X)| ≥ a) ≤
E(X 2 )
a2
Var(X)
a2
Proof : To show the first inequality it suffices to set in (1.1) h(t) = t2 and Y = |X| (or
Y = |X − E(X)| for the second one).
Proposition 1.2 (Hölder inequality) Let r > 1, with 1/r + 1/s = 1. Let ξ and η be two
random variables such that E(|ξ|r ) < ∞ and E(|η|s ) < ∞. Then E(|ξη|) < ∞ and
E(|ξη|) ≤ [E(|ξ|r )]1/r [E(|η|s )]1/s .
Proof : We first note that for all a > 0, b > 0, by concavity of log t,
(1/r) log a + (1/s) log b ≤ log(a/r + b/s),
what is equivalent to
a1/r b1/s ≤ a/r + b/s.
Let us set a = |ξ|r /E(|ξ|r ) and b = |η|s /E(|η|s ) (we suppose for a moment that E(|ξ|r ) 6= 0,
E(|η|s ) 6= 0), what results in
|ξη| ≤ [E(|ξ|r )]1/r [E(|η|s )]1/s (|ξ|r /rE(|ξ|r ) + |η|s /sE(|η|s )) ,
and we conclude when taking the expectation. If E(|ξ|r ) = 0 or E(|η|s ) = 0, then ξ = 0 (p.s) or
η = 0 (p.s.), and the inequality is trivial.
1.2. SOME USEFUL INEQUALITIES
9
Corollary 1.2 (Lyapunov inequality) Let 0 < v < t and let X be a random variable such
that E(|X|t ) < ∞. Then E(|X|v ) < ∞ and
[E(|X|v )]1/v ≤ [E(|X|t )]1/t .
(1.2)
To show the corollary it suffices to apply the Hölder inequality with ξ = X v , η = 1, r = t/v.
Using the inequality (1.2) with v = 2, t = 4 and |X − E(X)| instead of |X| we get
Thus the coefficient kurtosis β verifies the inequality β ≥ −2.
µ04
σ4
≥ 1.
The Lyapunov inequality implies the chain of inequalities
E(|X|) ≤ [E(|X|2 )]1/2 ≤ . . . ≤ [E(|X|k )]1/k .
Corollary 1.3 (Cauchy-Schwarz inequality) Let ξ and η be two random variables such that
E(ξ 2 ) < ∞ and E(η 2 ) < ∞. Then E(|ξη|) < ∞ et
E(|ξη|)2 ≤ E(ξ 2 )E(η 2 ).
(A particular case of the Hölder inequality with r = s = 2.)
Proposition 1.3 (Jensen inequality) Let g(·) be a convex function, X be a random variable
such that E(|X|) < ∞. Then
g(E(X)) ≤ E(g(X)).
Proof : By convexity of g, there exists a function g 1 (·) such that for all x, x0 ∈ R
g(x) ≥ g(x0 ) + (x − x0 )g 1 (x0 ).
We put x0 = E(X). Then
g(X) ≥ g(E(X)) + (X − E(X))g 1 (E(X)).
When taking the expectation we obtain E(g(X)) ≥ g(E(X)).
We have the following simple example of application oh the Jensen inequality:
|E(X)| ≤ E(|X|).
(1.3)
Proposition 1.4 (Cauchy-Schwarz inequality, a modification) Let ξ and η be two random variables such that E(ξ 2 ) < ∞ and E(η 2 ) < ∞. Then
(E(ξη))2 ≤ E(ξ 2 )E(η 2 ).
(1.4)
Moreover, the equality is attained if and only if (iff) there are a1 , a2 ∈ R such that a1 6= 0 or
a2 6= 0, and
a1 ξ + a2 η = 0 (a.s.)
(1.5)
10
LECTURE 1. BASIC PROBABILITY REFRESHER
Proof : The inequality (1.4) is a consequence of Corollary 1.3 and of (1.3). If (1.5) is true,
the equality
(E(ξη))2 − E(ξ 2 )E(η 2 ) = 0
(1.6)
is obvious. On the other hand, if we have (1.6) and E(η 2 ) 6= 0, then E((ξ − aη)2 ) = 0 with
a = E(ξη)/E(η 2 ), what implies that ξ = aη a.s.. The case E(η 2 ) = 0 is trivial.
1.3
Sequences of random variables
Let ξ1 , ξ2 ..., and ξ be random variables (r.v.) on (Ω, F, P ).
Definition 1.4 The sequence (ξn ) converges to a random variable ξ in probability (denoted
P
ξn → ξ) when n → ∞ if
lim P {|ξn − ξ| ≥ } = 0
n→∞
for any > 0.
Definition 1.5 The sequence (ξn ) converges to ξ in quadratic mean (or “in L2 ”) if E(ξ 2 ) <
∞, and
lim E(|ξn − ξ|2 ) = 0.
n→∞
Definition 1.6 The sequence (ξn ) converges to ξ almost surely (denoted ξn → ξ (a.s.), n →
∞) if
P {ω : ξn→
/ ξ} = 0
Remark. It can be shown that this definition is equivalent to the following one: for all > 0
lim P {sup |ξk − ξ| ≥ } = 0.
n→∞
k≥n
Definition 1.7 The sequence (ξn ) converges to a random variable ξ in distribution (we denote
D
ξn → ξ, n → ∞) if
P {ξn ≤ t} → P {ξ ≤ t} lorsque n → ∞
in all points of continuity of the c.d.f. F (t) = P {ξ ≤ t}.
Remark. The latter definition is equivalent to the convergence
E(f (ξn )) → E(f (ξ)) quand n → ∞
for all continuous and bounded f (weak convergence).
1.4. INDEPENDENCE AND LIMIT THEOREMS
11
Relationships between different types of convergence:
L2 -convergence =⇒ convergence in probability =⇒ | convergence in distribution
a.s. convergence =⇒ Exercise 1.4
Let (ξn ) and (ηn ) be two sequences of r.v.. Prove the following statements:
1o . If a ∈ R is a constant then
D
ξn → a
⇔
P
ξn → a,
when n → ∞.
D
D
2o . (Slutsky’s theorem) If ξn → a and ηn → η when n → ∞ and a ∈ R is a constant then
D
ξn + ηn → a + η,
as n → ∞. Show that if a is a general r.v., these two relations do not hold (construct a
counterexample).
P
D
3o . Let ξn → a, and let ηn → η when n → ∞, where a ∈ R is a constant and η is a random
variable, Then
D
ξn ηn → aη,
as n → ∞.
Would this result continue to hold if we suppose that a is a general random variable?
1.4
Independence and limit theorems
Definition 1.8 Let X and Y be two random variables. The variable X is said independent of
Y if
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B)
for all A ∈ B and B ∈ B (Borel A and B), denoted X⊥⊥Y .
If E(|X|) < ∞, E(|Y |) < ∞ then the independence implies
E(XY ) = E(X)E(Y )
(the inverse does not hold!).
Definition 1.9 Let X1 , ..., Xn be random variables, we say that X1 , ..., Xn are (mutually) independent if for all A1 , ..., An ∈ B
P (X1 ∈ A1 , ..., Xn ∈ An ) = P (X1 ∈ A1 ) · · · P (Xn ∈ An ).
Remark. The fact that Xi , i = 1, ..., n are pairwise independent, i.e. Xi ⊥⊥Xj , does not imply
that X1 , .., Xn are mutually independent. On the other hand, mutual independence implies
pairwise independence. In particular, if X1 , ..., Xn are independent and E(|Xi |) < ∞, i = 1, ..., n,
E(Xi Xj ) = E(Xi )E(Xj ),
i 6= j.
12
1.4.1
LECTURE 1. BASIC PROBABILITY REFRESHER
Sums of independent random variables
Let us consider the sum ni=1 Xi , where X1 , ..., Xn are independent. If E(Xi2 ) < ∞, i = 1, ..., n
(by the Lyapunov inequality this implies E(|Xi |) < ∞) then
P
E
n
X
i=1
!
Xi
=
n
X
E(Xi ) (true without the independence hypothesis)
i=1
and, moreover,
Var
n
X
!
Xi
i=1
=
n
X
Var(Xi ).
i=1
Definition 1.10 We say that the variables X1 , ..., Xn are i.i.d. (independent and identically
distributed) if they are mutually independent and Xi obeys the same distribution as Xj for all
1 ≤ i, j ≤ n.
Proposition 1.5 Let X1 , ..., Xn be i.i.d. r.v. such that E(X1 ) = µ and Var(X1 ) = σ 2 < ∞.
Then the arithmetic mean
n
1X
Xi
X̄ =
n i=1
satisfies
E(X̄) = µ and Var(X̄) =
1
σ2
Var(X1 ) =
.
n
n
Proposition 1.6 (Kolmogorov’s strong law of large numbers) Let X1 , ..., Xn be i.i.d. r.v.
such that E(|X1 |) < ∞, and µ = E(X1 ). We have
X̄ → µ (a.s.) when n → ∞.
Counterexample. Let Xi be i.i.d. r.v. with Cauchy distribution with density
f (x) =
1
, x ∈ R.
π(1 + x2 )
Then E(|X1 |) = ∞, E(X1 ) is not defined and the mean X̄ does not converge (we observe that
Cauchy distribution has “heavy tails”).
Proposition 1.7 (Central Limit Theorem (CLT)) Let X1 , ..., Xn be i.i.d. r.v. such that
E(X12 ) < ∞ and σ 2 = Var(X1 ) > 0. Then
√
X̄ − µ
n
σ
where µ = E(X1 ), and η ∼ N (0, 1).
!
D
→ η, lorsque n → ∞,
1.5. CONTINUITY THEOREMS
1.4.2
13
Asymptotic approximations of probability distributions
The CLT (Proposition 1.7) can be rewritten in the equivalent form:
P
√
X̄ − µ
n
σ
!
!
≤t
→ P (η ≤ t), as n → ∞,
for all t ∈ R, where η ∼ N (0, 1). Let us denote
Φ(t) = P (η ≤ t)
the standard normal c.d.f.. Then
√
P (X̄ ≤ x) = P
X̄ − µ
n
σ
!
≤
√
x−µ
n
σ
!
√
≈Φ
n
x−µ
σ
when n → ∞. In other words, for sufficiently large n, the c.d.f. P (X̄ ≤ x) of X̄ can be
approximated by the normal c.d.f.:
√
x−µ
P (X̄ ≤ x) ≈ Φ
n
.
σ
1.5
Continuity theorems
Proposition 1.8 (The first continuity theorem) Let g(·) be a continuous function, and let
ξ1 , ξ2 , ... and ξ be random variables on (Ω, F, P ). Then
(i) ξn → ξ (a.s.) ⇒ g(ξn ) → g(ξ) (a.s.)
P
⇒ g(ξn ) → g(ξ)
D
⇒ g(ξn ) → g(ξ)
(ii)
ξn → ξ
(iii)
ξn → ξ
P
D
Proof : (i) is evident. We prove (ii) in a particular case where ξ = a (a is fixed nonrandom),
the only case to be of interest in the sequel. The continuity of g implies that for any > 0 there
exists δ > 0 such that
|ξn − a| ≤ δ ⇒ |g(ξn ) − g(a)| < .
P
Since ξn → a as n → ∞, we have
lim P (|ξn − a| < δ) = 1 for all δ > 0.
n→∞
Thus
lim P (|g(ξn ) − g(a)| < ) = 1 for any > 0.
n→∞
(iii) It suffices to prove (see the comment after Definition 1.7) that for any continuous and
bounded function h(x)
E(h(g(ξn ))) → E(h(g(ξ))), n → ∞.
Since g is continuous, f = h ◦ g is also continuous and bounded, and we arrive at (iii) because
D
ξn → ξ implies that
E(f (ξn )) → E(f (ξ)), n → ∞,
for any continuous and bounded function f .
14
LECTURE 1. BASIC PROBABILITY REFRESHER
Proposition 1.9 (Second continuity theorem) Suppose that g(·) is continuous and continuously differentiable, and let X1 , ..., Xn be i.i.d. random variables such that E(X12 ) < ∞ and
σ 2 = Var(X1 ) > 0. Then
√
where X̄ =
1
n
Pn
i=1 Xi ,
g(X̄) − g(µ)
n
σ
!
D
→ ηg 0 (µ),
n → ∞,
µ = E(X1 ), and η ∼ N (0, 1).
Proof : In the premise of the proposition the function
(
h(x) =
g(x)−g(µ)
,
x−µ
0
g (µ),
si x 6= µ
si x = µ
P
is continuous. Because X̄ → µ (due to Proposition 1.6), and h is continuous, we conclude, due
to the first continuity theorem, that
P
h(X̄) → h(µ) = g 0 (µ),
However,
√
n → ∞.
(1.7)
√
√ g(X̄) − g(µ)
n
=
h(X̄)(X̄ − µ) = h(X̄)ηn ,
n
σ
σ
D
where ηn = σn (X̄ − µ). Now Proposition 1.7 implies that ηn → η ∼ N (0, 1) when n → ∞.
Using this fact along with (1.7) and the result 3o of the Exercise 1.4 we obtain the desired
statement.
1.6
Simulation of random variables
Dans les applications we a souvent besoin de générer (construire) de façwe artificielthe (à l’aide
d’un ordinateur, par exemple) a sequence X1 , ..., Xn de nombres aléatoires i.i.d. suivant the loi
F (we l’appelthe a échantillon). Les méthodes de simulatiwe permettent d’obtenir seulement a
valeur pseudo-aléatoire, au lieu d’a valeur aléatoire. Cethe signifie that les nombres X1 , ..., Xn
simulés sont déterministes – ils sont obtenus par a algorithme déterministe – mais les propriétés
de the sequence X1 , ..., Xn sont “proches” de celles d’a sequence aléatoire i.i.d. de même loi.
Par exemple, for les nombres pseudo-aléatoires we a
In applications we often need to “generate” (build) a computer simulated sequence X1 , ..., Xn
of i.i.d. random values following a given distribution F (we the call it a sample). Of course,
computer simulation only allows to build pseudo-random variables (not the “true” random
ones). That means that the simulated values X1 , ..., Xn are deterministic – they are obtained
by a deterministic algorithm – but the properties the sequence X1 , ..., Xn are “analogous” to
those of a random i.i.d. sequence . For example, for the pseudo-random variables one has
sup |Fn (x) − F (x)| → 0,
n→∞
x
for any x ∈ R, or Fbn (x) = n1 µn , where µn is the number of ξ1 , ..., ξn which satisfy ξk < x. We
call Fn (x) empirical distribution function computed using the sequence X1 , ..., Xn (here we
consider deterministic convergence, cf. Exercise 1.1.14). The strong law of large numbers and
the central limit theorem also hold for pseudo-random variables, etc.
1.6. SIMULATION OF RANDOM VARIABLES
1.6.1
15
Simulation of uniformly distributed random variables
The generation program is available in (essentially) all programming languages. How do the
work? The c.d.f. F (x) of the distribution U [0, 1] satisfies
F (x) =


 0,
x,

 1,
x<0
x ∈ [0, 1]
x > 1.
Congruential algorithm. We fix a real number a > 1 and an integer m (usually a and m
are “very large” numbers). We start with a fixed value z0 . For 1 ≤ i ≤ n we define
zi =
the rest of division of azi−1 by m
azi−1
= azi−1 −
m,
m
where [·] is the integer part. We always have 0 ≤ zi < m. Thus, if we set
azi−1
azi−1
zi
=
−
,
m
m
m
Ui =
then 0 ≤ Ui < 1. The sequence U1 , ..., Un is considered a sample from the uniform distribution
U [0, 1]. Even if this is not a random sequence, the empirical c.d.f.
FnU (x) =
n
1X
I{Ui ≤ x}
n i=1
satisfies sup0≤x≤1 |Fn − x| ≤ (m), n → ∞, with (m) converging rapidly to 0 when m → ∞.
A well developed mathematical theory allows to justify “good” choices of z0 , a and m. For
instance, the following values are often used:
a = 16807(75 ),
m = 2147483647(231 − 1).
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
the step-function empirical c.d.f./the theoretical c.d.f.
Recently, new pseudo-random generators with improved properties became available, and congruential generators are rarely used.
16
1.6.2
LECTURE 1. BASIC PROBABILITY REFRESHER
Simulation of general pseudo-random variable
Given an i.i.d. sample U1 , ..., Un from the uniform distribution, we can obtain a sample of
general distribution F (·) using the inversion algorithm. It may be used when an explicit
expression for F (·) is available. This technique is based on the following statement:
Proposition 1.10 Let F be a continuous and strictly monotone c.d.f., and let U be a random
variable uniformly distributed on [0, 1]. Then the c.d.f. of the r.v.
X = F −1 (U )
is exactly F (·).
Proof : We observe that
F (x) = P (U ≤ F (x)) = P (F −1 (U ) ≤ x) = P (X ≤ x).
Consider the following algorithm of simulating a sample X1 , ..., Xn from distribution F : if
F (x) is continuous and strictly increasing, we take
Xi = F −1 (Ui ),
where Ui are pseudo-random variables uniformly distributed on [0, 1], i = 1, ..., n. This way we
get a .
If F is not continuous or strictly monotone, we need to modify the definition of the “inverse”
F −1 . We set
∆
F −1 (y) = sup{t : F (t) < y}.
Then,
P (Xi ≤ x) = P (sup{t : F (t) < Ui } ≤ x) = P (Ui ≤ F (x)) = F (x).
Example 1.4 Exponential distribution:
f (x) = e−x I{x > 0},
F (x) = (1 − e−x )I{x > 0}.
We compute F −1 (y) = − ln(1 − y) for y ∈ (0, 1). Xi = − ln(1 − Ui ), i = 1, ..., n where Ui ∼
U [0, 1].
Example 1.5 Bernoulli distribution:
P (X = 0) = 1 − p, 0 < p < 1.
P (X = 1) = p,
We use the modified algorithm:
(
F −1 (y) = sup{t : F (t) < y} =
0,
1,
y ∈ [0, 1 − p],
y ∈ (1 − p, 1].
If Ui is a uniform r.v. then Xi = F −1 (Ui ) is a Bernoulli r.v., we have
(
Xi =
0,
1,
Ui ∈ [0, 1 − p],
Ui ∈ (1 − p, 1].
1.6. SIMULATION OF RANDOM VARIABLES
17
Exercise 1.5
A r.v. Y takes values 1, 3 and 4 with the probabilities P (Y = 1) = 3/5, P (Y = 3) = 1/5 et
P (Y = 4) = 1/5. How would you generate Y given a r.v. U ∼ U (0, 1).
Exercise 1.6
Let U ∼ U (0, 1).
1. Explain how to simulate a dice with 6 faces given U .
2. Let Y = [6U + 1], where [a] is the integer part of a. What are possible values of Y and
the corresponding probabilities?
Simulating transformed variables How to simulate a sample Y1 , ..., Yn from the distribution
F ((x − µ)/σ), given a sample X1 , ..., Xn from F (·)? We suppose that σ > 0 and µ ∈ R). We
should take Yi = σXi + µ, i = 1, ..., n.
1.6.3
Simulating normal N (0, 1) r.v.
Note that while the normal c.d.f. F is continuous and strictly increasing, the explicit expression
for F is not available. Thus, one can hardly apply the inversion algorithm. Nevertheless, there
are other techniques of simulating normal r.v. which are very efficient from the numerical point
of view.
Using the CLT. If U ∼ U [0, 1] then E(U ) = 1/2 and Var(U ) = 1/12. This implies by the
Central Limit Theorem that
U1 + ... + UN − N/2 D
p
→ N (0, 1), N → ∞,
N/12
for an i.i.d. sample U1 , ..., UN with uniform distribution on [0, 1] (N = 12 is usually sufficient to
obtain a “good” approximation!). Thus, one can consider the following simulation algorithm:
let U1 , U2 , ..., UnN be a pseudo-random sequence from uniform distribution U [0, 1], we take
U(i−1)N +1 + ... + UiN − N/2
Xi =
p
N/12
,
i = 1, ..., n.
Box–Müller algorithm. The algorithm is based on the following result:
Proposition 1.11 Let ξ and η be independent U [0, 1] random variables. Then the r.v.
X=
p
−2 ln ξ cos(2πη) and Y =
p
−2 ln ξ sin(2πη)
are standard normal and independent.
We prove this statement in Lecture 3.
This relation provides us with an efficient simulation technique:: let U1 , ..., U2n be random
variables i.i.d. r.v. U1 ∼ U [0, 1]. We set
for i = 1, ...n.
X2i =
p
X2i−1 =
p
−2 ln U2i cos(2πU2i−1 ),
−2 ln U2i sin(2πU2i−1 ),
18
LECTURE 1. BASIC PROBABILITY REFRESHER
1.7
Exercises
Exercise 1.7
Suppose 2 balanced dices are drawn. Find joint probability distribution of X and Y if:
1. X is the maximum of the obtained values and Y is a sum;
2. X is the value of the first dice and Y is the maximum of the two;
3. X and Y are, respectively, the smallest and the largest value.
Exercise 1.8
Suppose that X and Y are 2 independent Bernoulli B( 12 ) random variables. Let U = X + Y and
V = |X − Y |.
1. What is the joint probability distribution and marginal probability distributions of U and
V , conditional distribution of U given V = 0 et V = 1.
2. are r.v. U and V independent?
Exercise 1.9
Let ξ1 , ..., ξn be independent r.v., and let
ξmin = min(ξ1 , ..., ξn ),
ξmax = max(ξ1 , ..., ξn ).
1) Show that
P (ξmin ≥ x) =
n
Y
P (ξi ≥ x),
P (ξmax < x) =
i=1
n
Y
P (ξi < x).
i=1
2) Suppose, furthermore, that ξ1 , ..., ξn are identically distributed with uniform distribution
U [0, a]. Compute E(ξmin ), E(ξmax ), Var(ξmin ) et Var(ξmax )
Exercise 1.10
Let ξ1 , ..., ξn be i.i.d. Bernoulli r.v. with
P (ξ1 = 0) = 1 − λi ∆,
P (ξ1 = 1) = λi ∆
where λi > 0 and ∆ > 0 is small. Show that
P
n
X
i=1
Exercise 1.11
!
ξi = 1
=
n
X
i=1
!
2
λi ∆ + O(∆ ),
P
n
X
i=1
!
ξi > 1
= O(∆2 ).
1.7. EXERCISES
19
1) Prove that inf −∞<a<∞ E((ξ − a)2 ) is attained for a = E(ξ) and so
inf
−∞<a<∞
E((ξ − a)2 ) = Var(ξ).
2) Let ξ be a nonnegative r.v. with c.d.f. F and finite expectation. Prove that
Z ∞
E(ξ) =
(1 − F (x))dx.
0
3) Show, using the result of 2), that if M is the median of the c.d.f. F of ξ,
inf
−∞<a<∞
E(|ξ − a|) = E(|ξ − M |).
Exercise 1.12
Let X1 and X2 be two independent r.v. with the exponential distribution E(λ). Show that
min(X1 , X2 ) and |X1 − X2 | are r.v. with distributions, respectively, E(2λ) and E(λ).
Exercise 1.13
Let X be the number of “6” in 12000 independent draws √
of a dice. Using the√Central Limit Theorem estimate the probability that 1800 < X ≤ 2100 (Φ( 6) ≈ 0.9928, Φ(2 6) ≈ 0.999999518).
Compare this approximation to that obtained using the Chebyshev inequality.
Exercise 1.14
Suppose that r.v. ξ1 , ..., ξn are mutually independent and identically distributed with the c.d.f.
F . For x ∈ R, let us define the random variable Fbn (x) = n1 µn , where µn is the number of
ξ1 , ..., ξn which satisfy ξk < x. Show that for any x
P
Fbn (x) → F (x)
(the function Fbn (x) is called the empirical distribution function).
Exercise 1.15
[Monté-Carlo method] We want to compute the integral I =
random variable, then
R1
0
f (x)dx. Let X be a U [0, 1]
Z 1
E(f (X)) =
f (x)dx = I.
0
Let X1 , ..., Xn be uniformly distributed on [0, 1] i.i.d. r.v.. Let us consider the quantity
n
1X
f (Xi )
f¯n =
n i=1
P
and let us suppose that σ 2 = Var(f (X)) < ∞. Prove that E(f¯n ) → I et f¯n → I as n → ∞.
Estimate P (|f¯n − I| < ) using the CLT.
Exercise 1.16
20
LECTURE 1. BASIC PROBABILITY REFRESHER
Weibull distributions is often used in the survival and reliability analysis. An example of a
distribution from this family is given by c.d.f.
(
F (x) =
0, x < 0
2
1 − e−5x , x ≥ 0.
Explain how to generate a r.v. Z ∼ F given a uniform r.v. U .
Exercise 1.17
Write down the algorithm of simulating a Poisson r.v. by inversion.
Hint: there is no simple expression for the Poisson c.d.f. and the set of values is infinite.
However, the Poisson c.d.f. can be easily computed recursively. Observe that if X is the poisson
r.v.,
λk
λ
P (X = k) = e−λ
= P (X = k − 1).
k!
k
Lecture 2
Regression and correlation
2.1
Couples of random variables. Joint and marginal distributions.
Let (X, Y ) be a couple of r.v.. The joint c.d.f of (X, Y ) is given by
FX,Y (x, y) = P (X ≤ x, Y ≤ y),
x, y ∈ R.
The marginal c.d.f. are given by
FX (x) = FX,Y (x, ∞) = P (X ≤ x);
FY (y) = FX,Y (∞, y) = P (Y ≤ y).
In the continuous case
we suppose that FX,Y the derivative
∂ 2 FX,Y (x, y)
= fX,Y (x, y)
∂x∂y
(2.1)
exists a.e.. The function fX,Y (x, y) is called the density of FX,Y (x, y).
Marginal densities
fX and fY are defined according to
Z ∞
fX (x) =
−∞
Z ∞
fX,Y (x, y)dy, fY (y) =
−∞
fX,Y (x, y)dx.
In the discrete case X and Y takes values in a finite or countable set. A joint distribution
of a couple X, Y is defined by the probabilities {P (X = k, Y = m)}k,m . The marginal laws are
defined by the probabilities
P (X = k) =
X
P (X = k, Y = m),
m
P (Y = m) =
X
P (X = k, Y = m).
k
If X and Y are independent then
FX,Y = FX (x)FY (y) for all (x, y) ∈ R2 .
21
22
LECTURE 2. REGRESSION AND CORRELATION
The inverse is also true. In the continuous case the independence is equivalent to the decomposition
fX,Y (x, y) = fX (x)fY (y), for all (x, y) ∈ R2 ,
and in the discrete case,
P (X = k, Y = m) = P (X = k)P (Y = m).
2.2
Conditioning (discrete case)
Let A and B be two random events (A, B ∈ F) such that P (B) 6= 0. The conditional
probability P (A|B) of A given B is defined as
P (A|B) =
P (AB)
.
P (B)
Let X and Y be two discrete r.v.. According to this definition
P (Y = k|X = m) =
P (Y = k, X = m)
,
P (X = m)
for all k, m such that P (X = m) 6= 0. We suppose that P (X = m) 6= 0 for all admissible m.
Then
P
X
P (Y = k, X = m)
P (Y = k|X = m) = k
= 1.
P (X = m)
k
As a result, the probabilities {P (Y = k|X = m)}k define a discrete probability distribution. If
X and Y are independent
P (Y = k|X = m) =
P (Y = k)P (X = m)
= P (Y = k).
P (X = m)
(2.2)
The conditional expectation of Y given that X = m is the numerical function of m given
by
X
E(Y |X = m) =
kP (Y = k|X = m).
k
The conditional variance is defined by
Var(Y |X = m) = E(Y 2 |X = m) − [E(Y |X = m)]2 .
In a similar way we define the conditional moments, the conditional quantiles and other characteristics of conditional distribution.
Definition 2.1 Conditional expectation E(Y |X) of Y given X where X and Y are discrete
r.v. such that E(|Y |) < ∞, is a discrete random variable which only depend on X and takes
values
{E(Y |X = m)}m
with probabilities P (X = m).
2.2. CONDITIONING (DISCRETE CASE)
23
It is important not to confuse the random variable E(Y |X) with the (deterministic) numeric
function E(Y |X = m) (function of m).
Note that the condition E(|Y |) < ∞ guaranties the existence of conditional expectation
E(Y |X).
2.2.1
1o .
Properties of the conditional expectation (discrete case)
(Linearity) Let E(|Y1 |) < ∞, E(|Y2 |) < ∞, then for all a ∈ R, b ∈ R,
E(aY1 + bY2 |X) = aE(Y1 |X) + bE(Y2 |X) (a.s.)
2o .
If X and Y are independent and E(|Y |) < ∞, then E(Y |X) = E(Y ) (a.s.) (cf. (2.2)).
3o .
E(h(X)|X) = h(X) (a.s.) for all Borel h.
4o .
(Substitution theorem.) If E(|h(Y, X)|) < ∞ then
E(h(Y, X)|X = m) = E(h(Y, m)|X = m).
Proof : Let Y 0 = h(Y, X), this is a discrete r.v. taking values h(k, m). Thus, the conditional
distribution of Y 0 given X is given by the probabilities
P (Y 0 = a|X = m) = P (h(Y, X) = a|X = m) =
=
P (h(Y, X) = a, X = m)
P (X = m)
P (h(Y, m) = a, X = m)
= P (h(Y, m) = a|X = m).
P (X = m)
Therefore, for all m
E(Y 0 |X = m) =
X
aP (Y 0 = a|X = m) =
a
X
aP (h(Y, m) = a|X = m) = E(h(Y, m)|X = m).
a
As a result, if h(x, y) = h1 (y)h2 (x), we have
E(h1 (Y )h2 (X)|X = m) = h2 (m)E(h1 (Y )|X = m),
and
E(h1 (Y )h2 (X)|X) = h2 (X)E(h1 (Y )|X) (a.s.) .
5o . (Double expectation theorem) Let E(|Y |) < ∞, then E(E(Y |X)) = E(Y ).
Proof : We write
E(E(Y |X)) =
X
E(Y |X = m)P (X = m) =
m
=
X
m,k
XX
m
kP (Y = k, X = m) =
X X
k
k
m
kP (Y = k|X = m)P (X = m)
k
P (Y = k, X = m) =
X
k
kP (Y = k) = E(Y ).
24
LECTURE 2. REGRESSION AND CORRELATION
Example 2.1 Let ξ and η be two independent Bernoulli r.v., taking values 1 and 0 with probabilities, respectively, p and 1 − p. What is the conditional expectation E(ξ + η|η)? E(η|ξ + η)?
Using the properties 2o and 3o we obtain
E(ξ + η|η) = Eξ + η = p + η.
Furthermore, by the definition, for k = 0, 1, 2,
E(η|ξ + η = k) = 1 · P (η = 1|ξ + η = k) =
Thus, E(η|ξ + η) =
2.3
ξ+η
2


 0,
1/2,

 1,
k = 0,
k = 1,
k = 2.
(a.s.).
Conditioning as a projection
Let us consider the set of random variables ξ on (Ω, F, P ) such that E(ξ 2 ) < ∞. We will say
that ξ ∼ ξ 0 (is equivalent) if ξ = ξ 0 (a.s.) with respect to the measure P . This relation defines a
family of equivalence classes over random variables ξ such that E(ξ 2 ) < ∞.
Definition 2.2 We denote L2 (P ) the space of (equivalence classes of ) square-integrable r.v. ξ
(E(ξ 2 ) < ∞).
The space L2 (P ) we have just defined is a Hilbert space equipped with the scalar product
hX, Y i = E(XY ),
X, Y ∈ L2 (P ),
and the corresponding norm kXk = [E(X 2 )]1/2 , X ∈ L2 (P ).
Indeed, h·, ·i verifies all the axioms of the scalar product: for all X, ξ, η ∈ L2 (P ) and a, b ∈ R
haξ + bη, Xi = E([aξ + bη]X) = aE(ξX) + bE(ηX) = ahξ, Xi + bhη, Xi,
and hX, Xi ≥ 0; hX, Xi = 0 implies X = 0 (a.s.).
2.3.1
The best prediction
If r.v. X and Y are independent, the knowledge of X does not supply any information about
Y . However, when X and Y are dependent and we know the realization of X, it does provide
some information about Y . We define the problem of the best prediction of Y given X as
follows:
Let Y ∈ L2 (P ) and X be r.v. on (Ω, F, P ). Find a Borel measurable g(·) such that
kY − g(X)k = min kY − h(X)k,
h(·)
(2.3)
where the minimum is taken over all Borel measurable functions h(·) and k · k is the norm of
L2 (P ). The random variable Yb = g(X) is referred to as the best prediction of Y given X.
We use the following (statistical) vocabulary: X is explanatory variable or predictor, Y is
explained variable.
2.3. CONDITIONING AS A PROJECTION
25
We can write (2.3) in the equivalent form:
E((Y − g(X))2 ) = min E((Y − h(X))2 ) =
h(·)
min
h(X)∈LX
2 (P )
E((Y − h(X))2 ).
It suffices to consider the case h(X) ∈ L2 (P ), because the solution g(·) to (2.3) is “automatically” in L2 (P ).
We can consider (2.3) as the problem of orthogonal projection of Y on a linear subspace
LX
(P
) of L2 (P ) defined as
2
2
LX
2 (P ) = {ξ = h(X) : E(h (X)) < ∞}.
By the properties of the orthogonal projection, g(X) ∈ LX
2 (P ) is the solution to (2.3) if and
only if
hY − g(X), h(X)i = 0 for all h(X) ∈ LX
2 (P ),
(2.4)
Y
g(X)
LX(P)
2
and the orthogonal projection g(X) is unique (a.s.). When using notation with expectations
instead, (2.4) can be equivalently rewritten as
E((Y − g(X))h(X)) = 0 for all h(X) ∈ LX
2 (P ),
or
E(Y h(X)) = E(g(X)h(X)) for all h(X) ∈ LX
2 (P ).
(2.5)
E(Y I{X ∈ A}) = E(g(X)I{X ∈ A}) for all A ∈ B (all Borel measurable sets).
(2.6)
In particular,
Remark. Indeed, (2.6) implies (2.5), thus (2.5) and (2.6) are equivalent – recall that all
P
functions in L2 can be approximated by sums of the step functions i ci I{x ∈ Ai } (piecewiseconstant functions).
Let us show that in the discrete case the only r.v. g(X) which verifies (2.5) (and (2.6)),
and thus solves the problem of the best prediction (2.3), is the conditional expectation of Y
given X.
26
LECTURE 2. REGRESSION AND CORRELATION
Proposition 2.1 Let X and Y be discrete r.v. with Y ∈ L2 (P ). Then the best prediction Yb of
Y given X is unique (a.s.) and given by
Yb = g(X) = E(Y |X).
Proof :
E (E(Y |X)h(X)) =
X
E(Y |X = k)h(k)P (X = k)
k
=
"
X X
k
=
X
#
mP (Y = m|X = k) h(k)P (X = k)
m
m h(k)P (Y = m, X = k) = E(Y h(X)).
k,m
Thus (2.5) is verified with g(X) = E(Y |X). Due to the (a.s.) uniqueness of the orthogonal
projection, the best prediction is also unique (a.s.).
2.4
Probability and conditional expectation (the general case)
We extend the definition of the conditional expectation E(Y |X) to the general case of 2 r.v. X
and Y . We use the following definition:
Definition 2.3 Let Y and X be r.v. such that E(|Y |) < ∞. The conditional expectation
g(X) = E(Y |X) is a measurable with respect to X r.v. which verifies
E(Y I{X ∈ A}) = E(g(X)I{X ∈ A})
(2.7)
for all Borel sets A.
Remark. Here we replace the condition Y ∈ L2 (P ) (≡ E(Y 2 ) < ∞) with a weaker condition
E(|Y |) < ∞. One can show (see the course of Probability Theory) that the function g(X) which
verifies (2.7) exists and is unique (a.s.) (a consequence of the Radon-Nikodym theorem).
If Y ∈ L2 (P ), the existence and the a.s. uniqueness of the function g(X) verifying (2.7), as
we have already seen, is a consequence of the properties of the orthogonal projection in L2 .
Theorem 2.1 (Best prediction) Let X and Y be 2 r.v., Y ∈ L2 (P ). Then the best prediction
of Y given X is unique (a.s.) and coincides with
Yb = g(X) = E(Y |X).
2.4. PROBABILITY AND CONDITIONAL EXPECTATION (THE GENERAL CASE)
2.4.1
27
Conditional probability
Let us consider a special case as follows: we replace Y with Y 0 = I{Y ∈ B}. Note that the r.v.
Y 0 is bounded (|Y 0 | ≤ 1), and thus E(|Y 0 |2 ) < ∞. We can define the conditional expectation
g(X) = E(Y 0 |X) by the relationship (cf. (2.7))
E (I{Y ∈ B}I{X ∈ A}) = E(g(X)I{X ∈ A}) for all A, B ∈ B.
Definition 2.4 The conditional probability P (Y ∈ B|X) is a random variable which verifies
P (Y ∈ B, X ∈ A) = E [P (Y ∈ B|X)I{X ∈ A}] for all A ∈ B
Same as in the discrete case, we also define a numeric function.
Definition 2.5 The function of 2 variables P (Y ∈ B|X = x), B ∈ B (a Borel set) and x ∈ R
is referred to as conditional probability of Y given X = x if
(i) for all fixed B P (Y ∈ B|X = x) verifies
P (Y ∈ B, X ∈ A) =
Z
P (Y ∈ B|X = x)dFX (x);
(2.8)
A
(ii) for all fixed x, P (Y ∈ B|X = x) defines a probability distribution as a function of B.
Remark.
We know already that for all B ∈ B there is a function
gB (x) = P (Y ∈ B|X = x)
such that (i) is valid. However, this function is defined up to its values on the set NB of zero
measure. It is important that, in general, this set depends on B. Therefore, it may happen
S
that N = B∈B NB is of positive measure. This could do a serious damage – for example, the
additivity of the probability measure could be destroyed, etc. Fortunately, in our case (real r.v.
and Borel σ-algebra) there is a result due to Kolmogorov which says that one can always choose
a version of the function gB (·) such that P (Y ∈ B|X = x) is a probability measure for all fixed
x ∈ R. We will suppose in the sequel that this version is chosen in every particular case.
We can also define a real-valued function of x:
E(Y |X = x) =
such that
E(Y I{X ∈ A}) =
Z
Z
yP (dy|X = x).
E(Y |X = x)dFX (x), for all A ∈ B.
A
2.4.2
1o .
Properties of conditional expectation, general case
(Linearity.) Let E(|Y1 |) < ∞, E(|Y2 |) < ∞, then
E(aY1 + bY2 |X) = aE(Y1 |X) + bE(Y2 |X) (a.s.)
28
LECTURE 2. REGRESSION AND CORRELATION
2o . If X and Y are independent and E(|Y |) < ∞, then E(Y |X) = E(Y ) (a.s.)In view of the
definition (2.7) it suffices to prove that
E(Y I{X ∈ A}) = E (E(Y )I{X ∈ A}) , for all A ∈ B.
(2.9)
But
E (E(Y )I{X ∈ A}) = E(Y )P (X ∈ A),
and so (2.9) is a consequence of the independence of X and Y .
3o .
E(h(X)|X) = h(X) (a.s.) for all Borel functions h.
4o .
(Substitution theorem.) If E(|h(Y, X)|) < ∞, then
E(h(Y, X)|X = x) = E(h(Y, x)|X = x).
5o .
(Double expectation theorem)
E(E(Y |X)) = E(Y ).
Proof : Let us set A = R in the definition (2.7), then I(X ∈ A) = 1, and we obtain the desired
result.
2.5
Conditioning: continuous case
We suppose that there exists a joint density fX,Y (x, y) of the couple (X, Y ). Let us set
( f (x,y)
X,Y
fY |X (y|x) =
fX (x)
,
si fX (x) > 0,
si fX (x) = 0.
0,
Proposition 2.2 If the joint density (X, Y ) exists then
P (Y ∈ B|X = x) =
Z
B
fY |X (y|x)dy for all B ∈ B.
Proof : It suffices to show (cf. (2.8)) that for all A, B ∈ B
P (Y ∈ B, X ∈ A) =
Z Z
A
B
fY |X (y|x)dy dFX (x).
Since X has a density, dFX (x) = fX (x)dx. By the Fubini theorem
Z Z
A B
Z Z
fY |X (y|x)dyfX (x)dx =
B
A
fY |X (y|x)fX (x) dxdy
But fY |X (y|x)fX (x) = fX,Y (x, y) a.e. (if fX (x) = 0, then fX,Y (x, y) = 0 a fortiori). Therefore,
the last integral is equal to
Z Z
B
A
fX,Y (x, y)dxdy = P (X ∈ A, Y ∈ B).
2.5. CONDITIONING: CONTINUOUS CASE
29
The result of Proposition 2.2 provides a direct way to compute conditional expectation:
Corollary 2.1
2.
R∞
1. E(Y |X = x) =
−∞ fY |X (y|x)dy
R
yfY |X (y|x)dy,
= 1,
3. Y ⊥⊥X ⇒ fY |X (y|x) = fY (y).
We can define, same as in the discrete case, conditional variance:
Var(Y |X = x) = E(Y 2 |X = x) − (E(Y |X = x))2
Z ∞
=
−∞
2
Z ∞
2
y fY |X (y|x)dy −
−∞
yfY |X (y|x)dy
.
Example 2.2 Let X and Y be i.i.d. r.v. with exponential distribution. Let us compute the
conditional density f (x|z) = fX|X+Y (x|z) and E(X|X + Y ).
Let f (u) = λe−λu I{u > 0} be the density of X and Y . If z < x
Z z
P (X + Y < z, X < x) = P (X + Y < z, X < z) =
Z z−u
f (u)
0
f (v)dudv,
0
and if z ≥ x,
Z z−u
Z x
f (v)dudv.
f (u)
P (X + Y < z, X < x) =
0
0
As a result, for z ≥ x the joint density of the couple (X + Y, X) is (cf. (2.1))
f (z, x) =
∂ 2 P (X + Y < z, X < x)
= f (z − x)f (x) = λ2 e−λz .
∂x∂z
Besides, the density of X + Y is the convolution of two exponential densities, i.e.
fX+Y (z) = λ2 ze−λz .
We obtain
fX|X+Y (x|z) =
f (z, x)
1
= .
fX+Y (z)
z
for 0 ≤ x ≤ z, and fX|X+Y (x|z) = 0 for x > z. In other words, the conditional density is the
density of the uniform distribution on [0, z]. Thus we conclude that E(X|Z) = (X + Y )/2 (a.s.).
This example is related to the model of requests for service arriving to a service system. Let
X be the instant when the 1st request arrives (the instant t = 0 is the instant when the request
zero arrives), Y is the interval of time between the arrival of the 1st and of the 2nd requests.
We are looking for the probability density of the instant of the 1st arrival, given that the second
request arrived at time z.
30
2.6
LECTURE 2. REGRESSION AND CORRELATION
Covariance and correlation
Let X and Y be square-integrable r.v., i.e. E(X 2 ) < ∞ and E(Y 2 ) < ∞. We denote
2
σX
= Var(X),
σY2 = Var(Y ).
Definition 2.6 The covariance of X and Y is the value
Cov(X, Y ) = E ((X − E(X))(Y − E(Y ))) = E(XY ) − E(X)E(Y ).
If Cov(X, Y ) = 0, we say that X and Y are orthogonal, and we denote X ⊥ Y .
2 > 0 and σ 2 > 0. The correlation between X and Y is the value
Definition 2.7 Let σX
Y
Corr(X, Y ) = ρXY =
2.6.1
Cov(X, Y )
.
σX σY
Properties of covariance and correlation
The below relationships are immediate consequences of Definition 2.6.
1. Cov(X, X) = Var(X).
2. Cov(aX, bY ) = abCov(X, Y ), a, b ∈ R.
3. Cov(X + a, Y ) = Cov(X, Y ), a ∈ R.
4. Cov(X, Y ) = Cov(Y, X).
5. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(Y, X).
Indeed,
Var(X + Y ) = E((X + Y )2 ) − (E(X) + E(Y ))2
= E(X 2 ) + E(Y 2 ) + 2E(XY ) − E 2 (X) − E 2 (Y ) − 2E(X)E(Y ).
6. If X and Y are independent, Cov(X, Y ) = 0.
Important note:
the inverse is not true, for instance, if X ∼ N (0, 1) and Y = X 2 , then
Cov(X, Y ) = E(X 3 ) − E(X)E(X 2 ) = E(X 3 ) = 0
(recall that N (0, 1) is symmetric with respect to 0).
Let us now consider the properties of correlation:
1. −1 ≤ ρXY ≤ 1 (the Cauchy-Schwarz inequality)
|Cov(X, Y )| = E ((X − E(X))(Y − E(Y )))
≤
q
q
E((X − E(X))2 ) E((Y − E(Y ))2 ) = σX σY
2. If X and Y are independent, ρXY = 0.
3. |ρXY | = 1, if and only if X and Y are linearly dependent: there exist a 6= 0, b ∈ R such
that Y = aX + b.
2.7. REGRESSION
31
Proof : Note that |ρXY | = 1, iff the equality is attained in the Cauchy-Schwarz inequality. By
Proposition 1.4, this is only possible if there are α, β ∈ R such that
α(X − E(X)) + β(Y − E(Y )) = 0 (a.s.),
and if α 6= 0 or β 6= 0. This is equivalent to the existence of α, β and γ ∈ R such that
αX + βY + γ = 0 (a.s.),
with α 6= 0 or β 6= 0. If α 6= 0 and β 6= 0 one has
α
γ
Y =− X− ,
β
β
β
γ
X=− Y − ,
α
α
The case with α = 0 or β = 0 is impossible, because this would mean that one of the variables
Y or X is constant (a.s.), but we have assumed that σX and σY are positive.
Observe that if Y = aX + b, a, b ∈ R, a 6= 0,
2
σY2 = E((Y − E(Y ))2 ) = a2 E((X − E(X))2 ) = a2 σX
;
and the covariance,
2
Cov(X, Y ) = E ((X − E(X))a(X − E(X))) = aσX
,
aσ 2
a
X
= |a|
. We say that the correlation between X and Y is positive if
so that ρXY = σX |a|σ
X
ρXY > 0 and negative if ρXY < 0. The correlation above is thus positive (= 1) if a > 0 and
negative (= −1) if a < 0.
Geometric interpretation of the correlation Let h·, ·i be the scalar product and k · k the
norm of L2 (P ). Then
Cov(X, Y ) = hX − E(X), Y − E(Y )i
and
ρXY =
hX − E(X), Y − E(Y )i
.
kX − E(X)k kY − E(Y )k
In other words, ρXY is the “cosine of the angle” between X − E(X) and Y − E(Y ). Thus,
ρXY = ±1 means that X − E(X) and Y − E(Y ) are collinear: Y − E(Y ) = a(X − E(X)) for
a 6= 0.
2.7
Regression
Definition 2.8 Let X and Y be 2 r.v. such that E(|Y |) < ∞. The function g : R → R defined
with
g(x) = E(Y |X = x)
is called regression of Y on X (of Y in X).
We also refer to this regression as simple (the word means that X and Y are univariate). If X
or Y are multi-dimensional, we refer to the regression as multiple.
32
LECTURE 2. REGRESSION AND CORRELATION
Geometric interpretation. Let us recall the construction of paragraph 2.3. Suppose that
Y is an element of the Hilbert space L2 (P ) (i.e. E(Y 2 ) < ∞) and let, as before, LX
2 (P ) be the
linear subspace of L2 (P ) of all functions h(X) which are measurable with respect to X and such
that E(h2 (X)) < ∞. Then g(X) is the orthogonal projection Y on LX
2 (P ).
Y

E(Y|X)
LX(P)
2
The r.v. ξ = Y − g(X) is referred to as stochastic error (or residual). We have
Y = g(X) + ξ.
(2.10)
By definition of the conditional expectation, E(ξ|X) = 0 (a.s.), and so E(ξ) = 0.
Example 2.3 Let the joint density of X and Y
f (x, y) = (x + y)I{0 < x < 1, 0 < y < 1}.
What is the regression function g(x) = E(Y |X = x)?
We use Corollary 2.1:
f (x, y)
fY |X (y|x) =
; where fX (x) =
fX (x)
We conclude that
fY |X (y|x) =
Z 1
f (x, y)dy = (x + 1/2)I{0 < x < 1}.
0
x+y
I{0 < x < 1, 0 < y < 1},
x + 1/2
and
g(x) = E(Y |X = x) =
Z 1
0
yfY |X (y|x)dy =
Z 1
y(x + y)
x+
0
1
2
dy =
1
2x
+ 13
x + 21
for 0 < x < 1. Observe that g(x) is a nonlinear function of x.
2.7.1
Residual variance
The quadratic (mean square) error of approximation of Y with g(X) is the value
∆ = E((Y − g(X))2 ) = E (Y − E(Y |X))2 = E(ξ 2 ) = Var(ξ).
2.7. REGRESSION
33
We call ∆ the residual variance. The residual variance is smaller than the variance of Y .
Indeed, let h(X) = E(Y ) = const. By the best prediction theorem,
∆ = E (Y − g(X))2 ≤ E (Y − h(X))2 = E((Y − E(Y ))2 ) = Var(Y ).
Because E(Y ) is an element of LX
2 (P ), this means, geometrically, that a leg is smaller than the
hypotenuse:
Y
X
L2 (P)
E(Y|X)
E(Y)
L
Observe that the space L of “constant r.v.” is also a linear subspace of L2 (P ). Moreover, this is
exactly the intersection of all LX
2 (P ) for all X. But we already know that E(Y ) is the projection
of Y on L: indeed, for any constant a
E((Y − a)2 ) ≥ E((Y − E(Y ))2 )
(cf. Exercise 1.11).
By the Pythagoras theorem,
kY − E(Y )k2 = kE(Y |X) − E(Y )k2 + kY − E(Y |X)k2 ,
or
Var(Y ) = E((Y − E(Y ))2 ) = E (E(Y |X) − E(Y ))2 + E (Y − E(Y |X))2
= Var (E(Y |X)) + E (Var(Y |X))
= “variance explained by X” + “residual variance”
= Var(g(X)) + Var(ξ)
= Var(g(X)) + ∆.
Definition 2.9 Let Var(Y ) > 0. We call the correlation ratio of Y to X the nonnegative
value η 2 = ηY2 |X given by
ηY2 |X
E E(Y ) − E(Y |X))2
Var(g(X))
=
.
Var(Y )
Var(Y )
=
Note that, by the Pythagoras theorem,
ηY2 |X
∆
E (Y − g(X))2
=1−
=1−
.
Var(Y )
Var(Y )
34
LECTURE 2. REGRESSION AND CORRELATION
Geometric interpretation. The correlation ratio ηY2 |X is the squared cosine of the angle θ
between Y − E(Y ) and E(Y |X) − E(Y ), thus 0 ≤ ηY2 |X ≤ 1.
Remarks.
2
1. Generally, ηX|Y
6= ηY2 |X (absence of the symmetry).
2. The values η 2 = 0 and η 2 = 1 are special: η 2 = 1 implies that E((Y − E(Y |X))2 ) = 0,
thus Y = g(X) (a.s.); in other words, Y is a function of X.
Otherwise, η 2 = 0 means that E((E(Y ) − E(Y |X))2 ) = 0, and E(Y |X) = E(Y ) (a.s.), so
the regression is constant.
It is useful to note that g(X) = const implies the orthogonality of X and Y (i.e. Cov(X, Y ) = 0).
2 > 0, σ 2 > 0. Then,
Proposition 2.3 Let E(X 2 ) < ∞, E(Y 2 ) < ∞ and σX
Y
ηY2 |X ≥ ρ2XY .
Proof : By the definition of ηY2 |X , it suffices to show that
E (E(Y ) − E(Y |X))2 Var(X) ≥ [E((X − E(X))(Y − E(Y )))]2 .
Yet, by the double expectation theorem,
E((X−E(X))(Y −E(Y ))) = E ((X − E(X))E((Y − E(Y )|X)) = E ((X − E(X))(E(Y |X) − E(Y ))) .
Now, by applying the Cauchy-Schwarz inequality we arrive at
[E((X − E(X))(Y − E(Y )))]2 ≤ E((X − E(X))2 )E (E(Y |X) − E(Y ))2
= Var(X)E (E(Y |X) − E(Y ))2
(2.11)
Remarks.
• ηY2 |X = 0 implies that ρXY = 0.
• The residual variance can be expressed in terms of correlation ratio:
∆ = (1 − ηY2 |X )Var(Y ).
2.8
(2.12)
Linear regression
A particular case E(Y |X = x) = a + bx is called linear regression. When using (2.10), we
can write
Y = a + bX + ξ
where ξ is the residual, E(ξ|X) = 0 (a.s.) (⇒ E(ξ) = 0).
2.8. LINEAR REGRESSION
35
Let ρ = ρXY , and let σX > 0, σY > 0 be the correlation coefficient between X and Y and
the standard deviations of X and Y . One can express coefficients a and b of the linear regression
in terms of ρ, σX and σY . Indeed,
Y − E(Y ) = b(X − E(X)) + ξ.
When multiplying this equation by X − E(X) and taking the expectation, we obtain
2
Cov(X, Y ) = bVar(X) = bσX
,
so that
b=
σY
Cov(X, Y )
.
=ρ
2
σX
σX
Then,
Y =a+ρ
σY
X + ξ.
σX
On the other hand,
E(Y ) = a + ρ
and so
a = E(Y ) − ρ
σY
E(X)
σX
σY
E(X).
σX
Finally,
Y = E(Y ) + ρ
σY
(X − E(X)) + ξ.
σX
(2.13)
2 > 0, Var(Y ) = σ 2 > 0, and
Proposition 2.4 If E(X 2 ) < ∞ and E(Y 2 ) < ∞, Var(X) = σX
Y
the regression function g(x) = E(Y |X = x) is linear, then it may be written in the form
E(Y |X = x) = E(Y ) + ρ
σY
(x − E(X)).
σX
(2.14)
The residual variance is
∆ = (1 − ρ2 )σY2 ,
(2.15)
where ρ is the correlation coefficient between X and Y .
Proof : The equality (2.14) is a an immediate consequence of (2.13) along with the fact that
E(ξ|X = x) = 0. Let us prove (2.15). We can write (2.13) in the form
ξ = (Y − E(Y )) − ρ
σY
(X − E(X)).
σX
When taking the square and the expectation on the both sides, we come to
"
σY
σY
∆ = E(ξ ) = E (Y − E(Y )) − 2ρ
(X − E(X))(Y − E(Y )) + ρ
σX
σX
2
2
= ρ2
σY2
σY
Var(X) − 2ρ
Cov(X, Y ) + Var(Y ) = (1 − ρ2 )σY2 .
2
σX
σX
2
#
2
(X − E(X))
36
LECTURE 2. REGRESSION AND CORRELATION
Corollary 2.2 If the regression of Y in X is linear, under the premise of Proposition 2.4 we
have
ηY2 |X = ρ2XY .
In other words, in the case of the linear regression, the correlation ratio coincide with the
correlation between X and Y . (In particular, this implies that ρXY = 0 ⇔ ηY2 |X = 0 and
2
ηY2 |X = ηX|Y
=0.)
The inverse is also true: if ρ2XY = ηY2 |X , then the regression is linear.
Proof : Due to (2.12) one has: ∆ = (1 − ηY2 |X )Var(Y ), but in the linear case, moreover,
∆ = (1 − ρ2 )Var(Y ) due to (2.15). To show the inverse, note that if the equality is attained in
the Cauchy-Schwarz inequality, (2.11), then there exists α 6= 0 such that
α(X − E(X)) = E(Y |X) − E(Y ),
and thus
E(Y |X) = E(Y ) + α(X − E(X)).
Remark. The fact that the regression of Y on X is linear, in general, does not imply that the
regression of X on Y is linear too.
Exercise 2.1
We have X and Z, 2 r.v., independent with exponential distribution, X ∼ E(λ), Z ∼ E(1). Let
Y = X + Z. Compute the regression function g(y) = E(X|Y = y).
2.9. EXERCISES
2.9
37
Exercises
Exercise 2.2
Suppose that the joint distribution of X and Y is given by
(
F (x, y) =
1 − e−2x − e−y + e−(2x+y) si x > 0, y > 0,
0 sinon.
1. Find the marginal distribution of X and Y .
2. Find the joint density of X and Y .
3. Compute the marginal densities of X and Y , conditional density of X given Y = y.
4. Are X and Y independent?
Exercise 2.3
Consider the joint density function of X and Y given by:
xy
6
), 0 ≤ x ≤ 1, 0 ≤ y ≤ 2.
f (x, y) = (x2 +
7
2
1. Verify that f is a joint density.
2. Find the density of X, the conditional density fY |X (y|x).
3. Compute P Y > 12 |X <
1
2
.
Exercise 2.4
The joint density X and Y is given by:
f (x, y) = e−(x+y) , 0 ≤ x < ∞, 0 ≤ y < ∞
Compute
1. P (X < Y );
2. P (X < a).
Exercise 2.5
Two point are chosen at random on the opposite sides of the middle point the interval of length
L. In other words, the two points X and Y are independent random variables such that X
is uniformly distributed over [0, L/2[, and Y is uniformly distributed over [L/2, L]. Find the
probability that the distance |X − Y | is larger than L/3.
Exercise 2.6
38
LECTURE 2. REGRESSION AND CORRELATION
Let U1 and U2 be 2 independent r.v., both being uniformly distributed on [0, a]. Let V =
min{U1 , U2 } and Z = max{U1 , U2 }. Show that the joint c.d.f. F of V and Z is given by
t2 − (t − s)2
for 0 ≤ s ≤ t ≤ a.
a2
Hint: note that V ≤ s and Z ≤ t iff both U1 ≤ t and U2 ≤ t, but not both s < U1 ≤ t and
s < U2 ≤ t.
F (s, t) = P (V ≤ s, Z ≤ t) =
Exercise 2.7
Given 2 independent r.v. X1 and X2 with exponential distribution with parameters λ1 and λ2 ,
find the distribution of Z = X1 /X2 . Compute P (X1 < X2 ).
Exercise 2.8
Let X and Y be i.i.d. r.v.. Use the definition of the conditional expectation to show that
E(X|X + Y ) = E(Y |X + Y ) (p.s.), and thus E(X|X + Y ) = E(Y |X + Y ) = X+Y
(a.s.).
2
Exercise 2.9
Let X, Y1 and Y2 be independent r.v., let Y1 and Y2 be normal N (0, 1), et
Y1 + XY2
.
Z= √
1 + X2
Using the conditional distribution P (Z < u|X = x) show that Z ∼ N (0, 1).
Exercise 2.10
Let X and Y be 2 square integrable r.v. on (Ω, F, P ). Prove that
Var(Y ) = E(Var(Y |X)) + Var(E(Y |X)).
Exercise 2.11
Let X1 , ..., Xn be independent r.v. such that Xi ∼ P(λi ) (Poisson distribution with parameter
λk
λi , i.e. P (Xi = k) = e−λi k!i ).
P
1o . Find the distribution of X = ni=1 Xi .
2o . Show that the conditional distribution of (X1 , ..., Xn ) given X = r is multinomial M(r, p1 , ..., pn )
(you will compute the corresponding parameters).
that r.v. (X1 , ..., Xk ) integer valued in {0, ..., r} have multinomial distribution M(r, p1 , ..., pk )
Recall
if
P (X1 = n1 , ..., Xk = nk ) =
with
Pk
i=1 ni
r!
pn1 ...pnk k ,
n1 !...nk ! 1
= r. This is the distribution of (X1 , ..., Xk ) where
Xi = “number of Y ’s which are equal to i”
in r independent trials Y1 , ..., Yr with probabilities P (Y1 = i) = pi , i = 1, ..., k. Note that if
k = 2,
P (X1 = n1 , X2 = r − n1 ) = P (X1 = n1 ),
and the distribution is denoted M(r, p) = B(r, p).
3o . Compute E(X1 |X1 + X2 ).
4o . Show that if Xn is binomially distributed, Xn ∼ B(n, λ/n), then for all k, P (Xn = k) tends
k
to e−λ λk! when n → ∞.
2.9. EXERCISES
Recall
trials,
39
that binomial distribution describes the number X of wins in n independent Bernoulli
P (X = k) = Cnk pk (1 − p)n−k .
Exercise 2.12
Show that
1. Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z),
2. Cov
P
n
i=1 Xi ,
Pn
j=1 Yj
=
Pn
i=1
Pn
j=1 Cov(Xi , Yj ).
3. Prove that if Var(Xi ) = σ 2 and Cov(Xi , Xj ) = γ for all 1 ≤ i, j ≤ n then
Var(X1 + ... + Xn ) = nσ 2 + n(n − 1)γ.
4. Let ξ1 and ξ2 be i.i.d. random variables such that 0 < Var(ξ1 ) < ∞. Show that r.v.
η1 = ξ1 − ξ2 and η2 = ξ1 + ξ2 are uncorrelated.
Exercise 2.13
Let X be the number of “1” and Y the number of “2” in n draws of a honest (balanced) dice.
Compute Cov(X, Y ).
Before computing the quantity, would you be able to predict if Cov(X, Y ) ≥ 0 or Cov(X, Y ) <
0.
Hint: use the relationship 2) of Exercise 2.12.
Exercise 2.14
1o . Let ξ and η be r.v. with E(ξ) = E(η) = 0, Var(ξ) = Var(η) = 1 and the correlation
coefficient ρ. Show that
q
E(max(ξ 2 , η 2 )) ≤ 1 +
1 − ρ2 .
Hint: observe that
|ξ 2 + η 2 | + |ξ 2 − η 2 |
.
2
2o . Let ρ be the correlation of η and ξ. Show the inequality
max(ξ 2 , η 2 ) =
q
q
P |ξ − E(ξ)| ≥ Var(ξ) or |η − E(η)| ≥ Var(η) ≤
1+
p
1 − ρ2
.
2
Exercise 2.15
Let (X, Y ) be a random vector of dimension 2. Suppose that Y ∼ N (m, τ 2 ) and that the
distribution of X given Y = y is N (y, σ 2 ).
1o . What is the distribution of Y given X = x?
2o . What is the distribution of X?
3o . What is the distribution f E(Y |X)?
Exercise 2.16
40
LECTURE 2. REGRESSION AND CORRELATION
Let X and N be r.v. such that N is valued in {1, 2, . . .} and E(|X|) < ∞, E(N ) < ∞ . Consider
the sequence X1 , X2 , . . . of independent r.v. with the same distribution as X. Show the Wald
identity: if N is independent of Xi then
E(
N
X
Xi ) = E(N )E(X).
i=1
Exercise 2.17
Suppose that the salary of an individual satisfies Y ∗ = Xb + σε, where σ > 0, b ∈ R, X is a
r.v. with bounded second order moments corresponding the capacities of the individual, and ε
is independent of X standard normal variable, ε ∼ N (0, 1). If Y ∗ is larger than the SMIC value
S, the received salary Y is Y ∗ , otherwise it is equal to S. Compute E(Y |X). Is this expectation
linear?
Exercise 2.18
Show that if φ is a characteristic function of some r.v. then φ∗ , |φ|2 and Re(φ), are also
characteristic functions (of certain r.v.).
Hint: for Re(φ) consider 2 independent random variables X and Y , where Y takes values −1
and 1 with probabilities 1/2, X has characteristic function φ, then compute the characteristic
function of XY .
Lecture 3
Random vectors. Multivariate
normal distribution
3.1
Random vectors
Let X = (ξ1 , ..., ξp )T be a random vector, 1 where ξ1 , ..., ξp are random (univariate) variables.
In the same way we produce random matrices:


ξ11 , ... ξ1q


...
Ξ=
,
ξp1 , ... ξpq
where ξ11 , ..., ξqp are (univariate) r.v.. The cumulative distribution function of the random vector
X is
t = (t1 , ..., tp )T ∈ Rp .
F (t) = P (ξ1 ≤ t1 , ..., ξp ≤ tp ),
If F (t) is differentiable with respect to t, the density of X (the joint density of ξ1 , ..., ξp ) exists
and is equal to the mixed derivative
f (t) = f (t1 , ..., tp ) =
∂ p F (t)
.
∂t1 , ..., ∂tp
In this case
Z t1
Z tp
F (t) =
...
−∞
3.1.1
−∞
f (u1 , ..., up )du1 ...dup .
Properties of a multivariate density
∞
∞
We have: f (t) ≥ 0, −∞
... −∞
f (t1 , ..., tp )dt1 ...dtp = 1. The marginal density of ξ1 , ..., ξk ,
k < p is (we use the symbol f (·) as a generic notation for densities)
R
R
Z ∞
f (t1 , ..., tk ) =
1
Z ∞
...
−∞
−∞
By convention, vector X ∈ Rp×1 is a column vector.
41
f (t1 , ..., tp )dtk+1 ...dtp .
42
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
Note that
two different random vectors may have the same marginal distributions.
Example 3.1 Consider the densities
f1 (t1 , t2 ) = 1, et f2 (t1 , t2 ) = 1 + (2t1 − 1)(2t2 − 1), 0 < t1 , t2 < 1.
In both cases, f (t1 ) =
R1
0
f (t1 , t2 )dt2 = 1.
Same as is in the case of p = 2, the conditional density of ξ1 , ..., ξk given ξk+1 , ..., ξp is
f (t1 , ..., tk |tk+1 , ..., tp ) =
f (t1 , ..., tp )
.
f (tk+1 , ..., tp )
If X1 and X2 are two random vectors, then
fX2 |X1 (x2 |x1 ) =
f (x1 , x2 )
.
f (x1 )
Independence. Suppose that two random vectors X1 and X2 have a joint density f (x1 , x2 ).
They are independent iff
f (x1 , x2 ) = f1 (x1 )f2 (x2 ),
where f1 and f2 are probability densities. In other words, the conditional density fX2 |X1 (x2 |x1 )
does not depend on x1 . The independence is preserved by measurable transformations of vectors
X1 and X2 .
3.1.2
Moments of random vectors
Vector µ = (µ1 , ..., µp )T ∈ Rp is the mean (expectation) of the random vector X = (ξ1 , ..., ξp )T
if
Z
Z
µj = E(ξj ) = ... tj f (t1 , ..., tp )dt1 ...dtp , j = 1, ..., p
(we suppose that the corresponding integrals exist), we write µ = E(X). In the same way we
define the expectation of a random matrix. Same as in the scalar case, the expectation is a
linear functional: for any matrix A ∈ Rq×p et b ∈ Rq ,
E(AX + b) = AE(X) + b = Aµ + b.
This property is still valid for random matrices: if Ξ is a p × q random matrix, A ∈ Rq×p , then
E(AΞ) = AE(Ξ).
Covariance matrix Σ of the random vector X is given by
∆
Σ = V (X) = E((X − µ)(X − µ)T ) = (σij ),
a p × p matrix, where
σij = E((ξi − µi )(ξj − µj )) =
Z
Z
...
(ti − µi )(tj − µj )f (t1 , ..., tp )dt1 ...dtp
(we note that in this case σij are not always positive). Because σij = σji , Σ is a symmetric
matrix. We can define also the covariance matrix of random vectors X (p × 1) and Y
(q × 1):
C(X, Y ) = E((X − E(X))(Y − E(Y ))T ), C ∈ Rp×q .
3.1. RANDOM VECTORS
43
The covariance matrix possesses the following properties:
1o . Σ = E(XX T ) − µµT , where µ = E(X).
2o . For any a ∈ Rp , Var(aT X) = aT V (X)a.
Proof : Observe that by linearity of the expectation,
Var(aT X) = E((aT X − E(aT X))2 ) = E (aT (X − E(X))2 = E aT (X − µ)(X − µ)T a
= aT E (X − µ)(X − µ)T a = aT V (X)a.
Var(aT X) ≥ 0, implying that matrix V (X) is positive semidefinite, we write V (X) 0.
Thus, we have
3o . Σ 0.
4o . Let A be a p × q matrix. Then V (AX + b) = AV (X)AT .
Proof : Let Y = AX + b, then by linearity of the expectation,
ν = E(Y ) = E(AX + b) = Aµ + b et Y − E(Y ) = A(X − µ).
Now we have:
V (Y ) = E(A(X − µ)(X − µ)T A) = AV (X)AT (linearity again).
5o .
6o .
7o .
8o .
C(X, X) = V (X). In this case C = C T 0 (positive semidefinite matrix).
C(X, Y ) = C(Y, X)T .
C(X1 + X2 , Y ) = C(X1 , Y ) + C(X2 , Y ).
If X and Y are two p-random vectors,
V (X + Y ) = V (X) + C(X, Y ) + C(Y, X) + V (Y ) = V (X) + C(X, Y ) + C(X, Y )T + V (Y ).
9o . If X⊥⊥Y , then C(X, Y ) = 0 (null matrix) (the inverse is not true). This can be proved
exactly the same way as in the case of covariance of r.v..
Correlation matrix
P of X is given by P = (ρij ), 1 ≤ i, j ≤ p where
ρij = √
σij
√
σii σjj
.
We note that the diagonal entries ρii = 1, i = 1, ..., p.
√
If ∆ is a diagonal matrix with ∆ii = σii , then P = ∆−1 Σ∆−1 , and the positivity of Σ
implies the positivity of P , i.e. P 0.
3.1.3
Characteristic function of a random vector
Definition 3.1 Let X ∈ Rp be a random vector. Its characteristic function for t ∈ Rp is given
by
φX (t) = E exp(itT X) .
44
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
Exercise 3.1
Two random vectors X !
∈ Rp and Y ∈ Rq are independent iff the
! characteristic function φZ (u)
X
a
of the vector Z =
can be represented, for any u =
, a ∈ Rp and b ∈ Rq , as
Y
b
φZ (u) = φX (a)φY (b).
Verify this characterization in the continuous case.
3.1.4
Transformations of random vectors
Let h = (h1 , ..., hp )T be a transformation, i.e. a function from Rp to Rp ,
h(t1 , ..., tp ) = (h1 (t1 , ..., tp ), ..., hp (t1 , ..., tp ))T , t = (t1 , ..., tp )T ∈ Rp .
The Jacobian matrix of the transformation is defined par
!
∂hi
(t)
Jh (t) = Det
∂tj
.
i,j
Proposition 3.1 (Calculus reminder) Suppose that
(i) partial derivatives of hi (·) are continuous on Rp , i = 1, ..., p,
(ii) h is a bijection,
(iii) Jh (t) 6= 0 for any t ∈ Rp .
Then, for any function f (t) such that
Z
|f (t)|dt < ∞
Rp
and any Borel set K ⊆ Rp we a
Z
Z
f (t)dt =
K
h−1 (K)
f (h(u))|Jh (u)|du.
Remark:
by the inverse function theorem, under the conditions of the Proposition 3.1, the
inverse function g(·) = h−1 (·) exists everywhere on Rp and
Jh−1 (h(u)) =
same as
Jh−1 (t) =
1
,
Jh (u)
1
.
Jh (h−1 (t))
We conclude that h satisfies the conditions (i) − (iii) of Proposition 3.1 iff g = h−1 satisfies the
same conditions.
We have the following corollary of Proposition 3.1:
3.1. RANDOM VECTORS
45
Proposition 3.2 Let Y a random vector with the density fY (t), t ∈ Rp . Let g : Rp → Rp
be a transformation satisfying the premise of Proposition 3.1. Then, the density of the random
vector X = g(Y ) exists and is given by
fX (u) = fY (h(u))|Jh (u)|, for any u ∈ Rp ,
where h = g −1 .
Proof : Let X = (ξ1 , ..., ξp )T , v = (v1 , ..., vp )T , and Av = {t ∈ Rp : gi (t) ≤ vi , i = 1, ..., p}.
Then, by the Proposition 3.1 with h = g −1 and f = fY , the c.d.f. of X is
FX (v) = P (ξi ≤ vi , i = 1, ..., p) = P (gi (Y ) ≤ vi , i = 1, ..., p)
Z
=
Z
fY (t)dt =
Av
g(Av )
fY (h(u))|Jh (u)|du.
On the other hand,
g(Av ) = {u = g(t) ∈ Rp : t ∈ Av } = {u = g(t) ∈ Rp : gi (t) ≤ vi , i = 1, ..., p}
= {u = (u1 , ..., up )T ∈ Rp : ui ≤ vi , i = 1, ..., p}.
We conclude that
Z vp
Z v1
FX (v) =
...
−∞
−∞
fY (h(u))|Jh (u)|du
for any v = (v1 , ..., vp )T ∈ Rp . This implies that the density of X is fY (h(u))|Jh (u)|.
Corollary 3.1 If X = AY + b where Y is a random vector on Rp with the density fY and A is
an invertible p × p matrix then
fX (u) = fY (A−1 (u − b))| Det(A−1 )| =
fY (A−1 (u − b))
.
| Det(A)|
To verify the result it suffices to use Proposition 3.2 with u = g(t) = At + b and thus t =
g −1 (u) = h(u) = A−1 (u − b).
3.1.5
Reminder of the properties of symmetric matrices
Recall that p × p matrix A, A = (aij ), i, j = 1, ..., p is called symmetric if aij = aji , i, j = 1, ..., p.
The matrix Γ p × p is said orthogonal if
Γ−1 = ΓT (ou bien ΓΓT = ΓT Γ = I)
(where I is the p × p identity matrix). In other words, the columns γ·j of Γ are orthogonal
vectors of length 1 (the same is true for the lines γi· of Γ). Of course, | Det(Γ)| = 1. We have
the spectral decomposition theorem (Jordan):
Let A ∈ Rp×p be a symmetric matrix. Then there exists an orthogonal matrix Γ and the
diagonal matrix


λ1 0 ... 0


... ...
Λ = Diag(λi ) = 
,
0 ... 0 λp
46
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
such that
A = ΓΛΓT =
p
X
λi γ·i γ·iT ,
(3.1)
i=1
where γ·i are the orthonormal eigenvectors of A:
2
γ·iT γ·j = δij i, j = 1, ..., p,
Γ = (γ·1 , ..., γ·p ).
Comments.
1) Though symmetric matrix may have multiple eigenvalues, all the eigenvectors of such matrix
are different.
2) We assume in the sequel that the eigenvalues λi , i = 1, ..., p are sorted in the decreasing order:
λ1 ≥ λ2 ≥ ... ≥ λp .
We say that γ·1 is the first (or principal) eigenvector of A, i.e. the eigenvector corresponding
to the maximal eigenvalue; γ·2 is the second eigenvector, and so on.
If all eigenvalues λi , i = 1, ..., p are nonnegative, matrix A is positive semidefinite (and
positive definite if λi > 0).
Other useful properties of square matrices
P
Q
1o . Det(A) = pi=1 λi , Tr(A) = pi=1 λi .
2o . Det(AB) = Det(A) Det(B), Det(AT ) = Det(A).
3o . The calculus of matrix functions simplifies for symmetric matrices: for example, a positive
integer power As , s ∈ N+ of a symmetric matrix satisfies As = ΓΛs ΓT (if the matrix A is
positive definite it is true for any real s).
4o . Det(A−1 ) = Det(A)−1 for nondegenerate A.
5o . For any s ∈ R and any matrix A = AT 0, Det(As ) = Det(A)s (the simple consequence of
| det Γ| = 1 for an orthonormal matrix Γ).
Projectors. Symmetric matrix P such that
P 2 = P (idempotent matrix)
is called projection matrix (or projector).
All the eigenvalues of P are 0 or 1. Rank(P ) is the number of eigenvalues = 1. In other
words, there is an orthogonal matrix Γ such that
T
Γ PΓ =
I 0
0 0
!
,
where I is the Rank(P ) × Rank(P ) identity matrix.
Indeed, let v be an eigenvector of P , then P v = λv, where λ is an eigenvalue of P . Due to
P2 = P,
(λ2 − λ)v = (λP − P )v = (P 2 − P )v = 0.
This is equivalent to λ = 1 if P v 6= 0.
2
Here δij stands for the Kronecker index: δij = 1 if i = j, otherwise δij = 0.
3.2. CONDITIONAL EXPECTATION OF A RANDOM VECTOR
3.2
47
Conditional expectation of a random vector
Let X = (ξ1 , ..., ξp )T and Y = (η1 , ..., ηq )T be two random vectors. Here we consider only
continuous case, i.e. we assume that the joint density fX,Y (x, y) = fX,Y (t1 , ..., tp , s1 , ..., sq )
exists.
The conditional expectation E(Y |X) is the q-random vector with the components
E(η1 |X), ..., E(ηq |X);
here E(ηj |X) = gj (X) (a measurable function of X), and
gj (t) = E(ηj |X = t) =
Z
sj fηj |X=t (sj |t)dsj =
Z
sj fηj |ξ1 =t1 ,...,ξp =tp (sj |t1 , ..., tp )dsj .
We can verify that the latter quantity is well defined if, for instance, E(|ηj |) < ∞, j = 1, ..., q.
All the properties of conditional expectation, established in Lecture 2 hold true in the vector
case.
Same as in the scalar case, we can define the conditional covariance matrix:
V (Y |X) = E(Y Y T |X) − E(Y |X)E(Y |X)T .
3.2.1
Best prediction theorem
Let |a| =
q
a21 + ... + a2p stand for the Euclidian norm of Rp .
Definition 3.2 Let X ∈ Rp and Y ∈ Rq be two random vectors, and G : Rp → Rq . We say
that G(X) is the best prediction of Y given X (in the mean square sense) if
E (Y − G(X))(Y − G(X))T ≤ E (Y − H(X))(Y − H(X))T
(3.2)
(we say that A B if the différence B − A is positive semidefinite) for any measurable function
H : Rp → Rq .
Clearly, (3.2) implies (please, verify this!)
E(|Y − G(X)|2 ) = inf E(|Y − H(X)|2 ).
H(·)
where the minimum is taken over all measurable functions H(·) : Rp → Rq .
Same as in the case of p = q = 1, we have
Theorem 3.1 If E(|Y |2 ) < ∞, then the best prediction of Y given X is unique a.s. and satisfies
G(X) = E(Y |X) (a.s.).
Proof : Of course, it is sufficient to look for the minimum among functions H(·) such that
E(|H(X)|2 ) < ∞. For any such H(X)
E (H(X) − Y )(H(X) − Y )T )
= E [(H(X) − G(X)) + (G(X) − Y )][(H(X) − G(X)) + (G(X) − Y )]T
= E (H(X) − G(X)(H(X) − G(X))T + E (H(X) − G(X))(G(X) − Y )T
+E (G(X) − Y )(H(X) − G(X))T + E (G(X) − Y )(G(X) − Y )T .
48
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
But, using the properties of conditional expectation, we obtain:
E (H(X) − G(X))(G(X) − Y )T
h
= E E (H(X) − G(X))(G(X) − Y )T |X
h
= E (H(X) − G(X))E (G(X) − Y )T |X
i
i
= 0.
The statement of the theorem follows.
3.3
Multivariate normal distribution
Normal distribution on R: recall that the normal distribution N (µ, σ 2 ) on R is the probability distribution with density
(x − µ)2
1
f (x) = √
exp −
2σ 2
2πσ
!
.
Here µ is the mean and σ 2 is the variance. The characteristic function of the normal distribution
N (µ, σ 2 ) is given by
!
σ 2 t2
φ(t) = exp iµt −
).
2
2 /2
In particular, for N (0, 1) we have φ(t) = e−t
3.3.1
.
The distribution Np (0, I)
Np (0, I) is the distribution of the random vector X = (ξ1 , ..., ξp )T where ξi , i = 1, ..., p are i.i.d.
random variables with distribution N (0, 1).
Properties of Np (0, I):
1o . The mean and the covariance matrix of X are E(X) = 0 and V (X) = I.
2o . Distribution Np (0, I) is absolutely continuous with density
p
f (u) = (2π)
−p/2
p
Y
Y
1
1
exp(− uT u) = (2π)−p/2
exp(− u2i ) =
f0 (ui ),
2
2
i=1
i=1
2
where u = (u1 , ..., up )T and f0 (t) = √12π e−t /2 is the density of N (0, 1).
3o . The characteristic function of Np (0, I) is, by definition,

iaT X
φX (a) = E e
=
p
Y
j=1
where a = (a1 , ..., ap )T ∈ Rp .
=E
E eiaj ξj =
p
Y
j=1
p
Y

eiaj ξj 
2
1
e−aj /2 = exp(− aT a),
2
j=1
3.3. MULTIVARIATE NORMAL DISTRIBUTION
3.3.2
49
Normal distribution on Rp
Definition 3.3 The random vector X if normally distributed on Rp if and only if there exist a
p × p matrix A and a vector µ ∈ Rp such that
X = AY + µ, where Y ∼ Np (0, I).
Properties:
1o . E(X) = µ car E(Y ) = 0.
2o . V (X) = AV (Y )AT = AAT . We put Σ = AAT .
3o . The characteristic function
TX
φX (a) = E eia
Tµ
= eia
E eib
T µ− 1 bT b
2
= eia
T (AY
= E eia
TY
+µ)
(with b = AT a)
= eia
T µ− 1 aT Σa
2
.
(3.3)
We have the following characterization:
Theorem 3.2 Let φ : Rp → C (a complex-valued function). Then, φ is a characteristic function
of a normal distribution if and only if there exist µ ∈ Rp and a positive semidefinite matrix
Σ ∈ Rp×p such that
T µ− 1 aT Σa
2
φ(a) = eia
,
a ∈ Rp .
(3.4)
Remark: in this case µ is the mean and Σ is the covariance matrix of the normal distribution
in question.
Proof : The “only if” part has been already proved. To show the “if” part, starting from (3.4)
we have to prove that there exists a normal vector X ∈ Rp such that φ(·) is its characteristic
function.
1st step: by the spectral decomposition theorem, there exists an orthogonal matrix Γ such
that ΓT ΣΓ = Λ, where Λ is a diagonal matrix of rank k ≤ p with strictly positives eigenvalues
λj , 1 ≤ j ≤ k. Then (cf. (3.1)),
Σ=
p
X
λj γ·j γ·jT
j=1
where γ·j are the columns of Γ, and a·j =
γ·j are orthonormal).
=
p
X
a·j aT·j ,
j=1
p
λj γ·j . Observe that a·j ⊥ a·l for l 6= j (recall that
50
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
2nd step: Let Y ∼ N (0, I). Let us denote ηj the components of Y (Y = (η1 , ..., ηp )T ). We
consider the random vector
X = η1 a·1 + ... + ηk a·k + µ,
so that X = AY +µ, where A is a p×p matrix with columns aj , j = 1, ..., k: A = (a·1 , ..., a·k , 0, ..., 0).
Thus X is a normal vector. What is the characteristic function of X? We have E(X) = µ and
V (X) = E (η1 a·1 + ... + ηk a·k )(η1 a·1 + ... + ηk a·k )T =
k
X
a·j aT·k = Σ,
j=1
due to E(ηl ηj ) = δjl where δjl is the Kronecker symbol. Thus, by (3.3), the characteristic
function of X coincide with φ(u) in (3.4).
The result of Theorem 3.2 implies the following important corollary: any normal distribution
on Rp is entirely defined by the vector of means and its covariance matrix. This explains the
notation
X ∼ N (µ, Σ)
for the random vector X which is normally distributed with mean µ and covariance matrix
Σ = ΣT 0.
We will distinguish two situations, those of nondegenerate normal distribution and degenerate normal distribution.
3.3.3
Nondegenerate normal distribution
This is a normal distribution on Rp with positive definite covariance matrix Σ, i.e. Σ 0
(⇔ Det(Σ) > 0). Moreover, because Σ is symmetric and Σ 0, there exists a symmetric
matrix A1 = Σ1/2 (a symmetric square root of Σ) such that Σ = A21 = AT1 A1 = A1 AT1 . As
Det(Σ) = [Det(A1 )2 ] > 0, we have Det(A1 ) > 0 and A1 is invertible. By (3.3), if X ∼ N (µ, Σ),
its characteristic function is
1 T
T
φX (a) = eia µ− 2 a Σa
for any a ∈ Rp , and due to Σ = A1 AT1 , we have
T µ− 1 aT Σa
2
φX (a) = eia
T (A
= E eia
1Y
+µ)
= φA1 Y +µ (a),
where Y ∼ Np (0, I). Therefore,
X = A1 Y + µ
and, due to the invertibility of A1 ,
Y = A−1
1 (X − µ).
The Jacobian of this linear transformation is Det(A−1
1 ), and by the Corollary 3.1 the density of
X is, for any u ∈ Rp ,
−1
fX (u) = Det(A−1
1 )fY (A1 (u − µ)) =
=
1
(2π)p/2
p
1
fY (A−1
1 (u − µ))
Det(A1 )
1
exp − (u − µ)T Σ−1 (u − µ) .
2
Det(Σ)
3.3. MULTIVARIATE NORMAL DISTRIBUTION
51
Definition 3.4 We say that X has a nondegenerate normal distribution Np (µ, Σ) (with a positive semidefinite covariance matrix Σ) iff X is un random vector with density
1
1
p
f (t) =
exp − (t − µ)T Σ−1 (t − µ)
p/2
2
(2π)
Det(Σ)
3.3.4
Degenerate normal distribution
This is a normal distribution on Rp with singular covariance matrix Σ, i.e. Det(Σ) = 0 (in other
words, Rank(Σ) = k < p). For instance, consider Σ = 0, then the characteristic function of
T
X ∼ N (µ, 0) is φX (a) = eia µ (by Property 3o ) and the distribution of X is the Dirac function
at µ.
More generally, if Rank(Σ) = k ≥ 1, we obtain (cf. the proof of Theorem 3.2) that a vector
X ∼ Np (µ, Σ) can be represented as
X = AY + µ,
where Y ∼ N (0, I), A = (a·1 , ..., a·k , 0, ..., 0) and AAT = Σ, with Rank(A) = k. Thus, any
component of X is either normally distributed (nondegenerate) or its distribution is a Dirac
function. This is a consequence of the following statement:
Proposition 3.3 Let X ∼ Np (µ, Σ) and Rank(Σ) = k < p. Then there exists a linear subspace
H ⊂ Rp of dimension p − k such that the projection aT X of X on any vector a ∈ H is a “Dirac
random variable.”
Proof : We have X = AY + µ where AAT = Σ, Rank(A) = k, Let H = Ker(AT ) of dimension
dim (H) = p − k. If a ∈ H, then AT a = 0 and Σa = 0.
Now, let a ∈ H, the characteristic function of the r.v. aT X is
T X)u
φ(u) = E ei(a
TX
= E ei(ua)
T µ− 1 (ua)T Σ(ua)
2
= ei(ua)
= ei(ua)
Tµ
.
Therefore, the distribution of aT X is a (scalar) Dirac function at aT µ.
Theorem 3.3 (Equivalent definition of the multivariate normal distribution) A random vector
X ∈ Rp is normally distributed iff all its scalar projections aT X for any a ∈ Rp are normal
random variables.
Remark: we include the Dirac distribution as a special case of normal distributions (corresponding to σ 2 = 0).
Proof : Observe first that for any a ∈ Rp and any u ∈ Rp the characteristic function φaT X (u)
of the r.v. aT X is related to that of X:
T Xu
φaT X (u) = E eia
= φX (ua).
(3.5)
52
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
“only if ” part: Let X be a normal vector in Rp . We are to show that aT X is a normal
random variable for any a ∈ Rp . We use (3.5) to obtain for any u ∈ R
T µ− 1 u2 aT Σa
2
φaT X (u) = eiua
,
where µ and Σ are the mean and the covariance matrix of X. Thus,
1
2 σ2
0
φaT X (u) = eiµ0 u− 2 u
with µ0 = aT µ and σ02 = aT Σa. As a result,
aT X ∼ N (µ0 , σ02 ) = N (aT µ, aT Σa).
“if ” part: We are to prove that if for any a ∈ Rp aT X is a normal random variable then
X is a normal vector. To this end, note that if aT X is a normal r.v. for any a ∈ Rp , then
E(|X|2 ) < ∞ (to see why this is true it suffices choose as a the vectors of an orthonormal basis
of Rp ). Therefore, the mean µ = E(X) and the covariance matrix Σ = V (X) are well defined.
Let us fix a ∈ Rp . By hypothesis, there exists m ∈ R and s2 ≥ 0 such that aT X ∼ N (m, s2 ).
However, this immediately implies that
m = E(aT X) = aT µ,
s2 = Var(aT X) = aT Σa.
Moreover, the characteristic function of aT X is given by
1 2 2
u
φaT X (u) = eimu− 2 s
T µ− 1 u2 aT Σa
2
= eiua
.
Now using (3.5) we obtain
T µ− 1 aT Σa
2
φX (a) = φaT X (1) = eia
.
Because a ∈ Rp is arbitrary, we conclude by Theorem 3.2 that X is a normal random vector in
Rp with mean µ and covariance matrix Σ.
3.3.5
Properties of the multivariate normal distribution
Here we consider X ∼ Np (µ, Σ), where µ ∈ Rp and Σ ∈ Rp×p is a symmetric matrix, Σ 0.
The following properties are consequences of the results of the preceding section:
(N1) Let Σ 0, then the random vector Y = Σ−1/2 (X − µ) satisfies
Y ∼ Np (0, I).
(N2) One-dimensional projections aT X of X for any a ∈ Rp are normal random variables:
aT X ∼ N (aT µ, aT Σa).
In particular, the marginal densities of the distribution Np (µ, Σ) are also normal. The
inverse statement is not true!
Exercise 3.2
3.3. MULTIVARIATE NORMAL DISTRIBUTION
53
Let the joint density of r.v.’s X and Y satisfy
f (x, y) =
1 − x2 − y 2
e 2 e 2 [1 + xyI{−1 ≤ x, y ≤ 1}],
2π
What is the distribution of X, of Y ?
(N3) Any linear transformation of a normal vector is again a normal vector: if Y = AX + c
where A ∈ Rq×p and c ∈ Rq are some fixed matrix and vector (non-random),
Y ∼ Nq (Aµ + c, AΣAT ).
Exercise 3.3
Prove this claim.
(N4) Let σ 2 > 0. The distribution of X ∼ Np (0, σ 2 I) is invariant with respect to orthogonal
transformations: if Γ is an orthogonal matrix, then ΓX ∼ Np (0, σ 2 I). (The proof is
evident: it suffices to use (N3) with A = Γ.)
(N5) All subsets of components of a normal vector are normal vectors: Let X = (X1T , X2T )T
where X1 ∈ Rk and X2 ∈ Rp−k , then X1 and X2 are (k- and p − k-variate) normal vectors.
Proof : We use (N3) with c = 0 and A ∈ Rk×p , A = (Ik , 0) where Ik is the k × k identity
matrix to conclude that X1 is normal. For X2 we take A ∈ R(p−k)×p = (0, Ip−k ).
(N6) Two jointly normal vectors are independent if and only if they are non-correlated.
!
X
, where X ∈ Rp and
Proof : Only sufficiency (“if”) claim is of interest. Let Z =
Y
Y ∈ Rq , Z is a normal vector in Rq+p , and C(X, Y ) = 0 (the covariance matrix of X and
Y ). To prove that X and Y are independent
it suffices to verify (cf. Exercise 3.1) that the
!
a
, a ∈ Rp and b ∈ Rq , can be decomposed as
characteristic function φZ (u), u =
b
φZ (u) = φX (a)φY (b).
Indeed, we have
E(X)
E(Y )
E(Z) =
!
,
V (Z) =
V (X) C(X, Y )
C(Y, X)
V (Y )
!
=
V (X)
0
0
V (Y )
!
,
where V (X) ∈ Rp×p and V (Y ) ∈ Rq×q are covariance matrices of X and of Y , respectively.
Therefore, the characteristic function φZ (u) of Z is given by
"
!#
1
a
φZ (u) = φZ (a, b) = exp i(a E(X) + b E(Y )) − (aT , bT )V (Z)
b
2
1
1
= exp iaT E(X) − aT V (X)a exp ibT E(Y ) − bT V (Y )b = φX (a)φY (b).
2
2
T
for any u =
a
b
!
.
T
54
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
3.3.6
Geometry of the multivariate normal distribution
Let Σ 0. Note that the density of Np (µ, Σ) is constant on the surfaces
EC = {x : (x − µ)T Σ−1 (x − µ) = C 2 },
We call these level sets the “contours” of the distribution. In the case of interest, EC are ellipsods
which we refer to as concentration ellipsoids.
3
3

2

2
1
1
1
0
−1
2
−2
 =0.75
−3
−3
−2
−1
0
1
2
Concentration ellipsoids: X = (ξ1 , ξ2 ), Y = (η1 , η2 ), where Y =
3.4
3.4.1
3
Σ−1/2 X,
Σ=
1 3/4
3/4 1
!
Distributions derived from the normal
Pearson’s χ2 distribution
This is the distribution of the sum
Y = η12 + ... + ηp2 ,
where η1 , ..., ηp are i.i.d. N (0, 1) random variables. We denote it Y ∼ χ2p and pronounce “Y
follows the chi-square distribution with p degrees of freedom (d.f.). The density of the
χ2p distribution is given by
fχ2p (y) = C(p)y p/2−1 e−y/2 I{0 < y < ∞},
where
−1
C(p) = 2p/2 Γ(p/2)
,
and Γ(·) is the gamma-function:
Z ∞
Γ(x) =
0
ux−1 e−u/2 du,
x > 0.
(3.6)
3.4. DISTRIBUTIONS DERIVED FROM THE NORMAL
55
We have E(Y ) = p, Var(Y ) = 2p if Y ∼ χ2p .
p=1
p=2
p=3
p=6
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
Density of the χ2 distribution for different p’s
Exercise 3.4
Obtain the expression (3.6) for the density of χ2p .
3.4.2
Fisher-Snedecor distribution
Let U ∼ χ2p , V ∼ χ2q be two independent r.v.. The Fisher distribution with degrees of
freedom p and q is the distribution of the random variable
Y =
U/p
.
V /q
We write Y ∼ Fp,q . The density of Fp,q is given by
fFp,q (y) = C(p, q)
y p/2−1
(q + py)
p+q
2
I{0 < y < ∞},
where
C(p, q) =
Γ(p)Γ(q)
pp/2 q q/2
, with B(p, q) =
.
B(p/2, q/2)
Γ(p + q)
(3.7)
56
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
One can show that this density approches the density fχ2p in the limit q → ∞.
1
F(10,4)
F(10,10)
F(10,100)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10
Density of the Fisher distribution
Exercise 3.5
Verify the expression (3.7) for the density of the Fisher distribution.
3.4.3
Student (W. Gosset) t distribution
Let η ∼ N (0, 1), ξ ∼ χ2q be independent. The Student distribution with q degrees of
freedom is that of the r.v.
η
Y =q .
ξ
q
We write Y ∼ tq . The density of tq is
ftq (y) = C(q)(1 + y 2 /q)−(q+1)/2 ,
y ∈ R,
(3.8)
where
C(q) = √
1
.
qB(1/2, q/2)
Note that t1 is the Cauchy distribution, and tq “approaches” N (0, 1) when q → ∞. We note
that tq distribution is symmetric. The tails of tq are heavier that normal tales.
Exercise 3.6
3.5. COCHRAN’S THEOREM
57
Verify the expression (3.8) for the Student distribution density.
0.4
N(0,1)
t4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−5
−4
−3
−2
−1
0
1
2
3
4
5
Density of Student distribution
3.5
Cochran’s theorem
Theorem 3.4 Let X ∼ Np (0, I) and let A1 , ..., AJ , J < p, p × p matrices such that
(1) A2j = Aj ,
(2) Aj is symmetric, Rank(Aj ) = nj ,
(3) Aj Ak = 0 when j 6= k and
PJ
j=1 nj
≤ p.3
Then,
(i) vectors Aj X are independent with normal distribution Np (0, Aj ), j = 1, ..., J, respectively;
(ii) random variables |Aj X|2 , j = 1, ..., J are independent with χ2nj distribution, j = 1, ..., J.
Proof :
(i)
Observe that E(Aj X) = 0 and
V (Aj X) = Aj V (X)ATj = Aj ATj = A2j = Aj .
Moreover, the joint distribution of Ak X and Aj X is clearly normal. However,
C(Ak X, Aj X) = E(Ak XX T ATj ) = Ak V (X)ATj = Ak ATj = Ak Aj = 0
for j 6= k. By the property (N6) of the normal distribution, this implies that Ak X and Aj X are
independent for all k 6= j.
3
Some presentations of this result in statistical literature assume also that A1 + ... + AJ = I.
58
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
(ii)
Since Aj is a projector, there exists an orthogonal matrix Γ such that
ΓAj Γ =
Ij
0
0
0
!
,
the diagonal matrix of eigenvalues of Aj . Because the rank of Aj is equal to nj , we have
Rank(Ij ) = nj , and so
|Aj X|2 = X T ATj Aj X = X T Aj X = (X T ΓT )Λ(ΓX) = Y T ΛY =
nj
X
ηi2 ,
i=1
where Y = (η1 , ..., ηp )T is a normal vector, Y = ΓX ∼ Np (0, I) (due to the property (N4) of the
normal distribution). We conclude that |Aj X|2 ∼ χ2nj . Since the independence is preserved by
measurable transformations, |Aj X|2 and |Ak X|2 are independent for j 6= k.
3.6
Normal correlation theorem and Kalman-Bucy filter
Another important consequence of the results of Section 3.3.5 is the following statement:
Theorem 3.5 Let X T = (ξ T , θT ), ξ ∈ Rk , θ ∈ Rl , p = k + l, a normal vector, X ∼ Np (µ, Σ),
where
!
Σξξ Σξθ
T
T
T
,
µ = (µξ , µθ ), Σ =
Σθξ Σθθ
Σξξ ∈ Rk×k , Σθθ ∈ Rl×l , ΣTθξ = Σξθ ∈ Rk×l . We suppose that Σξξ 0.
Then
∆
m = E(θ|ξ) = µθ + Σθξ Σ−1
ξξ (ξ − µξ ), (a.s.)
∆
γ = V (θ|ξ)
= Σθθ − Σθξ Σ−1
ξξ Σξθ (a.s.),
(3.9)
and the conditional distribution of θ given ξ is normal: for any s ∈ Rl , P (θ ≤ s|ξ) is (a.s.) the
c.d.f. of a l-variate normal distribution with the vector of means m and the covariance matrix
γ (we write a ≤ b for two vectors a, b ∈ Rl for the system of inequalities a1 ≤ b1 , ..., ap ≤ bl ).
Moreover, random vectors ξ and
η = θ − Σθξ Σ−1
ξξ ξ
are independent.
Remarks:
1. The theorem provides an explicit expression for the regression function m = E(θ|ξ) (regression of θ on ξ) and the conditional covariance matrix
γ = V (θ|ξ) = E (θ − m)(θ − m)T .
This regression is linear in the case of a Gaussian couple (ξ, θ).
3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER
59
2. If we assume, in addition, that Σ 0 then the matrix γ is also 0. Indeed, let a ∈ Rk ,
b ∈ Rl , then
!
!
!
a
Σ
Σ
a
ξξ
ξθ
(aT bT )Σ
= (aT bT )
> 0.
b
Σθξ Σθθ
b
same as
aT Σξξ a + aT Σξθ b + bT Σθξ a + bT Σθθ b > 0.
(3.10)
If we choose
a = −Σ−1
ξξ Σξθ b,
then (3.10) can be rewritten as
−bT Σθξ Σ−1 Σξθ b + bT Σθθ b > 0,
for any b ∈ Rl . Thus,
Σθθ − Σθξ Σ−1 Σξθ 0.
3. The normal correlation theorem allows for the follwing geometric interpretation: assume that E(ξ) = 0 and E(θ) = 0, and let Lξ2 (P ) be the subspace of random vectors with
finite covariance matrix which are mesurables with respect to ξ. Then Σθξ Σ−1
ξξ ξ is the
−1
2
orthogonal projection of θ on Lξ (P ), and vector η = θ − Σθξ Σξξ ξ is orthogonal to L2ξ (P ).
4. It is worth to mention that one can also prove a “conditional” version of Theorem 3.5 if
we assume that the conditional distribution of the couple (ξ, θ) given another r.v., say, Z
is normal (a.s.). Indeed, let X = (ξ, θ)T = ((ξ1 , ..., ξk ), (θ1 , ..., θl ))T be a random vector
and Z some other random vector defined on the same probability space (Ω, F, P ). Assume
that the conditional distribution of X given Z is normal (a.s.) with vector of means
E(X|Z)T = (E(ξ|Z)T , E(θ|Z)T ) = (µTξ|Z , µTθ|Z ),
and covariance matrix
ΣX|Z =
V (ξ|Z) C(ξ, θ|Z)
C(θ, ξ|Z) V (θ|Z)
!
∆
=
Σξξ|Z
Σθ,ξ|Z
Σξ,θ|Z
Σθθ|Z
!
.
Then the conditional expectation m = E(θ|ξ, Z) and the conditional covariance matrix
γ = V (θ|ξ, Z) are given by
m = µθ|Z + Σθξ|Z Σ−1
ξξ|Z (ξ − µξ|Z ),
γ = Σθθ|Z − Σθξ|Z Σ−1
ξξ|Z Σξθ|Z ;
(3.11)
and the conditional distribution of θ given ξ and Z is normal: for any s ∈ Rl , any s ∈ Rl ,
P (θ ≤ s|ξ, Z) is (a.s.) the c.d.f. of an l-variate normal distribution with the mean m and
the covariance matrix γ. Moreover, random vectors ξ and
η = θ − Σθξ|Z Σ−1
ξξ|Z ξ
are conditionally independent given Z.
This statement can be proved in the exactly same way as Theorem 3.5 and will be used
in the next section.
60
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
Proof of the normal correlation theorem.
Step 1.
Let us compute E(η) and V (η):
−1
E(η) = E(θ − Σθξ Σ−1
ξξ ξ) = µθ − Σθξ Σξξ µξ ,
and
−1
T
V (η) = E [(θ − µθ ) − Σθξ Σ−1
ξξ (ξ − µξ )][(θ − µθ ) − Σθξ Σξξ (ξ − µξ )]
T
= Σθθ − Σ−1
ξξ Σξθ E (ξ − µξ )(θ − µθ )
−1
−1 T
T
−E (θ − µθ )(ξ − µξ )T Σ−1
ξξ Σθξ + Σθξ Σξξ E(ξ − µξ )(ξ − µξ ) )Σξξ Σθξ
= Σθθ − Σθξ Σ−1
ξξ Σξθ .
Step 2.
Let us verify the orthogonality of η and ξ:
−1
C(η, ξ) = C(θ, ξ) − Σθξ Σ−1
ξξ C(ξ, ξ) = Σθξ − Σθξ Σξξ Σξξ = 0,
thus η ⊥ ξ.
Step 3.
We show that the couple (ξ, η) is normal. Indeed,
ξ
η
!
= AX = A
ξ
θ
!
,
where
A=
with idenitity matrices Ik ∈
Rk×k
Ik
0
−Σθξ Σ−1
I
l
ξξ
and Il ∈
Rl×l .
!
,
By the property (N3) of Section 3.3.5
ξ
η
!
is a vector normal.
Its covariance matrix, is given by
V
ξ
η
!!
=
V (ξ) C(ξ, η)
C(η, ξ) V (η)
!
=
Σξξ
0
0 Σθθ − Σθξ Σ−1
ξξ Σξθ
!
Because Σξξ 0, and Σθθ − Σθξ Σ−1
ξξ Σξθ 0 (by the Cauchy-Schwarz inequality), we have
V
ξ
η
!!
0. Besides this,
V
ξ
η
!!
= AV (X)AT 0.
3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER
61
Step 4. Now the property (N6) implies that η and ξ are independent. On the other hand, the
result of Step 3, along with (N5), allows us to conclude that η is a normal vector. Using the
above expressions for E(η) and V (η) we obtain
−1
η ∼ Nl µθ − Σθξ Σ−1
ξξ µξ , Σθθ − Σθξ Σξξ Σξθ .
Now it suffices to note that
θ = η + Σθξ Σ−1
ξξ ξ,
where η is independent of ξ. Therefore, the conditional distribution of θ given ξ is the distribution
of η, translated by Σθξ Σ−1
ξξ ξ, i.e.
E(θ|ξ) = E(η) + Σθξ Σ−1
ξξ ξ,
V (θ|ξ) = V (η).
The linearity of the best prediction m = E(θ|ξ) of the vector θ given ξ is, of course, tightly
linked to the normality of the couple (ξ, θ), what allows for a simple calculus of m. It is interesting
to see what is the best linear prediction in the case where the joint distribution of the couple
ξ and θ is not normal. In other words, we ay be interested to find a matrix A∗ ∈ Rl×k and
vector b∗ ∈ Rl such that θb = b∗ + A∗ ξ satisfies
b
bT =
E (θ − θ)(θ
− θ)
inf
A∈Rl×k ,b∈Rl
E (θ − Aξ − b)(θ − Aξ − b)T .
The answer is given by the following lemma which stresses the importance of the Gaussian case
when looking for the best linear predictors:
Lemma 3.1 Suppose that (X, Y ) is a random vector, X ∈ Rk , Y ∈ Rl , such that E(|X|2 +
|Y |2 ) < ∞, V (X) 0. Let (ξ, θ) be a normal vector with the same mean and covariance matrix,
i.e.
E(ξ) = E(X), E(θ) = E(Y ),
V (ξ) = V (X), V (θ) = V (Y ), C(X, Y ) = C(ξ, θ).
Let now λ(b) : Rk → Rl be a linear function such that
λ(b) = E(θ|ξ = b).
Then λ(X) is the best linear prediction of Y given X. Besides this, E(λ(X)) = E(Y ).
Proof : Observe that the existance of a linear function λ(b) which coincides with E(θ|ξ = b)
is a concequence of the normal correlation theorem. Let η(b) be another linear estimation of θ
given ξ, then
E (θ − λ(ξ)(θ − λ(ξ))T ≤ E (θ − η(ξ)(θ − η(ξ))T ,
and by linearity of predictions λ(·) and η(·), under the premise of the lemma, we get
E (Y − λ(X))(Y − λ(X))T
= E (θ − λ(ξ))(θ − λ(ξ))T
≤ E (θ − η(ξ))(θ − η(ξ))T = E (Y − η(X))(Y − η(X))T ,
62
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
what proves the optimality of λ(X). Finally,
E(λ(X)) = E(λ(ξ)) = E (E(θ|ξ)) = E(θ) = E(Y ).
Let us consider the following example (cf. Exercise 2.15):
Example 3.2 Let X and Y be r.v. such that the couple (X, Y ) is normally distributed with
2 = Var(X) > 0, σ 2 = Var(Y ) > 0 and correlation
means µX = E(X), µY = E(Y ), variances σX
Y
ρ = ρXY < 1.
When putting Σ = Var
X
Y
!!
, we get
Σ=
2
σX
ρσX σY
ρσX σY
σY2
!
2 σ 2 (1 − ρ2 ) > 0. Observe that if in Theorem 3.5 ξ = X and θ = Y , then
with Det(Σ) = σX
Y
Σθξ = Σξθ = ρσX σY
Σθξ Σ−1
ξξ = ρ
σY
.
σX
So the regression function satisfies
m(x) = E(Y |X = x) = µY + ρ
σY
(x − µX ),
σX
γ = γ(x) = V (Y |X = x) = σY2 (1 − ρ2 ),
and the conditional density of Y given X is
1
(y − m(x))2
fY |X (y|x) = √
exp −
2πγ
2γ
!
(the density of distribution N (m(x), γ 2 (x))).
Let us consider a particular case of µX = µY = 0 and σX = σY = 1. Then
Σ=
1
ρ·1
ρ·1
1
!
, Σ
−1
1 −ρ
−ρ 1
2 −1
= (1 − ρ )
!
.
The eigenvectors of Σ (and of Σ−1 ) are
(1, 1)T et (−1, 1)T ,
corresponding to the eigenvalues, respectively,
λ1 = 1 + ρ et λ2 = 1 − ρ.
The normalized eigenvectors are γ1 = 2−1/2 (1, 1)T and γ2 = 2−1/2 (−1, 1)T . If we put Γ =
(γ1 , γ2 ), then we have the eigenvalue decomposition:
T
Σ = ΓΛΓ = Γ
1+ρ
0
0
1−ρ
!
ΓT .
3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER
63
Let us consider the concentration ellipsoids of the joint density of (X, Y ). Let for C > 0
EC = {x ∈ R2 : xT Σ−1 x ≤ C 2 } = {x ∈ R2 : |y|2 ≤ C 2 },
where y = Σ−1/2 x. We set
y1
y2
y=
then
!
x1
x2
, x=
1
y1 = p
(x1 + x2 ),
2(1 + ρ)
!
,
1
y2 = p
(x1 − x2 ).
2(1 − ρ)
In this case the concentration ellipse becomes


!2
1
p
EC = {xT Σ−1 x ≤ C 2 } =
(x1 + x2 )

2(1 + ρ)
3
2
!2
+
1
p
(x1 − x2 )
2(1 − ρ)


≤ C 2.

3
3


2
1
2


2
1
1
1
1
0
0
−1
−1
2
−2
−3
−3
−2
 =0.75
−2
−1
0
1
2
3
−3
−3
 =−0.5
−2
−1
0
1
2
3
Concentration ellipses: X = (ξ1 , ξ2 ), Y = (η1 , η2 ), where Y = Σ−1/2 X.
3.6.1
The Kalman-Bucy filter
Suppose that the sequence of (couples of) random vectors (θ, ξ) = ((θn ), (ξn )), n = 0, 1, 2, ...,
θn = (θ1 (n), ..., θl (n))T ∈ Rl and ξn = (ξ1 (n), ..., ξk (n))T ∈ Rk , is generated by the recursive
equations
(0)
θn+1 = an+1 θn + bn+1 n+1 ,
(1)
ξn+1 = An+1 θn + Bn+1 n+1 ,
(3.12)
with initial conditions (θ0 , ξ0 ).
(0)
(1)
(0)
Here n = ((01) , ..., (0l) )T and n = ((11) , ..., (0k) )T are independent normal vectors, 1 ∼
(1)
Nl (0, I), and 1 ∼ Nk (0, I); an , bn , An and Bn are deterministic matrices of size, respectively,
l × l, l × l, k × k and k × k. We suppose that matrices Bn are of full rank, and that the initial
(0)
(1)
conditions (θ0 , ξ0 ) are independent of the sequences (n ) and (n ).
In the sequel we use the notation ξ0n for the “long” vector ξ0n = (ξ0T , ..., ξnT )T .
First, observe that if E(|θ0 |2 + |ξ0 |2 ) < ∞, then for all n ≥ 0, E(|θn |2 + |ξn |2 ) < ∞. If we
assume, in addition, that the couple (θ0 , ξ0 ) is a normal vector, then we can easily verify (all θn
64
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
(0)
(1)
and ξn are linear functions of Gaussian vectors (θ0 , ξ0 ), (i ) and (i ), i = 1, ..., n) that the
“long” vector Z T = (θ0T , ξ0T , ..., θnT , ξnT ) is normal for each n ≥ 0. We can thus apply the normal
correlation theorem to compute the best prediction of the sequence (θi ), 0 ≤ i ≤ n, given (ξi ),
0 ≤ i ≤ n.
This computation may become rather expensive if we want to build the prediction for large
n. This observation is not quite valid today, but in the 50-60’s, memory and processing power
requirements were important factors, especially, for the “onboard” computations. This has
motivated the search for “cheap” algorithms of computing best predictions, which which resulted
in 1960 in the discovery of the Kalman-Bucy filter which computes the prediction in a fully
recursive way. The aim of the next exercises is to obtain the recursive equations of the Kalman
filter – recursive formulas for
mn = E(θn |ξ0n ),
γn = V (θn |ξ0n ).
This problem, extremely complicated in the general setting, allows for a simple solution if we
suppose that the conditional distribution P (θ0 < a|ξ0 ) of vector θ0 given ξ0 is normal (a.s.),
what we will assume in this section. Our first objective is to show that in the conditions above
the sequence (θ, ξ) is conditionally gaussian, in other words, the conditional c.d.f.
P (ξn+1 ≤ x, θn+1 ≤ a|ξ0n )
coincide (a.s.) with the c.d.f. of a l + k-dimensional normal distribution with the mean and the
covariance matrix which depend on ξ0n .
Exercise 3.7
Let ζn = (ξnT , θnT )T , t ∈ Rk+l . Verify that the conditional c.d.f.
P (ζn+1 ≤ t|ξ0n , θn+1 = u)
is (a.s.) normal with mean M u, where M is a (k + l) × l matrix, and the (k + l) × (k + l)
covariance matrix Σ to be determined.
Let us suppose now that for n ≥ 0 the conditional c.d.f.
P (ζn ≤ t|ξ0n−1 )
is (a.s.) that of a l +k-dimensional normal distribution with the mean and the covariance matrix
depending on ξ0n−1 .
Exercise 3.8
Use the “conditional version” of the normal correlation theorem (cf Remark 4 and display (3.11))
to show that the conditional c.d.f.
P (ζn+1 ≤ t|ξ0n ),
n≥0
are (a.s.) normal with
E(ζn+1 |ξ0n )
=
An+1 mn
an+1 mn
!
,
V
(ζn+1 |ξ0n )
=
T
Bn+1 Bn+1
+ An+1 γn ATn+1
An+1 γn aTn+1
an+1 γn ATn+1
bn+1 bTn+1 + an+1 γn aTn+1 ,
!
3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER
65
where mn = E(θn |ξ0n ) and γn = V (θn |ξ0n ).
Hint: compute the conditional characteristic function
E exp(itT ζn+1 )|ξ0n , θn ,
t ∈ Rl+k ,
then use the fact that in the premises of the exercise the distribution of θn , given ξ0n−1 and ξn ,
is conditionally normal with parameters mn and γn .
Exercise 3.9
Apply the (conditional) normal correlation theorem to obtain the recursive relations:
T
mn+1 = an+1 mn + an+1 γn ATn+1 (Bn+1 Bn+1
+ An+1 γn ATn+1 )−1 (ξn+1 − an+1 mn ),
(3.13)
T
γn+1 = an+1 γn an+1 + bn+1 bTn+1 − an+1 γn ATn+1 (Bn+1 Bn+1
+ An+1 γn ATn+1 )−1 An+1 γn aTn+1
T
(since Bn+1 is of full rank, so is the matrix Bn+1 Bn+1
+ An+1 γn ATn+1 , which is invertible).
Show that ξn+1 and
T
η = θn+1 − an+1 γn ATn+1 (Bn+1 Bn+1
+ An+1 γn ATn+1 )−1 (ξn+1 − an+1 mn )
are independent given ξ0n .
Example 3.3 Let X = (Xn ) and ξ = (ξn ) be random sequences, such that
(0)
(1)
Xn+1 = cXn + bn+1 , Yn+1 = Xn + Bn+1 ,
(3.14)
where c, b and B are reals, (0) and (1) are sequences of i.i.d. N (0, 1) r.v. which are mutually
independent. Let us compute mn = E(Xn |Y0n ).
((1)
We can suppose that Xn is the “useful signal,” and Bn+1 is the observation noise, and we want
to recover Xn given the observations Y0 , ..., Yn . Kalman equations (3.13) allow us to easily
obtain the expressions of the recursive prediction:
cγn−1
(Yn − cmn−1 )
+ γn−1
c2 γ 2
= c2 γn−1 + b2 − 2 n−1 .
B + γn−1
mn = cmn−1 +
γn
B2
Exercise 3.10
Show that if b 6= 0, B 6= 0 and |c| < 1, then the “limit error” γ = limn→∞ γn of the Kalman
filter exists and is a positive solution of the quadratic (Riccati) equation
γ 2 + (B 2 (1 − c2 ) − b2 )γ − b2 B 2 = 0.
Example 3.4 Let θ ∈ Rl be a normal vector with E(θ) = 0 and V (θ) = γ (we assume that
γ is known). We look for the best prediction of θ given observations of the k-variate sequence
(ξ) = (ξn )
(1)
ξn+1 = An+1 θ + Bn+1 n+1 , ξ0 = 0,
(1)
where An+1 , Bn+1 and n+1 satisfy the same hypotheses as in (3.12).
66
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
From (3.13) we obtain
T
mn+1 = mn + γn An+1 [Bn+1 Bn+1
+ An+1 γn ATn+1 ]−1 (ξn+1 − An+1 mn ),
T
γn+1 = γn − γn An+1 [Bn+1 Bn+1 + An+1 γn ATn+1 ]−1 ATn+1 γn .
(3.15)
Then solutions to (3.15) are given by
h
mn+1 = I + γ
h
γn+1 = I + γ
Pn
T
T
−1 T
m=0 Am+1 (Bm+1 Bm+1 ) Am+1
Pn
T
T
−1 T
m=0 Am+1 (Bm+1 Bm+1 ) Am+1
i−1
i−1
γ
Pn
T
T
−1
m=0 Am+1 (Bn+1 Bn+1 ) ξm+1 ,
(3.16)
γ,
where I is a k × k identity matrix.
Exercise 3.11
Derive the formula (3.16).
Hint: do Google search for matrix inversion lemma, then apply the lemma to
T
[γn−1 + An+1 (Bn+1 Bn+1
)−1 ATn+1 ]−1 .
3.6.2
Solutions to exercises of Section 3.6.1
Exercise 3.7
On can easily verify that (a.s.)
E(θn+1 |ξ0n , θn = u)
V
(θn+1 |ξ0n , θn
E(ξn+1 |ξ0n , θn = u)
= an+1 u,
= u) =
bn+1 bTn+1 ,
V
(ξn+1 |ξ0n , θn
= An+1 u,
T
= u) = Bn+1 Bn+1
and
C(θn+1 , ξn+1 |ξ0n , θn = u) = 0,
thus the conditional distribution of ζn+1 is (a.s.) normal with
E(ζn+1 |ξ0n , θn
= u) =
Au
au
!
,
V
(ζn+1 |ξ0n , θn
T
Bn+1 Bn+1
0
0
bn+1 bTn+1
= u) =
!
Exercise 3.8 In the premises of the exercise, by the normal correlation theorem, the distribution of θn given ξ0n is normal with the parameters mn = E(θn |ξ0n ) and γn = V (θn |ξ0n ) which
does not depend on ξ0n . We observe that (a.s.)
"
An+1 θn
an+1 θn
E exp(itT ζk+1 )|ξ0n , θn = exp itT
!
1
− tT
2
T
Bn+1 Bn+1
0
0
bn+1 bTn+1
! #
t ,
and, because
"
E
exp it
T
An+1 θn
an+1 θn
!# !
"
n
ξo = exp itT
An+1 mn
an+1 mn
"
!
!
1
− tT
2
An+1 γn ATn+1 An+1 γn aTn+1
an+1 γn ATn+1 an+1 γn aTn+1
we conclude that
E exp(itT ζk+1 )|ξ0n
= exp itT
1
− tT
2
Exercise 3.9
An+1 mn
an+1 mn
1
− tT
2
T
Bn+1 Bn+1
0
0
bn+1 bTn+1
An+1 γn ATn+1 An+1 γn aTn+1
an+1 γn ATn+1 an+1 γn aTn+1
! #
t
Just apply the (conditional) normal correlation theorem.
!
t
! #
t ,
3.7. EXERCISES
3.7
67
Exercises
Exercise 3.12
Let Q be a q × p matrix with q > p of rank p.
1o . Show that P = Q(QT Q)−1 QT is a projector.
2o . On what subspace L does P project?
Exercise 3.13
Let (X, Y ) be a random vector with density
f (x, y) = C exp(−x2 + xy − y 2 /2).
1o . Show that (X, Y ) is a normal vector. Compute the expectation, the covariance matrix and
the characteristic function of (X, Y ). Compute the correlation coefficient ρXY of X and Y .
2o . What is the distribution of X? of Y ? of 2X − Y ?
3o . Show that X and Y − X are independent random variables with the same distribution.
Exercise 3.14
Let X ∼ N (0, 1) and let Z be a r.v. taking values −1 and 1 with probability 12 . We suppose
that X and Z are independent, we set Y = ZX.
1o . Show that the distribution of Y is N (0, 1).
2o . Compute the covariance and the correlation of X and Y .
3o . Compute P (X + Y = 0).
4o . Is (X, Y ) a normal vector?
Exercise 3.15
Let ξ and η be independent r.v. with uniform distribution U [0, 1]. Prove that
X=
p
−2 ln ξ cos(2πη),
Y =
p
−2 ln ξ sin(2πη)
satisfy Z = (X, Y )T ∼ N2 (0, I).
Hint: let (X, Y ) ∼ N2 (0, I). Change to the polar coordinates.
Exercise 3.16
Let Z = (Z1 , Z2 , Z3 )T be a normal vector, with density f ,
1
6z12 + 6z22 + 8z32 + 4z1 z2
f (z1 , z2 , z3 ) =
exp
−
32
4(2π)3/2
!
.
1o . What is the distribution of (Z2 , Z3 ) given Z1 = z1 ?
Let X and Y be random vectors defined with



X=

2
0
0
1
2 2
2 5
4 10
2 4



Z

et Y =
1 1 1
1 0 0
!
Z.
2o . Vector (X, Y ) of dimension 6, is it Gaussian? Does vector X have a density? Does vector
Y have a density?
3o . Are vectors X and Y independent?
4o . What are the distributions of the components of Z.
68
LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION
Exercise 3.17
Let (X, Y, Z)T be a normal vector with zero mean and covariance matrix


2 1 1


Σ =  1 2 1 .
1 1 2
1o . We set U = −X + Y + Z, V = X − Y + Z, W = X + Y − Z. What is the distribution of
the vector (U, V, W )T ?
2o . What is the density of the random variable T = U 2 + V 2 + W 2 ?
Exercise 3.18
Let a vector (X, Y ) be normal N2 (µ, Σ) with mean and covariance matrix:
µ=
0
2
!
, Σ=
4 1
1 8
!
.
1o . What is the distribution of X + 4Y ?
2o . What is the joint distribution of Y − 2X and X + 4Y .
Exercise 3.19
Let X be a zero mean normal vector of dimension n with covariance matrix Σ > 0. What is the
distribution of the r.v. X T Σ−1 X?
Exercise 3.20
We model the height H of a male person in population P by the Gaussian distribution N (172, 49)
(units: cm). In this model:
1o . What is the probability for a man to be of height ≤160cm?
2o . We assume that there are approximately 15 millions of adult men in P; provide an estimation
of a number of men of height ≥200cm.
3o . What is the probability for 10 men randomly drawn from P to be all in the interval
[168,188]cm?
The height H 0 of females of P is modeled by the Gaussian distribution N (162, 49) (units:
cm).
4o . What is the probability for a male chosen at random to be higher than a randomly chosen
female?
We model the heights (H, H 0 ) of a man and a woman in a couple by a normal vector, the
correlation coefficient ρ between the height of a woman and a man being 0.4 (respectively, −0.4).
5o . Compute the probability p (respectively, p0 ) that the height of a man in a couple is larger
than that of a woman (before making the computation, what would be your guess, in which
order should one range p and p0 ?).
Exercise 3.21
Let Y = (η1 , ..., ηn )T be a normal vector, Y ∼ Nn (µ, σ 2 I), Hn−J be a subspace of dimension
n − J J > 0 of Rn , and let Hn−J−M be a subspace of dimension n − J − M, M > 0 of Hn−J .
We set
dJ = min |Y − y| et dJ+M =
min |Y − y|.
y∈Hn−J
Verify that
y∈Hn−J−M
3.7. EXERCISES
69
1. if µ ∈ Hn−J then random variable d2J /σ 2 follows χ2J distribution (with J degrees of freedom);
2. if, in addition, µ ∈ Hn−J−M , then
J d2J+M − d2J
∼ FM,J
M
d2J
(Fisher distribution with (M, J) degrees of freedom).