Download Basic probability refresher

Lecture 1 Basic probability refresher 1.1 Characterizations of random variables Let (Ω, F, P ) be a probability space where Ω is a general set, F is a σ-algebra and P is a probability measure on Ω. A random variable (r.v.) X is a measurable function X : (Ω, F) → (R, B) where B is a Borel σ-algebra. We will also write X(ω) to stress the fact that it is a function of ω ∈ Ω. Cumulative distribution function [0, 1] (c.d.f.) of a random variable X is the function F : R → F (x) = P (X ≤ x) = P (ω : X(ω) ≤ x). F is monotone nondecreasing, right-continuous and such that F (−∞) = 0 and F (∞) = 1. We also refer to F as the probability law (distribution) of X. We distinguish 2 types of random variables: discrete variables and continuous variables. Discrete variable X takes values in the finite or countable set. Poisson random variable X is an example of a discrete variable with countable value set: for λ > 0 the distribution of X satisfies 1 Pλ (X = k) = 1 λk −λ e , k! k = 0, 1, 2, ... We will see in the sequel the importance of this law and how it is linked to Poisson point process. 1 2 LECTURE 1. BASIC PROBABILITY REFRESHER We denote X ∼ P(λ) and say that is distributed according to the Poisson distribution with parameter λ. The c.d.f. of X is 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −1 0 1 2 3 4 5 6 The c.d.f. of a discrete random variable is a step function. Continuous variable. X is a continuous variable if its distribution admits a density with respect to the Lebesgue measure on R. In this case the c.d.f. F of X is differentiable almost everywhere on R and its derivative f (x) = F 0 (x) is called probability density of X. Note that f (x) ≥ 0 for all x ∈ R and Z ∞ f (x)dx = 1. −∞ Example 1.1 a) Normal distribution N (µ, σ 2 ) with density f (x) = √ (x−µ)2 1 e− 2σ2 , 2πσ x ∈ R, where µ ∈ R and σ > 0. If µ = 0, σ 2 = 1, the distribution N (0, 1) is referred to as standard normal distribution. b) Uniform distribution U [0, θ] with density 1 f (x) = I{x ∈ [0, θ]}, θ x ∈ R, where η > 0 and I{·} stands for the indicator function: for a set A ( I{x ∈ A} = 1 0 if x ∈ A, otherwise. 1.1. CHARACTERIZATIONS OF RANDOM VARIABLES 3 c) Exponential distribution E(λ) with density f (x) = λe−λx , for x ≥ 0 and f (x) = 0 pour x < 0, where λ > 0. The c.d.f. of E(λ) is given by F (x) = (1 − e−λx ) for x ≥ 0 and F (x) = 0 for x < 0. Discrete distributions are entirely determined by the probabilities {P (X = k)}k , continuous distribution are defined with their density f (·). However, some scalar functionals of the distribution may be useful to characterize the behavior of corresponding random variables. Examples of such functionals are the moments and quantiles. 1.1.1 Moments of random variables Mean (expectation) of a random variable X: Z ∞ µ = E(X) = −∞ Moment of order k ( P iP (X = i) xdF (x) = R i xf (x)dx in the discrete case, in the continuous case. (k = 1, 2, ...) : k µk = E(X ) = Z ∞ xk dF (x), −∞ same as central moment of order k: µ0k = E((X − µ)k ) = Z ∞ (x − µ)k dF (x). −∞ A special case is the variance σ 2 (= µ02 – the central moment of order 2): σ 2 = Var(X) = E((X − E(X))2 ) = E(X 2 ) − (E(X))2 . The p squared root of the variance is called standard deviation (s.d. or st.d.) of X: σ = Var(X). Absolute moment µ̄k of order k µ̄k = E(|X|k ) same as central absolute moment of order k: µ̄0k = E(|X − µ|k ). Clearly, these definitions assume the existence of the respective integrals, and not all distributions possess moments. Example 1.2 4 LECTURE 1. BASIC PROBABILITY REFRESHER Let X be a random variable with probability density c , f (x) = 1 + |x| log2 |x| where the constant c > 0 is such that R x ∈ R, f = 1. Then E(|X|a ) = ∞ for all a > 0. The mean is used to characterize the location (position) of a random variable. The variance characterizes the scale (dispersion) of the distribution. The normal distribution N (µ, σ 2 ) with mean µ and variance σ 2 : 0.4  0.35 0.3 0.25 0.2 0.15 0.1  0.05 0 −10  −8 −6 −4 −2 0 2 4 6 8 10 “large” σ (large dispertion), “small” σ (little dispersion) Let F be the c.d.f. of the random variable X with mean µ and variance σ. By an affine transformation we obtain the variable X0 = (X − µ)/σ, such that E(X0 ) = 0, E(X02 ) = 1 (the standardized variable). If F0 is the c.d.f. of X0 then F (x) = F0 ( x−µ σ ). In the continuous case, the density of X satisfies x−µ 1 ), f (x) = f0 ( σ σ where f0 is the density of X0 . Note that it is not necessary to assume that the mean and the variance exist to define the standardized distribution F0 and the representation F (x) = F0 ( x−µ σ ). Typically, this is done to underline that F depends on location parameter µ and scale σ. E.g., for the family of 1 Cauchy densities parameterized with µ, σ, f (x) = πσ(1+[(x−µ)/σ] 2 ) , the standardized density is f0 (x) = 1 . π(1+x2 ) Meanwhile, expectation and variance do not exist for the Cauchy distribution. An interesting problem of Calculus is related to the notion of moments µk : let F be a c.d.f. such that all its moments are finite. Given a sequence {µk }, k = 1, 2, ... of moments of F , is it possible to recover F ? The general answer to this question is negative. Nevertheless, there exist particular cases where the recovery is possible, namely, under the hypothesis that 1/k µ̄k k→∞ k lim sup <∞ (µ̄k being the k-the absolute moment). This hypothesis holds true, for instance, for densities with bounded support. To the best of our knowledge, necessary and sufficient conditions for existence of a solution to the problem of moments are currently unknown. 1.1. CHARACTERIZATIONS OF RANDOM VARIABLES 1.1.2 5 Probability quantiles Let X be a random variable with continuous and strictly increasing c.d.f. F . The quantile of order p, 0 < p < 1, of the distribution F is the solution qp of the equation F (qp ) = p. Observe that if F is strictly increasing and continuous, the solution exists and is unique, thus the quintile qp is well defined. If F has “flat zones” or is not continuous we can modify the definition, for instance, as follows: Definition 1.1 Let F be a c.d.f. The quintile qp of order p of F is the value qp = inf{q : F (q) ≥ p}. The median M of the c.d.f. F is the quintile of order 1/2, M = q1/2 . Note that if F is continuous F (M ) = 1/2. The quartiles are the quantiles q1/4 and q3/4 of order 1/4 and 3/4. The l% percentiles of F are the quantiles qp of order p = l/100, 0 < l < 100. We note that the median characterizes the location of the probability distribution, while the difference q3/4 − q1/4 (referred to as interquartile interval) can be interpreted as a characteristics of scale. These quantities are analogues of the mean µ and standard deviation σ. However, unlike the mean and the standard deviation, the median and the interquartile interval are well defined for all probability distributions. 1.1.3 Other characterizations The mode. For a discrete distribution F , we call the mode of F the value k ∗ such that P (X = k ∗ ) = max P (X = k) k In the continuous case, the mode x∗ is defined a local maximum of the density f : f (x∗ ) = max f (x). x A density f is said unimodal if x∗ is the unique local maximum of f (one can also speak of bi-modal or multi-modal densities). This characteristics is rather imprecise, because even when 6 LECTURE 1. BASIC PROBABILITY REFRESHER the density has a unique global maximum, we will call it multimodal if it has other local maxima. The mode is a characteristics of location which can be of interest in the case of unimodal density. 0.25 Mode 0.2 Mediane Moyenne 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 18 20 The mode, the mediane and the mean of a distribution Skewness and kurtosis Definition 1.2 The distribution of X (the c.d.f. F ) is said symmetric with respect to zero (or “simply” symmetric) if for all x ∈ Rm, F (x) = 1 − F (−x) (f (x) = f (−x) in the continuous case). Definition 1.3 The distribution of X (the c.d.f. F ) is called symmetric with respect to µ ∈ R if F (x + µ) = 1 − F (µ − x) (f (x + µ) = f (µ − x) in the continuous case). In other words, the c.d.f F (· − µ) is symmetric (with respect to zero). Exercise 1.1 a) Show that if F is symmetric with respect to µ, and E(|X|) < ∞, then E(X) = µ. Moreover, if F admits an unimodal density, then the mean = mediane = mode. b) If F is symmetric and all absolute moments µ̄k exist, then the moments µk = 0 for all odd k. If F is symmetric with respect to µ and all the moments µ̄k exist, then µ0k = 0 for all odd k (e.g., µ03 = 0). We can qualify the “asymmetry” of distributions (for which E(|X|3 ) < ∞) using the skewness parameter µ0 α = 33 . σ Note that α = 0 for a symmetric c.d.f. such that E(|X|3 ) < ∞. Note that that inverse is not true: the condition α = 0 does not imply the distribution symmetry. Exercise 1.2 1.1. CHARACTERIZATIONS OF RANDOM VARIABLES 7 Provide an example of asymmetric density with α = 0. Observe the role of σ in the definition of α: suppose, for istance, that the density f0 (x) of X R R R satisfies xf0 (x)dx = 0 and x2 f0 (x)dx = 1 with α0 = µ030 = x3 f0 (x)dx. For σ > 0, µ ∈ R, the function 1 x−µ f (x) = f0 ( ), σ σ is the density of the random variable σX + µ, and thus Var(σX + µ) = σ 2 and µ03 = (x − µ0 µ)3 f (x)dx = σ 3 µ030 . When computing α = σ33 we note that α = α0 . Thus the skewness α is invariant with respect to affine transformations (of scale and position). R Note that one cannot say that α > 0 for distributions which are “asymmetric on the right”, or α < 0 for ‘asymmetric on the left” distributions. The notions of left or right asymmetries are not properly defined. Kurtosis coefficient β is defined as follows: if the 4th central moment µ04 of X exists then β= µ04 − 3. σ4 Exercise 1.3 Show that µ04 /σ 4 = 3 and β = 0 for normal distribution N (µ, σ 2 ). We note that, same as the asymmetry coefficient α, the kurtosis β is invariant with respect to affine transformations. The coefficient β is often used to roughly qualify the tails of the distribution of X. One use the following vocabulary: a distribution F has “heavy tails” if Z Q(b) = Z dF (x) (= |x|≥b f (x)dx in the continuous case) |x|≥b decreases slowly when b → ∞; for instance, polynomially (as 1/br where r > 0). Otherwise, we say that F has “light tails” if Q(b) is fast decreasing (example: exponentially decreasing). We may use the following heuristics: if β > 0 we may consider that the distribution tails are 2 heavier than those of the normal normal distribution (Q(b) = O(e−b /2 ) for N (0, 1)). If β < 0 (we say that the distribution is leptokurtic) and assume that its tails are lighter than those of normal distribution (β = 0 for the normal distribution). Note also that β ≥ −2 for all distributions (si the next paragraph). Example 1.3 a) The kurtosis β of the uniform distribution U [0, 1] is equal to −1.2 (ultra-light tails). b) If f (x) ∼ |x|−5 when |x| tends to ∞, σ 2 is finite but µ04 = ∞, imlying that β = ∞ (heavy tails). 8 1.2 LECTURE 1. BASIC PROBABILITY REFRESHER Some useful inequalities Proposition 1.1 (Markov inequality) Let h(·) be a positive nondecreasing function, and E(h(X)) < ∞. Then for all a > 0 such that h(a) > 0, P (X ≥ a) ≤ E(h(X)) . h(a) (1.1) Proof : Let a > 0 be such that h(a) > 0. Since h(·) is a nondecreasing function, P (X ≥ a) ≤ P (h(X) ≥ h(a)) = Z I{h(x) ≥ h(a)}dF (x) = E(I{h(X) ≥ h(a)}) ≤ E h(X) E(h(X)) I{h(X) ≥ h(a)} ≤ . h(a) h(a) Corollary 1.1 (Chebyshev inequality) Let X be a random variable such that E(X 2 ) < ∞. Then for all a > 0 P (|X| ≥ a) ≤ P (|X − E(X)| ≥ a) ≤ E(X 2 ) a2 Var(X) a2 Proof : To show the first inequality it suffices to set in (1.1) h(t) = t2 and Y = |X| (or Y = |X − E(X)| for the second one). Proposition 1.2 (Hölder inequality) Let r > 1, with 1/r + 1/s = 1. Let ξ and η be two random variables such that E(|ξ|r ) < ∞ and E(|η|s ) < ∞. Then E(|ξη|) < ∞ and E(|ξη|) ≤ [E(|ξ|r )]1/r [E(|η|s )]1/s . Proof : We first note that for all a > 0, b > 0, by concavity of log t, (1/r) log a + (1/s) log b ≤ log(a/r + b/s), what is equivalent to a1/r b1/s ≤ a/r + b/s. Let us set a = |ξ|r /E(|ξ|r ) and b = |η|s /E(|η|s ) (we suppose for a moment that E(|ξ|r ) 6= 0, E(|η|s ) 6= 0), what results in |ξη| ≤ [E(|ξ|r )]1/r [E(|η|s )]1/s (|ξ|r /rE(|ξ|r ) + |η|s /sE(|η|s )) , and we conclude when taking the expectation. If E(|ξ|r ) = 0 or E(|η|s ) = 0, then ξ = 0 (p.s) or η = 0 (p.s.), and the inequality is trivial. 1.2. SOME USEFUL INEQUALITIES 9 Corollary 1.2 (Lyapunov inequality) Let 0 < v < t and let X be a random variable such that E(|X|t ) < ∞. Then E(|X|v ) < ∞ and [E(|X|v )]1/v ≤ [E(|X|t )]1/t . (1.2) To show the corollary it suffices to apply the Hölder inequality with ξ = X v , η = 1, r = t/v. Using the inequality (1.2) with v = 2, t = 4 and |X − E(X)| instead of |X| we get Thus the coefficient kurtosis β verifies the inequality β ≥ −2. µ04 σ4 ≥ 1. The Lyapunov inequality implies the chain of inequalities E(|X|) ≤ [E(|X|2 )]1/2 ≤ . . . ≤ [E(|X|k )]1/k . Corollary 1.3 (Cauchy-Schwarz inequality) Let ξ and η be two random variables such that E(ξ 2 ) < ∞ and E(η 2 ) < ∞. Then E(|ξη|) < ∞ et E(|ξη|)2 ≤ E(ξ 2 )E(η 2 ). (A particular case of the Hölder inequality with r = s = 2.) Proposition 1.3 (Jensen inequality) Let g(·) be a convex function, X be a random variable such that E(|X|) < ∞. Then g(E(X)) ≤ E(g(X)). Proof : By convexity of g, there exists a function g 1 (·) such that for all x, x0 ∈ R g(x) ≥ g(x0 ) + (x − x0 )g 1 (x0 ). We put x0 = E(X). Then g(X) ≥ g(E(X)) + (X − E(X))g 1 (E(X)). When taking the expectation we obtain E(g(X)) ≥ g(E(X)). We have the following simple example of application oh the Jensen inequality: |E(X)| ≤ E(|X|). (1.3) Proposition 1.4 (Cauchy-Schwarz inequality, a modification) Let ξ and η be two random variables such that E(ξ 2 ) < ∞ and E(η 2 ) < ∞. Then (E(ξη))2 ≤ E(ξ 2 )E(η 2 ). (1.4) Moreover, the equality is attained if and only if (iff) there are a1 , a2 ∈ R such that a1 6= 0 or a2 6= 0, and a1 ξ + a2 η = 0 (a.s.) (1.5) 10 LECTURE 1. BASIC PROBABILITY REFRESHER Proof : The inequality (1.4) is a consequence of Corollary 1.3 and of (1.3). If (1.5) is true, the equality (E(ξη))2 − E(ξ 2 )E(η 2 ) = 0 (1.6) is obvious. On the other hand, if we have (1.6) and E(η 2 ) 6= 0, then E((ξ − aη)2 ) = 0 with a = E(ξη)/E(η 2 ), what implies that ξ = aη a.s.. The case E(η 2 ) = 0 is trivial. 1.3 Sequences of random variables Let ξ1 , ξ2 ..., and ξ be random variables (r.v.) on (Ω, F, P ). Definition 1.4 The sequence (ξn ) converges to a random variable ξ in probability (denoted P ξn → ξ) when n → ∞ if lim P {|ξn − ξ| ≥ } = 0 n→∞ for any > 0. Definition 1.5 The sequence (ξn ) converges to ξ in quadratic mean (or “in L2 ”) if E(ξ 2 ) < ∞, and lim E(|ξn − ξ|2 ) = 0. n→∞ Definition 1.6 The sequence (ξn ) converges to ξ almost surely (denoted ξn → ξ (a.s.), n → ∞) if P {ω : ξn→ / ξ} = 0 Remark. It can be shown that this definition is equivalent to the following one: for all > 0 lim P {sup |ξk − ξ| ≥ } = 0. n→∞ k≥n Definition 1.7 The sequence (ξn ) converges to a random variable ξ in distribution (we denote D ξn → ξ, n → ∞) if P {ξn ≤ t} → P {ξ ≤ t} lorsque n → ∞ in all points of continuity of the c.d.f. F (t) = P {ξ ≤ t}. Remark. The latter definition is equivalent to the convergence E(f (ξn )) → E(f (ξ)) quand n → ∞ for all continuous and bounded f (weak convergence). 1.4. INDEPENDENCE AND LIMIT THEOREMS 11 Relationships between different types of convergence: L2 -convergence =⇒ convergence in probability =⇒ | convergence in distribution a.s. convergence =⇒ Exercise 1.4 Let (ξn ) and (ηn ) be two sequences of r.v.. Prove the following statements: 1o . If a ∈ R is a constant then D ξn → a ⇔ P ξn → a, when n → ∞. D D 2o . (Slutsky’s theorem) If ξn → a and ηn → η when n → ∞ and a ∈ R is a constant then D ξn + ηn → a + η, as n → ∞. Show that if a is a general r.v., these two relations do not hold (construct a counterexample). P D 3o . Let ξn → a, and let ηn → η when n → ∞, where a ∈ R is a constant and η is a random variable, Then D ξn ηn → aη, as n → ∞. Would this result continue to hold if we suppose that a is a general random variable? 1.4 Independence and limit theorems Definition 1.8 Let X and Y be two random variables. The variable X is said independent of Y if P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B) for all A ∈ B and B ∈ B (Borel A and B), denoted X⊥⊥Y . If E(|X|) < ∞, E(|Y |) < ∞ then the independence implies E(XY ) = E(X)E(Y ) (the inverse does not hold!). Definition 1.9 Let X1 , ..., Xn be random variables, we say that X1 , ..., Xn are (mutually) independent if for all A1 , ..., An ∈ B P (X1 ∈ A1 , ..., Xn ∈ An ) = P (X1 ∈ A1 ) · · · P (Xn ∈ An ). Remark. The fact that Xi , i = 1, ..., n are pairwise independent, i.e. Xi ⊥⊥Xj , does not imply that X1 , .., Xn are mutually independent. On the other hand, mutual independence implies pairwise independence. In particular, if X1 , ..., Xn are independent and E(|Xi |) < ∞, i = 1, ..., n, E(Xi Xj ) = E(Xi )E(Xj ), i 6= j. 12 1.4.1 LECTURE 1. BASIC PROBABILITY REFRESHER Sums of independent random variables Let us consider the sum ni=1 Xi , where X1 , ..., Xn are independent. If E(Xi2 ) < ∞, i = 1, ..., n (by the Lyapunov inequality this implies E(|Xi |) < ∞) then P E n X i=1 ! Xi = n X E(Xi ) (true without the independence hypothesis) i=1 and, moreover, Var n X ! Xi i=1 = n X Var(Xi ). i=1 Definition 1.10 We say that the variables X1 , ..., Xn are i.i.d. (independent and identically distributed) if they are mutually independent and Xi obeys the same distribution as Xj for all 1 ≤ i, j ≤ n. Proposition 1.5 Let X1 , ..., Xn be i.i.d. r.v. such that E(X1 ) = µ and Var(X1 ) = σ 2 < ∞. Then the arithmetic mean n 1X Xi X̄ = n i=1 satisfies E(X̄) = µ and Var(X̄) = 1 σ2 Var(X1 ) = . n n Proposition 1.6 (Kolmogorov’s strong law of large numbers) Let X1 , ..., Xn be i.i.d. r.v. such that E(|X1 |) < ∞, and µ = E(X1 ). We have X̄ → µ (a.s.) when n → ∞. Counterexample. Let Xi be i.i.d. r.v. with Cauchy distribution with density f (x) = 1 , x ∈ R. π(1 + x2 ) Then E(|X1 |) = ∞, E(X1 ) is not defined and the mean X̄ does not converge (we observe that Cauchy distribution has “heavy tails”). Proposition 1.7 (Central Limit Theorem (CLT)) Let X1 , ..., Xn be i.i.d. r.v. such that E(X12 ) < ∞ and σ 2 = Var(X1 ) > 0. Then √ X̄ − µ n σ where µ = E(X1 ), and η ∼ N (0, 1). ! D → η, lorsque n → ∞, 1.5. CONTINUITY THEOREMS 1.4.2 13 Asymptotic approximations of probability distributions The CLT (Proposition 1.7) can be rewritten in the equivalent form: P √ X̄ − µ n σ ! ! ≤t → P (η ≤ t), as n → ∞, for all t ∈ R, where η ∼ N (0, 1). Let us denote Φ(t) = P (η ≤ t) the standard normal c.d.f.. Then √ P (X̄ ≤ x) = P X̄ − µ n σ ! ≤ √ x−µ n σ ! √ ≈Φ n x−µ σ when n → ∞. In other words, for sufficiently large n, the c.d.f. P (X̄ ≤ x) of X̄ can be approximated by the normal c.d.f.: √ x−µ P (X̄ ≤ x) ≈ Φ n . σ 1.5 Continuity theorems Proposition 1.8 (The first continuity theorem) Let g(·) be a continuous function, and let ξ1 , ξ2 , ... and ξ be random variables on (Ω, F, P ). Then (i) ξn → ξ (a.s.) ⇒ g(ξn ) → g(ξ) (a.s.) P ⇒ g(ξn ) → g(ξ) D ⇒ g(ξn ) → g(ξ) (ii) ξn → ξ (iii) ξn → ξ P D Proof : (i) is evident. We prove (ii) in a particular case where ξ = a (a is fixed nonrandom), the only case to be of interest in the sequel. The continuity of g implies that for any > 0 there exists δ > 0 such that |ξn − a| ≤ δ ⇒ |g(ξn ) − g(a)| < . P Since ξn → a as n → ∞, we have lim P (|ξn − a| < δ) = 1 for all δ > 0. n→∞ Thus lim P (|g(ξn ) − g(a)| < ) = 1 for any > 0. n→∞ (iii) It suffices to prove (see the comment after Definition 1.7) that for any continuous and bounded function h(x) E(h(g(ξn ))) → E(h(g(ξ))), n → ∞. Since g is continuous, f = h ◦ g is also continuous and bounded, and we arrive at (iii) because D ξn → ξ implies that E(f (ξn )) → E(f (ξ)), n → ∞, for any continuous and bounded function f . 14 LECTURE 1. BASIC PROBABILITY REFRESHER Proposition 1.9 (Second continuity theorem) Suppose that g(·) is continuous and continuously differentiable, and let X1 , ..., Xn be i.i.d. random variables such that E(X12 ) < ∞ and σ 2 = Var(X1 ) > 0. Then √ where X̄ = 1 n Pn i=1 Xi , g(X̄) − g(µ) n σ ! D → ηg 0 (µ), n → ∞, µ = E(X1 ), and η ∼ N (0, 1). Proof : In the premise of the proposition the function ( h(x) = g(x)−g(µ) , x−µ 0 g (µ), si x 6= µ si x = µ P is continuous. Because X̄ → µ (due to Proposition 1.6), and h is continuous, we conclude, due to the first continuity theorem, that P h(X̄) → h(µ) = g 0 (µ), However, √ n → ∞. (1.7) √ √ g(X̄) − g(µ) n = h(X̄)(X̄ − µ) = h(X̄)ηn , n σ σ D where ηn = σn (X̄ − µ). Now Proposition 1.7 implies that ηn → η ∼ N (0, 1) when n → ∞. Using this fact along with (1.7) and the result 3o of the Exercise 1.4 we obtain the desired statement. 1.6 Simulation of random variables Dans les applications we a souvent besoin de générer (construire) de façwe artificielthe (à l’aide d’un ordinateur, par exemple) a sequence X1 , ..., Xn de nombres aléatoires i.i.d. suivant the loi F (we l’appelthe a échantillon). Les méthodes de simulatiwe permettent d’obtenir seulement a valeur pseudo-aléatoire, au lieu d’a valeur aléatoire. Cethe signifie that les nombres X1 , ..., Xn simulés sont déterministes – ils sont obtenus par a algorithme déterministe – mais les propriétés de the sequence X1 , ..., Xn sont “proches” de celles d’a sequence aléatoire i.i.d. de même loi. Par exemple, for les nombres pseudo-aléatoires we a In applications we often need to “generate” (build) a computer simulated sequence X1 , ..., Xn of i.i.d. random values following a given distribution F (we the call it a sample). Of course, computer simulation only allows to build pseudo-random variables (not the “true” random ones). That means that the simulated values X1 , ..., Xn are deterministic – they are obtained by a deterministic algorithm – but the properties the sequence X1 , ..., Xn are “analogous” to those of a random i.i.d. sequence . For example, for the pseudo-random variables one has sup |Fn (x) − F (x)| → 0, n→∞ x for any x ∈ R, or Fbn (x) = n1 µn , where µn is the number of ξ1 , ..., ξn which satisfy ξk < x. We call Fn (x) empirical distribution function computed using the sequence X1 , ..., Xn (here we consider deterministic convergence, cf. Exercise 1.1.14). The strong law of large numbers and the central limit theorem also hold for pseudo-random variables, etc. 1.6. SIMULATION OF RANDOM VARIABLES 1.6.1 15 Simulation of uniformly distributed random variables The generation program is available in (essentially) all programming languages. How do the work? The c.d.f. F (x) of the distribution U [0, 1] satisfies F (x) =    0, x,   1, x<0 x ∈ [0, 1] x > 1. Congruential algorithm. We fix a real number a > 1 and an integer m (usually a and m are “very large” numbers). We start with a fixed value z0 . For 1 ≤ i ≤ n we define zi = the rest of division of azi−1 by m azi−1 = azi−1 − m, m where [·] is the integer part. We always have 0 ≤ zi < m. Thus, if we set azi−1 azi−1 zi = − , m m m Ui = then 0 ≤ Ui < 1. The sequence U1 , ..., Un is considered a sample from the uniform distribution U [0, 1]. Even if this is not a random sequence, the empirical c.d.f. FnU (x) = n 1X I{Ui ≤ x} n i=1 satisfies sup0≤x≤1 |Fn − x| ≤ (m), n → ∞, with (m) converging rapidly to 0 when m → ∞. A well developed mathematical theory allows to justify “good” choices of z0 , a and m. For instance, the following values are often used: a = 16807(75 ), m = 2147483647(231 − 1). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 the step-function empirical c.d.f./the theoretical c.d.f. Recently, new pseudo-random generators with improved properties became available, and congruential generators are rarely used. 16 1.6.2 LECTURE 1. BASIC PROBABILITY REFRESHER Simulation of general pseudo-random variable Given an i.i.d. sample U1 , ..., Un from the uniform distribution, we can obtain a sample of general distribution F (·) using the inversion algorithm. It may be used when an explicit expression for F (·) is available. This technique is based on the following statement: Proposition 1.10 Let F be a continuous and strictly monotone c.d.f., and let U be a random variable uniformly distributed on [0, 1]. Then the c.d.f. of the r.v. X = F −1 (U ) is exactly F (·). Proof : We observe that F (x) = P (U ≤ F (x)) = P (F −1 (U ) ≤ x) = P (X ≤ x). Consider the following algorithm of simulating a sample X1 , ..., Xn from distribution F : if F (x) is continuous and strictly increasing, we take Xi = F −1 (Ui ), where Ui are pseudo-random variables uniformly distributed on [0, 1], i = 1, ..., n. This way we get a . If F is not continuous or strictly monotone, we need to modify the definition of the “inverse” F −1 . We set ∆ F −1 (y) = sup{t : F (t) < y}. Then, P (Xi ≤ x) = P (sup{t : F (t) < Ui } ≤ x) = P (Ui ≤ F (x)) = F (x). Example 1.4 Exponential distribution: f (x) = e−x I{x > 0}, F (x) = (1 − e−x )I{x > 0}. We compute F −1 (y) = − ln(1 − y) for y ∈ (0, 1). Xi = − ln(1 − Ui ), i = 1, ..., n where Ui ∼ U [0, 1]. Example 1.5 Bernoulli distribution: P (X = 0) = 1 − p, 0 < p < 1. P (X = 1) = p, We use the modified algorithm: ( F −1 (y) = sup{t : F (t) < y} = 0, 1, y ∈ [0, 1 − p], y ∈ (1 − p, 1]. If Ui is a uniform r.v. then Xi = F −1 (Ui ) is a Bernoulli r.v., we have ( Xi = 0, 1, Ui ∈ [0, 1 − p], Ui ∈ (1 − p, 1]. 1.6. SIMULATION OF RANDOM VARIABLES 17 Exercise 1.5 A r.v. Y takes values 1, 3 and 4 with the probabilities P (Y = 1) = 3/5, P (Y = 3) = 1/5 et P (Y = 4) = 1/5. How would you generate Y given a r.v. U ∼ U (0, 1). Exercise 1.6 Let U ∼ U (0, 1). 1. Explain how to simulate a dice with 6 faces given U . 2. Let Y = [6U + 1], where [a] is the integer part of a. What are possible values of Y and the corresponding probabilities? Simulating transformed variables How to simulate a sample Y1 , ..., Yn from the distribution F ((x − µ)/σ), given a sample X1 , ..., Xn from F (·)? We suppose that σ > 0 and µ ∈ R). We should take Yi = σXi + µ, i = 1, ..., n. 1.6.3 Simulating normal N (0, 1) r.v. Note that while the normal c.d.f. F is continuous and strictly increasing, the explicit expression for F is not available. Thus, one can hardly apply the inversion algorithm. Nevertheless, there are other techniques of simulating normal r.v. which are very efficient from the numerical point of view. Using the CLT. If U ∼ U [0, 1] then E(U ) = 1/2 and Var(U ) = 1/12. This implies by the Central Limit Theorem that U1 + ... + UN − N/2 D p → N (0, 1), N → ∞, N/12 for an i.i.d. sample U1 , ..., UN with uniform distribution on [0, 1] (N = 12 is usually sufficient to obtain a “good” approximation!). Thus, one can consider the following simulation algorithm: let U1 , U2 , ..., UnN be a pseudo-random sequence from uniform distribution U [0, 1], we take U(i−1)N +1 + ... + UiN − N/2 Xi = p N/12 , i = 1, ..., n. Box–Müller algorithm. The algorithm is based on the following result: Proposition 1.11 Let ξ and η be independent U [0, 1] random variables. Then the r.v. X= p −2 ln ξ cos(2πη) and Y = p −2 ln ξ sin(2πη) are standard normal and independent. We prove this statement in Lecture 3. This relation provides us with an efficient simulation technique:: let U1 , ..., U2n be random variables i.i.d. r.v. U1 ∼ U [0, 1]. We set for i = 1, ...n. X2i = p X2i−1 = p −2 ln U2i cos(2πU2i−1 ), −2 ln U2i sin(2πU2i−1 ), 18 LECTURE 1. BASIC PROBABILITY REFRESHER 1.7 Exercises Exercise 1.7 Suppose 2 balanced dices are drawn. Find joint probability distribution of X and Y if: 1. X is the maximum of the obtained values and Y is a sum; 2. X is the value of the first dice and Y is the maximum of the two; 3. X and Y are, respectively, the smallest and the largest value. Exercise 1.8 Suppose that X and Y are 2 independent Bernoulli B( 12 ) random variables. Let U = X + Y and V = |X − Y |. 1. What is the joint probability distribution and marginal probability distributions of U and V , conditional distribution of U given V = 0 et V = 1. 2. are r.v. U and V independent? Exercise 1.9 Let ξ1 , ..., ξn be independent r.v., and let ξmin = min(ξ1 , ..., ξn ), ξmax = max(ξ1 , ..., ξn ). 1) Show that P (ξmin ≥ x) = n Y P (ξi ≥ x), P (ξmax < x) = i=1 n Y P (ξi < x). i=1 2) Suppose, furthermore, that ξ1 , ..., ξn are identically distributed with uniform distribution U [0, a]. Compute E(ξmin ), E(ξmax ), Var(ξmin ) et Var(ξmax ) Exercise 1.10 Let ξ1 , ..., ξn be i.i.d. Bernoulli r.v. with P (ξ1 = 0) = 1 − λi ∆, P (ξ1 = 1) = λi ∆ where λi > 0 and ∆ > 0 is small. Show that P n X i=1 Exercise 1.11 ! ξi = 1 = n X i=1 ! 2 λi ∆ + O(∆ ), P n X i=1 ! ξi > 1 = O(∆2 ). 1.7. EXERCISES 19 1) Prove that inf −∞<a<∞ E((ξ − a)2 ) is attained for a = E(ξ) and so inf −∞<a<∞ E((ξ − a)2 ) = Var(ξ). 2) Let ξ be a nonnegative r.v. with c.d.f. F and finite expectation. Prove that Z ∞ E(ξ) = (1 − F (x))dx. 0 3) Show, using the result of 2), that if M is the median of the c.d.f. F of ξ, inf −∞<a<∞ E(|ξ − a|) = E(|ξ − M |). Exercise 1.12 Let X1 and X2 be two independent r.v. with the exponential distribution E(λ). Show that min(X1 , X2 ) and |X1 − X2 | are r.v. with distributions, respectively, E(2λ) and E(λ). Exercise 1.13 Let X be the number of “6” in 12000 independent draws √ of a dice. Using the√Central Limit Theorem estimate the probability that 1800 < X ≤ 2100 (Φ( 6) ≈ 0.9928, Φ(2 6) ≈ 0.999999518). Compare this approximation to that obtained using the Chebyshev inequality. Exercise 1.14 Suppose that r.v. ξ1 , ..., ξn are mutually independent and identically distributed with the c.d.f. F . For x ∈ R, let us define the random variable Fbn (x) = n1 µn , where µn is the number of ξ1 , ..., ξn which satisfy ξk < x. Show that for any x P Fbn (x) → F (x) (the function Fbn (x) is called the empirical distribution function). Exercise 1.15 [Monté-Carlo method] We want to compute the integral I = random variable, then R1 0 f (x)dx. Let X be a U [0, 1] Z 1 E(f (X)) = f (x)dx = I. 0 Let X1 , ..., Xn be uniformly distributed on [0, 1] i.i.d. r.v.. Let us consider the quantity n 1X f (Xi ) f¯n = n i=1 P and let us suppose that σ 2 = Var(f (X)) < ∞. Prove that E(f¯n ) → I et f¯n → I as n → ∞. Estimate P (|f¯n − I| < ) using the CLT. Exercise 1.16 20 LECTURE 1. BASIC PROBABILITY REFRESHER Weibull distributions is often used in the survival and reliability analysis. An example of a distribution from this family is given by c.d.f. ( F (x) = 0, x < 0 2 1 − e−5x , x ≥ 0. Explain how to generate a r.v. Z ∼ F given a uniform r.v. U . Exercise 1.17 Write down the algorithm of simulating a Poisson r.v. by inversion. Hint: there is no simple expression for the Poisson c.d.f. and the set of values is infinite. However, the Poisson c.d.f. can be easily computed recursively. Observe that if X is the poisson r.v., λk λ P (X = k) = e−λ = P (X = k − 1). k! k Lecture 2 Regression and correlation 2.1 Couples of random variables. Joint and marginal distributions. Let (X, Y ) be a couple of r.v.. The joint c.d.f of (X, Y ) is given by FX,Y (x, y) = P (X ≤ x, Y ≤ y), x, y ∈ R. The marginal c.d.f. are given by FX (x) = FX,Y (x, ∞) = P (X ≤ x); FY (y) = FX,Y (∞, y) = P (Y ≤ y). In the continuous case we suppose that FX,Y the derivative ∂ 2 FX,Y (x, y) = fX,Y (x, y) ∂x∂y (2.1) exists a.e.. The function fX,Y (x, y) is called the density of FX,Y (x, y). Marginal densities fX and fY are defined according to Z ∞ fX (x) = −∞ Z ∞ fX,Y (x, y)dy, fY (y) = −∞ fX,Y (x, y)dx. In the discrete case X and Y takes values in a finite or countable set. A joint distribution of a couple X, Y is defined by the probabilities {P (X = k, Y = m)}k,m . The marginal laws are defined by the probabilities P (X = k) = X P (X = k, Y = m), m P (Y = m) = X P (X = k, Y = m). k If X and Y are independent then FX,Y = FX (x)FY (y) for all (x, y) ∈ R2 . 21 22 LECTURE 2. REGRESSION AND CORRELATION The inverse is also true. In the continuous case the independence is equivalent to the decomposition fX,Y (x, y) = fX (x)fY (y), for all (x, y) ∈ R2 , and in the discrete case, P (X = k, Y = m) = P (X = k)P (Y = m). 2.2 Conditioning (discrete case) Let A and B be two random events (A, B ∈ F) such that P (B) 6= 0. The conditional probability P (A|B) of A given B is defined as P (A|B) = P (AB) . P (B) Let X and Y be two discrete r.v.. According to this definition P (Y = k|X = m) = P (Y = k, X = m) , P (X = m) for all k, m such that P (X = m) 6= 0. We suppose that P (X = m) 6= 0 for all admissible m. Then P X P (Y = k, X = m) P (Y = k|X = m) = k = 1. P (X = m) k As a result, the probabilities {P (Y = k|X = m)}k define a discrete probability distribution. If X and Y are independent P (Y = k|X = m) = P (Y = k)P (X = m) = P (Y = k). P (X = m) (2.2) The conditional expectation of Y given that X = m is the numerical function of m given by X E(Y |X = m) = kP (Y = k|X = m). k The conditional variance is defined by Var(Y |X = m) = E(Y 2 |X = m) − [E(Y |X = m)]2 . In a similar way we define the conditional moments, the conditional quantiles and other characteristics of conditional distribution. Definition 2.1 Conditional expectation E(Y |X) of Y given X where X and Y are discrete r.v. such that E(|Y |) < ∞, is a discrete random variable which only depend on X and takes values {E(Y |X = m)}m with probabilities P (X = m). 2.2. CONDITIONING (DISCRETE CASE) 23 It is important not to confuse the random variable E(Y |X) with the (deterministic) numeric function E(Y |X = m) (function of m). Note that the condition E(|Y |) < ∞ guaranties the existence of conditional expectation E(Y |X). 2.2.1 1o . Properties of the conditional expectation (discrete case) (Linearity) Let E(|Y1 |) < ∞, E(|Y2 |) < ∞, then for all a ∈ R, b ∈ R, E(aY1 + bY2 |X) = aE(Y1 |X) + bE(Y2 |X) (a.s.) 2o . If X and Y are independent and E(|Y |) < ∞, then E(Y |X) = E(Y ) (a.s.) (cf. (2.2)). 3o . E(h(X)|X) = h(X) (a.s.) for all Borel h. 4o . (Substitution theorem.) If E(|h(Y, X)|) < ∞ then E(h(Y, X)|X = m) = E(h(Y, m)|X = m). Proof : Let Y 0 = h(Y, X), this is a discrete r.v. taking values h(k, m). Thus, the conditional distribution of Y 0 given X is given by the probabilities P (Y 0 = a|X = m) = P (h(Y, X) = a|X = m) = = P (h(Y, X) = a, X = m) P (X = m) P (h(Y, m) = a, X = m) = P (h(Y, m) = a|X = m). P (X = m) Therefore, for all m E(Y 0 |X = m) = X aP (Y 0 = a|X = m) = a X aP (h(Y, m) = a|X = m) = E(h(Y, m)|X = m). a As a result, if h(x, y) = h1 (y)h2 (x), we have E(h1 (Y )h2 (X)|X = m) = h2 (m)E(h1 (Y )|X = m), and E(h1 (Y )h2 (X)|X) = h2 (X)E(h1 (Y )|X) (a.s.) . 5o . (Double expectation theorem) Let E(|Y |) < ∞, then E(E(Y |X)) = E(Y ). Proof : We write E(E(Y |X)) = X E(Y |X = m)P (X = m) = m = X m,k XX m kP (Y = k, X = m) = X X k k m kP (Y = k|X = m)P (X = m) k P (Y = k, X = m) = X k kP (Y = k) = E(Y ). 24 LECTURE 2. REGRESSION AND CORRELATION Example 2.1 Let ξ and η be two independent Bernoulli r.v., taking values 1 and 0 with probabilities, respectively, p and 1 − p. What is the conditional expectation E(ξ + η|η)? E(η|ξ + η)? Using the properties 2o and 3o we obtain E(ξ + η|η) = Eξ + η = p + η. Furthermore, by the definition, for k = 0, 1, 2, E(η|ξ + η = k) = 1 · P (η = 1|ξ + η = k) = Thus, E(η|ξ + η) = 2.3 ξ+η 2    0, 1/2,   1, k = 0, k = 1, k = 2. (a.s.). Conditioning as a projection Let us consider the set of random variables ξ on (Ω, F, P ) such that E(ξ 2 ) < ∞. We will say that ξ ∼ ξ 0 (is equivalent) if ξ = ξ 0 (a.s.) with respect to the measure P . This relation defines a family of equivalence classes over random variables ξ such that E(ξ 2 ) < ∞. Definition 2.2 We denote L2 (P ) the space of (equivalence classes of ) square-integrable r.v. ξ (E(ξ 2 ) < ∞). The space L2 (P ) we have just defined is a Hilbert space equipped with the scalar product hX, Y i = E(XY ), X, Y ∈ L2 (P ), and the corresponding norm kXk = [E(X 2 )]1/2 , X ∈ L2 (P ). Indeed, h·, ·i verifies all the axioms of the scalar product: for all X, ξ, η ∈ L2 (P ) and a, b ∈ R haξ + bη, Xi = E([aξ + bη]X) = aE(ξX) + bE(ηX) = ahξ, Xi + bhη, Xi, and hX, Xi ≥ 0; hX, Xi = 0 implies X = 0 (a.s.). 2.3.1 The best prediction If r.v. X and Y are independent, the knowledge of X does not supply any information about Y . However, when X and Y are dependent and we know the realization of X, it does provide some information about Y . We define the problem of the best prediction of Y given X as follows: Let Y ∈ L2 (P ) and X be r.v. on (Ω, F, P ). Find a Borel measurable g(·) such that kY − g(X)k = min kY − h(X)k, h(·) (2.3) where the minimum is taken over all Borel measurable functions h(·) and k · k is the norm of L2 (P ). The random variable Yb = g(X) is referred to as the best prediction of Y given X. We use the following (statistical) vocabulary: X is explanatory variable or predictor, Y is explained variable. 2.3. CONDITIONING AS A PROJECTION 25 We can write (2.3) in the equivalent form: E((Y − g(X))2 ) = min E((Y − h(X))2 ) = h(·) min h(X)∈LX 2 (P ) E((Y − h(X))2 ). It suffices to consider the case h(X) ∈ L2 (P ), because the solution g(·) to (2.3) is “automatically” in L2 (P ). We can consider (2.3) as the problem of orthogonal projection of Y on a linear subspace LX (P ) of L2 (P ) defined as 2 2 LX 2 (P ) = {ξ = h(X) : E(h (X)) < ∞}. By the properties of the orthogonal projection, g(X) ∈ LX 2 (P ) is the solution to (2.3) if and only if hY − g(X), h(X)i = 0 for all h(X) ∈ LX 2 (P ), (2.4) Y g(X) LX(P) 2 and the orthogonal projection g(X) is unique (a.s.). When using notation with expectations instead, (2.4) can be equivalently rewritten as E((Y − g(X))h(X)) = 0 for all h(X) ∈ LX 2 (P ), or E(Y h(X)) = E(g(X)h(X)) for all h(X) ∈ LX 2 (P ). (2.5) E(Y I{X ∈ A}) = E(g(X)I{X ∈ A}) for all A ∈ B (all Borel measurable sets). (2.6) In particular, Remark. Indeed, (2.6) implies (2.5), thus (2.5) and (2.6) are equivalent – recall that all P functions in L2 can be approximated by sums of the step functions i ci I{x ∈ Ai } (piecewiseconstant functions). Let us show that in the discrete case the only r.v. g(X) which verifies (2.5) (and (2.6)), and thus solves the problem of the best prediction (2.3), is the conditional expectation of Y given X. 26 LECTURE 2. REGRESSION AND CORRELATION Proposition 2.1 Let X and Y be discrete r.v. with Y ∈ L2 (P ). Then the best prediction Yb of Y given X is unique (a.s.) and given by Yb = g(X) = E(Y |X). Proof : E (E(Y |X)h(X)) = X E(Y |X = k)h(k)P (X = k) k = " X X k = X # mP (Y = m|X = k) h(k)P (X = k) m m h(k)P (Y = m, X = k) = E(Y h(X)). k,m Thus (2.5) is verified with g(X) = E(Y |X). Due to the (a.s.) uniqueness of the orthogonal projection, the best prediction is also unique (a.s.). 2.4 Probability and conditional expectation (the general case) We extend the definition of the conditional expectation E(Y |X) to the general case of 2 r.v. X and Y . We use the following definition: Definition 2.3 Let Y and X be r.v. such that E(|Y |) < ∞. The conditional expectation g(X) = E(Y |X) is a measurable with respect to X r.v. which verifies E(Y I{X ∈ A}) = E(g(X)I{X ∈ A}) (2.7) for all Borel sets A. Remark. Here we replace the condition Y ∈ L2 (P ) (≡ E(Y 2 ) < ∞) with a weaker condition E(|Y |) < ∞. One can show (see the course of Probability Theory) that the function g(X) which verifies (2.7) exists and is unique (a.s.) (a consequence of the Radon-Nikodym theorem). If Y ∈ L2 (P ), the existence and the a.s. uniqueness of the function g(X) verifying (2.7), as we have already seen, is a consequence of the properties of the orthogonal projection in L2 . Theorem 2.1 (Best prediction) Let X and Y be 2 r.v., Y ∈ L2 (P ). Then the best prediction of Y given X is unique (a.s.) and coincides with Yb = g(X) = E(Y |X). 2.4. PROBABILITY AND CONDITIONAL EXPECTATION (THE GENERAL CASE) 2.4.1 27 Conditional probability Let us consider a special case as follows: we replace Y with Y 0 = I{Y ∈ B}. Note that the r.v. Y 0 is bounded (|Y 0 | ≤ 1), and thus E(|Y 0 |2 ) < ∞. We can define the conditional expectation g(X) = E(Y 0 |X) by the relationship (cf. (2.7)) E (I{Y ∈ B}I{X ∈ A}) = E(g(X)I{X ∈ A}) for all A, B ∈ B. Definition 2.4 The conditional probability P (Y ∈ B|X) is a random variable which verifies P (Y ∈ B, X ∈ A) = E [P (Y ∈ B|X)I{X ∈ A}] for all A ∈ B Same as in the discrete case, we also define a numeric function. Definition 2.5 The function of 2 variables P (Y ∈ B|X = x), B ∈ B (a Borel set) and x ∈ R is referred to as conditional probability of Y given X = x if (i) for all fixed B P (Y ∈ B|X = x) verifies P (Y ∈ B, X ∈ A) = Z P (Y ∈ B|X = x)dFX (x); (2.8) A (ii) for all fixed x, P (Y ∈ B|X = x) defines a probability distribution as a function of B. Remark. We know already that for all B ∈ B there is a function gB (x) = P (Y ∈ B|X = x) such that (i) is valid. However, this function is defined up to its values on the set NB of zero measure. It is important that, in general, this set depends on B. Therefore, it may happen S that N = B∈B NB is of positive measure. This could do a serious damage – for example, the additivity of the probability measure could be destroyed, etc. Fortunately, in our case (real r.v. and Borel σ-algebra) there is a result due to Kolmogorov which says that one can always choose a version of the function gB (·) such that P (Y ∈ B|X = x) is a probability measure for all fixed x ∈ R. We will suppose in the sequel that this version is chosen in every particular case. We can also define a real-valued function of x: E(Y |X = x) = such that E(Y I{X ∈ A}) = Z Z yP (dy|X = x). E(Y |X = x)dFX (x), for all A ∈ B. A 2.4.2 1o . Properties of conditional expectation, general case (Linearity.) Let E(|Y1 |) < ∞, E(|Y2 |) < ∞, then E(aY1 + bY2 |X) = aE(Y1 |X) + bE(Y2 |X) (a.s.) 28 LECTURE 2. REGRESSION AND CORRELATION 2o . If X and Y are independent and E(|Y |) < ∞, then E(Y |X) = E(Y ) (a.s.)In view of the definition (2.7) it suffices to prove that E(Y I{X ∈ A}) = E (E(Y )I{X ∈ A}) , for all A ∈ B. (2.9) But E (E(Y )I{X ∈ A}) = E(Y )P (X ∈ A), and so (2.9) is a consequence of the independence of X and Y . 3o . E(h(X)|X) = h(X) (a.s.) for all Borel functions h. 4o . (Substitution theorem.) If E(|h(Y, X)|) < ∞, then E(h(Y, X)|X = x) = E(h(Y, x)|X = x). 5o . (Double expectation theorem) E(E(Y |X)) = E(Y ). Proof : Let us set A = R in the definition (2.7), then I(X ∈ A) = 1, and we obtain the desired result. 2.5 Conditioning: continuous case We suppose that there exists a joint density fX,Y (x, y) of the couple (X, Y ). Let us set ( f (x,y) X,Y fY |X (y|x) = fX (x) , si fX (x) > 0, si fX (x) = 0. 0, Proposition 2.2 If the joint density (X, Y ) exists then P (Y ∈ B|X = x) = Z B fY |X (y|x)dy for all B ∈ B. Proof : It suffices to show (cf. (2.8)) that for all A, B ∈ B P (Y ∈ B, X ∈ A) = Z Z A B fY |X (y|x)dy dFX (x). Since X has a density, dFX (x) = fX (x)dx. By the Fubini theorem Z Z A B Z Z fY |X (y|x)dyfX (x)dx = B A fY |X (y|x)fX (x) dxdy But fY |X (y|x)fX (x) = fX,Y (x, y) a.e. (if fX (x) = 0, then fX,Y (x, y) = 0 a fortiori). Therefore, the last integral is equal to Z Z B A fX,Y (x, y)dxdy = P (X ∈ A, Y ∈ B). 2.5. CONDITIONING: CONTINUOUS CASE 29 The result of Proposition 2.2 provides a direct way to compute conditional expectation: Corollary 2.1 2. R∞ 1. E(Y |X = x) = −∞ fY |X (y|x)dy R yfY |X (y|x)dy, = 1, 3. Y ⊥⊥X ⇒ fY |X (y|x) = fY (y). We can define, same as in the discrete case, conditional variance: Var(Y |X = x) = E(Y 2 |X = x) − (E(Y |X = x))2 Z ∞ = −∞ 2 Z ∞ 2 y fY |X (y|x)dy − −∞ yfY |X (y|x)dy . Example 2.2 Let X and Y be i.i.d. r.v. with exponential distribution. Let us compute the conditional density f (x|z) = fX|X+Y (x|z) and E(X|X + Y ). Let f (u) = λe−λu I{u > 0} be the density of X and Y . If z < x Z z P (X + Y < z, X < x) = P (X + Y < z, X < z) = Z z−u f (u) 0 f (v)dudv, 0 and if z ≥ x, Z z−u Z x f (v)dudv. f (u) P (X + Y < z, X < x) = 0 0 As a result, for z ≥ x the joint density of the couple (X + Y, X) is (cf. (2.1)) f (z, x) = ∂ 2 P (X + Y < z, X < x) = f (z − x)f (x) = λ2 e−λz . ∂x∂z Besides, the density of X + Y is the convolution of two exponential densities, i.e. fX+Y (z) = λ2 ze−λz . We obtain fX|X+Y (x|z) = f (z, x) 1 = . fX+Y (z) z for 0 ≤ x ≤ z, and fX|X+Y (x|z) = 0 for x > z. In other words, the conditional density is the density of the uniform distribution on [0, z]. Thus we conclude that E(X|Z) = (X + Y )/2 (a.s.). This example is related to the model of requests for service arriving to a service system. Let X be the instant when the 1st request arrives (the instant t = 0 is the instant when the request zero arrives), Y is the interval of time between the arrival of the 1st and of the 2nd requests. We are looking for the probability density of the instant of the 1st arrival, given that the second request arrived at time z. 30 2.6 LECTURE 2. REGRESSION AND CORRELATION Covariance and correlation Let X and Y be square-integrable r.v., i.e. E(X 2 ) < ∞ and E(Y 2 ) < ∞. We denote 2 σX = Var(X), σY2 = Var(Y ). Definition 2.6 The covariance of X and Y is the value Cov(X, Y ) = E ((X − E(X))(Y − E(Y ))) = E(XY ) − E(X)E(Y ). If Cov(X, Y ) = 0, we say that X and Y are orthogonal, and we denote X ⊥ Y . 2 > 0 and σ 2 > 0. The correlation between X and Y is the value Definition 2.7 Let σX Y Corr(X, Y ) = ρXY = 2.6.1 Cov(X, Y ) . σX σY Properties of covariance and correlation The below relationships are immediate consequences of Definition 2.6. 1. Cov(X, X) = Var(X). 2. Cov(aX, bY ) = abCov(X, Y ), a, b ∈ R. 3. Cov(X + a, Y ) = Cov(X, Y ), a ∈ R. 4. Cov(X, Y ) = Cov(Y, X). 5. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(Y, X). Indeed, Var(X + Y ) = E((X + Y )2 ) − (E(X) + E(Y ))2 = E(X 2 ) + E(Y 2 ) + 2E(XY ) − E 2 (X) − E 2 (Y ) − 2E(X)E(Y ). 6. If X and Y are independent, Cov(X, Y ) = 0. Important note: the inverse is not true, for instance, if X ∼ N (0, 1) and Y = X 2 , then Cov(X, Y ) = E(X 3 ) − E(X)E(X 2 ) = E(X 3 ) = 0 (recall that N (0, 1) is symmetric with respect to 0). Let us now consider the properties of correlation: 1. −1 ≤ ρXY ≤ 1 (the Cauchy-Schwarz inequality) |Cov(X, Y )| = E ((X − E(X))(Y − E(Y ))) ≤ q q E((X − E(X))2 ) E((Y − E(Y ))2 ) = σX σY 2. If X and Y are independent, ρXY = 0. 3. |ρXY | = 1, if and only if X and Y are linearly dependent: there exist a 6= 0, b ∈ R such that Y = aX + b. 2.7. REGRESSION 31 Proof : Note that |ρXY | = 1, iff the equality is attained in the Cauchy-Schwarz inequality. By Proposition 1.4, this is only possible if there are α, β ∈ R such that α(X − E(X)) + β(Y − E(Y )) = 0 (a.s.), and if α 6= 0 or β 6= 0. This is equivalent to the existence of α, β and γ ∈ R such that αX + βY + γ = 0 (a.s.), with α 6= 0 or β 6= 0. If α 6= 0 and β 6= 0 one has α γ Y =− X− , β β β γ X=− Y − , α α The case with α = 0 or β = 0 is impossible, because this would mean that one of the variables Y or X is constant (a.s.), but we have assumed that σX and σY are positive. Observe that if Y = aX + b, a, b ∈ R, a 6= 0, 2 σY2 = E((Y − E(Y ))2 ) = a2 E((X − E(X))2 ) = a2 σX ; and the covariance, 2 Cov(X, Y ) = E ((X − E(X))a(X − E(X))) = aσX , aσ 2 a X = |a| . We say that the correlation between X and Y is positive if so that ρXY = σX |a|σ X ρXY > 0 and negative if ρXY < 0. The correlation above is thus positive (= 1) if a > 0 and negative (= −1) if a < 0. Geometric interpretation of the correlation Let h·, ·i be the scalar product and k · k the norm of L2 (P ). Then Cov(X, Y ) = hX − E(X), Y − E(Y )i and ρXY = hX − E(X), Y − E(Y )i . kX − E(X)k kY − E(Y )k In other words, ρXY is the “cosine of the angle” between X − E(X) and Y − E(Y ). Thus, ρXY = ±1 means that X − E(X) and Y − E(Y ) are collinear: Y − E(Y ) = a(X − E(X)) for a 6= 0. 2.7 Regression Definition 2.8 Let X and Y be 2 r.v. such that E(|Y |) < ∞. The function g : R → R defined with g(x) = E(Y |X = x) is called regression of Y on X (of Y in X). We also refer to this regression as simple (the word means that X and Y are univariate). If X or Y are multi-dimensional, we refer to the regression as multiple. 32 LECTURE 2. REGRESSION AND CORRELATION Geometric interpretation. Let us recall the construction of paragraph 2.3. Suppose that Y is an element of the Hilbert space L2 (P ) (i.e. E(Y 2 ) < ∞) and let, as before, LX 2 (P ) be the linear subspace of L2 (P ) of all functions h(X) which are measurable with respect to X and such that E(h2 (X)) < ∞. Then g(X) is the orthogonal projection Y on LX 2 (P ). Y  E(Y|X) LX(P) 2 The r.v. ξ = Y − g(X) is referred to as stochastic error (or residual). We have Y = g(X) + ξ. (2.10) By definition of the conditional expectation, E(ξ|X) = 0 (a.s.), and so E(ξ) = 0. Example 2.3 Let the joint density of X and Y f (x, y) = (x + y)I{0 < x < 1, 0 < y < 1}. What is the regression function g(x) = E(Y |X = x)? We use Corollary 2.1: f (x, y) fY |X (y|x) = ; where fX (x) = fX (x) We conclude that fY |X (y|x) = Z 1 f (x, y)dy = (x + 1/2)I{0 < x < 1}. 0 x+y I{0 < x < 1, 0 < y < 1}, x + 1/2 and g(x) = E(Y |X = x) = Z 1 0 yfY |X (y|x)dy = Z 1 y(x + y) x+ 0 1 2 dy = 1 2x + 13 x + 21 for 0 < x < 1. Observe that g(x) is a nonlinear function of x. 2.7.1 Residual variance The quadratic (mean square) error of approximation of Y with g(X) is the value ∆ = E((Y − g(X))2 ) = E (Y − E(Y |X))2 = E(ξ 2 ) = Var(ξ). 2.7. REGRESSION 33 We call ∆ the residual variance. The residual variance is smaller than the variance of Y . Indeed, let h(X) = E(Y ) = const. By the best prediction theorem, ∆ = E (Y − g(X))2 ≤ E (Y − h(X))2 = E((Y − E(Y ))2 ) = Var(Y ). Because E(Y ) is an element of LX 2 (P ), this means, geometrically, that a leg is smaller than the hypotenuse: Y X L2 (P) E(Y|X) E(Y) L Observe that the space L of “constant r.v.” is also a linear subspace of L2 (P ). Moreover, this is exactly the intersection of all LX 2 (P ) for all X. But we already know that E(Y ) is the projection of Y on L: indeed, for any constant a E((Y − a)2 ) ≥ E((Y − E(Y ))2 ) (cf. Exercise 1.11). By the Pythagoras theorem, kY − E(Y )k2 = kE(Y |X) − E(Y )k2 + kY − E(Y |X)k2 , or Var(Y ) = E((Y − E(Y ))2 ) = E (E(Y |X) − E(Y ))2 + E (Y − E(Y |X))2 = Var (E(Y |X)) + E (Var(Y |X)) = “variance explained by X” + “residual variance” = Var(g(X)) + Var(ξ) = Var(g(X)) + ∆. Definition 2.9 Let Var(Y ) > 0. We call the correlation ratio of Y to X the nonnegative value η 2 = ηY2 |X given by ηY2 |X E E(Y ) − E(Y |X))2 Var(g(X)) = . Var(Y ) Var(Y ) = Note that, by the Pythagoras theorem, ηY2 |X ∆ E (Y − g(X))2 =1− =1− . Var(Y ) Var(Y ) 34 LECTURE 2. REGRESSION AND CORRELATION Geometric interpretation. The correlation ratio ηY2 |X is the squared cosine of the angle θ between Y − E(Y ) and E(Y |X) − E(Y ), thus 0 ≤ ηY2 |X ≤ 1. Remarks. 2 1. Generally, ηX|Y 6= ηY2 |X (absence of the symmetry). 2. The values η 2 = 0 and η 2 = 1 are special: η 2 = 1 implies that E((Y − E(Y |X))2 ) = 0, thus Y = g(X) (a.s.); in other words, Y is a function of X. Otherwise, η 2 = 0 means that E((E(Y ) − E(Y |X))2 ) = 0, and E(Y |X) = E(Y ) (a.s.), so the regression is constant. It is useful to note that g(X) = const implies the orthogonality of X and Y (i.e. Cov(X, Y ) = 0). 2 > 0, σ 2 > 0. Then, Proposition 2.3 Let E(X 2 ) < ∞, E(Y 2 ) < ∞ and σX Y ηY2 |X ≥ ρ2XY . Proof : By the definition of ηY2 |X , it suffices to show that E (E(Y ) − E(Y |X))2 Var(X) ≥ [E((X − E(X))(Y − E(Y )))]2 . Yet, by the double expectation theorem, E((X−E(X))(Y −E(Y ))) = E ((X − E(X))E((Y − E(Y )|X)) = E ((X − E(X))(E(Y |X) − E(Y ))) . Now, by applying the Cauchy-Schwarz inequality we arrive at [E((X − E(X))(Y − E(Y )))]2 ≤ E((X − E(X))2 )E (E(Y |X) − E(Y ))2 = Var(X)E (E(Y |X) − E(Y ))2 (2.11) Remarks. • ηY2 |X = 0 implies that ρXY = 0. • The residual variance can be expressed in terms of correlation ratio: ∆ = (1 − ηY2 |X )Var(Y ). 2.8 (2.12) Linear regression A particular case E(Y |X = x) = a + bx is called linear regression. When using (2.10), we can write Y = a + bX + ξ where ξ is the residual, E(ξ|X) = 0 (a.s.) (⇒ E(ξ) = 0). 2.8. LINEAR REGRESSION 35 Let ρ = ρXY , and let σX > 0, σY > 0 be the correlation coefficient between X and Y and the standard deviations of X and Y . One can express coefficients a and b of the linear regression in terms of ρ, σX and σY . Indeed, Y − E(Y ) = b(X − E(X)) + ξ. When multiplying this equation by X − E(X) and taking the expectation, we obtain 2 Cov(X, Y ) = bVar(X) = bσX , so that b= σY Cov(X, Y ) . =ρ 2 σX σX Then, Y =a+ρ σY X + ξ. σX On the other hand, E(Y ) = a + ρ and so a = E(Y ) − ρ σY E(X) σX σY E(X). σX Finally, Y = E(Y ) + ρ σY (X − E(X)) + ξ. σX (2.13) 2 > 0, Var(Y ) = σ 2 > 0, and Proposition 2.4 If E(X 2 ) < ∞ and E(Y 2 ) < ∞, Var(X) = σX Y the regression function g(x) = E(Y |X = x) is linear, then it may be written in the form E(Y |X = x) = E(Y ) + ρ σY (x − E(X)). σX (2.14) The residual variance is ∆ = (1 − ρ2 )σY2 , (2.15) where ρ is the correlation coefficient between X and Y . Proof : The equality (2.14) is a an immediate consequence of (2.13) along with the fact that E(ξ|X = x) = 0. Let us prove (2.15). We can write (2.13) in the form ξ = (Y − E(Y )) − ρ σY (X − E(X)). σX When taking the square and the expectation on the both sides, we come to " σY σY ∆ = E(ξ ) = E (Y − E(Y )) − 2ρ (X − E(X))(Y − E(Y )) + ρ σX σX 2 2 = ρ2 σY2 σY Var(X) − 2ρ Cov(X, Y ) + Var(Y ) = (1 − ρ2 )σY2 . 2 σX σX 2 # 2 (X − E(X)) 36 LECTURE 2. REGRESSION AND CORRELATION Corollary 2.2 If the regression of Y in X is linear, under the premise of Proposition 2.4 we have ηY2 |X = ρ2XY . In other words, in the case of the linear regression, the correlation ratio coincide with the correlation between X and Y . (In particular, this implies that ρXY = 0 ⇔ ηY2 |X = 0 and 2 ηY2 |X = ηX|Y =0.) The inverse is also true: if ρ2XY = ηY2 |X , then the regression is linear. Proof : Due to (2.12) one has: ∆ = (1 − ηY2 |X )Var(Y ), but in the linear case, moreover, ∆ = (1 − ρ2 )Var(Y ) due to (2.15). To show the inverse, note that if the equality is attained in the Cauchy-Schwarz inequality, (2.11), then there exists α 6= 0 such that α(X − E(X)) = E(Y |X) − E(Y ), and thus E(Y |X) = E(Y ) + α(X − E(X)). Remark. The fact that the regression of Y on X is linear, in general, does not imply that the regression of X on Y is linear too. Exercise 2.1 We have X and Z, 2 r.v., independent with exponential distribution, X ∼ E(λ), Z ∼ E(1). Let Y = X + Z. Compute the regression function g(y) = E(X|Y = y). 2.9. EXERCISES 2.9 37 Exercises Exercise 2.2 Suppose that the joint distribution of X and Y is given by ( F (x, y) = 1 − e−2x − e−y + e−(2x+y) si x > 0, y > 0, 0 sinon. 1. Find the marginal distribution of X and Y . 2. Find the joint density of X and Y . 3. Compute the marginal densities of X and Y , conditional density of X given Y = y. 4. Are X and Y independent? Exercise 2.3 Consider the joint density function of X and Y given by: xy 6 ), 0 ≤ x ≤ 1, 0 ≤ y ≤ 2. f (x, y) = (x2 + 7 2 1. Verify that f is a joint density. 2. Find the density of X, the conditional density fY |X (y|x). 3. Compute P Y > 12 |X < 1 2 . Exercise 2.4 The joint density X and Y is given by: f (x, y) = e−(x+y) , 0 ≤ x < ∞, 0 ≤ y < ∞ Compute 1. P (X < Y ); 2. P (X < a). Exercise 2.5 Two point are chosen at random on the opposite sides of the middle point the interval of length L. In other words, the two points X and Y are independent random variables such that X is uniformly distributed over [0, L/2[, and Y is uniformly distributed over [L/2, L]. Find the probability that the distance |X − Y | is larger than L/3. Exercise 2.6 38 LECTURE 2. REGRESSION AND CORRELATION Let U1 and U2 be 2 independent r.v., both being uniformly distributed on [0, a]. Let V = min{U1 , U2 } and Z = max{U1 , U2 }. Show that the joint c.d.f. F of V and Z is given by t2 − (t − s)2 for 0 ≤ s ≤ t ≤ a. a2 Hint: note that V ≤ s and Z ≤ t iff both U1 ≤ t and U2 ≤ t, but not both s < U1 ≤ t and s < U2 ≤ t. F (s, t) = P (V ≤ s, Z ≤ t) = Exercise 2.7 Given 2 independent r.v. X1 and X2 with exponential distribution with parameters λ1 and λ2 , find the distribution of Z = X1 /X2 . Compute P (X1 < X2 ). Exercise 2.8 Let X and Y be i.i.d. r.v.. Use the definition of the conditional expectation to show that E(X|X + Y ) = E(Y |X + Y ) (p.s.), and thus E(X|X + Y ) = E(Y |X + Y ) = X+Y (a.s.). 2 Exercise 2.9 Let X, Y1 and Y2 be independent r.v., let Y1 and Y2 be normal N (0, 1), et Y1 + XY2 . Z= √ 1 + X2 Using the conditional distribution P (Z < u|X = x) show that Z ∼ N (0, 1). Exercise 2.10 Let X and Y be 2 square integrable r.v. on (Ω, F, P ). Prove that Var(Y ) = E(Var(Y |X)) + Var(E(Y |X)). Exercise 2.11 Let X1 , ..., Xn be independent r.v. such that Xi ∼ P(λi ) (Poisson distribution with parameter λk λi , i.e. P (Xi = k) = e−λi k!i ). P 1o . Find the distribution of X = ni=1 Xi . 2o . Show that the conditional distribution of (X1 , ..., Xn ) given X = r is multinomial M(r, p1 , ..., pn ) (you will compute the corresponding parameters). that r.v. (X1 , ..., Xk ) integer valued in {0, ..., r} have multinomial distribution M(r, p1 , ..., pk ) Recall if P (X1 = n1 , ..., Xk = nk ) = with Pk i=1 ni r! pn1 ...pnk k , n1 !...nk ! 1 = r. This is the distribution of (X1 , ..., Xk ) where Xi = “number of Y ’s which are equal to i” in r independent trials Y1 , ..., Yr with probabilities P (Y1 = i) = pi , i = 1, ..., k. Note that if k = 2, P (X1 = n1 , X2 = r − n1 ) = P (X1 = n1 ), and the distribution is denoted M(r, p) = B(r, p). 3o . Compute E(X1 |X1 + X2 ). 4o . Show that if Xn is binomially distributed, Xn ∼ B(n, λ/n), then for all k, P (Xn = k) tends k to e−λ λk! when n → ∞. 2.9. EXERCISES Recall trials, 39 that binomial distribution describes the number X of wins in n independent Bernoulli P (X = k) = Cnk pk (1 − p)n−k . Exercise 2.12 Show that 1. Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z), 2. Cov P n i=1 Xi , Pn j=1 Yj = Pn i=1 Pn j=1 Cov(Xi , Yj ). 3. Prove that if Var(Xi ) = σ 2 and Cov(Xi , Xj ) = γ for all 1 ≤ i, j ≤ n then Var(X1 + ... + Xn ) = nσ 2 + n(n − 1)γ. 4. Let ξ1 and ξ2 be i.i.d. random variables such that 0 < Var(ξ1 ) < ∞. Show that r.v. η1 = ξ1 − ξ2 and η2 = ξ1 + ξ2 are uncorrelated. Exercise 2.13 Let X be the number of “1” and Y the number of “2” in n draws of a honest (balanced) dice. Compute Cov(X, Y ). Before computing the quantity, would you be able to predict if Cov(X, Y ) ≥ 0 or Cov(X, Y ) < 0. Hint: use the relationship 2) of Exercise 2.12. Exercise 2.14 1o . Let ξ and η be r.v. with E(ξ) = E(η) = 0, Var(ξ) = Var(η) = 1 and the correlation coefficient ρ. Show that q E(max(ξ 2 , η 2 )) ≤ 1 + 1 − ρ2 . Hint: observe that |ξ 2 + η 2 | + |ξ 2 − η 2 | . 2 2o . Let ρ be the correlation of η and ξ. Show the inequality max(ξ 2 , η 2 ) = q q P |ξ − E(ξ)| ≥ Var(ξ) or |η − E(η)| ≥ Var(η) ≤ 1+ p 1 − ρ2 . 2 Exercise 2.15 Let (X, Y ) be a random vector of dimension 2. Suppose that Y ∼ N (m, τ 2 ) and that the distribution of X given Y = y is N (y, σ 2 ). 1o . What is the distribution of Y given X = x? 2o . What is the distribution of X? 3o . What is the distribution f E(Y |X)? Exercise 2.16 40 LECTURE 2. REGRESSION AND CORRELATION Let X and N be r.v. such that N is valued in {1, 2, . . .} and E(|X|) < ∞, E(N ) < ∞ . Consider the sequence X1 , X2 , . . . of independent r.v. with the same distribution as X. Show the Wald identity: if N is independent of Xi then E( N X Xi ) = E(N )E(X). i=1 Exercise 2.17 Suppose that the salary of an individual satisfies Y ∗ = Xb + σε, where σ > 0, b ∈ R, X is a r.v. with bounded second order moments corresponding the capacities of the individual, and ε is independent of X standard normal variable, ε ∼ N (0, 1). If Y ∗ is larger than the SMIC value S, the received salary Y is Y ∗ , otherwise it is equal to S. Compute E(Y |X). Is this expectation linear? Exercise 2.18 Show that if φ is a characteristic function of some r.v. then φ∗ , |φ|2 and Re(φ), are also characteristic functions (of certain r.v.). Hint: for Re(φ) consider 2 independent random variables X and Y , where Y takes values −1 and 1 with probabilities 1/2, X has characteristic function φ, then compute the characteristic function of XY . Lecture 3 Random vectors. Multivariate normal distribution 3.1 Random vectors Let X = (ξ1 , ..., ξp )T be a random vector, 1 where ξ1 , ..., ξp are random (univariate) variables. In the same way we produce random matrices:   ξ11 , ... ξ1q   ... Ξ= , ξp1 , ... ξpq where ξ11 , ..., ξqp are (univariate) r.v.. The cumulative distribution function of the random vector X is t = (t1 , ..., tp )T ∈ Rp . F (t) = P (ξ1 ≤ t1 , ..., ξp ≤ tp ), If F (t) is differentiable with respect to t, the density of X (the joint density of ξ1 , ..., ξp ) exists and is equal to the mixed derivative f (t) = f (t1 , ..., tp ) = ∂ p F (t) . ∂t1 , ..., ∂tp In this case Z t1 Z tp F (t) = ... −∞ 3.1.1 −∞ f (u1 , ..., up )du1 ...dup . Properties of a multivariate density ∞ ∞ We have: f (t) ≥ 0, −∞ ... −∞ f (t1 , ..., tp )dt1 ...dtp = 1. The marginal density of ξ1 , ..., ξk , k < p is (we use the symbol f (·) as a generic notation for densities) R R Z ∞ f (t1 , ..., tk ) = 1 Z ∞ ... −∞ −∞ By convention, vector X ∈ Rp×1 is a column vector. 41 f (t1 , ..., tp )dtk+1 ...dtp . 42 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION Note that two different random vectors may have the same marginal distributions. Example 3.1 Consider the densities f1 (t1 , t2 ) = 1, et f2 (t1 , t2 ) = 1 + (2t1 − 1)(2t2 − 1), 0 < t1 , t2 < 1. In both cases, f (t1 ) = R1 0 f (t1 , t2 )dt2 = 1. Same as is in the case of p = 2, the conditional density of ξ1 , ..., ξk given ξk+1 , ..., ξp is f (t1 , ..., tk |tk+1 , ..., tp ) = f (t1 , ..., tp ) . f (tk+1 , ..., tp ) If X1 and X2 are two random vectors, then fX2 |X1 (x2 |x1 ) = f (x1 , x2 ) . f (x1 ) Independence. Suppose that two random vectors X1 and X2 have a joint density f (x1 , x2 ). They are independent iff f (x1 , x2 ) = f1 (x1 )f2 (x2 ), where f1 and f2 are probability densities. In other words, the conditional density fX2 |X1 (x2 |x1 ) does not depend on x1 . The independence is preserved by measurable transformations of vectors X1 and X2 . 3.1.2 Moments of random vectors Vector µ = (µ1 , ..., µp )T ∈ Rp is the mean (expectation) of the random vector X = (ξ1 , ..., ξp )T if Z Z µj = E(ξj ) = ... tj f (t1 , ..., tp )dt1 ...dtp , j = 1, ..., p (we suppose that the corresponding integrals exist), we write µ = E(X). In the same way we define the expectation of a random matrix. Same as in the scalar case, the expectation is a linear functional: for any matrix A ∈ Rq×p et b ∈ Rq , E(AX + b) = AE(X) + b = Aµ + b. This property is still valid for random matrices: if Ξ is a p × q random matrix, A ∈ Rq×p , then E(AΞ) = AE(Ξ). Covariance matrix Σ of the random vector X is given by ∆ Σ = V (X) = E((X − µ)(X − µ)T ) = (σij ), a p × p matrix, where σij = E((ξi − µi )(ξj − µj )) = Z Z ... (ti − µi )(tj − µj )f (t1 , ..., tp )dt1 ...dtp (we note that in this case σij are not always positive). Because σij = σji , Σ is a symmetric matrix. We can define also the covariance matrix of random vectors X (p × 1) and Y (q × 1): C(X, Y ) = E((X − E(X))(Y − E(Y ))T ), C ∈ Rp×q . 3.1. RANDOM VECTORS 43 The covariance matrix possesses the following properties: 1o . Σ = E(XX T ) − µµT , where µ = E(X). 2o . For any a ∈ Rp , Var(aT X) = aT V (X)a. Proof : Observe that by linearity of the expectation, Var(aT X) = E((aT X − E(aT X))2 ) = E (aT (X − E(X))2 = E aT (X − µ)(X − µ)T a = aT E (X − µ)(X − µ)T a = aT V (X)a. Var(aT X) ≥ 0, implying that matrix V (X) is positive semidefinite, we write V (X) 0. Thus, we have 3o . Σ 0. 4o . Let A be a p × q matrix. Then V (AX + b) = AV (X)AT . Proof : Let Y = AX + b, then by linearity of the expectation, ν = E(Y ) = E(AX + b) = Aµ + b et Y − E(Y ) = A(X − µ). Now we have: V (Y ) = E(A(X − µ)(X − µ)T A) = AV (X)AT (linearity again). 5o . 6o . 7o . 8o . C(X, X) = V (X). In this case C = C T 0 (positive semidefinite matrix). C(X, Y ) = C(Y, X)T . C(X1 + X2 , Y ) = C(X1 , Y ) + C(X2 , Y ). If X and Y are two p-random vectors, V (X + Y ) = V (X) + C(X, Y ) + C(Y, X) + V (Y ) = V (X) + C(X, Y ) + C(X, Y )T + V (Y ). 9o . If X⊥⊥Y , then C(X, Y ) = 0 (null matrix) (the inverse is not true). This can be proved exactly the same way as in the case of covariance of r.v.. Correlation matrix P of X is given by P = (ρij ), 1 ≤ i, j ≤ p where ρij = √ σij √ σii σjj . We note that the diagonal entries ρii = 1, i = 1, ..., p. √ If ∆ is a diagonal matrix with ∆ii = σii , then P = ∆−1 Σ∆−1 , and the positivity of Σ implies the positivity of P , i.e. P 0. 3.1.3 Characteristic function of a random vector Definition 3.1 Let X ∈ Rp be a random vector. Its characteristic function for t ∈ Rp is given by φX (t) = E exp(itT X) . 44 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION Exercise 3.1 Two random vectors X ! ∈ Rp and Y ∈ Rq are independent iff the ! characteristic function φZ (u) X a of the vector Z = can be represented, for any u = , a ∈ Rp and b ∈ Rq , as Y b φZ (u) = φX (a)φY (b). Verify this characterization in the continuous case. 3.1.4 Transformations of random vectors Let h = (h1 , ..., hp )T be a transformation, i.e. a function from Rp to Rp , h(t1 , ..., tp ) = (h1 (t1 , ..., tp ), ..., hp (t1 , ..., tp ))T , t = (t1 , ..., tp )T ∈ Rp . The Jacobian matrix of the transformation is defined par ! ∂hi (t) Jh (t) = Det ∂tj . i,j Proposition 3.1 (Calculus reminder) Suppose that (i) partial derivatives of hi (·) are continuous on Rp , i = 1, ..., p, (ii) h is a bijection, (iii) Jh (t) 6= 0 for any t ∈ Rp . Then, for any function f (t) such that Z |f (t)|dt < ∞ Rp and any Borel set K ⊆ Rp we a Z Z f (t)dt = K h−1 (K) f (h(u))|Jh (u)|du. Remark: by the inverse function theorem, under the conditions of the Proposition 3.1, the inverse function g(·) = h−1 (·) exists everywhere on Rp and Jh−1 (h(u)) = same as Jh−1 (t) = 1 , Jh (u) 1 . Jh (h−1 (t)) We conclude that h satisfies the conditions (i) − (iii) of Proposition 3.1 iff g = h−1 satisfies the same conditions. We have the following corollary of Proposition 3.1: 3.1. RANDOM VECTORS 45 Proposition 3.2 Let Y a random vector with the density fY (t), t ∈ Rp . Let g : Rp → Rp be a transformation satisfying the premise of Proposition 3.1. Then, the density of the random vector X = g(Y ) exists and is given by fX (u) = fY (h(u))|Jh (u)|, for any u ∈ Rp , where h = g −1 . Proof : Let X = (ξ1 , ..., ξp )T , v = (v1 , ..., vp )T , and Av = {t ∈ Rp : gi (t) ≤ vi , i = 1, ..., p}. Then, by the Proposition 3.1 with h = g −1 and f = fY , the c.d.f. of X is FX (v) = P (ξi ≤ vi , i = 1, ..., p) = P (gi (Y ) ≤ vi , i = 1, ..., p) Z = Z fY (t)dt = Av g(Av ) fY (h(u))|Jh (u)|du. On the other hand, g(Av ) = {u = g(t) ∈ Rp : t ∈ Av } = {u = g(t) ∈ Rp : gi (t) ≤ vi , i = 1, ..., p} = {u = (u1 , ..., up )T ∈ Rp : ui ≤ vi , i = 1, ..., p}. We conclude that Z vp Z v1 FX (v) = ... −∞ −∞ fY (h(u))|Jh (u)|du for any v = (v1 , ..., vp )T ∈ Rp . This implies that the density of X is fY (h(u))|Jh (u)|. Corollary 3.1 If X = AY + b where Y is a random vector on Rp with the density fY and A is an invertible p × p matrix then fX (u) = fY (A−1 (u − b))| Det(A−1 )| = fY (A−1 (u − b)) . | Det(A)| To verify the result it suffices to use Proposition 3.2 with u = g(t) = At + b and thus t = g −1 (u) = h(u) = A−1 (u − b). 3.1.5 Reminder of the properties of symmetric matrices Recall that p × p matrix A, A = (aij ), i, j = 1, ..., p is called symmetric if aij = aji , i, j = 1, ..., p. The matrix Γ p × p is said orthogonal if Γ−1 = ΓT (ou bien ΓΓT = ΓT Γ = I) (where I is the p × p identity matrix). In other words, the columns γ·j of Γ are orthogonal vectors of length 1 (the same is true for the lines γi· of Γ). Of course, | Det(Γ)| = 1. We have the spectral decomposition theorem (Jordan): Let A ∈ Rp×p be a symmetric matrix. Then there exists an orthogonal matrix Γ and the diagonal matrix   λ1 0 ... 0   ... ... Λ = Diag(λi ) =  , 0 ... 0 λp 46 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION such that A = ΓΛΓT = p X λi γ·i γ·iT , (3.1) i=1 where γ·i are the orthonormal eigenvectors of A: 2 γ·iT γ·j = δij i, j = 1, ..., p, Γ = (γ·1 , ..., γ·p ). Comments. 1) Though symmetric matrix may have multiple eigenvalues, all the eigenvectors of such matrix are different. 2) We assume in the sequel that the eigenvalues λi , i = 1, ..., p are sorted in the decreasing order: λ1 ≥ λ2 ≥ ... ≥ λp . We say that γ·1 is the first (or principal) eigenvector of A, i.e. the eigenvector corresponding to the maximal eigenvalue; γ·2 is the second eigenvector, and so on. If all eigenvalues λi , i = 1, ..., p are nonnegative, matrix A is positive semidefinite (and positive definite if λi > 0). Other useful properties of square matrices P Q 1o . Det(A) = pi=1 λi , Tr(A) = pi=1 λi . 2o . Det(AB) = Det(A) Det(B), Det(AT ) = Det(A). 3o . The calculus of matrix functions simplifies for symmetric matrices: for example, a positive integer power As , s ∈ N+ of a symmetric matrix satisfies As = ΓΛs ΓT (if the matrix A is positive definite it is true for any real s). 4o . Det(A−1 ) = Det(A)−1 for nondegenerate A. 5o . For any s ∈ R and any matrix A = AT 0, Det(As ) = Det(A)s (the simple consequence of | det Γ| = 1 for an orthonormal matrix Γ). Projectors. Symmetric matrix P such that P 2 = P (idempotent matrix) is called projection matrix (or projector). All the eigenvalues of P are 0 or 1. Rank(P ) is the number of eigenvalues = 1. In other words, there is an orthogonal matrix Γ such that T Γ PΓ = I 0 0 0 ! , where I is the Rank(P ) × Rank(P ) identity matrix. Indeed, let v be an eigenvector of P , then P v = λv, where λ is an eigenvalue of P . Due to P2 = P, (λ2 − λ)v = (λP − P )v = (P 2 − P )v = 0. This is equivalent to λ = 1 if P v 6= 0. 2 Here δij stands for the Kronecker index: δij = 1 if i = j, otherwise δij = 0. 3.2. CONDITIONAL EXPECTATION OF A RANDOM VECTOR 3.2 47 Conditional expectation of a random vector Let X = (ξ1 , ..., ξp )T and Y = (η1 , ..., ηq )T be two random vectors. Here we consider only continuous case, i.e. we assume that the joint density fX,Y (x, y) = fX,Y (t1 , ..., tp , s1 , ..., sq ) exists. The conditional expectation E(Y |X) is the q-random vector with the components E(η1 |X), ..., E(ηq |X); here E(ηj |X) = gj (X) (a measurable function of X), and gj (t) = E(ηj |X = t) = Z sj fηj |X=t (sj |t)dsj = Z sj fηj |ξ1 =t1 ,...,ξp =tp (sj |t1 , ..., tp )dsj . We can verify that the latter quantity is well defined if, for instance, E(|ηj |) < ∞, j = 1, ..., q. All the properties of conditional expectation, established in Lecture 2 hold true in the vector case. Same as in the scalar case, we can define the conditional covariance matrix: V (Y |X) = E(Y Y T |X) − E(Y |X)E(Y |X)T . 3.2.1 Best prediction theorem Let |a| = q a21 + ... + a2p stand for the Euclidian norm of Rp . Definition 3.2 Let X ∈ Rp and Y ∈ Rq be two random vectors, and G : Rp → Rq . We say that G(X) is the best prediction of Y given X (in the mean square sense) if E (Y − G(X))(Y − G(X))T ≤ E (Y − H(X))(Y − H(X))T (3.2) (we say that A B if the différence B − A is positive semidefinite) for any measurable function H : Rp → Rq . Clearly, (3.2) implies (please, verify this!) E(|Y − G(X)|2 ) = inf E(|Y − H(X)|2 ). H(·) where the minimum is taken over all measurable functions H(·) : Rp → Rq . Same as in the case of p = q = 1, we have Theorem 3.1 If E(|Y |2 ) < ∞, then the best prediction of Y given X is unique a.s. and satisfies G(X) = E(Y |X) (a.s.). Proof : Of course, it is sufficient to look for the minimum among functions H(·) such that E(|H(X)|2 ) < ∞. For any such H(X) E (H(X) − Y )(H(X) − Y )T ) = E [(H(X) − G(X)) + (G(X) − Y )][(H(X) − G(X)) + (G(X) − Y )]T = E (H(X) − G(X)(H(X) − G(X))T + E (H(X) − G(X))(G(X) − Y )T +E (G(X) − Y )(H(X) − G(X))T + E (G(X) − Y )(G(X) − Y )T . 48 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION But, using the properties of conditional expectation, we obtain: E (H(X) − G(X))(G(X) − Y )T h = E E (H(X) − G(X))(G(X) − Y )T |X h = E (H(X) − G(X))E (G(X) − Y )T |X i i = 0. The statement of the theorem follows. 3.3 Multivariate normal distribution Normal distribution on R: recall that the normal distribution N (µ, σ 2 ) on R is the probability distribution with density (x − µ)2 1 f (x) = √ exp − 2σ 2 2πσ ! . Here µ is the mean and σ 2 is the variance. The characteristic function of the normal distribution N (µ, σ 2 ) is given by ! σ 2 t2 φ(t) = exp iµt − ). 2 2 /2 In particular, for N (0, 1) we have φ(t) = e−t 3.3.1 . The distribution Np (0, I) Np (0, I) is the distribution of the random vector X = (ξ1 , ..., ξp )T where ξi , i = 1, ..., p are i.i.d. random variables with distribution N (0, 1). Properties of Np (0, I): 1o . The mean and the covariance matrix of X are E(X) = 0 and V (X) = I. 2o . Distribution Np (0, I) is absolutely continuous with density p f (u) = (2π) −p/2 p Y Y 1 1 exp(− uT u) = (2π)−p/2 exp(− u2i ) = f0 (ui ), 2 2 i=1 i=1 2 where u = (u1 , ..., up )T and f0 (t) = √12π e−t /2 is the density of N (0, 1). 3o . The characteristic function of Np (0, I) is, by definition,  iaT X φX (a) = E e = p Y j=1 where a = (a1 , ..., ap )T ∈ Rp . =E E eiaj ξj = p Y j=1 p Y  eiaj ξj  2 1 e−aj /2 = exp(− aT a), 2 j=1 3.3. MULTIVARIATE NORMAL DISTRIBUTION 3.3.2 49 Normal distribution on Rp Definition 3.3 The random vector X if normally distributed on Rp if and only if there exist a p × p matrix A and a vector µ ∈ Rp such that X = AY + µ, where Y ∼ Np (0, I). Properties: 1o . E(X) = µ car E(Y ) = 0. 2o . V (X) = AV (Y )AT = AAT . We put Σ = AAT . 3o . The characteristic function TX φX (a) = E eia Tµ = eia E eib T µ− 1 bT b 2 = eia T (AY = E eia TY +µ) (with b = AT a) = eia T µ− 1 aT Σa 2 . (3.3) We have the following characterization: Theorem 3.2 Let φ : Rp → C (a complex-valued function). Then, φ is a characteristic function of a normal distribution if and only if there exist µ ∈ Rp and a positive semidefinite matrix Σ ∈ Rp×p such that T µ− 1 aT Σa 2 φ(a) = eia , a ∈ Rp . (3.4) Remark: in this case µ is the mean and Σ is the covariance matrix of the normal distribution in question. Proof : The “only if” part has been already proved. To show the “if” part, starting from (3.4) we have to prove that there exists a normal vector X ∈ Rp such that φ(·) is its characteristic function. 1st step: by the spectral decomposition theorem, there exists an orthogonal matrix Γ such that ΓT ΣΓ = Λ, where Λ is a diagonal matrix of rank k ≤ p with strictly positives eigenvalues λj , 1 ≤ j ≤ k. Then (cf. (3.1)), Σ= p X λj γ·j γ·jT j=1 where γ·j are the columns of Γ, and a·j = γ·j are orthonormal). = p X a·j aT·j , j=1 p λj γ·j . Observe that a·j ⊥ a·l for l 6= j (recall that 50 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION 2nd step: Let Y ∼ N (0, I). Let us denote ηj the components of Y (Y = (η1 , ..., ηp )T ). We consider the random vector X = η1 a·1 + ... + ηk a·k + µ, so that X = AY +µ, where A is a p×p matrix with columns aj , j = 1, ..., k: A = (a·1 , ..., a·k , 0, ..., 0). Thus X is a normal vector. What is the characteristic function of X? We have E(X) = µ and V (X) = E (η1 a·1 + ... + ηk a·k )(η1 a·1 + ... + ηk a·k )T = k X a·j aT·k = Σ, j=1 due to E(ηl ηj ) = δjl where δjl is the Kronecker symbol. Thus, by (3.3), the characteristic function of X coincide with φ(u) in (3.4). The result of Theorem 3.2 implies the following important corollary: any normal distribution on Rp is entirely defined by the vector of means and its covariance matrix. This explains the notation X ∼ N (µ, Σ) for the random vector X which is normally distributed with mean µ and covariance matrix Σ = ΣT 0. We will distinguish two situations, those of nondegenerate normal distribution and degenerate normal distribution. 3.3.3 Nondegenerate normal distribution This is a normal distribution on Rp with positive definite covariance matrix Σ, i.e. Σ 0 (⇔ Det(Σ) > 0). Moreover, because Σ is symmetric and Σ 0, there exists a symmetric matrix A1 = Σ1/2 (a symmetric square root of Σ) such that Σ = A21 = AT1 A1 = A1 AT1 . As Det(Σ) = [Det(A1 )2 ] > 0, we have Det(A1 ) > 0 and A1 is invertible. By (3.3), if X ∼ N (µ, Σ), its characteristic function is 1 T T φX (a) = eia µ− 2 a Σa for any a ∈ Rp , and due to Σ = A1 AT1 , we have T µ− 1 aT Σa 2 φX (a) = eia T (A = E eia 1Y +µ) = φA1 Y +µ (a), where Y ∼ Np (0, I). Therefore, X = A1 Y + µ and, due to the invertibility of A1 , Y = A−1 1 (X − µ). The Jacobian of this linear transformation is Det(A−1 1 ), and by the Corollary 3.1 the density of X is, for any u ∈ Rp , −1 fX (u) = Det(A−1 1 )fY (A1 (u − µ)) = = 1 (2π)p/2 p 1 fY (A−1 1 (u − µ)) Det(A1 ) 1 exp − (u − µ)T Σ−1 (u − µ) . 2 Det(Σ) 3.3. MULTIVARIATE NORMAL DISTRIBUTION 51 Definition 3.4 We say that X has a nondegenerate normal distribution Np (µ, Σ) (with a positive semidefinite covariance matrix Σ) iff X is un random vector with density 1 1 p f (t) = exp − (t − µ)T Σ−1 (t − µ) p/2 2 (2π) Det(Σ) 3.3.4 Degenerate normal distribution This is a normal distribution on Rp with singular covariance matrix Σ, i.e. Det(Σ) = 0 (in other words, Rank(Σ) = k < p). For instance, consider Σ = 0, then the characteristic function of T X ∼ N (µ, 0) is φX (a) = eia µ (by Property 3o ) and the distribution of X is the Dirac function at µ. More generally, if Rank(Σ) = k ≥ 1, we obtain (cf. the proof of Theorem 3.2) that a vector X ∼ Np (µ, Σ) can be represented as X = AY + µ, where Y ∼ N (0, I), A = (a·1 , ..., a·k , 0, ..., 0) and AAT = Σ, with Rank(A) = k. Thus, any component of X is either normally distributed (nondegenerate) or its distribution is a Dirac function. This is a consequence of the following statement: Proposition 3.3 Let X ∼ Np (µ, Σ) and Rank(Σ) = k < p. Then there exists a linear subspace H ⊂ Rp of dimension p − k such that the projection aT X of X on any vector a ∈ H is a “Dirac random variable.” Proof : We have X = AY + µ where AAT = Σ, Rank(A) = k, Let H = Ker(AT ) of dimension dim (H) = p − k. If a ∈ H, then AT a = 0 and Σa = 0. Now, let a ∈ H, the characteristic function of the r.v. aT X is T X)u φ(u) = E ei(a TX = E ei(ua) T µ− 1 (ua)T Σ(ua) 2 = ei(ua) = ei(ua) Tµ . Therefore, the distribution of aT X is a (scalar) Dirac function at aT µ. Theorem 3.3 (Equivalent definition of the multivariate normal distribution) A random vector X ∈ Rp is normally distributed iff all its scalar projections aT X for any a ∈ Rp are normal random variables. Remark: we include the Dirac distribution as a special case of normal distributions (corresponding to σ 2 = 0). Proof : Observe first that for any a ∈ Rp and any u ∈ Rp the characteristic function φaT X (u) of the r.v. aT X is related to that of X: T Xu φaT X (u) = E eia = φX (ua). (3.5) 52 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION “only if ” part: Let X be a normal vector in Rp . We are to show that aT X is a normal random variable for any a ∈ Rp . We use (3.5) to obtain for any u ∈ R T µ− 1 u2 aT Σa 2 φaT X (u) = eiua , where µ and Σ are the mean and the covariance matrix of X. Thus, 1 2 σ2 0 φaT X (u) = eiµ0 u− 2 u with µ0 = aT µ and σ02 = aT Σa. As a result, aT X ∼ N (µ0 , σ02 ) = N (aT µ, aT Σa). “if ” part: We are to prove that if for any a ∈ Rp aT X is a normal random variable then X is a normal vector. To this end, note that if aT X is a normal r.v. for any a ∈ Rp , then E(|X|2 ) < ∞ (to see why this is true it suffices choose as a the vectors of an orthonormal basis of Rp ). Therefore, the mean µ = E(X) and the covariance matrix Σ = V (X) are well defined. Let us fix a ∈ Rp . By hypothesis, there exists m ∈ R and s2 ≥ 0 such that aT X ∼ N (m, s2 ). However, this immediately implies that m = E(aT X) = aT µ, s2 = Var(aT X) = aT Σa. Moreover, the characteristic function of aT X is given by 1 2 2 u φaT X (u) = eimu− 2 s T µ− 1 u2 aT Σa 2 = eiua . Now using (3.5) we obtain T µ− 1 aT Σa 2 φX (a) = φaT X (1) = eia . Because a ∈ Rp is arbitrary, we conclude by Theorem 3.2 that X is a normal random vector in Rp with mean µ and covariance matrix Σ. 3.3.5 Properties of the multivariate normal distribution Here we consider X ∼ Np (µ, Σ), where µ ∈ Rp and Σ ∈ Rp×p is a symmetric matrix, Σ 0. The following properties are consequences of the results of the preceding section: (N1) Let Σ 0, then the random vector Y = Σ−1/2 (X − µ) satisfies Y ∼ Np (0, I). (N2) One-dimensional projections aT X of X for any a ∈ Rp are normal random variables: aT X ∼ N (aT µ, aT Σa). In particular, the marginal densities of the distribution Np (µ, Σ) are also normal. The inverse statement is not true! Exercise 3.2 3.3. MULTIVARIATE NORMAL DISTRIBUTION 53 Let the joint density of r.v.’s X and Y satisfy f (x, y) = 1 − x2 − y 2 e 2 e 2 [1 + xyI{−1 ≤ x, y ≤ 1}], 2π What is the distribution of X, of Y ? (N3) Any linear transformation of a normal vector is again a normal vector: if Y = AX + c where A ∈ Rq×p and c ∈ Rq are some fixed matrix and vector (non-random), Y ∼ Nq (Aµ + c, AΣAT ). Exercise 3.3 Prove this claim. (N4) Let σ 2 > 0. The distribution of X ∼ Np (0, σ 2 I) is invariant with respect to orthogonal transformations: if Γ is an orthogonal matrix, then ΓX ∼ Np (0, σ 2 I). (The proof is evident: it suffices to use (N3) with A = Γ.) (N5) All subsets of components of a normal vector are normal vectors: Let X = (X1T , X2T )T where X1 ∈ Rk and X2 ∈ Rp−k , then X1 and X2 are (k- and p − k-variate) normal vectors. Proof : We use (N3) with c = 0 and A ∈ Rk×p , A = (Ik , 0) where Ik is the k × k identity matrix to conclude that X1 is normal. For X2 we take A ∈ R(p−k)×p = (0, Ip−k ). (N6) Two jointly normal vectors are independent if and only if they are non-correlated. ! X , where X ∈ Rp and Proof : Only sufficiency (“if”) claim is of interest. Let Z = Y Y ∈ Rq , Z is a normal vector in Rq+p , and C(X, Y ) = 0 (the covariance matrix of X and Y ). To prove that X and Y are independent it suffices to verify (cf. Exercise 3.1) that the ! a , a ∈ Rp and b ∈ Rq , can be decomposed as characteristic function φZ (u), u = b φZ (u) = φX (a)φY (b). Indeed, we have E(X) E(Y ) E(Z) = ! , V (Z) = V (X) C(X, Y ) C(Y, X) V (Y ) ! = V (X) 0 0 V (Y ) ! , where V (X) ∈ Rp×p and V (Y ) ∈ Rq×q are covariance matrices of X and of Y , respectively. Therefore, the characteristic function φZ (u) of Z is given by " !# 1 a φZ (u) = φZ (a, b) = exp i(a E(X) + b E(Y )) − (aT , bT )V (Z) b 2 1 1 = exp iaT E(X) − aT V (X)a exp ibT E(Y ) − bT V (Y )b = φX (a)φY (b). 2 2 T for any u = a b ! . T 54 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION 3.3.6 Geometry of the multivariate normal distribution Let Σ 0. Note that the density of Np (µ, Σ) is constant on the surfaces EC = {x : (x − µ)T Σ−1 (x − µ) = C 2 }, We call these level sets the “contours” of the distribution. In the case of interest, EC are ellipsods which we refer to as concentration ellipsoids. 3 3  2  2 1 1 1 0 −1 2 −2  =0.75 −3 −3 −2 −1 0 1 2 Concentration ellipsoids: X = (ξ1 , ξ2 ), Y = (η1 , η2 ), where Y = 3.4 3.4.1 3 Σ−1/2 X, Σ= 1 3/4 3/4 1 ! Distributions derived from the normal Pearson’s χ2 distribution This is the distribution of the sum Y = η12 + ... + ηp2 , where η1 , ..., ηp are i.i.d. N (0, 1) random variables. We denote it Y ∼ χ2p and pronounce “Y follows the chi-square distribution with p degrees of freedom (d.f.). The density of the χ2p distribution is given by fχ2p (y) = C(p)y p/2−1 e−y/2 I{0 < y < ∞}, where −1 C(p) = 2p/2 Γ(p/2) , and Γ(·) is the gamma-function: Z ∞ Γ(x) = 0 ux−1 e−u/2 du, x > 0. (3.6) 3.4. DISTRIBUTIONS DERIVED FROM THE NORMAL 55 We have E(Y ) = p, Var(Y ) = 2p if Y ∼ χ2p . p=1 p=2 p=3 p=6 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 Density of the χ2 distribution for different p’s Exercise 3.4 Obtain the expression (3.6) for the density of χ2p . 3.4.2 Fisher-Snedecor distribution Let U ∼ χ2p , V ∼ χ2q be two independent r.v.. The Fisher distribution with degrees of freedom p and q is the distribution of the random variable Y = U/p . V /q We write Y ∼ Fp,q . The density of Fp,q is given by fFp,q (y) = C(p, q) y p/2−1 (q + py) p+q 2 I{0 < y < ∞}, where C(p, q) = Γ(p)Γ(q) pp/2 q q/2 , with B(p, q) = . B(p/2, q/2) Γ(p + q) (3.7) 56 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION One can show that this density approches the density fχ2p in the limit q → ∞. 1 F(10,4) F(10,10) F(10,100) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 Density of the Fisher distribution Exercise 3.5 Verify the expression (3.7) for the density of the Fisher distribution. 3.4.3 Student (W. Gosset) t distribution Let η ∼ N (0, 1), ξ ∼ χ2q be independent. The Student distribution with q degrees of freedom is that of the r.v. η Y =q . ξ q We write Y ∼ tq . The density of tq is ftq (y) = C(q)(1 + y 2 /q)−(q+1)/2 , y ∈ R, (3.8) where C(q) = √ 1 . qB(1/2, q/2) Note that t1 is the Cauchy distribution, and tq “approaches” N (0, 1) when q → ∞. We note that tq distribution is symmetric. The tails of tq are heavier that normal tales. Exercise 3.6 3.5. COCHRAN’S THEOREM 57 Verify the expression (3.8) for the Student distribution density. 0.4 N(0,1) t4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 Density of Student distribution 3.5 Cochran’s theorem Theorem 3.4 Let X ∼ Np (0, I) and let A1 , ..., AJ , J < p, p × p matrices such that (1) A2j = Aj , (2) Aj is symmetric, Rank(Aj ) = nj , (3) Aj Ak = 0 when j 6= k and PJ j=1 nj ≤ p.3 Then, (i) vectors Aj X are independent with normal distribution Np (0, Aj ), j = 1, ..., J, respectively; (ii) random variables |Aj X|2 , j = 1, ..., J are independent with χ2nj distribution, j = 1, ..., J. Proof : (i) Observe that E(Aj X) = 0 and V (Aj X) = Aj V (X)ATj = Aj ATj = A2j = Aj . Moreover, the joint distribution of Ak X and Aj X is clearly normal. However, C(Ak X, Aj X) = E(Ak XX T ATj ) = Ak V (X)ATj = Ak ATj = Ak Aj = 0 for j 6= k. By the property (N6) of the normal distribution, this implies that Ak X and Aj X are independent for all k 6= j. 3 Some presentations of this result in statistical literature assume also that A1 + ... + AJ = I. 58 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION (ii) Since Aj is a projector, there exists an orthogonal matrix Γ such that ΓAj Γ = Ij 0 0 0 ! , the diagonal matrix of eigenvalues of Aj . Because the rank of Aj is equal to nj , we have Rank(Ij ) = nj , and so |Aj X|2 = X T ATj Aj X = X T Aj X = (X T ΓT )Λ(ΓX) = Y T ΛY = nj X ηi2 , i=1 where Y = (η1 , ..., ηp )T is a normal vector, Y = ΓX ∼ Np (0, I) (due to the property (N4) of the normal distribution). We conclude that |Aj X|2 ∼ χ2nj . Since the independence is preserved by measurable transformations, |Aj X|2 and |Ak X|2 are independent for j 6= k. 3.6 Normal correlation theorem and Kalman-Bucy filter Another important consequence of the results of Section 3.3.5 is the following statement: Theorem 3.5 Let X T = (ξ T , θT ), ξ ∈ Rk , θ ∈ Rl , p = k + l, a normal vector, X ∼ Np (µ, Σ), where ! Σξξ Σξθ T T T , µ = (µξ , µθ ), Σ = Σθξ Σθθ Σξξ ∈ Rk×k , Σθθ ∈ Rl×l , ΣTθξ = Σξθ ∈ Rk×l . We suppose that Σξξ 0. Then ∆ m = E(θ|ξ) = µθ + Σθξ Σ−1 ξξ (ξ − µξ ), (a.s.) ∆ γ = V (θ|ξ) = Σθθ − Σθξ Σ−1 ξξ Σξθ (a.s.), (3.9) and the conditional distribution of θ given ξ is normal: for any s ∈ Rl , P (θ ≤ s|ξ) is (a.s.) the c.d.f. of a l-variate normal distribution with the vector of means m and the covariance matrix γ (we write a ≤ b for two vectors a, b ∈ Rl for the system of inequalities a1 ≤ b1 , ..., ap ≤ bl ). Moreover, random vectors ξ and η = θ − Σθξ Σ−1 ξξ ξ are independent. Remarks: 1. The theorem provides an explicit expression for the regression function m = E(θ|ξ) (regression of θ on ξ) and the conditional covariance matrix γ = V (θ|ξ) = E (θ − m)(θ − m)T . This regression is linear in the case of a Gaussian couple (ξ, θ). 3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER 59 2. If we assume, in addition, that Σ 0 then the matrix γ is also 0. Indeed, let a ∈ Rk , b ∈ Rl , then ! ! ! a Σ Σ a ξξ ξθ (aT bT )Σ = (aT bT ) > 0. b Σθξ Σθθ b same as aT Σξξ a + aT Σξθ b + bT Σθξ a + bT Σθθ b > 0. (3.10) If we choose a = −Σ−1 ξξ Σξθ b, then (3.10) can be rewritten as −bT Σθξ Σ−1 Σξθ b + bT Σθθ b > 0, for any b ∈ Rl . Thus, Σθθ − Σθξ Σ−1 Σξθ 0. 3. The normal correlation theorem allows for the follwing geometric interpretation: assume that E(ξ) = 0 and E(θ) = 0, and let Lξ2 (P ) be the subspace of random vectors with finite covariance matrix which are mesurables with respect to ξ. Then Σθξ Σ−1 ξξ ξ is the −1 2 orthogonal projection of θ on Lξ (P ), and vector η = θ − Σθξ Σξξ ξ is orthogonal to L2ξ (P ). 4. It is worth to mention that one can also prove a “conditional” version of Theorem 3.5 if we assume that the conditional distribution of the couple (ξ, θ) given another r.v., say, Z is normal (a.s.). Indeed, let X = (ξ, θ)T = ((ξ1 , ..., ξk ), (θ1 , ..., θl ))T be a random vector and Z some other random vector defined on the same probability space (Ω, F, P ). Assume that the conditional distribution of X given Z is normal (a.s.) with vector of means E(X|Z)T = (E(ξ|Z)T , E(θ|Z)T ) = (µTξ|Z , µTθ|Z ), and covariance matrix ΣX|Z = V (ξ|Z) C(ξ, θ|Z) C(θ, ξ|Z) V (θ|Z) ! ∆ = Σξξ|Z Σθ,ξ|Z Σξ,θ|Z Σθθ|Z ! . Then the conditional expectation m = E(θ|ξ, Z) and the conditional covariance matrix γ = V (θ|ξ, Z) are given by m = µθ|Z + Σθξ|Z Σ−1 ξξ|Z (ξ − µξ|Z ), γ = Σθθ|Z − Σθξ|Z Σ−1 ξξ|Z Σξθ|Z ; (3.11) and the conditional distribution of θ given ξ and Z is normal: for any s ∈ Rl , any s ∈ Rl , P (θ ≤ s|ξ, Z) is (a.s.) the c.d.f. of an l-variate normal distribution with the mean m and the covariance matrix γ. Moreover, random vectors ξ and η = θ − Σθξ|Z Σ−1 ξξ|Z ξ are conditionally independent given Z. This statement can be proved in the exactly same way as Theorem 3.5 and will be used in the next section. 60 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION Proof of the normal correlation theorem. Step 1. Let us compute E(η) and V (η): −1 E(η) = E(θ − Σθξ Σ−1 ξξ ξ) = µθ − Σθξ Σξξ µξ , and −1 T V (η) = E [(θ − µθ ) − Σθξ Σ−1 ξξ (ξ − µξ )][(θ − µθ ) − Σθξ Σξξ (ξ − µξ )] T = Σθθ − Σ−1 ξξ Σξθ E (ξ − µξ )(θ − µθ ) −1 −1 T T −E (θ − µθ )(ξ − µξ )T Σ−1 ξξ Σθξ + Σθξ Σξξ E(ξ − µξ )(ξ − µξ ) )Σξξ Σθξ = Σθθ − Σθξ Σ−1 ξξ Σξθ . Step 2. Let us verify the orthogonality of η and ξ: −1 C(η, ξ) = C(θ, ξ) − Σθξ Σ−1 ξξ C(ξ, ξ) = Σθξ − Σθξ Σξξ Σξξ = 0, thus η ⊥ ξ. Step 3. We show that the couple (ξ, η) is normal. Indeed, ξ η ! = AX = A ξ θ ! , where A= with idenitity matrices Ik ∈ Rk×k Ik 0 −Σθξ Σ−1 I l ξξ and Il ∈ Rl×l . ! , By the property (N3) of Section 3.3.5 ξ η ! is a vector normal. Its covariance matrix, is given by V ξ η !! = V (ξ) C(ξ, η) C(η, ξ) V (η) ! = Σξξ 0 0 Σθθ − Σθξ Σ−1 ξξ Σξθ ! Because Σξξ 0, and Σθθ − Σθξ Σ−1 ξξ Σξθ 0 (by the Cauchy-Schwarz inequality), we have V ξ η !! 0. Besides this, V ξ η !! = AV (X)AT 0. 3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER 61 Step 4. Now the property (N6) implies that η and ξ are independent. On the other hand, the result of Step 3, along with (N5), allows us to conclude that η is a normal vector. Using the above expressions for E(η) and V (η) we obtain −1 η ∼ Nl µθ − Σθξ Σ−1 ξξ µξ , Σθθ − Σθξ Σξξ Σξθ . Now it suffices to note that θ = η + Σθξ Σ−1 ξξ ξ, where η is independent of ξ. Therefore, the conditional distribution of θ given ξ is the distribution of η, translated by Σθξ Σ−1 ξξ ξ, i.e. E(θ|ξ) = E(η) + Σθξ Σ−1 ξξ ξ, V (θ|ξ) = V (η). The linearity of the best prediction m = E(θ|ξ) of the vector θ given ξ is, of course, tightly linked to the normality of the couple (ξ, θ), what allows for a simple calculus of m. It is interesting to see what is the best linear prediction in the case where the joint distribution of the couple ξ and θ is not normal. In other words, we ay be interested to find a matrix A∗ ∈ Rl×k and vector b∗ ∈ Rl such that θb = b∗ + A∗ ξ satisfies b bT = E (θ − θ)(θ − θ) inf A∈Rl×k ,b∈Rl E (θ − Aξ − b)(θ − Aξ − b)T . The answer is given by the following lemma which stresses the importance of the Gaussian case when looking for the best linear predictors: Lemma 3.1 Suppose that (X, Y ) is a random vector, X ∈ Rk , Y ∈ Rl , such that E(|X|2 + |Y |2 ) < ∞, V (X) 0. Let (ξ, θ) be a normal vector with the same mean and covariance matrix, i.e. E(ξ) = E(X), E(θ) = E(Y ), V (ξ) = V (X), V (θ) = V (Y ), C(X, Y ) = C(ξ, θ). Let now λ(b) : Rk → Rl be a linear function such that λ(b) = E(θ|ξ = b). Then λ(X) is the best linear prediction of Y given X. Besides this, E(λ(X)) = E(Y ). Proof : Observe that the existance of a linear function λ(b) which coincides with E(θ|ξ = b) is a concequence of the normal correlation theorem. Let η(b) be another linear estimation of θ given ξ, then E (θ − λ(ξ)(θ − λ(ξ))T ≤ E (θ − η(ξ)(θ − η(ξ))T , and by linearity of predictions λ(·) and η(·), under the premise of the lemma, we get E (Y − λ(X))(Y − λ(X))T = E (θ − λ(ξ))(θ − λ(ξ))T ≤ E (θ − η(ξ))(θ − η(ξ))T = E (Y − η(X))(Y − η(X))T , 62 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION what proves the optimality of λ(X). Finally, E(λ(X)) = E(λ(ξ)) = E (E(θ|ξ)) = E(θ) = E(Y ). Let us consider the following example (cf. Exercise 2.15): Example 3.2 Let X and Y be r.v. such that the couple (X, Y ) is normally distributed with 2 = Var(X) > 0, σ 2 = Var(Y ) > 0 and correlation means µX = E(X), µY = E(Y ), variances σX Y ρ = ρXY < 1. When putting Σ = Var X Y !! , we get Σ= 2 σX ρσX σY ρσX σY σY2 ! 2 σ 2 (1 − ρ2 ) > 0. Observe that if in Theorem 3.5 ξ = X and θ = Y , then with Det(Σ) = σX Y Σθξ = Σξθ = ρσX σY Σθξ Σ−1 ξξ = ρ σY . σX So the regression function satisfies m(x) = E(Y |X = x) = µY + ρ σY (x − µX ), σX γ = γ(x) = V (Y |X = x) = σY2 (1 − ρ2 ), and the conditional density of Y given X is 1 (y − m(x))2 fY |X (y|x) = √ exp − 2πγ 2γ ! (the density of distribution N (m(x), γ 2 (x))). Let us consider a particular case of µX = µY = 0 and σX = σY = 1. Then Σ= 1 ρ·1 ρ·1 1 ! , Σ −1 1 −ρ −ρ 1 2 −1 = (1 − ρ ) ! . The eigenvectors of Σ (and of Σ−1 ) are (1, 1)T et (−1, 1)T , corresponding to the eigenvalues, respectively, λ1 = 1 + ρ et λ2 = 1 − ρ. The normalized eigenvectors are γ1 = 2−1/2 (1, 1)T and γ2 = 2−1/2 (−1, 1)T . If we put Γ = (γ1 , γ2 ), then we have the eigenvalue decomposition: T Σ = ΓΛΓ = Γ 1+ρ 0 0 1−ρ ! ΓT . 3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER 63 Let us consider the concentration ellipsoids of the joint density of (X, Y ). Let for C > 0 EC = {x ∈ R2 : xT Σ−1 x ≤ C 2 } = {x ∈ R2 : |y|2 ≤ C 2 }, where y = Σ−1/2 x. We set y1 y2 y= then ! x1 x2 , x= 1 y1 = p (x1 + x2 ), 2(1 + ρ) ! , 1 y2 = p (x1 − x2 ). 2(1 − ρ) In this case the concentration ellipse becomes   !2 1 p EC = {xT Σ−1 x ≤ C 2 } = (x1 + x2 )  2(1 + ρ) 3 2 !2 + 1 p (x1 − x2 ) 2(1 − ρ)   ≤ C 2.  3 3   2 1 2   2 1 1 1 1 0 0 −1 −1 2 −2 −3 −3 −2  =0.75 −2 −1 0 1 2 3 −3 −3  =−0.5 −2 −1 0 1 2 3 Concentration ellipses: X = (ξ1 , ξ2 ), Y = (η1 , η2 ), where Y = Σ−1/2 X. 3.6.1 The Kalman-Bucy filter Suppose that the sequence of (couples of) random vectors (θ, ξ) = ((θn ), (ξn )), n = 0, 1, 2, ..., θn = (θ1 (n), ..., θl (n))T ∈ Rl and ξn = (ξ1 (n), ..., ξk (n))T ∈ Rk , is generated by the recursive equations (0) θn+1 = an+1 θn + bn+1 n+1 , (1) ξn+1 = An+1 θn + Bn+1 n+1 , (3.12) with initial conditions (θ0 , ξ0 ). (0) (1) (0) Here n = ((01) , ..., (0l) )T and n = ((11) , ..., (0k) )T are independent normal vectors, 1 ∼ (1) Nl (0, I), and 1 ∼ Nk (0, I); an , bn , An and Bn are deterministic matrices of size, respectively, l × l, l × l, k × k and k × k. We suppose that matrices Bn are of full rank, and that the initial (0) (1) conditions (θ0 , ξ0 ) are independent of the sequences (n ) and (n ). In the sequel we use the notation ξ0n for the “long” vector ξ0n = (ξ0T , ..., ξnT )T . First, observe that if E(|θ0 |2 + |ξ0 |2 ) < ∞, then for all n ≥ 0, E(|θn |2 + |ξn |2 ) < ∞. If we assume, in addition, that the couple (θ0 , ξ0 ) is a normal vector, then we can easily verify (all θn 64 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION (0) (1) and ξn are linear functions of Gaussian vectors (θ0 , ξ0 ), (i ) and (i ), i = 1, ..., n) that the “long” vector Z T = (θ0T , ξ0T , ..., θnT , ξnT ) is normal for each n ≥ 0. We can thus apply the normal correlation theorem to compute the best prediction of the sequence (θi ), 0 ≤ i ≤ n, given (ξi ), 0 ≤ i ≤ n. This computation may become rather expensive if we want to build the prediction for large n. This observation is not quite valid today, but in the 50-60’s, memory and processing power requirements were important factors, especially, for the “onboard” computations. This has motivated the search for “cheap” algorithms of computing best predictions, which which resulted in 1960 in the discovery of the Kalman-Bucy filter which computes the prediction in a fully recursive way. The aim of the next exercises is to obtain the recursive equations of the Kalman filter – recursive formulas for mn = E(θn |ξ0n ), γn = V (θn |ξ0n ). This problem, extremely complicated in the general setting, allows for a simple solution if we suppose that the conditional distribution P (θ0 < a|ξ0 ) of vector θ0 given ξ0 is normal (a.s.), what we will assume in this section. Our first objective is to show that in the conditions above the sequence (θ, ξ) is conditionally gaussian, in other words, the conditional c.d.f. P (ξn+1 ≤ x, θn+1 ≤ a|ξ0n ) coincide (a.s.) with the c.d.f. of a l + k-dimensional normal distribution with the mean and the covariance matrix which depend on ξ0n . Exercise 3.7 Let ζn = (ξnT , θnT )T , t ∈ Rk+l . Verify that the conditional c.d.f. P (ζn+1 ≤ t|ξ0n , θn+1 = u) is (a.s.) normal with mean M u, where M is a (k + l) × l matrix, and the (k + l) × (k + l) covariance matrix Σ to be determined. Let us suppose now that for n ≥ 0 the conditional c.d.f. P (ζn ≤ t|ξ0n−1 ) is (a.s.) that of a l +k-dimensional normal distribution with the mean and the covariance matrix depending on ξ0n−1 . Exercise 3.8 Use the “conditional version” of the normal correlation theorem (cf Remark 4 and display (3.11)) to show that the conditional c.d.f. P (ζn+1 ≤ t|ξ0n ), n≥0 are (a.s.) normal with E(ζn+1 |ξ0n ) = An+1 mn an+1 mn ! , V (ζn+1 |ξ0n ) = T Bn+1 Bn+1 + An+1 γn ATn+1 An+1 γn aTn+1 an+1 γn ATn+1 bn+1 bTn+1 + an+1 γn aTn+1 , ! 3.6. NORMAL CORRELATION THEOREM AND KALMAN-BUCY FILTER 65 where mn = E(θn |ξ0n ) and γn = V (θn |ξ0n ). Hint: compute the conditional characteristic function E exp(itT ζn+1 )|ξ0n , θn , t ∈ Rl+k , then use the fact that in the premises of the exercise the distribution of θn , given ξ0n−1 and ξn , is conditionally normal with parameters mn and γn . Exercise 3.9 Apply the (conditional) normal correlation theorem to obtain the recursive relations: T mn+1 = an+1 mn + an+1 γn ATn+1 (Bn+1 Bn+1 + An+1 γn ATn+1 )−1 (ξn+1 − an+1 mn ), (3.13) T γn+1 = an+1 γn an+1 + bn+1 bTn+1 − an+1 γn ATn+1 (Bn+1 Bn+1 + An+1 γn ATn+1 )−1 An+1 γn aTn+1 T (since Bn+1 is of full rank, so is the matrix Bn+1 Bn+1 + An+1 γn ATn+1 , which is invertible). Show that ξn+1 and T η = θn+1 − an+1 γn ATn+1 (Bn+1 Bn+1 + An+1 γn ATn+1 )−1 (ξn+1 − an+1 mn ) are independent given ξ0n . Example 3.3 Let X = (Xn ) and ξ = (ξn ) be random sequences, such that (0) (1) Xn+1 = cXn + bn+1 , Yn+1 = Xn + Bn+1 , (3.14) where c, b and B are reals, (0) and (1) are sequences of i.i.d. N (0, 1) r.v. which are mutually independent. Let us compute mn = E(Xn |Y0n ). ((1) We can suppose that Xn is the “useful signal,” and Bn+1 is the observation noise, and we want to recover Xn given the observations Y0 , ..., Yn . Kalman equations (3.13) allow us to easily obtain the expressions of the recursive prediction: cγn−1 (Yn − cmn−1 ) + γn−1 c2 γ 2 = c2 γn−1 + b2 − 2 n−1 . B + γn−1 mn = cmn−1 + γn B2 Exercise 3.10 Show that if b 6= 0, B 6= 0 and |c| < 1, then the “limit error” γ = limn→∞ γn of the Kalman filter exists and is a positive solution of the quadratic (Riccati) equation γ 2 + (B 2 (1 − c2 ) − b2 )γ − b2 B 2 = 0. Example 3.4 Let θ ∈ Rl be a normal vector with E(θ) = 0 and V (θ) = γ (we assume that γ is known). We look for the best prediction of θ given observations of the k-variate sequence (ξ) = (ξn ) (1) ξn+1 = An+1 θ + Bn+1 n+1 , ξ0 = 0, (1) where An+1 , Bn+1 and n+1 satisfy the same hypotheses as in (3.12). 66 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION From (3.13) we obtain T mn+1 = mn + γn An+1 [Bn+1 Bn+1 + An+1 γn ATn+1 ]−1 (ξn+1 − An+1 mn ), T γn+1 = γn − γn An+1 [Bn+1 Bn+1 + An+1 γn ATn+1 ]−1 ATn+1 γn . (3.15) Then solutions to (3.15) are given by h mn+1 = I + γ h γn+1 = I + γ Pn T T −1 T m=0 Am+1 (Bm+1 Bm+1 ) Am+1 Pn T T −1 T m=0 Am+1 (Bm+1 Bm+1 ) Am+1 i−1 i−1 γ Pn T T −1 m=0 Am+1 (Bn+1 Bn+1 ) ξm+1 , (3.16) γ, where I is a k × k identity matrix. Exercise 3.11 Derive the formula (3.16). Hint: do Google search for matrix inversion lemma, then apply the lemma to T [γn−1 + An+1 (Bn+1 Bn+1 )−1 ATn+1 ]−1 . 3.6.2 Solutions to exercises of Section 3.6.1 Exercise 3.7 On can easily verify that (a.s.) E(θn+1 |ξ0n , θn = u) V (θn+1 |ξ0n , θn E(ξn+1 |ξ0n , θn = u) = an+1 u, = u) = bn+1 bTn+1 , V (ξn+1 |ξ0n , θn = An+1 u, T = u) = Bn+1 Bn+1 and C(θn+1 , ξn+1 |ξ0n , θn = u) = 0, thus the conditional distribution of ζn+1 is (a.s.) normal with E(ζn+1 |ξ0n , θn = u) = Au au ! , V (ζn+1 |ξ0n , θn T Bn+1 Bn+1 0 0 bn+1 bTn+1 = u) = ! Exercise 3.8 In the premises of the exercise, by the normal correlation theorem, the distribution of θn given ξ0n is normal with the parameters mn = E(θn |ξ0n ) and γn = V (θn |ξ0n ) which does not depend on ξ0n . We observe that (a.s.) " An+1 θn an+1 θn E exp(itT ζk+1 )|ξ0n , θn = exp itT ! 1 − tT 2 T Bn+1 Bn+1 0 0 bn+1 bTn+1 ! # t , and, because " E exp it T An+1 θn an+1 θn !# ! " n ξo = exp itT An+1 mn an+1 mn " ! ! 1 − tT 2 An+1 γn ATn+1 An+1 γn aTn+1 an+1 γn ATn+1 an+1 γn aTn+1 we conclude that E exp(itT ζk+1 )|ξ0n = exp itT 1 − tT 2 Exercise 3.9 An+1 mn an+1 mn 1 − tT 2 T Bn+1 Bn+1 0 0 bn+1 bTn+1 An+1 γn ATn+1 An+1 γn aTn+1 an+1 γn ATn+1 an+1 γn aTn+1 ! # t Just apply the (conditional) normal correlation theorem. ! t ! # t , 3.7. EXERCISES 3.7 67 Exercises Exercise 3.12 Let Q be a q × p matrix with q > p of rank p. 1o . Show that P = Q(QT Q)−1 QT is a projector. 2o . On what subspace L does P project? Exercise 3.13 Let (X, Y ) be a random vector with density f (x, y) = C exp(−x2 + xy − y 2 /2). 1o . Show that (X, Y ) is a normal vector. Compute the expectation, the covariance matrix and the characteristic function of (X, Y ). Compute the correlation coefficient ρXY of X and Y . 2o . What is the distribution of X? of Y ? of 2X − Y ? 3o . Show that X and Y − X are independent random variables with the same distribution. Exercise 3.14 Let X ∼ N (0, 1) and let Z be a r.v. taking values −1 and 1 with probability 12 . We suppose that X and Z are independent, we set Y = ZX. 1o . Show that the distribution of Y is N (0, 1). 2o . Compute the covariance and the correlation of X and Y . 3o . Compute P (X + Y = 0). 4o . Is (X, Y ) a normal vector? Exercise 3.15 Let ξ and η be independent r.v. with uniform distribution U [0, 1]. Prove that X= p −2 ln ξ cos(2πη), Y = p −2 ln ξ sin(2πη) satisfy Z = (X, Y )T ∼ N2 (0, I). Hint: let (X, Y ) ∼ N2 (0, I). Change to the polar coordinates. Exercise 3.16 Let Z = (Z1 , Z2 , Z3 )T be a normal vector, with density f , 1 6z12 + 6z22 + 8z32 + 4z1 z2 f (z1 , z2 , z3 ) = exp − 32 4(2π)3/2 ! . 1o . What is the distribution of (Z2 , Z3 ) given Z1 = z1 ? Let X and Y be random vectors defined with    X=  2 0 0 1 2 2 2 5 4 10 2 4    Z  et Y = 1 1 1 1 0 0 ! Z. 2o . Vector (X, Y ) of dimension 6, is it Gaussian? Does vector X have a density? Does vector Y have a density? 3o . Are vectors X and Y independent? 4o . What are the distributions of the components of Z. 68 LECTURE 3. RANDOM VECTORS. MULTIVARIATE NORMAL DISTRIBUTION Exercise 3.17 Let (X, Y, Z)T be a normal vector with zero mean and covariance matrix   2 1 1   Σ =  1 2 1 . 1 1 2 1o . We set U = −X + Y + Z, V = X − Y + Z, W = X + Y − Z. What is the distribution of the vector (U, V, W )T ? 2o . What is the density of the random variable T = U 2 + V 2 + W 2 ? Exercise 3.18 Let a vector (X, Y ) be normal N2 (µ, Σ) with mean and covariance matrix: µ= 0 2 ! , Σ= 4 1 1 8 ! . 1o . What is the distribution of X + 4Y ? 2o . What is the joint distribution of Y − 2X and X + 4Y . Exercise 3.19 Let X be a zero mean normal vector of dimension n with covariance matrix Σ > 0. What is the distribution of the r.v. X T Σ−1 X? Exercise 3.20 We model the height H of a male person in population P by the Gaussian distribution N (172, 49) (units: cm). In this model: 1o . What is the probability for a man to be of height ≤160cm? 2o . We assume that there are approximately 15 millions of adult men in P; provide an estimation of a number of men of height ≥200cm. 3o . What is the probability for 10 men randomly drawn from P to be all in the interval [168,188]cm? The height H 0 of females of P is modeled by the Gaussian distribution N (162, 49) (units: cm). 4o . What is the probability for a male chosen at random to be higher than a randomly chosen female? We model the heights (H, H 0 ) of a man and a woman in a couple by a normal vector, the correlation coefficient ρ between the height of a woman and a man being 0.4 (respectively, −0.4). 5o . Compute the probability p (respectively, p0 ) that the height of a man in a couple is larger than that of a woman (before making the computation, what would be your guess, in which order should one range p and p0 ?). Exercise 3.21 Let Y = (η1 , ..., ηn )T be a normal vector, Y ∼ Nn (µ, σ 2 I), Hn−J be a subspace of dimension n − J J > 0 of Rn , and let Hn−J−M be a subspace of dimension n − J − M, M > 0 of Hn−J . We set dJ = min |Y − y| et dJ+M = min |Y − y|. y∈Hn−J Verify that y∈Hn−J−M 3.7. EXERCISES 69 1. if µ ∈ Hn−J then random variable d2J /σ 2 follows χ2J distribution (with J degrees of freedom); 2. if, in addition, µ ∈ Hn−J−M , then J d2J+M − d2J ∼ FM,J M d2J (Fisher distribution with (M, J) degrees of freedom).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Basic probability refresher