Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Econ 514: Probability and Statistics Lecture 6: Special probability distributions Summarizing probability distributions • Let X be a random variable with probability distribution PX . • We consider two types of probability distributions – Discrete distributions: PX is absolutely continuous with respect to the counting measure. – Continuous distributions: PX is absolutely continuous with respect to the Lebesgue measure. • In both cases there is a density fX . • Initially we consider X scalar and later random vectors X. 1 How do we summarize the probability distribution PX ? • Obvious method: Make graph of density. • Figure 1 for discrete case and figures 2,3 for continuous case. 2 3 4 • Graph can be used to visualize support and to compute probabilities. • Intervals where fX is large have a high probability. Summarizing using moments • We can also try to summarize PX by numbers. • This never gives a complete picture, because we summarize a function fX by some number. • Obvious choice is E(X), the expected value of X and the mean of the distribution PX . • Interpretation: average over repetitions. – Repeat the random experiment N times and call the outcomes X1 , X2 , X3 , . . . , XN . P – If N is large then N1 N i=1 Xi ≈ E(X), i.e. the mean is the average over repetitions. 5 • Interpretation: optimal prediction. – Consider predictor m of outcome X. – Prediction error of this predictor X −m – Assume that the loss function is proportional to (X − m)2 i.e. proportional to squared prediction error. – Optimal predictor minimizes expected loss E((X−m)2 ) = E ((X − E(X)) + (E(X) − m))2 = = E (X − E(X))2 + 2(X − E(X))(E(X) − m) + (E(X) − m)2 = = E (X − E(X))2 + (E(X) − m)2 which is minimal if m = E(X). • Special case: If fX is symmetric around µ, i.e. fX (µ+ x) = fX (µ − x), then, if E(|X|) < ∞, we have E(X) = µ. • E(X) can be outside the support of X: see figure 3. Implication for prediction? 6 • The mean E(X) is a measure of location of the distribution PX . • A measure of dispersion is the variance of X defined by Var(X) = E (X − E(X))2 • Interpretation clear in discrete case X Var(X) = (xi − E(X))2 fX (xi ) i with xi − E(X) deviation from the mean and fX (xi ) the weight, i.e. the probability of the deviation. • We have Var(X) = E (X − E(X))2 = E X 2 − 2XE(X) + E(X)2 = = E(X 2 ) − 2E(X)2 + E(X)2 = E(X 2 ) − E(X)2 Useful in computations. 7 2 • Often we use µ or µX for E(X) and σ 2 or σX for Var(X). • The standard deviation of (the distribution of) X, often denoted by σX is defined by p σX = Var(X) • Example: Picking a number at random from [−1, 1]. – fX (x) = I[−1,1] (x) 12 – By symmetry E(X) = 0. – Variance is equal to E(X 2 ) 1 Z 1 2 x 1 1 2 Var(X) = σX = dx = x3 = 6 −1 3 −1 2 – Standard deviation 1 σX = √ 3 8 • Mean and variance are determined by E(X) and E(X 2 ). • These are the first two moments of the distribution of X. • In general the k-th moment, often denoted by µk , is µk = E(X k ) • We can also define the cumulants, i.e the moments around E(X) = µ mk = E (X − µ)k • The third cumulant is called skewness and the fourth kurtosis. • If the distribution is symmetric then skewness is 0 (see exercise). Kurtosis is a measure of peakedness (useful if distribution is symmetric). 9 • More moments means more knowledge about distribution. What if w know all moments µk , k = 1, 2, . . .? • Useful too to obtain moments is the moment generating function of X, denoted by MX (t) and defined by MX (t) = E etX if this expectation exist for −h < t < h, h > 0. • Obviously MX (0) = 1. • Take the derivative with respect to t and interchange integration and differentiation Z ∞ dMX (t) = xetx fX (x)dx dt −∞ • For a non-negative random variable this is allowed if E XehX < ∞ Why? • Hence dMX (0) = E(X) dt 10 • In general dk MX k (t ) = dt so that Z ∞ xk etx fX (x)dx −∞ dk MX (0) = E(X k ) k dt • The moments E(X k ) do not uniquely determine the distribution of X. Casella and Berger give counterexample. • With some further assumptions the moments do determine the distribution – If the distributions of X and Y have bounded support, then they are the same if and only if the moments are the same. – If the moment generating functions MX , MY exist and if they are equal for −h < t < h, then X and Y have the same distribution. 11 • We can also consider the characteristic function mX (t) mX (t) = E eitX • Because eitx = sin(tx) + i cos(tx) the characteristic function always exists. • There is a 1-1 correspondence between characteristic functions and distributions. 12 Special distributions • There is a catalogue of ’standard’ distributions PX of random variables X. • Often a random experiment that we encounter in practice is such that we are interested in the associated random variable X with such a standard distribution. • Choosing such a standard distribution is the selection of a mathematical model for a random experiment, described by the probability space (<, B, PX ). • Often PX depends on parameters that have to be chosen in order to have a fully specified mathematical model. Description of special distributions (i) In what type of random experiments can the standard distribution be used? (ii) Mean, variance, mgf (if exists). (iii) Shape of density, i.e. graph of density. 13 Discrete distributions Discrete uniform distribution • Consider a random experiment with a finite number of outcomes that without loss of generality can be labeled 1, . . . , N . • If outcomes are equally likely PX has a density with respect to the counting measure fX (x) = Pr(X = x) = = 0 1 N , x = 1, . . . , N , elsewhere • This the discrete uniform distribution. • This distribution has one parameter N . • Moments etc. only have meaning if the outcomes 1, . . . , N are not just labels, but are a count. • Moment generating function N 1 X tk 1 t 1 − etN MX (t) = e = e N N 1 − et k=1 P PN 2 1 • Using N k = N (N +1) and k=1 k = k=1 2 we have N N +1 1 X k= E(X) = N 2 N (N +1)(2N +1) , 6 k=1 N 1 X 2 (N + 1)(2N + 1) E(X ) = k = N 6 2 k=1 14 so that Var(X) = E(X 2 ) − E(X)2 = (N + 1)(N − 1) 12 Bernoulli distribution • Random experiment has two outcomes that we label 0 and 1. • Denote Pr(X = 1) = p. • PX has a density with respect to the counting measure fX (x) = px (1 − p)1−x = 0 , x = 0, 1 , elsewhere • This is the Bernoulli distribution. • There is one parameter p with 0 ≤ p ≤ 1. • The mgf is MX (t) = pet + 1 − p and E(X) = p E(X 2 ) = E(X) = p Var(X) = E(X 2 ) − E(X)2 = p − p2 = p(1 − p) Binomial distribution • Consider sequence of independent Bernoulli random experiments (or trials). • Define X as the number of 1-s in n trials. • Consider the event X = x. 15 – For this event x trials must have outcome 1 and n − x outcome 0. – Sequence with x 1-s and n − x 0-s is e.g. 1001 . . . 1100 – The probability of this sequence is px (1 − p)n−x . n – There are sequences of 0-s and 1-s that x have the same probability, so that n Pr(X = x) = px (1 − p)n−x x – Hence PX has a density with respect to the counting measure n fX (x) = px (1 − p)n−x , x = 0, 1, . . . , n x = 0 , elsewhere • This the Binomial distribution. Notation X ∼ B(n, p) • Binomial formula n (a + b) = n X ak bn−k k=0 • We use this formula to establish – The density sums to 1 n X n px (1 − p)n−x = (p + (1 − p))n = 1 x x=0 16 – The mgf is n X n etx px (1 − p)n−x = MX (t) = x x=0 n X x n n = pet (1 − p)n−x = pet + 1 − p x x=0 • Using the mgf we find n−1 t dMX t E(X) = (0) = n pe + 1 − p pe = np t=0 dt n−1 t d2 MX 2 t E(X ) = (0) = n pe + 1 − p pe + t=0 dt2 n−2 2 2t + n(n − 1) pet + 1 − p p e = np+n(n−1)p2 t=0 so that Var(X) = E(X 2 ) − E(X)2 = np(1 − p) • Let Yk be the outcome of the k-th Bernoulli trial, so that n X X= Yk k=1 with the Yk , k = 1, . . . , n stochastic independent. • This implies that E(X) = nE(Y1 ) = np Var(X) = nVar(Y1 ) = np(1 − p) n MX (t) = (MY1 (t))n = pet + 1 − p • Shape of the density fX 17 – We have fX (x) (n + 1)p − x =1+ fX (x − 1) p(1 − p) – We conclude that fX is increasing for x < (n+1)p and decreasing for x > (n + 1)p. n then fX is increasing for x = 0, . . . , n – If p > n+1 1 then fX is decreasing for x = and if p < n+1 0, . . . , n. Otherwise fX is increasing/decreasing. – The value of x that maximizes fX is called the mode of the distribution of X. – For the binomial distribution the mode is the largest integer less than or the smallest integer greater than (n + 1)p. • The binomial distribution has two parameters n, p with n a positive integer and 0 ≤ p ≤ 1. • Example: sampling – Let p be fraction of households in US with income less than $15000 per year. – Select N households at random from the population. – Define X is number of households among the n selected with income less than $15000. – The distribution of X is binomial, if the selections of households are independent. – This is true if the selection is done with replacement and approximately true if the population is sufficiently large. 18 – Assume n = 100 and 16 households have an income less than $ 15000. – Now 16 is an estimate of E(X) and this suggests that it is reasonable to guess that p̂ = 16 = .16 n or 16% of the US households has an income less than $15000. 19 Hypergeometric distribution • In the example we assumed (counterfactually) that selection was with replacement. • Now consider a population of size N from which we select a sample of size n without replacement. In the population M households have an income of less that $15000. • X is number of households among the n selected with income less than $15000. • X = x iff – we select x household from the M with anincome M of less that $15000: can be done in ways. x – we select the remaining n − x households from the N − M ith an income greater than or equal N −M to $15000: can be done in ways. n−x 20 • The total number of selections (without replacement) of n households from the population of N households N is . n • Combining these results we have M N −M x n−x Pr(X = x) = N n 21 • The distribution PX has a density with respect to the counting measure N −M M x n−x , x = 0, . . . , n fX (x) = N n = 0 , otherwise • The distribution PX is the Hypergeometric distribution. • It can be shown (see Casella and Berger) M N N n M M 1− n 1− Var(X) = N −1 N N N E(X) = n • Compare these results to those for the binomial distribution. 22 Geometric distribution • Consider a sequence of independent Bernoulli random experiments with probability of outcome 1 equal to p. • Call outcome 1 a ’success’ and outcome 0 a ’failure’. • Define X as the number of experiments before the first success. • X = x iff the outcomes for x + 1 Bernoulli experiments are 000 . . . 01 where there are x leading 0-s. • Hence Pr(X = x) = (1 − p)x p 23 • PX has a density with respect to the counting measure fX (x) = (1 − p)x p , x = 0, 1, . . . = 0 , otherwise • The distribution PX is called the Geometric distribution. • This distribution has one parameter p with 0 ≤ p ≤ 1. • The mgf is tX MX (t) = E e = ∞ X etx 1 − p)x p = x=0 =p ∞ X (1 − p)et x=0 24 x = p 1 − (1 − p)et • From the mgf we find dMX 1−p (0) = dt p 2 2 d M 1 − p 1 − p X E(X 2 ) = (0) = + dt2 p2 p so that 1−p Var(X) = p2 E(X) = • Sometimes we define X1 is the number of Bernoulli experiments needed for first success. • Then X1 = X + 1 and e.g. MX1 (t) = E etX1 = et tE etX 25 • Example of geometric distribution: – Consider a job seeker and let p be the probability of receiving a job offer in any week – The week in which the first offer is received has the distribution PX1 . – We have for x2 ≥ x1 Pr(X1 > x2 |X1 > x1 ) = Pr(X1 > x2 ) = Pr(X1 > x1 ) P∞ x (1 − p)x2 x=x2 +1 (1 − p) p P = = (1−p)x2 −x1 = = ∞ x x 1 (1 − p) x=x1 +1 (1 − p) p = ∞ X (1 − p)x p = Pr(X1 > x2 − x1 ) x=x2 −x1 +1 – Conclusion: If the job seeker has waited x1 weeks the probability that he/she has to wait another x2 − x1 weeks is the same as the probability of waiting x2 − x1 weeks from the beginning of the job search. The geometric distribution has no memory. 26 Negative binomial distribution • Setup as for the geometric distribution. • Define X as the number of failures before the r-th success. • X = x iff trial x + r is success (event A) and in previous x+r −1 trials r −1 successes and x failures. • Because the events A and B depend on independent random variables P (A ∩ B) = P (A)P (B). • P (A) = p • A sequence with r − 1 successes and x failures has probability pr−1 (1 − p)x . Because we can choose the x+r−1 r −1 successes in the x+r −1 trials in r−1 ways, this is the number of such sequences. Hence x+r−1 P (B) = pr−1 (1 − p)x r−1 27 • Combining Pr(X = x) = p x+r−1 r−1 pr−1 (1 − p)x • PX has density with respect to the counting measure x+r−1 fX (x) = p pr−1 (1 − p)x , x = 0, 1, . . . r−1 = 0 , otherwise • This is the Negative binomial distribution. • The parameters are r (integer) and p with 0 ≤ p ≤ 1. 28 Poisson distribution • Poisson distribution applies to number of occurrences of some event in a time interval of finite length, e.g. number of job offers received by job seeker in a month. • Offers can arrive at any moment (in continuous time). Compare with the geometric distribution. • Define X(a, b) as the number of offers in [a, b). • The symbol o(h) (small o of h) denotes any function with limh→0 o(h) h = 0. • Assumptions (i) Pr(X(s, s + h) = 1) = λ(h) + o(h) (ii) Pr(X(s, s + h) ≥ 2) = o(h) (iii) X(a, b) and X(c, d) are independent if [a, b) ∩ [c, d) = ∅. 29 • Consider [0, t) and divide into n intervals with length h = nt . Then (neglect probabilities that are of order o(h)) n Pr(X(0, t) = k) = (λh)k (1 − λh)n−k = k k n−k t t n = λ 1−λ = k n n n−k 1 λt n . . . (n − k + 1) k = (λt) 1 − k! n n...n Now n . . . (n − k + 1) =1 lim n→∞ n...n n−k −k n λt λt λt lim 1 − = lim 1 − lim 1 − = e−λt n→∞ n→∞ n→∞ n n n • Conclusion: for n → ∞ and if we write X for X(0, t) k −λt (λt) Pr(X = k) = e 30 k! • The distribution PX has a density with respect to the counting measure −θ θ fX (x) = e x , x = 0, 1, . . . x! = 0 , otherwise • The distribution PX is the Poisson distribution. It has one parameter θ > 0. Notation X ∼ Poisson(λ) • The mgf is MX (t) = ∞ X tx −θ θ e e x x! x=0 −θ =e ∞ X (et θ)x x=0 x! so that E(X 2 ) = θ2 + θ E(X) = θ and Var(X) = θ Note E(X) = Var(X). 31 t = e(e −1)θ Continuous distributions Uniform distribution • Random experiment: pick a number at random from [a, b]. Rx • PX ([a, x]) = x − a = a dx • Hence PX has a density with respect to Lebesgue measure 1 fX (x) = ,a ≤ x ≤ b b−a = 0 , otherwise • PX is the Uniform distribution on [a, b]. Notation X ∼ U [a, b] • We have ebt − eat MX (t) = (b − a)t a+b E(X) = 2 (b − a)2 Var(X) = 12 32 • Graph of density 33 Normal distribution • The distribution PX has density with respect to the Lebesgue measure 1 2 1 fX (x) = √ e− 2σ2 (x−µ) σ 2π , −∞ < x < ∞ • The mgf is tX tµ t(X−µ) MX (t) = E e =e E e = Z ∞ 1 2 1 tµ √ et(x−µ) e− 2σ2 (x−µ) dx = =e −∞ σ 2π Z ∞ 1 2 2 1 √ e− 2σ2 ((x−µ) −2σ t(x−µ)) dx = etµ −∞ σ 2π Now (x−µ)2 −2σ 2 t(x−µ) = (x−µ)2 −2σ 2 t(x−µ)+σ 4 t2 −σ 4 t2 = = (x − µ − σ 2 t)2 − σ 4 t2 so that tµ+ 12 σ 2 t2 MX (t) = e Z ∞ 1 1 2 2 2 2 1 √ e− 2σ2 (x−µ−σ t) dx = etµ+ 2 σ t −∞ σ 2π 34 • From the mgf E(X 2 ) = σ 2 + µ2 E(X) = µ so that Var(X) = σ 2 • The distribution PX is the Normal distribution with mean µ and variance σ 2 . Notation X ∼ N (µ, σ 2 ) 35 • Define Z= X −µ σ Then E(Z) = 0 Var(Z) = 1 Hence Z has a normal distribution with µ = 0, σ 2 = 1. This is the standard normal distribution with density 1 2 1 φ(x) = √ e− 2 x , −∞ < x < ∞ 2π and cdf Z x Φ(x) = φ(s)ds −∞ We can compute the probability of an interval [a, b] with the standard normal cdf a−µ b−µ Pr(a ≤ X ≤ b) = Pr ≤Z≤ = σ σ b−µ a−µ =Φ −Φ σ σ 36 • Shape of normal density: bell curve 37 • Why is the normal distribution so popular? – Galton’s quincunx or dropping board Magnified – Define Xn position (relative to 0) after n rows of pins. – If Zn takes values -1 and 1 and gives the direction at row n, then Xn = Z1 + . . . + Zn 38 – If n is large then Xn has approximately the normal distribution. – Central limit theorem: Sum of many independent small effects gives normal distribution. 39 Exponential distribution • Consider waiting time to an event that can occur at any time (compare with geometric distribution). • Define the hazard or failure rate by Pr(event in [t, t + dt]|event after t) = = Pr(t ≤ X < t + dt|X ≥ t) = • Assume fX (t)dt 1 − FX (t) fX (t) =λ 1 − FX (t) Then the solution to is obtained by integration fX (t) = λe−λt 40 • The distribution PX has a density with respect to the Lebesgue measure fX (x) = λe−λx = 0 ,x ≥ 0 , otherwise • PX has the Exponential distribution. There is one parameter λ > 0 and the notation is X ∼ Exp(λ) • The mgf is MX (t) = λ λ−t and hence E(X) = 1 λ var(X) = 41 1 λ2 • Note for t ≥ s Pr(X > t) e−λt = −λs = e−λ(t−s) Pr(X > t|X > s) = Pr(X > s) e If you have waited s, the probability of an additional wait of t − s is the same as if the wait had started at time 0. • As the geometric distribution the exponential distribution has no memory. • If X is length of human life, compare Pr(X > 40|X > 30) and Pr(X > 70|X > 60) • Connection with Poisson distribution: If event is recurrent and waiting time has exponential distribution with parameter λ, then number of occurrences in [0, t] has a Poisson distribution with parameter λt. 42 Gamma distribution • The Gamma distribution is the distribution of X = Y1 + . . . + Yr with Yk independent exponential random variables with parameter λ. • X is the waiting time to the r-the occurrence of the event. Compare with negative binomial distribution. • The distribution PX has a density with respect to the Lebesgue measure λ (λx)r−1 e−λx Γ(r) = 0 fX (x) = ,x ≥ 0 , otherwise with Γ the Γ function. Γ(r) = (r − 1)! if r is a positive integer and otherwise it has to be computed numerically. 43 • This is the Gamma distribution with parameters λ, r > 0. r need not be an integer. Notation X ∼ Γ(λ, r) • The mgf is MX (t) = so that E(X) = λ λ−t r λ r t<λ Var(X) = 44 r λ2 Lognormal distribution • Let Y ∼ N (µ, σ 2 ) and define X = eY . • The distribution PX has density fX (x) = 1 √ xσ 2π = 0 1 2 e− 2σ2 (ln x−µ) ,x ≥ 0 , otherwise Derive this density. • This is the Lognormal distribution with parameters µ and σ 2 . • The mean and variance can be derived from the mgf of the normal distribution 1 E(X) = eµ+ 2 σ 2 2 Var(X) = e2µ+2σ − e2µ+σ What is E(ln X) and Var(ln X)? 45 2 Cauchy distribution • A random variable that has a distribution with density with respect to the Lebesgue measure 1 fX (x) = πβ x−α β 2 , −∞ < x < ∞ has a Cauchy distribution with parameters α and β > 0. • The density is symmetric around α. This is the median of X. • E(X) does not exist and var(X) = ∞. • The mgf is ∞ for t 6= 0. 46 Chi-squared distribution • The chi-squared distribution is a special case of the Γ distribution: set r = k2 and λ = 12 . • The density is fX (x) = Γ = 0 1 k 2 k x 2 −1 e− 2 k x 22 ,x ≥ 0 , otherwise • The parameter k is called the degrees od freedom of the distribution. • The chi-squared distribution is important because of the following result: If X has a standard normal distribution, then Y = X 2 has a chi-squared distribution with k = 1. 47 • We derive the mgf 2 Z MY (t) = E etX = ∞ 1 2 1 2 √ etx − 2 x dx = 2π −∞ 1 =√ 1 − 2t Z =√ ∞ − 11 x2 1 √ e 2 1−2t dx = 1 2π √1−2t −∞ 1 = 1 − 2t 1 2 1 2 12 −t which is the mgf of the Γ distribution with r = 12 and λ = 12 , i.e. the chi-squared distribution with k = 1. 48 Exponential family of distributions • The exponential family of densities are the densities that can expressed as Pk fX (x) = h(x)c(θ)e i=1 wi (θ)ti (x) .−∞<x<∞ • Note that c, wi , i = 1, . . . , k do not depend on x and h, ti , i = 1, . . . , k do not depnd on θ. θ is the vector of parameters of the distribution. • Why useful: We will see that if we have data from an exponential family distribution, the information can be summarized by ti , i = 1, . . . , k. • Examples (i) Binomial distribution: For x = 0, . . . , n p n n fX (x) = px (1−p)n−x = (1−p)n ex ln( 1−p ) x x Hence h(x) = n x t(x) = x c(θ) = (1−p)n 49 p w(θ) = ln 1−p (ii) Normal distribution: For −∞ < x < ∞ µ2 µ 1 2 1 1 − 2σ12 (x−µ)2 − 2σ = √ e 2 e− 2σ2 x + σ2 x fX (x) = √ e σ 2π σ 2π Hence h(x) = 1 t1 (x) = x2 µ2 1 − 2σ c(θ) = √ e 2 σ 2π w1 (θ) = − t2 (x) = x 1 2σ 2 w2 (θ) = µ σ2 • Other exponential family distributions: Poisson, exponential, Gamma. • The density of the uniform distribution is fX (x) = 1 I(a ≤ x ≤ b) b−a The function I(a ≤ x ≤ b) cannot be factorized in a function of x and a, b. Hence it does not belong to the exponential family. 50 Multivariate distributions: recapitulation • Consider a probability space (Ω, A, P ) and define a vector of random variables or random vector X as a function X : Ω → <K , i.e. X1 (ω) .. X(ω) = . XK (ω) • The distribution of X is a probability measure PX : B K → [0, 1]. This is usually called the joint distribution of the random vector X. • We consider the case that PX has a density with respect to the counting measure (discrete distribution) or with respect to the Lebesgue measure (continuous distribution). • The density fX (x1 , . . . , xK ) is called the joint density of X. 51 • We have Pr(X1 ∈ B) = PX (B × < × . . . × <) = Z Z ∞ Z ∞ = ... fX (x1 , . . . , xK )dx1 . . . dxK = B −∞ −∞ Z = fX1 (x1 )dx1 B with Z ∞ fX1 (x1 ) = Z ∞ ... −∞ fX (x1 , x2 , . . . , xK )dx2 . . . dxK −∞ • fX1 is called the marginal density of X1 . • The marginal density of Xk for any k is obtained in the same way. For discrete distributions replace integration by summation. 52 • Consider subvectors X1 , . . . , XK1 and XK1 +1 , . . . , XK . • The distributions of these subvectors are independent if and only if fX (x1 , . . . , xK ) = fX1 ...XK1 (x1 , . . . , xK1 )fXK1 +1 ...XK (xK1 +1 , . . . , xK ) i.e. the joint density is the product of the marginal densities. • The conditional distribution of X1 , . . . , XK1 give XK1 +1 , . . . , XK has density fX1 ...XK1 |XK1 +1 ...XK (x1 , . . . , xK1 |xK1 +1 , . . . , xK ) = fX (x1 , . . . , xK ) fXK1 +1 ...XK (xK1 +1 , . . . , xK ) i.e. it is the ratio of the joint density and the marginal density of the variables on which we condition. 53 • If X̃ is any subvector of X that does not have X1 as a component, then the conditional mean of X1 given X̃ = x̃ can be computed using the conditional density of X1 given X̃ Z E(X1 |X̃ = x̃) = x1 fX1 |X̃ (x1 |x̃)dx1 < For a discrete distribution replace integration by summation. • The conditional variance of X1 given X̃ is 2 Var(X1 |X̃ = x̃) = E (X1 − E(X1 |X̃ = x̃)) X̃ = x̃ • We have Var(X1 |X̃ = x̃) = =E − 2X1 E(X1 |X̃ = x̃) + E(X1 |X̃ = x̃) X̃ = x̃ = 2 = E X1 X̃ = x̃ −2E(X1 |X̃ = x̃))2 +E(X1 |X̃ = x̃))2 = 2 = E X1 X̃ = x̃ − E(X1 |X̃ = x̃))2 X12 2 Compare this result to that for the unconditional variance. 54 • Law of iterated expectations: E(X1 ) = EX̃ (EX1 |X̃ (X1 |X̃)) Remember that on the rhs we just integrate E(X1 |X̃ = x̃) with respect to the distribution of X̃. • For the variance note h i h h i i 2 2 EX̃ Var(X1 |X̃) = EX̃ E X1 X̃ −EX̃ E(X1 |X̃)) and because E(X1 |X̃) is a random variable that is a function of X̃ h i h i h i2 2 Var E(X1 |X̃) = EX̃ E(X1 |X̃) − EX̃ E(X1 |X̃) we obtain if we add these equations E Var(X1 |X̃) +Var(E(X1 |X̃)) = E(X12 )−(E(X1 ))2 = Var(X1 ) 55 Summary measures associated with multivariate distributions, i.e. distribution of a random vector X 56 • Obvious: Means and variances of the random variables in X (marginal means and variances). • In random vectors we also consider the covariance of any two components of X, say X1 and X2 Cov(X1 , X2 ) = E [(X1 − E(X1 ))(X2 − E(X2 ))] • The covariance is informative on the relation between X1 and X2 , e.g. for a discrete distribution XX Cov(X1 , X2 ) = (x1 −E(X1 ))(x2 −E(X2 ))fX1 X2 (x1 , x2 ) x1 x2 If outcomes with x1 −E(X1 ) > 0 and x2 −E(X2 ) > 0 or x1 − E(X1 ) < 0 and x2 − E(X2 ) < 0 (deviations go in same direction) are more likely than outcomes with x1 − E(X1 ) > 0 and x2 − E(X2 ) < 0 or x1 − E(X1 ) < 0 and x2 − E(X2 ) > 0 (deviations go in opposite directions), then . 57 • In that case there is a positive association between X1 and X2 . • If the second type of outcomes are more likely Cov(X1 , X2 ) < 0 and the association is negative. • Note for constants c, d Cov(cX1 , dX2 ) = cdCov(X1 , X2 ) so that the size of Cov(X1 , X2 ) is not a good measure of the strength of the association. • To measure the strength we define the correlation coefficient of X1 , X2 by Cov(X1 , X2 ) p Var(X1 ) Var(X2 ) ρX 1 X 2 = p 58 • To derive its properties we need the Cauchy-Schwartz inequality p p |E(X1 X2 )| ≤ E(X1 ) E(X2 ) Proof: Consider 0 ≤ E (tX1 + X2 )2 = t2 E(X12 )+2tE(X1 X2 )+E(X22 ) The rhs is a quadratic equation with at most one zero. The discriminant of the equation satisfies 4E(X1 X2 )2 − 4E(X12 )E(X22 ) ≤ 0 Dividing by 4 and taking the square root gives the inequality. If E (tX1 + X2 )2 = 0 then Pr(tX1 + X2 = 0) = 1 i.e. the joint distribution is concentrated on the line tx1 + x2 = 0. 2 59 • Properties of the correlation coefficient – ρcX1 ,dX2 = ρX1 X2 – By Cauchy-Schwartz |Cov(X1 , X2 )| = |E [(X1 − E(X2 ))(X2 − E(X2 ))] | ≤ p p ≤ E ((X1 − E(X1 ))2 ) E ((X2 − E(X2 ))2 ) so that |ρX1 X2 | ≤ 1 60 – Note |ρX1 X2 = 1| ⇔ Pr((X2 − E(X2 )) = t(X1 − E(X1 ))) = 1. Hence Pr(X2 = a + bX1 ) = 1 with a = E(X2 ) − tE(X1 ) and b = t. Note that Pr((X2 −E(X2 )) = t(X1 −E(X1 ))) = 1 ⇒ Cov(X1 , X2 ) = bVar(X1 ) so that sign(ρX1 X2 ) = sign(Cov(X1 , X2 )) = sign(b). Conclusion: |ρX1 X2 | = 1 ⇔ Pr(X1 = a + bX1 ) = 1 for b 6= 0. If ρX1 X2 = 1 then b > 0 and if ρX1 X2 = −1 then b < 0 – The correlation coefficient is a measure of the strength of the association and the extreme values correspond to a linear relation. 61 • In the case of a multivariate distribution we organize the variances and covariances in a matrix, the variance(-covariance) matrix of X Var(X1 ) Cov(X1 , X2 ) · · · Cov(X1 , XK ) .. Var(X2 ) . Cov(X1 , X2 ) Var(X) = . . ... .. .. Cov(X1 , XK ) ··· ··· Var(XK ) Note that this is a symmetric K × K matrix Var(X) = Var(X)0 Often we use the notation Var(X) = Σ 62 • Remember if X is a K vector, then X1 − µ1 .. [X1 −µ1 · · · XK −µK ] = (X−µ)(X−µ)0 = . XK − µK 2 (X1 − µ1 ) (X1 − µ1 )(X2 − µ2 ) · · · (X1 − µ1 )(XK − µK ) .. (X1 − µ1 )(X2 − µ2 ) (X1 − µ1 )2 . = .. .. .. . . . (X1 − µ1 )(XK − µK ) ··· ··· (XK − µK )2 so that if we denote µ = E(X) Σ = Var(X) = E ((X − µ)(X − µ)0 ) 63 Linear and quadratic functions of random vectors • If X is a random vector with K components and a is a K vector of constants, we define the linear function of X K X 0 aX = ak Xk k=1 • Hence 0 E(a X) = E K X ! ak Xk k=1 = K X ak E(Xk ) = a0 E(X) k=1 • Also var(a0 X) = E (a0 X − E(a0 X))2 = E [(a0 X − a0 µ)(a0 X − a0 µ)] = = E [(a0 X − a0 µ)(X 0 a − µ0 a)] = E [a0 (X − µ)(X − µ)0 a] = = a0 [(X − µ)(X − µ)0 ] a 64 Moment generating function of a joint distribution • If X is a random vector the mgf of X is MX (t) = E et1 X1 +···+tK XK if the mgf exists for −h < tk < h, k = 1, . . . , K. Note t1 t = ... tK • Note that ∂ 2 MX (t) = E X1 X2 et1 X1 +···+tK XK ∂t1 ∂t2 so that ∂ 2 MX (0) = E (X1 X2 ) ∂t1 ∂t2 65 • This can be used to compute the covariance, because Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 )E(X2 ) • The mgf of the marginal distribution of X1 is MX1 (t1 ) = MX (t1 , 0, . . . , 0) 66 Special multivariate distributions Multinomial distribution • Binomial distribution: Number of 1’s in n independent Bernoulli experiments. • Instead of Bernoulli experiment with two outcomes consider random experiment wit K outcomes k = 1, . . . , K. • Example is to pick student at random from class and record his/her nationality. Label nationalities with label k = 1, . . . , K. • If fraction with nationality k is pk , then if outcome of random selection is Y we have Pr(Y = k) = pk , k = 1, . . . , K with PK k=1 pk = 1. 67 • Repeat this experiment n times and let the repetitions be independent. • Define XP k is number of experiments with outcome k. Note K k=1 Xk = n, so that XK is determined by X1 , . . . , XK−1 . • Consider a sequence of n outcomes Experiment 1 2 3 4 . . . n − 1 n Outcome 3 4 1 1 ... K − 1 K Probability p3 p4 p1 p1 . . . pK−1 pK Probability of this sequence is x K−1 xK px1 1 px2 2 · · · pK−1 pK P with xK = n − K−1 k=1 xk . • To compute Pr(X1 = x1 , . . . , XK−1 = xK−1 ) we count the number of such sequences. 68 • This is equivalent to – Pick x1 experiments with outcome 1, x2 with outcome 2 etc. from the n experiments. – Start with picking the x1 experiments with outcome 1 among the n experiments. This can be n done in ways. x1 – From the remaining n − x1 experiments pick the experiments with outcome 2. This can be done n − x1 in ways. x2 – The total number of ways to choose the experiments with outcomes 1 and 2 is: n! n n − x1 = x2 x1 x1 !x2 !(n − x1 − x2 )! – Using the same argument repeatedly we find that the total number of ways to choose the experiments with outcomes 1, 2, . . . K is: n! n! = x1 ! · · · (n − x1 − · · · − xK−1 )! x1 ! · · · xK ! 69 • Hence Pr(X1 = x1 , . . . , XK−1 = xK−1 ) = n! xK−1 xK px1 1 px2 2 · · · pK−1 pK x1 ! · · · xK ! • The Multinomial joint density of X1 , . . . , XK−1 is n! fX (x1 , . . . , xK−1 ) = QK K Y k=1 xk ! k=1 = 0 pxk k 0 ≤ xk ≤ n, k=1 otherwise • Multinomial formula n (a1 + · · · + aK ) = X x1 +···+xK 70 K X n! ax1 1 · · · axKK x ! · · · xK ! =n 1 xk = n • Using this the mgf is MX (t) = E et1 X1 +···tK−1 XK−1 = = x x n! et1 p1 1 · · · etK−1 pK−1 K−1 pxKK = x ! · · · xK ! x1 +···+xK =n 1 !n K−1 X = etk pk + pK X k=1 • From the mgf we find E(Xk ) = npk Var(Xk ) = npk (1−pk ) Cov(Xk , Xl ) = npk pl • Exercise: What is the marginal distribution of Xk ? What is the conditional distribution of X1 , X2 given X3 = x3 , . . . , XK−1 = xK−1 ? 71 Multivariate normal distribution • The K random vector X has K-dimensional Multivariate normal distribution if its distribution has a density with respect to the K-dimensional Lebesgue measure equal to fX (x) = 1 1 2 0 1 − 2 (x−µ) Σ K e −1 |Σ| (2π) 2 (x−µ) , −∞ < x < ∞ • By completion of squares (see 1-dimensional case) the mgf is 1 0 0 MX (t) = et µ+ 2 t Σt Exercise: Derive the mgf. • Hence E(X) = µ Var(X) = Σ Exercise: Derive these results. • The marginal distribution of Xk normal with mean µk and variance σk2 , the k-th element of the main diagonal of Σ. Exercise: Prove this using the mgf. 72 • Special case K = 2, the bivariate normal distribution. Let the random vector be Y X • The conditional distribution of Y given X = x is normal with σXY E(Y |X = x) = µY + 2 (X − µX ) σX 2 σXY 2 Var(Y |X = x) = σY 1 − 2 2 = σY2 1 − ρ2XY σX σY with σXY = Cov(X, Y ) • The conditional mean is linear in x. Compare wit result that Pr(Y = a+bX) = 1 if and only if |ρXY | = 1. 73 Regression fallacy or regression to the mean • Francis Galton (1822-1911), observed that tall fathers have on average shorter sons, and short fathers have on average taller sons (in Victorian England mothers and daughters did not count). • If this process were to continue, one would expect that in the long run extremes would disappear and all fathers and sons will have the average height. • Using the same reasoning: Short sons have on average taller fathers (with a height closer to the mean) and tall sons have on average smaller fathers (again with a height closer to the mean). • By this argument there is a tendency to move away from the mean! • Similar observations can be made about many phenomena: Rookie players who do exceptionally well in the first year, tend to have a slump in the second; bringing in new management when a company underperforms seems to improve performance etc. 74 • Analysis X = height of father Y = height of son • Reasonable assumption: X, Y have a bivariate normal distribution with E(X) = E(Y ) = µ Var(X) = var(Y ) = σ 2 0 < ρXY < 1 75 • Hence E(Y |X = x) = µ + ρ(x − µ) • If x > µ 0 < E(Y |X = x) − µ < x − µ i.e. average height of sons with fathers with more than average height is closer to the mean. • If x < µ 0 > E(Y |X = x) − µ > x − µ i.e. average height of sons with fathers with less than average height is closer to the mean. • However, heights of fathers and sons have the same (normal) distribution, i.e. no change over the generations. 76 The distribution of linear and quadratic functions of normal random vectors • X is a K random vector with X ∼ N (µ, Σ) • Consider the random variables (i) Y1 = a0 X with a a K vector of constants (scalar). (ii) Y2 = AX + b with A an M × K matrix and b an M vector of constants. (iii) Y3 = X 0 CX with C an K × K matrix of constants. C is symmetric. 77 • From the mgf of Y1 and Y2 we find (i) Y1 ∼ N (a0 µ, a0 Σa).Exercise: Derive this. (ii) Y2 ∼ N (Aµ + b, AΣA0 ) Exercise: Derive this. – We verify E(Y2 ) = AE(X) + b = Aµ + b Var(Y2 ) = E [(Y2 − Aµ − b)(Y2 − Aµ − b)0 ] = = E [(AX − Aµ)(AX − Aµ)0 ] = E [A(X − µ)(X − µ)A0 ] = = AE [(X − µ)(X − µ)] A0 = AΣA0 (iii) Special case X ∼ N (0, I) and C idempotent, i.e. C 2 = C, the matrix generalization of unity. 78 ∗ P is the K × K matrix of eigenvectors of C and choose P such that P 0 P = I, i.e. P is orthonormal. ∗ Define the diagonal matrix of eigenvalues λ1 · · · 0 Λ = ... . . . ... 0 · · · λK ∗ We have CP = P Λ and P 0 CP = Λ C = P ΛP 0 because by P 0 P = I we have P = (P 0 )−1 . ∗ Hence P ΛP 0 = C = C 2 = P Λ2 P 0 so that Λ2 = Λ 79 ∗ This implies that λk is either 0 or 1. Let L be 1 and consider Z = P 0X so that Z ∼ N (0, 1). ∗ Hence 0 0 0 Y3 = X P ΛP X = Z ΛZ = K X 2 λk Z K ∼ χ2 (L) k=1 ∗ Finally tr(C) = tr(P ΛP 0 ) = tr(ΛP 0 P ) = tr(Λ) = L • Let X1 and X2 be subvectors of X of dimensions K1 and K2 with K1 +K2 = K. Then the variance matrix of X is Σ11 Σ12 Σ= Σ12 Σ22 with Var(X1 ) = Σ11 and Var(X2 ) = Σ22 and Σ12 = E((X1 − µ1 )(X2 − µ2 )0 ). We have that X1 and X2 are independent if and only if Σ12 = 0. • To see this note that if Σ12 = 0 −1 −1 Σ 0 Σ 0 11 11 Σ−1 = = 0 Σ22 0 Σ−1 22 Hence 0 −1 (x−µ)0 Σ−1 (x−µ) = (x1 −µ1 )0 Σ−1 11 (x1 −µ1 )+(x2 −µ2 ) Σ22 (x2 −µ2 ) Substitution in the density of the multivariate normal distribution shows that this density factorizes in a function of x1 and a function of x2 , which establishes that these random vectors are independent. 80 • Conclusion: In the normal distribution X1 , X2 are independent if and only if Cov(X1 , X2 ) = 0. • Define Y4 = X 0 BX with B idempotent. Then if X ∼ N (0, I) (i) Y1 and Y3 are independent if and only if Ba = 0. (ii) Y3 and Y4 are stochastically independent if and only if BC = CB = 0. Proof: (i) Y3 = X 0 CX = X 0 C 2 X = X 0 C 0 CX which is a function of CX. Hence Y1 and Y3 are independent if and only if Cov(BX, a0 X) = E(BXX 0 a) = Ba = 0 (ii) Y3 = X 0 C 0 CX and Y4 = X 0 D0 DX so that Y3 and Y4 are independent if and only if Cov(BXX 0 C 0 ) = BC = 0. 81