Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
34 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY 1.10.3 Conditional Distributions Definition 1.16. Let X = (X1 , X2 ) denote a discrete bivariate rv with joint pmf pX (x1 , x2 ) and marginal pmfs pX1 (x1 ) and pX2 (x2 ). For any x1 such that pX1 (x1 ) > 0, the conditional pmf of X2 given that X1 = x1 is the function of x2 defined by pX (x1 , x2 ) . pX2 |x1 (x2 ) = pX1 (x1 ) Analogously, we define the conditional pmf of X1 given X2 = x2 pX1 |x2 (x1 ) = pX (x1 , x2 ) . pX2 (x2 ) It is easy to check that these functions are indeed pmfs. For example, P X pX (x1 , x2 ) X pX (x1 , x2 ) pX (x1 ) pX2 |x1 (x2 ) = = X2 = 1 = 1. pX1 (x1 ) pX1 (x1 ) pX1 (x1 ) X X 2 2 Example 1.22. Let X1 and X2 be defined as in Example 1.18. The conditional pmf of X2 given X1 = 5, is x2 0 1 2 3 4 5 pX2 |X1 =5 (x2 ) 0 1 2 0 1 2 0 0 Exercise 1.14. Let S and G denote the smoking status an gender as defined in Exercise 1.13. Calculate the probability that a randomly selected student is 1. a smoker given that he is a male; 2. female, given that the student smokes. Analogously to the conditional distribution for discrete rvs, we define the conditional distribution for continuous rvs. 1.10. TWO-DIMENSIONAL RANDOM VARIABLES 35 Definition 1.17. Let X = (X1 , X2 ) denote a continuous bivariate rv with joint pdf fX (x1 , x2 ) and marginal pdfs fX1 (x1 ) and fX2 (x2 ). For any x1 such that fX1 (x1 ) > 0, the conditional pdf of X2 given that X1 = x1 is the function of x2 defined by fX (x1 , x2 ) fX2 |x1 (x2 ) = . fX1 (x1 ) Analogously, we define the conditional pdf of X1 given X2 = x2 fX1 |x2 (x1 ) = fX (x1 , x2 ) . fX2 (x2 ) Here too, it is easy to verify that these functions are pdfs. For example, Z Z fX (x1 , x2 ) fX2 |x1 (x2 )dx2 = dx2 X2 fX1 (x1 ) X2 R fX (x1 , x2 )dx2 = X2 fX1 (x1 ) fX (x1 ) = 1. = 1 fX1 (x1 ) Example 1.23. For the random variables defined in Example 1.21 the conditional pdfs are 6x1 x22 fX (x1 , x2 ) = 2x1 = fX1 |x2 (x1 ) = fX2 (x2 ) 3x22 and fX (x1 , x2 ) 6x1 x22 fX2 |x1 (x2 ) = = = 3x22 . fX1 (x1 ) 2x1 The conditional pdfs allow us to calculate conditional expectations. The conditional expected value of a function g(X2) given that X1 = x1 is defined by X g(x2 )pX2 |x1 (x2 ) for a discrete r.v., X2 Z (1.14) E[g(X2 )|x1 ] = g(x2 )fX2 |x1 (x2 )dx2 for a continuous r.v.. X2 36 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY Example 1.24. The conditional mean and variance of X2 given a value of X1 , for the variables defined in Example 1.21 are Z 1 3 µX2 |x1 = E(X2 |x1 ) = x2 3x22 dx2 = , 4 0 and 2 σX 2 |x1 = var(X2 |x1 ) = E(X22 |x1 ) − [E(X2 |x1 )]2 = Z 0 1 x22 3x22 dx2 − 2 3 3 = . 4 80 Lemma 1.2. For random variables X and Y defined on support X and Y, respectively, and a function g(·) whose expectation exists, the following result holds E[g(Y )] = E{E[g(Y )|X]}. Proof. From the definition of conditional expectation we can write Z E[g(Y )|X = x] = g(y)fY |x (y)dy. Y This is a function of x whose expectation is Z Z EX {EY [g(Y )|X]} = g(y)fY |x (y)dy fX (x)dx X Y Z Z = g(y)fY |x (y)fX (x)dydx {z } | X Y =f(X,Y ) (x,y) = Z g(y) Y Z |X = E[g(Y )]. f(X,Y ) (x, y)dxdy {z } =fY (y) Exercise 1.15. Show the following two equalities which result from the above lemma. 1. E(Y ) = E{E[Y |X]}; 2. var(Y ) = E[var(Y |X)] + var(E[Y |X]). 1.10. TWO-DIMENSIONAL RANDOM VARIABLES 37 1.10.4 Hierarchical Models and Mixture Distributions When a random variable depends on another random variable, then we may try to model it in a hierarchical way. This leads to a mixture distribution. We explain this situation by the following example. Example 1.25. Denote by N a number of eggs laid by an insect. Suppose that each egg survives, independently of other eggs, with the same probability p. Let X denote the number of survivors. Then X|N = n is a binomial rv with n trials and probability of success p, i.e., X|(N = n) ∼ Bin(n, p). Assume that the number of eggs the insect lays follows a Poisson distribution with parameter λ, i.e., N ∼ Poisson(λ). That is, we can write the following hierarchical model, X|N ∼ Bin(N, p) N ∼ Poisson(λ). Then, the marginal distribution of the random variable X can be written as P (X = x) = = ∞ X n=0 ∞ X P (X = x, N = n) P (X = x|N = n)P (N = n) n=0 = ∞ X n n=x x x p (1 − p) n−x e−λ λn n! , since X|N = n is Binomial, N is Poisson and the conditional probability is zero x if n < x. Multiplying this expression by λλx and then simplifying gives, ∞ X n! e−λ λn λx px (1 − p)n−x (n − x)!x! n! λx n=x n−x ∞ (λp)x e−λ X (1 − p)λ = x! (n − x)! n=x P (X = x) = (λp)x e−λ (1−p)λ e = x! (λp)x e−λp = . x! 38 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY That is, X ∼ Poisson(λp). Then, we can easily get the expected number of surviving eggs as E(X) = λp. Note, that in the above example pλ = p E(N) = E(pN) = E E(X|N) since X|N ∼ Bin(N, p). Example 1.26. A new drug is tested on n patients in phase III of a clinical trial. Assume that a response can either be efficacy (success) or no efficacy (failure). It is expected that each patient’s chance to react positively to the drug is different and the probability of success follows a Beta(α, β) distribution. Then, the response for each patient could be modeled as a hierarchical Bernoulli rv, that is, Xi |Pi ∼ Bernoulli(Pi ), i = 1, 2, . . . , n, Pi ∼ Beta(α, β). We are interested in the expected total number of efficacious responses and also in its variance. Let us denote by Y the total number of successes. Then Y = n X Xi i=1 and EY = n X i=1 E(Xi ) = n X i=1 E E(Xi |Pi ) = n X E(Pi ) = i=1 n X i=1 α nα = . α+β α+β Now, we will calculate the variance of Y . Random variables Xi are independent, hence ! n n X X var(Y ) = var Xi = var(Xi ). i=1 i=1 We have var(Xi ) = var E(Xi |Pi ) + E var(Xi |Pi ) , i = 1, 2, . . . , n. 1.10. TWO-DIMENSIONAL RANDOM VARIABLES The first term in the above expression is var E(Xi |Pi ) = var(Pi ) = 39 αβ , (α + β)2 (α + β + 1) since Pi ∼ Beta(α, β). The second term of the expression for the variance of Xi is E var(Xi |Pi ) = E Pi (1 − Pi ) Z Γ(α + β) 1 = pi (1 − pi )pα−1 (1 − pi )β−1 dpi i Γ(α)Γ(β) 0 Z Γ(α + β) 1 α pi (1 − pi )β dpi = Γ(α)Γ(β) 0 Γ(α + β) Γ(α + 1)Γ(β + 1) = Γ(α)Γ(β) Γ(α + β + 2) αβ = . (α + β)(α + β + 1) Here we used the fact that Γ(α + β + 2) pα (1 − pi )β Γ(α + 1)Γ(β + 1) i is a pdf of a Beta(α + 1, β + 1) random variable and we also used the property of Gamma function that Γ(z) = zΓ(z − 1). Hence, var(Xi ) = var E(Xi |Pi ) + E var(Xi |Pi ) αβ αβ = + 2 (α + β) (α + β + 1) (α + β)(α + β + 1) αβ . = (α + β)2 Finally, as the variance of Xi does not depend on i, we get nαβ var(Y ) = . (α + β)2 Exercise 1.16. The number of spam messages Y sent to a server in a day has Poisson distribution with parameter λ = 21. Each spam message independently has probability p = 1/3 of not being detected by the spam filter. Let X denote the number of spam messages getting through the filter. Calculate the expected daily number of spam messages which get into the server. Also, calculate the variance of X. 40 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY 1.10.5 Independence of Random Variables Definition 1.18. Let X = (X1 , X2 ) denote a continuous bivariate rv with joint pdf fX (x1 , x2 ) and marginal pdfs fX1 (x1 ) and fX2 (x2 ). Then X1 and X2 are called independent random variables if, for every x1 ∈ X1 and x2 ∈ X2 fX (x1 , x2 ) = fX1 (x1 )fX2 (x2 ). (1.15) We define independent discrete random variables analogously. If X1 and X2 are independent, then the conditional pdf of X2 given X1 = x1 is fX2 |x1 (x2 ) = fX (x1 , x2 ) fX (x1 )fX2 (x2 ) = 1 = fX2 (x2 ) fX1 (x1 ) fX1 (x1 ) regardless of the value of x1 . Analogous property holds for the conditional pdf of X1 given X2 = x2 . Example 1.27. It is easy to notice that for the variables defined in Example 1.21 we have fX (x1 , x2 ) = 6x1 x22 = 2x1 3x22 = fX1 (x1 )fX2 (x2 ). So, the variables X1 and X2 are independent. In fact, two rvs are independent if and only if there exist functions g(x1 ) and h(x2 ) such that for every x1 ∈ X1 and x2 ∈ X2 , fX (x1 , x2 ) = g(x1 )h(x2 ) and the support for one variable does not depend on the support of the other variable. Theorem 1.13. Let X1 and X2 be independent random variables. Then 1.10. TWO-DIMENSIONAL RANDOM VARIABLES 41 1. For any A ⊂ R and B ⊂ R P (X1 ⊆ A, X2 ⊆ B) = P (X1 ⊆ A)P (X2 ⊆ B), that is, {X1 ⊆ A} and {X2 ⊆ B} are independent events. 2. For g(X1 ), a function of X1 only, and for h(X2 ), a function of X2 only, we have E[g(X1 )h(X2 )] = E[g(X1)] E[h(X2 )]. Proof. Assume that X1 and X2 are continuous random variables. To prove the theorem for discrete rvs we follow the same steps with sums instead of integrals. 1. We have P (X1 ⊆ A, X2 ⊆ B) = = Z Z ZB fX (x1 , x2 )dx1 dx2 ZA fX1 (x1 )fX2 (x2 )dx1 dx2 A Z Z fX1 (x1 )dx1 fX2 (x2 )dx2 = A ZB Z = fX1 (x1 )dx1 fX2 (x2 )dx2 B A B = P (X1 ⊆ A)P (X2 ⊆ B). 2. Similar arguments as in Part 1 give Z ∞Z ∞ E[g(X1)h(X2 )] = g(x1 )h(x2 )fX (x1 , x2 )dx1 dx2 −∞ −∞ Z ∞Z ∞ = g(x1 )h(x2 )fX1 (x1 )fX2 (x2 )dx1 dx2 −∞ −∞ Z ∞ Z ∞ = g(x1 )fX1 (x1 )dx1 h(x2 )fX2 (x2 )dx2 −∞ −∞ Z ∞ Z ∞ = g(x1 )fX1 (x1 )dx1 h(x2 )fX2 (x2 )dx2 −∞ −∞ = E[g(X1 )] E[h(X2 )]. Below we will apply this result for the moment generating function of a sum of independent random variables. 42 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY Theorem 1.14. Let X1 and X2 be independent random variables with moment generating functions MX1 (t) and MX2 (t), respectively. Then the moment generating function of the sum Y = X1 + X2 is given by MY (t) = MX1 (t)MX2 (t). Proof. By the definition of mgf and by Theorem 1.13, part 2, we have MY (t) = E etY = E et(X1 +X2 ) = E etX1 etX2 = E etX1 E etX2 = MX1 (t)MX2 (t). Note that this result can be easily extended to a sum of any number of mutually independent random variables. Example 1.28. Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent. What is the distribution of Y = X1 + X2 ? Using Theorem 1.14 we can write MY (t) = MX1 (t)MX2 (t) = exp{µ1t + σ12 t2 /2} exp{µ2 t + σ22 t2 /2} = exp{(µ1 + µ2 )t + (σ12 + σ22 )t2 /2}. This is the mgf of a normal rv with E(Y ) = µ1 + µ2 and var(Y ) = σ12 + σ22 . Exercise 1.17. A part of an electronic system has two types of components in joint operation. Denote by X1 and X2 the random length of life (measured in hundreds of hours) of component of type I and of type II, respectively. Assume that the joint density function of two rvs is given by 1 x1 + x2 fX (x1 , x2 ) = x1 exp − IX , 8 2 where X = {(x1 , x2 ) : x1 > 0, x2 > 0}. 1. Calculate the probability that both components will have a life length longer than 100 hours, that is, find P (X1 > 1, X2 > 1). 2. Calculate the probability that a component of type II will have a life length longer than 200 hours, that is, find P (X2 > 2). 3. Are X1 and X2 independent? Justify your answer. 1.10. TWO-DIMENSIONAL RANDOM VARIABLES 43 4. Calculate the expected value of so called relative efficiency of the two components, which is expressed by X2 E . X1 1.10.6 Covariance and Correlation Covariance and correlation are two measures of the strength of a relationship between two r.vs. We will use the following notation. E(X1 ) = µX1 E(X2 ) = µX2 2 var(X1 ) = σX 1 2 var(X2 ) = σX 2 2 2 Also, we assume that σX and σX are finite positive values. A simplified notation 1 2 2 2 µ1 , µ2 , σ1 , σ2 will be used when it is clear which rvs we refer to. Definition 1.19. The covariance of X1 and X2 is defined by cov(X1 , X2 ) = E[(X1 − µX1 )(X2 − µX2 )]. (1.16) Some useful properties of the covariance and correlation are given in the following two theorems. Theorem 1.15. Let X1 and X2 denote random variables and let a, b, c, . . . denote some constants. Then, the following properties hold. 1. cov(X1 , X2 ) = E(X1 X2 ) − µX1 µX2 . 2. If random variables X1 and X2 are independent then cov(X1 , X2 ) = 0. 3. var(aX1 + bX2 ) = a2 var(X1 ) + b2 var(X2 ) + 2ab cov(X1 , X2 ). 44 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY 4. For U = aX1 + bX2 + e and for V = cX1 + dX2 + f we can write cov(U, V ) = ac var(X1 ) + bd var(X2 ) + (ad + bc) cov(X1 , X2 ). Proof. We will show property 4. cov(U, V ) = E (aX1 + bX2 + e) − E(aX1 + bX2 + e) × (cX1 + dX2 + f ) − E(cX1 + dX2 + f ) = E a(X1 − E X1 ) + b(X2 − E X2 ) c(X1 − E X1 ) + d(X2 − E X2 ) = E ac(X1 − E X1 )2 + bd(X2 − E X2 )2 + (ad + bc)(X1 − E X1 )(X2 − E X2 ) = ac var X1 + bd var X2 + (ad + bc) cov(X1 , X2 ). Property 2 says that if two variables are independent, then their covariance is zero. This does not always work both ways, that is it does not mean that if the covariance is zero then the variables must be independent. The following small example shows this fact. Example 1.29. Let X ∼ U(−1, 1) and let Y = X 2 . Then E(X) = 0 Z 1 1 1 x2 dx = 2 3 −1 3 E(XY ) = E(X ) = 0. 2 E(Y ) = E(X ) = Hence, cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0, but Y is a function of X, so these two variables are not independent. Another measure of strength of relationship of two rvs is correlation. It is defined as Definition 1.20. The correlation of X1 and X2 is defined by ρ(X1 ,X2 ) = cov(X1 , X2 ) . σX1 σX2 (1.17) 1.10. TWO-DIMENSIONAL RANDOM VARIABLES 45 Correlation is a dimensionless measure and it expresses the strength of linearly related variables as shown in the following theorem. Theorem 1.16. For any random variables X1 and X2 1. −1 ≤ ρ(X1 ,X2 ) ≤ 1, 2. |ρ(X1 ,X2 ) | = 1 iff there exist numbers a 6= 0 and b such that P (X2 = aX1 + b) = 1. If ρ(X1 ,X2 ) = 1 then a > 0, and if ρ(X1 ,X2 ) = −1 then a < 0. Exercise 1.18. Prove Theorem 1.16. Hint: Consider roots (and so the discriminant) of the quadratic function of t: g(t) = E[(X − µX )t + (Y − µY )]2 . Example 1.30. Let the joint pdf of X, Y be fX,Y (x, y) = 1 on the support {(x, y) : 0 < x < 1, x < y < x + 1}. The two rvs are not independent as the range of Y depends on x. We will calculate the correlation between X and Y . For this we need to obtain the marginal pdfs, fX (x) and fY (y). The marginal pdf for X is Z x+1 fX (x) = 1dy = y|x+1 = 1, on support {x : 0 < x < 1}. x x Note that to obtain the marginal pdf for the rv Y the range of X has to be considered separately for y ∈ (0, 1) and for y ∈ [1, 2). When y ∈ (0, 1) then x ∈ (0, y). When y ∈ [1, 2), then x ∈ (y − 1, 1). Hence, Ry for y ∈ (0, 1); 1dx = x|y0 = y, R0 1 fY (y) = 1dx = x|1y−1 = 2 − y, for y ∈ [1, 2); y−1 0, otherwise. These give 1 (b − a)2 1 2 µX = , σX = = ; 2 12 12 µY = 1 σY2 = E(Y 2 ) − [E(Y )]2 = 1 7 −1= , 6 6 46 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY where 2 E(Y ) = Also, Z 1 2 y ydy + 0 2 1 E(XY ) = Hence, Z Z 1 0 Z 7 y 2 (2 − y)dy = . 6 x+1 xy1dydx = x cov(X, Y ) = E(XY ) − E(X) E(Y ) = 7 . 12 7 1 1 − ×1= . 12 2 12 Finally, ρ(X,Y ) = 1 cov(X, Y ) = q12 σX σY 1 1 12 6 1 =√ . 2 The linear relationship between X and Y is not very strong. Note: We can make an interesting comparison of this value of the correlation with the correlation of X and Y having a joint uniform distribution on {(x, y) : 0 < x < 1, x < y < x + 0.1}, which is a ’narrower strip’ of values then √ previously. Then, fX,Y (x, y) = 10 and it can be shown, that ρ(X, Y ) = 10/ 101, which is close to 1. The linear relationship between X and Y is very strong in this case.