Download Ch.4 Multivariate Variables and Their Distribution 1 Introduction

Ch.4 1 Multivariate Variables and Their Distribution Introduction In Chapter 3, we defined a univariate random variable as a rule that assigns a number to each outcome of a random experiment. When different rules assign (different) numbers to each outcome of a random experiment, we have a multivariate random variable. In Chapter 1, we saw several examples of studies where more than one variable is observed from each population unit. Some additional examples are 1. The X, Y and Z components of wind velocity can be measured in studies of atmospheric turbulence. 2. The velocity X and stopping distance Y of an automobile can be studied in an automobile safety study. 3. The diameter at breast height (DBH) and age of a tree can be measured in a study aimed at developing a method for predicting age from diameter. In such studies, it is not only interesting to investigate the behavior of each variable individually, but also to investigate the degree of relationship between them. We say that we know the joint distribution of a bivariate variable, (X, Y ) if we know all probabilities of the form P (a < X ≤ b, c < Y ≤ d), with a < b, c < d, where P (a < X ≤ b, c < Y ≤ d) is to be interpreted as P (a < X ≤ b and c < Y ≤ d) or as P ([a < X ≤ b] ∩ [c < Y ≤ d]). Similarly, we say that we know the joint distribution of a multivariate variable, X1 , X2 , . . . , Xm , if we know all probabilities of the form P (a1 < X1 ≤ b1 , a2 < X2 ≤ b2 , . . . , am < Xm ≤ bm ), with ak < bk , k = 1, . . . , m. As in the univariate case, which was considered in Chapter 3, a concise description of the joint probability distribution of any multivariate random variable can be achieved through its cumulative distribution function (cdf). 1 Definition 1.1. The joint or bivariate cdf of two random variables, X, Y , is defined by F (x, y) = P (X ≤ x, Y ≤ y). The joint or multivariate cdf of several random variables, X1 , X2 , . . . , Xm , is defined by F (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ). The cdf is convenient for calculating the probability that (X, Y ) will lie in a rectangle. In the bivariate case the formula is P (x1 < X ≤ x2 , y1 < Y ≤ y2 ) = F (x2 , y2 ) − F (x2 , y1 ) − F (x1 , y2 ) + F (x1 , y1 ). (1.1) Example 1.1. Let (X, Y ) be uniformly distributed on the unit rectangle [0, 1] × [0, 1]. This means that the probability that (X, Y ) lies in a subspace A of the unit rectangle equals the area of A. Thus, the bivariate cdf of (X, Y ) is F (x, y) = xy, for 0 ≤ x, y ≤ 1, F (x, y) = x, for 0 ≤ x ≤ 1, y ≥ 1, F (x, y) = y, for 0 ≤ y ≤ 1, x ≥ 1, F (x, y) = 1, for x, y ≥ 1, and F (x, y) = 0, if either x ≤ 0 or y ≤ 0. The probability that (X, Y ) will lie in the rectangle A = (0.3, 0.6) × (0.4, 0.7) is P (0.3 < X ≤ 0.6, 0.4 < Y ≤ 0.7) = (0.6)(0.7) −(0.3)(0.7) − (0.6)(0.4) + (0.3)(0.4) = 0.09, which is equal to the area of A. The joint cdf of (X, Y ) can also be used for obtaining the cdf of X or of Y . Thus, if F (x, y) is the bivariate cdf of (X, Y ), then FX (x) = P (X ≤ x) = P (X ≤ x, Y ≤ ∞) = F (x, ∞), (1.2) is the cdf of X. Similarly, FY (y) = F (∞, y) is the formula for obtaining the cdf of Y from the bivariate cdf of (X, Y ). 2 Example 1.2. Let (X, Y ) be uniformly distributed on the unit rectangle [0, 1] × [0, 1]. Find the cdf of the variables X and Y . Solution. The bivariate cdf, F (x, y), of (X, Y ) is given in Example 1.1. Using it and relation (1.2) we have that the cdf of X is FX (x) = F (x, ∞) = x, 0 ≤ x ≤ 1, which is recognized to be the cdf of the uniform in [0, 1] distribution. Similarly FY (y) = F (∞, y) = y, 0 ≤ y ≤ 1. Though very convenient for calculating the probability that (X, Y ) will take value in a rectangle, and also for calculating the individual probability distributions of X and Y , the cdf is not the most convenient tool for calculating directly the probability that (X, Y ) will take value in regions other than rectangles, such as circles. In the rest of this chapter, we will define the additional concepts associated with joint distributions of marginal distributions, conditional distributions and independent random variables. Concise descriptions of the joint, marginal and conditional distributions will be given by appropriately extending the definitions, given in Chapter 3 for univariate variables, of the probability mass function and the probability density function. Finally, we will introduce certain parameters of the joint, marginal and conditional distributions that are often used for descriptive and inferential purposes. 2 Describing the Joint Distribution of Discrete Random Variables 2.1 Joint probability mass function Definition 2.1. The joint or bivariate probability mass function (pmf ) of the discrete random variables X, Y is defined as p(x, y) = P (X = x, Y = y). The joint or multivariate pmf of the discrete random variables X1 , X2 , . . . , Xn is similarly defined as p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn ). 3 Let the sample space of (X, Y ) be S = {(x1 , y1 ), (x2 , y2 ), . . .}. The above definition implies that p(x, y) = 0, for all (x, y) that do not belong in S (i.e. all x different from (x1 , y1 ), (x2 , y2 ), . . .}). Moreover, from the Axioms of probability it follows that p(xi , yi ) ≥ 0, for all i, and X p(xi , yi ) = 1. (2.1) i Example 2.1. A robot performs two tasks, welding joints and tightening bolts. Let X be the number of defective welds, and Y be the number of improperly tightened bolts per car. Suppose that the possible values of X are 0,1,2 and those of Y are 0,1,2,3. Thus, the sample space S consists of 12 pairs (0.0), (0, 1), (0, 2), (0, 3), . . . , (2, 3). The joint pmf of X, Y is given by Y X 0 1 2 3 0 .84 .03 .02 .01 1 .06 .01 .008 .002 2 .01 .005 .004 .001 1.0 In agreement with (2.1), the sum of all probabilities equals one. Inspection of this pmf reveals that 84% of the cars have no defective welds and no improperly tightened bolts, while only one in a thousand have two defective welds and three improperly tightened bolts. In fact, the pmf provides answers to all questions regarding the probability of robot errors. For example, P (exactly one error) = P (X = 1, Y = 0) + P (X = 0, Y = 1) = .06 + .03 = .09. and P (exactly one defective weld) = P (X = 1, Y = 0) + P (X = 1, Y = 1) + P (X = 1, Y = 2) + P (X = 1, Y = 3) = .06 + .01 + .008 + .002 = .08. Example 2.2. [MULTINOMIAL RANDOM VARIABLE] The probabilities that a certain electronic component will last less than 50 hours of continuous use, between 50 and 4 90 hours, or more than 90 hours, are p1 = 0.2, p2 = 0.5, and p3 = 0.3, respectively. Consider a simple random sample of size eight of such electric components, and set X1 for the number of these components that last less than 50 hours, X2 for the number of these that last between 50 and 90 hours, and X3 for the number of these that last more than 90 hours. Then (X1 , X2 , X3 ) is an example of a multinomial random variable. The sample space S of (X1 , X2 , X3 ) consists of all nonnegative sets of integers (x1 , x2 , x3 ) that satisfy x1 + x2 + x3 = 8. It can be shown that the joint pmf of (X1 , X2 , X3 ) is given by p(x1 , x2 , x3 ) = 8! px1 px2 px3 , x1 !x2 !x3 ! 1 2 3 for any x1 , x2 , x3 in the sample space. For example, the probability that one such component will last less than 50 hours, five will last between 50 and 90 hours, and two will last more than 90 hours is p(1, 5, 2) = 2.2 8! 0.21 0.55 0.32 = 0.0945. 1!5!2! Marginal probability mass functions In Section 1, we learned how to obtain the cdf of the variable X from the bivariate cdf of (X, Y ). This individual distribution of the variable X is called the marginal distribution of X. Similarly, the individual distribution of Y is called the marginal distribution of Y . In this section, we will learn how to obtain the pmf of a marginal distribution from that of a bivariate (or multivariate) pmf. We begin by considering the bivariate distribution given in Example 2.1. Example 2.3. Consider the bivariate distribution of (X, Y ) given in Example 2.1. Find the marginal pmf of X. Solution. The probability pX (0) = P (X = 0) is found by summing the probabilities in the fist row of the table giving the joint pmf of (X, Y ): pX (0) = P (X = 0, Y = 0) + P (X = 0, Y = 1) + P (X = 0, Y = 2) + P (X = 0, Y = 3) = 0.84 + 0.03 + 0.02 + 0.01 = 0.9. 5 Also, the probability pX (1) = P (X = 1) is found by summing the probabilities in the second row of the joint pmf table, and the probability pX (2) = P (X = 2) is found by summing the probabilities in the third row of the joint pmf table. Thus, the marginal pmf of X is: x 0 pX (x) .9 1 2 .08 .02 Note that the pmf of X in the above example can be read from the vertical margin of the joint pmf table; see the table below. This justifies naming pX the marginal pmf of X. Also, the marginal pmf pY of Y is read from the lower horizontal margin of the joint pmf. Y X 0 1 2 3 0 .84 .03 .02 .01 .9 1 .06 .01 .008 .002 .08 2 .01 .005 .004 .001 .02 .91 .045 .032 .013 1.0 In general, we have the following proposition. Proposition 2.1. Let x1 , x2 , . . . be the possible values of X and y1 , y2 , . . . be the possible values of Y . The marginal pmf of X, respectively, Y are given by X X pX (x) = p(x, yi ), pY (y) = p(xi , y). i i In the case of more than two random variables, X1 , X2 , . . . , Xm , the marginal pmf of each variable Xi can be obtained by summing the joint pmf over all possible values of the other random variables. For example, it can be shown that, if the joint distribution of X1 , X2 , . . . , Xm is multinomial, then each Xi has a binomial distribution. 2.3 Conditional Distributions The concept of a conditional distribution of a discrete random variable is an extension of the concept of conditional probability of an event. For a discrete (X, Y ), if x is a possible value of X, i.e. pX (x) > 0, the concept of conditional probability provides answers to questions regarding the value of Y , given that 6 X = x has been observed. For example, the conditional probability that Y takes the value y given that X = x is P (Y = y|X = x) = P (X = x, Y = y) p(x, y) = , P (X = x) pX (x) where p(x, y) is the joint pmf of (X, Y ) and pX (x) is the marginal pmf of X. Definition 2.2. Let (X, Y ) be discrete, and let SY = {y1 , y2 , . . .} be the sample space of Y . Then if x is a possible value of X, i.e. pX (x) > 0, pY |X (yj |x) = p(x, yj ) , j = 1, 2, . . . , pX (x) where p(x, y) is the joint pmf of (X, Y ) and pX (x) is the marginal pmf of X, is called the conditional pmf of Y given that X = x. Example 2.4. Consider the discrete (X, Y ) of Example 2.1. Then, the conditional pmf of Y given that X = 0 is y 0 pY |X (y|X = 0) 1 .9333 .0333 2 3 .0222 .0111 These numbers are obtained by dividing each joint probability in the row that corresponds to the X-value 0, by the marginal probability that X = 0. As an illustration, pY |X (0|X = 0) = Proposition 2.2. p(0, 0) .84 = = .9333. pX (0) .9 1. The conditional pmf is a proper pmf. Thus, for each x with pX (x) > 0, pY |X (yj |x) ≥ 0, for all j = 1, 2, . . ., and X pY |X (yj |x) = 1. j 2. If we know the conditional pmf of Y given X = x, for all values x in the sample space of X (i.e. if we know pY |X (yj |x), for all j = 1, 2, . . ., and for all possible values x of X), and also know the marginal pmf of X, then the joint pmf of (X, Y ) can be obtained as p(x, y) = pY |X (y|x)pX (x). 3. The marginal pmf of Y can be obtained as X pY (y) = pY |X (y|x)pX (x). x in SX 7 The second part of the above proposition is useful because it is often easier to specify the marginal distribution of X and the conditional distribution of Y given X, than to specify directly the joint distribution of (X, Y ); see Example 2.5 below. Part 3 of Proposition 2.2 follows from the Law of Total Probability. Example 2.5. It is known that, with probability 0.6, a new lap-top owner will install wireless internet connection at home within a month. Let X denote the number (in hundreds) of new lap-top owners in a week from a certain region, and let Y denote the number among them who install wireless connection at home within a month. Suppose that the pmf of X is x 0 pX (x) 0.1 1 2 0.2 0.3 3 4 0.25 0.15 Find the joint distribution of (X, Y ). Find the probability that Y = 4. Solution. According to part 2 of Proposition 2.2, since the marginal distribution of X is known, the joint distribution of (X, Y ) can be specified if pY |X (y|x) is known for all possible values x of X. Given that X = x, however, Y has the binomial distribution with n = x trials and probability of success p = 0.6, so that µ ¶ x 0.6y 0.4x−y . pY |X (y|x) = y For example, if X = 3 then the probability that Y = 2 is µ ¶ 3 pY |X (2|3) = 0.62 0.43−2 = 0.4320. 2 Next, according to part 3 of Proposition 2.2, pY (4) = 4 X pY |X (4|x)pX (x) = 0 + 0 + 0 + 0 + 0.64 × 0.15 = 0.0194. x=0 We conclude this subsection by pointing out that, since the conditional pmf is a proper pmf, it is possible to consider its expected value and its variance. These are called the conditional expected value and conditional variance, respectively. As an example, Example 2.6. Consider the discrete (X, Y ) of Example 2.1. Calculate the conditional expected value of Y given that X = 0. Solution. Using the conditional pmf that we found in Example 2.4, we obtain, 8 E(Y |X = 0) =0 × (.9333) + 1 × (.0333) + 2 × (.0222) + 3 × (.0222) =.111. Compare this with the unconditional, or marginal, expected value of Y , which is E(Y ) = .148. 2.4 Independence The notion of independence of random variables is an extension of the notion of independence of events. We say that the events A = [X = x] and B = [Y = y] are independent if P (X = x, Y = y) = P (X = x)P (Y = y), where, as you may recall, P (X = x, Y = y) means P ([X = x] ∩ [Y = y]). If the above equality holds for all possible values, x, of X, and all possible values, y, of Y , then X, Y are called independent. In particular, we have the following definition. Definition 2.3. The discrete random variables X, Y are called independent if pX,Y (x, y) = pX (x)pY (y), for all x, y, where pX,Y is the joint pmf of (X, Y ) and pX , pY are the marginal pmfs of X, Y , respectively. Note that pX,Y (x, y) = pX (x)pY (y) can be rephrased as: The events A = [X = x] and B = [Y = y] are independent for all x, y. The next proposition is a collection of some statements that are equivalent to the statement of independence of two random variables. Proposition 2.3. Each of the following statements implies, and is implied by, the independence of the random variables X, Y . 1. FX,Y (x, y) = FX (x)FY (y), for all x, y, where FX,Y is the joint cdf of (X, Y ) and FX , FY are the marginal cdfs of X, Y , respectively. 9 2. pY |X (y|x) = pY (y), where pY |X (y|x) is the conditional probability of [Y = y] given that [X = x], and pY is the marginal pmf of Y . In other words, the conditional pmf of Y given X = x is the same for all values x that X might take. 3. pX|Y (x|y) = pX (x), where pX|Y (x|y) is the conditional probability of [X = x] given that [Y = y], and pX is the marginal pmf of X. In other words, the conditional pmf of X given Y = y is the same for all values y that Y might take. 4. Any event associated with the random variable X is independent from any event associated with the random variable Y , i.e. [X ∈ A] in independent from [Y ∈ B], where A is any subset of the sample space of X, and B is any subset of the sample space of Y . 5. For any two functions h and g, the random variables h(X), g(Y ) are independent. Example 2.7. Consider the joint distribution of (X, Y ) given in Example 2.1. Are X, Y independent? Solution. Here X, Y are not independent since p(0, 0) = .84 6= pX (0)pY (0) = (.9)(.91) = .819. Example 2.8. Are the X, Y of Example 2.5 independent? Solution. Here X, Y are not independent since pX,Y (3, 4) = pY |X (4|3)pX (3) = 0 × pX (3) = 0 6= pX (3)pY (4) = 0.25 × 0.0194 = 0.0049. Example 2.9. A system is made up of two components connected in parallel. Let A, B denote the two components. Thus, the system fails if both components fail. Let the random variable X take the value 1 if component A works, and the value 0 of it does not. Similarly, Y takes the value 1 if component B works, and the value 0 if it does not. From the repair history of the system it is known that the joint pmf of (X, Y ) is Y 0 1 0 0.0098 0.9702 0.98 1 0.0002 0.0198 0.02 0.99 1.0 X 0.01 10 In this example, it can be seen that pX,Y (x, y) = pX (x)pY (y), for all x, y. Thus, X, Y are independent. Moreover, calculation of the conditional pmf of Y given X reveals that the the conditional pmf of Y given X = x does not depend on the particular value x of X, in accordance with part 2 of Proposition 2.3. The definition of independence of two random variables extends to several random variables. Definition 2.4. The discrete random variables X1 , X2 , . . . , Xm are independent if their joint pmf is the product of the corresponding marginal pmfs, namely, if pX1 ,X2 ,...,Xm (x1 , x2 , . . . , xm ) = pX1 (x1 )pX2 (x2 ) · · · pXm (xm ), for all x1 , x2 , . . . , xm . 3 Describing the Joint Distribution of Continuous Random Variables Definition 3.1. The joint or bivariate density function of the continuous (X, Y ) is a non-negative function f (x, y) such that Z ∞Z ∞ f (x, y) dx dy = 1 and −∞ −∞ Z Z P ((X, Y ) ∈ A) = f (x, y) dx dy, A for any ’reasonable’ two-dimensional set A. The joint or multivariate probability density function of the continuous (X1 , X2 , . . . , Xn ) is a non-negative function f (x1 , x2 , . . . , xn ) such that Z Z ∞ ∞ ··· −∞ f (x1 , x2 , . . . , xn ) dx1 · · · dxn = 1 and −∞ Z P ((X1 , X2 , . . . , Xn ) ∈ A) = Z ··· f (x1 , x2 , . . . , xn ) dx1 · · · dxn , A for any ’reasonable’ n-dimensional set A. In particular, the definition implies Z bZ d P (a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y)dy dx. a 11 c Thus, from the geometric point of view, probabilities are represented as volumes under the joint pdf f (x, y), which now is a surface. The pdf can be derived from the cdf by differentiating: f (x1 , x2 , . . . , xn ) = ∂n F (x1 , x2 , . . . , xn ). ∂x1 . . . ∂xn (3.1) Example 3.1. Using the joint cdf of the bivariate uniform distribution in the unit rectangle [0, 1] × [0, 1], which was given in Example 1.1, find the corresponding joint pdf. Solution. Using (3.1) and the form of the bivariate cdf of the uniform distribution in the unit rectangle, we obtain that f (x, y) = 0, if (x, y) is outside the unit rectangle, and f (x, y) = 1, for 0 ≤ x, y ≤ 1. Example 3.2. Consider the bivariate density function f (x, y) = 12 2 (x + xy), 0 ≤ x, y ≤ 1. 7 Find the probability that X > Y . Solution. The desired probability can be found by integrating f over the region {(x, y)|0 ≤ y ≤ x ≤ 1}: 12 P (X > Y ) = 7 Z 1 0 Z x (x2 + xy) dy dx = 0 9 . 14 Example 3.3. Consider the bivariate distribution of Example 3.2. Find the probability that X ≤ 0.6 and Y ≤ 0.4, and the joint cdf of (X, Y ). Solution. Using the joint pdf given in Example 3.2, we have Z Z 12 0.6 0.4 2 P (X ≤ 0.6, Y ≤ 0.4) = F (0.6, 0.4) = (x + xy) dy dx = 0.0741. 7 0 0 In general, Z x Z y F (x, y) = 0 0 12 2 12 (s + st) dt ds = 7 7 µ Moreover, an easy differentiation verifies that f (x, y) = pdf given in Example 3.2. 12 x3 y x2 y 2 + 3 4 ¶ ∂2 F (x, y), ∂x∂y . is indeed the joint 3.1 Marginal Distributions In the continuous case, the marginal pdf of X and Y can be found by integrating the bivariate pdf f (x, y): Z Z ∞ fX (x) = ∞ f (x, y) dy, fY (y) = −∞ f (x, y) dx (3.2) −∞ Example 3.4. Find the marginal pdf of X and Y from their joint bivariate uniform distribution given Example 3.1. Solution. From (3.2), we have Z ∞ Z 1 fX (x) = f (x, y) dy = 1dy = 1, for 0 ≤ x ≤ 1, and fX (x) = 0, for x ∈ / [0, 1]. −∞ 0 Similarly, the marginal pdf of Y is obtained by Z ∞ Z 1 fY (y) = f (x, y) dx = 1dx = 1, for 0 ≤ y ≤ 1, and fY (y) = 0, for y ∈ / [0, 1]. −∞ 0 Thus, each of X and Y have a uniform in [0, 1] distribution, in agreement with Example 1.2. Example 3.5. Find the marginal pdf of X and Y from their joint bivariate distribution given Example 3.2. Solution. From (3.2), we have Z 1 12 2 12 6 fX (x) = (x + xy)dy = x2 + x, for 0 ≤ x ≤ 1, and fX (x) = 0, for x ∈ / [0, 1]. 7 7 0 7 Similarly, the marginal pdf of Y is given by Z 1 12 2 4 6 fY (y) = (x + xy)dx = + y. 7 7 0 7 3.2 Conditional Distributions In analogy with the definition in the discrete case, if (X, Y ) are continuous with joint pdf f (x, y), the conditional pdf of Y given X = x is defined to be fY |X=x (y) = f (x, y) , fX (x) 13 (3.3) if fX (x) > 0. In the case of discretized measurements, fY |X=x (y)∆y approximates P (y ≤ Y ≤ y + ∆y|x ≤ X ≤ x + ∆x), as can be seen from P (y ≤ Y ≤ y + ∆y|x ≤ X ≤ x + ∆x) = P (y ≤ Y ≤ y + ∆y, x ≤ X ≤ x + ∆x) P (x ≤ X ≤ x + ∆x) = f (x, y) ∆x ∆y f (x, y) = ∆y. fX (x) ∆x fX (x) The definition of conditional pdf implies that f (x, y) = fY |X=x (y)fX (x), (3.4) which is useful for specifying joint probability distributions. Integrating this over x we obtain Z ∞ fY (y) = fY |X=x (y)fX (x)dx, (3.5) −∞ which is the Law of Total Probability for continuous variables. Example 3.6. For a cylinder selected at random from the manufacturing line, let X=height, Y =radius. Suppose X, Y have a joint pdf  1 3  3 x if 1 ≤ x ≤ 3, ≤ y ≤ 2 2 4 f (x, y) = 8 y  0 otherwise. Find fX (x) and fY |X=x (y). Solution. According to the formulae, Z ∞ Z fX (x) = f (x, y)dy = −∞ fY |X=x (y) = 3.3 .75 .5 µ ¶ 3x x dy = 2 8y 4 f (x, y) 3 1 = . fX (x) 2 y2 Independence Definition 3.2. The continuous random variables X, Y are called independent if fX,Y (x, y) = fX (x)fY (y), holds for all x, y, where fX,Y is the joint pdf of (X, Y ), and fX , fY are the marginal pdfs of X, Y , respectively. 14 The next proposition is a collection of some statements that are equivalent to the statement of independence of two random variables. Proposition 3.1. Each of the following statements implies, and is implied by, the independence of the continuous random variables X, Y . 1. FX,Y (x, y) = FX (x)FY (y), for all x, y, where FX,Y is the joint cdf of (X, Y ) and FX , FY are the marginal cdfs of X, Y , respectively. 2. fY |X=x (y) = fY (y), where fY |X=x is the conditional pdf of Y given that [X = x], and fY is the marginal pmf of Y . In other words, the conditional pdf of Y , given X = x, is the same for all values x that X might take. 3. fX|Y =y (x) = fX (x), where fX|Y =y (x) is the conditional pdf of X given that [Y = y], and fX is the marginal pdf of X. In other words, the conditional pdf of X, given Y = y, is the same for all values y that Y might take. 4. Any event associated with the random variable X is independent from any event associated with the random variable Y , i.e. [X ∈ A] is independent from [Y ∈ B], where A is any subset of the sample space of X, and B is any subset of the sample space of Y . 5. For any two functions h and g, the random variables h(X), g(Y ) are independent. Example 3.7. Consider the joint distribution of X=height, and Y =radius given in Example 3.6. Are X and Y independent? Solution. The marginal pdf of X was derived in Example 3.6. That of Y is ¶ Z ∞ Z 3µ 3x 3 1 fY (y) = f (x, y)dx = . dx = 2 8y 2 y2 −∞ 1 Finally, joint pdf of (X, Y ) is given in Example 3.6. From this it can be verified that f (x, y) = fX (x)fY (y). Thus, X and Y are independent. An alternative method of checking the independence of X and Y , is to examine the conditional pdf fY |X=x (y). In Example 3.6 it was obtained that 3 1 fY |X=x (y) = . Since this is constant in x, we conclude that X and Y are independent, 2 y2 according to part 2 of Proposition 3.1. Finally, as an application of part 5 of Proposition 3.1, X and Y 2 are also independent. 15 4 Expected Value of a Statistic A function h(X1 , . . . , Xn ) of random variables will be called a statistic. Statistics are, of course, random variables and, as such, they have a distribution. The distribution of a statistic is known as its sampling distribution. As in the univariate case, we will see in this section that the expected value of a function of random variables (statistic) can be obtained without having to first obtain its distribution. This is a very useful/convenient method of calculating the expected value as the sampling distribution of a statistic is typically difficult to obtain; this will be demonstrated in Chapter 5. The variance of a linear combination of random variables can also be calculated without having to first obtain its distribution, but this calculation involves the concept of covariance and thus it will be considered in the next section. Let (X, Y ) be discrete with joint pmf pX,Y . The expected value of a function, h(X, Y ), of (X, Y ) is computed by E[h(X, Y )] = XX x h(x, y)pX,Y (x, y). y In the continuous case, summation is replaced by integration. Thus, if (X, Y ) is continuous with joint pdf fX,Y , then the expected value of a function, h(X, Y ), of (X, Y ) is computed by Z Z ∞ ∞ E[h(X, Y )] = h(x, y)fX,Y (x, y)dx dy −∞ −∞ The formulae extend directly to functions of more than two random variables. Thus, in the discrete case, the expected value of the statistic h(X1 , . . . , Xn ) is computed by E[h(X1 , . . . , Xn )] = X ··· x1 X h(x1 , . . . , xn )p(x1 , . . . , xn ), xn where p denotes the joint pmf of X1 , . . . , Xn , while in the continuous case, the expected value of h(X1 , . . . , Xn ) is computed by Z ∞ Z E[h(X1 , . . . , Xn )] = ··· −∞ ∞ h(x1 , . . . , xn )f (x1 , . . . , xn )dx1 · · · dxn . −∞ Example 4.1. Consider the joint distribution of (X, Y ) given in Example 2.1. Find the expected value of the total number of errors, T , that the robot makes on a car. Solution. Here T = h(X, Y ) = X + Y . Thus 16 E(T ) = XX x (x + y)p(x, y) y = 0(.84) + 1(.03) + 2(.02) + 3(.01) + 1(.06) + 2(.01) + 3(.008) + 4(.002) + 2(.01) + 3(.005) + 4(.004) + 5(.001) = .268. Example 4.2. Consider the joint distribution of X=height, Y = radius given in Example 3.6. Find the expected value of the volume of a cylinder. Solution: The volume of the cylinder is given in terms of the height (X) and radius (Y ) by the function h(X, Y ) = πY 2 X. Thus, Z 3 Z .75 E[h(X, Y )] = 1 = .5 µ ¶ 3x πy x dy dx 8 y2 2 13 π. 16 Of special interest is the case where the function of interest is a sum, or, more generally, a linear combination of random variables. The function h(X1 , . . . , Xn ) is a linear combination of X1 , . . . , Xn if h(X1 , . . . , Xn ) = a1 X1 + a2 X2 + . . . + an Xn , where a1 , . . . , an are given constant numbers. The total T = P Xi is a linear combination 1 with all ai = 1, and the sample mean X = n1 T is a linear combination with all ai = . n i Proposition 4.1. Let (X1 , . . . , Xn ) have any joint distribution (i.e. independent or dependent, discrete or continuous), and set E(Xi ) = µi . Then E(a1 X1 + · · · + an Xn ) = a1 µ1 + . . . + an µn . In other words, the expected value of a linear combination of random variables is the same linear combination of their expected values. 17 Corollary 4.1. Let (X1 , . . . , Xn ) have any joint distribution. If E(X1 ) = · · · = E(Xn ) = µ, then E(X) = µ, where T = P i and E(T ) = nµ, Xi and X = n1 T . Corollary 4.2. Let (X1 , X2 ) have any joint distribution. Then E(X1 − X2 ) = µ1 − µ2 , and E(X1 + X2 ) = µ1 + µ2 . As an application of Corollary 4.2, the expected value of the total number of errors which was obtained in Example 4.1, can also be computed as E(T ) = E(X + Y ) = E(X) + E(Y ) = .12 + .148 = .268. We conclude this subsection with a result about the expected value of a product of independent random variables. Proposition 4.2. If X and Y are independent, then E(XY ) = E(X)E(Y ). In general, if X1 , . . . , Xn are independent, E(X1 · · · Xn ) = E(X1 ) · · · E(Xn ). As an application of Proposition 4.2, the expected value of the volume h(X, Y ) = πY 2 X, which was obtained in Example 4.2, can also be calculated as E[h(X, Y )] = πE(Y 2 )E(X), since, as shown in Example 3.7, X and Y are independent, and thus, by part 5 of Proposition 3.1, X and Y 2 are also independent. 5 5.1 Parameters of a Multivariate Distribution The Regression Function It is often very interesting and informative to know how the expected value of one variable changes when we have observed the value that another variable has taken. In fact, in the 18 study where X is the velocity and Y is the stopping distance of an automobile, and in the study where X is the diameter at breast height and Y is the age of a tree, both of which were mentioned in Section 1, knowing how the expected value of Y changes with X would be of primary interest. Definition 5.1. For the bivariate random variable (X, Y ), the function µY |X (x) = E(Y |X = x) is called the regression function of Y on X. Example 5.1. For the discrete (X, Y ) considered in Example 2.1, regarding the errors a robot makes per car, calculate the regression function of Y on X. Solution. In Example 2.4 we obtained the conditional pmf of Y given that X = 0 as y 0 pY |X (y|X = 0) 1 0.9333 0.0333 2 3 0.0222 0.0111 and in Example 2.6, we computed the conditional expectation of Y given that X = 0 as E(Y |X = 0) = 0.111. Repeating these calculations, conditioning first on X = 1 and then on X = 2, we obtain y 0 pY |X (y|X = 1) 1 0.75 0.125 2 3 0.1 0.025 so that E(Y |X = 1) = 0.4, and y 0 pY |X (y|X = 2) 1 0.5 0.25 2 3 0.2 0.05 from which we obtain E(Y |X = 2) = 0.8. Summarizing the above calculations of the conditional expectation of Y in a table, we obtain the regression function of Y on X: x 0 µY |X (x) 0.111 1 2 0.4 0.8 The information that this regression function makes visually apparent, and which was not easily discernable from the joint probability mass function, is that in a car with more defective welds, you can expect to experience more improperly tightened bolts. 19 Note further, that a weighted average of the conditional expected values of Y , with weights equal to the marginal probabilities of X, gives the unconditional expected value of Y ; that is, using the marginal pmf of X from Example 2.3, we have E(Y |X = 0)pX (0) + E(Y |X = 1)pX (1) + E(Y |X = 2)pX (2) = 0.111 × 0.9 + 0.4 × 0.08 + 0.8 × 0.02 = 0.148 = E(Y ). This is the Law of Total Probability for Expectations. For continuous variables, it is customary to model the regression function. The simplest model is the linear regression model which specifies that µY |X (x) = α + βx. (5.1) Quadratic and more complicated models are also commonly used. The advantage of such models is that they offer easy interpretation of the effect of X on the expected value of Y , and also that the typically unknown parameters α and β are easily estimated from data, as we will see in the chapter on regression analysis. Example 5.2. In accelerated life testing, products are operated under harsher conditions than those encountered in real life in order to expedite the observation of the life times of the products being tested. Of course, interest lies in the life time of the products under normal operating conditions. A regression model can be used to transfer information obtained under harsher conditions to normal operating conditions. As an example, it may be assumed that the lifetime of a randomly chosen product has the exponential distribution but that the parameter λ of the distribution (see Example 3.7 of Chapter 3) depends on the stress applied. Let T denote the lifetime, and let X denote the stress applied. Let λ(x) = (α + βx)−1 , (5.2) and assume that, when stress X = x is applied then the lifetime T has distribution with pdf fT |X=x (t) = λ(x)e−λ(x)t . (5.3) As we saw in Example 4.12 of Chapter 3, the expected value of an exponentially distributed random variable is the inverse of the parameter λ. Thus, if the model assumptions (5.2), (5.3) are correct, then the regression function of T on X is µY |X (x) = α + βx. 20 (5.4) 5.2 Covariance and Correlation When two random variables X and Y are not independent, they are dependent. We say that X, Y are positively dependent if “large” values of X are associated with “large” values of Y , and “small” values of X are associated with “small” values of Y . For example, X=height and Y =weight of a randomly selected adult male, are positively dependent. In the opposite case we say that X, Y are negatively dependent. It is often of interest to quantify the dependence of two (positively or negatively) dependent variables. Thus the population parameters of a bivariate distribution involve measures of dependence, in addition to the parameters of the two marginal distributions. The regression function, which was discussed in the previous subsection, is a consequence and a manifestation of dependence since, if X and Y are independent, then the regression function µY (x) is constant in x. However, it is important to note that, the regression function is not designed to measure the degree of dependence of X and Y . In this subsection we will define the correlation of X and Y , as a measure of linear dependence, and the rank correlation, as a more general measure of nonlinear dependence. In order to do that, we will first need to talk about the covariance of X and Y . Definition 5.2. The covariance of X and Y , denoted by Cov(X, Y ) or σXY , is defined as σXY = E[(X − µX )(Y − µY )] = E(XY ) − µX µY , where µX and µY are the marginal expected values of X and Y , respectively. The second equality in the above definition is a computational formula for the covariance, 2 similar to the computational (short-cut) formula, σX = E[(X − µX )2 ] = E(X 2 ) − µ2X , for the variance. It is also worth pointing out that the covariance of a random variable with itself, i.e. Cov(X, X) or σXX , is 2 , σXX = E[(X − µX )(X − µX )] = E[(X − µX )2 ] = σX which is the variance of X. In order to develop an intuitive understanding of the role of covariance in the quantification of dependence, let us consider a finite underlying population of N units each of which has characteristics of interest (x, y). As a concrete example, consider the population of men 25-30 years old residing in Centre County, PA, and let x=height, y=weight 21 be the characteristics of interest. Let (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) denote the values of the characteristics of the N units. Let (X, Y ) denote the characteristics of a randomly selected unit. Then (X, Y ) has a discrete distribution (even though height and weight are continuous variables!) taking each of the possible values (x1 , y1 ), . . . , (xN , yN ) with probability 1/N . In this case the covariance formula in Definition 5.2 can be computed as σXY where µX = 1 N PN i=1 N 1 X = (xi − µX )(yi − µY ), N i=1 xi , and µY = 1 N PN i=1 (5.5) yi are the marginal expected values of X and Y , respectively. Suppose now that X, Y have positive dependence. Then the products (xi − µX )(yi − µY ), which appear in the summation of relation (5.5), will tend to be positive, and thus σXY will, in all likelihood, also be positive. If the dependence is negative then these products will tend to be negative and so will σXY . Thus we see σXY will be positive or negative according to whether the dependence of X and Y is positive or negative. Other properties of covariance are summarized in the next proposition. Proposition 5.1. 1. If X, Y are independent, then Cov(X, Y ) = 0. 2. Cov(X, Y ) = −Cov(X, −Y ) = −Cov(−X, Y ). 3. For any real numbers b and d, Cov(X + b, Y + d) = Cov(X, Y ). 4. For any real numbers a, b, c and d, Cov(aX + b, cY + d) = ac Cov(X, Y ). The first property, which is desirable for any measure of the degree dependence, follows from the formula for calculating the expected value of a product of independent random variables, which was given in Proposition 4.2. The second property means that, if the sign of one of the two variables changes, then a positive dependence becomes negative and vice-versa. The third property means that adding constants to the random variables will not change their covariance. The fourth property, however, renders the covariance undesirable as a measure of dependence. This is because it implies that the covariance 22 of X and Y changes when the scale (or unit) changes. Thus, in the example where X=height and Y =weight, changing the scale from (m, kg) to (ft, lb), changes the value of the covariance of X and Y . Clearly, we would like a measure of dependence to be unaffected by such scale changes. This leads to the definition of the correlation coefficient as a scale-free version of covariance. Definition 5.3. The correlation, or correlation coefficient of X and Y , denoted by Corr(X, Y ) or ρXY , is defined as ρX,Y = Corr(X, Y ) = Cov(X, Y ) , σX σY where σX , σY are the marginal standard deviations of X, Y , respectively. The following proposition summarizes some properties of the correlation coefficient. Proposition 5.2. 1. If a and c are either both positive or both negative, then Corr(aX + b, cY + d) = Corr(X, Y ). If If a and c are of opposite signs, then Corr(aX + b, cY + d) = −Corr(X, Y ). 2. For any two random variables X, Y , −1 ≤ ρ(X, Y ) ≤ 1, and if X, Y are independent then ρX,Y = 0. 3. ρX,Y = 1 or −1 if and only if Y = aX + b for some numbers a, b with a = 6 0. The properties listed in Proposition 5.2 imply that correlation is indeed a successful measure of linear dependence. First, it has the desirable property of being independent of scale. Second, the fact that it takes values between −1 and 1, makes it possible to develop a feeling for the degree of dependence between X and Y . Thus, if the variables are independent, their correlation coefficient is zero, while ρX,Y = ±1 happens if and only if X and Y have the strongest possible (that is, knowing one amounts to knowing the other) linear dependence. It should be emphasized that correlation measures only linear dependence. This being the case, it is possible for two variables to be dependent but to have zero correlation. See 23 figure below. Thus, if two variables have zero correlation, but we do not know whether or not they are independent, we call them uncorrelated. *********FIGURE SHOWING CORRELATION IN VARIOUS CONFIGURATIONS ********* Example 5.3. Let X denote the deductible in car insurance, and let Y denote the deductible in home insurance, of a randomly chosen home and car owner in some community. Suppose that X, Y have the following joint pmf. y 0 100 200 100 .20 .10 .10 .5 250 .05 .15 .30 .5 .25 .25 .50 1.0 x where the deductible amounts are in dollars. Find σX,Y and ρX,Y . Next, express the deductible amounts in cents and find again the covariance and the correlation coefficient. Solution. We will use computational formula Cov(X, Y ) = E(XY ) − E(X)E(Y ). First, E(XY ) = XX x Also, E(X) = X xyp(x, y) = 23.750. y xpX (x) = 175, E(Y ) = 125. x Thus, Cov(X, Y ) = 1875. Omitting the details of the calculations, the standard deviations are computed to be σX = 75, σY = 82.92. Thus, ρX,Y = .301. Next, if the deductible amounts are expressed in cents, then the new deductible amounts are (X 0 , Y 0 ) = (100X, 100Y ). Thus, Cov(X 0 , Y 0 ) = Cov(100X, 100Y ) = 18, 750, 000, which is the reason (mentioned above) that covariance is not a suitable measure of dependence. The correlation remains unchanged. Example 5.4. Consider the multinomial experiment of Example 2.2, but with a sample of size one. Thus, one electronic component will be tested, and if it lasts less than 50 hours, then Y1 = 1, Y2 = 0, and Y3 = 0; if it lasts between 50 and 90 hours, then Y1 = 0, 24 Y2 = 1, and Y3 = 0; if it lasts more than 90 hours, then Y1 = 0, Y2 = 0, and Y3 = 1. Find Cov(Y1 , Y2 ), Cov(Y1 , Y3 ) and Cov(Y2 , Y3 ). Solution. We will use computational formula Cov(X, Y ) = E(XY ) − E(X)E(Y ). First note that E(Y1 Y2 ) = 0. This is due to the fact that the sample space of the bivariate random variable (Y1 , Y2 ) is {(1, 0), (0, 1), (0, 0)}, so that the product Y1 Y2 is always equal to zero. Next, since the marginal distribution of each Yi is Bernoulli, we have that E(Yi ) = P (Yi = 1). Thus, according to the information given in Example 2.2, E(Y1 ) = 0.2, E(Y2 ) = 0.5, E(Y3 ) = 0.3. It follows that Cov(Y1 , Y2 ) = −0.1, Cov(Y1 , Y3 ) = −0.06 and Cov(Y2 , Y3 ) = −0.15. Example 5.5. Suppose (X, Y ) have joint pdf  24xy 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, x + y ≤ 1 f (x, y) = 0 otherwise Find Cov(X, Y ), ρX,Y . Solution: Will use Cov(X, Y ) = E(XY ) − E(X)E(Y ) Z ∞ Z ∞ E(XY ) = xyf (x, y) dxdy Z −∞ −∞ 1 1−x Z = xy24xy dydx = Z 0 0 Z ∞ E(X) = Z −∞ Z ∞ E(Y ) = x12x(1 − x)2 dx = 2 5 y12y(1 − xy)2 dy = 2 5 0 1 yfY (y)dy = −∞ Thus Cov(X, Y ) = 1 xfX (x)dx = 2 15 0 2 22 2 − =− . 15 5 5 75 R1 1 Next, E(X 2 ) = 0 x2 12x(1 − x)2 dx = , so 5 1 4 1 1 1 2 1 50 2 2 σ = − = , σX = . Similarly, σY = . Thus ρX,Y = − / = − = − . 5 25 25 5 5 75 25 75 3 25 5.3 Variance and Covariance of Linear Combinations In Section 4 we saw that the expected value of a linear combination of random variables is the same linear combination of the expected values, regardless of whether or not the random variables are independent. Here we will see that the variance of a linear combination involves the pairwise covariances, and thus it depends on whether or not the variables are independent. Proposition 5.3. Let the variables X1 , . . . , Xn have variances σXi = σi2 and covariances σXi ,Xj = σij . Then 1. If X1 , . . . , Xn are independent, thus all σij = 0, V ar(a1 X1 + . . . + an Xn ) = a21 σ12 + . . . + a2n σn2 2. Without independence, V ar(a1 X1 + . . . + an Xn ) = n X n X ai aj Cov(Xi , Xj ). i=1 j=1 Corollary 5.4. 1. If X1 , X2 independent, V ar(X1 − X2 ) = σ12 + σ22 V ar(X1 + X2 ) = σ12 + σ22 . 2. Without independence, V ar(X1 − X2 ) = σ12 + σ22 − 2Cov(X1 , X2 ) V ar(X1 + X2 ) = σ12 + σ22 + 2Cov(X1 , X2 ). Corollary 5.5. If X1 , . . . , Xn are independent and σ12 = · · · = σn2 = σ 2 , then V ar(X) = where T = P i σ2 and V ar(T ) = nσ 2 , n Xi and X = T /n. Proposition 5.4. Let the variables X1 , . . . , Xn have variances σXi = σi2 and covariances σXi ,Xj = σij . Let also Xj1 , . . . , Xjk be a collections of random variables obtained from X1 , . . . , Xn . Then, Cov(X1 , a1 Xj1 + · · · + ak Xjk ) = k X i=1 26 ai Cov(X1 , Xji ). Example 5.6. Consider the multinomial experiment of Example 2.2. Thus, n = 8 products are tested, X1 denotes the number of those that last less than 50 hours, X2 denotes the number that last between 50 and 90 hours, and X3 = 8 − X1 − X2 denotes the number that last more than 90. Find the covariance of X1 and X2 . Solution. For each of the eight products, i.e. for each i = 1, . . . , 8, define triples of variables (Yi1 , Yi2 , Yi3 ), as in Example 5.4. Thus, if the ith product lasts less than 50 hours, then Yi1 = 1 and Yi2 = Yi3 = 0; if it lasts between 50 and 90 hours then Yi2 = 1 and Yi1 = Yi3 = 0; if it lasts more than 90 hours then Yi3 = 1 and Yi1 = Yi2 = 0. Thus, X1 , the number of products that last less than 50 hours, is given as X1 = 8 X Yi1 . i=1 Similarly, X3 = 8 X Yi1 , and X3 = i=1 8 X Yi3 . i=1 It follows that 8 8 X X Cov(X1 , X2 ) = Cov( Yi1 , Yi2 ) i=1 = 8 X 8 X i=1 Cov(Yi1 , Yj2 ). i=1 j=1 Assuming that the life times of different products are independent, we have that, if i 6= j, then Cov(Yi1 , Yi2 ) = 0. Thus, Cov(X1 , X2 ) = 8 X 8 X Cov(Yi1 , Yi2 ) = (−0.1) = −0.8, i=1 i=1 where −0.1 is the covariance of Yi1 and Yi2 , as derived in Example 5.4. The covariance of X1 and X2 is similarly found to be 8 × (−0.06) = −0.48 and the covariance of X2 and X3 is similarly found to be 8 × (−0.15) = −1.2. 27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Ch.4 Multivariate Variables and Their Distribution 1 Introduction