Download Ch.4 Multivariate Variables and Their Distribution 1 Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Ch.4
1
Multivariate Variables and Their Distribution
Introduction
In Chapter 3, we defined a univariate random variable as a rule that assigns a number to
each outcome of a random experiment. When different rules assign (different) numbers
to each outcome of a random experiment, we have a multivariate random variable. In
Chapter 1, we saw several examples of studies where more than one variable is observed
from each population unit. Some additional examples are
1. The X, Y and Z components of wind velocity can be measured in studies of atmospheric turbulence.
2. The velocity X and stopping distance Y of an automobile can be studied in an
automobile safety study.
3. The diameter at breast height (DBH) and age of a tree can be measured in a study
aimed at developing a method for predicting age from diameter.
In such studies, it is not only interesting to investigate the behavior of each variable
individually, but also to investigate the degree of relationship between them.
We say that we know the joint distribution of a bivariate variable, (X, Y ) if we know all
probabilities of the form
P (a < X ≤ b, c < Y ≤ d), with a < b, c < d,
where P (a < X ≤ b, c < Y ≤ d) is to be interpreted as P (a < X ≤ b and c < Y ≤ d) or
as P ([a < X ≤ b] ∩ [c < Y ≤ d]). Similarly, we say that we know the joint distribution
of a multivariate variable, X1 , X2 , . . . , Xm , if we know all probabilities of the form
P (a1 < X1 ≤ b1 , a2 < X2 ≤ b2 , . . . , am < Xm ≤ bm ), with ak < bk , k = 1, . . . , m.
As in the univariate case, which was considered in Chapter 3, a concise description of
the joint probability distribution of any multivariate random variable can be achieved
through its cumulative distribution function (cdf).
1
Definition 1.1. The joint or bivariate cdf of two random variables, X, Y , is defined by
F (x, y) = P (X ≤ x, Y ≤ y).
The joint or multivariate cdf of several random variables, X1 , X2 , . . . , Xm , is defined by
F (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ).
The cdf is convenient for calculating the probability that (X, Y ) will lie in a rectangle. In
the bivariate case the formula is
P (x1 < X ≤ x2 , y1 < Y ≤ y2 ) = F (x2 , y2 ) − F (x2 , y1 ) − F (x1 , y2 ) + F (x1 , y1 ).
(1.1)
Example 1.1. Let (X, Y ) be uniformly distributed on the unit rectangle [0, 1] × [0, 1].
This means that the probability that (X, Y ) lies in a subspace A of the unit rectangle
equals the area of A. Thus, the bivariate cdf of (X, Y ) is
F (x, y) = xy, for 0 ≤ x, y ≤ 1,
F (x, y) = x, for 0 ≤ x ≤ 1, y ≥ 1,
F (x, y) = y, for 0 ≤ y ≤ 1, x ≥ 1,
F (x, y) = 1, for x, y ≥ 1, and
F (x, y) = 0, if either x ≤ 0 or y ≤ 0.
The probability that (X, Y ) will lie in the rectangle A = (0.3, 0.6) × (0.4, 0.7) is
P (0.3 < X ≤ 0.6, 0.4 < Y ≤ 0.7) = (0.6)(0.7)
−(0.3)(0.7) − (0.6)(0.4) + (0.3)(0.4) = 0.09,
which is equal to the area of A.
The joint cdf of (X, Y ) can also be used for obtaining the cdf of X or of Y . Thus, if
F (x, y) is the bivariate cdf of (X, Y ), then
FX (x) = P (X ≤ x) = P (X ≤ x, Y ≤ ∞) = F (x, ∞),
(1.2)
is the cdf of X. Similarly, FY (y) = F (∞, y) is the formula for obtaining the cdf of Y from
the bivariate cdf of (X, Y ).
2
Example 1.2. Let (X, Y ) be uniformly distributed on the unit rectangle [0, 1] × [0, 1].
Find the cdf of the variables X and Y .
Solution. The bivariate cdf, F (x, y), of (X, Y ) is given in Example 1.1. Using it and
relation (1.2) we have that the cdf of X is
FX (x) = F (x, ∞) = x, 0 ≤ x ≤ 1,
which is recognized to be the cdf of the uniform in [0, 1] distribution. Similarly FY (y) =
F (∞, y) = y, 0 ≤ y ≤ 1.
Though very convenient for calculating the probability that (X, Y ) will take value in a
rectangle, and also for calculating the individual probability distributions of X and Y ,
the cdf is not the most convenient tool for calculating directly the probability that (X, Y )
will take value in regions other than rectangles, such as circles.
In the rest of this chapter, we will define the additional concepts associated with joint
distributions of marginal distributions, conditional distributions and independent random
variables. Concise descriptions of the joint, marginal and conditional distributions will
be given by appropriately extending the definitions, given in Chapter 3 for univariate
variables, of the probability mass function and the probability density function. Finally,
we will introduce certain parameters of the joint, marginal and conditional distributions
that are often used for descriptive and inferential purposes.
2
Describing the Joint Distribution of Discrete Random Variables
2.1
Joint probability mass function
Definition 2.1. The joint or bivariate probability mass function (pmf ) of the discrete
random variables X, Y is defined as
p(x, y) = P (X = x, Y = y).
The joint or multivariate pmf of the discrete random variables X1 , X2 , . . . , Xn is similarly defined as
p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn ).
3
Let the sample space of (X, Y ) be S = {(x1 , y1 ), (x2 , y2 ), . . .}. The above definition
implies that p(x, y) = 0, for all (x, y) that do not belong in S (i.e. all x different from
(x1 , y1 ), (x2 , y2 ), . . .}). Moreover, from the Axioms of probability it follows that
p(xi , yi ) ≥ 0, for all i, and
X
p(xi , yi ) = 1.
(2.1)
i
Example 2.1. A robot performs two tasks, welding joints and tightening bolts. Let X
be the number of defective welds, and Y be the number of improperly tightened bolts per
car. Suppose that the possible values of X are 0,1,2 and those of Y are 0,1,2,3. Thus,
the sample space S consists of 12 pairs (0.0), (0, 1), (0, 2), (0, 3), . . . , (2, 3). The joint pmf
of X, Y is given by
Y
X
0
1
2
3
0 .84
.03
.02
.01
1 .06
.01
.008
.002
2 .01
.005 .004 .001
1.0
In agreement with (2.1), the sum of all probabilities equals one. Inspection of this pmf
reveals that 84% of the cars have no defective welds and no improperly tightened bolts,
while only one in a thousand have two defective welds and three improperly tightened
bolts. In fact, the pmf provides answers to all questions regarding the probability of robot
errors. For example,
P (exactly one error) = P (X = 1, Y = 0) + P (X = 0, Y = 1) = .06 + .03 = .09.
and
P (exactly one defective weld) = P (X = 1, Y = 0) + P (X = 1, Y = 1)
+ P (X = 1, Y = 2) + P (X = 1, Y = 3)
= .06 + .01 + .008 + .002 = .08.
Example 2.2. [MULTINOMIAL RANDOM VARIABLE] The probabilities that a certain electronic component will last less than 50 hours of continuous use, between 50 and
4
90 hours, or more than 90 hours, are p1 = 0.2, p2 = 0.5, and p3 = 0.3, respectively.
Consider a simple random sample of size eight of such electric components, and set X1 for
the number of these components that last less than 50 hours, X2 for the number of these
that last between 50 and 90 hours, and X3 for the number of these that last more than 90
hours. Then (X1 , X2 , X3 ) is an example of a multinomial random variable. The sample
space S of (X1 , X2 , X3 ) consists of all nonnegative sets of integers (x1 , x2 , x3 ) that satisfy
x1 + x2 + x3 = 8.
It can be shown that the joint pmf of (X1 , X2 , X3 ) is given by
p(x1 , x2 , x3 ) =
8!
px1 px2 px3 ,
x1 !x2 !x3 ! 1 2 3
for any x1 , x2 , x3 in the sample space. For example, the probability that one such component will last less than 50 hours, five will last between 50 and 90 hours, and two will last
more than 90 hours is
p(1, 5, 2) =
2.2
8!
0.21 0.55 0.32 = 0.0945.
1!5!2!
Marginal probability mass functions
In Section 1, we learned how to obtain the cdf of the variable X from the bivariate cdf of
(X, Y ). This individual distribution of the variable X is called the marginal distribution
of X. Similarly, the individual distribution of Y is called the marginal distribution of Y .
In this section, we will learn how to obtain the pmf of a marginal distribution from that
of a bivariate (or multivariate) pmf. We begin by considering the bivariate distribution
given in Example 2.1.
Example 2.3. Consider the bivariate distribution of (X, Y ) given in Example 2.1. Find
the marginal pmf of X.
Solution. The probability pX (0) = P (X = 0) is found by summing the probabilities in
the fist row of the table giving the joint pmf of (X, Y ):
pX (0) = P (X = 0, Y = 0) + P (X = 0, Y = 1)
+ P (X = 0, Y = 2) + P (X = 0, Y = 3)
= 0.84 + 0.03 + 0.02 + 0.01 = 0.9.
5
Also, the probability pX (1) = P (X = 1) is found by summing the probabilities in the
second row of the joint pmf table, and the probability pX (2) = P (X = 2) is found by
summing the probabilities in the third row of the joint pmf table. Thus, the marginal
pmf of X is:
x
0
pX (x) .9
1
2
.08
.02
Note that the pmf of X in the above example can be read from the vertical margin of
the joint pmf table; see the table below. This justifies naming pX the marginal pmf of X.
Also, the marginal pmf pY of Y is read from the lower horizontal margin of the joint pmf.
Y
X
0
1
2
3
0 .84
.03
.02
.01
.9
1 .06
.01
.008
.002
.08
2 .01
.005 .004 .001
.02
.91
.045 .032 .013
1.0
In general, we have the following proposition.
Proposition 2.1. Let x1 , x2 , . . . be the possible values of X and y1 , y2 , . . . be the possible
values of Y . The marginal pmf of X, respectively, Y are given by
X
X
pX (x) =
p(x, yi ), pY (y) =
p(xi , y).
i
i
In the case of more than two random variables, X1 , X2 , . . . , Xm , the marginal pmf of each
variable Xi can be obtained by summing the joint pmf over all possible values of the
other random variables. For example, it can be shown that, if the joint distribution of
X1 , X2 , . . . , Xm is multinomial, then each Xi has a binomial distribution.
2.3
Conditional Distributions
The concept of a conditional distribution of a discrete random variable is an extension of
the concept of conditional probability of an event.
For a discrete (X, Y ), if x is a possible value of X, i.e. pX (x) > 0, the concept of
conditional probability provides answers to questions regarding the value of Y , given that
6
X = x has been observed. For example, the conditional probability that Y takes the
value y given that X = x is
P (Y = y|X = x) =
P (X = x, Y = y)
p(x, y)
=
,
P (X = x)
pX (x)
where p(x, y) is the joint pmf of (X, Y ) and pX (x) is the marginal pmf of X.
Definition 2.2. Let (X, Y ) be discrete, and let SY = {y1 , y2 , . . .} be the sample space of
Y . Then if x is a possible value of X, i.e. pX (x) > 0,
pY |X (yj |x) =
p(x, yj )
, j = 1, 2, . . . ,
pX (x)
where p(x, y) is the joint pmf of (X, Y ) and pX (x) is the marginal pmf of X, is called the
conditional pmf of Y given that X = x.
Example 2.4. Consider the discrete (X, Y ) of Example 2.1. Then, the conditional pmf
of Y given that X = 0 is
y
0
pY |X (y|X = 0)
1
.9333 .0333
2
3
.0222
.0111
These numbers are obtained by dividing each joint probability in the row that corresponds
to the X-value 0, by the marginal probability that X = 0. As an illustration,
pY |X (0|X = 0) =
Proposition 2.2.
p(0, 0)
.84
=
= .9333.
pX (0)
.9
1. The conditional pmf is a proper pmf. Thus, for each x with
pX (x) > 0,
pY |X (yj |x) ≥ 0, for all j = 1, 2, . . ., and
X
pY |X (yj |x) = 1.
j
2. If we know the conditional pmf of Y given X = x, for all values x in the sample
space of X (i.e. if we know pY |X (yj |x), for all j = 1, 2, . . ., and for all possible
values x of X), and also know the marginal pmf of X, then the joint pmf of (X, Y )
can be obtained as
p(x, y) = pY |X (y|x)pX (x).
3. The marginal pmf of Y can be obtained as
X
pY (y) =
pY |X (y|x)pX (x).
x in SX
7
The second part of the above proposition is useful because it is often easier to specify the
marginal distribution of X and the conditional distribution of Y given X, than to specify
directly the joint distribution of (X, Y ); see Example 2.5 below. Part 3 of Proposition 2.2
follows from the Law of Total Probability.
Example 2.5. It is known that, with probability 0.6, a new lap-top owner will install
wireless internet connection at home within a month. Let X denote the number (in
hundreds) of new lap-top owners in a week from a certain region, and let Y denote the
number among them who install wireless connection at home within a month. Suppose
that the pmf of X is
x
0
pX (x) 0.1
1
2
0.2 0.3
3
4
0.25
0.15
Find the joint distribution of (X, Y ). Find the probability that Y = 4.
Solution. According to part 2 of Proposition 2.2, since the marginal distribution of X
is known, the joint distribution of (X, Y ) can be specified if pY |X (y|x) is known for all
possible values x of X. Given that X = x, however, Y has the binomial distribution with
n = x trials and probability of success p = 0.6, so that
µ ¶
x
0.6y 0.4x−y .
pY |X (y|x) =
y
For example, if X = 3 then the probability that Y = 2 is
µ ¶
3
pY |X (2|3) =
0.62 0.43−2 = 0.4320.
2
Next, according to part 3 of Proposition 2.2,
pY (4) =
4
X
pY |X (4|x)pX (x) = 0 + 0 + 0 + 0 + 0.64 × 0.15 = 0.0194.
x=0
We conclude this subsection by pointing out that, since the conditional pmf is a proper
pmf, it is possible to consider its expected value and its variance. These are called the
conditional expected value and conditional variance, respectively. As an example,
Example 2.6. Consider the discrete (X, Y ) of Example 2.1. Calculate the conditional
expected value of Y given that X = 0.
Solution. Using the conditional pmf that we found in Example 2.4, we obtain,
8
E(Y |X = 0) =0 × (.9333) + 1 × (.0333)
+ 2 × (.0222) + 3 × (.0222)
=.111.
Compare this with the unconditional, or marginal, expected value of Y , which is E(Y ) =
.148.
2.4
Independence
The notion of independence of random variables is an extension of the notion of independence of events. We say that the events A = [X = x] and B = [Y = y] are independent
if
P (X = x, Y = y) = P (X = x)P (Y = y),
where, as you may recall, P (X = x, Y = y) means P ([X = x] ∩ [Y = y]). If the above
equality holds for all possible values, x, of X, and all possible values, y, of Y , then X, Y
are called independent. In particular, we have the following definition.
Definition 2.3. The discrete random variables X, Y are called independent if
pX,Y (x, y) = pX (x)pY (y), for all x, y,
where pX,Y is the joint pmf of (X, Y ) and pX , pY are the marginal pmfs of X, Y , respectively.
Note that pX,Y (x, y) = pX (x)pY (y) can be rephrased as: The events A = [X = x] and
B = [Y = y] are independent for all x, y.
The next proposition is a collection of some statements that are equivalent to the statement of independence of two random variables.
Proposition 2.3. Each of the following statements implies, and is implied by, the independence of the random variables X, Y .
1. FX,Y (x, y) = FX (x)FY (y), for all x, y, where FX,Y is the joint cdf of (X, Y ) and
FX , FY are the marginal cdfs of X, Y , respectively.
9
2. pY |X (y|x) = pY (y), where pY |X (y|x) is the conditional probability of [Y = y] given
that [X = x], and pY is the marginal pmf of Y . In other words, the conditional pmf
of Y given X = x is the same for all values x that X might take.
3. pX|Y (x|y) = pX (x), where pX|Y (x|y) is the conditional probability of [X = x] given
that [Y = y], and pX is the marginal pmf of X. In other words, the conditional pmf
of X given Y = y is the same for all values y that Y might take.
4. Any event associated with the random variable X is independent from any event
associated with the random variable Y , i.e. [X ∈ A] in independent from [Y ∈ B],
where A is any subset of the sample space of X, and B is any subset of the sample
space of Y .
5. For any two functions h and g, the random variables h(X), g(Y ) are independent.
Example 2.7. Consider the joint distribution of (X, Y ) given in Example 2.1. Are X, Y
independent?
Solution. Here X, Y are not independent since
p(0, 0) = .84 6= pX (0)pY (0) = (.9)(.91) = .819.
Example 2.8. Are the X, Y of Example 2.5 independent?
Solution. Here X, Y are not independent since
pX,Y (3, 4) = pY |X (4|3)pX (3) = 0 × pX (3) = 0 6= pX (3)pY (4) = 0.25 × 0.0194 = 0.0049.
Example 2.9. A system is made up of two components connected in parallel. Let A,
B denote the two components. Thus, the system fails if both components fail. Let the
random variable X take the value 1 if component A works, and the value 0 of it does not.
Similarly, Y takes the value 1 if component B works, and the value 0 if it does not. From
the repair history of the system it is known that the joint pmf of (X, Y ) is
Y
0
1
0 0.0098
0.9702
0.98
1 0.0002
0.0198
0.02
0.99
1.0
X
0.01
10
In this example, it can be seen that pX,Y (x, y) = pX (x)pY (y), for all x, y. Thus, X, Y are
independent. Moreover, calculation of the conditional pmf of Y given X reveals that the
the conditional pmf of Y given X = x does not depend on the particular value x of X, in
accordance with part 2 of Proposition 2.3.
The definition of independence of two random variables extends to several random variables.
Definition 2.4. The discrete random variables X1 , X2 , . . . , Xm are independent if their
joint pmf is the product of the corresponding marginal pmfs, namely, if
pX1 ,X2 ,...,Xm (x1 , x2 , . . . , xm ) = pX1 (x1 )pX2 (x2 ) · · · pXm (xm ),
for all x1 , x2 , . . . , xm .
3
Describing the Joint Distribution of Continuous
Random Variables
Definition 3.1. The joint or bivariate density function of the continuous (X, Y ) is a
non-negative function f (x, y) such that
Z ∞Z ∞
f (x, y) dx dy = 1 and
−∞
−∞
Z Z
P ((X, Y ) ∈ A) =
f (x, y) dx dy,
A
for any ’reasonable’ two-dimensional set A. The joint or multivariate probability density
function of the continuous (X1 , X2 , . . . , Xn ) is a non-negative function f (x1 , x2 , . . . , xn )
such that
Z
Z
∞
∞
···
−∞
f (x1 , x2 , . . . , xn ) dx1 · · · dxn = 1 and
−∞
Z
P ((X1 , X2 , . . . , Xn ) ∈ A) =
Z
···
f (x1 , x2 , . . . , xn ) dx1 · · · dxn ,
A
for any ’reasonable’ n-dimensional set A.
In particular, the definition implies
Z bZ
d
P (a ≤ X ≤ b, c ≤ Y ≤ d) =
f (x, y)dy dx.
a
11
c
Thus, from the geometric point of view, probabilities are represented as volumes under
the joint pdf f (x, y), which now is a surface.
The pdf can be derived from the cdf by differentiating:
f (x1 , x2 , . . . , xn ) =
∂n
F (x1 , x2 , . . . , xn ).
∂x1 . . . ∂xn
(3.1)
Example 3.1. Using the joint cdf of the bivariate uniform distribution in the unit rectangle [0, 1] × [0, 1], which was given in Example 1.1, find the corresponding joint pdf.
Solution. Using (3.1) and the form of the bivariate cdf of the uniform distribution in the
unit rectangle, we obtain that f (x, y) = 0, if (x, y) is outside the unit rectangle, and
f (x, y) = 1, for 0 ≤ x, y ≤ 1.
Example 3.2. Consider the bivariate density function
f (x, y) =
12 2
(x + xy), 0 ≤ x, y ≤ 1.
7
Find the probability that X > Y .
Solution. The desired probability can be found by integrating f over the region {(x, y)|0 ≤
y ≤ x ≤ 1}:
12
P (X > Y ) =
7
Z
1
0
Z
x
(x2 + xy) dy dx =
0
9
.
14
Example 3.3. Consider the bivariate distribution of Example 3.2. Find the probability
that X ≤ 0.6 and Y ≤ 0.4, and the joint cdf of (X, Y ).
Solution. Using the joint pdf given in Example 3.2, we have
Z
Z
12 0.6 0.4 2
P (X ≤ 0.6, Y ≤ 0.4) = F (0.6, 0.4) =
(x + xy) dy dx = 0.0741.
7 0
0
In general,
Z
x
Z
y
F (x, y) =
0
0
12 2
12
(s + st) dt ds =
7
7
µ
Moreover, an easy differentiation verifies that f (x, y) =
pdf given in Example 3.2.
12
x3 y x2 y 2
+
3
4
¶
∂2
F (x, y),
∂x∂y
.
is indeed the joint
3.1
Marginal Distributions
In the continuous case, the marginal pdf of X and Y can be found by integrating the
bivariate pdf f (x, y):
Z
Z
∞
fX (x) =
∞
f (x, y) dy, fY (y) =
−∞
f (x, y) dx
(3.2)
−∞
Example 3.4. Find the marginal pdf of X and Y from their joint bivariate uniform
distribution given Example 3.1.
Solution. From (3.2), we have
Z ∞
Z 1
fX (x) =
f (x, y) dy =
1dy = 1, for 0 ≤ x ≤ 1, and fX (x) = 0, for x ∈
/ [0, 1].
−∞
0
Similarly, the marginal pdf of Y is obtained by
Z ∞
Z 1
fY (y) =
f (x, y) dx =
1dx = 1, for 0 ≤ y ≤ 1, and fY (y) = 0, for y ∈
/ [0, 1].
−∞
0
Thus, each of X and Y have a uniform in [0, 1] distribution, in agreement with Example
1.2.
Example 3.5. Find the marginal pdf of X and Y from their joint bivariate distribution
given Example 3.2.
Solution. From (3.2), we have
Z 1
12 2
12
6
fX (x) =
(x + xy)dy = x2 + x, for 0 ≤ x ≤ 1, and fX (x) = 0, for x ∈
/ [0, 1].
7
7
0 7
Similarly, the marginal pdf of Y is given by
Z 1
12 2
4 6
fY (y) =
(x + xy)dx = + y.
7 7
0 7
3.2
Conditional Distributions
In analogy with the definition in the discrete case, if (X, Y ) are continuous with joint pdf
f (x, y), the conditional pdf of Y given X = x is defined to be
fY |X=x (y) =
f (x, y)
,
fX (x)
13
(3.3)
if fX (x) > 0. In the case of discretized measurements, fY |X=x (y)∆y approximates P (y ≤
Y ≤ y + ∆y|x ≤ X ≤ x + ∆x), as can be seen from
P (y ≤ Y ≤ y + ∆y|x ≤ X ≤ x + ∆x)
=
P (y ≤ Y ≤ y + ∆y, x ≤ X ≤ x + ∆x)
P (x ≤ X ≤ x + ∆x)
=
f (x, y) ∆x ∆y
f (x, y)
=
∆y.
fX (x) ∆x
fX (x)
The definition of conditional pdf implies that
f (x, y) = fY |X=x (y)fX (x),
(3.4)
which is useful for specifying joint probability distributions. Integrating this over x we
obtain
Z
∞
fY (y) =
fY |X=x (y)fX (x)dx,
(3.5)
−∞
which is the Law of Total Probability for continuous variables.
Example 3.6. For a cylinder selected at random from the manufacturing line, let X=height,
Y =radius. Suppose X, Y have a joint pdf

1
3

3 x
if 1 ≤ x ≤ 3, ≤ y ≤
2
2
4
f (x, y) = 8 y

0
otherwise.
Find fX (x) and fY |X=x (y).
Solution. According to the formulae,
Z ∞
Z
fX (x) =
f (x, y)dy =
−∞
fY |X=x (y) =
3.3
.75
.5
µ
¶
3x
x
dy =
2
8y
4
f (x, y)
3 1
=
.
fX (x)
2 y2
Independence
Definition 3.2. The continuous random variables X, Y are called independent if
fX,Y (x, y) = fX (x)fY (y), holds for all x, y,
where fX,Y is the joint pdf of (X, Y ), and fX , fY are the marginal pdfs of X, Y , respectively.
14
The next proposition is a collection of some statements that are equivalent to the statement of independence of two random variables.
Proposition 3.1. Each of the following statements implies, and is implied by, the independence of the continuous random variables X, Y .
1. FX,Y (x, y) = FX (x)FY (y), for all x, y, where FX,Y is the joint cdf of (X, Y ) and
FX , FY are the marginal cdfs of X, Y , respectively.
2. fY |X=x (y) = fY (y), where fY |X=x is the conditional pdf of Y given that [X = x],
and fY is the marginal pmf of Y . In other words, the conditional pdf of Y , given
X = x, is the same for all values x that X might take.
3. fX|Y =y (x) = fX (x), where fX|Y =y (x) is the conditional pdf of X given that [Y = y],
and fX is the marginal pdf of X. In other words, the conditional pdf of X, given
Y = y, is the same for all values y that Y might take.
4. Any event associated with the random variable X is independent from any event
associated with the random variable Y , i.e. [X ∈ A] is independent from [Y ∈ B],
where A is any subset of the sample space of X, and B is any subset of the sample
space of Y .
5. For any two functions h and g, the random variables h(X), g(Y ) are independent.
Example 3.7. Consider the joint distribution of X=height, and Y =radius given in Example 3.6. Are X and Y independent?
Solution. The marginal pdf of X was derived in Example 3.6. That of Y is
¶
Z ∞
Z 3µ
3x
3 1
fY (y) =
f (x, y)dx =
.
dx =
2
8y
2 y2
−∞
1
Finally, joint pdf of (X, Y ) is given in Example 3.6. From this it can be verified that
f (x, y) = fX (x)fY (y).
Thus, X and Y are independent. An alternative method of checking the independence of
X and Y , is to examine the conditional pdf fY |X=x (y). In Example 3.6 it was obtained that
3 1
fY |X=x (y) =
. Since this is constant in x, we conclude that X and Y are independent,
2 y2
according to part 2 of Proposition 3.1. Finally, as an application of part 5 of Proposition
3.1, X and Y 2 are also independent.
15
4
Expected Value of a Statistic
A function h(X1 , . . . , Xn ) of random variables will be called a statistic. Statistics are,
of course, random variables and, as such, they have a distribution. The distribution of a
statistic is known as its sampling distribution. As in the univariate case, we will see in
this section that the expected value of a function of random variables (statistic) can be
obtained without having to first obtain its distribution. This is a very useful/convenient
method of calculating the expected value as the sampling distribution of a statistic is
typically difficult to obtain; this will be demonstrated in Chapter 5. The variance of
a linear combination of random variables can also be calculated without having to first
obtain its distribution, but this calculation involves the concept of covariance and thus it
will be considered in the next section.
Let (X, Y ) be discrete with joint pmf pX,Y . The expected value of a function, h(X, Y ), of
(X, Y ) is computed by
E[h(X, Y )] =
XX
x
h(x, y)pX,Y (x, y).
y
In the continuous case, summation is replaced by integration. Thus, if (X, Y ) is continuous
with joint pdf fX,Y , then the expected value of a function, h(X, Y ), of (X, Y ) is computed
by
Z
Z
∞
∞
E[h(X, Y )] =
h(x, y)fX,Y (x, y)dx dy
−∞
−∞
The formulae extend directly to functions of more than two random variables. Thus, in
the discrete case, the expected value of the statistic h(X1 , . . . , Xn ) is computed by
E[h(X1 , . . . , Xn )] =
X
···
x1
X
h(x1 , . . . , xn )p(x1 , . . . , xn ),
xn
where p denotes the joint pmf of X1 , . . . , Xn , while in the continuous case, the expected
value of h(X1 , . . . , Xn ) is computed by
Z ∞
Z
E[h(X1 , . . . , Xn )] =
···
−∞
∞
h(x1 , . . . , xn )f (x1 , . . . , xn )dx1 · · · dxn .
−∞
Example 4.1. Consider the joint distribution of (X, Y ) given in Example 2.1. Find the
expected value of the total number of errors, T , that the robot makes on a car.
Solution. Here T = h(X, Y ) = X + Y . Thus
16
E(T ) =
XX
x
(x + y)p(x, y)
y
= 0(.84) + 1(.03) + 2(.02) + 3(.01)
+ 1(.06) + 2(.01) + 3(.008) + 4(.002)
+ 2(.01) + 3(.005) + 4(.004) + 5(.001)
= .268.
Example 4.2. Consider the joint distribution of X=height, Y = radius given in Example
3.6. Find the expected value of the volume of a cylinder.
Solution: The volume of the cylinder is given in terms of the height (X) and radius (Y )
by the function h(X, Y ) = πY 2 X. Thus,
Z
3
Z
.75
E[h(X, Y )] =
1
=
.5
µ
¶
3x
πy x
dy dx
8 y2
2
13
π.
16
Of special interest is the case where the function of interest is a sum, or, more generally, a linear combination of random variables. The function h(X1 , . . . , Xn ) is a linear
combination of X1 , . . . , Xn if
h(X1 , . . . , Xn ) = a1 X1 + a2 X2 + . . . + an Xn ,
where a1 , . . . , an are given constant numbers. The total T =
P
Xi is a linear combination
1
with all ai = 1, and the sample mean X = n1 T is a linear combination with all ai = .
n
i
Proposition 4.1. Let (X1 , . . . , Xn ) have any joint distribution (i.e. independent or dependent, discrete or continuous), and set E(Xi ) = µi . Then
E(a1 X1 + · · · + an Xn ) = a1 µ1 + . . . + an µn .
In other words, the expected value of a linear combination of random variables is the same
linear combination of their expected values.
17
Corollary 4.1. Let (X1 , . . . , Xn ) have any joint distribution. If E(X1 ) = · · · = E(Xn ) =
µ, then
E(X) = µ,
where T =
P
i
and
E(T ) = nµ,
Xi and X = n1 T .
Corollary 4.2. Let (X1 , X2 ) have any joint distribution. Then
E(X1 − X2 ) = µ1 − µ2 , and E(X1 + X2 ) = µ1 + µ2 .
As an application of Corollary 4.2, the expected value of the total number of errors which
was obtained in Example 4.1, can also be computed as
E(T ) = E(X + Y ) = E(X) + E(Y ) = .12 + .148 = .268.
We conclude this subsection with a result about the expected value of a product of independent random variables.
Proposition 4.2. If X and Y are independent, then
E(XY ) = E(X)E(Y ).
In general, if X1 , . . . , Xn are independent,
E(X1 · · · Xn ) = E(X1 ) · · · E(Xn ).
As an application of Proposition 4.2, the expected value of the volume h(X, Y ) = πY 2 X,
which was obtained in Example 4.2, can also be calculated as
E[h(X, Y )] = πE(Y 2 )E(X),
since, as shown in Example 3.7, X and Y are independent, and thus, by part 5 of Proposition 3.1, X and Y 2 are also independent.
5
5.1
Parameters of a Multivariate Distribution
The Regression Function
It is often very interesting and informative to know how the expected value of one variable
changes when we have observed the value that another variable has taken. In fact, in the
18
study where X is the velocity and Y is the stopping distance of an automobile, and in
the study where X is the diameter at breast height and Y is the age of a tree, both of
which were mentioned in Section 1, knowing how the expected value of Y changes with
X would be of primary interest.
Definition 5.1. For the bivariate random variable (X, Y ), the function
µY |X (x) = E(Y |X = x)
is called the regression function of Y on X.
Example 5.1. For the discrete (X, Y ) considered in Example 2.1, regarding the errors a
robot makes per car, calculate the regression function of Y on X.
Solution. In Example 2.4 we obtained the conditional pmf of Y given that X = 0 as
y
0
pY |X (y|X = 0)
1
0.9333 0.0333
2
3
0.0222
0.0111
and in Example 2.6, we computed the conditional expectation of Y given that X = 0 as
E(Y |X = 0) = 0.111. Repeating these calculations, conditioning first on X = 1 and then
on X = 2, we obtain
y
0
pY |X (y|X = 1)
1
0.75 0.125
2
3
0.1
0.025
so that E(Y |X = 1) = 0.4, and
y
0
pY |X (y|X = 2)
1
0.5 0.25
2
3
0.2
0.05
from which we obtain E(Y |X = 2) = 0.8. Summarizing the above calculations of the
conditional expectation of Y in a table, we obtain the regression function of Y on X:
x
0
µY |X (x) 0.111
1
2
0.4
0.8
The information that this regression function makes visually apparent, and which was
not easily discernable from the joint probability mass function, is that in a car with more
defective welds, you can expect to experience more improperly tightened bolts.
19
Note further, that a weighted average of the conditional expected values of Y , with weights
equal to the marginal probabilities of X, gives the unconditional expected value of Y ; that
is, using the marginal pmf of X from Example 2.3, we have
E(Y |X = 0)pX (0) + E(Y |X = 1)pX (1) + E(Y |X = 2)pX (2)
= 0.111 × 0.9 + 0.4 × 0.08 + 0.8 × 0.02 = 0.148 = E(Y ).
This is the Law of Total Probability for Expectations.
For continuous variables, it is customary to model the regression function. The simplest
model is the linear regression model which specifies that
µY |X (x) = α + βx.
(5.1)
Quadratic and more complicated models are also commonly used. The advantage of such
models is that they offer easy interpretation of the effect of X on the expected value of Y ,
and also that the typically unknown parameters α and β are easily estimated from data,
as we will see in the chapter on regression analysis.
Example 5.2. In accelerated life testing, products are operated under harsher conditions
than those encountered in real life in order to expedite the observation of the life times of
the products being tested. Of course, interest lies in the life time of the products under
normal operating conditions. A regression model can be used to transfer information
obtained under harsher conditions to normal operating conditions. As an example, it
may be assumed that the lifetime of a randomly chosen product has the exponential
distribution but that the parameter λ of the distribution (see Example 3.7 of Chapter
3) depends on the stress applied. Let T denote the lifetime, and let X denote the stress
applied. Let
λ(x) = (α + βx)−1 ,
(5.2)
and assume that, when stress X = x is applied then the lifetime T has distribution with
pdf
fT |X=x (t) = λ(x)e−λ(x)t .
(5.3)
As we saw in Example 4.12 of Chapter 3, the expected value of an exponentially distributed
random variable is the inverse of the parameter λ. Thus, if the model assumptions (5.2),
(5.3) are correct, then the regression function of T on X is
µY |X (x) = α + βx.
20
(5.4)
5.2
Covariance and Correlation
When two random variables X and Y are not independent, they are dependent. We say
that X, Y are positively dependent if “large” values of X are associated with “large”
values of Y , and “small” values of X are associated with “small” values of Y . For example,
X=height and Y =weight of a randomly selected adult male, are positively dependent. In
the opposite case we say that X, Y are negatively dependent.
It is often of interest to quantify the dependence of two (positively or negatively) dependent variables. Thus the population parameters of a bivariate distribution involve
measures of dependence, in addition to the parameters of the two marginal distributions.
The regression function, which was discussed in the previous subsection, is a consequence
and a manifestation of dependence since, if X and Y are independent, then the regression
function µY (x) is constant in x. However, it is important to note that, the regression
function is not designed to measure the degree of dependence of X and Y .
In this subsection we will define the correlation of X and Y , as a measure of linear
dependence, and the rank correlation, as a more general measure of nonlinear dependence.
In order to do that, we will first need to talk about the covariance of X and Y .
Definition 5.2. The covariance of X and Y , denoted by Cov(X, Y ) or σXY , is defined
as
σXY = E[(X − µX )(Y − µY )] = E(XY ) − µX µY ,
where µX and µY are the marginal expected values of X and Y , respectively.
The second equality in the above definition is a computational formula for the covariance,
2
similar to the computational (short-cut) formula, σX
= E[(X − µX )2 ] = E(X 2 ) − µ2X , for
the variance. It is also worth pointing out that the covariance of a random variable with
itself, i.e. Cov(X, X) or σXX , is
2
,
σXX = E[(X − µX )(X − µX )] = E[(X − µX )2 ] = σX
which is the variance of X.
In order to develop an intuitive understanding of the role of covariance in the quantification of dependence, let us consider a finite underlying population of N units each of
which has characteristics of interest (x, y). As a concrete example, consider the population of men 25-30 years old residing in Centre County, PA, and let x=height, y=weight
21
be the characteristics of interest. Let (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ) denote the values of
the characteristics of the N units. Let (X, Y ) denote the characteristics of a randomly
selected unit. Then (X, Y ) has a discrete distribution (even though height and weight
are continuous variables!) taking each of the possible values (x1 , y1 ), . . . , (xN , yN ) with
probability 1/N . In this case the covariance formula in Definition 5.2 can be computed
as
σXY
where µX =
1
N
PN
i=1
N
1 X
=
(xi − µX )(yi − µY ),
N i=1
xi , and µY =
1
N
PN
i=1
(5.5)
yi are the marginal expected values of X and
Y , respectively. Suppose now that X, Y have positive dependence. Then the products
(xi − µX )(yi − µY ),
which appear in the summation of relation (5.5), will tend to be positive, and thus σXY
will, in all likelihood, also be positive. If the dependence is negative then these products
will tend to be negative and so will σXY . Thus we see σXY will be positive or negative
according to whether the dependence of X and Y is positive or negative. Other properties
of covariance are summarized in the next proposition.
Proposition 5.1.
1. If X, Y are independent, then Cov(X, Y ) = 0.
2. Cov(X, Y ) = −Cov(X, −Y ) = −Cov(−X, Y ).
3. For any real numbers b and d,
Cov(X + b, Y + d) = Cov(X, Y ).
4. For any real numbers a, b, c and d,
Cov(aX + b, cY + d) = ac Cov(X, Y ).
The first property, which is desirable for any measure of the degree dependence, follows
from the formula for calculating the expected value of a product of independent random
variables, which was given in Proposition 4.2. The second property means that, if the
sign of one of the two variables changes, then a positive dependence becomes negative
and vice-versa. The third property means that adding constants to the random variables
will not change their covariance. The fourth property, however, renders the covariance
undesirable as a measure of dependence. This is because it implies that the covariance
22
of X and Y changes when the scale (or unit) changes. Thus, in the example where
X=height and Y =weight, changing the scale from (m, kg) to (ft, lb), changes the value
of the covariance of X and Y . Clearly, we would like a measure of dependence to be
unaffected by such scale changes. This leads to the definition of the correlation coefficient
as a scale-free version of covariance.
Definition 5.3. The correlation, or correlation coefficient of X and Y , denoted by
Corr(X, Y ) or ρXY , is defined as
ρX,Y = Corr(X, Y ) =
Cov(X, Y )
,
σX σY
where σX , σY are the marginal standard deviations of X, Y , respectively.
The following proposition summarizes some properties of the correlation coefficient.
Proposition 5.2.
1. If a and c are either both positive or both negative, then
Corr(aX + b, cY + d) = Corr(X, Y ).
If If a and c are of opposite signs, then
Corr(aX + b, cY + d) = −Corr(X, Y ).
2. For any two random variables X, Y ,
−1 ≤ ρ(X, Y ) ≤ 1,
and if X, Y are independent then ρX,Y = 0.
3. ρX,Y = 1 or −1 if and only if Y = aX + b for some numbers a, b with a =
6 0.
The properties listed in Proposition 5.2 imply that correlation is indeed a successful
measure of linear dependence. First, it has the desirable property of being independent
of scale. Second, the fact that it takes values between −1 and 1, makes it possible to
develop a feeling for the degree of dependence between X and Y . Thus, if the variables
are independent, their correlation coefficient is zero, while ρX,Y = ±1 happens if and only
if X and Y have the strongest possible (that is, knowing one amounts to knowing the
other) linear dependence.
It should be emphasized that correlation measures only linear dependence. This being
the case, it is possible for two variables to be dependent but to have zero correlation. See
23
figure below. Thus, if two variables have zero correlation, but we do not know whether
or not they are independent, we call them uncorrelated.
*********FIGURE SHOWING CORRELATION IN VARIOUS CONFIGURATIONS
*********
Example 5.3. Let X denote the deductible in car insurance, and let Y denote the deductible in home insurance, of a randomly chosen home and car owner in some community.
Suppose that X, Y have the following joint pmf.
y
0
100
200
100
.20
.10
.10
.5
250
.05
.15
.30
.5
.25
.25
.50
1.0
x
where the deductible amounts are in dollars. Find σX,Y and ρX,Y . Next, express the
deductible amounts in cents and find again the covariance and the correlation coefficient.
Solution. We will use computational formula Cov(X, Y ) = E(XY ) − E(X)E(Y ). First,
E(XY ) =
XX
x
Also,
E(X) =
X
xyp(x, y) = 23.750.
y
xpX (x) = 175, E(Y ) = 125.
x
Thus, Cov(X, Y ) = 1875. Omitting the details of the calculations, the standard deviations are computed to be σX = 75, σY = 82.92. Thus, ρX,Y = .301. Next, if
the deductible amounts are expressed in cents, then the new deductible amounts are
(X 0 , Y 0 ) = (100X, 100Y ). Thus,
Cov(X 0 , Y 0 ) = Cov(100X, 100Y ) = 18, 750, 000,
which is the reason (mentioned above) that covariance is not a suitable measure of dependence. The correlation remains unchanged.
Example 5.4. Consider the multinomial experiment of Example 2.2, but with a sample
of size one. Thus, one electronic component will be tested, and if it lasts less than 50
hours, then Y1 = 1, Y2 = 0, and Y3 = 0; if it lasts between 50 and 90 hours, then Y1 = 0,
24
Y2 = 1, and Y3 = 0; if it lasts more than 90 hours, then Y1 = 0, Y2 = 0, and Y3 = 1. Find
Cov(Y1 , Y2 ), Cov(Y1 , Y3 ) and Cov(Y2 , Y3 ).
Solution. We will use computational formula Cov(X, Y ) = E(XY ) − E(X)E(Y ). First
note that
E(Y1 Y2 ) = 0.
This is due to the fact that the sample space of the bivariate random variable (Y1 , Y2 )
is {(1, 0), (0, 1), (0, 0)}, so that the product Y1 Y2 is always equal to zero. Next, since the
marginal distribution of each Yi is Bernoulli, we have that
E(Yi ) = P (Yi = 1).
Thus, according to the information given in Example 2.2, E(Y1 ) = 0.2, E(Y2 ) = 0.5,
E(Y3 ) = 0.3. It follows that Cov(Y1 , Y2 ) = −0.1, Cov(Y1 , Y3 ) = −0.06 and Cov(Y2 , Y3 ) =
−0.15.
Example 5.5. Suppose (X, Y ) have joint pdf

24xy 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, x + y ≤ 1
f (x, y) =
0
otherwise
Find Cov(X, Y ), ρX,Y .
Solution: Will use Cov(X, Y ) = E(XY ) − E(X)E(Y )
Z
∞
Z
∞
E(XY ) =
xyf (x, y) dxdy
Z
−∞
−∞
1
1−x
Z
=
xy24xy dydx =
Z
0
0
Z
∞
E(X) =
Z
−∞
Z
∞
E(Y ) =
x12x(1 − x)2 dx =
2
5
y12y(1 − xy)2 dy =
2
5
0
1
yfY (y)dy =
−∞
Thus Cov(X, Y ) =
1
xfX (x)dx =
2
15
0
2
22
2
−
=− .
15 5 5
75
R1
1
Next, E(X 2 ) = 0 x2 12x(1 − x)2 dx = , so
5
1
4
1
1
1
2 1
50
2
2
σ = −
= , σX = . Similarly, σY = . Thus ρX,Y = − / = − = − .
5 25
25
5
5
75 25
75
3
25
5.3
Variance and Covariance of Linear Combinations
In Section 4 we saw that the expected value of a linear combination of random variables is
the same linear combination of the expected values, regardless of whether or not the random variables are independent. Here we will see that the variance of a linear combination
involves the pairwise covariances, and thus it depends on whether or not the variables are
independent.
Proposition 5.3. Let the variables X1 , . . . , Xn have variances σXi = σi2 and covariances
σXi ,Xj = σij . Then
1. If X1 , . . . , Xn are independent, thus all σij = 0,
V ar(a1 X1 + . . . + an Xn ) = a21 σ12 + . . . + a2n σn2
2. Without independence,
V ar(a1 X1 + . . . + an Xn ) =
n X
n
X
ai aj Cov(Xi , Xj ).
i=1 j=1
Corollary 5.4.
1. If X1 , X2 independent,
V ar(X1 − X2 ) = σ12 + σ22
V ar(X1 + X2 ) = σ12 + σ22 .
2. Without independence,
V ar(X1 − X2 ) = σ12 + σ22 − 2Cov(X1 , X2 )
V ar(X1 + X2 ) = σ12 + σ22 + 2Cov(X1 , X2 ).
Corollary 5.5. If X1 , . . . , Xn are independent and σ12 = · · · = σn2 = σ 2 , then
V ar(X) =
where T =
P
i
σ2
and V ar(T ) = nσ 2 ,
n
Xi and X = T /n.
Proposition 5.4. Let the variables X1 , . . . , Xn have variances σXi = σi2 and covariances
σXi ,Xj = σij . Let also Xj1 , . . . , Xjk be a collections of random variables obtained from
X1 , . . . , Xn . Then,
Cov(X1 , a1 Xj1 + · · · + ak Xjk ) =
k
X
i=1
26
ai Cov(X1 , Xji ).
Example 5.6. Consider the multinomial experiment of Example 2.2. Thus, n = 8 products are tested, X1 denotes the number of those that last less than 50 hours, X2 denotes
the number that last between 50 and 90 hours, and X3 = 8 − X1 − X2 denotes the number
that last more than 90. Find the covariance of X1 and X2 .
Solution. For each of the eight products, i.e. for each i = 1, . . . , 8, define triples of
variables (Yi1 , Yi2 , Yi3 ), as in Example 5.4. Thus, if the ith product lasts less than 50
hours, then Yi1 = 1 and Yi2 = Yi3 = 0; if it lasts between 50 and 90 hours then Yi2 = 1
and Yi1 = Yi3 = 0; if it lasts more than 90 hours then Yi3 = 1 and Yi1 = Yi2 = 0. Thus,
X1 , the number of products that last less than 50 hours, is given as
X1 =
8
X
Yi1 .
i=1
Similarly,
X3 =
8
X
Yi1 , and X3 =
i=1
8
X
Yi3 .
i=1
It follows that
8
8
X
X
Cov(X1 , X2 ) = Cov(
Yi1 ,
Yi2 )
i=1
=
8 X
8
X
i=1
Cov(Yi1 , Yj2 ).
i=1 j=1
Assuming that the life times of different products are independent, we have that, if i 6= j,
then Cov(Yi1 , Yi2 ) = 0. Thus,
Cov(X1 , X2 ) =
8
X
8
X
Cov(Yi1 , Yi2 ) =
(−0.1) = −0.8,
i=1
i=1
where −0.1 is the covariance of Yi1 and Yi2 , as derived in Example 5.4. The covariance
of X1 and X2 is similarly found to be 8 × (−0.06) = −0.48 and the covariance of X2 and
X3 is similarly found to be 8 × (−0.15) = −1.2.
27