Download Chapter text

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Chapter 7
Properties of Expectations and Central
Limit Theorem
7.1 Introduction
 Contents ---
 In this chapter, additional properties of the expected values of random
variables will be exploited.
 Also discussed are some limit theorems, especially the central limit theorem
that is probably the most important and surprising result in probability.
7.2 Expectation of Sums of Random Variables
 Additional properties about jointly distributed random variables ---
 Proposition 7.1 (computing the mean of a function of two jointly distributed
random variables) --If X and Y have a joint pmf p(x, y), then
E[ g ( X , Y )]   g ( x, y ) p ( x, y ) .
y
x
If X and Y have a joint pdf f(x, y), then


E[ g ( X , Y )]    g ( x, y) f ( x, y)dxdy .
Proof: similar to the proof for the case of a single random variable; left as an
exercise.
 Example 7.1 --An accident occurs at a location X that is uniformly distributed on a road of
length L; and at the time of the accident an ambulance is at a location Y that is
also uniformly distributed on the same road. Assume that X and Y are independent.
Find the expected distance between the ambulance and the location of the
accident.
Solution:
 The pdf of X is fX(x) = 1/L 0 < x < L; 0, otherwise.
 Similarly, the pdf of Y is fY(y) = 1/L 0 < y < L; 0, otherwise.
 By Fact 6.9 of the last chapter, since X and Y are independent, we get the joint
pdf f of X and Y to be
 0 < x <L, 0 < y < L;
otherwise.
f(x, y) = fX(x)fY(y) = (1/L)(1/L) = 1/L2
=0
 The distance between the ambulance and the location of the accident is just a
function of X and Y: g(X, Y) = |X  Y|.
 And the expected distance, by Proposition 7.1, is
E[|X  Y|] = E[g(X, Y)] =
 
  g ( x, y) f ( x, y)dxdy
 
=
L
L
1
0 0 | x  y | L2 dydx .
L
 Now, we want to compute the value of 0 | x  y | dy first, in which the range
of integration for y is [0, L] and x is a fixed value.
 If y[0, x), then x > y, and so |x  y| = x  y.
 On the other hand, if y[x, L], then x  y, and so |x y| = (x  y) = y  x.
 Consequently,
L
0 | x  y | dy
=
x
0 ( x  y)dy
+
= (xy 
L
y2 x
y2
) |0 + (
 xy) | x
2
2
= (x2 
x2
L2
x2
) + [(
 xL)  (
 x2)]
2
2
2
=
L2
+ x2  xL.
2
 Now, we can compute E[|X  Y|] as:
E[|X  Y|] =
L
x ( y  x)dy
L
L
1
0 0 | x  y | L2 dydx
L
= (1/L2) 0 (
L2
 x 2  xL)dx
2
= (1/L2)(
L
L2
x3
x2
x+

L) |0
2
3
2
= (1/L2)(
L3
L3
L3
+

)
2
3
2
=
L
.
3
 Expectations of sums of random variables ---
 Fact 7.1 --Given two random variables X and Y, the following equality is true:
E[X +Y] = E[X] + E[Y].
Proof:
Regarding X + Y as a function of two random variables g(X, Y), we can
apply Proposition 7.1 to get






E[X + Y] =
  g ( x, y) f ( x, y)dxdy
=
  ( x  y) f ( x, y)dxdy
=
  xf ( x, y)dydx
=
 x( f ( x, y)dy)dx
=
 xf X ( x)dx



+
+


  yf ( x, y)dxdy
+


 y( f ( x, y)dx)dy

 yfY ( y)dy
= E[X] + E[Y].
 Fact 7.2 --Given n random variables X1, X2, …, Xn, the following equality is true:
E[X1 + X2 + … + Xn] = E[X1] + E[X2] + … + E[Xn].
Proof: easy to derive using Fact 7.1 and the principle of induction.
 Example 7.2 (the expected number of matches) --A group of N people throw their hats into the center of a room. The hats
are mixed up, and each person randomly selects one. Find the expected
number of people that select their own hats.
Solution:
 Let the number of persons that select their own hats be denoted as X.
 Obviously, X may be computed as X = X1 + X2 + … + XN where the
random variable Xi 1  i  N is defined as
Xi = 1 if the ith person selects his own hat;
= 0 otherwise.
 Now, the probability for a person to select his/her own hat among the N
ones is just 1/N, i.e.,
P{person i selects his/her own hat} = P{Xi = 1} = 1/N.
 Therefore, by definition we get the expected value for Xi as
E[Xi] = 1P{Xi = 1} + 0P{Xi = 0} = 1(1/N) + 0(1  1/N) = 1/N
where i = 1, 2, …, N.
 And so, by Fact 7.2 we get
E[X] = E[X1 + X2 + … + Xn] = E[X1] + E[X2] + … +E[Xn]
= (1/N)  N = 1.
 That is, exactly one person selects his/her own hat on the average.
7.3 Random Samples and Their Properties
 Concept ---
 Every observed data item of the distribution of a certain random variable is
itself random in nature, and so may be regarded as a random variable, too.
 Therefore, we have the following definitions.
 Random samples ---
 Definition 7.1 (random sample) --A random sample of size n arising from a certain random variable X is a
collection of n independent random variables X1, X2, …, Xn such that  i =1,
2, …, n, Xi is identically distributed with X, meaning that every Xi has the
same pmf or pdf as that of X.
Each Xi is called a sample variable; X is called the population random
variable; and the mean and variance of X are called the population mean and
variance, respectively.
 Definition 7.2 (random sampling and sample value) --The process of obtaining a set of observed data of a random sample is
called random sampling; the set of the observed data is also called a random
sample; and each observed data item is called a sample value.
 A note --- the term random sample so has two meanings:
 a set of random variables all independently and identically distributed
with a population random variable X; or
 a set of observed data of such random variables.
 An abbreviation --- we use iid to mean independently and identically
distributed.
 Example 7.3 --In a resistor manufacturing company, according to the past experience it
is known that the resistance of a resistor produced by the company is a
normal random variable X~N(, 2) where  and 2 are unknown.
 A random sampling process is conducted to take a set of sample values
of resistances from every 5 resistors produced.
 One of such sample-value sets is (8.04, 8.02, 8.07, 7.99, 8.03), another
is (8.01, 7.89, 8.10, 8.02, 8.00), and a third is (7.98, 8.01, 8.05, 7.90, 8.
05), and so on.
 These sets are random samples. The first sample values in the three sets,
namely, 8.04, 8.01, 7,89, …, are all random in nature, and so may be
described by a random variable X1, as mentioned previously.
 Similarly, the second sample values in the sets, namely, 8.02, 7.89, 8.01,
…, may be described by a second random variable X2, and so on,
yielding additionally three random variables X3, X4, and X5.
 The set of these five random variables X1, X2, …, X5 consists of a
random sample arising from X.
 Properties of random samples and sample means ---
 Fact 7.3 (the joint pdf of the random variables of a random sample) --The sample variables Xi’s in a random sample X1, X2, …, Xn of size n
arising from a population random variable X with pmf or pdf f(x) has a joint
pmf or pdf of the following form:
f(x1, x2, …, xn) = f(x1)f(x2)…f(xn).
Proof: easy to derive by applying Facts 6.8 or 6.9 according to the property
of independence and the principle of induction.
 Definition 7.3 (sample mean) --Given a random sample X1, X2, …, Xn arising from a population random
variable X, the following two functions of the sample variables
X =
1
(X1 + X2 + … + Xn) =
n
To = (X1 + X2 + … + Xn) =
n
 Xi / n ;
(7.1)
i 1
n
 Xi
(7.2)
i 1
are called the sample mean and the sample total of the random sample,
respectively.
 A note: the sample mean is itself a random variable, as mentioned previously,
and so is the sample total.
 Fact 7.4 (the mean of the sample mean) --The mean  X = E[ X ] of the sample mean X of a random sample X1,
X2, ..., Xn of size n arising from a population random variable X with mean 
is  X = .
Proof:
 By Definition 7.1 and Fact 7.2, we have
 X = E[ X ] = E[(X1 + X2 + … + Xn)/n]
= (E[X1] + E[X2] + … + E[Xn])/n
= (n)/n
= .
 A note: the sample mean is useful as an estimator of the mean of the
population random variable (to be discussed in the next chapter).
 Example 7.4 (continued from Example 7.3) --Suppose that we want to estimate the mean  of the normal distribution
of the resistance values of the resistor produced by the factory mentioned in
Example 7.3. As mentioned above, the sample mean may be used for this
purpose.
 The sample mean value x1 computed in terms of the first set of sample
values is x1 = (8.04 + 8.02 + 8.07 + 7.99 + 8.03)/5 = 8.03; that
computed in terms of the second set is x2 = (8.01 + 7.89 + 8.10 + 8.02
+ 8.00)/5  8.00; and that computed in terms of the third set is x3
(7.98 + 8.01 + 8.05 + 7.90 + 8. 15)  8.02, and so on.
 These estimates, 8.03, 8.00, 8.02, etc., are random in nature, and are
values of the sample mean X = (X1 + X2 + … + X5)/5.
 Fact 7.4 says that the mean of these values,  X = E[ X ], is just the
real mean  of the population random variable X.
 Note that there are ways other than the sample mean for estimating 
(to be discussed in the next chapter).
7.4 Covariance, Variance of Sums, and Correlations
 Expectation of a product of independent random variables ---
 Proposition 7.2 --If X and Y are two independent random variables, then for any functions
h and g, the following equality holds:
E[g(X)h(Y)] = E[g(X)]E[h(Y)].
Proof:
 Suppose that X and Y are jointly continuous with joint pdf f(x, y).
 By the independence of X and Y and Fact 6.9, we have
f(x, y) = fX(x)fY(y).
 Then
E[g(X)h(Y)] =




  g ( x)h( y) f ( x, y)dxdy
=
  g ( x)h( y) f X ( x) fY ( y)dxdy
=
 g ( x) f X ( x)dx  h( y) fY ( y)dy


= E[g(X)]E[h(Y)].
 The proof for the discrete case is similar.
 Covariance ---
 Definition 7.4 (the covariance of two random variables) --The covariance between two random variables X and Y, denoted by
Cov(X, Y), is defined by
Cov(X, Y) = E[(X – X)(Y – Y)]
where X and Y are the means of X and Y, respectively.
 Notes:
 It is easy to see from the above definition that Cov(X, Y) = Cov(Y, X),
i.e., the operator Cov is commutative.
 Comparing the above definition with that of Var(X) which is Var (X) =
E[(X  )2] where  is the mean of X, we see that the former becomes
the latter if Y is taken to be X, as described by the following fact.
 Fact 7.5 --Var(X) = Cov(X, X).
Proof: easy and left as an exercise.
 Fact 7.6 --Cov(X, Y) = E[XY] – [X]E[Y].
Proof: easy and left as an exercise.
 A note: compare the above formula with that for computing Var(X) from
Proposition 5.2, which is Var(X) = E[X2]  (E[X])2; the latter is just a special
case of the former.
 Fact 7.7 --If two random variables X and Y are independent, then Cov(X, Y) = 0, but
the reverse is not always true.
Proof:
 X and Y are independent, so by Proposition 7.2,
Cov(X, Y) = E[(X – X)(Y – Y)]
= E[(X – X)]E[(Y – Y)]
= (E[X]  X)(E[Y]  Y)
= 0.
 To prove the second statement, let X and Y be defined respectively as:
P{X = 0} = P{X = 1} = P{X = 1} = 1/3;
and
Y = 0 if X  0;
= 1, else.
 From the definition of random variable Y above, we get to know the
values of the product random variable XY is always zero, and so we
have E[XY] = 0.
 Also, it is easy to compute E[X] = (1/3)(0 + 1  1) = 0.
 Therefore, by Fact 7.4, we get Cov(X, Y) = E[XY] – [X]E[Y] = 0.
 However, clearly, X and Y are not independent as can be seen easily
from the definition of Y. Done.
 Fact 7.8 --Cov(aX, Y) = Cov(X, aY) = aCov(X, Y);
Cov(aX, bY) = abCov(X, Y).
Proof: easy from the definition of the covariance and left as exercises.
 Proposition 7.3 --n
m
i 1
j 1
Cov(  X i ,  Y j ) =
n
m
 Cov( X i , Y j ) .
i 1 j 1
(Note: here all Xi need not be iid; neither need all Yi.)
Proof:
 Let i = E[Xi], j = E[Yj].
n
n
m
m
i 1
i 1
j 1
j 1
 Then, E[ X i ]   i and E[ Y j ]    j .
n
 So, Cov(  X i ,
i 1
m
n
n
m
m
i 1
i 1
j 1
j 1
E[( X i E[ X i ])(  Y j E[  Y j ])]
 Yj ) =
j 1
n
n
m
m
i 1
i 1
j 1
j 1
= E[( X i  i )(  Y j  j )]
n
m
i 1
j 1
= E[ ( X i  i ) (Y j  j )]
m
n
= E[ ( X i  i )(Y j  j )]
j 1 i 1
m
=
n
 E[( X i  i )(Y j  j )]
(by Fact 7.2)
j 1 i 1
n
=
m
 Cov( X i , Y j ) .
i 1 j 1
(by definition of Cov)
 Fact 7.9 --n
Var(  X i ) =
i 1
n
 Var( X i )
i 1
n
n
+ 2 Cov( X i , X j) .
i j
Proof:
n
n
i 1
i 1
Var(  X i ) = Cov(  X i ,
(by Fact 7.5)
 Cov( X i , X j )
(by Proposition 7.3)
n
=
n
 Xi )
i 1
n
i 1 j 1
n
n
=
 Cov( X i , X i )
i j
i 1
n
n
=
 Var( X i )
+
n
 Var( X i )
i 1
n
 Cov( X i , X j)
(by Fact 7.5)
i j
i 1
=
n
 Cov( X i , X j)
+
n
n
+ 2 Cov( X i , X j) .
i j
 Fact 7.10 --If X1, X2, …, Xn are pairwise independent, in that Xi and Xj are
independent for all i  j, then
n
Var(  X i ) =
i 1
n
 Var( X i ) .
i 1
Proof:
 If Xi and Xj are independent for all i  j, then by Fact 7.7, we have
Cov( X i , X j ) = 0.
 And so by Fact 7.9, the result is derived.
 Fact 7.11 (the variance of the sample mean)--If X1, X2, …, Xn constitutes a random sample arising from a population
random variable X with variance 2, then the variance of their sample mean
X is
Var( X ) = 2/n.
Proof:
 By the definition of the sample mean: X = (X1 + X2 + … +Xn)/n
where Xi are all iid, we get
Var( X ) = Var((X1 + X2 + … +Xn)/n)
n
= Var((1/n)(  X i ))
i 1
n
= (1/n)2Var(  X i )
(by Proposition 5.3)
i 1
n
= (1/n2)  Var( X i )
(by the iid property and Fact 7.10)
i 1
= (1/n2)(n2)
= 2/n.
( Var(Xi) =   i = 1, 2, …, n)
 Notes:
 By Facts 7.4 and 7.11 above, we have the result:
E[ X ] = ; Var( X ) = 2/n,
respectively, if the population random variable X has mean  and
variance 2.
 It is also easy to see the following fact about the sample total To.
 Fact 7.12 (the mean and variance of the sample total) --E[To] = n; Var(To) = n2.
Proof: easy by Facts 7.4 and 7.11; left as an exercise.
 Example 7.5 --Prove Fact 4.7 which says that the mean and variance of a binomial
random variable with parameters n and p are E[X] = np and Var(X) = np(1 
p), respectively.
Proof:
 By definition, a binomial random variable with parameters n and p may
be considered as the sample total To of a random sample of size n, X1,
X2, …, Xn, arising from a Bernoulli random variable X defined as:
X = 1 with probability p and X = 0 with probability 1  p.
 The mean of the Bernoulli random variable X by definition is
E[X] = 1p + 0(1  p) = p.
 Also, the second moment of X is E[X2] = 12p + 02(1  p) = p.
 And the variance of it is thus
Var(X) = E[X2]  (E[X])2 = p  p2 = p(1  p).
 Now, by Fact 7.12, the mean and variance of the sample total To may be
computed to be
E[To] = nE[X] = np;
Var[To] = nVar(X) = np(1  p).
 Sample variance ---
 Definition 7.5 (sample variance) --Given a random sample X1, X2, …, Xn arising from a population random
variable X, the function of the sample variables
S2 = [(X1  X )2 + (X2  X )2 + … + (Xn  X )2]/(n  1)
n
= [  ( X i  X )2 ]/(n  1)
(7.3)
i 1
is called the sample variance, where X is the sample mean of the random
sample.
 A note: the value n  1 instead of n is used as the divisor in the above
definition. The reason will become clear in the next fact.
 Fact 7.13 (the mean of the sample variance) --If X1, X2, …, X2 constitutes a random sample arising from a population
random variable X with mean  and variance 2, then the mean E[S2] of the
sample variance S2 is just 2.
Proof:
 First, by definition we have
(n  1)S2 =
n
 ( X i  X )2
i 1
n
=
[( X i   )  ( X  )]2
i 1
=
n
n
n
i 1
i 1
i 1
 ( X i   )2  2( X   ) ( X i   )   ( X   )2
n
=
 ( X i   )2  2( X  )n( X   )  n( X   )2
i 1
n
=
 ( X i   )2  n( X   )2 .
i 1
 Taking expectations of the two sides of the above equality and by the
linearity of the expectation operator E[](Fact 7.2), we get
n
(n  1)E[S2] = E[ ( X i   )2  n( X   )2 ]
i 1
n
=
 E[( X i   )2 ]  nE[( X   )2 ] .
(A)
i 1
 But E[( X  )2] = E[( X  [ X ])2]
= Var( X ).
 Therefore, (A) above becomes
(n  1)E[S2] =
n
 E[( X i   )2 ]
(by Fact 7.4: E[ X ] = )
(by definition)
 nVar( X )
i 1
= n2  n(2/n)
= (n  1)2.
(by definition of 2 and Fact 7.11)
 Therefore, E[S2] = 2. Done.
 A summary -- Now according to Facts 7.4 and 7.13, we have the means of the sample
mean X and the sample variance S2, respectively, as
E[ X ] = ; E[S2] = 2,
where  and 2 are the mean and the variance of the population random
variable X, i.e.,
 Also, by Fact 7.11 we have Var( X ) = 2/n.
 How about the value Var(S2)? This value is difficult to derive except for
some special cases, like the population random variable being normal,
as described by the following fact.
 Fact 7.14 (the variance of the sample variance of a random sample arising
from a normal population distribution) --If X1, X2, …, X2 constitutes a random sample arising from a normal
population random variable X with mean  and variance 2, then the variance
Var(S2) of the sample variance S2 of the random sample is 24/(n  1).
Proof: see the reference book.
 A note: additional facts about the sample mean and the sample variance are
described in the following fact.
 Fact 7.15 (relations between the sample mean and the sample variance) --If X1, X2, …, X2 constitutes a random sample arising from a normal
population random variable X with mean  and variance 2, then the sample
mean X and the sample variance S2 of the random sample have the
following properties:
(a) X and S2 are independent;
(b) X is a normal random variable with mean  and variance 2/n; and
(c) (n  1)S2/2 is a 2 random variable with n  1 degrees of freedom.
Proof: see the reference book.
 Example 7.6 --Use the result of (c) in Fact 7.15 to prove Fact 7.14.
Solution:
 First, by Fact 6.13, the mean and variance of a gamma distribution with
parameters (t, ) are t/ and t/2, respectively.
 From (c) of Fact 7.15, (n  1)S2/2 is a 2 random variable with n  1
degrees of freedom; and according to Definition 6.11, this random
variable is just a gamma random variable with parameters (t, ) where t
= (n  1)/2 and  = 1/2.
 Therefore, the variance of (n  1)S2/2 may be computed as
Var((n  1)S2/2) = t/2 = [(n  1)/2]/(1/2)2 = 2(n  1).
(C)
 On the other hand, by Proposition 5.3 which says Var(aX + b) =
a2Var(X) for any random variable, we get
Var((n  1)S2/2) = [(n  1)/2]2Var(S2).
(D)
 Equating the above two results (C) and (D), we get
[(n  1)/2]2Var(S2) = 2(n  1)
or equivalently,
Var(S2) = 24/(n  1)
which is just the result of Fact 7.14. Done.
 Sample covariance ---
 Definition 7.6 (sample covariance) --Given two random samples X1, X2, …, Xn and Y1, Y2, …, Yn arising from
two random variables X and Y, respectively, the sample covariance of X and
Y is defined as
SXY = [(X1  X )(Y1  Y ) + (X2  X )(Y2  Y ) + … + (Xn  X )(Yn  Y )]/(n  1)
n
= [  ( X i  X )(Yi  Y ) ]/(n  1)
(7.4)
i 1
where X and Y are the sample means of the two random samples arising
from X and Y, respectively.
 Correlation ---
 Definition 7.7 (correlation of two random variables) --The correlation of two random variables X and Y, denoted by (X, Y), is
defined, as long as the product Var(X)Var(Y) is positive, by
Cov( X , Y )
.
Var(X )Var(Y )
 ( X ,Y ) 
(7.5)
 Fact 7.16 (limits of correlation values) --The correlation of two random variables X and Y are limited within the
interval of [1, 1], i.e.,
 1  (X, Y)  1.
Proof:
 Suppose random variables X and Y have variances given by X2 and Y2,
respectively.
 Then,
X
Y
0  Var(
+
)
(by definition of variance)
X
Y
= Var(
X
) + Var(
Y
) + 2 Cov(
X
,
Y
X
Y
 X Y
Cov( X , Y )
Var( X )
Var(Y )
=
+
+2
2
2
 XY
X
Y
)
(by Fact 7.9)
(by Proposition 5.3 and Fact 7.8)
= 2[1 + (X, Y)].
( Var(X) = X2, Var(Y) = Y2)
 That is, 0  2[1 + (X, Y)], which implies
–1  (X, Y).
(E)
 On the other hand,
X
Y
0  Var(

)
X
Y
= Var(
X
(by definition of variance)
) + Var(
Y
)  2 Cov(
X
,
Y
X
Y
 X Y
Cov( X , Y )
Var( X )
Var(Y )
=
+
2
2
2
 XY
X
Y
) (by Facts 7.8 and 7.9)
(by Proposition 5.3 and Fact 7.8)
= 2[1 – (X, Y)].
( Var(X) = X2, Var(Y) = Y2)
 That is, 0  2[1  (X, Y)], which implies
(X, Y)  1.
(F)
 By (E) and (F), we get –1  (X, Y)  1. Done.
 Fact 7.17 (the extreme linearity cases of correlation values) --(X, Y) = 1 implies that Y = a + bX where b =
Y
> 0; and
X
(X, Y) = 1 implies that Y = a + bX where b = 
Y
< 0;
X
Proof:
 For (X, Y) = 1 to be true, we see from the last proof that 0 = Var(

Y
Y
X
X
) must be true.
 This, by the definition of variance, implies that
constant (say c), i.e.,
X
X

Y
Y
X
X

Y
Y
must be a
= c.
 So, Y = a + bX where a = cY, b = Y/X > 0.
 The second case may be proved similarly with a = cY, b = Y/X < 0.
Done.
 Fact 7.18 (the inverse cases of linearity of Fact 7.17) --Y = a + bX implies (X, Y) = 1 or –1, depending on the sign of b.
Proof: easy and left as an exercise.
 Comments and illustrations about the property of correlation -- The correlation (X, Y) is a measure of the linearity between the two
random variables X and Y.
 A value of (X, Y) close to +1 or –1 indicates a high degree of linearity
between X and Y, whereas a value close to 0 indicates a lack of such
linearity (see Fig. 7.1).
y
y
(X, Y) = 1
(X, Y) = 1
x
x
(a)
(b)
Fig. 7.1 Illustration of linearity of extreme correlation values.
 A positive value of (X, Y) indicates that Y tends to increase when X
does the same, whereas a negative value indicates that Y tends to
decrease when X increases (see Fig. 7.2).
y
y
(X, Y) > 0
(X, Y) < 0
x
(a)
x
(b)
Fig. 7.2 Illustration of mutual tendency of X and Y for positive and negative
correlation values.
 When (X, Y) = 0, it means that there is no relation between the
tendency of the values of X and Y (see Fig. 7.3).
y
(X, Y) = 0
x
Fig. 7.3 Illustration of mutual tendency of X and Y for correlation value of zero.
 Definition 7.8 --If (X, Y) = 0, then the random variables X and Y are said to be
uncorrelated.
 Example 7.7 --A group of 10 ladies spend money to buy their cosmetics and clothes,
and the sample values in a recent year are shown in Table 7.1. Consider the
expenses for the cosmetics and clothes as two random variables X and Y. Find
the correlation (X, Y) using the data of Table 7.1 to see whether or not
spending more money for cosmetics implies spending more for clothes in the
lady group. Considering the data in the table as sample values of random
samples of X and Y, compute the sample mean value and the sample variance
value for use as the mean and the variance values of X and Y in computing the
value of .
Table 7.1 Sample values of yearly cosmetics and clothes fees of a lady group.
Lady No.
1
2
3
4
5
6
7
8
9
10
Cosmetics Fee
3000
5000
12000
2000
7000
15000
5000
6000
8000
10000
Clothes Fee
7000
8000
25000
5000
12000
30000
10000
15000
20000
18000
Solution:
 The sample mean values X and Y may be computed to be x =
10
(i=1
Xi)/10 = 73000/10 = 7300 and y = (10
Y )/10 = 150000/10 =
i=1 i
15000, respectively, for use as estimates of the means E[X] and E[Y] of
X and Y, respectively (details omitted).
 The sample variance values sx2 and sy2 for use as estimates of the
variances Var(X) and Var(Y) of X and Y may be computed to be sx2 =
148100000/(10  1) and sy2 = 606000000/(10  1) , respectively (details
omitted).
 The sample covariance value for use as an estimate of the covariance
Cov(X, Y) of X and Y may be computed to be sxy = 290000000/(10  1)
(details omitted).
 Therefore, the correlation (X, Y) may be computed, using the estimates,
to be
Cov( X , Y )
= sxy/(sx2sy2)1/2
 ( X ,Y ) 
Var(X )Var(Y )
= [290000000/(10  1)]
{[148100000/(10  1)][606000000/(10  1)]}1/2
= 290000000/(148100000606000000)1/2
 0.9680
which is quite close to 1, meaning that X and Y are very positively
correlated (X increases as Y increases, and vice versa), as can be seen
from Fig. 7.4 where the data distribution condition is quite close to that
of Fig. 7.1(a) as indicated by the red line.
35000
30000
25000
20000
15000
10000
5000
0
0
5000
10000
15000
20000
Fig. 7.4 Data distribution of Example 7.7.
 Comments -- The sample mean, variance, and covariance values computed in the last
example are respectively estimates of the means, variances, and
covariance of the population random variables.
 More about parameter estimation will be discussed in the next chapter.
 Bivariate normal distribution ---
 Notes -- The bivariate normal distribution is important in many applications; it is
also a starting point to study multivariate normal distributions.
 This joint normal distribution has many interesting properties which are
useful in various applications.
 Definition 7.9 (bivariate normal distribution) --Two jointly distributed random variables X and Y are said to have a
bivariate normal distribution if their joint pdf is given by
f(x, y) =
1
2 x y 1   2

 x    2  y  
1

y
x

exp 
  
2
2(1   )   x    y



2

( x   x )( y   y )  

  2 
.
 x y



(7.6)
 Illustration of a bivariate normal distribution --A diagram of the shape of the pdf of a bivariate normal distribution is
shown in Fig. 7.5.
f(x, y)
y
x
Fig. 7.5 Shape of the pdf of a bivariate normal distribution.
 Fact 7.19 --Given the bivariate normal distribution of two jointly distributed random
variables X and Y as described in Definition 7.9, the following facts are true:
(a) both of the marginal random variables X and Y are normal with
parameters (x, x2) and (y, y2), respectively;
(b) the  in the pdf is the correlation of X and Y;
(c) X and Y are independent when  = 0.
Proof: see the reference book and the following example.
 Example 7.8 --Prove (c) of Fact 7.19 above.
Proof:
 When  = 0 in (7.6) of Definition 7.9, f(x, y) there becomes
2
 
 1  x  x   y   y
f(x, y) =
exp  
 
2 x y
2   x    y

 
1
which can be factored into



2





f(x, y) =
1
x
2
 1 y

1

 1  x  x  

y
exp  

exp
 
 
2

2

2
y
x
 
 

 
  y 2



2




= fX(x)fY(y)
with fX(x) and fY(y) being the pdf’s of the two normal random variables
X and Y, respectively.
 Therefore, by Proposition 6.1 of the last chapter, we conclude that X
and Y are independent.
7.5 Conditional Expectation
 Definitions ---
 Recall -- (Definition 6.12) The conditional pmf of X, given that Y = y, is defined
for all y such that P{Y = y} > 0, by
pX|Y(x|y) = P{X = x | Y = y} = p(x, y)/pY(y).
 (Definition 6.14) The conditional pdf of X, given that Y = y, is defined
for all y such that fY(y) > 0, by
fX|Y(x|y) = f(x, y)/fY(y).
 Definition 7.10 (for the discrete case) --The conditional expectation of X, given that Y = y, is defined for all y
such that pY(y) > 0, by
E[X|Y = y] =
 xP{ X  x | Y  y}
x
=
 xpX |Y ( x | y) .
x
 Definition 7.11 (for the continuous case) --The conditional expectation of X, given that Y = y, is defined for all y
such that fY(y) > 0, by
E[X|Y = y] =

 xf X |Y ( x | y)dx .
 An interpretation of the conditional expectation --The conditional expectation given Y = y can be thought of as being an
ordinary expectation on a reduced sample space consisting only of outcomes
for which Y = y.
 Example 7.9 --Suppose that the joint pdf of random variables X and Y is given by
for 0 < x < , 0 < y < .
f(x, y) = ex/yey/y
Compute E[X|Y = y].
Solution:
 First compute
fX|Y(x|y) = f(x, y)/fY(y)

= f(x, y)/  f ( x, y )dx

= [(1/y)ex/yey]/[ 0 (1/ y)e x / y e y dx ]

= [(1/y)ex/y]/[ 0 e x / y d ( x / y ) ]

= (1/y)ex/y[ex/y |0 ]
= (1/y)ex/y(0  1)
= (1/y)ex/y.
 So, E[X|Y = y] =

 xf X |Y ( x | y)dx


=

0 ( x / y)e
x/ y
dx
=  0 x[(1/ y)e x / y dx] =  0 xd (e x / y ) = xex/y |x  0 +

0
e  x / y dx
(using integration by part udv = uv  vdu with v = ex/y and u = x)
= (0  0) + (y)ex/y |x  0
= 0  (y)e0/y = y.
 Comments -- The above example, with a result of y, shows that E[X|Y = y] may be
thought as a function g(y) of y, i.e., E[X|Y = y] = EX[X|Y = y] = g(y).
 Therefore, E[X|Y] may thought as a function of random variable Y, g(Y),
whose value at Y = y is just E[X|Y = y]. And E[E[X|Y]] in more detail is
just EY[EX[X|Y]].
 Just as conditional probabilities satisfy all of the properties of ordinary
probabilities, so do conditional expectations satisfy all of the properties
of ordinary expectations.
 Some facts resulting from this viewpoint are as follows.
 Fact 7.20 --E[g(X)|Y = y] =
 g ( x) p X |Y ( x | y)
in the discrete cases;
x
=

 g ( x) f X |Y ( x | y)dx
n
E[  X i |Y = y] =
i 1
in the continuous cases.
n
 E[ X i | Y  y] .
i 1
Proof: left as exercises.
 Proposition 7.4 (computing expectation by conditioning) --E[X] = E[E[X|Y]]
where for the discrete case of Y,
E[E[X|Y]] = EY[EX[X|Y]] =
y EX[X|Y = y]P{Y = y};
and for the continuous case of Y,
E[E[X|Y]] = EY[EX[X|Y]] =

  EX[X|Y = y]fY(y)dy.
 Proof: (for the discrete case only; the continuous case is left as an exercise)
y E[X|Y = y]P{Y = y} = y ( x xP{X = x|Y = y})P{Y = y}
=
y x x[P{X = x, Y = y}/P{Y = y}]P{Y = y}
=
y x xP{X = x, Y = y}
=
x x( y P{X = x, Y = y})
=
x xP{X = x}
= E[X].
 Example 7.10 --A miner is trapped in a mine containing 3 doors. The 1st door leads to a
tunnel that will take him to safety after 3 hours of travel. The 2nd door leads
to a tunnel that will return him to the mine after 5 hours of travel. The 3rd
door leads to a tunnel that will return him to the mine after 7 hours. If we
assume that the miner is at all time equally likely to choose any of the doors,
what is the expected length of time until he reaches safety?
Solution:
 Let X be the amount of time (in hours) until the miner reaches safety, and
let Y denote the door he initially chooses.
 Now, according to Proposition 7.4, we have
E[X] = E[X|Y = 1]P{Y = 1} + E[X|Y = 2]P{Y = 2} + E[X|Y = 3]P{Y = 3}
= (1/3)(E[X|Y = 1] + E[X|Y = 2] + E[X|Y = 3]).
 However,
E[X|Y = 1] = 3;
E[X|Y = 2] = 5 + E[X];
(why? Because of return to original status.)
E[X|Y = 3] = 7 + E[X].
 Therefore,
E[X] = (1/3)(3 + 5 + E[X] + 7 + E[X])
which can be solved to get
E[X] = 15.
7.6 Central Limit Theorem and Related Laws
 Introduction ---
 The central limit theorem provides insight into why many random variables
have probability distributions that are approximately normal.
 For example, the measurement error in a scientific experiment, often
normally distributed, may be thought of as a sum of a number of underlying
perturbations and errors of small magnitude, which are not necessarily
normal.
 Laws related to the central limit theorem will also be discussed subsequently.
 The weak law of large numbers ---
 Theorem 7.1 (the weak law of large numbers) --If X is the sample mean of a random sample X1. X2. …, Xn of size n
arising from a population random variable X with mean  and variance 2,
then for any  > 0, the following equality is true:
lim P{| X  | < } = 1.
n
(7.7)
That is, as n  , the probability that X and  are arbitrarily close
approaches 1.
Proof: we prove this theorem by three stages.
Stage 1 --- proof of Markov’s inequality:
P{Y  a}  E[Y]/a
a > 0,
where Y is a nonnegative random variable.
 For a > 0, define a new random variable
I = 1 if Y  a;
= 0, otherwise.
 Then, by definition
E[I] = 1P{Y  a} + 0P{Y < a} = P{Y  a}.
(G)
 Also, since Y  0 ( Y is a nonnegative random variable), for all values
of Y it is always true that
Y/a  I, or equivalently, I  Y/a
because
Y  a, Y/a  1 = I; and
 Y < a, Y/a > 0 = I.
 Taking the expectations of the two sides of the inequality I  Y/a leads
to E[I]  E[Y/a], or equivalently, P{Y  a}  E[Y]/a according to (G).
Stage 2 --- proof of Chebyshev’s inequality:
P{|W  |  k}  2/k2
 k > 0,
where W is a random variable with mean  and variance 2.
 Define the random variable Y mentioned in the proof of Stage 1 above
as Y = (W  )2. Then Y is nonnegative.
 Also let the value a there as a = k2.
 Then Markov’s inequality proven above can be applied here to get
P{Y  a} = P{(W  )2  k2}  E[Y]/a = E[(W  )2]/k2 = 2/k2
(H)
where the equality E[(W  )2] = 2 has been used.
 Since the inequality (W  )2  k2 means |W  |  k, we get from (H)
the desired result
P{|W  |  k}  2/k2.
Stage 3 --- proof of the theorem.
 Assume that the variance 2 of the population random variable X is
finite. The proof for the infinite case is omitted.
 The sample mean X has mean  and variance 2/n according to Facts
7.4 and 7.11.
 Take W and k mentioned in Stage 2 to be X and , respectively.
 Then, the Chebyshev’s inequality proven there says that
P{|W  |  } = P{| X  |  }  (2/n)/2 = 2/n2
or equivalently, that
P{| X  | < } = 1  P{| X  |  }  1  2/n2.
 As n  , 2/n2  0 since 2 is finite.
 Therefore, P{| X  | < }  1 as n  .
 That is, the desired result lim P{| X  | < } = 1 is obtained. Done.
n
 Comments -- With lim P{| X  | < } = 1, we say that X converges to  in
n
probability.
 The weak law of large numbers will be used to define a good property,
“consistency,” of estimators of parameters of various probability
distributions, as will be discussed in the next chapter.
 Central limit theorem ---
 Theorem 7.2 (the central limit theorem) --If X is the sample mean of a random sample X1. X2. …, Xn of size n
arising from a population random variable X with mean  and variance 2,
then for any  > 0, the following equality is true:
lim P{
n
X 
< } = ()
/ n
(7.8)
where () is the cdf of the standard normal random variable (called the
X 
error function hereafter). That is, as n  , the random variable Y =
/ n
is approximately a unit normal random variable.
Proof: see the reference book.
 Notes -- The above theorem is valid both for continuous random variables and
discrete ones. Also, the population random variable need not be normal.
 Another form of the theorem may be easily derived (left as an exercise)
for the sample total To (instead of for the sample mean X ):
lim P{
n
To  n
< } = ()
 n
(7.9)
which means again that as n  , the random variable Z =
To  n
is
 n
approximately a unit normal random variable
 Significance of the central limit theorem -- Providing a simple method for computing approximate probabilities for
the sample total To or sample mean X of a random sample X1, X2, …,
Xn (i.e., for the sum of iid random variables or their mean);
 Helping explain the remarkable fact that the empirical frequencies of so
many natural populations exhibit bell-shaped (i. e., normal)
distributions (see the illustration discussed next).
 An Illustration of the effect of the central limit theorem as n   -- See Fig. 7.6 in which each diagram illustrates the histogram of the
sample total of a random sample of size n arising from a discrete
population random variable X for specifying the outcomes of 10000
tossings of n dice (X = 1, 2, …, 6 with probability 1/6 for each value),
where n is taken to be n = 1, 2, 3, 5, 10, 50 for the six figures,
respectively (i.e., sample size = n, and # random sample = 10000).
 As can be seen, as n becomes larger and larger, the histogram becomes
closer and closer to be of a bell shape --- the shape of a normal
distribution.
 A rule of thumb about the magnitude of n for using the central limit
theorem --How large should n be for the approximation using the central limit
theorem to be good enough? A thumb of rule is n > 30.
 Example 7.11 (use of the central limit theorem) --When a batch of a certain chemical product is prepared, the amount of a
particular impurity in the batch is a random variable with mean 4.0g and
variance 2.25g2. If 50 batches are independently prepared, what is the
approximate probability that the sample mean amount of impurity, X , is
larger than 3.8g?
Fig. 7.6 Illustration of the central limit theorem showing the histogram of the sample total
becoming bell-shaped as n   (downloaded 05/04/2010 from
http://gaussianwaves.blogspot.com/2010/01/central-limit-theorem.html).
Solution:
 Here the size of the random sample is 50 > 30, so the central limit
theorem can be applied to the sample mean X here.
 The population random variable X here has mean  = 4.0g and  =
(2.25)1/2 = 1.5g.
 Since by (7.8) we have P{
X 
< }  () as n  , we can
/ n
compute the desired approximate probability as
P{ X > 3.8} = P{
X  4.0
3.8  4.0
>
}
1.5 / 50
1.5 / 50
 P{Z > 0.94}
= 1  P{Z  0.94}
= 1  (0.94)
= 1  (1  (0.94))
= (0.94)
 0.8264.
(by (7.8))
 Example 7.12 (use of the central limit theorem) --If 10 fair dice are rolled, find the approximate probability that the sum
obtained is between 30 and 40 using the central limit theorem.
Solution:
 Let random variable X denote the outcome of a die.
 The sum of the 10 outcomes of the 10 fair dice may be regarded as a
10
sample total To =
 Xi
of a random sample X1, X2, …, X10 of size 10
i 1
arising from the population random variable X.
 The mean of X is E[X] = (1 + 2 + … + 6)/6 = 7/2, and the variance of X
is Var(X) = E[X2] + (E[X])2 = (12 + 22 + … + 62)/6 + (7/2)2 = 35/12.
 And so the sample total To has mean n = 107/2 = 35 and standard
35
350
 10 =
.
12
12
 By the central limit theorem described by (7.9), we get
deviation  n =
P{30  To  40}  P{
30  35
 P{
350
12
6
= (
6
= (
6
7

To  35
350
12
Z

6
7
7
)  ( 
7
)  (1  (
 2( 6 7 ) 1
 2  (0.93)  1
= 20.8238  1
= 0.6472.
6
7
40  35
}
350
12
}
)
6
7
))
(
6
7
 0.9258)
 Note that the rule of thumb for n is not satisfied here, so the
approximation of the probability is not very accurate.
 Example 7.13 (the DeMoivre-Laplace Limit Theorem as a special case of
the central limit theorem) --Show that the DeMoivre-Laplace Limit Theorem mentioned in Chapter
5 and repeated below is a special case of the central limit theorem:
If Sn denotes the number of successes that occur when n independent
trials, each with a success probability p, are performed, then for any a < b, it is
true that
P{a 
Sn  np
 b}  (b)  (a)
np(1  p)
as n   (note: Sn is a random variable here).
Proof:
 Sn obviously is the sample total To of a random sample of size n arising
from a Bernoulli random variable X with success probability p, which
has mean p and variance p(1  p) as seen from the result of Example
7.5.
 Applying (7.9) of the central limit theorem above with n = np and
 n =
p(1  p)  n =
np(1  p) , we get the desired result.
 A generalized central limit theorem ---
 Theorem 7.2 (the central limit theorem for independent random variables)
--Let X1, X2, … be a sequence of independent random variables with
respective means and variances i = E[Xi] i2 = Var(Xi). If
(a) the Xi are uniformly bounded, i.e., for some , P{|Xi| < M} =1 i; and
(b) i=1i2 = ,
then the following equality is true:
n
 ( X i  i )
lim P{
n
i 1
n

i 1
< } = ().
(7.10)
2
i
Proof: omitted.
 A comment --- the above generalized central limit theorem is more useful for
estimating the probability distribution of the sum of an unlimited number of
independently, but not necessarily identically, distributed random variables.
 The strong law of large numbers ---
 Theorem 7.3 (the strong law of large numbers) --If X is the sample mean of a random sample X1. X2. …, Xn of size n
arising from a population random variable X with a finite mean , then the
following equality is true:
P{ lim X = } = 1.
n
(7.11)
Proof: omitted.
 A comment --- with P{ lim
X = } = 1, we say that X converges to  with
n
probability 1.
 Comparison about the strong and weak laws of large numbers -- By notations, the two laws are --Weak: lim P{| X  | < } = 1.
n
Strong: P{ lim X = } = 1.
n
(7.7)
(7.11)
 In words, the two laws may be described as --Weak: the sample mean X converges in probability towards the
population mean  (as n approaches ).
Strong: the sample mean X converges with probability 1 (or almost
surely) to the population mean  (as n approaches ).
 Mathematically, the difference between the two laws are --Weak: for a specified large n*, the sample mean X is likely to be near
μ. Thus, it leaves open the possibility that | X  | >  happens
an infinite number of times for n > n*, although at infrequent
intervals.
Strong: the above case of the open possibility almost surely will not
occur. In particular, it implies that with probability 1, we have
that for any ε > 0 the inequality | X  | <  holds for all large
enough n.
 Example 7.14 (an application of the strong law of large numbers) --The strong law of large numbers may be use to estimate the probability
p of success of a Bernoulli random variable X (which is a binomial random
variable with a single trial) or other similar cases.
 Recall the definition of a Bernoulli random variable X:
X = 1 if a success with probability p occurs; and
= 0, otherwise.
 The mean of X is  = E[X] = 1P{success} + 0P{failure} = 1p +
0(1  p) = p.
 Let X1, X2, …, Xn be a random sample arising from X with X as the
sample mean (note: the sample total To = (X1 + X2 + … + Xn) here is just
the binomial random variable consisting of n trials all identical with the
Bernoulli distribution).
 Then the law above says that as n  , with probability 1 the sample
mean X   = p.
 That is, the mean of the n sample values approaches almost surely the
probability p as n approaches .
 The validity of this way of estimating p was mentioned in Chapter 2 (at
the end of Section 2.2).
 Historical notes -- The central limit theorem was first proved by the French mathematician
Pierre-Simon, marquis de Laplace who observed that errors of
measurements, which can usually be regarded as being the sum of a
large number of tiny forces, tend to be normally distributed.
 The central limit theorem was regarded as an important contribution to
science.
 The strong law of large numbers is probably the best-known result in
probability theory.