Download 1.10.3 Conditional Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
34 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY
1.10.3 Conditional Distributions
Definition 1.16. Let X = (X1 , X2 ) denote a discrete bivariate rv with joint
pmf pX (x1 , x2 ) and marginal pmfs pX1 (x1 ) and pX2 (x2 ). For any x1 such that
pX1 (x1 ) > 0, the conditional pmf of X2 given that X1 = x1 is the function of x2
defined by
pX (x1 , x2 )
.
pX2 |x1 (x2 ) =
pX1 (x1 )
Analogously, we define the conditional pmf of X1 given X2 = x2
pX1 |x2 (x1 ) =
pX (x1 , x2 )
.
pX2 (x2 )
It is easy to check that these functions are indeed pmfs. For example,
P
X pX (x1 , x2 )
X
pX (x1 , x2 )
pX (x1 )
pX2 |x1 (x2 ) =
= X2
= 1
= 1.
pX1 (x1 )
pX1 (x1 )
pX1 (x1 )
X
X
2
2
Example 1.22. Let X1 and X2 be defined as in Example 1.18. The conditional
pmf of X2 given X1 = 5, is
x2
0
1
2
3
4
5
pX2 |X1 =5 (x2 ) 0
1
2
0
1
2
0 0
Exercise 1.14. Let S and G denote the smoking status an gender as defined in
Exercise 1.13. Calculate the probability that a randomly selected student is
1. a smoker given that he is a male;
2. female, given that the student smokes.
Analogously to the conditional distribution for discrete rvs, we define the conditional distribution for continuous rvs.
1.10. TWO-DIMENSIONAL RANDOM VARIABLES
35
Definition 1.17. Let X = (X1 , X2 ) denote a continuous bivariate rv with joint
pdf fX (x1 , x2 ) and marginal pdfs fX1 (x1 ) and fX2 (x2 ). For any x1 such that
fX1 (x1 ) > 0, the conditional pdf of X2 given that X1 = x1 is the function of x2
defined by
fX (x1 , x2 )
fX2 |x1 (x2 ) =
.
fX1 (x1 )
Analogously, we define the conditional pdf of X1 given X2 = x2
fX1 |x2 (x1 ) =
fX (x1 , x2 )
.
fX2 (x2 )
Here too, it is easy to verify that these functions are pdfs. For example,
Z
Z
fX (x1 , x2 )
fX2 |x1 (x2 )dx2 =
dx2
X2 fX1 (x1 )
X2
R
fX (x1 , x2 )dx2
= X2
fX1 (x1 )
fX (x1 )
= 1.
= 1
fX1 (x1 )
Example 1.23. For the random variables defined in Example 1.21 the conditional
pdfs are
6x1 x22
fX (x1 , x2 )
= 2x1
=
fX1 |x2 (x1 ) =
fX2 (x2 )
3x22
and
fX (x1 , x2 )
6x1 x22
fX2 |x1 (x2 ) =
=
= 3x22 .
fX1 (x1 )
2x1
The conditional pdfs allow us to calculate conditional expectations. The conditional expected value of a function g(X2) given that X1 = x1 is defined by
X
g(x2 )pX2 |x1 (x2 ) for a discrete r.v.,



X2
Z
(1.14)
E[g(X2 )|x1 ] =



g(x2 )fX2 |x1 (x2 )dx2 for a continuous r.v..
X2
36 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY
Example 1.24. The conditional mean and variance of X2 given a value of X1 , for
the variables defined in Example 1.21 are
Z 1
3
µX2 |x1 = E(X2 |x1 ) =
x2 3x22 dx2 = ,
4
0
and
2
σX
2 |x1
= var(X2 |x1 ) =
E(X22 |x1 ) − [E(X2 |x1 )]2
=
Z
0
1
x22 3x22 dx2 −
2
3
3
= .
4
80
Lemma 1.2. For random variables X and Y defined on support X and Y, respectively, and a function g(·) whose expectation exists, the following result holds
E[g(Y )] = E{E[g(Y )|X]}.
Proof. From the definition of conditional expectation we can write
Z
E[g(Y )|X = x] =
g(y)fY |x (y)dy.
Y
This is a function of x whose expectation is
Z Z
EX {EY [g(Y )|X]} =
g(y)fY |x (y)dy fX (x)dx
X
Y
Z Z
=
g(y)fY |x (y)fX (x)dydx
{z
}
|
X Y
=f(X,Y ) (x,y)
=
Z
g(y)
Y
Z
|X
= E[g(Y )].
f(X,Y ) (x, y)dxdy
{z
}
=fY (y)
Exercise 1.15. Show the following two equalities which result from the above
lemma.
1. E(Y ) = E{E[Y |X]};
2. var(Y ) = E[var(Y |X)] + var(E[Y |X]).
1.10. TWO-DIMENSIONAL RANDOM VARIABLES
37
1.10.4 Hierarchical Models and Mixture Distributions
When a random variable depends on another random variable, then we may try to
model it in a hierarchical way. This leads to a mixture distribution. We explain
this situation by the following example.
Example 1.25. Denote by N a number of eggs laid by an insect. Suppose that
each egg survives, independently of other eggs, with the same probability p. Let
X denote the number of survivors. Then X|N = n is a binomial rv with n trials
and probability of success p, i.e.,
X|(N = n) ∼ Bin(n, p).
Assume that the number of eggs the insect lays follows a Poisson distribution with
parameter λ, i.e.,
N ∼ Poisson(λ).
That is, we can write the following hierarchical model,
X|N ∼ Bin(N, p)
N ∼ Poisson(λ).
Then, the marginal distribution of the random variable X can be written as
P (X = x) =
=
∞
X
n=0
∞
X
P (X = x, N = n)
P (X = x|N = n)P (N = n)
n=0
=
∞ X
n
n=x
x
x
p (1 − p)
n−x
e−λ λn
n!
,
since X|N = n is Binomial, N is Poisson and the conditional probability is zero
x
if n < x. Multiplying this expression by λλx and then simplifying gives,
∞
X
n!
e−λ λn λx
px (1 − p)n−x
(n − x)!x!
n! λx
n=x
n−x
∞
(λp)x e−λ X (1 − p)λ
=
x!
(n − x)!
n=x
P (X = x) =
(λp)x e−λ (1−p)λ
e
=
x!
(λp)x e−λp
=
.
x!
38 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY
That is,
X ∼ Poisson(λp).
Then, we can easily get the expected number of surviving eggs as E(X) = λp. Note, that in the above example
pλ = p E(N) = E(pN) = E E(X|N)
since X|N ∼ Bin(N, p).
Example 1.26. A new drug is tested on n patients in phase III of a clinical trial.
Assume that a response can either be efficacy (success) or no efficacy (failure). It
is expected that each patient’s chance to react positively to the drug is different and
the probability of success follows a Beta(α, β) distribution. Then, the response
for each patient could be modeled as a hierarchical Bernoulli rv, that is,
Xi |Pi ∼ Bernoulli(Pi ), i = 1, 2, . . . , n,
Pi ∼ Beta(α, β).
We are interested in the expected total number of efficacious responses and also
in its variance.
Let us denote by Y the total number of successes. Then
Y =
n
X
Xi
i=1
and
EY =
n
X
i=1
E(Xi ) =
n
X
i=1
E E(Xi |Pi ) =
n
X
E(Pi ) =
i=1
n
X
i=1
α
nα
=
.
α+β
α+β
Now, we will calculate the variance of Y . Random variables Xi are independent,
hence
!
n
n
X
X
var(Y ) = var
Xi =
var(Xi ).
i=1
i=1
We have
var(Xi ) = var E(Xi |Pi ) + E var(Xi |Pi ) , i = 1, 2, . . . , n.
1.10. TWO-DIMENSIONAL RANDOM VARIABLES
The first term in the above expression is
var E(Xi |Pi ) = var(Pi ) =
39
αβ
,
(α + β)2 (α + β + 1)
since Pi ∼ Beta(α, β). The second term of the expression for the variance of Xi
is
E var(Xi |Pi ) = E Pi (1 − Pi )
Z
Γ(α + β) 1
=
pi (1 − pi )pα−1
(1 − pi )β−1 dpi
i
Γ(α)Γ(β) 0
Z
Γ(α + β) 1 α
pi (1 − pi )β dpi
=
Γ(α)Γ(β) 0
Γ(α + β) Γ(α + 1)Γ(β + 1)
=
Γ(α)Γ(β) Γ(α + β + 2)
αβ
=
.
(α + β)(α + β + 1)
Here we used the fact that
Γ(α + β + 2)
pα (1 − pi )β
Γ(α + 1)Γ(β + 1) i
is a pdf of a Beta(α + 1, β + 1) random variable and we also used the property of
Gamma function that Γ(z) = zΓ(z − 1).
Hence,
var(Xi ) = var E(Xi |Pi ) + E var(Xi |Pi )
αβ
αβ
=
+
2
(α + β) (α + β + 1) (α + β)(α + β + 1)
αβ
.
=
(α + β)2
Finally, as the variance of Xi does not depend on i, we get
nαβ
var(Y ) =
.
(α + β)2
Exercise 1.16. The number of spam messages Y sent to a server in a day has
Poisson distribution with parameter λ = 21. Each spam message independently
has probability p = 1/3 of not being detected by the spam filter. Let X denote the
number of spam messages getting through the filter. Calculate the expected daily
number of spam messages which get into the server. Also, calculate the variance
of X.
40 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY
1.10.5 Independence of Random Variables
Definition 1.18. Let X = (X1 , X2 ) denote a continuous bivariate rv with joint
pdf fX (x1 , x2 ) and marginal pdfs fX1 (x1 ) and fX2 (x2 ). Then X1 and X2 are
called independent random variables if, for every x1 ∈ X1 and x2 ∈ X2
fX (x1 , x2 ) = fX1 (x1 )fX2 (x2 ).
(1.15)
We define independent discrete random variables analogously.
If X1 and X2 are independent, then the conditional pdf of X2 given X1 = x1 is
fX2 |x1 (x2 ) =
fX (x1 , x2 )
fX (x1 )fX2 (x2 )
= 1
= fX2 (x2 )
fX1 (x1 )
fX1 (x1 )
regardless of the value of x1 . Analogous property holds for the conditional pdf of
X1 given X2 = x2 .
Example 1.27. It is easy to notice that for the variables defined in Example 1.21
we have
fX (x1 , x2 ) = 6x1 x22 = 2x1 3x22 = fX1 (x1 )fX2 (x2 ).
So, the variables X1 and X2 are independent.
In fact, two rvs are independent if and only if there exist functions g(x1 ) and h(x2 )
such that for every x1 ∈ X1 and x2 ∈ X2 ,
fX (x1 , x2 ) = g(x1 )h(x2 )
and the support for one variable does not depend on the support of the other variable.
Theorem 1.13. Let X1 and X2 be independent random variables. Then
1.10. TWO-DIMENSIONAL RANDOM VARIABLES
41
1. For any A ⊂ R and B ⊂ R
P (X1 ⊆ A, X2 ⊆ B) = P (X1 ⊆ A)P (X2 ⊆ B),
that is, {X1 ⊆ A} and {X2 ⊆ B} are independent events.
2. For g(X1 ), a function of X1 only, and for h(X2 ), a function of X2 only, we
have
E[g(X1 )h(X2 )] = E[g(X1)] E[h(X2 )].
Proof. Assume that X1 and X2 are continuous random variables. To prove the
theorem for discrete rvs we follow the same steps with sums instead of integrals.
1. We have
P (X1 ⊆ A, X2 ⊆ B) =
=
Z Z
ZB
fX (x1 , x2 )dx1 dx2
ZA
fX1 (x1 )fX2 (x2 )dx1 dx2
A
Z
Z
fX1 (x1 )dx1 fX2 (x2 )dx2
=
A
ZB
Z
=
fX1 (x1 )dx1
fX2 (x2 )dx2
B
A
B
= P (X1 ⊆ A)P (X2 ⊆ B).
2. Similar arguments as in Part 1 give
Z ∞Z ∞
E[g(X1)h(X2 )] =
g(x1 )h(x2 )fX (x1 , x2 )dx1 dx2
−∞ −∞
Z ∞Z ∞
=
g(x1 )h(x2 )fX1 (x1 )fX2 (x2 )dx1 dx2
−∞ −∞
Z ∞ Z ∞
=
g(x1 )fX1 (x1 )dx1 h(x2 )fX2 (x2 )dx2
−∞
−∞
Z ∞
Z ∞
=
g(x1 )fX1 (x1 )dx1
h(x2 )fX2 (x2 )dx2
−∞
−∞
= E[g(X1 )] E[h(X2 )].
Below we will apply this result for the moment generating function of a sum of
independent random variables.
42 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY
Theorem 1.14. Let X1 and X2 be independent random variables with moment
generating functions MX1 (t) and MX2 (t), respectively. Then the moment generating function of the sum Y = X1 + X2 is given by
MY (t) = MX1 (t)MX2 (t).
Proof. By the definition of mgf and by Theorem 1.13, part 2, we have
MY (t) = E etY = E et(X1 +X2 ) = E etX1 etX2 = E etX1 E etX2 = MX1 (t)MX2 (t).
Note that this result can be easily extended to a sum of any number of mutually
independent random variables.
Example 1.28. Let X1 ∼ N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) be independent. What
is the distribution of Y = X1 + X2 ?
Using Theorem 1.14 we can write
MY (t) = MX1 (t)MX2 (t)
= exp{µ1t + σ12 t2 /2} exp{µ2 t + σ22 t2 /2}
= exp{(µ1 + µ2 )t + (σ12 + σ22 )t2 /2}.
This is the mgf of a normal rv with E(Y ) = µ1 + µ2 and var(Y ) = σ12 + σ22 .
Exercise 1.17. A part of an electronic system has two types of components in joint
operation. Denote by X1 and X2 the random length of life (measured in hundreds
of hours) of component of type I and of type II, respectively. Assume that the joint
density function of two rvs is given by
1
x1 + x2
fX (x1 , x2 ) = x1 exp −
IX ,
8
2
where X = {(x1 , x2 ) : x1 > 0, x2 > 0}.
1. Calculate the probability that both components will have a life length longer
than 100 hours, that is, find P (X1 > 1, X2 > 1).
2. Calculate the probability that a component of type II will have a life length
longer than 200 hours, that is, find P (X2 > 2).
3. Are X1 and X2 independent? Justify your answer.
1.10. TWO-DIMENSIONAL RANDOM VARIABLES
43
4. Calculate the expected value of so called relative efficiency of the two components, which is expressed by
X2
E
.
X1
1.10.6 Covariance and Correlation
Covariance and correlation are two measures of the strength of a relationship between two r.vs.
We will use the following notation.
E(X1 ) = µX1
E(X2 ) = µX2
2
var(X1 ) = σX
1
2
var(X2 ) = σX
2
2
2
Also, we assume that σX
and σX
are finite positive values. A simplified notation
1
2
2
2
µ1 , µ2 , σ1 , σ2 will be used when it is clear which rvs we refer to.
Definition 1.19. The covariance of X1 and X2 is defined by
cov(X1 , X2 ) = E[(X1 − µX1 )(X2 − µX2 )].
(1.16)
Some useful properties of the covariance and correlation are given in the following
two theorems.
Theorem 1.15. Let X1 and X2 denote random variables and let a, b, c, . . . denote
some constants. Then, the following properties hold.
1. cov(X1 , X2 ) = E(X1 X2 ) − µX1 µX2 .
2. If random variables X1 and X2 are independent then
cov(X1 , X2 ) = 0.
3. var(aX1 + bX2 ) = a2 var(X1 ) + b2 var(X2 ) + 2ab cov(X1 , X2 ).
44 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY
4. For U = aX1 + bX2 + e and for V = cX1 + dX2 + f we can write
cov(U, V ) = ac var(X1 ) + bd var(X2 ) + (ad + bc) cov(X1 , X2 ).
Proof. We will show property 4.
cov(U, V ) = E (aX1 + bX2 + e) − E(aX1 + bX2 + e)
× (cX1 + dX2 + f ) − E(cX1 + dX2 + f )
= E a(X1 − E X1 ) + b(X2 − E X2 ) c(X1 − E X1 ) + d(X2 − E X2 )
= E ac(X1 − E X1 )2 + bd(X2 − E X2 )2 + (ad + bc)(X1 − E X1 )(X2 − E X2 )
= ac var X1 + bd var X2 + (ad + bc) cov(X1 , X2 ).
Property 2 says that if two variables are independent, then their covariance is
zero. This does not always work both ways, that is it does not mean that if the
covariance is zero then the variables must be independent. The following small
example shows this fact.
Example 1.29. Let X ∼ U(−1, 1) and let Y = X 2 . Then
E(X) = 0
Z
1
1
1
x2 dx =
2
3
−1
3
E(XY ) = E(X ) = 0.
2
E(Y ) = E(X ) =
Hence,
cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0,
but Y is a function of X, so these two variables are not independent.
Another measure of strength of relationship of two rvs is correlation. It is defined
as
Definition 1.20. The correlation of X1 and X2 is defined by
ρ(X1 ,X2 ) =
cov(X1 , X2 )
.
σX1 σX2
(1.17)
1.10. TWO-DIMENSIONAL RANDOM VARIABLES
45
Correlation is a dimensionless measure and it expresses the strength of linearly
related variables as shown in the following theorem.
Theorem 1.16. For any random variables X1 and X2
1. −1 ≤ ρ(X1 ,X2 ) ≤ 1,
2. |ρ(X1 ,X2 ) | = 1 iff there exist numbers a 6= 0 and b such that
P (X2 = aX1 + b) = 1.
If ρ(X1 ,X2 ) = 1 then a > 0, and if ρ(X1 ,X2 ) = −1 then a < 0.
Exercise 1.18. Prove Theorem 1.16. Hint: Consider roots (and so the discriminant) of the quadratic function of t:
g(t) = E[(X − µX )t + (Y − µY )]2 .
Example 1.30. Let the joint pdf of X, Y be
fX,Y (x, y) = 1
on the support {(x, y) : 0 < x < 1, x < y < x + 1}.
The two rvs are not independent as the range of Y depends on x. We will calculate
the correlation between X and Y . For this we need to obtain the marginal pdfs,
fX (x) and fY (y). The marginal pdf for X is
Z x+1
fX (x) =
1dy = y|x+1
= 1, on support {x : 0 < x < 1}.
x
x
Note that to obtain the marginal pdf for the rv Y the range of X has to be considered separately for y ∈ (0, 1) and for y ∈ [1, 2). When y ∈ (0, 1) then x ∈ (0, y).
When y ∈ [1, 2), then x ∈ (y − 1, 1). Hence,
 Ry
for y ∈ (0, 1);
1dx = x|y0 = y,


 R0
1
fY (y) =
1dx = x|1y−1 = 2 − y, for y ∈ [1, 2);
y−1



0,
otherwise.
These give
1
(b − a)2
1
2
µX = , σX
=
= ;
2
12
12
µY = 1
σY2 = E(Y 2 ) − [E(Y )]2 =
1
7
−1= ,
6
6
46 CHAPTER 1. ELEMENTS OF PROBABILITY DISTRIBUTION THEORY
where
2
E(Y ) =
Also,
Z
1
2
y ydy +
0
2
1
E(XY ) =
Hence,
Z
Z
1
0
Z
7
y 2 (2 − y)dy = .
6
x+1
xy1dydx =
x
cov(X, Y ) = E(XY ) − E(X) E(Y ) =
7
.
12
7
1
1
− ×1= .
12 2
12
Finally,
ρ(X,Y ) =
1
cov(X, Y )
= q12
σX σY
1
1
12 6
1
=√ .
2
The linear relationship between X and Y is not very strong.
Note: We can make an interesting comparison of this value of the correlation with
the correlation of X and Y having a joint uniform distribution on {(x, y) : 0 <
x < 1, x < y < x + 0.1}, which is a ’narrower strip’ of values then
√ previously.
Then, fX,Y (x, y) = 10 and it can be shown, that ρ(X, Y ) = 10/ 101, which is
close to 1. The linear relationship between X and Y is very strong in this case.
Related documents