Download Chapter 6: Random Variables and Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Randomness wikipedia , lookup

Transcript
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
Chapter 6: Random Variables and Distributions
We will now study the main tools used to characterize experiments with uncertainty:
random variables and their distributions.
1
Single Random Variables and Distributions
1.1
Basic definitions
• Random variables: a random variable is a function that maps the sample space Ω of an
experiment into R. In other words, a random variable X is a function that assigns a
real number X(ω) to each possible experimental outcome ω ∈ Ω.
— for example, for an experiment in which a coin is tossed 10 times, the sample space
consists of 210 sequences of 10 heads and tails. The number of heads obtained on
the 10 tosses can be regarded as a random variable, and let us denote it by X.
Clearly, X maps each possible sequences into the set {0, 1, · · · , 10}.
• Distributions: suppose A is a subset of R and we wish to measure the probability that
X ∈ A. This is given by:
Pr(X ∈ A) = Pr(ω ∈ Ω : X(ω) ∈ A).
Note that {ω ∈ Ω : X(s) ∈ A} is an event and so the right-hand side is well defined.
The distribution of a random variable X is the collection of all probabilities Pr(X ∈ A)
for all subsets A of the real numbers.
— consider the above example again. If A = {1, · · · , 10}, then Pr(X ∈ A) is just the
probability that the experiment outcome is a sequence with at least one head, and
so
1
Pr(X ∈ A) = 1 − ( )10 .
2
• Distribution functions: the distribution function (or df ) F of a random variable X is a
function defined for each real number x as follows:
F (x) = Pr(X ≤ x) = Pr(ω ∈ Ω : X(ω) ≤ x).
It just measures the probability of the event consisting of those outcomes satisfying
X(ω) ≤ x. Sometimes we also call it the cumulative distribution function (or cdf ).
— it is ready to show that F must satisfy the following properties:
1
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
∗ if x1 < x2 , then F (x1 ) ≤ F (x2 );
∗ limx→−∞ F (x) = 0 and limx→∞ F (x) = 1;
∗ F is continuous from the right, i.e., F (x) = F (x+ ).
— if the df of a random variable X is known, then we can derive the probability of X
belonging to any interval:
∗ Pr(X > x) = 1 − F (x);
∗ Pr(x1 < X ≤ x2 ) = F (x2 ) − F (x1 );
∗ Pr(X < x) = F (x− );
∗ Pr(X = x) = F (x) − F (x−1 ).
Now we discuss two classes of random variables:
1.2
Discrete random variables
A random variable X is discrete if there are at most countably many possible values for X.
For example, the above random variable counting the number of heads is a discrete one.
• For a discrete random variable having values among {xi }i∈N , its distribution function
can be calculated as
X
Pr(X = xi ).
F (x) =
{i:xi ≤x}
Clearly, this df must be a step function and discontinuous.
• A discrete random variable can also be characterized by its probability function (or pf )
defined as
f (x) = Pr(X = x)
for x ∈ R. If x is not one of the possible values of X, clearly f (x) = 0. This pf f (x)
just measures the likelihood of each particular outcome x.
— the relationship between the df and pf for a discrete random variable is
f (x) = F (x) − F (x− ) or F (x) =
X
f (xi ).
{i:xi ≤x}
• An important discrete random variable: the binomial distribution with parameters n
and p is represented by the pf
(
Cnx px (1 − p)n−x if x = 0, 1, · · · , n
.
f (x) =
0
otherwise
2
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
For example, consider n products produced by a firm. Suppose the probability of each
product being defective is p and these n products are independently produced. Then
f (x) just measures the probability that x of them are defective. From this pf, it is easy
to construct the df.
1.3
Continuous random variables
A random variable X is continuous if it can take any value on some (bounded or unbounded)
interval. For example, the weight of a person in next year, the temperature in tomorrow, and
the house price in next month can be regarded as continuous random variables.
• Probability density functions: given the distribution function F of a continuous random
variable, we define its probability density function (or pdf ) as a nonnegative function f
satisfying
Z
x
f (t)dt
F (x) =
−∞
for all x ∈ R.
— if F is differentiable, then
f (x) = F 0 (x).
— given the probability density function f , we can calculate
Z b
f (x)dx.
Pr(a < X ≤ b) =
a
— for a continuous random variable with continuous distribution functions, Pr(X =
x) = 0 for any x ∈ R.
• An examples: the uniform distribution on [a, b]:
(
1
b−a if x ∈ [a, b]
f (x) =
0
otherwise
and
F (x) =
⎧
⎪
⎨ 1
⎪
⎩
x−a
b−a
0
3
if x > b
if x ∈ [a, b] .
if x < a
MSc Maths and Statistics 2008
Department of Economics UCL
1.4
Chapter 6: Random Variables and Distributions
Jidong Zhou
Functions of a random variable
Given the distribution of X, we want to know the distribution of Y = h(X) where h(·) is a
function.
• X is a discrete random variable: if g(y) is the probability function of Y , then
g(y) = Pr(Y = y) = Pr [h(X) = y]
X
=
f (x).
x:h(x)=y
• X is a continuous random variable: if G(y) is the distribution function of Y , then
G(y) = Pr(Y ≤ y) = Pr(h(X) ≤ y)
Z
f (x)dx.
=
{x:h(x)≤y}
If G(y) is a differentiable function, then the pdf of Y is
g(y) = G0 (y).
— example: X is the uniform distribution on [−1, 1] and Y = X 2 . Then for 0 ≤ y ≤ 1,
G(y) = Pr(X 2 ≤ y)
Z √y
f (x)dx
=
√
− y
√
y.
=
For y > 1, G(y) = 1; and for y < 0, G(y) = 0. The pdf of Y on (0, 1] is
1
g(y) = √ .
2 y
— if h is strictly monotonic function, then the pdf of Y can be directly calculated as
¯
¯
£ −1 ¤ ¯ dh−1 (y) ¯
¯
g(y) = f h (y) ¯¯
dy ¯
£
¤
f h−1 (y)
=
|h0 [h−1 (y)]|
where h−1 is the inverse function of h. The second equality is because of the
derivative rule for the inverse function. We prove this result when h is strictly
increasing (the proof is similar if h is strictly decreasing):
¤
£
G(y) = Pr X ≤ h−1 (y)
Z h−1 (y)
f (x)dx.
=
−∞
4
MSc Maths and Statistics 2008
Department of Economics UCL
So
Chapter 6: Random Variables and Distributions
Jidong Zhou
£
¤ dh−1 (y)
.
g(y) = G0 (y) = f h−1 (y)
dy
The second equality is using the Lebnitz’s rule we have introduced in Chapter 2.
Exercise 1 (i) Suppose that the pdf of a random variable X is
(
x
2 if 0 < x < 2 .
f (x) =
0 otherwise
Determine the df and pdf of the new random variable Y = X(2 − X).
(ii) Suppose the pdf of a random variable X is
(
e−x if x > 0
.
f (x) =
0
otherwise
√
Determine the pdf of Y = X.
(iii) Suppose X has a continuous distribution function F , and let Y = F (X). Show
that Y has a uniform distribution on [0, 1]. (This transformation from X to Y is called the
probability integral transformation.)
1.5
Moments
The distribution of a random variable contains all of the probabilistic information about it.
However, it is usually cumbersome to present the entire distribution. Instead, some summaries
of the distribution can be useful for giving people a rough idea of how the distribution looks
like. The most commonly used summaries are the moments of the random variable.
• Expectation
— for a discrete random variable X with pf f having positive values on {xi }, its
expectation is
X
xi f (xi ).
E(X) =
i
When X has infinitely many values, this series may not converge.1 We say E(X)
exists if and only if
X
|xi | f (xi ) < ∞.
i
This condition guarantees that
1
P
i xi f (xi )
For example, if f (n) = kn1 2 for n = 1, 2, · · · , where k =
S
in Chapter 1), then ∞
n=1 nf(n) does not converge.
5
converges.
S∞
1
n=1 n2
(which converges as we have confirmed
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
— for a continuous random variable X with pdf f , its expectation is
Z ∞
E(X) =
xf (x)dx.
−∞
Similarly, this integration may not be well defined.2 We say E(X) exists if and
only if
Z ∞
|x| f (x)dx < ∞.
−∞
— the expectation of X is also called the expected value of X or the mean of X. It can
be regarded as the center of gravity of the distribution of X, but not necessarily
the central position of the distribution.
— examples:
(i) the expectation of the uniform distribution on [a, b] is
Z
b
a
a+b
x
dx =
.
b−a
2
(ii) the expectation of the binomial distribution is
n
X
x=0
xCnx px (1 − p)n−x = np.
— some properties of the expectation (we assume all expectations exist):
∗ for scalars a and b,
E(a + bX) = a + bE(X).
∗ for two random variables X1 and X2 , we have
E(X1 + X2 ) = E(X1 ) + E(X2 ).
∗ if h(X) = h1 (X) + h2 (X) is a function of X, then E(h(X)) = E(h1 (X)) +
E(h2 (X)).
∗
E[h(X)] =
Z
∞
h(x)f (x)dx,
−∞
but E[h(X)] is in general not equal to h [E(X)] except when h is a linear
function.
2
For example, for the Cauchy distribution which has pdf
f (x) =
one can verify that
U∞
−∞
1
,
π(1 + x2 )
xf (x)dx does not exists.
6
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
• Variance
— the variance of a distribution is given by:
¤
£
V ar(X) = E (X − E(X))2
= E(X 2 ) − E(X)2 .
It is also often denoted by σ 2 . (The variance of a distribution may also not exist.)
The square root of the variance is called the standard deviation and often denoted
p
by σ = V ar(x).
— it measures the spread or dispersion of the distribution around its mean.
— examples:
(i) the variance of the uniform distribution on [a, b] is
µ
¶
Z b 2
a+b 2
x
dx −
2
a b−a
µ
¶
2
2
a+b 2
a + ab + b
−
=
3
2
2
(b − a)
.
=
12
(ii) the variance of the binomial distribution is np(1 − p).
— some properties of the variance:
∗ V ar(c) = 0 where c is a constant;
∗ V ar(aX + b) = a2 V ar(X) where a and b are scalars.
• Two useful inequalities:
— Markov Inequality: suppose X is a random variable with Pr(X ≥ 0) = 1. Then
for any real number t > 0,
E(X)
.
Pr(X ≥ t) ≤
t
∗ this result can help approximate probability distribution of a random variable
when only its mean is known.
— Chebyshev Inequality: suppose X is a random variable and V ar(X) exists.
Then for any real number t > 0,
Pr(|X − E(X)| ≥ t) ≤
V ar(X)
.
t2
∗ this is just from the Markov inequality by realizeing |X − E(X)| ≥ t ⇐⇒
[X − E(X)]2 ≥ t2 .
7
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
∗ this result just says that it becomes less likely for the realization of a random
variable when it is farther away from the mean.
∗ see more applications of these two results in Section 4.8 in DeGroot and
Schervish (2002).
• Higher order moments
— kth moment of X is E(X k )
∗ the mean of X is just the first order moment
∗ again, the kth moment may fail to exist for some distributions. We say the
¯ ¯
kth moment exists if and only if E(¯X k ¯) < ∞
¯ ¯
¯ ¯
∗ if E(¯X k ¯) < ∞ for some positive integer k, then E(¯X j ¯) < ∞ for any positive
integer j < k.
£
¤
— kth central moment of X is E (X − E(X))k
∗ the variance of X is just the second order central moment
• Moment generating functions
The moment generating function (or mgf ) of a random variable X is
ψ(t) = E(etX ).
If ψ(t) exists for all values of t in an open interval around t = 0, then we have
ψ (n) (0) = E(X n ).
That is, the nth order derivative of the mgf of X evaluated at t = 0 is just the nth
£
¤2
moment of X. Thus, the mean is ψ 0 (0) and the variance is ψ 00 (0) − ψ 0 (0) . In many
cases, using mgf to compute moments is more convenient than using the definition
directly.
— example: The pdf of X is
f (x) =
(
e−x if x > 0
.
0
otherwise
Compute the mean and variance of X.
Z ∞
Z
tx −x
ψ(t) =
e e dx =
0
=
0
1
1−t
8
∞
e(t−1)x dx
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
for t < 1. So ψ(t) exists for t in an open interval around t = 0. Since
ψ 0 (t) =
1
2
and ψ 00 (t) =
,
(1 − t)2
(1 − t)3
it is ready to show that E(X) = 1 and V ar(X) = 1.
— an important result: if the mgf of two random variables X and Y are identical for all values of t in an open interval around t = 0, then the probability
distributions of X and Y must be identical.
• Quantile and median
The p-quantile of a distribution is the value x that divides the distribution in two parts,
one with probability p and the other with probability 1 − p. More precisely, if a random
variable’s distribution function is F , then its p-quantile is the smallest x such that
F (x) ≥ p. In particular, the 0.5-quantile is called the median. That is, the median of a
distribution divides it in two parts, each with the equal probability.
— examples:
(a) if Pr(X = 1) = 0.1, Pr(X = 2) = 0.4, Pr(X = 3) = 0.3, and Pr(X = 4) = 0.2,
then the median is 2, and the 0.8-quantile is 3.
(b) if a continuous random variable has
⎧
⎪
⎨ 1/2
f (x) =
1
⎪
⎩
0
the pdf
for 0 ≤ x ≤ 1
for 2.5 ≤ x ≤ 3 ,
otherwise
then the median is 1, and the 0.4-quantile is 0.8.
— in some cases, the median can reflect the “average” value of a random variable X
better than the mean. For example, if Pr(X = 10) = 0.99 and Pr(X = 10000) =
0.01, then the mean of X is 109.9 which is much higher than 10, but its median is
10 which is closer to the value of X in most of the time.
— the median minimizes the mean absolue error E(|X − d|), while the mean minimizes the mean square error E[(X − d)2 ].
— given the pdf f (x), then the value of x for which f (x) is maximum is called the
mode of the distribution
Exercise 2 (i) Let X be a random variable that can take only the values 0, 1, 2, · · · . Show
E(X) =
∞
X
n Pr(X = n) =
n=0
∞
X
n=1
9
Pr(X ≥ n).
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
(ii) Prove that the variance of the binomial distribution is np(1 − p).
(iii) Let X has discrete uniform distribution on the integers 1, · · · , n. Compute the variP
ance of X. (You may wish to use the formula nk=1 k2 = 16 n(n + 1)(2n + 1).)
2
Bivariate Distributions
In many cases, we need more than one random variable to describe an experiment. This part
consider the bivariate case. Let (X, Y ) be a pair of random variables. We first study their
joint distribution.
2.1
Joint distributions
• The discrete case: if both X and Y are discrete random variables, the joint probability
functions is
f (x, y) = Pr(X = x, Y = y).
If (x, y) is not the value of X and Y , f (x, y) = 0. The pf is always non-negative and
satisfies
X
f (x, y) = 1.
The joint distribution function is now
X
F (x, y) =
f (xi , yj ).
xi ≤x,yj ≤y
• The continuous case: if X and Y are continuous random variables, the joint distribution
function is
F (x, y) = Pr(X ≤ x, Y ≤ y)
for any (x, y) ∈ R2 . It is nondecreasing in each argument and satisfies
lim F (x, y) = 0
x→−∞
y→−∞
lim F (x, y) = 1.
x→∞
y→∞
The joint probability density function is a nonnegative function f defined on R2 such
that
Z b Z a
f (x, y)dxdy
F (a, b) =
for any (a, b) ∈
R2 .
−∞
−∞
If F (x, y) is twice differentiable, then the pdf is
f (x, y) =
10
∂ 2 F (x, y)
.
∂x∂y
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
And we can calculate
Z
Pr(a < X ≤ b, c < Y ≤ d) =
c
dZ b
f (x, y)dxdy.
a
• Example: Suppose the joint pdf of X and Y is
(
cx2 y for x2 ≤ y ≤ 1
.
f (x, y) =
0
otherwise
Determine the value of c and then calculate Pr(X ≥ Y ).
First of all, f must satisfy
Z
∞
−∞
which implies
c
It is ready to solve
Z
1
−1
Z
1
Z
∞
f (x, y)dxdy = 1,
−∞
x2 ydydx =
x2
c
2
Z
1
−1
x2 (1 − x4 )dx = 1.
c=
21
.
4
Z
1Z x
The probability
21
Pr(X ≥ Y ) =
4
0
x2 ydydx =
x2
3
.
20
• The expectation of a function of two random variables:
Z ∞Z ∞
E [h(X, Y )] =
h(x, y)f (x, y)dxdy.
−∞
−∞
Exercise 3 Suppose that the joint pdf of X and Y is
(
c(x2 + y) for 0 ≤ y ≤ 1 − x2
f (x, y) =
.
0
otherwise
Determine the value of c and then calculate Pr(Y ≤ X + 1).
2.2
Marginal distributions
Given the joint distribution function of X and Y , we want to know the distribution function
of each random variable. This is called the marginal distribution. In general, given F (x, y),
the marginal distribution function of X is
F1 (x) = Pr(X ≤ x, Y ≤ ∞),
and that of Y is
F2 (y) = Pr(X ≤ ∞, Y ≤ y).
11
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
• The discrete case: the marginal probability function of X is
f1 (x) =
X
f (x, y),
X
f (x, y).
y
and that of Y is
f2 (y) =
x
Then, the marginal distribution function of X is
F1 (x) =
X
f1 (xi )
X
f2 (yj ).
xi ≤x
and that of Y is
F2 (y) =
yj ≤y
• The continuous case: the marginal distribution function of X is
Z ∞Z x
F1 (x) = Pr(X ≤ x, Y ≤ ∞) =
f (x̃, y)dx̃dy,
−∞
−∞
and the marginal probability density function of X is
Z ∞
f1 (x) =
f (x, y)dy.
−∞
Similarly,
F2 (y) = Pr(X ≤ ∞, Y ≤ y) =
and
f2 (y) =
Z
∞
Z
∞
−∞
Z
y
f (x, ỹ)dỹdx
−∞
f (x, y)dx.
−∞
• Example: Suppose the joint pdf of X and Y is
(
1 for 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
f (x, y) =
.
0 otherwise
Then the marginal pdf of X is
f1 (x) =
Z
1
f (x, y)dy = 1
0
and the marginal df of X is
F1 (x) = x.
12
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
• Although the marginal distribution of X and Y can be derived from their joint distribution, it is usually impossible to reconstruct their joint distribution from their marginal
distributions without additional information. (See the exceptional case below where two
random variables are independent.)
• The moments of each random variable can be calculated by using its marginal distribution. Since it is very straightforward, we will not present the details.
2.3
Conditional distributions
We have encountered the concept of conditional probability before. We now apply it to
distribution functions.
Suppose we know the joint distribution of a pair of random variables (X, Y ). In general,
we can derive the revised probability of Y ∈ B conditional on that we have learned X ∈ A as
follows:
Pr(Y ∈ B, X ∈ A)
Pr(Y ∈ B|X ∈ A) =
Pr(X ∈ A)
if Pr(X ∈ A) > 0. Both the numerator and the denominator can be computed from the joint
distribution of X and Y . From now on, we focus on conditional distribution functions (i.e.,
B has the form of Y ≤ y and A is a singleton set).
• The discrete case: given the joint probability function f (x, y), the probability function
of Y conditional on X = x is
f2 (y|x) ≡ Pr(Y = y|X = x)
Pr(Y = y, X = x)
=
Pr(X = x)
f (x, y)
.
=
f1 (x)
It measures the revised probability of Y = y conditional on X = x. Then the distribution
function of Y conditional on X = x is
P
yj ≤y f (x, yj )
F2 (y|x) =
.
f1 (x)
The conditional distribution function of X can be similarly derived.
• The continuous case: since Pr(X = x) = 0 for a continuous random variable, we can
derive the conditional distribution of Y in the following way:
Pr(Y ≤ y|x < X ≤ x + ∆) =
13
F (x + ∆, y) − F (x, y)
.
F1 (x + ∆) − F1 (x)
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
Then we divide both numerator and denominator by ∆ and then let ∆ tend to zero.
This limit operation yields the conditional distribution function of Y :
F2 (y|x) ≡ Pr(Y ≤ y|X = x) =
=
∂F (x, y)/∂x
dF1 (x)/dx
∂F (x, y)/∂x
.
f1 (x)
Then the conditional probability density function of Y is
f2 (y|x) =
f (x, y)
∂F2 (y|x)
=
∂y
f1 (x)
whenever F2 (y|x) is differentiable with respect to y.
• In either case, we have
f (x, y) = f2 (y|x)f1 (x) = f1 (x|y)f2 (y).
That is, if we know the marginal pdf and the conditional pdf, then we can reconstruct
the joint pdf. Furthermore, we also have
f1 (x|y) =
=
f2 (y|x)f1 (x)
f2 (y)
f (y|x)f1 (x)
R 2
.
x f2 (y|x)f1 (x)dx
(In the discrete case, the integration in the denominator should be replaced by the sum.)
This is the Bayes’ Theorem for random variables.
Exercise 4 Suppose the joint pdf of X and Y is
(
3
16 (4 − 2x − y) for x > 0, y > 0 and 2x + y < 4 .
f (x, y) =
0
otherwise
Determine the conditional pdf of Y for every given value of X, and compute Pr(Y ≥ 2|X =
0.5).
2.4
Conditional moments
Our exposition is for continuous random variables, but all results also hold for discrete ones.
Consider X and Y with the joint pdf f (x, y).
• Conditional expectation:
14
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
— the conditional expectation of Y given X = x is
Z ∞
E(Y |x) =
yf2 (y|x)dy
−∞
where f2 (y|x) is the conditional pdf of y. When x changes, this conditional expectation will also change.
— the conditional expectation of Y given X, denoted by E(Y |X), is a function of X
and so a random variable. If h(x) ≡ E(Y |x), then E(Y |X) = h(X) and its distribution can be derived from X’s marginal distribution according to this function
relationship.
— then if all related expectations exist, we have
Z ∞
E(Y |x)f1 (x)dx
E [E(Y |X)] =
−∞
Z ∞Z ∞
=
yf2 (y|x)f1 (x)dxdy
−∞ −∞
Z ∞Z ∞
=
yf (x, y)dxdy
−∞
= E(Y ),
−∞
where the second step uses the definition of E(Y |x). This result is called the law
of iterated expectations.
— similarly, E [E(r(X, Y )|X)] = E(r(X, Y )) for any function r.
• Condition variance:
— the conditional variance of Y given X = x is
V ar(Y |x) = E{[Y − E(Y |x)]2 |x}
= E(Y 2 |x) − [E(Y |x)]2 .
Exercise 5 (i) Prove (a) if E(Y |X) = 0 then E(Y ) = 0; and (b) if E(Y |X) = 0 then
E(XY ) = 0.
(ii) Suppose the distribution of X is symmetric with respect to the point x = 0 and all
moments of X exist. Suppose E(Y |X) = aX + b for constants a and b. Show that X 2m and
Y are uncorrelated for m = 1, 2, · · · .
(iii) Show that
V ar(Y ) = E [V ar(Y |X)] + V ar[E(Y |X)].
15
MSc Maths and Statistics 2008
Department of Economics UCL
2.5
Chapter 6: Random Variables and Distributions
Jidong Zhou
Independent random variables
• Two random variables X and Y are independent iff, for any two subsets A and B of R,
we have
Pr(X ∈ A and Y ∈ B) = Pr(X ∈ A) Pr(X ∈ B).
• This statement is equivalent to that two random variables X and Y are independent iff
F (x, y) = F1 (x)F2 (y)
or
f (x, y) = f1 (x)f2 (y)
or
f1 (x|y) = f1 (x)
if f2 (y) > 0.
— the last statement just says that knowing the realized value of Y does not change
our probability judgment of X (vice versa).
— the last statement also indicates that if X and Y are independent random variables,
then the set of all (x, y) pairs where f (x, y) > 0 should be rectangular.
• Example: suppose the joint pdf of X and Y is
(
2e−(x+2y) for x ≥ 0 and y ≥ 0
f (x, y) =
.
0
otherwise
Are X and Y independent to each other? It is ready to calculate that f1 (x) = e−x for
x ≥ 0 and f1 (x) = 0 for x < 0; and f2 (y) = 2e−2y for y ≥ 0 and f2 (y) = 0 for y < 0.
Thus, f (x, y) = f1 (x)f2 (y) and so X and Y are indeed independent.3
• Properties: if X and Y are two independent random variables, then
— E(XY ) = E(X)E(Y )
— V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y )
— E(X|Y ) = E(X) and E(Y |X) = E(Y ) where
Z +∞
E(X|Y = y) =
xf1 (x|y)dx,
−∞
Z +∞
E(Y |X = x) =
yf2 (y|x)dy.
−∞
3
In effect, for two continuous random variables, they are independent iff f (x, y) = g1 (x)g2 (y) for all x and y,
where gi are nonnegative functions. That is, the joint pdf can be factorized into the product of a nonnegative
function of x and a nonnegative function of y.
16
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
— h(X) and g(Y ) are also independent for any two functions h and g, and so E [h(X)g(Y )] =
E [h(X)] E [g(Y )]
— if ψ x and ψ y are the mgf of X and Y , respectively, then the mgf of Z = X + Y is
ψz = ψx ψy
Exercise 6 (i) suppose the joint pdf of X and Y is
(
kx2 y 2 for x2 + y 2 ≤ 1
f (x, y) =
.
0
otherwise
Show that X and Y are not independent.
(ii) Suppose X1 and X2 are two independent variables and their mgf are ψ 1 (t) and ψ 2 (t),
respectively. Let Y = X1 + X2 and let its mgf be ψ(t). Show that, if all mgf exist, then
ψ(t) = ψ 1 (t)ψ 2 (t).
(iii) Let (X, Y, Z) be independent random variables such that:
E(X) = −1
and
V ar(X) = 2
E(Y ) = 0
and
V ar(Y ) = 3
E(Z) = 1
and
V ar(Z) = 4
Let
T
= 2X + Y − 3Z + 4
U
= (X + Z)(Y + Z)
Find E(T ), V ar(T ), E(T 2 ) and E(U ).
2.6
Covariance and correlations
These two concepts are used to measure how much two random variables X and Y depend on
each other. Let E(X) and E(Y ) be the expectations of X and Y, respectively. (Notice that
they are calculated by using X and Y ’s marginal distributions.)
• The covariance of X and Y :
Cov(X, Y ) = E [(X − E(X))(Y − E(Y ))]
= E(XY ) − E(X)E(Y )
— if V ar(X) < ∞ and V ar(Y ) < ∞, then Cov(X, Y ) will exist and be finite.
17
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
— the sign of the covariance indicates the direction of covariation of X and Y . But
its magnitude is also influenced by the overall magnitudes of X and Y .
• The correlation of X and Y :
Cov(X, Y )
ρ(X, Y ) = p
V ar(X)V ar(Y )
whenever both variances are nonzero.
— ρ is between −1 and 1.4
— X and Y are said to be positively correlated if ρ(X, Y ) > 0; they are negatively
correlated if ρ < 0; and they are uncorrelated if ρ = 0.
• Properties:
— if X and Y are independent, then Cov(X, Y ) = 0 and ρ(X, Y ) = 0. But the
converse of this statement is not true.5
— if Y = aX + b for some constants a and b, then ρ(X, Y ) = 1 if a > 0 and ρ(X, Y ) =
−1 if a < 0. The converse is also true.
— the correlation only measures the linear relationship between X and Y . A large
|ρ| means that X and Y are close to being linearly related and hence are closely
related. But when |ρ| is small, X and Y could still be closely related according to
some nonlinear relationship. (See the example in footnote 5.)
— if both V ar(X) and V ar(Y ) are finite, then
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ).
(1)
Furthermore,
V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov(X, Y ),
and
V ar
à n
X
i=1
Xi
!
=
n
X
i=1
V ar(Xi ) +
X
Cov(Xi , Xj ).
i6=j
This result is based on the Schwarz inequality [E(XY )]2 ≤ E(X 2 )E(Y 2 ) for any two random variables. if
the right-hand side is finite, then the equality will hold iff there are constants a and b such that aX + bY = 0
with probability 1.
5
That is, even if two random variables are uncorrelated, they can be dependent. For example, X is a
discrete random variable with Pr(X = 1) = Pr(X = 0) = Pr(X = −1) = 1/3, and Y = X 2 . They are clearly
dependent but one can check that Cov(X, Y ) = 0.
4
18
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
Exercise 7 (i) Suppose that the pair (X, Y ) are uniformly distributed on the interior of the
circle of radius 1. Compute Cov(X, Y ).
(ii) Suppose X has uniform distribution on the interval [−2, 2] and Y = X 6 . Show X and
Y are uncorrelated.
(iii) Prove the result (1), and
Cov(aX + bY + c, Z) = aCov(X, Z) + bCov(Y, Z)
where a, b, c are constants and all covariances exist.
(iv) Suppose that X and Y have the same variance, and the variances of X +Y and X −Y
also exist. Show X + Y and X − Y are uncorrelated.
2.7
Multivariate distributions
All of the above concepts and results can be readily extended to the case with more than two
random variables.
• Let X = [X1 X2 · · · Xn ]T be an n × 1 column vector of random variables.
• The joint distribution function is:
F (x) = Pr(X ≤ x)
= Pr(X1 ≤ x1 , X2 ≤ x2 , · · · , Xn ≤ xn )
• The joint pdf in the continuous case is:
f (x) =
• The marginal pdf is:
f1,··· ,k (x1 , · · · , xk ) =
Z
∂ n F (x)
.
∂x1 ∂x2 · · · ∂xn
∞
···
−∞
|
{z
Z
∞
f (x1 , · · · , xn )dxk+1 · · · dxn .
}
−∞
n−k
• Without loss of generality, the joint pdf of the last n − k random variables conditional
on the first k < n random variables’ realized values is
f (x1 , · · · , xn )
.
f1,··· ,k (x1 , · · · , xk )
• The n random variables are independent iff
F (x1 , · · · , xn ) = F (x1 ) · · · F (xn )
or
f (x1 , · · · , xn ) = f (x1 ) · · · f (xn ).
19
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
• Some of the most important moments are the following:
— expectation:
E(X) =
³
E(X1 ) · · ·
E(X2 )
´T
where each of the expectations inside the vector are performed using the marginal
distributions. For example:
Z +∞
x1 f1 (x1 )dx1 .
E(X1 ) =
−∞
— covariance matrix:
i
h
Σ ≡ V ar(X) = E (X − E(X)) (X − E(X))T
⎞
⎛
σ 11 · · · σ 1n
⎜ .
.. ⎟
..
.
= ⎜
.
. ⎟
⎠
⎝ .
σ n1 · · · σ nn
= E(XX T ) − E(X)E(X)T .
— for a constant column vector a,
£
¤
V ar(aT X) = E (aT X − aT E(X))2
h©
ª2 i
= E aT (X − E(X))
h
i
= E aT (X − E(X)) (X − E(X))T a
= aT Σa
This is a quadratic form. Since the variance is always non-negative according to
its definition, it yields aT Σa ≥ 0 for any nonzero a. That is, the covariance matrix
Σ is positive semidefinite.
— for a constant matrix A, we have
V ar(AX) = AΣA0 .
2.8
Functions of multiple random variables
We focus on the case with continuous random variables.
• Suppose the joint pdf of n random variables X1 , · · · , Xn is f (x1 , · · · , xn ), and a new
random variable is constructed as Y = h(X1 , · · · , Xn ). Then what is the pdf of Y . We
can compute the df of Y first:
Z
Z
G(y) = Pr(Y ≤ y) = · · · f (x1 , · · · , xn )dx1 · · · dxn
A(y)
20
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
where A(y) = {(x1 , · · · , xn ) ∈ Rn : h(x1 , · · · , xn ) ≤ y}. If G(y) is differentiable, we can
derive g(y) = G0 (y).
• Examples: suppose n independent random variables X1 , · · · , Xn share the same distribution F which is differentiable and has the density function f . Let Ymax = max{X1 , · · · , Xn }
and Ymin = min{X1 , · · · , Xn }. Determine the pdf of Ymax and Ymin .
Gmax (y) = Pr(Ymax ≤ y)
= Pr(X1 ≤ y, · · · , Xn ≤ y)
= F (y)n ,
and so gmax (y) = nF (y)n−1 f (y).
Gmin (y) = Pr(Ymin ≤ y)
= 1 − Pr(Ymin > y)
= 1 − [1 − F (y)]n ,
and so gmin (y) = n [1 − F (y)]n−1 f (y).
Exercise 8 (i) Revisit the above example on Ymax and Ymin . Derive the joint pdf of (Ymax , Ymin ).
(ii) Suppose X1 and X2 are two independent random variables and each distributes over
[0, 1] uniformly. Find the pdf of Y = X1 + X2 .
• Now consider the case with n new random variables:
Y1 = h1 (X1 , · · · , Xn )
..
.
(2)
Yn = hn (X1 , · · · , Xn ).
We want to derive the joint pdf of Y1 , · · · , Yn .
To do that, we need assumptions about the functions hi . If S is the subset of Rn such that
Pr((X1 , · · · , Xn ) ∈ S) = 1 and T is the subset of Rn such that Pr((Y1 , · · · , Yn ) ∈ T ) = 1,
we assume that the transformation from S to T by all hi is a one-to-one correspondence.
That is, given a point (y1 , · · · , yn ) in T , we have a unique preimage (x1 , · · · , xn ) in S.
With this assumption, we can solve (2) in terms of
X1 = s1 (Y1 , · · · , Yn )
..
.
Xn = sn (Y1 , · · · , Yn ).
21
(3)
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
Construct the determinant
¯
¯
¯
¯
J = ¯¯
¯
¯
∂s1
∂y1
···
..
.
∂s1
∂yn
∂sn
∂y1
···
∂sn
∂yn
..
.
..
.
¯
¯
¯
¯
¯
¯
¯
¯
for every point (y1 , · · · , yn ) ∈ T . We call it the Jacobian of the transformation in (3).
Then the joint pdf of the n new random variables is
(
f (s1 , · · · , sn ) |J| for (y1 , · · · , yn ) ∈ T
,
g(y1 , · · · , yn ) =
0
otherwise
where |J| is the absolute value of the Jacobian.6
• Examples: suppose the joint pdf of X1 and X2 is
(
4x1 x2 for 0 < x1 , x2 < 1
f (x1 , x2 ) =
.
0
otherwise
Let Y1 = X1 /X2 and Y2 = X1 X2 . Find the joint pdf of Y1 and Y2 .
It is easy to see that y1 > 0 and y2 ∈ (0, 1). For each pair of such y1 and y2 , we can
derive
√
y1 y2 ,
r
y2
=
.
y1
x1 =
x2
Then the Jocabian is
Therefore,
¯
q
¯ 1 y2
¯ 2 y1
¯
q
¯
y
¯ − 2y11 y21
g(y1 , y2 ) =
(
1
2
q
y1
y2
√1
2 y1 y2
¯
¯
¯
¯= 1 .
¯ 2y1
¯
2 yy21
for y1 > 0 and 0 < y2 < 1
0
otherwise
.
• A technique about the sum (or the difference) of two random variables:
Suppose we want to know the pdf of Y = X1 + X2 or X1 − X2 . Sometimes it is quite
hard to calculate G(y) = Pr(X1 + X2 ≤ y) or Pr(X1 − X2 ≤ y). In that case, we can
introduce another new random variable Z = X2 . Then we first derive the joint pdf of
(Y, Z) and then find the marginal distribution of Y .
6
In particular, if Y = AX, where X and Y are the vectors of random variables and A is a n × n nonsigular
matrix, then
1
f (A−1 y).
g(y) =
|det A|
22
MSc Maths and Statistics 2008
Department of Economics UCL
Chapter 6: Random Variables and Distributions
Jidong Zhou
Exercise 9 Suppose that X1 and X2 are independent and share the same distribution
(
e−x for x > 0
.
f (x) =
0 otherwise
Find the pdf of Y = X1 − X2 .
23