Download Introduction to Probability

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Introduction to Probability
Motoya Machida
January 17, 2006
This material is designed to provide a foundation in the mathematics of probability. We begin with the basic
concepts of probability, such as events, random variables, independence, and conditional probability. Having developed these concepts, the remainder of the material concentrates on the methods of calculation in probability
and their applications through exercises. The topics treated here are divided into the four major parts: (I) probability and distributions, (II) discrete distributions, (III) continuous distributions, and (IV) joint distributions.
When completed, these topics will unify the understanding of density and distribution functions of various kinds,
calculation and interpretation of moments, and distributions related to the normal distribution.
1
Sample space and events
The set of all possible outcomes of an experiment is called the sample space, and is typically denoted by Ω. For
example, if the outcome of an experiment is the order of finish in a race among 3 boys, Jim, Mike and Tom, then
the sample space becomes
Ω = {(J, M, T ), (J, T, M ), (M, J, T ), (M, T, J), (T, J, M ), (T, M, J)}.
In other example, suppose that a researcher is interested in the lifetime of a transistor, and measures it in
minutes. Then the sample space is represented by
Ω = {all nonnegative real numbers}.
Any subset of the sample space is called an event. In the who-wins-the-race example, “Mike wins the race” is
an event, which we denote by A. Then we can write
A = {(M, J, T ), (M, T, J)}.
In the transistor example, “the transistor does not last longer than 2500 minutes” is an event, which we denote
by E. And we can write
E = {x : 0 ≤ x ≤ 2500}.
1.1
Set operations
Once we have events A, B, . . ., we can define a new event from these events by using set operations—union,
intersection, and complement. The event A ∪ B, called the union of A and B, means that either A or B or both
occurs. Consider the “who-wins-the-race” example, and let B denote the event “Jim wins the second place”,
that is,
B = {(M, J, T ), (T, J, M )}.
1
Introduction to Probability
Then A ∪ B means that “either Mike wins the first, or Jim wins the second, or both,” that is,
A ∪ B = {(M, T, J), (T, J, M ), (M, J, T )}.
The event A ∩ B, called the intersection of A and B, means that both A and B occurs. In our example, A ∩ B
means that “Mike wins the first and Jim wins the second,” that is,
A ∩ B = {(M, J, T )}.
The event Ac , called the complement of A, means that A does not occur. In our example, the event A c means
that “Mike does not win the race”, that is,
Ac = {(J, M, T ), (J, T, M ), (T, J, M ), (T, M, J)}.
Now suppose that C is the event “Tom wins the race.” Then what is the event A ∩ C? It is impossible that
Mike wins and Tom wins at the same time. In mathematics, it is called the empty set, denoted by ∅, and can be
expressed in the form
A ∩ C = ∅.
If the two events A and B satisfy A ∩ B = ∅, then they are said to be disjoint. For example, “Mike wins the
race” and “Tom wins the race” are disjoint events.
1.2
Exercises
1. Two six-sided dice are thrown sequentially, and the face values that come up are recorded.
(a) List the sample space Ω.
(b) List the elements that make up the following events:
i. A = {the sum of the two values is at least 5}
ii. B = {the value of the first die is higher than the value of the second}
iii. C = {the first value is 4}
(c) List the events of the following events:
i. A ∩ C
ii. B ∪ C
iii. A ∩ (B ∪ C)
2
Probability and combinatorial methods
A probability P is a “function” defined over all the “events” (that is, all the subsets of Ω), and must satisfy the
following properties:
1. If A ⊂ Ω, then 0 ≤ P (A) ≤ 1
2. P (Ω) = 1
3. If A1 , A2 , . . . are events and Ai ∩ Aj = ∅ for all i 6= j then, P (
S∞
i=1
Ai ) =
P∞
i=1
P (Ai ).
Property (c) is called the “axiom of countable additivity,” which is clearly motivated by the property:
P (A1 ∪ · · · ∪ An ) = P (A1 ) + · · · + P (An ) if A1 , A2 , . . . , An are mutually exclusive events.
Quite frankly, there is nothing in one’s intuitive notion that requires this axiom.
Lecture Note
Page 2
Introduction to Probability
2.1
Equally likely outcomes
When the sample space
Ω = {ω1 , . . . , ωN }
consists of N outcomes, it is often natural to assume that P ({ω1 }) = · · · = P ({ωN }). Then we can obtain the
probability of a single outcome by
1
P ({ωi }) =
for i = 1, . . . , N .
N
Assuming this, we can compute the probability of any event A by counting the number of outcomes in A, and
obtain
number of outcomes in A
P (A) =
.
N
A problem involving equally likely outcomes can be solved by counting. The mathematics of counting is known
as combinatorial analysis. Here we summarize several basic facts—multiplication principal, permutations and
combinations.
2.2
Multiplication principle
If one experiment has m outcomes, and a second has n outcomes, then there are m × n outcomes for the two
experiments.
The multiplication principle can be extended and used in the following experiment. Suppose that we have n
differently numbered balls in an urn. In the first trial, we pick up a ball from the urn, record its number, and
put it back into the urn. In the second trial, we again pick up, record, and put back a ball. Continue the same
procedure until the r-th trial. The whole process is called a sampling with replacement. By generalizing the
multiplication principle, the number of all possible outcomes becomes
n × · · · × n = nr .
|
{z
}
r
Example 1. How many different 7-place license plates are possible if the first 3 places are to be occupied by
letters and the final 4 by numbers?
Example 2. How many different functions f defined on {1, . . . , n} are possible if the value f (i) takes either 0
or 1?
2.3
Permutations
Consider again “n balls in an urn.” In this experiment, we pick up a ball from the urn, record its number, but do
not put it back into the urn. In the second trial, we again pick up, record, and do not put back a ball. Continue
the same procedure r-times. The whole process is called a sampling without replacement. Then, the number of
all possible outcomes becomes
n × (n − 1) × (n − 2) × · · · × (n − r + 1)
{z
}
|
r
Given a set of n different elements, the number of all the possible “ordered” sets of size r is called r-element
permutation of an n-element set. In particular, the n-element permutation of an n-element set is expressed as
n × (n − 1) × (n − 2) × · · · × 2 × 1 = n!
|
{z
}
n
Lecture Note
Page 3
Introduction to Probability
The above mathematical symbol “n!” is called “n factorial.” We define 0! = 1, since there is one way to order 0
elements.
Example 3. How many different 7-place license plates are possible if the first 3 places are using different letters
and the final 4 using different numbers?
Example 4. How many different functions f defined on {1, . . . , n} are possible if the function f takes values on
{1, . . . , n} and satisfies that f (i) 6= f (j) for all i 6= j?
2.4
Combinations
In an experiment, we have n differently numbered balls in an urn, and pick up r balls “as a group.” Then there
are
n!
(n − r)!
ways of selecting the group if the order is relevant (r-element permutation). However, the order is irrelevant
when you choose r balls as a group, and any particular r-element group has been counted exactly r! times. Thus,
the number of r-element groups can be calculated as
n
n!
.
=
(n − r)!r!
r
You should read the above symbol as “n choose r.”
Example 5. A committee of 3 is to be formed from a group of 20 people. How many different committees are
possible?
Example 6. From a group of 5 women and 7 men, how many different committees consisting of 2 women and
3 men can be formed?
2.5
Binomial theorem
The term “n choose r” is often referred as a binomial coefficient, because of the following identity.
n X
n i n−i
n
(a + b) =
a b .
i
i=0
In fact, we can give a proof of the binomial theorem by using combinatorial analysis. For a while we pretend
that “commutative law” cannot be used for multiplication. Then, for example, the expansion of (a + b) 2 becomes
aa + ab + ba + bb,
and consists of the 4 terms, aa, ab, ba, and bb. In general, how many terms in the expansion of (a + b) n should
contain a’s exactly in i places? The answer to this question indicates the binomial theorem.
2.6
Optional topic
A lot of n items contains k defectives, and m are selected randomly and inspected. How should the value of m
be chosen so that the probability that at least one defective item turns up is 0.9? Apply your answer to the
following cases:
1. n = 1000 and k = 10;
Lecture Note
Page 4
Introduction to Probability
2. n = 10, 000 and k = 100.
Suppose that we have a lot of size n containing k defectives. If we sample and inspect m random items, what is the
probability that we will find i defectives in our sample? This probability is called a hypergeometric distribution,
and expressed by
n−k
k
m−i
i
p(i) =
, i = max{0, m + k − n}, . . . , min{m, k}
n
m
2.7
Exercises
1. How many ways are there to encode the 26-letter English alphabet into 8-bit binary words (sequences of
eight 0’s and 1’s)?
2. What is the coefficient of x3 y 4 in the expansion of (x + y)7 ?
3. A child has six blocks, three of which are red and three of which are green. How many patterns can she
make by placing them all in a line?
4. A deck of 52 cards is shuffled thoroughly. What is the probability that the four aces are all next to each
other? (Hint: Imagine that you have 52 slots lined up to place the four aces. How many different ways
can you “choose” four slots for those aces? How many different ways do you get consecutive slots for those
aces?)
3
Conditional probability and independence
Let A and B be two events where P (B) 6= 0. Then the conditional probability of A given B can be defined as
P (A|B) =
P (A ∩ B)
.
P (B)
The idea of “conditioning” is that “if we have known that B has occurred, the sample space should have become
B rather than Ω.”
Example 1. Mr. and Mrs. Jones have two children. What is the conditional probability that their children are
both boys, given that they have at least one son?
3.1
Bayes rule
Suppose that P (A|B) and P (B) are known. Then we can use this to find
P (A ∩ B) = P (A|B)P (B).
Example 2. Celine estimates that her chance to receive an A grade would be 12 in a French course, and 23 in a
chemistry course. If she decides whether to take a French course or a chemistry course this semester based on
the flip of a coin, what is the probability that she gets an A in chemistry?
Law of total probability. Let A and B be two events. Then we can write the probability P (A) as
P (A) = P (A ∩ B) + P (A ∩ B c ) = P (A|B)P (B) + P (A|B c )P (B c ).
Lecture Note
Page 5
Introduction to Probability
Sn
In general, suppose that we have a sequence B1 , B2 , . . . , Bn of mutually disjoint events satisfying i=1 Bi = Ω,
where “mutual disjointness” means that Bi ∩ Bj = ∅ for all i 6= j. (The events B1 , B2 , . . . , Bn are called “a
partition of Ω.”) Then for any event A we have
P (A) =
n
X
i=1
P (A ∩ Bi ) =
n
X
P (A|Bi )P (Bi ).
i=1
Example 3. An study shows that an accident-prone person will have an accident in one year period with
probability 0.4, whereas this probability is only 0.2 for a non-accident-prone person. Suppose that we assume
that 30 percent of all the new policyholders of an insurance company is accident prone. Then what is the
probability that a new policyholder will have an accident within a year of purchasing a policy?
Sn
Bayes’ rule. Let A and B1 , . . . , Bn be events such that the Bi ’s are mutually disjoint, i=1 Bi = Ω and
P (Bi ) > 0 for all i. Then
P (A ∩ Bj )
P (A|Bj )P (Bj )
P (Bj |A) =
.
= Pn
P (A)
i=1 P (A|Bi )P (Bi )
3.2
Concept of Independence.
Intuitively we would like to say that A and B are independent if knowing about one event give us no information
about another. That is, P (A|B) = P (A) and P (B|A) = P (B). We say A and B are independent if
P (A ∩ B) = P (A)P (B).
This definition is symmetric in A and B, and allows P (A) and/or P (B) to be 0. Furthermore, a collection of
events A1 , A2 , . . . , An is said to be mutually independent if it satisfies
P (Ai1 ∩ · · · ∩ Aim ) = P (Ai1 )P (Ai2 ) · · · P (Aim )
for any subcollection Ai1 , Ai2 , . . . , Aim .
3.3
Exercises
1. Urn A has three red balls and two white balls, and urn B has two red balls and five white balls. A fair
coin is tossed; if it lands heads up, a ball is drawn from urn A, and otherwise, a ball is drawn from urn B.
(a) What is the probability that a red ball is drawn?
(b) If a red ball is drawn, what is the probability that the coin landed heads up?
2. An urn contains three red and two white balls. A ball is drawn, and then it and another ball of the same
color are placed back in the urn. Finally, a second ball is drawn.
(a) What is the probability that the second ball drawn is white?
(b) If the second ball drawn is white, what is the probability that the first ball drawn was red?
3. Two dice are rolled, and the sum of the face values is six. What is the probability that at least one of the
dice came up a three?
4. A couple has two children. What is the probability that both are girls given that the oldest is a girl? What
is the probability that both are girls given that one of them is a girl?
5. What is the probability that the following system works if each unit fails independently with probability p
(see the figure below)?
Lecture Note
Page 6
Introduction to Probability
4
Random variables
A numerically valued function X of an outcome ω from a sample space Ω
X : Ω → R : ω → X(ω)
is called a random variable (r.v.), and usually determined by an experiment. We conventionally denote random
variables by uppercase letters X, Y, Z, U, V, . . . from the end of the alphabet.
Discrete random variables A discrete random variable is a random variable that can take values on a finite set
{a1 , a2 , . . . , an } of real numbers (usually integers), or on a countably infinite set {a1 , a2 , a3 , . . .}. The statement
such as “X = ai ” is an event since
{ω : X(ω) = ai }
is a subset of the sample space Ω. Thus, we can consider the probability of the event {X = a i }, denoted by
P (X = ai ). The function
p(ai ) := P (X = ai )
over the possible values of X, say a1 , a2 , . . ., is called a frequency function, or a probability mass function. The
frequency function p must satisfy
X
p(ai ) = 1,
i
where the sum is over the possible values of X. This will completely describe the probabilistic nature of the
random variable X. Another useful function is the cumulative distribution function (c.d.f.), and it is defined by
F (x) = P (X ≤ x),
−∞ < x < ∞.
The c.d.f. of a discrete r.v. is a nondecreasing step function. It jumps wherever p(x) > 0, and the jump at
ai is p(ai ). C.d.f.’s are usually denoted by uppercase letters, while frequency functions are usually denoted by
lowercase letters.
4.1
Continuous random variables
A continuous random variable is a random variable whose possible values are real values such as 78.6, 5.7, 10.24,
and so on. Examples of continuous random variables include temperature, height, diameter of metal cylinder,
etc. In what follows, a random variable means a “continuous” random variable, unless it is specifically said to
be discrete.
The probability distribution of a random variable X specifies how its values are distributed over the real numbers.
This is completely characterized by the cumulative distribution function (cdf ). The cdf
F (t) := P (X ≤ t).
represents the probability that the random variable X is less than or equal to t. Then we often say that “the
random variable X is distributed as F (t).” The cdf F (t) must satisfy the following properties:
Lecture Note
Page 7
Introduction to Probability
1. The cdf F (t) is a positive function [that is, F (t) ≥ 0].
2. The cdf F (t) increases monotonically [that is, if s < t, then F (s) ≤ F (t)].
3. The cdf F (t) must tend to zero as t goes to negative infinity “−∞”, and must tend to one as t goes to
positive infinity “+∞” [that is, lim F (t) = 0 and lim F (t) = 1].
t→−∞
4.2
t→+∞
Probability density function
It is often the case that the probability that the random variable X takes a value in a particular range is given
by the area under a curve over that range of values. This curve is called the probability density function (pdf ) of
the random variable X, denoted by f (x). Thus, the probability that “a ≤ X ≤ b” can be expressed by
P (a ≤ X ≤ b) =
Z
b
f (x)dx.
a
Furthermore, we can find the following relationship between cdf and pdf:
Z t
F (t) = P (X ≤ t) =
f (x)dx.
−∞
This implies that (i) the cdf F is a continuous function, and (ii) the pdf f (x), if f is continuous at x, can be
obtained as the derivative
dF (x)
.
f (x) =
dx
Example 1. A random variable X is called a uniform random variable on [a, b], when X takes any real number
in the interval [a, b] equally likely. Then the pdf of X is given by
(
1/(b − a) if a ≤ x ≤ b;
f (x) =
0
otherwise [that is, if x < a or b < x].
Compute the cdf F (x) for the uniform random variable X.
4.3
Quantile function
Given a cdf F (x), we can define the quantile function Q(u) = inf{x : F (x) ≥ u} for u ∈ (0, 1). In particular, if F
is continuous and strictly increasing, then we have Q(u) = F −1 (u). When F (x) is the cdf of a random variable
X, and F (x) is continuous, we can find
xp = Q(p)
⇔
P (X ≤ xp ) = p
for every p ∈ (0, 1). The value x1/2 = Q( 21 ) is called the median of F , and the values x1/4 = Q( 41 ) and
x3/4 = Q( 34 ) are called the lower quartile and the upper quartile, respectively.
Example 2. Compute the quantile function Q(u) for a uniform random variable X on [a, b].
4.4
Joint distributions
When two discrete random variables X and Y are obtained at the same experiment, we can define their joint
frequency function by
p(ai , bj ) = P (X = ai , Y = bj )
Lecture Note
Page 8
Introduction to Probability
where a1 , a2 , . . . and b1 , b2 , . . . are the possible values of X and Y , respectively. The marginal frequency function
of X, denoted by pX , can be calculated by
X
p(ai , bj ),
pX (ai ) = P (X = ai ) =
j
where the sum is over the possible values b1 , b2 , . . . of Y . Similarly, the marginal frequency function pY (bj ) =
P
i p(ai , bj ) of Y is given by summing over the possible values a1 , a2 , . . . of X.
Joint distributions of two continuous random variables. Consider a pair (X, Y ) of continuous random
variables. A joint density function f (x, y) is a nonnegative function satisfying
Z ∞ Z ∞
f (x, y) dy dx = 1,
−∞
−∞
and is used to compute probabilities constructed from the random variables X and Y simultaneously. For
example, we can calculate the probability
Z u2 Z v 2
f (x, y) dy dx.
P (u1 ≤ X ≤ u2 , v1 ≤ Y ≤ v2 ) =
u1
In particular,
F (u, v) = P (X ≤ u, Y ≤ v) =
is called the joint distribution function.
Z
u
−∞
v1
Z
v
f (x, y) dy dx.
−∞
Given the joint density function f (x, y), the distribution for each of X and Y is called the marginal distribution,
and the marginal density functions of X and Y , denoted by fX (x) and fY (y), are given respectively by
Z ∞
Z ∞
fX (x) =
f (x, y) dy and fY (y) =
f (x, y) dx.
−∞
−∞
Example 3. Consider the following joint density function of random variables X and Y :
(
λ2 e−λy
if 0 ≤ x ≤ y;
f (x, y) =
0
otherwise.
(1)
(a) Find P (X ≤ 1, Y ≤ 1). (b) Find the marginal density functions fX (x) and fY (y).
4.5
Conditional density functions
Suppose that two random variables X and Y has a joint density function f (x, y). If fX (x) > 0, then we can
define the conditional density function fY |X (y|x) given X = x by
fY |X (y|x) =
f (x, y)
.
fX (x)
Similarly we can define the conditional density function fX|Y (x|y) given Y = y by
fX|Y (x|y) =
f (x, y)
fY (y)
if fY (y) > 0. Then, clearly we have the following relation
f (x, y) = fY |X (y|x)fX (x) = fX|Y (x|y)fY (y).
Example 4. Find the conditional density functions fX|Y (x|y) and fY |X (y|x) for the joint density function in
Example 3 (see 4.4).
Lecture Note
Page 9
Introduction to Probability
4.6
Independent random variables
Let X and Y be discrete random variables with joint frequency function p(x, y). Then X and Y are said to be
independent, if it satisfies
p(x, y) = pX (x)pY (y)
for all possible values of (x, y).
In the same manner, if the joint density function f (x, y) for continuous random variables X and Y is expressed
in the form
f (x, y) = fX (x)fY (y) for all x, y,
then X and Y are independent.
4.7
Exercises
1. Suppose that X is a discrete random variable with P (X = 0) = .25, P (X = 1) = .125, P (X = 2) = .125,
and P (X = 3) = .5. Graph the frequency function and the cumulative distribution function of X.
2. An experiment consists of throwing a fair coin four times. (1) List the sample space Ω. (2) For each
discrete random variable X as defined in (a)–(d), describe the event “X = 0” as a subset of Ω and compute
P (X = 0). (3) For each discrete random variable X as defined in (a)–(d), find the possible values of X
and compute the frequency function.
(a) X is the number of heads before the first tail.
(b) X is the number of heads following the first tail.
(c) X is the number of heads minus the number of tails.
(d) X is the number of tails times the number of heads.
3. Let F (x) = 1 − exp(−αxβ ) for x ≥ 0, α > 0 and β > 0, and F (x) = 0 for x < 0. Show that F is a cdf, and
find the corresponding density.
4. Sketch the pdf and cdf of a random variable X that is uniform on [−1, 1].
5. Suppose that X has the density function f (x) = cx2 for 0 ≤ x ≤ 1 and f (x) = 0 otherwise.
(a) Find c.
(b) Find the cdf.
(c) What is P (.1 ≤ X < .5)?
6. The joint frequency function of two discrete random variables, X and Y , is given by
Y
1
2
3
4
X
1
.10
.05
.02
.02
2
.05
.20
.05
.02
3
.02
.05
.20
.04
4
.02
.02
.04
.10
(a) Find the marginal functions of X and Y .
(b) Are X and Y independent?
(c) Find a joint frequency function whose marginals X and Y are given by the same marginal functions
in (a) but are independent.
Lecture Note
Page 10
Introduction to Probability
7. Suppose that random variables X and Y have the joint density
(
6
(x + y)2
if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1;
f (x, y) = 7
0
otherwise.
(a) Find the marginal density functions of X and Y .
(b) Find the conditional density functions of X given Y = y and of Y given X = x.
Optional Problems.
1. Let f (x) = (1 + αx)/2 for −1 ≤ x ≤ 1 and f (x) = 0 otherwise, where −1 ≤ α ≤ 1. Show that f is a
density, and find the corresponding cdf. Find the quartiles and the median of the distribution in terms
of α.
2. The Cauchy cumulative distribution function is
F (x) =
1
1
+ tan−1 (x),
2 π
−∞ < x < ∞.
(a) Show that this is a cdf.
(b) Find the density function.
(c) Find x such that P (X > x) = .1.
5
Expectation
Let X be a discrete random variable whose possible values are a1 , a2 , . . ., and let p(x) is the frequency function
of X. Then the expectation (expected value or mean) of the random variable X is given by
X
E[X] =
ai p(ai ).
i
We often denote the expected value E[X] of X by µ or µX .
Expectation of discrete random variables. For a function g, we can define the expectation of function of a
discrete random variable X by
X
E[g(X)] =
g(ai )p(ai ).
i
Suppose that we have two random variables X and Y , and that p(x, y) is their joint frequency function. Then
the expectation of function g(X, Y ) of the two random variables X and Y is defined by
X
E[g(X, Y )] =
g(ai , bj )p(ai , bj ),
i,j
where the sum is over all the possible values of (X, Y ).
Example 1. A random variable X takes on values 0, 1, 2 with the respective probabilities P (X = 0) = 0.2,
P (X = 1) = 0.5, P (X = 2) = 0.3. Compute E[X] and E[X 2 ].
Lecture Note
Page 11
Introduction to Probability
5.1
Expectation of continuous random variables
Let X be a continuous random variable, and let f is the pdf of X. Then the expectation of the random variable
X is given by
Z ∞
E[X] =
x f (x) dx.
(2)
−∞
For a function g, we define the expectation E[g(X)] of the function g(X) of the random variable X by
Z ∞
E[g(X)] =
g(x) f (x) dx,
−∞
if g(x) is integrable with respect to f (x) dx, that is,
Z
∞
−∞
|g(x)| f (x) dx < ∞. Furthermore, if we have two
random variables X and Y with joint density function f (x, y), then the expectation of the function g(X, Y ) of
the two random variables becomes
Z ∞Z ∞
E[g(X, Y )] =
g(x, y)f (x, y) dxdy,
−∞
−∞
if g is integrable. For example, now we can define E[X 2 ], E[XY ], E[X + Y ] and so on.
Example 2. Let X be a uniform random variable on [a, b]. Compute E[X] and E[X 2 ].
5.2
Properties of expectation
One can think of the expectation E(X) as “an operation on a random variable X” which returns the average
value for X. It is also useful to observe that this is a “linear operator” satisfying the following properties:
1. Let a be a constant, and let X be a random variable having the pdf f (x). Then we can show that
Z ∞
Z ∞
E[a + X] =
(a + x)f (x) dx = a +
x f (x) dx = a + E[X].
−∞
−∞
2. Let a and b be scalars, and let X and Y be random variables having the joint density function f (x, y) and
the respective marginal density functions fX (x) and fY (y). Then we can show that
Z ∞Z ∞
E[aX + bY ] =
(ax + by)f (x, y) dx dy
−∞ −∞
Z ∞
Z ∞
y fY (y) dy = aE[X] + bE[Y ].
x fX (x) dx + b
=a
−∞
−∞
3. Let a and b1 , . . . , bn be scalars, and let X1 , . . . , Xn be random variables. By applying (a) and then applying (b) repeatedly, we can obtain
#
#
" n
" n
#
"
n
n
X
X
X
X
bi E[Xi ].
b i Xi = · · · = a +
bi Xi = a + b1 E[X1 ] + E
E a+
b i Xi = a + E
i=1
5.3
i=1
i=2
i=1
Expectation with independent random variables
Suppose that X and Y are independent random variables. Then the joint density function f (x, y) of X and Y
can be expressed as
f (x, y) = fX (x)fY (y).
Lecture Note
Page 12
Introduction to Probability
And the expectation of the function of the form g1 (X) × g2 (Y ) is given by
Z ∞Z ∞
E[g1 (X)g2 (Y )] =
g1 (x)g2 (y) f (x, y) dx dy
−∞ −∞
Z ∞
Z ∞
g2 (y)fY (y) dy = E[g1 (X)] × E[g2 (Y )].
g1 (x)fX (x) dx ×
=
−∞
−∞
5.4
Variance and covariance
The variance of a random variable X, denoted by Var(X) or σ 2 , is the expected value of “the squared difference
between the random variable and its expected value E(X).” Thus, it is given by
Var(X) := E[(X − E(X))2 ] = E[X 2 ] − (E[X])2 .
p
The square-root Var(X) of the variance Var(X) is called the standard error (SE) (or standard deviation (SD))
of the random variable X.
Now suppose that we have two random variables X and Y . Then the covariance of two random variables X and
Y can be defined as
Cov(X, Y ) := E((X − µx )(Y − µy )) = E(XY ) − E(X) × E(Y ),
where µx = E(X) and µy = E(Y ). The covariance Cov(X, Y ) measures the strength of the dependence of the
two random variables.
Example 3. Find Var(X) for a uniform random variable X on [a, b].
5.5
Properties of variance and covariance
(a) If X and Y are independent, then Cov(X, Y ) = 0 by observing that E[XY ] = E[X] · E[Y ].
(b) In contrast to the expectation, the variance is not a linear operator. For two random variables X and Y , we
have
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).
(3)
However, if X and Y are independent, by observing that Cov(X, Y ) = 0 in (3), we have
Var(X + Y ) = Var(X) + Var(Y ).
(4)
In general, when we have a sequence of independent random variables X1 , . . . , Xn , the property (4) is extended
to
Var(X1 + · · · + Xn ) = Var(X1 ) + · · · + Var(Xn ).
Variance and covariance under linear transformation. Let a and b be scalars (that is, real-valued constants), and let X be a random variable. Then the variance of aX + b is given by
Var(aX + b) = a2 Var(X).
Now let a1 , a2 , b1 and b2 be scalars, and let X and Y be random variables. Then similarly the covariance of
a1 X + b1 and a2 Y + b2 can be given by
Cov(a1 X + b1 , a2 Y + b2 ) = a1 a2 Cov(X, Y )
Lecture Note
Page 13
Introduction to Probability
5.6
Moment generating function
Let X be a random variable. Then the moment generating function (mgf ) of X is given by
M (t) = E etX .
The mgf M (t) is a function of t defined on some open interval (c0 , c1 ) with c0 < 0 < c1 . In particular when X
is a continuous random variable having the pdf f (x), the mgf M (t) can be expressed as
Z ∞
M (t) =
etx f (x) dx.
−∞
5.7
Properties of moment generating function
(a) The most significant property of moment generating function is that “the moment generating function
uniquely determines the distribution.”
(b) Let a and b be constants, and let MX (t) be the mgf of a random variable X. Then the mgf of the random
variable Y = a + bX can be given as follows.
h
i
h
i
MY (t) = E etY = E et(a+bX) = eat E e(bt)X = eat MX (bt).
(c) Let X and Y be independent random variables having the respective mgf’s MX (t) and MY (t). Recall that
E[g1 (X)g2 (Y )] = E[g1 (X)]·E[g2 (Y )] for functions g1 and g2 . We can obtain the mgf MZ (t) of the sum Z = X +Y
of random variables as follows.
i
h
MZ (t) = E etZ = E et(X+Y ) = E etX etY = E etX · E etY = MX (t) · MY (t).
(d) When t = 0, it clearly follows that M (0) = 1. Now by differentiating M (t) r times, we obtain
r
dr tX d tX
(r)
M (t) = r E e
= E X r etX .
=E
e
dt
dtr
In particular when t = 0, M (r) (0) generates the r-th moment of X as follows.
M (r) (0) = E[X r ],
r = 1, 2, 3, . . .
Example 4. Find the mgf M (t) for a uniform random variable X on [a, b]. And derive the derivative M 0 (0) at
t = 0 by using the definition of derivative and L’Hospital’s rule.
5.8
Exercises
1. If n men throw their hats into a pile and each man takes a hat at random, what is the expected number
of matches?
Hint: Suppose Y is the number of matches. How can you express the random variable Y by using indicator
random variables X1 , . . . , Xn ? What is the event for the indicator random variable Xi ?
2. Let X be a random variable with the pdf
f (x) =
(
2x
0
if 0 ≤ x ≤ 1;
otherwise.
Then, (a) find E[X], and (b) find E[X 2 ] and Var(X).
Lecture Note
Page 14
Introduction to Probability
3. Suppose that E[X] = µ and Var(X) = σ 2 . Let Z = (X − µ)/σ. Show that E[Z] = 0 and Var(Z) = 1.
4. Recall that the expectation is an “linear operator.” By using the definitions of variance and covariance in
terms of the expectation, show that
(a) Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ), and
(b) Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ).
5. If X and Y are independent random variables with equal variances, find Cov(X + Y, X − Y ).
6. If X and Y are independent random variables and Z = Y − X, find expressions for the covariance of X
and Z in terms of the variance of X and Y .
7. Let X and Y be independent random variables, and let α and β be scalars. Find an expression for the mgf
of Z = αX + βY in terms of the mgf’s of X and Y .
Optional Problems.
1. Suppose that n enemy aircraft are shot at simultaneously by m gunners, that each gunner selects an
aircraft to shoot at independently of the other gunners, and that each gunner hits the selected aircraft with
probability p. Find the expected number of aircraft hit by the gunners.
Hint: For each i = 1, . . . , n, let Xi be the indicator random
variable of the event that the i-th aircraft was
m
by yourself, and use it to find the expectation of
hit. Explain how you can verify P (Xi = 0) = 1 − np
interest.
6
Binomial and Related Distributions
An experiment involving a trial that results in possibly two outcomes, say “success” or “failure,” and that is
repeatedly performed, is called a binomial experiment. Tossing a coin is an example of such a trial where head
or tail are the two possible outcomes.
6.1
Relation between Bernoulli and binomial random variable
A binomial random variable can be expressed in terms of n Bernoulli random variables. If X 1 , X2 , . . . , Xn are
independent Bernoulli random variables with success probability p, then the sum of those random variables
Y =
n
X
Xi
i=1
is distributed as the binomial distribution with parameter (n, p).
Sum of independent binomial random variables. If X and Y are independent binomial random variables
with respective parameters (n, p) and (m, p), then the sum X + Y is distributed as the binomial distribution
with parameter (n + m, p).
Pn
Pn+m
Rough explanation. Observe that we can express X = i=1 Zi and Y = i=n+1 Zi in terms of independent
Pn+m
Bernoulli random variables Zi ’s with success probability p. Then the resulting sum X + Y = i=1
Zi must be
a binomial random variable with parameter (n + m, p).
Lecture Note
Page 15
Introduction to Probability
6.2
Expectation and variance of binomial distribution
Let X be a Bernoulli random variable with success probability p. Then the expectation of X becomes
E[X] = 0 × (1 − p) + 1 × p = p.
By using E[X] = p, we can compute the variance as follows:
Var(X) = (0 − p)2 × (1 − p) + (1 − p)2 × p = p(1 − p).
Now
Pn let Y be a binomial random variable with parameter (n, p). Recall that Y can be expressed as the sum
i=1 Xi of independent Bernoulli random variables X1 , . . . , Xn with success probability p. By using the property
of expectation and variance, we obtain
#
" n
n
X
X
E[Xi ] = np;
Xi =
E[Y ] = E
i=1
i=1
Var(Y ) = Var
n
X
Xi
i=1
!
=
n
X
i=1
Var(Xi ) = np(1 − p).
Moment generating functions. Let X be a Bernoulli random variable with success probability p. Then the
mgf MX (t) is given by
MX (t) = e(0)t × (1 − p) + e(1)t × p = (1 − p) + pet .
Pn
Then we can derive the mgf of a binomial random variable Y = i=1 Xi from the mgf of Bernoulli random
variable by applying the mgf property repeatedly.
MY (t) = MX1 (t) × MX2 (t) × · · · × MXn (t) = [(1 − p) + pet ]n .
6.3
Geometric distribution
For 0 < a < 1, we can establish the following formulas.
n
X
k=1
∞
X
k=1
∞
X
k=1
ak−1 =
ka
k−1
k=1
∞
X
k
k=1
1
;
1−a
1
d
1
=
;
=
dx 1 − x x=a
(1 − a)2
1
2a
d
x
=
+
=
2
2
dx (1 − x)
(1 − a)
(1 − a)3
x=a
ak−1 =
!
x k=0
x=a
!
∞
d X k =
kx dx
d
=
dx
k 2 ak−1
∞
X
1 − an
;
1−a
(5)
(6)
x=a
Geometric random variables. In sampling independent Bernoulli trials, each with probability of success p,
the frequency function of the trial number of the first success is
p(k) = (1 − p)k−1 p
k = 1, 2, . . . .
This is called a geometric distribution.
By using geometric series formula (5) we can immediately see the following:
∞
X
k=1
M (t) =
∞
X
k=1
Lecture Note
p(k) = p
∞
X
(1 − p)k−1 =
k=1
n
X
t
ekt p(k) = pe
k=1
p
= 1;
1 − (1 − p)
[(1 − p)et ]k−1 =
pet
.
1 − (1 − p)et
Page 16
Introduction to Probability
Example 3. Consider a roulette wheel consisting of 38 numbers—1 through 36, 0, and double 0. If Mr.Smith
always bets that the outcome will be one of the numbers 1 through 12, what is the probability that (a) his first
win will occur on his fourth bet; (b) he will lose his first 5 bets?
Expectation and variance of geometric distributions. Let X be a geometric random variable with success
probability p. To calculate the expectation E[X] and the variance Var(X), we apply geometric series formula
(6) to obtain
∞
X
∞
X
1
;
p
k=1
k=1
2
∞
X
2(1 − p)
1−p
1
1
1
2
2
2
+
.
− 2 =
=p
Var(X) = E[X ] − (E[X]) =
k p(k) −
p
p2
p3
p
p2
E[X] =
kp(k) = p
k(1 − p)k−1 =
k=1
6.4
Negative binomial distribution
Given independent Bernoulli trials with probability of success p, the frequency function of the number of trials
until the r-th success is
k−1 r
p(k) =
p (1 − p)k−r , k = r, r + 1, . . . .
r−1
This is called a negative binomial distribution with parameter (r, p).
The geometric distribution is a special case of negative binomial distribution when r = 1. Moreover, if X 1 , . . . , Xr
are independent and identically distributed (iid) geometric random variables with parameter p, then the sum
Y =
r
X
Xi
(7)
i=1
becomes a negative binomial random variable with parameter (r, p).
Expectation, variance and mgf of negative binomial distribution. By using the sum of iid geometric
rv’s (see Equation 7) we can compute the expectation, the variance, and the mgf of negative binomial random
variable Y .
" r
#
r
X
X
r
Xi =
E[Y ] = E
E[Xi ] = ;
p
i=1
i=1
!
r
r
X
X
r(1 − p)
Xi =
Var(Y ) = Var
Var(Xi ) =
;
p2
i=1
i=1
r
pet
MY (t) = MX1 (t) × MX2 (t) × · · · × MXn (t) =
.
1 − (1 − p)et
Example 4. What is the average number of times one must throw a die until the outcome “1” has occurred 4
times?
6.5
Exercises
1. Appending three extra bits to a 4-bit word in a particular way (a Hamming code) allows detection and
correction of up to one error in any of the bits. (a) If each bit has probability .05 of being changed during
communication, and the bits are changed independently of each other, what is the probability that the
word is correctly received (that is, 0 or 1 bit is in error)? (b) How does this probability compare to the
probability that the word will be transmitted correctly with no check bits, in which case all four bits would
have to be transmitted correctly for the word to be correct?
Lecture Note
Page 17
Introduction to Probability
2. Which is more likely: 9 heads in 10 tosses of a fair coin or 18 heads in 20 tosses?
3. Two boys play basketball in the following way. They take turns shooting and stop when a basket is made.
Player A goes first and has probability p1 of making a basket on any throw. Player B goes, who shoots
second, has probability p2 of making a basket. The outcomes of the successive trials are assumed to be
independent.
(a) Find the frequency function for the total number of attempts.
(b) What is the probability that player A wins?
4. Suppose that in a sequence of independent Bernoulli trials each with probability of success p, the number
of failure up to the first success is counted. (a) What is the frequency function for this random variable?
(b) Find the frequency function for the number of failures up to the rth success.
Hint: In (b) let X be the number of failures up to the rth success, and let Y be the number of trials until
the r-th success. What is the relation between the random variables X and Y ?
5. Find an expression for the cumulative distribution function of a geometric random variable.
6. In a sequence of independent trials with probability p of success, what is the probability that there are
exactly r successes before the kth failure?
Optional Problems.
1. (Banach Match Problem) A pipe smoker carries one box of matches in his left pocket and one in his right.
Initially, each box contains n matches. If he needs a match, the smoker is equally likely to choose either
pocket. What is the frequency function for the number of matches in the other box when he first discovers
that one box is empty?
Hint: Let Ak denote the event that the pipe smoker first discovers that the right-hand box is empty and
that there are k matches in the left-hand box at the time, and similarly let Bk denote the event that he
first discovers that the left-hand box is empty and that there are k matches in the right-hand box at the
time. Then use Ak ’s and Bk ’s to express the frequency function of interest.
7
Poisson Distribution
n
1
= 2.7182 · · · is known as the base of the natural logarithm (or, called Napier’s
n→∞
n
number), and is associated with the Taylor series
The number e = lim
1+
ex =
∞
X
xk
k=0
k!
.
(8)
The Poisson distribution with parameter λ > 0 has the frequency function
p(k) = e−λ
λk
,
k!
k = 0, 1, 2, . . .
By applying Taylor series (8) we can immediately see the following:
∞
X
p(k) = e−λ
k=0
M (t) =
∞
X
k=0
Lecture Note
ekt p(k) = e−λ
∞
X
λk
k=0
∞
X
k=0
k!
= e−λ eλ = 1;
t
t
(et λ)k
= e−λ eλe = eλ(e −1) .
k!
Page 18
Introduction to Probability
Example 1. Suppose that the number of typographical errors per page has a Poisson distribution with parameter
λ = 0.5. Find the probability that there is at least one error on a single page.
7.1
Poisson approximation to binomial distribution
The Poisson distribution may be used as an approximation for a binomial distribution with parameter (n, p)
when n is large and p is small enough so that np is a moderate number λ. Let λ = np. Then the binomial
frequency function becomes
p(k) =
→
n!
k!(n − k)!
k n−k
n −k
λ
λ
λk
n!
λ
λ
1−
·
·
1
−
1
−
=
·
n
n
k! (n − k)!nk
n
n
λk
λk
· 1 · e−λ · 1 = e−λ
k!
k!
as n → ∞.
Example 2. Suppose that the a microchip is defective with probability 0.02. Find the probability that a sample
of 100 microchips will contain at most 1 defective microchip.
7.2
Expectation and variance
Let Xn , n = 1, 2, . . ., be a sequence of binomial random variables with parameter (n, λ/n). Then we regard the
limit of the sequence as a Poisson random variable Y , and use the limit to find E[Y ] and Var(Y ). Thus, we
obtain
E[Xn ] = λ
λ
Var(Xn ) = λ 1 −
n
→
E[Y ] = λ
→
Var(Y ) = λ
as n → ∞;
as n → ∞.
Sum of independent Poisson random variables. The sum of independent Poisson random variables is a
Poisson random variable: If X and Y are independent Poisson random variables with respective parameters λ 1
and λ2 , then Z = X + Y is distributed as the Poisson distribution with parameter λ1 + λ2 .
Rough explanation. Suppose that Xn and Yn are independent binomial random variables with respective parameter (kn , pn ) and (ln , pn ) for each n = 1, 2, . . .. Then the sum Xn + Yn of the random variables has the binomial
distribution with parameter (kn + ln , pn ). By letting kn pn → λ1 and ln pn → λ2 with kn , ln → ∞ and pn → 0,
the respective limits of Xn and Yn are Poisson random variables X and Y with respective parameters λ1 and
λ2 . Moreover, the limit of Xn + Yn is the sum of the random variables X and Y , and has a Poisson distribution
with parameter λ1 + λ2 .
7.3
Exercises
1. Use the mgf to show that the expectation and the variance of a Poisson random variable are the same and
equal to the parameter λ.
2. The probability of being dealt a royal straight flush (ace, king, queen, jack, and ten of the same suit) in
poker is about 1.54 × 10−6 . Suppose that an avid poker player sees 100 hands a week, 52 weeks a year, for
20 years.
(a) What is the probability that she never sees a royal straight flush dealt?
(b) What is the probability that she sees at least two royal straight flushes dealt?
Lecture Note
Page 19
Introduction to Probability
3. Professor Rice was told that he has only 1 chance in 10,000 of being trapped in a much-maligned elevator
in the mathematics building. If he goes to work 5 days a week, 52 weeks a year, for 10 years and always
rides the elevator up to his office when he first arrives. What is the probability that he will never be
trapped? That he will be trapped once? Twice? Assume that the outcomes on all the days are mutually
independent.
4. Suppose that a rare disease has an incidence of 1 in 1000. Assuming that members of the population are
affected independently, find the probability of k cases in a population of 100,000 for k = 0, 1, 2.
5. Suppose that in a city the number of suicides can be approximated by a Poisson random variable with
λ = 0.33 per month.
(a) What is the distribution for the number of suicides per year? What is the average number of suicides
in one year?
(b) What is the probability of two suicides in one month?
Optional Problems.
1. Let X be a negative binomial random variable with parameter (r, p).
(a) When r = 2 and p = 0.2, find the probability P (X ≤ 4).
(b) Let Y be a binomial random variable with parameter (n, p). When n = 4 and p = 0.2, find the
probability P (Y ≥ 2).
(c) By comparing the results above, what can you find about the relationship between P (X ≤ 4) and
P (Y ≥ 2)? Generalize the relationship to the one between P (X ≤ n) and P (Y ≥ 2), and find the
probability P (X ≤ 10) when r = 2 and p = 0.2.
(d) By using the Poisson approximation, find the probability P (X ≤ 1000) when r = 2 and p = 0.001.
8
Solutions to Exercises
8.1
Binomial and related distributions
1. The number X of error bits has a binomial distribution with n = 4 and p = 0.95.
4
4
4
(a) P (X ≤ 1) =
(0.95) +
(0.05)(0.95)3 ≈ 0.986.
0
1
4
(b) P (X = 0) =
(0.95)4 ≈ 0.815.
0
10
2. “9 heads in 10 tosses” [ 10
≈ 9.77 × 10−3] is more likely than “18 heads in 20 tosses” [
9 (1/2)
−4
1.81 × 10 ].
3. Let X be the total number of attempts.
(
(1 − p1 )k−1 (1 − p2 )k−1 p1
(a) P (X = n) =
(1 − p1 )k (1 − p2 )k−1 p2
(b) P ({Player A wins}) =
∞
X
k=1
Lecture Note
20
18
(1/2)20 ≈
if n = 2k − 1;
if n = 2k.
P (X = 2k − 1) = p1
∞
X
k=1
[(1 − p1 )(1 − p2 )]k−1 =
p1
p1 + p 2 − p 1 p2
Page 20
Introduction to Probability
4. Let X be the number of failures up to the rth success, and let Y be the number of trials until the r-th
success. Then Y has a negative binomial distribution, and has the relation X = Y − r.
(a) Here r = 1 (that is, Y has a geometric distribution), and P (X = k) = P (Y = k + 1) = (1 − p) k p.
k+r−1 r
(b) P (X = k) = P (Y = k + r) =
p (1 − p)k .
r−1
5. Let X be a geometric random variable.
P (X ≤ k) =
k
X
i=1
p(1 − p)i−1 = 1 − (1 − p)k
6. Let X be the number of trials up to the k-th failure. Then X has a negative binomial distribution with
“success probability” (1 − p).
k+r−1
P ({Exactly r successes before the k-th failure}) = P (X = r + k) =
(1 − p)k pr
k−1
Optional Problem (Banach Match Problem). Let X be the number of matches in the other box when he first
discovered that one box is empty. Furthermore, let Ak denote the event that the pipe smoker first discovers that
the right-hand box is empty and that there are k matches in the left-hand box at the time, and similarly let B k
denote the event that he first discovers that the left-hand box is empty and that there are k matches in the righthand box at the time. Then we can observe that P (Ak ) = P (Bk ) and P (X = k) = P (Ak ) + P (Bk ) = 2P (Ak ).
Now in order for Ak to happen he has tomake exactly (2n+1−k) attempts when he choose the
box in
right-hand
2n − k
2n − k
2n+1−k
(n+1)st time. Thus, P (Ak ) =
(1/2)
, and therefore, we obtain P (X = k) =
(1/2)2n−k .
n
n
8.2
Poisson distribution
t
t
t
1. We first calculate M 0 (t) = λet eλ(e −1) and M 00 (t) = (λet )2 eλ(e −1) + λet eλ(e −1) . Then we obtain E[X] =
M 0 (0) = λ and E[X 2 ] = M 00 (0) = λ2 + λ. Finally we get Var(X) = E[X 2 ] − (E[X])2 = λ.
2. The number X of royal straight flushes has approximately a Poisson distribution with λ = (1.54 × 10 −6 ×
(100 × 52 × 20) ≈ 0.16.
(a) P (X = 0) = e−λ ≈ 0.852.
(b) P (X ≥ 2) = 1 − P (X ≤ 1) = 1 − (e−λ + λe−λ ) ≈ 0.0115.
3. The number X of misfortunes for Professor Rice being trapped has approximately a Poisson distribution
1
× (5 × 52 × 10) ≈ 0.26. Then P (X = 0) = e−λ ≈ 0.77, P (X = 1) = λe−λ ≈ 0.20, and
with λ = 10000
2
P (X = 2) = (λ /2)e−λ ≈ 0.026.
1
4. The number X of this disease cases has approximately a Poisson distribution with λ = 1000
×100000 = 100.
k
100
Then it has the frequency function p(k) ≈ e−100
. And we obtain p(0) ≈ 3 × 10−44 , p(1) ≈ 4 × 10−42 ,
k!
and p(2) ≈ 2 × 10−40 .
5. Let Xk be the number of suicides in the i-th month, and let Y be the number of suicides in a year. Then
Y = X1 + · · · + X12 has a Poisson distribution with 12 × 0.33 = 3.96. Thus, (a) E[Y ] = 3.96 and (b)
P (X1 = 2) = (0.332 /2)e−0.33 ≈ 0.039.
Optional Problem.
P4
(a) P (X ≤ 4) = k=2
Lecture Note
k−1
1
(0.2)2 (0.8)k−2 = 0.1808.
Page 21
Introduction to Probability
P1
(b) P (Y ≥ 2) = 1 − k=0 k4 (0.2)k (0.8)4−k = 0.1808.
(c) By using P (X ≤ n) = P (Y ≥ 2), P (X ≤ 10) = P (Y ≥ 2) = 1 −
P1
k=0
10
k
(0.2)k (0.8)10−k ≈ 0.6242.
(d) by using P (X ≤ n) = P (Y ≥ 2) and the Poisson approximation with λ = np = 1, P (X ≤ 1000) = P (Y ≥
2) ≈ 1 − e−λ − λe−λ ≈ 0.2642.
9
Summary Diagram
Bernoulli trial
P (X = 1) = p and P (X = 0) = q := 1 − p
E[X] = p,
(1)
V ar(X) = pq
(4)
The probability p of success and
the probability q := 1 − p of failure
(5)
?
?
?
Binomial(n, p)
Geometric(p)
P (X = i) = q
E[X] =
i−1
1
,
p
(2)
p,
i = 1, 2, . . .
V ar(X) =
(3)
P (X = i) =
q
p2
n
i
!
pi q n−i ,
E[X] = np,
i = 0, . . . , n
V ar(X) = npq
6
(6)
?
?
Negative binomial(r, p)
P (X = i) =
i−1
r−1
E[X] =
!
r
,
p
Poisson(λ)
pr q i−r ,
i = r, r + 1, . . .
V ar(X) =
rq
p2
P (X = i) = e−λ
E[X] = λ,
λi
, i = 0, 1, . . .
i!
V ar(X) = λ
6
(7)
1. X is the number of Bernoulli trials until the first success occurs.
2. X is the number of Bernoulli trials until the rth P
success occurs. If X1 , · · · , Xr are independent geometric
r
random variables with parameter p, then X = k=1 Xk becomes a negative binomial random variable
with parameter (r, p).
3. r = 1.
4. X is P
the number of successes in n Bernoulli trials. If X1 , · · · , Xn are independent Bernoulli trials, then
n
X = k=1 Xk becomes a binomial random variable with (n, p).
Lecture Note
Page 22
Introduction to Probability
5. If X1 and X2 are independent binomial random variables with respective parameters (n1 , p) and (n2 , p),
then X = X1 + X2 is also a binomial random variable with (n1 + n2 , p).
6. Poisson approximation is used for binomial distribution by letting n → ∞ and p → 0 while λ = np.
7. If X1 and X2 are independent Poisson random variables with respective parameters λ1 and λ2 , then
X = X1 + X2 is also a Poisson random variable with λ1 + λ2 .
10
10.1
Exponential and Gamma Distributions
Exponential distribution
The exponential density is defined by
f (x) =
Then, the cdf is computed as
F (t) =
(
(
λe−λx
0
x ≥ 0;
x<0
1 − e−λt
0
t ≥ 0;
t<0
Memoryless property. The function S(t) = 1 − F (t) = P (X > t) is called the survival function. When F (t)
is an exponential distribution, we have S(t) = e−λt for t ≥ 0. Furthermore, we can find that
P (X > t + s | X > s) =
S(t + s)
P (X > t + s)
=
= S(t) = P (X > t),
P (X > s)
S(s)
(9)
which is referred as the memoryless property of exponential distribution.
10.2
Gamma distribution
The gamma function, denoted by Γ(α), is defined as
Z ∞
Γ(α) =
uα−1 e−u du,
α > 0.
0
Then the gamma density is defined as
f (x) =
λα α−1 −λx
x
e
Γ(α)
x≥0
which depends on two parameters α > 0 and λ > 0. We call the parameter α a shape parameter, because changing
α changes the shape of the density. We call the parameter λ a scale parameter, because changing λ merely rescales
the density without changing its shape. This is equivalent to changing the units of measurement (feet to meters,
or seconds to minutes). Suppose that the parameter α is an integer n. Then we have Γ(n) = (n − 1)!. In
particular, the gamma distribution with α = 1 becomes an exponential distribution (with parameter λ).
n
Chi-square distribution. The gamma distribution with α = and λ = 12 is called the chi-square distribution
2
with n degrees of freedom.
Lecture Note
Page 23
Introduction to Probability
Expectation, variance and mgf of a gamma random variable Let X be a gamma random variable with
parameter (α, λ). For t < λ, the mgf MX (t) of X can be computed as follows.
Z ∞
Z ∞
λα α−1 −λx
λα
xα−1 e−(λ−t)x dx
etx
MX (t) =
x
e
dx =
Γ(α)
Γ(α) 0
0
α
α
Z ∞
Z ∞
λα
1
λ
1
λ
α−1 −(λ−t)x
α−1 −u
=
((λ − t)x)
e
(λ − t)dx =
u
e du =
(λ − t)α Γ(α) 0
λ−t
Γ(α) 0
λ−t
Then we can find the first and second moment
0
E[X] = MX
(0) =
Thus, we can obtain Var(X) =
10.3
α
λ
and
00
E[X 2 ] = MX
(0) =
α(α + 1)
λ2
α(α + 1) α2
α
− 2 = 2.
λ2
λ
λ
Exercises
1. Let X be a gamma random variable with parameter (α, λ), and let Y = cX with c > 0.
(a) Show that
P (Y ≤ t) =
where
f (x) =
Z
t
f (x) dx,
−∞
(λ/c)α α−1 −(λ/c)x
x
e
Γ(α)
(b) Conclude that Y is a gamma random variable with parameter (α, λ/c). Thus, only λ is affected by
the transformation Y = cX, which justifies to call λ a scale parameter.
2. Suppose that the lifetime of an electronic component follows an exponential distribution with λ = 0.2.
(a) Find the probability that the lifetime is less than 10.
(b) Find the probability that the lifetime is between 5 and 15.
3. X is an exponential random variable, and P (X < 1) = 0.05. What is λ?
4. The lifetime in months of each component in a microchip is exponentially distributed with parameter
1
.
µ = 10000
(a) For a particular component, we want to find the probability that the component breaks down before 5 months, and the probability that the component breaks down before 50 months. Use the
approximation e−x ≈ 1 − x for a small number x to calculate these probabilities.
(b) Suppose that the microchip contains 1000 of such components and is operational until three of these
components break down. The lifetimes of these components are independent and exponentially dis1
tributed with parameter µ = 10000
. By using the Poisson approximation, find the probability that
the microchip operates functionally for 5 months, and the probability that the microchip operates
functionally for 50 months.
5. The length of a phone call can be represented by an exponential distribution. Then answer to the following
questions.
(a) A report says that 50% of phone call ends in 20 minutes. Assuming that this report is true, calculate
the parameter λ for the exponential distribution.
Lecture Note
Page 24
Introduction to Probability
(b) Suppose that someone arrived immediately ahead of you at a public telephone booth. Using the result
of (a), find the probability that you have to wait between 10 and 20 minutes.
Optional Problems.
1. Suppose that a non-negative random variable X has the memoryless property
P (X > t + s | X > s) = P (X > t).
By completing the following questions, we will show that X must be an exponential random variable.
(a) Let S(t) is the survival function of X. Show that the memoryless property implies that S(s + t) =
S(s)S(t) for s, t ≥ 0.
(b) Let κ = S(1). Show that κ > 0.
1
m
n
(c) Show that S n1 = κ n and S m
n =κ .
Therefore, we can obtain S(t) = κt for any t ≥ 0. By letting λ = − ln κ, we can write S(t) = e−λt .
2. The gamma function is a generalized factorial function.
(a) Show that Γ(1) = 1.
(b) Show that Γ(α + 1) = αΓ(α). Hint: Use integration by parts.
(c) Conclude that Γ(n) = (n − 1)! for n = 1, 2, 3, . . ..
√
(d) Use the formula Γ( 12 ) = π to show that if n is an odd integer then we have
√
n
π(n − 1)!
Γ
= n−1 n−1 2
2
!
2
11
Normal Distributions
A normal distribution is represented by a family of distributions which have the same general shape, sometimes
described as “bell shaped.” The normal distribution has the pdf
(x − µ)2
1
,
(10)
f (x) := √ exp −
2σ 2
σ 2π
which depends upon two parameters µ and σ 2 . Here π = 3.14159 . . . is the famous “pi” (the ratio of the
circumference of a circle to its diameter), and exp(u) is the exponential function eu with the base e = 2.71828 . . .
of the natural logarithm.
Much celebrated integrations are the following:
Z ∞
(x−µ)2
1
√
e− 2σ2 dx = 1,
σ 2π −∞
Z ∞
(x−µ)2
1
√
xe− 2σ2 dx = µ,
σ 2π −∞
Z ∞
(x−µ)2
1
√
(x − µ)2 e− 2σ2 dx = σ 2 .
σ 2π −∞
(11)
The integration (11) guarantees that the density function (10) always represents a “probability density” no
matter what values the parameters µ and σ 2 are chosen. We say that a random variable X is normally distributed
with parameter (µ, σ 2 ) when X has the density function (10). The parameter µ, called a mean (or, location
Lecture Note
Page 25
Introduction to Probability
parameter), provides the center of the density, and the density function f (x) is symmetric around µ. The
parameter σ is a standard deviation (or, a scale parameter); small values of σ lead to high peaks but sharp drops.
Larger values of σ lead to flatter densities. The shorthand notation
X ∼ N (µ, σ 2 )
is often used to express that X is a normal random variable with parameter (µ, σ 2 ).
11.1
Linear transform of normal random variable
One of the important properties of normal distribution is that if X is a normal random variable with parameter
(µ, σ 2 ) [that is, the pdf of X is given by density function (10)]. then Y = aX + b is also a normal random
variable having parameter (aµ + b, (aσ)2 ). In particular,
X −µ
σ
(12)
becomes a normal random variable with parameter (0, 1), called the z-score. Then the normal density φ(x) with
parameter (0, 1) becomes
x2
1
φ(x) := √ e− 2 ,
2π
which is known as the standard normal density. We define the cdf Φ by
Z t
φ(x) dx.
Φ(t) :=
−∞
11.2
Moment generating function
Let X be a standard normal random variable. Then we can compute the mgf of X as follows.
2
Z ∞
Z ∞
Z ∞
(x−t)2
2
x2
1
t2
t2
1
1
1
t
MX (t) = √
etx e− 2 dx = √
e− 2 (x−t) + 2 dx = e 2 √
e− 2 dx = exp
2
2π −∞
2π −∞
2π −∞
Since X = (Y − µ)/σ becomes a standard normal random variable and Y = σX + µ, the mgf M (t) of Y can be
given by
2 2
σ t
µt
+ µt .
M (t) = e MX (σt) = exp
2
Then we can find the first and second moment
E[X] = M 0 (0) = µ
E[X 2 ] = M 00 (0) = µ2 + σ 2
and
Thus, we can obtain Var(X) = (µ2 + σ 2 ) − µ2 = σ 2 .
11.3
Calculating probability
Suppose that a tomato plant height X is normally distributed with parameter (µ, σ 2 ). Then what is the probability that the tomato plant height is between a and b?
How to calculate. The integration
P (a ≤ X ≤ b) =
Lecture Note
Z
b
a
(x−µ)2
1
√ e− 2σ2 dx
σ 2π
Page 26
Introduction to Probability
X −µ
, then the event {a ≤ X ≤ b} is equivalent
seems too complicated. But if we consider the random variable
σ
to the event
X −µ
b−µ
a−µ
≤
≤
.
σ
σ
σ
In terms of probability, this means that
P (a ≤ X ≤ b) = P
a−µ
X −µ
b−µ
≤
≤
σ
σ
σ
.
(13)
And it is easy to calculate
b−µ
a−µ
and b0 =
.
σ
σ
Since the z-score (12) is distributed as the standard normal distribution, the above probability becomes
a0 =
P (a ≤ X ≤ b) = P
X −µ
a ≤
≤ b0
σ
0
=
Z
b0
a0
φ(x) dx = Φ(b0 ) − Φ(a0 ).
(14)
Then we will be able to look at the values for Φ(a0 ) and Φ(b0 ) from the table of standard normal distribution.
For example, if a0 = −0.19 and b0 = 0.29, then Φ(−0.19) = 0.4247 and Φ(0.29) = 0.6141, and the probability of
interest becomes 0.1894, or approximately 0.19.
Suppose that a random variable X is normally distributed with mean µ and standard deviation σ. Then we can
obtain
b−µ
a−µ
1. P (a ≤ X ≤ b) = Φ
−Φ
;
σ
σ
b−µ
2. P (X ≤ b) = Φ
;
σ
a−µ
3. P (a ≤ X) = 1 − Φ
.
σ
11.4
Exercises
1. Let X be a normal random variable with parameter (µ, σ 2 ), and let Y = aX + b with a 6= 0.
(a) Show that
P (Y ≤ t) =
where
f (x) =
Z
t
f (x) dx,
−∞
(x − aµ − b)2
1
√ exp −
.
2(aσ)2
|a|σ 2π
(b) Conclude that Y is a normal random variable with parameter (aµ + b, (aσ)2 ).
2. Let X be a normal random variable with parameter (µ, σ 2 ), and let Y = eX . Find the density function f
of Y by forming
Z t
P (Y ≤ t) =
f (x) dx.
0
This density function f is called the lognormal density since ln Y is normally distributed.
3. Show that the standard normal density integrates to unity via (a) and (b).
Lecture Note
Page 27
Introduction to Probability
Z
Z ∞Z ∞
x2 +y2
e− 2 dx dy =
(a) Show that
−∞
(b) Show that
1
√
σ 2π
−∞
Z ∞
e−
(x−µ)2
2σ2
−∞
∞
0
1
dx = √
2π
Z
2π
e−
r2
2
r dθ dr = 2π.
0
Z
∞
e−
y2
2
dy = 1.
−∞
4. Suppose that in a certain population, individual’s heights are approximately normally distributed with
parameters µ = 70 and σ = 3in.
(a) What proportion of the population is over 6ft. tall?
(b) What is the distribution of heights if they are expressed in centimeters?
5. Let X be a normal random variable with µ = 5 and σ = 10. Find (a) P (X > 10), (b) P (−20 < X < 15),
and (c) the value of x such that P (X > x) = 0.05.
6. If X ∼ N (µ, σ 2 ), then show that P (|X − µ| ≤ 0.675σ) = 0.5.
7. X ∼ N (µ, σ 2 ), find the value of c in terms of σ such that P (µ − c ≤ X ≤ µ + c) = 0.95.
12
12.1
Solutions to exercises
Exponential and gamma distributions
Z t
(λ/c)α α−1 −(λ/c)y
λα α−1 −λx
x
e
dx =
y
e
dy by
1. (a) P (Y ≤ t) = P (cX ≤ t) = P (X ≤ t/c) =
−∞ Γ(α)
−∞ Γ(α)
substituting x = y/c. (b) The integrand f (y) is a gamma density with (α, λ/c).
Z
t/c
2. Let X be the lifetime of the electronic component.
(a) P (X ≤ 10) = 1 − e−(0.2)(10) ≈ 0.8647.
(b) P (5 ≤ X ≤ 15) = P (X ≤ 15) − P (X ≤ 5) = e−(0.2)(5) − e−(0.2)(15) ≈ 0.3181.
3. By solving P (X < 1) = 1 − e−λ = 0.05, we obtain λ = − ln 0.95 ≈ 0.0513.
4. (a) Let p1 be the probability that the component breaks down before 5 months, and let p2 be the probability
that the component breaks down before 50 months. Then,
p1 = 1 − e−5µ ≈ 1 − (1 − 0.0005) = 0.0005;
p2 = 1 − e−50µ ≈ 1 − (1 − 0.005) = 0.005.
(b) Let X1 be the number of components breaking down before 5 months, and let X2 be the number
of components breaking down before 50 months. Then Xi has the binomial distribution with parameter
(pi , 1000) for i = 1, 2. By using the Poisson approximation with λ1 = 1000p1 = 0.5 and λ2 = 1000p2 = 5,
we can obtain
λ1 2
−λ1
≈ 0.986;
P (X1 ≤ 2) = e
1 + λ1 +
2
λ2 2
≈ 0.125.
P (X2 ≤ 2) = e−λ2 1 + λ2 +
2
5. Let X be the length of phone call in minutes.
(a) By solving P (X ≤ 20) = 1 − e−20λ = 0.5, we obtain λ =
(b) P (10 ≤ X ≤ 20) = (1 − e
Lecture Note
−20λ
) − (1 − e
−10λ
) ≈ 0.207.
ln 0.5
20
≈ 0.0347.
Page 28
Introduction to Probability
Optional Problems.
1. (a) Since P (X > t + s | X > s) =
S(t + s)
P (X > t + s)
=
, we obtain S(t) = S(t + s)/S(s)
P (X > s)
S(s)
(b) Let κ = S(1) = S(1/2 + 1/2) = [S(1/2)]2 > 0.
(c) Since κ = S(1) = S(1/n + · · · + 1/n) = [S(1/n)]n , we have S(1/n) = κ1/n . Furthermore, S(m/n) =
S(1/n + · · · + 1/n) = [S(1/n)]m = κm/n .
Z ∞
2. (a) Γ(1) =
eu du = 1.
0
Z ∞
Z ∞
∞
(b) Γ(α + 1) =
uα e−u du = −uα e−u 0 + α
uα−1 e−u du = αΓ(α).
0
0
(c) Γ(n) = (n − 1)Γ(n − 1) = · · · = (n − 1)!
(d) Γ(n/2) = (n/2 − 1)(n/2 − 2) · · · (1/2)Γ(1/2) = (n − 2)(n − 4) · · · (1)Γ(1/2)/2
12.2
(n−1)/2
=
√
π(n − 1)!
n−1
!
2
2n−1
Normal distributions
1. (a) Assume a > 0. Then
Z
P (Y ≤ t) = P (aX + b ≤ t) = P (X ≤ (t − b)/a) =
=
Z
t
−∞
1
(y − aµ − b)
√ exp −
2(aσ)2
aσ 2π
2
(t−b)/a
−∞
(x − µ)2
1
√ exp −
dx
2σ 2
σ 2π
dy
by substituting x = (y − b)/a. When a < 0, we can similarly obtain
Z t
(y − aµ − b)2
1
√ exp −
P (Y ≤ t) =
dy
2(aσ)2
−∞ |a|σ 2π
(b) The integrand f (y) is a normal density with (aµ + b, (aσ)2 ).
2. For x > 0 we obtain
P (Y ≤ t) = P (e
=
Z
X
t
−∞
≤ t) = P (X ≤ ln t) =
Z
ln t
−∞
(ln y − µ)2
exp −
2σ 2
yσ 2π
1
√
dy
by substituting x = ln y. Thus, the density is given by f (y) =
3. (a)
Z
∞
0
Z
2π
e−
r2
2
r dθ dr = 2π
0
(b) Substitute x = σy + µ.
Z
∞
r e−
r2
2
0
dr = 2π
Z
∞
(x − µ)2
1
√ exp −
dx
2σ 2
σ 2π
1
√
yσ 2π
i
h
y−µ)2
.
exp − (ln 2σ
2
e−u du = 2π.
0
4. Let X be the height in inches normally distributed with µ = 70in and σ = 3in.
(a) P (X ≥ (6)(12)) ≈ 1 − Φ(0.67) ≈ 0.25.
(b) (2.54)X is the height in centimeters, and normally distributed with µ = (2.54)(70) = 177.8cm and
σ = (2.54)(3) = 7.62cm.
Lecture Note
Page 29
Introduction to Probability
5. (a) P (X > 10) = 1 − Φ(0.5) ≈ 0.3085. (b) P (−20 < X < 15) = Φ(1) − Φ(−2.5) ≈ 0.8351.
(c) Observe that P (X > x) = 1−Φ((x−5)/10) = 0.05. Since Φ(1.64) ≈ 0.95, we can find (x−5)/10 ≈ 1.64,
and therefore, x ≈ 21.4.
6. P (|X − µ| ≤ 0.675σ) = P (−0.675σ ≤ X − µ ≤ 0.675σ) = P (−0.675 ≤ (X − µ)/σ ≤ 0.675) = Φ(0.675) −
Φ(−0.675) ≈ 0.5.
7. Observing that P (µ − c ≤ X ≤ µ + c) = P (|(X − µ)/σ| ≤ c/σ) = 0.95, we obtain P ((X − µ)/σ ≤ c/σ) =
Φ(c/σ) = 0.975. Since Φ(1.96) ≈ 0.975, we can find c/σ ≈ 1.96, and therefore, c = 1.96σ.
13
Summary Diagram
(4)
?
Normal(µ, σ 2 )
f (x) = √
(x−µ)2
1
e− 2σ2 ,
2πσ
E[X] = µ,
f (x) = λe−λx
−∞ < x < ∞
E[X] =
V ar(X) = σ 2
6
(1)
?
(3)
f (x) =
? m degrees of
?freedom)
(with
x m
1 −x
2 ( ) 2 −1
2e
2
,
Γ( m
2)
E[X] = m,
x≥0
V ar(X) = 2m
1
,
λ
x≥0
V ar(X) =
1
λ2
6
(6)
(5)
(2)
Chi-square
Exponential(λ)
Gamma(α, λ)
λα α−1 −λx †
x
e
, x≥0
Γ(α)
α
α
E[X] = , V ar(X) = 2
λ
λ
f (x) =
(7)
6
† Γ(n) = (n − 1)!, when n is
a positive integer.
1. If X1 and X2 are independent normal random variables with respective parameters (µ1 , σ12 ) and (µ2 , σ22 ),
then X = X1 + X2 is a normal random variable with (µ1 + µ2 , σ12 + σ22 ).
2
m X
Xk − µ
2. If X1 , · · · , Xm are iid normal random variables with parameter (µ, σ 2 ), then X =
is a
σ
chi-square (χ2 ) random variable with m degrees of freedom.
3. α =
m
2
k=1
and λ = 12 .
4. If X1 , . . . , Xn are independent exponential random variables with respective
P parameters λ1 , · · · , λn , then
X = min{X1 , . . . , Xn } is an exponential random variable with parameter ni=1 λi .
Lecture Note
Page 30
Introduction to Probability
5. If X1 , · · · , Xn are iid exponential random variables with parameter λ, then X =
variable with (α = n, λ).
n
X
Xk is a gamma random
k=1
6. α = 1.
7. If X1 and X2 are independent random variables having gamma distributions with respective parameters
(α1 , λ) and (α2 , λ), then X = X1 + X2 is a gamma random variable with parameter (α1 + α2 , λ).
14
14.1
Bivariate Normal Distribution
Review of matrix algebra.
Matrices are usually denoted by capital letters, A, B, C, . . ., while scalars (real numbers) are denoted by lower
case letters, a, b, c, . . ..
a b
ka kb
=
c d
kc kd
ae + bg af + bh
a b e f
.
=
2. Product of two matrices:
ce + dg cf + dh
c d g h
a b
= ad − bc.
3. Determinant of a matrix: det
c d
a b
4. The inverse of a matrix: If A =
and det A 6= 0, then the inverse A−1 of A is given by
c d
1. Scalar multiplication: k
1
d
A =
det A −c
1 0
.
= A−1 A =
0 1
−1
and satisfies AA−1
5. The transpose of a matrix:
14.2
a b
c d
T
=
a
b
1
−b
d
=
a
ad − bc −c
−b
,
a
c
.
d
Quadratic forms
a b
If a 2-by-2 matrix A satisfies A = A, then it is of the form
, and is said to be symmetric. A 2-by-1 matrix
b c
x
is called a column vector, usually denoted by boldface letters x, y, . . ., and the transpose x T = x y
x=
y
is called the row vector. Then, the bivariate function
a b x
Q(x, y) = xT Ax = x y
= ax2 + 2bxy + cy 2
(15)
b c y
T
is called a quadratic form.
Lecture Note
Page 31
Introduction to Probability
If ac − b2 6= 0 and ac 6= 0, then the symmetric matrix can be decomposed into the following forms known as
Cholesky decompositions:
√
√
√
a p
0
a p b/ a
a b
√
=
(16)
b c
(ac − b2 )/a
(ac − b2 )/a
b/ a
0
p
√ p
2
(ac − b2 )/c b/√ c
(ac −
√b )/c √0
=
(17)
0
c
c
b/ c
Corresponding to the Cholesky decompositions of the matrix A, the quadratic form Q(x, y) can be also decomposed as follows:
!2
r
2
√
ac − b2
b
(18)
y
Q(x, y) =
ax + √ y +
a
a
!2 r
2
√
ac − b2
b
=
x + √ x + cy
(19)
c
c
14.3
Bivariate normal density
The function
f (x, y) =
2πσx σy
with the quadratic form
1
Q(x, y) =
1 − ρ2
"
x − µx
σx
1
p
2
1
exp − Q(x, y)
2
1 − ρ2
+
y − µy
σy
2
(20)
(x − µx )(y − µy )
− 2ρ
σx σy
#
(21)
gives the joint density function of a bivariate normal distribution. Note that the parameters σ x2 , σy2 , and ρ must
satisfy σx2 > 0, σy2 > 0, and −1 < ρ < 1. By defining the 2-by-2 symmetric matrix (also known as covariance
matrix ) and the two column vectors
2
σx
ρσx σy
x
µ
Σ=
, x=
and µ = x ,
ρσx σy
σy2
y
µy
the quadratic form can be expressed as
Q(x, y) = (x − µ)T Σ−1 (x − µ) = x − µx
y − µy
σx2
ρσx σy
ρσx σy
σy2
−1 x − µx
.
y − µy
(22)
Suppose that two random variables X and Y has the bivariate normal distribution (20). Then we can find out
the following properties.
1. Their marginal distributions fX (x) and fY (x) become
(
(
2 )
2 )
1 x − µx
1
1 y − µy
1
exp −
and fY (y) = √
.
exp −
fX (x) = √
2
σx
2
σy
2πσx
2πσy
Thus, X and Y are normally distributed with respective parameters (µx , σx2 ) and (µy , σy2 ).
2. If ρ = 0, then the quadratic form (21) becomes
2 2
x − µx
y − µy
Q(x, y) =
+
,
σx
σy
and consequently we have f (x, y) = fX (x)fY (y). Thus, X and Y are independent if and only if ρ = 0.
Lecture Note
Page 32
Introduction to Probability
3. If X and Y are not independent (that is, ρ 6= 0), we can compute the conditional density function f Y |X (y|x)
given X = x as

!2 
σy


y
−
µ
−
ρ
(x
−
µ
)
y
x
1
1
p
p σx
fY |X (y|x) = √
exp −
 2

2πσy 1 − ρ2
σy 1 − ρ 2
σ
which is the normal density function with parameter (µy +ρ σxy (x−µx ), σy2 (1−ρ2 )). Similarly, the conditional
density function fX|Y (y|x) given Y = y becomes

!2 
σx


(y
−
µ
)
x
−
µ
−
ρ
y
x
1
1
σ
p
p y
fX|Y (x|y) = √
exp −

 2
2πσx 1 − ρ2
σx 1 − ρ 2
which is the normal density function with parameter (µx + ρ σσyx (y − µy ), σx2 (1 − ρ2 )).
14.4
Covariance of bivariate normal random variables
Let (X, Y ) be a bivariate normal random variables with parameters (µx , µy , σx2 , σy2 , ρ). Recall that f (x, y) =
σ
f (x)f (y|x), and that f (y|x) is the normal density with mean µy + ρ σyx (x − µx ) and variance σy2 (1 − ρ2 ). First
we can compute
Z ∞
σy
(y − µy )f (y|x) dy = ρ (x − µx ).
σx
−∞
Then we can apply it to obtain
Z
∞
Z
∞
(x − µx )(y − µy )f (x, y) dy dx
Z ∞
(y − µy )f (y|x) dy f (x) dx
=
(x − µx )
−∞
−∞
Z ∞
σy
σy
=ρ
(x − µx )2 f (x) dx = ρ σx2 = ρσx σy
σx −∞
σx
Cov(X, Y ) =
−∞
Z ∞
−∞
The correlation coefficient of X and Y is defined by
ρ= p
Cov(X, Y )
Var(X)Var(Y )
.
It implies that the parameter ρ of bivariate normal distribution represents the correlation coefficient of X and
Y.
14.5
Exercises
1. Verify the Cholesky decompositions by multiplying the matrices in each of (16) and (17).
2. Show that the right-hand side of (18) equals (15) by calculating
√
√
a p
0
a
√
Q(x, y) = x y
b/ a
(ac − b2 )/a
0
Similarly show that the right-hand side of (19) equals (15).
√
x
p b/ a
.
(ac − b2 )/a y
3. Show that (21) is equal to (22).
Lecture Note
Page 33
Introduction to Probability
4. Show that (22) has the following decomposition
!2 2
x − µx − ρ σσxy (y − µy )
y − µy
p
Q(x, y) =
+
σy
σx 1 − ρ 2
!2
2
σ
y − µy − ρ σxy (x − µx )
x − µx
p
=
+
σx
σy 1 − ρ 2
(E1)
(E2)
by completing the following:
(a) Find the Cholesky decompositions of Σ−1 .
(b) Verify (E1) and (E2) by finding the Cholesky decompositions of Σ−1 .
5. Compute fX (x) by completing the following:
R∞
(a) Show that −∞ f (x, y) dy becomes
√
(
1
1
exp −
2
2πσx
x − µx
σx
2 )
×√
2πσy
(b) Show that
√
2πσy
1
p
1 − ρ2
Z

 1
exp −
 2
−∞
∞
Now find fY |X (y|x) given X = x.
15
1
p
1 − ρ2
σ
Z

 1
exp −
 2
−∞
∞
y − µy − ρ σxy (x − µx )
p
σy 1 − ρ 2
!2 

σ
y − µy − ρ σxy (x − µx )
p
σy 1 − ρ 2
1
dy = √

2π
Z
!2 

dy.

1 2
exp − y dy = 1.
2
−∞
∞
Conditional Expectation
Let f (x, y) be a joint density function for random variables X and Y , and let fY |X (y|x) be the conditional
density function of Y given X = x (see Lecture Note No. ?? for definition). Then the conditional expectation of
Y given X = x, denoted by E(Y |X = x), is defined as the function h(x) of x:
Z ∞
E(Y |X = x) := h(x) =
y fY |X (y|x) dy.
−∞
In particular, h(X) is the function of the random variable X, and is called the conditional expectation of Y given
X, denoted by E(Y |X). Similarly, we can define the conditional expectation for a function g(Y ) of the random
variable Y by
Z
E(g(Y )|X = x) :=
∞
−∞
g(y) fY |X (y|x) dy.
Since E(g(Y )|X = x) is a function, say h(x), of x, we can define E(g(Y )|X) as the function h(X) of the random
variable X.
15.1
Properties of conditional expectation
(a) Linearity. Let a and b be constants. By the definition of conditional expectation, it clearly follows that
E(aY + b|X = x) = aE(Y |X = x) + b. Consequently,
E(aY + b|X) = aE(Y |X) + b.
Lecture Note
Page 34
Introduction to Probability
(b) Law of total expectation. Since E(g(Y )|X) is a function h(X) of random variable of X, we can consider “the
expectation E[E(g(Y )|X)] of the conditional expectation E(g(Y )|X),” and compute it as follows.
Z ∞
Z ∞ Z ∞
E[E(g(Y )|X)] =
h(x)fX (x) dx =
g(y) fY |X (y|x) dy fX (x) dx
−∞
−∞
−∞
Z ∞
Z ∞ Z ∞
g(y)fY (y) dy = E[g(Y )].
fY |X (y|x)fX (x) dx g(y) dy =
=
−∞
−∞
−∞
Thus, we have E[E(g(Y )|X)] = E[g(Y )].
(c) Conditional variance formula. The conditional variance, denoted by Var(Y |X = x), can be defined as
Var(Y |X = x) := E((Y − E(Y |X = x))2 |X = x) = E(Y 2 |X = x) − (E(Y |X = x))2 ,
where the second equality can be obtained from the linearity property in (a). Since Var(Y |X = x) is a function, say h(x), of x, we can define Var(Y |X) as the function h(X) of the random variable X. Now compute
“the variance Var(E(Y |X)) of the conditional expectation E(Y |X)” and “the expectation E[Var(Y |X)] of the
conditional variance Var(Y |X)” as follows.
Var(E(Y |X)) = E[(E(Y |X))2 ] − (E[E(Y |X)])2 = E[(E(Y |X))2 ] − (E[Y ])2
2
2
2
2
E[Var(Y |X)] = E[E(Y |X)] − E[(E(Y |X)) ] = E[Y ] − E[(E(Y |X)) ]
(1)
(2)
By adding (1) and (2) together, we can obtain
Var(E(Y |X)) + E[Var(Y |X)] = E[Y 2 ] − (E[Y ])2 = Var(Y ).
(d) In calculating the conditional expectation E(g1 (X)g2 (Y )|X = x) given X = x, we can treat g1 (X) as the
constant g1 (x); thus, we have E(g1 (X)g2 (Y )|X = x) = g1 (x)E(g2 (Y )|X = x). This justifies
E(g1 (X)g2 (Y )|X) = g1 (X)E(g2 (Y )|X).
15.2
Prediction
Suppose that we have two random variables X and Y . Having observed X = x, one may attempt to “predict”
the value of Y as a function g(x) of x. The function g(X) of the random variable X can be constructed for
this purpose, and is called a predictor of Y . The criterion for the “best” predictor g(X) is to find a function
g to minimize E[(Y − g(X))2 ]. We can compute E[(Y − g(X))2 ] by applying the properties (a)–(d) in Lecture
summary No.21.
E[(Y − g(X))2 ] = E (Y − E(Y |X) + E(Y |X) − g(X))2
= E (Y − E(Y |X))2 + 2(Y − E(Y |X))(E(Y |X) − g(X)) + (E(Y |X) − g(X))2
h
i
= E[E((Y − E(Y |X))2 |X)] + E 2(E(Y |X) − g(X))E((Y − E(Y |X))|X) + E[(E(Y |X) − g(X))2 ]
= E[Var(Y |X)] + E[(E(Y |X) − g(X))2 ],
which is minimized when g(x) = E(Y |X = x). Thus, we can find g(X) = E(Y |X) as the best predictor of Y .
The best predictor for bivariate normal distributions. When X and Y has the bivariate normal distribution with parameter (µx , µy , σx2 , σy2 , ρ), we can find the best predictor
E(Y |X) = µy + ρ
Lecture Note
σy
(X − µx ).
σx
Page 35
Introduction to Probability
15.3
The best linear predictor
Except for the case of bivariate normal distribution, the conditional expectation E(Y |X) could be a complicated
nonlinear function of X. So we can instead consider the best linear predictor
g(X) = α + βX
2
of Y by minimizing E[(Y − (α + βX)) ]. Since
E[(Y − (α + βX))2 ] = E[Y 2 ] − 2αµy − 2βE[XY ] + α2 + 2αβµx + β 2 E[X 2 ]
where µx = E[X] and µy = E[Y ], we can obtain the desired values α and β by solving
∂
E[(Y − (α + βX))2 ] = −2µy + 2α + 2βµx = 0
∂α
∂
E[(Y − (α + βX))2 ] = −2E[XY ] + 2αµx + 2βE[X 2 ] = 0
∂β
The solution can be expressed by using σx2 = Var(X), σy2 = Var(Y ), and ρ = Cov(X, Y )/(σx σy ), and obtain
σy
E[XY ] − E[X]E[Y ]
=ρ
E[X 2 ] − (E[X])2
σx
σy
α = µy − βµx = µy − ρ µx
σx
β=
In summary, the best linear predictor g(X) becomes
g(X) = µy + ρ
15.4
σy
(X − µx ).
σx
Exercises
1. Suppose that the joint density function f (x, y) of X and Y is given by
(
2
if 0 ≤ x ≤ y ≤ 1;
f (x, y) =
0
otherwise.
(a)
(b)
(c)
(d)
Find
Find
Find
Find
the
the
the
the
covariance and the correlation of X and Y .
conditional expectations E(X|Y = y) and E(Y |X = x).
best linear predictor of Y in terms of X.
best predictor of Y .
2. Suppose that we have the random variables X and Y having the joint density
(
6
(x + y)2
if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1;
f (x, y) = 7
0
otherwise.
(a) Find the covariance and correlation of X and Y .
(b) Find the conditional expectation E(Y |X = x).
(c) Find the best linear predictor of Y .
3. Let X be the age of the pregnant woman, and let Y be the age of the prospective father at a certain hospital.
Suppose that the distribution of X and Y can be approximated by a bivariate normal distribution with
parameters µx = 28, µy = 32, σx = 6, σy = 8 and ρ = 0.8.
(a) Find the probability that the prospective father is over 30.
(b) When the pregnant woman is 31 years old, what is the best guess of the prospective father’s age?
(c) When the pregnant woman is 31 years old, find the probability that the prospective father is over 30.
Lecture Note
Page 36
Introduction to Probability
16
16.1
Transformations of Variables
Transformations of a random variable
For an interval A = [a, b] on real line with a < b, the integration of a function f (x) over A is formally written as
Z
Z b
f (x) dx =
f (x) dx.
(23)
A
a
Let h(u) be differentiable and strictly monotonic function (that is, either strictly increasing or strictly decreasing)
on real line. Then there is the inverse function g = h−1 , so that we can define the inverse image h−1 (A) of
A = [a, b] by
h−1 (A) = g(A) = {g(x) : a ≤ x ≤ b}.
The change of variable in the integral (23) provides the formula
Z
Z
d
f (h(u)) h(u) du.
f (x) dx =
du
g(A)
A
(24)
Theorem A. Let X be a random variable with pdf fX (x). Suppose that a function g(x) is differentiable and
strictly monotonic. Then the random variable Y = g(X) has the pdf
d −1 −1
(25)
fY (y) = fX (g (y)) g (y) .
dy
A sketch of proof. Since g(x) is strictly monotonic, we can find the inverse function h = g −1 , and apply the
formula (24). Thus, by letting A = (−∞, t], we can rewrite the cdf for Y as
FY (t) = P (Y ≤ t) = P (Y ∈ A) = P (g(X) ∈ A) = P (X ∈ h(A))
Z
Z
Z
d
d
=
fX (x) dx =
fX (h(y)) h(y) dy =
fX (g −1 (y)) g −1 (y) dy
dy
dy
h(A)
g(h(A))
A
By comparing it with the form
Z
t
fY (y) dy, we can obtain (25).
−∞
Linear transform of normal random variable. Let X be a normal random variable with parameter (µ, σ 2 ).
Then we can define the linear transform Y = aX + b with real values a and b. To compute the density function
fY (y) of the random variable Y , we can apply Theorem A, and obtain
1
(y − (aµ + b))2
√ exp −
fY (y) =
.
2(aσ)2
(aσ) 2π
This is the normal density function with parameter (aµ + b, (aσ)2 ).
Example. Let X be a uniform random variable on [0, 1]. Suppose that X takes nonnegative values [and
therefore, f (x) = 0 for x ≤ 0]. Find the pdf fY (y) for the random variable Y = X n .
16.2
Transformations of bivariate random variables
For a rectangle A = {(x, y) : a1 ≤ x ≤ b1 , a2 ≤ y ≤ b2 } on a plane, the integration of a function f (x, y) over A
is formally written as
ZZ
Z b2 Z b1
f (x, y) dx dy =
f (x, y) dx dy
(26)
A
Lecture Note
a2
a1
Page 37
Introduction to Probability
Suppose that a transformation (g1 (x, y), g2 (x, y)) is differentiable and has the inverse transformation (h1 (u, v), h2 (u, v))
satisfying
(
(
u = g1 (x, y)
x = h1 (u, v)
.
(27)
if and only if
v = g2 (x, y)
y = h2 (u, v)
Then we can state the change of variable in the double integral (26) as follows: (i) define the inverse image of A
by
h−1 (A) = g(A) = {(g1 (x, y), g2 (x, y)) : (x, y) ∈ A},
(ii) calculate the Jacobian
Jh (u, v) = det
∂
∂u h1 (u, v)
∂
∂u h2 (u, v)
∂
∂v h1 (u, v)
∂
∂v h2 (u, v)
,
(28)
(iii) finally, if it does not vanish for all x, y, then we can obtain
Z
Z
f (x, y) dx dy =
f (h1 (u, v), h2 (u, v)) |Jh (u, v)| du dv.
A
(29)
g(A)
Theorem B. Suppose that (a) random variables X and Y has a joint density function fX,Y (x, y), (b) a transformation (g1 (x, y), g2 (x, y)) is differentiable, and (c) its inverse transformation (h1 (u, v), h2 (u, v)) satisfies (27).
Then the random variables U = g1 (X, Y ) and V = g2 (X, Y ) has the joint density function
fU,V (u, v) = fX,Y (h1 (u, v), h2 (u, v)) |Jh (u, v)| .
(30)
A sketch of proof. Define A = {(x, y) : x ≤ s, y ≤ t} and the image of A by the inverse transformation
(h1 (x, y), h2 (x, y)) by
h(A) = {(h1 (u, v), h2 (u, v)) : (u, v) ∈ A}.
Then we can rewrite the joint distribution function FU,V for (U, V ) by applying the formula (29).
FU,V (s, t) = P ((U, V ) ∈ A) = P ((g1 (X, Y ), g2 (X, Y )) ∈ A) = P ((X, Y ) ∈ h(A))
Z
Z
=
fX,Y (x, y) dx dy =
fX,Y (h1 (u, v), h2 (u, v)) |Jh (u, v)| du dv
h(A)
g(h(A))
Z
=
fX,Y (h1 (u, v), h2 (u, v)) |Jh (u, v)| du dv
A
By comparing it with the form
16.3
Z
t
−∞
Z
s
fU,V (u, v) du dv, we can obtain (30).
−∞
Exercises
µx
1. Suppose that (X1 , X2 ) has a bivariate normal distribution with mean vector µ =
µy
2
σx
ρσx σy
matrix Σ =
. Consider the linear transformation
ρσx σy
σy2
and covariance
Y1 = g1 (X1 , X2 ) = a1 X1 + b1 ;
Y2 = g2 (X1 , X2 ) = a2 X2 + b2 .
(a) Find the inverse transformation (h1 (u, v), h2 (u, v)), and compute the Jacobian Jh (u, v).
(b) Find the joint density function fY1 ,Y2 (u, v) for (Y1 , Y2 ).
(c) Conclude
that the
bivariate normal density
with mean vector
joint density function fY1 ,Y2 (u, v) is the
a 1 µx + b 1
(a1 σx )2
ρ(a1 σx )(a2 σy )
µ=
and covariance matrix Σ =
.
a 2 µy + b 2
ρ(a1 σx )(a2 σy )
(a2 σy )2
Lecture Note
Page 38
Introduction to Probability
2. Let X1 and X2 be iid standard normal random variables. Consider the linear transformation
Y1 = g1 (X1 , X2 ) = a11 X1 + a12 X2 + b1 ;
Y2 = g2 (X1 , X2 ) = a21 X1 + a22 X2 + b2 ,
where a11 a22 − a12 a21 6= 0. Find the joint density function
that it
fY1 ,Y2 (u, v) for (Y1 , Y2 ), and conclude
σx2
ρσx σy
b1
is the bivariate normal density with mean vector µ =
and covariance matrix Σ =
ρσx σy
σy2
b2
a
a
+
a
a
11 21
12 22
.
where σx2 = a211 + a212 , σy2 = a221 + a222 , and ρ = p 2
(a11 + a212 )(a221 + a222 )
3. Suppose that X and Y are independent random variables with respective pdf’s fX (x) and fY (y). Consider
the transformation
U = g1 (X, Y ) = X + Y ;
X
V = g2 (X, Y ) =
.
X +Y
(a) Find the inverse transformation (h1 (u, v), h2 (u, v)), and compute the Jacobian Jh (u, v).
(b) Find the joint density function fXY (x, y) for (X, Y ).
(c) Now let fX (x) and fY (y) be the gamma density functions with respective parameters (a1 , λ) and
(a2 , λ). Then show that U = X + Y has the gamma distribution with parameter (α1 + α2 , λ) and
X
has the density function
that V =
X +Y
fV (v) =
Γ(α1 + α2 ) α1 −1
v
(1 − v)α2 −1 .
Γ(α1 )Γ(α2 )
(31)
The density function (31) is called the beta distribution with parameter (α1 , α2 ).
4. Let X and Y be random variables. The joint density function of X and Y is given by
(
xe−(x+y)
if 0 ≤ x and 0 ≤ y;
f (x, y) =
0
otherwise.
(a) Are X and Y independent? Describe the marginal distributions for X and Y with specific parameters.
(b) Find the distribution for the random variable U = X + Y .
5. Consider an investment for three years in which you plan to invest $500 in average at the beginning of
every year. The fund yields the fixed annual interest of 10%; thus, the fund invested at the beginning of
the first year yields 33.1% at the end of term (that is, at the end of the third year), the fund of the second
year yields 21%, and the fund of the third year yields 10%.
(a) How much do you expect to have in your investment account at the end of term?
(b) From your past spending behavior, you assume that the annual investment you decide is independent
every year, and that its standard deviation is $200. (That is, the standard deviation for the amount
of money you invest each year is $200.) Find the standard deviation for the total amount of money
you have in the account at the end of term.
(c) Furthermore, suppose that your annual investment is normally distributed. (That is, the amount of
money you invest each year is normally distributed.) Then what is the probability that you have more
than $2000 in the account at the end of term?
Lecture Note
Page 39
Introduction to Probability
6. Let X be an exponential random variable with parameter λ. And define
Y = Xn
where n is a positive integer.
(a) Compute the pdf fY (y) for Y .
(b) Find c so that
f (x) =
(
cxn e−λx
0
if x ≥ 0;
otherwise.
is a pdf. Hint: You should know the name of the pdf f (x).
(c) By using (b), find the expectation E[Y ].
7. Let X and Y be independent normal random variables with mean zero and the respective variances σ x2 and
σy2 . Then consider the change of variables via
(
U = a11 X + a12 Y
(32)
V = a21 X + a22 Y
where D = a11 a22 − a12 a21 6= 0.
(a) Find the variances σu2 and σv2 for U and V , respectively.
(b) Find the inverse transformation (h1 (u, v), h2 (u, v)) so that X and Y can be expressed in terms of U
and V via
X = h1 (U, V );
Y = h2 (U, V ).
And compute the Jacobian Jh (u, v).
(c) Find ρ so that the joint density function fU,V (u, v) of U and V can be expressed as
1
1
p
fU,V (u, v) =
exp − Q(u, v) .
2
2πσu σv 1 − ρ2
(d) Express Q(u, v) by using σu , σv , and ρ to conclude that the joint density is the bivariate normal
density with parameters µu = µv = 0, σu , σv , and ρ.
(e) Let X 0 = X + µx and Y 0 = Y + µy . Then consider the change of variables via
U 0 = a11 X 0 + a12 Y 0
V 0 = a21 X 0 + a22 Y 0
Express U 0 and V 0 in terms of U and V , and describe the joint distribution for U 0 and V 0 with specific
parameters.
(f) Now suppose that X and Y are independent normal random variables with the respective parameters
(µx , σx2 ) and (µy , σy2 ). Then what is the joint distribution for U and V given by (32)?
8. To complete this problem, you may use the following theorem:
Theorem C. Let X and Y be independent normal random variables with the respective parameters (µx , σx2 ) and (µy , σy2 ). Then if a11 a22 − a12 a21 6= 0, then the random variables
U = a11 X + a12 Y
V = a21 X + a22 Y
Lecture Note
Page 40
Introduction to Probability
has the bivariate normal distribution with parameter µu = a11 µx + a12 µy , µv = a21 µx + a22 µy ,
σu2 = a211 σx2 + a212 σy2 , σv2 = a221 σx2 + a222 σy2 , and
ρ= q
a11 a21 σx2 + a12 a22 σy2
(a211 σx2 + a212 σy2 )(a221 σx2 + a222 σy2 )
.
Now suppose that a real valued data X is transmitted though a noisy channel from location A to location B,
and that the value U received at location B is given by
U =X +Y
where Y is a normal random variable with mean zero and variance σy2 . Furthermore, we assume that X is
normally distributed with mean µ and variance σx2 , and is independent of Y .
(a) Find the expectation E[U ] and the variance Var(U ) of the received value U at location B.
(b) Describe the joint distribution for X and U with specific parameters.
(c) Provided that U = t is observed at location B, find the best predictor g(t) of X.
(d) Suppose that we have µ = 1, σx = 1 and σy = 0.5. When U = 1.3 was observed, find the best
predictor of X given U = 1.3.
17
17.1
Solutions to exercises
Conditional expectation
1. (a) Cov(X, Y ) = E[XY ] − E[X]E[Y ] = (1/4) − (1/3)(2/3) = 1/36, and ρ = (1/36)/
(b) Since f (x|y) = y1 for 0 ≤ x ≤ y and f (y|x) =
E(Y |X = x) = x+1
2 .
1
1−x
p
(1/18)2 = 1/2.
for x ≤ y ≤ 1, we obtain E(X|Y = y) =
y
2
and
σ
(c) The best linear predictor g(X) of Y is given by g(X) = µy + ρ σxy (X − µx ) = 12 X + 21 .
(d) Then best predictor of Y is E(Y |X) =
X+1
2 .
[This is the same as (c). Why?]
2. (a) Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 17/42 − (9/14)2 = −5/588 and ρ =
(b) E(Y |X = x) =
(−5/588)
(199/2940)
= −25/199.
6x2 +8x+3
4(3x2 +3x+1) .
25
X+
(c) The best linear predictor g(X) of Y is given by g(X) = − 119
3. (a) P (Y ≥ 30) = P Y −32
≥ − 41 = 1 − Φ(−0.25) ≈ 0.60.
8
(b) E(Y | X = 31) = (32) + (0.8) 68 (31 − 28) = 35.2
144
199 .
(c) Since the conditional density fY |X (y | x) has the normal density with mean µ = 35.2 and variance
σ 2 = (8)2 (1 − (0.8)2 ) = 23.04, we can obtain
Y − 35.2
≥ −1.08 X = 31 = 1 − Φ(−1.08) ≈ 0.86
P (Y ≥ 30 | X = 31) ≈ P
4.8
Lecture Note
Page 41
Introduction to Probability
17.2
Transformations of variables
1
0
v − b2
u − b1
1
a
1
and h2 (u, v) =
. Thus, Jh (u, v) = det
.
1. (a) h1 (u, v) =
=
0 a12
a1
a2
a1 a2
1 1
1
u − b1 v − b 2
p
·
(b) fY1 ,Y2 (u, v) = fX1 ,X2
,
=
exp − Q(u, v) , where
a1
a2
a1 a2 2π(|a1 |σx )(|a2 |σy ) 1 − ρ2
2
#
"
2 2
1
v − (a2 µy + b2 )
(u − (a1 µx + b1 ))(v − (a2 µy + b2 ))
u − (a1 µx + b1 )
Q(u, v) =
+
− 2ρ
.
1 − ρ2
a 1 σx
a 2 σy
(a1 σx )(a2 σy )
(c) From (b), fY1 ,Y2 (u, v) is a bivariate normal distribution with parameter µu = a1 µx + b1 , µv = a2 µy + b2 ,
σu = |a1 |σx , σv = |a2 |σy , and ρ if a1 a2 > 0 (or, −ρ if a1 a2 < 0).
2. (a) h1 (u, v) =
−a21 (u − b1 ) + a11 (v − b2 )
a22 (u − b1 ) − a12 (v − b2 )
and h2 (u, v) =
. Thus, Jh (u, v) =
a11 a22 − a12 a21
a11 a22 − a12 a21
1
.
a11 a22 − a12 a21
1
1 2
2
(b) Note that fX1 ,X2 (x, y) =
exp − (x + y ) . Then we obtain
2π
2
1
1
1
=
exp − Q(u, v) ,
fY1 ,Y2 (u, v) = fX1 ,X2 (h1 (u, v), h2 (u, v)) · a11 a22 − a12 a21 2π|a11 a22 − a12 a21 |
2
where
Q(u, v) = (u − b1 ) (v − b2 )
a211 + a212
a11 a21 + a12 a22
a11 a21 + a12 a22
a221 + a222
−1 u − b1
.
v − b2
(c) By comparing fY1 ,Y2 (u, v) with a bivariate normal density, we can find that fY1 ,Y2 (u, v) is a bivariate normal distribution with parameter µu = b1 , µv = b2 , σu2 = a211 + a212 , σv2 = a221 + a222 , and
a11 a21 + a12 a22
ρ= p 2
.
(a11 + a212 )(a221 + a222 )
v
u
= −u.
3. (a) h1 (u, v) = uv and h2 (u, v) = u − uv. Thus, Jh (u, v) = det
1 − v −u
(b) fXY (x, y) = fX (x)fY (y).
(c) We can obtain
fU V (u, v) = | − u| × fX (uv) × fY (u − uv)
1
1
λα1 e−λ(uv) (uv)α1 −1 ×
λα2 e−λ(u−uv) (u − uv)α2 −1
=u×
Γ(α1 )
Γ(α2 )
= fU (u)fV (v)
1
Γ(α1 + α2 ) α1 −1
λα1 +α2 e−λu uα1 +α2 −1 and fV (v) =
v
(1 − v)α2 −1 . Thus, fU (u)
Γ(α1 + α2 )
Γ(α1 )Γ(α2 )
has a gamma distribution with parameter (α1 + α2 , λ), and fV (v) has a beta distribution with parameter
(α1 , α2 ).
where fU (u) =
4. (a) Observe that f (x, y) = (xe−x ) · (e−y ) and that xe−x , x ≥ 0, is the gamma density with parameter (2, 1)
and e−y , y ≥ 0, is the exponential density with λ = 1. Thus, we can obtain
Z ∞
Z ∞
−x
fX (x) =
f (x, y) dy = xe
e−y dy = xe−x , x ≥ 0;
0
0
Z ∞
Z ∞
−y
fY (y) =
f (x, y) dx = e
xe−x dx = e−y , y ≥ 0.
0
Lecture Note
0
Page 42
Introduction to Probability
Since f (x, y) = fX (x)fY (y), X and Y are independent.
(b) Notice that the exponential density fY (y) with λ = 1 can be seen as the gamma distribution with
(1, 1). Since the sum of independent gamma random variables has again a gamma distribution, U has the
gamma distribution with (3, 1).
5. Let X1 be the amount of money you invest in the first year, let X2 be the amount in the second year, and
let X2 be the amount in the third year. Then the total amount of money you have at the end of term, say
Y , becomes
Y = 1.331X1 + 1.21X2 + 1.1X3 .
(a) E(1.331X1 + 1.21X2 + 1.1X3 ) = 1.331E(X1 ) + 1.21E(X2 ) + 1.1E(X3 ) = 3.641 × 500 = 1820.5.
2
(b) Var(1.331X1 + 1.21X2 + 1.1X3 ) = (1.331)
) + (1.21)2 Var(X2 ) + (1.1)2 Var(X3 ) = 4.446 × 2002 ≈
p Var(X1√
177800. Thus, the standard deviation is Var(Y ) ≈ 177800 ≈ 421.7.
Y − 1820.5
2000 − 1820.5
(c) P (Y ≥ 2000) = P
≈ 1 − Φ(0.43) ≈ 0.334.
≥
421.7
421.7
6. (a) By using the change-of-variable formula with g(x) = xn , we can obtain
d
n−1
n−1
1
λ
1
1
fY (y) = fX (g −1 (y)) g −1 (y) = fX (y n ) · y −( n ) = y −( n ) exp −λy n ,
dy
n
n
(b) Since the pdf f (x) is of the form of gamma density with (n+1, λ), we must have f (x) =
x ≥ 0. Therefore, c =
λn+1
(=
Γ(n + 1)
λn+1
n! ,
(c) Since f (x) in (b) is a pdf, we have
E[Y ] = E[X n ] =
Z
∞
0
y≥0
λn+1
xn e−λx ,
Γ(n + 1)
since n is an integer).
R∞
0
f (x) dx =
R∞
0
λn+1
Γ(n+1)
xn e−λx dx = 1. Thus, we can obtain
Γ(n + 1)
n!
xn λe−λx dx =
(= n , since n is an integer).
λn
λ
7. (a) σu2 = Var(U ) = a211 σx2 + a212 σy2 and σv2 = Var(V ) = a221 σx2 + a222 σy2 .
(b) h1 (u, v) =
(c) ρ = q
1
D (a22 u
− a12 v) and h2 (u, v) =
a11 a21 σx2 + a12 a22 σy2
(a211 σx2 + a212 σy2 )(a221 σx2 + a222 σy2 )
1
D (−a21 u
+ a11 v). And we have Jh (u, v) =
1
D.
.
(d) Q(u, v) can be expressed as
1
Q(u, v) =
1 − ρ2
"
u
σu
2
+
v
σv
2
uv
− 2ρ
σu σv
#
Thus, the joint density function fU,V (u, v) is the bivariate normal density with parameters µu = µv = 0
together with σu , σv , ρ as calculated in (a) and (c).
(e) U 0 = U + a11 µx + a12 µy and V 0 = V + a21 µx + a22 µy . Then U 0 and V 0 has the bivariate normal density
with parameters µu = a11 µx + a12 µy , µv = a21 µx + a22 µy , σu , σv , and ρ.
(f) By (d) and (e), (U, V ) has a bivariate normal distribution with parameter µu = a11 µx + a12 µy , µv =
a21 µx + a22 µy , σu2 = a211 σx2 + a212 σy2 , σv2 = a221 σx2 + a222 σy2 , and
ρ= q
Lecture Note
a11 a21 σx2 + a12 a22 σy2
(a211 σx2 + a212 σy2 )(a221 σx2 + a222 σy2 )
.
Page 43
Introduction to Probability
8. (a) E[U ] = E[X] + E[Y ] = µ and Var(U ) = Var(X) + Var(Y ) = σx2 + σy2 .
(b) By using Theorem C, we can find that X and U have the bivariate normal distribution with means
σx
µx = µ and µu = µ, variances σx2 = σx2 and σu2 = σx2 + σy2 , and ρ = q
.
2
σx + σy2
18
18.1
(c) E(X | U = t) = µx + ρ
σ2
σx
(t − µu ) = µ + 2 x 2 (t − µ).
σu
σx + σ y
(d) E(X | U = 1.3) = 1 +
(1)2
(1.3 − 1) = 1.24.
(1)2 + (0.5)2
Order statistics
Extrema
Let X1 , . . . , Xn be independent and identically distributed random variables (iid random variables) with the
common cdf F (x) and the pdf f (x). Then we can define the random variable V = min(X1 , . . . , Xn ) of the
minimum of X1 , . . . , Xn . To find the distribution of V , consider the survival function P (V > x) of V and
calculate as follows.
P (V > x) = P (X1 > x, . . . , Xn > x) = P (X1 > x) × · · · × P (Xn > x) = [1 − F (x)]n .
Thus, we can obtain the cdf FV (x) and the pdf fV (x) of V
FV (x) = 1 − [1 − F (x)]n
and
fV (x) =
d
FV (x) = nf (x)[1 − F (x)]n−1 .
dx
In particular, if X1 , . . . , Xn be independent exponential random variables with the common parameter λ, then
the minimum min(X1 , . . . , Xn ) has the density (nλ)e−(nλ)x which is again the exponential density function with
parameter (nλ).
18.2
Order statistics
Let X1 , . . . , Xn be iid random variables with the common cdf F (x) and the pdf f (x). When we sort X1 , . . . , Xn
as
X(1) < X(2) < · · · < X(n) ,
the random variable X(k) is called the k-th order statistic. Then we have the following theorem
Theorem. The pdf of the k-th order statistic X(k) is given by
fk (x) =
n!
f (x)F k−1 (x)[1 − F (x)]n−k .
(k − 1)!(n − k)!
(33)
Probability of small interval—the proof of theorem. If f is continuous at x and δ > 0 is small, then the probability
that a random variable X falls into the small interval [x, x + δ] is approximated by δf (x). In differential notation,
we can write
P (x ≤ X ≤ x + dx) = f (x)dx.
Now consider the k-th order statistic X(k) and the event that x ≤ X(k) ≤ x + dx. This event occurs
when (k − 1)
n
of Xi ’s are less than x and (n − k) of Xi ’s are greater than x ≈ x + dx. Since there are k−1,1,n−k
arrangements
for Xi ’s, we can obtain the probability of this event as
fk (x)dx =
Lecture Note
n!
(F (x))k−1 (f (x)dx)[1 − F (x)]n−k ,
(k − 1)!(n − k)!
Page 44
Introduction to Probability
which implies (33).
18.3
Beta distribution
When X1 , . . . , Xn be iid uniform random variables on [0, 1], the pdf of the k-th order statistic X(k) becomes
fk (x) =
n!
xk−1 (1 − x)n−k ,
(k − 1)!(n − k)!
0 ≤ x ≤ 1,
which is known as the beta distribution with parameters α = k and β = n − k + 1.
18.4
Exercises
1. Suppose that system’s components are connected in series and have lifetimes that are independent exponential random variables X1 , . . . , Xn with respective parameters λ1 , . . . , λn .
(a) Let V be the lifetime of the system. Show that V = min(X1 , . . . , Xn ).
(b) Show that the random variable V is exponentially distributed with parameter
Pn
i=1
λi .
2. Each component Ai of the following system has lifetime that is iid exponential random variable Xi with
parameter λ.
A1
A3
@
@ A5
A2
@
A4 @
A6
(a) Let V be the lifetime of the system. Show that V = max(min(X1 , X2 ), min(X3 , X4 ), min(X5 , X6 )).
(b) Find the cdf and the pdf of the random variable V .
3. A card contains n chips and has an error-correcting mechanism such that the card still functions if a single
chip fails but does not function if two or more chips fail. If each chip has an independent and exponentially
distributed lifetime with parameter λ, find the pdf of the card’s lifetime.
4. Suppose that a queue has n servers and that the length of time to complete a job is an exponential random
variable. If a job is at the top of the queue and will be handled by the next available server, what is the
distribution of the waiting time until service? What is the distribution of the waiting time until service of
the next job in the queue?
5. Let U and V be iid uniform random variable on [0, 1], and let X = min(U, V ) and Y = max(U, V ).
(a) We have discussed that the probability P (x ≤ X ≤ x + dx) is approximated by f (x) dx. Similarly, we
can find the following formula in the case of joint distribution.
P (x ≤ X ≤ x + dx, y ≤ Y ≤ y + dy) = f (x, y) dx dy.
In particular, we can obtain
P (x ≤ U ≤ x + dx, y ≤ V ≤ y + dy) = dx dy.
Now by showing that
P (x ≤ X ≤ x + dx, y ≤ Y ≤ y + dy)
= P (x ≤ U ≤ x + dx, y ≤ V ≤ y + dy) + P (x ≤ V ≤ x + dx, y ≤ U ≤ y + dy),
Lecture Note
Page 45
Introduction to Probability
justify that the joint density function f (x, y) of X and Y is given by
(
2
if 0 ≤ x ≤ y ≤ 1;
f (x, y) =
0
otherwise.
(b) Find the conditional expectations E(X|Y = y) and E(Y |X = x).
19
19.1
MGF techniques
Moment generating function
Let X be a random variable. Then the moment generating function (mgf) of X is given by
M (t) = E etX .
The mgf M (t) is a function of t defined on some open interval (c0 , c1 ) with c0 < 0 < c1 . In particular when X
is a continuous random variable having the pdf f (x), the mgf M (t) can be expressed as
Z ∞
etx f (x) dx.
M (t) =
−∞
The most significant property of moment generating function is that the moment generating function uniquely
determines the distribution.
Suppose that X is a standard normal random variable. Then we can compute the mgf of X as follows.
2
Z ∞
Z ∞
Z ∞
(x−t)2
2
x2
1
t2
t2
1
1
1
t
MX (t) = √
etx e− 2 dx = √
e− 2 (x−t) + 2 dx = e 2 √
e− 2 dx = exp
2
2π −∞
2π −∞
2π −∞
Since X = (Y − µ)/σ becomes a standard normal random variable and Y = σX + µ, the mgf of Y can be given
by
2 2
σ t
µt
+ µt .
MY (t) = e MX (σt) = exp
2
Similarly suppose now that X is a gamma random variable with parameter (α, λ). For t < λ, the mgf M X (t) of
X can be computed as follows.
Z ∞
Z ∞
α
λα
tx λ
α−1 −λx
MX (t) =
e
xα−1 e−(λ−t)x dx
x
e
dx =
Γ(α)
Γ(α) 0
0
α
α
Z ∞
Z ∞
1
λα
1
λ
λ
α−1 −(λ−t)x
α−1 −u
=
((λ − t)x)
e
(λ − t)dx =
u
e du =
(λ − t)α Γ(α) 0
λ−t
Γ(α) 0
λ−t
19.2
Joint moment generating function
Let X and Y be random variables having a joint density function f (x, y). Then we can define the joint moment
generating function M (s, t) of X and Y by
Z ∞Z ∞
sX+tY M (s, t) = E e
=
esx+ty f (x, y) dx dy.
−∞
−∞
If X and Y are independent, then the joint mgf M (s, t) becomes
M (s, t) = E esX+tY = E esX · E etY = MX (s) · MY (t).
Moreover, it has been known that “if the joint mgf M (s, t) of X and Y is of the form M1 (s) · M2 (t), then X and
Y are independent and have the respective mgf M1 (t) and M2 (t).”
Lecture Note
Page 46
Introduction to Probability
19.3
Sum of independent random variables
Let X and Y be independent random variables having the respective probability density functions f X (x) and
fY (y). Then the cumulative distribution function FZ (z) of the random variable Z = X + Y can be given as
follows.
Z
Z
Z
FZ (z) = P (X + Y ≤ z) =
∞
−∞
z−x
fX (x)fY (y) dy dx =
∞
−∞
−∞
fX (x)FY (z − x) dx,
where FY is the cdf of Y . By differentiating FZ (z), we can obtain the pdf fZ (z) of Z as
Z ∞
Z ∞
d
d
fZ (z) =
fX (x)
fX (x)fY (z − x) dx.
FZ (z) =
FY (z − x) dx =
dz
dz
−∞
−∞
R∞
The form of integration −∞ f (x)g(z − x) dx is called the convolution. Thus, the pdf fZ (z) is given by the
convolution of the pdf’s fX (x) and fY (y). However, the use of moment generating function makes it easier to
“find the distribution of the sum of independent random variables.”
Let X and Y be independent normal random variables with the respective parameters (µx , σx2 ) and (µy , σy2 ).
Then the sum Z = X + Y of random variables has the mgf
!
!
2 2
σy2 t2
(σx2 + σy2 )t2
σx t
+ µx t · exp
+ µy t = exp
+ (µx + µy )t
MZ (t) = MX (t) · MY (t) = exp
2
2
2
which is the mgf of normal distribution with parameter (µx + µy , σx2 + σy2 ). By the property (a) of mgf, we can
find that Z is a normal random variable with parameter (µx + µy , σx2 + σy2 ).
Let X and Y be independent gamma random variables with the respective parameters (α1 , λ) and (α2 , λ). Then
the sum Z = X + Y of random variables has the mgf
α1 α2 α1 +α2
λ
λ
λ
·
=
MZ (t) = MX (t) · MY (t) =
λ−t
λ−t
λ−t
which is the mgf of gamma distribution with parameter (α1 + α2 , λ). Thus, Z is a gamma random variable with
parameter (α1 + α2 , λ).
20
20.1
Chi-square, t -, and F -distributions
Chi-square distribution
The gamma distribution
1
xn/2−1 e−x/2 , x ≥ 0
2n/2 Γ(n/2)
with parameter (n/2, 1/2) is called the chi-square distribution with n degrees of freedom.
f (x) =
(a) Let X be a standard normal random variable, and let
Y = X2
√
√
be √
the square√of X. Then we can express the cdf FY of Y as FY (t) = P (X 2 ≤ t) = P (− t ≤ X ≤ t) =
Φ( t) − Φ(− t) in terms of the standard normal cdf Φ. By differentiating the cdf, we can obtain the pdf
fY as follows.
1
d
1
t
1
fY (t) = FY (t) = √ t− 2 e− 2 = 1/2
t−1/2 e−t/2 ,
dt
2 Γ(1/2)
2π
√
where we use Γ(1/2) = π. Thus, the square X 2 of X is the chi-square random variable with 1 degree of
freedom.
Lecture Note
Page 47
Introduction to Probability
(b) Let X1 , . . . , Xn be iid standard normal random variables. Since Xi2 ’s have the gamma distribution with
parameter (1/2, 1/2), the sum
n
X
Y =
Xi2
i=1
has the gamma distribution with parameter (n/2, 1/2). That is, the sum Y has the chi-square distribution
with n degree of freedom.
20.2
Student’s t -distribution
Let X be a chi-square random variable with n degrees of freedom, and let Y be a standard normal random
variable. Suppose that X and Y are independent. Then the distribution of the quotient
Z=p
Y
X/n
is called the Student’s t-distribution with n degrees of freedom. The pdf fZ (z) is given by
Γ((n + 1)/2)
fZ (z) = √
(1 + z 2 /n)−(n+1)/2 ,
nπΓ(n/2)
20.3
−∞ < z < ∞.
F -distribution
Let U and V be independent chi-square random variables with r1 and r2 degrees of freedom, respectively. Then
the distribution of the quotient
U/r1
Z=
V /r2
is called the F-distribution with (r1 , r2 ) degree of freedom. The pdf fZ (z) is given by
fZ (z) =
20.4
−(r1 +r2 )/2
r1
Γ((r1 + r2 )/2)
(r1 /r2 )r1 /2 z r1 /2−1 1 + z
,
Γ(r1 /2)Γ(r2 /2)
r2
0 < z < ∞.
Quotient of two random variables
Let X and Y be independent random variables having the respective pdf’s fX (x) and fY (y). Then the cdf FZ (z)
of the quotient
Z = Y /X
can be computed as follows.
FZ (z) = P (Y /X ≤ z) = P (Y ≥ zX, X < 0) + P (Y ≤ zX, X > 0)
Z 0 Z ∞
Z ∞ Z xz
=
fY (y) dy fX (x) dx +
fY (y) dy fX (x) dx
−∞
xz
0
−∞
By differentiating, we can obtain
Z ∞
Z ∞
Z 0
d
|x|fY (xz)fX (x) dx.
[xfY (xz)]fX (x) dx =
[−xfY (xz)]fX (x) dx +
FZ (z) =
fZ (z) =
dz
−∞
0
−∞
Lecture Note
Page 48
Introduction to Probability
Let X
p be a chi-square random variable with n degrees of freedom. Then the pdf of the random variable
U = X/n is given by
fU (x) =
p
2
d
d
nn/2
xn−1 e−nx /2
P ( X/n ≤ x) =
FX (nx2 ) = 2nxfX (nx2 ) = n/2−1
dx
dx
2
Γ(n/2)
for t ≥ 0; otherwise, fU (t) = 0. Suppose that Y is a standard normal random variable and independent of X.
Then the quotient
Y
Z=p
X/n
has a t-distribution with n degrees of freedom. The pdf fZ (z) can be computed as follows.
Z ∞
Z ∞
2
2
nn/2
xn e−(z +n)x /2 dx
|x|fY (xz)fU (x) dx = (n−1)/2 √
fZ (z) =
2
πΓ(n/2) 0
−∞
Z ∞
Γ((n + 1)/2)
1
(1 + z 2 /n)−(n+1)/2
(1 + z 2 /n)−(n+1)/2
t(n−1)/2 e−t dt = √
=√
nπΓ(n/2)
nπΓ(n/2)
0
for −∞ < z < ∞.
Let U and V be independent chi-square random variables with r1 and r2 degrees of freedom, respectively. Then
the quotient
U/r1
Z=
V /r2
has F -distribution with (r1 , r2 ) degree of freedom. By putting Y = U/r1 and X = V /r2 , we can compute the
pdf fZ (z) as follows.
Z ∞
Z ∞
r /2 r /2
r 1 r22
|x|fY (xz)fX (x) dx = (r +r )/21
fZ (z) =
x(r1 +r2 )/2−1 e−(r1 z+r2 )x/2 dx
z r1 /2−1
2 1 2 Γ(r1 /2)Γ(r2 /2)
−∞
0
Z ∞
r /2 r /2
r11 r22
−(r1 +r2 )/2 r1 /2−1
=
(r1 z + r2 )
z
t(r1 +r2 )/2−1 e−t dt
Γ(r1 /2)Γ(r2 /2)
0
−(r1 +r2 )/2
r1
Γ((r1 + r2 )/2)
r1 /2 r1 /2−1
1+ z
(r1 /r2 )
z
, 0 < z < ∞.
=
Γ(r1 /2)Γ(r2 /2)
r2
21
21.1
Sampling distributions
Population and random sample
The data values recorded x1 , . . . , xn are typically considered as the observed values of iid random variables
X1 , . . . , Xn having a common probability distribution f (x). To judge the quality of data, it is useful to envisage
a population from which the sample should be drawn. A random sample is chosen at random from the population
to ensure that the sample is representative of the population.
Let X1 , . . . , Xn be iid normal random variables with parameter (µ, σ 2 ). Then the random variable
n
X̄ =
is called the sample mean, and the random variable
1X
Xi
n i=1
n
S2 =
1 X
(Xi − X̄)2
n − 1 i=1
is called the sample variance. And they have the following properties.
Lecture Note
Page 49
Introduction to Probability
1. E[X̄] = µ and Var(X̄) =
σ2
.
n
2. The sample variance S 2 can be rewritten as
1
S =
n−1
2
n
X
Xi2
i=1
− nX̄
2
!
1
=
n−1
n
X
i=1
2
(Xi − µ) − n(X̄ − µ)
2
!
.
In particular, we can obtain
1
E[S ] =
n−1
2
n
X
i=1
2
2
E[(Xi − µ) ] − nE[(X̄ − µ) ]
!
1
=
n−1
n
X
i=1
Var(Xi ) − nVar(X̄)
!
= σ2 .
3. X̄ has the normal distribution with parameter (µ, σ 2 /n), and X̄ and S 2 are independent. Furthermore,
Wn−1 =
n
(n − 1)S 2
1 X
(Xi − X̄)2
=
σ2
σ 2 i=1
has the chi-square distribution with (n − 1) degrees of freedom.
4. The random variable
X̄−µ
√
X̄ − µ
σ/ n
√ =q
T =
S/ n
(n−1)S 2 /σ 2
n−1
has the t-distribution with (n − 1) degrees of freedom.
Rough explanation of Property 3. Assuming that X̄ and S 2 are independent (which is not so easy to prove), we
can show that the random
variable Wn−1 has the chi-square distribution with (n − 1) degrees of freedom. By
Pn
introducing V = σ12 i=1 (Xi − µ)2 and W1 = σn2 (X̄ − µ)2 , we can write V = Wn−1 + W1 . Since Wn−1 and W1
are independent, the mgf MV (t) of V can be expressed as
MV (t) = MWn−1 (t) × MW1 (t),
where MWn−1 (t) and MW1 (t) are the mgf’s of Wn−1 and W1 , respectively. Observe that W1 has the chi-square
distribution with 1 degrees of freedom. Thus, we can obtain
MWn−1 (t) =
MV (t)
(1 − 2t)−n/2
=
= (1 − 2t)−(n−1)/2
MW1
(1 − 2t)−1/2
which is the mgf of the chi-square distribution with (n − 1) degrees of freedom.
21.2
Sampling distributions under normal assumption
Suppose that the data
X1 , . . . , X n
are iid normal random variables with parameter (µ, σ 2 ).
1. The sample mean X̄ is normally distributed with parameter (µ, σ 2 /n), and the random variable
√
n(X̄ − µ)
∼ N (0, 1)
σ
(34)
has the standard normal distribution.
Lecture Note
Page 50
Introduction to Probability
2. The random variable
n
(n − 1)S 2
1 X
(Xi − X̄)2 =
∼ χ2n−1
2
σ i=1
σ2
is distributed as the chi-square distribution with (n − 1) degrees of freedom, and is independent of the
random variable (34).
3. The random variable
√
n(X̄−µ)
X̄ − µ
σ
√ =q
∼ tn−1
(n−1)S 2 /σ 2
S/ n
n−1
has the t-distribution with (n − 1) degrees of freedom, where S =
deviation.
21.3
√
S 2 is called the sample standard
Critical points
Given a random variable X, the critical point for level α is defined as the value c satisfying P (X > c) = α.
1. Suppose that a random variable X has the chi-square distribution with n degrees of freedom. Then the
critical point of chi-square distribution, denoted by χα,n , is defined by
P (X > χα,n ) = α.
2. Suppose that a random variable Z has the t-distribution with m degrees of freedom. Then the critical
point is defined as the value tα,m satisfying
P (Z > tα,m ) = α.
Since the t-distribution is symmetric, the critical point tα,m can be equivalently given by
P (Z < −tα,m ) = α.
Together we can obtain the expression
P (−tα/2,m ≤ Z ≤ tα/2,m ) = 1 − P (Z < −tα/2,m ) − P (Z > tα/2,m ) = 1 − α.
21.4
Exercises
1. Consider a sample X1 , . . . , Xn of normally distributed random variables with variance σ 2 = 5. Suppose
that n = 31.
(a) What is the value of c for which P (S 2 ≤ c) = 0.90?
(b) What is the value of c for which P (S 2 ≤ c) = 0.95?
2. Consider a sample X1 , . . . , Xn of normally distributed random variables with mean µ. Suppose that n = 16.
(a) What is the value of c for which P (|4(X̄ − µ)/S| ≤ c) = 0.95?
(b) What is the value of c for which P (|4(X̄ − µ)/S| ≤ c) = 0.99?
3. Suppose that X1 , . . . , X10 are iid normal random variables with parameters µ = 0 and σ 2 > 0.
(a) Let Z = X1 + X2 + X3 + X4 . Then what is the distribution for Z?
Lecture Note
Page 51
Introduction to Probability
2
(b) Let W = X52 + X62 + X72 + X82 + X92 + X10
. Then what is the distribution for
W
?
σ2
Z/2
(c) Describe the distribution for the random variable p
.
W/6
Z
(d) Find the value c for which P √ ≤ c = 0.95.
W
Joint Distributions January 17, 2006
22
22.1
Law of large numbers
Chebyshev’s inequality
We can define an indicator function IA of a subset A of real line as
(
1
x ∈ A;
IA (x) =
0
otherwise.
Let X be a random variable having the pdf f (x). Then we can express the probability
Z
Z ∞
P (X ∈ A) =
f (x) dx =
IA (x)f (x) dx
A
−∞
Now suppose that X has the mean µ = E[X] and the variance σ 2 = Var(X), and that we choose the particular
subset
A = {x : |x − µ| > t}.
Then we can observe that
IA (x) ≤
1
(x − µ)2
t2
for all x,
which can be used to derive
P (|X − µ| > t) = P (X ∈ A) =
Z
∞
−∞
IA (x)f (x) dx ≤
1
t2
Z
∞
−∞
(x − µ)2 f (x) dx =
σ2
1
Var(X)
=
.
t2
t2
The inequality P (|X − µ| > t) ≤ σ 2 /t2 is called the Chebyshev’s inequality.
22.2
Weak law of large numbers
Let Y1 , Y2 , . . . be a sequence of random variables. Then we can define the event An = {|Yn − α| > ε} and the
probability P (An ) = P (|Yn − α| > ε). We say that the sequence Yn converges to α in probability, if we have
lim P (|Yn − α| > ε) = 0,
n→∞
for any choice of ε > 0.
Now let X1 , X2 , . . . be a sequence of iid random variables with mean µ and variance σ 2 . Then for each n = 1, 2, . . .,
the sample mean
n
1X
Xi
X̄n =
n i=1
Lecture Note
Page 52
Introduction to Probability
has the mean µ and the variance σ 2 /n. Given a fixed ε > 0, we can apply the Chebyshev’s inequality to obtain
P (|X̄n − µ| > ε) ≤
σ 2 /n
→0
ε2
as n → ∞.
Thus, we have established
“the sample mean X̄n converges to µ in probability”
which is called the weak law of large numbers. A stronger result of the above, “the sample mean X̄n converges
to µ almost surely,” was established by Kolmogorov, and is called the strong law of large numbers.
23
23.1
Central limit theorems
Convergence in distribution
Let X1 , X2 , . . . be a sequence of random variables having the cdf’s F1 , F2 , . . ., and let X be a random variable
having the cdf F . Then we say that the sequence Xn converges to X in distribution (in short, Xn converges to
F ) if
lim Fn (x) = F (x) for every x at which F (x) is continuous.
(35)
n→∞
Note that the convergence in (35) is completely characterized in terms of the distributions F 1 , F2 , . . . and F .
Recall that the distributions F1 (x), F2 (x), . . . and F (x) are uniquely determined by the respective moment
generating functions, say M1 (t), M2 (t), . . . and M (t). Furthermore, we have an “equivalent” version of the
convergence in terms of the m.g.f’s
Continuity Theorem. If
lim Mn (t) = M (t)
n→∞
for every t (around 0),
(36)
then the corresponding distributions Fn ’s converge to F in the sense of (35).
Alternative criterion for convergence in distribution. Let Y1 , Y2 , . . . be a sequence of random variables,
and let Y be a random variable. Then the sequence Yn converges to Y in distribution if and only if
lim E[g(Yn )] = E[g(Y )]
n→∞
for every continuous function g.
Slutsky’s theorem. Let Z1 , Z2 , . . . and W1 , W2 , . . . be two sequences of random variables, and let c be a constant
value. Suppose that the sequence Zn converges to Z in distribution, and that the sequence Wn converges to c in
probability. Then
1. The sequence Zn + Wn converges to Z + c in distribution.
2. The sequence Zn Wn converges to cZ in distribution.
3. The sequence Zn /Wn converges to Z/c in distribution.
23.2
Central Limit Theorem
Let X1 , X2 , . . . be a sequence of iid random variables having the common distribution F with mean 0 and
variance 1. Then
n
X
√
n n = 1, 2, . . .
Zn =
Xi
i=1
Lecture Note
Page 53
Introduction to Probability
converges to the standard normal distribution.
A sketch of proof. Let M be the m.g.f. for the distribution F , and let Mn be the m.g.f. for the random
variable Zn for n = 1, 2, . . .. Since Xi ’s are iid random variables, we have
“ ” “ ” “ ”
n
tZn t
√t
√t
√t
(X1 +···+Xn )
X1
Xn
n
n
n
=E e
×···×E e
= M √
Mn (t) = E e
=E e
n
Observing that M 0 (0) = E[X1 ] = 0 and M 00 (0) = E[X12 ] = 1, we can apply a Taylor expansion to M √tn
around 0.
t2
t
t
t2 00
M (0) + εn (t) = 1 +
+ εn (t).
= M (0) + √ M 0 (0) +
M √
2n
2n
n
n
a n
= ea ; in a similar manner, we can obtain
Recall that lim 1 +
n→∞
n
n
2
t2 /2
lim Mn (t) = lim 1 +
+ εn (t) = et /2 ,
n→∞
n→∞
n
which is the m.g.f. for the standard normal distribution.
23.3
Normal approximation
Let X1 , X2 , . . . be a sequence of iid random variables having the common distribution F with mean µ and
variance σ 2 . Then
n
n
X
X
Xi − nµ
(Xi − µ)/σ
i=1
√
σ n
converges to the standard normal distribution.
Zn =
=
i=1
√
n
n = 1, 2, . . .
Let X1 , X2 , . . . , Xn be finitely many iid random variables with mean µ and variance σ 2 . If the size n is adequately
large, then the distribution of the sum
n
X
Y =
Xi
i=1
can be approximated by the normal distribution with parameter (nµ, nσ 2 ). A general rule for “adequately large”
n is about n ≥ 30, but it is often good for much smaller n. Similarly the sample mean
n
X̄n =
1X
Xi
n i=1
has approximately the normal distribution with parameter (µ, σ 2 /n). Then the following application of normal
approximation to probability computation can be applied.
Random variable
Y = X1 + · · · + X n
X1 + · · · + X n
X̄ =
n
Normal approximation
N (nµ, nσ 2 )
N
σ2
µ,
n
Probability
b − nµ
a − nµ
√
√
P (a ≤ Y ≤ b) ≈ Φ
−Φ
σ n
σ n
P (a ≤ X̄ ≤ b) ≈ Φ
b−µ
√
σ/ n
−Φ
a−µ
√
σ/ n
Example. An insurance company has 10,000 automobile policyholders. The expected yearly claim per policyholder is $240 with a standard deviation of $800. Find approximately the probability that the total yearly claim
exceeds $2.7 million. Can you say that such event is statistically highly unlikely?
Lecture Note
Page 54
Introduction to Probability
23.4
Normal approximation to binomial distribution
Suppose that X1 , . . . , Xn are iid Bernoulli random variables with the mean p = E(X) and the variance p(1−p) =
Var(X). If the size n is adequately large, then the distribution of the sum
Y =
n
X
Xi
i=1
can be approximated by the normal distribution with parameter (np, np(1 − p)). Thus, the normal distribution
N (np, np(1 − p)) approximates the binomial distribution B(n, p). A general rule for “adequately large” n is to
satisfy np ≥ 5 and n(1 − p) ≥ 5.
Let Y be a binomial random variable with parameter (n, p), and let Z be a normal random variable with
parameter (np, np(1 − p)). Then the distribution of Y can be approximated by that of Z. Since Z is a continuous
random variable, the approximation of probability should improve when the following formula of continuity
correction is considered.
!
!
j + 0.5 − np
i − 0.5 − np
P (i ≤ Y ≤ j) ≈ P (i − 0.5 ≤ Z ≤ j + 0.5) = Φ p
−Φ p
.
np(1 − p)
np(1 − p)
23.5
Exercises
1. An actual voltage of new a 1.5-volt battery has the probability density function
f (x) = 5,
1.4 ≤ x ≤ 1.6.
Estimate the probability that the sum of the voltages from 120 new batteries lies between 170 and 190
volts.
2. The germination time in days of a newly planted seed has the probability density function
f (x) = 0.3e−0.3x ,
x ≥ 0.
If the germination times of different seeds are independent of one another, estimate the probability that
the average germination time of 2000 seeds is between 3.1 and 3.4 days.
3. Calculate the following probabilities by using normal approximation with continuity correction.
(a) Let X be a binomial random variable with n = 10 and p = 0.7. Find P (X ≥ 8).
(b) Let X be a binomial random variable with n = 15 and p = 0.3. Find P (2 ≤ X ≤ 7).
(c) Let X be a binomial random variable with n = 9 and p = 0.4. Find P (X ≤ 4).
(d) Let X be a binomial random variable with n = 14 and p = 0.6. Find P (8 ≤ X ≤ 11).
Lecture Note
Page 55