Download Lectures for STP 421: Probability Theory

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Lectures for STP 421: Probability Theory
Jay Taylor
August 30, 2016
Contents
1 Overview and Conceptual Foundations of Probability
4
1.1
Determinism, Uncertainty, and Probability . . . . . . . . . . . . . . . . . . . . . .
4
1.2
Probability Spaces and Measure Theory . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Properties of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.4
Properties of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2 Conditional Probabilities
11
2.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3
The Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.4
Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.4.1
Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.4.2
Mendelian Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.5
Conditional Probabilities are Probabilities (Optional) . . . . . . . . . . . . . . . .
19
2.6
The Gambler’s Ruin Problem (Optional) . . . . . . . . . . . . . . . . . . . . . . .
22
3 Random Variables and Probability Distributions
3.1
3.2
Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.1.1
Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.1.2
Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.1.3
Properties of the Cumulative Distribution Function . . . . . . . . . . . . .
30
Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.1
34
Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . .
4 A Menagerie of Distributions
4.1
4.2
25
36
The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
4.1.1
Sampling and the hypergeometric distribution . . . . . . . . . . . . . . . .
39
The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
2
CONTENTS
3
4.2.1
Fluctuation tests and the origin of adaptive mutations . . . . . . . . . . .
44
4.2.2
Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Random Waiting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.3.1
The geometric and exponential distributions . . . . . . . . . . . . . . . . .
47
4.3.2
The gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Continuous Distributions on Finite Intervals . . . . . . . . . . . . . . . . . . . . .
51
4.4.1
The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.4.2
The beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.4.3
Bayesian estimation of proportions . . . . . . . . . . . . . . . . . . . . . .
53
4.4.4
Binary to analogue: construction of the uniform distribution . . . . . . . .
55
4.5
The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.6
Transformations of Random Variables . . . . . . . . . . . . . . . . . . . . . . . .
59
4.3
4.4
5 Random Vectors and Multivariate Distributions
64
5.1
Joint and Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
5.2
Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
5.3
Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
5.3.1
Discrete Conditional Distributions . . . . . . . . . . . . . . . . . . . . . .
68
5.3.2
Continuous Conditional Distributions
. . . . . . . . . . . . . . . . . . . .
69
5.3.3
Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
Expectations of Sums and Products of Random Variables . . . . . . . . . . . . .
72
5.4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.5
Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.6
The Law of Large Numbers and the Central Limit Theorem . . . . . . . . . . . .
79
5.6.1
The Weak and Strong Laws of Large Numbers . . . . . . . . . . . . . . .
79
5.6.2
The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.7
Multivariate Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.8
Sample Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
5.8.1
The χ2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
5.8.2
Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.4
Expectations of products
Chapter 1
Overview and Conceptual Foundations of Probability
1.1
Determinism, Uncertainty, and Probability
Scientific determinism postulates that we can predict how a system will develop over time if
we know (i) the laws governing the system and (ii) the initial state of the system. In particular,
in a deterministic world, we should obtain the same outcome whenever we repeat an experiment
under identical conditions. Deterministic theories were particularly influential in the eighteenth
and nineteenth centuries in large part because of the remarkable successes of Newtonian mechanics and remain important today, as can be seen in the continuing use of ordinary and partial
differential equations to model numerous physical phenomena. For example, Newton’s second
law - “force equals mass times acceleration” - can be expressed as a system of ODE’s,
d2 x
= F (x)
dt2
x(0) = x0
m
x0 (0) = v0
which can be solved exactly (in principle) if we know the forces F acting on the system and
the initial state (x0 , v0 ) of the system. In fact, Newtonian mechanics once seemed so effective
at describing nature that the mathematician Pierre-Simon Laplace (1814) was led to theorize
that the entire future and past of the universe could determined exactly by an entity (Laplace’s
demon) that knew the current positions and momenta of all of the particles in the universe.
However, despite these successes, scientific applications of deterministic models are encumbered
by multiple sources of uncertainty, some of which are listed below.
• Imprecise knowledge of the state of the system due to measurement error.
• Imprecise knowledge of model parameters.
• Model uncertainty and approximation: George Box: “All models are wrong, but some are
useful.” For example, molecular and cosmological processes are usually neglected when we
formulate models of circulation or airplane flight.
• Numerical solutions are usually approximate due to round-off and truncation errors.
• Certain kinds of deterministic models (chaotic dynamical systems) are known to be exquisitely
sensitive to their initial conditions in the sense that trajectories that are initially arbitrarily
close diverge exponentially.
• Other models are ill-posed in the sense that they either don’t have strong solutions or they
have more than one solution consistent with initial and/or boundary data. For example, it
4
1.2. PROBABILITY SPACES AND MEASURE THEORY
5
is possible that the Navier-Stokes equation, which is used to model fluid flow, lacks strong
solutions for some physically realistic parametrizations.
• According to some interpretations, quantum mechanics provides an intrinsically stochastic
description of nature, i.e., Schrödinger’s equation only allows us to calculate the probability
that a system occupies a given state.
For these reasons, scientists routinely must deal with uncertainty in one form or another. While
different theoretical tools have been developed to quantify uncertainty, most approaches make
use of probability theory. As a mathematical subject, probability has its origins in the 17’th
century in studies of games of chance and gambling. Since then, several different interpretations
of probabilities have been proposed and these differ mainly in the status that they assign to
probability as either a description of some feature of the physical world or as a description of our
knowledge of the world. The main interpretations are described in the following list.
• Frequentist interpretation: The probability of an event is equal to its limiting frequency
in an infinite series of independent identical trials (R. Ellis, J. Venn).
• Subjective interpretation: The probability of a proposition is a measure of how strongly
an individual believes that the proposition is true (F. Ramsey, J. von Neumann, B. de
Finetti). According to this interpretation, different individuals can rationally assign different probabilities to the same proposition.
• Objective interpretation: The probability of a proposition is a measure of the strength
of evidence supporting that proposition (J. M. Keynes, R. Jeffrey, R. T. Cox, E. Jaynes,
G. Polya). Under this interpretation, rational individuals with the same evidence should
assign the same probabilities to a proposition.
The frequentist interpretation is sometimes said to be physical because it considers probabilities to be properties of the physical world that are independent of any observer. In contrast,
the subjective and logical interpretations are said to be Bayesian or epistemological because
they treat probabilities as properties of an observer’s knowledge or beliefs about the world. A
very important difference between the frequentist and Bayesian interpretations concerns their
scope: whereas frequentist probabilities can only be assigned to events that can occur repeatedly, Bayesian probabilities can be assigned to events that may only occur once. For example,
both frequentist and Bayesian probabilities can be assigned to the event that a tossed coin will
land on heads, but only a Bayesian probability can be assigned to the proposition that Hillary
Clinton will win the next presidential election since that election will occur exactly once. In this
sense, the Bayesian interpretations of probability allow for a much wider theory that does the
frequentist interpretation, although not everyone regards this as legitimate.
1.2
Probability Spaces and Measure Theory
Although probabilities can be interpreted in several different ways, it turns out that there is a
common formal mathematical framework that is sufficient for all of these interpretations. This is
known as measure theoretic probability, which was formulated by the Russian mathematician Andrei Kolmogorov in the early 1930’s. The most important object in measure theoretic
probability is known as a probability space, which is defined below. To motivate the definition,
imagine that you are conducting an experiment with a random outcome and that you wish to
describe this scenario mathematically. Here we will temporarily adopt the frequentist viewpoint.
6
CHAPTER 1. OVERVIEW AND CONCEPTUAL FOUNDATIONS OF PROBABILITY
The first thing that we need to know is the set of all possible outcomes for the experiment. Let us
call this set Ω. In many cases, we may not be interested in the exact outcome of the experiment,
but rather in knowing whether it belongs to a particular collection of elements in Ω that share
some property, e.g., we sample a person at random and we want to know if their height is greater
than 6 feet. Such a subset is said to be an event. To describe the uncertainty of the experiment,
we need to be able to assign a probability to each event that might be of interest to us. In other
words, we want to be able to write down (or at least calculate) a set function P which takes an
event as an argument and returns a real number. For example, if E is an event, we are interested
in the probability of E, which will be denoted P(E). P is said to be a probability distribution
or probability measure.
Under the frequentist interpretation, P(E) is the limiting frequency of the event E in a series
of independent identical trials. This tells us that probabilities will always take values between 0
and 1, so that the range of the function P is the interval [0, 1]. In other words, for every event
E, we know that 0 ≤ P(E) ≤ 1. (Hint: if you ever calculate a probability and arrive at either a
negative number or a number greater than 1, then somewhere along the way you have made a
mistake.) Furthermore, if we assume that every trial results in exactly one outcome in Ω, then it
follows that the probability of Ω is equal to 1, while the probability of the empty set ∅ is equal to
0: P(∅) = 0 and P(Ω) = 1. Finally, if E and F are mutually exclusive events, meaning that
either E is true or F is true or neither is true, then it is easy to see that the limiting frequency
of the event E ∪ F that either E or F occurs is equal to the sum of the limiting frequencies of
E and F . In other words, if E and F are disjoint subsets of Ω, then P(E ∪ F ) = P(E) + P(F ).
Similarly, if E1 , E2 , · · · , En are mutually exclusive, then similar considerations indicate that
P(E1 ∪ · · · ∪ En ) = P(E1 ) + · · · + P(En ).
The following definition provides a formal framework for these observations.
Definition 1.1. A probability space is a triple {Ω, F, P} where:
• Ω is the sample space, i.e., the set of all possible outcomes of the experiment.
• F is a collection of subsets of Ω which we call events. F is called a σ-algebra and is
required to satisfy the following conditions:
1. The empty set and the sample space are both events: ∅, Ω ∈ F.
2. If E is an event, then its complement E c = Ω\E is also an event.
3. If E1 , E2 , · · · are events, then their union ∪n En is an event.
• P is a function from F into [0, 1]: if E is an event, then P(E) is the probability of E. P is
said to be a probability distribution or probability measure on F and is also required
to satisfy several conditions:
1. P(∅) = 0; P(Ω) = 1.
2. Countable additivity: If E1 , E2 , · · · are mutually exclusive events, i.e., Ei ∩Ej =
∅ whenever i 6= j, then
∞
∞
[
X
En =
P(En ).
P
n=1
Here are several examples.
n=1
1.2. PROBABILITY SPACES AND MEASURE THEORY
7
Example 1.1. Suppose that we toss a fair coin once and let Ω = {H, T } be the set of possible
outcomes, i.e., either H if we get heads or T if we get tails. Let F = {∅, {H}, {T }, Ω} be the
collection of all subsets of Ω and define the probability distribution P by setting
P(∅) = 0
1
P({H}) = P({T }) =
2
P(Ω) = 1.
Then (Ω, F, P) is a probability space which provides a model for the outcome of the coin toss.
Example 1.2. Now suppose that we toss a fair coin repeatedly and let N be the number of tosses
until we get the first heads (including that toss). Let Ω = N = {1, 2, 3, · · · }, let F = P(Ω) be
the power set of Ω, i.e., the collection of all subsets of Ω, and define the probability of an event
E = {n1 , n2 , · · · } ⊂ Ω by
X 1 ni
P(E) ≡
2
ni ∈E
Then (Ω, F, P) is a probability space which provides a model for the the number of tosses that are
carried out until we get heads. The variable N is said to be a random variable.
Example 1.3. A probability space (Ω, F, P) is said to be a symmetric probability space if
the sample space Ω = {x1 , · · · , xN } is finite; the σ-algebra F is the collection of all subsets of Ω,
i.e., F = P(Ω) is the power set of Ω; and the probability of any event E ⊂ Ω is equal to
P(E) =
|E|
|E|
=
,
|Ω|
N
where |E| is equal to the number of elements in E. In particular, it follows that the probability
of any single element x ∈ Ω is equal to 1/N and so all elements are equally likely.
Example 1.4. Suppose that a deck contains 52 cards and let Ω be the set containing all 52!
different ways of ordering the deck. If we assume that the deck is randomly shuffled, then every
ordering of cards is equally likely and so we can use the symmetric probability space (Ω, F, P) as
a mathematical model for the effect of shuffling on the order of the cards.
We next consider why we need to restrict to countable additivity and why we only require
σ-algebras to be closed under countable unions.
Example 1.5. Suppose that we want to model an experiment in which the outcome is a real
number that is uniformly distributed between 0 and 1. Then
• Ω = [0, 1];
• F should at least contain all open and closed intervals contained in [0, 1];
• P([a, b]) = b − a whenever 0 ≤ a ≤ b ≤ 1.
The last definition asserts that the probability that the outcome lies in an interval contained in
[0, 1] is equal to the length of the interval, which is what we mean when we say the outcome is
uniformly distributed on [0, 1]. One consequence of this definition is that points have probability
0, i.e, P({x}) = P([x, x]) = x − x = 0 since {x} = [x, x]. This shows that the countable additivity
of probability measures cannot generally be extended to uncountable collections of events, since
the left-hand and right-hand sides of the following expression are not equal:
X
1 = P([0, 1]) 6=
P({x}) = 0.
x∈[0,1]
8
CHAPTER 1. OVERVIEW AND CONCEPTUAL FOUNDATIONS OF PROBABILITY
Indeed, the problem here is that although the sets {x} and {y} are disjoint whenever x 6= y, there
are uncountably many such sets contained in [0, 1], i.e, [0, 1] is uncountably infinite. Later we
will see that such sums need to be replaced by integrals.
Notice that we haven’t specified the σ-algebra in Example 1.5, only that it should contain all
of the open and closed intervals contained in [0, 1]. One might hope that we could take F to
be the power set of [0, 1], i.e, we will let all subsets of [0, 1] be events, but it can be shown that
there is no function P defined on the power set P([0, 1]) with values in [0, 1] which satisfies the
conditions required of a probability measure and has the property that P([a, b]) = b − a for all
[a, b] ⊂ [0, 1]. However, one can also show that there is a σ-algebra F containing all of the open
and closed intervals which is strictly smaller than P([0, 1]) but which allows one to define P as
above in a self-consistent manner. This is called the Borel σ-algebra on [0, 1] and it is defined
to be the smallest σ-algebra containing all of the open intervals, i.e., it is the intersection of all
σ-algebras which contain these intervals. The sets contained in this σ-algebra are called Borel
sets. In general, any set that we might reasonably expect to encounter in applications will be a
Borel set, but there is no simple description of what these sets “look like”, i.e., the structure of
a Borel set can be arbitrarily complicated.
In this course, we will mostly pay very little attention to the σ-algebras defined on the probability
spaces that we study. With finite or countably-infinite sets, these usually are taken to be the
collection of all subsets of the sample space, so that the probability of every subset is defined.
Probability measures on uncountable sets are more complicated objects and one needs to pay
attention to these technical details to avoid contradiction. David Williams’ book ‘Probability
with Martingales’ (Cambridge, 1991) provides an excellent introduction to these issues for those
who wish to learn the mathematics more rigorously than presented here.
1.3
Properties of Events
Because the mathematical theory of probability is described using the language of sets, to apply
this theory to real-world problems we need to be able to translate these problems into statements
about sets. To help with this, we recall the following properties of sets.
Operations with events:
• union: E ∪ F is the set of outcomes that are in E or F (non-exclusive)
• intersection: E ∩ F is the set of outcomes that are in both E and F
– some authors use the notation EF
– E and F are said to be mutually exclusive if E ∩ F = ∅
• unions and intersections can be extended to more than two sets.
• complements: E c is the set of outcomes in Ω that are not in E
Unions and intersections satisfy several important algebraic laws:
• Commutative laws: E ∪ F = F ∪ E, E ∩ F = F ∩ E
1.4. PROPERTIES OF PROBABILITIES
9
• Associative laws: (E ∪ F ) ∪ G = E ∪ (F ∪ G), (E ∩ F ) ∩ G = E ∩ (F ∩ G)
• Distributive laws: (E ∪ F ) ∩ G = (E ∩ G) ∪ (F ∩ G), (E ∩ F ) ∪ G = (E ∪ G) ∩ (F ∪ G)
Lemma 1.1. DeMorgan’s laws:
∞
[
En
c
=
n=1
∞
\
En
c
=
n=1
∞
\
n=1
∞
[
Enc
Enc
n=1
One consequence of DeMorgan’s laws is that σ-algebras are closed under countable intersections:
if E1 , E2 , · · · are events in a σ-algebra F, then
∞
\
En
n=1
is also an event in F.
1.4
Properties of Probabilities
The following proposition describes some useful properties that follow from the definition of a
probability space. These are often used when calculating probabilities and should be committed
to memory.
Proposition 1.1. The following properties hold for any two events A, B in a probability space:
1. Complements: P(Ac ) = 1 − P(A).
2. Subsets: if A ⊂ B, then P(A) ≤ P(B);
3. Disjoint events: If A and B are mutually exclusive, then
P(A ∩ B) = 0.
4. Unions: For any two events A and B (not necessarily mutually exclusive), we have:
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Example 1.6. Suppose that the frequency of HIV infection in a community is 0.05, that the
frequency of tuberculosis infection is 0.1, and that the frequency of dual infection is 0.01. What
proportion of individuals have neither HIV nor tuberculosis?
P(HIV or tuberculosis) = P(HIV) + P(tuberculosis) − P(both) = 0.05 + 0.1 − 0.01 = 0.14
P(neither) = 1 − P(HIV or tuberculosis) = 0.86
10
CHAPTER 1. OVERVIEW AND CONCEPTUAL FOUNDATIONS OF PROBABILITY
The following identity generalizes part 4 of Lemma 1.2 and is known as the inclusion-exclusion
formula. It too is sometimes useful when performing calculations.
Theorem 1.1. Let E1 , · · · , En be a collection of events. Then
n
X
P E1 ∪ E2 ∪ · · · ∪ En
=
P(Ei ) −
X
P(Ei1 ∩ Ei2 ) + · · ·
1≤i1 <i2 ≤n
i=1
X
r+1
+ (−1)
P(Ei1 ∩ Ei2 ∩ · · · ∩ Eir )
i1 <i2 <···ir
n+1
+ · · · + (−1)
P(E1 ∩ E2 ∩ · · · ∩ En )
This can be verified with the help of indicator variables and expectations.
Chapter 2
Conditional Probabilities
2.1
Motivation
Suppose that two fair dice are rolled and that the sum of the two rolls is even. What is the
probability that we have rolled a 1 with the first dice?
Without the benefit of any theory, we could address this question empirically in the following
way. Suppose that we perform this experiment N times, with N large, and let (xi , yi ) be the
outcome obtained on the i’th trial, where xi is the number rolled with the first die and yi is the
number rolled with the second die. Then we can estimate the probability of interest (which we
will call P ) by dividing the number of trials for which xi = 1 and xi + yi is even by the total
number of trials that have xi + yi even:
P
≈
=
#{i : xi = 1 and xi + yi even }
#{i : xi + yi even }
#{i : xi = 1 and xi + yi even }/N
,
#{i : xi + yi even }/N
where we have divided both the numerator and denominator by N in the second line.
Now, notice that for N large, we expect the following approximations to hold:
P{the first roll equals 1 and the sum is even} ≈ #{i : xi = 1 and xi + yi even }/N
P{the sum is even} ≈ #{i : xi + yi even }/N,
with these becoming identities as N tends to infinity. This suggests that we can evaluate the
probability P exactly using the following equation:
P =
P{the first roll equals 1 and the sum is even}
.
P{the sum is even}
Here P is said to be the conditional probability that the first roll is equal to 1 given that the
sum of the two rolls is even. This identity motivates the following definition.
Definition 2.1. Let E and F be events and assume that P(F ) > 0. Then the conditional
probability of E given F is defined by the ratio
P(E|F ) =
P(E ∩ F )
.
P(F )
11
12
CHAPTER 2. CONDITIONAL PROBABILITIES
F can be thought of as some additional piece of information that we have concerning the outcome of an experiment. The conditional probability P(E|F ) then summarizes how this additional
information affects our belief that the event E occurs. Notice that the ratio appearing in the
definition is only well-defined if the numerator is positive, hence the important requirement that
P(F ) > 0.
Example 2.1. A student is taking an exam. Suppose that the probability that the student finishes
the exam in less than x hours is x/2 for x ∈ [0, 2]. Show that the conditional probability that the
student does not finish the exam in one hour given that they are still working after 45 minutes is
0.8.
Example 2.2. A fair coin is flipped twice. Show that the conditional probability that both flips
lands on heads given that the first lands on heads is 1/2. Show that the conditional probability
that both flips land on heads given that at least one flip lands on heads is 1/3.
The next result is a simple, but very useful consequence of the definition of conditional probability.
Proposition 2.1. (Multiplication Rule) Suppose that E and F are events and that P(F ) > 0.
Then
P (E ∩ F ) = P (E|F ) · P (F ) .
(2.1)
Similarly, if E1 , · · · , Ek are events and P (E1 ∩ · · · ∩ Ek−1 ) > 0, then recursive application of
equation (3.1) shows that
P (E1 ∩ · · · ∩ Ek ) = P (E1 ) P (E2 |E1 ) P (E3 |E1 ∩ E2 ) · · · P (Ek |E1 ∩ · · · ∩ Ek−1 ) .
Example 2.3. Suppose that n balls are chosen without replacement from an urn that initially
contains r red balls and b blue balls. Given that k of the n balls are blue, show that the conditional
probability that the first ball chosen is blue is k/n.
Solution: Let B be the event that the first ball chosen is blue and let Bk be the event that k of
the n balls chosen are blue. Then we need to calculate:
P(B ∩ Bk )
P(B|Bk ) =
.
P(Bk )
To evaluate the numerator, write
P(Bk ∩ B) = P(Bk |B)P(B),
and notice that P(B) = b/(b + r). Also, P(Bk |B) is just the probability that a set of n − 1 balls
chosen from an urn containing r red balls and b − 1 blue balls contains k − 1 blue balls, which is
equal to
r b−1
P(Bk |B) =
k−1 n−k
r+b−1
n−1
.
Similarly,
P(Bk ) =
b
k
r
n−k
r+b
n
.
It follows that
b P(B|Bk ) =
r+b
b−1
r
k−1 n−k
r+b−1
n−1
r+b
n
b
r
k n−k
=
k
.
n
2.2. INDEPENDENCE
2.2
13
Independence
We begin with the definition of pairwise independence, i.e., of what it means for two events to
be independent. In the sequel, we will extend this to more than two sets.
Definition 2.2. Two events E and F are said to be independent if
P(E ∩ F ) = P(E)P(F ).
Intuitively, E and F are independent if knowing that F has occurred provides no information
about the likelihood that E has occurred and vice versa:
P(E|F ) =
P(F |E) =
P(E ∩ F )
P(E)P(F )
=
= P(E)
P(F )
P(F )
P(E)P(F )
P(E ∩ F )
=
= P(F ).
P(E)
P(E)
Remark 2.1. Independence is a symmetric relation: to say that E is independent of F also
means that F is independent of E.
Example 2.4. Suppose that two coins are tossed and that all four outcomes are equally likely.
Let E be the event that the first coin lands on heads and let F be the event that the second coin
lands on heads. Then, as expected, E and F are independent since
P(E ∩ F ) =
1
= P(E)P(F ).
4
Example 2.5. Suppose that we toss two fair dice. Let E1 be the event that the sum of the dice
is 6 and let F be the event that the first die equals 4. Then,
P(E1 ∩ F ) =
1
5 1
6=
· = P(E1 )P(F ),
36
36 6
and thus E1 and F are not independent. On the other hand, if we let E2 be the event that the
sum of the dice is 7, then
P(E2 ∩ F ) =
1
1 1
= · = P(E2 )P(F ),
36
6 6
and so E2 and F are independent.
Proposition 2.2. If E and F are independent, then so are each of the following three pairs: E
and F c , E c and F , and E c and F c .
Proof. This follows by direct calculation of the joint probabilities. For example,
P (E ∩ F c ) = P(E) − P (E ∩ F )
= P(E) − P(E) · P(F )
= P(E) · P (F c ) ,
which shows that E and F c are independent.
14
CHAPTER 2. CONDITIONAL PROBABILITIES
As promised, we can extend the definition of independence to collections of more than two events.
Furthermore, in the example that follows the definition, we show that independence of a collection of objects can be a strictly stronger condition than mere independence of each pair of objects.
Definition 2.3. A countable collection of events, E1 , E2 , · · · , is said to be independent if for
every finite sub-collection Ei1 , Ei2 , Ei3 , · · · , Ein , we have:
P
n
\
r=1
!
Eir
=
n
Y
P(Eir ).
r=1
Example 2.6. Pairwise independence does not imply independence: Suppose that two
fair dice are tossed and let E1 be the event that the first die equals 4, let E2 be the event that
the second die equals 4, and let E3 be the event that the sum of the dice is 7. As above, we
know that each pair of events is independent. However, it is not the case that all three events are
independent since
111
P(E1 ∩ E2 ∩ E3 ) = 0 6=
= P(E1 )P(E2 )P(E3 ).
666
Example 2.7. Suppose that an infinite sequence of independent trials is performed such that
each trial results in a success with probability p and a failure with probability q = 1 − p. Calculate
the probability that:
• (a) at least one success occurs in the first n trials;
• (b) exactly k successes occur in the first n trials;
• (c) all trials result in successes.
2.3
The Law of Total Probability
Let E and F be two events and notice that we can write E as the following disjoint union:
E = (E ∩ F ) ∪ (E ∩ F c ).
Then, using the additivity of probabilities and the definition of conditional probabilities, we
obtain:
P(E) = P(E ∩ F ) + P(E ∩ F c )
= P(E|F )P(F ) + P(E|F c )P(F c )
= P(E|F )P(F ) + P(E|F c )(1 − P(F )).
(2.2)
This formula can sometimes be used to calculate the probability of a complicated event by exploiting the additional information provided by knowing whether or not F has also occurred.
The hard part is knowing how to choose F so that the conditional probabilities P(E|F ) and
P(E|F c ) are both easy to calculate.
2.4. BAYES’ FORMULA
15
Example 2.8. Suppose that there are two classes of people, those that are accident prone and
those that are not, and that the risk of having an accident within a one year period is either 0.4
or 0.2 depending on which class a person belongs to. If the proportion of accident-prone people
in the population is 0.3, calculate the probability that a randomly selected person has an accident
in a given year (0.26). What is the conditional probability that an individual is accident-prone
given that they have had an accident in the last year? (6/13).
Equation (3.2) has the following far-reaching generalization.
Proposition 2.3. (Law of Total Probability) Suppose that F1 , · · · , Fn are mutually exclusive
events such that P (Fi ) > 0 for each i = 1, · · · , n and E ⊂ F1 ∪ · · · ∪ Fn . Then
P(E) =
=
n
X
i=1
n
X
P (E ∩ Fi )
P (E|Fi ) · P (Fi ) .
i=1
Proof. The result follows from the countable additivity of P and the fact that E is the disjoint
union of the sets E ∩ Fi for i = 1, · · · , n.
2.4
Bayes’ Formula
Bayes’ formula describes the relationship between the two conditional probabilities P(E|F ) and
P(F |E).
Theorem 2.1. (Bayes’ Formula) Suppose that E and F are events such that P(E) > 0 and
P(F ) > 0. Then
P(E|F ) · P(F )
P(F |E) =
.
(2.3)
P(E)
Similarly, if F1 , · · · , Fn are events such that P (Fi ) > 0 for i = 1, · · · , n, then
P (Fj |E) =
P (E ∩ Fj )
P (E|Fj ) P (Fj )
= Pn
.
P(E)
i=1 P (E|Fi ) P (Fi )
(2.4)
Example 2.9. Suppose that we have three cards, and that the first card has two red sides, the
second card has two black sides, and the third card has one red side and one black side. One of
these cards is chosen at random and placed on the ground. If the upper side of this card is red,
show that the probability that the other side is black is 1/3.
Example 2.10. Suppose that a test for HIV-1 infection is known to have a false positive rate
of 2.3% and a false negative rate of 1.4%, and that the prevalence of HIV-1 in a particular
population is equal to 1 : 10, 000. If an individual with no known risk factors takes the test and
tests positive for HIV-1 infection, what is the probability that they are, in fact, infected?
16
CHAPTER 2. CONDITIONAL PROBABILITIES
We can use Bayes’ formula to solve this problem, but we first introduce some notation. Let
H ≡ “the individual is infected by HIV-1”
H̄ ≡ “the individual is not infected by HIV-1”
D ≡ “the individual tests positive”
As above, Bayes’ formula can be written as
P(H|D) = P(H)
P(D|H)
.
P(D)
Here H and H̄ are the two hypotheses of interest and these are both mutually exclusive and
exhaustive. Thus, the denominator in the fraction on the right-hand side can be expanded as
P(D) = P(H)P(D|H) + P(H̄)P(D|H̄).
In light of the prior information stipulated in I, we have
P(H) = 0.0001
P(H̄) = 1 − P(H) = 0.9999
P(D|H) = 0.986
P(D|H̄) = 0.023.
Indeed, since the individual has no known risk factors for HIV-1 infection, we can take the overall
prevalence of HIV-1 infection in the population as our estimate of the prior probability that the
individual is infected, i.e., we are using frequency data to estimate (or elicit) the prior probability
of a hypothesis of interest. The second probability follows from the sum rule. The third probability
is concerned with the event where the individual tests positive (D) when they are in fact infected
H and when I is true; however, this is just the probability of a true positive, which by the sum
rule is equal to 1 minus the probability of a false negative, which is 1 − 0.014 = 0.986. Similarly,
the fourth probability is that of a false positive, which is 0.023 according to I. Substituting these
quantities into the equation written above shows that P(D) = 0.0230963. It then follows that
P(H|D) =
0.0001 × 0.986
≈ 0.0043,
0.0001 × 0.986 + 0.9999 × 0.023
and so an individual with a positive test result but no known risk factors is still very unlikely to
be infected by HIV-1 despite the fact that the test is fairly accurate. The reason for this is that
the prevalence of HIV-1 infection in the population is so low that an individual that has no reason
to suspect that they are HIV-1 infected is far more likely to receive a positive test result due to a
testing error than because they are, in fact, infected.
Example 2.11. Suppose that 5% of men and 0.25% of women are color blind. A color-blind
person is chosen at random. Assuming that there are an equal number of males and females in
the population, what is the probability that the chosen individual is male?
Solution: Let M and F be the event that the chosen individual is a male or female, respectively,
and let C denote the event that a randomly chosen individual is color blind. Then the probability
of interest is
P(C|M )P(M )
P(C|M )P(M ) + P(C|F )P(F )
0.05 ∗ 0.5
=
0.05 ∗ 0.5 + 0.0025 ∗ 0.5
= 0.9524.
P(M |C) =
2.4. BAYES’ FORMULA
2.4.1
17
Bayesian Statistics
Bayes’ formula is often applied to problems of inference in the following way. Suppose that
our goal is to decide among a set of competing hypotheses, H1 , · · · , Hn . Let p1 , · · · , pn be a
probability distribution on the hypotheses, i.e., p1 + · · · + pn = 1, where pi is a measure of the
strength of our belief in hypothesis Hi prior to performing the experiment. In the jargon of
Bayesian statistics, (p1 , · · · , pn ) is said to be the prior distribution on the space of hypotheses.
Now suppose that we perform an experiment to help decide between the hypotheses, and let the
outcome of this experiment be the event E. Bayes’ formula can be used to update our belief
in the relative probabilities of the different hypotheses in light of the new data. The posterior
probability, p∗i , of hypothesis Hi is defined to be the conditional probability of Hi given E:
P(E|Hi )P(Hi )
pi · P(E|Hi )
p∗i = P(Hi |E) = Pn
= Pn
.
j=1 P(E|Hj )P(Hj )
j=1 pj · P(E|Hj )
Notice that (p∗1 , · · · , p∗n ) is a probability distribution on the space of hypotheses that is called
the posterior distribution. It depends both on the data collected in the experiment as well as
on our prior belief about the likelihood of the different hypotheses.
Example 2.12. Suppose that we are studying a population containing n individuals and that our
goal is to estimate the prevalence of tuberculosis in the population. Let Hk be the hypothesis that
k individuals are infected, and suppose that our prior distribution on (H0 , · · · , Hn ) is given by
pk =
1
, k = 0, 1, · · · , n.
n+1
Such a prior is said to be uninformative and means that we have no prior information concerning the incidence of tuberculosis in the population.
Now, suppose that the experiment consists of sampling one member of the population and testing
them for tuberculosis infection. Let E be the event that that individual is infected and observe
that P(E|Hk ) = k/n and that
P(E) =
n
X
pk P(E|k) =
k=0
n
X
k=0
1 k
1
n(n + 1)
1
=
= .
n+1n
n(n + 1)
2
2
Using Bayes’ formula, the posterior probability of hypothesis Hi is
p∗i =
1 2i
2i
pi · P(E|Hi )
=
=
1/2
n+1 n
n(n + 1)
for i = 0, · · · , n. Notice that p∗i ≥ pi for i ≥ n/2, i.e., observing an infected individual in our
sample increases the posterior probability that the incidence of tuberculosis is greater than 1/2.
2.4.2
Mendelian Genetics
Gregor Mendel’s 1866 paper proposing a mathematical model for heredity was one of the first
applications of probability to the natural sciences. Although we now know that the genetic architecture of some traits is much more complicated than Mendel imagined, Mendelian genetics still
provides a useful framework for understanding how many traits of interest, including thousands
18
CHAPTER 2. CONDITIONAL PROBABILITIES
connected with hereditary human diseases, are inherited. In fact, these models play an important
role both in genetic counseling and in forensic DNA analysis.
We begin with a list of important facts and concepts:
• Humans have two copies of most genes (except for those on the sex chromosomes).
• A child independently inherits one gene from each parent, i.e., one from the mother and
one from the father.
• Each of the two gene copies in a parent is equally likely to be inherited.
• Different types of a gene are called alleles. These are often denoted by capital and lowercase
letters, e.g., B and b.
• A genotype is a list of the types of the two genes possessed by an individual.
• If there are two alleles, then there are three possible genotypes: BB, bb, and Bb. Notice
that we usually ignore the order of alleles when writing the genotype, i.e., Bb and bB are
the same genotype.
• The types and frequencies of the genotypes inherited by offspring can be calculated using
a Punnett square. With two alleles, B and b, the possible crosses are:
– BB × BB produces only BB
– bb × bb produces only bb
– BB × Bb produces 21 BB + 12 Bb
– bb × Bb produces 12 bb + 21 Bb
– Bb × Bb produces 14 BB + 21 Bb + 14 bb
– BB × bb produces only Bb.
Example 2.13. Assume that eye color is determined by a single pair of genes, which can be
either B or b. If a person has one or more copies of B, then they will have brown eyes. If a
person has two copies of b, then they will have blue eyes. Suppose that Smith and his parents
have brown eyes, and that Smith’s sister has blue eyes.
a) What is the probability that Smith caries a blue-eyed gene?
b) Suppose that Smith’s partner has blue eyes. What is the probability that their first child will
have blue eyes?
c) Suppose that their first child has brown eyes. What is the probability that their second child
will also have brown eyes?
Solutions:
a) Because Smith’s sister has blue eyes, her genotype is bb. Consequently, both of Smith’s parents
(who are brown-eyed) must have genotypes Bb, since the sister must inherit a b gene from each
parent. However, Smith himself could have genotype BB or Bb. Let E1 and E2 be the events
that Smith has genotype BB or Bb, respectively, and let F be the event that Smith has brown
2.5. CONDITIONAL PROBABILITIES ARE PROBABILITIES (OPTIONAL)
19
eyes. Then the probability that Smith has genotype Bb is
P(F |E2 )P(E2 )
P(F )
1 · 1/2
=
3/4
= 2/3,
P(E2 |F ) =
while the probability that Smith has genotype BB is P(E1 |F ) = 1 − P(E2 |F ) = 1/3.
b) If Smith’s partner has blue eyes, then her genotype is bb. Now if Smith has genotype BB,
then his children can only have genotype Bb and therefore have zero probability of having blue
eyes. On the other hand, if Smith has genotype Bb, then his first child will have blue eyes with
probability 1/2 (i.e., with the probability that the child inherits a b gene from Smith). Thus, the
total probability that the first child has blue eyes is:
1/3 · 0 + 2/3 · 1/2 = 1/3.
c) Knowing that their first child has brown eyes provides us with additional information about
the genotypes of the parents. Let G be the event that the first child has brown eyes and let E1
and E2 be the events that Smith has genotype BB or Bb, respectively. From (a) we know that
P(E1 ) = 1/3 and P(E2 ) = 2/3, while (b) tells us that P(G) = 1 − 1/3 = 2/3. Consequently,
P(G|E1 )P(E1 )
1 · 1/3
=
= 1/2,
P(G)
2/3
P(E2 |G) = 1 − P(E1 |G) = 1/2.
P(E1 |G) =
Let H be the event that the second child has brown eyes and notice that
P(H|E1 , G) = 1
P(H|E2 , G) = 1/2.
Then,
P(H|G) = P(H ∩ E1 |G) + P(H ∩ E2 |G)
= P(H|E1 , G)P(E1 |G) + P(H|E2 , G)P(E2 |G)
= 1 · 1/2 + 1/2 · 1/2 = 3/4.
2.5
Conditional Probabilities are Probabilities (Optional)
Theorem 2.2. Let F be an event in a probability space (Ω, F, P) such that P(F ) > 0. Then the
conditional probability P(·|F ) is a probability measure on (Ω, F).
Proof. We need to show that P(·|F ) satisfies the three axioms of a probability measure. The
first axiom requires that for any event E,
0 ≤ P(E|F ) =
P(E ∩ F )
≤ 1,
P(F )
which follows because P(E ∩ F ), P(F ) > 0 and because P(E ∩ F ) ≤ P(F ) (since E ∩ F ⊂ F ).
20
CHAPTER 2. CONDITIONAL PROBABILITIES
The second axiom follows by a simple calculation:
P(Ω|F ) =
P(Ω ∩ F )
P(F )
=
= 1.
P(F )
P(F )
Lastly, let E1 , E2 , · · · be a countable collection of disjoint events and let E =
P(E|F ) =
=
S∞
n=1 En .
Then
P(E ∩ F )
P(F )
∞
[
1
P
(En ∩ F )
P(F )
n=1
=
=
1
P(F )
∞
X
∞
X
P(En ∩ F )
n=1
P(En |F ),
n=1
where in passing from the second line to the third we have made use of the countable additivity
of P(·) and of the fact that the events En ∩ F, n ≥ 1 are disjoint. This verifies that P(·|F ) is
countably additive, which completes the proof.
One useful consequence of Proposition 5.1 is that we can condition on a series of events, e.g., if
E, F , and G are events and we define
Q(E) = P(E|F ),
then the conditional probability Q(E|G) is well-defined. In fact,
Q(E|G) =
=
=
=
=
Q(E ∩ G)
Q(G)
P(E ∩ G|F )
P(G|F )
P(E ∩ G ∩ F )
P(F )
·
P(F )
P(G ∩ F )
P(E ∩ F ∩ G)
P(F ∩ G)
P(E|F ∩ G),
which shows that conditioning first on F and then on G is equivalent to conditioning on the
intersection of F and G. Furthermore, this correspondence allows us to rewrite the identity
Q(E) = Q(E|G)Q(G) + Q(E|Gc )Q(Gc )
in the useful form
P(E|F ) = P(E|F ∩ G)P(G|F ) + P(E|F ∩ Gc )P(Gc |F ).
Example 2.14. A female chimp gives birth to a single offspring, but it is unknown which of two
male chimps is the father. Before any genetic data is collected, it is believed that the probability
that male number 1 is the father is p and that the probability that male number 2 is the father is
2.5. CONDITIONAL PROBABILITIES ARE PROBABILITIES (OPTIONAL)
21
1 − p. (Recall that (p, 1 − p) is the prior distribution on the two hypotheses concerning paternity.)
DNA is collected from all four individuals and sequenced at a single site on the genome, showing
that the maternal genotype is AA, the genotype of male 1 is aa, the genotype of male 2 is Aa,
and that the genotype of the offspring is Aa. What is the posterior probability that male 1 is the
father?
Solution: Let all probabilities be conditional on the genotypes of the adults and define Mi , i =
1, 2 to be the event that male i is the father and G to be the event that the offspring genotype
is Aa. Then
P(M1 |G) =
=
=
=
P(M1 ∩ G)
P(G)
P(G|M1 )P(M1 )
P(G|M1 )P(M1 ) + P(G|M2 )P(M2 )
1·p
1 · p + 12 · (1 − p)
2p
.
1+p
Notice that P(M1 |G) > p = P(M1 ) whenever p < 1, i.e., the observed genotypes increase the
likelihood that male 1 is the father.
Example 2.15. Suppose that a series of independent trials is conducted and that each trial
results in a success with probability p and a failure with probability q = 1 − p. What is the
probability that a run of n consecutive successes occurs before a run of m consecutive failures?
Solution: Let us introduce the following events
E = {a run of n successes happens before a run of m failures}
S = {the first trial results in a success}
Sn−1 = {trials 2 through n all result in successes}
Fm−1 = {trials 2 through m all result in failures}.
By the law of total probability, we have
P(E) = P(E|S)P(S) + P(E|S c )P(S c ) = p · P(E|S) + q · P(E|S c ).
The conditional probabilities in this last expression can be evaluated in the following way. First
observe that
c
c
P(E|S) = P(E|Sn−1 ∩ S)P(Sn−1 |S) + P(E|Sn−1
∩ S)P(Sn−1
|S)
c
= pn−1 + (1 − pn−1 )P(E|Sn−1
∩ S),
since Sn−1 and S are independent and P(E|Sn−1 ∩ S) = 1 (n consecutive successes occur on
c
the first n trials if S and Sn−1 are both true). However, since the event Sn−1
∩ S means that a
failure has occurred during at least one of the trials 2 through n and since whether E occurs or
not does not depend on any event happening before this first failure, it follows that
c
P(E|Sn−1
∩ S) = P(E|S c ),
i.e., it is as if the failure occurred on the very first trial. Thus
P(E|S) = pn−1 + (1 − pn−1 )P(E|S c ).
22
CHAPTER 2. CONDITIONAL PROBABILITIES
Similar reasoning shows that
c
c
P(E|S c ) = P(E|Fm−1 ∩ S c )P(Fm−1 |S c ) + P(E|Fm−1
∩ S c )P(Fm−1
|S c )
c
= 0 + (1 − q m−1 )P(E|Fm−1
∩ S)
= (1 − q m−1 )P(E|S),
where we have used the fact that P(E|Fm−1 ∩ S c ) = 0 (since then m consecutive failures occur
c
during the first m trials) and that P(E|Fm−1
∩ S) = P(E|S).
By combining the equations for P(E|S) and P(E|S c ), we obtain
P(E|S) = pn−1 + (1 − pn−1 )(1 − q m−1 )P(E|S)
which then implies that
P(E|S) =
P(E|S c ) =
pn−1
1 − (1 − pn−1 )(1 − q m−1 )
pn−1 (1 − q m−1 )
.
1 − (1 − pn−1 )(1 − q m−1 )
Finally, by substituting these conditional probabilities into the initial equation derived for P(E),
we arrive at the answer to our problem:
P(E) =
2.6
pn−1 (1 − q m )
pn + pn−1 (q − q m )
=
.
1 − (1 − pn−1 )(1 − q m−1 )
1 − (1 − pn−1 )(1 − q m−1 )
The Gambler’s Ruin Problem (Optional)
The following is a classic problem of probability that dates back to the 17’th century.
Two gamblers, A and B, bet on the outcomes of successive flips of a coin. On each flip, A wins a
dollar from B if the coin comes up heads, while B wins a dollar from A if the coin comes up tails.
The game continues until one of the players runs out of money. Suppose that the coin tosses are
mutually independent and that each toss turns up heads with probability p. If A starts with i dollars and B starts with N −i dollars, what is the probability that A ends up with all of the money?
Solution: Let E be the event that A eventually wins all of the money and let
Pi = P (E|A starts with i dollars)
be the conditional probability of E given that A starts with i dollars. We will use the independence of the successive coin tosses to obtain a system of linear equations for the probabilities
P0 , P1 , · · · , PN . To this end, let H be the event that the first toss lands on heads and observe
that
Pi = P(E|H)P(H) + P(E|H c )P(H c )
= pP(E|H) + (1 − p)P(E|H c )
= pPi+1 + (1 − p)Pi−1 ,
since
P(E|H) = Pi+1
and
P(E|H c ) = Pi−1 .
(2.5)
2.6. THE GAMBLER’S RUIN PROBLEM (OPTIONAL)
23
Indeed, if H is true, then A will have i + 1 dollars and B will have N − (i + 1) dollars after
the first toss. Furthermore, since the successive tosses are independent, the probability that A
eventually wins all of the money conditional on having i + 1 dollars after the first toss is the
same as the probability that A wins all of the money conditional on A starting the game with
i + 1 dollars.
Equation (3.5) can be reformulated as a one-step recursion by first writing Pi = pPi + qPi =
pPi+1 + qPi−1 and then rearranging terms on both sides of the identity to obtain
p Pi+1 − Pi = q Pi − Pi−1 ,
i.e.,
Pi+1 − Pi = α Pi − Pi−1 ,
where α = q/p.
Notice that P0 = 0 (A loses if she has no money to wager). Thus,
P2 − P1 = α P1 − P0 = αP1
P3 − P2 = α P2 − P1 = α2 P1
···
Pi − Pi−1 = α Pi−1 − Pi−2 = αi−1 P1
···
PN − PN −1 = α PN −1 − PN −2 = αN −1 P1 .
Adding the first i − 1 equations gives
Pi − P1 = P1
i−1
X
αk ,
k=1
and therefore
Pi =

i

 P1 1−α
1−α
if p 6= 1/2 (α 6= 1)


if p = 1/2 (α = 1).
P1 i
Then, using the fact that PN = 1 (A wins if she begins with all of the money), we can solve for
P1 :
 1−α
 1−αN if p 6= 1/2
P1 =
 1
if p = 1/2.
N
Substituting these results back into the expression for Pi , we obtain

1−(q/p)i

6 1/2
 1−(q/p)N if p =
Pi =

 i
if p = 1/2.
N
If we let Qi denote the probability that B ends up with all the money when the players have i
and N − i dollars at the start of the game, then by the symmetry of the problem, we see that

1−(p/q)N −i

if p 6= 1/2
 1−(p/q)N
Qi =

 N −i
if p = 1/2.
N
24
CHAPTER 2. CONDITIONAL PROBABILITIES
In particular, a simple calculation shows that Pi + Qi = 1, i.e., with probability 1, either A or
B eventually wins all of the money. In other words, the probability that the game continues
indefinitely with neither A nor B winning is 0.
Remark 2.2. If we let Xn denote A’s holdings after the n’th coin toss, then the random sequence
(Xn : n ≥ 0) is an example of a Markov chain and, in the case p = q = 1/2, of a martingale.
Chapter 3
Random Variables and Probability Distributions
3.1
Random Variables
Imagine that an experiment is performed which results in a random outcome ω belonging to a set
Ω. We saw in the last chapter that we could model such a scenario by introducing a probability
space (Ω, F, P) in which Ω is the sample space, F is a collection of events in Ω, and P is a
distribution which specifies the probability P(A) of each event A ∈ F. However, depending on
the size and the complexity of the sample space as well as on the experimentalist’s resources, it
may not be possible to directly observe ω. Instead, what the experimentalist usually can do is to
measure some quantity X = X(ω) that depends on ω and which reveals some information about
the outcome. Because the value of X depends on the outcome of the experiment, which is random, we say that X is a random variable. Random variables play a central role in probability
theory, statistics, and stochastic modeling and so this entire chapter is devoted to their properties.
Definition 3.1. Suppose that (Ω, F, P) is a probability space and that E is a set equipped with
a σ-algebra E of events. A function X : Ω → E is said to be an E-valued random variable if
for every set A ∈ E we have
X −1 (A) = {ω ∈ Ω : X(ω) ∈ A} = {X ∈ A} ∈ F.
In other words, for each set A ∈ E, we must be able to decide whether the event {X ∈ A} is true
or not.
Here E can be a set of objects of any type, including integers, real numbers, complex numbers,
vectors, matrices, functions, geometric objects, graphs, networks, and even probability distributions themselves. For example, if E is a set of matrices, then we would say that an E-valued
random variable X is a random matrix. This level of generality allows us to apply random
variables to problems coming from many different disciplines. However, most often E is a subset
of the real numbers, in which case we would say that X is a real-valued random variable.
In fact, for many authors random variables are synonymous with real-valued random variables.
Two examples follow.
Example 3.1. Suppose that we a toss a fair coin one hundred times and record the outcome of
each toss as H or T . We can model this experiment using a symmetric probability space (Ω, F, P)
in which Ω is the collection of sequences of length one hundred in which each element is either
H or T , F is the power set of Ω, and the probability of each outcome ω = (ω1 , · · · , ω100 ) is
P({ω}) =
25
1
2100
.
26
CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
Because Ω is a very large set, we might choose to restrict our attention to the number of tosses
that land on heads, which we can represent by a random variable X : Ω → E = {0, 1, · · · , 100},
where
X(ω) = #{i : ωi = H},
and we take E to be the power set of E. Notice that because F is the power set of Ω, every set
{X ∈ A} belongs to F when A ∈ E.
Example 3.2. Suppose that a fair die is rolled twice and that we record the outcome of each roll.
We can model this experiment with a symmetric probability space (Ω, F, P) where Ω = {(i, j) :
1 ≤ i, j ≤ 6}, F is the power set of Ω, and the probability of each outcome ω = (ω1 , ω2 ) is
P({ω}) =
1
.
36
In this case, the sum of the two numbers rolled is a random variable X : Ω → E = {2, · · · , 12}
defined by
X(ω) = ω1 + ω2 .
Similarly, the product, the difference and the quotient of the two numbers rolled are all random
variables on Ω.
Arguably, the most important attribute of a random variable is its distribution, which tells us
the probabilities of events involving this variable.
Definition 3.2. Let X be an E-valued random variable defined on a probability space (Ω, F, P).
Then the distribution of X is a probability distribution PX : E → [0, 1] on the space (E, E)
defined by the formula
PX (A) = P(X ∈ A)
whenever A ∈ E. Furthermore, the triple (E, E, PX ) is itself a probability space.
Example 3.3. Let X be the sum obtained when a fair die is rolled twice. Then the distribution
of X on the set E = {2, 3, · · · , 12} is determined by the following probabilities:
i
PX (i)
2
1/36
3
2/36
4
3/36
5
4/36
6
5/36
7
6/36
8
5/36
9
4/36
10
3/36
11
2/36
12
1/36
Furthermore, the probability of any subset A ⊂ E can be calculated by summing the probabilities
of the elements of A:
X
PX (A) =
PX (i).
i∈A
Indeed, X is an example of a discrete random variable and the table shows the values of the
probability mass function of X (see below).
Remark: Although random variables are formally defined as functions on probability spaces,
it is common practice to suppress all mention of the underlying probability space (Ω, F, P) and
instead only work with the variable X and its distribution PX on (E, E). This allows us to focus
on quantities that we can actually observe and measure. This will be the convention followed
in these lecture notes except in a few places where it will be convenient to use the underlying
probability space as a mathematical tool to prove results about random variables.
3.1. RANDOM VARIABLES
27
In the special case where X is a real-valued random variable, the distribution of X can be
summarized by a function called the cumulative distribution function of X (abbreviated c.d.f.).
This is useful because real-valued functions are often less complicated objects than probability
distributions defined on σ-algebras on the real numbers. This is particularly true of continuous
random variables, which can assume any value in an interval.
Definition 3.3. If X is a real-valued random variable, then the cumulative distribution
function of X is the function FX : R → [0, 1] defined by
FX (x) ≡ P(X ≤ x) = P X ∈ (−∞, x]
for any x ∈ R.
Notice that FX (x) is said to be the cumulative distribution function of X because it measures
the cumulative probability of X from −∞ up to x.
It is clear that the cumulative distribution function of a random variable is uniquely determined
by the distribution of the variable. However, less obviously, the converse is also true: the
distribution of a random variable is uniquely determined by its cumulative distribution function.
In other words, if X and Y have the same c.d.f., i.e., if FX (x) = FY (x) for every x ∈ R, then
X and Y have the same distribution: P(X ∈ A) = P(Y ∈ A) for every ‘nice’ subset A ⊂ R.
Section (4.1.3) describes several other important properties of cumulative distribution functions.
We defer examples to the next two sections.
Having specialized to real-valued random variables, we now specialize further. Namely, in this
course we will mainly focus on two classes of real-valued random variables: discrete and continuous, which are defined in the next two sections. Random variables which are neither discrete nor
continuous also exist (e.g., a random variable with the Cantor distribution, which is supported
on the Cantor middle thirds set, is neither discrete nor continuous), but discrete and continuous
random variables are the ones most commonly encountered in applications.
3.1.1
Discrete random variables
Recall that a set E is said to be countable if it is either finite or countably infinite. In the
latter case, E can be put in to one-to-one correspondence with the natural numbers, i.e., we can
enumerate the elements of E = {x1 , x2 , x3 , · · · }. Examples of countably infinite sets include the
integers, the non-negative even integers and the rationals. An example of a set that is infinite
but not countably infinite is the interval [0, 1]; this was first established by Georg Cantor using
his famous diagonalization argument.
Definition 3.4. A real-valued random variable X is said to be discrete if there is a countable
set of values E = {x1 , x2 , · · · } ⊂ R such that P(X ∈ E) = 1. In this case, the probability mass
function of X is the function pX : E → [0, 1] defined by
pX (y) = P(X = y).
The next proposition shows how the probability mass function (p.m.f.) of a discrete random
variable is related to the distribution of that variable. In particular, it follows that if two discrete random variables have the same p.m.f., then they necessarily have the same distribution.
28
CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
Proposition 3.1. Suppose that X is a discrete random variable with values in E and let pX be
the probability mass function of X. If A ⊂ E, then
X
P(X ∈ A) =
pX (y),
y∈A
i.e., the probability that X belongs to A is equal to the sum of the probability mass function
evaluated at all of the elements of A.
Proof. Since A can be written as a countable disjoint union of the singleton sets containing the
elements contained in A, i.e.,
[
A=
{y},
y∈A
we can use the countable additivity of probability distributions to conclude that
X
X
P(A) =
P({y}) =
pX (y).
y∈A
y∈A
In particular, it follows that the sum of the probability mass function over all of the points
contained in E is equal to 1:
X
1 = P(Ω) =
pX (xi ).
xi ∈E
We can also express the cumulative distribution function of a discrete random variable X in
terms of its probability mass function.
X
FX (x) ≡ P(X ≤ x) =
pX (xi ).
xi ≤x
This shows that the probability mass function of a discrete random variable with values in the
set E = {x1 , x2 , · · · } is a step function which is piecewise constant on the intervals [xi , xi+1 )
with a jump of size pX (xi ) at x = xi . (See Figure 5.1 in Gregory (2005).)
Example 3.4. A random variable X is said to be a Bernoulli random variable with parameter
p ∈ [0, 1] if X takes values in the set E = {0, 1} and the p.m.f. of X is equal to
1 − p if x = 0
pX (x) =
p
if x = 1.
We also say that X has a Bernoulli distribution with parameter p. Notice that the cumulative
distribution function of X is the following step function

if x < 0
 0
1 − p if x ∈ [0, 1)
FX (x) =

1
if x ≥ 1.
Bernoulli random variables are often used to describe trials that can result in just two possible
outcomes, e.g., a coin toss experiment can be represented by a Bernoulli random variable if we
assign X = 1 whenever we get heads and X = 0 whenever we get tails.
3.1. RANDOM VARIABLES
3.1.2
29
Continuous random variables
Definition 3.5. A real-valued random variable X is said to be continuous if there is a function
fX : R → [0, ∞], called the probability density function of X, such that
Z b
P(a ≤ X ≤ b) =
fX (x)dx
a
for every −∞ ≤ a ≤ b ≤ ∞.
If a random variable X is continuous, then its probability density function (or density for short)
can be calculated by evaluating the following limit:
fX (x) = lim
δx→0+
P(x ≤ X ≤ x + δx)
.
δx
(3.1)
This shows that the probability density fX (x) evaluated at a point x is not equal to the probability that X = x, but instead is the limit of the ratio of two probabilities as each tends to zero.
This explains why densities can take on values that are greater than 1. On the other hand, there
is a one-to-one relationship between the probability density function of a continuous random
variable and the distribution of that variable. Our next proposition should be compared with
Proposition 4.2.
Proposition 3.2. Suppose that X is a continuous random variable with density fX . Then
Z
P(X ∈ A) =
fX (x)dx
A
for any subset A ⊂ R for which the integral is well-defined.
The collection of subsets A ⊂ R for which the integral is well-defined is known as the Lebesgue
σ-algebra and includes all countable unions and intersections of closed or open intervals. See
Williams (1991) for a careful discussion of these issues. By taking A = R, it follows that that
integral of the density over the entire real line is equal to 1:
Z ∞
1 = P(X ∈ R) =
fX (x)dx.
−∞
On the other hand, if we take A = {y} for any y ∈ R, then
Z y
P(X = y) =
fX (x)dx = 0,
y
which shows that a continuous random variable has no atoms. This highlights an observation
that may seem paradoxical at first, which is that events with probability zero are not, in general,
impossible. In this case, we know that the random variable Y will take on some value y even
though the probability of this event is zero: P(Y = y) = 0.
Proposition (4.2) also shows how the probability density function and the cumulative distribution
function are related. Since this relationship is so important in practice, we will state it as a
corollary.
Corollary 3.1. Suppose that X is a continuous random variable with density fX (x) and c.d.f.
FX (x). Then FX is the antiderivative of fX , while fX is the derivative of FX :
Z x
FX (x) = P(X ≤ x) =
fX (y)dy,
−∞
fX (x) =
d
FX (x).
dx
30
CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
Thus, if we know the density of a random variable, then we can find its cumulative distribution
function by integration, while if we know the cumulative distribution function, then we can find
the density function by differentiation.
Example 3.5. Let X be a continuous random variable with probability density function
Cxα if x ∈ [0, 1]
fX (x) =
0
if x < 0 or x > 1.
Here α > −1 is a parameter which characterizes the distribution, while C is a constant which
must be chosen so that the density integrates to 1:
Z 1
C
1=
Cxα dx =
.
α+1
0
In other words, we must take C = α+1 to get a probability density. Notice that we must also take
α > −1 since otherwise the integral diverges, but values of α ∈ (−1, 0) are perfectly legitimate
even though the density function diverges as x → 0.
3.1.3
Properties of the Cumulative Distribution Function
We have seen that the cumulative distribution function of a random variable can be either
discontinuous or continuous depending on whether or not there are particular values that the
random variable assumes with positive probability. However, there are several properties that
are shared by every cumulative distribution function.
Proposition 3.3. Let X be a real-valued random variable (not necessarily discrete) with cumulative distribution function F (x) = P(X ≤ x). Then
1. F is non-decreasing, i.e., if x ≤ y, then F (x) ≤ F (y).
2. limx→∞ F (x) = 1.
3. limx→−∞ F (x) = 0.
4. F is right-continuous, i.e., for any x and any decreasing sequence (xn , n ≥ 1), that converges to x, limn→∞ F (xn ) = F (x).
Proof. 1.) If x ≤ y, then {X ≤ x} ⊂ {X ≤ y}, which implies that F (x) ≤ F (y).
2.) If (xn , n ≥ 1) is an increasing sequence such that xn → ∞, then the events En = {X ≤ xn }
form an increasing sequence with
∞
[
{X < ∞} =
En .
n=1
It follows from the continuity properties of probability measures (Lecture 2) that
lim F (xn ) = lim P(En ) = P(X < ∞) = 1.
n→∞
n→∞
3.) Likewise, if (xn , n ≥ 1) is a decreasing sequence such that xn → −∞, then the events
En = {X ≤ xn } form a decreasing sequence with
∅=
∞
\
n=1
En .
3.2. EXPECTATIONS AND MOMENTS
31
In this case, the continuity properties of measures imply that
lim F (xn ) = lim P(En ) = P(∅) = 0.
n→∞
n→∞
4.) If xn , n ≥ 1 is a decreasing sequence converging to x, then the sets En = {X ≤ xn } also form
a decreasing sequence with
∞
\
{X ≤ x} =
En .
n=1
Consequently,
lim F (xn ) = lim P(En ) = P{X ≤ x} = F (x).
n→∞
n→∞
Some other useful properties of cumulative distribution functions are listed below.
(i) P(a < X ≤ b) = F (b) − F (a) for all a < b.
(ii) Another application of the continuity property shows that the probability that X is strictly
less than x is equal to
P(X < x) =
lim P(X ≤ x − 1/n) =
n→∞
lim F (x − 1/n).
n→∞
In other words, P(X < x) is equal to the left limit of F at x, which is often denoted F (x−).
However, notice that in general this limit need not be equal to F (x):
F (x) = P(X < x) + P(X = x),
and so P(X < x) = F (x) if and only if P(X = x) = 0.
(iii) The converse of this proposition is also true: namely, if F : R → [0, 1] is a function satisfying
properties (1) - (4), then F is the CDF of some random variable X. We will prove a partial
result in this direction later on in the course.
3.2
Expectations and Moments
Definition 3.6. Suppose that X is a real-valued random variable which is either discrete with
probability mass function pX (x) or continuous with probability density function fX (x). Then the
expected value of X is the weighted average of the values of X given by
(P
pX (xi ) xi
if X is discrete,
E[X] = R ∞xi
−∞ fX (x) x dx if X is continuous,
provided that the sum or the integral exists.
The expected value of a random variable X may also be called the expectation, the mean or
the first moment of X and is sometimes denoted hXi. The latter notation is especially common
in the physics and engineering literature and is also used by Gregory (2005).
32
CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
Example 3.6. Let X be a Bernoulli random variable with parameter p (cf. Example 4.3). Then
the expected value of X is
E[X] = (1 − p) · 0 + p · 1 = p.
Example 3.7. Let X be the outcome when we roll a fair die. Then
E[X] =
6
X
1
i=1
6
· i = 3.5.
Example 3.8. Let X be the random variable described in Example 4.2., i.e., X is continuous
with density (α + 1)xα for x ∈ [0, 1] where α > −1. Then the expected value of X is equal to
Z
E[X] =
1
(α + 1)xα · xdx =
0
α+1
.
α+2
Notice that in the first two examples, the random variable X cannot be equal to its expected
value. In this sense, the name ‘expected value’ is somewhat misleading and it could be argued
that it would be better to refer to E[X] as the mean of X. On the other hand, we will later see
that if X1 , X2 , · · · are independent, identically-distributed real-valued random variables with the
same distribution as X, then
N
1 X
lim
Xi = E[X]
N →∞ N
i=1
provided that X has a finite mean, i.e., the sample mean converges to the expected value as the
size of the sample increases. This result is known as the strong law of large numbers.
One way to obtain new random variables is to compose a real-valued function with an existing
random variable. For example, suppose that X is a discrete random variable that takes values in
the set E = {x1 , x2 , . . . } with probabilities pX (xi ), and let g : E → R be a real-valued function.
Then Y = g(X) is a discrete random variable with values in the countable set g(E) = {y1 , y2 , · · · }
and the probability mass function of Y is
X
pY (yi ) ≡ P (Y = yi ) =
pX (xj ).
xj :g(xj )=yi
Notice that more than one value xi may contribute to the probability of each point yj if g is not
one-to-one. Similarly, if X is a continuous random variable and g : R → R is a function, then
Y = g(X) is also a real-valued random variable, although it need not be continuous.
Example 3.9. Suppose that X is a discrete random variable that takes values in the set {−1, 0, 1}
with probabilities
1
1
1
pX (−1) = , pX (0) = , pX (1) = .
4
2
4
Then Y = X 2 is a discrete random variable that takes values in the set {0, 1} with probabilities
pY (0) = pX (0) =
1
2
pY (1) = pX (−1) + pX (1) =
and so
1
1
1
E X2 = · 0 + · 1 = .
2
2
2
1
2
3.2. EXPECTATIONS AND MOMENTS
33
The following proposition allows us to calculate the expectation E[g(X)] without having to
first find the probability mass function or density function of the random variable Y = g(X).
Although the identities contained in the proposition appear as a definition in Gregory (2005), in
fact, these must be derived from Definition 4.6. We will prove the result under the assumption
that X is discrete; see Williams (1991) for a proof of the general case.
Proposition 3.4. Suppose that X is a real-valued random variable which is either discrete with
probability mass function pX (x) or continuous with probability density function fX (x). Then, for
any real-valued function g : R → R,
(P
pX (xi )g(xi )
if X is discrete,
E[g(X)] = R ∞xi
−∞ fX (x)g(x) dx if X is continuous,
provided that the sum or the integral exists.
Proof. Suppose that X is a discrete random variable with values in the set E = {x1 , x2 , · · · }
and probability mass function pX . If Y = g(X), then Y is also a discrete random variable which
takes on distinct values {y1 , y2 , · · · } with probabilities pY (yj ) and we can define
Ej ≡ g −1 (yj ) = {xi ∈ E : g(xi ) = yj } .
Then the sets E1 , E2 , · · · are disjoint with union E = E1 ∪ E2 ∪ · · · , and
X
E[g(X)] =
pY (yj ) · yj
j


=
X
=
X X
=
X X
=
X
X

xi ∈Ej
j
j
j
pX (xi ) · yj
pX (xi ) · yj
xi ∈Ej
pX (xi ) · g(xi )
xi ∈Ej
pX (xi ) · g(xi ).
xi ∈E
An important consequence of Proposition 4.4 is that expectations are linear.
Corollary 3.2. If a and b are constants, then
E[aX + b] = aE[X] + b.
Proposition 4.4 can also be used to calculate the moments of a real-valued random variable.
Definition 3.7. The n’th moment of a real-valued random variable X is the quantity
(P
pX (xi )xni
if X is discrete,
µ0n = E[X n ] = R ∞xi
n
−∞ fX (x)x dx if X is continuous,
provided that the sum or the integral exists. Similarly, if µ = E[X] is finite, then the n’th central
moment of X is the quantity
(P
pX (xi )(xi − µ)n
if X is discrete,
µn = E[(X − µ)n ] = R ∞xi
n
−∞ fX (x)(x − µ) dx if X is continuous,
provided that the sum or the integral exists.
34
CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
Notice that the first central moment vanishes, since the linearity of expectations implies that
µ1 = E[X − µ] = µ − µ = 0.
On the other hand, the second, third and fourth central moments are sufficiently important in
statistics and applied probability to warrant special names. The second central moment is the
most familiar and is called the variance of the random variable. The variance is usually denoted
σ 2 and can be evaluated using the following formula:
σ 2 = E[(X − µ)2 ]
= E[X 2 − 2µX + µ2 ]
= E[X 2 ] − 2µ2 + µ2
= E[X 2 ] − µ2 .
The square root of the variance is called the standard deviation. This is denoted σ and has
the same units as X, whereas the variance has the units of X 2 , e.g., if the units of X are cm,
then the units of the standard deviation are also cm, whereas the units of the variance are cm2 .
The third and fourth central moments are also useful in some instances. The third moment is
known as the skewness and measures the asymmetry of the distribution of a random variable
about its mean. A random variable that is perfectly symmetric about its mean has skewness
µ3 = 0, whilst a variable is said to be either right-skewed or left-skewed if µ3 > 0 or µ3 < 0,
respectively. The fourth central moment is known as the kurtosis and measures how peaked the
distribution is around the mean. Since the skewness and kurtosis are dimensioned quantities with
units equal to the units of X raised to the third or fourth power, respectively, it is often preferred
to replace these by dimensionless quantities obtained by dividing by the standard deviation σ
raised to this power. This leads us to define the following quantities
µ3
Coefficient of skewness: α3 = 3
σ
µ4
Coefficient of kurtosis: α4 = 4 .
σ
See Figure 5.3 in Gregory (2005) for a selection of distributions with different degrees of skewness
and kurtosis.
3.2.1
Moment generating functions
Moment generating functions have a number of important applications in probability theory.
In this section, we will see how they can be used to calculate all of the moments of a random
variable X in a single step. See the following sections for several examples of this approach.
Definition 3.8. If X is a real-valued random variable, then the moment generating function
of X is the function mX : R → [0, ∞] defined by the formula
(P
tX pX (xi )et·xi
if X is discrete,
mX (t) = E e
= R ∞xi
tx
−∞ fX (x)e dx if X is continuous.
Although the moment generating function is defined for all t ∈ R, it is possible for it to diverge everywhere except at t = 0 where mX (0) = 1 holds for any random variable X. On the
other hand, if X is a bounded random variable, i.e., if there is a constant K < ∞ such that
P(|X| ≤ K) = 1, then the moment generating function will be finite everywhere: mX (t) < ∞ for
all t ∈ R. Furthermore, it can be shown that the distribution of a random variable is uniquely
3.2. EXPECTATIONS AND MOMENTS
35
determined by the moment generating function whenever that function is finite in some open
interval containing 0. In particular, if two random variables have the same moment generating
function and this function is finite in some open interval, then the two variables also have the
same distribution.
To understand why the moment generating function has this name, suppose that the Taylor
series of the moment generating function mX (t) of a random variable X is convergent in some
open interval (−a, a) containing 0. Then
mX (t) = E[etX ]
"∞
#
X 1
n n
= E
t X
n!
n=0
∞
X
1 0 n
=
µ t ,
n! n
n=0
where µ0n = E[X n ] is the n’th (non-central) moment of X and the last step requires that we be
able to interchange the sum with the expectation. This shows that the moments of X can be
found by first finding its moment generating function and then expanding it in a Taylor series
and reading off the coefficients. In particular, the n’th moment of X is simply the n’th derivative
of mX (t) evaluated at t = 0:
(n)
µ0n = mX (0).
In some cases, this procedure may require less work than direct evaluation of the sum or the
integral expression of the moment.
Example 3.10. Suppose that λ > 0 is a positive constant and let X be a continuous random
variable with density fX (x) = λe−λx on [0, ∞) and fX (x) = 0 for x < 0. (X is said to be an
exponential random variable with parameter λ.) Then the moment generating function of X is
(
Z ∞
λ
tX if t < λ
λ−t
mX (t) = E e
=
etx λe−λx dx =
∞
if t ≥ λ.
0
Since this is a smooth function on an open interval (−∞, λ), it can be expanded in a Taylor series
as
∞
X
λ
λ−n tn
=
λ−t
n=0
and it follows that the n’th moment of X is simply µ0n = λ−n n!.
Chapter 4
A Menagerie of Distributions
4.1
The Binomial Distribution
Suppose that an experiment is repeated n times and that each trial has two possible outcomes,
which we denote Q and Q. Assume that the trials are independent of one another and that the
probability of outcome Q is the same in each trial, say p. If we let n = 3 and we let Qi be the
proposition that the i’th trial results in outcome Q, then there are eight possibilities:
Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3
+Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3 .
We can calculate the probability of each of these possibilities using the product rule. For example,
the probability of outcome Q1 , Q2 , Q3 is equal to
P(Q1 , Q2 , Q3 ) = P(Q1 ) × P(Q2 |Q1 ) × P(Q3 |Q1 , Q2 )
= P(Q1 ) × P(Q2 ) × P(Q3 )
= (1 − P(Q1 )) × P(Q2 ) × P(Q3 )
= p2 (1 − p).
Similarly, the probabilities of outcomes Q1 , Q2 , Q3 and P(Q1 , Q2 , Q3 ) are
P(Q1 , Q2 , Q3 ) = P(Q1 ) × P(Q2 ) × P(Q3 ) = p2 (1 − p)
P(Q1 , Q2 , Q3 ) = P(Q1 ) × P(Q2 ) × P(Q3 ) = p2 (1 − p).
Upon calculating the probabilities of the other five possible outcomes for a sequence of three
trials, we see that the probability of a particular outcome depends only on the number of trials
that result in Q or Q, but not on the order in which these occur. Let X be a random variable
that records the number of trials that result in outcome Q. Since there is exactly one way that
all three trials can result in three Q’s, three ways that the three trials can result in two Q’s,
three ways that they can result in two Q’s, and one way that they can result in 0 Q’s, it follows
that the probability mass function of X is equal to
pX (0) = (1 − p)3
pX (1) = 3p(1 − p)2
pX (3) = p3
pX (2) = 3p2 (1 − p).
In general, suppose that n independent trials are performed and let X be the number of trials
36
4.1. THE BINOMIAL DISTRIBUTION
37
that result in outcome Q. Then, by reasoning as in the above example, we can show that
P(X = k) = number of ways that k Q’s can occur in n trials × pk (1 − p)n−k
= number of ways of selecting a subset of size k from a set containing n elements
×pk (1 − p)n−k
n k
=
p (1 − p)n−k
k
n!
=
pk (1 − p)n−k ,
k!(n − k)!
where the constants
n
n!
=
k
k!(n − k)!
(4.1)
count the number of combinations of k elements from a set containing n elements. These are
sometimes denoted n Cr or Crn , depending on the source, and they are called binomial coefficients because they appear as coefficients of monomials in binomial expansions:
n X
n k n−k
n
(x + y) =
x y
.
(4.2)
k
k=0
This distribution occurs so frequently in probability theory and statistics that it has its own
name.
Definition 4.1. A random variable X is said to have the binomial distribution with parameters n and p if it takes values in the set {0, 1, · · · , n} with probability mass function
n k
pX (k) ≡ P(X = k) =
p (1 − p)n−k .
(4.3)
k
In this case, we write X ∼ Binomial(n, p) to indicate that X has this distribution. and we say
that X is a binomial random variable.
The name of this distribution stems from the connection between the probabilities p(k) and the
coefficients in a binomial expansion:
n
n X
X
n k
p(k) =
p (1 − p)n−k = (p + 1 − p)n = 1.
k
k=0
k=0
Notice that a Bernoulli random variable with parameter p is also a binomial random variable
with parameters n = 1 and p. In fact, the next theorem shows that there is a close relationship
between Bernoulli distributions and binomial distributions even when n ≥ 1.
Theorem 4.1. Let X1 , · · · , Xn be independent Bernoulli random variables, each with the same
parameter p. Then the sum X = X1 + · · · + Xn is a binomial random variable with parameters
n and p.
Proof. The random variable X counts the number of Bernoulli variables X1 , · · · , Xn that are
equal to 1, i.e., the number of successes in the n independent trials. Clearly X takes values in
the set {0, · · · , n}. To calculate the probability that X = k, let E be the event that Xi1 = Xi2 =
· · · = Xik = 1 and Xj = 0 for all j ∈
/ {i1 , · · · , ik }. Then, because the Bernoulli variables are
independent, we know that
P(E) = pk (1 − p)n−k .
38
However, there are
events, and so
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
n
k
such sets of indices i1 , · · · , ik , these corresponding to mutually exclusive
n k
P(X = k) =
p (1 − p)n−k .
k
Example 4.1. Suppose that a fair coin is tossed n times. Then the number of heads that appear
is a binomial random variable with parameters n and p = 1/2.
Example 4.2. Sampling with replacement. Suppose that n balls are sampled with replacement an urn which contains 3 red balls and 9 blue balls. If X denotes the number of red balls in
the sample, then X ∼ Binomial(n, 0.25) since the probability of sampling a red ball on any given
draw is 3/3 + 9 = 0.25. More generally, if the urn contains r red balls and b blue balls, then
X ∼ Binomial(n, p) where p = r/(r + b).
Example 4.3. Suppose that eye color in humans is under the control of a single gene. Individuals
with genotypes BB or Bb have brown eyes, while individuals with genotype bb have blue eyes.
Suppose that two brown-eyed individuals with genotypes Bb and Bb have a total of 4 children.
Assuming Mendelian inheritance, what is the distribution of the number of blue-eyed offspring?
Answer: Let Xi equal 1 if the i’th child has blue eyes and 0 otherwise. Then Xi is a Bernoulli
random variable with parameter p = 1/4, since a child will be blue-eyed only if they inherit a blue
allele from each parent, which happens with probability 1/4. Since the genotypes of the different
offspring are independent, the total number of blue-eyed children is binomial with parameters
n = 4 and p = 1/4.
The mean and the variance of the binomial distribution can be calculated with the help of the
moment generating function. If X ∼ Binomial(n, p), then the moment generating function of X
is equal to
mX (t) ≡ E etX
n
X
P(X = k)etk
=
k=0
n X
n k
=
p (1 − p)n−k etk
k
k=0
n X
n
=
(pet )k (1 − p)n−k
k
k=0
n
= pet + 1 − p ,
where the final identity follows from the binomial formula. With the help of the chain rule and
the product rule, differentiating twice (with respect to t) gives
n−1
m0X (t) = npet pet + 1 − p
n−1
n−2
m00X (t) = npet pet + 1 − p
+ n(n − 1)p2 e2t pet + 1 − p
.
It follows that the first two moments of X are
E[X] = m0X (0) = np
2
E X
= m00X (0) = np + n(n − 1)p2 ,
4.1. THE BINOMIAL DISTRIBUTION
39
while the variance of X is
Var(X) = E X 2 − (E[X])2 = np(1 − p).
Thus, as we might expect, the mean of X is an increasing function of both parameters, n and p,
while the variance of X is an increasing function of n which is largest when p = 1/2. Furthermore, the variance is 0 whenever p = 0 or p = 1, since in these cases we either have X = 0 with
probability 1 or X = n with probability one, i.e, every trial results in the same outcome.
The final result of this section asserts that the probability mass function of a binomial random
variable is always peaked around its mean.
Proposition 4.1. Let X be a binomial random variable with parameters (n, p). Then P(X = k)
is a unimodal function of k with its maximum at the largest integral value of k less than or equal
to (n + 1)p.
Proof. A simple calculation shows that
P(X = k)
(n − k + 1)p
=
,
P(X = k − 1)
k(1 − p)
which is greater than or equal to 1 if and only if
k ≤ (n + 1)p.
4.1.1
Sampling and the hypergeometric distribution
In Example 5.2, we noted that if n balls are sampled with replacement from an urn containing
r red balls and b blue balls, then the number of red balls in the sample is binomially distributed
with parameters n and p = r/(r + b). It should be emphasized that the assumption that we
are sampling with replacement is crucial here: the probability of sampling a red ball on the i’th
draw is still p no matter which balls were previously sampled because these are returned to the
urn before the next ball is sampled.
Perhaps more often than not, sampling is done without replacement, e.g., when we ‘sample’
the population by mailing out a survey to a random selection of individuals or when we enroll
individuals in a clinical trial, we usually do not want to include ‘duplicates’ in our sample. In this
case, the probability of sampling a particular type of object changes as we remove individuals
from the population. Consider the urn example again, but now suppose that we sample 3 balls
without replacement from an urn containing 3 red balls and 9 blue balls, and let X be the number
of red balls in our sample. If the order in which red and blue balls are sampled is considered, then
there are eight possible outcomes: RRR, BRR, RBR, RRB, BBR, BRB, BBR, and BBB,
which occur with probabilities:
3
2
1
×
×
12 11 10
9
3
2
P(BRR) = P(B) × P(R|B) × P(R|BR) =
×
×
12 11 10
3
9
2
P(RBR) = P(R) × P(B|R) × P(R|RB) =
×
×
12 11 10
3
2
9
×
×
P(RRB) = P(R) × P(R|R) × P(B|RR) =
12 11 10
P(RRR) = P(R) × P(R|R) × P(R|RR) =
3×2×1
12 × 11 × 10
3×2×9
=
12 × 11 × 10
3×2×9
=
12 × 11 × 10
3×2×9
=
12 × 11 × 10
=
40
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
9
8
3
×
×
12 11 10
9
3
8
P(BRB) = P(B) × P(R|B) × P(B|BR) =
×
×
12 11 10
3
9
8
P(RBB) = P(R) × P(B|R) × P(B|RB) =
×
×
12 11 10
9
8
7
P(BBB) = P(B) × P(B|B) × P(B|BB) =
×
×
12 11 10
P(BBR) = P(B) × P(B|B) × P(R|BB) =
3×9×8
12 × 11 × 10
3×9×8
=
12 × 11 × 10
3×9×8
=
12 × 11 × 10
9×8×7
=
.
12 × 11 × 10
=
As with our previous example, the probability of each outcome depends only on the numbers of
red and blue balls contained in the sample, not on the order in which they appear. Thus the
probability mass function of X is equal to
3
3! 9! . 12!
3
3! 9! . 12!
pX (0) =
×
pX (1) =
×
0
3! 6!
9!
2! 7!
9!
1
.
3
3! 9!
12!
12!
3
3! 9!
pX (2) =
/ .
×
pX (3) =
×
2
1! 8!
9!
0! 9!
9!
3
More generally, suppose that n balls are sampled without replacement from an urn containing r
red balls and b blue balls and let X be the number of red balls contained in the sample. Clearly
n cannot be larger than the number of balls contained in the urn, so n ≤ r + b. Furthermore,
the possible values that X can assume are integers between max{0, n − b} and min{n, r}. To
calculate P(X = k), where k lies between these two bounds, we first observe that the probability
that k red balls are first sampled and then n − k blue balls are sampled is
r
r−1
r − (k − 1)
b
b−1
b − (n − k − 1)
···
···
r+br+b−1
r + b − (k − 1)
r + b − k r + b − (k + 1)
r + b − (n − 1)
.
r!
b!
(r + b)!
=
×
.
(r − k)! (b − (n − k))!
(r + b − n)!
However, since there are nk different orders in which the k red balls can appear in the sample,
it follows that the probability that the sample contains k red balls is equal to
.
r!
n
b!
(r + b)!
P(X = k) =
×
k
(r − k)! (b − (n − k))!
(r + b − n)!
.
r!
b!
(r + b)!
=
×
k!(r − k)! (n − k)!(b − (n − k))!
n!(r + b − n)!
b r
=
k
n−k
r+b
n
.
Like the binomial distribution, this distribution is of sufficient importance that it has its own
name.
Definition 4.2. A random variable X is said to have the hypergeometric distribution with
parameters (N, m, n) if X takes values in the set {max{0, n − (N − m)}, · · · , min{n, m}} with
probability mass function
pX (k) =
m
k
N −m
n−k
N
n
In this case, we write X ∼ Hypergeometric(N, m, n).
.
(4.4)
4.2. THE POISSON DISTRIBUTION
41
Unfortunately, the moment generating function is not of much use in this case. (It can be
expressed in terms of a hypergeometric function, hence the name of the distribution.) However,
by direct calculation, it can be shown that the mean and the variance of a hypergeometric random
variable are
m
E[X] = n
N
m m N − n
.
Var(X) = n
1−
N
N
N −1
In other words, if we let p = m/N be the probability of a ‘success’, then X has the same mean np
as a binomially distributed random variable with parameters (n, p), but it has a smaller variance
in general, since
m m N − n
n
≤ np(1 − p).
1−
N
N
N −1
On the other hand, the hypergeometric distribution with parameters (N, m, n) will almost be
the same as the binomial distribution with parameters (n, p) when p = m/N and N and m
are both much larger than n. To make this result precise, suppose that Xt is a sequence of
hypergeometric random variables with parameters (Nt , mt , n), where p = mt /Nt ∈ (0, 1) for all
t ≥ 1 and Nt → ∞ and mt → ∞ as t → ∞. Then
n
mt
mt − (k − 1)
Nt − mt
Nt − mt − (n − k − 1)
P(Xt = k) =
× ··· ×
× ··· ×
k
Nt
Nt − (k − 1)
Nt − k
Nt − (n − 1)
n k
→
p (1 − p)n−k ,
k
as t → ∞. Intuitively, sampling a small number of balls without replacement from an urn
containing a very large number of balls is almost equivalent to sampling with replacement,
since in the latter case we are unlikely to ever sample the same ball more than once anyway.
For this reason, the binomial distribution is often regarded as an adequate approximation to
the hypergeometric distribution when analyzing data that was generated by sampling without
replacement, provided that the sample size is much smaller than the population size.
4.2
The Poisson Distribution
Suppose that we perform a large number of independent trials, say n ≥ 100, and that the
probability of a success on any one trial is small, say p = λ/n 1. If we let X denote the
total number of successes that occur in all n trials, then we know from the last lecture that
X ∼ Binomial(n, p) and therefore
P(X = k) =
=
=
=
≈
n k
p (1 − p)n−k
k
k n!
λ
λ n−k
1−
(n − k)!k! n
n
k
n!
λ
λ n
λ −k
1−
1−
k!
n
n
nk (n − k)!
k
n(n − 1)(n − 2) · · · (n − k + 1)
λ
λ n
λ −k
1
−
1
−
k!
n
n
nk
k
λ
e−λ
,
k!
42
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
where the approximation in the last line can be justified by the
λ n
lim 1 −
n→∞
n
λ −k
lim 1 −
n→∞
n
n(n − 1)(n − 2) · · · (n − k + 1)
lim
n→∞
nk
following three limits:
= e−λ
= 1
= 1.
Furthermore, since these approximations sum to 1 when k ranges over the non-negative integers,
e−λ
∞
X
λk
k=0
k!
= e−λ eλ = 1,
it follows that the function
p(k) = e−λ
λk
,
k!
k≥0
defines a probability distribution on the non-negative integers. Like the binomial distribution,
this distribution is so important in probability theory and statistics (and even in the sciences)
that it has its own name, which is taken from that of the 19’th century French mathematician,
Siméon Dennis Poisson (1781-1840).
Definition 4.3. A random variable X is said to have the Poisson distribution with parameter λ ≥ 0 if X takes values in the non-negative integers with probability mass function
pX (k) = P(X = k) = e−λ
λk
.
k!
In this case we write X ∼ Poisson(λ).
To calculate the mean and the variance of this distribution, we first identify its moment generating
function:
∞
X
pX (k)etk
mX (t) = E etX =
k=0
= e
−λ
∞
X
λk etk
k=0
−λ λet
= e
e
k!
= eλ(e
t −1)
.
Differentiating twice with respect to t gives
t −1)
m0X (t) = λet+λ(e
t
m00X (t) = λ 1 + λet et+λ(e −1)
and we then find
E[X] = m0X (0) = λ
Var(X) = m00X (0) − (E[X])2 = λ.
Thus, the parameter λ is equal to both the mean and the variance of the Poisson distribution.
It has long been recognized that the Poisson distribution provides a surprisingly accurate model
for the statistics of a large number of seemingly unrelated phenomena. Some examples include:
4.2. THE POISSON DISTRIBUTION
43
• the number of misprints per page of a book;
• the number of wrong telephone numbers dialed in a day;
• the number of customers entering a post office per day;
• the number of vacancies (per year) on the US Supreme Court (Table 5.1);
• the number of sites that are mutated when a gene is replicated;
• the number of α-particles discharged per day by a
14 C
source;
• the number of major earthquakes per year in a region.
Table 4.1: Data from Cole (2010) compared with a Poisson distribution with λ = 0.5.
Number of vacancies (x)
Probability
0
1
2
3
≥3
0.6065
0.3033
0.0758
0.0126
0.0018
Years with x vacancies
1837-1932
1933-2007
Observed Expected Observed Expected
59
58.2
47
45.5
27
29.1
21
22.7
9
7.3
7
5.7
1
1.2
0
1.0
0
0.2
0
0.1
Since these phenomena are generated by very different physical and biological processes, the fact
that they share similar statistical properties cannot be explained by the specific mechanisms
that operate in each instance. Instead, the widespread emergence of the Poisson distribution
appears to be a consequence of a more general mathematical result. As we saw at the beginning
of this section, if the probability of success per trial is small, then the total number of successes
that occur when a large number of trials is performed is approximately Poisson distributed with
parameter λ = np. This result is sometimes known as the Law of Rare Events because it
describes the statistics of count data for events that have a small probability of occurring, i.e.,
that are rare. In fact, as Theorem 5.2 illustrates, we can relax the assumption that every trial is
equally likely to result in a success provided that certain other conditions are satisfied. This is
an example of the so-called Poisson paradigm.
Theorem 4.2. For each n ≥ 1, let Xn,1 , · · · , Xn,n be a family of independent Bernoulli random
variables with parameters pn,1 , · · · , pn,n . Suppose that
lim max pn,i = 0
n→∞ 1≤i≤n
n
X
lim
n→∞
pn,i = λ.
i=1
Then,
lim P(Xn,1 + · · · + Xn,n = k) = e−λ
n→∞
λk
.
k!
Remark: The first condition guarantees that each event is individually unlikely to occur, while
the second condition implies that the mean number of events that do occur is approximately λ
when n is large.
44
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
To see how the law of rare events might explain Poisson-distributed count data, consider the
number of misprints per page of a book. We can regard each word printed on a given page as
a trial, which can either result in the correct spelling of the word or in a misprint. Assuming
that the typist or editor is at least somewhat competent, the probability that any one word is a
misprint in likely to be small, while the number of words per page is probably on the order of
250. Furthermore, barring catastrophic failures, the presence or absence of misprints is probably nearly independent between words. However, under these conditions, the law of rare events
tells us that the total number of misprints per page should be approximately Poisson distributed.
Example 4.4. If, on average, a healthy adult contracts three colds per year, what is the probability
that they will contract at least two colds in given year? To answer this, we will assume that the
number of colds contracted per year by a healthy adult is Poisson distributed with parameter
λ = 3. If we let X denote this random variable, then
P(X ≥ 2) = 1 − P(X < 2)
= 1 − P(X = 0) − P(X = 1)
30
31
= 1 − e−3 − e−3
≈ 0.801,
0!
1!
where 0! = 1 by convention.
4.2.1
Fluctuation tests and the origin of adaptive mutations
One of the classic experiments of molecular genetics is the fluctuation test, which was developed by Salvador Luria and Max Delbrück in 1943 in the course of their investigations of the
mechanisms that give rise to adaptive mutations. An adaptive mutation is one that increases
the fitness of the organism that carries it. Specifically, Luria and Delbrück wanted to determine
whether adaptive mutations occur spontaneously or else are directly induced by the environmental conditions to which they are adapted. To do this, they developed an experimental system
based on a bacterium, Escherischia coli, along with type of virus known as a T1 bacteriophage
that infects this bacterium. When T1 phage is added to a culture of E. coli, most of the bacteria
are killed, but a few resistant cells usually survive and will give rise to resistant colonies that
can be seen on the surface of a petri dish. In other words, resistance to T1 phage is a trait that
varies across E. coli bacteria and which is heritable, i.e., the descendants of resistant bacteria
are usually themselves resistant.
The fluctuation test carried out by Luria and Delbrück was based on the following procedures:
1. A single E. coli culture was initiated from a single T1-susceptible cell and allowed to grow
to a population containing millions of bacteria. Several small samples were taken from this
colony and spread on agar plates that had also been inoculated with the T1 phage. These
plates were left for a period and the number of resistant colonies was counted.
2. The procedure described in (1) was repeated several times, using independently established
E. coli cultures and the resulting data were used to estimate both the mean and the variance
of the number of resistant colonies arising in each culture.
Luria and Delbrück then reasoned as follows. If resistance mutations do not arise spontaneously
but instead are induced by the presence of T1 phage, then because the mutation rate per cell
is low (as shown by the data) and because there are a large number of cells per agar plate that
could mutate and give rise to mutant colonies, the total number of mutant colonies generated
4.2. THE POISSON DISTRIBUTION
45
in each trial will be approximately Poisson distributed. In particular, under these assumptions,
the mean and the variance of the number of mutant colonies generated in each trial will be
approximately equal. On the other hand, if mutations occur spontaneously during expansion of
the bacterial populations used in each trial, then the proportion of resistant cells in each culture
prior to T1 plating will depend on the timing of the mutation relative to the expansion of the
population from a single cell. In particular, if the mutation occurs at an early stage of expansion,
then it will be inherited by many of the cells that are spread onto the plates containing the T1
phage and there will be many resistant colonies. However, if the mutation occurs at a late stage
of the expansion, then it will be inherited by few cells and so there will be few resistant colonies.
It can be shown that under this scenario, the number of mutant colonies does not follow the
Poisson distribution and has a variance that is much higher than the mean.
Luria and Delbrück repeated this experiment several times and found that the ratio of the
sample variance to the sample mean ranged from about 4 to 200. In no case was the sample
variance similar to the sample mean and additional controls were performed that showed that
sampling error could not account for the discrepancy between the two. In other words, the data
provide strong evidence against the hypothesis that resistance mutations are induced by the
T1 phage itself and are consistent with the predictions of the hypothesis that these mutations
arise spontaneously prior to exposure to the virus. Similar results have emerged from the many
studies of mutation that were carried out after Luria and Delbrück’s original work using different
organisms and different selection pressures, and indeed these findings are consistent with our
current understanding of the molecular basis of mutation.
4.2.2
Poisson processes
Poisson distributions can also be used to model scenarios in which events occur at a ‘constant
rate’ through space or time. Consider a sequence of events that occur repeatedly through time
(such as earthquakes or mutations to DNA) and assume that the following properties are satisfied:
• Property 1 : Events occur at rate λ per unit time, i.e., the probability that exactly one
event occurs in a given interval [t, t + h] is equal to λh + o(h).
• Property 2 : Events are not clustered, i.e., the probability that two or more events occur in
an interval of length h is o(h).
• Property 3 : Independence. For any set of non-overlapping intervals [s1 , t1 ], [s2 , t2 ], · · · , [sr , tr ],
the numbers of events occurring in these intervals are independent.
Here we have used Landau’s little-o notation: If f and g are real-valued functions on [0, ∞),
we write that f (h) = g(h) + o(h) if
f (h) − g(h)
= 0.
h→0
h
lim
In particular, we write f (h) = o(h) if f (h)/h converges to 0 as h → 0. For example, if f (h) = h2 ,
then f (h) = o(h).
Let N (t) denote the number of events that occur in the interval [0, t]. Our goal is to calculate
the distribution of N (t), which we can do by breaking the interval [0, t] into n non-overlapping
subintervals, each of length t/n. Then,
P (N (t) = k) = P (An ) + P (Bn )
46
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
where
An = {k subintervals each contain exactly one event, while the others contain no events}
Bn = {N (t) = k and at least one subinterval contains ≥ 2 events }.
Notice that An and Bn are mutually exclusive.
We first show that P (Bn ) can be made arbitrarily small by taking n sufficiently large. Indeed,
P (Bn ) ≤ P{at least one subinterval contains two or more events}
n
X
≤
P{the k’th subinterval contains two or more events}
k=1
n
X
t
t
o(t/n)
≤
o
= n·o
= t
.
n
n
t/n
k=1
However, since t/n → 0 as n → ∞, it follows from the definition of o(h) that this last expression
vanishes as n → ∞, which then implies that
lim P (Bn ) = 0.
n→∞
Next, using Properties 1 and 2, observe that
λt
t
P {an interval of length t/n contains one event} =
+o
n
n
λt
t
P {an interval of length t/n contains no events} = 1 −
+o
.
n
n
Also, using Property 3, we have
k n−k
n
λt
t
λt
t
P (An ) =
+o
1−
+o
k
n
n
n
n
k
(λt)
→ e−λt
as n → ∞,
k!
since there are nk ways for the k events to be distributed across the n subintervals such that
each subinterval contains either 0 or 1 event.
Putting these results together shows that
P(N (t) = k) =
lim (P(An ) + P(Bn ))
n→∞
= e−λt
(λt)k
.
k!
In other words, for each t ≥ 0, the number of events N (t) to have occurred by time t is a Poisson
random variable with parameter λt. The collection of random variables (N (t) : t ≥ 0) is said to
be a Poisson process with intensity λ. Such processes are used to model a wide variety of
phenomena such as disease transmissions, earthquakes, and arrival of traffic at stoplights.
4.3. RANDOM WAITING TIMES
4.3
4.3.1
47
Random Waiting Times
The geometric and exponential distributions
Suppose that a sequence of independent Bernoulli trials is performed and that each trial has
probability p ∈ (0, 1] of resulting in a success. If T denotes the number of trials performed until
the first success is recorded, then T is a random variable that takes values in the positive integers
{1, 2, · · · , ∞} with probability mass function
pT (k) = (1 − p)k−1 p,
k ≥ 1.
Indeed, T = k if and only if the first k − 1 trials all result in failures and the k’th trial results in
a success. A random variable with this distribution is said to have the geometric distribution
with success probability p, in which case we write T ∼ Geometric(p).
The moment generating function of T is
mT (t) =
∞
X
(1 − p)k−1 petk
k=1
= pe
t
∞
X
(1 − p)et
k−1
k=1
=
 pet

 1−(1−p)et
if t < − ln(1 − p)


∞
otherwise
with derivatives
m0T (t) =
m00T (t) =
pet
(1 − (1 − p)et )2
2pet
−pet
+
.
(1 − (1 − p)et )2 (1 − (1 − p)et )3
It follows that the mean and the variance of T are equal to
E[T ] =
Var(T ) =
1
p
1−p
.
p2
As expected, the smaller p is, the more trials that we have to conduct on average to obtain the
first success.
Example 4.5. The probability of a mutation in a eukaryotic genome is approximately µ = 10−8
per site per generation. Thus it takes, on average, approximately µ−1 = 108 generations for a
particular site to mutate. To put this time span into perspective, the earliest undisputed animal
fossils date to the Cambrian explosion about 540 MYA, while the dinosaurs first appeared in the
Triassic about 230 MYA and the non-avian dinosaurs went extinct at the end of the Cretaceous
around 65 MYA. Also, because eukaryotic genomes generally contain many millions to billions
of nucleotides, these typically mutate in every generation.
As illustrated in the preceding example, the geometric distribution is often used to model the
random waiting time until the first “success” occurs. This makes the most sense if the trials are
48
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
performed at a constant rate, e.g., one trial per day, in which case T will have the same units as
the time between trials. However, if the probability of success per trial is very small, then this
waiting time will typically be very large, in which case the following approximation can be used.
Theorem 4.3. Let (n ; n ≥ 1) be a sequence of positive numbers tending to 0 and for each n ≥ 1
let Tn be a geometric random variable with parameter n . Then, for every t ≥ 0,
lim P{n Tn ≤ t} = 1 − e−t .
n→∞
Proof. The result follows from the calculation
P{n Tn ≤ t}
=
t
1 − P Tn >
n
=
1 − (1 − n )t/n
→ 1 − e−t
as n → ∞,
where we have used the following result
lim (1 + an )bn = ec ,
n→∞
whenever (an : n ≥ 1) and (bn : n ≥ 1) are sequences of real numbers such that an → 0, bn → ∞,
and an bn → c as n tends to infinity.
Theorem 5.3 not only provides us with an approximation for the geometric distribution, but
it also leads to a new distribution that is interesting in its own right. Since the function
F (t) = 1 − e−t is nondecreasing and differentiable with F (0) = 0 and F (t) → 1 as t → ∞,
it is the cumulative distribution function of a continuous random variable X with probability
density pX (t) = F 0 (t) = e−t for t ≥ 0. This distribution is so important that it has been named.
Definition 4.4. A continuous random variable X is said to have the exponential distribution
with parameter λ if the density of X is

 λe−λx if x > 0
pX (x) =

0
otherwise.
In this case we write X ∼ Exponential(λ).
Theorem 5.3 illustrates an important theme in probability theory, which is that discrete random
variables that take on large integer values can often be approximated by continuous random
variables. In this particular case, Tn will typically assume values that are of the same order of
magnitude as its mean, E[Tn ] = −1
n , which will be large whenever n is small. In contrast, if
we normalize Tn by dividing by its mean, then we obtain a random variable n Tn which is no
longer integer-valued but which has mean E[n Tn ] = 1 and typically assumes values that are of
the same order of magnitude. Notice that we can also interpret n Tn as the time to the first
success, where time is measured in units proportional to −1
n .
Previously (Example 4.7) we showed that the moment generating function of X ∼ Exponential(λ)
is
(
λ
if t < λ
mX (t) = λ−t
∞
if t ≥ λ.
4.3. RANDOM WAITING TIMES
49
which implies that the mean and the variance are
1
λ
1
.
λ2
E[X] =
Var(X) =
Thus, the mean of an exponential random variable is equal to the reciprocal of its parameter,
while the variance is equal to the mean squared. Like the geometric distribution, the exponential
distribution is often used to model the random time that elapses until some event occurs. With
this interpretation, the parameter λ is often referred to as the rate parameter.
An important property of the exponential distribution is that it is memoryless: if X is exponentially distributed with parameter λ, then for all t, s ≥ 0,
P{X > t + s|X > t} = P{X > s}.
To verify this, use the CDF calculated above to obtain
P{X > t} = 1 − P{X ≤ t} = e−λt ,
and then calculate
P{X > t + s|X > t} =
P{X > t + s}
P{X > t}
e−λ(t+s)
e−λt
−λs
= e
= P{X > s}.
=
(4.5)
In other words, if we think of X as being the random time until some event occurs, then conditional on the event not having occurred by time t, i.e., on X > t, the distribution of the time
remaining until the event does occur, X − t, is also exponential with parameter λ. It is as if the
process determining whether the event occurs between times t and t + s has no ‘memory’ of the
amount of time that has elapsed up to that point. This observation also sheds further light on
why λ is called the rate parameter. Indeed,
e−λt − e−λ(t+δt)
e−λt
−λδt
= 1−e
P(t < T ≤ t + δt|T > T ) =
≈ λδt,
provided 0 < δt 1. This shows that the probability that the event occurs in a short time
interval (t, t + δt] given that it has not already occurred by time t is approximately proportional
to the rate λ times the duration of the interval δt. Furthermore, because this probability does
not depend on t, the rate is constant in time. Remarkably, these properties also uniquely characterize the exponential distribution.
Proposition 4.2. Suppose that X is a continuous random variable with values in [0, ∞). Then
X is memoryless if and only X is exponentially distributed for some parameter λ > 0.
As an aside, the geometric distribution could also be said to be memoryless provided we restrict
to integer-valued times. To see this, let T ∼ Geometric(p) and observe that for all positive
50
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
integers n, m ≥ 1,
P(T > n + m)
P(T > n)
(1 − p)n+m−1 p
=
(1 − p)n−1 p
= (1 − p)m−1 p = P(T > m),
P(T > n + m|T > n) =
which is clearly a discrete analogue of (5.5). This is not surprising given our earlier observation
that the exponential distribution arises as a limiting case of the geometric distribution.
4.3.2
The gamma distribution
Although the memorylessness of the exponential distribution is often a convenient mathematical
property, there are many temporal phenomena in nature that occur at rates which vary over time
and which therefore are not memoryless. For example, because mortalities are age-dependent in
many species, e.g, the very young and the very old are usually much more likely to die in a given
period than are adult individuals, the exponential distribution does not provide a good statistical
model for the life spans of individuals of these species. For this reason, additional continuous
distributions on the non-negative real numbers are needed and one of the most important families
of such distributions is the gamma distribution.
Definition 4.5. A random variable X is said to have the gamma distribution with shape
parameter α > 0 and scale parameter λ > 0 if its density is
 λα α−1 −λx
e
if x > 0
 Γ(α) x
p(x) =

0
otherwise,
where the gamma function Γ(α) is defined (for α > 0) by the formula
Z ∞
Γ(α) =
y α−1 e−y dy.
0
In this case, we write X ∼ Gamma(α, λ).
Notice that the exponential distribution with parameter λ is identical to the gamma distribution
with parameters (1, λ). Later we will show that if X1 , · · · , Xn are independent exponential RVs,
each with parameter λ, then the sum X = X1 + · · · + Xn is a gamma RV with parameters (n, λ).
Thus, gamma-distributed random variables are sometimes used as statistical models for lifespans
that are determined by a sequence of independent events which occur at constant rate (e.g., transformation of a lineage of cells following the accumulation of mutations that disrupt regulation of
the cell cycle). Furthermore, unlike the exponential distribution, the gamma distribution is not
(in general) memoryless and its mode is positive whenever the shape parameter is greater than 1.
Before we can calculate the moments of X, we need to spend a little time investigating the
gamma function, which is important in its own right. Integration by parts shows that for α > 1,
the gamma function satisfies the following important recursion:
Z ∞
−α−1 −y ∞
Γ(α) = −y
e 0 + (α − 1)
y α−2 e−y dy
0
= (α − 1)Γ(α − 1).
4.4. CONTINUOUS DISTRIBUTIONS ON FINITE INTERVALS
51
In particular, if α = n is a positive integer, then because
Z ∞
Γ(1) =
e−y dy = 1,
0
we obtain
Γ(n) = (n − 1)Γ(n − 2)
= (n − 1)(n − 2)Γ(n − 3)
= (n − 1)(n − 2) · · · 3 · 2Γ(1)
= (n − 1)!
(4.6)
which shows that the gamma function is a continuous interpolation of the factorial function.
In fact, it can be shown that the gamma function is the unique interpolation of the factorial
function which can be expanded as a power series. Equation (5.6) can be used to calculate the
moments of the gamma distribution. If X has the gamma distribution with parameters (α, λ),
then
Z ∞
λα α−1 −λx
n
xn
E[X ] =
x
e
dx
Γ(α)
0
Z
Γ(n + α) ∞ λn+α
= λ−n
xn+α−1 e−λx dx
Γ(α)
Γ(n
+
α)
0
n−1
Y
= λ−n
(α + k).
k=0
In particular, taking n = 1 and n = 2 gives
α
λ
V ar(X) = E[X 2 ] − E[X]2
α(α + 1) α2
=
− 2
λ2
λ
α
=
.
λ2
E[X] =
4.4
4.4.1
Continuous Distributions on Finite Intervals
The uniform distribution
The uniform distribution is arguably the simplest of all the continuous distributions since its
density is constant on an interval. Later in the course we will see that this distribution is special
in another sense, since it is the distribution with maximum entropy on any finite interval. For
these reasons, the uniform distribution is often used as a non-informative prior distribution
for variables when only the lower and upper bounds are known.
Definition 4.6. A continuous random variable
on the interval (a, b) if the density function of
 1
 b−a
p(x) =

0
X is said to have the uniform distribution
X is
if x ∈ (a, b)
otherwise.
In this case we write X ∼ Uniform(a, b) or X ∼ U (a, b). In particular, if X ∼ U (0, 1), then X
is said to be a standard uniform random variable.
52
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
To say that X is uniformly distributed on an interval (a, b) means that each point in (a, b) is
equally likely to be a value of X. In particular, the probability that X lies in any subinterval of
(a, b) is proportional to the length of the subinterval: if (l, r) ⊂ (a, b), then
r−l
P X ∈ (l, r) =
.
b−a
Of course, since X is continuous, the probability that X = x for any x ∈ (a, b) is 0.
Suppose that X has the U (a, b) distribution. Then the n’th moment of X is given by
E Xn =
1
b−a
Z
a
b
n+1
b
1
1
1
b
− an+1
n+1 x dx =
.
x
=
b−an+1
n+1
b−a
a
n
It follows that the mean and the variance of X are
1
(a + b)
2
1
Var(X) = E X 2 − E[X]2 = (b − a)2 .
12
E[X] =
4.4.2
The beta distribution
Although mathematically convenient, the uniform distribution is often too restrictive when modeling the statistics of variables that take values in a finite interval. When this is true, we can
sometimes turn to the beta distribution as a fairly flexible generalization of the standard uniform
distribution.
Definition 4.7. A random variable X is said to have the beta distribution with parameters
a, b > 0 if its density is
p(x) =


1
a−1 (1
β(a,b) x

0
− x)b−1
if x ∈ (0, 1)
otherwise,
where the beta function β(a, b) is defined (for a, b > 0) by the formula
Z
β(a, b) =
1
xa−1 (1 − x)b−1 dx.
0
Notice that if a = b = 1, then X is simply a standard uniform random variable. Also, if X1 and
X2 are independent gamma-distributed RVs with parameters (a, θ) and (b, θ), respectively, then
the random variable X = X1 /(X1 + X2 ) is beta-distributed with parameters (a, b). In particular,
if X1 and X2 are independent exponential RVs with parameter λ, then X = X1 /(X1 + X2 ) is
uniformly distributed on [0, 1]. The beta distribution is often used to model distributions of
frequencies.
It can be shown that the beta function and the gamma function are related by the following
identity:
Γ(a)Γ(b)
.
β(a, b) =
Γ(a + b)
4.4. CONTINUOUS DISTRIBUTIONS ON FINITE INTERVALS
53
In turn, we can use this identity to evaluate the moments of the beta distribution. Let X be a
beta random variable with parameters (a, b). Then
n
E[X ] =
=
=
=
=
Z 1
1
xn xa−1 (1 − x)b−1 dx
β(a, b) 0
β(n + a, b)
β(a, b)
Γ(n + a)Γ(b) Γ(a + b)
Γ(n + a + b) Γ(a)Γ(b)
Γ(n + a) Γ(a + b)
Γ(a) Γ(n + a + b)
n−1
Y a+k
.
a+b+k
k=0
Taking n = 1 and n = 2 gives
a
a+b
V ar(X) = E[X 2 ] − E[X]2
a(a + 1)
a2
=
−
(a + b)(a + b + 1) (a + b)2
ab
.
=
2
(a + b) (a + b + 1)
E[X] =
4.4.3
Bayesian estimation of proportions
Suppose that we conduct n independent trials and that we find that exactly k of them result
in a success. If we assume that each trial has the same unknown probability p of resulting in a
success, what does our data tell us about the value of p?
In problem set 4 you addressed this question by calculating the maximum likelihood estimate of p. Here we will describe a Bayesian approach to the same problem. Let us define the
propositions I, D and Hp by
I = “n independent, identically-distributed trials have been conducted”
D = “exactly k of the trials have resulted in a success”
Hp = “the probability of success per trial is in (p, p + δp)”
Notice that there are infinitely many hypotheses Hp which are indexed by values of p ∈ [0, 1].
By Bayes’ formula, we know that the posterior probability of hypothesis Hp is equal to
P(Hp |D, I) = P(Hp |I)
P(D|Hp , I)
P(D|I)
(4.7)
where P(Hp |I) is the prior probability of Hp , P(D|Hp , I) is the likelihood function, and P(D|I) is
the prior predictive probability of the data. Under the assumptions of the problem, the likelihood
function is given by a binomial probability
n k
P(D|Hp , I) =
p (1 − p)n−k .
k
54
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
Thus, to complete the analysis, we need to choose a prior distribution for the value of p. Recall
that in Bayesian inference the prior distribution quantifies our degree of belief in each hypothesis
before we take into account the new data. This belief could be based on data generated by
previous experiments of the same type or on completely different sources of information (or
both). For binomial trials, it is often convenient to choose a prior distribution on p that belongs
to the beta family, i.e., we will assume that the density of the prior of distribution on Hp is
P(Hp |I) =
1
pa−1 (1 − p)b−1 dp
β(a, b)
for some pair of numbers a, b > 0. Because the variance of the beta distribution is a decreasing
function of a and b, if a and b are both large, then the prior will be concentrated around the
mean a/a+b, which would reflect a scenario in which we have very strong prior beliefs concerning
the likely values of p. On the other hand, if a = b = 1, then the beta distribution is just the
standard uniform distribution, which would reflect a scenario in which we have little (perhaps no)
information concerning the likely values of p. Whatever the values of these parameters, equation
(5.7) tells us that the posterior density of p is equal to
n k
n−k
1
a−1
b−1 k p (1 − p)
P(Hp |D, I) =
p (1 − p)
β(a, b)
P(D|I)
1 k+a−1
=
p
(1 − p)n−k+b−1 ,
C
where the constant C = β(a, b)P(D|I)/ nk can either be calculated directly or it can be determined indirectly by noting that it is a normalizing constant for the density on the right-hand
side, which is just the density of a beta distribution with parameters a + k and b + n − k. Since
this density must still integrate to 1 over [0, 1], it follows that
C = β(a + k, b + n − k).
This shows that the posterior distribution belongs to the same parametric family of distributions
as the prior distribution, differing only in the values of the parameters, which now depend on
the data. Prior distributions with this property are said to be conjugate priors and have
the advantage that the posterior distribution can be determined immediately without having to
resort to expensive numerical estimates of the prior predictive probability.
In this particular problem, we see that the mean and the variance of the posterior distribution
are
E[p|D, I] =
Var(p|D, I) =
a+k
a+b+n
(a + k)(b + n − k)
.
(a + b + n)2 (a + b + n + 1)
This shows that the variance of the posterior distribution is necessarily less than the variance of
the prior distribution by a quantity that is increasing in n, i.e., the more data that we collect,
the less uncertain we are about the true value of p. Furthermore, if k and n are large relative
to a and b, then the mean of the posterior distribution will be close to p̂M L = k/n, which is the
maximum likelihood estimate of p mentioned above. In fact, the relative magnitudes of a and b
on the one hand and k and n − k on the other hand reflect the relative importance of the two
sources of information that we have concerning the value of p. When a, b k, n − k, the data is
only weakly informative about the value of p and then the posterior distribution depends mainly
on the prior distribution. On the other hand, if a, b k, n − k, then the data is very informative
about the value of p and this too is reflected in the posterior distribution.
4.4. CONTINUOUS DISTRIBUTIONS ON FINITE INTERVALS
4.4.4
55
Binary to analogue: construction of the uniform distribution
Thus far we have simply assumed that continuous distributions exist, without taking care to
specify the σ-algebra of measurable subsets of R or to prove that there really are probability
distributions with these properties. Here we partially redress this absence by showing how to
construct a standard uniform random variable from a sequence of i.i.d. Bernoulli random variables.
Suppose that X1 , X2 , · · · is a sequence of independent Bernoulli random variable, each with
parameter 1/2. If we define
n
∞
X
1
X=
Xn
,
2
n=1
then X is a standard uniform random variable. Notice that the random sequence (X1 , X2 , · · · )
is the binary representation of the random number X. That X is uniformly distributed can be
deduced as follows:
Step 1: We say that a number x ∈ R is a dyadic rational number if x has a finite binary
expansion, i.e., there exists an integer n ≥ 1 and numbers c1 , · · · , cn ∈ {0, 1} such that
x=
∞
X
k=1
k
1
ck
.
2
Notice that every dyadic rational number is rational, but that not every rational number is dyadic
rational, e.g., x = 1/3 is not dyadic rational. On the other hand, the set of dyadic rational numbers is dense in R, i.e., for every z ∈ R and every > 0, there exists a dyadic rational number x
such that |z − x| < .
Step 2: Every number x ∈ R either has a unique, non-terminating binary expansion, or it
has two binary expansions, one ending in an infinite sequence of 0’s and the other ending in an
infinite sequence of 1’s. In either case,
P{X = x} = 0.
Step 3: Given numbers c1 , · · · , cn ∈ {0, 1}, let x be the corresponding dyadic rational defined
by the equation shown in Step 1 and let E(c1 , · · · , cn ) be the interval (x, x + 2−n ]. Then
P{X ∈ E(c1 , · · · , cn )} = P{X1 = c1 , · · · , Xn = cn } − P{X = x}
n
1
=
.
2
Furthermore, if 0 ≤ a < b ≤ 1, where a and b are dyadic rational numbers, then because the
interval (a, b] can be written as the disjoint union of finitely many sets of the type E(c1 , · · · , cn )
for suitable choices of the ci ’s, we have that
P{X ∈ (a, b]} = (b − a).
Step 4: Given any two numbers 0 ≤ a < b ≤ 1, not necessarily dyadic rational, we can find
a decreasing sequence (an ; n ≥ 1) and an increasing sequence (bn ; n ≥ 1) such that an < bn ,
56
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
an → a and bn → b. Then the sets (an , bn ) form an increasing sequence, with limit set (a, b),
and by the continuity properties of probability measures, we know that
P{X ∈ (a, b)} =
=
lim P{X ∈ (an , bn ]}
n→∞
lim (bn − an )
n→∞
= b − a.
However, this result implies that X is uniform on [0, 1].
4.5
The Normal Distribution
As explained below, the standard normal distribution is one of the most important probability
distributions in all of mathematics. Not only does it play a central role in large sample statistics,
but it is also relevant to many physical and biological processes, including heat conduction,
diffusion, and the evolution of quantitative traits. Furthermore, the standard normal distribution
is the building block for an entire family of normal distributions, having different means and
variances.
Definition 4.8. A continuous real-valued random variable X is said to have the normal distribution with mean µ and variance σ 2 if the density of X is
1
2
2
p(x) = √ e−(x−µ) /2σ
σ 2π
− ∞ < x < ∞.
In this case we write X ∼ N (µ, σ 2 ). In particular, X is said to be a standard normal random
variable if X ∼ N (0, 1).
Remark: Normal distributions are also called Gaussian distributions after Carl Friedrich
Gauss (1777-1855).
We can show that the integral of p(x) over R is equal to 1 by using a clever trick attributed to
Gauss. First, by making the substitution y = (x − µ)/σ, we see that
Z ∞
Z ∞
1
2
p(x)dx = √
e−y /2 dy.
2π −∞
−∞
R∞
Then, letting I = ∞ exp(−y 2 /2)dy, we obtain
Z ∞
Z ∞
2
2
I2 =
e−x /2 dx
e−y /2 dy
−∞
Z−∞
∞ Z ∞
2
2
=
e−(x +y )/2 dxdy
−∞ −∞
∞ Z 2π
Z
re−r
=
0
Z
= 2π
0
∞
re−r
2 /2
2 /2
dθdr
dr = 2π.
0
Here we have changed to polar coordinates in passing from the second to the third line,
√ i.e., we
set x = r cos(θ), y = r sin(θ), and dxdy = rdθdr. This calculation shows that I = 2π, which
confirms that p(x) is a probability density.
4.5. THE NORMAL DISTRIBUTION
57
An important property of normal distributions is that any affine transformation of a normal
random variable is also a normal random variable. Specifically, if X is normally distributed with
mean µ and variance σ 2 , then the random variable Y = aX + b is also normally distributed, but
with mean aµ + b and variance a2 σ 2 . To verify this claim, assume that a > 0 and observe that
the c.d.f. of Y is
x−b
x−b
FY (x) = P(Y ≤ x) = P(aX + b ≤ x) = P X ≤
= FX
.
a
a
The density of Y can be found by differentiating FY (x) and is equal to
d
x−b
1
x−b
d
pY (x) =
=
FY (x) =
FX
pX
dx
dx
a
a
a
1
2
2
2
√ e−(x−aµ−b) /2a σ ,
=
aσ 2π
which shows that Y ∼ N (aµ + b, a2 σ 2 ) as claimed. A similar argument works if a < 0. In
particular, by taking b = −µ and a = 1/σ, we see that the variable Z = (X − µ)/σ is a standard
normal random variable. For this reason, Z is said to be standardized. Similarly, if Z is a
standard normal random variable, then so is −Z.
For statistical applications of the normal distribution, we are often interested in probabilities of
the form P(X > x) = 1−P(X ≤ x). Although a simple analytical expression is unavailable, these
quantities can be calculated numerically and have been extensively tabulated for the standard
normal distribution. An online copy of such a table can be found at Wikipedia under the topic
‘Standard normal table’. Furthermore, because every normal random variable can be expressed
in terms of a standard normal random variable, we can use these lookup tables for more general
problems. For example, if X ∼ N (µ, σ 2 ), then by defining Z = (X − µ)/σ, we see that
P(X > x) = P(σZ + µ > x)
= 1 − P(Z ≤ (x − µ)/σ) = 1 − Φ((x − µ)/σ),
where Φ(x) is the CDF of the standard normal distribution:
Z x
1
2
Φ(x) = √
e−y /2 dy.
2π −∞
Alternatively, functions that evaluate the cumulative distribution function of any normal random
variable X with arbitrary mean and variance are available in computational software packages
such as Matlab, Mathematica and R. For example, the Matlab command
p = normcdf (x, mu, sigma)
will calculate the probability P(X ≤ x) when X ∼ N (µ, σ 2 ). Note that the symbols x, mu, and
sigma in the command should be replaced by their intended values, e.g., p = normcdf (1, 0, 1).
Standardization is also used in the proof of the next proposition.
Proposition 4.3. Let X be a normal random variable with parameters µ and σ 2 . Then the
expected value of X is µ, while the variance of X is σ 2 .
Proof. Let Z = (X − µ)/σ, so that Z ∼ N (0, 1). Then
Z ∞
1
2
E[Z] = √
xe−x /2 dx = 0,
2π −∞
58
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
2
while integration-by-parts with u = x and dv = xe−x /2 shows that
Z ∞
1
2
x2 e−x /2 dx
Var(Z) = E Z 2 = √
2π −∞
Z ∞
1
2
−x2 /2
−x /2 ∞
= √
e
dx
= 1.
−xe
|∞ +
2π
−∞
Since X = µ + σZ, the proposition follows from the identities
E[X] = µ + σE[Z] = µ
Var(X) = σ 2 Var(Z) = σ 2 .
Our next result helps to explain why normal distributions are so prevalent. Recall that if
X1 , · · · , Xn are independent Bernoulli random variables, each having parameter p, then the
distribution of the sum Sn = X1 + · · · + Xn is binomial with parameters (n, p). Furthermore, if
p is small, then, by the law of rare events, we know that Sn is approximately Poisson distributed
with parameter np. In this case, Sn is typically small, of order np, and so a discrete approximation is to be expected. If, instead, np is large, then Sn will typically be large and it makes
more sense to look for a continuous approximation, much as we did in Theorem 5.3. Such an
approximation is given in the next theorem.
Theorem 4.4. (DeMoivre-Laplace Limit Theorem) If Sn is a binomial random variable
with parameters n and p and Z is a standard normal random variable, then
(
)
Z b
Sn − np
1
2
p
√ e−x /2 dx
lim P a ≤
≤ b = P(a ≤ Z ≤ b) =
n→∞
2π
np(1 − p)
a
for all real numbers a ≤ b.
In other words, the DeMoivre-Laplace Theorem asserts that when both n and np are large, the
distribution of the standardized random variable Zn = √Sn −np can be approximated by the
np(1−p)
distribution of a standard normal random variable Z. Furthermore, since affine transformations
of normal distributions are still normal (see above), it follows that for large n the variable Sn
is approximately normal with mean np and variance np(1 − p). We illustrate the use of this
approximation in Example 5.6 below. The accuracy of this approximation depends on both n
and p, but it is usually accurate to within a few percent when np(1 − p) ≥ 10 and the error can
be shown to decay at a rate that is asymptotic to n−1/2 . The DeMoivre-Laplace Theorem can
be illustrated mechanically with the help of a device known as a Galton board or a quincunx;
an applet simulating a Galton board can be found at the following link:
http://www.jcu.edu/math/isep/quincunx/quincunx.html.
To use this applet, set the number of bins to 10 or greater and run a series of simulations with
an increasing number of balls (e.g., 10, 100, 1000, 10000). Be sure to set the speed to Maximum
Speed for larger simulations.
In fact, the Demoivre-Laplace theorem is a special case of a much more powerful result which
is known as the Central Limit Theorem. This asserts that the distribution of the sum of a
large number of independent, identically-distributed random variables with any distribution with
finite mean and finite variance is approximately normal. In the setting of the Demoivre-Laplace
theorem, the individual random variables are all Bernoulli, but there is nothing special about
4.6. TRANSFORMATIONS OF RANDOM VARIABLES
59
the Bernoulli distribution. Indeed, we could just as easily add Poisson or exponential or betadistributed random variables and the sum would still be approximately normally distributed.
Later in the course we will use moment generating functions to prove that the CLT holds for
sums of random variables that have finite exponential moments and then we will be able to
deduce Theorem 5.4 as a corollary.
Example 4.6. Suppose that a fair coin is tossed 100 times. What is the probability that the
number of heads obtained is between 45 and 55 (inclusive)?
Solution: If X denotes the number of heads obtained in 100 tosses, then X is a binomial random
variable with parameters (100, 1/2). By the Demoivre-Laplace theorem, we know that
X − 50
P{45 ≤ X ≤ 55} = P −1 ≤
≤1
5
≈ Φ(1) − Φ(−1)
= Φ(1) − (1 − Φ(1))
= 2Φ(1) − 1 = 0.683,
where a table of Z-scores has been consulted for the numerical value of Φ(1) = 0.841. Here we
have made use of the identity
Φ(−x) = 1 − Φ(x),
which follows from the fact that if X ∼ N (0, 1), then
P{X ≤ −x} = P{X > x} = 1 − P{X ≤ x}.
4.6
Transformations of Random Variables
In Section 5.5 we saw that an affine transformation of a normal random variable is also normally
distributed and we used this observation to prove several results about normal distributions.
Since transformations of random variables are frequently used in statistics and applied probability, we need a general set of techniques to determine their distributions. This is the subject of
this section.
Suppose that X is a real-valued random variable and let Y = g(X), where g : R → R is a
real-valued function. Then
P(Y ∈ E) = P(g(X) ∈ E) = P(X ∈ g −1 (E)),
where the set g −1 (E) is the preimage of E under g
g −1 (E) = {x ∈ R : g(x) ∈ E}.
Notice that we do not require g to be one-to-one and so g −1 may not be defined as a function.
Furthermore, since in general E 6= g −1 (E), it is clear that Y and X will usually have different
distributions. However, provided that the transformation is not too complicated, we can sometimes express the cumulative distribution function of Y in terms of that of X. We first see how
this is done in two examples and then describe a general approach.
60
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
Example 4.7. Let X be a non-negative random variable with CDF FX and let Y = X n , where
n ≥ 1. Then, for any x ≥ 0,
FY (x) = P(Y ≤ x)
= P(X n ≤ x)
= P(X ≤ x1/n )
= FX x1/n ,
while for any x < 0, we have FY (x) = 0. Similarly, if X has a density pX = FX0 (x), then Y also
has a density given by

if x < 0
 0
d
pY (x) =
FY (x) =
 1 1 −1
dx
n
pX x1/n
if x ≥ 0.
nx
Example 4.8. Let X be a continuous real-valued random variable with density pX (x) = FX0 (x)
and let Y = X 2 . Then, for x ≥ 0,
FY (x) = P{Y ≤ x}
= P{X 2 ≤ x}
√
√
= P{− x ≤ X ≤ x}
√
√
= P{− x < X ≤ x}
√
√
= FX ( x) − FX (− x),
√
where we have used the fact that P(X = − x) = 0 (since X is continuous). In contrast, if x < 0,
then FY (x) = 0. In this case, the density of Y is given by

if x < 0
 0
d
pY (x) =
FY (x) =
 √1 p √x + p − √x if x ≥ 0.
dx
2 x
X
X
Notice that we obtain different expressions for FY (x) in Examples 5.7 and 5.8 in the case n = 2
depending on whether X is strictly non-negative or can take on both positive and negative values. The reason for this is that the function x → x2 is one-to-one when restricted to [0, ∞) but
is two-to-one when considered on the entire real line.
Monotonic transformations: Suppose that X is a real-valued random variable with CDF FX
and let Y = g(X), where the function g : R → (l, r) is strictly increasing, continuous, one-to-one
and onto, i.e., if x < y, then g(x) < g(y). Then g is invertible on the interval (l, r) and so the
CDF of Y can be expressed as
FY (x) = P{Y ≤ x}
= P{g(X) ≤ x}
= P{X ≤ g −1 (x)}
= FX g −1 (x) ,
for any x ∈ (l, r). Of course, if x ≥ r, then P{Y ≤ x} = 1, while if x ≤ l, then P{Y ≤ x} = 0.
Combining these facts, we obtain the general expression

 0
if x ≤ l
FX g −1 (x)
if x ∈ (l, r)
FY (x) =

1
if x ≥ r.
4.6. TRANSFORMATIONS OF RANDOM VARIABLES
61
If, instead, g is a decreasing function, then
FY (x) = P{g(X) ≤ x}
= P{X ≥ g −1 (x)}
= 1 − FX (g −1 (x)−)
where FX (x−) = limxn ↑x FX (xn ) is the left-handed limit of FX at x. Thus, in this case, the
CDF of Y has the form

 0
if x ≤ l
1 − FX g −1 (x) −
if x ∈ (l, r)
FY (x) =

1
if x ≥ r.
Of course, if X is continuous, then FX (x−) = FX (x) and the last result can be simplified.
Example 4.9. Suppose that U is a standard normal random variable and let g be the function
g(y) = (b − a)y + a, where b 6= a, and set Y = g(U ). If b > a, then a ≤ Y ≤ b and the cumulative
distribution function of Y is
x−a
x−a
FY (x) = P((b − a)U + a ≤ x) = P U ≤
=
,
b−a
b−a
provided x ∈ [a, b]. This shows that Y is uniform on the interval [a, b]. On the other hand, if
b < a, then b ≤ Y ≤ a and
x−a
x−a
FY (x) = P((b − a)U + a ≤ x) = P U ≥
= 1−P U ≤
b−a
b−a
a−x
x−b
= 1−
=
,
a−b
a−b
whenever x ∈ [b, a], which shows that Y is uniform on the interval [b, a]. In particular, by taking
a = 1 and b = 0, we see that Y ≡ 1 − U is uniform on [0, 1], i.e., 1 − U is also a standard
uniform random variable. We will use this observation below in Example 6.4.
The next proposition describes an important application of these identities.
Theorem 4.5. Suppose that U is a uniform random variable on (0, 1) and let X be a real-valued
random variable with a strictly increasing continuous CDF FX (·). Then the random variable
Y = FX−1 (U ) has the same distribution as X, while the random variable Z = FX (X) is uniformly
distributed on (0, 1).
Proof. First consider the distribution of Y :
FY (x) = P{Y ≤ x}
= P{FX−1 (U ) ≤ x}
= P{U ≤ FX (x)}
= FX (x),
since FX (x) ∈ [0, 1] for all x and P{U ≤ y} = y whenever y ∈ [0, 1]. This shows that Y and X
have the same distribution.
62
CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS
Similarly, the CDF of Z = FX (X) is
FZ (x) = P{Z ≤ x}
= P{FX (X) ≤ x}
= P{X ≤ FX−1 (x)}
= FX (FX−1 (x))
= x,
for any x ∈ [0, 1], which shows that Z is a uniform random variable on [0, 1].
This result has important applications in computational statistics and stochastic simulations. For
example, many computational statistics packages include routines that will simulate a sequence
of independent uniform random variables U1 , U2 , · · · (or some suitable approximation thereof).
Then, if X is a random variable with a strictly increasing CDF, we can generate a sequence of
independent random variables X1 , X2 , · · · , each having the same distribution as X, by setting
Xi = FX−1 (Ui ). This is computationally efficient only if the inverse FX−1 can be easily evaluated.
Example 4.10. Recall that the CDF of the exponential distribution with parameter λ is
FX (x) = 1 − e−λx .
A simple calculation shows that
1
FX−1 (x) = − ln(1 − x),
λ
and so it follows that if U is uniform on [0, 1], then
1
Y = − ln(1 − U )
λ
is exponentially distributed with parameter λ. In fact, because the distribution of 1 − U is also
uniform on [0, 1], the random variable
1
Y 0 = − ln(U )
λ
is also exponentially distributed with parameter λ.
If X is a continuous random variable and g is a strictly monotonic function with a differentiable
inverse, then it can be shown that the random variable Y = g(X) is also continuous and the
density of Y can be expressed in terms of g and the density of X. This relationship is described
in the next proposition.
Proposition 4.4. Let X be a continuous random variable with density pX (·) and suppose that
g is a strictly monotonic continuous function with differentiable inverse g −1 . Then Y = g(X) is
a continuous random variable with density function
 d −1 
g (y) · pX g −1 (y)
if y = g(x) for some x
 dy
pY (y) =


0
otherwise.
4.6. TRANSFORMATIONS OF RANDOM VARIABLES
63
Proof. Without loss of generality, we may assume that g is increasing. Suppose that y = g(x)
for some x. Then, as shown above, FY (y) = FX g −1 (y) , and differentiating with respect to y
gives
d
d
pY (y) =
FY (y) = g −1 (y) · pX g −1 (y) .
dy
dy
Chapter 5
Random Vectors and Multivariate Distributions
5.1
Joint and Marginal Distributions
Because the world is complex and multi-faceted, scientific investigations are usually faced with
a multitude of variables whose relationships to one another are only poorly understood. For
example, if we are interested in understanding temporal fluctuations in quantities such as ocean
surface temperatures or currency exchange rates, then we will need to study time series of
data that can be represented as a sequence of random variables, X1 , X2 , · · · , where the variable
Xt denotes the quantity at interest at the time of the t’th measurement. Similarly, if we are
conducting a genome-wide association study (GWAS) to identify genetic variants that predispose individuals to complex diseases such as cancer or diabetes or schizophrenia, then we need
to consider an array of random variables, Xi1 , · · · , Xin , Yi , where Yi indicates whether the i’th
individual in the study is affected (a case) or not (a control) and Xij indicates which nucleotide
{A, T, C, G} is present at the j’th site in this individual.
To build probabilistic models for these kinds of data, we need to consider groups of random
variables X1 , · · · , Xn that are all defined on the same probability space. Provided that these
are ordered in some manner, we can treat each group of variables as a random vector X =
(X1 , · · · , Xn ) which is a map from the sample space Ω into Rn :
X(ω) ≡ (X1 (ω), · · · , Xn (ω)) ∈ Rn .
Our main goal in this chapter is to develop some tools that can be used to describe and study
random vectors.
Definition 5.1. Suppose that X1 , X2 , · · · , Xn are real-valued random variables defined on the
same probability space (Ω, F, P). Then the joint distribution of these variables is the probability
distribution Q of the random vector X = (X1 , · · · , Xn ) defined by
Q(E) = P (X1 , · · · , Xn ) ∈ E
for any measurable subset E ⊂ Rn . Likewise, the joint cumulative distribution function of
(X1 , · · · , Xn ) is defined by
F (x1 , · · · , xn ) = P (X1 ≤ x1 , · · · , Xn ≤ xn ) .
As with real-valued random variables, both the joint distribution and the joint CDF are fully
determined by the other. This means that if two random vectors X = (X1 , · · · , Xn ) and
Y = (Y1 , · · · , Yn ) have the same joint cumulative distribution function, then their components
64
5.1. JOINT AND MARGINAL DISTRIBUTIONS
65
have the same joint distribution.
If X = (X1 , · · · , Xn ) is a random vector, then each variable Xi is itself a real-valued random
variable and the distribution of Xi is called the marginal distribution of Xi . Note that
each component variable has a marginal distribution. By exploiting the continuity properties of
probability distributions, the marginal cumulative distribution function of Xi can be determined
from the joint cumulative distribution of the random vector. For example, the marginal CDF of
X1 is equal to
FX1 (x) = P(X1 ≤ x)
= P(X1 ≤ x, X2 < ∞, · · · , Xn < ∞)


[
= P
{X1 ≤ x, X2 < M, · · · , Xn < M }
M ≥1
=
lim FX (x1 , M, · · · , M ).
M →∞
This process of recovering the marginal distribution of a single random variable from the joint
distribution of the random vector is called marginalisation.
Caveat: The joint distribution of the variables (X1 , · · · , Xn ) fully determines all of the marginal
distributions. However, knowledge of the marginal distributions alone is not sufficient to determine the joint distribution: there are usually infinitely many joint distributions that have the
same marginal distributions.
Definition 5.2. If X1 , · · · , Xn are discrete random variables defined on the same probability
space and Xi takes values in the countable set Ei , then X = (X1 , · · · , Xn ) is a discrete random
vector with values in the product space E = E1 × · · · × En and the joint probability mass
function of X is defined by
p(x1 , · · · , xn ) = P(X1 = x1 , · · · , Xn = xn ).
If we are given the joint probability mass function pX of a discrete random vector X = (X1 , · · · , Xn ),
then we can find the marginal probability mass functions of the variables Xi by summing pX
over appropriate subsets of E. For example, the marginal probability mass function of X1 is
given by the following sum:
X
pX1 (x) = P(X1 = x) =
pX (x, x2 , · · · , xn ).
x2 ,··· ,xn
In other words, we can calculate the marginal probability mass function for Xi at a point x ∈ Ei
by summing pX over all possible vectors in E with the i’th coordinate held fixed at x.
Example 5.1. Suppose that n independent experiments are performed, each of which can have
one of r possible outcomes (labeled 1, · · · , r) with respective probabilities p1 , · · · , pr , where p1 +
· · · + pr = 1. Let Xi denote the number of experiments with outcome i. Then X = (X1 , · · · , Xn )
is a discrete random vector with probability mass function
n
p(n1 , · · · , nr ) = P{X1 = n1 , · · · , Xr = xr } =
pn1 · · · pnr r ,
n1 , · · · , nr 1
and X is said to have the multinomial distribution with parameters n and (p1 , · · · , pr ). Notice that X1 + · · · + Xn = n. Of course, if r = 2, then this is just the binomial distribution with
parameters n and p1 .
66
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Definition 5.3. The random variables X1 , · · · , Xn are said to be jointly continuous if there
exists a function p(x1 , · · · , xn ), called the joint probability density function, such that
Z Z
Z
p(x1 , · · · , xn )dx1 · · · dxn
P{(X1 , · · · , Xn ) ∈ E} =
···
(x1 ,··· ,xn )∈E
for any measurable subset E ⊂ Rn .
As in the univariate case, the joint density function can be found by differentiating the joint
cumulative distribution function. However, when there are multiple variables, we must take
partial derivatives for each component of the vector:
pX (x1 , · · · , xn ) = ∂x1 · · · ∂xn FX (x1 , · · · , xn ).
Furthermore, if the variables X1 , · · · , Xn are jointly continuous, then each variable Xi is marginally
continuous and its marginal density function can be found by integrating the joint density function over all coordinates except xi :
Z ∞
Z ∞
pXi (x) =
···
pX (x1 , · · · , xi−1 , x, xi+1 , · · · , xn )dx1 · · · dxi−1 dxi+1 · · · dxn .
−∞
−∞
Caveat: X1 , · · · , Xn can all be individually continuous even when they are not jointly continuous. For example, the random vector (X, X) is never continuous even if X is itself a continuous
random variable.
Example 5.2. The random vector X = (X1 , · · · , Xn−1 ) is said to have the Dirichlet distribution with parameters α1 , · · · , αn > 0 if its components are jointly continuous with joint density
function
n
Y
Γ(α)
Q
pX (x1 , · · · , xn−1 ) = n
xαi i −1
i=1 Γ(αi )
i=1
for all x1 , · · · , xn−1 ≥ 0 satisfying x1 + · · · + xn−1 ≤ 1, where xn ≡ 1 − x1 − · · · − xn−1 . Since
x1 + · · · + xn = 1 by construction, the Dirichlet distribution is usually regarded as a continuous
distribution on the standard n − 1-dimensional simplex
)
(
n
X
xi = 1 and xi ≥ 0 for all i
Kn−1 = (x1 , · · · , xn ) ∈ Rn :
i=1
When n = 2, the Dirichlet distribution reduces to the Beta distribution.
5.2
Independent Random Variables
Definition 5.4. A collection of n random variables X1 , · · · , Xn , all defined on the same probability space, are said to be independent if for every collection of measurable sets A1 , · · · , An ,
P{X1 ∈ A1 , · · · , Xn ∈ An } =
n
Y
P{Xi ∈ Ai }.
i=1
In other words, the random variables X1 , · · · , Xn are independent if and only if for all such subsets A1 , · · · , An ⊂ R, the sets {X1 ∈ A1 }, · · · , {Xn ∈ An } are independent. Also, we say that an
infinite collection of random variables is independent if every finite subcollection is independent.
5.2. INDEPENDENT RANDOM VARIABLES
67
Heuristically, X1 and X2 are independent if knowing the value of X1 provides us with no information about the value of X2 , and vice versa.
Proposition 5.1. Suppose X1 , · · · , Xn are jointly distributed random variables, i.e., they are
all defined on the same probably space, and let X = (X1 , · · · , Xn ). Then each of the following
conditions implies independence:
(a) The joint cumulative distribution function factors into a product of the marginal cumulative
distribution functions, i.e., for all real numbers x1 , · · · , xn ,
FX (x1 , · · · , xn ) =
n
Y
FXi (xi ).
i=1
(b) X1 , · · · , Xn are discrete and the joint probability mass function factors into a product of
the marginal probability mass functions:
pX (x1 , · · · , xn ) =
n
Y
pXi (xi )
i=1
for all x1 , · · · , xn .
(c) X1 , · · · , Xn are jointly continuous and the joint probability density function factors into a
product of the marginal probability density functions:
fX (x1 , · · · , xn ) =
n
Y
fXi (xi )
i=1
for all x1 , · · · , xn .
Example 5.3. Let Z be a Poisson random variable with parameter λ, let W1 , W2 , · · · be a
collection of independent Bernoulli random variables, each with parameter p, and define
X =
Z
X
Wi
i=1
Y
= Z − X.
Then X and Y are independent Poisson random variables with parameters λp and λ(1 − p),
respectively. To prove this, we calculate
P{X = i, Y = j} = P{X = i, Z = i + j}
= P{X = i|Z = i + j} · P{Z = i + j}
i+j i
λi+j
=
p (1 − p)j · e−λ
i
(i + j)!
e−λp (λp)i e−λ(1−p) (λ(1 − p))j
i!
j!
≡ pX (i) · pY (j),
=
where pX and pY are the probability mass functions of Poisson random variables with parameters
λp and λ(1 − p). Independence of X and Y then follows from the factorization of the joint
probability mass function. Furthermore, since Z = X + Y , we have also shown that the sum of
two independent Poisson-distributed random variables is also Poisson-distributed with parameter
equal to the sum of the parameters of those two variables.
68
5.3
5.3.1
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Conditional Distributions
Discrete Conditional Distributions
Definition 5.5. If X and Y are discrete random variables with joint probability mass function
pX,Y (x, y), then we define the conditional probability mass function of X given Y = y by
pX|Y (x|y) = P{X = x|Y = y}
=
=
P{X = x, Y = y}
P{Y = y}
pX,Y (x, y)
,
pY (y)
provided pY (y) > 0.
In other words, since X and Y are defined on the same probability space, we can consider the
distribution of X conditional on the event Y = y. This is of great importance in statistics, since
we are often interested in the distributions of variables that can only be indirectly estimated
from the values of other correlated random variables. Of course, if X and Y are independent,
then the conditional probability mass function of X given Y = y is equal to the unconditioned
probability mass function:
pX|Y (x|y) =
pX,Y (x, y)
pX (x)pY (y)
=
= pX (x).
pY (y)
pY (y)
In other words, if X and Y are independent, then Y provides us with no information concerning
the value of X.
Example 5.4. Suppose that the X and Y are independent Poisson RVs with parameters λ1 and
λ2 and let Z = X + Y . Then, using the fact (see Example 6.3) that Z is a Poisson RV with
parameter λ1 + λ2 , the conditional probability mass function of X given Z = n is
pX|Z (k|n) =
=
=
=
=
P{X = k, Z = n}
P{Z = n}
P{X = k, Y = n − k}
P{Z = n}
P{X = k}P{Y = n − k}
P{Z = n}
n−k
k
n −1
−λ1 λ1
−λ2 λ2
−(λ1 +λ2 ) (λ1 + λ2 )
e
·e
· e
k!
(n − k)!
n!
k n−k
n
λ1
λ2
,
k
λ1 + λ2
λ1 + λ2
for k = 0, · · · , n, and so it follows that the conditional distribution of X given X + Y = n is
Binomial with parameters n and p = λ1 /(λ1 + λ2 ). Notice that once we condition on the event
X + Y = n, X and Y are no longer independent.
The next example illustrates how the conditional joint distribution of more than two RVs can be
calculated.
5.3. CONDITIONAL DISTRIBUTIONS
69
Example 5.5. Recall that (X1 , · · · , Xk ) has the multinomial distribution with parameters n and
(p1 , · · · , pk ), p1 + · · · + pk = 1, if the joint probability mass function has the form
P{X1 = n1 , · · · , Xk = nk } =
n!
pn1 · · · pnk k ,
n1 ! · · · nk ! 1
ni ≥ 0,
k
X
ni = n.
i=1
Notice that by combining groups of the component variables, we again obtain a multinomial
RV. For example, if we combine the first r categories into a single type and we define Fr =
1 − pr+1 − · · · − pk = p1 + · · · + pr , then the random vector (X1 + · · · + Xr , Xr+1 , · · · , Xk ) has
the multinomial distribution with parameters n and (Fr , pr+1 , · · · , pk ). This can be shown by
calculating
P{X1 + · · · + Xr = l, Xr+1 = nr+1 , · · · , Xk = nk }
X
=
P{X1 = n1 , · · · , Xk = nk }
n1 +···+nr =l
X
=
n1 +···+nr =l
=
n!
nr+1
· · · pnk k
pn1 · · · pnr r pr+1
n1 ! · · · nk ! 1
n!
n
F l p r+1 · · · pnk k ,
l!nr+1 ! · · · nk ! r r+1
provided l + nr+1 + · · · + nk = n. The last identity is a consequence of the multinomial formula.
Now suppose that we observe that Xr+1 = nr+1 , · · · , Xk = nk . Then, if we set m = nr+1 +
· · · + nk and let Fr be defined as in the preceding paragraph, the conditional joint distribution of
(X1 , · · · , Xr ) given this observation is equal to
P{X1 = n1 , · · · , Xr = nr |Xr+1 = nr+1 , · · · , Xk = nk }
P{X1 = n1 , · · · , Xk = nk }
=
P{Xr+1 = nr+1 , · · · , Xk = nk }
−1
n!
n!
nk
n1
nr
n−m nr+1
nr nr+1
=
p · · · pr pr+1 · · · pk ·
F
pr+1 · · · pk
n1 ! · · · nk ! 1
(n − m)!nr+1 ! · · · nk ! r
nr
(n − m)! p1 n1
pr
.
=
···
n1 ! · · · nr ! Fr
Fr
In other words, the conditional distribution of (X1 , · · · , Xr ) given the values of the random variables Xr+1 , · · · , Xk is again multinomial with parameters n − m and (p1 /Fr , · · · , pr /Fr ).
5.3.2
Continuous Conditional Distributions
Definition 5.6. If X and Y are jointly continuous random variables with joint probability density
pX,Y (x, y) and marginal densities pX (x) and pY (y), then for any y such that pY (y) > 0, we can
define the conditional probability density of X given that Y = y by
1
P{X ∈ (x, x + h]|Y ∈ (y, y + h]}
h→0 h
1
2 P{X ∈ (x, x + h], Y ∈ (y, y + h]}
= lim h
1
h→0
h P{Y ∈ (y, y + h]}
pX,Y (x, y)
=
,
pY (y)
pX|Y (x|y) =
provided pY (y) > 0.
lim
70
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Because Y is marginally continuous, we are effectively conditioning on an event with zero probability: P{Y = y} = 0. That we are able to make sense of this is a consequence of the fact
that X and Y are jointly continuous: this controls the singularity in the ratio that defines the
conditional probability.
Remark: If X and Y are independent, then the conditional probability density of X given
Y = y is equal to the unconditioned probability density:
pX|Y (x|y) =
pX,Y (x, y)
pX (x)pY (y)
=
= pX (x).
pY (y)
pY (y)
Example 5.6. We say that a random vector (X, Y ) has a bivariate normal distribution if
there exist parameters (µx , µy ) and σx > 0, σy > 0, −1 < ρ < 1 such that the joint density of X
and Y is
"
#)
(
(x − µx )(y − µy )
y − µy 2
1
x − µx 2
1
p
+
− 2ρ
.
pX,Y (x, y) =
exp −
2(1 − ρ2 )
σx
σy
σx σy
2πσx σy 1 − ρ2
The conditional density of X given that Y = y is equal to
pX,Y (x, y)
pY (y)
= C1 pX,Y (x, y)
(
pX|Y (x|y) =
"
#)
x(y − µy )
1
x − µx 2
− 2ρ
= C2 exp −
2(1 − ρ2 )
σx
σx σy
σx
1
2
x − 2x µx + ρ (y − µy )
= C3 exp − 2
2σx (1 − ρ2 )
σy
(
2 )
1
σx
= C4 exp − 2
x − µx + ρ (y − µy )
.
2σx (1 − ρ2 )
σy
R∞
Apart from the constant C4 , which can be evaluated explicitly by using the fact that −∞ pX|Y (x|y)dx =
1, we recognize the conditional density as the density of a normal RV with
• mean = µx + ρ(σx /σy )(y − µy )
• variance = σx2 (1 − ρ2 ).
Remark: Notice that X and Y are independent if and only if ρ = 0.
5.3.3
Conditional Expectations
Definition 5.7. If X and Y are jointly discrete random variables with conditional probability
mass function pX|Y (x|y), then the conditional expectation of X given Y = y is the quantity
X
X
E X|Y = y =
x · P{X = x|Y = y} =
pX|Y (x|y) · x.
x
x
Similarly, if X and Y are jointly continuous with conditional density function pX|Y (x|y), then
the conditional expectation of X given Y = y is defined by
Z ∞
E X|Y = y =
x · pX|Y (x|y)dx.
−∞
5.3. CONDITIONAL DISTRIBUTIONS
71
Example 5.7. Suppose that X and Y are independent Poisson RVs, both with parameter λ, and
let Z = X + Y . We wish to find the conditional expectation of X given that Z = n. To do so,
we first recall from Example 6.4 that the conditional distribution of X given Z = n is binomial
with parameters (n, 1/2):
n
n
1
pX|Z (k|n) =
,
k
2
for k = 0, · · · , n. Consequently,
n n
X
n
1
n
E X|Z = n =
k = .
k
2
2
k=0
We can also solve this problem by observing that
n = E[Z|Z = n] = E[X|Z = n] + E[Y |Z = n] = 2E[X|Z = n],
using the fact that E[X|Z = n] = E[Y |Z = n] since X and Y have the same conditional distribution given Z.
In general, all of the results that hold for expectations also hold for conditional expectations,
including the following useful identity:
E g(X)|Y = y =
 P

x g(x)pX|Y (x|y)
discrete case
 R∞
continuous case.
∞
g(x)pX|Y (x|y)dx
The next definition introduces a subtle concept that can be used to great effect when calculating
expected values. This is particularly important in the theory of stochastic processes and Markov
chains.
Definition 5.8. The conditional expectation of X given Y is the random variable
E[X|Y ] = H(Y ),
where H(y) is the (deterministic) function defined by the formula
H(y) = E[X|Y = y].
Definition 6.8 should be compared and contrasted with Definition 6.7: whereas the conditional
expectation E[X|Y = y] defined in 6.7 is a number, the conditional expectation E[X|Y ] defined
in 6.8 is a random variable. The difference arises because in the former case we are conditioning
on a specific event, namely Y = y, and asking for the mean value of X under the assumption
that this particular event is true, whereas in the latter case we are conditioning on a random
variable and asking for the mean value of X as a function of the unknown random value that Y
might take.
Example 5.8. Suppose that X, Y , and Z are as in Example 6.7. Then
1
E[X|Z] = Z
2
since h(z) = E[X|Z = z] = 12 z.
72
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Our next result states that the expected value of a random variable X can be calculated by averaging the conditional expectation of X given any other random variable Y over the distribution
of Y .
Proposition 5.2. For any two random variables X and Y , we have
E[X] = E[E[X|Y ]],
provided that both E[X] and E[X|Y ] exist. In particular, if Y is discrete, then
E[X] =
X
E[X|Y = y]P{Y = y},
y
while if Y is continuous with density pY (y), then
Z ∞
E[X] =
E[X|Y = y]pY (y)dy.
−∞
Proof. We will assume that X and Y are both discrete. Then
X
XX
E[X|Y = y]P{Y = y} =
xP{X = x|Y = y}P{Y = y}
y
y
x
X X P{X = x, Y = y}
=
x
P{Y = y}
P{Y
=
y}
y
x
XX
=
xP{X = x, Y = y}
y
x
X X
=
x
P{X = x, Y = y}
x
X
=
y
xP{X = x}
x
= E[X].
5.4
Expectations of Sums and Products of Random Variables
We begin with a general result concerning the expectation of a function of two random variables.
Proposition 5.3. If X and Y have a joint probability mass function pX,Y (x, y) and g : R2 → R,
then
XX
E[g(X, Y )] =
pX,Y (x, y)g(x, y).
y
x
Similarly, if X and Y are jointly continuous with joint density pX,Y (x, y), then
Z
∞
Z
∞
E[g(X, Y )] =
pX,Y (x, y)g(x, y)dxdy.
−∞
−∞
Proof. By expressing g as the difference between its positive and negative parts, g = g + − g − ,
it suffices to prove the result under the assumption that g is non-negative. Here we will assume
5.4. EXPECTATIONS OF SUMS AND PRODUCTS OF RANDOM VARIABLES
73
that X and Y are jointly continuous; a similar proof works in the case where both are discrete.
Then, by exercise 7 of PS3, we have
Z ∞
P{g(X, Y ) > t}dt
E[g(X, Y )] =
Z0 ∞ Z Z
=
pX,Y (x, y)dx dy dt
0
Z
∞
Z
(x,y):g(x,y)>t
∞ Z g(x,y)
pX,Y (x, y)dt dx dy
=
−∞
Z−∞
∞ Z ∞
=
0
pX,Y (x, y)g(x, y)dx dy,
−∞
−∞
where the interchange of integrals in passing from the second to the third line is justified by the
fact that the integrand is everywhere non-negative.
Example 5.9. Suppose that X and Y are independent RVs that are uniformly distributed on
[0, 1]. Then
Z 1Z 1
E |X − Y | =
|x − y|dydx
0
0
Z 1 Z x
Z 1
=
(x − y)dy +
(y − x)dy dx
0
0
x
Z 1
=
x2 − x2 /2 + (1 − x2 )/2 − x(1 − x) dx
0
Z 1
1
2
x −x+
=
dx
2
0
1
=
.
3
The following corollary states that the expected value of a sum of random variables is equal
to the sum of the expected values of those variables. This is true even when the variables are
not independent. Put another way, expectation is a linear functional on the space of random
variables with finite expectations.
Corollary 5.1. Suppose that X1 , X2 , · · · , Xn are random variables defined on the same probability space and that E[Xi ] is finite for all i = 1, · · · , n. Then
" n
#
n
X
X
E
Xi =
E[Xi ].
i=1
i=1
Proof. If the Xi are all discrete or all jointly continuous, the corollary follows from Proposition
6.3 by taking g(x, y) = x + y and then proceeding by induction on n.
Example 5.10. Suppose that X1 , · · · , Xn are independent, identically-distributed random variables, each with expected value µ. Then the sample mean of the Xi ’s is just the unweighted
average of the sample:
n
1X
X̄ =
Xi .
n
i=1
74
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Notice that
"
n
1X
E[X̄] = E
Xi
n
#
i=1
=
n
1X
E[Xi ]
n
=
1
· nµ = µ,
n
i=1
i.e., the expected value of the sample mean is equal to the mean of the distribution. For this
reason, the sample mean is said to be an unbiased estimator of the expected value.
Example 5.11. Expectation of a binomial RV: Recall that if X1 , · · · , Xn are independent
Bernoulli RVs, each with parameter p, then their sum X = X1 · · · Xn is a Binomial RV with
parameters n and p. A simple calculation shows that E[Xi ] = p · 1 + (1 − p) · 0 = p for each i.
Consequently,
n
X
E[X] =
E[Xi ] = np,
i=1
as we showed earlier using moment generating functions.
Example 5.12. Let X be the number of cards that are left in their original position by a random
permutation of a deck of n cards. Then X can be written as the sum X = X1 + · · · + Xn , where
Xi is the indicator function of the event that the i’th card is left in its original position. Notice
that each Xi is a Bernoulli RV with parameter p = 1/N , but the Xi ’s are not independent.
Nonetheless,
n
X
1
E[X] =
E[Xi ] = n ·
= 1.
n
i=1
Example 5.13. Random Walks on Z. Suppose that X1 , X2 , · · · is a collection of independent
RVs, each with distribution P{Xi = 1} = P{Xi = −1} = 21 , and let SN = X1 + · · · + XN . We
can think of SN as the position at time N of a particle which in each unit interval of time is
equally likely to move to the right or to the left. Notice that
E[SN ] =
N
X
E[Xi ] = 0,
i=1
since E[Xi ] = 12 · 1 + 12 · (−1) = 0 for each i. Thus, the expected position of the particle never
changes. A better measure of the displacement of the particle away from its initial position at 0
is given by the root mean square displacement (RMSQ) of SN
q 2 .
DN ≡ E S N
To find DN , we begin by calculating
2
DN
2
= E SN
= E (X1 + · · · + XN )2

"N
#
X
X
= E
Xi2 + E 
i=1

Xi Xj 
1≤i6=j≤N
= N E X12 + N (N − 1)E X1 X2 ,
5.4. EXPECTATIONS OF SUMS AND PRODUCTS OF RANDOM VARIABLES
75
where to obtain the last line, we have the used the identities E[Xi2 ] = E[X12 ] and E[Xi Xj ] =
E[X1 X2 ] for all i and j such that i 6= j. Furthermore,
E X12 ] =
E X1 X2 =
1 2 1
· 1 + · (−1)2 = 1
2
2
1
1
1
1
· 1 · 1 + 1 · (−1) + · (−1) · 1 + · (−1) · (−1) = 0.
4
4
4
4
2 = N and so the RMSD of the
Substituting these values into the previous equation shows that DN
particle is
DN = N 1/2 .
Notice that DN diverges as N → ∞, but at a rate that is sublinear in N .
5.4.1
Expectations of products
Proposition 5.4. If X and Y are independent, then for any pair of real-valued functions h and
g,
E[h(X)g(Y )] = E[h(X)] · E[g(Y )],
provided that the expectations appearing on both sides of the identity exist.
Proof. We will prove this under the assumption that X and Y are jointly continuous with joint
density pX,Y = pX · pY . Then
Z ∞Z ∞
E h(X)g(Y ) =
h(x)g(y)pX,Y (x, y)dxdy
−∞ −∞
Z ∞Z ∞
=
h(x)g(y)pX (x)pY (y)dxdy
−∞ −∞
Z ∞
Z ∞
=
pX (x)h(x)dx
pY (y)g(y)dy
−∞
−∞
= E[h(X)] · E[g(Y )].
Corollary 5.2. Taking h(x) = x and g(y) = y in Proposition 7.2 yields the important identity:
E[XY ] = E[X] · E[Y ],
provided that X, Y and XY all have finite expectations.
Corollary 5.3. If X1 , · · · , Xn are independent RVs, then a simple induction argument on n
shows that
" n
#
n
Y
Y
E
fi (Xi ) =
E[fi (Xi )],
i=1
i=1
again provided that all of the expectations appearing in the formula exist.
Our next proposition establishes an important property of moment generating functions of sums
of independent random variables.
76
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Proposition 5.5. Suppose that X1 , · · · , Xn are independent RVs with moment generating functions ψX1 (t), · · · , ψXn (t). Then the moment generating function of the sum X = X1 + · · · + Xn
is
h
i
ψX (t) = E et(X1 +···+Xn )
=
n
Y
E etXi
i=1
=
n
Y
ψXi (t),
i=1
i.e., the moment generating function of a sum of independent RVs is just the product of the
moment generating functions of the individual RVs.
Example 5.14. Suppose that X1 , · · · , Xn are independent Poisson RVs with parameters λ1 , · · · , λn ,
respectively. Then the moment generating function of the sum X = X1 + · · · + Xn is
ψX (t) =
n
Y
t)
e−λi (1−e
i=1
= exp −(1 − et )
n
X
!
λi
,
i=1
which shows that X is a Poisson RV with parameter
5.5
Pn
i=1 λi .
Compare this with Example 6.3.
Covariance and Correlation
Definition 5.9. If X and Y are RVs with means µX = E[X] and µY = E[Y ], then the covariance between X and Y is the quantity
Cov(X, Y ) = E (X − µX )(Y − µY )]
= E[XY ] − µX · µY .
Remark: The covariance of two random variables is a measure of their association. The covariance is positive if when X is larger than µX , then Y also tends to be larger than µY , while the
covariance is negative if the opposite pattern holds. In particular, if X and Y are independent,
then
Cov(X, Y ) = E[X − µX ]E[Y − µY ] = 0,
because knowledge of how X compares with its mean provides no information about Y or vice
versa. Observe that the converse is not true. There are random variables that have zero covariance but which are not independent. For example, suppose that X has the distribution
1
P{X = 0} = P{X = 1} = P{X = −1} = ,
3
and that the random variable Y is defined by
0
Y =
1
if X 6= 0
if X = 0.
Then E[X] = 0 and also E[XY ] = 0 because XY = 0 with probability 1, so that
Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0,
5.5. COVARIANCE AND CORRELATION
77
but X and Y clearly are not independent. For example,
1
P{X = 0|Y = 1} = 1 6= = P{X = 0}.
3
Any two random variables X and Y are said to be uncorrelated if Cov(X, Y ) = 0, regardless
of whether or not they are independent.
The next proposition lists some useful properties of covariances.
Proposition 5.6.
(a) Cov(X, Y ) = Cov(Y, X)
(b) Cov(X, X) = V ar(X)
(c) Cov(aX, Y ) = aCov(X, Y ) for any scalar a


n
m
n X
m
X
X
X
(d) Cov 
Xi ,
Yj  =
Cov(Xi , Yj ).
i=1
j=1
i=1 j=1
Each property can be proved directly from the definition of the covariance and the linearity
properties of expectations.
A very useful formula follows by combining parts (b) and (d) of the preceding proposition:
!
n
n
X
X
X
V ar
Xi =
V ar(Xi ) +
Cov(Xi , Xj ).
i=1
i=1
1≤i6=j≤n
In particular, if the Xi ’s are pairwise independent, then Cov(Xi , Xj ) = 0 whenever i 6= j, so
that
!
n
n
X
X
V ar
Xi =
V ar(Xi ).
i=1
i=1
If, in addition to being pairwise independent, the Xi ’s are identically distributed, say with
variance V ar(Xi ) = σ 2 , then
!
n
X
V ar
Xi = nσ 2 .
i=1
Example 5.15. If X1 , · · · , Xn are independent Bernoulli RVs, each with parameter p, then
X = X1 + · · · + Xn is a Binomial RV with parameters (n, p). Because the variance of each
Bernoulli RV is p(1 − p), the preceding formula for the variance of a sum of IID RVs shows that
V ar(X) = np(1 − p),
confirming our previous calculation using moment generating functions.
Example 5.16. If X1 , · · · , Xn is a collection of IID (independent, identically-distributed) random variables with mean µ and variance σ 2 , then the sample mean X̄ and the sample variance
S 2 are the quantities defined by
n
X̄ =
1X
Xi
n
i=1
n
S2 =
2
1 X
Xi − X̄ .
n−1
i=1
78
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
If we think of the Xi ’s as the outcomes of a sequence of independent replicates of some experiment,
then the sample mean and sample variance are often used to estimate the true mean and true
variance of the underlying distribution. Recall that in the previous section we showed that the
sample mean is an unbiased estimator for the true mean:
E[X̄] = µ.
Although unbiasedness is a desirable property of an estimator, we also care about the variance of
the estimator about its expectation. For the sample mean, we can calculate
!
2
n
X
1
V ar(X̄) =
V ar
Xi
n
i=1
2 X
n
1
=
V ar(Xi )
n
i=1
σ2
=
n
,
which shows that the variance of the sample mean is inversely proportional to the number of
independent replicates that have been performed. We can also show that the sample variance is
an unbiased estimator of the true variance. We first observe that
n
S2 =
=
=
2
1 X
Xi − µ + µ − X̄
n−1
1
n−1
1
n−1
i=1
n
X
2
(Xi − µ) − 2(X̄ − µ)
i=1
n
X
(Xi − µ) +
i=1
(Xi − µ)2 −
i=1
n
X
n
X
!
X̄ − µ
2
i=1
n
(X̄ − µ)2 .
n−1
Taking expectations gives
E S2 =
n
1 X n
E (Xi − µ)2 −
V ar(X̄)
n−1
n−1
i=1
n
n σ2
σ2 −
n−1
n−1 n
= σ2.
=
Informally, the reason that we must divide by n − 1 rather than n to obtain an unbiased estimate
of the variance is that the deviations between the Xi ’s and the sample mean (which is computed
using the observed data) tend to be smaller than the deviations between the Xi ’s and the true
mean.
The next definition provides us with a dimensionless measure of the association between two
random variables.
2 > 0 and
Definition 5.10. The correlation of two random variables X and Y with variances σX
2
σY > 0 is the quantity
Cov(X, Y )
ρ(X, Y ) =
.
σX σY
5.6. THE LAW OF LARGE NUMBERS AND THE CENTRAL LIMIT THEOREM
79
Because
ρ(aX, bY ) =
ab · Cov(X, Y )
= ρ(X, Y ),
aσX · bσY
for any pair of scalar quantities a, b 6= 0, the correlation between two random variables does not
depend on the units in which they are measured, i.e., the correlation is dimensionless. Another
important property of the correlation is that
− 1 ≤ ρ(X, Y ) ≤ 1.
(5.1)
This can be proved as follows. First note that
X
Y
0 ≤ V ar
+
σX
σY
Cov(X, Y )
V ar(X) V ar(Y )
+
+2
=
2
2
σX σY
σX
σY
= 2(1 + ρ(X, Y )),
which implies that ρ(X, Y ) ≥ −1. A similar calculation involving the variance of the difference
of X/σX and Y /σY shows that ρ(X, Y ) ≤ 1. If we interpret the standard deviation of a random
variable as a measure of its length or norm, then (6.1) can be interpreted as a version of the
Cauchy-Schwartz inequality.
It can be shown that if ρ(X, Y ) = 1, then with probability one such that
σY
Y = µY +
(X − µX ),
σX
while if ρ(X, Y ) = −1, then
σY
(X − µX ).
σX
Thus the correlation constant can also be treated as a measure of the collinearity of two random
variables.
Y = µY −
5.6
The Law of Large Numbers and the Central Limit Theorem
In Example 6.16, we showed that the sample mean X̄N of a series of N independent and
identically-distributed random variables is an unbiased estimator of the true mean µ = E[X1 ]
and that the variance of the sample mean is inversely proportional to N :
E X̄N = µ
σ2
Var X̄N =
.
N
This suggests that when N is large, the sample mean should be close to µ and that the difference
X̄N − µ should typically be of order N −1/2 in magnitude. Our aim in this section is to explain
in what sense these statements are true.
5.6.1
The Weak and Strong Laws of Large Numbers
Proposition 5.7. (Markov’s Inequality) If X is a nonnegative real-valued RV, then for any
positive real number a > 0,
E[X]
P X≥a ≤
.
a
80
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Proof. Let Ia be the indicator function of the event {X ≥ a}
1 if X(ω) ≥ a
Ia (ω) =
0 otherwise.
Then, since X ≥ 0, we have
X
.
a
Taking expectations of both sides of this inequality gives
Ia ≤
E[X]
.
P X ≥ a = E[Ia ] ≤
a
Proposition 5.8. (Chebyshev’s Inequality) If X is a RV with finite mean µ and finite variance σ 2 , then for any positive real number k > 0,
σ2
P |X − µ| ≥ k ≤ 2 .
k
Proof. The result follows by taking a = k 2 and applying Markov’s inequality to the non-negative
random variable (X − µ)2
E (X − µ)2
σ2
2
2
≤
.
P |X − µ| ≥ k = P (X − µ) ≥ k ≤
k2
k2
While Markov’s inequality provides a simple bound on the tail of the distribution of X, Chebyshev’s inequality allows us to estimate the probability that X differs greatly from its mean. In
general, both estimates tend to be crude in the sense that the probabilities in question can be
much smaller than the upper bounds given by these inequalities. However, even these crude
inequalities can be used to prove some important results.
Our next result, called the weak law of large numbers, is one of the celebrated limit theorems of
probability theory. Before stating this theorem, we introduce one of the modes of convergence
of a sequence of random variables. In fact, there are several distinct senses in which a sequence of
random variables can be said to converge to a limit and we will encounter two more of these below.
Definition 5.11. We say that a sequence of random variables, X1 , X2 , · · · , converges in probability to a random variable X if for every > 0,
lim P |Xn − X| > = 0.
n→∞
Notice that the variables X1 , X2 , · · · and X must all be defined on the same probability space for
this definition to make sense.
Theorem 5.1. (The Weak Law of Large Numbers) Suppose that X1 , X2 , · · · is a sequence
of i.i.d. variables with finite mean µ = E[X1 ] and let Sn = X1 + · · · + Xn . Then, for any > 0,
Sn
lim P − µ ≥ = 0,
n→∞
n
i.e., the sample averages
1
n Sn
converge in probability to the mean.
5.6. THE LAW OF LARGE NUMBERS AND THE CENTRAL LIMIT THEOREM
81
Proof. We will prove the theorem under the additional assumption that Var(Xi ) = σ 2 < ∞. In
this case we have
1
1
σ2
E
and
V ar
Sn = µ
Sn =
,
n
n
n
and so Chebyshev’s inequality implies that for any > 0,
σ2
P |Sn − µ| ≥ ≤ 2 ,
n
which tends to 0 as n → ∞.
In other words, given any > 0, the weak law of large numbers asserts that by collecting sufficiently many independent samples, we can make the probability that the sample mean and the
true mean differ by more than an amount as close to zero as we want. Here we think of as an
upper bound on the acceptable error in our estimate. In fact, there is an even stronger sense in
which the sample mean converges to the true mean as the number of samples tends to infinity.
This is the content of the next theorem, which we state without proof.
Theorem 5.2. (Strong Law of Large Numbers) Let X1 , X2 , · · · be a sequence of I.I.D.
random variables, each with finite mean µ = E[X], and let Sn = X1 + · · · Xn . Then,
1
P lim Sn = µ = 1,
n→∞ n
i.e., the sequence
1
n Sn
converges almost surely to µ.
The significance of the strong law of large numbers is that it shows that the frequentist concept
of probabilities can be derived using the measure-theoretic machinery introduced by Kolmogorov
in the 1930’s. In particular, suppose that (Ω, F, P) is a probability space and that E ∈ F is an
event, and let X1 , X2 , · · · be independent indicator variables for E:
1 if E occurs on the i’th trial
Xi =
0 otherwise.
Then E[Xi ] = P(E) and so the strong law tells us that
n
1X
P lim
Xi = P(E)
n→∞ n
!
= 1,
i=1
i.e., the limiting frequency of E is equal to its probability.
5.6.2
The Central Limit Theorem
Although the strong law of large numbers tells us that the sample means of a sequence of i.i.d.
random variables converge almost surely to the expected value of the sampling distribution, it
does not tell us anything about the rate of convergence or about the size of the fluctuations of the
sample means about this limit. This information is important because we might like to know, for
example, how likely it is that the sample mean of say 10,000 independent variates will differ from
the true mean by more than 10% of the value of the latter. The Central Limit Theorem is one of
several results that gives us some insight into the fluctuations of the sequence of sample means
under the additional assumption that the sampling distribution has finite variance. Before stating
this theorem, we begin by defining a new mode of convergence for a sequence of random variables.
82
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Definition 5.12. Suppose that F and Fn , n ≥ 1 are cumulative distribution functions on R.
Then Fn is said to converge weakly to F if limn→∞ Fn (x) = F (x) at every continuity point x
of F . Likewise, a sequence of random variables Xn is said to converge in distribution to a
random variable X if the cumulative distribution functions Fn (x) = P(Xn ≤ x) converge weakly
to F (x) = P(X ≤ x).
Example 5.17. To see why we only require pointwise convergence at continuity points, let Xn
and X be defined by setting P(Xn = 1/n) = 1 and P(X = 0) = 1. Then
Fn (x) =
0
1
if x < 1/n
otherwise
and
F (x) =
0
1
if x < 0
otherwise,
which shows that Fn (x) converges to F (x) as n → ∞ at every x 6= 0. Since F (x) is continuous
everywhere except at x = 0, it follows that the sequence Xn converges in distribution to X even
though the sequence Fn (0) = 0 does not converge to F (0) = 1.
The next proposition shows how we can use moment generating functions to deduce that a
sequence of random variables converges in distribution. The assumption that the moment generating function of the limit X is finite is essential.
Proposition 5.9. Let X1 , X2 , · · · be a sequence of random variables with moment generating
functions ψXn (t), and let X be a random variable with moment generating function ψX . Then
the sequence Xn converges in distribution to X if
lim ψXn (t) = ψX (t) < ∞
n→∞
for all t ∈ R.
In effect, this proposition says two random variables Xn and X are close (in the sense that their
distributions are close) if their moment generating functions are close. This will be key to our
proof of the central limit theorem, but we first need to determine the moment generating function
of a normally-distributed random variable.
Example 5.18. If Z is a standard normal RV (with mean 0 and variance 1), then
Z
∞
1
2
etx √ e−x /2 dx
2π
−∞
Z ∞
1
2
2
= et /2 √
e−(x−t) /2 dx
2π −∞
MZ (t) =
2
= et /2
∞
X
1 2n
t .
=
n
2 n!
n=0
Thus, the moments of Z are
(2n)!
2n n!
= 0
M2n =
M2n+1
5.6. THE LAW OF LARGE NUMBERS AND THE CENTRAL LIMIT THEOREM
83
for all n ≥ 0. To calculate the moment generating function of a normal RV with mean µ and σ,
recall that Z ≡ (X − µ)/σ) is a standard normal RV. Then, writing X = σZ + µ, we have
ψX (t) = E etX
h
i
= E et(σZ+µ)
= etµ ψZ (σt)
2 2
σ t
= exp
+ µt .
2
Theorem 5.3. (The Central Limit Theorem) Suppose that X1 , X2 , · · · is a sequence of I.I.D.
variables with finite mean µ and variance σ 2 , and let Sn = X1 + · · · + Xn . Then the sequence of
random variables
1 1/2 1
Sn − nµ
√
Zn ≡
= n
Sn − µ
σ
n
σ n
converges in distribution to the standard normal N (0, 1), i.e., for every real number z,
Z z
1
2
e−x /2 dx.
lim P{Zn ≤ z} = Φ(z) ≡ √
n→∞
2π −∞
Proof. We will prove the C.L.T. under the assumption that the moment generating function of
the random variables Xi , denoted ψ(t), is finite on the entire real line. Let us first assume that
µ = 0 and σ 2 = 1. Notice that the moment generating function of the scaled random variable
√
Xi / n is
tXi
t
E exp √
=ψ √
n
n
P
√
and that the moment generating function of the sum Zn = ni=1 Xi / n is equal to
n
t
ψ √
.
n
Let
L(t) = log ψ(t)
and observe that L(0) = L0 (0) = 0 and L00 (0) = 1. By Proposition 6.9, it suffices to show that
n
t
2
= et /2 ,
lim ψ √
n→∞
n
which is equivalent to
lim nL
n→∞
t
√
n
=
t2
.
2
However, this last identity can be verified using L’Hôpital’s rule
√
L(t/ n)
L(tx)
tL0 (tx)
t2 L00 (tx)
lim
=
lim
=
lim
=
lim
n→∞
x→0 x2
x→0
x→0
n−1
2x
2
2
t
=
.
2
The general case can then be handled by applying this result to the standardized variables
Xi∗ = (Xi − µ)/σ, which have mean 0 and variance 1.
84
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
What is remarkable about the central limit theorem is that it shows that the distribution of
a sum of sufficiently many i.i.d. random variables with finite variance is approximately normal
no matter what distribution the individual variables have. This may explain why the normal
distribution is encountered in so many seemingly unrelated phenomena, such as the distribution
of height in a population of adult males, the distribution of particle velocities in a gas, or the
distribution of errors in an experiment.
Remark 5.1. The restriction imposed by the use of the moment generating function in the proof
of the C.L.T. can be surmounted by instead working with characteristic functions:
χX (t) ≡ E eitX .
Provided that X is real-valued, we have |eitX | = 1 and so the modulus of the characteristic
function is less than or equal to 1: |χX (t)| ≤ 1 for all t ∈ (−∞, ∞). This means that the characteristic function of any real-valued random variable is finite on the entire real line. Furthermore,
Proposition (6.9) still holds if we replace pointwise convergence of moment generating functions
by pointwise convergence of characteristic functions.
Example 5.19. Let X1 , · · · , X10 be independent random variables, each uniformly distributed
on [0, 1], and let X = X1 + · · · + X10 . The central limit theorem can be used to approximate the
distribution of X. For example,
(
)
√
X −5
6−5
P{X > 6} = P p
≈ 1 − Φ( 1.2) ≈ 0.137.
>p
10/12
10/12
A version of the central limit theorem holds even when the random variables X1 , X2 , · · · are
independent but not necessarily identically-distributed.
Theorem 5.4. Let X1 , X2 , · · · be a sequence of independent random variables with µi = E[Xi ]
and σi2 = V ar(Xi ). If the Xi are P
uniformly bounded, i.e., there exists M < ∞ such that
P{Xi < M } = 1 for all i ≥ 1, and if i≥1 σi2 = ∞, then


 Pn (X − µ )

i
i
i=1
q
lim P
≤
x
= Φ(x)
Pn
n→∞ 

σ2
i=1
i
for each x ∈ R.
This more general result can be used to prove the following surprising statement about the
number of prime divisors of a randomly chosen positive integer.
Theorem 5.5. (Erdös-Kac) For each n ≥ 1, let Xn be uniformly distributed on the set
{1, · · · , n}, and define φ(n) to be the number of prime divisors of the integer n, e.g., φ(2) = 1,
φ(3) = 1, φ(4) = 1, φ(5) = 1, φ(6) = 2, etc. Then, for every real number x,
!
φ(Xn ) − log log(n)
p
lim P
≤ x = Φ(x).
n→∞
log log(n)
In other words, if an integer is chosen at random between 1 and n, then for large n the number
of prime divisors of that integer is approximately normally distributed with mean and variance
both equal to log log(n).
5.7. MULTIVARIATE NORMAL DISTRIBUTIONS
5.7
85
Multivariate Normal Distributions
It can be shown that the central limit theorem extends to sums of independent, identicallydistributed random vectors which, when suitably normalized, converge in distribution to a multivariate normal distribution. There are several equivalent definitions of the MVN distribution, but here we adopt a slightly restricted version based on the density function.
Definition 5.13. A random vector X = (X1 , · · · , Xn ) with values in Rn is said to have a
multivariate normal distribution if there is a vector µ = (µ1 , · · · , µn ) and a symmetric positivedefinite n × n matrix Σ = (σij ) with determinant |Σ| > 0 such that X has density
1
1
T −1
f (x) = p
exp − (x − µ) Σ (x − µ) .
2
(2π)n |Σ|
In this case, we write X ∼ N (µ, Σ) and say that µ is the mean vector and Σ is the covariance
matrix of X.
Random vectors that satisfy the conditions of the preceding definition are said to have a nondegenerate MVN distribution. Degenerate MVN distributions can also be defined by allowing
the covariance matrix to have a rank less than n. In this case, X will be supported on a
proper subspace of Rn . Because degenerate MVN distributions do not have a density on Rn , the
following property, which we state as a theorem, is sometimes taken as a more general definition
of the MVN distribution which encompasses both the degenerate and non-degenerate cases.
Theorem 5.6. Suppose that X ∼ N (µ, Σ) is an n-dimensional MVN random vector and let
c = (c1 , · · · , cn ) ∈ Rn . Then Y = c · X = c1 X1 + · · · + cn Xn is normally distributed with
E[Y ] =
n
X
ci µi
Var(Y ) =
i=1
n
X
c2i σii + 2
i=1
X
ci cj σij .
i<j
In particular, E[Xi ] = µi , Var(Xi ) = σii and cov(Xi , Xj ) = σij .
Proof. By subtracting the mean vector if necessary, we can assume that X ∼ N (0, Σ), where
0 = (0, · · · , 0) ∈ Rn . Since Σ is a symmetric positive-definite matrix, it can be shown that there
is an n × n matrix A = (aij ) with inverse B = A−1 such that Σ = AAT and Σ−1 = B T B. Notice
that the following relationships exist between the entries of Σ and those of A:
σij =
n
X
aik ajk .
k=1
Furthermore, A can be chosen so that |A| = |Σ|1/2 > 0.
We will show that Y has the stated normal distribution by calculating its moment generating
function, which we can do by evaluating the following n-dimensional integral
Z
1 T −1
1
tY
tc·x
exp − x Σ x dx.
ψY (t) = E[e ] =
e p
2
(2π)n |Σ|
Rn
Our first step is to make the change of variables x = Ay. Since dx = |A|dy, this gives
Z
1 T T T
1
ψY (t) = p
etc·Ay e− 2 y A B BAy |A|dy
n
(2π) |Σ| Rn
Z
1
1
2
= p
etc·Ay− 2 |y| dy,
n
(2π) Rn
86
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
where we have used the following identities: AB = I = B T AT and y T y = |y|2 =
Furthermore, since
!
n
n X
n
n
n
n
X
X
X
X
X
tc · Ay =
ci (Ay)i =
ci aij yj =
ci aij yj =
γ j yj
i=1
i=1 j=1
j=1
i=1
Pn
2
j=1 yj .
j=1
the exponential function inside the integral can be factored into a product of exponentials
e
tc·Ay− 21 |y|2
=
n
Y
1 2
etγj yj − 2 yj .
j=1
This allows us to write the n-dimensional integral as a product of one-dimensional integrals,
which can each be evaluated using the moment generating function of the normal distribution:
Z ∞
n
n
Y
Y
2 2
1
2 2
tγj yj − 21 yj2
√
e
dyj =
ψY (t) =
eγj t = eγ t ,
2π −∞
j=1
j=1
where
γ2 =
=
n
X
j=1
n
X
γj2 =
c2i
n
X
i=1
=
n
X
n
n
X
X
j=1
i=1
a2ij + 2
X
j=1
!2
ci aij
ci cj
X
i<j
c2i σii + 2
i=1
X
aik ajk
k
ci cj σij .
i<j
This shows that Y is normally distributed with mean 0 and variance γ 2 .
In fact, it can be shown that any affine transformation of a MVN distributed random vector is
also MVN.
Theorem 5.7. Suppose that X ∼ N (µ, Σ) takes values in Rn and let Y = AX + b, where A is
an m × n non-singular matrix and b ∈ Rm . Then Y is an m-dimensional MVN random vector
with distribution N (Aµ + b, AΣAT ).
This result is important in its own right, but we will use it in the proof of the following theorem.
Theorem 5.8. Suppose that X1 , · · · , Xn are independent normal random variables with mean µ
and variance σ 2 and let X̄ = (X1 + · · · + Xn )/n be their sample mean. Then X̄ is independent
of the random variables X1 − X̄, · · · , Xn − X̄.
Proof. The trick is to notice that the random vector Y = (X̄, X2 − X̄, · · · , Xn − X̄) can be
written as a linear transformation of the random vector X = (X1 , · · · , Xn ), i.e., Y = AX, where
 1

1
1
···
n
n
n
 −1 1 − 1 ···
− n1 
 n

n
A= .
.
..
..
 ..

.
.
− n1
− n1
···
1−
1
n
Since X is MVN with mean vector (µ, · · · , µ) and covariance matrix σ 2 In×n , it follows that Y
is also MVN with mean vector (µ, 0, · · · , 0). To find the covariance matrix of Y , we recall that
X̄ has variance σ 2 /n. Also, since
Xi
σ2
cov(Xi , X̄) = cov Xi ,
=
,
n
n
5.8. SAMPLE STATISTICS
87
it follows that the variance of Xi − X̄ is
Var(Xi − X̄) = Var(Xi ) + Var(X̄) − 2cov(Xi , X̄) =
n−1 2
σ .
n
Similarly,
cov(X̄, Xi − X̄) = cov(X̄, Xi − X̄) − Var(X̄) = 0,
while
cov(Xi − X̄, Xj − X̄) = cov(Xi , Xj ) − cov(Xi , X̄) − cov(Xi , X̄) + Var(X̄) = −
σ2
.
n
From these results it can be shown that the joint density of Y can be written as



!2
n
n
 1

X
X
1
yi +
yi2  ,
fY (y1 , · · · , yn ) = C exp − 2 (y1 − µ)2 exp − 2 
 2σ

2σ /n
i=2
i=2
where C is a normalizing constant. However, since the joint density factors into a product of
terms that depend either only on y1 = x̄ or only on (y2 , · · · , yn ) = (x2 P
− x̄, · · · , xn − x̄), it follows
that X̄ is independent of (X2 − X̄, · · · , Xn − X̄). Furthermore, since ni=1 (Xi − X̄) = 0, we see
that
n
X
X1 − X̄ = −
(Xi − X̄),
i=2
which shows that X1 − X̄ is a function of the vector (X2 − X̄, · · · , Xn − X̄). Consequently, X̄
is also independent of (X1 − X̄, · · · , Xn − X̄).
Since the sample variance
n
S2 =
1 X
(Xi − X̄)2
n−1
i=1
is a function of the vector (X1 − X̄, · · · , Xn − X̄), Theorem 6.8 has the following important
corollary.
Corollary 5.4. If X1 , · · · , Xn are i.i.d. normal random variables, then the sample mean X̄ is
independent of the sample variance S 2 .
In other words, provided that the data is normally distributed and the measurements are independent, the error in our estimate of the mean is independent of the error in our estimate of the
variance.
5.8
5.8.1
Sample Statistics
The χ2 Distribution
Suppose that we perform a series of i.i.d. trials that result in measurements X1 , · · · , Xn and we
let X̄ and S 2 denote the sample mean and sample variance:
n
X̄ =
1X
Xi
n
i=1
n
S2 =
2
1 X
Xi − X̄ .
n−1
i=1
88
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
If we assume that each of the variables Xi is normally distributed with unknown mean µ and
unknown variance σ 2 , then X̄ is also normally distributed with mean µ and variance σ 2 /n. The
sample variance, however, is not normally distributed.
To identify the distribution of S 2 , we first observe that if Z is a standard normal random variable,
then by the change-of-variables formula (cf. Example 5.8), the density of Z 2 on (0, ∞) is
1
1
1
fZ 2 (x) = 2 √ √ e−x/2 = √ x1/2−1 e−x/2 .
2 x 2π
2π
However, this is just the density of the gamma distribution with shape parameter α = 1/2 and
rate parameter λ = 1/2, which is also known as the χ2 -distribution with one degree of freedom,
which is sometimes abbreviated as χ21 . As an aside, since we can also write this density as
fZ 2 (x) =
(1/2)1/2 1/2−1 −x/2
x
e
,
Γ(1/2)
this calculation leads to the identity
√
1
Γ
= π.
2
The moment generating function of the χ21 distribution is
h 2i
mZ 2 (t) ≡ E etZ
Z ∞
1
=
etx √ x−1/2 e−x/2 dx
2π
0
Z ∞
1
x−1/2 e−(1/2−t)x dx
= √
2π 0
Z ∞
1
2
= √ 2
e−(1−2t)y /2 dy
2π 0
√
1
2π
= √ √
= (1 − 2t)−1/2 ,
2π 1 − 2t
provided t < 1/2.
Now suppose that Z1 , · · · , Zn are independent standard normal random variables and let R =
Z12 + · · · + Zn2 be the sum of the squares. Our goal is to identify the distribution of R. Because
the Zi are independent, we know that the moment generating function of their sum is equal to
the product of the moment generating functions of the individual terms, i.e.,
mR (t) = E etR = (1 − 2t)−n/2 .
I claim that this is also the moment generating function of a Gamma distribution with shape
parameter α = n/2 and rate parameter λ = 1/2. Indeed, if Y is such a random variable, then
tY
mY (t) = E e
Z
=
0
∞
ety
(1/2)n/2 n/2−1 −y/2
y
e
dy
Γ(n/2)
n/2
Z ∞
1
1
y n/2−1 e−(1/2−t)y dy
=
2
Γ(n/2) 0
n/2
Z ∞
1
−n/2 1
= (1/2 − t)
xn/2−1 e−x dx
2
Γ(n/2) 0
1
= (1 − 2t)−n/2
Γ(n/2) = (1 − 2t)−n/2 ,
Γ(n/2)
5.8. SAMPLE STATISTICS
89
as claimed. This shows that Y and R have the same distribution. In practice, R is said to have
the χ2 distribution with n degrees of freedom, which is written R ∼ χ2n . The relevance of this
distribution to the sample variance is explained in the following proposition.
Proposition 5.10. If the trial outcomes X1 , · · · , Xn are i.i.d. normal random variables, then
the normalized sample variance (n − 1)S 2 /σ 2 is χ2 -distributed with n − 1-degrees of freedom.
Proof. In Example (6.16), we showed that the sample variance can be rewritten in the following
form:
n
1 X
n
2
S =
(Xi − µ)2 −
(X̄ − µ)2 .
n−1
n−1
i=1
If we then rearrange these terms and divide each one by σ 2 , we arrive at the following expression:
n
(n − 1)S 2 (X̄ − µ)2 X (Xi − µ)2
+
.
=
σ2
σ 2 /n
σ2
i=1
Now we know that the right-hand side is χ2 -distributed with n degrees of freedom, while the
second term on the left-hand side is also χ2 -distributed but with only one degree of freedom.
By using moment generating functions, for example, we can show that this implies that the first
term on the left-hand is χ2 -distributed with n − 1 degrees of freedom.
Example 5.20. Suppose that 16 individuals are sampled at random from a population in which
height is normally distributed with unknown mean µ and variance σ 2 . We know that the sample
variance S 2 is an unbiased estimator of σ 2 , but we might also be interested in calculating the
probability that S 2 is unusually small relative to σ 2 , e.g, we would like to calculate P(S 2 ≤ 0.25σ 2 ).
Since 15S 2 /σ 2 ∼ χ215 , this can be done by evaluating the following probability,
1 2
15S 2
15
2
P S ≤ σ
= P
≤
4
σ2
4
2
= P χ15 ≤ 3.75
≈ 0.0016,
which was done with the help of the Matlab function chi2cdf (x, df ).
5.8.2
Student’s t Distribution
When the measurements X1 , · · · , Xn are normally distributed, we know that the Z-score
Z=
X̄ − µ
σ/n1/2
is also normally-distributed. However, this observation is only useful in practice if we know the
actual value of the variance σ 2 . If not, then we can replace σ in this expression by the square
root S of the sample variance, but then the distribution of the statistic
T =
X̄ − µ
S/n1/2
is no longer normal. Instead, T is said to have Student’s t-distribution with n − 1 degrees of
freedom. This distribution is characterized in the next proposition.
90
CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS
Proposition 5.11. Let Z be a standard normal random variable and let V be an independent
χ2 - distributed random variable with n degrees of freedom. Then the test statistic
Z
T =p
V /n
has Student’s t distribution with n degrees of freedom and the density of T on (−∞, ∞) is
Γ( n+1
2 ) fT (t) = √
nπ Γ n2
1+
t2
n
−(n+1)/2
.
Proof. The density of T can be found by marginalizing over the joint density of T and V ,
Z ∞
pT,V (t, v)dv,
pT (t) =
0
and the joint density can be found with the help of the product rule:
pT,V (t, v) = pT |V (t|v)pV (v).
The first term on the right-hand side can be found by noting that conditional on V = v, T is
normally distributed with mean 0 and variance n/v, i.e.,
√
v −vt2 /2n
√
pT |V (t|v) =
e
.
2πn
Also, since V is χ2 -distributed with n degrees of freedom, it follows that the density of V is that
of the gamma distribution with shape parameter n/2 and rate parameter 1/2:
pV (v) =
1 n/2
2
Γ(n/2)
v n/2−1 e−v/2 .
Multiplying these gives
pT,V (t, v) = √
1 n/2
2
2πnΓ(n/2)
v
n+1
−1
2
e
t2
− 21 + 2n
v
and marginalizing over v then gives
Γ( n+1
2 ) pT (t) = √
nπ Γ n2
1+
t2
n
−(n+1)/2
.