Download 36-217 Probability Theory and Random Processes

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
36-217 Probability Theory and Random Processes
Lecture Notes
Mattia Ciollaro
June 24, 2015
NOTICE: text in red means that a change in the text has been
made during or after class (possibly because of a typo).
Lecture 1
Populations, Statistics and Random Processes
Broadly speaking, in Statistics we try to infer, learn, or estimate some features or parameters of a ‘population’ using a observed data sampled from
that population.
What do we mean by ‘population’ ? We can think of a population as
an entity which encompasses all possible data that can be observed in an
experiment. This can be a tangible collection of objects or a more abstract
entity.
Examples of tangible populations are:
• the collection of all pennies in the USA
• all human beings born in the year 2000.
Most often, we perform an experiment because we are interested in learning about a particular feature or parameter of a population. Examples of
parameters for the above populations are respectively:
• the mean weight of all the pennies in the USA
• the proportion of people born in the year 2000 that are taller than
5.25 feet today
• the ‘variability’ in the weight of all the pennies in the USA.
1
In order to learn such features we usually proceed as follows:
1. specify a statistical model for the unknown true population (e.g. specify a class of functions that approximate well the unknown true distribution of the weight of all the pennies in the USA)
2. collect data X1 , . . . , Xn , possibly by performing a particular experiment (e.g. sample n pennies and measure their weight)
3. compute a statistic, i.e. a function of the data (and of the data exclusively!), to estimate the parameter of interest (e.g. compute the
average weight of the n sampled pennies)
4. further statistical analyses (see, for instance, 36-226!).
Probability Theory provides the rigorous mathematical framework needed
in the above process (and much more...). Probability Theory can be studied
at very different levels, depending on one’s purpose and needs. At a purely
mathematical level, Probability Theory is an application of Measure Theory to real-world problems. At a more introductory and conceptual level,
Probability Theory is a set of mathematical tools that allows us to carry
out and analyze the process described above in a scientific way (especially
for statisticians, computer scientists, scientists and friends).
As an example, some frequently used statistics or estimators are
P
• the sample mean X̄ = n1 ni=1 Xi
1 Pn
2
• the sample variance S 2 = n−1
i=1 (Xi − X̄) .
In particular, they estimate the mean µ of the distribution of interest (which
is a measure of central tendency) and its variance σ 2 (which is a measure
of variability/dispersion). With respect to the examples above, we can use
the sample mean to estimate the mean weight µ of all the pennies in the
USA by collecting n pennies, measuring their weights X1 , . . . , Xn , and then
computing X̄. We expect that, as n gets larger, X̄ ≈ µ. If we are also
interested in estimating the variance
√ of the weight of the pennies in the USA
compute the sample variance
σ 2 or its standard deviation σ = σ 2 , we might
√
2
2
S or the sample standard deviation S = S (note that, as opposed to the
sample variance, the sample standard deviation has the same unit of measure
of the data X1 , . . . , Xn ). Again, as n gets larger, we expect that S 2 ≈ σ 2 .
We can also use the sample mean to estimate the proportion of people
born in the year 2000 who are taller than 5.25 feet today: first we measure
2
the height of n persons born in the year 2000, we introduce an indicator
variable
(
1 if Xi > 5.25
Zi = 1(5.25,∞) (Xi ) =
(1)
0 if Xi ≤ 5.25
P
and we finally compute Z̄ = n1 ni=1 Zi .
Again, as an example, we might imagine that a given probability distribution, for instance the Normal distribution, might be good model to
describe the weight of the pennies in the USA. The Normal distribution is
completely characterized by two parameters: its mean, µ, and its variance
σ2.
How would you plan an experiment to estimate µ and σ 2 if you
believe that the Normal distribution N (µ, σ 2 ) is a good model for
the weight of all the pennies in the USA?
We should always keep in mind that there is a difference between models
and the truth. While the Normal distribution might provide a good model
for the distribtion of the pennies in the USA, the true distribution of the
weight of the pennies in the USA might not be Normal (and most likely it
is not Normal, in fact).
TRUTH → MODELS → STATISTICS
Why the Normal distribution, for any µ and σ 2 , is most likely
not the true distribution of the weight of the pennies in the USA?
There are situations in which the quantity that we want to study evolves
with time (or with space, or with respect to some other dimension). For
example, suppose that you are interested in the number of people walking
into a store on a given day. We can model that quantity as a random
quantity Xt varying with respect to time t. Most often there exists some
kind of ‘dependence’ between the number of people in the store at time t
and the number of people in the store at time t0 (especially if |t−t0 | is small).
There is a branch of Probability Theory that is devoted to study and model
this type of problems. We usually refer to the collection {Xt : t ∈ T } as
a ‘random process’ or as a ‘stochastic process’. The set T can represent a
time interval, a spatial region, or some more elaborate domain.
Another example of a random process is observing the location of robberies occurring in a given city.
3
Another typical example of a random process is the evolution of a stock
price in Wall Street as a (random) function of time.
We will devote part of this course to study (at an introductory level)
some of the theory related to random processes.
Sample Spaces and Set Theory
What do we mean by probability? There are two major interpretations:
• objective probability: the long-run frequency of the occurrence of a
given event (e.g. seeing heads in a coin flip)
• subjective probability: a subjective belief in the rate of occurrence of
a given event.
Methods based on objective probability are usually called frequentist or
classical whereas those based on subjective probability are usually called
Bayesian as they are associated to the Bayesian paradigm. In this class,
we focus on frequentist methods (but we will discuss later in the course the
basic idea at the basis of Bayesian Statistics).
In Probability Theory, the notion of ‘event’ is usually described in terms
of a set (of outcomes). Therefore, before beginning our discussion of Probability Theory, it is worthwile to review some basic facts about set theory.
Throughout the course we will adopt the convention that Ω denotes the
universal set (the superset of all sets) and ∅ denotes the empty set (i.e. a
set that does not contain any element). Recall that a set A is a subset of
another set B (or A is contained in B) if for any element x of A we have
x ∈ A =⇒ x ∈ B. We denote the relation A is a subset of B by A ⊂ B
(similarly, A ⊃ B means A is a superset of B or A contains B). Clearly,
if A ⊂ B and B ⊂ A, then A = B. Let ∧ and ∨ stand for ‘and’ and ‘or’
respectively. Given two subsets A and B of Ω, recall the following basic set
operations:
• union of sets: A ∪ B = {x ∈ Ω : x ∈ A ∨ x ∈ B}
• intersection of sets: A ∩ B = {x ∈ Ω : x ∈ A ∧ x ∈ B}
• complement of a set: Ac = {x ∈ Ω : x ∈
/ A}
• set difference A \ B = {x ∈ Ω : x ∈ A ∧ x ∈
/ B}
• symmetric set difference A∆B = (A \ B) ∪ (B \ A) = (A ∪ B) \ (A ∩ B).
4
VENN’S DIAGRAMS
Notice that, for any set A ⊂ Ω, A ∪ Ac = Ω and A ∩ Ac = ∅. Two sets
A, B ⊂ Ω for which A ∩ B = ∅ are said to be disjoint or mutually exclusive.
Also, recall the distributive property for unions and intersections. For
A, B, C ⊂ Ω, we have
• A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ B)
• A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ B).
Finally, we have the De Morgan’s Laws:
• (A ∪ B)c = Ac ∩ B c
• (A ∩ B)c = Ac ∪ B c .
These can be extended to the union and the intersection of any n > 0 sets:
• (∪ni=1 Ai )c = ∩ni=1 Aci
• (∩ni=1 Ai )c = ∪ni=1 Aci .
ILLUSTRATION WITH VENN’S DIAGRAMS
We used before the word ‘experiment’ to describe the process of collecting data or observations. An experiment is, broadly speaking, any process
by which an observation is made. This can be done actively, if you have
control on the apparatus that collects the data, or passively, if you only get
to see the data, but you have no control on the apparatus that collects them.
An experiment generates observations or outcomes. Consider the following
example: you toss a fair coin twice (your experiment). The possible outcomes (simple events) of the experiment are HH, HT, TH, TT (H: heads,
T: tails). The collection of all possible outcomes of an experiment forms the
‘sample space’ of that experiment. In this case, the sample space is the set
Ω = {HH, HT, T H, T T }. Any subset of the sample space of an experiment
is called an event. For instance, the event ‘you observe at least one tails in
your experiment’ is the set A = {HT, T H, T T } ⊂ Ω.
5
Homework 1 (due May 19th)
1. (25 pts) Consider the Normal distribution N (µ, σ 2 ) that we mentioned
in class and a sample X1 , . . . , Xn of size n obtained from this distribution. Which of the following are statistics and why?
(a) X̄
√
(b) n(X̄−µ)
Qn S
(c) i=1 Xi
(d) max{X1 , . . . , Xn , σ 2 }
(e) 1{X3 >0} (X3 )
Solution:
(a), (c), and (e) are statistics since they are functions depending only
on the data. (b) and (d) are not statistics since they also depend on
the parameters µ and σ 2 .
2. (25 pts) Show that, for any A, B1 , . . . , Bn ⊂ Ω with ∪ni=1 Bi = Ω, we
have A = (A ∩ B1 ) ∪ · · · ∪ (A ∩ Bn ).
Solution: We have A = A ∩ Ω and Ω = ∪ni=1 Bn . Thus,
A = A ∩ Ω = A ∩ (∪ni=1 Bi )
Using the distributive property of intersection, we get
A = ∪ni=1 (A ∩ Bi ).
3. (25 pts) Consider the following experiments:
• you have three keys and exactly one of them opens the door
that you want to open; at each attempt to open the door, you
pick one of the keys completely at random and try to open the
door (at each attempt you pick one of the three keys at random,
regardless of the fact that you may already have tried some in
previous attempts). You count the number of attempts needed
to open the door. What is the sample space for this experiment?
• you have an urn with 1 red ball, 1 blue ball and 1 white ball. For
two times, you draw a ball from the urn, look at its color, take a
note of the color on a piece of paper, and put the ball back in the
urn. What is the sample space for this experiment? Write out
the set corresponding to the event ‘you observe at least 1 white
ball’.
6
Solution:
• The sample space is the set of positive integers Ω = {1, 2, 3, 4, 5, . . . }.
• Let R: red, B: blue, and W: white. The sample space is the set
of all possible pairs of observed colors
Ω = {RR, RB, RW, BR, BB, BW, W R, W B, W W }.
Let A be the event ‘you observe at least 1 white ball’. Then
A = {RW, BW, W R, W B, W W }.
4. (25 pts) Think of an example of a random process and describe what
the quantity of interest Xt is and what the variable t represents in
your example (time, space, . . . ?)
5. (EXTRA CREDIT, 10 pts) Show that the expression of the sample
variance
n
1 X
S2 =
(Xi − X̄)2
(2)
n−1
i=1
is equivalent to

n
1 X 2 1
S2 =
Xi −
n−1
n
i=1
n
X
!2 
Xi
.
(3)
i=1
Solution:
We have
(n − 1)S 2 =
=
n
X
i=1
n
X
i=1
=
n
X
i=1
n
X
(Xi − X̄)2 =
Xi2 − 2X̄
i=1
n
X
(Xi2 − 2Xi X̄ + X̄ 2 )
Xi + nX̄ 2 =
i=1
Xi2 − nX̄ 2 =
n
X
Xi2 − 2nX̄ 2 + nX̄ 2
i=1
n
X
i=1
1
Xi2 − n 2
n
n
X
!2
Xi
i=1
Thus,

n
1 X 2 1
2
S =
Xi −
n−1
n
i=1
7
n
X
i=1
!2 
Xi
.
.
Lecture 2
Recommended readings: WMS, sections 2.1 → 2.6
Unconditional Probability
Consider again the example from the previous lecture. We toss a fair coin
twice. The sample space for the experiment is Ω = {HH, HT, T H, T T }.
If we were to repeat the experiment over and over again, what
is the frequency of the occurrence of the sequence HT ? What
about the other sequences? What about the frequency of {HT } ∪
{T T }?
To make the intuition a little more formal, we can say that a probability
measure on Ω is a set function P satisfying the following axioms:
• for any A ⊂ Ω, P (A) ∈ [0, 1]
• P (Ω) = 1
∞
• P
for any countable collection of disjoint events {Ai }∞
i=1 ⊂ Ω, P (∪i=1 Ai ) =
∞
i=1 P (Ai ).
An interesting question to ask is what the probability of the union of two
(possibly non-disjoint) events is. That is, given A, B ⊂ Ω, what is P (A∪B)?
Here is the answer. Notice first that A = (A ∩ B) ∪ (A ∩ B c ) and the two sets
are disjoint (why?). Using the third axiom of probability (the countable
additivity axiom), we have
P (A) = P (A ∩ B) + P (A ∩ B c )
P (B) = P (A ∩ B) + P (Ac ∩ B)
(4)
Now, A ∪ B = (A ∩ B c ) ∪ (Ac ∩ B) ∪ (A ∩ B) and again these sets are all
disjoint. Putting all together we get
P (A ∪ B) = P (A) − P (A ∩ B) + P (B) − P (A ∩ B) + P (A ∩ B) =
= P (A) + P (B) − P (A ∩ B)
(5)
How does this formula read when A and B are disjoint? Why
is P (A ∩ B) = 0 when A and B are disjoint?
8
This can be extended to more than 2 events. For instance, given A, B, C ⊂
Ω, we have
P (A ∪ B ∪ C) = P (A) + P (B) + P (C) − P (A ∩ B) − P (A ∩ C) − P (B ∩ C)
+ P (A ∩ B ∩ C)
(6)
and, in general, for A1 , . . . , An ⊂ Ω we have the so called inclusion-exclusion
formula (note the alternating signs for the summands):
P (∪ni=1 Ai ) =
n
X
i=1
P (Ai ) −
n−1
+ (−1)
X
1≤i<j≤n
P (Ai ∩ Aj ) +
X
1≤i<j<k≤n
P (Ai ∩ Aj ∩ Ak ) − . . .
P (A1 ∩ · · · ∩ An ).
(7)
Notice that in general we have the following ‘union bound’ for any
A1 , . . . , An ⊂ Ω:
n
X
P (∪ni=1 Ai ) ≤
P (Ai ).
(8)
i=1
For a given experiment with m possible outcomes, how can we compute
the probability of an event of interest? We can always do the following:
1. define the experiment and describe its simple events Ei , i ∈ {1, . . . , m}
2. define reasonable probabilities on the simple events, P (Ei ), i ∈ {1, . . . , m}
3. define the event of interest A in terms of the simple events Ei
P
4. compute P (A) = i:Ei ∈A P (Ei ).
Here is an example (and we will describe it in terms of the above scheme).
There are 5 candidates for two identical job positions: 3 females and 2 males.
What is the probability that a completely random selection process will
appear discriminatory? (i.e. exactly 2 males or exactly 2 females are chosen
for these job positions) We can approach this problem in the following way.
1. we introduce 5 labels for each of the 5 candidates: M1 , M2 , F1 , F2 ,
F3 . The sample space is then
Ω ={M1 M2 , M1 F1 , M1 F2 , M1 F3 , M2 M1 , M2 F1 , M2 F2 , M2 F3 , F1 M1 , F1 M2 ,
F1 F2 , F1 F3 , F2 M1 , F2 M2 , F2 F1 , F2 F3 , F3 M1 , F3 M2 , F3 F1 , F3 F2 }
9
2. because the selection process is completely random, each of the simple
events of the sample space is equally likely. Therefore the probability
of any simple event is just 1/|Ω|
3. the event of interest is A = {M1 M2 , M2 M1 , F1 F2 , F1 F3 , F2 F1 , F2 F3 , F3 F1 , F3 F2 }
4. P (A) = P (M1 M2 ) + P (M2 M1 ) + · · · + P (F3 F2 ) = |A|/|Ω| = 8/20 =
2/5 = 0.4.
Idea: if the simple events are equally likely to occur, then the
probability of a composite event A is just P (A) = |A|/|Ω|.
Questions to ask when you define the sample space and the probabilities
on the simple events:
• is the sampling done with or without replacement?
• does the order of the labels matter?
• can we efficiently compute the size of the sample space, |Ω|?
Basic techniques from combinatorial analysis come handy for this type
of questions when the simple events forming the sample space are equally
likely to occur. Let’s take a closer look to some relevant cases.
Sampling with replacement when order matters
Suppose you have a group of m elements a1 , . . . , am and another group of n
elements b1 , . . . bn . You can form mn pairs containing one element from each
group. Of course you can easily extend this reasoning to more than just two
groups. This is a useful fact when we are sampling with replacement and
the order matters.
Consider the following example. You toss a six-sided die twice. You have
m = 6 simple outcomes on the first roll and n = 6 possible outcomes in the
second roll. The size of the sample space consisting of all possible pairs of
the type (i, j) for i, j ∈ {1, . . . , 6} is |Ω| = mn = 62 = 36. Is the experiment
performed with replacement? Yes, if the first roll is a 2, nothing precludes
the fact that the second roll is a 2 again. Does the order matter? Yes, the
pair (2,5) corresponding to a 2 on the first roll and a 5 on the second roll is
not equal to the pair (5,2) corresponding to a 5 on the first roll and a 2 on
the second roll.
10
Sampling without replacement when order matters
An ordered arrangement of r distinct elements is called a permutation. The
number of ways of ordering n distinct elements taken r at a time is denoted
Prn where
Prn = n(n − 1)(n − 2) . . . (n − r + 1) =
n!
.
(n − r)!
(9)
Consider the following example. There are 30 members in a student
organization and they need to choose a president, a vice-president, and a
treasurer. In how many ways can this be done? The size of the n distinct
elements (persons) is n = 30, the number of persons to be appointed is
r = 3. The sampling is done without replacement (a president is chosen
out of 30 people, then among the 29 people left a vice-president is chosen, and finally among the 28 people left a treasurer is chosen). Does the
order matter? Yes, the president is elected first, then the vice-president,
and finally the treasurer (think about ‘ranking’ the 30 candidates). The
number of ways in which the three positions can be assigned is therefore
P330 = 30!/(30 − 3)! = 30 ∗ 29 ∗ 28 = 24360.
Sampling without replacement when order does not matter
The number of combinations of n elements taken r at a time corresponds to
the number of subsets of size r that can be formed from the n objects. This
is denoted Crn where
n
Pn
n!
n
Cr =
= r =
.
(10)
r
r!
(n − r)!r!
Here is an example. How many different subsets of three people can become officers of the organization, if chosen randomly? Order doesn’t matter
here (we are not interested in the exact appointments for a given candidate).
The answer is therefore C330 = 30!/(27! ∗ 3!) = 30 ∗ 29 ∗ 28/6 = 4060.
The binomial coefficient nr can be extended to the multinomial case
in a straightforward way. Suppose that we want to partition n distinct
elements in k distinct groups of size n1 , . . . , nk in such a way that each of
the n elements appears in exactly one of the groups. This can be done in
n
n!
(11)
=
n1 . . . nk
n1 !n2 ! . . . nk !
possible ways.
11
Exercises in class:
1. The final exam of 36-217 has 5 multiple choice questions. Questions
1 to 3 and question 5 each have 3 possible choices. Questions 4 has 3
possible choices. Suppose that you haven’t study at all for the exam
and your answers to the final exam are given completely at random.
What is the probability that you will get a full score at the exam?
2. You have a string of 10 letters. How many substrings of length 7 can
you form based on the original string?
3. Google is looking to hire 3 software engineers and asked CMU for potential candidates. CMU advised that Google hires 3 students from the
summer 36-217 class. Because the positions must be filled as quickly
as possible, Google decided to skip the interview process and hire 3
of you completely at random. What is the probability that *you* will
become a software engineer at Google?
12
Lecture 3
Recommended readings: WMS, sections 2.7 → 2.13
Conditional Probability
The probability of an event A may vary as a function of the occurrence of
other events. Then, it becomes interesting to compute conditional probabilities, e.g. the probability that the event A occurs given the knowledge that
another event B occurred.
As an example of unconditional probability, think about the event A =‘my
final grade for the 36-217 summer class final exam is higher than 90/100’
(this event is not conditional on the occurrence of another event) and its
probability P (A). As an example of conditional probability, consider now
the event B = ‘my midterm grade for the 36-217 class was higher than
80/100’: how are P (A) and P (A given that B occurred) related? Intuitively, we might expect that P (A) < P (A given that B occurred)! (beware,
however, that as we will see the definition of conditional probability is not
intrinsically related to causality)
The conditional probability of an event A given another event B is usually denoted P (A|B).
INTUITIVE DESCRIPTION OF CONDITIONAL PROBABILITY VIA VENN’S DIAGRAMS
By definition, the conditional probability of the event A given the event
B is
P (A ∩ B)
P (A|B) =
.
(12)
P (B)
Observe that the quantity P (A|B) is well-defined only if P (B) > 0.
Consider the following table:
A
Ac
B
0.3
0.4
Bc
0.1
0.2
The unconditional probabilities of the events A and B are respectively
P (A) = 0.3 + 0.1 = 0.4 and P (B) = 0.3 + 0.4 = 0.7. The conditional probability of the event A given the event B is P (A|B) = P (A ∩ B)/P (B) =
0.3/0.7 = 3/7 while the probability of the event B given the event A is
13
P (B|A) = P (B ∩ A)/P (A) = 0.3/0.4 = 3/4. The conditional probability
of the event A given that B c occurs is P (A|B c ) = P (A ∩ B c )/P (B c ) =
0.1/0.3 = 1/3. Does the probability of A vary as a function of the
occurrence of the event B?
The conditional probability P (·|B) with respect to an event B with
P (B) > 0 is a proper probability measure, therefore the three axioms of
probability that we discussed in Lecture hold for P (·|B) as well. When it
comes to standard computations, P (·|B) has the same properties as P (·):
for instance, for disjoint events A1 , A2 ⊂ Ω we have P (A1 ∪ A2 |B) =
P (A1 |B) + P (A2 |B).
Exercise in class
By using a certain prescription drug, there is a 10% chance of experiencing headaches, a 15% chance of experiencing nausea, and a 5% chance of
experiencing both headaches and nausea. What is the probability that you
will experience an headache given that you feel nauseaus after taking the
prescription drug?
Independence
The event A ⊂ Ω is said to be independent of the event B ⊂ Ω if P (A ∩
B) = P (A)P (B). This means that the occurrence of the event B does not
alter the chance that the event A happens. In fact, assuming that P (B) >
0, we easily see that this is equivalent to P (A|B) = P (A ∩ B)/P (B) =
P (A)P (B)/P (B) = P (A).
Furthermore, assuming that also P (A) > 0, we have that P (A|B) =
P (A) is equivalent to P (B|A) = P (B) (indipendence is a symmetric relation!).
Exercise in class
Consider the following events and the corresponding table of probabilities’:
A
Ac
B
0
0.4
Bc
0.2
0.4
Are the events A and B independent?
Exercise in class
Consider the following events and the corresponding table of probabilities’:
14
A
Ac
B
1/4
1/2
Bc
1/12
1/6
Are the events A and B independent?
Exercise in class
Let A, B ⊂ Ω and P (B) > 0.
• What is P (A|Ω)?
• What is P (A|A)?
• Let B ⊂ A. What is P (A|B)?
• Let A ∩ B = ∅. What is P (A|B)?
Notice that from the definition of conditional probability, we have the
following:
P (A ∩ B) = P (B)P (A|B) = P (A)P (B|A).
(13)
This can be generalized to more than two events. For instance, for three
events A, B, C ⊂ Ω, we have
P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B)
(14)
and for n events A1 , . . . , An
P (A1 ∩ · · · ∩ An ) = P (A1 )P (A2 |A1 )P (A3 |A1 ∩ A2 ) . . . P (An |A1 ∩ · · · ∩ An−1 ).
(15)
Law of Total Probability and Bayes’ Rule
Assume that {Bi }∞
i=1 is a partition of Ω, i.e. for any i 6= j we have Bi ∩Bj = ∅
∞
and ∪i=1 Bi = Ω. Assume also that P (Bi ) > 0 ∀i. Then, for any A ⊂ Ω, we
have the so-called law of total probability
P (A) =
∞
X
P (A|Bi )P (Bi ).
(16)
i=1
n
As you showed in Homework 1, A = ∪∞
i=1 (A ∩ Bi ) when ∪i=1 Bi = Ω. But
∞
this time the sets (A ∩ Bi ) are also disjoint since {Bi }i=1 is a partition of Ω.
Furthermore, by equation (13) we have P (A ∩ Bi ) = P (A|Bi )P (Bi ). Thus,
15
P (A) =
P∞
i=1 P (A
∩ Bi ) =
P∞
i=1 P (A|Bi )P (Bi ).
We can now use the law of total probability to derive the so-called Bayes’
rule. We have
P (Bi |A) =
P (Bi ∩ A)
P (A|Bi )P (Bi )
= P∞
.
P (A)
i=1 P (A|Bi )P (Bi )
(17)
For two events A, B this reduces to
P (B|A) =
P (A|B)P (B)
P (A|B)P (B)
=
P (A)
P (A|B)P (B) + P (A|B c )P (B c )
(18)
Exercise in class
You are diagnosed with a disease that has two types, A and B. In the
population, the probability of having type A is 10% and the probability of
having type B is 90%. You undergo a test that is 80% accurate, i.e. if you
have type A disease, the test will diagnose type A with probability 80%
and type B with probability 20% (and vice versa). The test indicates that
you have type A. What is the probability that you really have the type A
disease?
Let A =‘you have type A’, B=‘you have type B’, TA =‘the test diagnoses
type A’, and TB =‘the test diagnoses type B’. We know that P (A) = 0.1
and P (B) = 0.9. The test is 80% accurate, meaning that
P (A|TA ) = P (B|TB ) = 0.8
P (B|TA ) = P (A|TB ) = 0.2
We want to compute P (A|TA ). We have
P (A ∩ TA )
P (TA |A)P (A)
=
P (TA )
P (TA |A)P (A) + P (TA |B)P (B)
0.8 ∗ 0.1
8
4
=
=
= .
0.8 ∗ 0.1 + 0.2 ∗ 0.9
26
13
P (A|TA ) =
16
Homework 2 (due May 21st)
1. (12 pts) A secretary types 4 letters to 4 people and addresses the 4
envelopes. If she inserts the letters at random, what is the probability
that exactly 2 letters will go in the correct envelopes?
Solution:
Call the letters A, B, C, D. What is the probability that she correctly
address letters A and B? That is 2/4! (the ordering where A and
B are matched to their envelopes out of the 4! possible orderings).
Now, one has C24 possible choices of 2 letters out of 4 to be addressed
correctly. Thus, the probability that exactly 2 letters will go in the
correct envelopes is C24 ∗ 2/4! = 1/2.
2. (12 pts) A student prepares for an exam by studying a list of 10 problems. She can solve 6 of them. For the exam, the instructor chooses
5 problems among the 10 completely at random. What is the probability that the student can solve all of the problems on the day of the
exam?
Solution:
How many combinations of test questions are possible (i.e. how big
is the sample space)? There are C510 possible combinations of test
questions. From the six questions that the student can solve, one can
choose 5 of them in C56 ways. Thus the probability that the student
can solve all of the problems on the day of the exam is
C56
=
C510
6
5
10
5
=
6!
5!(6−5)!
10!
5!(10−5)!
=
1
.
42
3. (12 pts) A fair die is tossed six times, and the number on the uppermost
face is recorded each time. What is the probability that the numbers
recorded are 1, 2, 3, 4, 5, 6, in any order?
Solution:
The number of possible outcomes of the experiment is 66 (in the first
roll there are 6 possible outcomes, in the second roll there are again
6 possible outcomes, and so on). The number of ways in which the
numbers 1, 2, 3, 4, 5, 6 can appear in any order is 6!. Therefore, the
probability that the numbers recorded after the experiment are 1, 2,
3, 4, 5, 6 in any order is 6!/66 = 20/64 .
17
4. (12 pts) A population of voters contains 40% Republicans and 60%
Democrats. It is reported that 30% of the Republicans and 70% of the
Democrats favor an election issue. A person chosen at random from
this population is found to favor the issue in question. What is the
conditional probability that this person is a Democrat?
Solution:
Let D and R denote the events ‘the person is a Democrat’ and ‘the
person is a Republican’, respectively. Let F denote the event ‘the
person favors the issue’. We have P (D) = 0.6, P (R) = 0.4, P (F |D) =
0.7, and P (F |R) = 0.3. Then, using Bayes’ Rule,
P (F |D)P (D)
P (F |D)P (D)
=
P (F )
P (F |D)P (D) + P (F |R)P (R)
0.7 ∗ 0.6
=
= 7/9.
0.7 ∗ 0.6 + 0.3 ∗ 0.4
P (D|F ) =
5. (4 pts) Use the axioms of probability to show that for any A ⊂ S we
have P (Ac ) = 1 − P (A).
Solution:
We have Ω = A∪Ac . Since A and Ac are disjoint, P (Ω) = P (A∪Ac ) =
P (A) + P (Ac ). Thus, P (Ac ) = P (Ω) − P (A). Since P (Ω) = 1, we then
have P (Ac ) = 1 − P (A).
6. (8 pts) Let B ⊂ A. Show that P (A) = P (B) + P (A ∩ B c ).
Solution:
Remember from Homework 1 that we have A = (A ∩ B) ∪ (A ∩ B c ).
Furthermore, these two sets are disjoint: A∩B is a subset of B, A∩B c
is a subset of B c , and obviously B and B c are disjoint. Therefore,
P (A) = P (A ∩ B) + P (A ∩ B c ). Since B ⊂ A, we have A ∩ B = B and
thus P (A) = P (B) + P (A ∩ B c ).
7. (10 pts) Show that for A, B ⊂ Ω, P (A∩B) ≥ max{0, P (A)+P (B)−1}.
Solution:
Because the probability of any event is a non-negative number, it is
clear that P (A ∩ B) ≥ 0. Furthermore, recall that in general for two
events A and B we have
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
from which it follows that
P (A ∩ B) = P (A) + P (B) − P (A ∪ B)
18
Since P (A ∪ B) ≤ 1, we have
P (A ∩ B) = P (A) + P (B) − P (A ∪ B) ≥ P (A) + P (B) − 1.
Thus, since P (A ∩ B) ≥ 0 and P (A ∩ B) ≥ P (A) + P (B) − 1, we
conclude that P (A ∩ B) ≥ max{0, P (A) + P (B) − 1}.
8. (10 pts) Let A, B ⊂ Ω be such that P (A) > 0 and P (B) > 0. Show
that
• if A and B are disjoint, then they are not independent
• if A and B are independent, then they are not disjoint.
Solution:
• Suppose that A and B are independent. Then, P (A ∩ B) =
P (A)P (B) > 0 (since P (A) > 0 and P (B) > 0 by assumption).
However, A and B are disjoint by assumption, i.e. A ∩ B = ∅
which implies that P (A ∩ B) = 0, and we reach a contradiction.
• Suppose that A and B are disjoint, i.e. A ∩ B = ∅. Then P (A ∩
B) = 0. However, A and B are independent by assumption, i.e.
P (A∩B) = P (A)P (B) > 0 (recall that P (A) > 0 and P (B) > 0),
and again we reach a contradiction.
9. (10 pts) Let A, B ⊂ Ω be such that P (A) > 0 and P (B) > 0. Prove
or disprove the following claim: in general, it is true that P (A|B) =
P (B|A). If you believe that the claim is true, clearly explain your
reasoning. If you believe that the claim is false in general, can you
find an additional condition under which the claim is true?
Solution:
The claim is false in general. A simple counter example: take A
arbitrary with 0 < P (A) < 1 and B = Ω. We have P (A|B) =
P (A|Ω) = P (A) while P (B|A) = P (Ω|A) = 1. Notice, however, that
if P (A) = P (B), then
P (A|B) =
P (A ∩ B)
P (A ∩ B)
=
= P (B|A).
P (B)
P (A)
Therefore, under this additional condition, the claim is true.
10. (10 pts) Let A, B ⊂ Ω be independent events. Show that the following
pairs of events are also independent:
19
• A, B c
• Ac , B
• Ac , B c .
Solution:
• We have A = (A ∩ B) ∪ (A ∩ B c ) and the two events are disjoint.
Therefore,
P (A ∩ B c ) = P (A) − P (A ∩ B) = P (A) − P (A)P (B)
= P (A)[1 − P (B)] = P (A)P (B c )
which implies that A and B c are independent.
• To show Ac and B are independent, just switch the roles of A
and B in the argument above.
• Once again we can write Ac = (Ac ∩ B c ) ∩ (Ac ∩ B) and the two
events are disjoint. Hence, as in the first part,
P (Ac ∩ B c ) = P (Ac ) − P (Ac ∩ B) = P (Ac ) − P (Ac )P (B)
= P (Ac )[1 − P (B)] = P (Ac )P (B c )
implying that Ac and B c are also independent (notice that here
we used the previous result on the independence of Ac and B).
20
Lecture 4
Recommended readings: WMS, sections 3.1, 3.2, 4.1 → 4.3
Random Variables and Probability Distributions
Let’s start with an example. Suppose that you flip a fair coin three times.
The sample space for this experiment is
Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
Since the coin is fair, we have that
P ({HHH}) = P ({HHT }) = · · · = P ({T T T }) =
1
1
= .
|Ω|
8
Suppose that we are interested in a particular quantity associated to the
experiment, say the number of tails that we observe. This is of course
a random quantity. In particular, we can conveniently define a function
X : Ω → R that counts the number of tails. Precisely,
X(HHH) = 0
X(HHT ) = 1
X(HT H) = 1
X(HT T ) = 2
..
.
X(T T T ) = 3.
Since the outcome of the experiment is random, the value that the function
X takes is also random. That’s why we call a function like X a random variable 1 . P induces through X a probability distribution on R. In particular,
we can easily see that the probability distribution of the random variable X
is
P (X = 0) = P ({HHH}) = 1/8
P (X = 1) = P ({HHT, HT H, T HH}) = 3/8
P (X = 2) = P ({HT T, T HT, T T H}) = 3/8
P (X = 3) = P ({T T T }) = 1/8
P (X = any other number) = 0.
1
For historical reasons X is called a random ‘variable’, even though X is clearly a
function mapping Ω into R.
21
You might be wondering why we care about random variables. After
all, based on the example, it seems that we still have to clearly define and
specify the sample space in order to determine P (X = x) for x ∈ R. Actually, it turns out when we model a random quantity of interest (such as the
number of tails in the coin tossing example) most of the times one assumes
(or knows) a distribution to use. As we shall see later, in the example above
we would simply assume that X has a Binomial distribution without having
to specify the sample space and specifying a probability on it! Assuming
that X has a Binomial distribution (or any other distribution) is equivalent
to making an assumption for the probabilities that we are interested in, i.e.
P (X = x) for x ∈ R.
Exercise in class
Let X and Y be random variables with the following distributions:


0.4 if x = 0
P (X = x) = 0.6 if x = 1


0 if x ∈
/ {0, 1}
and


0.7 if y = −1
P (Y = y) = 0.3 if y = 1


0 if y ∈
/ {0, 1}
Suppose that for any x, y ∈ R the events {X = x} and {Y = y} are independent. What is the probability distribution of the random variable
Z = X + Y ? We have


−1 if {X = 0} ∩ {Y = −1} occurs



0 if {X = 1} ∩ {Y = −1} occurs
Z=
1 if {X = 0} ∩ {Y = 1} occurs



2 if {X = 1} ∩ {Y = 1} occurs.
Thus,


0.4 ∗ 0.7 if z = −1





0.6 ∗ 0.7 if z = 0
P (Z = z) = 0.4 ∗ 0.3 if z = 1



0.6 ∗ 0.3 if z = 2




0 if z ∈
/ {−1, 0, 1, 2}
22


0.28 if z = −1





0.42 if z = 0
= 0.12 if z = 1



0.18 if z = 2




0 if z ∈
/ {−1, 0, 1, 2}.
At this point it is worthwhile making an important distinction between
two types of random variables, namely discrete random variables and continuous random variables.
We say that a random variable is discrete if the set of values that it can
take is at most countable. On the other hand, a random variable taking
values in an uncountably infinite set is called continuous.
Question
Consider the following:
• you draw a circle on a piece of paper an one diameter of the circle; at
the center of the circle, you keep your pencil standing orthogonal to
the plane where the circle lies. At some point you let go the pencil.
X is the random variable corresponding to the angle that the pencil
forms with the diameter of the circle that you drew after the pencil
fell on the piece of paper.
• you roll a die. Y is the random variable corresponding to the number
of rolls needed until you observe tail for the first time.
What are the possible values that X and Y can take? Are X and Y discrete
or continuous random variables?
Depending on whether a given random variable X is discrete or continuous, we use two special types of functions in order to describe its distribution.
• if X is discrete, let’s define its support as the set supp(X) = {x ∈
R : P (X = x) > 0} (if X is discrete, supp(X) is either a finite or
countable set). We can describe the probability distribution of X in
terms of its probability mass function (p.m.f.), i.e. the function
p(x) = P (X = x)
(19)
mapping R into [0, 1]. The function p satisfies the following properties:
1. p(x) ∈ [0, 1] ∀x ∈ R
P
2.
x∈supp(X) p(x) = 1.
• if X is continuous, we can describe the probability distribution of X
by means of the probability density function (p.d.f.) f : R → R+ .
Define in this case supp(X) = {x ∈ R : f (x) > 0}. The function f
satisfies the following properties:
23
1. f (x) ≥ 0 ∀x ∈ R
R
R
2. R f (x) dx = supp(X) f (x) dx = 1.
We use f to compute the probability of events of the type {X ∈ (a, b]}
for a < b. In particular, for a < b, we have
Z b
f (x) dx
(20)
P (X ∈ (a, b]) =
a
Notice that this implies that, for any x ∈ R, P (X = x) = 0 if X
is a continuous random variable! Also, if X is a continous random
variable, it is clear from above that
P (X ∈ (a, b]) = P (X ∈ [a, b]) = P (X ∈ [a, b)) = P (X ∈ (a, b)). (21)
In general, for any set A ⊂ R, we have
P (X ∈ A) =
Z
f (x) dx.
A
Exercise in class
A group of 4 components is known to contain 2 defectives. An inspector randomly tests the components one at a time until the two defectives are found.
Let X denote the number of tests on which the second defective is found.
What is the p.m.f. of X? What is the support of X? Graph the p.m.f. of X.
Let D stand for defective and N stand for non-defective. The sample
space for this experiment is Ω = {DD, N DD, DN D, DN N D, N DN D, N N DD}.
The simple events in Ω are equally likely because the inspector samples the
components completely at random. Thus, the probability of each simple
event in Ω is just 1/|Ω| = 1/6. We have
P (X = 2) = P ({DD}) = 1/6
P (X = 3) = P ({N DD, DN D}) = 2/6 = 1/3
P (X = 4) = P ({DN N D, N DN D, N N DD}) = 3/6 = 1/2
P (X = any other number) = 0.
Thus, the p.m.f. of X is


1/6 if x = 2



1/3 if x = 3
p(x) =

1/2 if x = 4



0 if x ∈
/ {2, 3, 4}
24
and supp(X) = {2, 3, 4}.
GRAPH
Exercise in class
Consider the following p.d.f. for the random variable X:
(
e−x if x ≥ 0
f (x) = e−x 1[0,∞) (x) =
0 if x < 0.
What is the support of X? Compute P (2 < X < 3). Graph the p.d.f. of X.
The support of X is clearly the set [0, ∞). We have
Z 3
3
P (2 < X < 3) = P (X ∈ (2, 3)) =
f (x) dx = −e−x 2 = e−2 − e−3 .
2
GRAPH
Another way to describe the distribution of a random variable is by
means of its cumulative distribution function (c.d.f). Again we will separate
the discrete case and the continuous case.
• If X is a discrete random variable, its c.d.f is defined as the function
X
F (x) =
p(y).
y≤x
y∈supp(X)
Notice that for a discrete random variable, F is not a continuous function.
• If X is a continuous random variable, its c.d.f is defined as the function
Z x
F (x) =
f (y) dy.
−∞
Notice that for a continuous random variable, F is a continuous function.
In both cases, the c.d.f. of X satisfies the following properties:
1. limx→−∞ F (x) = 0
2. limx→+∞ F (x) = 1
25
3. x ≤ y =⇒ F (x) ≤ F (y)
4. F is a right-continuous function, i.e. for any x ∈ R we have limy→x+ F (y) =
F (x).
Exercise in class
Compute the c.d.f. of the random variable X in the two examples above
and draw its graph.
For the example in the discrete case, we have


0 if x < 2


1/6 if 2 ≤ x < 3
F (x) =

1/6 + 1/3 = 1/2 if 3 ≤ x < 4



1/2 + 1/2 = 1 if x ≥ 4.
For the example in the continuous case, we have
(
(
0 if x < 0
0 if x < 0
R x −y
F (x) =
=
x
0 + 0 e dy if x ≥ 0
−e−y |0 if x ≥ 0
(
0 if x < 0
=
1 − e−x if x ≥ 0.
GRAPHS
What is the relationship between the c.d.f. of a random variable and its
p.m.f./p.d.f.?
• For a discrete random variable X, let x(i) denote the i-th largest element in supp(X). Then,
(
F (x(i) ) − F (x(i−1) ) if i ≥ 2
p(x(i) ) =
F (x(i) ) if i = 1
• For a continuous random variable X,
d
F (y)
f (x) =
dy
y=x
for any x at which F is differentiable.
Furthermore, notice that for a continuous random variable X with c.d.f.
F and for a < b one has
P (a < X ≤ b) = P (a ≤ X ≤ b) = P (a ≤ X < b) = P (a < X < b)
Z b
Z b
Z a
=
f (x) dx =
f (x) dx −
f (x) dx = F (b) − F (a).
a
−∞
−∞
26
This is easy. However, for a discrete random variable one has to be careful
with the bounds. Using the same notation as before, for i < j one has
P (x(i) < X ≤ x(j) ) = F (x(j) ) − F (x(i) )
P (x(i) ≤ X ≤ x(j) ) = F (x(j) ) − F (x(i) ) + p(x(i) )
P (x(i) ≤ X < x(j) ) = F (x(j−1) ) − F (x(i) ) + p(x(i) )
P (x(i) < X < x(j) ) = F (x(j−1) ) − F (x(i) ).
Suppose that the c.d.f. F of a continuous random variable X is strictly
increasing. Then F is invertible, meaning that there exists a function F −1 :
(0, 1) → R ∪ {−∞, +∞} such that for any x ∈ R we have F −1 (F (x)) = x.
Then, for any α ∈ (0, 1) we can define the α-quantile of X as the number
xα = F −1 (α)
(22)
with the property that P (X ≤ xα ) = α. This can be extended to c.d.f. that
are not strictly increasing and to p.m.f.’s, but for this class we will not use
that extension.
Exercise in class
Consider again a random variable X with p.d.f.
(
e−x if x ≥ 0
f (x) = e−x 1[0,∞) (x) =
0 if x < 0.
Compute the α-quantile of X.
We know from above that
(
0 if x < 0
F (x) =
1 − e−x if x ≥ 0.
For α ∈ (0, 1), set α = F (xα ) = 1 − e−xα . We then have that the α-quantile
is
xα = − log(1 − α).
27
Lecture 5
Recommended readings: WMS, sections 3.3, 4.3
Expectation and Variance
The expectation (or expected value or mean) is an important operator associated to a probability distribution. Given a random variable X with p.m.f.
p (if X is discrete) or p.d.f. f (if X is continuous), its expectation E(X) is
defined as
P
• E(X) = x∈supp(X) xp(x), if X is discrete
R
R
• E(X) = R xf (x) dx = x∈supp(X) xf (x) dx, if X is continuous.
Roughly speaking, E(X) is the ‘center’ of the distribution of X.
Exercise in class
Consider the random variable X and its p.m.f.


0.2 if x = 0





0.3 if x = 1
p(x) = 0.1 if x = 2



0.4 if x = 3




0 if x ∈
/ {0, 1, 2, 3}.
What is E(X)?
By definition we have
X
E(X) =
xp(x) = 0∗0.2+1∗0.3+2∗0.1+3∗0.4 = 0.3+0.2+1.2 = 1.7.
x∈supp(X)
Exercise in class
Consider the random variable X and its p.d.f
f (x) = 3x2 1[0,1] (x).
What is E(X)?
28
Again, by definition
Z
Z
xf (x) dx =
E(X) =
R
Z 1
=
0
xf (x) dx =
supp(X)
Z 1
x ∗ 3x2 dx = 3
x3 dx =
0
3 4 1 3
x 0= .
4
4
Consider a function g : R → R of the random variable X and the new
random variable g(X). The expectation of g(X) is simply
P
• E(g(X)) = x∈supp(X) g(x)p(x), if X is discrete
R
R
• E(g(X)) = R g(x)f (x) dx = x∈supp(X) g(x)f (x) dx, if X is continuous.
Exercise in class
Consider once again the two random variables above and the function g(x) =
x2 . What is E(X 2 )?
• In the discrete example, we have
X
E(X 2 ) =
x2 p(x) = 02 ∗0.2+12 ∗0.3+22 ∗0.1+32 ∗0.4 = 0.3+0.4+3.6 = 4.3.
x∈supp(X)
• In the continous example, we have
Z
Z
2
2
E(X ) =
x f (x) dx =
R
Z 1
=
0
x2 f (x) dx =
supp(X)
Z 1
x2 ∗ 3x2 dx = 3
x4 dx =
0
3 5 1 3
x 0= .
5
5
A technical note: for a random variable X, the expected value E(X) is
a well-defined quantity whenever E(|X|) < ∞.
The expected value operator E is a linear operator: given two random
variables X, Y and scalars a, b ∈ R we have
E(aX) = aE(x)
E(X + Y ) = E(X) + E(Y ).
(23)
For any scalar a ∈ R, we have E(a) = a. To see this, consider the random
variable Y with p.m.f.
(
1 if x = a
p(y) = 1{a} (x) =
0 if x 6= a.
29
From the definition of E(Y ) it is clear that E(Y ) = a.
Exercise in class
Consider again the continuous random variable X of the previous example.
What is E(X + X 2 )?
We have E(X + X 2 ) = E(X) + E(X 2 ) =
3
4
+
3
5
=
27
20 .
Beware that in general, for two random variables X, Y , it is not true
that E(XY ) = E(X)E(Y ). However, we shall see later in the class that
this is true when X and Y are independent.
The letter µ is often used to denote the expected value E(X) of a random
variable X, i.e. µ = E(X).
Another very important operator associated to a probability distribution
is the variance. The variance measures how ‘spread’ the distribution of a
random variable is. The variance of a random variable X is defined as
V (X) = E[(X − µ)2 ] = E(X 2 ) − µ2 .
(24)
Equivalently, one can write
• if X is discrete:
X
V (X) =
(x − µ)2 p(x) =
x∈supp(X)
• if X is continuous:
Z
V (X) =
x∈supp(X)
X
x∈supp(X)
x2 p(x) − µ2
x2 f (x) dx − µ2 .
Exercise in class
Consider the random variables of the previous examples. What is V (X)?
• In the discrete example, V (X) = E(X 2 ) − [E(X)]2 = 4.3 − (1.7)2 =
4.3 − 2.89 = 1.41.
• In the continuous example V (X) = E(X 2 )−[E(X)]2 = 3/5−(3/4)2 =
3/5 − 9/16 = 94/80 = 47/40.
σ 2 is often used to denote the variance V (X) of a random variable X,
i.e. σ 2 = V (X). The variance of a random variable X is finite as soon as
E(X 2 ) < ∞.
Here are two important properties of the variance operator. For any
a ∈ R we have
30
• V (aX) = a2 V (X)
• V (a + X) = V (X).
Exercise in class
Consider
random variable of the previous examples. What
p the continuous
is V
40/47X ?
We have
p
40
40 47
40/47X = V (X) =
= 1.
47
47 40
Beware that in general, for two random variables X, Y , it is not true
that V (X + Y ) = V (X) + V (Y ). However, we shall see later in the course
that this is true when X and Y are independent.
V
Markov’s Inequality and Tchebysheff ’s Inequality
Sometimes we may want to compute or approximate the probability of certain events involving a random variable X even if we don’t know its distribution (but given that we know its expectation or its variance). Let a > 0.
The following inequalities are useful in these cases:
E(|X|)
(Markov’s inequality)
a
E(X 2 )
P (|X| ≥ a) ≤
(Tchebysheff’s inequality)
a2
P (|X| ≥ a) ≤
(25)
Let σ 2 = V (X) as usual so that σ is the standard deviation of X. Note that
we can conveniently take a = kσ and the second inequality then reads
P (|X − µ| ≥ kσ) ≤
σ2
1
= 2
kσ 2
k
(26)
where µ = E(X). Thus, we can easily bound the probability that X deviates
from its expectation by more than k times its standard deviation.
31
Homework 3 (due May 26th)
1. (8 pts) Consider the random variable X with p.m.f.


0.1 if x = 0
p(x) = 0.9 if x = 1


0 if x ∈
/ {0, 1}.
Draw the p.m.f. of X. Compute the c.d.f. of X and draw it.
Solution:
The c.d.f. of X is


0 if x < 0
F (x) = p(0) if 0 ≤ x < 1


p(0) + p(1) if x ≥ 1


0 if x < 0
= 0.1 if 0 ≤ x < 1


1 if x ≥ 1.


0 if x < 0
= 0.1 if 0 ≤ x < 1


0.1 + 0.9 if x ≥ 1.
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
p(x)
F(x)
0.6
0.8
1.0
1) c.d.f.
1.0
1) p.m.f.
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
-1.0
x
-0.5
0.0
0.5
1.0
1.5
x
2. (10 pts) Let X be a continuous random variable with p.d.f.
f (x) = 3x2 1[0,1] (x).
Compute the c.d.f. of X and draw it. What is P (X ∈ (0.2, 0.4))?
What is P (X ∈ [−3, 0.2))?
32
2.0
Solution:
The c.d.f. of X is


0R if x < 0
x
x
2
F (x) =
f (x) dx =
0 3y dy if 0 ≤ x < 1

−∞

1 if x ≥ 1


0 if x < 0
= x3 if 0 ≤ x < 1


1 if x ≥ 1.
Z


0 if x < 0
x
= y 3 0 if 0 ≤ x < 1


1 if x ≥ 1
0.0
0.2
0.4
F(x)
0.6
0.8
1.0
2) c.d.f.
-2
-1
0
1
x
We have
P (X ∈ (0.2, 0.4)) = F (0.4) − F (0.2) = 0.43 − 0.23 = 0.056
and
P (X ∈ [−3, 0.2)) = F (0.2) − F (−3) = 0.23 − 0 = 0.008.
3. (10 pts) Consider again the exercise of page 24 on the defective components. Compute the c.d.f. of X and use it (possibly together with
the p.m.f. of X) to compute the following probabilities:
• P (2 ≤ X ≤ 5)
• P (X > 3)
33
2
• P (0 < X < 4).
Solution:
The c.d.f. of X is


0 if x < 2



p(2) if 2 ≤ x < 3
F (x) =

p(2) + p(3) if 3 ≤ x < 4



p(2) + p(3) + p(4) if x ≥ 4


0 if x < 2



1/6 if 2 ≤ x < 3
=
1/2 if 3 ≤ x < 4



1 if x ≥ 4.


0 if x < 2



1/6 if 2 ≤ x < 3
=

1/6 + 1/3 if 3 ≤ x < 4



1/6 + 1/3 + 1/2 if x ≥ 4
Thus, we have
P (2 ≤ X ≤ 5) = F (5) − F (2) + p(2) = 1 − 1/6 + 1/6 = 1
P (X > 3) = 1 − P (X ≤ 3) = 1 − F (3) = 1 − 1/2 = 1/2
P (0 < X < 4) = F (3) − F (0) = 1/2 − 0 = 1/2.
4. (10 pts) Consider the following c.d.f. for the random variable X:


0 if x < 0


√


 18 x if 0 ≤ x < 1

F (x) = P (X ≤ x) = − 38 + 12 x if 1 ≤ x < 32


5
3
3


− 2 + 4 x if 2 ≤ x < 2


1 if x ≥ 2.
What is the p.d.f. of X?
Solution:
Notice that F is piecewise differentiable. Therefore, the p.d.f. of X is

0 if x ≤ 0



 1√


 16 x if 0 < x < 1
d
f (x) =
F (y)
= 21 if 1 ≤ x < 32

dy

y=x
5
3



4 if 2 ≤ x < 2


0 if x ≥ 2.
34
5. (10 pts) Let α ∈ (0, 1). Derive the α-quantile associated to the following c.d.f.
1
F (x) =
.
1 + e−x
Solution:
We need to solve
F (xα ) = α
with respect to xα . We have
1
=α
1 + e−xα
1
= 1 + e−xα
α
1
− log
− 1 = xα
α
1−α
α
xα = − log
= log
.
α
1−α
6. (10 pts) Consider the continuous random variable X with p.d.f. f
and its tranformation 1A (X) where A ⊂ R. Show that E(1A (X)) =
P (X ∈ A).
Solution:
Let g(x) = 1A (x). By definition, the expected value of the random
variable g(X) is
Z
E(g(X)) =
g(x)f (x) dx.
R
In this case, we have
E(g(X)) = E(1A (X)) =
Z
1A (x)f (x) dx =
R
Z
A
f (x) dx = P (X ∈ A).
7. (6 pts) Consider a random variable X and a scalar a ∈ R. Using the
definition of variance, show that V (X + a) = V (X).
Solution:
Consider the random variable Y = X + a. We have that E(Y ) =
E(X + a) = E(X) + a. Now, by definition, V (Y ) = E[(Y − E(Y ))2 ] =
E[(X + a − (E(X) + a))2 ] = E[(X − E(X))2 ] = V (X).
35
8. (6 pts) Let X be a random variable with expectation µ. Show that
E[(X − µ)2 ] = E(X 2 ) − µ2 .
Solution:
We have
E[(X − µ)2 ] = E(X 2 + µ2 − 2µX) = E(X 2 ) + µ2 + E(−2µX) =
= E(X 2 ) + µ2 − 2µE(X) = E(X 2 ) + µ2 − 2µ ∗ µ
= E(X 2 ) + µ2 − 2µ2 = E(X 2 ) − µ2 .
9. (10 pts) Let X be a random variable with p.d.f. f (x) = 1[0,1] (x).
• Draw the p.d.f. of X.
• Compute the c.d.f. of X and draw it.
• Compute E(3X).
• Compute V (2X + 100).
Solution:
The c.d.f. of X is


0R if x < 0
x
F (x) =
f (y) dy =
0 dy if 0 ≤ x < 1

−∞

1 if if x ≥ 1
Z
x


0 if x < 0
= x if 0 ≤ x < 1


1 if if x ≥ 1
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
f(x)
F(x)
0.6
0.8
1.0
9) c.d.f.
1.0
9) p.d.f.
-2
-1
0
1
2
-2
-1
x
We have that E(X) =
0
1
x
R
R xf (x) dx =
36
R1
0
x ∗ 1 dx =
1
2
1
x2 0 = 12 . There-
2
fore, E(3X) = 3E(X) = 32 . Furthermore,
Z
2
E(X ) =
1
Z
2
x f (x) dx =
0
R
hence V (X) = E(X 2 ) − [E(X)]2 =
4
V (2X) = 4V (X) = 12
= 13 .
1
3
x2 ∗ 1 dx =
− 14 =
1
12 .
1 3 1 1
x 0=
3
3
Thus, V (2X + 100) =
10. (10 pts) A secretary is given n > 0 passwords to access a computer file.
In order to open the file, she starts entering the password in a random
order and after each attempt she discards the password that she used if
that password was not the correct one. Let X be the random variable
corresponding to the number of password that the secretary needs to
enter before she can open the file.
• Argue that P (X = 1) = P (X = 2) = · · · = P (X = n) = 1/n.
• Find the expectation and the variance of X.
Potentially useful facts:
Pn
•
x=1 x = n(n + 1)/2
Pn
2
•
x=1 x = n(n + 1)(2n + 1)/6.
Solution:
Let Ri denote the event ‘the secretary chooses the right password
on the i-th attempt’. We have P (X = 1) = P (R1 ) = 1/n since
the secretary is picking one password at random from the n available
passwords. The probability that she chooses the right password on the
second attempt is
P (X = 2) = P (R1c ∩ R2 ) = P (R1c )P (R2 |R1c ) =
n−1 1
1
= ,
n n−1
n
i.e. it equals the probability that she chooses the wrong password on
the first attempt and then she chooses the right one on the second
attempt. Similarly,
P (X = 3) = P (R1c ∩ R2c ∩ R3c )
= P (R1c )P (R2c |R1c )P (R3 |R1c ∩ R2c ) =
=
1
,
n
37
n−1n−2 1
n n−1n−2
and then it is clear that P (X = i) =
expectation of X is therefore
E(X) =
n
X
x=1
1
n
for all i ∈ {1, . . . , n}. The
n
X
1
xP (X = x) =
x
n
x=1
n
1X
1 n(n + 1)
=
x=
n
n
2
x=1
n+1
=
.
2
To compute the variance of X we also need
2
E(X ) =
n
X
x=1
n
1X 2
x
x P (X = x) =
n
2
x=1
1 n(n + 1)(2n + 1)
(n + 1)(2n + 1)
=
=
.
n
6
6
Thus,
(n + 1)(2n + 1) (n + 1)2
−
6
4
2n2 + n + 2n + 1 n2 + 2n + 1
=
−
6
4
1 1
1
1
1 1
=
−
n2 +
−
n+ −
3 4
2 2
6 4
2
1
1
n −1
= n2 −
=
.
12
12
12
V (X) = E(X 2 ) − [E(X)]2 =
11. (10 pts) Consider the positive random variable X (i.e. P (X > 0) = 1)
with expectation E(X) = 1 and variance V (X) = 0.002. Use Markov’s
inequality and Tchebysheff’s inequality to provide two lower bounds
for the probability of the event {X < 0.1}. Are both bounds useful in
this case? Comment on the bounds.
Solution:
Markov’s inequality yields
P (|X| ≥ 0.1) ≤
E|X|
.
0.1
Hence,
P (|X| < 0.1) > 1 −
38
E|X|
.
0.1
Since X is positive, this is equivalent to
E(X)
= 1 − 10 = −9.
0.1
P (X < 0.1) > 1 −
The bound is valid and correct! The probability of any event is a nonnegative number, so it is true that P (X < 0.1) > −9. However, this
bound is uninformative. It gives us no additional information about
the distribution of X.
Tchebysheff’s inequality implies
P (X < 0.1) > 1 −
V (X)
= 1 − 0.2 = 0.8.
0.12
This time, because the variance of X is relatively small, we obtain a
non trivial lower bound for the probability of interest.
39
Lecture 6
Recommended readings: WMS, sections 3.4 → 3.8
Independent Random Variables
Consider a collection of n random variables X1 , . . . , Xn and their joint probability distribution. Their joint probability distribution can be described in
terms of the joint c.d.f.
FX1 ,...,Xn (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 ∩ X2 ≤ x2 ∩ · · · ∩ Xn ≤ xn ),
(27)
in terms of the joint p.m.f. (if the random variables are all discrete)
pX1 ,...,Xn (x1 , x2 , . . . , xn ) = P (X1 = x1 ∩ X2 = x2 ∩ · · · ∩ Xn = xn ),
(28)
or in terms of the joint p.d.f. fX1 ,...,Xn (if the random variables are all
continuous).
The random variables X1 , . . . , Xn are said to be independent if either of
the following holds:
Q
• FX1 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 FXi (xi )
Q
• pX1 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 pXi (xi )
Q
• fX1 ,...,Xn (x1 , x2 , . . . , xn ) = ni=1 fXi (xi ).
Then, if we consider an arbitrary collection of events {X1 ∈ A1 }, {X2 ∈
A2 }, . . . , {Xn ∈ An }, we have that
P (X1 ∈ A1 ∩ X2 ∈ A2 ∩ · · · ∩ Xn ∈ An ) =
n
Y
i=1
P (Xi ∈ Ai ).
If the random variables also share the same marginal distribution, i.e. we
have
• FXi = F
∀i ∈ {1, . . . , n}
• pXi = p
∀i ∈ {1, . . . , n} (if the random variables are all discrete)
• fXi = f
∀i ∈ {1, . . . , n} (if the random variables are all continuous)
then the random variables X1 , . . . , Xn are said to be independent and identically distributed, usually shortened in i.i.d..
40
Frequently Used Discrete Distributions
The Binomial Distribution
Consider again tossing a coin (not necessarily fair) n times in such a way that
each coin flip is independent of the other coin flips. By this, we mean that if
Hi denotes the event ‘observing heads on the i-th toss’, then P (Hi ∩ Hj ) =
P (Hi )P (Hj ) for all i 6= j. Suppose that the probability of seeing heads
on each flip is p ∈ [0, 1] (and let’s call the event ‘seeing heads’ a success).
Introduce the random variables
(
1 if the i−th flip is a success
Xi =
0 otherwise
fo i ∈ {0, 1, 2, . . . , n}. The number of heads that we observe (or the number
of successes in the experiment) is
X=
n
X
Xi .
i=1
Under the conditions described above, the random variable X is distributed
according to the Binomial distribution with parameters n ∈ Z+ (number of
trials) and p ∈ [0, 1] (probability of success in each trial). We denote this by
X ∼ Binomial(n, p). The p.m.f. of X is
( n x
n−x if x ∈ {0, 1, 2, . . . , n}
x p (1 − p)
p(x) =
(29)
0 if x ∈
/ {0, 1, 2, . . . , n}
Its expectation is E(X) = np and its variance is V (X) = np(1 − p).
In particular, when n = 1, we usually say that X is distributed according
to the Bernoulli distribution of parameter p, denoted X ∼ Bernoulli(p). In
this case E(X) = p and V (X) = p(1 − p).
Exercise in class:
Let X denote the number of 6 observed after rolling 4 times a fair die. Then
X ∼ Binomial(n, p) with n = 4 and p = 1/6. What is the probability that
we observe at least 3 times the number 6 in the 4 rolls? What is the expected
number of times that the number 6 is observed in 4 rolls?
41
We have
P (X ≥ 3) = P (X = 3) + P (X = 4) = p(3) + p(4)
4 3 1 4−3
4
1
1 4−4
4
1
1−
+
1−
=
6
6
4
6
6
3
1 5
1
21
= 4 ∗ 3 + 4 = 4 ≈ 0.016
6 6 6
6
and E(X) = np = 4/6 = 2/3.
The Geometric Distribution
Consider now counting the number of coin flips needed before the first success is observed in the setting described above and let X denote the corresponding random variable. Then X has the Geometric distribution with
parameter p ∈ [0, 1], denoted X ∼ Geometric(p). Its p.m.f. is given by
(
(1 − p)x−1 p if x ∈ {1, 2, 3, . . . }
p(x) =
(30)
0 if x ∈
/ {1, 2, 3, . . . }
INTUITION
The expected value of X is E(X) = 1/p, while its variance is V (X) =
(1 − p)/p2 .
The Geometric distribution is one of the distribution that have the socalled ‘memoryless property’. This means that if X ∼ Geometric(p), then
P (X > x + y|X > x) = P (X > y)
for any 0 < x ≤ y. To see this, let’s first compute the c.d.f. of X. We have
(
(
0 if x < 1
0 if x < 1
F (x) = P (X ≤ x) =
=
Pbxc
Pbxc−1
y−1
p y=1 (1 − p)
if x ≥ 1
p y=0 (1 − p)y if x ≥ 1
(
(
0 if x < 1
0 if x < 1
=
=
1−(1−p)bxc
1 − (1 − p)bxc if x ≥ 1.
p 1−(1−p) if x ≥ 1
Let’s look for example to the case 1 < x < y. We have
P (X > x + y|X > x) =
P (X > x + y)
1 − F (x + y)
(1 − p)bx+yc
=
=
P (X > x)
1 − F (x)
(1 − p)bxc
= (1 − p)byc = P (X > y).
42
Exercise in class
Consider again rolling a fair die. What is the probability that the first 6 is
rolled on the 4-th roll?
Let X denote the random variable counting the number of rolls needed
to observe the first 6. We have X ∼ Geometric(p) with p = 1/6. Then,
P (X = 4) = p(4) = p(1 − p)4−1 =
1 53
53
=
≈ 0.096.
6 63
64
The Negative Binomial Distribution
Suppose that we are interested in counting the number of coin flips needed in
order to observe the r-th success. If X denotes the corresponding random
variable, then X has the Negative Binomial distribution with parameters
r ∈ Z+ (number of successes) and p ∈ [0, 1] (probability of success on each
trial). We use the notation X ∼ NBinomial(r, p). Its p.m.f. is then
(
x−1 r
x−r if x ∈ {r, r + 1, . . . }
r−1 p (1 − p)
p(x) =
(31)
0 if x ∈
/ {r, r + 1, . . . }.
INTUITION
The expected value of X is E(X) = r/p and its variance is V (X) =
r(1 − p)/p2 .
Exercise in class:
Consider again rolling a fair die. What is the probability that the 3rd 6 is
observed on the 5-th roll?
We have X ∼ NBinomial(r, p) with r = 3 and p = 1/6 and
3 5−3
5−1
1
5
52
P (X = 5) =
= 6 ∗ 5 ≈ 0.019.
3−1
6
6
6
Question:
How do the Geometric distribution and the Negative Binomial ditribution
relate?
The Hypergeometric Distribution
Suppose that you have a population of N elements that can be divided into
2 subpopulations of size r and N − r (say the population of ‘successes’ and
‘failures’, respectively). Imagine that you sample n ≤ N elements from
43
this population without replacement. What is the probability that your
sample contains x successes? To answer this question we can introduce the
random variable X with the Hypergeometric distribution with parameters
r ∈ Z+ (number of successes in the population), N ∈ Z+ population size,
and n ∈ Z+ (number of trials). We write X ∼ HGeometric(r, N, n). Its
p.m.f. is
 r N −r
 (x)( n−x ) if x ∈ {0, 1, . . . , r}
(Nn )
p(x) =
(32)

0 if x ∈
/ {0, 1, . . . , r}.
The expectation and the variance of X are E(X) = nr/N and
V (X) =
nr N − r N − n
.
N N N −1
Exercise in class:
There are 10 candidates for 4 software engineering positions at Google. 7 of
them are female candidates and the remaining 3 are male candidates. If the
selection process at Google is completely random, what is the probability
that only 1 male candidate will be eventually hired?
The number of male candidates hired by Google can be described by
means of the random variable X ∼ Hypergeometric(r, N, n) with r = 3,
N = 10, and n = 4. Then,
3 7
7!
3 4!3!
6∗5∗4∗3
1
7!6!
1 3
P (X = 1) = 10 = 10! =
=
= .
2
∗
10!
10
∗
9
∗
8
2
6!4!
4
The Poisson Distribution
The Poisson distribution can be thought of as a limiting case of the Binomial
distribution. Consider the following example: you own a store and you want
to model the number of people who enter in your store on a given day. In
any time interval of that day, the number of people walking in the store
is certainly discrete, but in principle that number can be any non-negative
integer. We could try and divide the day into n smaller subperiods in such
a way that, as n → ∞, only one person can walk into the store in any given
subperiod. If we let n → ∞, it is clear however that the probability p that
a person will walk in the store in an infinitesimally small subperiod of time
is such that p → 0. The Poisson distribution arises as a limiting case of the
Binomial distribution when n → ∞, p → 0, and np → λ ∈ (0, ∞).
44
The p.m.f. of a random variable X that has the Poisson distribution
with parameter λ > 0, denoted X ∼ Poisson(λ) is
(
x
e−λ λx! if x ∈ {0, 1, 2, 3, . . . }
p(x) =
(33)
0 if x ∈
/ {0, 1, 2, 3, . . . }
Both the expected value and the variance of X are equal to λ, E(X) =
V (X) = λ.
Exercise in class:
At the CMU USPS office, the expected number of students waiting in line
between 1PM and 2PM is 3. What is the probability that you will see more
than 2 students already in line in front of you, if you go to the USPS office
in that period of time?
Let X ∼ Poisson(λ) with λ = 3 be the number of students in line when
you enter into the store. We want to compute P (X > 2). We have
P (X > 2) = 1 − P (X ≤ 2) = 1 −
=1−e
−3
2
X
x=0
p(x) = 1 − e−3
2
X
3x
x=0
9
17
1+3+
= 1 − e−3
≈ 0.577.
2
2
45
x!
=
Lecture 7
Recommended readings: WMS, sections 4.4 → 4.8
Frequently Used Continuous Distributions
In this section, we will describe some of the most commonly used continuous
distribution. In particular, we will focus on three important families which
are often used in the statistical modeling of physical phenomena:
1. the Normal family: this is a class of distributions that is commonly
used to describe physical phenomena where the quantity of interest
takes values (at least in principle) in the range (−∞, ∞)
2. the Gamma family: this is a class of distributions which are frequenlty
used to describe physical phenomena where the quantity of interest
takes non-negative values, i.e. it takes values in the interval [0, ∞)
3. the Beta family: this is a class of distributions that is commonly used
to describe physical phenomena where the quantity of interest takes
values in some interval [a, b] of the real line.
We will also focus on some relevant subfamilies of the families mentioned
above.
The Uniform Distribution
The Uniform distribution is the simplest among the continuous distributions. We say that a random variable X has the Uniform distribution with
parameters a, b ∈ R, a < b, if its p.d.f. is
(
1
if x ∈ [a, b]
1
f (x) =
1[a,b] (x) = b−a
(34)
b−a
0 if x ∈
/ [a, b].
In this case, we use the notation X ∼ Uniform(a, b). We have E(X) =
and V (X) =
(b−a)2
12 .
GRAPH
46
a+b
2
The Normal Distribution
The Normal distribution is of the most relevant distributions for applications
in Statistics. We say that a random variable X has a Normal distribution
with parameters µ ∈ R and σ 2 ∈ R+ if its p.d.f. is
f (x) = √
1
e−
(x−µ)2
2σ 2
.
2πσ 2
In this case, we write X ∼ N (µ, σ). The parameters of X correspond to
its expectation µ = E(X) and its variance σ 2 = V (X). The c.d.f. of
X ∼ N (µ, σ 2 ) can be expressed in terms of the error function erf(·). However, for our purposes, we will not need to investigate this futher.
GRAPH
The Standardized Normal Distribution Given X ∼ N (µ, σ 2 ), we can
always obtain a ‘standardized’ version Z of X such that Z still has a Normal
distribution, E(Z) = 0, and V (Z) = 1 (i.e. Z ∼ N (0, 1)). This can be done
by means of the transformation
X −µ
Z=
.
σ
The random variable Z is said to be standardized. Of course, one can also
standardize random variables that have other distributions (as long as they
have finite variance), but unlike the Normal case the resulting standardized
variable may not belong anymore to the same family to which the original
random variable X belonged to.
The Normal family is closed with respect to translation and scaling: if
X ∼ N (µ, σ 2 ), then aX + b ∼ N (aµ + b, a2 σ 2 ) for a 6= 0 and b ∈ R.
Furthermore, if X ∼ N (µ, σ 2 ) and Y ∼ N (ν, τ 2 ) and X and Y are
independent, then X + Y ∼ N (µ + ν, σ 2 + τ 2 ).
We finally mention that it is common notation to indicate the c.d.f. of
Z ∼ N (0, 1) by Φ(·). Notice that Φ(−x) = 1 − Φ(x) for any x ∈ R.
GRAPH
The Gamma Distribution
We say that the random variable X has a Gamma distribution with parameters α, β > 0 if its p.d.f. is
1
−x
f (x) = α
xα−1 e β 1[0,∞) (x).
(35)
β Γ(α)
47
GRAPH
Notice that the p.d.f. of the Gamma distribution includes the Gamma
function
Z ∞
xα−1 e−x dx
(36)
Γ(α) =
0
for α > 0. Furthermore, for α ∈ Z+ , we have that Γ(α) = (α − 1)!.
Make sure not to confuse the Gamma distribution, described by the p.d.f.
of equation (35), and the Gamma function of equation (36)!
The following recursive property of the Gamma function is often handy
for computations that involve Gamma-distributed random variables: for any
α > 0, we have Γ(α + 1) = αΓ(α).
The expectation and the variance of X are respectively E(X) = αβ and
V (X) = αβ 2 .
The c.d.f. of a Gamma-distributed random variable can be expressed
explicitly in terms of the incomplete Gamma function. Once again, for our
purposes, we don’t need to investigate this further.
We saw that the Normal family is closed with respect to translation
and scaling. The Gamma family is closed with respect to positive scaling only. If X ∼ Gamma(α, β), then cX ∼ Gamma(α, cβ), provided that
c > 0. Furthermore, if X1 ∼ Gamma(α1 , β), X2 ∼ Gamma(α
P2 ,nβ), . . . ,
Xn ∼ Gamma(α
n , β) are independent random variables, then
i=1 Xi ∼
P
Gamma ( ni=1 αi , β)
The Exponential Distribution
The Exponential distribution constitutes a subfamily of the Gamma distribution. In particular, X is said to have an Exponential distribution
with parameter β > 0 if X ∼ Gamma(1, β). In that case, we write X ∼
Exponential(β).
The p.d.f. of X is therefore
f (x) =
1 − βx
e 1[0,∞) (x).
β
Because the Exponenetial distribution is a subfamily of the Gamma distribution, we have E(X) = αβ = β and V (X) = αβ 2 = β 2 .
Exercise in class
Compute the c.d.f. of X ∼ Exponential(β).
48
We have,
(
0 if x < 0
F (x) = P (X ≤ x) = R x 1 − y
β dy if x ≥ 0
0 βe
(
0 if x < 0
=
−x
1 − e β if x ≥ 0.

0 if x < 0
y x
=
− e− β if x ≥ 0.
0
The Chi-Square Distribution
The Chi-Square distribution is another relevent subfamily of the Gamma
family of distributions. It frequently arises in statistical model-fitting procedures. We say that X has a Chi-Square distribution with ν > 0 ‘degrees
of freedom’, if X ∼ Gamma(ν/2, 2). In this case, we write X ∼ χ2 (ν). We
have E(X) = ν and V (X) = 2ν.
The Beta distribution
Suppose that you are interested in studying a quantity Y that can only take
values in the interval [a, b] ⊂ R. We can easily transform Y in such a way
that it can only take values in the standard unit interval [0, 1]: it is enough
to consider the normalized version of Y
X=
Y −a
.
b−a
Then, a flexible family of distributions that can be used to model Y is the
Beta family. We say that a random variable X has a Beta distribution with
parameters α, β > 0, denoted X ∼ Beta(α, β), if its p.d.f. is of the form
f (x) =
where
Z
B(α, β) =
0
xα−1 (1 − x)β−1
1[0,1] (x).
B(α, β)
1
xα−1 (1 − x)β−1 dx =
(37)
Γ(α)Γ(β)
.
Γ(α + β)
GRAPH
The c.d.f. of X can be expressed explicitly in terms of the incomplete
Beta function
Z x
B(x; α, β) =
y α−1 (1 − y)β−1 dx,
0
49
but we don’t need to investigate this further for our purposes.
α
The expected value of X is E(X) = α+β
and its variance is V (X) =
αβ
.
(α+β)2 (α+β+1)
A Note on the Normalizing Constant of a Probability Density
Function
Most frequently, a given p.d.f. f takes the form
f (x) = cg(x)
where c is a positive constant and g is a function. The part of f depending on
x, i.e. the function g, is usually called the kernel of the p.d.f. f . Very often
one can guess whether f belongs to a certain family by simply inspecting
g. Then, if f is indeed a density, c > 0 is exactly the ‘right’ constant which
makes f integrate to 1. Therefore, if for any reason c is unknown, but
you can guess that X ∼ f belongs to a certain family of distributions for
a particular value of its parameters, in order to figure out c one does not
necessarily have to compute
!−1
Z
c=
g(x) dx
.
supp(X)
Let us illustrate this by means of two examples:
• Let f (x) = ce− 2 1[0,∞) (x) and we want to figure out what c is. Here
x
g(x) = e− 2 1[0,∞) (x) is the kernel of an Exponential p.d.f. with parameter β = 2. We know therefore that c must be equal to c = 1/β.
x
• Let f (x) = cx4 (1 − x)5 1[0,1] (x) and again suppose that we want to
figure out the value of c. Here g(x) = x4 (1 − x)5 1[0,1] (x) which is the
kernel of a Beta p.d.f. with parameters α = 5 and β = 6. We know
therefore that c must be the number
c=
Γ(α + β)
Γ(5 + 6)
10!
=
=
= 1260.
Γ(α)Γ(β)
Γ(5)Γ(6)
4!5!
50
Homework 4 (due May 28th)
1. (6 pts) An oil exploration firm is formed with enough capital to finance
ten explorations. The probability of a particular exploration being
successful is 0.1. Assume the explorations are independent. Find the
mean and variance of the number of successful explorations.
Solution:
Let n ∈ Z+ and p ∈ [0, 1]. For any i ∈ {1, . . . , n}, let Xi be the
Bernoulli(p) random variable describing the success or failure of the
i-th exploration of the oil exploration firm:
(
1 if the i-th exploration is successful
Xi =
0 otherwise.
The number of successful explorations is
Y =
n
X
Xi .
i=1
Since the explorations are assumed to be independent, it follows that
Y ∼ Binomial(n, p) (recall that the sum of n independent Bernoulli(p)
random variables is a random variable itself and its distribution is
Binomial(n, p)). In this exercise n = 10 and p = 0.1 so we have
E(Y ) = np = 10 ∗ 0.1 = 1 and V (Y ) = np(1 − p) = 1 ∗ 0.9 = 0.9.
2. (8 pts) Ten motors are packaged for sale in a certain warehouse. The
motors sell for USD 100 each, but a double-your-money-back guarantee
is in effect for any defectives the purchaser may receive (i.e. if the
purchaser receives one defective motor, the seller has to pay USD 200
back to the purchaser). Find the expected net gain for the seller if the
probability of any one motor being defective is 0.08 (assuming that
the quality of any one motor is independent of that of the others).
Solution:
Let n be the number of motors that are packaged for sale in the warehouse and c be the price of each motor. For any i ∈ {1, . . . , n} let Di
denote the Bernoulli(p) random variable describing whether the i-th
motor is defective or not:
(
1 if the i-th motor is defective
Di =
0 otherwise.
51
Everytime a motor is sold the seller gains c dollars, but if the motor
turns out to be defective the seller has to pay 2c dollars back to the
buyer because of the double-your-money-back guarantee. Assuming
that the quality of each motor is independent of the quality of every
other motor in stock, the net gain of the seller is the random variable
G = cn − 2c
n
X
Di
i=1
Pn
where
i=1 Di ∼ Binomial(n, p) and the expected net gain is the
expectation of G:
!
!
n
n
X
X
E(G) = E cn − 2c
Di = cn−2cE
Di = cn−2cnp = cn(1−2p).
i=1
i=1
Since c = USD 100, n = 10 and p = 0.08 we have that
E(G) = USD 100 ∗ 10 ∗ (1 − 2 ∗ 0.08) = USD 1000 ∗ (0.84) = USD 840.
3. (6 pts) Consider again Exercise 10 of Homework 3. Suppose this time
that the secretary does not discard a password once she tried it and
discovered that the password is wrong. What is the distribution of the
random variable X corresponding to the number of attempts needed
until she enters the correct password? What are the mean and the
variance of X?
Solution:
We can model the sequence of attempts using a sequence of random
i.i.d.
variables X1 , X2 , . . . ∼ Bernoulli(p) where p = 1/n. The number of
attempts needed to enter the correct password for the first time then
corresponds to a random variable Y ∼ Geometric(p). Its expectation
is E(Y ) = 1/p = n and its variance is V (Y ) = (1 − p)/p2 = n(n − 1).
4. (8 pts) A geological study indicates that an exploratory oil well should
strike oil with probability 0.2.
• What is the probability that the first strike comes on the third
well drilled?
• What is the probability that the third strike comes on the seventh
well drilled?
• What assumptions did you make to obtain the answers to parts
(a) and (b)?
52
• Find the mean and variance of the number of wells that must be
drilled if the company wants to set up three producing wells.
Solution:
Let X be the random variable describing the number of the trial on
which the r-th oil well is observed to strike oil. X is distributed according to NBinomial(r, p) and its probability mass function is
(
x−1 r
p (1 − p)x−r if x = r, r + 1, . . .
r−1
P (X = x) =
0 otherwise.
• The probability that the first strike comes on the third well drilled
is
3−1 1
P (X = 3) =
p (1 − p)3−1 = p(1 − p)2
1−1
which is equal to the probability of observing the first success on
the third trial (notice that this is really the same as doing the
computation with a Geometric(p) distribution since here r = 1).
p = 0.2, therefore we have
P (X = 3) = 0.2(0.8)2 = 0.128.
• Here r = 3, so the probability that the third strike comes on the
seventh well drilled is
7−1 3
6 3
P (X = 7) =
p (1 − p)7−3 =
p (1 − p)4
3−1
2
6
=
0.23 ∗ 0.84 ≈ 0.049.
2
• In order to apply the Negative Binomial model we need to make
some assumptions. In particular, we have to assume that all the
trials are identical and independent Bernoulli trials (i.e. every
time a well is drilled the fact that it strikes oil or not does not
depend on what happens at other wells and the probability that
a well strikes oil is the same for all wells).
• Here we want to compute E(Y ) and V (Y ) when r = 3, namely
the expectation and the variance of the number of the trial on
which the third well is observed to strike oil. We have
r
3
E(Y ) = =
= 15
p
0.2
r(1 − p)
3 ∗ 0.8
V (Y ) =
=
= 60.
p2
0.22
53
5. (6 pts) Cards are dealt at random and without replacement from a
standard 52 card deck. What is the probability that the second king
is dealt on the fifth card?
Solution:
The probability of the event that the second king is dealt on the fifth
card is equivalent to the probability of the intersection of the two
events A=“exactly one king is dealt in the first four cards” and B =
“exactly a king is dealt on the fifth card”. We have
4 48
P (A) =
and
1
3
52
4
3
1
45
0
48
1
P (B|A) =
so
4
1
48
3
52
4
P (A ∩ B) = P (A)P (B|A) =
3
1
45
0
48
1
=
4 ∗ 17296 3 ∗ 1
≈ 0.016.
270725 48
We can also interpret the problem in the following way. Let K1 ∼
HGeometric(r, N, n) be the random variable describing the number of
kings that are dealt in the first n cards from a deck of N cards containing exactly r kings and let K2 ∼ Hypergeometric(s, M, m) be a
random variable describing the number of kings that are dealt in the
first m cards from a deck of M cards containing exactly s kings. Assume moreover that K1 and K2 are independent. Then the probability
that the second king is dealt on the fifth hand can be expressed as
s M −s
r N −r
P (K1 = 1, K2 = 1) = P (K1 = 1)P (K2 = 1) =
1
n−1
N
n
1
m−1
M
m
for a particular choice of the parameters. For this question, we know
that N = 52, n = 4 and r = 4 and we know how the deck changes after
exactly one king and three other cards are dealt: indeed, we know that
M = N − n, m = 5 − n and s = r − 1. Moreover, the independence
assumption is appropriate since the cards are always dealt randomly,
no matter the composition of the deck. It’s easy to check that plugging
this parameters values into the expression for P (K1 = 1, K2 = 1) yields
54
exactly
N −r
s M −s
r
n−1
1 m−1
1
=
N
M
n
m
3 45
4 48
1
3 1 480 ≈ 0.0.16.
52
4
1
r
1
P (K1 = 1, K2 = 1) =
=
N −r
n−1
N
n
r−1
1
N −n−(r−1)
5−n−1
N −n
5−n
6. (8 pts) A parking lot has two entrances. Cars arrive at entrance I
according to a Poisson distribution at an average of three per hour
and at entrance II according to a Poisson distribution at an average of
four per hour. What is the probability that a total of three cars will
arrive at the parking lot in a given hour? Assume that the numbers
of cars arriving at the two entrances are independent.
Solution:
Let C1 ∼ Poisson(λ) and C2 ∼ Poisson(µ) denote the number of cars
arriving in an hour at entrance I and entrance II of the parking lot
respectively. We know that λ = 3 and µ = 4 and we assume that C1
and C2 are independent. At a given hour
C = C1 + C2
is the total number of cars arriving at the parking lot. We are interested in computing P (C = 3). The event C = 3 can be written as the
following disjoint union of events
{C = 3} = ({C1 = 0} ∩ {C2 = 3}) ∪ ({C1 = 1} ∩ {C2 = 2})∪
∪ ({C1 = 2} ∩ {C2 = 1}) ∪ ({C1 = 3} ∩ {C2 = 0})
whose probability is (here we use the fact that the union is a disjoint
union and we exploit the independence of C1 and C2 )
P (C = 3) = P (C1 = 0 ∩ C2 = 3) + P (C1 = 1 ∩ C2 = 2) + P (C1 = 2 ∩ C2 = 1)+
+ P (C1 = 3 ∩ C2 = 0) = P (C1 = 0)P (C2 = 3) + P (C1 = 1)P (C2 = 2)+
+ P (C1 = 2)P (C2 = 1) + P (C1 = 3)P (C2 = 0) =
µ3
µ2
λ2
λ3
+ e−λ λe−µ
+ e−λ e−µ µ + e−λ e−µ =
3!
2
2
3!
3
2
2
3
µ
µ
λ
λ
= e−(λ+µ)
+λ +µ +
≈ · · · ≈ 0.052.
3!
2
2
3!
= e−λ e−µ
55
7. (6 pts) Derive the c.d.f. of X ∼ Uniform(a, b).
Solution:
Recall that
f (x) =
We have
1
1 (x).
b − a [a,b]


0R if x < a
x 1
F (x) = P (X ≤ x) =
a b−a if a ≤ x < b


1 if x ≥ b


0 if x < a
= x−a
b−a if a ≤ x < b


1 if x ≥ b.
8. (4 pts) Let X ∼ Uniform(0, 1). Is X ∼ Beta(α, β) for some α, β > 0?
Solution:
Recall the p.d.f. corresponding to the Beta distribution with parameters α, β > 0:
f (x) =
Γ(α + β) α−1
x
(1 − x)β−1 1[0,1] (x).
Γ(α)Γ(β)
By comparing to the p.d.f. of X
fX (x) = 1[0,1] (x),
it is easily seen that X ∼ Uniform(α, β) ≡ Beta(1, 1). Notice that in
this case Γ(α) = Γ(β) = Γ(1) = 0! = 1 and Γ(α + β) = Γ(2) = 1! = 1.
9. (8 pts) The number of defective circuit boards coming off a soldering
machine follows a Poisson distribution. During a specific eight-hour
day, one defective circuit board was found. Conditional on the event
that a single defective circuit was produced, the time at which it was
produced has a Uniform distribution.
• Find the probability that the defective circuit was produced during the first hour of operation during that day.
• Find the probability that it was produced during the last hour of
operation during that day.
• Given that no defective circuit boards were produced during the
first four hours of operation, find the probability that the defective board was manufactured during the fifth hour.
56
Solution:
Let T be the random variable denoting the time at which the defective
circuit board was produced. T is uniformly distributed in the interval [0, 8] (hours), that is T ∼ Uniform(0, 8). Its probability density
function is therefore
(
1
if t ∈ [0, 8]
f (t) = 8
0 otherwise.
and its cumulative distribution function is


0 if t < 0
F (t) = P (T ≤ t) = 8t if t ∈ [0, 8)


1 if t ≥ 8 .
• The probability that the defective circuit board was produced
during the first hour of operation during the day is
P (T ∈ (0, 1]) = P (0 < T ≤ 1) = F (1) − F (0) =
1
1
−0= .
8
8
• The probability that the defective circuit board was produced
during the last hour of operation during the day is
P (T ∈ (7, 8]) = P (7 < T ≤ 8) = F (8) − F (7) = 1 −
7
1
= .
8
8
• Given that no defective circuits boards were produced in the first
four hours (that is we know that the event T > 4 occurred), the
probability that the defective board was manufactured during the
fifth hour is
P (T ∈ (5, 6] ∩ T > 4)
P (T ∈ (5, 6])
P (T ∈ (5, 6] | T > 4) =
=
P (T > 4)
P (T > 4)
6
5
1
1
−
F (6) − F (5)
1
=
= 8 8 = 8 4 = 81 = .
1 − P (T ≤ 4)
1 − F (4)
4
1− 8
2
10. (6 pts) Let Z ∼ N (0, 1). Show that, for any z > 0, P (|Z| > z) =
2(1 − Φ(z)).
Solution:
We have
P (|Z| > z) = P (Z > z ∪ Z < −z) = P (Z > z) + P (Z < −z)
= 1 − P (Z ≤ z) + P (Z < −z) = 1 − Φ(z) + Φ(−z)
= 1 − Φ(z) + 1 − Φ(z) = 2(1 − Φ(z)).
57
11. (6 pts) Let Z ∼ N (0, 1). Express in terms of Φ, the c.d.f. of Z, the
following probabilities
• P (Z 2 < 1)
• P (3Z/5 + 3 > 2).
Compute the exact probability of the event Φ−1 (α/3) ≤ Z ≤ Φ−1 (4α/3),
where Φ−1 (α) is the inverse c.d.f. computed at α (i.e. the α-quantile
of Z).
Solution:
We have
• P (Z 2 > 1) = P (|Z| > 1). Based on the result from the previous
exercise, P (|Z| > 1) = 2(1 − Φ(1)). Thus, P (Z ≤ 1) = 1 −
P (|Z| > 1) = 1 − 2(1 − Φ(1)) = 2Φ(1) − 1.
• P (3Z/5 + 3 > 2) = P (3Z/5 > −1) = P (Z > −5/3) = 1 −
Φ(−5/3) = Φ(5/3).
Regarding the last question, we have
P (Φ−1 (α/3) ≤ Z ≤ Φ−1 (4α/3)) = Φ(Φ−1 (4α/3)) − Φ(Φ−1 (α/3))
= 4α/3 − α/3 = α.
12. (8 pts) The SAT and ACT college entrance exams are taken by thousands of students each year. The mathematics portions of each of these
exams produce scores that are approximately normally distributed. In
recent years, SAT mathematics exam scores have averaged 480 with
standard deviation 100. The average and standard deviation for ACT
mathematics scores are 18 and 6, respectively.
• An engineering school sets 550 as the minimum SAT math score
for new students. What percentage of students will score below
550 in a typical year? Express your answer in terms of Φ.
• What score should the engineering school set as a comparable
standard on the ACT math test? You can leave your answer in
terms of Φ−1 .
Solution:
Let S denote the random variable describing the SAT mathematics
scores and let A denote the random variable describing the ACT
58
mathematics scores. Let also Z ∼ N (0, 1). S is distributed approximately according to N (µs , σs2 ) with µs = 480 and σs = 100 and A is
distributed approximately according to N (µa , σa2 ) with µa = 18 and
σa = 6.
• Based on the approximations above the percentage of students in
the engineering school with SAT math score below 550 is
S − µs
550 − µs
550 − 480
P (S ≤ 550) = P
≤
≈P Z≤
σs
σs
100
= P (Z ≤ 0.7) = Φ(0.7) ≈ 0.758.
• The minimum score that the engineering school should require
from new students in the ACT math test in order for the two
minimum scores to be two comparable standards is the number
a that satisfies
P (A ≤ a) ≈ P (S ≤ 550) ≈ Φ(0.7).
We have to solve for a the following equation
A − µa
a − µa
a − µa
P (A ≤ a) = P
≤
≈P Z≤
= Φ(0.7),
σa
σa
σa
namely we need to solve
a − µa
= 0.7
σa
whose solution is a = 0.7σa + µa = 0.7 ∗ 6 + 18 = 22.2.
13. (6 pts) Suppose that a random variable X has a probability density
function given by
x
f (x) = cx3 e− 2 1[0,∞) (x).
• Find the value of c that makes f a valid p.d.f. (hint: if you think
about this carefully, no calculations are needed!)
• Does X have a Chi-Square distribution? If so, with how many
degrees of freedom?
• What are E(X) and V (X)?
Solution:
59
• Given that f is a bona fide p.d.f., c has to be exactly the postive
constant that makes the kernel of the p.d.f.
x3 e− 2 1[0,∞) (x)
x
integrate to 1. We recognize that this kernel is the kernel of a
Gamma p.d.f. with parameters α = 4 and β = 2. It follows that
c must be the number
c=
1
β α Γ(α)
1
=
24 Γ(4)
=
1
1
1
=
= .
16 ∗ 3!
16 ∗ 6
96
• Recall that Gamma(ν/2, 2) ≡ χ2 (ν). Therefore, in this case,
ν = 8 and the above p.d.f. corresponds to a Chi-Square p.d.f.
with ν = 8 degrees of freedom.
• We have E(X) = ν = 8 and V (X) = 2ν = 16.
14. (8 pts) Consider the random variable X ∼ Exponential(β). Show that
the distribution of X has the memoryless property.
Solution:
Let 0 ≤ s ≤ t. We have that
− s+t
β
1
−
1
−
e
P (X > s + t)
1 − F (s + t)
P (X > s + t|X > s) =
=
=
− s+t
P (X > s)
1 − F (s)
1− 1−e β
=e
− βt
= P (X > t).
Therefore, X has the memoryless property.
15. (6 pts) An interesting property of the Exponential distribution is that
its associated hazard function is constant on [0, ∞). The hazard function associated to a density f is defined as the function
h(x) =
f (x)
.
1 − F (x)
(38)
One way to interpret the hazard function is in terms of the failure
rate of a certain component. In this sense, h can be thought of as the
ratio between the probability density of failure at time x, f (x), and
the prorbability that the component is still working at the same time,
which is P (X > x) = 1 − F (x).
60
Show that if X ∼ Exponential(β) is a random variable describing the
time at which a particular component fails to work, then the failure
rate of X is constant (i.e. h does not depend on x).
Solution:
For x > 0 we have
−x
1
e β
1
f (x)
β
= .
=
h(x) =
x
−β
1 − F (x)
β
1− 1−e
16. (EXTRA CREDIT, 10 pts) Consider the Binomial p.m.f.
n x
p(x) =
p (1 − p)n−x 1{0,1,...,n} (x)
x
with parameters n and p. Show that
lim p(x) = e−λ
n→∞
λx
1
(x)
x! {0,1,... }
where the limit is taken as n → ∞, p → 0, and np → λ > 0.
Solution:
Take any x ∈ {0, 1, . . . , n}. We have
np x np n−x
n x
n!
1−
p(x) =
p (1 − p)n−x =
x!(n − x)! n
n
x
n
n(n − 1) . . . (n − x + 1) 1
np
np −x
x
=
(np)
1
−
1
−
nx
x!
n
n
Now,
n(n − 1) . . . (n − x + 1)
→ 1 as n → ∞
nx
(np)x → λx as np → λ
λ n
λ n
np n
→ 1−
as np → λ and 1 −
→ e−λ as n → ∞
1−
n
n
n
np −x
λ −x
λ −x
1−
→ 1−
as np → λ and 1 −
→ 1 as n → ∞.
n
n
n
Thus,
1
λx
∗ λx ∗ e−λ ∗ 1 = e−λ
x!
x!
which is the p.m.f. of a Poisson distribution with parameter λ.
p(x) → 1 ∗
61
17. (EXTRA CREDIT, 10 pts) Consider the Normal p.d.f.
f (x) = √
1
2πσ 2
e−
(x−µ)2
2σ 2
with parameters µ ∈ R and σ 2 > 0. Show that µ − σ and µ + σ
correspond to the inflection points of f .
Solution:
First, we need to compute the second derivative of f . We have
f 0 (x) = − √
2
x − µ − (x−µ)
2σ 2
e
2πσ 2 σ 2
1
and therefore
2
2
1 − (x−µ)
(x − µ)2 − (x−µ)
2σ 2
2σ 2
f (x) = − √
−
e
e
σ4
2πσ 2 σ 2
2
1
1 − (x−µ)
(x − µ)2
2
= −√
e 2σ
1−
σ2
2πσ 2 σ 2
00
1
Notice that
−√
2
1 − (x−µ)
2σ 2
e
<0
2πσ 2 σ 2
1
for any x ∈ R, therefore the sign of f 00 depends on
1−
(x − µ)2
σ 2 − (x − µ)2
=
σ2
σ2
Clearly,
σ 2 − (x − µ)2
≥0
σ2
for |x − µ| ≤ σ ⇐⇒ µ − σ ≤ x ≤ µ + σ,
σ 2 − (x − µ)2
≤0
σ2
for |x − µ| ≥ σ ⇐⇒ x ≥ µ + σ ∨ x ≤ µ − σ, and
σ 2 − (x − µ)2
=0
σ2
for x = µ ± σ. Therefore x = µ ± σ are the two inflection points of f .
62
18. (EXTRA CREDIT, 10 pts) Consider the random variable X ∼ Beta(α, β)
with α, β ∈ Z+ . Show that, for n ∈ Z+ ,
E(X n+1 )
α+n
=
.
n
E(X )
α+β+n
Use this result to show that
E(X n ) =
n−1
Y
i=0
α+i
.
α+β+i
Solution:
For the first claim, fix an arbitrary n ∈ Z+ . Then
Z
Γ(α + β) 1 n α−1
n
x x
(1 − x)β−1 dx
E(X ) =
Γ(α)Γ(β) 0
Z
Γ(α + β) 1 α+n−1
=
x
(1 − x)β−1 dx.
Γ(α)Γ(β) 0
Now,
xα+n−1 (1 − x)β−1 1[0,1] (x)
is the kernel of a Beta(α + n, β) p.d.f., hence its integral on [0, 1] must
be equal to
Z 1
Γ(α + n)Γ(β)
.
xα+n−1 (1 − x)β−1 dx =
Γ(α + β + n)
0
Therefore,
Γ(α + β) Γ(α + n)Γ(β)
Γ(α)Γ(β) Γ(α + β + n)
Γ(α + β)Γ(α + n)
=
Γ(α)Γ(α + β + n)
E(X n ) =
Then,
E(X n+1 )
Γ(α + β)Γ(α + n + 1) Γ(α)Γ(α + β + n)
=
E(X n )
Γ(α)Γ(α + β + n + 1) Γ(α + β)Γ(α + n)
Γ(α + n + 1) Γ(α + β + n)
=
Γ(α + β + n + 1) Γ(α + n)
(α + n)Γ(α + n)
Γ(α + β + n)
=
(α + β + n)Γ(α + β + n) Γ(α + n)
α+n
=
.
α+β+n
63
For the second claim, it is enough to notice that
E(X n ) =
n−1
Y
i=0
E(X i+1 )
.
E(X i )
From the previous result, we have that
E(X i+1 )
α+i
=
,
i
E(X )
α+β+i
thus
E(X n ) =
n−1
Y
i=0
n−1
Y α+i
E(X i+1 )
=
E(X i )
α+β+i
i=0
64
Lecture 8
Recommended readings: WMS, sections 5.1 → 5.4
Multivariate Probability Distributions
So far we have focused on univariate probability distributions, i.e. probability distributions for a single random variable. However, when we discussed
independence of random variables in Lecture 6, we introduced the notion of
joint c.d.f., joint p.m.f., and joint p.d.f. for a collection of n random variables
X1 , . . . , Xn . In this lecture we will elaborate more on these objects.
Let X1 , . . . , Xn be a collection of n random variables.
• Regardless of whether they are discrete or continuous, we denote by
FX1 ,...,Xn their joint c.d.f., i.e. the function
FX1 ,...,Xn (x1 , . . . , xn ) = P (X1 ≤ x1 ∩ · · · ∩ Xn ≤ xn ).
• If they are all discrete, we denote by pX1 ,...,Xn their joint p.m.f., i.e.
the function
pX1 ,...,Xn (x1 , . . . , xn ) = P (X1 = x1 ∩ · · · ∩ Xn = xn ).
• If they are all continuous, we denote by fX1 ,...,Xn their joint p.d.f..
The above functions satisfy properties that are similar to those satisfied by
their univariate counterparts.
• The joint c.d.f. FX1 ,...,Xn satisfies:
– FX1 ,...,Xn (x1 , . . . , xn ) ∈ [0, 1] for any x1 , . . . , xn ∈ R.
– FX1 ,...,Xn is monotonically non-decreasing in each of its variables
– FX1 ,...,Xn is right-continuous with respect to each of its variables
– limxi →−∞ FX1 ,...,Xn (x1 , . . . , xi , . . . , xn ) = 0 for any i ∈ {1, . . . , n}
– limx1 →+∞,...,xn →+∞ FX1 ,...,Xn (x1 , . . . , xn ) = 1
• The joint p.m.f. satisfies:
– pX1 ,...,Xn (x1 , . . . , xn ) ∈ [0, 1] for any x1 , . . . , xn ∈ R.
P
P
–
x1 ∈supp(X1 ) · · ·
xn ∈supp(Xn ) pX1 ,...,Xn (x1 , . . . , xn ) = 1.
• The joint p.d.f. satisfies:
65
– fX1 ,...,Xn (x1 , . . . , xn ) ≥ 0 for any x1 , . . . , xn ∈ R.
R
R
– R R. . . R fX1 ,...,X
R n (x1 , . . . , xn ) dx1 . . . dxn
= supp(X1 ) . . . supp(Xn ) fX1 ,...,Xn (x1 , . . . , xn ) dxn . . . dx1 = 1.
Furthermore we have
X
FX1 ,...,Xn (x1 , . . . , xn ) =
y1 ≤x1
y1 ∈supp(X1 )
···
X
p(y1 , . . . , yn )
yn ≤xn
yn ∈supp(Xn )
and
Z
FX1 ,...,Xn (x1 , . . . , xn ) =
x1
Z
xn
...
−∞
−∞
fX1 ,...,Xn (y1 , . . . , yn ) dyn . . . dy1
Exercise in class:
You are given the following bivariate p.m.f.:

1


8 if (x1 , x2 ) = (0, −1)



1


4 if (x1 , x2 ) = (0, 0)


 1 if (x , x ) = (0, 1)
1 2
pX1 ,X2 (x1 , x2 ) = 81


4 if (x1 , x2 ) = (2, −1)



1


4 if (x1 , x2 ) = (2, 0)


0 otherwise.
What is P (X1 ≤ 1 ∩ X2 ≤ 0) = FX1 ,X2 (1, 0)? We have
FX1 ,X2 (1, 0) = pX1 ,X2 (0, −1) + pX1 ,X2 (0, 0) =
3
1 1
+ = .
8 4
8
GRAPH
Exercise in class:
You are given the following bivariate p.d.f.:
fX1 ,X2 (x1 , x2 ) = e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 )
• What is P (X1 ≤ 1 ∩ X2 > 5)?
• What is P (X1 + X2 ≤ 3)?
66
We have
Z
1Z ∞
Z
1Z ∞
e−(x1 +x2 ) dx1 dx2
fX1 ,X2 (x1 , x2 ) dx1 dx2 =
0
5
Z ∞
1 ∞ −x1
e−x2 dx2 = − e−x1 0 − e−x2 5
e
dx1
=
5
0
= 1 − e−1 e−5 .
P (X1 ≤ 1 ∩ X2 > 5) =
∞
Z 1
5
and
GRAPH
P (X1 + X2 ≤ 3) =
Z
3 Z 3−x1
e
0
Z
Z
e−x1
0
3
=
e−x1
0
3
dx2 dx1 =
e
0
3
=
Z
−(x1 +x2 )
3−x −e−x2 0 1 dx1 =
Z
0
3
0
−x1
Z
3−x1
e−x2 dx2 dx1
0
e−x1 1 − ex1 −3 dx1
3 − e−3 dx1 = − e−x1 0 − 3e−3 = 1 − 4e−3 .
Marginal Distributions
Given a collection of random variables X1 , . . . , Xn and their joint distribution, how can we derive the marginal distribution of only one of them, say
Xi ? The idea is summing or integrating the joint distribution over all the
variables except for Xi .
GRAPH
Thus, given pX1 ,...,Xn we have that
pXi (xi ) =
X
y1 ∈supp(X1 )
···
X
X
yi−1 ∈supp(Xi−1 ) yi+1 ∈supp(Xi+1 )
···
X
pX1 ,...,Xn (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ).
yn ∈supp(Xn )
and given fX1 ,...,Xn we have
Z
Z Z
Z
...
. . . fX1 ,...,Xn (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) dxn . . . dxi+1 dxi−1 . . . dx1
fXi (xi ) =
R
Z
Z R R
ZR
Z
=
...
...
fX1 ,...,Xn (x1 , . . . , xi−1 , xi , xi+1 , . . . , xn )
supp(X1 )
supp(Xi−1 )
supp(Xi+1 )
dxn . . . dxi+1 dxi−1 . . . dx1 .
67
supp(Xn )
Exercise in class:
Consider again the bivariate p.m.f.

1


8



1


4


1
pX1 ,X2 (x1 , x2 ) = 81


4



1


4


0
if (x1 , x2 ) = (0, −1)
if (x1 , x2 ) = (0, 0)
if (x1 , x2 ) = (0, 1)
if (x1 , x2 ) = (2, −1)
if (x1 , x2 ) = (2, 0)
otherwise.
Derive the marginal p.m.f. of X2 .
Notice first that supp(X2 ) = {−1, 0, 1}. We have
3
1 1
+ =
8 4
8
1 1
1
pX2 (0) = pX1 ,X2 (0, 0) + pX1 ,X2 (2, 0) = + =
4 4
2
1
pX2 (1) = pX1 ,X2 (0, 1) = .
8
pX2 (−1) = pX1 ,X2 (0, −1) + pX1 ,X2 (2, −1) =
Thus,

3


8 if x2 = −1


 1 if x = 0
2
pX2 (x2 ) = 21

if
x
2 =1

8


0 if x ∈
2 / {−1, 0, 1}
Exercise in class:
Consider again the bivariate p.d.f.
fX1 ,X2 (x1 , x2 ) = e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 ).
Derive the marginal p.d.f. of X1 .
Notice first that supp(X1 ) = [0, ∞). For x1 ∈ [0, ∞) we have
Z
Z ∞
−x1
fX1 (x1 ) =
fX1 ,X2 (x1 , x2 ) dx2 = e
e−x2 dx2 = e−x1 .
0
R
Thus,
fX1 (x1 ) = e−x1 1[0,∞) (x1 ).
68
Conditional Distributions
We will limit our discussion here at the bivariate case, but all that follows
is easily extended to more than two random variables.
Suppose that we are given a pair of random variables X1 , X2 and we
want to compute probabilities of the type
P (X1 ∈ A1 |X2 = x2 )
for a particular fixed value x2 of X2 . To do this, we need either the conditional p.m.f. (if X1 is discrete) or the conditional p.d.f. (if X1 is continuous)
of X1 given X2 = x2 . By definition, we have
pX1 ,X2 (x1 , x2 )
pX2 (x2 )
fX1 ,X2 (x1 , x2 )
fX1 |X2 =x2 (x1 ) =
.
fX2 (x2 )
pX1 |X2 =x2 (x1 ) =
Notice the two following impotant facts:
1. pX1 |X2 =x2 and fX1 |X2 =x2 are not well-defined if x2 is such that pX2 (x2 ) =
0 and fX2 (x2 ) = 0 respectively (i.e. x2 ∈
/ supp(X2 ))
2. given x2 ∈ supp(X2 ), pX1 |X2 =x2 and fX1 |X2 =x2 are null whenever
pX1 ,X2 (x1 , x2 ) = 0 and fX1 ,X2 (x1 , x2 ) = 0 respectively.
So whenever you are computing a conditional distribution, it is good practice to 1) determine the support of the conditioning variable X2 and clearly
state that the conditional distribution that you are about to compute is only
well-defined for x2 ∈ supp(X2 ) and 2) given x2 ∈ supp(X2 ), clarify for which
values of x1 ∈ R the conditional distribution pX1 |X2 =x2 and fX1 |X2 =x2 is null.
Exercise in class:
Consider again the bivariate p.m.f.

1

8



1


4


1
pX1 ,X2 (x1 , x2 ) = 81


4


1


4


0
if (x1 , x2 ) = (0, −1)
if (x1 , x2 ) = (0, 0)
if (x1 , x2 ) = (0, 1)
if (x1 , x2 ) = (2, −1)
if (x1 , x2 ) = (2, 0)
otherwise.
69
Derive the conditional distribution of X1 given X2 .
First of all notice that the conditional p.m.f. of X1 given X2 = x2 is
only well-defined for x2 ∈ supp(X2 ) = {−1, 0, 1}. Then we have
p

X1 ,X2 (0,−1)
1/8


if
x
=
0
1


p
(−1)
 X2
 3/8 if x1 = 0
pX1 |X2 =−1 (x1 ) = pX1 ,X2 (2,−1) if x1 = 2 = 1/4
3/8 if x1 = 2
pX2 (−1)





0 if x ∈
1 / {0, 2}
0 if x1 ∈
/ {0, 2}

1

 3 if x1 = 0
= 23 if x1 = 2


0 if x1 ∈
/ {0, 2}
pX1 |X2 =0 (x1 ) =
p
X1 ,X2 (0,0)


 pX2 (0) if x1 = 0
p
(2,0)
X1 ,X2
if x1 = 2
pX2 (0)



0 if x1 ∈
/ {0, 2}

1

 2 if x1 = 0
= 12 if x1 = 2


0 if x1 ∈
/ {0, 2}
( pX
pX1 |X2 =1 (x1 ) =
(0,1)
1 ,X2
pX2 (1)
if x1 = 0
0 if x1 6= 0
(
1 if x1 = 0
=
0 if x1 =
6 0.

1/4


 1/2 if x1 = 0
= 1/4
1/2 if x1 = 2


0 if x ∈
1 / {0, 2}
( 1/8
=
1/8
if x1 = 0
0 if x1 6= 0
Exercise in class:
Consider again the bivariate p.d.f.
fX1 ,X2 (x1 , x2 ) = e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 ).
Derive the conditional p.d.f. of X2 given X1 .
First of all notice that the conditional p.m.f. of X2 given X1 = x1 is
only well-defined for x1 ∈ supp(X1 ) = [0, ∞). For x1 ∈ [0, ∞), we have
fX2 |X1 =x1 (x2 ) =
=
e−(x1 +x2 ) 1[0,∞)×[0,∞) (x1 , x2 )
fX1 ,X2 (x1 , x2 )
=
fX1 (x1 )
e−x1 1[0,∞) (x1 )
e−(x1 +x2 ) 1[0,∞) (x2 )
= e−x2 1[0,∞) (x2 ).
e−x1
70
Question: what family of distributions do X1 and X2 belong to?
What are the parameter values?
Question: can we say something about whether or not X1 and X2
are independent in this case?
A Test for the Independence of Two Random Variables
Suppose that you are given (X1 , X2 ) ∼ fX1 ,X2 . If
1. the support of fX1 ,X2 is a ‘rectangular’ region and
2. fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) for all x1 , x2 ∈ R
then X1 and X2 are independent. The same is true for the case where
(X1 , X2 ) is discrete, after obvious changes to 1). Notice that 1) can be conveniently captured by appropriately using indicator functions.
Exercise in class:
Consider the following bivariate p.d.f.:
fX1 ,X2 (x1 , x2 ) = 6(1 − x2 )1{0≤x1 ≤x2 ≤1} (x1 , x2 ).
Are X1 and X2 independent?
A Note on Independence and Normalizing Constants
Frequently, the p.d.f. (and the same holds true for a p.m.f.) of a pair of
random variables takes the form
fX1 ,X2 (x1 , x2 ) = cg(x1 )h(x2 )
(39)
where c > 0 is a constant, g is a function depending only on x1 and h is a
function depending only on x2 . If this is the case, the two random variables
X1 and X2 are independent and g and h are the kernels of the p.d.f. of
X1 and X2 respectively. Thus, by simply inspecting g and h we can guess
to which family of distributions X1 and X2 belong to. We do not need to
worry about the constant c, as we know that if fX1 ,X2 is a bona fide p.d.f.,
then c = c1 c2 is exactly equal to the product of the normalizing constants
c1 and c2 that satisfy
Z
Z
c1
g(x) dx = c2
h(x) dx = 1.
(40)
R
R
This easily generalizes to a collection of n > 2 random variables.
71
Lecture 9
Recommended readings: WMS, sections 5.5 → 5.8
Expected Value in the Context of Bivariate Distributions (part
1)
In Lecture 5, we introduced the concept of expectation or expected value of
a random variable. We will now introduce other operators that are based on
the expected value operator and we will investigate how they act on bivariate
distributions. While we focus on bivariate distributions, it is worthwile to
keep in mind that all that we discuss in this Lecture can be extended to
collections of random variables X1 , . . . , Xn with n > 2.
For a function
g :R2 → R
(x1 , x2 ) 7→ g(x1 , x2 )
and a pair of random variables (X1 , X2 ) with joint p.m.f. pX1 ,X2 (if discrete)
or joint p.d.f. fX1 ,X2 (if continuous), the expected value of g(X1 , X2 ) is
defined as
X
X
E(g(X1 , X2 )) =
g(x1 , x2 )pX1 ,X2 (x1 , x2 ) (discrete case)
x1 ∈supp(X1 ) x2 ∈supp(X2 )
Z Z
g(x1 , x2 )fX1 ,X2 (x1 , x2 ) dx1 dx2 if X1 and X2 (continuous case).
E(g(X1 , X2 )) =
R
R
Expectation, Variance and Independence
In Lecture 5, we pointed out that in general, given two random variables X1
and X2 , it is not true that E(X1 X2 ) = E(X1 )E(X2 ). We said, however, that
it is true as soon as X1 and X2 are independent. Let’s see why. Without loss
of generality, assume that X1 and X2 are continous (for the discrete case,
just change integration into summation as usual). If they are independent,
fX1 ,X2 = fX1 fX2 . Take g(x1 , x2 ) = x1 x2 . Then, we have
Z Z
E(X1 X2 ) =
x1 x2 fX1 ,X2 (x1 , x2 ) dx1 dx2
R
R
Z Z
Z
Z
=
x1 x2 fX1 (x1 )fX2 (x2 ) dx1 dx2 =
x1 fX1 (x1 ) dx1
x2 fX2 (x2 ) dx2
R
R
R
= E(X1 )E(X2 ).
72
R
The above argument is easily extended to show that for any two functions
g1 , g2 : R → R, if X1 and X2 are independent, then E(g1 (X1 )g2 (X2 )) =
E(g1 (X1 )g2 (X2 )).
Exercise in class:
Consider the following bivariate p.d.f.:
3
fX1 ,X2 (x1 , x2 ) = e−3x1 1[0,∞)×[0,2] (x1 , x2 ).
2
Are X1 and X2 independent? Why? What are their marginal distributions?
Compute E(3XY + 3).
Exercise in class:
Consider the pair of random variables X1 and X2 with joint p.d.f.
1
fX1 ,X2 (x1 , x2 ) = x1 e−(x1 +x2 )/2 1[0,∞)×[0,∞) (x1 , x2 )
8
and consider the function g(x, y) = y/x. What is E(g(X1 , X2 ))?
First of all, notice that X1 and X2 are independent. Therefore, we know
already that E(g(X1 , X2 )) = E(X2 /X1 ) = E(X2 )E(1/X1 ). Moreover, the
kernel of the p.d.f. of X1 is
x1 e−x1 /2 1[0,∞) (x1 )
while that of the p.d.f. of X2 is
e−x2 /2 1[0,∞) (x2 ).
Thus, X1 and X2 are distributed according to Gamma(2, 2) and Exponential(2)
respectively. It follows that E(X2 ) = β = 2. We only need to compute
E(1/X1 ). This is equal to
Z ∞
Z ∞
x1
1
1
1 1
E
=
fX1 (x1 ) dx =
x1 e− 2 dx
X1
x1
x1 4
0
0
Z
1 ∞ − x1
1
1
=
e 2 dx = 2 = .
4 0
4
2
It follows that E(X2 /X1 ) = E(X2 )E(1/X1 ) = 2(1/2) = 1.
Back to Lecture 5 again. There we mentioned that for two random
variables X and Y , V (X + Y ) 6= V (X) + V (Y ) usually, but that the result is
73
true if X and Y are independent. Now, we can easily see that if X and Y are
independent and we take the functions g1 = g2 = g with g(X) = X − E(X),
by our previous results we have
V (X + Y ) = E[(X + Y − (E(X) + E(Y )))2 ] = E[((X − E(X)) + (Y − E(Y )))2 ]
= E[(X − E(X))2 + (Y − E(Y ))2 + 2(X − E(X))(Y − E(Y ))]
= E[(X − E(X))2 ] + E[(Y − E(Y ))2 ] + 2E(X − E(X))E(Y − E(Y ))
= V (X) + V (Y ) + 2 ∗ 0 ∗ 0 = V (X) + V (Y ).
(41)
Notice that we exploited the independence of X and Y together with the results above to deal with the expectation of the cross-product 2(X−E(X))(Y −
E(Y )) with g1 = g2 = g.
Question: what about V (X − Y )?
Covariance
The covariance of a pair of random variables X and Y is a measure of their
linear dependence. By definition, the covariance between X and Y is
Cov(X, Y ) = E[(X − E(X))(Y − E(Y ))] = E(XY ) − E(X)E(Y ).
(42)
It is easy to check that the covariance operator satisfies, for a, b, c, d ∈ R
Cov(a + bX, c + dY ) = bdCov(X, Y ).
Also, notice that Cov(X, Y ) = Cov(Y, X) and Cov(X, X) = V (X).
GRAPH
Question: suppose that X and Y are independent. What is Cov(X, Y )?
There exists a scaled version of the covariance which takes values in
[−1, 1]. This is the correlation between X and Y . The correlation between
X and Y is defined as
Cov(X, Y )
Cor(X, Y ) = p
.
V (X)V (Y )
(43)
It is easy to check that for a, b, c, d ∈ R with b, d 6= 0, we have Cor(a +
bX, c+dY ) = Cor(X, Y ). So, unlike the covariance operator, the correlation
operator is not affected by affine transformations of X and Y .
74
Beware that while we saw that X and Y are independent =⇒ Cov(X, Y ) =
0, the converse is not true (i.e. independence is stronger than uncorrelation). Consider the following example. Let X be a random variable with
E(X) = E(X 3 ) = 0. Consider Y = X 2 . Clearly, X and Y are not independent (in fact, Y is a deterministic function of X!). However,
Cov(X, Y ) = E(XY ) − E(X)E(Y ) = E(X ∗ X 2 ) − 0 = E(X 3 ) = 0.
So X and Y are uncorrelated, because Cor(X, Y ) = 0, but they are not
independent!
Exercise in class:
Look back at equation (41). Without assuming that X and Y are independent, but rather only making the weaker assumption that X and Y are
uncorrelated (i.e. Cov(X, Y ) = 0), show that
V (X + Y ) = V (X − Y ) = V (X) + V (Y ).
From the discussion above it is clear that in general
V (X + Y ) = V (X) + V (Y ) + 2Cov(X, Y )
V (X − Y ) = V (X) + V (Y ) − 2Cov(X, Y ).
This generalizes as follows. Given a1 , . . . , an ∈ R and X1 , . . . , Xn ,
!
n
n
n X
n
X
X
X
V
ai Xi =
a2i V (Xi ) +
ai aj Cov(Xi , Xj )
i=1
i=1
=
n
X
i=1 j=1
j6=i
a2i V (Xi ) +
i=1
X
2ai aj Cov(Xi , Xj ).
1≤i<j≤n
If the random variables X1 , . . . , Xn are all pairwise uncorrelated, then obviously the second summand above is null.
Exercise in class:
You are given three random variables X, Y, Z with V (X) = 1, V (Y ) = 4,
V (Z) = 3, Cov(X, Y ) = 0, Cov(X, Z) = 1, and Cov(Y, Z) = 1. Compute
V (3X + Y − 2Z).
We have
V (3X + Y − 2Z) = V (3X) + V (Y ) + V (−2Z) + 2Cov(3X, Y ) + 2Cov(3X, −2Z) + 2Cov(Y, −2Z)
= 9V (X) + V (Y ) + 4V (Z) + 6Cov(X, Y ) − 12Cov(X, Z) − 4Cov(Y, Z)
= 9 + 4 + 12 − 12 − 4 = 9.
75
76
Homework 5 (due June 2nd)
1. (14 pts) Consider the following function:
(
6x21 x2 if 0 ≤ x1 ≤ x2 ∧ x1 + x2 ≤ 2
f (x1 , x2 ) =
0 otherwise.
• Verify that f is a valid bivariate p.d.f.
• Suppose (X1 , X2 ) ∼ f . Are X1 and X2 independent random
variables?
• Compute P (Y1 + Y2 < 1)
• Compute the marginal p.d.f. of X1 . What family does this distribution belong to? What are the values of its parameters?
• Compute the marginal p.d.f. of X2
• Compute the conditional p.d.f. of X2 given X1
• Compute P (X2 < 1.1|X1 = 0.6).
Solution:
• It is clear that fX1 ,X2 (x1 , x2 ) ≥ 0 for any x1 , x2 ∈ R. We just
77
need to check that fX1 ,X2 integrates to 1:
Z 1 Z (2−x1 )
Z Z
6x21 x2 dx2 dx1
fX1 ,X2 (x1 , x2 ) dx1 dx2 =
R
0
R
Z
1
x21
=6
0
Z
=3
0
(2−x1 )
Z
Z
x1
1
2−x
x21 x22 x1 1 dx1
x2 dx2 dx1 = 3
x1
0
1
x21 [(2
2
x21 ] dx1
− x1 ) −
Z
= 12
0
1
x21 (1 − x1 ) dx1
2
24
Γ(3)Γ(2)
= 12 =
= 1.
= 12
Γ(5)
4!
24
• X1 and X2 are not independent since the support of fX1 ,X2 (the
union of the green region and the red region in the figure) is not
a rectangular region.
• The region corresponding to the event X1 + X2 < 1 is the red
region in the figure. We have
Z 1 Z 1−x1
2
P (X1 + X2 < 1) =
6x21 x2 dx2 dx1
0
x1
Z
=6
1
2
x21
Z
0
Z
=3
1−x1
Z
x2 dx2 dx1 = 3
x1
1
2
x21 [(1
0
1
2
1−x
x21 x22 x1 1 dx1
0
2
− x1 ) −
x21 ] dx1
Z
=3
1
2
0
x21 (1 − 2x1 ) dx1
1 3 1
= x31 02 − x41 02
2
3
1
1
= .
= −
8 32
32
• Notice that supp(X1 ) = [0, 1]. For x1 ∈ [0, 1] we thus have
Z 2−x1
2−x
fX1 (x1 ) =
6x21 x2 dx2 = 3x21 x22 x1 1
=
x1
3x21 [(2
− x1 )2 − x21 ] = 12x21 (1 − x1 )
which we recognize as the p.d.f. of a Beta(3, 2) distribution.
• Notice that supp(X2 ) = [0, 2]. For 0 ≤ x2 < 1 we have
Z x2
x
fX2 (x2 ) =
6x21 x2 dx1 = 2x2 x31 0 2
0
= 2x42 .
78
For 1 ≤ x2 ≤ 2 we have
Z
fX2 (x2 ) =
2−x2
0
2−x
6x21 x2 dx1 = 2x2 x31 0 2
= 2x2 (2 − x2 )3 .
• The conditional p.d.f. of X2 given X1 = x1 is only well-defined
for x1 ∈ supp(X1 ) = [0, 1]. For x1 ∈ (0, 1),
6x21 x2 1[x1 ,2−x1 ] (x2 )
fX1 ,X2 (x1 , x2 )
=
fX1 (x1 )
12x21 (1 − x1 )
x2
=
1
(x2 ).
2(1 − x1 ) [x1 ,2−x1 ]
fX2 |X1 =x1 (x2 ) =
• We have
Z
1.1
P (X2 < 1.1|X1 = 0.6) =
−∞
Z 1.1
fX2 |X1 =0.6 (x2 ) dx2
x2
1[0.6,1.4] (x2 )
−∞ 2(1 − 0.6)
1 2 1.1 1.21 − 0.36
=
x
= 0.53125.
=
1.6 2 0.6
1.6
=
2. (12 pts) Consider the following bivariate p.m.f.:


0.38 if (x1 , x2 ) = (0, 0)





0.17 if (x1 , x2 ) = (1, 0)



0.14 if (x , x ) = (0, 1)
1 2
pX1 ,X2 (x1 , x2 ) =

0.02
if
(x
1 , x2 ) = (1, 1)





0.24 if (x1 , x2 ) = (0, 2)



0.05 if (x , x ) = (1, 2).
1 2
• Compute the marginal p.m.f. of X1 and the marginal p.m.f. of
X2
• Compute pX2 |X1 =0 , the conditional p.m.f. of X2 given X1 = 0
• Compute P (X1 = 0|X2 = 2)?
Solution:
79
• Notice first that supp(X1 ) = {0, 1}. Thus, we have


pX1 ,X2 (0, 0) + pX1 ,X2 (0, 1) + pX1 ,X2 (0, 2) if x1 = 0
pX1 (x1 ) = pX1 ,X2 (1, 0) + pX1 ,X2 (1, 1) + pX1 ,X2 (1, 2) if x1 = 1


0 if x1 ∈
/ {0, 1}




0.38
+
0.14
+
0.24
if
x
=
0
1

0.76 if x1 = 0
= 0.17 + 0.02 + 0.05 if x1 = 1 = 0.24 if x1 = 1




0 if x1 ∈
/ {0, 1}
0 if x1 ∈
/ {0, 1}.
Now notice that supp(X2 ) = {0, 1, 2}. We have
• We have


pX1 ,X2 (0, 0) + pX1 ,X2 (1, 0) if x2 = 0



p
X1 ,X2 (0, 1) + pX1 ,X2 (1, 1) if x2 = 1
pX2 (x2 ) =

pX1 ,X2 (0, 2) + pX1 ,X2 (1, 2) if x2 = 2



0 if x ∈
2 / {0, 1, 2}




0.38
+
0.17
if
x
=
0


2

0.55 if x2 = 0


0.14 + 0.02 if x = 1
0.16 if x = 1
2
2
=
=


0.29 if x2 = 2
0.24 + 0.05 if x2 = 2







0 if x ∈
0 if x2 ∈
/ {0, 1, 2}.
2 / {0, 1, 2}

0.38


0.76 if x2 = 0



0.14
if x2 = 1
pX1 ,X2 (0, x2 )
= 0.76
pX2 |X1 =0 (x2 ) =
0.24
 0.76 if x2 = 2
pX1 (0)



0 if x ∈
2 / {0, 1, 2}


0.5 if x2 = 0



0.184 if x = 1
2
≈

0.316
if
x
2 =2



0 if x ∈
2 / {0, 1, 2}
• We have
pX1 ,X2 (0, 2)
P (X1 = 0 ∩ X2 = 2)
=
P (X2 = 2)
pX2 (2)
0.24
=
≈ 0.828.
0.29
P (X1 = 0|X2 = 2) =
80
3. (8 pts) Let X1 ∼ Uniform(0, 1) and, for 0 < x1 ≤ 1,
fX2 |X1 =x1 (x2 ) =
1
1
(x2 ).
x1 (0,x1 ]
• What family does fX2 |X1 =x1 belong to? What are the values of
its parameters?
• Compute the joint p.d.f. of (X1 , X2 )
• Compute the marginal p.d.f. of X2 .
Solution:
• It is clear that fX2 |X1 =x1 for 0 < x1 ≤ 1 is a Uniform p.d.f. over
the interval (0, x1 ].
• We have
fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 |X1 =x1 (x2 ) = 1(0,1] (x1 )
=
1
1
(x2 )
x1 (0,x1 ]
1
1
(x1 , x2 ).
x1 {0<x2 ≤x1 ≤1}
• We have, for x2 ∈ (0, 1],
Z
1
fX2 (x2 ) =
x2
1
dx1 = log(x1 )|1x2 = − log(x2 )
x1
otherwise fX2 (x2 ) = 0 for x2 ∈
/ (0, 1].
4. (10 pts) Let X1 ∼ Binomial(n1 , p) and X2 ∼ Binomial(n2 , p). Assume
that X1 and X2 are independent. Consider W = X1 + X2 . One
can show that W ∼ Binomial(n1 + n2 , p). Use this result to show
that pX1 |W =w is a Hypergeometric p.m.f. with parameters r = n1 ,
N = n1 + n2 , and n = w.
Solution:
81
We have, for x1 ∈ {0, 1, . . . , w}
P (X1 = x1 ∩ W = w)
P (X1 = x1 ∩ X2 = w − x1 )
=
P (W = w)
P (W = w)
P (X1 = x1 )P (X2 = w − x1 )
=
P (W = w)
w−x
n2
n1 x1
n1 −x1
1 (1 − p)n2 −w+x1
w−x1 p
x1 p (1 − p)
=
n1 +n2 w
p (1 − p)n1 +n2 −w
w
n2
n1
pX1 |W =w (x1 ) = P (X1 = x1 |W = w) =
=
x1 w−x1
n1 +n2
w
and clearly pX1 |W =w (x1 ) = 0 if x1 ∈
/ {0, 1, . . . , w}.
5. (10 pts) Suppose that you have a coin such that, if tossed, head appears
with probability p ∈ [0, 1] and tail occurs with probability 1−p. Person
A tosses the coin. The number of tosses until the first head appears for
person A is X1 . Then person B tosses the coin. The number of tosses
until the first head appears for person B is X2 . Assume that X1 and
X2 are independent. What is the probability of the event X1 = X2 ?
Solution:
Obviously, X ∼ Geometric(p) and Y ∼ Geometric(p). Because X and
Y are independent, we have
pX1 ,X2 (x1 , x2 ) = pX1 (x1 )pX2 (x2 ) = (1 − p)x1 −1 p1{1,2,... } (x1 )(1 − p)x2 −1 p1{1,2,... } (x2 )
= (1 − p)x1 +x2 −1 p2 1{1,2,... } (x1 )1{1,2,... } (x2 ).
Then,
X
P (X1 = X2 ) =
x1 =x2 =x
x∈{1,2,... }
= p2
X
(1 − p)2x−2 = p2
p2
1 − (1 −
pX1 ,X2 (x, x)
x∈{1,2,... }
x∈{1,2,... }
=
X
pX1 ,X2 (x1 , x2 ) =
p)2
=
X
[(1 − p)2 ]x−1
x∈{1,2,... }
p
.
2−p
6. (8 pts) Let (X1 , X2 ) ∼ fX1 ,X2 with
fX1 ,X2 (x1 , x2 ) =
1
1
(x1 , x2 ).
x1 {0<x2 ≤x1 ≤1}
82
Compute E(X1 − X2 ).
Solution:
We have E(X1 −X2 ) = E(X1 )−E(X2 ). First we compute the marginal
p.d.f.’s. They are
Z
Z
1
fX1 ,X2 (x1 , x2 ) dx2 =
fX1 (x1 ) =
1{0<x2 ≤x1 ≤1} (x1 , x2 ) dx2
R x1
R
Z
1
=
1 (x1 ) 1(0,x1 ] (x2 ) dx2
x1 (0,1]
ZRx1
1
=
(x2 ) dx2
1 (x1 )
x1 (0,1]
0
= 1(0,1] (x1 )
and
Z
fX2 (x2 ) =
Z
fX1 ,X2 (x1 , x2 ) dx1 =
ZR
R
1
1
(x1 , x2 ) dx1
x1 {0<x2 ≤x1 ≤1}
1
1(0,1] (x2 )1[x2 ,1] (x2 ) dx1
=
R x1
Z 1
1
= 1(0,1] (x2 )
dx1 = − log(x2 )1(0,1] (x2 ).
x
x2 1
Notice that fX1 is the p.d.f. of Uniform(0, 1), therefore E(X1 ) = 1/2.
On the other hand
1
2
Z
Z 1
x2
x2
log(x2 ) −
dx2
E(X2 ) = −
x2 log(x2 ) dx2 = −
2
2
0
0
1 2 1 1
=
x
= .
4 2 0 4
Therefore E(X1 − X2 ) = E(X1 ) − E(X2 ) = 1/2 − 1/4 = 1/4.
7. (10 pts) Consider
fX1 ,X2 (x1 , x2 ) = 6(1 − x2 )1{0≤x1 ≤x2 ≤1} (x1 , x2 ).
• Are X1 and X2 independent?
• Are X1 and X2 correlated (positively or negatively) or uncorrelated?
Solution:
83
• X1 and X2 are not independent. We can see this in two ways:
1) we draw the support of their joint p.d.f. (which is the set
indicated by the indicator function; the red triangle in the figure)
and we easily see that it does not correspond to a rectangular
region; 2) we rewrite the indicator function as
1{0≤x1 ≤x2 ≤1} (x1 , x2 ) = 1[0,x2 ] (x1 )1[x1 ,1] (x2 )
and then we easily see that there is no way to factorize the joint
p.d.f. into two functions g and h such that g only depends on x1
and h only depends on x2 .
• To answer the second question, we need to compute Cov(X1 , X2 ).
For that we need E(X1 ), E(X2 ), and E(X1 X2 ). We have
Z Z
E(X1 X2 ) =
x1 x2 fX1 ,X2 (x1 , x2 ) dx1 dx2
R R
Z 1 Z x2
=
x1 x2 6(1 − x2 ) dx1 dx2
0
0
Z 1
Z x2
=6
x2 (1 − x2 )
x1 dx1 dx2
0
0
Z 1
x
=3
x2 (1 − x2 ) x21 0 2 dx2
0
Z 1
=3
x32 (1 − x2 )dx2
0
Γ(4)Γ(2)
3!
18
3
=3
=3 =
= .
Γ(6)
5!
120
20
84
Then,
Z
1
Z
1
6(1 − x2 ) dx2 dx1
1 #
Z 1 "
x22 1
x1 x2 |x1 −
=6
dx1
2 x1
0
Z 1
x31
x1
2
dx1
− x1 +
=6
2
3
0
1 1 1
1
=6
− +
=
4 3 8
4
E(X1 ) =
x1
x2
0
and
1
Z
E(X2 ) =
Z
6
0
0
1
6x2 (1 − x2 )
Z
x2
dx1 dx2
0
x22 (1 − x2 ) dx2 = 6
2
12
1
Γ(3)Γ(2)
=6 =
= .
Γ(5)
4!
24
2
3
Thus, Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 )E(X2 ) = 20
− 14 12 =
3
1
1
20 − 8 = 40 . Since the covariance is positive, X1 and X2 are
positively correlated and therefore not independent.
8. (10 pts) Let X1 and X2 be uncorrelated random variables. Consider
Y1 = X1 + X2 and Y2 = X1 − X2 .
• Express Cov(Y1 , Y2 ) in terms of V (X1 ) and V (X2 )
• Give an expression for Cor(Y1 , Y2 )
• Is it possible that Cor(Y1 , Y2 ) = 0? If so, when does this happen?
Solution:
• We have
Cov(Y1 , Y2 ) = Cov(X1 + X2 , X1 − X2 )
= Cov(X1 , X1 ) − Cov(X1 , X2 ) + Cov(X2 , X1 ) − Cov(X2 , X2 )
= V (X1 ) − V (X2 ).
• By definition
Cov(Y1 , Y2 )
Cor(Y1 , Y2 ) = p
.
V (Y1 )V (Y2 )
85
We have
V (Y1 ) = V (X1 + X2 ) = V (X1 ) + V (X2 ) + 2Cov(X1 , X2 )
= V (X1 ) + V (X2 ) + 0 = V (X1 ) + V (X2 )
and
V (Y2 ) = V (X1 − X2 ) = V (X1 ) + V (X2 ) − 2Cov(X1 , X2 )
= V (X1 ) + V (X2 ) + 0 = V (X1 ) + V (X2 ),
thus
V (X1 ) − V (X2 )
Cov(Y1 , Y2 )
.
=
Cor(Y1 , Y2 ) = p
V (X1 ) + V (X2 )
V (Y1 )V (Y2 )
• If V (X1 ) > V (X2 ), then Y1 and Y2 are positively correlated.
If V (X1 ) < V (X2 ), then Y1 and Y2 are negatively correlated.
Finally, If V (X1 ) = V (X2 ), then Y1 and Y2 are uncorrelated.
9. (10 pts) Let
1
fX1 (x1 ) = x31 e−x1 1[0,∞) (x1 )
6
and
1 x2
fX2 (x2 ) = e− 2 1[0,∞) (x2 ).
2
Let Y = X1 − X2 .
• Compute E(Y )
• Let Cov(X1 , X2 ) = γ. Compute V (Y )
• Suppose that X1 and X2 are uncorrelated. What is V (Y ) in this
case?
Solution:
• It is clear from the inspection of the two p.d.f.’s that X1 ∼
Gamma(4, 1) and X2 ∼ Exponential(2). Thus, E(Y ) = E(X1 −
X2 ) = E(X1 ) − E(X2 ) = 4 − 2 = 2.
• V (Y ) = V (X1 − X2 ) = V (X1 ) + V (X2 ) − 2Cov(X1 , X2 ) = 4 +
4 − 2γ = 2(4 − γ).
• If X1 and X2 are uncorrelated, then γ = 0 and V (Y ) = 8.
86
10. (8 pts) Use the definition of variance to show that V (X ±Y ) = V (X)+
V (Y ) ± 2Cov(X, Y ).
Solution: We have
V (X ± Y ) = E[(X ± Y )2 ] − [E(X ± Y )]2
= E(X 2 + Y 2 ± 2XY ) − [E(X)]2 − [E(X)]2 ∓ 2E(X)E(Y )
= E(X 2 ) − [E(X)]2 + E(Y 2 ) − [E(Y )]2 ± 2E(XY ) ∓ 2E(X)E(Y )
= V (X) + V (Y ) ± 2[E(XY ) − E(X)E(Y )] = V (X) + V (Y ) ± 2Cov(X, Y ).
11. (EXTRA CREDIT, 12 pts) Let Z ∼ N (0, 1). Show that for any odd
n ≥ 1, E(Z n ) = 0.
Solution: Let f denote the p.d.f. associated to N (0, 1). Notice that
f is an even function, i.e. f (z) = f (−z) for any z ∈ R. We have
Z
n
E(Z ) =
z n f (z) dz
R
Z 0
Z +∞
n
=
z f (z) dz +
z n f (z) dz
−∞
0
0
Z
=
Z
n
+∞
z n f (z) dz
z f (−z) dz +
−∞
0
Z
=
0
−((−z)n )f (−z) dz +
−∞
−∞
Z
=
(−z)n f (−z) dz +
0
Z
Z
+∞
z n f (z) dz
0
+∞
z n f (z) dz
0
Make the change of variable x = −z in the first integral. We get
Z −∞
Z +∞
n
(−z) f (−z) dz +
z n f (z) dz
0
0
Z +∞
Z +∞
n
=
(x )f (x)(−1) dx +
z n f (z) dz
0
0
Z +∞
Z +∞
=−
(xn )f (x) dx +
z n f (z) dz = 0.
0
0
12. (EXTRA CREDIT, 15 pts) Show that for two random variables X
and Y , Cor(X, Y ) ∈ [−1, 1].
Hint: consider X 0 = X − E(X) and Y 0 = Y − E(Y ) and study the
quadratic function h(t, X 0 , Y 0 ) = (tX 0 − Y 0 )2 as a function of t. Notice
87
that h(t, X 0 , Y 0 ) ≥ 0, therefore E(h(t, X 0 , Y 0 )) ≥ 0 which implies that
the discriminant of h(t, X 0 , Y 0 ) as a quadratic function of t must be
non-positive! Try to use this fact to prove the claim that Cor(X, Y ) ∈
[−1, 1].
Solution: We have E(h(t, X 0 , Y 0 )) ≥ 0. This means
E[(tX 0 − Y 0 )2 ] = E(t2 X 02 + Y 02 − 2tX 0 Y 0 )
= t2 E(X 02 ) − 2tE(X 0 Y 0 ) + E(Y 02 ) ≥ 0
This can only happen if the discriminant is non-positive, i.e. if
4[E(X 0 Y 0 )]2 − 4E(X 02 )E(Y 02 ) ≤ 0,
i.e.
[E(X 0 Y 0 )]2 ≤ E(X 02 )E(Y 02 )
or, equivalently,
[E(X 0 Y 0 )]2
≤ 1.
E(X 02 )E(Y 02 )
Notice that
[E(X 0 Y 0 )]2
[E((X − E(X))(Y − E(Y )))]2
=
E(X 02 )E(Y 02 )
E((X − E(X))2 )E((Y − E(Y ))2 )
[Cov(X, Y )]2
=
= [Cor(X, Y )]2 .
V (X)V (Y )
Thus, it follows that |Cor(X, Y )| ≤ 1 or, equivalently, −1 ≤ Cor(X, Y ) ≤
1.
88
Lecture 10
Recommended readings: WMS, section 5.11
Expected Value in the Context of Bivariate Distributions (part
2)
Using conditional distributions, we can define the notion of conditional expectation, conditional variance, and conditional covariance.
Conditional Expectation
Given a conditional p.m.f. pX1 |X2 =x2 or a conditional p.d.f. pX1 |X2 =x2 the
conditional expectation of X1 given X2 = x2 is defined as
X
E(X1 |X2 = x2 ) =
x1 pX1 |X2 =x2 (x1 )
(44)
x1 ∈supp(X1 )
in the discrete case and as
E(X1 |X2 = x2 ) =
Z
x1 ∈supp(X1 )
x1 fX1 |X2 =x2 (x1 ) dx1
(45)
in the continuous case.
It is clear from the definition that the conditional expectation is a function of the value of X2 . Since X2 is a random variable, this means that
the conditional expectation (despite its name) is itself a random variable!
This is a fundamental difference with respect to the standard concept of
expectation for a random variable (for instance E(X1 )) which is just a constant depending on the distribution of X1 . In terms of notation, we often
denote the random variable corresponding to the conditional expectation of
X1 given X2 simply as E(X1 |X2 ).
It is easy to see from the definition of conditional expectation that
E(E(X1 |X2 )) = E(X1 ), where the outer expectation is taken with respect
to the probability distribution of X2 .
Note that if X1 and X2 are independent, then E(X1 |X2 ) = E(X1 ).
Conditional Variance
We can also define the conditional variance of X1 given X2 . We have
V (X1 |X2 ) = E[(X1 − E(X1 |X2 ))2 |X2 ] = E(X12 |X2 ) − [E(X1 |X2 )]2
89
(46)
Obtaining the unconditional variance from the conditional variance is a little
‘harder’ than obtaining the unconditional expectation from the conditional
expectation:
V (X1 ) = E[V (X1 |X2 )] + V [E(X1 |X2 )].
Note that if X1 and X2 are independent, then V (X1 |X2 ) = V (X1 ).
Conditional Covariance
It is worthwhile mentioning that we can also define the conditional covariance between X1 and X2 given a third random variable X3 . We hae
Cov(X1 , X2 |X3 ) = E[(X1 − E(X1 |X3 ))(X2 − E(X2 |X3 ))|X3 ]
= E(X1 X2 |X3 ) − E(X1 |X3 )E(X2 |X3 ).
(47)
To get the unconditional covariance we have the following formula:
Cov(X1 , X2 ) = E[Cov(X1 , X2 |X3 )] + Cov[E(X1 |X3 ), E(X2 |X3 )].
(48)
Note that if X1 and X3 are independent, and X2 and X3 are independent,
then Cov(X1 , X2 |X3 ) = Cov(X1 , X2 ).
In terms of computation, everything is essentially unchanged. The only
difference is that we sum or integrate against the conditional p.m.f./conditional
p.d.f. rather than the marginal p.m.f./marginal p.d.f..
Exercise in class:
Consider again the p.d.f. of exercise 7 in Homework 5:
fX1 ,X2 (x1 , x2 ) = 6(1 − x2 )1{0≤x1 ≤x2 ≤1} (x1 , x2 ).
What is E(X1 |X2 )? Compute E(X1 ).
We first need to compute fX1 |X2 =x2 for x2 ∈ supp(X2 ). For x2 ∈ [0, 1]
we have
Z x2
fX2 (x2 ) =
6(1 − x2 )1[0,1] (x2 ) dx2 = 6x2 (1 − x2 )1[0,1] (x2 )
0
Notice that from the marginal p.d.f. of X2 we can see that X2 ∼ Beta(2, 2).
For x2 ∈ (0, 1) we have
6(1 − x2 )1[0,x2 ] (x1 )
fX1 ,X2 (x1 , x2 )
=
fX2 (x2 )
6x2 (1 − x2 )
1
=
1
(x1 )
x2 [0,x2 ]
fX1 |X2 =x2 (x1 ) =
90
Notice that fX1 |X2 =x2 is the p.d.f. of a Uniform(0, x2 ) distribution. It follows
that E(X1 |X2 = x2 ) = x2 /2. We can write this more concisely (and in a
way that stresses more the fact that the conditional expectation is a random
variable!) as E(X1 |X2 ) = X2 /2.
We have E(X1 ) = E[E(X1 |X2 )] = E(X2 /2) = E(X2 )/2 = [(2/(2 +
2)]/2 = 1/4.
Exercise in class:
i.i.d.
Let N ∼ Poisson(λ) and let Y1 , . . . , YN
PN∼ Gamma(α, β) with N independent of each of the Yi ’s. Let T = i=1 Yi . Compute E(T |N ), E(T ),
V (T |N ), and V (T ).
We have that
!
N
n
X
X
E(T |N = n) = E
Yi |N = n =
E(Yi |N = n)
i=1
=
n
X
i=1
E(Yi ) = nαβ.
i=1
Thus, E(T |N ) = N αβ. Now,
E(T ) = E[E(T |N )] = E(N αβ) = αβE(N ) = αβλ.
The conditional variance is
V (T |N = n) = V
=
n
X
i=1
N
X
i=1
!
Yi |N = n
V (Yi |N = n) =
=V
n
X
i=1
n
X
!
Yi |N = n
V (Yi ) = nαβ 2 .
i=1
Thus, V (T |N ) = N αβ 2 . The unconditional variance of T is then
V (T ) = V [E(T |N )] + E[V (T |N )] = V (N αβ) + E(N αβ 2 )
= α2 β 2 V (N ) + αβ 2 E(N )
= α2 β 2 λ + αβ 2 λ
= αβ 2 λ(1 + α).
Exercise in class:
Let Q ∼ Uniform(0, 1) and Y |Q ∼ Binomial(n, Q). Compute E(Y ) and
V (Y ).
91
We have
E(Y ) = E[E(Y |Q)] = E(nQ) = nE(Q) = n/2
and
V (Y ) = V [E(Y |Q)] + E[V (Y |Q)] = V (nQ) + E(nQ(1 − Q))
= n2 V (Q) + nE(Q − Q2 ) =
n2
+ nE(Q) − nE(Q2 )
12
n2 n
+ − n(V (Q) + [E(Q)]2 )
12
2
n2 n
n
n
=
+ −
−
12
2 12 4
n2 n
n
=
+ = (n/2 + 1).
12
6
6
=
92
Lecture 11
Recommended readings: WMS, sections 6.1 → 6.4
Functions of Random Variables and Their Distributions
Suppose that you are given a random variable X ∼ FX and a function
g : R → R. Consider the new random variable Y = g(X). How can we find
the distribution of Y ? This is the focus of this lecture.
There are two approaches that one can typically use to find the distribution of a function of a random variable Y = g(X): the first approach is
based on the c.d.f., the other approach is based on the change of variable
technique. The former is general and works for any function g, the latter
requires some assumptions on g, but it is generally faster than the general
approach based on the c.d.f..
The method of the cumulative distribution function
The idea is pretty simple. You know that X ∼ FX and you want to find
FY . The process is as follows:
1. FY (y) = P (Y ≤ y) by definition
2. P (Y ≤ y) = P (g(X) ≤ y), since Y = g(X)
3. now you want to express the event {g(X) ≤ y} in terms of X, since
you know the distribution of X only
4. let A = {x ∈ R : g(x) ≤ y}; then {g(X) ≤ y} = {X ∈ A}
5. it follows that P (g(X) ≤ y) = P (X ∈ A) which can typically be
expressed in terms of FX
6. (EXTRA) once you have FX , the p.m.f. or the p.d.f. of X can be
easily derived from it.
Exercise in class:
Let Z ∼ N (0, 1) and g(x) = x2 . Consider Y = g(Z) = Z 2 . What is the
probability distribution of Y ?
93
Following the steps above, we have
(
0 if y < 0
FY (y) = P (Y ≤ y) = P (Z ≤ y) =
√
P (|Z| ≤ y) if y ≥ 0.
(
0 if y < 0
=
√ √
P (Z ∈ [− y, y]) if y ≥ 0.
2
√ √
Notice that in this case A = [− y, y]. It follows that
(
(
0 if y < 0
0 if y < 0
=
FY (y) =
√
√
√ √
Φ( y) − Φ(− y) if y ≥ 0.
P (Z ∈ [− y, y]) if y ≥ 0.
(
0 if y < 0
=
√
2Φ( y) − 1 if y ≥ 0.
Let φ denote the standard normal p.d.f.
1 x2
φ(x) = √ e 2 .
2π
The p.d.f. of Y = Z 2 is therefore
(
0 if y ≤ 0
fY (y) =
√
√1 φ( y) if y > 0
y
(
0 if y ≤ 0
y
=
1
√1 y − 2 e− 2 if y > 0
2π
Notice that this is the p.d.f. of a Gamma( 21 , 2) ≡ χ2 (1) distribution.
Exercise in class: the probability integral transform
Consider a continuous random variable X ∼ FX where FX , the distribution
of X is strictly increasing. Consider the random variable Y = FX (X). What
is the probablity distribution of Y ?
The transformation Y = FX (X) is usually called the probability integral
transform. To see why, notice that
Z X
Y = FX (X) =
fX (y) dy.
−∞
We have


0 if y < 0
FY (y) = P (Y ≤ y) = P (FX (X) ≤ y) = P (X ≤ FX−1 (y)) if y ∈ [0, 1)


1 if y ≥ 1.
94
In this case, therefore, A = (−∞, FX−1 (y)]. Then,




0 if y < 0
0 if y < 0
−1
FY (y) = P (X ≤ FX (y)) if y ∈ [0, 1) = FX (FX−1 (y)) if y ∈ [0, 1)




1 if y ≥ 1
1 if y ≥ 1


0 if y < 0
= y if y ∈ [0, 1)


1 if y ≥ 1.
The p.d.f. of Y is therefore
fY (y) = 1[0,1] (y),
i.e. Y ∼ Uniform(0, 1).
The method of the change of variable
We will not focus on the mathematical motivation behind this technique.
If you are curious and want to learn more about it, I strongly recommend
taking a look at this document. For our purposes, we will take this procedure
as a black-box.
In order to apply the method of the change of variable, the function g
must admit an inverse g −1 (this is not required by the method based on the
c.d.f.). A sufficient condition is that g is strictly monotone (increasing or
decreasing).
Suppose that you are given X ∼ FX and you want to compute the
probability distribution of Y = g(X). First of all, determine the support of
Y . Then, use this formula to obtain fY on the support of Y from fX and g:
d
fY (y) = fX (g −1 (y)) g −1 (z) .
dz
z=y Exercise in class:
Consider X ∼ Beta(1, 2) and g(x) = 2x−1. What is the p.d.f. of Y = g(X)?
First of all, notice that (with a little abuse of notation) supp(Y ) =
g(supp(X)). Since supp(X) = [0, 1], it follows that supp(Y ) = [−1, 1].
We have
fX (x) = 2(1 − x)1[0,1] (x),
g −1 (x) =
95
x+1
,
2
and
d −1
1
g (x) = .
dx
2
Thus, for y ∈ [−1, 1],
d
fY (y) = fX (g −1 (y)) g −1 (x) dx
x=y y+1 1 y+1
1−y
=2 1−
=1−
=
.
2
2
2
2
The complete description of the p.d.f. of Y is therefore
fY (y) =
1−y
1[−1,1] (y).
2
96
Homework 6 (due June 4th)
1. (8 pts) Let X|Q = q ∼ Binomial(n, q) and Q ∼ Beta(α, β). Compute
E(X).
Solution:
We have
E(X) = E[E(X|Q)] = E(nQ) = nE(Q) =
nα
.
α+β
2. (12 pts) Let X|Q = q ∼ Binomial(n, q) and Q ∼ Beta(3, 2). Find the
marginal p.d.f. of X. (Notice that here we are mixing discrete and
continuous random variables, but the usual definitions still apply!)
Solution:
The support of X is supp(X) = {0, 1, . . . , n}. For x ∈ supp(X) and
q ∈ (0, 1) we have
Z
1
Z
1
pX,Q (x, q) dq =
pX|Q=q (x)pQ (q) dq
0
0
Z 1 n x
Γ(3 + 2) 2
=
q (1 − q)n−x
q (1 − q) dq
x
Γ(3)Γ(2)
0
Z 1
n
= 12
q 2+x (1 − q)n−x+1 dq
x 0
n Γ(3 + x)Γ(n − x + 2)
= 12
x Γ(3 + x + n − x + 2)
n Γ(3 + x)Γ(n − x + 2)
= 12
.
x
Γ(5 + n)
pX (x) =
For any x ∈
/ supp(X), of course pX (x) = 0.
3. (10 pts) Let X ∼ Exponential(4). Consider Y = 3X + 1. Find the
p.d.f. of Y and E(Y ).
Solution:
We have g(x) = 3x + 1 thus the support of Y is [1, ∞)), g −1 (x) =
and
d −1
1
g (x) = .
dx
3
97
x−1
3 ,
It follows that, for y ∈ [1, ∞),
d
1 y−1 1
fY (y) = fX (g −1 (y)) g −1 (x) = e− 12
dx
4
3
x=y =
1 − x−1
e 12
12
otherwise fY (y) = 0. Now,
Z ∞
Z ∞
y−1
1
(12x + 1)e−x dx
E(Y ) =
y e− 12 dy =
12
1
Z ∞
Z ∞0
= 12
xe−x dx +
e−x dx = 12 + 1 = 13.
0
0
4. (10 pts) We mentioned in class that if X ∼ Gamma(α, β) then, for
c > 0, Y = cX ∼ Γ(α, cβ). Prove it.
Solution:
We have g(x) = cx, g −1 (x) = x/c, and
d −1
1
g (x) = .
dx
c
The image of the support of X through g is still [0, ∞). Then, for
y ∈ [0, ∞),
d
fY (y) = fX (g −1 (y)) g −1 (x) dx
x=y α−1
y
1
−y 1
e cβ
= α
β Γ(α) c
c
y
1
−
=
y α−1 e cβ ,
(cβ)α Γ(α)
otherwise fY (y) = 0. This is clearly the p.d.f. of a Gamma(α, cβ)
distribution.
5. (12 pts) Let F be a given continuous and strictly increasing c.d.f.. Let
U ∼ Uniform(0, 1). Find a function g such that g(U ) ∼ F . You can
assume that g is strictly increasing.
98
Solution:
Let Y = g(U ). We have
F (y) = P (Y ≤ y) = P (g(U ) ≤ y) = P (U ≤ g −1 (y))

−1

0 if g (y) < 0
= g −1 (y) if g −1 (y) ∈ [0, 1)


1, if g −1 (y) ≥ 1.
Thus, it is enough to set g −1 = F , i.e. g = F −1 , and then Y = g(U ) ∼
F.
6. (12 pts) Let X ∼ fX with
fX (x) =
b
1
(x)
x2 [b,∞)
and let U ∼ Uniform(0, 1). Find a function g such that g(U ) has the
same distribution of X.
Solution:
Using the result of exercise 5, we need to set g = FX−1 . We have
(
(
0 if x < b
0 if x < b
= Rx b
FX (x) = R x
fX (y) dy if x ≥ b
b y 2 dy if x ≥ b
b
(
0 if x < b
0 if x < b
x
=
=
b
− y if x ≥ b
1 − xb if x ≥ b.
b
It follows that
g(x) = FY−1 (x) =
b
.
1−x
7. (10 pts) Let θ > 0 and X ∼ fX with
fX (x) =
2x − x2
e θ 1[0,∞) (x).
θ
Consider Y = X 2 . What is fY , the p.d.f. of Y ? What are E(Y ) and
V (Y ).
Solution:
Notice that the function g(x) = x2 is invertible over the support of
99
X, [0, ∞). Furthermore, g is such that supp(Y ) = supp(X). We have
√
g −1 (x) = x and
d −1
1
g (x) = √ .
dx
2 x
Therefore
fY (y) = fX (g
−1
d
−1
(y)) g (x) dx
x=y √
2 y −y
1
√
e θ 1[0,∞) ( y) √
=
θ
2 y
1 −x
= e θ 1[0,∞) (y).
θ
It follows that Y ∼ Exponential(θ). Thus, E(Y ) = θ and V (Y ) = θ2 .
8. (10 pts) Let X ∼ Beta(α, β). Consider Y = 1 − X. What is the
probability distribution of Y ? Now, let α = β = 1. Claim that X and
Y have the same distribution. Are X and Y also independent?
Solution:
d −1
Here g(x) = 1 − x, so that g −1 (x) = g(x) and dx
g (x) = −1. Notice
that, for this g, supp(Y ) = supp(X) = [0, 1]. Then, for y ∈ [0, 1] we
have
d
−1
−1
fY (y) = fX (g (y)) g (x) dx
x=y Γ(α + β)
(1 − y)α−1 y β−1 | − 1|
Γ(α)Γ(β)
Γ(α + β)
=
(1 − y)α−1 y β−1 ,
Γ(α)Γ(β)
=
otherwise fY (y) = 0. It is clear then that Y ∼ Beta(β, α). If α =
β = 1, then X and Y share the exact same distribution, which is
Uniform(0, 1). However, since Y is a deterministic function of X (in
particular, Y = 1 − X), it is obvioulsy false that X and Y are independent random variables.
9. (6 pts) Let X ∼ FX . What is E(X|X)?
Solution:
For x ∈ supp(X), we have E(X|X = x) = E(x|X = x) = E(x) = x,
i.e. E(X|X) = X.
100
10. (10 pts) Let X, Y, Z be three random variables such that E(X|Z) =
3Z, E(Y |Z) = 1, and E(XY |Z) = 4Z. Assume furthermore that
E(Z) = 0. Are X and Y uncorrelated?
Solution:
We have
Cov(X, Y ) = E[Cov(X, Y |Z)] + Cov[E(X|Z), E(Y |Z)]
= E[E(XY |Z) − E(X|Z)E(Y |Z)] + Cov(3Z, 1)
= E(4Z − 3Z ∗ 1) + 3Cov(Z, 1) = E(Z) + 3 ∗ 0 = 0 + 0 = 0.
It follows that X and Y are uncorrelated.
101
Lecture 12
Recommended readings: WMS, sections 7.1 → 7.4
The Empirical Rule, Sampling Distributions, the Central Limit
Theorem, and the Delta Method
In this lecture we will focus on some basic statistical concepts at the basis
of statistical inference. Before starting, let’s quickly discuss a useful rule of
thumb associated to the Normal distribution which is often mentioned and
used in practice.
The Empirical Rule Based on the Normal Distribution
Suppose that you collect data X1 , . . . , Xn ∼ f where f is an unknown probability density function which, however, you know being ‘bell-shaped’ (or you
expect to be ‘bell-shaped’ given the information you have on the particular
phenomenon you are observing, or you can show to ‘bell-shaped’ using, for
example, an histogram).
INTRODUCTION TO HISTOGRAM AND DENSITY ESTIMATION
We know that we can estimate the mean µf and the standard deviation
σf of f by means of the sample mean and the sample standard deviation
X̄ and S, respectively. On the basis of these two statistics, you may be
interested in approximately quantifying the probability content of intervals
of the form [µf − kσf , µf + kσf ] where k is a positive integer. Because for
X ∼ N (µ, σ 2 ) one has
P (µ − σ ≤ X ≤ µ + σ) ≈ 68%
P (µ − 2σ ≤ X ≤ µ + 2σ) ≈ 95%
P (µ − 3σ ≤ X ≤ µ + 3σ) ≈ 99%,
one would expect that, based on X̄ and S,
P ([X̄ − S, X̄ + S]) ≈ P ([µf − σf , µf + σf ]) ≈ 68%
P ([X̄ − 2S, X̄ + 2S]) ≈ P ([µf − 2σf , µf + 2σf ]) ≈ 95%
P ([X̄ − 3S, X̄ + 3S]) ≈ P ([µf − 3σf , µf + 3σf ]) ≈ 99%
102
(49)
if the probability density f is ‘bell-shaped’. The approximations of equation
(49) are frequently referred to as the empirical rules based on the Normal
distribution.
Sampling Distributions
To illustrate the concept of sampling distribution, we will consider the Nori.i.d.
mal model, which assumes that the data X1 , . . . , Xn ∼ N (µ, σ 2 ) are i.i.d.
with a Normal distribution with parameters µ and σ 2 that are usually assumed to be unknown. In order to estimate µ and σ 2 , we saw that we can
use the two statistics X̄ and S 2 (the sample mean and the sample variance).
These statistics, or estimators, are random variables too, since they depend
on the data X1 , . . . , Xn . If they are random variables, then they must have
their own probability distributions! The probability distribution of a statistic (or an appropriate stabilizing transformation of it) is called the sampling
distribution of that statistic. We have the following results:
√
1.
2.
3.
n(X̄−µ)
σ
∼ N (0, 1)
(n−1)S 2
∼ χ2 (n − 1)
σ2
√
n(X̄−µ)
∼ t(n − 1).
S
Some comments: 1. is easy to understand and prove (it’s just standardization of a Normal random variable!). 1. is useful when we are interested in
making inferences about µ and we know σ 2 . 2. is a little harder to prove.
We use this result when we are interested in making inferences about σ 2 and
µ is unknown. 3. is a classical result. It is useful when we want to make
inferences on µ and σ 2 is unknown. We won’t study in detail the t distribution. However, this distribution arises when we take the ratio of a standard
Normal distribution and the square root of a χ2 distribution divided by its
degrees of freedom (under the assumption that the Normal and the χ2 distributed random variables are also independent). The t distribution looks
like a Normal distribution with ‘fatter’ tails. As the number of degrees of
freedom of the t distribution goes to ∞, the t distribution ‘converges in
distribution’ to a standard Normal distribution.
For the purposes of this course, we say that a sequence of random variables X1 , X2 , . . . Xn , . . . converges in distribution to a certain distribution
F if their c.d.f.’s F1 , F2 , . . . , Fn , . . . are such that
lim Fn (x) = F (x)
n→∞
103
d
for each x ∈ R at which F is continuous. We will use the notation →
to denote convergence in distribution. The idea here is that the limiting
c.d.f. F (often called the limiting distribution of the X’s) can be used to
approximate probability statements about the random variables in the sequence. This idea will be key to the next result, the Central Limit Theorem.
BASIC INFERENCE EXAMPLE AND p-VALUES (IMPORTANT:
CHI-SQUARE EXAMPLE)
The Central Limit Theorem
The Central Limit Theorem is a remarkable result. Simply put, suppose that
iid
you have a sequence of random variables X1 , X2 , . . . , Xn , . . . ∼ F distributed
according to some c.d.f. F , such that their expectation µ and their variance
σ 2 exist and are finite. Then, the standardized version of their average
converges in distribution to a standard Normal distribution.
In other words,
√
n(X̄n − µ) d
→Z
(50)
σ
as n → ∞, where Z ∼ N (0, 1). Another way to express the Central Limit
Theorem is to say that, for any x ∈ R,
√
n(X̄n − µ)
≤ x → Φ(x).
(51)
lim P
n→∞
σ
This result is extremely frequently used and invoked to approximate probability statements regarding the average of a collection of i.i.d. random
variables when n is ‘large’. However, rarely the population variance σ 2 is
known. The Central Limit Theorem can be extended to accomodate this
case. We have
√
n(X̄n − µ) d
→Z
(52)
S
where Z ∼ N (0, 1).
Exercise in class:
The number of errors per computer program has a Poisson distribution with
mean λ = 5. We receive n = 125 programs written by n = 125 different
programmers. Let X1 , . . . , Xn denote the number of errors in each of the
iid
n programs. Then X1 , . . . , Xn ∼ Poisson(λ). We want to approximate the
probability that the average number of errors per program is not larger than
5.5.
104
p
√
We have µ = E(X1 ) = λ and σ = V (X1 ) = λ. Furthermore,
√
√
n(X̄n − µ)
n(5.5 − µ)
P (X̄n ≤ 5.5) = P
≤
σ
σ
!
√
125(5.5 − 5)
√
= P (Z ≤ 2.5)
≈P Z≤
5
= Φ(2.5)
where Z ∼ N (0, 1).
The Delta Method
If we have a sequence of random variables X1 , X2 , . . . , Xn , . . . which converges in distribution to a standard Normal and a differentiable function
g : R → R, then the Delta Method allows us to find the limiting distribution
for the sequence of random variables g(X1 ), g(X2 ), . . . , g(Xn ), . . . . Assume
that
√
n(X̄n − µ) d
→ N (0, 1)
σ
and that g : R → R is differentiable with g 0 (µ) 6= 0. Then,
√
n(g(Xn ) − g(µ)) d
→ N (0, 1).
|g 0 (µ)|σ
Exercise in class:
iid
Let X1 , . . . , Xn ∼ F where F is some c.d.f. such that both the expectation
µ and the variance σ 2 of the X’s exist and are finite. By the Central Limit
Theorem,
√
n(X̄n − µ) d
→ N (0, 1).
σ
Consider the function g(x) = ex . Find the limiting distribution of g(X̄n ).
We have g 0 (x) = g(x) = ex > 0 for any x ∈ R; thus, g(µ) 6= 0 necessarily.
By applying the Delta Method, we have that
√ X̄n
µ
√
n
e
−
e
n[g(X̄n ) − g(µ)]
d
=
→ N (0, 1).
0
µ
|g (µ)|σ
e σ
Thus, we can use the distribution
2µ 2
µ e σ
N e ,
.
n
to approximate probability statements about g(X̄n ) = eX̄n when n is ‘large’.
105
Homework 7 (due June 9th)
iid
1. (30 pts) Let X1 , . . . , X10 ∼ N (µ, σ 2 ). We don’t know the value of
σ 2 , but we want to decide whether, based on the data X1 , . . . , X10 ,
σ 2 ≤ σ02 = 3 or not. We observe from the data that S 2 = s2 = 10.
• Use the sampling distribution of S 2 to compute the probability of
the event {S 2 ≥ s2 } under the assumption that the true unknown
variance is σ 2 = σ02 (you will need to use a statistical software
and use the c.d.f. of a χ2 with a certain number of degrees of
freedom to compute this probability)2 .
• Based on the probability that you computed, explain why the
data X1 , . . . , X10 suggest that it is likely not true that σ 2 ≤σ02 .
Solution:
• We know that
(n − 1)S 2
∼ χ2 (n − 1).
σ2
Here, σ 2 is unknown. However, let’s assume that σ 2 is at least as
large as σ02 and let’s therefore plug σ 2 = σ02 in the above result
about the sampling distribution of S 2 . It then follows that,
(n − 1)S 2
(n − 1)s2
2
2
P (S ≥ s ) = P
≥
σ02
σ02
= P χ2 (9) ≥ 30 ≈ 0.0004
(with a little abuse of notation).
• We thus discovered that, if the true unknown variance was not
larger than σ02 = 3 as we conjectured, then the probability of
observing a value of the sample variance s2 as extreme as the one
we observed from the data (s2 = 10) is extremely small. Based on
the probability that we computed, it looks therefore very unlikely
that the true unknown variance σ 2 is not larger than σ02 = 3.
2
We can safely plug-in σ02 because
sup
2
0<σ 2 ≤σ0
Pσ2 (S 2 > s2 ) = Pσ2 (S 2 > s2 ).
0
Here the notation Pσ2 (A) should be interpreted as the ‘probability of the event A when
0
we assume that the true unknown variance is σ02 ’.
106
2. (20 pts) The times a cashier spends processing individual customer’s
orders are independent random variables with mean 2.5 minutes and
standard deviation 2 minutes. What is the approximate probability
that it will take more than 4 hours to process the orders of 100 people?
You can leave your answer in terms of Φ.
Solution:
Let Z ∼ N (0, 1) and let X1 , . . . , Xn , . . . denote the times spent by the
cahsier to process the 100 individual customer’s orders. Without any
other assumption on the distribution of the X’s other than independence, finite mean and finite variance, we can only rely on the Central
Limit Theorem. In this case, n = 100 is likely large enough to give a
good approximation to the event of interest, which is
n
X
Xi > 240
i=1
(notice that there are exactly 240 minutes in 4 hours). Thus, we have
!
n
X
240
P
Xi > 240 = P X̄n >
n
i=1
!
√ 240
√
n n −µ
n(X̄n − µ)
>
=P
σ
σ
!
!
√ 240
√ 240
n n −µ
n n −µ
≈P Z>
=1−P Z ≤
σ
σ
!
√ 240
n n −µ
=1−Φ
σ
!
10 240
−
2.5
1
1
100
=1−Φ
=Φ
.
=1−Φ −
2
2
2
3. (25 pts) Let X1 , . . . , Xn be iid random variables with c.d.f. F , finite
expectation µ and finite variance σ 2 . Consider the random variable
Y = aX̄n + b for a, b ∈ R and a > 0. Assume that n is ‘large’. Give
an approximate expression for the α-quantile of Y . Your answer will
be in terms of Φ−1 .
Solution:
Let Z ∼ N (0, 1). Without furhter information about F , we can only
107
use the Central Limit Theorem. By definition, the α-quantile of Y is
the value q ∈ R such that P (Y ≤ q) = α. We can use the Central
Limit Theorem to approximate this number when n is large. We have
q−b
α = P (Y ≤ q) = P (aX̄n + b ≤ q) = P X̄n ≤
a


q−b
√
−
µ
√
a
n(X̄n − µ)

=P
≤ n
σ
σ




q−b
q−b
−
µ
−
µ
√
√
a
 = Φ n a
.
≈ P Z ≤ n
σ
σ
Thus,
√
i.e.
n
q−b
a
−µ
σ
≈ Φ−1 (α),
σΦ−1 (α)
√
q ≈b+a µ+
.
n
iid
4. (25 pts) Let X1 , . . . , Xn ∼ Bernoulli(p) be a sequence of binary random variables where
(
1, if the Pittsburgh Pirates won the i-th game against the Cincinnati Reds
Xi =
0, if the Pittsburgh Pirates lost the i-th game against the Cincinnati Reds
and p ∈ (0, 1). Suppose that we are interested in the odds that the
Pittsburgh Pirates win a game agains the Cincinnati Reds. To do that,
we need to consider the function
g(p) =
p
1−p
representing the odds of the Pirates winning agains the Reds. It is
natural to estimate p by means of X̄n . Provide a distribution that can
be used to approximate probability statements about g(X̄n ).
Solution:
We have E(Xi ) = p ∈ (0, 1) and V (Xi ) = p(1 − p) ∈ (0, 1) for any
i ∈ {1, . . . , n} so both the expectation and the variance of the X’s exist
108
and are finite. Consider the function g(x) =
We have
1
g 0 (x) =
.
(1 − x)2
x
1−x
defined for x ∈ (0, 1).
Notice that g 0 (x) > 0 for any x in (0, 1). Thus, g 0 (E(Xi )) 6= 0. The
Delta Method allows us to write
√ X̄n
p
√
n
−
1−p
n[g(X̄n ) − g(E(Xi ))]
1−X̄n
d
p
p
=
→ N (0, 1).
1
0
|g (E(Xi ))| V (Xi )
p(1 − p)
(1−p)2
Hence, we can use the distribution
p
p
N
,
1 − p n(1 − p)3
to approximate probability statements about g(X̄n ) =
X̄n
.
1−X̄n
5. (EXTRA CREDIT, 15 pts) Let U1 , U2 , . . . , Un , . . . be a sequence of
d
random variables such that Un → Uniform(0, 1). Consider the function
g(x) = − log x and the new sequence of random variables X1 , X2 , . . . , Xn , . . .
d
with Xi = − log Ui . Prove that Xn → Exponential(1). Hint: use the
method of the c.d.f. that we discussed in lecture 11.
Solution:
We first want to compute the c.d.f. of the n-th transformed variable
Xn in the sequence. We have
FXn (x) = P (Xn ≤ x) = P (− log Un ≤ x) = P Un ≥ e−x
d
Because Un → Uniform(0, 1), it follows that
(
0 if x < 0
−x
P Un ≥ e
→
1 − e−x if x ≥ 0
= F (x).
We recognize F this as the c.d.f. of an Exponential(1) distribution.
d
Thus, we conclude that Xn → Exponential(1).
109
Lecture 13
Recommended readings: N/A
The Law of Large Numbers, Convergence of Random Variables, Convergence Results
The Law of Large Numbers
In the previous set of lecture notes we studied the Central Limit Theorem
and a particular mode of convergence, the convergence in distribution. Another important result on the convergence of the average of a sequence of
random variables is the (weak) law of large numbers 3 . Simply put, the law
of large numbers says that the sample mean of a sequence of iid random
variables converges to the common expectation of the random variables in
the sequence. More precisely, let X1 , . . . , Xn , . . . be a sequence of iid random
variables with common mean µ. Then, for any > 0,
lim P (|X̄n − µ| > ) = 0.
n→∞
(53)
p
In this case, we use the notation X̄n → µ.
Example in class:
Suppose that you repeatedly flip a coin which has probability p ∈ (0, 1) of
showing head. Introduce the random variables
(
1, if the i-th flip is head
Xi =
0, if the i-th flip is tails.
Consider the random variable X̄n describing the observed proportion of
p
heads. Then, as n → ∞, X̄n → µ. In English, this means that for any
arbitrarily small > 0, the probability of the event ‘the proportion of heads
differs from p by more than ’ converges to 0 as n → ∞.
Convergence of Random Variables
So far we introduced two notions of convergence for random variables: convergence in distribution (which we associated to the Central Limit Theorem)
and convergence in probability (which we associated to the weak law of large
numbers). There are at least two other important notions of convergence
3
We will discuss a strong version in the next section.
110
Figure 1: Relationships between different modes of convergence.
that are frequently used and mentioned: convergence in quadratic mean (or
L2 -convergence) and almost sure convergence.
We say that the sequence of random variables X1 , X2 , . . . , Xn , . . . converges in quadratic mean to the random variable X if
lim E[(Xn − X)2 ] = 0.
n→∞
(54)
L
We use the notation Xn →2 X in this case. This type of convergence is thus
related to the convergence of expectations.
We say that the sequence of random variables X1 , X2 , . . . , Xn , . . . (defined over some sample space Ω) converges almost surely to the random
variable X if
P {ω ∈ Ω : lim Xn (ω) = X(ω)} = 1.
n→∞
This is a very strong notion of convergence (essentially the strongest in probability theory, except for ‘sure’ convergence!). It states that the set of elements ω in the sample space Ω for which the sequence X1 (ω), X2 (ω), . . . , Xn (ω), . . .
converges to X(ω) is a set that has probability 1. We denote almost sure
a.s.
convergence as Xn → X.
Almost sure convergence is associated to the strong law of large numbers.
Under the same assumptions of the weak law of large numbers, we have
a.s.
X̄n → µ.
Convergence Results
Figure 1 illustrates some relationships between the different modes of convergence that we discussed. Furthermore, even if it is not included in the
Figure, almost sure convergence implies convergence in probability.
Here is a list of interesting results. Let X1 , . . . , Xn , . . . and Y1 , . . . , Yn , . . .
be two sequences of random variables, X and Y be two random variables,
and let g : R → R be a continuous function. We have
111
a.s.
a.s.
a.s.
p
p
p
L
L
L
• Xn → X and Yn → Y =⇒ Xn + Yn → X + Y
• Xn → X and Yn → Y =⇒ Xn + Yn → X + Y
• Xn →2 X and Yn →2 Y =⇒ Xn + Yn →2 X + Y
d
d
d
• Xn → c ∈ R and Yn → Y =⇒ Xn + Yn → c + Y
a.s.
a.s.
a.s.
p
p
p
• Xn → X and Yn → Y =⇒ Xn Yn → XY
• Xn → X and Yn → Y =⇒ Xn Yn → XY
d
d
d
• Xn → c ∈ R and Yn → Y =⇒ Xn Yn → cY
a.s.
a.s.
p
p
d
d
• Xn → X =⇒ g(Xn ) → g(X)
• Xn → X =⇒ g(Xn ) → g(X)
• Xn → X =⇒ g(Xn ) → g(X).
112
Lecture 14
Recommended readings: N/A
Probability Inequalities and Expectation Inequalities
Probability Inequalities
We already studied Markov’s inequality and Tchebysheff’s inequality in lecture 5. Here we discuss two other important inequalities that give exponential upper bounds on the probability content of the tails of a distribution.
We will focus on Hoeffding’s inequality and McDiarmid’s inequality.
Hoeffding’s Inequality
Let X1 , . . . , Xn be independent random variables with E(Xi ) = 0 and
P (ai ≤ Xi ≤ bi ) = 1 for some ai < bi and for all i ∈ {1, . . . , n}. Then
for any > 0 we have
!
n
n
X
Y
2
2
P
Xi > ≤ e−t
et (bi −ai ) /8
i=1
i=1
for any t > 0.
Exercise in class:
Suppose that ai = a and bi = b for all i ∈ {1, . . . , n}.
• How does Hoeffding’s inequality read in this case?
• For any > 0, the bound is valid for any t > 0. Derive the tightest
bound when ai = a and bi = b for all i ∈ {1, . . . , n}.
It is easy to see that when ai = a and bi = b for all i ∈ {1, . . . , n} the
bound reads
!
n
X
n(b−a)2 t2
2
2
−t
8
.
P
Xi > ≤ e−t et n(b−a) /8 = e
i=1
The tightest bound is the one obtained by minimizing with respect to t the
quadratic exponent. The exponent is minimized for t = 4/n(b − a)2 . By
plugging back this value of t we obtain
!
n
2
X
− 2 2
n(b−a)
P
Xi > ≤ e
i=1
113
McDiarmid’s Inequality
Let X1 , . . . , Xn be independent random variables. Suppose that the function
g is such that
sup g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − g(x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn ) ≤ ci
x1 ,...,xn ,x0i
for i ∈ {1, . . . , n}. Then
P (g(X1 , . . . , Xn ) − E[g(X1 , . . . , Xn )] ≥ ) ≤ e
2
c2
i=1 i
− Pn2
.
Expectation Inequalities
We will only discuss two basic expectation inequalities: Cauchy-Schwartz
inequality and Jensen’s inequality. Although simple, they are useful in a
variety of situations.
Cauchy-Schwartz Inequality
Let X and Y be two random variables such that E(X 2 ) and E(Y 2 ) exist
and are finite. Then,
p
E(|XY |) ≤ E(X 2 )E(Y 2 ).
Exercise in class:
Use Cauchy-Schwartz inequality to show that for two random variables X
and Y such that E(X 2 ) and E(Y 2 ) exist and are finite, |Cor(X, Y )| ≤ 1.
Jensen’s Inequality
Let g : R → R be a convex function and let X be a random variable. Then,
provided E(X) exists,
g[E(X)] ≤ E[g(X)].
Exercise in class:
Use Jensen’s inequality to show that for any random variable X, V (X) ≥ 0.
114
Homework 8 (due June 11th)
1. (20 pts) Prove the weak law of large numbers under the additional
assumption that the variance σ 2 of the iid sequence X1 , X2 , . . . , Xn , . . .
exists and is finite. Hint: use Tchebysheff’s inequality.
Solution:
Take any > 0. By Tchebysheff’s inequality we have that
P (|X̄n − µ| > ) ≤
V (X̄n )
σ2
=
2
n2
Clearly,
σ2
= 0,
n→∞ n2
0 ≤ lim P (|X̄n − µ| > ) ≤ lim
n→∞
hence the result.
2. (20 pts) Consider the sequence of random variables X1 , X2 , . . . , Xn , . . .
p
with Xn ∼ N (µ, 3/(1 + log n)). Show that Xn → µ.
Solution:
Take any arbitrary > 0. Tchebysheff’s inequality implies
P (|Xn − µ| > ) ≤
V (Xn )
3
=
.
2
(1 + log n)2
Clearly,
3
= 0,
n→∞ (1 + log n)2
0 ≤ lim P (|Xn − µ| > ) ≤ lim
n→∞
p
and thus Xn → µ.
3. (20 pts) Prove that for a sequence of random variables X1 , X2 , . . . , Xn , . . .
L2
p
and a random variable X we have Xn → X =⇒ Xn → X. Hint:
once again, use Tchebysheff’s inequality.
Solution:
Take any arbitrary > 0. Tchebysheff’s inequality implies
P (|Xn − X| > ) = P (|Xn − X|2 > 2 ) ≤
E[(Xn − X)2 ]
.
2
By assumption limn→∞ E[(Xn − X)2 ] = 0, therefore
E[(Xn − X)2 ]
= 0.
n→∞
2
0 ≤ lim P (|Xn − X| > ) ≤ lim
n→∞
p
and therefore Xn → X.
115
iid
4. (20 pts) Let X1 , . . . , Xn ∼ Bernoulli(p). Let > 0. Use Hoeffding’s
inequality to derive a tight upper bound for the probability of the
event {|X̄n − p| > }.
Solution:
)
( n
X
{|Xn − p| > } = (Xi − p) > n
i=1
( n
) ( n
)
X
X
=
(Xi − p) > n ∪
(p − Xi ) > n .
i=1
i=1
Notice that Xi − p ∈ [−1, 1] with probabilty 1, p − Xi ∈ [−1, 1] with
probability 1, and E(Xi − p) = E(p − Xi ) = 0 for any i ∈ {1, . . . , n}.
We can therefore apply Hoeffding’s inequality which yields
!
!
n
n
X
X
P |X̄n − p| > = P
(Xi − p) > n + P
(p − Xi ) > n
i=1
≤e
i=1
2(n)2
−
n(1−(−1))2
2(n)2
−
n(1−(−1))2
+e
= 2e−
n2
2
.
5. (20 pts) Consider McDiarmid’s inequality. Prove that McDiarmid’s
inequality coincides withPHoeffding’s inequality in the particular case
where g(x1 , . . . xn ) = n1 ni=1 xi , P (a ≤ Xi ≤ b)= 1, and E(Xi ) = 0
for i ∈ {1, . . . , n}.
Solution:
Notice that
g(x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − g(x1 , . . . , xi−1 , x0i , xi+1 , . . . , xn )
sup
x1 ,...,x0i ,...,xn
1
0 =
sup
(x
−
x
)
i
i
.
x1 ,...,x0i ,...,xn n
Because a ≤ Xi ≤ b with probability 1, it follows that
g(X1 , . . . , Xi−1 , Xi , Xi+1 , . . . , Xn ) − g(X1 , . . . , Xi−1 , Xi0 , Xi+1 , . . . , Xn )
sup
X1 ,...,Xi0 ,...,Xn
1
b−a
0 =
sup
(X
−
X
)
i
i
≤ n .
X1 ,...,Xi0 ,...,Xn n
116
with probability 1. It is easily seen that E[g(X1 , . . . , Xn )] = 0, so
McDiarmid’s inequality can be applied and yields
P (g(X1 , . . . , Xn ) − E[g(X1 , . . . , Xn )] > )
= P X̄n > −P
≤e
−
=e
22
b−a 2
n
n
i=1
(
22
(b−a)2
n
)
−
=e
2n2
(b−a)2
which is the same bound that we obtain by applying Hoeffding’s inequality.
6. (EXTRA CREDIT, 15 pts) Let Xn ∼ Poisson(λn ) with n ∈ {1, 2, . . . }
p
and λn = 1/n. Show that Yn = nXn → 0. Hint: remember from the
d
convergence results that we studied that if Yn → c for some c ∈ R,
p
d
then the implication Yn → c =⇒ Yn → c holds true.
Solution:
Consider FYn , the c.d.f. of Yn . We have that
FYn (t) = P (nXn ≤ t) = P (Xn ≤ t/n).
Clearly, if t < 0 then FYn (t) = 0. Take now t ∈ [0, 1). We have
FYn (t) = P (nXn ≤ t) = P (Xn ≤ t/n) ≤ P (Xn < 1/n) ≤ P (Xn < 1)
1
= P (Xn = 0) = pXn (0) = e−λn = e− n .
1
Now, since e− n → 1 as n → ∞, it is clear that the limit c.d.f. F is
the function
(
0 if t < 0
F (t) =
1 if t ≥ 0
which corresponds to the c.d.f. of a random variable that assigns
d
probability mass 1 to the point t = 0. This shows that Yn → 0 which
p
implies that Yn → 0.
117
Lecture 15
Recommended readings: WMS, sections 3.9, 4.9, 6.5
Moment-generating Functions
In this lecture, we introduce a particular function associated to a probability
distribution F which uniquely characterizes F . This function is called the
moment-generating function (henceforth shortened to m.g.f.) of F because,
if it exists, it allows to easily compute any moment of F , i.e. any expectation
of the type E(X k ) with k ∈ {0, 1, . . . } for X ∼ F .
There are other functions that are similar to the m.g.f.s which we will not
discuss in this course: these include the probability-generating function for
discrete probability distributions (which is a compact power series representation of the p.m.f.), the characteristic function (which is the inverse Fourier
transform of a p.m.f./p.d.f. and always exists) and the cumulant-generating
function (which is the logarithm of the m.g.f.).
The moments {E(X k )}∞
k=1 associated to a distribution X ∼ F completely characterize F (if they exist). They can all be encapsulated in the
m.g.f.
(R
tx
supp(X) e fX (x) dx if X is continuous
mX (t) = E(etX ) = P
tx
supp(X) e pX (x) if X is discrete.
If two random variables X and Y are such that mX = mY then their c.d.f.’s
FX and FY are equal at almost all points (i.e. they can differ at at most
countably many points).
We say that the moment generating function mX of X ∼ F exists if
there exists an open neighborhood around t = 0 in which mX (t) is finite.
Note that it is always true that, for X ∼ F with an arbitrary distribution
F , mX (0) = 1.
The name of this function comes from the following feature: suppose
that mX exists, then for any k ∈ {0, 1, . . . }
dk
= E(X k ).
(55)
mX (t)
k
dt
t=0
This means that we can ‘generate’ the moments of X ∼ F from mX by
differentiating mX and evaluating its derivatives at t = 0.
118
Exercise in class:
Show that the m.g.f. of X ∼ Binomial(n, p) is
mX (t) = [pet + 1 − p]n
and use it to compute V (X).
We have
mX (t) =
=
n
X
tx
e pX (x) =
x=0
n X
n
X
x=0
n x
e
p (1 − p)n−x
x
tx
n
(pet )x (1 − p)n−x = [pet + (1 − p)]n
x
x=0
where we used the binomial theorem
n X
n i n−i
(a + b) =
ab .
i
n
i=0
Notice that the function above is well-defined and finite for any t ∈ R. Then,
d
E(X) = mX (t)
= npet [pet + (1 − p)]n−1 t=0 = np
dt
t=0
and
d2
E(X ) = 2 mX (t)
= np2 e2t (n − 1)[pet + (1 − p)]n−2 + npet [pet + (1 − p)]n−1 t=0
dt
t=0
2
= np2 (n − 1) + np.
Thus,
V (X) = E(X 2 )−[E(X)]2 = np2 (n−1)+np−n2 p2 = −np2 +np = np(1−p).
Exercise in class:
Show that the m.g.f. of X ∼ Uniform(0, 1) is
1
mX (t) = (et − 1).
t
We have
Z
mX (t) =
0
1
1 tx 1 1 t
e dx = e = (e − 1)
t
t
0
tx
119
which is well-defined and finite for any t ∈ R (recall that mX (0) = 1 for any
X).
It is easy to verify that a m.g.f. mX satisfies
maX+b (t) = ebt mX (at).
(56)
M.g.f’s also provide a tool which is often useful to identify the distribution of a linear combination of random variables. In the m.g.f. world sums
becomes products! In particular, consider a collection of n independent ran. It is easy to check
dom variables X1 , . . . , Xn with m.g.f’s mX1 , . . . , mXnP
from the definition of m.g.f. that, if we consider Y = ni=1 (ai Xi + bi ), then
Pn
mY (t) = e
i=1 bi t
n
Y
mXi (ai t).
(57)
i=1
Exercise in class:
The m.g.f of X ∼ Poisson(λ) is
mX (t) = eλ(e
t −1)
for t ∈ R. Consider Y ∼ Poisson(µ) with X and Y independent. What is
the distribution of X + Y ?
Since X and Y are independent, we have
t −1)
mX+Y (t) = mX (t)mY (t) = eλ(e
t −1)
eµ(e
= e(λ+µ)(e
t −1)
and we recognize this as the m.g.f. of a Poisson(λ + µ) distribution.
120
Homework 9 (due June 16th)
1. (30 pts) Show that if Xi ∼ Gamma(α
. . . , n} and X1 , . . . , Xn
i , β) for {1, P
P
are independent, then X = ni=1 Xi ∼ Gamma( ni=1 αi , β). Hint: the
m.g.f. of Y ∼ Gamma(α, β) is
mY (t) = (1 − βt)−α
which is well-defined and finite for t < 1/β.
Solution:
For t < 1/β, we have
mX (t) =
n
Y
mXi (t) =
i=1
n
Pn
Y
(1 − βt)−αi = (1 − βt)− i=1 αi .
i=1
We easilyPsee that the m.g.f. of X thus corresponds to the m.g.f. of a
Gamma( ni=1 αi , β) distribution.
2. (20 pts) We showed in class that if X ∼ N (0, 1), then X 2 ∼ χ2 (1). We
P
iid
mentioned that if X1 , . . . , Xn ∼ N (0, 1), then Y = ni=1 Xi2 ∼ χ2 (n).
Use the result that you obtained from the previous exercise to prove
that Y ∼ χ2 (n).
Solution:
Since Xi2 ∼ χ2 (1), we have
1
mXi (t) = (1 − 2t)− 2
for any i ∈ {1, . . . , n}. By applying the result from the previous exercise we obtain
mY (t) = (1 − 2t)−
Pn
1
i=1 2
n
= (1 − 2t)− 2
for t < 1/2, which corresponds to the m.g.f. of a Gamma(n/2, 2) ≡
χ2 (n) distribution.
3. (25 pts) Let X ∼ Binomial(n1 , p) and Y ∼ Binomial(n2 , p) with X
and Y independent random variables. What is the distribution of
Z =X +Y?
Solution:
We have
mZ (t) = mX (t)mY (t) = [pet +1−p]n1 [pet +1−p]n2 = [pet +1−p]n1 +n2
which clearly corresponds to the m.g.f. of a Binomial(n1 + n2 , p) distribution.
121
4. (25 pts) Derive the m.g.f. of X ∼ Uniform(a, b).
Solution:
We have
mX (t) = E(e
tX
b
Z
b
ebt − eat
1
1
tx e
dx =
e =
.
b−a
(b − a)t
(b − a)t
a
tx
)=
a
which is well-defined and finite for any t ∈ R.
5. (EXTRA CREDIT, 20 pts) Consider the discrete random variable X ∼
pX with mX well-defined and finite in a open neighborhood of 0, say
(−, ) for some > 0. Show that
dk
= E(X k ).
mX (t)
k
dt
t=0
for k ∈ {0, 1, . . . , }. Hint: use the Taylor series representation of the
function t 7→ etx around t = 0.
Solution:
Consider the Taylor series representation of the function g(t) = etx
around t = 0. We have
g(t) =
∞
X
g (i) (0)
i=0
i!
ti .
Notice that for i ∈ {0, 1, . . . },
g (i) (t) = xi g(t) = xi etx
thus
g (i) (0) = xi g(0) = xi .
It follows that
mX (t) = E(e
tX
)=
X
tx
e pX (x) =
x∈supp(X)
122
X
x∈supp(X)
pX (x)
∞
X
xi
i=0
i!
ti
hence
∞
dk X xi i pX (x) k
t
i! dt
i=0
x∈supp(X)
t=0
∞
k
i
X
X d x pX (x)
=
ti dtk i! i=0
x∈supp(X)
dk
=
mX (t)
k
dt
t=0
X
t=0
=
X
pX (x)
∞
X
i=0
x∈supp(X)
=
X
k
i−k i(i − 1) . . . (i − k + 1)t i!
xi
t=0
k
x pX (x) = E(X ).
x∈supp(X)
The last equality follows because
• for 0 ≤ i < k, one of the factors in i(i − 1) . . . (i − k + 1) is 0
• for i > k, ti−k t=0 = 0
• for i = k
xi
xk
i−k =
i(i − 1) . . . (i − k + 1)t k! = xk .
i!
k!
t=0
123
Lecture 16
Recommended readings: N/A
Introduction to Random Processes
Informally: a random process is a mathematical model of a probabilistic
experiment that evolves in ‘time’ and generates a sequence of (random)
values.
A little more formally: consider a collection of random variables X =
{Xt }t∈T defined on a sample space Ω endowed with a probability measure
P , where the collection is indexed by a ‘time’ parameter t taking values in
the set T . Such collection is commonly referred to as a random process or
as a stochastic process. You might have noticed that once again, the word
‘time’ has quotation marks: this is because, while we often think about
the parameter t as time, we can also think of spatial random processes, i.e.
random processes that evolve with respect to space rather than time. In
other more complicated cases, the index set T in which the parameter t
takes its values can be an even more general domain that is not necessarily
related to time or space.
Recall that a random variable is a function mapping the sample space Ω
into a subset S of the real numbers. In the jargon of random processes the
set S is usually called the state space of the random process X. In fact, if
at the time t ∈ T the random variable Xt takes the value Xt = xt ∈ S, we
say that the state of the random process X at time t is xt .
Based on the features of the sets T and the sets S we say that X is a
• discrete-time random process, if T is a discrete set (i.e. T is finite or
countably infinite)
• continuous-time random process, if T is not a discrete set
• discrete-state random process, if S is a discrete set
• continuous-state random process, if S is a not a discrete set.
It is clear from above that X = {Xt }t∈T is a function of both the elements
of Ω and of the time parameter t. We can therefore think of X in at least
two ways:
• as a collection of random variables Xt taking values in S; i.e. for any
fixed time t, the random process X corresponds to a random variable
Xt valued in S
124
• as the collection of random functions of time (‘trajectories’); i.e. for
any fixed ω ∈ Ω, we can view X as the sample path or ‘trajectory’
t 7→ Xt (ω) which is a (random) function of time.
More generally, a random process is a mapping of the type
X:
Ω×T →S
(ω, t) 7→ Xt (ω)
GRAPHICAL INTERPRETATION
In contrast to the study of a single random variable or to the study
of a collection of independent random variables, the analysis of a random
process puts much more focus on the dependence structure between the
random variables in the process at different times and on the implications
of that dependence structure.
Because there are many real-world phenomena that can be interpreted
and modeled as random processes, there is a wide variety of random processes. In this class, however, we will focus only on the following processes:
• the Bernoulli process
• the Poisson process
• Brownian motion (the Wiener process)
• discrete-time Markov chains.
With respect to the processes listed above, the Bernoulli process and discretetime Markov chains are discrete-time and discrete-state processes, whereas
the Poisson process is a continuous-time and discrete-state process. Brownian motion is instead a continuous-time and continuous-state random process.
We can further characterize the Bernoulli process and the Poisson process
as arrival-type processes: in this processes, we are interested in occurrences
that have the character of an ‘arrival’, such as customers walking into a
store, messages receptions at a receiver, incoming telephone calls, etc. In
the Bernoulli process arrivals occur in discrete-time while in the Poisson
process arrivals occur in continuous-time.
In the following, we provide a very basic and intuitive interpretation of
the aforementioned random processes.
125
The Bernoulli Process
The Bernoulli process is one of the most elementary random processes. It
can be visualized as an infinite sequence of independent coin tosses where
the probability of heads in each toss is a fixed number p ∈ [0, 1]. Thus, a
Bernoulli process is essentially a sequence of iid Bernoulli random variables.
Question: what is the state space of a Bernoulli process?
The Poisson Process
The Poisson process is a ‘counting’ random process in continuous time. The
basic idea is that we are interested in counting the number of events (occurring at a certain rate λ > 0) that occurred up to a certain time t. For
example, imagine that you own a store and you are counting the number
of customers that entered into your store up to a certain time t. A Poisson
process is a probabilistic model that can be used to model this phenomenon.
In a Poisson process, the number of events occurred up to a certain time t
is a random variable that has a Poisson distribution with parameter λt. We
will give a more precise definition of the Poisson process in the next lecture,
and we will study some interesting properties of this process.
Question: what is the state space of a Poisson process?
Brownian Motion
Brownian motion is one of the most important and well-studied continuous
time stochastic processes. It has almost surely continuous sample paths
126
which are almost everywhere not differentiable. Brownian motion is part of
a class of random processes called diffusion processes; these processes usually
arise as the solution of a stochastic differential equation. Brownian motion
is the key process for the definition of stochastic integrals, i.e. integrals of
the type
Z
T
Xt (ω) dBt (ω)
(58)
0
where {Bt }Tt=0 is a Brownian motion and {Xt }Tt=0 is another random process.
The branch of mathematics devoted to studying the properties of these
integrals is called stochastic calculus.
From a more practical point of view, the study of Brownian motion is
historically associated to modeling the motion of a particle suspended in a
fluid, where the motion results from the collision of the particle with the
atoms and the molecules forming gas or the liquid in which the particle is
immersed.
Question: what is the state space of a Brownian motion?
Discrete-Time Markov Chains
Informally, a Markov chain is a random process with the property that,
conditional on the present state of the process, the future states of the
process are independent of the past states. Mathematically, this condition
is described by the Markov property
P (Xn = xn |Xn−1 = xn−1 , Xn−2 = xn−2 , . . . , X0 = x0 ) =
P (Xn = xn |Xn−1 = xn−1 )
127
(59)
How would you compactly summarize the Markov chains in the
pictures below?
128
Lecture 17
Recommended readings: Ross, sections 5.3.1 → 5.3.3
The Bernoulli Process and the Poisson Process
The Bernoulli Process
Let X = {Xi }∞
i=1 be a sequence of iid Bernoulli(p) random variables. The
random process X is called a Bernoulli process.
Question: describe the sample paths of a Bernoulli process.
The Bernoulli process can be thought of as an arrival process: at each
time i either there is an arrival (i.e. Xi = 1, or equivalently stated, we
observe a success) or there is not an arrival (i.e. Xi = 0, or equivalently
stated, we observe a failure).
Many of the properties of a Bernoulli process are already known to us
from previous discussions.
Question: consider n distinct times; what is the probability distiribution of the number of arrivals in those n times?
Question: what is the probability distiribution of the first arrival?
Question: consider the time needed until the r-th arrival (r ∈
{1, 2, . . . }); what is the probability distribution associated to the
time of the r-th arrival?
The independence of the random variables in the Bernoulli process has
an important implication about the process. Consider a Bernoulli process
∞
X = {Xi }∞
i=1 and the process X−n = {Xi }i=n+1 . Because the random variables in X−n are independent of the random variables in X \ X−n , it follows
that X−n is itself a Bernoulli process (starting at time n) which does not
depend on the initial n random variables of X. We say that the Bernoulli
process thus satisfies the fresh-start property.
Question: recall the memoryless property of the Geometric distribution. How do you intepret this property in light of the freshstart property of the Bernoulli process?
129
The Poisson Process
A random process X = {Xt }t∈T is called a counting process if Xt represents
the number of events that occur by time t. More properly, if X is a counting
process, it must also satisfy the following intuitive properties:
• for any t ∈ T , Xt ≥ 0
• for any t ∈ T , Xt ∈ Z
• if s ≤ t, then Xs ≤ Xt
• for s ≤ t, Xt −Xs is the number of events occurring in the time interval
(s, t].
A counting process X is said to have independent increments if the number of events occurring in disjoint time intervals are independent. This
means that the random variables Xt − Xs and Xv − Xu are independent
whenever (s, t] ∩ (u, v] = ∅.
Furthermore, a counting process X is said to have stationary increments
if the distribution of the number of events that occur in a time interval only
depends on the length of the time interval. This means that the random
variables Xt+s − Xs have the same distribution for all s ∈ T .
Example:
∞
∞
Consider
P a Bernoulli process X = {Xi }i=1 and the process Y = {Yi }i=1 with
Yi = j≤i Xj . Then Y is a discrete-time counting process with independent
and stationary increments.
We are now ready to give rigoroulsy define what a Poisson process is.
A counting process X = {Xt }t≥0 is said to be a Poisson process with rate
λ > 0 if
• X0 = 0
• X has independent increments
• the number of events in any time interval of length t > 0 is Poisson
distributed with expectation λt, i.e. for any s, t ≥ 0 we have X(t +
s) − X(s) ∼ Poisson(λt).
Question: does a Poisson process have stationary increments?
There is another (equivalent) definition of Poisson process: a counting
process X = {Xt }t≥0 is said to be a Poisson process with rate λ > 0 if
130
1. X0 = 0
2. X has stationary and independent increments
3. P (Xh = 1) = λh + o(h)
4. P (Xh ≥ 2) = o(h)
Condition 3. says that the probability of observing an arrival in a small
amount of time is roughly proportional to h. Condition 4. says that it is
very unlikely to observe more than an arrival in a short amount of time. Try
and verify that these conditions are consistent with a Poisson distribution!
So, we know about the probability distribution of the count of arrivals
up to any time in a Poisson process. But what about the probability distribution of the inter-arrival times {Ti }∞
i=1 ? Notice that the event {T1 > t} is
equivalent to the event ‘there are no arrivals in the time interval [0, t], i.e.
{Xt = 0}. Thus,
P (T1 > t) = P (Xt = 0) = e−λt
from which we see that T1 ∼ Exponential(1/λ). Now, what is the distribution of T2 ? We have
P (T2 > t|T1 = s) = P (Xt+s − Xs = 0|T1 = s) = P (Xt+s − Xs = 0|Xs − X0 = 1)
= P (Xt+s − Xs = 0) = e−λt
which implies that T2 ∼ Exponential(λ) as well. By using the same argument, it is easy to see that all the inter-arrival times of a Poisson process
are iid Exponentially distributed random variables with parameter λ.
Question: what is the distribution of the waiting time until the
n-th arrival?
We can also describe a Poisson process in terms of inter-arrival times.
Suppose we start with {Ti }∞
varii=1 , a sequence of iid Exponential random
P
ables with expectation λ > 0. Define the random variables Si = j≤i Ti for
i ∈ {1, 2, . . . } and S0 = 0 corresponding to the waiting time until the i-th
arrival. Finally, let Xt = max{i : Si ≤ t}. Then the process X = {Xt }t≥0 is
a Poisson process. From the definition of Poisson process, it is then easy to
see that the following hold true:
• Xt ≥ n ⇐⇒ Sn ≤ t
• Xt = n ⇐⇒ Sn ≤ t ∧ Sn+1 > t.
131
A similar argument can be made for the Bernoulli process using iid Geometric(p)
inter-arrival times.
The Poisson Process as a Continuous-time Version of the Bernoulli
Process
Fix an arrival rate λ > 0, some time t > 0, and divide the time interval (0, t]
into n subintervals of length h = t/n. Consider now a Bernoulli process X =
{Xi }ni=1 defined over these time subintervals, where each Xi is a Bernoulli
random variable recording whether there was an arrival in the time interval
((i − 1)h, ih]. Impose now the conditions 3. and 4. of the definition of
Poisson process that we described above. We have that the probability p of
observing at least one arrival in any of these subintervals is
t
1
p = λh + o(h) = λ + o
.
n
n
Thus, the number of subintervals in which we record an arrival has a Binomial(n, p)
distribution with p.m.f.
x n−x
n
t
1
t
1
p(x) =
λ +o
1−λ +o
1{0,1,...,n} (x)
x
n
n
n
n
Now, let n → ∞ or equivalently h → 0 so that the partition of (0, t] becomes
finer and finer, and we approach the continuous-time regime. Following the
same limit calculations that we did in Homework 4, as n → ∞ we have
p(x) → e−λt
(λt)x
1
(x),
x! {0,1,... }
which is the p.m.f. of a Poisson(λt) distribution. Thus, the Poisson process
can be thought of as a continuous-time version of the Bernoulli process.
132
Lecture 18
Recommended readings: Ross, sections 5.3.4, 5.3.5, 5.4.1
Splitting, Merging, Further Properties of the Poisson Process, the Nonhomogeneous Poisson Process, and the Spatial
Poisson Process
Splitting a Bernoulli Process
ۧ֌ *႘
Consider a Bernoulli process of parameter p ∈ [0, 1]. Suppose that we keep
any arrival with probability q ∈ [0, 1] or otherwise discard it with probability 1 − q. The#ºčč
new O=Ǐ
processI™Ǐ
thus † obtained
isèčč
still
a Bernoulli
Ǐ M! ŗ|!Ǐ !Ǐ
Ú++÷Ċč`č
Pč Ǐ process with
-.
áč
ùč Ҍߥof
Ԧ ‫ ׹‬MƯ
&Ǐ h&ū Æ Ǐ MǏ
parameter pq.ø2
Čč
This
is an example
thinned Bernoullič process.
Similarly,
č ¯ ” !Ǐ ‡ ‫ ׸‬I č
|Ǐ Ǐ
ŸǏ ? Ǐ Hz/Ŭ Ŗ KnjǏ -@ mǏ č č
the process
by
the discarded
arrivals
isâč
aĎǏ
thinned
Bernoulli process
؎ -@obtained
ÔǏ ™Ǐ ÝŒİ ÄǏ 7
×=^Ǐ!KǏ
Ƹ ĺǏ đć†ƀǏ
č śǏ
ǏƹǏ
/'ۖ − q)).
Ø\č
Oč
ӫ‫ۖ'؞‬
/'ۖ‫ۖ''ؿ‬
as well (with parameter p(1
Ǐ +Ǐ ř‫ ߥۯ‬č Ù
ŭǏ ÕǏ Ǐ Ǐ œŰǏ ơg nǏ ĻǏ 9č č :+ G + Ɛ႘ ͭ෩႘ Qč Ǐ
This can be easily generalized to the case where e!
weǏ Ƣò
split
the original
Ǐ ƍ ø AǏ
process in more than two subprocesses.
Ʒw AǏ #Uč č aœUþč
ȥ‫ط‬
gč `i %č ‡
ͳ
- ® ÅǏ
ǰ Չ ‫ ܨ ک‬႘ ࠲Ͱ̉ஆθ ༪ அ ഄધ႘ŋ੦႘̈Է ‫ނ‬ʋปഅŋ࿕̉౪இ႘ Ͱผŋफ़ʋΓΓ႘Ԋ႘
¯/ժ١Չۖ TűŸǏ ඾႘
Merging
a Bernoulli
aԷ ĭMňňM‫كڇ‬
͆ۖ ‰YXč
TǏ DŽƎƣǏ
ŊԷ ӑΒԷ ҕ ε ԯ
$ gǏ
ņߥ g ǍƁǏ
- m Ǐ Ƥƅě ƺ ‡ DžĜNj@Ǐ ѧٍӐߥ
Process
Ç࠱႘
Ǐ K Ǐ YÎXŸã¶
¤č
ëč •\č
ࢱ༛႘
Ǐ
ú=č
̇ independent
; ˩ ල႘ B႘ ෨႘ ව෪႘ Ċ zǏBernoulli
Ų Ŕ Ǐ
඼ :
U Ҕ ୹ ༚ၖ႘ Ǔ à ۖ one
؊ȹ LJ Ҁ ۖ
Considertwo
processes,
à äč Û ĕ ” Ä^ č .²
^č 2
č íîč ୺ ೼ ႘
řԧߥ ‫ܥ‬՝Ӱ‫ڹ‬ҵߥ
- ¯ Ǐ Qč 6č
TDhǏ ]č
with
parameter
p and
the other with parameterƥbǎǏ
q. Consider the new process obtained by
óƏǏ
( Ɔ:ƦƧǏ éč O
nǏ c€Ǐ ïą--MðñčÜbٗ GÏ č ඿႘ ô -Ǐ ࢲε႘ Ğ€Ǐ ƻ ŕ ŪǏ
K Ǐ žũĝıďǏ
1 ifΜthere
was an arrival at time i in either process
˘Γ˙ߥ
Xi =
Šč
0 otherwise.
/‫ॲ ۖ֔؟׺‬ਓമςਔเ႘ Ș഍ ࣁҞϻ႘
òÿb§˜×Íč Pč
Ņ ¾‫ط‬
Ȧ‫ط‬
ҖnnԷ ̉Է
ƼǏVԷ
The process thus obtained is a Bernoulli process
with
parameter p + q − pq.
ҕӯߥ ħôĈĐǏ /
; ļǏ ::+hRǏ
This is easily generalized to the case Ö ğųǏ
where we merge more than two
independent Bernoulli processes.
ȧ‫ط‬
ÌÍÎÇoÈoËǏ« ©ªÊǏ ĨǏ ĉƨƩǏ
+ƽ!Ǐ õƐö^cƑƪǏ
РŋԷ
Է
Ѧٌӏߥ ඹ ӹ ႘
133
ÇÑ႘ ࢮ႘
͟͞͠‫ط‬ϋ ό Ϝ Ө Ռ h Ս Վ ‫פף‬±‫ط‬
ÖËM–4č êč ೽ ͮԄ႘ ‹4č ņ ҔԷ ó"ô¨¡; Â4 zč ædžǏ
ǭ႘
$ $ $ "$ # $ !!!$$
Splitting a Poisson Process
Consider a Poisson process X with rate λ > 0. Suppose that there can be
two different types of arrivals in the process, say type A and type B, and that
each arrival is classified as an arrival of type A with probability p ∈ [0, 1] and
as an arrival of type B with probability 1 − p and that the classification is
fone independently from the rest of the process. Let XA and XB denote the
counting processes associated to arrivals of type A and type B respectively.
Then XA and XB are Poisson processes with parameters λp and λ(1 − p)
respectively. Furthermore, XA and XB are independent processes. The two
processes XA and XB are examples of thinned Poisson processes.
This can be easily generalized to the case where we consider n ≥ 3
different types of arrivals.
ԪԷ
Merging two Poisson Processes
Consider two Poisson processes X1 and X2 with rates λ > 0 and µ > 0
ƽ҆႘
‫ߥמ‬
respectively. Consider the new Poisson process
X = X1 + X2 . The merged
Poisson process‫ا‬Ĭmٞߥ
X is a PoissonѴ‫ۇ‬ӽߥ
process with rate λ + µ.
Furthermore, any arrival in the process X has probability λ/(λ + µ) of
Ċ ӁӔۖ ω éʔߥ
being originated from X1 and probability
µ/(λ + µ) of being originated from
X2 .
This is easily generalized to the case where more than two Poisson processes are merged.
Conditional Distribution of the Arrival Times
Suppose that we know that the number of arrivals up to time t > 0 in a
Poisson process with rate λ is n > 0. What is the conditional distribution
of the n arrival times S1 , . . . , Sn given that there were exactly n arrivals?
134
We will not prove this result (although in homework 10 you will prove
the simple case where n = 1), but it turns out that the arrival times are
uniformly scattered in the time interval [0, t]. Because t can be arbitrarily
large, this gives us a new perspective on the Poisson process: conditional on
the number of arrivals, the Poisson process is the random process that gets
closest to generating values that are uniformly scattered over an interval of
arbitrarily large length (in the limit, [0, ∞)).
The Nonhomogeneous Poisson Process
So far we fixed the rate of a Poisson process to be a fixed scalar λ > 0.
However, we can generalize the definition of Poisson process and allow the
arrival rate to vary with as a function of time.
We say that a random process X is a nonhomogeneous Poisson process
with intensity function λ(t) for t ≥ 0 if
• X is a counting process
• X0 = 0
• X has independent increments
• P (Xt+h − Xt = 1) = λ(t)h + o(h)
• P (Xt+h − Xt ≥ 2) = o(h)
In this case, for any 0 ≤ s < t, we have
Z t
Nt − Ns ∼ Poisson
λ(u) du .
s
Question: is the nonhomogeneous Poisson process described above
a random process with stationary increments?
The Spatial Poisson Process
We saw that the Poisson process is a probabilistic model for random scattering of values across the positive real numbers. What if we want to randomly
scatter points over a spatial domain?
We say that a counting process N = {N (A)}A⊂X on a set X is a spatial
Poisson process with rate λ > 0 if
• N (∅) = 0
135
• N has independent increments, i.e. if A1 , . . . , An are disjoint sets then
N (A1 ), . . . , N (An ) are independent random variables
• N (A) ∼ Poisson(λ vol(A)).
The spatial Poisson process is a basic example of a point process. If we
denote Y1 , Y2 , . . . the ‘points’ of the process, the spatial Poisson random
process induces a random measure (the Poisson random measure) over the
subsets of X by means of
N (X )
N (A) =
X
1A (Yi ).
i=1
ILLUSTRATION: ASTEROIDS HITTING A PLANET
136
Homework 10 (due June 23rd)
1. (25 pts) Consider a random walk process on the integers X = {Xi }∞
i=0
with P (X0 = x0 ) = 1 for some x0 ∈ Z. The process has state space
S = Z, the set of integer numbers. Furthermore, at any time i ∈
{0, 1, . . . } we have
P (Xi+1 = xi + 1|Xi = xi ) = p
P (Xi+1 = xi − 1|Xi = xi ) = 1 − p.
• Does the random process X satisfy the Markov property?
P
• Consider the new process Y = {Yi }∞
i=0 with Yi =
j≤i Xj . Does
the random process Y satisfy the Markov property?
Solution:
• Yes, X satisfies the Markov property: for any i ∈ {0, 1, . . . }, the
state of the process X at time i + 1 only depends on the state of
the process at time i.
P
• Let Yi = j≤i Xi . Notice that Yi+1 = Yi + Xi+1 . We have
P (Yi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 )
= P (Yi + Xi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 )
= P (yi + Xi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 )
= P (Xi+1 = yi+1 − yi |Yi = yi , Yi−1 = yi−1 )
= P (Xi+1 = yi+1 − yi |Xi = yi − yi−1 )
(
p if yi+1 − yi = yi − yi−1 + 1
=
1 − p if yi+1 − yi = yi − yi−1 − 1
(
p if yi+1 = 1 + 2yi − yi−1
=
1 − p if yi+1 = −1 + 2yi − yi−1
Thus,
P (Yi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 , . . . , Y0 = y0 )
= P (Yi+1 = yi+1 |Yi = yi , Yi−1 = yi−1 )
and Y does not satisfy the Markov property.
137
2. (15 pts) Consider a Bernoulli
Pr process X and its inter-arrival times
{Ti }∞
.
Show
that
T
=
r
i=1
i=1 Ti with r ∈ {1, 2, . . . } is a Negative
Binomial random variable with parameters r and p, where p is the
probability of success of each of the random variables {Xi }∞
i=1 . Hint:
you may want to read again through lecture 15, where we illustrated
that moment-generating functions are a useful tool to derive the distribution of a sum of independent random variables.
Solution:
The inter-arrival times of a Bernoulli process are iid Geometric(p)
random variables. The m.g.f. of a Geometric distribution of parameter
p is
pet
m(t) =
1 − (1 − p)et
P
for t < − log(1 − p). Consider the random variable Tr = ri=1 Ti . We
have
r
pet
r
mTr (t) = m (t) =
1 − (1 − p)et
for t < − log(1 − p), which is the m.g.f. of the Negative Binomial
distribution with parameters r and p. Thus, Tr ∼ NBinomial(r, p).
3. (30 pts) Consider a Poisson process X = {Xt }t≥0 . Fix a time t0 ≥
0. Show that Y satisfies the fresh-start property, i.e. show that the
process Y = {Yt }t≥0 with Yt = Xt+t0 − Xt0 is also a Poisson process
and it is independent from {Xt }0≤t<t0 . Hint: show that Y satisfies
the properties that describe a Poisson process according to the first
definition of Poisson process that we discussed in class.
Solution:
We need to check that the process Y satisfies the properties that define
a Poisson process. We have
• Y0 = Xt0 +0 − Xt0 = X0 = 0
• for any two time intervals (s, t] and (u, v] such that (s, t]∩(u, v] =
∅, we have Yt − Ys = Xt0 +t − Xt0 +s and Yv − Yu = Xt0 +v − Xt0 +u ;
since X is a Poisson process and (t0 +s, t0 +t]∩(t0 +u, t0 +v] = ∅ if
(s, t]∩(u, v] = ∅, it follows that Yt −Ys and Yv −Yu are independent
• for any s, t ≥ 0 we have Yt+s −Ys = Xt+s+t0 −Xs+t0 ∼ Poisson(λ(t+
s + t0 − (s + t0 )) ≡ Poisson(λt).
138
Notice that, because X is a Poisson process, for any t ≥ 0, Yt =
Xt+t0 − Xt0 is independent of any increment Xv − Xu with 0 ≤ u <
v < t0 . Thus, Y is independent from {Xt }0≤t<t0 . We thus conclude
that Poisson processes satisfy the fresh-start property.
4. (30 pts) Consider a Poisson process X = {Xt }t≥0 and the time of
the first arrival T1 . Show that, conditional on the event {Xt = 1}
(i.e. there was only one arrival until time t > 0), T1 is uniformly
distributed on [0, t]. Use the definition of conditional probability and
the express the event {T1 ≤ s ∩ Xt = 1} in terms of Poisson process
count increments to show that the function s 7→ P (T1 ≤ s|Xt = 1)
corresponds to a Uniform(0, t) c.d.f..
Solution:
We want to compute P (T1 ≤ s|Xt = 1). We have
P (T1 ≤ s|Xt = 1) =
=
P (T1 ≤ s ∩ Xt = 1)
P (Xt = 1)


0 if s < 0
P (Xs −X0 =1∩Xt −Xs =0)
P (Xt =1)


1 if s ≥ t
if 0 ≤ s < t
since, when 0 ≤ s < t, the intersection of the events ‘the first arrival
was before time s’ and ‘there was an arrival before time t’ is ‘there was
an arrival in the time interval [0, s] and there were no other arrivals in
the time interval (s, t]’.
For 0 ≤ s < t, the independent independence and the stationarity of
the increments of a Poisson process imply that
P (Xs − X0 = 1 ∩ Xt − Xs = 0)
P (Xs − X0 = 1)P (Xt − Xs = 0)
=
P (Xt = 1)
P (Xt = 1)
=
e−λs λs e−λ(t−s)
s
= .
t
e−λt λt
It follows that


0 if s < 0
P (T1 ≤ s|Xt = 1) = st if 0 ≤ s < t


1 if s ≥ t
which is the c.d.f. of a Uniform(0, t) distribution.
139
Lecture 19
Recommended readings: Ross, sections 10.1, 10.3, 10.5, 10.6
Brownian Motion
Multivariate Normal Distribution
Before introducing the Brownian motion, it is worthwile to recall that a
n-dimensional random vector (X1 , . . . , Xn ) is distributed according to a
(nondegenerate) multivariate Normal distribution with mean µ ∈ Rn and
covariance matrix Σ ∈ Rn×n if its joint density function is
1
1
T −1
e− 2 (x−µ) Σ (x−µ)
f (x1 , . . . , xn ) = f (x) = p
n
(2π) det(Σ)
and the covariance matrix Σ is a positive definite matrix. If the covariance
matrix is only positive semi-definite, the random vector (X1 , . . . , Xn ) is said
to have a degenerate Normal distribution. Two key properties of a Normal
random vector are
1. for i 6= j, Σi,j = Σj,i = 0 ⇐⇒ Xi and Xj are independent (uncorrelation is equivalent to independence for jointly Normal random
variables)
P
2. for any α = α1 , . . . , αn ∈ Rn , the linear combination ni=1 αi Xi has a
Normal distribution with expectation αT µ and variance αT Σα.
Brownian Motion
A random process B = {Bt }t≥0 valued in R is called Brownian motion, if it
satisfies:
• B0 = 0
• B has independent increments
• for any s, t ≥ 0, Bt − Bs ∼ N (0, |t − s|)
The following properties follow from the definition of Brownian motion.
1. for any s, t ≥ 0, Cov(Bs , Bt ) = E(Bs Bt ) = min{s, t} = s ∧ t
140
2. for any 0 ≤ t1 < t2 < · · · < tn , the random vector (Bt1 , Bt2 , . . . , Btn )
has a multivariate Normal distribution with null expectation and covariance matrix


t1 . . . . . . . . .
t1
 ..

 . t2 . . . . . .
t2 


 ..

.. . .

.
.
.



 ..
..
.
.
tn−1 tn−1 
t1 t2
tn−1 tn
3. Brownian motion is a Gaussian process, i.e. for any 0 ≤ t1 < t2 <
· · · < tn and for any α1 , α2 , . . . , αn ∈ R, the random variable α1 Bt1 +
· · · + αn Btn is a Normal random variable
Conversely, if a random process B = {Bt }t≥0 satisfies properties 1., 2., and
3., and B0 = 0, then B is a Brownian motion.
Note that for any t ≥ 0 we have E(Bt ) = 0 and V (Bt ) = V (Bt − B0 ) = t
so the variance of the process increases linearly with time.
Properties of the Brownian Motion Sample Paths
Let us recall that a function f : R → R is said to be γ-Hölder continuous
(for some γ > 0) on an interval [a, b] if
|f (x) − f (y)|
< ∞.
|x − y|γ
x,y∈[a,b]
sup
x6=y
The sample paths t 7→ Bt (ω) of a Brownian motion satisfy the following
properties:
• they are almost surely continuous, i.e. P ({ω ∈ Ω : t 7→ Bt (ω) is a continuous function}) =
1
• they are almost surely γ-Hölder-continuous for γ ∈ (0, 1/2) in the
interval [0, T ] for any T > 0, i.e. P ({ω ∈ Ω : t 7→ Bt (ω) is a γ −
Hölder-continuous function on [0, T ]}) = 1 for any γ ∈ (0, 1/2) and
for any T > 0; furthermore, with probability 1 they are not γ-Hölder
continuous on any [a, b] ⊂ R+ if γ > 1/2
• they are almost surely nowhere differentiable.
141
Brownian Motion with Drift
We say that X = {Xt }t≥0 is a Brownian motion with drift coefficient µ ∈ R
and variance σ 2 > 0 if for any t ≥ 0
Xt = σBt + µt
and B = {Bt }t≥0 is a Brownian motion. In this case we have, for any t ≥ 0,
E(Xt ) = µt and V (Xt ) = tσ 2 , thus both the expected value and the variance
increase linearly with time.
Geometric Brownian Motion
Consider the price of a stock X = {Xt }t≥0 evolving as a function of time.
Often, it is reasonable to assume that the percentage changes in the price
of the stock are independent and identically distributed random variables.
In discrete time, this means that we assume that the random variables
{Xn+1 /Xn }∞
n=0 are iid.
Let Yn+1 = Xn+1 /Xn for n ∈ {0, 1, . . . } so that Xn+1 = Yn+1 Xn . By
iteration, it follows that
Xn+1 = Yn+1 Yn Yn−1 . . . Y1 X0
142
thus
log(Xn+1 ) =
n+1
X
log(Yi ) + log(X0 ).
i=1
After suitable normalization (and moving to the continuous-time regime)
we can approximate the random variables {log(Xn )}∞
n=0 with a Brownian
motion (possibly with a drift) — question for you: why does this make
sense? In this case, the continuous-time random process X is called geometric Brownian motion.
More compactly, let B be a Brownian motion with drift coefficient µ ∈ R
and variance σ 2 > 0. The random process X = eB is a a geometric Brownian
motion.
In this case, for any t ≥ 0 we have
E(Xt ) = e
2
µ+ σ2
t
2
2 µ+ σ2 t
σ2 t
V (Xt ) = e
e
−1 .
Brownian Bridge
Let B = {Bt }t∈[0,1] be a Brownian motion. Consider the conditional random
process B̄ = B|{B1 = 0}. The random process B̄ is called a Brownian bridge;
the name is suggested by the fact that the sample paths of B̄ are almost
surely ‘suspended’ between B̄0 = 0 and B̄1 = 0, thus forming a bridge
between the points (0, 0) and (1, 0).
We have that, for any t ∈ [0, 1], E(B̄t ) = 0 and V (B̄t ) = t(1 − t) —
question for you: at which time does the Brownian bridge have
the highest variability? Furthermore, for any 0 < s < t < 1, we have
that Cov(B̄s , B̄t ) = s(1 − t).
A Brownian bridge can be also expressed as
B̄t = Bt − tB1
where B = {Bt }t∈[0,1] is a Brownian motion.
143
144
Lecture 20
Recommended readings: Ross, sections 4.1 → 4.2
Markov Chains (part 1)
The Markov Property
Let X = {Xn }∞
n=1 be a discrete-time random process with discrete state
space S. X is said to be a Markov chain if it satisfies the Markov property
P (Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 , . . . , X0 = x0 ) = P (Xn+1 = xn+1 |Xn = xn ),
for any n ≥ 0.
The Markov property can be stated in the following equivalent way. Let
h be any bounded function mapping S ∞ → R. Then X satisfies the Markov
property if
E(h(Xn+1 , Xn+2 , . . . )|Xn , Xn−1 , . . . , X0 )
= E(h(Xn+1 , Xn+2 , . . . )|Xn ).
How can we interpret the above condition in an algorithmic fashion? Think
about h being any algorithm you may use on the future states of the process (i.e. h is any feature of the future states of the process that you may
be interested to consider). Then your ‘best’ prediction of the algorithm’s
output on the process’s future states given the entire history of the process
up to the current state is a function only of the current state.
Example in class:
∞
Let X = {Xn }∞
n=0 be a random process where {Xn }n=0 is a collection of
independent random variables. Then, for all n ≥ 0,
E(h(Xn+1 , Xn+2 , . . . )|Xn , Xn−1 , . . . , X0 )
= E(h(Xn+1 , Xn+2 , . . . )) = E(h(Xn+1 , Xn+2 , . . . )|Xn ).
Thus X satisfies the Markov property.
We are usually interested in two kinds of questions about Markov chains:
• how will the system evolve in the short term? In this case, we use
conditional probabilities to compute the likelihood of various events
and evolutionary paths.
145
• how will the system evolve in the long term? In this case, we use
conditional probabilities to identify the ‘limiting behavior’ over vast
stretches of time, or the ‘equilibrium’ of the system.
Transition Probabilities
From now on, to simplify the notation, we will assume that the state space
S is the set (or a subset) of the positive integers {0, 1, 2, . . . }. The evolution
of a Markov chain with respect to time is described in terms of the transition
probabilities
P (Xn+1 = j|Xn = i) = Pn;i,j
Notice that in general the transition probability from state i to state j can
vary with time. It is very convenient, especially for computational reasons,
to arrange the transition probabilities into a transition probability matrix
(t.p.m.) P n .
When the transition probabilities Pn;i,j do not change over time, we
say that the Markov chain is time-homogeneous. In this case, we drop the
subscript n and we simply write
P (Xn+1 = j|Xn = i) = Pi,j .
Similarly, the t.p.m. of a time-homogeneous Markov chain is denoted P . In
this class we will mainly focus on time-homogeneous Markov chain.
Exercise in class:
Example
3: Three-state
example
time
Write the t.p.m.
of the time-homogeneous
Markov from
chain in last
the figure
below.
1
0.4
0.3
0.7
1.0
2
0
0.6
Exercise in class:
Consider a simple random walk starting from 0. At each time n, the pro146
18
cess increments by 1 with probability p ∈ [0, 1] or decrements by 1 with
probability 1 − p. Describe the t.p.m. of this process.
Stochastic Matrices
Consider a matrix A. The matrix A is said to be a stochastic matrix if it
satisfies the two following properties:
1. all the entries of A are non-negative, i.e. Ai,j ≥ 0 for all i, j
P
2. each row of A sums to 1: j Ai,j = 1 for all i.
Notice that we call the matrix ‘stochastic’, but such matrix is not random!
If the matrix A is such that 1.Pand 2. are satisfied and, furthermore, also
each column of A sums to 1 ( i Ai,j = 1 for all j), then A is said to be a
doubly stochastic matrix.
Question: is a t.p.m. a stochastic matrix? Why?
Question: is the t.p.m. of the 3-state Markov chain above a doubly stochastic matrix?
Question: is the t.p.m. of the simple random walk above a doubly
stochastic matrix?
Using Transition Probabilities
Let’s start with an example. Consider again the 3-state Markov chain with
its t.p.m.


0.6 0.4 0
P = 0.7 0 0.3
0
1
0
Suppose that the initial distribution of the Markov chain at time 0 is pX0 =
(P (X0 = 0), P (X0 = 1), P (X0 = 2)) = (0.2, 0, 0.8). What is the probability
distribution of X1 ? What is the probability distribution of Xn , for any n ≥ 1?
Let P (n) denote the n-step t.p.m.. Based on the example above, we see
that P (n) = P P P . . . P , i.e. P (n) is the n-th power of P , P n , with respect
to the usual matrix multiplication.
147
This is made precise in terms of the Chapman-Kolmogorov equations,
which state that for all positive integers n, m we have
P (n+m) = P (n) P (m)
with P (0) = I.
Therefore, we conclude that a discrete-state time-homogeneous Markov
chain is completely described by the initial proabability distribution pX0 and
the t.p.m. P .
148
Lecture 21
Recommended readings: Ross, section 4.3
Markov Chains (part 2)
Probability of Sample Paths
Consider a sample path (i0 , i1 , . . . , in ). Note that the Markov property
makes it very easy to compute the probability of observing this sample
path. In fact we have
P (X0 = i0 ∩ X1 = i1 ∩ · · · ∩ Xn = in )
= P (X0 = i0 )P (X1 = i1 |X0 = i0 )P (X2 = i2 |X1 = i0 ∩ X2 = i1 ) × . . .
× P (Xn = in |Xn−1 = in−1 ∩ · · · ∩ X1 = i1 ∩ X0 = i0 )
= P (X0 = i0 )P (X1 = i1 |X0 = i0 )P (X2 = i2 |X1 = i1 ) . . . P (Xn = in |Xn−1 = in−1 ).
Communicating Classes
We now turn to the questions of which states of the process can be reached
from which. This type of analysis will help us understand both the short
term and the long term behavior of a Markov chain.
Let i, j ∈ S be two states (possibly with i = j). We say that j is
(n)
accessible from i if, for some n ≥ 0, P i,j > 0. This is often denoted i → j.
Let again i, j ∈ S be two states (possibly with i = j). We say that i and
j communicate if i → j and j → i. This is often denoted i ↔ j.
The relation of communication satisfies the following properties for each
i, j, k ∈ S:
1. i ↔ i
2. i ↔ j =⇒ j ↔ i
3. i ↔ j ∧ j ↔ k =⇒ i ↔ k.
Furthermore, communication is an equivalence relation. This means that
the state space S can be partitioned into disjoint subsets, where all the
states that communicate with each other belong to the same subset. These
subsets are called communicating classes. Each state of the state space S
lies in exactly one communicating class.
149
Exercise in class:
Consider the following t.p.m. of a Markov chain:
1 1

2
2 0 0
 1 1 0 0
2
2

P =
1 1 1 1
4
4
4
4
0 0 0 1
What are the communicating classes of this Markov chain?
Exercise in class:
Consider the following t.p.m. of a Markov chain:


1
0
0
0
0
0
0
0.1 0 0.4 0 0.4
0
0.1 


 0 0.3 0 0.7 0
0
0 


0
0 0.5
0
0.2 
P =

0.3 0

 0 0.3 0 0.3 0
0.4
0


0
0
0
0
0 0.25 0.75
0
0
0
0
0
0.6 0.4
What are the communicating classes of this Markov chain?
Absorbing Classes
A communicating class is called absorbing if the Markov chain never leaves
that class once it enters.
Question: what communicating classes are absorbing in the above
examples?
Irreducibility
A Markov chain is said to be irreducible if it has only one communicating
class (trivial partition of the state space).
Exercise in class:
Consider again the random walk on the penthagon. Is this Markov chain
irreducible?
If a Markov chain is not irreducible, then we can partition its state space
S into disjoint sets
S = D ∪ (∪i Ci )
(60)
150
where each Ci is an absorbing communicating class and D is a union of
non-absorbing communicating classes.
Note the following key implication: eventually, the Markov chain will
leave D and enter exactly one of the Ci ’s. At that point the Markov chain
will act like an irreducible Markov chain on the reduced state space Ci .
Recurrent and Transient States
Informally, we say that a state i ∈ S is recurrent if the probability that
starting from state i the Markov chain will ever visit again state i is 1. On
the other hand, if the probability that starting from state i the Markov chain
will ever visit again state i is strictly less than 1, then we say that i is a
transient state.
Note that, based on the above definition,
• if i ∈ S is recurrent then, if the Markov chain starts from state i, state
i will be visited infinitely often
• if i ∈ S is transient then, each time the Markov chain visits state i,
there is a positive probability p ∈ (0, 1] that state i will never be visited
again. Therefore, the probability that starting from state i the Markov
chain will visit state i exactly n times is (1 − p)n−1 p; this means that
the amount of time that the Markov chain spends in state i (starting
from state i) is a Geometric random variable with parameter p. This
implies that, if the Markov chain starts from state i, the expected
amount of time spent in state i is 1/p.
From the above, it follows that a state i ∈ S is recurrent if and only
if, starting from state i, the expected number of times that the Markov
chain visits state i is infinite. This allows us to characterize recurrent and
transient states in terms of transition probabilities. Let In = 1{Xn =i} (Xn )
be the random variable
P indicating whether the Markov chain is visiting state
i at time n. Then ∞
n=0 In is the amount of time spent by the Markov chain
in state i. We have
!
∞
∞
X
X
E
In |X0 = i =
E(In |X0 = i)
n=0
=
∞
X
n=0
P (Xn = i|X0 = i) =
n=0
∞
X
n=0
151
(n)
P i,i
and therefore
∞
X
n=0
∞
X
n=0
(n)
P i,i = ∞ ⇐⇒ i is recurrent
(n)
P i,i < ∞ ⇐⇒ i is transient.
Notice also that if i ∈ S is recurrent then every state in the same communicating class is also recurrent. The property of being recurrent or transient
is a class property.
Example in class:
Consider again the random walk on the integers. Since this Markov chain is
irreducible (all the states communicates), either all the states are recurrent
or they are all transient. It turns out that the random walk on the integers
is recurrent if and only if the probability of a positive/negative increment is
exactly 1/2. Otherwise, the random walk on the positive integers is transient (see Ross, pages 208-209 for a proof).
Exercise in class: consider again the previous two Markov chains.
Which classes are recurrent/transient?
152
Lecture 22
Recommended readings: Ross, sections 4.4
Markov Chains (part 3)
Periodicity
Let’s start with a simple example. Consider the t.p.m.


0 1 0
P = 0 0 1 .
1 0 0
We have that
P (2)


0 0 1
= 1 0 0 .
0 1 0
P (3)


1 0 0
= 0 1 0 .
0 0 1
and
Starting from any of the states 0, 1, 2, how often does the Markov chain
return to the initial state?
The above example suggests that this simple Markov chain has a built-in
periodicity of order 3. How can we make this idea more formal and generalize
it to more complicated Markov chains?
For any state s ∈ S of a Markov chain, we define the period of s as
n
d(s) = gcd{n ≥ 1 : Ps,s
> 0}
(n)
where gcd stand for ‘greatest common divisor’. This implies that P s,s = 0
unless n = md(s) for some m ∈ Z.
An irreducible Markov chain is said to be aperiodic if d(s) = 1 for all
states s ∈ S. It is called strongly aperiodic if Ps,s > 0 for some s ∈ S.
Exercise in class:
Consider

0
P =  14
1
8
1
2
0
7
8
153
1
2
3
.
4

0
Is this Markov chain periodic, aperiodic, or strongly aperiodic?
Exercise in class:
Consider
1
P =
4
1
4
1
8
1
2
0
7
8
1
4
3
.
4

0
Is this Markov chain periodic, aperiodic, or strongly aperiodic?
Periodicity is another class property. This means that the period is
constant over any fixed communicating class of a Markov chain.
Cyclic Class Decomposition
Let X be an irreducible Markov chain with period d. There exists a partition
of S
S = ∪dk=1 Uk
such that
Ps,Uk+1 = 1
for s ∈ Uk and k = 0, 1, . . . , d − 1.
The sets U1 , . . . , Us are called cyclic classes of X because X cycles through
them successively. It thus follows that the ‘accelerated’ Markov chain Xd =
d
{Xid }∞
i=1 has transition probability matrix P and each Uk is an absorbing,
irreducible, aperiodic class.
In light of this result, we can rewrite the state space decomposition of
equation (60) as
i
S = D ∪ (∪i Ci ) = D ∪ ∪i ∪dj=1
Ui,j
where each absorbing class Ci is an irreducible Markov chain with period di
whose state space can be partitioned into the di cyclic classes Ui,1 , . . . , Ui,di .
Positive and Null Recurrence
We already defined the notion of recurrence in the previous lecture. Here,
we introduce a somewhat technical (but important) point. Suppose that
state s ∈ S is recurrent. Then, based on the previous lecture, this means
that if the Markov chain starts from state s, the chain will visit state s
154
in the future infinitely often. We now make the following distinction for a
recurrent state s (or, equivalently, for the communicating class of s, since
recurrence is a class property):
• s is positive recurrent if, starting from s, the expected amount of time
until the Markov chain visits again state s is finite
• s is null recurrent if, starting from s, the expected amount of time
until the Markov chain visits again state s is infinite.
While there is indeed a difference between positive and null recurrent states,
it can be shown that in a Markov chain with finite state space all recurrent
states are positive recurrent.
Ergodicity, Equilibrium and Limiting Probabilities
Consider a Markov chain with t.p.m. P . Suppose there exists a probability
distribution π over the state space such that πP = π. This implies that
π = πP (n) for all integers n ≥ 0. The probability distribution π is called the
equilibrium distribution or the stationary distribution of the Markov chain.
In equilibrium πj is the propotion of time spent by the Markov chain in
state j.
A communicating class is said to be ergodic if it is positive recurrent and
aperiodic. We have the following important result: for an irreducible and
ergodic Markov chain the limiting probabilities
πj = lim Pijn
n→∞
j∈S
exist and the limit does not depend on i. Furthermore, the vector
Pπ of the
above limiting probabilities is the unique solution to π = P π and j πj = 1
(hence it corresponds to the stationary distribution of the Markov chain).
The limiting probabilities can also be shown to correspond to the longterm proportion of time that the Markov chain spends in each state. More
precisely, we have that
vij (n)
πj = lim
n→∞
n
where vij (n) denotes the expected number of visits to state j, starting from
state i, within the first n steps.
155
Equilibrium When the State Space is Finite
Let U be a |S| × |S| matrix with entries all equal to 1, let u be a |S|dimensional vector with entries all equal to 1 and let o be a |S|-dimensional
vector with entries all equal to 0. Let X be an irreducible, aperiodic, and finite state (and thus ergodic) Markov chain. If we want to find the stationary
distribution of X we must solve
(
π(I − P ) = o
πU = u.
The solution to the above system is π = u(I − P + U )−1 .
Exercise for home:
Consider the t.p.m.

0.6 0.4 0
P = 0.7 0 0.3 .
0
1
0

Apply the above formula to verify that π ≈ (0.5738, 0.3279, 0.0984).
It can be shown that if the matrix P is doubly stochastic, we have that
πj = 1/|S| for all j ∈ S.
Exercise in class:
Consider again the random walk on the pentagon. What is the stationary
distribution of this Markov chain?
156
Lecture 23
Recommended readings: Ross, section 4.6
Markov Chains (part 4)
More on Equilibrium Distribution and Limiting Probabilities
In the previous lecture, we studied that if an irreducible Markov chain is
ergodic (i.e. positive recurrent and aperiodic), the limiting probabilities
lim Pijn
n→∞
j∈S
(61)
exist and these limits do not depend on i or on the initial distribution pX0 .
Furthermore, we said that, whenever a stationary distribution π exists, the
limiting probabilities above correspond exactly to the entries of π.
Let us know clarify a slightly technical point. While it is true that in
order for the Markov chain to admit the above limiting probabilities we
need the Markov chain to be irreducible, positive recurrent and aperiodic
(i.e. irreducible and ergodic), the existence of a stationary distribution does
not require the aperiodicity assumption.
Hitting Probabilities (part 1)
Recall the decomposition of the state space of equation (60). Suppose that
the Markov chain starts at a state s ∈ D which is not absorbing. At some
point in time, the Markov chain will eventually enter one of the absorbing
classes Ci of the decomposition (60) and will be ‘stuck’ in that class from
that time on. What is the probability that, starting from s ∈ D, the chain
will hit the absorbing class Ci ?
Define hi (s) as the probability that, starting from a state s ∈ S, the
chain will hit the absorbing class Ci . We have
X
X
hi (s) =
Pss0 +
Psu hi (u)
s0 ∈Ci
u∈D
and the vector hi is the smallest non-negative solution to the above equation.
P Define P D to be the submatrix of P associated to D and bi (s) =
s0 ∈Ci Pss0 for s ∈ D. When S is finite, we can rewrite the above equation in vector/matrix form as
(I − P D )hi = bi =⇒ hi = (I − P D )−1 bi .
157
Example in class:
Let X be a Markov chain with t.p.m.

1 0
1 0
2 1
P =
 0 12
0
3
0 0
0
0
1
3
1
6
1
2
0
1
3
0
0
0

0
0

0

1
3
1
defined on S = {0, 1, 2, 3, 4}. We have that C1 = {0}, C2 = {4}, and
D = {1, 2, 3}. Then,


0 13 61
P D =  21 0 12  ,
1
1
3
3 0
1
2
b1 =  0  ,
0
 
0
b2 =  0  ,
1
3
and
(I − P D )−1


30 14 12
1 
24 34 21 .
=
19
18 16 30
Thus,


15
1  
12
h1 =
19
9
and


4
1  
7 .
h2 =
19
10
Block Decomposition of the T.P.M.
From now on, we consider a finite state Markov chain with a set A of |A| = a
absorbing states and a set T of |T | = t of transient states. We can rearrange
the t.p.m. of the Markov chain as
I 0
P =
S T
158
where I is a a × a matrix, T is a t × t matrix, S is a t × a matrix, and 0 is
a a × t matrix of zeroes. Notice that the values of S and T are not unique:
they depend on how you order the states.
Hitting Probabilities (part 2), Expected Hitting Times, Expected
Number of Visits
In terms of the block decoposition of the t.p.m., it can be shown that we
can obtain the hitting probabilities matrix by simply computing
(I − T )−1 S.
On the other hand, the matrix (I − T )−1 gives us the expected number
of visits to state s0 starting from state s, with s, s0 ∈ T .
Finally, if we let u denote a vector of ones, it can be shown that m =
(I − T )−1 u returns the vector of the expected time until, starting from a
transient state s ∈ T , the Markov chain hits an absorbing state.
Example in class (the rat maze):
A rat is placed in the maze depicted below. If the rat is in the rooms 2, 3,
4, or 5, it chooses one of the doors at random with equal probability. The
rat stays put once it reaches either the food or the shock.
The t.p.m. for the chain is


1 0 0 0 0 0
 1 0 0 1 0 0
2

 12

0 0 31 13 0 
3


P =
1
1
1
0
0
0

3
3
3
0 0 1 0 0 1 
2
2
0 0 0 0 0 1
Using the formulae above, we have that the matrix of the expected number
of visits is
 26 2
5
2 
21
4
 21
 10
21
2
21
7
10
7
4
7
5
7
7
4
7
10
7
2
7
while the matrix of hitting probabilities is
5 2
7
4
 73

7
2
7
7
3
7
4
7
5
7
159
21
10 
21  ,
4 
21
26
21
and the expected times until absorption are
 49 
A First Look at Long-Term Behavior
21
(∞)
(n)
 56 
.
 21
 56 
, if the limit exists. If this exists
for all states i and
Define Pij = limn→∞ Pij
j, then put these into the matrix P (∞).
21
49
21
Exercise 4. [Exists?] Construct a transition matrix P such that P (∞) does not
How do we interpret these numbers?
exist.
Exercise 5. [The Rat Maze]
A rat is placed into the above maze.
If the rat is in one of the other four
rooms, it chooses one of the doors at
random with equal probability. The rat
stays put once it reaches either the food
(F) or the shock (S).
Food (F)
3
5
2
4
Shock (S)
Write out the transition matrix P and the limiting transition matrix P(∞) for the
rat maze example. If the rat starts in room “2,” what is the probability it reaches
the food before the shock?
13
160