Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 526 LECTURE NOTES
TERRY SOO
Abstract. These notes are for Math 526 (a first course in probability and statistics for students with the power of first year Calculus), following the text book, Probability and Statistics for Engineers and Scientist, Warpool, Myers, Myers, and Ye. They are
not a substitute for reading the book or attending class.
1. Chapter 1
1.1. Introduction. We will try to get an idea of what statistics is
about with the following examples. A population is a collection of
individuals or items of a particular type. For example, we could be
interested in the population of KU students or the population of KState
students. Suppose we are interested in knowing which students are
taller. The population mean of the heights of a population is given by
summing all the heights and then dividing by the number of students.
One way of interpreting this question is to compare the population
mean of the heights of the two populations.
However, it may be too hard or undesirable to actually measure the
heights of all the students of both populations; the best we may be able
to do is take sample; that is, just measure the heights of a subset of the
students. What subsets should one take? A simple way to sample is
take a simple random sample, where each individual is equally likely to
be sampled. So say we sample n students from each of the populations
and obtain the height data: x1 , . . . , xn and y1 , . . . , yn , giving the heights
of the sampled KU and KState students. We compute the sample mean
of the heights of the sampled KU students given by
n
1X
x¯ =
xi ,
n i=1
and we similarly compute y¯. But how close are these sample means to
the population means? How certain can we be that these estimators
for the population means tells us anything about them?
My favourite examples in probability and statistics usually involve
flipping coins and rolling dice. Suppose we are given a coin and want
1
2
TERRY SOO
to determine if it is fair or not. Let’s assign a value of 1 to heads and
0 tails. In this case, we imagine the population to be the set of all coin
flips that is generated by the coin, and a sample is given by flipping the
coin a finite number of times. Suppose we obtain the values x1 , . . . , xn .
Again, we can take the sample mean x¯ and hope that this tells us
something about the true probability that a coin flip comes up heads.
By the end of this course we will have a quantitative way of saying how
good these estimators are.
1.2. Standard deviation. Let’s return to the heights of the KU students. What if you for some reason knew that most KU students are
roughly the same height? Would you be more confident that sample
average would be good estimator for the population mean? One way
to measure the spread of sample numerical values x1 , . . . , xn is given
by the sample standard deviation. The sample variance is given by
n
s2 =
1 X
(xi − x¯)2 ,
n − 1 i=1
where x¯ is the sample mean, and the sample standard deviation is given
by
√
s = s2 .
Hopefully, we will discuss why n − 1 appears instead of n, later in the
course.
Example done in class 1.1. Let L(z) = az+b. Given data x1 , . . . , xn ,
consider the transformed data y1 , . . . , yn , where yi = L(xi ). Think of
converting from Celsius to Fahrenheit for a concrete example. Find the
relation between sx , the sample standard deviation of x1 , . . . , xn and sy ,
the sample standard deviation of y1 , . . . , yn .
P
Example done in class 1.2. Let f (z) := ni=1 (z − xi )2 . Find the
minimum of f . Use the power of Calculus.
Exercise 1.3. Prove the so-called short cut formula:
n
X
i=1
2
(xi − x¯) =
n
X
i=1
x2i
n
1 X 2
−
xi .
n i=1
Exercise 1.4. Consider m := min {x1 , . . . , xn } and M := max {x1 , . . . , xn },
so that m ∈ {x1 , . . . , xn } and m ≤ xi for all i = 1, 2, . . . , n and
M ∈ {x1 , . . . , xn } and M ≥ xi for all i = 1, 2, . . . , n. (For example,
min {2, 6, 5} = 2 and max {2, 6, 5} = 6.)
(a) Show that m ≤ x¯ ≤ M.
MATH 526 LECTURE NOTES
3
(b) Show that
n
1X
(xi − x¯)2 ≤ (M − m)2 .
n i=1
2. Chapter 2.1, 2.2
2.1. Sample spaces. We will now build a mathematical framework in
that will model random events. By an experiment we mean any process
that generates a set of data. Call the set of all possibles outcomes of
a statistical experiment a sample space; sometimes it is denoted by
S, S, or Ω. In the simple experiment, where we toss a coin, we can
take S = {H, T } or S = {0, 1}. Another experiment where we toss a
coin until it lands heads, we can take S = {H, T H, T T H, T T T H, . . .}.
If we are wondering how much longer it will take the bus to arrive
we can take S = {t ∈ R : t ≥ 0} . An event is a subset of a sample
space. That is, E is an event of a sample space S, if every element of
E is an element of S; in this case, we write, E ⊆ S. Note that the
whole set S and the empty or null set, ∅ are always events. In the
simple experiment where we roll a dice, and then toss a coin, we can
take S = {(i, j) : i = 1, 2, 3, 4, 5, 6, j = 0, 1}, and examples of events are
given by {(6, 0), (6, 1)} and {(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)}; in
words, these are the events that the dice rolled 6, and the coin came
up tails, respectively.
2.2. Set operations. In dealing with events, it will be necessary to
acquaint ourselves with set notions. By x ∈ E, we mean that x is
a member of the set E (sometimes we also say that x is an element
of E). Two sets E and F are equal whenever they have exactly the
same elements. Thus if E = {1, 2, 3} and F = {3, 2, 1}, then F = E.
Let’s assume we are working in the sample space S. If A ⊆ S, then
by the complement of A with respect to S, we mean the set A0 =
{x ∈ S : x 6∈ E}; this is the set of all elements of S that are not in A,
and is also sometimes denoted by Ac . The intersection of two events
A and B is the set A ∩ B = {x ∈ S : x ∈ A and x ∈ B}. The union
of two events A and B is the set A ∪ B = {x ∈ S : x ∈ A or x ∈ B}.
Example done in class 2.1. If S1 and S2 are sample spaces, we can
create another sample space by taking
S1 × S2 = {(s1 , s2 ) : s1 ∈ S1 , s2 ∈ S2 } .
Use this notation to write sample spaces for the following experiments:
flip a coin three times; flip a coin and then row a dice.
4
TERRY SOO
Example done in class 2.2. Let A and B be events of a sample space
S, where A ⊆ B ⊆ S. Show that B 0 ⊆ A0 .
We will now verify one of De Morgan’s law.
Example done in class 2.3 (De Morgan). Let A and B be events of
a sample space S. Show that
(1) (A ∩ B)0 = A0 ∪ B 0 and
(2) (A ∪ B)0 = A0 ∩ B 0
Solution. We will verify (half ) of the first identity (1). Note that in
order to verify a set identity such as E = F , it suffices to check that
E ⊆ F and F ⊆ E. Let x ∈ A0 ∪ B 0 , then x is in A0 or B 0 . Note that
A ∩ B ⊆ A, thus Exercise 2.2 gives that A0 ⊆ (A ∩ B)0 . If x ∈ A0 , then
by we have x ∈ (A ∩ B)0 . Similarly, if x ∈ B 0 , then x ∈ (A ∩ B)0 . Thus
in both cases, x ∈ (A ∩ B)0 ; hence, A0 ∪ B 0 ⊆ (A ∩ B)0 . It remains to
verify that (A ∩ B)0 ⊆ A0 ∪ B 0 . Do this for homework.
Exercise 2.4. Let S = {a, b, c, d}. List all the subsets of S that have
exactly three elements.
3. Chapter 2.3
3.1. Counting techniques. Often we want to say all the outcomes
of an experiment is equally likely to happen. In order to do this, we
need to know how to count the number of elements in a sample space.
The most basic rule is the multiplication rule, which states that if A
and B are finite sets with cardinalities |A| = n and |B| = m, then
|A × B| = nm. Another way of stating this, is that if the first element
of an ordered pair can be chosen in n ways, and the second element
of an ordered pair can be chosen in m ways, then the total number of
ordered pairs is nm.
Given, say, 3 distinct objects {a1 , a2 , a3 }, a permutation is an ordered
arrangement of them; for example, (a1 , a2 , a3 ) or (a2 , a1 , a3 ).
Example done in class 3.1. List all the possible permutations of the
above example.
We can use the multiplication rule to count the number of possible
permutations of n objects; there are n choices for the first entry, then
there are n − 1 choices for the second entry, and so on, so that there
n! = n(n − 1)(n − 2) · · · 2 · 1 permutations. We take by definition that
0! = 1.
Given n distinct objects {a1 , . . . , an }, suppose we want to partition
them into 2 sets of size k and n − k (where we don’t care about order). We can think of this as choosing k objects from n without regard
MATH 526 LECTURE NOTES
5
to order; such a selection is called a combination. We can count the
number of ways to do this via the following argument. Every permutation say, (a1 , . . . , ak , ak+1 , . . . an ) or (a2 , a1 , . . . , ak , ak+1 , . . . , an ) yields
a partition of 2 sets, by putting the first k coordinates into one set and
putting the last n − k elements into another. However, in the examples
above, some of them give the same set. For each permutation, there are
exactly k! · (n − k)! other permutations that gives the same partition;
thus the answer is
n
n
n!
=
=
.
k
k, n − k
k!(n − k)!
Similarly, we can derive a formula to count the number of partitions
of n distinct objects into r sets of size n1 , . . . , nr , where n1 + · · · nr = n.
We obtain that the number is
n
n1 , . . . , nr
=
n!
n1 ! · · · nr !
Example done in class 3.2. In Powerball, a lottery, 5 white balls
are drawn from 59 numbered balls, and 1 red ball is drawn from 35
numbered red balls. The jackpot is won by guessing correctly all the
balls that are drawn (you don’t have to guess the order). How many
different choices are there for the player?
Example done in class 3.3 (Binomial formula). Show that
n X
n k n−k
n
(x + y) =
x y ,
k
k=0
and then use this to show that a set of size n has 2n subsets.
4. Chapter 2.3, 2.4
Exercise 4.1. What is wrong with the following argument.
Let A and
5
B be disjoint sets, both of size 5 The set A has 3 subsets
of size 3,
5
5
ditto for B. So, the set A ∪ B has size 10, and has 3 × 3 subsets of
2
size 6. Thus 53 = 10
.
6
4.1. Poker hands. A poker hand is a set of five cards from the deck of
52 standard playing cards. In a standard deck of playing card, (sometimes called a French deck), there are 13 ranks (A, 2, 3, . . . , 10, J, Q, K),
and for each rank there are four suits: ♦-diamond, ♥-heart, ♣-club,
and ♠-spade.
6
TERRY SOO
Example done in class 4.2 (2 pair). A two-pair is of the form
(aa)(bb)c, where a, b, c are cards of distinct rank. Thus we do not count
a four of a kind as a two-pair! An example of a two pair would be,
{4♦, 4♥, K♣, K♠, A♥}
(1) What are the total number of poker hands?
(2) What are the total number of two-pairs?
Solution. There are 52
poker hands. To count the number of two5
pairs, we first choose two ranks from the 13 ranks, then for each of the
two ranks, we need to choose 2 suits from the 4 possible suits, then we
still need to pick one card to be the non-paired card; to do this, we note
that there are 11 remaining ranks to choose from, and for each rank
there are four possible suits. Thus we have that the total number of two
pairs is given by:
13 4 4
(11)(4).
2
2 2
Exercise 4.3 (1 pair). How many one-pair poker hands are there?
Two pairs, and three-of-a-kinds, etc, do not count as a one-pair. A
one-pair is of the form (aa)(bcd), where a, b, c, d are of distinct rank.
Exercise 4.4 (3-of-a-kind, a triple). How many 3-of-a-kind poker hands
are there? A three-of-a-kind is of the form (aaa)(bc), where a, b, c are
of distinct rank.
Exercise 4.5 (Straights, including straight flushes, and royal flushes).
Let us say that a straight is of the form abcde, where all the cards are of
distinct rank, and can be arranged in increasing order. For the purposes
of order, an ace can count as a 1 or a 14, and J = 11,Q = 12, and
K = 13. (Sometimes by a straight, we mean to exclude the cases where
the cards abcde all have the same suit, but we will allow these for this
exercise.)
4.2. The axioms of probability. In order to do probability on a
sample space S, we have to assign a number in [0, 1] for events or
subsets of S. We require the following rules on a set function P; these
are Kolmogorov’s axioms for probability theory. For all events A, we
require P(A) ∈ [0, 1], and we also require P(∅) = 0, and P(S) = 1.
Another reasonable requirement, on P, is that if two events A and B
are such that A ∩ B = ∅ (disjoint), then P(A) + P(B) = P(A ∪ B). In
fact, we will need a stronger requirement that if A1 , A2 , . . . is a sequence
of mutually exclusive events (that is, if Ai ∩ Aj = ∅, if i 6= j), then
∞
∞
[
X
P
Ai =
P(Ai ).
i=1
i=1
MATH 526 LECTURE NOTES
7
(This last axiom is referred to as countable additivity). Sometimes a
sample space, together with its events, and a set function P is called a
probability space. In many ways that P function behaves like an area
function that gives the standard Euclidean area of subsets of a unit
square.
4.3. The case of equal probabilities. One way to define such setfunction P on a finite sample space S is to assign equal probabilities
to each element of the sample space, this then leads us to set P(E) =
|E|/|S| for an event E, where |E| is the number of elements in E.
Example done in class 4.6. Find the probability that when two fair
dice are rolled the sum will be seven? How about seven or eight?
Solution. Let S = {1, 2, 3, 4, 5, 6}2 , and P assign equal probabilities to
each of the outcome. Note that |S| = 62 = 36. The event
A = {(1, 6), (6, 1), (3, 4), (4, 3), (2, 5), (5, 2)}
is exactly when the sum of the two dice is 7. Since |A| = 6, we have
that P(A) = 1/6. The event
B = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}
is exactly when the sum of the two dice is 8. Clearly A ∩ B = ∅, thus
P(A ∪ B) = P(A) + P(B) = 6/36 + 5/36 = 11/36.
Exercise 4.7. In well-mixed deck of standard playing cards, what is
probability that I am dealt on 4-of-a-kind, when I am dealt 5 cards. A
4-of-a-kind is a hand of the form aaaab where a and b are of disjoint
rank. See Section 4.1 for terminology.
4.4. Arbitrary probabilities. Given a sample space with a countable number of elements, that is, the elements of S can be put into
∞
a sequence S = {s1 , s2 , . . .}. We can also check
P∞ that if (pi )i=1 is a
sequence of non-negative numbers such that i=1 pi = 1, then for an
event E = {si1 , si2 , . . .} if we set
P(E) =
∞
X
pij ,
j=1
then Kolmogorov’s axiom’s are satisfied.
Example done in class 4.8 (Poisson process). Set S = {0, 1, 2, 3, . . .}.
−1
Check that if pi = e i! , then by assigning probabilities pi to i ∈ S, then
Kolmogorov’s axiom’s are satisfied; in other words, check that pi are
positive and have 1 as the sum.
8
TERRY SOO
Exercise 4.9. Let p ∈ (0, 1). Consider the probability space for two
coin flips given by: S = {HH, HT, T H, T T } with P given by
P(HH) = p2 , P(HT ) = P(T H) = p(1 − p), P(T T ) = (1 − p)2
Check that the probabilities do indeed sum to 1 (for any p ∈ (0, 1)).
Write down the event E that the first flip comes up heads. What is the
probability for this event (in the case that p = 3/4)?
5. Chapter 2.5, 2.6
5.1. Some formulas. Given the axioms of probability and standard
set identities we can deduce some useful formulas for computing probabilities. Let P be a probability on a sample space S.
Example done in class 5.1. If A is an event, then P(A) + P(A0 ) =
P(S) = 1.
Example done in class 5.2. Suppose we flip a fair coin three times.
What is the probability that there will be at least one head?
Exercise 5.3. For any events A and B, we have
A = (A ∩ B) ∪ (A ∩ B 0 )
and
A ∪ B = A ∪ (B ∩ A0 ),
where A is disjoint from B ∩ A0 .
Example done in class 5.4 (Inclusion-exclusion). If A and B are
events, then
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
Solution. We rewrite A∪B as a disjoint union so that A∪B = A∪(B∩
A0 ) and A is disjoint from B ∩ A0 . Thus P(A ∪ B) = P(A) + P(B ∩ A0 ).
We also have that P(B ∩ A0 ) + P(B ∩ A) = P(B), so some algebra yields
the required result.
Exercise 5.5 (Monotonicity). For events A, B with A ⊆ B, show that
P(A) ≤ P(B).
Solution. Rewrite B = A ∪ (B ∩ A0 ), so that P(B) = P(A) + P(B ∩ A0 ).
Example done in class 5.6. Assume that our Math 526 class contains
35 students. If 10 students are enrolled in an English course, and 15
students are enrolled in a Spanish course, and 20 students are enrolled
a English or Spanish course, then what is probability that a randomly
selected student will be enrolled in a English and Spanish course. (Here
we mean that each student is selected with equal probability.)
MATH 526 LECTURE NOTES
9
Exercise 5.7 (Inclusion-Exclusion, counting). Suppose that S is a finite set; that is, |S| = N , for some integer N > 0. Also assume that
P({s}) = 1/N, for all s ∈ S; that is, each element of S has equal weight
under P. Use Exercise 5.4 to recover the standard inclusion-exclusion
formula:
|A ∪ B| = |A| + |B| − |A ∩ B|.
Exercise 5.8 (Inclusion-Exclusion for three sets). Use Exercise 5.4
to prove the following inclusion-exclusion formula in the case of three
sets, A, B, C ⊆ S:
P(A ∪ B ∪ C) = P(A) + P(B) + P(C)
− P(A ∩ B) − P(A ∩ C) − P(B ∩ C) + P(A ∩ B ∩ C).
5.2. Conditional probabilities. Suppose I roll a fair dice with values {1, 2, 3, 4, 5, 6}. If I told you that the result was even, then what is
the probability that the result was a 4? In some sense, what has happened, is that with the new information, we should consider a reduced
sample space, with possibly different probabilities. The new sample
space should be {2, 4, 6}, with each outcome having equal probabilities; another way is to just use the same sample space {1, 2, 3, 4, 5, 6},
but assign zero probability to 1, 3, 5, and equal probabilities to 2, 4, 6.
One way to formalize this is to used conditional probabilities. Let A, B
be events, with P(B) > 0. Let
P(B|A) =
P(A ∩ B)
.
P(B)
Exercise 5.9. Let P(B) > 0. Check that the set function Q(A) =
P(A|B) satisfies Kolmogorov’s axioms.
Example done in class 5.10. Consider a fair dice roll with values
{1, 2, 3, 4, 5, 6}. Let B be the event that result is even. Let A be the
event that the roll is a 2. Compute P(A|B).
Solution. Note that P(B) = 1/2, and A ∩ B = {2}. Thus
P(A|B) =
(1/6)
= 1/3.
(1/2)
Example done in class 5.11. Suppose we do the following two step
experiment. I flip a fair coin with values {h, t}, and if it comes up
heads, then I roll a fair dice, with values {1, 2, 3, 4, 5, 6}, otherwise, if
it comes up tails, then I flip another fair coin with values {0, 1}.
(1) What is a sample space for this experiment?
(2) What is the probability that we have a head and a 1?
10
TERRY SOO
(3) What is the probability that we have a tail and a 1
(4) What is the probability that we have a 1?
Solution. Let H be the event that the first coin is head, let T be the
event that it is tails, and let U be the event that the final result is 1.
(1) We can take
S = {(h, 1), (h, 2), (h, 3), (h, 4), (h, 5), (h, 6), (t, 0), (t, 1)} .
(2) We know given that if the first coin flip comes up heads, then
the probability is 1/6; in other words, P(U |H) = 1/6. Thus
P(U ∩ H) = P(U |H)P(H) = (1/6)(1/2) = 1/12.
(3) We know that if first coin flip come up tails, then the probability is 1/2; in other words, P(U |T ) = 1/2. Thus P(U ∩ T ) =
P(U |T )P(T ) = (1/2)(1/2) = 1/4.
(4) Note that P(U ) = P(U ∩ H) + P(U ∩ T ), since H 0 = T . So,
P(U ) = 1/12 + 3/12 = 4/12 = 1/3.
6. Lecture 6, Chapter 2.6, 2.7
Exercise 6.1. Suppose you are dealt two cards from a standard (wellmixed) 52 card French deck. By counting, compute the probability that
you will have at least one ace. Try this problem again using conditional
probabilities.
6.1. Stochastic independence. Two events A and B are independent if P(A ∩ B) = P(A)P(B). In the case that P(B) > 0, is easy to
check that this is equivalent to the condition that P(A|B) = P(A). In
other words, knowing B, does not effect the probability of A.
Example done in class 6.2. Check that the two definitions are equivalent.
Solution. We checked in class that if A and B are independent, then
P(A|B) = P(A) (for B with P(B) > 0); it remains to verify the other
direction.
Exercise 6.3. Find the probability that when two independent dice sixsided are rolled you get two sixes. Find the probability that you will get
doubles.
Exercise 6.4. Suppose two friends Britney and Christina go shopping.
Let bi be the probability that Britney buys i things, and let ci be the
probability that Christina buys i things. Suppose that bi and ci are
given by:
b0 = 1/10, b1 , = 2/10, b2 = 3/10, b3 = 2/10, b4 = 2/10,
MATH 526 LECTURE NOTES
11
and
c0 = 3/10, c1 = 3/10, c2 = 2/10, c3 = 2/10.
Find the probability that Britney and Christina buy the same number
of things, if their shopping habits are independent of one each other.
Solution. Let Bi and Ci be the events that Britney and Christina buy
i things, respectively. The event that they both buy i things is Bi ∩ Ci ,
which has probability P(Bi ∩ Ci ) = P(Bi )P(Ci ) since the events Bi and
Ci are independent. Let E be the event that they buy the same number
of things. Then E is given by the disjoint union:
E = (B0 ∩ C0 ) ∪ (B1 ∩ C1 ) ∪ (B2 ∩ C2 ) ∪ (B3 ∩ C3 ).
So P(E) = b0 c0 + b1 c1 + b2 c2 + b3 c3 .
Exercise 6.5. Check that if A and B are independent, then A and B 0
are also independent.
Let A = {A1 , A2 , . . .} be a collection of events. We say that the
events are pairwise independent if P(Ai ∩ Aj ) = P(Ai )P(Aj ) for all
i 6= j, and we say that they are independent or mutually independent
if for every finite subset of events Ai1 , Ai2 , . . . Ain (where the ij are
distinct), we have that
n
n
\
Y
P
Aij =
P(Aij ).
j=1
j=1
Exercise 6.6. Give an example of a sample space S with probabilities
P , where A1 , A2 , A3 are events such that they are pairwise independent,
but not mutually independent.
Solution. Let S = {0, 1}2 and let P assign equal probabilities to each
element of S. Let A1 := {(0, 0), (0, 1)}, A2 := {(0, 0), (1, 0)}, and
A3 = {(0, 1), (1, 0)}. In words, A1 is the event that the first flip is tails,
A2 is the event that the second flip is tails, A3 is the event that exactly
one of the flips came out tails. Clearly, A1 and A2 are independent.
It is easy to check that A3 and A1 are independent and A3 and A2 are
independent. However, A1 ∩ A2 ∩ A3 = ∅, thus
0 = P(A1 ∩ A2 ∩ A3 ) 6= P(A1 )P(A2 )P(A3 ) = (1/2)3 .
Example done in class 6.7. Consider the experiment where I flip
a coin, that is known to come up heads with probability 2/3, 7 times.
Assume that the coin flips are independent.
(a) How many ways are there of getting exactly 3 heads in 7 tosses?
(This is just a counting question, there is no probability involved.)
12
TERRY SOO
(b) What is the probability that I will get HHHT T T T ; that is, 3 consecutive heads, then 4 consecutive tails?
(c) What is the probability that I will get T T T T HHH?
(d) What is the probability that I will get exactly three heads?
(e) To generalize, suppose instead that the coin is known to come up
heads with probability p ∈ (0, 1), and I flip it n times. Find an
expression for the probability that I will get exactly k heads (for
0 ≤ k ≤ n).
7. Chapter 2.6, 2.7
We will elaborate on the techniques used in Exercise 5.11
7.1. Total probability. Suppose B1 , B2 , B3 give a partition of a sample space S, so that B1 , B2 , B3 are mutually exclusive, and their union
is all of S. Given any event A, clearly it is given by the disjoint union,
A = (A ∩ B1 ) ∪ (A ∩ B2 ) ∪ (A ∩ B3 ),
thus
P(A) = P(A ∩ B1 ) + P(A ∩ B2 ) + P(A ∩ B3 ).
We also know from the definition of conditional probabilities, that if
each of the Bi ’s have non-zero probabilities, then
P(A ∩ Bi ) = P(A|Bi )P(Bi ),
for each i = 1, 2, 3. Thus we obtain that,
P(A) = P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 ).
Sometimes this is referred to as the rule of total probability.
Sometimes we also want to compute P(Bi |A), and a bit algebra gives
the following formula, in the case i = 3:
P(B3 |A) =
=
P(B3 ∩ A)
P(A)
P(A|B3 )P(B3 )
.
P(A|B1 )P(B1 ) + P(A|B2 )P(B2 ) + P(A|B3 )P(B3 )
This is referred to as Bayes’ theorem.
Exercise 7.1 (Two Face). The DC comic book villain Two-Face often
uses a coin to decide the fate of his victims. If the result of the flip is
tails, then the victim is spared, otherwise the victim is killed. It turns
out he actually randomly selects from three coins: a fair one, one that
comes up tails 1/3 of the time, and another that comes up tails 1/10
of the time. What is the probability that a victim is spared?
MATH 526 LECTURE NOTES
13
Solution. Let Sp denote that event that the victim is spared and let
C1 be the event that the fair coin is used, C2 be the event that the coin
that comes up tails 1/3 of time is used, and C3 denote the event that
the coin that comes up tails 1/10 of the time is used. Then
P(Sp) = P(Sp|C1 )P (C1 ) + P(Sp|C2 )P(C2 ) + P(Sp|C3 )P(C3 )
= 1/2(1/3) + 1/3(1/3) + 1/10(1/3).
Exercise 7.2. In Exercise 7.1, what is the probability that two-face
used the fair-coin, given that the victim was spared.
Solution.
P(C1 ∩ Sp)
P(Sp)
P(Sp|C1 )P(C1 )
=
P(Sp)
1/2(1/3)
=
1/2(1/3) + 1/3(1/3) + 1/10(1/3)
P(C1 |Sp) =
Exercise 7.3 (Monty Hall meets Bayes). In a game show there are
three doors, behind each of which is a prize; behind one door is a car,
and the other two, goats. The car is equally likely to behind each door.
Your goal is to win the car. After you choose a door, the host of
the game show who knows where the car is, chooses as at random (as
randomly as possible) one of the remaining doors (one which you didn’t
pick) which contain a goat, and reveals this information to you. You are
then offered the choice of staying with your current choice or switching
your choice to the other remaining door. What should you do if you
want to maximize your chances of winning? Assume for simplicity that
you originally picked the first door, and the host revealed that behind
the third door is a goat, so that your choice is between staying with the
first door or switching to the second door.
Exercise 7.4 (Medical testing). Consider a screening test for a certain disease that has the following reliability. If the patient has the
disease, the test will be positive, inconclusive, or negative with probabilities 0.8, 0.05, 0.15, respectively. If the patient does not have the
disease, the test will be positive, inconclusive, or negative with probabilities 0.03, 0.1, 0.87, respectively. Suppose that the disease is present
in 14 percent of the population of adults who are referred by their doctors to be screened with this test.
(1) An adult is referred by their doctor for testing. What is the
probability that the person will be correctly diagnosed?
14
TERRY SOO
(2) If an adult has been correctly diagnosed, what is the probability
the person does not have the disease?
8. Exam 1, Answers
Q1: 173cm, 13cm2 , 3.6056cm; Q3: 0.9; Q4: 1 in 5, 153, 632.64706;
Q5: 1/12, 2/3; Q6: 0.6; Q7: 0.2344; Q8: (See Exercise 2.93 in the
textbook); Q9: 5/11= 0.45454545; Q10: Use the Binomial formula.
9. Chapter 3.1
9.1. Random variables. A (real-valued) random variable is a function which assigns an element of R to each member of a sample space
S. Typically, random variables are denoted by capital Roman letters
such as X, Y and Z. In probability, we write
{X = n} = {s ∈ S : X(s) = n}
and
{X ≤ x} = {s ∈ S : X(s) ≤ x} .
The simplest random variables are constant (real-valued) random variables; that is a random variable X such that for some b ∈ R, we have
P(X = b) = 1.
9.2. Bernoulli random variables. A random variable that only takes
the values 0 and 1 is called a Bernoulli random variable. Let p ∈ (0, 1).
A Bernoulli random variable X, with P(X = 1) = p is called a Bernoulli
random variable with parameter p; if X is such a random variable, we
often write X ∼ Bern(p).
Example done in class 9.1. Given the sample space S = {1, 2, 3, 4, 5, 6}
where P assigns equal probabilities to each element of S, define a Bernoulli
random variable on S. Define a Bernoulli random variable X on S such
that probability that of the event {X = 1} is 1/3.
Solution. We can take Y (s) = 1 for all s ∈ S even, and Y (s) = 0
otherwise. We can take X(s) = 1 is s = 1, 2, and X(s) = 0 otherwise.
Example done in class 9.2. Consider the simple experiment where
we keep flipping a fair coin until we see a head. Let X be the number
of flips. Define a sample space for X. What is the probability that
X = 4? Compute P(X ∈ {2, 5}).
Solution. We can let S = {H, T H, T T H, T T T H, T T T T H, . . .}, and
assign probabilities 1/2, 1/4, 1/8, . . . to each one of the elements, respectively. Note these probabilities are a geometric series with unit
MATH 526 LECTURE NOTES
15
sum. We have P(X = n) = 2−n . Thus P(X = 4) = 1/16, and
P(X ∈ {2, 5}) = 1/4 + 1/32.
Example done in class 9.3. Suppose that I will bet 5 dollars on a
toss of a coin, whereby I will win 5 dollars if it comes up heads, and
lose 5 dollars otherwise. Suppose that the coin tossed two times and
that I will make a bet each time. Write down a sample space for the
coin tosses, and a random variable W defined on the sample space that
represents my possible winnings and losses.
Solution. Let S = {T T, T H, HT, HH}. Then
W (T T ) = −10, W (T H) = 0 = W (HT ), W (HH) = 10.
Example done in class 9.4 (Indicators). Let A be an event of a
sample space S. Define a random variable on S via
(
1 if s ∈ A,
1A (s) =
0 otherwise.
Let A and B be independent events such that P(A) = P(B) = 2/3. Let
X = 1A + 1B . Find the P(X = x) for all x ∈ R.
Solution. Notice that {X = 2} = A ∩ B, {X = 1} = (A ∩ B 0 ) ∪
(A0 ∩ B), and {X = 0} = A0 ∩ B 0 , and the events A and B are independent. Thus we have that P(X = 2) = (2/3)(2/3), P(X = 1) =
(2)(2/3)(1/3), P(X = 0) = (1/3)(1/3); for all other x ∈ R, we have
P(X = x) = 0.
Example done in class 9.5. Suppose a system consists of 3 components and will work if at least two of the components are functioning.
Let Ci be the event that component i is functioning. Let W be the event
that the system is functioning.
(a) For any events A and B, check that 1A∩B = 1A 1B .
(b) Also check that if A and B are disjoint events, then 1A∪B = 1A +1B .
(c) Express W in terms of the events Ci ; that is, write down an expression that looks like
W = (C1 ∩ C2 ∩ C3 ) ∪ . . .
(d) Express 1W in terms of the indicators 1Ci ; that is, write down an
expression that looks like
1W = 1C1 1C2 1C3 + · · · .
(e) Suppose that the events Ci are independent, and that P(C1 ) = 0.2,
P(C2 ) = 0.3 and P(C3 ) = 0.6. Compute P(W ).
Solution.
16
TERRY SOO
(a) Notice that both expressions only take values in {0, 1}, and note
that 1A 1B = 1 if and only if 1A = 1 and 1B = 1. And 1A = 1 and
1B = 1 if and only if 1A∩B = 1.
(b) Since A and B are disjoint, we have that both expressions only
take values in {0, 1}. Note that 1A∪B = 0 if and only 1A = 0 and
1B = 0. And 1A = 0 and 1B = 0 if and only if 1A + 1B = 0.
(c) W = (C1 ∩C2 ∩C3 ) ∪ (C1 ∩C2 ∩C30 ) ∪ (C1 ∩C20 ∩C3 ) ∪ (C10 ∩C2 ∩C3 ).
(d) 1W = 1C1 1C2 1C3 + 1C1 1C2 1C30 + 1C1 1C20 1C3 + 1C10 1C2 1C3 .
(e) Since W is expressed as a disjoint union, and the events Ci are
independent, we have
P(W ) = 0.2(0.3)(0.6) + 0.2(0.3)(0.4) + 0.2(0.7)(0.6) + 0.8(0.3)(0.6).
Exercise 9.6 (Inclusion-Exclusion again). Let A and B be events for
sample space S. Show that
1A∪B = 1A + 1B − 1A∩B .
9.3. Law of large numbers. Let p ∈ (0, 1). Let (Ai )∞
i=1 is a sequence
of mutually independent events all with P(Ai ) = P(A1 ) = p. For a
concrete example, think of an experiment where we flip a coin an infinite number of times, and let Ai be the event that the i-flip comes
up heads. The law of large numbers gives the connection between the
mathematical foundations of probability and our intuitive understanding of probability as a limiting relative frequency: it states that on an
event of probability one, we have that
n
1X
lim
1Ai = P(A1 ) = p;
n→∞ n
i=1
more precisely, there is an event S 0 such that P(S 0 ) = 1, and for all
s ∈ S 0 , we have
n
1X
1Ai (s) = P(A1 ) = p.
n→∞ n
i=1
lim
9.4. Types of random variables. Let X be a random variable. If
there a set C = {v1 , v2 , . . .} such that P(X ∈ C) = 1, then X is
a discrete random variable; in other words, X only takes values in
a countable set. We will discuss continuous random variables in more
detail later, which for example might take values in all of in the interval
[0, 1].
MATH 526 LECTURE NOTES
17
9.5. Uniform random variables. Suppose that U is a random variable values in [0, 1] as randomly as possible. Such a random variable should have the property that P(a ≤ U ≤ b) = b − a, for all
0 ≤ a < b ≤ 1; that is the probability that U lies in any interval is
exactly the length of the interval. The random variable U is called a
uniform random variable, and it can be defined on the sample space
[0, 1], by taking U (s) = s for all s ∈ [0, 1], and constructing P (this
is actually no easy task) on [0, 1] so that P((a, b)) = b − a for all
0 ≤ a ≤ b ≤ 1.
Exercise 9.7. Let U be a uniform random variable on [0, 1]. Show
that P(U = x) = 0 for all x ∈ [0, 1]. What is wrong with the following
calculation:
X
1 = P(U ∈ [0, 1]) =
P(U = x) = 0.
x∈[0,1]
Hint: [0, 1] can not be arranged in a sequence.
Solution. Let ε > 0. Observe that
0 ≤ P(U = e) ≤ P(x − ε ≤ U ≤ x + ε) ≤ 2ε.
Since this inequality holds for all ε > 0, we must have that P(U = x) =
0. The calculation is flawed because although the sets {U = x} are
clearly disjoint, the set of all x ∈ [0, 1] can not be put into a sequence,
thus we can not apply Kolmogorov’s (countable additivity) axiom.
10. Chapter 3.2
10.1. Probability distributions. Let X be a discrete random variable. The probability mass function (pmf) of X is defined via
p(x) = P(X = x)
for all x (for which X takes values). (In the case that X is realvalued, it suffices
Pto consider all x ∈ R). Kolmogorov’s axioms give that
p(x) ≥ 0, and x p(x) = 1, and similarly, any function p with these
two properties is a probability mass function for a discrete random
variable. The cumulative distribution function (cdf) of a real-valued
random variable X is defined to be
F (x) = P(X ≤ x),
for all x ∈ R. In the case that X is also a discrete random variable we
have that
X
F (x) =
p(y).
y:y≤x
18
TERRY SOO
We say that either p or F give the distribution or law of a discrete
real-valued random variable. Often, what we care about in practice is
not the actual sample space that a random variable may be defined on,
but its distribution.
Exercise 10.1. Let p be a pmf given by p(0) = 1/3, p(1) = 1/9, and
p(2) = x (and p(y) = 0 for all other y). What is x?
Exercise 10.2. Find the pmf and cdf of a Bernoulli random variable
with parameter 1/3.
Solution. Clearly, p(0) = 2/3, p(1) = 1/3, and p(x) = 0 for all other
x ∈ R. As for F , we have F (x) = 0 for all x < 0, F (x) = 2/3 for all
x ∈ [0, 1), and F (x) = 1 for all x ∈ [1, ∞).
Exercise 10.3 (Sum of independent Bernoulli random variable). Let
p ∈ (0, 1). Suppose (Ai )ni=1 are mutually independent events with P(Ai ) =
p. Let
n
X
X=
1Ai .
i=1
Find the pmf for X.
Solution. Let 0 ≤ k ≤ n. One way that X = k is for A1 , . . . , Ak to
occur and Ak+1 , . . . , An not to occur. The mutual independence of the
events (Ai ) gives that
P(A1 ∩ A2 ∩ · · · ∩ Ak ∩ A0k+1 ∩ · · · ∩ A0n ) = pk (1 − p)n−k ,
and so does every other way; there are exactly nk ways for k of the
events (Ai )ni=1 to occur. Thus
n k
P(X = k) =
p (1 − p)n−k .
k
The random variable X in Exercise 10.3 is called a binomial random
variable with parameters (n, p), and we write X ∼ Bin(n, p). It can be
used to model the number of ‘success’ (and failures) in n independent
trails that each have a success probability of p.
Exercise 10.4. Suppose I have a standard French deck. Consider the
following experiment. I deal myself a poker hand of 5 cards, check the
hand, then put the cards back in deck, mix deck, and then repeat for a
grand total of 6 times. What is the probability that I get exactly three,
one pair poker hands?
Exercise 10.5. Show that log : (0, ∞) → R is a increasing function;
that is, if x < y, then log(x) < log(y). Let f : R → (0, ∞). Show that
MATH 526 LECTURE NOTES
19
if for some xM ∈ R we have that log(f (xM )) ≥ log(f (x)) for all x ∈ R,
then f (xM ) ≥ f (x) for all x ∈ R. Also check that if xc is a critical
point of f (x), then xc is also a critical point of log(f (x)).
Solution. Let x < y. Clearly, log(y) − log(x) = log y/x . Since
x < y, we know that y/x > 1. Also log(z) > 0, for all z > 1. Thus
log(y) − log(x) > 0. In particular, if f (xM ) < f (z) for some z ∈ R,
then log(f (xM )) < f (z).
If g(x) = log(f (x)), note that
g 0 (x) =
f 0 (x)
.
f (x)
Thus if xc is a critical point for f (that is, f 0 (xc ) = 0 or f 0 is not
differentiable at xc ), then the same must hold for g.
Exercise 10.6 (Maximum likelihood estimate). Suppose I have a coin,
and I want to figure out the probability p that it lands heads. I flip it
n = 10 times, and find that I get k = 3 heads (those two numbers I
do know). We don’t know p, but we do know that we can model our
experiment as a random variable X which counts the number of heads,
where X ∼ Bin(n, p). Consider the function f : [0, 1] → [0, 1], given
by
n k
f (p) =
p (1 − p)n−k = P(X = k).
k
Find the value for which f is maximized. (You may assume that n = 10
and k = 3.) Hint: by using Exercise 10.5, it might easier to maximize
g(x) = log(f (x)) rather than f (x) directly. (H)
Solution. By Exercise 10.5, we consider
n
g(p) = log(f (p)) = log
+ k log(p) + (n − k) log(1 − p).
k
We have that
k n−k
−
,
p
1−p
and setting g 0 (p) = 0 and solving for p, we obtain that p = k/n. Furthermore, by the 1st or 2nd derivative test, it is easy to argue that g
and hence f obtains its maximum value at p = k/n.
g 0 (p) =
Exercise 10.7. Suppose F is the cdf for the integer valued random
variable X, and we know that F (0) = 0, F(1) = 1/3, F (2) = 2/3,
F (3) = 5/6, F (4) = 11/12, and F (5) = 1. Compute the pmf for X.
20
TERRY SOO
Exercise 10.8. Let X be a discrete integer valued (P(X ∈ Z) = 1)
random variable. Let F be the cdf for X. Express the following in
terms of F : P(X > a), P(X = a), P(a ≤ X ≤ b), P(a < X < b),
P(a < X ≤ b), P(a ≤ X < b). Assume that a, b ∈ Z.
11. Chapter 3.3 (I)
11.1. Continuous random variables. We will define continuous random variables via a continuous analogue of the pmf. A function f :
R → [0, ∞) is a probability density function (pdf) if
Z ∞
f (x)dx = 1.
−∞
Example done in class 11.1. Let c > 0, and f : R → [0, ∞) be
defined via
(
c/x2 if x ∈ [1, ∞),
f (x) =
0
otherwise.
Find c so that f is a pdf.
We say that X is a continuous random variable if there is a pdf f so
that the cdf of X is given via
Z x
F (x) = P(X ≤ x) =
f (u)du.
−∞
Note that as a function of x, the cdf F is continuous. In fact it is
differentiable if f is continuous, and by the fundamental theorem of
calculus, F 0 (x) = f (x). (If F is differentiable at x, then we take f (x) =
F 0 (x).)
The standard normal distribution is given by the pdf
1 2
1
f (x) = √ e− 2 x .
2π
We say that Z is a standard normal random variable if it has pdf f .
This is one of the most important densities in probability and statistics.
See Figures 1 and 3 for illustrations of the pdf and the cdf. We will
verify later that this is indeed a pdf. It turns out that there is not
nice formula for the anti-derivative of f , so we can only numerically
compute values of F (x). We will discuss in more detail how to handle
this important distribution later.
In many ways a integrals and pdfs will function like sums and pmfs.
However it is possible that for a pdf to have a value greater than 1. We
MATH 526 LECTURE NOTES
21
0.2
0.0
0.1
f(x)
0.3
0.4
Standard normal density
-4
-2
0
2
4
x
Figure 1. A graph of the pdf for a standard normal
random variable
also have the approximation
Z
x+ε
(2ε)f (x) ≈
f (u)du,
x−ε
and for many purposes it may be helpful to pretend that f (x)dx =
P(X = x), but for us this is not a good mathematical statement.
Example done in class 11.2. Show that if X is continuous random
variable with pdf f , then P(X = b) = 0 for all b ∈ R.
Solution. Let e > 0. Observe that {X = b} ⊆ {b − ε < X ≤ b}.
Hence 0 ≤ P(X = b) ≤ F (b) − F (b − e). Since F is continuous,
taking a limit as ε → 0, gives the required result.
12. Chapter 3.3 (II)
In contrast to discrete random variables, for a continuous random
random variable X with a pdf f and cdf F , Exercise 11.2 gives that
22
TERRY SOO
0.0
0.2
0.4
f(x)
0.6
0.8
1.0
Standard normal cdf
-4
-2
0
2
4
x
Figure 2. A graph of the cdf for a standard normal
random variable.
for any a < b, we have
P(a ≤ X ≤ b) = P(a < X < b)
= P(a ≤ X < b)
= P(a < X ≤ b)
Z b
Z
=
f (x)dx −
−∞
a
f (x)dx
−∞
= F (b) − F (a)
Z b
=
f (x)dx.
a
Example done in class 12.1 (Uniform random variables). Find the
pdf for a random variable U such that P(U ∈ [0, 1]) = 1, and for any
interval [a, b] ⊆ [0, 1], we have P(U ∈ [a, b]) = b − a. We say that U is
uniformly distributed in [0, 1].
MATH 526 LECTURE NOTES
23
0.2
0.0
0.1
f(x)
0.3
0.4
Standard normal density
-4
-2
0
2
4
x
Figure 3. An illustration of P(−2 ≤ Z ≤ 0) for a standard normal random variable Z.
Solution. Let
(
1
f (x) =
0
if x ∈ [0, 1],
otherwise.
Example done in class 12.2 (Exponential random variables). Let
λ > 0. Let X be a continuous random variable with the property that
all x ≥ 0, we have P(X > x) = e−λx . Note that X is positive. We call
X an exponential random variable with parameter 1/λ. (We will see
why it is a 1/λ, instead of just λ later.) Find the pdf for X (assuming
the pdf is piecewise continuous).
24
TERRY SOO
Rx
Solution. We know that F (x) = −∞ f (t)dt. Taking a derivative
yields, F 0 (x) = f (x). We know that F (x) = 0 for all x ≤ 0, and
F (x) = 1 − e−λx , otherwise. Thus for all x ≥ 0
(
λe−λx if x ∈ [0, ∞),
f (x) =
0
otherwise.
Exercise 12.3 (Cauchy distribution). For all x ∈ R, set
f (x) =
1
.
π(1 + x2 )
Check that f is a pdf.
Solution. Clearly f (x) ≥ 0. In order to do the integral, make a change
of variables x = tan u, so that dx = (sec2 u)du, from which we obtain
using the trig. identity 1 + tan2 u = sec2 u that
Z ∞
Z
1 π/2
f (x)dx =
1du = 1.
π −π/2
−∞
Exercise 12.4. Let c > 0. Consider the pdf given by
(
cx4 if x ∈ [0, 2),
f (x) =
0
otherwise.
What is c? Find the P(1 ≤ X < ∞)?
12.1. Convolutions. We know that we can add random variables to
create other random variables, but what operations can we perform on
pdfs and pmfs to obtain new ones?
Suppose f and g are pdfs. Define
Z ∞
h(z) = (f ? g)(z) =
f (x)g(z − x)dx.
−∞
We can check that h is also a pdf. Clearly, h ≥ 0, since both f, g ≥ 0.
Also by interchanging the order of integration, Fubini’s theorem, we
have
Z ∞
Z ∞Z ∞
h(x)dx =
f (x)g(z − x)dxdz
−∞
−∞ −∞
Z ∞
Z ∞
=
f (x)
g(z − x)dz dx
−∞
−∞
Z ∞
=
f (x) · 1dx = 1
−∞
MATH 526 LECTURE NOTES
25
We can define the convolutions of pmfs f, g : Z → R similarly via
X
(f ? g)(n) =
f (i)g(n − i).
i∈Z
Exercise 12.5 (Convolutions of pmfs). Check that f ? g is also a pmf.
12.2. Constructing (continuous) random variables. Note that in
defining continuous random variables, we never actually constructed
them! All we did was specify a distribution via a pdf or cdf. Recall
that a random variable is defined on sample space. We never defined
the actually sample spaces for which the random variables live, nor
did we construct the probability P that agrees with the distribution. It
turns out that this is actually beyond the scope of this course. However,
if you believe in the existence of uniform random variables, then that
is all that is need to construct other random variables.
In order to construct uniform random variables, say a random variable U such that P(U ≤ x) = x for all x ∈ [0, 1], we need to be
able define a set function ` on all intervals of (a, b) ⊂ [0, 1], such that
`(a, b) = b − a. This is not a problem, except, we also need to extend
this function in a reasonable way (satisfying Kolmogorov’s axioms) to
include all subsets of [0, 1]. It turns out this is actually in some sense
impossible; the best you can hope for is to include enough subsets of
[0, 1] so that you can actually have a large set of events in which to do
probability.
Exercise 12.6. Assume that U is a uniform random variable such that
P(U ≤ x) = x for all x ∈ [0, 1].
(1) Let p ∈ (0, 1). Find a function φ so that φ(U ) is a Bernoulli
random variable with P(φ(U ) = 1) = p.
(2) Assume that F is a cdf for a continuous random variable X,
and that F is an increasing function so that if x < y, then
F (x) < F (y), and the inverse F −1 is well-defined. Check that
the random variable defined via F −1 (U ) has the same cdf as X.
Exercise 12.7 (Hit/miss von Neumann). Let (Vi )i∈N be independent
and uniformly distributed in the unit square centered at the origin in
R2 , so that P(Vi ∈ A) = area(A) for subsets A of the unit square. Let
D be the disk that is inscribed in the square. Let N denote the first
time i such that Vi ∈ D. Show that VN is uniformly distributed on C,
so that
area(A)
P(VN ∈ A) =
,
area(D)
for subsets A of D.
26
TERRY SOO
Exercise 12.8 (Fair flips from a biased coin). Suppose you do not
know if a coin is fair or not. How can you and your friend decided in
random and fair way will who pay for dinner?
13. Chapter 3.4 (i)
13.1. Joint distributions. Suppose I have a deck of cards and deal
myself two cards which we can represent as a discrete random variable
Z = (X1 , X2 ), so that X1 is the first card dealt, and X2 is the second
card dealt. If X1 is the 2 of spades, then X2 can not be the 2 of spades.
Sometimes if helps to think of Z as two random variables with a joint
distribution rather than as a single random variable taking values as
ordered pairs. We say that f = fX,Y is a joint probability mass function
for a pair of random variables (X, Y ), if f (x, y) ≥ 0 for all (x, y),
P P
x
y f (x, y) = 1, and P({X = x} ∩ {Y = y}) = P(X = x, Y = y) =
f (x, y). Let fX and fY P
be the pmf of X and Y , respectively. P
Notice that
fX (x) = P(X = x) = y f (x, y) and fY (y) = P(Y = y) = x f (x, y);
in the contexts of joint distributions, there are called the marginal
distributions of X and Y (alone), respectively. The conditional distribution of X given Y = y (provided that P(Y = y) > 0) is given
by
f (x|y) = P(X = x|Y = y) =
P(X = x, Y = y)
fX,Y (x, y)
=
.
P(Y = y)
fY (y)
A similar expression holds for conditional distribution of Y given X =
x.
Example done in class 13.1. Let c ≥ 0. Let f be the joint probability
mass function of two discrete random variables X and Y be given below
from the following table:
f (x, y) y = 0 y = 1 y = 2
x=0
0.12 0.08
0.2
x=1
0.2
0.1
0.1
x=2
0.1
c
0.05
(a)
(b)
(c)
(d)
Find c.
Compute P(X = 0).
Compute P(Y = 2).
Compute P(X = 0|Y = 1).
(Thus f (1, 2) = P(X = 1, Y = 2) = 0.2 and f (2, 1) = c)
MATH 526 LECTURE NOTES
27
Exercise 13.2. Consider the joint pmf of random variables X and
Y given by f (0, 0) = 1/12, f (0, 1) = 1/4, f (1, 0) = 5/12, f (1, 1) =
3/12. Find the marginal distributions of X and Y . Find the conditional
distribution of X given Y = 1.
Exercise 13.3. Consider the experiment where we roll a fair dice,
record its value as the random variable X, and then if X = x, we flip a
coin biased coin x times, and record the number of heads as the random
variable Y . Assume that the probability that coin comes up head is 2/3.
Find the joint pmf of X and Y .
Solution. We know that for 1 ≤ n ≤ 6 and for 0 ≤ k ≤ n, we have
that by Exercise 10.3 that
n 2 k 1 n−k
P(Y = k|X = n) =
.
k 3
3
We also know that P(X = n) = 1/6 for all n = 1, 2, 3, 4, 5, 6. Hence
P(X = n, Y = k) = P(Y = k|X = n)P(X = n)
n 2 k 1 n−k
(1/6),
=
3
k 3
for 0 ≤ k ≤ n, and n = 1, 2, 3, 4, 5, 6.
13.2. Independence of Random variables. Given two random variables X and Y (discrete or otherwise), we say that they are independent
if for all x, y ∈ R, we have
P({X ≤ x} ∩ {Y ≤ y}) ≤ P(X ≤ x)P(Y ≤ y).
Similarly, given a sequence of random variables (Xi ), we say that they
are independent if any finite number of the coordinates i1 , . . . , in (where
the ij are distinct) we have
P {Xi1 ≤ x1 } ∩ · · · ∩ {Xin ≤ xn } = P(Xi1 ≤ x1 ) · · · P(Xin ≤ xn )
for all x1 , . . . , xn ∈ R. It is possible to check that this condition is
equivalent to the seemingly stronger condition that
P {Xi1 ∈ Ai } ∩ · · · ∩ {Xin ∈ An } = P(Xi1 ∈ Ai ) · · · P(Xin ∈ An )
for all (reasonable) subsets A1 , . . . , An ⊆ R. Using this condition, it
is possible to check that functions of independent random variables
are also independent; that is, if Xi are independent random variables,
and gi are (deterministic) functions, then gi (Xi ) are also independent
random variables.
28
TERRY SOO
Exercise 13.4. For integer-valued discrete random variables, show that
X and Y are independent if and only if for all integers n, m ∈ Z we
have
P({X = n} ∩ {Y = m}) = P(X = n)P(Y = m),
so that if fX,Y is joint pmf of X and Y , then X and Y are independent
if and only if fX,Y (n, m) = fX (n)fY (m).
Example done in class 13.5. Let X and Y be a random variables
which take values in {1, 2, 3, 4, 5, 6} with equal probability; that is, P(X =
i) = 1/6 = P(Y = i) for all i = 1, 2, . . . , 6. Also assume that X and Y
are independent. Compute P(XY = 12).
Example done in class 13.6. Let λ, µ > 0. Suppose that X and Y
are independent random variables such that P(X > x) = e−λx for all
x ≥ 0, and P(Y > y) = e−µy for all y ≥ 0. If Z = min {X, Y }, then
find the cdf for Z.
Solution. Note that {Z > z} = {X > z} ∩ {Y > z}. Since X and Y
are independent, we have that
P(Z > z) = P(X > z)P(Y > z) = e−(λ+µ)z
for all z ≥ 0. Also note that P(Z > 0) = 1. Thus
(
1 − e−(λ+µ)z if z ≥ 0,
F (z) =
0
otherwise.
Exercise 13.7 (Sum of independent random variables). Let X and
Y be independent discrete integer-valued random variables with pmfs
given by f and g, respectively. Show that if Z = X + Y , then the pmf
of Z is given by the convolution f ? g. (See Exercise 12.5.)
Solution. Let Ai = {X = i} ∩ {Y = n − i} for each i ∈ Z. Observe
that the (Ai )i∈Z are mutually disjoint, and
[
{Z = n} =
Ai .
i∈Z
Since X and Y are independent, we have
X
P(Z = n) =
P(X = i)P(Y = n − i) = (f ? g)(n).
i∈Z
MATH 526 LECTURE NOTES
29
13.3. Poisson random variables. Let λ > 0 and consider the pdf
given by
( −λ n
e λ
if n =0,1,2,3, . . . ,
n!
pλ (n) =
0
otherwise.
If X is random variable with pdf pλ , then we say that X is a Poisson
random variables with mean or parameter λ, and we write X ∼ P oi(λ).
Exercise 13.8. Check that pλ is indeed a pdf.
Solution. Since
ex =
∞
X
xn
i=0
n!
,
we have that
∞
X
e−λ λn
n=0
n!
−λ
=e
∞
X
λn
n=0
n!
= e−λ eλ = 1.
Exercise 13.9. Suppose X and Y are independent Poisson random
variables with parameter λ and µ. Let Z = X + Y .
(1) If λ = 2 and µ = 1, find P(Z = 1).
(2) Find the pmf of Z. (Hint, it is the pmf of a Poisson random
variable. Use Exercise 13.7. The Binomial formula may also
come in handy.)
Solution.
(1) Observe that {Z = 1} is given by the disjoint union of
{X = 1} ∩ {Y = 0} ∪ {X = 0} ∩ {Y = 1} .
Since X and Y are independent we have that
P(Z = 1) = pλ (1)pµ (0) + pλ (0)pµ (1)
= e−(λ+µ) λ(1) + 1(µ)
= e−(λ+µ) (λ + µ)
= pλ+µ (1)
= p3 (1) = 3e−3
(2) By Exercise 13.7, we know that
P(Z = n) = (pλ ? pµ )(n)
X
=
pλ (i)pµ (n − i).
i∈Z
30
TERRY SOO
We know that P(Z = n) = 0 if n < 0, since X and Y are
nonnegative. Let n ≥ 0. Note that pλ (i) = 0 for all i < 0, and
pµ (n − i) = 0 for all i > n. Thus
X
pλ (i)pµ (n − i) =
i∈Z
=
n
X
pλ (i)pµ (n − i)
i=0
n −λ i −µ n−i
X
e
i=0
λ
i!
e µ
(n − i)!
Now recall that
n
n!
=
.
i
i!(n − i)!
So
n e−(λ+µ) X n i n−i
P(Z = n) =
λµ .
n! i=0 i
Now the binomial formula, Exercise 3.3, with x = λ and y = µ
gives that
(λ + µ)n
.
n!
So we obtain that Z is a Poisson random variable with parameter λ + µ.
P(Z = n) = (e−(λ+µ) )
14. Chapter 3.4 (ii)
14.1. Joint distributions for continuous random variables. We
say that f : R2 → [0, ∞) is the joint probability density function if
Z ∞Z ∞
f (u, v)dudv = 1.
−∞
−∞
We say that F is the joint distribution for continuous random variables
X and Y if for all x, y ∈ R, we have
Z x Z y
F (x, y) =
f (u, v)dudv = P(X ≤ x, Y ≤ y)
−∞
−∞
for all x, y ∈ R. This is equivalent to saying that
Z Z
P((X, Y ) ∈ A) =
f (u, v)dudv
A
2
for all (nice) regions of R . (By nice, I mean anything you can write
down or define explicitly, and anything we will use in this course.)
MATH 526 LECTURE NOTES
31
If F is also sufficiently smooth at a point (x, y), then we can recover
f (x, y) via:
∂2
F (x, y) = f (x, y).
∂x∂y
From this we can deduce that if X and Y are continuous random variables, and f is the joint pdf for X and Y , then X and Y are independent
if and only if f (x, y) = fX (x)fY (y) for all x, y ∈ R, where fX and fY
are the pdf’s for X and Y . The marginals are given via:
Z ∞
Z ∞
f (x, y)dy and fY (y) =
f (x, y)dx.
fX (x) =
−∞
−∞
It is trickier to define conditional distributions for continuous random
variables, since the event {Y = y} has probability zero, when Y is
continuous. Let X and Y be continuous random variables with a joint
pdf f . Consider the following calculation for small ε > 0, which will
be based on the approximation
Z x+ε
f (u)du.
(2ε)f (x) ≈
x−ε
P {X ≤ x} ∩ {Y ∈ (y − ε, y + ε)}
P(X ≤ x | Y ∈ (y − ε, y + ε)) =
P(Y ∈ (y − ε, y + ε))
R x R y+ε
f (u, v)dudv
−∞ y−ε
≈
2εfY (y)
Rx
2ε −∞ f (u, y)du
≈
2εfY (y)
Rx
f (u, y)du
.
= −∞
fY (y)
So we define the conditional distribution function of X given Y = y via
Rx
f (u, y)du
,
FX|Y (x) = −∞
fY (y)
and the conditional density of X given that Y = y via the
f (x, y)
fX|Y (x|y) = f (x|y) =
,
fY (y)
provided that fY (y) > 0. We can similarly define the conditional distribution of Y given that X = x.
Note that the condition that fX|Y (x|y) is given by a function of only
the x variable, and does not depend on y is equivalent to that of X and
Y being independent. The analogous statement for fY |X is also true.
32
TERRY SOO
Exercise 14.1. Let c > 0. Consider the joint distribution for continuous random variables X and Y given by the density
(
cxe−5x e−xy
f (x, y) =
0
(a)
(b)
(c)
(d)
if x, y ≥ 0,
otherwise.
Find c.
Find the marginal fX .
Find the marginal fY .
Check, directly by integration, that the function defined by
(
5
if y ≥ 0,
(5+y)2
g(y) =
0
otherwise,
is indeed a pdf.
(e) Find the density for the conditional distribution of Y given that
X = x, for x > 0.
Solution.
(a) Clearly f (x, y) ≥ 0. We have the requirement that
Z ∞Z ∞
f (x, y)dxdy = 1.
−∞
−∞
We have that
Z ∞Z ∞
Z
f (x, y)dxdy =
−∞
−∞
∞
∞
Z
f (x, y)dxdy
0
0
Z
∞
e−5x
= c
Z
0
Z
= c
=
c
.
5
!
∞
xe−xy dy dx
0
∞
e−5x dx
0
Thus c = 5.
(b) From our previous calculation, we easily see that fX (x) = 5e−5x for
all x ≥ 0, and fX (x) = 0 otherwise.
(c) We will need to do integration by parts. Recall that the product rule
gives that
d
(u(x)v(x)) = u0 (x)v(x) + u(x)v 0 (x).
dx
MATH 526 LECTURE NOTES
33
Thus integrating gives
Z b
Z b
b
0
u(x)v (x)dx = u(x)v(x) a −
u0 (x)v(x)dx.
a
a
We know that fY (y) = 0 for y < 0. For y ≥ 0, we are supposed to
compute
Z ∞
fY (y) =
5xe−5x e−xy dx.
0
−x(5+y)
−5
Take u(x) = x and v 0 (x) = 5e−x(5+y) . So take v(x) = 5+y
e
.
We find that
h
ix=∞
−5 −x(5+y) ix=∞ h
5
−x(5+y)
e
fY (y) = x
−
e
5+y
(5 + y)2
x=0
x=0
Using that fact that xe−x → 0 as x → ∞, we have that
5
.
fY (y) =
(5 + y)2
(d) Note that
∞
−5 ∞
g(y)dy =
= 1.
5+y 0
−∞
Z
(e) We have that
(
xe−xy
fY |X (y|x) =
0
if y ≥ 0,
otherwise.
Exercise 14.2. Let c > 0. Consider the joint distribution for continuous random variables X and Y given by the density
(
c if 0 ≤ y ≤ x ≤ 1,
f (x, y) =
0 otherwise.
(a) Find c. (Hint, it may help to try a picture of the region of integration.)
(b) Find the marginal fX .
(c) Find the marginal fY .
(d) Check, directly by integration or elementary school math, that the
function defined by
(
2(1 − y) if y ∈ [0, 1],
g(y) =
0
otherwise,
is indeed a pdf.
(e) Find the density for the conditional distribution of X given that
Y = y, for y ∈ (0, 1).
34
TERRY SOO
(f ) Fix some b ∈ [0, 1). Check, directly, by integration or elementary
school math, that the function defined by
(
h(x) =
1
1−b
0
if 0 ≤ b ≤ x < 1,
otherwise,
is indeed a pdf.
Solution.
(a) The region of integration is a triangle with boundaries given by the
equations y = 0, x = 1, and y = x; thus c = 1/2; or we can do the
integration:
Z ∞Z ∞
Z 1Z x
f (x, y)dxdy = c
1dydx
−∞
−∞
0
Z
0
1
xdx
1
2
= cx /2
= c
0
0
= c/2.
Thus c = 2.
(b) Clearly, fX (x) = 0 for all x < 0 and all x > 1. For x ∈ [0, 1], we
have that
Z x
2dy = 2x.
fX (x) =
0
(c) Clearly, fY (y) = 0 for all y < 0 and all y > 1. For all y ∈ [0, 1],
we have
Z 1
fY (y) =
2dx = 2(1 − y).
y
(d) It is clear that the required integral of g is given by the area of a
right angle triangle with height 2 and length 1, which has unit area.
(e) Thus the required conditional density is given by
(
1
if 0 ≤ y ≤ x < 1,
fX|Y (x|y) = 1−y
0
otherwise.
(f ) The required integral of h is given by the area of a rectangle with
height 1/(1 − b) and length 1 − b, which as unit area.
MATH 526 LECTURE NOTES
35
14.2. Normal distributions. Let ρ ∈ (−1, 1). Let f : R2 → [0, ∞)
be given by
f (x, y) =
1
1
2
2
p
(x
−
2ρxy
+
y
)
.
exp −
2(1 − ρ2 )
2π 1 − ρ2
This is a called a standard bivariate normal distribution. We will first
consider the case where ρ = 0. If g : R → [0, ∞) is given by
1 2
1
g(x) = √ e− 2 x ,
2π
the g the pdf for the standard normal distribution. A random variable
with pdf g is a a standard normal random variable.
Example done in class 14.3.
(1) In the case that ρ = 0, check that the bivariate normal distribution defined above is indeed a joint distribution; that is, it is
nonnegative, and its double integral over the entire real plane is
one.
(2) In the case ρ = 0, also rewrite f (x, y) = f1 (x)f2 (y) as a product
of functions in x and y alone.
(3) Show that the pdf for the standard normal distribution is indeed
a pdf.
Solution.
(1) We use polar coordinates and the change of coordinates x =
r cos θ, y = r sin θ, dxdy = rdrdθ.
1
2π
Z Z
−1/2(x2 +y 2 )
e
R2
Z ∞ Z 2π
1
2
dxdy =
re−r drdθ
2π 0
0
Z 2π
2
=
re−r dr
0
∞
−r2 = −e 0
= 1.
(2) Write f (x, y) = g(x)g(y), where g is the pdf for the standard
normal distribution.
R
2
RR
(3) We know that 1 =
f
(x,
y)dxdy
=
g(x)dx
, so the
R2
R
result follows.
36
TERRY SOO
15. Chapter 3.4 (III)
15.1. More on normal distributions. Let µ ∈ R and σ > 0. The
density for a normal random variable X with parameters (µ, σ) is given
by
1
2
1
n(x; µ, σ) = √ e− 2σ2 (x−µ) ;
σ 2π
in which case we also write X ∼ N (µ, σ 2 ).
Example done in class 15.1. Check that for every µ ∈ R, and σ > 0,
we indeed have that n(·; µ, σ) is a density.
Exercise 15.2 (Transformation of the standard normal).
(1) Let X be a standard normal random variable, so that X ∼
N (0, 1). Let σ > 0, and µ ∈ R. Find the pdf for the random
variable Y = σX + µ.
(2) Find the second marginal for bivariate normal
R ∞ distribution, f ,
without assuming ρ = 0; that is, compute −∞ f (x, y)dx. (Hint:
complete the square x2 − 2ρxy + y 2 = (x − ρy)2 + (1 − ρ2 )y 2 . You
will get the Rdensity
for a standard normal random variable.)
∞ R∞
(3) Verify that −∞ −∞ f (x, y)dxdy = 1
Solution.
(1) We have
x − µ
1
P(Y ≤ x) = P(σX + µ ≤ x) = P X ≤
=√
σ
2π
By the change of variables u =
Z
x−µ
σ
u2
e− 2 du.
−∞
t−µ
σ
we have that
Z x
(t−µ)2
e− 2σ2 dt.
1
FY (x) = P(Y ≤ x) = √
σ 2π −∞
By taking a derivative we obtain the required pdf
(t−µ)2
1
FY0 (x) = √ e− 2σ2 .
σ 2π
(2) After completing the square, we are to do the integral:
Z ∞
1
1
2
2 2
p
exp −
((x
−
ρy)
+
(1
−
ρ
)y
)
dx
2(1 − ρ2 )
2π 1 − ρ2 −∞
=
Z ∞
exp(y 2 /2)
1
2
p
exp −
(x − ρy) dx.
2(1 − ρ2 )
2π 1 − ρ2 −∞
MATH 526 LECTURE NOTES
37
From the previous exercise, with σ s = 1 − ρ2 and µ = 2ρy, we
know that
Z ∞
1
1
2
exp −
(x
−
ρy)
dx = 1.
√ p
2(1 − ρ2 )
2π 1 − ρ2 −∞
Thus we are left with
Z ∞
1 2
1
f (x, y)dx = √ e− 2 y ,
2π
−∞
which is exactly the pdf for a standard normal random variable.
(3) We just need to integrate both sides of the above equation; we
already know the answer for this integral since it is the pdf for
a standard normal random variable.
Exercise 15.3 (Sum of continuous random variables). Let X and Y
be continuous random variables with a joint pdf f , and pdfs given by
fX and fY , respectively. Let Z = X + Y . Let z ∈ R.
(1) Show that the pdf for Z is given by
Z ∞
fX+Y (z) =
f (x, z − x)dx.
−∞
(2) If X and Y are independent show that the pdf for Z is given by
fX ? f Y .
Solution.
(1) Let R = {(x, y) ∈ R2 : x + y ≤ z}.
Z Z
f (x, y)dxdy
P(Z ≤ z) =
R
Z ∞ Z z−x
=
f (x, y)dydx
−∞
−∞
By a change of variables y = u − x (here x is a constant), we
have
Z ∞Z z
P(Z ≤ z) =
f (x, u − x)dudx
∞
−∞
Z z Z ∞
=
f (x, u − x)du dx
−∞
∞
Differentiating gives the result.
(2) If X and Y are independent, we have that
f (x, z − x) = fX (x)fY (z − x),
from which the result follows.
38
TERRY SOO
Exercise 15.4. Let X and Y have the joint density given by the standard bivariate normal distribution. Compute the conditional distribution of X given that Y = y. When are X and Y independent?
Solution. From our previous calculations in Exercise 15.2, it is clear
that
1
1
2
f (x|y) = √ p
.
exp −
(x
−
ρy)
2(1 − ρ2 )
2π 1 − ρ2
We see that X and Y are independent if and only if ρ = 0.
16. Chapter 4.1, 4.3 (I)
16.1. Mean of a random variable. Let X be a real-valued discrete
random variable with pmf f , then the expected value of X is defined
via:
X
xf (x).
EX =
x
If X is a continuous random variable with pdf f then its expected value
is define similarly:
Z ∞
EX =
xf (x)dx.
−∞
We often also write µ = EX. We also call EX the mean of a random
variable; the justification for this is given by a version of the law of
large numbers which states that if (Xi )∞
i=1 is a sequence of independent
random variables with the same distribution such that E|X1 | < ∞,
then on an event of probability one, we have
∞
1X
Xi = EX1 .
n→∞ n
i=1
lim
Example done in class 16.1. Let X be a Bernoulli random variable
with parameter p, so that P(X = 1) = p. Show that EX = p. Let A be
an event, show that E1A = P(A).
Exercise 16.2. Let X be a Binomial random variable with parameter
p ∈ (0, 1), so that if has a pmf given by:
n k
P(X = k) =
p (1 − p)n−k
k
for all integers k with 0 ≤ k ≤ n. Show directly using the definition of
expectation that EX = np.
MATH 526 LECTURE NOTES
39
Solution.
n
X
n k
EX =
k
p (1 − p)n−k
k
k=0
n
X
n k
=
k
p (1 − p)n−k
k
k=1
n
X
n k−1
= p
k
p (1 − p)n−k
k
k=1
= p
n
X
k
k=1
n
X
= pn
k=1
= pn
n−1
X
k=0
n!
pk−1 (1 − p)n−k
k!(n − k)!
(n − 1)!
pk−1 (1 − p)n−k
(k − 1)!(n − k)!
(n − 1)!
pk (1 − p)n−k−1
k!(n − k − 1)!
n−1
X
(n − 1)!
pk (1 − p)n−1−k
k!(n − 1 − k)!
k=0
n−1 X
n−1 k
= pn
p (1 − p)n−1−k
k
k=0
= pn
= pn(1) = pn.
Example done in class 16.3 (N -sided fair dice). Suppose that X is
a random variable which takes values in R = {1, 2, 3, . . . , n} with equal
d
probability; that is, P(X = i) = 1/n for all i ∈ R. Let Y = X; that is;
P(Y = z) = P(X = z) for all z ∈ R. Also assume that X and Y are
independent. Compute P(Y = X) and EX.
Exercise 16.4 (Geometric random variables). We say that T is a
geometric random variable with parameter p ∈ (0, 1) if P(T = k) =
(1 − p)k−1 p for all k = 1, 2, 3, . . . Compute ET . If you flip a fair coin,
how many times on average do you have to flip it in order to get a
head? Hint: recall from Calculus that for all r ∈ (0, 1) we have that
∞
X
1
f (r) :=
=
rn
1 − r n=0
40
TERRY SOO
and
∞
X
1
f (r) =
=
nrn−1 .
(1 − r)2
n=1
0
Example done in class 16.5. Let U be uniformly distributed in [0, 1].
Find EU .
Exercise 16.6. Let σ > 0 and µ ∈ R. Let X be a continuous random
variable with pdf given by
(x−µ)2
1
f (x) = √ e− 2σ2 ,
σ 2π
so that X ∼ N (µ, σ 2 ). Find EX.
Exercise 16.7. Let λ > 0. Let X be an exponential random variable
with parameter 1/λ. Find EX. (See Exercise 12.2.)
Solution. The pdf for X is given by
(
λe−λx if x ∈ [0, ∞),
f (x) =
0
otherwise.
Thus
Z
EX =
∞
xf (x)dx.
0
We will need to integrate by parts. Recall that:
Z b
Z b
b
0
u(x)v (x)dx = u(x)v(x) a −
u0 (x)v(x)dx.
a
a
Set v 0 (x) = f (x) and u(x) = x; thus we take v(x) = −e−λx , and have
u0 (x) = 1. So, we have
h
i∞ Z ∞
−λx
EX =
− xe
+
e−λx dx
0
0
h 1
i∞
= 0 + − e−λx
λ
0
1
=
.
λ
(Recall that xe−x → 0 as x → ∞.)
Example done in class 16.8. Let X be a continuous random variable
with probability density function f . If f is even (f (x) = f (−x) for all
x ∈ R), then show that EX = 0 and P(X ≥ x) = P(X ≤ −x).
MATH 526 LECTURE NOTES
41
Let α ∈ (0, 1) Show that P(−z ≤ X ≤ z) = 1 − α if and only if
P(X > z) = α/2. We should assume that
Z
|x|f (x)dx < ∞
to avoid any problems with ∞ − ∞.
Solution. Since f is even, by a change of variables, x = −u we have
that
Z 0
Z ∞
xf (x)dx =
−uf (−u)du
−∞
0
Z ∞
−uf (u)du
=
0
Hence
Z
∞
Z
∞
xf (x)dx =
0
Z
xf (x)dx +
−∞
Z
xf (x)dx
Z−∞
∞
0
∞
xf (x)dx −
=
0
xf (x)dx = 0.
0
Also, by a change of variables x = −u we have
Z
P(X ≥ z) =
∞
Z
−z
f (x)dx = P(X ≤ −z).
f (x)dx =
z
−∞
For the final claim, observe that
P(−z ≤ X ≤ z) = P(X ≤ z) − P(X > −z) = 1 − P(X > z) + P(X < −z)
= 1 − 2P(X > z)
Exercise 16.9. For a discrete real-valued random variable, show that
if Z ≥ 0, then EZ ≥ 0.
end of Exam 2 coverage
17. Quiz and review of Quiz and Extra HW
18. Chapter 4.1, 4.3 (II)
18.1. Functions of random variables.
Example done in class 18.1. Suppose X is uniformly distributed
on {−3, −2, −1, 0, 1, 2, 3}; that is, P(X = i) = 1/7 for all integers
−3 ≤ i ≤ 3. Let g(x) = x2 . Compute Eg(X).
42
TERRY SOO
Rather than to compute the pmf, it turns out that there is often an
easier way. Let X be a discrete random variable taking values on D.
Given a function g : D → R such that E|g(X)| < ∞, we have that
X
Eg(X) =
g(x)P(X = x).
x∈D
Similarly, if f is the pdf for a continuous random X, and g is a function
such that g : R → R and g(X) is a continuous random variable, then
Z ∞
Eg(X) =
g(x)f (x)dx.
−∞
Sometimes these formulas are called the law of the unconscious statistician; I am not sure why.
Exercise 18.2. Let λ > 0. Let X be a random variable with the
property that all x ≥ 0, we have P(X > x) = eλx . Find E(X 2 ).
Solution. The pdf for X is given by
(
λe−λx if x ∈ [0, ∞),
f (x) =
0
otherwise.
Thus
Z
2
∞
x2 f (x)dx.
E(X ) =
0
Again we will need to integrate by parts. Recall that:
Z b
Z b
b
0
u0 (x)v(x)dx.
u(x)v (x)dx = u(x)v(x) a −
a
a
Choose u(x) = x2 , and v 0 (x) = f (x). Thus we take v(x) = −e−λx , and
have u0 (x) = 2x. So we have
h
i∞ Z ∞
2
2 −λx
E(X ) = − x e
+
2xe−λx dx.
0
0
The first term on the right hand vanishes since x2 e−x → 0 as x → ∞.
As for the second term, we already know how to deal with that from
Exercise 16.7, since we from that exercise, we know
Z ∞
1
EX =
xλe−λx dx = ;
λ
0
hence,
Z
0
∞
2xe−λx dx =
2
.
λ2
MATH 526 LECTURE NOTES
43
We will give a proof of the law of the unconscious statistician, for
the case of discrete random variables. We need to show that
X
X
yP(g(X) = y) =
g(x)P(X = x).
y
x
In order to do this, we will re-arrange the sum on the right-hand side.
For each y, let g −1 (y) = {x : g(x) = y}. Note that for each y,
X
P(g(X) = y) =
P(X = x),
x∈g −1 (y)
so that
yP(g(X) = y) = y
X
P(X = x) =
x∈g −1 (y)
X
g(x)P(X = x).
x∈g −1 (y)
Summing both sides over y gives
X
X X
yP(g(X) = y) =
g(x)P(X = x).
y
y
x∈g −1 (y)
Since the union over all y of g −1 (y) gives all possible values of x, the
right hand side is the same as:
X
g(x)P(X = x).
x
Similar formulas hold for joint distributions. In particular, the version for the discrete joint distribution is a consequence of the regular
one, since given two discrete random variables X, Y , the random variable Z = (X, Y ) is also a discrete random variable: let g be a function
of (X, Y ), then
XX
Eg(X, Y ) =
g(x, y)P(X = x, Y = y).
x
y
For the continuous random variables, if f is the joint pdf of continuous
random variables X and Y , we have
Z ∞Z ∞
Eg(X, Y ) =
g(x, y)f (x, y)dxdy.
−∞
−∞
Example done in class 18.3. Let U1 and U2 be independent random
variables uniformly distributed in [0, 1]. Compute E(U1 U2 ).
44
TERRY SOO
18.2. Linearity and independent products. It turns out the law of
the unconscious statistician for (joint) distributions can be very useful
for deriving other useful properties of the expectation operator E.
Exercise 18.4. If X and Y are independent real-valued random variables, then E(XY ) = (EX)(EY ). Show this for discrete random variables using the law of the unconscious statistician with the function
g(x, y) = xy.
Solution.
E(XY ) = Eg(X, Y )
XX
=
xyP(X = x, Y = y)
x
=
x
=
y
XX
X
xyP(X = x)P(Y = y) since X and Y are independent
y
xP(X = x)
x
X
yP(Y = y)
y
= (EX)(EY ).
Exercise 18.5. Find two random variables X and Y such that E(XY ) =
(EX)(EY ), but X and Y are not independent.
Example done in class 18.6. Let a, b ∈ R. If X is a real-valued
random variable, then E(aX + b) = aEX + b. Show this for a continuous random variable X with pdf f , using the law of the unconscious
statistician with the function g(x) = ax + b.
Solution.
E(aX + b) = Eg(X)
Z ∞
=
(ax + b)f (x)dx
−∞
Z ∞
Z ∞
=
axf (x)dx +
bf (x)dx
−∞
−∞
Z ∞
Z ∞
= a
xf (x)dx + b
f (x)dx
−∞
−∞
= aEX + b.
Example done in class 18.7. If X and Y are real-valued random
variables (not necessarily independent), then E(X + Y ) = EX + EY
(provided that E|X| + E|Y | < ∞). Show this for discrete random variables using the law of the unconscious statistician with the function
g(x, y) = x + y.
MATH 526 LECTURE NOTES
45
Solution.
E(X + Y ) = E(g(X, Y ))
XX
=
(x + y)P(X = x, Y = y)
x
=
y
XX
x
xP(X = x, Y = y) + yP(X = x, Y = y)
y
X X
X X
=
x
P(X = x, Y = y) +
y
P(X = x, Y = y)
x
=
X
y
y
xP(X = x) +
x
X
x
yP(Y = y)
y
= EX + EY.
Example done in class 18.8. Use the linearity of expectation to show
that find the expectation of a Binomial random variable with parameter
(n, p).
18.3. Information theory. Let X be a discrete random variable taking the values {a1 , . . . , an }, where P(X = ai ) 6= 0 for all i = 1, 2, . . . , n.
The information function is another random variable associated with
X defined via:
n
X
I = IX = −
log(P(X = ai ))1{X=ai } ;
i=1
in other words, if X = ai , then I = − log(P(X = ai )). Notice that
I ≥ 0. The log here is the usual natural logarithm; other popular
choices are to use log2 , especially in computer science.
The information function is a measure of how much information or
how surprised you are when you see the outcome of X. For example,
if X is Bernoulli random variable with parameter 0.99, we have that if
X = 1, then I = − log(0.99) ≈ 0, but if X = 0, then I = − log(0.01),
which is a large number. The entropy of X is denoted by H(X) and is
given by H(X) = EIX ; thus we can think of entropy the expected information given by a random variable. Entropy plays an important rule
in coding theory, and we will touch upon some of its basic properties
in the next exercises.
Exercise 18.9. Let X be a discrete random variable taking values in
{a1 , . . . , an }., where P(X = ai ) 6= 0 for all i = 1, . . . , n. Show that
H(X) = −
n
X
i=1
P(X = ai ) log(P(X = ai )).
46
TERRY SOO
(Hint, take the expectation of IX and use the linearity of expectation,
and remember what the expectation of an indicator is.)
Solution. Recall that E1A = P(A). By the linearity of expectation, we
have
H(X) = EIX = −
= −
n
X
i=1
n
X
log(P(X = ai ))E(1{X=ai } )
log(P(X = ai ))P(X = ai ).
i=1
Exercise 18.10. Let X be a Bernoulli random variable with parameter
p ∈ (0, 1) Compute H(X). Find the value p which maximizes H(X).
(Hint: After you find a formula for H(X) that depends on p, you might
need to use calculus.)
Solution. Let
f (p) = −p log p − (1 − p) log(1 − p).
Clearly, H(X) = f (p).
Note that technically f is only defined for x ∈ (0, 1). However,
limx→0 f (x) = limx→1 f (1) = 0, from which it makes sense to set
f (1) = f (0) = 1. We want to find the p which maximizes f (p) on
[0, 1]. We want to find the derivative of f and find critical points; that
is, points c for which f 0 (c) = 0. Then we need to check which one of
those will maximizes f in the interval [0, 1]. We have that
1
p
+ log(1 − p) −
1−p
1−p
= log(1 − p) − log(p).
f 0 (p) = − log p − 1 +
Thus we are left to solve for p in
log(1 − p) = log(p),
or
1 − p = p,
from which we deduce that p = 1/2. Since f is positive on (0, 1),
f (1) = f (0) = 0, and p = 1/2 is the only critical point, f (1/2) is
the maximum value of f on [0, 1]. (We can also argue using the first
derivative test.)
MATH 526 LECTURE NOTES
47
18.3.1. Joint entropy. If X and Y are discrete random variables with
joint pmf f , and marginals fX and Y , then we also define their joint
entropy to be given by
XX
H(X, Y ) = −
f (x, y) log(f (x, y));
x
y
(in this notation we take, 0 log(0) = 0.
Exercise 18.11. Show that if X and Y are independent, then
H(X, Y ) = H(X) + H(Y ).
For a fixed value of x, such that fX (x) > 0, we define
X
H(Y |X = x) = −
fY |X (y|x) log(fY |X (y|x)),
y
and the conditional entropy of Y given X to be,
X
fX (x)H(Y |X = x).
H(Y |X) =
x
Exercise 18.12. Show that
XX
f (x, y) log(fY |X (y|x)).
H(Y |X) = −
x
y
Exercise 18.13. Show that H(X, Y ) = H(X) + H(Y |X) and conclude
that if X and Y are independent, then H(Y |X) = H(Y ).
19. Chapter 4.2, 4.3 (I)
19.1. Variance. The variance of a random variable X is defined via:
Var(X) = E (X − EX)2 .
p
We often write Var(X) = σ 2 , and we say that Var(X) = σ is the
standard deviation of the random variable X.
The law of large numbers relates the variance to sample variance
in the following way. If (Xi )i∈N is a sequence of independent random
variables, all with the same distribution with EX1 = µ and Var(X1 ) =
σ 2 , then
n
1X
lim
(Xi − µ)2 = E(X1 − µ)2 = σ 2 .
n→∞ n
i=1
(We cheated slightly here since we have the difference of Xi with µ,
instead of the sample mean. We will do a more detailed calculation
later.)
For discrete random variables and continuous random variables, using the law of the law of the unconscious statistician with the function
48
TERRY SOO
g(x) = (x − µ)2 , we can obtain formulas for computing the variance.
For a discrete random variable X we have
X
Var(X) =
(x − µ)2 P(X = x).
x
And for continuous a continuous random variable X with pdf f , we
have
Z ∞
Var(X) =
(x − µ)2 f (x)dx.
−∞
Example done in class 19.1 (Variance of a Bernoulli random variable). Let X be a random variable with P(X = 1) = p and P(X = 0) =
1 − p. Find Var(X).
Solution. We know that EX = p. Using the above formula we have:
Var(X) = (0−p)2 (1−p)+(1−p)2 p = (1−p)(p2 +(1−p)p)) = (1−p)(p).
Exercise 19.2 (Variance of standard normal). Let X be a continuous
random variable with pdf given by
x2
1
f (x) = √ e− 2 .
2π
Find the variance of X.
Solution. It is easy to see that EX = 0, thus Var(X) = E(X 2 ). Again
we will need to integrate by parts to compute E(X 2 ). Recall that:
Z b
Z b
b
0
u0 (x)v(x)dx.
u(x)v (x)dx = u(x)v(x) a −
a
a
0
Choose u(x) = x and v (x) = xf (x). Integration by parts gives:
Z ∞
Z ∞
x2
x − x2 ∞
1
2
√ e− 2 dx.
x f (x)dx = − √ e 2 +
−∞
2π
2π
−∞
−∞
We have that the second term equals 1 and the first term is zero, since
it follows from l’hopital’s rule that the exponential goes to zero faster
than any polynomial going to infinity.
Example done in class 19.3 (Short-cut formula). Show that Var(X) =
E(X 2 ) − (EX)2 .
Solution.
Var(X) = E(X − EX)2 = E(X 2 − 2XEX + (EX)2 ).
Applying the linearity of expectation, we have,
Var(X) = E(X 2 ) − 2(EX)2 + (EX)2 = E(X 2 ) − (EX)2 .
MATH 526 LECTURE NOTES
49
Exercise 19.4. Let X be a Poisson random variable with parameter
λ > 0, so that
λk
P(X = k) = e−λ
k!
for k = 0, 1, 2, . . .. Show that EX = λ = Var(X).
Solution. First we show that EX = λ.
EX =
∞
X
−λ
kP(X = k) = e
k=0
= e−λ λ
∞
X
λk
k
k!
k=0
∞
∞
∞
X
X
X
λk−1
λk−1
λk
k
= e−λ λ
= λe−λ
k!
(k − 1)!
k!
k=1
k=1
k=0
= λ
With the short-cut formula, it remains to compute E(X 2 ):
E(X 2 ) = e−λ
∞
X
k2
k=0
∞
X
∞
X
λk
λk
= e−λ
k2
k!
k!
k=1
∞
X
λk−1
λk
= e−λ λ
(k + 1)
(k − 1)!
k!
k=1
k=0
∞
∞
X
λk X λk −λ
= e λ
k +
k! k=0 k!
k=0
= e−λ λ
k
= λ(EX + 1) = λ2 + λ.
Thus Var(X) = λ2 + λ − (λ)2 = λ.
Exercise 19.5. Let λ > 0. Let X be an exponential random variable
with parameter 1/λ. Find Var(X).
Solution. By the short-cut formula, we have that
Var(X) = E(X 2 ) − (EX)2 .
By Exercises 16.7 and 18.2, we have that
2
1
1
Var(X) = 2 − 2 = 2 .
λ
λ
λ
20. Test 2
Q1: a)0.43478, b)0.42, c)1.8; Q2: a)4, c)1, d)0, e)0.05859, f)4/5;
Q3: a)1/5, b)1/5[g(10) − g(5)]; Q4: a)0.34956, b)0.04637; Q5: a)2,
b)FX (4) − FX (3).
50
TERRY SOO
21. Chapter 4.2, 4.3 (II)
21.1. Sample variance and variance. Let (Xi )∞
i=1 be a sequence of
independent random variables such that EXi = EX1 and EXi2 = EX12
for all i ∈ N. Let
n
X
¯n = 1
X
Xi .
n i=1
Notice by the short-cut formula, Exercise 1.3, we have that
n
n
n
1 X 2
X
1X
¯ n )2 = 1
(Xi − X
Xi2 −
Xi .
n i=1
n i=1
n i=1
Taking a limit as n → ∞, by the law of large numbers, we have that
on an event of probability one that
n
1X
¯ n )2 = EX 2 − EX1 )2 = Var(X1 )
lim
(Xi − X
1
n→∞ n
i=1
by the other short-cut formula, Exercise 19.3.
21.2. Variance for sums of random variables.
Example done in class 21.1. Let X be a random variable and a, b ∈
R. Find Var(aX), Var(X + b).
Solution. We have
2
2
Var(aX) = E aX − E(aX) = E aX − aEX = a2 E(X − EX)2 .
Thus Var(aX) = a2 Var(X).
We have
2
Var(X−b) = E (X−b)−E(X−b) = E X−b−EX+b)2 = E(X−EX)2 .
Thus Var(X − b) = Var(X).
Example done in class 21.2. Recall that if X and Y are independent
random variables, then E(XY ) = (EX)(EY ). Use this fact to show
that if X and Y are independent random variables, then Var(X + Y ) =
Var(X) + Var(Y ).
MATH 526 LECTURE NOTES
51
Solution.
Var(X + Y ) =
=
=
=
=
=
E(X + Y )2 − (E(X + Y ))2
EX 2 + 2E(XY ) + EY 2 − (E(X) + E(Y )2 )
2E(XY ) + EX 2 + EY 2 − 2(EX)EY − (EX)2 − (EY )2
2(EX)EY + EX 2 + EY 2 − 2(EX)EY − (EX)2 − (EY )2
E(X 2 ) − (EX)2 + E(Y 2 ) − (EY )2
Var(X) + Var(Y ).
In first equation, we used the short-cut formula. In second equation,
we used the linearity of Expectation ( E(aX + bY ) = aEX + bEY ). In
the fourth equation, we used the independence of X and Y . Finally, in
last equation, we used the short-cut formula again.
Example done in class 21.3. Let a, b ∈ R. Show that if X and Y are
independent random variables, then aX and bY are also independent
random variables. (See also Section 13.2.)
Example done in class 21.4. Let a, b ∈ R. Let X and Y be independent random variables. Show that Var(aX + bY ) = a2 Var(X) +
b2 Var(Y ); thus in particular, Var(X − Y ) = Var(X) + Var(Y ).
Example done in class 21.5. Let X and Y be random variables that
are not necessarily independent. Show that
Var(X + Y ) = Var(X) + Var(Y ) + 2E(XY ) − 2(EX)(EY ).
Example done in class 21.6 (Variance for Binomial random variables). Let (Xi )ni=1 be independent
Bernoulli random variables with paPn
X
is
a Binomial random variable with
rameter p. Then X =
i
i=1
parameter (n, p). Find the variance of X.
Solution. By Exercise 19.1, we have that Var(Xi ) = p(1 − p). Since
X is a sum of independent random variables, we know that
Var(X) =
n
X
Var Xi = np(1 − p).
i=1
Exercise 21.7. Let X1 , X2 , X3 . . . , be independent random variables
with the same distribution; that is, all the XP
i have the same cdf. Let
2
EX1 = µ and V ar(X1 ) = σ > 0. Let Sn = ni=1 Xi . Find (formulas,
in terms of n, µ, and σ 2 for) the mean the variance of Sn ; determine
the mean and variance (by applying the formula) for the specific case
n = 25, µ = 3 , and σ 2 = 7.
52
TERRY SOO
Solution. The linearity of expectation gives, ESn = nEX1 = nµ, and
the independence of the Xi gives, Var(Sn ) = n Var(X1 ) = nσ 2 .
In the specific case n = 25, µ = 3 , and σ 2 = 7, we obtain that
ESn = 25(3) = 75, and Var(Sn ) = 25(7) = 175.
Exercise 21.8 (Standard Version). Let X be a random variable. Sometimes it makes things nice to standardize random variables by defining:
X − EX
Z=p
.
Var(X)
Check that Z has mean 0 and unit variance; that is, Var(Z) = 1.
(Assume that 0 < Var(X) < ∞.)
Solution. Let EX = µ and V ar(X) = σ 2 , to bring out the fact that
they are just constants.
X − µ 1
1
E
= E(X − µ) = (EX − µ) = 0.
σ
σ
σ
Var
X − µ
1
1
= 2 Var(X − µ) = 2 Var(X) = 1.
σ
σ
σ
Example done in class 21.9. Let c ∈ R. Suppose that X is a random
variable such that P(X = c) = 1. Show that Var(X) = 0. (In fact, the
other direction is also true. That is, if Var(X) = 0, then there exists
c ∈ R, such that P(X = c) = 1; moreover c = EX.
21.3. Covariance. Given two random variables X and Y , we define
the covariance of X and Y to be given by
Cov(X, Y ) = E (X − EX)(Y − EY ) .
Whenever Var(X), Var(Y ) > 0, the correlation coefficient is given
by
Cov(X, Y )
.
ρ(X, Y ) = p
Var(X)Var(Y )
Exercise 21.10. Let Y = aX + b. Find ρ(X, Y ). Assume a 6= 0, and
Var(X) 6= 0. (You may have to take cases depending on whether a is
negative or positive.)
MATH 526 LECTURE NOTES
53
Solution.
Cov(X, Y ) = E (X − EX)(Y − EY )
= E (X − EX)(aX + b − E(aX + b)
= E (X − EX)(aX − aEX)
= aE (X − EX)(X − EX)
= aVar(X).
2
We also
√ have that Var(Y ) = Var(aX + b) =a Var(aX) = a Var(X).
Since, a2 = |a|, we have that ρ(X, Y ) = |a| . Thus if a > 0, then
ρ(X, Y ) = 1, and if a < 0, then ρ(X, Y ) = −1.
Exercise 21.11. Prove the following short-cut formula. Show that
Cov(X, Y ) = E(XY ) − (EX)(EY ).
Exercise 21.12. Suppose that (X, Y ) are standard normal random
variables with joint density function f given by the standard bivariate
normal distribution with parameter ρ ∈ (−1, 1) so that
1
1
2
2
f (x, y) = p
(x
−
2ρxy
+
y
)
.
exp −
2(1 − ρ2 )
2π 1 − ρ2
Find the covariance of X and Y . Hint use the law of the unconscious
statistician with the function g(x, y) = xy.
Chapter 4.4–Postponed, please read
21.4. Chebyshev’s inequality. Let X be a random variable with
EX 2 < ∞. Let a ≥ 0. Chebyshev’s inequality states that
a2 P(|X| > a) ≤ EX 2 .
To see this consider the random variable Y defined by Y = a1{|X|>a} .
Notice that Y ≤ |X|, since Y = 0 when |X| ≤ a and Y simply takes
the value a when |X| > a. Since Y ≤ |X|, and Y is nonnegative, we
have that Y 2 ≤ X 2 . Taking expectations on both sides, we obtain the
required result. Actually, we cheated a bit, we need to the following
result to show that if X ≤ Y , then EX ≤ EY :
Exercise 21.13. Show that if Z is real-valued random variable (discrete or continuous) such that P(Z ≥ 0) = 1, then EZ ≥ 0.
54
TERRY SOO
Exercise 21.14. Use Chebyshev’s theorem to obtain the version in the
textbook. Let X be a real-valued random with EX = µ and Var(X) =
σ 2 . Show that for any k > 0, we have
P(|X − µ| ≤ kσ) ≥ 1 −
1
.
k2
Exercise 21.15 (Convergence in probability). Let Xn be a sequence
of random variables such that for any ε > 0 we have that
lim P(|Xn | > ε) = 0;
n→∞
in this case we say that Xn converges to 0 in probability. Show that if
lim EXn2 = 0,
n→∞
then Xn converges to 0 in probability.
Solution. Let ε > 0. By Chebyshev’s theorem, we have that
0 ≤ ε2 P(|Xn | > ε) ≤ EXn2 .
Taking limits on both sides, we obtain that limn→∞ P(|Xn | > ε) = 0, as
required.
21.5. Weak law of large numbers. Using Chebyshev’s inequality
and Exercise 21.15 we can prove a version of the weak law of large
numbers.
Exercise 21.16. Let Xi be a sequence of independent random variables
all with the same mean EX1 = 0 and bounded variance Var(Xi ) ≤ C <
∞ (for some C). Let Sn = X1 + · · · + Xn . Show that
S 2
n
lim E
=0
n→∞
n
and that Sn converges to 0 in probability.
Solution. Observe that ESn = 0, so that Var(Sn ) = ESn2 . By using
Exercise 21.2, we have that
1
1
ESn2 = 2 Var(Sn )
2
n
n
n
n
X
1
1 X
C
=
Var(Xi ) ≤ 2
C≤ .
2
n i=1
n i=1
n
0 ≤ E(Sn /n)2 =
Taking limits on both sides we obtain the first required result, and applying Exercise 21.15, we obtain the second result.
MATH 526 LECTURE NOTES
55
22. Chapter 5.1, 5.2, 5.6
22.1. Some named discrete random variables. We collect here
some of the discrete random variables that we have discussed so far.
22.1.1. Bernoulli random variables. Let p ∈ (0, 1). We say that X ∼
Bern(p) if P(X = 1) = p and P(X = 0) = 1 − p. We have computed
that EX = p and Var(X) = p(1 − p).
22.1.2. Binomial random variables. Let n ≥ 0. Let p ∈ (0, 1). We say
that X ∼ Bin(n, p) if for all 0 ≤ k ≤ n, we have
n k
P(X = k) =
p (1 − p)n−k .
k
Note that if (Xi )ni=1 are independent Bernoulli random variables with
parameter p, then
n
X
Xi ∼ Bin(n, p).
i=1
Using this, we have computed that EX = np and Var(X) = np(1 − p).
22.1.3. Using your calculator to help compute binomial probabilities.
Your calculator has the pmf of binomial random variables built in to it.
For example, if Y ∼ Bin(8, 0.345), then if we want to know P(Y = 2),
we can do the following:
(i) Press the blue 2nd button,
(ii) then press DISTR (VARS),
(iii) then (scroll down) choose option A,
(iv) enter: binompdf(8,0.345,2)
(v) which gives 0.2631746302
More importantly, you calculator allows you to compute the cdf of
Binomial random variable. Let X ∼ Bin(23, 0.345). For example, it
would be a very tedious task to compute
P(X ≤ 8) = P(X = 0) + P(X = 1) + · · · + P(X = 8).
Your book has tables of the cdf values of Binomial random variables,
but only for certain values of p. Thirty years ago, we might learn to
use these tables, but we will instead opt to use the calculator. For
example, to compute P(X ≤ 8), we can do the following:
(i) Press the blue 2nd button,
(ii) then press DISTR (VARS),
(iii) then (scroll down) choose option B,
(iv) enter: binomcdf(23,0.345,8)
(v) which gives 0.6058681469
56
TERRY SOO
Exercise 22.1. Let X ∼ Bin(15, 0.3457).
(a) Find P(X = 7).
(b) Find P(X ≤ 7).
(c) Find P(X < 7).
(d) Find P(X > 8).
(e) Find P(2 ≤ X ≤ 5).
Solution. The calculator gives:
(a) P(X = 7) = 0.12754386.
(b) P(X ≤ 7) = 0.8936125.
(c) Find P(X < 7) = P(X ≤ 7) − P(X = 7) = 0.7661.
(d) Find P(X > 8) = 1−P(X ≤ 8) = 1−0.9610004418 = 0.0389995582.
(e) Find P(2 ≤ X ≤ 5) = P(X ≤ 5) − P(X < 2) = P(X ≤ 5) − P(X ≤
1) = 0.5783131724 − 0.0153912842 = 0.5629218882.
22.2. Poisson random variables. Let λ > 0. We say that X is
Poisson random variable with parameter λ > 0 and write X ∼ P oi(λ)
if
λk
P(X = k) = e−λ
k!
for k = 0, 1, 2, . . .. We showed that EX = λ = Var(X).
22.2.1. Using your calculator to compute Poisson probabilities. Your
calculator also has built in, the pmf and cdf. For example, let X ∼
P oi(5), so that X is a Poisson random variable with mean 5. To
compute P(X = 4), we can do the following:
(i) Press the blue 2nd button,
(ii) then press DISTR (VARS),
(iii) then (scroll down) choose option C,
(iv) enter: poissonpdf(5,4)
(v) which gives 0.1754673698
To compute P(X ≤ 4), we can do the following:
(i) Press the blue 2nd button,
(ii) then press DISTR (VARS),
(iii) then (scroll down) choose option D,
(iv) enter: poissoncdf(5,4)
(v) which gives 0.4404932851
Exercise 22.2. Let X ∼ P oi(7).
(a) Find P(X = 7).
(b) Find P(X ≤ 7).
(c) Find P(X < 7).
(d) Find P(X > 8).
MATH 526 LECTURE NOTES
57
(e) Find P(2 ≤ X ≤ 5).
Solution. The calculator gives:
(a) P(X = 7) = 0.14900.
(b) Find P(X ≤ 7) = 0.59871.
(c) Find P(X < 7) = P(X ≤ 7) − P(X = 7) = 0.449711.
(d) Find P(X > 8) = 1−P(X ≤ 8) = 1−0.7290912682 = 0.2709087318.
(e) Find P(2 ≤ X ≤ 5) = P(X ≤ 5) − P(X ≤ 1) = 0.3007082762 −
0.0072950557 = 0.2934132204.
22.3. Poisson process on a line. Let N (t) be the number of arrivals
of some random process up to time t ≥ 0, where N (0) = 0 and N (s) ≤
N (t) for all s ≤ t. Think of people arriving to an ice cream shop. We
say that N is a Poisson process (on the positive real line) of intensity
λ > 0 if the following conditions are satisfied:
(i) (Stationarity: The number of arrives in an interval of time depends only on the length of time.) For all t > s we have that
d
N (t) − N (s) = N (t − s); that is, P([N (t) − N (s)] = k) = P(N (t −
s) = k) for all k = 0, 1, 2, . . . . Thus the distribution of N (t)−N (s)
only depends on t − s.
(ii) (Independent Increments.) For all t1 < t2 < · · · , tn , the random variables [N (tn )−N (tn−1 )], [N (tn−1 )−N (tn−2 )], . . . , [N (t2 )−
N (t1 )] are independent.
(iii) (Orderliness: two customers do not arrive at the same time)
1
lim P(N (h) ≥ 2) = 0.
h
(iv) (In a small interval time, the probability that a customer arrives
is proportional to λ.)
1
lim P(N (h) = 1) = λ > 0.
h→0 h
When these conditions hold, we have that
h→0
(λt)k
k!
for all t > 0 and all k = 0, 1, . . . ; in other words for any fixed t, we
have that N (t) is a Poisson random variables with parameter λt. Thus
EN (t) = λt. Usually λ has units like arrivals per time.
To see why N (t) has anything to do with a Poisson random variable,
let t > 0, and partition the interval [0, t] into n intervals of size t/n,
where n is large. By condition (iii) and condition (i) we can basically
assume that in each interval there is at most one arrival. Let p =
P(N (t) = k) = e−λt
58
TERRY SOO
λ(t/n). By conditions (ii) and (iv), we have that probability that there
are k arrives is given by
n k
n!
(λt)k λt n−k
n−k
P(N (t) = k) =
p (1 − p)
=
1−
+ g(n),
k
(n − k)!k! nk
n
where g(n) → 0 as n → ∞. Using the fact that
1 n
lim 1 +
= e,
n→∞
n
and
n!
lim
= 1,
n→∞ (n − k)!nk
we obtain that
k
(λt)k λt n−k
n k
n!
−λt (λt)
1−
+g(n)
=
e
.
lim
p (1−p)n−k =
n→∞ k
(n − k)!k! nk
n
k!
Exercise 22.3.
Dr. Zed stays at school from 9:00 AM to 5:00 PM. Dr. Zed’s office
hours are from 9:00 AM to 10:00 AM. The number of students visiting
Dr. Zed can be modelled by using two independent Poisson processes.
During office hours, the number of students visiting can be modelled
by a Poisson process of rate 5 per hour. Outside of office hours, the
number of students visiting can be modelled using a Poisson process of
rate 1/2 per hour.
(a) What is the probability that no students show up during office hours?
(R)
(b) What is the probability that exactly 2 students show up during a 45
minute period during office hours? (R)
(c) What is the probability that no students show up during the entire
day at school? Hint: Let A be the event that no student shows up
during the office hours and let B be the event that no students show
up outside of office hours. You want to compute P(A ∩ B).
(d) What is the probability that exactly one student shows up during
the entire day at school?
Solution. There is one hour of office hours and 7 hours of non-office
hours. Let N1 and N2 be the number of students that come during office
hours and non-office hours respectively. Then Let N1 ∼ P oi(5(1)) and
N2 ∼ P oi((1/2)(7)). Assume that N1 and N2 are independent.
(a) We need to compute P(N1 = 0) = e−5 50 /0! ≈ 0.006738.
(b) Let N3 ∼ P oi((45/60)(5)). We need to compute P(N3 = 2) ≈
e−3.75 (3.75)2 /2! ≈ 0.165359.
MATH 526 LECTURE NOTES
59
(c) We need to compute P(N1 = 0, N2 = 0) = P(N1 = 0)P(N2 = 0) =
e−5 50 /0! × e−7/2 (7/2)0 /0! ≈ 0.03694.
(d) We need to compute P(N1 = 1, N2 = 0) + P(N1 = 0, N2 = 1) =
e−5 51 /1! × e−7/2 (7/2)0 /0! + e−5 50 /0! × e−7/2 (7/2)1 /1! ≈ 0.001729.
Exercise 22.4. Suppose domestic and international passengers arrive
at a security check point independently of each other at rate of 10 passengers per hour and 3 passengers per hour, respectively.
(a) What is the probability that within a 15 minute interval, exactly 2
domestic and 3 international passengers arrive the check point?
(b) What is the probability within an hour interval exactly 10 passengers
(regardless of type) arrive at the checkpoint? (Hint: Use Exercise
13.9)
(c) What is the probability that with an hour interval between five and
ten passengers, inclusive, arrive at the checkpoint?
Exercise 22.5. Let N (x) be a Poisson process of intensity λ > 0.
Show that
P(N (x) = 1 | N (1) = 1) = x
for all x ∈ [0, 1]. Thus if you know that there is one arrival in [0, 1],
then the time of that arrival is uniformly distributed in [0, 1].
22.4. Poisson processes in higher dimensions. Let N (A) be the
number of items (of some random process) in a set A area (or volume)
L(A) > 0, where N (A) = 0 if L(A) = 0 and N (A) ≤ N (B) if A ⊆ B.
(Think of the number of trees in some patch of the forest or stars in
some region of outer space.) We say that N is a (spatial) Poisson
process of intensity λ on a set D, if for all A ⊆ D
(λL(A))k
(1)
k!
for all k = 0, 1, 2, . . . ; in other words N (A) is a Poisson random variable
with parameter λL(A). Thus EN (A) = λL(A). Usually λ has units
like items per area or items per volume.
Similar to the case of one dimension, the following conditions on N
motivate definition (1):
P(N (A) = k) = e−λL(A)
d
(i) (Stationarity.) For all A, B ⊆ S, if L(A) = L(B), then N (A) =
N (B); that is, P(N (A) = k) = P(N (B) = k) for all k = 0, 1, 2, . . . .
Thus the distribution of N (A) only depends on the area (or volume) of A.
(ii) (Independence) If A1 , . . . , An ⊆ D are disjoint sets, then the random variables N (A1 ), . . . N (An ) are independent.
60
TERRY SOO
(iii) If D ⊇ A1 ⊇ A2 ⊇ A3 , . . . is a sequence of sets such that L(An ) →
0 as n → ∞, then
1
lim
P(N (An )) ≥ 2) = 0
n→∞ L(An )
and
(iv)
1
P(N (An ) = 1) = λ > 0.
lim
n→∞ L(An )
Exercise 22.6. Suppose the number of trees in a forest can be modelled
as a Poisson process, and on average there are 25 trees per square mile.
(a) In a 2 square mile plot of land, what is the probability that there
are 40 trees?
(b) In a 2 square mile plot of land, what is the probability that there
are (strictly) less than 40 trees?
Solution. Let X be the number of trees in a 2 square mile plot of land.
We have that X ∼ P oi(2(25)).
(a) We compute
P(X = 40) = e−50 5040 /40! ≈ 0.021499.
(b) We are asked to find P(X < 40). Since X is a discrete integervalued random variable, we P(X < 40) = P(X ≤ 39). Our calculator easily gives that P(X ≤ 39) ≈ 0.0645703689.
23. Chapter 6.1, 6.6, 6.2, 6.3
23.1. Some named continuous random variables. We collect here
some of the continuous random variables that we have discussed so far.
23.1.1. Uniform random variables. Let a < b. We say that X is uniformly distributed in the interval [a, b] (and write X ∼ U nif [a, b] or
X ∼ U [a, b]) if it has the pdf given by
(
1
if x ∈ [a, b],
f (x) = b−a
0
otherwise.
Exercise 23.1. Find the mean and variance of random variable that
is uniformly distributed in the interval [a, b].
Solution. Let X ∼ U nif [a, b], and let f be the pdf for X. Recall from
high-school that
b2 − a2 = (b − a)(b + a),
and
b3 − a3 = (b − a)(b2 + ab + a2 ).
MATH 526 LECTURE NOTES
61
We have that
1
EX =
b−a
Z
a
b
x2 b
b+a
xdx =
,
=
2(b − a) a
2
and
Z b
x3 b
1
b2 + ab + a2
2
x dx =
.
EX =
=
b−a a
3(b − a) a
3
Hence by the short-cut formula we have that
2
b2 − 2ab + a2
(b − a)2
=
.
12
12
Exercise 23.2. Let U1 and U2 be independent random variables that
are uniformly distributed in [0, 1]. Let V = U1 +U2 . Let W be uniformly
distributed in [0, 2].
(a) Compute the variance of V .
(b) Compute the variance of W .
(c) Find the pdf for V . (Hint use Exercise 15.3).
Var(X) = EX 2 − (EX)2 =
Solution. Note that we know from Exercise 23.1 that Var(U1 ) = Var(U2 ) =
1/12.
(a) Since U1 and U2 are independent, we have that Var(V ) = Var(U1 )+
Var(U2 ) = 1/6.
(b) Again, from Exercise 23.1, we have that Var(W ) = 1/3.
(c) Let f be the pdf for a U1 ; thus it is also the pdf for U2 . Let x ∈ [0, 1].
Let z ∈ R. Note that f (z − x) = 1 if z − x ∈ [0, 1] and 0 otherwise.
By Exercise 15.3, we have that
Z 1
fW (z) =
f (z − x)dx.
0
Thus if z ∈ [0, 1], then
Z
z
fW (z) =
1dx = z,
0
and if z ∈ [1, 2], then
Z
1
dx = 2 − z
fW (z) =
z−1
23.1.2. Exponential random variables. Let λ > 0. Recall that X is an
exponential random variable with parameter 1/λ if it has pdf given by
(
λe−λx if x ∈ [0, ∞),
f (x) =
0
otherwise.
62
TERRY SOO
We know that an exponential random variable with parameter 1/λ
has mean 1/λ and variance 1/λ2 . We also know that if X is an exponential random variable with parameter 1/λ, then P(X > x) = e−λx
for all x ≥ 0.
Exponential random variables are connected to Poisson processes in
the following way. Suppose N is a Poisson process on positive real line
with intensity λ. Consider T the time of the first arrival; that is, the
smallest time t such that P(N (t) = 1). We have that
P(T > t) = P(N (t) = 0) = e−λt .
Thus T is an exponential random variable with parameter 1/λ. Another important property of the exponential random variable is the
memoryless property, which states that for t, s ≥ 0, we have
P {X > t + s} ∩ {X > s}
P(X > t + s | X > s) =
P(X > s)
P(X > t + s)
=
P(X > s)
−λ(t+s)
e
=
e−λs
−λt
= e
= P(X > t).
Exercise 23.3. Suppose that the lifetime, in years, of a lightbulb is
given by an exponential random variable with mean 5 years.
(a) What is the probability that the light bulb last longer than 5 years?
(b) Suppose I used the light bulb for 5 years, what is the probability that
it will last another 5 years?
Exercise 23.4. Suppose type-A light bulbs have a lifetime, in years,
that can be modelled by exponential random variables with mean 5, and
that type-B light bulbs can similarly be modelled by exponential random
variables with mean 7.
(a) Suppose that a room is lit by one type-A and one type-B light bulb,
which operate independently of each other. What is the probability
that the room will stay lit for at least 10 years? (Hint: you may
need to use the inclusion-exclusion formula. Exercise 13.6 may also
be helpful.) (H)
(b) What is the expected time the room will stay lit?
Exercise 23.5. Let X and Y be independent exponential random variables with mean 1. Find the pdf for W = X + Y .
MATH 526 LECTURE NOTES
63
23.2. Normal random variables. Let µ ∈ R and σ > 0. We say
that X is a normal random variable with mean µ and variance σ 2 and
write X ∼ N (µ, σ 2 ) if has pdf given by
1
2
1
n(x; µ, σ) = √ e− 2σ2 (x−µ) .
σ 2π
We know that if Z ∼ N (0, 1), and if X = σZ + µ, then X ∼ N (µ, σ 2 ).
We also know that EZ = 0, and Var(Z) = 1, from which we can
deduce that EZ = µ and Var(Z) = σ 2 . Similarly, if X ∼ N (µ, σ 2 ) and
Z = X−µ
, then Z ∼ N (0, 1).
σ
23.2.1. Using the tables. Let Z ∼ N (0, 1) be a standard normal random variables. Let f : R → [0, ∞) be the pdf for Z; thus
−x2 1
.
f (x) = √ exp
2
2π
By definition, for each z ∈ R, we have that
Z z
P(Z ≤ z) = Φ(z) =
f (x)dx.
−∞
However, f does not have an elementary anti-derivative. Tables of
values of Φ(z) are contained in Table A.3 of your book. In your book
contains the values of Φ(z) for z ∈ [−3.49, 3.49], where the value of z
is given to two decimal places, and the value of Φ(z) is given to four
decimal places. From the table we can read off that Φ(−1.1) ≈ 0.1357,
Φ(−1.17) ≈ 0.1210, Φ(0) = 0.5, Φ(0.01) ≈ 0.5040, Φ(0.1) ≈ 0.5398,
and Φ(3.49) ≈ 0.9998. If we want to compute the value of Φ(0.435) we
can take an linear of interpolation of the values Φ(0.43) and Φ(0.44)
via:
Φ(0.435) ≈ [Φ(0.43) + Φ(0.44)]/2.
Exercise 23.6. Let Z be a standard normal variable.
(a)
(b)
(c)
(d)
Find
Find
Find
Find
P(Z ≤ 2.343).
P(Z < 2.343).
P(1 ≤ Z ≤ 2).
P(Z > 0.3).
Solution.
(a) P(Z ≤ 2.343) ≈ Φ(2.34) ≈ 0.9904
(b) Since Z is a continuous random variable, we have that P(Z ≤
2.343) = P(Z < 2.343).
64
TERRY SOO
(c) Observe that
P(1 ≤ Z ≤ 2) = P(Z ≤ 2) − P(Z < 1)
= P(Z ≤ 2) − P(Z ≤ 1)
= Φ(2) − Φ(1) ≈ 0.9772 − 0.8413 = 0.1359.
(d) P(Z > 0.3) = 1 − P(Z ≤ 0.3) = 1 − Φ(0.3) ≈ 1 − 0.6179 = 0.3821.
Some tables actually only gives the values of Φ(z) for z ≥ 0. The
symmetries of the normal distribution can be used to deduce the other
values. Let Z ∼ N (0, 1). For z ≥ 0, we have that
P(Z ≥ −z) = P(Z ≤ z),
and
P(Z ≤ −z) = Φ(−z) = 1 − Φ(z) = 1 − P(Z ≤ z);
Similarly, we have that
P(−z ≤ Z ≤ z) = P(Z ≤ z)−P(Z ≤ −z) = 2P(Z ≤ z)−1 = 2Φ(z)−1.
Exercise 23.7. Let Z ∼ N (0, 1).
(a) Find z ∈ R so that P(Z ≤ z) = 0.567
(b) Find z ≥ 0, so that P(−z ≤ Z ≤ z) = 0.5.
Solution.
(a) We see that Φ(0.16) ≈ 0.5636 and Φ(0.17) ≈ 0.5675. So by taking
z = 0.17, we have that P(Z ≤ z) ≈ 0.567.
(b) We know that P(−z ≤ Z ≤ z) = 2Φ(z) − 1. Thus we want to solve
for z in 2Φ(z) − 1 = 0.5, which leads to the equation Φ(z) = 0.75.
We find on the tables that Φ(0.67) = 0.7486 and Φ(0.68) = 0.7517.
So we have that P(−0.675 ≤ Z ≤ 0.675) ≈ 0.5.
24. Chapter 6.3, 6.4
24.1. Critical values and z-alpha notation. Let Z ∼ N (0, 1). Let
α ∈ (0, 1). The number zα is the number such that
P(Z ≥ zα ) = α.
The numbers zα are sometimes called z-critical values.
Exercise 24.1.
(a) Find zα for α = 0.1, 0.05, 0.025, 0.01, 0.005, 0.001, 0.0005
(b) Show that P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α.
Solution.
MATH 526 LECTURE NOTES
65
(a) Note that α = P(Z ≥ zα ) = 1 − Φ(zα ). Thus we want to find zα ,
so that Φ(zα ) = 1 − α. For α = 0.1, we find that Φ(1.28) ≈ 0.8897
and Φ(1.29) ≈ 0.9015. So we take z0.1 ≈ 1.28. For α = 0.05,
we find that Φ(1.64) ≈ 0.9495 and Φ(1.65) ≈ 0.9505. So we take
z0.05 ≈ 1.645. We similarly find the other values: z0.025 ≈ 1.96,
z0.01 ≈ 2.33, z0.005 ≈ 2.58, z0.001 ≈ 3.08, and z0.0005 ≈ 3.27.
(b) We have that
P(−zα/2 ≤ Z ≤ zα/2 ) = 2P(Z ≤ zα/2 ) − 1 = 2(1 − α/2) − 1 = 1 − α.
24.2. Non-standard normals. Let µ ∈ R and σ > 0. Suppose X ∼
N (µ, σ 2 ). If we set
Z=
X −µ
,
σ
the Z ∼ N (0, 1). This relation allows us to compute probabilities
regarding X using the standard normal tables.
Exercise 24.2. Let X ∼ N (0.5, 22 ).
(a) Find P(X ≤ 2)
(b) Find P(X ≥ −1)
(c) Find P(−1 ≤ X ≤ 2).
Solution.
(a) Note that Z = (X − 0.5)/2 is a standard normal random variable.
We have that
P(X ≤ 2) = P(X − 0.5 ≤ 2 − 0.5)
= P (X − 0.5)/2 ≤ (2 − 0.5)/2
= P(Z ≤ 0.75).
From the tables we have that P(Z ≤ 0.75) ≈ 0.7734.
(b) Similarly, we find that
P(X ≥ −1) = P(Z ≥ (−1 − .5)/2)
= P(Z ≥ −0.75).
We know that 1 − P(Z ≤ −0.75) = P(Z ≥ −0.75). From the tables,
we have that P(Z ≥ −0.75) = P(Z ≤ 0.75) ≈ 0.7734.
(c) We have that
P(−1 ≤ X ≤ 2) = P(−0.75 ≤ Z ≤ 0.75) = 2P(Z ≤ 0.75) − 1 ≈ 0.5468
66
TERRY SOO
24.2.1. Using your calculator. Your calculator has a even more accurate and powerful normal table built in. For example, if X ∼
N (0.5, 22 ), your calculator can easily give the values of P(−1 ≤ X ≤ 2).
All that needs to be done is the following:
(i) Press the blue 2nd button,
(ii) then press DISTR (VARS),
(iii) then press 2 for normalcdf,
(iv) enter: normalcdf(-1,2,0.5,2) (make sure you use the negative sign
and not the minus)
(v) which gives 0.5467454411
Note that if X ∼ N (µ, σ 2 ), then
normalcdf (a, b, µ, σ) ≈ P(a ≤ X ≤ b).
In order to (approximately) obtain P(X ≤ b), we can set a to be
something like −9999.
The calculator also has the ability to do inverse look-up. Suppose
X ∼ N (1, 22 ), and we want to know the value a for which P(X ≤ a) =
0.65.
(i) Press the blue 2nd button,
(ii) then press DISTR (VARS),
(iii) then press 3 for invNorm,
(iv) enter: invNorm(0.65,1,2)
(v) which gives 1.770640945
24.3. The class policy on the calculators on the normal tables.
You are free to use to the calculator. However, as a general policy,
in a question that contains normal random variables, part of showing
your work means that you should be able to obtain your answer from
using the standard normal tables provided in your book. You may use
the calculator instead of the tables, and I encourage you to use the full
power of your calculator to double-check your answers. However, as far
as obtaining your final answer, you should pretend that the power of
your calculator is limited to that of looking up standard normal values
that are available in the tables; so that in presenting your final answers,
you are only looking up the values Φ(z) for z ∈ R. For example, you
are free to use
normalcdf (−9999, z, 0, 1) ≈ Φ(z).
Similarly, you are free to use the invNorm function. Please make it
clear, in your work at the point you use the tables or your calculator.
Exercise 24.3. Let X ∼ N (5, σ 2 ). If P(X ≤ 6 | X ≥ 5) = 0.383, then
what is σ > 0?
MATH 526 LECTURE NOTES
67
Solution. Observe that
0.383 = P(X ≤ 6 | X ≥ 5) =
P(5 ≤ X ≤ 6)
P(X ≤ 6) − P(X ≤ 5)
=
P(X ≥ 5)
P(X ≥ 5)
Clearly, P(X ≥ 5) = 1/2 = P (X ≤ 5) (why?). Also, Z = (X −5)/σ ∼
N (0, 1). Thus
1
.
P(X ≤ 6) = P Z ≤
σ
Hence
1
.
0.6915 = P Z ≤
σ
Thus we conclude from the tables, that σ = 2.
Exercise 24.4. Find the value of the following integral, without using
your calculator. You will need to use the normal tables. (H)
Z 0.5
2
e−x dx.
−0.25
Exercise 24.5. Suppose the height of a randomly selected women (aged
18-64) is normally distributed with mean µ = 163cm and variance
σ 2 = 102 cm2 .
(a) Given that a randomly selected women is taller than average, what
is the probability that she is taller than 170cm?
(b) Given a random sample of 10 women (that is 10 women whose
heights are independent of each other, and is given by the above
distribution), what is the probability that exactly 3 of them will be
taller than 165cm?
(c) Same set-up as part (b), what is the probability that no more than
3 women will be taller than 165cm.
Solution. Let X ∼ N (163, 102 ). Let Z = (X − 163)/10, so that
Z ∼ N (0, 1).
(a) We are asked to compute
P(X > 170 | X > 163) =
P(X > 170)
P(Z > 0.7)
=
≈table 0.484.
P(X > 163)
1/2
(b) Set
p = P(X > 165) = P(Z >
2
) = 1−P(Z ≤ 0.2) ≈table 1−0.5793 = 0.4207.
10
68
TERRY SOO
Let S ∼ Bin(10, p). We are asked to compute
10 3
P(S = 3) =
p (1 − p)7 ≈ 0.1956.
3
(c) We are asked to compute P(S ≤ 3) ≈ 0.3318.
25. Sums of independent normal random variables
25.1. Sums (and differences) of independent normal random
variables. Let µ1 , µ2 ∈ R and σ1 , σ2 > 0. Let X1 ∼ N (µ1 , σ12 ) and
X2 ∼ N (µ2 , σ22 ) be independent random variables. Let X = X1 + X2 .
It turns out that X is also a normal random variable, but what should
its mean and variance be? Notice that E(X1 + X2 ) = µ1 + µ2 (we do
not even need independence for this) and Var(X1 + X2 ) = Var(X1 ) +
Var(X2 ) = σ12 + σ22 (we do need independence for this). Thus X ∼
N (µ1 + µ2 , σ12 + σ22 ).
Example done in class 25.1 (Linear scaling). . Let a, b ∈ R. Let
X ∼ N (µ, σ 2 ). Show that if Y = aX + b, then Y ∼ N (aµ + b, a2 σ 2 ).
(You may assume that you know that Y is a normal random variable
with some mean and variance, that you are trying to figure out.)
Example done in class 25.2. Use Exercise 25.1 to show that if X1 ∼
N (µ1 , σ12 ) and X2 ∼ N (µ2 , σ22 ) are independent random variables, then
X = X1 − X2 ∼ N (µ1 − µ2 , σ12 + σ22 ).
Exercise 25.3. Let X and Y be independent standard normal random
variables. let W = X +Y . Prove using Exercise 15.3 that W ∼ N (0, 2).
(H)
Exercise 25.4. µ1 , µ2 ∈ R and σ1 , σ2 > 0. f (x) = n(x; µ1 , σ1 ) and
g(x) = n(x; µ2 , σ2 ). Recall that we defined convolutions
for pdfs as well
p
2
as pmfs. Show that (f ? g)(x) = n(x; µ1 + µ2 , σ1 + σ22 ) (the algebra
could be quite hard), and use Exercise 15.3 to prove the claim of Section
25.1.
Example done in class 25.5. Manufacture of a certain component
requires two different machining operations. Machining time for each
operation has a normal distribution, and the two times are independent
of one another. The mean values are 30 minutes and 20 minutes,
respectively, and the standard deviations are 3 minutes and 2 minutes,
respectively. What is the probability that the total machining time for
(a total) one of these components will be less than 45 minutes?
MATH 526 LECTURE NOTES
69
Solution. Let X1 ∼ N (30, 32 ) and X2 ∼ N (20, 22 ) and assume that
X1 and X2 are independent. Thus X = X1 + X2 has the same distribution as the total manufacturing time of one of these components.
We are asked to compute P(X ≤ 45). Since X is a sum of independent normals, we have that X is also a normal random variable. It
must have mean EX1 + EX2 and variance V ar(X1 ) + V ar(X2 ). Thus
X ∼ N (30 + 20, 32 + 22 ). Let Z ∼ N (0, 1), then
√
P(X ≤ 45) = P( 13Z + 50 ≤ 45) = P(Z ≤ −1.39).
The tables give that P(Z ≤ −1.39) = 0.0823.
Exercise 25.6. In Canada, according to Wikipedia, the average height
of males (aged 25-44) is 1.76 meters and the average height of females
(aged 25-44) is 1.633 meters. Let us assume that the heights of males
and females are normally distributed. Let Y be the height of a male.
2
Then Y ∼ N (1.76, σm
), for some σm > 0. Suppose that P(1.4 ≤
Y ≤ 2.12) = 0.99. Similarly, let X be the height of a female so that
X ∼ N (1.633, σf2 ). Suppose that P(1.22 ≤ X ≤ 2.046) = 0.99.
(a) Find σm .
(b) Find σf .
(c) If I randomly choose a man and woman, what is the probability
that the woman will be taller? Clearly state what assumptions are
needed for your calculations.
26. Prelude to the central limit theorem
26.1. Baby central limit theorem. In particular, if Xi ∼ N (µ, σ 2 )
are independent, then we know that
Sn = X1 + · · · + Xn ∼ N (nµ, nσ 2 ).
Thus,
¯ = Sn ∼ N (µ, σ 2 /n)
X
n
and
¯ −µ
X
S − nµ
√ = n√
∼ N (0, 1).
σ/ n
σ n
A simple random sample from a distribution is given by independent
random variables X1 , . . . , Xn , that all have same distribution; that is,
P(Xi ≤ x) = F (x) = P(X1 ≤ x).
This is the basic and common sampling assumption in statistics, we
discuss more about sampling later, when we state a version of a central
limit theorem that works for any distribution.
70
TERRY SOO
Exercise 26.1. Assume that (from experience) if we randomly select a
fish from a pond and measure its length, this random (length) variable
is normally distributed with mean 13cm and variance 7cm2 .
(a) If we sample 100 fish, what is the probability that their sample mean
¯ is less than 12cm?
X
(b) If we sample 49 fish, what is the probability that their sample mean
¯ is less than 12cm?
X
¯ n < 12) for n = 49, 100. Well,
Solution. We are asked to compute P(X
the game is make this a probability about N (0, 1) random variables, then
use the tables. Notice in general:
¯ µ
a−µ
¯ < a) = P X −
√ < √ .
P(X
σ/ n
σ/ n
Thus if Z =
¯
X−µ
√ ,
σ/ n
then Z ∼ N (0, 1), and
(a)
− 13 ¯ 100 < 12) = P Z < 12
√
= P(Z < −3.78) ≈ 0,
P(X
7/10
and
(b)
− 13
¯ 49 < 12) = P(Z < 12
√
P(X
) = P(Z < −2.65) ≈ 0.0040.
7/7
Exercise 26.2. According to Wikipedia, the SAT’s (a standardized test
in the United States) have a traditional range of 200 − 800 is based on
a normal distribution with mean 500 and standard deviation 100. Let
X ∼ N (500, 1002 ).
(a) Find P(200 ≤ X ≤ 800).
(b) What is the probability that the average score of 30 randomly selected people (who wrote the test) will be greater than 500?
(c) What is the probability that the average score of 30 randomly selected people (who wrote the test) will be greater than 530?
27. Chapter 8.1-8.3 (KU custom edition 7.1-7.3)
27.1. Independent random variables and simple random sampling. We will model the random outcome of an experiment using random variables. For example, before we flip a coin and see the result,
we think of it as a Bernoulli random variable wiith some parameter
p ∈ (0, 1). Thus when we sample from a population, before we actually
observe the outcomes of the sample, we think of the random sample
as a sequence of random variables X1 , . . . , Xn . For example, before we
MATH 526 LECTURE NOTES
71
measure the heights of say 10 KU students, we think of their heights as
random variables X1 , . . . X10 . We say that two random variables have
the same distribution if they have the same cdf (or same pmf or pdf).
Often we assume that random variables all have same the distribution and are independent. Often in practice these may not be realistic
assumptions!
One of the main goals in this course so to develop methods of statistical inference; that is, to infer something about the population from the
sample. We will see that treating random samples as random variables
will allow us develop quantitative methods to analyze the effectiveness
of a statistical inference.
By a (simple) random sample of size n from a population or distribution we mean a sequence of independent random variables X1 , . . . , Xn
which all have the same distribution. In the language of sampling, the
expectation and variance of X1 , are sometimes referred to as the population mean or population variance or the true mean of true variance.
27.2. Statistics. Suppose we are sampling from a normal distribution,
in order to determine the population mean. As we discussed at the
beginning of course, one way would be to estimate it, using the sample
mean:
n
1X
¯
X=
Xi .
n i=1
¯ (before we actually
Note that we call both the random variable X
observed the values of Xi ) and its value after we observe or compute
its value x¯ the sample mean.
We call any function of a random sample (for example, the sample
mean) a statistic. Thus a statistic is also a random variable.
Another statistic that will play an important role in our discussion
is the sample variance:
n
S2 =
1 X
¯ 2.
(Xi − X)
n − 1 i=1
Again, when we use the notation capital ‘S’ we are referring to the
random variable and when we use the notation little ‘s’ we are referring
to the observed or measured values of S.
Exercise 27.1. Let X1 , . . . , Xn be a simple random sample. Let EX1 =
¯ = µ. What is the Var(X)?
¯ (R)
µ and Var(X1 ) = σ 2 . Show that EX
72
TERRY SOO
Solution.
n
X
¯ = E 1
EX
Xi
n i=1
n
1X
EXi
=
n i=1
n
=
µ = µ.
n
Since Xi are independent in a simple random sample,
¯ = Var
Var(X)
n
1 X
Xi
n i=1
n
X
1
=
Var
X
i
n2
i=1
n
1 X
=
Var(Xi )
n2 i=1
=
n 2 σ2
σ = .
n2
n
Exercise 27.2. Let X1 and X2 be independent Bernoulli random variables with parameter p ∈ (0, 1). Let X = max(X1 , X2 ). Compute EX,
in the case that p = 1/2.
Solution. Note that X only takes values in {0, 1}. We have that
P(X = 0) = P(X1 = 0, X2 = 0) = P(X1 = 0)P(X2 = 0), since X1
and X2 are independent. Thus P(X1 = 0) = (1 − p)2 and P(X1 = 1) =
1 − (1 − p)2 = 2p − p2 . Hence
EX = 0 · (1 − p)2 + 1 · (2p − p2 ) = p(2 − p)
Exercise 27.3 (The reason for the n − 1). Let X1 , . . . , Xn be a simple
random sample. Suppose that Var(X1 ) = σ 2 . Show that E(S 2 ) = σ 2 .
Solution. Hint: Also let EX1 = µ. Note that
¯ +X
¯ 2 = Xi2 −
¯ 2 = Xi2 − 2Xi X
(Xi − X)
n
2 X
2
¯ 2.
Xi Xj − Xi2 + X
n j=1,j6=i
n
Note that the Xi ’s are independent, thus EXi Xj = µ2 , when i 6= j. Also
recall that the short-cut formula gives that EXi = Var(Xi ) + (EXi )2 =
σ 2 + µ2 .
MATH 526 LECTURE NOTES
73
28. Chapter 8.4 (KU custom edition 7.4)
28.1. The central limit theorem. Let X1 , X2 , X3 . . . , be independent random variables with the same distribution. Let EX1 = µ and
V ar(X1 ) = σ 2 > 0. Let Z ∼ N (0, 1) and Φ(x) = P(Z ≤ x). Let
Sn =
n
X
Xi
i=1
and
Zn =
Sn − nµ
√ .
σ n
(Note that we know that in the special case where the Xi are normal
random variables we have that Zn ∼ N (0, 1).)
The CLT states that for all x ∈ R we have that as n → ∞,
Fn (x) = P(Zn ≤ x) → Φ(x).
A standard rule of thumb is that Fn (x) ≈ F (x) if n > 30. This amazing theorem tells us that we can estimates the probabilities associated
with Zn by pretending Zn is Z, without knowing anything about the
underlying distribution of the Xi . It is this theorem that makes much
of statistics possible, since in practice we may not know the underlying
distribution of a population. In the context of statistics courses, it is
often useful to observe that
¯ −µ
X
√ .
Zn =
σ/ n
In the context of random sampling note that X1 , . . . , Xn is a simple
random sample from a distribution with population mean µ and variance σ 2 .
28.2. Using the CLT.
Exercise 28.1. Let X1 , . . . , X100 denote the actual net weights of 100
randomly selected bags of concrete mix, where each bag marked 50lbs,
but in fact the weight is random. We assume that the Xi are independent and have the same distribution.
(a) If the expected weight of each bag is 50lbs and the variance is 1lbs2 ,
calculate
¯ ≤ 50.1)
P(49.9 ≤ X
¯ is the usual sample mean
(approximately) using the CLT, where X
given by
74
TERRY SOO
100
X
¯= 1
X
Xi .
100 i=1
(b) If the expected weight of each bag is 49.8 rather than 50, calculate
¯ ≤ 50.1).
P(49.9 ≤ X
28.3. Chapter 6.5, Normal approximations to the binomial.
The central limit was first proved for Bernoulli random variables, and
can be used to approximate a Binomial distribution. Such approximations were especially important before modern computers were readily
available.
Let (Xi )∞
i=1 be independent Bernoulli random variables with parameter p ∈ (0, 1). Recall that EX1 = p and Var(X1 ) = p(1 − p). Let
Sn = X1 + · · · + Xn . Recall that Sn ∼ Bin(n, p). Let
Sn − np
Zn = p
.
np(1 − p)
The central limit implies that Zn is approximately normal when n is
large.
Exercise 28.2. Let X ∼ Bin(100, 1/2). Use your calculator to compute P(X ≤ 45), and use the central limit to approximate P(X ≤ 45).
Solution. Let Z 0 =
X−50
,
5
and let Z ∼ N (0, 1). We have that
P(X ≤ 45)
=
=
≈CLT
=table
P(X − 50 ≤ −5)
P(Z 0 ≤ −1)
P(Z ≤ −1)
0.1587
The calculator gives 0.1841
If the previous exercise seems unsatisfying, we can apply the so-called
continuity correction to improve the approximation. The following adjusts for the fact that we are trying to approximate something discrete
with something continuous.
Let X ∼ Bin(n, p), then
x + 0.5 − np .
P(X ≤ x) ≈ P Z ≤ p
n(p)(1 − p)
Notice the extra 0.5 term. Some textbooks state that this approximation is adequate when both np ≥ 10 and n(1 − p) ≥ 10.
MATH 526 LECTURE NOTES
75
Exercise 28.3. Do the previous exercise using the continuity correction. (R)
28.4. Normal approximation to the Poisson. We know from Exercise 13.9 that the sum of two independent Poisson random variables
is again a Poisson random variable. It particular, if Xi are independent
Poisson random variables with mean 1, then Sn = X1 + · · · + Xn is a
Poisson random variable with mean n and variance n.
Exercise 28.4. Let Y ∼ P oi(50). Apply the central limit theorem to
estimate to approximate P(Y ≤ 47).
Solution. Let Xi be independent Poisson random variables with mean
1, then S = X1 + · · · + X50 ∼ P oi(50). Let
S − 50
Z0 = √
.
50
The central limit theorem gives that Z 0 is approximately a standard
normal random variable. Let Z ∼ N (0, 1). We have that
P(S ≤ 47)
=
=
≈CLT
≈table
P(S − 50 ≤ −3)
P(Z 0 ≤ −0.424)
P(Z ≤ −0.424)
0.3372.
Exercise 28.5 (An important observation for the case of two distributions). Later, we will state a version of the CLT for the following set-up.
For now, consider the following exercise. Let X1 , . . . , Xm , Y1 , . . . , Yn be
independent random variables, where all the Xi ’s have the same mean
2
and all the Yj ’s have the same mean µY and variµX and variance σX
2
ance σY . (For CLT we will need to also assume that all the Xi0 s come
from a common distribution, and similarly, all the Yi0 s come from a
¯ and Y¯ in the usual way:
common distribution.) Define X
m
X
¯= 1
X
Xi ,
m i=1
and
n
1X
Y¯ =
Yj .
n j=1
Consider the random variables
¯ − Y¯
W =X
76
TERRY SOO
and
¯ − Y¯ − (µX − µY )
X
q
.
2
2
σY
σX
+ n
m
Z=
(a)
(b)
(c)
(d)
(e)
What
What
What
What
What
is
is
is
is
is
the
the
the
the
the
¯ What is the mean of Y¯ ? (R)
mean of X?
¯ What is the variance of Y¯ . (R)
variance of X?
mean of W ? (R)
variance of W ? (R)
mean and variance of Z? (R)
29. Super Quiz
Answers based on tables. Q1: a)0.07, b)0.285714, c)2.4331, d)
12.705. Q2: b)0.130377, d)0.08564. Q3: a)0.2643, b) 0.2228, c)0.7368,
d)0.99375, e)0.0336.
30. Chapter 8.5
30.1. t-distribution. We will define a one-parameter family of distributions that are closely related to the standard normal distribution
and will be very useful later on. Our approach here will be slightly
more basic than approach taken in your textbook.
Let (Xi ) be a sequence of independent normal random variables with
mean µ and variance σ 2 . Assume n ≥ 2. As usual set,
n
1X
¯
X=
Xi ,
n i=1
and
n
S2 =
1 X
¯ 2.
(Xi − X)
n − 1 i=1
Consider the random variable T defined via:
¯ −µ
X
√ .
T = Tn−1 =
S/ n
We say that the (continuous) random variable T has a t-distribution
with parameter ν = n − 1 (degrees of freedom); similarly if Y is any
random variable with the same distribution (same cdf, same pdf) as T ,
then we say that Y has a t-distribution with n − 1 degrees of freedom,
and write Y ∼ tn−1 .
Note that if we replaced the random variable S by the number σ in
the definition of Tn−1 , then Tn−1 ∼ N (0, 1). In fact, as n → ∞, we
have P(Tn ≤ x) → Φ(x); where Φ is the cdf for the standard normal
distribution. We will need the following properties of Tn−1 . The pdf
MATH 526 LECTURE NOTES
77
for Tn−1 has a complicated formula, but we will only need to refer to
the following property. We have that ETn−1 = 0; moreover, if gn−1 is
the pdf for Tn−1 , then gn−1 is an even function. Hence Exercise 16.8
applies.
As with the normal distribution, we may retrieve the cdf for a tdistribution using a table or using a calculator. For example, let T5 be
a t-distribution with ν = 5 degrees of freedom. Suppose we want to
compute P(−1 ≤ T5 ≤ 2):
(i) Press the blue 2nd button,
(ii) then press DISTR (VARS),
(iii) then press 6 for tcdf,
(iv) enter: tcdf(-1,2,5) (make sure you use the negative sign and not
the minus)
(v) which gives 0.767421529.
30.1.1. Critical values and t-alpha notation. Let Tν ∼ tν . Let α ∈
(0, 1). The number tα = tα,ν is the number such that
P(Tν ≥ tα,ν ) = α.
The numbers zα are sometimes called z-critical values. The tables at
the back of your book give the values of tα,ν for ν = 1, . . . , 30, and
α = 0.4, 0.3, 0.2, 0.15, 0.1, 0.05, 0.025. The table also gives values in the
case where ν = ∞, by this, they mean tα,∞ = zα (see Section 24.1).
Exercise 30.1. Let T ∼ t10 . Compute P(T > 1 | T > 0.5).
31. Chapter 9.1-9.3
31.1. Estimators and Estimates. An point estimator is a statistic
that is meant to give us an idea of a parameter of a population, for
example, the mean or variance. Thus both the statistics given sample
mean and variance are point estimators. We call the observed values
of point estimators, point estimates. How do we measure how good
a point estimator is? The following is very basic way we can classify
point estimators.
31.2. Unbiased estimators. Suppose that θ is a parameter for a population. An estimator X for θ is unbiased if EX = θ.
¯ the sample mean, is an
Example done in class 31.1. Show that X,
unbiased estimator for the population mean.
Exercise 31.2. Show that S 2 , the sample variance, is an unbiased
estimator for the population variance. (Use Exercise 27.3)
78
TERRY SOO
Exercise 31.3. Suppose you have a random sample X1 , . . . , Xn , where
each Xi ∼ Bern(p). We know that the variance is given by σ 2 =
Var(Xi ) = p(1 − p). We want to estimate σ 2 . Of course we can use
S 2 , however it seems more natural to use
¯ − X),
¯
Y = X(1
¯ is an estimator for p. Find EY .
since X
¯ − EX
¯ 2 . Thus by the short-cut
Solution. We have that EY = EX
formula,
¯ − Var(X)
¯ − (EX)
¯ 2 = p(1 − p) 1 + 1
EY = EX
n
Exercise 31.4. Suppose you have a random sample X1 , . . . , Xn , where
we have Xi ∼ U (0, θ). We do not know θ and we want to estimate it.
Consider
M = max {X1 , . . . , Xn } ,
and
ˆ = n + 1 M.
Θ
n
(a) What is the cdf for X1 ?
(b) Find the cdf for M . Hint: Observe that
{M ≤ x} = {X1 ≤ x} ∩ · · · ∩ {Xn ≤ x} ,
and remember that the Xi are assumed to be independent.
(c) Find the pdf for M by using calculus.
n
θ.
(d) Show that EM = n+1
ˆ
(e) Show that EΘ = θ
Solution.
(a) For x < 0, we have P(X1 ≤ x) = 0, for x ≥ θ, we have P(X1 ≤
x) = 1, and for x ∈ [0, θ], we have P(X1 ≤ x) = xθ .
(b) Since
{M ≤ x} = {X1 ≤ x} ∩ · · · ∩ {Xn ≤ x} ,
and the Xi are independent, we have by the previous part that
x n
P(M ≤ x) =
,
θ
for all x ∈ [0, θ], P(M ≤ x) = 1 for all x > θ, and P(M ≤ x) = 0
for all x < 0.
MATH 526 LECTURE NOTES
79
(c) By taking a derivative, we obtain that
if x < 0
0
n n−1
fM (x) = θn x
if x ∈ [0, θ]
0
if x > θ.
(d) We have that
Z
EM =
0
θ
θ
n
n n
n
n+1 x =
θ.
x dx = n
n
θ
θ (n + 1)
n+1
0
ˆ =
(e) By definition, Θ
n+1
M,
n
ˆ =
EΘ
so by the previous part,
n+1
EM = θ.
n
31.3. Errors. If we are trying to estimate a parameter θ, sometimes
ˆ The estimated standard error of an
we denote its estimator by Θ.
ˆ
(unbiased) estimator is given by Var(Θ).
In the special case where we are sampling for a normal distribution with mean µ, it is known that the sample mean is an estimator
for µ such that has the least variance amongst all the other unbiased
estimators you can cook up.
Exercise 31.5. Find the estimated standard error for the sample mean.
ˆ in Exercise
Exercise 31.6. Find the estimated standard error for Θ
31.4.
32. Chapter 9.4, 9.10 (I)
The central limit theorem or an assumption of normality will allow us
¯ (in simple random sampling)
to estimate how good the sample mean X
is as a point estimator for the true (population) mean µ (which we
do not know). We will start first with a somewhat unrealistic, but
instructive example.
32.1. Baby (exact) confidence intervals (for the population
mean): normal population, known population variance. Let
Xi be independent normal random variables all with mean µ and vari¯ be the usual sample
ance σ 2 . Here µ is unknown, but σ is. Let X
mean. We know that
¯ −µ
X
√ ∼ N (0, 1).
Z=
σ/ n
80
TERRY SOO
Recall that for α ∈ (0, 1), zα is the number such that P(Z ≥ zα ) = α, so
that P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α. Consider the following calculation
1 − α = P(−zα/2 ≤ Z ≤ zα/2 )
¯ −µ
X
√ ≤ zα/2
= P − zα/2 ≤
σ/ n
σ
¯ − µ ≤ zα/2 √σ
= P − zα/2 √ ≤ X
n
n
σ
σ
¯ − zα/2 √ , X
¯ + zα/2 √
= P µ∈ X
;
n
n
that is, the probability that the true mean µ lies in the random interval
given by
¯ − zα/2 √σ , X
¯ + zα/2 √σ
X
n
n
is 1 − α.
This calculation motivates the following definition. Suppose we ac¯ = x¯. Then
tually observe the values of X1 , . . . , Xn , and we find that X
we say that the (deterministic) interval given by
σ
σ x¯ − zα/2 √ , x¯ + zα/2 √
n
n
is a 100(1 − α) % (exact, two-sided) confidence interval for µ. Sometimes a CI is also expressed compactly as
σ
x¯ ± zα/2 √ .
n
√
Recall that VarX = √σn . The term √σn is also sometimes called the
¯
standard error of X.
Let us stress that a confidence interval (CI) for µ is computed after
we collect data, and will be a deterministic interval like (134, 151). The
population mean µ is a fixed deterministic number as well, like µ = 123
or µ = 143. Thus, once the confidence interval is computed, whether µ
belongs to a confidence interval is not a random event. It is incorrect to
say that µ will lie in a confidence interval with probability 1 − α; this is
only true for the random interval given by our motivating calculation.
Example done in class 32.1. What is the probability that 2.5 ∈ [2, 3]?
What is the probability that 2.5 ∈ [3, 5]?
Example done in class 32.2. Suppose I know that the heights of men
in Lawrence is normally distributed with some unknown mean µ, and
MATH 526 LECTURE NOTES
81
know the variance σ 2 = 16cm2 . Suppose we collect a random sample
of 8 men’s heights, in cm, in Lawerence, and have the following data:
178, 173, 176, 156, 190, 175, 170, 190
Compute a 95 percent CI for µ.
Solution. We have that CI is given by
σ
σ x¯ − zα/2 √ , x¯ + zα/2 √ ;
n
n
where n = 8, σ = 4, and α = 0.05. We need to compute x¯; this can be
done with your calculator:
(i) Press the blue 2nd button,
(ii) then press DISTR (LIST),
(iii) side scroll to option MATH and choose mean,
(iv) enter: mean({178, 173, 176, 156, 190, 175, 170, 190})
(v) which gives 176
We need to determine the value of z0.05/2 . From Exercise 24.1, we have
that z0.05/2 ≈ 1.96. Thus the CI is given by
(173.228, 178.772).
33. Chapter 9.4, 9.10 (II)
What about when the variance σ 2 is unknown? Here we use point
estimators for σ 2 . For example,
n
1 X
¯ 2.
S2 =
(Xi − X)
n − 1 i=1
In the case of coin flips where Xi take the values 1 and 0, we can also
use the random variable given by
¯ − X).
¯
X(1
33.1. Large sample size (CLT approximate) confidence intervals (for the population mean): unknown population variance.
Let X1 , . . . , Xn be a simple random sample (where the Xi ’s may not
necessary be normal). Then for n large
¯ −µ
X
√ ≈ N (0, 1);
S/ n
that is P(Zn ≤ x) is approx given by P(Z ≤ x) when n is large (textbooks sometimes state n ≥ 40).
A 100(1 − α)% (approx two-sided) confidence interval can be derived
in a similar way by replacing σ with a point estimator in our earlier
Zn =
82
TERRY SOO
discussion in Section 32.1. We obtain that the random interval given
by
h
i
¯ + (zα/2 ) √S
¯ − (zα/2 ) √S , X
X
n
n
contains the population mean µ with probability approximately 1 − α.
Here, the probability is approximate since we appealed to a version
of the central limit theorem. (Sometimes the term √Sn is called the
estimated standard error.)
Suppose we actually observe the values of X1 , . . . , Xn , and we find
¯ = x¯ and S = s. Then we say that the (deterministic) interval
that X
given by
s
s x¯ − zα/2 √ , x¯ + zα/2 √
n
n
is a 100(1 − α)% (approx two-sided) confidence interval for µ.
Exercise 33.1. Suppose that in a random sample of 50 kitchens with
gas cooking appliances we monitor the CO2 levels for a one week period
and find that the sample mean was 654.16 (ppm) and the standard
deviation was 164.43.
(a) Calculate (approx) a 95 percent (two-sided) confidence interval for
µ—the true average CO2 level in the population of all homes from
which the sample was selected.
(b) Suppose that we assume that the sample standard deviation will be
no greater than 175. What sample size would be necessary to obtain
an interval width no greater than 50 (ppm) for a confidence level
of 95 percent?
Hint: you may use that if Z ∼ N (0, 1), then P(−1.96 ≤ Z ≤ 1.96) =
0.95 and P(Z ≤ 1.96) = 0.975.
Solution. Since the sample size is large n = 50 > 40, we may use the
large sample size confidence intervals approximations that are based on
the CLT. Thus we use the formula:
s
x¯ ± (zα/2 ) √ .
n
So the required confidence interval is (608.58, 699.74).
For the second part of the question, note that the width is given by
s
2(1.96) √ .
n
Since we know that s ≤ 175, the width is no greater than
175
2(1.96) √ .
n
MATH 526 LECTURE NOTES
83
Thus we need to solve for n in the inequality
175
50 ≥ 2(1.96) √ .
n
We easily obtain that n ≥ 188.2384. So take n = 189.
In practice, it may be difficult to obtain large sample sizes. In the
case where we sample from a normal distribution, it is not necessary
to have a large sample size.
33.2. Exact confidence intervals (for the population mean):
normal population, unknown population variance. Let X1 , . . . , Xn
be a simple random sample where the Xi are normal random variables
with mean µ and unknown variance σ 2 . Then for all n ≥ 1, we have
¯ −µ
X
√ ∼ tn−1 ;
Zn =
S/ n
in other words Zn has t-distribution with n − 1 degrees of freedom.
A 100(1 − α)% (approx two-sided) confidence interval can be derived
in a similar way as before. Instead of appealing to the standard normal
distribution, we appeal to the t-distribution. Recall that if Tν ∼ tν ,
then the numbers tα,ν are such that
P(Tν ≥ tα,ν ) = α.
We obtain that the random interval given by
i
h
¯ − (tα/2 ) √S , X
¯ + (tα/2 ) √S
X
n
n
contains the population mean µ with probability 1 − α. Suppose we
¯ = x¯ and
actually observe the values of X1 , . . . , Xn , and we find that X
S = s. Then we say that the (deterministic) interval given by
s
s x¯ − tα/2 √ , x¯ + tα/2 √
n
n
is a 100(1 − α)% (two-sided) confidence interval for µ.
Example done in class 33.2. From experience, we know that test
scores for a certain standardized test can be modelled using the normal
distribution; that is, if X is the test score of a randomly sampled student, then X is normally distributed with mean µ and variance σ 2 , for
some µ and some (unknown) σ > 0. Suppose that in a simple random
sample of 8 students we have the following test scores:
500, 620, 520, 700, 562, 658, 550, 656.
Find a 95 percent confidence interval for the true mean µ.
84
TERRY SOO
Solution. We need to compute x¯ and s¯; this can be done with your
calculator. We obtain that x¯ = 595.75. To obtain s, we can proceed as
follows:
(i)
(ii)
(iii)
(iv)
(v)
Press the blue 2nd button,
then press DISTR (LIST),
side scroll to option MATH and choose stdDev,
enter: mean({500, 620, 520, 700, 562, 658, 550, 656})
which gives 72.8001
From the tables, we have that t0.05/2,7 = 2.365. So the required CI is
given by
595.75 ± 60.87
or
(534.88, 656.62).
33.3. Using your calculator. Your calculator is capable of computing CIs. For example, we can do Exercise 33.2 in the following way:
(i)
(ii)
(iii)
(iv)
(v)
Press the STAT button,
then side scroll to TESTS,
choose option 8-TInterval,
now you can enter raw data or computed statistics:
we can enter x¯ = 595.75, Sx = 72.8001, n = 8, and C-Level=0.95.
Similarly, Exercises 32.2 and 33.1, can be easily treated using option
7-ZInterval.
I encourage you to explore the functions of your calculator or other
software. However, simply using your calculator to compute CI will
not be sufficient to receive any marks on tests and assignments in this
course. In a question, if you just write the the required CI from your
calculator, it is very likely that you will not receive any partial credit.
You are welcome to use the functions of your calculator to check your
work, however, I expect you to manually compute the CI as I have in
the exercises.
The next exercise can be treated using option A-1-PropZInt.
33.4. Large sample size (CLT approximate) confidence intervals (for the population mean): proportions. Let X1 , . . . , Xn be
a simple random sample where the Xi are Bernoulli random variables
¯ is the
with parameter p ∈ (0, 1). Note that EXi = µ = p. Note that X
proportion of ‘successes’ or ones that occur in n trials. In this context,
¯ = Pˆ . The observed value of Pˆ is often denoted by pˆ.
we often write X
MATH 526 LECTURE NOTES
85
A version of the central limit theorem implies that
Pˆ − p
Zn = q
≈ N (0, 1);
ˆ
ˆ
P (1 − P )/n
where the approximation is good if both np, n(1 − p) ≥ 10.
A 100(1 − α)% (approx two-sided) confidence interval can be derived
in a similar way as before. We obtain that the random interval given
by
s
s
h
ˆ
ˆ
ˆ
ˆ i
¯ + (zα/2 ) P (1 − P )
¯ − (zα/2 ) P (1 − P ) , X
X
n
n
contains the population proportion p with probability 1 − α. Suppose
we actually observe the values of X1 , . . . , Xn , and we find that Pˆ = pˆ.
Then we say that the (deterministic) interval given by
r
r
pˆ(1 − pˆ)
pˆ(1 − pˆ) pˆ − zα/2
, pˆ + zα/2
n
n
is a 100(1 − α)% (approx two-sided) confidence interval for p. Note
that in order to ensure that the approximate CI is reliable, we check
that pˆn, pˆ(1 − pˆ) ≥ 10.
Example done in class 33.3. We wish to investigate p, the proportion of people with the disease Deadly-Virus who die within three years
after receiving a newly discovered treatment. We have taken a random sample of 187 people with Deadly-Virus and gave them the new
treatment. Three years later, 170 of these patients have died.
(a) Find a 97 percent CI for p.
(b) Suppose that in the future we wish to carry out a second study in
which we will construct a 99 percent confidence interval which will
estimate p to within 0.001. Using the data given in the setup as a
pilot study, estimate the sample size needed.
Solution.
(a) We have that n = 187 and pˆ = 170
. Note that pˆn, pˆ(1− pˆ) ≥ 10. We
187
need to find z0.03/2 . From the tables, we have that P(Z ≤ 2.17) ≈
0.985. The inverse normal function on your calculator gives P(Z ≤
2.170090375) ≈ 0.985. Thus we obtain that the CI is given by
0.9090 ± 0.045619.
(b) To estimate the size n, we solve for n in the equation
√
0.001 = z0.01/2 σ/ n.
86
TERRY SOO
We have that the tables give z0.01/2 ≈ 2.576, and since we are supposed to use the data as a pilot study, we estimate that
p
σ ≈ pˆ(1 − pˆ),
where pˆ = 170/187. Some algebra yields n ≈ 548411.
Exercise 33.4. Let X ∼ Bern(p). Consider f (p) = Var(X). What is
the maximum value of f .
Exercise 33.5. We have a magician’s coin, and we want to investigate
p ∈ (0, 1), the probability that on a flip of the coin it come up heads.
We want to construct a 95 percent confidence interval for p.
(a) Estimate how many flips do we need to do to estimate p to within
0.001.
(b) Suppose I flipped the coin 100 times and got 23 heads and 77 tails.
What is the CI?
end of coverage for Test 3