Download Collected lecture notes: Weeks 1-12 File

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Introduction to Probability
Franco Vivaldi
School of Mathematical Sciences
c The University of London, 2016
Last updated: May 2, 2017
Abstract
These notes are part of the course material of MTH4107 Introduction to Probability. They
are a complement, rather than a substitute, of the lecture notes; in particular, they do not contain
all examples, proofs, or explanations. All material included in this document is examinable,
unless explicitly stated otherwise.
At the end of each chapter there are supplementary exercises, some of which taken from
past mid-term tests and final examinations. Solutions to these exercises will not be provided;
by contrast, we often supply numerical answers, to support the development of self-assessment
skills.
These notes are an extended version of the lecture notes written by Robert Johnson for an
earlier version of this course.
1
Introduction
1.1
Overview
We begin by asking some questions about probability.
1. Would you rather be given £5, or toss a coin winning £10 if it comes up heads?
2. How many people do you need in a room to have a better than 50% chance that two of them
share a birthday?
3. A suspect’s fingerprints match those found on a murder weapon. The chance of a match
occurring by chance is around 1 in 50,000. Is the suspect likely to be innocent?
4. A network is formed by taking 100 points and connecting each pair with some fixed probability. What can we say about the number and size of connected components in the network?
All these questions relate to situations where there is some randomness —events we cannot
predict with certainty. This could be either because they happen by chance (tossing a coin) or they
are beyond our knowledge (the innocence or guilt of a suspect). Probability is the mathematical
discipline that studies the laws that govern random phenomena. Since randomness is all around us,
it is a useful theory; it is also mathematically appealing.
The first question is not entirely a mathematical one and there is no right or wrong answer. It’ll
depend on circumstances. To bring it closer to mathematics, we modify it by introducing a variable
quantity x as follows:
1b. Would you rather (a) be given £5, or (b) toss a coin winning £ x if it comes up heads?
Now the answer depends on the value of x. If x 6 5, then anybody would answer (a), and most
people would still answer (a) for x = 6. For x = 100 most people would answer (b), and there must
be an intermediate value of x for which opinions are as divided as possible.
To determine such a value, let us consider a large number N of repetitions of the same experiment. If N people choose (a), then the total amount paid out will be exactly £ 5N. If N people
choose (b), then the total amount paid out will be close to £ Nx/2, the closer the larger the value
of N. (This is the principle upon which the gambling business is based.) Then the ratio of these
figures will be close to Nx/(2 × 5N) = x/10. As N → ∞, the value x = 10 is the only value for
which this ratio converges to 1. This means that for x = 10, with respect to the money paid out,
options (a) and (b) are equivalent.
Another approach concerns the ‘degree of belief’ in the odds of case (b). Most people would
not be prepared to choose (b) if x < 10, and would not be prepared to offer (b) to someone else if
x > 10. Thus x = 10 is perceived to be the only value which makes the bet on (b) ‘fair’.
Probability Theory, which we begin to develop in chapter 3, provides the mathematical apparatus needed to characterise the outcome of a large number of repeated experiments, or to quantify
a degree of belief.
1
Question 2 is a simple calculation which we will see how to do in chapter 4. If you haven’t
seen this before then try to guess what the answer will be. Most people have rather poor intuition
for this kind of questions, so I would expect there to be a wide range of guesses, with many of
them being far from the correct answer.
Question 3 emphasises how important it is to determine exactly what information we are given
and what we want to know. The court would have to consider the probability that the suspect
is innocent under the assumption that the fingerprints match. The numerical probability given in
the question is the probability that the fingerprints match under the assumption that the suspect is
innocent. These are in general completely different quantities. The erroneous assumption that they
are equal is sometimes called the ‘prosecutor’s fallacy’. The mathematical tool for considering
probabilities given certain assumptions is conditional probability, which we develop in chapter 5.
The final question describes the construction of a random graph. These objects, first studied
within pure mathematics, have found a large number of applications, ranging from epidemiology
to computing. This fascinating area of mathematics is beyond the scope of this course; you will be
able to explore it at advanced undergraduate or post-graduate level.
1.2
Elements of the mathematical language
(Some material of this section belongs to Mathematical Structures course’s syllabus.)
Higher mathematics should be approached as a new language; understanding Probability requires fluency in this language. We begin by introducing words which represent some fundamental
types of objects:
number,
set,
function,
sequence,
sentence,
operator.
1. Number. I assume you know what a number is. Some related words are:
natural number,
integer,
rational,
irrational,
real,
complex.
2. Set. A set is a collection (family, aggregate) of well-defined1 distinct objects, called the
elements of the set. Two sets are equal if they have the same elements.
Set notation employs a pair of curly brackets. The brackets’ content specifies the elements of
the set, in various ways. The simplest format is a list:
{777, 7, 77},
{0, 2, 4, 6, . . .},
{1, 2, . . . , n},
{}.
(1)
Note the use of the comma and the ellipsis (dots). The latter is used to represent large or infinite
lists implicitly. The rightmost expression represents the set with no elements, called the empty set,
which is commonly denoted by ∅.
1 unambiguous
2
The definition ‘The set of books in Queen Mary’s library’ is probably flawed: any decent library
has multiple copies of some books. Also, is a borrowed book considered to be in the library? (Fix
this definition —see exercises.)
The sentence ‘In the set of 1-digit prime numbers, which element comes before 7?’ is meaningless. This set is well-defined, and it has four elements. However, from the definition we infer
that the order in which the set’s elements are specified is irrelevant. (Read the definition of set
again, and convince yourself that this is in fact the case.) So the set
{7, 2, 5, 3}
is the (as opposed to a) set of 1-digit prime numbers.
Not all sets can be represented by listing their elements, explicitly or implicitly, as we did in
(1). The set of points in the plane is an example.
Whenever you write, speak, or think mathematics, the word set should occur often. Let us
write some phrases containing this word:
A set of sets. A finite set of polynomials. Two distinct sets of triangles.
The set of real non-zero solutions of an equation.
3. Function. A function consists of two sets, called the domain and the co-domain, together
with a rule that associates to every element of the domain a unique element of the co-domain.
In symbols:
f :A→B
x 7→ f (x).
Here A denotes the domain, B the co-domain, x is an element of A (written x ∈ A), f is the function’s
name, and f (x) is the unique element of B associated to x, called the value of f at x.
It is easy to make up flawed function definitions:
f : A 7→ B
x → f (x)
f : {0, 1, 2} → (0, 1, 2) x →
7 2−x
f : {0, 1} → {0, 1}
x→
7 2x
[wrong arrows]
[the co-domain is not a set]
[ f (1) is not an element of the co-domain].
4. Sequence. A sequence is an ordered list of well-defined objects, called the terms (or
elements) of the sequence. Two sequences are the same if all corresponding terms in the two lists
are the same.
The sequence notation employs round brackets. The elements of a sequence —if not listed
explicitly— are typically represented by one symbol, which is decorated with an integer subscript
specifying the position of the term in the sequence:
(1, 3, 5)
(−1, −3, −5, . . .)
(x1 , x2 , x3 , . . .).
The use of ellipsis is ubiquitous in sequence notation. Note that (3, 1) 6= (1, 3), whereas {3, 1} =
{1, 3}. In some cases, a collection of distinct objects may be represented both as a set and as a
3
sequence, for instance, the prime numbers. The two representations serve different purposes. Not
all collections of distinct objects may be represented as sequences, e.g., the numbers between 0
and 1.
5. Sentence (or statement). I assume you know what a verbal sentence is. There are also symbolic sentences, such at ‘1 < 2’. A symbolic sentence is constructed by combining mathematical
objects via relational operators —see below. We emphasise the difference that exists between a
false sentence (‘My dog is black’ if my dog is, in fact, white), and a meaningless expression (‘My
black is dog’), namely an expression which breaks formal rules of grammar or syntax. Thus ‘1 > 2’
is a valid sentence (which happens to be false), while the expression ‘< 1 <’ is meaningless.
6. Operator. Operators are special types of functions, of common use. The well-known
arithmetical operators turn a pair of numbers into a number:
− (subtraction)
+ (addition)
× (multiplication)
/ (division).
Set operators turn a pair of sets into a set:
∪ (union)
∩ (intersection)
4 (symmetric difference).
r (difference)
Relational operators turn a pair of mathematical objects of various types into a sentence. Equalities and inequalities, which act —in the first instance— on pairs of numbers, are among the bestknown relational operators:
=
6=
6
<
>
>
etc.
The operators = and 6= can be made to act on any object type, not just numbers. For instance,
we have defined their meaning for sets and sequences. For the time being, we shall consider
inequalities as being defined only for real numbers.
There are relational operators for sets. The following operators transform into a sentence,
respectively, a pair of sets and pair consisting of an arbitrary object and a set:
⊆ (inclusion)
⊂ (proper inclusion)
∈ (membership).
We write A ⊂ B, which reads ‘A is a proper subset of B’, to mean A ⊆ B and A 6= B. Thus ‘N ⊂ Z’
reads ‘The natural numbers are a (proper) subset of the integers’, and −1 6∈ N reads, ‘−1 is not a
natural number’.
Logical operators turn sentences into sentences:
¬ (not)
∧ (and)
∨ (or)
⇒ (only if).
(2)
For example:
(x = −5) ⇒ (x2 = 25)
(x ∈ R) ∧ (x 6∈ Q).
Let us translate the rightmost sentence into words. A literal translation, ‘x is a real number and x is
not a rational number’, is clumsy and robotic. By contrast, the sentence ‘x is an irrational number’
shows that the meaning of the symbols has been understood.
As your mathematical maturity grows, so will the confidence with which you use the words
and symbols introduced in this section.
4
E XERCISES
E XERCISE 1.1. Write a very short summary of section 1.1 (fewer than 50 words).
E XERCISE 1.2.
1. Replace the flawed definition ‘The set of books in Queen Mary’s library’ given in section
1.2 with a mathematically sound definition.
2. Give examples of flawed set definitions, end explain why they are flawed.
3. Explain what is the difference between f and f (x) in function notation, without using any
symbol.
4. Give examples of flawed function definitions, and explain why they are flawed.
E XERCISE 1.3. For each word listed below, write a true and a false sentence employing that word.
set, sequence, function, co-domain, sentence, set operator, relational operator.
E XERCISE 1.4. Give an example of each item.
1. A set of sets.
2. A set of sequences.
3. A sequence of sets.
4. A sequence of sequences.
5. A finite sequence of functions.
6. An infinite set of functions.
7. A sequence of set operators.
8. An infinite sequence of sentences.
[Hint: look at the next exercise for inspiration.]
5
E XERCISE 1.5. You are given expressions which may or may not be meaningful, according to the
grammatical conventions agreed in this chapter. If an expression is meaningless, then say so, and
explain why; if an expression is meaningful, then assign it to one of the following classes:
number,
set,
sequence,
sentence,
operator;
for sentences, state, in addition, whether they are true or false.
1.
{0}
2. ∅
3. ∅ ∈ ∅
4.
1+1 ⇒ 2
5.
3/2
6.
7.
2 ⊆ {3}
8.
2 ∈ {3}
9. {} 6∈ {}
10.
{1, {1}}
11. {∅} = ∅
13.
∩
14.
16.
r
17. (b, b, b, . . .)
18. ((1), (1, 2), (1, 2, 3))
19.
(1, 2) ⊂ (1, 2, 3)
20. (2, 2) > (1, 1)
21. (1, 2) = (2, 1)
22.
{(−1, 0, 1})
23. ({−1, 0, 1})
24. {(−1, 0, 1)}
25.
{+, −, ×, /}
26. (+, −, +, −, . . .)
27. {{{+}}}
3r2
12. {2 = 2, 2 6= 2}
2∪1
15. {∩} ∪ {∪}
[Hint: When dealing with the empty set, think of it as {}.]
[Ans: there are 7 meaningless expressions, 1 number, 8 sets, 4 sequences, 5 sentences, 2 operators.]
E XERCISE 1.6. Write ten expressions of the type above, five of which are meaningless.
E XERCISE 1.7. Let pn be the nth prime number: (p1 , p2 , p3 , . . .) = (2, 3, 5, . . .). Test your grasp
of sequence notation, by writing out the first few terms of the following sequences, for n > 1:
pn+1 ,
pn + 1,
2pn ,
6
p2n ,
p2n ,
pn2 .
2
Sample Space and Events
The general setting for Probability is this: we perform an experiment and we record an outcome.
Outcomes must be well-defined, mutually exclusive, and cover all possibilities. To describe these
phenomena we shall employ the mathematical language of sets, see section 1.2.
Definition. The sample space of an experiment is the set of all possible outcomes. It is denoted
by Ω, and its elements —the outcomes of an experiment— are denoted by ω.
Any non-empty set can be a sample space. A sample space with a single element formally
legitimate, if pointless.
Definition. An event is a subset of the sample space. The event A occurs if the outcome is an
element of A. An event comprising just one element of Ω is called elementary or simple.
Make sure you have assimilated the new vocabulary:
Ω (sample space)
ω (outcome)
{ω} (elementary event).
We write some sentences, in two different ways:
symbols
A⊆Ω
A = {ω}
A=∅
A=Ω
words
A is an event
A is an elementary event
A is the impossible event
A is the certain event
We compose some short phrases and sentences using these new words:
A finite sample space. Two elementary events. Distinct outcomes.
If event A occurs, so does event B. Events A and B are not disjoint.
Provide illustrative examples.
E XAMPLE . A coin is tossed three times and the sequence of heads/tails is recorded. We let
Ω = {hhh, hht, hth, htt,thh,tht,tth,ttt}
where htt means the first toss is a head, the second is a tail, the third is a tail, etc. The elements
of the set Ω are words made of three letters from the alphabet {h,t}. Since the order in which
the letters appear is essential and repetition is allowed, we recognise that a word is a sequence in
disguise: hht means (h, h,t).
Let A be the event described in words as ‘the second toss is a tail’. Then
A = {hth, htt,tth,ttt}.
7
(3)
We see that A ⊂ Ω.
The sample space could equally be represented as a set of binary sequences2
Ω = {(0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)},
(4)
or even as a set of integers: Ω = {1, 2, 3, 4, 5, 6, 7, 8}, provided that we agree on what each integer
represents. (The last representation is rather cryptic.)
E XAMPLE . Take an exam repeatedly until you pass. Here the sample space is infinite, and may
be represented in many ways. For instance, Ω could be a set of words from the alphabet {p, f},
representing pass and fail, respectively.
Ω := {p, fp, ffp, fffp, . . .}.
Let A be the event ‘It takes me at least three attempt before I pass’. Then
A := {ffp, fffp, ffffp, . . .}.
The elements of Ω may be replaced by binary sequences, writing (0, 0, 0, 1) in place of fffp. Alternatively, one could just record the number of attempts, or the number of failures, that is,
Ω = N = {1, 2, 3, . . .}
or
Ω = {0, 1, 2, . . .}.
These representations are more efficient than the previous ones —they use less information.
E XAMPLE . We record the birthday of N people, represented as terms of an N-element sequence.
Each birthday (a date) will be encoded as an integer between 1 and 365 (or 366, if we count leap
years). An outcome takes the form:
bk ∈ {1, 2, . . . , 365}
ω = (b1 , b2 , b3 , . . . , bN )
and hence
(32, 32, 32, . . . , 32)
says that everyone was born on February 1st. The event described as ‘There are no repeated
birthdays’ is the set of all N-element sequences of the type above, which don’t have any repeated
element.
To summarise, when we analyse a probability problem, the first question to ask is: What type
of object will an outcome ω be? It could be a number, a sequence, a set, etc. If ω is a number,
then: What type of number is it (integer, rational, real)? Suppose instead that ω is a composite
object (a sequence or a set). What type of objects are the elements of ω (numbers, symbols, etc)?
Then, obviously: Is ω finite or infinite?, and if it’s finite, then: Do all outcomes have the same
number of elements?
2 sequences
of 0s and 1s
8
Now we collect together all possible outcomes, and form the sample space Ω. Is the sample
space finite or infinite?
Finally, we identify the event A we are interested in. This is done by specifying certain properties of the outcomes. Normally, not all outcomes will have such properties; selecting those that do,
we obtain the event A —a proper subset of Ω. If all outcome have these properties, then we obtain
the certain event A = Ω; if no outcome have these properties, then we obtain the impossible event
A = ∅. In any case, if Ω is finite, then A must be finite; otherwise: Is A finite or infinite?
You will have noted that we have asked many questions about structures, without reference to
examples or calculations. This approach is necessary in higher mathematics.
2.1
New events from old ones
Let A and B be events. Since events are sets, we can construct new events by combining old
events using set operators: complement, union, intersection, etc. This procedure is mirrored
in the sentences that define the events, where set operators become logical operators —see 2.
Specifically, taking the complement corresponds to the negation ‘not’, union becomes ‘or’, and
intersection becomes ‘and’.
sentence
set
set operator
Ac
complement
A and B both occur
A∩B
intersection
At least one of A and B occurs
A∪B
union
A occurs and B does not
ArB
difference
Exactly one of A and B occurs
A4B
symmetric difference
A does not occur
The following symbolic sentences are true for any event A in Ω:
Ac = Ω r A
A ∩ Ac = ∅
A ∪ Ac = Ω
Ωc = ∅
A 4 A = ∅.
(5)
Make sure you understand their meaning.
2.2
A probability function
We want to assign a numerical value to an event, which reflects the chance that it occurs. The
corresponding mathematical structure is a function P, which, for every event A of a sample space
Ω, assigns a number P(A), which we call the probability of the event A.
We start with a limited brief: we assume that Ω is finite —the simplest scenario— and we seek
the simplest function P that fits the above description, postponing the details (e.g., characterising
the domain of P). Thus we define P(A) to be the ratio of the number of elements of A to the total
number of elements in Ω. In symbols:
P(A) :=
|A|
# elements of A
=
,
# elements of Ω |Ω|
9
(6)
where |X| denotes the number of elements (or the cardinality) of a finite set X.
E XAMPLE . Let Ω = {h,t}. Then P({h}) = |{h}|/|Ω| = 1/2, in agreement with our expectation.
In chapter 3 we shall see that many different probabilities may be defined over the same set Ω.
The one given above assumes that all elementary outcomes are equally likely, which may or may
not be a reasonable description of the phenomenon under study. In the example of three coin tosses
given above [equation (4)], this is sensible if the coin is fair, in which case we would get that the
event A in (3) has probability 4/8 = 1/2. However, if the coin is biased, then this would not be a
reasonable notion of probability.
We also run into difficulties if Ω is infinite. If Ω = N (the set of positive integers), then there
is no reasonable way of assigning a probability to the integers in such a way that all elementary
outcomes are equally likely (why?). However, there are ways of assigning a probability to the
integers, with the property that any outcome has a positive chance of occurring.
If Ω is a bounded region of the plane (the disc of unit radius, for instance), then there are similar
problems with defining probability in terms of ratios. However, here we do have some notion of
choosing an element of Ω in such a way that all possibilities are equally likely. We could do this
by setting the probability of an event (a region in Ω) to be its area.
In the next section we will see how to define a mathematical notion of probability which can
deal with outcomes not all being equally likely, and with infinite sample spaces.
From definition (6) we immediately deduce that
0 6 P(A) 6 1.
What events have zero probability? We must have |A|/|Ω| = 0, namely |A| = 0, that is, A = {} = ∅.
Similarly, for A to have probability 1, we must have |A|/|Ω| = 1, namely |A| = |Ω|. But since
A ⊆ Ω, then A = Ω, necessarily.
Let now A1 and A2 be mutually exclusive events, by which we mean A1 ∩ A2 = ∅. Then
|A1 ∪ A2 | = |A1 | + |A2 | (think about it), and hence
|A1 ∪ A2 | |A1 | + |A2 |
=
|Ω|
Ω
|A1 | |A2 |
=
+
= P(A1 ) + P(A2 ).
|Ω|
|Ω|
P(A1 ∪ A2 ) =
We express this result in one sentence, without symbols.
The probability of the union of two disjoint events is equal to the sum of the probabilities of the individual events.
If A1 , A2 , A3 have the same property, then, using the argument above twice, we find:
P(A1 ∪ A2 ∪ A3 ) = P((A1 ∪ A2 )) + P(A3 )
= P(A1 ) + P(A2 ) + P(A3 ).
We can generalise this to any finite number of mutually exclusive events.
10
E XERCISES
In the following exercises, we shall employ the notation developed above, namely:
symbols
words
a sample space
an outcome
a probability
events
the complement of A
the cardinality of A
Ω
ω
P
A, B
Ac
|A|
E XERCISE 2.1. Write the symbolic sentences (5) using only words.
E XERCISE 2.2. Let A, B,C be events. Write the following events as sets, using only symbols.
1. A does not occur or B occurs.
2. A occurs and B does not occur.
3. Neither A nor B occur.
4. At most one of A or B occurs.
[Mid-term test 2016]
5. Exactly one of A or B occurs.
6. At least one of A or B occurs.
7. Of A, B, and C,
i) only A occurs;
ii) exactly two of them occur.
[Final examination 2016.]
E XERCISE 2.3. Some of the following statements/expressions are false/meaningless. Identify all
faulty items and explain briefly what the fault is; the shorter the explanation, the better.
1. An event is something that happens.
2. All events contain outcomes.
3. Every outcome is an event.
4. Every elementary event is an outcome.
5. The intersection of two elementary events.
6. The complement of the impossible event.
7. The sample space is an event.
8. An infinite event.
9. The compliment of the sample space.
10. Every outcome is an element of some event.
[Hint: There are five faulty items.]
11
E XERCISE 2.4. Let Ω be a set of triangles.
1. Give an example of an elementary outcome.
2. Give an example of an event.
3. Give an example of two mutually exclusive events.
Do the same in the case in which Ω is a set of
i)
prime numbers,
ii)
books,
iii)
polynomials.
E XERCISE 2.5. Four people, A, B, C, and D, are supposed to be attending an event, at which their
attendance is recorded.
i) Read through this exercise, and identify the items in which the word ‘event’ is used with two
different meanings.
ii) Write down the sample space as a set of sets. Explain your notation.
iii) Write down the sample space as a set of binary sequences3 . Explain your notation.
iv) Using both notations, write down the event “Exactly three of them attend the event” as a set.
v) Using both notations, write down the event “A attends the event and D does not” as a set.
vi) Write down another event both as a sentence and as a set, using both notations.
vii) Give an example of an experiment whose outcomes could be represented with the notation
iii).
E XERCISE 2.6. You are given expressions, some meaningful, other meaningless (according to the
grammatical conventions agreed in this chapter). If an expression is meaningless, then say so and
give a reason (as opposed to producing a correct expression.) If an expression is meaningful, then
assign it to one of the following classes:
number,
set,
function,
sequence,
sentence,
operator.
In this exercise, you may assume that ω is not a set. (In general, it could be.)
3A
binary sequence is a sequence of 0s and 1s.
12
E XAMPLES :
1. {ω}c
2.
{0c }
3.
{{ω}}c
4. Ω ⊆ {ω}
5.
1⊂Ω
6.
{ω} ∈ Ω
7. |A| − 1
8.
|A| × |B|
9.
|A| r |B|
10. A ⊃ B
11. A 6 B
12. A ∩ ∅
13. ω 6∈ {{ω}}
14. {ω} = Ω
15. ω ∈ {ω}
16. Ω ⇒ ω
17. ω < Ω
18. Ω 4 Ω
19. Ω r Ω
20. Ω/Ω
21. ({ω} r Ω) ⊂ {ω}
22. 4{ω}
23. (A 4 A)c
24. Ω 4 ∅
25. P
26. P ∩ P
27. {P}
28. ∅ r ∅ = 0
29.
31. P(Ac )
32. P(A)c
33. P((Ac )c )
∅ ∈ P(∅)
30. P(A, B)
P(∅)2
34.
1 − P(A)
35. P(A) r 1
36.
37.
1/P({0})
38. P(A) − P(B)
39. P(A) r P(B)
40. P({2}) < 2−1
41. (P, P, P, . . .)
42. P(P(A))
43. P(A) ∩ P(B)
44. P(A ∩ B)
45. (((Ωc )c )c )c
46. P(A 6= B)
47. {P(A)}
48. P(A ∪ B) ⇒ P(A)
A ∩ Bc
Set;
2 r {1}
Meaningless: illegal left operand of set
difference operator.
[Ans: Number of occurrences of listed categories: number (9), set (10), function (1), sequence (1),
sentence (8), operator (0), meaningless (19).]
13
3
The Axioms of Probability
A probability is a function satisfying certain conditions, called Kolmogorov’s axioms. We have
seen an example of such a function in section 2.2, and the idea is to identify a larger class of
functions of similar type. The axioms are chosen so as to make the construction plausible, and as
general as possible. Recall that the sample space Ω is a set, and that an event A is a subset of Ω.
Definition (Kolmogorov’s Axioms for Probability). Let Ω be a finite non-empty set. A probability
P is a function which assigns to each subset A ⊆ Ω a real number P(A) such that:
I. P(A) > 0
II. P(Ω) = 1
III. If (A1 , A2 , . . . , An ) is a sequence of pairwise disjoint subsets of Ω (that is, Ai ∩ A j = ∅ for
all i 6= j), then
P(A1 ∪ A2 ∪ · · · ∪ An ) = P(A1 ) + P(A2 ) + · · · + P(An ).
(7)
First, some informal considerations. We think of P as a function that assigns a weight to the
subsets of Ω. Axiom I says that the weight of a set must be non-negative, and axiom II says
that Ω itself has unit weight. Axiom III says that the total weight of a collection of sets with no
common elements (or mutually exclusive events, in the language of probability) may be computed
by adding up the weights of the individual sets.
Next, we analyse the definition in detail.
1. The first sentence introduces a set Ω with certain properties, which will be kept fixed for the rest
of the definition. At the beginning of second sentence
A probability P is a function which . . .
we note two indefinite articles, denoting lack of uniqueness: the object being defined is not completely specified by the listed properties. Discounting the obvious fact that different choices of the
set Ω lead to different functions P, we note a more substantial point. On the same set Ω, it is possible to define more than one probability function (indeed infinitely many of them —see exercises).
The indefinite articles in the definition officially confirm, as we anticipated, that the function P
introduced in section 2.2 is just one among many.
2. The domain of P is the set of all subsets of Ω, called the power set of Ω, and denoted by P(Ω).
Thus
P({a, b, c}) = {∅, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}}.
The co-domain of P is the set of real numbers, so we write:
P : P(Ω) → R.
14
By virtue of Axiom I, the co-domain of P may be taken as the set of non-negative real numbers.
Soon we shall see that such a co-domain may be restricted further, to the closed unit interval (the
interval with endpoints 0 and 1, both included).
3. Ω is a subset of Ω (that is, Ω ∈ P(Ω)), so the expression P(Ω) in II makes sense.
4. In III, the expression A1 ∪ A2 ∪ · · · ∪ An denotes the collection of the elements of Ω which belong
to at least one set Ai . This is a well-defined4 subset of Ω. Hence the expression P(A1 ∪ A2 ∪ · · · ∪
An ) is legitimate.
5. The sum in (7) may be written compactly using the sigma-notation:
n
P(A1 ) + P(A2 ) + · · · + P(An ) = ∑ P(Ai ).
(8)
i=1
The summation index i is a dummy variable: its role is mere book-keeping, and so its identity is
irrelevant:
n
n
∑ P(Ai) = ∑ P(Ak ) = ∑
i=1
P(A j ).
(9)
16 j6n
k=1
(The rightmost form of sigma notation can be quite useful —see exercises.) There is an analogous
notation for repeated unions:
A1 ∪ A2 ∪ · · · ∪ An =
n
[
Ai =
i=1
[
Ai .
16i6n
6. If Ω is infinite, then one has the impression (which turns out to be erroneous) that all is needed
is to amend Axiom III as follows:
III. If (A1 , A2 , . . .) is a sequence of pairwise disjoint subsets of Ω (that is, Ai ∩ A j = ∅ for all
i 6= j), then
P(A1 ∪ A2 ∪ · · · ) = P(A1 ) + P(A2 ) + · · ·
(10)
Now the sequence of events, their union, and the summation, can be finite or infinite, and the
open-ended notation works well in both cases. Unfortunately, there are two unresolved problems:
I. We haven’t explained the meaning of an infinite sum:
∞
∑ P(Ai) = ∑ P(Ai) = P(A1) + P(A2) + · · · .
i>1
i=1
Is not sufficient just to think of adding up infinitely many numbers. For this course we circumvent
this difficulty by only using known infinite sums, such as the sum of a geometric progression.
II. If Ω is not countable, meaning that it is impossible to arrange its elements to form a sequence (e.g., Ω is an interval, or a disc), then not every subset of Ω can be an event. Some subsets
4 Formally,
the definition of this set rests on the identity (A ∪ B) ∪ C = A ∪ (B ∪ C), valid for any sets A, B,C, and
the induction principle —see Mathematical Structures course.
15
are so weird that they can’t be assigned probabilities in any meaningful way. The reason for this is
subtle, and beyond the scope of this module5 . This will not concern us: every reasonable set you
would ever want to consider in this course does not have this problem.
E XAMPLE . If Ω is finite, then you can check that setting P(A) = |A|/|Ω| gives a probability. This
is the case if every outcome in the sample space is equally likely. However, you should not assume
that every outcome is equally likely without good reason.
E XAMPLE . A coin need not be fair. So we may let Ω = {h,t} with
P({}) = 0
P({h}) =
1
3
P({t}) =
2
3
P({h,t}) = 1.
This function satisfies the axioms, but it is not of the form |A|/|Ω|.
3.1
Consequences of the axioms
From the axioms, we shall infer several properties that any probability P must have. Hopefully,
these will agree with our intuition about probability (if they did not, then this would suggest that
we had not made a good choice of axioms). The proofs of all of these are simple deductions from
the axioms, which we will use extensively in the rest of the course.
Proposition 3.1. For every event A we have P(Ac ) = 1 − P(A).
P ROOF. Let A be any event. Set A1 = A and A2 = Ac . By definition of the complement, A1 ∩ A2 =
∅ and so we can apply Axiom III (with n = 2) to get
P(A1 ∪ A2 ) = P(A1 ) + P(A2 ).
But (again by definition of the complement), A1 ∪ A2 = Ω, so
P(Ω) = P(A1 ) + P(A2 ) = P(A) + P(Ac ).
By Axiom II we have P(Ω) = 1 and so
1 = P(A) + P(Ac ).
(11)
Rearranging this gives the result. Notice that each line of the proof which is not a simple manipulation is justified by one of the
axioms or a definition. We can use the results we have proved to deduce more results, such as the
following corollary.
Corollary 3.2. P(∅) = 0.
5 If
you want to find out more about this then you could read up on non-measurable sets and the Banach-Tarski
paradox, but be aware that this is post-graduate mathematics.
16
P ROOF. By definition of complement, we have Ωc = ∅. Hence by Proposition 3.1
P(∅) = 1 − P(Ω) = 1 − 1 = 0,
where the second equality uses Axiom II. Corollary 3.3. If A is an event, then P(A) 6 1.
P ROOF. We rewrite identity (11) as P(A) = 1 − P(Ac ). The result now follows from the the fact
that P(Ac ) > 0, from Axiom I. From Axioms I, II, and the last two corollaries, it follows that the set of values assumed by
a probability is a subset of the closed unit interval; moreover, such a subset must contain both
endpoints 0 and 1 (see exercises).
Proposition 3.4. If A and B are events and A ⊆ B, then
P(A) 6 P(B).
P ROOF. We have B = (B r A) ∪ A and (B r A) ∩ A = ∅. Then Axiom III gives P(B) = P(B r A) +
P(A) and the result follows by noting that P(B r A) + P(A) > 0, from Axiom I. Proposition 3.5. If A = {ω1 , ω2 , . . . , ωn } is a finite event, then
n
P(A) = ∑ P({ωi }).
(12)
i=1
Proposition 3.6 (Inclusion-exclusion for two events6 ). For any two events A and B we have
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
The idea here is that if we just take P(A) + P(B) then we have “counted A ∩ B twice” so this
must be subtracted. Here is a proof:
P ROOF. The events A r B, B r A and A ∩ B are pairwise disjoint and their union is A ∪ B. Hence
P(A ∪ B) = P(A r B) + P(B r A) + P(A ∩ B) (Axiom III).
Also, (again by Axiom III)
P(A) = P(A r B) + P(A ∩ B)
P(B) = P(B r A) + P(A ∩ B).
It follows that
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
6 If
we are in the situation where P(A) = |A|/|Ω| then this reduces to an expression involving cardinalities of sets.
This was the version of inclusion-exclusion that you saw in Mathematical Structures.
17
Proposition 3.7 (Inclusion-exclusion for three events). For any three events A, B and C we have
P(A ∪ B ∪C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩C) − P(B ∩C) + P(A ∩ B ∩C).
As in the case of two events, the above formula has an intuitive interpretation: the first three
terms over-count; the following three terms are meant to compensate, but we end up undercounting, which requires a final adjustment. (Do the book-keeping carefully: at first sight it would
seem that there should be a factor 2 in front of the last term.)
P ROOF. The quantity to be determined may be written as P((A ∪ B) ∪ C), and the computation
reduces to the case of two events. The proof now involves applying proposition 3.6 three times,
using the set identity7
(A ∪ B) ∩C = (A ∩C) ∪ (B ∩C).
From the above we obtain
P((A ∪ B) ∩C) = P(A ∩C) + P(B ∩C) − P((A ∩C) ∩ (B ∩C))
= P(A ∩C) + P(B ∩C) − P(A ∩ B ∩C).
Keeping this identity in mind, we compute
P(A ∪ B ∪C) = P((A ∪ B) ∪C) = P(A ∪ B) + P(C) − P((A ∪ B) ∩C)
= P(A) + P(B) − P(A ∩ B) + P(C) − P((A ∩C) ∪ (B ∩C))
= P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩C) − P(B ∩C) + P(A ∩ B ∩C),
as desired. Propositions 3.6 and 3.7 supersede Axiom III for n = 2 and n = 3, respectively, in that they don’t
require the assumption that the events be disjoint. There is also an inclusion-exclusion formula for
n events. (Can you work it out and prove it?) However, an n increases, these formulae become
increasingly complicated.
3.2
Applications
We consider some applications of the results of the previous section.
A PPLYING PROPOSITION 3.5. If Ω = {ω1 , . . . , ωn } is finite, then, taking A = Ω in proposition 3.5,
we see that we can define a probability P on Ω as follows. First, we choose n non-negative real
numbers a1 , . . . , an whose sum is 1. Then, we let P({ωk }) = ak for k = 1, . . . , n. This function P
satisfies Kolmogorov’s axioms (see exercises).
The following extreme choice is instructive:
P({w1 }) = 1,
7 See
P({w2 }) = P({w3 }) = · · · = P({ωn }) = 0.
Mathematical Structures.
18
(13)
All the weight of the sample space is concentrated on a single elementary event; it is plain that the
sum of the probabilities of the elementary events is 1. Now let
B = {ω2 , ω3 , . . . , ωn }
C = {ω1 }
D = {ω1 , ω2 }.
Prom proposition 3.5, we obtain
P(B) = 0
P(C) = P(D) = 1.
We have constructed a probability for which the impossible event is not the only event that has
zero probability, and the certain event is not the only event that has probability 1. Furthermore,
we have C ⊂ D (proper inclusion) but P(C) = P(D), which illustrates a feature of proposition 3.4
which may have been overlooked.
Using proposition 3.3 one can show that there are infinitely many distinct probability functions
over any finite set Ω, as long as |Ω| > 1 (see exercises).
A PPLYING PROPOSITIONS 3.1 AND 3.7. Suppose that a certain quantity x cannot be computed
exactly. Bounds for x are two real numbers a and b such that a 6 x 6 b. If b − a is small, then such
bounds provide valuable information about x.
Now let A, B,C are events with
P(A) = P(B) = P(C) =
1
3
P(A ∩ B) = P(A ∩C) = P(B ∩C) =
1
.
10
We are interested in the probability that none of A, B,C occurs, namely in P((A ∪ B ∪C)c ), but from
the given data we cannot compute this quantity exactly. So we look for bounds.
From Propositions 3.1 and 3.7, we have:
P((A ∪ B ∪C)c ) = 1 − P(A ∪ B ∪C)
= 1 − P(A) − P(B) − P(C) + P(A ∩ B) + P(A ∩C) + P(B ∩C) − P(A ∩ B ∩C)
1
3
1
− P(A ∩ B ∩C).
= 1 − 3 + 3 − P(A ∩ B ∩C) =
3
10
10
However
0 6 P(A ∩ B ∩C) 6 P(A ∩ B) =
1
10
and therefore
2
3
6 P((A ∩ B ∩C)c ) 6 .
10
10
c
These are the desired bounds, which identify P((A ∩ B ∩C) ) with a precision of 0.1.
3.3
The language of probability
The language of probability is a peculiar dialect, which employs non-mathematical terms for wellestablished words (e.g., ‘event’ to mean ‘set’), as well as idiomatic expressions. Some of these
19
expressions are grammatically incorrect, but they also have a certain street appeal, much like “innit” or “I woz ’ere.”
We shall now spell out clearly why certain expressions are sloppy, which sloppy expressions
are deemed acceptable, and which are not. This exercise will sharpen our knowledge of set terminology and notation. and improve our understanding of the foundations of probability theory.
In probability one routinely writes P(ω) for P({ω}). The former expression is incorrect, since
ω is not an element of the domain of P. But once we stipulate that this is acceptable, then we can
rewrite (12) as
n
P(A) = ∑ P(ωi ).
i=1
The new expression is lighter, and its meaning is clear. However, when we use this shorthand, we
must remember that we are writing something to mean something else. Furthermore, the above
licence is restricted to the expression P(ω). For instance, the expression P(ω1 ∪ ω2 ) is not acceptable: one must write P({ω1 } ∪ {ω2 }). Let us enounce our first agreement on sloppiness:
If the argument of a probability function is an elementary event, then the latter may be
replaced by the corresponding outcome.
Next we consider proper language vs. dialect for set definitions. Suppose that we want to define
a certain subset A of a set Ω of integers.
Let A be the set of all positive integers in Ω.
The set A cannot be specified by listing its elements, since we don’t know what Ω is. A more
general —and powerful— way to write a set symbolically is given by
A = {ω ∈ Ω : ω > 0}.
(14)
The curly brackets are the signature of a set, but the expression inside employs a new syntax. On
the left of the separator ‘:’ a membership statement (ω ∈ Ω) tells us that the set being defined is a
subset of Ω, and that the elements of Ω are called ω. On the right there is a sentence about ω (‘ω
is positive’). The set A is defined as the collection of those ωs for which such a sentence is true.
Then A is necessarily a subset of Ω, that is, an event.
Note that in expression (14) the symbol ω is a dummy variable [much like the summation
index in (9)], which can be changed to any other symbol:
{ω ∈ Ω : ω > 0} = {♦ ∈ Ω : ♦ > 0}.
To emphasise the auxiliary role of dummy variables, we translate these set definitions into words,
without any reference to variables:
‘The set of elements of Ω which are positive.’
20
Now we turn to the probability dialect. As an event, the set A would normally be defined as
follows:
A = ‘The outcome is a positive integer’
or
A = ‘ω > 0’.
(15)
These expressions are plainly wrong because an event is a set (the set of all ω which are positive),
not a sentence (ω is positive). However, replacing an event by its defining sentence [the two being
linked by (14)] is very much part of the probability idiom, and we shall agree that this transgression
is not a mistake.
This permissive state of affairs can lead to colloquial hybrids such as
A =‘I pick a positive integer’.
While this engaging style may be appealing, we see that the drier sentences (15) convey mathematical information more clearly.
This problem appears again when we write about probabilities. Using (14), the expanded
expression for P(A) is unsightly:
P({ω ∈ Ω : ω > 0})
and so we shall accept the sloppy but leaner form
P(ω > 0).
We must now remember that the argument of the function P is incorrect, since the domain of P is
a set of sets (the power set of Ω), not a set of sentences. We shall draw the line here though; the
minimalist expression
P(positive)
whereby a sentence has been further reduced to an adjective, is not acceptable.
The set/sentence duality of events may be exploited to write the same thing in two different
ways:
S ETS :
S ENTENCES :
A⊆B
The set A is contained in the set B
ω ∈A ⇒ ω ∈B
If A occurs, then B occurs.
One could even replace the second expressions by ‘The event A implies the event B’. Be
aware, however, that this linguistic convention does not apply outside probability. A statement
such as ‘The set A implies the set B’ is not acceptable in mathematics!
We shall learn more dialectal expressions later on, including the catchy and very prominent
expression random variable (see chapter 8), which will turn out to be non-random, and indeed
not even a variable!
As attractive at it may be, the informal language of events brings about an unexpected difficulty:
if events are written as sentences, how can we distinguish between an event and a sentence about
an event? For instance, let Ω be a set of books, and let
A = ‘The chosen book is a novel’
B = ‘The chosen book is boring’.
21
(We assume that such events are well-defined.) Now consider the sentences:
‘The chosen book is an interesting novel’
‘All novels are boring’.
(16)
It turns out that the first sentence represents an event, while the second does not. To see this, we
translate them into symbols:
A ∩ Bc
A ⊆ B.
The first quantity is a set, the second a sentence; thus P(A ∩ Bc ) is meaningful while P(A ⊆ B) is
meaningless.
Rather than analysing subtle syntactical differences between the sentences (16), we merely
observe that, when an ambiguity arises, it is advisable to abandon the probability dialect, and use
instead the unambiguous language of sets.
E XERCISES
E XERCISE 3.1. Write out the following sums in full.
3
1)
∑ P(Ak )2
3
2)
∑ P(A j−2)
3)
5)
∑ P(A j2 )n− j
6)
j=3
∑
P({k})
|k|<3
i=1
n
k=1
8
4)
∑ P({ωi})i−1
j=1
P({k2 })
∑
2
k <8
[Ans: 6) P({0}) + 2P({1}) + 2P({4}).]
E XERCISE 3.2. Express the following statements using only words:
i) Kolmogorov’s axioms.
ii) All propositions and corollaries stated in this chapter.
E XERCISE 3.3. We change the dictionary of probability as follows:
set 7→ bag
outcome 7→ apple
probabilty 7→ weight.
Formulate Axiom I, Corollary 3.2, Corollary 3.3, and Proposition 3.5 using this new language.
[Hint: What is an event?]
E XERCISE 3.4. Describe the inclusion-exclusion formulae in fewer than 100 words and without
using any mathematical symbol.
[Hint: Do not attempt a detailed description.]
22
E XERCISE 3.5. From the inclusion-exclusion formula for n = 3, derive explicit formulae for P(A∪
B ∪C) for the following special cases:
1. A ⊆ B.
2. A ⊆ B ⊆ C.
3. A and B are disjoint.
4. All events are pairwise disjoint.
E XERCISE 3.6. Write the following events as sets, using only symbols.
1. B occurs if A occurs.
2. A occurs if and only if B occurs. [Hard: Mid-term test 2014.]
[Hints: 1. This is the set of ω such that the sentence ω ∈ A ⇒ ω ∈ B is true. 2. This means: A
occurs if B occurs, and A does not occur if B does not occur.]
E XERCISE 3.7. [Final examination 2016.]
you use.
Prove the following statements, quoting any result
1. For all A, we have P(A ∩ Ac ) = 0.
2. The probability that exactly one of the events A or B occurs is equal to P(A)+P(B)−2P(A∩
B).
E XERCISE 3.8. Let (A1 , A2 , . . . ) be a sequence of subsets of Ω, such that Ai ∩ A j = ∅ for all i 6= j.
Prove that if Ω is finite, then this sequence must be finite.
E XERCISE 3.9.
1. Define all probability functions on the sample space Ω = {0, 1}. Show that there are infinitely many such functions, and that each function assumes 2, 3, or 4 distinct values. [Hint:
use proposition 3.5].
2. Let |Ω| = 3. For n = 2, 3, . . . , 8, give an example of a function P which assumes precisely n
distinct values.
3. [Hard.] Let |Ω| = n. Give an example of a probability on Ω which assumes 2n distinct
values. [Hint: first show that |P(Ω)| = 2n , where P(X) denotes the power set of a set X;
then define an injective probability function.]
23
E XERCISE 3.10. Let Ω = {1, . . . , n}, and let a1 , . . . , an be non-negative real numbers such that
a1 + · · · + an = 1. Show that the function
P : P(Ω) → R
P(A) =
∑ ak
k∈A
is a probability, namely it satisfies Kolmogorov’s axioms.
[Hint: Axioms I and II are easy; Axiom III requires a basic property of finite sums.]
E XERCISE 3.11. Let Ω be a non-empty set, and let ω ∗ be an arbitrary element of Ω. We define a
function P as follows:
(
1 if ω ∗ ∈ A
P : P(Ω) → R
P(A) =
0 if ω ∗ 6∈ A.
Show that P is a probability.
E XERCISE 3.12. Let the sample space be the set N of natural numbers. Provide an example of
each event (or sequence of events) specified below. Describe events as sets (we provide a range of
valid answers for item 1).
1. A finite event.
[Some valid answers: {1,2} [boring]; ∅ [cheeky]; the set {n, . . . , n + m}, for some m, n ∈ N
[good ]; the set of primes with fifty decimal digits [good ].]
2. The certain event.
3. An infinite event.
4. Two finite events which are not mutually exclusive.
5. Two mutually exclusive infinite events.
6. Two infinite events, the second occurring if the first occurs.
7. Two infinite events which are not mutually exclusive, even though none of them implies the
other.
8. Two events with infinitely many outcomes in common.
9. An infinite event with a finite complement.
10. An infinite sequence of finite events.
11. A finite sequence of infinite events.
12. An infinite sequence of non mutually exclusive finite events.
13. An infinite sequence of infinite events.
14. An infinite sequence of mutually exclusive infinite events. [Hard ]
24
15. An infinite descending sequence8 of infinite events. Can all these events occur simultaneously? [Hard ]
[Hint: In items involving infinite events, consider using the sets mN := {m, 2m, 3m, . . .} (e.g., 2N
is the set of even integers).]
E XERCISE 3.13. Suppose that A, B and C are events. Prove9 the following:
i) P(A ∩ B) > P(A) + P(B) − 1,
ii) If A ⊆ Bc then P(A) + P(B) 6 1.
iii) P(A4B) = P(A) + P(B) − 2P(A ∩ B),
iv) P(A ∪ B ∪C) 6 P(A) + P(B) + P(C).
v) Formulate and prove a version of iv) for n events rather than three.
[Hint: in iv) you could start by showing that P(A ∪ B ∪C) = P(A) + P(B r A) + P(C r (A ∪ B)).]
8A
sequence of sets is descending if each term of the sequence contains the next term; only the case of proper
inclusion is interesting here.
9 Each each step in your proofs —unless it is a straightforward manipulation— must be justified by a definition,
axiom or result proved in section 3.1.
25
4
Sampling
If the sample space Ω is finite, and all outcomes are equally likely, then computing the probability
of an event A requires counting elements of sets:
P(A) =
|A|
.
|Ω|
To compute |A| and |Ω|, we are often interested in finding how many ways there are of choosing k
elements from an n-element set X. This is called sampling from a set. The results of sampling are
special types of outcome, called samples (whence the expression ‘sample space’).
4.1
Sample types
Samples are either sets or sequences, with k elements. Their nature depends on what we mean
by selection: is the order of the selected items important? Is repetition allowed? If the order is
important, then, necessarily, the outcomes will be sequences (or words, which are sequences in
disguise). Such sequences may or may not have repeated elements. If there are no repetitions and
the order is unimportant, then the outcomes will be sets. As we shall see, in the latter case it is still
possible to represent the outcomes as sequences, leading to two different ways of computing the
same thing.
We now develop a general framework for dealing with problems of this type. The analysis is
subdivided into four cases.
Ordered selection with repetition
We begin with an example. Let Ω be the set of flags with three vertical stripes, each of which can
be red, blue, green, or white. This is an ordered selection of colours (we begin from the mast, say),
allowing repeated colours.
A flag ω will be represented as a word whose letters denote the colours. Specifically, we let
Ω be the set of all 3-letter words from a 4-letter alphabet {b, g, r, w}. For instance, rbb ∈ Ω. To
compute the cardinality of Ω, we note that there are 4 choices for the first stripe; for each choice
there are 4 choices for the second stripe (because we are allowed to repeat colours); for each pair
of choices, there are 4 choices for the third stripe. In total we have
|Ω| = 4 × 4 × 4 = 43 = 64.
If A is the set of flags whose first stripe is red, then there is just 1 way of choosing the first letter
in the word (it must be r), while, as above, there are 4 ways of choosing the second and the third
letters. Thus |A| = 1 × 4 × 4 = 16 (indeed |A| is just the number of two-stripe flags), and hence the
probability that the first stripe of a flag chosen at random be red is given by
P(A) =
|A| 16 1
=
= .
|Ω| 64 4
26
Let us now describe the general situation. Let X be a set. If we make an ordered selection of k
elements of X with repetition (we imagine putting each selected element back into the set, so that
it could be selected again), then the sample space is the set of all k-element sequences (also called
ordered k-tuples) of elements of X:
Ω = {(x1 , x2 , . . . , xk ) : xi ∈ X for all i}.
(17)
Let n = |X|. There are n choices for x1 , for each of these there are n choices for x2 , and so on. It
follows that
k
k
|Ω| = n
{z· · · × n} = n = |X| .
| ×n×
k terms
The set definition (17) has a different syntax from that seen earlier in (14). Rather than having
{membership statement : sentence about the member}
we now have
{arbitrary expression : sentence about the expression}.
With this format one can construct genuinely new sets, not just subsets of an existing set as in (14).
[The set (17) is denoted by X k . This idiomatic expression is interpreted as a form of ‘repeated
multiplication’, called a Cartesian product of sets. Using this notation, we obtain the neat identity
|X k | = |X|k .]
Ordered selection without repetition
Suppose that in our flag example we are not allowed to repeat any colour. Then Ω is the set of
3-letter words from a 4-letter alphabet without repetitions: Ω = {rgb, bgr, rbw, . . .}. There are 4
choices of the first stripe, 3 choices for the second stripe, and 2 choices for the third stripe. In total:
|Ω| = 4 × 3 × 2 = 24.
More generally, if we make an ordered selection of k elements from a set X without repetition
(that is to say, we do not replace the selected item in the set), then the sample space is the set of all
k-element sequences (ordered k-tuples) of distinct elements of X. In symbols:
Ω = {(x1 , x2 , . . . , xk ) : xi ∈ X, with xi 6= x j for all i 6= j}.
To determine the cardinality of this set, notice that if |X| = n, then there are n choices for x1 ; for
each of these choices there are n − 1 choices for x2 ; for each of these there are n − 2 choices for x3
and so on. Hence,
n!
|Ω| = n(n − 1)(n − 2) · · · (n − k + 1) =
(18)
|
{z
} (n − k)!
r terms
where the factorial function n 7→ n! is defined as
n
0! = 1
n! = 1 · 2 · · · (n − 1) · n = ∏ k,
k=1
27
n > 1.
(19)
If k = n, then we select all elements from X, in some order. An arrangement of n objects is
called a permutation. So the number of permutations of n objects is equal to
n!
n!
= = n!.
(n − n)! 0!
The convention that 0! = 1 ensures that formula (18) is still valid in this situation. We see that from
a single n-element set we can construct n! distinct n-element sequences, by rearranging the terms.
The factorial function grows very rapidly10 : We display the first few values:
n 1 2 3 4
5
6
7
8
9
10
···
n! 1 2 6 24 120 720 5040 40320 362880 3628800 · · ·
Thus a set of ten cards can be reshuffled in more than three million different ways!
Unordered selection without repetition
Let Ω be the set of volleyball teams (6 players) that can be formed from a squad of 10. The order
in which the team members are listed does not change the team, and hence is irrelevant; obviously,
there cannot be any repetition. So this is an unordered selection without repetition.
We begin with an ordered selection. From formula (18) with n = 10 and r = 6, we find that
there are 10!/(10 − 6)! possibilities, Since the order is irrelevant, we must now divide this quantity
by the number of ways of re-arranging the team members, which is 6!. We obtain
|Ω| =
10!
10 · 9 · 8 · 7
=
= 210.
4!6!
2·3·4
Let us consider the general setting. If we make an unordered selection of k elements from a set
X, without repeated elements, then the sample space is the set of all k-element subsets of X. Thus
the sample space is a subset of the power set of X.
Ω = {A ∈ P(X) : |A| = k}
or
Ω = {A : A ⊆ X, |A| = k}.
An ordered sample is obtained by taking an element of Ω and putting its elements in order. Each
element of the sample space can be ordered in k! ways and so letting n = |X|, and using formula
(18), we obtain
n!
k! × |Ω| =
,
(n − k)!
and therefore
n
|Ω| =
k
where
n
n!
=
.
k
k!(n − k)!
(20)
The expression nk (read as “n choose k”) is called a binomial coefficient. By convention, if k > n,
then we let nk = 0.
10 The
factorial grows faster than any exponential, that is, for any a > 0, we have limn→∞ an /n! = 0.
28
Unordered selection with repetition
[This item is not examinable.]
For completeness, we mentioned that the number of ways of choosing r elements from an
n-element set, unordered but with repetition, is
n+k−1
.
k
The proof of this statement is harder than the other three cases and will be omitted. Try proving it,
and even if you fail, try to see why the naive generalisation of the argument for unordered sampling
without repetition doesn’t work. In other words, why is the answer not nk /k!?
4.2
Summary
When dealing with a sampling problem, three questions must be answered.
1. What is the set X we are selecting from? What is n = |X|?
2. What is the number k = |ω| of selected objects?
3. Will the outcomes ω be sets or sequences?
Then we have to count; few functions suffice for many counting problems encountered in
Probability, and elsewhere.
1. The powers nk , representing the number of sequences of length k of elements of an n-element
set. Thus there are 2k sequences of k coin tosses, 104 available pin numbers for credit cards,
and |B||A| functions from a finite set A to a finite set B.
2. The falling powers11
nk = n · (n − 1) · · · · (n − k + 1),
giving the number of sequences of length k of distinct elements of an n-element set. There
are 104 pin numbers without repeated digits, and 83 ways to give three medals to eight
competitors.
3. The factorial function n! = nn , giving the number of ways of re-arranging the terms of an
n-element sequence. There are 5! ways of placing five books on a shelf, and 4! ways of
sitting four people around a table.
4. The binomial coefficient
n
nk
=
k
k!
giving the number of k-element subsets of an n-element set. There are 52
ways to choose
4
10
four cards from a deck, and 3 binary sequences of length 10 with three zeros. (Here we
choose the three locations of the zeros, among ten possibilities.)
11 This
terminology and notation are due to D. E. Knuth.
29
The following result provides a summary.
Theorem 4.1. The number of ways of selecting k elements from an n-element set is:
Ordered?
Repetition?
#
1)
yes
yes
nk
2)
yes
no
3)
no
no
nk
n
k
4)
no
yes
n+k−1
k
[Item 4 is not examinable.]
4.3
Examples
Sampling problems are tricky. They are often expressed in non-mathematical terms, and are seldom solvable by applying mechanically theorem 4.1. Becoming fluent in this area of Probability
requires extensive exercise —there are no schortcuts. Some examples of sampling problems are
given below. The problem sheets contain many exercises on sampling; many more are given at the
end of this chapter. You are advised to do all of them!
E XAMPLE . I have 10 distinct coins, 7 silver ones, and 3 copper ones. I select 4 at random, all
outcomes being equally likely. Consider the events
A: ‘I select 4 silver coins’
B: ‘I select 4 coins, 2 silver and 2 copper, in any order’
C: ‘I select 2 silver coins; then I select 2 copper coins’.
The coins are elements of a set, which is the union of two subsets (c for copper, s for silver):
X = Xc ∪ Xs ,
|Xc | = 3,
|Xs | = 7.
In all cases, the sampling is without repetitions. In cases A and B, the sampling is unordered, and
hence an outcome ω is a set; in case C the sampling is ordered, which means that an outcome is a
sequence or a word. For A and B it is also possible to regard the samples as being ordered, in which
case ω is a sequence. However, this is not possible for C, since samples have repeated elements
and they must be sequences.
30
Accordingly, we shall compute the probability of A and B in two different ways, whereas for C
there is only one approach.
First case: Ω is a set of sets. Then, using (20), we find
10
10!
|Ω| =
=
.
4
4! 6!
To compute P(A), we need |A|. We have
7
7!
|A| =
=
4
4! 3!
and hence
|A| 1
= .
|Ω| 6
To compute P(B), we need |B|. There are 72 two-element subsets of Xs , and for each of them
there are 32 two-element subsets of Xc . Hence
7 3
|B| =
2 2
P(A) =
and
−1
7 3 10
3
P(B) =
= .
2 2
4
10
Now we deal with ordered sampling. First we consider the ordered version of Ω, A, and B
(retaining the same symbols), and then we consider the event C. The sample space is now the set
of all four-element sequences (x1 , x2 , x3 , x4 ) of distinct elements of X. Hence
|Ω| = 10 · 9 · 8 · 7.
(21)
The set A is the set of all four-element sequences of elements of Xs , so |A| = 7 · 6 · 5 · 4. As to the
set B, there are six possible words (sequences) consisting of 2 letters c and 2 letters s, namely
sscc, sccs, ccss, cssc, cscs, scsc.
(22)
Now, regardless of the word, there are 7 · 6 ways of choosing the two elements of Xs , and 3 · 2 ways
of choosing the two elements of Xc . Thus |B| = 6 · (7 · 6) · (3 · 2). From (21) and the above we
obtain
7·6·5·4
1
6·7·6·3·2
3
P(A) =
=
P(B) =
= ,
10 · 9 · 8 · 7 6
10 · 9 · 8 · 7
10
as before —as it should be!
The event C is a special instance of B, namely that corresponding to the word sscc. Thus
|C| = 7 · 6 · 3 · 2, and hence
P(C) =
7·6·3·2
1
1
=
= P(B).
10 · 9 · 8 · 7 20 6
31
E XAMPLE . (The birthday problem.) We want to compute the probability that in a group of n
people, at least two of them have the same birthday. A birthday is an element of the set
X = {1, 2, . . . , 365}
hence |X| = 365 (we exclude leap years). Let bk ∈ X be the birthday of the kth person, with
k = 1, . . . , n. Thus a sample (an outcome) is an arbitrary sequence of the type
ω = (b1 , b2 , . . . , bn )
bk ∈ X.
Be aware that the notation employed here differs from that of Theorem 4.1. In the birthday problem, the sample size is n, and |X| = 365; in Theorem 4.1 the sample size is r, and, confusingly,
n = |X|. (Being able to adjust to new notation is a distinctive skill of a mathematician.)
Then
|Ω| = |X|n = 365n .
(Using Cartesian product notation —see section 4.1— we have Ω = X n .) Let An be the event
‘There is a repeated birthday among n people’. Then Acn is ‘There is no repeated birthday’, and
hence
|Acn | = 365 · 364 · 363 · · · (365 − n + 1).
Therefore
365 · 364 · · · (365 − n + 1)
.
365n
We the aid of a computer, we obtain the following (approximate) values of P(An ):
P(An ) = 1 − P(Acn ) = 1 −
(23)
n
22
23
60
70
P(An ) 0.475 0.507 0.99 0.999
(A simple, approximate explicit formula for P(An ) is derived in the exercises.) Since P(An ) is an
increasing function of n, these data show that n = 23 is the smallest integer for which n people are
more likely than not to have a common birthday. In a group of 70 people, the chance of not having
a common birthday is less than 1%.
E XAMPLE . (Quality control.) An inspector examines 10 items selected at random from a batch of
100 manufactured items. The batch is accepted if none of the 10 items is defective. Compute the
probability that a batch containing exactly 10 defective items is accepted.
The sample space Ω is the set of all 10-elements subsets of a set with 100 elements. The event
A: ‘The chosen batch is accepted’ is the set of all 10-elements subsets of a set with 90 elements
—the good items in the batch. Thus
90
90!
100
100!
|Ω| =
=
|A| =
=
,
10! 90!
10
10! 80!
10
and therefore
−1
90 100
81 · 82 · · · 90
520058680173
P(A) =
=
=
≈ 0.33.
10
10
91 · 92 · · · 100 1573664496040
32
(24)
An approximate formula for P(A) is derived in the next section.
We conclude with some general observations on sampling problems.
• Consider which is easier to compute: the cardinality of an event, or that of its complement?
(In the birthday problem, the latter was easier.)
• Learn how to recognise common structures in seemingly unrelated problems. (Exercises
4.11, 4.13, and 4.21 at the end of this chapter are the birthday problem in disguise.)
• In (22) we have seen a subdivision of a problem into cases, to which the same combinatorial
formulae are then applied. Exercises 4.10 and 4.19 at the end of this chapter require this
technique (two cases need to be considered).
4.4
Sumpplementary material
[This section is not examinable.]
4.4.1
Two famous formulae
The following two formulae are among the most famous in analysis. They provide —among many
other things— simple approximate expressions of complicated sampling formulae, such as (23)
and (24).
Let e = 2.71828 · · · be Napier’s constant (the base of the natural logarithm). Then, for all x ∈ R
we have
x n
lim 1 +
= ex .
(25)
n→∞
n
Let π = 3.141592653 . . . be Archimedes’ constant. Then
n n √
2πn
(Stirling’s formula)
(26)
n! ∼
e
where we write f (n) ∼ g(n) to mean
f (n)
= 1.
n→∞ g(n)
lim
If n is large, then we write
x n
1+
≈ ex .
n
1 n n √
2πn ≈ 1.
n! e
The symbol ≈, which reads ‘is approximately equal to’ is not a mathematical symbol, because
it does not give information on the quality of the approximation. Formulae derived using these
approximations will need to be checked (by computer, typically) against the exact expressions, to
find out how good they are.
33
As an example, let us consider the quality control formula (24). We rewrite it as follows:
P(A) =
90
81 82
× ×···×
.
91 92
100
There are 10 terms in the above product, and each term is close to 9/10 = 1 − (1/10). Hence
1 10
P(A) ≈ 1 −
≈ 1/e = 0.36 . . .
10
This simple result is appealing, and considering that the derivation involved two approximations,
the accuracy is acceptable. A more sophisticated application is given in the last exercise at the end
of the chapter.
4.4.2
The Binomial Theorem
The following well-known theorem —called the binomial theorem— is an application of counting
selections.
Theorem 4.2. For all real numbers x and all non-negative integers n, the following holds:
n n r
n
(1 + x) = ∑
x.
r=0 r
(27)
P ROOF. We have
(1 + x)n = (1 + x) × (1 + x) × · · · × (1 + x)
{z
}
|
n terms
= a0 + a1 x + a2 x2 + · · · + an xn .
Label each term (1 + x) in the product, from 1 to n, so as to regard these terms as distinct objects.
They form a set with n elements. To construct the monomial xr we must select r distinct terms from
which to take x, while taking 1 from the remaining n − r terms. This is an unordered selection of
r objects from an n element set without repetition, so there are nr ways of doing it. This is the
coefficient ar , and the result follows. If we replace x with y/x in (27) and we multiply both sides by xn , we obtain the following
alternative statement of the theorem:
n n n−r r
n
(x + y) = ∑
x y
x, y ∈ R, n > 0.
r=0 r
The binomial theorem can also be proved by induction on n.
34
E XERCISES
E XERCISE 4.1. Simplify each expression as much as you can.
8
15
1.
Ans: 70
2.
Ans: 3 · 7 · 11 · 13
4
5
10 2
1
127
3.
Ans:
2
4.
Ans: 127
1 29
126
16 2
6 1
8!
8
5.
Ans: 25
6.
Ans: 215 · 36 · 53
13·11
3
3 2
7
1
7.
10 8
8 7
18
11
Ans:
5
2·13·17
8.
97
50
100
50
Ans:
22
3·11
E XERCISE 4.2. Assuming that the probability of any child being male is 1/2, find the probability
that in a family of five children (a) all children are of the same sex; b) the three oldest are boys
and the two youngest are girls; c) three of them are boys and two of them are girls.
[Ans: (a) 1/16; (b) 1/32; (c) 5/16.]
E XERCISE 4.3. Two balls are drawn from an urn containing one white, three black, and four green
balls. (a) What is the probability that the first is white and the second is black? (b) What is the
probability if the first ball is replaced before the second drawing?
[Ans: (a) 3/56; (b) 3/64.]
E XERCISE 4.4. Two fair dice are rolled. Find the probability that the sum of the outcomes is not
equal to 4.
[Ans: 11/12.]
E XERCISE 4.5. Three cards are drawn at random from a full deck. What is the probability of
getting a three, a seven, and an ace?
[Ans: 24 /(52 · 13 · 17).]
E XERCISE 4.6. One of the numbers 2, 4, 6, 7, 8, 11, 12 and 13 is chosen at random as the numerator
of a fraction, and then one of the remaining numbers is chosen at random as the denominator of
the fraction. What is the probability of the fraction being in lowest terms?
[Hint: A fraction is a two-element sequence in disguise. Ans: 9/14.]
E XERCISE 4.7. [Mid-term test 2016.] A three-digit pin number is assigned at random. Determine
the probability that two digits are the same, while the other is different.
[Ans: 33 /102 .]
35
E XERCISE 4.8. [Mid-term test 2016.] I select four distinct integers at random among the integers
1, 2, . . . , 10. Determine the probability that my selection contains at most one prime number.
[Ans: 19/42.]
E XERCISE 4.9. [Final examination 2016.] A fair coin is tossed repeatedly. Let n > 2. What is
the probability that a head will occur for the second time on the nth toss?
[Ans: (n − 1)/2n .]
E XERCISE 4.10. [Final examination 2016.] At a round table with eight chairs, four men and four
women sit at random. What is the probability that every woman will sit between two men?
[Ans: 1/35.]
E XERCISE 4.11. [Final examination 2016.] Four people choose at random one holiday destination among ten destinations. Show that the probability that at least two people will make the same
choice is less than 1/2.
[Ans: 62/125.]
E XERCISE 4.12. The word DRAWER is spelled with six scrabble tiles. The tiles are then randomly
rearranged. What is the probability of the rearranged tiles spelling the word REWARD?
[Ans: 1/360.]
E XERCISE 4.13. [Mid-term test 2014.] An underground train made of n cars is boarded by r
passengers (r 6 n), each entering a car at random. What is the probability of the passengers all
ending up in different cars?
[Ans: n(n − 1) · · · (n − r + 1)/nr .]
E XERCISE 4.14. A three-digit integer is formed at random from the digits {1, 2, 3, 4}, with no
repeated digits. (a) What is the probability that such an integer be even? (b) What is the same
probability if we use the digits {1, 2, 3, 4, 5}?
[Ans: (a) 1/2; (b) 2/5.]
E XERCISE 4.15. An integer is chosen at random from the first 2000 positive integers. What is the
probability that the chosen integer is divisible by 6 or 8?
[Ans: 1/4.]
E XERCISE 4.16. From five married couples, four people are selected. What is the probability that
two men and two women are chosen?
[Ans: 10/21.]
E XERCISE 4.17. [Final examination 2015.] I forgot the last digit of my cash card’s pin number.
I am allowed three attempts, after which the card will be withdrawn. What is the probability that
I’ll be able to use the card?
[Ans: 3/10.]
36
E XERCISE 4.18. A wooden cube with painted faces is sawed up into 1000 little cubes, all of the
same size. The little cubes are then mixed up, and one is chosen at random. What is the probability
of it having just two painted faces?
[Ans: 0.096.]
E XERCISE 4.19. I have ten books, five each of two titles, and I place them at random on a shelf.
What is the probability that (a) five copies of one title follow five copies of the other title on the
shelf? (b) the two titles alternate on the shelf?
[Ans: (a) 1/126; (b) same as (a).]
E XERCISE 4.20. A batch of 100 items contains 5 defective items. Fifty items are chosen at random
and then inspected. Suppose the whole batch is accepted if no more than one of the 50 inspected
items is defective. What is the probability of accepting the whole batch?
47 · 37
[Ans:
≈ 0.18.]
99 · 97
E XERCISE 4.21. [Hard.] What is the probability that a random function between two finite sets
be injective?
[Hint: Give a sensible definition of random function; establish the necessary notation.]
E XERCISE 4.22. Let 1 6 r 6 n. A subset of {1, 2, . . . , n} of cardinality r is chosen at random.
i) Calculate the probability that 1 is an element of the chosen subset.
ii) Without using i) and Proposition 3.1, calculate the probability that 1 is not an element of the
chosen subset. Hence deduce that
n
n−1
n−1
=
+
.
r
r
r−1
Ans:
i) r/n;
ii) (n − r)/n
(after simplification).
E XERCISE 4.23. A random selection of n balls is made from a bag containing n red balls and n
blue balls. Let Ai be the event that the selection contains exactly i red balls.
i) Find P(Ai ).
ii) Deduce that
2 n
2n
∑ i = n .
i=0
n
[Hint: use Axiom III.]
37
E XERCISE 4.24. I have 4n balls, 2n black and 2n white. I divide them at random into two groups
of 2n balls each. Let An be the event: ‘Each group consists of n black balls and n white balls.
1. Compute P(An ).
√
2. [Hard.] Using Stirling’s formula (26), show that P(An ) ∼ 2/ 2πn.
2 −1
2n
4n
[Ans: 1.
.]
n
2n
E XERCISE 4.25. [Hard.] We assemble a necklace at random, using 2n beads of two different
colours, with n beads of each colour. Let pn be the probability that the two colours alternate. Show
that, for all n > 1, the quantity p−1
n is an integer.
E XERCISE 4.26. [Hard.] We seek a simple approximate expression for the inscrutable formula
(23).
i) Show that
n−1 P(An ) = 1 − ∏
k=0
k
.
1−
365
ii) Using i) and (25), show that
P(An ) ≈ β (n)
where
β (n) = 1 − e−n(n−1)/730 .
[Hint: You will need the identities ea eb = ea+b and a = (a365 )1/365 , and the formula for the
sum of an arithmetic progression.]
iii) Using Maple, analyse the accuracy of the approximate expression β (n), by plotting the difference of the two sequences P(An ) − β (n), as functions of n.
38
5
Conditional Probability
We want to measure the influence that the occurrence of an event B has on the probability of
another event A occurring. For example, if B occurs, then A may be more likely to occur, or even
occur with certainty.
Definition. The conditional probability of an event A, given an event B with P(B) > 0, is defined
as
P(A ∩ B)
P(A|B) :=
.
(28)
P(B)
The centrepiece of (28) is the assignment operator ‘:=’, which appears, explicitly or implicitly, in every symbolic definition. This operator assigns a meaning to the expression on the left in
terms of the expression on the right. Typically, the former is a short-hand of the latter, employing
new symbols, or combinations of symbols. For instance, the expression
A 4 B := (A ∪ B) r (A ∩ B),
gives meaning to the symbol ‘4’, by specifying its action on two sets in terms of quantities defined
previously (union, intersection, and difference operators)12 .
The right-hand side of (28) is well-defined: it’s a ratio of real numbers with a non-zero denominator. But the left-hand side is obscure: what new symbol is being defined? To answer this
question, we rewrite (28) using a different, more conventional, notation:
H(A, B) :=
P(A ∩ B)
.
P(B)
(29)
We see that a conditional probability is a new function (here called H) of two variables, much like
in the expression
sin(x + y)
h(x, y) :=
f (y) 6= 0.
(30)
sin(y)
So, the symbol defined via (28) is not ‘|’, but rather a combination of P and ‘|’.
These grammatical subtleties are also present in the verbal definition of conditional probability:
The probability of A, given B, is defined as the probability of A intersection B divided
by the probability of B.
Now a comma appears after the first occurrence of the symbol A, but not after the second. As a
consequence, the expression ‘A, given B’ (unlike ‘A intersection B’) does not identify an actual
object.
12 Mathematicians
often write ‘=’ to mean ‘:=’, particularly if the left-hand side consists of a single symbol: p =
Computer scientists can’t afford such a liberty, as programming languages treat equations and assignments
very differently.
2127 − 1.
39
We have learned that the last item in the list
P(A ∪ B),
P(A ∩ B),
P(A r B),
P(A∆B),
P(A|B),
is very different from all the others, in spite of a superficial similarity. In particular, the expression
A|B, taken in isolation, is meaningless, that is, ‘|’ is not a set operator.
The definition of conditional probability does not require that A happens after B. One way
of thinking of this is to imagine that B has been promoted to the new sample space, and that all
outcomes outside B are ignored. Then the only information that B has of A is A ∩B, while the factor
P(B) at denominator ensures that P(A|B) is correctly normalised. In fact, we have (see exercises)
0 6 P(A|B) 6 1.
(31)
The conditional probability of A, given B, is the new probability of A in these circumstances: It
measures how the occurrence of B influences the chance of the event A occurring.
For instance, if P(A|B) < P(A), then B occurring makes A less probable. An extreme case is
P(A|B) = 0, which means that, given B, the event A is impossible. On the other hand, if P(A|B) >
P(A) then B occurring makes A more probable. In the extreme case P(A|B) = 1, the occurrence
of B makes A certain. Another special case is P(A|B) = P(A) [which is seen to be equivalent to
P(A ∩ B) = P(A)P(B), using (28)], which means that B occurring does not affect the occurrence of
A. This particularly important case we will be discussed in the next chapter.
E XAMPLE . We roll a fair die twice, and consider the events
A = ‘The first roll is a 6’
B = ‘Roll a double’
C = ‘Roll at least one odd number’.
We find |Ω| = 36 and
|A| = 6,
|B| = 6,
|C| = 27,
|A ∩ B| = 1,
|B ∩C| = 3,
|A ∩C| = 3.
From the above, we compute
1
P(A) = ,
6
and therefore
1
P(B) = ,
6
3
P(C) = ,
4
P(A ∩ B) =
1
= P(A)
6
1
P(B|C) = < P(B)
2
1
P(A|C) = < P(A)
9
1
,
36
P(B ∩C) =
1
,
12
1
= P(B)
6
1
P(C|B) = < P(C)
2
P(A|B) =
P(A ∩C) =
1
,
12
P(B|A) =
P(C|A) =
40
1
2
< P(C).
(32)
5.1
The probability of intersection of events
The probability of the union of events is given by the inclusion-exclusion formulae, proved for two
and three events in section 3.1. In this section we establish a formula for the probability of the
intersection of an arbitrary number of events, in terms of conditional probabilities (theorem 5.1).
We begin with an example.
Consider our earlier sampling example of selecting 4 coins from 7 silvers and 3 coppers, ordered and without repetition. Let Ai be the event ‘The ith coin is silver’. Then P(A1 ) = 7/10. If A1
occurs, then the second selection is from 6 silvers and 3 coppers, that is, P(A2 |A1 ) = 6/9 = 2/3.
Hence, from the definition of conditional probability, we obtain, after rearranging:
P(A1 ∩ A2 ) = P(A1 ) × P(A2 |A1 ) =
7
2
7
× = .
10 3 15
Also, by definition
P(A1 |A2 ∩ A3 ) =
P(A1 ∩ A2 ∩ A3 )
P(A2 ∩ A3 )
and hence
P(A1 ∩ A2 ∩ A3 ) = P(A1 ) × P(A2 |A1 ) × P(A3 |A1 ∩ A2 ).
If A1 and A2 occur, then the third selection is from 5 silvers and 3 coppers, so that P(A3 |A1 ∩ A2 ) =
5/8 and
7
2 5
7
P(A1 ∩ A2 ∩ A3 ) =
× × = .
10 3 8 24
Continuing as above we find
P(A1 ∩ A2 ∩ A3 ∩ A4 ) = P(A1 ) × P(A2 |A1 ) × P(A3 |A1 ∩ A2 ) × P(A4 |A1 ∩ A2 ∩ A3 )
7
2 5 4 1
=
× × × = ,
10 3 8 7 6
which we computed earlier as (7 · 6 · 5 · 4)/(10 · 9 · 8 · 7).
The next result is a generalisation of this procedure.
Theorem 5.1. For any events A1 , A2 , . . . , An , we have
P(A1 ∩ A2 ∩ · · · ∩ An ) =P(A1 ) × P(A2 |A1 ) × P(A3 |A1 ∩ A2 )×
· · · × P(An |A1 ∩ A2 ∩ · · · ∩ An−1 )
(33)
provided that all of the conditional probabilities involved are defined (that is, that the second argument of each conditional probability has positive probability).
The statement of the theorem depends on a natural number n, and we shall prove it by induction
on n (see Mathematical Structures lecture notes). To do this, we first prove that the statement is
true for 2 events. This is called the base case. We then prove that, given any k > 2, if the statement
41
is true for k events then it is also true for k + 1 events. (This is called the inductive step.) The
principle of induction then ensures that the statement is true for all natural numbers n.
P ROOF.
Base case: When n = 2 the statement is P(A1 ∩ A2 ) = P(A1 ) × P(A2 |A1 ). This is true by the
definition of conditional probability rearranged.
Inductive step: Let k > 2 and suppose that the statement is true for k events. We will use this to
prove that the statement is true for k + 1 events. If we can do this then we may conclude that it is
true for all n.
By definition
P(A1 ∩ · · · ∩ Ak+1 )
P(Ak+1 |A1 ∩ A2 ∩ · · · ∩ Ak ) =
.
P(A1 ∩ · · · ∩ Ak )
Rearranging this we get
P(A1 ∩ · · · ∩ Ak+1 ) = P(A1 ∩ · · · ∩ Ak ) × P(Ak+1 |A1 ∩ · · · ∩ Ak ).
The fact that the result is true for k events means that
P(A1 ∩ · · · ∩ Ak ) = P(A1 ) × P(A2 |A1 ) × · · · × P(Ak |A1 ∩ · · · ∩ Ak−1 ).
Substituting this into the previous equation gives that
P(A1 ∩ · · · ∩ Ak+1 ) =P(A1 ) × P(A2 |A1 )×
· · · × P(Ak |A1 ∩ · · · ∩ Ak−1 ) × P(Ak+1 |A1 ∩ · · · ∩ Ak ).
But this is precisely the statement of the theorem for k + 1 events.
The proof is complete. 5.2
Ordered Sampling Revisited
Conditional probability provides an alternative approach to question involving ordered sampling.
Suppose that I select a sequence (x1 , . . . , xr ) of r objects taken from an n-element set X, without
repetition (that is, x1 , . . . , xr are distinct).
Let Ai be the event “the ith selection is xi ”, and let A = A1 ∩ · · · ∩ Ar . Then A = {(x1 , . . . , xn )}
is the elementary event ‘the chosen sample is (x1 , . . . , xr )’, and hence
P(A) =
1
(n − r)!
1
=
=
.
|Ω| n × (n − 1) × · · · × (n − r + 1)
n!
However, from Theorem 5.1 we also have
P(A) = P(A1 ∩ A2 ∩ · · · ∩ Ar )
= P(A1 ) × P(A2 |A1 ) × P(A3 |A1 ∩ A2 ) × · · · × P(Ar |A1 ∩ A2 ∩ · · · ∩ Ar−1 ).
42
(34)
If A1 ∩ · · · ∩ Ai−1 occurs, then the ith selection is from the set X r {x1 , . . . , xi−1 }, since xi is not
1
equal to any of x1 , . . . , xi−1 . So P(Ai |A1 ∩ · · · ∩ Ai−1 ) = n−i+1
. Inserting these expressions in the
formula above we obtain
P(A) =
1
1
1
(n − r)!
×
×···×
=
.
n n−1
n−r+1
n!
Expression (34) is the reciprocal of a product of integers. Using conditional probability, we have
computed it as a product of reciprocals of integers.
With the same approach, we now deal with a variant of this problem. Suppose that Y ⊆ X. We
want to compute the probability that the chosen sample consists of elements of Y . Let m = |Y |,
and let Bi be the event ‘the ith selection is in Y ’. Then the desired event is B = B1 ∩ · · · ∩ Br . From
Theorem 5.1, we have
P(B) = P(B1 ∩ B2 ∩ · · · ∩ Br )
= P(B1 ) × P(B2 |B1 ) × P(B3 |B1 ∩ B2 ) × · · · × P(Br |B1 ∩ B2 ∩ · · · ∩ Br−1 ).
Now we have
m−i+1
,
n−i+1
since if B1 ∩ · · · ∩ Bi−1 occurs, then the ith selection is from a set of n − i + 1 elements, m − i + 1 of
which are in Y . Hence
P(Bi |B1 ∩ · · · ∩ Bi−1 ) =
P(B) =
m m−1
m − r + 1 m!(n − r)!
×
×···×
=
.
n n−1
n−r+1
n!(m − r)!
Now suppose that we allow repetitions. Then the probability of the ith selection being xi does
not depend on earlier selections, being a selection from the same set X with n elements.
1
P(Ai |A1 ∩ A2 ∩ · · · ∩ Ai−1 ) = .
n
So for any x1 , . . . , xr ∈ X we get
P(A) =
1 1
1
1
× ×···× = r .
n n
n n
Similarly,
m m
m mr
× ×···× = r .
n n
n
n
We will see later that the case of selection with repetition is simpler because the events corresponding to each selection are independent.
The answers we get here are the same as the ones you would get by using the ideas in the
sampling section and calculating P(A) = |A|/|Ω|, where Ω is the sample space for the complete
experiment of selecting r objects, and the event A = {(x1 , . . . , xr )} consists of a single outcome.
Arguably, the present method is slicker. Also, the assumption we make (in each selection, all
remaining elements in X are equally likely) is simpler than the assumption we made previously
(each possible outcome from Ω —an r-element sequence— is equally likely).
P(B) =
43
5.3
Notation
[This section is not examinable.]
In chapter 3 we introduced the Sigma-notation to write sums compactly [see (8)]. We now
introduce a unified notation which generalises the Sigma-notation to operators other than addition.
Let (X1 , X2 , . . . , Xn ) be a sequence. For n > 1 we define:
n
∑ Xk := X1 + · · · + Xn
Xk numbers
k=1
n
∏ Xk := X1 × · · · × Xn
k=1
n
[
k=1
n
\
Xk numbers
Xk := X1 ∪ · · · ∪ Xn
Xk sets
Xk := X1 ∩ · · · ∩ Xn
Xk sets
k=1
For n = 1, the value of these expressions is just the first term in the sequence, e.g.,
1
∑ Xk := X1,
k=1
1
[
Xk := X1 ,
etc.
k=1
With this notation, formula (7) reads
n
[
P
n
Ak =
∑ P(Ak )
k=1
k=1
while formula (33) becomes
P
n
\
k=1
n
Ak = P(A1 ) ∏ P Ak |
k=2
44
k−1
\
j=1
Aj .
(35)
E XERCISES
E XERCISE 5.1. Justify all numerical results in (32).
E XERCISE 5.2. Prove the inequalities (31).
E XERCISE 5.3. Let Ω = {1, 2, . . . , 16}, with all outcomes being equally likely, and let A = {1, . . . , 4}.
Give an example of an event B such that
1. P(A|B) = 0
2. P(A|B) = 1
3. P(A|B) < P(A)
4. P(A|B) = P(A)
5. P(A|B) > P(A).
[Ans: 2. B = {3}.]
E XERCISE 5.4. A standard fair die is rolled twice.
i) Find the probability that that the sum of the two rolls is at least 9.
ii) Find the conditional probability that the first roll is 4 given that the sum of the two rolls is at
least 9.
iii) Find the conditional probability that the first roll is not 4 given that the sum of the two rolls
is at least 9.
iv) Find the conditional probability that the first roll is 4 given that the sum of the two rolls is
less than 9.
v) Find the conditional probability that the sum of the two rolls is at least 9 given that the first
roll is a 4.
[Ans: i) 5/18; ii) 1/5; iii) 4/5; iv) 2/13 ; v) 1/3]
45
E XERCISE 5.5. True or false? Justify your answer; the notation is standard.
[Don’t guess: use lecture notes.]
1. For some A, B, P(A ∩ B) 6= P(B ∩ A).
2. For all A, P(A) > 0.
3. For all A, P(A) = 1 − P(Ac ).
4. For some A, B, P(A ∪ B) = P(A) + P(B).
5. For all A, P(Ω ∪ A) = 1.
6. For all A there is B such that P(A) + P(B) = 1.
7. For all P, if P(A) 6 P(B) then A ⊆ B.
8. For all A, there is B such that P(A ∩ B) 6= P(B).
9. For all A, B, P(A ∩ B) = P(A)P(B).
10. For all A, B, P(A ∩ B) = P(A)P(A|B).
11. For some A, B, P(A ∩ B) = P(A)P(B).
12. For all A, B, P(A ∩ B) = P(A) + P(B) − P(A ∪ B).
13. For some A, P(A ∩ B) = P(B) for all B.
14. For all P, A, P(A ∩ ∅) = 0.
15. For all A, B, P(A∆B) = P(A) + P(B) − 2P(A ∩ B).
16. [Hard.] For all P, if P(A) = 0, then A = 0.
/
17. [Hard.] For all P, if P(A) = 1 then A = Ω.
E XERCISE 5.6. Let A and B be events with P(A) > 0 and P(B) > 0.
i) Prove that P(Ac |B) = 1 − P(A|B).
ii) Show that knowledge of P(A|B) gives no information about P(A|Bc ).
iii) Prove that if P(A|B) > P(A) then P(B|A) > P(B).
iv) Prove that if P(A) = P(B) = 32 , then P(A|B) > 12 .
46
6
Independence
The inclusion-exclusion formulae show that P(A1 ∪ · · · ∪ An ) may or may not be equal to P(A1 ) +
· · · + P(An ). Equality holds if the events Ai are pairwise disjoint, namely if, for all i 6= j we have
Ai ∩ A j = ∅. There is an analogous concept for P(A1 ∩ · · · ∩ An ), the probability of intersection of
events. We begin with the case n = 2.
Definition. We say that the events A and B are independent if
P(A ∩ B) = P(A) × P(B).
(36)
The meaning of a common English word has just been re-defined: this word can no longer be used
casually.
Events which are clearly physically unrelated turn out to be independent according to definition
(36) —this is why such a definition is sensible. For instance, let us consider two tosses of a
fair coin. The sample space is {hh, ht,th,tt}. The events A = {hh, ht} (the first toss is h) and
B = {hh,th} (the second toss is h) have probability 1/2, and since
P(A ∩ B) = P({hh}) =
1
= P(A) × P(B)
4
the events A and B are independent. By contrast, the events ‘the first toss is h’ and ‘the first toss is
t’ are not independent, as readily verified.
However, independent is not equivalent to physically unrelated. For example, we saw that if a
fair die is rolled twice then the event “first roll is a 6” and the event “both rolls produce the same
number” are independent. Therefore, even in the most obvious cases, (e.g., two successive coin
tosses), if you assume independence, then you must clearly state so. Failure to do so will be treated
as a logical mistake.
The assumption of independence often appears in the formulation of problems. If so, then in
the solution you must say where you use this assumption.
The next theorem connects independence to conditional probabilities.
Theorem 6.1. Let A and B be events with P(A) > 0 and P(B) > 0. The following are equivalent:
i) A and B are independent
ii) P(A|B) = P(A)
iii) P(B|A) = P(B).
This result says that if A and B are independent, then the occurrence of A does not affect the
probability of B, and vice-versa. The equivalence of the three statements is expressed formally as
i) ⇔ ii) ⇔ iii).
47
P ROOF. It will be sufficient to show that i) ⇒ ii), ii) ⇒ iii), and iii) ⇒ i) (think about it).
i) ⇒ ii): [That is, if A and B are independent, then P(A|B) = P(A).] Suppose that A and B are
independent. Then
P(A ∩ B)
P(B)
P(A) × P(B)
=
P(B)
= P(A),
P(A|B) =
[by definition (36)]
[by assumption]
as desired.
ii) ⇒ iii): Suppose that P(A|B) = P(A). By the definition of conditional probability this means that
P(A ∩ B)/P(B) = P(A). Rearranging, and using the fact that P(A) 6= 0, we get P(A ∩ B)/P(A) =
P(B), that is, P(B|A) = P(B).
iii) ⇒ i): Suppose that P(B|A) = P(B). By definition, this means that P(A ∩ B)/P(A) = P(B).
Rearranging we get P(A ∩ B) = P(A)P(B), that is, A and B are independent events. Defining independence for more than two events is more complicated.
E XAMPLE . We roll a fair die two times. We consider the events
A:
We find
‘first roll is odd’,
1
P(A) = ,
2
B:
‘second roll is odd’,
1
P(B) = ,
2
P(C) =
C:
1
2
‘sum of rolls is odd’.
[Why?],
and
1
1
1
P(A ∩C) = ,
P(B ∩C) = ,
P(A ∩ B) = ,
4
4
4
and hence A and B and C, are pairwise independent, meaning that A and B are independent, A and
C are independent, and B and C are independent. However, it is plain that if A and B occur, then C
is impossible; likewise, if A occurs and B does not, or vice-versa, then C is certain. This is reflected
mathematically in the following relation:
P(A ∩ B ∩C) = P(∅) = 0 6= P(A) × P(B) × P(C).
So the probability of all three happening is not the product of individual probabilities.
We generalise the definition of independence to the case of more than two events. This definition must take into account all possible intersections of events.
Definition. We say that the n events A1 , A2 , . . . , An are independent if for every 1 < s 6 n, and
every sequence of indices (i1 , . . . , is ) such that 1 6 i1 < i2 < · · · < is 6 n, we have
P(Ai1 ∩ Ai2 ∩ · · · ∩ Ais ) = P(Ai1 ) × P(Ai2 ) × · · · × P(Ais ).
48
(37)
To familiarise ourselves with the zoo of sequences of indices (i1 , . . . , is ) that appear in this definition, we list them explicitly for n = 2, 3, 4.
n=2:
(1, 2)
n=3:
(1, 2), (1, 3), (2, 3), (1, 2, 3)
n=4:
(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4), (1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4), (1, 2, 3, 4).
Thus for n = 2 we have one pair (r = 2); for n = 3 we have three pairs (r = 2) and one triple (r = 3);
for n = 4 we have six pairs (r = 2), four triples (r = 3) and one quadruple (r = 4). (Write down
explicitly some of identities (37) for n = 4.)
We see that to establish the independence of more than two events requires a lot more work
than just checking the independence of all possible pairs. (The number of identities (37) grows
very rapidly with n —see exercise 6.12.) For this reason, the expression ‘mutually independent’ is
sometimes found in the literature to characterise the independence of more than two events. I think
that the adverb ‘mutually’ is confusing here, since it suggests a non-existing connection with the
expression ‘mutually exclusive’, that is, disjoint. Independence involves the probability P, while
disjointness doesn’t, being a statement solely about subsets of Ω. In fact, two mutually exclusive
events can never be independent, unless one has zero probability (see exercises).
Let us now consider examples of calculating probabilities of events which are built up of several
independent events.
E XAMPLE . A coin which has probability p of showing heads is tossed n times in succession. What
is the probability that heads comes up exactly r times?
Events are words ω = ω1 · · · ωn [equivalently, sequences (ω1 , . . . , ωn )], where ωk ∈ {h,t}.
Since the result of the tosses are physically unrelated, we may assume that the events A1 , . . . , An ,
where Ak =‘the kth toss is ωk ’ are independent13 . This means that the conditional probability of
Ak , given any possible combination of the remaining A j , j 6= k, is equal to the probability of Ak ,
which is either p or 1 − p. Equivalently, to compute the probability
P(A1 ∩ · · · ∩ An ) = P({ω1 · · · ωn })
of a particular sequence of r heads and n − r tails, we just multiply together the appropriate probability for each toss. That is,
P(h| . .{z
. . . . h} t|. .{z
. . . .t}) = pr (1 − p)n−r .
r heads
n−r tails
Clearly, any sequence of r heads and n − r tails will have the same probability. There are
sequences, and so if A is the set of all samples ω which have has exactly r heads, then
n r
P(A) =
p (1 − p)n−r .
r
We will see this expression again when we discuss the binomial distribution.
13 See
exercise 6.11.
49
n
r
such
Figure 1: Flow on graphs.
E XAMPLE . To go from x to y, I can use buses a or b. Let A be the event ‘bus a is running’, let B
be the event ‘bus b is running’ (see figure 1, left), and let
P(A) =
9
,
10
4
P(B) = .
5
What is the probability that I can travel (which is the event A ∪ B), assuming that A and B are
independent?
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
= P(A) + P(B) − P(A)P(B)
9
4 36 49
=
+ −
= .
10 5 50 50
[by inclusion-exclusion]
[by assumption]
Now suppose that we have the more complex network illustrated in figure 1, right, where each
event corresponds to transport on the associated section of the network. Being able to travel from
x to y corresponds to the event
T = [A ∪ (B ∩C)] ∩ D.
(Think about it.) Let us now assume that the events A, B,C, D are independent. Then the events D
and A ∪ (B ∩C) are also independent, since the former does not feature in the latter. We compute:
P(T ) =
=
=
=
=
=
P(D ∩ [A ∪ (B ∩C)])
P(D) P(A ∪ (B ∩C))
P(D) [P(A) + P(B ∩C) − P(A ∩ B ∩C)]
P(D) [P(A) + P(B)P(C) − P(A) P(B) P(C)]
P(D) [P(A) + P(B) P(C)(1 − P(A))]
P(D) [P(A) + P(Ac ) P(B) P(C)].
[by commutativity of intersection]
[by independence]
[by inclusion-exclusion]
[by independence]
We have expressed the probability of a complex event in terms of the probabilities of its constituent
components.
6.1
Notation
[This section is not examinable.]
50
In this section we shall rewrite expression (37) in a more concise and elegant way. As a starting
point, we adopt the notation developed in section 5.3:
P
s
\
s
Aik = ∏ P(Aik ).
k=1
k=1
However, the information about the structure of the double subscripts ik must still be supplied
separately.
To get rid of the double subscripts, we note that the terms of the sequence (i1 , . . . , is ) are distinct. In addition, the order in which these terms appear is irrelevant, since the intersection and
multiplication operators are commutative —one can independently and arbitrarily re-arrange the
terms of the two expressions in (37). So we turn these sequences into sets, and we rewrite (37)
compactly as
\ P
Ai = ∏ P(Ai ),
I ⊆ {1, . . . , n}, |I| > 1.
(38)
i∈I
i∈I
All awkward double subscripts have disappeared. In addition, the formula is self-contained.
But we can go further, and dispose of all subscripts altogether! Let Λ be the set of events whose
independence is being tested. Formula (38) says that for every subset A ⊆ Λ with more than one
element, the probability that all events in A occur is equal to the product of the probabilities of
the individual events in A. The following variant of (38) neatly translates this prescription into
symbols:
\ P
A = ∏ P(A),
A ∈ P(Λ) r ∅.
(39)
A∈A
A∈A
(Recall that if X is a set, then P(X) denotes the power set of X, namely the set of all subsets of
X, see page 14.) The condition |A| > 1 is unnecessary, since for |A| = 1 we get the trivial identity
P(A) = P(A) [see (35), page 44]. If |A| = 0, that is, A = ∅, then the left-hand side of (39) is
unproblematic, as it reduces to P(∅) = 0. However, the right-hand side would now represent an
empty product, and the latter is normally defined to be equal to 1. This is why in (37) we have
removed the empty set from P(Λ).
The transformation of (37) into (39) illustrates a general aspect of the mathematical language:
by adopting high-level constructs (subsets of a power set, in our case), the exposition becomes
simpler, albeit more abstract.
E XERCISES
E XERCISE 6.1. Prove that two mutually exclusive events, none of which has zero probability, are
not independent.
51
E XERCISE 6.2. [Final examination 2015.] Let Ω = {1, 2, . . . , 20} and let A be the set of even integers in Ω. Assuming that all elements of Ω are equally likely, give an example —with justification—
of events B and C such that
(i) A and B are independent;
(ii) A and C are not independent.
E XERCISE 6.3. [Final examination 2016.] Let Ω be the interval [0, 1], and let the probability of
an interval A ⊂ Ω be equal to the length of A. Give an example —with justification— of
1. two intervals A and B such that A and B partition Ω;
2. two intervals A and B such that A and B are independent;
3. two intervals A and B such that P(A|B) = 1.
E XERCISE 6.4. True or false? Justify your answer. (The notation is standard; in expressions of
the type P(A|B), you may assume that P(B) 6= 0.)
1. For all A, P(A|Ac ) = 0.
2. For all A, B, P(Ac |Bc ) + P(Ac |B) = 1.
3. For all A, B,C, P(C ∩ B ∩ A) = P(A) · P(B|A) · P(C|A ∩ B).
4. For all A, B, if P(A|B) = P(A), then P(A ∩ B) = P(A) · P(B).
5. For all A, B,C, P(A) = P(A|B) · P(B) + P(A|C) · P(C).
E XERCISE 6.5. I toss a fair coin repeatedly, stopping when I have either tossed it four times or I
have seen two heads (whichever happens first). I record the sequence of heads/tails seen.
i) Write down the sample space.
ii) Are all elements of the sample space equally likely? Justify your answer.
iii) Calculate the probability that I see two heads.
iv) Calculate the conditional probability that I see two heads given that the first toss is a tail.
v) Calculate the conditional probability that the first toss is a tail given that I see two heads.
[Ans: ii) No; iii) 11/16; iv) 1/2; v) 4/11.]
52
E XERCISE 6.6. A positive integer from the set {1, 2, 3, . . . , 36} is chosen at random with all
choices equally likely. Let E be the event “the chosen number is even”, O be the event “the chosen
number is odd”, Q be the event “the chosen number is a perfect square”, and Dk be the event “the
chosen number is a multiple of k”. Decide with justification:
i) Are the events E and O independent?
ii) Are the events E and Q independent?
iii) Are the events O and Q independent?
iv) Are the events D3 and D4 independent?
v) Are the events D4 and D6 independent?
E XERCISE 6.7. The top card of a thoroughly shuffled deck of playing cards is turned over. Let
A be the event “the card is an Ace”, R be the event “the card belongs to a red suit (♦ or ♥)”, and
M be the event “the card belongs to a major suit (♥ or ♠)”. Show that the events A, R and M are
independent.
E XERCISE 6.8. There are two roads from A to B and two roads from B to C. Suppose that each
road is closed with probability p and that the state of each road is independent of the others. What
condition on p ensures
that the probability that I can travel from A to C is at least 1/2?
√
2
[Ans: p 6 1 − 2/2.]
E XERCISE 6.9.
i) Prove that if A and B are independent events then Ac and B are independent events. Which
question on this sheet illustrates a special case of this?
ii) Prove that if A, B and C are independent events then A and B ∪C are independent.
iii) Prove that if A and Ac are independent events then P(A) = 0 or 1. Which question on this
sheet is related to this?
E XERCISE 6.10. [Final examination 2015.] Let Ω = {1, 2, 3}. Give an example —with justification—
of a probability P for which the events {1, 2} and {2, 3} are independent.
[Hint: See exercise 3.11.]
E XERCISE 6.11. We toss a fair coin n times, with n > 2. Choose any pair of indices i, j such that
1 6 i < j 6 n; choose any pair of symbols ω1 , ω2 ∈ {h,t}. Prove that the events
A := ‘The ith toss is ω1 ’
and
are independent.
[Hint: Show that |A ∩ B| = 2n−2 .]
53
B := ‘The jth toss is ω2 ’
E XERCISE 6.12. [Hard.] Prove that in order to establish the independence of n events, the number
of identities (37) that need to be checked is 2n − n − 1.
[Hint: express the answer as a sum of binomial coefficients; then use the binomial theorem, see
page 34.]
54
7
More on Conditional Probability
We have already used conditional probability as an aid to calculating probabilities. In this section we develop two tools for performing such computations: the Total Probability Theorem, and
Bayes’ Theorem. We begin with a definition.
Definition. The events E1 , E2 , . . . , En partition Ω (or form a partition of Ω) if they are pairwise
disjoint and their union is the whole of Ω.
In symbols:
E1 ∪ · · · ∪ En = Ω.
E j ∩ Ek = ∅,
j 6= k.
The events E and E c partition Ω, for any E, e.g., the even and the odd integers form a partition
of Z.
Theorem 7.1 (Total Probability). If E1 , E2 , . . . , En partition Ω and P(Ei ) 6= 0 for all i, then for any
event A we have
n
P(A) = ∑ P(A|Ei )P(Ei ).
i=1
The condition P(Ei ) > 0 implies, in particular, that Ei is not empty. The same condition will
be found in all applications of the idea of a partition.
As a special case, if 0 < P(E) < 1 (hence P(E) 6= 0 and P(E c ) 6= 0), then
P(A) = P(A|E)P(E) + P(A|E c )P(E c ).
P ROOF. Let Ai = A ∩ Ei . Then the events A1 , . . . , An are pairwise disjoint (because E1 , . . . , En are
pairwise disjoint), and their union is A.
Applying Axiom III we get P(A) = P(A1 ∪ · · · ∪ An ) = ∑ni=1 P(Ai ).
Now, using the fact that P(Ei ) 6= 0, the definition of conditional probability gives P(Ai ) =
P(A ∩ Ei ) = P(A|Ei )P(Ei ), and hence P(A) = ∑ni=1 P(A|Ei )P(Ei ), as desired. Often we know the conditional probability of an event under certain assumptions. This result
tells us how to aggregate these conditional probabilities to find the probability of the event —a
‘divide and conquer’ technique. You saw several examples of calculating probabilities by this
method in the exercise sheets.
E XAMPLE . We consider a flow along the graph depicted in figure 2. The starting point is the vertex
x, from which we choose at random one of the four outgoing edges, to arrive at one of the vertices
αi . From there we choose at random one among the outgoing edges, and repeat this process until
we arrive at one vertex without outgoing edges, where the flow stops. What is the probability that
we arrive at y?.
55
Figure 2: Flow on a graph.
For i = 1, . . . , 4, let Ei be the event ‘The flow reaches αi ’. These events are mutually exclusive
(obviously), equiprobable (by assumption), and their union is the certain event (the flow must pass
through one of these vertices). Therefore
1
P(Ei ) =
i = 1, . . . , 4.
4
Let Y be the event ‘The flow reaches y’. By considering the number of outgoing edges from the
αi , and any subsequent vertex, we obtain
1
1
2
P(Y |E1 ) =
P(Y |E2 ) =
P(Y |E3 ) = 1
P(Y |E4 ) = .
3
2
5
Hence, from theorem 7.1, we compute
4
P(Y ) =
∑ P(Y |Ei)P(Ei)
i=1
1
=
4
1 1
2
+ +1+
3 2
5
=
67
.
120
Our next result is an analogue of the theorem of total probability for conditional probabilities.
Theorem 7.2. If E1 , E2 , . . . , En partition Ω, and if A and B are events with P(B ∩ Ei ) > 0 for all i,
then
n
P(A|B) = ∑ P(A|B ∩ Ei )P(Ei |B).
i=1
P ROOF. We apply theorem 7.1 to the event A ∩ B.
P(A|B) =
=
P(A ∩ B)
1 n
=
∑ P(A ∩ B|Ei) × P(Ei)
P(B)
P(B) i=1
1
P(B)
n
P(A ∩ B ∩ Ei )
× P(Ei )
P(Ei )
i=1
∑
n
=
P(A ∩ B ∩ Ei ) P(B ∩ Ei )
×
P(B)
i=1 P(B ∩ Ei )
∑
n
=
∑ P(A|B ∩ Ei) × P(Ei|B).
i=1
56
E XAMPLE . I have two coins: one is fair and the other has probability 3/4 of coming up heads. I
toss a coin selected at random.
a) What is the probability that I get a head?
b) Suppose I get a head. What is the probability that the coin is fair?
c) What is the probability that a second toss of the same coin also gives a head?
We consider the events:
F : ‘I select the fair coin’,
H1 : ‘First toss is head’,
H2 : ‘Second toss is head’.
We have P(F) = P(F c ) = 1/2.
a) We need P(H1 ). Using the total probability theorem , we obtain
P(H1 ) = P(H1 |F) × P(F) + P(H1 |F c ) × P(F c )
1 1 3 1 5
=
× + × = .
2 2 4 2 8
b) We need P(F|H1 ). By definition
P(F|H1 ) =
P(F ∩ H1 ) (1/2)2 2
=
= .
P(H1 )
5/8
5
c) We want P(H1 |H2 ). Using theorem 7.2, we find
P(H2 |H1 ) = P(H2 |H1 ∩ F) × P(F|H1 ) + P(H2 |H1 ∩ F c ) × P(F c |H1 )
1 2 3 3 13
=
× + × = ,
2 5 4 5 20
where we have used the identity P(Ac |B) = 1 − P(A|B), see exercise 5.6.
As we have seen P(A|B) and P(B|A) are very different things. The following theorem, known as
Bayes’ Theorem, relates these two conditional probabilities; it is used when you need to calculate
P(B|A) from P(A|B).
Theorem 7.3 (Bayes). For any events A and B, with P(A), P(B) > 0, we have
P(B|A) =
P(A|B) P(B)
.
P(A)
P ROOF. By definition, P(A|B) = P(A ∩ B)/P(B), that is, P(A ∩ B) = P(A|B)P(B). Since A ∩ B =
B ∩ A, we have P(B|A) = P(A ∩ B)/P(A). Substitution gives the desired result. The conclusion of this theorem is sometimes stated as
P(B|A) =
P(A|B) P(B)
,
P(A|B)P(B) + P(A|Bc ) P(Bc )
57
(40)
by applying the theorem of total probability to the denominator. We can go further. Given any
partition B1 , . . . , Bn , with P(B j ) > 0, Bayes’ theorem can also be stated as
P(Bk |A) =
P(A|Bk ) · P(Bk )
=
P(A)
P(A|Bk ) P(Bk )
n
.
∑ P(A|B j ) P(B j )
j=1
E XAMPLE . (Medical trials.) 2% of a population suffer from a disease. We have a diagnostic test
which has 90% chances of giving a positive result in someone with a disease, and 1% chances of
giving a false positive. We are interested in the probability that a person testing positive does have
the disease. Let D be the event ‘the person tested has the disease’ and let P be the event ‘the test is
positive’. We want P(D|P).
By Bayes’ theorem [in the form (40)], we have:
P(P|D) P(D)
P(P|D) P(D) + P(P|Dc ) P(Dc )
9
2
9
2
1
98 −1
=
×
×
×
+
×
10 100
10 100 100 100
18
90
104
=
=
≈ 0.65.
×
103 278 139
P(D|P) =
Since 0 < P(P) < 1, from exercise 5.6 we obtain
P(Dc |P) = 1 − P(D|P) ≈ 0.35.
So, even though the test has a very small chance of giving a ‘false positive’ (i.e., of identifying
the disease in a healthy patient), the conditional probability of not having the disease given a
positive test is surprisingly large. This is due to the fact that the disease is rare. The prosecutor’s
fallacy mentioned in the introduction is another example, with similar features.
One way of thinking of these examples is that we have changed our assessment of the likelihood
of B from P(B) to P(B|A) given the information that A occurs. This “updating” of probabilities in
the light of evidence is called Bayesian inference by statisticians.
58
E XERCISES
E XERCISE 7.1. Let A and B be arbitrary non-empty events. Which of the following collections of
sets partition Ω?
i) the two events A, B r A,
ii) the four events A, Ac , B, Bc ,
iii) the four events A r B, B r A, A ∩ B, (A ∪ B)c ,
iv) the three events A ∩ B, A4B, Ac ∩ Bc ,
v) the three events A, B, (A ∪ B)c .
In each case give a brief reason. If the given sets do not partition Ω, are there conditions on A and
B that would turn them into a partition?
E XERCISE 7.2. [Final examination 2015.] A box contains two fair coins and one unfair coin with
two heads. A coin is selected at random from the box, and tossed twice. (a) What is the probability
that both tosses show heads? (b) If both tosses show heads, what is the probability that it is the
unfair coin?
[Ans: (a) 1/2; (b) 2/3]
E XERCISE 7.3. Let E1 and E2 be disjoint events, and let A ⊆ (E1 ∪ E2 ).
1. Prove that A = (A ∩ E1 ) ∪ (A ∩ E2 ).
2. Let P(Ei ) > 0, for i = 1, 2. Prove that P(A) = P(A|E1 )P(E1 ) + P(A|E2 )P(E2 ).
3. Generalise the above results to the case of n events E1 , . . . , En .
E XERCISE 7.4. Mimi and Rodolfo are looking for a key in the dark. Suppose that the key may be
under the table, behind the bookshelf or in the corridor and has a 1/3 chance of being in each of
these places. Mimi searches under the table; if the key is there she has a 3/5 chance of finding it.
Rodolfo searches behind the bookshelf; if the key is there he has a 1/5 chance of finding it.
i) Calculate the probability that the key is found.
ii) Suppose that the key is found. Calculate the conditional probability that it is found by
Rodolfo.
iii) Suppose that the key is not found. Calculate the conditional probability that it is in the
corridor.
[Ans: i) 4/15; ii) 1/4; iii) 5/11.]
59
E XERCISE 7.5. Two key members of a volleyball team are injured, and each has probability 1/3
of recovering before the match. The recoveries of the two players are independent of each other. If
both are able to play then the team has probability 3/4 of winning the match, if only one of them
plays then the probability of winning is 1/2 and if neither play the probability of winning is 1/16.
What is the probability that the match is won?
[Ans: 1/3.]
E XERCISE 7.6. In a game of tennis between Roger and Rafael, the score has reached deuce14
Suppose that each point is won by Roger with probability 1/4 (and otherwise by Rafael). Suppose
also (for simplicity, if unrealistically) that the outcome of each point is independent of all other
points.
Let x be the probability that Roger wins the game, u be the conditional probability that Roger
wins the game given that he wins the first point and v be the conditional probability that Roger
wins the game given that he loses the first point. Show that:
4x = u + 3v,
4u = 3x + 1,
4v = x,
and hence determine the probability that Roger wins the game.
[Ans: 1/10.]
14 Once
the score has reached deuce, play continues until (effectively) one player has a lead of two points.
60
8
Random Variables
A probabilistic description of a problem requires a set Ω (the sample space) and a function P (a
probability) which assigns a real number to each subset of Ω (an event). So far, we have seen
many sets and operations on sets, but just two functions: the probability and the related conditional
probability [see (28), p. 39].
In this section we introduce a new class of functions that will become a permanent feature of
the theory: the random variables. These functions will give us great flexibility in defining events,
and a tool for organising the study of their probabilities, both conceptually and computationally.
Often we are interested not in the exact outcome of an experiment, but only in some aspect of
it. For instance, if Ω is a set of university students, then an outcome ω is a student, and we may
only want to know their developmental year, or their score in a particular exam. Random variables
are introduced to handle these sorts of situations, and the idea is to represent the measurement of
interest as a real number. This prescription is suitable to most measurements.
Definition. A random variable X is a function X : Ω → R.
This definition15 may be given without symbols:
A random variable is a real-valued function defined over the sample space.
The expression ‘real-valued function’ is synonymous with ‘a function whose co-domain is the set
R’.
Once again, we must adjust to the dialect of probability: plainly, a random variable is not
a variable and is not random. Moreover, the symbols used for random variables (capital letters
from the end of the alphabet: X,Y, Z) are hardly the symbols normally chosen for functions. As
the theory develops, the reasons for this seemingly incongruous terminology and notation will
emerge.
We look at examples of random variables.
E XAMPLE . Let the sample space Ω be a set of books. There are endless possibilities for random
variables on this space. To define a random variable X, we must make up a rule which assigns a
real number X(ω) to each book ω ∈ Ω. For example:
1. the number of pages of ω;
2. the number of letters ‘c’ appearing in the title of ω;
3. 1 if ω is a thriller, and 0 otherwise;
4. the weight in Kg of ω.
15 If
Ω is not countable, not all functions Ω → R are random variables, see section 8.5.
61
Note that the symbol ω must appear in each definition (unless the function X is a constant). Thus,
in example 3, the set of thrillers in Ω is the set of books ω such that X(ω) = 1, and the probability
that a book chosen at random is a thriller is equal to the probability that X(ω) = 1. We see how a
random variable, that is, a real function, identifies an event, hence a probability.
The attentive reader will have noticed that for these functions to be well-defined, it is necessary
that all expressions appearing in the definition have an exact meaning. For instance, in item 1 we
must decide whether or not the ‘number of pages’ includes all pages between the front and back
cover. In item 2, we must decide if the ‘title’ of a book should include the sub-title, if there is
one. The term ‘thriller’ in 3 seems subjective, and to avoid ambiguity we may need to restrict the
sample space to books equipped with a list of key-words, which specify genres. Item 4 is even
more problematic —see exercises.
While a mathematician would much prefer to deal only with crisp mathematical objects, realvalued functions over non-mathematical sets are necessary in many applications (e.g., database
design).
E XAMPLE . Let Ω be a set of polynomials in one variable, with integer coefficients. For example:
ω = 2 + 3x − 6x4 − 5x3 + x9 .
(41)
We give examples of random variables over Ω, indicating in parentheses the value X(ω) for the
polynomial ω in (41):
1. the degree of ω [X(ω) = 9]
2. the constant coefficient of ω [X(ω) = 2]
3. the largest coefficient of ω [X(ω) = 3]
4. the number of negative coefficients of ω [X(ω) = 2]
5. the sum of the coefficients of ω. [X(ω) = −5]
√
√
6. the value of ω at√ 2, that is, the number obtained by substituting 2 for the indeterminate
of ω. [X(ω) = 9 2 − 22]
Again, the symbol ω appears in each definition. Thus, in item 1, the non-constant polynomials in
Ω are those for which X(ω) > 1; in item 4, the probability that a polynomial chosen at random has
non-negative coefficients is equal to the probability that X(ω) = 0.
E XAMPLE . Roll a standard die: Ω = {1, 2, 3, 4, 5, 6} ⊆ R. Let X be the identity function on Ω,
that is, X(ω) = ω. Since Ω is a set of real numbers, the function X is a (rather useless) random
variable.
62
Definition. Let X be a random variable. The set
Range(X) := {X(ω) : ω ∈ Ω}
(42)
is called the range16 of X.
It should be clear that the range can be defined for any function, not just random variables. The
next examples illustrates this concept.
E XAMPLE . A coin is tossed three times. Let Ω be the set of outcomes, represented as words of
three letters from the alphabet {h,t}:
Ω := {hhh, hht, hth, htt,thh,tht,tth,ttt}.
We give some examples of random variables on Ω, indicating in parentheses the range of values
each random variable assumes:
1. the number X(ω) of heads in ω, e.g., X(thh) = 2; [Range(X) = {0, 1, 2, 3}];
2. the number Y (ω) of tails in ω: Y (ω) = 3 − X(ω); [Range(Y ) = {0, 1, 2, 3}];
3. the number Z(ω) giving the absolute value of the difference between the number of heads
and the number of tails in ω:
Z(ω) = |X(ω) −Y (ω)|,
e.g., Z(htt) = 1.
Then [Range(Z) = {1, 3}];
4. the number M(ω) giving the largest among the number of heads and the number of tails in
ω: M(ω) = max{X(ω),Y (ω)}, e.g., Z(thh) = 2; [Range(M) = {2, 3}];
5. the number W (ω) of occurrences of two consecutive entries in ω, e.g., W (tht) = 0; [Range(W ) =
{0, 1, 2}].
E XAMPLE . If we toss a coin n times, we expect about n/2 heads. To analyse this problem, we
define the functions X and Z as we did above for n = 3. We call Z the discrepancy of the sequence
ω.
Z(ω) = |X(ω) −Y (ω)| = |X(ω) − (n − X(ω))| = |2X(ω) − n|.
(43)
Thus the discrepancy of a sequence of n elements has the same parity as n. Since the range of X is
{0, . . . , n}, we have
(
{0, 2, 4, . . . , n − 2, n} if n is even
Range(Z) =
{1, 3, 5, . . . , n − 2, n} if n is odd.
16 The
range is also called the image of X, or the image of Ω under X.
63
To compute the discrepancy we represent ω as a binary sequence (with, say, 1 and 0 meaning
head and tail, respectively).
ωk ∈ {0, 1}.
ω = (ω1 , . . . , ωn )
Then the number X(ω) of heads in ω is given by a sum:
n
X(ω) =
∑ ωk .
(44)
k=1
We expect the discrepancy to be small compared to n. The data shown below refer to 30
computer-generated sequences ω, of n = 1000 coin tosses each. We display the corresponding 30
values of Z(ω):
28, 6, 58, 60, 24, 10, 16, 6, 24, 36, 0, 22, 26, 26, 14, 30, 4, 34, 4, 2, 4, 44, 16, 24, 72, 28, 28, 26, 32, 8.
These are the same data sorted in ascending order, which is more informative.
0, 2, 4, 4, 4, 6, 6, 8, 10, 14, 16, 16, 22, 24, 24, 24, 26, 26, 26, 28, 28, 28, 30, 32, 34, 36, 44, 58, 60, 72.
(In section 8.3, I’ll show you how to do this with a computer.) Note that some relatively low values
are missing, e.g. Z(ω) = 12. The sample space is huge: |Ω| = 21000 ≈ 10301 , and so the above
data offer just a glimpse of it. They suggest that the probability that Z(ω) < 100 (which is 10% of
the value of n) must be very large, since all data fall within that range. Later on, we shall develop
techniques to compute these probabilities. Eventually, probability theory will allow one to answer
questions such as: on average, how many times do we have to repeat the experiment above before
having more that 50% chances of seeing X(ω) > 100?
With a single random variable it is possible to represent a large number of events. The idea is
to consider values, or ranges of values, assumed by the random variable, and construct events from
them. To see how this is done, let k be a real number, and let us consider the equation X(ω) = k.
It represents an event A(k), namely
A(k) = {ω ∈ Ω : X(ω) = k}.
(45)
By construction, A(k) is a subset of Ω, namely, an event. This is the set of solutions of the equation
X(ω) = k, that is, the set of outcomes ω at which the value of the function X is equal to k. If X
does not assume the value k, then A(k) = 0/ (the impossible event). Likewise, if X is the constant
function ω 7→ k (not a very useful random variable!), then A(k) = Ω, the certain event.
We now see where the terminology and notation of random variables come from. If the outcome
ω is random (meaning that its value is not predictable), then the real number X(ω) will also be
random, and it is legitimate to call X(ω) a random variable. In probability one then writes X to
mean X(ω), which endorses the idea that a dependent variable (the value of the function X) has
been promoted to a variable.
64
This notational convention now propagates through the symbolic language. The shorthand
expression ‘X = k’ is used to mean ‘X(ω) = k’, which in turn is a shorthand for {ω ∈ Ω : X(ω) =
k}. The result is that ‘X = k’ is a shorthand of a shorthand: from a set, to a mathematical sentence,
to an idiomatic expression. (We are —in effect— writing log = 2 to mean {e2 }, which is rather
extreme.) The final stage of this notational tsunami is to write P(X = k) to mean the probability of
the aforementioned event.
E XAMPLE . We return to the three coin toss example seen earlier, adopting the same notation. The
expression
P(X = 2) = P(Y = 1) = P({hht, hth,thh})
represents the probability of seeing exactly 2 heads. Likewise
P(X 6 2) = P(Y > 1) = P({hht, hth,thh, htt,tht,tth,ttt})
is the probability of seeing at most 2 heads, while the probability of seeing more heads than tails is
P(X > Y ) = P(X > 2) = P({hht, hth,thh, hhh}).
8.1
Probability mass functions
Associated to every random variable, there is a new probability function, which we now define.
Definition. Let X be a random variable. The function
δX : R → R
δX (k) = P(X = k) = P({ω ∈ Ω : X(ω) = k})
is called the probability mass function (PMF) of X.
If the reference to the random variable is clear, we write δ for δX . Since k is arbitrary, the
domain of δ is the whole of R (see below). The co-domain is R because the end result is a
probability, which is a real number.
Give yourself time to absorb this definition and related notation. This is a detailed description
of the procedure for constructing δX (k) from k.
1. Choose an arbitrary k ∈ R, and consider the equation X(ω) = k.
2. Determine the set A(k) of all the solutions to this equation —see (45).
3. Compute P(A(k)). This is the value of the PMF at k.
In step 1, if k is not among the values assumed by the function X, then A(k) = ∅, and P(A(k)) = 0.
The fact that we may have δ (k) = 0 for most k is not an impediment here. The mass function of X
is determined once we know Range(X).
65
E XAMPLE . Consider again three tosses of a coin, and the random variable X which gives the
number of heads observed. Let the coin have probability p of coming up heads. The event ‘X = 2’
is {hht, hth,thh}, and so
P(X = 2) = P({hht, hth,thh}) = 3p2 (1 − p).
Similarly, you can work out P(X = 0), P(X = 1), P(X = 3), and obtain that the probability mass
function is
k
0
1
2
3
(46)
3
2
2
P(X = k) (1 − p) 3p(1 − p) 3p (1 − p) p3 .
Note that the random variable Y giving the number of tails in ω has exactly the same probability
mass function. This simple example shows that different random variables may have the same
mass function (see exercises).
E XAMPLE . I toss a fair coin until I see a head. Let T be the number of tosses. Then T is a random
variable taking values in N. If k ∈ N then there is a single outcome ω such that T (ω) = k, namely
ω = t|{z}
· · ·t h. Then
k−1
P(T = k) = P(t|{z}
· · ·t h) =
k−1
1
.
2k
We have
1
2
3 ···
i
···
k
i
P(T = k) 1/2 1/4 1/8 · · · 1/2 · · ·
(47)
Let us now consider a different random variable for the same system. We play a game of chance
in which the player —after paying a fee— chooses k ∈ N, and wins 2k pounds if T (ω) = k. The
new random variable
W (ω) = 2T (ω)
(48)
represents the (gross) amount won, and P(W = 2k ) = 1/2k . The random variable W is a function
of the random variable T , the function being f (T ) = 2T . The PMF of W is computed from that of
T [given by (47)]
n
2
22
23 · · · 2k · · ·
(49)
P(W = n) 1/2 1/22 1/23 · · · 1/2k · · ·
How do we choose the fee in such a way that the Casino, after a large number of games, breaks
even? Later we shall be able to answer this question.
66
Definition. A random variable X is discrete if its range is either finite or countably infinite.
The definition of random variable implies that the range of X is a subset of R. For instance, if
the set of values a random variable assumes is N (such as the number of coin tosses until the first
head is seen) or Z, then this variable is discrete. If the set of values a random variable assumes is
R or some interval of real numbers (for example the height of a randomly chosen person), then the
random variable is not discrete.
Discrete random variables are conceptually simpler, and we will mainly be studying these.
Later in the course we will look at a special type of non-discrete random variables called continuous
random variables.
Lemma 8.1. If X is a discrete random variable then
∑ δX (k) = ∑ P(X = k) = 1
k
k
where the sum is over the range of X.17
Before proving the lemma, we make some remarks. If X assumes finitely many values k1 , k2 , . . . , kn ,
then this result may be written as
n
∑ P(X = ki) = 1.
i=1
If X assumes an infinite sequence of values k1 , k2 , . . . (namely, the set Range(X) is countably
infinite), we write instead
∞
∑ P(X = ki) = 1.
i=1
To treat the finite and infinite cases with the same notation, we write
∑ P(X = ki) = 1
or
∑
i>1
P(X = k) = 1.
k∈Range(X)
Lemma 8.1 provides a useful check when calculating the PMF.
Let’s see the lemma in action on examples considered above. For the PMF of the random
variable X of the three coin tosses example (46), we have
3
∑ P(X = k) = (1 − p)3 + 3p(1 − p)2 + 3p2(1 − p) + p3 = [(1 − p) + p]3 = 1.
k=0
For the PMF of the random variable T in (47) we have
∞
∞
∞ k
1
1
1/2
∑ P(T = k) ∑ 2k = ∑ 2 = 1 − 1/2 = 1.
k=1
k=1
k=1
17 If
the set of values X can take is infinite we don’t have a formal definition of what this sum means. However, in
most cases we will look at you will be able to work it out (for instance you know how to find the sum of infinitely
many terms of a geometric progression).
67
P ROOF OF LEMMA 8.1. Let Ei be the event X = ki , for i = 1, 2, . . .. Then the sets Ei partition Ω.
Indeed, if i 6= j, then we must have Ei ∩ E j = 0,
/ because if Ei and E j had a point ω in common, then
the function X would have two different values at ω. Likewise, the union of the Ei is Ω, because Ω
is the domain of the function X. The lemma now follows from Axiom III (which, as stated, applies
only to finite or countable sample spaces). .
8.2
Expectation and variance
A random variable is characterised by two key properties: its expectation and variance.
Definition. The expectation E(X) of a discrete random variable X is
E(X) :=
∑
kP(X = k).
(50)
k∈Range(X)
Note that the value of the sum could be infinite, in which case we say that the expectation of X
is infinite. If X assumes the sequence of values k1 , k2 , . . . (we can form such a sequence —finite or
infinite— because X is discrete), then we can rewrite the expectation as
∑ kiP(X = ki) = k1P(X = k1) + k2P(X = k2) + · · · .
i>1
Thus to compute the expectation of X, we must first determine its PMF (the values ki assumed by
X and the corresponding probabilities), and then perform a summation.
The expectation of a random variable is a generalisation of the notion of arithmetical mean µ
of a sequence k1 , . . . , kn of real numbers, given by
µ=
1 n
∑ ki.
n i=1
To see this, suppose that Ω = {1, . . . , n}, and that all numbers i ∈ Ω are equiprobable, namely
P(i) = 1/n. Let X be the random variable defined by X(i) = ki , for all i ∈ Ω. Then
µ =
n
1
1 n
k
=
ki ×
i
∑
∑
n i=1
n
i=1
n
=
n
∑ kiP(i) = ∑ kiP(X = ki) = E(X).
i=1
i=1
So the arithmetical mean is equal to the expectation value of the random variable X which generates
the sequence of numbers being considered.
68
Definition. The variance Var(X) of a discrete random variable X is defined as
Var(X) := ∑(k − E(X))2 P(X = k)
(51)
k
where the summation is over Range(X).
Since all terms in the sum (51) are non-negative, so is the variance, cf. Proposition 8.5i below.
E XAMPLE . Let Ω = {1, 2} with equiprobable elements: P(1) = P(2) = 1/2. Now fix an arbitrary
real number a, and define the random variable X by the formula
X(1) = a
X(2) = −a.
The PMF of X is easily described: it assumes the value 1/2 at the points ±a, and it is zero everywhere else. The expectation value of X is easy to guess:
1
1
E(X) = aP(X = a) − aP(X = −a) = aP(1) − aP(2) = a − a = 0.
2
2
On the other hand, the variance (51) does depend on a:
1
1
Var(X) = (a − 0)2 P(Xa = a) + (a − 0)2 P(Xa = −a) = a2 + a2 = a2 .
2
2
The larger is |a|, the further apart are the values of X, the larger is the variance of X.
The example above illustrates the meaning of variance: it measures how concentrated is the
distribution of mass about the expectation. Thus a small variance means sharply concentrated
mass, and a large variance means that the mass is spread out. Zero variance is possible only if
there is just one value of X at which the mass function is non-zero (see exercise 8.9). This is the
case, in particular, if the function X is constant.
The following result provides alternative formulae for expectation and variance.
Proposition 8.2 (Alternative formulae for E and Var). If X is a discrete random variable, then
i) E(X) =
∑ X(ω)P(ω);
ω∈Ω
ii) Var(X) =
k2 P(X = k) − E(X)2 .
∑
k∈Range(X)
Some mathematicians regard formula ii) as the definition of variance. Since this formula and
the one given above have the same value, this makes no difference.
P ROOF. We prove i). Let X assume the values k1 , k2 , . . ., and let Ai = {ω ∈ Ω : X(ω) = ki }. Then
the sets Ai partition Ω, and we find
E(X) =
∑ kiP(X = ki) = ∑ ki ∑
i>1
=
∑∑
i>1 ω∈Ai
=
i>1
ki P(ω) = ∑
∑
i>1 ω∈Ai
∑ X(ω)P(ω).
ω∈Ω
69
P(ω)
ω∈Ai
X(ω)P(ω)
We prove ii). We compute (all summations extend to Range(X)):
Var(X) =
∑ k2P(X = k) − 2E(X) ∑ kP(X = k) +E(X)2 ∑ P(X = k)
k
|k
{z
|k
}
E(X)
=
{z
1
}
∑ k2P(X = k) − 2E(X)2 + E(X)2 = ∑ k2P(X = k) − E(X)2.
k
k
The proof is complete. E XAMPLE . For the three coins example described earlier we have
E(X) = 0 × (1 − p)3 + 1 × 3p(1 − p)2 + 2 × 3p2 (1 − p) + 3 × p3
= 3p[(1 − p)2 + 2p(1 − p) + p2 ] = 3p
and (using Prop. 8.2)
Var(X) = 02 × (1 − p)3 + 12 × (1 − p)2 + 22 × p2 (1 − p) + 32 × p3 − (3p)2
= 3p(1 − p).
We have seen examples of random variables obtained by applying a function to other random
variables. For instance, the number Y of tails in a sequence of n coin tosses is equal to n − X, where
X is the number of heads in the same sequence. In (48), we have considered an exponential dependence between random variables. The following result tells us how to compute the expectation
value of a function of a random variable.
Lemma 8.3. Let X be a discrete random variable, and let f : R → R be any function. Then f (X)
is a discrete random variable, and
E( f (X)) =
f (X(ω))P(ω) = ∑ f (k)P(X = k)
∑
k
ω∈Ω
where in the rightmost sum we have k ∈ Range(X).
P ROOF : Let Y = f (X). Then Y : Ω → R, and since Range(X) is finite or countable, so is Range(Y ).
Hence Y is a discrete random variable. Let Range(X) = {k1 , k2 , . . .} and let Ai = {ω ∈ Ω : X(ω) =
ki }. We have
E(F(X)) =
∑
f (X(ω))P(ω) = ∑
∑∑
i>1 ω∈Ai
=
f (ki )P(ω) = ∑ f (ki )
i>1
∑
P(ω)
ω∈Ai
∑ f (ki)P(Ai) = ∑ f (ki)P(X = ki)
i>1
=
f (X(ω))P(ω)
i>1 ω∈Ai
ω∈Ω
=
∑
i>1
∑ f (k)P(X = k).
k
70
where the first and second equalities follow, respectively, from propositions 8.2i and 3.5. Using this lemma, we can rewrite the definition of the variance of a discrete random variable
X as
Var(X) = E((X − E(X))2 ).
Here the quantity (X − E(X))2 is the random variable obtained by applying the function x 7→
(x − E(X))2 to X.
Also, Proposition 8.2ii becomes
Var(X) = E(X 2 ) − E(X)2 .
(52)
(Be sure to distinguish between E(X 2 ) and E(X)2 ; they are completely different things.)
Proposition 8.4 (Properties of expectation). Let X and Y be discrete random variables, and let
c ∈ R.
i) E(c) = c
ii) E(cX) = cE(X)
iii) E(X +Y ) = E(X) + E(Y )
iv) If m 6 X(ω) 6 M for all ω ∈ Ω, then m 6 E(X) 6 M.
v) If there exists a ∈ R such that P(X = a − t) = P(a + t) for all t > 0 (that is X is symmetric
about a), then E(X) = a.
In item i), the expression E(c) represents the expectation value of the constant function ω 7→ c.
Thus the symbol c is used with two different meanings within the same sentence! Likewise, in
items ii) and iii), the expressions cX and X +Y represent, respectively, the functions ω 7→ cX(ω)
and ω 7→ X(ω) +Y (ω).
P ROOF.
i) The constant random variable c takes the value c with probability 1. So E(c) = c × 1 = c
ii) From formula (50), we find
E(cX) = ∑ ckP(X = k) = c ∑ kP(X = k) = cE(X).
k
k
iii) We use Proposition 8.2i.
E(X +Y ) =
∑ [X(ω) +Y (ω)]P(ω)
ω∈Ω
=
∑ X(ω)P(ω) + ∑ Y (ω)P(ω) = E(X) + E(Y ).
ω∈Ω
ω∈Ω
71
iv) Since every value that X takes does not exceed M, we have that
E(X) = ∑ kP(X = k) 6 M ∑ P(X = k) = M.
k
k
Similarly, since every value that X takes is > m we have that
E(X) = ∑ kP(X = k) > m ∑ P(X = k) = m.
k
k
v) We have:
E(X) = ∑ kP(X = k)
k
= aP(X = a) + ∑ ((a − t)P(X = a − t) + (a + t)P(X = a + t))
t>0
= aP(X = a) + ∑ (aP(X = a − t) + aP(X = a + t))
t>0
(the terms involving t cancel because P(X = a − t) = P(X = a + t))
= a ∑ P(X = k) = a.
k
Proposition 8.5 (Properties of variance). Let X be a discrete random variable and c ∈ R be a
constant.
i) Var(X) > 0
ii) Var(c) = 0
iii) Var(X + c) = Var(X)
iv) Var(cX) = c2 Var(X).
P ROOF. For most of these we can either use the definition of variance or Proposition 8.2. However,
one or other of these approaches may be easier.
i) By definition Var = ∑k (k − E(X))2 P(X = k). Since the square of any real number is nonnegative and probabilities are all non-negative, each summand is non-negative. It follows
that Var(X) > 0.
ii) Since E(c) = c we have that Var(c) = (c − c)2 × 1 = 0.
iii) By parts i) and ii) of Proposition 8.4, we have:
Var(X + c) = E((X + c)2 ) − (E(X + c))2
= E(X 2 + 2cX + c2 ) − (E(X) + c)2
= E(X 2 ) + 2cE(X) + c2 − ((E(X))2 + 2cE(X) + c2 )
= E(X 2 ) − (E(X))2
= Var(X).
72
iv) By Proposition 8.4ii
Var(cX) = E((cX)2 ) − (E(cX))2 = c2 E(X 2 ) − (cE(X))2
= c2 (E(X 2 ) − (E(X))2 ) = c2 Var(X).
The proof is complete. Several items of propositions 8.4 and 8.5, could be expressed in concise form by saying that
for any random variable X,Y and real numbers a, b we have:
E(aX + bY ) = aE(X) + bE(Y )
Var(aX + b) = a2 Var(X).
(53)
Such simple transformation properties of expectation and variance exist only for affine functions of
random variables. One should not expect analogous formulae for other functions, e.g., f (X) = X 2
or f (X) = eX . For instance, in chapter 9 we shall see that the expectation value of the random
variable T in equation (47) is equal to 2. By contrast, the expectation value of the variable W = 2T
—see equation (49)— is infinite:
1
1
1
E(W ) = 2 + 22 2 + 23 3 + · · · = 1 + 1 + 1 + · · · = ∞.
2
2
2
E XAMPLE . Consider our example of tossing a coin three times. Let Y be the number of tails
observed. Clearly Y = 3 − X (where as before X is the number of heads observed)18 . We worked
out that E(X) = 3p and Var(X) = 3p(1 − p). So, using parts ii) and iii) of Proposition 8.4 E(Y ) =
3 − 3E(X) = 3(1 − p). Using parts iii) and iv) of Proposition 8.5, Var(Y ) = (−1)2 Var(X) = 3p(1 −
p).
8.3
Virtual coin tosses
[This section is not examinable (but useful).]
We present a guided tour of techniques for generating sequences of coin tosses with a computer.
We shall use the Maple programming language, but only a minimal knowledge of this language
is necessary. At the end of this section, you’ll see how I have obtained the data displayed on
page 63. You will also be able to run your own computer simulations, and you will see interesting
phenomena that are found in more advanced courses. Equally important, doing some mathematical
experimentation with large sample spaces, non-trivial outcomes, and random variables, is the best
way to learn this stuff.
We begin by creating our virtual coin:
> coin:=rand(0..1):
18 Remember
that this identity connects two functions. More explicitly, it means that for any i ∈ S we have Y (i) =
3 − X(i).
73
This command creates the function coin, a bespoke version of the Maple random number generator rand which produces a binary output19 (type ?rand for details). After stipulating that 1 means
head and 0 tail, we toss our virtual coin four times:
> coin(),coin(),coin(),coin();
0, 0, 0, 1
Note that the function coin is called with an empty argument. If we toss it again, there is a high
probability (how high?) that the result will be different:
> coin(),coin(),coin(),coin();
0, 1, 1, 1
and in fact it is.
Next we construct a virtual outcome ω, which is a sequence of 30 coin tosses:
> omega:=[seq(coin(),i=1..30)];
ω := [0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1]
How many heads? Surely we don’t want to count them by hand!
> add(x,x=omega);
17
As in the previous section, we let X(ω) be the number of heads in ω. The next command defines
the random variable X, given by (44).
> X:=omg->add(x,x=omg):
> X(omega);
17
Let’s get serious now. We want sequences of n = 1000 coin tosses:
> n:=1000:
We also need the discrepancy of a sequence, namely the random variable Z defined in (43):
> Z:=omg->abs(2*X(omg)-n):
We save typing by defining the short-hand function Sct, whereby Sct(k) is a sequence of k coin
tosses.
> Sct:=k->[seq(coin(),i=1..k)]:
Generating a sequence of n coin tosses is now straightforward:
> omega:=Sct(n):
> X(omega),Z(omega);
492, 16
This time we have 492 heads, hence 508 tails, and the discrepancy is 16. We merge the generation
of ω and the evaluation of Z(ω) in a single command:
19 Likewise,
die:=rand(1..6) creates a virtual die.
74
> Z(Sct(n));
22
Predictably, this result differs from the previous one. With all this machinery, evaluating the discrepancy for a sequence of computer-generated outcomes requires little effort:
> [seq(Z(Sct(n)),i=1..30)];
[16, 58, 2, 14, 24, 28, 14, 16, 54, 28, 2, 18, 8, 44, 14, 22, 0, 50, 28, 14, 6, 6, 32, 0, 4, 20, 46, 54, 20, 50]
This list is more informative when its elements are sorted in ascending order:
> sort(%);
[0, 0, 2, 2, 4, 6, 6, 8, 14, 14, 14, 14, 16, 16, 18, 20, 20, 22, 24, 28, 28, 28, 32, 44, 46, 50, 50, 54, 54, 58]
Many values are repeated, many are missing, and the largest discrepancy is 58. In particular, even
though the probability that two outcomes ω coincide is minuscule (what is it?), the probability that
two values of Z(ω) coincide seems quite large. (Computing the latter probability requires more
advanced techniques.)
In preparation for what is to come, it would be really nice to know how many times each
value of the discrepancy is repeated, without having to inspect the sequence. For this purpose, let
us define the function Freq which does just that, it computes the frequency of occurrence of the
elements of a sequence. (This function is cryptic: I leave it to the Maple geeks to figure out how it
works.)
> Freq:=L->[seq([x,nops(select((l,x)->evalb(l=x),L,x))],
x=sort([op(op(L))]))]:
Let us test this function on the last set of data:
> Freq(%%);
[[0, 2], [2, 2], [4, 1], [6, 2], [8, 1], [14, 4], [16, 2], [18, 1], [20, 2], [22, 1], [24, 1], [28, 3], [32, 1],
[44, 1], [46, 1], [50, 2], [54, 2], [58, 1]]
(The output format should be clear: 0 occurs twice, 2 occurs twice, etc.; the output is sorted even
if the input is not.)
Now the grand finale: we produce 104 outcomes, requiring a total of ten million coin tosses.
(If your computer is not up to this task, then just reduce the number of outcomes.)
> seq(Z(Sct(n)),i=1..10000)]:
> Zfr:=Freq(%):
First, let us check how often do small values of the discrepancy repeat:
> Zfr[1..10];
[[0, 269], [2, 503], [4, 486], [6, 489], [8, 468], [10, 437], [12, 484], [14, 448], [16, 444], [18, 413]]
All 10 smallest values of Z occur repeatedly. We also note that 0 occurs less often than 2. Let’s
find out what is the largest recorded value of the discrepancy:
> Zfr[-1];
75
[122, 1]
Thus, among 104 sequences of coin tosses, there was one sequence with 122 more heads than tails,
or vice-versa. To get an overview of what’s going on, we plot our frequency data:
> plot(Zfr);
76
Given that the computed sample space has 104 elements, a value of Z(ω) which is repeated, say,
100 times, occurs with a probability equal to 100/104 = 0.01. These data tell us that in sequences
of 1000 coin tosses, finding discrepancies greater than 100 is very unlikely.
In the next section we shall develop tools for computing probabilities associated to values of
random variables. Later on, when we study the Bernoulli distribution, we’ll even be able to explain
the distinctive shape of the function shown in the above plot —it can be derived from formula (57)
p. 83.
8.4
Probability and digital image processing
[This section is not examinable.]
Some constructs of probability (sample spaces, discrete random variables, probability mass
functions) are relevant to the representation and processing of digital images.
An (8-bit) RGB colour space is the set of triples (r, g, b), where r, g, b are integers in the range 0
to 28 − 1 = 255. These integers represent the intensity of the three primary colours red (R), green
(G), and blue (B), each ranging from 0 (no light) to a maximum value (saturated light). The colour
resolution depends on the number of available bits. With 8 binary bits (base 2 representation), we
have:
0 = 00000000
255 = 11111111.
A colour is an element of the RGB space, which physical devices (e.g., a computer screen) reproduce as a superposition of the primary colours. An 8-bit RGB colour space represents (28 )3
distinct colours, about 16 millions. Here are some examples:
(0, 0, 0)
(255, 255, 255)
(a, a, a)
(a, 0, 0)
(0, a, 0)
(0, 0, a)
(a, a, 0)
(a, 0, a)
(0, a, a)
black
white
grey
red
green
blue
yellow
magenta
cyan
0 < a < 255
(small a: dark grey)
The light that reaches a pixel of a camera sensor during an exposure gets converted into an
element of the RGB space, via a sophisticated process. The result is a digital image Ω, which is a
collection of pixels ω, arranged in an I × J rectangular array:
ω = (i, j, r, g, b),
1 6 i 6 I,
1 6 j 6 J,
0 6 r, g, b 6 255.
The integer co-ordinates i, j, specify the location of the pixel in the array, while the last three
entries specify the colour at that location. Let M = I × J. Then |Ω| = M, the number of pixels in
the image, usually measured in units of Megapixels (one million pixels).
77
Next we introduce some random variables. The basic ones extract a single primary colour from
a pixel:
R(ω) = r
G(ω) = g
B(ω) = b.
Light intensity is equally important. It can be defined in several ways:
1
Z(ω) =
(R(ω) + G(ω) + B(ω))
3
W (ω) = [3.0 × R(ω) + 0.59 × G(ω) + 0.11 × B(ω)] .
(54)
The square brackets denote rounding to the nearest integer, which ensure that these functions also
assume integer values between 0 and 255. The function Z is the average of the intensities of the
primary colours, while the more sophisticated function W is a weighted sum of primary colours,
where the weights take into account the physiology of the human eye. (The eye has maximal
sensitivity to green light, and minimal to blue light.) For example, the function
(i, j, r, g, b) 7→ (i, j,W (ω),W (ω),W (ω))
converts a colour image to black-and-white.
The luminosity histograms —very familiar to photographers— are the PMFs of the random
variables R, G, B,W . They provide statistical information about a large number of pixels. Thus, for
light intensity one considers the events ‘W = k’, which correspond to the following subsets of Ω:
Λk = {ω ∈ Ω : W (ω) = k}
k = 0, . . . , 255.
This is the solution set of the equation W (ω) = k. Hence
δW (k) =
|Λk |
M
k = 0, . . . , 255.
Digital cameras and image-processing programs routinely display the function δw , and/or the corresponding triple of PMFs for the primary colours: δR , δG , δB .
In a well-exposed image, the value of δW (k) should be very small when k is near 0 (deep
shadows) or near 255 (highlights). A large value of δW (k) for k at or near 255 signals the presence
of a large number of saturated pixels. This usually results in irreversible loss of details in the
highlights. Similarly, a large value of δW (k) for small k denotes extended deep shadows void of
details. Image-processing software afford a visualisation of such trouble spots. Thus saturated
—or near-saturated— highlights comprise the set of pixels
S j = {ω ∈ Ω : W (ω) > 250 − j} =
j
[
Λ255−i
j>0
i=0
for small values of j. Typically, the value of j is controlled by a slider, and the corresponding event
S j is displayed with a distinctive colour, thereby identifying the affected area of the image.
78
8.5
Is every function X : Ω → R a random variable?
[This section is not examinable.]
Briefly, we raise a point of mathematical rigour concerning the definition of a random variable
given on page 61. If the set Ω is uncountable (infinite, but such that its elements cannot be arranged
to form a sequence, e.g., Ω = R), then the given definition of random variable is not quite correct. It
turns out that some functions are so pathological that they cannot be regarded as random variables.
This should not be entirely surprising: we have already mentioned the fact that if Ω is uncountable
there are sets which are too pathological to be regarded as events. This subtlety is well beyond the
scope of this course and will not concern us at all. Every reasonable function that you would ever
want to use here will be a legitimate random variable.
E XERCISES
E XERCISE 8.1. The following definitions of random variables are rather sloppy.
1. Ω is a set of books, and X(ω) is the weight of ω.
2. Ω is a set of people, and X(ω) is the country where ω was born.
3. Ω is a set of cars, and X(ω) is the capacity of ω’s fuel tank.
Turn them into well-defined mathematical functions, by specifying more precisely the sample
space and/or the function definition.
E XERCISE 8.2. For each sample space, define two discrete random variables, one assuming only
integer values, and one assuming also non-integer values.
1. A set of integer sequences.
2. A set of infinite integer sequences.
3. A set of infinite real sequences.
4. A set of real intervals.
5. A set of triangles in the Cartesian plane.
E XERCISE 8.3. For each random variable of the previous exercise, define an event via this variable, using only symbols. Then define the same event using only words.
79
E XERCISE 8.4. A random variable X has the following probability mass function:
n
−2 −1
0
1
2
P(X = n) 1/10 2/5 1/4 1/5 1/20
a) Calculate the probability of each of the following events:
i) X = 2;
ii)
X = 3;
iii)
X 6 1;
iv)
X 2 < 2.
[Ans: i) 1/20; ii) 0; iii) 19/20; iv) 17/20.]
b) Let Y be a new random variable defined by Y = X 2 + 4.
i) What values does Y take?
ii) Find the probability mass function of Y .
[Ans: δY (4) = 1/4; δY (5) = 3/5; δY (8) = 3/20.]
E XERCISE 8.5. Let X be a random variable with range {0, 1, 2}. Suppose that E(X) = a and
Var(X) = a2 . Determine the probability mass function of X.
E XERCISE 8.6. You roll two fair standard dice and record the outcome ω = (ω1 , ω2 ) of the two
numbers showing. Let T (ω) = ω1 + ω2 and let D(ω) = |ω1 − ω2 |.
i) Find the probability mass function of T .
ii) Find the expectation of T .
iii) Find the probability mass function of D.
iv) Find the expectation of D.
v) Say in words what the random variable M = 21 (T + D) measures.
[Ans: ii) 7, iv) 35/18.]
E XERCISE 8.7. A bag contains 6 red marbles and 2 blue marbles. I choose 5 at random without
replacement. Let B be the number of blue marbles I end up with, and R be the number of red marbles I end up with. Find the probability mass function of B. Without doing any more calculations
write down the probability mass function of R.
[Ans: δB (0) = δR (5) = 3/28, δB (1) = δR (4) = 15/28, δB (2) = δR (3) = 10/28]
80
E XERCISE 8.8.
a) Adam has five keys in his pocket one of which opens his front door. On arriving home he
takes out a key at random and attempts to open the door with it. If it doesn’t fit the door then
he replaces it in his pocket. This is repeated until he manage to open the door. Let A be the
random variable ‘the number of attempts he takes’. Find the probability mass function of A.
b) Eve has five keys in her pocket one of which opens her front door. On arriving home she
takes out a key at random and attempts to open the door with it. If it doesn’t fit the door
then she holds on to it and picks a key from those remaining in her pocket. This is repeated
until she manage to open the door. Let E be the random variable ‘the number of attempts
she takes’. Find the probability mass function of E.
[Ans: a) δA (n) = (4/5)n−1 × 1/5; b) δE (n) = 1/5.]
E XERCISE 8.9. Prove that the variance of a discrete random variable X is equal to 0 if and only if
for some k ∈ Range(X), we have P(X = k) = 1.
E XERCISE 8.10. Let X be a random variable taking values in the set {0, 1, 2, 3, . . . , n}.
i) Show that
n
E(X) = ∑ P(X > i).
i=1
ii) Deduce that if E(X) < 1 then X takes the value 0 with positive probability.
iii) Use part i) to prove that for all 1 6 t 6 n we have
P(X > t) 6
E(X)
.
t
E XERCISE 8.11. Let Ω = {1, 2, 3}, and assume that all outcomes are equally likely. Define two
distinct random variables on Ω which have the same probability mass function. Can you define
more than two such random variables? How many? Do the same without the assumption that all
outcomes are equally likely.
E XERCISES FOR SECTION 8.4
E XERCISE 8.12. Let Ω be a perfectly exposed picture of a chess-board over a grey background
with RGB colour (β , β , β ), with 0 6 β 6 255. The camera has a sensor with I × J pixels, and ratio
I/J = 3/2. Compute the probability mass function of the random variable W [defined in (54)], as
a function of β , and of the relative area α of the chess-board with respect to the area of the image.
81
9
Some Discrete Distributions
In the past eight chapters, we have used coin tosses, balls in bags, card decks, etc., as metaphors
for classes of general phenomena. Now it’s time to extract the mathematical essence of these phenomena, and define universal objects of greater significance. These are prototypes of probability
mass functions, called distributions.
Let δ : R → R be a function. What properties must δ satisfy in order to be the probability mass
function of some discrete random variable? Keeping in mind the properties of PMFs, the answer
is not difficult:
1. There is a finite or countable set R = {k1 , k2 , . . .} of real numbers such that δ (x) 6= 0 only if
x ∈ R. (Such a set R serves as the range of some discrete random variable.)
2. We have
δ (ki ) > 0
and
∑ δ (ki) = 1.
(55)
i>1
(This ensures that the values of δ behave like probabilities.)
A function with these properties is said to be a distribution.
Certain distributions occur so often that they have been given special names (Bernoulli, Binomial, Hypergeometric, Geometric, Poisson) and a special notation:
Bernoulli(p),
Bin(n, p),
Hg(n, M, N),
Geom(n, p),
Poisson(λ ).
Each distribution depends on one or more parameters, so that it can be tailored to different situations. In this section we describe what type of phenomena these distributions describe, and we
determine their expectation and variance as a function of the parameters.
If X is a random variable and δ is a distribution, then we write X ∼ δ to indicate that the
probability mass function of X is δ . Thus X ∼ δX . Note that we may have X ∼ δ and Y ∼ δ even if
X 6= Y . For instance, in a sequence of fair coin tosses, the number of heads and the number of tails
seen are different random variables, but they have the same probability mass function: δX = δY .
Bernoulli distribution
Suppose that a trial (experiment) with two outcomes is performed. We will call the outcomes
success and failure, and let P(success) = p (hence P(failure) = 1 − p). Such a trial is called a
Bernoulli trial (or a Bernoulli(p) trial if we wish to emphasise the probability).
The random variable X has the Bernoulli(p) distribution (write X ∼ Bernoulli(p)) if its range
is {0, 1}, and if its probability mass function is
k
0
1
P(X = k) 1 − p p
82
Figure 3: Binomial distribution Bin(n, p), for n = 50 and p = 1/10 (blue), p = 1/2 (red), p = 3/4 (black).
The expectation and variance of X are calculated from the PMF in the usual way.
E(X) = 0 × (1 − p) + 1 × p = p
Var(X) = 02 × (1 − p) + 12 × p − p2 = p(1 − p).
(56)
The variance attains its maximum at p = 1/2, and is equal to zero for p = 0, 1 (see exercise 8.9).
The Bernoulli distribution emerges naturally in the following very general construct. Let Ω be
any sample space, and let A be an event. The random variable IA , defined as
(
1 ω ∈A
IA : Ω → R
IA (ω) =
0 ω 6∈ A,
is called the indicator function (or characteristic function) of the set A (in Ω). It is plain that
the event ‘IA = 1’ has probability P(A). Therefore IA has the Bernoulli distribution with parameter
p = P(A). In symbols: IA ∼ Bernoulli(P(A)).
Binomial distribution
If n independent Bernoulli trials are performed20 , and if we let X be the number of trials which
result in success, then X has the binomial distribution. We write X ∼ Bin(n, p). The random
variable X takes values in {0, 1, 2, . . . , n} and has PMF
n k
P(X = k) =
p (1 − p)n−k for 0 6 k 6 n.
(57)
k
20 meaning
that if we let Ei be the event that the ith trial results in success, then the events E1 , . . . , En are independent
83
(The binomial distribution satisfies lemma 8.1: ∑nk=0 nk pk (1 − p)n−k = (p + (1 − p))n = 1.)
To find E(X) and Var(X) for the binomial distribution, we need two identities for the binomial
coefficient:
n!
n
n
n n−1
(n − 1)!
=
=
=
k>1
(58)
k
k!(n − k)! k (k − 1)!(n − k)! k k − 1
n
n(n − 1)
(n − 2)!
n(n − 1) n − 2
=
=
k > 2.
(59)
k
k(k − 1) (k − 2)!(n − k)! k(k − 1) k − 2
Let q = 1 − p. We find, using (58):
n
E(X) =
=
=
=
n
n
∑ kP(X = k) = ∑ k k pk qn−k
k=0
k=0
n n−1
∑ n k − 1 pk qn−k
k=1
n n − 1 k−1 n−k
np ∑
p q
k=1 k − 1
n−1 n − 1 j (n−1)− j
np ∑
pq
j
j=0
= np(p + q)n−1 = np.
For the variance, we begin to consider E(X 2 ) − E(X). Using (59), we compute:
E(X 2 ) − E(X) = E(X(X − 1))
n
n k n−k
= ∑ k(k − 1)
pq
k
k=0
n
n − 2 k n−k
= ∑ n(n − 1)
pq
k−2
k=2
n n − 2 k−2 (n−2)−(k−2)
2
p q
= n(n − 1)p ∑
k=2 k − 2
n−2 n − 2 j (n−2)− j
2
= n(n − 1)p ∑
pq
j
j=0
= n(n − 1)p2 .
So E(X 2 ) − E(X) = n(n − 1)p2 + np and
Var(X) = E(X 2 ) − E(X)2 = np[p(n − 1) + 1 − np] = np(1 − p).
A comparison with (56) shows that the expectation and variance of Bin(n, p) are n times those
of Bernoulli(p). This is an instance of a general phenomenon, which you will see in chapter 11
(corollary 11.2 and theorem 11.3). Thus, for fixed n, the expectation is a linear function of p, while
84
the variance is quadratic, achieving its maximum at p = 1/2 and its minimum at p = 0, 1, where it
is equal to 0. The binomial distribution for n = 50 and three values of p is displayed in figure 3.
These formulae were proved using the binomial theorem and binomial identities. There are
other methods of proofs, including the use of generating functions, which you may see in books or
in other courses.
E XAMPLE . If X ∼ Bin(n, p), then Y = n−X is the number of failures in n independent Bernoulli(p)
trials. It follows that Y ∼ Bin(n, 1 − p).
E XAMPLE . (Binomial distribution and sampling) I select n balls at random from a bag contains
N balls, M of which are red. After the selection I replace the ball in the bag. Each random
selection will result in a red ball with probability M
N , and the outcome of each pick is independent
of all the others (since we are sampling with replacement). So the number of red balls chosen has
N−M M
M
distribution Bin(n, M
.
N ). It has expectation n N and variance n N
N
Hypergeometric distribution
Let us select n objects from a set of N objects, M of which belong to a certain subset, which we
call A. If repetition is allowed, then the number X of elements of A among the n objects selected
has the binomial distribution.
If instead we assume that sampling is without repetition, then the binomial distribution is no
longer appropriate. In this case the random variable X takes values in the set {0, 1, 2, . . . , M},
although some of these values will be assumed with zero probability; if n < M some large values
cannot occur, and if n > N − M some small values are excluded (see exercise 9.3).
The computation of the PMF is a standard sampling problem. The sample space has size Nn ,
where we clearly must assume that n 6 N. If we fix an integer k, then there are Mk ways of
choosing the k elements of A; for each of these there are N−M
n−k ways of choosing the n − k objects
which are not in A; The PMF is given by
M N−M P(X = k) =
k
n−k
N
n
0 6 k 6 M,
(60)
which is called the hypergeometric distribution (in symbols: X ∼ Hg(n, M, N)). Noting that the
binomial coefficient is zero if the lower argument is negative or is greater than the upper argument,
we conclude that expression (60) is automatically zero in the ranges mentioned above.
It can be shown that:
E(X) = n M
N ,
(61)
N−M N−n Var(X) = n M
.
N
N
N−1
The derivation of these formulae will be dealt with in problem 11.5.
It is instructive to compare the above formulae with those for Bin(n, M/N), the analogous
random variable for sampling with repetition. Let p = M/N. Then formulae (61) read:
E(X) = np,
Var(X) = np(1 − p)
85
N −n
.
N −1
The expectation is the same as for the binomial, while the variance differs by the factor (N − n)/(N − 1);
for n large enough, this makes the variance smaller, so the random variable is more sharply concentrated. However, when N is very large compared with n, the factor (N − n)/(N − 1) is close to
1, so the variance is close to that of the binomial distribution.
Geometric distribution
Let us consider an unlimited number of independent Bernoulli trials, and let X be the number of
trials up to and including the first success. In this case X has the geometric distribution, written
as X ∼ Geom(p). The random variable X takes values in {1, 2, 3, . . . } and has PMF
P(X = k) = p(1 − p)k−1
k > 1.
Notice also that the event X > k occurs if and only if the first k trials all result in failure. Hence,
P(X > k) = (1 − p)k .
To compute expectation and variance, we use the summation formulae
x(1 + x)
x
∑ k2xk = (1 − x)3
∑ kxk = (1 − x)2 ,
k>1
|x| < 1.
k>1
They can be derived by considering the sum of the geometric series:
f (x) =
1
= ∑ xk
1 − x k>0
|x| < 1.
Differentiating term by term, we obtain, for |x| < 1:
x
x
d f (x)
=
= x ∑ kxk−1 = ∑ kxk .
dx
(1 − x)2
k>0
k>1
This is the first formula in (62). To derive the second formula, we compute:
x2
d2 f (x)
d f (x)
2x2
x
x(1 + x)
−
x
=
−
=
.
2
3
2
dx
dx
(1 − x)
(1 − x)
(1 − x)3
Term-by-term differentiation gives
x2
d2 f (x)
d f (x)
−x
= x2 ∑ k(k − 1)xk−2 − ∑ kxk
2
dx
dx
k>0
k>1
=
∑ (k2 − k)xk − ∑ kxk
k>1
=
∑ k 2 xk .
k>1
Comparing the above two expressions gives the desired result.
86
k>1
(62)
Using the first formula in (62), we compute the expectation:
E(X) =
1− p
p
p
1
∑ k(1 − p)k−1 p = 1 − p ∑ k(1 − p)k = 1 − p × (1 − 1 + p)2 = p .
k>1
k>1
Using the second formula in (62) and the result above, we find:
Var(X) = E(X 2 ) − E(X)2
=
1
∑ k2(1 − p)k−1 p − p2
k≥1
=
1
p
k2 (1 − p)k − 2
∑
1 − p k>1
p
p
(1 − p)(1 − p + 1) 1
×
− 2
1− p
(1 − 1 + p)3
p
2− p
1
1− p
=
− 2= 2 .
2
p
p
p
=
Poisson distribution
Suppose that incidents of some kind occur at random times, at an average rate λ per unit time (or,
equivalently, at random locations, at an average rate λ per unit length). For example, a call centre
receiving 70,000 calls a year, receives on average λ = 8 calls an hour, the unit time of interest.
Other examples are the number of particles emitted from a radioactive source in a second, the
number of claims received by an insurance company in a month, the number of bugs found in
1,000 lines of computer code.
Given λ , let X be the number of incidents which occur in unit time. The random variable X
has a Poisson distribution (write X ∼ Poisson(λ )) if it assumes the values {0, 1, 2, . . . }, and has
PMF (see figure 4)
λk
P(X = k) = e−λ
k > 0.
(63)
k!
As we shall see, such a distribution models accurately the situations described above.
The analysis of the Poisson distribution requires knowledge of the exponential function. We
shall need two formulae. The first formula defines the exponential function as a power series:
ex :=
x2
xk
x3
x4
∑ k! = 1 + x + 2 + 6 + 24 + · · ·
x ∈ R.
(64)
k>0
The second formula is a classic result of analysis [see (25)]:
x n
lim 1 +
= ex
x ∈ R.
(65)
n→∞
n
We begin by showing that (63) is indeed a distribution [cf. (55)]. This function does assume
non-negative values on a countable set. Moreover, using (64), we find
∑ P(X = k) =
∑ e−λ
k>0
k>0
λk
λk
= e−λ ∑
= e−λ eλ = 1.
k!
k!
k>0
87
Figure 4: Poisson distribution (63), for λ = 5 (blue) and λ = 15 (red).
The expression (63) arises as a limit of binomial distributions Bin(n, p), in the case in which p
is small, n is large, while np is moderate. The idea is to identify the product np with the quantity
λ introduced above, which is then kept constant as n → ∞.
To carry out this plan, we divide the unit of time λ into n subintervals. By choosing a sufficiently large value of n, we ensure that the quantity λ /n is very small. Under these circumstances,
the occurrence of 1 incident is each subinterval is a rare event, and we expect the probability of this
event to be λ /n. Because λ /n is very small, the probability that there be two or more incidents in
the same subinterval is so small that we can neglect it (you can verify that this probability is very
close to (λ /n)2 ).
Thus we assume that in each interval there are either 0 or 1 incidents, with the probability of
0 incidents being 1 − λ /n. Then the total number of incidents has distribution Bin(n, λ /n). This
means that the probability of k incidents per unit time is:
k λ n−k
n
λ
P(X = k) ≈
1−
n
n
k
n(n − 1) . . . (n − k + 1)
λ −k λ k
λ n
=
1−
1−
.
n
k!
n
nk
As n is made large, this gives a better and better approximation. Now
2
k−1
1
k
n(n − 1)(n − 2) · · · (n − k + 1) = n 1 −
1−
··· 1−
,
n
n
n
and therefore
λk
P(X = k) ≈ U(n, k)
k!
88
λ n
1−
n
where
1
2
k−1
λ −k
U(n, k) = 1 −
1−
··· 1−
1−
.
n
n
n
n
Now fix k, and consider the limit n → ∞. Since each term in the above product tends to 1, we have
lim U(n, k) = 1.
n→∞
Moreover, expression (65) gives
λ n
lim 1 −
= e−λ .
n→∞
n
Putting everything together, we obtain
λk
P(X = k) = lim U(n, k)
n→∞
k!
λ n λ k −λ
1−
= e
n
k!
which is the PMF of the Poisson distribution.
From (64) we find
E(X) =
∑ ke−λ
k>0
= λ e−λ
λk
λk
= e−λ ∑ k
k!
k>1 k!
λ k−1
λm
−λ
=
λ
e
∑
∑
m>0 m!
k>1 (k − 1)!
= λ e−λ eλ = λ .
For the variance, we consider
E(X(X − 1)) =
∑ k(k − 1)e−λ
k>0
= λ 2 e−λ
λk
k!
λ k−2
k>2
2 −λ λ
= λ e
λm
m>0 m!
∑ (k − 2)! = λ 2e−λ ∑
e = λ 2.
So E(X 2 ) − E(X) = λ 2 , or E(X 2 ) = λ 2 + λ , and
Var(X) = (λ 2 + λ )λ 2 = λ .
With hindsight, these results should not be surprising. Indeed, the expectation value of the
Binomial distribution Bin(n, p) is np = nλ /n = λ . Likewise, the variance is np(1 − p) = λ (1 −
p) ∼ λ , since p is small.
89
Figure 5: The probability mass function δ (left) and the cumulative distribution function F (right) of the
data (67), which refer to a discrete random variable.
9.1
Cumulative distribution functions
The cumulative distribution function (CDF) FX of a random variable X is defined as
FX : R → R
FX (x) = P(X 6 x).
(66)
As we did for the PMF, we shall write F in place of FX , if the reference to the random variable is
clear from the context. Note that the definition of the CDF does not require the random variable X
to be discrete. Accordingly, and in preparation for the introduction of continuous random variables
in chapter 10, we have used the symbol x in place of k to denote the elements of the domain of δ
and F. We shall continue to use the symbol k when we refer to a discrete random variable.
E XAMPLE . Let a discrete random variable X have the following PMF:
k
k1 k2 k3
P(X = k) a b c
(67)
where a, b, c > 0 and a + b + c = 1. The PMF and CDF of X are displayed in figure 5. By its very
nature, the cumulative distribution function of a discrete random variable is piecewise-defined:


0
if k < k1



a
if k1 6 k < k2
FX (k) =

a + b if k2 6 k < k3




1
if k > k3 .
90
The CDF of a discrete random variable is a step-function. The steps are located at all elements
ki of Range(X) at which δ (ki ) > 0, the height of the step at ki being equal to δ (ki ). The value
of the function at the discontinuity point agrees with the height of the upper step, by virtue of the
non-strict inequality in the definition (66). (Think about it.) Furthermore, the function is constant
between consecutive steps.
For a discrete random variable, the cumulative distribution function provides an alternative —
and often very useful— representation of the information stored in the probability mass function.
In other words, if we know one of the PMF and CDF, then we can determine the other. To determine
the CDF from the PMF, we obtain, from (66):
FX (x) =
∑ P(X = k) = ∑ δ (k).
k6x
(68)
k6x
If a, b are real numbers with a 6 b, then the above expression gives
P(a < X 6 b) = FX (b) − FX (a) =
∑
δX (k),
(69)
a<k6b
Let us now determine the PMF from the CDF. Let X take values k1 , k2 , k3 , . . . with k1 < k2 <
k3 < . . . . Then
P(X = ki ) = FX (ki ) − FX (ki−1 ).
This formula holds also in the case in which the sequence (ki ) is decreasing, provided that we
introduce negative indices: · · · < k−3 < k−2 < k−1 ; one could also have doubly-infinite sequence:
· · · < k−2 < k−1 < k0 < k1 < k2 < · · · . Statistical tables often give the CDF rather than the PMF,
so you may need to use this fact to find the PMF from tables.
E XAMPLE . Let T ∼ Geom(p). Then
FT (k) = P(T 6 x) = 1 − P(T > k) = 1 − P(Ak ),
where Ak is the event ‘the first k trials are failures”. Then
FT (k) = 1 − (1 − p)k
k ∈ N.
To write a formula for FT (x) valid for all x ∈ R, we introduce the floor function, denoted by the
symbol b · c. Thus for any x ∈ R, we let bxc to be the largest integer not exceeding x, e.g.,
b3.1] = 3
b−1] = −1
b−11/10] = −2.
The floor function is a step function, with unit steps at the integers. Using the floor function, we
obtain:
(
1 − (1 − p)bxc x > 0
FT (x) =
0
x < 0.
Let us now return to the general case of cumulative distribution function, which, as we pointed
out, does not require the random variable to be discrete. The CDF FX of a random variable X has
the following properties:
91
I. 0 6 FX (x) 6 1.
II. The following limits hold:
lim FX (x) = 1
lim FX (x) = 0.
x→∞
(70)
x→−∞
III. The function FX is non-decreasing, that is, if a < b then FX (a) 6 FX (b)
IV. P(a < X 6 b) = FX (b) − FX (a).
V. If P(X = x) > 0, then FX has a discontinuity at x.
Property I is a consequence of the fact that FX is a probability. To establish III, we note that the
event X 6 a, being a subset of the event X 6 b, has smaller probability. IV follows directly from
the definition of CDF. The proof of properties II and V is left as an exercise. In the next section
we will use the CDF to study a new type of random variables, called continuous random variables.
E XERCISES
E XERCISE 9.1. Condense this chapter into half a page, by writing a summary of the properties of
the various distributions (PMF, expectation, variance, and example of their occurrence).
E XERCISE 9.2. Suppose that A ∼ Poisson(3), B ∼ Geom(1/3), and that C ∼ Bin(4, 1/6). Find
the following probabilities:
i)
P(A = 2);
ii)
P(A > 2);
P(B = 3);
iii)
iv)
P(B 6 3);
v)
P(C = 2).
[Ans: i) 9/2e3 ; ii) 1 − 17/2e3 ; iii) 4/27; iv) 19/27; v) 25/216.]
E XERCISE 9.3. Let X ∼ Hg(n, M, N); determine Range(X).
E XERCISE 9.4. Let Ω := {−1, 0, 1} and let
P(−1) = 1/2
P(0) = 1/6
P(1) = 1/3.
We consider sequences of elements of Ω, where each element is selected according to the above
probability, and where the the result of each selection is independent from all other selections.
Determine the probability mass function of each of the following random variables. (If the PMF
is one of those introduced in this chapter, then identify it, specifying the parameter(s); otherwise,
write down the PMF explicitly.)
i) The number of zeros in a sample of length 50.
ii) The number of entries up to and including the first negative term.
92
iii) The number of entries following the first positive term, up to and including the first zero.
iv) The number of negative terms in the ten selections that follow the first occurrence of a positive term.
v) The number of entries up to and including the second positive term.
vi) Given a positive integer m, the number of entries up to and including the mth occurrence of
a positive term.
[Ans.: i) Bin(50, 1/6).]
E XERCISE 9.5. [Final examination 2016.] A hundred tickets are sold in a lottery in which there
is a top prize of £50 and four prizes of £10 each. Each ticket costs £1. Let X be your net gain when
you buy one ticket.
1. Compute the probability mass function of X.
2. Compute and draw the cumulative distribution function of X.
3. Compute the expectation of X.
Explain what you do.
E XERCISE 9.6. [Final examination 2016.] Assume that, on average, a typographical error is
found every 1000 typeset characters. Compute the probability that a 600-character page contains
fewer than two errors.
E XERCISE 9.7. Two players, Ann and Bob, in turn roll a fair die. Ann rolls first, and the first one
to roll a six wins.
i) What are their respective chances of winning?
ii) Let the random variable X be the number of times Bob rolls the die. Compute P(X = k) for
k 6 3.
[Ans: i) 5/11, 6/11; ii) p, p(1 − p)(2 − p), p(1 − p)3 (2 − p), p(1 − p)5 (2 − p).]
E XERCISE 9.8. An electrical component is installed on a certain day and is inspected on each
subsequent day. Let G be the number of days that the component lasts for before breaking.
i) Suppose that G ∼ Geom(p). Show that for any k, ` > 0
P(G > k + `|G > k) = P(G > `).
ii) Describe in words what the conclusion of part i) means for the lifetime of the component.
iii) Show that if G satisfies P(G > k + `|G > k) = P(G > `) for all k, ` > 0, then G must have a
geometric distribution.
[Hint: i) you must sum a geometric series. iii) begin with k = ` = 1, then k = 1, ` = 2, etc.]
93
E XERCISE 9.9. Let X be the number of fish caught by a fisherman in one afternoon. Suppose that
X is distributed Poisson(λ ). Each fish has probability p of being a salmon, independently of all
other fish caught. Let Y be the number of salmon caught.
i) Suppose that the fisherman catches m fish. What is the probability that k of them are salmon?
ii) Show that:
P(Y = k) =
∑ P(Y = k|X = m)P(X = m).
m>k
iii) Hence find the probability mass function of Y . What is the distribution of Y ?
[Hint: ii) Use the Total Probability Theorem, with a suitable partition.]
E XERCISE 9.10.
a) A discrete random variable A has the following probability mass function:
n
−1
0
1
2
P(A = n) 1/10 3/5 1/5 1/10
Find the cumulative distribution function of A.
b) A discrete random variable B has CDF


0



1/5
FB (r) =

3/5


 1
if r < 1/2
if 1/2 6 r < 3
if 3 6 r < 5
if 5 6 r.
Write down the values which B takes. Find the PMF of B.
E XERCISE 9.11.
a) [Final examination 2016.] Introduce briefly the Poisson distribution. Verify that it is indeed
a distribution, and explain —without proof— its connection with the binomial distribution.
b) Show that the Poisson distribution attains its maximum value at k = bλ c, where b·c is the
floor function.
[Hint: Consider δ (k + 1)/δ (k).]
E XERCISE 9.12. Prove properties II and V (page 92) of a CDF.
94
10
Continuous Random Variables
We have mostly dealt with discrete random variables, namely functions Ω → R assuming a finite
or countable set of values. In this chapter we consider random variables whose range is an interval,
which could be finite, semi-infinite, or the whole real line.
We introduce continuous random variables as limits of discrete random variables. For any
n > 1, we consider the discrete random variable Xn , which assumes n values, equally spaced in the
interval [0,1], with equal probability. Let us denote by δn and Fn , respectively, the PMF and CDF
of Xn ; we have
1 2
n
1
j
Range(Xn ) =
, ,...,
δn
= , j = 1, . . . , n,
(71)
n n
n
n
n
while δn (x) = 0 for any x not of the form j/n. The CDF of the random variable X4 is displayed in
figure 6, top left. It has four steps of height 1/4, the height being the value of the corresponding
PMFat that point. The random variable X8 assumes twice as many values, and hence its CDF,
shown in the top right diagram, has twice as many steps, of half the height. We double the number
of values again, and again, and display the CDFs of X16 and X32 in the bottom diagrams. These
CDFs can be expressed analytically with the aid of the floor function:


if x < 0

0
Fn (x) =
1
n


1
bnxc if 0 6 x 6 1
n > 1,
(72)
if x > 1
where we have used the fact that bnxc is the number of steps lying to the left of x —including the
step at x, if there is one.
In the limit n → ∞, we obtain the random variable X:
X = lim Xn
Range(X) = [0, 1].
n→∞
(73)
The CDF of X is obtained from (72) by noting that, since 0 6 nx − bnxc < 1, we have
0 6 x−
bnxc 1
<
n
n
for all x ∈ R. As n grows, 1/n approaches 0, and hence the quantity bnxc/n approaches x. So we
have



0 if x < 0
F(x) = lim Fn (x) =
n→∞
x


1
if 0 6 x 6 1
(74)
if x > 1.
This is a continuous function, which assumes every value between 0 and 1.
As a physical model for the random variable X given in (73), let us consider a low-friction
wheel, initially made to spin at very high angular velocity. When the wheel comes to a rest, we
95
Figure 6: The cumulative distribution functions Fn , for n = 4, 8, 16, 32 [see (66)], illustrate the transition
from discrete to continuous random variables. (The vertical segments in the diagrams are merely a guide for
the eye; they are not part of the graph of the functions.)
consider its angular position θ = 2πx measured with respect to a fixed external reference point. It
is plausible to assume that θ is distributed uniformly between 0 and 2π. Then x, which we take as
the value of our random variable X, will be distributed uniformly between 0 and 1. The random
variable Xn will then represent the case of a wheel which, by some device, can only come to rest at
n equally spaced angular positions (like a wheel of fortune at a fair).
Many natural random variables have the property that the set of values they assume is an interval of R, or R itself. The example above suggest that the CDF is still useful in this case. By
contrast, the PMF is no longer immediately useful, since its value becomes zero everywhere, as in
(71). These considerations motivate the following definition.
Definition. A random variable X is continuous if its cumulative distribution function FX is a
continuous function.
All properties of a CDF listed in section 9.1 apply to the continuous case as well. In particular,
FX is a non-decreasing function with FX (t) → 0 as x → −∞ and FX (t) → 1 as x → ∞ (properties
II and III). Moreover, property V implies that P(X = x) = 0 for any x. Indeed, if P(X = x) were
positive, then FX would have a discontinuity at x, and hence X would not be a continuous random
variable. So P(X < x) = P(X 6 x) = F(x) and
P(a < x < b) = P(a 6 x < b) = P(a < x 6 b) = P(a 6 x 6 b) = FX (b) − FX (a).
The function (74) is seen to possess all these properties.
96
E XAMPLE . An archer fires an arrow which hits a circular target Ω of radius 1. Suppose that the
probability that the arrow hits a certain region A is equal to the area of A divided by the area of Ω.
This amounts to assuming that the point ω which is hit by the arrow is chosen uniformly at random
within the target. Let X be the distance from ω to the centre of Ω, and let Ar be the disc of radius
r with the same centre as Ω. Then
P(X 6 r) = P(Ar ) =
So
|Ar | πr2
=
= r2
|Ω|
π



0
F(x) = r2


1
0 6 r 6 1.
if r < 0
if 0 6 x 6 1
(75)
if r > 1.
All properties of CDFs listed in section 9.1 can be verified for this example.
E XAMPLE . Let Xn be the random variable (71), and let [a, b] be an interval in [0, 1]. We are
interested in the probability that Xn belongs to [a, b]. We begin by isolating the values of Xn which
lie in such an interval:
Rn (a, b) = Range(Xn ) ∩ [a, b].
Using this notation, we write
P(a 6 Xn 6 b) =
1
δn (k) = Rn (a, b).
n
(a,b)
∑
k∈Rn
Next we determine the cardinality of Rn (a, b). Let j+ and j− be, respectively, the largest and the
smallest integer j such that j/n ∈ [a, b]. Then
j+ = bn − ε +
j− = an + ε −
0 6 ε + , ε − < 1.
where the quantities ε ± ensure that j± are integers21 . Then there are j+ − j− + 1 elements of
Range(Xn ) in [a, b], and therefore
1
1
P(a 6 Xn 6 b) = ( j+ − j− + 1) = b − a + (1 − ε + − ε − ).
n
n
Since |1 − ε + − ε − | 6 1, we find
1
lim P(a 6 Xn 6 b) = b − a + lim (1 − ε + − ε − ) = b − a.
n→∞
n→∞ n
(76)
recognise that j+ = bbnc. Moreover the have j− = dane, where dxe is the ceiling of x —the smallest integer
not smaller than x.
21 We
97
We see that in the limit, the probability that the value of the random variable Xn belongs to
an interval is equal to the length of the interval. Can we obtain this result directly from the CDF
(74), using calculus? The answer is yes, provided that we assume —as it is the case here— that
our CDFs are differentiable everywhere, except, possibly, at finitely many points. This condition
is satisfied by the CDFs of continuous random variables, except in some pathological cases which
will not be encountered in this course.
These considerations motivate the following definition.
Definition. Let X be a continuous random variable. The probability density function (or PDF)
δX is the function
d
δX (t) = FX (t).
dt
E XAMPLE . For the CDF (74), we find:
δX (t) = lim δn (t) =
(
1
n→∞
0
if 0 6 t 6 1
if otherwise.
(77)
For the CDF (75), we have:
δX (t) =
(
2x
if 0 6 t 6 1
0
if otherwise.
(78)
The definition of PDF is ambiguous, since as δX is not determined uniquely at points where FX
is not differentiable (technically we are defining not a single PDF, but a whole family of possible
PDFs, any of which will do equally well). For example, in (77), the ‘bad’ points are x = 0, 1.
However, the choice of values given to δX at these points will make no difference to integrals
involving it, so everything that follows is unaffected by such a choice.
We determine some properties of the PDF. Since the CDF is non-decreasing, we have δX (t) > 0
for all t. Also, by the fundamental theorem of calculus22 we have that
P(a 6 X 6 b) = FX (b) − FX (a) =
Z b
a
δX (t)dt.
(79)
This is the analogue of expression (69) for a discrete random variable, where instead of an integral
we had a sum.
From (69) and property (70) of the CDF, we obtain the normalisation condition
Z ∞
−∞
δX (t)dt = 1,
22 To
(80)
apply the fundamental theorem of calculus we need some mild conditions on δX , because things might go
wrong if the PDF is particularly unpleasant. However, for any reasonable function (such as any function you’ll meet
in this course) this problem will not arise.
98
which is the analogue of lemma 8.1 (p. 67) for the continuous case. This identity gives us a way of
checking that a calculated PDF is plausible.
In our example (77), expression (69) becomes
Z b
P(a 6 X 6 b) =
a
Z b
δX (t)dt =
dt = b − a,
a
in accordance with our earlier explicit calculation (76).
For a continuous random variable, it is usually easier to calculate the CDF, and then find the
PDF by differentiating. Conversely, the CDF can be found from the PDF by integrating. Specifically, we have the integral expression
Z x
FX (x) =
−∞
δX (t)dt
to be compared with the sum
FY (x) =
∑
δY (k)
−∞<k6x
for a discrete random variable Y .
Several important parameters of a random variable are extracted from its CDF.
Definition. The number m is a median of X if FX (m) = 1/2. The numbers ` and u are, respectively,
the lower and upper quartiles of X if FX (`) = 1/4 and FX (u) = 3/4.
E XAMPLE . For the CDF (74) we find immediately: FX (1/2) = 1/2, and hence m = 1/2. By the
same token,
(75) we have F(m) = m2 = 1/2, and hence
√ ` = 1/4 and u = 3/4. For the example
√
m = 1/ 2; likewise, we find ` = 1/2 and u = 3/2.
The definitions of median and quartiles also apply to discrete random variables. However, in
the discrete case these quantities may not exist, or may exist but fail to be unique. If the random
variable is continuous, then they are guaranteed to exist and are generally unique (why?).
Definition. Let X be a continuous random variable with probability density function δ . The expectation E(X) is given by
Z
∞
E(X) :=
tδ (t)dt.
(81)
−∞
The variance Var(X) of X is given by
Z ∞
Var(X) :=
−∞
(t − E(X))2 δ (t)dt.
(82)
Next we derive the analogue of proposition 8.2ii for the discrete case. We compute:
Z ∞
Var(X) =
(t 2 − 2tE(X) + E(X))2 δ (t)dt
Z∞∞
=
Z−∞
∞
=
−∞
2
t δ (t)dt − 2E(X)
Z ∞
t 2 δ (t)dt − E(X)2 ,
99
−∞
2
tδ (t)dt + E(X)
Z ∞
−∞
δ (t)dt
where we have used elementary properties of integration, the normalisation condition (80), and the
definition (81).
The properties (53) of expectation and variance apply verbatim to the continuous case. Furthermore, if g : R → R is a function, then
Z ∞
E(g(X)) =
−∞
g(t)δX (t)dt
which is the analogue of lemma 8.3, p. 70.
In all these formulae, the integrals extend from −∞ to ∞. In practice, the PDF is often 0 outside
a smaller range, and so we can integrate over this smaller range only.
E XAMPLE . For the PDFgiven in (73), the normalisation condition (80) becomes
Z 1
Z ∞
−∞
δX (t)dt =
0
dt = [t]10 = 1,
while expectation and variance are given by
2 1
1
t
E(X) =
tδX (t)dt =
=
tdt =
2 0 2
−∞
0
Z ∞
1 2
Var(X) =
t 2 δ (t)dt −
2
−∞
1
Z 1
1
t3
1 1 1
1
2
=
t dt − =
− = − = .
4
3 0 4 3 4 12
0
Z 1
Z ∞
E XAMPLE . For the PDF (75), the normalisation condition (80) becomes
Z 1
Z ∞
−∞
δX (t)dt =
0
1
2tdt = t 2 0 = 1,
while expectation and variance are given by
2 3 1 2
=
E(X) =
t 2t dt = t
3 0 3
0
2
Z 1
2
2
Var(X) =
t 2t dt −
3
0
4 1
t
4
1
=
− = .
2 0 9 18
Z 1
10.1
Some special continuous distributions
As for the discrete case, some distributions occur so frequently that we give them special names.
Here we look at two of them, the uniform and the exponential distributions. The former is a
generalisation of (74), the latter is derived from the Poisson distribution.
100
Uniform distribution
Suppose that a number X is chosen from the interval [a, b], with X being equally likely to be
anywhere in the interval. We will interpret this condition as meaning that the probability that the
value of X belongs to a sub-interval of [a, b], is equal to the length of that sub-interval, divided by
the length of [a, b]. Then we say that X has the uniform distribution and write X ∼ Uniform[a, b]
or X ∼ U[a, b]. The PDF of X is given by
(
1
if a 6 x 6 b
b−a
δX (x) =
0
otherwise.
Integration gives the CDF:
FX (x) =

 0
x−a
b−a

1
if x < a
if a 6 x 6 b
if x > b.
The normalisation condition (80) is verified immediately:
1
δX (t)dt =
b−a
−∞
Z ∞
Z b
dt = 1,
a
while expectation and variance are given by
b
b2 − a2
t2
b+a
t
=
dt =
=
E(X) =
2(b − a) a 2(b − a)
2
a b−a
2 b
Z b 2
(b + a)2
t
b+a
t3
Var(X) =
−
dt −
=
2
3(b − a) a
4
a b−a
Z b
=
b3 − a3
(b + a)2 (b − a)2
−
=
.
3(b − a)
4
12
Exponential distribution
This continuous distribution is related to the discrete Poisson distribution. Suppose that the number
of incidents occurring in any interval of time of length t is distributed Poisson (λt). Instead of
counting the number of incidents in a fixed interval (this would give a discrete random variable
with the Poisson distribution), we look at the time X at which the first incident occurs. This
gives a continuous random variable which we say has the exponential distribution, and write
X ∼ Exp(λ ). Using the connection with the Poisson distribution, we compute the CDF of the
101
exponential distribution as follows. If t < 0, then we let FX (t) = 0. If t > 0, we find
FX (t) = P(X 6 t)
= 1 − P(X > t)
= 1 − P(there are no incidents in the interval [0,t])
(λt)0
0!
−λt
= 1−e .
= 1 − e−λt
Note that FX is a non-decreasing continuous function which tends to 1, in accordance with properties II and III of CDFs. Differentiating gives the PDF
0
if t < 0
δX (t) =
−λt
λe
if t > 0.
The expectation and variance of the exponential distribution are found using integration by
parts (exercise 10.7i):
1
1
E(X) = ;
Var(X) = 2 .
λ
λ
In later courses, you will meet the normal distribution, arguably the most important continuous distribution. It emerges as a continuous limit of the discrete binomial distribution.
10.2
Monotonic transformations of random variables
If g is any function which is defined on the range of X (the set of values that X takes), and which
takes values in R, we can define a new random variable Y = g(X), namely a new function Y (x) =
g(X(x)). If g is monotonic (increasing or decreasing) then the PDF of Y can be found from the
PDF of X.
We begin with a simple example: Y = aX + b, with a, b ∈ R. If a > 0, then
FY (x) = P(Y 6 x)
= P(aX + b 6 x)
x−b
=P X 6
a
x−b
= FX
.
a
Using the chain rule of differentiation we obtain:
d
x−b
1
x−b
δY (x) = FX
= δX
.
dx
a
a
a
102
Similarly, if a < 0, we find
FY (x) = P(aX + b 6 x)
x−b
=P X >
a
x−b
= 1−P X <
a
x−b
= 1−P X 6
a
x−b
= 1 − FX
a
where we have used the fact that P(X < x) = P(X 6 x) for all x, since X is a continuous random
variable. By differentiating the expression above, we obtain the PDF:
x−b
1
.
δY (x) = − δX
a
a
Now let us consider the general case. If g is increasing, then it can be shown that g has an
inverse function g−1 which is also increasing (this is intuitively obvious by considering the graph
of g). Now,
FY (x) = P(Y 6 x)
= P(g(X) 6 x)
= P(X 6 g−1 (x)) (as g−1 is increasing)
= FX (g−1 (x)).
Now, provided that g−1 is differentiable, we can differentiate with respect to x and obtain (using
the chain rule) that
d
δY (x) = g−1 (x) δX (g−1 (x)).
dx
Similarly, if g is decreasing then it can be shown that g has an inverse function which is also
decreasing. Now,
FY (x) = P(Y 6 x)
= P(g(X) 6 x)
= P(X > g−1 (x)) (as g−1 is decreasing)
= 1 − FX (g−1 (x)).
Provided that g−1 is differentiable, we can differentiate with respect to x and (again using the chain
rule) obtain that
d
δY (x) = − g−1 (x) δX (g−1 (x)).
dx
103
d −1
Note that the function δY is non-negative because dx
g (x) 6 0.
We can deal with both cases with a single formula, by saying that if g is monotonic and g−1 is
differentiable then
d −1
δY (x) = (g (x)) δX (g−1 (x)).
dx
We remark that if Y is related to X by a non-monotone function then this method does not
work. For instance, if Y = X 2 then
√
√
√
√
FY (x) = P(X 2 6 x) = P(− x 6 X 6 x) = FX ( x) − FX (− x)
so we can still relate the CDF of Y to the CDF of X but not in such a simple way.
E XERCISES
E XERCISE 10.1. Define a continuous random variable on each of the following sample spaces:
1. A set of books.
2. A set of real intervals.
3. A set of triangles.
4. A set of real functions f : R → R.
E XERCISE 10.2. Consider the function δ : R → R, given by:
(
cxα 0 6 x 6 1
δ (x) =
0
otherwise
where c and α are real numbers, and α > −1.
(i) For given α, determine the value of c such that δ = δX is the probability density function of
some random variable X.
(ii) Compute the expectation and the median of X.
[Ans: i) c = α + 1; ii) (α + 1)/(α + 2), (1/2)1/(α+1) .]
104
E XERCISE 10.3.
a) A continuous random variable X has probability density function (PDF):


0
if x < 1



(x − 1) if 1 6 x < 2
δX (x) =
 (3 − x) if 2 6 x < 3


 0
if x > 3.
i) Find the CDF of X.
ii) By considering the properties of CDFs, write down a couple of ways in which you
could check that your answer to part i) is plausible. Perform those checks and revisit
part i) if appropriate.
iii) Calculate P(3/2 < X < 5/2) by evaluating the CDF at appropriate places.
b) A continuous random variable Y has cumulative distribution function


0
if y 6 0


 3y
if 0 < y 6 1
4
FY (y) =
y+8

if 1 < y 6 4


 112 if y > 4.
i) Find the PDF of Y .
ii) Calculate the expectation and variance of Y .
iii) Calculate the median of Y .
E XERCISE 10.4. Let Ω be the right-angled triangle with vertices (0, 0), (1, 0), (1, 1). Let a be a
point chosen at random from within the triangle with the probability that a is in any fixed region
being proportional to the area of the region. Let T be the random variable whose value is ‘the angle
that the line from (0, 0) to a makes with the x-axis’. Find the CDF and PDF of T .
E XERCISE 10.5. Let X be a continuous random variable.
i) Prove that the median, lower quartile and upper quartile of X all exist.
ii) If you know the median of X what can you say about the median of the random variable 2X?
iii) If you know the quartiles of X what can you say about the lower quartile of the random
variable −2X?
105
E XERCISE 10.6. Let U be a random variable with the Uniform[0, 1] distribution.
i) Show that if r, s ∈ R with r > 0 then rU + s also has a uniform distribution.
ii) Use this to give another derivation of the expectation and variance of a Uniform[a, b] random
variable which only requires integrating functions related to the Uniform[0, 1] distribution.
iii) Show that U 2 does not have the uniform distribution.
[Hint: For both i) and iii) start by considering the CDF of the new random variable.]
E XERCISE 10.7. Let E be a random variable with the Exp(λ ) distribution. Let U be a random
variable with the Uniform[0, 1] distribution.
i) Show that E(E) = 1/λ and Var(E) = 1/λ 2 .
ii) Show that for x, y > 0, the conditional probability that E > x + y given that E > x is equal
to the probability that E > y. This is called the memoryless property of the exponential
distribution; can you see why?
iii) Show that for 0 < x < 1, 0 < y < 1, the conditional probability that U > x + y given that
U > x is strictly less than the probability that U > y. (So U does not have the memoryless
property).
iv) How does the expectation of E compare with its median?
v) How does the expectation of U compare with its median?
E XERCISE 10.8. Use the intermediate value theorem to show that the median and the quartiles of
a continuous random variable do exist. Give an example of CDF for which they are not unique.
E XERCISE 10.9. [Hard.] Let Xn be the sequence of random variables introduced at the beginning
of this chapter. Show that for any number α ∈ [0, 1], there is a sequence of rational numbers
xn ∈ Range(Xn ) such that limn→∞ xn = α.
106
11
Working with Several Random Variables
[This chapter is not examinable.]
Sometimes it is useful to consider more than one random variable at the same time, or to write
a random variable as a combination of other random variables. In this section we develop some of
this theory in the discrete case. All random variables mentioned are assumed to be discrete. Much
of what follows is also true for continuous random variables (but with sums replaced by integrals
and pmf replaced by pdf).
If we have two discrete random variables X and Y defined on the same sample space the function
δX,Y : R2 → R
(x, y) 7→ P(X = x,Y = y)
is called the joint probability mass function (jpmf) of X and Y . The domain of a jpmf is the
Cartesian plane, namely the set of ordered pairs (two-element sequences) of real numbers. The
jpmf of X and Y can be presented as a table. The example we had in lectures gave the following:
0
0
R
1
0
3/35
2
6/35 1/35
B 1 2/35 12/35 6/35
2 2/35
3/35
3
0
0
0
Here, for example, the top right entry means that P(R = 3, B = 0) = 1/35.
You can work out the PMFs of X and Y from the joint PMF by
P(X = x) = ∑ P(X = x,Y = y),
y
P(Y = y) = ∑ P(X = x,Y = y).
x
We sometimes refer to these as the marginal pmf of X and Y to emphasise that they came from a
joint PMF.
If X,Y are random variables and g is a real-valued function of two variables then g(X,Y ) is
another random variable. It satisfies
E(g(X,Y )) = ∑ ∑ g(x, y)P(X = x,Y = y).
x
y
Theorem 11.1. If X and Y are discrete random variables then
E(X +Y ) = E(X) + E(Y ).
107
P ROOF.
E(X +Y ) = ∑ ∑(x + y)P(X = x,Y = y)
x
y
= ∑ ∑ xP(X = x,Y = y) + ∑ ∑ yP(X = x,Y = y)
x
y
y
x
= ∑ x ∑ P(X = x,Y = y) + ∑ y ∑ P(X = x,Y = y)
x
y
y
x
= ∑ xP(X = x) + ∑ P(Y = y)
x
y
= E(X) + E(Y ).
If we have more than two random variables then applying this theorem repeatedly and using
basic properties of expectation we obtain:
Corollary 11.2 (Linearity of expectation). If X1 , X2 , . . . , Xn are discrete random variables and
c1 , c2 , . . . , cn ∈ R, then
!
n
E
n
= ∑ ci E(Xi ).
∑ ciXi
i=1
(83)
i=1
Linearity is an important property of expectation. The next example gives one illustration of
its use.
E XAMPLE . We provide an alternative way of deriving the expectation of a binomial random variable. Suppose that X ∼ Bin(n, p) and for every 1 ≤ i ≤ n define a random variable
1 if the ith trial results in success
Xi =
0 if the ith trial results in failure.
Clearly X = X1 + X2 + · · · + Xn . Also, for every i we have Xi ∼ Bernoulli(p) and so E(X1 ) =
E(X2 ) = · · · = E(Xn ) = p. Corollary 11.2 now tells us that
E(X) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = np.
The Xi in this argument are sometimes called indicator variables.
11.1
Independent random variables
The concept of independence was developed for events in chapter 6. Here we introduce an analogous concept for random variables.
Definition. Two discrete random variables X and Y are independent if for all x and y we have
P(X = x,Y = y) = P(X = x)P(Y = y)
108
that is, if the events X = x and Y = y are independent.
More generally, the random variables X1 , X2 , . . . , Xn are independent if for all x1 , x2 , . . . , xn we
have
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X1 = x1 )P(X2 = x2 ) . . . P(Xn = xn ).
It is worth noting (because it is a frequent misconception) that we do not require X and Y to be
independent in Theorem 11.1. However, if X and Y are independent then we can say more.
Theorem 11.3. If X and Y are independent discrete random variables then:
i) E(XY ) = E(X)E(Y ),
ii) Var(X +Y ) = Var(X) + Var(Y ).
P ROOF.
i)
E(XY ) = ∑ ∑ xyP(X = x,Y = y).
x
y
Since X and Y are independent we have that
E(XY ) = ∑ ∑ xyP(X = x)P(Y = y)
x
y
Using properties of summation, we can express the right-hand side as a product to get:
!
E(XY ) =
∑ xP(X = x)
×
x
∑ yP(Y = y)
= E(X)E(Y ).
y
(If you are having trouble seeing the last step, then write it out in full in a special case, e.g.,
where x assumes three values x1 , x2 , x3 , and y assumes two values y1 , y2 .)
ii)
Var(X +Y ) = E((X +Y )2 ) − E(X +Y )2
= E(X 2 + 2XY +Y 2 ) − (E(X) + E(Y ))2
= E(X 2 ) + 2E(XY ) + E(Y 2 ) − E(X)2 − 2E(X)E(Y ) − E(Y )2
= Var(X) + Var(Y ) + 2(E(XY ) − E(X)E(Y )).
Now since X and Y are independent we use part i) to deduce that the term in brackets is 0.
Hence,
Var(X +Y ) = Var(X) + Var(Y ).
109
The converse of this is false; it is possible to have X and Y not independent but still have
E(XY ) = E(X)E(Y ).
Applying this result repeatedly and using basic properties of variance we obtain:
Corollary 11.4. If X1 , X2 , . . . , Xn are independent discrete random variables and c1 , c2 , . . . , cn ∈ R
then
!
n
Var
∑ ciXi
i=1
n
= ∑ c2i Var(Xi ).
i=1
E XAMPLE . We provide an alternative way of deriving the variance of a binomial random variable.
Suppose that X ∼ Bin(n, p) and we define indicator variables Xi as in the previous example. The
Xi are independent random variables (because the trials involved in the definition of the Binomial
distribution are independent). Since Xi ∼ Bernoulli(p) we have that Var(X1 ) = Var(X2 ) = · · · =
Var(Xn ) = p(1 − p). Corollary 11.4 now tells us that
Var(X) = Var(X1 ) + Var(X2 ) + · · · + Var(Xn ) = np(1 − p).
The next definition gives a measure of how far from being independent two random variables
are. You will meet these concepts again in statistics modules.
Definition. The covariance Cov(X,Y ) of two discrete random variables X and Y is
Cov(X,Y ) = E(XY ) − E(X)E(Y ).
It can be shown that
Cov(X,Y ) = E ((X − E(X))(Y − E(Y ))) .
The correlation coefficient Corr(X,Y ) of X and Y is defined to be
Cov(X,Y )
Corr(X,Y ) = p
.
Var(X)Var(Y )
The rough idea of these definitions is that:
• If X and Y are independent then Cov(X,Y ) = Corr(X,Y ) = 0 (by Theorem 11.3).
• If X and Y have a tendency to either both be large or both be small then the covariance and
correlation coefficient will be positive.
• If there is a tendency for one of X and Y to be large while the other is small then the covariance and correlation coefficient will be negative.
• The more extreme this tendency is, the larger in magnitude the covariance or correlation are.
110
• The correlation coefficient (as we shall see later) is normalised to lie between −1 and 1, and
to be unchanged when X and Y are scaled linearly.
Where X being large (or small) really means X being larger (or smaller) than E(X) and similarly
for Y .
The following properties of covariance and correlation can be deduced from their definitions.
Proposition 11.5. If X and Y are discrete random variables and a, b, c, d ∈ R then:
i) Var(X +Y ) = Var(X) + Var(Y ) + 2Cov(X,Y ),
ii) Cov(aX + b, cY + d) = acCov(X,Y ),
iii) If a, c > 0 then Corr(aX + b, cY + d) = Corr(X,Y ),
iv) −1 ≤ Corr(X,Y ) ≤ 1.
The point of parts ii) and iii) is that if I decide to scale one or both of the random variables
by a linear transformation (for example measuring a temperature in degrees Fahrenheit rather than
degrees Celsius) then the covariance may change but the correlation coefficient will not.
P ROOF.
i) During the proof of 11.3ii we showed that for any
Var(X +Y ) = Var(X) + Var(Y ) + 2(E(XY ) − E(X)E(Y ))
(look back to check that we really did prove this for any X,Y , in other words up to this point
of the proof we did not use that X,Y were independent). The result follows from this.
ii) We will use the second definition of covariance that
Cov(aX + b, cY + d) = E((aX + b − E(aX + b))(cY + d − E(cY + d)))
Using basic properties of expectation we get
Cov(aX + b, cY + d) = E((aX + b − aE(X) − b)(cY + d − cE(Y ) − d))
= E(a(X − E(X))c(Y − E(Y )))
= acE((X − E(X))(Y − E(Y ))
= acCov(X,Y )
iii) We have that Var(aX + b) = a2 Var(X) and Var(cY + d) = c2 Var(Y ). This and the previous
part give that
Cov(aX + b, cY + d)
Corr(aX + b, cY + d) = p
Var(aX + b)Var(cY + d)
acCov(X,Y )
=p
a2 Var(X)c2 Var(Y )
111
Now, since a, c > 0 we have that
√
√
a2 = a and c2 = c and so
Corr(X,Y ) =
acCov(X,Y )
p
= Corr(X,Y ).
ac Var(X)Var(Y )
iv) We omit the proof.
11.2
Conditional random variables
As in the rest of this chapter, all random variables will be assumed to be discrete.
Let us recall some important notational conventions introduced in chapter 8. If X is a random
variable (namely a function Ω → R) and x ∈ R, then the sentence ‘X = x’ represents an event,
namely the set {ω ∈ Ω : X(ω) = x}, comprising all solutions of the equation X(ω) = x in sample
space. Now we push this convention further, and use hybrid representations of events —as sets or
as sentences— within the same expression!
If X is a random variable and E is an event then, by the definition of conditional probability,
P(X = k|E) =
P(X = k and E)
.
P(E)
This gives the pmf of a new random variable called ‘X given E’ or X|E (or ‘X conditioned on E’).
We can consider properties of X|E such as its expectation and variance. By definition,
E(X|E) = ∑ kP(X = k|E).
k
E XAMPLE . A standard fair die is rolled twice. Let X be the number showing on the first roll and
E be the event ‘at least one odd number is rolled’. Find the distribution of X|E.
We have that P(E) = 3/4.
If r is odd then
P(X = r|E) =
P(X = r and E) P(X = r) 1/6 2
=
=
= .
P(E)
P(E)
3/4 9
If r is even then
P(X = r and E)
P(E)
P(X = r) × P(second roll is odd) 1/6 × 1/2 1
=
=
= .
P(E)
3/4
9
P(X = r|E) =
It follows that the conditional pmf of X|E is
n
1
2
3
4
5
6
P(X = n|E) 2/9 1/9 2/9 1/9 2/9 1/9
112
You can now work out that E(X|E) =
2+2+6+4+10+6
9
=
10
3.
E XAMPLE . Often the event E is given by another random variable. Look back at the joint distribution of R and B in the last chapter. To find the distribution of R|B = 0 means to find P(R = r|B = 0).
Hence we divide the top row of the joint distribution by P(B = 0). To find P(B = 0) we take the
sum of all elements in the top row of the joint distribution. We get P(B = 0) = 10/35 and that the
conditional pmf is:
n
1
2
3
P(R = n|B = 0) 3/10 6/10 1/10
The following theorem is similar to the theorem of total probability.
Theorem 11.6. If E1 , E2 , . . . , En partition Ω (that is, they are pairwise disjoint and their union is
Ω) and P(Ei ) > 0 for all i then
n
E(X) = ∑ P(Ei )E(X|Ei ).
i=1
P ROOF. By definition
E(X) = ∑ kP(X = k).
k
If we apply the Theorem of Total Probability with the partition E1 , . . . , En to P(X = k) we get
n
E(X) = ∑ k ∑ P(X = k|Ei )P(Ei )
k
n
i=1
= ∑ ∑ kP(X = k|Ei )P(Ei )
i=1 k
n
= ∑ P(Ei ) ∑ kP(X = k|Ei )
i=1
n
k
= ∑ P(Ei )E(X|Ei ).
i=1
This theorem can be sometime be used to calculate the expectation of a random variable without
having to calculate the whole PMF. We saw one example in lectures and here is a second example
giving an alternative derivation of the expectation and variance of a geometric random variable.
If X ∼ Geom(p) (so X is the number of trials up to an including the first success in a sequence
of Bernoulli trials) and A is the event “the first trial is a success” then
E(X) = P(A)E(X|A) + P(Ac )E(X|Ac ).
Clearly, if A occurs, then X = 1 with probability 1 and so E(X|A) = 1.
113
If A does not occur then the number of subsequent trials (not including the first one) to the first
success is also distributed Geom(p) and so
E(X|Ac ) = 1 + E(X).
Substituting into the the above equation we obtain an equation for E(X) which when solved gives
E(X) = 1/p (agreeing with our previous method of working this out).
Similarly,
E(X 2 ) = P(A)E(X 2 |A) + P(Ac )E(X 2 |Ac ).
As before E(X 2 |A) = 1. Since the number of trials after the first one is also distributed Geom(p)
we have that E(X 2 |Ac ) = E((1 + X)2 ). Substituting this gives
E(X 2 ) = p + (1 − p)E(1 + 2X + X 2 )
= p + (1 − p)(1 + 2/p + E(X 2 ))
which when solved gives E(X 2 ) = (2 − p)/p2 and hence Var(X) = (1 − p)/p2 .
E XERCISES
E XERCISE 11.1. Two fair standard dice are rolled. Let A be the number of 1s seen and B be the
number of 2s seen in the outcome. Find the joint distribution of A and B. Find the covariance of A
and B and the correlation coefficient of A and B.
E XERCISE 11.2. Suppose that X,Y, Z are random variables with X ∼ Bin(7, 1/6), Y ∼ Geom(1/2),
Z ∼ Poisson(6). Suppose further that X and Y are independent but that X and Z are not independent. Which of the following can be determined from this information? Find the value of those
which can be determined.
i) E(X +Y )
ii) E(X + Z)
iii) E(X + 2Y + 3Z)
iv) E(X 2 +Y 2 + Z 2 )
v) Var(X +Y )
vi) Var(X + Z)
vii) Var(X + 2Y + 3Z).
114
E XERCISE 11.3. A fair die is rolled. Let N be the number showing on the die. Now a fair coin is
tossed N times. Let X be the number of heads seen. Find the conditional distribution of X given
N = n and hence find the expectation of X.
E XERCISE 11.4. Let X be the number of fish caught by a fisherman and Y be the number of fish
caught by a second fisherman in one afternoon of fishing. Suppose that X is distributed Poisson(λ )
and Y is distributed Poisson(µ). Suppose further that X and Y are independent random variables.
i) Show that
n
P(X +Y = n) =
∑ e−λ
k=0
λ k −µ µ n−k
e
.
k!
(n − k)!
ii) Hence find the distribution of the total number of fish caught.
E XERCISE 11.5. Suppose that we have n objects m of which are red and the remaining n − m of
which are blue. We make a selection of r objects without replacement. Let X be the number of
red objects among the r objects chosen. Recall that the random variable X has the hypergeometric
distribution —see (60) p. 85. This question leads you through the calculation of the expectation
and variance of X, given in (61).
i) Regard the selection as being made with order and define random variables X1 , . . . , Xr by
0 if the ith object chosen is blue
Xi =
1 if the ith object chosen is red
Find the distribution of each Xi (hint: they each have the same distribution) and express X in
terms of the Xi .
ii) Hence show that the expectation of X is r mn .
iii) What is the distribution of Xi2 ?
iv) What is the distribution of Xi X j for i 6= j?
v) Hence find E(X 2 ) and deduce that
Var(X) = r
m n − r
m
1−
.
n
n n−1
115