Download Lecture notes used last year

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Randomness wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
Notes for COMP11120 (Probability part)
Kees van Schaik
[email protected]
Alan Turing Building, office 2.142
2013-2014
Note: If you find any typos/errors/... I would be very grateful if you’d let me know
at [email protected].
1
Contents
Contents
2
1 Building blocks of probability theory
1.1 Sample spaces and events . . . . . . . . . .
1.2 Basic event/set operations . . . . . . . . . .
1.3 Probability measures . . . . . . . . . . . . .
1.4 Some more rules for set/event manipulations
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
and computing probabilities
5
. 5
. 8
. 10
. 16
2 Equally likely outcomes
21
2.1 Equally likely outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 A little bit of combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 The
3.1
3.2
3.3
3.4
3.5
power of conditioning
Conditional probabilities . .
Independent events . . . . .
Partitions . . . . . . . . . .
The law of total probability
Bayes’ Theorem . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Random variables
4.1 What are random variables? . . . . . . . . . . . . . . . . .
4.2 Computing probabilities with random variables . . . . . .
4.3 Constructing new random variables from existing ones . .
4.4 Probability mass functions (pmf’s) . . . . . . . . . . . . .
4.5 Cumulative distribution functions (cdf’s) . . . . . . . . . .
4.6 The mean (or: expected value) of a random variable . . . .
4.7 The variance and standard deviation of a random variable
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
34
36
39
43
.
.
.
.
.
.
.
47
47
48
54
55
57
60
65
5 Some well known distributions
69
5.1 What is a distribution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 The Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 The Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2
CONTENTS
5.4
5.5
3
The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
The geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3
Chapter 1
Introduction: building blocks of
probability theory
Many or even all of you will already have some understanding of what probability theory
is about. In fact it’s likely that quite a few of you have already seen some probability
theory during your previous education. But otherwise you will all have for one reason or
the other thought about basic probabilistic questions. The probably most boring example
is the tossing of coin, which yields either heads or tails. What makes the experiment of
tossing a coin probabilistic is that you don’t know whether you will get heads or tails:
the outcome of the experiment is ‘random’. Probability theory concerns the study of
such experiments, and tries to provide with tools that allow to analyse such ‘random’
experiments.
In the sequel, the word ‘experiment’ is to be taken very broadly, we essentially use
it for any situation in which a ‘random’ outcome will occur, be it a game, a lottery, the
number of car accidents in Manchester on any given day, the decay of radioactive material
etc. etc.
1.1
Sample spaces and events
We begin by collecting all possible outcomes of the experiment in a set called the sample
space:
Definition 1.1.1 (Sample space). All possible outcomes of an experiment are collected
in a set called the sample space and is usually denoted by the letter Ω. In general we will
label the possible outcomes of the experiment as ω1 , ω2 , ... so that Ω = {ω1 , ω2 , . . . , ωn }
if the experiment has n possible outcomes and Ω = {ω1 , ω2 , . . .} if the experiment has
infinitely many possible outcomes∗ . (In particular examples we may deviate from this
∗
Note that there is an underlying assumption here, namely that we can actually put all the outcomes
in a list i.e. we can order them. This is only possible when the set in question is countable. Examples
of countable sets are the natural numbers N, the integers Z and (even) the rational numbers Q, or any
5
6
Notes for COMP11120 (Probability part)
general notation though and will use something more specific for the example in question).
Example 1.1.2. Consider the experiment of throwing a coin. Then there are two possible
outcomes, namely heads (H) and tails (T). Therefore the sample space is the set Ω =
{H, T}. (Or, in general notation, Ω = {ω1 , ω2 } where ω1 : coin shows heads and ω2 : coin
shows tails). Next consider rolling a dice and observing the number of dots. The resulting
sample space is Ω = {1, 2, 3, 4, 5, 6}. (Or, in general notation, Ω = {ω1 , ω2 , . . . , ω6 } where
ωi : dice shows i dots for i = 1, . . . , 6). Finally, suppose you throw two coins. Each coin
can show either heads (H) or tails (T), and the outcome of the experiment consists of a
combination of two of those. Hence Ω = {HH, HT, TH, TT}. (And the obvious analogue
in general notation).
In the sequel we are not only interested in the outcome of the experiment but, more
broadly, in certain collections of outcomes. For instance, if we throw a dice we might be
very interested in whether or not the number of dots is less than 3. This is true if the
number of dots is either 1 or 2, and false otherwise. To answer this question, rather than in
the outcome itself we are hence more interested in whether the outcome is in the set {1, 2}
(in which case the aswer to the question is ‘yes’) or not (in which case the answer to the
question is ‘no’). In order to facilitate such questions we give a name to such collections
of outcomes:
Definition 1.1.3 ((Occuring) events). An event is a set containing one or more outcomes from the sample space Ω (or, none in case it is the empty set, see below). Hence
mathematically speaking an event is nothing but a subset of the sample space: A ⊆ Ω. In
fact, any subset of Ω is (by definition) an event. This includes the empty set† , denoted ∅.
Furthermore, we say that an event occurs (or happens) if the actual outcome of the
experiment is an element of the event. (Hence we only know for sure whether an event
occurs or not when we have done the experiment and observed the outcome!).
Remark 1.1.4 (‘Extreme’ events). There are two ‘extreme’ events in the sense that they
are the smallest and largest possible event (in terms of the number of outcomes they
contain). The smallest is ∅ (the empty set). This set does not contain any outcomes at
all and hence this is an event that never occurs. The largest is Ω itself. This set clearly
contains all possible outcomes and as such it occurs always.
All other events contain at least some, but not all elements of Ω. Hence these events
may or may not occur.
subsets of those sets. An example of a set that is uncountable, i.e. not countable, are the real numbers
R or any (non-empty) interval [a, b] ⊆ R: no matter how hard you try, you can’t list all elements in R.
Mathematically speaking, a set S is countable exactly if there exist an injective (one-to-one) function
f : S → N. In this course we will only focus on the situation that the sample space Ω is countable. If
Ω were uncountable some results we discuss in these notes seriously break down! A first proof that R is
uncountable was given the famous mathematician Georg Cantor around the 1890’s
†
Recall that the ‘empty set’ is the set that contains no elements at all. By convention the empty set
is a subset of any set
6
CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY
7
Remark 1.1.5. If Ω has n elements then there are 2n events (or, equivalently, Ω has 2n
subsets). A way to see that is as follows. Write Ω = {ω1 , ω2 , . . . , ωn }. You can associate
each subset of Ω with a sequence of zeros and ones, of length n. Namely, the i-th entry of
the sequence is a 1 if ωi is an element of the subset and otherwise a 0, for i = 1, . . . , n. In
this way, the sequence with only 0’s corresponds to the subset which contains no elements
at all, i.e. the empty set; while the sequence with only 1’s corresponds to the subset which
contains all elements, i.e. Ω itself. All other sequences have at least a 0 and a 1 in and
represent subsets somewhere between these two extremes (see also the remark above).
Hence there is a one-to-one correspondence (i.e. a bijection) between all subsets of Ω
and all these sequences. So rathare than counting subsets we may as well count sequences,
and as there are n entries each to be filled one of two choices, namely a 0 or a 1, we see
that there are indeed 2n such sequences.
Example 1.1.6 (Example 1.1.2 cont’d). In the experiment of tossing a coin we have Ω =
{H, T}. There are 22 = 4 possible subsets of Ω, namely ∅, {H}, {T} and Ω itself. These
are hence also the four possible events. We can express these events in words as follows:
∅
: ”nothing happens” (this event never occurs, see also Remark 1.1.4)
{H} : ”coin shows heads” (this event occurs exactly if the outcome is heads)
{T} : ”coin shows tails” (this event occurs exactly if the outcome is tails)
Ω
: ”coin shows either heads or tails” (this event occurs always, see also Remark 1.1.4).
Example 1.1.7 (Example 1.1.2 cont’d). When we throw a dice we have Ω = {1, 2, 3, 4, 5, 6}.
There are hence 26 = 64 possible events. As always this includes ∅ and Ω. Three more
examples of events, denoted A, B and C:
A: ”No. of dots is less than 3”, hence A = {1, 2};
B: ”No. of dots is even”, hence B = {2, 4, 6};
C: ”No. of dots is 1”, hence C = {1}.
Recall that an event occurs when the actual outcome of the experiment is a member of
the event in question. Hence we can compile the following table:
Observed outcome
of the experiment
1
2
3
4
5
6
Event occurs?
A
B
C
Y
N
Y
Y
Y
N
N
N
N
N
Y
N
N
N
N
N
Y
N
Remark 1.1.8. Shifting out focus to events rather than (only) outcomes might make
you wonder how we can catch a particular outcome in terms of events. In fact, the event
C in the above Example 1.1.7 shows how to do that. With the general notation Ω =
{ω1 , ω2 , . . . , ωn } we have that
7
8
Notes for COMP11120 (Probability part)
the outcome of the expriment is ωi
exactly if
the event {ωi } occurs.
1.2
Basic event/set operations
As events are just sets, we can use the usual set operations to create new sets/events
from existing ones. For the sake of completeness we collect the usual set operations in the
following definition, and we give an interpretation of what the set operations mean when
translated to the world of outcomes and events.
Definition 1.2.1 (Basic set operations). If A, B ⊆ Ω are events then we have the following.
• We use the symbol #A to denote the number of outcomes in A, i.e. #A ∈ {0, 1, 2, . . . , },
we can also have #A = ∞.
• The complement of A, denoted Ac , consists of all outcomes that are not in A, i.e.
Ac = {ω ∈ Ω | ω 6∈ A}.
Note that the event Ac occurs ⇐⇒ the event A does not occur.
• The union A ∪ B consists of all outcomes that are either in A or in B (or in both):
A ∪ B = {ω ∈ Ω | ω ∈ A or ω ∈ B}.
Note that the event A ∪ B occurs ⇐⇒ the event A, or the event B occurs, or both.
• The intersection A ∩ B consists of all outcomes that are in both A and B:
A ∩ B = {ω ∈ Ω | ω ∈ A and ω ∈ B}.
Note that the event A ∩ B occurs ⇐⇒ both events A and B occur.
• If B ⊆ A we can define the (set) difference of A and B (‘A minus B’):
A \ B = {ω ∈ Ω | ω ∈ A and ω 6∈ B}.
Note that the event A \ B occurs ⇐⇒ the event A occurs, but the event B does
not.
8
CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY
9
Ac
A
A
Figure 1.1: A set A (left) and its complement Ac (right)
B
A
A ∪B
A ∩B
Figure 1.2: Two sets A (left top) and B (right top) with their union (left bottom) and
intersection (right bottom)
9
10
Notes for COMP11120 (Probability part)
A\B
B
Figure 1.3: With the set A as in the previous two graphs and B as on the left, the set
difference A \ B on the right
See Figures 1.1–1.3 for a graphical representation of the above set operations. (These
representations are called ‘Venn diagrams’ and can be very useful to gain some intuition
how (complicated) set operations work out).
Definition 1.2.2 (Disjoint events). Two events A and B are called disjoint when A∩B =
∅. That is, two events are disjoint exactly when they cannot occur at the same time.
Example 1.2.3 (Example 1.1.7 cont’d). In the setting of throwing a dice with Ω =
{1, 2, 3, 4, 5, 6} and the three events:
A = {No. of dots is less than 3} = {1, 2};
B = {No. of dots is even} = {2, 4, 6};
C = {No. of dots is 1} = {1}
we have for instance
Ac = {3, 4, 5, 6} = {No. of dots is at least 3},
A ∪ C = {1, 2} = A,
A ∩ C = {1} = C,
A ∩ B = {2},
B ∩ C = ∅,
A \ C = {2}.
1.3
Probability measures
We have so far discussed two of the main ingredients, namely the sample space and events.
However we would like to talk about the likelihhod of outcomes of the experiment and in
particular the likelihood of events occuring. That is to say, we want to have a function
that assigns to every possible event, the likelihood that this event occurs.
10
CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY
11
Such a function is called a probability measure. For its definition we need to know
what disjoint events are:
Definition 1.3.1 (Disjoint events). Two events A and B are called disjoint when A∩B =
∅. That is, two events are disjoint exactly when they cannot occur both.
Definition 1.3.2 (Probability measure). A probability measure, denoted P, is a mapping
P : {Collection of all events} → [0, 1]
i.e. P assigns to any event A a number P(A) ∈ [0, 1]. This number we understand the likelihood that the event A occurs, i.e. the likelihood that the actual outcome of the experiment
is an element of A.
Furthermore a probability measure should satisfy the following properties:
(i) P(Ω) = 1,
(ii) if A and B are disjoint events then P(A ∪ B) = P(A) + P(B).
Remark 1.3.3. In the above Definition 1.3.2, there are two ‘extreme’ cases:
P(A) = 0 ⇐⇒ A will never occur,
P(A) = 1 ⇐⇒ A will always occur.
We will see in a bit that P(A) = 0 holds at least when A = ∅ while P(A) = 1 is only
true when A = Ω. For all other events A we have P(A) ∈ (0, 1) – i.e. A may or may not
occur. (Compare also with Remark 1.1.4).
Example 1.3.4 (Example 1.1.2 cont’d). Consider again the experiment of tossing a coin,
with Ω = {H, T} and the four possible events ∅, {H}, {T} and Ω. We can define a
(candidate) probability measure P by setting
1
1
P({H}) := , P({T}) :=
and P(Ω) := 1.
2
2
We need to verify this mapping P is indeed a probability measure, i.e. that all conditions
in Definition 1.3.2 are satisfied.
Point (i) is just by definition.
For point (ii), we need to verify that for all possible combinations of disjoint events A
and B we indeed have
P(∅) := 0,
P(A ∪ B) = P(A) + P(B).
(1.1)
You can verify all combinations yourself, two particular examples: if we take A = ∅ and
B = {H} then
P(A ∪ B) = P(∅ ∪ {H}) = P({H}) =
while
11
1
2
12
Notes for COMP11120 (Probability part)
1
1
=
2
2
and hence we have verified that (1.1) indeed holds. If we take A = {H} and B = {T }
then
P(A) + P(B) = P(∅) + P({H}) = 0 +
P(A ∪ B) = P({H} ∪ {T }) = P({H, T }) = P(Ω) = 1
while
P(A) + P(B) = P({H}) + P({T }) =
1 1
+ =1
2 2
so both are indeed equal.
Note that P is the choice that corresponds to a ‘fair’ coin, i.e. when heads and tails
both appear with probability 1/2. However P is certainly not the only possible probability
measure.
Indeed, if we define a mapping P as
1
3
P({H}) := , P({T}) :=
and P(Ω) := 1
4
4
you can verify this is indeed also a probability measure. This one corresponds to the situation the coin isn’t fair and heads only appears with probability 1/4.
In fact, for any α ∈ [0, 1] define the mapping Pα by setting
P(∅) := 0,
Pα (∅) := 0,
Pα ({H}) := α,
Pα ({T}) := 1 − α
and
Pα (Ω) := 1,
then it can be verified (exercise) these are indeed all probability measures. Note also that
for α = 1/2 we find the P from above back, while for α = 1/4 we end up with the P from
above. We see that even in this very basic example there are infinitely many probability
measures!
Finally then, it’s not very difficult to come up with an example of a mapping that is
not a valid probability measure (exercise). Take for instance the mapping P1 defined as
P1 (∅) := 0,
1
P1 ({H}) := ,
3
P1 ({T}) :=
1
3
and
P1 (Ω) := 1.
One disadvantage of Definition 1.3.2 is that it requires us to check that point (ii) of the
definition holds for any disjoint events A and B – and as there may be many combinations
of disjoint events this may be quite a task. Luckily there is a shortcut.
For this, let us use the general notation Ω = {ω1 , ω2 , . . . , ωn } (we take for simplicity a
finite sample space, but it also works for infinite ones. However, the argument we’re going
to use below does not work when the sample space is not countable. Not relevant for us
now, but good to know in case you’d end up doing more probability later on – cf. the
footnote at the beginning of this chapter).
12
CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY
13
For any i = 1, . . . , n define pi := P({ωi }) (i.e. pi is the probability that the experiment
has ωi as outcome). The important thing to realise is now that in fact we only need to
know these pi ’s to compute the probability of any event. Indeed, take for instance the
event A = {ω1 , ω2 }, then
(∗)
P(A) = P({ω1 } ∪ {ω2 }) = P({ω1 }) + P({ω2 }) = p1 + p2 .
Note that (*) uses that {ω1 } and {ω2 } are disjoint sets, i.e. {ω1 } ∩ {ω2 } = ∅, and hence
we can use part (ii) of Definition 1.3.2. You can probably imagine this works in the same
way for any event: its probability is computed by adding up the probabilities of each of
the outcomes that are elements of the event. We can express this in a formula as follows.
For any event A we have
P(A) =
n
X
1A (ωi )pi .
(1.2)
i=1
The symbol 1 is an ‘indicator function’. In general, for any set S the function 1S (x) takes
the value 1 if x ∈ S holds and it takes the value 0 otherwise. Hence the indicator function
appearing in (1.2) behaves as follows:
1A (ωi ) =
(
1 if ωi ∈ A
0 if ωi 6∈ A.
(1.3)
The formula (1.2) therefore reads as follows: the probability of A is obtained by looping
over all outcomes in the sample space Ω, and if the outcome is an element of A then add
the corresponding pi to the number you had and otherwise add nothing.
Note that if we apply the formula (1.2) to the event A = Ω then 1A (ωi ) = 1 for every
i = 1, . . . , n and thus the formula reads
P(Ω) =
n
X
pi .
i=1
Since we know from Definition 1.3.2 that P(Ω) = 1 this implies that
n
X
pi = 1.
i=1
So, if we put the above argument in a Result then we have derived the following:
Result 1.3.5. Suppose Ω = {ω1 , ω2 , . . . , ωn }. Let P be a probability measure. Define
pi := P({ωi }) for i = 1, . . . , n. Then we have
(i) pi ≥ 0 for all i = 1, . . . , n and p1 + p2 + . . . + pn = 1,
13
14
Notes for COMP11120 (Probability part)
(ii) for any event A ⊆ Ω we have
P(A) =
n
X
1A (ωi )pi
i=1
where the indicator function 1 is defined in (1.3).
Also if Ω has infinitely many elements (i) and (ii) are still true, with the understanding
that the n has to be replaced by ∞ i.e.
(i’) pi ≥ 0 for all i = 1, 2, . . . and p1 + p2 + . . . = 1,
(ii’) for any event A ⊆ Ω we have
P(A) =
∞
X
1A (ωi )pi .
i=1
In fact, if you think a bit about it you can also invert the above result. Still working with
the sample space Ω = {ω1 , ω2 , . . . , ωn }, take any sequence of numbers pi for i = 1, . . . , n
such that pi ≥ 0 and p1 + p2 + . . . + pn = 1. Now define the mapping P by setting for any
event A
P(A) :=
n
X
1A (ωi )pi .
(1.4)
i=1
Note that this formula in particular yields P({ωi }) = pi for every i = 1, . . . , n. Now this
is indeed a probability measure. To verify point (i) of Definition 1.3.2, indeed we have
P(Ω) =
n
X
1Ω (ωi )pi =
i=1
n
X
pi = 1.
i=1
To verify point (ii), let A and B be disjoint events. First we make the following step:
1A (ωi ) + 1B (ωi ) =
=
(
1 if ωi ∈ A
(
1 if ωi ∈ B
+
0 if ωi 6∈ A
0 if ωi 6∈ B



2 if ωi ∈ A ∩ B
(∗)
1 if ωi ∈ (A ∪ B) \ (A ∩ B) =


0 if ω 6∈ A ∪ B
i
(
1 if ωi ∈ A ∪ B
0 if ωi 6∈ A ∪ B
= 1A∪B (ωi ).
Note that the second equality follows by considering the different values both indicator
functions can take. The crucial step is (*), where we use that A and B are disjoint i.e.
that A ∩ B = ∅ to simplify the expression we had before. This computation hence tells
14
CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY
15
us that on account of A ∩ B = ∅ we have 1A (ωi ) + 1B (ωi ) = 1A∪B (ωi ). Using this we see
that
(1.4)
P(A ∪ B) =
n
X
i=1
n
X
1A∪B (ωi )pi =
(1A (ωi ) + 1B (ωi ))pi
i=1
=
n
X
1A (ωi )pi +
i=1
n
X
(1.4)
1B (ωi )pi = P(A) + P(B)
i=1
so that we have also verified point (ii) of Definition 1.3.2 and we can thus conclude that
P as defined in (1.2) is indeed a probability measure.
This all brings us to
Result 1.3.6. Suppose Ω = {ω1 , ω2 , . . . , ωn }. Take any sequence of numbers pi for i =
1, . . . , n such that pi ≥ 0 and p1 + p2 + . . . + pn = 1. For any event A, define the mapping
P as
P(A) :=
n
X
1A (ωi )pi .
(1.5)
i=1
(That is, P(A) is found by adding up all those pi ’s for which the corresponding outcome
ωi is an element of A.) Then P is a probability measure, chosen in such a way that for all
i = 1, . . . , n the experiment has outcome ωi with probability pi = P({ωi }).
If Ω = {ω1 , ω2 , . . . , } the same result still holds true, with the obvious adjustments such
as that we need an infinite sequence of non-negative pi ’s summing to 1 rather than a finite
one and we need to replace the upper bound of the sum in (1.5) by ∞.
What we have actually done in this bit is connecting two ways of thinking about
probabilities. We pretty much all know from previous experiences (secondary school etc.)
how to compute probabilities: we assign a probability to each of the possible outcomes
of the experiment and then the probability of an event is computed by looking what
outcomes are an element of the event and adding of the probabilities of these outcomes.
This is the first approach (and an intuitively appealing one). The second approach is the
more mathematical approach, where we prefer to work with simple and more abstract
definitions. According to the mathematical approach, a probability measure is simply a
mapping as defined in Defintion 1.3.2.
The above two Results 1.3.5 & 1.3.6 connect these two approaches. It is telling us that
the first more intuitive approach is in fact consistent with the mathematical definition of
a probability measure, as you would hope, so that we can essentially mix the use of both
approaches to our liking‡ .
‡
It is good to point out that the first approach above only works when the sample space is either finite
or countable, the mathematical definition works for any possible sample space though — even though
that situation has its own quirks such as the fact that one has to careful with how events are exactly
defined. However this is of course all outside the scope of this course!
15
16
Notes for COMP11120 (Probability part)
Example 1.3.7 (Example 1.2.3 cont’d). Consider again the example of throwing a dice,
with Ω = {1, 2, 3, 4, 5, 6} and the events A, B and C we introduced before:
A = {No. of dots is less than 3} = {1, 2};
B = {No. of dots is even} = {2, 4, 6};
C = {No. of dots is 1} = {1}.
Let us now compute some probabilities, using Result 1.3.6. Assuming that the dice is
fair it is natural to assume all six possible outcomes occur with probability 1/6. Hence we
take p1 = p2 = . . . = p6 = 1/6. Then
1
1
1
P(A) = p1 + p2 = , P(B) = p2 + p4 + p6 =
and P(C) = p1 = .
3
2
6
Alternatively, these probabilities could be computed using that by Definition 1.3.2 we
have P(A ∪ B) = P(A) + P(B) for disjoint events A and B. The strategy is then to split
the events A, B and C into disjoint events for which we know their probabilities:
1
P(A) = P({1} ∪ {2}) = P({1}) + P({2}) = p1 + p2 = ,
3
P(B) = P({2} ∪ {4} ∪ {6}) = P({2}) + P({4}) + P({6}) = p2 + p4 + p6 =
and
1.4
1
2
1
P(C) = P({1}) = .
6
Some more rules for set/event manipulations
and computing probabilities
We have now introduced three of the four core concepts, namely samples spaces, events
and probability measures. The final one is random variables and we’ll get to those later.
Let us conclude this chapter by stating some results that are helpful for manipulating
events and computing probabilities.
For events we have the following. These identities may come in very handy!
Result 1.4.1. Consider a sample space Ω and let A, B and C be events. Then we have:
(i) A ∩ Ac = ∅ and A ∪ Ac = Ω
(ii) Distributive rules (note that these are exactly the same laws as in arithmetic, provided you replace ”∪” by ”+” and ”∩” by ”×”):
• (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C)
• (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C)
(iii) De Morgan’s laws:
• (A ∪ B)c = Ac ∩ B c
16
CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY
17
• (A ∩ B)c = Ac ∪ B c
Proof. Ad (i). This is clear from the definition Ac = {ω ∈ Ω | ω 6∈ A}.
Ad (ii). For the first one,
ω ∈ (A ∪ B) ∩ C ⇐⇒ ω ∈ A ∪ B and ω ∈ C
⇐⇒ (ω ∈ A and ω ∈ C) or (ω ∈ B and ω ∈ C)
⇐⇒ ω ∈ A ∩ C or ω ∈ B ∩ C ⇐⇒ ω ∈ (A ∪ C) ∪ (B ∩ C).
The proof of the second one is similar.
Ad (iii). For the first one,
ω ∈ (A ∪ B)c ⇐⇒ ω 6∈ A ∪ B ⇐⇒ ω 6∈ A and ω 6∈ B
⇐⇒ ω ∈ Ac and ω ∈ B c ⇐⇒ ω ∈ Ac ∩ B c .
The proof of the second one is an exercise.
For computing probabilities we have the following — again very useful! For the proofs,
note that all we know (at this point) about a probability measure is that the probability
of the whole sample space is 1 and that the probability of the union of two disjoint events
is the sum of their probabilities. Cf. Definition 1.3.2. Therefore we will be looking for ways
to make use of disjoint events.
Result 1.4.2. Consider a sample space Ω, let A and B be events and let P be a probability
measure. Then we have:
(i) P(Ac ) = 1 − P(A)
(‘complementary law’)
(ii) if B ⊆ A then P(B) ≤ P(A) and P(A \ B) = P(A) − P(B)
(iii) P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
(‘additive law’)
(iv) P(Ac ∩ B) = P(B) − P(A ∩ B)
(‘2nd complementary law’)
(v) P(A ∩ B c ) = P(A) − P(A ∩ B)
(‘3rd complementary law’)
Proof. Ad (i). From Result 1.4.1 we know that A and Ac are disjoint events (*) while we
also have A ∪ Ac = Ω (**). Recall from Definition 1.3.2 (i) that P(Ω) = 1. Using these
facts together with Definition 1.3.2 (ii) (*) we get
(∗∗)
(∗)
1 = P(Ω) = P(A ∪ Ac ) = P(A) + P(Ac )
and the result follows.
Ad (ii). If B ⊆ A then we may cut A up in two disjoint parts: B and A \ B, i.e.
A = B ∪ (A \ B) (make a Venn diagram!). Hence, again making use of Definition 1.3.2
(ii) we find
17
18
Notes for COMP11120 (Probability part)
P(A) = P(B ∪ (A \ B)) = P(B) + P(A \ B)
and rearranging a bit we indeed get P(A \ B) = P(A) − P(B).
Furthermore, we may also write this as P(B) = P(A) − P(A \ B). Since the probability
of any set is bigger than or equal to 0, also P(A \ B) ≥ 0 and it hence indeed follows that
P(B) ≤ P(A).
Ad (iii). We may write the event A ∪ B for instance as the union of the two disjoint
events A and B \ (A ∩ B) (make a Venn diagram!). This yields:
P(A ∪ B) = P(A ∩ (B \ (A ∩ B))) = P(A) + P(B \ (A ∩ B)) = P(A) + P(B) − P(A ∩ B),
where we have again used Definition 1.3.2 (ii) and the last step makes use of part (ii)
above.
Ad (iv). Draw again a Venn diagram to convince yourself that we may write Ac ∩ B =
B \ (A ∩ B), so that
P(Ac ∩ B) = P(B \ (A ∩ B)) = P(B) − P(A ∩ B),
where the last step uses part (ii) above.
Ad (v). This is exactly the same as part (iv), only with the roles of the events A and
B reversed.
Remark 1.4.3. If we plug A = Ω into part (i) above it reads as P(∅) = 1 − P(Ω). As
P(Ω) = 1 (cf. Definition 1.3.2 (i)) it follows that P(∅) = 0 (as we promised to show in
Remark 1.3.3). Also, this means that if we plug two disjoint events A and B into the
‘additive law’ above then that one reads
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) = P(A) + P(B) − P(∅) = P(A) + P(B),
which is exactly Definition 1.3.2 (ii). So the difference between the two is that the ‘additive
law’ holds also when the intersection A∩B is not empty, in which case P(A)+P(B) would
be too large as it counts the elements in A ∩ B twice (they are both in A and in B), and
hence the correction term P(A ∩ B) is subtracted.
We conclude with two examples.
Example 1.4.4. Suppose that we have events A and B of which we know that P(A) = 0.4,
P(B) = 0.3 and P(A∩B) = 0.2. We would like to compute P(A∪B), P(Ac ) and P(Ac ∩B c ).
For the first, applying the ‘additive law’ from Result 1.4.2 we find
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) = 0.4 + 0.3 − 0.2 = 0.5.
18
CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY
19
For the second, the ‘complementary law’ from Result 1.4.2 yields
P(Ac ) = 1 − P(A) = 1 − 0.4 = 0.6.
Finally, the last one is more challenging. We (obviously) need to try to rewrite the event
Ac ∩ B c into one or more of the events for which we are given the probabilities. One of the
De Morgan’s laws from Result 1.4.1 gives us what we need, namely Ac ∩ B c = (A ∪ B)c :
P(Ac ∩ B c ) = P((A ∪ B)c ) = 1 − P(A ∪ B) = 1 − 0.5 = 0.5,
where we used the ‘complementary law’ from Result 1.4.2 and the fact that we computed
P(A ∪ B) above.
Example 1.4.5. Consider a loaded dice which has the property that 1 and 3 dots show
up with probability 1/3 while 2, 4, 5 and 6 dots show up with probability 1/12. Compute
the probability that the outcome is even, and the probability that at least 2 dots show up.
First note that we again have the sample space Ω = {1, 2, 3, 4, 5, 6}. Following the
approach in Result 1.3.6 and Example 1.3.7 we define for i = 1, . . . , 6 the probabilities for
each of the outcomes as pi = P({ωi }). We are given that p1 = p3 = 1/3 and p2 = p4 =
p5 = p6 = 1/12. Then
P({No. of dots is even}) = P({2, 4, 6}) = p2 + p4 + p6 =
1
4
and
2
P({No. of dots is at least 2}) = 1 − P({No. of dots is 1}) = 1 − p1 = ,
3
where we used the ‘complementary law’ from Result 1.4.2 again. (In terms of sets, we have
{No. of dots is at least 2} = {2, 3, 4, 5, 6} so that its complement is {1} = {No. of dots is 1}.)
19
Chapter 2
Equally likely outcomes
In the previous chapter we have developed some general theory for computing probabilities. Now, very often it is the case that we are dealing with an experiment that has the
special property that every outcome is equally likely to occur. Think for instance about
throwing a fair dice, a fair coin, randomly selecting a prize winner in a lottery etc. etc.
Since this is such a prominent particular case of experiments we dedicate this chapter
to discussing such experiments in more detail. We will see that computing probabilities
becomes a matter of counting.
2.1
Equally likely outcomes
Consider an experiment with sample space Ω = {ω1 , ω2 , . . . , ωn } and suppose that we
assume that each outcome is equally likely. That is to say, if we set pi := P({ωi }), the
probability that the experiment has outcome ωi for i = 1, . . . , n, then we have p1 = p2 =
. . . = pn . Since we know that p1 + p2 + . . . + pn = 1 (cf. Result 1.3.5 in Ch. 1) it must be
the case that pi = 1/n for all i = 1, . . . , n. Furthermore, recall from Result 1.3.5 in Ch. 1
that we can compute the probability of any event A by the formula
P(A) =
n
X
1A (ωi )pi .
i=1
If we now plug in that pi = 1/n for all i = 1, . . . , n we get
P(A) =
n
X
i=1
n
1A (ωi )
1X
1
=
1A (ωi ).
n
n i=1
Pn
(2.1)
The expression i=1 1A (ωi ) is a sum that loops over all elements ωi in the sample space,
and adds 1 to the total if ωi ∈ A and 0 otherwise. Hence the number we end up with
is exactly equal to the number of elements in A, denoted #A. Therefore we can further
simplify the above (2.1) to
21
22
Notes for COMP11120 (Probability part)
1
#A
#A =
n
#Ω
where in the last step we used that since Ω = {ω1 , ω2 , . . . , ωn } we have #Ω = n.
So we have shown the following:
P(A) =
Result 2.1.1. Suppose that an experiment has equally likely outcomes. Then we may
compute the probability of any event A using the formula
#A
.
#Ω
(Recall that for any set S, #S denotes the number of elements in S).
P(A) =
(2.2)
The above result essentially tells us that in the case of equally likely outcomes, computing probabilities boils down to (simply, or sometimes not so simply) counting how
many outcomes there are in your event A and in the whole sample space Ω.
2.2
A little bit of combinatorics
Since computing probabilities boils down to counting outcomes in this case, let us very
briefly discuss some (probably quite familiar) results on how to count in a ‘clever’ way.
Within mathematics, such questions are studied in the field known as combinatorics.
The first result is probably very familiar to you and is the basis of all counting we’ll
do in this chapter.
Result 2.2.1 (Multiplication rule). Suppose we need to select a total of n items. If there
are k1 possible choices for the first item, k2 possible choices for the second item, ..., kn
possible choices for the n-th item then there are in total
k1 · k2 · . . . · kn
ways to select the n items.
Next let us recall two basic quantities that we’ll use throughout:
Definition 2.2.2 (Factorial). For any n ∈ N we define
n! = n · (n − 1) · (n − 2) · . . . · 2 · 1
and
0! = 1.
Or, in recursive form: n! = n · (n − 1)! with intitial value 0! = 1.
Definition 2.2.3 (Binomial coefficient). For any n ∈ N and k = 0, 1, . . . , n we define the
binomial coeffcient nk as
n
n!
=
k
k!(n − k)!
22
CHAPTER 2. EQUALLY LIKELY OUTCOMES
23
(also pronounced as ‘n choose k’).
The question we’ll be asking ourselves next is as follows: consider a set of n distinct
elements, how many different selections of k ≤ n items out of this set are there? The
asnwer also depends on the type of selecting you are doing. The selection could be with
and without replacement. ‘Without replacement’ means that a selected item is put aside
and can not be selected again, while (hence) ‘with replacement’ means that a selected item
remains available in the set to be (potentially) selected again. Furthermore it is important
whether the order is important. If the order is important then you count two selections
that consist of the same items but contain those items in a different order as two different
selections, while if the order is not important then you consider all selections consisting
of the same items as one selection only.
Example 2.2.4. Consider the set {a, b, c} and suppose we would like to select 2 items
out of this set. (Check the below claims for yourself !)
First consider selections with replacement. If the order is important then all possible
selections are
{a, a}, {a, b}, {a, c}, {b, a}, {b, b}, {b, c}, {c, a}, {c, b}, {c, c}.
(2.3)
If the order is not important then there are less selections possible as we consider {a, b}
and {b, a} to be the same selection, analogue for {a, c} and {c, a}, {b, c} and {c, b}. Then
we are hence only left with
{a, a}, {a, b}, {a, c}, {b, b}, {b, c}, {c, c}.
Next consider selections without replacement. If the order is important then all possible
selections are
{a, b}, {a, c}, {b, a}, {b, c}, {c, a}, {c, b}
(2.4)
while if the order is not important we again do not see a distinction between {a, b} and
{b, a}, {a, c} and {c, a}, {b, c} and {c, b} so that we end up with the possible selections
{a, b}, {a, c}, {b, c}.
(2.5)
Here are some formulae (without proof) for computing how many selections are possible.
Result 2.2.5. Consider the question of how many different selections of k elements out
of a set of n distinct items we can make (k ≤ n). The answer is given in the following
table:
23
24
Notes for COMP11120 (Probability part)
Order important?
Yes
Without replacement:
With replacement:
No
n!
(n − k)!
(1)
n
k
nk
(3)
(complicated!)
(2)
If the order is important we typically talk about ‘permutations’, otherwise about ‘combinations’.
Example 2.2.6 (Example 2.2.4 cont’d). Let’s check that the formulae from Result 2.2.5
are consistent with the above example. We have n = 3 and k = 2. For (2.3) we get indeed
nk = 32 = 9 possible selections. For (2.4) we get indeed
n!
3!
=
=6
(n − k)!
1!
possible selections, and for (2.5) we get indeed
n
3
3!
=
=
=3
k
2
2! · 1!
possible selections.
Remark 2.2.7. If we take k = n in the above Result 2.2.5 we effectively look at selecting
all items from the set of n distinct items. In that case we see that there are n!/0! = n!
different ways to select all items from the set of n items without replacement and where
the order is important. If the order is not important (but still without replacement) there
is only nn = 1 way of selecting all items – indeed, if we select all items from the set and
then do it again, then necessarily the second selection is just a reordered version of the
first selection and since the order is not important the second one is the same as the first
one!
The above Result 2.2.5 deals only with sets of distinct items. We conclude this section
with one more very useful result (again without proof), which deals with a counting
problem if the set contains duplicate items:
Result 2.2.8. Consider a set of n items, consisting of l subsets where each of these subsets
consists of identical items (but the items are different between different subsets). Suppose
that the first subset contains k1 (identical) items, the second one k2 (identical) items, ...,
the l-th one kl (identical) items. (Hence k1 + k2 + · · · + kl = n.) Then the number of
different permutations of all items of this set (i.e. the number of ways to select all n items
without replacement and where the order matters) is given by:
24
CHAPTER 2. EQUALLY LIKELY OUTCOMES
25
n!
.
k1 ! · k2 ! · . . . · kl !
See Example 2.3.1 for an application to illustrate this result.
2.3
Examples
The material in this chapter is typically best understood by going through examples. Here
are a bunch.
Example 2.3.1. An explorer has rations for two weeks, consisting of 4 cans of Tuna, 7
cans of Spam, and 3 cans of Beanie Weenies. If he opens one can each day, in how many
different ways can he consume the rations?
This is an application of Result 2.2.8. We have in total n = 14 cans. If he would have
all different cans then there would be 14! = 87178291200 different ways to consume his
rations (cf. Remark 2.2.7). However in reality there are much less different ways, because
given a certain choice we can swap for instance two cans of Tuna in that choice and end
up with exactly the same choice! This wouldn’t be possible of all cans were different.
Now, the collection of cans consists of l = 3 subsets of identical items (the Tuna, Spam
and Beanie Weenies). So k1 = 4, k2 = 7 and k3 = 3, and the number of different ways in
which he can consume his rations is
14!
= 120120.
4! · 7! · 3!
Example 2.3.2. A fair coin is tossed 10 times. What is the probability of:
a) exactly 5 heads,
b) at least 9 heads,
c) less than 9 heads.
For the solution, first note that as the coin is fair this is indeed an experiment with
equally likely outcomes. The sample space Ω consists of all possible sequences of length 10,
each entry filled with either H or T . Hence, the ‘multiplication rule’ Result 2.2.1 tells us
that #Ω = 210 = 1024.
For a), the event A = {5 heads} consists of all sequences of length 10 with 5 entries
filled with a H and the other entries with a T . Hence, #A equals the number of ways
in which we can select 5 out of 10 positions to put our H in (the other entries are then
automatically filled with a T as that’s the only other option). Applying formula (2) from
Result 2.2.5 yields
10
#A =
= 252.
5
25
26
Notes for COMP11120 (Probability part)
So using (2.2) we find
252
63
=
.
1024
256
Alternatively we could apply Result 2.2.8 to count the elements in A. We are interested in
how many ways there are to fill a sequence of length 10 with 5 heads and 5 tails. Or, put
differently, in how many permutations there are of a set containing 5 heads and 5 tails.
Using Result 2.2.8 with n = 10, l = 2, k1 = 5, k2 = 5 we find that there are
P({5 heads}) =
10!
= 252
5! · 5!
permutations, reassuringly the same number we found above.
For b), we may write the event A = {at least 9 heads} as A = {9 heads} ∪ {10 heads}
and since these events are disjoint (i.e. their intersection is empty) we have in fact that
10
10
#A = #{9 heads} + #{10 heads} =
+
= 11
9
10
(using the same counting method as in part a)) so that again by (2.2) we find
11
.
1024
Finally for c), given the work we did above already it is a good idea here to use the ‘complementary law’ (cf. Result 1.4.2 in Ch. 1) since {less than 9 heads}c = {at least 9 heads}
and hence
P({at least 9 heads}) =
P({less than 9 heads}) = 1 − P({at least 9 heads}) = 1 −
1013
11
=
,
1024
1024
where we used the result from b).
Example 2.3.3. Three fair dice are thrown. Compute the probability that:
a) the throw is
,
b) exactly two of the dice are ,
c) the sum of the dice equals 16.
First again note that as the dice are fair, all outcomes are equally likely. The sample
space Ω consists of a sequence of length 3, each entry a natural number between 1 and 6.
By the ‘multiplication rule’ Result 2.2.1 we have #Ω = 63 = 216.
For a), the event A = {(1, 1, 1)} contains just one outcome and hence using (2.2)
1
.
216
For b), the event A = {exactly two of the dice are } consists of any of the outcomes
(2, 2, ?), (2, ?, 2) and (?, 2, 2) where ? is a number unequal to 2. By the ‘multiplication rule’
P({the throw is
26
}) =
CHAPTER 2. EQUALLY LIKELY OUTCOMES
27
Result 2.2.1 there are 1 · 1 · 5 possibilities of the type (2, 2, ?), 1 · 5 · 1 possibilities of the
type (2, ?, 2) and 5 · 1 · 1 possibilities of the type (?, 2, 2). Hence
#A = 1 · 1 · 5 + 1 · 5 · 1 + 5 · 1 · 1 = 15
and thus by (2.2)
15
5
= .
216
72
For c), the sum of the dice is 16 exactly if we either have a , a and a , or a , a
and a . Of course, this can happen in several different ways: the first case corresponds
to one of the outcomes (6, 6, 4), (6, 4, 6) and (4, 6, 6); while the second case corresponds to
one of the outcomes (6, 5, 5), (5, 6, 5) and (5, 5, 6). Hence
P({exactly two of the dice are
}) =
A = {the sum of the dice equals 16} = {(6, 6, 4), (6, 4, 6), (4, 6, 6), (6, 5, 5), (5, 6, 5), (5, 5, 6)}
so that #A = 6 and thus by (2.2)
1
6
= .
216
36
An alternative way to count the elements in A is the following. If we would like to know
how many outcomes there are containing a , a
and a
we could apply Result 2.2.8
with n = 3, l = 2, k1 = 2 and k2 = 1. This yields
P({the sum of the dice equals 16}) =
6
3!
=
=3
2! · 1!
2·1
possible outcomes. In a similar way all outcomes consisting of a , a and a
counted, which is also 3 so that we also using this method arrive at #A = 6.
could be
Example 2.3.4. Consider an urn containing 3 green, 4 blue and 5 red balls. You take
two balls out, without replacement (i.e. you don’t put the first ball back). Compute the
probability of getting:
a) two blue balls,
b) the first ball red and the second ball not red,
c) the second ball green.
This is again an experiment with equally likely outcomes. The sample space Ω consists
of pairs of two colors. As there are 12 balls in total and it is without replacement, there
are in total 12 · 11 = 132 possible pairs, i.e. #Ω = 132. (This can be seen from either the
‘multiplication rule’ Result 2.2.1 or alternatively from formula (1) in Result 2.2.5 with
n = 12 and k = 2, which yields 12!/10! = 12 · 11 = 132.)
For a), as there are 4 blue balls in the urn we have
27
28
Notes for COMP11120 (Probability part)
#{both balls are blue} = 4 · 3 = 12
(again either by the ‘multiplication rule’ Result 2.2.1 or by formula (1) in Result 2.2.5
with n = 4 and k = 2) and hence by (2.2)
12
1
= .
132
11
For b), the first ball has to be red and the second not. For the second ball there are
then 3 + 4 = 7 possibilities that are not red. This yields, again by the ‘multiplication rule’
Result 2.2.1:
P({both balls are blue}) =
#{first ball red, second ball not red} = 5 · 7 = 35
so that by (2.2)
35
.
132
Finally for c), this is slightly more tricky as the number of possibilities for the second
ball to be green depends on what the first ball was: if the first ball was green then there is
one less possibility than when the first ball was not green. One way around this problem
is to note that we have
P({first ball red, second ball not red}) =
{second ball green} =
{first ball green, second ball green} ∪ {first ball not green, second ball green}
and as these two events are disjoint (i.e. their intersection is empty) we have
#{second ball green}
= #{first ball green, second ball green} + #{first ball not green, second ball green}.
The point is that we can easily compute how many elements both events in the right hand
side have, namely using the same argument as above we have
#{first ball green, second ball green} = 3 · 2 = 6
and #{first ball not green, second ball green} = 9 · 3 = 27.
Hence
#{second ball green} = 6 + 27 = 33
and thus
28
CHAPTER 2. EQUALLY LIKELY OUTCOMES
P({second ball green}) =
29
29
33
1
= .
132
4
Chapter 3
The power of conditioning
In this chapter we will introduce the concept of conditioning on an event, which means that
we consider the situation of performing the experiment given that we know in advance
that a certain event will occur. This is a very powerful concept which leads to several
important results in probability theory such as the law of total probability and Bayes’
Theorem.
3.1
Conditional probabilities
To introduce the concept, consider the example of throwing a (fair) dice. Hence the sample
space is Ω = {1, 2, 3, 4, 5, 6}. As done many times before, since the dice is fair each of the
outcomes is equally likely and hence pi := P({i}), the probability that the outcome is i,
equals 1/6 for any i = 1, . . . , 6. Define the event
B = {the outcome is at least 5} = {5, 6}.
Now consider the experiment of throwing the dice given that we know that the event B
occurs. If we know in advance that the event B occurs, then we know in advance that the
outcome of the experiment will be either 5 or 6. Define
qi := the probability that the outcome is i given that we know that the event B occurs,
still for i = 1, . . . , 6. Now, the outcomes 1, 2, 3 and 4 can not occur, since they are not
elements of B and we know that B occurs. Hence q1 = q2 = q3 = q4 = 0. The outcomes 5
and 6 can both occur, and they are still equally likely but have become a lot more likely
now that we know it must definitely one of these two, so that q5 = q6 = 1/2. That is to
say, we obtain q5 and q6 by taking p5 and p6 and scaling these up with a factor 3. This
factor is equal to 1/P(B) = 1/(1/3) = 3.
So, the effect of knowing in advance that the event B occurs, or conditioning on the
event B, is that the probabilities of the 6 possible outcomes are transformed as follows:
31
32
Notes for COMP11120 (Probability part)
1 1 1 1 1 1
, , , , ,
6 6 6 6 6 6
;
1 1
0, 0, 0, 0, ,
.
2 2
This transformation is obtained according to the following principle:
(i) all outcomes not in the event B are assigned probability 0 (as they could not possibly
occur),
(ii) the probability of each outcome in the event B is scaled up by a factor 1/P(B) (to
reflect the fact they have become more likely to occur than in the original experiment
without conditioning).
As usual we don’t want to work with outcomes only but with events, and hence we
would like to translate this principle to the level of events. It turns out the right way to
do this is through the following formula: for any event A
the probability that A occurs given that we know that B occurs =
P(A ∩ B)
.
P(B)
(3.1)
Let’s do some quick sanity checks of this formula, by considering what it gives us in three
different cases for the event A. (A Venn diagram in each case should be helpful!).
• First suppose that A only consists of outcomes that are not in B (or, A ∩ B = ∅).
Hence, following (i) above, each outcome in A should be assigned probability 0 and
therefore the event A should have probability 0 of occuring. This is indeed also what
formula (3.1) yields. Namely, using that A ∩ B = ∅ we see:
P(A ∩ B)
P(∅)
=
= 0,
P(B)
P(B)
where we used that P(∅) = 0 (cf. Remark 1.4.3 in Ch. 1).
• Next consider the case that A only contains outcomes that are in B, i.e. A ⊆ B,
then following (ii) we would expect that A has a probability of occuring that is the
old probability P(A) scaled up by a factor 1/P(B), i.e. P(A)/P(B). This is indeed
what we get. Namely, using (3.1) with the fact that A ⊆ B implies A ∩ B = A yields
in this case
P(A ∩ B)
P(A)
=
.
P(B)
P(B)
• Finally, the only other case is when A contains both outcomes that are in B and
outcomes that are not in B. In that case, following (i) we should expect that all
outcomes in A that are not in B get probability 0. Following (ii), the outcomes in
32
CHAPTER 3. THE POWER OF CONDITIONING
33
A that are also in B (which is exactly all outcomes in A ∩ B) should get a new
probability that consists of the old probability, i.e. P(A ∩ B), scaled up by a factor
1/P(B). So we would expect to find
0+
P(A ∩ B)
1
· P(A ∩ B) =
P(B)
P(B)
and reassuringly that’s exactly formula (3.1)!
Let us put this now in a neat definition:
Definition 3.1.1 (Conditional probability). Let B ⊆ Ω be an event with P(B) > 0. Take
any other event A ⊆ Ω. We use the notation
P(A | B)
for the (conditional) probability of A given B, or, more elaborately, the probability that
the event A occurs given that we know in advance that the event B occurs. This probability
is given by the formula:
P(A | B) =
P(A ∩ B)
.
P(B)
(3.2)
Note that we can rewrite this formula in a form called the ‘multiplicative law’:
P(A ∩ B) = P(A | B)P(B).
(3.3)
Example 3.1.2. Let’s do the good old dice example one more time, so Ω = {1, 2, 3, 4, 5, 6}.
Define the events
A = {# dots is odd} = {1, 3, 5}
and
B = {# dots is at least 4} = {4, 5, 6}.
Then the probability of A given B is
P(A | B) =
P(A ∩ B)
P({5})
1/6
1
=
=
= .
P(B)
P({4, 5, 6})
3/6
3
Maybe we could have guessed this result, namely since B consists of 3 (equally likely)
outcomes of which 2 are even and 1 is odd it makes sense that if we know in advance that
B occurs then the probability of an odd outcome is 1/3.
33
34
Notes for COMP11120 (Probability part)
Example 3.1.3. Two cards are taken from a deck of cards (with the usual 52 cards in it).
What is the probability that the first card is Queen of Spades and the second is the King
of Spades?
Note you could of course solve this using the techniques from previous chapter! If we
first do that we find that the event we’re interested in contains only 1 element while the
whole sample space Ω contains 52 · 51 = 2652 possible outcomes, so the required probability
is 1/2652.
We may also use our new technique of conditioning. Namely, define the events
B = {first card is Queen of Spades}
and
A = {second card is King of Spades}.
Then P(B) = 1/52 and P(A | B) = 1/51 (because if we know in advance that B occurs,
i.e. that the first card drawn was the Queen of Spades, then for the second draw there are
51 cards left of which the King of Spades is one). Hence by the ‘multiplication rule’ (3.3)
we get
1 1
1
1
·
=
=
.
51 52
51 · 52
2652
As A ∩ B = {first card is Queen of Spades and second card is King of Spades} we have
found again that the requested probability is 1/2652.
P(A ∩ B) = P(A | B)P(B) =
Remark 3.1.4. What we actually have done in (3.2) is define ourselves a mapping P
that assigns to any possible event A the number
P(A ∩ B)
∈ [0, 1].
P(B)
You can in fact verify that this mapping P satisfies both conditions of Definition 1.3.2 from
Ch. 1, and hence P is a probability measure. It is a probability measure ‘concentrated’ on
only the event B ⊆ Ω rather than on the whole sample space Ω, in the sense that P (B) = 1.
P (A) := P(A | B) =
3.2
Independent events
Take an event B with P(B) > 0. If the probability of an event A does not alter at all
when we condition on B, i.e. when
P(A) = P(A | B)
(3.4)
we say that the event A is independent of the event B. If we plug the formula for P(A | B)
from (3.2) in (3.4) we see that
34
CHAPTER 3. THE POWER OF CONDITIONING
P(A) = P(A | B) ⇐⇒ P(A) =
P(A ∩ B)
⇐⇒ P(A)P(B) = P(A ∩ B).
P(B)
35
(3.5)
We typically take the last equation above, i.e.
P(A)P(B) = P(A ∩ B)
as the definition of independence rather than (3.4). (As we saw in (3.5) they are equivalent
– at least as long as P(B) > 0.) The reason for this is the following. If P(A) > 0 we also
have
P(A)P(B) = P(A ∩ B) ⇐⇒ P(B) =
P(A ∩ B)
P(B ∩ A)
=
= P(B | A).
P(A)
P(A)
If we concentrate on the last equation, i.e. P(B) = P(B | A), we see that this one means
that B is independent of A. Indeed it tells us that the probability of B does not alter
when we condition on the event A.
So, for any events A and B with P(A) > 0 and P(B) > 0 we have that
A is independent of B ⇐⇒ P(A)P(B) = P(A ∩ B) ⇐⇒ B is independent of A.
In particular this tells us that independence can never happen in only one direction:
if A is independent of B then necessarily B is also independent of A (and vice versa)!
Therefore we simply say that two events A and B are independent. This means both A
is independent of B and B is independent of A.
Putting this in a nice
Definition 3.2.1 (Independence). Two events A and B are said to be independent if it
holds that
P(A)P(B) = P(A ∩ B).
(3.6)
If P(A) > 0 and P(B) > 0 we have in particular
A and B are independent ⇐⇒ P(A) = P(A | B) ⇐⇒ P(B) = P(B | A).
(3.7)
Remark 3.2.2 (Disjoint vs independent: warning!!). A very common mistake is to confuse the concepts of independence and being disjoint. However they are really completely
different. By definition A and B are disjoint exactly if A ∩ B = ∅, cf. Definition 1.2.2 in
Ch. 1. This is clearly a very different definition than (3.6)!
In fact, if A and B are disjoint then P(A ∩ B) = P(∅) = 0. Hence for them to be
independent as well, equation (3.6) tells us that we would need to have P(A) = 0 or
P(B) = 0 and this is (of course) usually not the case.
35
36
Notes for COMP11120 (Probability part)
Also intuitively we can understand that independence and being disjoint cannot both
be true. Two events are independent when the occurence of the one event does not give
you any extra information about how likely it is that the other event will occur as well. If
the events are disjoint then you do have extra information: if the one event occurs then
you know for sure the other event won’t occur as they are disjoint and hence don’t share
any outcomes!
Example 3.2.3. A fair coin is tossed twice. Consider the events A = {1st throw shows heads}
and B = {2nd throw shows tails}. Are A and B independent?
It is an experiment with equally likely outcomes and the sample space is Ω = {HH, T H, HT, T T }.
We have A = {HH, HT } and B = {HT, T T }, and hence A ∩ B = {HT }. We find
2
1
= ,
4
2
and hence it indeed holds that
P(A) =
P(B) =
2
1
=
4
2
and
P(A ∩ B) =
1
4
P(A ∩ B) = P(A)P(B)
as both sides equal 1/4, we therefore conclude that A and B are indeed independent.
Intuitively, this makes a lot of sense. A and B are independent when the occurence of
the one event does not give you any extra information about the occurence of the other.
Now A only concerns the outcome of the first throw and B only concerns the outcome of
the second throw, so these events do not give any information about the other.
3.3
Partitions
In the remaining part of this chapter we will use the concept of conditioning introduced
above together with the concept of ‘partitions’ to derive two very important results in
probability theory: the law of total probability and Bayes’ Theorem.
In this section we will first introduce the concept of partitions. The idea is very simple:
imagine your sample space as a sheet of paper. Take a scissors and cut the sheet up into
different pieces. Then the collection of these pieces of paper is what we call a ‘partition’
of the original sheet of paper. The two crucial properties are:
(i) the pieces do not overlap,
(ii) putting all the pieces together we find the whole sheet of paper back again.
The mathematical definition of a partition of a sample space (the ‘sheet of paper’)
follows this image exactly: we take a collection of events E1 , . . . , Em (the ‘pieces of paper’)
such that they do not overlap, i.e. Ei ∩Ej = ∅ for all i and j such that i 6= j, and such that
taking them all together we find the sample space back again, i.e. E1 ∪ E2 ∪ . . . ∪ Em = Ω.
36
CHAPTER 3. THE POWER OF CONDITIONING
37
Definition 3.3.1 (Partition). Consider a sample space Ω. A partition is a collection of
(non empty) events E1 , . . . , Em for some m ≥ 2 that satisfies the following two conditions:
(i) Ei ∩ Ej = ∅ for all i = 1, . . . , m and j = 1, . . . , m with i 6= j,
(ii) E1 ∪ E2 ∪ . . . ∪ Em = Ω.
Any collection of events that satisfies point (i) above is said to be mutually disjoint. So a
collection of events is a partition exactly if they are mutually disjoint and if their union
is the whole sample space.
A partition can also very well consist of infinitely many events, in this case the definition gets extended in the obvious way:
(i’) Ei ∩ Ej = ∅ for all i ≥ 1 and j ≥ 1 with i 6= j,
(ii’) E1 ∪ E2 ∪ E3 ∪ . . . = Ω.
See also the two top diagrams in Figure 3.1.
Remark 3.3.2. If you think a bit about it, the following also holds true: a collection of
events E1 , . . . , Em is a partition exactly if any outcome ω in the sample space Ω belongs
to exactly one of events E1 , . . . , Em .
In terms of the sheet of paper: every molecule of the sheet of paper ends up in exactly
one of the pieces (ignoring that in reality some will get stuck to the scissors etc...).
Example 3.3.3. Consider the sample space Ω = {ω1 , ω2 , ω3 }. Which of the following
collections of events are partitions?
a) E1 = {ω1 , ω2 }, E2 = {ω2 , ω3 }
b) E1 = {ω1 }, E2 = {ω3 }
c) E1 = {ω1 , ω2 }, E2 = {ω3 }.
For a), this is not a partition because E1 ∩ E2 = {ω2 } and this intersection should be
empty according to Defintion 3.3.1 (i). Or, another way to say the same, the element ω2
is in both E1 and E2 and this is not allowed (cf. Remark 3.3.2).
For b), this is not a partition either because E1 ∪ E2 = {ω1 , ω3 } and this union should
be equal to Ω (Defintion 3.3.1 (ii)). Using Remark 3.3.2 again, it is not a partition since
ω2 is not an element of both E1 and E2 .
For c), yes this is a partition: we have E1 ∩ E2 = ∅ and E1 ∪ E2 = Ω so the conditions
of Defintion 3.3.1 are indeed satisfied. Alternatively, following Remark 3.3.2, indeed each
of the possible outcomes ω1 , ω2 and ω3 is an element of either E1 or E2 (but not of both).
Remark 3.3.4. The simplest partition we can think of is the following: take any event B.
Then set E1 = B and E2 = B c . This is indeed a partition since E1 ∩ E2 = B ∩ B c = ∅ and
E1 ∪ E2 = B ∪ B c = Ω (as is obvious from the definition of complement, see Definition
1.2.1 in Ch. 1) so that the conditions of Defintion 3.3.1 are indeed satisfied.
37
38
Notes for COMP11120 (Probability part)
Remark 3.3.5. The number of partitions that are possible when the sample space consists
of n outcomes is called a Bell number. (We don’t allow the empty set to be one of the events
in partition. If you would allow that you could produce an infinite number of partitions for
every sample space by just adding empty sets all the time). For n = 2 it is 2, for n = 3
it is 5, for n = 4 it is 15, for n = 5 it is 52 and for n = 6 it is 203. So we see, as you
probably had guessed already, that the number of possible partitions grows very quickly as
the sample space gets larger.
We conclude with two results concerning computing probabilities with partitions.
Result 3.3.6 (Additive law for mutually disjoint events). For any collection of mutually
disjoint events E1 , E2 , . . . , Em , i.e. events such that Ei ∩ Ej = ∅ for all i = 1, . . . , m and
j = 1, . . . , m with i 6= j (cf. Definition 3.3.1), we have
P(E1 ∪ E2 ∪ . . . ∪ Em ) = P(E1 ) + P(E2 ) + . . . + P(Em ).
Proof. This is essentially an extension of the ‘additive law’ (cf. Result 1.4.2 in Ch. 1)
to more than two events which is intuitively obvious: the probability of the union of a
collection of events that do not overlap at all should simply be the sum of the probabilities
of each of these events. However for a truly sound mathematical proof you would need
to use the principle of induction (which I think I read somewhere you will be doing next
semester), so for now we will just be happy with the fact that it is an obvious result!
Result 3.3.7. Take any partition E1 , E2 , . . . , Em of a sample space Ω. Then we have that
P(E1 ) + P(E2 ) + . . . + P(Em ) = 1.
Proof. As E1 , E2 , . . . , Em forms a partition, we know that these events are mutually disjoint (cf. Definition 3.3.1). Hence by the ‘additive law for mutually disjoint events’ above
we have that
P(E1 ) + P(E2 ) + . . . + P(Em ) = P(E1 ∪ E2 ∪ . . . ∪ Em ).
(3.8)
Furthermore, we also know from the definition of a partition that E1 ∪ E2 ∪ . . . ∪ Em = Ω
and we have by definition that P(Ω) = 1 (cf. Definition 1.3.2). So, plugging these two
facts into (3.8) it indeed follows that
P(E1 ) + P(E2 ) + . . . + P(Em ) = P(E1 ∪ E2 ∪ . . . ∪ Em ) = P(Ω) = 1.
38
CHAPTER 3. THE POWER OF CONDITIONING
3.4
39
The law of total probability
Let us now combine the two ingredients, namely conditional probabilities and partitions,
into a formula known as the ‘law of total probability’. Take any event A and a partition
consisting of the events E1 , . . . , Em . Then we have the following:
A = (A ∩ E1 ) ∪ (A ∩ E2 ) ∪ . . . ∪ (A ∩ Em ).
(3.9)
A way to see this is as follows. Take your sheet of paper again, and colour somewhere on
the sheet a shape to represent your event A in a sample space. Then cut your piece of
paper into m pieces. Each of the pieces of paper contains part of the coloured shape (or,
of course, there may be none of it on some of the pieces). The effect is that we have cut
A up into m different parts, namely the parts on each of the pieces of paper. Hence we
could write something like:
A = Part of A on piece 1 + Part of A on piece 2 + . . . + Part of A on piece m
which is exactly what (3.9) is saying! Namely, translating this back to the sample space,
we break the event A up into different parts by considering the parts A ∩ E1 (all outcomes
in A that are also in E1 ), A ∩ E2 (all outcomes in A that are also in E2 ), ..., A ∩ Em (all
outcomes in A that are also in Em ) and then formula (3.9) is saying nothing but ”we can
find the event A back by collecting all the parts together”. See also Figure 3.1.
Remark 3.4.1 (Warning!). For equation (3.9) to hold it is crucial that the events E1 , . . . , Em
form a partition of the sample space. If that’s not true, so if the events are not mutually
disjoint and/or their union is not the whole sample space, then in general equation (3.9)
will not be true!
The law of total probability now follows quite easily from (3.9). First note that the
events A ∩ E1 , A ∩ E2 , . . . , A ∩ Em are mutually disjoint (recall Definition 3.3.1). Indeed,
for any i = 1, . . . , m and j = 1, . . . , m with i 6= j we have that
(A ∩ Ei ) ∩ (A ∩ Ej ) = A ∩ Ei ∩ A ∩ Ej = A ∩ A ∩ Ei ∩ Ej = A ∩ Ei ∩ Ej = A ∩ ∅ = ∅
(note that in the first three equalities above nothing special happens, just working things
out a bit and using that it doesn’t matter in which order you take intersections. Then we
use that as the events E1 , . . . , Em form a partition they are mutually disjoint, in particular
Ei ∩ Ej = ∅ for the values of i and j we have chosen here).
Having established that the events A ∩ E1 , A ∩ E2 , . . . , A ∩ Em are mutually disjoint,
we get from (3.9) and the ‘additive law for mutually disjoint events’ (cf. Result 3.3.6) that
39
40
Notes for COMP11120 (Probability part)
Figure 3.1: Left top: a grey square. Oh no, a sample space. Right top: the sample space
‘cut up’ into four parts, thereby forming a partition consisting of the events E1 , E2 , E3
and E4 . Left bottom: an event A in the sample space. Right bottom: as the sample space
gets ‘cut up’ into the same four events E1 , . . . , E4 , each of them contains a portion of the
event A: E1 contains the portion A ∩ E1 , E2 contains the portion A ∩ E2 etc.
P(A) = P (A ∩ E1 ) ∪ (A ∩ E2 ) ∪ . . . ∪ (A ∩ Em )
= P(A ∩ E1 ) + P(A ∩ E2 ) + . . . + P(A ∩ Em ).
Finally, using the ‘multiplicative law’ (cf. Definition 3.1.1) we may also write this as
P(A) = P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em )
(3.10)
and this is exactly the formula known as the ‘law of total probability’. It may at first
seem a bit strange that it would be somehow useful to express the very short and simple
expression P(A) as the much more complicated right hand side of (3.10), but there are
plenty examples where the easiest (or even only) way to compute P(A) is by using this
formula!
Recall from Remark 3.3.4 that the simplest partition consists of two events only, and
is obtained by taking any event B and setting E1 = B and E2 = B c . If we take this
particular partition, then formula (3.10) simplifies to
P(A) = P(A | B)P(B) + P(A | B c )P(B c ).
So:
40
CHAPTER 3. THE POWER OF CONDITIONING
41
Result 3.4.2 (Law of total probability). Take a partition of the sample space consisting
of the events E1 , . . . , Em . Then we have for any event A the law of total probability:
P(A) = P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em ).
(3.11)
In the special case the partition consists of only two elements, so E1 = B and E2 = B c
for some event B (cf. Remark 3.3.4), the formula becomes
P(A) = P(A | B)P(B) + P(A | B c )P(B c ).
(3.12)
Example 3.4.3. A company buys tyres from two suppliers, say supplier 1 and 2. Suppose
that 40% of all tyres come from supplier 1, and that the rates of defectives is 10% for
supplier 1 and 5% for supplier 2. What is the probability that a randomly chosen tyre is
defective?
Define the event A = {tyre is defective} , so we want to compute P(A). There is no
‘direct’ way to do this, since the rates of defectives is different for both suppliers. So
we need to ‘split’ into the different cases that the tyre comes from supplier 1 resp. supplier 2. For this, define the event B = {tyre comes from supplier 1}. Note that B c =
{tyre comes not from supplier 1} = {tyre comes from supplier 2}.
Now, since 40% of all tyres come from supplier 1 we know that P(B) = 0.4. Hence by
the ‘complementary law’ P(B c ) = 0.6. (Or use that B c = {tyre comes from supplier 2}
and since 40% comes from supplier 1, it must be that 60% comes from supplier 2 and
hence P(B c ) = 0.6.) Also, using what we know about the rates of defectives we see that
P(A | B) = P({tyre is defective} | {tyre comes from supplier 1}) = 0.1
and
P(A | B c ) = P({tyre is defective} | {tyre comes from supplier 2}) = 0.05
The law of total probability, with the partition E1 = B and E2 = B c i.e. formula
(3.12), now yields
P(A) = P(A | B)P(B) + P(A | B c )P(B c ) = 0.1 · 0.4 + 0.05 · 0.6 = 0.07.
In the above solution we haven’t specified the sample space. That is very well possible if you
like, for instance as follows. Of all tyres we are interested in two properties: who the supplier was and whether or not the tyre was defective. Hence we might write the outcome of
the experiment of randomly choosing a tyre as (D, 1) (defective and supplier was 1), (D, 2)
(defective and supplier was 2), (6D, 1) (not defective and supplier was 1) and (6D, 2) (not
defective and supplier was 2). The sample space is then Ω = {(D, 1), (D, 2), (6D, 1), (6D, 2)}.
41
42
Notes for COMP11120 (Probability part)
(Note that this is not an experiment with equally likely outcomes and hence we can’t conclude that each of the four outomes has probability 1/4 of occuring!).
The events A, B and B c above can then be written as A = {(D, 1), (D, 2)}, B =
{(D, 1), (6D, 1)} and B c = {(D, 2), (6D, 2)}. The rest of the solution works exactly the
same (of course).
Example 3.4.4. Suppose that in a semiconductor manufacturing plant the probability of
chip failure depends on the level of contamination.
If the level of contamination is high then the probability of failure is 0.1. If level of
contamination is medium then probability of failure is 0.01. Finally, if the level of contamination is low then the probability of failure is 0.001. Every chip is exposed to one level of
contamination only.
Suppose that in a particular production run 20% of the chips are subjected to high
levels of contamination, 30% to medium levels and 50% to low levels of contamination.
What is the probability a randomly selected chip fails?
We need to compute the probability of the event A = {the chip fails}. Again we cannot do
this directly as it depends on the level of contamination the chip has been exposed to. Therefore, define the events E1 = {contamination is high}, E2 = {contamination is medium}
and E3 = {contamination is low}. Note that E1 , E2 and E3 together for a partition of the
sample space. Indeed, following Remark 3.3.2, every chip we can randomly select must be
a member of exactly one of these events (as we assumed that every chip is exposed to one
level of contamination only).
We are given that P(E1 ) = 0.2, P(E2 ) = 0.3 and P(E3 ) = 0.5. Furthermore we are
given that P(A | E1 ) = 0.1, P(A | E2 ) = 0.01 and P(A | E3 ) = 0.001. Hence applying the
law of total probability (3.11) we find
P(A) = P(A | E1 )P(E1 ) + P (A | E2 )P(E2 ) + P(A | E3 )P(E3 )
= 0.1 · 0.2 + 0.01 · 0.3 + 0.001 · 0.5 = 0.0235.
Analogue to what we did above, again we could also in this case be more precise and
define a probability space. Any chip is either defect (D) or not (6D), and was subjected
to a contamination level H, M or L. This means the sample space can be written as
Ω = {(D, H), (D, M ), (D, L), (6D, H), (6D, M ), (6D, L)}. The events above now become A =
{(D, H), (D, M ), (D, L)}, E1 = {(D, H), (6D, H)}, E2 = {(D, M ), (6D, M )} and E3 =
{(D, L), (6D, L)}. In particular, you could now check in a more rigorous way that the
events E1 , E2 , E3 indeed form a partition, i.e. satisfy the conditions of Definition 3.3.1!
42
CHAPTER 3. THE POWER OF CONDITIONING
3.5
43
Bayes’ Theorem
Finally in this chapter we look at Bayes’ Theorem∗ . This is quite a famous one, and used
in several ways in both probability and statistics. It is quite easy to prove, using the law
of total probability from the previous section.
Take any event A with P(A) > 0, and a partition consisting of the events E1 , . . . , Em .
Take any i = 1, 2, . . . , m. Then we have the following:
P(Ei | A) =
P(A | Ei )P(Ei )
P(Ei ∩ A)
=
P(A)
P(A)
=
P(A | Ei )P(Ei )
.
P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em )
Note that the first equality uses the definition of conditional probability (cf. Definition
3.1.1), the second uses the ‘multiplicative law’ (cf. Definition 3.1.1) and the final equality
uses the law of total probability (cf. Result 3.4.2).
The equality between the ultimate left hand and right hand sides is called Bayes’
Theorem:
Result 3.5.1 (Bayes’ Theorem). Let A be an event with P(A) > 0. Take a partition of the
sample space consisting of the events E1 , . . . , Em . Then we have for any i = 1, 2, . . . , m:
P(Ei | A) =
P(A | Ei )P(Ei )
.
P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em )
(3.13)
In particular, if the partition consists of two events only, i.e. E1 = B and E2 = B c for
some event B (cf. Remark 3.3.4), then this formula boils down to
P(A | B)P(B)
P(A | B)P(B) + P(A | B c )P(B c )
(if we take i = 1 in the above formula) and
P(B | A) =
P(A | B c )P(B c )
P(A | B)P(B) + P(A | B c )P(B c )
(if we take i = 2 in the above formula).
P(B c | A) =
(3.14)
(3.15)
Generally speaking, in our setting Bayes’ Theorem is usually a good choice if you
have to find the probability of some event A given an event B, and you only know the
probability of B given A instead. That is, it helps to ‘invert’ the events in a conditional
probability.
Before we move on to an example, let us first prove that the ‘complementary law’ also
holds for conditional probabilities, often a very useful result when working with Bayes’
Theorem (the proof is left as an exercise):
∗
Named after the English mathematician Thomas Bayes
43
44
Notes for COMP11120 (Probability part)
Result 3.5.2 (Complementary law for conditional probabilities). In a sample space Ω,
let A and B be any events such that P(B) > 0. Then we have that
P(A | B) = 1 − P(Ac | B).
Example 3.5.3. A new medical test is designed to identify a rare illness. If somebody
has the disease then the test correctly identifies this with probability 0.99. If somebody is
does not have the disease then the test correctly identifies this with probability 0.95.
Suppose that the probability of having the illness is 0.00001. Given that for a randomly
selected person the test is postive, what is the probability this person indeed has the disease?
For the solution, define the events A = {test is positive} and B = {person has disease}.
Then we are asked to compute P(B | A). We are given that P(A | B) = 0.99, P(Ac | B c ) =
0.95 and P(B) = 0.00001. This is a situation in which we know P(A | B) and are asked to
compute P(B | A). As pointed out above, Bayes’ Theorem is then usually helpful. Formula
(3.14) reads:
P(B | A) =
P(A | B)P(B)
.
P(A | B)P(B) + P(A | B c )P(B c )
We see that in order to use this, in addition to what we know already we need to find P(B c )
and P(A | B c ). For the former we can simply use the ‘complementary law’ (cf. Result 1.4.2
in Ch. 1) to find P(B c ) = 1 − 0.00001 = 0.9999. For the latter we use the above Result
3.5.2 to derive
P(A | B c ) = 1 − P(Ac | B c ) = 1 − 0.95 = 0.05.
Plugging all this into the formula we find
0.99 · 0.00001
≈ 0.002.
0.99 · 0.00001 + 0.05 · 0.9999
If you think about it, would you be happy when your GP uses a test for a serious
disease on you with the characteristic that when the test is positive there is only a 0.002
probability you actually do have the disease? Nah. What is most intriguing about this example is the following. If the probability that a randomly selected person has the disease
increases from the 0.00001 we used above to 0.1, then you can verify that P(B | A) changes
from approx 0.002 to approx 0.69. That makes the test suddenly seem a lot more sensible,
however the test itself hasn’t changed, only the probability a person has the disease. So,
Bayes’ Theorem shows us that the quality of the test (as perceived by its users) does not
only depend on how well the test is able to detect the disease but also on how large the
percentage of the population having the disease is.
P(B | A) =
Finally, if you’d like you could again work with a sample space as well. In this case,
what matters about the selected person is whether they have the disease (D) or not (H),
44
CHAPTER 3. THE POWER OF CONDITIONING
45
and whether the test is positive (P ) or not (N ). So we could write the sample space as
Ω = {(D, P ), (D, N ), (H, P ), (H, N )} and express all events as subsets of this sample
space.
45
Chapter 4
Random variables
So far we have introduced three fundamental concepts in probability theory: sample
spaces, events and probability measures. In this and the next (final) chapter we discuss
the last fundamental concept, namely random variables.
4.1
What are random variables?
The idea is that often you are not only interested in the outcome of the experiment an
sich but (also) in some consequence that outcome has. For instance, in a betting game
with dice you are typically not (only) interested in the outcome of the throw(s) but more
in your overall win or loss. Every outcome of the experiment yields a certain win or loss,
and hence we can analyse the overall win or loss by introducing a mapping that assigns
to every outcome the associated win or loss should that outcome indeed occur. That is
to say, we are talking about a mapping from the sample space to (say) the real numbers.
That is exactly what a random variable in general is:
Definition 4.1.1 (Random variables). A random variable is a mapping from the sample
space Ω to the real numbers R, usually denoted by X. That is, a random variable is a
mapping
X:Ω→R
so that every outcome ω ∈ Ω is assigned a real number X(ω) ∈ R. We denote by R ⊆ R
the range of X, i.e. the values X can actually attain. (Sometimes the term ‘image’ is used
rather than ‘range’).
Remark 4.1.2. In our case, where the sample space is countable – recall the discussion
at the beginning of Ch. 1 – the range R will always be a countable subset of R. In general,
when this is the case X is called a discrete random variable. There are other types of
random variables as well – which could occur if the sample space is not countable – but
this is outside the scope of this course.
47
48
Notes for COMP11120 (Probability part)
Remark 4.1.3. A random variable is in principle nothing more (or less) than a function
with Ω as its domain. It is important to realise the following though. If we think back to our
setup, with consists of an experiment with outcomes in Ω, then as long as the experiment
is not yet performed we don’t yet know what the outcome will be, and hence we also don’t
yet know what value the random variable will take! The random variable ‘inherits’ the
uncertainty about the outcome of the experiment in the sense that it is uncertain what
value it will take. In particular this also means that the we can talk about the probability
that a random variable takes a certain value etc. Hence the ‘random’ in the name! ;).
Example 4.1.4. Consider a betting game in which a coin is tossed. When heads appears
you receive £1 while if tails appears you have to pay £1. Clearly, more than in the outcome
of the experiment itself, i.e. heads or tails, you are interested in the financial consequences.
With Ω = {H, T }, if we define the mapping X : Ω → R by setting
X(T ) := 1
and
X(H) := −1
(4.1)
then X represents the win you will make by playing this game. In this case the range of
X is R = {−1, 1}.
Note that since we don’t know what outcome the experiment will have we also don’t
know which value the random variable X will take, as this value depends on the outcome.
We only know (for sure) which value X will take after we have performed the experiment
and know the outcome so that we can apply the rule (4.1).
Example 4.1.5. Consider an experiment in which you roll three dice and you are interested in the sum of the dots of the three dice. We can see every outcome as a triple of
numbers, each between 1 and 6. Hence, we may write
Ω = {(k1 , k2 , k3 ) | ki ∈ {1, 2, 3, 4, 5, 6} for i = 1, 2, 3}.
The random variable X giving us the sum of the dots of the three dice is then defined as
the mapping X : Ω → R given by
X((k1 , k2 , k3 )) := k1 + k2 + k3
for every (k1 , k2 , k3 ) ∈ Ω.
Note that the possible values X can take make up the set {3, 4, . . . , 18} and hence the
range of X is R = {3, 4, . . . , 18}.
4.2
Computing probabilities with random variables
When we have specified a probability measure P for our experiment then we know how
likely each possible outcome of our experiment is to occur. This carries over to random
variables: as the value a random variable turns out to take is determined by the outcome
48
CHAPTER 4. RANDOM VARIABLES
49
of the experiment, and each possible outcome occurs with a certain probability, also every
value in the range of the random variable occurs with a certain probability!
The key point to understand here is the following. Recall that X is mapping from Ω
to R ⊆ R. Take some possible value r ∈ R. Then X takes this value r if and only if the
experiment yields an outcome ω ∈ Ω satisfying X(ω) = r. That is to say,
X takes the value r
⇐⇒
the outcome ω of the experiment satisfies X(ω) = r.
(Of course, there can be only one or there can be multiple outcomes ω such that X(ω) = r).
This means that the probability that X takes the value r should be equal to the probability
of the event {ω ∈ Ω | X(ω) = r} ⊆ Ω, which is the event consisting of all those outcomes
that are mapped to the value r by X:
the probability that X takes the value r = P({ω ∈ Ω | X(ω) = r}).
Remark 4.2.1. As probabilists are a bit lazy and prefer to write things as short as
possible we use shorthands for expressing events and probabilities that involve random
variables.
For instance, when we write ‘{X = r}’ we actually mean the event {ω ∈ Ω | X(ω) =
r}, i.e. the event consisting of all outcomes that are mapped to the value r by the
random variable X. Also the equal sign might be some other oprator, for instance
when we write ‘{X < 2}’ we actually mean the event {ω ∈ Ω | X(ω) < 2} and when
we write ‘{X 6= 4}’ we actually mean the event {ω ∈ Ω | X(ω) 6= 4}.
We do something similar when it comes to notation for computing probabilities
with random variables. For instance, we write ‘P(X = 2)’ when we actually mean
P({ω ∈ Ω | X(ω) = 2}), ‘P(X > 10)’ when we actually mean P({ω ∈ Ω | X(ω) > 10})
etc. etc.
In this section we will write out those shorthands in full for your convenience,
but later in the chapter we will use the shorthand only. Of course, when you find
it confusing, try to translate the shorthand to the full form first and then do the
computation!
Next up are some examples to illustrate how to compute simple probabilities involving
random variables.
Example 4.2.2. Consider an experiment with sample space Ω = {ω1 , ω2 , ω3 , ω4 }. Suppose
that
P({ω1 }) = P({ω2 }) =
1
3
and
49
1
P({ω3 }) = P({ω4 }) = .
6
50
Notes for COMP11120 (Probability part)
Figure 4.1: The situation in Example 4.2.2: we have an experiment that can yield any of
the outcomes from the sample space Ω = {ω1 , . . . , ω4 }, and a random variable X that
assigns to each possible outcome a value from its range R = {0, 1, 2}
Define the random variable X by setting X(ω1 ) = 0, X(ω2 ) = 1, X(ω3 ) = X(ω4 ) = 2.
What is the the range R of X? Compute each of the probabilities P(X = 0), P(X = 2)
and P(X < 2). See also Figure 4.1.
We have R = {0, 1, 2}, as these are the only three values X can take. For computing
the requested probabilities we find:
1
(∗)
(∗∗)
P(X = 0) = P({ω ∈ Ω | X(ω) = 0}) = P({ω1 }) = ,
3
(∗)
(∗∗)
(∗∗∗)
P(X = 2) = P({ω ∈ Ω | X(ω) = 2}) = P({ω3 , ω4 }) = P({ω3 }) + P({ω4 }) =
1
3
and
2
(∗)
(∗∗)
(∗∗∗)
P(X < 2) = P({ω ∈ Ω | X(ω) < 2}) = P({ω1 , ω2 }) = P({ω1 }) + P({ω2 }) = .
3
Note that in every computation, in step (*) we write out the shorthand, in step (**) we
investigate which outcomes are in the event we are interested in by looking at the definition
of X, and finally in step (***) we apply the ‘additive law’.
Example 4.2.3 (Example 4.1.4 cont’d). Consider again the experiment and random
variable introduced in Example 4.1.4. Let us assume the coin is fair so that we have
P({H}) = 1/2 and P({T }) = 1/2. The random variable X has range R = {−1, 1}. Compute both P(X = −1) and P(X = 1).
50
CHAPTER 4. RANDOM VARIABLES
51
Following the same principle as in the above example, we find
P(X = −1) = P({ω ∈ Ω | X(ω) = −1}) = P({H}) =
1
2
and similarly
1
P(X = 1) = P({ω ∈ Ω | X(ω) = 1}) = P({T }) = .
2
Example 4.2.4 (Example 4.1.5 cont’d). Consider again the experiment and random
variable introduced in Example 4.1.5. Assuming the three dice are fair this is an experiment
with equally likely outcomes and hence (recall Chapter 2) we have for any event A that
P(A) =
#A
#Ω
(4.2)
where #Ω = 63 = 216. Compute the probability that X is equal to 3 and the probability
that X is at most 4.
For the probability that X is equal to 3 we get
(4.2)
P(X = 3) = P({ω ∈ Ω | X(ω) = 3}) = P({(1, 1, 1)}) =
1
216
and for the probability that X is at most 4 we get
(4.2)
P(X ≤ 4) = P({ω ∈ Ω | X(ω) ≤ 4}) = P({(1, 1, 1), (1, 1, 2), (1, 2, 1), (2, 1, 1)}) =
4
1
= .
216
54
We conclude this section with an important observation. Consider any random variable
X and let us write its range in a generic form: R = {r1 , r2 , . . . , rm }. Define for i = 1, . . . , m
the events Ei = {X = ri } = {ω ∈ Ω | X(ω) = ri }. Take any outcome ω ∈ Ω. Then we
know that X maps ω to some value in its range, so ω must be an element of exactly one
of the events E1 , E2 , . . . , Em . But this means, according to Remark 3.3.2 in Ch. 3, that
the events E1 , E2 , . . . , Em form a partition of the sample space!
This has some very useful consequences. For instance, Result 3.3.7 in Ch. 3 tells us
that this implies that
P(E1 ) + P(E2 ) + . . . + P(Em ) = P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm ) = 1,
or, using the sum notation:
m
X
P(X = ri ) = 1.
i=1
51
52
Notes for COMP11120 (Probability part)
Figure 4.2: The situation in Example 4.2.2: a partition of the sample space Ω =
{ω1 , . . . , ω4 } consisting of the events E1 , E2 , E3 is generated by the random variable X
with range R = {0, 1, 2} by setting E1 = {ω ∈ Ω | X(ω) = 0} = {ω1 }, E2 = {ω ∈
Ω | X(ω) = 1} = {ω2 }, E3 = {ω ∈ Ω | X(ω) = 2} = {ω3 , ω4 }
Result 4.2.5. Take any random variable X with range R = {r1 , r2 , . . . , rm }. Define the
collection of events E1 , E2 , . . . , Em as
Ei = {X = ri } = {ω ∈ Ω | X(ω) = ri }
for i = 1, . . . , r.
Then the collection E1 , E2 , . . . , Em is a partition of the sample space. One consequence of
this fact is that
m
X
P(X = ri ) = P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm ) = 1.
i=1
This result also holds true if the range of X has infinitely many elements∗ , say R =
{r1 , r2 , . . .}. Then we get infinitely many events in our partition, i.e. Ei = {X = ri } for
all i ≥ 1, and we have
∞
X
P(X = ri ) = P(X = r1 ) + P(X = r2 ) + . . . = 1.
i=1
∗
As mentioned before, we’re always assuming it is at most countable!
52
CHAPTER 4. RANDOM VARIABLES
53
Example 4.2.6 (Example 4.2.2 cont’d). In Example 4.2.2 we had the sample space Ω =
{ω1 , . . . , ω4 } and a random variable X with range R = {0, 1, 2} defined as X(ω1 ) = 0,
X(ω2 ) = 1, X(ω3 ) = X(ω4 ) = 2.
Following the above Result 4.2.5 we define the events E1 = {ω ∈ Ω | X(ω) = 0},
E2 = {ω ∈ Ω | X(ω) = 1} and E3 = {ω ∈ Ω | X(ω) = 2}. According to Result 4.2.5 these
three events should form a partition of the sample space Ω. Indeed, using the definition of
X we find
E1 = {ω ∈ Ω | X(ω) = 0} = {ω1 },
E2 = {ω ∈ Ω | X(ω) = 1} = {ω2 }
and E3 = {ω ∈ Ω | X(ω) = 2} = {ω3 , ω4 },
which is indeed a parition of Ω. See also Figure 4.2.
Result 4.2.5 also tells us that we should have P(X = 0) + P(X = 1) + P(X = 2) = 1.
We can indeed verify that this is true. We had already computed that P(X = 0) = 1/3 and
P(X = 2) = 1/3. To compute the last one you can verify that P(X = 1) = P({ω2 }) = 1/3
and we conclude that indeed
P(X = 0) + P(X = 1) + P(X = 2) =
1 1 1
+ + = 1.
3 3 3
Example 4.2.7. Let Ω be some sample space and let X be a random variable of which
we know that its range is R = {1, 2, 3, 4, 5} and that P(X = 1) = P(X = 2) = P(X =
4) = P(X = 5) = 1/6. Compute P(X = 3). Also compute P(X ≥ 4) and P(X < 3.7).
First note that we know the probabilities for all values in the range R, with the exception of 3. We know from Result 4.2.5 that the sum of all probabilities must equal 1, that
is
P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) = 1.
Plugging in the probabilities we are given we can deduce
P(X = 3) = 1 − P(X = 1) − P(X = 2) − P(X = 4) − P(X = 5) = 1 −
4
1
= .
6
3
Next we compute P(X ≥ 4). For this,let us first write out the shorthand to realise
that we are looking at the event {ω ∈ Ω | X(ω) ≥ 4}. Note that we know nothing about
the sample space at all! So we can’t figure out which outcomes are in this event. But we
don’t need to. The first thing to realise is that since the range of X is {1, 2, 3, 4, 5}, the
outcomes ω for which X(ω) ≥ 4 must be exactly those outcomes for which either X(ω) = 4
or X(ω) = 5. That is to say, we have
{ω ∈ Ω | X(ω) ≥ 4} = {ω ∈ Ω | X(ω) = 4} ∪ {ω ∈ Ω | X(ω) = 5}
53
54
Notes for COMP11120 (Probability part)
and hence we have that
P({ω ∈ Ω | X(ω) ≥ 4}) = P({ω ∈ Ω | X(ω) = 4} ∪ {ω ∈ Ω | X(ω) = 5}),
or, in shorthand:
P(X ≥ 4) = P({X = 4} ∪ {X = 5}).
(4.3)
From Result 4.2.5 we know that the events {X = 1}, {X = 2}, . . . , {X = 5} form a
partition of the sample space. In particular that means that they are mutually disjoint
(recall Definition 3.3.1 in Ch. 3), which implies that {X = 4} and {X = 5} are disjoint,
i.e. {X = 4} ∩ {X = 5} = ∅. But that means that we can apply the ‘additive law’ from
Ch. 1 to (4.3) to derive that
P(X ≥ 4) = P({X = 4}) + P({X = 5}) =
1
1 1
+ = .
6 6
3
Finally, to compute P(X < 3.7) we proceed in exactly the same way. First we note that
since X has range {1, 2, 3, 4, 5}, any outcome in the event {ω ∈ Ω | X(ω) < 3.7} must be
in any of the events {X = 1}, {X = 2} or {X = 3}. That is to say:
{X < 3.7} = {X = 1} ∪ {X = 2} ∪ {X = 3}.
Again we have that the three events in the above right hand side are mutually disjoint by
Result 4.2.5 and hence the ‘additive law’ from Ch. 1 yields
P(X < 3.7) = P(X = 1) + P(X = 2) + P(X = 3) =
2
1 1 1
+ + = .
6 6 3
3
Remark 4.2.8. In the above Example 4.2.7 something very new has happened: we have
considered a situation in which we did not at all specify what the experiment is, nor what
the sample space looks like, and hence neither what the probability of each of the outcomes
in the sample space is. Instead we only specified the random variable, its range and the
probablities for each of the values in the range of X. We will do this more often in the
remaining part of this course: when dealing with random variables we will often leave the
experiment and sample space unspecified – just because we don’t need to specify them to
do the computations we need to do!
4.3
Constructing new random variables from
existing ones
If we have a random variable X, i.e. a mapping X : Ω → R and a function f : R → R we
would look at the composition of these two mappings to create one new mapping:
54
CHAPTER 4. RANDOM VARIABLES
55
f
X
Ω
R
R
Y
That is to say, the mapping Y : Ω → R is defined as
Y (ω) = f (X(ω)) for all ω ∈ Ω.
(4.4)
This is also a random variable, as it is simply a mapping from Ω to R (cf. Definition
4.1.1). Of course, the range of Y may be different from the range of X!
Example 4.3.1. Take the sample space Ω = {ω1 , ω2 , ω3 } and suppose that all outcomes
are equally likely to occur. Define the random variable X by X(ω1 ) = −1, X(ω2 ) = 0 and
X(ω3 ) = 1. Then X has range RX = {−1, 0, 1} (we use the subscript ‘X’ to specify it is
the range of X), and as we did in the previous subsection we may compute
1
1
1
P(X = −1) = P({ω1 }) = , P(X = 0) = P({ω2 }) =
and P(X = 1) = P({ω3 }) = .
3
3
3
Now, define the random variable Y by setting Y (ω) = X(ω)2 for all ω ∈ Ω. That is,
we use (4.4) with the function f (x) = x2 . Then we find
Y (ω1 ) = X(ω1 )2 = (−1)2 = 1,
Y (ω2 ) = X(ω2 )2 = 02 = 0 and
Y (ω3 ) = X(ω3 )2 = 12 = 1.
From this we deduce that the range of Y is RY = {0, 1} and we may compute
1
2
and P(Y = 1) = P({ω1 , ω3 }) = .
3
3
(Note that we have P(X = −1)+P(X = 0)+P(X = 1) = 1 and P(Y = 0)+P(Y = 1) = 1,
as should be the case according to Result 4.2.5).
P(Y = 0) = P({ω2 }) =
Remark 4.3.2. Another case of lazyness of notation: we normally simply write ‘Y = X 2 ’
when we actually mean ‘the random variable Y defined by Y (ω) = X(ω)2 for all ω ∈ Ω’.
Just as another example, we write Y = 2X 2 − 3X + 10 just to mean that Y is given
by Y (ω) = 2X(ω)2 − 3X(ω) + 10 for all ω ∈ Ω – which is (4.4) with the function
f (x) = 2x2 − 3x + 10.
4.4
Probability mass functions (pmf ’s)
Every† random variable X has a ‘probability mass function’, usually abbreviated to ‘pmf’.
This is simply a function, usually denoted by p, containing all the information concerning
the range R of X and the associated probabilities:
†
discrete, see Remark 4.1.2
55
56
Notes for COMP11120 (Probability part)
Definition 4.4.1 (Probability mass function (pmf)). The probability mass function
(pmf ) of a random variable X with range R = {r1 , r2 , . . . , rm } is the function
p : R → [0, 1]
defined as
p(ri ) := P(X = ri )
for all i = 1, . . . , m.
If the range has infinitely many elements, say R = {r1 , r2 , . . .}, then the definition is
adjusted in the obvious way:
p(ri ) := P(X = ri )
for all i ≥ 1.
Example 4.4.2. Consider a betting game where a (fair) dice is rolled. If one or two dots
show you get £1, if three, four or five dots show you get £2 and otherwise you get £3.
You are interested in the profit you make playing this game. We have Ω = {1, 2, 3, 4, 5, 6},
where all outcomes are equally likely. The random variable X representing your profit is
given by
X(1) = X(2) = 1,
X(3) = X(4) = X(5) = 2
and
X(6) = 3.
Hence X has range R = {1, 2, 3}, and thus its pmf is a mapping p : {1, 2, 3} → [0, 1]. We
can easily compute its values:
1
p(1) = P(X = 1) = P({1, 2}) = P({1}) + P({2}) = ,
3
p(2) = P(X = 2) = P({3, 4, 5}) = P({3}) + P({4}) + P({5}) =
1
2
and
1
p(3) = P(X = 3) = P({6}) = .
6
Note that in each computation above, the first step is just to fill in the definition of the
pmf, and from then on it’s just the same arguments we used in the previous two sections
as well.
There’s not so much exciting to say about pmf’s, except for the following. We know
from Result 4.2.5 that for any random variable X with range R = {r1 , r2 , . . . , rm } (resp.
R = {r1 , r2 , . . .}) it holds that
m
X
P(X = ri ) = 1 resp.
i=1
X
P(X = ri ) = 1.
i≥1
If we plug in the definition of the pmf, this directly translates to
56
CHAPTER 4. RANDOM VARIABLES
m
X
57
p(ri ) = 1 resp.
i=1
X
p(ri ) = 1.
i≥1
Hence:
Result 4.4.3. For any random variable X with range R = {r1 , r2 , . . . , rm } (or R =
{r1 , r2 , . . .} if the range has infinitely many elements) we have that
m
X
p(ri ) = 1
i=1
(or
P
4.5
i≥1
p(ri ) = 1 in the infinite case).
Cumulative distribution functions (cdf ’s)
Like the probability mass function we introduced in the previous section, the ‘cumulative distribution function’ (abbreviated to ‘cdf’) of a random variable is a function that
contains all the information about the range of the random variable and the associated
probabilities for each value in its range. In that sense the cdf is interchangable with the
pmf: if you know the cdf you can derive the pmf, and vice versa. (The cdf has one big
advantage over the pmf: the cdf exists for every random variable, not only for the discrete
ones we limit ourselves to in this course. That is mainly why it is a populair tool to use.)
The cdf is usually denoted F and it is a function F : R → [0, 1] which assings to every
x ∈ R the probability that the rv X takes a value that is less than or equal to x. That is
to say:
Definition 4.5.1 (Cumulative distribution function (cdf)). The cumulative distribution
function (cdf ) of a random variable X is a mapping, usually denoted by F , of the form
F : R → [0, 1]
and it is defined as follows:
F (x) = P(X ≤ x)
for all x ∈ R.
Remark 4.5.2 (Short intermezzo: drawing the graph of discontinous functions). We will
find in a moment that (in our case) the cdf is a discontinuous function, that is to say, if
you draw its graph you will find there are ‘jumps’ in the graph. For drawing such a ‘jump’
there is a special agreement. To explain this, consider two functions f and g, with the
following formulae:
f (x) =
(
1 if x < 1
2 if x ≥ 1
and
57
g(x) =
(
1
if x ≤ 1
2
if x > 1.
58
-2
Notes for COMP11120 (Probability part)
-1
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
1
2
-2
-1
1
2
Figure 4.3: On the left a plot of the function f and on the right a plot of the function
g (cf. Remark 4.5.2). The ‘closed circle’ indicates what the value is at a point where the
function has a jump, while the ‘open circle’ shows what the value is not. So we read from
the left graph that f (1) = 2 and from the right graph that g(1) = 1
These functions are almost identical, the only difference is for x = 1, since f (1) = 2 and
g(1) = 1. (Mathematicians say that f is ‘right continuous’ while g is ‘left continuous’.)
To make the difference between these two clear in a graph we draw ‘open’ and ‘closed’
circles at the points where the graph has a jump. Probably staring at the graph is much
more helpful than trying to explain in words, hence have a look at Figure 4.3 and match
the location of the ‘open’ and ‘closed’ circles for both f and g up with the above formulae.
Example 4.5.3 (Example 4.4.2 cont’d). Consider again the random variable X we discussed in Example 4.4.2, which had range R = {1, 2, 3} and pmf p given by p(1) = 1/3,
p(2) = 1/2 and p(3) = 1/6. Let us try to figure out what the cdf F of this random variable
looks like.
We work from ‘left to right’ through all possible values for x ∈ R, where we will see that
only something interesting happens if x crosses one of the values in the range R = {1, 2, 3}
of X.
First pick any x ∈ R such that x < 1 (the smallest value in the range of X). Then
the event {X ≤ x} consists of all outcomes such that X(ω) ≤ x. However, the smallest
value X can have is 1. And we have chosen x < 1. Hence there are no outcomes in the
evet {X ≤ x}! That is to say, {X ≤ x} = ∅. Hence plugging in the definition of the cdf
we get for such values of x:
F (x) = P(X ≤ x) = P(∅) = 0 if x < 1.
(4.5)
Next pick any x ∈ R such that 1 ≤ x < 2 (so, x is at least as large as the smallest value
in R but smaller than the second smallest value in R). Consider again the event {X ≤ x}
in this situation. Every outcome ω ∈ Ω such that X(ω) = 1 also satisfies X(ω) ≤ x – by
our choice of x, in particular since x ≥ 1. Any outcome ω ∈ Ω such that X(ω) = 2 or
58
CHAPTER 4. RANDOM VARIABLES
59
1
0.8
0.6
0.4
0.2
-1
1
2
4
3
Figure 4.4: A plot of the cdf F we found in Example 4.5.3
X(ω) = 3 does not satisfy X(ω) ≤ x – by our choice of x, in particular since x < 2. But
this means that we have that {X ≤ x} = {X = 1} and hence we can compute:
F (x) = P(X ≤ x) = P(X = 1) = p(1) =
1
3
if 1 ≤ x < 2.
(4.6)
We keep going in the same fashion until we have considered all possible values of x.
The next one is any x ∈ R such that 2 ≤ x < 3. Consider again the event {X ≤ x} in this
situation. Every outcome ω ∈ Ω such that X(ω) = 1 or X(ω) = 2 also satisfies X(ω) ≤ x
– by our choice of x, in particular since x ≥ 2. Any outcome ω ∈ Ω such that X(ω) = 3
however does not satisfy X(ω) ≤ x – by our choice of x, in particular since x < 3. So, in
this case we arrive at {X ≤ x} = {X = 1} ∪ {X = 2} and using this we can compute:
F (x) = P(X ≤ x) = P({X = 1} ∪ {X = 2})
= P(X = 1) + P(X = 2) = p(1) + p(2) =
5
6
if 2 ≤ x < 3. (4.7)
Note that in above computation we have used that the events {X = 1} and {X = 2} are
disjoint, as we of course remember very well from Result 4.2.5!
We have only one case left to consider, namely any x ∈ R such that x ≥ 3. But in this
case we simply have {X ≤ x} = Ω. Indeed, as X(ω) can be either 1, 2 or 3, it is always
true that X(ω) ≤ x – since we have x ≥ 3. This now yields
F (x) = P(X ≤ x) = P(Ω) = 1
if x ≥ 3.
(4.8)
Finally, as the cdf F is now fully specified (we have computed its value for every
possible x ∈ R in (4.5)-(4.8)) we can make a graph of the function, see Figure 4.4.
59
60
Notes for COMP11120 (Probability part)
p(3) =1/6
p(2) =1/2
p(1) =1/3
r1 =1
r2 =2
r3 =3
Figure 4.5: See Remark 4.5.4
Remark 4.5.4. If we think about what exactly happened in the above Example 4.5.3 we
can write down the following general rule (in our case): the cdf F of a random variable X
with range R = {r1 , r2 , . . . , rm } and pmf p is a ‘piecewise constant’ function: it consists
of a collection of constant pieces (i.e. the flat bits in the graph) and ‘jumps’. The ‘jumps’
occur at the points in the range of X, i.e. for x = r1 , x = r2 , ..., x = rm , and the height
of these jumps is given by the corresponding probabilities p(r1 ), p(r2 ), ..., p(rm ).
In the above Example 4.5.3 we have r1 = 1, r2 = 2, r3 = 3 and p(1) = 1/3, p(2) =
1/2, p(3) = 1/6, see Figure 4.5.
Remark 4.5.5. An important observation is that the cdf contains in one graph all information about a random variable X: given the graph of a cdf we can deduce what the range
of X is and what the pmf is. For this we simply use Remark 4.5.4: the range consists of
exactly all the points where the graph ‘jumps’, and the corresponding probabilities can be
read from the size of the jumps.
4.6
The mean (or: expected value) of a random
variable
We know by now very well that a random variable takes different values dependent on
what the outcome of the underlying experiment is. It is then quite natural to wonder what
the ‘average value’ is of a random variable. It is very comparable to having a sequence of
numbers, say a1 , a2 , . . . , an , where each number has a weight, say that wi is the weight of
the number ai . Then the average value of this sequence is given by the well known formula
Average value =
w 1 a1 + w 2 a2 + . . . + w n an
.
w1 + w2 + . . . + wn
60
(4.9)
CHAPTER 4. RANDOM VARIABLES
61
If we think about a random variable X, then it takes a value from its range R =
{r1 , r2 , . . . , rm }, each value ri with the probability P(X = ri ). If we think about these
probabilities as the ‘weights’, then the analogue of (4.9) reads
Average value of X =
P(X = r1 ) · r1 + P(X = r2 ) · r2 + . . . + P(X = rm ) · rm
.
P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm )
Since P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm ) = 1 (cf. Result 4.2.5) this simplifies to
Average value of X = P(X = r1 )r1 + P(X = r2 )r2 + . . . + P(X = rm )rm
and this is exactly how we define the average value or mean of a random variable:
Definition 4.6.1 (Mean). Let X be a random variable with range R = {r1 , r2 , . . . , rm }
and pmf p. Then the mean of X is denoted by E[X] and is defined as
E[X] =
m
X
ri P(X = ri ) =
i=1
m
X
ri p(ri ).
(4.10)
i=1
If the range has infinitely many elements, R = {r1 , r2 , . . .}, then the sum simply becomes
infinite as well:
E[X] =
∞
X
ri P(X = ri ) =
i=1
∞
X
ri p(ri ).
i=1
In fact, we would like to extend this definition slightly. As we discussed in Section
4.3, if we take any function f : R → R then we may construct a new random variable Y by definining Y = f (X), i.e. Y (ω) = f (X(ω)) for all ω ∈ Ω. If the range
of X is RX = {r1 , r2 , . . . , rm } then the random variable Y can take any of the values
f (r1 ), f (r2 ), . . . , f (rm ), where each of these values occurs with probability P(X = ri )
– note that these values may not all be necessarily different, cf. Example 4.3.1 for an
example of this – and hence, analogue to the reasoning above, it makes sense to set
Average value of f (X) = P(X = r1 )f (r1 ) + P(X = r2 )f (r2 ) + . . . + P(X = rm )f (rm ).
This is indeed how we define the mean of the random variable f (X):
Definition 4.6.2 (Mean of a function of a random variable). Let X be a random variable
with range R = {r1 , r2 , . . . , rm } and pmf p. Let f : R → R be a function. Then the mean
of the random variable f (X) is defined as
E[f (X)] =
m
X
f (ri )P(X = ri ) =
i=1
m
X
i=1
61
f (ri )p(ri ).
(4.11)
62
Notes for COMP11120 (Probability part)
Again, if the range has infinitely many elements, R = {r1 , r2 , . . .}, then the sum simply
becomes infinite as well:
E[f (X)] =
∞
X
f (ri )P(X = ri ) =
i=1
∞
X
f (ri )p(ri ).
i=1
Remark 4.6.3. If we take the function f (x) = x then f (X) is simply equal to X. Reassuringly, if we plug f (x) = x into the formula (4.11) then this boils down to
E[X] =
m
X
ri p(ri ),
i=1
which is exactly (4.10). So, (4.11) is indeed an extension of (4.10): it applies more broadly,
but in cases where both formulae apply they yield the same result.
Remark 4.6.4. You might find this last bit a bit fishy, and rightfully so. Namely, if
Y = f (X) is a random variable as we know it is then it has its own range, say RY =
{y1 , y2 , . . . , yn } and its own pmf say pY . Hence we wouldn’t need the above Definition 4.6.2
since we could simply apply the original Definition 4.6.1 to the random variable Y to get
a formula for the mean of f (X):
E[f (X)] = E[Y ] =
n
X
yi pY (yi ).
i=1
The good news is that if you think about it then there is no real problem as this gives
exactly the same result as using Definition 4.6.2. That is, we have
n
X
yi pY (yi ) =
i=1
m
X
f (ri )p(ri ),
i=1
so it doesn’t matter which of the two definitions you use! See also the example
below.
Example 4.6.5 (Example 4.3.1 cont’d). In Example 4.3.1 we were looking at a random
variable X with range RX = {−1, 0, 1} and pmf pX given by pX (−1) = pX (0) = pX (1) =
1/3. Then Definition 4.6.1 yields that the mean of X is given by
E[X] = −1 ·
1
1
1
+ 0 · + 1 · = 0.
3
3
3
The random variable Y was defined as Y = X 2 , and we had already deduced that the range
of Y was RY = {0, 1} with pmf pY given by pY (0) = 1/3 and pY (1) = 2/3.
Now, we may compute the mean of Y = X 2 by two different approaches (see the above
Remark). Firstly we may use that Y is a random variable in its own right and we can
directly apply Definition 4.6.1 to compute
62
CHAPTER 4. RANDOM VARIABLES
63
1
2
2
+1· = .
3
3
3
2
On the other hand, since Y = f (X) with f (x) = x we may also apply Definition 4.6.2
to compute
E[Y ] = 0 ·
1
1
1
2
+ 02 · + 12 · = .
(4.12)
3
3
3
3
So indeed both approaches give the same result. The advantage of using the second approach
is that you do not first need to determine what the range and pmf of Y are, for (4.12) you
only need to know what the range and pmf of X are!
E[Y ] = E[X 2 ] = (−1)2 ·
Example 4.6.6. Consider a random variable X with range R = {0, 1, 2, . . .} and pmf p
given by
k
1
p(k) = a
3
for k = 0, 1, 2, . . ..
What is a? Also compute the mean of X.
We know that p must satisfy the property (cf. Result 4.4.3)
∞
X
p(k) = 1.
(4.13)
k=0
Plugging in the formula for p we find that
k
∞ k
∞
X
X
1
1
3
1
=a
=a
p(k) =
a
1 = a,
3
3
2
1− 3
k=0
k=0
k=0
P∞ k
where we used the geometric series:
k=0 x = 1/(1 − x) for any x ∈ (−1, 1). Hence
condition (4.13) boils down to
∞
X
3
2
a = 1 =⇒ a = .
2
3
So we arrive at
2
p(k) =
3
k
1
3
for k = 0, 1, 2, . . ..
To compute the mean of X we apply Definition 4.6.1 and manipulate a bit to get
∞
X
k
k−1
∞
∞
X
2 1X
1
2 1
= ·
.
E[X] =
kp(k) =
k
k
3
3
3
3
3
k=0
k=0
k=1
(4.14)
The reason that it is handy to ‘move a factor 1/3 out of the sum’, as we did above in the
last step, is the following. If we differentiate the geometric series once we see that
63
64
Notes for COMP11120 (Probability part)
∞
X
1
x =
1−x
k=0
differentiate
k
=⇒
∞
X
kxk−1 =
k=1
1
.
(1 − x)2
If we plug x = 1/3 into the above right hand side, we find that
k−1
∞
X
1
1
9
=
.
k
=
2
3
(1
−
1/3)
4
k=1
Plugging this result back into (4.14) yields
E[X] =
1
2 1 9
· · = .
3 3 4
2
We conclude this section with some properties of the mean.
Result 4.6.7. We have the following for any random variable X with range R.
(i) If the range contains only one element, say R = {c}, then E[X] = c.
(ii) For any a, b ∈ R we have
E[aX + b] = aE[X] + b.
Proof. Let p denote the pmf of X.
Ad (i). Since the range only contains the element c it must be that p(c) = 1 (cf. Result
4.4.3). Hence we get from Definition 4.6.1 (where m = 1) that
E[X] = c · 1 = c.
Ad (ii). By defining the function f (x) = ax + b we have that E[f (X)] = E[aX + b].
Invoking Definition 4.6.2 and then plugging in the definition of f we get
E[f (X)] =
m
X
k=1
f (ri )p(ri ) =
m
X
(ari + b)p(ri ) = a
k=1
m
X
ri p(ri ) + b
k=1
m
X
p(ri ).
k=1
But again by Definition 4.6.2, the first term in the right hand side above is equal to aE[X],
P
while we know from Result 4.4.3 that m
k=1 p(ri ) = 1 and hence the second term in the
above right hand side is just equal to b. So we indeed arrive at
E[aX + b] = E[f (X)] = aE[X] + b
as required.
64
CHAPTER 4. RANDOM VARIABLES
4.7
65
The variance and standard deviation of a
random variable
In the previous section we have extensively discussed what the mean of a random variable
is, the ‘average value’ it takes. However, of course the mean does not give us a full picture
what a random variable looks like. For instance, consider two random variables: X with
range RX = {−1, 0, 1} and pmf pX (−1) = pX (0) = pX (1) = 1/3, and Y with range
RY = {−10, 0, 10} and pmf pY (−10) = pY (0) = pY (10) = 1/3. Then both random
variables X and Y have mean equal to 0, indeed:
E[X] = −1 ·
1
1
1
1
1
1
+ 0 · + 1 · = 0 and E[Y ] = −10 · + 0 · + 10 · = 0.
3
3
3
3
3
3
However there is a clear difference between X and Y : the range of Y is much ‘wider’, Y
can take the values −10 and 10 which are both ‘far away’ from its mean 0, while X has
range {−1, 0, 1} and is hence always at most a distance of 1 away from its mean 0.
The idea is now to introduce a measure for how far, on average, a random variable
is away from its mean. Several such measures are possible, but one very common one is
the variance. The variance works as follows. First the mean of a random variable X is
computed, and for convenience let’s denote the mean by µ. Then we consider the random
variable Y given by Y = (X − µ)2 , that is
Y (ω) = (X(ω) − µ)2
for all ω ∈ Ω.
Note that (X(ω) − µ)2 is a measure for the distance between X(ω) and the mean µ of
X. We could just as well have taken |X(ω) − µ|, or (X(ω) − µ)4 for instance, but there
are technical reasons why the square is a good choice. Now we have a random variable
measuring the distance between X and its mean, but for easy comparison we would like
just a number rather, and hence we take the mean of this distance:
E[(X − µ)2 ]
and this is exactly what the variance is.
Definition 4.7.1 (Variance). For a random variable X with mean µ we define the variance, denoted Var(X), as
Var(X) = E[(X − µ)2 ].
The variance is a measure for how far X is on average away from its mean µ: the larger
the variance, the larger this distance on average is.
Example 4.7.2. Consider again the two random variables we introduced at the beginning
of this section, namely X with range RX = {−1, 0, 1} and pmf pX (−1) = pX (0) = pX (1) =
65
66
Notes for COMP11120 (Probability part)
1/3, and Y with range RY = {−10, 0, 10} and pmf pY (−10) = pY (0) = pY (10) = 1/3. We
had already computed that both random variables have mean equal to 0. So the mean of
X, say µX , satisfies µX = 0 and also the mean of Y , say µY , satisfies µY = 0.
Let us for both compute their variance. Note that this is a straightforward application
of Definition 4.6.2, where we set the function f equal to f (x) = (x − µX )2 :
Def. 4.6.2
Var(X) = E[(X − µX )2 ]
=
f (−1)pX (−1) + f (0)pX (0) + f (1)pX (1)
1
1
1
2
= (−1 − 0)2 · + (0 − 0)2 · + (1 − 0)2 · = ,
3
3
3
3
and now for Y , where we use f (x) = (x − µY )2 :
Var(Y ) = E[(X − µY )2 ]
Def. 4.6.2
=
f (−10)pY (−10) + f (0)pY (0) + f (10)pY (10)
1
1
1
200
= (−10 − 0)2 · + (0 − 0)2 · + (10 − 0)2 · =
.
3
3
3
3
At the beginning of this section we have argued that Y could get much further away
from its mean 0 than X could, and we see this confirmed by the above computation: the
variance of Y is much larger than the variance of X.
Here are some useful properties of the variance:
Result 4.7.3. For any random variable X we have
(i)
Var(X) = E[X 2 ] − (E[X])2 ,
(4.15)
(ii) if the range of X contains only one element then Var(X) = 0,
(iii) for any a, b ∈ R we have
Var(aX + b) = a2 Var(X).
Proof. Ad (i). Let us as usual write the range of X in the generic form R = {r1 , r2 , . . . , rm }.
Also let us denote the mean of X by µ, for notational convenience, and let p be the pmf
of X. Then we have
2 (∗)
Var(X) = E[(X − µ) ] =
m
X
(∗∗)
2
(ri − µ) p(ri ) =
i=1
m
X
(ri2 − 2µri + µ2 )p(ri )
i=1
m
m
m
X
X
(∗∗∗) X 2
2
=
ri p(ri ) − 2µ
ri p(ri ) + µ
p(ri ). (4.16)
i=1
i=1
66
i=1
CHAPTER 4. RANDOM VARIABLES
67
Note that (*) uses Definition 4.6.2, in (**) we just work out the brackets, and in (***)
we work out the brackets a bit further to break the single summation up. Now, we can
simplify the terms in the ultimate right hand side quite a bit. Indeed, from Definition
4.6.2 we know that
m
X
ri2 p(ri ) = E[X 2 ],
i=1
from Definition 4.6.1 we know
m
X
ri p(ri ) = E[X] = µ
i=1
and from Result 4.4.3 we have
m
X
p(ri ) = 1.
i=1
Plugging this all back into (4.16) we see that we get
Var(X) = E[X 2 ] − 2µ · µ + µ2 = E[X 2 ] − 2µ2 + µ2 = E[X 2 ] − µ2 = E[X 2 ] − (E[X])2 .
Ad (ii). Suppose that the range of X contains only the element c. We have seen in
Result 4.6.7 (i) that in this case E[X] = c. Also, it must mean that the pmf p of X satisfies
p(c) = 1. Hence we get
(∗)
Var(X) = E[(X − µ)2 ] = (c − c)2 p(c) = 0 · 1 = 0,
where (*) uses Definition 4.6.2.
Ad (iii). Let us as usual write the range of X in the generic form R = {r1 , r2 , . . . , rm }.
Also let us denote the mean of X by µ, for notational convenience, and let p be the pmf
of X. By part (i) we have that
Var(aX + b) = E[(aX + b)2 ] − (E[aX + b])2 .
(4.17)
We’ll just need to work this out. For the first term we may use Definition 4.6.2 again, this
time with the function f (x) = (ax + b)2 , to compute
E[(aX + b)2 ] =
m
X
(ari − b)2 p(ri ) = a2 E[X 2 ] + 2abµ + b2 ,
i=1
where the steps for the last equality are very similar to what we did in the proof of part
(i) above. For the second term we may use Result 4.6.7 to see that
(E[aX + b])2 = (aE[X] + b)2 = (aµ + b)2 = a2 µ2 + 2abµ + b2 .
67
68
Notes for COMP11120 (Probability part)
Plugging this all back into (4.17) we find that
Var(aX + b) = a2 E[X 2 ] + 2abµ + b2 − a2 µ2 + 2abµ + b2 = a2 E[X 2 ] − a2 µ2
= a2 E[X 2 ] − µ2 = a2 E[X 2 ] − (E[X])2 = a2 Var(X),
as required. Note the last step uses the above part (i) again.
Sometimes it is more convenient to use formula (4.15) in a computation of the variance,
however usually both the original definition and that formula are possible to use.
We conclude this chapter with a definition of the standard deviation. Essentially
the standard deviation is fully equivalent to the variance, it measures exactly the same
‘spread’, just on a different scale. It is simply defined as the square root of the variance.
Definition 4.7.4 (Standard deviation). The standard deviation of a random variable X,
denoted SD(X), is defined as
SD(X) =
p
68
Var(X).
Chapter 5
Some well known distributions
In this final chapter we discuss some of the most well known distributions (discrete)
random variables can have.
5.1
What is a distribution?
So far we have always discussed a situation in which we started with an experiment, a
corresponding sample space and in the previous Chapter we have seen how we can define
a random variable associated with a certain experiment. To be able to do computations
with random variables we need to know what their range is and what their pmf is. Now, as
you can probably imagine, there are several different experiments possible with associated
random variables that all lead to the same range and the same pmf. Since mathematicians
love to make things as abstract as possible they like to treat all these cases the same. This
is done by focussing only on the range and pmf of a random variable.
Definition 5.1.1 (Distribution). A distribution is simply a combination of a range and
a pmf. If two random variables have identical ranges and pmf ’s they are said to have the
same distribution (or, are equal in law).
In this chapter we will briefly discuss some of the most famous distributions, i.e.
combinations of ranges and pmf’s.
5.2
The Bernoulli distribution
The most elementary one is without any doubt the Bernoulli distribution:
Definition 5.2.1 (Bernoulli distribution). A random variable X has the Bernoulli distribution with parameter p ∈ [0, 1] (notation: X ∼ Bernoulli(p)) when its range is R = {0, 1}
and its pmf pX is given by
pX (0) = 1 − p
and
69
pX (1) = p.
70
Notes for COMP11120 (Probability part)
Example 5.2.2. Consider the experiment of tossing a coin, so Ω = {H, T }. Define the
random variable X by setting X(H) = 0 and X(T ) = 1. Then the range of X is R =
{0, 1} and (if the coin is indeed fair) its pmf pX is given by pX (0) = P({H}) = 1/2 and
pX (1) = P({T }) = 1/2. Hence X satisfies above Definition 5.2.1 for p = 1/2. Therefore
X has the Bernoulli distribution with parameter 1/2, i.e. X ∼ Bernoulli(1/2).
Example 5.2.3. A fair dice is thrown, so Ω = {1, 2, 3, 4, 5, 6}. Define the random variable
X by setting
X(1) = X(2) = 0,
X(3) = X(4) = X(5) = X(6) = 1.
Then X has range {0, 1}, it has pmf pX given by
2
1
4
2
=
and pX (1) = P({3, 4, 5, 6}) = = .
6
3
6
3
This is again a Bernoulli distribution, now with parameter value p = 2/3. Hence X ∼
Bernoulli(2/3).
pX (0) = P({1, 2}) =
When X ∼ Bernoulli(p) it is quite straightforward to compute its mean and variance.
Using Definition 4.6.1 from Ch. 4 we find
E[X] = 0 · pX (0) + 1 · pX (1) = 0 · (1 − p) + 1 · p = p.
To compute its variance we first compute, using Definition 4.6.2 from Ch. 4:
E[X 2 ] = 02 · pX (0) + 12 · pX (1) = 0 · (1 − p) + 1 · p = p
so that by Result 4.7.3 we find
Var(X) = E[X 2 ] − (E[X])2 = p − p2 = p(1 − p).
Hence:
Result 5.2.4. Suppose that X ∼ Bernoulli(p). Then
E[X] = p
5.3
and
Var(X) = p(1 − p).
The Binomial distribution
Consider a simple experiment that yields either ‘success’ with probability p or ‘failure’
with probability (hence) 1 − p. Suppose now that you perform this experiment n times,
independently of each other, and suppose that you define the random variable X as the
total number of successes you have. This is the typical setup for a binomial distribution.
Note that X has range R = {0, 1, . . . , n}. Also, we can compute the pmf pX of X quite
easily. For any k ∈ R, p(k) = P(X = k) is the probability of getting k times success out of
70
CHAPTER 5. SOME WELL KNOWN DISTRIBUTIONS
71
n tries. Any possible sequence of k successes and n−k failures has probability pk (1−p)n−k
of occuring (by independence of the experiments we can multiply the probabilities). There
are nk sequences consisting of k successes and n − k failures (cf. Ch. 2), hence
n k
p(k) = P(X = k) =
p (1 − p)n−k .
k
Therefore we define:
Definition 5.3.1 (Binomial distribution). A random variable X has the Binomial distribution with parameters n ∈ N and p ∈ [0, 1] (notation: X ∼ Binomial(n, p)) when its
range is R = {0, 1, . . . , n} and its pmf pX is given by
n k
pX (k) =
p (1 − p)n−k
k
for all k ∈ {0, 1, . . . , n}.
To compute the mean and variance of a Binomial distribution it is easiest to make
use of a fact we haven’t proven as it is outside the scope of our course, namely the fact
that if we may write the random variable X as the sum of other random variables, say
Y1 , Y2 , . . . , Yn , then
E[X] = E[Y1 ] + E[Y2 ] + . . . + E[Yn ]
(5.1)
and if the Yi ’s are independent (we haven’t even defined what that exactly means!) we in
addition also have that
Var(X) = Var(Y1 ) + Var(Y2 ) + . . . + Var(Yn ).
(5.2)
Now, if we believe for a second that these are true facts, then we can argue as follows. We
do n identical experiments. Define the random variable Yi as follows. Yi takes the value 1 if
the i-th experiment yields ‘success’ and 0 if the i-th experiment yields ‘failure’. Recalling
the previous section we then know that Yi ∼ Bernoulli(p) and in particular E[Yi ] = p and
Var(Yi ) = p(1−p) (cf. Result 5.2.4). Furthermore, if we want to know how many successes
we have had in total we might just as well add up the random variables Y1 , Y2 , . . . , Yn ,
that is, we have
X = Y1 + Y2 + . . . + Yn .
Using the ‘facts’ (5.1) and (5.2) we hence arrive at
E[X] = p + p + . . . + p = np
and
Var(X) = p(1 − p) + p(1 − p) + . . . + p(1 − p) = np(1 − p).
71
72
Notes for COMP11120 (Probability part)
Alternative proofs are possible, for instance for the mean you could first write out the
definition and then manipulate to get the result out, but that’s a bit of work as well.
Hence we leave it at this intuive explanation rather.
Result 5.3.2. Suppose that X ∼ Binomial(n, p). Then
E[X] = np
and
Var(X) = np(1 − p).
Example 5.3.3. You are starting your career as dart player in a Manchester pub. As
you are just starting they give you a special darts board, that contains only a big bulls eye
and otherwise nothing. If you hit bulls eye you get 50 points, otherwise you get no points.
Suppose that each throw hits bulls eye with probability 1/10. If you throw 100 darts, what
is the expected number of points? What is the probability of at least 100 points?
This is a case where you repeat a simple experiment, that yields ‘success’ (you hit bulls
eye) with prob. 1/10 and ‘failure’ (no points) with (hence) probability 9/10, in total 100
times. Hence if X denotes the number of successes then X ∼ Binomial(100, 1/10). The
expected number of successes is E[X], which is equal to np = 100 · 1/10 = 10 (cf. Result
5.3.2). So the expected number of points is 50 · 10 = 500.
For the probability of at least 100 points, that is the same as the probability of at least
2 times bulls eye, i.e. P(X ≥ 2). It is easiest here to use the ‘complement rule’. Namely,
since
{X ≥ 2} = {X = 2} ∪ {X = 3} ∪ . . . ∪ {X = 100}
we have P(X ≥ 2) = P(X = 2) + P(X = 3) + . . . + P(X = 100). Computing all these
probabilities would be quite a work. It is more clever to use that we know that
P(X = 0) + P(X = 1) + . . . + P(X = 100) = 1
(cf. Result 4.2.5 in Ch. 4) which yields
P(X = 0) + P(X = 1) + P(X ≥ 2) = 1 =⇒ P(X ≥ 2) = 1 − P(X = 0) − P(X = 1).
Using the formula for the pmf from Definition 5.3.1 we can compute
0 100
100
1
9
P(X = 0) =
≈ 0.0000266
0
10
10
and
P(X = 1) =
1 99
100
1
9
≈ 0.000295
1
10
10
so that we arrive at
72
CHAPTER 5. SOME WELL KNOWN DISTRIBUTIONS
73
P(X ≥ 2) = 1 − P(X = 0) − P(X = 1) ≈ 0.9997.
5.4
The Poisson distribution
The next one on the list is the Poisson distribution. This one has a more vague interpretation. It has one parameter, denoted by λ > 0. One of its uses is to count the number
of ‘successes’ when an experiment is repeated many times and the probability of ‘success’
is very small. Indeed, a Binomial distribution with a very large value of n and a very
small value of p is very similar to a Poisson distrbution with parameter λ = np. This fact
is known as the Poisson limit theorem. The advantage of using the Poisson distribution
over the (in principle more correct) Binomial distribution in such situations (i.e. when n
is very large and p is very small) is that it is easier to do computations with the Poisson
distribution.
The Poisson distribution is the first one in our list with a range containing infinitely
many elements.
Definition 5.4.1 (Poisson distribution). A random variable X has the Poisson distribution with parameter λ > 0 (notation: X ∼ Poisson(λ)) when its range is R = {0, 1, . . .}
and its pmf pX is given by
pX (k) =
λk −λ
e
k!
for all k ∈ {0, 1, . . .}.
For deriving the mean and the variance of a Poisson distribution it is good to recall
the power series of the exponential function:
x
e =
∞
X
xk
k=0
k!
.
(5.3)
For the mean, we get from Definition 4.6.1 in Ch. 4 that
∞
X
∞
∞
X
λk −λ X λk −λ
E[X] =
kpX (k) =
k e =
k e .
k!
k!
k=0
k=0
k=1
Note that in the last step we let the sum start at k = 1 rather than at k = 0, which is
absolutely fine since
k
λk −λ
e = 0 when k = 0.
k!
Now, using that for any k ≥ 1
k
k
1
1
=
=
=
k!
k(k − 1)(k − 2) . . . 2 · 1
(k − 1)(k − 2) . . . 2 · 1
(k − 1)!
73
74
Notes for COMP11120 (Probability part)
we may further manipulate
∞
∞
X
X
λk−1
λk
.
k e−λ = e−λ λ
k!
(k
−
1)!
k=1
k=1
But
∞
∞
X
X
λk−1
λk
=
= eλ ,
(k
−
1)!
k!
k=1
k=0
the last step by (5.3).
Putting the pieces together we now arrive at
E[X] = e−λ λeλ = λ.
It turns out that the variance is equal to λ as well (exercise).
Result 5.4.2. Suppose that X ∼ Poisson(λ). Then
E[X] = λ and
5.5
Var(X) = λ.
The geometric distribution
We conclude with the geometric distribution. This one has a quite clear interpretation
again. Suppose again that you perform a sequence of identical experiments, indepedent
od each other, where each experiment generates ‘success’ with probability p ∈ [0, 1] and
‘failure’ with probability (hence) 1 − p. Let X be the number of experiments you have
to do until you get ‘success’ for the first time. Then X has a geometric distribution with
parameter p.
It is clear from this description what the range is, namely R = {1, 2, . . .} (again with
infinitely many elements). Also, the probability that X takes the value k ∈ R is equal
to the probability that we get k − 1 ‘failures’ and then a ‘success’, the corresponding
probability is hence
P(X = k) = (1 − p)k−1 p.
So:
Definition 5.5.1 (Geometric distribution). A random variable X has the Geometric
distribution with parameter p ∈ [0, 1] (notation: X ∼ Geometric(p)) when its range is
R = {1, 2, . . .} and its pmf pX is given by
pX (k) = (1 − p)k−1 p
for all k ∈ {0, 1, . . .}.
74
CHAPTER 5. SOME WELL KNOWN DISTRIBUTIONS
75
To compute the mean of a geometric distribution we make use of the geometric series:
∞
X
xk =
k=0
1
1−x
for any x ∈ (−1, 1).
(Indeed, we use the same arguments as in Example 4.6.6 in Ch. 4, which was essentially
also a geometric distribution!). If we differentiate∗ both sides of this equation we get
∞
X
kxk−1 =
k=1
1
(1 − x)2
for any x ∈ (−1, 1).
(5.4)
Now, again using Definition 4.6.1 from Ch. 4 we get
E[X] =
∞
X
kpX (k) =
k=0
∞
X
k−1
k(1 − p)
p=p
k=0
∞
X
k(1 − p)k−1 .
k=1
Using (5.4) with x = 1 − p yields
∞
X
k(1 − p)k−1 =
k=1
1
p2
so that we arrive at
E[X] = p ·
1
1
= .
2
p
p
In a similar fashion it can be derived that the variance is equal to (1 − p)/p2 .
Result 5.5.2. Suppose that X ∼ Geometric(p). Then
E[X] =
1
p
and
Var(X) =
∗
1−p
.
p2
We are assuming here that when we differentiate the infinite sum, we may as well differentiate each
of the terms of the infinite sum. This is certainly true when you have a finite sum, as you know very well,
but actually doing it for an infinite sum would need some extra justification
75