Download Probabilistic Models and Data Analysis Lecture Notes

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
Probabilistic Models and Data Analysis
Lecture Notes
Verena Wolf1
February 9, 2016
1
[email protected]
ii
Contents
1 Probabilities and Discrete Random Variables
1.1 Probabilities . . . . . . . . . . . . . . . . . . . . . . .
1.2 Discrete Random Variables . . . . . . . . . . . . . .
1.2.1 Expectation and Variance . . . . . . . . . . .
1.2.2 Higher Moments, Covariance and Correlation
1.3 Important Discrete Probability Distributions . . . .
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
1
1
5
7
13
14
15
2 Continuous Random Variables and Laws of Large
2.1 σ-algebras . . . . . . . . . . . . . . . . . . . . . . .
2.2 Continuous random variables . . . . . . . . . . . .
2.3 Important continuous distributions . . . . . . . . .
2.4 Laws of Large Numbers . . . . . . . . . . . . . . .
2.4.1 Chebyshev’s inequality . . . . . . . . . . . .
2.4.2 Weak Law of Large Numbers . . . . . . . .
2.4.3 Strong Law of Large Numbers . . . . . . .
2.4.4 Central Limit Theorem. . . . . . . . . . . .
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . .
Numbers
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
19
19
21
23
27
28
28
29
29
30
3 Discrete-time Markov Chains
3.1 Discrete-Time Markov Chains.
3.1.1 Transient distribution .
3.1.2 Equilibrium distribution
3.2 Case Study: PageRank . . . . .
3.3 Case Study: DNA Methylation
3.4 From DTMCs to CTMCs . . .
3.5 Exercises . . . . . . . . . . . .
.
.
.
.
.
.
.
33
33
34
35
46
47
49
50
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Hidden Markov Models
57
4.1 Computing the probability of outputs . . . . . . . . . . . . . 58
4.2 Most likely hidden sequence of states . . . . . . . . . . . . . . 59
4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
iii
iv
CONTENTS
5 Computer Simulations and Monte Carlo Methods
5.1 Generating random variates . . . . . . . . . . . . . .
5.2 Monte Carlo simulation of DTMCs . . . . . . . . . .
5.3 Output Analysis . . . . . . . . . . . . . . . . . . . .
5.3.1 Interval Estimation . . . . . . . . . . . . . . .
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . .
6 Parameter Estimation
6.1 Method of moments . . . . . . . . .
6.2 Generalized method of moments . .
6.3 Maximum likelihood method . . . .
6.4 Estimation of Variances . . . . . . .
6.5 Estimation of Parameters of HMMs
6.6 Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
68
69
70
73
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
76
79
82
87
93
96
7 Statistical Testing
7.1 Level α tests: general approach . . . . . .
7.2 Standard Normal null distribution (Z-test)
7.3 T-tests for unknown σ . . . . . . . . . . .
7.4 p-Value . . . . . . . . . . . . . . . . . . .
7.5 Chi-square tests . . . . . . . . . . . . . . .
7.5.1 Estimated variance . . . . . . . . .
7.5.2 Observed Counts . . . . . . . . . .
7.5.3 Testing independence . . . . . . .
7.6 Multiple Testing . . . . . . . . . . . . . .
7.6.1 Bonferroni correction . . . . . . . .
7.6.2 Holm-Bonferroni method . . . . .
7.6.3 Benjamini-Hochberg procedure . .
7.7 Exercises . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
103
104
106
109
110
112
112
113
116
118
119
120
121
122
.
.
.
.
.
.
.
.
.
.
.
.
Preface
These lecture notes are based on a course hold at Saarland University in
2015. We use hyperlinks to Wikipedia pages to encourage students to explore
further the context of this course. However, we would like to point out that
Wikipedia entries are mostly written by laymen and should not be used as
a primary information source.
v
vi
CONTENTS
Chapter 1
Probabilities and Discrete
Random Variables
“Was ein Punkt, ein rechter Winkel, ein Kreis ist, weiß ich schon
vor der ersten Geometriestunde, ich kann es nur noch nicht präzisieren.
Ebenso weiß ich schon, was Wahrscheinlichkeit ist, ehe ich es
definiert habe.“ (Hans Freudenthal)
This section gives a short primer on probabilities, as well as on discrete and
continuous random variables.
1.1
Probabilities
Let us consider chance experiments with a countable number of possible
outcomes ω1 , ω2 , . . .. The set Ω = {ω1 , ω2 , . . .} of all outcomes is called the
sample space. Subsets of Ω are called events and by 2Ω we denote the set
of all events.
Example 1: Rolling a die and tossing a coin
If we roll a die, the set of possible outcomes is Ω = {1, 2, . . . , 6}. The
event “number is even” is given by E = {2, 4, 6}.
If we toss a coin and count the number of trials until a head turns up for
the first time, Ω = {1, 2, . . .}. The event “number of trials greater than
10” is given by E = {11, 12, . . .}.
We define what a probability is using Kolmogorov’s axioms.
Definition 1: Probability
Assume Ω is a discrete (i.e. finite or countably infinite) and non-empty
sample space. Let P be a function such that P : 2Ω → [0, 1]. The value
P (A) is called the probability of event A if P is such that
1
1.1. PROBABILITIES
1. P (Ω) = 1,
2. for pairwise disjoint events A1 , A2 , . . .
it holds that
P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . .
When we reason about the probability of
certain events, we can use many arguments
from set theory. For instance, if the events
A, B, and C are as illustrated in Figure 1.1,
it holds that
P (A ∪ B) = P (A) + P (B) and
P (A ∪ C) = P (A) + P (C) − P (A ∩ C).
Ω
C
B
A
Example 2: Rolling a die
The probability of the event {1, 2, . . . , 6} is
one. Moreover, for each n ∈ {1, 2, . . . , 6}
Figure 1.1: Using set arguit holds that P ({n}) = 1/6.
Thus,
ments for the calculation of
P ({2, 4, 6}) = 1/6+1/6+1/6 = 1/2. Simievent probabilities.
larly, P ({1, 3, 5}) = 1 − P ({2, 4, 6}) = 1/2.
Example 3: Tossing a coin until heads for the first time
The probability of the events “exactly n trials” and “more than three
trails” are
P ({n}) = (1/2)n and
P ({4, 5, . . .}) = P (Ω \ {1, 2, 3}) = 1 − (1/2 + 1/4 + 1/8) = 1/8.
P
Note that P (Ω) = ω∈Ω P ({ω}) = 1/2 + 1/4 + 1/8 + . . . = 1.
The triple (Ω, 2Ω , P ) is called a discrete probability space.
Definition 2: Conditional Probability
Let A, B be events and P (B) > 0. Then
P (A|B) :=
P (A ∩ B)
P (B)
is called probability of A under the condition B. Clearly, this implies that
P (A ∩ B) = P (A|B) · P (B).
It is easy to show that the value P (A|B) is a probability in the sense of
Definition 1.
2
1.1. PROBABILITIES
Example 4: Lung Cancer
We define the events
72
A: person gets lung cancer with P (A) = 200000
= 0.00036,
B: person is a smoker with P (B) = 0.25.
From the people that get lung cancer, 90% are smokers. The experiment
consists in choosing a person at random. For the probability of getting
lung cancer under the condition of being a smoker, we calculate
P (A|B) =
P (A∩B)
P (B)
=
P (B|A)P (A)
P (B)
=
0.9·0.00036
0.25
= 0.001296.
For the probability of getting lung cancer under the condition of not being
a smoker, we get
P (A|B̄) =
P (A∩B̄)
P (B)
=
P (B̄|A)P (A)
P (B)
=
(1−0.9)·0.00036
0.75
= 0.000048.
Thus, the chance of getting lung cancer is around 30 times higher for
smokers compared to non-smokers.
Definition 3: Independence
Let 0 < P (B) < 1. The event A is called independent of B if
P (A|B) = P (A|B̄).
Thus, the event A has the same probability no matter whether we condition
on B or on B̄. Our intuition tells us that in this case, event B ”has nothing
to do” with the event A. E.g. you roll two dice, one is red and one is blue.
Event A is ”red die gives six” and event B is ”blue die gives even number”.
Then, clearly, A is independent of B and vice versa. However, the following
example shows that independence can also occur between events that are
highly related.
Example 5: Independent Events
Consider the case A = Ω. The event Ω is independent of any event B
with 0 < P (B) < 1 since
P (Ω|B) =
P (Ω∩B)
P (B)
=1=
P (Ω∩B̄)
P (B)
= P (Ω|B̄).
Assume now that Ω = {1, 2, . . . , 6}, A = {5, 6}, and B = {2, 4, 6}. Then
P (A|B) =
P (A∩B)
P (B)
=
P ({6})
P ({2,4,6})
and
3
=2·
1
6
= 1/3
1.1. PROBABILITIES
P (A|B̄) =
P (A∩B̄)
P (B)
=
P ({5})
P ({1,3,5})
=2·
1
6
= 1/3,
which means that A = {5, 6} is independent of B = {2, 4, 6}.
Note that one can use the alternative (and equivalent) condition
P (A ∩ B) = P (A) · P (B)
to define independence. In particular, this condition can also be checked if
P (B) = 0.
Assume that we have a finite or countably infinite number of events A1 , A2 , . . .
that are pairwise disjoint. Assume further Ω = A1 ∪ A2 ∪ . . . and P (Ai ) > 0
for all i. Often, the probability of some event B is unknown, but the conditional probabilities P (B|Ai ), . . . are known. In this case, P (B) can be computed using the
law of total probability, which states that
P (B) =
X
i
P (B|Ai ) · P (Ai ) .
|
{z
}
A1
=P (B∩Ai )
The corresponding partitioning of Ω is illustrated in Fig. 1.2.
If, however, P (Ai |B) is the probability of interest, one can use Bayes’ theorem to compute it. With Bayes’ theorem we can for
any event A with P (A) > 0 express P (A|B)
by P (B|A), that is,
P (A|B) =
Ω
A2
B
A3
A4
P (B|A) · P (A)
.
P (B)
Figure 1.2: The law of total
Thus, for the above problem of computing probability.
P (Ai |B) for pairwise disjoint events Ai , this
gives
P (Ai |B) =
P (B∩Ai )
P (B)
=
P (B|Ai )·P (Ai )
P (B)
Example 6: Noisy Channel
Consider the noisy binary channel illustrated on the right.
If the channel is noise-free (p = 0), zero is
transmitted from the upper left node to the
upper right node. Similarly, one is transmitted from the lower left node to the lower
right node.
4
=
PP (B|Ai )·P (Ai ) .
i P (B|Ai )·P (Ai )
0
1−p
0
p
p
1
1−p
1
1.2. DISCRETE RANDOM VARIABLES
If the channel is noisy, with probability p > 0 one is transmitted instead
of zero and zero instead of one. Assume that the probability of sending
zero is π0 and the probability of sending one is π1 = 1 − π0 .
We define the events
• A0 : send 0 with P (A0 ) = π0 ,
• A1 : send 1 with P (A1 ) = π1 := 1 − π0 ,
• B0 : receive 0,
• B1 : receive 1,
Then, P (B1 |A0 ) = P (B0 |A1 ) = p and P (B1 |A1 ) = P (B0 |A0 ) = 1 − p.
We calculate
P (B1 ) = P (B1 |A0 ) · P (A0 ) + P (B1 |A1 ) · P (A1 ) = p · π0 + (1 − p) · π1 ,
P (B0 ) = P (B0 |A0 ) · P (A0 ) + P (B0 |A1 ) · P (A1 ) = (1 − p) · π0 + p · π1 .
1.2
Ω
Discrete Random Variables
ω3
ω2
ω1
X
X
X(Ω)
A random variable is used to represent
an outcome of an experiment. Technically, the fact that this variable is
“random” (that is, it takes values with
a certain probability), is realized by
using a mapping (as illustrated in the
figure on the left).
Definition 4: Discrete Random Variable
Let (Ω, 2Ω , P ) be a discrete probability space. A function
X:Ω→R
is called a discrete (real-valued) random variable on (Ω, 2Ω , P ).
Now, that we mapped the outcome of an experiment to R, we need to
assign probabilities to the subsets of X(Ω) = {a ∈ R | ∃ω ∈ Ω : X(ω) = a}.
Since we already have probabilities for events A ⊆ Ω, we define a function
PX : 2X(Ω) → [0, 1] such that, for A ∈ 2X(Ω) ,
PX (A) := P (X −1 (A)) = P ({ω ∈ Ω | X(ω) ∈ A}).
Then (X(Ω), 2X(Ω) , PX ) is a discrete probability space.
5
1.2. DISCRETE RANDOM VARIABLES
0.2
1
0.8
0.15
0.6
0.1
0.4
0.05
0.2
0
2
3
4
5
6
7
8
9
10
11
0
12
2
3
4
5
6
7
8
9
10
11
12
Figure 1.4: Cumulative probability
(rolling two dice).
Figure 1.3: Discrete probability distribution (rolling two dice).
Example 7: Rolling two dice
We want to define a random variable for the sum of the two numbers on
the two dice. Let Ω = {(ω1 , ω2 ) | ω1 , ω2 , ∈ {1, . . . , 6}} and X : Ω → R is
such that X(ω1 , ω2 ) = ω1 + ω2 .
Then, the probability of having the number 10 is
PX ({10}) = P (X −1 ({10})) = P ({(ω1 , ω2 ) ∈ Ω | X(ω1 , ω2 ) = 10})
= P ({(4, 6)}) + P ({(6, 4)}) + P ({(5, 5)}) = 1/12.
In the sequel, we use the following “shortcuts“ to refer to subset of Ω:
• “X = a” stands for the set {ω ∈ Ω | X(ω) = a}
• “X ≤ a” stands for the set {ω ∈ Ω | X(ω) ≤ a}
• “X < a” . . .
Thus, for instance,
P (X ≤ a) = P ({ω ∈ Ω | X(ω) ≤ a}) =
X
P (X = c).
c∈X(Ω),c≤a
The function f : X(Ω) → [0, 1] with f (a) = P (X = a) is called the discrete
probability distribution of X. It tells us the probability of each possible
event of the form “X = a”. Of similar importance is the following function
related to a random variable X.
Definition 5: Cumulative Probability Distribution
Let X be a discrete (real-valued) random variable. The function F : R →
[0, 1] with
X
F (x) := P (X ≤ x) =
P (X = a).
a∈X(Ω),a≤x
is called the cumulative probability distribution of X.
6
1.2. DISCRETE RANDOM VARIABLES
Example 8: Rolling two dice
Assume that X is defined as in Example 7. The discrete probability distribution and the cumulative probability distribution of X are shown in
Figures 1.3 - 1.4, respectively.
1.2.1
Expectation and Variance
Besides the probability distribution of a discrete random variable X, other
values related to X are of interest. The most important ones are the expectation and the variance of X which are defined as
• E(X) =
• V (X) =
P
x∈X(Ω) x
P
· P (X = x)
x∈X(Ω) (x
− E(X))2 · P (X = x)
respectively. (Note that the sum might not converge, in which case the
expectation/variance
does not exist.)
p The standard deviation of X is given
p
by V (X). Note that V (X) (and V (X)) are always non-negative.
Example 9: Expected Value of Special Random Variables
Let c ∈ R. Assume X(ω) = c for all ω ∈ Ω. Then E(X) = c.
Let A ⊆ Ω and define Y = IA where
1
if ω ∈ A,
IA (ω) =
0 otherwise.
Then E(Y ) = 1 · P (A) + 0 · P (Ā) = P (A).
Often the expectation of a random variable is denoted by the Greek letter
µ while the variance is often denoted by a σ 2 and σ denotes the standard
deviation. The latter measures the variation using the same ”units” as X
and µ while σ 2 measures in squared units.
One might think that E[X] is something like ”the best guess for X” that
we can make. However, note that
5
8
7
1
2
3
4
6
9 10 11
even if X is discrete (e.g. X ∈
{1, 2, . . .}) its expectation may not be
Figure 1.5: Bimodal discrete probaa valid outcome (e.g. 2.6) since it is a
bility distribution.
real number. In addition, it might be
7
1.2. DISCRETE RANDOM VARIABLES
a bad guess if the probability distribution of X is, for instance, bimodal as
illustrated in Fig. 1.5.
Example 10:
We consider first a standard fair die with 6 sides. For such a die the
6
P
k · 16 = 3.5 and the variance is V (X) =
expectation is given by E(X) =
k=1
6
P
k=1
2
(k − 3.5) ·
1
6
= 2.9. Now let us consider the second fair die where
number on sides 3 and 4 are substituted by 1 and 6 (so the die has sides
1, 2, 1, 6, 5, 6). For this die the expectation is 3.5 as before but the variance
is much larger, V (Y ) = 4.9. Distribution of both dice are shown on the
figure below (blue colors) together with the corresponding variance (green
colors).
E(X)
1
2
3
E(Y )
4
5
6
1
2
3
4
5
6
It is possible to connect random variables on the same sample space Ω using
operations such as +, −, ·, etc. For instance, we define Z = X + Y as a
random variable on Ω by
Z(ω) = X(ω) + Y (ω) for all ω ∈ Ω.
The probability P (Z = z) = P (X + Y = z) is then well-defined since
P (Z = z) = P (ω ∈ Ω | Z(ω) = z) =
X
P ({ω}).
ω∈Ω,X(ω)+Y (ω)=z
Moreover,
X
z∈Z(Ω)
P (Z = z) =
X
X
z∈Z(Ω) ω∈Ω,X(ω)+Y (ω)=z
8
P ({ω}) = P (Ω) = 1.
1.2. DISCRETE RANDOM VARIABLES
Ω
(!
ω1 , ω
!2 )
X(Ω)
ω
!1
(ω1 , ω2 )
Z(Ω)
ω
!1 + ω
!2
ω1
ω1 +ω2
ω
!2
ω2
Y (Ω)
Figure 1.6: Sum of two random variables (rolling two dice).
Example 11: Rolling two dice
Assume that Ω = {1, 2, . . . , 6}2 and X, Y : Ω → R are such that, for
(ω1 , ω2 ) ∈ Ω,
X(ω1 , ω2 ) = ω1 , Y (ω1 , ω2 ) = ω2 .
Then for Z = X + Y we have
Z(ω1 , ω2 ) = X(ω1 , ω2 ) + Y (ω1 , ω2 ) = ω1 + ω2
and (see illustration in Fig. 1.6)
X
P (Z = z) =
P ({(ω1 , ω2 )}) =
(ω1 ,ω2 )∈Ω,
X(ω1 ,ω2 )+Y (ω1 ,ω2 )=z
X
P ({(ω1 , ω2 )}).
(ω1 ,ω2 )∈Ω,
ω1 +ω2 =z
Now, having two random variables on the same probability space, we can
define a joint distribution as follows.
Definition 6: Joint Probability Distribution
Let X, Y be discrete (real-valued) random variables on the same probability
space (Ω, 2Ω , P ). The function PX,Y : X(Ω) × Y (Ω) → [0, 1] with
PX,Y (a, b) := P (X = a ∧ Y = b) = P (X −1 ({a}) ∩ Y −1 ({b})).
is called the joint probability distribution of X and Y .
Similar to the definition of independence between events we can define when
two random variables are independent.
Definition 7: Independence of Random Variables
1 variables on the same probability
Let X, Y be discrete (real-valued) random
Ω
space (Ω, 2 , P ). We call X and Y independent iff for all a ∈ X(Ω), b ∈
9
1.2. DISCRETE RANDOM VARIABLES
Y (Ω)
P (X = a | Y = b) = P (X = a).
(Or equivalently, X and Y are independent iff P (X = a ∧ Y = b) = P (X =
a) · P (Y = b).)
Example 12: Rolling two dice
Assume that X and Y are defined as in Example 11. Then X and Y are
independent, since for any a, b ∈ {1, . . . , 6},
P (X = a ∧ Y = b) = P (X −1 ({a}) ∩ Y −1 ({b}))
= P ({(ω1 , ω2 ) ∈ Ω | ω1 = a}
∩{(ω1 , ω2 ) ∈ Ω | ω2 = b})
= P ({(a, b)}) = 1/36
and
P (X = a) · P (Y = b) = P (X −1 ({a})) · P (Y −1 ({b}))
= P ({(ω1 , ω2 ) ∈ Ω | ω1 = a})
·P ({(ω1 , ω2 ) ∈ Ω | ω2 = b})
= 6/36 · 6/36 = 1/36.
We are now able to state some properties of the expectation. Let X, Y be
discrete random variables on the same probability space and assume that
E(X) and E(Y ) exist. Then, for a, b ∈ R
P
1. E(X) = ω∈Ω X(ω)P ({ω}),
2. E(a · X + b) = a · E(X) + b,
3. E(X + Y ) = E(X) + E(Y ).
4. If X and Y are independent, then E(X · Y ) = E(X) · E(Y ).
Proof. (1.)
E(X) =
X
x∈X(Ω)
=
X
x∈X(Ω)
=
X
x∈X(Ω)
=
X
x · P (X = x)
(definition of E(X))
x · P ({ω | X(ω) = x})
x·
X
P ({ω})
(definition of P (X = x))
(2nd axiom of Def. 1)
ω∈Ω∧X(ω)=x
X(ω)P ({ω})
(X is a function)
ω∈Ω
10
1.2. DISCRETE RANDOM VARIABLES
Proof. (2.) Let Z = a · X + b be a discrete random variable on the same
probability space as X, such that Z(ω) = a · X(ω) + b. We now find:
E(a · X + b) = E(Z)
X
=
Z(ω)P ({ω})
(definition of Z)
(first property)
ω∈Ω
=
X
(aX(ω) + b) · P ({ω})
ω∈Ω
=a
X
ω∈Ω
=a
X
ω∈Ω
=a
X
ω∈Ω
X(ω) · P ({ω}) + b
(definition of Z)
X
P ({ω})
(calculus)
ω∈Ω
X(ω) · P ({ω}) + bP (Ω)
X(ω) · P ({ω}) + b
(2nd axiom of Def. 1)
(1st axiom of Def. 1)
= a · E(X) + b
(first property)
Proof. (3.) Let Z = X + Y be a discrete random variable on the same
probability space as X and Y , such that Z(ω) = X(ω) + Y (ω). We now
find:
E(X + Y ) = E(Z)
X
=
Z(ω)P ({ω})
(definition of Z)
(first property)
ω∈Ω
=
X
(X(ω) + Y (ω)) · P ({ω})
ω∈Ω
=
X
ω∈Ω
X(ω) · P ({ω}) +
= E(X) + E(Y )
X
ω∈Ω
Y (ω) · P ({ω})
(definition of Z)
(calculus)
(first property)
Proof. (4.) Let Z = X · Y be a discrete random variable on the same
probability space as X and Y , such that Z(ω) = X(ω) · Y (ω). Let X and
11
1.2. DISCRETE RANDOM VARIABLES
Y be independent. We now find:
E(X · Y )
= E(Z)
(definition of Z)
P
= ω∈Ω Z(ω)P ({ω})
(first property)
P
= ω∈Ω X(ω) · Y (ω)P ({ω})
(definition of Z)
P
P
= x∈X(Ω)∧y∈Y (Ω) x · y · ω∈Ω∧X(ω)=x∧Y (ω)=y P ({ω})
(X, Y are functions)
P
= x∈X(Ω)∧y∈Y (Ω) x · y · P ({ω | X(ω) = x ∧ Y (ω) = y})
(2nd axiom of Def. 1)
P
= x∈X(Ω)∧y∈Y (Ω) x · y · P (X = x ∧ Y = y)
(definition of P (. . .))
P
= x∈X(Ω)∧y∈Y (Ω) x · y · P (X = x) · P (Y = y)
(independence of X and Y )
P
P
= x∈X(Ω) x · P (X = x) · y∈Y (ω) y · P (Y = y)
(calculus)
= E(X) · E(Y )
(definition of E(X), E(Y ))
We list the properties of the variance operator without proof: Let X, Y be
discrete random variables on the same probability space and assume that
E(X), E(Y ), VAR(X), and VAR(Y ) exist. Then
1. VAR(a · X + b) = a2 · VAR(X) for a, b ∈ R,
2. V AR(X) = E(X 2 ) − E(X)2
3. VAR(X + Y ) = VAR(X) + VAR(Y ) + 2 · (E(X · Y ) − E(X) · E(Y )).
4. If X and Y are independent, then VAR(X +Y ) = VAR(X)+VAR(Y ).
The definition of independent random variables can be extended to more
than two variables as follows.
Definition 8: Independence (of n discrete random variables)
For n ∈ N, let X1 , . . . , Xn be discrete (real-valued) random variables on the
same probability space (Ω, 2Ω , P ). We call X1 , . . . , Xn independent iff for all
a1 ∈ X1 (Ω), . . . , an ∈ Xn (Ω)
P (X1 = a1 , . . . , Xn = an ) = P (X1 = a1 ) · . . . · P (Xn = an ).
12
1.2. DISCRETE RANDOM VARIABLES
Note that if X1 , . . . , Xn are independent, then
E(X1 · . . . · Xn ) = E(X1 ) · . . . · E(Xn )
and
VAR(X1 + . . . + Xn ) = VAR(X1 ) + . . . + VAR(Xn ).
1.2.2
Higher Moments, Covariance and Correlation
In the previous section we have used different operators (e.g. +) to combine
several random variables and considered the expectation of combinations of
random variables. Thus, we already used the following general formula for
the expectation of a function g
X
g(x) · P (X = x).
E(g(X)) =
x
(Again the sum might not converge, in which case E(g(X)) does not exist.)
Note that this formula can also be applied if g is not a one-to-one function,
e.g. if g(x) = x2 . In the case g(x) = xi , we call E(X i ) the i-th moment of
X. Note that the moments of a distribution are related to its skewness and
its kurtosis. The former characterizes the degree of asymmetry while the
latter characterizes the flatness or peakedness of the distribution.
While expectation, variance and standard deviation describe properties of a
single random variable, the covariance and the correlation measure the level
of dependence between variables:
• The covariance of two random variables X and Y is defined by
COV (X, Y ) = E([X − E(X)][Y − E(Y )])
• and, assuming V AR(X), V AR(Y ) > 0, the correlation of X and Y is
defined by
COV (X, Y )
p
COR(X, Y ) = p
.
V AR(X) V AR(Y )
The covariance is the expected product of the deviations of X and Y from their respective means.
The correlation can be seen as a scaled version of Y
the covariance having the same sign. If X and Y
are positively correlated, COR(X, Y ) > 0, large
values of X (positive deviations from E(X)) genX
erally correspond to large values of Y . Similarly,
small values of X imply small values of Y . If X Figure 1.7: Positively
and Y are negatively correlated, COR(X, Y ) < 0, correlated X and Y .
large values of X generally correspond to small
13
1.3. IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS
values of Y (example is given in Fig. 1.7). If X and Y are uncorrelated,
COR(X, Y ) = 0, we know that there is no linear dependency between X
and Y . This does, however, not imply that they are independent since other
forms of dependence are possible as the following example shows.
Example 13: Uncorrelated Variables
Assume that X is a discrete random variable such that P (X = −1) =
P (X = 0) = P (X = 1) = 31 and Y = X 2 . Then E(X) = 0 and E(Y ) =
2/3. Thus,
P
COV (X, Y ) =
x,y P (X = x ∧ Y = y)(x − E(X))(y − E(Y ))
= 13 (−1 − 0)(1 − 23 ) + 13 (1 − 0)(1 − 23 )
= 0
Thus, X and Y are uncorrelated even though Y is a function of X (the
strongest form of dependence). Clearly, X and Y are not independent
since, for instance,
P (X = 1 ∧ Y = 1) =
1 2
2
1
6= P (X = 1)P (Y = 1) = · = .
3
3 3
9
Next, we list some important properties of the covariance:
• COV (X, Y ) = COV (Y, X)
• COV (X, Y ) = E(X · Y ) − E(X) · E(Y )
• COV (X, X) = V AR(X)
Finally, we note that since the correlation is scaled and
has the property −1 ≤ COR(X, Y ) ≤ 1, we have a maximal positive (negative) correlation at COR(X, Y ) = 1
(COR(X, Y ) = −1). In this case all possible realizations
for (X, Y ) lie on a straight line. Thus, given X = x the
value Y = y is fixed by the line. The figure on the right
illustrates this case.
1.3
Y
X
Important Discrete Probability Distributions
Let us now consider some important discrete probability distributions.
• Bernoulli distribution: Consider again an experiment where there
are only two possible outcomes (e.g. flipping a coin). This is called
a Bernoulli trail. Let us use 0 (failure) and 1 (success) for the two
possible outcomes. The probability of 1 is P (X = 1) = p and P (X =
14
1.4. EXERCISES
0) = 1 − p. Thus, the mean of the Bernoulli distribution is E(X) =
0(1 − p) + 1(p) = p.
• Binomial distribution: We consider a sequence of n independent
Bernoulli trials and count the number X of successes. Then X follows
a binomial distribution, i.e. for k ∈ {0, 1, . . . , n}
n k
P (X = k) =
p (1 − p)n−k
k
where the first factor counts all possible orderings for k successes
among n trials, the second one gives the probability that in the independent trials we have k successes and the last factor describes the
probability of (the remaining) n−k failures. Note that one can express
X as the sum of n independent Bernoulli variables X1 , . . . , Xn :
X = X1 + . . . + Xn .
Thus, E(X) = E(X1 ) + . . . + E(Xn ) = p + . . . + p = n · p. Please use
the Wikipedia link to view the plot of the distribution and learn more
about the Binomial distribution.
• Geometric Distribution: Consider again an experiment where the
probability of an event A is P (A) = p. We repeat this experiment
until A occurs for the first time. Let X be the random variable that
describes the number of trials until A occurs for the first time. Then X
is called geometrically distributed. We have P (X = i) = (1 − p)i−1 · p
and E(X) = 1/p. The variance of X is V (X) = (1 − p)/p. Please use
the Wikipedia link to view the plot of the distribution and learn more
about the geometric distribution.
• Poisson Distribution Consider a call center where on average µ = 6
calls per minute arrive. Let X be the random variable that represents
k
the number of calls in the next minute and assume P (X = k) = µk! e−µ .
Then X is called Poisson distributed and has expectation
∞
∞
X
X
µk −µ
µk−1
E(X) =
k·
e = µe−µ
= µe−µ eµ = µ
k!
(k − 1)!
k=0
k=1
and variance V (X) = µ (without proof). Please use the Wikipedia link
to view the plot of the distribution and learn more about the Poisson
distribution.
1.4
Exercises
Easy
Exercise 1: Probability What is wrong with the following axioms for
probability:
15
1.4. EXERCISES
The value P (A) is called the probability of event A if P is such that
1. P (Ω) = 1,
2. for events A1 , A2 , . . . it holds that
P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . .
Exercise 2: A Chance Experiment We throw a fair coin three consecutive times. What is the sample space of this experiment. Can you describe
two different events on this sample space?
Exercise 3: Conditional Probability Show that the value P (A|B) is
a probability in the sense of Definition 1.
Exercise 4: Independence Suppose that we toss a 6-sided die. Each of
the six numbers is a sample point and is assigned probability 61 . Define the
events A and B as follows: A = {1, 3, 4}, B = {3, 4, 5, 6}. Prove that A and
B are independent.
Exercise 5: Variance Assume that for a discrete random variable V AR(X) =
0. What can be said about the probability distribution of X?
Exercise 6: Covariance Let X and Y be two random variables such that
V AR(X) = 2 and COV (X, Y ) = 1. Compute the covariance COV (5X, 2X+
3Y )
Exercise 7: Correlation of Random Variables Assume that X is a
discrete random variable such that P (X = −1) = P (X = 0) = P (X = 1) =
1
2
3 and Y = X − X. Are these random variables uncorrelated?
Exercise 8: Conditional Probability Assume that X is a discrete
random variable such that P (X = −1) = P (X = 0) = P (X = 1) = 31
and Y = X 2 + X. Are these random variables correlated? What is the
correlation coefficient?
Exercise 9: Poisson vs Binomial Use MATLAB to plot a Poisson distribution with mean µ = 10. Assume now a binomial distrbution with n = 30
number of trials. What is the success probability p of the binomial such that
the two distributions look similar?
Medium
Exercise 10: Conditional Probability In the summer semester 2015
there were 1162 students at Saarland University that studied computer science and 184 that studied bioinformatics. About 43 % of the bioinformatics
students are female and for computer science about 18 % are female. What
is the probability that a female student from the total number of 1346 students (studying either bioinformatics or computer science) is a computer
science student?
Exercise 11: Independence Prove that for P (B) > 0 the following
conditions are all equivalent:
16
1.4. EXERCISES
i) P (A ∩ B) = P (A) · P (B)
ii) P (A|B) = P (A|B̄)
iii) P (A|B) = P (A).
Can A be independent of B but not vice versa?
Exercise 12: Law of Total Probability A test for a viral infection is
95% reliable for infected patients and 99% for healthy patients. Assume that
4% of all patients that are tested are infected with the virus. If a patient is
tested positively what is the probability that he really has the virus?
Exercise 13: Discrete Random Variables Let (Ω, 2Ω , P ) be a discrete
probability space with Ω being finite and X : Ω → R. Prove that, for
A ⊂ X(Ω), the value
PX (A) := P (X −1 (A)) = P ({ω ∈ Ω | X(ω) ∈ A})
is a probability. (Hint: Note that if Ω is finite, the second axiom of Kolmogorov’s probability definition reduces to “If A and B are disjoint subsets
of Ω then P (A ∪ B) = P (A) + P (B)”.)
Exercise 14: Discrete Random Variables Consider a game of chance
where a player bets one euro and picks a number between 1 and 6, e.g. 2.
A die is then rolled three times and the player receives
nothing
if no 2 was thrown,
two euros
if 2 was thrown one time,
three euros if 2 was thrown two times,
four euros
if 2 was thrown three times,
and similar for the other numbers. Define a random variable Z that describes
the net profit (“Reingewinn”) of one round of the game. Calculate the
probabilities P (Z = −1), P (Z = 0), P (Z = 1), P (Z = 2), P (Z = 3), as
well as the expectation, variance, and standard deviation of Z.
Difficult
Exercise 15: Variance and Expectation Prove the following:
1. V AR(X) = E(X 2 ) − E(X)2
2. VAR(a · X + b) = a2 · VAR(X) for a, b ∈ R,
3. VAR(X + Y ) = VAR(X) + VAR(Y ) + 2 · (E(X · Y ) − E(X) · E(Y )).
4. If X and Y are independent, then VAR(X +Y ) = VAR(X)+VAR(Y ).
Exercise 16: Binomial Distribution Prove that for a binomially distributed random variable, X ∼ B(n, p) it holds E[X] = pn and V AR[X] =
p(1 − p)n.
17
1.4. EXERCISES
Exercise 17: Poisson Distribution A Poisson distributed random variable X has the following probability distribution:
P (X = k) =
µk −µ
e
k!
k ∈ {0, 1, . . .}
Prove that mean and variance of X is equal to µ.
18
Chapter 2
Continuous Random
Variables and Laws of Large
Numbers
Consider a chance experiment where the sample space Ω contains uncountably many elements. In this case, assigning probabilities to all elements in
2Ω poses problems. Assume, for instance, we randomly choose a real number in [0, 1]. If all numbers are equally likely to occur, we have to assign
probability zero to each since their “sum” must be one (note that the sum
over uncountably many nonzero values is undefined). Instead of giving a
new definition of probabilities, we restrict ourselves to a set of events, called
σ-algebra, for which we can define probabilities as in the discrete case.
2.1
σ-algebras
If we define a random variable as a function X : Ω → R, we want to reason
about the probability of events such as X = x for any x ∈ R or a < X ≤ b
for any interval (a, b]. From the properties of probabilities (see Def. 1), we
then know the probability of the disjoint union of countably many of such
sets as well as the probability of the complement of such a set1 .
Definition 9: σ-algebra
A set F ⊆ 2Ω is called a σ-algebra if
1. Ω ∈ F,
2. A ∈ F implies Ā ∈ F.
3. If A1 , A2 , . . . ∈ F is a sequence of sets then
A1 ∪ A2 ∪ . . . ∈ F.
1
From the two conditions in Def. 1, one can easily derive that P (Ā) = 1 − P (A).
19
2.1. σ-ALGEBRAS
The most important example of a σ-algebra, which is needed in the sequel,
is the σ-algebra that is generated by a set E ⊆ 2Ω . We define the smallest
σ-algebra that contains E by
σ(E) := ∩{F ⊃ E : F is a σ-algebra}.
Example 14: Generated σ-algebra
For simplicity, we consider the finite set Ω = {1, 2, . . . , 6}. Let E =
{{2, 6}, {5, 6}}. Then
σ(E) = {{2, 6}, {5, 6}, {1, 3, 4, 5}, {1, 2, 3, 4}, {1, 2, 3, 4, 5}, {6}, . . .}.
If Ω is finite, the idea is to construct a sequence of sets in an iterative
fashion, where we start with E and obtain the next set from the previous
one by joining and complementing elements of the current set. If no
new element can be constructed by union or complement operations, the
current set equals σ(E).
Net we assume Ω = R and
E = {(a, b] : a, b ∈ R, a ≤ b}.
The σ-algebra B := σ(E) is called the Borel algebra on the reals. It contains
all subsets, called Borel sets, of 2R that can be obtained from E by countable
union and complement operations. Note that also intervals of the form
(a, b) or [a, b] are Borel sets. A similar construction is possible for Ω = Rn .
Intuitively, the Borel sets are those sets, for which we can assign a “volume”,
“area” or “size”. Subsets of Rn that are not Borel sets are only of theoretical
interest since for practical applications they are not of importance.
We are now able to give a more general definition of a probability space (i.e.
also for sample sets with uncountably many elements).
Definition 10: Probability Space
Let Ω be a nonempty set and let F ⊆ 2Ω be a σ-algebra. A probability space
is a triple (Ω, F, P ) where the probability measure P : F → [0, 1] is such
that
• P (Ω) = 1 and,
• if A1 , A2 , . . . ∈ F is a sequence of pairwise disjoint sets, then
P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . .
20
2.2. CONTINUOUS RANDOM VARIABLES
Example 15: Discrete Probability Space
Any discrete probability space (Ω, 2Ω , P ) fulfills Def. 10, since 2Ω is a
σ-algebra and P is as in Def. 1.
Example 16: One-dimensional Interval
Let Ω = R, F = B and 0 ≤ a < b, a, b ∈ R. Consider a probability
measure P : F → [0, 1] such that
( min(y,b)−max(x,a)
if max(x, a) < min(y, b)
b−a
P ({ω ∈ Ω | x < ω ≤ y}) =
0
otherwise.
Similar to how we extended the set E = {(a, b] : a, b ∈ R, a ≤ b} to a
σ-algebra, we can show that if P is a probability measure and defined as
above for all half-open intervals, its value for the remaining sets in B is
uniquely determined.
2.2
Continuous random variables
Definition 11: Real-valued
Random Variable
Let (Ω, F, P ) be a probability space.
A real-valued random variable on
(Ω, F, P ) is a function X : Ω → R
such that for all A ∈ B
X −1 (A) = {ω ∈ Ω | X(ω) ∈ A} ∈ F.
Ω
ω3
ω2
ω1
X
X(Ω)
X
A
Figure 2.1: Is the inverse image of
A w.r.t. X an element of F?.
The above definition ensures that if we want to know the probability that X
is in some Borel set A, we can consider the inverse image of A with respect
to X, for which we know its probability.
Clearly, we can define a probability measure PX : B → [0, 1] by setting
PX (A) := P ({ω | X(ω) ∈ A}) = P (X −1 (A)) and use similar notations as in
the discrete case (e.g. P (a < X ≤ b)).
Definition 12: Cumulative Probability Distribution
Let X be a real-valued random variable on (Ω, F, P ). The function F :
R → [0, 1] with x 7→ F (x) := P (X ≤ x) is called the cumulative probability
distribution of X.
21
2.2. CONTINUOUS RANDOM VARIABLES
Example 17: One-dimensional Interval
Assume that X is a randomly chosen point in the interval [a, b] and
(Ω, F, P ) is as in Ex. 16. Then
F (y) = P (X ≤ y) =





y−a
b−a
if y ∈ [a, b],
1
if y > b,
0
otherwise.
We call a random variable X on (Ω, F, P ) discrete if X(Ω) is a discrete
set (finite or countably infinite). We call X continuous if X(Ω) contains
uncountably many elements and there exists a non-negative and integrable
function f : R → R≥0 , called density, with
Z x
F (x) = P (X ≤ x) =
f (y) dy.
−∞
Clearly,
Z
∞
f (y) dy = 1
−∞
since F (∞) = 1, but note that P (X = y) 6= f (y).
Example 18: One-dimensional Interval
Assume that X is as in Ex. 17. Then X is a continuous random variable
since the (constant) function f with
( 1
if y ∈ [a, b],
b−a
f (y) =
0
otherwise.
is the density of X. We can verify this by calculating
x
Z x
Z x
1
y
x−a
F (x) =
f (y) dy =
dy =
=
= P (X ≤ x)
b−a a
b−a
−∞
a b−a
for x ∈ [a, b], F (x) = 0 for x < a, and F (x) = 1 for x > b. Figure 2.2
shows a plot of the functions f and F .
The distribution in the above example is called the (continuous) uniform
distribution.
The expectation and variance of a continuous random variable X with density f are defined as
Z ∞
Z ∞
E(X) =
x · f (x) dx and V (X) =
(x − E(X))2 · f (x) dx.
−∞
−∞
22
2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS
f (y)
F (x)
1
1
b−a
a
b
y
a
b
x
Figure 2.2: Density and cumulative probability distribution of a random
variable X.
Note that these integrals may not exist, in which case the expectation/variance is undefined.
Many results for discrete random variables carry over to the continuous setting. For instance, properties of the expectation and variance (e.g. E(X +
Y ) = E(X) + E(Y )) or results concerning the connection of random variables, joint distributions, stochastic independence, etc. We do not discuss
them here, as they are very similar to the results presented above. Moreover,
in the sequel, we will mostly work with discrete random variables.
2.3
Important continuous distributions
Besides the uniform distribution that we have seen above, there are some
other important continuous distributions:
Uniform Distribution
We already described the uniform distribution in the examples above. In
summary we get the following properties if X is uniformly distributed on
the interval (a, b) (denoted by X ∼ U (a, b)):
• Its density is constant on the interval (a, b), that is, f (x) = 1/(b − a)
for x ∈ (a, b) and f (x) = 0 otherwise.
• If we want to know the probability that X falls into an subinterval of
(a, b) of length d we get
P (X ∈ (c, c + d)) = d/(b − a),
(c, c + d) ⊆ (a, b)
which means that it is proportional to the length d (and independent
of c, cf. Figure 2.2).
1
• The expectation is given by the midpoint of the interval E(X) =
(a + b)/2.
23
2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS
Exponential Distribution
Let λ > 0. We say that a continuous random variable X is exponentially
distributed with parameter λ (denoted by X ∼ Exp(λ)) if the density of X
is such that, for t ∈ R,
(
λ · e−λt if t ≥ 0,
f (t) =
0
otherwise.
The cumulative probability distribution of X is then given by
( Rx
Rx
−λt dt = 1 − e−λx
if x ≥ 0,
−∞ f (t) dt = 0 λ · e
F (x) =
0
otherwise.
The exponential distribution is often used to describe waiting times or interarrival times since in many real-world systems where a sequence of rare
events plays an important role, the assumption that the time between these
events is exponentially distributed is used. For instance, the time of radioactive decay is This is related to the fact that these events are assumed
to occur spontaneously and that the exponential distribution has the socalled memoryless property:
Assume that the random variable X describes a waiting time and we already
know that X ≥ t and ask for the probability that X ≥ t + h (we have to
wait for additional h time units). Then, if X is exponentially distributed we
can verify that
P (X ≥ t + h | X ≥ t) = P (X ≥ h),
t, h > 0.
The exponential distribution is the only continuous distribution that is memoryless. In fact, it is possible to derive from the memoryless property that
the distribution of X must be exponential. Similarly, one can show that the
geometric distribution is the only discrete distribution that is memoryless.
Example 19: Bus waiting paradox
Assume that Bob arrives at a bus stop where buses arrive on average
every 10 min. The interarrival times X of the buses are exponentially
distributed. Bob asks one of the people waiting there how long he is already
waiting. Whatever the answer of that person is (e.g. t = 20 min or 1
min), it does not change the probability that Bob has to wait for h minutes
for the next bus.
The following property of exponentially distributed random variables gives
rise to an algorithm for the generation of exponentially distributed numbers (realizations of X). In order to proceed, we use inverse transform
24
2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS
method. inverse transform method. If U ∼ U (0, 1) then X ∼ Exp(λ),
where X = − ln(U )/λ. This can be verified as follows. We have the cumulative distribution F (x) = 1 − e−λx for X. Thus, y = F (x) is a number
between 0 and 1. Solving for x yields
⇐⇒
⇐⇒
⇐⇒
y := F (x) = 1 − e−λx
1 − y = e−λx
ln(1 − y) = −λx
x = − λ1 ln(1 − y) = F −1 (y).
Thus, if U is uniformly distributed on (0, 1), the random variable X =
F −1 (U ) = − λ1 ln(1 − U ) must be exponentially distributed with parameter
λ. Moreover, since 1−U has the same distribution as U , the random variable
X = − λ1 ln(U ) must also be exponentially distributed with parameter λ.
Normal distribution
The normal distribution is one of the most important continuous distributions since it naturally arises when we sum up many random variables.
Therefore, measurement errors and fluctuations are often (approximately)
normally distributed. Also measurements of the length or size of certain
objects give often normally distributed results.
A normally distributed random variable X with mean µ and variance σ 2
(denoted by X ∼ N (µ, σ 2 )) has the density
−(x − µ)2
1
exp
,
f (x) = √
2σ 2
σ 2π
The density of the standardized normal distribution with mean µ = 0 and
variance σ 2 = 1 is then defined as
−x2
1
φ(x) = √ e 2
2π
and its cumulative distribution function is
Z x
−z 2
1
√ e 2 dz.
Φ(x) =
2π
−∞
Assume we are interested in the probability that X ∼ N (µ, σ 2 ) falls into a
certain interval (a, b). Since the solution of the above integral is difficult to
compute, we can standardise X and use the fact that we have a table with
the values for Φ.
Definition 13: Standardized Random Variables.
For any random variable X with finite E(X) = µ and finite VAR(X) = σ 2 ,
we define its standardised version
X∗ =
X −µ
.
σ
25
2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS
Then E(X ∗ ) = 0 and VAR(X ∗ ) = 1 since
∗
E(X ) = E
X −µ
σ
and
∗
VAR(X ) = VAR
=
X −µ
σ
E(X) − µ
=0
σ
=
VAR(X)
= 1.
σ2
For X ∼ N (µ, σ 2 ) this means that we can find the cumulative distribution
of X ∗ in a standard normal table and use it to compute that of X. Using
X ∗ = X−µ
⇐⇒ X = X ∗ σ + µ we get
σ
P (X ≤ x) = P (X ∗ σ + µ ≤ x) = P (X ∗ ≤
x−µ
).
σ
Example 20: Computing Normal Probabilities
The average salary of an employee in Germany is µ = 30432 Euros per
year (in 2012, brutto). Assuming a normal distribution and a standard
deviation of σ = 10000 Euros, we can compute the proportion of employees that earn between, say 20 000 and 40 000 Euros per year as follows:
X−µ
40000−µ
<
<
P (20000 < X < 40000) = P 20000−µ
σ
σ
σ
= P (−1.0432 < X ∗ < 0.9568)
= Φ(0.9568) − Φ(−1.0432)
≈ 0.83147 − 0.14917 = 0.6823
where we used that Φ(−1.0432) = 1 − Φ(1.0432).
Next, we can compute the income limit of the poorest 3% of the employees,
i.e. find x such that P (X < x) = 0.03:
P (X < x) = P X < x−µ
= Φ( x−µ
σ
σ ) = 0.03
Thus,
x = µ + σΦ−1 (0.03) = 30432 + 10000 · (−1.88) = 11632
where we used that Φ(−1.88) = 0.03.
Finally, we note that if a random variable X is normally distributed with
expectation µ̂ and variance σ̂ 2 , written X ∼ N (µ̂, σ̂ 2 ), then, for a, b ∈ R,
aX + b has distribution N (aµ̂ + b, (aσ̂)2 ).
26
2.4. LAWS OF LARGE NUMBERS
Gamma distribution
The Gamma distribution generalizes a number of other distributions (e.g.
exponential) and has a rather flexible shape compared to other continuous
distributions. Therefore it can be used to fit very different types of data.
A random variable X is Gamma distributed with parameters α and β (written X ∼ Gamma(α, β)) if its density is given by
1 1 α−1
−x
f (x) =
x
exp
,
x > 0, α > 0, β > 0
Γ(α) β α
β
and f (x) = 0 whenever x ≤ 0. Thus, it assigns positive probability only to
positive values of X and also the parameters α (for the shape) and β (for the
scae) must be positive. The gamma function Γ(α) is used as a normalization
factor of the density. Often α is a positive integer and then Γ(α) = (α − 1)!
which means that the density (and its integral) is easy to evaluate (for plot of
the Gamma densities please visit https://upload.wikimedia.org/wikipedia/
commons/e/e6/Gamma distribution pdf.svg).
How do the parameters α and β influence the mean and the variance of
the distribution? It can be shown that the mean of X ∼ Gamma(α, β) is
µ = αβ and the variance is σ 2 = αβ 2 . Further interesting properties of the
Gamma distribution are:
• For large α and β, the distribution is very similar to a normal distribution (with mean µ = αβ and variance σ 2 = αβ 2 )
• If α = 1 and β = 1/λ, then X ∼ Exp(λ).
• If α = ν/2 and β = 2 then the distribution is a chi-squared distribution
with ν degrees of freedom (where ν is a positive integer). In that case,
we have the distribution of a sum of the squares of ν independent
standard normal random variables. The chi-squared distribution is
often used for hypothesis testing (goodness of fit tests).
2.4
Laws of Large Numbers
Consider n realizations x1 = X(ω1 ), x2 = X(ω2 ), . . . , xn = X(ωn ) of a
random variable X with finite expectation µ. They could, for instance,
be generated by repeating an experiment n times under equal conditions.
Intuitively, for large n the sample mean
n
x̄ =
1X
xi ≈ µ.
n
i=1
In order to determine the quality of the above approximation, we recall
Chebyshev’s Inequality.
27
2.4. LAWS OF LARGE NUMBERS
2.4.1
µ−a
Chebyshev’s inequality
µ
Let Y be a random variable with finite expectation E(Y ) and finite variance
VAR(Y ). Assume that besides E(Y ) and
VAR(Y ), nothing is known about Y (e.g.
the cumulative probability distribution).
Our aim is to reason about the deviation
of Y from its expectation. According to
Chebyshev’s Inequality (see figure on the
right for an illustration), for any a > 0
µ+a
{ω ∈ Ω | |X(ω) − µ| < a}
VAR(Y )
.
a2
Note that the proof of the theorem is quite straight forward.
P (|Y − E(Y )| ≥ a) ≤
2.4.2
(2.1)
Weak Law of Large Numbers
In order to apply the inequality above, we assume that x1 , x2 , . . . , xn are
realizations of independent random variables X1 , X2 , . . . , Xn , all having the
same distribution as X (see also Def. 8). Define
n
Zn =
1X
Xi .
n
i=1
Then E(Zn ) =
σ2,
1
n
Pn
i=1 E(Xi )
VAR(Zn ) = VAR
= µ and, if X has finite variance VAR(X) =
n
1X
Xi
n
i=1
!
=
n
σ2
n · σ2
1 X
=
.
VAR(X
)
=
i
n2
n2
n
i=1
(See also page 12 for the properties of the variance operator.) Thus, Eq. (2.1)
gives us, for any > 0,
P (|Zn − µ| ≥ ) ≤
If is given, we can make
any > 0,
lim P (|Zn − µ| ≥ ) = 0
n→∞
σ2
n·2
σ2
.
n · 2
arbitrarily small by increasing n. Thus, for
(or, equivalently, lim P (|Zn − µ| < ) = 1),1
n→∞
which is known as the weak law of large numbers 2 . Here, the sequence
{Zn }n≥1 of random variables converges “weakly” since only the corresponding probabilities converge.
2
Note that, as opposed to the strong law of large numbers, for the weak law of large
numbers pairwise independence of X1 , . . . , Xn is already sufficient and (full) independence
is not necessary.
28
2.4. LAWS OF LARGE NUMBERS
Example 21: Bernoulli Trials
Consider a sequence of n Bernoulli trials and for 1 ≤ j ≤ n, let Xj be
the random variable that is 1 if the j-th trial is success and 0 otherwise.
Moreover, let P (Xj = 1) = p for all j. Then E(X1 ) = E(X2 ) = . . . =
E(Xn ) = p. According to the weak law of large numbers, for any > 0,


n
1 X
P 
Xj − p <  → 1
n
j=1
as n → ∞.
2.4.3
Strong Law of Large Numbers
An even stronger result than the weak law of large numbers provides the
strong law of large numbers, which states that
P lim |Zn − µ| = 0 = 1.
n→∞
Here, we measure the probability of all outcomes ω ∈ Ω with limn→∞ |Zn (ω)−
µ| = 0 and find that this event occurs with probability one. The strong law
of large numbers implies the weak law of large numbers.
We have seen two laws that match our initial intuition that
n
1X
x̄ =
xi ≈ µ.
n
i=1
In the following we will see that even more can be derived about sums of
random variables.
2.4.4
Central Limit Theorem.
P
Recall that Zn = n1 ni=1 Xi , where the Xi are independent and identically
distributed random variables with finite expectation µ and finite variance
σ 2 > 0. According to the central limit theorem, the distribution of
Zn − E(Zn )
Zn − µ
√ .
Zn∗ = p
=
σ/ n
VAR(Zn )
converges to the normal distribution with expectation 0 and variance 1 (standard normal distribution). Thus, for x, y ∈ R, x < y,
Z y
1
2
∗
lim P (x < Zn < y) = √
e−0.5t dt
n→∞
2π x
29
2.5. EXERCISES
n=4
n=10
1
n=50
1
distr ZnStar
normal distr
0.8
0.8
1
distr ZnStar
normal distr
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
−4
−2
0
2
4
0
−4
−2
0
2
4
0
−4
distr ZnStar
normal distr
−2
0
2
4
Figure 2.3: The distribution of Zn∗ approaches the standard normal distribution as n increases.
and Zn is approximately normally distributed with expectation µ and vari√
ance σ 2 /n, since Zn = σ/ n · Zn∗ + µ and Zn∗ ∼ N (0, 1).
Example 22: Approximation of the binomial distribution
We have already seen that for a sequence of n independent Bernoulli trials
the number Sn of successes follows a binomial distribution. We can write
Sn as
Sn = X1 + . . . + Xn
where X1 , . . . , Xn are independent Bernoulli random variables with parameter p. If n is large, we can approximate the binomial distribution
of Sn by a normal distribution. The central limit theorem tells us that
Zn = n1 Sn is approximately normally distributed with mean p (of the
Bernoulli random variables) and variance p(1−p)
n . Therefore Sn must approximately follow the distribution N (np, np(1−p)). Fig. 2.3 shows a plot
of the cumulative probability distribution of Zn∗ . It can be seen that the
distribution (in yellow) approaches N (0, 1) (red curve) as n increases.
2.5
Exercises
Easy
Exercise 18: Probability Density Function Find the value of the
constant c ∈ R such that the function f is the density of a continuous
random variable X (or joint density of X and Y ) or argue why there is no
such constant. Consider the following functions:
• Case 1
f (x) =
(
c·x
0
30
if 0 ≤ x ≤ 1,
otherwise,
2.5. EXERCISES
• Case 2
f (x) =
• Case 3
f (x, y) =
(
(
1+c·x
0
c · e−(x+2y)
0
if 0 ≤ x ≤ 3,
otherwise,
if 0 ≤ x < ∞, 0 ≤ y < ∞,
otherwise,
For the cases 1 and 2, calculate the expected value E(X) if it is possible.
Exercise 19: Exponential Distribution Assume that the lifetime T
(in 1000 hour units) of a certain facility (e.g. a piece of hardware) has the
exponential distribution with parameter λ = 21 . Find the following:
1. The average lifetime
2. P (T > 3)
3. P (T ≤ 2)
4. P (1 ≤ T ≤ 4)
Exercise 20: Chebyshev inequality Suppose that X is exponentially
distributed with parameter λ > 0. Compute the true value and the Chebyshev bound for the probability that X is at least k standard deviations away
from the mean. Plot both as a function of k.
Exercise 21: Exponential distribution An advertisement agency assumes that the time TV viewers remember the content of a commercial is
exponentially distributed with parameter λ = 0.25 (days). What is the portion of TV viewers that are able to remember the content of the commercial
after 7 days?
Exercise 22: Inverse transform method Prove that for a continuous
random variable X, for which the inverse F −1 (x) of the cumulative distribution function exists, the following holds:
If U ∼ U (0, 1) and Y = F −1 (U ) then Y has the same distribution as X.
Medium
Exercise 23: Normal distribution Assume that Z is normally distributed with mean −0.5 and standard deviation 2. Determine the following
values and probabilities.
a) Determine p = P (Z ∗ < 0.3), where Z ∗ denotes the standardized random variable of Z.
b) Determine p = P (Z ≤ 0).
31
2.5. EXERCISES
c) Determine z with P (Z ≤ z) = 0.271.
d) Determine p = P (|Z| > 0.1).
Exercise 24: Inverse Transform Method
a) Prove that for a continuous random variable X, for which the inverse
F −1 (x) of the cumulative distribution function exists, the following
holds:
If U ∼ U (0, 1) and Y = F −1 (U ) then Y has the same distribution as
X.
b) The cumulative distribution function of the Pareto distribution is given
as
F (x) =
1−
0,
xm α
,
x
x ≥ xm
x < xm
where x ∈ R and α, xm > 0.
i) Use the transform method in a) to determine a procedure to
generate a Pareto distributed random variate if a U ∼ U [0, 1] is
given.
ii) If Y is an exponentially distributed random variable with rate
α then xm exp(Y ) is Pareto distributed with parameters α and
xm as above. Describe an alternative procedure for generating
a Pareto distributed random variate when on an exponentially
distributed random variate Y is given.
Exercise 25: Monte-Carlo Simulation Consider the generalized ex3
2
ponential distribution f (x) = ce−x +1.5x , 0 ≤ x ≤ 3. Write a MATLAB
or R function that computes the value of c such that f (x) is a density
and compute the expectation of the above distribution with simulating uniform distributed numbers u ∼ [0, 3]2 without applying the inverse transform
method.
Difficult
Exercise 26: Memoryless Property Prove that the exponential distribution is the only continuous distribution with the memoryless property.
Exercise 27: Transformation of Random Variables Assume that
X and Y are independent and identically distributed exponential random
variables. Show that the conditional distribution of X given X + Y = t (i.e.
P (X > x|X + Y = t)) is the uniform distribution on (0, t).
32
Chapter 3
Discrete-time Markov Chains
A Markov chain is a countable family of random variables that take values
in a discrete set S (called state space) obeying the Markov property. They
typically describe the temporal dynamics of a process (or system) over time
and we can distinguish between chains that act in continuous time and chains
that act in discrete time.
3.1
Discrete-Time Markov Chains.
A discrete-time Markov chain (DTMC) is a family of random variables
(Xn )n∈N0 , where n is a discrete index, Xn : Ω → S and X fulfils the Markov
property
P (Xn+1 = y | Xn = xn , Xn−1 = xn−1 , . . . , X0 = x0 )
= P (Xn+1 = y | Xn = xn )
for all n, y, x0 , . . . , xn . The Markov property expresses the assumption that
the information of the present (i.e., Xn = xn ) is relevant to predict the
future behaviour of the system, however additional information about the
past (Xj = sj , j ≤ n − 1) is irrelevant.
W.l.o.g. we will assume that S = N1 and that the transition probabilities
pij = P (Xn+1 = j | Xn = j) do not depend on n (time homogeneity). Thus,
we can arrange the transition probabilities in a matrix P = (pij )i,j∈{1,2,...} .
The matrix P is called the transition probability matrix of the DTMC X.
1
We can simply enumerate all states in S such that S = {s1 , s2 , . . .} and then just
write X = i instead of X = si
33
3.1. DISCRETE-TIME MARKOV CHAINS.
3.1.1
Transient distribution
Let p0 be the row vector that contains the initial distribution at time zero,
i.e., the entries P (X0 = x) for all x. We now find that p1 = p0 · P since
X
P (X1 = x) =
P (X1 = x | X0 = y) · P (X0 = y).
y∈S
Successive application of this argument yields
pn = pn−1 · P = . . . = p0 · P n ,
that is, the probabilities after n steps are obtained by n multiplications with
P . Note that P is a stochastic matrix, i.e., the row sums are one and it has
only non-negative entries. If we consider P · . . . · P = P n then this matrix
is stochastic as well and contains for each state the probability of reaching
another state after n steps. Therefore, in the sequel we may sometimes write
P (n) instead of P n .
Example 23: Discrete-time Markov chain
Assume we want to describe the weather over time and we only distinguish
between sunny, cloudy and rainy weather. Thus, S = {1, 2, 3} where 1
stands for sunny, 2 for cloudy, and 3 for rainy weather. Assume further
we know that with probability 0.7 a sunny day is followed by another sunny
day, with probability 0.2 a cloudy and with probability 0.1 a rainy with day
follows. Similarly, we can make assumptions about the day after a cloudy
day or a rainy day. Note that we assume that the distribution for the next
day is only determined by the weather on the current day and nothing else
- in particular the weather in the past.
Figure 3.1 shows the underlying graph of
the DTMC X which consists of the state
space S (nodes of the graph) and the possible transitions annotated with the transition probabilities. The matrix


0.7 0.2 0.1
P =  0.25 0.5 0.25  .
0.1 0.2 0.7
contains the transition probabilities of X.
Let p0 = (1, 0, 0) be the initial distribution
of the chain. Then, after two steps,
p2 = p0 · P 2 = (0.55, 0.26, 0.19),
that is, for instance, P (X2 = 2) = 0.26.
34
0.7
1
0.25
0.1
0.2 0.1
0.25
0.5
2
3
0.7
0.2
Figure 3.1: The underlying
graph of a DTMC.
3.1. DISCRETE-TIME MARKOV CHAINS.
Given P and p0 , we have seen two ways of computing the vector pn of state
probabilities after n steps called the transient distribution after n steps.
Either we successively compute p1 , p2 , . . . , pn by multiplying the vector with
P or we compute the matrix product P n and multiply it with p0 . The former
approach is usually preferable since if P is large (N × N square matrix)
it requires significantly less computational work (matrix multiplication is
O(N 3 ) while matrix-vector multiplication is O(N 2 ) if carried out naively)
and less memory. Note that even if P is a sparse matrix the powers of P
will typically not be sparse.
3.1.2
Equilibrium distribution
Often one is interested in the distribution of a DTMC in the long run, i.e.
when the number of steps approaches infinity. By multiplying the current
distribution with P very often, we can approximate the limit
lim pn =: π,
n→∞
called the equilibrium distribution of X. The entries contain the frequencies
the process visits the states in the long run.
Example 24: Equilibrium distribution
In the weather example above, pn converges to the following vector after
a large number of multiplications with P :
π ≈ (0.3571, 0.2857, 0.3571).
Does this limit exist for every DTMC? Does it always converge to a distribution? The following example shows that there are cases where π does not
exist.
Example 25: Periodic chain
Assume that p0 = (1, 0, 0) and


0 1 0
P = 0 0 1 .
1 0 0
The underlying graph is illustrated in the
figure on the right.
35
3.1. DISCRETE-TIME MARKOV CHAINS.
Then the successive multiplication of p0 with P yields an alternating sequence of vectors p1 = (0, 1, 0), p2 = (0, 0, 1), p3 = (1, 0, 0) = p0 ,
. . . . However, if we choose p0 = (1/3, 1/3, 1/3) then we get convergence
to π = (1/3, 1/3, 1/3). As an alternative, we can consider the following
limit (which always exists for periodic Markov chains), which corresponds
to the time average of the transient probability of each state.
n
1X
pk = (1/3, 1/3, 1/3).
n→∞ n
π̃ = lim
k=1
The next example shows that sometimes, π exists but it is not a distribution.
Instead all entries converge to zero.
Example 26: Infinite non-recurrent chain
Consider the infinite-state DTMC illustrated by the graph below.
2/3
1
0
1
1/3
2/3
2
1/3
2/3
...
3
1/3
1/3
and assume that we start with probability one in state 0. In the infinite
vectors p1 , p2 , . . . , the probabilty will move more and more to the (unbounded) right. The probability of recurring to state 0 in finite time will
not be one. The limit limn→∞ pn yields a vector with all (infinite) entries
equal to zero.
Another problem is that the limit π may depend on the initial distribution
p0 . In particular, the underlying graph of the DTMC may contain several
strongly connected components. If some of these components are bottom,
they cannot be left.
Example 27: DTMCs with strongly connected components
Consider the two DTMCs illustrated by the graph below.
36
3.1. DISCRETE-TIME MARKOV CHAINS.
1
2
1
2
1
2
1
3
1
2
1
3
1
2
1
2
3
1
3
4
1
2
1
2
1
2
1
2
6
7
1
1
2
1
2
1
1
2
1
3
2
1
3
1
2
1
5
1
2
3
5
1
3
1
2
4
1
2
1
8
1
2
6
7
1
1
2
1
2
1
2
8
Both contain three strongly connected components, SCCs, (indicated by
the three colours red, yellow and green. However, the DTMC on the
left contains only one SCC that is bottom, i.e. that cannot be left (red
states), while in the case of the DTMC on the right there are two SCCs
that are bottom (red and green). In the left case it does not matter
what initial distribution we choose. In the limit we will always remain
in the states 6 and 7 and switch between these two states. On average
we will spend half of the time in state 6 and the other half in state 7.
Thus, πlef t = (0, 0, 0, 0, 0, 1/2, 1/2, 0). In the right case, however, we have a dependence on the initial distribution. For instance, if
p0 = (0, 0, 0, 0, 0, 1, 0, 0) then we get the same result as for the left
DTMC while for p0 = (0, 0, 1, 0, 0, 0, 0, 0) we get an equilibrium
distribution πright that only assigns positive probability to the states 3, 4
and 8. If, for example, p0 = (1, 0, . . .) then the states 3, 4, 6, 7 and 8 will
get a positive equilibrium probability. We will revisit this example in the
sequel after discussing how πright can be computed in that case.
The examples above motivate the following definitions.
Definition 14: Irreducibility
A DTMC is called irreducible if any state can be reached from any other
(n)
(n0 )
state, i.e., for i, j ∈ S ∃ n, n0 ∈ N0 such that Pij > 0 and Pji > 0.
Otherwise we call the DTMC reducible.
Example 28: Irreducible and reducible DTMCs
The two DTMCs in Example 27 are reducible. The DTMCs in the Examples 23, 25 and 26 are all irreducible. Note that a DTMC with |S| > 1
that contains a state i with pii = 1 is always reducible.
Definition 15: Aperiodic DTMC
We define the period of a state i as
p = gcd{n : P ( X(n) = i | X(0) = i ) > 0}
|
{z
}
from state i to i in n steps
1
37
1
3.1. DISCRETE-TIME MARKOV CHAINS.
where gcd is the greatest common divisor. A state is called aperiodic if its
period is 1. A Markov chain is called aperiodic if all its states are aperiodic.
In the sequel, we will only consider aperiodic DTMCs.
Equilibrium of finite irreducible DTMCs
Based on the two definitions above we can state an important result in
Markov chain theory.
Let (X(t), t ∈ N0 ) be a finite-state Markov chain that is irreducible (and aperiodic). Then the equilibrium distribution π exists and fulfils the following
equation system
P
π =π·P
i∈S
πi = 1.
Moreover, the above equation system has in this case always a unique solution. Thus, we can compute π by solving the equation system. The above
condition π = π · P is often referred to as global
P balance equations since it
states that
P in the limit the probability flow j πj pji to state i is equal to
the flow j πi pij out of state i, i.e.
X
j
πj pji = πi = πi
X
j
pij ⇐⇒
X
j6=i
πj pji = πi
X
pij .
j6=i
Example 29: Equilibrium distribution
Consider again the DTMC in Example 23. We can compute the solution
of the above equation system by rewriting π = π·P into π(P −I) = 0 where
I is the identity matrix of size 3 × 3. Note that the matrix A = (P − I)T
(where T refers to the transpose of a matrix) has rank 2 and contains
redundant information (column sums are zero because row sums of P are
one). We can therefore ignore
one row of A and instead incorporate the
P
normalization condition i∈S πi = 1 by replacing the removed row by a
vector of ones. Thus, we solve

    
−0.3 0.25 0.1
π1
0
 0.2 −0.5 0.2  · π2  = 0 .
1
1
1
π3
1
P
Note that the last equation of the system ensures that i∈S πi = 1. The
solution is (cf. Example 24 π ≈ (0.3571, 0.2857, 0.3571).
38
3.1. DISCRETE-TIME MARKOV CHAINS.
Equilibrium of finite reducible DTMCs
If X is reducible (and aperiodic), the equilibrium distribution π depends on
the initial distribution p0 and requires to analyse the underlying graph of
X.
Definition 16: Bottom strongly connected component (BSCC)
A set T of states is strongly connected if, for each pair of states i and j in
T , j is reachable from i passing only through states in T .
A strongly connected component (SCC) is a maximally strongly connected
set of states, i.e. no superset of it is also strongly connected.
A bottom strongly connected component (BSCC) is an SCC T from which
no state outside T is reachable from T .
The equilibrium distribution π can be computed by carrying out the following steps:
1. Find all bottom strongly connected components B1 , . . . , Bk in the underlying graph of X.
2. Compute the vectors pe1 , . . . , pek of eventually entering B1 , . . . , Bk as
defined below.
3. Determine the equilibrium distributions πB1 , . . . , πBk of the components considered in isolation.
Given pe1 , . . . , pek and πB1 , . . . , πBk , we can compute the equilibrium
probability of a state i ∈ B` as
X
π(i) =
p0 (j)pe` (j)πB` (i)
j
where πB` is the equilibrium distribution of the BSCC B` considered in
isolation. The idea of the computation above is that we start initially in
state j with probability p0 (j) and then eventually enter from j the BSCC
B` with probability pe` in which state i has the equilibrium probability
πB` (i). Summation over all j gives us the final equilibrium probability of
i. If i 6∈ B1 ∪ . . . ∪ Bk then π(i) = 0 because eventually the process will
enter one of the BSCCs which it cannot leave. Note that since the BSCCs
considered in isolation are irreducible and finite, we can compute πB` by
solving the equation system
X
πB` = πB` PB` ,
πB`
j∈B`
where PB` is the transition probability matrix of the subset B` .
For step 1 we have to find all BSCCs of X. This requires to run a graph
analysis algorithm such as Tarjans’s algorithm.
39
3.1. DISCRETE-TIME MARKOV CHAINS.
For step 2 we have to solve the following system of equations for each BSCC
B` . We define a vector pe` such that

 0
1
pe` (i) =
 P
j6∈B`
pij pe` (j) +
P
j∈B`
pij
if B` is not reachable from i,
if i ∈ B` ,
otherwise.
It can be shown that pe` (i) is the probability of eventually entering B` when
starting in state i. The last equation follows from the two cases that either
we enter B` in one step (second term) or we first visit state j and then go
to B` (first term). Note that we can alternatively write this as
pe` = pe` P T ,
pe` (i) = 1, if i ∈ B`
(3.1)
pe` (i) = 0, if B` is not reachable from i
and incorporate the last two conditions
P into a system of equations similar
as above the normalization condition i π(i) = 1.
Example 30: Equilirbium distribution of a reducible DTMC
Consider again the right DTMC in Example 27 and assume that the initial
distribution is
1 1
p0 = ( 0 0 0 0 0 0).
2 2
The BSCCs are B1 = {3, 4, 8} and B2 = {6, 7} and thus π(1) = π2 =
π5 = 0. For step 3 we find
πB1 = ( 13
1 1
3 3)
πB2 = ( 21
1
2)
by applying the approach for irreducible chains to B1 and B2 . For step
2 we first consider the probability of reaching B1 . We can aggregate all
states of B1 and B2 into one state since once we have reached one of
theses sets, the future evolution of the process is not of interest. Thus,
we consider the DTMC below in order to compute pe1 and pe2 .
40
3.1. DISCRETE-TIME MARKOV CHAINS.
1
2
1
2
1
2
1
3
1
2
5
1
3
1
3
1
2
3, 4, 8
1
6, 7
1
Next we transform the equation system in 3.1 into Ax = b and define the
matrix A = (P − I) (now, P is the transition probability matrix of the
aggregated chain). This matrix has two rows with only zero corresponding
to the two BSCCs. We replace these two rows with the unit vectors that
are zero everywhere except at the position for B1 (for B2 ) where they are
equal to one, respectively. Then, by choosing for b a unit vector with a
one at the position for B` we can incorporate the two constrains pe` (i) =
1, if i ∈ B` and pe` (i) = 0, if B` is not reachable from i. This yields pe1 =
(0.4 0.4 1 1 0.2 0 0 1) and pe2 = (0.6 0.6 0 0 0.8 1 1 0). Multiplication
with p0 gives us the probability of reaching B1 and B2 (after starting with
initial distribution p0 ):
p0 p0e1 = 0.7
p0 p0e2 = 0.3
Thus, we finally get
1
1
π(1) = π(2) = π(5) = 0, π(3) = π(4) = π(8) = 0.7· , π(6) = π(7) = 0.3· .
3
2
Equilibrium of infinite DTMCs
There are many applications for Markov chains with infinitely many states.
Although the real system can only have a finite number of states, we often
decide for a model with infinitely many states. The reason is that often
apriori bounds on certain countable objects are not known and instead of
assuming some (very unlikely high) bounds it is easier to work with an
infinite model, analyse it and find realistic bounds that are met with very
high probability.
41
3.1. DISCRETE-TIME MARKOV CHAINS.
Example 31: Queuing System
Queuing systems are mathematical models that describe real-world systems where queuing occurs, for instance, network packets in the Internet,
customers waiting to be served or molecules that are produced and can degenerate. Most queuing systems are continuous-time Markov chains but
it is also possible to design a DTMC that describes a queuing system as
can be seen below (assuming pλ + pµ ≤ 1).
1−pλ
pλ
0
1−pλ −pµ
pλ
1−pλ −pµ
1
pµ
pλ
2
pµ
1−pλ −pµ
pλ
...
3
pµ
pµ
In each step, customers arrive with probability pλ and leave with probability pµ . With probability 1 − pλ − pµ nothing happens during the current
time step. Alternatively we can see this chain as a random walk on the
non-negative integers.
In the sequel we state sufficient conditions under which the equilibrium
distribution exists. An important state property in this context is positive
recurrence, which holds for a state i if the process returns to i infinitely
often. Formally, we define the first passage time (or hitting time)
Tij = inf{n ≥ 1 : Xn = j | X0 = i}.
and the probability to enter state j from state i within n ≥ 1 steps
Fij (n) = P (X1 = j ∨ X2 = j ∨ . . . ∨ Xn = j | X0 = i).
Definition 17: Recurrence
• A state i is called recurrent if Fii (∞) = 1, and transient otherwise.
• A recurrent state i is called positive recurrent if E(Tii ) < ∞.
• A recurrent state i is called null recurrent if E(Tii ) = ∞.
Note that if a finite chain is irreducible then all states are positive recurrent.
Moreover, recurrence is a BSCC property, i.e., if a single state of a BSCC
fulfils it, then all states of the BSCC fulfil it (for SCCs this might not be
true). In addition, it is possible to show that if an irreducible (possible
infinite) chain has at least one positive recurrent state, then its equilibrium
distribution exists.
42
3.1. DISCRETE-TIME MARKOV CHAINS.
Example 32: Recurrence
The states of the chain in Example 31 are all positive recurrent if pµ >
1/2, null recurrent if pλ = pµ = 1/2, and transient if pλ > 1/2.
Next we find that positive recurrence allows us to compute the equilibrium
distribution by solving an equation system.
Let (X(t), t ∈ N0 ) be a (possibly infinite) DTMC in which all states are positive recurrent (and aperiodic). Then the equilibrium distribution π exists
and is the unique solution of the following equation system
P
π =π·P
i∈S
πi = 1.
Note that aperiodic chains where all states are positive recurrent are also
called ergodic.
The above system of equations is infinite and therefore it is often solved by
”guessing” a solution and showing that it fulfils the equations.
Example 33: Guessing the equilibrium distribution
For the chain in Example 31 we assume that pµ > pλ and consider the
global balance equations:
π0 pλ
=
π1 pµ
π1 (pλ + pµ ) =
π0 pλ + π2 pµ
...
πi (pλ + pµ ) = πi−1 pλ + πi+1 pµ
...
Inserting the first equation into the second one yields
π1 (pλ + pµ ) = π1 pµ + π2 pµ ⇐⇒ π1 pλ = π2 pµ .
Next, we can insert π1 pλ = π2 pµ into the next equation
π2 (pλ + pµ ) = π1 pλ + π3 pµ
(for state 2), which yields
π2 (pλ + pµ ) = π2 pµ + π3 pµ ⇐⇒ π2 pλ = π3 pµ .
43
3.1. DISCRETE-TIME MARKOV CHAINS.
Thus, our guess is that
π1 =
pλ
pµ π0 ,
π2 =
pλ
pµ π1
=
...
πi =
and
∞
X
i=0
πi = 1 ⇐⇒
pλ
pµ
∞ X
pλ i
i=0
pµ
i
pλ
pµ
2
π0
π0
1
π0 = 1 ⇐⇒ π0 = P i
∞
pλ
i=0
pµ
which implies that π0 = 1 − ppµλ . And indeed inserting our guess into
the global balance equations yields that it is a solution of the equations.
Since we assumed that pµ > pλ and thus the DTMC is ergodic (see also
Example 35), it must be the unique solution and thus we have guessed the
correct values for π.
In general ergodicity for infinite-state chains is undecidable. However, often
infinite-state chains that describe real-world systems (e.g. queuing systems)
have a very regular structure which allows a graph analysis and all (B)SCCs
can be determined. Then we can proceed in a similar ways as for finite
DTMC. However, within an infinite BSCC it may still be the case that
Fii (∞) < 1. Thus, it remains the question how to check positive recurrence
if a graph analysis tells us that the infinite chain is irreducible? Instead
of reasoning about Fii (n) it is often easier to reason about the drift of the
chain, which is usually defined in terms of a level function l : S → R+ .
X
dl (i) = E(l(X1 ) − l(X0 ) | X0 = i) =
pij (l(j) − l(i))
j
The drift tells us the change of the level function when starting in state i.
Example 34: Drift (1)
For the chain in Example 31 assume that the level of state i is i + 1, i.e.
l(i) = i + 1. Then the drift for state i with i > 0 is
X
dl (i) =
pij (l(j) − l(i)) = pλ · (+1) + pµ · (−1) = pλ − pµ .
j
The drift of state i = 0 is dl (0) = pλ .
The following result provides a sufficient criterion for ergodicity.
44
3.1. DISCRETE-TIME MARKOV CHAINS.
Let X be an irreducible (possibly infinite) DTMC with state space S and
l : S → R+ such that the set {i ∈ S | l(i) ≤ k} is finite for any k < ∞. Then
X is ergodic if there exists a constant c > 0 and a finite subset C ⊂ S with
i) dl (i) ≤ −c for all i ∈ S \ C,
ii) dl (i) < ∞ for all i ∈ C.
Intuitively, the above result states that the drift should only be > c in a
finite number of states (the set C in whose direction the chains drifts). In
particular, there it may be positive. In all other states the drift must be
≤ c, which means that the chains ”drifts away” from these states.
It is important to note that the level function (also called Lyapunov function) is such that in the first k levels there cannot be infinitely many states
(see condition ”the set {i ∈ S | l(i) ≤ k} is finite for any k < ∞”).
For instance, if the state space of X
is S = {(i, j) | i, j ∈ N0 } (twodimensional grid) then we can, for instance, use the level function l(i, j) =
max{i, j, 1} as illustrated in the figure
on the right.
Example 35: Drift (2)
For the chain in Example 31 we find
that the function l with l(i) = i +
1 fulfils the criterion that {i ∈ S |
l(i) ≤ k} is finite for any k < ∞.
Moreover, if pµ > 1/2 then pλ < 1/2 and thus
dl (i) = pλ − pµ < 0,
i = 1, 2, . . .
Thus, state 0 is the only state with positive drift dl (0) = pλ and thus
C = {0}. We conclude that in the case pµ > 1/2 the chain is ergodic.
If, however, pλ = pµ = 1/2 or pλ > 1/2 then we have infinitely many
states with drift 0 or even positive drift. (Note that we cannot conclude
that the DTMC is not ergodic since a different function l may exist that
fulfills the ergodicity conditions.)
To summarize, we have considered infinite-state DTMCs and focused on
the case where the underlying graph is irreducible (the graph forms one
large BSCC). If the chain is ergodic, the global balance equations hold and
we can check ergodicity by using the drift arguments above. Checking the
reverse (all states are transient) is difficult as well as reasoning about infinite
45
(0)
(1)
Link
πi
ebigen Link
πi
!
1
j→i nj
(0)
· πj
!STUDY: PAGERANK
3.2.
(2) CASE
(1)
= j→i n1j · πj
j
· πj
1
nj
1
nj
1
α
nj
α
1−N
1
nj
+
j
=
1
N
α
nj
1
=
πi
+
α
nj
+
1−α
N
..
.
nj
links auf Webseite j ist
α
1−
N
1−
α
N
2
..
1−
N α
Seite
1−α
N
j
1−α
N
..
.
Figure 3.2: The structure of the random web surfer model.
RWTH Aachen und H. Hermanns, Universität des Saarlandes.
22
reducible DTMC (with infinite SCCs or BSCCs). It is therefore omitted here
(also because there is little practical significance in these cases).
3.2
Case Study: PageRank
About 16 years ago the success story of Google started with the development
of Google’s PageRank algorithm, which was used to rank web pages according to their importance to the users. The theoretical basis of PageRank is
the random web surfer model which corresponds to a discrete-time Markov
chain.
Note that due to its simplicity and generality, PageRank is also used in
other domains, e.g. to rank genes according to a microarray experiment or
according to their importance w.r.t. a disease. Moreover, the ranking mechanism can be applied to any graph and yields a network centrality score that
reflects the importance of each node in light of the entire graph structure.
For the random web surfer model we let N be the total number of all (indexed) web pages. Assume that a web surfer starts initially at some random
web page. Then she follows one of the links on that page (each link has the
same probability) and visits the next web page. Then again she follows one
of the links on that page and visits the next web page, and so on. What
is the probability to be on page i after k clicks? What if k → ∞? The
idea behind this model is that more important websites are likely to receive
more links from other websites and should be ranked higher. The pages that
visited with high probability in the limit should be ranked first.
Obviously, we can describe this process with a DTMC where each state corresponds to a web page. The initial distribution of the chain is p0 (j) = 1/N
for all j ∈ {1, . . . , N }. The transition probability to go from page j to page
i is pji = 1/nj where nj is the number of links on page j and there is a link
from j to page i (see also Figure 3.2, left). The probability of being on page
46
3.3. CASE STUDY: DNA METHYLATION
i after 1 click is therefore
p1 (i) =
X 1
p0 (j),
nj
j→i
where j → i means that there is a link from j to i. For k clicks we get the
recursion
X 1
pk−1 (j).
pk (i) =
nj
j→i
Obviously, the underlying graph may be reducible, e.g. the surfer may end
up in a BSCC which cannot be left since each page has only links to pages in
the BSCC but not to pages outside. In addition, the equilibrium distribution
would give probability zero to all pages that are not part of a BSCC and
thus all these pages could not be ranked among each other and would appear
in the ranking after all BSCC pages. Moreover, it is not very realistic that
a surfer always only follows links. The solution is to assume that in each
step the surfer either follows a link with probability α or she decides to start
over with probability 1 − α and again randomly chooses a page from all N
possible pages. The parameter 0 < α < 1 is called the damping factor. In
Figure 3.2 (right) we illustrate the outgoing transitions of a page (incoming
transitions are omitted). Now, we see that the chain is irreducible (there is
a direct link from any page to any other page with probability 1−α
N ) and the
equilibrium distribution is given by the solution of the equation system
X α
1−α
1−α
πi =
+
πj +
πi ,
i ∈ {1, 2, . . . , N }.
nj
N
N
j→i
This equation system is usually solved by applying iterative methods such
as the Power method which successively multiplies a vector p with the transition probability matrix P , i.e. we iteratively compute p(k+1) = p(k) P for
k = 0, 1, . . . until convergence. The initial vector can, for instance, be chosen as the uniform distribution that assigns probability 1/N to each page.
Clearly, for very large graphs (N > 10, 000, 000) distributed algorithms have
to be applied.
We finally note that the page rank algorithm, although it is the first algorithm that was used to order search engine results, nowadays is only a part
of Google’s current ranking methodology.
3.3
Case Study: DNA Methylation
DNA Methylation is a modification of the DNA where a methyl group is
added to a cytosine base and it typically occurs at so-called CpG sites,
where a cytosine nucleotide (C) is followed by a guanine nucleotide (G).
Hence on the opposite side of the DNA strand there is the complementary
47
3.3. CASE STUDY: DNA METHYLATION
Figure 3.3: DNA methylation after DNA replication.
sequence GC.
DNA methylation plays an important role in gene regulation, for instance,
during the early development of an embryo. Moreover, abnormal DNA
methylation patterns have been observed in cancer cells.
When the cell divides, the DNA is replicated by splitting the two strands
and synthesizing a new complementary sequence on the other side of the
two DNA strands of the mother cell. Thus, in each of the two daughter
cells, one strand is completely unmethylated. However, during replication
the cell tries to maintain the methylation pattern by adding methyl groups
on the newly synthesized strands. With high probability a hemimethylated
CpG site (methyl group on one strand) becomes fully methylated (methyl
group on both strands, see Figure 3.3). However, methylation patterns are
not always accurately copied and methyl groups may also be added to CpG
sites that are unmethylated such that the site becomes hemimethylated.
If we model the dynamics of DNA methylation over time, we use a DTMC
with three states (unmethylated, hemimethylated, fully methylated). The
time steps correspond to cell replications and the transition probability matrix P is the product of the matrices D and M where D describes cell
division and M describes all methylation events between two cell divisions.
In Figure 3.4 we illustrate the transitions of D on the left and those of M
on the right.
1−µ
1−λ−δ
U
λ
H
µ
1
2
1
1
F
U
1
2
H
1
F
δ
Figure 3.4: Transitions of cell division (matrix D) on the right and transitions of methylation events (matrix M) on the left.
48
3.4. FROM DTMCS TO CTMCS
Given a time course of the state of a CpG, we can estimate the parameters
µ, λ, δ and the initial conditions of the system (how to do parameter estimation will be discussed later in the course). Moreover, we can simulate
the system until equilibrium or distinguish the contributions to methylation
of several enzymes given knockout data. When extending the model to a
hidden Markov model, we can also take into account measurement errors
(hidden Markov models will be discussed later in the course).
3.4
From DTMCs to CTMCs
Since time is continuous in most real systems, it is often easier not to discretize time to get a time-discrete model but to work with a continuous-time
model. A continuous-time Markov chain (CTMC) differs from a discretetime Markov chain only in that the time between two transitions is not
exactly one time unit but a (positive) random number R. It is possible to
prove that if the process (Xt )t∈R0 , Xt : Ω → S fulfills the Markov property
then the distribution of R is exponential, i.e.
P (R > t) = exp(−λt),
t≥0
for some λ > 0 (assuming that the state can be left with positive probability). If the process enters a state that cannot be left with positive probability,
then it will remain there forever.
Note that in DTMCs we use high self-loop probabilities if we want the
process to stay long in a certain state (while our discretization of time is
rather fine since other states are left after a single step). In CTMCs, there
are no selfloops. Instead a small value for the parameter λ is chosen for each
state in which the process remains for a long time.
Example 36: Molecular Evolution
We model the evolution of a single DNA base across multiple generations
using the Jukes-Cantor Model. Thus, Xn describes what base is found on
a fixed DNA position of the n-th descendant of an organism. We assume
that the base is not under any selective pressure and that the base in one
generation depends only on the base in the immediately previous generation. We construct a DTMC (Xn ) with state space {A, C, G, T } that
correspond to the four possible bases. Furthermore, on each generation
we are equally likely to change with probability µ the current base to any
of the three other bases. Thus,
49
3.5. EXERCISES


1 − 3µ
µ
µ
µ
 µ
1 − 3µ
µ
µ 
.
P =
 µ
µ
1 − 3µ
µ 
µ
µ
µ
1 − 3µ
Due to its symmetry all states have the same equilibrium probability. However, for a fixed number k of generation we can compute the probability
of still having the initial base at the DNA position. Since the probability
of base mutation is very small (µ ≈ 10−9 ) a discrete model is not very
appropriate here since it takes many generations until a state is left (ge1
ometrically distributed with parameter 3µ; mean 3µ
≈ 3.3 · 108 ). Thus,
we can use a CTMC where the time until a state is left is exponentially
distributed with parameter λ. For instance, we could choose λ = 1016 .
Then the average time until a state is left is λ1 = 106 . The probability to
enter a different state within the next t time units is then given by
P (R ≤ t) = 1 − exp(−λt).
For the probability to jump from one state to another we can still use the
matrix P . For instance, the probability to jump from state A to state C
within t time units is given by P (R ≤ t) · µ.
For CTMCs, the transient distributions can be computed by solving a system
of ordinary differential equations. The equilibrium distribution is, similarly
as for DTMCs, computed by solving a linear system of equations. In the
sequel we will not further discuss CTMCs but move on to hidden Markov
models.
3.5
Exercises
Easy
Exercise 28: Periodic DTMC Assume that we choose p0 = (1/4, 1/4, 1/2)
as initial distribution in Example 25. Will the sequence pi converge for
i → ∞?
Exercise 29: Transient Distribution Let X be a DTMC with state
space S = {1, 2} and transition probability matrix
1−α
α
P =
,
β
1−β
where α, β ∈ (0, 1).
a) Find the equilibrium distribution of X.
50
3.5. EXERCISES
b) Use MATLAB (or R) to find the transient probability distribution
after 8 steps if the initial distribution is given by p0 = 51 45 and the
parameters are α = 0.18, β = 0.64. (If you have not yet installed
MATLAB then write Pseudocode.)
Exercise 30: Periodic Markov Chain Assume that the state space of
a DTMC is S = {1, 2, 3, 4, 5, 6, 7} and the transition probability matrix P is

0
0

0

P =
0
0

1
2
1
4
0
0
0
0
0
1
2
3
4
1
2
1
3
0
0
0
0
0
1
4
0
0
0
0
0
0
1
4
2
3
0
0
0
0
0
0
0
1
3
1
2
3
4
0
0

0
0

2
3
1
2
1
4
0
0
a) Draw the underlying graph of the chain; is the chain irreducible?
b) Find the period d of state 5.
Exercise 31: Equilibrium of a DTMC
1/4
Explain why the equilibrium probability π of
the DTMC on the left exists and write MAT1/2
1
2
LAB (or R) code for computing it. (If you
1
have not yet installed MATLAB then write
Pseudocode)
3/4
1/2
3
Exercise 32: Infinite DTMC (1) Assume that the state space of a DTMC
is S = N0 and its transition probabilities are P (Xt+1 = 0|Xt = i) = 12 and
P (Xt+1 = i + 1|Xt = i) = 21 for all i ∈ N0 , t ∈ N. The underlying graph is
shown below:
a) Argue why the chain is irreducible.
b) Prove that state 0 is recurrent.
c) Let l(i) = i + 1 be a level function. Simplify the corresponding drift
function dl as much as possible?
d) Choose a constant c > 0 such that the drift is greater −c only in a
finite number of states.
e) Find the equilibrium distribution of the chain.
51
3.5. EXERCISES
Exercise 33: Number of visits Consider an ergodic DTMC with equilibrium distribution π.
a) Show that the average recurrence time mj of state j is equal to the
inverse of state j’s equilibrium probability πj .
b) Show that the average number vi,j of visits to state i between two
successive visits to state j is equal to the ratio πi /πj .
Medium
Exercise 34: Squaring P
a) Consider the transition probability matrix P of a DTMC:


0.125 0.7 0.175
0.2
0.5 
P =  0.3
0.68 0.178 0.142
Write MATLAB (or R) code for computing the transient distribution of the chain after 16 jumps if the initial distribution is given by
(0.14, 0.6, 0.26). In your code you should use exactly 4 matrix multiplications and only one vector-matrix multiplication. (If you
have not yet installed MATLAB then write Pseudocode.)
b) Would your result change if the initial distribution is different from
(0.14, 0.6, 0.26)? Argue why not?
c) Discuss the general drawbacks of your approach if it is applied to large
Markov chains with sparse P .
Exercise 35: Equilibrium of reducible DTMCs Compute the equilibrium distribution for the reducible DTMCs that follow: For a) Can you
do it without solving the corresponding equation system for the reachability
probabilities? Verify your results with MATLAB for both cases!
52
3.5. EXERCISES
a)
1
1/3
99/100
2/3
99/100
3
2
1/100 1/100
2/3
1/2
5
6
7
1/2
1/2
1/2
8
1/2
1/3
1/2
b)
1
2
1
2
1
4
1
3
1
3
1
3
1
3
2
3
6
1
2
1
3
1
2
3
4
2
4
5
1
2
1
2
Exercise 36: Infinite DTMC (2) Consider a DTMC with state space
S = Z2+ such that for state (i, j) the transition probabilities are given by the
following rules:
• The probability to go from state (i, j) to (i + 1, j) is 1/8.
• The probability to go from state (i, j) to (i, j + 1) is 1/8.
• If max{i, j} > 1 then the probability to go from state (i, j) to (i −
1, j − 1) is 1/3.
• The remaining probabilities are self-loop probabilities.
Prove that the chain is ergodic. Argue using a suitable drift function.
53
3.5. EXERCISES
Difficult
Exercise 37: Reversible Markov Chain
Consider an irreducible, discrete-time Markov chain X with a finite state
space S and transition probability matrix P . Let π be the equilibrium
π
distribution of X. Consider the following probabilities rij = pji πji which is
the transition probability of the reversed chain X ∗ , i.e. the DTMC (Xn∗ )n∈N
with
Xn∗ = Xτ −n
for some fixed time τ . Note that the matrix R with entries rij is the transition probability matrix of X ∗ . The chain X is called reversible if and only if
rij = pij (or, in terms of matrices, R = P , where R is the transition matrix
of the reversed Markov chain X ∗ ).
a) Give a matrix definition of R based on the matrix P and the vector π.
b) Consider two Markov chains with the following transition probability
matrices




0.8 0.15 0.05
0 5/9 4/9
P1 = 0.7 0.2 0.1  P2 = 5/6 0 1/6
0.5 0.3 0.2
4/5 1/5 0
Compute the equilibrium distribution of the two chains and the transition matrices R of the reversed chains. Are these Markov chains
reversible?
c) Show that given the reversibility condition R = P it follows that
πi pij = πj pji .
These equations (for all i, j) are called the detailed balance equations.
d) Derive the global balance equations from the detailed balance equations.
e) Consider an irreducible reversible Markov chain X with transition maπ
trix P , i.e. rij = pji πji . Prove that X is reversible if and only if the
probability of traversing any closed path (cycle) is the same in both
the forward and backward directions.
Exercise 38: The Game Suppose you are playing a game together with
your friend. Initially you have A Euros and your friend has B Euros. Each
time you toss a fair coin. If you win, you get 1 Euro from your friend. If
you lose, you give 1 to your friend. You and your friend stop gambling only
if either you or your friend run out of money. Your money after the t-th
game is defined by Xt and X0 = A.
54
3.5. EXERCISES
a) Draw the state graph of the above process X. Is X a DTMC?
b) What is the probability that you eventually lose all of your money?
Exercise 39: Consider a DTMC with state space {1, 2, . . . , n + 1} where
state n + 1 cannot be left. Thus, its transition probability matrix P has the
form
T v
P =
0 1
where T is an n × n-matrix, v is a vector of size n, and 0 is the n-vector of
zeros only. Let p0 be the initial distribution.
a) Find a closed-form expression for the probability p(k) to be in the
absorbing state after k steps.
b) What must hold for p0 , v and T such that p(k) is a discrete probability
distribution on N.
c) Find a closed-form expression for the mean of that distribution.
Exercise 40: Poisson Process
A Poisson process with constant rate λ > 0 is a CTMC (N (t))t≥0 taking
values in Z+ with the following properties:
k
a) P (N (t) = k) = e−λt (λt)
k! , k ∈ Z+ .
b) P (N (t2 ) = k2 |N (t1 ) = k1 ) = P (N (t2 − t1 ) = k2 − k1 ) for t2 ≥ t1 , k1 , k2 ∈
Z+ .
Derive an expression for P (N (s) = k|N (t) = n) when s < t, k, n ∈ Z+ .
Exercise 41: DNA methylation
Recall the DNA methylation model in Figure 3.4 in the lecture notes.
a) Biologists can measure DNA methylation with a method called Bisulfite Sequencing. During this treatment,
– each unmethylated Cytosine (C) is first converted to Thymine
(T). With a conversion error of 1 − c, however, the unmethylated
C remains an unmethylated C.
– each methylated Cytosine (5mC) is first converted to (an unmethylated) Cytosine (C).
Define an HMM that describes the output of Bisulfite Sequencing.
b) Consider a modification of methylated Cytosines (5mCs) where the
methyl group is oxidized and becomes a hydroxy group (5hmC). Formulate a new HMM with states U U, U M , M U ,M M, HU, U H, HM, M H, HH
where
55
3.5. EXERCISES
– U U refers to a CpG that is on both sides unmethlyated,
– U M (M U ) refer to a CpG where the lower (upper) strand Cytosine is methlyated,
– M M refers to a CpG where both Cytosines are methlyated,
– U H (HU ) refer to a CpG where the lower (upper) strand Cytosine is hydroxylyated,
– M H (HM ) refer to a CpG where the lower (upper) strand Cytosine is hydroxylyated and the upper (lower) strand Cytosine is
methylyated,
– HH refers to a CpG where both Cytosines are hydroxylyated.
Include a new parameter for the probability that a methylated site
becomes hydroxylyated.
c) Based on the DTMC in b) define two HMMs that describe the results
of Bisulfite Sequencing and oxidative Bisulfite Sequencing. The conversions (and conversion errors) of Bisulfite Sequencing and oxidative
Bisulfite Sequencing are illustrated below. (Note that the experiment
consists of two steps, (ox.) bisulfite treatment and sequencing, but
the intermediate result after bisulfite treatment must not be explicitly
modelled.)
Bisulfite Seq
5hmC
C
1
1
-c
-c
c
5hmC
5mC
-d
5mC
d
1
C
Bisulfite / OxBS
c
treatment
OxBS Seq
U
C
5hmC
U
C
5hmU
T
C
C
T
C
T
PCR and
Sequencing
56
Chapter 4
Hidden Markov Models
In this chapter we discuss Hidden Markov Models (HMMs) which can be
seen as an extension of DTMCs. A HMM is particularly useful in modelling
whenever we cannot observe the state of a Markov chain directly but only a
part of it or ”an output” of it. In particular if we can observe the state but
there is a certain error in our observation then a HMM is an appropriate
model.
A HMM extends a finite-state DTMC in that besides the set S of (hidden)
states (w.l.o.g. we assume S = {1, 2, . . . , N }) there is a set O of observable
states or outputs. Moreover, besides the transition probability matrix P
there is a matrix B of output probabilities. If the number of hidden states
|S| = N and the number of observable states |O| = M , then B is an N × M matrix such that the entry bik is the probability that state i outputs the
k-th observation ok ∈ O. If we define the random variable Yn as the output
after n steps of the process and Xn the hidden state after n steps then
bik = P (Yn = ok | Xn = i).
(Note that for the transition probabilities of a DTMC we had pij = P (Xn =
j | Xn−1 = i).) Thus, a HMM is a combination of two stochastic processes X
and Y where X is hidden and Y , which is a function of X, can be observed.
Example 37: Annual Temperatures (1)
We want to model the average annual temperature at a particular location
on earth over a series of years, for which no direct measurements of the
temperature exists (e.g. 500 years ago). For simplicity we only consider
two annual temperatures, H for ”hot” and C for ”cold”. Thus, S =
{H, C}. Moreover, we know that the probability of a hot year followed by
another hot year is 0.7 and the probability that a cold year is followed by
another cold year is 0.6 (e.g. estimated from current weather statistics),
57
4.1. COMPUTING THE PROBABILITY OF OUTPUTS
i.e. the transition probability matrix P equals
H
C
H
0.7
0.4
C
0.3
.
0.6
In addition, we have data about the size of tree growth rings in these
years (S for small,M for medium, and L for large; O = {S, M, L}) and
we know that the relationship between annual temperature and tree ring
sizes is given by the matrix B, defined as
H
C
S
0.1
0.7
M
0.4
0.2
L
0.5
.
0.1
The entries of the matrix B are also often called emission probabilities.
4.1
Computing the probability of outputs
Assume we want to compute the probability that a certain sequence τ =
ok0 ok1 , . . . , okm of outputs have been made, i.e. we want to compute
P (Y0 = ok0 , Y1 = ok1 , . . . , Yk = okm ).
Let us start with the simplest case where m = 0 and let us assume that p0
is the initial distribution of X. Since Y is a function of X we can condition
on X to compute probabilities of Y and use the law of total probability, i.e.
P
P (Y0 = ok0 ) =
i∈S P (X0 = i) · P (Y0 = ok0 | X0 = i)
P
=
i∈S p0 (i) · bik0 .
For m = 1 we get
=
=
=
=
=
=
P (Y0 = ok0 , Y1 = ok1 )
P
P (X0 = i)P (Y0 = ok0 , Y1 = ok1 | X0 = i)
i∈S
P
p0 (i)P (Y0 = ok0 | Y1 = ok1 , X0 = i)P (Y1 = ok1 | X0 = i)
i∈S
P
p0 (i)P (Y0 = ok0 | X0 = i)P (Y1 = ok1 | X0 = i)
i∈S
P
P
p0 (i)bik0
P (X1 = j | X0 = i)P (Y1 = ok1 | X1 = j, X0 = i)
i∈S
j∈S
P
P
p0 (i)bik0
pij P (Y1 = ok1 | X1 = j)
i∈S
j∈S
P
P
p0 (i)bik0
pij bjk1 .
i∈S
j∈S
58
4.2. MOST LIKELY HIDDEN SEQUENCE OF STATES
By defining the matrices W0 , W1 , . . . Wm as


b1k`
0
···
0

.. 
..


.
 0 b2k`
. 
,
W` = 
 .

..
..
 ..

.
.
0


0
···
0
bN k`
` = 0, 1, . . . , m
we can write the above sum as a vector-matrix product
P (Y0 = ok0 , Y1 = ok1 ) = p0 W0 P W1 e.
where e is a vector of ones and thus for general m
P (Y0 = ok0 , Y1 = ok1 , . . . , Ym = okm ) = p0 W0 P W1 P . . . Wm−1 P Wm e.
According to the so-called forward algorithm, we compute this large vectormatrix product from left to the right, i.e.
1. Compute the vector v0 = p0 · W0 .
2. For n = 1, 2, . . . , m compute vn = vn−1 · P · Wn .
3. Return the sum of the entries of the final vector vm .
Note that a backward computation is also possible.
Example 38: Annual temperatures (2)
For the example of annual temperatures we have the following observed
sequence:
τ =S M S L
Assuming the initial distribution p0 = (0.6 0.4) we multiply for P (Y0 =
S, Y1 = M, Y2 = S, Y3 = L)
0.1 0
0.7 0.3
0.4 0
0.7 0.3
0.1 0
0.6 0.4 ·
·
·
·
·
·
0 0.7
0.4 0.6
0 0.2
0.4 0.6
0 0.7
0.7 0.3
0.5 0
1
·
·
= 0.0096.
0.4 0.6
0 0.1
1
4.2
Most likely hidden sequence of states
Now, we would like to find the sequence ν ∗ of hidden states that is most
likely for a given sequence τ = ok0 ok1 , . . . , okm of observations. Thus, ν ∗ is
59
4.2. MOST LIKELY HIDDEN SEQUENCE OF STATES
the sequence that maximizes the value
P (ν | τ )
= P (X0 = i0 , X1 = i1 , . . . Xm = im | Y0 = ok0 , Y1 = ok1 , . . . , Ym = okm )
among all sequences ν = i0 i1 . . . im of hidden states. However, since
P (ν | τ ) =
P (ν, τ )
P (τ )
and P (τ ) is independent of ν we can alternatively maximize P (ν, τ ), i.e.
ν ∗ = argmax P (ν | τ ) = argmax P (ν, τ ).
ν
ν
The Viterbi algorithm determines ν ∗ by iteratively computing the value
w` (i) =
max P (i0 i1 . . . i` = i, τ )
i0 i1 ...i`−1
which gives the maximal probability of all paths ending in state i after `
steps and the observed sequence τ , where the maximum is taken over all
hidden state sequences for ` − 1 steps. Note that maxi wm (i) = P (ν, τ ) gives
a solution to the above problem. By induction it is easy to show that
w`+1 (j) = max w` (i) pij bjk`+1 .
i
Hence, we can iteratively compute the vectors w0 , w1 , . . . , wm and track the
sequence of states that corresponds to the maximum as follows:
1. Initialize the vector w0 with entries w0 (i) = p0 (i)bik0 .
2. For all ` ∈ {1, 2, . . . , m} compute the vector w` with entries
w` (j) = max w`−1 (i) pij bjk` ,
j∈S
i
and store the states that correspond to the maximum in the vectors
s` with entries
s` (j) = argmax w`−1 (i) pij ,
i
j ∈ S.
3. Return the probability P (ν ∗ , τ ) = maxi wm (i) and the last maximal
state i∗n = argmaxi wm (i).
4. Backtrack the remaining states of the optimal sequence ν ∗ by setting
i∗` = s`+1 (i∗`+1 ),
60
` = m − 1, . . . , 1, 0
4.3. EXERCISES
Example 39: Annual temperatures (3)
Given τ = S M S L we want to find the most likely sequence of annual
temperatures given τ . We first compute
0.1 0
= 0.06 0.28 .
w0 = 0.6 0.4
0 0.7
Then we find
w1 (1) = max{0.06 · 0.7, 0.28 · 0.4}0.4 = 0.0448
w1 (2) = max{0.06 · 0.3, 0.28 · 0.6}0.2 = 0.0336
and s1 = (2 2). The next step gives
w2 (1) = max{0.0448 · 0.7, 0.0336 · 0.4}0.1 ≈ 0.0031
w2 (2) = max{0.0448 · 0.3, 0.0336 · 0.6}0.7 ≈ 0.0141
and s2 = (1 2). Finally,
w3 (1) = max{0.0031 · 0.7, 0.0141 · 0.4}0.5 ≈ 0.0028
w3 (2) = max{0.0031 · 0.3, 0.0141 · 0.6}0.1 ≈ 0.0008
and s3 = (2 2). Hence, P (ν ∗ , τ ) ≈ 0.0028 and the last state i∗3 = 1 of ν ∗ .
Backtracking yields i∗2 = s3 (1) = 2, i∗1 = s2 (2) = 2, and i∗0 = s1 (2) = 2.
Thus, ν ∗ = 2 2 2 1.
We will revisit HMMs when we consider the problem of estimating parameters of Markov models.
4.3
Exercises
Exercise 42: Consider the hidden Markov model with the set of hidden
states S = {s1 , s2 , s3 } and the set of outputs O = {1, 2, 3}. The transition
probability and emission matrices are as follows:

0 21
P = 1 0
0 1
1
2

0 B =
0
The initial distribution is p0 = (1, 0, 0).
1
2
1
2
0
1
2
0
1
2
0

1
2
1
2
• Consider the two observed sequences τ1 = 1, 2, 3 and τ2 = 1, 3, 1
What are all possible state sequences for the observed sequences?
What is the probability to observe these sequences?
61
4.3. EXERCISES
• Compute the most likely sequence of the hidden states for the observed
sequences τ1 = 1, 3, 1 and τ2 = 1, 2, 1 without using the Viterbi algorithm.
• Consider an alternative HMM model where p0 = 13 , 13 , 13 and



1 1
0 12 12
2
2 0
P =  13 13 13  B =  14 0 34 
2
1
0 34 14
3
3 0
Again, compute the most likely sequence of the hidden states for the
observed sequences τ3 = 1, 3, 1 using the Viterbi algorithm.
Exercise 43: HMM as DTMC Consider two dice D1 and D2 with four
faces. One chooses a die D1 with probability α (or D2 with probability 1−α)
throws it, records the result and repeats. The probability distributions of
the result obtained by throwing dice are given below:
Dice
D1
D2
1
2
3
4
1
4
1
6
1
4
1
6
1
4
1
3
1
4
1
3
The process of choosing a die is hidden and we assume that we can only
observe the result (i.e. the number of dots on the face).
a) What are the possible outcomes of this experiment?
b) Define an HMM for the experiment.
c) This experiment can be modelled also as a discrete time Markov chain
with state space S = {D1 , D2 , 1, 2, 3, 4}. Write out the transition
probability matrix of the corresponding DTMC or draw the underlying
graph.
d) This experiment can be modelled also as another DTMC with state
space S = {1, 2, 3, 4}. Write out the transition probability matrix of
the corresponding DTMC or draw the underlying graph.
Medium
Exercise 44: Sequence Generation
a) Construct an HMM that generates a sequence of the form Ac1 B c2 Ac3 B c4 . . .
where Aci represents a sequence of length ci consisting of only ’A’ (and
similarly for B ci ), i ∈ {1, 2, . . .}. The natural numbers c1 , c2 , c3 , . . . are
drawn from the set {1, 2, 3} with equal probabilities.
62
4.3. EXERCISES
b) Construct an HMM that generates a sequence of the form Ac1 B 4−c1 Ac2 B 4−c2 . . ..
The natural numbers c1 , c2 , c3 , . . . are drawn from the set {1, 2, 3} with
equal probabilities.
63
4.3. EXERCISES
64
Chapter 5
Computer Simulations and
Monte Carlo Methods
Many software libraries provide a random number generator that returns a
sequence of numbers u1 , u2 , . . . that follow a U (0, 1) distribution, i.e. that
are uniformly distributed on the interval (0, 1). Moreover, these numbers
are (approximately) independent. Thus, if we generate, say, 100 random
numbers and check how many of them, say N , are smaller than a certain
value p > 0 then N/100 ≈ p. Thus, we can approximate certain probabilities using random numbers. In the previous chapters we already used this
technique to approximate the area below some function. This is particularly
useful if the integral is difficult to evaluate numerically.
5.1
Generating random variates
Often, other random variates than uniform are needed, that is, random
numbers that follow a certain distribution. If it is a continuous distribution
with an invertible cumulative distribution function, the inverse transform
method (see Section 2.3) can be used. For discrete distributions there are
two possibilities. Assume we want to generate random numbers that are
realizations of a discrete random variable X. Then
a) we either use a specific relationship between X and U ∼ U (0, 1) or
b) we split the interval (0, 1) into disjoint intervals and do a case distinction as explained below.
Example 40: Generating Bernoulli distributed numbers
65
5.1. GENERATING RANDOM VARIATES
Let U ∼ U (0, 1) and p ∈ (0, 1). We define
1 if U < p,
X=
0 if U ≥ p.
Then P (X = 1) = P (U < p) = p and P (X = 0) = 1−p. Thus, X follows
the Bernoulli distribution with probability p.
Algorithmically, we first generate U , set X accordingly and return X.
A MATLAB code for this is (after setting the value for p):
U = rand;
X = (U < p)
Example 41: Generating binomially distributed numbers
Let U1 , . . . , Un be independent and U (0, 1)-distributed. Moreover, let p ∈
(0, 1). We define
1 if Ui < p,
Xi =
0 if Ui ≥ p,
Pn
and Y = i=1 Xi . Then X1 , X2 , . . . , Xn are Bernoulli distributed (with
parameter p) and thus their sum Y is binomially distributed with parameters n and p (see Section 1.3).
Algorithmically, P
we generate U1 , . . . , Un , set X1 , X2 , . . . , Xn accordingly
and return Y = ni=1 Xi .
A MATLAB code for this is (after setting the values for n and p):
U = rand(n ,1);
X = sum( U < p )
Example 42: Generating geometrically distributed numbers
Let U1 , U2 , . . . be independent and U (0, 1)-distributed. Moreover, let p ∈
(0, 1). We define
X = min{Ui < p}.
i
Then, clearly, P (X = 1) = p and it is easy to see that P (X = k) =
(1 − p)k−1 p.
Algorithmically, we set X = 1 and perform a while-loop in which we
generate U and check whether U > p (loop condition). If this is the case
we set X = X + 1. After the loop we return X. A MATLAB code for
this is (after setting the value for p):
66
5.1. GENERATING RANDOM VARIATES
X = 1;
while (rand > p )
X = X +1;
end;
X
The splitting of the interval (0, 1) can be done for any discrete random
variable X. Assume that X takes the values x0 , x1 , . . . and P (X = x0 ) = p0 ,
P (X = x1 ) = p1 , . . . . Then we divide the interval (0, 1) into the subintervals
A0 = (0, p0 ),
A1 =
[p0 , p0 + p1 ),
A2 =
[p0 + p1 , p0 + p1 + p2 ).
...
Then we let X = xi if U falls into interval Ai . Note that since Ai has length
pi the probability that U falls into Ai is exactly pi . Hence X has the desired
distribution.
Algorithmically, the naive approach to find the index i with
p0 + . . . + pi−1 ≤ U < p0 + . . . + pi
is performed as follows:
1. initialize c = p0 for the cumulative value,
2. iteratively set c = c + pk in the k-th execution of a while loop that is
only executed if U ≥ c.
If U < c holds after i iterations, then X = xi .
A MATLAB code for this is (when pi is stored in p(i + 1) as in MATLAB
array indices start at 1):
c = p (1); % set to p_0
i =0;
U = rand;
while ( U >= c )
i = i +1;
c = c + p ( i +1); % set to p_i
end;
X = i
There is a simple way to improve the efficiency of the above approach. If
we sort the values x0 , x1 , . . . according to their probability p0 , p1 , . . . we can
ensure that large intervals (high probabilities) are considered first. Hence it
is more likely that the number of iterations is small since U is more likely
to fall in one of the first intervals.
67
5.2. MONTE CARLO SIMULATION OF DTMCS
In the case that X can take infinitely many values with positive probability,
the intervals can be constructed on the fly. However, care has to be taken in
how the intervals are traversed in order to ensure convergence of the while
loop.
Example 43: Generating Poisson distributed numbers
For the Poisson distribution we have that x0 = 0, x1 = 1, . . . and for
parameter λ > 0
pi = P (X = xi ) = e−λ
λi
,
i!
i = 0, 1, 2, . . . .
We perform the following steps (MATLAB code, variable lambda is set):
c = exp( - lambda );
i =0;
U = rand;
while ( U >= c )
i = i +1;
c = c +exp( - lambda )* lambda ^ i / factorial ( i );
end;
X = i
We note that MATLAB also offers functions for generating random variates for many well-known continuous and discrete distributions (function
random).
5.2
Monte Carlo simulation of DTMCs
Since we already discussed how to generate realizations of a discrete random
variable X, we can simply iteratively apply this to simulate a given DTMC
with initial distribution p0 and transition probability matrix P .
First we have to determine the initial state i0 of the chain. Hence we apply the algorithm of the previous section to the random variable X0 with
distribution p0 taking values in S = {1, 2, . . .}. Once we have determined
the index k0 we apply the procedure to X1 given X0 = k0 where we use the
distribution given be the row in P that corresponds to k0 . This generates
the next state k1 and again we apply the procedure to X2 given X1 = k1 by
considering the row in P that corresponds to state k1 and so forth.
For a finite DTMC with N states, transition probability matrix P and initial
distribution p0 we perform a simulation for T ∈ N steps using the following
MATLAB code:
n =0; % step counter
68
5.3. OUTPUT ANALYSIS
d = p0 ;
stateseq = ’ ’;
while ( n < T ) % while time horizon not yet reached
U = rand;
c = d (1); % first interval
i =1;
while ( U >= c ) % next state not yet found
i = i +1;
c = c + d ( i );
end;
% add state at time n to state sequence
stateseq = strcat ( stateseq , s p r i n t f ( ’ % g ’ ,i ));
d = P (i ,:); % select next distribution
n = n +1;
end;
The above algorithm generates a single trajectory Xn (ω) : N0 → S of the
chain, i.e. a sequence of states visited by the chain. This is called Monte
Carlo Simulation of the Markov chain. To estimate probabilities or expectations of random variables, a large number of such trajectories have to be
generated, i.e. we have to execute the above algorithm for a large number of
times. In the next section we discuss how we can get accurate estimations
based on a collection of trajectories.
5.3
Output Analysis
Assume that we generate 100 trajectories of a DTMC using the algorithm
presented in the previous section. For each run we can check whether a
certain event occurred or not. E.g. assume that in 67 of the 100 runs we
have that Xn = 1 that is in these 67 runs the chain was in state 1 after n steps
(for some fixed n). Then the relative frequency of the event “chain is in state
67
67
1 after n steps” is 100
. Intuitively, we can deduce that P (Xn = 1) ≈ 100
.
Or, alternatively, we generate 100 Bernoulli distributed numbers. Assume
that in 67 of the 100 generations X = 1. Then the relative frequency of the
67
event p = P (X = 1) ≈ 100
. Thus, we have an approximation of the success
parameter p.
However, we do not know how good such estimations are since the generations are based on pseudo random numbers. If we do another 100 runs, we
could get a very different frequency for the same event. In this section, we
discuss how we can estimate probabilities of events and expected values of
random variables using Monte Carlo simulation. We will see how we can
determine the quality of an estimated value by exploiting the central limit
theorem .
69
5.3. OUTPUT ANALYSIS
5.3.1
Interval Estimation
Assume that we are interested in the probability of a certain event A, such
as X = x (or in a DTMC Xn = x). For ω ∈ Ω, define
1 if ω ∈ A,
χA (ω) =
0 if ω 6∈ A,
Then E(χA ) = 1 · P (χA = 1) + 0 · P (χA = 0) = P (χA = 1) = P (A). Thus,
an estimate of E(χA ) is also an estimate of P (A). Therefore, in the sequel,
we concentrate on estimates of the expectation of some random variable.
Let X be a random variable and let X1 , X2 , . . . , Xn be independent random
variables, all having the same distribution as X. For instance, if we generate
n trajectories of a DTMC using Monte Carlo simulation, the random variable
Xj represents the value of χA in the j-th trajectory (i.e., it is either 1 or 0)
or the state of the chain at time t.
Let µ = E(X) = E(X1 ) = . . . = E(Xn ). We discuss the construction of
interval bounds Il , Ir such that “with high probability” µ ∈ [Il , Ir ]. Since
these bounds depend on X1 , X2 , . . . , Xn , Il and Ir are random variables.
Estimators for Expectation and Variance. The sample mean
n
1X
Zn =
Xi
n
i=1
is called an estimator for µ. In Section 2.4.2, we calculated that E(Zn ) = µ,
which means that Zn is an unbiased estimator. Let VAR(X) = VAR(X1 ) =
2
. . . = VAR(Xn ) = σ 2 and recall that VAR(Zn ) = σn . According to the
central limit theorem,
Zn − µ
√
Zn∗ =
σ/ n
approximately follows a standard normal distribution. We are interested in
the deviation |Zn − µ| of the estimator Zn and µ, that is, in
√ |Zn − µ|
∗
√
P (|Zn | ≤ z) = P
≤ z = P |Zn − µ| ≤ z · σ/ n .
(5.1)
σ/ n
Since we do not know σ (or σ 2 ) we estimate σ 2 with the sample variance
n
S2 =
1 X
(Xi − Zn )2 .
n−1
(5.2)
i=1
Note that S 2 is an unbiased estimator for σ 2 and that
Pn
P
2
( ni=1 Xi )2
2
i=1 Xi
−
.
S =
n−1
(n − 1)n
70
(5.3)
5.3. OUTPUT ANALYSIS
Confidence Level. In order to control the estimation error, we want the
constraint in Eq. 5.1 to hold with a high probability. We choose a confidence
level β and define z such that β = P (|Zn∗ | ≤ z). Usually, β ∈ {0.95, 0.99} and
an approximation for z can be found in the standard normal table. Since
the standard normal distribution is symmetric around the expectation 0,
P (|Zn∗ | ≤ z) = P (−z ≤ Zn∗ ≤ z). Let Φ be the cumulative probability
distribution of the standard normal distribution. For a confidence level of
β = 0.95, for instance, we get Φ(1.96) ≈ P (Zn∗ ≤ 1.96) ≈ 0.975. Thus,
P (−1.96 ≤ Zn∗ ≤ 1.96) ≈ P (Zn∗ ≤ 1.96) − P (Zn∗ ≤ −1.96)
≈ Φ(1.96) − (1 − Φ(1.96))
≈ 0.975 − 0.025 = 0.95.
and z ≈ 1.96 (see also the illustration in Fig. 5.1).
Confidence Interval.
Since
β = P (|Zn∗ | ≤ z) ≈ P (|Zn − µ| ≤ z
p
S 2 /n)
we get the confidence interval
i
h
p
p
Zn − z S 2 /n, Zn + z S 2 /n = [Il , Ir ].
We see now that the confidence interval depends on the number of generated
trajectories n. To make the confidence interval twice as small, we must
generate four times as many trajectories.
Note that we cannot deduce that µ lies
N(0,1)
in [Il , Ir ] with probability β since
0.5
• Zn and S 2 are random variables,
0.4
• Zn is normally distributed only as
n → ∞,
0.3
• we approximated σ 2 by S 2 .
0.1
95%
0.2
0
−3
2.5%
−1.96
2.5%
−1
0
1
1.96
3
The correct interpretation is that, for
large n and a large number of realizations
of the random interval [Il , Ir ], β is the Figure 5.1: Probability density
percentage of realizations [Il (ω), Ir (ω)] function of N (0, 1).
that will contain µ.
Practical Issues. The simulation algorithm in Section 5.2 can be used to
compute realizations of the random variables Xi . Assume that we execute
the algorithm n times to generate n trajectories. We add variables sum and
71
5.3. OUTPUT ANALYSIS
sum
d that we initialize with zero. In the i-th run, we increase sum and sum
d
2
2
by xi = Xi (ω) and xi = (Xi (ω)) , respectively. After the generation of n
trajectories, we compute zn = sum/n and
s2 =
sum
d
sum 2
−
.
n − 1 (n − 1)n
Note that in the latter case we exploit Eq. (5.3).
Then, for a fixed confidence, the interval
h
i
p
p
zn − z s2 /n, zn + z s2 /n
can be constructed.
Number of Simulation Runs. If the interval [Il , Ir ] is large relative to
Zn the quality of the estimator is poor and more simulation runs have to
be carried out. This is likely the case if n is small or if we try to estimate
the probability of an event that is rare. Let us fix the relative width of the
interval to be 0.2 (which means that we have a relative error of at most 0.1)
and chose confidence level β = 0.95. Thus, z ≈ 1.96. Assume that we want
to estimate γ = P (A) and Xi is one if A occurs in the i-th run and zero
otherwise. Clearly,
E(X1 ) = E(X2 ) = . . . = E(Xn ) = 1 · γ + 0 · (1 − γ) = γ.
We can determine the number of necessary runs by bounding the relative
width of the confidence interval as follows:
p
z · S 2 /N
z2 S 2
S2
2·
≤ 0.2 =⇒
≤
N
=⇒
384
·
≤N
γ
0.01 γ 2
γ2
Using the fact that
σ 2 = VAR[χA ] = γ(1 − γ)
and replacing S 2 by σ 2 yields N ≥ 384 · 1−γ
γ . For instance, the sufficient
number of runs to guarantee that probabilities, having at least the order of
magnitude of 10−5 , are estimated with a relative error of at most 0.1 and a
confidence of 95% is N ≈ 38, 000, 000.
Monte-Carlo Simulation of Markov chains has a number of drawbacks compared to a direct numerical solution using matrix-vector multiplications:
• If the event whose probability has to be calculated is rare, a large
number of simulation runs is necessary to estimate its probability.
• In order to half the confidence interval, four times more simulation
runs have to be carried out. It is therefore expensive to obtain accurate
results.
• It is important that the pseudorandom numbers used during the simulation are close to ”truly random“. This is necessary to ensure independence.
72
5.4. EXERCISES
5.4
Exercises
Easy
Exercise 45: Generating random numbers (negative binomial distribution) A negatively binomially distributed random variable X with
parameters r ∈ {1, 2, . . .} and p ∈ (0, 1) has probability distribution
k+r−1 k
P (X = k) =
p (1 − p)r ,
k = 0, 1, 2, . . .
k
and describes the number of successes in a sequence of i.i.d. Bernoulli trials
before a specified number r of failures occurs. For example, if we throw a
die repeatedly until the third time ”6” appears (r = three failures), then the
probability distribution of the number of non-6 values that had appeared
will be a negative binomial.
a) Discuss the intuition behind the formula for the probability distribution with your fellow students
b) Assuming that only addition and generation of u ∼ U (0, 1) operations
are available, write code in Matlab (or R or some Matlab-clone that
you prefer) that generates random numbers that follow a negative
binomial distribution).
Medium
Exercise 46: Gillespie Algorithm Write down the pseudocode for
simulating a CTMC. Consider that in each state x ∈ Np of the CTMC there
are transition rules Rj , j ∈ {1, . . . , N } each of them fires with a certain
transition rate αj (x) and has as a result the jump from state x to state
x + vj , where vj ∈ Np is the change vector that corresponds to rule j. At
last consider that theP
residence time in state x is exponentialy distributed
with parameter α0 = j αj (x).
Exercise 47: Simulation of Page Rank Write code in Matlab (or R or
some Matlab-clone that you prefer) that simulates the DTMC of the random
surfer model. As input it should take the adjacency matrix of the directed
graph of web links and the output should be the ranking of the nodes. As
an initial state you can choose any state and as a simulation time horizon
just use some large number of steps (say 10000). Based on the total number
of visits of each node, approximate the equilibrium probabilities.
Apply your algorithm to the graph in the exercise sheet of last week and
compare your ranking results.
73
5.4. EXERCISES
74
Chapter 6
Parameter Estimation
We consider an unknown distribution of which we have a collection of samples and we know the family of the distribution (e.g. Poisson) or the underlying model (distribution of a DTMC after n steps) but some of the
parameters are unknown and have to be estimated.
We already saw estimators for the mean of the distribution and for its standard deviation in Section 5.3. They are random variables computed from a
collection of data and are subject to a sampling error. We also constructed
random intervals that contain the true value of the parameter with high
probability.
We may also use simple descriptive statistics such as measures about location
(mean, median, quantiles), the variability (variance standard deviation) or
the range. Often this is used to determine the family of the distribution if it is
unknown. However, there are more parameters that shape a distribution. In
the sequel we will discuss some popular methods for estimating parameters.
Sometimes they may lead to different results for the estimation.
Example 44: Photon Count
Consider a telescope observing the photon flux of a certain star during
certain time intervals of fixed length. We assume that the distribution of
the number of observed photons in the interval [t, t+h] is constant in t, i.e.
we have the same distribution for all intervals of length h. Assume further
that n independent measurements are performed for certain intervals of
length h, i.e. we have n numbers x1 , . . . , xn for the photon flux within
intervals of length h. Clearly, we will in reality never have exact and
truly independent measurements but let us assume for now that
a) the measurements have been taken over n days for, say, h = 1 hour
every day,
b) the telescope gives a very accurate measurement.
75
6.1. METHOD OF MOMENTS
From a) we get approximate independence and from b) we derive that
measurement errors may be neglected. Random photon counts can well be
described by a Poisson distribution with some parameter λ.
We know that the mean of the Poisson distribution is E(X) = λ and the
variance is VAR(X) = λ as well. Shall we now use the sample mean
n
1X
xi = x̄
n
i=1
to fit E(X) = λ or shall we use the sample variance
n
1 X
(xi − x̄)2 .
s =
n−1
2
i=1
as defined in Section 5.3 to fit VAR(X) = λ? Which of the two will give
us a better estimate for λ? What does ”better” mean in this context?
6.1
Method of moments
Given data points x1 , . . . , xn of n independent measurements, the idea of
the method of moments is to adjust the moments of the distribution such
that they fit the sample moments of the data. In general, the k-th sample
moment is defined as
n
1X
mk =
(xi )k .
n
i=1
Note that we often write x̄ for m1 . For k > 1 it is often a good alternative
to consider the k-th sample central moments
n
1X
m̄k =
(xi − x̄)k
n
i=1
since non-central moments may become very large. Note that for the sample
variance two definitions exist, the one for m̄2 and the one for s2 . The
difference is that for m̄2 we divide the sum by n while for s2 we divide it by
n−1. The reason is that if we view the sample variance as an estimator of the
true variance of a random variable X based on the data points X1 , . . . , Xn
where the Xi are i.i.d. (same distribution as X with mean µ and variance
σ 2 ), then the estimator
n
1X
1 X
S =
(Xi − X̄)2 , where X̄ =
Xi ,
n−1
n
2
i=1
i
76
6.1. METHOD OF MOMENTS
is unbiased (E(S 2 ) = σ) while
n
1X
S̃ =
(Xi − X̄)2
n
2
i=1
is not (E(S̃ 2 ) =
6 σ). We will come back to this issue when we discuss the
pros and cons of the method of moments.
Now we assume that we consider a distribution with exactly K unknown
parameters θ1 , . . . , θK . Then the moments of the distribution (often called
population moments) are functions of these parameters, i.e. E(X k ) =
µk (θ1 , . . . , θK ). We may also simply write µ instead of µ1 (θ1 , . . . , θK ). Clearly,
the central moments E((X − µ)k ) = µ̄k (θ1 , . . . , θK ), k = 2, 3, . . . are also a
function of θ1 , . . . , θK .
To determine these parameters we construct an equation system
µ1 (θ̂1 , . . . , θ̂K )
=
m1
µ̄2 (θ̂1 , . . . , θ̂K )
= m̄2
...
µ̄K (θ̂1 , . . . , θ̂K ) = m̄K
and solve for θ̂1 , . . . , θ̂K . Note that it is also possible to equate the noncentral moments instead. Also note that in general, it may be necessary to
add equations for higher moments if there is no unique solution for θ̂1 , . . . , θ̂K
with K equations (for instance, because some equations are linearly dependent).
Example 45: Fitting the mean of a Poisson distribution
Revisiting Example 44 we would determine the parameter λ = µ by setting
λ = x̄ = m1 .
Example 46: Fitting the parameters of a Normal distribution
Assume that our data points x1 , . . . , xn are realizations of a random variable X that follows a Normal distribution. The Normal distribution has
parameter θ1 = µ for the mean and θ2 = σ for the standard deviation.
Thus, we directly get as a solution that µ = x̄ (from the first equation)
√
and σ = m̄2 (from the second equation).
As a variant, assume that it is known that X has mean µ = 0. Then, for
77
6.1. METHOD OF MOMENTS
the single remaining parameter σ we do not set K = 1 and consider only
the first equation µ1 (µ, σ) = m1 since the first moment µ1 (µ, σ) is equal
to the first parameter µ and thus does not give any constraints on σ. Instead we consider the second equation µ̄2 (µ, σ) = m̄2 since µ̄2 (µ, σ) = σ 2
√
and thus σ = m̄2 .
The method of moments has a number of disadvantages compared to the
estimation methods considered in the sequel. The estimators may be biased
and thus may have a systematic estimation error. An example is the bias of
the sample variance
Example 47: Bias of the sample variance
As already discussed above, according to the method of moments, we would
estimate the variance of the true distribution by
n
1X
s̃ =
(xi − x̄)2
n
2
i=1
instead of using the unbiased estimator
n
s2 =
1 X
(xi − x̄)2 .
n−1
i=1
Intuitively, s̃2 underestimates the variance since in its definition we use
an estimator for the mean instead of the true mean and this estimator
must lie in the center of the samples (while the true mean might not)
and therefore the deviation from it is smaller than the deviations from
the true mean. Consider for instance a normal distribution with mean 0
and variance 100. For a small sample size of, say, n = 3 one might get
the data points x1 = −1.9412, x2 = −21.3836, x3 = −8.3959 using the
following MATLAB command x = random(’Normal’,0,10,3,1);. The
sample mean x̄ = −10.5736 is far away from the true mean. Obviously,
the deviations from the true mean 0 are much larger than the deviations
from x̄ and thus s̃2 is only var(x,1) = 65.3717 while var(x) = S 2 =
98.0576 is closer to the true standard deviation (note that the Matlab
command var(x) uses the unbiased estimator!).
If we could consider the deviations from the (usually unknown) true mean
0 the estimator with the factor n1 becomes unbiased:
E
1
n
Pn
i=1 (Xi
− E(X))2 )
Pn
=
1
n
=
1
nn
78
i=1 E((Xi
− E(X))2 )
· VAR(Xi ) = VAR(X)
6.2. GENERALIZED METHOD OF MOMENTS
6.2
Generalized method of moments
The method of moments has the drawback that it does not make use of
all information available from the data points x1 , . . . , xn . For instance, in
Example 44 we only use x̄ and do not take into account the variation of the
data, i.e. s2 or s̃2 .
The generalized method of moments does take into account this information
by considering cost functions instead of equating the population moments
of the distribution and the sample moments.
In the sequel, we simply write θ for a vector θ = (θ1 , . . . , θK ) of parameters.
We define the following cost functions for J ≥ K:
g1 (θ)
=
m1 − µ1 (θ)
= m2 − µ2 (θ)
...
gJ (θ) = mJ − µJ (θ)
g2 (θ)
Note that in the previous section we only considered the case where J = K
(number of parameters equals number of equations) and we wanted all costs
to be zero.
Now, we can choose a least squares estimator which chooses θ̂ such that
S(θ) =
J
X
gj2 (θ)
j=1
becomes minimal, i.e. θ̂ = arg minθ S(θ). However, this global cost function
S(θ) is not optimal since for higher order moments the variance of gj will be
much higher than for low-order moments. (Note that gj is also a function
of the data and can thus be interpreted as a random variable. Due to the
j-th power in the j-th sample mean, we can expect a higher variance for
j becoming larger.) Thus, the idea is to weight each cost function by its
variance. However, since some cost functions may be correlated (as we see
in the next example), we take a more general approach that allows us to
also consider covariances.
Example 48: Correlated cost functions
Recall that for the normal distribution we have two parameters µ and σ 2
and that the first four population moments and cost functions are given
by
79
6.2. GENERALIZED METHOD OF MOMENTS
E(X)
VAR(X)
E([X − µ]3 )
E([X − µ]4 )
=
=
=
=
µ
σ2
0
3σ 4
g1 (µ̂, σ̂ 2 )
g2 (µ̂, σ̂ 2 )
g3 (µ̂, σ̂ 2 )
g4 (µ̂, σ̂ 2 )
mean
variance
skewness
kurtosis
=
=
=
=
m1 − µ̂
m̄2 − σ̂ 2
m̄3 − 0
m̄4 − 3σ̂ 4
where we used the central moments instead of the non-central ones. We
see that the second and the fourth cost function both give constraints only
on σ and are correlated.
We now interpret g1 (θ), . . . , gJ (θ) as a (column) vector g(θ) and consider
the very general case that the cost function is
S̃(θ) = g(θ)0 W g(θ)
(6.1)
where W is a positive-definite J × J matrix with appropriate weights. Note
that if W is the identity matrix I then S̃(θ) = S(θ).
We now give some sufficient conditions under which the estimator θ̂ =
arg minθ S̃(θ) is consistent, i.e. converges in distribution to the true value θ̃
of θ.
1. It must hold that J ≥ K (moment order is not lower than number of
parameters). Otherwise we do not have enough conditions to identify
θ1 , . . . , θ K .
2. We must have at least K moment conditions that are functionally
independent. This is the case if the rank of the Jacobian matrix of the
g-functions (evaluated at θ̃) has a rank of at least K.
3. If we replace the sample moments by the population moments in the
cost terms of S̃(θ), then S̃(θ) must yield zero for θ = θ̃. This must
be a unique property of θ̃, i.e., no other value of θ should satisfy this
equality.
It turns out that for the case J > K we can minimize the variance of the
estimator θ̂ that minimizes (6.1) as follows:
√
We know from the central limit theorem (multidimensional case) that ng(θ)
converges to a normal distribution with mean zero (if θ is the ’true’ parameter value) and covariance matrix Σ where the entry Σj,k is given as
n
COV (gj , gk ) = COV
n
1X j
1X k
Xi − µj ,
Xi − µk
n
n
i=1
i=1
!
=
1
COV (Xij , Xik ).
n
It can be shown that if we choose W proportional to the inverse of Σ we get
an optimal weighting matrix (in the sense that the variance of our estimator
80
6.2. GENERALIZED METHOD OF MOMENTS
θ̂ becomes minimal). Thus, we ignore the n1 factor and try to estimate the covariance matrix F of the population moments (with entries COV (Xij , Xik ))
by
n
1X j
F̂ =
(Xi − µj )(Xik − µk ).
n
i=1
Then we choose W = F̂ −1 . However, F̂ depends on the parameter vector
θ (as it depends on µ1 , . . . , µJ which are functions of θ). Therefore, the
following 2-step procedure can be used to find a good choice for W :
1. Choose W = I and compute
θ̂(1) = arg min(g 0 W g).
θ
2. Compute F̂ with entries
n
F̂j,k =
1X j
[Xi − µj (θ̂(1) )][Xik − µk (θ̂(1) )].
n
i=1
and for W = F̂ −1 compute
θ̂(2) = arg min(g 0 W g).
θ
Example 49: Photon Count revisited
Consider again the photon count example (see Example 44. In this case,
we have a single parameter θ = λ and E[Xi ] = λ, E[Xi ] = λ2 + λ. Thus,
the first two cost functions are
P
g1 (λ) = n1 ni=1 Xi − λ
P
g2 (λ) = n1 ni=1 Xi2 − λ2 − λ
In the method of momentsPapproach, we only used the first cost
function and chose λ = n1 ni=1 Xi . We use the Matlab command
x=random(’Poisson’,10,1,30) to produce 30 random numbers that follow
a Poisson distribution with mean λ = 10. Let us assume that the returned numbers are such that m1 = x̄ = 10.5 and m2 = 126.3. We do a
first estimation of λ by minimizing g 0 W g for W = I, i.e. we minimize
g1 (λ)2 + g2 (λ)2 w.r.t. λ. The Matlab command
81
6.3. MAXIMUM LIKELIHOOD METHOD
fminsearch(@(l)(10.5 − l)2 + (126.9 − l2 − l)2 , 1)
starts a search at initial value 1 and yields λ̂ = 10.776 which is slightly
worse than the estimation λ = x̄ of the method of moments. Next we
compute Σ̂ based on our estimation 10.776. For instance, for the two
off-diagonal entries we compute mean((x − 10.776). ∗ (x.2 − 10.7762 −
10.776)) where x is the vector of data points. As its inverse we get an
(asymptotically) optimal weight matrix
0.9788 −0.0375
W =
.
−0.0375 0.0015
Finally, we call again
fminsearch(@(l)W (1, 1) ∗ (mean(x) − l)2 + W (2, 2) ∗ (mean(x.2 ) − l2 −
l)2 + 2 ∗ W (1, 2) ∗ (mean(x) − l) ∗ (mean(x.2 ) − l2 − l), 1)
and get the estimation λ̂ = 10.08.
It is important to note that using the weight matrix might not always result
in a more accurate estimation. In fact, our choice of W is only asymptotically optimal. However, the estimator is always consistent (converges in
probability to the true value of the parameter) and asymptotically normal.
6.3
Maximum likelihood method
Again, we assume that n data points X1 , . . . , Xn are given as well as the
family of the distribution. The maximum likelihood method is the most
common way of estimating parameters of a distribution. Recall that an
estimator is a function of the data. Here, we consider maximum likelihood
estimators.
Let θ be the unknown parameter and let Pθ (·) be the corresponding discrete
probability mass function, i.e. the Xi are all independent and identically
distributed with P (Xi = x) = Pθ (x), x ∈ R. Therefore, the probability of
observing X1 = x1 , . . . , Xn = xn , assuming that θ is the true value, is given
by
Lθ (x1 , . . . , xn ) = Pθ (x1 )Pθ (x2 )...Pθ (xn ).
Lθ (x1 , . . . , xn ) is also called the likelihood of the data and we say that θ̂ is
a maximum likelihood estimator (MLE) of θ if for all possible θ
Lθ (x1 , . . . , xn ) ≤ Lθ̂ (x1 , . . . , xn ).
Note that the MLE of θ is, in general, not unique. Intuitively, the MLE is
the parameter value that “best explains” the observed data.
82
6.3. MAXIMUM LIKELIHOOD METHOD
To maximize the likelihood, we can consider its derivatives w.r.t. θ and find
∂
parameter values where ∂θ
Lθ (x1 , . . . , xn ) equals 0 (in some cases such points
do not exist and we find θ̂ at the boundary of the set of possible values for
θ).
Often the maximum of the log-likelihood
ln Lθ (x1 , . . . , xn ) = ln Pθ (x1 ) + ln Pθ (x2 ) + ... + ln Pθ (xn ),
which is equal to that of the likelihood (as the logarithm is strictly monotonically increasing), is easier to compute. More concretely, θ̂ maximizes Lθ
if and only if θ̂ maximizes ln Lθ .
Example 50: MLE of the geometric distribution
Assume we have collected data that represents inter-arrival times of customers (in minutes). Since the inter-arrival times are approximately geometrically distributed (we hypothesize this after looking at certain summary statistics such as the range of the data, the mean, the skewness,
the variance or a histogram plot of the data), we wish to fit the data to
a geometric distribution with parameter θ, i.e. we want to find a θ that
best explains the data. The likelihood of the data x1 , . . . , xn is in this case
given by
Lθ (x1 , . . . , xn ) = Pθ (x1 )Pθ (x2 )...Pθ (xn )
= θ(1 − θ)x1 · θ(1 − θ)x2 · . . . · θ(1 − θ)xn
n
P
= θn (1 − θ)i=1
xi
since the probability to observe xi is Pθ (xi ) = θ · (1 − θ)xi . Since
Lθ (x1 , . . . , xn ) is a function of θ, we shortly write L(θ) and omit the
dependence on the data. Considering the log-likelihood yields
ln L(θ) = n · ln θ +
n
X
i−1
xi ln(1 − θ)
We maximize ln L(θ) by setting the derivative to zero:
d ln L(θ)
n
(−1) X
n
1 X
= +
xi = −
xi
dθ
θ
1−θ
θ
1−θ
i
!
0=
n
1
−
θ
1−θ
83
i
X
i
xi
6.3. MAXIMUM LIKELIHOOD METHOD
⇔
where x̄ =
1
n
X
i
n
P
xi =
n(1 − θ)
1
1
⇔ x̄ = − 1 ⇔ θ̂ =
θ
θ
x̄ + 1
xi . Next we find that θ̂ is really a maximizer:
i=1
X
d2 ln L(θ)
n
1
=− 2 −
xi < 0
2
2
dθ
θ
(1 − θ)
i
Hence, the MLE of θ is θ̂ =
1
x̄+1 .
Note that in the above example, since the mean of the geometric distribution is (1 − θ)/θ we matched the mean x̄ of the data and the mean of the
distribution. Note that this is not always the case for a maximum likelihood
estimator, i.e. the estimator of the method of moments is not always the
same as the MLE.
Assume now that we have hypothesized a continuous distribution for our
data. In that case, the likelihood function is
Lθ (x1 , . . . , xn ) = fθ (x1 ) · fθ (x2 ) · ... · fθ (xn )
where fθ is the density of the distribution that depends on the parameter
θ. Again, since Lθ (x1 , . . . , xn ) is a function of θ, we shortly write L(θ) and
omit the dependence on the data. An MLE θ̂ of θ is defined as a value that
maximizes L(θ) (over all permissible values of θ).
The reason why the likelihood is a density in the continuous case is that the
probability of a single outcome is zero in the continuous case. Hence one
could consider
P (xi − h < X < xi + h) =
Z
xi +h
fθ (x)dx
xi −h
for the outcome xi and some very small h. But this quantity is approximately
equal to 2hfθ (xi ) and thus proportional to the factors of the likelihood defined above. Hence, this approach would result in the same MLE.
Example 51: MLE of the exponential distribution
Assume that data were collected on the inter-arrival times for cars in a
drive-up banking facility. Since histograms of the data show exponential
curve, we hypothesize an exponential distribution. Note that some distributions generalize the exponential distribution but have more parameters
84
6.3. MAXIMUM LIKELIHOOD METHOD
(e.g. gamma distribution). They would provide a fit which is at least as
good as the fit of the exponential distribution.
For the exponential distribution the parameter is θ = λ, λ > 0 and the
density is fλ (x) = λe−λx . Hence, the likelihood of the samples x1 , . . . , xn
is
P
n
Y
−λ xi
−λxi
n
L(λ) =
λe
=λ ·e i
i=1
and the log-likelihood is
ln L(λ) = n · ln λ + (−λ
X
xi )
i
and its derivative w.r.t. λ is
d
n X
ln L(λ) = −
xi
dλ
λ
i
Setting the derivative to zero yields
!
0=
1X
1
n X
−
=
xi ⇔
xi = x̄
λ
λ
n
i
i
and thus λ = 1/X̄. We find that λ is a maximizer:
d2
n
ln L(λ) = − 2 < 0
dλ2
λ
and finally get λ̂ = 1/x̄.
We remark that for some distributions, the log-likelihood function may not
be useful and also, finding a maximum by setting the derivative to zero is
not always possible. In general, if an analytic solution is not possible, global
optimization methods have to be applied in order to determine, which of
several local optima of the likelihood has the highest value.
We list some important properties of Maximum Likelihood Estimators (some
of the properties require mild “regularity” assumptions, e.g. likelihood function must be differentiable, support of distribution does not depend on θ.)
• Uniqueness: For most common distributions, the MLE is unique; that
is, L(θ̂) is strictly greater than L(θ) for any other value of θ.
• Asymptotic unbiasedness: θ̂ is a function of the samples x1 , ..., xn and
thus a random variable if Xi = xi is not given, i ∈ {1, 2, . . . , n}.
Assume now that X1 , ..., Xn have distribution Pθ and θ̂n is the MLE
85
6.3. MAXIMUM LIKELIHOOD METHOD
of θ based on the observations X1 , ..., Xn . Then
lim E[θ̂n ] = θ.
n→∞
Therefore, we say that θ̂n is asymptotically unbiased.
• Invariance: If θ̂ is an MLE of θ and if g is a one-to-one function, then
g(θ̂) is an MLE of g(θ). To see that this holds, we note that
L̃
z }| {
L(θ) = L(g −1 (g(θ))
d = g(θ̂) (where g(θ)
d is the MLE of
are both maximized by θ̂, so g(θ)
g(θ) interpreted as the unknown parameter).
Example 52: Invariance
Assume that x1 , x2 , ..., xn are samples from a Bernoulli distribution with
parameter θ = p. Then the likelihood is
n
Y
i=1
P
[θxi + (1 − θ)χ=0 (xi )] = θ
where χ=0 (x) =
xi
i
P
n− xi
· (1 − θ)
i
1 if x = 0,
0 otherwise.
Hence the log-likelihood is
X
X
ln L(θ) =
xi ln(θ) + (n −
xi ) ln(1 − θ)
i
i
and setting its derivative to zero yields
X
X
∂
!
L(θ) =
xi /θ − (n −
xi )/(1 − θ) = 0
∂θ
i
P
i
P
P
P
xi = θ̂(n − xi ) ⇔
xi = n · θ̂ ⇔ θ̂ = xi /n
i
i
i
P i
Next we find that θ̂ =
xi /n =: x̄ is a maximizer:
⇒ (1 − θ̂)
i
X
X
∂2
2
L(θ)
=
−
x
/θ
−
(n
−
xi )/(1 − θ)2 < 0 for any θ ∈ [0, 1]
i
∂θ2
i
i
86
6.4. ESTIMATION OF VARIANCES
Thus, θ̂ = x̄.
Next, we assume that the variance g(θ) = θ(1 − θ) is our parameter (note
that in the permissible range of θ, g is one-to-one). Define L̃ such that
L̃ (g(θ)) = L(θ), i.e., we consider the same values for the likelihood but
the likelihood function is different as it is a function in g(θ). Setting the
derivative to zero yields
∂
∂
∂θ !
= 0.
L̃(g) =
L̃(g) ·
∂g
∂θ |{z} ∂g
=L(θ)
∂
Since ∂θ
L (g(θ)) |θ=θ̂ = 0 it holds that
x̄(1 − x̄).
∂
∂g L̃(g)
= 0 if we choose ĝ = g(θ̂) =
• Asymptotically normally distributed: For n → ∞, θ̂n converges in
distribution to N (θ, δ(θ)) (normal distribution with mean θ and variance δ(θ)), where θ̂n is our estimation based on n observations, θ is
“true” parameter of the distribution of the observations. Moreover,
−1
2
d ln L(θ)
δ(θ) = − E
dθ2
One can even show that for any other estimator θ̃, which converges
in distribution to N (θ, σ 2 ), we have δ(θ) ≤ σ 2 (variance is greater or
equal). Therefore, MLEs are called best asymptotically normal. For
large n, we can estimate the probability that θ̂ deviates from the true
value by more than (see confidence intervals; later).
• strongly consistent: MLEs are strongly consistent, i.e,
P ( lim θ̂n = θ) = 1.
n→∞
In the case of more than one parameter the MLE is found in a very similar
way, i.e. we find the vector θ̂ = (θ̂1 , ..., θ̂m ) that maximizes L(θ) or ln L(θ).
∂
ln L(θ) = 0 and
For instance, for two parameters α and β we try to solve ∂α
∂
ln
L(θ)
=
0
simultaneously
for
α
and
β
where
θ
=
(α,
β).
(See exercises
∂β
for an example.)
6.4
Estimation of Variances
It is very important to analyze how good the estimated value given by some
estimator θ̂ is. If VAR(θ̂) is large, then our estimation is of bad quality and
87
6.4. ESTIMATION OF VARIANCES
might lead to wrong conclusions about the real system. Typically, when
results ofpparameter estimations are reported, the (estimated) standard deviations VAR(θ) are given as well.
Example 53: Poisson distribution
In the previous sections we found that both the method of moments and
the maximum likelihood approach gives the estimator
θ̂ = X̄
for the Poisson distribution, i.e. the best value for the unknown mean of
the Poisson distribution is the sample mean. In Section 5.3.1 we already
found that the variance of X̄ is σ 2 /n where σ 2 is the variance of the
distribution of the Xi and thus for the Poisson distribution we get
VAR(θ̂) = VAR(X̄) = θ∗ /n.
Here, θ∗ is the ’true’ (unknown) value that we estimate with θ̂ = X̄.
Thus, we can estimate VAR(θ̂) as X̄/n.
Alternatively, we can estimate σ 2 by computing the sample variance S 2
(as explained in Example 47) and estimate VAR(θ̂) as S 2 /n.
Often, analytic formulas for VAR(θ̂) cannot be derived. In this case, one can
either use the bootstrap method or estimate VAR(θ̂) based on the properties
of the estimator.
Generalized Method of Moments. For the GMM estimator, it can be
shown that (under the assumptions made in Section 6.2) the estimator is
asymptotically normally distributed with (asymptotic) covariance matrix
V =
−1
1
G0 F −1 G
n
if we choose the optimal weighting matrix W = F −1 . Here, the matrix G is
the expected Jacobian matrix of the cost functions (where the sample moments are replaced by the theoretical moments1 ) evaluated at the true value
of θ. We can therefore replace G and F by their corresponding estimates Ĝ
1
I.e. we replace X̄ by X, m2 by X 2 , etc.
88
6.4. ESTIMATION OF VARIANCES
and F̂ and compute V̂ =
1
n
Ĝ0 F̂ −1 Ĝ

∂g1 (θ̂)
 ∂θ1


Ĝ = 


−1
...
..
.
..
∂gJ (θ̂)
∂θ1
...
.
where
∂g1 (θ̂)
∂θK



.. 
. 


∂gJ (θ̂)
.
∂θK
Thus, estimates for the variances of θ̂1 , . . . , θ̂K are given by the diagonal
entries of V̂ .
Example 54: Estimated error of GMM estimator
Recall Example 49. For our final estimation λ̂ = 10.08 we first consider
2
2 −λ)(θ̂)
θ̂)
the matrix G. We find ∂(X−λ)(
= −1 and ∂(X −λ
= −2λ − 1.
∂λ
∂λ
Thus, G is independent of X and we must not replace the population
moments E[X i ] by their corresponding sample moments but can directly
use
"
# "
#
−1
−1
=
Ĝ =
−21.16
−2λ̂ − 1
as an estimator. Then we estimate the variance of λ̂ as
−1
1 0
1
1
Ĝ W Ĝ
= 0.5256.
= ·
n
n 0.0634
√
Thus, the standard error is 0.5256 = 0.7250.
V̂ =
Maximum Likelihood Method. To estimate the variance of a maximum likelihood estimator, we exploit the fact that MLEs are asymptotically
normally distributed. Thus, for large sample sizes and a single parameter
θ ∈ R, we have approximately a variance of
2
−1
d ln L(θ∗ , X1 , . . . , Xn )
δ(θ∗ ) = − E
.
dθ2
In the above expression we wrote L(θ, X1 , . . . , Xn ) for the likelihood to emphasize that L is a function of θ and X1 , . . . , Xn . Moreover, here θ∗ is
the true value of θ. For an estimator based on the data we simply replace
X1 , . . . , Xn by their corresponding realizations x1 , . . . , xn and the (unknown)
true value of θ by the estimate θ̂, i.e.
!−1
∂ 2 ln L(θ̂, x1 , . . . , xn )
.
−
∂θ2
89
6.4. ESTIMATION OF VARIANCES
Intuitively, the second derivative tells us the curvature of the likelihood and
if it is ’flat’ at θ̂ then our estimated value might not be very accurate and
the variance (negative inverse) is large. Then either we do not have enough
samples or the parameter is difficult to identify (see also the example after
next below). We are not very confident about our estimated value since
perturbing θ̂ slightly also gives us a similar likelihood.
For several parameters, the same approach is used, i.e. the negative diagonal
entries of the inverse of the Hessian matrix give estimates for the variances
of the parameters.
Example 55: Parameters of the normal distribution
Assume that we have n i.i.d. samples X1 , . . . , Xn that follow a normal
distribution with (unkown) mean µ and (unkown) variance σ 2 . Hence,
θ = (µ, σ 2 ).
The likelihood of the data is
n
Y
1
1
2
√
L(X1 , . . . , Xn ; θ) =
exp − 2 (Xi − µ)
2σ
2πσ 2
i=1
. Thus, log-likelihood of the data is
n
n
n
1 X
2
ln L(X1 , . . . , Xn ; θ) = − ln(2π) − ln(σ ) − 2
(Xi − µ)2 .
2
2
2σ
i=1
Next we compute the derivatives w.r.t. µ and σ 2 :
∂
1 Pn
i=1 (Xi − µ)
∂µ ln L(X1 , . . . , Xn ; θ) = σ 2
∂
∂σ 2
ln L(X1 , . . . , Xn ; θ) = − n2 (σ 2 )−1 + 21 (σ 2 )−2
Pn
i=1 (Xi
− µ)2 .
Setting the two derivatives to zero yields from the first equation
n
X
i=1
(Xi − µ) = 0 =⇒
n
X
i=1
n
Xi − nµ = 0 =⇒ µ̂ =
1X
Xi = X̄.
n
i=1
Inserting this into the second equation and multiplying by 2σ 4 gives
−nσ 2 +
n
X
i=1
n
(Xi − µ̂)2 = 0 =⇒ σ̂ 2 =
1X
(Xi − X̄)2 .
n
i=1
Thus, the MLE of σ is a biased estimator.
To prove that µ̂ and σ̂ 2 yield a maximum we have to consider the Hessian
matrix.
 ∂2

∂2
ln L
∂µ∂µ ln L
∂µ∂σ 2

H(µ, σ) =  2
∂
∂2
ln L ∂σ2 ∂σ2 ln L
∂σ 2 ∂µ
90
6.4. ESTIMATION OF VARIANCES

(X
−
µ)
i
i=1

Pn
n
2
−2
2
−3
2
− (σ )
i=1 (Xi − µ)
2 (σ )

−(σ 2 )−2
− σn2

=
P
−(σ 2 )−2 ni=1 (Xi − µ)
Pn
∂2
∂µ∂µ
We have a local maximum at θ̂ = (µ̂, σ̂ 2 ) if
determinant
det(H(µ̂, σ̂ 2 )) =
ln L < 0 and if the
∂2
∂2
∂2
∂2
ln L ·
ln
L
−
ln
L
·
ln L
∂µ∂µ
∂σ 2 ∂σ 2
∂µ∂σ 2
∂σ 2 ∂µ
is positive at θ̂. Next we insert concrete numbers to simplify the computation. We generate 50 random numbers that are normally distributed with
mean zero and variance one. We get the estimates
µ̂ = −0.1229 and σ̂ 2 = 0.9903
and the Hessian is (approximately)
"
−50.4906
−0.0000
−0.0000
#
−25.4930
.
(Note that its determinant is indeed positive.) The inverse of the Hessian
is
"
#
−0.0198 0.0000
.
−0.0000 −0.0392
Thus, the estimated variances are 0.0198 for µ̂ and 0.0392 for σ̂ 2 , yielding
standard errors of 1 = 0.1407 and 2 = 0.1981. Note that the true mean
and variance are elements of the intervals [µ̂−1 , µ̂+1 ] and [σ̂ 2 −2 , σ̂ 2 +
2 ].
Example 56: Parameters of a birth-death process
Recall the DTMC that we considered in Example 31 on page 42. We
illustrate its graph below.
1−pλ
pλ
0
1−pλ −pµ
pλ
1−pλ −pµ
1
pµ
pλ
2
pµ
1−pλ −pµ
pλ
...
3
pµ
pµ
Assume that pλ and pµ are unknown parameters that we want to estimate
and the chain starts with probability one in state 0 at time t = 0. We
consider three cases of observations:
91
6.4. ESTIMATION OF VARIANCES
a) We have n independent samples X1 , . . . , Xn which give us the state
of the chain at times t1 , . . . , tn .
b) We have a time series X1 , . . . , Xn from a single run of the chain
which give us the state of the chain at times t1 , . . . , tn for t1 < . . . <
tn .
c) We have n independent steady-state observations X1 , . . . , Xn .
How good will our estimations of pλ and pµ be in each of the three cases?
For a) we have a likelihood
La (pλ , pµ ) =
n
Y
pti (Xi )
i=1
where pt = p0 · P t is the transient distribution after t steps. Note that the
transition probability matrix P depends on pλ and pµ .
In case b) the likelihood is
Lb (pλ , pµ ) = pt1 (X1 ) · (P t2 −t1 )X1 ,X2 · . . . · (P tn −tn−1 )Xn−1 ,Xn
where (P t )i,j denotes the entry of the matrix P t at position (i, j).
Finally, for c) we have
Lc (pλ , pµ ) =
n
Y
π(Xi ),
i=1
where π is the equilibrium distribution of the chain (assuming pλ < pµ ).
After generating n = 100 samples for pλ = 0.4 and pµ = 0.45 in case of
a) the MLE gives
p̂λ = 0.299 ± 0.138 and p̂µ = 0.330 ± 0.158
as a local minimum of the negative log-likelihood. The Matlab function
fmincon that performs the local search also approximates the Hessian from
which we estimated the standard errors 0.138 and 0.158 of p̂λ and p̂µ .
Note that these estimated values give a very close estimation of the ratios
pλ /(pλ +pµ ) and pµ /(pλ +pµ ), but the absolute numbers are too small and
the estimate for the selfloop probability 1 − (pλ + pµ ) is too high. The high
errors show us that the estimation is of bad quality. This problem occurs
very often in realistic applications of the MLE and shows that certain
parameters may not be identifiable. It is related to the fact that we are
considering a single observation of many independent runs of the chain
which is not sufficient to estimate pλ and pµ .
92
6.5. ESTIMATION OF PARAMETERS OF HMMS
Again we generate n = 100 samples for pλ = 0.4 and pµ = 0.45 but as a
time series (case b). The corresponding MLE gives
p̂λ = 0.399 ± 0.027 and p̂µ = 0.443 ± 0.030
as a local minimum of the negative log-likelihood. Thus, time-series data
allows to estimate the true parameters very accurately which comes from
the fact that the data is correlated and thus much more informative.
Finally, we generate n = 100 steady-state samples (case c). The corresponding MLE gives
p̂λ = 0.254 ± 11.446 and p̂µ = 0.295 ± 12.456
as a local minimum of the negative log-likelihood. The standard errors
are extremely high showing that the log- likelihood function is ’very flat’
around the minimum and thus the estimated values may not be close to the
true values of the parameters. We have again the problem of parameters
that are not identifiable, which is obvious since the equilibrium distribution does not uniquely define the DTMC. In fact, there are infinitely
many DTMC having the same equilibrium distribution but different values
for pλ and pµ . However, the ratios are again estimated very accurately
which comes from the fact that the ratios can indeed be determined from
equilibrium data.
6.5
Estimation of Parameters of HMMs
When we discussed hidden Markov models, we discussed the forward algorithm for computing the probability of a sequence τ = ok0 ok1 , . . . , okm of
outputs. The idea is to perform the following vector-matrix multiplications.
P (τ ) = P (Y0 = ok0 , Y1 = ok1 , . . . , Ym = okm )
= p0 W0 P W1 P . . . Wm−1 P Wm e.
(6.2)
Now, we want to adjust the parameters of the HMM given a sequence
τ = ok0 ok1 , . . . , okm . As parameters, we have the initial distribution p0 of
the hidden states, the transition probabilities pi,j between the hidden states
and the emission probabilities bik of making output ok when the hidden state
is i. In a maximum likelihood approach, we must maximize P (τ ) w.r.t. the
parameters. Thus, global optimization methods have to be applied to maximize the logarithm of P (τ ). An alternative approach based on the iterative
Baum-Welch procedure gives a local maximum of P (τ ) in the following way:
93
6.5. ESTIMATION OF PARAMETERS OF HMMS
Figure 6.1: Illustration of the computation of wt (i, j). Here, aij is the
transition probability and bj is the emission probability.
Instead of P (τ ) we consider the probability
wt (i, j) = P (Xt = i, Xt+1 = j | τ ),
t ∈ {0, 1, . . . , m − 1}
which is the probability of being in state i at time t and in state j at time
t+1, given τ . The values of wt (i, j) can be computed iteratively by exploiting
that
wt (i, j) =
where
αt (i)pij bj,ot+1 βt+1 (j)
αt (i)pij bj,ot+1 βt+1 (j)
=P P
P (τ )
i
j αt (i)pij bj,ot+1 βt+1 (j)
αt (i) = P (Y0 = ok0 , . . . , Yt = okt , Xt = i)
= p0 W0 P W1 P . . . Wt−1 P Wt ei
and
βt+1 (j) = P (Yt+2 = okt+2 , . . . , Ym = okm | Xt+1 = j)
= ej P Wt+2 P . . . Wm−1 P Wm e,
where eh is a vector which is zero everywhere except for position h where
it is equal to one. Note that αt (i) and βt+1 (j) give intermediate values
of the large matrix-vector product in Eq. 6.2 where the product is either
evaluated from the left or from the right. The idea is illustrated in Figure 6.1.
Obviously, the value
X
X
γt (i) =
wt (i, j) =
P (Xt = i, Xt+1 = j | τ ) = P (Xt = i | τ )
j
j
is the probability of being in state i given the observation τ . Therefore, we
can compute the corresponding time averages as
m
X
γt (i)
t=0
94
6.5. ESTIMATION OF PARAMETERS OF HMMS
which equals the average number of transitions from state i (given τ ) and
m
X
wt (i, j)
t=0
which equals the average number of transitions from i to j (given τ ).
Example 57:
For the example of annual temperatures we considered the following observed sequence in Example 38.
τ =S M S L
Assuming the initial distribution p0 = (0.6 0.4), we computed the probability for P (Y0 = S, Y1 = M, Y2 = S, Y3 = L) as
0.1 0
0.1 0
0.7 0.3
0.7 0.3
0.4 0
0.6 0.4 ·
·
·
·
·
·
0 0.7
0.4 0.6
0 0.2
0 0.7
0.4 0.6
1
0.5 0
0.7 0.3
= 0.0096.
·
·
1
0 0.1
0.4 0.6
Let us now compute, for instance, w0 (1, 2): we find that α0 (1) = 0.6 ·
0.1 = 0.06. Similarly, β1 (2) = 0.1244 is the second entry of the vector
P W2 P W3 e. Thus, w0 (1, 2) = N1 0.06 · 0.3 · 0.2 · 0.1244 ≈ 0.0004, where we
used that p1,2 = 0.3 and the emission probability of output M in state 2
is 0.2. Here, N is the normalization factor P (τ ).
Next we use the values wt (i, j) to improve a first guess for the model parameters. We first choose p0 , P and B randomly and compute wt (i, j) for
all t < m and all i, j. Then we choose new parameter values according to
the following set of reestimation formulas:
p0 (i) = expected number of times in state i at time 0 = γ0 (i)
P
expected number of transitions from i to j
t wt (i, j)
pij =
= P
expected number of transitions from i
t γt (i)
P
expected number of times in i with output ok
t:Yt =ok γt (i)
P
bik =
=
expected number of times in i
t γt (i)
It is possible to prove that the reestimation step leads to parameter values
with a higher likelihood unless we are in a (probably) local maximum of the
likelihood function. Thus, it can replace the search for a local optimum of the
likelihood. Note that computing the values wt (i, j) can be done efficiently
95
6.6. EXERCISES
by evaluating the matrix-vector product in 6.2 from the left and from the
right.
Example 58: Annual temperatures revisited
For the example of annual temperatures we get the following reestimates
(after starting with the true values of the parameters): The initial distribution p0 = (0.6 0.4) is (badly) estimated as
0.1882
0.8118.
The transition probability matrix
0.7 0.3
0.4 0.6
is estimated as
0.56 0.44
.
0.50 0.50
Note that much better estimation results are achieved if several observation sequences are considered.
6.6
Exercises
Easy
Exercise 48: Method of Moments I
Consider n independent and identically distributed numbers that follow (a)
an exponential distribution with unknown parameter λ or (b) a uniform
distribution with unknown support [a, b]. Derive the method of moments
estimator (a) for λ and (b) for a and b.
Exercise 49: Method of Moments II
Derive the method of moments estimator for a 2-dimensional normal distribution where the 2-dimensional mean (µx , µy ) and the 2 × 2 covariance
matrix Σ has to be estimated. Write out the details for a given set of n
sample pairs (X1 , Y1 ), . . . , (Xn , Yn ).
Exercise 50: Confidence Interval for the Sample Mean
Construct an approximate (1 − α)100 % confidence interval for the sample
mean X̄ for the case a) that the variance σ 2 of the distribution is known
and b) is not known.
Hint: Make sure that the probability that the interval covers the true mean
is 1 − α.
96
6.6. EXERCISES
Exercise 51: Variance of Maximum likelihood estimators
Assume that X1 , . . . Xn are i.i.d samples from a distribution with density
fθ (x) = (θ + 1) xθ , where 0 ≤ θ ≤ 1 and 0 ≤ x ≤ 1.
• Compute the MLE θ̂ for the unknown parameter θ.
• How would you approximate the distribution of the MLE for parameter
θ? What is the standard deviation (standard error) of θ̂?
Exercise 52: Uniqueness of MLE
Assume that X1 , . . . Xn are i.i.d samples from the uniform distribution U (θ, θ+
1), where the value of parameter θ ∈ (−∞, +∞). Show that the maximum
likelihood estimator for θ is not unique.
Exercise 53: Application of Baum-Welch Algorithm
Consider Example 58 in the Lecture Notes and assume that besides τ1 =
(S M S L) a second independent observation τ2 = (L L L M ) is given. How
would you adjust the re-estimation formulas in the case of several independent observations?
Write a short program that conducts one iteration of the Baum-Welch procedure to re-estimate the parameters p0 and P (starting from the true values
of the parameters) where both observations are taken into account. Assume
that B is fixed and known.
Exercise 54: Maximum likelihood estimator
Given n i.i.d samples x1 , . . . xn from the distributions below, compute the
MLE for the unknown parameters.
a) Binomial distribution: B(m, θ) with m known.
b) Uniform distribution
f (x) =
(
1
b−a ,
0,
if a ≤ x ≤ b
otherwise.
Exercise 55: MLE for the double exponential distribution
The double exponential distribution has density f (x, θ) = 21 exp(−|x − θ|)
for x ∈ R. Verify that θ̂ = med(x1 , . . . , xn ). Use the following property of
the median function med: the sample median is the point which minimizes
the sum of the distances from the samples.
Exercise 56: MLE for a small discrete-time Markov chain
Consider a discrete-time Markov chain with transition probability matrix
1−α α
P =
1
0
Estimate the parameter α for the case that you have independent observations of the chain in steady-state where the chain was four times in state 1
and two times in state 2.
97
6.6. EXERCISES
Medium
Exercise 57: Baum-Welch Algorithm
P
• Prove that j αt (j)βt (j) = P (τ ).
• Derive a formula for computing w̃t (i, j) = P (Xt = i, Xt+2 = j | τ ) and
a re-estimation formula for the 2-step transition probability P (Xt+2 =
j | Xt = i).
Exercise 58: Optimal observation time
Recall that the Fisher information is defined as
"
2
#
∂ ln L(X; λ)
∂ ln L(X; λ) 2
I(λ) = −EX
= EX
∂λ2
∂λ
and also recall that the variance of a MLE can be approximated by I(λ)−1 .
Consider a 2-state Markov chain X in continuous time where we have a
single transition from state 1 to state 2 at rate λ and where
P (Xt = 1) = e−λt and P (Xt = 2) = 1 − e−λt .
At what time should we observe the process to get the best estimation for
λ (if we can only do a single observation at a single time point)? Try to
maximize the information I(λ) (minimize the variance of λ̂) by plotting I(λ)
in time and for different λ, e.g. λ ∈ {0.01, 0.1, 1, 10}.
Exercise 59: Biased and unbiased estimators
a) What is the method of moments estimator for the second central
moment (variance)? Express its mean in terms of the V AR[X] =
E[(X − µ)2 ].
Hint: Use that E[(X̄ − µ)2 ] = n1 VAR[X].
b) Consider two sets of samples X1 , . . . , Xn and Y1 , . . . , Ym that are i.i.d.,
respectively, with means µX and µY and with the same variance σ 2 .
Show that the pooled variance estimator
!
n
m
X
X
1
Sp2 =
(Xi − X̄)2 +
(Yi − Ȳ )2
n+m−2
i=1
i=1
estimates σ 2 unbiasedly.
c) For the method of moments estimator m̄3 for the third central moment
and it can be shown that E(m̄3 ) = (n−1)(n−2)
µ̄3 . Based on this result,
n2
derive an unbiased estimator for the third moment by making the
appropriate correction.
98
6.6. EXERCISES
d) Write a small Matlab program that tests the two estimators for the
third central moment for normally distributed data (the third central
moment of the normal distribution is zero). Use small sample sizes to
analyse the influence of the bias.
Exercise 60: Performance of estimators for the mean of a normal distribution
The parameter µ of the normal distribution is both equal to the mean E[X]
and the median m which is defined as the number that satisfies the inequalities
1
1
P (X ≤ m) =
and P (X ≥ m) =
2
2
(in general the median may not be unique). While E[X] is estimated by X̄,
a (consistent) estimator for m is simply the median M of the data, i.e. in
the case of an odd number of samples, it is (after sorting) the (n + 1)/2-th
value. In the case of an even number of samples, it is the arithmetic mean
of the n/2-th and the (n + 2)/2-th value (after sorting).
We already know that X̄ is asymptotically normally distributed with variance σ 2 /n. As opposed to this, it can be shown that the asymptotic variance
2
of M is 2πσ
4n .
Assume that σ 2 = 1. Use the above facts to determine for both estimators
the number of samples n such that the probability, that µ is estimated with
a distance of at most 0.2 from the true value, is (at least) 95.8%.
Hint: 95.8% is the probability that a normally distributed number deviates
by at most 2 standard deviations from its mean.
Exercise 61: MLE for Pareto distribution (2 parameters)
The Pareto distribution is a power law probability distribution that is used
in description of social, scientific, geophysical, and many other types of
observable phenomena. Its pdf is given by
f (x) =
(
α
m
α xxα+1
,
0,
if x ≥ xm
otherwise.
Given n Pareto distributed random numbers x1 , . . . , xn provide the MLE
xc
c
m for xm and then for α as a function of x
m and the data.
Exercise 62: gene frequencies of blood antigens
The problem is to estimate the gene frequencies of blood antigens A and
B from observed frequencies of four blood groups in a sample. Denote the
gene frequencies of A and B by θ1 and θ2 respectively, and the expected
99
6.6. EXERCISES
probabilities of the four blood groups O, A, B and AB by π1 , π2 , π3 , and π4 .
These probabilities are functions of θ1 and θ2 and are given as follows:
π1 = 2θ1 θ2 , π2 = θ1 (2−θ1 −2θ2 ), π3 = θ2 (2−θ2 −2θ1 ) and π4 = (1−θ1 −θ2 )2 .
The joint distribution of the observed frequencies P
y1 , y2 , y3 , y4 of the four
blood groups are multinomial distributed with n = 4i=1 yi .
Derive an equation system for the MLEs for θ1 and θ2 (you must not solve
the equation system nor consider second derivatives to prove that you found
a maximizer).
Difficult
Exercise 63: Fisher Information
The Fisher information is a way of measuring the amount of information that
an observable random variable X carries about an unknown parameter upon
which the probability of X depends. The inverse of the Fisher Information
is the asymptotic variance of the maximum likelihood estimator (see the
lecture notes).
a) Show the following equality of the Fisher information for the case of
one parameter θ:
"
2
#
∂ ln L(X; θ)
∂ ln L(X; θ) 2
−EX
= EX
∂θ2
∂θ
b) Compute the Fisher Information of a binomial RV X ∼ B(n, θ). Can
you plot the Fisher Information as function of θ?
Exercise 64: Derivative of the transient distribution
Consider again the discrete-time Markov chain with initial distribution p0 =
(1 0) and transition probability matrix
1−α α
P =
.
1
0
Assume that you have independent observations of the chain at certain transient times t1 , . . . , tn .
a) In addition to an expression for computing the log-likelihood, derive a
formula for numerically computing the derivative of the log-likelihood
w.r.t. α.
b) Write a Matlab program to estimate the parameter α, where you give
in addition to the log-likelihood also its derivatives to the optimization
function (e.g. fmincon).
100
6.6. EXERCISES
Difficult
Exercise 65: *MLE: Taking into account measurement errors
Consider i.i.d. sample O1 , . . . , On from the Poisson distribution (observation
samples are provided in the CMS). We assume that the measurement error is
additive, i.e. Oi = Xi +i , where Xi is the ”true” state and the measurement
errors are i.i.d. with i ∼ N (0, 1). Notice that measurement error is also
assumed to be independent of X.
a) Propose a way to compute the (log)likelihood of a sample.
b) Compute (using Matlab or similar) the value of the likelihood of a
given sequence assuming that λ = 4.
c) Derive a formula for the derivative of the (log)likelihood. How would
you make the optimization procedure even more efficient (in terms of
convergence)?
d) Write a Matlab (or similar) program to estimate the parameter λ where
you use both the value of the (log)likelihood and its derivative in the
optimization procedure.
e) Derive a formula for the derivative of the (log)likelihood with respect
to the variance of the measurement error, i.e. w.r.t. to variance V ,
where i ∼ N (0, V ).
f) Write a Matlab (or similar) program to estimate both the parameter
λ and the variance of the measurement error V , where you use both
the value of the (log)likelihood and its derivatives in the optimization
procedure.
Hint: Note that you have to find a way to approximate the likelihood using
finite summation.
101
6.6. EXERCISES
102
Chapter 7
Statistical Testing
Carrying out statistical test that are related to the observations of the real
system and/or the model is very important when we make claims or statements about the system. A popular class of tests are hypothesis tests that
are used to verify statistical hypotheses.
In a statistical hypothesis test, we usually first formulate a (null) hypothesis
H0 and an alternative hypothesis HA . The two statements H0 and HA must
be mutually exclusive and the test can either accept or reject H0 (in favor
of HA ). The null hypothesis is either
• an equality
• the absence of an effect or some relation
Note that this leads to null hypotheses that are in many cases not the
same as the statement that we want to verify. The reason for the above
constraint is that intuitively, we need an equality (or absence of an effect or
some relation) to fix the distribution that we consider during the test.
We begin with a motivating example.
Example 59: Defective Products
Assume that a manufacturer claims that at most 3% of his products are
defective. We want to verify this statement to decide whether we accept
the shipment of the products or not. We define
H0 : fraction of defective products is equal to 3%
HA : fraction of defective products is greater than 3%
where for HA we used the right-tail alternative. If we have evidence that
H0 is rejected in favor of HA then we reject the shipment.
Note that HA : ’fraction of defective products is smaller than 3%’ is not
useful since then we will always accept the shipment - no matter if we
103
266
Probability
and Statistics
for Computer
ScientistsAPPROACH
7.1.
LEVEL
α TESTS:
GENERAL
where µ1 is the average number of concurrent users last year, and µ2 is the average number
of concurrent users this year. Depending on the situation, we may replace the two-sided
have evidence for H0 or
for H .(1)
alternative HA : µ2 − µ1 ̸= 2000 with a one-sided
alternative AHA : µ2 − µ1 < 2000 or
(2)
(1)
HA : µ2 − µ1 > 2000. The test of H0 against HA evaluates the amount of evidence that
(2)
the mean number of concurrent users changed by fewer than 2000. Testing against HA , we
see if there is sufficient evidence to claim that this number increased by more than 2000. ♦
Example 60: Concurrent Users
Example 9.24.
Assume that we want to verify the statement that the average number of
To verify
if the proportion
products
at most 3%,
we test increased by 2000 this
concurrent
users ofofdefective
an online
PCisgaming
platform
year.HWe
define
0 : p = 0.03 vs HA : p > 0.03,
H0 : µ2 -µ1 =2000
Why do we choose the right-tail alternative HA : p > 0.03?HThat
we reject the
-µ1 6= 2000
A : µis2because
shipment only if significant evidence supporting this alternative is collected. If the data
suggest that p < 0.03, the shipment will still be accepted.
where µ1 is the average number of concurrent users♦ of
where p is the proportion of defects in the whole shipment.
the last year and µ2
the average number of this year. Here, HA is called a two-sided alternative
since it covers both cases µ2 -µ1 > 2000 and µ2 -µ1 < 2000.
9.4.2
Type I and Type II errors: level of significance
Fromwethe
twothat
examples
we see
that Therefore,
there can
be two-sided alternatives,
When testing hypotheses,
realize
all we seeabove
is a random
sample.
with
all the best statisticsone-sided,
skills, our decision
to
accept
or
to
reject
H
may
still
be
wrong.
That
0
left-tail alternatives (HA is µ < µ0 ) and one-sided, right-tail
would be a sampling error (Section 8.1).
alternatives (HA is µ > µ0 ), where H0 is µ = µ0 .
Four situations are possible,
The outcome of our test depends on a
finite random sample and thus we may
Result of the test
always take a wrong decision. The
four situations depicted on the left are
Reject H0
Accept H0
possible. Our goal is to keep each of
H0 is true Type I error
correct
the two errors small. Thus, a good
test only results in a wrong decision if
H0 is false
correct
Type II error
the sample is not very representative
(i.e.
extreme).
In two of the four cases, the test results in a correct decision. Either we
accepted
a true Often the type I error
hypothesis, or we rejected
a
false
hypothesis.
The
other
two
situations
are
sampling
errors.
is seen as more dangerous since it corresponds to ’convicting an innocent
defendant’ or ’sending a healthy patient to a surgery’. Therefore, we fix the
DEFINITION 9.7
probability α of a type I error which is also called the significance level of
A type I error occurs when we reject the true null hypothesis.
the test.
A type II error occurs when we accept the false null hypothesis.
α = P {reject H0 | H0 is true}
The probability of rejecting a false hypothesis (avoid a type II error) is the
about which we make
our hypothesis:
p(θ) = P {reject H0 | θ; HA is true}
Each error occurs with
a certain
to keep small.
A good
test resultsθ
power
of probability
the test that
andweahope
function
of the
parameter
in an erroneous decision only if the observed data are somewhat extreme.
Typically, α is chosen very small, e.g. α ∈ {0.01, 0.05, 0.10} such that the
type I error is kept small and we only reject H0 with a lot of confidence.
7.1
Level α tests: general approach
To test H0 against HA we perform the following steps:
104
7.1. LEVEL α TESTS: GENERAL APPROACH
1. Compute a test statistic T which is a function of the sample (or an
estimator) and thus a random number. The distribution of T , given
H0 is true, is known.
268
2. Consider the null distribution F0 of T , given H0 is true, and find the
portion that corresponds to α, i.e. the part of the area below the
density curve that gives α and is thus the region where we reject H0 .
Probability
and Statistics
Scientists
The
remaining
partfor
(ofComputer
area 1-α)
is the acceptance region.
f0 (T ) ✻
Rejection
region,
Acceptance
region,
probability (1 − α)
when H0 is true
probability α
when H0 is true
✠
✲
T
FIGURE 9.6: Acceptance and rejection regions.
In the illustration on the left,
T is expected to be large if HA
is true. Therefore, the rejection region is at the right tail
of the distribution. In a twosided test, we consider portions of α/2 at both tails of the
distribution.
We always have that
P {T ∈ acceptance region | H0 is true} = 1 − α
and
These regions are selected in such a way that the values of test
statistic T in the rejection
∈ the
rejection
| H0 is suppose
true} =
α.
region provide a stronger support of P
HA{T
than
values T ∈region
A. For example,
that
T is expected to be large if HA is true. Then the rejection region corresponds to the right
3. Accept
H0 if T
belongs to the acceptance region and reject
tail of the null distribution
F0 (Figure
9.6).
it otherwise.
It isat important
mention
that
if we accept
H0 we cannot say that
As another example, look
Figure 9.3 on to
p. 249.
If the null
distribution
of T is Standard
Normal, then the area’with
between
(−zα/2 ) and z1α/2
(1 − α). The
probability
−equals
α theexactly
hypothesis
H0 interval
is true’. The reason is that
H0 is not random,
A = (−zα/2i.e.
, zα/2it
) either holds with probability one or it does
not hold with probability one. The only correct interpretation is that
can serve as a level α acceptance region for a two-sided test of H0 : θ = θ0 vs HA : θ ̸= θ0 .
if we ofreject
H0 then
the data provides sufficient evidence against H0
The remaining part consists
two symmetric
tails,
and in favor of HA . Either H0 is really not true or our data is not
R = Ā = (−∞, −zα/2 ] ∪ [zα/2 , +∞);
representative - which
happens with probability α. If we accept H0
this is the rejection region.
then the data does not provide sufficient evidence to reject H0 . In the
absence
sufficient and
evidence,
by that
default, we accept H0 .
Areas under the density
curve areofprobabilities,
we conclude
P {T ∈inacceptance
region |about
H0 } = 1finding
−α
Let us now
detail reason
the rejection region with area α
since
there
are
many
regions
with
area
α
below
the density curve. The best
and
∈ rejection
| H0that
} = α.the type II error is small. Thus, we
choice is Pan{Tarea
that region
ensures
choose the rejection region such that it is likely that T falls into this region
if and
HA its
is true.
This will maximize the power of the test, i.e. the probability
Step 3: Result
interpretation
of rejecting H0 given HA is true. Often, this results in the a choice of the
Accept the hypothesis
H0 region
if the testasstatistic
T belongs
the acceptance
Reject distributed test
rejection
illustrated
in to
Figure
7.1 forregion.
a normally
H0 in favor of the alternative HA if T belongs to the rejection region.
statistic. Usually, the test statistic is defined such that
Our acceptance and rejection regions guarantee that the significance level of our test is
• the right-tail alternative forces T to be large,
Significance level = P { Type I error }
• the left-tail alternative
forces| H
T0 }to be small,
= P { Reject
= P {T ∈ R | H }
= α.
0
• the two-sided alternative forces
T to be either large or small.
Therefore, indeed, we have a level α test!
105
(9.13)
7.2. 270
STANDARD NORMAL
NULL DISTRIBUTION (Z-TEST)
Probability and Statistics for Computer Scientists
Reject
if T is here
Accept
if T is here
0
✢ ✲
zα
❫
0
T
(b) Left-tail Z-test
Reject
if T is here
Accept
if T is here
❯
−zα/2
✲
−zα
T
(a) Right-tail Z-test
Reject
if T is here
Accept
if T is here
0
☛
zα/2
✲
T
(c) Two-sided Z-test
FIGURE 9.7: Acceptance and rejection regions for a Z-test with (a) a one-sided right-tail
alternative;
(b) a one-sidedand
left-tail
alternative;
(c) a two-sided
alternative. distributed test
Figure
7.1: Acceptance
rejection
regions
for a normally
statistic. a) one-sided right-tail alternative; b) one-sided left-tail alternative;
c) two-sided
alternative.
9.4.5 Standard
Normal null distribution (Z-test)
An important case, in terms of a large number of applications, is when the null distribution
of the test statistic is Standard Normal.
in this case is called
a Z-test, and
the test
statistic is usually denoted
by Z.
7.2The test
Standard
Normal
null
distribution
(Z-test)
(a) A level α test with a right-tail alternative should
!
For a large number of applications,
the null distribution of T (the distribureject H0
if Z ≥ zα
tion of T given H0 is true) is
standard
Then the test is (9.14)
called a
accept H0
if normal.
Z < zα
Z-test. Usually, one the following cases applies:
The rejection region in this case consists of large values of Z only,
• we consider sample R
means
of normally
distributed
data,
= [zα , +∞),
A = (−∞,
zα )
9.7a).
•(see
weFigure
consider
sample means of arbitrarily distributed data where the
Under
the
null
belongs
to A and we reject the null hypothesis with probability
number ofhypothesis,
samplesZ is
large,
P {T ≥ zα | H0 } = 1 − Φ(zα ) = α,
• we consider sample proportions of arbitrarily distributed data where
making
the probability
of false rejection
(type I error) equal α .
the number
of samples
is large,
For example, we use this acceptance region to test the population mean,
• we consider differences of sample means or sample proportions where
H0 : µ = µ0 vs HA : µ > µ0 .
the number of samples is large.
In all of these cases a Z-test can be used.
Let zα be the number such that P (Z > zα ) = Φ(zα ) = α if Z is a standard
normally distributed random variable. Then we reject H0 if
• Z ≥ zα for a test with right-tail alternative,
• Z ≤ −zα for a test with left-tail alternative,
106
7.2. STANDARD NORMAL NULL DISTRIBUTION (Z-TEST)
• |Z| ≥ zα/2 for a test with two-sided alternative.
Note that each time we have
P {reject H0 | H0 is true} = 1 − Φ(zα ) = α.
Example 61: A test for the mean
Assume that we have used X̄ to estimate the unknown mean µ0 of a
normal distribution based on n = 100 independent samples. Since the
samples are normally distributed, X̄ is normally distributed too and we
know that E[X̄] = µ0 (see previous chapter) and that VAR[X̄] = σ 2 /n.
Assume further that σ = 800 is known and that the data is such that
X̄ = 5200. We would like to verify (with a significance of 5%, α = 0.05)
whether the true mean is greater than 5000, i.e. H0 : µ0 = 5000 and HA :
µ0 > 5000 (right-tail alternative).
1. We first compute the test statistic
Z=
X̄ − µ0
5200 − 5000
√
√ =
= 2.5.
σ/ n
800/ 100
2. The critical value is zα = 1.645 (from the table of the standard
normal distribution). Thus, we should reject H0 if Z ≥ 1.645 and
accept it otherwise.
3. Since Z falls into the rejection region, we have enough evidence to
reject H0 and support the alternative hypothesis HA : µ0 > 5000.
Note that if, for instance, X̄ = 5100, we would get Z = 1.25 and not have
enough evidence to reject H0 and believe in the alternative hypothesis that
µ0 > 5000.
Example 62: Two-Sample Z-test of proportions
A quality inspector finds 10 defective parts in a sample of n = 500 parts
received from manufacturer A. Out of m = 400 parts from manufacturer
B, she finds 12 defective ones. A computer-making company uses these
parts in their computers and claims that the quality of parts produced by
A and B is the same. At the 5% level of significance, do we have enough
evidence to disprove this claim?
107
7.2. STANDARD NORMAL NULL DISTRIBUTION (Z-TEST)
We test H0 : pA = pB , or H0 : pA − pB = 0, against HA : pA 6= pB
where pA (pB ) is the portion of defective parts from manufacturer A (B),
respectively.
This is a two-sided test because no direction of the alternative has been
indicated.
1. We first compute the values of the estimated portions
p̂A =
10
12
= 0.02 and p̂B =
= 0.03.
500
400
These estimators are asymptotically normally distributed (sum of
n and m Bernoulli variables divided by n and m, respectively)
where the means are the true portions pA and pB and the variances are pA (1 − pA )/n and pB (1 − pB )/m. Obviously, p̂A − p̂B is
also asymptotically normally distributed with mean zero and variance pA (1 − pA )/n + pB (1 − pB )/m. Thus, inserting the estimated
values we standardize and get
p̂A − p̂B
Z=p
= −0.945.
pA (1 − pA )/n + pB (1 − pB )/m
2. The critical value is zα/2 = 1.96 (from the table of the standard
normal distribution). Since this is a two-sided test we should reject
H0 if |Z| ≥ 1.96 and accept it otherwise.
3. Since Z falls into the acceptance region, we do not have enough
evidence to reject H0 . Although the sample proportions of defective
parts are unequal, the difference between them appears too small to
claim that population proportions are different.
Alternatively, we can consider a single estimator p̂ for the overall proportion of defective products since we assume that H0 (pA = pB ) holds. But
with p := pA = pB this implies that
VAR[XA ] = VAR[XB ] = p(1 − p)
if XA and XB are Bernoulli distributed with parameter p. Hence, if we
estimate the common portion of defective parts as
p̂ =
number of defective parts
np̂A + mp̂B
=
= 0.0244
total number of parts
n+m
then we can use it to replace the unknown probability p in the variance
estimators VAR[p̂A ] = p(1−p)
and VAR[p̂B ] = p(1−p)
n
m . Hence,
VAR[p̂A − p̂B ] =
p̂(1 − p̂) p̂(1 − p̂)
+
n
m
108
7.3. T-TESTS FOR UNKNOWN σ
and get
Z=q
p̂A − p̂B
p̂(1−p̂)
n
+
p̂(1−p̂)
m
= −0.966.
Again, Z falls into the acceptance region and we do not have enough
evidence to reject H0 .
7.3
T-tests for unknown σ
In the previous section we used an estimator for the unknown true variance
σ 2 . In the special case that our data X1 , . . . , Xn is normally distributed
with mean µ and variance σ 2 and we estimate the unknown mean with
n
1X
X̄n =
Xi
n
i=1
we can make use of a result from statistics that tells us the following:
We know that E[X̄n ] = µ and thus
X̄n − µ
Z= p
σ 2 /n
must follow a standard normal distribution (we standardized it by subtracting the mean and dividing by the standard deviation!). However, since σ 2
is unknown we estimate it using
n
Sn2 =
1 X
(Xi − X̄n ).
n−1
i=1
It can be shown that
X̄n − µ
T =p
Sn2 /n
follows a Student’s t-distribution with n − 1 degrees of freedom. Thus, once
µ is fixed (because of our hypothesis H0 ) and the data X1 , . . . , Xn is given we
can compute T and check whether it falls into the acceptance or rejection
region by looking at the values tα , −tα or tα/2 depending on the type of
alternative. Note that tα is the value such that
P (T > tα ) = α
and it can be found in the table of the Student’s t-distribution (in the same
way we find tα/2 ). Note that the degrees of freedom n − 1 is a parameter
of the Student’s t-distribution since the distribution of T changes when the
samples size n changes.
109
7.4. P-VALUE
7.4
p-Value
Statistical Inference I
281
So far, we were testing hypotheses by
(a) Small α
Accept H
means of acceptance and rejection regions where we need to know the sigAccept
nificance level α in order to conduct
H
Zobs
a test. However, there is no system✲
z
0
atic way of choosing α, the probability of making type I error. Of course,
(b) Large α,
when it seems too dangerous to reject
same Zobs
Reject H
a true H0 , we choose a low significance
Accept
level. But how low? Should we choose
H
Zobs
α = 0.01? Perhaps, 0.001? Or even
✲
z
0
0.0001? If we choose α small enough
9.10: (a) Under a low level of significance α, we accept the null hypothesis. (b)
we can always make sure that H0 isFIGURE
accepted
illustrated
Under
a high level ofas
significance,
we reject it. on the right.
280
Probability and Statistics for Computer Scientists
Also, if our observed test
P-value
statistic Zobs belongs to a reUsing a P-value approach, we try not to rely on the level of significance. In fact, let us try
jection
but it is ”too
to test a hypothesis
using all levelsregion
of significance!
Considering all levels of significance (between 0 and 1 because α is a probability of Type I
Zobs
close
to
call”
as
illustrated on
error), we notice:
Accept
Case
1.
If
a
level
of
significance
is
very
low,
we
accept
hypothesis
(see Figure 9.10a).
H0
the left, then how the
donullwe
report
A low value of
α = P result?
{ reject the null hypothesis when it is true }
the
makes it very unlikely to reject the hypothesis because it yields a very small rejection region.
❘
reject the
The right-tail
above the rejectionwe
regionshould
equals α.
✲ areaFormally,
−zα/2
zα/2
0
Case 2. On the other extreme end, a high significance level α makes it likely to reject the null
null
hypothesis,
but
practihypothesis and corresponds to a large rejection region. A sufficiently large α will produce
FIGURE 9.9: This test is “too close to call”: formally we reject the null
although
suchhypothesis
a large rejection
region that will cover our test statistic, forcing us to reject H (see
cally,
we
realize
that
a
slightly
Figure 9.10b).
the Z-statistic is almost at the boundary.
there exists a boundary
value between α-to-accept
(case 1) and
α-to-reject (case
Conclusion:
different significance level α could have
expanded
the acceptance
region
just
2). This number is a P-value (Figure 9.11).
Example
9.36 (Unauthorized
use of and
a computer
account,
continued).
A
99%
enough
to cover Z
force
us
to
accept
H
.
Is
there
a
statistical
measure
0
obs
confidence interval for the mean time between keystrokes is
to quantify how far away we are from a ”too close to call”?
[0.24; 0.34]
The idea is to try to test a hypothesis using all levels of significance. Then
(Example 9.19 on p. 260). Example 9.28 on p. 275 tests whether the mean time is 0.2
have
two
cases:
seconds,we
which
would be
consistent
with the speed of the account owner. The interval does
not contain 0.2. Therefore, at a 1% level of significance, we have significant evidence that
1)
Very
small
values
of α make it very unlikely♦ to reject the hypothesis
the account was used by a different person.
because they yield very small rejection regions. 2) High significance levels α
will make it likely to reject H0 and corresponds to a large rejection region.
will be forced to reject H0 . The P-value is the boundary value between
9.4.10 We
P-value
the accept case 1) and reject case 2). Thus,
0
0
α
0
0
α
0
How do we choose α?
the p-value is the lowest significance level α that forces rejection of H0 and
also the highest significance level α that forces acceptance of H0 .
So far, we were testing hypotheses by means of acceptance and rejection regions. In the last
section, we learned how to use confidence intervals for two-sided tests. Either way, we need
to know the significance level α in order to conduct a test. Results of our test depend on it.
How do we choose α, the probability of making type I sampling error, rejecting the true
hypothesis? Of course, when it seems too dangerous to reject true H0 , we choose a low
significance level. How low? Should we choose α = 0.01? Perhaps, 0.001? Or even 0.0001?
Usually α ∈ [0.01, 0.1] (although there are exceptions). Then, a P-value
0.1 exceeds
all natural
levels, and the null hypothAlso, if greater
our observedthan
test statistic
Z = Zobs belongs
to a rejectionsignificance
region but it is “too
close to call” (see, for example, Figure 9.9), then how do we report the result? Formally,
esis
should
be
accepted.
Conversely,
if
a
P-value
we should reject the null hypothesis, but practically, we realize that a slightly differentis less than 0.01, then it is
significance
level α could
haveall
expanded
the acceptance
region just enough
to cover
smaller
than
natural
significance
levels,
andZobsthe null hypothesis should
and force us to accept H0 .
Only
the important.
P-valueForhappens
fall
Supposebe
thatrejected.
the result of our
test is if
crucially
example, the to
choice
of a between 0.01 and 0.1, we
businessreally
strategy for
the next
ten
years depends
on the
it. In this
case,of
cansignificance.
we rely so heavily This is the ”too close to
have
to
think
about
level
on the choice of α? And if we rejected the true hypothesis just because we chose α = 0.05
instead of α = 0.04, then how do we explain to the chief executive officer that the situation
was marginal? What is the statistical term for “too close to call”?
110
Important:
Pr (observation | hypothesis) ≠ Pr (hypothesis | observation)
The probability of observing a result given that some hypothesis
is true is not equivalent to the probability that a hypothesis is true
given that some result has been observed.
Using the p-value as a “score” is committing an egregious logical error:
7.4.
the transposed conditional fallacy.
P-VALUE
More likely observation
Computing P-values
Probability density
Statistical Inference I
283
P-value
Here is how a P-value can be computed from data.
Very un-likely
Very un-likely
Let us look at Figure 9.10 again. Start
from Figure 9.10a, gradually increase α, and keepobservations
observations
your eye at the vertical bar separating the acceptance and rejection region. It will move to
the left until it hits the observed test statistic Zobs . At this point, our decision changes,
Observed
and we switch from case 1 (Figure 9.10a) to case 2 (Figure 9.10b).
Increasing α further, we
data point
pass the Z-statistic and start accepting the null hypothesis.
What happens at the border of α-to-accept and α-to-reject? Definition 9.9 says that this
borderline α is the P-value,
Set of possible results
P = α.
Also, at this border our observed Z-statistic coincides with the critical value zα ,
and thus,
A p-value (shaded
green
area) isas
thethe
probability
of an observed
Figure 7.2: Interpretation
of the
p-value
probability
that we observe
Zobs
= zα , result assuming that the null hypothesis is true.
(or
more
extreme)
a test statistic T that is at least as extreme as Tobs given that H0 is true.
P = α = P {Z ≥ zα } = P {Z ≥ Zobs } .
In this formula, Z is any Standard Normal random variable, and Zobs is our observed test
statistic, which is a concrete number, computed from data. First, we compute Zobs , then
call”.
A good decision is to collect more data until a more
use Table A4
to calculate
P {Z ≥ Zobs } = 1 − Φ(Zobs ).
can be obtained.
We
compute
bytwo-sided
fixingalternatives
Zobs =arezcomputed
selecting
p =
α and similarly,
P-values for
the left-tail
and forpthe
as given
in Table 9.3.
one-sided right-tail alternative we have
definitive answer
α such that for a
This table applies to all the Z-tests in this chapter. It can be directly extended to the case
of unknown standard deviations and T-tests (Table 9.4).
p = α = P (Z ≥ zα ) = P (Z ≥ Zobs ) = 1 − Φ(Zobs )
Understanding P-values
where Z is standard normally distributed and Zobs is the test statistic. The
the two-sided case
Looking atcomputation
Tables 9.3 and 9.4,
P-valuefor
is the
of observing
a test
ofwepsee
is that
similar
theprobability
one-sided
left-tail
and
statistic at least as extreme as Zobs or tobs . Being “extreme” is determined by the alternaas
well
as
for
T-tests.
tive. For a right-tail alternative, large numbers are extreme; for a left-tail alternative, small
Hypothesis
H0
θ = θ0
that
Alternative
HA
P-value
Computation
right-tail
θ > θ0
P {Z ≥ Zobs }
1 − Φ(Zobs )
left-tail
θ < θ0
P {Z ≤ Zobs }
Φ(Zobs )
two-sided
θ ̸= θ0
P {|Z| ≥ |Zobs |}
2(1 − Φ(|Zobs |))
We summarize the computation of the p-value for
Z-test on the left where
we distinguish the three
different cases for the alternative hypothesis HA .
From the definition of the
p-value, it is also clear
TABLE 9.3: P-values for Z-tests.
p = P (observing test statistic T that is at least as extreme as Tobs | H0 ).
In Figure 7.2 we illustrate this interpretation (the green shaded area is the
p-value here and Tobs is the observed data point). Thus, it is wrong to say
that the p-value tells us something about the probability that H0 is true
(given the observation)! A high p-value tells us that the observed or even
more extreme values of Zobs is not so unlikely (given H0 ), and therefore, we
111
7.5. CHI-SQUARE TESTS
see no contradiction with H0 and do not reject it. Conversely, a low p-value
signals that such an extreme test statistic is unlikely if H0 is true. Since we
really observed it, our data are not consistent with H0 and we reject it.
Example 63: Two-Sample Z-Test of Proportions (revisited)
In Example 62 we computed a test statistic of Zobs = −0.945 for a twosided test which compares the quality of parts produced by two different
manufactures. We compute p as
p = P (|Z| ≥ | − 0.945|) = 2(1 − Φ(0.945)) = 0.3472.
This p-value is quite high and indicates that the null hypothesis should not
be rejected. If H0 is true then the chance of observing a value for Z that
is as extreme or more extreme than Zobs is 34%. This is no contradiction
with the assumption that H0 is true.
7.5
Chi-square tests
Chi-square tests are another popular class of statistical tests that can be used
to support statements about estimated variances, about estimated counts,
about a family of distributions and about independence.
7.5.1
Estimated variance
Recall that an unbiased estimator for the population variance σ 2 is given by
the sample variance
n
1 X
2
S =
(Xi − X̄)2 .
n−1
i=1
We would like to know the distribution of S 2 to perform test for the hypothesis H0 : σ 2 = σ02 . Since the summands (Xi − X̄)2 are not independent we
cannot directly use the central limit theorem. In fact, for moderate or small
n the distribution is not Normal at all (for large n a generalization of the
central limit theorem gives an approximate normal distribution). If the Xi
are independent and normally distributed observations with VAR(Xi ) = σ 2
then the distribution of
2
n (n − 1)S 2 X Xi − X̄
=
σ2
σ
i=1
is Chi-square with n − 1 degrees of freedom. The Chi-square distribution is
a special case of the Gamma distribution and its density is given by
1
f (x) =
xν/2−1 e−x/2 ,
x > 0,
Γ(ν/2)2ν/2
112
7.5. CHI-SQUARE TESTS
and f (x) = 0 whenever x ≤ 0. Here ν > 0 is a parameter that equals the
degrees of freedom. Note that the mean is equal to ν and the variance is 2ν.
Thus, for a level-α test with H0 : σ 2 = σ02 we compute
χ2obs =
(n − 1)S 2
σ02
and compare the value χ2obs with the critical values of the Chi-square distribution. More concretely,
• for a right-tail alternative HA : σ 2 > σ02 , we reject H0 if χ2obs ≥ χ2α ,
• for a left-tail alternative HA : σ 2 < σ02 , we reject H0 if χ2obs ≤ χ21−α ,
• for the two-sided alternative HA : σ 2 6= σ02 , we reject H0 if either
χ2obs ≥ χ2α/2 or χ2obs ≤ χ21−α/2
where χ2α is such that α = P (χ2 > χ2α ). We summarize the chi-square test
for the population variance below and omit an example here since this test
Statistical Inference I
291
works exactly as the Z and T tests.
Null
Hypothesis
Alternative
Hypothesis
Test
statistic
σ 2 > σ02
σ 2 = σ02
(n − 1)s2
σ02
σ 2 < σ02
Rejection
region
χ2obs > χ2α
χ2obs < χ2α
χ2obs ≥ χ2α/2 or
χ2obs ≤ χ21−α/2
σ 2 ̸= σ02
P-value
!
"
P χ2 ≥ χ2obs
"
!
P χ2 ≤ χ2obs
# !
"
2 min P χ2 ≥ χ2obs ,
"$
!
P χ2 ≤ χ2obs
TABLE 9.5: χ2 -tests for the population variance
7.5.2
Observed Counts
this case,
&
%
α $ be used
The chi-square test can #also
to compare
observed and expected
(9.27)
P =2
= 2P χ2 ≤ χ2obs = 2F (χ2obs ).
2
counts via the chi-square statistic
So, the P-value is either given by (9.26) or by (9.27), depending on which one of them is
2
N
smaller and which boundary hits χ
X
obs first. We can write this2in one equation as
(Obs(k) − Exp(k))
2
χ =
&(
%
&
' %
&
%
Exp(k)
= 2 min F (χ2obs ), 1 − F (χ2obs ) ,
P = 2 min P χ2 ≥ χ2obs , P χ2 ≤ χ2obs
(7.1)
k=1
where F is the cdf of χ2 distribution with ν = n − 1 degrees of freedom.
where N is the number of categories or groups of data defined depending on
Testing
procedures
that we
have just
derived
are summarized
9.5. The
same
tests
our
testing
problem.
Obs(k)
is the
observed
numberinofTable
samples
that
belong
can also be used for the standard deviation because testing σ 2 = σ02 is equivalent to testing
toσ category
k, and Exp(k) = E[Obs(k) | H0 is true] is the expected number
= σ0 .
of samples that fall into category k if the null hypothesis H0 is true. This
2 indicate
isExample
always a9.41.
one-sided,
test,
low values
of χconstructed
Refer to right-tail
Example 9.40
on p.because
288. The only
90% confidence
interval
there
contains
the
suggested
value
of
σ
=
2.2.
Then,
by
duality
between
confidence
intervals
that the observed counts are close to what we expect them to be under
the
and tests, there should be no evidence against this value of 2σ. Measure the amount of
null
hypotheses.
On
the
contrary,
large
values
of
χ
occur
if
the
observed
evidence against it by computing the suitable P-value.
counts are far from the expected counts, which indicates inconsistency of the
data and H0 . Therefore, the test can be seen as a “frequency comparison”
Solution. The null hypothesis H0 : σ = 2.2 is tested against HA : σ 2 ̸= 2.2. This is a twosided test because we only need to know whether the standard deviation equals σ0 = 2.2 or
113
not.
Compute the test statistic from the data in Example 9.40,
χ2obs =
(5)(6.232)
(n − 1)s2
=
= 6.438.
σ02
2.22
Using Table A6 with ν = n − 1 = 5 degrees of freedom, we see that χ20.80 < χ2obs < χ20.20 .
Therefore,
&
%
&
%
P χ2 ≥ χ2obs > 0.2 and P χ2 ≤ χ2obs > 0.2,
7.5. CHI-SQUARE TESTS
(compare the data histogram with a fitted density or mass function). A
level α rejection region for this chi-square test is [χ2α , +∞) and the p-value
is computed as
p = P (χ2 ≥ χ2obs ).
It can be shown that for large n
Obs(k) − Exp(k)
p
Exp(k)
follow an standard normal distribution and thus the null distribution of χ2
converges to the Chi-square distribution with (N − 1) degrees of freedom, as
the sample sizePincreases to infinity. We subtract one degree from N because
we know that k Obs(k) = n and once any N − 1 are known, the remaining
one is uniquely determined.
As a rule of thumb, we require an expected count of at least 5 in each
category, i.e. Exp(k) ≥ 5 for all k and at least 3 bins (N ≥ 3). Moreover, it
is often beneficial if
Exp(1) ≈ . . . ≈ Exp(N ).
After defining the bins and computing χ2 we can use the Chi-square distribution to construct rejection regions and compute p-values to decide whether
H0 can be rejected or not.
Example 64: Testing a distribution
We want to test whether the observed data belong to a particular distribution. For example, we may want to test whether a sample comes
from the Normal distribution, whether interarrival times are Exponential
and counts are Poisson, whether a random number generator returns high
quality Standard Uniform values, or whether a die is unbiased.
In general for n samples X1 , . . . , Xn with distribution F we test H0 : F =
F0 vs HA : F 6= F0 , where F0 is some given distribution.
To conduct the test, we take all possible values of X under F0 (the support of F0 ) and split them into N bins B1 , . . . , BN such that the expected
number of samples, that fall into a bin, is at least 5. It could be that the
left-most interval is B1 = (−∞, a1 ] or the right-most interval is [aN −1 , ∞)
or both. The observed count for the k-th bin Bk is
Obs(k) = |{i = 1, . . . , n : Xi ∈ Bk }|,
where for a set A we use |A| to denote the number of elements of A. Hence
N
P
Obs(k) = n. If H0 is true and all Xi have the distribution F0 , then
k=1
114
7.5. CHI-SQUARE TESTS
Obs(k), the number of ”successes” in n trials, has Binomial distribution
with parameters n and
pk = F0 (Bk ) = P (Xi ∈ Bk | H0 ).
Then, the corresponding expected count is the expected value of this Binomial distribution, Exp(k) = npk . We compute χ2 as defined in Eq. 7.1
and conduct the test.
Let us now become more concrete: Suppose that after losing a large
amount of money, an unlucky gambler questions whether the game was
fair and the die was really unbiased. The last 90 tosses of this die gave
the following results
Score
Frequency
1
20
2
15
3
12
4
17
5
9
6
17
We test H0 : F = F0 vs HA : F 6= F0 , where F is the distribution of
the number of dots on the die, and F0 is such that P (X = i) = 1/6 for
i ∈ {1, 2, . . . , 6}. We choose Bi = {i} and then the observed counts are
given by the table above and the expected counts are n/6 = 90/6 = 15 for
all bins (all more than 5). We compute
2
P
(Obs(k)−Exp(k))
χ2 = N
k=1
Exp(k)
=
(20−15)2
15
+
(15−15)2
15
+
(12−15)2
15
+
(17−15)2
15
+
(9−15)2
15
+
(17−15)2
15
= 5.2
and find (from the table of the chi-square distribution for N − 1 = 5
degrees of freedom) that the p-value
p = P (χ2 ≥ 5.2) ∈ [0.2, 0.8].
Hence there is no significant evidence to reject H0 , and therefore, no
evidence that the die was biased.
Assume that we have the same situation as in the previous example, i.e., we
suppose that X1 , ..., Xn have distribution F0 but we also used X1 , ..., Xn to
fit the parameters of F0 (e.g. we computed the MLE θ̂). Assume F0 has m
parameters (m ≥ 1).
One can show that in this case, if H0 is true, then χ2 converges to a chisquare distribution with k − m − 1 degrees of freedom. Thus, we have to
subtract the number of estimated parameters from the degrees of freedom
and defined the rejection region accordingly. This type of test is called
goodness of fit test.
In general, it is a major drawback of the test that there is no clear prescription for the interval selection and one can come to different conclusions
depending on how the intervals are chosen. However, it can be applied to
115
7.5. CHI-SQUARE TESTS
any hypothesized distribution and is often used to test whether artificially
generated pseudo-random numbers are U (0, 1)-distributed and independent.
7.5.3
Testing independence
In many practical applications we would like to test if there is a significant
association between two features. This helps to understand the cause-andeffect relationships. For example, is it true that smoking causes lung cancer?
Do the data confirm that drinking and driving increases the chance of a
traffic accident? We can use a chi-square test to verify
H0 : Factors A and B are independent
vs
HA : Factors A and B are dependent.
In general, we assume that there are two sets of categories A1 , . . . , Ak and
B1 , . . . , Bm where Ai ∩ Aj = ∅, Bi ∩ Bj = ∅ for any i 6= j and independence
is understood as for random variables, i.e., factors A and B are independent
if any randomly selected x belongs to categories Ai and Bj independently
of each other. Therefore, we are testing
H0 : P (x ∈ Ai ∩ Bj ) = P (x ∈ Ai )P (x ∈ Bj ) for all i, j
vs
HA : P (x ∈ Ai ∩ Bj ) 6= P (x ∈ Ai )P (x ∈ Bj ) for some i, j.
Example 65: Smokers and lung cancer
We have k = m = 2 such as A1 = set of smokers, A2 = set of nonsmokers, B1 = set of persons with lung cancer, B2 = set of persons
without lung cancer.
We next assume the following observed counts:
B1 B2
A1
n11 n12
A2
n21 n22
...
... ...
nk1 nk2
Ak
column total n·1 n·2
. . . Bm row total
. . . n1m
n1·
. . . n2m
n2·
... ...
...
. . . nkm
nk·
. . . n·m
n·· = n
P
P
where ni· = i nij and n·j = j nij are the row and column totals. This
table defines the observed counts Obs(i, j) = nij . We then estimate the
probabilities
P (x ∈ Ai ) and P (x ∈ Bj ) ∀i, j
116
7.5. CHI-SQUARE TESTS
as
n·j
ni·
Pb(x ∈ Ai ) =
and Pb(x ∈ Bj ) =
, ∀i, j.
n
n
If H0 is true then P (x ∈ Ai ∩ Bj ) = P (x ∈ Ai )P (x ∈ Bj ) and thus
Exp(i, j) = n · P (x ∈ Ai ∩ Bj ).
Based on the information that we have, we can only estimate
d j) = n ni· n·j = ni· n·j .
Exp(i,
n n
n
Next, we compute the test statistic
χ2obs =
k X
m
X
2
d j)
Obs(i, j) − Exp(i,
Statistical Inference II
i=1 j=1
313
d j)
Exp(i,
You can always check that all the row totals, the column totals, and the whole table total
are the same for the observed and the expected counts (so, if there is a mistake, catch it
and
consider its corresponding chi square distribution. It can be shown that
here).
2
2
the degrees
of(160
freedom
(k−−280)
1)(m
−
1)−due
linearly
dependent
(240
(140
180)2to some
(460 − 420)
− 120)2 are
χ2obs =
+
+
+
= 31.75.
constraints in the120
sum of differences.
280
180
420
From Table A6 with (2 − 1)(2 − 1) = 1 d.f., we find that the P-value P < 0.001. We have a
significant evidence
that an email
having an attachment is somehow related to being spam.
Example
66: Internet
Shopping
♦
Therefore, this piece of information can be used in anti-spam filters.
A web designer suspects that the chance for an internet shopper to make
aExample
purchase
through
her shopping
web site on
varies
depending
of the
week.
10.5
(Internet
different
days on
of the
the day
week).
A web
designer
suspects
the chance
for an internet
shopper to
make
a purchase
her web
To
test
this that
claim,
she collects
data during
one
week,
whenthrough
the web
site
site varies depending
recorded
3758 hits.on the day of the week. To test this claim, she collects data during
one week, when the web site recorded 3758 hits.
Observed
No purchase
Single purchase
Multiple purchases
Total
Mon Tue
399 261
119
72
39
50
557 383
Wed Thu Fri Sat Sun
284 263 393 531 502
97
51 143 145 150
20
15
41 97
86
401 329 577 773 738
Total
2633
777
348
3758
Testing independence (i.e., probability of making a purchase or multiple purchases is the
same on any
day of the week),
we compute
the estimated
expected
counts, or multiple
Testing
independence
(i.e.,
probability
of making
a purchase
)
purchases is the
on(nany
of
the iweek),
we compute the estimated
i· )(n·jday
!same
for
= 1, . . . , 7, j = 1, 2, 3.
Exp(i,
j) =
n
expected counts,
Expected
d j)
No purchase
Exp(i,
Single purchase
Multiple purchases
We
get
Total
Mon
Tue
Wed
Thu
Fri
Sat
Sun
51.58
557
35.47
383
37.13
401
30.47
329
53.43
577
71.58
773
68.34
738
ni· n·j 230.51 404.27 541.59 517.07
ni· n·j
390.26
=
n 268.34
= 280.96 for
i = 1, . . . , 7, j = 1, 2, 3.
n
n
115.16n 79.19
82.91
68.02 119.30 159.82 152.59
Total
2633
777
348
3758
Then, the test statistic is
(399 − 390.26)2 117 (86 − 68.34)2
+ ...+
= 60.79,
390.26
68.34
and it has (7 − 1)(3 − 1) = 12 degrees of freedom. From Table A6, we find that the P-value
is P < 0.001, so indeed, there is significant evidence that the probability of making a single
purchase or multiple purchases varies during the week.
♦
χ2obs =
Matlab Demo. The χ2 test for independence takes only a few lines of code in MATLAB.
For example, here is the solution to Example 10.4.
site varies depending on the day of the week. To test this claim, she collects data during
one week, when the web site recorded 3758 hits.
Observed
No purchase
Single purchase
Multiple purchases
Total
Mon Tue
399 261
119
72
39
50
557 383
Wed Thu Fri Sat Sun
284 263 393 531 502
97
51 143 145 150
20
15
41 97
86
401 329 577 773 738
Total
2633
777
348
3758
Testing independence (i.e., probability of making a purchase or multiple purchases is the
7.6.sameMULTIPLE
TESTING
on any day of the
week), we compute the estimated expected counts,
! j) = (ni· )(n·j )
Exp(i,
n
Expected
No purchase
Single purchase
Multiple purchases
Total
for
i = 1, . . . , 7, j = 1, 2, 3.
Mon
Tue
Wed
Thu
Fri
Sat
Sun
390.26 268.34 280.96 230.51 404.27 541.59 517.07
115.16 79.19 82.91 68.02 119.30 159.82 152.59
51.58 35.47 37.13 30.47 53.43 71.58 68.34
557
383
401
329
577
773
738
Total
2633
777
348
3758
Then, the test statistic is
and the test statistic
is
(399 − 390.26)2
2
(86 − 68.34)2
= 60.79,
390.26 2
68.34
2
(399
−
390.26)
(86
−
68.34)
2
and it has χ
(7 − 1)(3
− 1) = 12 degrees of freedom.
find60.79.
that the P-value
+ . . . +From Table A6, we =
obs =
390.26
is P < 0.001, so indeed, there
is significant evidence that68.34
the probability of making a single
purchase or multiple purchases varies during the week.
♦
χobs =
+ ...+
We have (7 − 1)(3 − 1) = 12 degrees of freedom and get a very low p-value
<Matlab
0.001 Demo.
which The
means
that we should reject H , i.e. there is significant
χ2 test for independence takes only a0 few lines of code in MATLAB.
evidence
thathere
theis the
probability
making
a single purchase or multiple purFor example,
solution toof
Example
10.4.
chases
varies
the week.
X = [160
240;during
140 460];
% Matrix of observed counts
Row = sum(X’)’; Col = sum(X); Tot = sum(Row); % Row and column totals
k = length(Col); m = length(Row);
% Dimensions of the table
In the
of small sample sizes and very unequally
distributed
e = case
zeros(size(X));
% Expected
countsdata (among
for
i=1:k;
for
j=1:m;
e(i,j)
=
Row(i)*Col(j)/Tot;
end;
end;
the cells), Fisher’s exact test can be used instead to test for independence.
chisq = (X-e).^
2./e;
% Chi-square terms
chistat = sum(sum(chisq));
% Chi-square statistic
Pvalue = 1-chi2cdf(chistat,(k-1)*(m-1))
% P-value
7.6
Multiple Testing
Up to this point we have only discussed the testing procedure applicable
for a single hypothesis. However, we often wish to test many associated
hypotheses at the same time. This leads to the so-called multiple testing
problem. When testing many hypotheses simultaneously, careful attention
must be paid to what may be concluded from the p-values of the individual
tests. As an example, suppose that a salesman claims that some substance
is beneficial, and to demonstrate this, he tests 100 different null hypotheses,
one for each of 100 different illnesses, each null hypothesis stating that the
substance is not useful in helping to cure one of the illnesses involved and
each alternative hypothesis stating that the substance is useful. Suppose
that the appropriate statistical tests lead to 100 p-values, one for each disease. If a Type I error of 1% had been used for each individual test, the
probability that at least one of the 100 null hypotheses will be rejected
given that the substance is of no value for any of the illnesses (all null hypotheses true) can be computed from the Binomial distribution with success
probability p = 0.01 and N = 100. If k is the number of successes, then
P (k ≥ 1) = 1 − P (k = 0) = 1 − (1 − p)N ≈ 0.634.
Thus, even if the substance is of no value there is a very high chance that
some null hypotheses get rejected. This might lead the salesman to claim
that the substance is useful for those illnesses where, by chance, the null
hypothesis was rejected. He may provide only the data relating to those
illnesses, which taken out of context may appear convincing. This indicates
118
7.6. MULTIPLE TESTING
the potential error in drawing conclusions from individual p-values obtained
from many associated tests carried out in parallel (in fact this happened in a
number of medical research project, for instance, related to the association
between breast cancer and polymorphisms in certain genes).
One possible approach to this problem is to use a so-called ”experimentwise,” or ”family-wise,” p-value, defined as follows. The probability of rejecting at least one null hypothesis when all null hypotheses are true is
known as the family-wise error rate (FWER), i.e.,
FWER = P (#rejected null hypotheses that are true ≥ 1).
The aim is to control the FWER at some value acceptable value α. To do
this it is necessary to use a Type I error for each individual test that is
smaller than α.
7.6.1
Bonferroni correction
One of the simplest ways of assuring an FWER of α when N tests are
performed, is to use a Type I error of α/N for each individual test. Then,
the ”success probability” for the binomial distribution is α/N instead of α
and we get, if all null hypotheses are true and the tests are independent,
FWER = P (V ≥ 1) = 1 − P (V = 0) = 1 − (1 − α/N )N .
where V is the number of rejected null hypotheses that are true.
In general, if N0 ≤ N is the total number of true null hypotheses, we consider
N0 Bernoulli random variables X1 , . . . , XN0 . Then
P
P
N0
k=1 Xk
P 0
≥ 1 = P (X1 = 1 ∨ . . . ∨ XN0 = 1) ≤ N
k=1 P (Xk = 1)
PN
PN
≤
k=1 α/N = α.
k=1 P (Xk = 1) =
This correction applies whether the tests are independent or not, but it is
quite conservative and has the disadvantage that only a very small p-value
leads to the rejection of H0 . The power of the test (reject H0 if HA is true)
gets very small.
Example 67: Five Tests
Let us assume we have 5 tests with null hypotheses H0(1) , H0(2) , . . . , H0(5)
and we computed the following p-values:
H0(1) : 0.005, H0(2) : 0.011, H0(3) : 0.02, H0(4) : 0.04, H0(5) : 0.13
119
7.6. MULTIPLE TESTING
For a significance level α = 0.05 we have a local significance of α/N =
0.01 and can only reject H0(1) .
7.6.2
Holm-Bonferroni method
The Holm-Bonferroni method adjusts the rejection criterion of each individual hypothesis as follows: First, we sort the hypotheses according to their
p-value, i.e. pi is the p-value of H0(i) and p1 ≤ . . . ≤ pN . Then we fix the
”global” significance α and compute the minimal index k such that
pk >
α
.
N +1−k
This can be done by defining α1 = α/N , α2 = α/(N − 1), . . . αN = α/1
and comparing p1 with α1 , p2 with α2 , etc until pk > αk . If k > 1 we
reject H0(1) , H0(2) , . . . , H0(k−1) and do not reject H0(k) , H0(k+1) , . . . , H0(N ) .
If k = 1 then no null hypothesis is rejected and if no such k exists all null
hypotheses are rejected. It can be shown (exercise!) that the FWER of this
method is at most α.
Example 68: Five Tests
The p-values of the 5 tests in the previous example where already sorted
and we compare
H0(1) : 0.005 < α/N = 0.01,
H0(2) : 0.011 < α/(N − 1) = 0.0125,
H0(3) : 0.02 > α/(N − 2) = 0.0167,
Hence we reject H0(1) and H0(2) and do not reject H0(3) , H0(4) and H0(5) .
Let us finally consider the following table of counts of null hypotheses.
H0 accepted
H0 rejected
total
true H0
U
V
N0
false H0
T
S
total
N −R
R
N − N0
N
The second column gives the number of hypotheses where the p-value was
small enough to reject and the first column gives the number of hypotheses
that were not rejected. We have that FWER = P (V ≥ 1) ≈ V /N0 were V
is the number of incorrectly significant tests. Since controlling the FWER
leads to extremely conservative methods (in particular if N is very large),
120
7.6. MULTIPLE TESTING
other methods accept a few false positives. Let the false discovery proportion
be defined as
(
V /R if R ≥ 1
Fdp =
0
otherwise.
Then it is often useful to control the false discovery rate FDR = E[Fdp], i.e.
the average proportion of false positives among the set of rejected hypotheses. For instance, medical screening guidelines define FDR = 0.1 acceptable
for certain types of screenings. Note that controlling FDR does not lead to
FWER control in general. Moreover, we can only control its expectation
but the actual proportion of false positives may be much higher.
In the exercises we will show that
• if all null hypotheses are true (N = N0 ) then the FDR is equivalent
to the FWER and
• we always have that FWER ≥ FDR.
Thus, this procedure is less stringent and thus more powerful in the sense
that it allows to reject more hypotheses.
7.6.3
Benjamini-Hochberg procedure
A simple method that controls the FDR is the Benjamini-Hochberg procedure. It controls the FDR such that its expectation is at most α. For
instance, if one uses microarray experiments to compare the expression levels of 20,000 genes in liver tumors and in normal liver cells, an FDR of 5%
would give up to 5% of the genes with significant results being false positives.
In follow-up experiments, one can then check whether the selected genes really show a significant difference between the normal and tumor cells. This
ensures that with low probability we miss a potentially important discovery
which is reasonable if the costs for additional tests on the selected genes are
low.
First, we again sort the hypotheses according to their p-value, i.e. pi is the
p-value of H0(i) and p1 ≤ . . . ≤ pN . Then we fix an FDR of α and compute
the maximal index k such that
pk ≤
k
α.
N
We reject H0(1) , H0(2) , . . . , H0(k) and do not reject H0(k+1) , H0(k+2) , . . . , H0(N ) .
If no such k exists none of the null hypotheses is rejected. It can be shown
that, if the tests are independent, then the average FDR of this method is
N0
N α, which is at most α.
Example 69: Five Tests (revisited)
121
7.7. EXERCISES
Recall that the p-values of the 5 tests in the previous example where
H0(1) : 0.005, H0(2) : 0.011, H0(3) : 0.02, H0(4) : 0.04, H0(5) : 0.13.
In the Benjamini-Hochberg procedure we compare
1
N α = 0.01,
0.011 ≤ N2 α = 0.02,
0.02 ≤ N3 α = 0.03,
0.04 ≤ N4 α = 0.04,
0.13 > N5 α = 0.05,
H0(1) : 0.005 ≤
H0(2) :
H0(3) :
H0(4) :
H0(5) :
Hence we reject H0(1) , . . . , H0(4) and do not reject H0(5) .
Note that it is possible that for some of the i < k we have a p-value higher
than Ni α.
7.7
Exercises
Easy
122