Download Discrete-Time Martingales and the Kalman Filter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Discrete-Time Martingales and the Kalman Filter
Mark Webster
Student Number: 200313814
Module Code: MATH5003M
Supervisor: Dr. J. Voss
May 5, 2011
Abstract
This report covers the material required to talk about martingales, an important type
of random process, and the Kalman Filter, a method for determining the value of a
process with noisy observations.
We begin with a definition of Lebesgue integration, and build up the probability theory
necessary to talk about expectation and conditional expectation. This is followed by
more in-depth descriptions of martingales, and some simulations in R. The final section
covers the basic theory of the Kalman filter, a discrete-time method for determining
the state of a process given noisy measurements and a dynamical model of said state.
The appendices contain a brisk coverage of the material required from measure theory.
Contents
Notation Table
3
1 Introduction
4
2 Expectation and Conditional Expectation
2.1 Expectation . . . . . . . . . . . . . . . . . .
2.1.1 Definition and Basic Theorems . . .
2.1.2 Variance and Inner Products . . . .
2.1.3 Elementary Formula and Probability
2.2 Conditional Expectation . . . . . . . . . . .
2.2.1 The Fundamental Theorem . . . . .
2.2.2 Example . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
. 5
. 5
. 6
. 7
. 9
. 9
. 10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
13
15
17
18
20
21
21
4 Filtering
4.1 Bayes’ Formula For Bivariate Normal Distributions . . . . . . . .
4.1.1 Recursive Property in Probability Distribution Functions
4.1.2 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Single Random Variable . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 System Model and Filter . . . . . . . . . . . . . . . . . .
4.2.2 R source code: Filtering a Single Value . . . . . . . . . .
4.3 Series of Variables; Kalman Filter . . . . . . . . . . . . . . . . . .
4.3.1 System Model and Filter . . . . . . . . . . . . . . . . . .
4.3.2 R source code: Kalman Filter . . . . . . . . . . . . . . . .
4.3.3 Example: Moving on a Line . . . . . . . . . . . . . . . . .
4.3.4 R Source Code: Movement Problem . . . . . . . . . . . .
4.3.5 Extension for Multiple Processes and Observations . . . .
4.3.6 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
22
23
23
23
24
27
27
29
31
34
39
42
A Appendices
A.1 Measures . . . . .
A.2 Events . . . . . . .
A.3 Random Variables
A.4 Independence . . .
A.5 Integration . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
44
45
45
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Density Functions
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
3 Martingales
3.1 Definition . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Examples: Martingales and Markov Chains
3.1.2 Stopping Times . . . . . . . . . . . . . . . .
3.1.3 R source code: Random Walk . . . . . . . .
3.2 The Convergence Theorem . . . . . . . . . . . . .
3.3 Further Results . . . . . . . . . . . . . . . . . . . .
3.3.1 Orthogonality of Increments . . . . . . . . .
3.3.2 Lévy’s Upward Theorem . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
2
Notation Table
Symbol
Ω
ω
X, Y, Z
Xn , Yn , Zn
Cn
C •X
Ln
log
s.t.
∃
∀
|X|
|X|p
hXi
ΛX
µ, ν
λ
F, G
IA
T ∧n
N
Z+
R+
n
gn
Hn2 , Kn2
Zn
Vn
mΣ
SF
P(A)
N (µ, σ 2 )
φn
rn , r n
Meaning
Set of possible events (A.2.1)
An event in Ω (A.2.1)
Random variable (A.3.2)
Random process
Pre-visible process (3.1.15)
Martingale transform (3.1.16)
Lebesgue space (2.1.12)
Exponential logarithm
“such that”
“there exists”
“for all”
Lp -norm (2.1.12)
Inner or scalar product (2.1.16)
Law of X (A.3.3)
Measure (A.1.4), or mean
Lebesgue measure (A.1.5, A.1.8)
σ-algebras (A.1.1)
Indicator function
Small value
min(T, n)
{1, 2, 3, . . .}
{0, 1, 2, . . .}
[0, ∞)
Timestep of process
Shift value in Kalman filter (4.3.1)
Variance of noise in filter (4.2.1, 4.3.1)
Mean of estimate (4.2.1, 4.3.1)
Variance of estimate (4.2.1, 4.3.1)
Set of Σ-measurable functions (A.3.1)
Set of simple functions (A.5.2)
Power set of A (2.1.2)
Normal distribution (2.1.23)
Kalman gain factor (4.3.5)
Residual, or innovation (4.3.5)
3
Equivalent in sources
Ω
ω
X, Y, Z
Mn for martingales
Cn
C •X
Ln
log, ln
“such that”
N/A
∀
|X|
kXkp
hXi
ΛX
µ
Leb
F, G
IA
T ∧n
N
Z+
R+
n, k
gn , u(k)
Hn2 , Kn2 or Q(k), R(k)
Zn , x̂(k|k)
Vn , P (k|k)
mΣ
SF
P(A)
N (µ, σ 2 )
K(k)
z̃ −
n
Chapter 1
Introduction
We look at two interesting and important ideas. Martingales are random processes
that, on average, stay the same. From this simple definition we get a lot of results
that allow us to give useful results, particularly in the case when the martingale stops:
in this case we get useful results about the value the martingale ends at. This is very
useful, since a lot of random processes are martingales, and we can also use it to easily
give results about other processes; see Example 3.1.27 for an example.
Say we have a process which we wish to observe the value of: we know the dynamics
the process follows, but our measurements of the current value are noisy, so blindly
reading the measurements could result in large errors. The second idea we look at,
the Kalman filter, is an important algorithm by which we can estimate what the value
actually is, and thus it has many applications, since we almost always have to measure
something without perfect instruments to do so.
Chapter 2 covers the theory of expectation, an essential part of anything where we
wish to know the expected, or average, value of a random variable, either in general
(expectation) or given certain information (conditional expectation). Some of the
beginning material here comes from results in measure theory, in particular the idea
of using Lebesgue integration instead of Riemann integration: additional information
is available in the Appendix if desired.
Chapter 3 defines martingales, and gives an outline of the theorems the definition now
allows us to use. There is also a comparison with Markov chains, another common
type of random process. There are several examples given in this chapter, mostly based
on scenarios in gambling. In the next chapter they are used to show that the filters
eventually have a fixed variance in their estimates. Martingales have a lot of uses, and
only a few are presented here.The examples and R source code are the author’s own
unless otherwise stated; other material is from [8] unless otherwise stated.
Chapter 4 deals with the subject of filtering, where we wish to estimate the value of
a process given only noisy observations and knowledge of the dynamics of the process
and the observations. The notation and most of the basic theory are from [8], with
additional material from [1, 6, 7]; examples, and material in Section 4.3.5 are author’s
own unless otherwise stated.
The Appendix contains the basic material from measure theory that is needed to talk
about probability and expectation: it is a brisk summary of the first few chapters
of [8].
All R programs and this report can be found at http://www.maths.leeds.ac.uk/
~voss/projects/2010-martingales/.
4
Chapter 2
Expectation and Conditional
Expectation
The theory of martingales depends heavily on the use of conditional expectations. We
therefore first describe expectation and conditional expectation, and the associated
theorems needed. Material in this section is from [8] unless otherwise marked.
Not all of the material in this chapter is required to study martingales: the important
parts will be listed at the beginning of each section.
2.1
Expectation
Important parts in this section are Definitions 2.1.1, 2.1.8, 2.1.10, 2.1.12, 2.1.15, 2.1.16,
2.1.19, 2.1.21, Theorems 2.1.3, 2.1.7, 2.1.9, 2.1.11, 2.1.14, and Examples 2.1.2, 2.1.23.
2.1.1
Definition and Basic Theorems
Definition 2.1.1 The expectation of X is
Z
E(X) =
XdP
ZΩ
=
X(ω)P(dω)
ω∈Ω
= P(X),
using the Lebesgue integral as defined in Definition A.5.1, where P is a probability
measure as described in Definition A.2.1, and where Ω is a set of events as defined
in Definition A.2.1.
Example 2.1.2 Take [Own Example] Ω = {x1 , x2 , . . . , xn } ⊆ R with xi 6= xj ∀i 6= j,
X(ω) = ω, FP= P(Ω), where P(Ω) is the power set of Ω, the set of all subsets of Ω.
n
Let P(A) = i=1 Ixi ∈A pi , so pi = P(X = xi ). We now have
Z
E(X) =
XdP
Ω
X
=
xi P(X = xi )
i
=
X
xi P({xi })
i
=
X
xi pi .
i
5
Taking P(Xn → X) = 1 for sequence (Xn )n∈N and some variable X, we have the following basic theorems [8, Section 6.2], derived from their equivalents in Appendix A.5:
Theorem 2.1.3 (Monotone-Convergence Theorem) For 0 ≤ Xn ↑ X,
E(Xn ) ↑ E(X) ≤ ∞.
Theorem 2.1.4 (Fatou’s Lemma) Xn ≥ 0 ⇒ E(X) ≤ lim inf E(Xn ).
Theorem 2.1.5 (Dominated-Convergence Theorem) If ∃Y such that |Xn (ω)| ≤
Y (ω) ∀n, w and E(Y ) < ∞, then
E|Xn − X| → 0 s.t. E(Xn ) → E(X).
Theorem 2.1.6 (Scheffé’s Lemma) E|Xn | → E|X| ⇐⇒ E|Xn − X| → 0.
We also have the following new theorems:
Theorem 2.1.7 (Bounded Convergence Theorem) If ∃ finite k s.t. |Xn (ω)| ≤
k ∀n, ω, then
E(Xn − X) → 0.
We will need this theorem later to prove one of the main theorems in Chapter 3, Doob’s
Optional Stopping Theorem (Theorem 3.1.21).
R
Definition 2.1.8 E(X, F ) = F X(ω)P(dω) = E(XIF ), where F ∈ F, where F is a
filtration as described in Definition A.2.1.
Theorem 2.1.9 (Markov’s Inequality) Let X ∈ mΣ, i.e. be a Σ-measurable function as defined in Definition A.3.1, and g be a non-decreasing, Borel-measurable function g : R 7→ [0, ∞], with the Borel sigma-algebra on R as defined in Example A.1.3.
Then
E(X) ≥ E(g(X), X ≥ c) ≥ g(c)P(X ≥ c).
This theorem will be useful in the next section, where we want to show the existence
of a conditional expectation.
2.1.2
Variance and Inner Products
Definition 2.1.10 A function c : G 7→ R for G ⊂ R is convex on G if its graph lies
below any of its chords, i.e. if for some x, y ∈ G c(λx + (1 − λ)y) < λc(x) + (1 − λ)c(y)
∀0 < λ < 1.
For example, the function c(x) = x2 on R has
(λx + (1 − λ)y)2 − (λx2 + (1 − λ)y 2 )
=
λ2 x2 + 2λ(1 − λ)xy + (1 − λ)2 y 2
−λx2 − (1 − λ)y 2
=
−λ(1 − λ)(y − x)2
>
0.
So x2 is convex on R.
Theorem 2.1.11 (Jenson’s Inequality) For convex c, E(X) < ∞, P(X ∈ G) = 1
for an open subset G ⊂ R, and E|c(X)| < ∞, we have E(c(X)) ≥ c(E(X)).
This comes in useful in Chapter 4, when we want to show that the estimate of a single
value is a martingale.
Definition 2.1.12 For 1 ≤ p < ∞, we say X ∈ Lp if E|X|p < ∞. Then |X|p =
1
(E|X|p ) p , where | · |p is the Lp -norm. Lp is also a vector space.
6
Theorem 2.1.13 (Monoticity of Lp -norms) Take p ≤ r, Y ∈ Lr . Then Y ∈ Lp ,
and |Y |p ≤ |Y |r .
Theorem 2.1.14 (Schwartz Inequality) For X, Y ∈ L2 ,
i) XY ∈ L1 and E(XY ) ≤ E|XY | ≤ |X|2 |Y |2 .
ii) X + Y ∈ L2 , |X + Y |2 ≤ |X|2 + |Y |2 .
This theorem, like Theorem 2.1.9, will be used in the section on conditional expectation.
Definition 2.1.15 Take X, Y ∈ L2 , µX = E(X), µY = E(Y ). Then X̃ = X −µX , Ỹ =
Y − µY are in L2 . By the Schwartz Inequality, X̃ Ỹ ∈ L1 . We then define the covariance Cov(X, Y ) = E(X̃ Ỹ ) = E(X − µX )(Y − µY ) = E(XY ) − µX µY , and the
variance Var(X) = Cov(X, X) = E(X 2 ) − µ2X .
This gives the usual form of variance and covariance in probability and statistics; it
comes particularly useful in Chapter 4.
Definition 2.1.16 The inner, or scalar product hX, Y i = E(XY ). Then, the
hX,Y i
. The correlation of X
angle between X and Y obeys equation cos θ = |X|
2 |Y |2
hX̃,Ỹ i
and Y , corr(X, Y ) = cos θ = |X̃|
. We say that X, Y are orthogonal, or
2 |Ỹ |2
perpendicular, when hX, Y i = 0; we also write this as X⊥Y . In this case, on
2
2
2
L2 we have |X + Y |2 = |X|2 + |Y |2 . Taking X̃, Ỹ , in probabilistic language this
becomes
Var(X + Y ) = Var(X) + Var(Y ) when Cov(X, Y ) = 0.
This is known as pythagoras’ theorem.
Using inner products in L2 will be shown later to be an easy – and usually possible –
way to show the existence of a conditional expectation.
Theorem 2.1.17 (Parallelogram Law) By the bilinearity of h·, ·i,
2
2
|X + Y |2 + |X − Y |2 = hX + Y, X + Y i + hX − Y, X − Y i
2
2
= 2|X|2 + 2|Y |2 .
2.1.3
Elementary Formula and Probability Density Functions
In this section we introduce the idea of a probability density function. This allows us
to integrate over the real numbers to find the expectation, variance, or other functions
of a random variable, by giving us a function that describes the probability of the
variable being in any interval.
Theorem 2.1.18 (Completeness of Lp ) Let p ∈ [1, ∞). Then if (Xn ) is a Cauchy
sequence in Lp , i.e.
lim sup |Xr − Xs |p = 0,
k→∞ r,s≥k
then ∃X ∈ Lp such that Xr → X ∈ Lp , i.e.
lim |Xr − X|p = 0.
r→∞
Definition 2.1.19 Take subspace K ⊂ L2 and a Cauchy sequence (Vn ∈ K), Vn →
V ∈ K. Then ∀X ∈ L2 ∃Y ∈ K s.t. |X − Y |2 = inf{|X − W |2 | W ∈ K}, (X −
Y )⊥Z ∀Z ∈ K. Then we say Y is the orthogonal projection of X onto K, and
is almost surely unique.
7
Theorem 2.1.20 (Elementary Formula for Expectation) Let h be a
Borel-measurable function, and ΛX ∈ (R, B) be the law of X, i.e. ΛX (B) = P(X ∈ B)
as defined in Definition A.3.3. Then
h(X) ∈ L1 ⇐⇒ h ∈ L1 (R, B, ΛX ).
R
In this case, E(h(X)) = ΛX (h) = P h(X)ΛX (dx).
Definition 2.1.21 The probability density function fX forR X, if it exists, is a
Borel-measurable function fX : R 7→ [0, ∞] such that P(X ∈ B) = B fX (x)dx for B ∈
X
B. We also write this as fX = dΛ
dλ , where λ is the Lebesgue measure on B, described in
Examples A.1.5 and A.1.8, and using the notation for density from Definition A.5.15.
Note: This means that, if the
R we have a probability density function for X, we can
now write the expectation as XfX dλ, by Lemma A.5.16. In particular, if xfX (x)
is
R also riemann-integrable and X(ω) = ω, we can also write the expectation as
xf (x)dx, and the two integrals are equal.
Example 2.1.22 The uniform distribution U [a, b] on interval [a, b] has constant
R b dx
1
. Then the mean of X, E(X) = a xb−a
= b+a
probability density function fX (x) = b−a
2 ,
i.e. the midpoint of the interval, and the variance,
E(X 2 ) − E(X)2 =
=
=
=
=
b
x2 dx (b + a)2
−
4
a b−a
3
3
b −a
(b + a)2
−
3(b − a)
4
2
2
b2 + 2ab + a2
b − ab + a
−
3
4
b2 + 2ab + a2
12
(b + a)2
.
12
Z
Example 2.1.23 The normal distribution N (µ, σ 2 ) has probability density function fX (x) =
√ 1
e−
2πσ 2
(x−µ)2
2σ 2
Z
. Then the mean of X,
∞
(x−µ)2
x
√
e− 2σ2 dx
2
2πσ
−∞
Z ∞
2
(x−µ)2
x − µ − (x−µ)
µ
√
=
e 2σ2 + √
e− 2σ2
dx
2πσ 2
2πσ 2
−∞
Z ∞
(x−µ)2
1
√
=µ
e− 2σ2 dx
2πσ 2
−∞
= µ,
E(X) =
and the variance,
2
2
Z
∞
x2
e−
(x−µ)2
2σ 2
dx − µ2
−∞
Z ∞
Z ∞
2
2
x − µ − (x−µ)
xµ − (x−µ)
2
2σ
√
√
=
x
e
dx +
e 2σ2 dx − µ2
2πσ 2
2πσ 2
−∞
−∞
∞
Z ∞
2
(x−µ)
(x−µ)2
σ
σ
√ e− 2σ2 dx + µ2 − µ2
= − x √ e− 2σ2
+
2π
2π
−∞
x=−∞
E(X ) − E(X) =
√
2πσ 2
= σ2 .
8
This is a rather important example: a lot of random process are normal, or can be
approximated as being normal over a long period of time. In Chapter 4 one of the
underlying assumptions is that the noise is normally distributed, something known as
“white noise”.
Theorem 2.1.24 (Hölder’s Inequality) Take f ∈ Lp (S, Σ, µ) and h ∈ Lq (S, Σ, µ).
Then
f h ∈ L1 (S, Σ, µ) and |µ(f h)| ≤ µ(|f h|) ≤ |f |p |h|q .
Theorem 2.1.25 (Minkowski’s Inequality) Take f, g ∈ Lp (S, Σ, µ). Then |f +
g|p ≤ |f |p + |g|p .
2.2
Conditional Expectation
The important parts of this section are Definition 2.2.2, Notation 2.2.3, Theorem 2.2.1,
and the “Existence in L2 ” section of the proof of said theorem.
2.2.1
The Fundamental Theorem
Now we have the concept of expectation, we will give a formal definition of a conditional expectation. The usual definition of conditional expectation Y = E(X|Z) has a
random variable Y whose value depends on the value of random variable Z. Using our
definitions of events and random variables, we can say that Z is a real-valued function
Z(ω ∈ Ω), taking a certain point in the sample space. If Z has a given value, this
limits the possible values of ω; in other words, out of the possible events, the actual
outcome has been limited to a subset of the family of events F, say G. The conditional
expectation is then the expected value of X(ω) given that ω ∈ G.
Theorem 2.2.1 (Fundamental Theorem) Take a random variable X with E|X| <
∞, and G a sub-σ-algebra of F in the usual probability triple (Ω, F, P). Then there is
a random variable Y with Y ∈ mG, E|Y | < ∞ such that
∀G ∈ G E(Y, G) = E(X, G).
This Y is almost surely unique, i.e. if a random variable Ỹ has the same properties
then P(Y = Ỹ ) = 1.
Definition 2.2.2 A random variable Y with the properties described above is called a
version of conditional expectation E(X|G). We say that Y = E(X|G) almost surely.
Notation 2.2.3 We write E(X|Z) as shorthand for E(X|σ(Z)), where σ(Z) is the
σ-algebra generated by Z, as defined in Definition A.1.2.
Proof of Almost Sure Uniqueness
Proof by contradiction [8, Section 9.5]. Assuming the conditional expectation exists,
we have X ∈ L1 , and two versions Y, Ỹ of E(X|G). Then, by the definition, we have
Y, Ỹ ∈ L1 (Ω, G, P), and that E(Y, G) = E(Ỹ , G) ⇒ E(Y − Ỹ , G) = 0 ∀G ∈ G.
We now suppose Y, Ỹ are not almost surely equal. Without loss of generality, we take
Y > Ỹ , so P(Y > Ỹ ) > 0. We can introduce an error term and construct a sequence
of events Xn = {Y > Ỹ + n } with Xn ↑ {Y > Ỹ } as n ↓ 0. Then ∃n such that
P(Y > Ỹ + n ) = P(Y − Ỹ > n ) > 0.
Since Y, Ỹ are G-measurable, we can use the Markov inequality (Theorem 2.1.9) with
g(c) = c, c = n to get
E(Y − Ỹ , Y − Ỹ > n ) ≥ n P(Y − Ỹ > n ) > 0.
But E(Y − Ỹ , Y − Ỹ > n ) = 0. Contradiction, therefore Y is almost surely unique.
9
Proof of Existence for X ∈ L2
We give proof of existence for the case of E|X|2 < ∞, since it is a commonly used
L-norm, and it’s a more simple case since we can use the idea of orthogonal projections
(Definition 2.1.19). Take Y as the orthogonal projection of X onto L2 (Ω, G, P), which
always exists. Then we have
hX − Y, Zi = E((X − Y )Z) = 0 ∀Z ∈ L2 (Ω, G, P).
By linearity, we therefore arrive at E(XZ) = E(XY ). We can now set Z = IG for
G ∈ G to obtain the result.
Proof of Existence for X ∈ L1
In the standard machine (Method A.5.13), we defined h = h+ − h− , with h+ , h− as
described in Notation A.5.10, to prove something for all measurable functions if it was
true for all positive measurable functions. Similarly, we define X = X + − X − and
limit ourselves to the case X ∈ (L1 )+ . We now choose [8, Section 9.5] a sequence
0 ≤ Xn ↑ X. Now Xn ∈ L2 , so by the previous section we can form sequence
(Yn = E(Xn |G))n∈N .
Lemma 2.2.4 For a non-negative bound random variable X, E(X|G) ≥ 0 almost
surely.
Proof: By contradiction. Assume version Y of the expectation has P(Y < 0) > 0:
then ∃ > 0 such that we can set G = {Y < −} with P(G) > 0, and take the Markov
Inequality (Theorem 2.1.9) with g(x) = x, X = −Y , c = to obtain
E(−Y, −Y > ) > P(−Y > ).
We thus have
0 ≤ E(X, G) = E(Y, G) < −P(G) < 0.
Contradiction, so E(X|G) ≥ 0 almost surely. From this lemma we have 0 ≤ Yn almost surely ∀n, so we define Y (ω) = lim sup Yn (ω).
Then Y ∈ mG, and Yn ↑ Y almost surely. By the Monotone-Convergence Theorem
(Theorem 2.1.3), we then take the limit of E(Yn , G) = E(Xn , G) to get
E(Y, G) = E(X, G) ∀G ∈ G. 2.2.2
Example
Take [Own Example] Ω = {1, 2, 3, 4, 5, 6}, F = P(Ω), G = {∅, {1, 2}, {3, 4, 5, 6}, Ω},
P(A ∈ F) = ]A
6 , where P(Ω) is the power set of Ω, and ]A is the number of elements
of A. We then have the probability triple associated with rolling a fair six-sided die.
As we expect, the probability measure gives E(X) = 27 for X(ω) = ω. Since Y must
be G-measurable, we can say that
(
y1 ω ∈ {1, 2},
Y (ω) =
y2 ω ∈ {3, 4, 5, 6}.
Since E(Y IG ) = E(XIG ) ∀G ∈ G, Y is determined by the simultaneous equations
E(Y I{1,2} )
E(Y I{3,4,5,6} )
= E(XI{1,2} ),
= E(XI{3,4,5,6} ).
10
From the probability measure, we then have
1
y1 +
6
1
1
1
y2 + y2 + y2 +
6
6
6
1
y1
6
1
y2
6
=
=
1
(1 + 2),
6
1
(3 + 4 + 5 + 6),
6
or 13 y1 = 12 , 23 y2 = 3. We therefore obtain result
E(X|G) =
3
9
I{1,2} + I{3,4,5,6} .
2
2
Note: Since E|X|2 < ∞ we can treat Y as the orthogonal projection – as from
Definition 2.1.19 – of X onto L2 (Ω, G, P). Then for any Z ∈ L2 (Ω, G, P), we define
Z = z1 I{1,2} + z2 I{3,4,5,6} , and obtain
hX − Y, Zi = E((X − Y )Z)
1
= ((3 − 2y1 )z1 + (18 − 4y2 )z2 )
6
= 0 ∀z1 , z2 .
So we see the inner product behaves as expected.
11
Chapter 3
Martingales
Martingales, and similarly supermartingales and submartingales, are an important
type of random process: their expected value stays the same over time, so on average
they stay at the same value. (In the case of supermartingales and submartingales, the
expected value is monotonically decreasing or increasing respectively.) This allows us
to derive other results for random processes, especially useful as these three types of
processes are rather common: common examples of processes that can be modelled as
a martingale would be the amount of money a gambler owns during several rounds of
betting, or brownian motion.
3.1
Definition
Important parts in this section are Definitions 3.1.1, 3.1.5, 3.1.7, 3.1.17, 3.1.18, 3.1.15,
3.1.16, Theorem 3.1.21 and Corollary 3.1.26.
Definition 3.1.1 A filtration {Fn }n≥0 is an increasing
family of sub-σ-algebras of
S
F, with F0 ⊆ F1 ⊆ . . . ⊆ F. We define F∞ = σ( n Fn ) ⊆ F.
Example 3.1.2 For a common example from gambling, we consider the rolling of
dice throws, each an independent and identically distributed random variable Xi for
i ∈ Z+ . Then filtration Fn = σ(X0 , X1 , . . . , Xn ) is the σ-algebra generated by the set
{X0 , . . . , Xn } as defined in Definition A.1.2. In other words, it is the set of all possible
sets whose elements can be of the throws up to time n.
Definition 3.1.3 A filtered space (Ω, F, {Fn }, P) is a probability triple (Ω, F, P)
with an associated filtration {Fn }n≥0 .
Example 3.1.4 {Fn } is usually taken as the natural filtration described in Example 3.1.2,
Fn = σ(X0 , X1 , . . . , Xn ).
For the gambling example, the filtered space is thus a description of the probabilistic
model for dice-throwing, with a record of the dice throws up to time n.
Definition 3.1.5 A process X = (Xn | n ≥ 0) is adapted to filtration {Fn } if Xn
is Fn -measurable ∀n.
Example 3.1.6 For the gambling example, a process Yn is adapted to Fn if it’s any
function of {X0 , X1 , . . . , Xn }, so that P
it’s determinable at time n; a common example
n
would be the sum of the throws, Yn = i=0 Xi .
Definition 3.1.7 A process X is a martingale relative to ({Fn }, P) if
i) X is adapted,
ii) E|Xn | < ∞
∀n,
12
iii) E(Xn |Fn−1 ) = Xn−1 almost surely ∀n ≥ 1.
A supermartingale has the equality in iii) replaced by “≤”, and a submartingale
has it replaced by “≥”; so, a supermartingale decreases on average, and a submartingale
increases on average.
3.1.1
Examples: Martingales and Markov Chains
Martingales and Markov chains both appear often in probability theory, so we look at
the distinctions between the two.
Definition 3.1.8 A stochastic process X is a collection of random variables
(Xγ | γ ∈ C) parametrized by set C, where the variables are all on the same probability
triple.
Definition 3.1.9 A stochastic, or transition matrix P = (pij ) is a matrix such
that
X
pij ≥ 0,
pik = 1.
k
Definition 3.1.10 A time-homogeneous Markov Chain X = (Xn | n ∈ Z+ ) is
a stochastic process parametrized by set Z+ with elements Xn ∈ E for some set E. If
E is countable, the chain is then defined by a stochastic |E| × |E| matrix P and an
initial distribution µ over E. Then
P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = µi0 pi0 i1 pi1 i2 . . . pin−1 in .
Corollary 3.1.11 A Markov Chain X is “memoryless”, i.e.
P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) = P(Xn = in | Xn−1 = in−1 ).
Proof:
From the definition, we have [Own Proof ]
P(X0 = i0 , . . . , Xn = in ) = P(X0 = i0 , . . . , Xn−1 = in−1 )
P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 )
= µi0 pi0 i1 . . . pin−2 in−1
P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ),
∴ pin−1 in = P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 )
= P(Xn = in | Xn−1 = in−1 ). So we have the other common definition of a Markov Chain, where the next value only
depends on the current value.
Example 3.1.12 A random walk Sn = X1 + X2 + . . . + Xn , where Xn are independent, identically distributed random variables, has conditional expectation
E(Sn | Fn−1 ) = E(Sn−1 + Xn | Fn−1 )
= Sn−1 + E(Xn | Fn−1 )
= Sn−1 + E(Xn ).
So S is only a martingale when E(X) = 0. Additionally, we can say
P(S0 = i0 = 0, . . . , Sn = in ) = P(S0 = 0, . . . , Sn−1 = in−1 )
P(Sn = in | Sn−1 = in−1 )
= P(S1 = i1 )P(S2 = i2 | S1 = i1 ) . . .
P(Sn = in | Sn−1 = in−1 )
= p0i1 pi1 i2 . . . pin−1 in .
So S is always a Markov chain, with µ0 = 1.
13
Example 3.1.13 Instead of the usual Markov chain, we have every value after time
n = 1 be determined by a new stochastic matrix Q = Q(X1 ) whose values depend on
the value of X1 ; in other words,
P(X0 = i0 , X1 = i1 ) = µi0 pi0 i1 ,
P(X0 = i0 , . . . , Xn = in ) = µi0 pi0 i1 qi1 i2 (i1 )qi2 i3 (i1 ) . . . qin−1 in (i1 )
∀n > 1.
Except in the trivial case, process X is no longer a Markov Chain, since the transition
probabilities are no longer “memoryless”, but also depend on an older value. However,
if P and Q(X1 ) are defined in such a way that
X
E(X1 | X0 = i0 ) =
jpi0 j = i0 ∀i0 ∈ E,
j∈E
E(Xn | Fn−1 ) =
X
jqin−1 j (i1 ) = in−1
∀n > 1, i1 , in−1 ∈ E,
j∈E
then we have a process that isn’t a Markov Chain, but is a martingale.
Note that the above process can become a Markov Chain if we store the value of X1 ; in
this case, the process becomes a Markov Chain over the bivariate state space (X1 , Xn ).
Example 3.1.14 The “Martingale betting system” is a gambling strategy where the
gambler repeatedly doubles the size of his bet each round, until he wins a round. The
theory behind this is that, given an infinite amount of time and money, the gambler
can keep raising his bet indefinitely until he wins a round, at which point he is up by
the size of his initial bet. Mathematically, we write this as a sum of weighted random
variables. Without loss of generality, we take the starting value as zero, and assume
the payout is evens, i.e. the payout for a successful round is the initial bet, plus the
same again. Then we can write [8, Section 10.6]
Sn = C1 X1 + C2 X2 + . . . + Cn Xn = Sn−1 + Cn−1 Xn−1 ,
where P(Xn = 1) = p, P(Xn = −1) = 1 − p, and Cn is a random variable that is
determined by past results, i.e.
Cn = Cn (X1 , X2 , . . . , Xn−1 ).
Cn is called a pre-visible process, and we define this term after the example. In this
case, we have
(
1
Xn−1 = 1,
Cn = Cn (Xn−1 ) =
2Cn−1 Xn−1 = −1.
For S to be a Markov Chain, we need the next value Sn to be completely determinable
from the value of Sn−1 . This would require C to be determinable by S, but this isn’t
the case, so S can’t be a Markov Chain. On the other hand,
E(Sn | Fn−1 ) = E(Sn−1 + Cn Xn | Fn−1 )
= Sn−1 + Cn E(Xn ),
So this is a martingale if E(Xn ) = 0.
Note this strategy can take a long time, and hence a lot of money, waiting for a
successful round: the chance of waiting N rounds for a successful one is P(N = n) =
(1 − p)n−1 p, with an expected time of
E(N ) =
X
n∈N
n(1 − p)n−1 p =
X X
(1 − p)m−1 p =
n∈N m≥n
X
(1 − p)n−1 =
n∈N
1
.
p
In addition, if we assume the initial bet C1 = c, the amount of money required to bet
in N rounds is c + 2c + . . . + 2N −1 c = (2N − 1)c, so the expected amount of money M
14
required is
E(M ) =
X
(2n − 1)c(1 − p)n−1 p = c
n∈N
(
=
∞
c
2p−1
X
X
2n (1 − p)n−1 p − c
n∈N
(1 − p)n−1 p
n∈N
0 ≤ p ≤ 1/2,
1/2 < p ≤ 1,
A rather large amount to need just to gain an amount c.
This betting system is where martingales derived their name from; the origin of the
word before this is unclear. there are two main theories [2]: either that it comes from
the name of a type of saddle – which bifurcates into two equally long strips in the
middle – or that it comes from the Provençal phrase “a la martegalo” – referring to
the inhabitants of Martigues, who had a reputation for doing things in a ridiculous or
naive way – and means “in an absurd manner”, an appropriate origin for the naming
of this betting system if it is the case.
Definition 3.1.15 A process C = (Cn )n∈N is pre-visible if Cn ∈ mFn−1 ∀n.
Such a pre-visible process is thus determinable in advance of when it is used. Examples
include a controllable parameter, which would need to be determined based on past
results, and can not be based on the upcoming one, such as C in the example above.
P
Definition 3.1.16 The martingale transform (C • X)n =
1≤k≤n Ck (Xk −
Xk−1 ), where C is pre-visible. This is the discrete equivalent of the stochastic
integral.
3.1.2
Stopping Times
Definition 3.1.17 A map T : Ω 7→ {0, 1, 2, . . . ; ∞} is a stopping time if {T = n} =
{ω : T (ω) = n} ∈ Fn ∀n ≤ ∞. It is possible to have T = ∞.
The requirement that {T = n} ∈ Fn means that the decision to stop at a certain time
can only depend on what has happened up to that time. For example, in general you
can set the stopping time as the nth occurrence of any value, T = inf{n ≥ 0; Xn ∈ Y },
but you can’t set the stopping time as the nth last occurrence of any value, because
this usually can not be determined unless the whole process has already been observed.
Definition 3.1.18 The stopped process XT ∧n , or the process Xn stopped at T , is
the process X up to stopping time T , and is equal to XT ∀n ≥ T . This can also be
(T )
(T )
denoted as (C (T ) • X)n = XT ∧n − X0 , where Cn = I{n≤T } . Cn is then pre-visible,
P
(T )
since {Cn = 0} = {T ≤ n − 1} = 0≤k≤n {T = k} ∈ Fn−1 .
Theorem 3.1.19 If X is a (super)martingale, and T is a stopping time, then X T is
a (super)martingale, with E(XT ∧n ) is (less than or) equal to E(X0 )∀n.
Example 3.1.20 Take X as a random walk on Z+ [8], starting at 0, with stopping
time T = inf{n; Xn = 1}. Then E(XT ) = 1. However, E(XT ∧n ) = E(X0 ) = 0. We
therefore do not necessarily have E(XT ) = E(X0 ).
The next theorem gives sufficient conditions for E(XT ) = E(X0 ).
Theorem 3.1.21 (Doob’s Optional Stopping Theorem) Let X be a
(super)martingale, T be a stopping time. Then XT is integrable and E(XT ) is (less
than or) equal to E(X0 ), if any of the following hold:
i) T is bounded, i.e. ∃N ∈ N such that T (ω) ≤ N ∀ω;
ii) X is bounded, i.e. ∃K ∈ R+ such that |Xn (ω)| ≤ K ∀n, ω, and T is almost surely
finite;
iii) E(T ) < ∞, and ∃K ∈ R+ such that |Xn (ω) − Xn−1 (ω)| ≤ K ∀(n, ω).
15
Proof: From Theorem 3.1.19, E(XT ∧n ) ≤ E(X0 ). Then for X being a supermartingale,
i) Take n = N .
ii) Take n ↑ ∞ using the Bounded Convergence Theorem (Theorem 2.1.7).
PT ∧n
iii) The condition gives |XT ∧n − X0 | = | k=1 (Xk − Xk−1 )| ≤ KT, E(KT ) < ∞,
so we can take n ↑ ∞.
For X being a martingale, we apply the above to −X to show equality.
Example 3.1.22 (Gambler’s Ruin) Say we have an amount of money S0 , and we
can bet on an infinite number of rounds, where the odds of victory and the corresponding
payoffs are the same as in Example 3.1.14. We aim to reach an amount of money S,
and so we stop when we’ve either reached this amount or run out of money: we’d like
to know how likely we are to succeed.
Take our money at time n as Sn = S0 + (C · X)n , with (C · X)n being the martingale
transform. Our stopping time T = inf{n; Sn ∈ {0, S}}. Our situation can be split into
two cases:
i) p = q = 1/2. In this case Sn is a martingale regardless of our choice of Cn , and
since Sn is bounded we can derive our result from Theorem 3.1.21 [3, Section
12.2]:
0.P(ST = 0) + S.P(ST = S) = E(ST ) = S0 ,
so P(Success) = SS0 . This is irrespective of our gambling strategy, as we’d expect
from a fair game.
ii) p 6= q. Sn is now either a supermartingale or a submartingale, so Theorem 3.1.21
will give us an inequality, and we can’t use it to directly calculate the answer.
For the moment, let’s suppose we always take Cn = 1, and that S0 = S/2. If
we now write pi for P(ST = 2i|S0 = i), then we can use the fact that Sn is
time-homogenous to write pi = ppi+1 + qpi−1 , with boundary conditions p0 = 0,
p2i = 1. This is a recurrence relation with solution
pi =
1 − ( pq )i
1−
( pq )2i
=
1
pi
.
q i = i
1 + (p)
p + qi
For p > q this is greater than 1+1 q = p, and for p < q this is smaller than p,
p
where p would be our chance of success by betting i and reaching our stop time in
one round. Applying this result to each bet instead of the entire problem, we find
that when p > 1/2 our best strategy is to bet on increments as small as possible
1−( q )S0
and P(Success) = 1−(pq )S , and when p < 1/2 our best strategy is to bet as high
p
as possible: intuitively, in the former case the odds benefit us in the long term
so we can take our time, and in the latter case they do not.
Note: Calculating the chance of success when p < q can be complicated, depending
on exactly how we restrict the bets: limiting them to stay in [0,S] means the recurrence
relation on pi changes depending on whether i ≥ S/2. (In the case S0 = S, we can
obviously still say the chance of success is p.) On the other hand, if we allow ourselves
to just stop when we have at least S, and bet all of our money at each timestep, the
relation is simpler and has solution P(Success) = pt+1 , where −t − 1 ≤ log2 i < −t.
Corollary 3.1.23 If X is a martingale with Xn − Xn−1 bounded by some constant
K, and C is a pre-visible process bounded by some constant L, and T is a stopping
time with E(T ) < ∞, then E(C • X)T = 0.
Corollary 3.1.24 If X is a non-negative supermartingale, and T is an almost-surelyfinite stopping time, then E(XT ) ≤ E(X0 ).
16
Pn
Example 3.1.25 Take a simple binomial random walk, with X0 = 0, Xn = k=1 Zk
and Zk being equally likely to be −1 or 1 ∀k. Xn is then a martingale. Let stopping
time T = min{n : Xn = −1}. We then have that i) T is not bounded, and ii) X is
not bounded. Since −1 = E(XT ) 6= E(X0 ) = 0, iii) must also not hold. We therefore
conclude that E(T ) = ∞.
We usually need to determine whether E(T ) < ∞ instead of deriving it from Doob’s
Optional-Stopping Theorem.
Corollary 3.1.26 Let T be a stopping time such that ∃N ∈ N, > 0 such that
∀n ∈ N
P(T ≤ n + N | Fn ) > almost surely. Then E(T ) < ∞.
Example 3.1.27 Say we have a monkey randomly pressing keys on a typewriter, letters only, and we want to know how long it will take for it to type the word “abracadabra” [8, Exercise E10.6]. We suppose that at each timestep a gambler arrives
with one betting chip, and bets it on the monkey typing the first letter, A, at the fair
payoff of 25 to 1. If he wins, then at the next timestep he bets all 26 chips on the
monkey typing B, and so on through the whole word, until the monkey types the whole
word, or it misses a letter and he loses all his chips.
Let Xn,m be the amount of chips owned at time m by the gambler who entered at time
n – so, for example, Xn,n will be equal to either 26 or 0. Xn,m is then a martingale with
Pmregard to m. Then the sum of money owned by all gamblers at time m,
Zm = n=1 Xn,m , has E(Zm ) = m, and the process Wm = Zm − m is a martingale
with E(Wm ) = 0. If we can find E(WT ) = E(ZT ) − E(T ), where T is the stopping time
of the monkey typing the whole word, then we can express the expected time taken by
the expected total number of chips held by the gamblers.
We can use Corollary 3.1.26 to show that E(T ) < ∞, by, for example, setting N > 11
P11
i
and = 26−11 . In addition, Zm ≤ 26Zm−1 + 26, and Zm ≤
i=1 26 ∀m, so we
P10
i
11
know that |Zm − Zm−1 | ≤ 25 i=1 26 + 26 = 26 + 25. We can thus use condition
iii) of Doob’s Optional Stopping Theorem (3.1.21) to say that E(WT ) = 0, and so
E(ZT ) = E(T ). E(ZT ) has a value predetermined by the stop condition, since only
certain gamblers can have any chips remaining when the word is finished: those who
entered at times T , T − 3, and T − 10 will have chips, since there are sequences of
letters that is both at the beginning and the end of the word, of length 1, 4 and 11
respectively. These gamblers have 26, 264 and 2611 chips respectively, so
E(T ) = E(ZT ) = 26 + 264 + 2611 .
3.1.3
R source code: Random Walk
This script creates a given number of random walks over a given number of timesteps
with a given probability distribution, then plots them on a graph. The creation of the
walk data is left as a separate function to allow manipulation before plotting.
#Create walk data according to any defined random distribution
create.walk=function(timesteps,runs,start,f) {
y=array(dim=c(timesteps+1,runs))
y[1,]=start
for (time in 1:timesteps) {
x=f(runs)
y[time+1,]=y[time,]+x
}
y
}
draw.walk=function(data) {
17
##Create a "width matrix" to record the no. of occurrences of
##each edge
calculate.width=function(data,timesteps,runs) {
z=array(NA,c(timesteps,runs))
for (time in 1:timesteps) {
for (run in 1:runs) {
if (is.na(z[time,run])) {
z[time,run]=1
for (through in (run+1):runs) {
if (data[time,through]==data[time,run] &&
data[time+1,through]==data[time+1,run]) {
z[time,c(run,through)]=c(z[time,run]+1,0)
}
}
}
}
if (is.na(z[time,runs])) {
z[time,runs]=1
}
}
z
}
##Use width matrix to plot walk with thickness depending on frequency
draw.data=function(data,width,timesteps,runs) {
plot(c(0,timesteps),range(data),
type="n",xlab="Time",ylab="Walk values")
for (run in 1:runs) {
for (time in 1:timesteps) {
if (width[time,run]>0) {
lines(c(time-1,time),c(data[time,run],
data[time+1,run]),lwd=width[time,run])
}
}
}
}
b=calculate.width(data,dim(data)[1]-1,dim(data)[2])
draw.data(data,b,dim(data)[1]-1,dim(data)[2])
}
Some example commands, respectively for an even discrete (−1, 1) random walk and
a normal distribution of mean 0 and variance 1, each with 10 runs over 25 time units:
source("http://www.maths.leeds.ac.uk/~voss/projects/2010-martingales/plotwalk.R")
x=create.walk(25,10,0,function(x) sample(c(-1,1),x,replace=TRUE))
draw.walk(x)
source("http://www.maths.leeds.ac.uk/~voss/projects/2010-martingales/plotwalk.R")
x=create.walk(25,10,0,function(x) rnorm(x,0,1))
draw.walk(x)
3.2
The Convergence Theorem
Important parts in this section are Theorem 3.2.5 and Lemmas 3.2.2 and 3.2.3. For a
process X on R, let YN = (C • X)N , where the pre-visible strategy Cn (a, b) is defined
as follows for a < b: Cn = 0 until X < a. Then Cn = 1 until X > b. Then Cn = 0,
18
5
0
−5
Walk values
0
5
10
15
20
25
Time
0
−5
Walk values
5
Figure 3.1: Example graphic for a binomial random walk in Program 3.1.3, with increments
equally likely to be 1 or −1, taken over 25 time steps with 10 sample runs
0
5
10
15
20
25
Time
Figure 3.2: Example graphic for a random walk in Program 3.1.3 with normally-distributed
increments, with mean 0 and variance 1, over 25 time steps and 10 sample runs
19
and the strategy repeats. More formally,
C1 = I{X0 <a} ,
Cn = I{Cn−1 =1} I{Xn−1 ≤b} + I{Cn−1 =0} I{Xn−1 <a} .
Definition 3.2.1 Let the number of upcrossings UN [a, b](ω) be the number of times
that X goes from below a to above b in time N .
Since YN will total to the amount gained in UN upcrossings, plus an amount at the
end if CN = 1, we can say that
Lemma 3.2.2 YN (ω) ≥ (b − a)UN [a, b](ω) − [XN (ω) − a]− , where [n]− = | min(n, 0)|.
Lemma 3.2.3 (Doob’s Upcrossing Lemma) Let X be a supermartingale, UN [a, b]
be the number of upcrossings of [a, b] by time N . Then
(b − a)EUN [a, b] ≤ E([XN − a]− ).
Proof: Y = C • X is also a supermartingale, E(Y ) ≤ 0. The result follows from the
expectation of Lemma 3.2.2.
Corollary 3.2.4 Let X be a supermartingale bounded in L1 , a, b ∈ R such that a < b,
U∞ [a, b] = limN ↑∞ UN [a, b]. Then
(b − a)EU∞ [a, b] ≤ |a| + sup E|Xn | < ∞
n
and so
P(U∞ [a, b] = ∞) = 0.
Conversely, if U∞ [a, b] = ∞ for some a, b, then X is not bounded in L1 .
Proof: By Lemma 3.2.3, (b − a)EUN [a, b] ≤ |a| + E|XN | ≤ |a| + supn E|Xn |. We then
let N ↑ ∞ by the Monotone Convergence Theorem (Theorem 2.1.3).
Theorem 3.2.5 (Doob’s Forward Convergence Theorem) For a
supermartingale X bounded in L1 , X∞ = limn↑∞ Xn almost surely exists and is finite.
Definition 3.2.6 We say X∞ (ω) = lim sup Xn (ω) ∀ω. X∞ is then F∞ -measurable,
and Xn = lim Xn almost surely.
Corollary 3.2.7 If X is a non-negative supermartingale, then
E|Xn | = E(Xn ) ≤ E(X0 ).
So X is bounded and X∞ exists almost surely.
3.3
Further Results
The most important point in this section is Theorem 3.3.2. From the theorems above,
we often need to show that a martingale is bounded in L1 . Often the easiest method
for proving this is to show that the martingale is bounded in L2 : the result then follows
by taking part a) of the Schwartz Inequality (Theorem 2.1.14) with Y = 1.
20
3.3.1
Orthogonality of Increments
Let X be a martingale in L2 , s, t, u, v ∈ Z+ such that s ≤ t ≤ u ≤ v. We have that
E(Xv |Fu ) = Xu almost surely, so Xv − Xu is orthogonal to L2 (Fu ) from Section 2.2.1.
In particular,
E((Xt − Xs )(Xv − Xu )) = E(E((Xt − Xs )(Xv − Xu )|Ft ))
= E((Xt − Xs )E(Xv − Xu |Ft ))
= E((Xt − Xs )(Xt − Xt ))
= 0,
so we can express Xn as a sum of orthogonal terms,
Xn = X0 +
n
X
(Xk − Xk−1 ).
k=1
By Pythagoras’ Theorem (Definition 2.1.16), we thus have
E(Xn2 ) = E(X02 ) +
n
X
E(Xk − Xk−1 )2 .
k=1
Theorem 3.3.1 [8, Chapter 12] Let Xn ∈ L2 be a martingale. Then Xn is bounded
in L2 if and only if
X
E[(Xk − Xk−1 )2 ] < ∞.
Then Xn → X∞ almost surely in L2 .
Proof: Follows immediately from the above expression of E(Xn2 ) as a sum of orthogonal terms.
Given that the martingale is thus bounded in L1 , Doob’s Convergence Theorem (Theorem 3.2.5) then shows that X∞ = lim Xn exists almost surely.
3.3.2
Lévy’s Upward Theorem
Theorem 3.3.2 (Lévy’s Upward Theorem) Let X ∈ L(Ω, F, P), and
Yn = E(X|Fn ) almost surely. Then Yn → E(X|F∞ ) almost surely in L1 .
Loosely speaking, this means that if, for example, we have an expectation of a process
given information up to time n, then that expectation will tend to the expectation
given all information we’ll ever have.
21
Chapter 4
Filtering
We can now discuss filtering, the theory of estimating the value of unknown quantities
we can only measure with noise. The main method used here is an expansion of Bayes’
formula, given in the first section.
4.1
Bayes’ Formula For Bivariate Normal Distributions
Notation 4.1.1 We write the conditional probability
P(B|A) = P(AB)/P(A)
as CA (B).
We also have CB (A) = P(AB)/P(B), and so we can write
CB (A) = CA (B)P(A)/P(B).
This is Bayes’ theorem, and is often used to “update” the estimate of P(A) after
observing event B. In this case we often call P(A) the prior probability, and CB (A)
the posterior probability. The recursive property of conditional probabilities is then
C
(CD)
written as CA,B,C (D) = CA,B (D|C) = CA,B
. When using estimating probabilities
A,B (C)
as described, above, this lets us generalise Bayes’ theorem to cases where we have a
series of observations.
4.1.1
Recursive Property in Probability Distribution Functions
Take random variables X, Y, Z, T with joint pdf fX,Y,Z,T : R4 → R, so that for B ∈ B 4 ,
Z
P{(X, Y, Z, T ) ∈ B} =
fX,Y,Z,T (x, y, z, t) dx dy dz dt.
B
We then have fX,Y,Z (x, y, z) =
R
R
fX,Y,Z,Y (x, y, z, t) dt. Then the conditional pdf of T ,
fT |X,Y,Z (t|x, y, z) =
fX,Y,Z,T (x, y, z, t)
,
fX,Y,Z (x, y, z)
and the recurrence property is written as
fT |X,Y,Z = (fT |Z )|X,Y =
fT,Z|X,Y
.
fZ|X,Y
Example 4.1.2 For two random variables X, Y with joint pdf fX,Y ∈ R2 [8],
fX (x)fY |X (y|x)
fX,Y (x, y)
=
∝ fX (x)fY |X (y|x),
fY (y)
fY (y)
R
with the constant of proportionality determined by R fX|Y (x|y) dx = 1.
fX|Y (x|y) =
22
4.1.2
Bayes’ Formula
Lemma 4.1.3 (Bayes’ Formula) [8, Section 15.7] Let µ, a, b ∈ R, U, W ∈ (0, ∞),
and X, Y be random variables such that
L(X) = N (µ, U ),
CX (Y ) = N (a + bX, W ),
where N (µ, σ 2 ) is a normal distribution as described in Example 2.1.23.Then CY (X) =
N (Z, V ), where V ∈ (0, ∞) and X are such that
1
1
b2
=
+
,
V
U
W
Proof:
Z
µ
b(Y − a)
=
+
.
V
U
W
1
X has distribution function fX (x) = (2πU )− 2 e−
− 21 −
tional distribution function fY |X (y|x) = (2πW )
fX|Y (x|y) ∝ fX (x)fY |X (y|x), so
log fX|Y (x|y) = c1 (y) −
= c1 (y) −
= c1 (y) −
= c2 (y) −
= c2 (y) −
= c2 (y) −
= c2 (y) −
e
(x−µ)2
2U
(y−a−bx)2
2W
, and Y has condi-
. From Example 4.1.2,
(y − a − bx)2
(x − µ)2
−
2U
2W
1
(W (x2 − 2µx + µ2 ) + U (b2 x2 − 2b(y − a)x + f (y)))
2U W
1
((W + b2 U )x2 − 2(µW + b(y − a)U )x + f (y))
2U W
2
W + b2 U
µW + b(y − a)U
x−
2U W
W + b2 U
2
1
µW + b(y − a)U
x−
V
2V
UW
1
(x − z)2
2V
(x − z)2
,
2V
where
1
W + b2 U
1
b2 z
µW + b(y − a)U
µ
b(y − a)
=
=
+
,
=
=
+
.
V
UW
U
W V
UW
U
W
Thus, fX|Y (x|y) has the normal distribution given above. 4.2
4.2.1
Single Random Variable
System Model and Filter
We can now apply the above on conditional probabilities to examine the case of a
single random variable, where we can only get observations of the variable with random
noise. Specifically, we suppose we have independent random variables X, ν1 , ν2 . . . with
normal distributions,
L(X) = N (0, σ 2 ), L(νk ) = N (0, c2k ).
We then say we can only observe X with series of observations Yk = X + νk , Fn =
σ(Y1 , Y2 , . . . , Yn ).
Notation 4.2.1 We write the conditional law CFn (X), defined in Notation 4.1.1,
as Cn (X), with C0 (X) = N (0, σ 2 ). Then CN is the approximation of X after N
observations Yn .
23
Say we have determined that Cn−1 (X) = N (Zn−1 , Vn−1 ). Yn = X + νn gives
Cn−1 (Yn |X) = N (X, c2n )
, so we can use Bayes’ formula (Lemma 4.1.3), with
µ = Zn−1 , U = Vn−1 , a = 0, b = 1, W = c2n ,
to obtain the recursion [8, Section 15.8, with misprint as if U = Vn ]
Cn (X) = Cn−1 (X|Yn ) = N (Zn , Vn ),
Vn−1 + c2n Zn
1
1
Zn−1
Yn
1
,
=
+ 2 =
=
+ 2.
2
Vn
Vn−1
cn
Vn−1 cn Vn
Vn−1
cn
Since C0 (X) = N (0, σ 2 ) = N (Z0 , V0 ), we have shown that Cn (X) = N (Zn , Vn ) ∀n.
Lemma 4.2.2 The above estimate mean, Zn , is a martingale, and Zn → Z∞ = X
almost surely.
Proof: We first show Zn is a martingale by satisfying all three conditions in Definition 3.1.7.
i) We have Zn = E(X|Fn ), so Zn is clearly adapted.
1
ii) In L2 , |Zn |2 = (E(E(X|Fn )2 ) 2 . By Jenson’s Inequality (Theorem 2.1.11), with
c(x) = x2 , |Zn |2 ≥ E(E(X|Fn )) = E(X). Using the same inequality on the inner
1
1
expectation, |Zn |2 ≤ (E(E(X 2 |Fn )) 2 = (E(X 2 )) 2 = |X| = σ. Zn is thus bounded
in L2 , and so is also bounded in L1 .
iii) E(Zn |Fm ) = E(E(X|Fn )|Fm ) = E(X|Fm ) = Zm ∀m < n.
So Zn is a martingale, with E(X − Zn )2 = Vn and bounded in L2 . We thus know from
Lévy’s Upward Theorem (Theorem 3.3.2) that Zn = E(X|Fn ) → Z∞ = E(X|F∞ )
almost surely in L2 . We’d like to show that our mean estimate of X tends towards
X as the number of observations tends to infinity, i.e. that Z∞ = X almost surely,
so we also
variance Vn ↓ 0. From the recurrence formula for Vn , we have Vn =
Pnneed−2
−2
−1
{σ
+
c
; we can thus say that Z∞ = X almost surely if and only if
k=1 k }
P −2
ck = ∞. 4.2.2
R source code: Filtering a Single Value
This script takes a normally-distributed variable and noisy measurements of it, calculates the estimate of the variable as the measurements are taken, then plots the
variable, the estimate’s mean and standard deviation, and the measurements.
#Script for creating noisy measurements of a value, and giving
#an estimate of the variable by filtering
##Calculate the estimated mean and variance of possible values
create.estimates = function(Y, signalmean, signalsd, noisesd) {
n = length(Y)
noisesd = matrix(noisesd, n, 1)
est = array(NA, dim=c(length(Y)+1,2))
est[1,] = c(signalmean, signalsd^2)
for(step in 1:n) {
est[step+1,2] = est[step,2] + 1/(noisesd[step]^2)
est[step+1,1] = est[step,1] + Y[step]/(noisesd[step]^2)
}
est[,2] = 1/est[,2]
est[,1] = est[,1]*est[,2]
24
+
0.5
−0.5
+ +
+
+
+
+
+
+
+
+
+
−1.5
Estimate
1.5
+
+
+
5
10
+
+
+
+
0
+
15
20
Measurements
Figure 4.1: Example graphic for measuring a random variable in Program 4.2.2, with mean
0 and variance 1, with noise of variance 1
est[,2] = sqrt(est[,2])
colnames(est) = c("mean", "sd")
est
}
draw.filter = function(signal,Y,est) {
##Calculate one standard deviation to each side of the estimates
estplus = est[,1]+est[,2]
estminus = est[,1]-est[,2]
plot(c(0,length(Y)),
range(c(est[,1],estplus,estminus, signal-0.1,signal+0.1)),
type="n", xlab="Measurements", ylab="Estimate")
lines(0:length(Y),est[,1])
lines(0:length(Y),estplus,lty=2)
lines(0:length(Y),estminus,lty=2)
abline(h=signal,lty=3)
points(1:length(Y),Y,pch="+")
}
create.filter = function(n, noisesd, signalmean=0, signalsd=1) {
signal = rnorm(1, signalmean, signalsd)
observations = rnorm(n, signal, noisesd)
est = create.estimates(observations, signalmean, signalsd, noisesd)
draw.filter(signal, observations, est)
}
25
1.0
0.8
0.6
0.4
0.2
Estimate Deviation
0
5
10
15
20
Measurements
1.0
Figure 4.2: Standard deviation of the estimate in Figure 4.1 over time
+
0.0
+
+
+ +
+ + + + + + +
+ +
+ +
+ +
+
−1.0
−0.5
Estimate
0.5
+
0
5
10
15
20
Measurements
Figure 4.3: Example graphic for measuring a random variable in Program 4.2.2, with mean
0 and variance 1, with noise of variance c2n = 1/n2
26
1.0
0.8
0.6
0.4
0.2
0.0
Estimate Deviation
0
5
10
15
20
Measurements
Figure 4.4: Standard deviation of the estimate in Figure 4.3 over time
4.3
4.3.1
Series of Variables; Kalman Filter
System Model and Filter
We can use the same recursion for a series of variables. For the Kalman Filter, we
assume that the signal variable (Xn )n∈N follows a disturbed linear recursion,
Xn − Xn−1 = An Xn−1 + gn + νn ,
where gn is known at time n − 1 – so it is pre-visible – and L(νn ) = N (0, Hn2 ), and
that the observations Yn follow recursion
Yn − Yn−1 = Cn Xn + n ,
where Cn is known, and L(n ) = N (0, Kn2 ). We also assume that Cn 6= 0 – otherwise
the observation has no relation to the signal – and that Kn 6= 0, since equality would
mean our observations are perfectly accurate and we don’t need to filter.
gn is required to be pre-visible for many applications of the filter in optimization, since
pre-visibility means gn can be some controllable variable used to optimize to certain
conditions [6]; see Section 4.3.3 for an example of this.
We express the equations with regard to Xn −Xn−1 and Yn −Yn−1 , rather than Xn and
Yn , because these recurrence relations are often used as an approximation to stochastic
et , where
differential equations like dXt = AXt + dgt + Ht dBt and dYt = Ct Xt + Kn dB
e are standard Brownian motions, approximated by (Bn − Bn−1 ), (B
en − B
en−1 ) ∼
B, B
N (0, 1). These are used in the Kalman-Bucy filter, the continuous-time equivalent of
the Kalman filter; for more information see [4].
For X0 we again have an initial mean and variance, as in Section 4.2.1. Since we only
look at difference Y1 − Y0 , the value of Y0 is arbitrary, but to equate the process to
that of a continuous one we usually take Y0 = 0 by convention.
We now have equations
Cn−1 (Xn ) = N (αn Zn−1 + gn , αn2 Vn−1 + Hn2 ),
Cn−1 (Yn |Xn ) = N (Yn−1 + Cn Xn , Kn2 ),
where αn = An +1. As before, we can now suppose that Cn−1 (Xn−1 ) = N (Zn−1 , Vn−1 ).
We then apply Lemma 4.1.3, with
µ = αn Zn−1 + gn , U = αn Vn−1 + Hn2 , a = YN −1 , b = Cn , W = Kn2 ,
27
to obtain recursion
1
Cn2 Zn
Cn (Yn − Yn−1 )
1
αn Zn−1 + gn
+
,
+
.
= 2
= 2
2
2
2
Vn
αn Vn−1 + Hn
Kn Vn
αn Vn−1 + Hn
Kn2
Lemma 4.3.1 
In the case where Hn , Cn , Kn , An , gn are constant in time, the scaled
g
−n

α (Zn − 1−α ) α 6= 0, 1
estimate Mn = Zn − gn
α = 1 is a martingale.


Zn
α=0
Proof:
The case where α = 0 is obvious, since Mn = g; otherwise,
E(Zn |Fm ) = αE(Zn−1 |Fm ) + g
= α2 E(Zn−2 |Fm ) + g + αg . . .
= αn−m E(Zm |Fm ) + g
n−m−1
X
αi
i=0
(
n−m
αn−m Zm + g α α−1−1
=
αn−m Zm + g(n − m)
(
−m
−α−n
α−m Zm + g α α−1
∴ E(α−n Zn |Fm ) =
Zm + g(n − m)
E(α−n (Zn +
= α 6= 1
= α − 1,
α 6= 1
α = 1,
g
g
)|Fm ) = α−m (Zm +
)
α−1
α−1
E(Zn − gn|Fm ) = Zm − gm
α 6= 1,
α = 1. Theorem 4.3.2 Vn tends to a limit V∞ .
As before, we assume C, K 6= 0. We first look for a fixed point, 1/V∞ =
C2
α2 V∞ +H 2 + K 2 , then examine the stability of the fixed point with the following lemma:
Proof:
1
Lemma 4.3.3 For an equation xn+1 = f (xn ), the fixed point x∗ is locally stable if
(x∗ )
(x∗ )
| dfdx
| < 1, since f (x∗ + n ) ∼ f (x∗ ) + n dfdx
by Taylor series for small error n .
We rearrange the fixed-point equation to
V∞ = f (V∞ ) =
K 2 (α2 V∞ + H 2 )
,
K 2 + C 2 (α2 V∞ + H 2 )
2
2
2
)
and take Vn+1 = f (Vn ), where f (x) = K 2K+C(α2 (αx+H
2 x+H 2 ) , in the above Lemma.
The equation can also be rearranged to
2
α2 C 2 V∞
+ C 2 H 2 + K 2 (1 − α2 ) V∞ − K 2 H 2 = 0.
This quadratic equation can reduce to a linear equation depending on α, so we look at
the following distinct cases:
1. For α 6= 0 and H 6= 0, we have the full quadratic equation, with solution
V∞ =
(α2 − 1)K 2 − C 2 H 2 +
p
((α2 − 1)K 2 − C 2 H 2 )2 + 4a2 C 2 K 2 H 2
.
2α2 C 2
28
In this case we only take the positive root: since α, H 6= 0 the square root term
is larger than the rest, so the negative root is less than zero, and since variance
is positive this is impossible. The function derivative is
α2 K 2
α2 C 2 K 2 (α2 x + H 2 )
df (x)
= 2
−
2
dx
K + C 2 (α2 x + H 2 )
K 2 + C 2 (α2 x + H 2 )
=
α2 K 4
K 2 + C 2 (α2 x + H 2 )
2 .
2
For convergence we thus require α2 K 4 < K 2 + C 2 (α2 V∞ + H 2 ) , or
|α|K 2
2|α|K 2
K 2 + C 2 H 2 + α2 C 2 V∞
1 2
(α − 1)K 2 − C 2 H 2
< K 2 + C 2H 2 +
2
q
2
+ (α2 − 1)K 2 − C 2 H 2 + 4a2 C 2 K 2 H 2 bysolution,
√ < (α2 + 1)K 2 + C 2 H 2 + 2 . . . ,
<
√
∴ (|α| − 1)2 K 2 + C 2 H 2 + 2 . . . > 0.
The variance therefore always converges to V∞ , since all the terms are positive
and C 2 H 2 > 0.
2. For α 6= 0 and H = 0, the equation now has two fixed points, at V−∞ = 0 and
2
2
V+∞ = (α α−1)K
, and the convergence condition becomes
2C2
(|α| − 1)2 K 2 ± (α2 − 1)K 2 > 0.
For V−∞ this becomes (1 − |α|)K 2 > 0, and for V+∞ this becomes (α2 − |α|)K 2 >
0. The stability of these points thus depends on |α|: for |α| < 1, V−∞ = 0 is stable,
and V+∞ is both unstable and negative; for |α| > 1, the variance converges to
non-zero value V+∞ . and V−∞ = 0 is unstable; for |α| = 1, V−∞ = V+∞ = 0 is
stable. In summary, in this case we always have convergence.
3. For α = 0, the equation reduces to (C 2 H 2 + K 2 )V∞ − K 2 H 2 = 0, with solution
V∞ =
We have f (x) =
K2H2
K 2 +C 2 H 2 ,
so
df (x)
dx
K 2H 2
.
+ C 2H 2
K2
= 0. We thus always have stability.
In summary, the variance always converges to V∞ . 4.3.2
R source code: Kalman Filter
This takes a series of signals related by disturbed linear recursion, and noisy observations of them, then plots the signal series, the mean and standard deviation series
for the estimations, and the differences between the observations. The differences are
taken to keep the plots in the same area as the signal and observation series.
#Script for creating noisy measurements of a process, and giving
#an estimate of same by filtering
##Convert any possibly time-converted values into length-n arrays,
##sorted in a list
convert.to.array = function(n, ...) {
args=list(...)
for(a in 1:length(args)) {
args[[a]] = array(args[[a]],n)
29
}
args
}
create.signal = function(n, meanX0, sdX0, A, g, H) {
signal = array(0,n+1)
convert=convert.to.array(n,A+1,g,H)
alpha = convert[[1]]
g = convert[[2]]
H = convert[[3]]
signal[1] = rnorm(1,meanX0,sdX0)
for(step in 1:n) {
signal[step+1] = rnorm(1, alpha[step]*signal[step]+g[step],
H[step])
}
signal
}
create.observations = function(n,signal,C,K) {
convert=convert.to.array(n,0,C,K)
observations = convert[[1]]
C = convert[[2]]
K = convert[[3]]
observations[1] = rnorm(1,C[1]*signal[2],K[1])
for(step in 1:(n-1)) {
observations[step+1] = rnorm(1,C[step+1]*signal[step+2],
K[step+1])
}
observations
}
##Calculate the estimated mean and variance of possible values
create.estimates = function(Y, meanX0, sdX0, A, g, H, C, K) {
n = length(Y)
convert = convert.to.array(n,A+1,g,H,C,K)
alpha = convert[[1]]
g = convert[[2]]
H = convert[[3]]
C = convert[[4]]
K = convert[[5]]
est = array(0, c(n+1, 2))
est[1,] = c(meanX0, sdX0^2)
for(step in 1:n) {
est[step+1,2] = 1/(1/((alpha[step]^2)*est[step,2]
+ H[step]^2) + (C[step]/K[step])^2)
check = (alpha[step]*est[step,1]
+ g[step])/((alpha[step]^2)*est[step,2] + H[step]^2)
check2 = C[step]*Y[step]/(K[step]^2)
est[step+1,1] = (check+check2)*est[step+1,2]
}
est[,2] = sqrt(est[,2])
colnames(est) = c("mean", "sd")
30
+
1
+
0
+
+
+
+
+
+
+
−2
−1
Estimate
2
+
+
+
0
2
4
6
8
10
12
Measurements
Figure 4.5: Example graphic for measuring a random process in Program 4.3.2 with default
values, create.filter(12,1)
est
}
draw.filter = function(signal,Y,est) {
##Calculate one standard deviation to each side of the estimates
estplus = est[,1]+est[,2]
estminus = est[,1]-est[,2]
plot(c(0,length(Y)), range(c(est[,1],estplus,estminus,signal-0.1,
signal+0.1,Y)), type="n", xlab="Measurements", ylab="Estimate")
lines(0:length(Y),est[,1])
lines(0:length(Y),estplus,lty=2)
lines(0:length(Y),estminus,lty=2)
lines(0:length(Y),signal,lty=3)
points(1:length(Y),Y,pch="+")
}
create.filter = function(n, K, meanX0=0, sdX0=1, A=-0.1, g=0,
H=1, C=1) {
signal = create.signal(n, meanX0, sdX0, A, g, H)
observations = create.observations(n,signal,C,K)
est = create.estimates(observations, meanX0, sdX0, A, g,
H, C, K)
draw.filter(signal, observations, est)
}
4.3.3
Example: Moving on a Line
Consider the optimality problem [6, Section 11.4] of an object moving on the line R with
controllable velocity gn at each timestep, where we can only measure the position of the
object with noise, and we must choose gn , only based on Fn−1 = {Y0 , Y1 , . . . Yn −1}, to
PN −1
2
minimize E( n=0 gn2 + DXN
) for a finite stopping time N and some D. Specifically,
31
2
+
0
+ +
+
+
+
+
+
−2
+
+
+
+
+
+
0
5
+ +
+
+
−4
Estimate
1
+
+
10
15
20
Measurements
Figure 4.6: Example graphic for measuring a random process in Program 4.3.2 with default
values, create.filter(20,1)
1
0
+
+
+
+
+
−2
+
0
+
+
+
−1
Estimate
2
+
+
2
+
4
6
8
10
12
Measurements
Figure 4.7: Example graphic for measuring a random process in Program 4.3.2 with default
values and noise variance Kn2 = 1/n2 , create.filter(12,1/(1:12))
32
we have a system
Xn+1 = Xn + gn ,
Yn = Xn + n ,
where L(n ) = N (0, 1). We thus have a Kalman Filter, with
α = 1, g = gn−1 , Hn2 = 0, C = 1, Kn2 = 1.
Our position estimate is then N (Zn , Vn ) with
1
Zn
Zn−1 + gn−1
1
=
+ 1,
=
+ Yn ,
Vn
Vn−1
Vn
Vn−1
V0
Vn−1
=
,
∴ Vn =
Vn−1 + 1
nV0 + 1
Yn Vn−1 + (Zn−1 + gn−1 )
Yn V0 + (Zn−1 + gn−1 )([n − 1]V0 + 1)
Zn =
=
.
Vn−1 + 1
nV0 + 1
In the absence of other information, we can use the above recursion by assuming we
have no information at time n = 0, i.e. Z0 = z, 1/V0 = 0. Then
1
Yn + (n − 1)(Zn−1 + gn−1 )
, Vn = .
n
n
Z1 = Y1 , V1 = 1, Zn =
PN −1
2
Let F (Zk , Vk , k) = E( n=k gn2 + DXN
|{Zk , Vk , Yk } = Gk ). Then
F (Zk , Vk , k) = gk2 + E(F (Zk+1 , Vk+1 , k + 1)|Gk , gk ),
2
2
F (ZN , VN , N ) = E(DXN
|FN ) = DE(XN
|GN )
2
2
2
2
= D(E(XN
− ZN
|GN ) + ZN
) = D(VN + ZN
).
2
Suppose we can write F (Zk+1 , Vk+1 , k + 1) = Ak+1 Zk+1
+ Bk+1 . Then we have
2
F (Zk , Vk , k) = gk2 + E(Ak+1 Zk+1
+ Bk+1 |Gk , gk )
!
2
Yk+1 Vk + Zk + gk
2
|Gk , gk + Bk+1
= gk + Ak+1 E
Vk + 1
!
2
V
X
+
Z
+
V
+
(V
+
1)g
k
k
k
k
k+1
k
k
= gk2 + Ak+1 E
|Gk , gk + Bk+1
Vk + 1
= (Ak+1 + 1)gk2 + 2Ak+1 Zk gk + Ak+1 Zk2 +
Ak+1 Vk2
+ Bk+1 .
Vk + 1
k+1
Minimizing over gk gives gk = − AA
Zk , so
k+1 +1
F (Zk , Vk , k) =
Ak+1 Vk2
Ak=1
Zk2 + Bk+1 +
= Ak Zk2 + Bk ,
Ak+1 + 1
Vk+1
where
Ak =
Bk = Bk+1 +
= Bk+1 +
=
Ak+1
D
=
,
Ak+1 + 1
1 + kD
Ak+1 Vk2
Vk + 1
DV02 (1 + kV0 )
(1 + (k + 1)D)(1 + kV0 )2 (1 + (k + 1)V0 )
N
−1
X
V0
1
+ DV02
.
1 + N V0
(1 + iV0 )(1 + (i + 1)V0 )(1 + (i + 1)D)
i=k
33
Thus, the optimality problem has the solution gk =
Zk becomes
Zk =
−DZk
1+(N −k)D ,
so the expression for
1+(N −k)D
Yk V0 + ([k − 1]V0 + 1) 1+(N
−k+1)D Zk−1
kV0 + 1
(k − 1)V0 + 1
V0
Zk−1 +
Yk
1 + (N − k + 1)D
1 + (N − k)D
k
X
1 + (N − k)D
Z0
V0
+
Yi ,
=
kV0 + 1
1 + N D i=1 1 + (N − i)D
1 + (N − k)D
=
kV0 + 1
and the expected final cost at time k,
F (Zk , Vk , k)
=
D
V0
Z2 +
1 + kD k 1 + N V0
N
−1
X
1
+DV02
.
(1 + iV0 )(1 + (i + 1)V0 )(1 + (i + 1)D)
i=k
4.3.4
R Source Code: Movement Problem
This gives examples of the results of the problem in Example 4.3.3, plotting the current
position, the mean and standard deviation of the estimated position, and the observed
position.
#Script for the movement problem
##Create arrays and starting values
create.start = function(n, meanX0, sdX0) {
signal = array(NA,n+1)
signal[1] = rnorm(1,meanX0,sdX0)
signal
}
create.observations = function(signal) {
##create array for observations for time 0 to n-1
n=length(signal)-1
observations = array(NA,n)
observations[1] = rnorm(1,signal[1],1)
observations
}
create.estimates = function(Y, meanX0, sdX0) {
n = length(Y)
est = array(NA, c(n, 2))
if (sdX0 == Inf) {
est[1,] = c(Y[1], 1)
}
else {
est[1,] = c((meanX0 + Y[1]*sdX0^2)/(1+sdX0^2),
sdX0^2/(1+sdX0^2))
}
colnames(est) = c("mean", "sd")
est
}
##Calculate values in next time step
34
create.move = function(k,n,signal,est,D) {
if (D==Inf) {g=-est[k,1]/(n-k+1)} else {
g = -D*est[k,1]/(1+(n-k+1)*D)
}
signal[k+1] = signal[k] + g
signal
}
update.observations = function(k,observations,signal) {
observations[k] = rnorm(1,signal[k],1)
observations
}
update.estimates = function(k,est,signal,Y,D) {
n = length(Y)
if (D==Inf) {g=-est[k,1]/(n-k+1)} else {
g = -D*est[k,1]/(1+(n-k+1)*D)
}
est[k+1,2] = est[k,2]/(est[k,2] + 1)
check = (est[k,1] + g)/est[k,2]
est[k+1,1] = (check+Y[k+1])*est[k+1,2]
est
}
draw.filter = function(signal,Y,est) {
##Calculate one standard deviation to each side of the estimates
estplus = est[,1]+est[,2]
estminus = est[,1]-est[,2]
plot(c(0,length(Y)), range(c(est[,1],estplus,estminus,signal,Y,0)),
type="n", xlab="Measurements", ylab="Estimate")
lines(0:(length(Y)-1),est[,1])
lines(0:(length(Y)-1),estplus,lty=2)
lines(0:(length(Y)-1),estminus,lty=2)
lines(0:length(Y),signal,lty=3)
points(0:(length(Y)-1),Y,pch="+")
}
create.filter = function(n, meanX0=0, sdX0=1, D=1) {
if (sdX0 == Inf) {
##using Inf normally would give "Not a Number" errors
signal = create.start(n,meanX0,0)
}
else signal = create.start(n, meanX0, sdX0)
observations = create.observations(signal)
est = create.estimates(observations, meanX0, sdX0)
for(k in 1:(n-1)) {
signal = create.move(k,n,signal,est,D)
observations = update.observations(k+1,observations,signal)
est = update.estimates(k,est,signal,observations,D)
}
signal = create.move(n,n,signal,est,D)
est[,2] = sqrt(est[,2])
draw.filter(signal, observations, est)
}
35
+
+
−1
+
−2
+
+
+
+
+
+
+
−3
Estimate
0
+
+
0
2
4
6
8
10
12
Measurements
Figure 4.8: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with default values and 12 timesteps, create.filter(12)
1.5
+
+
0.5
+
+
+
+
+
−0.5
Estimate
+
+
+
−1.5
+
+
0
2
4
6
8
10
12
Measurements
Figure 4.9: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with default values and 12 timesteps, create.filter(12)
36
0.5
+
+
−0.5
+
+
+
−1.5
−2.5
Estimate
+
+
+
+
+
0
+
+
2
4
6
8
10
12
Measurements
Figure 4.10: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps and D=∞, create.filter(12,D=Inf)
1.5
+
+
+
0.0
0.5
+
+
+
+
+
+
+
−0.5
Estimate
1.0
+
+
0
2
4
6
8
10
12
Measurements
Figure 4.11: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps and D=0, create.filter(12,D=0), obviously resulting in zero
movement
37
+
+
+
+
+
4
6
+
+
+
2
Estimate
8
10 12
+
+
+
0
+
0
2
4
6
8
10
12
Measurements
Figure 4.12: Example graphic for the movement problem in Section 4.3.3, using
Program 4.3.4 with 12 timesteps, starting at 10 with infinite starting variance, create.filter(12,meanX0=10,sdX0=Inf)
+
+
+
0.0
Estimate
1.0
+
+
+
+
+
−1.0
+
+
0
+
+
2
4
6
8
10
12
Measurements
Figure 4.13: Example graphic for the movement problem in Section 4.3.3, using
Program 4.3.4 with 12 timesteps, starting at 0 with infinite starting variance, create.filter(12,sdX0=Inf); we then still have movement due to the inaccurate observations
38
4.3.5
Extension for Multiple Processes and Observations
The above form of the Kalman Filter is satisfactory, but it causes problems if we have
more than one variable, or more than one observations: in either case we also need
to consider covariances between the different approximations, so we need a more sophisticated model. For this reason, the associated literature for the Kalman Filter
usually denotes the equations in matrix form; we therefore extend the basic Kalman
filter above to the case where we have v variables we wish to approximate and m
observations.
The aforementioned literature usually splits the Kalman Filter into two steps – the
time update, or prediction step, and the measurement update, or correction step [7]. This makes the calculations more digestible, especially when using
matrices, so we first show our solution for the simple case can be written in this form.
Rewriting the Simple Case
We first write our solution in terms of Vn and Zn instead of 1/Vn and Zn /Vn :
Cn2
1
1
+
= 2
Vn
αn Vn−1 + Hn2
Kn2
2
2
2
K + C (α Vn−1 + Hn2 )
= n 2 n2 n
,
Kn (αn Vn−1 + Hn2 )
K 2 (α2 Vn−1 + Hn2 )
Vn = 2 n 2n 2
,
Kn + Cn (αn Vn−1 + Hn2 )
Zn
αn Zn−1 + gn
Cn (Yn − Yn−1 )
= 2
+
Vn
αn Vn−1 + Hn2
Kn2
2
K (αn Zn−1 + gn ) + Cn (Yn − Yn−1 )(αn2 Vn−1 + Hn2 )
= n
,
Kn2 (αn2 Vn−1 + Hn2 )
K 2 (αn Zn−1 + gn ) + Cn (Yn − Yn−1 )(αn2 Vn−1 + Hn2 )
Zn = n
.
Kn2 + Cn2 (αn2 Vn−1 + Hn2 )
These expressions, and those that follow, frequently use the terms E(Xn |Fn−1 ) =
αn Zn−1 + gn and E(Vn |Fn−1 ) = αn2 Vn−1 + Hn2 : for convenience, we write these as Z̃n
and Ṽn respectively. We then write the equation for Zn as
Zn =
Kn2 Z̃n + Cn (Yn − Yn−1 )Ṽn
Kn2 + Cn2 Ṽn
= Z̃n +
Cn Ṽn (Yn − Yn−1 − Cn Z̃n )
Kn2 + Cn2 Ṽn
= Z̃n + φn rn ,
where φn =
Cn Ṽn
2 +C 2 Ṽ
Kn
n n
is pre-visible, and often called the gain, blending, or kalman
gain factor [1,7]; it can be thought of as an indication of how much we value the new
information conpared to our old estimate. Additionally, the term rn = Yn − Yn−1 −
Cn Z̃n = Yn − E(Yn |Fn−1 ) is the difference between actual and expected measurement
Yn , and is often called the residual or innovation [7].
K2
The gain factor has the property Cn φn + K 2 +Cn2 Ṽ = 1, so we can write the system as
n
n
n
Z̃n = αn Zn−1 + gn , Ṽn = αn2 Vn−1 + Hn2 , φn =
Cn Ṽn
;
+ Cn2 Ṽn
Kn2
rn = Yn − Yn−1 − Cn Z̃n , Zn = Z̃n + φn rn , Vn = (1 − Cn φn )Ṽn .
39
Matrix Form
Say we wish to estimate v variables from m observations. The variables can now
depend on each other, so for the ith variable Xi,n we have dynamics equation
Xi,n − Xi,n−1 =
v
X
Ai,j,n Xj,n−1 + gi,n + νi,n ,
j=1
and for ith observation Yi,n we have equation
Yi,n − Yi,n−1 =
v
X
Ci,j,n Xj,n + i,n .
j=1
In matrix form, these equations become
xn − xn−1 = An xn−1 + g n + ν n , y n − y n−1 = Cn xn + n .
Additionally, the independence conditions for xn , νi,n , and i,n are given by the equations [1, Section 7.1]
E(ν n xTm ) = 0 ∀n > m,
E(n xTm ) = 0 ∀n, m,
E(n ν Tm ) = E(ν n ν Tm ) = E(n Tm ) = 0
∀n 6= m,
where xT denotes the usual transpose of a vector, and the distribution of the noise
variables is given by the covariance matrix
T Jn2
Hn2
νn
νn
.
E
=
n
n
(Jn2 )T Kn2
Usually the two noise variables are independent, so Jn2 = 0.
If we assume that we have estimate at time n − 1 of mean z n−1 and variance Vn−1 ,
then we have
E(xn |Fn−1 ) = αn z n−1 + g n , E(y n |Fn−1 ) = y n−1 + Cn (αn z n−1 + g n ),
where αn = An + I. Let ξ n = xn − E(xn |Fn−1 ) = αn (xn−1 − z n−1 ) + ν n , and
χn = y n − E(y n |Fn−1 ) = Cn (xn − αn xn−1 − g n ) + n = Cn (αn (xn−1 − z n−1 ) +
ν n ) + n = Cn ξ n + n . Let Ṽn = Var(ξn ) = Hn2 + αn Vn−1 αnT ; We can then write the
covariance matrix
ξn
Ṽn
Jn2 + Ṽn CnT
.
Cov
=
χn
(Jn2 )T + Cn Ṽn Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT
Lemma 4.3.4 If two random variables
x, yare normallydistributed with mean 0 and
x
Vxx Vxy
symmetric covariance matrix Cov
=
, where Vyy is non-singular,
y
Vyx Vyy
then
−1
−1
E(x|y) = Vxy Vyy
y, Var(x|y) = Vxx − Vxy Vyy
Vyx .
−1
Proof: The term [6] x − Vxy Vyy
y is linear in x, y, so is normally distributed. Ad−1
T
−1
ditionally, E((x − Vxy Vyy y)y ) = 0, so it is independent of y and Vxy Vyy
y is the
−1
conditional expectation of y. So we can say that E(x|y) = Vxy Vyy y, and that
−1
−1
Cov(x − Vxy Vyy
y|y) = Cov(x − Vxy Vyy
y)
−T T
−1
−1
−T T
= E(xxT − xy T Vyy
Vxy − Vxy Vyy
yxT + Vxy Vyy
yy T Vyy
Vxy )
−1
−1
−1
−1
= Vxx − Vxy Vyy
Vyx − Vxy Vyy
Vyx + Vxy Vyy
Vyy Vyy
Vyx
−1
= Vxx − Vxy Vyy
Vyx . 40
ξ n and χn have mean 0, so by the above lemma, and by the fact that E(ξ n |χn ) =
z n − z̃ n and Var(ξ n |χn ) = Vn , we can now say that
z n = z̃ n + (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 (y n − E(y n |Fn−1 )),
Vn = Ṽn − (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 ((Jn2 )T + Cn Ṽn ).
We have now found the gain factor and the innovation,
φn = (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 ,
r n = y n − E(y n |Fn−1 ),
and can more simply write the above as
z n = z̃ n + φn r n ,
Vn = Ṽn − φn ((Jn2 )T + Cn Ṽn ).
We can now write the complete form of the filter,
z̃ n = αn z n−1 + g n , Ṽn = Hn2 + αn Vn−1 αnT
φn = (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 ,
r n = y n − y n−1 − Cn z̃ n , z n = z̃ n + φn r n , Vn = Ṽn − φn ((Jn2 )T + Cn Ṽn ).
Ṽ C T
Note: Usually Jn2 = 0: this gives φn = K 2 +Cn Ṽn 2 C T , Vn = (I − φn Cn )Ṽn , and we
n n n
n
then have the same form as the simple case.
Example: Moving Under Gravity of Unknown Force
Say we are measuring the position of a particle moving under gravity with observation
noise, where the particle begins at zero height, but the initial velocity W and the
acceleration g due to gravity are not known exactly [5, Section 7.2, with a different
solution method and noisy timesteps]. The current position, Xn , follows the recursion
1
1
Xn = W n − gn2 = Xn−1 + W + g( − n).
2
2
The signal is then the vector of the position, initial velocity and acceleration, and
follows the recursion

 

Xn
1 1 21 − n
0  xn−1 = αn xn−1 .
xn =  W  =  0 1
g
0 0
1
Our only observation is of the current position, so we have measurement recursion
yn − yn−1 = 1 0 0 xn + n = Cn xn + n ,
where n ∼ N (0, Kn2 ). Say for timestep n − 1 we have estimate



zn−1
Vn−1,1,1 Vn−1,1,2
z n−1 =  Wn−1  , Vn−1 =  Vn−1,2,1 Vn−1,2,2
gn−1
Vn−1,3,1 Vn−1,3,2
mean and variance

Vn−1,1,3
Vn−1,2,3  ;
Vn−1,3,3
then we have gain factor
φn =
Kn2
1
1
Ṽn CnT =
Ṽn CnT ,
T
2
+ Cn Ṽn Cn
Kn + Ṽn,1,1
and innovation
rn = yn − yn−1 − z̃n .
41
The estimate mean then has form
z n = z̃ n + φn rn = z̃ n +
Looking at zn = Cn z n , Wn =
we can see that
zn =
0
1
Ṽn CnT (yn − yn−1 − z̃n ).
+ Ṽn,1,1
0 z n , and gn = 0 0 1 z n separately,
Kn2
1
Kn2 z̃n + Ṽn,1,1 (yn − yn−1 )
,
Kn2 + Ṽn,1,1
Wn = Wn−1 +
gn = gn−1 +
Ṽn,3,1 (yn − yn−1 − z̃n )
,
Kn2 + Ṽn,1,1
Ṽn,3,1 (yn − yn−1 − z̃n )
.
Kn2 + Ṽn,1,1
Additionally the estimate variance has form
Vn = (I − φn Cn )Ṽn =

Ṽn,1,1
1

I−
Ṽn,2,1
Kn2 + Ṽn,1,1
Ṽn,3,1
0
0
0
!
0
0  Ṽn .
0
From this we can see that Vn,1,2 , Vn,1,3 can only be non-zero if they are in the prior
Ṽn,2,2 Ṽn,2,3
will stay the same if Ṽ0,1,2 =
estimate Ṽn , and that the sub-matrix
Ṽn,3,2 Ṽn,3,3
Ṽ0,1,3 = 0. However, in Ṽn = αn Vn−1 αnT we have Ṽn,1,2 = Vn−1,2,2 + ( 12 − n)Vn−1,3,2 ,
and similarly for Ṽn,1,3 , so unless we definitely know W or g already we will use all
parts of the covariance matrix.
4.3.6
Comments
1. The dynamics for the progression of the signal, or the observation, can be nonlinear. In this case, the estimates are calculated by taking partial derivatives of
the recursion functions, as in a Taylor series: see [7] for more information. An
example would be a digital meter, with measurement noise dependent on the
size of X or Y as the meter switches between scales of measurement.
2. If either noise has a non-zero mean, we can simply adjust the noise by including
its mean in gn or E(Yn |Fn−1 ).
3. In the case Cn = 0, the observations have no dependence on Xn , so the signal is
not observable, and Z̃n , Ṽn . In the matrix form, this is equivalent to Cn being
singular, and so at least one member of r n is not observable: see [1, 6] for more
on observability.
4. The signal at time n, Xn , can also be approximated given Fm for m > n. This
is referred to as smoothing: for more information see [1, Chapter 9].
42
Appendix A
Appendices
Theories about probability, expectations, and so on are derived from the more abstract
field of measure theory. We therefore give a brisk summary of the important results
and definitions from the first few chapters of [8].
A.1
Measures
Definition A.1.1 For a set S and a collection Σ of subsets of S, we say Σ is a
σ-algebra on S if
S ⊂ Σ,
F ∈ Σ ⇒ F c = S \ F ∈ Σ,
and, for a sequence of disjoint subsets F =(Fn )n∈N Fn ∈ Σ,
[
Fn ∈ Σ.
n
We say F ∈ Ω is an Ω-measurable subset of S.
Definition A.1.2 We say σ(Σ) is the σ-algebra generated by Σ, where σ(Σ) is
the intersection of all σ-algebras with Σ as a subset.
Example A.1.3 (Borel σ-algebras) The Borel σ-algebra B(S) on the space S is
generated by the family of open subsets of S. B(R), often written as just B, thus
contains all subsets of R. Similarly, B((0, 1]) is often written as B(0, 1].
Definition A.1.4 We say a function µ : Σ 7→ [0, ∞) on σ-algebra Σ of set S is a
measure if
µ(φ) = 0
where φ is the empty set, and, for a sequence of distinct subsets (Fn ∈ Σ)n∈N ,
[
X
µ(
Fn ) =
µ(Fn ).
n∈N
n∈N
Furthermore, it is a probability measure if µ(F ) ≤ 1 ∀ F ∈ Σ.
Example A.1.5 The lebesgue measure on B(0, 1], λ : B(0, 1] 7→ (0, 1], has the form
λ(
[
λ(a, b) = b − a,
X
(an , bn ) =
(bn − an )
n∈N
n∈N
for disjoint subsets (an , bn ). The Lebesgue measure is thus a general measure of length.
43
Definition A.1.6 We say I is a π-system on S if it’s a family of subsets of S that
is stable under finite intersections.
Theorem A.1.7 (Carathéodory’s Extension Theorem) Let Σ be the σ-algebra
generated by Σ0 . Then, for measure µ0 : Σ0 7→ [0, ∞], ∃ measure µ : Σ 7→ [0, ∞] such
that µ = µ0 on Σ0 . So, we can extend a measure to a larger model.
Example A.1.8 The Lebesgue measure on B(0, 1] can be extended to B[0, 1] by saying
that λ{0} = 0. We can also extend to B, so we can measure length on the whole of R.
A.2
Events
Definition A.2.1 We say a probability triple is a measure space (Ω, F, P), where
Ω is the sample space, ω ∈ Ω is a sample point, the σ-algebra F is the family
of events – so that an event is a F-measurable subset of Ω – and P is a probability
measure on (Ω, F).
Definition A.2.2 For sequence of events (En )n∈N = {ω | ω ∈ E},
(En , i.o.) = (En infinitely often)
\ [
= lim sup En =
En
m n≥m
= {ω | ∀m ∃n(ω) s.t. ω ∈ En(ω) }
= {ω | ω ∈ En for infinitely many n}.
(En , ev) = (En eventually)
[ \
= lim inf En =
En
m n≥m
= {ω | ∃m(ω) s.t. ∀n ≥ m(ω) ω ∈ En }
= {ω | ω ∈ En ∀ large n}.
Theorem A.2.3 (Fatou’s Lemma) P(lim inf En ) ≤ lim inf P(En ).
Theorem A.2.4 (Fatou Reverse Lemma) For finite measure P,
P(lim sup En ) ≥ lim sup P(En ).
Theorem A.2.5 (First Borel-Cantelli Theorem) For events (En )n∈N ,
X
P(En ) < ∞ ⇒ P(lim sup En ) = P(En , i.o) = 0.
n
A.3
Random Variables
Definition A.3.1 A function f : S 7→ R is Σ-measurable if f −1 : B →
7 Σ. mΣ is
the set of all Σ-measurable functions on S, and mΣ+ is the set of all non-negative
elements of mΣ.
Definition A.3.2 For sample space Ω and σ-algebra F, a random variable X is
a F-measurable function X : Ω 7→ R, X −1 : B 7→ F, where B is as defined in Example A.1.3.
Definition A.3.3 For a random variable X, the law LX of X is LX = P ◦ X −1 ,
LX : B 7→ [0, 1]. Then LX is a probability measure on (R, B).
44
Definition A.3.4 For a random variable X, the distribution function of X is
the function FX : R 7→ [0, 1], where
FX (c) = LX (−∞, c] = P(x ≤ c) = P{ω | X(ω) ≤ c}.
Theorem A.3.5 (Monotone-Class Theorem) Let H be a class of bounded functions S 7→ R with the following conditions:
i) H is a vector space over R,
ii) Constant function 1 ∈ H,
iii) For non-negative functions (fn ∈ H)n∈N with fn ↑ (f ∈ Ω), f ∈ H.
Then, if H contains the identity functions of every set in π-system I, then it contains
every bounded σ(I)-measurable function on S.
A.4
Independence
Definition A.4.1 Sub-σ-algebras An of F are independent if ∀ ai ∈ Ai (i ∈ N)
and distinct ij for j=1 to n
P(ai1 ∩ ai2 . . . ∩ ain ) =
n
Y
P(aik ).
k=1
Definition A.4.2 Random variables X1 , X2 , . . . are independent if σ-algebras
σ(X1 ), σ(X2 ), . . .
are independent.
Definition A.4.3 Events E1 , E2 , . . . are independent if the σ-algebras E1 , E2 , . . . are
independent, where En is the σ-algebra {∅, En , Ω \ En , Ω}.
Theorem A.4.4 (Second Borel-Cantelli Lemma) For the series of independent
events (En )n∈N ,
X
P(En ) = ∞ ⇒ P(En , i.o.) = P(lim sup En ) = 1.
Proof:
We have
(lim sup EN )c = lim inf Enc =
[ \
Enc .
m n≥m
We then have
!
P
\
n≥m
A.5
Enc
=
Y
(1 − P(En )). n≥m
Integration
Definition A.5.1 We say that measure µ is the lebesgue integral µ(f ) =
and that
Z
µ(f, A) =
f (s) µ(ds) = µ(f IA ) s ∈ S, A ⊂ Σ.
R
f dµ,
A
Such a measure is linear, so the integral is also linear.
+
Definition A.5.2 f ∈ mΣP
is simple if it can be written as a weighted sum of
m
indicator functions, i.e. f = k=1 ak IAk for some ak ≥ 0 and some Ak ∈ Σ. We then
write f ∈ SF.
45
We can assume that Ak in the above definition are disjoint, since a1 IA1 + a2 IA2 =
a1 IA1 ∩/A2 + (a1 + a2 )IA1 ∩A2 + a2 I/A1 ∩A2 .
Definition A.5.3 For subset A ∈ Σ, we define µ0 (IA ) = µ(A) ≤ ∞, where µ0 is
+
an
Pmnaive integral defined for simple functions. For f ∈ SF we define µ0 (f ) =
a
µ(A
)
≤
∞.
k
k=1 k
Definition A.5.4 For f ∈ mΣ+ we define µ(f ) = sup{µ0 (b) | b ∈ SF + , h ≤ f } ≤
∞. So we can take the integral of a non-negative function as the upper limit of a
sequence of integrals of simple functions.
Theorem A.5.5 (Monotone-Convergence Theorem) For the sequence of functions (fn ∈ mΣ+ )n∈N ,
fn ↑ f ⇒ µ(fn ) ↑ µ(f ) ≤ ∞.
So the integral of fn tends to the integral of f .
Definition A.5.6 The rth staircase function a(r) : [0, ∞] 7→ [0, ∞] is defined as


x = 0,
0
(r)
−r
a (x) = (i − 1)2
(i − 1)2−r < x ≤ i2−r ≤ r i ∈ N,


r
x > r.
The functions f (r) = a(r) ◦ f are simple functions, with f (r) ↑ f . By the MonotoneConvergence Theorem (Theorem A.5.5), we now have
µ(f ) =↑ lim µ(f (r) ) =↑ lim µ0 (f (r) ).
Since a(r) are left-continuous, we also have fn ↑ f ⇒ a(r) (fn ) ↑ a(r) (f ).
Theorem A.5.7 (Fatou’s Lemma) For (fn ∈ mΣ+ )n∈N ,
µ(lim inf fn ) ≤ lim sup µ(fn ). So,
Z
Z
lim inf fn ≤ lim inf fn .
Theorem A.5.8 (Reverse Fatou’s Lemma) For (fn ∈ mΣ+ )n∈N with finite and
bounded fn ,
µ(lim sup fn ) ≥ lim sup µ(fn ).
Definition A.5.9 L1 (S, Σ, µ) is the set of µ-integrable functions, f ∈ mΣ such
that µ(f ) < ∞.
Notation A.5.10 We let f + (s) = max(f (s), 0), f − (s) = max(−f (s), 0). Then we
have
f = f + − f − , |f | = f + + f − .
Note:
This means that
Z
Z
f dµ = µ(f ) = µ(f + ) − µ(f − ),
|f |dµ = µ(|f |) = µ(f + ) + µ(f − ).
So we immediately get µ(f ) ≤ µ(|f |), with equality if and only if f is non-negative.
Note: Since f + , f − ∈ mΣ+ , and the integral is linear, we can extend the definition
of integral µ0 to the set of measurable functions mΣ.
Theorem A.5.11 (Dominated-Convergence Theorem) Take functions fn ,
f ∈ L1 (S, Σ, µ)+ , with
|fn (s)| ≤ g(s)
for some g ∈ L1 (S, Σ, µ)+ with µ(g) < ∞. Then
fn → f
f ∈ L1 (S, Σ, µ).
So µ(|fn − f |) → 0, so µ(fn ) → µ(f ).
46
Theorem A.5.12 (Scheffé’s Lemma) Take fn , f ∈ L1 (S, Σ, µ), with fn → f almost everywhere. Then
µ(|fn − f |) → 0 iff µ(|fn |) → µ(|f |).
Method A.5.13 (Standard Machine) A method for proving a linear result is true
for all functions in a space.
i) Show the result is true for indicator functions.
ii) By linearity, show the result is true for functions in SF + .
iii) Use the Monotone-Convergence Theorem to show the result is true for functions
in mΣ+ .
iv) Write h = h+ − h− and use linearity to show the result is true for measurable
functions.
Definition A.5.14 For an F-measurable function f , f µ(A) = µ(f, A).
R
Definition A.5.15 A measure µ(A) = IA f dν, also written µ = f ◦ ν, has density
f relative to µ. We then write
dµ
= f.
dν
Lemma A.5.16 If
measurable, and
dµ
dν
= f, and g is an F-measurable function, then f g is also FZ
Z
g dµ = gf dν.
Proof:
By the standard machine [Own Proof ], Method A.5.13.
R
R
i) For g as an indicator function, IA dµ = µ(1, A) = f ν(1, A), and IA f dν =
ν(f, A) = f ν(1, A).
R
RP
Pm
ii) P
For gR as a simple Rfunction, Rg P
=
gk IAk dµ =
so R g dµ =
k=1 gk IAk ,P
gk IAk dµ, and gf dν =
gk IAk f dν =
gk IAk f dν, which are equal
by part i).
iii) For g as a non-negative F-measurable function, we can define g as the limit
of a sequence (gn )n∈N of simple functions by using the staircase function from
Definition A.5.6. By part ii) we have µ(gn , A) = f ν(gn , A), so by the Monotone
Convergence
Theorem (A.5.5), µ(g, A) = f ν(g, A), which is equivalent to
R
R
gIA dµ = gIA f dν.
iv) For g as an F-measurable function, we show part iii) for g + and −g − . We thus
have the result from linearity, g = g + − g − ⇒ µ(g, A) = µ(g + , A) − µ(g − , A) =
f ν(g + , A) − f ν(g − , A) = f ν(g, A). R
Corollary A.5.17 If f dν = 1, then µ is a probability measure.
47
Bibliography
[1] Donald E. Catlin. Estimation, Control, and the Discrete Kalman Filter. Springer,
1989.
[2] Roger Mansuy. Histoire de martingales. Mathématiques & Sciences Humaines,
(169):105–113, 2005.
[3] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized
Algorithms and Probabilistic Analysis. Cambridge University Press, 2005.
[4] Bernt Øksendal. Stochastic Differential Equations. Springer, 2000.
[5] Albert Tarantola. Inverse Problem Theory. Society for Industrial and Applied
Mathematics, 2005.
[6] Richard Weber. Optimization and control. Lecture notes for the Optimization and
Control course at Cambridge, 2010.
[7] Greg Welch and Gary Bishop. An introduction to the Kalman filter. Technical
Report TR 95-041, University of North Carolina, Department of Computer Science,
July 2006. Introductory article that also discusses the case of nonlinear systems.
[8] David Williams. Probability with Martingales. Cambridge University Press, 1991.
48