Download Discrete-Time Martingales and the Kalman Filter

Discrete-Time Martingales and the Kalman Filter Mark Webster Student Number: 200313814 Module Code: MATH5003M Supervisor: Dr. J. Voss May 5, 2011 Abstract This report covers the material required to talk about martingales, an important type of random process, and the Kalman Filter, a method for determining the value of a process with noisy observations. We begin with a definition of Lebesgue integration, and build up the probability theory necessary to talk about expectation and conditional expectation. This is followed by more in-depth descriptions of martingales, and some simulations in R. The final section covers the basic theory of the Kalman filter, a discrete-time method for determining the state of a process given noisy measurements and a dynamical model of said state. The appendices contain a brisk coverage of the material required from measure theory. Contents Notation Table 3 1 Introduction 4 2 Expectation and Conditional Expectation 2.1 Expectation . . . . . . . . . . . . . . . . . . 2.1.1 Definition and Basic Theorems . . . 2.1.2 Variance and Inner Products . . . . 2.1.3 Elementary Formula and Probability 2.2 Conditional Expectation . . . . . . . . . . . 2.2.1 The Fundamental Theorem . . . . . 2.2.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . 5 . 5 . 6 . 7 . 9 . 9 . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 15 17 18 20 21 21 4 Filtering 4.1 Bayes’ Formula For Bivariate Normal Distributions . . . . . . . . 4.1.1 Recursive Property in Probability Distribution Functions 4.1.2 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Single Random Variable . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 System Model and Filter . . . . . . . . . . . . . . . . . . 4.2.2 R source code: Filtering a Single Value . . . . . . . . . . 4.3 Series of Variables; Kalman Filter . . . . . . . . . . . . . . . . . . 4.3.1 System Model and Filter . . . . . . . . . . . . . . . . . . 4.3.2 R source code: Kalman Filter . . . . . . . . . . . . . . . . 4.3.3 Example: Moving on a Line . . . . . . . . . . . . . . . . . 4.3.4 R Source Code: Movement Problem . . . . . . . . . . . . 4.3.5 Extension for Multiple Processes and Observations . . . . 4.3.6 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 22 23 23 23 24 27 27 29 31 34 39 42 A Appendices A.1 Measures . . . . . A.2 Events . . . . . . . A.3 Random Variables A.4 Independence . . . A.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . 43 43 44 44 45 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Martingales 3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Examples: Martingales and Markov Chains 3.1.2 Stopping Times . . . . . . . . . . . . . . . . 3.1.3 R source code: Random Walk . . . . . . . . 3.2 The Convergence Theorem . . . . . . . . . . . . . 3.3 Further Results . . . . . . . . . . . . . . . . . . . . 3.3.1 Orthogonality of Increments . . . . . . . . . 3.3.2 Lévy’s Upward Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2 Notation Table Symbol Ω ω X, Y, Z Xn , Yn , Zn Cn C •X Ln log s.t. ∃ ∀ |X| |X|p hXi ΛX µ, ν λ F, G IA T ∧n N Z+ R+ n gn Hn2 , Kn2 Zn Vn mΣ SF P(A) N (µ, σ 2 ) φn rn , r n Meaning Set of possible events (A.2.1) An event in Ω (A.2.1) Random variable (A.3.2) Random process Pre-visible process (3.1.15) Martingale transform (3.1.16) Lebesgue space (2.1.12) Exponential logarithm “such that” “there exists” “for all” Lp -norm (2.1.12) Inner or scalar product (2.1.16) Law of X (A.3.3) Measure (A.1.4), or mean Lebesgue measure (A.1.5, A.1.8) σ-algebras (A.1.1) Indicator function Small value min(T, n) {1, 2, 3, . . .} {0, 1, 2, . . .} [0, ∞) Timestep of process Shift value in Kalman filter (4.3.1) Variance of noise in filter (4.2.1, 4.3.1) Mean of estimate (4.2.1, 4.3.1) Variance of estimate (4.2.1, 4.3.1) Set of Σ-measurable functions (A.3.1) Set of simple functions (A.5.2) Power set of A (2.1.2) Normal distribution (2.1.23) Kalman gain factor (4.3.5) Residual, or innovation (4.3.5) 3 Equivalent in sources Ω ω X, Y, Z Mn for martingales Cn C •X Ln log, ln “such that” N/A ∀ |X| kXkp hXi ΛX µ Leb F, G IA T ∧n N Z+ R+ n, k gn , u(k) Hn2 , Kn2 or Q(k), R(k) Zn , x̂(k|k) Vn , P (k|k) mΣ SF P(A) N (µ, σ 2 ) K(k) z̃ − n Chapter 1 Introduction We look at two interesting and important ideas. Martingales are random processes that, on average, stay the same. From this simple definition we get a lot of results that allow us to give useful results, particularly in the case when the martingale stops: in this case we get useful results about the value the martingale ends at. This is very useful, since a lot of random processes are martingales, and we can also use it to easily give results about other processes; see Example 3.1.27 for an example. Say we have a process which we wish to observe the value of: we know the dynamics the process follows, but our measurements of the current value are noisy, so blindly reading the measurements could result in large errors. The second idea we look at, the Kalman filter, is an important algorithm by which we can estimate what the value actually is, and thus it has many applications, since we almost always have to measure something without perfect instruments to do so. Chapter 2 covers the theory of expectation, an essential part of anything where we wish to know the expected, or average, value of a random variable, either in general (expectation) or given certain information (conditional expectation). Some of the beginning material here comes from results in measure theory, in particular the idea of using Lebesgue integration instead of Riemann integration: additional information is available in the Appendix if desired. Chapter 3 defines martingales, and gives an outline of the theorems the definition now allows us to use. There is also a comparison with Markov chains, another common type of random process. There are several examples given in this chapter, mostly based on scenarios in gambling. In the next chapter they are used to show that the filters eventually have a fixed variance in their estimates. Martingales have a lot of uses, and only a few are presented here.The examples and R source code are the author’s own unless otherwise stated; other material is from [8] unless otherwise stated. Chapter 4 deals with the subject of filtering, where we wish to estimate the value of a process given only noisy observations and knowledge of the dynamics of the process and the observations. The notation and most of the basic theory are from [8], with additional material from [1, 6, 7]; examples, and material in Section 4.3.5 are author’s own unless otherwise stated. The Appendix contains the basic material from measure theory that is needed to talk about probability and expectation: it is a brisk summary of the first few chapters of [8]. All R programs and this report can be found at http://www.maths.leeds.ac.uk/ ~voss/projects/2010-martingales/. 4 Chapter 2 Expectation and Conditional Expectation The theory of martingales depends heavily on the use of conditional expectations. We therefore first describe expectation and conditional expectation, and the associated theorems needed. Material in this section is from [8] unless otherwise marked. Not all of the material in this chapter is required to study martingales: the important parts will be listed at the beginning of each section. 2.1 Expectation Important parts in this section are Definitions 2.1.1, 2.1.8, 2.1.10, 2.1.12, 2.1.15, 2.1.16, 2.1.19, 2.1.21, Theorems 2.1.3, 2.1.7, 2.1.9, 2.1.11, 2.1.14, and Examples 2.1.2, 2.1.23. 2.1.1 Definition and Basic Theorems Definition 2.1.1 The expectation of X is Z E(X) = XdP ZΩ = X(ω)P(dω) ω∈Ω = P(X), using the Lebesgue integral as defined in Definition A.5.1, where P is a probability measure as described in Definition A.2.1, and where Ω is a set of events as defined in Definition A.2.1. Example 2.1.2 Take [Own Example] Ω = {x1 , x2 , . . . , xn } ⊆ R with xi 6= xj ∀i 6= j, X(ω) = ω, FP= P(Ω), where P(Ω) is the power set of Ω, the set of all subsets of Ω. n Let P(A) = i=1 Ixi ∈A pi , so pi = P(X = xi ). We now have Z E(X) = XdP Ω X = xi P(X = xi ) i = X xi P({xi }) i = X xi pi . i 5 Taking P(Xn → X) = 1 for sequence (Xn )n∈N and some variable X, we have the following basic theorems [8, Section 6.2], derived from their equivalents in Appendix A.5: Theorem 2.1.3 (Monotone-Convergence Theorem) For 0 ≤ Xn ↑ X, E(Xn ) ↑ E(X) ≤ ∞. Theorem 2.1.4 (Fatou’s Lemma) Xn ≥ 0 ⇒ E(X) ≤ lim inf E(Xn ). Theorem 2.1.5 (Dominated-Convergence Theorem) If ∃Y such that |Xn (ω)| ≤ Y (ω) ∀n, w and E(Y ) < ∞, then E|Xn − X| → 0 s.t. E(Xn ) → E(X). Theorem 2.1.6 (Scheffé’s Lemma) E|Xn | → E|X| ⇐⇒ E|Xn − X| → 0. We also have the following new theorems: Theorem 2.1.7 (Bounded Convergence Theorem) If ∃ finite k s.t. |Xn (ω)| ≤ k ∀n, ω, then E(Xn − X) → 0. We will need this theorem later to prove one of the main theorems in Chapter 3, Doob’s Optional Stopping Theorem (Theorem 3.1.21). R Definition 2.1.8 E(X, F ) = F X(ω)P(dω) = E(XIF ), where F ∈ F, where F is a filtration as described in Definition A.2.1. Theorem 2.1.9 (Markov’s Inequality) Let X ∈ mΣ, i.e. be a Σ-measurable function as defined in Definition A.3.1, and g be a non-decreasing, Borel-measurable function g : R 7→ [0, ∞], with the Borel sigma-algebra on R as defined in Example A.1.3. Then E(X) ≥ E(g(X), X ≥ c) ≥ g(c)P(X ≥ c). This theorem will be useful in the next section, where we want to show the existence of a conditional expectation. 2.1.2 Variance and Inner Products Definition 2.1.10 A function c : G 7→ R for G ⊂ R is convex on G if its graph lies below any of its chords, i.e. if for some x, y ∈ G c(λx + (1 − λ)y) < λc(x) + (1 − λ)c(y) ∀0 < λ < 1. For example, the function c(x) = x2 on R has (λx + (1 − λ)y)2 − (λx2 + (1 − λ)y 2 ) = λ2 x2 + 2λ(1 − λ)xy + (1 − λ)2 y 2 −λx2 − (1 − λ)y 2 = −λ(1 − λ)(y − x)2 > 0. So x2 is convex on R. Theorem 2.1.11 (Jenson’s Inequality) For convex c, E(X) < ∞, P(X ∈ G) = 1 for an open subset G ⊂ R, and E|c(X)| < ∞, we have E(c(X)) ≥ c(E(X)). This comes in useful in Chapter 4, when we want to show that the estimate of a single value is a martingale. Definition 2.1.12 For 1 ≤ p < ∞, we say X ∈ Lp if E|X|p < ∞. Then |X|p = 1 (E|X|p ) p , where | · |p is the Lp -norm. Lp is also a vector space. 6 Theorem 2.1.13 (Monoticity of Lp -norms) Take p ≤ r, Y ∈ Lr . Then Y ∈ Lp , and |Y |p ≤ |Y |r . Theorem 2.1.14 (Schwartz Inequality) For X, Y ∈ L2 , i) XY ∈ L1 and E(XY ) ≤ E|XY | ≤ |X|2 |Y |2 . ii) X + Y ∈ L2 , |X + Y |2 ≤ |X|2 + |Y |2 . This theorem, like Theorem 2.1.9, will be used in the section on conditional expectation. Definition 2.1.15 Take X, Y ∈ L2 , µX = E(X), µY = E(Y ). Then X̃ = X −µX , Ỹ = Y − µY are in L2 . By the Schwartz Inequality, X̃ Ỹ ∈ L1 . We then define the covariance Cov(X, Y ) = E(X̃ Ỹ ) = E(X − µX )(Y − µY ) = E(XY ) − µX µY , and the variance Var(X) = Cov(X, X) = E(X 2 ) − µ2X . This gives the usual form of variance and covariance in probability and statistics; it comes particularly useful in Chapter 4. Definition 2.1.16 The inner, or scalar product hX, Y i = E(XY ). Then, the hX,Y i . The correlation of X angle between X and Y obeys equation cos θ = |X| 2 |Y |2 hX̃,Ỹ i and Y , corr(X, Y ) = cos θ = |X̃| . We say that X, Y are orthogonal, or 2 |Ỹ |2 perpendicular, when hX, Y i = 0; we also write this as X⊥Y . In this case, on 2 2 2 L2 we have |X + Y |2 = |X|2 + |Y |2 . Taking X̃, Ỹ , in probabilistic language this becomes Var(X + Y ) = Var(X) + Var(Y ) when Cov(X, Y ) = 0. This is known as pythagoras’ theorem. Using inner products in L2 will be shown later to be an easy – and usually possible – way to show the existence of a conditional expectation. Theorem 2.1.17 (Parallelogram Law) By the bilinearity of h·, ·i, 2 2 |X + Y |2 + |X − Y |2 = hX + Y, X + Y i + hX − Y, X − Y i 2 2 = 2|X|2 + 2|Y |2 . 2.1.3 Elementary Formula and Probability Density Functions In this section we introduce the idea of a probability density function. This allows us to integrate over the real numbers to find the expectation, variance, or other functions of a random variable, by giving us a function that describes the probability of the variable being in any interval. Theorem 2.1.18 (Completeness of Lp ) Let p ∈ [1, ∞). Then if (Xn ) is a Cauchy sequence in Lp , i.e. lim sup |Xr − Xs |p = 0, k→∞ r,s≥k then ∃X ∈ Lp such that Xr → X ∈ Lp , i.e. lim |Xr − X|p = 0. r→∞ Definition 2.1.19 Take subspace K ⊂ L2 and a Cauchy sequence (Vn ∈ K), Vn → V ∈ K. Then ∀X ∈ L2 ∃Y ∈ K s.t. |X − Y |2 = inf{|X − W |2 | W ∈ K}, (X − Y )⊥Z ∀Z ∈ K. Then we say Y is the orthogonal projection of X onto K, and is almost surely unique. 7 Theorem 2.1.20 (Elementary Formula for Expectation) Let h be a Borel-measurable function, and ΛX ∈ (R, B) be the law of X, i.e. ΛX (B) = P(X ∈ B) as defined in Definition A.3.3. Then h(X) ∈ L1 ⇐⇒ h ∈ L1 (R, B, ΛX ). R In this case, E(h(X)) = ΛX (h) = P h(X)ΛX (dx). Definition 2.1.21 The probability density function fX forR X, if it exists, is a Borel-measurable function fX : R 7→ [0, ∞] such that P(X ∈ B) = B fX (x)dx for B ∈ X B. We also write this as fX = dΛ dλ , where λ is the Lebesgue measure on B, described in Examples A.1.5 and A.1.8, and using the notation for density from Definition A.5.15. Note: This means that, if the R we have a probability density function for X, we can now write the expectation as XfX dλ, by Lemma A.5.16. In particular, if xfX (x) is R also riemann-integrable and X(ω) = ω, we can also write the expectation as xf (x)dx, and the two integrals are equal. Example 2.1.22 The uniform distribution U [a, b] on interval [a, b] has constant R b dx 1 . Then the mean of X, E(X) = a xb−a = b+a probability density function fX (x) = b−a 2 , i.e. the midpoint of the interval, and the variance, E(X 2 ) − E(X)2 = = = = = b x2 dx (b + a)2 − 4 a b−a 3 3 b −a (b + a)2 − 3(b − a) 4 2 2 b2 + 2ab + a2 b − ab + a − 3 4 b2 + 2ab + a2 12 (b + a)2 . 12 Z Example 2.1.23 The normal distribution N (µ, σ 2 ) has probability density function fX (x) = √ 1 e− 2πσ 2 (x−µ)2 2σ 2 Z . Then the mean of X, ∞ (x−µ)2 x √ e− 2σ2 dx 2 2πσ −∞ Z ∞ 2 (x−µ)2 x − µ − (x−µ) µ √ = e 2σ2 + √ e− 2σ2 dx 2πσ 2 2πσ 2 −∞ Z ∞ (x−µ)2 1 √ =µ e− 2σ2 dx 2πσ 2 −∞ = µ, E(X) = and the variance, 2 2 Z ∞ x2 e− (x−µ)2 2σ 2 dx − µ2 −∞ Z ∞ Z ∞ 2 2 x − µ − (x−µ) xµ − (x−µ) 2 2σ √ √ = x e dx + e 2σ2 dx − µ2 2πσ 2 2πσ 2 −∞ −∞ ∞ Z ∞ 2 (x−µ) (x−µ)2 σ σ √ e− 2σ2 dx + µ2 − µ2 = − x √ e− 2σ2 + 2π 2π −∞ x=−∞ E(X ) − E(X) = √ 2πσ 2 = σ2 . 8 This is a rather important example: a lot of random process are normal, or can be approximated as being normal over a long period of time. In Chapter 4 one of the underlying assumptions is that the noise is normally distributed, something known as “white noise”. Theorem 2.1.24 (Hölder’s Inequality) Take f ∈ Lp (S, Σ, µ) and h ∈ Lq (S, Σ, µ). Then f h ∈ L1 (S, Σ, µ) and |µ(f h)| ≤ µ(|f h|) ≤ |f |p |h|q . Theorem 2.1.25 (Minkowski’s Inequality) Take f, g ∈ Lp (S, Σ, µ). Then |f + g|p ≤ |f |p + |g|p . 2.2 Conditional Expectation The important parts of this section are Definition 2.2.2, Notation 2.2.3, Theorem 2.2.1, and the “Existence in L2 ” section of the proof of said theorem. 2.2.1 The Fundamental Theorem Now we have the concept of expectation, we will give a formal definition of a conditional expectation. The usual definition of conditional expectation Y = E(X|Z) has a random variable Y whose value depends on the value of random variable Z. Using our definitions of events and random variables, we can say that Z is a real-valued function Z(ω ∈ Ω), taking a certain point in the sample space. If Z has a given value, this limits the possible values of ω; in other words, out of the possible events, the actual outcome has been limited to a subset of the family of events F, say G. The conditional expectation is then the expected value of X(ω) given that ω ∈ G. Theorem 2.2.1 (Fundamental Theorem) Take a random variable X with E|X| < ∞, and G a sub-σ-algebra of F in the usual probability triple (Ω, F, P). Then there is a random variable Y with Y ∈ mG, E|Y | < ∞ such that ∀G ∈ G E(Y, G) = E(X, G). This Y is almost surely unique, i.e. if a random variable Ỹ has the same properties then P(Y = Ỹ ) = 1. Definition 2.2.2 A random variable Y with the properties described above is called a version of conditional expectation E(X|G). We say that Y = E(X|G) almost surely. Notation 2.2.3 We write E(X|Z) as shorthand for E(X|σ(Z)), where σ(Z) is the σ-algebra generated by Z, as defined in Definition A.1.2. Proof of Almost Sure Uniqueness Proof by contradiction [8, Section 9.5]. Assuming the conditional expectation exists, we have X ∈ L1 , and two versions Y, Ỹ of E(X|G). Then, by the definition, we have Y, Ỹ ∈ L1 (Ω, G, P), and that E(Y, G) = E(Ỹ , G) ⇒ E(Y − Ỹ , G) = 0 ∀G ∈ G. We now suppose Y, Ỹ are not almost surely equal. Without loss of generality, we take Y > Ỹ , so P(Y > Ỹ ) > 0. We can introduce an error term and construct a sequence of events Xn = {Y > Ỹ + n } with Xn ↑ {Y > Ỹ } as n ↓ 0. Then ∃n such that P(Y > Ỹ + n ) = P(Y − Ỹ > n ) > 0. Since Y, Ỹ are G-measurable, we can use the Markov inequality (Theorem 2.1.9) with g(c) = c, c = n to get E(Y − Ỹ , Y − Ỹ > n ) ≥ n P(Y − Ỹ > n ) > 0. But E(Y − Ỹ , Y − Ỹ > n ) = 0. Contradiction, therefore Y is almost surely unique. 9 Proof of Existence for X ∈ L2 We give proof of existence for the case of E|X|2 < ∞, since it is a commonly used L-norm, and it’s a more simple case since we can use the idea of orthogonal projections (Definition 2.1.19). Take Y as the orthogonal projection of X onto L2 (Ω, G, P), which always exists. Then we have hX − Y, Zi = E((X − Y )Z) = 0 ∀Z ∈ L2 (Ω, G, P). By linearity, we therefore arrive at E(XZ) = E(XY ). We can now set Z = IG for G ∈ G to obtain the result. Proof of Existence for X ∈ L1 In the standard machine (Method A.5.13), we defined h = h+ − h− , with h+ , h− as described in Notation A.5.10, to prove something for all measurable functions if it was true for all positive measurable functions. Similarly, we define X = X + − X − and limit ourselves to the case X ∈ (L1 )+ . We now choose [8, Section 9.5] a sequence 0 ≤ Xn ↑ X. Now Xn ∈ L2 , so by the previous section we can form sequence (Yn = E(Xn |G))n∈N . Lemma 2.2.4 For a non-negative bound random variable X, E(X|G) ≥ 0 almost surely. Proof: By contradiction. Assume version Y of the expectation has P(Y < 0) > 0: then ∃ > 0 such that we can set G = {Y < −} with P(G) > 0, and take the Markov Inequality (Theorem 2.1.9) with g(x) = x, X = −Y , c = to obtain E(−Y, −Y > ) > P(−Y > ). We thus have 0 ≤ E(X, G) = E(Y, G) < −P(G) < 0. Contradiction, so E(X|G) ≥ 0 almost surely. From this lemma we have 0 ≤ Yn almost surely ∀n, so we define Y (ω) = lim sup Yn (ω). Then Y ∈ mG, and Yn ↑ Y almost surely. By the Monotone-Convergence Theorem (Theorem 2.1.3), we then take the limit of E(Yn , G) = E(Xn , G) to get E(Y, G) = E(X, G) ∀G ∈ G. 2.2.2 Example Take [Own Example] Ω = {1, 2, 3, 4, 5, 6}, F = P(Ω), G = {∅, {1, 2}, {3, 4, 5, 6}, Ω}, P(A ∈ F) = ]A 6 , where P(Ω) is the power set of Ω, and ]A is the number of elements of A. We then have the probability triple associated with rolling a fair six-sided die. As we expect, the probability measure gives E(X) = 27 for X(ω) = ω. Since Y must be G-measurable, we can say that ( y1 ω ∈ {1, 2}, Y (ω) = y2 ω ∈ {3, 4, 5, 6}. Since E(Y IG ) = E(XIG ) ∀G ∈ G, Y is determined by the simultaneous equations E(Y I{1,2} ) E(Y I{3,4,5,6} ) = E(XI{1,2} ), = E(XI{3,4,5,6} ). 10 From the probability measure, we then have 1 y1 + 6 1 1 1 y2 + y2 + y2 + 6 6 6 1 y1 6 1 y2 6 = = 1 (1 + 2), 6 1 (3 + 4 + 5 + 6), 6 or 13 y1 = 12 , 23 y2 = 3. We therefore obtain result E(X|G) = 3 9 I{1,2} + I{3,4,5,6} . 2 2 Note: Since E|X|2 < ∞ we can treat Y as the orthogonal projection – as from Definition 2.1.19 – of X onto L2 (Ω, G, P). Then for any Z ∈ L2 (Ω, G, P), we define Z = z1 I{1,2} + z2 I{3,4,5,6} , and obtain hX − Y, Zi = E((X − Y )Z) 1 = ((3 − 2y1 )z1 + (18 − 4y2 )z2 ) 6 = 0 ∀z1 , z2 . So we see the inner product behaves as expected. 11 Chapter 3 Martingales Martingales, and similarly supermartingales and submartingales, are an important type of random process: their expected value stays the same over time, so on average they stay at the same value. (In the case of supermartingales and submartingales, the expected value is monotonically decreasing or increasing respectively.) This allows us to derive other results for random processes, especially useful as these three types of processes are rather common: common examples of processes that can be modelled as a martingale would be the amount of money a gambler owns during several rounds of betting, or brownian motion. 3.1 Definition Important parts in this section are Definitions 3.1.1, 3.1.5, 3.1.7, 3.1.17, 3.1.18, 3.1.15, 3.1.16, Theorem 3.1.21 and Corollary 3.1.26. Definition 3.1.1 A filtration {Fn }n≥0 is an increasing family of sub-σ-algebras of S F, with F0 ⊆ F1 ⊆ . . . ⊆ F. We define F∞ = σ( n Fn ) ⊆ F. Example 3.1.2 For a common example from gambling, we consider the rolling of dice throws, each an independent and identically distributed random variable Xi for i ∈ Z+ . Then filtration Fn = σ(X0 , X1 , . . . , Xn ) is the σ-algebra generated by the set {X0 , . . . , Xn } as defined in Definition A.1.2. In other words, it is the set of all possible sets whose elements can be of the throws up to time n. Definition 3.1.3 A filtered space (Ω, F, {Fn }, P) is a probability triple (Ω, F, P) with an associated filtration {Fn }n≥0 . Example 3.1.4 {Fn } is usually taken as the natural filtration described in Example 3.1.2, Fn = σ(X0 , X1 , . . . , Xn ). For the gambling example, the filtered space is thus a description of the probabilistic model for dice-throwing, with a record of the dice throws up to time n. Definition 3.1.5 A process X = (Xn | n ≥ 0) is adapted to filtration {Fn } if Xn is Fn -measurable ∀n. Example 3.1.6 For the gambling example, a process Yn is adapted to Fn if it’s any function of {X0 , X1 , . . . , Xn }, so that P it’s determinable at time n; a common example n would be the sum of the throws, Yn = i=0 Xi . Definition 3.1.7 A process X is a martingale relative to ({Fn }, P) if i) X is adapted, ii) E|Xn | < ∞ ∀n, 12 iii) E(Xn |Fn−1 ) = Xn−1 almost surely ∀n ≥ 1. A supermartingale has the equality in iii) replaced by “≤”, and a submartingale has it replaced by “≥”; so, a supermartingale decreases on average, and a submartingale increases on average. 3.1.1 Examples: Martingales and Markov Chains Martingales and Markov chains both appear often in probability theory, so we look at the distinctions between the two. Definition 3.1.8 A stochastic process X is a collection of random variables (Xγ | γ ∈ C) parametrized by set C, where the variables are all on the same probability triple. Definition 3.1.9 A stochastic, or transition matrix P = (pij ) is a matrix such that X pij ≥ 0, pik = 1. k Definition 3.1.10 A time-homogeneous Markov Chain X = (Xn | n ∈ Z+ ) is a stochastic process parametrized by set Z+ with elements Xn ∈ E for some set E. If E is countable, the chain is then defined by a stochastic |E| × |E| matrix P and an initial distribution µ over E. Then P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = µi0 pi0 i1 pi1 i2 . . . pin−1 in . Corollary 3.1.11 A Markov Chain X is “memoryless”, i.e. P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) = P(Xn = in | Xn−1 = in−1 ). Proof: From the definition, we have [Own Proof ] P(X0 = i0 , . . . , Xn = in ) = P(X0 = i0 , . . . , Xn−1 = in−1 ) P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) = µi0 pi0 i1 . . . pin−2 in−1 P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ), ∴ pin−1 in = P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) = P(Xn = in | Xn−1 = in−1 ). So we have the other common definition of a Markov Chain, where the next value only depends on the current value. Example 3.1.12 A random walk Sn = X1 + X2 + . . . + Xn , where Xn are independent, identically distributed random variables, has conditional expectation E(Sn | Fn−1 ) = E(Sn−1 + Xn | Fn−1 ) = Sn−1 + E(Xn | Fn−1 ) = Sn−1 + E(Xn ). So S is only a martingale when E(X) = 0. Additionally, we can say P(S0 = i0 = 0, . . . , Sn = in ) = P(S0 = 0, . . . , Sn−1 = in−1 ) P(Sn = in | Sn−1 = in−1 ) = P(S1 = i1 )P(S2 = i2 | S1 = i1 ) . . . P(Sn = in | Sn−1 = in−1 ) = p0i1 pi1 i2 . . . pin−1 in . So S is always a Markov chain, with µ0 = 1. 13 Example 3.1.13 Instead of the usual Markov chain, we have every value after time n = 1 be determined by a new stochastic matrix Q = Q(X1 ) whose values depend on the value of X1 ; in other words, P(X0 = i0 , X1 = i1 ) = µi0 pi0 i1 , P(X0 = i0 , . . . , Xn = in ) = µi0 pi0 i1 qi1 i2 (i1 )qi2 i3 (i1 ) . . . qin−1 in (i1 ) ∀n > 1. Except in the trivial case, process X is no longer a Markov Chain, since the transition probabilities are no longer “memoryless”, but also depend on an older value. However, if P and Q(X1 ) are defined in such a way that X E(X1 | X0 = i0 ) = jpi0 j = i0 ∀i0 ∈ E, j∈E E(Xn | Fn−1 ) = X jqin−1 j (i1 ) = in−1 ∀n > 1, i1 , in−1 ∈ E, j∈E then we have a process that isn’t a Markov Chain, but is a martingale. Note that the above process can become a Markov Chain if we store the value of X1 ; in this case, the process becomes a Markov Chain over the bivariate state space (X1 , Xn ). Example 3.1.14 The “Martingale betting system” is a gambling strategy where the gambler repeatedly doubles the size of his bet each round, until he wins a round. The theory behind this is that, given an infinite amount of time and money, the gambler can keep raising his bet indefinitely until he wins a round, at which point he is up by the size of his initial bet. Mathematically, we write this as a sum of weighted random variables. Without loss of generality, we take the starting value as zero, and assume the payout is evens, i.e. the payout for a successful round is the initial bet, plus the same again. Then we can write [8, Section 10.6] Sn = C1 X1 + C2 X2 + . . . + Cn Xn = Sn−1 + Cn−1 Xn−1 , where P(Xn = 1) = p, P(Xn = −1) = 1 − p, and Cn is a random variable that is determined by past results, i.e. Cn = Cn (X1 , X2 , . . . , Xn−1 ). Cn is called a pre-visible process, and we define this term after the example. In this case, we have ( 1 Xn−1 = 1, Cn = Cn (Xn−1 ) = 2Cn−1 Xn−1 = −1. For S to be a Markov Chain, we need the next value Sn to be completely determinable from the value of Sn−1 . This would require C to be determinable by S, but this isn’t the case, so S can’t be a Markov Chain. On the other hand, E(Sn | Fn−1 ) = E(Sn−1 + Cn Xn | Fn−1 ) = Sn−1 + Cn E(Xn ), So this is a martingale if E(Xn ) = 0. Note this strategy can take a long time, and hence a lot of money, waiting for a successful round: the chance of waiting N rounds for a successful one is P(N = n) = (1 − p)n−1 p, with an expected time of E(N ) = X n∈N n(1 − p)n−1 p = X X (1 − p)m−1 p = n∈N m≥n X (1 − p)n−1 = n∈N 1 . p In addition, if we assume the initial bet C1 = c, the amount of money required to bet in N rounds is c + 2c + . . . + 2N −1 c = (2N − 1)c, so the expected amount of money M 14 required is E(M ) = X (2n − 1)c(1 − p)n−1 p = c n∈N ( = ∞ c 2p−1 X X 2n (1 − p)n−1 p − c n∈N (1 − p)n−1 p n∈N 0 ≤ p ≤ 1/2, 1/2 < p ≤ 1, A rather large amount to need just to gain an amount c. This betting system is where martingales derived their name from; the origin of the word before this is unclear. there are two main theories [2]: either that it comes from the name of a type of saddle – which bifurcates into two equally long strips in the middle – or that it comes from the Provençal phrase “a la martegalo” – referring to the inhabitants of Martigues, who had a reputation for doing things in a ridiculous or naive way – and means “in an absurd manner”, an appropriate origin for the naming of this betting system if it is the case. Definition 3.1.15 A process C = (Cn )n∈N is pre-visible if Cn ∈ mFn−1 ∀n. Such a pre-visible process is thus determinable in advance of when it is used. Examples include a controllable parameter, which would need to be determined based on past results, and can not be based on the upcoming one, such as C in the example above. P Definition 3.1.16 The martingale transform (C • X)n = 1≤k≤n Ck (Xk − Xk−1 ), where C is pre-visible. This is the discrete equivalent of the stochastic integral. 3.1.2 Stopping Times Definition 3.1.17 A map T : Ω 7→ {0, 1, 2, . . . ; ∞} is a stopping time if {T = n} = {ω : T (ω) = n} ∈ Fn ∀n ≤ ∞. It is possible to have T = ∞. The requirement that {T = n} ∈ Fn means that the decision to stop at a certain time can only depend on what has happened up to that time. For example, in general you can set the stopping time as the nth occurrence of any value, T = inf{n ≥ 0; Xn ∈ Y }, but you can’t set the stopping time as the nth last occurrence of any value, because this usually can not be determined unless the whole process has already been observed. Definition 3.1.18 The stopped process XT ∧n , or the process Xn stopped at T , is the process X up to stopping time T , and is equal to XT ∀n ≥ T . This can also be (T ) (T ) denoted as (C (T ) • X)n = XT ∧n − X0 , where Cn = I{n≤T } . Cn is then pre-visible, P (T ) since {Cn = 0} = {T ≤ n − 1} = 0≤k≤n {T = k} ∈ Fn−1 . Theorem 3.1.19 If X is a (super)martingale, and T is a stopping time, then X T is a (super)martingale, with E(XT ∧n ) is (less than or) equal to E(X0 )∀n. Example 3.1.20 Take X as a random walk on Z+ [8], starting at 0, with stopping time T = inf{n; Xn = 1}. Then E(XT ) = 1. However, E(XT ∧n ) = E(X0 ) = 0. We therefore do not necessarily have E(XT ) = E(X0 ). The next theorem gives sufficient conditions for E(XT ) = E(X0 ). Theorem 3.1.21 (Doob’s Optional Stopping Theorem) Let X be a (super)martingale, T be a stopping time. Then XT is integrable and E(XT ) is (less than or) equal to E(X0 ), if any of the following hold: i) T is bounded, i.e. ∃N ∈ N such that T (ω) ≤ N ∀ω; ii) X is bounded, i.e. ∃K ∈ R+ such that |Xn (ω)| ≤ K ∀n, ω, and T is almost surely finite; iii) E(T ) < ∞, and ∃K ∈ R+ such that |Xn (ω) − Xn−1 (ω)| ≤ K ∀(n, ω). 15 Proof: From Theorem 3.1.19, E(XT ∧n ) ≤ E(X0 ). Then for X being a supermartingale, i) Take n = N . ii) Take n ↑ ∞ using the Bounded Convergence Theorem (Theorem 2.1.7). PT ∧n iii) The condition gives |XT ∧n − X0 | = | k=1 (Xk − Xk−1 )| ≤ KT, E(KT ) < ∞, so we can take n ↑ ∞. For X being a martingale, we apply the above to −X to show equality. Example 3.1.22 (Gambler’s Ruin) Say we have an amount of money S0 , and we can bet on an infinite number of rounds, where the odds of victory and the corresponding payoffs are the same as in Example 3.1.14. We aim to reach an amount of money S, and so we stop when we’ve either reached this amount or run out of money: we’d like to know how likely we are to succeed. Take our money at time n as Sn = S0 + (C · X)n , with (C · X)n being the martingale transform. Our stopping time T = inf{n; Sn ∈ {0, S}}. Our situation can be split into two cases: i) p = q = 1/2. In this case Sn is a martingale regardless of our choice of Cn , and since Sn is bounded we can derive our result from Theorem 3.1.21 [3, Section 12.2]: 0.P(ST = 0) + S.P(ST = S) = E(ST ) = S0 , so P(Success) = SS0 . This is irrespective of our gambling strategy, as we’d expect from a fair game. ii) p 6= q. Sn is now either a supermartingale or a submartingale, so Theorem 3.1.21 will give us an inequality, and we can’t use it to directly calculate the answer. For the moment, let’s suppose we always take Cn = 1, and that S0 = S/2. If we now write pi for P(ST = 2i|S0 = i), then we can use the fact that Sn is time-homogenous to write pi = ppi+1 + qpi−1 , with boundary conditions p0 = 0, p2i = 1. This is a recurrence relation with solution pi = 1 − ( pq )i 1− ( pq )2i = 1 pi . q i = i 1 + (p) p + qi For p > q this is greater than 1+1 q = p, and for p < q this is smaller than p, p where p would be our chance of success by betting i and reaching our stop time in one round. Applying this result to each bet instead of the entire problem, we find that when p > 1/2 our best strategy is to bet on increments as small as possible 1−( q )S0 and P(Success) = 1−(pq )S , and when p < 1/2 our best strategy is to bet as high p as possible: intuitively, in the former case the odds benefit us in the long term so we can take our time, and in the latter case they do not. Note: Calculating the chance of success when p < q can be complicated, depending on exactly how we restrict the bets: limiting them to stay in [0,S] means the recurrence relation on pi changes depending on whether i ≥ S/2. (In the case S0 = S, we can obviously still say the chance of success is p.) On the other hand, if we allow ourselves to just stop when we have at least S, and bet all of our money at each timestep, the relation is simpler and has solution P(Success) = pt+1 , where −t − 1 ≤ log2 i < −t. Corollary 3.1.23 If X is a martingale with Xn − Xn−1 bounded by some constant K, and C is a pre-visible process bounded by some constant L, and T is a stopping time with E(T ) < ∞, then E(C • X)T = 0. Corollary 3.1.24 If X is a non-negative supermartingale, and T is an almost-surelyfinite stopping time, then E(XT ) ≤ E(X0 ). 16 Pn Example 3.1.25 Take a simple binomial random walk, with X0 = 0, Xn = k=1 Zk and Zk being equally likely to be −1 or 1 ∀k. Xn is then a martingale. Let stopping time T = min{n : Xn = −1}. We then have that i) T is not bounded, and ii) X is not bounded. Since −1 = E(XT ) 6= E(X0 ) = 0, iii) must also not hold. We therefore conclude that E(T ) = ∞. We usually need to determine whether E(T ) < ∞ instead of deriving it from Doob’s Optional-Stopping Theorem. Corollary 3.1.26 Let T be a stopping time such that ∃N ∈ N, > 0 such that ∀n ∈ N P(T ≤ n + N | Fn ) > almost surely. Then E(T ) < ∞. Example 3.1.27 Say we have a monkey randomly pressing keys on a typewriter, letters only, and we want to know how long it will take for it to type the word “abracadabra” [8, Exercise E10.6]. We suppose that at each timestep a gambler arrives with one betting chip, and bets it on the monkey typing the first letter, A, at the fair payoff of 25 to 1. If he wins, then at the next timestep he bets all 26 chips on the monkey typing B, and so on through the whole word, until the monkey types the whole word, or it misses a letter and he loses all his chips. Let Xn,m be the amount of chips owned at time m by the gambler who entered at time n – so, for example, Xn,n will be equal to either 26 or 0. Xn,m is then a martingale with Pmregard to m. Then the sum of money owned by all gamblers at time m, Zm = n=1 Xn,m , has E(Zm ) = m, and the process Wm = Zm − m is a martingale with E(Wm ) = 0. If we can find E(WT ) = E(ZT ) − E(T ), where T is the stopping time of the monkey typing the whole word, then we can express the expected time taken by the expected total number of chips held by the gamblers. We can use Corollary 3.1.26 to show that E(T ) < ∞, by, for example, setting N > 11 P11 i and = 26−11 . In addition, Zm ≤ 26Zm−1 + 26, and Zm ≤ i=1 26 ∀m, so we P10 i 11 know that |Zm − Zm−1 | ≤ 25 i=1 26 + 26 = 26 + 25. We can thus use condition iii) of Doob’s Optional Stopping Theorem (3.1.21) to say that E(WT ) = 0, and so E(ZT ) = E(T ). E(ZT ) has a value predetermined by the stop condition, since only certain gamblers can have any chips remaining when the word is finished: those who entered at times T , T − 3, and T − 10 will have chips, since there are sequences of letters that is both at the beginning and the end of the word, of length 1, 4 and 11 respectively. These gamblers have 26, 264 and 2611 chips respectively, so E(T ) = E(ZT ) = 26 + 264 + 2611 . 3.1.3 R source code: Random Walk This script creates a given number of random walks over a given number of timesteps with a given probability distribution, then plots them on a graph. The creation of the walk data is left as a separate function to allow manipulation before plotting. #Create walk data according to any defined random distribution create.walk=function(timesteps,runs,start,f) { y=array(dim=c(timesteps+1,runs)) y[1,]=start for (time in 1:timesteps) { x=f(runs) y[time+1,]=y[time,]+x } y } draw.walk=function(data) { 17 ##Create a "width matrix" to record the no. of occurrences of ##each edge calculate.width=function(data,timesteps,runs) { z=array(NA,c(timesteps,runs)) for (time in 1:timesteps) { for (run in 1:runs) { if (is.na(z[time,run])) { z[time,run]=1 for (through in (run+1):runs) { if (data[time,through]==data[time,run] && data[time+1,through]==data[time+1,run]) { z[time,c(run,through)]=c(z[time,run]+1,0) } } } } if (is.na(z[time,runs])) { z[time,runs]=1 } } z } ##Use width matrix to plot walk with thickness depending on frequency draw.data=function(data,width,timesteps,runs) { plot(c(0,timesteps),range(data), type="n",xlab="Time",ylab="Walk values") for (run in 1:runs) { for (time in 1:timesteps) { if (width[time,run]>0) { lines(c(time-1,time),c(data[time,run], data[time+1,run]),lwd=width[time,run]) } } } } b=calculate.width(data,dim(data)[1]-1,dim(data)[2]) draw.data(data,b,dim(data)[1]-1,dim(data)[2]) } Some example commands, respectively for an even discrete (−1, 1) random walk and a normal distribution of mean 0 and variance 1, each with 10 runs over 25 time units: source("http://www.maths.leeds.ac.uk/~voss/projects/2010-martingales/plotwalk.R") x=create.walk(25,10,0,function(x) sample(c(-1,1),x,replace=TRUE)) draw.walk(x) source("http://www.maths.leeds.ac.uk/~voss/projects/2010-martingales/plotwalk.R") x=create.walk(25,10,0,function(x) rnorm(x,0,1)) draw.walk(x) 3.2 The Convergence Theorem Important parts in this section are Theorem 3.2.5 and Lemmas 3.2.2 and 3.2.3. For a process X on R, let YN = (C • X)N , where the pre-visible strategy Cn (a, b) is defined as follows for a < b: Cn = 0 until X < a. Then Cn = 1 until X > b. Then Cn = 0, 18 5 0 −5 Walk values 0 5 10 15 20 25 Time 0 −5 Walk values 5 Figure 3.1: Example graphic for a binomial random walk in Program 3.1.3, with increments equally likely to be 1 or −1, taken over 25 time steps with 10 sample runs 0 5 10 15 20 25 Time Figure 3.2: Example graphic for a random walk in Program 3.1.3 with normally-distributed increments, with mean 0 and variance 1, over 25 time steps and 10 sample runs 19 and the strategy repeats. More formally, C1 = I{X0 <a} , Cn = I{Cn−1 =1} I{Xn−1 ≤b} + I{Cn−1 =0} I{Xn−1 <a} . Definition 3.2.1 Let the number of upcrossings UN [a, b](ω) be the number of times that X goes from below a to above b in time N . Since YN will total to the amount gained in UN upcrossings, plus an amount at the end if CN = 1, we can say that Lemma 3.2.2 YN (ω) ≥ (b − a)UN [a, b](ω) − [XN (ω) − a]− , where [n]− = | min(n, 0)|. Lemma 3.2.3 (Doob’s Upcrossing Lemma) Let X be a supermartingale, UN [a, b] be the number of upcrossings of [a, b] by time N . Then (b − a)EUN [a, b] ≤ E([XN − a]− ). Proof: Y = C • X is also a supermartingale, E(Y ) ≤ 0. The result follows from the expectation of Lemma 3.2.2. Corollary 3.2.4 Let X be a supermartingale bounded in L1 , a, b ∈ R such that a < b, U∞ [a, b] = limN ↑∞ UN [a, b]. Then (b − a)EU∞ [a, b] ≤ |a| + sup E|Xn | < ∞ n and so P(U∞ [a, b] = ∞) = 0. Conversely, if U∞ [a, b] = ∞ for some a, b, then X is not bounded in L1 . Proof: By Lemma 3.2.3, (b − a)EUN [a, b] ≤ |a| + E|XN | ≤ |a| + supn E|Xn |. We then let N ↑ ∞ by the Monotone Convergence Theorem (Theorem 2.1.3). Theorem 3.2.5 (Doob’s Forward Convergence Theorem) For a supermartingale X bounded in L1 , X∞ = limn↑∞ Xn almost surely exists and is finite. Definition 3.2.6 We say X∞ (ω) = lim sup Xn (ω) ∀ω. X∞ is then F∞ -measurable, and Xn = lim Xn almost surely. Corollary 3.2.7 If X is a non-negative supermartingale, then E|Xn | = E(Xn ) ≤ E(X0 ). So X is bounded and X∞ exists almost surely. 3.3 Further Results The most important point in this section is Theorem 3.3.2. From the theorems above, we often need to show that a martingale is bounded in L1 . Often the easiest method for proving this is to show that the martingale is bounded in L2 : the result then follows by taking part a) of the Schwartz Inequality (Theorem 2.1.14) with Y = 1. 20 3.3.1 Orthogonality of Increments Let X be a martingale in L2 , s, t, u, v ∈ Z+ such that s ≤ t ≤ u ≤ v. We have that E(Xv |Fu ) = Xu almost surely, so Xv − Xu is orthogonal to L2 (Fu ) from Section 2.2.1. In particular, E((Xt − Xs )(Xv − Xu )) = E(E((Xt − Xs )(Xv − Xu )|Ft )) = E((Xt − Xs )E(Xv − Xu |Ft )) = E((Xt − Xs )(Xt − Xt )) = 0, so we can express Xn as a sum of orthogonal terms, Xn = X0 + n X (Xk − Xk−1 ). k=1 By Pythagoras’ Theorem (Definition 2.1.16), we thus have E(Xn2 ) = E(X02 ) + n X E(Xk − Xk−1 )2 . k=1 Theorem 3.3.1 [8, Chapter 12] Let Xn ∈ L2 be a martingale. Then Xn is bounded in L2 if and only if X E[(Xk − Xk−1 )2 ] < ∞. Then Xn → X∞ almost surely in L2 . Proof: Follows immediately from the above expression of E(Xn2 ) as a sum of orthogonal terms. Given that the martingale is thus bounded in L1 , Doob’s Convergence Theorem (Theorem 3.2.5) then shows that X∞ = lim Xn exists almost surely. 3.3.2 Lévy’s Upward Theorem Theorem 3.3.2 (Lévy’s Upward Theorem) Let X ∈ L(Ω, F, P), and Yn = E(X|Fn ) almost surely. Then Yn → E(X|F∞ ) almost surely in L1 . Loosely speaking, this means that if, for example, we have an expectation of a process given information up to time n, then that expectation will tend to the expectation given all information we’ll ever have. 21 Chapter 4 Filtering We can now discuss filtering, the theory of estimating the value of unknown quantities we can only measure with noise. The main method used here is an expansion of Bayes’ formula, given in the first section. 4.1 Bayes’ Formula For Bivariate Normal Distributions Notation 4.1.1 We write the conditional probability P(B|A) = P(AB)/P(A) as CA (B). We also have CB (A) = P(AB)/P(B), and so we can write CB (A) = CA (B)P(A)/P(B). This is Bayes’ theorem, and is often used to “update” the estimate of P(A) after observing event B. In this case we often call P(A) the prior probability, and CB (A) the posterior probability. The recursive property of conditional probabilities is then C (CD) written as CA,B,C (D) = CA,B (D|C) = CA,B . When using estimating probabilities A,B (C) as described, above, this lets us generalise Bayes’ theorem to cases where we have a series of observations. 4.1.1 Recursive Property in Probability Distribution Functions Take random variables X, Y, Z, T with joint pdf fX,Y,Z,T : R4 → R, so that for B ∈ B 4 , Z P{(X, Y, Z, T ) ∈ B} = fX,Y,Z,T (x, y, z, t) dx dy dz dt. B We then have fX,Y,Z (x, y, z) = R R fX,Y,Z,Y (x, y, z, t) dt. Then the conditional pdf of T , fT |X,Y,Z (t|x, y, z) = fX,Y,Z,T (x, y, z, t) , fX,Y,Z (x, y, z) and the recurrence property is written as fT |X,Y,Z = (fT |Z )|X,Y = fT,Z|X,Y . fZ|X,Y Example 4.1.2 For two random variables X, Y with joint pdf fX,Y ∈ R2 [8], fX (x)fY |X (y|x) fX,Y (x, y) = ∝ fX (x)fY |X (y|x), fY (y) fY (y) R with the constant of proportionality determined by R fX|Y (x|y) dx = 1. fX|Y (x|y) = 22 4.1.2 Bayes’ Formula Lemma 4.1.3 (Bayes’ Formula) [8, Section 15.7] Let µ, a, b ∈ R, U, W ∈ (0, ∞), and X, Y be random variables such that L(X) = N (µ, U ), CX (Y ) = N (a + bX, W ), where N (µ, σ 2 ) is a normal distribution as described in Example 2.1.23.Then CY (X) = N (Z, V ), where V ∈ (0, ∞) and X are such that 1 1 b2 = + , V U W Proof: Z µ b(Y − a) = + . V U W 1 X has distribution function fX (x) = (2πU )− 2 e− − 21 − tional distribution function fY |X (y|x) = (2πW ) fX|Y (x|y) ∝ fX (x)fY |X (y|x), so log fX|Y (x|y) = c1 (y) − = c1 (y) − = c1 (y) − = c2 (y) − = c2 (y) − = c2 (y) − = c2 (y) − e (x−µ)2 2U (y−a−bx)2 2W , and Y has condi- . From Example 4.1.2, (y − a − bx)2 (x − µ)2 − 2U 2W 1 (W (x2 − 2µx + µ2 ) + U (b2 x2 − 2b(y − a)x + f (y))) 2U W 1 ((W + b2 U )x2 − 2(µW + b(y − a)U )x + f (y)) 2U W 2 W + b2 U µW + b(y − a)U x− 2U W W + b2 U 2 1 µW + b(y − a)U x− V 2V UW 1 (x − z)2 2V (x − z)2 , 2V where 1 W + b2 U 1 b2 z µW + b(y − a)U µ b(y − a) = = + , = = + . V UW U W V UW U W Thus, fX|Y (x|y) has the normal distribution given above. 4.2 4.2.1 Single Random Variable System Model and Filter We can now apply the above on conditional probabilities to examine the case of a single random variable, where we can only get observations of the variable with random noise. Specifically, we suppose we have independent random variables X, ν1 , ν2 . . . with normal distributions, L(X) = N (0, σ 2 ), L(νk ) = N (0, c2k ). We then say we can only observe X with series of observations Yk = X + νk , Fn = σ(Y1 , Y2 , . . . , Yn ). Notation 4.2.1 We write the conditional law CFn (X), defined in Notation 4.1.1, as Cn (X), with C0 (X) = N (0, σ 2 ). Then CN is the approximation of X after N observations Yn . 23 Say we have determined that Cn−1 (X) = N (Zn−1 , Vn−1 ). Yn = X + νn gives Cn−1 (Yn |X) = N (X, c2n ) , so we can use Bayes’ formula (Lemma 4.1.3), with µ = Zn−1 , U = Vn−1 , a = 0, b = 1, W = c2n , to obtain the recursion [8, Section 15.8, with misprint as if U = Vn ] Cn (X) = Cn−1 (X|Yn ) = N (Zn , Vn ), Vn−1 + c2n Zn 1 1 Zn−1 Yn 1 , = + 2 = = + 2. 2 Vn Vn−1 cn Vn−1 cn Vn Vn−1 cn Since C0 (X) = N (0, σ 2 ) = N (Z0 , V0 ), we have shown that Cn (X) = N (Zn , Vn ) ∀n. Lemma 4.2.2 The above estimate mean, Zn , is a martingale, and Zn → Z∞ = X almost surely. Proof: We first show Zn is a martingale by satisfying all three conditions in Definition 3.1.7. i) We have Zn = E(X|Fn ), so Zn is clearly adapted. 1 ii) In L2 , |Zn |2 = (E(E(X|Fn )2 ) 2 . By Jenson’s Inequality (Theorem 2.1.11), with c(x) = x2 , |Zn |2 ≥ E(E(X|Fn )) = E(X). Using the same inequality on the inner 1 1 expectation, |Zn |2 ≤ (E(E(X 2 |Fn )) 2 = (E(X 2 )) 2 = |X| = σ. Zn is thus bounded in L2 , and so is also bounded in L1 . iii) E(Zn |Fm ) = E(E(X|Fn )|Fm ) = E(X|Fm ) = Zm ∀m < n. So Zn is a martingale, with E(X − Zn )2 = Vn and bounded in L2 . We thus know from Lévy’s Upward Theorem (Theorem 3.3.2) that Zn = E(X|Fn ) → Z∞ = E(X|F∞ ) almost surely in L2 . We’d like to show that our mean estimate of X tends towards X as the number of observations tends to infinity, i.e. that Z∞ = X almost surely, so we also variance Vn ↓ 0. From the recurrence formula for Vn , we have Vn = Pnneed−2 −2 −1 {σ + c ; we can thus say that Z∞ = X almost surely if and only if k=1 k } P −2 ck = ∞. 4.2.2 R source code: Filtering a Single Value This script takes a normally-distributed variable and noisy measurements of it, calculates the estimate of the variable as the measurements are taken, then plots the variable, the estimate’s mean and standard deviation, and the measurements. #Script for creating noisy measurements of a value, and giving #an estimate of the variable by filtering ##Calculate the estimated mean and variance of possible values create.estimates = function(Y, signalmean, signalsd, noisesd) { n = length(Y) noisesd = matrix(noisesd, n, 1) est = array(NA, dim=c(length(Y)+1,2)) est[1,] = c(signalmean, signalsd^2) for(step in 1:n) { est[step+1,2] = est[step,2] + 1/(noisesd[step]^2) est[step+1,1] = est[step,1] + Y[step]/(noisesd[step]^2) } est[,2] = 1/est[,2] est[,1] = est[,1]*est[,2] 24 + 0.5 −0.5 + + + + + + + + + + + −1.5 Estimate 1.5 + + + 5 10 + + + + 0 + 15 20 Measurements Figure 4.1: Example graphic for measuring a random variable in Program 4.2.2, with mean 0 and variance 1, with noise of variance 1 est[,2] = sqrt(est[,2]) colnames(est) = c("mean", "sd") est } draw.filter = function(signal,Y,est) { ##Calculate one standard deviation to each side of the estimates estplus = est[,1]+est[,2] estminus = est[,1]-est[,2] plot(c(0,length(Y)), range(c(est[,1],estplus,estminus, signal-0.1,signal+0.1)), type="n", xlab="Measurements", ylab="Estimate") lines(0:length(Y),est[,1]) lines(0:length(Y),estplus,lty=2) lines(0:length(Y),estminus,lty=2) abline(h=signal,lty=3) points(1:length(Y),Y,pch="+") } create.filter = function(n, noisesd, signalmean=0, signalsd=1) { signal = rnorm(1, signalmean, signalsd) observations = rnorm(n, signal, noisesd) est = create.estimates(observations, signalmean, signalsd, noisesd) draw.filter(signal, observations, est) } 25 1.0 0.8 0.6 0.4 0.2 Estimate Deviation 0 5 10 15 20 Measurements 1.0 Figure 4.2: Standard deviation of the estimate in Figure 4.1 over time + 0.0 + + + + + + + + + + + + + + + + + + −1.0 −0.5 Estimate 0.5 + 0 5 10 15 20 Measurements Figure 4.3: Example graphic for measuring a random variable in Program 4.2.2, with mean 0 and variance 1, with noise of variance c2n = 1/n2 26 1.0 0.8 0.6 0.4 0.2 0.0 Estimate Deviation 0 5 10 15 20 Measurements Figure 4.4: Standard deviation of the estimate in Figure 4.3 over time 4.3 4.3.1 Series of Variables; Kalman Filter System Model and Filter We can use the same recursion for a series of variables. For the Kalman Filter, we assume that the signal variable (Xn )n∈N follows a disturbed linear recursion, Xn − Xn−1 = An Xn−1 + gn + νn , where gn is known at time n − 1 – so it is pre-visible – and L(νn ) = N (0, Hn2 ), and that the observations Yn follow recursion Yn − Yn−1 = Cn Xn + n , where Cn is known, and L(n ) = N (0, Kn2 ). We also assume that Cn 6= 0 – otherwise the observation has no relation to the signal – and that Kn 6= 0, since equality would mean our observations are perfectly accurate and we don’t need to filter. gn is required to be pre-visible for many applications of the filter in optimization, since pre-visibility means gn can be some controllable variable used to optimize to certain conditions [6]; see Section 4.3.3 for an example of this. We express the equations with regard to Xn −Xn−1 and Yn −Yn−1 , rather than Xn and Yn , because these recurrence relations are often used as an approximation to stochastic et , where differential equations like dXt = AXt + dgt + Ht dBt and dYt = Ct Xt + Kn dB e are standard Brownian motions, approximated by (Bn − Bn−1 ), (B en − B en−1 ) ∼ B, B N (0, 1). These are used in the Kalman-Bucy filter, the continuous-time equivalent of the Kalman filter; for more information see [4]. For X0 we again have an initial mean and variance, as in Section 4.2.1. Since we only look at difference Y1 − Y0 , the value of Y0 is arbitrary, but to equate the process to that of a continuous one we usually take Y0 = 0 by convention. We now have equations Cn−1 (Xn ) = N (αn Zn−1 + gn , αn2 Vn−1 + Hn2 ), Cn−1 (Yn |Xn ) = N (Yn−1 + Cn Xn , Kn2 ), where αn = An +1. As before, we can now suppose that Cn−1 (Xn−1 ) = N (Zn−1 , Vn−1 ). We then apply Lemma 4.1.3, with µ = αn Zn−1 + gn , U = αn Vn−1 + Hn2 , a = YN −1 , b = Cn , W = Kn2 , 27 to obtain recursion 1 Cn2 Zn Cn (Yn − Yn−1 ) 1 αn Zn−1 + gn + , + . = 2 = 2 2 2 2 Vn αn Vn−1 + Hn Kn Vn αn Vn−1 + Hn Kn2 Lemma 4.3.1  In the case where Hn , Cn , Kn , An , gn are constant in time, the scaled g −n  α (Zn − 1−α ) α 6= 0, 1 estimate Mn = Zn − gn α = 1 is a martingale.   Zn α=0 Proof: The case where α = 0 is obvious, since Mn = g; otherwise, E(Zn |Fm ) = αE(Zn−1 |Fm ) + g = α2 E(Zn−2 |Fm ) + g + αg . . . = αn−m E(Zm |Fm ) + g n−m−1 X αi i=0 ( n−m αn−m Zm + g α α−1−1 = αn−m Zm + g(n − m) ( −m −α−n α−m Zm + g α α−1 ∴ E(α−n Zn |Fm ) = Zm + g(n − m) E(α−n (Zn + = α 6= 1 = α − 1, α 6= 1 α = 1, g g )|Fm ) = α−m (Zm + ) α−1 α−1 E(Zn − gn|Fm ) = Zm − gm α 6= 1, α = 1. Theorem 4.3.2 Vn tends to a limit V∞ . As before, we assume C, K 6= 0. We first look for a fixed point, 1/V∞ = C2 α2 V∞ +H 2 + K 2 , then examine the stability of the fixed point with the following lemma: Proof: 1 Lemma 4.3.3 For an equation xn+1 = f (xn ), the fixed point x∗ is locally stable if (x∗ ) (x∗ ) | dfdx | < 1, since f (x∗ + n ) ∼ f (x∗ ) + n dfdx by Taylor series for small error n . We rearrange the fixed-point equation to V∞ = f (V∞ ) = K 2 (α2 V∞ + H 2 ) , K 2 + C 2 (α2 V∞ + H 2 ) 2 2 2 ) and take Vn+1 = f (Vn ), where f (x) = K 2K+C(α2 (αx+H 2 x+H 2 ) , in the above Lemma. The equation can also be rearranged to 2 α2 C 2 V∞ + C 2 H 2 + K 2 (1 − α2 ) V∞ − K 2 H 2 = 0. This quadratic equation can reduce to a linear equation depending on α, so we look at the following distinct cases: 1. For α 6= 0 and H 6= 0, we have the full quadratic equation, with solution V∞ = (α2 − 1)K 2 − C 2 H 2 + p ((α2 − 1)K 2 − C 2 H 2 )2 + 4a2 C 2 K 2 H 2 . 2α2 C 2 28 In this case we only take the positive root: since α, H 6= 0 the square root term is larger than the rest, so the negative root is less than zero, and since variance is positive this is impossible. The function derivative is α2 K 2 α2 C 2 K 2 (α2 x + H 2 ) df (x) = 2 − 2 dx K + C 2 (α2 x + H 2 ) K 2 + C 2 (α2 x + H 2 ) = α2 K 4 K 2 + C 2 (α2 x + H 2 ) 2 . 2 For convergence we thus require α2 K 4 < K 2 + C 2 (α2 V∞ + H 2 ) , or |α|K 2 2|α|K 2 K 2 + C 2 H 2 + α2 C 2 V∞ 1 2 (α − 1)K 2 − C 2 H 2 < K 2 + C 2H 2 + 2 q 2 + (α2 − 1)K 2 − C 2 H 2 + 4a2 C 2 K 2 H 2 bysolution, √ < (α2 + 1)K 2 + C 2 H 2 + 2 . . . , < √ ∴ (|α| − 1)2 K 2 + C 2 H 2 + 2 . . . > 0. The variance therefore always converges to V∞ , since all the terms are positive and C 2 H 2 > 0. 2. For α 6= 0 and H = 0, the equation now has two fixed points, at V−∞ = 0 and 2 2 V+∞ = (α α−1)K , and the convergence condition becomes 2C2 (|α| − 1)2 K 2 ± (α2 − 1)K 2 > 0. For V−∞ this becomes (1 − |α|)K 2 > 0, and for V+∞ this becomes (α2 − |α|)K 2 > 0. The stability of these points thus depends on |α|: for |α| < 1, V−∞ = 0 is stable, and V+∞ is both unstable and negative; for |α| > 1, the variance converges to non-zero value V+∞ . and V−∞ = 0 is unstable; for |α| = 1, V−∞ = V+∞ = 0 is stable. In summary, in this case we always have convergence. 3. For α = 0, the equation reduces to (C 2 H 2 + K 2 )V∞ − K 2 H 2 = 0, with solution V∞ = We have f (x) = K2H2 K 2 +C 2 H 2 , so df (x) dx K 2H 2 . + C 2H 2 K2 = 0. We thus always have stability. In summary, the variance always converges to V∞ . 4.3.2 R source code: Kalman Filter This takes a series of signals related by disturbed linear recursion, and noisy observations of them, then plots the signal series, the mean and standard deviation series for the estimations, and the differences between the observations. The differences are taken to keep the plots in the same area as the signal and observation series. #Script for creating noisy measurements of a process, and giving #an estimate of same by filtering ##Convert any possibly time-converted values into length-n arrays, ##sorted in a list convert.to.array = function(n, ...) { args=list(...) for(a in 1:length(args)) { args[[a]] = array(args[[a]],n) 29 } args } create.signal = function(n, meanX0, sdX0, A, g, H) { signal = array(0,n+1) convert=convert.to.array(n,A+1,g,H) alpha = convert[[1]] g = convert[[2]] H = convert[[3]] signal[1] = rnorm(1,meanX0,sdX0) for(step in 1:n) { signal[step+1] = rnorm(1, alpha[step]*signal[step]+g[step], H[step]) } signal } create.observations = function(n,signal,C,K) { convert=convert.to.array(n,0,C,K) observations = convert[[1]] C = convert[[2]] K = convert[[3]] observations[1] = rnorm(1,C[1]*signal[2],K[1]) for(step in 1:(n-1)) { observations[step+1] = rnorm(1,C[step+1]*signal[step+2], K[step+1]) } observations } ##Calculate the estimated mean and variance of possible values create.estimates = function(Y, meanX0, sdX0, A, g, H, C, K) { n = length(Y) convert = convert.to.array(n,A+1,g,H,C,K) alpha = convert[[1]] g = convert[[2]] H = convert[[3]] C = convert[[4]] K = convert[[5]] est = array(0, c(n+1, 2)) est[1,] = c(meanX0, sdX0^2) for(step in 1:n) { est[step+1,2] = 1/(1/((alpha[step]^2)*est[step,2] + H[step]^2) + (C[step]/K[step])^2) check = (alpha[step]*est[step,1] + g[step])/((alpha[step]^2)*est[step,2] + H[step]^2) check2 = C[step]*Y[step]/(K[step]^2) est[step+1,1] = (check+check2)*est[step+1,2] } est[,2] = sqrt(est[,2]) colnames(est) = c("mean", "sd") 30 + 1 + 0 + + + + + + + −2 −1 Estimate 2 + + + 0 2 4 6 8 10 12 Measurements Figure 4.5: Example graphic for measuring a random process in Program 4.3.2 with default values, create.filter(12,1) est } draw.filter = function(signal,Y,est) { ##Calculate one standard deviation to each side of the estimates estplus = est[,1]+est[,2] estminus = est[,1]-est[,2] plot(c(0,length(Y)), range(c(est[,1],estplus,estminus,signal-0.1, signal+0.1,Y)), type="n", xlab="Measurements", ylab="Estimate") lines(0:length(Y),est[,1]) lines(0:length(Y),estplus,lty=2) lines(0:length(Y),estminus,lty=2) lines(0:length(Y),signal,lty=3) points(1:length(Y),Y,pch="+") } create.filter = function(n, K, meanX0=0, sdX0=1, A=-0.1, g=0, H=1, C=1) { signal = create.signal(n, meanX0, sdX0, A, g, H) observations = create.observations(n,signal,C,K) est = create.estimates(observations, meanX0, sdX0, A, g, H, C, K) draw.filter(signal, observations, est) } 4.3.3 Example: Moving on a Line Consider the optimality problem [6, Section 11.4] of an object moving on the line R with controllable velocity gn at each timestep, where we can only measure the position of the object with noise, and we must choose gn , only based on Fn−1 = {Y0 , Y1 , . . . Yn −1}, to PN −1 2 minimize E( n=0 gn2 + DXN ) for a finite stopping time N and some D. Specifically, 31 2 + 0 + + + + + + + −2 + + + + + + 0 5 + + + + −4 Estimate 1 + + 10 15 20 Measurements Figure 4.6: Example graphic for measuring a random process in Program 4.3.2 with default values, create.filter(20,1) 1 0 + + + + + −2 + 0 + + + −1 Estimate 2 + + 2 + 4 6 8 10 12 Measurements Figure 4.7: Example graphic for measuring a random process in Program 4.3.2 with default values and noise variance Kn2 = 1/n2 , create.filter(12,1/(1:12)) 32 we have a system Xn+1 = Xn + gn , Yn = Xn + n , where L(n ) = N (0, 1). We thus have a Kalman Filter, with α = 1, g = gn−1 , Hn2 = 0, C = 1, Kn2 = 1. Our position estimate is then N (Zn , Vn ) with 1 Zn Zn−1 + gn−1 1 = + 1, = + Yn , Vn Vn−1 Vn Vn−1 V0 Vn−1 = , ∴ Vn = Vn−1 + 1 nV0 + 1 Yn Vn−1 + (Zn−1 + gn−1 ) Yn V0 + (Zn−1 + gn−1 )([n − 1]V0 + 1) Zn = = . Vn−1 + 1 nV0 + 1 In the absence of other information, we can use the above recursion by assuming we have no information at time n = 0, i.e. Z0 = z, 1/V0 = 0. Then 1 Yn + (n − 1)(Zn−1 + gn−1 ) , Vn = . n n Z1 = Y1 , V1 = 1, Zn = PN −1 2 Let F (Zk , Vk , k) = E( n=k gn2 + DXN |{Zk , Vk , Yk } = Gk ). Then F (Zk , Vk , k) = gk2 + E(F (Zk+1 , Vk+1 , k + 1)|Gk , gk ), 2 2 F (ZN , VN , N ) = E(DXN |FN ) = DE(XN |GN ) 2 2 2 2 = D(E(XN − ZN |GN ) + ZN ) = D(VN + ZN ). 2 Suppose we can write F (Zk+1 , Vk+1 , k + 1) = Ak+1 Zk+1 + Bk+1 . Then we have 2 F (Zk , Vk , k) = gk2 + E(Ak+1 Zk+1 + Bk+1 |Gk , gk ) ! 2 Yk+1 Vk + Zk + gk 2 |Gk , gk + Bk+1 = gk + Ak+1 E Vk + 1 ! 2 V X + Z + V + (V + 1)g k k k k k+1 k k = gk2 + Ak+1 E |Gk , gk + Bk+1 Vk + 1 = (Ak+1 + 1)gk2 + 2Ak+1 Zk gk + Ak+1 Zk2 + Ak+1 Vk2 + Bk+1 . Vk + 1 k+1 Minimizing over gk gives gk = − AA Zk , so k+1 +1 F (Zk , Vk , k) = Ak+1 Vk2 Ak=1 Zk2 + Bk+1 + = Ak Zk2 + Bk , Ak+1 + 1 Vk+1 where Ak = Bk = Bk+1 + = Bk+1 + = Ak+1 D = , Ak+1 + 1 1 + kD Ak+1 Vk2 Vk + 1 DV02 (1 + kV0 ) (1 + (k + 1)D)(1 + kV0 )2 (1 + (k + 1)V0 ) N −1 X V0 1 + DV02 . 1 + N V0 (1 + iV0 )(1 + (i + 1)V0 )(1 + (i + 1)D) i=k 33 Thus, the optimality problem has the solution gk = Zk becomes Zk = −DZk 1+(N −k)D , so the expression for 1+(N −k)D Yk V0 + ([k − 1]V0 + 1) 1+(N −k+1)D Zk−1 kV0 + 1 (k − 1)V0 + 1 V0 Zk−1 + Yk 1 + (N − k + 1)D 1 + (N − k)D k X 1 + (N − k)D Z0 V0 + Yi , = kV0 + 1 1 + N D i=1 1 + (N − i)D 1 + (N − k)D = kV0 + 1 and the expected final cost at time k, F (Zk , Vk , k) = D V0 Z2 + 1 + kD k 1 + N V0 N −1 X 1 +DV02 . (1 + iV0 )(1 + (i + 1)V0 )(1 + (i + 1)D) i=k 4.3.4 R Source Code: Movement Problem This gives examples of the results of the problem in Example 4.3.3, plotting the current position, the mean and standard deviation of the estimated position, and the observed position. #Script for the movement problem ##Create arrays and starting values create.start = function(n, meanX0, sdX0) { signal = array(NA,n+1) signal[1] = rnorm(1,meanX0,sdX0) signal } create.observations = function(signal) { ##create array for observations for time 0 to n-1 n=length(signal)-1 observations = array(NA,n) observations[1] = rnorm(1,signal[1],1) observations } create.estimates = function(Y, meanX0, sdX0) { n = length(Y) est = array(NA, c(n, 2)) if (sdX0 == Inf) { est[1,] = c(Y[1], 1) } else { est[1,] = c((meanX0 + Y[1]*sdX0^2)/(1+sdX0^2), sdX0^2/(1+sdX0^2)) } colnames(est) = c("mean", "sd") est } ##Calculate values in next time step 34 create.move = function(k,n,signal,est,D) { if (D==Inf) {g=-est[k,1]/(n-k+1)} else { g = -D*est[k,1]/(1+(n-k+1)*D) } signal[k+1] = signal[k] + g signal } update.observations = function(k,observations,signal) { observations[k] = rnorm(1,signal[k],1) observations } update.estimates = function(k,est,signal,Y,D) { n = length(Y) if (D==Inf) {g=-est[k,1]/(n-k+1)} else { g = -D*est[k,1]/(1+(n-k+1)*D) } est[k+1,2] = est[k,2]/(est[k,2] + 1) check = (est[k,1] + g)/est[k,2] est[k+1,1] = (check+Y[k+1])*est[k+1,2] est } draw.filter = function(signal,Y,est) { ##Calculate one standard deviation to each side of the estimates estplus = est[,1]+est[,2] estminus = est[,1]-est[,2] plot(c(0,length(Y)), range(c(est[,1],estplus,estminus,signal,Y,0)), type="n", xlab="Measurements", ylab="Estimate") lines(0:(length(Y)-1),est[,1]) lines(0:(length(Y)-1),estplus,lty=2) lines(0:(length(Y)-1),estminus,lty=2) lines(0:length(Y),signal,lty=3) points(0:(length(Y)-1),Y,pch="+") } create.filter = function(n, meanX0=0, sdX0=1, D=1) { if (sdX0 == Inf) { ##using Inf normally would give "Not a Number" errors signal = create.start(n,meanX0,0) } else signal = create.start(n, meanX0, sdX0) observations = create.observations(signal) est = create.estimates(observations, meanX0, sdX0) for(k in 1:(n-1)) { signal = create.move(k,n,signal,est,D) observations = update.observations(k+1,observations,signal) est = update.estimates(k,est,signal,observations,D) } signal = create.move(n,n,signal,est,D) est[,2] = sqrt(est[,2]) draw.filter(signal, observations, est) } 35 + + −1 + −2 + + + + + + + −3 Estimate 0 + + 0 2 4 6 8 10 12 Measurements Figure 4.8: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with default values and 12 timesteps, create.filter(12) 1.5 + + 0.5 + + + + + −0.5 Estimate + + + −1.5 + + 0 2 4 6 8 10 12 Measurements Figure 4.9: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with default values and 12 timesteps, create.filter(12) 36 0.5 + + −0.5 + + + −1.5 −2.5 Estimate + + + + + 0 + + 2 4 6 8 10 12 Measurements Figure 4.10: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps and D=∞, create.filter(12,D=Inf) 1.5 + + + 0.0 0.5 + + + + + + + −0.5 Estimate 1.0 + + 0 2 4 6 8 10 12 Measurements Figure 4.11: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps and D=0, create.filter(12,D=0), obviously resulting in zero movement 37 + + + + + 4 6 + + + 2 Estimate 8 10 12 + + + 0 + 0 2 4 6 8 10 12 Measurements Figure 4.12: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps, starting at 10 with infinite starting variance, create.filter(12,meanX0=10,sdX0=Inf) + + + 0.0 Estimate 1.0 + + + + + −1.0 + + 0 + + 2 4 6 8 10 12 Measurements Figure 4.13: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps, starting at 0 with infinite starting variance, create.filter(12,sdX0=Inf); we then still have movement due to the inaccurate observations 38 4.3.5 Extension for Multiple Processes and Observations The above form of the Kalman Filter is satisfactory, but it causes problems if we have more than one variable, or more than one observations: in either case we also need to consider covariances between the different approximations, so we need a more sophisticated model. For this reason, the associated literature for the Kalman Filter usually denotes the equations in matrix form; we therefore extend the basic Kalman filter above to the case where we have v variables we wish to approximate and m observations. The aforementioned literature usually splits the Kalman Filter into two steps – the time update, or prediction step, and the measurement update, or correction step [7]. This makes the calculations more digestible, especially when using matrices, so we first show our solution for the simple case can be written in this form. Rewriting the Simple Case We first write our solution in terms of Vn and Zn instead of 1/Vn and Zn /Vn : Cn2 1 1 + = 2 Vn αn Vn−1 + Hn2 Kn2 2 2 2 K + C (α Vn−1 + Hn2 ) = n 2 n2 n , Kn (αn Vn−1 + Hn2 ) K 2 (α2 Vn−1 + Hn2 ) Vn = 2 n 2n 2 , Kn + Cn (αn Vn−1 + Hn2 ) Zn αn Zn−1 + gn Cn (Yn − Yn−1 ) = 2 + Vn αn Vn−1 + Hn2 Kn2 2 K (αn Zn−1 + gn ) + Cn (Yn − Yn−1 )(αn2 Vn−1 + Hn2 ) = n , Kn2 (αn2 Vn−1 + Hn2 ) K 2 (αn Zn−1 + gn ) + Cn (Yn − Yn−1 )(αn2 Vn−1 + Hn2 ) Zn = n . Kn2 + Cn2 (αn2 Vn−1 + Hn2 ) These expressions, and those that follow, frequently use the terms E(Xn |Fn−1 ) = αn Zn−1 + gn and E(Vn |Fn−1 ) = αn2 Vn−1 + Hn2 : for convenience, we write these as Z̃n and Ṽn respectively. We then write the equation for Zn as Zn = Kn2 Z̃n + Cn (Yn − Yn−1 )Ṽn Kn2 + Cn2 Ṽn = Z̃n + Cn Ṽn (Yn − Yn−1 − Cn Z̃n ) Kn2 + Cn2 Ṽn = Z̃n + φn rn , where φn = Cn Ṽn 2 +C 2 Ṽ Kn n n is pre-visible, and often called the gain, blending, or kalman gain factor [1,7]; it can be thought of as an indication of how much we value the new information conpared to our old estimate. Additionally, the term rn = Yn − Yn−1 − Cn Z̃n = Yn − E(Yn |Fn−1 ) is the difference between actual and expected measurement Yn , and is often called the residual or innovation [7]. K2 The gain factor has the property Cn φn + K 2 +Cn2 Ṽ = 1, so we can write the system as n n n Z̃n = αn Zn−1 + gn , Ṽn = αn2 Vn−1 + Hn2 , φn = Cn Ṽn ; + Cn2 Ṽn Kn2 rn = Yn − Yn−1 − Cn Z̃n , Zn = Z̃n + φn rn , Vn = (1 − Cn φn )Ṽn . 39 Matrix Form Say we wish to estimate v variables from m observations. The variables can now depend on each other, so for the ith variable Xi,n we have dynamics equation Xi,n − Xi,n−1 = v X Ai,j,n Xj,n−1 + gi,n + νi,n , j=1 and for ith observation Yi,n we have equation Yi,n − Yi,n−1 = v X Ci,j,n Xj,n + i,n . j=1 In matrix form, these equations become xn − xn−1 = An xn−1 + g n + ν n , y n − y n−1 = Cn xn + n . Additionally, the independence conditions for xn , νi,n , and i,n are given by the equations [1, Section 7.1] E(ν n xTm ) = 0 ∀n > m, E(n xTm ) = 0 ∀n, m, E(n ν Tm ) = E(ν n ν Tm ) = E(n Tm ) = 0 ∀n 6= m, where xT denotes the usual transpose of a vector, and the distribution of the noise variables is given by the covariance matrix T Jn2 Hn2 νn νn . E = n n (Jn2 )T Kn2 Usually the two noise variables are independent, so Jn2 = 0. If we assume that we have estimate at time n − 1 of mean z n−1 and variance Vn−1 , then we have E(xn |Fn−1 ) = αn z n−1 + g n , E(y n |Fn−1 ) = y n−1 + Cn (αn z n−1 + g n ), where αn = An + I. Let ξ n = xn − E(xn |Fn−1 ) = αn (xn−1 − z n−1 ) + ν n , and χn = y n − E(y n |Fn−1 ) = Cn (xn − αn xn−1 − g n ) + n = Cn (αn (xn−1 − z n−1 ) + ν n ) + n = Cn ξ n + n . Let Ṽn = Var(ξn ) = Hn2 + αn Vn−1 αnT ; We can then write the covariance matrix ξn Ṽn Jn2 + Ṽn CnT . Cov = χn (Jn2 )T + Cn Ṽn Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT Lemma 4.3.4 If two random variables x, yare normallydistributed with mean 0 and x Vxx Vxy symmetric covariance matrix Cov = , where Vyy is non-singular, y Vyx Vyy then −1 −1 E(x|y) = Vxy Vyy y, Var(x|y) = Vxx − Vxy Vyy Vyx . −1 Proof: The term [6] x − Vxy Vyy y is linear in x, y, so is normally distributed. Ad−1 T −1 ditionally, E((x − Vxy Vyy y)y ) = 0, so it is independent of y and Vxy Vyy y is the −1 conditional expectation of y. So we can say that E(x|y) = Vxy Vyy y, and that −1 −1 Cov(x − Vxy Vyy y|y) = Cov(x − Vxy Vyy y) −T T −1 −1 −T T = E(xxT − xy T Vyy Vxy − Vxy Vyy yxT + Vxy Vyy yy T Vyy Vxy ) −1 −1 −1 −1 = Vxx − Vxy Vyy Vyx − Vxy Vyy Vyx + Vxy Vyy Vyy Vyy Vyx −1 = Vxx − Vxy Vyy Vyx . 40 ξ n and χn have mean 0, so by the above lemma, and by the fact that E(ξ n |χn ) = z n − z̃ n and Var(ξ n |χn ) = Vn , we can now say that z n = z̃ n + (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 (y n − E(y n |Fn−1 )), Vn = Ṽn − (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 ((Jn2 )T + Cn Ṽn ). We have now found the gain factor and the innovation, φn = (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 , r n = y n − E(y n |Fn−1 ), and can more simply write the above as z n = z̃ n + φn r n , Vn = Ṽn − φn ((Jn2 )T + Cn Ṽn ). We can now write the complete form of the filter, z̃ n = αn z n−1 + g n , Ṽn = Hn2 + αn Vn−1 αnT φn = (Jn2 + Ṽn CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn Ṽn CnT )−1 , r n = y n − y n−1 − Cn z̃ n , z n = z̃ n + φn r n , Vn = Ṽn − φn ((Jn2 )T + Cn Ṽn ). Ṽ C T Note: Usually Jn2 = 0: this gives φn = K 2 +Cn Ṽn 2 C T , Vn = (I − φn Cn )Ṽn , and we n n n n then have the same form as the simple case. Example: Moving Under Gravity of Unknown Force Say we are measuring the position of a particle moving under gravity with observation noise, where the particle begins at zero height, but the initial velocity W and the acceleration g due to gravity are not known exactly [5, Section 7.2, with a different solution method and noisy timesteps]. The current position, Xn , follows the recursion 1 1 Xn = W n − gn2 = Xn−1 + W + g( − n). 2 2 The signal is then the vector of the position, initial velocity and acceleration, and follows the recursion     Xn 1 1 21 − n 0  xn−1 = αn xn−1 . xn =  W  =  0 1 g 0 0 1 Our only observation is of the current position, so we have measurement recursion yn − yn−1 = 1 0 0 xn + n = Cn xn + n , where n ∼ N (0, Kn2 ). Say for timestep n − 1 we have estimate    zn−1 Vn−1,1,1 Vn−1,1,2 z n−1 =  Wn−1  , Vn−1 =  Vn−1,2,1 Vn−1,2,2 gn−1 Vn−1,3,1 Vn−1,3,2 mean and variance  Vn−1,1,3 Vn−1,2,3  ; Vn−1,3,3 then we have gain factor φn = Kn2 1 1 Ṽn CnT = Ṽn CnT , T 2 + Cn Ṽn Cn Kn + Ṽn,1,1 and innovation rn = yn − yn−1 − z̃n . 41 The estimate mean then has form z n = z̃ n + φn rn = z̃ n + Looking at zn = Cn z n , Wn = we can see that zn = 0 1 Ṽn CnT (yn − yn−1 − z̃n ). + Ṽn,1,1 0 z n , and gn = 0 0 1 z n separately, Kn2 1 Kn2 z̃n + Ṽn,1,1 (yn − yn−1 ) , Kn2 + Ṽn,1,1 Wn = Wn−1 + gn = gn−1 + Ṽn,3,1 (yn − yn−1 − z̃n ) , Kn2 + Ṽn,1,1 Ṽn,3,1 (yn − yn−1 − z̃n ) . Kn2 + Ṽn,1,1 Additionally the estimate variance has form Vn = (I − φn Cn )Ṽn =  Ṽn,1,1 1  I− Ṽn,2,1 Kn2 + Ṽn,1,1 Ṽn,3,1 0 0 0 ! 0 0  Ṽn . 0 From this we can see that Vn,1,2 , Vn,1,3 can only be non-zero if they are in the prior Ṽn,2,2 Ṽn,2,3 will stay the same if Ṽ0,1,2 = estimate Ṽn , and that the sub-matrix Ṽn,3,2 Ṽn,3,3 Ṽ0,1,3 = 0. However, in Ṽn = αn Vn−1 αnT we have Ṽn,1,2 = Vn−1,2,2 + ( 12 − n)Vn−1,3,2 , and similarly for Ṽn,1,3 , so unless we definitely know W or g already we will use all parts of the covariance matrix. 4.3.6 Comments 1. The dynamics for the progression of the signal, or the observation, can be nonlinear. In this case, the estimates are calculated by taking partial derivatives of the recursion functions, as in a Taylor series: see [7] for more information. An example would be a digital meter, with measurement noise dependent on the size of X or Y as the meter switches between scales of measurement. 2. If either noise has a non-zero mean, we can simply adjust the noise by including its mean in gn or E(Yn |Fn−1 ). 3. In the case Cn = 0, the observations have no dependence on Xn , so the signal is not observable, and Z̃n , Ṽn . In the matrix form, this is equivalent to Cn being singular, and so at least one member of r n is not observable: see [1, 6] for more on observability. 4. The signal at time n, Xn , can also be approximated given Fm for m > n. This is referred to as smoothing: for more information see [1, Chapter 9]. 42 Appendix A Appendices Theories about probability, expectations, and so on are derived from the more abstract field of measure theory. We therefore give a brisk summary of the important results and definitions from the first few chapters of [8]. A.1 Measures Definition A.1.1 For a set S and a collection Σ of subsets of S, we say Σ is a σ-algebra on S if S ⊂ Σ, F ∈ Σ ⇒ F c = S \ F ∈ Σ, and, for a sequence of disjoint subsets F =(Fn )n∈N Fn ∈ Σ, [ Fn ∈ Σ. n We say F ∈ Ω is an Ω-measurable subset of S. Definition A.1.2 We say σ(Σ) is the σ-algebra generated by Σ, where σ(Σ) is the intersection of all σ-algebras with Σ as a subset. Example A.1.3 (Borel σ-algebras) The Borel σ-algebra B(S) on the space S is generated by the family of open subsets of S. B(R), often written as just B, thus contains all subsets of R. Similarly, B((0, 1]) is often written as B(0, 1]. Definition A.1.4 We say a function µ : Σ 7→ [0, ∞) on σ-algebra Σ of set S is a measure if µ(φ) = 0 where φ is the empty set, and, for a sequence of distinct subsets (Fn ∈ Σ)n∈N , [ X µ( Fn ) = µ(Fn ). n∈N n∈N Furthermore, it is a probability measure if µ(F ) ≤ 1 ∀ F ∈ Σ. Example A.1.5 The lebesgue measure on B(0, 1], λ : B(0, 1] 7→ (0, 1], has the form λ( [ λ(a, b) = b − a, X (an , bn ) = (bn − an ) n∈N n∈N for disjoint subsets (an , bn ). The Lebesgue measure is thus a general measure of length. 43 Definition A.1.6 We say I is a π-system on S if it’s a family of subsets of S that is stable under finite intersections. Theorem A.1.7 (Carathéodory’s Extension Theorem) Let Σ be the σ-algebra generated by Σ0 . Then, for measure µ0 : Σ0 7→ [0, ∞], ∃ measure µ : Σ 7→ [0, ∞] such that µ = µ0 on Σ0 . So, we can extend a measure to a larger model. Example A.1.8 The Lebesgue measure on B(0, 1] can be extended to B[0, 1] by saying that λ{0} = 0. We can also extend to B, so we can measure length on the whole of R. A.2 Events Definition A.2.1 We say a probability triple is a measure space (Ω, F, P), where Ω is the sample space, ω ∈ Ω is a sample point, the σ-algebra F is the family of events – so that an event is a F-measurable subset of Ω – and P is a probability measure on (Ω, F). Definition A.2.2 For sequence of events (En )n∈N = {ω | ω ∈ E}, (En , i.o.) = (En infinitely often) \ [ = lim sup En = En m n≥m = {ω | ∀m ∃n(ω) s.t. ω ∈ En(ω) } = {ω | ω ∈ En for infinitely many n}. (En , ev) = (En eventually) [ \ = lim inf En = En m n≥m = {ω | ∃m(ω) s.t. ∀n ≥ m(ω) ω ∈ En } = {ω | ω ∈ En ∀ large n}. Theorem A.2.3 (Fatou’s Lemma) P(lim inf En ) ≤ lim inf P(En ). Theorem A.2.4 (Fatou Reverse Lemma) For finite measure P, P(lim sup En ) ≥ lim sup P(En ). Theorem A.2.5 (First Borel-Cantelli Theorem) For events (En )n∈N , X P(En ) < ∞ ⇒ P(lim sup En ) = P(En , i.o) = 0. n A.3 Random Variables Definition A.3.1 A function f : S 7→ R is Σ-measurable if f −1 : B → 7 Σ. mΣ is the set of all Σ-measurable functions on S, and mΣ+ is the set of all non-negative elements of mΣ. Definition A.3.2 For sample space Ω and σ-algebra F, a random variable X is a F-measurable function X : Ω 7→ R, X −1 : B 7→ F, where B is as defined in Example A.1.3. Definition A.3.3 For a random variable X, the law LX of X is LX = P ◦ X −1 , LX : B 7→ [0, 1]. Then LX is a probability measure on (R, B). 44 Definition A.3.4 For a random variable X, the distribution function of X is the function FX : R 7→ [0, 1], where FX (c) = LX (−∞, c] = P(x ≤ c) = P{ω | X(ω) ≤ c}. Theorem A.3.5 (Monotone-Class Theorem) Let H be a class of bounded functions S 7→ R with the following conditions: i) H is a vector space over R, ii) Constant function 1 ∈ H, iii) For non-negative functions (fn ∈ H)n∈N with fn ↑ (f ∈ Ω), f ∈ H. Then, if H contains the identity functions of every set in π-system I, then it contains every bounded σ(I)-measurable function on S. A.4 Independence Definition A.4.1 Sub-σ-algebras An of F are independent if ∀ ai ∈ Ai (i ∈ N) and distinct ij for j=1 to n P(ai1 ∩ ai2 . . . ∩ ain ) = n Y P(aik ). k=1 Definition A.4.2 Random variables X1 , X2 , . . . are independent if σ-algebras σ(X1 ), σ(X2 ), . . . are independent. Definition A.4.3 Events E1 , E2 , . . . are independent if the σ-algebras E1 , E2 , . . . are independent, where En is the σ-algebra {∅, En , Ω \ En , Ω}. Theorem A.4.4 (Second Borel-Cantelli Lemma) For the series of independent events (En )n∈N , X P(En ) = ∞ ⇒ P(En , i.o.) = P(lim sup En ) = 1. Proof: We have (lim sup EN )c = lim inf Enc = [ \ Enc . m n≥m We then have ! P \ n≥m A.5 Enc = Y (1 − P(En )). n≥m Integration Definition A.5.1 We say that measure µ is the lebesgue integral µ(f ) = and that Z µ(f, A) = f (s) µ(ds) = µ(f IA ) s ∈ S, A ⊂ Σ. R f dµ, A Such a measure is linear, so the integral is also linear. + Definition A.5.2 f ∈ mΣP is simple if it can be written as a weighted sum of m indicator functions, i.e. f = k=1 ak IAk for some ak ≥ 0 and some Ak ∈ Σ. We then write f ∈ SF. 45 We can assume that Ak in the above definition are disjoint, since a1 IA1 + a2 IA2 = a1 IA1 ∩/A2 + (a1 + a2 )IA1 ∩A2 + a2 I/A1 ∩A2 . Definition A.5.3 For subset A ∈ Σ, we define µ0 (IA ) = µ(A) ≤ ∞, where µ0 is + an Pmnaive integral defined for simple functions. For f ∈ SF we define µ0 (f ) = a µ(A ) ≤ ∞. k k=1 k Definition A.5.4 For f ∈ mΣ+ we define µ(f ) = sup{µ0 (b) | b ∈ SF + , h ≤ f } ≤ ∞. So we can take the integral of a non-negative function as the upper limit of a sequence of integrals of simple functions. Theorem A.5.5 (Monotone-Convergence Theorem) For the sequence of functions (fn ∈ mΣ+ )n∈N , fn ↑ f ⇒ µ(fn ) ↑ µ(f ) ≤ ∞. So the integral of fn tends to the integral of f . Definition A.5.6 The rth staircase function a(r) : [0, ∞] 7→ [0, ∞] is defined as   x = 0, 0 (r) −r a (x) = (i − 1)2 (i − 1)2−r < x ≤ i2−r ≤ r i ∈ N,   r x > r. The functions f (r) = a(r) ◦ f are simple functions, with f (r) ↑ f . By the MonotoneConvergence Theorem (Theorem A.5.5), we now have µ(f ) =↑ lim µ(f (r) ) =↑ lim µ0 (f (r) ). Since a(r) are left-continuous, we also have fn ↑ f ⇒ a(r) (fn ) ↑ a(r) (f ). Theorem A.5.7 (Fatou’s Lemma) For (fn ∈ mΣ+ )n∈N , µ(lim inf fn ) ≤ lim sup µ(fn ). So, Z Z lim inf fn ≤ lim inf fn . Theorem A.5.8 (Reverse Fatou’s Lemma) For (fn ∈ mΣ+ )n∈N with finite and bounded fn , µ(lim sup fn ) ≥ lim sup µ(fn ). Definition A.5.9 L1 (S, Σ, µ) is the set of µ-integrable functions, f ∈ mΣ such that µ(f ) < ∞. Notation A.5.10 We let f + (s) = max(f (s), 0), f − (s) = max(−f (s), 0). Then we have f = f + − f − , |f | = f + + f − . Note: This means that Z Z f dµ = µ(f ) = µ(f + ) − µ(f − ), |f |dµ = µ(|f |) = µ(f + ) + µ(f − ). So we immediately get µ(f ) ≤ µ(|f |), with equality if and only if f is non-negative. Note: Since f + , f − ∈ mΣ+ , and the integral is linear, we can extend the definition of integral µ0 to the set of measurable functions mΣ. Theorem A.5.11 (Dominated-Convergence Theorem) Take functions fn , f ∈ L1 (S, Σ, µ)+ , with |fn (s)| ≤ g(s) for some g ∈ L1 (S, Σ, µ)+ with µ(g) < ∞. Then fn → f f ∈ L1 (S, Σ, µ). So µ(|fn − f |) → 0, so µ(fn ) → µ(f ). 46 Theorem A.5.12 (Scheffé’s Lemma) Take fn , f ∈ L1 (S, Σ, µ), with fn → f almost everywhere. Then µ(|fn − f |) → 0 iff µ(|fn |) → µ(|f |). Method A.5.13 (Standard Machine) A method for proving a linear result is true for all functions in a space. i) Show the result is true for indicator functions. ii) By linearity, show the result is true for functions in SF + . iii) Use the Monotone-Convergence Theorem to show the result is true for functions in mΣ+ . iv) Write h = h+ − h− and use linearity to show the result is true for measurable functions. Definition A.5.14 For an F-measurable function f , f µ(A) = µ(f, A). R Definition A.5.15 A measure µ(A) = IA f dν, also written µ = f ◦ ν, has density f relative to µ. We then write dµ = f. dν Lemma A.5.16 If measurable, and dµ dν = f, and g is an F-measurable function, then f g is also FZ Z g dµ = gf dν. Proof: By the standard machine [Own Proof ], Method A.5.13. R R i) For g as an indicator function, IA dµ = µ(1, A) = f ν(1, A), and IA f dν = ν(f, A) = f ν(1, A). R RP Pm ii) P For gR as a simple Rfunction, Rg P = gk IAk dµ = so R g dµ = k=1 gk IAk ,P gk IAk dµ, and gf dν = gk IAk f dν = gk IAk f dν, which are equal by part i). iii) For g as a non-negative F-measurable function, we can define g as the limit of a sequence (gn )n∈N of simple functions by using the staircase function from Definition A.5.6. By part ii) we have µ(gn , A) = f ν(gn , A), so by the Monotone Convergence Theorem (A.5.5), µ(g, A) = f ν(g, A), which is equivalent to R R gIA dµ = gIA f dν. iv) For g as an F-measurable function, we show part iii) for g + and −g − . We thus have the result from linearity, g = g + − g − ⇒ µ(g, A) = µ(g + , A) − µ(g − , A) = f ν(g + , A) − f ν(g − , A) = f ν(g, A). R Corollary A.5.17 If f dν = 1, then µ is a probability measure. 47 Bibliography [1] Donald E. Catlin. Estimation, Control, and the Discrete Kalman Filter. Springer, 1989. [2] Roger Mansuy. Histoire de martingales. Mathématiques & Sciences Humaines, (169):105–113, 2005. [3] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. [4] Bernt Øksendal. Stochastic Differential Equations. Springer, 2000. [5] Albert Tarantola. Inverse Problem Theory. Society for Industrial and Applied Mathematics, 2005. [6] Richard Weber. Optimization and control. Lecture notes for the Optimization and Control course at Cambridge, 2010. [7] Greg Welch and Gary Bishop. An introduction to the Kalman filter. Technical Report TR 95-041, University of North Carolina, Department of Computer Science, July 2006. Introductory article that also discusses the case of nonlinear systems. [8] David Williams. Probability with Martingales. Cambridge University Press, 1991. 48

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Discrete-Time Martingales and the Kalman Filter