Download Probabilistic Models and Data Analysis Lecture Notes

Probabilistic Models and Data Analysis Lecture Notes Verena Wolf1 February 9, 2016 1 [email protected] ii Contents 1 Probabilities and Discrete Random Variables 1.1 Probabilities . . . . . . . . . . . . . . . . . . . . . . . 1.2 Discrete Random Variables . . . . . . . . . . . . . . 1.2.1 Expectation and Variance . . . . . . . . . . . 1.2.2 Higher Moments, Covariance and Correlation 1.3 Important Discrete Probability Distributions . . . . 1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 5 7 13 14 15 2 Continuous Random Variables and Laws of Large 2.1 σ-algebras . . . . . . . . . . . . . . . . . . . . . . . 2.2 Continuous random variables . . . . . . . . . . . . 2.3 Important continuous distributions . . . . . . . . . 2.4 Laws of Large Numbers . . . . . . . . . . . . . . . 2.4.1 Chebyshev’s inequality . . . . . . . . . . . . 2.4.2 Weak Law of Large Numbers . . . . . . . . 2.4.3 Strong Law of Large Numbers . . . . . . . 2.4.4 Central Limit Theorem. . . . . . . . . . . . 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 21 23 27 28 28 29 29 30 3 Discrete-time Markov Chains 3.1 Discrete-Time Markov Chains. 3.1.1 Transient distribution . 3.1.2 Equilibrium distribution 3.2 Case Study: PageRank . . . . . 3.3 Case Study: DNA Methylation 3.4 From DTMCs to CTMCs . . . 3.5 Exercises . . . . . . . . . . . . . . . . . . . 33 33 34 35 46 47 49 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Hidden Markov Models 57 4.1 Computing the probability of outputs . . . . . . . . . . . . . 58 4.2 Most likely hidden sequence of states . . . . . . . . . . . . . . 59 4.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 iii iv CONTENTS 5 Computer Simulations and Monte Carlo Methods 5.1 Generating random variates . . . . . . . . . . . . . . 5.2 Monte Carlo simulation of DTMCs . . . . . . . . . . 5.3 Output Analysis . . . . . . . . . . . . . . . . . . . . 5.3.1 Interval Estimation . . . . . . . . . . . . . . . 5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 6 Parameter Estimation 6.1 Method of moments . . . . . . . . . 6.2 Generalized method of moments . . 6.3 Maximum likelihood method . . . . 6.4 Estimation of Variances . . . . . . . 6.5 Estimation of Parameters of HMMs 6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 68 69 70 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 76 79 82 87 93 96 7 Statistical Testing 7.1 Level α tests: general approach . . . . . . 7.2 Standard Normal null distribution (Z-test) 7.3 T-tests for unknown σ . . . . . . . . . . . 7.4 p-Value . . . . . . . . . . . . . . . . . . . 7.5 Chi-square tests . . . . . . . . . . . . . . . 7.5.1 Estimated variance . . . . . . . . . 7.5.2 Observed Counts . . . . . . . . . . 7.5.3 Testing independence . . . . . . . 7.6 Multiple Testing . . . . . . . . . . . . . . 7.6.1 Bonferroni correction . . . . . . . . 7.6.2 Holm-Bonferroni method . . . . . 7.6.3 Benjamini-Hochberg procedure . . 7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 104 106 109 110 112 112 113 116 118 119 120 121 122 . . . . . . . . . . . . Preface These lecture notes are based on a course hold at Saarland University in 2015. We use hyperlinks to Wikipedia pages to encourage students to explore further the context of this course. However, we would like to point out that Wikipedia entries are mostly written by laymen and should not be used as a primary information source. v vi CONTENTS Chapter 1 Probabilities and Discrete Random Variables “Was ein Punkt, ein rechter Winkel, ein Kreis ist, weiß ich schon vor der ersten Geometriestunde, ich kann es nur noch nicht präzisieren. Ebenso weiß ich schon, was Wahrscheinlichkeit ist, ehe ich es definiert habe.“ (Hans Freudenthal) This section gives a short primer on probabilities, as well as on discrete and continuous random variables. 1.1 Probabilities Let us consider chance experiments with a countable number of possible outcomes ω1 , ω2 , . . .. The set Ω = {ω1 , ω2 , . . .} of all outcomes is called the sample space. Subsets of Ω are called events and by 2Ω we denote the set of all events. Example 1: Rolling a die and tossing a coin If we roll a die, the set of possible outcomes is Ω = {1, 2, . . . , 6}. The event “number is even” is given by E = {2, 4, 6}. If we toss a coin and count the number of trials until a head turns up for the first time, Ω = {1, 2, . . .}. The event “number of trials greater than 10” is given by E = {11, 12, . . .}. We define what a probability is using Kolmogorov’s axioms. Definition 1: Probability Assume Ω is a discrete (i.e. finite or countably infinite) and non-empty sample space. Let P be a function such that P : 2Ω → [0, 1]. The value P (A) is called the probability of event A if P is such that 1 1.1. PROBABILITIES 1. P (Ω) = 1, 2. for pairwise disjoint events A1 , A2 , . . . it holds that P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . . When we reason about the probability of certain events, we can use many arguments from set theory. For instance, if the events A, B, and C are as illustrated in Figure 1.1, it holds that P (A ∪ B) = P (A) + P (B) and P (A ∪ C) = P (A) + P (C) − P (A ∩ C). Ω C B A Example 2: Rolling a die The probability of the event {1, 2, . . . , 6} is one. Moreover, for each n ∈ {1, 2, . . . , 6} Figure 1.1: Using set arguit holds that P ({n}) = 1/6. Thus, ments for the calculation of P ({2, 4, 6}) = 1/6+1/6+1/6 = 1/2. Simievent probabilities. larly, P ({1, 3, 5}) = 1 − P ({2, 4, 6}) = 1/2. Example 3: Tossing a coin until heads for the first time The probability of the events “exactly n trials” and “more than three trails” are P ({n}) = (1/2)n and P ({4, 5, . . .}) = P (Ω \ {1, 2, 3}) = 1 − (1/2 + 1/4 + 1/8) = 1/8. P Note that P (Ω) = ω∈Ω P ({ω}) = 1/2 + 1/4 + 1/8 + . . . = 1. The triple (Ω, 2Ω , P ) is called a discrete probability space. Definition 2: Conditional Probability Let A, B be events and P (B) > 0. Then P (A|B) := P (A ∩ B) P (B) is called probability of A under the condition B. Clearly, this implies that P (A ∩ B) = P (A|B) · P (B). It is easy to show that the value P (A|B) is a probability in the sense of Definition 1. 2 1.1. PROBABILITIES Example 4: Lung Cancer We define the events 72 A: person gets lung cancer with P (A) = 200000 = 0.00036, B: person is a smoker with P (B) = 0.25. From the people that get lung cancer, 90% are smokers. The experiment consists in choosing a person at random. For the probability of getting lung cancer under the condition of being a smoker, we calculate P (A|B) = P (A∩B) P (B) = P (B|A)P (A) P (B) = 0.9·0.00036 0.25 = 0.001296. For the probability of getting lung cancer under the condition of not being a smoker, we get P (A|B̄) = P (A∩B̄) P (B) = P (B̄|A)P (A) P (B) = (1−0.9)·0.00036 0.75 = 0.000048. Thus, the chance of getting lung cancer is around 30 times higher for smokers compared to non-smokers. Definition 3: Independence Let 0 < P (B) < 1. The event A is called independent of B if P (A|B) = P (A|B̄). Thus, the event A has the same probability no matter whether we condition on B or on B̄. Our intuition tells us that in this case, event B ”has nothing to do” with the event A. E.g. you roll two dice, one is red and one is blue. Event A is ”red die gives six” and event B is ”blue die gives even number”. Then, clearly, A is independent of B and vice versa. However, the following example shows that independence can also occur between events that are highly related. Example 5: Independent Events Consider the case A = Ω. The event Ω is independent of any event B with 0 < P (B) < 1 since P (Ω|B) = P (Ω∩B) P (B) =1= P (Ω∩B̄) P (B) = P (Ω|B̄). Assume now that Ω = {1, 2, . . . , 6}, A = {5, 6}, and B = {2, 4, 6}. Then P (A|B) = P (A∩B) P (B) = P ({6}) P ({2,4,6}) and 3 =2· 1 6 = 1/3 1.1. PROBABILITIES P (A|B̄) = P (A∩B̄) P (B) = P ({5}) P ({1,3,5}) =2· 1 6 = 1/3, which means that A = {5, 6} is independent of B = {2, 4, 6}. Note that one can use the alternative (and equivalent) condition P (A ∩ B) = P (A) · P (B) to define independence. In particular, this condition can also be checked if P (B) = 0. Assume that we have a finite or countably infinite number of events A1 , A2 , . . . that are pairwise disjoint. Assume further Ω = A1 ∪ A2 ∪ . . . and P (Ai ) > 0 for all i. Often, the probability of some event B is unknown, but the conditional probabilities P (B|Ai ), . . . are known. In this case, P (B) can be computed using the law of total probability, which states that P (B) = X i P (B|Ai ) · P (Ai ) . | {z } A1 =P (B∩Ai ) The corresponding partitioning of Ω is illustrated in Fig. 1.2. If, however, P (Ai |B) is the probability of interest, one can use Bayes’ theorem to compute it. With Bayes’ theorem we can for any event A with P (A) > 0 express P (A|B) by P (B|A), that is, P (A|B) = Ω A2 B A3 A4 P (B|A) · P (A) . P (B) Figure 1.2: The law of total Thus, for the above problem of computing probability. P (Ai |B) for pairwise disjoint events Ai , this gives P (Ai |B) = P (B∩Ai ) P (B) = P (B|Ai )·P (Ai ) P (B) Example 6: Noisy Channel Consider the noisy binary channel illustrated on the right. If the channel is noise-free (p = 0), zero is transmitted from the upper left node to the upper right node. Similarly, one is transmitted from the lower left node to the lower right node. 4 = PP (B|Ai )·P (Ai ) . i P (B|Ai )·P (Ai ) 0 1−p 0 p p 1 1−p 1 1.2. DISCRETE RANDOM VARIABLES If the channel is noisy, with probability p > 0 one is transmitted instead of zero and zero instead of one. Assume that the probability of sending zero is π0 and the probability of sending one is π1 = 1 − π0 . We define the events • A0 : send 0 with P (A0 ) = π0 , • A1 : send 1 with P (A1 ) = π1 := 1 − π0 , • B0 : receive 0, • B1 : receive 1, Then, P (B1 |A0 ) = P (B0 |A1 ) = p and P (B1 |A1 ) = P (B0 |A0 ) = 1 − p. We calculate P (B1 ) = P (B1 |A0 ) · P (A0 ) + P (B1 |A1 ) · P (A1 ) = p · π0 + (1 − p) · π1 , P (B0 ) = P (B0 |A0 ) · P (A0 ) + P (B0 |A1 ) · P (A1 ) = (1 − p) · π0 + p · π1 . 1.2 Ω Discrete Random Variables ω3 ω2 ω1 X X X(Ω) A random variable is used to represent an outcome of an experiment. Technically, the fact that this variable is “random” (that is, it takes values with a certain probability), is realized by using a mapping (as illustrated in the figure on the left). Definition 4: Discrete Random Variable Let (Ω, 2Ω , P ) be a discrete probability space. A function X:Ω→R is called a discrete (real-valued) random variable on (Ω, 2Ω , P ). Now, that we mapped the outcome of an experiment to R, we need to assign probabilities to the subsets of X(Ω) = {a ∈ R | ∃ω ∈ Ω : X(ω) = a}. Since we already have probabilities for events A ⊆ Ω, we define a function PX : 2X(Ω) → [0, 1] such that, for A ∈ 2X(Ω) , PX (A) := P (X −1 (A)) = P ({ω ∈ Ω | X(ω) ∈ A}). Then (X(Ω), 2X(Ω) , PX ) is a discrete probability space. 5 1.2. DISCRETE RANDOM VARIABLES 0.2 1 0.8 0.15 0.6 0.1 0.4 0.05 0.2 0 2 3 4 5 6 7 8 9 10 11 0 12 2 3 4 5 6 7 8 9 10 11 12 Figure 1.4: Cumulative probability (rolling two dice). Figure 1.3: Discrete probability distribution (rolling two dice). Example 7: Rolling two dice We want to define a random variable for the sum of the two numbers on the two dice. Let Ω = {(ω1 , ω2 ) | ω1 , ω2 , ∈ {1, . . . , 6}} and X : Ω → R is such that X(ω1 , ω2 ) = ω1 + ω2 . Then, the probability of having the number 10 is PX ({10}) = P (X −1 ({10})) = P ({(ω1 , ω2 ) ∈ Ω | X(ω1 , ω2 ) = 10}) = P ({(4, 6)}) + P ({(6, 4)}) + P ({(5, 5)}) = 1/12. In the sequel, we use the following “shortcuts“ to refer to subset of Ω: • “X = a” stands for the set {ω ∈ Ω | X(ω) = a} • “X ≤ a” stands for the set {ω ∈ Ω | X(ω) ≤ a} • “X < a” . . . Thus, for instance, P (X ≤ a) = P ({ω ∈ Ω | X(ω) ≤ a}) = X P (X = c). c∈X(Ω),c≤a The function f : X(Ω) → [0, 1] with f (a) = P (X = a) is called the discrete probability distribution of X. It tells us the probability of each possible event of the form “X = a”. Of similar importance is the following function related to a random variable X. Definition 5: Cumulative Probability Distribution Let X be a discrete (real-valued) random variable. The function F : R → [0, 1] with X F (x) := P (X ≤ x) = P (X = a). a∈X(Ω),a≤x is called the cumulative probability distribution of X. 6 1.2. DISCRETE RANDOM VARIABLES Example 8: Rolling two dice Assume that X is defined as in Example 7. The discrete probability distribution and the cumulative probability distribution of X are shown in Figures 1.3 - 1.4, respectively. 1.2.1 Expectation and Variance Besides the probability distribution of a discrete random variable X, other values related to X are of interest. The most important ones are the expectation and the variance of X which are defined as • E(X) = • V (X) = P x∈X(Ω) x P · P (X = x) x∈X(Ω) (x − E(X))2 · P (X = x) respectively. (Note that the sum might not converge, in which case the expectation/variance does not exist.) p The standard deviation of X is given p by V (X). Note that V (X) (and V (X)) are always non-negative. Example 9: Expected Value of Special Random Variables Let c ∈ R. Assume X(ω) = c for all ω ∈ Ω. Then E(X) = c. Let A ⊆ Ω and define Y = IA where 1 if ω ∈ A, IA (ω) = 0 otherwise. Then E(Y ) = 1 · P (A) + 0 · P (Ā) = P (A). Often the expectation of a random variable is denoted by the Greek letter µ while the variance is often denoted by a σ 2 and σ denotes the standard deviation. The latter measures the variation using the same ”units” as X and µ while σ 2 measures in squared units. One might think that E[X] is something like ”the best guess for X” that we can make. However, note that 5 8 7 1 2 3 4 6 9 10 11 even if X is discrete (e.g. X ∈ {1, 2, . . .}) its expectation may not be Figure 1.5: Bimodal discrete probaa valid outcome (e.g. 2.6) since it is a bility distribution. real number. In addition, it might be 7 1.2. DISCRETE RANDOM VARIABLES a bad guess if the probability distribution of X is, for instance, bimodal as illustrated in Fig. 1.5. Example 10: We consider first a standard fair die with 6 sides. For such a die the 6 P k · 16 = 3.5 and the variance is V (X) = expectation is given by E(X) = k=1 6 P k=1 2 (k − 3.5) · 1 6 = 2.9. Now let us consider the second fair die where number on sides 3 and 4 are substituted by 1 and 6 (so the die has sides 1, 2, 1, 6, 5, 6). For this die the expectation is 3.5 as before but the variance is much larger, V (Y ) = 4.9. Distribution of both dice are shown on the figure below (blue colors) together with the corresponding variance (green colors). E(X) 1 2 3 E(Y ) 4 5 6 1 2 3 4 5 6 It is possible to connect random variables on the same sample space Ω using operations such as +, −, ·, etc. For instance, we define Z = X + Y as a random variable on Ω by Z(ω) = X(ω) + Y (ω) for all ω ∈ Ω. The probability P (Z = z) = P (X + Y = z) is then well-defined since P (Z = z) = P (ω ∈ Ω | Z(ω) = z) = X P ({ω}). ω∈Ω,X(ω)+Y (ω)=z Moreover, X z∈Z(Ω) P (Z = z) = X X z∈Z(Ω) ω∈Ω,X(ω)+Y (ω)=z 8 P ({ω}) = P (Ω) = 1. 1.2. DISCRETE RANDOM VARIABLES Ω (! ω1 , ω !2 ) X(Ω) ω !1 (ω1 , ω2 ) Z(Ω) ω !1 + ω !2 ω1 ω1 +ω2 ω !2 ω2 Y (Ω) Figure 1.6: Sum of two random variables (rolling two dice). Example 11: Rolling two dice Assume that Ω = {1, 2, . . . , 6}2 and X, Y : Ω → R are such that, for (ω1 , ω2 ) ∈ Ω, X(ω1 , ω2 ) = ω1 , Y (ω1 , ω2 ) = ω2 . Then for Z = X + Y we have Z(ω1 , ω2 ) = X(ω1 , ω2 ) + Y (ω1 , ω2 ) = ω1 + ω2 and (see illustration in Fig. 1.6) X P (Z = z) = P ({(ω1 , ω2 )}) = (ω1 ,ω2 )∈Ω, X(ω1 ,ω2 )+Y (ω1 ,ω2 )=z X P ({(ω1 , ω2 )}). (ω1 ,ω2 )∈Ω, ω1 +ω2 =z Now, having two random variables on the same probability space, we can define a joint distribution as follows. Definition 6: Joint Probability Distribution Let X, Y be discrete (real-valued) random variables on the same probability space (Ω, 2Ω , P ). The function PX,Y : X(Ω) × Y (Ω) → [0, 1] with PX,Y (a, b) := P (X = a ∧ Y = b) = P (X −1 ({a}) ∩ Y −1 ({b})). is called the joint probability distribution of X and Y . Similar to the definition of independence between events we can define when two random variables are independent. Definition 7: Independence of Random Variables 1 variables on the same probability Let X, Y be discrete (real-valued) random Ω space (Ω, 2 , P ). We call X and Y independent iff for all a ∈ X(Ω), b ∈ 9 1.2. DISCRETE RANDOM VARIABLES Y (Ω) P (X = a | Y = b) = P (X = a). (Or equivalently, X and Y are independent iff P (X = a ∧ Y = b) = P (X = a) · P (Y = b).) Example 12: Rolling two dice Assume that X and Y are defined as in Example 11. Then X and Y are independent, since for any a, b ∈ {1, . . . , 6}, P (X = a ∧ Y = b) = P (X −1 ({a}) ∩ Y −1 ({b})) = P ({(ω1 , ω2 ) ∈ Ω | ω1 = a} ∩{(ω1 , ω2 ) ∈ Ω | ω2 = b}) = P ({(a, b)}) = 1/36 and P (X = a) · P (Y = b) = P (X −1 ({a})) · P (Y −1 ({b})) = P ({(ω1 , ω2 ) ∈ Ω | ω1 = a}) ·P ({(ω1 , ω2 ) ∈ Ω | ω2 = b}) = 6/36 · 6/36 = 1/36. We are now able to state some properties of the expectation. Let X, Y be discrete random variables on the same probability space and assume that E(X) and E(Y ) exist. Then, for a, b ∈ R P 1. E(X) = ω∈Ω X(ω)P ({ω}), 2. E(a · X + b) = a · E(X) + b, 3. E(X + Y ) = E(X) + E(Y ). 4. If X and Y are independent, then E(X · Y ) = E(X) · E(Y ). Proof. (1.) E(X) = X x∈X(Ω) = X x∈X(Ω) = X x∈X(Ω) = X x · P (X = x) (definition of E(X)) x · P ({ω | X(ω) = x}) x· X P ({ω}) (definition of P (X = x)) (2nd axiom of Def. 1) ω∈Ω∧X(ω)=x X(ω)P ({ω}) (X is a function) ω∈Ω 10 1.2. DISCRETE RANDOM VARIABLES Proof. (2.) Let Z = a · X + b be a discrete random variable on the same probability space as X, such that Z(ω) = a · X(ω) + b. We now find: E(a · X + b) = E(Z) X = Z(ω)P ({ω}) (definition of Z) (first property) ω∈Ω = X (aX(ω) + b) · P ({ω}) ω∈Ω =a X ω∈Ω =a X ω∈Ω =a X ω∈Ω X(ω) · P ({ω}) + b (definition of Z) X P ({ω}) (calculus) ω∈Ω X(ω) · P ({ω}) + bP (Ω) X(ω) · P ({ω}) + b (2nd axiom of Def. 1) (1st axiom of Def. 1) = a · E(X) + b (first property) Proof. (3.) Let Z = X + Y be a discrete random variable on the same probability space as X and Y , such that Z(ω) = X(ω) + Y (ω). We now find: E(X + Y ) = E(Z) X = Z(ω)P ({ω}) (definition of Z) (first property) ω∈Ω = X (X(ω) + Y (ω)) · P ({ω}) ω∈Ω = X ω∈Ω X(ω) · P ({ω}) + = E(X) + E(Y ) X ω∈Ω Y (ω) · P ({ω}) (definition of Z) (calculus) (first property) Proof. (4.) Let Z = X · Y be a discrete random variable on the same probability space as X and Y , such that Z(ω) = X(ω) · Y (ω). Let X and 11 1.2. DISCRETE RANDOM VARIABLES Y be independent. We now find: E(X · Y ) = E(Z) (definition of Z) P = ω∈Ω Z(ω)P ({ω}) (first property) P = ω∈Ω X(ω) · Y (ω)P ({ω}) (definition of Z) P P = x∈X(Ω)∧y∈Y (Ω) x · y · ω∈Ω∧X(ω)=x∧Y (ω)=y P ({ω}) (X, Y are functions) P = x∈X(Ω)∧y∈Y (Ω) x · y · P ({ω | X(ω) = x ∧ Y (ω) = y}) (2nd axiom of Def. 1) P = x∈X(Ω)∧y∈Y (Ω) x · y · P (X = x ∧ Y = y) (definition of P (. . .)) P = x∈X(Ω)∧y∈Y (Ω) x · y · P (X = x) · P (Y = y) (independence of X and Y ) P P = x∈X(Ω) x · P (X = x) · y∈Y (ω) y · P (Y = y) (calculus) = E(X) · E(Y ) (definition of E(X), E(Y )) We list the properties of the variance operator without proof: Let X, Y be discrete random variables on the same probability space and assume that E(X), E(Y ), VAR(X), and VAR(Y ) exist. Then 1. VAR(a · X + b) = a2 · VAR(X) for a, b ∈ R, 2. V AR(X) = E(X 2 ) − E(X)2 3. VAR(X + Y ) = VAR(X) + VAR(Y ) + 2 · (E(X · Y ) − E(X) · E(Y )). 4. If X and Y are independent, then VAR(X +Y ) = VAR(X)+VAR(Y ). The definition of independent random variables can be extended to more than two variables as follows. Definition 8: Independence (of n discrete random variables) For n ∈ N, let X1 , . . . , Xn be discrete (real-valued) random variables on the same probability space (Ω, 2Ω , P ). We call X1 , . . . , Xn independent iff for all a1 ∈ X1 (Ω), . . . , an ∈ Xn (Ω) P (X1 = a1 , . . . , Xn = an ) = P (X1 = a1 ) · . . . · P (Xn = an ). 12 1.2. DISCRETE RANDOM VARIABLES Note that if X1 , . . . , Xn are independent, then E(X1 · . . . · Xn ) = E(X1 ) · . . . · E(Xn ) and VAR(X1 + . . . + Xn ) = VAR(X1 ) + . . . + VAR(Xn ). 1.2.2 Higher Moments, Covariance and Correlation In the previous section we have used different operators (e.g. +) to combine several random variables and considered the expectation of combinations of random variables. Thus, we already used the following general formula for the expectation of a function g X g(x) · P (X = x). E(g(X)) = x (Again the sum might not converge, in which case E(g(X)) does not exist.) Note that this formula can also be applied if g is not a one-to-one function, e.g. if g(x) = x2 . In the case g(x) = xi , we call E(X i ) the i-th moment of X. Note that the moments of a distribution are related to its skewness and its kurtosis. The former characterizes the degree of asymmetry while the latter characterizes the flatness or peakedness of the distribution. While expectation, variance and standard deviation describe properties of a single random variable, the covariance and the correlation measure the level of dependence between variables: • The covariance of two random variables X and Y is defined by COV (X, Y ) = E([X − E(X)][Y − E(Y )]) • and, assuming V AR(X), V AR(Y ) > 0, the correlation of X and Y is defined by COV (X, Y ) p COR(X, Y ) = p . V AR(X) V AR(Y ) The covariance is the expected product of the deviations of X and Y from their respective means. The correlation can be seen as a scaled version of Y the covariance having the same sign. If X and Y are positively correlated, COR(X, Y ) > 0, large values of X (positive deviations from E(X)) genX erally correspond to large values of Y . Similarly, small values of X imply small values of Y . If X Figure 1.7: Positively and Y are negatively correlated, COR(X, Y ) < 0, correlated X and Y . large values of X generally correspond to small 13 1.3. IMPORTANT DISCRETE PROBABILITY DISTRIBUTIONS values of Y (example is given in Fig. 1.7). If X and Y are uncorrelated, COR(X, Y ) = 0, we know that there is no linear dependency between X and Y . This does, however, not imply that they are independent since other forms of dependence are possible as the following example shows. Example 13: Uncorrelated Variables Assume that X is a discrete random variable such that P (X = −1) = P (X = 0) = P (X = 1) = 31 and Y = X 2 . Then E(X) = 0 and E(Y ) = 2/3. Thus, P COV (X, Y ) = x,y P (X = x ∧ Y = y)(x − E(X))(y − E(Y )) = 13 (−1 − 0)(1 − 23 ) + 13 (1 − 0)(1 − 23 ) = 0 Thus, X and Y are uncorrelated even though Y is a function of X (the strongest form of dependence). Clearly, X and Y are not independent since, for instance, P (X = 1 ∧ Y = 1) = 1 2 2 1 6= P (X = 1)P (Y = 1) = · = . 3 3 3 9 Next, we list some important properties of the covariance: • COV (X, Y ) = COV (Y, X) • COV (X, Y ) = E(X · Y ) − E(X) · E(Y ) • COV (X, X) = V AR(X) Finally, we note that since the correlation is scaled and has the property −1 ≤ COR(X, Y ) ≤ 1, we have a maximal positive (negative) correlation at COR(X, Y ) = 1 (COR(X, Y ) = −1). In this case all possible realizations for (X, Y ) lie on a straight line. Thus, given X = x the value Y = y is fixed by the line. The figure on the right illustrates this case. 1.3 Y X Important Discrete Probability Distributions Let us now consider some important discrete probability distributions. • Bernoulli distribution: Consider again an experiment where there are only two possible outcomes (e.g. flipping a coin). This is called a Bernoulli trail. Let us use 0 (failure) and 1 (success) for the two possible outcomes. The probability of 1 is P (X = 1) = p and P (X = 14 1.4. EXERCISES 0) = 1 − p. Thus, the mean of the Bernoulli distribution is E(X) = 0(1 − p) + 1(p) = p. • Binomial distribution: We consider a sequence of n independent Bernoulli trials and count the number X of successes. Then X follows a binomial distribution, i.e. for k ∈ {0, 1, . . . , n} n k P (X = k) = p (1 − p)n−k k where the first factor counts all possible orderings for k successes among n trials, the second one gives the probability that in the independent trials we have k successes and the last factor describes the probability of (the remaining) n−k failures. Note that one can express X as the sum of n independent Bernoulli variables X1 , . . . , Xn : X = X1 + . . . + Xn . Thus, E(X) = E(X1 ) + . . . + E(Xn ) = p + . . . + p = n · p. Please use the Wikipedia link to view the plot of the distribution and learn more about the Binomial distribution. • Geometric Distribution: Consider again an experiment where the probability of an event A is P (A) = p. We repeat this experiment until A occurs for the first time. Let X be the random variable that describes the number of trials until A occurs for the first time. Then X is called geometrically distributed. We have P (X = i) = (1 − p)i−1 · p and E(X) = 1/p. The variance of X is V (X) = (1 − p)/p. Please use the Wikipedia link to view the plot of the distribution and learn more about the geometric distribution. • Poisson Distribution Consider a call center where on average µ = 6 calls per minute arrive. Let X be the random variable that represents k the number of calls in the next minute and assume P (X = k) = µk! e−µ . Then X is called Poisson distributed and has expectation ∞ ∞ X X µk −µ µk−1 E(X) = k· e = µe−µ = µe−µ eµ = µ k! (k − 1)! k=0 k=1 and variance V (X) = µ (without proof). Please use the Wikipedia link to view the plot of the distribution and learn more about the Poisson distribution. 1.4 Exercises Easy Exercise 1: Probability What is wrong with the following axioms for probability: 15 1.4. EXERCISES The value P (A) is called the probability of event A if P is such that 1. P (Ω) = 1, 2. for events A1 , A2 , . . . it holds that P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . . Exercise 2: A Chance Experiment We throw a fair coin three consecutive times. What is the sample space of this experiment. Can you describe two different events on this sample space? Exercise 3: Conditional Probability Show that the value P (A|B) is a probability in the sense of Definition 1. Exercise 4: Independence Suppose that we toss a 6-sided die. Each of the six numbers is a sample point and is assigned probability 61 . Define the events A and B as follows: A = {1, 3, 4}, B = {3, 4, 5, 6}. Prove that A and B are independent. Exercise 5: Variance Assume that for a discrete random variable V AR(X) = 0. What can be said about the probability distribution of X? Exercise 6: Covariance Let X and Y be two random variables such that V AR(X) = 2 and COV (X, Y ) = 1. Compute the covariance COV (5X, 2X+ 3Y ) Exercise 7: Correlation of Random Variables Assume that X is a discrete random variable such that P (X = −1) = P (X = 0) = P (X = 1) = 1 2 3 and Y = X − X. Are these random variables uncorrelated? Exercise 8: Conditional Probability Assume that X is a discrete random variable such that P (X = −1) = P (X = 0) = P (X = 1) = 31 and Y = X 2 + X. Are these random variables correlated? What is the correlation coefficient? Exercise 9: Poisson vs Binomial Use MATLAB to plot a Poisson distribution with mean µ = 10. Assume now a binomial distrbution with n = 30 number of trials. What is the success probability p of the binomial such that the two distributions look similar? Medium Exercise 10: Conditional Probability In the summer semester 2015 there were 1162 students at Saarland University that studied computer science and 184 that studied bioinformatics. About 43 % of the bioinformatics students are female and for computer science about 18 % are female. What is the probability that a female student from the total number of 1346 students (studying either bioinformatics or computer science) is a computer science student? Exercise 11: Independence Prove that for P (B) > 0 the following conditions are all equivalent: 16 1.4. EXERCISES i) P (A ∩ B) = P (A) · P (B) ii) P (A|B) = P (A|B̄) iii) P (A|B) = P (A). Can A be independent of B but not vice versa? Exercise 12: Law of Total Probability A test for a viral infection is 95% reliable for infected patients and 99% for healthy patients. Assume that 4% of all patients that are tested are infected with the virus. If a patient is tested positively what is the probability that he really has the virus? Exercise 13: Discrete Random Variables Let (Ω, 2Ω , P ) be a discrete probability space with Ω being finite and X : Ω → R. Prove that, for A ⊂ X(Ω), the value PX (A) := P (X −1 (A)) = P ({ω ∈ Ω | X(ω) ∈ A}) is a probability. (Hint: Note that if Ω is finite, the second axiom of Kolmogorov’s probability definition reduces to “If A and B are disjoint subsets of Ω then P (A ∪ B) = P (A) + P (B)”.) Exercise 14: Discrete Random Variables Consider a game of chance where a player bets one euro and picks a number between 1 and 6, e.g. 2. A die is then rolled three times and the player receives nothing if no 2 was thrown, two euros if 2 was thrown one time, three euros if 2 was thrown two times, four euros if 2 was thrown three times, and similar for the other numbers. Define a random variable Z that describes the net profit (“Reingewinn”) of one round of the game. Calculate the probabilities P (Z = −1), P (Z = 0), P (Z = 1), P (Z = 2), P (Z = 3), as well as the expectation, variance, and standard deviation of Z. Difficult Exercise 15: Variance and Expectation Prove the following: 1. V AR(X) = E(X 2 ) − E(X)2 2. VAR(a · X + b) = a2 · VAR(X) for a, b ∈ R, 3. VAR(X + Y ) = VAR(X) + VAR(Y ) + 2 · (E(X · Y ) − E(X) · E(Y )). 4. If X and Y are independent, then VAR(X +Y ) = VAR(X)+VAR(Y ). Exercise 16: Binomial Distribution Prove that for a binomially distributed random variable, X ∼ B(n, p) it holds E[X] = pn and V AR[X] = p(1 − p)n. 17 1.4. EXERCISES Exercise 17: Poisson Distribution A Poisson distributed random variable X has the following probability distribution: P (X = k) = µk −µ e k! k ∈ {0, 1, . . .} Prove that mean and variance of X is equal to µ. 18 Chapter 2 Continuous Random Variables and Laws of Large Numbers Consider a chance experiment where the sample space Ω contains uncountably many elements. In this case, assigning probabilities to all elements in 2Ω poses problems. Assume, for instance, we randomly choose a real number in [0, 1]. If all numbers are equally likely to occur, we have to assign probability zero to each since their “sum” must be one (note that the sum over uncountably many nonzero values is undefined). Instead of giving a new definition of probabilities, we restrict ourselves to a set of events, called σ-algebra, for which we can define probabilities as in the discrete case. 2.1 σ-algebras If we define a random variable as a function X : Ω → R, we want to reason about the probability of events such as X = x for any x ∈ R or a < X ≤ b for any interval (a, b]. From the properties of probabilities (see Def. 1), we then know the probability of the disjoint union of countably many of such sets as well as the probability of the complement of such a set1 . Definition 9: σ-algebra A set F ⊆ 2Ω is called a σ-algebra if 1. Ω ∈ F, 2. A ∈ F implies Ā ∈ F. 3. If A1 , A2 , . . . ∈ F is a sequence of sets then A1 ∪ A2 ∪ . . . ∈ F. 1 From the two conditions in Def. 1, one can easily derive that P (Ā) = 1 − P (A). 19 2.1. σ-ALGEBRAS The most important example of a σ-algebra, which is needed in the sequel, is the σ-algebra that is generated by a set E ⊆ 2Ω . We define the smallest σ-algebra that contains E by σ(E) := ∩{F ⊃ E : F is a σ-algebra}. Example 14: Generated σ-algebra For simplicity, we consider the finite set Ω = {1, 2, . . . , 6}. Let E = {{2, 6}, {5, 6}}. Then σ(E) = {{2, 6}, {5, 6}, {1, 3, 4, 5}, {1, 2, 3, 4}, {1, 2, 3, 4, 5}, {6}, . . .}. If Ω is finite, the idea is to construct a sequence of sets in an iterative fashion, where we start with E and obtain the next set from the previous one by joining and complementing elements of the current set. If no new element can be constructed by union or complement operations, the current set equals σ(E). Net we assume Ω = R and E = {(a, b] : a, b ∈ R, a ≤ b}. The σ-algebra B := σ(E) is called the Borel algebra on the reals. It contains all subsets, called Borel sets, of 2R that can be obtained from E by countable union and complement operations. Note that also intervals of the form (a, b) or [a, b] are Borel sets. A similar construction is possible for Ω = Rn . Intuitively, the Borel sets are those sets, for which we can assign a “volume”, “area” or “size”. Subsets of Rn that are not Borel sets are only of theoretical interest since for practical applications they are not of importance. We are now able to give a more general definition of a probability space (i.e. also for sample sets with uncountably many elements). Definition 10: Probability Space Let Ω be a nonempty set and let F ⊆ 2Ω be a σ-algebra. A probability space is a triple (Ω, F, P ) where the probability measure P : F → [0, 1] is such that • P (Ω) = 1 and, • if A1 , A2 , . . . ∈ F is a sequence of pairwise disjoint sets, then P (A1 ∪ A2 ∪ . . .) = P (A1 ) + P (A2 ) + . . . . 20 2.2. CONTINUOUS RANDOM VARIABLES Example 15: Discrete Probability Space Any discrete probability space (Ω, 2Ω , P ) fulfills Def. 10, since 2Ω is a σ-algebra and P is as in Def. 1. Example 16: One-dimensional Interval Let Ω = R, F = B and 0 ≤ a < b, a, b ∈ R. Consider a probability measure P : F → [0, 1] such that ( min(y,b)−max(x,a) if max(x, a) < min(y, b) b−a P ({ω ∈ Ω | x < ω ≤ y}) = 0 otherwise. Similar to how we extended the set E = {(a, b] : a, b ∈ R, a ≤ b} to a σ-algebra, we can show that if P is a probability measure and defined as above for all half-open intervals, its value for the remaining sets in B is uniquely determined. 2.2 Continuous random variables Definition 11: Real-valued Random Variable Let (Ω, F, P ) be a probability space. A real-valued random variable on (Ω, F, P ) is a function X : Ω → R such that for all A ∈ B X −1 (A) = {ω ∈ Ω | X(ω) ∈ A} ∈ F. Ω ω3 ω2 ω1 X X(Ω) X A Figure 2.1: Is the inverse image of A w.r.t. X an element of F?. The above definition ensures that if we want to know the probability that X is in some Borel set A, we can consider the inverse image of A with respect to X, for which we know its probability. Clearly, we can define a probability measure PX : B → [0, 1] by setting PX (A) := P ({ω | X(ω) ∈ A}) = P (X −1 (A)) and use similar notations as in the discrete case (e.g. P (a < X ≤ b)). Definition 12: Cumulative Probability Distribution Let X be a real-valued random variable on (Ω, F, P ). The function F : R → [0, 1] with x 7→ F (x) := P (X ≤ x) is called the cumulative probability distribution of X. 21 2.2. CONTINUOUS RANDOM VARIABLES Example 17: One-dimensional Interval Assume that X is a randomly chosen point in the interval [a, b] and (Ω, F, P ) is as in Ex. 16. Then F (y) = P (X ≤ y) =      y−a b−a if y ∈ [a, b], 1 if y > b, 0 otherwise. We call a random variable X on (Ω, F, P ) discrete if X(Ω) is a discrete set (finite or countably infinite). We call X continuous if X(Ω) contains uncountably many elements and there exists a non-negative and integrable function f : R → R≥0 , called density, with Z x F (x) = P (X ≤ x) = f (y) dy. −∞ Clearly, Z ∞ f (y) dy = 1 −∞ since F (∞) = 1, but note that P (X = y) 6= f (y). Example 18: One-dimensional Interval Assume that X is as in Ex. 17. Then X is a continuous random variable since the (constant) function f with ( 1 if y ∈ [a, b], b−a f (y) = 0 otherwise. is the density of X. We can verify this by calculating x Z x Z x 1 y x−a F (x) = f (y) dy = dy = = = P (X ≤ x) b−a a b−a −∞ a b−a for x ∈ [a, b], F (x) = 0 for x < a, and F (x) = 1 for x > b. Figure 2.2 shows a plot of the functions f and F . The distribution in the above example is called the (continuous) uniform distribution. The expectation and variance of a continuous random variable X with density f are defined as Z ∞ Z ∞ E(X) = x · f (x) dx and V (X) = (x − E(X))2 · f (x) dx. −∞ −∞ 22 2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS f (y) F (x) 1 1 b−a a b y a b x Figure 2.2: Density and cumulative probability distribution of a random variable X. Note that these integrals may not exist, in which case the expectation/variance is undefined. Many results for discrete random variables carry over to the continuous setting. For instance, properties of the expectation and variance (e.g. E(X + Y ) = E(X) + E(Y )) or results concerning the connection of random variables, joint distributions, stochastic independence, etc. We do not discuss them here, as they are very similar to the results presented above. Moreover, in the sequel, we will mostly work with discrete random variables. 2.3 Important continuous distributions Besides the uniform distribution that we have seen above, there are some other important continuous distributions: Uniform Distribution We already described the uniform distribution in the examples above. In summary we get the following properties if X is uniformly distributed on the interval (a, b) (denoted by X ∼ U (a, b)): • Its density is constant on the interval (a, b), that is, f (x) = 1/(b − a) for x ∈ (a, b) and f (x) = 0 otherwise. • If we want to know the probability that X falls into an subinterval of (a, b) of length d we get P (X ∈ (c, c + d)) = d/(b − a), (c, c + d) ⊆ (a, b) which means that it is proportional to the length d (and independent of c, cf. Figure 2.2). 1 • The expectation is given by the midpoint of the interval E(X) = (a + b)/2. 23 2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS Exponential Distribution Let λ > 0. We say that a continuous random variable X is exponentially distributed with parameter λ (denoted by X ∼ Exp(λ)) if the density of X is such that, for t ∈ R, ( λ · e−λt if t ≥ 0, f (t) = 0 otherwise. The cumulative probability distribution of X is then given by ( Rx Rx −λt dt = 1 − e−λx if x ≥ 0, −∞ f (t) dt = 0 λ · e F (x) = 0 otherwise. The exponential distribution is often used to describe waiting times or interarrival times since in many real-world systems where a sequence of rare events plays an important role, the assumption that the time between these events is exponentially distributed is used. For instance, the time of radioactive decay is This is related to the fact that these events are assumed to occur spontaneously and that the exponential distribution has the socalled memoryless property: Assume that the random variable X describes a waiting time and we already know that X ≥ t and ask for the probability that X ≥ t + h (we have to wait for additional h time units). Then, if X is exponentially distributed we can verify that P (X ≥ t + h | X ≥ t) = P (X ≥ h), t, h > 0. The exponential distribution is the only continuous distribution that is memoryless. In fact, it is possible to derive from the memoryless property that the distribution of X must be exponential. Similarly, one can show that the geometric distribution is the only discrete distribution that is memoryless. Example 19: Bus waiting paradox Assume that Bob arrives at a bus stop where buses arrive on average every 10 min. The interarrival times X of the buses are exponentially distributed. Bob asks one of the people waiting there how long he is already waiting. Whatever the answer of that person is (e.g. t = 20 min or 1 min), it does not change the probability that Bob has to wait for h minutes for the next bus. The following property of exponentially distributed random variables gives rise to an algorithm for the generation of exponentially distributed numbers (realizations of X). In order to proceed, we use inverse transform 24 2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS method. inverse transform method. If U ∼ U (0, 1) then X ∼ Exp(λ), where X = − ln(U )/λ. This can be verified as follows. We have the cumulative distribution F (x) = 1 − e−λx for X. Thus, y = F (x) is a number between 0 and 1. Solving for x yields ⇐⇒ ⇐⇒ ⇐⇒ y := F (x) = 1 − e−λx 1 − y = e−λx ln(1 − y) = −λx x = − λ1 ln(1 − y) = F −1 (y). Thus, if U is uniformly distributed on (0, 1), the random variable X = F −1 (U ) = − λ1 ln(1 − U ) must be exponentially distributed with parameter λ. Moreover, since 1−U has the same distribution as U , the random variable X = − λ1 ln(U ) must also be exponentially distributed with parameter λ. Normal distribution The normal distribution is one of the most important continuous distributions since it naturally arises when we sum up many random variables. Therefore, measurement errors and fluctuations are often (approximately) normally distributed. Also measurements of the length or size of certain objects give often normally distributed results. A normally distributed random variable X with mean µ and variance σ 2 (denoted by X ∼ N (µ, σ 2 )) has the density −(x − µ)2 1 exp , f (x) = √ 2σ 2 σ 2π The density of the standardized normal distribution with mean µ = 0 and variance σ 2 = 1 is then defined as −x2 1 φ(x) = √ e 2 2π and its cumulative distribution function is Z x −z 2 1 √ e 2 dz. Φ(x) = 2π −∞ Assume we are interested in the probability that X ∼ N (µ, σ 2 ) falls into a certain interval (a, b). Since the solution of the above integral is difficult to compute, we can standardise X and use the fact that we have a table with the values for Φ. Definition 13: Standardized Random Variables. For any random variable X with finite E(X) = µ and finite VAR(X) = σ 2 , we define its standardised version X∗ = X −µ . σ 25 2.3. IMPORTANT CONTINUOUS DISTRIBUTIONS Then E(X ∗ ) = 0 and VAR(X ∗ ) = 1 since ∗ E(X ) = E X −µ σ and ∗ VAR(X ) = VAR = X −µ σ E(X) − µ =0 σ = VAR(X) = 1. σ2 For X ∼ N (µ, σ 2 ) this means that we can find the cumulative distribution of X ∗ in a standard normal table and use it to compute that of X. Using X ∗ = X−µ ⇐⇒ X = X ∗ σ + µ we get σ P (X ≤ x) = P (X ∗ σ + µ ≤ x) = P (X ∗ ≤ x−µ ). σ Example 20: Computing Normal Probabilities The average salary of an employee in Germany is µ = 30432 Euros per year (in 2012, brutto). Assuming a normal distribution and a standard deviation of σ = 10000 Euros, we can compute the proportion of employees that earn between, say 20 000 and 40 000 Euros per year as follows: X−µ 40000−µ < < P (20000 < X < 40000) = P 20000−µ σ σ σ = P (−1.0432 < X ∗ < 0.9568) = Φ(0.9568) − Φ(−1.0432) ≈ 0.83147 − 0.14917 = 0.6823 where we used that Φ(−1.0432) = 1 − Φ(1.0432). Next, we can compute the income limit of the poorest 3% of the employees, i.e. find x such that P (X < x) = 0.03: P (X < x) = P X < x−µ = Φ( x−µ σ σ ) = 0.03 Thus, x = µ + σΦ−1 (0.03) = 30432 + 10000 · (−1.88) = 11632 where we used that Φ(−1.88) = 0.03. Finally, we note that if a random variable X is normally distributed with expectation µ̂ and variance σ̂ 2 , written X ∼ N (µ̂, σ̂ 2 ), then, for a, b ∈ R, aX + b has distribution N (aµ̂ + b, (aσ̂)2 ). 26 2.4. LAWS OF LARGE NUMBERS Gamma distribution The Gamma distribution generalizes a number of other distributions (e.g. exponential) and has a rather flexible shape compared to other continuous distributions. Therefore it can be used to fit very different types of data. A random variable X is Gamma distributed with parameters α and β (written X ∼ Gamma(α, β)) if its density is given by 1 1 α−1 −x f (x) = x exp , x > 0, α > 0, β > 0 Γ(α) β α β and f (x) = 0 whenever x ≤ 0. Thus, it assigns positive probability only to positive values of X and also the parameters α (for the shape) and β (for the scae) must be positive. The gamma function Γ(α) is used as a normalization factor of the density. Often α is a positive integer and then Γ(α) = (α − 1)! which means that the density (and its integral) is easy to evaluate (for plot of the Gamma densities please visit https://upload.wikimedia.org/wikipedia/ commons/e/e6/Gamma distribution pdf.svg). How do the parameters α and β influence the mean and the variance of the distribution? It can be shown that the mean of X ∼ Gamma(α, β) is µ = αβ and the variance is σ 2 = αβ 2 . Further interesting properties of the Gamma distribution are: • For large α and β, the distribution is very similar to a normal distribution (with mean µ = αβ and variance σ 2 = αβ 2 ) • If α = 1 and β = 1/λ, then X ∼ Exp(λ). • If α = ν/2 and β = 2 then the distribution is a chi-squared distribution with ν degrees of freedom (where ν is a positive integer). In that case, we have the distribution of a sum of the squares of ν independent standard normal random variables. The chi-squared distribution is often used for hypothesis testing (goodness of fit tests). 2.4 Laws of Large Numbers Consider n realizations x1 = X(ω1 ), x2 = X(ω2 ), . . . , xn = X(ωn ) of a random variable X with finite expectation µ. They could, for instance, be generated by repeating an experiment n times under equal conditions. Intuitively, for large n the sample mean n x̄ = 1X xi ≈ µ. n i=1 In order to determine the quality of the above approximation, we recall Chebyshev’s Inequality. 27 2.4. LAWS OF LARGE NUMBERS 2.4.1 µ−a Chebyshev’s inequality µ Let Y be a random variable with finite expectation E(Y ) and finite variance VAR(Y ). Assume that besides E(Y ) and VAR(Y ), nothing is known about Y (e.g. the cumulative probability distribution). Our aim is to reason about the deviation of Y from its expectation. According to Chebyshev’s Inequality (see figure on the right for an illustration), for any a > 0 µ+a {ω ∈ Ω | |X(ω) − µ| < a} VAR(Y ) . a2 Note that the proof of the theorem is quite straight forward. P (|Y − E(Y )| ≥ a) ≤ 2.4.2 (2.1) Weak Law of Large Numbers In order to apply the inequality above, we assume that x1 , x2 , . . . , xn are realizations of independent random variables X1 , X2 , . . . , Xn , all having the same distribution as X (see also Def. 8). Define n Zn = 1X Xi . n i=1 Then E(Zn ) = σ2, 1 n Pn i=1 E(Xi ) VAR(Zn ) = VAR = µ and, if X has finite variance VAR(X) = n 1X Xi n i=1 ! = n σ2 n · σ2 1 X = . VAR(X ) = i n2 n2 n i=1 (See also page 12 for the properties of the variance operator.) Thus, Eq. (2.1) gives us, for any > 0, P (|Zn − µ| ≥ ) ≤ If is given, we can make any > 0, lim P (|Zn − µ| ≥ ) = 0 n→∞ σ2 n·2 σ2 . n · 2 arbitrarily small by increasing n. Thus, for (or, equivalently, lim P (|Zn − µ| < ) = 1),1 n→∞ which is known as the weak law of large numbers 2 . Here, the sequence {Zn }n≥1 of random variables converges “weakly” since only the corresponding probabilities converge. 2 Note that, as opposed to the strong law of large numbers, for the weak law of large numbers pairwise independence of X1 , . . . , Xn is already sufficient and (full) independence is not necessary. 28 2.4. LAWS OF LARGE NUMBERS Example 21: Bernoulli Trials Consider a sequence of n Bernoulli trials and for 1 ≤ j ≤ n, let Xj be the random variable that is 1 if the j-th trial is success and 0 otherwise. Moreover, let P (Xj = 1) = p for all j. Then E(X1 ) = E(X2 ) = . . . = E(Xn ) = p. According to the weak law of large numbers, for any > 0,   n 1 X P  Xj − p <  → 1 n j=1 as n → ∞. 2.4.3 Strong Law of Large Numbers An even stronger result than the weak law of large numbers provides the strong law of large numbers, which states that P lim |Zn − µ| = 0 = 1. n→∞ Here, we measure the probability of all outcomes ω ∈ Ω with limn→∞ |Zn (ω)− µ| = 0 and find that this event occurs with probability one. The strong law of large numbers implies the weak law of large numbers. We have seen two laws that match our initial intuition that n 1X x̄ = xi ≈ µ. n i=1 In the following we will see that even more can be derived about sums of random variables. 2.4.4 Central Limit Theorem. P Recall that Zn = n1 ni=1 Xi , where the Xi are independent and identically distributed random variables with finite expectation µ and finite variance σ 2 > 0. According to the central limit theorem, the distribution of Zn − E(Zn ) Zn − µ √ . Zn∗ = p = σ/ n VAR(Zn ) converges to the normal distribution with expectation 0 and variance 1 (standard normal distribution). Thus, for x, y ∈ R, x < y, Z y 1 2 ∗ lim P (x < Zn < y) = √ e−0.5t dt n→∞ 2π x 29 2.5. EXERCISES n=4 n=10 1 n=50 1 distr ZnStar normal distr 0.8 0.8 1 distr ZnStar normal distr 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 −4 −2 0 2 4 0 −4 −2 0 2 4 0 −4 distr ZnStar normal distr −2 0 2 4 Figure 2.3: The distribution of Zn∗ approaches the standard normal distribution as n increases. and Zn is approximately normally distributed with expectation µ and vari√ ance σ 2 /n, since Zn = σ/ n · Zn∗ + µ and Zn∗ ∼ N (0, 1). Example 22: Approximation of the binomial distribution We have already seen that for a sequence of n independent Bernoulli trials the number Sn of successes follows a binomial distribution. We can write Sn as Sn = X1 + . . . + Xn where X1 , . . . , Xn are independent Bernoulli random variables with parameter p. If n is large, we can approximate the binomial distribution of Sn by a normal distribution. The central limit theorem tells us that Zn = n1 Sn is approximately normally distributed with mean p (of the Bernoulli random variables) and variance p(1−p) n . Therefore Sn must approximately follow the distribution N (np, np(1−p)). Fig. 2.3 shows a plot of the cumulative probability distribution of Zn∗ . It can be seen that the distribution (in yellow) approaches N (0, 1) (red curve) as n increases. 2.5 Exercises Easy Exercise 18: Probability Density Function Find the value of the constant c ∈ R such that the function f is the density of a continuous random variable X (or joint density of X and Y ) or argue why there is no such constant. Consider the following functions: • Case 1 f (x) = ( c·x 0 30 if 0 ≤ x ≤ 1, otherwise, 2.5. EXERCISES • Case 2 f (x) = • Case 3 f (x, y) = ( ( 1+c·x 0 c · e−(x+2y) 0 if 0 ≤ x ≤ 3, otherwise, if 0 ≤ x < ∞, 0 ≤ y < ∞, otherwise, For the cases 1 and 2, calculate the expected value E(X) if it is possible. Exercise 19: Exponential Distribution Assume that the lifetime T (in 1000 hour units) of a certain facility (e.g. a piece of hardware) has the exponential distribution with parameter λ = 21 . Find the following: 1. The average lifetime 2. P (T > 3) 3. P (T ≤ 2) 4. P (1 ≤ T ≤ 4) Exercise 20: Chebyshev inequality Suppose that X is exponentially distributed with parameter λ > 0. Compute the true value and the Chebyshev bound for the probability that X is at least k standard deviations away from the mean. Plot both as a function of k. Exercise 21: Exponential distribution An advertisement agency assumes that the time TV viewers remember the content of a commercial is exponentially distributed with parameter λ = 0.25 (days). What is the portion of TV viewers that are able to remember the content of the commercial after 7 days? Exercise 22: Inverse transform method Prove that for a continuous random variable X, for which the inverse F −1 (x) of the cumulative distribution function exists, the following holds: If U ∼ U (0, 1) and Y = F −1 (U ) then Y has the same distribution as X. Medium Exercise 23: Normal distribution Assume that Z is normally distributed with mean −0.5 and standard deviation 2. Determine the following values and probabilities. a) Determine p = P (Z ∗ < 0.3), where Z ∗ denotes the standardized random variable of Z. b) Determine p = P (Z ≤ 0). 31 2.5. EXERCISES c) Determine z with P (Z ≤ z) = 0.271. d) Determine p = P (|Z| > 0.1). Exercise 24: Inverse Transform Method a) Prove that for a continuous random variable X, for which the inverse F −1 (x) of the cumulative distribution function exists, the following holds: If U ∼ U (0, 1) and Y = F −1 (U ) then Y has the same distribution as X. b) The cumulative distribution function of the Pareto distribution is given as F (x) = 1− 0, xm α , x x ≥ xm x < xm where x ∈ R and α, xm > 0. i) Use the transform method in a) to determine a procedure to generate a Pareto distributed random variate if a U ∼ U [0, 1] is given. ii) If Y is an exponentially distributed random variable with rate α then xm exp(Y ) is Pareto distributed with parameters α and xm as above. Describe an alternative procedure for generating a Pareto distributed random variate when on an exponentially distributed random variate Y is given. Exercise 25: Monte-Carlo Simulation Consider the generalized ex3 2 ponential distribution f (x) = ce−x +1.5x , 0 ≤ x ≤ 3. Write a MATLAB or R function that computes the value of c such that f (x) is a density and compute the expectation of the above distribution with simulating uniform distributed numbers u ∼ [0, 3]2 without applying the inverse transform method. Difficult Exercise 26: Memoryless Property Prove that the exponential distribution is the only continuous distribution with the memoryless property. Exercise 27: Transformation of Random Variables Assume that X and Y are independent and identically distributed exponential random variables. Show that the conditional distribution of X given X + Y = t (i.e. P (X > x|X + Y = t)) is the uniform distribution on (0, t). 32 Chapter 3 Discrete-time Markov Chains A Markov chain is a countable family of random variables that take values in a discrete set S (called state space) obeying the Markov property. They typically describe the temporal dynamics of a process (or system) over time and we can distinguish between chains that act in continuous time and chains that act in discrete time. 3.1 Discrete-Time Markov Chains. A discrete-time Markov chain (DTMC) is a family of random variables (Xn )n∈N0 , where n is a discrete index, Xn : Ω → S and X fulfils the Markov property P (Xn+1 = y | Xn = xn , Xn−1 = xn−1 , . . . , X0 = x0 ) = P (Xn+1 = y | Xn = xn ) for all n, y, x0 , . . . , xn . The Markov property expresses the assumption that the information of the present (i.e., Xn = xn ) is relevant to predict the future behaviour of the system, however additional information about the past (Xj = sj , j ≤ n − 1) is irrelevant. W.l.o.g. we will assume that S = N1 and that the transition probabilities pij = P (Xn+1 = j | Xn = j) do not depend on n (time homogeneity). Thus, we can arrange the transition probabilities in a matrix P = (pij )i,j∈{1,2,...} . The matrix P is called the transition probability matrix of the DTMC X. 1 We can simply enumerate all states in S such that S = {s1 , s2 , . . .} and then just write X = i instead of X = si 33 3.1. DISCRETE-TIME MARKOV CHAINS. 3.1.1 Transient distribution Let p0 be the row vector that contains the initial distribution at time zero, i.e., the entries P (X0 = x) for all x. We now find that p1 = p0 · P since X P (X1 = x) = P (X1 = x | X0 = y) · P (X0 = y). y∈S Successive application of this argument yields pn = pn−1 · P = . . . = p0 · P n , that is, the probabilities after n steps are obtained by n multiplications with P . Note that P is a stochastic matrix, i.e., the row sums are one and it has only non-negative entries. If we consider P · . . . · P = P n then this matrix is stochastic as well and contains for each state the probability of reaching another state after n steps. Therefore, in the sequel we may sometimes write P (n) instead of P n . Example 23: Discrete-time Markov chain Assume we want to describe the weather over time and we only distinguish between sunny, cloudy and rainy weather. Thus, S = {1, 2, 3} where 1 stands for sunny, 2 for cloudy, and 3 for rainy weather. Assume further we know that with probability 0.7 a sunny day is followed by another sunny day, with probability 0.2 a cloudy and with probability 0.1 a rainy with day follows. Similarly, we can make assumptions about the day after a cloudy day or a rainy day. Note that we assume that the distribution for the next day is only determined by the weather on the current day and nothing else - in particular the weather in the past. Figure 3.1 shows the underlying graph of the DTMC X which consists of the state space S (nodes of the graph) and the possible transitions annotated with the transition probabilities. The matrix   0.7 0.2 0.1 P =  0.25 0.5 0.25  . 0.1 0.2 0.7 contains the transition probabilities of X. Let p0 = (1, 0, 0) be the initial distribution of the chain. Then, after two steps, p2 = p0 · P 2 = (0.55, 0.26, 0.19), that is, for instance, P (X2 = 2) = 0.26. 34 0.7 1 0.25 0.1 0.2 0.1 0.25 0.5 2 3 0.7 0.2 Figure 3.1: The underlying graph of a DTMC. 3.1. DISCRETE-TIME MARKOV CHAINS. Given P and p0 , we have seen two ways of computing the vector pn of state probabilities after n steps called the transient distribution after n steps. Either we successively compute p1 , p2 , . . . , pn by multiplying the vector with P or we compute the matrix product P n and multiply it with p0 . The former approach is usually preferable since if P is large (N × N square matrix) it requires significantly less computational work (matrix multiplication is O(N 3 ) while matrix-vector multiplication is O(N 2 ) if carried out naively) and less memory. Note that even if P is a sparse matrix the powers of P will typically not be sparse. 3.1.2 Equilibrium distribution Often one is interested in the distribution of a DTMC in the long run, i.e. when the number of steps approaches infinity. By multiplying the current distribution with P very often, we can approximate the limit lim pn =: π, n→∞ called the equilibrium distribution of X. The entries contain the frequencies the process visits the states in the long run. Example 24: Equilibrium distribution In the weather example above, pn converges to the following vector after a large number of multiplications with P : π ≈ (0.3571, 0.2857, 0.3571). Does this limit exist for every DTMC? Does it always converge to a distribution? The following example shows that there are cases where π does not exist. Example 25: Periodic chain Assume that p0 = (1, 0, 0) and   0 1 0 P = 0 0 1 . 1 0 0 The underlying graph is illustrated in the figure on the right. 35 3.1. DISCRETE-TIME MARKOV CHAINS. Then the successive multiplication of p0 with P yields an alternating sequence of vectors p1 = (0, 1, 0), p2 = (0, 0, 1), p3 = (1, 0, 0) = p0 , . . . . However, if we choose p0 = (1/3, 1/3, 1/3) then we get convergence to π = (1/3, 1/3, 1/3). As an alternative, we can consider the following limit (which always exists for periodic Markov chains), which corresponds to the time average of the transient probability of each state. n 1X pk = (1/3, 1/3, 1/3). n→∞ n π̃ = lim k=1 The next example shows that sometimes, π exists but it is not a distribution. Instead all entries converge to zero. Example 26: Infinite non-recurrent chain Consider the infinite-state DTMC illustrated by the graph below. 2/3 1 0 1 1/3 2/3 2 1/3 2/3 ... 3 1/3 1/3 and assume that we start with probability one in state 0. In the infinite vectors p1 , p2 , . . . , the probabilty will move more and more to the (unbounded) right. The probability of recurring to state 0 in finite time will not be one. The limit limn→∞ pn yields a vector with all (infinite) entries equal to zero. Another problem is that the limit π may depend on the initial distribution p0 . In particular, the underlying graph of the DTMC may contain several strongly connected components. If some of these components are bottom, they cannot be left. Example 27: DTMCs with strongly connected components Consider the two DTMCs illustrated by the graph below. 36 3.1. DISCRETE-TIME MARKOV CHAINS. 1 2 1 2 1 2 1 3 1 2 1 3 1 2 1 2 3 1 3 4 1 2 1 2 1 2 1 2 6 7 1 1 2 1 2 1 1 2 1 3 2 1 3 1 2 1 5 1 2 3 5 1 3 1 2 4 1 2 1 8 1 2 6 7 1 1 2 1 2 1 2 8 Both contain three strongly connected components, SCCs, (indicated by the three colours red, yellow and green. However, the DTMC on the left contains only one SCC that is bottom, i.e. that cannot be left (red states), while in the case of the DTMC on the right there are two SCCs that are bottom (red and green). In the left case it does not matter what initial distribution we choose. In the limit we will always remain in the states 6 and 7 and switch between these two states. On average we will spend half of the time in state 6 and the other half in state 7. Thus, πlef t = (0, 0, 0, 0, 0, 1/2, 1/2, 0). In the right case, however, we have a dependence on the initial distribution. For instance, if p0 = (0, 0, 0, 0, 0, 1, 0, 0) then we get the same result as for the left DTMC while for p0 = (0, 0, 1, 0, 0, 0, 0, 0) we get an equilibrium distribution πright that only assigns positive probability to the states 3, 4 and 8. If, for example, p0 = (1, 0, . . .) then the states 3, 4, 6, 7 and 8 will get a positive equilibrium probability. We will revisit this example in the sequel after discussing how πright can be computed in that case. The examples above motivate the following definitions. Definition 14: Irreducibility A DTMC is called irreducible if any state can be reached from any other (n) (n0 ) state, i.e., for i, j ∈ S ∃ n, n0 ∈ N0 such that Pij > 0 and Pji > 0. Otherwise we call the DTMC reducible. Example 28: Irreducible and reducible DTMCs The two DTMCs in Example 27 are reducible. The DTMCs in the Examples 23, 25 and 26 are all irreducible. Note that a DTMC with |S| > 1 that contains a state i with pii = 1 is always reducible. Definition 15: Aperiodic DTMC We define the period of a state i as p = gcd{n : P ( X(n) = i | X(0) = i ) > 0} | {z } from state i to i in n steps 1 37 1 3.1. DISCRETE-TIME MARKOV CHAINS. where gcd is the greatest common divisor. A state is called aperiodic if its period is 1. A Markov chain is called aperiodic if all its states are aperiodic. In the sequel, we will only consider aperiodic DTMCs. Equilibrium of finite irreducible DTMCs Based on the two definitions above we can state an important result in Markov chain theory. Let (X(t), t ∈ N0 ) be a finite-state Markov chain that is irreducible (and aperiodic). Then the equilibrium distribution π exists and fulfils the following equation system P π =π·P i∈S πi = 1. Moreover, the above equation system has in this case always a unique solution. Thus, we can compute π by solving the equation system. The above condition π = π · P is often referred to as global P balance equations since it states that P in the limit the probability flow j πj pji to state i is equal to the flow j πi pij out of state i, i.e. X j πj pji = πi = πi X j pij ⇐⇒ X j6=i πj pji = πi X pij . j6=i Example 29: Equilibrium distribution Consider again the DTMC in Example 23. We can compute the solution of the above equation system by rewriting π = π·P into π(P −I) = 0 where I is the identity matrix of size 3 × 3. Note that the matrix A = (P − I)T (where T refers to the transpose of a matrix) has rank 2 and contains redundant information (column sums are zero because row sums of P are one). We can therefore ignore one row of A and instead incorporate the P normalization condition i∈S πi = 1 by replacing the removed row by a vector of ones. Thus, we solve       −0.3 0.25 0.1 π1 0  0.2 −0.5 0.2  · π2  = 0 . 1 1 1 π3 1 P Note that the last equation of the system ensures that i∈S πi = 1. The solution is (cf. Example 24 π ≈ (0.3571, 0.2857, 0.3571). 38 3.1. DISCRETE-TIME MARKOV CHAINS. Equilibrium of finite reducible DTMCs If X is reducible (and aperiodic), the equilibrium distribution π depends on the initial distribution p0 and requires to analyse the underlying graph of X. Definition 16: Bottom strongly connected component (BSCC) A set T of states is strongly connected if, for each pair of states i and j in T , j is reachable from i passing only through states in T . A strongly connected component (SCC) is a maximally strongly connected set of states, i.e. no superset of it is also strongly connected. A bottom strongly connected component (BSCC) is an SCC T from which no state outside T is reachable from T . The equilibrium distribution π can be computed by carrying out the following steps: 1. Find all bottom strongly connected components B1 , . . . , Bk in the underlying graph of X. 2. Compute the vectors pe1 , . . . , pek of eventually entering B1 , . . . , Bk as defined below. 3. Determine the equilibrium distributions πB1 , . . . , πBk of the components considered in isolation. Given pe1 , . . . , pek and πB1 , . . . , πBk , we can compute the equilibrium probability of a state i ∈ B` as X π(i) = p0 (j)pe` (j)πB` (i) j where πB` is the equilibrium distribution of the BSCC B` considered in isolation. The idea of the computation above is that we start initially in state j with probability p0 (j) and then eventually enter from j the BSCC B` with probability pe` in which state i has the equilibrium probability πB` (i). Summation over all j gives us the final equilibrium probability of i. If i 6∈ B1 ∪ . . . ∪ Bk then π(i) = 0 because eventually the process will enter one of the BSCCs which it cannot leave. Note that since the BSCCs considered in isolation are irreducible and finite, we can compute πB` by solving the equation system X πB` = πB` PB` , πB` j∈B` where PB` is the transition probability matrix of the subset B` . For step 1 we have to find all BSCCs of X. This requires to run a graph analysis algorithm such as Tarjans’s algorithm. 39 3.1. DISCRETE-TIME MARKOV CHAINS. For step 2 we have to solve the following system of equations for each BSCC B` . We define a vector pe` such that   0 1 pe` (i) =  P j6∈B` pij pe` (j) + P j∈B` pij if B` is not reachable from i, if i ∈ B` , otherwise. It can be shown that pe` (i) is the probability of eventually entering B` when starting in state i. The last equation follows from the two cases that either we enter B` in one step (second term) or we first visit state j and then go to B` (first term). Note that we can alternatively write this as pe` = pe` P T , pe` (i) = 1, if i ∈ B` (3.1) pe` (i) = 0, if B` is not reachable from i and incorporate the last two conditions P into a system of equations similar as above the normalization condition i π(i) = 1. Example 30: Equilirbium distribution of a reducible DTMC Consider again the right DTMC in Example 27 and assume that the initial distribution is 1 1 p0 = ( 0 0 0 0 0 0). 2 2 The BSCCs are B1 = {3, 4, 8} and B2 = {6, 7} and thus π(1) = π2 = π5 = 0. For step 3 we find πB1 = ( 13 1 1 3 3) πB2 = ( 21 1 2) by applying the approach for irreducible chains to B1 and B2 . For step 2 we first consider the probability of reaching B1 . We can aggregate all states of B1 and B2 into one state since once we have reached one of theses sets, the future evolution of the process is not of interest. Thus, we consider the DTMC below in order to compute pe1 and pe2 . 40 3.1. DISCRETE-TIME MARKOV CHAINS. 1 2 1 2 1 2 1 3 1 2 5 1 3 1 3 1 2 3, 4, 8 1 6, 7 1 Next we transform the equation system in 3.1 into Ax = b and define the matrix A = (P − I) (now, P is the transition probability matrix of the aggregated chain). This matrix has two rows with only zero corresponding to the two BSCCs. We replace these two rows with the unit vectors that are zero everywhere except at the position for B1 (for B2 ) where they are equal to one, respectively. Then, by choosing for b a unit vector with a one at the position for B` we can incorporate the two constrains pe` (i) = 1, if i ∈ B` and pe` (i) = 0, if B` is not reachable from i. This yields pe1 = (0.4 0.4 1 1 0.2 0 0 1) and pe2 = (0.6 0.6 0 0 0.8 1 1 0). Multiplication with p0 gives us the probability of reaching B1 and B2 (after starting with initial distribution p0 ): p0 p0e1 = 0.7 p0 p0e2 = 0.3 Thus, we finally get 1 1 π(1) = π(2) = π(5) = 0, π(3) = π(4) = π(8) = 0.7· , π(6) = π(7) = 0.3· . 3 2 Equilibrium of infinite DTMCs There are many applications for Markov chains with infinitely many states. Although the real system can only have a finite number of states, we often decide for a model with infinitely many states. The reason is that often apriori bounds on certain countable objects are not known and instead of assuming some (very unlikely high) bounds it is easier to work with an infinite model, analyse it and find realistic bounds that are met with very high probability. 41 3.1. DISCRETE-TIME MARKOV CHAINS. Example 31: Queuing System Queuing systems are mathematical models that describe real-world systems where queuing occurs, for instance, network packets in the Internet, customers waiting to be served or molecules that are produced and can degenerate. Most queuing systems are continuous-time Markov chains but it is also possible to design a DTMC that describes a queuing system as can be seen below (assuming pλ + pµ ≤ 1). 1−pλ pλ 0 1−pλ −pµ pλ 1−pλ −pµ 1 pµ pλ 2 pµ 1−pλ −pµ pλ ... 3 pµ pµ In each step, customers arrive with probability pλ and leave with probability pµ . With probability 1 − pλ − pµ nothing happens during the current time step. Alternatively we can see this chain as a random walk on the non-negative integers. In the sequel we state sufficient conditions under which the equilibrium distribution exists. An important state property in this context is positive recurrence, which holds for a state i if the process returns to i infinitely often. Formally, we define the first passage time (or hitting time) Tij = inf{n ≥ 1 : Xn = j | X0 = i}. and the probability to enter state j from state i within n ≥ 1 steps Fij (n) = P (X1 = j ∨ X2 = j ∨ . . . ∨ Xn = j | X0 = i). Definition 17: Recurrence • A state i is called recurrent if Fii (∞) = 1, and transient otherwise. • A recurrent state i is called positive recurrent if E(Tii ) < ∞. • A recurrent state i is called null recurrent if E(Tii ) = ∞. Note that if a finite chain is irreducible then all states are positive recurrent. Moreover, recurrence is a BSCC property, i.e., if a single state of a BSCC fulfils it, then all states of the BSCC fulfil it (for SCCs this might not be true). In addition, it is possible to show that if an irreducible (possible infinite) chain has at least one positive recurrent state, then its equilibrium distribution exists. 42 3.1. DISCRETE-TIME MARKOV CHAINS. Example 32: Recurrence The states of the chain in Example 31 are all positive recurrent if pµ > 1/2, null recurrent if pλ = pµ = 1/2, and transient if pλ > 1/2. Next we find that positive recurrence allows us to compute the equilibrium distribution by solving an equation system. Let (X(t), t ∈ N0 ) be a (possibly infinite) DTMC in which all states are positive recurrent (and aperiodic). Then the equilibrium distribution π exists and is the unique solution of the following equation system P π =π·P i∈S πi = 1. Note that aperiodic chains where all states are positive recurrent are also called ergodic. The above system of equations is infinite and therefore it is often solved by ”guessing” a solution and showing that it fulfils the equations. Example 33: Guessing the equilibrium distribution For the chain in Example 31 we assume that pµ > pλ and consider the global balance equations: π0 pλ = π1 pµ π1 (pλ + pµ ) = π0 pλ + π2 pµ ... πi (pλ + pµ ) = πi−1 pλ + πi+1 pµ ... Inserting the first equation into the second one yields π1 (pλ + pµ ) = π1 pµ + π2 pµ ⇐⇒ π1 pλ = π2 pµ . Next, we can insert π1 pλ = π2 pµ into the next equation π2 (pλ + pµ ) = π1 pλ + π3 pµ (for state 2), which yields π2 (pλ + pµ ) = π2 pµ + π3 pµ ⇐⇒ π2 pλ = π3 pµ . 43 3.1. DISCRETE-TIME MARKOV CHAINS. Thus, our guess is that π1 = pλ pµ π0 , π2 = pλ pµ π1 = ... πi = and ∞ X i=0 πi = 1 ⇐⇒ pλ pµ ∞ X pλ i i=0 pµ i pλ pµ 2 π0 π0 1 π0 = 1 ⇐⇒ π0 = P i ∞ pλ i=0 pµ which implies that π0 = 1 − ppµλ . And indeed inserting our guess into the global balance equations yields that it is a solution of the equations. Since we assumed that pµ > pλ and thus the DTMC is ergodic (see also Example 35), it must be the unique solution and thus we have guessed the correct values for π. In general ergodicity for infinite-state chains is undecidable. However, often infinite-state chains that describe real-world systems (e.g. queuing systems) have a very regular structure which allows a graph analysis and all (B)SCCs can be determined. Then we can proceed in a similar ways as for finite DTMC. However, within an infinite BSCC it may still be the case that Fii (∞) < 1. Thus, it remains the question how to check positive recurrence if a graph analysis tells us that the infinite chain is irreducible? Instead of reasoning about Fii (n) it is often easier to reason about the drift of the chain, which is usually defined in terms of a level function l : S → R+ . X dl (i) = E(l(X1 ) − l(X0 ) | X0 = i) = pij (l(j) − l(i)) j The drift tells us the change of the level function when starting in state i. Example 34: Drift (1) For the chain in Example 31 assume that the level of state i is i + 1, i.e. l(i) = i + 1. Then the drift for state i with i > 0 is X dl (i) = pij (l(j) − l(i)) = pλ · (+1) + pµ · (−1) = pλ − pµ . j The drift of state i = 0 is dl (0) = pλ . The following result provides a sufficient criterion for ergodicity. 44 3.1. DISCRETE-TIME MARKOV CHAINS. Let X be an irreducible (possibly infinite) DTMC with state space S and l : S → R+ such that the set {i ∈ S | l(i) ≤ k} is finite for any k < ∞. Then X is ergodic if there exists a constant c > 0 and a finite subset C ⊂ S with i) dl (i) ≤ −c for all i ∈ S \ C, ii) dl (i) < ∞ for all i ∈ C. Intuitively, the above result states that the drift should only be > c in a finite number of states (the set C in whose direction the chains drifts). In particular, there it may be positive. In all other states the drift must be ≤ c, which means that the chains ”drifts away” from these states. It is important to note that the level function (also called Lyapunov function) is such that in the first k levels there cannot be infinitely many states (see condition ”the set {i ∈ S | l(i) ≤ k} is finite for any k < ∞”). For instance, if the state space of X is S = {(i, j) | i, j ∈ N0 } (twodimensional grid) then we can, for instance, use the level function l(i, j) = max{i, j, 1} as illustrated in the figure on the right. Example 35: Drift (2) For the chain in Example 31 we find that the function l with l(i) = i + 1 fulfils the criterion that {i ∈ S | l(i) ≤ k} is finite for any k < ∞. Moreover, if pµ > 1/2 then pλ < 1/2 and thus dl (i) = pλ − pµ < 0, i = 1, 2, . . . Thus, state 0 is the only state with positive drift dl (0) = pλ and thus C = {0}. We conclude that in the case pµ > 1/2 the chain is ergodic. If, however, pλ = pµ = 1/2 or pλ > 1/2 then we have infinitely many states with drift 0 or even positive drift. (Note that we cannot conclude that the DTMC is not ergodic since a different function l may exist that fulfills the ergodicity conditions.) To summarize, we have considered infinite-state DTMCs and focused on the case where the underlying graph is irreducible (the graph forms one large BSCC). If the chain is ergodic, the global balance equations hold and we can check ergodicity by using the drift arguments above. Checking the reverse (all states are transient) is difficult as well as reasoning about infinite 45 (0) (1) Link πi ebigen Link πi ! 1 j→i nj (0) · πj !STUDY: PAGERANK 3.2. (2) CASE (1) = j→i n1j · πj j · πj 1 nj 1 nj 1 α nj α 1−N 1 nj + j = 1 N α nj 1 = πi + α nj + 1−α N .. . nj links auf Webseite j ist α 1− N 1− α N 2 .. 1− N α Seite 1−α N j 1−α N .. . Figure 3.2: The structure of the random web surfer model. RWTH Aachen und H. Hermanns, Universität des Saarlandes. 22 reducible DTMC (with infinite SCCs or BSCCs). It is therefore omitted here (also because there is little practical significance in these cases). 3.2 Case Study: PageRank About 16 years ago the success story of Google started with the development of Google’s PageRank algorithm, which was used to rank web pages according to their importance to the users. The theoretical basis of PageRank is the random web surfer model which corresponds to a discrete-time Markov chain. Note that due to its simplicity and generality, PageRank is also used in other domains, e.g. to rank genes according to a microarray experiment or according to their importance w.r.t. a disease. Moreover, the ranking mechanism can be applied to any graph and yields a network centrality score that reflects the importance of each node in light of the entire graph structure. For the random web surfer model we let N be the total number of all (indexed) web pages. Assume that a web surfer starts initially at some random web page. Then she follows one of the links on that page (each link has the same probability) and visits the next web page. Then again she follows one of the links on that page and visits the next web page, and so on. What is the probability to be on page i after k clicks? What if k → ∞? The idea behind this model is that more important websites are likely to receive more links from other websites and should be ranked higher. The pages that visited with high probability in the limit should be ranked first. Obviously, we can describe this process with a DTMC where each state corresponds to a web page. The initial distribution of the chain is p0 (j) = 1/N for all j ∈ {1, . . . , N }. The transition probability to go from page j to page i is pji = 1/nj where nj is the number of links on page j and there is a link from j to page i (see also Figure 3.2, left). The probability of being on page 46 3.3. CASE STUDY: DNA METHYLATION i after 1 click is therefore p1 (i) = X 1 p0 (j), nj j→i where j → i means that there is a link from j to i. For k clicks we get the recursion X 1 pk−1 (j). pk (i) = nj j→i Obviously, the underlying graph may be reducible, e.g. the surfer may end up in a BSCC which cannot be left since each page has only links to pages in the BSCC but not to pages outside. In addition, the equilibrium distribution would give probability zero to all pages that are not part of a BSCC and thus all these pages could not be ranked among each other and would appear in the ranking after all BSCC pages. Moreover, it is not very realistic that a surfer always only follows links. The solution is to assume that in each step the surfer either follows a link with probability α or she decides to start over with probability 1 − α and again randomly chooses a page from all N possible pages. The parameter 0 < α < 1 is called the damping factor. In Figure 3.2 (right) we illustrate the outgoing transitions of a page (incoming transitions are omitted). Now, we see that the chain is irreducible (there is a direct link from any page to any other page with probability 1−α N ) and the equilibrium distribution is given by the solution of the equation system X α 1−α 1−α πi = + πj + πi , i ∈ {1, 2, . . . , N }. nj N N j→i This equation system is usually solved by applying iterative methods such as the Power method which successively multiplies a vector p with the transition probability matrix P , i.e. we iteratively compute p(k+1) = p(k) P for k = 0, 1, . . . until convergence. The initial vector can, for instance, be chosen as the uniform distribution that assigns probability 1/N to each page. Clearly, for very large graphs (N > 10, 000, 000) distributed algorithms have to be applied. We finally note that the page rank algorithm, although it is the first algorithm that was used to order search engine results, nowadays is only a part of Google’s current ranking methodology. 3.3 Case Study: DNA Methylation DNA Methylation is a modification of the DNA where a methyl group is added to a cytosine base and it typically occurs at so-called CpG sites, where a cytosine nucleotide (C) is followed by a guanine nucleotide (G). Hence on the opposite side of the DNA strand there is the complementary 47 3.3. CASE STUDY: DNA METHYLATION Figure 3.3: DNA methylation after DNA replication. sequence GC. DNA methylation plays an important role in gene regulation, for instance, during the early development of an embryo. Moreover, abnormal DNA methylation patterns have been observed in cancer cells. When the cell divides, the DNA is replicated by splitting the two strands and synthesizing a new complementary sequence on the other side of the two DNA strands of the mother cell. Thus, in each of the two daughter cells, one strand is completely unmethylated. However, during replication the cell tries to maintain the methylation pattern by adding methyl groups on the newly synthesized strands. With high probability a hemimethylated CpG site (methyl group on one strand) becomes fully methylated (methyl group on both strands, see Figure 3.3). However, methylation patterns are not always accurately copied and methyl groups may also be added to CpG sites that are unmethylated such that the site becomes hemimethylated. If we model the dynamics of DNA methylation over time, we use a DTMC with three states (unmethylated, hemimethylated, fully methylated). The time steps correspond to cell replications and the transition probability matrix P is the product of the matrices D and M where D describes cell division and M describes all methylation events between two cell divisions. In Figure 3.4 we illustrate the transitions of D on the left and those of M on the right. 1−µ 1−λ−δ U λ H µ 1 2 1 1 F U 1 2 H 1 F δ Figure 3.4: Transitions of cell division (matrix D) on the right and transitions of methylation events (matrix M) on the left. 48 3.4. FROM DTMCS TO CTMCS Given a time course of the state of a CpG, we can estimate the parameters µ, λ, δ and the initial conditions of the system (how to do parameter estimation will be discussed later in the course). Moreover, we can simulate the system until equilibrium or distinguish the contributions to methylation of several enzymes given knockout data. When extending the model to a hidden Markov model, we can also take into account measurement errors (hidden Markov models will be discussed later in the course). 3.4 From DTMCs to CTMCs Since time is continuous in most real systems, it is often easier not to discretize time to get a time-discrete model but to work with a continuous-time model. A continuous-time Markov chain (CTMC) differs from a discretetime Markov chain only in that the time between two transitions is not exactly one time unit but a (positive) random number R. It is possible to prove that if the process (Xt )t∈R0 , Xt : Ω → S fulfills the Markov property then the distribution of R is exponential, i.e. P (R > t) = exp(−λt), t≥0 for some λ > 0 (assuming that the state can be left with positive probability). If the process enters a state that cannot be left with positive probability, then it will remain there forever. Note that in DTMCs we use high self-loop probabilities if we want the process to stay long in a certain state (while our discretization of time is rather fine since other states are left after a single step). In CTMCs, there are no selfloops. Instead a small value for the parameter λ is chosen for each state in which the process remains for a long time. Example 36: Molecular Evolution We model the evolution of a single DNA base across multiple generations using the Jukes-Cantor Model. Thus, Xn describes what base is found on a fixed DNA position of the n-th descendant of an organism. We assume that the base is not under any selective pressure and that the base in one generation depends only on the base in the immediately previous generation. We construct a DTMC (Xn ) with state space {A, C, G, T } that correspond to the four possible bases. Furthermore, on each generation we are equally likely to change with probability µ the current base to any of the three other bases. Thus, 49 3.5. EXERCISES   1 − 3µ µ µ µ  µ 1 − 3µ µ µ  . P =  µ µ 1 − 3µ µ  µ µ µ 1 − 3µ Due to its symmetry all states have the same equilibrium probability. However, for a fixed number k of generation we can compute the probability of still having the initial base at the DNA position. Since the probability of base mutation is very small (µ ≈ 10−9 ) a discrete model is not very appropriate here since it takes many generations until a state is left (ge1 ometrically distributed with parameter 3µ; mean 3µ ≈ 3.3 · 108 ). Thus, we can use a CTMC where the time until a state is left is exponentially distributed with parameter λ. For instance, we could choose λ = 1016 . Then the average time until a state is left is λ1 = 106 . The probability to enter a different state within the next t time units is then given by P (R ≤ t) = 1 − exp(−λt). For the probability to jump from one state to another we can still use the matrix P . For instance, the probability to jump from state A to state C within t time units is given by P (R ≤ t) · µ. For CTMCs, the transient distributions can be computed by solving a system of ordinary differential equations. The equilibrium distribution is, similarly as for DTMCs, computed by solving a linear system of equations. In the sequel we will not further discuss CTMCs but move on to hidden Markov models. 3.5 Exercises Easy Exercise 28: Periodic DTMC Assume that we choose p0 = (1/4, 1/4, 1/2) as initial distribution in Example 25. Will the sequence pi converge for i → ∞? Exercise 29: Transient Distribution Let X be a DTMC with state space S = {1, 2} and transition probability matrix 1−α α P = , β 1−β where α, β ∈ (0, 1). a) Find the equilibrium distribution of X. 50 3.5. EXERCISES b) Use MATLAB (or R) to find the transient probability distribution after 8 steps if the initial distribution is given by p0 = 51 45 and the parameters are α = 0.18, β = 0.64. (If you have not yet installed MATLAB then write Pseudocode.) Exercise 30: Periodic Markov Chain Assume that the state space of a DTMC is S = {1, 2, 3, 4, 5, 6, 7} and the transition probability matrix P is  0 0  0  P = 0 0  1 2 1 4 0 0 0 0 0 1 2 3 4 1 2 1 3 0 0 0 0 0 1 4 0 0 0 0 0 0 1 4 2 3 0 0 0 0 0 0 0 1 3 1 2 3 4 0 0  0 0  2 3 1 2 1 4 0 0 a) Draw the underlying graph of the chain; is the chain irreducible? b) Find the period d of state 5. Exercise 31: Equilibrium of a DTMC 1/4 Explain why the equilibrium probability π of the DTMC on the left exists and write MAT1/2 1 2 LAB (or R) code for computing it. (If you 1 have not yet installed MATLAB then write Pseudocode) 3/4 1/2 3 Exercise 32: Infinite DTMC (1) Assume that the state space of a DTMC is S = N0 and its transition probabilities are P (Xt+1 = 0|Xt = i) = 12 and P (Xt+1 = i + 1|Xt = i) = 21 for all i ∈ N0 , t ∈ N. The underlying graph is shown below: a) Argue why the chain is irreducible. b) Prove that state 0 is recurrent. c) Let l(i) = i + 1 be a level function. Simplify the corresponding drift function dl as much as possible? d) Choose a constant c > 0 such that the drift is greater −c only in a finite number of states. e) Find the equilibrium distribution of the chain. 51 3.5. EXERCISES Exercise 33: Number of visits Consider an ergodic DTMC with equilibrium distribution π. a) Show that the average recurrence time mj of state j is equal to the inverse of state j’s equilibrium probability πj . b) Show that the average number vi,j of visits to state i between two successive visits to state j is equal to the ratio πi /πj . Medium Exercise 34: Squaring P a) Consider the transition probability matrix P of a DTMC:   0.125 0.7 0.175 0.2 0.5  P =  0.3 0.68 0.178 0.142 Write MATLAB (or R) code for computing the transient distribution of the chain after 16 jumps if the initial distribution is given by (0.14, 0.6, 0.26). In your code you should use exactly 4 matrix multiplications and only one vector-matrix multiplication. (If you have not yet installed MATLAB then write Pseudocode.) b) Would your result change if the initial distribution is different from (0.14, 0.6, 0.26)? Argue why not? c) Discuss the general drawbacks of your approach if it is applied to large Markov chains with sparse P . Exercise 35: Equilibrium of reducible DTMCs Compute the equilibrium distribution for the reducible DTMCs that follow: For a) Can you do it without solving the corresponding equation system for the reachability probabilities? Verify your results with MATLAB for both cases! 52 3.5. EXERCISES a) 1 1/3 99/100 2/3 99/100 3 2 1/100 1/100 2/3 1/2 5 6 7 1/2 1/2 1/2 8 1/2 1/3 1/2 b) 1 2 1 2 1 4 1 3 1 3 1 3 1 3 2 3 6 1 2 1 3 1 2 3 4 2 4 5 1 2 1 2 Exercise 36: Infinite DTMC (2) Consider a DTMC with state space S = Z2+ such that for state (i, j) the transition probabilities are given by the following rules: • The probability to go from state (i, j) to (i + 1, j) is 1/8. • The probability to go from state (i, j) to (i, j + 1) is 1/8. • If max{i, j} > 1 then the probability to go from state (i, j) to (i − 1, j − 1) is 1/3. • The remaining probabilities are self-loop probabilities. Prove that the chain is ergodic. Argue using a suitable drift function. 53 3.5. EXERCISES Difficult Exercise 37: Reversible Markov Chain Consider an irreducible, discrete-time Markov chain X with a finite state space S and transition probability matrix P . Let π be the equilibrium π distribution of X. Consider the following probabilities rij = pji πji which is the transition probability of the reversed chain X ∗ , i.e. the DTMC (Xn∗ )n∈N with Xn∗ = Xτ −n for some fixed time τ . Note that the matrix R with entries rij is the transition probability matrix of X ∗ . The chain X is called reversible if and only if rij = pij (or, in terms of matrices, R = P , where R is the transition matrix of the reversed Markov chain X ∗ ). a) Give a matrix definition of R based on the matrix P and the vector π. b) Consider two Markov chains with the following transition probability matrices     0.8 0.15 0.05 0 5/9 4/9 P1 = 0.7 0.2 0.1  P2 = 5/6 0 1/6 0.5 0.3 0.2 4/5 1/5 0 Compute the equilibrium distribution of the two chains and the transition matrices R of the reversed chains. Are these Markov chains reversible? c) Show that given the reversibility condition R = P it follows that πi pij = πj pji . These equations (for all i, j) are called the detailed balance equations. d) Derive the global balance equations from the detailed balance equations. e) Consider an irreducible reversible Markov chain X with transition maπ trix P , i.e. rij = pji πji . Prove that X is reversible if and only if the probability of traversing any closed path (cycle) is the same in both the forward and backward directions. Exercise 38: The Game Suppose you are playing a game together with your friend. Initially you have A Euros and your friend has B Euros. Each time you toss a fair coin. If you win, you get 1 Euro from your friend. If you lose, you give 1 to your friend. You and your friend stop gambling only if either you or your friend run out of money. Your money after the t-th game is defined by Xt and X0 = A. 54 3.5. EXERCISES a) Draw the state graph of the above process X. Is X a DTMC? b) What is the probability that you eventually lose all of your money? Exercise 39: Consider a DTMC with state space {1, 2, . . . , n + 1} where state n + 1 cannot be left. Thus, its transition probability matrix P has the form T v P = 0 1 where T is an n × n-matrix, v is a vector of size n, and 0 is the n-vector of zeros only. Let p0 be the initial distribution. a) Find a closed-form expression for the probability p(k) to be in the absorbing state after k steps. b) What must hold for p0 , v and T such that p(k) is a discrete probability distribution on N. c) Find a closed-form expression for the mean of that distribution. Exercise 40: Poisson Process A Poisson process with constant rate λ > 0 is a CTMC (N (t))t≥0 taking values in Z+ with the following properties: k a) P (N (t) = k) = e−λt (λt) k! , k ∈ Z+ . b) P (N (t2 ) = k2 |N (t1 ) = k1 ) = P (N (t2 − t1 ) = k2 − k1 ) for t2 ≥ t1 , k1 , k2 ∈ Z+ . Derive an expression for P (N (s) = k|N (t) = n) when s < t, k, n ∈ Z+ . Exercise 41: DNA methylation Recall the DNA methylation model in Figure 3.4 in the lecture notes. a) Biologists can measure DNA methylation with a method called Bisulfite Sequencing. During this treatment, – each unmethylated Cytosine (C) is first converted to Thymine (T). With a conversion error of 1 − c, however, the unmethylated C remains an unmethylated C. – each methylated Cytosine (5mC) is first converted to (an unmethylated) Cytosine (C). Define an HMM that describes the output of Bisulfite Sequencing. b) Consider a modification of methylated Cytosines (5mCs) where the methyl group is oxidized and becomes a hydroxy group (5hmC). Formulate a new HMM with states U U, U M , M U ,M M, HU, U H, HM, M H, HH where 55 3.5. EXERCISES – U U refers to a CpG that is on both sides unmethlyated, – U M (M U ) refer to a CpG where the lower (upper) strand Cytosine is methlyated, – M M refers to a CpG where both Cytosines are methlyated, – U H (HU ) refer to a CpG where the lower (upper) strand Cytosine is hydroxylyated, – M H (HM ) refer to a CpG where the lower (upper) strand Cytosine is hydroxylyated and the upper (lower) strand Cytosine is methylyated, – HH refers to a CpG where both Cytosines are hydroxylyated. Include a new parameter for the probability that a methylated site becomes hydroxylyated. c) Based on the DTMC in b) define two HMMs that describe the results of Bisulfite Sequencing and oxidative Bisulfite Sequencing. The conversions (and conversion errors) of Bisulfite Sequencing and oxidative Bisulfite Sequencing are illustrated below. (Note that the experiment consists of two steps, (ox.) bisulfite treatment and sequencing, but the intermediate result after bisulfite treatment must not be explicitly modelled.) Bisulfite Seq 5hmC C 1 1 -c -c c 5hmC 5mC -d 5mC d 1 C Bisulfite / OxBS c treatment OxBS Seq U C 5hmC U C 5hmU T C C T C T PCR and Sequencing 56 Chapter 4 Hidden Markov Models In this chapter we discuss Hidden Markov Models (HMMs) which can be seen as an extension of DTMCs. A HMM is particularly useful in modelling whenever we cannot observe the state of a Markov chain directly but only a part of it or ”an output” of it. In particular if we can observe the state but there is a certain error in our observation then a HMM is an appropriate model. A HMM extends a finite-state DTMC in that besides the set S of (hidden) states (w.l.o.g. we assume S = {1, 2, . . . , N }) there is a set O of observable states or outputs. Moreover, besides the transition probability matrix P there is a matrix B of output probabilities. If the number of hidden states |S| = N and the number of observable states |O| = M , then B is an N × M matrix such that the entry bik is the probability that state i outputs the k-th observation ok ∈ O. If we define the random variable Yn as the output after n steps of the process and Xn the hidden state after n steps then bik = P (Yn = ok | Xn = i). (Note that for the transition probabilities of a DTMC we had pij = P (Xn = j | Xn−1 = i).) Thus, a HMM is a combination of two stochastic processes X and Y where X is hidden and Y , which is a function of X, can be observed. Example 37: Annual Temperatures (1) We want to model the average annual temperature at a particular location on earth over a series of years, for which no direct measurements of the temperature exists (e.g. 500 years ago). For simplicity we only consider two annual temperatures, H for ”hot” and C for ”cold”. Thus, S = {H, C}. Moreover, we know that the probability of a hot year followed by another hot year is 0.7 and the probability that a cold year is followed by another cold year is 0.6 (e.g. estimated from current weather statistics), 57 4.1. COMPUTING THE PROBABILITY OF OUTPUTS i.e. the transition probability matrix P equals H C H 0.7 0.4 C 0.3 . 0.6 In addition, we have data about the size of tree growth rings in these years (S for small,M for medium, and L for large; O = {S, M, L}) and we know that the relationship between annual temperature and tree ring sizes is given by the matrix B, defined as H C S 0.1 0.7 M 0.4 0.2 L 0.5 . 0.1 The entries of the matrix B are also often called emission probabilities. 4.1 Computing the probability of outputs Assume we want to compute the probability that a certain sequence τ = ok0 ok1 , . . . , okm of outputs have been made, i.e. we want to compute P (Y0 = ok0 , Y1 = ok1 , . . . , Yk = okm ). Let us start with the simplest case where m = 0 and let us assume that p0 is the initial distribution of X. Since Y is a function of X we can condition on X to compute probabilities of Y and use the law of total probability, i.e. P P (Y0 = ok0 ) = i∈S P (X0 = i) · P (Y0 = ok0 | X0 = i) P = i∈S p0 (i) · bik0 . For m = 1 we get = = = = = = P (Y0 = ok0 , Y1 = ok1 ) P P (X0 = i)P (Y0 = ok0 , Y1 = ok1 | X0 = i) i∈S P p0 (i)P (Y0 = ok0 | Y1 = ok1 , X0 = i)P (Y1 = ok1 | X0 = i) i∈S P p0 (i)P (Y0 = ok0 | X0 = i)P (Y1 = ok1 | X0 = i) i∈S P P p0 (i)bik0 P (X1 = j | X0 = i)P (Y1 = ok1 | X1 = j, X0 = i) i∈S j∈S P P p0 (i)bik0 pij P (Y1 = ok1 | X1 = j) i∈S j∈S P P p0 (i)bik0 pij bjk1 . i∈S j∈S 58 4.2. MOST LIKELY HIDDEN SEQUENCE OF STATES By defining the matrices W0 , W1 , . . . Wm as   b1k` 0 ··· 0  ..  ..   .  0 b2k` .  , W` =   .  .. ..  ..  . . 0   0 ··· 0 bN k` ` = 0, 1, . . . , m we can write the above sum as a vector-matrix product P (Y0 = ok0 , Y1 = ok1 ) = p0 W0 P W1 e. where e is a vector of ones and thus for general m P (Y0 = ok0 , Y1 = ok1 , . . . , Ym = okm ) = p0 W0 P W1 P . . . Wm−1 P Wm e. According to the so-called forward algorithm, we compute this large vectormatrix product from left to the right, i.e. 1. Compute the vector v0 = p0 · W0 . 2. For n = 1, 2, . . . , m compute vn = vn−1 · P · Wn . 3. Return the sum of the entries of the final vector vm . Note that a backward computation is also possible. Example 38: Annual temperatures (2) For the example of annual temperatures we have the following observed sequence: τ =S M S L Assuming the initial distribution p0 = (0.6 0.4) we multiply for P (Y0 = S, Y1 = M, Y2 = S, Y3 = L) 0.1 0 0.7 0.3 0.4 0 0.7 0.3 0.1 0 0.6 0.4 · · · · · · 0 0.7 0.4 0.6 0 0.2 0.4 0.6 0 0.7 0.7 0.3 0.5 0 1 · · = 0.0096. 0.4 0.6 0 0.1 1 4.2 Most likely hidden sequence of states Now, we would like to find the sequence ν ∗ of hidden states that is most likely for a given sequence τ = ok0 ok1 , . . . , okm of observations. Thus, ν ∗ is 59 4.2. MOST LIKELY HIDDEN SEQUENCE OF STATES the sequence that maximizes the value P (ν | τ ) = P (X0 = i0 , X1 = i1 , . . . Xm = im | Y0 = ok0 , Y1 = ok1 , . . . , Ym = okm ) among all sequences ν = i0 i1 . . . im of hidden states. However, since P (ν | τ ) = P (ν, τ ) P (τ ) and P (τ ) is independent of ν we can alternatively maximize P (ν, τ ), i.e. ν ∗ = argmax P (ν | τ ) = argmax P (ν, τ ). ν ν The Viterbi algorithm determines ν ∗ by iteratively computing the value w` (i) = max P (i0 i1 . . . i` = i, τ ) i0 i1 ...i`−1 which gives the maximal probability of all paths ending in state i after ` steps and the observed sequence τ , where the maximum is taken over all hidden state sequences for ` − 1 steps. Note that maxi wm (i) = P (ν, τ ) gives a solution to the above problem. By induction it is easy to show that w`+1 (j) = max w` (i) pij bjk`+1 . i Hence, we can iteratively compute the vectors w0 , w1 , . . . , wm and track the sequence of states that corresponds to the maximum as follows: 1. Initialize the vector w0 with entries w0 (i) = p0 (i)bik0 . 2. For all ` ∈ {1, 2, . . . , m} compute the vector w` with entries w` (j) = max w`−1 (i) pij bjk` , j∈S i and store the states that correspond to the maximum in the vectors s` with entries s` (j) = argmax w`−1 (i) pij , i j ∈ S. 3. Return the probability P (ν ∗ , τ ) = maxi wm (i) and the last maximal state i∗n = argmaxi wm (i). 4. Backtrack the remaining states of the optimal sequence ν ∗ by setting i∗` = s`+1 (i∗`+1 ), 60 ` = m − 1, . . . , 1, 0 4.3. EXERCISES Example 39: Annual temperatures (3) Given τ = S M S L we want to find the most likely sequence of annual temperatures given τ . We first compute 0.1 0 = 0.06 0.28 . w0 = 0.6 0.4 0 0.7 Then we find w1 (1) = max{0.06 · 0.7, 0.28 · 0.4}0.4 = 0.0448 w1 (2) = max{0.06 · 0.3, 0.28 · 0.6}0.2 = 0.0336 and s1 = (2 2). The next step gives w2 (1) = max{0.0448 · 0.7, 0.0336 · 0.4}0.1 ≈ 0.0031 w2 (2) = max{0.0448 · 0.3, 0.0336 · 0.6}0.7 ≈ 0.0141 and s2 = (1 2). Finally, w3 (1) = max{0.0031 · 0.7, 0.0141 · 0.4}0.5 ≈ 0.0028 w3 (2) = max{0.0031 · 0.3, 0.0141 · 0.6}0.1 ≈ 0.0008 and s3 = (2 2). Hence, P (ν ∗ , τ ) ≈ 0.0028 and the last state i∗3 = 1 of ν ∗ . Backtracking yields i∗2 = s3 (1) = 2, i∗1 = s2 (2) = 2, and i∗0 = s1 (2) = 2. Thus, ν ∗ = 2 2 2 1. We will revisit HMMs when we consider the problem of estimating parameters of Markov models. 4.3 Exercises Exercise 42: Consider the hidden Markov model with the set of hidden states S = {s1 , s2 , s3 } and the set of outputs O = {1, 2, 3}. The transition probability and emission matrices are as follows:  0 21 P = 1 0 0 1 1 2  0 B = 0 The initial distribution is p0 = (1, 0, 0). 1 2 1 2 0 1 2 0 1 2 0  1 2 1 2 • Consider the two observed sequences τ1 = 1, 2, 3 and τ2 = 1, 3, 1 What are all possible state sequences for the observed sequences? What is the probability to observe these sequences? 61 4.3. EXERCISES • Compute the most likely sequence of the hidden states for the observed sequences τ1 = 1, 3, 1 and τ2 = 1, 2, 1 without using the Viterbi algorithm. • Consider an alternative HMM model where p0 = 13 , 13 , 13 and    1 1 0 12 12 2 2 0 P =  13 13 13  B =  14 0 34  2 1 0 34 14 3 3 0 Again, compute the most likely sequence of the hidden states for the observed sequences τ3 = 1, 3, 1 using the Viterbi algorithm. Exercise 43: HMM as DTMC Consider two dice D1 and D2 with four faces. One chooses a die D1 with probability α (or D2 with probability 1−α) throws it, records the result and repeats. The probability distributions of the result obtained by throwing dice are given below: Dice D1 D2 1 2 3 4 1 4 1 6 1 4 1 6 1 4 1 3 1 4 1 3 The process of choosing a die is hidden and we assume that we can only observe the result (i.e. the number of dots on the face). a) What are the possible outcomes of this experiment? b) Define an HMM for the experiment. c) This experiment can be modelled also as a discrete time Markov chain with state space S = {D1 , D2 , 1, 2, 3, 4}. Write out the transition probability matrix of the corresponding DTMC or draw the underlying graph. d) This experiment can be modelled also as another DTMC with state space S = {1, 2, 3, 4}. Write out the transition probability matrix of the corresponding DTMC or draw the underlying graph. Medium Exercise 44: Sequence Generation a) Construct an HMM that generates a sequence of the form Ac1 B c2 Ac3 B c4 . . . where Aci represents a sequence of length ci consisting of only ’A’ (and similarly for B ci ), i ∈ {1, 2, . . .}. The natural numbers c1 , c2 , c3 , . . . are drawn from the set {1, 2, 3} with equal probabilities. 62 4.3. EXERCISES b) Construct an HMM that generates a sequence of the form Ac1 B 4−c1 Ac2 B 4−c2 . . .. The natural numbers c1 , c2 , c3 , . . . are drawn from the set {1, 2, 3} with equal probabilities. 63 4.3. EXERCISES 64 Chapter 5 Computer Simulations and Monte Carlo Methods Many software libraries provide a random number generator that returns a sequence of numbers u1 , u2 , . . . that follow a U (0, 1) distribution, i.e. that are uniformly distributed on the interval (0, 1). Moreover, these numbers are (approximately) independent. Thus, if we generate, say, 100 random numbers and check how many of them, say N , are smaller than a certain value p > 0 then N/100 ≈ p. Thus, we can approximate certain probabilities using random numbers. In the previous chapters we already used this technique to approximate the area below some function. This is particularly useful if the integral is difficult to evaluate numerically. 5.1 Generating random variates Often, other random variates than uniform are needed, that is, random numbers that follow a certain distribution. If it is a continuous distribution with an invertible cumulative distribution function, the inverse transform method (see Section 2.3) can be used. For discrete distributions there are two possibilities. Assume we want to generate random numbers that are realizations of a discrete random variable X. Then a) we either use a specific relationship between X and U ∼ U (0, 1) or b) we split the interval (0, 1) into disjoint intervals and do a case distinction as explained below. Example 40: Generating Bernoulli distributed numbers 65 5.1. GENERATING RANDOM VARIATES Let U ∼ U (0, 1) and p ∈ (0, 1). We define 1 if U < p, X= 0 if U ≥ p. Then P (X = 1) = P (U < p) = p and P (X = 0) = 1−p. Thus, X follows the Bernoulli distribution with probability p. Algorithmically, we first generate U , set X accordingly and return X. A MATLAB code for this is (after setting the value for p): U = rand; X = (U < p) Example 41: Generating binomially distributed numbers Let U1 , . . . , Un be independent and U (0, 1)-distributed. Moreover, let p ∈ (0, 1). We define 1 if Ui < p, Xi = 0 if Ui ≥ p, Pn and Y = i=1 Xi . Then X1 , X2 , . . . , Xn are Bernoulli distributed (with parameter p) and thus their sum Y is binomially distributed with parameters n and p (see Section 1.3). Algorithmically, P we generate U1 , . . . , Un , set X1 , X2 , . . . , Xn accordingly and return Y = ni=1 Xi . A MATLAB code for this is (after setting the values for n and p): U = rand(n ,1); X = sum( U < p ) Example 42: Generating geometrically distributed numbers Let U1 , U2 , . . . be independent and U (0, 1)-distributed. Moreover, let p ∈ (0, 1). We define X = min{Ui < p}. i Then, clearly, P (X = 1) = p and it is easy to see that P (X = k) = (1 − p)k−1 p. Algorithmically, we set X = 1 and perform a while-loop in which we generate U and check whether U > p (loop condition). If this is the case we set X = X + 1. After the loop we return X. A MATLAB code for this is (after setting the value for p): 66 5.1. GENERATING RANDOM VARIATES X = 1; while (rand > p ) X = X +1; end; X The splitting of the interval (0, 1) can be done for any discrete random variable X. Assume that X takes the values x0 , x1 , . . . and P (X = x0 ) = p0 , P (X = x1 ) = p1 , . . . . Then we divide the interval (0, 1) into the subintervals A0 = (0, p0 ), A1 = [p0 , p0 + p1 ), A2 = [p0 + p1 , p0 + p1 + p2 ). ... Then we let X = xi if U falls into interval Ai . Note that since Ai has length pi the probability that U falls into Ai is exactly pi . Hence X has the desired distribution. Algorithmically, the naive approach to find the index i with p0 + . . . + pi−1 ≤ U < p0 + . . . + pi is performed as follows: 1. initialize c = p0 for the cumulative value, 2. iteratively set c = c + pk in the k-th execution of a while loop that is only executed if U ≥ c. If U < c holds after i iterations, then X = xi . A MATLAB code for this is (when pi is stored in p(i + 1) as in MATLAB array indices start at 1): c = p (1); % set to p_0 i =0; U = rand; while ( U >= c ) i = i +1; c = c + p ( i +1); % set to p_i end; X = i There is a simple way to improve the efficiency of the above approach. If we sort the values x0 , x1 , . . . according to their probability p0 , p1 , . . . we can ensure that large intervals (high probabilities) are considered first. Hence it is more likely that the number of iterations is small since U is more likely to fall in one of the first intervals. 67 5.2. MONTE CARLO SIMULATION OF DTMCS In the case that X can take infinitely many values with positive probability, the intervals can be constructed on the fly. However, care has to be taken in how the intervals are traversed in order to ensure convergence of the while loop. Example 43: Generating Poisson distributed numbers For the Poisson distribution we have that x0 = 0, x1 = 1, . . . and for parameter λ > 0 pi = P (X = xi ) = e−λ λi , i! i = 0, 1, 2, . . . . We perform the following steps (MATLAB code, variable lambda is set): c = exp( - lambda ); i =0; U = rand; while ( U >= c ) i = i +1; c = c +exp( - lambda )* lambda ^ i / factorial ( i ); end; X = i We note that MATLAB also offers functions for generating random variates for many well-known continuous and discrete distributions (function random). 5.2 Monte Carlo simulation of DTMCs Since we already discussed how to generate realizations of a discrete random variable X, we can simply iteratively apply this to simulate a given DTMC with initial distribution p0 and transition probability matrix P . First we have to determine the initial state i0 of the chain. Hence we apply the algorithm of the previous section to the random variable X0 with distribution p0 taking values in S = {1, 2, . . .}. Once we have determined the index k0 we apply the procedure to X1 given X0 = k0 where we use the distribution given be the row in P that corresponds to k0 . This generates the next state k1 and again we apply the procedure to X2 given X1 = k1 by considering the row in P that corresponds to state k1 and so forth. For a finite DTMC with N states, transition probability matrix P and initial distribution p0 we perform a simulation for T ∈ N steps using the following MATLAB code: n =0; % step counter 68 5.3. OUTPUT ANALYSIS d = p0 ; stateseq = ’ ’; while ( n < T ) % while time horizon not yet reached U = rand; c = d (1); % first interval i =1; while ( U >= c ) % next state not yet found i = i +1; c = c + d ( i ); end; % add state at time n to state sequence stateseq = strcat ( stateseq , s p r i n t f ( ’ % g ’ ,i )); d = P (i ,:); % select next distribution n = n +1; end; The above algorithm generates a single trajectory Xn (ω) : N0 → S of the chain, i.e. a sequence of states visited by the chain. This is called Monte Carlo Simulation of the Markov chain. To estimate probabilities or expectations of random variables, a large number of such trajectories have to be generated, i.e. we have to execute the above algorithm for a large number of times. In the next section we discuss how we can get accurate estimations based on a collection of trajectories. 5.3 Output Analysis Assume that we generate 100 trajectories of a DTMC using the algorithm presented in the previous section. For each run we can check whether a certain event occurred or not. E.g. assume that in 67 of the 100 runs we have that Xn = 1 that is in these 67 runs the chain was in state 1 after n steps (for some fixed n). Then the relative frequency of the event “chain is in state 67 67 1 after n steps” is 100 . Intuitively, we can deduce that P (Xn = 1) ≈ 100 . Or, alternatively, we generate 100 Bernoulli distributed numbers. Assume that in 67 of the 100 generations X = 1. Then the relative frequency of the 67 event p = P (X = 1) ≈ 100 . Thus, we have an approximation of the success parameter p. However, we do not know how good such estimations are since the generations are based on pseudo random numbers. If we do another 100 runs, we could get a very different frequency for the same event. In this section, we discuss how we can estimate probabilities of events and expected values of random variables using Monte Carlo simulation. We will see how we can determine the quality of an estimated value by exploiting the central limit theorem . 69 5.3. OUTPUT ANALYSIS 5.3.1 Interval Estimation Assume that we are interested in the probability of a certain event A, such as X = x (or in a DTMC Xn = x). For ω ∈ Ω, define 1 if ω ∈ A, χA (ω) = 0 if ω 6∈ A, Then E(χA ) = 1 · P (χA = 1) + 0 · P (χA = 0) = P (χA = 1) = P (A). Thus, an estimate of E(χA ) is also an estimate of P (A). Therefore, in the sequel, we concentrate on estimates of the expectation of some random variable. Let X be a random variable and let X1 , X2 , . . . , Xn be independent random variables, all having the same distribution as X. For instance, if we generate n trajectories of a DTMC using Monte Carlo simulation, the random variable Xj represents the value of χA in the j-th trajectory (i.e., it is either 1 or 0) or the state of the chain at time t. Let µ = E(X) = E(X1 ) = . . . = E(Xn ). We discuss the construction of interval bounds Il , Ir such that “with high probability” µ ∈ [Il , Ir ]. Since these bounds depend on X1 , X2 , . . . , Xn , Il and Ir are random variables. Estimators for Expectation and Variance. The sample mean n 1X Zn = Xi n i=1 is called an estimator for µ. In Section 2.4.2, we calculated that E(Zn ) = µ, which means that Zn is an unbiased estimator. Let VAR(X) = VAR(X1 ) = 2 . . . = VAR(Xn ) = σ 2 and recall that VAR(Zn ) = σn . According to the central limit theorem, Zn − µ √ Zn∗ = σ/ n approximately follows a standard normal distribution. We are interested in the deviation |Zn − µ| of the estimator Zn and µ, that is, in √ |Zn − µ| ∗ √ P (|Zn | ≤ z) = P ≤ z = P |Zn − µ| ≤ z · σ/ n . (5.1) σ/ n Since we do not know σ (or σ 2 ) we estimate σ 2 with the sample variance n S2 = 1 X (Xi − Zn )2 . n−1 (5.2) i=1 Note that S 2 is an unbiased estimator for σ 2 and that Pn P 2 ( ni=1 Xi )2 2 i=1 Xi − . S = n−1 (n − 1)n 70 (5.3) 5.3. OUTPUT ANALYSIS Confidence Level. In order to control the estimation error, we want the constraint in Eq. 5.1 to hold with a high probability. We choose a confidence level β and define z such that β = P (|Zn∗ | ≤ z). Usually, β ∈ {0.95, 0.99} and an approximation for z can be found in the standard normal table. Since the standard normal distribution is symmetric around the expectation 0, P (|Zn∗ | ≤ z) = P (−z ≤ Zn∗ ≤ z). Let Φ be the cumulative probability distribution of the standard normal distribution. For a confidence level of β = 0.95, for instance, we get Φ(1.96) ≈ P (Zn∗ ≤ 1.96) ≈ 0.975. Thus, P (−1.96 ≤ Zn∗ ≤ 1.96) ≈ P (Zn∗ ≤ 1.96) − P (Zn∗ ≤ −1.96) ≈ Φ(1.96) − (1 − Φ(1.96)) ≈ 0.975 − 0.025 = 0.95. and z ≈ 1.96 (see also the illustration in Fig. 5.1). Confidence Interval. Since β = P (|Zn∗ | ≤ z) ≈ P (|Zn − µ| ≤ z p S 2 /n) we get the confidence interval i h p p Zn − z S 2 /n, Zn + z S 2 /n = [Il , Ir ]. We see now that the confidence interval depends on the number of generated trajectories n. To make the confidence interval twice as small, we must generate four times as many trajectories. Note that we cannot deduce that µ lies N(0,1) in [Il , Ir ] with probability β since 0.5 • Zn and S 2 are random variables, 0.4 • Zn is normally distributed only as n → ∞, 0.3 • we approximated σ 2 by S 2 . 0.1 95% 0.2 0 −3 2.5% −1.96 2.5% −1 0 1 1.96 3 The correct interpretation is that, for large n and a large number of realizations of the random interval [Il , Ir ], β is the Figure 5.1: Probability density percentage of realizations [Il (ω), Ir (ω)] function of N (0, 1). that will contain µ. Practical Issues. The simulation algorithm in Section 5.2 can be used to compute realizations of the random variables Xi . Assume that we execute the algorithm n times to generate n trajectories. We add variables sum and 71 5.3. OUTPUT ANALYSIS sum d that we initialize with zero. In the i-th run, we increase sum and sum d 2 2 by xi = Xi (ω) and xi = (Xi (ω)) , respectively. After the generation of n trajectories, we compute zn = sum/n and s2 = sum d sum 2 − . n − 1 (n − 1)n Note that in the latter case we exploit Eq. (5.3). Then, for a fixed confidence, the interval h i p p zn − z s2 /n, zn + z s2 /n can be constructed. Number of Simulation Runs. If the interval [Il , Ir ] is large relative to Zn the quality of the estimator is poor and more simulation runs have to be carried out. This is likely the case if n is small or if we try to estimate the probability of an event that is rare. Let us fix the relative width of the interval to be 0.2 (which means that we have a relative error of at most 0.1) and chose confidence level β = 0.95. Thus, z ≈ 1.96. Assume that we want to estimate γ = P (A) and Xi is one if A occurs in the i-th run and zero otherwise. Clearly, E(X1 ) = E(X2 ) = . . . = E(Xn ) = 1 · γ + 0 · (1 − γ) = γ. We can determine the number of necessary runs by bounding the relative width of the confidence interval as follows: p z · S 2 /N z2 S 2 S2 2· ≤ 0.2 =⇒ ≤ N =⇒ 384 · ≤N γ 0.01 γ 2 γ2 Using the fact that σ 2 = VAR[χA ] = γ(1 − γ) and replacing S 2 by σ 2 yields N ≥ 384 · 1−γ γ . For instance, the sufficient number of runs to guarantee that probabilities, having at least the order of magnitude of 10−5 , are estimated with a relative error of at most 0.1 and a confidence of 95% is N ≈ 38, 000, 000. Monte-Carlo Simulation of Markov chains has a number of drawbacks compared to a direct numerical solution using matrix-vector multiplications: • If the event whose probability has to be calculated is rare, a large number of simulation runs is necessary to estimate its probability. • In order to half the confidence interval, four times more simulation runs have to be carried out. It is therefore expensive to obtain accurate results. • It is important that the pseudorandom numbers used during the simulation are close to ”truly random“. This is necessary to ensure independence. 72 5.4. EXERCISES 5.4 Exercises Easy Exercise 45: Generating random numbers (negative binomial distribution) A negatively binomially distributed random variable X with parameters r ∈ {1, 2, . . .} and p ∈ (0, 1) has probability distribution k+r−1 k P (X = k) = p (1 − p)r , k = 0, 1, 2, . . . k and describes the number of successes in a sequence of i.i.d. Bernoulli trials before a specified number r of failures occurs. For example, if we throw a die repeatedly until the third time ”6” appears (r = three failures), then the probability distribution of the number of non-6 values that had appeared will be a negative binomial. a) Discuss the intuition behind the formula for the probability distribution with your fellow students b) Assuming that only addition and generation of u ∼ U (0, 1) operations are available, write code in Matlab (or R or some Matlab-clone that you prefer) that generates random numbers that follow a negative binomial distribution). Medium Exercise 46: Gillespie Algorithm Write down the pseudocode for simulating a CTMC. Consider that in each state x ∈ Np of the CTMC there are transition rules Rj , j ∈ {1, . . . , N } each of them fires with a certain transition rate αj (x) and has as a result the jump from state x to state x + vj , where vj ∈ Np is the change vector that corresponds to rule j. At last consider that theP residence time in state x is exponentialy distributed with parameter α0 = j αj (x). Exercise 47: Simulation of Page Rank Write code in Matlab (or R or some Matlab-clone that you prefer) that simulates the DTMC of the random surfer model. As input it should take the adjacency matrix of the directed graph of web links and the output should be the ranking of the nodes. As an initial state you can choose any state and as a simulation time horizon just use some large number of steps (say 10000). Based on the total number of visits of each node, approximate the equilibrium probabilities. Apply your algorithm to the graph in the exercise sheet of last week and compare your ranking results. 73 5.4. EXERCISES 74 Chapter 6 Parameter Estimation We consider an unknown distribution of which we have a collection of samples and we know the family of the distribution (e.g. Poisson) or the underlying model (distribution of a DTMC after n steps) but some of the parameters are unknown and have to be estimated. We already saw estimators for the mean of the distribution and for its standard deviation in Section 5.3. They are random variables computed from a collection of data and are subject to a sampling error. We also constructed random intervals that contain the true value of the parameter with high probability. We may also use simple descriptive statistics such as measures about location (mean, median, quantiles), the variability (variance standard deviation) or the range. Often this is used to determine the family of the distribution if it is unknown. However, there are more parameters that shape a distribution. In the sequel we will discuss some popular methods for estimating parameters. Sometimes they may lead to different results for the estimation. Example 44: Photon Count Consider a telescope observing the photon flux of a certain star during certain time intervals of fixed length. We assume that the distribution of the number of observed photons in the interval [t, t+h] is constant in t, i.e. we have the same distribution for all intervals of length h. Assume further that n independent measurements are performed for certain intervals of length h, i.e. we have n numbers x1 , . . . , xn for the photon flux within intervals of length h. Clearly, we will in reality never have exact and truly independent measurements but let us assume for now that a) the measurements have been taken over n days for, say, h = 1 hour every day, b) the telescope gives a very accurate measurement. 75 6.1. METHOD OF MOMENTS From a) we get approximate independence and from b) we derive that measurement errors may be neglected. Random photon counts can well be described by a Poisson distribution with some parameter λ. We know that the mean of the Poisson distribution is E(X) = λ and the variance is VAR(X) = λ as well. Shall we now use the sample mean n 1X xi = x̄ n i=1 to fit E(X) = λ or shall we use the sample variance n 1 X (xi − x̄)2 . s = n−1 2 i=1 as defined in Section 5.3 to fit VAR(X) = λ? Which of the two will give us a better estimate for λ? What does ”better” mean in this context? 6.1 Method of moments Given data points x1 , . . . , xn of n independent measurements, the idea of the method of moments is to adjust the moments of the distribution such that they fit the sample moments of the data. In general, the k-th sample moment is defined as n 1X mk = (xi )k . n i=1 Note that we often write x̄ for m1 . For k > 1 it is often a good alternative to consider the k-th sample central moments n 1X m̄k = (xi − x̄)k n i=1 since non-central moments may become very large. Note that for the sample variance two definitions exist, the one for m̄2 and the one for s2 . The difference is that for m̄2 we divide the sum by n while for s2 we divide it by n−1. The reason is that if we view the sample variance as an estimator of the true variance of a random variable X based on the data points X1 , . . . , Xn where the Xi are i.i.d. (same distribution as X with mean µ and variance σ 2 ), then the estimator n 1X 1 X S = (Xi − X̄)2 , where X̄ = Xi , n−1 n 2 i=1 i 76 6.1. METHOD OF MOMENTS is unbiased (E(S 2 ) = σ) while n 1X S̃ = (Xi − X̄)2 n 2 i=1 is not (E(S̃ 2 ) = 6 σ). We will come back to this issue when we discuss the pros and cons of the method of moments. Now we assume that we consider a distribution with exactly K unknown parameters θ1 , . . . , θK . Then the moments of the distribution (often called population moments) are functions of these parameters, i.e. E(X k ) = µk (θ1 , . . . , θK ). We may also simply write µ instead of µ1 (θ1 , . . . , θK ). Clearly, the central moments E((X − µ)k ) = µ̄k (θ1 , . . . , θK ), k = 2, 3, . . . are also a function of θ1 , . . . , θK . To determine these parameters we construct an equation system µ1 (θ̂1 , . . . , θ̂K ) = m1 µ̄2 (θ̂1 , . . . , θ̂K ) = m̄2 ... µ̄K (θ̂1 , . . . , θ̂K ) = m̄K and solve for θ̂1 , . . . , θ̂K . Note that it is also possible to equate the noncentral moments instead. Also note that in general, it may be necessary to add equations for higher moments if there is no unique solution for θ̂1 , . . . , θ̂K with K equations (for instance, because some equations are linearly dependent). Example 45: Fitting the mean of a Poisson distribution Revisiting Example 44 we would determine the parameter λ = µ by setting λ = x̄ = m1 . Example 46: Fitting the parameters of a Normal distribution Assume that our data points x1 , . . . , xn are realizations of a random variable X that follows a Normal distribution. The Normal distribution has parameter θ1 = µ for the mean and θ2 = σ for the standard deviation. Thus, we directly get as a solution that µ = x̄ (from the first equation) √ and σ = m̄2 (from the second equation). As a variant, assume that it is known that X has mean µ = 0. Then, for 77 6.1. METHOD OF MOMENTS the single remaining parameter σ we do not set K = 1 and consider only the first equation µ1 (µ, σ) = m1 since the first moment µ1 (µ, σ) is equal to the first parameter µ and thus does not give any constraints on σ. Instead we consider the second equation µ̄2 (µ, σ) = m̄2 since µ̄2 (µ, σ) = σ 2 √ and thus σ = m̄2 . The method of moments has a number of disadvantages compared to the estimation methods considered in the sequel. The estimators may be biased and thus may have a systematic estimation error. An example is the bias of the sample variance Example 47: Bias of the sample variance As already discussed above, according to the method of moments, we would estimate the variance of the true distribution by n 1X s̃ = (xi − x̄)2 n 2 i=1 instead of using the unbiased estimator n s2 = 1 X (xi − x̄)2 . n−1 i=1 Intuitively, s̃2 underestimates the variance since in its definition we use an estimator for the mean instead of the true mean and this estimator must lie in the center of the samples (while the true mean might not) and therefore the deviation from it is smaller than the deviations from the true mean. Consider for instance a normal distribution with mean 0 and variance 100. For a small sample size of, say, n = 3 one might get the data points x1 = −1.9412, x2 = −21.3836, x3 = −8.3959 using the following MATLAB command x = random(’Normal’,0,10,3,1);. The sample mean x̄ = −10.5736 is far away from the true mean. Obviously, the deviations from the true mean 0 are much larger than the deviations from x̄ and thus s̃2 is only var(x,1) = 65.3717 while var(x) = S 2 = 98.0576 is closer to the true standard deviation (note that the Matlab command var(x) uses the unbiased estimator!). If we could consider the deviations from the (usually unknown) true mean 0 the estimator with the factor n1 becomes unbiased: E 1 n Pn i=1 (Xi − E(X))2 ) Pn = 1 n = 1 nn 78 i=1 E((Xi − E(X))2 ) · VAR(Xi ) = VAR(X) 6.2. GENERALIZED METHOD OF MOMENTS 6.2 Generalized method of moments The method of moments has the drawback that it does not make use of all information available from the data points x1 , . . . , xn . For instance, in Example 44 we only use x̄ and do not take into account the variation of the data, i.e. s2 or s̃2 . The generalized method of moments does take into account this information by considering cost functions instead of equating the population moments of the distribution and the sample moments. In the sequel, we simply write θ for a vector θ = (θ1 , . . . , θK ) of parameters. We define the following cost functions for J ≥ K: g1 (θ) = m1 − µ1 (θ) = m2 − µ2 (θ) ... gJ (θ) = mJ − µJ (θ) g2 (θ) Note that in the previous section we only considered the case where J = K (number of parameters equals number of equations) and we wanted all costs to be zero. Now, we can choose a least squares estimator which chooses θ̂ such that S(θ) = J X gj2 (θ) j=1 becomes minimal, i.e. θ̂ = arg minθ S(θ). However, this global cost function S(θ) is not optimal since for higher order moments the variance of gj will be much higher than for low-order moments. (Note that gj is also a function of the data and can thus be interpreted as a random variable. Due to the j-th power in the j-th sample mean, we can expect a higher variance for j becoming larger.) Thus, the idea is to weight each cost function by its variance. However, since some cost functions may be correlated (as we see in the next example), we take a more general approach that allows us to also consider covariances. Example 48: Correlated cost functions Recall that for the normal distribution we have two parameters µ and σ 2 and that the first four population moments and cost functions are given by 79 6.2. GENERALIZED METHOD OF MOMENTS E(X) VAR(X) E([X − µ]3 ) E([X − µ]4 ) = = = = µ σ2 0 3σ 4 g1 (µ̂, σ̂ 2 ) g2 (µ̂, σ̂ 2 ) g3 (µ̂, σ̂ 2 ) g4 (µ̂, σ̂ 2 ) mean variance skewness kurtosis = = = = m1 − µ̂ m̄2 − σ̂ 2 m̄3 − 0 m̄4 − 3σ̂ 4 where we used the central moments instead of the non-central ones. We see that the second and the fourth cost function both give constraints only on σ and are correlated. We now interpret g1 (θ), . . . , gJ (θ) as a (column) vector g(θ) and consider the very general case that the cost function is S̃(θ) = g(θ)0 W g(θ) (6.1) where W is a positive-definite J × J matrix with appropriate weights. Note that if W is the identity matrix I then S̃(θ) = S(θ). We now give some sufficient conditions under which the estimator θ̂ = arg minθ S̃(θ) is consistent, i.e. converges in distribution to the true value θ̃ of θ. 1. It must hold that J ≥ K (moment order is not lower than number of parameters). Otherwise we do not have enough conditions to identify θ1 , . . . , θ K . 2. We must have at least K moment conditions that are functionally independent. This is the case if the rank of the Jacobian matrix of the g-functions (evaluated at θ̃) has a rank of at least K. 3. If we replace the sample moments by the population moments in the cost terms of S̃(θ), then S̃(θ) must yield zero for θ = θ̃. This must be a unique property of θ̃, i.e., no other value of θ should satisfy this equality. It turns out that for the case J > K we can minimize the variance of the estimator θ̂ that minimizes (6.1) as follows: √ We know from the central limit theorem (multidimensional case) that ng(θ) converges to a normal distribution with mean zero (if θ is the ’true’ parameter value) and covariance matrix Σ where the entry Σj,k is given as n COV (gj , gk ) = COV n 1X j 1X k Xi − µj , Xi − µk n n i=1 i=1 ! = 1 COV (Xij , Xik ). n It can be shown that if we choose W proportional to the inverse of Σ we get an optimal weighting matrix (in the sense that the variance of our estimator 80 6.2. GENERALIZED METHOD OF MOMENTS θ̂ becomes minimal). Thus, we ignore the n1 factor and try to estimate the covariance matrix F of the population moments (with entries COV (Xij , Xik )) by n 1X j F̂ = (Xi − µj )(Xik − µk ). n i=1 Then we choose W = F̂ −1 . However, F̂ depends on the parameter vector θ (as it depends on µ1 , . . . , µJ which are functions of θ). Therefore, the following 2-step procedure can be used to find a good choice for W : 1. Choose W = I and compute θ̂(1) = arg min(g 0 W g). θ 2. Compute F̂ with entries n F̂j,k = 1X j [Xi − µj (θ̂(1) )][Xik − µk (θ̂(1) )]. n i=1 and for W = F̂ −1 compute θ̂(2) = arg min(g 0 W g). θ Example 49: Photon Count revisited Consider again the photon count example (see Example 44. In this case, we have a single parameter θ = λ and E[Xi ] = λ, E[Xi ] = λ2 + λ. Thus, the first two cost functions are P g1 (λ) = n1 ni=1 Xi − λ P g2 (λ) = n1 ni=1 Xi2 − λ2 − λ In the method of momentsPapproach, we only used the first cost function and chose λ = n1 ni=1 Xi . We use the Matlab command x=random(’Poisson’,10,1,30) to produce 30 random numbers that follow a Poisson distribution with mean λ = 10. Let us assume that the returned numbers are such that m1 = x̄ = 10.5 and m2 = 126.3. We do a first estimation of λ by minimizing g 0 W g for W = I, i.e. we minimize g1 (λ)2 + g2 (λ)2 w.r.t. λ. The Matlab command 81 6.3. MAXIMUM LIKELIHOOD METHOD fminsearch(@(l)(10.5 − l)2 + (126.9 − l2 − l)2 , 1) starts a search at initial value 1 and yields λ̂ = 10.776 which is slightly worse than the estimation λ = x̄ of the method of moments. Next we compute Σ̂ based on our estimation 10.776. For instance, for the two off-diagonal entries we compute mean((x − 10.776). ∗ (x.2 − 10.7762 − 10.776)) where x is the vector of data points. As its inverse we get an (asymptotically) optimal weight matrix 0.9788 −0.0375 W = . −0.0375 0.0015 Finally, we call again fminsearch(@(l)W (1, 1) ∗ (mean(x) − l)2 + W (2, 2) ∗ (mean(x.2 ) − l2 − l)2 + 2 ∗ W (1, 2) ∗ (mean(x) − l) ∗ (mean(x.2 ) − l2 − l), 1) and get the estimation λ̂ = 10.08. It is important to note that using the weight matrix might not always result in a more accurate estimation. In fact, our choice of W is only asymptotically optimal. However, the estimator is always consistent (converges in probability to the true value of the parameter) and asymptotically normal. 6.3 Maximum likelihood method Again, we assume that n data points X1 , . . . , Xn are given as well as the family of the distribution. The maximum likelihood method is the most common way of estimating parameters of a distribution. Recall that an estimator is a function of the data. Here, we consider maximum likelihood estimators. Let θ be the unknown parameter and let Pθ (·) be the corresponding discrete probability mass function, i.e. the Xi are all independent and identically distributed with P (Xi = x) = Pθ (x), x ∈ R. Therefore, the probability of observing X1 = x1 , . . . , Xn = xn , assuming that θ is the true value, is given by Lθ (x1 , . . . , xn ) = Pθ (x1 )Pθ (x2 )...Pθ (xn ). Lθ (x1 , . . . , xn ) is also called the likelihood of the data and we say that θ̂ is a maximum likelihood estimator (MLE) of θ if for all possible θ Lθ (x1 , . . . , xn ) ≤ Lθ̂ (x1 , . . . , xn ). Note that the MLE of θ is, in general, not unique. Intuitively, the MLE is the parameter value that “best explains” the observed data. 82 6.3. MAXIMUM LIKELIHOOD METHOD To maximize the likelihood, we can consider its derivatives w.r.t. θ and find ∂ parameter values where ∂θ Lθ (x1 , . . . , xn ) equals 0 (in some cases such points do not exist and we find θ̂ at the boundary of the set of possible values for θ). Often the maximum of the log-likelihood ln Lθ (x1 , . . . , xn ) = ln Pθ (x1 ) + ln Pθ (x2 ) + ... + ln Pθ (xn ), which is equal to that of the likelihood (as the logarithm is strictly monotonically increasing), is easier to compute. More concretely, θ̂ maximizes Lθ if and only if θ̂ maximizes ln Lθ . Example 50: MLE of the geometric distribution Assume we have collected data that represents inter-arrival times of customers (in minutes). Since the inter-arrival times are approximately geometrically distributed (we hypothesize this after looking at certain summary statistics such as the range of the data, the mean, the skewness, the variance or a histogram plot of the data), we wish to fit the data to a geometric distribution with parameter θ, i.e. we want to find a θ that best explains the data. The likelihood of the data x1 , . . . , xn is in this case given by Lθ (x1 , . . . , xn ) = Pθ (x1 )Pθ (x2 )...Pθ (xn ) = θ(1 − θ)x1 · θ(1 − θ)x2 · . . . · θ(1 − θ)xn n P = θn (1 − θ)i=1 xi since the probability to observe xi is Pθ (xi ) = θ · (1 − θ)xi . Since Lθ (x1 , . . . , xn ) is a function of θ, we shortly write L(θ) and omit the dependence on the data. Considering the log-likelihood yields ln L(θ) = n · ln θ + n X i−1 xi ln(1 − θ) We maximize ln L(θ) by setting the derivative to zero: d ln L(θ) n (−1) X n 1 X = + xi = − xi dθ θ 1−θ θ 1−θ i ! 0= n 1 − θ 1−θ 83 i X i xi 6.3. MAXIMUM LIKELIHOOD METHOD ⇔ where x̄ = 1 n X i n P xi = n(1 − θ) 1 1 ⇔ x̄ = − 1 ⇔ θ̂ = θ θ x̄ + 1 xi . Next we find that θ̂ is really a maximizer: i=1 X d2 ln L(θ) n 1 =− 2 − xi < 0 2 2 dθ θ (1 − θ) i Hence, the MLE of θ is θ̂ = 1 x̄+1 . Note that in the above example, since the mean of the geometric distribution is (1 − θ)/θ we matched the mean x̄ of the data and the mean of the distribution. Note that this is not always the case for a maximum likelihood estimator, i.e. the estimator of the method of moments is not always the same as the MLE. Assume now that we have hypothesized a continuous distribution for our data. In that case, the likelihood function is Lθ (x1 , . . . , xn ) = fθ (x1 ) · fθ (x2 ) · ... · fθ (xn ) where fθ is the density of the distribution that depends on the parameter θ. Again, since Lθ (x1 , . . . , xn ) is a function of θ, we shortly write L(θ) and omit the dependence on the data. An MLE θ̂ of θ is defined as a value that maximizes L(θ) (over all permissible values of θ). The reason why the likelihood is a density in the continuous case is that the probability of a single outcome is zero in the continuous case. Hence one could consider P (xi − h < X < xi + h) = Z xi +h fθ (x)dx xi −h for the outcome xi and some very small h. But this quantity is approximately equal to 2hfθ (xi ) and thus proportional to the factors of the likelihood defined above. Hence, this approach would result in the same MLE. Example 51: MLE of the exponential distribution Assume that data were collected on the inter-arrival times for cars in a drive-up banking facility. Since histograms of the data show exponential curve, we hypothesize an exponential distribution. Note that some distributions generalize the exponential distribution but have more parameters 84 6.3. MAXIMUM LIKELIHOOD METHOD (e.g. gamma distribution). They would provide a fit which is at least as good as the fit of the exponential distribution. For the exponential distribution the parameter is θ = λ, λ > 0 and the density is fλ (x) = λe−λx . Hence, the likelihood of the samples x1 , . . . , xn is P n Y −λ xi −λxi n L(λ) = λe =λ ·e i i=1 and the log-likelihood is ln L(λ) = n · ln λ + (−λ X xi ) i and its derivative w.r.t. λ is d n X ln L(λ) = − xi dλ λ i Setting the derivative to zero yields ! 0= 1X 1 n X − = xi ⇔ xi = x̄ λ λ n i i and thus λ = 1/X̄. We find that λ is a maximizer: d2 n ln L(λ) = − 2 < 0 dλ2 λ and finally get λ̂ = 1/x̄. We remark that for some distributions, the log-likelihood function may not be useful and also, finding a maximum by setting the derivative to zero is not always possible. In general, if an analytic solution is not possible, global optimization methods have to be applied in order to determine, which of several local optima of the likelihood has the highest value. We list some important properties of Maximum Likelihood Estimators (some of the properties require mild “regularity” assumptions, e.g. likelihood function must be differentiable, support of distribution does not depend on θ.) • Uniqueness: For most common distributions, the MLE is unique; that is, L(θ̂) is strictly greater than L(θ) for any other value of θ. • Asymptotic unbiasedness: θ̂ is a function of the samples x1 , ..., xn and thus a random variable if Xi = xi is not given, i ∈ {1, 2, . . . , n}. Assume now that X1 , ..., Xn have distribution Pθ and θ̂n is the MLE 85 6.3. MAXIMUM LIKELIHOOD METHOD of θ based on the observations X1 , ..., Xn . Then lim E[θ̂n ] = θ. n→∞ Therefore, we say that θ̂n is asymptotically unbiased. • Invariance: If θ̂ is an MLE of θ and if g is a one-to-one function, then g(θ̂) is an MLE of g(θ). To see that this holds, we note that L̃ z }| { L(θ) = L(g −1 (g(θ)) d = g(θ̂) (where g(θ) d is the MLE of are both maximized by θ̂, so g(θ) g(θ) interpreted as the unknown parameter). Example 52: Invariance Assume that x1 , x2 , ..., xn are samples from a Bernoulli distribution with parameter θ = p. Then the likelihood is n Y i=1 P [θxi + (1 − θ)χ=0 (xi )] = θ where χ=0 (x) = xi i P n− xi · (1 − θ) i 1 if x = 0, 0 otherwise. Hence the log-likelihood is X X ln L(θ) = xi ln(θ) + (n − xi ) ln(1 − θ) i i and setting its derivative to zero yields X X ∂ ! L(θ) = xi /θ − (n − xi )/(1 − θ) = 0 ∂θ i P i P P P xi = θ̂(n − xi ) ⇔ xi = n · θ̂ ⇔ θ̂ = xi /n i i i P i Next we find that θ̂ = xi /n =: x̄ is a maximizer: ⇒ (1 − θ̂) i X X ∂2 2 L(θ) = − x /θ − (n − xi )/(1 − θ)2 < 0 for any θ ∈ [0, 1] i ∂θ2 i i 86 6.4. ESTIMATION OF VARIANCES Thus, θ̂ = x̄. Next, we assume that the variance g(θ) = θ(1 − θ) is our parameter (note that in the permissible range of θ, g is one-to-one). Define L̃ such that L̃ (g(θ)) = L(θ), i.e., we consider the same values for the likelihood but the likelihood function is different as it is a function in g(θ). Setting the derivative to zero yields ∂ ∂ ∂θ ! = 0. L̃(g) = L̃(g) · ∂g ∂θ |{z} ∂g =L(θ) ∂ Since ∂θ L (g(θ)) |θ=θ̂ = 0 it holds that x̄(1 − x̄). ∂ ∂g L̃(g) = 0 if we choose ĝ = g(θ̂) = • Asymptotically normally distributed: For n → ∞, θ̂n converges in distribution to N (θ, δ(θ)) (normal distribution with mean θ and variance δ(θ)), where θ̂n is our estimation based on n observations, θ is “true” parameter of the distribution of the observations. Moreover, −1 2 d ln L(θ) δ(θ) = − E dθ2 One can even show that for any other estimator θ̃, which converges in distribution to N (θ, σ 2 ), we have δ(θ) ≤ σ 2 (variance is greater or equal). Therefore, MLEs are called best asymptotically normal. For large n, we can estimate the probability that θ̂ deviates from the true value by more than (see confidence intervals; later). • strongly consistent: MLEs are strongly consistent, i.e, P ( lim θ̂n = θ) = 1. n→∞ In the case of more than one parameter the MLE is found in a very similar way, i.e. we find the vector θ̂ = (θ̂1 , ..., θ̂m ) that maximizes L(θ) or ln L(θ). ∂ ln L(θ) = 0 and For instance, for two parameters α and β we try to solve ∂α ∂ ln L(θ) = 0 simultaneously for α and β where θ = (α, β). (See exercises ∂β for an example.) 6.4 Estimation of Variances It is very important to analyze how good the estimated value given by some estimator θ̂ is. If VAR(θ̂) is large, then our estimation is of bad quality and 87 6.4. ESTIMATION OF VARIANCES might lead to wrong conclusions about the real system. Typically, when results ofpparameter estimations are reported, the (estimated) standard deviations VAR(θ) are given as well. Example 53: Poisson distribution In the previous sections we found that both the method of moments and the maximum likelihood approach gives the estimator θ̂ = X̄ for the Poisson distribution, i.e. the best value for the unknown mean of the Poisson distribution is the sample mean. In Section 5.3.1 we already found that the variance of X̄ is σ 2 /n where σ 2 is the variance of the distribution of the Xi and thus for the Poisson distribution we get VAR(θ̂) = VAR(X̄) = θ∗ /n. Here, θ∗ is the ’true’ (unknown) value that we estimate with θ̂ = X̄. Thus, we can estimate VAR(θ̂) as X̄/n. Alternatively, we can estimate σ 2 by computing the sample variance S 2 (as explained in Example 47) and estimate VAR(θ̂) as S 2 /n. Often, analytic formulas for VAR(θ̂) cannot be derived. In this case, one can either use the bootstrap method or estimate VAR(θ̂) based on the properties of the estimator. Generalized Method of Moments. For the GMM estimator, it can be shown that (under the assumptions made in Section 6.2) the estimator is asymptotically normally distributed with (asymptotic) covariance matrix V = −1 1 G0 F −1 G n if we choose the optimal weighting matrix W = F −1 . Here, the matrix G is the expected Jacobian matrix of the cost functions (where the sample moments are replaced by the theoretical moments1 ) evaluated at the true value of θ. We can therefore replace G and F by their corresponding estimates Ĝ 1 I.e. we replace X̄ by X, m2 by X 2 , etc. 88 6.4. ESTIMATION OF VARIANCES and F̂ and compute V̂ = 1 n Ĝ0 F̂ −1 Ĝ  ∂g1 (θ̂)  ∂θ1   Ĝ =    −1 ... .. . .. ∂gJ (θ̂) ∂θ1 ... . where ∂g1 (θ̂) ∂θK    ..  .    ∂gJ (θ̂) . ∂θK Thus, estimates for the variances of θ̂1 , . . . , θ̂K are given by the diagonal entries of V̂ . Example 54: Estimated error of GMM estimator Recall Example 49. For our final estimation λ̂ = 10.08 we first consider 2 2 −λ)(θ̂) θ̂) the matrix G. We find ∂(X−λ)( = −1 and ∂(X −λ = −2λ − 1. ∂λ ∂λ Thus, G is independent of X and we must not replace the population moments E[X i ] by their corresponding sample moments but can directly use " # " # −1 −1 = Ĝ = −21.16 −2λ̂ − 1 as an estimator. Then we estimate the variance of λ̂ as −1 1 0 1 1 Ĝ W Ĝ = 0.5256. = · n n 0.0634 √ Thus, the standard error is 0.5256 = 0.7250. V̂ = Maximum Likelihood Method. To estimate the variance of a maximum likelihood estimator, we exploit the fact that MLEs are asymptotically normally distributed. Thus, for large sample sizes and a single parameter θ ∈ R, we have approximately a variance of 2 −1 d ln L(θ∗ , X1 , . . . , Xn ) δ(θ∗ ) = − E . dθ2 In the above expression we wrote L(θ, X1 , . . . , Xn ) for the likelihood to emphasize that L is a function of θ and X1 , . . . , Xn . Moreover, here θ∗ is the true value of θ. For an estimator based on the data we simply replace X1 , . . . , Xn by their corresponding realizations x1 , . . . , xn and the (unknown) true value of θ by the estimate θ̂, i.e. !−1 ∂ 2 ln L(θ̂, x1 , . . . , xn ) . − ∂θ2 89 6.4. ESTIMATION OF VARIANCES Intuitively, the second derivative tells us the curvature of the likelihood and if it is ’flat’ at θ̂ then our estimated value might not be very accurate and the variance (negative inverse) is large. Then either we do not have enough samples or the parameter is difficult to identify (see also the example after next below). We are not very confident about our estimated value since perturbing θ̂ slightly also gives us a similar likelihood. For several parameters, the same approach is used, i.e. the negative diagonal entries of the inverse of the Hessian matrix give estimates for the variances of the parameters. Example 55: Parameters of the normal distribution Assume that we have n i.i.d. samples X1 , . . . , Xn that follow a normal distribution with (unkown) mean µ and (unkown) variance σ 2 . Hence, θ = (µ, σ 2 ). The likelihood of the data is n Y 1 1 2 √ L(X1 , . . . , Xn ; θ) = exp − 2 (Xi − µ) 2σ 2πσ 2 i=1 . Thus, log-likelihood of the data is n n n 1 X 2 ln L(X1 , . . . , Xn ; θ) = − ln(2π) − ln(σ ) − 2 (Xi − µ)2 . 2 2 2σ i=1 Next we compute the derivatives w.r.t. µ and σ 2 : ∂ 1 Pn i=1 (Xi − µ) ∂µ ln L(X1 , . . . , Xn ; θ) = σ 2 ∂ ∂σ 2 ln L(X1 , . . . , Xn ; θ) = − n2 (σ 2 )−1 + 21 (σ 2 )−2 Pn i=1 (Xi − µ)2 . Setting the two derivatives to zero yields from the first equation n X i=1 (Xi − µ) = 0 =⇒ n X i=1 n Xi − nµ = 0 =⇒ µ̂ = 1X Xi = X̄. n i=1 Inserting this into the second equation and multiplying by 2σ 4 gives −nσ 2 + n X i=1 n (Xi − µ̂)2 = 0 =⇒ σ̂ 2 = 1X (Xi − X̄)2 . n i=1 Thus, the MLE of σ is a biased estimator. To prove that µ̂ and σ̂ 2 yield a maximum we have to consider the Hessian matrix.  ∂2  ∂2 ln L ∂µ∂µ ln L ∂µ∂σ 2  H(µ, σ) =  2 ∂ ∂2 ln L ∂σ2 ∂σ2 ln L ∂σ 2 ∂µ 90 6.4. ESTIMATION OF VARIANCES  (X − µ) i i=1  Pn n 2 −2 2 −3 2 − (σ ) i=1 (Xi − µ) 2 (σ )  −(σ 2 )−2 − σn2  = P −(σ 2 )−2 ni=1 (Xi − µ) Pn ∂2 ∂µ∂µ We have a local maximum at θ̂ = (µ̂, σ̂ 2 ) if determinant det(H(µ̂, σ̂ 2 )) = ln L < 0 and if the ∂2 ∂2 ∂2 ∂2 ln L · ln L − ln L · ln L ∂µ∂µ ∂σ 2 ∂σ 2 ∂µ∂σ 2 ∂σ 2 ∂µ is positive at θ̂. Next we insert concrete numbers to simplify the computation. We generate 50 random numbers that are normally distributed with mean zero and variance one. We get the estimates µ̂ = −0.1229 and σ̂ 2 = 0.9903 and the Hessian is (approximately) " −50.4906 −0.0000 −0.0000 # −25.4930 . (Note that its determinant is indeed positive.) The inverse of the Hessian is " # −0.0198 0.0000 . −0.0000 −0.0392 Thus, the estimated variances are 0.0198 for µ̂ and 0.0392 for σ̂ 2 , yielding standard errors of 1 = 0.1407 and 2 = 0.1981. Note that the true mean and variance are elements of the intervals [µ̂−1 , µ̂+1 ] and [σ̂ 2 −2 , σ̂ 2 + 2 ]. Example 56: Parameters of a birth-death process Recall the DTMC that we considered in Example 31 on page 42. We illustrate its graph below. 1−pλ pλ 0 1−pλ −pµ pλ 1−pλ −pµ 1 pµ pλ 2 pµ 1−pλ −pµ pλ ... 3 pµ pµ Assume that pλ and pµ are unknown parameters that we want to estimate and the chain starts with probability one in state 0 at time t = 0. We consider three cases of observations: 91 6.4. ESTIMATION OF VARIANCES a) We have n independent samples X1 , . . . , Xn which give us the state of the chain at times t1 , . . . , tn . b) We have a time series X1 , . . . , Xn from a single run of the chain which give us the state of the chain at times t1 , . . . , tn for t1 < . . . < tn . c) We have n independent steady-state observations X1 , . . . , Xn . How good will our estimations of pλ and pµ be in each of the three cases? For a) we have a likelihood La (pλ , pµ ) = n Y pti (Xi ) i=1 where pt = p0 · P t is the transient distribution after t steps. Note that the transition probability matrix P depends on pλ and pµ . In case b) the likelihood is Lb (pλ , pµ ) = pt1 (X1 ) · (P t2 −t1 )X1 ,X2 · . . . · (P tn −tn−1 )Xn−1 ,Xn where (P t )i,j denotes the entry of the matrix P t at position (i, j). Finally, for c) we have Lc (pλ , pµ ) = n Y π(Xi ), i=1 where π is the equilibrium distribution of the chain (assuming pλ < pµ ). After generating n = 100 samples for pλ = 0.4 and pµ = 0.45 in case of a) the MLE gives p̂λ = 0.299 ± 0.138 and p̂µ = 0.330 ± 0.158 as a local minimum of the negative log-likelihood. The Matlab function fmincon that performs the local search also approximates the Hessian from which we estimated the standard errors 0.138 and 0.158 of p̂λ and p̂µ . Note that these estimated values give a very close estimation of the ratios pλ /(pλ +pµ ) and pµ /(pλ +pµ ), but the absolute numbers are too small and the estimate for the selfloop probability 1 − (pλ + pµ ) is too high. The high errors show us that the estimation is of bad quality. This problem occurs very often in realistic applications of the MLE and shows that certain parameters may not be identifiable. It is related to the fact that we are considering a single observation of many independent runs of the chain which is not sufficient to estimate pλ and pµ . 92 6.5. ESTIMATION OF PARAMETERS OF HMMS Again we generate n = 100 samples for pλ = 0.4 and pµ = 0.45 but as a time series (case b). The corresponding MLE gives p̂λ = 0.399 ± 0.027 and p̂µ = 0.443 ± 0.030 as a local minimum of the negative log-likelihood. Thus, time-series data allows to estimate the true parameters very accurately which comes from the fact that the data is correlated and thus much more informative. Finally, we generate n = 100 steady-state samples (case c). The corresponding MLE gives p̂λ = 0.254 ± 11.446 and p̂µ = 0.295 ± 12.456 as a local minimum of the negative log-likelihood. The standard errors are extremely high showing that the log- likelihood function is ’very flat’ around the minimum and thus the estimated values may not be close to the true values of the parameters. We have again the problem of parameters that are not identifiable, which is obvious since the equilibrium distribution does not uniquely define the DTMC. In fact, there are infinitely many DTMC having the same equilibrium distribution but different values for pλ and pµ . However, the ratios are again estimated very accurately which comes from the fact that the ratios can indeed be determined from equilibrium data. 6.5 Estimation of Parameters of HMMs When we discussed hidden Markov models, we discussed the forward algorithm for computing the probability of a sequence τ = ok0 ok1 , . . . , okm of outputs. The idea is to perform the following vector-matrix multiplications. P (τ ) = P (Y0 = ok0 , Y1 = ok1 , . . . , Ym = okm ) = p0 W0 P W1 P . . . Wm−1 P Wm e. (6.2) Now, we want to adjust the parameters of the HMM given a sequence τ = ok0 ok1 , . . . , okm . As parameters, we have the initial distribution p0 of the hidden states, the transition probabilities pi,j between the hidden states and the emission probabilities bik of making output ok when the hidden state is i. In a maximum likelihood approach, we must maximize P (τ ) w.r.t. the parameters. Thus, global optimization methods have to be applied to maximize the logarithm of P (τ ). An alternative approach based on the iterative Baum-Welch procedure gives a local maximum of P (τ ) in the following way: 93 6.5. ESTIMATION OF PARAMETERS OF HMMS Figure 6.1: Illustration of the computation of wt (i, j). Here, aij is the transition probability and bj is the emission probability. Instead of P (τ ) we consider the probability wt (i, j) = P (Xt = i, Xt+1 = j | τ ), t ∈ {0, 1, . . . , m − 1} which is the probability of being in state i at time t and in state j at time t+1, given τ . The values of wt (i, j) can be computed iteratively by exploiting that wt (i, j) = where αt (i)pij bj,ot+1 βt+1 (j) αt (i)pij bj,ot+1 βt+1 (j) =P P P (τ ) i j αt (i)pij bj,ot+1 βt+1 (j) αt (i) = P (Y0 = ok0 , . . . , Yt = okt , Xt = i) = p0 W0 P W1 P . . . Wt−1 P Wt ei and βt+1 (j) = P (Yt+2 = okt+2 , . . . , Ym = okm | Xt+1 = j) = ej P Wt+2 P . . . Wm−1 P Wm e, where eh is a vector which is zero everywhere except for position h where it is equal to one. Note that αt (i) and βt+1 (j) give intermediate values of the large matrix-vector product in Eq. 6.2 where the product is either evaluated from the left or from the right. The idea is illustrated in Figure 6.1. Obviously, the value X X γt (i) = wt (i, j) = P (Xt = i, Xt+1 = j | τ ) = P (Xt = i | τ ) j j is the probability of being in state i given the observation τ . Therefore, we can compute the corresponding time averages as m X γt (i) t=0 94 6.5. ESTIMATION OF PARAMETERS OF HMMS which equals the average number of transitions from state i (given τ ) and m X wt (i, j) t=0 which equals the average number of transitions from i to j (given τ ). Example 57: For the example of annual temperatures we considered the following observed sequence in Example 38. τ =S M S L Assuming the initial distribution p0 = (0.6 0.4), we computed the probability for P (Y0 = S, Y1 = M, Y2 = S, Y3 = L) as 0.1 0 0.1 0 0.7 0.3 0.7 0.3 0.4 0 0.6 0.4 · · · · · · 0 0.7 0.4 0.6 0 0.2 0 0.7 0.4 0.6 1 0.5 0 0.7 0.3 = 0.0096. · · 1 0 0.1 0.4 0.6 Let us now compute, for instance, w0 (1, 2): we find that α0 (1) = 0.6 · 0.1 = 0.06. Similarly, β1 (2) = 0.1244 is the second entry of the vector P W2 P W3 e. Thus, w0 (1, 2) = N1 0.06 · 0.3 · 0.2 · 0.1244 ≈ 0.0004, where we used that p1,2 = 0.3 and the emission probability of output M in state 2 is 0.2. Here, N is the normalization factor P (τ ). Next we use the values wt (i, j) to improve a first guess for the model parameters. We first choose p0 , P and B randomly and compute wt (i, j) for all t < m and all i, j. Then we choose new parameter values according to the following set of reestimation formulas: p0 (i) = expected number of times in state i at time 0 = γ0 (i) P expected number of transitions from i to j t wt (i, j) pij = = P expected number of transitions from i t γt (i) P expected number of times in i with output ok t:Yt =ok γt (i) P bik = = expected number of times in i t γt (i) It is possible to prove that the reestimation step leads to parameter values with a higher likelihood unless we are in a (probably) local maximum of the likelihood function. Thus, it can replace the search for a local optimum of the likelihood. Note that computing the values wt (i, j) can be done efficiently 95 6.6. EXERCISES by evaluating the matrix-vector product in 6.2 from the left and from the right. Example 58: Annual temperatures revisited For the example of annual temperatures we get the following reestimates (after starting with the true values of the parameters): The initial distribution p0 = (0.6 0.4) is (badly) estimated as 0.1882 0.8118. The transition probability matrix 0.7 0.3 0.4 0.6 is estimated as 0.56 0.44 . 0.50 0.50 Note that much better estimation results are achieved if several observation sequences are considered. 6.6 Exercises Easy Exercise 48: Method of Moments I Consider n independent and identically distributed numbers that follow (a) an exponential distribution with unknown parameter λ or (b) a uniform distribution with unknown support [a, b]. Derive the method of moments estimator (a) for λ and (b) for a and b. Exercise 49: Method of Moments II Derive the method of moments estimator for a 2-dimensional normal distribution where the 2-dimensional mean (µx , µy ) and the 2 × 2 covariance matrix Σ has to be estimated. Write out the details for a given set of n sample pairs (X1 , Y1 ), . . . , (Xn , Yn ). Exercise 50: Confidence Interval for the Sample Mean Construct an approximate (1 − α)100 % confidence interval for the sample mean X̄ for the case a) that the variance σ 2 of the distribution is known and b) is not known. Hint: Make sure that the probability that the interval covers the true mean is 1 − α. 96 6.6. EXERCISES Exercise 51: Variance of Maximum likelihood estimators Assume that X1 , . . . Xn are i.i.d samples from a distribution with density fθ (x) = (θ + 1) xθ , where 0 ≤ θ ≤ 1 and 0 ≤ x ≤ 1. • Compute the MLE θ̂ for the unknown parameter θ. • How would you approximate the distribution of the MLE for parameter θ? What is the standard deviation (standard error) of θ̂? Exercise 52: Uniqueness of MLE Assume that X1 , . . . Xn are i.i.d samples from the uniform distribution U (θ, θ+ 1), where the value of parameter θ ∈ (−∞, +∞). Show that the maximum likelihood estimator for θ is not unique. Exercise 53: Application of Baum-Welch Algorithm Consider Example 58 in the Lecture Notes and assume that besides τ1 = (S M S L) a second independent observation τ2 = (L L L M ) is given. How would you adjust the re-estimation formulas in the case of several independent observations? Write a short program that conducts one iteration of the Baum-Welch procedure to re-estimate the parameters p0 and P (starting from the true values of the parameters) where both observations are taken into account. Assume that B is fixed and known. Exercise 54: Maximum likelihood estimator Given n i.i.d samples x1 , . . . xn from the distributions below, compute the MLE for the unknown parameters. a) Binomial distribution: B(m, θ) with m known. b) Uniform distribution f (x) = ( 1 b−a , 0, if a ≤ x ≤ b otherwise. Exercise 55: MLE for the double exponential distribution The double exponential distribution has density f (x, θ) = 21 exp(−|x − θ|) for x ∈ R. Verify that θ̂ = med(x1 , . . . , xn ). Use the following property of the median function med: the sample median is the point which minimizes the sum of the distances from the samples. Exercise 56: MLE for a small discrete-time Markov chain Consider a discrete-time Markov chain with transition probability matrix 1−α α P = 1 0 Estimate the parameter α for the case that you have independent observations of the chain in steady-state where the chain was four times in state 1 and two times in state 2. 97 6.6. EXERCISES Medium Exercise 57: Baum-Welch Algorithm P • Prove that j αt (j)βt (j) = P (τ ). • Derive a formula for computing w̃t (i, j) = P (Xt = i, Xt+2 = j | τ ) and a re-estimation formula for the 2-step transition probability P (Xt+2 = j | Xt = i). Exercise 58: Optimal observation time Recall that the Fisher information is defined as " 2 # ∂ ln L(X; λ) ∂ ln L(X; λ) 2 I(λ) = −EX = EX ∂λ2 ∂λ and also recall that the variance of a MLE can be approximated by I(λ)−1 . Consider a 2-state Markov chain X in continuous time where we have a single transition from state 1 to state 2 at rate λ and where P (Xt = 1) = e−λt and P (Xt = 2) = 1 − e−λt . At what time should we observe the process to get the best estimation for λ (if we can only do a single observation at a single time point)? Try to maximize the information I(λ) (minimize the variance of λ̂) by plotting I(λ) in time and for different λ, e.g. λ ∈ {0.01, 0.1, 1, 10}. Exercise 59: Biased and unbiased estimators a) What is the method of moments estimator for the second central moment (variance)? Express its mean in terms of the V AR[X] = E[(X − µ)2 ]. Hint: Use that E[(X̄ − µ)2 ] = n1 VAR[X]. b) Consider two sets of samples X1 , . . . , Xn and Y1 , . . . , Ym that are i.i.d., respectively, with means µX and µY and with the same variance σ 2 . Show that the pooled variance estimator ! n m X X 1 Sp2 = (Xi − X̄)2 + (Yi − Ȳ )2 n+m−2 i=1 i=1 estimates σ 2 unbiasedly. c) For the method of moments estimator m̄3 for the third central moment and it can be shown that E(m̄3 ) = (n−1)(n−2) µ̄3 . Based on this result, n2 derive an unbiased estimator for the third moment by making the appropriate correction. 98 6.6. EXERCISES d) Write a small Matlab program that tests the two estimators for the third central moment for normally distributed data (the third central moment of the normal distribution is zero). Use small sample sizes to analyse the influence of the bias. Exercise 60: Performance of estimators for the mean of a normal distribution The parameter µ of the normal distribution is both equal to the mean E[X] and the median m which is defined as the number that satisfies the inequalities 1 1 P (X ≤ m) = and P (X ≥ m) = 2 2 (in general the median may not be unique). While E[X] is estimated by X̄, a (consistent) estimator for m is simply the median M of the data, i.e. in the case of an odd number of samples, it is (after sorting) the (n + 1)/2-th value. In the case of an even number of samples, it is the arithmetic mean of the n/2-th and the (n + 2)/2-th value (after sorting). We already know that X̄ is asymptotically normally distributed with variance σ 2 /n. As opposed to this, it can be shown that the asymptotic variance 2 of M is 2πσ 4n . Assume that σ 2 = 1. Use the above facts to determine for both estimators the number of samples n such that the probability, that µ is estimated with a distance of at most 0.2 from the true value, is (at least) 95.8%. Hint: 95.8% is the probability that a normally distributed number deviates by at most 2 standard deviations from its mean. Exercise 61: MLE for Pareto distribution (2 parameters) The Pareto distribution is a power law probability distribution that is used in description of social, scientific, geophysical, and many other types of observable phenomena. Its pdf is given by f (x) = ( α m α xxα+1 , 0, if x ≥ xm otherwise. Given n Pareto distributed random numbers x1 , . . . , xn provide the MLE xc c m for xm and then for α as a function of x m and the data. Exercise 62: gene frequencies of blood antigens The problem is to estimate the gene frequencies of blood antigens A and B from observed frequencies of four blood groups in a sample. Denote the gene frequencies of A and B by θ1 and θ2 respectively, and the expected 99 6.6. EXERCISES probabilities of the four blood groups O, A, B and AB by π1 , π2 , π3 , and π4 . These probabilities are functions of θ1 and θ2 and are given as follows: π1 = 2θ1 θ2 , π2 = θ1 (2−θ1 −2θ2 ), π3 = θ2 (2−θ2 −2θ1 ) and π4 = (1−θ1 −θ2 )2 . The joint distribution of the observed frequencies P y1 , y2 , y3 , y4 of the four blood groups are multinomial distributed with n = 4i=1 yi . Derive an equation system for the MLEs for θ1 and θ2 (you must not solve the equation system nor consider second derivatives to prove that you found a maximizer). Difficult Exercise 63: Fisher Information The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter upon which the probability of X depends. The inverse of the Fisher Information is the asymptotic variance of the maximum likelihood estimator (see the lecture notes). a) Show the following equality of the Fisher information for the case of one parameter θ: " 2 # ∂ ln L(X; θ) ∂ ln L(X; θ) 2 −EX = EX ∂θ2 ∂θ b) Compute the Fisher Information of a binomial RV X ∼ B(n, θ). Can you plot the Fisher Information as function of θ? Exercise 64: Derivative of the transient distribution Consider again the discrete-time Markov chain with initial distribution p0 = (1 0) and transition probability matrix 1−α α P = . 1 0 Assume that you have independent observations of the chain at certain transient times t1 , . . . , tn . a) In addition to an expression for computing the log-likelihood, derive a formula for numerically computing the derivative of the log-likelihood w.r.t. α. b) Write a Matlab program to estimate the parameter α, where you give in addition to the log-likelihood also its derivatives to the optimization function (e.g. fmincon). 100 6.6. EXERCISES Difficult Exercise 65: *MLE: Taking into account measurement errors Consider i.i.d. sample O1 , . . . , On from the Poisson distribution (observation samples are provided in the CMS). We assume that the measurement error is additive, i.e. Oi = Xi +i , where Xi is the ”true” state and the measurement errors are i.i.d. with i ∼ N (0, 1). Notice that measurement error is also assumed to be independent of X. a) Propose a way to compute the (log)likelihood of a sample. b) Compute (using Matlab or similar) the value of the likelihood of a given sequence assuming that λ = 4. c) Derive a formula for the derivative of the (log)likelihood. How would you make the optimization procedure even more efficient (in terms of convergence)? d) Write a Matlab (or similar) program to estimate the parameter λ where you use both the value of the (log)likelihood and its derivative in the optimization procedure. e) Derive a formula for the derivative of the (log)likelihood with respect to the variance of the measurement error, i.e. w.r.t. to variance V , where i ∼ N (0, V ). f) Write a Matlab (or similar) program to estimate both the parameter λ and the variance of the measurement error V , where you use both the value of the (log)likelihood and its derivatives in the optimization procedure. Hint: Note that you have to find a way to approximate the likelihood using finite summation. 101 6.6. EXERCISES 102 Chapter 7 Statistical Testing Carrying out statistical test that are related to the observations of the real system and/or the model is very important when we make claims or statements about the system. A popular class of tests are hypothesis tests that are used to verify statistical hypotheses. In a statistical hypothesis test, we usually first formulate a (null) hypothesis H0 and an alternative hypothesis HA . The two statements H0 and HA must be mutually exclusive and the test can either accept or reject H0 (in favor of HA ). The null hypothesis is either • an equality • the absence of an effect or some relation Note that this leads to null hypotheses that are in many cases not the same as the statement that we want to verify. The reason for the above constraint is that intuitively, we need an equality (or absence of an effect or some relation) to fix the distribution that we consider during the test. We begin with a motivating example. Example 59: Defective Products Assume that a manufacturer claims that at most 3% of his products are defective. We want to verify this statement to decide whether we accept the shipment of the products or not. We define H0 : fraction of defective products is equal to 3% HA : fraction of defective products is greater than 3% where for HA we used the right-tail alternative. If we have evidence that H0 is rejected in favor of HA then we reject the shipment. Note that HA : ’fraction of defective products is smaller than 3%’ is not useful since then we will always accept the shipment - no matter if we 103 266 Probability and Statistics for Computer ScientistsAPPROACH 7.1. LEVEL α TESTS: GENERAL where µ1 is the average number of concurrent users last year, and µ2 is the average number of concurrent users this year. Depending on the situation, we may replace the two-sided have evidence for H0 or for H .(1) alternative HA : µ2 − µ1 ̸= 2000 with a one-sided alternative AHA : µ2 − µ1 < 2000 or (2) (1) HA : µ2 − µ1 > 2000. The test of H0 against HA evaluates the amount of evidence that (2) the mean number of concurrent users changed by fewer than 2000. Testing against HA , we see if there is sufficient evidence to claim that this number increased by more than 2000. ♦ Example 60: Concurrent Users Example 9.24. Assume that we want to verify the statement that the average number of To verify if the proportion products at most 3%, we test increased by 2000 this concurrent users ofofdefective an online PCisgaming platform year.HWe define 0 : p = 0.03 vs HA : p > 0.03, H0 : µ2 -µ1 =2000 Why do we choose the right-tail alternative HA : p > 0.03?HThat we reject the -µ1 6= 2000 A : µis2because shipment only if significant evidence supporting this alternative is collected. If the data suggest that p < 0.03, the shipment will still be accepted. where µ1 is the average number of concurrent users♦ of where p is the proportion of defects in the whole shipment. the last year and µ2 the average number of this year. Here, HA is called a two-sided alternative since it covers both cases µ2 -µ1 > 2000 and µ2 -µ1 < 2000. 9.4.2 Type I and Type II errors: level of significance Fromwethe twothat examples we see that Therefore, there can be two-sided alternatives, When testing hypotheses, realize all we seeabove is a random sample. with all the best statisticsone-sided, skills, our decision to accept or to reject H may still be wrong. That 0 left-tail alternatives (HA is µ < µ0 ) and one-sided, right-tail would be a sampling error (Section 8.1). alternatives (HA is µ > µ0 ), where H0 is µ = µ0 . Four situations are possible, The outcome of our test depends on a finite random sample and thus we may Result of the test always take a wrong decision. The four situations depicted on the left are Reject H0 Accept H0 possible. Our goal is to keep each of H0 is true Type I error correct the two errors small. Thus, a good test only results in a wrong decision if H0 is false correct Type II error the sample is not very representative (i.e. extreme). In two of the four cases, the test results in a correct decision. Either we accepted a true Often the type I error hypothesis, or we rejected a false hypothesis. The other two situations are sampling errors. is seen as more dangerous since it corresponds to ’convicting an innocent defendant’ or ’sending a healthy patient to a surgery’. Therefore, we fix the DEFINITION 9.7 probability α of a type I error which is also called the significance level of A type I error occurs when we reject the true null hypothesis. the test. A type II error occurs when we accept the false null hypothesis. α = P {reject H0 | H0 is true} The probability of rejecting a false hypothesis (avoid a type II error) is the about which we make our hypothesis: p(θ) = P {reject H0 | θ; HA is true} Each error occurs with a certain to keep small. A good test resultsθ power of probability the test that andweahope function of the parameter in an erroneous decision only if the observed data are somewhat extreme. Typically, α is chosen very small, e.g. α ∈ {0.01, 0.05, 0.10} such that the type I error is kept small and we only reject H0 with a lot of confidence. 7.1 Level α tests: general approach To test H0 against HA we perform the following steps: 104 7.1. LEVEL α TESTS: GENERAL APPROACH 1. Compute a test statistic T which is a function of the sample (or an estimator) and thus a random number. The distribution of T , given H0 is true, is known. 268 2. Consider the null distribution F0 of T , given H0 is true, and find the portion that corresponds to α, i.e. the part of the area below the density curve that gives α and is thus the region where we reject H0 . Probability and Statistics Scientists The remaining partfor (ofComputer area 1-α) is the acceptance region. f0 (T ) ✻ Rejection region, Acceptance region, probability (1 − α) when H0 is true probability α when H0 is true ✠ ✲ T FIGURE 9.6: Acceptance and rejection regions. In the illustration on the left, T is expected to be large if HA is true. Therefore, the rejection region is at the right tail of the distribution. In a twosided test, we consider portions of α/2 at both tails of the distribution. We always have that P {T ∈ acceptance region | H0 is true} = 1 − α and These regions are selected in such a way that the values of test statistic T in the rejection ∈ the rejection | H0 is suppose true} = α. region provide a stronger support of P HA{T than values T ∈region A. For example, that T is expected to be large if HA is true. Then the rejection region corresponds to the right 3. Accept H0 if T belongs to the acceptance region and reject tail of the null distribution F0 (Figure 9.6). it otherwise. It isat important mention that if we accept H0 we cannot say that As another example, look Figure 9.3 on to p. 249. If the null distribution of T is Standard Normal, then the area’with between (−zα/2 ) and z1α/2 (1 − α). The probability −equals α theexactly hypothesis H0 interval is true’. The reason is that H0 is not random, A = (−zα/2i.e. , zα/2it ) either holds with probability one or it does not hold with probability one. The only correct interpretation is that can serve as a level α acceptance region for a two-sided test of H0 : θ = θ0 vs HA : θ ̸= θ0 . if we ofreject H0 then the data provides sufficient evidence against H0 The remaining part consists two symmetric tails, and in favor of HA . Either H0 is really not true or our data is not R = Ā = (−∞, −zα/2 ] ∪ [zα/2 , +∞); representative - which happens with probability α. If we accept H0 this is the rejection region. then the data does not provide sufficient evidence to reject H0 . In the absence sufficient and evidence, by that default, we accept H0 . Areas under the density curve areofprobabilities, we conclude P {T ∈inacceptance region |about H0 } = 1finding −α Let us now detail reason the rejection region with area α since there are many regions with area α below the density curve. The best and ∈ rejection | H0that } = α.the type II error is small. Thus, we choice is Pan{Tarea that region ensures choose the rejection region such that it is likely that T falls into this region if and HA its is true. This will maximize the power of the test, i.e. the probability Step 3: Result interpretation of rejecting H0 given HA is true. Often, this results in the a choice of the Accept the hypothesis H0 region if the testasstatistic T belongs the acceptance Reject distributed test rejection illustrated in to Figure 7.1 forregion. a normally H0 in favor of the alternative HA if T belongs to the rejection region. statistic. Usually, the test statistic is defined such that Our acceptance and rejection regions guarantee that the significance level of our test is • the right-tail alternative forces T to be large, Significance level = P { Type I error } • the left-tail alternative forces| H T0 }to be small, = P { Reject = P {T ∈ R | H } = α. 0 • the two-sided alternative forces T to be either large or small. Therefore, indeed, we have a level α test! 105 (9.13) 7.2. 270 STANDARD NORMAL NULL DISTRIBUTION (Z-TEST) Probability and Statistics for Computer Scientists Reject if T is here Accept if T is here 0 ✢ ✲ zα ❫ 0 T (b) Left-tail Z-test Reject if T is here Accept if T is here ❯ −zα/2 ✲ −zα T (a) Right-tail Z-test Reject if T is here Accept if T is here 0 ☛ zα/2 ✲ T (c) Two-sided Z-test FIGURE 9.7: Acceptance and rejection regions for a Z-test with (a) a one-sided right-tail alternative; (b) a one-sidedand left-tail alternative; (c) a two-sided alternative. distributed test Figure 7.1: Acceptance rejection regions for a normally statistic. a) one-sided right-tail alternative; b) one-sided left-tail alternative; c) two-sided alternative. 9.4.5 Standard Normal null distribution (Z-test) An important case, in terms of a large number of applications, is when the null distribution of the test statistic is Standard Normal. in this case is called a Z-test, and the test statistic is usually denoted by Z. 7.2The test Standard Normal null distribution (Z-test) (a) A level α test with a right-tail alternative should ! For a large number of applications, the null distribution of T (the distribureject H0 if Z ≥ zα tion of T given H0 is true) is standard Then the test is (9.14) called a accept H0 if normal. Z < zα Z-test. Usually, one the following cases applies: The rejection region in this case consists of large values of Z only, • we consider sample R means of normally distributed data, = [zα , +∞), A = (−∞, zα ) 9.7a). •(see weFigure consider sample means of arbitrarily distributed data where the Under the null belongs to A and we reject the null hypothesis with probability number ofhypothesis, samplesZ is large, P {T ≥ zα | H0 } = 1 − Φ(zα ) = α, • we consider sample proportions of arbitrarily distributed data where making the probability of false rejection (type I error) equal α . the number of samples is large, For example, we use this acceptance region to test the population mean, • we consider differences of sample means or sample proportions where H0 : µ = µ0 vs HA : µ > µ0 . the number of samples is large. In all of these cases a Z-test can be used. Let zα be the number such that P (Z > zα ) = Φ(zα ) = α if Z is a standard normally distributed random variable. Then we reject H0 if • Z ≥ zα for a test with right-tail alternative, • Z ≤ −zα for a test with left-tail alternative, 106 7.2. STANDARD NORMAL NULL DISTRIBUTION (Z-TEST) • |Z| ≥ zα/2 for a test with two-sided alternative. Note that each time we have P {reject H0 | H0 is true} = 1 − Φ(zα ) = α. Example 61: A test for the mean Assume that we have used X̄ to estimate the unknown mean µ0 of a normal distribution based on n = 100 independent samples. Since the samples are normally distributed, X̄ is normally distributed too and we know that E[X̄] = µ0 (see previous chapter) and that VAR[X̄] = σ 2 /n. Assume further that σ = 800 is known and that the data is such that X̄ = 5200. We would like to verify (with a significance of 5%, α = 0.05) whether the true mean is greater than 5000, i.e. H0 : µ0 = 5000 and HA : µ0 > 5000 (right-tail alternative). 1. We first compute the test statistic Z= X̄ − µ0 5200 − 5000 √ √ = = 2.5. σ/ n 800/ 100 2. The critical value is zα = 1.645 (from the table of the standard normal distribution). Thus, we should reject H0 if Z ≥ 1.645 and accept it otherwise. 3. Since Z falls into the rejection region, we have enough evidence to reject H0 and support the alternative hypothesis HA : µ0 > 5000. Note that if, for instance, X̄ = 5100, we would get Z = 1.25 and not have enough evidence to reject H0 and believe in the alternative hypothesis that µ0 > 5000. Example 62: Two-Sample Z-test of proportions A quality inspector finds 10 defective parts in a sample of n = 500 parts received from manufacturer A. Out of m = 400 parts from manufacturer B, she finds 12 defective ones. A computer-making company uses these parts in their computers and claims that the quality of parts produced by A and B is the same. At the 5% level of significance, do we have enough evidence to disprove this claim? 107 7.2. STANDARD NORMAL NULL DISTRIBUTION (Z-TEST) We test H0 : pA = pB , or H0 : pA − pB = 0, against HA : pA 6= pB where pA (pB ) is the portion of defective parts from manufacturer A (B), respectively. This is a two-sided test because no direction of the alternative has been indicated. 1. We first compute the values of the estimated portions p̂A = 10 12 = 0.02 and p̂B = = 0.03. 500 400 These estimators are asymptotically normally distributed (sum of n and m Bernoulli variables divided by n and m, respectively) where the means are the true portions pA and pB and the variances are pA (1 − pA )/n and pB (1 − pB )/m. Obviously, p̂A − p̂B is also asymptotically normally distributed with mean zero and variance pA (1 − pA )/n + pB (1 − pB )/m. Thus, inserting the estimated values we standardize and get p̂A − p̂B Z=p = −0.945. pA (1 − pA )/n + pB (1 − pB )/m 2. The critical value is zα/2 = 1.96 (from the table of the standard normal distribution). Since this is a two-sided test we should reject H0 if |Z| ≥ 1.96 and accept it otherwise. 3. Since Z falls into the acceptance region, we do not have enough evidence to reject H0 . Although the sample proportions of defective parts are unequal, the difference between them appears too small to claim that population proportions are different. Alternatively, we can consider a single estimator p̂ for the overall proportion of defective products since we assume that H0 (pA = pB ) holds. But with p := pA = pB this implies that VAR[XA ] = VAR[XB ] = p(1 − p) if XA and XB are Bernoulli distributed with parameter p. Hence, if we estimate the common portion of defective parts as p̂ = number of defective parts np̂A + mp̂B = = 0.0244 total number of parts n+m then we can use it to replace the unknown probability p in the variance estimators VAR[p̂A ] = p(1−p) and VAR[p̂B ] = p(1−p) n m . Hence, VAR[p̂A − p̂B ] = p̂(1 − p̂) p̂(1 − p̂) + n m 108 7.3. T-TESTS FOR UNKNOWN σ and get Z=q p̂A − p̂B p̂(1−p̂) n + p̂(1−p̂) m = −0.966. Again, Z falls into the acceptance region and we do not have enough evidence to reject H0 . 7.3 T-tests for unknown σ In the previous section we used an estimator for the unknown true variance σ 2 . In the special case that our data X1 , . . . , Xn is normally distributed with mean µ and variance σ 2 and we estimate the unknown mean with n 1X X̄n = Xi n i=1 we can make use of a result from statistics that tells us the following: We know that E[X̄n ] = µ and thus X̄n − µ Z= p σ 2 /n must follow a standard normal distribution (we standardized it by subtracting the mean and dividing by the standard deviation!). However, since σ 2 is unknown we estimate it using n Sn2 = 1 X (Xi − X̄n ). n−1 i=1 It can be shown that X̄n − µ T =p Sn2 /n follows a Student’s t-distribution with n − 1 degrees of freedom. Thus, once µ is fixed (because of our hypothesis H0 ) and the data X1 , . . . , Xn is given we can compute T and check whether it falls into the acceptance or rejection region by looking at the values tα , −tα or tα/2 depending on the type of alternative. Note that tα is the value such that P (T > tα ) = α and it can be found in the table of the Student’s t-distribution (in the same way we find tα/2 ). Note that the degrees of freedom n − 1 is a parameter of the Student’s t-distribution since the distribution of T changes when the samples size n changes. 109 7.4. P-VALUE 7.4 p-Value Statistical Inference I 281 So far, we were testing hypotheses by (a) Small α Accept H means of acceptance and rejection regions where we need to know the sigAccept nificance level α in order to conduct H Zobs a test. However, there is no system✲ z 0 atic way of choosing α, the probability of making type I error. Of course, (b) Large α, when it seems too dangerous to reject same Zobs Reject H a true H0 , we choose a low significance Accept level. But how low? Should we choose H Zobs α = 0.01? Perhaps, 0.001? Or even ✲ z 0 0.0001? If we choose α small enough 9.10: (a) Under a low level of significance α, we accept the null hypothesis. (b) we can always make sure that H0 isFIGURE accepted illustrated Under a high level ofas significance, we reject it. on the right. 280 Probability and Statistics for Computer Scientists Also, if our observed test P-value statistic Zobs belongs to a reUsing a P-value approach, we try not to rely on the level of significance. In fact, let us try jection but it is ”too to test a hypothesis using all levelsregion of significance! Considering all levels of significance (between 0 and 1 because α is a probability of Type I Zobs close to call” as illustrated on error), we notice: Accept Case 1. If a level of significance is very low, we accept hypothesis (see Figure 9.10a). H0 the left, then how the donullwe report A low value of α = P result? { reject the null hypothesis when it is true } the makes it very unlikely to reject the hypothesis because it yields a very small rejection region. ❘ reject the The right-tail above the rejectionwe regionshould equals α. ✲ areaFormally, −zα/2 zα/2 0 Case 2. On the other extreme end, a high significance level α makes it likely to reject the null null hypothesis, but practihypothesis and corresponds to a large rejection region. A sufficiently large α will produce FIGURE 9.9: This test is “too close to call”: formally we reject the null although suchhypothesis a large rejection region that will cover our test statistic, forcing us to reject H (see cally, we realize that a slightly Figure 9.10b). the Z-statistic is almost at the boundary. there exists a boundary value between α-to-accept (case 1) and α-to-reject (case Conclusion: different significance level α could have expanded the acceptance region just 2). This number is a P-value (Figure 9.11). Example 9.36 (Unauthorized use of and a computer account, continued). A 99% enough to cover Z force us to accept H . Is there a statistical measure 0 obs confidence interval for the mean time between keystrokes is to quantify how far away we are from a ”too close to call”? [0.24; 0.34] The idea is to try to test a hypothesis using all levels of significance. Then (Example 9.19 on p. 260). Example 9.28 on p. 275 tests whether the mean time is 0.2 have two cases: seconds,we which would be consistent with the speed of the account owner. The interval does not contain 0.2. Therefore, at a 1% level of significance, we have significant evidence that 1) Very small values of α make it very unlikely♦ to reject the hypothesis the account was used by a different person. because they yield very small rejection regions. 2) High significance levels α will make it likely to reject H0 and corresponds to a large rejection region. will be forced to reject H0 . The P-value is the boundary value between 9.4.10 We P-value the accept case 1) and reject case 2). Thus, 0 0 α 0 0 α 0 How do we choose α? the p-value is the lowest significance level α that forces rejection of H0 and also the highest significance level α that forces acceptance of H0 . So far, we were testing hypotheses by means of acceptance and rejection regions. In the last section, we learned how to use confidence intervals for two-sided tests. Either way, we need to know the significance level α in order to conduct a test. Results of our test depend on it. How do we choose α, the probability of making type I sampling error, rejecting the true hypothesis? Of course, when it seems too dangerous to reject true H0 , we choose a low significance level. How low? Should we choose α = 0.01? Perhaps, 0.001? Or even 0.0001? Usually α ∈ [0.01, 0.1] (although there are exceptions). Then, a P-value 0.1 exceeds all natural levels, and the null hypothAlso, if greater our observedthan test statistic Z = Zobs belongs to a rejectionsignificance region but it is “too close to call” (see, for example, Figure 9.9), then how do we report the result? Formally, esis should be accepted. Conversely, if a P-value we should reject the null hypothesis, but practically, we realize that a slightly differentis less than 0.01, then it is significance level α could haveall expanded the acceptance region just enough to cover smaller than natural significance levels, andZobsthe null hypothesis should and force us to accept H0 . Only the important. P-valueForhappens fall Supposebe thatrejected. the result of our test is if crucially example, the to choice of a between 0.01 and 0.1, we businessreally strategy for the next ten years depends on the it. In this case,of cansignificance. we rely so heavily This is the ”too close to have to think about level on the choice of α? And if we rejected the true hypothesis just because we chose α = 0.05 instead of α = 0.04, then how do we explain to the chief executive officer that the situation was marginal? What is the statistical term for “too close to call”? 110 Important: Pr (observation | hypothesis) ≠ Pr (hypothesis | observation) The probability of observing a result given that some hypothesis is true is not equivalent to the probability that a hypothesis is true given that some result has been observed. Using the p-value as a “score” is committing an egregious logical error: 7.4. the transposed conditional fallacy. P-VALUE More likely observation Computing P-values Probability density Statistical Inference I 283 P-value Here is how a P-value can be computed from data. Very un-likely Very un-likely Let us look at Figure 9.10 again. Start from Figure 9.10a, gradually increase α, and keepobservations observations your eye at the vertical bar separating the acceptance and rejection region. It will move to the left until it hits the observed test statistic Zobs . At this point, our decision changes, Observed and we switch from case 1 (Figure 9.10a) to case 2 (Figure 9.10b). Increasing α further, we data point pass the Z-statistic and start accepting the null hypothesis. What happens at the border of α-to-accept and α-to-reject? Definition 9.9 says that this borderline α is the P-value, Set of possible results P = α. Also, at this border our observed Z-statistic coincides with the critical value zα , and thus, A p-value (shaded green area) isas thethe probability of an observed Figure 7.2: Interpretation of the p-value probability that we observe Zobs = zα , result assuming that the null hypothesis is true. (or more extreme) a test statistic T that is at least as extreme as Tobs given that H0 is true. P = α = P {Z ≥ zα } = P {Z ≥ Zobs } . In this formula, Z is any Standard Normal random variable, and Zobs is our observed test statistic, which is a concrete number, computed from data. First, we compute Zobs , then call”. A good decision is to collect more data until a more use Table A4 to calculate P {Z ≥ Zobs } = 1 − Φ(Zobs ). can be obtained. We compute bytwo-sided fixingalternatives Zobs =arezcomputed selecting p = α and similarly, P-values for the left-tail and forpthe as given in Table 9.3. one-sided right-tail alternative we have definitive answer α such that for a This table applies to all the Z-tests in this chapter. It can be directly extended to the case of unknown standard deviations and T-tests (Table 9.4). p = α = P (Z ≥ zα ) = P (Z ≥ Zobs ) = 1 − Φ(Zobs ) Understanding P-values where Z is standard normally distributed and Zobs is the test statistic. The the two-sided case Looking atcomputation Tables 9.3 and 9.4, P-valuefor is the of observing a test ofwepsee is that similar theprobability one-sided left-tail and statistic at least as extreme as Zobs or tobs . Being “extreme” is determined by the alternaas well as for T-tests. tive. For a right-tail alternative, large numbers are extreme; for a left-tail alternative, small Hypothesis H0 θ = θ0 that Alternative HA P-value Computation right-tail θ > θ0 P {Z ≥ Zobs } 1 − Φ(Zobs ) left-tail θ < θ0 P {Z ≤ Zobs } Φ(Zobs ) two-sided θ ̸= θ0 P {|Z| ≥ |Zobs |} 2(1 − Φ(|Zobs |)) We summarize the computation of the p-value for Z-test on the left where we distinguish the three different cases for the alternative hypothesis HA . From the definition of the p-value, it is also clear TABLE 9.3: P-values for Z-tests. p = P (observing test statistic T that is at least as extreme as Tobs | H0 ). In Figure 7.2 we illustrate this interpretation (the green shaded area is the p-value here and Tobs is the observed data point). Thus, it is wrong to say that the p-value tells us something about the probability that H0 is true (given the observation)! A high p-value tells us that the observed or even more extreme values of Zobs is not so unlikely (given H0 ), and therefore, we 111 7.5. CHI-SQUARE TESTS see no contradiction with H0 and do not reject it. Conversely, a low p-value signals that such an extreme test statistic is unlikely if H0 is true. Since we really observed it, our data are not consistent with H0 and we reject it. Example 63: Two-Sample Z-Test of Proportions (revisited) In Example 62 we computed a test statistic of Zobs = −0.945 for a twosided test which compares the quality of parts produced by two different manufactures. We compute p as p = P (|Z| ≥ | − 0.945|) = 2(1 − Φ(0.945)) = 0.3472. This p-value is quite high and indicates that the null hypothesis should not be rejected. If H0 is true then the chance of observing a value for Z that is as extreme or more extreme than Zobs is 34%. This is no contradiction with the assumption that H0 is true. 7.5 Chi-square tests Chi-square tests are another popular class of statistical tests that can be used to support statements about estimated variances, about estimated counts, about a family of distributions and about independence. 7.5.1 Estimated variance Recall that an unbiased estimator for the population variance σ 2 is given by the sample variance n 1 X 2 S = (Xi − X̄)2 . n−1 i=1 We would like to know the distribution of S 2 to perform test for the hypothesis H0 : σ 2 = σ02 . Since the summands (Xi − X̄)2 are not independent we cannot directly use the central limit theorem. In fact, for moderate or small n the distribution is not Normal at all (for large n a generalization of the central limit theorem gives an approximate normal distribution). If the Xi are independent and normally distributed observations with VAR(Xi ) = σ 2 then the distribution of 2 n (n − 1)S 2 X Xi − X̄ = σ2 σ i=1 is Chi-square with n − 1 degrees of freedom. The Chi-square distribution is a special case of the Gamma distribution and its density is given by 1 f (x) = xν/2−1 e−x/2 , x > 0, Γ(ν/2)2ν/2 112 7.5. CHI-SQUARE TESTS and f (x) = 0 whenever x ≤ 0. Here ν > 0 is a parameter that equals the degrees of freedom. Note that the mean is equal to ν and the variance is 2ν. Thus, for a level-α test with H0 : σ 2 = σ02 we compute χ2obs = (n − 1)S 2 σ02 and compare the value χ2obs with the critical values of the Chi-square distribution. More concretely, • for a right-tail alternative HA : σ 2 > σ02 , we reject H0 if χ2obs ≥ χ2α , • for a left-tail alternative HA : σ 2 < σ02 , we reject H0 if χ2obs ≤ χ21−α , • for the two-sided alternative HA : σ 2 6= σ02 , we reject H0 if either χ2obs ≥ χ2α/2 or χ2obs ≤ χ21−α/2 where χ2α is such that α = P (χ2 > χ2α ). We summarize the chi-square test for the population variance below and omit an example here since this test Statistical Inference I 291 works exactly as the Z and T tests. Null Hypothesis Alternative Hypothesis Test statistic σ 2 > σ02 σ 2 = σ02 (n − 1)s2 σ02 σ 2 < σ02 Rejection region χ2obs > χ2α χ2obs < χ2α χ2obs ≥ χ2α/2 or χ2obs ≤ χ21−α/2 σ 2 ̸= σ02 P-value ! " P χ2 ≥ χ2obs " ! P χ2 ≤ χ2obs # ! " 2 min P χ2 ≥ χ2obs , "$ ! P χ2 ≤ χ2obs TABLE 9.5: χ2 -tests for the population variance 7.5.2 Observed Counts this case, & % α $ be used The chi-square test can #also to compare observed and expected (9.27) P =2 = 2P χ2 ≤ χ2obs = 2F (χ2obs ). 2 counts via the chi-square statistic So, the P-value is either given by (9.26) or by (9.27), depending on which one of them is 2 N smaller and which boundary hits χ X obs first. We can write this2in one equation as (Obs(k) − Exp(k)) 2 χ = &( % & ' % & % Exp(k) = 2 min F (χ2obs ), 1 − F (χ2obs ) , P = 2 min P χ2 ≥ χ2obs , P χ2 ≤ χ2obs (7.1) k=1 where F is the cdf of χ2 distribution with ν = n − 1 degrees of freedom. where N is the number of categories or groups of data defined depending on Testing procedures that we have just derived are summarized 9.5. The same tests our testing problem. Obs(k) is the observed numberinofTable samples that belong can also be used for the standard deviation because testing σ 2 = σ02 is equivalent to testing toσ category k, and Exp(k) = E[Obs(k) | H0 is true] is the expected number = σ0 . of samples that fall into category k if the null hypothesis H0 is true. This 2 indicate isExample always a9.41. one-sided, test, low values of χconstructed Refer to right-tail Example 9.40 on p.because 288. The only 90% confidence interval there contains the suggested value of σ = 2.2. Then, by duality between confidence intervals that the observed counts are close to what we expect them to be under the and tests, there should be no evidence against this value of 2σ. Measure the amount of null hypotheses. On the contrary, large values of χ occur if the observed evidence against it by computing the suitable P-value. counts are far from the expected counts, which indicates inconsistency of the data and H0 . Therefore, the test can be seen as a “frequency comparison” Solution. The null hypothesis H0 : σ = 2.2 is tested against HA : σ 2 ̸= 2.2. This is a twosided test because we only need to know whether the standard deviation equals σ0 = 2.2 or 113 not. Compute the test statistic from the data in Example 9.40, χ2obs = (5)(6.232) (n − 1)s2 = = 6.438. σ02 2.22 Using Table A6 with ν = n − 1 = 5 degrees of freedom, we see that χ20.80 < χ2obs < χ20.20 . Therefore, & % & % P χ2 ≥ χ2obs > 0.2 and P χ2 ≤ χ2obs > 0.2, 7.5. CHI-SQUARE TESTS (compare the data histogram with a fitted density or mass function). A level α rejection region for this chi-square test is [χ2α , +∞) and the p-value is computed as p = P (χ2 ≥ χ2obs ). It can be shown that for large n Obs(k) − Exp(k) p Exp(k) follow an standard normal distribution and thus the null distribution of χ2 converges to the Chi-square distribution with (N − 1) degrees of freedom, as the sample sizePincreases to infinity. We subtract one degree from N because we know that k Obs(k) = n and once any N − 1 are known, the remaining one is uniquely determined. As a rule of thumb, we require an expected count of at least 5 in each category, i.e. Exp(k) ≥ 5 for all k and at least 3 bins (N ≥ 3). Moreover, it is often beneficial if Exp(1) ≈ . . . ≈ Exp(N ). After defining the bins and computing χ2 we can use the Chi-square distribution to construct rejection regions and compute p-values to decide whether H0 can be rejected or not. Example 64: Testing a distribution We want to test whether the observed data belong to a particular distribution. For example, we may want to test whether a sample comes from the Normal distribution, whether interarrival times are Exponential and counts are Poisson, whether a random number generator returns high quality Standard Uniform values, or whether a die is unbiased. In general for n samples X1 , . . . , Xn with distribution F we test H0 : F = F0 vs HA : F 6= F0 , where F0 is some given distribution. To conduct the test, we take all possible values of X under F0 (the support of F0 ) and split them into N bins B1 , . . . , BN such that the expected number of samples, that fall into a bin, is at least 5. It could be that the left-most interval is B1 = (−∞, a1 ] or the right-most interval is [aN −1 , ∞) or both. The observed count for the k-th bin Bk is Obs(k) = |{i = 1, . . . , n : Xi ∈ Bk }|, where for a set A we use |A| to denote the number of elements of A. Hence N P Obs(k) = n. If H0 is true and all Xi have the distribution F0 , then k=1 114 7.5. CHI-SQUARE TESTS Obs(k), the number of ”successes” in n trials, has Binomial distribution with parameters n and pk = F0 (Bk ) = P (Xi ∈ Bk | H0 ). Then, the corresponding expected count is the expected value of this Binomial distribution, Exp(k) = npk . We compute χ2 as defined in Eq. 7.1 and conduct the test. Let us now become more concrete: Suppose that after losing a large amount of money, an unlucky gambler questions whether the game was fair and the die was really unbiased. The last 90 tosses of this die gave the following results Score Frequency 1 20 2 15 3 12 4 17 5 9 6 17 We test H0 : F = F0 vs HA : F 6= F0 , where F is the distribution of the number of dots on the die, and F0 is such that P (X = i) = 1/6 for i ∈ {1, 2, . . . , 6}. We choose Bi = {i} and then the observed counts are given by the table above and the expected counts are n/6 = 90/6 = 15 for all bins (all more than 5). We compute 2 P (Obs(k)−Exp(k)) χ2 = N k=1 Exp(k) = (20−15)2 15 + (15−15)2 15 + (12−15)2 15 + (17−15)2 15 + (9−15)2 15 + (17−15)2 15 = 5.2 and find (from the table of the chi-square distribution for N − 1 = 5 degrees of freedom) that the p-value p = P (χ2 ≥ 5.2) ∈ [0.2, 0.8]. Hence there is no significant evidence to reject H0 , and therefore, no evidence that the die was biased. Assume that we have the same situation as in the previous example, i.e., we suppose that X1 , ..., Xn have distribution F0 but we also used X1 , ..., Xn to fit the parameters of F0 (e.g. we computed the MLE θ̂). Assume F0 has m parameters (m ≥ 1). One can show that in this case, if H0 is true, then χ2 converges to a chisquare distribution with k − m − 1 degrees of freedom. Thus, we have to subtract the number of estimated parameters from the degrees of freedom and defined the rejection region accordingly. This type of test is called goodness of fit test. In general, it is a major drawback of the test that there is no clear prescription for the interval selection and one can come to different conclusions depending on how the intervals are chosen. However, it can be applied to 115 7.5. CHI-SQUARE TESTS any hypothesized distribution and is often used to test whether artificially generated pseudo-random numbers are U (0, 1)-distributed and independent. 7.5.3 Testing independence In many practical applications we would like to test if there is a significant association between two features. This helps to understand the cause-andeffect relationships. For example, is it true that smoking causes lung cancer? Do the data confirm that drinking and driving increases the chance of a traffic accident? We can use a chi-square test to verify H0 : Factors A and B are independent vs HA : Factors A and B are dependent. In general, we assume that there are two sets of categories A1 , . . . , Ak and B1 , . . . , Bm where Ai ∩ Aj = ∅, Bi ∩ Bj = ∅ for any i 6= j and independence is understood as for random variables, i.e., factors A and B are independent if any randomly selected x belongs to categories Ai and Bj independently of each other. Therefore, we are testing H0 : P (x ∈ Ai ∩ Bj ) = P (x ∈ Ai )P (x ∈ Bj ) for all i, j vs HA : P (x ∈ Ai ∩ Bj ) 6= P (x ∈ Ai )P (x ∈ Bj ) for some i, j. Example 65: Smokers and lung cancer We have k = m = 2 such as A1 = set of smokers, A2 = set of nonsmokers, B1 = set of persons with lung cancer, B2 = set of persons without lung cancer. We next assume the following observed counts: B1 B2 A1 n11 n12 A2 n21 n22 ... ... ... nk1 nk2 Ak column total n·1 n·2 . . . Bm row total . . . n1m n1· . . . n2m n2· ... ... ... . . . nkm nk· . . . n·m n·· = n P P where ni· = i nij and n·j = j nij are the row and column totals. This table defines the observed counts Obs(i, j) = nij . We then estimate the probabilities P (x ∈ Ai ) and P (x ∈ Bj ) ∀i, j 116 7.5. CHI-SQUARE TESTS as n·j ni· Pb(x ∈ Ai ) = and Pb(x ∈ Bj ) = , ∀i, j. n n If H0 is true then P (x ∈ Ai ∩ Bj ) = P (x ∈ Ai )P (x ∈ Bj ) and thus Exp(i, j) = n · P (x ∈ Ai ∩ Bj ). Based on the information that we have, we can only estimate d j) = n ni· n·j = ni· n·j . Exp(i, n n n Next, we compute the test statistic χ2obs = k X m X 2 d j) Obs(i, j) − Exp(i, Statistical Inference II i=1 j=1 313 d j) Exp(i, You can always check that all the row totals, the column totals, and the whole table total are the same for the observed and the expected counts (so, if there is a mistake, catch it and consider its corresponding chi square distribution. It can be shown that here). 2 2 the degrees of(160 freedom (k−−280) 1)(m − 1)−due linearly dependent (240 (140 180)2to some (460 − 420) − 120)2 are χ2obs = + + + = 31.75. constraints in the120 sum of differences. 280 180 420 From Table A6 with (2 − 1)(2 − 1) = 1 d.f., we find that the P-value P < 0.001. We have a significant evidence that an email having an attachment is somehow related to being spam. Example 66: Internet Shopping ♦ Therefore, this piece of information can be used in anti-spam filters. A web designer suspects that the chance for an internet shopper to make aExample purchase through her shopping web site on varies depending of the week. 10.5 (Internet different days on of the the day week). A web designer suspects the chance for an internet shopper to make a purchase her web To test this that claim, she collects data during one week, whenthrough the web site site varies depending recorded 3758 hits.on the day of the week. To test this claim, she collects data during one week, when the web site recorded 3758 hits. Observed No purchase Single purchase Multiple purchases Total Mon Tue 399 261 119 72 39 50 557 383 Wed Thu Fri Sat Sun 284 263 393 531 502 97 51 143 145 150 20 15 41 97 86 401 329 577 773 738 Total 2633 777 348 3758 Testing independence (i.e., probability of making a purchase or multiple purchases is the same on any day of the week), we compute the estimated expected counts, or multiple Testing independence (i.e., probability of making a purchase ) purchases is the on(nany of the iweek), we compute the estimated i· )(n·jday !same for = 1, . . . , 7, j = 1, 2, 3. Exp(i, j) = n expected counts, Expected d j) No purchase Exp(i, Single purchase Multiple purchases We get Total Mon Tue Wed Thu Fri Sat Sun 51.58 557 35.47 383 37.13 401 30.47 329 53.43 577 71.58 773 68.34 738 ni· n·j 230.51 404.27 541.59 517.07 ni· n·j 390.26 = n 268.34 = 280.96 for i = 1, . . . , 7, j = 1, 2, 3. n n 115.16n 79.19 82.91 68.02 119.30 159.82 152.59 Total 2633 777 348 3758 Then, the test statistic is (399 − 390.26)2 117 (86 − 68.34)2 + ...+ = 60.79, 390.26 68.34 and it has (7 − 1)(3 − 1) = 12 degrees of freedom. From Table A6, we find that the P-value is P < 0.001, so indeed, there is significant evidence that the probability of making a single purchase or multiple purchases varies during the week. ♦ χ2obs = Matlab Demo. The χ2 test for independence takes only a few lines of code in MATLAB. For example, here is the solution to Example 10.4. site varies depending on the day of the week. To test this claim, she collects data during one week, when the web site recorded 3758 hits. Observed No purchase Single purchase Multiple purchases Total Mon Tue 399 261 119 72 39 50 557 383 Wed Thu Fri Sat Sun 284 263 393 531 502 97 51 143 145 150 20 15 41 97 86 401 329 577 773 738 Total 2633 777 348 3758 Testing independence (i.e., probability of making a purchase or multiple purchases is the 7.6.sameMULTIPLE TESTING on any day of the week), we compute the estimated expected counts, ! j) = (ni· )(n·j ) Exp(i, n Expected No purchase Single purchase Multiple purchases Total for i = 1, . . . , 7, j = 1, 2, 3. Mon Tue Wed Thu Fri Sat Sun 390.26 268.34 280.96 230.51 404.27 541.59 517.07 115.16 79.19 82.91 68.02 119.30 159.82 152.59 51.58 35.47 37.13 30.47 53.43 71.58 68.34 557 383 401 329 577 773 738 Total 2633 777 348 3758 Then, the test statistic is and the test statistic is (399 − 390.26)2 2 (86 − 68.34)2 = 60.79, 390.26 2 68.34 2 (399 − 390.26) (86 − 68.34) 2 and it has χ (7 − 1)(3 − 1) = 12 degrees of freedom. find60.79. that the P-value + . . . +From Table A6, we = obs = 390.26 is P < 0.001, so indeed, there is significant evidence that68.34 the probability of making a single purchase or multiple purchases varies during the week. ♦ χobs = + ...+ We have (7 − 1)(3 − 1) = 12 degrees of freedom and get a very low p-value <Matlab 0.001 Demo. which The means that we should reject H , i.e. there is significant χ2 test for independence takes only a0 few lines of code in MATLAB. evidence thathere theis the probability making a single purchase or multiple purFor example, solution toof Example 10.4. chases varies the week. X = [160 240;during 140 460]; % Matrix of observed counts Row = sum(X’)’; Col = sum(X); Tot = sum(Row); % Row and column totals k = length(Col); m = length(Row); % Dimensions of the table In the of small sample sizes and very unequally distributed e = case zeros(size(X)); % Expected countsdata (among for i=1:k; for j=1:m; e(i,j) = Row(i)*Col(j)/Tot; end; end; the cells), Fisher’s exact test can be used instead to test for independence. chisq = (X-e).^ 2./e; % Chi-square terms chistat = sum(sum(chisq)); % Chi-square statistic Pvalue = 1-chi2cdf(chistat,(k-1)*(m-1)) % P-value 7.6 Multiple Testing Up to this point we have only discussed the testing procedure applicable for a single hypothesis. However, we often wish to test many associated hypotheses at the same time. This leads to the so-called multiple testing problem. When testing many hypotheses simultaneously, careful attention must be paid to what may be concluded from the p-values of the individual tests. As an example, suppose that a salesman claims that some substance is beneficial, and to demonstrate this, he tests 100 different null hypotheses, one for each of 100 different illnesses, each null hypothesis stating that the substance is not useful in helping to cure one of the illnesses involved and each alternative hypothesis stating that the substance is useful. Suppose that the appropriate statistical tests lead to 100 p-values, one for each disease. If a Type I error of 1% had been used for each individual test, the probability that at least one of the 100 null hypotheses will be rejected given that the substance is of no value for any of the illnesses (all null hypotheses true) can be computed from the Binomial distribution with success probability p = 0.01 and N = 100. If k is the number of successes, then P (k ≥ 1) = 1 − P (k = 0) = 1 − (1 − p)N ≈ 0.634. Thus, even if the substance is of no value there is a very high chance that some null hypotheses get rejected. This might lead the salesman to claim that the substance is useful for those illnesses where, by chance, the null hypothesis was rejected. He may provide only the data relating to those illnesses, which taken out of context may appear convincing. This indicates 118 7.6. MULTIPLE TESTING the potential error in drawing conclusions from individual p-values obtained from many associated tests carried out in parallel (in fact this happened in a number of medical research project, for instance, related to the association between breast cancer and polymorphisms in certain genes). One possible approach to this problem is to use a so-called ”experimentwise,” or ”family-wise,” p-value, defined as follows. The probability of rejecting at least one null hypothesis when all null hypotheses are true is known as the family-wise error rate (FWER), i.e., FWER = P (#rejected null hypotheses that are true ≥ 1). The aim is to control the FWER at some value acceptable value α. To do this it is necessary to use a Type I error for each individual test that is smaller than α. 7.6.1 Bonferroni correction One of the simplest ways of assuring an FWER of α when N tests are performed, is to use a Type I error of α/N for each individual test. Then, the ”success probability” for the binomial distribution is α/N instead of α and we get, if all null hypotheses are true and the tests are independent, FWER = P (V ≥ 1) = 1 − P (V = 0) = 1 − (1 − α/N )N . where V is the number of rejected null hypotheses that are true. In general, if N0 ≤ N is the total number of true null hypotheses, we consider N0 Bernoulli random variables X1 , . . . , XN0 . Then P P N0 k=1 Xk P 0 ≥ 1 = P (X1 = 1 ∨ . . . ∨ XN0 = 1) ≤ N k=1 P (Xk = 1) PN PN ≤ k=1 α/N = α. k=1 P (Xk = 1) = This correction applies whether the tests are independent or not, but it is quite conservative and has the disadvantage that only a very small p-value leads to the rejection of H0 . The power of the test (reject H0 if HA is true) gets very small. Example 67: Five Tests Let us assume we have 5 tests with null hypotheses H0(1) , H0(2) , . . . , H0(5) and we computed the following p-values: H0(1) : 0.005, H0(2) : 0.011, H0(3) : 0.02, H0(4) : 0.04, H0(5) : 0.13 119 7.6. MULTIPLE TESTING For a significance level α = 0.05 we have a local significance of α/N = 0.01 and can only reject H0(1) . 7.6.2 Holm-Bonferroni method The Holm-Bonferroni method adjusts the rejection criterion of each individual hypothesis as follows: First, we sort the hypotheses according to their p-value, i.e. pi is the p-value of H0(i) and p1 ≤ . . . ≤ pN . Then we fix the ”global” significance α and compute the minimal index k such that pk > α . N +1−k This can be done by defining α1 = α/N , α2 = α/(N − 1), . . . αN = α/1 and comparing p1 with α1 , p2 with α2 , etc until pk > αk . If k > 1 we reject H0(1) , H0(2) , . . . , H0(k−1) and do not reject H0(k) , H0(k+1) , . . . , H0(N ) . If k = 1 then no null hypothesis is rejected and if no such k exists all null hypotheses are rejected. It can be shown (exercise!) that the FWER of this method is at most α. Example 68: Five Tests The p-values of the 5 tests in the previous example where already sorted and we compare H0(1) : 0.005 < α/N = 0.01, H0(2) : 0.011 < α/(N − 1) = 0.0125, H0(3) : 0.02 > α/(N − 2) = 0.0167, Hence we reject H0(1) and H0(2) and do not reject H0(3) , H0(4) and H0(5) . Let us finally consider the following table of counts of null hypotheses. H0 accepted H0 rejected total true H0 U V N0 false H0 T S total N −R R N − N0 N The second column gives the number of hypotheses where the p-value was small enough to reject and the first column gives the number of hypotheses that were not rejected. We have that FWER = P (V ≥ 1) ≈ V /N0 were V is the number of incorrectly significant tests. Since controlling the FWER leads to extremely conservative methods (in particular if N is very large), 120 7.6. MULTIPLE TESTING other methods accept a few false positives. Let the false discovery proportion be defined as ( V /R if R ≥ 1 Fdp = 0 otherwise. Then it is often useful to control the false discovery rate FDR = E[Fdp], i.e. the average proportion of false positives among the set of rejected hypotheses. For instance, medical screening guidelines define FDR = 0.1 acceptable for certain types of screenings. Note that controlling FDR does not lead to FWER control in general. Moreover, we can only control its expectation but the actual proportion of false positives may be much higher. In the exercises we will show that • if all null hypotheses are true (N = N0 ) then the FDR is equivalent to the FWER and • we always have that FWER ≥ FDR. Thus, this procedure is less stringent and thus more powerful in the sense that it allows to reject more hypotheses. 7.6.3 Benjamini-Hochberg procedure A simple method that controls the FDR is the Benjamini-Hochberg procedure. It controls the FDR such that its expectation is at most α. For instance, if one uses microarray experiments to compare the expression levels of 20,000 genes in liver tumors and in normal liver cells, an FDR of 5% would give up to 5% of the genes with significant results being false positives. In follow-up experiments, one can then check whether the selected genes really show a significant difference between the normal and tumor cells. This ensures that with low probability we miss a potentially important discovery which is reasonable if the costs for additional tests on the selected genes are low. First, we again sort the hypotheses according to their p-value, i.e. pi is the p-value of H0(i) and p1 ≤ . . . ≤ pN . Then we fix an FDR of α and compute the maximal index k such that pk ≤ k α. N We reject H0(1) , H0(2) , . . . , H0(k) and do not reject H0(k+1) , H0(k+2) , . . . , H0(N ) . If no such k exists none of the null hypotheses is rejected. It can be shown that, if the tests are independent, then the average FDR of this method is N0 N α, which is at most α. Example 69: Five Tests (revisited) 121 7.7. EXERCISES Recall that the p-values of the 5 tests in the previous example where H0(1) : 0.005, H0(2) : 0.011, H0(3) : 0.02, H0(4) : 0.04, H0(5) : 0.13. In the Benjamini-Hochberg procedure we compare 1 N α = 0.01, 0.011 ≤ N2 α = 0.02, 0.02 ≤ N3 α = 0.03, 0.04 ≤ N4 α = 0.04, 0.13 > N5 α = 0.05, H0(1) : 0.005 ≤ H0(2) : H0(3) : H0(4) : H0(5) : Hence we reject H0(1) , . . . , H0(4) and do not reject H0(5) . Note that it is possible that for some of the i < k we have a p-value higher than Ni α. 7.7 Exercises Easy 122

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Probabilistic Models and Data Analysis Lecture Notes