Download ST2351 Probability and Theoretical Statistics Course Notes for

ST2351 Probability and Theoretical Statistics Course Notes for Michaelmas Term 2012 Simon Wilson, September 2012 1 c Simon Wilson, 2012 CONTENTS Contents 1 Events and Probabilities 1.1 Probability Spaces . . . . . . . . . . . . . 1.2 A Few Simple Theorems . . . . . . . . . . 1.3 Discrete Probability Spaces . . . . . . . . 1.4 Interpretation of Probability . . . . . . . . 1.5 Conditional Probability and Independence 1.6 Two Important Theorems . . . . . . . . . 1.7 The 3-door (or Monty Hall) problem . . . 1.8 The Birthday Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 4 4 4 5 5 6 2 Discrete Random Variables 2.1 Some Common Discrete Random Variables 2.2 Expectation . . . . . . . . . . . . . . . . . . 2.3 Expectation of Functions of X . . . . . . . 2.4 Conditional Expectation . . . . . . . . . . . 2.5 Properties of Expectation . . . . . . . . . . 2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 10 10 10 11 11 3 Continuous Random Variables 3.1 Properties of the pdf . . . . . . . 3.2 Examples of Continuous Random 3.3 Expectation and Variance . . . . 3.4 Properties of Expectations . . . . 3.5 Functions of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 14 14 15 15 4 Multivariate Random Variables 4.1 Multivariate Discrete Distributions . . . . . . . 4.2 Expectation . . . . . . . . . . . . . . . . . . . . 4.3 Independence . . . . . . . . . . . . . . . . . . . 4.4 Generalising to n Random Variables . . . . . . 4.5 Multivariate Distribution Functions . . . . . . . 4.6 Continuous Random Variables . . . . . . . . . . 4.7 Marginal Density Functions and Independence 4.8 Conditional Density Functions . . . . . . . . . 4.9 Expectations of Continuous Random Variables 4.10 Variance and Covariance . . . . . . . . . . . . . 4.11 Sums of Random Variables . . . . . . . . . . . 4.12 The Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 17 17 18 18 19 19 20 20 21 22 23 5 Moment and Characteristic Functions 5.1 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 27 6 Two Probability Theorems 6.1 The Law of Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 28 28 . . . . . . Variables . . . . . . . . . . . . . . . . . . 2 c Simon Wilson, 2012 1 1 EVENTS AND PROBABILITIES Events and Probabilities HANDOUT: outline Probability was developed by gamblers in the 17th century. However, its rigorous development had to wait for the start of the 20th century. Probability starts with the idea of an experiment or trial: Definition: An experiment is any course of action whose consequence is not predetermined. When we have an experiment, there is more than one possible outcome, and we cannot determine which will happen. Definition: The set of all possible outcomes of an experiment is called the sample space. It is usually denoted Ω. Example: experiment=“throw a die”. Then Ω = {1, 2, 3, 4, 5, 6}. Definition: An event is just a subset of the sample space that interests us. The members of Ω are also events, called elementary events. The set of events is the event space and is denoted F. i.e. “we throw an odd number” or “we don’t throw a 6” are events when we throw a die. For simple things such as a die, F is usually the power set. However, in more complicated situations, it may be considerably less than the set of all subsets (this is to do with σ− algebra, which we do not go into here). However, again for reasons that we do not go into, F must satisfy: • F is non-empty; • A ∈ F ⇒ Ω − A ∈ F; S • A1 , A2 , . . . ∈ F ⇒ i Ai ∈ F Usually, we think that some outcomes of an experiment are more likely to occur than others. Probability is a way to quantify the likelihood of each event in F occurring. It is a number between 0 and 1, and so is a mapping from F to [0, 1]. We give it certain properties: Definition: A probability measure on (Ω, F) is a mapping P : F → [0, 1] that satisfies: • P (Ω) = 1 and P (∅) = 0; • If A1 , A2 , . . . are disjoint events then: P ∞ [ ! Ai i=1 = ∞ X P (Ai ) i=1 Example: throwing a die: A1 = get a 1 or 2, A2 = get a 5; P (A1 P (1) + P (2) + P (5) = 3/6 = 0.5. 1.1 S A2 ) = P (A1 ) + P (A2 ) = Probability Spaces For any experiment, there is a mathematical object called a probability space that describes it. Definition: A probability space is a triple (Ω, F, P ) where • Ω is a set; • F is an event space of subsets of Ω; • P is a probability measure on (Ω, F) 3 c Simon Wilson, 2012 1 1.2 EVENTS AND PROBABILITIES A Few Simple Theorems Theorem: A, B ∈ F ⇒ A − BS∈ F. Proof: Ω − (A − B) = (Ω − A) B, which is an event by properties of F. T∞ Theorem: A1 , A2 , . . . , ∈ F ⇒ i=1 Ai ∈ F T S∞ ∞ Proof: T∞Morgan’s law. Ω − i=1 Ai = T∞i=1 (Ω − Ai ). Ai ∈ F ⇒ Ω − Ai ∈ F ⇒ T∞This is just de Ω − i=1 Ai ∈ F ⇒ i=1 Ai ∈ F (complement of Ω − i=1 Ai . Theorem: For A ∈ F, P (Ω − A) = 1 − P (A) (i.e. P (not A) = 1 − P (A)). Proof: Ω − A ∈ F, and disjoint with A, thus P (A) + P (Ω − A) = P (Ω) = 1. S T Theorem: For A, B ∈ F, P (A B) = P (A) +T P (B) − P (A B). S T Proof: DRAW A VENN DIAGRAM. A−B and A B areSdisjoint T T thus P (A) = P ((A−B)T (A B)) = P (A − B) + P (A B). Similarly P (B) = T P ((B − A) (B A)) = P (B − S A) +T P (B S A); adding, we get P (A) + P (B) = P (A − B) + 2P (A B) + P (B − A) = [P ((A − B) (A B) (B − A))] + T S T P (A B) = P (A B) + P (A B). Useful hint: it is usually best when trying to prove things like this to draw a Venn diagram and put things in terms of disjoint sets, then use the rule about unions of disjoint sets. 1.3 Discrete Probability Spaces When Ω is countable, we say that the probability space is discrete. In this case, F is almost always the set of all subsets of Ω. We always assume that is the case. If that is so, then the elementary S events ω ∈ Ω P are in F so we can define P ({ω}) = P (ω). Then, for any event A, we have P (A) = P ( ω∈A {ω}) = ω∈A P (ω). So, for discrete spaces, P (A) is always the sum of probabilities of the outcomes in A. Example: Let Ω = {1, . . . , N } and suppose P (i) = 1/N i.e. all outcomes are equally likely. Then P (A) = |A|/N . This is very common: (lottery, picking a card from a pack, etc.) We look at an example of this in the tutorials. 1.4 Interpretation of Probability 1.5 Conditional Probability and Independence I take a pack of cards and you pick one: P (3♥) = 1/52. Suppose I pick a card and tell you “it is a heart”. Now what is P (3♥)? The extra information “it is a heart” has affected the prob. In fact I have told you that an event “it is a heart” has occurred, and it has affected the probability of the event “it is 3♥”. In general,Tif A, B ∈ F, and we want P (A) and are given that B has occurred, then A will only occur if A B occurs and P (A) changes to something we call the probability of A given B or the conditional probability. Definition: If A, B ∈ F and P (B) > 0 thenTthe conditional probability of A given B is denoted P (A | B) and is defined to be P (A | B) = P (A B)/P (B). Example: Consider two urns. Urn 1 has 3 white and 2 black balls, urn 2 has 1 white and 6 black balls. An experiment consists of tossing a fair coin. If it lands H then a ball is picked from urn 1, else a ball is picked from urn 2: 1. What is Ω? 2. What is P (W ball | H thrown)? 3. What is P (W ball | T thrown)? 4 c Simon Wilson, 2012 1 EVENTS AND PROBABILITIES Definition: Two events A, B are independent if the occurrence of one does not affect the prob. of the other; or if P (A | B) = P (A) and P (B | A) = P (B). T T From the definition of P (A B), we see that A and B are independent if P (A B) = P (A)P (B). More generally, a family of events {Ai : i ∈ I} are independent if for all finite subsets J ⊆ I: ! \ Y P Ai = P (Ai ). i∈J i∈J The family is pairwise independent if the above holds for |J| = 2. Example: Take 3 events A, B and C. 1. What must hold for these 3 events to be independent? 2. Consider throwing a 4 sided die, where each outcome is equally likely. Let A = {1, 2}, B = {1, 3} and C = {1, 4}. ShowTthat these are pairwise independent but not independent. Calculate P (A), P (B), P (C), P (A B), etc. and compare with definition. 1.6 Two Important Theorems Two very useful laws of probability are the partition law and Bayes law: Theorem: (Partition law) Let B1 , B2 , . . . , ∈ F be a partition of Ω. Then for any A ∈ F: X P (A) = P (A | Bi ) P (Bi ). i TS S T P T P Proof: P (A) = P (A ( i Bi )) = P ( i (A Bi )) = i P (A Bi ) = i P (A | Bi ) P (Bi ). P Note: a special case is if the partition sets are just each outcome, so you have P (A) = ∀x∈Ω P (A | x)P (x). Example: Tomorrow’s weather is either rain or snow but not both, with P (rain)=0.4 and P (snow)=0.6. If it rains, then I am late for the class with prob. 0.2; if it snows with prob. 0.6. What is P (late)? Theorem: (Bayes’ law). Let B1 , B2 , . . . , ∈ F be a partition of Ω. Then for any A ∈ F: P (A | Bi ) P (Bi ) . P (Bi | A) = P j P (A | Bj ) P (Bj ) Proof: P (Bi | A) = P (Bi T A)/P (A) = P (A | Bi )P (Bi )/P (A) =result (by partition law). P Again, a special case is for any x ∈ Ω, P (x | A) = P (A | x) P (x)/ ∀y∈Ω P (A | y) P (y) 1.7 The 3-door (or Monty Hall) problem There are 3 identical doors. Behind 1 door is a nice prize (say a car). Behind the other two is something not very desirable (nothing, or a donkey). You play the following game: • You will select a door and of course want to win the nice prize. Your host is Monty Hall who knows what lies behind the doors but will not tell you! • Once you have picked a door then Monty will open one of the other doors that does not have the prize behind it. Note that there will always be at least one such door; • You now have a choice. You can: 1. Stay with your original choice of door; 5 c Simon Wilson, 2012 1 EVENTS AND PROBABILITIES 2. Switch to the other unopened door. The question is: does it make any difference to your chances of winning the nice prize by switching or not? Let the doors be labelled W, L1 , L2 . The probabilities the you pick each is 1/3. If you initially pick W then Monty has a choice of opening L1 or L2 . Assume wlog that he opens L1 in this case. However if you pick L1 /L2 then Monty can only open L2 /L1 (no choice). If you decide to switch then there’s only one remaining door to switch to. The outcomes of this experiment are a pair (x, y) where x is your initial choice and y is the door that you finally choose after switching or not. Consider the two options: Switch : Ωswitch = {(W, L2 )(L1 , W ), (L2 , W )}. These occur with probabilities 1/3 each. The event “win”={(L1 , W ), (L2 , W )} and P (win) = 1/3 + 1/3 = 2/3; Stay : Ωstay = {(W, W ), (L1 , L1 ), (L2 , L2 )}. These occur with probabilities 1/3 each. The event “win”={(W, W )} and P (win) = 1/3; So it is better to switch! Another (and simpler) way to see what is happening is to draw a table of the outcomes of the game according to which door you intially pick: so when switching, you win iff you initally pick L1 or L2 Pick first W L1 L2 Switch L2 W W Stay W L1 L2 which has probability 2/3. When staying, you win iff you initially pick W with probability 1/3. 1.8 The Birthday Problem There are N people in a room. What is the chance that none of them share a birthday? What do you think for this class (ignoring 29th February for now)? Is it more than 25%? More than 50%? More than 75%? We can figure out the probability as follows. Assume 365 days and that each date is equally likely to be a birthday (not quite true but we’ll ignore that for now). Let pN = P (N people do not have a birthday in common). Label the people in the room 1, 2, . . . , N . We’ll use an induction argument. • For N = 2, p2 = P (person 2 does not have the same birthday as person 1) = 364/365; • We have: pN = P (person N does not have the same birthday as persons 1, . . . , N − 1 AND persons 1, . . . , N − 1 do not share a birthday) = = P (person N does not have the same birthday as persons 1, . . . , N − 1) × P (persons 1, . . . , N − 1 do not share a birthday) 365 − (N − 1) pN −1 . 365 Hence pN 365 − (N − 1) 365 − (N − 1) = pN −1 = 365 365 6 365 − (N − 2) 365 pN −2 = · · · = N Y 366 − i i=2 c 365 . Simon Wilson, 2012 1 EVENTS AND PROBABILITIES The handout shows a plot of these probabilities for different N . Are you surprised? HANDOUT: BIRTHDAY PROBLEM HANDOUT: BAYES LAW - EXPERT SYSTEMS HANDOUT: TUTORIAL 1 7 c Simon Wilson, 2012 2 2 DISCRETE RANDOM VARIABLES Discrete Random Variables Random variables are a very important idea in probability theory. Simplistic Definition: a discrete random variable X is the sample space of a discrete probability space that is a subset of <. i.e. throws of a die. Ω = {1, 2, 3, 4, 5, 6} is a random variable; throwing a coin Ω = {H, T } is not a random variable. The big advantage of random variables is that P can be represented by a function of a real variable p : Ω → [0, 1] for the probability of each outcome. We would like to be able to do this for non-real valued Ω as well, which we can do if we have a mapping from Ω to < to represent Ω. To make it compatible with the definition of a probability space, it should have certain properties as well. This gives us: Formal Definition: A discrete random variable X is a mapping X : Ω → < on the space (Ω, F, P ), where Ω is discrete and • X(Ω) is a countable subset of <; • {ω ∈ Ω | X(ω) = x} ∈ F, ∀x ∈ < Note: when Ω ⊂ <, X is almost always the identidy mapping, which gives us the simplistic definition. Given a random variable, we can propose: Definition: If X is a discrete random variable on (Ω, F, P ) then the probability mass function (pmf) pX of X is the function such that pX (x) = P ({ω ∈ Ω | X(ω) = x}). We often say P (X = x) for pX (x). P pX (x) is a probability, and so must satisfy 0 ≤ pX (x) ≤ 1 and ∀x∈< pX (x) = 1. Example: Toss a coin. Ω = {H, T }. An obvious random variable to associate with tossing coin is X(T ) = 0 and X(H) = 1. If P (H) = p and P (T ) = 1 − p then the pmf can be written px (1 − p)1−x . Example: Throw a die. Ω = {1, 2, 3, 4, 5, 6}. Suppose you play a game where you: win £2 if a 5 is thrown, lose £1 if a 1 or a 3 is thrown, and lose nothing otherwise. Then your winnings are a discrete random variable X : Ω → < with X(1) = X(3) = −1, X(2) = X(4) = X(6) = 0 and X(5) = 2. 2.1 Some Common Discrete Random Variables Certain random variables occur very frequently, and we list 6 of them here: 1. Bernoulli. This is the coin toss. X takes value 1 with probability p and 0 with probability 1 − p; no other value is possible. Often 1 is associated with success an 0 with failure, and so p is called ”success probability”. The pmf is P (X = x) = px (1 − p)1−x , x = 0, 1. 2. Binomial. Suppose we throw a coin n times, where P (H) = p, and each throw is independent. What is the chance that we get x heads? In general, we have P (kH) = pk (1 − p)n−k ×no. of ways of getting kH. The number of ways of getting x H from n is n!/(x!(n − x)!). This gives us the binomial distribution. It is the number of ”successes” of n independent Bernoulli trials, with prob. of success p. The pmf is: n P (X = x) = px (1 − p)n−x .x = 0, 1, . . . , n. x 8 c Simon Wilson, 2012 2 DISCRETE RANDOM VARIABLES Example: throw a die 4 times, what is the probability of getting 0, 1, 2, etc sixes? (Binomial, n = 6, p = 1/6) 3. Geometric. If we keep tossing the above coin, how many tosses until we get a H? For the first to be on the nth throw, we have to have n − 1 throws T then a H, with prob. (1 − p)n−1 p. This is the geometric. The pmf is: P (X = x) = (1 − p)x−1 p, x = 1, 2, . . . ; Example: what is the probability of seeing the first 6 on the xth throw of a die? For x = 3? 4. Negative binomial. How many coin tosses until we get k H? This is the negative binomial. Its pmf is: x−1 P (X = x) = pk (1 − p)x−k , x = k, k + 1, . . . . k−1 When k = 1 we get the geometric. 5. Poisson. The Poisson is a random variable on {0, 1, . . .} with pmf: λx −λ e , x = 0, 1, 2, . . . . x! The Poisson describes many physical systems, such as number of customers entering a bank per minute, radioactive decays per second, etc. Example: the number of server failures in a data centre per day is Poisson with λ = 3. What is the probability of 0 failures, 4 failures, 2 or more failures? The Poisson is justified by thinking of a binomial with large n and small p and letting n → ∞ and p → 0 such that np = λ is a constant. P (X = x) = P (X = k) n! (λ/n)k (1 − λ/n)n−k k!(n − k)! n(n − 1) · · · (n − k + 1) λk (1 − λ/n)n (1 − λ/n)−k = lim n→∞ nk k! = = = lim n→∞ λk n(n − 1) · · · (n − k + 1) lim lim (1 − λ/n)n lim (1 − λ/n)−k n→∞ n→∞ k! n→∞ nk λk −λ e . k! 6. Hypergeometric. This is the “lottery probability”. There are N balls in a bag, of which K are white and N − K are black; in a lottery, the white ones correspond to the numbers that you picked. n of these are picked without replacement - these correspond to the numbers chosen to be the winning ones. The probability that x of the n are white is: K N −K x n−x P (X = x) = , x = 0, 1, . . . , min(n, K). N n For example, a lottery where you choose 6 from 45 numbers (so N = 45, K = 6, n = 6). and 6 numbers are picked as the winning ones without replacement (so n = 6), then: 6 39 x 6−x P (x numbers picked) = , x = 0, 1, . . . , 6. 45 6 9 c Simon Wilson, 2012 2 DISCRETE RANDOM VARIABLES Thus, P (0) = 0.401, P (1) = 0.424, P (2) = 0.151, P (3) = 0.0224 7, P (4) = 0.00136 05, P (5) = 0.000 0287 5, P (6) = 0.000 000 123 5 (1 in 8.1 million). 2.2 Expectation When we throw a die, we do not know of course which outcome will occur. However, there are properties of random experiments that are not uncertain. For example, suppose that Pnwe repeatedly toss the die. Let Xi be the ith outcome and look at the average number thrown 1 Xi /n as a function of n. HANDOUT: DIE THROWS Such a property holds for many random variables. It is called the expectation. Definition: The expectation of a discrete random variable X with pmf pX (x) is X E(X) = x pX (x). ∀x Pn It is quite easy to show that the running average 1 Xi /n does indeed “converge” to E(X) as n → ∞ (The law of large numbers). The expectation is also called the mean of X. It is also the ‘centre of gravity” of a probability distribution. P Example: Let X be Bernoulli(p). Then E(X) = ∀x x pX (x) = 0 × (1 − p) + 1 × p = p. Example: Let X be geometric(p). Then E(X) = X ∀x x pX (x) = ∞ X x(1−p)x−1 p = p x=1 ∞ ∞ X d X d (−(1−p)x ) = −p (1−p)x = −pd/dp1/p = 1/p. dp dp x=1 x=1 Example: Let X be Poisson(λ). Then ∞ ∞ ∞ X X X λx −λ λx λx −λ −λ x e =e E(X) = = λe = λe−λ eλ = λ. x! (x − 1)! x! x=0 x=1 x=0 Other mean values are: for the binomial(n, p), E(X) = np, for the negative binomial(n, p), E(X)=n/p. HANDOUT: PROBABILITY DISTRIBUTION EXAMPLES 2.3 Expectation of Functions of X For any function g : < → <, we define the expectation of g(X) to be X E(g(X)) = g(x) pX (x). ∀x Example: X is Bernoulli(p). What is E(eX )? E(eX ) = 1 + (e − 1)p. 2.4 P x=0,1 ex px (1 − p)1−x = (1 − p) + pe = Conditional Expectation We define the conditional expectation of X given A to be: X E(X | A) = x P (X = x | A). ∀x There is an equivalent of the Partition theorem for conditional expectations, which goes as follows: 10 c Simon Wilson, 2012 2 DISCRETE RANDOM VARIABLES Theorem: Let B1 , B2 , . . . be a partition of Ω and let X be any rv on Ω. Then X E(X) = E(X | Bi ) P (Bi ), i whenever this sum converges absolutely. Proof: using definition of conditional probability: !! X E(X | Bi )P (Bi ) = XX xP ({X = x} \ Bi ) = X x {X = x} xP \ [ x i i 2.5 Properties of Expectation Bi i = X xP (X = x) = E(X). x Lemma: The following properties hold: 1. For a constant c, E(c) = c. 2. For any rv X and c ∈ <, E(cX) = cE(X); 3. For any 2 rvs X and Y , E(X + Y ) = E(X) + E(Y ). 4. If X and Y are independent then E(XY ) = E(X)E(Y ). 5. In fact, X and Y are independent iff E(g(X)h(Y )) = E(g(X)) E(h(Y )) for all functions g and h for which the expectations exist. Proof: trivial except for the last two, which we will prove when we come to do multivariate distributions in Section 4. 2.6 Variance Here are two probability distributions on {0, 1, 2, 3, 4, 5}: distribution 1 is (0.05, 0.05, 0.4, 0.4, 0.05, 0.05), distribution 2 is (0.2, 0.1, 0.2, 0.2, 0.1, 0.2). PLOT THEM. We see that the expectation in both cases is 2.5 but that no. 1 is much less “spread out” or variable than no. 2. The variance of a random variable measures the amount of spread. It is defined to be Var(X) = E((X − E(X))2 ), the mean squared distance between X and its mean. Note that: E((X − E(X))2 ) = E(X 2 − 2E(X)X + E(X)2 ) = E(X 2 ) − 2E(X)2 + E(X)2 = E(X 2 ) − E(X)2 ; P where E(X 2 ) = ∀x x2 pX (x). We usually use this formula to compute the variance. Example: X is Bernoulli(p). We know that E(X) = p, and E(X 2 ) = 02 (1 − p) + 12 p = p, thus Var(X) = p − p2 = p(1 − p). Example: X is Poisson(λ). We know that E(X) = λ and: E(X 2 ) = ∞ X x=0 x2 pX (x) = ∞ X x2 λx e−λ /x! = e−λ x=0 ∞ X x=1 = e−λ ( ∞ X (x − 1)λx /(x − 1)! + x=1 = e−λ (λ2 xλx /(x − 1)! = e−λ ∞ X ((x − 1) + 1)λx /(x − 1)! x=1 ∞ X λx /(x − 1)!) x=1 ∞ X (x − 1)λx−2 /(x − 1)! + λ x=1 = e−λ (λ2 ∞ X λx−1 /(x − 1)!) x=1 ∞ X xλx /x! + λ x=0 ∞ X λx /x!) = e−λ (λ2 eλ + λeλ = λ2 + λ x=0 11 c Simon Wilson, 2012 2 DISCRETE RANDOM VARIABLES thus Var(X) = λ2 + λ − λ2 = λ. So for the Poisson distribution, mean = variance. Other variances are: the binomial, Var(X) = np(1 − p), geometric, Var(X) = (1 − p)/p2 HANDOUT: TUTORIAL 2 12 c Simon Wilson, 2012 3 3 CONTINUOUS RANDOM VARIABLES Continuous Random Variables Discrete rvs take only countably many values, which is very restrictive; continuous random quantities (lifelength of a light bulb, height of a randomly picked individual etc.) are also needed. To model continuous quantities, we need to redefine the rv. Definition A random variable X on the space (Ω, F, P ) is a mapping X : Ω → F such that {ω ∈ Ω | X(ω) ≤ x} ∈ F for all x ∈ <. This subsumes the definition of a discrete rv. Random variables in general are studied by their distribution function FX : < → [0, 1], which is defined to be: FX (x) = P ({ω ∈ Ω : X(ω) ≤ x}), or more simply P (X ≤ x). P Note: For discrete random variables with a mass function pX (x), P (X ≤ x) = k≤x pX (k), so looks like an increasing step function. Properties of FX : Intuitively FX (x) → 1 as x → ∞, and FX (x) → 0 as x → −∞. Also, F is non-decreasing, since for y > x: FX (y) = P (X ≤ y) = P ({ω ∈ Ω : X(ω) ≤ y}) = P ({ω ∈ Ω : X(ω) ≤ x} [ {ω ∈ Ω : x ≤ X(ω) ≤ y}) = FX (x) + P ({ω ∈ Ω : x ≤ X(ω) ≤ y}) ≥ FX (x). Finally, for a bounded interval (a, b], we have: P (a < X ≤ b) = P ({X ≤ b} − {X ≤ a}) = P (X ≤ b) − P (X ≤ a) = FX (b) − FX (a), Definition: A continuous random variable X is one where FX (x) can be written Z x FX (x) = fX (s) ds, −∞ for some function fX (s). fX is called the probability density function (pdf). 3.1 Properties of the pdf Any function fX : < → < satisfying: 1. Since FX is non-decreasing, we must have fX (x) ≥ 0. R∞ 2. Since FX (x) → 1, we must have that −∞ fX (x)dx = 1. can be a pdf. In addition, we observe that: 1. fX (x) = dFX (x)/dx, where such a differential exists. 2. For a < b, P (a ≤ X ≤ B) = P (X ≤ b) − P (X ≤ a) = Rb f (x) dx. a X Rb −∞ fX (x) dx − Ra −∞ fX (x) dx = 3. fX (x) is NOT the probability that X = x. For small δx, Z x+δx P (x ≤ X ≤ x + δx) = fX (u) du ≈ fX (x){δx, x so fX (x)δx is a probability. 13 c Simon Wilson, 2012 3 CONTINUOUS RANDOM VARIABLES 4. In fact, for a continuous rv, P (X = x) = 0! This seems counter-intuitive but isn’t really. It’s a result of measure theory ({x} has measure 0). Practically, we can never measure a continuous quantity exactly, there is always some uncertainty, and so we always will really be talking about X in an interval. 5. fX does not have to be in [0, 1], continuous or even bounded (as long as it is integrable to 1). 3.2 Examples of Continuous Random Variables • Uniform Let a < b, and define  if x < a  0, (x − a)/(b − a), if a ≤ x ≤ b FX (x) =  1, if x > b. F satisfies the properties of distribution function. The pdf is fX (x) = 1/(b − a), for a ≤ b, 0 otherwise. • Exponential Let λ > 0 and define FX (x) = 0, if x ≤ 0 1 − e−λx , if x > 0 The pdf is fX (x) = λe−λx , for x > 0. • Normal, or Gaussian. A very important distribution. The pdf has parameters µ ∈ < and σ 2 > 0: 2 2 1 e−(x−µ) /2σ , x ∈ <. fX (x) = √ 2πσ 2 FX (x) is not in close form. We’ll introduce others as we need to. HANDOUT: GAUSSIAN, TABLES OF THE NORMAL DISTRIBUTION 3.3 Expectation and Variance The expected value of a continuous rv X is: Z ∞ E(X) = x fX (x) dx, −∞ the expected value of a function of X, g(X) is: Z E(g(X)) = ∞ g(x) fX (x) dx. −∞ The variance of X is still E((X − E(X))2 ), which can be calculated as E(X 2 ) − E(X)2 . Example: fX (x) = 3x2 , 0 ≤ x ≤ 1, 0 otherwise. Is this a legimate pdf? If so, what are E(X) and Var(X)? R∞ Since fX (x) ≥ 0 and −∞ fX (x) dx = x3 |10 = 1, then it is a legitimate pdf. Z E(X) = 1 x 3x2 dx = 0.75x4 |10 = 0.75. 0 14 c Simon Wilson, 2012 3 Z 2 E(X ) = CONTINUOUS RANDOM VARIABLES 1 3x4 dx = 0.6; Var(X) = 0.6 − 0.752 = 0.0375. 0 Example Uniform distribution. Z E(X) = b x/(b−a)dx = x2 /(2(b−a))|ba = (b2 −a2 )/(2(b−a)) = (a+b)(b−a)/(2(b−a)) = (a+b)/2. a E(X 2 ) = x3 /(3(b − a))|ba = (b3 − a3 )/(3(b − a)) = (a2 + ab + b2 )/3. thus Var(X) = (a2 + ab + b2 )/3 − (a + b)2 /4 = (b − a)2 /12. Example Exponential distribution. Z ∞ Z −λx x λe dx = (by parts) = x E(X) = 0 ∞ λe−λx − . . . = 1/λ 0 It turns out that E(X 2 ) = 2/λ2 , thus Var(X) = 1/λ2 . Example: Gaussian, we have E(X) = µ and Var(X) = σ 2 . HANDOUT: RETURN TO PROBABILITY DISTRIBUTION EXAMPLES 3.4 Properties of Expectations ...are exactly those of expectation of discrete random variables. 3.5 Functions of a Random Variable Given a random variable X with pdf fX (x), can we say anything about the pdf of any function of X, say g(X)? Yes, we can, as long as we make some restrictions on the class of g: Theorem: Let X be a continuous rv with pdf fX (x) and let g be a strictly increasing and differentiable function from < to <. Then Y = g(X) has pdf: fY (y) = fX (g −1 (y)) d −1 [g (y)]. dy Proof: if we look at the distribution function, then since g is strictly increasing: P (Y ≤ y) = P (g(X) ≤ y) = P (X ≤ g −1 (y)); so FY (y) = FX (g −1 (y)). Differentiate both sides wrt y to get the result. Note that if g is strictly decreasing then by the same proof we have Y (y) = −fX (g −1 (y)) d −1 [g (y)]. dy Example: If X has density fX (x) then Y = X 3 has distribution function: P (Y ≤ y) = P (X 3 ≤ y) = P (X ≤ √ 3 y) √ 3 and density (by theorem) fX ( y) y −2/3 /3 The density of g(X) exists for many cases outside that of the theorem, but there are no general results. The general idea is to work out the distribution function first and then differentiate to get the pdf. 15 c Simon Wilson, 2012 3 CONTINUOUS RANDOM VARIABLES Example: Let X have pdf fX (x) and let Y = X 2 . The distribution function of Y is then: 0, if y < 0 √ √ P (Y ≤ y) = P (X 2 ≤ y) = P (− y ≤ X ≤ y), if y ≥ 0 Differentiating we see that fY (y) = 0 for y < 0 and for y > 0 we have: fY (y) = d 1 d d √ √ √ √ √ √ P (Y ≤ y) = P (− y ≤ X ≤ y) = (FX ( y)−FX (− y)) = √ (fX ( y)+fX (− y)). dy dy dy 2 y HANDOUT: Tutorial 3. 16 c Simon Wilson, 2012 4 4 MULTIVARIATE RANDOM VARIABLES Multivariate Random Variables Often we want to make probability statements about 2 or more random variables at the same time. We do this using the methods of multivariate probability 4.1 Multivariate Discrete Distributions Definition: Let X and Y be discrete random variables on (Ω, F, P ). The joint probability mass function of X and Y is the function pX,Y : <2 → [0, 1] defined by: pX,Y (x, y) = P ({ω ∈ Ω | X(ω) = x and Y (ω) = y}), often written P (X = x, Y = y) Lemma: The individual pmf of X and Y are X X pX (x) = P (X = x, Y = y), pY (y) = P (X = x, Y = y), ∀y ∀x these are also called the marginal distributions P of X and Y . P Proof: by Partition theorem: P (X = x) = ∀y P (X = x | Y = y) P (Y = y) = ∀y P (X = x, Y = y). Similarly, for a vector of rvs (X1 , X2, . . . , Xn ), the joint pmf is P (X1 = x1 , X2 = x2 , . . . , Xn = xn ). Example: X and Y take values in {0, 1, 2}. The joint pmf can be written in a table as follows: X 0 1 2 Y 1 0.1 0 0.05 0 0.1 0.15 0.25 2 0.05 0.2 0.1 We see that P (X = 1) = P (X = 1, Y = 0) + P (X = 1, Y = 1) + P (X = 1, Y = 2) = 0.35. 4.2 Expectation For any function g : <2 → <, the expectation of g(X, Y ) is XX E(g(X, Y )) = g(x, y) pX,Y (x, y). ∀x ∀y Theorem: The expectation operator is linear, that is for a, b ∈ < and discrete rvs X and Y , E(aX + bY ) = aE(X) + bE(Y ). Proof: E(aX + bY ) = X X XX X X (ax + by)pX,Y (x, y) = a x pX,Y (x, y) + b y pX,Y (x, y) x y x =a y X x 4.3 xpX (x) + b y X x ypY (y) = aE(X) + bE(Y ). y Independence Definition: Random variables X and Y P are independentPif pX,Y (x, y) = pX (x)pY (y), ∀x, y. This could also be written pX,Y (x, y) = ( y pX,Y (x, y))( x pX,Y (x, y)). In fact: 17 c Simon Wilson, 2012 4 MULTIVARIATE RANDOM VARIABLES Theorem: Discrete rvs X and Y are independent iff there exist functions g, h such that pX,Y (x, y) = g(x)h(y), ∀x, y. Proof: ⇐. Choose g(x) = pX (x) and Ph(y) = pY (y). P P ⇒ If g and h exist P thenPpX (x) = y pX,Y pY (y) = h(y) x g(x). We P(x, Py) = g(x) y h(y) P andP also know that 1 = Px y pP X,Y (x, y) = x y g(x)h(y) = x g(x) y h(y). Thus pX,Y (x, y) = g(x)h(y) = g(x)h(y) x g(x) y h(y) = pX (x)pY (y). Theorem: If X and ). P Y are independent rvs P then E(XY ) = E(X)E(Y P P Proof: E(XY ) = x,y xypX,Y (x, y) =) = x,y xypX (x)pY (y) = x xpX (x) y ypY (y) = E(X)E(Y ). The converse is not true — there are dependent rvs for which E(XY ) = E(X)E(Y ). Example: Let P (X = −1) = P (X = 0) = P (X = 1) = 1/3, and let Y = 0 if X = 0 and Y = 1 if X 6= 0. Then: X and Y are dependent (since P (X = 0, Y = 1) = 0 but P (X = 0)P (Y = 1) = 2/9) but E(XY ) = 0 and E(X) = 0, E(Y ) = 2/3 ⇒ E(X)E(Y ) = 0 also. However, the following is true: Theorem: Discrete rvs X and Y are independent iff E(g(X)h(Y )) = E(g(X))E(h(Y )) for all functions g, h : < → < for which these expectations exist. Proof: ⇐. Choose 1, if x = a 1, if y = b g(x) = , h(y) = 0, if x 6= a 0, if y 6= b Then E(g(X)h(Y )) = pX,Y (a, b) and E(g(X))E(h(Y )) = pX (a)pY (b). So pX,Y (a, b) = pX (a)pY (b). Since we can repeat this ∀a, b we have that pX,Y (a, b) = pX (a)pY (b)∀a, b so X and Y are independent. ⇐ is just the previous proof. 4.4 Generalising to n Random Variables Everything we have done generalises to the case where we have a vector of rvs X(X1 , X2 , . . . , Xn ). The joint pmf is given by pX (x) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn ). The marginal pmfs are given by pXi (xi ) = X P (X1 = x1 , X2 = x2 , . . . , Xn = xn ) ∀xj :j6=i If the Xi are independent then pX (x) = 4.5 Q i pXi (xi ) and E(X1 X2 · · · Xn ) = E(X1 )E(X2 ) · · · E(Xn ). Multivariate Distribution Functions • For any two rvs X and Y , we define the joint distribution function to be: FX,Y (x, y) = P (X ≤ x, Y ≤ y). • We see that limx,y→−∞ FX,Y (x, y) = 0 and that limx,y→∞ FX,Y (x, y) = 1. Further F is nondecreasing in the sense that FX,Y (x1 , y1 ) ≤ FX,Y (x2 , y2 ) whenever x1 < x2 and y1 < y2 . • We can also compute the marginal distribution functions of X and Y by noting that FX (x) = P (X ≤ x) = P (X ≤ x, Y ≤ ∞) = FX,Y (x, ∞); similarly FY (y) = FX,Y (∞, y). • If X and Y are independent iff P (X ≤ x, Y ≤ y) = P (X ≤ x)P (Y ≤ y), which is to say FX,Y (x, y) = FX (x)FY (y). • Finally, for n rvs X1 , X2 , . . . , Xn , the joint distribution is FX (x) = P (X1 ≤ x1 , . . . , Xn ≤ xn ). The variables are independent if FX (x) = FX1 (x1 )FX2 (x2 ) · · · FXn (xn ). 18 c Simon Wilson, 2012 4 MULTIVARIATE RANDOM VARIABLES Example: Let FX,Y (x, y) = 1 − e−x − e−y + e−x−y for x, y ≥ 0. This is a distribution function, and we have FX (x) = FX,Y (x, ∞) = 1 − e−x and FY (y) = 1 − e−y (exponential rvs). Further, we see that FX,Y (x, y) = FX (x)FY (y) so X and Y are independent. Example: Let FX,Y (x, y) = 1 − 0.5e−x − 0.5e−y for x, y ≥ 0. Then FX (x) = 1 − 0.5e−x and FY (y) = 1 − 0.5e−y but now FX,Y (x, y) 6= FX (x)FY (y) and so these two rvs are not variable. 4.6 Continuous Random Variables X and Y are called continuous if FX,Y (x, y) can be written Z x Z y fX,Y (u, v) du dv, FX,Y (x, y) = −∞ −∞ for a function fX,Y : <2 → [0, ∞) called the joint probability density function. f has many of the properties of the pdf. We have that fX,Y (x, y) ≥ 0 and Z ∞Z ∞ fX,Y (x, y) dx dy = 1. −∞ −∞ Any function f with these properties is a pdf. Also, P (x < X, x + δx, y < Y < y + δy) ≈ fX,Y (x, y) δx δy. Also, for any A ⊂ <2 , we have Z P ((X, Y ) ∈ A) = fX,Y (x, y) dx dy. A Everything generalises to (X1 , X2 , . . . , Xn ). For continuous rvs there is a pdf f (x1 , . . . , xn ) such that: Z P ((X1 , X2 , . . . , Xn ) ∈ A) = f (x1 , . . . , xn ) dx, A n for A ⊂ < , and further f (x1 , . . . , xn ) = 4.7 ∂n P (X1 ≤ x1 , . . . , Xn ≤ xn ). ∂x1 · · · ∂xn Marginal Density Functions and Independence This follows closely the ideas of joint discrete rvs. The marginal pdfs of X and Y are: Z x Z ∞ Z ∞ d d fX,Y (u, v) du dv = fX,Y (x, y) dy, fX (x) = P (X ≤ x) = dx dx −∞ −∞ −∞ R∞ and fY (y) = −∞ fX,Y (x, y) dx. If X and Y are independent then FX,Y (x, y) = FX (x)FY (y), which implies fX,Y (x, y) = fX (x)fY (y). Example: X and Y have the joint pdf: cx, if 0 < y < x < 1, f (x, y) = 0, otherwise What is c? What are the marginal densities of X and Y . Are X and Y independent? What is P (X > 2Y )? Answer: Since f integrates to 1, we have Z 1Z x Z 1 1= cx dy dx = cx2 = c/3, 0 0 0 19 c Simon Wilson, 2012 4 so c = 3. Marginal of X is: Z MULTIVARIATE RANDOM VARIABLES x 3x dy = 3x2 , fX (x) = 0 and of Y is Z 1 fY (y) = 3x dx = 3(1 − y 2 )/2. y Since fX,Y (x, y) 6= fX (x)fY (y), so X and Y are dependent. Finally: Z 1/2 Z 1 0 4.8 Z 3x dx dy = P (X > 2Y ) = 2y 1/2 (3/2) − 6y 2 dy = 1/4. 0 Conditional Density Functions If X and Y have joint pdf fX,Y (x, y) then we can talk about the pdf of X conditional on Y = y. The pdf that describes this is the conditional density function and is defined as: fX|Y (x | y) = fX,Y (x, y) ; fY (y) similarly fY |X (y | x) = fX,Y (x, y)/fX (x). Example: Let fX,Y (x, y) = e−y , for 0 < x < y < ∞, and 0 otherwise. What are fX|Y (x | y) and fY |X (y | x)? R∞ Ry fX (x) = x e−y dy = e−x and fY (y) = 0 e−y dx = ye−y . Thus fY |X (y | x) = ex−y , y > x and fX|Y (x | y) = 1/y, 0 < x < y. Conditional expectation works in the intuitive way: Z E(X | Y = y) = x fX|Y (x | y) dx. ∀x The equivalent of the partition theorem for continuous expectation is: R Theorem: For jointly continuous rvs X and Y , E(Y ) = E(Y | X = x) fX (x) dx. 4.9 Expectations of Continuous Random Variables The properties of expectations of discrete random quantities also hold for continuous. For X and Y jointly continuous and a function g : <2 → <, E(g(X, Y )) is defined as Z ∞Z ∞ E(g(X, Y )) = g(x, y) fX,Y (x, y) dx dy. −∞ −∞ Expectation is linear (E(aX + bY ) = aE(X) + bE(Y )). If X and Y are independent then E(XY ) = E(X)E(Y ). X and Y are independent iff E(g(X)h(Y )) = E(g(X))E(h(Y )) for all functions g and h for which these expectations exist. The proofs of these results are very similar to those for the discrete case, but with the pdf replacing the pmf and integrals replacing sums. Example: Let (X, Y ) be the co-ordinates of a point uniformly distributed on the unit circle, so that fX,Y (x, y) = π −1 , for x2 + y 2 ≤ 1, 20 c Simon Wilson, 2012 4 MULTIVARIATE RANDOM VARIABLES 0, otherwise. What is the expected distance of the point from the origin? This distance is Its expected value is: Z p 2 2 E( X + Y ) = ∞ −∞ Z ∞ Z p 2 2 x + y fX,Y (x, y) dx dy = −∞ 1 Z √1−y2 −1 − √ π −1 p √ X 2 + Y 2. x2 + y 2 dx dy 1−y 2 = (letting x = r cos(θ), y = r sin(θ)) = 0.5 4.10 Variance and Covariance Remember that Var(X)=E((X − E(X))2 ) is supposed to measure the amount of dispersion in X. Recall that we can also write Var(X)=E(X 2 ) − E(X)2 . Variance has the following properties: Lemma: Var(X)=0 iff P (X = E(X)) = 1. Proof: Let µ = E(X). ⇐ E(X 2 ) = µ2 , hence Var(X) = E(X 2 ) − E(X)2 = µ2 − µ2 = 0. ⇒ If Var(X)=0 then E(X 2 ) = µ2 . Define Y = X − µ. Then E(Y 2 ) = E((X − µ)2 ) = 0. Since P 2 2 E(Y ) = ∀y y P (Y = y), the only way this can happen is of P (Y = 0) = 1, hence P (X = µ) = 1. Lemma: Var(aX + b) = a2 Var(X), for a, b ∈ <. Proof: Var(aX + b) = E(((aX + b) − E(aX + b))2 ) = E((aX + b − aE(X) − b)2 ) = E(a2 (X − E(X))2 ) = a2 E((X − E(X))2 ) = a2 Var(X). p Often, statisticians work with the standard deviation, which is Var(X). This has the nice property that SD(aX) = aSD(X). Definition: The covariance of X and Y is Cov(X, Y ) = E((X − E(X))(Y − E(Y ))). We can expand the definition of covariance to get that Cov(X, Y ) = E(XY ) − E(X)E(Y ). The covariance contains info about the joint behaviour of X and Y are. If Cov(X, Y ) > 0 then it is more likely that X − E(X) and Y − E(Y ) have the same sign. Lemma: If X and Y are independent then Cov(X, Y ) = 0. Proof: Cov(X, Y ) = E(XY ) − E(X)E(Y ) so it’s sufficient to show that E(XY ) = E(X)E(Y ) when X and Y are independent. This is easy: The converse is not true in general; see: Y=-1 Y=1 X=-1 1/6 1/6 X=0 1/3 0 X=1 1/6 1/6 You can check that X and Y are dependent e.g. pX,Y (0, 1) = 0 6= 1/3 × 1/3 = pX (0) pY (1) but E(XY ) = E(X) = E(Y ) = 0 so Cov(X, Y ) = 0. Often we normalise the covariance to: Definition: The correlation between X and Y is Cov(X, Y ) ρ(X, Y ) = p . Var(X) Var(Y ) because: Theorem: −1 ≤ ρ(X, Y ) ≤ 1. Proof: is a direct results of the Cauchy-Schwartz inequality by setting U = X − E(X) and V = Y − E(Y ). Theorem: (Cauchy-Schwartz Inequality) If U and V are random variables then E(U V )2 ≤ E(U 2 )E(V 2 ). Proof: Let W = sU + V for s ∈ <. Since W 2 ≥ 0, we have that 0 ≤ E(W 2 ) = E((sU + V )2 ) = E(U 2 )s2 + 2E(U V )s + E(V 2 ). Clearly E(U 2 ) ≥ 0. If E(U 2 ) = 0 then P (U = 0) = 1 21 c Simon Wilson, 2012 4 MULTIVARIATE RANDOM VARIABLES and result holds. If E(U 2 ) > 0 then the inequality implies that, as a quadratic in s, E(U 2 )s2 + 2E(U V )s + E(V 2 ) intersects the origin at most once. Thus the discriminant of the quadratic (2E(U V ))2 − 4E(U 2 )E(V 2 ) ≤ 0 and result follows. ρ has the following property: Theorem: ρ(X, Y ) = 1 iff Y = a + bX for a > 0, b ∈ <. ρ(X, Y ) = −1 iff Y = a + bX for a < 0, b ∈ <. p Proof: ⇐ is easy to see since Var(Y ) = b2 Var(X) and so Var(X)Var(Y ) = bVar(X) and Cov(X, Y ) = E(XY )−E(X)E(Y ) = E(X(a+bX))−E(X)E(a+bX) = aE(X)+bE(X 2 )−aE(X)−bE(X)2 = b(E(X 2 ) − E(X)2 ) = bVar(X), thus ρ(X, Y ) = 1. ⇒ Let ρ(X, Y ) = 1. Let a =Var(X), b = 2Cov(X, Y ) > 0 (since ρ > 0) and c =Var(Y ). Then b2 − 4ac = 4Var(X)Var(Y )[ρ(X, Y )2 − 1] = 0. Thus the quadratic as2 + bs + c = 0 has 2 equal roots, say α = −b/2a, which must be negative since b > 0. Now let W = α(X − E(X)) + (Y − E(Y )). Then E(W 2 ) = aα2 + bα + c = 0, which can only happen if P (W = 0) = 1, thus P (Y = −αX + β) = 1 for β = αE(X) + E(Y ), and since α < 0 the relationship is increasing linear. We can repeat this for ρ = −1, getting that α > 0 and thus the relationship is decreasing linear. 4.11 Sums of Random Variables Theorem: Let X and Y be discrete random variables with joint probability mass function pX,Y (x, y). Then the pmf of Z = X + Y is given by: ∞ X pZ (z) = pX,Y (x, z − x). x=−∞ Proof: pZ (z) = P (Z = z) = P (X + Y = z) = P ((X = z and Y = 0) or (X = z − 1 and Y = 1) or (X = z + 1 and Y = −1) or · · · ) ∞ X X = P (X = x and Y = z − x) = pX,Y (x, z − x) as required. x=−∞ ∀x Note that if X, Y ≥ 0 then this reduces to: pZ (z) = z X pX,Y (x, z − x). x=0 Example: For the joint pmf at the start of the Section, what is the pmf of Z = X + Y ? X 0 1 2 0 0.1 0.15 0.25 22 Y 1 0.1 0 0.05 2 0.05 0.2 0.1 c Simon Wilson, 2012 4 MULTIVARIATE RANDOM VARIABLES Example: Let X and Y be independent Poisson random variables with mean λx and λy respectively. What is the distribution of Z = X + Y ? The equivalent result for continuous random variables replaces (as usual) the sum by an integral: Theorem: Let X and Y be continuous random variables with joint probability density function fX,Y (x, y). Then the pmf of Z = X + Y is given by: Z ∞ fZ (z) = fX,Y (x, z − x) dx. −∞ If X, Y ≥ 0 then Z z fX,Y (x, z − x) dx. fZ (z) = 0 Example: What is the pdf of Z = X + Y for the example at the start of Section Example: Let X and Y be independent exponential random variables with the same mean λ. What is the distribution of Z = X + Y ? Can we say anything about the mean and variance of the sum of random variables? Theorem: Let X and Y be random variables. Then E(aX + bY ) = aE(X) + bE(Y ) and Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2ab Cov(X, Y ). Var(aX + bY ) = E(((aX + bY ) − E(aX + bY ))2 ) = E(((aX − E(aX)) + (bY − E(bY )))2 ) = E(a2 (X−E(X))2 +2ab(X−E(X))(Y −E(Y ))+(b2 Y −E(Y ))2 ) = a2 Var(X)+2abE((X−E(X))(Y −E(Y )))+b2 Var(Y ). Note that in vector notation we can write T Var(a, b)(X, Y ) = (a, b) Var(X) Cov(X, Y ) Cov(X, Y ) Var(Y ) (a, b)T . This is useful later!! Lemma: If X and Y are independent then Cov(X, Y ) = 0 and so Var(X + Y ) =Var(X)+Var(Y ). 4.12 The Multivariate Gaussian Distribution This is the most important distribution in statistics; it forms the basis of many statistical methods where data are vectors rather than single numbers. Here we will concentrate on its definition, properties and examples. But you will meet it in many places if you study statistical methods further in sophister/postgraduate courses. Try http://www.youtube.com/watch?v=TC0ZAX3DA88 for a 15 minute video on it!! Definition: The multivariate normal (also called the multivariate Gaussian) distribution is a continuous probability distribution defined for a set of random variables X = (X1 , . . . , Xk ). Its density has the form: 1 1 T −1 fX (x) = exp − (x − µ) Σ (x − µ) , x ∈ <k , (1) 2 (2π)k/2 |Σ|1/2 where µ = (µ1 , . . . , µk )T is called the mean vector, Σ is a k ×k matrix, called the variance-covariance matrix and |Σ| is the determinant of Σ. It must be a positive definite matrix for this density to be 23 c Simon Wilson, 2012 4 MULTIVARIATE RANDOM VARIABLES well defined e.g. v T Σv > 0 for all v ∈ <k − {0}. The entries in this matrix are the variances (on the main diagonal) and covariances (off the main diagonal) e.g. Σii = Var(Xi ) = σi2 , Σij = Cov(Xi , Xj ) = ρij σi σj , where ρij is the correlation between Xi and Xj . Handout: bivariate normal examples There are many properties of this distribution; here we simply state them. 1. All marginal distributions of X are also Gaussian with the mean and variance given in µ and Σ. So Xi is normal with mean µi and variance Σii . 2. Further, any joint marginal of a subset of {X1 , . . . , Xk } is also multivariate Gaussian. Its mean vector is the vector of the means of the components of the subset, and the variance-covariance matrix is just the rows and columns of Σ corresponding to the components. To put this in terms of an equation, partition X into two sub-vectors X1 = (X1 , . . . , Xi ) and X2 = (Xi+1 , . . . , Xk ). Similarly partition the mean vector into µ1 = (µ1 , . . . , µi ) and µ2 = (µi+1 , . . . , µk ) and the variance-covariance matrix: Σ11 Σ12 , Σ= Σ21 Σ22 where    Σ11 =   σ12 ρ21 σ2 σ1 .. . ρ12 σ1 σ2 σ22 .. . ··· ··· .. . ρ1i σ1 σi ρ2i σ2 σi .. . ρi1 σi σ1 ρi2 σi σ2 ··· σi2      is the top left i × i sub-matrix of Σ, etc. Then X1 is multivariate normal with mean µ1 and variance-covariance matrix Σ1 1, and X2 is multivariate normal with mean µ2 and variancecovariance matrix Σ2 2. 3. Conditionals. All conditional distributions of any subset of {X1 , . . . , Xk } conditional on any distinct other subset are also Gaussian. Defining X1 = (X1 , . . . , Xi ) and X2 = (Xi+1 , . . . , Xk ) as before, the conditional distribution of X1 given X2 = x2 is Gaussian with mean µ1 + Σ12 Σ−1 22 (x2 − µ2 ) and variance-covariance matrix Σ11 − Σ12 Σ−1 22 Σ21 . 4. Linear combinations. Let Y = b1 X1 + b2 X2 + · · · + bk Xk . Letting b = (b1 , . . . , bk )T , we can write Y = bT X. Then Y is also Gaussian and its mean is bT µ = b1 µ1 +· · ·+bk µk and variance is k X k X bi bj Σij = bT Σb. i=1 j=1 Note that the variance can be rewritten k X i=1 b2i Var(Xi ) + k X k X bi bj Cov(Xi , Xj ) = i=1 j=1 j6=i k X i=1 b2i Var(Xi ) + 2 k X i−1 X bi bj Cov(Xi , Xj ). i=1 j=1 The mean and variance of bT X are bT µ and bT Σb for any vector of random variables X. The special property when X is Gaussian is that the distribution of bT X is also Gaussian. 24 c Simon Wilson, 2012 4 MULTIVARIATE RANDOM VARIABLES 5. Linear transformations. More generally, let B be a m × k matrix and let Y = BX ∈ <m . Then Y is Gaussian with mean Bµ and variance-covariance matrix B T ΣB. Example: let X = (X1 , X2 , X3 , X4 )T be multivariate normal variance-covariance matrix  4 2.4 1 −1  2.4 9 −0.3 2.4   1 −0.3 1 0 −1 2.4 0 1 distributed with mean (1, 0, 3, 2)T and   .  • What is E(X2 )? What is Var(X2 )? What is Cov(X1 , X3 )? What is Corr(X2 , X3 )? • What is the distribution of X4 ? What is the distribution of (X1 , X2 )T ? What is the distribution of (X2 , X4 )T ? • What is the distribution of X1 |X2 = x2 ? What is the distribution of (X1 , X2 )T |(X3 , X4 )T ? What is the distribution (X1 , X4 )T |X2 ? • What is the distribution of Y = X1 + 3X2 − X3 + 2X4 ? What is the distribution of Y1 X2 − X4 Y = = ? Y2 3X1 + X2 + 5X3 − 2X4 HANDOUT: Tutorial 4 25 c Simon Wilson, 2012 5 5 5.1 MOMENT AND CHARACTERISTIC FUNCTIONS Moment and Characteristic Functions Moment Generating Functions Definition: For a random variable X, the moment generating function (mgf) of X is denoted MX (t) and defined as: P tx if X is discrete, tX R ∞x e txpX (x), MX (t) = E(e ) = e f (x) dx, X −∞ wherever this sum or integral converges absolutely. Example Normal distribution: For N (0, 1), we have: Z ∞ Z ∞ 2 2 2 2 1 1 √ e−(x−t) /2 dx = et /2 . MX (t) = etx √ e−x /2 dx = et /2 2π 2π −∞ −∞ This is valid ∀t. Example: Exponential distribution: Z ∞ Z MX (t) = etx λe−λx dx = λ 0 ∞ e(t−λ)x dx = 0 λ/(λ − t), if t < λ, ∞, if t ≥ λ, so it only exists for t < λ. Theorem: If MX (t) exists in an open neighbourhood containing 0 then, for k = 1, 2, . . . ,, dk k E(X ) = k MX (t) . dt t=0 Sketch Proof: Follows the similar proof for the pgf. Z ∞ Z ∞ k Z ∞ dk dk dk d tx tX tx M (t) = E(e ) = e f (x)dx = e f (x)dx = xk etx fX (x)dx = E(X k etX ); X X X k dtk dtk dtk −∞ −∞ dt −∞ and let t = 0 to get the answer. Related to the above theorem is Theorem: If MX (t) < ∞ in an open neighbourhood of 0 then there is a unique distribution with mgf MX (t). Further, under this condition, E(X k ) < ∞ and MX (t) = ∞ k X t k=0 k! E(X k ). Proof: we don’t prove, merely commenting that this is just the Laplace inversion theorem. The sum is just the series expansion of MX about 0; we see from the previous theorem that the coefficients in this expansion would indeed be the moments. Theorem: If X has mgf MX (t) then Y = aX + b has mgf MY (t) = etb MX (at). Proof: MY (t) = E(etY ) = E(et(aX+b) ) = E(etb eatX ) = etb E(e(at)X ) = etb MX (at). 2 Example: We know that if X ∼ N (0, 1) then MX (t) = et /2 . What about more generally Y ∼ 2 2 2 2 (µ, σ 2 )? Since Y = µ + σX, we can say that MY (t) = eµt MX (σt) = eµt eσ t /2 = eµt+0.5σ t . Theorem: If X and Y are independent rvs then X + Y has mgf MX+Y (t) = MX (t)MY (t). Proof: MX+Y (t) = E(et(X+Y ) ) = E(etX etY ) = E(etX )E(etY ) = MX (t)MY (t). 26 c Simon Wilson, 2012 5 5.2 MOMENT AND CHARACTERISTIC FUNCTIONS Characteristic Functions The√problem that the mgf is not necessarily defined for all t can be solved if we look at E(eitX ), for i = −1. This exists always for all t. Definition: The characteristic function of a random variable X is a function φX (t) defined by: P itx pX (x), if X is discrete, itX ∀x e φX (t) = E(e ) = ∞ ∈−∞ eitx fX (x) dx, if X is continuous. Note: Since eitx = cos(tx) + i sin(tx), we can also write φX (t) = E(cos(tX)) + iE(sin(tX)). We see that, since eitX lies on the unit circle in C, that we have |φX (t)| ≤ 1. We can also write φX (t) = MX (it) if the mgf exists in an open neighbourhood of the origin. 2 2 Example: X ∼ N (µ, σ 2 ). Since MX (t) = eµt+0.5σ t which exists ∀t, we can write φX (t) = 2 2 MX (it) = eiµt−0.5σ t . Example: X is exponential(λ). Then Z ∞ Z ∞ φX (t) = eitx λe−λx dx = λ e(it−λ)x dx = λ/(λ − it). 0 0 (this may be found by splitting the integral into real and complex parts, or using residues). Characteristic functions have the following properties of the mgf (prrofs similar to the mgf case): 1. If X has cf φX (t) then Y = aX + b has cf φY (t) = eitb φX (at). 2. If X and Y are independent then φX+Y (t) = φX (t)φY (t). 3. As long as the expectations exist, ik E(X k ) = dk . φ (t) X k dt t=0 Observant ones among you will have noticed that the characteristic function looks like a Fourier transform of fX (x). Therefore we have: Theorem: (Uniqueness) Let X and Y have cfs φX and φY respectively. Then X and Y have the same distributions iff φX (t) = φY (t) for all t ∈ <. Theorem: (Inversion) Let X be continuous with pdf fX (x) and cf φX (t). Then Z ∞ 1 fX (x) = e−itx φX (t) dt 2π −∞ at every point x at which f is differentiable. 27 c Simon Wilson, 2012 6 6 TWO PROBABILITY THEOREMS Two Probability Theorems 6.1 The Law of Averages We motivated expectation by looking at the running average X̄n = (X1 + · · · + Xn )/n of iid rvs. As n → ∞, it was stated that X̄n → E(X). The law of averages makes this statement explicit. Definition: A sequence of random variables Y1 , Y2 , . . . converges in mean square to the rv Y if E((Yn − Y )2 ) → 0 as n → ∞. Example: Let Yn be a Bernoulli rv with success probability 1/n. Let Y be the constant rv 0. Then E((Yn − Y )2 ) = E(Yn2 ) = 02 (1 − 1/n) + 12 (1/n) = 1/n → 0, so YN → Y in mean square. Example: Let Yn take the value n with probability 1/n and 0 with probability 1 − 1/n and Y as before. Then E((Yn − Y )2 ) = E(Yn2 ) = 02 (1 − 1/n) + n2 × 1/n = n → ∞, so Yn does not tend to Y in mean square (even though P (Yn 6= 0) → 0) Theorem: (Law of Large Numbers). Let X1 , X2 , . . . be a sequence of independent random variables each having mean µ and variance σ 2 . Then 1 (X1 + · · · + Xn ) → µ n in mean square. P n Proof: Let Sn = 1 Xi . So we are interested in what happens to Sn /n. We have E(Sn /n) = 1 1 1 E(Sn ) = E(X1 + · · · + Xn ) = (E(X1 ) + E(X2 ) + · · · E(Xn )) = µ. n n n and so E 2 ! 1 1 Sn − µ = Var(Sn /n) = 2 Var(X1 + X2 + · · · Xn ) n n = 1 1 (Var(X1 ) + · · · + Var(Xn )) = 2 nσ 2 = σ 2 /n → 0, n2 n thus Sn /n converges in mean square to µ as reqd. 6.2 The Central Limit Theorem We know that E(Sn ) = nµ. We know that in the limit Sn is converging to nµ in the sense that Sn /n → µ in expected mean square. The next problem is to ask how much variation there is about this mean for Sn as n → ∞. The central limit theorem tells us this for iid rvs. To start, we define Sn − E(Sn ) Zn = p Var(Sn ) p to be the standardised version of Sn . Note that Zn = an Sn + bn for constants an = 1/ Var(Sn ) and bn = −an E(Sn ), chosen so that E(Zn ) = 1 and Var(Zn ) = 1. Next, since E(Sn ) = nµ and Var(Sn ) = nσ 2 (see proof of LLN), we can write Zn = Sn − nµ √ . σ n 28 c Simon Wilson, 2012 6 TWO PROBABILITY THEOREMS Theorem: (Central Limit Theorem). Let X1 , X2 , . . . be iid rvs having mean µ and variance σ 2 . The standardised sum Sn − nµ √ , Zn = σ n where Sn = X1 + · · · Xn , satisfies, as n → ∞, Z z 2 1 √ e−u /2 du, P (Zn ≤ z) = 2π −∞ for z ∈ <. In other words, the distribution of Zn is converging point-wise (for each z ∈ <) to the standard normal distribution. Note: This is a remarkable result! Not only does the distribution of Zn converge to something, it ALWAYS converges to the standard normal, NO MATTER WHAT the distribution of the Xi . The only requirement is that Xi have a mean and variance. Note: Since √ Zn is approximately N (0, 1), by the properties of the normal distribution, we have that Sn = nµ + σ nZn is approximately N (nµ, nσ 2 ). Proof: Let Ui = Xi − µ, so that E(Ui ) = 0 and Var(Ui ) = σ 2 . The mgf of the Ui are MU (t) = e−µt MX (t).P √ n Now Zn = 1 Ui /(σ n), so the mgf of Zn is !! n n n Y t t X t tZn √ √ Ui √ E exp Ui = MZn (t) = E(e ) = E exp , = MU σ n 1 σ n σ n 1 by properties of the mgf. Now recall the theorem that defines the mgf as a power series expansion about 0: MU (x) = 1 + xE(U1 ) + 0.5x2 E(U12 ) + o(x2 ) = 1 + 0.5σ 2 x2 + o(x2 ), since E(U1 ) = 0 and E(U12 ) = E(U12 ) − E(U1 )2 = Var(U1 ) = σ 2 . Thus n n " 2 2 !#n t t t2 t 2 √ √ √ MZn (t) = MU = 1 + 0.5σ +o = 1 + 0.5 + o(1/n) , n σ n σ n σ n As n → ∞, the above → exp(t2 /2) (Recall that limn→∞ (1 + x/n)n = ex ). This is the mgf of the standard normal. By uniqueness theorem for mgfs, this means that the distribution must be the standard normal. Example: A fair coin is thrown 20000 times. Let S be the number of heads. Find an approximation to P (9900 < S < 10010). P20000 S is in fact binomial(20000,0.5). But, we can write S = i=1 Xi , where Xi = 1 is Bernoulli(0.5), taking the value 1 for a H and 0 for T. The mean and variance of Xi exists: µ = E(Xi ) = 0.5 and σ 2 =Var(Xi ) = 0.25. We can apply the CLT to say that for large n, S is approximately N (nµ, nσ 2 ) = N (10000, 5000). Thus √ √ √ P (9900 < S < 10010) = P ((9900−10000)/ 5000 < (S−10000)/ 5000 < (10010−10000)/ 5000) = P (−1.41 < Z < 0.141) = P (Z < 0.141) − P (Z < −1.41) = 0.5557 − (1 − P (Z < 1.41)) = 0.5557 − 1 + 0.9207 = 0.4764. Note that the actual probability using the true binomial distribution is 0.4793. HANDOUT: Tutorial 5 HANDOUT: Sample Exam Questions 29 c Simon Wilson, 2012

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ST2351 Probability and Theoretical Statistics Course Notes for