Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lectures for STP 421: Probability Theory Jay Taylor August 30, 2016 Contents 1 Overview and Conceptual Foundations of Probability 4 1.1 Determinism, Uncertainty, and Probability . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Probability Spaces and Measure Theory . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Properties of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Properties of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Conditional Probabilities 11 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 The Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Mendelian Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Conditional Probabilities are Probabilities (Optional) . . . . . . . . . . . . . . . . 19 2.6 The Gambler’s Ruin Problem (Optional) . . . . . . . . . . . . . . . . . . . . . . . 22 3 Random Variables and Probability Distributions 3.1 3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Properties of the Cumulative Distribution Function . . . . . . . . . . . . . 30 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 34 Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . 4 A Menagerie of Distributions 4.1 4.2 25 36 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.1 Sampling and the hypergeometric distribution . . . . . . . . . . . . . . . . 39 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2 CONTENTS 3 4.2.1 Fluctuation tests and the origin of adaptive mutations . . . . . . . . . . . 44 4.2.2 Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Random Waiting Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 The geometric and exponential distributions . . . . . . . . . . . . . . . . . 47 4.3.2 The gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Continuous Distributions on Finite Intervals . . . . . . . . . . . . . . . . . . . . . 51 4.4.1 The uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 The beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Bayesian estimation of proportions . . . . . . . . . . . . . . . . . . . . . . 53 4.4.4 Binary to analogue: construction of the uniform distribution . . . . . . . . 55 4.5 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Transformations of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 4.4 5 Random Vectors and Multivariate Distributions 64 5.1 Joint and Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Discrete Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2 Continuous Conditional Distributions . . . . . . . . . . . . . . . . . . . . 69 5.3.3 Conditional Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Expectations of Sums and Products of Random Variables . . . . . . . . . . . . . 72 5.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 The Law of Large Numbers and the Central Limit Theorem . . . . . . . . . . . . 79 5.6.1 The Weak and Strong Laws of Large Numbers . . . . . . . . . . . . . . . 79 5.6.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.7 Multivariate Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.8 Sample Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.8.1 The χ2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.8.2 Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4 Expectations of products Chapter 1 Overview and Conceptual Foundations of Probability 1.1 Determinism, Uncertainty, and Probability Scientific determinism postulates that we can predict how a system will develop over time if we know (i) the laws governing the system and (ii) the initial state of the system. In particular, in a deterministic world, we should obtain the same outcome whenever we repeat an experiment under identical conditions. Deterministic theories were particularly influential in the eighteenth and nineteenth centuries in large part because of the remarkable successes of Newtonian mechanics and remain important today, as can be seen in the continuing use of ordinary and partial differential equations to model numerous physical phenomena. For example, Newton’s second law - “force equals mass times acceleration” - can be expressed as a system of ODE’s, d2 x = F (x) dt2 x(0) = x0 m x0 (0) = v0 which can be solved exactly (in principle) if we know the forces F acting on the system and the initial state (x0 , v0 ) of the system. In fact, Newtonian mechanics once seemed so effective at describing nature that the mathematician Pierre-Simon Laplace (1814) was led to theorize that the entire future and past of the universe could determined exactly by an entity (Laplace’s demon) that knew the current positions and momenta of all of the particles in the universe. However, despite these successes, scientific applications of deterministic models are encumbered by multiple sources of uncertainty, some of which are listed below. • Imprecise knowledge of the state of the system due to measurement error. • Imprecise knowledge of model parameters. • Model uncertainty and approximation: George Box: “All models are wrong, but some are useful.” For example, molecular and cosmological processes are usually neglected when we formulate models of circulation or airplane flight. • Numerical solutions are usually approximate due to round-off and truncation errors. • Certain kinds of deterministic models (chaotic dynamical systems) are known to be exquisitely sensitive to their initial conditions in the sense that trajectories that are initially arbitrarily close diverge exponentially. • Other models are ill-posed in the sense that they either don’t have strong solutions or they have more than one solution consistent with initial and/or boundary data. For example, it 4 1.2. PROBABILITY SPACES AND MEASURE THEORY 5 is possible that the Navier-Stokes equation, which is used to model fluid flow, lacks strong solutions for some physically realistic parametrizations. • According to some interpretations, quantum mechanics provides an intrinsically stochastic description of nature, i.e., Schrödinger’s equation only allows us to calculate the probability that a system occupies a given state. For these reasons, scientists routinely must deal with uncertainty in one form or another. While different theoretical tools have been developed to quantify uncertainty, most approaches make use of probability theory. As a mathematical subject, probability has its origins in the 17’th century in studies of games of chance and gambling. Since then, several different interpretations of probabilities have been proposed and these differ mainly in the status that they assign to probability as either a description of some feature of the physical world or as a description of our knowledge of the world. The main interpretations are described in the following list. • Frequentist interpretation: The probability of an event is equal to its limiting frequency in an infinite series of independent identical trials (R. Ellis, J. Venn). • Subjective interpretation: The probability of a proposition is a measure of how strongly an individual believes that the proposition is true (F. Ramsey, J. von Neumann, B. de Finetti). According to this interpretation, different individuals can rationally assign different probabilities to the same proposition. • Objective interpretation: The probability of a proposition is a measure of the strength of evidence supporting that proposition (J. M. Keynes, R. Jeffrey, R. T. Cox, E. Jaynes, G. Polya). Under this interpretation, rational individuals with the same evidence should assign the same probabilities to a proposition. The frequentist interpretation is sometimes said to be physical because it considers probabilities to be properties of the physical world that are independent of any observer. In contrast, the subjective and logical interpretations are said to be Bayesian or epistemological because they treat probabilities as properties of an observer’s knowledge or beliefs about the world. A very important difference between the frequentist and Bayesian interpretations concerns their scope: whereas frequentist probabilities can only be assigned to events that can occur repeatedly, Bayesian probabilities can be assigned to events that may only occur once. For example, both frequentist and Bayesian probabilities can be assigned to the event that a tossed coin will land on heads, but only a Bayesian probability can be assigned to the proposition that Hillary Clinton will win the next presidential election since that election will occur exactly once. In this sense, the Bayesian interpretations of probability allow for a much wider theory that does the frequentist interpretation, although not everyone regards this as legitimate. 1.2 Probability Spaces and Measure Theory Although probabilities can be interpreted in several different ways, it turns out that there is a common formal mathematical framework that is sufficient for all of these interpretations. This is known as measure theoretic probability, which was formulated by the Russian mathematician Andrei Kolmogorov in the early 1930’s. The most important object in measure theoretic probability is known as a probability space, which is defined below. To motivate the definition, imagine that you are conducting an experiment with a random outcome and that you wish to describe this scenario mathematically. Here we will temporarily adopt the frequentist viewpoint. 6 CHAPTER 1. OVERVIEW AND CONCEPTUAL FOUNDATIONS OF PROBABILITY The first thing that we need to know is the set of all possible outcomes for the experiment. Let us call this set Ω. In many cases, we may not be interested in the exact outcome of the experiment, but rather in knowing whether it belongs to a particular collection of elements in Ω that share some property, e.g., we sample a person at random and we want to know if their height is greater than 6 feet. Such a subset is said to be an event. To describe the uncertainty of the experiment, we need to be able to assign a probability to each event that might be of interest to us. In other words, we want to be able to write down (or at least calculate) a set function P which takes an event as an argument and returns a real number. For example, if E is an event, we are interested in the probability of E, which will be denoted P(E). P is said to be a probability distribution or probability measure. Under the frequentist interpretation, P(E) is the limiting frequency of the event E in a series of independent identical trials. This tells us that probabilities will always take values between 0 and 1, so that the range of the function P is the interval [0, 1]. In other words, for every event E, we know that 0 ≤ P(E) ≤ 1. (Hint: if you ever calculate a probability and arrive at either a negative number or a number greater than 1, then somewhere along the way you have made a mistake.) Furthermore, if we assume that every trial results in exactly one outcome in Ω, then it follows that the probability of Ω is equal to 1, while the probability of the empty set ∅ is equal to 0: P(∅) = 0 and P(Ω) = 1. Finally, if E and F are mutually exclusive events, meaning that either E is true or F is true or neither is true, then it is easy to see that the limiting frequency of the event E ∪ F that either E or F occurs is equal to the sum of the limiting frequencies of E and F . In other words, if E and F are disjoint subsets of Ω, then P(E ∪ F ) = P(E) + P(F ). Similarly, if E1 , E2 , · · · , En are mutually exclusive, then similar considerations indicate that P(E1 ∪ · · · ∪ En ) = P(E1 ) + · · · + P(En ). The following definition provides a formal framework for these observations. Definition 1.1. A probability space is a triple {Ω, F, P} where: • Ω is the sample space, i.e., the set of all possible outcomes of the experiment. • F is a collection of subsets of Ω which we call events. F is called a σ-algebra and is required to satisfy the following conditions: 1. The empty set and the sample space are both events: ∅, Ω ∈ F. 2. If E is an event, then its complement E c = Ω\E is also an event. 3. If E1 , E2 , · · · are events, then their union ∪n En is an event. • P is a function from F into [0, 1]: if E is an event, then P(E) is the probability of E. P is said to be a probability distribution or probability measure on F and is also required to satisfy several conditions: 1. P(∅) = 0; P(Ω) = 1. 2. Countable additivity: If E1 , E2 , · · · are mutually exclusive events, i.e., Ei ∩Ej = ∅ whenever i 6= j, then ∞ ∞ [ X En = P(En ). P n=1 Here are several examples. n=1 1.2. PROBABILITY SPACES AND MEASURE THEORY 7 Example 1.1. Suppose that we toss a fair coin once and let Ω = {H, T } be the set of possible outcomes, i.e., either H if we get heads or T if we get tails. Let F = {∅, {H}, {T }, Ω} be the collection of all subsets of Ω and define the probability distribution P by setting P(∅) = 0 1 P({H}) = P({T }) = 2 P(Ω) = 1. Then (Ω, F, P) is a probability space which provides a model for the outcome of the coin toss. Example 1.2. Now suppose that we toss a fair coin repeatedly and let N be the number of tosses until we get the first heads (including that toss). Let Ω = N = {1, 2, 3, · · · }, let F = P(Ω) be the power set of Ω, i.e., the collection of all subsets of Ω, and define the probability of an event E = {n1 , n2 , · · · } ⊂ Ω by X 1 ni P(E) ≡ 2 ni ∈E Then (Ω, F, P) is a probability space which provides a model for the the number of tosses that are carried out until we get heads. The variable N is said to be a random variable. Example 1.3. A probability space (Ω, F, P) is said to be a symmetric probability space if the sample space Ω = {x1 , · · · , xN } is finite; the σ-algebra F is the collection of all subsets of Ω, i.e., F = P(Ω) is the power set of Ω; and the probability of any event E ⊂ Ω is equal to P(E) = |E| |E| = , |Ω| N where |E| is equal to the number of elements in E. In particular, it follows that the probability of any single element x ∈ Ω is equal to 1/N and so all elements are equally likely. Example 1.4. Suppose that a deck contains 52 cards and let Ω be the set containing all 52! different ways of ordering the deck. If we assume that the deck is randomly shuffled, then every ordering of cards is equally likely and so we can use the symmetric probability space (Ω, F, P) as a mathematical model for the effect of shuffling on the order of the cards. We next consider why we need to restrict to countable additivity and why we only require σ-algebras to be closed under countable unions. Example 1.5. Suppose that we want to model an experiment in which the outcome is a real number that is uniformly distributed between 0 and 1. Then • Ω = [0, 1]; • F should at least contain all open and closed intervals contained in [0, 1]; • P([a, b]) = b − a whenever 0 ≤ a ≤ b ≤ 1. The last definition asserts that the probability that the outcome lies in an interval contained in [0, 1] is equal to the length of the interval, which is what we mean when we say the outcome is uniformly distributed on [0, 1]. One consequence of this definition is that points have probability 0, i.e, P({x}) = P([x, x]) = x − x = 0 since {x} = [x, x]. This shows that the countable additivity of probability measures cannot generally be extended to uncountable collections of events, since the left-hand and right-hand sides of the following expression are not equal: X 1 = P([0, 1]) 6= P({x}) = 0. x∈[0,1] 8 CHAPTER 1. OVERVIEW AND CONCEPTUAL FOUNDATIONS OF PROBABILITY Indeed, the problem here is that although the sets {x} and {y} are disjoint whenever x 6= y, there are uncountably many such sets contained in [0, 1], i.e, [0, 1] is uncountably infinite. Later we will see that such sums need to be replaced by integrals. Notice that we haven’t specified the σ-algebra in Example 1.5, only that it should contain all of the open and closed intervals contained in [0, 1]. One might hope that we could take F to be the power set of [0, 1], i.e, we will let all subsets of [0, 1] be events, but it can be shown that there is no function P defined on the power set P([0, 1]) with values in [0, 1] which satisfies the conditions required of a probability measure and has the property that P([a, b]) = b − a for all [a, b] ⊂ [0, 1]. However, one can also show that there is a σ-algebra F containing all of the open and closed intervals which is strictly smaller than P([0, 1]) but which allows one to define P as above in a self-consistent manner. This is called the Borel σ-algebra on [0, 1] and it is defined to be the smallest σ-algebra containing all of the open intervals, i.e., it is the intersection of all σ-algebras which contain these intervals. The sets contained in this σ-algebra are called Borel sets. In general, any set that we might reasonably expect to encounter in applications will be a Borel set, but there is no simple description of what these sets “look like”, i.e., the structure of a Borel set can be arbitrarily complicated. In this course, we will mostly pay very little attention to the σ-algebras defined on the probability spaces that we study. With finite or countably-infinite sets, these usually are taken to be the collection of all subsets of the sample space, so that the probability of every subset is defined. Probability measures on uncountable sets are more complicated objects and one needs to pay attention to these technical details to avoid contradiction. David Williams’ book ‘Probability with Martingales’ (Cambridge, 1991) provides an excellent introduction to these issues for those who wish to learn the mathematics more rigorously than presented here. 1.3 Properties of Events Because the mathematical theory of probability is described using the language of sets, to apply this theory to real-world problems we need to be able to translate these problems into statements about sets. To help with this, we recall the following properties of sets. Operations with events: • union: E ∪ F is the set of outcomes that are in E or F (non-exclusive) • intersection: E ∩ F is the set of outcomes that are in both E and F – some authors use the notation EF – E and F are said to be mutually exclusive if E ∩ F = ∅ • unions and intersections can be extended to more than two sets. • complements: E c is the set of outcomes in Ω that are not in E Unions and intersections satisfy several important algebraic laws: • Commutative laws: E ∪ F = F ∪ E, E ∩ F = F ∩ E 1.4. PROPERTIES OF PROBABILITIES 9 • Associative laws: (E ∪ F ) ∪ G = E ∪ (F ∪ G), (E ∩ F ) ∩ G = E ∩ (F ∩ G) • Distributive laws: (E ∪ F ) ∩ G = (E ∩ G) ∪ (F ∩ G), (E ∩ F ) ∪ G = (E ∪ G) ∩ (F ∪ G) Lemma 1.1. DeMorgan’s laws: ∞ [ En c = n=1 ∞ \ En c = n=1 ∞ \ n=1 ∞ [ Enc Enc n=1 One consequence of DeMorgan’s laws is that σ-algebras are closed under countable intersections: if E1 , E2 , · · · are events in a σ-algebra F, then ∞ \ En n=1 is also an event in F. 1.4 Properties of Probabilities The following proposition describes some useful properties that follow from the definition of a probability space. These are often used when calculating probabilities and should be committed to memory. Proposition 1.1. The following properties hold for any two events A, B in a probability space: 1. Complements: P(Ac ) = 1 − P(A). 2. Subsets: if A ⊂ B, then P(A) ≤ P(B); 3. Disjoint events: If A and B are mutually exclusive, then P(A ∩ B) = 0. 4. Unions: For any two events A and B (not necessarily mutually exclusive), we have: P(A ∪ B) = P(A) + P(B) − P(A ∩ B) Example 1.6. Suppose that the frequency of HIV infection in a community is 0.05, that the frequency of tuberculosis infection is 0.1, and that the frequency of dual infection is 0.01. What proportion of individuals have neither HIV nor tuberculosis? P(HIV or tuberculosis) = P(HIV) + P(tuberculosis) − P(both) = 0.05 + 0.1 − 0.01 = 0.14 P(neither) = 1 − P(HIV or tuberculosis) = 0.86 10 CHAPTER 1. OVERVIEW AND CONCEPTUAL FOUNDATIONS OF PROBABILITY The following identity generalizes part 4 of Lemma 1.2 and is known as the inclusion-exclusion formula. It too is sometimes useful when performing calculations. Theorem 1.1. Let E1 , · · · , En be a collection of events. Then n X P E1 ∪ E2 ∪ · · · ∪ En = P(Ei ) − X P(Ei1 ∩ Ei2 ) + · · · 1≤i1 <i2 ≤n i=1 X r+1 + (−1) P(Ei1 ∩ Ei2 ∩ · · · ∩ Eir ) i1 <i2 <···ir n+1 + · · · + (−1) P(E1 ∩ E2 ∩ · · · ∩ En ) This can be verified with the help of indicator variables and expectations. Chapter 2 Conditional Probabilities 2.1 Motivation Suppose that two fair dice are rolled and that the sum of the two rolls is even. What is the probability that we have rolled a 1 with the first dice? Without the benefit of any theory, we could address this question empirically in the following way. Suppose that we perform this experiment N times, with N large, and let (xi , yi ) be the outcome obtained on the i’th trial, where xi is the number rolled with the first die and yi is the number rolled with the second die. Then we can estimate the probability of interest (which we will call P ) by dividing the number of trials for which xi = 1 and xi + yi is even by the total number of trials that have xi + yi even: P ≈ = #{i : xi = 1 and xi + yi even } #{i : xi + yi even } #{i : xi = 1 and xi + yi even }/N , #{i : xi + yi even }/N where we have divided both the numerator and denominator by N in the second line. Now, notice that for N large, we expect the following approximations to hold: P{the first roll equals 1 and the sum is even} ≈ #{i : xi = 1 and xi + yi even }/N P{the sum is even} ≈ #{i : xi + yi even }/N, with these becoming identities as N tends to infinity. This suggests that we can evaluate the probability P exactly using the following equation: P = P{the first roll equals 1 and the sum is even} . P{the sum is even} Here P is said to be the conditional probability that the first roll is equal to 1 given that the sum of the two rolls is even. This identity motivates the following definition. Definition 2.1. Let E and F be events and assume that P(F ) > 0. Then the conditional probability of E given F is defined by the ratio P(E|F ) = P(E ∩ F ) . P(F ) 11 12 CHAPTER 2. CONDITIONAL PROBABILITIES F can be thought of as some additional piece of information that we have concerning the outcome of an experiment. The conditional probability P(E|F ) then summarizes how this additional information affects our belief that the event E occurs. Notice that the ratio appearing in the definition is only well-defined if the numerator is positive, hence the important requirement that P(F ) > 0. Example 2.1. A student is taking an exam. Suppose that the probability that the student finishes the exam in less than x hours is x/2 for x ∈ [0, 2]. Show that the conditional probability that the student does not finish the exam in one hour given that they are still working after 45 minutes is 0.8. Example 2.2. A fair coin is flipped twice. Show that the conditional probability that both flips lands on heads given that the first lands on heads is 1/2. Show that the conditional probability that both flips land on heads given that at least one flip lands on heads is 1/3. The next result is a simple, but very useful consequence of the definition of conditional probability. Proposition 2.1. (Multiplication Rule) Suppose that E and F are events and that P(F ) > 0. Then P (E ∩ F ) = P (E|F ) · P (F ) . (2.1) Similarly, if E1 , · · · , Ek are events and P (E1 ∩ · · · ∩ Ek−1 ) > 0, then recursive application of equation (3.1) shows that P (E1 ∩ · · · ∩ Ek ) = P (E1 ) P (E2 |E1 ) P (E3 |E1 ∩ E2 ) · · · P (Ek |E1 ∩ · · · ∩ Ek−1 ) . Example 2.3. Suppose that n balls are chosen without replacement from an urn that initially contains r red balls and b blue balls. Given that k of the n balls are blue, show that the conditional probability that the first ball chosen is blue is k/n. Solution: Let B be the event that the first ball chosen is blue and let Bk be the event that k of the n balls chosen are blue. Then we need to calculate: P(B ∩ Bk ) P(B|Bk ) = . P(Bk ) To evaluate the numerator, write P(Bk ∩ B) = P(Bk |B)P(B), and notice that P(B) = b/(b + r). Also, P(Bk |B) is just the probability that a set of n − 1 balls chosen from an urn containing r red balls and b − 1 blue balls contains k − 1 blue balls, which is equal to r b−1 P(Bk |B) = k−1 n−k r+b−1 n−1 . Similarly, P(Bk ) = b k r n−k r+b n . It follows that b P(B|Bk ) = r+b b−1 r k−1 n−k r+b−1 n−1 r+b n b r k n−k = k . n 2.2. INDEPENDENCE 2.2 13 Independence We begin with the definition of pairwise independence, i.e., of what it means for two events to be independent. In the sequel, we will extend this to more than two sets. Definition 2.2. Two events E and F are said to be independent if P(E ∩ F ) = P(E)P(F ). Intuitively, E and F are independent if knowing that F has occurred provides no information about the likelihood that E has occurred and vice versa: P(E|F ) = P(F |E) = P(E ∩ F ) P(E)P(F ) = = P(E) P(F ) P(F ) P(E)P(F ) P(E ∩ F ) = = P(F ). P(E) P(E) Remark 2.1. Independence is a symmetric relation: to say that E is independent of F also means that F is independent of E. Example 2.4. Suppose that two coins are tossed and that all four outcomes are equally likely. Let E be the event that the first coin lands on heads and let F be the event that the second coin lands on heads. Then, as expected, E and F are independent since P(E ∩ F ) = 1 = P(E)P(F ). 4 Example 2.5. Suppose that we toss two fair dice. Let E1 be the event that the sum of the dice is 6 and let F be the event that the first die equals 4. Then, P(E1 ∩ F ) = 1 5 1 6= · = P(E1 )P(F ), 36 36 6 and thus E1 and F are not independent. On the other hand, if we let E2 be the event that the sum of the dice is 7, then P(E2 ∩ F ) = 1 1 1 = · = P(E2 )P(F ), 36 6 6 and so E2 and F are independent. Proposition 2.2. If E and F are independent, then so are each of the following three pairs: E and F c , E c and F , and E c and F c . Proof. This follows by direct calculation of the joint probabilities. For example, P (E ∩ F c ) = P(E) − P (E ∩ F ) = P(E) − P(E) · P(F ) = P(E) · P (F c ) , which shows that E and F c are independent. 14 CHAPTER 2. CONDITIONAL PROBABILITIES As promised, we can extend the definition of independence to collections of more than two events. Furthermore, in the example that follows the definition, we show that independence of a collection of objects can be a strictly stronger condition than mere independence of each pair of objects. Definition 2.3. A countable collection of events, E1 , E2 , · · · , is said to be independent if for every finite sub-collection Ei1 , Ei2 , Ei3 , · · · , Ein , we have: P n \ r=1 ! Eir = n Y P(Eir ). r=1 Example 2.6. Pairwise independence does not imply independence: Suppose that two fair dice are tossed and let E1 be the event that the first die equals 4, let E2 be the event that the second die equals 4, and let E3 be the event that the sum of the dice is 7. As above, we know that each pair of events is independent. However, it is not the case that all three events are independent since 111 P(E1 ∩ E2 ∩ E3 ) = 0 6= = P(E1 )P(E2 )P(E3 ). 666 Example 2.7. Suppose that an infinite sequence of independent trials is performed such that each trial results in a success with probability p and a failure with probability q = 1 − p. Calculate the probability that: • (a) at least one success occurs in the first n trials; • (b) exactly k successes occur in the first n trials; • (c) all trials result in successes. 2.3 The Law of Total Probability Let E and F be two events and notice that we can write E as the following disjoint union: E = (E ∩ F ) ∪ (E ∩ F c ). Then, using the additivity of probabilities and the definition of conditional probabilities, we obtain: P(E) = P(E ∩ F ) + P(E ∩ F c ) = P(E|F )P(F ) + P(E|F c )P(F c ) = P(E|F )P(F ) + P(E|F c )(1 − P(F )). (2.2) This formula can sometimes be used to calculate the probability of a complicated event by exploiting the additional information provided by knowing whether or not F has also occurred. The hard part is knowing how to choose F so that the conditional probabilities P(E|F ) and P(E|F c ) are both easy to calculate. 2.4. BAYES’ FORMULA 15 Example 2.8. Suppose that there are two classes of people, those that are accident prone and those that are not, and that the risk of having an accident within a one year period is either 0.4 or 0.2 depending on which class a person belongs to. If the proportion of accident-prone people in the population is 0.3, calculate the probability that a randomly selected person has an accident in a given year (0.26). What is the conditional probability that an individual is accident-prone given that they have had an accident in the last year? (6/13). Equation (3.2) has the following far-reaching generalization. Proposition 2.3. (Law of Total Probability) Suppose that F1 , · · · , Fn are mutually exclusive events such that P (Fi ) > 0 for each i = 1, · · · , n and E ⊂ F1 ∪ · · · ∪ Fn . Then P(E) = = n X i=1 n X P (E ∩ Fi ) P (E|Fi ) · P (Fi ) . i=1 Proof. The result follows from the countable additivity of P and the fact that E is the disjoint union of the sets E ∩ Fi for i = 1, · · · , n. 2.4 Bayes’ Formula Bayes’ formula describes the relationship between the two conditional probabilities P(E|F ) and P(F |E). Theorem 2.1. (Bayes’ Formula) Suppose that E and F are events such that P(E) > 0 and P(F ) > 0. Then P(E|F ) · P(F ) P(F |E) = . (2.3) P(E) Similarly, if F1 , · · · , Fn are events such that P (Fi ) > 0 for i = 1, · · · , n, then P (Fj |E) = P (E ∩ Fj ) P (E|Fj ) P (Fj ) = Pn . P(E) i=1 P (E|Fi ) P (Fi ) (2.4) Example 2.9. Suppose that we have three cards, and that the first card has two red sides, the second card has two black sides, and the third card has one red side and one black side. One of these cards is chosen at random and placed on the ground. If the upper side of this card is red, show that the probability that the other side is black is 1/3. Example 2.10. Suppose that a test for HIV-1 infection is known to have a false positive rate of 2.3% and a false negative rate of 1.4%, and that the prevalence of HIV-1 in a particular population is equal to 1 : 10, 000. If an individual with no known risk factors takes the test and tests positive for HIV-1 infection, what is the probability that they are, in fact, infected? 16 CHAPTER 2. CONDITIONAL PROBABILITIES We can use Bayes’ formula to solve this problem, but we first introduce some notation. Let H ≡ “the individual is infected by HIV-1” H̄ ≡ “the individual is not infected by HIV-1” D ≡ “the individual tests positive” As above, Bayes’ formula can be written as P(H|D) = P(H) P(D|H) . P(D) Here H and H̄ are the two hypotheses of interest and these are both mutually exclusive and exhaustive. Thus, the denominator in the fraction on the right-hand side can be expanded as P(D) = P(H)P(D|H) + P(H̄)P(D|H̄). In light of the prior information stipulated in I, we have P(H) = 0.0001 P(H̄) = 1 − P(H) = 0.9999 P(D|H) = 0.986 P(D|H̄) = 0.023. Indeed, since the individual has no known risk factors for HIV-1 infection, we can take the overall prevalence of HIV-1 infection in the population as our estimate of the prior probability that the individual is infected, i.e., we are using frequency data to estimate (or elicit) the prior probability of a hypothesis of interest. The second probability follows from the sum rule. The third probability is concerned with the event where the individual tests positive (D) when they are in fact infected H and when I is true; however, this is just the probability of a true positive, which by the sum rule is equal to 1 minus the probability of a false negative, which is 1 − 0.014 = 0.986. Similarly, the fourth probability is that of a false positive, which is 0.023 according to I. Substituting these quantities into the equation written above shows that P(D) = 0.0230963. It then follows that P(H|D) = 0.0001 × 0.986 ≈ 0.0043, 0.0001 × 0.986 + 0.9999 × 0.023 and so an individual with a positive test result but no known risk factors is still very unlikely to be infected by HIV-1 despite the fact that the test is fairly accurate. The reason for this is that the prevalence of HIV-1 infection in the population is so low that an individual that has no reason to suspect that they are HIV-1 infected is far more likely to receive a positive test result due to a testing error than because they are, in fact, infected. Example 2.11. Suppose that 5% of men and 0.25% of women are color blind. A color-blind person is chosen at random. Assuming that there are an equal number of males and females in the population, what is the probability that the chosen individual is male? Solution: Let M and F be the event that the chosen individual is a male or female, respectively, and let C denote the event that a randomly chosen individual is color blind. Then the probability of interest is P(C|M )P(M ) P(C|M )P(M ) + P(C|F )P(F ) 0.05 ∗ 0.5 = 0.05 ∗ 0.5 + 0.0025 ∗ 0.5 = 0.9524. P(M |C) = 2.4. BAYES’ FORMULA 2.4.1 17 Bayesian Statistics Bayes’ formula is often applied to problems of inference in the following way. Suppose that our goal is to decide among a set of competing hypotheses, H1 , · · · , Hn . Let p1 , · · · , pn be a probability distribution on the hypotheses, i.e., p1 + · · · + pn = 1, where pi is a measure of the strength of our belief in hypothesis Hi prior to performing the experiment. In the jargon of Bayesian statistics, (p1 , · · · , pn ) is said to be the prior distribution on the space of hypotheses. Now suppose that we perform an experiment to help decide between the hypotheses, and let the outcome of this experiment be the event E. Bayes’ formula can be used to update our belief in the relative probabilities of the different hypotheses in light of the new data. The posterior probability, p∗i , of hypothesis Hi is defined to be the conditional probability of Hi given E: P(E|Hi )P(Hi ) pi · P(E|Hi ) p∗i = P(Hi |E) = Pn = Pn . j=1 P(E|Hj )P(Hj ) j=1 pj · P(E|Hj ) Notice that (p∗1 , · · · , p∗n ) is a probability distribution on the space of hypotheses that is called the posterior distribution. It depends both on the data collected in the experiment as well as on our prior belief about the likelihood of the different hypotheses. Example 2.12. Suppose that we are studying a population containing n individuals and that our goal is to estimate the prevalence of tuberculosis in the population. Let Hk be the hypothesis that k individuals are infected, and suppose that our prior distribution on (H0 , · · · , Hn ) is given by pk = 1 , k = 0, 1, · · · , n. n+1 Such a prior is said to be uninformative and means that we have no prior information concerning the incidence of tuberculosis in the population. Now, suppose that the experiment consists of sampling one member of the population and testing them for tuberculosis infection. Let E be the event that that individual is infected and observe that P(E|Hk ) = k/n and that P(E) = n X pk P(E|k) = k=0 n X k=0 1 k 1 n(n + 1) 1 = = . n+1n n(n + 1) 2 2 Using Bayes’ formula, the posterior probability of hypothesis Hi is p∗i = 1 2i 2i pi · P(E|Hi ) = = 1/2 n+1 n n(n + 1) for i = 0, · · · , n. Notice that p∗i ≥ pi for i ≥ n/2, i.e., observing an infected individual in our sample increases the posterior probability that the incidence of tuberculosis is greater than 1/2. 2.4.2 Mendelian Genetics Gregor Mendel’s 1866 paper proposing a mathematical model for heredity was one of the first applications of probability to the natural sciences. Although we now know that the genetic architecture of some traits is much more complicated than Mendel imagined, Mendelian genetics still provides a useful framework for understanding how many traits of interest, including thousands 18 CHAPTER 2. CONDITIONAL PROBABILITIES connected with hereditary human diseases, are inherited. In fact, these models play an important role both in genetic counseling and in forensic DNA analysis. We begin with a list of important facts and concepts: • Humans have two copies of most genes (except for those on the sex chromosomes). • A child independently inherits one gene from each parent, i.e., one from the mother and one from the father. • Each of the two gene copies in a parent is equally likely to be inherited. • Different types of a gene are called alleles. These are often denoted by capital and lowercase letters, e.g., B and b. • A genotype is a list of the types of the two genes possessed by an individual. • If there are two alleles, then there are three possible genotypes: BB, bb, and Bb. Notice that we usually ignore the order of alleles when writing the genotype, i.e., Bb and bB are the same genotype. • The types and frequencies of the genotypes inherited by offspring can be calculated using a Punnett square. With two alleles, B and b, the possible crosses are: – BB × BB produces only BB – bb × bb produces only bb – BB × Bb produces 21 BB + 12 Bb – bb × Bb produces 12 bb + 21 Bb – Bb × Bb produces 14 BB + 21 Bb + 14 bb – BB × bb produces only Bb. Example 2.13. Assume that eye color is determined by a single pair of genes, which can be either B or b. If a person has one or more copies of B, then they will have brown eyes. If a person has two copies of b, then they will have blue eyes. Suppose that Smith and his parents have brown eyes, and that Smith’s sister has blue eyes. a) What is the probability that Smith caries a blue-eyed gene? b) Suppose that Smith’s partner has blue eyes. What is the probability that their first child will have blue eyes? c) Suppose that their first child has brown eyes. What is the probability that their second child will also have brown eyes? Solutions: a) Because Smith’s sister has blue eyes, her genotype is bb. Consequently, both of Smith’s parents (who are brown-eyed) must have genotypes Bb, since the sister must inherit a b gene from each parent. However, Smith himself could have genotype BB or Bb. Let E1 and E2 be the events that Smith has genotype BB or Bb, respectively, and let F be the event that Smith has brown 2.5. CONDITIONAL PROBABILITIES ARE PROBABILITIES (OPTIONAL) 19 eyes. Then the probability that Smith has genotype Bb is P(F |E2 )P(E2 ) P(F ) 1 · 1/2 = 3/4 = 2/3, P(E2 |F ) = while the probability that Smith has genotype BB is P(E1 |F ) = 1 − P(E2 |F ) = 1/3. b) If Smith’s partner has blue eyes, then her genotype is bb. Now if Smith has genotype BB, then his children can only have genotype Bb and therefore have zero probability of having blue eyes. On the other hand, if Smith has genotype Bb, then his first child will have blue eyes with probability 1/2 (i.e., with the probability that the child inherits a b gene from Smith). Thus, the total probability that the first child has blue eyes is: 1/3 · 0 + 2/3 · 1/2 = 1/3. c) Knowing that their first child has brown eyes provides us with additional information about the genotypes of the parents. Let G be the event that the first child has brown eyes and let E1 and E2 be the events that Smith has genotype BB or Bb, respectively. From (a) we know that P(E1 ) = 1/3 and P(E2 ) = 2/3, while (b) tells us that P(G) = 1 − 1/3 = 2/3. Consequently, P(G|E1 )P(E1 ) 1 · 1/3 = = 1/2, P(G) 2/3 P(E2 |G) = 1 − P(E1 |G) = 1/2. P(E1 |G) = Let H be the event that the second child has brown eyes and notice that P(H|E1 , G) = 1 P(H|E2 , G) = 1/2. Then, P(H|G) = P(H ∩ E1 |G) + P(H ∩ E2 |G) = P(H|E1 , G)P(E1 |G) + P(H|E2 , G)P(E2 |G) = 1 · 1/2 + 1/2 · 1/2 = 3/4. 2.5 Conditional Probabilities are Probabilities (Optional) Theorem 2.2. Let F be an event in a probability space (Ω, F, P) such that P(F ) > 0. Then the conditional probability P(·|F ) is a probability measure on (Ω, F). Proof. We need to show that P(·|F ) satisfies the three axioms of a probability measure. The first axiom requires that for any event E, 0 ≤ P(E|F ) = P(E ∩ F ) ≤ 1, P(F ) which follows because P(E ∩ F ), P(F ) > 0 and because P(E ∩ F ) ≤ P(F ) (since E ∩ F ⊂ F ). 20 CHAPTER 2. CONDITIONAL PROBABILITIES The second axiom follows by a simple calculation: P(Ω|F ) = P(Ω ∩ F ) P(F ) = = 1. P(F ) P(F ) Lastly, let E1 , E2 , · · · be a countable collection of disjoint events and let E = P(E|F ) = = S∞ n=1 En . Then P(E ∩ F ) P(F ) ∞ [ 1 P (En ∩ F ) P(F ) n=1 = = 1 P(F ) ∞ X ∞ X P(En ∩ F ) n=1 P(En |F ), n=1 where in passing from the second line to the third we have made use of the countable additivity of P(·) and of the fact that the events En ∩ F, n ≥ 1 are disjoint. This verifies that P(·|F ) is countably additive, which completes the proof. One useful consequence of Proposition 5.1 is that we can condition on a series of events, e.g., if E, F , and G are events and we define Q(E) = P(E|F ), then the conditional probability Q(E|G) is well-defined. In fact, Q(E|G) = = = = = Q(E ∩ G) Q(G) P(E ∩ G|F ) P(G|F ) P(E ∩ G ∩ F ) P(F ) · P(F ) P(G ∩ F ) P(E ∩ F ∩ G) P(F ∩ G) P(E|F ∩ G), which shows that conditioning first on F and then on G is equivalent to conditioning on the intersection of F and G. Furthermore, this correspondence allows us to rewrite the identity Q(E) = Q(E|G)Q(G) + Q(E|Gc )Q(Gc ) in the useful form P(E|F ) = P(E|F ∩ G)P(G|F ) + P(E|F ∩ Gc )P(Gc |F ). Example 2.14. A female chimp gives birth to a single offspring, but it is unknown which of two male chimps is the father. Before any genetic data is collected, it is believed that the probability that male number 1 is the father is p and that the probability that male number 2 is the father is 2.5. CONDITIONAL PROBABILITIES ARE PROBABILITIES (OPTIONAL) 21 1 − p. (Recall that (p, 1 − p) is the prior distribution on the two hypotheses concerning paternity.) DNA is collected from all four individuals and sequenced at a single site on the genome, showing that the maternal genotype is AA, the genotype of male 1 is aa, the genotype of male 2 is Aa, and that the genotype of the offspring is Aa. What is the posterior probability that male 1 is the father? Solution: Let all probabilities be conditional on the genotypes of the adults and define Mi , i = 1, 2 to be the event that male i is the father and G to be the event that the offspring genotype is Aa. Then P(M1 |G) = = = = P(M1 ∩ G) P(G) P(G|M1 )P(M1 ) P(G|M1 )P(M1 ) + P(G|M2 )P(M2 ) 1·p 1 · p + 12 · (1 − p) 2p . 1+p Notice that P(M1 |G) > p = P(M1 ) whenever p < 1, i.e., the observed genotypes increase the likelihood that male 1 is the father. Example 2.15. Suppose that a series of independent trials is conducted and that each trial results in a success with probability p and a failure with probability q = 1 − p. What is the probability that a run of n consecutive successes occurs before a run of m consecutive failures? Solution: Let us introduce the following events E = {a run of n successes happens before a run of m failures} S = {the first trial results in a success} Sn−1 = {trials 2 through n all result in successes} Fm−1 = {trials 2 through m all result in failures}. By the law of total probability, we have P(E) = P(E|S)P(S) + P(E|S c )P(S c ) = p · P(E|S) + q · P(E|S c ). The conditional probabilities in this last expression can be evaluated in the following way. First observe that c c P(E|S) = P(E|Sn−1 ∩ S)P(Sn−1 |S) + P(E|Sn−1 ∩ S)P(Sn−1 |S) c = pn−1 + (1 − pn−1 )P(E|Sn−1 ∩ S), since Sn−1 and S are independent and P(E|Sn−1 ∩ S) = 1 (n consecutive successes occur on c the first n trials if S and Sn−1 are both true). However, since the event Sn−1 ∩ S means that a failure has occurred during at least one of the trials 2 through n and since whether E occurs or not does not depend on any event happening before this first failure, it follows that c P(E|Sn−1 ∩ S) = P(E|S c ), i.e., it is as if the failure occurred on the very first trial. Thus P(E|S) = pn−1 + (1 − pn−1 )P(E|S c ). 22 CHAPTER 2. CONDITIONAL PROBABILITIES Similar reasoning shows that c c P(E|S c ) = P(E|Fm−1 ∩ S c )P(Fm−1 |S c ) + P(E|Fm−1 ∩ S c )P(Fm−1 |S c ) c = 0 + (1 − q m−1 )P(E|Fm−1 ∩ S) = (1 − q m−1 )P(E|S), where we have used the fact that P(E|Fm−1 ∩ S c ) = 0 (since then m consecutive failures occur c during the first m trials) and that P(E|Fm−1 ∩ S) = P(E|S). By combining the equations for P(E|S) and P(E|S c ), we obtain P(E|S) = pn−1 + (1 − pn−1 )(1 − q m−1 )P(E|S) which then implies that P(E|S) = P(E|S c ) = pn−1 1 − (1 − pn−1 )(1 − q m−1 ) pn−1 (1 − q m−1 ) . 1 − (1 − pn−1 )(1 − q m−1 ) Finally, by substituting these conditional probabilities into the initial equation derived for P(E), we arrive at the answer to our problem: P(E) = 2.6 pn−1 (1 − q m ) pn + pn−1 (q − q m ) = . 1 − (1 − pn−1 )(1 − q m−1 ) 1 − (1 − pn−1 )(1 − q m−1 ) The Gambler’s Ruin Problem (Optional) The following is a classic problem of probability that dates back to the 17’th century. Two gamblers, A and B, bet on the outcomes of successive flips of a coin. On each flip, A wins a dollar from B if the coin comes up heads, while B wins a dollar from A if the coin comes up tails. The game continues until one of the players runs out of money. Suppose that the coin tosses are mutually independent and that each toss turns up heads with probability p. If A starts with i dollars and B starts with N −i dollars, what is the probability that A ends up with all of the money? Solution: Let E be the event that A eventually wins all of the money and let Pi = P (E|A starts with i dollars) be the conditional probability of E given that A starts with i dollars. We will use the independence of the successive coin tosses to obtain a system of linear equations for the probabilities P0 , P1 , · · · , PN . To this end, let H be the event that the first toss lands on heads and observe that Pi = P(E|H)P(H) + P(E|H c )P(H c ) = pP(E|H) + (1 − p)P(E|H c ) = pPi+1 + (1 − p)Pi−1 , since P(E|H) = Pi+1 and P(E|H c ) = Pi−1 . (2.5) 2.6. THE GAMBLER’S RUIN PROBLEM (OPTIONAL) 23 Indeed, if H is true, then A will have i + 1 dollars and B will have N − (i + 1) dollars after the first toss. Furthermore, since the successive tosses are independent, the probability that A eventually wins all of the money conditional on having i + 1 dollars after the first toss is the same as the probability that A wins all of the money conditional on A starting the game with i + 1 dollars. Equation (3.5) can be reformulated as a one-step recursion by first writing Pi = pPi + qPi = pPi+1 + qPi−1 and then rearranging terms on both sides of the identity to obtain p Pi+1 − Pi = q Pi − Pi−1 , i.e., Pi+1 − Pi = α Pi − Pi−1 , where α = q/p. Notice that P0 = 0 (A loses if she has no money to wager). Thus, P2 − P1 = α P1 − P0 = αP1 P3 − P2 = α P2 − P1 = α2 P1 ··· Pi − Pi−1 = α Pi−1 − Pi−2 = αi−1 P1 ··· PN − PN −1 = α PN −1 − PN −2 = αN −1 P1 . Adding the first i − 1 equations gives Pi − P1 = P1 i−1 X αk , k=1 and therefore Pi = i P1 1−α 1−α if p 6= 1/2 (α 6= 1) if p = 1/2 (α = 1). P1 i Then, using the fact that PN = 1 (A wins if she begins with all of the money), we can solve for P1 : 1−α 1−αN if p 6= 1/2 P1 = 1 if p = 1/2. N Substituting these results back into the expression for Pi , we obtain 1−(q/p)i 6 1/2 1−(q/p)N if p = Pi = i if p = 1/2. N If we let Qi denote the probability that B ends up with all the money when the players have i and N − i dollars at the start of the game, then by the symmetry of the problem, we see that 1−(p/q)N −i if p 6= 1/2 1−(p/q)N Qi = N −i if p = 1/2. N 24 CHAPTER 2. CONDITIONAL PROBABILITIES In particular, a simple calculation shows that Pi + Qi = 1, i.e., with probability 1, either A or B eventually wins all of the money. In other words, the probability that the game continues indefinitely with neither A nor B winning is 0. Remark 2.2. If we let Xn denote A’s holdings after the n’th coin toss, then the random sequence (Xn : n ≥ 0) is an example of a Markov chain and, in the case p = q = 1/2, of a martingale. Chapter 3 Random Variables and Probability Distributions 3.1 Random Variables Imagine that an experiment is performed which results in a random outcome ω belonging to a set Ω. We saw in the last chapter that we could model such a scenario by introducing a probability space (Ω, F, P) in which Ω is the sample space, F is a collection of events in Ω, and P is a distribution which specifies the probability P(A) of each event A ∈ F. However, depending on the size and the complexity of the sample space as well as on the experimentalist’s resources, it may not be possible to directly observe ω. Instead, what the experimentalist usually can do is to measure some quantity X = X(ω) that depends on ω and which reveals some information about the outcome. Because the value of X depends on the outcome of the experiment, which is random, we say that X is a random variable. Random variables play a central role in probability theory, statistics, and stochastic modeling and so this entire chapter is devoted to their properties. Definition 3.1. Suppose that (Ω, F, P) is a probability space and that E is a set equipped with a σ-algebra E of events. A function X : Ω → E is said to be an E-valued random variable if for every set A ∈ E we have X −1 (A) = {ω ∈ Ω : X(ω) ∈ A} = {X ∈ A} ∈ F. In other words, for each set A ∈ E, we must be able to decide whether the event {X ∈ A} is true or not. Here E can be a set of objects of any type, including integers, real numbers, complex numbers, vectors, matrices, functions, geometric objects, graphs, networks, and even probability distributions themselves. For example, if E is a set of matrices, then we would say that an E-valued random variable X is a random matrix. This level of generality allows us to apply random variables to problems coming from many different disciplines. However, most often E is a subset of the real numbers, in which case we would say that X is a real-valued random variable. In fact, for many authors random variables are synonymous with real-valued random variables. Two examples follow. Example 3.1. Suppose that we a toss a fair coin one hundred times and record the outcome of each toss as H or T . We can model this experiment using a symmetric probability space (Ω, F, P) in which Ω is the collection of sequences of length one hundred in which each element is either H or T , F is the power set of Ω, and the probability of each outcome ω = (ω1 , · · · , ω100 ) is P({ω}) = 25 1 2100 . 26 CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Because Ω is a very large set, we might choose to restrict our attention to the number of tosses that land on heads, which we can represent by a random variable X : Ω → E = {0, 1, · · · , 100}, where X(ω) = #{i : ωi = H}, and we take E to be the power set of E. Notice that because F is the power set of Ω, every set {X ∈ A} belongs to F when A ∈ E. Example 3.2. Suppose that a fair die is rolled twice and that we record the outcome of each roll. We can model this experiment with a symmetric probability space (Ω, F, P) where Ω = {(i, j) : 1 ≤ i, j ≤ 6}, F is the power set of Ω, and the probability of each outcome ω = (ω1 , ω2 ) is P({ω}) = 1 . 36 In this case, the sum of the two numbers rolled is a random variable X : Ω → E = {2, · · · , 12} defined by X(ω) = ω1 + ω2 . Similarly, the product, the difference and the quotient of the two numbers rolled are all random variables on Ω. Arguably, the most important attribute of a random variable is its distribution, which tells us the probabilities of events involving this variable. Definition 3.2. Let X be an E-valued random variable defined on a probability space (Ω, F, P). Then the distribution of X is a probability distribution PX : E → [0, 1] on the space (E, E) defined by the formula PX (A) = P(X ∈ A) whenever A ∈ E. Furthermore, the triple (E, E, PX ) is itself a probability space. Example 3.3. Let X be the sum obtained when a fair die is rolled twice. Then the distribution of X on the set E = {2, 3, · · · , 12} is determined by the following probabilities: i PX (i) 2 1/36 3 2/36 4 3/36 5 4/36 6 5/36 7 6/36 8 5/36 9 4/36 10 3/36 11 2/36 12 1/36 Furthermore, the probability of any subset A ⊂ E can be calculated by summing the probabilities of the elements of A: X PX (A) = PX (i). i∈A Indeed, X is an example of a discrete random variable and the table shows the values of the probability mass function of X (see below). Remark: Although random variables are formally defined as functions on probability spaces, it is common practice to suppress all mention of the underlying probability space (Ω, F, P) and instead only work with the variable X and its distribution PX on (E, E). This allows us to focus on quantities that we can actually observe and measure. This will be the convention followed in these lecture notes except in a few places where it will be convenient to use the underlying probability space as a mathematical tool to prove results about random variables. 3.1. RANDOM VARIABLES 27 In the special case where X is a real-valued random variable, the distribution of X can be summarized by a function called the cumulative distribution function of X (abbreviated c.d.f.). This is useful because real-valued functions are often less complicated objects than probability distributions defined on σ-algebras on the real numbers. This is particularly true of continuous random variables, which can assume any value in an interval. Definition 3.3. If X is a real-valued random variable, then the cumulative distribution function of X is the function FX : R → [0, 1] defined by FX (x) ≡ P(X ≤ x) = P X ∈ (−∞, x] for any x ∈ R. Notice that FX (x) is said to be the cumulative distribution function of X because it measures the cumulative probability of X from −∞ up to x. It is clear that the cumulative distribution function of a random variable is uniquely determined by the distribution of the variable. However, less obviously, the converse is also true: the distribution of a random variable is uniquely determined by its cumulative distribution function. In other words, if X and Y have the same c.d.f., i.e., if FX (x) = FY (x) for every x ∈ R, then X and Y have the same distribution: P(X ∈ A) = P(Y ∈ A) for every ‘nice’ subset A ⊂ R. Section (4.1.3) describes several other important properties of cumulative distribution functions. We defer examples to the next two sections. Having specialized to real-valued random variables, we now specialize further. Namely, in this course we will mainly focus on two classes of real-valued random variables: discrete and continuous, which are defined in the next two sections. Random variables which are neither discrete nor continuous also exist (e.g., a random variable with the Cantor distribution, which is supported on the Cantor middle thirds set, is neither discrete nor continuous), but discrete and continuous random variables are the ones most commonly encountered in applications. 3.1.1 Discrete random variables Recall that a set E is said to be countable if it is either finite or countably infinite. In the latter case, E can be put in to one-to-one correspondence with the natural numbers, i.e., we can enumerate the elements of E = {x1 , x2 , x3 , · · · }. Examples of countably infinite sets include the integers, the non-negative even integers and the rationals. An example of a set that is infinite but not countably infinite is the interval [0, 1]; this was first established by Georg Cantor using his famous diagonalization argument. Definition 3.4. A real-valued random variable X is said to be discrete if there is a countable set of values E = {x1 , x2 , · · · } ⊂ R such that P(X ∈ E) = 1. In this case, the probability mass function of X is the function pX : E → [0, 1] defined by pX (y) = P(X = y). The next proposition shows how the probability mass function (p.m.f.) of a discrete random variable is related to the distribution of that variable. In particular, it follows that if two discrete random variables have the same p.m.f., then they necessarily have the same distribution. 28 CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Proposition 3.1. Suppose that X is a discrete random variable with values in E and let pX be the probability mass function of X. If A ⊂ E, then X P(X ∈ A) = pX (y), y∈A i.e., the probability that X belongs to A is equal to the sum of the probability mass function evaluated at all of the elements of A. Proof. Since A can be written as a countable disjoint union of the singleton sets containing the elements contained in A, i.e., [ A= {y}, y∈A we can use the countable additivity of probability distributions to conclude that X X P(A) = P({y}) = pX (y). y∈A y∈A In particular, it follows that the sum of the probability mass function over all of the points contained in E is equal to 1: X 1 = P(Ω) = pX (xi ). xi ∈E We can also express the cumulative distribution function of a discrete random variable X in terms of its probability mass function. X FX (x) ≡ P(X ≤ x) = pX (xi ). xi ≤x This shows that the probability mass function of a discrete random variable with values in the set E = {x1 , x2 , · · · } is a step function which is piecewise constant on the intervals [xi , xi+1 ) with a jump of size pX (xi ) at x = xi . (See Figure 5.1 in Gregory (2005).) Example 3.4. A random variable X is said to be a Bernoulli random variable with parameter p ∈ [0, 1] if X takes values in the set E = {0, 1} and the p.m.f. of X is equal to 1 − p if x = 0 pX (x) = p if x = 1. We also say that X has a Bernoulli distribution with parameter p. Notice that the cumulative distribution function of X is the following step function if x < 0 0 1 − p if x ∈ [0, 1) FX (x) = 1 if x ≥ 1. Bernoulli random variables are often used to describe trials that can result in just two possible outcomes, e.g., a coin toss experiment can be represented by a Bernoulli random variable if we assign X = 1 whenever we get heads and X = 0 whenever we get tails. 3.1. RANDOM VARIABLES 3.1.2 29 Continuous random variables Definition 3.5. A real-valued random variable X is said to be continuous if there is a function fX : R → [0, ∞], called the probability density function of X, such that Z b P(a ≤ X ≤ b) = fX (x)dx a for every −∞ ≤ a ≤ b ≤ ∞. If a random variable X is continuous, then its probability density function (or density for short) can be calculated by evaluating the following limit: fX (x) = lim δx→0+ P(x ≤ X ≤ x + δx) . δx (3.1) This shows that the probability density fX (x) evaluated at a point x is not equal to the probability that X = x, but instead is the limit of the ratio of two probabilities as each tends to zero. This explains why densities can take on values that are greater than 1. On the other hand, there is a one-to-one relationship between the probability density function of a continuous random variable and the distribution of that variable. Our next proposition should be compared with Proposition 4.2. Proposition 3.2. Suppose that X is a continuous random variable with density fX . Then Z P(X ∈ A) = fX (x)dx A for any subset A ⊂ R for which the integral is well-defined. The collection of subsets A ⊂ R for which the integral is well-defined is known as the Lebesgue σ-algebra and includes all countable unions and intersections of closed or open intervals. See Williams (1991) for a careful discussion of these issues. By taking A = R, it follows that that integral of the density over the entire real line is equal to 1: Z ∞ 1 = P(X ∈ R) = fX (x)dx. −∞ On the other hand, if we take A = {y} for any y ∈ R, then Z y P(X = y) = fX (x)dx = 0, y which shows that a continuous random variable has no atoms. This highlights an observation that may seem paradoxical at first, which is that events with probability zero are not, in general, impossible. In this case, we know that the random variable Y will take on some value y even though the probability of this event is zero: P(Y = y) = 0. Proposition (4.2) also shows how the probability density function and the cumulative distribution function are related. Since this relationship is so important in practice, we will state it as a corollary. Corollary 3.1. Suppose that X is a continuous random variable with density fX (x) and c.d.f. FX (x). Then FX is the antiderivative of fX , while fX is the derivative of FX : Z x FX (x) = P(X ≤ x) = fX (y)dy, −∞ fX (x) = d FX (x). dx 30 CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Thus, if we know the density of a random variable, then we can find its cumulative distribution function by integration, while if we know the cumulative distribution function, then we can find the density function by differentiation. Example 3.5. Let X be a continuous random variable with probability density function Cxα if x ∈ [0, 1] fX (x) = 0 if x < 0 or x > 1. Here α > −1 is a parameter which characterizes the distribution, while C is a constant which must be chosen so that the density integrates to 1: Z 1 C 1= Cxα dx = . α+1 0 In other words, we must take C = α+1 to get a probability density. Notice that we must also take α > −1 since otherwise the integral diverges, but values of α ∈ (−1, 0) are perfectly legitimate even though the density function diverges as x → 0. 3.1.3 Properties of the Cumulative Distribution Function We have seen that the cumulative distribution function of a random variable can be either discontinuous or continuous depending on whether or not there are particular values that the random variable assumes with positive probability. However, there are several properties that are shared by every cumulative distribution function. Proposition 3.3. Let X be a real-valued random variable (not necessarily discrete) with cumulative distribution function F (x) = P(X ≤ x). Then 1. F is non-decreasing, i.e., if x ≤ y, then F (x) ≤ F (y). 2. limx→∞ F (x) = 1. 3. limx→−∞ F (x) = 0. 4. F is right-continuous, i.e., for any x and any decreasing sequence (xn , n ≥ 1), that converges to x, limn→∞ F (xn ) = F (x). Proof. 1.) If x ≤ y, then {X ≤ x} ⊂ {X ≤ y}, which implies that F (x) ≤ F (y). 2.) If (xn , n ≥ 1) is an increasing sequence such that xn → ∞, then the events En = {X ≤ xn } form an increasing sequence with ∞ [ {X < ∞} = En . n=1 It follows from the continuity properties of probability measures (Lecture 2) that lim F (xn ) = lim P(En ) = P(X < ∞) = 1. n→∞ n→∞ 3.) Likewise, if (xn , n ≥ 1) is a decreasing sequence such that xn → −∞, then the events En = {X ≤ xn } form a decreasing sequence with ∅= ∞ \ n=1 En . 3.2. EXPECTATIONS AND MOMENTS 31 In this case, the continuity properties of measures imply that lim F (xn ) = lim P(En ) = P(∅) = 0. n→∞ n→∞ 4.) If xn , n ≥ 1 is a decreasing sequence converging to x, then the sets En = {X ≤ xn } also form a decreasing sequence with ∞ \ {X ≤ x} = En . n=1 Consequently, lim F (xn ) = lim P(En ) = P{X ≤ x} = F (x). n→∞ n→∞ Some other useful properties of cumulative distribution functions are listed below. (i) P(a < X ≤ b) = F (b) − F (a) for all a < b. (ii) Another application of the continuity property shows that the probability that X is strictly less than x is equal to P(X < x) = lim P(X ≤ x − 1/n) = n→∞ lim F (x − 1/n). n→∞ In other words, P(X < x) is equal to the left limit of F at x, which is often denoted F (x−). However, notice that in general this limit need not be equal to F (x): F (x) = P(X < x) + P(X = x), and so P(X < x) = F (x) if and only if P(X = x) = 0. (iii) The converse of this proposition is also true: namely, if F : R → [0, 1] is a function satisfying properties (1) - (4), then F is the CDF of some random variable X. We will prove a partial result in this direction later on in the course. 3.2 Expectations and Moments Definition 3.6. Suppose that X is a real-valued random variable which is either discrete with probability mass function pX (x) or continuous with probability density function fX (x). Then the expected value of X is the weighted average of the values of X given by (P pX (xi ) xi if X is discrete, E[X] = R ∞xi −∞ fX (x) x dx if X is continuous, provided that the sum or the integral exists. The expected value of a random variable X may also be called the expectation, the mean or the first moment of X and is sometimes denoted hXi. The latter notation is especially common in the physics and engineering literature and is also used by Gregory (2005). 32 CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Example 3.6. Let X be a Bernoulli random variable with parameter p (cf. Example 4.3). Then the expected value of X is E[X] = (1 − p) · 0 + p · 1 = p. Example 3.7. Let X be the outcome when we roll a fair die. Then E[X] = 6 X 1 i=1 6 · i = 3.5. Example 3.8. Let X be the random variable described in Example 4.2., i.e., X is continuous with density (α + 1)xα for x ∈ [0, 1] where α > −1. Then the expected value of X is equal to Z E[X] = 1 (α + 1)xα · xdx = 0 α+1 . α+2 Notice that in the first two examples, the random variable X cannot be equal to its expected value. In this sense, the name ‘expected value’ is somewhat misleading and it could be argued that it would be better to refer to E[X] as the mean of X. On the other hand, we will later see that if X1 , X2 , · · · are independent, identically-distributed real-valued random variables with the same distribution as X, then N 1 X lim Xi = E[X] N →∞ N i=1 provided that X has a finite mean, i.e., the sample mean converges to the expected value as the size of the sample increases. This result is known as the strong law of large numbers. One way to obtain new random variables is to compose a real-valued function with an existing random variable. For example, suppose that X is a discrete random variable that takes values in the set E = {x1 , x2 , . . . } with probabilities pX (xi ), and let g : E → R be a real-valued function. Then Y = g(X) is a discrete random variable with values in the countable set g(E) = {y1 , y2 , · · · } and the probability mass function of Y is X pY (yi ) ≡ P (Y = yi ) = pX (xj ). xj :g(xj )=yi Notice that more than one value xi may contribute to the probability of each point yj if g is not one-to-one. Similarly, if X is a continuous random variable and g : R → R is a function, then Y = g(X) is also a real-valued random variable, although it need not be continuous. Example 3.9. Suppose that X is a discrete random variable that takes values in the set {−1, 0, 1} with probabilities 1 1 1 pX (−1) = , pX (0) = , pX (1) = . 4 2 4 Then Y = X 2 is a discrete random variable that takes values in the set {0, 1} with probabilities pY (0) = pX (0) = 1 2 pY (1) = pX (−1) + pX (1) = and so 1 1 1 E X2 = · 0 + · 1 = . 2 2 2 1 2 3.2. EXPECTATIONS AND MOMENTS 33 The following proposition allows us to calculate the expectation E[g(X)] without having to first find the probability mass function or density function of the random variable Y = g(X). Although the identities contained in the proposition appear as a definition in Gregory (2005), in fact, these must be derived from Definition 4.6. We will prove the result under the assumption that X is discrete; see Williams (1991) for a proof of the general case. Proposition 3.4. Suppose that X is a real-valued random variable which is either discrete with probability mass function pX (x) or continuous with probability density function fX (x). Then, for any real-valued function g : R → R, (P pX (xi )g(xi ) if X is discrete, E[g(X)] = R ∞xi −∞ fX (x)g(x) dx if X is continuous, provided that the sum or the integral exists. Proof. Suppose that X is a discrete random variable with values in the set E = {x1 , x2 , · · · } and probability mass function pX . If Y = g(X), then Y is also a discrete random variable which takes on distinct values {y1 , y2 , · · · } with probabilities pY (yj ) and we can define Ej ≡ g −1 (yj ) = {xi ∈ E : g(xi ) = yj } . Then the sets E1 , E2 , · · · are disjoint with union E = E1 ∪ E2 ∪ · · · , and X E[g(X)] = pY (yj ) · yj j = X = X X = X X = X X xi ∈Ej j j j pX (xi ) · yj pX (xi ) · yj xi ∈Ej pX (xi ) · g(xi ) xi ∈Ej pX (xi ) · g(xi ). xi ∈E An important consequence of Proposition 4.4 is that expectations are linear. Corollary 3.2. If a and b are constants, then E[aX + b] = aE[X] + b. Proposition 4.4 can also be used to calculate the moments of a real-valued random variable. Definition 3.7. The n’th moment of a real-valued random variable X is the quantity (P pX (xi )xni if X is discrete, µ0n = E[X n ] = R ∞xi n −∞ fX (x)x dx if X is continuous, provided that the sum or the integral exists. Similarly, if µ = E[X] is finite, then the n’th central moment of X is the quantity (P pX (xi )(xi − µ)n if X is discrete, µn = E[(X − µ)n ] = R ∞xi n −∞ fX (x)(x − µ) dx if X is continuous, provided that the sum or the integral exists. 34 CHAPTER 3. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS Notice that the first central moment vanishes, since the linearity of expectations implies that µ1 = E[X − µ] = µ − µ = 0. On the other hand, the second, third and fourth central moments are sufficiently important in statistics and applied probability to warrant special names. The second central moment is the most familiar and is called the variance of the random variable. The variance is usually denoted σ 2 and can be evaluated using the following formula: σ 2 = E[(X − µ)2 ] = E[X 2 − 2µX + µ2 ] = E[X 2 ] − 2µ2 + µ2 = E[X 2 ] − µ2 . The square root of the variance is called the standard deviation. This is denoted σ and has the same units as X, whereas the variance has the units of X 2 , e.g., if the units of X are cm, then the units of the standard deviation are also cm, whereas the units of the variance are cm2 . The third and fourth central moments are also useful in some instances. The third moment is known as the skewness and measures the asymmetry of the distribution of a random variable about its mean. A random variable that is perfectly symmetric about its mean has skewness µ3 = 0, whilst a variable is said to be either right-skewed or left-skewed if µ3 > 0 or µ3 < 0, respectively. The fourth central moment is known as the kurtosis and measures how peaked the distribution is around the mean. Since the skewness and kurtosis are dimensioned quantities with units equal to the units of X raised to the third or fourth power, respectively, it is often preferred to replace these by dimensionless quantities obtained by dividing by the standard deviation σ raised to this power. This leads us to define the following quantities µ3 Coefficient of skewness: α3 = 3 σ µ4 Coefficient of kurtosis: α4 = 4 . σ See Figure 5.3 in Gregory (2005) for a selection of distributions with different degrees of skewness and kurtosis. 3.2.1 Moment generating functions Moment generating functions have a number of important applications in probability theory. In this section, we will see how they can be used to calculate all of the moments of a random variable X in a single step. See the following sections for several examples of this approach. Definition 3.8. If X is a real-valued random variable, then the moment generating function of X is the function mX : R → [0, ∞] defined by the formula (P tX pX (xi )et·xi if X is discrete, mX (t) = E e = R ∞xi tx −∞ fX (x)e dx if X is continuous. Although the moment generating function is defined for all t ∈ R, it is possible for it to diverge everywhere except at t = 0 where mX (0) = 1 holds for any random variable X. On the other hand, if X is a bounded random variable, i.e., if there is a constant K < ∞ such that P(|X| ≤ K) = 1, then the moment generating function will be finite everywhere: mX (t) < ∞ for all t ∈ R. Furthermore, it can be shown that the distribution of a random variable is uniquely 3.2. EXPECTATIONS AND MOMENTS 35 determined by the moment generating function whenever that function is finite in some open interval containing 0. In particular, if two random variables have the same moment generating function and this function is finite in some open interval, then the two variables also have the same distribution. To understand why the moment generating function has this name, suppose that the Taylor series of the moment generating function mX (t) of a random variable X is convergent in some open interval (−a, a) containing 0. Then mX (t) = E[etX ] "∞ # X 1 n n = E t X n! n=0 ∞ X 1 0 n = µ t , n! n n=0 where µ0n = E[X n ] is the n’th (non-central) moment of X and the last step requires that we be able to interchange the sum with the expectation. This shows that the moments of X can be found by first finding its moment generating function and then expanding it in a Taylor series and reading off the coefficients. In particular, the n’th moment of X is simply the n’th derivative of mX (t) evaluated at t = 0: (n) µ0n = mX (0). In some cases, this procedure may require less work than direct evaluation of the sum or the integral expression of the moment. Example 3.10. Suppose that λ > 0 is a positive constant and let X be a continuous random variable with density fX (x) = λe−λx on [0, ∞) and fX (x) = 0 for x < 0. (X is said to be an exponential random variable with parameter λ.) Then the moment generating function of X is ( Z ∞ λ tX if t < λ λ−t mX (t) = E e = etx λe−λx dx = ∞ if t ≥ λ. 0 Since this is a smooth function on an open interval (−∞, λ), it can be expanded in a Taylor series as ∞ X λ λ−n tn = λ−t n=0 and it follows that the n’th moment of X is simply µ0n = λ−n n!. Chapter 4 A Menagerie of Distributions 4.1 The Binomial Distribution Suppose that an experiment is repeated n times and that each trial has two possible outcomes, which we denote Q and Q. Assume that the trials are independent of one another and that the probability of outcome Q is the same in each trial, say p. If we let n = 3 and we let Qi be the proposition that the i’th trial results in outcome Q, then there are eight possibilities: Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3 +Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3 + Q1 , Q2 , Q3 . We can calculate the probability of each of these possibilities using the product rule. For example, the probability of outcome Q1 , Q2 , Q3 is equal to P(Q1 , Q2 , Q3 ) = P(Q1 ) × P(Q2 |Q1 ) × P(Q3 |Q1 , Q2 ) = P(Q1 ) × P(Q2 ) × P(Q3 ) = (1 − P(Q1 )) × P(Q2 ) × P(Q3 ) = p2 (1 − p). Similarly, the probabilities of outcomes Q1 , Q2 , Q3 and P(Q1 , Q2 , Q3 ) are P(Q1 , Q2 , Q3 ) = P(Q1 ) × P(Q2 ) × P(Q3 ) = p2 (1 − p) P(Q1 , Q2 , Q3 ) = P(Q1 ) × P(Q2 ) × P(Q3 ) = p2 (1 − p). Upon calculating the probabilities of the other five possible outcomes for a sequence of three trials, we see that the probability of a particular outcome depends only on the number of trials that result in Q or Q, but not on the order in which these occur. Let X be a random variable that records the number of trials that result in outcome Q. Since there is exactly one way that all three trials can result in three Q’s, three ways that the three trials can result in two Q’s, three ways that they can result in two Q’s, and one way that they can result in 0 Q’s, it follows that the probability mass function of X is equal to pX (0) = (1 − p)3 pX (1) = 3p(1 − p)2 pX (3) = p3 pX (2) = 3p2 (1 − p). In general, suppose that n independent trials are performed and let X be the number of trials 36 4.1. THE BINOMIAL DISTRIBUTION 37 that result in outcome Q. Then, by reasoning as in the above example, we can show that P(X = k) = number of ways that k Q’s can occur in n trials × pk (1 − p)n−k = number of ways of selecting a subset of size k from a set containing n elements ×pk (1 − p)n−k n k = p (1 − p)n−k k n! = pk (1 − p)n−k , k!(n − k)! where the constants n n! = k k!(n − k)! (4.1) count the number of combinations of k elements from a set containing n elements. These are sometimes denoted n Cr or Crn , depending on the source, and they are called binomial coefficients because they appear as coefficients of monomials in binomial expansions: n X n k n−k n (x + y) = x y . (4.2) k k=0 This distribution occurs so frequently in probability theory and statistics that it has its own name. Definition 4.1. A random variable X is said to have the binomial distribution with parameters n and p if it takes values in the set {0, 1, · · · , n} with probability mass function n k pX (k) ≡ P(X = k) = p (1 − p)n−k . (4.3) k In this case, we write X ∼ Binomial(n, p) to indicate that X has this distribution. and we say that X is a binomial random variable. The name of this distribution stems from the connection between the probabilities p(k) and the coefficients in a binomial expansion: n n X X n k p(k) = p (1 − p)n−k = (p + 1 − p)n = 1. k k=0 k=0 Notice that a Bernoulli random variable with parameter p is also a binomial random variable with parameters n = 1 and p. In fact, the next theorem shows that there is a close relationship between Bernoulli distributions and binomial distributions even when n ≥ 1. Theorem 4.1. Let X1 , · · · , Xn be independent Bernoulli random variables, each with the same parameter p. Then the sum X = X1 + · · · + Xn is a binomial random variable with parameters n and p. Proof. The random variable X counts the number of Bernoulli variables X1 , · · · , Xn that are equal to 1, i.e., the number of successes in the n independent trials. Clearly X takes values in the set {0, · · · , n}. To calculate the probability that X = k, let E be the event that Xi1 = Xi2 = · · · = Xik = 1 and Xj = 0 for all j ∈ / {i1 , · · · , ik }. Then, because the Bernoulli variables are independent, we know that P(E) = pk (1 − p)n−k . 38 However, there are events, and so CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS n k such sets of indices i1 , · · · , ik , these corresponding to mutually exclusive n k P(X = k) = p (1 − p)n−k . k Example 4.1. Suppose that a fair coin is tossed n times. Then the number of heads that appear is a binomial random variable with parameters n and p = 1/2. Example 4.2. Sampling with replacement. Suppose that n balls are sampled with replacement an urn which contains 3 red balls and 9 blue balls. If X denotes the number of red balls in the sample, then X ∼ Binomial(n, 0.25) since the probability of sampling a red ball on any given draw is 3/3 + 9 = 0.25. More generally, if the urn contains r red balls and b blue balls, then X ∼ Binomial(n, p) where p = r/(r + b). Example 4.3. Suppose that eye color in humans is under the control of a single gene. Individuals with genotypes BB or Bb have brown eyes, while individuals with genotype bb have blue eyes. Suppose that two brown-eyed individuals with genotypes Bb and Bb have a total of 4 children. Assuming Mendelian inheritance, what is the distribution of the number of blue-eyed offspring? Answer: Let Xi equal 1 if the i’th child has blue eyes and 0 otherwise. Then Xi is a Bernoulli random variable with parameter p = 1/4, since a child will be blue-eyed only if they inherit a blue allele from each parent, which happens with probability 1/4. Since the genotypes of the different offspring are independent, the total number of blue-eyed children is binomial with parameters n = 4 and p = 1/4. The mean and the variance of the binomial distribution can be calculated with the help of the moment generating function. If X ∼ Binomial(n, p), then the moment generating function of X is equal to mX (t) ≡ E etX n X P(X = k)etk = k=0 n X n k = p (1 − p)n−k etk k k=0 n X n = (pet )k (1 − p)n−k k k=0 n = pet + 1 − p , where the final identity follows from the binomial formula. With the help of the chain rule and the product rule, differentiating twice (with respect to t) gives n−1 m0X (t) = npet pet + 1 − p n−1 n−2 m00X (t) = npet pet + 1 − p + n(n − 1)p2 e2t pet + 1 − p . It follows that the first two moments of X are E[X] = m0X (0) = np 2 E X = m00X (0) = np + n(n − 1)p2 , 4.1. THE BINOMIAL DISTRIBUTION 39 while the variance of X is Var(X) = E X 2 − (E[X])2 = np(1 − p). Thus, as we might expect, the mean of X is an increasing function of both parameters, n and p, while the variance of X is an increasing function of n which is largest when p = 1/2. Furthermore, the variance is 0 whenever p = 0 or p = 1, since in these cases we either have X = 0 with probability 1 or X = n with probability one, i.e, every trial results in the same outcome. The final result of this section asserts that the probability mass function of a binomial random variable is always peaked around its mean. Proposition 4.1. Let X be a binomial random variable with parameters (n, p). Then P(X = k) is a unimodal function of k with its maximum at the largest integral value of k less than or equal to (n + 1)p. Proof. A simple calculation shows that P(X = k) (n − k + 1)p = , P(X = k − 1) k(1 − p) which is greater than or equal to 1 if and only if k ≤ (n + 1)p. 4.1.1 Sampling and the hypergeometric distribution In Example 5.2, we noted that if n balls are sampled with replacement from an urn containing r red balls and b blue balls, then the number of red balls in the sample is binomially distributed with parameters n and p = r/(r + b). It should be emphasized that the assumption that we are sampling with replacement is crucial here: the probability of sampling a red ball on the i’th draw is still p no matter which balls were previously sampled because these are returned to the urn before the next ball is sampled. Perhaps more often than not, sampling is done without replacement, e.g., when we ‘sample’ the population by mailing out a survey to a random selection of individuals or when we enroll individuals in a clinical trial, we usually do not want to include ‘duplicates’ in our sample. In this case, the probability of sampling a particular type of object changes as we remove individuals from the population. Consider the urn example again, but now suppose that we sample 3 balls without replacement from an urn containing 3 red balls and 9 blue balls, and let X be the number of red balls in our sample. If the order in which red and blue balls are sampled is considered, then there are eight possible outcomes: RRR, BRR, RBR, RRB, BBR, BRB, BBR, and BBB, which occur with probabilities: 3 2 1 × × 12 11 10 9 3 2 P(BRR) = P(B) × P(R|B) × P(R|BR) = × × 12 11 10 3 9 2 P(RBR) = P(R) × P(B|R) × P(R|RB) = × × 12 11 10 3 2 9 × × P(RRB) = P(R) × P(R|R) × P(B|RR) = 12 11 10 P(RRR) = P(R) × P(R|R) × P(R|RR) = 3×2×1 12 × 11 × 10 3×2×9 = 12 × 11 × 10 3×2×9 = 12 × 11 × 10 3×2×9 = 12 × 11 × 10 = 40 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS 9 8 3 × × 12 11 10 9 3 8 P(BRB) = P(B) × P(R|B) × P(B|BR) = × × 12 11 10 3 9 8 P(RBB) = P(R) × P(B|R) × P(B|RB) = × × 12 11 10 9 8 7 P(BBB) = P(B) × P(B|B) × P(B|BB) = × × 12 11 10 P(BBR) = P(B) × P(B|B) × P(R|BB) = 3×9×8 12 × 11 × 10 3×9×8 = 12 × 11 × 10 3×9×8 = 12 × 11 × 10 9×8×7 = . 12 × 11 × 10 = As with our previous example, the probability of each outcome depends only on the numbers of red and blue balls contained in the sample, not on the order in which they appear. Thus the probability mass function of X is equal to 3 3! 9! . 12! 3 3! 9! . 12! pX (0) = × pX (1) = × 0 3! 6! 9! 2! 7! 9! 1 . 3 3! 9! 12! 12! 3 3! 9! pX (2) = / . × pX (3) = × 2 1! 8! 9! 0! 9! 9! 3 More generally, suppose that n balls are sampled without replacement from an urn containing r red balls and b blue balls and let X be the number of red balls contained in the sample. Clearly n cannot be larger than the number of balls contained in the urn, so n ≤ r + b. Furthermore, the possible values that X can assume are integers between max{0, n − b} and min{n, r}. To calculate P(X = k), where k lies between these two bounds, we first observe that the probability that k red balls are first sampled and then n − k blue balls are sampled is r r−1 r − (k − 1) b b−1 b − (n − k − 1) ··· ··· r+br+b−1 r + b − (k − 1) r + b − k r + b − (k + 1) r + b − (n − 1) . r! b! (r + b)! = × . (r − k)! (b − (n − k))! (r + b − n)! However, since there are nk different orders in which the k red balls can appear in the sample, it follows that the probability that the sample contains k red balls is equal to . r! n b! (r + b)! P(X = k) = × k (r − k)! (b − (n − k))! (r + b − n)! . r! b! (r + b)! = × k!(r − k)! (n − k)!(b − (n − k))! n!(r + b − n)! b r = k n−k r+b n . Like the binomial distribution, this distribution is of sufficient importance that it has its own name. Definition 4.2. A random variable X is said to have the hypergeometric distribution with parameters (N, m, n) if X takes values in the set {max{0, n − (N − m)}, · · · , min{n, m}} with probability mass function pX (k) = m k N −m n−k N n In this case, we write X ∼ Hypergeometric(N, m, n). . (4.4) 4.2. THE POISSON DISTRIBUTION 41 Unfortunately, the moment generating function is not of much use in this case. (It can be expressed in terms of a hypergeometric function, hence the name of the distribution.) However, by direct calculation, it can be shown that the mean and the variance of a hypergeometric random variable are m E[X] = n N m m N − n . Var(X) = n 1− N N N −1 In other words, if we let p = m/N be the probability of a ‘success’, then X has the same mean np as a binomially distributed random variable with parameters (n, p), but it has a smaller variance in general, since m m N − n n ≤ np(1 − p). 1− N N N −1 On the other hand, the hypergeometric distribution with parameters (N, m, n) will almost be the same as the binomial distribution with parameters (n, p) when p = m/N and N and m are both much larger than n. To make this result precise, suppose that Xt is a sequence of hypergeometric random variables with parameters (Nt , mt , n), where p = mt /Nt ∈ (0, 1) for all t ≥ 1 and Nt → ∞ and mt → ∞ as t → ∞. Then n mt mt − (k − 1) Nt − mt Nt − mt − (n − k − 1) P(Xt = k) = × ··· × × ··· × k Nt Nt − (k − 1) Nt − k Nt − (n − 1) n k → p (1 − p)n−k , k as t → ∞. Intuitively, sampling a small number of balls without replacement from an urn containing a very large number of balls is almost equivalent to sampling with replacement, since in the latter case we are unlikely to ever sample the same ball more than once anyway. For this reason, the binomial distribution is often regarded as an adequate approximation to the hypergeometric distribution when analyzing data that was generated by sampling without replacement, provided that the sample size is much smaller than the population size. 4.2 The Poisson Distribution Suppose that we perform a large number of independent trials, say n ≥ 100, and that the probability of a success on any one trial is small, say p = λ/n 1. If we let X denote the total number of successes that occur in all n trials, then we know from the last lecture that X ∼ Binomial(n, p) and therefore P(X = k) = = = = ≈ n k p (1 − p)n−k k k n! λ λ n−k 1− (n − k)!k! n n k n! λ λ n λ −k 1− 1− k! n n nk (n − k)! k n(n − 1)(n − 2) · · · (n − k + 1) λ λ n λ −k 1 − 1 − k! n n nk k λ e−λ , k! 42 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS where the approximation in the last line can be justified by the λ n lim 1 − n→∞ n λ −k lim 1 − n→∞ n n(n − 1)(n − 2) · · · (n − k + 1) lim n→∞ nk following three limits: = e−λ = 1 = 1. Furthermore, since these approximations sum to 1 when k ranges over the non-negative integers, e−λ ∞ X λk k=0 k! = e−λ eλ = 1, it follows that the function p(k) = e−λ λk , k! k≥0 defines a probability distribution on the non-negative integers. Like the binomial distribution, this distribution is so important in probability theory and statistics (and even in the sciences) that it has its own name, which is taken from that of the 19’th century French mathematician, Siméon Dennis Poisson (1781-1840). Definition 4.3. A random variable X is said to have the Poisson distribution with parameter λ ≥ 0 if X takes values in the non-negative integers with probability mass function pX (k) = P(X = k) = e−λ λk . k! In this case we write X ∼ Poisson(λ). To calculate the mean and the variance of this distribution, we first identify its moment generating function: ∞ X pX (k)etk mX (t) = E etX = k=0 = e −λ ∞ X λk etk k=0 −λ λet = e e k! = eλ(e t −1) . Differentiating twice with respect to t gives t −1) m0X (t) = λet+λ(e t m00X (t) = λ 1 + λet et+λ(e −1) and we then find E[X] = m0X (0) = λ Var(X) = m00X (0) − (E[X])2 = λ. Thus, the parameter λ is equal to both the mean and the variance of the Poisson distribution. It has long been recognized that the Poisson distribution provides a surprisingly accurate model for the statistics of a large number of seemingly unrelated phenomena. Some examples include: 4.2. THE POISSON DISTRIBUTION 43 • the number of misprints per page of a book; • the number of wrong telephone numbers dialed in a day; • the number of customers entering a post office per day; • the number of vacancies (per year) on the US Supreme Court (Table 5.1); • the number of sites that are mutated when a gene is replicated; • the number of α-particles discharged per day by a 14 C source; • the number of major earthquakes per year in a region. Table 4.1: Data from Cole (2010) compared with a Poisson distribution with λ = 0.5. Number of vacancies (x) Probability 0 1 2 3 ≥3 0.6065 0.3033 0.0758 0.0126 0.0018 Years with x vacancies 1837-1932 1933-2007 Observed Expected Observed Expected 59 58.2 47 45.5 27 29.1 21 22.7 9 7.3 7 5.7 1 1.2 0 1.0 0 0.2 0 0.1 Since these phenomena are generated by very different physical and biological processes, the fact that they share similar statistical properties cannot be explained by the specific mechanisms that operate in each instance. Instead, the widespread emergence of the Poisson distribution appears to be a consequence of a more general mathematical result. As we saw at the beginning of this section, if the probability of success per trial is small, then the total number of successes that occur when a large number of trials is performed is approximately Poisson distributed with parameter λ = np. This result is sometimes known as the Law of Rare Events because it describes the statistics of count data for events that have a small probability of occurring, i.e., that are rare. In fact, as Theorem 5.2 illustrates, we can relax the assumption that every trial is equally likely to result in a success provided that certain other conditions are satisfied. This is an example of the so-called Poisson paradigm. Theorem 4.2. For each n ≥ 1, let Xn,1 , · · · , Xn,n be a family of independent Bernoulli random variables with parameters pn,1 , · · · , pn,n . Suppose that lim max pn,i = 0 n→∞ 1≤i≤n n X lim n→∞ pn,i = λ. i=1 Then, lim P(Xn,1 + · · · + Xn,n = k) = e−λ n→∞ λk . k! Remark: The first condition guarantees that each event is individually unlikely to occur, while the second condition implies that the mean number of events that do occur is approximately λ when n is large. 44 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS To see how the law of rare events might explain Poisson-distributed count data, consider the number of misprints per page of a book. We can regard each word printed on a given page as a trial, which can either result in the correct spelling of the word or in a misprint. Assuming that the typist or editor is at least somewhat competent, the probability that any one word is a misprint in likely to be small, while the number of words per page is probably on the order of 250. Furthermore, barring catastrophic failures, the presence or absence of misprints is probably nearly independent between words. However, under these conditions, the law of rare events tells us that the total number of misprints per page should be approximately Poisson distributed. Example 4.4. If, on average, a healthy adult contracts three colds per year, what is the probability that they will contract at least two colds in given year? To answer this, we will assume that the number of colds contracted per year by a healthy adult is Poisson distributed with parameter λ = 3. If we let X denote this random variable, then P(X ≥ 2) = 1 − P(X < 2) = 1 − P(X = 0) − P(X = 1) 30 31 = 1 − e−3 − e−3 ≈ 0.801, 0! 1! where 0! = 1 by convention. 4.2.1 Fluctuation tests and the origin of adaptive mutations One of the classic experiments of molecular genetics is the fluctuation test, which was developed by Salvador Luria and Max Delbrück in 1943 in the course of their investigations of the mechanisms that give rise to adaptive mutations. An adaptive mutation is one that increases the fitness of the organism that carries it. Specifically, Luria and Delbrück wanted to determine whether adaptive mutations occur spontaneously or else are directly induced by the environmental conditions to which they are adapted. To do this, they developed an experimental system based on a bacterium, Escherischia coli, along with type of virus known as a T1 bacteriophage that infects this bacterium. When T1 phage is added to a culture of E. coli, most of the bacteria are killed, but a few resistant cells usually survive and will give rise to resistant colonies that can be seen on the surface of a petri dish. In other words, resistance to T1 phage is a trait that varies across E. coli bacteria and which is heritable, i.e., the descendants of resistant bacteria are usually themselves resistant. The fluctuation test carried out by Luria and Delbrück was based on the following procedures: 1. A single E. coli culture was initiated from a single T1-susceptible cell and allowed to grow to a population containing millions of bacteria. Several small samples were taken from this colony and spread on agar plates that had also been inoculated with the T1 phage. These plates were left for a period and the number of resistant colonies was counted. 2. The procedure described in (1) was repeated several times, using independently established E. coli cultures and the resulting data were used to estimate both the mean and the variance of the number of resistant colonies arising in each culture. Luria and Delbrück then reasoned as follows. If resistance mutations do not arise spontaneously but instead are induced by the presence of T1 phage, then because the mutation rate per cell is low (as shown by the data) and because there are a large number of cells per agar plate that could mutate and give rise to mutant colonies, the total number of mutant colonies generated 4.2. THE POISSON DISTRIBUTION 45 in each trial will be approximately Poisson distributed. In particular, under these assumptions, the mean and the variance of the number of mutant colonies generated in each trial will be approximately equal. On the other hand, if mutations occur spontaneously during expansion of the bacterial populations used in each trial, then the proportion of resistant cells in each culture prior to T1 plating will depend on the timing of the mutation relative to the expansion of the population from a single cell. In particular, if the mutation occurs at an early stage of expansion, then it will be inherited by many of the cells that are spread onto the plates containing the T1 phage and there will be many resistant colonies. However, if the mutation occurs at a late stage of the expansion, then it will be inherited by few cells and so there will be few resistant colonies. It can be shown that under this scenario, the number of mutant colonies does not follow the Poisson distribution and has a variance that is much higher than the mean. Luria and Delbrück repeated this experiment several times and found that the ratio of the sample variance to the sample mean ranged from about 4 to 200. In no case was the sample variance similar to the sample mean and additional controls were performed that showed that sampling error could not account for the discrepancy between the two. In other words, the data provide strong evidence against the hypothesis that resistance mutations are induced by the T1 phage itself and are consistent with the predictions of the hypothesis that these mutations arise spontaneously prior to exposure to the virus. Similar results have emerged from the many studies of mutation that were carried out after Luria and Delbrück’s original work using different organisms and different selection pressures, and indeed these findings are consistent with our current understanding of the molecular basis of mutation. 4.2.2 Poisson processes Poisson distributions can also be used to model scenarios in which events occur at a ‘constant rate’ through space or time. Consider a sequence of events that occur repeatedly through time (such as earthquakes or mutations to DNA) and assume that the following properties are satisfied: • Property 1 : Events occur at rate λ per unit time, i.e., the probability that exactly one event occurs in a given interval [t, t + h] is equal to λh + o(h). • Property 2 : Events are not clustered, i.e., the probability that two or more events occur in an interval of length h is o(h). • Property 3 : Independence. For any set of non-overlapping intervals [s1 , t1 ], [s2 , t2 ], · · · , [sr , tr ], the numbers of events occurring in these intervals are independent. Here we have used Landau’s little-o notation: If f and g are real-valued functions on [0, ∞), we write that f (h) = g(h) + o(h) if f (h) − g(h) = 0. h→0 h lim In particular, we write f (h) = o(h) if f (h)/h converges to 0 as h → 0. For example, if f (h) = h2 , then f (h) = o(h). Let N (t) denote the number of events that occur in the interval [0, t]. Our goal is to calculate the distribution of N (t), which we can do by breaking the interval [0, t] into n non-overlapping subintervals, each of length t/n. Then, P (N (t) = k) = P (An ) + P (Bn ) 46 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS where An = {k subintervals each contain exactly one event, while the others contain no events} Bn = {N (t) = k and at least one subinterval contains ≥ 2 events }. Notice that An and Bn are mutually exclusive. We first show that P (Bn ) can be made arbitrarily small by taking n sufficiently large. Indeed, P (Bn ) ≤ P{at least one subinterval contains two or more events} n X ≤ P{the k’th subinterval contains two or more events} k=1 n X t t o(t/n) ≤ o = n·o = t . n n t/n k=1 However, since t/n → 0 as n → ∞, it follows from the definition of o(h) that this last expression vanishes as n → ∞, which then implies that lim P (Bn ) = 0. n→∞ Next, using Properties 1 and 2, observe that λt t P {an interval of length t/n contains one event} = +o n n λt t P {an interval of length t/n contains no events} = 1 − +o . n n Also, using Property 3, we have k n−k n λt t λt t P (An ) = +o 1− +o k n n n n k (λt) → e−λt as n → ∞, k! since there are nk ways for the k events to be distributed across the n subintervals such that each subinterval contains either 0 or 1 event. Putting these results together shows that P(N (t) = k) = lim (P(An ) + P(Bn )) n→∞ = e−λt (λt)k . k! In other words, for each t ≥ 0, the number of events N (t) to have occurred by time t is a Poisson random variable with parameter λt. The collection of random variables (N (t) : t ≥ 0) is said to be a Poisson process with intensity λ. Such processes are used to model a wide variety of phenomena such as disease transmissions, earthquakes, and arrival of traffic at stoplights. 4.3. RANDOM WAITING TIMES 4.3 4.3.1 47 Random Waiting Times The geometric and exponential distributions Suppose that a sequence of independent Bernoulli trials is performed and that each trial has probability p ∈ (0, 1] of resulting in a success. If T denotes the number of trials performed until the first success is recorded, then T is a random variable that takes values in the positive integers {1, 2, · · · , ∞} with probability mass function pT (k) = (1 − p)k−1 p, k ≥ 1. Indeed, T = k if and only if the first k − 1 trials all result in failures and the k’th trial results in a success. A random variable with this distribution is said to have the geometric distribution with success probability p, in which case we write T ∼ Geometric(p). The moment generating function of T is mT (t) = ∞ X (1 − p)k−1 petk k=1 = pe t ∞ X (1 − p)et k−1 k=1 = pet 1−(1−p)et if t < − ln(1 − p) ∞ otherwise with derivatives m0T (t) = m00T (t) = pet (1 − (1 − p)et )2 2pet −pet + . (1 − (1 − p)et )2 (1 − (1 − p)et )3 It follows that the mean and the variance of T are equal to E[T ] = Var(T ) = 1 p 1−p . p2 As expected, the smaller p is, the more trials that we have to conduct on average to obtain the first success. Example 4.5. The probability of a mutation in a eukaryotic genome is approximately µ = 10−8 per site per generation. Thus it takes, on average, approximately µ−1 = 108 generations for a particular site to mutate. To put this time span into perspective, the earliest undisputed animal fossils date to the Cambrian explosion about 540 MYA, while the dinosaurs first appeared in the Triassic about 230 MYA and the non-avian dinosaurs went extinct at the end of the Cretaceous around 65 MYA. Also, because eukaryotic genomes generally contain many millions to billions of nucleotides, these typically mutate in every generation. As illustrated in the preceding example, the geometric distribution is often used to model the random waiting time until the first “success” occurs. This makes the most sense if the trials are 48 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS performed at a constant rate, e.g., one trial per day, in which case T will have the same units as the time between trials. However, if the probability of success per trial is very small, then this waiting time will typically be very large, in which case the following approximation can be used. Theorem 4.3. Let (n ; n ≥ 1) be a sequence of positive numbers tending to 0 and for each n ≥ 1 let Tn be a geometric random variable with parameter n . Then, for every t ≥ 0, lim P{n Tn ≤ t} = 1 − e−t . n→∞ Proof. The result follows from the calculation P{n Tn ≤ t} = t 1 − P Tn > n = 1 − (1 − n )t/n → 1 − e−t as n → ∞, where we have used the following result lim (1 + an )bn = ec , n→∞ whenever (an : n ≥ 1) and (bn : n ≥ 1) are sequences of real numbers such that an → 0, bn → ∞, and an bn → c as n tends to infinity. Theorem 5.3 not only provides us with an approximation for the geometric distribution, but it also leads to a new distribution that is interesting in its own right. Since the function F (t) = 1 − e−t is nondecreasing and differentiable with F (0) = 0 and F (t) → 1 as t → ∞, it is the cumulative distribution function of a continuous random variable X with probability density pX (t) = F 0 (t) = e−t for t ≥ 0. This distribution is so important that it has been named. Definition 4.4. A continuous random variable X is said to have the exponential distribution with parameter λ if the density of X is λe−λx if x > 0 pX (x) = 0 otherwise. In this case we write X ∼ Exponential(λ). Theorem 5.3 illustrates an important theme in probability theory, which is that discrete random variables that take on large integer values can often be approximated by continuous random variables. In this particular case, Tn will typically assume values that are of the same order of magnitude as its mean, E[Tn ] = −1 n , which will be large whenever n is small. In contrast, if we normalize Tn by dividing by its mean, then we obtain a random variable n Tn which is no longer integer-valued but which has mean E[n Tn ] = 1 and typically assumes values that are of the same order of magnitude. Notice that we can also interpret n Tn as the time to the first success, where time is measured in units proportional to −1 n . Previously (Example 4.7) we showed that the moment generating function of X ∼ Exponential(λ) is ( λ if t < λ mX (t) = λ−t ∞ if t ≥ λ. 4.3. RANDOM WAITING TIMES 49 which implies that the mean and the variance are 1 λ 1 . λ2 E[X] = Var(X) = Thus, the mean of an exponential random variable is equal to the reciprocal of its parameter, while the variance is equal to the mean squared. Like the geometric distribution, the exponential distribution is often used to model the random time that elapses until some event occurs. With this interpretation, the parameter λ is often referred to as the rate parameter. An important property of the exponential distribution is that it is memoryless: if X is exponentially distributed with parameter λ, then for all t, s ≥ 0, P{X > t + s|X > t} = P{X > s}. To verify this, use the CDF calculated above to obtain P{X > t} = 1 − P{X ≤ t} = e−λt , and then calculate P{X > t + s|X > t} = P{X > t + s} P{X > t} e−λ(t+s) e−λt −λs = e = P{X > s}. = (4.5) In other words, if we think of X as being the random time until some event occurs, then conditional on the event not having occurred by time t, i.e., on X > t, the distribution of the time remaining until the event does occur, X − t, is also exponential with parameter λ. It is as if the process determining whether the event occurs between times t and t + s has no ‘memory’ of the amount of time that has elapsed up to that point. This observation also sheds further light on why λ is called the rate parameter. Indeed, e−λt − e−λ(t+δt) e−λt −λδt = 1−e P(t < T ≤ t + δt|T > T ) = ≈ λδt, provided 0 < δt 1. This shows that the probability that the event occurs in a short time interval (t, t + δt] given that it has not already occurred by time t is approximately proportional to the rate λ times the duration of the interval δt. Furthermore, because this probability does not depend on t, the rate is constant in time. Remarkably, these properties also uniquely characterize the exponential distribution. Proposition 4.2. Suppose that X is a continuous random variable with values in [0, ∞). Then X is memoryless if and only X is exponentially distributed for some parameter λ > 0. As an aside, the geometric distribution could also be said to be memoryless provided we restrict to integer-valued times. To see this, let T ∼ Geometric(p) and observe that for all positive 50 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS integers n, m ≥ 1, P(T > n + m) P(T > n) (1 − p)n+m−1 p = (1 − p)n−1 p = (1 − p)m−1 p = P(T > m), P(T > n + m|T > n) = which is clearly a discrete analogue of (5.5). This is not surprising given our earlier observation that the exponential distribution arises as a limiting case of the geometric distribution. 4.3.2 The gamma distribution Although the memorylessness of the exponential distribution is often a convenient mathematical property, there are many temporal phenomena in nature that occur at rates which vary over time and which therefore are not memoryless. For example, because mortalities are age-dependent in many species, e.g, the very young and the very old are usually much more likely to die in a given period than are adult individuals, the exponential distribution does not provide a good statistical model for the life spans of individuals of these species. For this reason, additional continuous distributions on the non-negative real numbers are needed and one of the most important families of such distributions is the gamma distribution. Definition 4.5. A random variable X is said to have the gamma distribution with shape parameter α > 0 and scale parameter λ > 0 if its density is λα α−1 −λx e if x > 0 Γ(α) x p(x) = 0 otherwise, where the gamma function Γ(α) is defined (for α > 0) by the formula Z ∞ Γ(α) = y α−1 e−y dy. 0 In this case, we write X ∼ Gamma(α, λ). Notice that the exponential distribution with parameter λ is identical to the gamma distribution with parameters (1, λ). Later we will show that if X1 , · · · , Xn are independent exponential RVs, each with parameter λ, then the sum X = X1 + · · · + Xn is a gamma RV with parameters (n, λ). Thus, gamma-distributed random variables are sometimes used as statistical models for lifespans that are determined by a sequence of independent events which occur at constant rate (e.g., transformation of a lineage of cells following the accumulation of mutations that disrupt regulation of the cell cycle). Furthermore, unlike the exponential distribution, the gamma distribution is not (in general) memoryless and its mode is positive whenever the shape parameter is greater than 1. Before we can calculate the moments of X, we need to spend a little time investigating the gamma function, which is important in its own right. Integration by parts shows that for α > 1, the gamma function satisfies the following important recursion: Z ∞ −α−1 −y ∞ Γ(α) = −y e 0 + (α − 1) y α−2 e−y dy 0 = (α − 1)Γ(α − 1). 4.4. CONTINUOUS DISTRIBUTIONS ON FINITE INTERVALS 51 In particular, if α = n is a positive integer, then because Z ∞ Γ(1) = e−y dy = 1, 0 we obtain Γ(n) = (n − 1)Γ(n − 2) = (n − 1)(n − 2)Γ(n − 3) = (n − 1)(n − 2) · · · 3 · 2Γ(1) = (n − 1)! (4.6) which shows that the gamma function is a continuous interpolation of the factorial function. In fact, it can be shown that the gamma function is the unique interpolation of the factorial function which can be expanded as a power series. Equation (5.6) can be used to calculate the moments of the gamma distribution. If X has the gamma distribution with parameters (α, λ), then Z ∞ λα α−1 −λx n xn E[X ] = x e dx Γ(α) 0 Z Γ(n + α) ∞ λn+α = λ−n xn+α−1 e−λx dx Γ(α) Γ(n + α) 0 n−1 Y = λ−n (α + k). k=0 In particular, taking n = 1 and n = 2 gives α λ V ar(X) = E[X 2 ] − E[X]2 α(α + 1) α2 = − 2 λ2 λ α = . λ2 E[X] = 4.4 4.4.1 Continuous Distributions on Finite Intervals The uniform distribution The uniform distribution is arguably the simplest of all the continuous distributions since its density is constant on an interval. Later in the course we will see that this distribution is special in another sense, since it is the distribution with maximum entropy on any finite interval. For these reasons, the uniform distribution is often used as a non-informative prior distribution for variables when only the lower and upper bounds are known. Definition 4.6. A continuous random variable on the interval (a, b) if the density function of 1 b−a p(x) = 0 X is said to have the uniform distribution X is if x ∈ (a, b) otherwise. In this case we write X ∼ Uniform(a, b) or X ∼ U (a, b). In particular, if X ∼ U (0, 1), then X is said to be a standard uniform random variable. 52 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS To say that X is uniformly distributed on an interval (a, b) means that each point in (a, b) is equally likely to be a value of X. In particular, the probability that X lies in any subinterval of (a, b) is proportional to the length of the subinterval: if (l, r) ⊂ (a, b), then r−l P X ∈ (l, r) = . b−a Of course, since X is continuous, the probability that X = x for any x ∈ (a, b) is 0. Suppose that X has the U (a, b) distribution. Then the n’th moment of X is given by E Xn = 1 b−a Z a b n+1 b 1 1 1 b − an+1 n+1 x dx = . x = b−an+1 n+1 b−a a n It follows that the mean and the variance of X are 1 (a + b) 2 1 Var(X) = E X 2 − E[X]2 = (b − a)2 . 12 E[X] = 4.4.2 The beta distribution Although mathematically convenient, the uniform distribution is often too restrictive when modeling the statistics of variables that take values in a finite interval. When this is true, we can sometimes turn to the beta distribution as a fairly flexible generalization of the standard uniform distribution. Definition 4.7. A random variable X is said to have the beta distribution with parameters a, b > 0 if its density is p(x) = 1 a−1 (1 β(a,b) x 0 − x)b−1 if x ∈ (0, 1) otherwise, where the beta function β(a, b) is defined (for a, b > 0) by the formula Z β(a, b) = 1 xa−1 (1 − x)b−1 dx. 0 Notice that if a = b = 1, then X is simply a standard uniform random variable. Also, if X1 and X2 are independent gamma-distributed RVs with parameters (a, θ) and (b, θ), respectively, then the random variable X = X1 /(X1 + X2 ) is beta-distributed with parameters (a, b). In particular, if X1 and X2 are independent exponential RVs with parameter λ, then X = X1 /(X1 + X2 ) is uniformly distributed on [0, 1]. The beta distribution is often used to model distributions of frequencies. It can be shown that the beta function and the gamma function are related by the following identity: Γ(a)Γ(b) . β(a, b) = Γ(a + b) 4.4. CONTINUOUS DISTRIBUTIONS ON FINITE INTERVALS 53 In turn, we can use this identity to evaluate the moments of the beta distribution. Let X be a beta random variable with parameters (a, b). Then n E[X ] = = = = = Z 1 1 xn xa−1 (1 − x)b−1 dx β(a, b) 0 β(n + a, b) β(a, b) Γ(n + a)Γ(b) Γ(a + b) Γ(n + a + b) Γ(a)Γ(b) Γ(n + a) Γ(a + b) Γ(a) Γ(n + a + b) n−1 Y a+k . a+b+k k=0 Taking n = 1 and n = 2 gives a a+b V ar(X) = E[X 2 ] − E[X]2 a(a + 1) a2 = − (a + b)(a + b + 1) (a + b)2 ab . = 2 (a + b) (a + b + 1) E[X] = 4.4.3 Bayesian estimation of proportions Suppose that we conduct n independent trials and that we find that exactly k of them result in a success. If we assume that each trial has the same unknown probability p of resulting in a success, what does our data tell us about the value of p? In problem set 4 you addressed this question by calculating the maximum likelihood estimate of p. Here we will describe a Bayesian approach to the same problem. Let us define the propositions I, D and Hp by I = “n independent, identically-distributed trials have been conducted” D = “exactly k of the trials have resulted in a success” Hp = “the probability of success per trial is in (p, p + δp)” Notice that there are infinitely many hypotheses Hp which are indexed by values of p ∈ [0, 1]. By Bayes’ formula, we know that the posterior probability of hypothesis Hp is equal to P(Hp |D, I) = P(Hp |I) P(D|Hp , I) P(D|I) (4.7) where P(Hp |I) is the prior probability of Hp , P(D|Hp , I) is the likelihood function, and P(D|I) is the prior predictive probability of the data. Under the assumptions of the problem, the likelihood function is given by a binomial probability n k P(D|Hp , I) = p (1 − p)n−k . k 54 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS Thus, to complete the analysis, we need to choose a prior distribution for the value of p. Recall that in Bayesian inference the prior distribution quantifies our degree of belief in each hypothesis before we take into account the new data. This belief could be based on data generated by previous experiments of the same type or on completely different sources of information (or both). For binomial trials, it is often convenient to choose a prior distribution on p that belongs to the beta family, i.e., we will assume that the density of the prior of distribution on Hp is P(Hp |I) = 1 pa−1 (1 − p)b−1 dp β(a, b) for some pair of numbers a, b > 0. Because the variance of the beta distribution is a decreasing function of a and b, if a and b are both large, then the prior will be concentrated around the mean a/a+b, which would reflect a scenario in which we have very strong prior beliefs concerning the likely values of p. On the other hand, if a = b = 1, then the beta distribution is just the standard uniform distribution, which would reflect a scenario in which we have little (perhaps no) information concerning the likely values of p. Whatever the values of these parameters, equation (5.7) tells us that the posterior density of p is equal to n k n−k 1 a−1 b−1 k p (1 − p) P(Hp |D, I) = p (1 − p) β(a, b) P(D|I) 1 k+a−1 = p (1 − p)n−k+b−1 , C where the constant C = β(a, b)P(D|I)/ nk can either be calculated directly or it can be determined indirectly by noting that it is a normalizing constant for the density on the right-hand side, which is just the density of a beta distribution with parameters a + k and b + n − k. Since this density must still integrate to 1 over [0, 1], it follows that C = β(a + k, b + n − k). This shows that the posterior distribution belongs to the same parametric family of distributions as the prior distribution, differing only in the values of the parameters, which now depend on the data. Prior distributions with this property are said to be conjugate priors and have the advantage that the posterior distribution can be determined immediately without having to resort to expensive numerical estimates of the prior predictive probability. In this particular problem, we see that the mean and the variance of the posterior distribution are E[p|D, I] = Var(p|D, I) = a+k a+b+n (a + k)(b + n − k) . (a + b + n)2 (a + b + n + 1) This shows that the variance of the posterior distribution is necessarily less than the variance of the prior distribution by a quantity that is increasing in n, i.e., the more data that we collect, the less uncertain we are about the true value of p. Furthermore, if k and n are large relative to a and b, then the mean of the posterior distribution will be close to p̂M L = k/n, which is the maximum likelihood estimate of p mentioned above. In fact, the relative magnitudes of a and b on the one hand and k and n − k on the other hand reflect the relative importance of the two sources of information that we have concerning the value of p. When a, b k, n − k, the data is only weakly informative about the value of p and then the posterior distribution depends mainly on the prior distribution. On the other hand, if a, b k, n − k, then the data is very informative about the value of p and this too is reflected in the posterior distribution. 4.4. CONTINUOUS DISTRIBUTIONS ON FINITE INTERVALS 4.4.4 55 Binary to analogue: construction of the uniform distribution Thus far we have simply assumed that continuous distributions exist, without taking care to specify the σ-algebra of measurable subsets of R or to prove that there really are probability distributions with these properties. Here we partially redress this absence by showing how to construct a standard uniform random variable from a sequence of i.i.d. Bernoulli random variables. Suppose that X1 , X2 , · · · is a sequence of independent Bernoulli random variable, each with parameter 1/2. If we define n ∞ X 1 X= Xn , 2 n=1 then X is a standard uniform random variable. Notice that the random sequence (X1 , X2 , · · · ) is the binary representation of the random number X. That X is uniformly distributed can be deduced as follows: Step 1: We say that a number x ∈ R is a dyadic rational number if x has a finite binary expansion, i.e., there exists an integer n ≥ 1 and numbers c1 , · · · , cn ∈ {0, 1} such that x= ∞ X k=1 k 1 ck . 2 Notice that every dyadic rational number is rational, but that not every rational number is dyadic rational, e.g., x = 1/3 is not dyadic rational. On the other hand, the set of dyadic rational numbers is dense in R, i.e., for every z ∈ R and every > 0, there exists a dyadic rational number x such that |z − x| < . Step 2: Every number x ∈ R either has a unique, non-terminating binary expansion, or it has two binary expansions, one ending in an infinite sequence of 0’s and the other ending in an infinite sequence of 1’s. In either case, P{X = x} = 0. Step 3: Given numbers c1 , · · · , cn ∈ {0, 1}, let x be the corresponding dyadic rational defined by the equation shown in Step 1 and let E(c1 , · · · , cn ) be the interval (x, x + 2−n ]. Then P{X ∈ E(c1 , · · · , cn )} = P{X1 = c1 , · · · , Xn = cn } − P{X = x} n 1 = . 2 Furthermore, if 0 ≤ a < b ≤ 1, where a and b are dyadic rational numbers, then because the interval (a, b] can be written as the disjoint union of finitely many sets of the type E(c1 , · · · , cn ) for suitable choices of the ci ’s, we have that P{X ∈ (a, b]} = (b − a). Step 4: Given any two numbers 0 ≤ a < b ≤ 1, not necessarily dyadic rational, we can find a decreasing sequence (an ; n ≥ 1) and an increasing sequence (bn ; n ≥ 1) such that an < bn , 56 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS an → a and bn → b. Then the sets (an , bn ) form an increasing sequence, with limit set (a, b), and by the continuity properties of probability measures, we know that P{X ∈ (a, b)} = = lim P{X ∈ (an , bn ]} n→∞ lim (bn − an ) n→∞ = b − a. However, this result implies that X is uniform on [0, 1]. 4.5 The Normal Distribution As explained below, the standard normal distribution is one of the most important probability distributions in all of mathematics. Not only does it play a central role in large sample statistics, but it is also relevant to many physical and biological processes, including heat conduction, diffusion, and the evolution of quantitative traits. Furthermore, the standard normal distribution is the building block for an entire family of normal distributions, having different means and variances. Definition 4.8. A continuous real-valued random variable X is said to have the normal distribution with mean µ and variance σ 2 if the density of X is 1 2 2 p(x) = √ e−(x−µ) /2σ σ 2π − ∞ < x < ∞. In this case we write X ∼ N (µ, σ 2 ). In particular, X is said to be a standard normal random variable if X ∼ N (0, 1). Remark: Normal distributions are also called Gaussian distributions after Carl Friedrich Gauss (1777-1855). We can show that the integral of p(x) over R is equal to 1 by using a clever trick attributed to Gauss. First, by making the substitution y = (x − µ)/σ, we see that Z ∞ Z ∞ 1 2 p(x)dx = √ e−y /2 dy. 2π −∞ −∞ R∞ Then, letting I = ∞ exp(−y 2 /2)dy, we obtain Z ∞ Z ∞ 2 2 I2 = e−x /2 dx e−y /2 dy −∞ Z−∞ ∞ Z ∞ 2 2 = e−(x +y )/2 dxdy −∞ −∞ ∞ Z 2π Z re−r = 0 Z = 2π 0 ∞ re−r 2 /2 2 /2 dθdr dr = 2π. 0 Here we have changed to polar coordinates in passing from the second to the third line, √ i.e., we set x = r cos(θ), y = r sin(θ), and dxdy = rdθdr. This calculation shows that I = 2π, which confirms that p(x) is a probability density. 4.5. THE NORMAL DISTRIBUTION 57 An important property of normal distributions is that any affine transformation of a normal random variable is also a normal random variable. Specifically, if X is normally distributed with mean µ and variance σ 2 , then the random variable Y = aX + b is also normally distributed, but with mean aµ + b and variance a2 σ 2 . To verify this claim, assume that a > 0 and observe that the c.d.f. of Y is x−b x−b FY (x) = P(Y ≤ x) = P(aX + b ≤ x) = P X ≤ = FX . a a The density of Y can be found by differentiating FY (x) and is equal to d x−b 1 x−b d pY (x) = = FY (x) = FX pX dx dx a a a 1 2 2 2 √ e−(x−aµ−b) /2a σ , = aσ 2π which shows that Y ∼ N (aµ + b, a2 σ 2 ) as claimed. A similar argument works if a < 0. In particular, by taking b = −µ and a = 1/σ, we see that the variable Z = (X − µ)/σ is a standard normal random variable. For this reason, Z is said to be standardized. Similarly, if Z is a standard normal random variable, then so is −Z. For statistical applications of the normal distribution, we are often interested in probabilities of the form P(X > x) = 1−P(X ≤ x). Although a simple analytical expression is unavailable, these quantities can be calculated numerically and have been extensively tabulated for the standard normal distribution. An online copy of such a table can be found at Wikipedia under the topic ‘Standard normal table’. Furthermore, because every normal random variable can be expressed in terms of a standard normal random variable, we can use these lookup tables for more general problems. For example, if X ∼ N (µ, σ 2 ), then by defining Z = (X − µ)/σ, we see that P(X > x) = P(σZ + µ > x) = 1 − P(Z ≤ (x − µ)/σ) = 1 − Φ((x − µ)/σ), where Φ(x) is the CDF of the standard normal distribution: Z x 1 2 Φ(x) = √ e−y /2 dy. 2π −∞ Alternatively, functions that evaluate the cumulative distribution function of any normal random variable X with arbitrary mean and variance are available in computational software packages such as Matlab, Mathematica and R. For example, the Matlab command p = normcdf (x, mu, sigma) will calculate the probability P(X ≤ x) when X ∼ N (µ, σ 2 ). Note that the symbols x, mu, and sigma in the command should be replaced by their intended values, e.g., p = normcdf (1, 0, 1). Standardization is also used in the proof of the next proposition. Proposition 4.3. Let X be a normal random variable with parameters µ and σ 2 . Then the expected value of X is µ, while the variance of X is σ 2 . Proof. Let Z = (X − µ)/σ, so that Z ∼ N (0, 1). Then Z ∞ 1 2 E[Z] = √ xe−x /2 dx = 0, 2π −∞ 58 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS 2 while integration-by-parts with u = x and dv = xe−x /2 shows that Z ∞ 1 2 x2 e−x /2 dx Var(Z) = E Z 2 = √ 2π −∞ Z ∞ 1 2 −x2 /2 −x /2 ∞ = √ e dx = 1. −xe |∞ + 2π −∞ Since X = µ + σZ, the proposition follows from the identities E[X] = µ + σE[Z] = µ Var(X) = σ 2 Var(Z) = σ 2 . Our next result helps to explain why normal distributions are so prevalent. Recall that if X1 , · · · , Xn are independent Bernoulli random variables, each having parameter p, then the distribution of the sum Sn = X1 + · · · + Xn is binomial with parameters (n, p). Furthermore, if p is small, then, by the law of rare events, we know that Sn is approximately Poisson distributed with parameter np. In this case, Sn is typically small, of order np, and so a discrete approximation is to be expected. If, instead, np is large, then Sn will typically be large and it makes more sense to look for a continuous approximation, much as we did in Theorem 5.3. Such an approximation is given in the next theorem. Theorem 4.4. (DeMoivre-Laplace Limit Theorem) If Sn is a binomial random variable with parameters n and p and Z is a standard normal random variable, then ( ) Z b Sn − np 1 2 p √ e−x /2 dx lim P a ≤ ≤ b = P(a ≤ Z ≤ b) = n→∞ 2π np(1 − p) a for all real numbers a ≤ b. In other words, the DeMoivre-Laplace Theorem asserts that when both n and np are large, the distribution of the standardized random variable Zn = √Sn −np can be approximated by the np(1−p) distribution of a standard normal random variable Z. Furthermore, since affine transformations of normal distributions are still normal (see above), it follows that for large n the variable Sn is approximately normal with mean np and variance np(1 − p). We illustrate the use of this approximation in Example 5.6 below. The accuracy of this approximation depends on both n and p, but it is usually accurate to within a few percent when np(1 − p) ≥ 10 and the error can be shown to decay at a rate that is asymptotic to n−1/2 . The DeMoivre-Laplace Theorem can be illustrated mechanically with the help of a device known as a Galton board or a quincunx; an applet simulating a Galton board can be found at the following link: http://www.jcu.edu/math/isep/quincunx/quincunx.html. To use this applet, set the number of bins to 10 or greater and run a series of simulations with an increasing number of balls (e.g., 10, 100, 1000, 10000). Be sure to set the speed to Maximum Speed for larger simulations. In fact, the Demoivre-Laplace theorem is a special case of a much more powerful result which is known as the Central Limit Theorem. This asserts that the distribution of the sum of a large number of independent, identically-distributed random variables with any distribution with finite mean and finite variance is approximately normal. In the setting of the Demoivre-Laplace theorem, the individual random variables are all Bernoulli, but there is nothing special about 4.6. TRANSFORMATIONS OF RANDOM VARIABLES 59 the Bernoulli distribution. Indeed, we could just as easily add Poisson or exponential or betadistributed random variables and the sum would still be approximately normally distributed. Later in the course we will use moment generating functions to prove that the CLT holds for sums of random variables that have finite exponential moments and then we will be able to deduce Theorem 5.4 as a corollary. Example 4.6. Suppose that a fair coin is tossed 100 times. What is the probability that the number of heads obtained is between 45 and 55 (inclusive)? Solution: If X denotes the number of heads obtained in 100 tosses, then X is a binomial random variable with parameters (100, 1/2). By the Demoivre-Laplace theorem, we know that X − 50 P{45 ≤ X ≤ 55} = P −1 ≤ ≤1 5 ≈ Φ(1) − Φ(−1) = Φ(1) − (1 − Φ(1)) = 2Φ(1) − 1 = 0.683, where a table of Z-scores has been consulted for the numerical value of Φ(1) = 0.841. Here we have made use of the identity Φ(−x) = 1 − Φ(x), which follows from the fact that if X ∼ N (0, 1), then P{X ≤ −x} = P{X > x} = 1 − P{X ≤ x}. 4.6 Transformations of Random Variables In Section 5.5 we saw that an affine transformation of a normal random variable is also normally distributed and we used this observation to prove several results about normal distributions. Since transformations of random variables are frequently used in statistics and applied probability, we need a general set of techniques to determine their distributions. This is the subject of this section. Suppose that X is a real-valued random variable and let Y = g(X), where g : R → R is a real-valued function. Then P(Y ∈ E) = P(g(X) ∈ E) = P(X ∈ g −1 (E)), where the set g −1 (E) is the preimage of E under g g −1 (E) = {x ∈ R : g(x) ∈ E}. Notice that we do not require g to be one-to-one and so g −1 may not be defined as a function. Furthermore, since in general E 6= g −1 (E), it is clear that Y and X will usually have different distributions. However, provided that the transformation is not too complicated, we can sometimes express the cumulative distribution function of Y in terms of that of X. We first see how this is done in two examples and then describe a general approach. 60 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS Example 4.7. Let X be a non-negative random variable with CDF FX and let Y = X n , where n ≥ 1. Then, for any x ≥ 0, FY (x) = P(Y ≤ x) = P(X n ≤ x) = P(X ≤ x1/n ) = FX x1/n , while for any x < 0, we have FY (x) = 0. Similarly, if X has a density pX = FX0 (x), then Y also has a density given by if x < 0 0 d pY (x) = FY (x) = 1 1 −1 dx n pX x1/n if x ≥ 0. nx Example 4.8. Let X be a continuous real-valued random variable with density pX (x) = FX0 (x) and let Y = X 2 . Then, for x ≥ 0, FY (x) = P{Y ≤ x} = P{X 2 ≤ x} √ √ = P{− x ≤ X ≤ x} √ √ = P{− x < X ≤ x} √ √ = FX ( x) − FX (− x), √ where we have used the fact that P(X = − x) = 0 (since X is continuous). In contrast, if x < 0, then FY (x) = 0. In this case, the density of Y is given by if x < 0 0 d pY (x) = FY (x) = √1 p √x + p − √x if x ≥ 0. dx 2 x X X Notice that we obtain different expressions for FY (x) in Examples 5.7 and 5.8 in the case n = 2 depending on whether X is strictly non-negative or can take on both positive and negative values. The reason for this is that the function x → x2 is one-to-one when restricted to [0, ∞) but is two-to-one when considered on the entire real line. Monotonic transformations: Suppose that X is a real-valued random variable with CDF FX and let Y = g(X), where the function g : R → (l, r) is strictly increasing, continuous, one-to-one and onto, i.e., if x < y, then g(x) < g(y). Then g is invertible on the interval (l, r) and so the CDF of Y can be expressed as FY (x) = P{Y ≤ x} = P{g(X) ≤ x} = P{X ≤ g −1 (x)} = FX g −1 (x) , for any x ∈ (l, r). Of course, if x ≥ r, then P{Y ≤ x} = 1, while if x ≤ l, then P{Y ≤ x} = 0. Combining these facts, we obtain the general expression 0 if x ≤ l FX g −1 (x) if x ∈ (l, r) FY (x) = 1 if x ≥ r. 4.6. TRANSFORMATIONS OF RANDOM VARIABLES 61 If, instead, g is a decreasing function, then FY (x) = P{g(X) ≤ x} = P{X ≥ g −1 (x)} = 1 − FX (g −1 (x)−) where FX (x−) = limxn ↑x FX (xn ) is the left-handed limit of FX at x. Thus, in this case, the CDF of Y has the form 0 if x ≤ l 1 − FX g −1 (x) − if x ∈ (l, r) FY (x) = 1 if x ≥ r. Of course, if X is continuous, then FX (x−) = FX (x) and the last result can be simplified. Example 4.9. Suppose that U is a standard normal random variable and let g be the function g(y) = (b − a)y + a, where b 6= a, and set Y = g(U ). If b > a, then a ≤ Y ≤ b and the cumulative distribution function of Y is x−a x−a FY (x) = P((b − a)U + a ≤ x) = P U ≤ = , b−a b−a provided x ∈ [a, b]. This shows that Y is uniform on the interval [a, b]. On the other hand, if b < a, then b ≤ Y ≤ a and x−a x−a FY (x) = P((b − a)U + a ≤ x) = P U ≥ = 1−P U ≤ b−a b−a a−x x−b = 1− = , a−b a−b whenever x ∈ [b, a], which shows that Y is uniform on the interval [b, a]. In particular, by taking a = 1 and b = 0, we see that Y ≡ 1 − U is uniform on [0, 1], i.e., 1 − U is also a standard uniform random variable. We will use this observation below in Example 6.4. The next proposition describes an important application of these identities. Theorem 4.5. Suppose that U is a uniform random variable on (0, 1) and let X be a real-valued random variable with a strictly increasing continuous CDF FX (·). Then the random variable Y = FX−1 (U ) has the same distribution as X, while the random variable Z = FX (X) is uniformly distributed on (0, 1). Proof. First consider the distribution of Y : FY (x) = P{Y ≤ x} = P{FX−1 (U ) ≤ x} = P{U ≤ FX (x)} = FX (x), since FX (x) ∈ [0, 1] for all x and P{U ≤ y} = y whenever y ∈ [0, 1]. This shows that Y and X have the same distribution. 62 CHAPTER 4. A MENAGERIE OF DISTRIBUTIONS Similarly, the CDF of Z = FX (X) is FZ (x) = P{Z ≤ x} = P{FX (X) ≤ x} = P{X ≤ FX−1 (x)} = FX (FX−1 (x)) = x, for any x ∈ [0, 1], which shows that Z is a uniform random variable on [0, 1]. This result has important applications in computational statistics and stochastic simulations. For example, many computational statistics packages include routines that will simulate a sequence of independent uniform random variables U1 , U2 , · · · (or some suitable approximation thereof). Then, if X is a random variable with a strictly increasing CDF, we can generate a sequence of independent random variables X1 , X2 , · · · , each having the same distribution as X, by setting Xi = FX−1 (Ui ). This is computationally efficient only if the inverse FX−1 can be easily evaluated. Example 4.10. Recall that the CDF of the exponential distribution with parameter λ is FX (x) = 1 − e−λx . A simple calculation shows that 1 FX−1 (x) = − ln(1 − x), λ and so it follows that if U is uniform on [0, 1], then 1 Y = − ln(1 − U ) λ is exponentially distributed with parameter λ. In fact, because the distribution of 1 − U is also uniform on [0, 1], the random variable 1 Y 0 = − ln(U ) λ is also exponentially distributed with parameter λ. If X is a continuous random variable and g is a strictly monotonic function with a differentiable inverse, then it can be shown that the random variable Y = g(X) is also continuous and the density of Y can be expressed in terms of g and the density of X. This relationship is described in the next proposition. Proposition 4.4. Let X be a continuous random variable with density pX (·) and suppose that g is a strictly monotonic continuous function with differentiable inverse g −1 . Then Y = g(X) is a continuous random variable with density function d −1 g (y) · pX g −1 (y) if y = g(x) for some x dy pY (y) = 0 otherwise. 4.6. TRANSFORMATIONS OF RANDOM VARIABLES 63 Proof. Without loss of generality, we may assume that g is increasing. Suppose that y = g(x) for some x. Then, as shown above, FY (y) = FX g −1 (y) , and differentiating with respect to y gives d d pY (y) = FY (y) = g −1 (y) · pX g −1 (y) . dy dy Chapter 5 Random Vectors and Multivariate Distributions 5.1 Joint and Marginal Distributions Because the world is complex and multi-faceted, scientific investigations are usually faced with a multitude of variables whose relationships to one another are only poorly understood. For example, if we are interested in understanding temporal fluctuations in quantities such as ocean surface temperatures or currency exchange rates, then we will need to study time series of data that can be represented as a sequence of random variables, X1 , X2 , · · · , where the variable Xt denotes the quantity at interest at the time of the t’th measurement. Similarly, if we are conducting a genome-wide association study (GWAS) to identify genetic variants that predispose individuals to complex diseases such as cancer or diabetes or schizophrenia, then we need to consider an array of random variables, Xi1 , · · · , Xin , Yi , where Yi indicates whether the i’th individual in the study is affected (a case) or not (a control) and Xij indicates which nucleotide {A, T, C, G} is present at the j’th site in this individual. To build probabilistic models for these kinds of data, we need to consider groups of random variables X1 , · · · , Xn that are all defined on the same probability space. Provided that these are ordered in some manner, we can treat each group of variables as a random vector X = (X1 , · · · , Xn ) which is a map from the sample space Ω into Rn : X(ω) ≡ (X1 (ω), · · · , Xn (ω)) ∈ Rn . Our main goal in this chapter is to develop some tools that can be used to describe and study random vectors. Definition 5.1. Suppose that X1 , X2 , · · · , Xn are real-valued random variables defined on the same probability space (Ω, F, P). Then the joint distribution of these variables is the probability distribution Q of the random vector X = (X1 , · · · , Xn ) defined by Q(E) = P (X1 , · · · , Xn ) ∈ E for any measurable subset E ⊂ Rn . Likewise, the joint cumulative distribution function of (X1 , · · · , Xn ) is defined by F (x1 , · · · , xn ) = P (X1 ≤ x1 , · · · , Xn ≤ xn ) . As with real-valued random variables, both the joint distribution and the joint CDF are fully determined by the other. This means that if two random vectors X = (X1 , · · · , Xn ) and Y = (Y1 , · · · , Yn ) have the same joint cumulative distribution function, then their components 64 5.1. JOINT AND MARGINAL DISTRIBUTIONS 65 have the same joint distribution. If X = (X1 , · · · , Xn ) is a random vector, then each variable Xi is itself a real-valued random variable and the distribution of Xi is called the marginal distribution of Xi . Note that each component variable has a marginal distribution. By exploiting the continuity properties of probability distributions, the marginal cumulative distribution function of Xi can be determined from the joint cumulative distribution of the random vector. For example, the marginal CDF of X1 is equal to FX1 (x) = P(X1 ≤ x) = P(X1 ≤ x, X2 < ∞, · · · , Xn < ∞) [ = P {X1 ≤ x, X2 < M, · · · , Xn < M } M ≥1 = lim FX (x1 , M, · · · , M ). M →∞ This process of recovering the marginal distribution of a single random variable from the joint distribution of the random vector is called marginalisation. Caveat: The joint distribution of the variables (X1 , · · · , Xn ) fully determines all of the marginal distributions. However, knowledge of the marginal distributions alone is not sufficient to determine the joint distribution: there are usually infinitely many joint distributions that have the same marginal distributions. Definition 5.2. If X1 , · · · , Xn are discrete random variables defined on the same probability space and Xi takes values in the countable set Ei , then X = (X1 , · · · , Xn ) is a discrete random vector with values in the product space E = E1 × · · · × En and the joint probability mass function of X is defined by p(x1 , · · · , xn ) = P(X1 = x1 , · · · , Xn = xn ). If we are given the joint probability mass function pX of a discrete random vector X = (X1 , · · · , Xn ), then we can find the marginal probability mass functions of the variables Xi by summing pX over appropriate subsets of E. For example, the marginal probability mass function of X1 is given by the following sum: X pX1 (x) = P(X1 = x) = pX (x, x2 , · · · , xn ). x2 ,··· ,xn In other words, we can calculate the marginal probability mass function for Xi at a point x ∈ Ei by summing pX over all possible vectors in E with the i’th coordinate held fixed at x. Example 5.1. Suppose that n independent experiments are performed, each of which can have one of r possible outcomes (labeled 1, · · · , r) with respective probabilities p1 , · · · , pr , where p1 + · · · + pr = 1. Let Xi denote the number of experiments with outcome i. Then X = (X1 , · · · , Xn ) is a discrete random vector with probability mass function n p(n1 , · · · , nr ) = P{X1 = n1 , · · · , Xr = xr } = pn1 · · · pnr r , n1 , · · · , nr 1 and X is said to have the multinomial distribution with parameters n and (p1 , · · · , pr ). Notice that X1 + · · · + Xn = n. Of course, if r = 2, then this is just the binomial distribution with parameters n and p1 . 66 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Definition 5.3. The random variables X1 , · · · , Xn are said to be jointly continuous if there exists a function p(x1 , · · · , xn ), called the joint probability density function, such that Z Z Z p(x1 , · · · , xn )dx1 · · · dxn P{(X1 , · · · , Xn ) ∈ E} = ··· (x1 ,··· ,xn )∈E for any measurable subset E ⊂ Rn . As in the univariate case, the joint density function can be found by differentiating the joint cumulative distribution function. However, when there are multiple variables, we must take partial derivatives for each component of the vector: pX (x1 , · · · , xn ) = ∂x1 · · · ∂xn FX (x1 , · · · , xn ). Furthermore, if the variables X1 , · · · , Xn are jointly continuous, then each variable Xi is marginally continuous and its marginal density function can be found by integrating the joint density function over all coordinates except xi : Z ∞ Z ∞ pXi (x) = ··· pX (x1 , · · · , xi−1 , x, xi+1 , · · · , xn )dx1 · · · dxi−1 dxi+1 · · · dxn . −∞ −∞ Caveat: X1 , · · · , Xn can all be individually continuous even when they are not jointly continuous. For example, the random vector (X, X) is never continuous even if X is itself a continuous random variable. Example 5.2. The random vector X = (X1 , · · · , Xn−1 ) is said to have the Dirichlet distribution with parameters α1 , · · · , αn > 0 if its components are jointly continuous with joint density function n Y Γ(α) Q pX (x1 , · · · , xn−1 ) = n xαi i −1 i=1 Γ(αi ) i=1 for all x1 , · · · , xn−1 ≥ 0 satisfying x1 + · · · + xn−1 ≤ 1, where xn ≡ 1 − x1 − · · · − xn−1 . Since x1 + · · · + xn = 1 by construction, the Dirichlet distribution is usually regarded as a continuous distribution on the standard n − 1-dimensional simplex ) ( n X xi = 1 and xi ≥ 0 for all i Kn−1 = (x1 , · · · , xn ) ∈ Rn : i=1 When n = 2, the Dirichlet distribution reduces to the Beta distribution. 5.2 Independent Random Variables Definition 5.4. A collection of n random variables X1 , · · · , Xn , all defined on the same probability space, are said to be independent if for every collection of measurable sets A1 , · · · , An , P{X1 ∈ A1 , · · · , Xn ∈ An } = n Y P{Xi ∈ Ai }. i=1 In other words, the random variables X1 , · · · , Xn are independent if and only if for all such subsets A1 , · · · , An ⊂ R, the sets {X1 ∈ A1 }, · · · , {Xn ∈ An } are independent. Also, we say that an infinite collection of random variables is independent if every finite subcollection is independent. 5.2. INDEPENDENT RANDOM VARIABLES 67 Heuristically, X1 and X2 are independent if knowing the value of X1 provides us with no information about the value of X2 , and vice versa. Proposition 5.1. Suppose X1 , · · · , Xn are jointly distributed random variables, i.e., they are all defined on the same probably space, and let X = (X1 , · · · , Xn ). Then each of the following conditions implies independence: (a) The joint cumulative distribution function factors into a product of the marginal cumulative distribution functions, i.e., for all real numbers x1 , · · · , xn , FX (x1 , · · · , xn ) = n Y FXi (xi ). i=1 (b) X1 , · · · , Xn are discrete and the joint probability mass function factors into a product of the marginal probability mass functions: pX (x1 , · · · , xn ) = n Y pXi (xi ) i=1 for all x1 , · · · , xn . (c) X1 , · · · , Xn are jointly continuous and the joint probability density function factors into a product of the marginal probability density functions: fX (x1 , · · · , xn ) = n Y fXi (xi ) i=1 for all x1 , · · · , xn . Example 5.3. Let Z be a Poisson random variable with parameter λ, let W1 , W2 , · · · be a collection of independent Bernoulli random variables, each with parameter p, and define X = Z X Wi i=1 Y = Z − X. Then X and Y are independent Poisson random variables with parameters λp and λ(1 − p), respectively. To prove this, we calculate P{X = i, Y = j} = P{X = i, Z = i + j} = P{X = i|Z = i + j} · P{Z = i + j} i+j i λi+j = p (1 − p)j · e−λ i (i + j)! e−λp (λp)i e−λ(1−p) (λ(1 − p))j i! j! ≡ pX (i) · pY (j), = where pX and pY are the probability mass functions of Poisson random variables with parameters λp and λ(1 − p). Independence of X and Y then follows from the factorization of the joint probability mass function. Furthermore, since Z = X + Y , we have also shown that the sum of two independent Poisson-distributed random variables is also Poisson-distributed with parameter equal to the sum of the parameters of those two variables. 68 5.3 5.3.1 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Conditional Distributions Discrete Conditional Distributions Definition 5.5. If X and Y are discrete random variables with joint probability mass function pX,Y (x, y), then we define the conditional probability mass function of X given Y = y by pX|Y (x|y) = P{X = x|Y = y} = = P{X = x, Y = y} P{Y = y} pX,Y (x, y) , pY (y) provided pY (y) > 0. In other words, since X and Y are defined on the same probability space, we can consider the distribution of X conditional on the event Y = y. This is of great importance in statistics, since we are often interested in the distributions of variables that can only be indirectly estimated from the values of other correlated random variables. Of course, if X and Y are independent, then the conditional probability mass function of X given Y = y is equal to the unconditioned probability mass function: pX|Y (x|y) = pX,Y (x, y) pX (x)pY (y) = = pX (x). pY (y) pY (y) In other words, if X and Y are independent, then Y provides us with no information concerning the value of X. Example 5.4. Suppose that the X and Y are independent Poisson RVs with parameters λ1 and λ2 and let Z = X + Y . Then, using the fact (see Example 6.3) that Z is a Poisson RV with parameter λ1 + λ2 , the conditional probability mass function of X given Z = n is pX|Z (k|n) = = = = = P{X = k, Z = n} P{Z = n} P{X = k, Y = n − k} P{Z = n} P{X = k}P{Y = n − k} P{Z = n} n−k k n −1 −λ1 λ1 −λ2 λ2 −(λ1 +λ2 ) (λ1 + λ2 ) e ·e · e k! (n − k)! n! k n−k n λ1 λ2 , k λ1 + λ2 λ1 + λ2 for k = 0, · · · , n, and so it follows that the conditional distribution of X given X + Y = n is Binomial with parameters n and p = λ1 /(λ1 + λ2 ). Notice that once we condition on the event X + Y = n, X and Y are no longer independent. The next example illustrates how the conditional joint distribution of more than two RVs can be calculated. 5.3. CONDITIONAL DISTRIBUTIONS 69 Example 5.5. Recall that (X1 , · · · , Xk ) has the multinomial distribution with parameters n and (p1 , · · · , pk ), p1 + · · · + pk = 1, if the joint probability mass function has the form P{X1 = n1 , · · · , Xk = nk } = n! pn1 · · · pnk k , n1 ! · · · nk ! 1 ni ≥ 0, k X ni = n. i=1 Notice that by combining groups of the component variables, we again obtain a multinomial RV. For example, if we combine the first r categories into a single type and we define Fr = 1 − pr+1 − · · · − pk = p1 + · · · + pr , then the random vector (X1 + · · · + Xr , Xr+1 , · · · , Xk ) has the multinomial distribution with parameters n and (Fr , pr+1 , · · · , pk ). This can be shown by calculating P{X1 + · · · + Xr = l, Xr+1 = nr+1 , · · · , Xk = nk } X = P{X1 = n1 , · · · , Xk = nk } n1 +···+nr =l X = n1 +···+nr =l = n! nr+1 · · · pnk k pn1 · · · pnr r pr+1 n1 ! · · · nk ! 1 n! n F l p r+1 · · · pnk k , l!nr+1 ! · · · nk ! r r+1 provided l + nr+1 + · · · + nk = n. The last identity is a consequence of the multinomial formula. Now suppose that we observe that Xr+1 = nr+1 , · · · , Xk = nk . Then, if we set m = nr+1 + · · · + nk and let Fr be defined as in the preceding paragraph, the conditional joint distribution of (X1 , · · · , Xr ) given this observation is equal to P{X1 = n1 , · · · , Xr = nr |Xr+1 = nr+1 , · · · , Xk = nk } P{X1 = n1 , · · · , Xk = nk } = P{Xr+1 = nr+1 , · · · , Xk = nk } −1 n! n! nk n1 nr n−m nr+1 nr nr+1 = p · · · pr pr+1 · · · pk · F pr+1 · · · pk n1 ! · · · nk ! 1 (n − m)!nr+1 ! · · · nk ! r nr (n − m)! p1 n1 pr . = ··· n1 ! · · · nr ! Fr Fr In other words, the conditional distribution of (X1 , · · · , Xr ) given the values of the random variables Xr+1 , · · · , Xk is again multinomial with parameters n − m and (p1 /Fr , · · · , pr /Fr ). 5.3.2 Continuous Conditional Distributions Definition 5.6. If X and Y are jointly continuous random variables with joint probability density pX,Y (x, y) and marginal densities pX (x) and pY (y), then for any y such that pY (y) > 0, we can define the conditional probability density of X given that Y = y by 1 P{X ∈ (x, x + h]|Y ∈ (y, y + h]} h→0 h 1 2 P{X ∈ (x, x + h], Y ∈ (y, y + h]} = lim h 1 h→0 h P{Y ∈ (y, y + h]} pX,Y (x, y) = , pY (y) pX|Y (x|y) = provided pY (y) > 0. lim 70 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Because Y is marginally continuous, we are effectively conditioning on an event with zero probability: P{Y = y} = 0. That we are able to make sense of this is a consequence of the fact that X and Y are jointly continuous: this controls the singularity in the ratio that defines the conditional probability. Remark: If X and Y are independent, then the conditional probability density of X given Y = y is equal to the unconditioned probability density: pX|Y (x|y) = pX,Y (x, y) pX (x)pY (y) = = pX (x). pY (y) pY (y) Example 5.6. We say that a random vector (X, Y ) has a bivariate normal distribution if there exist parameters (µx , µy ) and σx > 0, σy > 0, −1 < ρ < 1 such that the joint density of X and Y is " #) ( (x − µx )(y − µy ) y − µy 2 1 x − µx 2 1 p + − 2ρ . pX,Y (x, y) = exp − 2(1 − ρ2 ) σx σy σx σy 2πσx σy 1 − ρ2 The conditional density of X given that Y = y is equal to pX,Y (x, y) pY (y) = C1 pX,Y (x, y) ( pX|Y (x|y) = " #) x(y − µy ) 1 x − µx 2 − 2ρ = C2 exp − 2(1 − ρ2 ) σx σx σy σx 1 2 x − 2x µx + ρ (y − µy ) = C3 exp − 2 2σx (1 − ρ2 ) σy ( 2 ) 1 σx = C4 exp − 2 x − µx + ρ (y − µy ) . 2σx (1 − ρ2 ) σy R∞ Apart from the constant C4 , which can be evaluated explicitly by using the fact that −∞ pX|Y (x|y)dx = 1, we recognize the conditional density as the density of a normal RV with • mean = µx + ρ(σx /σy )(y − µy ) • variance = σx2 (1 − ρ2 ). Remark: Notice that X and Y are independent if and only if ρ = 0. 5.3.3 Conditional Expectations Definition 5.7. If X and Y are jointly discrete random variables with conditional probability mass function pX|Y (x|y), then the conditional expectation of X given Y = y is the quantity X X E X|Y = y = x · P{X = x|Y = y} = pX|Y (x|y) · x. x x Similarly, if X and Y are jointly continuous with conditional density function pX|Y (x|y), then the conditional expectation of X given Y = y is defined by Z ∞ E X|Y = y = x · pX|Y (x|y)dx. −∞ 5.3. CONDITIONAL DISTRIBUTIONS 71 Example 5.7. Suppose that X and Y are independent Poisson RVs, both with parameter λ, and let Z = X + Y . We wish to find the conditional expectation of X given that Z = n. To do so, we first recall from Example 6.4 that the conditional distribution of X given Z = n is binomial with parameters (n, 1/2): n n 1 pX|Z (k|n) = , k 2 for k = 0, · · · , n. Consequently, n n X n 1 n E X|Z = n = k = . k 2 2 k=0 We can also solve this problem by observing that n = E[Z|Z = n] = E[X|Z = n] + E[Y |Z = n] = 2E[X|Z = n], using the fact that E[X|Z = n] = E[Y |Z = n] since X and Y have the same conditional distribution given Z. In general, all of the results that hold for expectations also hold for conditional expectations, including the following useful identity: E g(X)|Y = y = P x g(x)pX|Y (x|y) discrete case R∞ continuous case. ∞ g(x)pX|Y (x|y)dx The next definition introduces a subtle concept that can be used to great effect when calculating expected values. This is particularly important in the theory of stochastic processes and Markov chains. Definition 5.8. The conditional expectation of X given Y is the random variable E[X|Y ] = H(Y ), where H(y) is the (deterministic) function defined by the formula H(y) = E[X|Y = y]. Definition 6.8 should be compared and contrasted with Definition 6.7: whereas the conditional expectation E[X|Y = y] defined in 6.7 is a number, the conditional expectation E[X|Y ] defined in 6.8 is a random variable. The difference arises because in the former case we are conditioning on a specific event, namely Y = y, and asking for the mean value of X under the assumption that this particular event is true, whereas in the latter case we are conditioning on a random variable and asking for the mean value of X as a function of the unknown random value that Y might take. Example 5.8. Suppose that X, Y , and Z are as in Example 6.7. Then 1 E[X|Z] = Z 2 since h(z) = E[X|Z = z] = 12 z. 72 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Our next result states that the expected value of a random variable X can be calculated by averaging the conditional expectation of X given any other random variable Y over the distribution of Y . Proposition 5.2. For any two random variables X and Y , we have E[X] = E[E[X|Y ]], provided that both E[X] and E[X|Y ] exist. In particular, if Y is discrete, then E[X] = X E[X|Y = y]P{Y = y}, y while if Y is continuous with density pY (y), then Z ∞ E[X] = E[X|Y = y]pY (y)dy. −∞ Proof. We will assume that X and Y are both discrete. Then X XX E[X|Y = y]P{Y = y} = xP{X = x|Y = y}P{Y = y} y y x X X P{X = x, Y = y} = x P{Y = y} P{Y = y} y x XX = xP{X = x, Y = y} y x X X = x P{X = x, Y = y} x X = y xP{X = x} x = E[X]. 5.4 Expectations of Sums and Products of Random Variables We begin with a general result concerning the expectation of a function of two random variables. Proposition 5.3. If X and Y have a joint probability mass function pX,Y (x, y) and g : R2 → R, then XX E[g(X, Y )] = pX,Y (x, y)g(x, y). y x Similarly, if X and Y are jointly continuous with joint density pX,Y (x, y), then Z ∞ Z ∞ E[g(X, Y )] = pX,Y (x, y)g(x, y)dxdy. −∞ −∞ Proof. By expressing g as the difference between its positive and negative parts, g = g + − g − , it suffices to prove the result under the assumption that g is non-negative. Here we will assume 5.4. EXPECTATIONS OF SUMS AND PRODUCTS OF RANDOM VARIABLES 73 that X and Y are jointly continuous; a similar proof works in the case where both are discrete. Then, by exercise 7 of PS3, we have Z ∞ P{g(X, Y ) > t}dt E[g(X, Y )] = Z0 ∞ Z Z = pX,Y (x, y)dx dy dt 0 Z ∞ Z (x,y):g(x,y)>t ∞ Z g(x,y) pX,Y (x, y)dt dx dy = −∞ Z−∞ ∞ Z ∞ = 0 pX,Y (x, y)g(x, y)dx dy, −∞ −∞ where the interchange of integrals in passing from the second to the third line is justified by the fact that the integrand is everywhere non-negative. Example 5.9. Suppose that X and Y are independent RVs that are uniformly distributed on [0, 1]. Then Z 1Z 1 E |X − Y | = |x − y|dydx 0 0 Z 1 Z x Z 1 = (x − y)dy + (y − x)dy dx 0 0 x Z 1 = x2 − x2 /2 + (1 − x2 )/2 − x(1 − x) dx 0 Z 1 1 2 x −x+ = dx 2 0 1 = . 3 The following corollary states that the expected value of a sum of random variables is equal to the sum of the expected values of those variables. This is true even when the variables are not independent. Put another way, expectation is a linear functional on the space of random variables with finite expectations. Corollary 5.1. Suppose that X1 , X2 , · · · , Xn are random variables defined on the same probability space and that E[Xi ] is finite for all i = 1, · · · , n. Then " n # n X X E Xi = E[Xi ]. i=1 i=1 Proof. If the Xi are all discrete or all jointly continuous, the corollary follows from Proposition 6.3 by taking g(x, y) = x + y and then proceeding by induction on n. Example 5.10. Suppose that X1 , · · · , Xn are independent, identically-distributed random variables, each with expected value µ. Then the sample mean of the Xi ’s is just the unweighted average of the sample: n 1X X̄ = Xi . n i=1 74 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Notice that " n 1X E[X̄] = E Xi n # i=1 = n 1X E[Xi ] n = 1 · nµ = µ, n i=1 i.e., the expected value of the sample mean is equal to the mean of the distribution. For this reason, the sample mean is said to be an unbiased estimator of the expected value. Example 5.11. Expectation of a binomial RV: Recall that if X1 , · · · , Xn are independent Bernoulli RVs, each with parameter p, then their sum X = X1 · · · Xn is a Binomial RV with parameters n and p. A simple calculation shows that E[Xi ] = p · 1 + (1 − p) · 0 = p for each i. Consequently, n X E[X] = E[Xi ] = np, i=1 as we showed earlier using moment generating functions. Example 5.12. Let X be the number of cards that are left in their original position by a random permutation of a deck of n cards. Then X can be written as the sum X = X1 + · · · + Xn , where Xi is the indicator function of the event that the i’th card is left in its original position. Notice that each Xi is a Bernoulli RV with parameter p = 1/N , but the Xi ’s are not independent. Nonetheless, n X 1 E[X] = E[Xi ] = n · = 1. n i=1 Example 5.13. Random Walks on Z. Suppose that X1 , X2 , · · · is a collection of independent RVs, each with distribution P{Xi = 1} = P{Xi = −1} = 21 , and let SN = X1 + · · · + XN . We can think of SN as the position at time N of a particle which in each unit interval of time is equally likely to move to the right or to the left. Notice that E[SN ] = N X E[Xi ] = 0, i=1 since E[Xi ] = 12 · 1 + 12 · (−1) = 0 for each i. Thus, the expected position of the particle never changes. A better measure of the displacement of the particle away from its initial position at 0 is given by the root mean square displacement (RMSQ) of SN q 2 . DN ≡ E S N To find DN , we begin by calculating 2 DN 2 = E SN = E (X1 + · · · + XN )2 "N # X X = E Xi2 + E i=1 Xi Xj 1≤i6=j≤N = N E X12 + N (N − 1)E X1 X2 , 5.4. EXPECTATIONS OF SUMS AND PRODUCTS OF RANDOM VARIABLES 75 where to obtain the last line, we have the used the identities E[Xi2 ] = E[X12 ] and E[Xi Xj ] = E[X1 X2 ] for all i and j such that i 6= j. Furthermore, E X12 ] = E X1 X2 = 1 2 1 · 1 + · (−1)2 = 1 2 2 1 1 1 1 · 1 · 1 + 1 · (−1) + · (−1) · 1 + · (−1) · (−1) = 0. 4 4 4 4 2 = N and so the RMSD of the Substituting these values into the previous equation shows that DN particle is DN = N 1/2 . Notice that DN diverges as N → ∞, but at a rate that is sublinear in N . 5.4.1 Expectations of products Proposition 5.4. If X and Y are independent, then for any pair of real-valued functions h and g, E[h(X)g(Y )] = E[h(X)] · E[g(Y )], provided that the expectations appearing on both sides of the identity exist. Proof. We will prove this under the assumption that X and Y are jointly continuous with joint density pX,Y = pX · pY . Then Z ∞Z ∞ E h(X)g(Y ) = h(x)g(y)pX,Y (x, y)dxdy −∞ −∞ Z ∞Z ∞ = h(x)g(y)pX (x)pY (y)dxdy −∞ −∞ Z ∞ Z ∞ = pX (x)h(x)dx pY (y)g(y)dy −∞ −∞ = E[h(X)] · E[g(Y )]. Corollary 5.2. Taking h(x) = x and g(y) = y in Proposition 7.2 yields the important identity: E[XY ] = E[X] · E[Y ], provided that X, Y and XY all have finite expectations. Corollary 5.3. If X1 , · · · , Xn are independent RVs, then a simple induction argument on n shows that " n # n Y Y E fi (Xi ) = E[fi (Xi )], i=1 i=1 again provided that all of the expectations appearing in the formula exist. Our next proposition establishes an important property of moment generating functions of sums of independent random variables. 76 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Proposition 5.5. Suppose that X1 , · · · , Xn are independent RVs with moment generating functions ψX1 (t), · · · , ψXn (t). Then the moment generating function of the sum X = X1 + · · · + Xn is h i ψX (t) = E et(X1 +···+Xn ) = n Y E etXi i=1 = n Y ψXi (t), i=1 i.e., the moment generating function of a sum of independent RVs is just the product of the moment generating functions of the individual RVs. Example 5.14. Suppose that X1 , · · · , Xn are independent Poisson RVs with parameters λ1 , · · · , λn , respectively. Then the moment generating function of the sum X = X1 + · · · + Xn is ψX (t) = n Y t) e−λi (1−e i=1 = exp −(1 − et ) n X ! λi , i=1 which shows that X is a Poisson RV with parameter 5.5 Pn i=1 λi . Compare this with Example 6.3. Covariance and Correlation Definition 5.9. If X and Y are RVs with means µX = E[X] and µY = E[Y ], then the covariance between X and Y is the quantity Cov(X, Y ) = E (X − µX )(Y − µY )] = E[XY ] − µX · µY . Remark: The covariance of two random variables is a measure of their association. The covariance is positive if when X is larger than µX , then Y also tends to be larger than µY , while the covariance is negative if the opposite pattern holds. In particular, if X and Y are independent, then Cov(X, Y ) = E[X − µX ]E[Y − µY ] = 0, because knowledge of how X compares with its mean provides no information about Y or vice versa. Observe that the converse is not true. There are random variables that have zero covariance but which are not independent. For example, suppose that X has the distribution 1 P{X = 0} = P{X = 1} = P{X = −1} = , 3 and that the random variable Y is defined by 0 Y = 1 if X 6= 0 if X = 0. Then E[X] = 0 and also E[XY ] = 0 because XY = 0 with probability 1, so that Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 0, 5.5. COVARIANCE AND CORRELATION 77 but X and Y clearly are not independent. For example, 1 P{X = 0|Y = 1} = 1 6= = P{X = 0}. 3 Any two random variables X and Y are said to be uncorrelated if Cov(X, Y ) = 0, regardless of whether or not they are independent. The next proposition lists some useful properties of covariances. Proposition 5.6. (a) Cov(X, Y ) = Cov(Y, X) (b) Cov(X, X) = V ar(X) (c) Cov(aX, Y ) = aCov(X, Y ) for any scalar a n m n X m X X X (d) Cov Xi , Yj = Cov(Xi , Yj ). i=1 j=1 i=1 j=1 Each property can be proved directly from the definition of the covariance and the linearity properties of expectations. A very useful formula follows by combining parts (b) and (d) of the preceding proposition: ! n n X X X V ar Xi = V ar(Xi ) + Cov(Xi , Xj ). i=1 i=1 1≤i6=j≤n In particular, if the Xi ’s are pairwise independent, then Cov(Xi , Xj ) = 0 whenever i 6= j, so that ! n n X X V ar Xi = V ar(Xi ). i=1 i=1 If, in addition to being pairwise independent, the Xi ’s are identically distributed, say with variance V ar(Xi ) = σ 2 , then ! n X V ar Xi = nσ 2 . i=1 Example 5.15. If X1 , · · · , Xn are independent Bernoulli RVs, each with parameter p, then X = X1 + · · · + Xn is a Binomial RV with parameters (n, p). Because the variance of each Bernoulli RV is p(1 − p), the preceding formula for the variance of a sum of IID RVs shows that V ar(X) = np(1 − p), confirming our previous calculation using moment generating functions. Example 5.16. If X1 , · · · , Xn is a collection of IID (independent, identically-distributed) random variables with mean µ and variance σ 2 , then the sample mean X̄ and the sample variance S 2 are the quantities defined by n X̄ = 1X Xi n i=1 n S2 = 2 1 X Xi − X̄ . n−1 i=1 78 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS If we think of the Xi ’s as the outcomes of a sequence of independent replicates of some experiment, then the sample mean and sample variance are often used to estimate the true mean and true variance of the underlying distribution. Recall that in the previous section we showed that the sample mean is an unbiased estimator for the true mean: E[X̄] = µ. Although unbiasedness is a desirable property of an estimator, we also care about the variance of the estimator about its expectation. For the sample mean, we can calculate ! 2 n X 1 V ar(X̄) = V ar Xi n i=1 2 X n 1 = V ar(Xi ) n i=1 σ2 = n , which shows that the variance of the sample mean is inversely proportional to the number of independent replicates that have been performed. We can also show that the sample variance is an unbiased estimator of the true variance. We first observe that n S2 = = = 2 1 X Xi − µ + µ − X̄ n−1 1 n−1 1 n−1 i=1 n X 2 (Xi − µ) − 2(X̄ − µ) i=1 n X (Xi − µ) + i=1 (Xi − µ)2 − i=1 n X n X ! X̄ − µ 2 i=1 n (X̄ − µ)2 . n−1 Taking expectations gives E S2 = n 1 X n E (Xi − µ)2 − V ar(X̄) n−1 n−1 i=1 n n σ2 σ2 − n−1 n−1 n = σ2. = Informally, the reason that we must divide by n − 1 rather than n to obtain an unbiased estimate of the variance is that the deviations between the Xi ’s and the sample mean (which is computed using the observed data) tend to be smaller than the deviations between the Xi ’s and the true mean. The next definition provides us with a dimensionless measure of the association between two random variables. 2 > 0 and Definition 5.10. The correlation of two random variables X and Y with variances σX 2 σY > 0 is the quantity Cov(X, Y ) ρ(X, Y ) = . σX σY 5.6. THE LAW OF LARGE NUMBERS AND THE CENTRAL LIMIT THEOREM 79 Because ρ(aX, bY ) = ab · Cov(X, Y ) = ρ(X, Y ), aσX · bσY for any pair of scalar quantities a, b 6= 0, the correlation between two random variables does not depend on the units in which they are measured, i.e., the correlation is dimensionless. Another important property of the correlation is that − 1 ≤ ρ(X, Y ) ≤ 1. (5.1) This can be proved as follows. First note that X Y 0 ≤ V ar + σX σY Cov(X, Y ) V ar(X) V ar(Y ) + +2 = 2 2 σX σY σX σY = 2(1 + ρ(X, Y )), which implies that ρ(X, Y ) ≥ −1. A similar calculation involving the variance of the difference of X/σX and Y /σY shows that ρ(X, Y ) ≤ 1. If we interpret the standard deviation of a random variable as a measure of its length or norm, then (6.1) can be interpreted as a version of the Cauchy-Schwartz inequality. It can be shown that if ρ(X, Y ) = 1, then with probability one such that σY Y = µY + (X − µX ), σX while if ρ(X, Y ) = −1, then σY (X − µX ). σX Thus the correlation constant can also be treated as a measure of the collinearity of two random variables. Y = µY − 5.6 The Law of Large Numbers and the Central Limit Theorem In Example 6.16, we showed that the sample mean X̄N of a series of N independent and identically-distributed random variables is an unbiased estimator of the true mean µ = E[X1 ] and that the variance of the sample mean is inversely proportional to N : E X̄N = µ σ2 Var X̄N = . N This suggests that when N is large, the sample mean should be close to µ and that the difference X̄N − µ should typically be of order N −1/2 in magnitude. Our aim in this section is to explain in what sense these statements are true. 5.6.1 The Weak and Strong Laws of Large Numbers Proposition 5.7. (Markov’s Inequality) If X is a nonnegative real-valued RV, then for any positive real number a > 0, E[X] P X≥a ≤ . a 80 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Proof. Let Ia be the indicator function of the event {X ≥ a} 1 if X(ω) ≥ a Ia (ω) = 0 otherwise. Then, since X ≥ 0, we have X . a Taking expectations of both sides of this inequality gives Ia ≤ E[X] . P X ≥ a = E[Ia ] ≤ a Proposition 5.8. (Chebyshev’s Inequality) If X is a RV with finite mean µ and finite variance σ 2 , then for any positive real number k > 0, σ2 P |X − µ| ≥ k ≤ 2 . k Proof. The result follows by taking a = k 2 and applying Markov’s inequality to the non-negative random variable (X − µ)2 E (X − µ)2 σ2 2 2 ≤ . P |X − µ| ≥ k = P (X − µ) ≥ k ≤ k2 k2 While Markov’s inequality provides a simple bound on the tail of the distribution of X, Chebyshev’s inequality allows us to estimate the probability that X differs greatly from its mean. In general, both estimates tend to be crude in the sense that the probabilities in question can be much smaller than the upper bounds given by these inequalities. However, even these crude inequalities can be used to prove some important results. Our next result, called the weak law of large numbers, is one of the celebrated limit theorems of probability theory. Before stating this theorem, we introduce one of the modes of convergence of a sequence of random variables. In fact, there are several distinct senses in which a sequence of random variables can be said to converge to a limit and we will encounter two more of these below. Definition 5.11. We say that a sequence of random variables, X1 , X2 , · · · , converges in probability to a random variable X if for every > 0, lim P |Xn − X| > = 0. n→∞ Notice that the variables X1 , X2 , · · · and X must all be defined on the same probability space for this definition to make sense. Theorem 5.1. (The Weak Law of Large Numbers) Suppose that X1 , X2 , · · · is a sequence of i.i.d. variables with finite mean µ = E[X1 ] and let Sn = X1 + · · · + Xn . Then, for any > 0, Sn lim P − µ ≥ = 0, n→∞ n i.e., the sample averages 1 n Sn converge in probability to the mean. 5.6. THE LAW OF LARGE NUMBERS AND THE CENTRAL LIMIT THEOREM 81 Proof. We will prove the theorem under the additional assumption that Var(Xi ) = σ 2 < ∞. In this case we have 1 1 σ2 E and V ar Sn = µ Sn = , n n n and so Chebyshev’s inequality implies that for any > 0, σ2 P |Sn − µ| ≥ ≤ 2 , n which tends to 0 as n → ∞. In other words, given any > 0, the weak law of large numbers asserts that by collecting sufficiently many independent samples, we can make the probability that the sample mean and the true mean differ by more than an amount as close to zero as we want. Here we think of as an upper bound on the acceptable error in our estimate. In fact, there is an even stronger sense in which the sample mean converges to the true mean as the number of samples tends to infinity. This is the content of the next theorem, which we state without proof. Theorem 5.2. (Strong Law of Large Numbers) Let X1 , X2 , · · · be a sequence of I.I.D. random variables, each with finite mean µ = E[X], and let Sn = X1 + · · · Xn . Then, 1 P lim Sn = µ = 1, n→∞ n i.e., the sequence 1 n Sn converges almost surely to µ. The significance of the strong law of large numbers is that it shows that the frequentist concept of probabilities can be derived using the measure-theoretic machinery introduced by Kolmogorov in the 1930’s. In particular, suppose that (Ω, F, P) is a probability space and that E ∈ F is an event, and let X1 , X2 , · · · be independent indicator variables for E: 1 if E occurs on the i’th trial Xi = 0 otherwise. Then E[Xi ] = P(E) and so the strong law tells us that n 1X P lim Xi = P(E) n→∞ n ! = 1, i=1 i.e., the limiting frequency of E is equal to its probability. 5.6.2 The Central Limit Theorem Although the strong law of large numbers tells us that the sample means of a sequence of i.i.d. random variables converge almost surely to the expected value of the sampling distribution, it does not tell us anything about the rate of convergence or about the size of the fluctuations of the sample means about this limit. This information is important because we might like to know, for example, how likely it is that the sample mean of say 10,000 independent variates will differ from the true mean by more than 10% of the value of the latter. The Central Limit Theorem is one of several results that gives us some insight into the fluctuations of the sequence of sample means under the additional assumption that the sampling distribution has finite variance. Before stating this theorem, we begin by defining a new mode of convergence for a sequence of random variables. 82 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Definition 5.12. Suppose that F and Fn , n ≥ 1 are cumulative distribution functions on R. Then Fn is said to converge weakly to F if limn→∞ Fn (x) = F (x) at every continuity point x of F . Likewise, a sequence of random variables Xn is said to converge in distribution to a random variable X if the cumulative distribution functions Fn (x) = P(Xn ≤ x) converge weakly to F (x) = P(X ≤ x). Example 5.17. To see why we only require pointwise convergence at continuity points, let Xn and X be defined by setting P(Xn = 1/n) = 1 and P(X = 0) = 1. Then Fn (x) = 0 1 if x < 1/n otherwise and F (x) = 0 1 if x < 0 otherwise, which shows that Fn (x) converges to F (x) as n → ∞ at every x 6= 0. Since F (x) is continuous everywhere except at x = 0, it follows that the sequence Xn converges in distribution to X even though the sequence Fn (0) = 0 does not converge to F (0) = 1. The next proposition shows how we can use moment generating functions to deduce that a sequence of random variables converges in distribution. The assumption that the moment generating function of the limit X is finite is essential. Proposition 5.9. Let X1 , X2 , · · · be a sequence of random variables with moment generating functions ψXn (t), and let X be a random variable with moment generating function ψX . Then the sequence Xn converges in distribution to X if lim ψXn (t) = ψX (t) < ∞ n→∞ for all t ∈ R. In effect, this proposition says two random variables Xn and X are close (in the sense that their distributions are close) if their moment generating functions are close. This will be key to our proof of the central limit theorem, but we first need to determine the moment generating function of a normally-distributed random variable. Example 5.18. If Z is a standard normal RV (with mean 0 and variance 1), then Z ∞ 1 2 etx √ e−x /2 dx 2π −∞ Z ∞ 1 2 2 = et /2 √ e−(x−t) /2 dx 2π −∞ MZ (t) = 2 = et /2 ∞ X 1 2n t . = n 2 n! n=0 Thus, the moments of Z are (2n)! 2n n! = 0 M2n = M2n+1 5.6. THE LAW OF LARGE NUMBERS AND THE CENTRAL LIMIT THEOREM 83 for all n ≥ 0. To calculate the moment generating function of a normal RV with mean µ and σ, recall that Z ≡ (X − µ)/σ) is a standard normal RV. Then, writing X = σZ + µ, we have ψX (t) = E etX h i = E et(σZ+µ) = etµ ψZ (σt) 2 2 σ t = exp + µt . 2 Theorem 5.3. (The Central Limit Theorem) Suppose that X1 , X2 , · · · is a sequence of I.I.D. variables with finite mean µ and variance σ 2 , and let Sn = X1 + · · · + Xn . Then the sequence of random variables 1 1/2 1 Sn − nµ √ Zn ≡ = n Sn − µ σ n σ n converges in distribution to the standard normal N (0, 1), i.e., for every real number z, Z z 1 2 e−x /2 dx. lim P{Zn ≤ z} = Φ(z) ≡ √ n→∞ 2π −∞ Proof. We will prove the C.L.T. under the assumption that the moment generating function of the random variables Xi , denoted ψ(t), is finite on the entire real line. Let us first assume that µ = 0 and σ 2 = 1. Notice that the moment generating function of the scaled random variable √ Xi / n is tXi t E exp √ =ψ √ n n P √ and that the moment generating function of the sum Zn = ni=1 Xi / n is equal to n t ψ √ . n Let L(t) = log ψ(t) and observe that L(0) = L0 (0) = 0 and L00 (0) = 1. By Proposition 6.9, it suffices to show that n t 2 = et /2 , lim ψ √ n→∞ n which is equivalent to lim nL n→∞ t √ n = t2 . 2 However, this last identity can be verified using L’Hôpital’s rule √ L(t/ n) L(tx) tL0 (tx) t2 L00 (tx) lim = lim = lim = lim n→∞ x→0 x2 x→0 x→0 n−1 2x 2 2 t = . 2 The general case can then be handled by applying this result to the standardized variables Xi∗ = (Xi − µ)/σ, which have mean 0 and variance 1. 84 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS What is remarkable about the central limit theorem is that it shows that the distribution of a sum of sufficiently many i.i.d. random variables with finite variance is approximately normal no matter what distribution the individual variables have. This may explain why the normal distribution is encountered in so many seemingly unrelated phenomena, such as the distribution of height in a population of adult males, the distribution of particle velocities in a gas, or the distribution of errors in an experiment. Remark 5.1. The restriction imposed by the use of the moment generating function in the proof of the C.L.T. can be surmounted by instead working with characteristic functions: χX (t) ≡ E eitX . Provided that X is real-valued, we have |eitX | = 1 and so the modulus of the characteristic function is less than or equal to 1: |χX (t)| ≤ 1 for all t ∈ (−∞, ∞). This means that the characteristic function of any real-valued random variable is finite on the entire real line. Furthermore, Proposition (6.9) still holds if we replace pointwise convergence of moment generating functions by pointwise convergence of characteristic functions. Example 5.19. Let X1 , · · · , X10 be independent random variables, each uniformly distributed on [0, 1], and let X = X1 + · · · + X10 . The central limit theorem can be used to approximate the distribution of X. For example, ( ) √ X −5 6−5 P{X > 6} = P p ≈ 1 − Φ( 1.2) ≈ 0.137. >p 10/12 10/12 A version of the central limit theorem holds even when the random variables X1 , X2 , · · · are independent but not necessarily identically-distributed. Theorem 5.4. Let X1 , X2 , · · · be a sequence of independent random variables with µi = E[Xi ] and σi2 = V ar(Xi ). If the Xi are P uniformly bounded, i.e., there exists M < ∞ such that P{Xi < M } = 1 for all i ≥ 1, and if i≥1 σi2 = ∞, then Pn (X − µ ) i i i=1 q lim P ≤ x = Φ(x) Pn n→∞ σ2 i=1 i for each x ∈ R. This more general result can be used to prove the following surprising statement about the number of prime divisors of a randomly chosen positive integer. Theorem 5.5. (Erdös-Kac) For each n ≥ 1, let Xn be uniformly distributed on the set {1, · · · , n}, and define φ(n) to be the number of prime divisors of the integer n, e.g., φ(2) = 1, φ(3) = 1, φ(4) = 1, φ(5) = 1, φ(6) = 2, etc. Then, for every real number x, ! φ(Xn ) − log log(n) p lim P ≤ x = Φ(x). n→∞ log log(n) In other words, if an integer is chosen at random between 1 and n, then for large n the number of prime divisors of that integer is approximately normally distributed with mean and variance both equal to log log(n). 5.7. MULTIVARIATE NORMAL DISTRIBUTIONS 5.7 85 Multivariate Normal Distributions It can be shown that the central limit theorem extends to sums of independent, identicallydistributed random vectors which, when suitably normalized, converge in distribution to a multivariate normal distribution. There are several equivalent definitions of the MVN distribution, but here we adopt a slightly restricted version based on the density function. Definition 5.13. A random vector X = (X1 , · · · , Xn ) with values in Rn is said to have a multivariate normal distribution if there is a vector µ = (µ1 , · · · , µn ) and a symmetric positivedefinite n × n matrix Σ = (σij ) with determinant |Σ| > 0 such that X has density 1 1 T −1 f (x) = p exp − (x − µ) Σ (x − µ) . 2 (2π)n |Σ| In this case, we write X ∼ N (µ, Σ) and say that µ is the mean vector and Σ is the covariance matrix of X. Random vectors that satisfy the conditions of the preceding definition are said to have a nondegenerate MVN distribution. Degenerate MVN distributions can also be defined by allowing the covariance matrix to have a rank less than n. In this case, X will be supported on a proper subspace of Rn . Because degenerate MVN distributions do not have a density on Rn , the following property, which we state as a theorem, is sometimes taken as a more general definition of the MVN distribution which encompasses both the degenerate and non-degenerate cases. Theorem 5.6. Suppose that X ∼ N (µ, Σ) is an n-dimensional MVN random vector and let c = (c1 , · · · , cn ) ∈ Rn . Then Y = c · X = c1 X1 + · · · + cn Xn is normally distributed with E[Y ] = n X ci µi Var(Y ) = i=1 n X c2i σii + 2 i=1 X ci cj σij . i<j In particular, E[Xi ] = µi , Var(Xi ) = σii and cov(Xi , Xj ) = σij . Proof. By subtracting the mean vector if necessary, we can assume that X ∼ N (0, Σ), where 0 = (0, · · · , 0) ∈ Rn . Since Σ is a symmetric positive-definite matrix, it can be shown that there is an n × n matrix A = (aij ) with inverse B = A−1 such that Σ = AAT and Σ−1 = B T B. Notice that the following relationships exist between the entries of Σ and those of A: σij = n X aik ajk . k=1 Furthermore, A can be chosen so that |A| = |Σ|1/2 > 0. We will show that Y has the stated normal distribution by calculating its moment generating function, which we can do by evaluating the following n-dimensional integral Z 1 T −1 1 tY tc·x exp − x Σ x dx. ψY (t) = E[e ] = e p 2 (2π)n |Σ| Rn Our first step is to make the change of variables x = Ay. Since dx = |A|dy, this gives Z 1 T T T 1 ψY (t) = p etc·Ay e− 2 y A B BAy |A|dy n (2π) |Σ| Rn Z 1 1 2 = p etc·Ay− 2 |y| dy, n (2π) Rn 86 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS where we have used the following identities: AB = I = B T AT and y T y = |y|2 = Furthermore, since ! n n X n n n n X X X X X tc · Ay = ci (Ay)i = ci aij yj = ci aij yj = γ j yj i=1 i=1 j=1 j=1 i=1 Pn 2 j=1 yj . j=1 the exponential function inside the integral can be factored into a product of exponentials e tc·Ay− 21 |y|2 = n Y 1 2 etγj yj − 2 yj . j=1 This allows us to write the n-dimensional integral as a product of one-dimensional integrals, which can each be evaluated using the moment generating function of the normal distribution: Z ∞ n n Y Y 2 2 1 2 2 tγj yj − 21 yj2 √ e dyj = ψY (t) = eγj t = eγ t , 2π −∞ j=1 j=1 where γ2 = = n X j=1 n X γj2 = c2i n X i=1 = n X n n X X j=1 i=1 a2ij + 2 X j=1 !2 ci aij ci cj X i<j c2i σii + 2 i=1 X aik ajk k ci cj σij . i<j This shows that Y is normally distributed with mean 0 and variance γ 2 . In fact, it can be shown that any affine transformation of a MVN distributed random vector is also MVN. Theorem 5.7. Suppose that X ∼ N (µ, Σ) takes values in Rn and let Y = AX + b, where A is an m × n non-singular matrix and b ∈ Rm . Then Y is an m-dimensional MVN random vector with distribution N (Aµ + b, AΣAT ). This result is important in its own right, but we will use it in the proof of the following theorem. Theorem 5.8. Suppose that X1 , · · · , Xn are independent normal random variables with mean µ and variance σ 2 and let X̄ = (X1 + · · · + Xn )/n be their sample mean. Then X̄ is independent of the random variables X1 − X̄, · · · , Xn − X̄. Proof. The trick is to notice that the random vector Y = (X̄, X2 − X̄, · · · , Xn − X̄) can be written as a linear transformation of the random vector X = (X1 , · · · , Xn ), i.e., Y = AX, where 1 1 1 ··· n n n −1 1 − 1 ··· − n1 n n A= . . .. .. .. . . − n1 − n1 ··· 1− 1 n Since X is MVN with mean vector (µ, · · · , µ) and covariance matrix σ 2 In×n , it follows that Y is also MVN with mean vector (µ, 0, · · · , 0). To find the covariance matrix of Y , we recall that X̄ has variance σ 2 /n. Also, since Xi σ2 cov(Xi , X̄) = cov Xi , = , n n 5.8. SAMPLE STATISTICS 87 it follows that the variance of Xi − X̄ is Var(Xi − X̄) = Var(Xi ) + Var(X̄) − 2cov(Xi , X̄) = n−1 2 σ . n Similarly, cov(X̄, Xi − X̄) = cov(X̄, Xi − X̄) − Var(X̄) = 0, while cov(Xi − X̄, Xj − X̄) = cov(Xi , Xj ) − cov(Xi , X̄) − cov(Xi , X̄) + Var(X̄) = − σ2 . n From these results it can be shown that the joint density of Y can be written as !2 n n 1 X X 1 yi + yi2 , fY (y1 , · · · , yn ) = C exp − 2 (y1 − µ)2 exp − 2 2σ 2σ /n i=2 i=2 where C is a normalizing constant. However, since the joint density factors into a product of terms that depend either only on y1 = x̄ or only on (y2 , · · · , yn ) = (x2 P − x̄, · · · , xn − x̄), it follows that X̄ is independent of (X2 − X̄, · · · , Xn − X̄). Furthermore, since ni=1 (Xi − X̄) = 0, we see that n X X1 − X̄ = − (Xi − X̄), i=2 which shows that X1 − X̄ is a function of the vector (X2 − X̄, · · · , Xn − X̄). Consequently, X̄ is also independent of (X1 − X̄, · · · , Xn − X̄). Since the sample variance n S2 = 1 X (Xi − X̄)2 n−1 i=1 is a function of the vector (X1 − X̄, · · · , Xn − X̄), Theorem 6.8 has the following important corollary. Corollary 5.4. If X1 , · · · , Xn are i.i.d. normal random variables, then the sample mean X̄ is independent of the sample variance S 2 . In other words, provided that the data is normally distributed and the measurements are independent, the error in our estimate of the mean is independent of the error in our estimate of the variance. 5.8 5.8.1 Sample Statistics The χ2 Distribution Suppose that we perform a series of i.i.d. trials that result in measurements X1 , · · · , Xn and we let X̄ and S 2 denote the sample mean and sample variance: n X̄ = 1X Xi n i=1 n S2 = 2 1 X Xi − X̄ . n−1 i=1 88 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS If we assume that each of the variables Xi is normally distributed with unknown mean µ and unknown variance σ 2 , then X̄ is also normally distributed with mean µ and variance σ 2 /n. The sample variance, however, is not normally distributed. To identify the distribution of S 2 , we first observe that if Z is a standard normal random variable, then by the change-of-variables formula (cf. Example 5.8), the density of Z 2 on (0, ∞) is 1 1 1 fZ 2 (x) = 2 √ √ e−x/2 = √ x1/2−1 e−x/2 . 2 x 2π 2π However, this is just the density of the gamma distribution with shape parameter α = 1/2 and rate parameter λ = 1/2, which is also known as the χ2 -distribution with one degree of freedom, which is sometimes abbreviated as χ21 . As an aside, since we can also write this density as fZ 2 (x) = (1/2)1/2 1/2−1 −x/2 x e , Γ(1/2) this calculation leads to the identity √ 1 Γ = π. 2 The moment generating function of the χ21 distribution is h 2i mZ 2 (t) ≡ E etZ Z ∞ 1 = etx √ x−1/2 e−x/2 dx 2π 0 Z ∞ 1 x−1/2 e−(1/2−t)x dx = √ 2π 0 Z ∞ 1 2 = √ 2 e−(1−2t)y /2 dy 2π 0 √ 1 2π = √ √ = (1 − 2t)−1/2 , 2π 1 − 2t provided t < 1/2. Now suppose that Z1 , · · · , Zn are independent standard normal random variables and let R = Z12 + · · · + Zn2 be the sum of the squares. Our goal is to identify the distribution of R. Because the Zi are independent, we know that the moment generating function of their sum is equal to the product of the moment generating functions of the individual terms, i.e., mR (t) = E etR = (1 − 2t)−n/2 . I claim that this is also the moment generating function of a Gamma distribution with shape parameter α = n/2 and rate parameter λ = 1/2. Indeed, if Y is such a random variable, then tY mY (t) = E e Z = 0 ∞ ety (1/2)n/2 n/2−1 −y/2 y e dy Γ(n/2) n/2 Z ∞ 1 1 y n/2−1 e−(1/2−t)y dy = 2 Γ(n/2) 0 n/2 Z ∞ 1 −n/2 1 = (1/2 − t) xn/2−1 e−x dx 2 Γ(n/2) 0 1 = (1 − 2t)−n/2 Γ(n/2) = (1 − 2t)−n/2 , Γ(n/2) 5.8. SAMPLE STATISTICS 89 as claimed. This shows that Y and R have the same distribution. In practice, R is said to have the χ2 distribution with n degrees of freedom, which is written R ∼ χ2n . The relevance of this distribution to the sample variance is explained in the following proposition. Proposition 5.10. If the trial outcomes X1 , · · · , Xn are i.i.d. normal random variables, then the normalized sample variance (n − 1)S 2 /σ 2 is χ2 -distributed with n − 1-degrees of freedom. Proof. In Example (6.16), we showed that the sample variance can be rewritten in the following form: n 1 X n 2 S = (Xi − µ)2 − (X̄ − µ)2 . n−1 n−1 i=1 If we then rearrange these terms and divide each one by σ 2 , we arrive at the following expression: n (n − 1)S 2 (X̄ − µ)2 X (Xi − µ)2 + . = σ2 σ 2 /n σ2 i=1 Now we know that the right-hand side is χ2 -distributed with n degrees of freedom, while the second term on the left-hand side is also χ2 -distributed but with only one degree of freedom. By using moment generating functions, for example, we can show that this implies that the first term on the left-hand is χ2 -distributed with n − 1 degrees of freedom. Example 5.20. Suppose that 16 individuals are sampled at random from a population in which height is normally distributed with unknown mean µ and variance σ 2 . We know that the sample variance S 2 is an unbiased estimator of σ 2 , but we might also be interested in calculating the probability that S 2 is unusually small relative to σ 2 , e.g, we would like to calculate P(S 2 ≤ 0.25σ 2 ). Since 15S 2 /σ 2 ∼ χ215 , this can be done by evaluating the following probability, 1 2 15S 2 15 2 P S ≤ σ = P ≤ 4 σ2 4 2 = P χ15 ≤ 3.75 ≈ 0.0016, which was done with the help of the Matlab function chi2cdf (x, df ). 5.8.2 Student’s t Distribution When the measurements X1 , · · · , Xn are normally distributed, we know that the Z-score Z= X̄ − µ σ/n1/2 is also normally-distributed. However, this observation is only useful in practice if we know the actual value of the variance σ 2 . If not, then we can replace σ in this expression by the square root S of the sample variance, but then the distribution of the statistic T = X̄ − µ S/n1/2 is no longer normal. Instead, T is said to have Student’s t-distribution with n − 1 degrees of freedom. This distribution is characterized in the next proposition. 90 CHAPTER 5. RANDOM VECTORS AND MULTIVARIATE DISTRIBUTIONS Proposition 5.11. Let Z be a standard normal random variable and let V be an independent χ2 - distributed random variable with n degrees of freedom. Then the test statistic Z T =p V /n has Student’s t distribution with n degrees of freedom and the density of T on (−∞, ∞) is Γ( n+1 2 ) fT (t) = √ nπ Γ n2 1+ t2 n −(n+1)/2 . Proof. The density of T can be found by marginalizing over the joint density of T and V , Z ∞ pT,V (t, v)dv, pT (t) = 0 and the joint density can be found with the help of the product rule: pT,V (t, v) = pT |V (t|v)pV (v). The first term on the right-hand side can be found by noting that conditional on V = v, T is normally distributed with mean 0 and variance n/v, i.e., √ v −vt2 /2n √ pT |V (t|v) = e . 2πn Also, since V is χ2 -distributed with n degrees of freedom, it follows that the density of V is that of the gamma distribution with shape parameter n/2 and rate parameter 1/2: pV (v) = 1 n/2 2 Γ(n/2) v n/2−1 e−v/2 . Multiplying these gives pT,V (t, v) = √ 1 n/2 2 2πnΓ(n/2) v n+1 −1 2 e t2 − 21 + 2n v and marginalizing over v then gives Γ( n+1 2 ) pT (t) = √ nπ Γ n2 1+ t2 n −(n+1)/2 .