Download Introduction to Probability

Introduction to Probability Motoya Machida January 17, 2006 This material is designed to provide a foundation in the mathematics of probability. We begin with the basic concepts of probability, such as events, random variables, independence, and conditional probability. Having developed these concepts, the remainder of the material concentrates on the methods of calculation in probability and their applications through exercises. The topics treated here are divided into the four major parts: (I) probability and distributions, (II) discrete distributions, (III) continuous distributions, and (IV) joint distributions. When completed, these topics will unify the understanding of density and distribution functions of various kinds, calculation and interpretation of moments, and distributions related to the normal distribution. 1 Sample space and events The set of all possible outcomes of an experiment is called the sample space, and is typically denoted by Ω. For example, if the outcome of an experiment is the order of finish in a race among 3 boys, Jim, Mike and Tom, then the sample space becomes Ω = {(J, M, T ), (J, T, M ), (M, J, T ), (M, T, J), (T, J, M ), (T, M, J)}. In other example, suppose that a researcher is interested in the lifetime of a transistor, and measures it in minutes. Then the sample space is represented by Ω = {all nonnegative real numbers}. Any subset of the sample space is called an event. In the who-wins-the-race example, “Mike wins the race” is an event, which we denote by A. Then we can write A = {(M, J, T ), (M, T, J)}. In the transistor example, “the transistor does not last longer than 2500 minutes” is an event, which we denote by E. And we can write E = {x : 0 ≤ x ≤ 2500}. 1.1 Set operations Once we have events A, B, . . ., we can define a new event from these events by using set operations—union, intersection, and complement. The event A ∪ B, called the union of A and B, means that either A or B or both occurs. Consider the “who-wins-the-race” example, and let B denote the event “Jim wins the second place”, that is, B = {(M, J, T ), (T, J, M )}. 1 Introduction to Probability Then A ∪ B means that “either Mike wins the first, or Jim wins the second, or both,” that is, A ∪ B = {(M, T, J), (T, J, M ), (M, J, T )}. The event A ∩ B, called the intersection of A and B, means that both A and B occurs. In our example, A ∩ B means that “Mike wins the first and Jim wins the second,” that is, A ∩ B = {(M, J, T )}. The event Ac , called the complement of A, means that A does not occur. In our example, the event A c means that “Mike does not win the race”, that is, Ac = {(J, M, T ), (J, T, M ), (T, J, M ), (T, M, J)}. Now suppose that C is the event “Tom wins the race.” Then what is the event A ∩ C? It is impossible that Mike wins and Tom wins at the same time. In mathematics, it is called the empty set, denoted by ∅, and can be expressed in the form A ∩ C = ∅. If the two events A and B satisfy A ∩ B = ∅, then they are said to be disjoint. For example, “Mike wins the race” and “Tom wins the race” are disjoint events. 1.2 Exercises 1. Two six-sided dice are thrown sequentially, and the face values that come up are recorded. (a) List the sample space Ω. (b) List the elements that make up the following events: i. A = {the sum of the two values is at least 5} ii. B = {the value of the first die is higher than the value of the second} iii. C = {the first value is 4} (c) List the events of the following events: i. A ∩ C ii. B ∪ C iii. A ∩ (B ∪ C) 2 Probability and combinatorial methods A probability P is a “function” defined over all the “events” (that is, all the subsets of Ω), and must satisfy the following properties: 1. If A ⊂ Ω, then 0 ≤ P (A) ≤ 1 2. P (Ω) = 1 3. If A1 , A2 , . . . are events and Ai ∩ Aj = ∅ for all i 6= j then, P ( S∞ i=1 Ai ) = P∞ i=1 P (Ai ). Property (c) is called the “axiom of countable additivity,” which is clearly motivated by the property: P (A1 ∪ · · · ∪ An ) = P (A1 ) + · · · + P (An ) if A1 , A2 , . . . , An are mutually exclusive events. Quite frankly, there is nothing in one’s intuitive notion that requires this axiom. Lecture Note Page 2 Introduction to Probability 2.1 Equally likely outcomes When the sample space Ω = {ω1 , . . . , ωN } consists of N outcomes, it is often natural to assume that P ({ω1 }) = · · · = P ({ωN }). Then we can obtain the probability of a single outcome by 1 P ({ωi }) = for i = 1, . . . , N . N Assuming this, we can compute the probability of any event A by counting the number of outcomes in A, and obtain number of outcomes in A P (A) = . N A problem involving equally likely outcomes can be solved by counting. The mathematics of counting is known as combinatorial analysis. Here we summarize several basic facts—multiplication principal, permutations and combinations. 2.2 Multiplication principle If one experiment has m outcomes, and a second has n outcomes, then there are m × n outcomes for the two experiments. The multiplication principle can be extended and used in the following experiment. Suppose that we have n differently numbered balls in an urn. In the first trial, we pick up a ball from the urn, record its number, and put it back into the urn. In the second trial, we again pick up, record, and put back a ball. Continue the same procedure until the r-th trial. The whole process is called a sampling with replacement. By generalizing the multiplication principle, the number of all possible outcomes becomes n × · · · × n = nr . | {z } r Example 1. How many different 7-place license plates are possible if the first 3 places are to be occupied by letters and the final 4 by numbers? Example 2. How many different functions f defined on {1, . . . , n} are possible if the value f (i) takes either 0 or 1? 2.3 Permutations Consider again “n balls in an urn.” In this experiment, we pick up a ball from the urn, record its number, but do not put it back into the urn. In the second trial, we again pick up, record, and do not put back a ball. Continue the same procedure r-times. The whole process is called a sampling without replacement. Then, the number of all possible outcomes becomes n × (n − 1) × (n − 2) × · · · × (n − r + 1) {z } | r Given a set of n different elements, the number of all the possible “ordered” sets of size r is called r-element permutation of an n-element set. In particular, the n-element permutation of an n-element set is expressed as n × (n − 1) × (n − 2) × · · · × 2 × 1 = n! | {z } n Lecture Note Page 3 Introduction to Probability The above mathematical symbol “n!” is called “n factorial.” We define 0! = 1, since there is one way to order 0 elements. Example 3. How many different 7-place license plates are possible if the first 3 places are using different letters and the final 4 using different numbers? Example 4. How many different functions f defined on {1, . . . , n} are possible if the function f takes values on {1, . . . , n} and satisfies that f (i) 6= f (j) for all i 6= j? 2.4 Combinations In an experiment, we have n differently numbered balls in an urn, and pick up r balls “as a group.” Then there are n! (n − r)! ways of selecting the group if the order is relevant (r-element permutation). However, the order is irrelevant when you choose r balls as a group, and any particular r-element group has been counted exactly r! times. Thus, the number of r-element groups can be calculated as n n! . = (n − r)!r! r You should read the above symbol as “n choose r.” Example 5. A committee of 3 is to be formed from a group of 20 people. How many different committees are possible? Example 6. From a group of 5 women and 7 men, how many different committees consisting of 2 women and 3 men can be formed? 2.5 Binomial theorem The term “n choose r” is often referred as a binomial coefficient, because of the following identity. n X n i n−i n (a + b) = a b . i i=0 In fact, we can give a proof of the binomial theorem by using combinatorial analysis. For a while we pretend that “commutative law” cannot be used for multiplication. Then, for example, the expansion of (a + b) 2 becomes aa + ab + ba + bb, and consists of the 4 terms, aa, ab, ba, and bb. In general, how many terms in the expansion of (a + b) n should contain a’s exactly in i places? The answer to this question indicates the binomial theorem. 2.6 Optional topic A lot of n items contains k defectives, and m are selected randomly and inspected. How should the value of m be chosen so that the probability that at least one defective item turns up is 0.9? Apply your answer to the following cases: 1. n = 1000 and k = 10; Lecture Note Page 4 Introduction to Probability 2. n = 10, 000 and k = 100. Suppose that we have a lot of size n containing k defectives. If we sample and inspect m random items, what is the probability that we will find i defectives in our sample? This probability is called a hypergeometric distribution, and expressed by n−k k m−i i p(i) = , i = max{0, m + k − n}, . . . , min{m, k} n m 2.7 Exercises 1. How many ways are there to encode the 26-letter English alphabet into 8-bit binary words (sequences of eight 0’s and 1’s)? 2. What is the coefficient of x3 y 4 in the expansion of (x + y)7 ? 3. A child has six blocks, three of which are red and three of which are green. How many patterns can she make by placing them all in a line? 4. A deck of 52 cards is shuffled thoroughly. What is the probability that the four aces are all next to each other? (Hint: Imagine that you have 52 slots lined up to place the four aces. How many different ways can you “choose” four slots for those aces? How many different ways do you get consecutive slots for those aces?) 3 Conditional probability and independence Let A and B be two events where P (B) 6= 0. Then the conditional probability of A given B can be defined as P (A|B) = P (A ∩ B) . P (B) The idea of “conditioning” is that “if we have known that B has occurred, the sample space should have become B rather than Ω.” Example 1. Mr. and Mrs. Jones have two children. What is the conditional probability that their children are both boys, given that they have at least one son? 3.1 Bayes rule Suppose that P (A|B) and P (B) are known. Then we can use this to find P (A ∩ B) = P (A|B)P (B). Example 2. Celine estimates that her chance to receive an A grade would be 12 in a French course, and 23 in a chemistry course. If she decides whether to take a French course or a chemistry course this semester based on the flip of a coin, what is the probability that she gets an A in chemistry? Law of total probability. Let A and B be two events. Then we can write the probability P (A) as P (A) = P (A ∩ B) + P (A ∩ B c ) = P (A|B)P (B) + P (A|B c )P (B c ). Lecture Note Page 5 Introduction to Probability Sn In general, suppose that we have a sequence B1 , B2 , . . . , Bn of mutually disjoint events satisfying i=1 Bi = Ω, where “mutual disjointness” means that Bi ∩ Bj = ∅ for all i 6= j. (The events B1 , B2 , . . . , Bn are called “a partition of Ω.”) Then for any event A we have P (A) = n X i=1 P (A ∩ Bi ) = n X P (A|Bi )P (Bi ). i=1 Example 3. An study shows that an accident-prone person will have an accident in one year period with probability 0.4, whereas this probability is only 0.2 for a non-accident-prone person. Suppose that we assume that 30 percent of all the new policyholders of an insurance company is accident prone. Then what is the probability that a new policyholder will have an accident within a year of purchasing a policy? Sn Bayes’ rule. Let A and B1 , . . . , Bn be events such that the Bi ’s are mutually disjoint, i=1 Bi = Ω and P (Bi ) > 0 for all i. Then P (A ∩ Bj ) P (A|Bj )P (Bj ) P (Bj |A) = . = Pn P (A) i=1 P (A|Bi )P (Bi ) 3.2 Concept of Independence. Intuitively we would like to say that A and B are independent if knowing about one event give us no information about another. That is, P (A|B) = P (A) and P (B|A) = P (B). We say A and B are independent if P (A ∩ B) = P (A)P (B). This definition is symmetric in A and B, and allows P (A) and/or P (B) to be 0. Furthermore, a collection of events A1 , A2 , . . . , An is said to be mutually independent if it satisfies P (Ai1 ∩ · · · ∩ Aim ) = P (Ai1 )P (Ai2 ) · · · P (Aim ) for any subcollection Ai1 , Ai2 , . . . , Aim . 3.3 Exercises 1. Urn A has three red balls and two white balls, and urn B has two red balls and five white balls. A fair coin is tossed; if it lands heads up, a ball is drawn from urn A, and otherwise, a ball is drawn from urn B. (a) What is the probability that a red ball is drawn? (b) If a red ball is drawn, what is the probability that the coin landed heads up? 2. An urn contains three red and two white balls. A ball is drawn, and then it and another ball of the same color are placed back in the urn. Finally, a second ball is drawn. (a) What is the probability that the second ball drawn is white? (b) If the second ball drawn is white, what is the probability that the first ball drawn was red? 3. Two dice are rolled, and the sum of the face values is six. What is the probability that at least one of the dice came up a three? 4. A couple has two children. What is the probability that both are girls given that the oldest is a girl? What is the probability that both are girls given that one of them is a girl? 5. What is the probability that the following system works if each unit fails independently with probability p (see the figure below)? Lecture Note Page 6 Introduction to Probability 4 Random variables A numerically valued function X of an outcome ω from a sample space Ω X : Ω → R : ω → X(ω) is called a random variable (r.v.), and usually determined by an experiment. We conventionally denote random variables by uppercase letters X, Y, Z, U, V, . . . from the end of the alphabet. Discrete random variables A discrete random variable is a random variable that can take values on a finite set {a1 , a2 , . . . , an } of real numbers (usually integers), or on a countably infinite set {a1 , a2 , a3 , . . .}. The statement such as “X = ai ” is an event since {ω : X(ω) = ai } is a subset of the sample space Ω. Thus, we can consider the probability of the event {X = a i }, denoted by P (X = ai ). The function p(ai ) := P (X = ai ) over the possible values of X, say a1 , a2 , . . ., is called a frequency function, or a probability mass function. The frequency function p must satisfy X p(ai ) = 1, i where the sum is over the possible values of X. This will completely describe the probabilistic nature of the random variable X. Another useful function is the cumulative distribution function (c.d.f.), and it is defined by F (x) = P (X ≤ x), −∞ < x < ∞. The c.d.f. of a discrete r.v. is a nondecreasing step function. It jumps wherever p(x) > 0, and the jump at ai is p(ai ). C.d.f.’s are usually denoted by uppercase letters, while frequency functions are usually denoted by lowercase letters. 4.1 Continuous random variables A continuous random variable is a random variable whose possible values are real values such as 78.6, 5.7, 10.24, and so on. Examples of continuous random variables include temperature, height, diameter of metal cylinder, etc. In what follows, a random variable means a “continuous” random variable, unless it is specifically said to be discrete. The probability distribution of a random variable X specifies how its values are distributed over the real numbers. This is completely characterized by the cumulative distribution function (cdf ). The cdf F (t) := P (X ≤ t). represents the probability that the random variable X is less than or equal to t. Then we often say that “the random variable X is distributed as F (t).” The cdf F (t) must satisfy the following properties: Lecture Note Page 7 Introduction to Probability 1. The cdf F (t) is a positive function [that is, F (t) ≥ 0]. 2. The cdf F (t) increases monotonically [that is, if s < t, then F (s) ≤ F (t)]. 3. The cdf F (t) must tend to zero as t goes to negative infinity “−∞”, and must tend to one as t goes to positive infinity “+∞” [that is, lim F (t) = 0 and lim F (t) = 1]. t→−∞ 4.2 t→+∞ Probability density function It is often the case that the probability that the random variable X takes a value in a particular range is given by the area under a curve over that range of values. This curve is called the probability density function (pdf ) of the random variable X, denoted by f (x). Thus, the probability that “a ≤ X ≤ b” can be expressed by P (a ≤ X ≤ b) = Z b f (x)dx. a Furthermore, we can find the following relationship between cdf and pdf: Z t F (t) = P (X ≤ t) = f (x)dx. −∞ This implies that (i) the cdf F is a continuous function, and (ii) the pdf f (x), if f is continuous at x, can be obtained as the derivative dF (x) . f (x) = dx Example 1. A random variable X is called a uniform random variable on [a, b], when X takes any real number in the interval [a, b] equally likely. Then the pdf of X is given by ( 1/(b − a) if a ≤ x ≤ b; f (x) = 0 otherwise [that is, if x < a or b < x]. Compute the cdf F (x) for the uniform random variable X. 4.3 Quantile function Given a cdf F (x), we can define the quantile function Q(u) = inf{x : F (x) ≥ u} for u ∈ (0, 1). In particular, if F is continuous and strictly increasing, then we have Q(u) = F −1 (u). When F (x) is the cdf of a random variable X, and F (x) is continuous, we can find xp = Q(p) ⇔ P (X ≤ xp ) = p for every p ∈ (0, 1). The value x1/2 = Q( 21 ) is called the median of F , and the values x1/4 = Q( 41 ) and x3/4 = Q( 34 ) are called the lower quartile and the upper quartile, respectively. Example 2. Compute the quantile function Q(u) for a uniform random variable X on [a, b]. 4.4 Joint distributions When two discrete random variables X and Y are obtained at the same experiment, we can define their joint frequency function by p(ai , bj ) = P (X = ai , Y = bj ) Lecture Note Page 8 Introduction to Probability where a1 , a2 , . . . and b1 , b2 , . . . are the possible values of X and Y , respectively. The marginal frequency function of X, denoted by pX , can be calculated by X p(ai , bj ), pX (ai ) = P (X = ai ) = j where the sum is over the possible values b1 , b2 , . . . of Y . Similarly, the marginal frequency function pY (bj ) = P i p(ai , bj ) of Y is given by summing over the possible values a1 , a2 , . . . of X. Joint distributions of two continuous random variables. Consider a pair (X, Y ) of continuous random variables. A joint density function f (x, y) is a nonnegative function satisfying Z ∞ Z ∞ f (x, y) dy dx = 1, −∞ −∞ and is used to compute probabilities constructed from the random variables X and Y simultaneously. For example, we can calculate the probability Z u2 Z v 2 f (x, y) dy dx. P (u1 ≤ X ≤ u2 , v1 ≤ Y ≤ v2 ) = u1 In particular, F (u, v) = P (X ≤ u, Y ≤ v) = is called the joint distribution function. Z u −∞ v1 Z v f (x, y) dy dx. −∞ Given the joint density function f (x, y), the distribution for each of X and Y is called the marginal distribution, and the marginal density functions of X and Y , denoted by fX (x) and fY (y), are given respectively by Z ∞ Z ∞ fX (x) = f (x, y) dy and fY (y) = f (x, y) dx. −∞ −∞ Example 3. Consider the following joint density function of random variables X and Y : ( λ2 e−λy if 0 ≤ x ≤ y; f (x, y) = 0 otherwise. (1) (a) Find P (X ≤ 1, Y ≤ 1). (b) Find the marginal density functions fX (x) and fY (y). 4.5 Conditional density functions Suppose that two random variables X and Y has a joint density function f (x, y). If fX (x) > 0, then we can define the conditional density function fY |X (y|x) given X = x by fY |X (y|x) = f (x, y) . fX (x) Similarly we can define the conditional density function fX|Y (x|y) given Y = y by fX|Y (x|y) = f (x, y) fY (y) if fY (y) > 0. Then, clearly we have the following relation f (x, y) = fY |X (y|x)fX (x) = fX|Y (x|y)fY (y). Example 4. Find the conditional density functions fX|Y (x|y) and fY |X (y|x) for the joint density function in Example 3 (see 4.4). Lecture Note Page 9 Introduction to Probability 4.6 Independent random variables Let X and Y be discrete random variables with joint frequency function p(x, y). Then X and Y are said to be independent, if it satisfies p(x, y) = pX (x)pY (y) for all possible values of (x, y). In the same manner, if the joint density function f (x, y) for continuous random variables X and Y is expressed in the form f (x, y) = fX (x)fY (y) for all x, y, then X and Y are independent. 4.7 Exercises 1. Suppose that X is a discrete random variable with P (X = 0) = .25, P (X = 1) = .125, P (X = 2) = .125, and P (X = 3) = .5. Graph the frequency function and the cumulative distribution function of X. 2. An experiment consists of throwing a fair coin four times. (1) List the sample space Ω. (2) For each discrete random variable X as defined in (a)–(d), describe the event “X = 0” as a subset of Ω and compute P (X = 0). (3) For each discrete random variable X as defined in (a)–(d), find the possible values of X and compute the frequency function. (a) X is the number of heads before the first tail. (b) X is the number of heads following the first tail. (c) X is the number of heads minus the number of tails. (d) X is the number of tails times the number of heads. 3. Let F (x) = 1 − exp(−αxβ ) for x ≥ 0, α > 0 and β > 0, and F (x) = 0 for x < 0. Show that F is a cdf, and find the corresponding density. 4. Sketch the pdf and cdf of a random variable X that is uniform on [−1, 1]. 5. Suppose that X has the density function f (x) = cx2 for 0 ≤ x ≤ 1 and f (x) = 0 otherwise. (a) Find c. (b) Find the cdf. (c) What is P (.1 ≤ X < .5)? 6. The joint frequency function of two discrete random variables, X and Y , is given by Y 1 2 3 4 X 1 .10 .05 .02 .02 2 .05 .20 .05 .02 3 .02 .05 .20 .04 4 .02 .02 .04 .10 (a) Find the marginal functions of X and Y . (b) Are X and Y independent? (c) Find a joint frequency function whose marginals X and Y are given by the same marginal functions in (a) but are independent. Lecture Note Page 10 Introduction to Probability 7. Suppose that random variables X and Y have the joint density ( 6 (x + y)2 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1; f (x, y) = 7 0 otherwise. (a) Find the marginal density functions of X and Y . (b) Find the conditional density functions of X given Y = y and of Y given X = x. Optional Problems. 1. Let f (x) = (1 + αx)/2 for −1 ≤ x ≤ 1 and f (x) = 0 otherwise, where −1 ≤ α ≤ 1. Show that f is a density, and find the corresponding cdf. Find the quartiles and the median of the distribution in terms of α. 2. The Cauchy cumulative distribution function is F (x) = 1 1 + tan−1 (x), 2 π −∞ < x < ∞. (a) Show that this is a cdf. (b) Find the density function. (c) Find x such that P (X > x) = .1. 5 Expectation Let X be a discrete random variable whose possible values are a1 , a2 , . . ., and let p(x) is the frequency function of X. Then the expectation (expected value or mean) of the random variable X is given by X E[X] = ai p(ai ). i We often denote the expected value E[X] of X by µ or µX . Expectation of discrete random variables. For a function g, we can define the expectation of function of a discrete random variable X by X E[g(X)] = g(ai )p(ai ). i Suppose that we have two random variables X and Y , and that p(x, y) is their joint frequency function. Then the expectation of function g(X, Y ) of the two random variables X and Y is defined by X E[g(X, Y )] = g(ai , bj )p(ai , bj ), i,j where the sum is over all the possible values of (X, Y ). Example 1. A random variable X takes on values 0, 1, 2 with the respective probabilities P (X = 0) = 0.2, P (X = 1) = 0.5, P (X = 2) = 0.3. Compute E[X] and E[X 2 ]. Lecture Note Page 11 Introduction to Probability 5.1 Expectation of continuous random variables Let X be a continuous random variable, and let f is the pdf of X. Then the expectation of the random variable X is given by Z ∞ E[X] = x f (x) dx. (2) −∞ For a function g, we define the expectation E[g(X)] of the function g(X) of the random variable X by Z ∞ E[g(X)] = g(x) f (x) dx, −∞ if g(x) is integrable with respect to f (x) dx, that is, Z ∞ −∞ |g(x)| f (x) dx < ∞. Furthermore, if we have two random variables X and Y with joint density function f (x, y), then the expectation of the function g(X, Y ) of the two random variables becomes Z ∞Z ∞ E[g(X, Y )] = g(x, y)f (x, y) dxdy, −∞ −∞ if g is integrable. For example, now we can define E[X 2 ], E[XY ], E[X + Y ] and so on. Example 2. Let X be a uniform random variable on [a, b]. Compute E[X] and E[X 2 ]. 5.2 Properties of expectation One can think of the expectation E(X) as “an operation on a random variable X” which returns the average value for X. It is also useful to observe that this is a “linear operator” satisfying the following properties: 1. Let a be a constant, and let X be a random variable having the pdf f (x). Then we can show that Z ∞ Z ∞ E[a + X] = (a + x)f (x) dx = a + x f (x) dx = a + E[X]. −∞ −∞ 2. Let a and b be scalars, and let X and Y be random variables having the joint density function f (x, y) and the respective marginal density functions fX (x) and fY (y). Then we can show that Z ∞Z ∞ E[aX + bY ] = (ax + by)f (x, y) dx dy −∞ −∞ Z ∞ Z ∞ y fY (y) dy = aE[X] + bE[Y ]. x fX (x) dx + b =a −∞ −∞ 3. Let a and b1 , . . . , bn be scalars, and let X1 , . . . , Xn be random variables. By applying (a) and then applying (b) repeatedly, we can obtain # # " n " n # " n n X X X X bi E[Xi ]. b i Xi = · · · = a + bi Xi = a + b1 E[X1 ] + E E a+ b i Xi = a + E i=1 5.3 i=1 i=2 i=1 Expectation with independent random variables Suppose that X and Y are independent random variables. Then the joint density function f (x, y) of X and Y can be expressed as f (x, y) = fX (x)fY (y). Lecture Note Page 12 Introduction to Probability And the expectation of the function of the form g1 (X) × g2 (Y ) is given by Z ∞Z ∞ E[g1 (X)g2 (Y )] = g1 (x)g2 (y) f (x, y) dx dy −∞ −∞ Z ∞ Z ∞ g2 (y)fY (y) dy = E[g1 (X)] × E[g2 (Y )]. g1 (x)fX (x) dx × = −∞ −∞ 5.4 Variance and covariance The variance of a random variable X, denoted by Var(X) or σ 2 , is the expected value of “the squared difference between the random variable and its expected value E(X).” Thus, it is given by Var(X) := E[(X − E(X))2 ] = E[X 2 ] − (E[X])2 . p The square-root Var(X) of the variance Var(X) is called the standard error (SE) (or standard deviation (SD)) of the random variable X. Now suppose that we have two random variables X and Y . Then the covariance of two random variables X and Y can be defined as Cov(X, Y ) := E((X − µx )(Y − µy )) = E(XY ) − E(X) × E(Y ), where µx = E(X) and µy = E(Y ). The covariance Cov(X, Y ) measures the strength of the dependence of the two random variables. Example 3. Find Var(X) for a uniform random variable X on [a, b]. 5.5 Properties of variance and covariance (a) If X and Y are independent, then Cov(X, Y ) = 0 by observing that E[XY ] = E[X] · E[Y ]. (b) In contrast to the expectation, the variance is not a linear operator. For two random variables X and Y , we have Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ). (3) However, if X and Y are independent, by observing that Cov(X, Y ) = 0 in (3), we have Var(X + Y ) = Var(X) + Var(Y ). (4) In general, when we have a sequence of independent random variables X1 , . . . , Xn , the property (4) is extended to Var(X1 + · · · + Xn ) = Var(X1 ) + · · · + Var(Xn ). Variance and covariance under linear transformation. Let a and b be scalars (that is, real-valued constants), and let X be a random variable. Then the variance of aX + b is given by Var(aX + b) = a2 Var(X). Now let a1 , a2 , b1 and b2 be scalars, and let X and Y be random variables. Then similarly the covariance of a1 X + b1 and a2 Y + b2 can be given by Cov(a1 X + b1 , a2 Y + b2 ) = a1 a2 Cov(X, Y ) Lecture Note Page 13 Introduction to Probability 5.6 Moment generating function Let X be a random variable. Then the moment generating function (mgf ) of X is given by M (t) = E etX . The mgf M (t) is a function of t defined on some open interval (c0 , c1 ) with c0 < 0 < c1 . In particular when X is a continuous random variable having the pdf f (x), the mgf M (t) can be expressed as Z ∞ M (t) = etx f (x) dx. −∞ 5.7 Properties of moment generating function (a) The most significant property of moment generating function is that “the moment generating function uniquely determines the distribution.” (b) Let a and b be constants, and let MX (t) be the mgf of a random variable X. Then the mgf of the random variable Y = a + bX can be given as follows. h i h i MY (t) = E etY = E et(a+bX) = eat E e(bt)X = eat MX (bt). (c) Let X and Y be independent random variables having the respective mgf’s MX (t) and MY (t). Recall that E[g1 (X)g2 (Y )] = E[g1 (X)]·E[g2 (Y )] for functions g1 and g2 . We can obtain the mgf MZ (t) of the sum Z = X +Y of random variables as follows. i h MZ (t) = E etZ = E et(X+Y ) = E etX etY = E etX · E etY = MX (t) · MY (t). (d) When t = 0, it clearly follows that M (0) = 1. Now by differentiating M (t) r times, we obtain r dr tX d tX (r) M (t) = r E e = E X r etX . =E e dt dtr In particular when t = 0, M (r) (0) generates the r-th moment of X as follows. M (r) (0) = E[X r ], r = 1, 2, 3, . . . Example 4. Find the mgf M (t) for a uniform random variable X on [a, b]. And derive the derivative M 0 (0) at t = 0 by using the definition of derivative and L’Hospital’s rule. 5.8 Exercises 1. If n men throw their hats into a pile and each man takes a hat at random, what is the expected number of matches? Hint: Suppose Y is the number of matches. How can you express the random variable Y by using indicator random variables X1 , . . . , Xn ? What is the event for the indicator random variable Xi ? 2. Let X be a random variable with the pdf f (x) = ( 2x 0 if 0 ≤ x ≤ 1; otherwise. Then, (a) find E[X], and (b) find E[X 2 ] and Var(X). Lecture Note Page 14 Introduction to Probability 3. Suppose that E[X] = µ and Var(X) = σ 2 . Let Z = (X − µ)/σ. Show that E[Z] = 0 and Var(Z) = 1. 4. Recall that the expectation is an “linear operator.” By using the definitions of variance and covariance in terms of the expectation, show that (a) Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ), and (b) Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y ). 5. If X and Y are independent random variables with equal variances, find Cov(X + Y, X − Y ). 6. If X and Y are independent random variables and Z = Y − X, find expressions for the covariance of X and Z in terms of the variance of X and Y . 7. Let X and Y be independent random variables, and let α and β be scalars. Find an expression for the mgf of Z = αX + βY in terms of the mgf’s of X and Y . Optional Problems. 1. Suppose that n enemy aircraft are shot at simultaneously by m gunners, that each gunner selects an aircraft to shoot at independently of the other gunners, and that each gunner hits the selected aircraft with probability p. Find the expected number of aircraft hit by the gunners. Hint: For each i = 1, . . . , n, let Xi be the indicator random variable of the event that the i-th aircraft was m by yourself, and use it to find the expectation of hit. Explain how you can verify P (Xi = 0) = 1 − np interest. 6 Binomial and Related Distributions An experiment involving a trial that results in possibly two outcomes, say “success” or “failure,” and that is repeatedly performed, is called a binomial experiment. Tossing a coin is an example of such a trial where head or tail are the two possible outcomes. 6.1 Relation between Bernoulli and binomial random variable A binomial random variable can be expressed in terms of n Bernoulli random variables. If X 1 , X2 , . . . , Xn are independent Bernoulli random variables with success probability p, then the sum of those random variables Y = n X Xi i=1 is distributed as the binomial distribution with parameter (n, p). Sum of independent binomial random variables. If X and Y are independent binomial random variables with respective parameters (n, p) and (m, p), then the sum X + Y is distributed as the binomial distribution with parameter (n + m, p). Pn Pn+m Rough explanation. Observe that we can express X = i=1 Zi and Y = i=n+1 Zi in terms of independent Pn+m Bernoulli random variables Zi ’s with success probability p. Then the resulting sum X + Y = i=1 Zi must be a binomial random variable with parameter (n + m, p). Lecture Note Page 15 Introduction to Probability 6.2 Expectation and variance of binomial distribution Let X be a Bernoulli random variable with success probability p. Then the expectation of X becomes E[X] = 0 × (1 − p) + 1 × p = p. By using E[X] = p, we can compute the variance as follows: Var(X) = (0 − p)2 × (1 − p) + (1 − p)2 × p = p(1 − p). Now Pn let Y be a binomial random variable with parameter (n, p). Recall that Y can be expressed as the sum i=1 Xi of independent Bernoulli random variables X1 , . . . , Xn with success probability p. By using the property of expectation and variance, we obtain # " n n X X E[Xi ] = np; Xi = E[Y ] = E i=1 i=1 Var(Y ) = Var n X Xi i=1 ! = n X i=1 Var(Xi ) = np(1 − p). Moment generating functions. Let X be a Bernoulli random variable with success probability p. Then the mgf MX (t) is given by MX (t) = e(0)t × (1 − p) + e(1)t × p = (1 − p) + pet . Pn Then we can derive the mgf of a binomial random variable Y = i=1 Xi from the mgf of Bernoulli random variable by applying the mgf property repeatedly. MY (t) = MX1 (t) × MX2 (t) × · · · × MXn (t) = [(1 − p) + pet ]n . 6.3 Geometric distribution For 0 < a < 1, we can establish the following formulas. n X k=1 ∞ X k=1 ∞ X k=1 ak−1 = ka k−1 k=1 ∞ X k k=1 1 ; 1−a 1 d 1 = ; = dx 1 − x x=a (1 − a)2 1 2a d x = + = 2 2 dx (1 − x) (1 − a) (1 − a)3 x=a ak−1 = ! x k=0 x=a ! ∞ d X k = kx dx d = dx k 2 ak−1 ∞ X 1 − an ; 1−a (5) (6) x=a Geometric random variables. In sampling independent Bernoulli trials, each with probability of success p, the frequency function of the trial number of the first success is p(k) = (1 − p)k−1 p k = 1, 2, . . . . This is called a geometric distribution. By using geometric series formula (5) we can immediately see the following: ∞ X k=1 M (t) = ∞ X k=1 Lecture Note p(k) = p ∞ X (1 − p)k−1 = k=1 n X t ekt p(k) = pe k=1 p = 1; 1 − (1 − p) [(1 − p)et ]k−1 = pet . 1 − (1 − p)et Page 16 Introduction to Probability Example 3. Consider a roulette wheel consisting of 38 numbers—1 through 36, 0, and double 0. If Mr.Smith always bets that the outcome will be one of the numbers 1 through 12, what is the probability that (a) his first win will occur on his fourth bet; (b) he will lose his first 5 bets? Expectation and variance of geometric distributions. Let X be a geometric random variable with success probability p. To calculate the expectation E[X] and the variance Var(X), we apply geometric series formula (6) to obtain ∞ X ∞ X 1 ; p k=1 k=1 2 ∞ X 2(1 − p) 1−p 1 1 1 2 2 2 + . − 2 = =p Var(X) = E[X ] − (E[X]) = k p(k) − p p2 p3 p p2 E[X] = kp(k) = p k(1 − p)k−1 = k=1 6.4 Negative binomial distribution Given independent Bernoulli trials with probability of success p, the frequency function of the number of trials until the r-th success is k−1 r p(k) = p (1 − p)k−r , k = r, r + 1, . . . . r−1 This is called a negative binomial distribution with parameter (r, p). The geometric distribution is a special case of negative binomial distribution when r = 1. Moreover, if X 1 , . . . , Xr are independent and identically distributed (iid) geometric random variables with parameter p, then the sum Y = r X Xi (7) i=1 becomes a negative binomial random variable with parameter (r, p). Expectation, variance and mgf of negative binomial distribution. By using the sum of iid geometric rv’s (see Equation 7) we can compute the expectation, the variance, and the mgf of negative binomial random variable Y . " r # r X X r Xi = E[Y ] = E E[Xi ] = ; p i=1 i=1 ! r r X X r(1 − p) Xi = Var(Y ) = Var Var(Xi ) = ; p2 i=1 i=1 r pet MY (t) = MX1 (t) × MX2 (t) × · · · × MXn (t) = . 1 − (1 − p)et Example 4. What is the average number of times one must throw a die until the outcome “1” has occurred 4 times? 6.5 Exercises 1. Appending three extra bits to a 4-bit word in a particular way (a Hamming code) allows detection and correction of up to one error in any of the bits. (a) If each bit has probability .05 of being changed during communication, and the bits are changed independently of each other, what is the probability that the word is correctly received (that is, 0 or 1 bit is in error)? (b) How does this probability compare to the probability that the word will be transmitted correctly with no check bits, in which case all four bits would have to be transmitted correctly for the word to be correct? Lecture Note Page 17 Introduction to Probability 2. Which is more likely: 9 heads in 10 tosses of a fair coin or 18 heads in 20 tosses? 3. Two boys play basketball in the following way. They take turns shooting and stop when a basket is made. Player A goes first and has probability p1 of making a basket on any throw. Player B goes, who shoots second, has probability p2 of making a basket. The outcomes of the successive trials are assumed to be independent. (a) Find the frequency function for the total number of attempts. (b) What is the probability that player A wins? 4. Suppose that in a sequence of independent Bernoulli trials each with probability of success p, the number of failure up to the first success is counted. (a) What is the frequency function for this random variable? (b) Find the frequency function for the number of failures up to the rth success. Hint: In (b) let X be the number of failures up to the rth success, and let Y be the number of trials until the r-th success. What is the relation between the random variables X and Y ? 5. Find an expression for the cumulative distribution function of a geometric random variable. 6. In a sequence of independent trials with probability p of success, what is the probability that there are exactly r successes before the kth failure? Optional Problems. 1. (Banach Match Problem) A pipe smoker carries one box of matches in his left pocket and one in his right. Initially, each box contains n matches. If he needs a match, the smoker is equally likely to choose either pocket. What is the frequency function for the number of matches in the other box when he first discovers that one box is empty? Hint: Let Ak denote the event that the pipe smoker first discovers that the right-hand box is empty and that there are k matches in the left-hand box at the time, and similarly let Bk denote the event that he first discovers that the left-hand box is empty and that there are k matches in the right-hand box at the time. Then use Ak ’s and Bk ’s to express the frequency function of interest. 7 Poisson Distribution n 1 = 2.7182 · · · is known as the base of the natural logarithm (or, called Napier’s n→∞ n number), and is associated with the Taylor series The number e = lim 1+ ex = ∞ X xk k=0 k! . (8) The Poisson distribution with parameter λ > 0 has the frequency function p(k) = e−λ λk , k! k = 0, 1, 2, . . . By applying Taylor series (8) we can immediately see the following: ∞ X p(k) = e−λ k=0 M (t) = ∞ X k=0 Lecture Note ekt p(k) = e−λ ∞ X λk k=0 ∞ X k=0 k! = e−λ eλ = 1; t t (et λ)k = e−λ eλe = eλ(e −1) . k! Page 18 Introduction to Probability Example 1. Suppose that the number of typographical errors per page has a Poisson distribution with parameter λ = 0.5. Find the probability that there is at least one error on a single page. 7.1 Poisson approximation to binomial distribution The Poisson distribution may be used as an approximation for a binomial distribution with parameter (n, p) when n is large and p is small enough so that np is a moderate number λ. Let λ = np. Then the binomial frequency function becomes p(k) = → n! k!(n − k)! k n−k n −k λ λ λk n! λ λ 1− · · 1 − 1 − = · n n k! (n − k)!nk n n λk λk · 1 · e−λ · 1 = e−λ k! k! as n → ∞. Example 2. Suppose that the a microchip is defective with probability 0.02. Find the probability that a sample of 100 microchips will contain at most 1 defective microchip. 7.2 Expectation and variance Let Xn , n = 1, 2, . . ., be a sequence of binomial random variables with parameter (n, λ/n). Then we regard the limit of the sequence as a Poisson random variable Y , and use the limit to find E[Y ] and Var(Y ). Thus, we obtain E[Xn ] = λ λ Var(Xn ) = λ 1 − n → E[Y ] = λ → Var(Y ) = λ as n → ∞; as n → ∞. Sum of independent Poisson random variables. The sum of independent Poisson random variables is a Poisson random variable: If X and Y are independent Poisson random variables with respective parameters λ 1 and λ2 , then Z = X + Y is distributed as the Poisson distribution with parameter λ1 + λ2 . Rough explanation. Suppose that Xn and Yn are independent binomial random variables with respective parameter (kn , pn ) and (ln , pn ) for each n = 1, 2, . . .. Then the sum Xn + Yn of the random variables has the binomial distribution with parameter (kn + ln , pn ). By letting kn pn → λ1 and ln pn → λ2 with kn , ln → ∞ and pn → 0, the respective limits of Xn and Yn are Poisson random variables X and Y with respective parameters λ1 and λ2 . Moreover, the limit of Xn + Yn is the sum of the random variables X and Y , and has a Poisson distribution with parameter λ1 + λ2 . 7.3 Exercises 1. Use the mgf to show that the expectation and the variance of a Poisson random variable are the same and equal to the parameter λ. 2. The probability of being dealt a royal straight flush (ace, king, queen, jack, and ten of the same suit) in poker is about 1.54 × 10−6 . Suppose that an avid poker player sees 100 hands a week, 52 weeks a year, for 20 years. (a) What is the probability that she never sees a royal straight flush dealt? (b) What is the probability that she sees at least two royal straight flushes dealt? Lecture Note Page 19 Introduction to Probability 3. Professor Rice was told that he has only 1 chance in 10,000 of being trapped in a much-maligned elevator in the mathematics building. If he goes to work 5 days a week, 52 weeks a year, for 10 years and always rides the elevator up to his office when he first arrives. What is the probability that he will never be trapped? That he will be trapped once? Twice? Assume that the outcomes on all the days are mutually independent. 4. Suppose that a rare disease has an incidence of 1 in 1000. Assuming that members of the population are affected independently, find the probability of k cases in a population of 100,000 for k = 0, 1, 2. 5. Suppose that in a city the number of suicides can be approximated by a Poisson random variable with λ = 0.33 per month. (a) What is the distribution for the number of suicides per year? What is the average number of suicides in one year? (b) What is the probability of two suicides in one month? Optional Problems. 1. Let X be a negative binomial random variable with parameter (r, p). (a) When r = 2 and p = 0.2, find the probability P (X ≤ 4). (b) Let Y be a binomial random variable with parameter (n, p). When n = 4 and p = 0.2, find the probability P (Y ≥ 2). (c) By comparing the results above, what can you find about the relationship between P (X ≤ 4) and P (Y ≥ 2)? Generalize the relationship to the one between P (X ≤ n) and P (Y ≥ 2), and find the probability P (X ≤ 10) when r = 2 and p = 0.2. (d) By using the Poisson approximation, find the probability P (X ≤ 1000) when r = 2 and p = 0.001. 8 Solutions to Exercises 8.1 Binomial and related distributions 1. The number X of error bits has a binomial distribution with n = 4 and p = 0.95. 4 4 4 (a) P (X ≤ 1) = (0.95) + (0.05)(0.95)3 ≈ 0.986. 0 1 4 (b) P (X = 0) = (0.95)4 ≈ 0.815. 0 10 2. “9 heads in 10 tosses” [ 10 ≈ 9.77 × 10−3] is more likely than “18 heads in 20 tosses” [ 9 (1/2) −4 1.81 × 10 ]. 3. Let X be the total number of attempts. ( (1 − p1 )k−1 (1 − p2 )k−1 p1 (a) P (X = n) = (1 − p1 )k (1 − p2 )k−1 p2 (b) P ({Player A wins}) = ∞ X k=1 Lecture Note 20 18 (1/2)20 ≈ if n = 2k − 1; if n = 2k. P (X = 2k − 1) = p1 ∞ X k=1 [(1 − p1 )(1 − p2 )]k−1 = p1 p1 + p 2 − p 1 p2 Page 20 Introduction to Probability 4. Let X be the number of failures up to the rth success, and let Y be the number of trials until the r-th success. Then Y has a negative binomial distribution, and has the relation X = Y − r. (a) Here r = 1 (that is, Y has a geometric distribution), and P (X = k) = P (Y = k + 1) = (1 − p) k p. k+r−1 r (b) P (X = k) = P (Y = k + r) = p (1 − p)k . r−1 5. Let X be a geometric random variable. P (X ≤ k) = k X i=1 p(1 − p)i−1 = 1 − (1 − p)k 6. Let X be the number of trials up to the k-th failure. Then X has a negative binomial distribution with “success probability” (1 − p). k+r−1 P ({Exactly r successes before the k-th failure}) = P (X = r + k) = (1 − p)k pr k−1 Optional Problem (Banach Match Problem). Let X be the number of matches in the other box when he first discovered that one box is empty. Furthermore, let Ak denote the event that the pipe smoker first discovers that the right-hand box is empty and that there are k matches in the left-hand box at the time, and similarly let B k denote the event that he first discovers that the left-hand box is empty and that there are k matches in the righthand box at the time. Then we can observe that P (Ak ) = P (Bk ) and P (X = k) = P (Ak ) + P (Bk ) = 2P (Ak ). Now in order for Ak to happen he has tomake exactly (2n+1−k) attempts when he choose the box in right-hand 2n − k 2n − k 2n+1−k (n+1)st time. Thus, P (Ak ) = (1/2) , and therefore, we obtain P (X = k) = (1/2)2n−k . n n 8.2 Poisson distribution t t t 1. We first calculate M 0 (t) = λet eλ(e −1) and M 00 (t) = (λet )2 eλ(e −1) + λet eλ(e −1) . Then we obtain E[X] = M 0 (0) = λ and E[X 2 ] = M 00 (0) = λ2 + λ. Finally we get Var(X) = E[X 2 ] − (E[X])2 = λ. 2. The number X of royal straight flushes has approximately a Poisson distribution with λ = (1.54 × 10 −6 × (100 × 52 × 20) ≈ 0.16. (a) P (X = 0) = e−λ ≈ 0.852. (b) P (X ≥ 2) = 1 − P (X ≤ 1) = 1 − (e−λ + λe−λ ) ≈ 0.0115. 3. The number X of misfortunes for Professor Rice being trapped has approximately a Poisson distribution 1 × (5 × 52 × 10) ≈ 0.26. Then P (X = 0) = e−λ ≈ 0.77, P (X = 1) = λe−λ ≈ 0.20, and with λ = 10000 2 P (X = 2) = (λ /2)e−λ ≈ 0.026. 1 4. The number X of this disease cases has approximately a Poisson distribution with λ = 1000 ×100000 = 100. k 100 Then it has the frequency function p(k) ≈ e−100 . And we obtain p(0) ≈ 3 × 10−44 , p(1) ≈ 4 × 10−42 , k! and p(2) ≈ 2 × 10−40 . 5. Let Xk be the number of suicides in the i-th month, and let Y be the number of suicides in a year. Then Y = X1 + · · · + X12 has a Poisson distribution with 12 × 0.33 = 3.96. Thus, (a) E[Y ] = 3.96 and (b) P (X1 = 2) = (0.332 /2)e−0.33 ≈ 0.039. Optional Problem. P4 (a) P (X ≤ 4) = k=2 Lecture Note k−1 1 (0.2)2 (0.8)k−2 = 0.1808. Page 21 Introduction to Probability P1 (b) P (Y ≥ 2) = 1 − k=0 k4 (0.2)k (0.8)4−k = 0.1808. (c) By using P (X ≤ n) = P (Y ≥ 2), P (X ≤ 10) = P (Y ≥ 2) = 1 − P1 k=0 10 k (0.2)k (0.8)10−k ≈ 0.6242. (d) by using P (X ≤ n) = P (Y ≥ 2) and the Poisson approximation with λ = np = 1, P (X ≤ 1000) = P (Y ≥ 2) ≈ 1 − e−λ − λe−λ ≈ 0.2642. 9 Summary Diagram Bernoulli trial P (X = 1) = p and P (X = 0) = q := 1 − p E[X] = p, (1) V ar(X) = pq (4) The probability p of success and the probability q := 1 − p of failure (5) ? ? ? Binomial(n, p) Geometric(p) P (X = i) = q E[X] = i−1 1 , p (2) p, i = 1, 2, . . . V ar(X) = (3) P (X = i) = q p2 n i ! pi q n−i , E[X] = np, i = 0, . . . , n V ar(X) = npq 6 (6) ? ? Negative binomial(r, p) P (X = i) = i−1 r−1 E[X] = ! r , p Poisson(λ) pr q i−r , i = r, r + 1, . . . V ar(X) = rq p2 P (X = i) = e−λ E[X] = λ, λi , i = 0, 1, . . . i! V ar(X) = λ 6 (7) 1. X is the number of Bernoulli trials until the first success occurs. 2. X is the number of Bernoulli trials until the rth P success occurs. If X1 , · · · , Xr are independent geometric r random variables with parameter p, then X = k=1 Xk becomes a negative binomial random variable with parameter (r, p). 3. r = 1. 4. X is P the number of successes in n Bernoulli trials. If X1 , · · · , Xn are independent Bernoulli trials, then n X = k=1 Xk becomes a binomial random variable with (n, p). Lecture Note Page 22 Introduction to Probability 5. If X1 and X2 are independent binomial random variables with respective parameters (n1 , p) and (n2 , p), then X = X1 + X2 is also a binomial random variable with (n1 + n2 , p). 6. Poisson approximation is used for binomial distribution by letting n → ∞ and p → 0 while λ = np. 7. If X1 and X2 are independent Poisson random variables with respective parameters λ1 and λ2 , then X = X1 + X2 is also a Poisson random variable with λ1 + λ2 . 10 10.1 Exponential and Gamma Distributions Exponential distribution The exponential density is defined by f (x) = Then, the cdf is computed as F (t) = ( ( λe−λx 0 x ≥ 0; x<0 1 − e−λt 0 t ≥ 0; t<0 Memoryless property. The function S(t) = 1 − F (t) = P (X > t) is called the survival function. When F (t) is an exponential distribution, we have S(t) = e−λt for t ≥ 0. Furthermore, we can find that P (X > t + s | X > s) = S(t + s) P (X > t + s) = = S(t) = P (X > t), P (X > s) S(s) (9) which is referred as the memoryless property of exponential distribution. 10.2 Gamma distribution The gamma function, denoted by Γ(α), is defined as Z ∞ Γ(α) = uα−1 e−u du, α > 0. 0 Then the gamma density is defined as f (x) = λα α−1 −λx x e Γ(α) x≥0 which depends on two parameters α > 0 and λ > 0. We call the parameter α a shape parameter, because changing α changes the shape of the density. We call the parameter λ a scale parameter, because changing λ merely rescales the density without changing its shape. This is equivalent to changing the units of measurement (feet to meters, or seconds to minutes). Suppose that the parameter α is an integer n. Then we have Γ(n) = (n − 1)!. In particular, the gamma distribution with α = 1 becomes an exponential distribution (with parameter λ). n Chi-square distribution. The gamma distribution with α = and λ = 12 is called the chi-square distribution 2 with n degrees of freedom. Lecture Note Page 23 Introduction to Probability Expectation, variance and mgf of a gamma random variable Let X be a gamma random variable with parameter (α, λ). For t < λ, the mgf MX (t) of X can be computed as follows. Z ∞ Z ∞ λα α−1 −λx λα xα−1 e−(λ−t)x dx etx MX (t) = x e dx = Γ(α) Γ(α) 0 0 α α Z ∞ Z ∞ λα 1 λ 1 λ α−1 −(λ−t)x α−1 −u = ((λ − t)x) e (λ − t)dx = u e du = (λ − t)α Γ(α) 0 λ−t Γ(α) 0 λ−t Then we can find the first and second moment 0 E[X] = MX (0) = Thus, we can obtain Var(X) = 10.3 α λ and 00 E[X 2 ] = MX (0) = α(α + 1) λ2 α(α + 1) α2 α − 2 = 2. λ2 λ λ Exercises 1. Let X be a gamma random variable with parameter (α, λ), and let Y = cX with c > 0. (a) Show that P (Y ≤ t) = where f (x) = Z t f (x) dx, −∞ (λ/c)α α−1 −(λ/c)x x e Γ(α) (b) Conclude that Y is a gamma random variable with parameter (α, λ/c). Thus, only λ is affected by the transformation Y = cX, which justifies to call λ a scale parameter. 2. Suppose that the lifetime of an electronic component follows an exponential distribution with λ = 0.2. (a) Find the probability that the lifetime is less than 10. (b) Find the probability that the lifetime is between 5 and 15. 3. X is an exponential random variable, and P (X < 1) = 0.05. What is λ? 4. The lifetime in months of each component in a microchip is exponentially distributed with parameter 1 . µ = 10000 (a) For a particular component, we want to find the probability that the component breaks down before 5 months, and the probability that the component breaks down before 50 months. Use the approximation e−x ≈ 1 − x for a small number x to calculate these probabilities. (b) Suppose that the microchip contains 1000 of such components and is operational until three of these components break down. The lifetimes of these components are independent and exponentially dis1 tributed with parameter µ = 10000 . By using the Poisson approximation, find the probability that the microchip operates functionally for 5 months, and the probability that the microchip operates functionally for 50 months. 5. The length of a phone call can be represented by an exponential distribution. Then answer to the following questions. (a) A report says that 50% of phone call ends in 20 minutes. Assuming that this report is true, calculate the parameter λ for the exponential distribution. Lecture Note Page 24 Introduction to Probability (b) Suppose that someone arrived immediately ahead of you at a public telephone booth. Using the result of (a), find the probability that you have to wait between 10 and 20 minutes. Optional Problems. 1. Suppose that a non-negative random variable X has the memoryless property P (X > t + s | X > s) = P (X > t). By completing the following questions, we will show that X must be an exponential random variable. (a) Let S(t) is the survival function of X. Show that the memoryless property implies that S(s + t) = S(s)S(t) for s, t ≥ 0. (b) Let κ = S(1). Show that κ > 0. 1 m n (c) Show that S n1 = κ n and S m n =κ . Therefore, we can obtain S(t) = κt for any t ≥ 0. By letting λ = − ln κ, we can write S(t) = e−λt . 2. The gamma function is a generalized factorial function. (a) Show that Γ(1) = 1. (b) Show that Γ(α + 1) = αΓ(α). Hint: Use integration by parts. (c) Conclude that Γ(n) = (n − 1)! for n = 1, 2, 3, . . .. √ (d) Use the formula Γ( 12 ) = π to show that if n is an odd integer then we have √ n π(n − 1)! Γ = n−1 n−1 2 2 ! 2 11 Normal Distributions A normal distribution is represented by a family of distributions which have the same general shape, sometimes described as “bell shaped.” The normal distribution has the pdf (x − µ)2 1 , (10) f (x) := √ exp − 2σ 2 σ 2π which depends upon two parameters µ and σ 2 . Here π = 3.14159 . . . is the famous “pi” (the ratio of the circumference of a circle to its diameter), and exp(u) is the exponential function eu with the base e = 2.71828 . . . of the natural logarithm. Much celebrated integrations are the following: Z ∞ (x−µ)2 1 √ e− 2σ2 dx = 1, σ 2π −∞ Z ∞ (x−µ)2 1 √ xe− 2σ2 dx = µ, σ 2π −∞ Z ∞ (x−µ)2 1 √ (x − µ)2 e− 2σ2 dx = σ 2 . σ 2π −∞ (11) The integration (11) guarantees that the density function (10) always represents a “probability density” no matter what values the parameters µ and σ 2 are chosen. We say that a random variable X is normally distributed with parameter (µ, σ 2 ) when X has the density function (10). The parameter µ, called a mean (or, location Lecture Note Page 25 Introduction to Probability parameter), provides the center of the density, and the density function f (x) is symmetric around µ. The parameter σ is a standard deviation (or, a scale parameter); small values of σ lead to high peaks but sharp drops. Larger values of σ lead to flatter densities. The shorthand notation X ∼ N (µ, σ 2 ) is often used to express that X is a normal random variable with parameter (µ, σ 2 ). 11.1 Linear transform of normal random variable One of the important properties of normal distribution is that if X is a normal random variable with parameter (µ, σ 2 ) [that is, the pdf of X is given by density function (10)]. then Y = aX + b is also a normal random variable having parameter (aµ + b, (aσ)2 ). In particular, X −µ σ (12) becomes a normal random variable with parameter (0, 1), called the z-score. Then the normal density φ(x) with parameter (0, 1) becomes x2 1 φ(x) := √ e− 2 , 2π which is known as the standard normal density. We define the cdf Φ by Z t φ(x) dx. Φ(t) := −∞ 11.2 Moment generating function Let X be a standard normal random variable. Then we can compute the mgf of X as follows. 2 Z ∞ Z ∞ Z ∞ (x−t)2 2 x2 1 t2 t2 1 1 1 t MX (t) = √ etx e− 2 dx = √ e− 2 (x−t) + 2 dx = e 2 √ e− 2 dx = exp 2 2π −∞ 2π −∞ 2π −∞ Since X = (Y − µ)/σ becomes a standard normal random variable and Y = σX + µ, the mgf M (t) of Y can be given by 2 2 σ t µt + µt . M (t) = e MX (σt) = exp 2 Then we can find the first and second moment E[X] = M 0 (0) = µ E[X 2 ] = M 00 (0) = µ2 + σ 2 and Thus, we can obtain Var(X) = (µ2 + σ 2 ) − µ2 = σ 2 . 11.3 Calculating probability Suppose that a tomato plant height X is normally distributed with parameter (µ, σ 2 ). Then what is the probability that the tomato plant height is between a and b? How to calculate. The integration P (a ≤ X ≤ b) = Lecture Note Z b a (x−µ)2 1 √ e− 2σ2 dx σ 2π Page 26 Introduction to Probability X −µ , then the event {a ≤ X ≤ b} is equivalent seems too complicated. But if we consider the random variable σ to the event X −µ b−µ a−µ ≤ ≤ . σ σ σ In terms of probability, this means that P (a ≤ X ≤ b) = P a−µ X −µ b−µ ≤ ≤ σ σ σ . (13) And it is easy to calculate b−µ a−µ and b0 = . σ σ Since the z-score (12) is distributed as the standard normal distribution, the above probability becomes a0 = P (a ≤ X ≤ b) = P X −µ a ≤ ≤ b0 σ 0 = Z b0 a0 φ(x) dx = Φ(b0 ) − Φ(a0 ). (14) Then we will be able to look at the values for Φ(a0 ) and Φ(b0 ) from the table of standard normal distribution. For example, if a0 = −0.19 and b0 = 0.29, then Φ(−0.19) = 0.4247 and Φ(0.29) = 0.6141, and the probability of interest becomes 0.1894, or approximately 0.19. Suppose that a random variable X is normally distributed with mean µ and standard deviation σ. Then we can obtain b−µ a−µ 1. P (a ≤ X ≤ b) = Φ −Φ ; σ σ b−µ 2. P (X ≤ b) = Φ ; σ a−µ 3. P (a ≤ X) = 1 − Φ . σ 11.4 Exercises 1. Let X be a normal random variable with parameter (µ, σ 2 ), and let Y = aX + b with a 6= 0. (a) Show that P (Y ≤ t) = where f (x) = Z t f (x) dx, −∞ (x − aµ − b)2 1 √ exp − . 2(aσ)2 |a|σ 2π (b) Conclude that Y is a normal random variable with parameter (aµ + b, (aσ)2 ). 2. Let X be a normal random variable with parameter (µ, σ 2 ), and let Y = eX . Find the density function f of Y by forming Z t P (Y ≤ t) = f (x) dx. 0 This density function f is called the lognormal density since ln Y is normally distributed. 3. Show that the standard normal density integrates to unity via (a) and (b). Lecture Note Page 27 Introduction to Probability Z Z ∞Z ∞ x2 +y2 e− 2 dx dy = (a) Show that −∞ (b) Show that 1 √ σ 2π −∞ Z ∞ e− (x−µ)2 2σ2 −∞ ∞ 0 1 dx = √ 2π Z 2π e− r2 2 r dθ dr = 2π. 0 Z ∞ e− y2 2 dy = 1. −∞ 4. Suppose that in a certain population, individual’s heights are approximately normally distributed with parameters µ = 70 and σ = 3in. (a) What proportion of the population is over 6ft. tall? (b) What is the distribution of heights if they are expressed in centimeters? 5. Let X be a normal random variable with µ = 5 and σ = 10. Find (a) P (X > 10), (b) P (−20 < X < 15), and (c) the value of x such that P (X > x) = 0.05. 6. If X ∼ N (µ, σ 2 ), then show that P (|X − µ| ≤ 0.675σ) = 0.5. 7. X ∼ N (µ, σ 2 ), find the value of c in terms of σ such that P (µ − c ≤ X ≤ µ + c) = 0.95. 12 12.1 Solutions to exercises Exponential and gamma distributions Z t (λ/c)α α−1 −(λ/c)y λα α−1 −λx x e dx = y e dy by 1. (a) P (Y ≤ t) = P (cX ≤ t) = P (X ≤ t/c) = −∞ Γ(α) −∞ Γ(α) substituting x = y/c. (b) The integrand f (y) is a gamma density with (α, λ/c). Z t/c 2. Let X be the lifetime of the electronic component. (a) P (X ≤ 10) = 1 − e−(0.2)(10) ≈ 0.8647. (b) P (5 ≤ X ≤ 15) = P (X ≤ 15) − P (X ≤ 5) = e−(0.2)(5) − e−(0.2)(15) ≈ 0.3181. 3. By solving P (X < 1) = 1 − e−λ = 0.05, we obtain λ = − ln 0.95 ≈ 0.0513. 4. (a) Let p1 be the probability that the component breaks down before 5 months, and let p2 be the probability that the component breaks down before 50 months. Then, p1 = 1 − e−5µ ≈ 1 − (1 − 0.0005) = 0.0005; p2 = 1 − e−50µ ≈ 1 − (1 − 0.005) = 0.005. (b) Let X1 be the number of components breaking down before 5 months, and let X2 be the number of components breaking down before 50 months. Then Xi has the binomial distribution with parameter (pi , 1000) for i = 1, 2. By using the Poisson approximation with λ1 = 1000p1 = 0.5 and λ2 = 1000p2 = 5, we can obtain λ1 2 −λ1 ≈ 0.986; P (X1 ≤ 2) = e 1 + λ1 + 2 λ2 2 ≈ 0.125. P (X2 ≤ 2) = e−λ2 1 + λ2 + 2 5. Let X be the length of phone call in minutes. (a) By solving P (X ≤ 20) = 1 − e−20λ = 0.5, we obtain λ = (b) P (10 ≤ X ≤ 20) = (1 − e Lecture Note −20λ ) − (1 − e −10λ ) ≈ 0.207. ln 0.5 20 ≈ 0.0347. Page 28 Introduction to Probability Optional Problems. 1. (a) Since P (X > t + s | X > s) = S(t + s) P (X > t + s) = , we obtain S(t) = S(t + s)/S(s) P (X > s) S(s) (b) Let κ = S(1) = S(1/2 + 1/2) = [S(1/2)]2 > 0. (c) Since κ = S(1) = S(1/n + · · · + 1/n) = [S(1/n)]n , we have S(1/n) = κ1/n . Furthermore, S(m/n) = S(1/n + · · · + 1/n) = [S(1/n)]m = κm/n . Z ∞ 2. (a) Γ(1) = eu du = 1. 0 Z ∞ Z ∞ ∞ (b) Γ(α + 1) = uα e−u du = −uα e−u 0 + α uα−1 e−u du = αΓ(α). 0 0 (c) Γ(n) = (n − 1)Γ(n − 1) = · · · = (n − 1)! (d) Γ(n/2) = (n/2 − 1)(n/2 − 2) · · · (1/2)Γ(1/2) = (n − 2)(n − 4) · · · (1)Γ(1/2)/2 12.2 (n−1)/2 = √ π(n − 1)! n−1 ! 2 2n−1 Normal distributions 1. (a) Assume a > 0. Then Z P (Y ≤ t) = P (aX + b ≤ t) = P (X ≤ (t − b)/a) = = Z t −∞ 1 (y − aµ − b) √ exp − 2(aσ)2 aσ 2π 2 (t−b)/a −∞ (x − µ)2 1 √ exp − dx 2σ 2 σ 2π dy by substituting x = (y − b)/a. When a < 0, we can similarly obtain Z t (y − aµ − b)2 1 √ exp − P (Y ≤ t) = dy 2(aσ)2 −∞ |a|σ 2π (b) The integrand f (y) is a normal density with (aµ + b, (aσ)2 ). 2. For x > 0 we obtain P (Y ≤ t) = P (e = Z X t −∞ ≤ t) = P (X ≤ ln t) = Z ln t −∞ (ln y − µ)2 exp − 2σ 2 yσ 2π 1 √ dy by substituting x = ln y. Thus, the density is given by f (y) = 3. (a) Z ∞ 0 Z 2π e− r2 2 r dθ dr = 2π 0 (b) Substitute x = σy + µ. Z ∞ r e− r2 2 0 dr = 2π Z ∞ (x − µ)2 1 √ exp − dx 2σ 2 σ 2π 1 √ yσ 2π i h y−µ)2 . exp − (ln 2σ 2 e−u du = 2π. 0 4. Let X be the height in inches normally distributed with µ = 70in and σ = 3in. (a) P (X ≥ (6)(12)) ≈ 1 − Φ(0.67) ≈ 0.25. (b) (2.54)X is the height in centimeters, and normally distributed with µ = (2.54)(70) = 177.8cm and σ = (2.54)(3) = 7.62cm. Lecture Note Page 29 Introduction to Probability 5. (a) P (X > 10) = 1 − Φ(0.5) ≈ 0.3085. (b) P (−20 < X < 15) = Φ(1) − Φ(−2.5) ≈ 0.8351. (c) Observe that P (X > x) = 1−Φ((x−5)/10) = 0.05. Since Φ(1.64) ≈ 0.95, we can find (x−5)/10 ≈ 1.64, and therefore, x ≈ 21.4. 6. P (|X − µ| ≤ 0.675σ) = P (−0.675σ ≤ X − µ ≤ 0.675σ) = P (−0.675 ≤ (X − µ)/σ ≤ 0.675) = Φ(0.675) − Φ(−0.675) ≈ 0.5. 7. Observing that P (µ − c ≤ X ≤ µ + c) = P (|(X − µ)/σ| ≤ c/σ) = 0.95, we obtain P ((X − µ)/σ ≤ c/σ) = Φ(c/σ) = 0.975. Since Φ(1.96) ≈ 0.975, we can find c/σ ≈ 1.96, and therefore, c = 1.96σ. 13 Summary Diagram (4) ? Normal(µ, σ 2 ) f (x) = √ (x−µ)2 1 e− 2σ2 , 2πσ E[X] = µ, f (x) = λe−λx −∞ < x < ∞ E[X] = V ar(X) = σ 2 6 (1) ? (3) f (x) = ? m degrees of ?freedom) (with x m 1 −x 2 ( ) 2 −1 2e 2 , Γ( m 2) E[X] = m, x≥0 V ar(X) = 2m 1 , λ x≥0 V ar(X) = 1 λ2 6 (6) (5) (2) Chi-square Exponential(λ) Gamma(α, λ) λα α−1 −λx † x e , x≥0 Γ(α) α α E[X] = , V ar(X) = 2 λ λ f (x) = (7) 6 † Γ(n) = (n − 1)!, when n is a positive integer. 1. If X1 and X2 are independent normal random variables with respective parameters (µ1 , σ12 ) and (µ2 , σ22 ), then X = X1 + X2 is a normal random variable with (µ1 + µ2 , σ12 + σ22 ). 2 m X Xk − µ 2. If X1 , · · · , Xm are iid normal random variables with parameter (µ, σ 2 ), then X = is a σ chi-square (χ2 ) random variable with m degrees of freedom. 3. α = m 2 k=1 and λ = 12 . 4. If X1 , . . . , Xn are independent exponential random variables with respective P parameters λ1 , · · · , λn , then X = min{X1 , . . . , Xn } is an exponential random variable with parameter ni=1 λi . Lecture Note Page 30 Introduction to Probability 5. If X1 , · · · , Xn are iid exponential random variables with parameter λ, then X = variable with (α = n, λ). n X Xk is a gamma random k=1 6. α = 1. 7. If X1 and X2 are independent random variables having gamma distributions with respective parameters (α1 , λ) and (α2 , λ), then X = X1 + X2 is a gamma random variable with parameter (α1 + α2 , λ). 14 14.1 Bivariate Normal Distribution Review of matrix algebra. Matrices are usually denoted by capital letters, A, B, C, . . ., while scalars (real numbers) are denoted by lower case letters, a, b, c, . . .. a b ka kb = c d kc kd ae + bg af + bh a b e f . = 2. Product of two matrices: ce + dg cf + dh c d g h a b = ad − bc. 3. Determinant of a matrix: det c d a b 4. The inverse of a matrix: If A = and det A 6= 0, then the inverse A−1 of A is given by c d 1. Scalar multiplication: k 1 d A = det A −c 1 0 . = A−1 A = 0 1 −1 and satisfies AA−1 5. The transpose of a matrix: 14.2 a b c d T = a b 1 −b d = a ad − bc −c −b , a c . d Quadratic forms a b If a 2-by-2 matrix A satisfies A = A, then it is of the form , and is said to be symmetric. A 2-by-1 matrix b c x is called a column vector, usually denoted by boldface letters x, y, . . ., and the transpose x T = x y x= y is called the row vector. Then, the bivariate function a b x Q(x, y) = xT Ax = x y = ax2 + 2bxy + cy 2 (15) b c y T is called a quadratic form. Lecture Note Page 31 Introduction to Probability If ac − b2 6= 0 and ac 6= 0, then the symmetric matrix can be decomposed into the following forms known as Cholesky decompositions: √ √ √ a p 0 a p b/ a a b √ = (16) b c (ac − b2 )/a (ac − b2 )/a b/ a 0 p √ p 2 (ac − b2 )/c b/√ c (ac − √b )/c √0 = (17) 0 c c b/ c Corresponding to the Cholesky decompositions of the matrix A, the quadratic form Q(x, y) can be also decomposed as follows: !2 r 2 √ ac − b2 b (18) y Q(x, y) = ax + √ y + a a !2 r 2 √ ac − b2 b = x + √ x + cy (19) c c 14.3 Bivariate normal density The function f (x, y) = 2πσx σy with the quadratic form 1 Q(x, y) = 1 − ρ2 " x − µx σx 1 p 2 1 exp − Q(x, y) 2 1 − ρ2 + y − µy σy 2 (20) (x − µx )(y − µy ) − 2ρ σx σy # (21) gives the joint density function of a bivariate normal distribution. Note that the parameters σ x2 , σy2 , and ρ must satisfy σx2 > 0, σy2 > 0, and −1 < ρ < 1. By defining the 2-by-2 symmetric matrix (also known as covariance matrix ) and the two column vectors 2 σx ρσx σy x µ Σ= , x= and µ = x , ρσx σy σy2 y µy the quadratic form can be expressed as Q(x, y) = (x − µ)T Σ−1 (x − µ) = x − µx y − µy σx2 ρσx σy ρσx σy σy2 −1 x − µx . y − µy (22) Suppose that two random variables X and Y has the bivariate normal distribution (20). Then we can find out the following properties. 1. Their marginal distributions fX (x) and fY (x) become ( ( 2 ) 2 ) 1 x − µx 1 1 y − µy 1 exp − and fY (y) = √ . exp − fX (x) = √ 2 σx 2 σy 2πσx 2πσy Thus, X and Y are normally distributed with respective parameters (µx , σx2 ) and (µy , σy2 ). 2. If ρ = 0, then the quadratic form (21) becomes 2 2 x − µx y − µy Q(x, y) = + , σx σy and consequently we have f (x, y) = fX (x)fY (y). Thus, X and Y are independent if and only if ρ = 0. Lecture Note Page 32 Introduction to Probability 3. If X and Y are not independent (that is, ρ 6= 0), we can compute the conditional density function f Y |X (y|x) given X = x as  !2  σy   y − µ − ρ (x − µ ) y x 1 1 p p σx fY |X (y|x) = √ exp −  2  2πσy 1 − ρ2 σy 1 − ρ 2 σ which is the normal density function with parameter (µy +ρ σxy (x−µx ), σy2 (1−ρ2 )). Similarly, the conditional density function fX|Y (y|x) given Y = y becomes  !2  σx   (y − µ ) x − µ − ρ y x 1 1 σ p p y fX|Y (x|y) = √ exp −   2 2πσx 1 − ρ2 σx 1 − ρ 2 which is the normal density function with parameter (µx + ρ σσyx (y − µy ), σx2 (1 − ρ2 )). 14.4 Covariance of bivariate normal random variables Let (X, Y ) be a bivariate normal random variables with parameters (µx , µy , σx2 , σy2 , ρ). Recall that f (x, y) = σ f (x)f (y|x), and that f (y|x) is the normal density with mean µy + ρ σyx (x − µx ) and variance σy2 (1 − ρ2 ). First we can compute Z ∞ σy (y − µy )f (y|x) dy = ρ (x − µx ). σx −∞ Then we can apply it to obtain Z ∞ Z ∞ (x − µx )(y − µy )f (x, y) dy dx Z ∞ (y − µy )f (y|x) dy f (x) dx = (x − µx ) −∞ −∞ Z ∞ σy σy =ρ (x − µx )2 f (x) dx = ρ σx2 = ρσx σy σx −∞ σx Cov(X, Y ) = −∞ Z ∞ −∞ The correlation coefficient of X and Y is defined by ρ= p Cov(X, Y ) Var(X)Var(Y ) . It implies that the parameter ρ of bivariate normal distribution represents the correlation coefficient of X and Y. 14.5 Exercises 1. Verify the Cholesky decompositions by multiplying the matrices in each of (16) and (17). 2. Show that the right-hand side of (18) equals (15) by calculating √ √ a p 0 a √ Q(x, y) = x y b/ a (ac − b2 )/a 0 Similarly show that the right-hand side of (19) equals (15). √ x p b/ a . (ac − b2 )/a y 3. Show that (21) is equal to (22). Lecture Note Page 33 Introduction to Probability 4. Show that (22) has the following decomposition !2 2 x − µx − ρ σσxy (y − µy ) y − µy p Q(x, y) = + σy σx 1 − ρ 2 !2 2 σ y − µy − ρ σxy (x − µx ) x − µx p = + σx σy 1 − ρ 2 (E1) (E2) by completing the following: (a) Find the Cholesky decompositions of Σ−1 . (b) Verify (E1) and (E2) by finding the Cholesky decompositions of Σ−1 . 5. Compute fX (x) by completing the following: R∞ (a) Show that −∞ f (x, y) dy becomes √ ( 1 1 exp − 2 2πσx x − µx σx 2 ) ×√ 2πσy (b) Show that √ 2πσy 1 p 1 − ρ2 Z   1 exp −  2 −∞ ∞ Now find fY |X (y|x) given X = x. 15 1 p 1 − ρ2 σ Z   1 exp −  2 −∞ ∞ y − µy − ρ σxy (x − µx ) p σy 1 − ρ 2 !2   σ y − µy − ρ σxy (x − µx ) p σy 1 − ρ 2 1 dy = √  2π Z !2   dy.  1 2 exp − y dy = 1. 2 −∞ ∞ Conditional Expectation Let f (x, y) be a joint density function for random variables X and Y , and let fY |X (y|x) be the conditional density function of Y given X = x (see Lecture Note No. ?? for definition). Then the conditional expectation of Y given X = x, denoted by E(Y |X = x), is defined as the function h(x) of x: Z ∞ E(Y |X = x) := h(x) = y fY |X (y|x) dy. −∞ In particular, h(X) is the function of the random variable X, and is called the conditional expectation of Y given X, denoted by E(Y |X). Similarly, we can define the conditional expectation for a function g(Y ) of the random variable Y by Z E(g(Y )|X = x) := ∞ −∞ g(y) fY |X (y|x) dy. Since E(g(Y )|X = x) is a function, say h(x), of x, we can define E(g(Y )|X) as the function h(X) of the random variable X. 15.1 Properties of conditional expectation (a) Linearity. Let a and b be constants. By the definition of conditional expectation, it clearly follows that E(aY + b|X = x) = aE(Y |X = x) + b. Consequently, E(aY + b|X) = aE(Y |X) + b. Lecture Note Page 34 Introduction to Probability (b) Law of total expectation. Since E(g(Y )|X) is a function h(X) of random variable of X, we can consider “the expectation E[E(g(Y )|X)] of the conditional expectation E(g(Y )|X),” and compute it as follows. Z ∞ Z ∞ Z ∞ E[E(g(Y )|X)] = h(x)fX (x) dx = g(y) fY |X (y|x) dy fX (x) dx −∞ −∞ −∞ Z ∞ Z ∞ Z ∞ g(y)fY (y) dy = E[g(Y )]. fY |X (y|x)fX (x) dx g(y) dy = = −∞ −∞ −∞ Thus, we have E[E(g(Y )|X)] = E[g(Y )]. (c) Conditional variance formula. The conditional variance, denoted by Var(Y |X = x), can be defined as Var(Y |X = x) := E((Y − E(Y |X = x))2 |X = x) = E(Y 2 |X = x) − (E(Y |X = x))2 , where the second equality can be obtained from the linearity property in (a). Since Var(Y |X = x) is a function, say h(x), of x, we can define Var(Y |X) as the function h(X) of the random variable X. Now compute “the variance Var(E(Y |X)) of the conditional expectation E(Y |X)” and “the expectation E[Var(Y |X)] of the conditional variance Var(Y |X)” as follows. Var(E(Y |X)) = E[(E(Y |X))2 ] − (E[E(Y |X)])2 = E[(E(Y |X))2 ] − (E[Y ])2 2 2 2 2 E[Var(Y |X)] = E[E(Y |X)] − E[(E(Y |X)) ] = E[Y ] − E[(E(Y |X)) ] (1) (2) By adding (1) and (2) together, we can obtain Var(E(Y |X)) + E[Var(Y |X)] = E[Y 2 ] − (E[Y ])2 = Var(Y ). (d) In calculating the conditional expectation E(g1 (X)g2 (Y )|X = x) given X = x, we can treat g1 (X) as the constant g1 (x); thus, we have E(g1 (X)g2 (Y )|X = x) = g1 (x)E(g2 (Y )|X = x). This justifies E(g1 (X)g2 (Y )|X) = g1 (X)E(g2 (Y )|X). 15.2 Prediction Suppose that we have two random variables X and Y . Having observed X = x, one may attempt to “predict” the value of Y as a function g(x) of x. The function g(X) of the random variable X can be constructed for this purpose, and is called a predictor of Y . The criterion for the “best” predictor g(X) is to find a function g to minimize E[(Y − g(X))2 ]. We can compute E[(Y − g(X))2 ] by applying the properties (a)–(d) in Lecture summary No.21. E[(Y − g(X))2 ] = E (Y − E(Y |X) + E(Y |X) − g(X))2 = E (Y − E(Y |X))2 + 2(Y − E(Y |X))(E(Y |X) − g(X)) + (E(Y |X) − g(X))2 h i = E[E((Y − E(Y |X))2 |X)] + E 2(E(Y |X) − g(X))E((Y − E(Y |X))|X) + E[(E(Y |X) − g(X))2 ] = E[Var(Y |X)] + E[(E(Y |X) − g(X))2 ], which is minimized when g(x) = E(Y |X = x). Thus, we can find g(X) = E(Y |X) as the best predictor of Y . The best predictor for bivariate normal distributions. When X and Y has the bivariate normal distribution with parameter (µx , µy , σx2 , σy2 , ρ), we can find the best predictor E(Y |X) = µy + ρ Lecture Note σy (X − µx ). σx Page 35 Introduction to Probability 15.3 The best linear predictor Except for the case of bivariate normal distribution, the conditional expectation E(Y |X) could be a complicated nonlinear function of X. So we can instead consider the best linear predictor g(X) = α + βX 2 of Y by minimizing E[(Y − (α + βX)) ]. Since E[(Y − (α + βX))2 ] = E[Y 2 ] − 2αµy − 2βE[XY ] + α2 + 2αβµx + β 2 E[X 2 ] where µx = E[X] and µy = E[Y ], we can obtain the desired values α and β by solving ∂ E[(Y − (α + βX))2 ] = −2µy + 2α + 2βµx = 0 ∂α ∂ E[(Y − (α + βX))2 ] = −2E[XY ] + 2αµx + 2βE[X 2 ] = 0 ∂β The solution can be expressed by using σx2 = Var(X), σy2 = Var(Y ), and ρ = Cov(X, Y )/(σx σy ), and obtain σy E[XY ] − E[X]E[Y ] =ρ E[X 2 ] − (E[X])2 σx σy α = µy − βµx = µy − ρ µx σx β= In summary, the best linear predictor g(X) becomes g(X) = µy + ρ 15.4 σy (X − µx ). σx Exercises 1. Suppose that the joint density function f (x, y) of X and Y is given by ( 2 if 0 ≤ x ≤ y ≤ 1; f (x, y) = 0 otherwise. (a) (b) (c) (d) Find Find Find Find the the the the covariance and the correlation of X and Y . conditional expectations E(X|Y = y) and E(Y |X = x). best linear predictor of Y in terms of X. best predictor of Y . 2. Suppose that we have the random variables X and Y having the joint density ( 6 (x + y)2 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1; f (x, y) = 7 0 otherwise. (a) Find the covariance and correlation of X and Y . (b) Find the conditional expectation E(Y |X = x). (c) Find the best linear predictor of Y . 3. Let X be the age of the pregnant woman, and let Y be the age of the prospective father at a certain hospital. Suppose that the distribution of X and Y can be approximated by a bivariate normal distribution with parameters µx = 28, µy = 32, σx = 6, σy = 8 and ρ = 0.8. (a) Find the probability that the prospective father is over 30. (b) When the pregnant woman is 31 years old, what is the best guess of the prospective father’s age? (c) When the pregnant woman is 31 years old, find the probability that the prospective father is over 30. Lecture Note Page 36 Introduction to Probability 16 16.1 Transformations of Variables Transformations of a random variable For an interval A = [a, b] on real line with a < b, the integration of a function f (x) over A is formally written as Z Z b f (x) dx = f (x) dx. (23) A a Let h(u) be differentiable and strictly monotonic function (that is, either strictly increasing or strictly decreasing) on real line. Then there is the inverse function g = h−1 , so that we can define the inverse image h−1 (A) of A = [a, b] by h−1 (A) = g(A) = {g(x) : a ≤ x ≤ b}. The change of variable in the integral (23) provides the formula Z Z d f (h(u)) h(u) du. f (x) dx = du g(A) A (24) Theorem A. Let X be a random variable with pdf fX (x). Suppose that a function g(x) is differentiable and strictly monotonic. Then the random variable Y = g(X) has the pdf d −1 −1 (25) fY (y) = fX (g (y)) g (y) . dy A sketch of proof. Since g(x) is strictly monotonic, we can find the inverse function h = g −1 , and apply the formula (24). Thus, by letting A = (−∞, t], we can rewrite the cdf for Y as FY (t) = P (Y ≤ t) = P (Y ∈ A) = P (g(X) ∈ A) = P (X ∈ h(A)) Z Z Z d d = fX (x) dx = fX (h(y)) h(y) dy = fX (g −1 (y)) g −1 (y) dy dy dy h(A) g(h(A)) A By comparing it with the form Z t fY (y) dy, we can obtain (25). −∞ Linear transform of normal random variable. Let X be a normal random variable with parameter (µ, σ 2 ). Then we can define the linear transform Y = aX + b with real values a and b. To compute the density function fY (y) of the random variable Y , we can apply Theorem A, and obtain 1 (y − (aµ + b))2 √ exp − fY (y) = . 2(aσ)2 (aσ) 2π This is the normal density function with parameter (aµ + b, (aσ)2 ). Example. Let X be a uniform random variable on [0, 1]. Suppose that X takes nonnegative values [and therefore, f (x) = 0 for x ≤ 0]. Find the pdf fY (y) for the random variable Y = X n . 16.2 Transformations of bivariate random variables For a rectangle A = {(x, y) : a1 ≤ x ≤ b1 , a2 ≤ y ≤ b2 } on a plane, the integration of a function f (x, y) over A is formally written as ZZ Z b2 Z b1 f (x, y) dx dy = f (x, y) dx dy (26) A Lecture Note a2 a1 Page 37 Introduction to Probability Suppose that a transformation (g1 (x, y), g2 (x, y)) is differentiable and has the inverse transformation (h1 (u, v), h2 (u, v)) satisfying ( ( u = g1 (x, y) x = h1 (u, v) . (27) if and only if v = g2 (x, y) y = h2 (u, v) Then we can state the change of variable in the double integral (26) as follows: (i) define the inverse image of A by h−1 (A) = g(A) = {(g1 (x, y), g2 (x, y)) : (x, y) ∈ A}, (ii) calculate the Jacobian Jh (u, v) = det ∂ ∂u h1 (u, v) ∂ ∂u h2 (u, v) ∂ ∂v h1 (u, v) ∂ ∂v h2 (u, v) , (28) (iii) finally, if it does not vanish for all x, y, then we can obtain Z Z f (x, y) dx dy = f (h1 (u, v), h2 (u, v)) |Jh (u, v)| du dv. A (29) g(A) Theorem B. Suppose that (a) random variables X and Y has a joint density function fX,Y (x, y), (b) a transformation (g1 (x, y), g2 (x, y)) is differentiable, and (c) its inverse transformation (h1 (u, v), h2 (u, v)) satisfies (27). Then the random variables U = g1 (X, Y ) and V = g2 (X, Y ) has the joint density function fU,V (u, v) = fX,Y (h1 (u, v), h2 (u, v)) |Jh (u, v)| . (30) A sketch of proof. Define A = {(x, y) : x ≤ s, y ≤ t} and the image of A by the inverse transformation (h1 (x, y), h2 (x, y)) by h(A) = {(h1 (u, v), h2 (u, v)) : (u, v) ∈ A}. Then we can rewrite the joint distribution function FU,V for (U, V ) by applying the formula (29). FU,V (s, t) = P ((U, V ) ∈ A) = P ((g1 (X, Y ), g2 (X, Y )) ∈ A) = P ((X, Y ) ∈ h(A)) Z Z = fX,Y (x, y) dx dy = fX,Y (h1 (u, v), h2 (u, v)) |Jh (u, v)| du dv h(A) g(h(A)) Z = fX,Y (h1 (u, v), h2 (u, v)) |Jh (u, v)| du dv A By comparing it with the form 16.3 Z t −∞ Z s fU,V (u, v) du dv, we can obtain (30). −∞ Exercises µx 1. Suppose that (X1 , X2 ) has a bivariate normal distribution with mean vector µ = µy 2 σx ρσx σy matrix Σ = . Consider the linear transformation ρσx σy σy2 and covariance Y1 = g1 (X1 , X2 ) = a1 X1 + b1 ; Y2 = g2 (X1 , X2 ) = a2 X2 + b2 . (a) Find the inverse transformation (h1 (u, v), h2 (u, v)), and compute the Jacobian Jh (u, v). (b) Find the joint density function fY1 ,Y2 (u, v) for (Y1 , Y2 ). (c) Conclude that the bivariate normal density with mean vector joint density function fY1 ,Y2 (u, v) is the a 1 µx + b 1 (a1 σx )2 ρ(a1 σx )(a2 σy ) µ= and covariance matrix Σ = . a 2 µy + b 2 ρ(a1 σx )(a2 σy ) (a2 σy )2 Lecture Note Page 38 Introduction to Probability 2. Let X1 and X2 be iid standard normal random variables. Consider the linear transformation Y1 = g1 (X1 , X2 ) = a11 X1 + a12 X2 + b1 ; Y2 = g2 (X1 , X2 ) = a21 X1 + a22 X2 + b2 , where a11 a22 − a12 a21 6= 0. Find the joint density function that it fY1 ,Y2 (u, v) for (Y1 , Y2 ), and conclude σx2 ρσx σy b1 is the bivariate normal density with mean vector µ = and covariance matrix Σ = ρσx σy σy2 b2 a a + a a 11 21 12 22 . where σx2 = a211 + a212 , σy2 = a221 + a222 , and ρ = p 2 (a11 + a212 )(a221 + a222 ) 3. Suppose that X and Y are independent random variables with respective pdf’s fX (x) and fY (y). Consider the transformation U = g1 (X, Y ) = X + Y ; X V = g2 (X, Y ) = . X +Y (a) Find the inverse transformation (h1 (u, v), h2 (u, v)), and compute the Jacobian Jh (u, v). (b) Find the joint density function fXY (x, y) for (X, Y ). (c) Now let fX (x) and fY (y) be the gamma density functions with respective parameters (a1 , λ) and (a2 , λ). Then show that U = X + Y has the gamma distribution with parameter (α1 + α2 , λ) and X has the density function that V = X +Y fV (v) = Γ(α1 + α2 ) α1 −1 v (1 − v)α2 −1 . Γ(α1 )Γ(α2 ) (31) The density function (31) is called the beta distribution with parameter (α1 , α2 ). 4. Let X and Y be random variables. The joint density function of X and Y is given by ( xe−(x+y) if 0 ≤ x and 0 ≤ y; f (x, y) = 0 otherwise. (a) Are X and Y independent? Describe the marginal distributions for X and Y with specific parameters. (b) Find the distribution for the random variable U = X + Y . 5. Consider an investment for three years in which you plan to invest $500 in average at the beginning of every year. The fund yields the fixed annual interest of 10%; thus, the fund invested at the beginning of the first year yields 33.1% at the end of term (that is, at the end of the third year), the fund of the second year yields 21%, and the fund of the third year yields 10%. (a) How much do you expect to have in your investment account at the end of term? (b) From your past spending behavior, you assume that the annual investment you decide is independent every year, and that its standard deviation is $200. (That is, the standard deviation for the amount of money you invest each year is $200.) Find the standard deviation for the total amount of money you have in the account at the end of term. (c) Furthermore, suppose that your annual investment is normally distributed. (That is, the amount of money you invest each year is normally distributed.) Then what is the probability that you have more than $2000 in the account at the end of term? Lecture Note Page 39 Introduction to Probability 6. Let X be an exponential random variable with parameter λ. And define Y = Xn where n is a positive integer. (a) Compute the pdf fY (y) for Y . (b) Find c so that f (x) = ( cxn e−λx 0 if x ≥ 0; otherwise. is a pdf. Hint: You should know the name of the pdf f (x). (c) By using (b), find the expectation E[Y ]. 7. Let X and Y be independent normal random variables with mean zero and the respective variances σ x2 and σy2 . Then consider the change of variables via ( U = a11 X + a12 Y (32) V = a21 X + a22 Y where D = a11 a22 − a12 a21 6= 0. (a) Find the variances σu2 and σv2 for U and V , respectively. (b) Find the inverse transformation (h1 (u, v), h2 (u, v)) so that X and Y can be expressed in terms of U and V via X = h1 (U, V ); Y = h2 (U, V ). And compute the Jacobian Jh (u, v). (c) Find ρ so that the joint density function fU,V (u, v) of U and V can be expressed as 1 1 p fU,V (u, v) = exp − Q(u, v) . 2 2πσu σv 1 − ρ2 (d) Express Q(u, v) by using σu , σv , and ρ to conclude that the joint density is the bivariate normal density with parameters µu = µv = 0, σu , σv , and ρ. (e) Let X 0 = X + µx and Y 0 = Y + µy . Then consider the change of variables via U 0 = a11 X 0 + a12 Y 0 V 0 = a21 X 0 + a22 Y 0 Express U 0 and V 0 in terms of U and V , and describe the joint distribution for U 0 and V 0 with specific parameters. (f) Now suppose that X and Y are independent normal random variables with the respective parameters (µx , σx2 ) and (µy , σy2 ). Then what is the joint distribution for U and V given by (32)? 8. To complete this problem, you may use the following theorem: Theorem C. Let X and Y be independent normal random variables with the respective parameters (µx , σx2 ) and (µy , σy2 ). Then if a11 a22 − a12 a21 6= 0, then the random variables U = a11 X + a12 Y V = a21 X + a22 Y Lecture Note Page 40 Introduction to Probability has the bivariate normal distribution with parameter µu = a11 µx + a12 µy , µv = a21 µx + a22 µy , σu2 = a211 σx2 + a212 σy2 , σv2 = a221 σx2 + a222 σy2 , and ρ= q a11 a21 σx2 + a12 a22 σy2 (a211 σx2 + a212 σy2 )(a221 σx2 + a222 σy2 ) . Now suppose that a real valued data X is transmitted though a noisy channel from location A to location B, and that the value U received at location B is given by U =X +Y where Y is a normal random variable with mean zero and variance σy2 . Furthermore, we assume that X is normally distributed with mean µ and variance σx2 , and is independent of Y . (a) Find the expectation E[U ] and the variance Var(U ) of the received value U at location B. (b) Describe the joint distribution for X and U with specific parameters. (c) Provided that U = t is observed at location B, find the best predictor g(t) of X. (d) Suppose that we have µ = 1, σx = 1 and σy = 0.5. When U = 1.3 was observed, find the best predictor of X given U = 1.3. 17 17.1 Solutions to exercises Conditional expectation 1. (a) Cov(X, Y ) = E[XY ] − E[X]E[Y ] = (1/4) − (1/3)(2/3) = 1/36, and ρ = (1/36)/ (b) Since f (x|y) = y1 for 0 ≤ x ≤ y and f (y|x) = E(Y |X = x) = x+1 2 . 1 1−x p (1/18)2 = 1/2. for x ≤ y ≤ 1, we obtain E(X|Y = y) = y 2 and σ (c) The best linear predictor g(X) of Y is given by g(X) = µy + ρ σxy (X − µx ) = 12 X + 21 . (d) Then best predictor of Y is E(Y |X) = X+1 2 . [This is the same as (c). Why?] 2. (a) Cov(X, Y ) = E[XY ] − E[X]E[Y ] = 17/42 − (9/14)2 = −5/588 and ρ = (b) E(Y |X = x) = (−5/588) (199/2940) = −25/199. 6x2 +8x+3 4(3x2 +3x+1) . 25 X+ (c) The best linear predictor g(X) of Y is given by g(X) = − 119 3. (a) P (Y ≥ 30) = P Y −32 ≥ − 41 = 1 − Φ(−0.25) ≈ 0.60. 8 (b) E(Y | X = 31) = (32) + (0.8) 68 (31 − 28) = 35.2 144 199 . (c) Since the conditional density fY |X (y | x) has the normal density with mean µ = 35.2 and variance σ 2 = (8)2 (1 − (0.8)2 ) = 23.04, we can obtain Y − 35.2 ≥ −1.08 X = 31 = 1 − Φ(−1.08) ≈ 0.86 P (Y ≥ 30 | X = 31) ≈ P 4.8 Lecture Note Page 41 Introduction to Probability 17.2 Transformations of variables 1 0 v − b2 u − b1 1 a 1 and h2 (u, v) = . Thus, Jh (u, v) = det . 1. (a) h1 (u, v) = = 0 a12 a1 a2 a1 a2 1 1 1 u − b1 v − b 2 p · (b) fY1 ,Y2 (u, v) = fX1 ,X2 , = exp − Q(u, v) , where a1 a2 a1 a2 2π(|a1 |σx )(|a2 |σy ) 1 − ρ2 2 # " 2 2 1 v − (a2 µy + b2 ) (u − (a1 µx + b1 ))(v − (a2 µy + b2 )) u − (a1 µx + b1 ) Q(u, v) = + − 2ρ . 1 − ρ2 a 1 σx a 2 σy (a1 σx )(a2 σy ) (c) From (b), fY1 ,Y2 (u, v) is a bivariate normal distribution with parameter µu = a1 µx + b1 , µv = a2 µy + b2 , σu = |a1 |σx , σv = |a2 |σy , and ρ if a1 a2 > 0 (or, −ρ if a1 a2 < 0). 2. (a) h1 (u, v) = −a21 (u − b1 ) + a11 (v − b2 ) a22 (u − b1 ) − a12 (v − b2 ) and h2 (u, v) = . Thus, Jh (u, v) = a11 a22 − a12 a21 a11 a22 − a12 a21 1 . a11 a22 − a12 a21 1 1 2 2 (b) Note that fX1 ,X2 (x, y) = exp − (x + y ) . Then we obtain 2π 2 1 1 1 = exp − Q(u, v) , fY1 ,Y2 (u, v) = fX1 ,X2 (h1 (u, v), h2 (u, v)) · a11 a22 − a12 a21 2π|a11 a22 − a12 a21 | 2 where Q(u, v) = (u − b1 ) (v − b2 ) a211 + a212 a11 a21 + a12 a22 a11 a21 + a12 a22 a221 + a222 −1 u − b1 . v − b2 (c) By comparing fY1 ,Y2 (u, v) with a bivariate normal density, we can find that fY1 ,Y2 (u, v) is a bivariate normal distribution with parameter µu = b1 , µv = b2 , σu2 = a211 + a212 , σv2 = a221 + a222 , and a11 a21 + a12 a22 ρ= p 2 . (a11 + a212 )(a221 + a222 ) v u = −u. 3. (a) h1 (u, v) = uv and h2 (u, v) = u − uv. Thus, Jh (u, v) = det 1 − v −u (b) fXY (x, y) = fX (x)fY (y). (c) We can obtain fU V (u, v) = | − u| × fX (uv) × fY (u − uv) 1 1 λα1 e−λ(uv) (uv)α1 −1 × λα2 e−λ(u−uv) (u − uv)α2 −1 =u× Γ(α1 ) Γ(α2 ) = fU (u)fV (v) 1 Γ(α1 + α2 ) α1 −1 λα1 +α2 e−λu uα1 +α2 −1 and fV (v) = v (1 − v)α2 −1 . Thus, fU (u) Γ(α1 + α2 ) Γ(α1 )Γ(α2 ) has a gamma distribution with parameter (α1 + α2 , λ), and fV (v) has a beta distribution with parameter (α1 , α2 ). where fU (u) = 4. (a) Observe that f (x, y) = (xe−x ) · (e−y ) and that xe−x , x ≥ 0, is the gamma density with parameter (2, 1) and e−y , y ≥ 0, is the exponential density with λ = 1. Thus, we can obtain Z ∞ Z ∞ −x fX (x) = f (x, y) dy = xe e−y dy = xe−x , x ≥ 0; 0 0 Z ∞ Z ∞ −y fY (y) = f (x, y) dx = e xe−x dx = e−y , y ≥ 0. 0 Lecture Note 0 Page 42 Introduction to Probability Since f (x, y) = fX (x)fY (y), X and Y are independent. (b) Notice that the exponential density fY (y) with λ = 1 can be seen as the gamma distribution with (1, 1). Since the sum of independent gamma random variables has again a gamma distribution, U has the gamma distribution with (3, 1). 5. Let X1 be the amount of money you invest in the first year, let X2 be the amount in the second year, and let X2 be the amount in the third year. Then the total amount of money you have at the end of term, say Y , becomes Y = 1.331X1 + 1.21X2 + 1.1X3 . (a) E(1.331X1 + 1.21X2 + 1.1X3 ) = 1.331E(X1 ) + 1.21E(X2 ) + 1.1E(X3 ) = 3.641 × 500 = 1820.5. 2 (b) Var(1.331X1 + 1.21X2 + 1.1X3 ) = (1.331) ) + (1.21)2 Var(X2 ) + (1.1)2 Var(X3 ) = 4.446 × 2002 ≈ p Var(X1√ 177800. Thus, the standard deviation is Var(Y ) ≈ 177800 ≈ 421.7. Y − 1820.5 2000 − 1820.5 (c) P (Y ≥ 2000) = P ≈ 1 − Φ(0.43) ≈ 0.334. ≥ 421.7 421.7 6. (a) By using the change-of-variable formula with g(x) = xn , we can obtain d n−1 n−1 1 λ 1 1 fY (y) = fX (g −1 (y)) g −1 (y) = fX (y n ) · y −( n ) = y −( n ) exp −λy n , dy n n (b) Since the pdf f (x) is of the form of gamma density with (n+1, λ), we must have f (x) = x ≥ 0. Therefore, c = λn+1 (= Γ(n + 1) λn+1 n! , (c) Since f (x) in (b) is a pdf, we have E[Y ] = E[X n ] = Z ∞ 0 y≥0 λn+1 xn e−λx , Γ(n + 1) since n is an integer). R∞ 0 f (x) dx = R∞ 0 λn+1 Γ(n+1) xn e−λx dx = 1. Thus, we can obtain Γ(n + 1) n! xn λe−λx dx = (= n , since n is an integer). λn λ 7. (a) σu2 = Var(U ) = a211 σx2 + a212 σy2 and σv2 = Var(V ) = a221 σx2 + a222 σy2 . (b) h1 (u, v) = (c) ρ = q 1 D (a22 u − a12 v) and h2 (u, v) = a11 a21 σx2 + a12 a22 σy2 (a211 σx2 + a212 σy2 )(a221 σx2 + a222 σy2 ) 1 D (−a21 u + a11 v). And we have Jh (u, v) = 1 D. . (d) Q(u, v) can be expressed as 1 Q(u, v) = 1 − ρ2 " u σu 2 + v σv 2 uv − 2ρ σu σv # Thus, the joint density function fU,V (u, v) is the bivariate normal density with parameters µu = µv = 0 together with σu , σv , ρ as calculated in (a) and (c). (e) U 0 = U + a11 µx + a12 µy and V 0 = V + a21 µx + a22 µy . Then U 0 and V 0 has the bivariate normal density with parameters µu = a11 µx + a12 µy , µv = a21 µx + a22 µy , σu , σv , and ρ. (f) By (d) and (e), (U, V ) has a bivariate normal distribution with parameter µu = a11 µx + a12 µy , µv = a21 µx + a22 µy , σu2 = a211 σx2 + a212 σy2 , σv2 = a221 σx2 + a222 σy2 , and ρ= q Lecture Note a11 a21 σx2 + a12 a22 σy2 (a211 σx2 + a212 σy2 )(a221 σx2 + a222 σy2 ) . Page 43 Introduction to Probability 8. (a) E[U ] = E[X] + E[Y ] = µ and Var(U ) = Var(X) + Var(Y ) = σx2 + σy2 . (b) By using Theorem C, we can find that X and U have the bivariate normal distribution with means σx µx = µ and µu = µ, variances σx2 = σx2 and σu2 = σx2 + σy2 , and ρ = q . 2 σx + σy2 18 18.1 (c) E(X | U = t) = µx + ρ σ2 σx (t − µu ) = µ + 2 x 2 (t − µ). σu σx + σ y (d) E(X | U = 1.3) = 1 + (1)2 (1.3 − 1) = 1.24. (1)2 + (0.5)2 Order statistics Extrema Let X1 , . . . , Xn be independent and identically distributed random variables (iid random variables) with the common cdf F (x) and the pdf f (x). Then we can define the random variable V = min(X1 , . . . , Xn ) of the minimum of X1 , . . . , Xn . To find the distribution of V , consider the survival function P (V > x) of V and calculate as follows. P (V > x) = P (X1 > x, . . . , Xn > x) = P (X1 > x) × · · · × P (Xn > x) = [1 − F (x)]n . Thus, we can obtain the cdf FV (x) and the pdf fV (x) of V FV (x) = 1 − [1 − F (x)]n and fV (x) = d FV (x) = nf (x)[1 − F (x)]n−1 . dx In particular, if X1 , . . . , Xn be independent exponential random variables with the common parameter λ, then the minimum min(X1 , . . . , Xn ) has the density (nλ)e−(nλ)x which is again the exponential density function with parameter (nλ). 18.2 Order statistics Let X1 , . . . , Xn be iid random variables with the common cdf F (x) and the pdf f (x). When we sort X1 , . . . , Xn as X(1) < X(2) < · · · < X(n) , the random variable X(k) is called the k-th order statistic. Then we have the following theorem Theorem. The pdf of the k-th order statistic X(k) is given by fk (x) = n! f (x)F k−1 (x)[1 − F (x)]n−k . (k − 1)!(n − k)! (33) Probability of small interval—the proof of theorem. If f is continuous at x and δ > 0 is small, then the probability that a random variable X falls into the small interval [x, x + δ] is approximated by δf (x). In differential notation, we can write P (x ≤ X ≤ x + dx) = f (x)dx. Now consider the k-th order statistic X(k) and the event that x ≤ X(k) ≤ x + dx. This event occurs when (k − 1) n of Xi ’s are less than x and (n − k) of Xi ’s are greater than x ≈ x + dx. Since there are k−1,1,n−k arrangements for Xi ’s, we can obtain the probability of this event as fk (x)dx = Lecture Note n! (F (x))k−1 (f (x)dx)[1 − F (x)]n−k , (k − 1)!(n − k)! Page 44 Introduction to Probability which implies (33). 18.3 Beta distribution When X1 , . . . , Xn be iid uniform random variables on [0, 1], the pdf of the k-th order statistic X(k) becomes fk (x) = n! xk−1 (1 − x)n−k , (k − 1)!(n − k)! 0 ≤ x ≤ 1, which is known as the beta distribution with parameters α = k and β = n − k + 1. 18.4 Exercises 1. Suppose that system’s components are connected in series and have lifetimes that are independent exponential random variables X1 , . . . , Xn with respective parameters λ1 , . . . , λn . (a) Let V be the lifetime of the system. Show that V = min(X1 , . . . , Xn ). (b) Show that the random variable V is exponentially distributed with parameter Pn i=1 λi . 2. Each component Ai of the following system has lifetime that is iid exponential random variable Xi with parameter λ. A1 A3 @ @ A5 A2 @ A4 @ A6 (a) Let V be the lifetime of the system. Show that V = max(min(X1 , X2 ), min(X3 , X4 ), min(X5 , X6 )). (b) Find the cdf and the pdf of the random variable V . 3. A card contains n chips and has an error-correcting mechanism such that the card still functions if a single chip fails but does not function if two or more chips fail. If each chip has an independent and exponentially distributed lifetime with parameter λ, find the pdf of the card’s lifetime. 4. Suppose that a queue has n servers and that the length of time to complete a job is an exponential random variable. If a job is at the top of the queue and will be handled by the next available server, what is the distribution of the waiting time until service? What is the distribution of the waiting time until service of the next job in the queue? 5. Let U and V be iid uniform random variable on [0, 1], and let X = min(U, V ) and Y = max(U, V ). (a) We have discussed that the probability P (x ≤ X ≤ x + dx) is approximated by f (x) dx. Similarly, we can find the following formula in the case of joint distribution. P (x ≤ X ≤ x + dx, y ≤ Y ≤ y + dy) = f (x, y) dx dy. In particular, we can obtain P (x ≤ U ≤ x + dx, y ≤ V ≤ y + dy) = dx dy. Now by showing that P (x ≤ X ≤ x + dx, y ≤ Y ≤ y + dy) = P (x ≤ U ≤ x + dx, y ≤ V ≤ y + dy) + P (x ≤ V ≤ x + dx, y ≤ U ≤ y + dy), Lecture Note Page 45 Introduction to Probability justify that the joint density function f (x, y) of X and Y is given by ( 2 if 0 ≤ x ≤ y ≤ 1; f (x, y) = 0 otherwise. (b) Find the conditional expectations E(X|Y = y) and E(Y |X = x). 19 19.1 MGF techniques Moment generating function Let X be a random variable. Then the moment generating function (mgf) of X is given by M (t) = E etX . The mgf M (t) is a function of t defined on some open interval (c0 , c1 ) with c0 < 0 < c1 . In particular when X is a continuous random variable having the pdf f (x), the mgf M (t) can be expressed as Z ∞ etx f (x) dx. M (t) = −∞ The most significant property of moment generating function is that the moment generating function uniquely determines the distribution. Suppose that X is a standard normal random variable. Then we can compute the mgf of X as follows. 2 Z ∞ Z ∞ Z ∞ (x−t)2 2 x2 1 t2 t2 1 1 1 t MX (t) = √ etx e− 2 dx = √ e− 2 (x−t) + 2 dx = e 2 √ e− 2 dx = exp 2 2π −∞ 2π −∞ 2π −∞ Since X = (Y − µ)/σ becomes a standard normal random variable and Y = σX + µ, the mgf of Y can be given by 2 2 σ t µt + µt . MY (t) = e MX (σt) = exp 2 Similarly suppose now that X is a gamma random variable with parameter (α, λ). For t < λ, the mgf M X (t) of X can be computed as follows. Z ∞ Z ∞ α λα tx λ α−1 −λx MX (t) = e xα−1 e−(λ−t)x dx x e dx = Γ(α) Γ(α) 0 0 α α Z ∞ Z ∞ 1 λα 1 λ λ α−1 −(λ−t)x α−1 −u = ((λ − t)x) e (λ − t)dx = u e du = (λ − t)α Γ(α) 0 λ−t Γ(α) 0 λ−t 19.2 Joint moment generating function Let X and Y be random variables having a joint density function f (x, y). Then we can define the joint moment generating function M (s, t) of X and Y by Z ∞Z ∞ sX+tY M (s, t) = E e = esx+ty f (x, y) dx dy. −∞ −∞ If X and Y are independent, then the joint mgf M (s, t) becomes M (s, t) = E esX+tY = E esX · E etY = MX (s) · MY (t). Moreover, it has been known that “if the joint mgf M (s, t) of X and Y is of the form M1 (s) · M2 (t), then X and Y are independent and have the respective mgf M1 (t) and M2 (t).” Lecture Note Page 46 Introduction to Probability 19.3 Sum of independent random variables Let X and Y be independent random variables having the respective probability density functions f X (x) and fY (y). Then the cumulative distribution function FZ (z) of the random variable Z = X + Y can be given as follows. Z Z Z FZ (z) = P (X + Y ≤ z) = ∞ −∞ z−x fX (x)fY (y) dy dx = ∞ −∞ −∞ fX (x)FY (z − x) dx, where FY is the cdf of Y . By differentiating FZ (z), we can obtain the pdf fZ (z) of Z as Z ∞ Z ∞ d d fZ (z) = fX (x) fX (x)fY (z − x) dx. FZ (z) = FY (z − x) dx = dz dz −∞ −∞ R∞ The form of integration −∞ f (x)g(z − x) dx is called the convolution. Thus, the pdf fZ (z) is given by the convolution of the pdf’s fX (x) and fY (y). However, the use of moment generating function makes it easier to “find the distribution of the sum of independent random variables.” Let X and Y be independent normal random variables with the respective parameters (µx , σx2 ) and (µy , σy2 ). Then the sum Z = X + Y of random variables has the mgf ! ! 2 2 σy2 t2 (σx2 + σy2 )t2 σx t + µx t · exp + µy t = exp + (µx + µy )t MZ (t) = MX (t) · MY (t) = exp 2 2 2 which is the mgf of normal distribution with parameter (µx + µy , σx2 + σy2 ). By the property (a) of mgf, we can find that Z is a normal random variable with parameter (µx + µy , σx2 + σy2 ). Let X and Y be independent gamma random variables with the respective parameters (α1 , λ) and (α2 , λ). Then the sum Z = X + Y of random variables has the mgf α1 α2 α1 +α2 λ λ λ · = MZ (t) = MX (t) · MY (t) = λ−t λ−t λ−t which is the mgf of gamma distribution with parameter (α1 + α2 , λ). Thus, Z is a gamma random variable with parameter (α1 + α2 , λ). 20 20.1 Chi-square, t -, and F -distributions Chi-square distribution The gamma distribution 1 xn/2−1 e−x/2 , x ≥ 0 2n/2 Γ(n/2) with parameter (n/2, 1/2) is called the chi-square distribution with n degrees of freedom. f (x) = (a) Let X be a standard normal random variable, and let Y = X2 √ √ be √ the square√of X. Then we can express the cdf FY of Y as FY (t) = P (X 2 ≤ t) = P (− t ≤ X ≤ t) = Φ( t) − Φ(− t) in terms of the standard normal cdf Φ. By differentiating the cdf, we can obtain the pdf fY as follows. 1 d 1 t 1 fY (t) = FY (t) = √ t− 2 e− 2 = 1/2 t−1/2 e−t/2 , dt 2 Γ(1/2) 2π √ where we use Γ(1/2) = π. Thus, the square X 2 of X is the chi-square random variable with 1 degree of freedom. Lecture Note Page 47 Introduction to Probability (b) Let X1 , . . . , Xn be iid standard normal random variables. Since Xi2 ’s have the gamma distribution with parameter (1/2, 1/2), the sum n X Y = Xi2 i=1 has the gamma distribution with parameter (n/2, 1/2). That is, the sum Y has the chi-square distribution with n degree of freedom. 20.2 Student’s t -distribution Let X be a chi-square random variable with n degrees of freedom, and let Y be a standard normal random variable. Suppose that X and Y are independent. Then the distribution of the quotient Z=p Y X/n is called the Student’s t-distribution with n degrees of freedom. The pdf fZ (z) is given by Γ((n + 1)/2) fZ (z) = √ (1 + z 2 /n)−(n+1)/2 , nπΓ(n/2) 20.3 −∞ < z < ∞. F -distribution Let U and V be independent chi-square random variables with r1 and r2 degrees of freedom, respectively. Then the distribution of the quotient U/r1 Z= V /r2 is called the F-distribution with (r1 , r2 ) degree of freedom. The pdf fZ (z) is given by fZ (z) = 20.4 −(r1 +r2 )/2 r1 Γ((r1 + r2 )/2) (r1 /r2 )r1 /2 z r1 /2−1 1 + z , Γ(r1 /2)Γ(r2 /2) r2 0 < z < ∞. Quotient of two random variables Let X and Y be independent random variables having the respective pdf’s fX (x) and fY (y). Then the cdf FZ (z) of the quotient Z = Y /X can be computed as follows. FZ (z) = P (Y /X ≤ z) = P (Y ≥ zX, X < 0) + P (Y ≤ zX, X > 0) Z 0 Z ∞ Z ∞ Z xz = fY (y) dy fX (x) dx + fY (y) dy fX (x) dx −∞ xz 0 −∞ By differentiating, we can obtain Z ∞ Z ∞ Z 0 d |x|fY (xz)fX (x) dx. [xfY (xz)]fX (x) dx = [−xfY (xz)]fX (x) dx + FZ (z) = fZ (z) = dz −∞ 0 −∞ Lecture Note Page 48 Introduction to Probability Let X p be a chi-square random variable with n degrees of freedom. Then the pdf of the random variable U = X/n is given by fU (x) = p 2 d d nn/2 xn−1 e−nx /2 P ( X/n ≤ x) = FX (nx2 ) = 2nxfX (nx2 ) = n/2−1 dx dx 2 Γ(n/2) for t ≥ 0; otherwise, fU (t) = 0. Suppose that Y is a standard normal random variable and independent of X. Then the quotient Y Z=p X/n has a t-distribution with n degrees of freedom. The pdf fZ (z) can be computed as follows. Z ∞ Z ∞ 2 2 nn/2 xn e−(z +n)x /2 dx |x|fY (xz)fU (x) dx = (n−1)/2 √ fZ (z) = 2 πΓ(n/2) 0 −∞ Z ∞ Γ((n + 1)/2) 1 (1 + z 2 /n)−(n+1)/2 (1 + z 2 /n)−(n+1)/2 t(n−1)/2 e−t dt = √ =√ nπΓ(n/2) nπΓ(n/2) 0 for −∞ < z < ∞. Let U and V be independent chi-square random variables with r1 and r2 degrees of freedom, respectively. Then the quotient U/r1 Z= V /r2 has F -distribution with (r1 , r2 ) degree of freedom. By putting Y = U/r1 and X = V /r2 , we can compute the pdf fZ (z) as follows. Z ∞ Z ∞ r /2 r /2 r 1 r22 |x|fY (xz)fX (x) dx = (r +r )/21 fZ (z) = x(r1 +r2 )/2−1 e−(r1 z+r2 )x/2 dx z r1 /2−1 2 1 2 Γ(r1 /2)Γ(r2 /2) −∞ 0 Z ∞ r /2 r /2 r11 r22 −(r1 +r2 )/2 r1 /2−1 = (r1 z + r2 ) z t(r1 +r2 )/2−1 e−t dt Γ(r1 /2)Γ(r2 /2) 0 −(r1 +r2 )/2 r1 Γ((r1 + r2 )/2) r1 /2 r1 /2−1 1+ z (r1 /r2 ) z , 0 < z < ∞. = Γ(r1 /2)Γ(r2 /2) r2 21 21.1 Sampling distributions Population and random sample The data values recorded x1 , . . . , xn are typically considered as the observed values of iid random variables X1 , . . . , Xn having a common probability distribution f (x). To judge the quality of data, it is useful to envisage a population from which the sample should be drawn. A random sample is chosen at random from the population to ensure that the sample is representative of the population. Let X1 , . . . , Xn be iid normal random variables with parameter (µ, σ 2 ). Then the random variable n X̄ = is called the sample mean, and the random variable 1X Xi n i=1 n S2 = 1 X (Xi − X̄)2 n − 1 i=1 is called the sample variance. And they have the following properties. Lecture Note Page 49 Introduction to Probability 1. E[X̄] = µ and Var(X̄) = σ2 . n 2. The sample variance S 2 can be rewritten as 1 S = n−1 2 n X Xi2 i=1 − nX̄ 2 ! 1 = n−1 n X i=1 2 (Xi − µ) − n(X̄ − µ) 2 ! . In particular, we can obtain 1 E[S ] = n−1 2 n X i=1 2 2 E[(Xi − µ) ] − nE[(X̄ − µ) ] ! 1 = n−1 n X i=1 Var(Xi ) − nVar(X̄) ! = σ2 . 3. X̄ has the normal distribution with parameter (µ, σ 2 /n), and X̄ and S 2 are independent. Furthermore, Wn−1 = n (n − 1)S 2 1 X (Xi − X̄)2 = σ2 σ 2 i=1 has the chi-square distribution with (n − 1) degrees of freedom. 4. The random variable X̄−µ √ X̄ − µ σ/ n √ =q T = S/ n (n−1)S 2 /σ 2 n−1 has the t-distribution with (n − 1) degrees of freedom. Rough explanation of Property 3. Assuming that X̄ and S 2 are independent (which is not so easy to prove), we can show that the random variable Wn−1 has the chi-square distribution with (n − 1) degrees of freedom. By Pn introducing V = σ12 i=1 (Xi − µ)2 and W1 = σn2 (X̄ − µ)2 , we can write V = Wn−1 + W1 . Since Wn−1 and W1 are independent, the mgf MV (t) of V can be expressed as MV (t) = MWn−1 (t) × MW1 (t), where MWn−1 (t) and MW1 (t) are the mgf’s of Wn−1 and W1 , respectively. Observe that W1 has the chi-square distribution with 1 degrees of freedom. Thus, we can obtain MWn−1 (t) = MV (t) (1 − 2t)−n/2 = = (1 − 2t)−(n−1)/2 MW1 (1 − 2t)−1/2 which is the mgf of the chi-square distribution with (n − 1) degrees of freedom. 21.2 Sampling distributions under normal assumption Suppose that the data X1 , . . . , X n are iid normal random variables with parameter (µ, σ 2 ). 1. The sample mean X̄ is normally distributed with parameter (µ, σ 2 /n), and the random variable √ n(X̄ − µ) ∼ N (0, 1) σ (34) has the standard normal distribution. Lecture Note Page 50 Introduction to Probability 2. The random variable n (n − 1)S 2 1 X (Xi − X̄)2 = ∼ χ2n−1 2 σ i=1 σ2 is distributed as the chi-square distribution with (n − 1) degrees of freedom, and is independent of the random variable (34). 3. The random variable √ n(X̄−µ) X̄ − µ σ √ =q ∼ tn−1 (n−1)S 2 /σ 2 S/ n n−1 has the t-distribution with (n − 1) degrees of freedom, where S = deviation. 21.3 √ S 2 is called the sample standard Critical points Given a random variable X, the critical point for level α is defined as the value c satisfying P (X > c) = α. 1. Suppose that a random variable X has the chi-square distribution with n degrees of freedom. Then the critical point of chi-square distribution, denoted by χα,n , is defined by P (X > χα,n ) = α. 2. Suppose that a random variable Z has the t-distribution with m degrees of freedom. Then the critical point is defined as the value tα,m satisfying P (Z > tα,m ) = α. Since the t-distribution is symmetric, the critical point tα,m can be equivalently given by P (Z < −tα,m ) = α. Together we can obtain the expression P (−tα/2,m ≤ Z ≤ tα/2,m ) = 1 − P (Z < −tα/2,m ) − P (Z > tα/2,m ) = 1 − α. 21.4 Exercises 1. Consider a sample X1 , . . . , Xn of normally distributed random variables with variance σ 2 = 5. Suppose that n = 31. (a) What is the value of c for which P (S 2 ≤ c) = 0.90? (b) What is the value of c for which P (S 2 ≤ c) = 0.95? 2. Consider a sample X1 , . . . , Xn of normally distributed random variables with mean µ. Suppose that n = 16. (a) What is the value of c for which P (|4(X̄ − µ)/S| ≤ c) = 0.95? (b) What is the value of c for which P (|4(X̄ − µ)/S| ≤ c) = 0.99? 3. Suppose that X1 , . . . , X10 are iid normal random variables with parameters µ = 0 and σ 2 > 0. (a) Let Z = X1 + X2 + X3 + X4 . Then what is the distribution for Z? Lecture Note Page 51 Introduction to Probability 2 (b) Let W = X52 + X62 + X72 + X82 + X92 + X10 . Then what is the distribution for W ? σ2 Z/2 (c) Describe the distribution for the random variable p . W/6 Z (d) Find the value c for which P √ ≤ c = 0.95. W Joint Distributions January 17, 2006 22 22.1 Law of large numbers Chebyshev’s inequality We can define an indicator function IA of a subset A of real line as ( 1 x ∈ A; IA (x) = 0 otherwise. Let X be a random variable having the pdf f (x). Then we can express the probability Z Z ∞ P (X ∈ A) = f (x) dx = IA (x)f (x) dx A −∞ Now suppose that X has the mean µ = E[X] and the variance σ 2 = Var(X), and that we choose the particular subset A = {x : |x − µ| > t}. Then we can observe that IA (x) ≤ 1 (x − µ)2 t2 for all x, which can be used to derive P (|X − µ| > t) = P (X ∈ A) = Z ∞ −∞ IA (x)f (x) dx ≤ 1 t2 Z ∞ −∞ (x − µ)2 f (x) dx = σ2 1 Var(X) = . t2 t2 The inequality P (|X − µ| > t) ≤ σ 2 /t2 is called the Chebyshev’s inequality. 22.2 Weak law of large numbers Let Y1 , Y2 , . . . be a sequence of random variables. Then we can define the event An = {|Yn − α| > ε} and the probability P (An ) = P (|Yn − α| > ε). We say that the sequence Yn converges to α in probability, if we have lim P (|Yn − α| > ε) = 0, n→∞ for any choice of ε > 0. Now let X1 , X2 , . . . be a sequence of iid random variables with mean µ and variance σ 2 . Then for each n = 1, 2, . . ., the sample mean n 1X Xi X̄n = n i=1 Lecture Note Page 52 Introduction to Probability has the mean µ and the variance σ 2 /n. Given a fixed ε > 0, we can apply the Chebyshev’s inequality to obtain P (|X̄n − µ| > ε) ≤ σ 2 /n →0 ε2 as n → ∞. Thus, we have established “the sample mean X̄n converges to µ in probability” which is called the weak law of large numbers. A stronger result of the above, “the sample mean X̄n converges to µ almost surely,” was established by Kolmogorov, and is called the strong law of large numbers. 23 23.1 Central limit theorems Convergence in distribution Let X1 , X2 , . . . be a sequence of random variables having the cdf’s F1 , F2 , . . ., and let X be a random variable having the cdf F . Then we say that the sequence Xn converges to X in distribution (in short, Xn converges to F ) if lim Fn (x) = F (x) for every x at which F (x) is continuous. (35) n→∞ Note that the convergence in (35) is completely characterized in terms of the distributions F 1 , F2 , . . . and F . Recall that the distributions F1 (x), F2 (x), . . . and F (x) are uniquely determined by the respective moment generating functions, say M1 (t), M2 (t), . . . and M (t). Furthermore, we have an “equivalent” version of the convergence in terms of the m.g.f’s Continuity Theorem. If lim Mn (t) = M (t) n→∞ for every t (around 0), (36) then the corresponding distributions Fn ’s converge to F in the sense of (35). Alternative criterion for convergence in distribution. Let Y1 , Y2 , . . . be a sequence of random variables, and let Y be a random variable. Then the sequence Yn converges to Y in distribution if and only if lim E[g(Yn )] = E[g(Y )] n→∞ for every continuous function g. Slutsky’s theorem. Let Z1 , Z2 , . . . and W1 , W2 , . . . be two sequences of random variables, and let c be a constant value. Suppose that the sequence Zn converges to Z in distribution, and that the sequence Wn converges to c in probability. Then 1. The sequence Zn + Wn converges to Z + c in distribution. 2. The sequence Zn Wn converges to cZ in distribution. 3. The sequence Zn /Wn converges to Z/c in distribution. 23.2 Central Limit Theorem Let X1 , X2 , . . . be a sequence of iid random variables having the common distribution F with mean 0 and variance 1. Then n X √ n n = 1, 2, . . . Zn = Xi i=1 Lecture Note Page 53 Introduction to Probability converges to the standard normal distribution. A sketch of proof. Let M be the m.g.f. for the distribution F , and let Mn be the m.g.f. for the random variable Zn for n = 1, 2, . . .. Since Xi ’s are iid random variables, we have “ ” “ ” “ ” n tZn t √t √t √t (X1 +···+Xn ) X1 Xn n n n =E e ×···×E e = M √ Mn (t) = E e =E e n Observing that M 0 (0) = E[X1 ] = 0 and M 00 (0) = E[X12 ] = 1, we can apply a Taylor expansion to M √tn around 0. t2 t t t2 00 M (0) + εn (t) = 1 + + εn (t). = M (0) + √ M 0 (0) + M √ 2n 2n n n a n = ea ; in a similar manner, we can obtain Recall that lim 1 + n→∞ n n 2 t2 /2 lim Mn (t) = lim 1 + + εn (t) = et /2 , n→∞ n→∞ n which is the m.g.f. for the standard normal distribution. 23.3 Normal approximation Let X1 , X2 , . . . be a sequence of iid random variables having the common distribution F with mean µ and variance σ 2 . Then n n X X Xi − nµ (Xi − µ)/σ i=1 √ σ n converges to the standard normal distribution. Zn = = i=1 √ n n = 1, 2, . . . Let X1 , X2 , . . . , Xn be finitely many iid random variables with mean µ and variance σ 2 . If the size n is adequately large, then the distribution of the sum n X Y = Xi i=1 can be approximated by the normal distribution with parameter (nµ, nσ 2 ). A general rule for “adequately large” n is about n ≥ 30, but it is often good for much smaller n. Similarly the sample mean n X̄n = 1X Xi n i=1 has approximately the normal distribution with parameter (µ, σ 2 /n). Then the following application of normal approximation to probability computation can be applied. Random variable Y = X1 + · · · + X n X1 + · · · + X n X̄ = n Normal approximation N (nµ, nσ 2 ) N σ2 µ, n Probability b − nµ a − nµ √ √ P (a ≤ Y ≤ b) ≈ Φ −Φ σ n σ n P (a ≤ X̄ ≤ b) ≈ Φ b−µ √ σ/ n −Φ a−µ √ σ/ n Example. An insurance company has 10,000 automobile policyholders. The expected yearly claim per policyholder is $240 with a standard deviation of $800. Find approximately the probability that the total yearly claim exceeds $2.7 million. Can you say that such event is statistically highly unlikely? Lecture Note Page 54 Introduction to Probability 23.4 Normal approximation to binomial distribution Suppose that X1 , . . . , Xn are iid Bernoulli random variables with the mean p = E(X) and the variance p(1−p) = Var(X). If the size n is adequately large, then the distribution of the sum Y = n X Xi i=1 can be approximated by the normal distribution with parameter (np, np(1 − p)). Thus, the normal distribution N (np, np(1 − p)) approximates the binomial distribution B(n, p). A general rule for “adequately large” n is to satisfy np ≥ 5 and n(1 − p) ≥ 5. Let Y be a binomial random variable with parameter (n, p), and let Z be a normal random variable with parameter (np, np(1 − p)). Then the distribution of Y can be approximated by that of Z. Since Z is a continuous random variable, the approximation of probability should improve when the following formula of continuity correction is considered. ! ! j + 0.5 − np i − 0.5 − np P (i ≤ Y ≤ j) ≈ P (i − 0.5 ≤ Z ≤ j + 0.5) = Φ p −Φ p . np(1 − p) np(1 − p) 23.5 Exercises 1. An actual voltage of new a 1.5-volt battery has the probability density function f (x) = 5, 1.4 ≤ x ≤ 1.6. Estimate the probability that the sum of the voltages from 120 new batteries lies between 170 and 190 volts. 2. The germination time in days of a newly planted seed has the probability density function f (x) = 0.3e−0.3x , x ≥ 0. If the germination times of different seeds are independent of one another, estimate the probability that the average germination time of 2000 seeds is between 3.1 and 3.4 days. 3. Calculate the following probabilities by using normal approximation with continuity correction. (a) Let X be a binomial random variable with n = 10 and p = 0.7. Find P (X ≥ 8). (b) Let X be a binomial random variable with n = 15 and p = 0.3. Find P (2 ≤ X ≤ 7). (c) Let X be a binomial random variable with n = 9 and p = 0.4. Find P (X ≤ 4). (d) Let X be a binomial random variable with n = 14 and p = 0.6. Find P (8 ≤ X ≤ 11). Lecture Note Page 55

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Probability