Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lectures 12–21 Random Variables Definition: A random variable (rv or RV) is a real valued function defined on the sample space. The term “random variable” is a misnomer, in view of the normal usage of function and variable. Random variables are denoted by capital letters from the end of the alphabet, e.g. X, Y , Z, but other letters are used as well, e.g., B, M , etc. Hence: X : S 7−→ R and X(e) for e ∈ S is a particular value of X. Since the e are random outcomes of the experiment, it will make the function values X(e) random as a consequence. Hence the terminology “random variable”, although “random function value” might have been less confusing. Random variables are simply a bridge from the sample space to the realm of numbers, where we can perform arithmetic. A random variable is different from a random function, where the evaluation for each e is a function trajectory, e.g., against time as in a stock market index for a given day. Such random functions are also known as stochastic processes. We will only have limited exposure to them. Example 1 (Roll of Two Dice): The sum X of the two numbers facing up is an rv. Example 2 (Toss of Three Coins): The number X of heads in the toss of three coins is a random variable. Compute P (X = i) = P ({X = i}) for i = 0, 1, 2, 3. The event {X = i} stands short for {e ∈ S : X(e) = i}. Example 3 (Urn Problem): Three balls are randomly selected (without replacement) from an urn containig 20 balls labeled 1, 2, . . . , 20. We bet that we will get at least one label ≥ 17. What is the probability of winning the bet. This problem could be solved without involving the notion of a random variable. For the sake of working with the concept of a random variable let X be the maximum number of the three balls drawn. Hence we are interested in P (X ≥ 17), which is computed as follows: P (X ≥ 17) = P (X = 17) + P (X = 18) + P (X = 19) + P (X = 20) = 1 − P (X ≤ 16) with P (X = i) = =⇒ P (X ≥ 17) = 16 2 20 3 + 17 2 20 3 + 18 2 20 3 i−1 2 20 3 + 19 2 20 3 for i = 3, 4, . . . , 20 , 16 2 34 51 3 3 = + + + = .50877 = 1 − 20 19 285 380 20 3 Example 4 (Coin Toss with Stopping Rule): A coin (with probability p of heads) is tossed until either a head is obtained or until n tosses are made. Let X be the number of tosses made. Find P (X = i) for i = 1, . . . , n. Solution: P (X = i) = (1 − p)i−1 p for i = 1, . . . , n − 1 and P (X = n) = (1 − p)n−1 . Check that probabilities add to 1. Example 5 (Coupon Collector Problem): There are N types of coupons. Each time a coupon is obtained, it is, independently of previous selections, equally likely to be one of the N types. We are interested in the random variable T = the number of coupons that needs to be collected to get a full set of N coupons. Rather than get P (T = n) immediately, we obtain P (T > n). 1 Let Aj be the event that coupon j is not among the first n collected coupons. By the inclusionexclusion formula we have P (T > n) = P N [ ! Ai = i=1 N X P (Ai ) − i=1 P (Ai1 Ai2 ) + . . . + (−1)N +1 P (A1 A2 . . . AN ) X i1 <i2 with P (A1 A2 . . . AN ) = 0, of course, and for i1 < i2 < . . . we have N −1 P (Ai ) = N n N −2 P (Ai1 Ai2 ) = N , n ,..., P (Ai1 Ai2 . . . Aik ) = N −k N !n and thus N −1 P (T > n) = N N = N −1 X (−1) n i+1 i=1 N − 2 N i ! ! N −2 N N −i N n N + 3 ! N −3 N n N − . . . + (−1) ! N N −1 1 N n n and P (T > n − 1) = P (T = n) + P (T > n), hence P (T = n) = P (T > n − 1) − P (T > n). Distribution Functions Example 6 (First Heads): Toss fair coin until first head lands up. Let X be the number of tosses required. Then P (X ≤ k) = 1 − P (X ≥ k + 1) = 1 − P (k tails in first k tosses) = 1 − 0.5k . Definition: The cumulative distribution function (cdf or CDF) or more simply the distibution function F of the random variable X is defined for all real numbers b as F (b) = P (X ≤ b) = P ({e : X(e) ≤ b}) . 1.0 Example 7 (Using a CDF): Suppose the cdf of the random variable X is given by ● 0.8 ● 1 3≤x 0.6 0.4 2≤x<3 1≤x<2 0.2 0≤x<1 2 3 11 12 ● F(x) x<0 0.0 F (x) = 0 x 2 −1 0 1 2 x Compute P (X < 3), P (X = 1), P (X > .5) and P (2 < X ≤ 4). 11 2 1 , − 12 = 16 , 1 − .52 = .75, 1 − 11 = 12 . 12 3 12 2 3 4 L12 ends Discrete Random Variables Definition: A random variable X which can take on at most a countable number of values is called a discrete random variable. For such a discrete random variable we define its probability mass function (pmf ) p(a) of X by p(a) = P (X = a) = P ({e : X(e) = a}) for all a ∈ R . p(a) is positive for at most a countable number of values of a. If X assumes only one of the following values x1 , x2 , x3 , . . . then p(xi ) ≥ 0 for i = 1, 2, 3, . . . and p(x) = 0 for all other values of x Graphical representation of p(x) (one die, sum of two dice): Example 8 (Poisson): Suppose the discrete random variable X has pmf p(i) = cλi /i! for i = 0, 1, 2, . . . where λ is some positive value and c = exp(−λ) makes the probabilities add to one. Find P (X = 0) and P (X > 2). P (X = 0) = c = exp(−λ), P (X > 2) = 1 − P (X = 0) − P (X = 1) − P (X = 2) = 1 − exp(−λ)(1 + λ + λ2 /2). The cdf F of a discrete random variable X can be expressed as X F (a) = p(x) . x: x≤a The c.d.f. of a discrete random variable X is a step function with a possible step at each of its possible values x1 , x2 , . . . and being flat in between. Example 9 (Discrete CFD): p(1) = .25, p(2) = .5, p(3) = .125 and p(4) = .125, construct the c.d.f. and graph it. Interpret the step size. Expected Value or Mean of X A very important concept in probability theory is that of the expected value or mean of a random variable X. For a discrete RV it is defined as X E[X] = E(X) = µ = µX = x · p(x) = x:p(x)>0 X x · p(x) x the probability weighted average of all possible values of X. If X takes on the two values 0 and 1 with probabilities p(0) = .5 and p(1) = .5 then E[X] = 0 · .5 + 1 · .5 = .5, which is half way between 0 and 1. When p(1) = 2/3 and p(0) = 1/3, then E[X] = 2/3, twice as close to 1 than to 0. That’s because the probability of 1 is twice that of 0. The double weight 2/3 at 1 balances the weight of 1/3 at 0, when the fulcrum of balance is set at 2/3 = E[X]. weight1 · moment arm1 = weight2 · moment arm2 or 2/3 · 1/3 = 1/3 · 2/3, where the moment arm is measured as the distance of the weight from the fulcrum, here at 2/3 = E[X]. moment arm1 = |2/3 − 1| = 1/3 and moment arm2 = |2/3 − 0| = 2/3 This is a general property of E[X], not just limited to RVs with two values. If a is the location of the fulcrum, then we get balance when X x<a (a − x)p(x) = X x>a (x − a)p(x) or 0 = − X (a − x)p(x) + x<a 3 X x>a (x − a)p(x) = X x (x − a)p(x) or 0 = X x xp(x) − a X p(x) or 0 = E[X] − a or a = E[X] x The term expectation can again be linked to our long run frequency motivation for probabilities. If we play the same game repeatedly, say a large number N times, with payoffs being one of the amounts x1 , x2 , x3 , . . ., then we would roughly see these amounts with approximate relative frequencies p(x1 ), p(x2 ), p(x3 ), . . ., i.e., with approximate frequencies N p(x1 ), N p(x2 ), N p(x3 ), . . ., thus realizing in N such games the following total payoff: N p(x1 ) · x1 + N p(x2 ) · x2 + N p(x3 ) · x3 + . . . i.e., on a per game basis N p(x1 ) · x1 + N p(x2 ) · x2 + N p(x3 ) · x3 + . . . = p(x1 ) · x1 + p(x2 ) · x2 + p(x3 ) · x3 + . . . = E[X] N On average we expect to win E[X] (or lose E[X], if E[X] < 0). Example 10 (Rolling a Fair Die): If X is the number showing face up on a fair die, we get E[X] = 1 1 1 1 1 3·7 7 1 ·1+ ·2+ ·3+ ·4+ ·5+ ·6= = 6 6 6 6 6 6 6 2 Indicator Variable: For any event E we can define the indicator RV I = IE (e) = 1 if e ∈ E and IE (e) = 0 if e ∈ / E ⇒ E[I] = P (E) · 1 + P (E c ) · 0 = P (E) Example 11 (Quiz Show): You are asked two different types of questions, but the second one only when you answer the first correctly. When you answer a question of type i correctly you get a prize of Vi dollars. In which order should you attempt to answer the question types, when you know your chances of answering questions of type i are Pi , i = 1, 2, respectively? Or does the order even matter? Assume that the events of answering questions are independent? If you choose to answer a question of type i = 1 first, your winnings are 0 with probability 1 − P1 V1 with probability P1 (1 − P2 ) V1 + V2 with probability P1 P2 with expected winnings W1 : E[W1 ] = V1 P1 (1 − P2 ) + (V1 + V2 )P1 P2 . When answering the type 2 question first, you get the same expression with indices exchanged, i.e., E[W2 ] = V2 P2 (1 − P1 ) + (V1 + V2 )P1 P2 . Thus E[W1 ] > E[W2 ] ⇐⇒ V1 P1 (1 − P2 ) > V2 P2 (1 − P1 ) ⇐⇒ V2 P2 V 1 P1 > 1 − P1 1 − P2 the choice should be ordered by odds-weighted payoffs. Example: P1 = .8, V1 = 900, P2 = .4, V2 = 6000, then 900·.8 = 3600 < 6000·.4 = 4000. E[W1 ] = 2640 < E[W2 ] = 2688. .2 .6 4 L13 ends Expectation of g(X): For an RV X and a function g : R → R, we can view Y = g(X) again as a random variable. Find its expectation. Two ways: Find the pX (x)-weighted average of all g(X) values, or given the pmf pX (x) of X, find the pmf pY (y) of Y = g(X), then its expectation as the pY (y) weighted average of all Y values. Example: Let X have values −2, 2, 4 with p(X (−2) = .25, pX (2) = .25, pX (4) = .5, respectively. =⇒ x -2 2 4 pX (x) x2 .25 4 .25 4 .5 16 x2 pX (x) 1 1 8 10 y = x2 4 16 =⇒ pY (y) ypY (y) .5 2 .5 8 10 pY (4) X z }| { x2 pX (x) = (−2)2 · .25 + 22 · .25 + 42 · .5 = 4 · (.25 + .25) +16 · .5 = x X ypY (y) = 10 y What we see in this special case holds in general for discrete RVs X and functions Y = g(X) E[Y ] = X y ypY (y) = X g(x)pX (x) = E[g(X)] x The formal proof idea is already contained in the above example, so we skip it, but see book for notational formal proof or the graphic above. 5 Example 12 (Business Planning): A seasonal product (say skis), when sold in timely fashion, yields a net profit of b dollars for each unit sold, and a net loss of ` dollars, when it needs to be sold at season’s end at a fire sale. Assume that the customer demand for the number X of units is an RV with pmf pX (x) = p(x) and assume s units are stocked. When X > s, the excess orders cannot be filled. Then the realized profit Q(s) is an RV, namely ( bX − (s − X)` if X ≤ s sb if X > s Q(s) = with expected profit E[Q(s)] = s X ∞ X (bi − (s − i)`)p(i) + sb i=0 = (b + `) s X ip(i) − s(b + `) i=0 p(i) = (b + `) s X i=s+1 s X ip(i) − s` i=0 s X " p(i) + sb 1 − i=0 p(i) + sb = sb + (b + `) i=0 s X s X # p(i) i=0 (i − s)p(i) i=0 Find the value s that maximizes this expected value. We examine what happens to E[Q(s)] as we increase s to s + 1. E[Q(s + 1)] = (s + 1)b + (b + `) s+1 X s X i=0 i=0 (i − s − 1)p(i) = (s + 1)b + (b + `) =⇒ E[Q(s + 1)] − E[Q(s)] = b − (b + `) s X p(i) > 0 ⇐⇒ s X p(i) < i=0 i=0 (i − s − 1)p(i) b b+` Ps Since i=0 p(i) increases with s and since b/(b + `) is constant, there is a largest s, say s∗ , for which this inequality holds, and thus the maximum expected profit is E[Q(s∗ + 1)], achieved when stocking s∗ + 1 items. We need to know p(i), i = 0, 1, . . ., e.g., from past experience. Examples of E[g(X)]: 1) Let g(x) = ax + b, with constants a, b, then E[aX + b] = X (ax + b)p(x) = a X x xp(x) + b x X p(x) = aE[X] + b x 2) let g(x) = xn , then E[X n ] = X xn p(x) x is called the nth moment of X,and E[X] is also known as the first moment. The Variance of X While E[X] is a measure of the center of a distribution given by a pmf p(x), we also like to have some measure of the spread or variation of a distribution. While E[X] = 0 for X ≡ 0 with probability 1, or X = ±1 with probability 1/2 each or X = ±100 with probability 1/2 each, we would view the variabilities of these three situations quite differently. One plausible measure would be the expected absolute difference of X from its mean, i.e., E[|X − µ|], where µ = E[X]. For the above three situations we would get E[|X − µ|] = 0, 1, 100, respectively. While this was easy enough, it turns out the the absolute value function |X − µ| is not very conducive to 6 L14 ends manipulations. We introduce a different measure that can be exploited much more conveniently as we will see later on. Definition: The variance of a random variable X with mean µ = E[X] is defined as var(X) = E[(X − µ)2 ] An alternate formula, and example of the manipulative capability of the variance definition, is var(X) = E[X 2 − 2µX + µ2 ] = (x2 − 2µx + µ2 )p(x) = X x 2 2 X x2 p(x) − x 2 2 X 2µxp(x) + x X µ2 p(x) x 2 2 = E[X ] − 2µE[X] + µ = E[X ] − µ = E[X ] − (E[X]) Example 13 (Variance of a Fair Die): If X denotes the face up of a randomly rolled fair die then 1 1 1 1 1 91 1 E[X 2 ] = 12 + 22 + 32 + 42 + 52 + 62 = 6 6 6 6 6 6 6 2 and var(X) = 7 91 − 6 2 = 35 12 Variance of aX + b: For constants a and b we have var(aX + b) = a2 var(X), since var(aX + b) = E[{(aX + b − (aµ + b)}2 ] = E[a2 (X − µ)2 ] = a2 E[(X − µ)2 ] = a2 var(X) In analogy to the center of gravity interpretation of E[X] we can view var(X) as the moment of inertia of the pmf p(x), when viewing p(x) as weight in mechanics. While squaring the deviation of X around µ in the definition of var(X), it creates a distortion and changes any units of measurements to square units. To bring matters back to its original units we take the square root of the variance, i.e., the standard deviation SD(X), as the appropriate measure of spread SD(X) = σ = σX = q var(X) We now discuss several special discrete distributions. Bernoulli and Binomial Random Variables Aside from the constant random variable which takes on only one value, the next level of simplicity is a random variable with only two values, most often 0 and 1, (canonical choice). Definition (Bernoulli Random Variable): A random variable X which can take on only the two values 0 and 1 is called a Bernoulli random variable. We indicate its distribution by X ∼ B(p). In liberal notational usage we also write P (X ≤ x) = P (B(p) ≤ x). Such random variables are often employed when we focus on an event E in a particular random experiment. Let p = P (E). If E occurs we say the experiment results in a success and otherwise we call it a failure. The Bernoulli rv X is then defined as follows: X(e) = 1 if e ∈ E and X(e) = 0 if e 6∈ E. Hence X counts the number successes in one performance of the experiment. Often the following alternate notation is used: IE (e) = 1 if e ∈ E and IE (e) = 0 otherwise. IE is then also called the indicator function of E. 7 The probability mass function of X or IE is p(0) = P (X = 0) = P ({e : X(e) = 0}) = P (E c ) = 1 − p p(1) = P (X = 1) = P ({e : X(e) = 1}) = P (E) = p where p is usually called the success probability. The mean and variance of X ∼ B(p) is E[X] = (1 − p) · 0 + p · p = p and var(X) = E[X 2 ] − (E[X])2 = E[X] − p2 = p − p2 = p(1 − p) where we exploited X ≡ X 2 . If we perform n independent repetitions of this basic experiment, i.e. n independent trials, then we can talk of another random variable Y , namely the number of successes in these n trials. Y is called a binomial random variable and we indicate its distribution by Y ∼ Bin(n, p), again liberally writing P (Y ≤ y) = P (Bin(n, p) ≤ y). For parameters n and p, the probability mass function of Y is (as derived previously) ! n i p(i) = P (Y = i) = p (1 − p)n−i for i = 0, 1, 2, . . . , n .1 i Example 14 (Coin Flips): Flip 5 fair coins and denote by X the number of heads in these 5 flips. Get the probability mass function of X. Example 15 (Quality Assurance): A company produces parts. The probability that any given part will be defective is .01. The parts are shipped in batches of 10 and the promise is made that any batch with two or more defectives will be replaced by two new batches of 10 each. What proportion of the batches will need to be replaced? Solution: 1 − P (X = 0) − P (X = 1) = 1 − (1 − p)10 − 10p(1 − p)9 = .0043 where p = .012 . Hence about .4% of the batches will be affected. Example 16: (Chuck-a-luck): A player bets on a particular number i = 1, 2, 3, 4, 5, 6 of a fair die. The die is rolled 3 times and if the chosen bet number appears k = 1, 2, 3 times the player wins k units, otherwise loses 1 unit. If X denotes the payoff, what is the expected value E[X] of the game? ! 0 3 3 P (X = −1) = 0 1 6 5 6 ! 2 1 1 5 3 P (X = 2) = 2 6 6 ! 1 2 125 = , 216 3 P (X = 1) = 1 15 = , 216 3 P (X = 3) = 3 1 6 5 6 = 75 , 216 = 1 216 ! 3 0 1 6 5 6 125 75 15 1 −17 +1· +2· +3· = 216 216 216 216 216 with an expected loss of 0.0787 units per game in the long run. Example 17 (Genetics): A particular trait (eye color or left-handedness) on a person is governed by a particular gene pair, which can either be {d, d}, {d, r} or {r, r}. The dominant =⇒ E[X] = −1 · 1 With appropriate values for i, n and p you get p(i) via the command dbinom(i,n,p) in R, while pbinom(i,n,p) returns P (Y ≤ i). In EXCEL get these via =BINOMDIST(i,n,p,FALSE) and =BINOMDIST(i,n,p,TRUE), respectively. You may also use the spreadsheet available within the free OpenOffice http://www.openoffice.org/. 2 1-pbinom(1,10,.01) in R and in EXCEL via = 1-BINOMDIST(1,10,.01,TRUE). 8 L15 ends gene d dominates over the recessive r, i.e., the trait shows whenever there is a d in the gene pair. An offspring from two parents inherits randomly one gene from each gene pair of its parents. If both parents are hybrids ({d, r}) what is the chance that of 4 offspring at least 3 show the outward appearance of the dominant gene? Solution: p = 3/4 is the probability that any given offspring will have gene pair {d, d} or {d, r}. Hence P = 4(3/4)3 (1/4) + 1(3/4)4 = 189/256 = .74. Example 18 (Reliability): On an aircraft we want to compare the reliability (probability of functioning) of a 3 out of 5 system with a 2 out of 3 system. A k out of n system functions whenever a majority of the subsystems function properly. Usually n is chosen as odd. We assume that the probability of failure 1 − p is the same all subsystems and that failures occur independently. A 3 out of 5 system has a higher reliability than a 2 out of 3 system whenever ! ! ! ! ! 0.4 0.8 3 out of 5 system 2 out of 3 system 0.0 reliability 5 3 5 4 5 5 3 2 3 3 1 p (1−p)2 + p (1−p)+ p > p (1−p)+ p ⇐⇒ (1−p)(2p−1) > 0 ⇐⇒ p > 2 3 4 5 2 3 0.0 0.2 0.4 0.6 0.8 1.0 0.98 0.99 1.00 3 out of 5 system 2 out of 3 system 0.9994 0.9988 reliability 1.0000 p 0.95 0.96 0.97 p 9 Mean and Variance of X ∼ Bin(n, p): Using the simple identities ! ! n n−1 i =n i i−1 n X ! ! n n−2 and i(i − 1) = n(n − 1) i i−2 n X n i n − 1 i−1 E[X] = i p (1 − p)n−i = np p (1 − p)n−1−(i−1) i i − 1 i=0 i=1 =⇒ ! ! substituting i − 1 = j = np n−1 X j=0 ! n−1 j p (1 − p)n−1−j = np j Note the connection to Bernoulli RVs Xi , indicating success or failure in the ith trial and E[X] = E[X1 + . . . + Xn ] = E[X1 ] + . . . + E[Xn ] = np. Expectation of a sum = sum of the individual (finite) expectations. n X n X n i n − 2 i−2 E[X(X − 1)] = i(i − 1) p (1 − p)n−i = n(n − 1)p2 p (1 − p)n−2−(i−2) i i − 2 i=0 i=2 ! ! substituting i − 2 = j = n(n − 1)p2 n−2 X j=0 n(n − 1)p 2 2 = E[X(X − 1)] = E[X − X] = X 2 (x − x)p(x) = x =⇒ =⇒ ! n−2 j p (1 − p)n−2−j = n(n − 1)p2 j X x x2 p(x) − X xp(x) x = E[X 2 ] − E[X] = E[X 2 ] − np E[X 2 ] = np + n(n − 1)p2 = np(1 − p) + (np)2 var(X) = E[X 2 ] − (E[X])2 = np(1 − p) Note again var(X) = var(X1 + . . . + Xn ) = var(X1 ) + . . . + var(Xn ) = np(1 − p) Variance of a sum of independent RVs = sum of the (finite) variances of those RVs. Qualitative Behavior of Binomial Probability Mass Function: If X is a binomial random variable with parameters (n, p) then the probability mass function p(x) of X first increases monotonically and then decreases monotonically, reaching its largest value when x is the largest integer ≤ (n + 1)p. p n−x > 1 or < 1 ⇐⇒ (n + 1)p > x + 1 or < x + 1. Of course it Proof: Look at p(x + 1)/p(x) = 1−p x+1 is possible that p(x) is entirely monotone (when?). Illustrate with Pascal’s triangle. L16 ends The Poisson Random Variable Definition: A random variable X with possible values 0, 1, 2, . . . is called a Poisson random variable, indicated by X ∼ P ois(λ), if for some constant λ > 0 its pmf is given by p(i) = P (X = i) = P (P ois(λ) = i) = e−λ λi for i = 0, 1, 2, . . . .3 i! 3 Check summation to 1. In R get p(i) = P (X = i) via the command dpois(i,lambda), while P (X ≤ i) is obtained by ppois(i,lambda). In EXCEL you get the same by =POISSON(i,lambda,FALSE) and =POISSON(i,lambda,TRUE), respectively. 10 Approximation to a binomial random variable for small p and large n: Let X be a binomial rv with parameters n and p. Let n get large and let p get small so that λ = np does neither degenerate to 0 nor ∞, then !i !n−i n! n! λ λ P (X = i) = pi (1 − p)n−i = 1− i!(n − i) i!(n − i)! n n i n i n(n − 1) · · · (n − i + 1) λ (1 − λ/n) −λ λ . = ≈ e ni i! (1 − λ/n)i i! Since np represents the expected or average number of successes of the n trials represented by the binomial random variable it should not be surprising that the Poisson prameter λ should be interpreted as the average or expected count for such a Poisson random variable. Actually, for the approximation to work it can be shown that small p is sufficient. In fact, if for i = 1, 2, 3, . . . , n the Xi are independent Bernoulli random variables with respective success probabilities pi and if S = X1 + · · · + Xn and if Y is a Poisson random variable with parameter P λ = ni=1 pi then |P (S ≤ x) − P (Y ≤ x)| ≤ 3(max(p1 , . . . , pn ))1/3 for all x or one can show that |P (S ≤ x) − P (Y ≤ x)| ≤ 2 n X p2i for all x . i=1 Poisson-Binomial Approximation, see class web page. A Poisson random variable often serves as a good model for the count of rare events. Examples: Number of misprints on a page Number of telephone calls coming through an exchange Number of wrong numbers dialed Number of lightning strikes on commercial aircraft Number of bird ingestions into the engine of a jet Number of engine failures on a jet Number of customers coming into a post office on a given day Number of meteoroids striking an orbiting space station Number of discharged α–particles from some radioactive source. Example 19: (Typos): Let X be the number of typos on a single page of a given book. Assume that X is Poisson with parameter λ = .5, i.e. we expect about half an error per page or about one error per every two pages. Find the probability of at least one error. Solution: P (X ≥ 1) = 1 − P (X = 0) = 1 − exp(−.5) = .393. Example 20 (Defectives): A machine produces 10% defective items, i.e. an item coming off the machine has a chance of .1 of being defective. What is the chance that in the next 10 items coming off the machine we find at most one defective item? Solution: Let X be the number of defective items among the 10. P (X ≤ 1) = P (X = 0) + P (X = 1) = (.1)0 (.9)10 + 10(.1)1 (.9)9 = .7361 whereas using a Poisson random variable Y with parameter λ = 10(.1) = 1 we get P (Y ≤ 1) = P (Y = 0) + P (Y = 1) = e−1 + e−1 = .7358 11 Mean and Variance of the Poisson Distribution: Based on the approximation of the Binomial(n, p) by a Poisson(λ = np) distribution when p is small, we would expect that E[Y ] ≈ np = λ and var(Y ) ≈ np(1 − p) ≈ λ. We now show that these approximations are in fact exact. ∞ −λ i−1 ∞ −λ j ∞ X X X e λ e λ e−λ λi =λ =λ =λ E[Y ] = i i! j! i=1 (i − 1)! j=0 i=0 ∞ X ∞ ∞ ∞ ∞ −λ j X X X X e−λ λi e λ e−λ λi−1 e−λ λj e−λ λj E[Y 2 ] = i2 =λ = λ (j + 1) =λ +λ = λ2 + λ i j i! (i − 1)! j! j! j! i=0 i=1 j=0 j=0 j=0 =⇒ var(Y ) = E[Y 2 ] − (E[Y ])2 = λ2 + λ − λ2 = λ Poisson Distribution for Events in Time (Another Justification): Sometimes we observe random incidents occurring in time, e.g. arrival of customers, meteoroids, lightning etc. Quite often these random phenomena appear to satisfy the following basic assumptions for some positive constant λ: 1. The probability that exactly one incident occurs during an interval of length h is λh + o(h) where o(h) is a function of h which goes to 0 faster than h, i.e. o(h)/h → 0 as h → 0 (e.g. o(h) = h2 ). The concept/notation of o(h) was introduced by Edmund Landau. 2. The probability that two or more incidents occur in an interval of length h is the same for all such intervals and equal to o(h). No clustering of incidents! 3. For any integers n, j1 , . . ., jn and any set of nonoverlapping intervals the events E1 , . . ., En , with Ei denoting the occurrence of exactly ji incidents in the ith interval, are independent. If N (t) denotes the number of incidents in a given interval of length t then it can be shown that N (t) is a Poisson random variable with parameter λt, i.e. P (N (t) = k) = e−λt (λt)k /k!. Proof: Take as time interval [0, t] and divide it into n equal parts. P (N (t) = k) = P (k of the intervals contain exactly one incident and n − k contain 0 incidents)+P (N (t) = k and at least one subinterval contains two or more incidents). The second probability can be bounded by n X P (i th interval contains at least two incidents) ≤ n X o i=1 i=1 t = n o(t/n) → 0 . n The probability of 0 incidents in a particular interval of length t/n is 1 − [λ(t/n) + o(t/n)] so that the first probability above becomes (in cavalier fashion, not quite air tight. See Poisson-Binomial Approximation on the class web page for a clean argument.) " #k " n! λt + o(t/n) k!(n − k)! n λt 1− − o(t/n) n #n−k which converges to exp(−λt)(λt)k /k!. Example 21 (Space Debris): It is estimated that the space station will be hit by space debris beyond a critical size and velocity on the average about once in 500 years. What is the chance that the station will survive the first 20 years without such a hit. Solution: T = 500 then λT = 1 or λ = 1/500. Now t = 20 and P (N (t) = 0) = exp(−λt) = exp(−20/500) = .9608. 12 L17 ends Geometric, Negative Binomial and Hypergeometric Random Variables Definition: In independent trials with success probability p the number X of trials required to get the first success is called a geometric random variable. We write X ∼ Geo(p) to indicate its distribution. Its probability mass function is p(n) = P (X = n) = P (Geo(p) = n) = (1 − p)n−1 p for n = 1, 2, 3, . . . Check summation to 1. Some texts (and software, e.g., R and EXCEL as a special negative binomial) treat X0 = X − 1 = number of failures before the first success as the geometric RV. Then P (X0 = n) = P (X = n + 1) = (1 − p)n p for n = 0, 1, 2, . . .. Example 22 (Urn Problem): An urn contains N white and M black balls. Balls are drawn with replacement until the first black ball is obtained. Find P (X = n) and P (X ≥ k), the latter in two ways. Probability of success = p = M/(M + N ). P (X = n) = (1 − p)n−1 p and P (X ≥ k) = (1 − p)k−1 P (X ≥ k) = ∞ X (1 − p)i−1 p = (1 − p)k−1 i=k ∞ X (1 − p)i−k p i=k = p(1 − p)k−1 ∞ X (1 − p)j = p(1 − p)k−1 j=0 1 = (1 − p)k−1 1 − (1 − p) Mean and Variance of X ∼ Geo(p): E[X] − 1 = E[X − 1] = ∞ X (n − 1)(1 − p)n−1 p = (1 − p) = (1 − p) (n − 1)(1 − p)n−2 p n=2 n=1 ∞ X ∞ X i(1 − p)i−1 p = (1 − p)E[X] =⇒ E[X](1 − (1 − p)) = 1 or E[X] = i=1 1 p Fits intuition: If p = 1/1000, then it takes on average 1/p = 1000 trials to see one success. L18 ends ∞ ∞ X X (n − 1)2 (1 − p)n−2 p (n − 1)2 (1 − p)n−1 p = (1 − p) E[X 2 ] − 2E[X] + 1 = E[(X − 1)2 ] = n=2 n=1 = (1 − p) ∞ X i2 (1 − p)i−1 p = (1 − p)E[X 2 ] i=1 2 1 2 =⇒ E[X 2 ](1 − (1 − p)) = − 1 or E[X 2 ] = 2 − p p p 2 1 1 1−p or var(X) = E[X 2 ] − (E[X])2 = 2 − − 2 = p p p p2 Definition: In independent trials with success probability p the number X of trials required to get the first r successes accumulated is called a negative binomial random variable. We write X ∼ N egBin(r, p) to indicate its distribution. Its probability mass function is ! n−1 p(n) = P (X = n) = P (N egBin(r, p) = n) = (1 − p)n−r pr for n = r, r + 1, r + 2, . . . r−1 13 For r = 1 we get the geometric distribution as a special case. Exploiting the equivalence of the two statements: “it takes at least m trials to get r successes” and “in the first m − 1 trials we have at most r − 1 successes” we have P (N egBin(r, p) ≥ m) = 1 − P (N egBin(r, p) ≤ m − 1) = P (Bin(m − 1, p) ≤ r − 1) (1) This facilitates the computation of the negative binomial cumulative probabilities in terms of appropriate binomial cumulative probabilities. We can view X as the sum of independent geometric random variables Y1 , . . . , Yr , each with success probability p. Here Y1 denotes the number of trials to the first succes, Y2 the number of additional trials to the next success thereafter, and so on. Clearly, for i1 , . . . , ir ∈ {1, 2, 3, . . .} we have P (Y1 = i1 , . . . , Yr = ir ) = P (Y1 = i1 ) · . . . · P (Yr = ir ) (2) since the individual statements concern what specifically happens in the first i1 + . . . + ir trials, all of which are independent, namely we have i1 − 1 failures, then a success, then i2 − 1 failures, then a success, and so on. From (2) it follows that for E1 , . . . , Er ⊂ {1, 2, 3, . . .} we have P (Y1 ∈ E1 , . . . , Yr ∈ Er ) = X ... i1 ∈E1 = X = ... X P (Y1 = i1 , . . . , Yr = ir ) ir ∈Er i1 ∈E1 distributive law of arithmetic X X P (Y1 = i1 ) · . . . · P (Yr = ir ) ir ∈Er P (Y1 = i1 ) · . . . · i1 ∈E1 X P (Yr = ir ) ir ∈Er = P (Y1 ∈ E1 ) · . . . · P (Yr ∈ Er ) The same holds for any subset of the Y1 , . . . , Yr , since (2) also holds for any subset. For example, summing the left and right side over all i1 = 1, 2, 3, . . . yields ∞ X P (Y1 = i1 , . . . , Yr = ir ) = i1 =1 ∞ X P (Y1 = i1 ) · . . . · P (Yr = ir ) i1 =1 P (Y1 < ∞, Y2 = i2 , . . . , Yr = ir ) = P (Y1 < ∞) · P (Y2 = i2 ) · . . . · P (Yr = ir ) P (Y2 = i2 , . . . , Yr = ir ) = P (Y2 = i2 ) · . . . · P (Yr = ir ) and similarly by summing over any other and further indices. In particular we get 1 = P (Y1 < ∞) · . . . · P (Yr < ∞) = P (Y1 < ∞, . . . , Yr < ∞) ≤ P (Y1 + . . . + Yr < ∞) = P (X < ∞) This means that the negative binomial pmf sums to 1, i.e., 1 = P (X < ∞) = ∞ X P (X = n) = n=r ∞ X n=r ! n−1 (1 − p)n−r pr r−1 Some texts (and software such as R and EXCEL) treat X0 = X − r = number of failures prior to the rth success as a negative binomial RV. Then P (X0 = n) = P (X = n + r) for n = 0, 1, 2, . . .4 . 4 P (X0 = n) and P (X0 ≤ n) can be obtained in R by the commands dnbinom(n,r,p) and pnbinom(n,r,p), respectively, while in EXCEL use = NEGBINOMDIST(n,r,p) and =1-BINOMDIST(r-1,n+r,p,TRUE) based on (1). E.g., pnbinom(4,5,.2) and =1-BINOMDIST(4,9,0.2,TRUE) return 0.01958144. 14 L19 ends Example 23 (r Successes Before m Failures): If independent trials are performed with success probability p what is the chance of getting r successes before m failures? Solution: Let X be the number of trials required to get the first r successes. Then we need to find: P (X ≤ m + r − 1) = P (X0 ≤ m − 1). n Mean and Variance of X ∼ N egBin(r, p): Using n n−1 = r and V ∼ N egBin(r + 1, p) r−1 r ∞ X ∞ n−1 r rX n r+1 p (1 − p)n−r = nk−1 p (1 − p)n−r E[X k ] = nk p r − 1 r n=r n=r ! ∞ n + 1 − 1 r+1 r X k−1 p (1 − p)n+1−(r+1) (n + 1 − 1) = r+1−1 p n+1=r+1 ! ! ∞ r X m−1 r = (m − 1)k−1 pr+1 (1 − p)m−(r+1) = E[(V − 1)k−1 ] p m=r+1 r+1−1 p ! =⇒ E[X] = r p and r r E[X ] = E[V − 1] = p p 2 ! r+1 −1 p rr+1 r r =⇒ var(X) = − − p p p p !2 = r(1 − p) p2 If we write X again as X = Y1 + . . . + Yr with independent Yi ∼ Geo(p), i = 1, . . . , r, we note again r E[X] = E[Y1 + . . . + Yr ] = E[Y1 ] + . . . + E[Yr ] = p var(X) = var(Y1 + . . . + Yr ) = var(Y1 ) + . . . + var(Yr ) = r 1−p p2 Definition: If a sample of size n is chosen randomly and without replacement from an urn containing N balls, of which M = N p are white and N − M = N − N p are black, then the number X of white balls in the sample is called a hypergeometric random variable. To indicate its distribution we write X ∼ Hyper(n, M, N ). Its possible values are x = 0, 1, . . . , n with pmf p(k) = P (X = k) = M k N −M n−k N n (3) which is positive only if 0 ≤ k and k ≤ M and 0 ≤ n − k and n − k ≤ N − M , i.e. if max(0, n − N + M ) ≤ k ≤ min(n, M ).5 Expression (3) also applies when drawing the n balls one by one without replacement since then P (X = k) = n k M (M − 1) . . . (M − k + 1)(N − M )(N − M − 1) . . . (N − M − (n − k + 1)) N (N − 1) . . . (N − n + 1) = M k N −M n−k N n 5 In R we can obtain P (X = k) and P (X ≤ k) by the commands dhyper(k,M,N-M,n) and phyper(k,M,N-M,n), respectively. EXCEL only gives P (X = k) directly via =HYPGEOMDIST(k,n,M,N). For example, for M = 40, N = 100, n = 30 and k = 15 dhyper(15,40,60,30) and =HYPGEOMDIST(15,30,40,100) return P (X = 15) = .07284917, while phyper(15,40,60,30) returns P (X ≤ 15) = .9399093. 15 Example 24 (Animal Counts): r animals are caught and tagged and released. After a reasonable time interval n animals are captured and the number X of tagged ones are counted. The total number N of animals is unknown. Then r i pN (i) = P (X = i) = N −r n−i N n Find N which maximizes this pN (i) for the observed value X = i. pN (i) (N − r)(N − n) = ≥1 pN −1 (i) N (N − r − n + i) if and only if N ≤ rn/i. Hence our maximum likelihood estimate is N̂ = largest integer ≤ rn/i. Another way of motivating this estimate is to appeal to r/N ∼ i/n. Example 25 (Quality Control): Shipments of 1000 items each are inspected by selecting 10 without replacement. If the sample contains more than one defective then the whole shipment is rejected. What is the chance for rejecting a shipment if at most 5% of the shipment is bad. The probability of no rejection is P (X = 0) + P (X = 1) = 50 950 0 10 1000 10 50 950 1 9 1000 10 + = .91469 hence the chance of rejecting a shipment is at most .08531. Expections and Variances of X1 + . . . + Xn : First we prove a basic alternate formula for the expectation of a single random variable X: E[X] = X xpX (x) = X X(s)p(s) s x where the first expression involves the pmf pX (x) of X and sums over all possible values of X and the second expression involves the probability p(s) = P ({s}) for all elements s in the sample space S. The equivalence is seen as follows. For any of the possible values x of X let Sx = {s ∈ S : X(s) = x}. For different values x the events/sets Sx are disjoint and their union over all x is S. Thus X xpX (x) = x X xP ({s : X(s) = x}) = X x = X X x x xp(s) = X X x s∈Sx X p(s) s∈Sx X(s)p(s) = x s∈Sx X X(s)p(s) s From this we get immediately E[X1 + . . . + Xn ] = X (X1 (s) + . . . + Xn (s))p(s) = s = X X [X1 (s)p(s) + . . . + Xn (s)p(s)] s X1 (s)p(s) + . . . + X s s provided the individual expectations are finite. 16 Xn (s)p(s) = E[X1 ] + . . . + E[X2 ] Next we will we address a corresponding formula for the variance of a sum of independent discrete random variables X1 , . . . , Xn , namely var(X1 + . . . + Xn ) = var(X1 ) + . . . + var(Xn ) provided the individual variances are finite. First we need to define the concept independence for a pair of random variables X and Y in concordance with the previously introduced independence of events. X and Y are independent, whenever for all possible values x and y of X and Y we have P (X = x, Y = y) = P ({s ∈ S : X(s) = x, Y (s) = y}) = P ({s ∈ S : X(s) = x})P ({s ∈ S : Y (s) = y}) = P (X = x)P (Y = y) As a consequence we have for independent X and Y with finite expectations the following property E[XY ] = E[X]E[Y ] , i.e., E[XY ] − E[X]E[Y ] = cov(X, Y ) = 0 where cov(X, Y ) is the covariance of X and Y , equivalently defined as cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY − XE[Y ] − Y E[X] + E[X]E[Y ]] = E[XY ] − E[X]E[Y ] − E[X]E[Y ] + E[X]E[Y ] = E[XY ] − E[X]E[Y ] Proof of independence =⇒ cov(X, Y ) = 0: Let Sxy = {s ∈ S : X(s) = x, Y (s) = y} E[XY ] = X X(s)Y (s)p(s) = s = X X X xyp(s) = X X xy x,y X xyP (X = x, Y = y) = i 2 E (X1 + . . . + Xn ) = xP (X = x) n X E X2 (E[X1 + . . . + Xn ]) X i +2 n X h i n X yP (Y = y) = E[X]E[Y ] Xi Xj i<j n X E Xi2 + 2 i=1 2 xyP (X = x)P (Y = y) by independence y i=1 = X x,y x h p(s) by distributive law s∈Sxy x,y by distributive law = X(s)Y (s)p(s) by stepwise summation x,y s∈Sxy x,y s∈Sxy = X X = n X E h Xi2 i +2 i=1 n X E [Xi Xj ] i<j E[Xi ]E[Xj ] i<j 2 = (E[X1 ] + . . . + E[Xn ]) = n X 2 (E[Xi ]) + 2 i=1 h n X E[Xi ]E[Xj ] i<j i var(X1 + . . . + Xn ) = E (X1 + . . . + Xn )2 − (E[X1 + . . . + Xn ])2 = = n X i=1 n X i=1 E[Xi2 ] − (E[Xi ])2 + 2 n X (E [Xi Xj ] − E[Xi ]E[Xj ]) i<j var(Xi ) + 2 n X cov(Xi , Xj ) = i<j n X var(Xi ) i=1 where the last = holds for pairwise independence of Xi and Xj for i < j. 17 The above rules for mean and variance of Y = X1 + . . . + Xn are now illustrated for two situations. Let X1 , . . . , Xn be indicator RVs indicating a success or failure in the ith of n trials. In the first situation we assume these trials are independent and have success probability p each. Then, as observed previously, from the mean and variance results for Bernoulli RVs we get E[Y ] = E(X1 +. . .+Xn ) = n X E[Xi ] = np and var(Y ) = var(X1 +. . .+Xn ) = i=1 n X var(Xi ) = np(1−p) i=1 In the second situation we view the trials in the hypergeometric context, where Xi = 1 when the ith ball drawn is white and Xi = 0 otherwise. We argued previously that P (Xi = 1) = M/N = p = proportion of white balls in the population =⇒ n X E[Y ] = E(X1 + . . . + Xn ) = nM N E[Xi ] = i=1 For var(Y ) we need to involve the covariance terms in our formula for var(X1 + . . . + Xn ). We find M (M − 1)(N − 2) . . . (N − n + 1) M (M − 1) = N (N − 1)(N − 2) . . . (N − n + 1) N (N − 1) M N −M p(1 − p) M (M − 1) M M − =− =− cov(Xi , Xj ) = E[Xi Xj ] − E[Xi ]E[Xj ] = N (N − 1) N N N N (N − 1) N −1 E[Xi Xj ] = P (Xi = 1, Xj = 1) = var(Y ) = var(X1 + . . . + Xn ) = n X var(Xi ) + 2 i=1 n X cov(Xi , Xj ) i<j ! n p(1 − p) n−1 N −n = np(1 − p) − 2 = np(1 − p) 1 − = np(1 − p) 2 N −1 N −1 N −1 The factor 1 − (n − 1)/(N − 1) is called the finite population correction factor. For fixed n it gets close to 1 when N is large, in which case it does not matter much whether we draw with or without replacement. One easily shows (exercise, or see Text p. 162) that ! n k P (Hyper(n, M, N ) = k) −→ P (Bin(n, p) = k) = p (1−p)n−k k as N −→ ∞, where p = M/N We will now pull forward material from Ch. 8, namely the inequalities of Markov and Chebychev6 . Markov’s Inequality: Let X be a nonnegative discrete RV with finite expectation E[X] then for any a > 0 we have E[X] P (X ≥ a) ≤ a Proof: E[X] = X x≥a xpX (x)+ X x<a xpX (x) ≥ X xpX (x) ≥ x≥a X apX (x) = aP (X ≥ a) x≥a Markov’s inequality is only meaningful for a > E[X]. It limits the probability far beyond the mean or expectation of X, in concordance with our previous center of gravity interpretation of E[X]. 6 Scholz ← Lehmann ← Neyman ← Sierpinsky ← Voronoy ← Markov ← Chebychev 18 While this inequality is usually quite crude, it can be sharp, i.e., result in equality. Namely, let X take the two values 0 and a with probability 1 − p and p. Then p = P (X ≥ a) = E[X]/a. Chebychev’s Inequality: Let X be a discrete RV with finite variance E[(X − µ)2 ] = σ 2 , then for any k > 0 we have σ2 P (|X − µ| ≥ k) ≤ 2 k 2 Proof by Markov’s inequality using Y = (X − µ) as our nonnegative RV P (|X − µ| ≥ k) = P ((X − µ)2 ≥ k 2 ) ≤ E[(X − µ)2 ] σ2 = k2 k2 These inequalities hold also for RVs that are not discrete, but why wait that long for the following. We will now combine the above results into a theorem that proves the long run frequency notion that we have alluded to repeatedly, in particular when introducing the axioms of probability. Let X̄ = (X1 + . . . + Xn )/n be the average of n independent and identically distributed random variables (telegraphically expressed as iid RVs), each with same mean µ = E[Xi ] and variance σ 2 = var(Xi ). Such random variables can be the result of repeatedly observing a random variable X in independent repetitions of the same random experiment, like repeatedly tossing a coin or rolling a die, and denoting the resulting RVs by X1 , . . . , Xn . Using the rules of expectation and variance (under independence) we have E[X̄] = 1 1 E[X1 + . . . + Xn ] = (µ + . . . + µ) = µ n n 1 1 2 σ2 2 var(X + . . . + X ) = (σ + . . . + σ ) = 1 n n2 n2 n and by Chebychev’s inequality applied to X̄ we get for any > 0 var(X̄) = P (|X̄ − µ| ≥ ) ≤ σ2 1 −→ 0 as n −→ ∞, n 2 i.e., the probability that X̄ will differ from µ by at least > 0 becomes vanishingly small. P We say X̄ converges to µ in probability and write X̄ −→ µ as n −→ ∞. This result is called the weak law of large numbers (WLLN or LLN without emphasis on weak). When our random experiment consists of observing whether a certain event E occurs or not, we observe an indicator variable X = IE with values 1 and 0. If we repeatedly do this experiment (independently), we observe X1 , . . . , Xn , each with mean µ = p = P (E) and variance σ 2 = p(1−p). In that case X̄ is the proportion of 1’s among the X1 , . . . , Xn , i.e., the proportion of times we observe the event E. P The above law of large numbers gives us X̄ −→ µ = p = P (E) as n −→ ∞, i.e., in the long run the observed proportion or relative frequency of observing the event E converges to P (E). 19 Properties of Distribution Functions F : 1. F is nondecreasing, i.e. F (a) ≤ F (b) for a, b with a ≤ b. 2. limb→∞ F (b) = 1 3. limb→−∞ F (b) = 0 4. F is right continuous, i.e. if bn ↓ b then F (bn ) ↓ F (b) or limn→∞ F (bn ) = F (b). Proof: 1. For a ≤ b we have {e : X(e) ≤ a} ⊂ {e : X(e) ≤ b}. 2., 3. and 4. ⇐= P (limn→∞ En ) = limn→∞ P (En ) for properly chosen monotone sequences En . T E.g., if bn ↓ b then En = {e : X(e) ≤ bn } & E = {e : X(e) ≤ b} = ∞ n=1 En All probability questions about X can be answered in terms of the cdf F of X. For example, • P (a < X ≤ b) = F (b) − F (a) for all a ≤ b • P (X < b) = limn→∞ F (b − n1 ) =: F (b−) • F (b) = P (X ≤ b) = P (X < b) + P (X = b) = F (b−) + (F (b) − F (b−)) 20