Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Binomial and normal distributions Business Statistics 41000 Fall 2015 1 Topics 1. Sums of random variables 2. Binomial distribution 3. Normal distribution 4. Vignettes 2 Topic: sums of random variables Sums of random variables are important for two reasons: 1. Because we often care about aggregates and totals (sales, revenue, employees, etc). 2. Because averages are basically sums, and probabilities are basically averages (of dummy variables), when we go to estimate probabilities, we will end up using sums of random variables a lot. This second point is the topic of the next lecture. For now, we focus on the direct case. 3 A sum of two random variables Suppose X is a random variable denoting the profit from one wager and Y is a random variable denoting the profit from another wager. If we want to consider our total profit, we may consider the random variable that is the sum of the two wagers, S = X + Y . To determine the distribution of S, we must first know the joint distribution of (X , Y ). 4 A sum of two random variables Suppose that (X , Y ) has the following joint distribution: -$200 $100 $200 $0 0 1 9 3 9 $100 1 9 2 9 2 9 So S can take the values {−200, −100, 100, 200, 300}. Notice that there are two ways that S can be $200. 5 A sum of two random variables We can directly determine the distribution of S as: S s P(S = s) -$200 +$0 0 -$200 + $100 1 9 1 9 5 9 2 9 $100 + $0 $100 + $100 or $200 + $0 $200 + $100 2 9 + 3 9 = When determining the distribution of sums of random variables, we lose information about individual values and aggregate the probability of events giving the same sum. 6 Topic: binomial distribution A binomial random variable can be constructed as the sum of independent Bernoulli random variables. Familiarity with the binomial distribution eases many practical probability calculations. See OpenIntro sections 3.4 and 3.6.4. 7 Sums of Bernoulli RVs When rolling two dice, what is the probability of rolling two ones? By independence we can calculate this probability as 1 1 1 P(1, 1) = = . 6 6 36 Now with three dice, what is the probability of rolling exactly two 1’s? 8 Sums of Bernoulli RVs (cont’d) The event A =“rolling a one”, can be described as a Bernoulli random variable with p = 61 . We can denote the three independent rolls by writing iid Xi ∼ Bernoulli(p), i = 1, 2, 3. The notation iid is shorthand for “independent and identically distributed”. Determining the probability of rolling exactly two 1’s can be done by considering the random variable Y = X1 + X2 + X3 and asking for P(Y = 2). 9 Sums of Bernoulli random variables (cont’d) Consider the distribution of Y = X1 + X2 + X3 . Y y P(Y = y ) 000 0 (1 − p)3 001 or 100 or 010 1 (1 − p)(1 − p)p + p(1 − p)(1 − p) + (1 − p)p(1 − p) 011 or 110 or 101 2 (1 − p)p 2 + p 2 (1 − p) + p(1 − p)p 111 3 p3 Event Remember that for this example p = 61 . 10 Sums of Bernoulli random variables (cont’d) Determining the probability of a certain number of successes requires knowing 1) the probability of each individual success and 2) the number of ways that number of successes can arise. Y y P(Y = y ) 000 0 (1 − p)3 001 or 100 or 010 1 3(1 − p)2 p 011 or 110 or 101 2 3(1 − p)p 2 111 3 p3 Event We find that P(Y = 2) = 3p 2 (1 − p) = 3(1/36)(5/6) = 5 6(12) = 5 72 . 11 Sums of Bernoulli random variables (cont’d) What if we had four rolls, and the probability of success was 13 ? 0000 1000 0100 1100 0010 1010 0110 1110 0001 1001 0101 1101 0011 1011 0111 1111 12 Sums of Bernoulli random variables (cont’d) Summing up the probabilities for each of the values of Y , we find: Y y 0 1 2 3 4 Substituting p = 1 3 P(Y = y ) (1 − p)4 4(1 − p)3 p 6(1 − p)2 p 2 4(1 − p)p 3 p4 we can now find P(Y = y ) for any y = 0, 1, 2, 3, 4. 13 Defintion: N choose y The number of ways we can arrange y successes among N trials can be calculated efficiently by a computer. We denote this number with a special expression. N choose y The notation N N! = (N − y )!y ! y designates the number of ways that y items can be assigned to N possible positions. This notation can be used to summarize the entries in the previous tables for various values of N and y . 14 Definition: Binomial distribution Binomial distribution A random variable Y has a binomial distribution with parameters N and p if its probability distribution function is of the form: N y p (1 − p)N−y p(y ) = y for integer values of y between 0 and N. 15 Example: drunk batter What is the probability that our alcoholic major-leaguer gets more than 2 hits in a game in which he has 5 at bats? Let X =“number of hits”. We model X as a binomial random variable with parameters N = 5 and p = 0.316. X x 0 1 2 3 4 5 P(X = x) (1 − p)5 5(1 − p)4 p 10(1 − p)3 p 2 10(1 − p)2 p 3 5(1 − p)p 4 p5 Substituting p = 0.316 we calculate P(X > 2) = 0.185. 16 Example: winning a best-of-seven play-off Assume that the Chicago Bulls have probability 0.4 of beating the Miami Heat in any given game and that the outcomes of individual games are independent. What is the probability that the Bulls win a seven game series against the Heat? 17 Example: winning a best-of-seven play-off (cont’d) Consider the number of games won by the Bulls over a full seven games against the Heat. We model this as a binomial random variable Y with parameters N = 7 and p = 0.4, which we express with the notation Y ∼ Bin(7, 0.4). The symbol “∼” is read “distributed as”. “Bin” is short for “binomial”. The numbers which follow are the values of the two binomial parameters, the number of independent Bernoulli trials (N) and the probability of success at each trial (p). 18 Example: winning a best-of-seven play-off (cont’d) Although we never see all seven games played (because the series stops as soon as one team wins four games) we note that in this expanded event space I any event with at least four Bulls wins corresponds to an observable Bulls series win, I any event corresponding to an observed Bulls series win has at least four total Bulls wins. 19 Example: winning a best-of-seven play-off (cont’d) For example, the observable sequence 011011 (where a 1 stands for a Bulls win) has two possible completions, 0110110 or 0110111. Any hypothetical games played beyond the series-ending fourth win can only increase the total number of wins tallied by Y . Conversely, the sequence 1010111 is an event corresponding to Y = 5 and we can associate it with the observable subsequence 101011, a Bulls series win in six games. 20 Example: winning a best-of-seven play-off (cont’d) Therefore, the events corresponding to “Bulls win the series” are precisely those corresponding to Y ≥ 4. We may conclude that the probability of a series win for the Bulls is P(Y ≥ 4) = P(Y = 4) + P(Y = 5) + P(Y = 6) + P(Y = 7) = 0.29. 21 Example: winning a best-of-seven play-off (cont’d) We can arrive at this answer without reference to the binomial random variable Y if we are willing to do our own counting. 4 P(Bulls series win) = p + = p4 + ! 4 4 p (1 − p) + 3 ! 4 4 p (1 − p) + 1 ! 5 4 p (1 − p)2 + 3 ! 5 4 p (1 − p)2 + 2 ! 6 4 p (1 − p)3 3 ! 6 4 p (1 − p)3 3 = 0.29. This calculation explicitly accounts for the fact that Bulls series wins necessarily conclude with a Bulls game win. 22 Example: double lottery winners In 1971, Jane Adams won the lottery twice in one year! If you read of a double winner in your daily newspaper, how surprised should you be? To answer this question we need to make some assumptions. Consider 40 state lotteries. Assume that each one has a 1 in 18 million chance of winning. Assume that each one has 1 million people that play it daily (say, 250 times a year), and that each one buys 5 tickets. Given these conditions, what is the probability that in one calendar year there is at least one double winner? 23 Example: double lottery winners (cont’d) Let Xi be the random variable denoting how many winning tickets person i has: Xi ∼ Binomial(5(250), p = (1/18) × 10−6 ). Now let Yi be the dummy variable for the event Xi > 1, which is the event that person i is a double (or more) winner: Yi ∼ Bernoulli(q). We can compute q = 1 − Pr (Xi = 0) − Pr (Xi = 1) = 2.4 × 10−9 . 24 Example: double lottery winners (cont’d) To account for the people playing the lottery in each of 40 states, Pmillion N we consider Z = i=1 Yi , which is another binomial random variable: Z ∼ Binomial(N = 4 × 107 , q). Finally, the probability that Z > 0 can be found as 1 − P(Z = 0) = 1 − (1 − q)N = 1/11. Not so rare! 25 Example: rural vs. urban hospitals About as many boys as girls are born in hospitals. In a small Country Hospital only a few babies are born every week. In the urban center, many babies are born every week at City General. Say that a normal week is one where between 45% and 55% of the babies are female. An unusual week is one where more than 55% are girls or more than 55% are boys. Which of the following is true? I Unusual weeks occur equally often at Country Hospital and at City General. I Unusual weeks are more common at Country Hospital than at City General. I Unusual weeks are less common at Country Hospital than at City General. 26 Example: rural vs. urban hospital (cont’d) We can model the births in the two hospitals as two independent random variables. Let X = “number of baby girls born at Country Hospital” and Y =“number of baby girls born at City General”. X ∼ Binomial(N1 , p) Y ∼ Binomial(N2 , p) Assume that p = 0.5. The key difference is that N1 is much smaller than N2 . To illustrate, assume that N1 = 20 and N2 = 500. 27 Example: rural vs. urban hospital (cont’d) During a usual week at the rural hospital between 0.45N1 = 0.45(20) = 9 and 0.55N1 = 0.55(20) = 11 baby girls are born. The probability of usual week is P(9 ≤ X ≤ 11) ≈ 0.50, so the probability of an unusual week is 1 − P(9 ≤ X ≤ 11) = P(X < 9) + P(X > 11) ≈ 0.5. Note: satisfying the condition X < 9 is the same as not satisfying the condition X ≥ 9; strict versus non-strict inequalities make a difference. 28 Example: rural vs. urban hospital (cont’d) 0.10 0.05 0.00 Probability 0.15 0.20 Country Hospital 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Births 29 Example: rural vs. urban hospital (cont’d) In a usual week at the city hospital between 0.45N2 = 0.45(500) = 225 and 0.55N2 = 0.55(500) = 275 baby girls are born. Then the probability of a usual week is P(225 ≤ X ≤ 275) = 0.978, so the probability of an unusual week is 1 − P(225 ≤ X ≤ 275) = P(X < 225) + P(X > 275) = 0.022. 30 Example: rural vs. urban hospital (cont’d) 0.020 0.010 0.000 Probability 0.030 City General 200 206 212 218 224 230 236 242 248 254 260 266 272 278 284 290 Births 31 Variance of a sum of independent random variables A useful fact: Variance of linear combinations of independent random variables A weighted sum/difference of random variables Y = expressed as m X V(Y ) = ai2 V(Xi ). Pm i ai Xi can be i How can this be used to derive the expression for the variance of a binomial random variable? 32 Variance of binomial random variable Variance of a binomial random variable A binomial random variable X with parameters N and p has variance V(X ) = Np(1 − p). 33 Variance of a proportion By dividing through by the total number of babies born each week we can consider the proportion of girl babies. Define the random variables P1 = X N1 and P2 = Y . N2 Then it follows that V (P1 ) = V(X ) N1 p(1 − p) = = p(1 − p)/N1 2 N1 N12 V (P2 ) = N2 p(1 − p) V(Y ) = = p(1 − p)/N2 . 2 N2 N22 and 34 Law of Large Numbers An arithmetical average of random variables is itself a random variable. As more and more individual random variables are averaged up, the variance decreases but the mean stays the same. As a result, the distribution of the averaged random variable becomes more and more concentrated around its expected value. 35 Law of Large Numbers 0.00 0.05 0.10 0.15 0.20 0.25 Distribution of sample proportion (N = 10, p = 0.7) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 36 Law of Large Numbers 0.00 0.05 0.10 0.15 Distribution of sample proportion (N = 20, p = 0.7) 0 0.7 1 37 Law of Large Numbers 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Distribution of sample proportion (N = 50, p = 0.7) 0 0.7 1 38 Law of Large Numbers 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Distribution of sample proportion (N = 150, p = 0.7) 0 0.7 1 39 Law of Large Numbers 0.00 0.01 0.02 0.03 0.04 0.05 Distribution of sample proportion (N = 300, p = 0.7) 0 0.7 1 40 Example: Schlitz Super Bowl taste test 41 Bell curve approximation to binomial The binomial distributions can be approximated by a smooth density function for large N. 0.15 0.10 0.05 0.00 Probability mass / Density 0.20 Normal approximation for binomial distribution with N = 20, p = 0.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 x 42 Bell curve approximation to binomial 0.10 0.05 0.00 Probability mass / Density 0.15 Normal approximation for binomial distribution with N = 60, p = 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x 43 Bell curve approximation to binomial 0.03 0.02 0.00 0.01 Probability mass / Density 0.04 Normal approximation for binomial distribution with N = 500, p = 0.8 340 346 352 358 364 370 376 382 388 394 400 406 412 418 424 430 436 442 448 454 460 x What are some reasons that very small p or small N lead to bad approximations? 44 Central limit theorem The normal distribution can be “justified” via its relationship to the binomial distribution. Roughly: if a random outcome is the combined result of many individual random events, its distribution will follow a normal curve. The quincunx or Galton box is a device which physically simulates such a scenario using ball bearings and pins stuck in a board. PLAY VIDEO The CLT can be stated more precisely, but the practical impact is just this: random variables which arise as sums of many other random variables (not necessarily normally distributed) tend to be normally distributed. 45 Normal distributions The normal family of densities has two parameters, typically denoted µ and σ 2 , which govern the location and scale, respectively. 0.2 0.1 0.0 f(x) 0.3 0.4 Gaussian densities for various location parameters -4 -2 0 2 4 x 46 Normal distributions (cont’d) I will use the terms normal distribution, normal density and normal random variable more or less interchangeably. 0.4 0.0 0.2 f(x) 0.6 0.8 Mean-zero Gaussian densities with differing scale parameters -4 -2 0 2 4 x The normal distribution is also called the Gaussian distribution or the bell curve. 47 Normal means and variances Mean and variance of a normal random variable A normal random variable X , with parameters µ and σ 2 , is denoted X ∼ N(µ, σ 2 ). The mean and variance of X are E (X ) = µ, V (X ) = σ 2 . The density function is symmetric and unimodal, so the median and mode of X are also given by the location parameter µ. The standard deviation of X is given by σ. 48 Normal approximation to binomial The binomial distributions can be approximated by a normal distribution. Normal approximation to the binomial A Bin(N, p) distribution can be approximated by a N(Np, Np(1 − p)) distribution for N “large enough”. Notice that this just “matches” the mean and variance of the two distributions. 49 Linear transformation of normal RVs We can add a fixed number to a normal random variable and/or multiply it by a fixed number and get a new normal random variable. This sort of operation is called a linear transformation. Linear transformation of normal random variables If X ∼ N(µ, σ 2 ) and Y = a + bX for fixed numbers a and b, then Y ∼ N(a + bµ, b 2 σ 2 ). For example, if X ∼ N(1, 2) and Y = 3 − 5X , then Y ∼ N(−2, 50). 50 Standard normal RV Standard normal A standard normal random variable is one with mean 0 and variance 1. It is often denoted by the letter Z : Z ∼ N(0, 1). We can write any normal random variable as a linear transformation of a standard normal RV. For normal random variable X ∼ N(µ, σ 2 ), we can write X = µ + σZ . 51 The “empirical rule” It is convenient to characterize where the “bulk” of the probability mass of a normal distribution resides by providing an interval, in terms of standard deviations, about the mean. 0.2 0.1 68 % 0.0 Density 0.3 0.4 N(µ,σ) µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ x 52 The “empirical rule” (cont’d) The widespread application of the normal distribution has lead this to be dubbed the empirical rule. 0.2 0.1 95 % 0.0 Density 0.3 0.4 N(µ,σ) µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ x 53 The “empirical rule” (cont’d) It is, for obvious reasons, sometimes called the 68-95-99.7 rule. 0.2 0.1 99.7 % 0.0 Density 0.3 0.4 N(µ,σ) µ − 4σ µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ µ + 4σ x 54 The “empirical rule” (cont’d) To revisit some earlier examples: I 68% of Chicago daily highs in the winter season are between 19 and 48 degrees. I 95% of NBA players are between 6ft and 7ft 2in. I In 99.7% of weeks, the proportion of baby girls born at City General is between 0.4985 and 0.5015. 55 Sums of normal random variables Weighted sums of normal random variables are also normally distributed. For example if X1 ∼ N(5, 20) and X2 ∼ N(1, 0.5) then for Y = 0.1X1 + 0.9X2 Y ∼ N(m, v ). where m = 0.1(5) + 0.9(1) = 1.4 and v = 0.12 (20) + 0.92 (0.5) = 0.605. 56 Linear combinations of normal RVs Linear combinations of independent normal random variables For i = 1, . . . , n, let iid Xi ∼ N(µi , σi2 ). Define Y = Pn i=1 ai Xi for weights a1 , a2 , . . . , an . Then Y ∼ N(m, v ) where m= n X i=1 ai µi and v= n X ai2 σi2 . i=1 57 Example: two-stock portfolio Consider two stocks, A and B, with annual returns (in percent of investment) distributed according to normal distributions XA ∼ N(5, 20) and XB ∼ N(1, 0.5). What fraction of our investment should we put into stock A, with the remainder put in stock B? 58 Example: two-stock portfolio (cont’d) For a given fraction α, the total return on our portfolio is Y = αXA + (1 − α)XB with distribution Y ∼ N(m, v ). where m = 5α + (1 − α) and v = 20α2 + 0.5(1 − α)2 . 59 Example: two-stock portfolio (cont’d) Suppose we want to find α so that P(Y ≤ 0) is as small as possible. 0.3 0.4 Stock A Stock B 0.0 0.1 0.2 Density 0.5 0.6 Two-stock portfolio -5 0 5 10 15 20 Percent return The blue distributions correspond to varying values of α. 60 Example: two-stock portfolio (cont’d) We can plot the probability of a loss as a function of α. 0.08 0.04 0.06 Probability 0.10 0.12 Probability of a loss 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 α We see that this probability is minimized when α = 11% approximately. This is the LLN at work! 61 Variance of a sum of correlated random variables For correlated (dependent) random variables, we have a modified formula: Variance of linear combinations of two correlated random variables A weighted sum/difference of random variables Y = a1 X1 + a2 X2 can be expressed as V(Y ) = a12 V(X1 ) + a22 V(X2 ) + 2a1 a2 Cov(X1 , X2 ). There is a homework problem that asks you to find the variance of portfolios of stocks, as in the example above, for stocks which are related to one another (in a common industry, for example). 62 Vignettes 1. Differential dispersion 2. Average number of sex partners 3. mean reversion 63 Vignette: a difference in dispersion In this vignette we observe how selection (in the sense of evolution, or hiring, or admissions) can turn higher variability into over-representation. The analysis uses the ideas of random variables, distribution functions, and conditional probability. For more background, read the article “Sex Ed” from the February 2005 issue of the New Republic (available at the course home page). 64 A difference in dispersion Consider two groups of college graduates with “employee fitness scores” following the distributions shown below. 0.4 0.2 0.3 0.256 0.043 0.051 0.064 -5 -4 -3 0.085 0.128 0.128 0.085 0.064 0.051 0.043 3 4 5 0.023 0.008 0.003 3 4 5 0.0 0.1 Probability 0.5 0.6 Distribution of Capabilities, Group A -2 -1 0 1 2 Score 0.6 Distribution of Capabilities, Group B 0.4 0.3 0.2 0.171 0.003 0.008 0.023 -5 -4 -3 0.171 0.063 0.063 0.0 0.1 Probability 0.5 0.464 -2 -1 0 1 2 Score These distributions have the same mean, the same median, and the same mode. But they differ in their dispersion, or variability. 65 A difference in dispersion (cont’d) Let X denote the random variables recording the scores and let A and B denote membership in the respective groups. 0.4 0.2 0.3 0.256 0.043 0.051 0.064 -5 -4 -3 0.085 0.128 0.128 0.085 0.064 0.051 0.043 3 4 5 0.023 0.008 0.003 3 4 5 0.0 0.1 Probability 0.5 0.6 Distribution of Capabilities, Group A -2 -1 0 1 2 Score 0.6 Distribution of Capabilities, Group B 0.4 0.3 0.2 0.171 0.003 0.008 0.023 -5 -4 -3 0.171 0.063 0.063 0.0 0.1 Probability 0.5 0.464 -2 -1 0 1 2 Score V (X | A) = 5.87 and V (X | B) = 1.666. The corresponding standard deviations are σ(X | A) = 2.42 and σ(X | B) = 1.29. 66 A difference in dispersion (cont’d) But now consider only elite jobs, for which it is necessary that fitness score X ≥ 4. 0.4 0.2 0.3 0.256 0.043 0.051 0.064 -5 -4 -3 0.085 0.128 0.128 0.085 0.064 0.051 0.043 3 4 5 0.023 0.008 0.003 3 4 5 0.0 0.1 Probability 0.5 0.6 Distribution of Capabilities, Group A -2 -1 0 1 2 Score 0.6 Distribution of Capabilities, Group B 0.4 0.3 0.2 0.171 0.003 0.008 0.023 -5 -4 -3 0.171 0.063 0.063 0.0 0.1 Probability 0.5 0.464 -2 -1 0 1 2 Score We can use Bayes’ rule to calculate P(A | X ≥ 4) and P(B | X ≥ 4). 67 A difference in dispersion (cont’d) If we assume a priori that P(A) = P(B) = 1/2, we find P(X ≥ 4 | A)P(A) P(X ≥ 4 | A)P(A) + P(X ≥ 4 | B)P(B) 0.094(0.5) = 0.094(0.5) + 0.012(0.5) = 0.89. P(A | X ≥ 4) = Why don’t we need to calculate P(B | X ≥ 4) separately? 68 Larry Summers and women-in-science “Summers’s critics have repeatedly mangled his suggestion that innate differences might be one cause of gender disparities ... into the claim that they must be the only cause. And they have converted his suggestion that the statistical distributions of men’s and women’s abilities are not identical to the claim that all men are talented and all women are not–as if someone heard that women typically live longer than men and concluded that every woman lives longer than every man. . . . In many traits, men show greater variance than women, and are disproportionately found at both the low and high ends of the distribution. Boys are more likely to be learning disabled or retarded but also more likely to reach the top percentiles in assessments of mathematical ability, even though boys and girls are similar in the bulk of the bell curve. . . .” Stephen Pinker in The New Republic 69 Example: gender and aptitudes revisited Assume that job“aptitude” can be represented as a continuous random variable and that the distribution of scores differs by gender. 0.4 Aptitude distribution 0.2 0.0 0.1 Density 0.3 women men -6 -4 -2 0 2 4 6 Score For women, 93.7% of the scores are between the vertical dashed lines, whereas only 68.6% of the men’s scores fall in this range. 70 Example: gender and aptitudes revisited (cont’d) The corresponding CDFs reveals the same difference. 0.0 0.2 0.4 F(x) 0.6 0.8 1.0 Cumulative distribution function -6 -4 -2 0 2 4 6 Score These distributions are meant to be illustrative rather than factual. 71 Sex partners vignette: which average? Here is a torn-from-the-headlines example of why it pays to know a little probability. “Everyone knows men are promiscuous by nature...Surveys bear this out. In study after study and in country after country, men report more, often many more, sexual partners than women... But there is just one problem, mathematicians say. It is logically impossible for heterosexual men to have more partners on average than heterosexual women. Those survey results cannot be true.” 72 A sex-partners statistical model Question: is it possible for men to have more sex partners, on average, than women? To answer this question, we will consider a “toy” probability model for homo sapiens mating behavior. Sally Chastity Maude John 0.07 0.5 0.05 Lenny 0.06 0.5 0.04 Romeo 0.05 0.5 0.09 Let’s call it the “summer camp” model. 73 A sex-partners random variable The quantity of interest is the number of sex partners. In our model, this will be a number between 0 and 3. For each individual we can compute the distribution of this random variable. We will denote individuals by their first initial. A red initial means they partnered, a black initial means they did not. We will assume independence. This means, for example, that Sally hooking up with Romeo makes it neither more nor less likely that she will hook up with Lenny. 74 Sally’s sex-partner distribution Xs x P(Xs = x) JLR 0 (1-0.07)(1-0.06)(1-0.05) JLR or JLR or JLR 1 (0.07)(1-0.06)(1-0.05) + (1-0.07)(0.06)(1-0.05) + (1-0.07)(1-0.06)(0.05) JLR or JLR or JLR 2 (0.07)(0.06)(1-0.05) + (1-0.07)(0.06)(0.05) + (0.07)(1-0.06)(0.05) JLR 3 (0.07)(0.06)(0.05) Event Can you see the probability laws in action here? 75 Sally’s sex-partner distribution Xs Event x ps (x) = P(Xs = x) JLR 0 0.83 JLR or JLR or JLR 1 0.16 JLR or JLR or JLR 2 0.01 JLR 3 0.0002 Here is what it looks like after the calculation (rounded a bit). We can do similarly for each individual. 76 Sally’s sex-partners distribution Here is a picture of Sally’s sex partner distribution. 0.4 0.6 0.8305 0.1592 0.0 0.2 Probability 0.8 1.0 Distribution of sex partners for Sally 0 1 0.0101 2e-04 2 3 Number of partners The mean is 0(0.83) + 1(0.16) + 2(0.01) + 3(0.0002) = 0.18. What is the mode? What is the median? 77 Female sex-partner distribution To get the distribution for all females, we sum over the individual women. We apply the law of total probability using all three conditional distributions: pfemale (x) = ps (x)P(Sally) + pc (x)P(Chastity) + pm (x)P(Maude). We assume that the women are selected at random with equal probability P(Maude) = P(Chastity) = P(Sally) = 1/3. 78 Female sex-partner distribution At the end we get a distribution like this. 0.6 0.5951 0.4 Probability 0.8 1.0 Distribution of sex partners for females 0.2 0.2315 0.1315 0.0 0.0418 0 1 2 3 Number of partners The mean is 0.62, the mode is 0, and the median is 0. 79 Male sex-partner distribution We can do the same thing for the males, and we get this. 0.6 0.4983 0.4 0.4417 0.2 Probability 0.8 1.0 Distribution of sex partners for males 0.0 0.0583 0 1 2 0.0017 3 Number of partners The mean is 0.62, the mode is 1, and the median is 1. 80 Sex-partners vignette recap The narrow lesson is that it pays to be specific about which measure of central tendency you’re talking about! The more general lesson is that using probability models and a little bit of algebra can help us see a situation more clearly. This example uses the concepts of random variable, independence, conditional distribution, mean, median...and others. 81 Idea: statistical “null” hypotheses The hypothesis that events are independent often makes a nice contrast to other explanations, namely that random events are somehow related. This vantage point allows us to judge if those other explanations fit the facts any better than the uninteresting “null” explanation that events are independent. 82 Vignette: making better pilots Flight instructors have a policy of berating pilots who make bad landings. They notice that good landings met with praise mostly result in subsequently less-good landings, while bad landings met with harsh criticism mostly result in subsequently improved landings. Is their causal reasoning necessarily valid? To stress-test their judgment that “criticism works” we consider the evidence in light of the null hypothesis that subsequent landings are in fact independent of one another, regardless of criticism or praise. 83 Example: making better pilots (cont’d) Contrary to the assumptions of the instructors, consider each landing as independent of subsequent landings (irrespective of feedback). Assume that landings can be classified into three types: poor, adequate, or excellent. Further assume the following probabilities: Event Probability bad pb adequate pa good pg Remember that pb + pa + pg = 1. 84 Example: making better pilots (cont’d) Assume that the policy of criticism is judged to work when a poor landing is followed by a not-poor landing. Then P(criticism seems to work) = P(not bad2 | bad1 ) = P(not bad2 ) = pa +pg by independence. Conversely, the policy of praise appears to work when an good landing is followed by another good landing. So P(good2 | good1 ) = P(good2 ) = pg . Praise always appears to work less often than criticism! 85 Remark: null and alternative hypotheses The previous example shows that the evidence can appear to favor criticism over praise even if criticism and praise are totally irrelevant. Does this mean that criticism does not work? No, it just means that the observed facts are not compelling evidence that criticism works, because they are entirely consistent with the null hypothesis that landing quality is independent of previous landings and feedback. In cases like this we say we “fail to reject the null hypothesis”. We’ll revisit this terminology a couple weeks from now. 86 Example: making better pilots (continuous version) What if we want to take pilot skill into account? We will model this situation using normal random variables and see if the same conclusions (that praise appears to hurt performance and criticism seems to boost it) could arise by chance. 87 Example: making better pilots (continuous version, cont’d) Assume that each pilot has a certain ability level, call it A. Each individual landing score arises as a combination of this ability and certain random fluctuations, call them . The landing score at time t can be expressed as St = A + t . iid Assuming that t ∼ N(0, σ 2 ), then St ∼ N(A, σ 2 ). 88 Example: making better pilots (continuous version, cont’d) Denote an average landing score as M. Consider a pilot with A > M. When he makes an exceptional landing, because 1 > 2σ, he is unlikely to best it on his next landing. 0.4 0.0 0.2 Density 0.6 0.8 Distribution of landing scores M A A+ε1 S2 For this reason, praise is unlikely to work even though landings are independent of one another. 89 Example: making better pilots (continuous version, cont’d) For a poor pilot with A < M a similar argument holds. When he makes a very poor landing, because 1 < −2σ, he is unlikely to do worse on his next landing. 0.4 0.0 0.2 Density 0.6 0.8 Distribution of landing scores A+ε1 A M S2 For this reason, criticism is likely to “work” even though landings are independent. 90 Idea: mean reversion The previous example illustrates an idea known as mean reversion. This name refers to the fact that subsequent observations tend to be “pulled back” towards the overall mean even if the events are independent of one another. Mean reversion describes a probabilistic fact, not a physical process. What might the flight instructors have done (as an experiment) to really get to the bottom of their question? 91