Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
5 Joint Probability Distributions and Random Samples Copyright © Cengage Learning. All rights reserved. 5.1 Jointly Distributed Random Variables Copyright © Cengage Learning. All rights reserved. Two Discrete Random Variables The probability mass function (pmf) of a single discrete rv X specifies how much probability mass is placed on each possible X value. The joint pmf of two discrete rv’s X and Y describes how much probability mass is placed on each possible pair of values (x, y). 3 Two Discrete Random Variables Definition 4 Example 5.1 Anyone who purchases an insurance policy for a home or automobile must specify a deductible amount, the amount of loss to be absorbed by the policyholder before the insurance company begins paying out. Suppose that a particular company offers auto deductible amounts of $100, $500, and $1000, and homeowner deductible amounts of $500, $1000, and $2000. Consider randomly selecting someone who has both auto and homeowner insurance with this company, and let X =the amount of the auto policy deductible and Y = the amount of the homeowner policy deductible. 5 Example 5.1 cont’d The joint pmf of these two variables appears in the accompanying joint probability table: According to this joint pmf, there are nine possible (X, Y) pairs: (100, 500), (100, 1000), … , and finally (1000, 5000). The probability of (100, 500) is p(100, 500) = P(X = 100, Y = 500) = .30. Clearly p(x, y) ≥ 0, and it is easily confirmed that the sum of the nine displayed probabilities is 1. 6 Example 5.1 cont’d The probability P(X = Y) is computed by summing p(x, y) over the two (x, y) pairs for which the two deductible amounts are identical: P(X = Y) = p(500, 500) + p(1000, 1000) = .15 + .10 = .25 Similarly, the probability that the auto deductible amount is at least $500 is the sum of all probabilities corresponding to (x, y) pairs for which x ≥ 500; this is the sum of the probabilities in the bottom two rows of the joint probability table: P(X ≥ 500) = .15 + .20 + .05 + .10 + .10 + .05 = .65 7 Two Discrete Random Variables Definition 8 Example 5.2 Example 5.1 continued… The possible X values are x = 100, 500 and x = 1000, so computing row totals in the joint probability table yields 9 Example 5.2 cont’d Similarly, the marginal pmf of X is then From this pmf, P(X ≥ 500) = .40 + .25 = .65, which we already calculated in Example 5.1. Similarly, the marginal pmf of Y is obtained from the column totals as 10 Two Continuous Random Variables The probability that the observed value of a continuous rv X lies in a one-dimensional set A (such as an interval) is obtained by integrating the pdf f(x) over the set A. Similarly, the probability that the pair (X, Y) of continuous rv’s falls in a two-dimensional set A (such as a rectangle) is obtained by integrating a function called the joint density function. 11 Two Continuous Random Variables Definition 12 Two Continuous Random Variables We can think of f(x, y) as specifying a surface at height f(x, y) above the point (x, y) in a three-dimensional coordinate system. Then P[(X, Y) A] is the volume underneath this surface and above the region A, analogous to the area under a curve in the case of a single rv. 13 Two Continuous Random Variables This is illustrated in Figure 5.1. P[(X, Y ) A] = volume under density surface above A Figure 5.1 14 Example 5.3 A bank operates both a drive-up facility and a walk-up window. On a randomly selected day, let X = the proportion of time that the drive-up facility is in use (at least one customer is being served or waiting to be served) and Y = the proportion of time that the walk-up window is in use. Then the set of possible values for (X, Y) is the rectangle D = {(x, y): 0 x 1, 0 y 1}. 15 Example 5.3 cont’d Suppose the joint pdf of (X, Y) is given by To verify that this is a legitimate pdf, note that f(x, y) 0 and 16 Example 5.3 cont’d The probability that neither facility is busy more than one-quarter of the time is 17 Example 5.3 cont’d 18 Two Continuous Random Variables The marginal pdf of each variable can be obtained in a manner analogous to what we did in the case of two discrete variables. The marginal pdf of X at the value x results from holding x fixed in the pair (x, y) and integrating the joint pdf over y. Integrating the joint pdf with respect to x gives the marginal pdf of Y. 19 Two Continuous Random Variables Definition 20 Example 5.4 The marginal pdf of X, which gives the probability distribution of busy time for the drive-up facility without reference to the walk-up window, is for 0 ≤ x ≤ 1 and 0 otherwise. The marginal pdf of Y is 21 Example 5.4 Then 22 Independent Random Variables In many situations, information about the observed value of one of the two variables X and Y gives information about the value of the other variable. In Example 5.1, the marginal probability of X at x = 100 was .35, and at X = 1000 is .25. However, we learn that Y = 5000 the last column of the joint probability table tells us that X can’t possible be 100 and the other two possibilities, 500 and 1000, are now equally likely. Thus knowing the value is a dependence between two variables. In Chapter 2, we pointed out that one way of defining independence of two events is via the condition P(A B) = P(A) P(B). 23 Independent Random Variables Here is an analogous definition for the independence of two rv’s. Definition 24 Independent Random Variables The definition says that two variables are independent if their joint pmf or pdf is the product of the two marginal pmf’s or pdf’s. Intuitively, independence says that knowing the value of one of the variables does not provide additional information about what the value of the other variable might be. 25 Example 5.1 cont’d The joint pmf of these two variables appears in the accompanying joint probability table: p(1000, 5000) = .05 (.10)(.25) = pX(1000) pY(5000) so X and Y are not independent. Independence of X and Y requires that every entry in the joint probability table be the product of the corresponding row and column marginal probabilities. 26 Independent Random Variables Independence of two random variables is most useful when the description of the experiment under study suggests that X and Y have no effect on one another. Then once the marginal pmf’s or pdf’s have been specified, the joint pmf or pdf is simply the product of the two marginal functions. It follows that P(a X b, c Y d) = P(a X b) P(c Y d) 27 More Than Two Random Variables To model the joint behavior of more than two random variables, we extend the concept of a joint distribution of two variables. Definition 28 Example 5.9 A binomial experiment consists of n dichotomous (success–failure), homogenous (constant success probability) independent trials. Now consider a trinomial experiment in which each of the n trials can result in one of three possible outcomes. For example, each successive customer at a store might pay with cash, a credit card, or a debit card. The trials are assumed independent. Let 𝑝1 = P(trial results in a type 1 outcome) and define 𝑝2 and 𝑝3 analogously for type 2 and type 3 outcomes. The random variables of interest here are 𝑋𝑖 = the number of trials that result in a type i outcome for i = 1, 2, 3. 29 Example 5.9 In n = 10 trials, the probability that the first five are type 1 outcomes, the next three are type 2, and the last two are type 3—that is, the probability of the experimental outcome 1111122233—is 𝑝15 ∙ 𝑝23 ∙ 𝑝32 . This is also the probability of the outcome 1122311123, and in fact the probability of any outcome that has exactly five 1’s, three 2’s, and two 3’s. Now to determine the probability P(𝑋1 = 5, 𝑋2 = 3, and 𝑋3 = 2), we have to count the number of outcomes that have exactly five 1’s, three 2’s, and two 3’s. 30 Example 5.9 10 ) ways to choose five of the trials to be 5 the type 1 outcomes. Now from the remaining five trials, we choose three to be the type 2 outcomes, which can be 5 done in ( ) ways. 3 First, there are ( This determines the remaining two trials, which consist of type 3 outcomes. So the total number of ways of choosing five 1’s, three 2’s, and two 3’s is 31 Example 5.9 Thus we see that Generalizing this to n trials gives for 𝑥1 = 0, 1,2, … ; 𝑥2 = 0, 1, 2, … ; 𝑥3 = 0, 1, 2, … such that 𝑥1 + 𝑥2 + 𝑥3 = 𝑛. Notice that whereas there are three random variables here, the third variable 𝑥3 is actually redundant. For example, in the case n = 10, having 𝑥1 = 5 and 𝑥2 = 3 implies that 𝑥3 = 2 (just as in a binomial experiment there are actually two rv’s—the number of successes and number of failures—but the latter is redundant). 32 Example 5.9 As a specific example, the genetic allele of a pea section can be either AA, Aa, or aa. A simple genetic model specifies P(AA) = .25, P(Aa) = .50, and P(aa) = .25. If the alleles of 10 independently obtained sections are determined, the probability that exactly five of these are Aa and two are AA is 33 Example 5.9 A natural extension of the trinomial scenario is an experiment consisting of n independent and identical trials, in which each trial can result in any one of r possible outcomes. Let 𝑝𝑖 = P(outcome i on any particular trial), and define random variables by 𝑋𝑖 = the number of trials resulting in outcome i (i = 1, … , r). 34 Example 5.9 This is called a multinomial experiment, and the joint pmf of 𝑋1 , … , 𝑋𝑟 is called the multinomial distribution. An argument analogous to the one used to derive the trinomial pmf gives the multinomial pmf as 35 More Than Two Random Variables The notion of independence of more than two random variables is similar to the notion of independence of more than two events. Definition 36 More Than Two Random Variables Thus if the variables are independent with n = 4, then the joint pmf or pdf of any two variables is the product of the two marginals, and similarly for any three variables and all four variables together. Intuitively, independence means that learning the values of some variables doesn’t change the distribution of the remaining variables. Most importantly, once we are told that n variables are independent, then the joint pmf or pdf is the product of the n marginals. 37 Conditional Distributions Suppose X = the number of major defects in a randomly selected new automobile and Y = the number of minor defects in that same auto. If we learn that the selected car has one major defect, what now is the probability that the car has at most three minor defects—that is, what is P(Y 3 | X = 1)? 38 Conditional Distributions Similarly, if X and Y denote the lifetimes of the front and rear tires on a motorcycle, and it happens that X = 10,000 miles, what now is the probability that Y is at most 15,000 miles, and what is the expected lifetime of the rear tire “conditional on” this value of X? Questions of this sort can be answered by studying conditional probability distributions. 39 Conditional Distributions Definition 40 Conditional Distributions Notice that the definition of fY | X(y | x) parallels that of P(B | A), the conditional probability that B will occur, given that A has occurred. Once the conditional pdf or pmf has been determined, questions of the type posed at the outset of this subsection can be answered by integrating or summing over an appropriate set of Y values. 41 Example 5.3 cont’d Reconsider the situation of example 5.3 and 5.4 involving X = the proportion of time that a bank’s drive-up facility is busy and Y = the analogous proportion for the walk-up window. The conditional pdf of Y given that X = .8 is 42 Example 5.12 The probability that the walk-up facility is busy at most half the time given that X = .8 is then 43 Example 5.12 cont’d Using the marginal pdf of Y gives P(Y .5) = .350. Also E(Y) = .6, whereas the expected proportion of time that the walk-up facility is busy given that X = .8 (a conditional expectation) is 44 5.2 Expected Values, Covariance, and Correlation Copyright © Cengage Learning. All rights reserved. 45 Expected Values, Covariance, and Correlation Any function h(X) of a single rv X is itself a random variable. However, to compute E[h(X)], it is not necessary to obtain the probability distribution of h(X); instead, E[h(X)] is computed as a weighted average of h(x) values, where the weight function is the pmf p(x) or pdf f(x) of X. A similar result holds for a function h(X, Y) of two jointly distributed random variables. 46 Expected Values, Covariance, and Correlation Proposition 47 Example 5.13 Five friends have purchased tickets to a certain concert. If the tickets are for seats 1–5 in a particular row and the tickets are randomly distributed among the five, what is the expected number of seats separating any particular two of the five? Let X and Y denote the seat numbers of the first and second individuals, respectively. Possible (X, Y) pairs are {(1, 2), (1, 3), . . . , (5, 4)}, and the joint pmf of (X, Y) is x = 1, . . . , 5; y = 1, . . . , 5; x y otherwise 48 Example 5.13 cont’d The number of seats separating the two individuals is h(X, Y) = |X – Y| – 1. The accompanying table gives h(x, y) for each possible (x, y) pair. 49 Example 5.13 cont’d Thus 50 Covariance When two random variables X and Y are not independent, it is frequently of interest to assess how strongly they are related to one another. Definition 51 Covariance That is, since X – X and Y – Y are the deviations of the two variables from their respective mean values, the covariance is the expected product of deviations. Note that Cov(X, X) = E[(X – X)2] = V(X). The rationale for the definition is as follows. Suppose X and Y have a strong positive relationship to one another, by which we mean that large values of X tend to occur with large values of Y and small values of X with small values of Y. 52 Covariance Then most of the probability mass or density will be associated with (x – X) and (y – Y), either both positive (both X and Y above their respective means) or both negative, so the product (x – X)(y – Y) will tend to be positive. Thus for a strong positive relationship, Cov(X, Y) should be quite positive. For a strong negative relationship, the signs of (x – X) and (y – Y) will tend to be opposite, yielding a negative product. 53 Covariance Thus for a strong negative relationship, Cov(X, Y) should be quite negative. If X and Y are not strongly related, positive and negative products will tend to cancel one another, yielding a covariance near 0. 54 Covariance Figure 5.4 illustrates the different possibilities. The covariance depends on both the set of possible pairs and the probabilities. In Figure 5.4, the probabilities could be changed without altering the set of possible pairs, and this could drastically change the value of Cov(X, Y). p(x, y) = 1/10 for each of ten pairs corresponding to indicated points: (a) positive covariance; (b) negative covariance; Figure 5.4 (c) covariance near zero 55 Example 5.15 The joint and marginal pmf’s for X = automobile policy deductible amount and Y = homeowner policy deductible amount in Example 5.1 were 56 Example 5.15 cont’d Therefore, 57 Covariance The following shortcut formula for Cov(X, Y) simplifies the computations. Proposition According to this formula, no intermediate subtractions are necessary; only at the end of the computation is 𝜇𝑋 ∙ 𝜇𝑌 subtracted from E(XY). The proof involves expanding (X 𝜇𝑋 )(Y - 𝜇𝑌 ) and then carrying the summation or integration through to each individual term. 58 Correlation Definition 59 Example 5.17 It is easily verified that in the insurance scenario of Example 5.15, E(X2) = 2,987,500 = 353,500 – (485)2 = 118,275, X = 343.911, E(Y2) = 2,987,500, 𝜎𝑌2 =1,721,875, and Y = 1312.202. This gives 60 Correlation The following proposition shows that remedies the defect of Cov(X, Y) and also suggests how to recognize the existence of a strong (linear) relationship. Proposition 61 Correlation If we think of p(x, y) or f(x, y) as prescribing a mathematical model for how the two numerical variables X and Y are distributed in some population (height and weight, verbal SAT score and quantitative SAT score, etc.), then is a population characteristic or parameter that measures how strongly X and Y are related in the population. In Chapter 12, we will consider taking a sample of pairs (x1, y1), . . . , (xn, yn) from the population. The sample correlation coefficient r will then be defined and used to make inferences about . 62 Correlation The correlation coefficient is actually not a completely general measure of the strength of a relationship. Proposition 63 Correlation This proposition says that is a measure of the degree of linear relationship between X and Y, and only when the two variables are perfectly related in a linear manner will be as positive or negative as it can be. However, if | p | << 1, there may still be a strong relationship between the two variables, just one that is not linear. And even if | p | is close to 1, it may be that the relationship is really nonlinear but can be well approximated by a straight line. 64 Example 5.18 Let X and Y be discrete rv’s with joint pmf The points that receive positive probability mass are identified on the (x, y) coordinate system in Figure 5.5. The population of pairs for Example 18 Figure 5.5 65 Example 5.18 cont’d It is evident from the figure that the value of X is completely determined by the value of Y and vice versa, so the two variables are completely dependent. However, by symmetry X = Y = 0 and E(XY) =0 The covariance is then Cov(X,Y) = E(XY) – X Y = 0 and thus X,Y = 0. Although there is perfect dependence, there is also complete absence of any linear relationship! 66 Correlation A value of 𝜌 near 1 does not necessarily imply that increasing the value of X causes Y to increase. It implies only that large X values are associated with large Y values. For example, in the population of children, vocabulary size and number of cavities are quite positively correlated, but it is certainly not true that cavities cause vocabulary to grow. Instead, the values of both these variables tend to increase as the value of age, a third variable, increases. For children of a fixed age, there is probably a low correlation between number of cavities and vocabulary size. In summary, association (a high correlation) is not the same as causation. 67 The Bivariate Normal Distribution Just as the most useful univariate distribution in statistical practice is the normal distribution, the most useful joint distribution for two rv’s X and Y is the bivariate normal distribution. The pdf is somewhat complicated: 68 The Bivariate Normal Distribution A graph of this pdf, the density surface, appears in Figure 5.6. It follows (after some tricky integration) that the marginal distribution of X is normal with mean value 𝜇1 and standard deviation 𝜎1 , and similarly the marginal distribution of Y is normal with mean 𝜇2 and standard deviation 𝜎2 . The fifth parameter of the distribution is 𝜌, which can be shown to be the correlation coefficient between X and Y. 69 The Bivariate Normal Distribution It is not at all straightforward to integrate the bivariate normal pdf in order to calculate probabilities. Instead, selected software packages employ numerical integration techniques for this purpose. Many students applying for college take the SAT, which for a few years consisted of three components: Critical Reading, Mathematics, and Writing. While some colleges used all three components to determine admission, many only looked at the first two (reading and math). 70 The Bivariate Normal Distribution Let X and Y denote the Critical Reading and Mathematics scores, respectively, for a randomly selected student. According to the College Board website, the population of students taking the exam in Fall 2012 had the following characteristics: Suppose that X and Y have (approximately, since both variables are discrete) a bivariate normal distribution with correlation coefficient 𝜌 = .25. The Matlab software package gives P(X ≤ 650, Y ≤ 650) = P(both scores are at most 650) = .8097. 71 The Bivariate Normal Distribution It can also be shown that the conditional distribution of Y given that X = x is normal. This can be seen geometrically by slicing the density surface with a plane perpendicular to the (x, y) passing through the value x on that axis; the result is a normal curve sketched out on the slicing plane. The conditional mean value is a linear function of x, and the conditional variance is The closer the correlation coefficient is to 1 or 21, the less variability there is in the conditional distribution. Analogous results hold for the conditional distribution of X given that Y = y. 72 The Bivariate Normal Distribution The bivariate normal distribution can be generalized to the multivariate normal distribution. Its density function is quite complicated, and the only way to write it compactly is to employ matrix notation. If a collection of variables has this distribution, then the marginal distribution of any single variable is normal, the conditional distribution of any single variable given values of the other variables is normal, the joint marginal distribution of any pair of variables is bivariate normal, and the joint marginal distribution of any subset of three or more of the variables is again multivariate normal. 73 5.3 Statistics and Their Distributions Copyright © Cengage Learning. All rights reserved. 74 Statistics and Their Distributions There is uncertainty, before the data becomes, what a statistic will be. As we view each observation as a random variable and denote the sample by X1, X2, . . . , Xn (uppercase letters for random variables). This variation in turn implies that the value of any function of the sample observations—such as the sample mean, sample standard deviation, or sample fourth spread—also varies from sample to sample. That is, prior to obtaining x1, . . . , xn, there is uncertainty as to the value of , the value of s, and so on. 75 Example 5.20 cont’d Samples from the Weibull Distribution of Example 19 Table 5.1 76 Statistics and Their Distributions Definition 77 Statistics and Their Distributions Any statistic, being a random variable, has a probability distribution. In particular, the sample mean has a probability distribution. The probability distribution of a statistic is sometimes referred to as its sampling distribution to emphasize that it describes how the statistic varies in value across all samples that might be selected. 78 Random Samples The probability distribution of any particular statistic depends not only on the population distribution (normal, uniform, etc.) and the sample size n but also on the method of sampling. Consider selecting a sample of size n = 2 from a population consisting of just the three values 1, 5, and 10, and suppose that the statistic of interest is the sample variance. If sampling is done “with replacement,” then S2 = 0 will result if X1 = X2. 79 Random Samples However, S2 cannot equal 0 if sampling is “without replacement.” So P(S2 = 0) = 0 for one sampling method, and this probability is positive for the other method. Our next definition describes a sampling method often encountered (at least approximately) in practice. 80 Random Samples Definition 81 Random Samples Conditions 1 and 2 can be paraphrased by saying that the Xi’s are independent and identically distributed (iid). If sampling is either with replacement or from an infinite (conceptual) population, Conditions 1 and 2 are satisfied exactly. These conditions will be approximately satisfied if sampling is without replacement, if the sample size n is much smaller than the population size N. In practice, if n/N .05 (at most 5% of the population is sampled), we can proceed as if the Xi’s form a random sample. 82 Deriving a Sampling Distribution Probability rules can be used to obtain the distribution of a statistic provided that it is a “fairly simple” function of the Xi’s and either there are relatively few different X values in the population or else the population distribution has a “nice” form. Our next example illustrate such situation. 83 Example 5.21 A certain brand of MP3 player comes in three configurations: a model with 2 GB of memory, costing $80, a 4 GB model priced at $100, and an 8 GB version with a price tag of $120. If 20% of all purchasers choose the 2 GB model, 30% choose the 4 GB model, and 50% choose the 8 GB model, then the probability distribution of the cost X of a single randomly selected MP3 player purchase is given by with = 106, 2 = 244 (5.2) 84 Example 5.21 cont’d Suppose on a particular day only two MP3 players are sold. Let X1 = the revenue from the first sale and X2 the revenue from the second. Suppose that X1 and X2 are independent, each with the probability distribution shown in (5.2) [so that X1 and X2 constitute a random sample from the distribution (5.2)]. 85 Example 5.21 cont’d Table 5.2 lists possible (x1, x2) pairs, the probability of each [computed using (5.2) and the assumption of independence], and the resulting and s2 values. [Note that when n = 2, s2(x1 – )2(x2 – )2.] Outcomes, Probabilities, and Values of x and s2 for Example 20 Table 5.2 86 Example 5.21 The complete sampling distributions of (5.3) and (5.4). cont’d and S2 appear in (5.3) (5.4) 87 Example 5.21 cont’d Figure 5.8 pictures a probability histogram for both the original distribution (5.2) and the distribution (5.3). The figure suggests first that the mean (expected value) of the distribution is equal to the mean 106 of the original distribution, since both histograms appear to be centered at the same place. Probability histograms for the underlying distribution and x distribution in Example 20 Figure 5.8 88 Example 5.21 cont’d From (5.3), = (80)(.04) + . . . + (120)(.25) = 106 = Second, it appears that the distribution has smaller spread (variability) than the original distribution, since probability mass has moved in toward the mean. Again from (5.3), = (802)(.04) + + (1202)(.25) – (106)2 89 Example 5.21 cont’d The variance of is precisely half that of the original variance (because n = 2). Using (5.4), the mean value of S2 is S2 = E(S2) = S2 pS2(s2) = (0)(.38) + (200)(.42) + (800)(.20) + 244 = 2 That is, the sampling distribution is centered at the population mean , and the S2 sampling distribution is centered at the population variance 2. 90 Example 5.21 cont’d If there had been four purchases on the day of interest, the sample average revenue would be based on a random sample of four Xi’s, each having the distribution (5.2). More calculation eventually yields the pmf of for n = 4 as 91 Example 5.21 cont’d From this, x = 106 = and = 61 = 2/4. Figure 5.8 is a probability histogram of this pmf. Probability histogram for based on n = 4 in Example 20 Figure 5.9 92 Example 5.21 cont’d Example 5.21 should suggest first of all that the computation of and can be tedious. If the original distribution (5.2) had allowed for more than three possible values, then even for n = 2 the computations would have been more involved. The example should also suggest, however, that there are some general relationships between E( ), V( ), E(S2), and the mean and variance 2 of the original distribution. 93 Simulation Experiments 94 Simulation Experiments The second method of obtaining information about a statistic’s sampling distribution is to perform a simulation experiment. This method is usually used when a derivation via probability rules is too difficult or complicated to be carried out. Such an experiment is virtually always done with the aid of a computer. 95 Simulation Experiments The following characteristics of an experiment must be specified: 1. The statistic of interest ( mean, etc.) , S, a particular trimmed 2. The population distribution (normal with = 100 and = 15, uniform with lower limit A = 5 and upper limit B = 10,etc.) 3. The sample size n (e.g., n = 10 or n = 50) 4. The number of replications k (number of samples to be obtained) 96 Simulation Experiments Then use appropriate software to obtain k different random samples, each of size n, from the designated population distribution. For each sample, calculate the value of the statistic and construct a histogram of the k values. This histogram gives the approximate sampling distribution of the statistic. The larger the value of k, the better the approximation will tend to be (the actual sampling distribution emerges as k ). In practice, k = 500 or 1000 is usually sufficient if the statistic is “fairly simple.” 97 Simulation Experiments The final aspect of the histograms to note is their spread relative to one another. The larger the value of n, the more concentrated is the sampling distribution about the mean value. This is why the histograms for n = 20 and n = 30 are based on narrower class intervals than those for the two smaller sample sizes. For the larger sample sizes, most of the values are quite close to 8.25. This is the effect of averaging. When n is small, a single unusual x value can result in an value far from the center. 98 Simulation Experiments With a larger sample size, any unusual x values, when averaged in with the other sample values, still tend to yield an value close to . Combining these insights yields a result that should appeal to your intuition: based on a large n tends to be closer to than does based on a small n. 99 5.4 The Distribution of the Sample Mean Copyright © Cengage Learning. All rights reserved. 100 The Distribution of the Sample Mean The importance of the sample mean springs from its use in drawing conclusions about the population mean . Some of the most frequently used inferential procedures are based on properties of the sampling distribution of . A preview of these properties appeared in the calculations and simulation experiments of the previous section, where we noted relationships between E( ) and and also among V( ), 2, and n. 101 The Distribution of the Sample Mean Proposition 102 The Distribution of the Sample Mean According to Result 1, the sampling (i.e., probability) distribution of is centered precisely at the mean of the population from which the sample has been selected. Result 2 shows that the distribution becomes more concentrated about as the sample size n increases. In marked contrast, the distribution of To becomes more spread out as n increases. Averaging moves probability in toward the middle, whereas totaling spreads probability out over a wider and wider range of values. 103 The Distribution of the Sample Mean The standard deviation is often called the standard error of the mean; it describes the magnitude of a typical or representative deviation of the sample mean from the population mean. 104 Example 5.25 In a notched tensile fatigue test on a titanium specimen, the expected number of cycles to first acoustic emission (used to indicate crack initiation) is = 28,000, and the standard deviation of the number of cycles is = 5000. Let X1, X2, . . . , X25 be a random sample of size 25, where each Xi is the number of cycles on a different randomly selected specimen. Then the expected value of the sample mean number of cycles until first emission is E( ) = 28,000, and the expected total number of cycles for the 25 specimens is E(To) = n = 25(28,000) = 700,000. 105 Example 5.25 The standard deviation of and of To are cont’d (standard error of the mean) If the sample size increases to n = 100, E( ) is unchanged, but = 500, half of its previous value (the sample size must be quadrupled to halve the standard deviation of ). 106 The Case of a Normal Population Distribution 107 The Case of a Normal Population Distribution Proposition We know everything there is to know about the and To distributions when the population distribution is normal. In particular, probabilities such as P(a b) and P(c To d) can be obtained simply by standardizing. 108 The Case of a Normal Population Distribution Figure 5.15 illustrates the proposition. A normal population distribution and sampling distributions Figure 5.15 109 Example 5.26 The distribution of egg weights (g) of a certain type is normal with mean value 53 and standard deviation .3 (consistent with data in the article “Evaluation of Egg Quality Traits of Chickens Reared under Backyard System in Western Uttar Pradesh” (Indian J. of Poultry Sci., 2009: 261–262)). Let 𝑋1 , 𝑋2 , … , 𝑋12 denote the weights of a dozen randomly selected eggs; these 𝑋𝑖 ’s constitute a random sample of size 12 from the specified normal distribution 110 Example 5.26 cont’d The total weight of the 12 eggs is 𝑇0 = 𝑋1 +. . . +𝑋12 it is normally distributed with mean value E(𝑇0 ) = 𝑛𝜇= 12(53) = 636 and variance V(𝑇0 ) = n𝜎 2 =12(.3)2 = 1.08. The probability that the total weight is between 635 and 640 is now obtained by standardizing and referring to Appendix Table A.3: 111 Example 5.26 cont’d If cartons containing a dozen eggs are repeatedly selected, in the long run slightly more than 83% of the eggs in a carton will weigh in total between 635 g and 640 g. Notice that 635 < 𝑇0 < 640 is equivalent to 52.9167 < X < 53.3333 (divide each term in the original system of inequalities by 12). Thus P(52.9167 < X < 53.3333) ≈ .8315. This latter probability can also be obtained by standardizing X directly. 112 Example 5.26 Now consider randomly selecting just four of these eggs. The sample mean weight 𝑋 is then normally distributed with mean value 𝜇𝑋 = 𝜇 = 53 and standard deviation 𝜇𝑋 = 𝜎/ 𝑛 = .3/ 4 = .15 The probability that the sample mean weight exceeds 53.5 g is then Because 53.5 is 3.33 standard deviations (of X ) larger than the mean value 53, it is exceedingly unlikely that the sample mean will exceed 53.5. 113 The Central Limit Theorem 114 The Central Limit Theorem When the Xi’s are normally distributed, so is sample size n. for every The derivations in Example 5.21 and simulation experiment of Example 5.24 suggest that even when the population distribution is highly nonnormal, averaging produces a distribution more bell-shaped than the one being sampled A reasonable conjecture is that if n is large, a suitable normal curve will approximate the actual distribution of . The formal statement of this result is the most important theorem of probability. 115 The Central Limit Theorem Theorem 116 The Central Limit Theorem Figure 5.16 illustrates the Central Limit Theorem. The Central Limit Theorem illustrated Figure 5.16 117 The Central Limit Theorem According to the CLT, when n is large and we wish to calculate a probability such as P(a b), we need only “pretend” that is normal, standardize it, and use the normal table. The resulting answer will be approximately correct. The exact answer could be obtained only by first finding the distribution of , so the CLT provides a truly impressive shortcut. 118 Example 5.27 The amount of a particular impurity in a batch of a certain chemical product is a random variable with mean value 4.0 g and standard deviation 1.5 g. If 50 batches are independently prepared, what is the (approximate) probability that the sample average amount of impurity is between 3.5 and 3.8 g? According to the rule of thumb to be stated shortly, n = 50 is large enough for the CLT to be applicable. 119 Example 5.27 cont’d then has approximately a normal distribution with mean value = 4.0 and so 120 Example 5.27 Now consider randomly selecting 100 batches, and let 𝑇0 represent the total amount of impurity in these batches. Then the mean value and standard deviation of 𝑇0 are 100(4) = 400 and 100 (1.5) = 15, respectively, and the CLT implies that 𝑇0 has approximately a normal distribution. The probability that this total is at most 425 g is 121 The Central Limit Theorem The CLT provides insight into why many random variables have probability distributions that are approximately normal. For example, the measurement error in a scientific experiment can be thought of as the sum of a number of underlying perturbations and errors of small magnitude. A practical difficulty in applying the CLT is in knowing when n is sufficiently large. The problem is that the accuracy of the approximation for a particular n depends on the shape of the original underlying distribution being sampled. 122 The Central Limit Theorem If the underlying distribution is close to a normal density curve, then the approximation will be good even for a small n, whereas if it is far from being normal, then a large n will be required. There are population distributions for which even an n of 40 or 50 does not suffice, but such distributions are rarely encountered in practice. 123 The Central Limit Theorem On the other hand, the rule of thumb is often conservative; for many population distributions, an n much less than 30 would suffice. For example, in the case of a uniform population distribution, the CLT gives a good approximation for n 12. 124 5.5 The Distribution of a Linear Combination Copyright © Cengage Learning. All rights reserved. 125 The Distribution of a Linear Combination The sample mean X and sample total To are special cases of a type of random variable that arises very frequently in statistical applications. Definition 126 The Distribution of a Linear Combination For example, consider someone who owns 100 shares of stock A, 200 shares of stock B, and 500 shares of stock C. Denote the share prices of these three stocks at some particular time by 𝑋1, 𝑋2 , and 𝑋3 , respectively. Then the value of this individual’s stock holdings is the linear combination Y = 100𝑋1 + 200𝑋2 + 500𝑋3 . Taking a1 = a2 = . . . = an = 1 gives Y = X1 + . . . + Xn = To, and a1 = a2 = . . . = an = yields 127 The Distribution of a Linear Combination Notice that we are not requiring the Xi’s to be independent or identically distributed. All the Xi’s could have different distributions and therefore different mean values and variances. We first consider the expected value and variance of a linear combination. 128 The Distribution of a Linear Combination Proposition 129 The Distribution of a Linear Combination Proofs are sketched out at the end of the section. A paraphrase of (5.8) is that the expected value of a linear combination is the same as the linear combination of the expected values—for example, E(2X1 + 5X2) = 21 + 52. The result (5.9) in Statement 2 is a special case of (5.11) in Statement 3; when the Xi’s are independent, Cov(Xi, Xj) = 0 for i j and = V(Xi) for i = j (this simplification actually occurs when the Xi’s are uncorrelated, a weaker condition than independence). Specializing to the case of a random sample (Xi’s iid) with ai = 1/n for every i gives E(X) = and V(X) = 2/n. A similar comment applies to the rules for To. 130 Example 5.30 A gas station sells three grades of gasoline: regular, extra, and super. These are priced at $3.00, $3.20, and $3.40 per gallon, respectively. Let X1, X2, and X3 denote the amounts of these grades purchased (gallons) on a particular day. Suppose the Xi’s are independent with 1 = 1000, 2 = 500, 3 = 300, 1 = 100, 2 = 80, and 3 = 50. 131 Example 5.30 cont’d The revenue from sales is Y = 3.0X1 + 3.2X2 + 3.4X3, and E(Y) = 3.01 + 3.22 + 3.43 = $5620 132 The Difference Between Two Random Variables 133 The Difference Between Two Random Variables An important special case of a linear combination results from taking n = 2, a1 = 1, and a2 = –1: Y = a1X1 + a2X2 = X1 – X2 We then have the following corollary to the proposition. Corollary 134 The Difference Between Two Random Variables The expected value of a difference is the difference of the two expected values, but the variance of a difference between two independent variables is the sum, not the difference, of the two variances. There is just as much variability in X1 – X2 as in X1 + X2 [writing X1 – X2 = X1 + (– 1)X2, (–1)X2 has the same amount of variability as X2 itself]. 135 Example 5.31 A certain automobile manufacturer equips a particular model with either a six-cylinder engine or a four-cylinder engine. Let X1 and X2 be fuel efficiencies for independently and randomly selected six-cylinder and four-cylinder cars, respectively. With 1 = 22, 2 = 26, 1 = 1.2, and 2 = 1.5, E(X1 – X2) = 1 – 2 = 22 – 26 = –4 136 Example 5.31 cont’d If we relabel so that X1 refers to the four-cylinder car, then E(X1 – X2) = 4, but the variance of the difference is still 3.69. 137 The Case of Normal Random Variables 138 The Case of Normal Random Variables When the Xi’s form a random sample from a normal distribution, X and To are both normally distributed. Here is a more general result concerning linear combinations. Proposition 139 Example 5.32 The total revenue from the sale of the three grades of gasoline on a particular day was Y = 3.0X1 + 3.2X2 + 3.4X3, and we calculated g = 5620 and (assuming independence) g = 429.46. If the Xis are normally distributed, the probability that revenue exceeds 4500 is 140 The Case of Normal Random Variables The CLT can also be generalized so it applies to certain linear combinations. Roughly speaking, if n is large and no individual term is likely to contribute too much to the overall value, then Y has approximately a normal distribution. 141