Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management BA 386T Tom Shively ESTIMATION AND SAMPLING DISTRIBUTIONS The purpose of these notes is to summarize the concepts regarding estimation and sampling distributions that we discussed in class. ESTIMATING THE POPULATION MEAN (µ) Consider an example dealing with the salaries of last spring’s MBA graduates. The random variable X will represent MBA salary. We will assume the population of last spring’s MBA salaries are normally distributed with unknown mean µ and a known variance of σ X2 = (10,000)2. The distribution is drawn below. σX = 10,000 µ X: MBA Salary The interpretation of µ is the following: µ represents the mean salary of all MBA graduates from last spring. We could determine the exact value of µ (i.e. the exact mean salary of all MBAs in the population) by obtaining the salary of every MBA graduate, adding them up, and dividing the total by the number of MBA graduates. In practice, it is too expensive and time consuming to obtain every MBA salary, which means in practice, that we cannot determine µ exactly. The important point to understand from this discussion is that the population mean µ is a number, but we do not know what the number is without getting the salary of every MBA graduate in the country. Note that “mean” and “average” mean the same thing. They can be used interchangeably. For example, “sample mean” and “sample average” mean the same thing, as do “population mean” and “population average.” 1 Aside I am assuming the variance of the population of MBA salaries ( σ X2 ) is known. In particular, I am assuming that σ X2 = (10,000)2. In practice, if we don’t know the mean (µ) of the population, we will not know the variance ( σ X2 ) of the population either. I am assuming we know the variance to remove one level of complexity from the problem of estimating the population mean µ and determining the quality of the estimate. In practice, we will have to estimate the population variance σ X2 . This will be discussed below in further detail. End Aside Next Aside It is much easier to interpret the standard deviation σX of the population than it is to interpret the variance σ X2 of the population. The reason for this is that we know 68% of the probability falls within one standard deviation of the mean, and that 95% of the probability falls within two standard deviations of the mean. The probability calculation for one standard deviation is given below. σX µ -σX µ µ+σX X: MBA Salary (µ − σ X ) − µ X − µ (µ + σ X ) − µ pr(µ - σX < X < µ + σX) = pr < < σX σX σX = pr(-1.0 < Z < 1.0) = pr(Z < 1.0) - pr(Z < -1.0) = 0.8413 - 0.1587 = 0.6826 Also, the units for the standard deviation in this example are dollars, while the units for the variance are dollars squared. It is considerably easier to think of a measure of dispersion in dollars than in dollars squared. 2 In the MBA salary example, 68% of the MBA graduates make within σX = $10,000 of the population mean salary. An equivalent way to think about this probability is that if we pick an MBA graduate at random from the population, there is a 68% chance that we will get a person that makes within $10,000 of the average salary of all MBA graduates. Similarly, 95% of the MBA graduates make within 2σX = $20,000 of the population mean salary. An equivalent way to think about this probability is that if we pick an MBA graduate at random from the population, there is a 95% chance that we will get a person that makes within $20,000 of the average salary of all MBA graduates. End Aside To estimate the population mean, we collect a sample of MBA salaries from the population of last spring’s MBA graduates and use the mean of the salaries in the sample as an estimate of the mean of the salaries in the entire population. The steps in the logic underlying this idea are the following: (1) The sample is representative of the population. If the sample is relatively large, it will represent the entire spectrum of MBA salaries in the population. For example, a few people in the sample will make low salaries (because a small percentage of the MBA population makes low salaries), a large portion of the people in the sample will make salaries right around the population average (because most of the people in the MBA population make a salary close to the population mean salary), and a few people in the sample will make high salaries (because a small percentage of the MBA population makes high salaries). (2) The sample mean ( X ) is representative of the population mean (µ) This follows from step (1). If the sample is representative of the entire spectrum of MBA salaries, then the sample mean ( X ) must be representative of the population mean (µ). Another way to say this is that the sample mean ( X ) is a good proxy for the unobservable population mean (µ). (3) The sample mean ( X ) is a good estimator of the population mean (µ) This follows from step (2). Step (3) just formalizes the idea in step (2), i.e. if X is a good proxy for µ, then we say X provides a good estimator for µ. 3 DETERMINING THE QUALITY OF X AS AN ESTIMATOR FOR µ A natural question to ask is how good the estimate of µ is that we get using X . For example, suppose we collect a sample of size n = 100. It is possible (although unlikely) that we get 80 people in the sample that make salaries well above the population mean µ, and only 20 people in the sample that make salaries below the population mean µ. If this happens, then the sample mean X will be far above the population mean µ, and we will get a bad estimate of µ. The appropriate way to phrase the question concerning the quality of X as an estimator for µ is the following. First, we must define what a good estimate is. I will say any estimate within $900 of the population mean µ is a good estimate. The definition of an accurate estimate (i.e. saying we want an estimate within $900 of µ) is a subject matter question, not a statistical question. This means you must consider the context of the problem to determine the degree of accuracy that is required to have a good estimate. (The $900 figure I chose is admittedly a bit arbitrary. A more natural figure would be $1000 but I chose $900 to make it easy to differentiate from σ X = 1000, which is used below.) The question we want to answer is: What is the probability that we get a sample of MBA salaries from the population of last spring’s MBA graduates that gives an X that is within $900 of µ? To answer this question we must consider the sampling distribution of X . First, note that X is a random variable. The random experiment used to obtain X is the process of collecting a random sample, and the outcome of the random experiment (i.e. the sample mean X ) is a numerical value. Therefore, X is a random variable and has a distribution associated with it. This distribution represents the uncertainty regarding the value of X that we obtain due to the uncertainty regarding the random sample we will obtain. The distribution of X is N(µ ,σ X2 = σ X2 n ), where n is the sample size, σ X2 = (10,000)2 is the variance of the population, and σ = Var ( X ) = 2 X mean X . The distribution is drawn below. 4 σ X2 n is the variance of the sample σX = µ σX n X Intuition underlying the distribution of X The distribution of X has a mean of µ. The reason is that there is a 50/50 chance we will obtain a sample of MBA salaries from the MBA population that is weighted towards good students (i.e. there are more good students in the sample whose salaries are above the population mean µ than poor students whose salaries are below the population mean µ), and therefore there is a 50/50 chance the sample average X is above the population mean µ. This is represented in the sampling distribution because half the area in the distribution for X is above µ, i.e. half the time we get a sample that gives an X greater than µ. A similar argument can be used to explain why half the area in the distribution for X is below µ. Aside There is a theorem in statistics called the Central Limit Theorem. It says that if X ~ N(µ, σ ), then X ~ N(µ ,σ 2 X 2 X = σ X2 ). This theorem backs up the intuition we developed n in class. It is a formal statement of the intuition (which is what all theorems are). End Aside Suppose we collect a sample of size n = 100. Given the sampling distribution for X σ X2 (10,000) 2 2 is N(µ ,σ X = = = 1,000,000 = (1,000)2), we can compute the probability 100 n that we get a sample of MBA salaries that give an X within $900 of µ (i.e. we can compute the probability that we get what we define to be a good estimate of µ). 5 σX = µ-900 µ σX n µ+900 = 10,000 100 = 1000 X ( µ − 900) − µ X − µ ( µ + 900) − µ pr(µ - 900 < X < µ + 900) = pr < < 1000 1000 1000 = pr(-0.9 < Z < 0.9) = pr(Z < 0.9) - pr(Z < -0.9) = 0.8159 - 0.1841 = 0.6318 (1) Suppose we collect the following sample of n = 100 salaries from the MBA population: 56496 75416 63516 61981 73506 84008 66390 76538 57154 65582 75998 72854 47156 83319 68730 64246 78871 64879 46674 71946 71169 68624 53100 57686 82651 62452 70288 34509 71315 71786 55572 68847 81244 74021 58491 60610 69942 61855 68703 66174 58173 68386 62209 64545 56881 75401 60446 73382 52375 62371 53678 65182 63114 61985 51936 58156 69112 57161 54847 61671 51208 60074 70832 52879 51615 59239 48304 58531 65702 58701 73883 73227 54859 65904 80033 70814 57770 57757 58249 67506 63484 80587 84081 69471 62943 76397 56920 62409 55329 56845 46387 50703 70761 74632 70901 62259 46217 68240 56122 78082 The notation we will use is the following: Xi represents the salary of the i-th person in the sample. Thus, X1 = $56496 is the salary of the first person in the sample, X2 = $64246 is the salary of the second person in the sample, …, X100 = $71786 is the salary of the last person in the sample. The sample mean is X = 1 n 1 100 Xi = ∑ ∑ X = 64252 n i =1 100 i =1 i 6 Thus, X = $64252 is our estimate of the population mean µ. We don’t know whether X = $64252 is close to µ or not because we don’t know µ. If we knew µ we would not have to bother estimating it. However, we can make the following statement based on the probability calculation in equation (1). Of all the possible samples of size n = 100 we could collect from the population of MBA graduates, 63.2% of them give an X within $900 of the population mean µ. 36.8% of the possible sample we could collect give an X more than $900 from the population mean µ. You don’t know which kind of sample you get (i.e. you don’t know whether you get one of the 63.2% that give an X within $900 of µ or one of the samples that gives an X more than $900 from µ). However, in my opinion, a 36.8% chance of failure is too high (i.e. a 36.8% chance of getting a bad estimate is too high). To increase the probability of getting a good estimate we need to increase the sample size n. If we increase the sample size we are collecting more information about the population mean µ so we should get a better estimate. This will be reflected in a smaller probability of collecting a sample that gives an X more than $900 from µ. Intuitively, with a larger sample (say n = 400) we are less likely to get a “strange” sample. A large sample is more likely to be very representative of the population. For example, we are highly unlikely to get all n = 400 people in the sample from good schools. σ X2 (10,000) 2 Given the sampling distribution for X is N(µ ,σ = = = 250,000 = n 400 (500)2) when n = 400, we can now compute the probability that we get a sample of 400 MBA salaries that give an X within $900 of µ (i.e. we can compute the probability that we get what we define to be a good estimate of µ when n = 400). 2 X σX = µ-900 µ σX n µ+900 = 10,000 400 = 500 X ( µ − 900) − µ X − µ ( µ + 900) − µ < < pr(µ - 900 < X < µ + 900) = pr 500 500 500 = pr(-1.8 < Z < 1.8) 7 = pr(Z < 1.8) - pr(Z < -1.8) = 0.9641 - 0.0359 = 0.9282 (2) We can now make the following statement based on the probability calculation in equation (2). Of all the possible samples of size n = 400 we could collect from the population of MBA graduates, 92.8% of them give an X within $900 of the population mean µ. 7.2% give an X more than $900 from the population mean µ. Therefore, we can be highly confident that the X we obtain from a sample of size n = 400 will be within $900 of µ. ESTIMATING THE POPULATION VARIANCE σ X2 The logic for estimating the population variance σ X2 is the same as the logic used to estimate the population mean µ. The three steps are the following: (1) The sample is representative of the population. (2) The sample dispersion is representative of the population dispersion ( σ X2 ) The question is how to measure the dispersion of the sample. A natural way to do this is to use the average distance squared that each point in the sample is from the center of the sample. The distance from the i-th point (Xi) to the center of the sample ( X ) is (Xi X ). Thus, the average distance squared is 1 n s = ∑ ( Xi − X )2 n i =1 2 X Rather than divide by n, we divide by n-1. Intuition tells use to divide by n but some mathematics (which we will not discuss and you are not responsible for) tells us to divide by n -1. Therefore, we will use 1 n s = ( Xi − X )2 . ∑ n − 1 i =1 2 X to represent the dispersion of the sample. (Technical point: We use average distance squared instead of average distance because if we added up all the distances, they would always add to zero, which means the average distance would also be zero. This would 8 clearly be a bad measure of dispersion. To avoid the problem of positive and negative values canceling out, we use distances squared.) (3) The sample variance ( s 2X ) is a good estimator of the population variance ( σ X2 ) The reason is that s 2X is a measure of the dispersion of the sample, and the dispersion of the sample is representative of the dispersion of the population. Thus, s 2X is representative of the population dispersion, and is therefore a good estimator of σ X2 . Example Consider the MBA salary example. The salaries in the sample we collected are given on page 6. The following dotplot of the salaries (which is not from Excel output) gives a feel for the dispersion of the sample. . . : .: : . : : . ..::. :. .: .: : ... . ::. :::.::::::.::::: :::::::::....: :: +---------+---------+---------+---------+---------+------30000 40000 50000 60000 70000 80000 For this sample, X = 64252 and s 2X = 1 n 1 100 2 ( X X ) = − ∑ ∑ ( X − X )2 n − 1 i =1 i 100 − 1 i =1 i = 94692361 = (9731)2. Thus, s 2X = (9731)2 is our estimate of the population variance σ X2 . The sample standard deviation (called sX) is sX = 9731. 9 Salary EXPLANATION OF THE THREE TYPES OF VARIANCES The three types of variances are: (1) σ X2 : σ X2 is the population variance. σX is the population standard deviation. The population standard deviation and population variance provide a measure of the dispersion of the population. 68% of the probability falls within σX of the population mean µ, while 95% of the probability falls within 2σX of the population mean µ. For example, in the MBA salary example, σX = $10,000 so 68% of the MBA population makes within $10,000 of the average salary of all MBAs in the population, while 95% of the MBA population makes within $20,000 of the average salary of all MBAs in the population. 1 n ∑ ( X − X ) 2 : s 2X is the sample variance. sX is the sample standard n − 1 i =1 i deviation. The sample standard deviation and sample variance provide a measure of the dispersion of the sample. Because the sample is representative of the population, the sample variance s 2X is representative of the population variance σ X2 , and therefore s 2X is an estimator for σ X2 . (2) s 2X = (3) σ = 2 X σ X2 : σ X2 is the variance of the sample mean. It provides a measure of the n uncertainty regarding the value of the sample mean X we obtain from our random sample. σ X is the standard deviation of the sample mean. Of all the possible samples we could collect from the population, 68% of the them will give an X within σ X of the population mean. For example, in the MBA salary example (with n = 100 and σ2 (10,000) 2 σ X2 = X = = 1,000,000 = (1,000)2, so σ X = 1000), 68% of the possible n 100 samples we could collect from the population will give an X within σ X = $1000 of µ. This provides a measure of the quality of X as an estimator of µ. To summarize, σ X2 is a measure of the dispersion of the population and s 2X is a measure of the dispersion of the sample. σ X2 is a measure of the quality of the sample mean X as an estimator of the population mean µ. The population standard deviation σX = $10,000 tells us the probability that the salary of a single randomly chosen graduate from the population being within $10,000 of the population average salary (µ) is 68%. 10 σ X2 (10000) 2 = = 1000 tells us The standard deviation of the sample mean σ X = n 100 the probability that we collect a sample of size n = 100 from the population that gives a sample mean X within σ X = $1,000 of µ is 68%. The sample standard deviation sX is an estimate of σX. 11 NOTATION (1) X ~ N(µ, σ X2 ): X is normally distributed with population mean µ and population variance σ X2 . (2) µ: Population mean. µ represents the center of the population. It is the value of X that we expect on average. For example, in the MBA salary example, µ is the average salary of all MBAs in the population. (3) σ X2 : Population variance. σ X2 provides a measure of the dispersion of the population. See the description on the previous page. (4) σX: Population standard deviation. σX also provides a measure of the dispersion of the population. It is easier to interpret than the population variance because the units are appropriate, e.g. dollars (not dollars squared) in the MBA salary example. (5) Xi: i-th value in the random sample. For example, in the MBA salary example, X1 represents the salary of the first person in the sample, X2 represents the salary of the second person in the sample, etc. 1 n ∑ X : Sample mean (or equivalently, sample average). X represents the n i =1 i center of the sample. It is an estimator for the population mean µ. (6) X = (7) X ~ N(µ,σ X2 = σ X2 = (8) σ X2 = σ X2 n σ X2 n ): X is normally distributed with mean µ and variance . σ X2 : σ X2 is the variance of the sample mean. See the description on the n previous page. 1 n ∑ ( X − X ) 2 : s 2X is the sample variance. It represents the dispersion of n − 1 i =1 i the sample and is an estimate of σ X2 . See the description on the previous page. (9) s 2X = 1 n ∑ ( X − X ) 2 : sX is the sample standard deviation. sX also n − 1 i =1 i provides a measure of the dispersion of the sample. It is easier to interpret than the sample variance because the units are appropriate, e.g. dollars (not dollars squared) in the MBA salary example. (10) s X = s 2X = 12