Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 8 Point and Interval Estimators Up to this point we have been studying probability theory. We have not looked at statistics at all. In probability theory we asked the following sort of question: ``Suppose we have a normal distribution with 100 and x . What is the probability that a sample of n 100 will produce an x in the interval [99,101]?'' In statistics we ask this kind of question. “Suppose we took a sample of n 100 and found that x 99.5 . What can we say about ?” In one sense statistics is probability theory stood on its head. What is considered as given in probability theory is what we don't know in statistics. What we consider as known in statistics is the question asked in probability theory. 8.1 Point Estimators One of the tasks in statistics is to get estimates of population parameters such as or . An estimator is a formula for producing an estimate. The estimate will serve as our best guess for the value of the population parameter we can find using the sample data. If we have the following data: x = 1.0, 2.0, 3.0 1 n xi will be the estimator. The value n i 1 of the mean, x 2 , is called the estimate. Note that the formula is the estimator and the value that the formula produces for a given set of data is called the estimate. The best point estimators for the population parameters we have considered thus far are shown in Table 8-1 then the sample mean x Population parameter Best point estimator 1 n X Xi n i 1 2 1 n s Xi X n 1 i 1 p X pˆ n Table 8-1. Best point estimators for some population parameters. Recall from your mathematics classes that a point on a line can represent a number. So anything that can produce a number will suffice as a point estimator. Generally, estimators will be formulas, but need not be. Some formulas that will produce a number using sample data are Point estimators of Sample mean formula n x xi i 1 Geometric mean GM n x1 x2 x3 Funny formula one xn n ff 1 cos xi i 1 Funny formula two n ff 2 log xi i 1 Max’s estimator 5 Table 8.2 Point estimators of the population mean. Table 8.1 shows some potential point estimators of the population mean. There is an infinity of possible point estimators of the population mean but we only show 5 of them here. Note that Max’s estimator has some distinct advantages – it requires no data and hence it has zero sampling cost and it can be computed very quickly. Is there a reason then, save for professional jealously, that other statisticians prefer the sample mean to Max’s estimator? The answer involves the word best. By best we mean the following. ----------------------[--- ---]------------------------ The drawing above show the population mean designated as a point on a line (the black dot). Construct a small interval [] about this dot. Take a sample and compute the point estimators for every formula. More x ’s will be in this interval than any other point estimator (well if the population mean just happens to be 5 Max’s estimator will win every time.). Hence the term best, it just means that, on average, it will be better than any competing estimator. Finally note that best does not necessarily mean good. Any particular value for the sample mean might produce a perfectly rotten estimate of the mean. Interval estimators, discussed next, will help determine if the estimate is a good one. Example 8.1 A new manufacturing process has been developed to produce artificial diamonds. A sample of n 10 diamonds are taken from the process and weighed. The results of this sample are x 0.5 carets and s 2 0.1 carets. What is the best estimate of the true average weight of diamonds produced by the process? The true average weight of the process is designated by , the average of the weights of all diamonds that will ever be produced. Obviously, it would be possible to compute this value by weighing every diamond. We can only make a guess of the value of the population mean based on the sample results. The best guess we can make using the sample data is the sample mean, x 0.5 carets. 8.2 Interval Estimators when is known An estimator is a formula that produces a single value called an estimate. We know that the best point estimator for a population mean is a sample mean. The actual value we get from a sample is called an estimate. What we would like is some way of trying to determine if a particular estimate is a good one. The only way of being sure is to actually compare our estimate with the population mean. But since the reason we are taking the sample in the first place is to try to get a guess for the unknown value of the population mean we can never really be sure that a particular guess is a good one. However, we can get some pretty good hints. First let’s look at a probability problem. Consider taking samples of size n=100 from a normal distribution with 100 and =51.02 . Note the CLT holds (why?). Yes, the value for the standard deviation looks funny, but will be useful in what follows. Now construct a symmetric interval about the population mean, such that 95% of all sample means of size n=100 will be in this interval. Consider the formula for the Z—score Z= X-μ which can be rearranged as X x σx X Z X . We need to construct an interval X L , X L such that there is a 95% probability that a sample mean will be in this interval. Such an interval can be computed by choosing the proper values of Z. The interval we need from the Z distribution is one that contains 95% of the area symmetric about the mean. That interval is 1.96,1.96 . If we use these values we get X L X 1.96 X and X U X 1.96 X . So X L , X U X 1.96 X , X 1.96 X Now for a sample of size 100 X n 51.02 5.102 100 so X L , X U X 1.96 X , X 1.96 X X L , X U 100 1.96 5.102 ,100 1.96 5.102 X L , X U 90,110 So there is a 95% probability that a sample mean will be in the interval 90,110 . This is why the value of 51.02 for the standard deviation was used earlier – so the resulting interval would have values that are easy to work with. Import observation: Suppose we take 100 samples from a normal distribution. Then approximately 95 of the 100 samples will have means in the interval X 1.96 X , X 1.96 X 90,100 . Now consider Table 8.3. The two right most columns include the interval 90,110 that will contain 95% of the sample means. The column labeled X will contain a sample mean (we will pretend to take sample means here). The next two columns contain the interval X 1.96 X , X 1.96 X X 10, X 10 . The last two columns are labeled A and B. Let these statements be the following questions: A: Is X in X 1.96 X , X 1.96 X ? B: Is X in X-1.96 X , X 1.96 X ? The first sample has a mean X 101 . This mean lies in 90,110 . The same size interval about X 101 is 91,111 . Note this interval contains X . So the answer to A is Y and the answer to B is Y. X 1.96 x X 1.96 x 90 90 90 90 90 90 110 110 110 110 110 110 X 101 104 109 109.99 110.01 113 X 1.96 X 91 94 99 99.99 101.01 93 X 1.96 X 111 114 119 119.99 120.01 123 A B Y Y Y Y N N Y Y Y Y N N Table 8.3 Important observations: 1. 2. 3. 4. Every time there is a Y in column A there will also be a Y in column B. 95% of the time there will be a Y in column A. Thus 95% of the time there will be a Y in column B. So if we take a sample mean and construct an interval X-1.96 X , X 1.96 X then there is a 95% probability that that interval will contain the population mean. 5. To rephrase 4., if you take a sample, compute the sample mean , X , and construct an interval X-1.96 X , X 1.96 X about that sample mean and then make a bet that this interval contains , then you will win 95% of your bets. Example 8.2. Suppose that household incomes in Flagstaff are normally distributed with standard deviation 2000 . A sample of n=1000 household are interviewed and data are collected about the income of the household. The results of the sample are that the average income of the households in the sample is X 33,011 . Construct an interval such that you can be 95% certain that the interval will contain the average income for all Flagstaff households (the latter is the population mean). From the above discussion, we know that if we find an interval X-1.96 X , X 1.96 X it will contain . So we have X n 2000 63.25 1000 X-1.96 X , X 1.96 X 33011 1.96 63.25 ,33011 1.96 63.25 X-1.96 X , X 1.96 X 32887.03,33134.97 Example 8.3. The same as 8.2 except that the sample size is n=10. X n 2000 632.46 10 X-1.96 X , X 1.96 X 33011 1.96 632.46 ,33011 1.96 632.46 X-1.96 X , X 1.96 X 31771.38,34250.62 Compare the two results. In Example 8.2 we were 95% confident that the population mean lies in an interval of width 123.97 while the width of the interval in 8.3 is 1239.62 . We should be much better about the interval in 8.2 than the one in 8.3. Example 8.4 To belabor the point suppose the sample had been of size n=10000. In that case 2000 X 20 n 10000 X-1.96 X , X 1.96 X 33011 1.96 20 ,33011 1.96 20 X-1.96 X , X 1.96 X 32971.8,33050.2 In this case the interval is 39.20 . In this case we could say that we are 95% confident we know where the population mean is within about plus or minus 39.20 dollars. We can make the interval as small as we want by making the sample size bigger. (Actually if the sample size gets close to the population size we would have to modify the formula for X . We will ignore that issue here). A 1 100% confidence interval (CI) for the population mean is given by X Z / 2 x where determines the level of confidence and hence the value of Z to use. You would choose a value of Z so the / 2 of the area of the Z distribution lies in each tail. Suppose you want a 90% CI for the population mean. So 1 100% 90% so =0.1 and we want values from the Z distribution with / 2 0.1/ 2 0.05 or 5% of the area in each tail of the Z—distribution as shown in Figure 8.1. For the 90% CI the Z values are 1.65 . Figure 8.1 A 90% CL for Excel has a command that computes confidence intervals in the form Z / 2 X . The command is confidence( , , n) If we use this for Examples 8.2—8.4 above we find For Example 8.2 For Example 8.3 For Example 8.4 123.9588 =CONFIDENCE(0.05,2000,1000) 1239.588 =CONFIDENCE(0.05,2000,10) 39.19922 =CONFIDENCE(0.05,2000,10000) Example 8.5 A sample of n=300 is taken of Flagstaff household incomes with the resulting sample mean of $28,750. It is known that the standard deviation of all Flagstaff household incomes $1000. Find a 99% CI for the average of all Flagstaff household incomes. Note, because the distribution is continuous we will assume and the sample size is relatively large (>30) we will assume the CLT holds for this problem. Z / 2 1000 57.74 n 300 Z 0.01/ 2 Z 0.005 2.58 X X Z / 2 X 28, 750 2.58 57.74 28, 750 148.96 We are 99% confident that the average household income for Flagstaff is in the interval 28601.04, 28898.96 . Example 8.6 Suppose that we are interested in estimating the average gasoline usage of a certain brand of automobile. A sample of n=150 cars of that brand are driven and the average gasoline usage of the sample cars is X 32.5 mpg. Suppose that we know the population standard deviation is 3 mpg. What is the best point estimate for the average mpg for the entire fleet of cars of this brand. Find a 90% CI for the population mean mpg. The best point estimate if 32.5 mpg. For the 90% CI X Z / 2 n Z 0.10 / 2 3 0.245 150 Z 0.05 1.65 X Z / 2 X 32.5 1.65 0.245 32.5 0.404 8.3 Confidence Intervals for p̂ The method for finding the CI for a population proportion, p, is the same as finding the CI for the population mean. The general form is point estimator Z / 2 standard deviation of the sampling distribution A 1 100% CI for p is pˆ Z / 2 pˆ where ˆ ˆ/n pˆ pq Note there is a slight change in the formula for the standard deviation of the sampling distribution here and the one in Chapter 7. There we used p pq / n and here we use ˆ ˆ/n pˆ pq Chapter 7 was a chapter on probability theory. In probability theory we assume we know the parameter values, so in that chapter we assumed we knew p . This chapter is a chapter on statistics. We don’t know the parameter values, all we have are point estimates, which we use as guesses for the parameter value. So we don’t know p here, we only know p̂ which we will use as our best guess for p . So our test to see if the CLT will hold here is if npˆ 5 and nqˆ 5. Because we are using guesses here, we might be wrong, but there is no better alternative. Example 8.7 We have commissioned another election poll. A sample of n=1000 voters are asked if they prefer candidate A or candidate B. Of these 534 say they prefer A. We will consider a response for A to be a success. Find a 95% CI for p, the proportion of voters who prefer A. X 534 n 1000 X 534 0.534, qˆ 1- pˆ 0.466 n 1000 ˆ ˆ / n (0.534)(0.466) /1000 0.0158 pˆ pq pˆ 0.05 so Z / 2 Z 0.025 1.96 pˆ Z / 2 pˆ 0.534 (1.96)(0.0158) 0.534 0.031 0.503, 0.565 Example 8.8 Radio station KFUD claims that 40% of the listeners in its receiving area listen to its 2:00PM music program. A sample of n=300 radio listeners in this area are sampled and 88 say they listen to KFUD at 2:00PM. Calculate a 99% CI for p. Does it seem likely that KFUD’s claim is correct? X 88 n 300 X 88 0.293, qˆ 1- pˆ 0.707 n 3000 ˆ ˆ / n (0.293)(0.707) / 300 0.026 pˆ pq pˆ 0.01 so Z / 2 Z 0.005 2.58 pˆ Z / 2 pˆ 0.293 (2.58)(0.0263) 0.293 0.067 0.226, 0.360 Given this interval it is most unlikely that KFUD’s claim is correct. I would be prepared to bet fairly big money that this claim is not true. 8.4 Confidence Intervals -- unknown. Thus far we have assumed that the population standard deviation, , is known. This is a pretty silly assumption in practical work. Recall that the formula for the population variance is 2 1 N 2 Xi . N i 1 If we don’t know the population mean, then it is unlikely that we will know the population standard deviation. Instead we will replace the population standard deviation with the best guess we have for it based on the sample data—that is we will use the value produced by the best point estimator 2 1 n 2 s Xi . n 1 i 1 We can’t just replace with s in the formula, we must also replace Z with a new distribution, the t—distribution. A 1 100% confidence interval for when is not known X t / 2 s X where s X s n So why do we do this? The value of s X we get from a particular sample may be either larger or smaller than the value of X . When we say, for example, that we are 95% confident, we really ought to say that we are at least 95% confident that the population mean lies within the interval. Consider a betting analogy. Suppose that we want to construct an interval such that if we bet the population mean lies inside the interval, that we will win at least 95% of the bets. Now if s X X there would be no problem—the resulting interval computed using Zs X would still work. The interval would be bigger than that calculated using Z X , but we could still be at least 95% confident that the population mean would be in the interval. The problem arises when s X X . In that case ZsX Z X , and we would win something less than 95% of our bets. In statistics, the tendency is to make conservative statements – to err on the side of safety. So what we want to do is take care of the worse case scenario—the one where we win less than 95% of the bets. We want something that will cause the interval to get bigger that it would just using Z. We will replace Z with the t—distribution that is designed just for that purpose. df 1 t0.100 3.078 t0.050 6.314 t0.025 12.706 t0.010 t0.005 31.821 63.657 7 1.415 1.895 2.365 2.998 3.499 Table 8.4 A portion of the t—distribution. Table 8.4 shows a part of the t—distribution. The symbol df stands for degrees of freedom. The concept of a degree of freedom is difficult and requires a fair degree of mathematics sophistication. We will not discuss what this concept is here. You must know, however that For the t—distribution df = n-1 That is for the t—distribution the degrees of freedom is equal to the sample size minus one. The t—distribution values change as the sample size changes. The subscript indicates the area of the t—distribution that is to the right of the value in the body of the table. A plot of the t—distribution is shown in Figure 8.2. If, for example, n 8 so that df n 1 8 1 7 , then the column labeled t0.100 gives a value 1.415 where that column intersect the row with df 7 . This means that 10% of the area under the curve for the t—distribution is to the right of 1.415 and 90% is to the left. We could refer to this as t0.100 1.415 . If df 1 and t t0.010 31.821 , then 10% of the area under the t curve will be to the right of 31.821 and 90% to the left. The t—distribution has mean zero and it is symmetric about zero just like the standard normal. So if we needed to find values of t for which 80% of the distribution lies between them when df 1 , these values are t 31.821 . If 10% of the area is to the right of 31.821, then 10% of the area will be to the left of –31.821 and 80% of the area will be between them. alpha t Figure 8.2 The t—distribution. Note that the area given in the t—table is the area to the right of a point, not to the left as with the normal distribution. Remark 1: If n is very large, the t and normal distributions are almost identical. Remark 2: The t distribution assumes that the sample comes from a normal distribution. In practice this is often assumed as a matter of convenience. We will use that assumption in this class, except when we know that the binomial distribution is being used. Remark 3: We still need to check to see if the CLT holds. If the distribution the sample comes from is normal, then the CLT will hold automatically. If it is not normal and n>30 then we will say that the CLT holds (binomial excepted), but that we might not be really justified in using the t distribution. We will use it and hope for the best, however. The good news is that when n gets very large the CLT will almost surely hold and t will be so close to the normal that any errors we make will be small. Example 8.9 A testing facility has a contract to provide an independent evaluation of the gasoline usage of a particular kind of automobile. Because the test is to not be biased by any relationship with the car manufacturer, the testing facility must purchase its own cars. Because this is a very expensive proposition they decide to purchase only 25 cars (n=25). They determine the sample average gasoline usage for these cars is 31.25 mpg X and the sample standard deviation is 2.65 mpg s . Find a 90% CI for the gasoline usage for these automobiles. Assume that gasoline usage is normally distributed. X 31.25 n 25 s 2.65 s 2.65 sX 0.53 n 25 df n 1 24 t0.050 1.711 (for a 90% CI we need 5% in each tail) X t0.050 s X 31.25 (1.711)(0.53) 31.25 0.91 Excel has a worksheet function for the t—distribution. The description from the Excel help facility is Returns the t-value of the Student's t-distribution as a function of the probability and the degrees of freedom. Syntax TINV(probability,degrees_freedom) Probability is the probability associated with the two-tailed Student's t-distribution. Degrees_freedom is the number of degrees of freedom to characterize the distribution. Remarks If either argument is nonnumeric, TINV returns the #VALUE! error value. If probability < 0 or if probability > 1, TINV returns the #NUM! error value. If degrees_freedom is not an integer, it is truncated. If degrees_freedom < 1, TINV returns the #NUM! error value. TINV is calculated as TINV = p( t<X ), where X is a random variable that follows the t-distribution. A one-tailed t-value can be returned by replacing probability with 2*probability. For a probability of 0.05 and degrees of freedom of 10, the two-tailed value is calculated with TINV(0.05,10), which returns 2.28139. The one-tailed value for the same probability and degrees of freedom can be calculated with TINV(2*0.05,10), which returns 1.812462. Note In some tables, probability is described as (1-p). TINV uses an iterative technique for calculating the function. Given a probability value, TINV iterates until the result is accurate to within ± 3x10^-7. If TINV does not converge after 100 iterations, the function returns the #N/A error value. Example TINV(0.054645,60) equals 1.96 Here is the bad news. Excel the area in both tails. Not just the rightmost tail. For the problem we just worked we wanted a 90% CI, so we should have 10% in the tails (a two tailed value). So we can find the t value that will give us 10% of the area in the tails by the command 1.711 =TINV(0.100,24) The Excel solution for Example 8.9 is shown below. X bar s s xbar df t-value interval width CI Upper limit Lower limit 31.25 2.65 0.53 =2.65/SQRT(25) 24 1.711 =TINV(0.1,24) 0.91 =1.711*0.53 32.16 =31.25+0.91 30.34 =31.25-0.91 Excel solution for Example 8.9 Problems 8.1 The Flagstaff Chamber of Commerce wants to attract a new retail business to Flagstaff. They wish to impress on the retail establishment that Flagstaff is a prosperous city. They conduct a survey of Flagstaff homes to get an estimate of household income. Suppose that it is known that the standard deviation for the population of household incomes in Flagstaff is $5,000. The sample is conducted for n=100 Flagstaff homes and gives a sample mean of $21,555. Find a 90% confidence interval for mean Flagstaff household income. Should the CLT hold? Compute by hand and also use Excel Given: X 21555, 5000 and since we know the population standard deviation we can use Z. Because the distribution is continuous and n 30 we will assume the CLT holds X Z / 2 x X 21555 x Z / 2 n Z 0.05 5000 5000 50 10 100 1.65 X Z / 2 x 21555 1.65(500) 21555 825 20730, 22380 X bar sigma n sigma x bar confidence UCL LCL 21555 5000 100 500 822.4265002 22377.4 20732.6 =5000/SQRT(100) =CONFIDENCE(0.1,5000,100) =21555 + 822.4 =21555 - 822.4 8.2 Work problem 8.1 assuming that the population standard deviation is not known, but that the sample standard deviation is s=4,500. Assume that incomes are normally distributed. Then solve using Excel Because the data comes from a normal distribution the CLT holds and also use of the t distribution is valid as well. We have to use t because we don’t know the population standard deviation. X t / 2 sx X 21555 s 4500 4500 450 10 n 100 t 0.05 1.671 (for 99 df) sx t / 2 X t / 2 sx 21555 1.671(450) 21555 752 20803, 22307 xbar n s s xbar T UCL LCL 21555 100 4500 450 1.660391717 22302 20808 = 4500/SQRT(100) = TINV(0.1,99) = 21555 + 1.66*(450) = 21555 - 1.66*(450) Note that TINV(alpha,n) computes the 2 tailed probability (so that half of alpha is in each tail), 8.3 The College of Business Administration is conducting an economic impact study of the effect the University has on the Flagstaff community. One of the items of the study is student spending in the area. Suppose that a sample of 400 students is taken to determine their spending habits in the town (exclusive of rent.). The average spending of the students in the sample is $250 per month with a standard deviation of $60 per month. Find a 99% confidence interval for the mean town spending of all NAU students. Assume spending is normally distributed. X t / 2 sx X 250 s 60 60 3 n 400 20 t 0.005 2.617 (for 399 df) sx t / 2 X t / 2 sx 250 2.617(3) 250 7.85 242.15, 257.85 xbar n s s xbar t UCL LCL 250 400 60 3 2.588204 257.764 242.236 = 60 / SQRT(400) = TINV(0.01,399) = 250 + 2.588 * (3) = 250 - 2.588 * (3) 8.4 A computer chip manufacturer is interested in the proportion of defective central processor chips being produced. A sample of n=100 chips is taken and 8 are found to be defective. Find a 95% confidence interval for the true proportion of defective chips being produced during operations. X 8 n 100 X 8 pˆ 0.08, qˆ 1- pˆ 0.92 n 100 npˆ 100(0.08) 8 5 nqˆ 100(.92) 92 5, so CLT holds ˆ ˆ / n (0.08)(0.92) /100 0.0271 pˆ pq 0.05 so Z / 2 Z 0.025 1.96 pˆ Z / 2 pˆ 0.08 (1.96)(0.0271) 0.08 0.053 0.027, 0.133 8.5 The specifications for a certain assembly call for bolts with a pitch of 950 mm. A new shipment of bolts for this assembly arrives and n=100 of them are taken for inspection. This sample gives a mean of 960 mm and a standard deviation of 10 mm. Find a 90% confidence interval for the pitch of the bolts in this sample. Does it seem likely that the bolts in the shipment meet the specifications of the assembly? Assume that the bold diameters are normally distributed. X t / 2 sx X 960 s 10 10 1 n 100 10 t 0.05 1.671 (for 99 df) sx t / 2 X t / 2 sx 960 1.617(1) 960 1.617 958.33,961.67 xbar n s s xbar t UCL LCL 960 100 10 1 1.660392 961.66 958.34 = 10 / SQRT(100) = TINV(0.1,99) = 960 + 1.66 * (1) = 960 - 1.66 * (1) Because the population is normally distributed, we can assume that the CLT holds and that we can use the t distribution. Note that the CI does not include the stated mean. It is highly unlikely that the process is producing with the desired mean. 8.6 A sample of n=20 items is taken from a normal distribution. The sample results are X 80 and s 10 . Find a 90% confidence interval for . X t / 2 sx X 80 s 10 2.236 n 20 t 0.05 1.729 (for 19 df) sx t / 2 X t / 2 sx 80 1.729(2.236) 80 3.886 76.134,83.866 xbar n s s xbar T UCL LCL 80 20 10 2.236067977 1.729131327 83.866044 76.133956 = 10/SQRT(20) = TINV(0.1,19) = 80 + 1.729*(2.236) = 80 - 1.729*(2.236) The distribution the sample was taken from is normal, hence the CLT holds and the use of the t distribution is permissible. 8.7 In examining the credit accounts of a department store, an auditor selected a random sample of 10 accounts and found that the average account error was $-\$37.00$ with a standard deviation of $\$15.00$ (these are sample results). Construct a $90\%$ confidence for the population mean. Assume that the accounting errors are normally distributed. X t / 2 sx X 37 s 15 4.743 n 10 t 0.05 1.833 (for 9 df) sx t / 2 X t / 2 sx 37 1.833(4.743) 37 8.69 45.69, 28.31 xbar n s s xbar T UCL LCL -37 10 15 4.74341649 1.833113856 -28.306081 -45.693919 = 15/SQRT(10) = TINV(0.1,9) = -37 + 1.833*4.743 = -37 - 1.833*4.743