Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CONFIDENCE INTERVALS I ESTIMATION: the sample mean Gx is an estimate of the population mean µ point of sampling is to obtain estimates of population values Example: for 55 students in Section 105, 45 of 55 work: ps = 82%; for those who work, the mean number of hours xG= 14.76 inference: 82% of ASU students work an average of 14.76 hours a week; p = 0.82 and µ = 14.76 Problem: the sampling distribution is a continuous distribution ⇒ the probability that Gx actually equals µ is zero Gx is not an accurate estimate of µ; in this case we cannot even state the probability that Gx is accurate Gx is called a point estimate of µ statisticians generally prefer to give an interval estimate: "There is a 90% probability that µ is between 12 and 17.5." interval estimate has two features Ø the estimate in interval form Ø a probability statement: taken as an assessment of the reliability or accuracy of the estimate the probability hinges on the probabilities found in the sampling distribution of Gx INTUITION Consider the sampling distribution of sample means for samples of size 100 drawn from a population of salaries in which µ = 33,000 and σ = 5,000 w E(Gx) = 33,000 wσxG = 5,000 ÷ √100 = 500. In the z table find z values that demarcate the middle 95% of a normal distribution: z = ± 1.96 The interval µ − 1.96 × σxG to µ + 1.96 × σxG contains 95% of the sampling distribution or contains 95% of all the possible Gx’s that could ever be drawn from this population the interval noted is 33,000 ± 1.96 × 500 or 33,000 ± 980. Any sample mean in this interval differs from the actual population mean by no more than 980. ⇒ 95% of all possible sample means differ from µ by no more than 980. ⇒for any Gx , there is a 95% probability that it differs from µ by no more than 980; that is, by no more than the amount 1.96 × σxG choose a sample and calculate Gx; now consider an interval of the form Gx ± 1.96 × σxG necessarily, the population mean µ lies within the limits of 95% of all such intervals ⇒ there is a 95% probability that the population mean lies within the limits of any interval of the form Gx ± 1.96 × σxG CALCULATING CONFIDENCE FOR MEAN A C% confidence interval: INTERVALS An interval ofTHE thePOPULATION form xG ±confidence zC × σinterval xG A C% for the population mean is given by where zC is chosen so that C% of the normal distribution Gx ± zC × σxG lies within the interval −zC to +zC. w C is the confidence level w zC is found as a z value such that −zC to +zC incorporates the middle C% of a normal distribution; ±zC demarcates a symmetric interval which has area C Example: Find the appropriate z value for a 92% confidence interval Ø the interval must be symmetric: take out the middle 92% leaves 8% to be split between upper and lower tails. Ø The required z values demarcate the lower 4% and lower 96% of the z distribution. Ø Alternatively, let L = (100 − C)/2; here L = (100 − 92)/2 = 4% or 0.04. In the cumulative z table find area 0.0400. The closest seems to be 0.0401; reading back to the margins z0.0401 = −1.75; therefore the required zC = 1.75 Check by finding a z value such that 96% of the distribution is less than that value. Examples: Ø A population of Christmas trees has unknown µ, but it is known that the population is normally distributed with σ = 4. A sample of 25 trees has Gx = 16.6. Find a 95% confidence interval for the mean height of the population. w given n = 25, so that σxG = σ ÷ √ n = 4/5 = 0.8 w L = (100 − C)/2 = 5/2 = 2.5% or 0.025. From the z table 0.025 of the z distribution is less than −1.96, so zC = 1.96. w applying the formula above Gx ± zC × σxG 16.6 ± 1.96 × 0.8 16.6 ± 1.568, or the interval 15.032 to 18.168 Stating the interval: w “A 95% confidence interval for the population mean is 16.6 ± 1.568” w “We are 95% confident that the population mean is in the interval 15.032 to 18.168.” w “There is a 95% probability that µ is at least 15.032 but no more than 18.168.” Ø Find a 90% confidence interval for the same population, same sample w find z90 by reference to z table. „ L = (100 − C)/2 = 0.05. „ the nearest entry is 0.0505 ⇒ z = −1.64. w then we have 16.6 ± 1.64 × 0.8 = 16.6 ± 1.312, or the interval 15.288 to 17.912 Ø For a sample of 64 drawn from this population, we got the same Gx. Find a 90% confidence interval for the population mean. w since n = 64, σxG = 4 ÷ √ 64 = 0.5 w confidence interval: 16.6 ± 1.64 × 0.5 = 16.6 ± 0.82, or the interval 15.78 to 17.42 Messages: Ø the width of the confidence interval varies in the same direction as the confidence level in our first example, width = 18.168 − 15.032 = 3.136, while in the second example, width = 17.912 − 15.288 = 2.624 „ width of the interval is 2 × zC × σxG : called the precision of the estimate „ there is a trade-off between precision and confidence • common sense: for very wide intervals, we can be quite confident that we've captured µ, but as the interval narrows, the probability that it includes µ drops to zero Ø As the sample size increases, precision increases at the same level of confidence the third interval above has width 1.64 „ with sufficiently large sample, we can achieve whatever combination of confidence and precision we desire „ as n increases, σxG decreases FINDING THE RIGHT SAMPLE SIZE The distance e = zC × σxG is the error in the estimate „ e is one-half the width of the confidence interval „ within the limits of our confidence statement, we are sure that the population mean differs from the sample mean by no more than e: we might say we're 90% confident that the true mean differs from the sample mean by no more than e. Hence, e is the maximum error in the estimate Suppose that there is some maximum tolerable value for e, or maximum tolerable error for a given confidence level, the value of n necessary to keep e within tolerable limits σ , solve for n to find n 2 zC × σ n= e e = zC × for given z, chosen for the appropriate confidence level, this formula gives us the sample size necessary to achieve an error of no more than e in general, the result of this calculation is not an integer, so the rule is to make the sample size equal to the next largest integer. NOTE CAREFULLY: This refers to the maximum tolerable error in the sampling procedure, or in the estimate of µ, NOT to the tolerance in a manufacturing process. Examples: Ø Cigarette filters are supposed to have µ = 15 mm in length; σ = 0.1 mm. Machinery will jam if the length of a filter exceeds 15.3 mm, and the probability of such a filter increases as the mean length increases; must have an accurate estimate of the mean length of filters. Let us require e ≤ 0.01 mm. and 90% confidence intervals. How large must n be? 2 2 zC × σ 1.64 × 0 .1 n= = = 268.96 0 . 01 e the next greatest integer is the required sample size (that is, ALWAYS round n upwards in these problems); here n = 269 Ø ordering T-shirts to give to contestants in a road race; average chest size unknown but for all chests everywhere σ = 4 in. Measure a sample of the participants when they register, and require that the sample be accurate to within ±1.5 in. How large must the sample be to have 99% confidence in the result? w first, z99 = ? w n = (2.58 × 4 / 1.5)2 = 47.33 w rounding upwards, we require n = 48 CONFIDENCE INTERVALS II: σ UNKNOWN WHEN TO USE A z VALUE IN CONSTRUCTING CONFIDENCE INTERVALS To this point, we have assumed the population standard deviation known. IF NOT Population is normally distributed and σ NOT known ⇒ the sampling distribution of Gx is NOT normal but rather conforms to Student's t distribution If population is NOT normal and σ is NOT known but the sample is large (that is, n ≥ 30), then the sampling distribution of Gx approximates the t distribution In either of these cases, s, the sample standard deviation, estimates σ. __ 2 2 RECALL: s = Σ(x − xG) /(n − 1) and s = √s2 The standard error of the mean is estimated by sxG = s/√n confidence intervals have the form Gx ± tC × sxG the t values used here are numbers of standard deviations – in this case, numbers of standard deviations on a t distribution CHARACTERISTICS OF THE t DISTRIBUTION Ø Continuous Ø Symmetric Ø Values near the mean are more probable than values further out so that t distribution looks like a bell-shaped curve. How is that any different from a normal distribution? 1. the t distribution has fatter tails and less mass in the center Ø for a given number of standard deviations, probability is higher on the normal distribution than on a t distribution Ø put another way, a given probability level will be further from the center of a t distribution that from the center of the normal distribution Ø or, a given probability level will be more standard deviations (t values) away from the mean than would be the case on a normal distribution Note: t values will always be larger than z values for corresponding confidence level ⇒ intervals constructed with t will always be wider (less precise) than those constructed with z 2. there is not one t distribution but a large number, depending on the number of "degrees of freedom" Digression: the concept of degrees of freedom Ø Mechanically df = n − k, where n is the sample size and k the number of parameters that must be estimated from the sample before estimating the standard deviation w for example: s, the sample standard deviation, is an estimate of σ. To calculate s, we must estimate µ. µ is estimated by xG, and xGis the only statistic we must calculate before we can calculate s. We must thus estimate one parameter, µ, before deriving and estimate of σ, and there are thus n − 1 degrees of freedom in our estimate of σ Ø more generally, degrees of freedom represents the number of independent (in the probability sense) random variables in a problem w in calculating s we must use Gx. Suppose we are given Gx and n − 1 of the values in the sample; then the n-th value is already determined and can be derived from what we know The t-distribution: pages E-7 and E-8 in your textbook how to read the table Ø Upper tail (α) values across the top are the area in one tail of the distribution Ø for a confidence interval use an upper–tail value corresponding to the area in one tail of the distribution „ this will be only half the difference between the confidence level and 1 For example: in preparing a 95% confidence interval, there will be 5% in the tails of the distribution, thus 0.025 in each tail: we should use a t value for upper-tail area 0.025 and the appropriate number of degrees of freedom „ if C is the confidence level, expressed as decimal fraction, use α = (1 − C)/2 Ø degrees of freedom are in the left hand column as df → infinity, the t-value → z value Examples: Ø Find the appropriate t value for 20 degrees of freedom and 90% confidence interval. α = (1 − 0.9)/2 = 0.05 ⇒ t = 1.7247 Ø for a sample of size 37, find the t value for a 99% confidence interval d.f. = n − 1 = 36; α = (1 − 0.99)/2 = 0.005 ⇒ t = 2.7195 CONFIDENCE INTERVAL FOR µ WITH NORMAL POPULATION AND σ UNKNOWN Problem requires use of t with n − 1 degrees of freedom. Confidence intervals will have the form ( n −1) d . f . x ± tC × sx __ where sxG = s/√ n , s being the sample standard deviation note similarity to earlier confidence intervals Examples: Ø 7 male students are selected at random and an alcoholic beverage is poured down them in tenth-ounce increments until distinct signs of non-sobriety are observed. The following results were obtained: Individual Amount of Beverage (oz) 1 3.7 2 2.9 3 3.2 4 4.1 5 4.6 6 2.3 7 2.5 Researchers feel safe in assuming that the distribution of ounces until non-sobriety is normal in the population. Construct a 95% confidence interval for amount of drink it takes to get the average member of the population drunk. „ calculate Gx and s: Gx = 3.329, s2 = Σ(x − xG)2/(n − 1) = [(3.7 − 3.329)2 + … + (2.5 − 3.329)2] ÷ (7 − 1) = 0.7157 s = √0.7157 = 0.846 „ calculate sxG = s/√n = 0.846/√ 7 = 0.846/2.65 = 0.3198 „ find appropriate t value, for c = .95 and 6 df = 2.4469 „ multiply sxG by t value = 0.7824 Gx ± t × sxG = 3.329 ± 0.7824 or the interval 2.546 to 4.111 Ø Each of 9 cars in a sample is driven 20,000 miles, the gallons of fuel used recorded, and the fuel mileage calculated. For the sample mean fuel mileage Gx = 34.6 and s = 1.2. Assuming that the distribution of fuel mileage is normally distributed, find a 90% confidence interval for the mileage to be expected from all cars of this make. „ sxG = s/√ n = (1.2)/3 = 0.4 „ α = (1 − .9)/2 = 0.05 and d.f. = n − 1 = 8 ⇒ t = 1.860 Gx ± t × sxG = 34.6 ± 1.86 × 0.4 34.6 ± 0.744 or the interval 33.856 to 35.344 Ø In a sample of 41 students who work, xG= 16.561 and s = 5.7128. Find a 95% confidence interval for the average hours worked by all ASU students who work. „ sxG = s ÷ √n = 5.7128 ÷ √41 = 0.892189 „ for 40 degrees of freedom, t95 = 2.0211 „ confidence interval: 16.561 ± 2.0211 × .892189 16.6 ± 1.8 Ø We wish to establish the average weight of a population of turkeys; we have chosen a sample of 36, weighed them and have the following results: 18 13 6 7 26 8 20 12 22 10 19 11 7 12 14 22 11 21 11 12 18 14 8 16 9 18 16 17 13 14 21 16 11 15 10 15 Construct a 98% confidence interval for the population mean of these turkeys „ first, find tC = 2.438 „ next, find Gx and s: Gx = 14.25, s = 4.90 „ find sxG = s/√n = 4.90/6 = 0.816861007 xG ± tC × sxG 14.25 ± 2.438 × 0.8169 14.25 ± 1.99 or 12.26 to 16.24 SAMPLING DISTRIBUTIONS FOR SMALL SAMPLES The t distribution is often thought of as primarily of value with small samples Ø applies whenever population is known to be normal and σ unknown, no matter how small n footnote: who was "Student"? A pseudonym for William Gosset, an Irish brewmaster concerned with controlling biochemical processes in brewing Ø with large samples, if population is not normal, we must rely on Central Limit Theorem And many statisticians and other practitioners will use z procedures with any sample of 30 or more: this is especially prevalent in older practice Another possibility: sample is small, so that CLT does not apply population is not normally distributed or the distribution is unknown Safest course is to take a larger sample and rely on CLT Following schematic may be used to determine proper distribution to use in constructing confidence intervals. Population standard deviation known? Yes No Population normal? Yes Population normal? No Yes No Sample Size Sample Size z value n >= 30 n < 30 z or t (see t value note) ERROR NOTES: n >= 30 n < 30 z or t (see note) ERROR 1. For a non-normal population and large samples, different practitioners may proceed differently. Some argue that the Central Limit Theorem justifies use of a z value in this case, while others feel that it is more appropriate to use a t value since that gives a less precise estimate (a wider confidence interval). For purposes of this course, use a t in such cases. 2. For small samples from non-normal populations: there are techniques which can be used to derive an interval estimate in this case, but they are beyond the scope of this course. CONFIDENCE INTERVALS III CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION Purpose: to use the sample proportion, ps , as the basis of an interval estimate of the population proportion p Reminders: the sample proportion ps = x/n the sampling distribution of p has parameters E(ps) = p σps = √p × (1 −p)/n ps is normally distributed, so that probabilities are found by reference to the z table typically p is unknown, so that we must estimate σps by sps = √[ps × (1 − ps )]/n A confidence interval for p then will have the form ps ± zC × sps Examples: Ø of 55 students in a sample, 45 work. Construct a 95% confidence interval for the proportion in the population who work. „ ps = 45/55 = 0.82 „ sps = √[.82 × (1 − .82)]/55 = 0.0518 „ zC = ±1.96 „ confidence interval: 0.82 ± 1.96 × 0.0518 → 0.82 ± 0.10 We are 95% confident that in the population somewhere between 72% and 92% work. Ø In a sample of 800 North Carolinians 51% express the intention to vote for Jesse Helms in the next election. Find a 98% confidence interval for the proportion in the population who intend to vote for Helms. „ ps = 51%; sps = √(51 × 49)/800 = 1.76741 „ then we have 51 ± 2.33 × 1.76741 = 51% ± 4.12% or the interval 46.9% to 55.1% „ from this, we can say, strictly and properly, "We are 95% sure that the proportion in the population who intend to vote Helms is within 4.12% of 51%." or, as we might loosely and a bit improperly put it, "Our survey shows that 51% of the population intend to vote Helms, and this result is accurate to within plus or minus 4%." „ the election is a toss-up or “too close to call.” Ø Suppose same result with a sample of size n = 1600 „ sps = 1.2497, and sp × z = 1.2497 × 2.33 = 2.9% „ confidence interval would be 51% ± 3% POINT: is the minor increase in precision worth the extra cost? FINDING THE NECESSARY SAMPLE SIZE IN PROPORTION PROBLEMS Since we have ± zC × sps , the estimate ps differs from p by at most that amount substituting the definition of sps, the error is at most zC × √[p × (1 −p)]/n notice the use of p in the above expression; the concepts advanced here involve what we know about the sampling distribution before sampling begins for a given confidence level, this error can be reduced by increasing n in the last example above, we noted that doubling the sample size would reduce the error from 4% to 3% Ø suppose we require e < 0.01, that is, accuracy to within ± 1%. How large must n be? the maximum error in the estimate: e = zC × solve for n, giving n= 2 p × (1 − p ) × zC e2 p × (1 − p ) n Ø A major problem: p, the population proportion is unknown „ solution 1: assume p = 0.5 this will give largest possible value for n since p × (1 −p) reaches a maximum when p = 0.5 may result in an unnecessarily large and expensive sample „ solution 2: use other information do a pilot study on a small sample and use the resulting ps to estimate p previous experience or knowledge of other populations may give an approximate value for p w lacks certainty of solution 1, but may result in somewhat smaller sample Examples: applying the formula above to solution 1, we have n= .5 × (1 − .5) × 2.33 2 2 = 13,573 0.01 this is the sample size necessary to be absolutely sure that a 98% confidence interval is accurate to within ± 1% Ø In the work example above, 95% confidence interval and sample of 55 gave accuracy of ±0.10. What sample size is necessary to hold the error to ±0.015 (1.5%)? „ solution 1: n = [(0.5 × 0.5) × 1.962] ÷ 0.0152 = 4268.44; taking the next greatest integer, we have 4269 „ solution 2: for n = 55, we had ps = 0.82. Take that as an estimate of the unknown p. Then n = [(0.82 × 0.18) × 1.962] ÷ 0.0152 = 2520.09 or 2520 using the pilot-study approach reduces the required sample size by more than 1,749 and might save a considerable amount of money A footnote: in most proportion problems, it doesn’t matter whether you use percentages or decimal fractions, as long as you keep them straight. In the sample-size formula above, however, you must use decimal fractions. To use percentages, substitute 100 for 1, so the formula becomes n = [p* × (100 − p*) × z2] ÷ e* where p* and e* are defined as percentages. THE z VS. THE t DISTRIBUTION In constructing confidence intervals, use the z distribution whenever Ø the population standard deviation σ is known AND the population is known to normally distributed Ø you wish to calculate a confidence interval for a proportion „ rule of thumb: n × p ≥ 5 AND n × (1 − p) ≥ 5 for sufficiently accurate approximation In constructing confidence intervals, use the t distribution Ø if the population is known to be normally distributed AND the population standard deviation σ is UNKNOWN: this holds for any sample size Ø if the population’s distribution is NOT normal AND the sample size is at least 30 AND the population standard deviation σ is UNKNOWN