* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter 6 Contents The problem of estimation
Survey
Document related concepts
Transcript
Chapter 6 Contents I introduction to estimation (section 6.1) I Confidence Interval for a Population Mean – Normal (Z) The problem of estimation I in gaining some knowledge of the mean of the population Statistic (section 6.2) I Confidence Interval for a Population Mean – Student’s t I Large Sample Confidence Interval for a Population I Determining the Sample Size (section 6.5) I Finite Population Correction for Simple Random Sampling I (not covered) The sample average just gives a number as the estimate of I Confidence Interval for a Population Variance (not covered) the unknown population mean I sample mean is called a Point Estimate of the population I A point estimate is almost surely likely to be wrong I If the sample average income is $23288, it is unlikely that mean I the population average is $23288 In general we are interested in some parameter ( eg. mean, median, s.d, number of modes) of an unknown I Use a statistic to estimate the parameter ( eg. sample mean, sample median, sample s.d) I This would be called the Point Estimate More reasonable to say the average income is close to $23288 population I We draw a sample. Use the sample average as an estimate of the population mean Proportion (section 6.4) I Want to find the average income of a household in Michigan Statistic (section 6.3) I Suppose we have a large population and we are interested I Confidence intervals are an attempt to do this Confidence intervals I Suppose we have a population that can be modeled as I Normal State a measure of accuracy of the proposed interval ( Confidence Level). This is the probability that we will get a I The population mean µ is not known I Using a sample (of size ‘n‘) propose a range of values ( sample such that the the unknown µ is in the proposed interval Confidence interval) for µ . Typically of the form X̄ ± ‘Margin of Error‘ Normal mean. Known standard deviation I I The basic model here is: we have a normal population Recall If X is N(µ, σ), then P(µ−1.96σ < X < µ+1.96σ) = P(−1.96 < Z < 1.96) = .95 whose mean µ is unknown but the s.d. σ is known I we plan to draw a sample of of size ‘n‘ and calculate X̄ from the sample I We know from the empirical rule that roughly 95% of the sample averages will be within 2σX̄ of µX̄ . I Since µX̄ = µ and σX̄ = √σ , n 0.95 we can say −3 I roughly 95% of the sample averages will be within µ ± I or roughly 95% of the sample averages will be such that µ is within X̄ ± 2 √σn I We will make all this a bit more general and precise 2 √σn −1.96 0 1.96 3 confidence interval for µ I P(µX̄ − 1.96σX̄ < X̄ < µX̄ + 1.96σX̄ ) = .95 I since µX̄ = µ and σX̄ = I P(µ − 1.96 √σn < X̄ < µ + 1.96 √σn ) = .95 I Rewriting the above using a bit easy algebra I P(X̄ − 1.96 √σn < µ < X̄ + 1.96 √σn ) = .95 I We now call X̄ ± 1.96 √σn a 95% confidence interval for µ √σ n I State the required confidence level, like, 80%,90%, 95% I write the confidence level as 1 − α = I If conf level is 90%, 1 − α = .9, α = 0.1, α2 = .05 I from the standard normal table find zα/2 such that the area conf .level . 100 solve for α to the right zα/2 of is α/2 I If conf level 90%, z.05 = 1.65 I The required confidence interval is X̄ ± zα/2 √σn Conf. interval for µ, σ unknown I to recap I A confidence interval for µ from a normal population is I X̄ ± ‘Margin of error‘ I if the confidence level or confidence coefficient is 100(1 − α), then I ‘Margin of error‘ = zα/2 √σn I Typically σ is not known I the sample s.d (denoted by ) ‘ s‘ serves as an estimate of σ I If n is large ( rule of thumb n ≥ 30 ), s is a reasonably accurate estimate of σ I use the confidence interval s X̄ ± zα/2 √ n Interpretation of confidence interval I We called X̄ ± 1.96 √σn a 95% confidence interval for µ because, I P(X̄ − 1.96 √σn < µ < X̄ + 1.96 √σn ) = .95 I What is the random quantity in the above equation? I X̄ . It is X̄ that changes from sample to sample Interpretation of confidence interval I P(X̄ − 1.96 √σn < µ < X̄ + 1.96 √σn ) = .95 I In repeated sampling in 95% cases the conf.level calculated from the sample will contain the unknown µ I Say a sample of size ‘n‘ is ‘good‘ if the 95% conf. interval coming from this sample contains the unknown mean. I Roughly 95% of the samples will be ‘good‘ Problem 6.4 tophat 3,13 I n=90, x̄ = 25.9, s = 2.7 I σ unknown but n ≥ 30 I confidence coefficient = 95%. 1 − α = .95, α2 = .025 I z.025 = 1.96 I conf. interval = 25.9 ± 1.96 √2.7 = 25.9 ± .56 90 I confidence coefficient = 90%. 1 − α = .90, α2 = .05 I z.025 = 1.65 I conf. interval = 25.9 ± 1.65 √2.7 = 25.9 ± .47 90 I confidence coefficient = 99%. 1 − α = .99, α2 = .005 I z.005 = 2.576 I conf. interval = 25.9 ± 2.576 √2.7 90 I = 25.9 ± .73 Problem 6.11 I n = 307, x̄ = 3.11, s = .66 b confidence coefficient = 98%, 1 − α = .98, α2 = .01 I z.01 = 2.326 I conf. interval = 3.11 ± 2.326 √.66 307 I = 3.11 ± .088 = 93.02, 3.20) tophat 24,32 variations Find confidence interval I The basic equation in the σ known case is I Margin of Error = ME = zα/2 √σn I The equation connects three quantities; ME, confidence coefficient ( through zα/2 ), n I If any two are given, we can find the third Find Confidence Coefficient I we are given ‘n ‘, confidence coefficient – hence α and so zα/2 I Find confidence interval I equivalently find ME and set X̄ ± ME Example Beechcraft, Inc. wants to estimate the average time it takes for the Beechjet corporation jet to climb from sea level to 41,000 I We are given n, ME I Find Confidence coefficient I ME = zα/2 √σn ; so zα/2 = ME I Use normal table to determine feet. From previous experience, company engineers believe √ n σ P(−zα/2 < Z < zα/2 ) = 1 − α that the standard deviation of climbing time is 4 minutes. The model is tested in 100 random trials. 1. If the sample mean is 30 minutes.Find 80% confidence interval for the average climbing time from sea level to 41,000 feet. 2. If Beechcraft, Inc. uses 0.0515 as the ME, what is the confidence level (in percentage) associated with the resulting confidence interval? 3. If Beechcraft, Inc. wants the 80% confidence interval for the mean with width 0.515, find the required sample size I n=100, σ = 4 I If the sample mean is 30 minutes.Find 80% confidence what is the confidence level (in percentage) associated interval for the average climbing time from sea level to with the resulting confidence interval? I 41,000 feet. I x̄ = 30, 1 − α = .8, = .1, zα/2 = 1.28 I 4 confidence interval: 30 ± 1.28 √100 I If Beechcraft, Inc. wants the 80% confidence interval for the mean to be of width 0.0515, so ME = I 0.055 2 .find I √ 100 4 = 1.29 confidence level = 100P(−1.29 < Z < 1.29) = 80% Sample size determination Here the confidence coefficient is given and ME is given. I ME = zα/2 √σn I n= Problem is to find ‘n‘ I zα/2 = .0515 the required sample size I ME = 0515. 4 = .0515 zα/2 √100 I α 2 If Beechcraft, Inc. uses 0.515 as the estimate of width, zα/2 σ ME 1.29∗4 2 2 Sometimes the accuracy is stated in terms of the width of I In the last problem n = the interval. Note ME = width 2 . I so 353 samples are required to ensure that a margin of this is called sample size determination problem .275 = 352.07 error of .0275 has 80% confidence Relationship between n, α and ME tophat 84 I ME = zα/2 √σn I With confidence Level (hence z) fixed, ME decreases as n increases (Larger the sample, narrower the ME) I With ME fixed, As n increases, z increases; i.e. the confidence level increases I With confidence level fixed, as ME decreases, n increases ( need more samples to get a narrower interval) I I tophat 11,16,95 σ - unknown The basic model here is: we have a normal population whose mean µ and σ both unknown. I we want to draw a sample of of size ‘n’ and get a confidence interval for the unknown mean µ I Since σ is not known, we estimate it by the sample standard deviation ‘s’ I This gives us the “ t-statistic “ t= I X̄ − µ s The distribution of the t-statistic is no longer normal. It has Student’s t-Statistic The t-statistic has a sampling distribution very much like that of the z-statistic: mound-shaped, symmetric, with mean 0. The primary difference between the sampling distributions of t and z is that the t-statistic is more variable than the z-statistic. a distribution called ‘student’s t-distribution with (n-1) degrees of freedom Degrees of Freedom The actual amount of variability in the sampling distribution of t depends on the sample size n. A convenient way of expressing this dependence is to say that the t-statistic has (n – 1) degrees of freedom (df). Student’s t Distribution Standard Normal Bell-Shaped Symmetric ‘Fatter’ Tails t (df = 13) t (df = 5) 0 z t t - Table t-value If we want the t-value with an area of .025 to its right and 4 df, we look in the table under the column t.025 for the entry in the row corresponding to 4 df. This entry is t.025 = 2.776. The corresponding standard normal zscore is z.025 = 1.96. I Note the distribution depends ( unlike the known σ case) I by on ‘n’. I The confidence interval for µ when σ is unknown is given s X̄ ± t(n−1),/2 √ n Degrees of freedom: Before estimating σ by ‘s’ the ‘n’ observations are used to compute X̄ . If ‘n’ numbers are to have a fixed mean, then only (n-1) of these numbers can be arbitrary. where I t(n−1) is found from the t-table I ‘s’ is the sample standard deviation I Given confidence level and ME, find sample size I Cannot give an answer like the known σ case I Given an interval X̄ , s, find the confidence level I One solution: Take a preliminary sample, estimate σ. I can answer with software I with tables: only limited answers Proceed with this estimate to decide on a sample for the second stage. I will not pursue this ‘two-stage’ sampling here I The t-table ( in the text) does not list all ‘degrees of freedom’. So we will work with the closest in the table. I as n → ∞, the t-distribution converges to the standard normal Problems 13,29,32,106 Problem 13 Problem 29 I n = 1751, x̄ = 6563, s = 2484 I Find 90% confidence interval I Since ‘n‘ is large, we do not need ‘t-distribution‘. Can work with normal tables I I 1 − α = .9, α2 = .05, zα/2 = 1.645 confidence interval: 6563 ± 1.645 √2484 1751 = 6563 ± 97.65 I n = 7, x̄ = 89.86, s = 11.63 I n is small and σ is unknown so cannot use normal approximation I 1 − α = .95, α2 = .025, t6,.025 = 2.447 I √ confidence interval: 89.86 ± 2.447 11.63 = (79.10, 100.62) 7 Problem 32 I n = 20, x̄ = 3.8, s = 1.2 I 1 − α = .9, α2 = .05, t19,.05 = 1.729 I confidence interval: 3.8 ± 1.729 √1.2 = 3.8 ± .464 20 I The average LOS for women is 4.6. In this hospital it is less than 4.26. So women in this hospital have a smaller LOS. tophat 40,41,46 Conf. Int. for PROPORTIONS I I I I We have a large population with two categories ‘S’ and ’F’ I p- proportion of ‘S’ in the population is unknown I confidence innterval for p based on a “large” sample p̂- sample proportion q For large n, p̂ is N p, p(1−p) n The confidence interval is p̂ ± z I r p(1 − p) n Since this involves the unknown p, replace by its estimate p̂ p̂ ± z r p̂(1 − p̂) n z- from normal tables using prescribed conf. level Conditions Required for a Valid Large-Sample Adjusted Confidence interval Confidence Interval for p I I A random sample is selected from the target population. I The sample size n is large. (This condition will be satisfied The confidence interval just discussed performs poorly if p is close to 0 or to 1 I A better confidence interval due to Agresti is if both np̂ ≥ 15 and n(1 − p̂) ≥ 15) . Note that np̂ and p̃ ± zα/2 n(1 − p̂) are simply the number of successes and number of failures, respectively, in the sample.). where p̃ = +2 n+4 . r p̃(1 − p̃) n+4 sample size sample size q p(1−p) n I ME = z I z 2 n = [ ME ] p̂(1 − p̂) I p̂ depends on the sample I ME = width/2 I If we have a prior estimate of p use that in place of p̂ I If nothing is known about p, be conservative and use z 21 p = 1/2. i.e., n = [ ME ] 4 Problem 43 problems 43,45,54,73 I n = 225, p̂ = .46 I np̂ = 103.5, n(1 − p̂) = 121.5 I Both are larger than 15. So o.k. to use the method I conf. interval: z.025 = 1.96, p̂ = .46, (1 − p̂) = .54 I q .46 ± 1.96 .46∗.54 225 = .46 ± .065 Problem 45 I n = 2045, p̂ = Problem 54 818 2045 = .4 q I p̂ is approximately N(.4, I conf. interval: z.025 = 1.96, p̂ = .4, (1 − p̂) = .6 .4∗.6 2045 ) I The population: Senior HR executives I The proportion of HR executives who believe that their managers interview too many people = N(.4, .011) I q .46 ± 1.96 .4∗.6 2045 = (.38, .42) I No. Zillow’s claim falls outside the conf .interval 211 502 I n = 502, p̂ = I np̂ = 211, n(1 − p̂) = 291 I Both are larger than 15. So o.k. to use the method I conf. interval: z.01 = 2.326, p̂ = .42, (1 − p̂) = .58 q .42 ± 2.326 .42∗.58 502 = .42 ± .051 I I = .42 narrower Problem 73 I Problem :. What is the required sample size for determining the proportion of defective items in a process 1 3 I preliminary estimate of p = I ME = .01 I confidence level = 99%, so z.005 = 2.576 I 21 2 n = ( 2.576 .01 ) 3 3 = 14735 if the proportion is to be known within 0.05 with 90% confidence. No guess as to the population proportion is available I .05 2 1 n = ( z.05 ) 4 = 271 I A polling company wants to estimate the proportion of the population that will vote for party D in the next election. They want to do it with a margin of error of 3% and with Problem 63 A company believes its market share is about 14%. Find the minimum required sample size for estimating the actual market share to within 5% with 90% confidence 95% confidence. How large a sample should they take? n=( I n= .025 2 1 ( z.03 ) 4 = 1068 1.65 2 ) (.14)(.86) .05