Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Objectives 6.1, 7.1 Estimating with confidence (CIS: Chapter 10) p Statistical confidence (CIS gives a good explanation of a 95% CI) p Confidence intervals. Further reading http://onlinestatbook.com/2/estimation/confidence.html p Choosing the sample size p t distributions. Further reading http://onlinestatbook.com/2/estimation/t_distribution.html p One-sample t confidence interval for a population mean p How confidence intervals behave Overview of Inference p Sample ≠ population, and sample mean x ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions about µ then we have to use x to do so. p Methods for drawing conclusions about a population from sample € data are called statistical inference. p There are two main types of inference: p € § Confidence Intervals - estimating the value of a population parameter, and § Tests of Significance - assessing evidence for a claim (hypothesis) about a population. Inference is appropriate when data are produced by either § a random sample or § a randomized experiment. Introducing con4idence intervals p It is very unlikely that the sample mean based on a sample will ever equal the true mean. Our aim is to construct an interval around the sample mean which is `likely’ to contain the mean. This is called a confidence interval. p p In 2012, a Gallop poll was taken for the proportion of the electorate that would vote for Obama. Gallup predicted that the Obama vote would be in the interval [45%,51%] with 95% confidence. p p q The Obama vote turned out to be 50.5%, so the interval did capture the true proportion. You may be asking yourself how do we understand 95%, since 50.5% lies in this interval, there does not appear to be any uncertainty in it. In the next few slides, our objective is to understand how a confidence interval is constructed and how to interpret it. Review: properties of the sample mean The sample mean x is a unique number for any particular sample. If you had obtained a different sample (by chance) you almost certainly would have had a different value for your sample mean. In fact, € you could get many different values for the sample mean, and virtually none of them would actually equal the true population mean, µ. In Chapter 4, we learnt that if a random variable was normally distributed with µ and standard deviation σ then 95% probability it will lie in the interval [µ 1.96 ⇥ , µ + 1.96 ⇥ ] Now our focus is on the sample mean it has mean µ and standard error σ/√n (chapter 5), thus there is 95% probability that it lies in interval µ 1.96 ⇥ p , µ + 1.96 ⇥ p n n But the mean is unknown, our objective is to locate the true mean based on the sample mean. p To do this we turn the story around, if the sample mean lies in the interval µ 1.96 ⇥ pn , µ + 1.96 ⇥ pn p This is the same as saying the mean µ lies in the interval [sample mean –1.96×σ/√n, sample mean +1.96×σ/√n]. q Thus 95% of the time, the true mean (that we want to estimate) will be in the interval (this is called a confidence interval): sample mean (average) 1.96 ⇥ p , sample mean (average) + 1.96 ⇥ p n n Case 1: Normal data – sample size one p Human heights are approximately a normal distribution. The standard deviation of a human height is 3.8 inches. p Our objective is to construct a confidence interval for the mean height. p We start with the less than ideal situation that we only have a sample size one (just observation!). In this case the standard error is 3.8/√1 = 3.8 (the regular standard deviation). p We know that the observation is normally distributed, so it is straightforward to construct the 95% confidence interval for the mean height using just one randomly selected height is: [height – 1.96×3.8, height + 1.96×3.8] = [height – 7.44, height + 7.44]. Construct an interval using your height. A large amount of data on heights has been collected and it is known that the mean height of a person is about 67 inches. Does your interval contain the mean? Most of you will contain the mean, 67 inches. Those of you whose height is in the extremes (very tall or small – more than 1.96 standard deviations from 67) will have an interval that won’t contain 67 inches. Because the sampling distribution of x is narrower than the population distribution, by a factor of √n, the estimates x tend to be closer to n Sample means, n subjects € x the population €σ parameter µ than individual n Population, x individual subjects observations are. € σ µ If the population is normally distributed N(µ,σ), the sampling distribution is N(µ,σ/√n), Case 1: Normal data – sample size three p Again we estimate the mean human height, but this time taken from a random sample of three people. Recall, the standard deviation of a human height is 3.8 inches. p If the sample size is 3, the standard error of an average based on three is 3.8/√3 = 2.19. p As each randomly selected height is normally distributed, so is the average based on three (recall Chapter 5): 3.8 p The 95% confidence interval is X̄ ⇠ N ( µ , p ) |{z} 3 ?? X̄ 3.8 3.8 1.96 ⇥ p , X̄ + 1.96 ⇥ p 3 3 Given any random sample of size three we take its average and plug it in. Here we illustrate the height example. q In the shot on the right we draw a sample of size three from the population of all heights. The average (sample mean) is evaluated. q This average corresponds to one of the green dots on the lower right plot. The green lines is the confidence interval centered about the average. q We did this 100 times and 96 of the intervals contain the true mean 67. If the sample mean is normally distributed and the 100 samples were calculated and for each sample a 95% CI was evaluated, about 95 would contain the true mean of 67. In reality only have one CI; we are 95% confident it contains mean. Observations p We see that the length of confidence interval when using just one person in the sample is 2×1.96×3.8 = 14.88, this is quite long, and does not really allow us to pinpoint the mean. p Whereas the length of the confidence interval using three people is only 2×1.96×3.8/√3 = 14.88/√3 p If ten people were used to calculate the sample mean the corresponding interval length would be 14.88/√10 = 4.7. p We see that for any given interval either the mean is in this interval or not. The 95% comes into play when we look at the proportion of intervals that contain the mean. p In reality: p p p We do not know the true mean µ, so will never know whether the interval contained the mean or not. We only observe one sample of size n, and thus have one CI. This is why we say with 95% confidence the mean lies in it. Case 2: Skewed data – sample size 3 p In the previous example we looked at height data, which tends to be normal. In this example we consider Right skewed data, which is NOT normal – examples include, House prices, Salaries etc. p We randomly draw a sample of size 3 from a right skewed distribution with mean 14 and standard deviation 10.7. p The sample/mean average has a mean which is 14 and standard deviation which is 10.7/√3 = 6.17. p We construct a 95% confidence interval to locate the mean, X̄ 10.7 10.7 1.96 ⇥ p , X̄ + 1.96 ⇥ p 3 3 The confidence interval is constructed under the assumption that the sample mean is normal. In the next slide we investigate how this influences the `quality’ of the confidence interval. We draw three samples from this skewed distribution and take the average. q The average corresponds to one of the green dots on the plot below. We construct a 95% interval. q We see that only 93 of the intervals contain the mean. The reason for the difference between the 95% and 93 (though not much) can be found in the green plot of the sample mean. It is slightly skewed and clearly not normal. The sample size is not large enough for the CLT to work. We do not have 95% confidence in this 95% confidence interval. q Case 2: Skewed data – sample size 50 p In the previous example, it was clear that we did not have the full 95% confidence in the 95% confidence interval we had constructed. p This was because the sample mean was not normal. p We need to be careful when constructing confidence intervals using small sample sizes because the normality assumption may not hold – this means our interval is not as reliable as we think it is. p If the sample size is sufficiently large then we recall from Chapter 5 that the corresponding sample size will be close to normal. This means that a 95% confidence interval will actually be a 95% confidence interval. p In this next slide we look at the reliability of the 95% CI (where the data is sampled from a skewed distribution): X̄ 10.7 10.7 1.96 ⇥ p , X̄ + 1.96 ⇥ p 50 50 We observe that the sample mean based on a sample of 50 appears close to normal (though it needs to checked with a QQplot). The `coverage’ of the confidence interval (at least over these 100 realizations) is `about’ 95%. We can `safely’ say that we have 95% confidence in the 95% confidence interval. To summarize a 95% confidence interval is an interval where we are 95% confident it contains the mean (note for any given interval the mean is either there or not – so no probability). Implications We do not need to (and cannot, anyway) take a lot of random samples to “rebuild” the sampling distribution and find µ at its center. n All we need is one SRS of Sample size n and we can rely on n Population the properties of the sampling distribution to infer reasonable values for the population mean µ. µ Multiple samples revisited With 95% confidence, we can say that µ should be within 1.96 σ standard deviations (1.96×σ/√n) from our sample mean x . p € In 95% of all possible samples of this size n, µ will indeed fall in our confidence interval. € p In only 5% of samples will x be farther from µ. p “Confidence” = the proportion of possible samples that give us a € correct conclusion. n Calculation practice 1 p You want to rent an unfurnished one-bedroom apartment in Dallas. The mean monthly rent for 10 randomly sampled apartments is 980 dollars. Assume that monthly rents follow a normal distribution with standard deviation 280 dollars. p Question: Construct a 95% confidence interval for the mean monthly rent of a one-bedroom apartment. p Answer: The standard error for the sample mean is 280/√10 = 88.54. The 95% CI is [980 ±1.96×88.54] = [806,1153]. With 95% confidence we believe the mean price of one-bedroom apartments in Dallas lies in this interval. p It is important that this is referring to mean prices not the price. p Question Does the above confidence interval mean that 95% of all rents should lie in this interval? p Answer: No, this is confidence interval for the mean not the apartment price. An interval where 95% of apartment prices will lie is [980 ±1.96(88.54+280)] = [257,1720]. You do not have to understand this calculation, but you will notice this interval is much wider. The reason is that it must capture 95% of all rents, which are extremely varied. This interval will not get narrower as the sample size grows. The CI for the mean captures the mean rent, this interval is far narrower and will get narrower as the sample size grows. q Question A realtor wants to know if the mean price of one bedroom apartments in Dallas is more than 1100 dollars a month. Based on the confidence interval for the mean, what can you say? q Answer We showed that the 95% confidence interval for the mean is [806,1153] dollars. As this interval contains both values above and below 1100 dollars, we do not know. We do not have enough data to answer her question. Calculation practice 2 p Hypokalemia is diagnosed when the blood potassium level is below 3.5mEq/dl. The potassium in a blood sample varies from sample to sample and follows a normal distribution with unknown mean but standard deviation 0.2. A patient’s potassium is measured taken over 4 days. The sample over 4 days is 3, 3.5, 3.9, 4.4, its sample mean is 3.7. p p q Question: Construct a 95% confidence interval for the mean potassium and discuss whether the patient is likely to be diagnosed with Hypokalemia. Answer: The standard error for the sample mean is 0.2/√4 = 0.1. Thus the 95% confidence interval for the mean potassium level is [3.7±1.96×0.1] = [3.504,3.894]. This means with 95% confidence we believe the mean lies in this interval. Since 3.5 or less does not lie in this interval, it suggests that the patient does not have low potassium. There is a precise way of answer this specific problem which we discuss in Chapter 7 (called statistical testing). Con4idence interval misunderstandings p Suppose 400 alumni were asked to rate the University of Olat counseling services from a scale 1 to 10. The sample mean was found to be 8.6 and it is known that the standard deviation is σ=2. Ima Bitlost has done the analysis, but has made some mistakes. p Ima computes the 95% CI interval for the mean satisfaction score as [8.6±1.96×2]. What is her mistake? p Ima has not taken into account that the sample mean has a much smaller standard deviation (standard error) than the population. The standard error is 2/√400 = 0.1. Thus the true CI is [8.6±1.96×0.1] = [8.4,8.796]. p After correcting her mistake, she states that “I am 95% confident that the sample mean lies in the interval [8.4,8.796]” What is wrong with her statement? p This is a meaningless statement, for sure the sample mean lies in this interval! It is the population mean that we are 95% confident lies there. p She quickly realizes her mistake and instead states “the probability that the mean lies in the interval [8.4,8.796] is 95%”, what misinterpretation is she making now? p p By 95%, we mean that if we repeated the experiment many times over about 95% of the time the intervals will contain the mean. For any given interval the mean is either in there or not. There is no probability attached to it. To overcome, this issue we say that with we have 95% confidence in the mean lies in this interval. Finally, in her defense for using the normal distribution to determine the confidence coefficient (1.96) she says “Because the sample size is quite large, the population of alumni ratings will be close to normal”. Explain to Ima her misunderstanding. p The distribution of the population always stays the same, regardless of the sample size (in this case, it is clear that variables that take integer values between 1 to 10 cannot be normal). However, the sample mean does get closer to normal as the sample size grow. With a sample size of 400, the distribution of the sample mean will be very close to normal. Different levels of con4idence p There is no need to restrict ourselves to 95% confidence intervals. p The level of confidence we use really depends on how much confidence we want. For example, you would expect a 99% confidence interval is more likely to contain the mean than a 95% confidence interval. p To construct a 99% confidence interval we use exactly the same prescription as used to construct a 95% confidence interval, the only thing that changes is 1.96 goes to 2.57 (if you look up -2.57 in the ztables you will see this corresponds to 0.5%, so 99% of the time the sample mean will lie within 2.57 standard errors from the mean). p A 99% CI for the mean one-bedroom apartment price is [980±2.57×88.54]. Length of interval is 2×2.57×88.54 q A 90% CI for the mean one-bedroom apartment price is [980±1.64×88.54]. Length of interval is 2×1.64×88.54 What does a 100% confidence interval look like? In a 100% CI we are sure to find the mean, but this interval is so wide it is not informative. Sample size and length of the CI p Let us return to the apartment example. We recall that the 95% confidence interval for the mean price is [980 ±1.96×88.54] = [806,1153]. The length of this interval is 2×1.96×88.54 = 347. p p p Answer: The standard error is 280/√100 = 28 (much smaller than when the sample size is 10), and the CI is [1000 ±1.96×28]. The length of this interval is 2×1.96×28 =109. What we observe is: p p p Question: Suppose I take a SRS of 100 apartments in Dallas, the sample mean based on this sample is 1000, what will the CI be? The length of the interval does not depend on the sample mean, this is just the centralizing factor. It only depends on (i)1.96, (ii) the standard deviation and (iii) the sample size. The length of the interval gets smaller as the sample size increases. If we want the interval to have a certain length, we can choose the sample size accordingly. How large an interval p You read in a newspaper that The proportion of the public that supports gay marriage is now 55%±15%. q This means a survey was done, the proportion in the survey who supported gay marriage was 55% and that confidence interval for the population proportion is [55-15,55+15]% = [40%,70%]. q This is an extremely large interval, it is so wide, that it is really not that informative about the opinion of the public. q As we will see on the next slide, the reason it is too wide is that the sample size is too small. This experiment was not designed well. q Typically, before data is collected, we need to decide how large a sample to collect. This is usually done by deciding how much `above and below’ the estimator seems reasonable. For example, [55-3,55+3]% = [52,58]% is more information. The 3% is known as a margin of error. Given a certain margin of error we can then determine the sample size (see formula on next page). Margin of Error p Margin of error is the lingo used for the plus and minus part in the confidence interval. p That is the confidence interval is [sample mean±1.96×σ/√n], the margin of error is 1.96×σ/√n. q q For example, in the previous example the margin of error for the CI based on 10 apartments is 1.96×88.54. The margin of error for the CI based on 100 apartments is 1.96×28. q The margin of error in some sense, is a measure of reliability. For a given confidence level, the smaller the margin error the more precisely we can pinpoint the true mean. q Suppose we want the margin or error to be equal to some value, then we can find the sample size such that we obtain that margin of error. Solve for n the equation MoE = 1.96×σ/√n (the Margin of Error and the standard deviation σ are given): n = (1.96×σ/MoE)2 q See the next few slides for examples. Calculation practice p In a study of bone turn over in young women with a medical condition, serum TRAP was measured in 31 subjects. The sample mean was 13.2 units per liter. Assume the standard deviation is known to be 6.5U/l. p p Question: Find the 80% CI for the mean serum level. Answer: 10% in the z-tables, this gives -1.28. The standard error for the sample mean is 6.5/√31 = 1.16. Altogether this gives the CI [13.2±1.16×1.28] =[11.7,14.6]. This means with we believe with 80% confidence the mean level of serum for women with this medical condition should lie in this interval. By choosing such a low level of confidence our interval is quite narrow, but our confidence in this interval is relatively low. The margin of error is (14.6-11.7)/2 = 1.45. q q Question: How large a sample size should we choose such that the 80% CI for the mean has the margin of error 1U/l. Answer: Solve 1.28×6.5/√n = 1, n=(1.28×6.5/1)2 =70. When the standard deviation is unknown? p In the previous example we assumed the standard deviation was unknown. In general before we collect the data, we will not have much information about the standard deviation. However, we will have some idea on bounds for it. p For example, the standard deviation for human heights is probably between 2-5 inches. Based on this information we can can find the sample size whose Margin of Error is at most a certain length. p Question How large a sample size do we require such that the margin of error for a 95% confidence interval for the mean of human heights is maximum 0.25 inch, given that σ lies somewhere between 2-5 inches. p Answer We know that the formula is n = (1.96×σ/0.25)2.. We need to choose the standard deviation to place in the formula. p If we use σ=2, then the sample size is n=(1.96×2/0.25)2 = 246. p If we use σ=5, then the sample size is n=(1.96×5/0.25)2 = 1537. p For standard deviations between 2 and 5, the sample size will be between 246 – 1537. p Using the smaller standard deviation means a smaller sample size is required. However, if the standard deviation is greater than 2, then it means that the MoE will be larger than the desired minimum: p If σ=5, and we use the minimum sample size n=246, then putting these numbers into the formula we see that the MoE =1.96×5/√246 = 0.62. Which is larger than the required MoE of 0.5. This is not what we want, as we want to ensure that the MoE is less than 0.25. p If σ=2, and we use the maximum sample size n=1537, then putting these numbers into the formula we see that the MoE =1.96×2/√1537 = 0.1. Which is less than the require of 0.25. This is exactly what we want, as we want the MoE which is at most 0.25. q To be sure that the MoE is maximum 0.25, we need to use a sample size of n=1537. This means always using what we believe is the maximum standard deviation in the calculation of margin of error i.e., n= ✓ 1.96 ⇥ M AX MoE ◆2 Calculation practice (tricky) p Question: A confidence interval for the length of parrots beaks is [4,10] inches. It is based on a sample of size n. By what factor should the sample size increase such that the margin of error is 1? q Answer: This looks like an impossible question because we don’t have any obvious information. But we can break the problem into steps: q Confidence intervals are centered about the sample mean, so the average of the observed data is 7. The margin of error is half the length of the CI interval which is [10-4]/2 = 3 = 1.96×σ/√n. q We want to decrease the MoE, such that MoE = 1, so it decreases by a third. Now some basic maths, suppose we increase the sample size by factor 9 (9 times the original data): 1.96 ⇥ p 9n = 1.96 ⇥ 3n = 3 1 1.96 ⇥ p = = 1 3 3 n | {z } =3 Thus increasing the sample size by factor 9 results in the Margin of Error reducing to 1. Observe we need a huge increase in sample size to get a moderate decrease in the MoE! Calculation (continued) p Example If a sample size of 20 gave a confidence interval [4,10], how large a sample size is required to reduce the margin of error to 1/2 (0.5)? p Solution If the confidence interval is [4,10], from the previous slide we know that the MoE is 3. This means that 1.96 ⇥ p = 3 20 If increase the sample size by factor 36, ie. from n=20 to n=20×36=720. Then I see that the margin of error is 1 3 1 p = ⇥ 1.96 ⇥ p = = 1.96 ⇥ p = 1.96 ⇥ 6 6 2 36 ⇥ 20 6 ⇥ 20 20 We see that to decrease the margin of error from 3 to ½ (by a sixth) we need to increase the sample size by factor 36! Analysis with unknown standard deviation p So far we have assumed that the standard deviation is known, even though the mean is unknown. p In some situations, this is realistic. For example, in the potassium level example, it seems reasonable to suppose that the amount of variation for everyone is the same, but everyone has their own personal mean level, which is unknown. p In most situations, the mean level is unknown. p Given the data: 68, 68.5, 68.9 and 64.4 the sample mean is 68.7, how to `get’ the standard deviation to construct a confidence interval? p We do not know the standard deviation, but we know that we can v estimate it using the formula u p For our example it is s= r u s=t 1 n 1 n X (Xi i=1 1 ([ 0.7]2 + [ 0.2]2 + [0.2]2 + [0.7]2 ) = 0.59 3 X̄)2 Using the z-‐transform with estimated standard deviation p Once we have estimated the standard deviation we replace the the unknown true standard deviation in the z-transform with the estimated standard deviation: X̄ µ X̄ µ p ) p / n s/ n q s X̄ ± 1.96 p ! X̄ ± 1.96 p n n After this we could conduct the analysis just as before. However, we will show in the next few slides (with the aid of Statcrunch) that this strategy leads to unreliable confidence intervals (when the sample size small). We consider two examples q q The data is normal (we `draw’ samples from a distribution with mean 3.8 and standard deviation 3.8, however confidence interval used does not know these specifications) and sample size is n = 3. The data is normal (as above), but sample size is n = 50. Case 1: Normal data – sample size 3. In this example we draw samples of size 3: q The 95% CI using the above data and the normal 69.9 1.73 1.73 1.96 ⇥ p , 69.6 + 1.96 ⇥ p 3 3 We see from this example that the estimated standard deviation (1.73) underestimates the true standard deviation (3.8). This in general tends to be true for small sample sizes. This means the 95% CI is too narrow. We see from the plot on the left that only 84% of the `95% CI’ contain the mean. This means it is not a 95% CI. Something has gone wrong. Case 1: Normal data – sample size 50 In the previous example the sample size was 3, now we consider the case that the sample size is 50. For the example given on the right the 95% CI is 68.0 4.07 4.07 1.96 ⇥ p , 68.0 + 1.96 ⇥ p 50 50 For this example, the estimated standard deviation 4.07 is far closer to the true 3.8. This in general is true for large sample sizes. Looking at the number of times the mean is contained within in the 95% confidence interval (on the right) we see that it is close to the prescribed level lf 95%. Observations from the experiments p Simply replacing the true standard deviation with the estimated standard deviation seems to have severe consequences on the confidence interval. p When the sample size was small there tends to be an underestimation in the standard error, resulting in the 95% CI not really being a 95% CI. p To see why consider the z-transforms of the sample mean with known and estimated standard deviations: p p (sample mean - µ)/(σ/√n) p (sample mean - µ)/(s/√n) In the next few slides we show that when we estimate the standard deviation the z-transform is no longer a standard normal, but the so called t-distribution. Review: σ is unknown In the case the we can estimate the standard deviation from the data. The sample standard deviation s provides an estimate of the population standard deviation σ. But when the sample size is small, the sample contains only a few individuals. Then s is a mediocre estimate of σ. p When the sample size is large, the sample is likely to contain elements representative of the whole population. Then s is a good estimate of σ. p The data is unlikely to contain values in the tails and, s is likely to underestimate σ. p Population distribution Large sample Small sample Sample means and standard deviations p Just like the sample mean is random with a distribution, so is the sample standard deviation. p Here we take a sample of size 10 from a normal distribution and calculate its sample mean and variance. Estimating the standard deviation p The sampling distribution of the sample standard deviation (n=5) q The sample distribution of the sample standard deviation (n=25) Observe that as the sample size increases the estimator of the sample standard deviation becomes less variable (1.70 reduces to 0.65). Large amount of variability in the sample standard deviation influences the confidence interval. That nice Mr. Gosset p Just over 100 years ago, W.S. Gosset was a biometrician who worked for Guiness Brewery in Dublin, Ireland. p His hobby was statistics. p Gosset realized that his inferences with small sample data seemed to be incorrect too often – his true confidence level was less than it was stated to be. We just observed this in the simulations previously. p p p He worked out the proper method that took into account substituting s for σ. But he had to publish under a pseudonym: Student (probably because Gosset was a sweet and modest person). Gosset’s theory is based on the distribution of the quantity t= p x −µ s n . This looks like the z-score for x , except that s replaces σ in the denominator. Formal: Student’s t distributions Suppose that an SRS of size n is drawn from an Normal(µ,σ) population. p x −µ z = When σ is known, the sampling distribution for σ n is Normal(0,1). p When σ is estimated from the sample standard deviation s, the x −µ t = sampling distribution for will be very close to normal if the s n sample size n is large. This is because for large n, s will be a very reliable estimator of σ. q However, in the case that n is not so large, the variability in s will have an impact on the distribution. q It is clear that the impact it has depends on the sample size. Student’s t distributions p When σ is estimated from the sample standard deviation s, the sampling distribution for t = x −µ s The sample distribution of t = n will depend on the sample size. x −µ s n is a t distribution with n − 1 degrees of freedom. p The degrees of freedom (df) is a measure of how well s estimates σ. The larger the degrees of freedom, the better σ is estimated. q This means we need a new set of tables! q Further reading: http://onlinestatbook.com/2/estimation/t_distribution.html When n is very large, s is a very good estimate of σ, and the corresponding t distributions are very close to the normal distribution. The t distributions become wider (thicker tailed) for smaller sample sizes, reflecting that s can be smaller than σ, so the corresponding ttransform is more likely to take extreme values than the z-transform. Impact on con4idence intervals Suppose we want to construct the C% confidence interval for the mean. The standard deviation is unknown, so as well as estimating the mean we also estimate the standard deviation from the sample. The C% Confidence Interval is: X̄ tn 1 ✓ ◆ 100 C s ⇥ p , X̄ + tn 2 n 1 ✓ ◆ 100 C s ⇥p 2 n C Examples: 95%, sample size n=3 X̄ s s 4.3 ⇥ p , X̄ + 4.3 ⇥ p 3 3 95%, sample size n=10 X̄ s s 2.26 ⇥ p , X̄ + 2.26 ⇥ p 10 10 −t* t* Example: For an 95% confidence level C, 95% of Student’s t curve’s area is contained in the interval. Con4idence level and the margin of error The confidence level C determines the value of t* (in table D). The margin of error also depends on t*. § Higher confidence C implies a larger m = t* × s n margin of error m (thus less precision in our estimates). § A lower confidence level C produces a smaller margin of error m (thus C better precision in our estimates). § We find t* in the line of Table D for df = n−1 and confidence level C. −t* t* Table D When σ is unknown, we use a t distribution with “n−1” degrees of freedom (df). Table D shows the z-values and t-values corresponding to landmark P-values/ confidence levels. t= When the sample is very large, we use the normal distribution and the standardized z-value. x −µ s n p Focus first on 2.5%. For each n, the 2.5% corresponds to the area on the left and right tails of the t-distribution with n degrees of freedom. Remember a distribution gives the chance/likelihood of certain outcomes. p Recall that for a normal distribution, the point where we get 2.5% on the left and the right of the tails of the distribution is 1.96 (which is the very last row of the table). p If we go down the table. we see that as the sample size, n, increases the value corresponding to 2.5, goes from 12.71 (for n=1) to a number that is very close to 1.96 for extremely large n. p This means for small n the variability on the standard deviation s means that the chance of the t-transform being extreme is relatively large. p However, as n grows, the estimator of the standard deviation improves, and the t-transform gets closer to a normal distribution. p You observe the same is true for other percentages. p p 90% means looking up 5% p 99% means looking up 0.5% DO NOT MIX CONFIDENCE LEVEL WITH SIGNIFICANCE LEVEL Case 1: Normal data – sample size 3, using t-‐dist In this example we draw samples of size 3: q The 95% CI using the above data and the t-distribution is 69.9 1.73 1.73 4.3 ⇥ p , 69.6 + 4.3 ⇥ p 3 3 This is the same example as considered previously, but now the t-distribution has been used. 1.96 has been replaced with 4.3. From the plot of the right we see that using the t-distribution to construct the CI about 95% of the 95% confidence intervals really do contain the population mean. By using the t-distribution we have corrected for under the underestimation of the sample sd. Non-normal data: A misconception Using a t-distribution rather than a normal distribution when constructing a confidence interval does not correct for the lack of normality in the data. In the example of the left, we use the tdistribution to construct the CI. But we observe that only 88 of the 100 95% confidence intervals contain the mean. Fundamentally, if the data is not normal, and the sample size is small neither the normal or the t will give the correct 95% confidence interval. REMEMBER we only use the tdistribution because we have estimated the standard deviation from the data. Calculation practice (red wine 1) It has been suggested that drinking red wine in moderation may protect against heart attacks. This is because red wind contains polyphenols which act on blood cholesterol. To see if moderate red wine consumption increases the average blood level of polyphenols, a group of nine randomly selected healthy men were assigned to drink half a bottle of red wine daily for two weeks. The percent change in their blood polyphenol levels are presented here: 0.7 3.5 Sample average 4.0 4.9 5.5 7.0 x = 5.50 Sample standard deviation s = 2.517 Degrees of freedom df = n − 1 = 8 7.4 8.1 8.4 We will encounter two problems when doing the analysis. The first is that the sample size is not huge so we have to hope that the sample mean is close to normal. The second is the standard deviation is unknown and has to be estimated from the data. q What is the 95% confidence interval for the average percent change? p First, we determine what t* is. The degrees of freedom are df = n − 1 = 8 and C = 95%. From Table D we get t* = 2.306. (…) p The margin of error m is: m = t* × s/√n = 2.306 × 2.517/√9 ≈ 1.93. So the 95% confidence interval is 5.50 ± 1.93, or 3.57 to 7.43. p We can say “With 95% confidence, the mean of percent increase is between 3.57% and 7.43%.” p What if we want a 99% confidence interval instead? p For C = 99% and df = 8, we find t* = 3.355. Thus m = 3.355 × 2.517/ √9 ≈ 2.81. p Now, with 99% confidence, we only can conclude the mean is between 2.69 and 8.31. (A big price to pay for the extra confidence.) Calculation practice (red wine 2) Let us return to the same study, but this time we increase the sample size to 15 men. The data is now: 0.7,3.5,4,4.9,5.5,7,7.4,8.1,8.4, 3.2,0.8,4.3,-0.2,-0.6,7.5 The sample mean in this case is 4.3 and the sample standard deviation is 3.06. Since the sample size has increased, it is likely that the sample standard deviation is a more reliable estimator of the true standard deviation. The number of degrees of freedom is 14. Just as in the previous example we can construct a 95% confidence interval but now we use 14df instead of 8dfs. Solution: Using the t-tables the 95% CI is 2 44.3 ± 2.145 | {z } t-tables 14 df, 2.5% 3 3.06 ⇥ p 5 = [2.6, 6] 15 Con4idence intervals using Software p Usually software will construct the confidence interval for you. Therefore it is important to connect the calculations with the statistical output. The box on the right is the output (it is superimposed on the window used to generate the output). Observe that L.Limit – U. limit gives the confidence interval [2.6,6] calculated on the previous slide. DF = 14, matches with the degrees of freedom. Calculation practice 3 p Let us return to the example of prices of apartments in Dallas. 10 apartments are randomly sampled. The sample mean and the sample standard deviation based on this sample is 980 dollars and 250 dollars (both are estimators based on a sample of size ten). Construct a 95% confidence interval for the mean: p The standard error is 250/√10 = 79. p Looking up the t-tables at 2.5% and 9 degrees of freedom gives 2.262. p q The 95% confidence interval for the mean is [980 ± 2.262×79]=[801,1159]. Suppose we want to know whether the price of apartments have increased since last year, where the mean price was 850 dollars. q Based on this interval we see that 850 dollars and greater is contained in this interval. This means the mean could be 850 dollars or higher. There given the sample it is unclear whether the mean price of apartments has increased since last year or not. Calculation practice 4 p Let us return to the M&M data. Suppose we want to calculate a 95% confidence interval for the mean number of M&Ms in plain, peanut butter and peanut M&Ms. These can be calculated using the summary statistics output: Summary statistics for Total: Group by: Type Type n M 84 P 40 PB 46 Mean 17.297619 8.675 10.913043 Variance Std. Dev. Std. Err. 8.259753 2.8739786 0.3135768 9.814744 3.1328492 3.325604 1.8236238 Median Range Min Max Q1 Q3 18 14 7 21 17 19 0.49534693 8 15 6 21 7 8 0.26887867 11 10 8 18 10 11 Using this output we can calculate the confidence intervals for the mean number of M&Ms in each type. To obtain the t-values we can use Statcrunch (as the appropriate DF are not in the tables). Using Software to obtain con4idence intervals p Go to Stats -> t-statistics -> one-sample -> with data -> select the column you want to analyse (choose the Group by if you want it grouped), on the next page select confidence interval and the level you want it at. Looking at the intervals, do you think that the mean number of M&Ms of each type could be the same? Hint: Compare the intervals. Compare the margin of errors, are they same? Calculation practice: coffee shop sales A marketing firm randomly samples 45 coffee shops and determines their annual sales. The sample has an average of $2.67 million and a standard deviation of $1.03 million. What can we say with 90% confidence about the mean annual sales for the population of all coffee shops? p The degrees of freedom is 45−1 = 44. p For 90% confidence, we find t* = 1.680. p The margin of error is 1.680×1.03/√45 = 0.258 p So the interval for the true mean is 2.67 ± 0.26. x ± t* s n “We conclude that the mean annual sales of all coffee shops is between $2.41 million and $2.93 million, with 90% confidence.” p Summary of con4idence interval for µ. p The confidence interval for a population mean µ is x ± t* s p p n. t* is obtained from Student’s t distribution using n−1 degrees of freedom. (Table D in the textbook.) t* is the value such that the confidence level C is the area between –t* and t*. p Confidence is the proportion of samples that lead to a correct conclusion (for a specific method of inference). p p p p The investigator chooses the confidence level C. Tradeoff: more confidence means bigger margin of error, wider intervals. The degrees of freedom is associated with s, the estimate for σ. * The margin of error t s / larger samples are better. n also depends on the sample size: Interpretation of con4idence, again p The confidence level C is the proportion of all possible random samples (of size n) that will give results leading to a correct conclusion, for a specific method. p In other words, if many random samples were obtained and confidence intervals were constructed from their data with C = 95% then 95% of the intervals would contain the true parameter value. p In the same way, if an investigator always uses C = 95% then 95% of the confidence intervals he constructs will contain the parameter value being estimated. p But he never knows which ones do! p Changing the method (such as changing the value of t*) will change the confidence level. p Once computed, any individual confidence interval either will or will not contain the true population parameter value. It is not random. p It is not correct to say C is the probability that the true value falls in the particular interval you have computed. * x ± t ×s / n Cautions about using p This formula is only for inference about µ, the population mean. Different formulas are used for inference about other parameters. p The data must be a simple random sample from the population. p The formula is not quite correct for other sampling designs. (But see a statistician to get the right inference method.) p Confidence intervals based on t* are not resistant to outliers. p If n is small and the population is not normal, the true confidence level could be smaller than C. (Usually n ≥ 30 suffices unless the data are highly skewed.) p This inference cannot rescue sampling bias, badly produced data or computational errors. Accompanying problems associated with this Chapter p Quiz 6 p Quiz 7 p Quiz 7a p Quiz 8 p Homework 4 (part of it)