Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of statistics wikipedia , lookup
Sufficient statistic wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Psychometrics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Omnibus test wikipedia , lookup
Misuse of statistics wikipedia , lookup
IV. Estimation and Testing A. Overview 1. Introduction to Estimation A parameter is an important characteristic of a population. Examples: • the true mean outside diameter of the pen barrel; • the variability of the elastic strengths of polymer yarn; and • the coefficients which relate the effect of catalyst, temperature, and pressure to the filament's strength. Problem: How often do we know the true values of parameters? Almost Never! WHY? We never can observe populations in their entirety. How do we get around this problem? WE TAKE A SAMPLE AND ESTIMATE THE PARAMETERS An estimator is a statistic used to estimate an unknown parameter of a population. The sample mean, y and the sample variance, s2, are examples of estimators. Two criteria for choosing estimators are: • accuracy (unbiased) • precision. An unbiased estimator of an unknown parameter is one whose expected value is equal to the parameter of interest. Thus, we call ˆ an unbiased estimator of if E[ˆ] Thus the estimator yields, on the average, an estimate close to the true value. In this case, ˆ is an unbiased estimator of . 1 ˆ is a biased estimator. 2 The concept of precision looks at the variances of the estimators. An estimator is more precise if its sampling distribution has a smaller standard error. If our data come from a normal distribution, then among the class of unbiased estimators, • y is the most precise estimator of μ • s2 is the most precise estimator of σ2. We defined s2 using n-1 in the denominator because it produces an unbiased estimator of σ2. 2. Introduction to Confidence Intervals We call y and s2 point estimators. If we sample from a continuous distribution, then y and s2 are continuous random variables. Does anyone sense a problem? Note: • P( y ) 0 • P(s2 = σ2) = 0 Consequently, statisticians prefer interval estimators. These intervals give a range of plausible values for the parameter of interest. For example, consider the population mean, μ. If we don't know σ2, and if the parent distribution is well-behaved, then y s/ n follows a t-distribution. As a result, y t 1 s/ n s s y t 1 n n s s P y t y t 1 n n s s P y t y t 1 n n P t P t n 1 , / 2 n 1 , / 2 n 1 , / 2 n 1 , / 2 n 1 , / 2 n 1 , / 2 n 1 , / 2 n 1 , / 2 We call y t n 1 , / 2 s n a (1- )•100% confidence interval for μ. Interpretation: If we take an infinite number of samples from a well behaved parent distribution, then (1- )•100% of the time, the interval y t will contain μ. n 1 , / 2 s n 3. Introduction to Testing The process by which we use data to answer questions about parameters is very similar to how juries evaluate evidence about a defendant. We start with a nominal claim, which we call a null hypothesis, H0. H0: the defendant is innocent The prosecutor seeks to establish an alternative claim, which we call the alternative hypothesis, Ha. Ha: the defendant is guilty Note: the jury makes a decision under the risk of making a mistake. convict acquit Defendant’s innocent Type I error Correct decision True State guilty Correct decision Type II error What is the typical standard for a jury's decision? must be convinced beyond a reasonable doubt What does that imply about the probability of a Type I error? Should be small What does that imply about the probability of a Type II error? Could be large Traditionally, we let P(Type I error) P(reject H | H is true) 0 0 P(reject H when H is true) 0 0 P(convict an innocent person) is called the significance level of our test. The power of the test is Power = P(reject H0 | Ha is true] = P(convict a guilty person) We want small and large power. Note: • Rejecting H0 is a strong claim since we needed to be convinced beyond a reasonable doubt. We must have substantial evidence before we reject the nominal claim. • Failing to reject H0 is a weak claim. The evidence may seem to support the alternative, but the jury is not convinced beyond a reasonable doubt. We do the same thing with engineering decisions. Consider a packaging process for the 10 oz boxes of a popular breakfast cereal. The company has received a number of complaints about underfilled boxes. Suppose the equipment should be set to deliver, on the average, 10.2oz. If it really is set to that value, the company should have virtually no complaints about underfills. What would be an appropriate procedure to determine if the machine is set properly or if it will tend to underfill the boxes? The appropriate hypotheses for testing underfills are: H0: μ = 10.2 Ha: μ < 10.2 What is a Type I error and its consequence? What is a Type II error and its consequence? The most commonly used values for are: • .10 • .05 • .01 If we perform this test once, what seems to be a reasonable ? Let’s shift gears. If you have a problem with underfills, how can you correct it? From a stockholder's perspective, is this a wise idea? • We don't want to underfill. • Neither do we want to overfill. What would be an appropriate procedure! H0: μ = 10.2 Ha: μ ≠ 10.2 A two sided hypothesis since we care μ < 10.2 and μ > 10.2. This is a real problem in industry and will lead to the concept of control charts which we introduce in the next chapter. In general, we follow a 5-step procedure for conducting hypothesis tests. 1. State the appropriate hypotheses. H0: nominal claim Ha: alternative claim (what we seek to prove) 2. State the appropriate test statistic. State how we plan to analyze the data. 3. Determine the critical region. Determine the values for the test statistic which support rejecting H0. 4. Conduct the experiment, calculate the test statistic. 5. Reach conclusions and state them in English. We will learn a statistical jargon: • reject H0 • fail to reject H0. THIS IS NOT ENGLISH! A better way to express our conclusions: • We should adjust the equipment. • We shouldn't adjust the equipment. If we reject the null hypothesis, we should always follow up our test with an appropriate confidence interval. The idea of the interval: to give a range of plausible values as an alternative to the nominal claim. 4. Relationship of Testing to Confidence Intervals A two-sided hypothesis test with a significance level of is equivalent to constructing a (1 - )• 100% confidence interval and using the following decision rule: • If the interval does contain this value, then we would fail to reject H0. • If the interval does not contain this value, then we would reject H0. The we use for the hypothesis test is exactly the same we use for the confidence interval. By the way we constructed the confidence interval, each value in the interval is a plausible candidate for the true value. Thus, if the nominal value of the parameter of interest falls within the confidence interval, then we have no evidence to conclude that it is not a plausible value for the parameter. Hence, we cannot reject the null hypothesis. On the other hand, if our interval does not contain the nominal value, then the nominal value is not plausible, and we do have sufficient evidence to reject the nominal claim. Many engineers and statisticians prefer to concentrate solely on confidence intervals since • they clearly estimate the parameter of interest, and • they can address the interesting questions for which hypothesis tests are designed. Confidence intervals provide a simple, powerful, and direct basis for addressing both practical and statistical significance. B. Tests for a Single Mean 1. One Sided Tests Consider the injection molding process for pen barrels. Suppose the nominal outside diameter is .380 in. Lately, the supervisor in packaging keeps complaining that the caps fall off, jamming his equipment. We need to determine if the outside diameters of these barrels, on the average, has become too small. What should we do? Collect a sample. A recent random sample of 15 pen barrels yielded .379 .380 .378 .379 .381 .379 .380 .378 .379 .379 .381 .379 .380 .380 .380 Is it clear that, on the average, the outside diameter is less then .380 in? Consider a hypothesis test. Step 1: State the Hypotheses H0: μ = .380 Ha: μ < .380 Step 2: State the Test Statistic y t s/ n 0 Step 3: State the Critical or Rejection Region The critical region depends upon Ha For Ha: μ < μ0 : We thus reject H0 if where t n 1 , t t n 1 , is the appropriate value from the t table in the Appendix. For Ha: μ > μ0 : We thus reject H0 if t t n 1 , Usually, textbook problems give . Typical values for are: • .10 • .05 (most popular) • .01 In our particular case, consider = .05. Thus, we shall reject H0 if t < -t n-1,α t < -t 14,.05 t < -1.761 Step 4: Conduct Experiment and Calculate Test Statistic y .3795 s .0009 y t s/ n .3795 .380 .0009 / 15 2.152 0 Step 5: Reach Conclusions and State in English Since t < -1.761, we have sufficient evidence to reject H0. We therefore have enough evidence to suggest that the true mean outside diameter is less than .380. A reasonable question: What are the “plausible” values for the true mean outside diameter? We can construct a 95% confidence interval for μ by y t So n 1 , / 2 s with t n n 1 , / 2 t 14 ,. 025 2.145 .0009 .3795 (2.145) 15 .3795 .0005 (.3790,3800) Does this interval contain .380? Note: in some sense, we could have addressed the question of interest directly by the confidence interval. What did we assume to do this analysis? That our outside diameter's follow a well behaved distribution. Are we comfortable with that assumption? . Number Depth .378 00 2 2 .379 000000 6 .380 00000 5 7 .381 00 2 2 2. Two-Tailed Tests An important characteristic of the grapes used to make fine wine is the sugar content. Basically, the wine maker can predict the final alcohol content of the wine by dividing the sugar content of the grapes by 2. A Napa Valley winery pays a premium to its wine growers if they can deliver shipment with true mean alcohol contents of 26%. The winery tests grapes from five different, randomly selected locations in the shipment and determines the sugar content at each location. What is an appropriate method for determining if the wine growers deserves a premium? Step 1: State the Hypotheses H0: μ = 26 Ha: μ ≠ 26 Step 2: State the Test Statistic y t s/ n 0 Step 3: State the critical region For Ha: μ ≠ μ0: We thus reject H0 if where t n 1 , / 2 | t | t n 1 , / 2 is appropriate value from the t table in the Appendix In our case, use = .05. Thus, we shall reject the null hypothesis if | t | t n 1 , / 2 | t | t 4 ,. 025 | t | 2.777 Step 4: Conduct Experiment and Calculate Test Statistic Suppose the next wine grower has y 24.5 y t s/ n 24.5 26 1.3 / 5 2.580 0 s 1.3 Step 5: Reach Conclusion, State in English Since |t| < 2.777, we fail to reject H0. Therefore we have insufficient evidence to show that the true sugar content is not 26%. Therefore, we should pay the grower the premium. Typically, we would want to check our assumptions. In this case, with n=5, we cannot do very much. We must trust that the data come from a very well behaved (nearly normal) distribution. C. Tests for Proportions Example: 50 lb. Bags of Graphite Historically 1% of the 50 lb. bags of graphite bagged on a certain process have weights outside the specifications of 48-52 lbs. Suppose we wish to monitor this process. What would be appropriate hypotheses? H0: p = p0 (p = .01) Ha: p ≠ p0 (p ≠ .01) What would be the appropriate test statistic? Let Y be the number of bags which fail to meet the specifications in our sample. We can estimate p by Y pˆ n From the normal approximation to the binomial, we obtain Z pˆ p p (1 p ) n 0 0 Note: under H0, we actually know the standard error of p̂ What should be the critical or rejection region? Once again, the rejection region depends on the alternative hypothesis. Consider Ha: p < p0. We thus reject H0 if Z < -zα Now, consider Ha: p > p0. We thus reject H0 if Z > -zα Consider Ha: p ≠ p0, which is our specific case. We thus reject H0 if |Z| > zα/2 We need to determine an appropriate significance level, . Typical choices are .10 .05 .01 Which should we use? What is our rejection rule? Next, we need to determine a sample size. From the normal approximation to the binomial, we need n to satisfy: • np0 ≥ 5 (preferably np0 ≥ 10), and • n(1 - p0) ≥ 5 (preferably n(1 - p0) ≥ 10). In our case, what does that mean? Suppose we use n = 1000 and that our sample has 15 bags which fail to meet the specifications. The value for our test statistic is .015 .01 Z 1.59 (.01)(.99) 1000 What can we conclude? D. p-values In testing, represents our standard of evidence. Once we state our , we determine the appropriate critical region for our test. Any value of our test statistic which is more extreme than our “critical value” is considered sufficient evidence to reject the null hypothesis or nominal claim. An alternative method looks at the observed significance level, sometimes called the attained significance level, which is the smallest Type I error rate that would allow us to reject the null hypothesis. The observed significance level is the probability of seeing the particular value of our test statistic, or something more extreme, if H0 is true. This probability is usually called a p-value. Most statistical software packages report p-values since these packages do not know what the researcher wishes to use for . One rejects H0 whenever the p-value is less than . We make extensive use of p-values when performing regression analysis with statistical software. The p-value depends upon the specific alternative used for our test. Let z0 be the observed value for our test statistic. • For Ha: μ < μ0, the p-value is P(Z < z0). • For Ha: μ > μ0, the p-value is P(Z > z0). • For Ha: μ ≠ μ0, we must consider both tails of the standard normal distribution and the p-value is 2 • P(Z > |z0|). Example: Breaking Strengths of Carbon Fibers Consider the hypotheses H0: p = .10 Ha: p ≠ .10 Suppose the data produced a test statistic value of z0 = -1.33. Thus, the p-value for this test is p-value = 2 • P(Z > |z0|) = 2 • P(Z > |-1.33|) = 2 • P(Z > 1.33) = .1836 Suppose our significance level is = .01. Since our p-value is not less than .01, we would fail to reject the null hypothesis. We have insufficient evidence to show that the true proportion of defectives has changed. E. Hypothesis Tests for Two Means, Independent Groups There are many occasions when we need to compare two processes or populations. For example, consider two machines which produce erasers with the same nominal outside diameter. For a long time, the supervisor has complained that Machine 1 produces erasers with a larger outside diameter. How can we approach this problem? Let μ1 and Let μ2 and 2 1 2 2 be the population mean and population variance for machine 1. be the population mean and population variance for machine 2. Assume: • 2 2 1 2 (common variance) 2 • is unknown 2 • the observations from Machine 1 are independent of those from Machine 2. Suppose that a random sample of size n1 is taken from machine 1's production. Let y and s 1 2 be the resulting sample mean and sample variance. 1 Suppose that a random sample of size n2 is taken from machine 2's production. Let y and s 2 2 2 be the resulting sample mean and sample variance. Step 1: The Possible Hypotheses H0: μ1- μ2 = 0 Ha: μ1- μ2 < 0 μ1- μ2 = 0 μ1- μ2 > 0 μ1- μ2 = 0 μ1- μ2 ≠ 0 This procedure can be generalized to test H0: μ1- μ2 = δ0 Ha: μ1- μ2 < δ0 μ1- μ2 = δ0 μ1- μ2 > δ0 μ1- μ2 = δ0 μ1- μ2 ≠ δ0 when δ0 is a specified difference between the two means. In our specific case, our hypotheses are: H0: μ1- μ2 = 0 Ha: μ1- μ2 > 0 Step 2: The Test Statistic y y t 1 1 s n n 1 2 p 1 2 where (n 1) s (n 1) s s n n 2 2 2 1 1 2 2 2 p 1 2 In this case, t follows a t distribution with n1 + n2 – 2 degrees of freedom. Step 3: Critical or Rejection Regions Once again, the rejection regions depend on the alternative hypothesis. • For Ha: μ1 - μ2 < 0, we reject H0 when t t n1 n2 2 , • For Ha: μ1 - μ2 > 0, we reject H0 when t t n1 n2 2 , • For Ha: μ1 - μ2 ≠ 0, we reject H0 when | t | t n1 n2 2 , / 2 In our specific case, we reject H0 when t t n1 n2 2 , Step 4: Collect Data and Calculate the Test Statistic A single batch of raw materials has been split to provide two production runs: One for machine 1, and one for machine 2. MACHINE 1 240 243 250 253 238 242 245 251 239 242 246 248 MACHINE 2 241 243 245 248 239 240 242 243 239 240 250 252 241 243 249 255 For Machine 1: y 244.75 s 24.205 n 12 y 244.375 s 24.516 n 16 1 2 1 1 For Machine 2: 2 2 1 2 (n 1) s (n 1) s s n n 2 2 2 1 1 2 2 2 p 1 2 11(24.205) 15(24.516) 12 16 2 24.284 Thus, s 4.938 p The value of the test statistic is y y 244.75 244.375 t 0.199 1 1 1 1 s 4.938 n n 12 16 1 2 p 1 2 Step 5: Reach Conclusions Suppose we use a significance level of 0.10. With n1 = 12 and n2 = 16, the critical value for the t statistic is t n1 n2 2 , t 26 ,. 10 1.315 Because our observed value of the test statistic (0.199) is less than 1.315, we do not have sufficient evidence to reject the null hypothesis. Thus, we cannot show that Machine 1 produces larger outside diameters than Machine 2. A (1 ) 100% confidence interval for μ1 - μ2 is ( y1 y2 ) t n n 2 , / 2 s p 1 2 1 1 n1 n2 Thus, a 95\% confidence interval for the two machines is ( y1 y2 ) t n n 2 , / 2 s p 1 2 1 1 n1 n2 (244.75 244.375) 2.056(4.938) 1 1 12 16 0.375 1.886 (1.511,2.261) Note: 0 is a plausible value for the true mean difference. We need to check our assumptions. Stem 23• 24* 24• 24* Machine 1 89 0222 568 013 No. 2 4 3 3 Depth 2 6 6 3 Normal Probability Plot for Machine 1 Machine 2 99 00112333 589 005 No. 2 8 3 3 Depth 2 6 3 Normal Probability Plot for Machine 2 F. Paired t-test 1. The Hypothesis Test Note: The two sample t test assumed that the two samples were independent of each other. There are many occasions where the two samples are not independent because they involve the same sampling unit. Example: Marketing Pre-Test of a New Ball-Point Pen The Marketing Department of a pen company determined that the basic ball-point pen needed revision. Marketing commissioned a production lot of a new prototype pen. A group of ten people who work at the production facility were asked to write with the new prototype and with the leading competitor's pen. Each person ranked the pen's writing performance on a scale from 1 - 10, with 1 being extremely poor and 10 being excellent. Note: We should expect significant differences in preference from individual to individual. The two rankings are not independent of one another! How can we determine if people prefer the prototype? Let • y1i be the observed score for the competitor's pen given by the ith person • y2i be the observed score for the prototype pen given by the ith person Define di = y1i - y2i . Let δ be the true mean difference in the scores. • If δ = 0, then there is no difference in the two pens. • If δ > 0, then the first pen tends to get higher ratings than the second. • If δ < 0, then the first pen tends to get lower ratings than the second. We can set up an appropriate hypothesis testing procedure. Step 1: State the Hypotheses H0: δ = δ0 δ = δ0 δ = δ0 Ha: δ < δ0 δ > δ0 δ ≠ δ0 Note: Often, δ0 will be 0. In our case, we wish to show that the prototype is better; thus, H0: δ = δ0 Ha: δ < δ0 Step 2: State the Test Statistic An appropriate estimate of δ is 1 n d di n i 1 d is the sample mean difference. Note: 2 Let s be the sample variance for the differences, d n sd2 n n d i2 d i i 1 i 1 n(n 1) The appropriate test statistic is d 0 t sd / n 2 Step 3: State the Critical or Rejection Region Our critical regions are: •For Ha: δ < δ0, we reject H0 when • For Ha: δ > δ0, we reject H0 when • For Ha: δ ≠ δ0, we reject H0 when t t n1, t t n1, | t | t n1, / 2 For the marketing pre-test, we should use a .05 significance level. Thus, we reject the null hypothesis if t t n1, t t9 ,.025 t 1.833 Step 4: Collect Data and Calculate the Test Statistic The actual data: Individual Competitor Prototype Difference 1 7 8 -1 2 6 7 -1 3 8 9 -1 4 10 8 2 5 2 9 -7 6 5 5 0 7 6 6 0 8 6 8 -2 9 4 10 -6 10 6 9 -3 For these data, d 1.9 sd 2.77 Thus, our test statistic is d t sd / n 1.9 2.77 / 10 2.17 Step 5: Reach Conclusions Since t < -1.833, we have sufficient evidence to reject the null hypothesis. As a result, we have evidence to suggest that people who work at this facility really do prefer the prototype. 2. The Confidence Interval We can construct a 95% confidence interval for the true difference by sd d t n1, / 2 n 2.77 1.9 t9 ,.025 10 1.9 2.262 (0.88) 1.9 1.98 (3.88,0.08) The plausible values for this difference range from -3.88 to 0.08, which seems to contradict the results of our hypothesis test. We must keep in mind that we conducted a one-sided hypothesis test; however, our confidence interval is two-sided. We do need to check our assumptions. Stem -s: -f: -t: -*: *: t: Leaves No. 67 2 Depth 2 23 111 00 2 2 4 2 1 3 1 The Normal Probability Plot 3. When to Pair A reasonable question: When should an experimenter pursue a paired structure? Pairing works well when the sampling units available for the study differ widely among themselves. In this case, pairing allows us to remove the sampling unit to sampling unit variability, which makes our estimate of the standard deviation much smaller. As a result, we are more likely to reject our null hypothesis (we increase the power of our test). On the other hand, pairing the data also reduces the number of degrees of freedom available for our analysis. Decreasing the number of degrees of freedom makes the critical value for our test statistic slightly larger in absolute value. As a result, it is slightly more difficult to reject the null hypothesis (we slightly decrease the power of our test). In general, we should obtain paired data whenever we know that the sampling units differ significantly from one another. The reduction in variability typically more than compensates for the slight increase in the critical value for the test. G. Transformations There are times when engineering data does not follow a normal distribution. This violates our distributional assumption for the t-test. One approach for dealing with nonnormal data is to transform the data to a different scale where normality holds. Common transformation in the Engineering Sciences are natural log, square root and inverse. Consider an example of the sealing strength of plastic bags with a target strength of 11 Newtons. A Normal probability plot shows the data depart from normality. Original Data Quantile of Standard Normal 2 1 0 -1 Transformed Data Using Inverse -2 0.05 0.06 0.07 0.08 0.09 Inverse of Data 0.10 0.11 0.12 The output from the t-test on the transformed data shows that there is no evidence that the mean has changed from 11 using 0.05. Test of mu = 0.0909 vs not = 0.0909 N Mean StDev SE Mean 95% CI T P 20 0.083293 0.018119 0.004051 (0.074813, 0.091773) -1.88 0.076 It is important to remember to transform the nominal value being tested (11 becomes 0.0909). An alternative to transforming the data is to apply a methodology that does not rely on the normality assumption. This methodology is known as nonparametric statistics. The analogous nonparametric procedure for the one-sample t-test tests the population median and is called the sign test. Essentially, the sign test counts the number of observations above and below the median. If the null hypothesis is true, we would expect half the observations to be above the median and half below. Using the binomial distribution, one can calculate a p-value when the numbers above andbelow deviate from half the data. Sign test of median = 11.00 versus not = 11.00 N Below Equal Above 20 7 0 13 P Median 0.2632 11.70 Note that nonparametric procedures are less powerful than t-tests since we are only concerned with the number of values above and below the median and not their exact values. (If all 13 values above 11 in our example were multiplied by 100, we would get the same p-value in the sign test.) There are nonparametric procedures for the two-sample independent t-test (rank sum test) and the paired t-test (signed rank test). These procedures can be done very quickly in standard statistical software packages.