Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter Five Hypothesis Testing: Concepts The Purpose of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Initial Look at Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Formal Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Null and Alternate Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Procedure for Formal Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Errors in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . False Positive Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . False Negative Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary: Choosing the Confidence Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 112 114 114 114 115 120 124 124 124 126 130 131 The Purpose of Hypothesis Testing The purpose of obtaining measurements of a chemical system is usually to draw some conclusions about the properties of the system. One of the simplest use of statistics, one that has largely concerned us to this point, is to obtain an estimate of the system properties through the use of confidence intervals. This is an aspect of statistical estimation theory. Now, however, we turn our attention to decision theory, where we learn how we can use measurement statistics to draw general conclusions about chemical systems. The following are examples of situations where we want to draw some kind of conclusion based on measurements: • two reactants are mixed, and the concentrations of the products are monitored as a function of time in order to determine the rate constant, k, of the reaction. You want to compare the results of your measurement with a value calculated from theory. • you have just come up with a new synthetic procedure for a certain commercial product that you believe increases the yield over the currently accepted method. You measure the yield by both methods, and you find that your method gives a 65% yield while the older method gave a 60% yield. You must prove that your method is actually superior to the older method, and that the increase in yield is not due to the uncertainty in the measured values. For a more detailed example, consider the following situation. Let’s say we obtain the following measurements of the pH of a particular solution pH measurements: 9.5, 9.9, 9.8 Now we wish to know whether it is possible to state, with confidence, that the pH of the solution is less than 10. If we can assume that the measurements are unbiased, we can restate this question in a form that can be evaluated with statistics, namely: “is it true that pH < 10?” Now, assuming no measurement bias, the fact that none of the measurements of pH are greater than 10 seems to support the notion that the true pH of the solution is less than ten. However, since the measurement of pH is a random variables, there is always a chance that the actual pH is indeed greater than ten, and that the three measurements, by random chance, all happen to be less than 10 – just like there is a chance that three coin flips in a row will come up tails, even though there is a fifty-fifty chance of getting heads on any single coin toss. Our problem is this: at what point can we say that random variability is an unlikely explanation for the difference between the measured pH values and a fixed value (e.g., a pH of 10)? In other words, when do the measured values “differ significantly” from the fixed value? The meaning of the word “significantly” must be very clear: a statistically significant difference in the values would be a greater difference than could be reasonably explained by random error. This is exactly the type of question that hypothesis testing answers. Hypothesis tests are sometimes called significance tests, since they detect “significant” differences in numbers, differences that are unlikely to be due to random chance. Page 110 Chapter 5 Hypothesis Testing: Concepts An Initial Look at Hypothesis Testing Let’s use an example to help us to see how we might derive conclusions using random variables (i.e., measurements). Example 5.1 A cigarette manufacturer states that the nicotine level of its cigarettes is 14 mg per cigarette. You wish to test this claim. You collect a random sample of 5 cigarettes and test for nicotine content. The measured nicotine level (in mg) of the cigarettes in the sample are 14.05, 14.33, 16.36, 18.55, 14.76. Do these measurements indicate a nicotine level different than that claimed by the manufacturer? Basically, what we would like to do is test the following statement: Hypothesis: The true nicotine level of the cigarettes is different from that claimed (14 mg) by the manufacturer. Let’s calculate the mean of the measured nicotine level. T ( 14.05 14.33 16.36 18.55 14.76 ) . mg x x bar mean( x) measurements x bar = 15.61 mg So the mean measured level of nicotine in the five cigarettes was 15.61 mg/cigarette. Obviously, this value is somewhat larger than the nicotine level stated by the manufacturer. The question is, however, is the difference between the nicotine levels “significant?” Do we have any justification for challenging the nicotine levels claimed by the manufacturer? In order to answer this question, we need more information than simply the measurement average: we must also make use of the observed variability of the five measurements to construct a confidence interval. sx t stdev ( x) se sx se = 0.837 mg standard error of mean value 5 2.7765 width x lower critical t-value for 4 df's at the 5% level t . se x bar width = 2.32 mg t . se x lower = 13.29 mg lower boundary of CI x upper x bar t . se x upper = 17.93 mg upper boundary of CI In this instance, the 95% confidence interval is 15.61 ± 2.32 mg/cigarette. Recall exactly what this interval represents: assuming no bias, this range of values (13.29 → 17.93 mg) contains the true amount of nicotine in the cigarettes analyzed, with 95% probability. Page 111 Chapter 5 Hypothesis Testing: Concepts Since the confidence interval calculated from the measurements on five cigarettes includes 14 mg, we cannot support the original hypothesis that the manufacturer’s claimed nicotine level is incorrect. In other words, the difference between the measurement mean of 15.61 mg and the manufacturer’s stated level of 14 mg is not significant. Note that we must be very careful in how we phrase our conclusion. Even though the confidence interval includes the value 14 mg, we have not proven that the manufacturer’s claim is true. In other words • we do not prove that [nicotine] = 14 mg/cigarette. We can only state that there is a 95% probability that the true nicotine content is somewhere between 13.29 and 17.93; out best estimate of the nicotine content is 15.61 mg. • we cannot prove (with 95% probability) that [nicotine] ≠ 14 mg/cigarette, since the 95% confidence interval contains this value. We have just had our first brush with hypothesis testing, where we use data (containing random error) from an experiment to test an assertion. This is obviously an important area of statistics, and one that we will discuss in detail. Page 112 Chapter 5 Hypothesis Testing: Concepts Formal Hypothesis Testing Introduction In the last section, a confidence interval was constructed in order to test a specific hypothesis. In scientific endeavors, there are a wide variety of different types of hypotheses that may need to be tested using the results of one or more experiments. In this section, we will formalize the procedure to be used in hypothesis testing. Although the procedure may seem a little rigid, it can be adopted for almost any situation. The price for the general applicability of the procedure is the use of somewhat abstract language and concepts. Null and Alternate Hypotheses All hypothesis tests actually involve at least two statements, called the null hypothesis (H0) and the alternate (or working) hypothesis (H1). A statistical hypothesis is an assertion or conjecture concerning one or more population parameters. Basically, this step is a translation from words to population parameters. The null hypothesis, H0, will generally involve an equality and one or more population parameters. In our nicotine example, the null hypothesis would be: H0: µx = 14 mg/cigarette null hypothesis In other words, we accept as the null hypothesis the manufacturer’s claim that each cigarette contains 14 mg of nicotine. If the null hypothesis is true, and if there is no bias in the measurements, then the population mean µx of all measurements will be 14 mg. As you can see, the null hypothesis involves a population parameter (µx, the population mean of the measurements) and a statement of equality. As we will stress time and again, the null hypothesis cannot be proven. It is assumed as fact unless the data proves otherwise. The alternate hypothesis, H1, will be a statement involving the same population parameters, in such a way that H1 and H0 cannot both be true. Usually the alternate hypothesis involves one of the following relational operators: ≠, <, or >. For our example, H1: µx ≠ 14 mg/cigarette alternate hypothesis (two-tailed test) Alternate hypotheses such as this one, with a “not equals” (≠) relationship, result in two-tailed tests. This statement claims that the measurement population mean is not 14 mg; if we assume no measurement bias, this hypothesis disputes the manufacturer’s claim of nicotine level. The form of both hypotheses is very important, particularly that of the alternate hypothesis. This is because we are testing the alternate hypothesis in the hypothesis test procedure. Suppose we actually suspect that the manufacturer is underestimating the nicotine level in the cigarettes; in this case, we would use the following alternate hypothesis: a different alternate hypothesis or, H1: µx > 14 mg/cigarette H1: “the true nicotine content is greater than 14 mg/cigarette” Page 113 (one-tailed test) Chapter 5 Hypothesis Testing: Concepts This form of H1 would result in a slightly different hypothesis test. Alternate hypotheses such as this one, with a greater than (>) or less than (<) relationship, result in one-tailed tests. In the hypothesis testing procedure, we assume that the null hypothesis is true, and it is not tested. The goal of the procedure is to test the assertion embodied by the alternate hypothesis, H1. If H1 is proven to be true, then obviously H0 will be false. This format is exactly the same as that of the US criminal legal system, as represented in the famous statement “innocent until proven guilty.” In statistical hypothesis testing, H0 is assumed to be true unless H1 can be proven to be true with reasonable certainty. Procedure for Formal Hypothesis Tests For easy reference, here is a list of the steps in hypothesis testing; each step will be discussed in detail. 1. Form the null hypothesis, H0, and the alternative hypothesis, H1, in terms of statistical population parameters. 2. Choose the desired confidence level. The confidence level this is also sometimes called the significance level. 3. Choose a test statistic and calculate it. 4. Calculate the critical values; alternately, determine the P-value of the test statistic. 5. State the conclusion clearly, avoiding statistical jargon. Step 1: State the null hypothesis (H0) and the alternate hypothesis (H1) We have described the null and alternate hypotheses. Formulating these is the most difficult but crucial part of the test procedure. Remember that we begin with an assumption that H0 is true, and that we are trying to test H1. We may be interested in either proving or disproving H1. The following table gives the null hypotheses for three common statistical tests. Note that the null hypothesis always involves population parameters, and (in these cases) is expressed as an equality. Page 114 Chapter 5 Hypothesis Testing: Concepts Situation Form of the null hypothesis Answers the question: comparison of a random variable, x, and a fixed value, k H0 : ✙x = k Is there a significant difference between the mean of some measurements, and some fixed value? comparison of the mean of two variables, x and y H0 : ✙x = ✙y Is there a significant difference between the mean of two sets of measurements? comparison of the variances of two variables, x and y H 0 : ✤ 2x = ✤ 2y Is there a significant difference between the variances of two sets of measurements? The alternate hypotheses, H1, in these cases may involve an inequality (≠) or a relational operator (< or >). As discussed previously, the form of H1 determines whether we use a one-tailed or a two-tailed test. Step 2: Choose the desired level of confidence/significance Remember that any confidence interval has an associated confidence level. The purpose of a confidence interval is to “bracket” the possible values for a population parameter such as µx. Random variables always add a little “spice” (i.e., uncertainty) to any conclusion; there is always a chance that we are wrong, since random variables are, well, random. So the confidence level is needed to state the probability that the population parameter is truly contained with our confidence interval. It is a measure of how much we trust the interval, how “confident” we are in our result. Since confidence intervals play a crucial role in hypothesis testing, it is not surprising that we generally choose a confidence level when testing assertions using the results of experiments, which are almost always random variables. The meaning of the confidence level in hypothesis testing is slightly different than in confidence intervals, however. Consider our example. We have two competing hypothesis: H0: µx = 14 mg and H1: µx ≠ 14 mg. We are testing the alternate hypothesis, H1, and there are two possible outcomes: 1. We succeed in proving that H1 is true, in which case H0 is know to be false. 2. We fail to prove that H1 is true. [Remember! We cannot prove that H0 is true.] The confidence level in hypothesis testing measures our certainty when we succeed in proving H1. It is the probability that the conclusion that H1 is true and H0 is false is correct. Let’s assume that we want to test at the 95% level for our example. That means that, if our test proves that the nicotine level is not 14 mg, there is a 95% probability that our data has lead us to the proper conclusion. You might wonder: why wouldn’t I want to be very certain in my conclusion? In other words, shouldn’t I always choose a high confidence level in hypothesis testing (at least 95%, and maybe Page 115 Chapter 5 Hypothesis Testing: Concepts 99% or even 99.9%!). We will defer a discussion of the appropriate confidence level in testing to later in the chapter. But for now, ask yourself this question: why don’t you similarly always choose a high confidence level in constructing confidence intervals? A 95% confidence interval is commonly given; why not always use 99%, or 99.9%? What affect would that have on the confidence interval? There are both advantages and disadvantages in choosing high confidence levels, as we will discover. In statistics, the term significance level is probably more common than confidence level in hypothesis testing. The significance level (SL) is directly related to the confidence level (CL): SL = 100% − CL. Thus, instead of testing at the 95% confidence level, we may instead test at the 5% significance level and arrive at the same conclusions. Although we will tend to use the term “confidence level” in this text, you should be familiar with both terms. Step 3: Choose a test statistic and calculate its value The next step in hypothesis testing is to choose a statistic (the test statistic) appropriate for testing the hypotheses. The test statistic (like any statistic) is a value calculated in some manner from the data. Since the data presumably contain random error, the test statistic will likewise be a random variable. There are two requirements for a test statistic: 1. Its probability distribution must be known; preferably, tables of critical values exist for the statistic. 2. The test statistic should result in a reasonably “good” (or “efficient”) hypothesis test. What factors might make one test better than another? Let’s come back to that point in a little bit. In example 5.1, the null and alternate hypotheses both deal with the population mean µx of the measurements, so it would seem that we could certainly use the sample mean of the measurements as the basis for the test statistic. In constructing a confidence interval for µx, the t-distribution is used (when σx is not known). This suggests that the following test statistic, T, could be used in this hypothesis test: T= possible test statistic x n − 14 s(x n ) The test statistic is the studentized sample mean. It has a t-distribution; if H0 is true, then µT = 0. The sample mean is not the only possible basis of the test statistic. Instead, we could use the sample median, or some other form of weighted average. It turns out that for normally distributed data, the studentized sample mean is the best test statistic to use for hypothesis tests such as for example 5.1. Let’s calculate the observed value for the test statistic for the five measurements in example 5.1: T obs x bar 14. mg se T obs = 1.9243 This is the "studentized" mean: the number of std devs of the mean from 14 mg In this equation, “se” is the standard error of the sample mean, xbar. According to the observed test statistic, the mean of the measurements, 15.61 mg/cigarette, is 1.92 standard deviations from the manufacturer’s claimed value of 14 mg/cigarette. Page 116 Chapter 5 Hypothesis Testing: Concepts Step 4: Calculate the critical value(s) or the P-value It is important to keep in mind that the null hypothesis, H0, is “innocent until proven guilty.” The probability distribution of the test statistic, T, assuming that the null hypothesis is true, is called the null distribution. The next step in hypothesis testing is to calculate the critical value(s) of the null distribution. For two-tailed tests, such as the one we must use for example 5.1, there are two critical values. (One-tailed tests only have a single critical value). The null distribution of T is a t-distribution with four degrees of freedom and a mean of zero. Recalling that we choose 95% as our confidence level, the critical values are the values such that Tcrit = ± t4,0.025 = ± 2.7765 lower critical value upper critical value 95% -4 -3 -2 -1 0 1 2 3 4 Test Statistic accept H1 reject H0 accept H1 reject H0 T = -2.7765 (lower critical value) accept H0 T = +2.7765 upper critical value Figure 5.1: Decision criteria for the hypothesis test for example 5.1. If the observed test statistic is above the upper critical value, or below the lower critical value, then we accept the alternate hypothesis, H1, and reject the null hypothesis, H0. The critical values are the boundaries between two decision-making regions: • the acceptance region, between the two critical values. If the test statistic assumes a value in this region, then the null hypothesis, H0, is accepted. We cannot prove the alternate hypothesis, H1, with the desired confidence level. Page 117 Chapter 5 Hypothesis Testing: Concepts • the rejection region, where Tobs > Tupper or Tobs < Tlower. If the test statistic is in this region, then H0 is rejected and H1 is accepted. We have proven that H1 is true at the desired confidence level. By inspecting the null distribution, we can see how the critical values are chosen, and we can understand the role of the confidence level in hypothesis testing. Figure 5.1 shows the situation for a two-tailed test at the 95% confidence level. We choose the critical values so that the 95% of the area under the null distribution is between the critical values. What this means is that, if the null hypothesis is true, there is a 95% probability that the observed test statistic will be within the acceptance region. It is not strictly necessary to calculate the critical values. An alternative approach makes use of the concept of the P-value, which has been mentioned before. The P-value can be interpreted in terms of the null distribution; in particular, for a two-tailed test, the P-value is two-tailed P-value P obs = P(T > T obs ) + P(T < −T obs ) = 2 $ P(T > T obs ) Consider example 5.1: the mean of five measurements of nicotine content was 15.61 mg/cigarette, which is 1.92 standard deviations from the manufacturer’s claimed value. Most statistical programs and spreadsheets will also calculate the P-value; for example 5.1, the two-tailed P-value is Pobs = 0.1266 In other words, if the null hypothesis were true, there is a 12.66% probability that we would obtain a sample mean that is farther than 1.92 standard deviations from 14 mg/cigarette (in either direction). The P-value is used instead of (or in addition to) critical values. It indicates the weight of the evidence in favor of the alternate hypothesis: the smaller the P-value, the less likely it is that random variability can account for the observed data. To tie the P-value approach with the “critical region” approach, consider this: the P-value tells us the maximum value of the confidence level that we can adopt and still prove the alternate hypothesis. We calculate this value by maximum confidence level: CL = 100% $ (1 − P obs ) where CL is the confidence level as a percentage. For example 5.1, if we choose a confidence level of 87.44% or less, then we can prove that the alternate hypothesis is true. Of course, a smaller confidence level means that we are less confident of our conclusion, so we want a P-value as small as possible. We may more directly interpret the P-value in terms of the significance level. The P-value is the largest significance level at which we may accept the alternate hypothesis. Thus, in this example, we can prove H1 at the 12.66% significance level, at best. Remember: a smaller significance level means we are more certain of this conclusion. Aside: calculating P-values in Excel Page 118 Chapter 5 Hypothesis Testing: Concepts When the null distribution is a t-distribution, then the P-value is calculated in Excel by using the TDIST() function: calculation P-values in Excel P obs = tdist(T obs , df, tail) where Tobs is the observed value of the test statistic, df are the degrees of freedom of the t-distribution, and tails is either one or two (for 1- or 2-tailed P-values). For example 5.1, you would enter “= tdist(1.9243, 4, 2)” into any cell to obtain the 2-tailed P-value. Other Excel functions would be needed when the null distribution does not follow a t-distribution. Step 5: State the conclusion After we decide whether to accept H0 or H1, we must state our conclusion in a manner that is accurate and yet can be understood by anyone who does not have a background in statistics. Essentially, we must translate our conclusions from “statistic-ese” (e.g., “reject H0, accept H1”) to normal language. We should give both our conclusion and the confidence level, even though the confidence level is most properly understood in a statistics framework. For example 5.1, we accepted H0; we couldn’t prove H1. In other words, our conclusion would be: We cannot prove with 95% confidence that the nicotine level in the cigarettes is different than 14 mg/cigarette. This statement sounds like poor English (basically a double negative), but the wording was very carefully chosen. We begin with the assumption that the cigarettes have 14 mg of nicotine, and we fail prove any differently. This is similar to a jury returning a verdict of “Not Guilty” in a criminal trial. Notice that the verdict is not that the defendant was innocent, simply that guilt was not proven beyond a “reasonable doubt.” In hypothesis testing, the level of “reasonable doubt” is determined when the confidence level is set. Examples Let’s try another two-tailed test. This test is similar in nature to example 5.1. Example 5.2 A certain analytical procedure is being tested for the presence of measurement bias. Twenty measurements are made on a solution whose concentration has been certified at 1.000 µM. The sample mean is 1.010 µM, with an RSD of 5.0% for the individual measurements. Is there any evidence of measurement bias? Page 119 Chapter 5 Hypothesis Testing: Concepts First let's set up the null and alternate hypotheses H0 : µ x 1.000. µM There is no bias in the measurements. H1 : µ x 1.000. µM Bias exists; two-tailed test. ξx 1.000. µM sx RSD. x bar x bar 1.010. µM RSD s x = 0.0505 µM 5.0. % std_err sx std_err = 0.0116 µM 19 Let's use the studentized mean as the test statistic, and calculate the observed test statistic x bar T obs P obs ξx std_err 0.3988 T obs = 0.8631 sample mean is this many std devs from the true value This is the two-tailed P-value of the observed value of the test statistic Now we look up the critical values from the t-tables. For 19 degrees of freedom, a 95% confidence level and a two-tailed test, the critical values are -2.0930 and +2.0930. Since the observed value of the test statistic is within the acceptance region, we must accept the null hypothesis. Thus, we cannot prove bias in these measurements at the 95% confidence level. Note: from the observed P-value for this example, we see that we can only prove H1 with 60.22% confidence, at best. Now let’s try a one-tailed test. Example 5.3 It is suspected that a series of tests of blood alcohol level proves that the alcohol level is above the legal limit of 0.10%. The measurements are: 0.106 0.118 0.097 0.127 0.134 0.141 Do these measurements prove legal intoxication with 95% confidence? As always, the first step is to set up the null and alternate hypotheses. In this case, we should use the following: null alternate H0: µx = 0.10 % H1: µx > 0.10 % “blood alcohol level at the legal limit (assuming no bias)” “blood alcohol level above the legal limit” It may be a little difficult to see why the null hypothesis should be that the blood alcohol level is exactly 0.10 %. In setting up the hypotheses, it is best to always ask yourself, what is it that I want to test? What are the possible conclusions? The answers to these questions determine the form of the alternate hypothesis; the null hypothesis will follow. For this example, we want to test whether or not the alcohol level is above the legal limit. Remember that the purpose of the statistical test procedure is actually to test the alternate hypothesis, so that we would propose as the alternate hypothesis that the alcohol level is too high. The nature of the testing procedure is such that we either prove or fail to prove this Page 120 Chapter 5 Hypothesis Testing: Concepts hypothesis; i.e., or conclusion will be either that we can prove that the alcohol level is too high (a “guilty” verdict) or that we cannot prove an excessive alcohol level (“not guilty”) .These conclusions are proper for our intentions in this example. Since we propose that µx > 0.10 % is our alternate hypothesis statement, the corresponding null hypothesis is µx = 0.10 %. The other thing to notice about the form of H1 in this example is that it results in a one-tailed test. This will affect the critical values (and the P-value, if we calculate it). Let’s continue with our testing procedure. We can proceed by calculating the observed test statistic. x T ( 0.106 0.118 0.097 0.127 0.134 0.141 ) . % x bar mean( x) x bar = 0.1205 % std_err Let's calculate the observed test statistic x bar 0.10. % T obs T obs = 2.9865 std_err P obs 0.00379 stdev ( x) std_err = 0.0069 % 6 studentized measurement mean Probability of seeing a larger value that Tobs i s 0.379%. The P-value is standard output for many statistical programs. In this case, the one-tailed P-value is 0.379 %, which means that we can prove H1 at the 99.721% confidence level, if we desired; certainly at the 95% level we may reject H0 and accept H1. However, it is difficult to use t-tables to calculate Pobs, so we will confirm this decision using the critical value approach. For a one-tailed test, there is only a single critical value, as shown in the next figure. Page 121 Chapter 5 Hypothesis Testing: Concepts critical v alue 95% -4 -3 -2 -1 -0 1 2 3 4 Test Statistic H :µ=k 0 H :µ>k 1 accept H1 accept H 0 critical value Figure 5.2: An example of a one-tailed test. There is only a single critical value. The top figure shows the null distribution. The critical value is chosen such that the area to under the curve to the left of the critical value is at the appropriate confidence level (95% for this example). The lower figure shows the decision process: if the observed test statistic is larger than the critical value, Tobs > Tcrit, then the null hypothesis is rejected and the alternate hypothesis is proven. Recall that the null distribution is the probability distribution of the test statistic, T, assuming that H0 is true. As the upper figure shows, we must choose the critical value such that, for the null distribution, P(Tobs < Tcrit) = CL where “CL” is the chosen confidence level. For our example, we have chosen a confidence level of 95%. We can determine the critical value from the t-tables: one-tailed critical value T crit = t ✚,✍ = t 5,.05 where ν is the appropriate degrees of freedom, and α is the area in the right tail of the t distribution. We determine the value of α from the confidence level: CL = (1 − ✍) $ 100%. For our example, the t-tables tell us that the critical value is Tcrit = 2.0150. If you recall, the observed value of the test statistic was 2.9865; since this is larger than the critical value, we reject the null hypothesis and accept the alternate hypothesis. Our conclusion is: Assuming no measurement bias, the data show that the blood alcohol level is above the legal limit (at the 95% confidence level). Page 122 Chapter 5 Hypothesis Testing: Concepts Errors in Hypothesis Testing Introduction Since they involve random variables, there is always an element of uncertainty in hypothesis tests. Specifically, there is always a chance that the conclusion of a test is in error. This uncertainty is the reason that you must specify a confidence level when you perform statistical tests. Choosing the confidence level allows you to determine the degree of the uncertainty in your test: basically, you can control the likelihood that your conclusion is correct. As we will see, the confidence level also indirectly determines the ability of the statistical test to detect and label small differences as “significant.” How can the conclusion from a hypothesis test be in error? For tests with a single null hypothesis, H0, and a single alternate hypothesis, H1, then the following table shows all the possibilities: decision reality H1 is not true: accept H0 accept H1 (“negative” result) (“positive” result) correct false positive H1 is true: false negative correct Let’s illustrate with an example. Let’s say someone undergoes a pregnancy test. Now the reality of the matter is that the person is either pregnant or she isn’t .The test will either decide in favor of pregnancy (called a “positive” test result) or will decide that the subject is not pregnancy (a “negative” result). We can draw an analogy to statistical hypothesis tests. We begin with the assumption (the null hypothesis) that the subject is “not pregnant.” The alternate hypothesis, the one we want to test, is that the subject is pregnant. A conclusion in favor of pregnancy (H1 is accepted) is considered a positive test result; however, if the subject actually is not pregnant (H0 is actually true), then our conclusion is in error. This situation − an incorrect acceptance of H1 − is called a false positive. On the other hand, if the conclusion of the test is that the subject is not pregnant (H0 is accepted), and this conclusion is in error (H1 is actually true), then the test gives a false negative. In the remainder of this section, we will describe how to calculate the probability that the result of a hypothesis test is in error (either a false positive or false negative). False Positive Errors All of the hypothesis tests presented so far in this chapter have been of the following type: the null hypothesis is H0: µx = k the true measurement mean is some fixed value, k While the alternate hypothesis is one of the following Page 123 Chapter 5 H1: µx ≠ k H1: µx > k H1: µx < k Hypothesis Testing: Concepts the true measurement mean is not some fixed value, k (a two-tailed test) the true measurement mean is larger than some fixed value, k (a one-tailed test) the true measurement mean is smaller than some fixed value, k (a one-tailed test) The decision criterion of the test is the following: if the observed test statistic, Tobs, is outside of the interval defined by the critical value(s), then we reject H0 and accept H1. A false positive occurs when Tobs is outside the H0 acceptance region when, in fact, H0 is true. The probability of a false positive is controlled by choosing the appropriate confidence levels in a statistical test. To be exact, CL = 1 − ✍ where CL is the chosen confidence level, and α is the probability of a false positive. In other words, when testing at the 90% confidence level, there is a 10% chance of falsely accepting H1. Let’s imagine that we are comparing a mean value, µx, to a fixed value k. Unknown to us, the null hypothesis is actually true. The following figure shows the null distribution of the test statistic, i.e., the probability distribution of the test statistic when the null hypothesis is actually true. Null Distribution critical v alue critical v alue area: α/2 -4 -3 area: α/2 -2 -1 -0 1 2 3 4 Test Statistic Figure 5.3: choosing the critical values for a two-tailed test. If Tobs occurs between the critical values, then the null hypothesis is accepted; if not, then H1 is accepted. The shaded area in both tails is probability of a false positive: it is the probability that Tobs does not fall between the critical values, even though it “should,” since H0 is true. Now we can see how the critical values are chosen for two-tailed tests: each tail must contain an area of α/2, so that the total probability of a false positive is α, the desired value. Page 124 Chapter 5 Hypothesis Testing: Concepts Now let’s consider the probability of false positive error for a one-tailed test. In such a test, there is only a single critical value. Let’s imagine that we are testing for values that are greater than a fixed value, k; in other words, our alternate hypothesis is H1: µx > k. The next figure shows the null distribution, together with the critical value and the probability of false positive. Null Distribution critical v alue area: α -4 -3 -2 -1 -0 1 2 3 4 Test Statistic Figure 5.4: choosing the critical value for a one-tailed test. If Tobs is less than the critical value, then the null hypothesis is accepted; if not, then H1 is accepted. The shaded area in both tails is probability of a false positive. Note that the critical value was chosen such that the probability of false positive, α, is the same as in figure 5.3 To summarize, we set the probability of false positive error when we choose the confidence level. We must then choose the critical values according to our desired value of α. This means that, for a two-tailed test, the area in each tail of the null distribution must be α/2; for a one-tailed test, the area in the single tail (since there is only one critical value) will be α. False Negative Errors A false negative occurs when we incorrectly accept H0 when we should actually reject H0 and accept H1. In other words, the alternate hypothesis is actually true, but the test statistic still falls within the critical region (so that the null hypothesis is accepted). The next figure shows the probability distribution of the test statistic when the alternate hypothesis is true. Page 125 Chapter 5 Hypothesis Testing: Concepts True Distribution of Test Statistic (H1 is true) accept H0 accept H1 critical v alue β: probability of f alse negativ e -2 -1 0 1 2 3 4 5 6 7 Test Statistic Figure 5.5: This figure shows the probability distribution (not the null distribution) of the test statistic in a situation when the alternate hypothesis is actually true (in this case, µx > k). However, if the test statistic is less than the critical value, shown in the figure, then the null hypothesis will be accepted: this would be a false negative error. The shaded area shows the probability, β, of this occurring. As we see in the figure, even when the alternate hypothesis is true, there is some chance (β) that the test statistic will be less than the critical value. This chance is the probability of false negative error, β. In order to calculate β, we must know the value of the population parameter, µx. We can always calculate the value of β for some hypothetical situation in which we postulate a value for the population parameter. This type of exercise would give us some idea of how “sensitive” our testing procedure is to situations in which the alternate hypothesis is false. The next example illustrates this point. Page 126 Chapter 5 Hypothesis Testing: Concepts Example 5.4 You wish to develop a procedure to test for bias in the analysis of fluoride in water. During the analytical procedure, three independent measurements are obtained on a sample, and averaged to determine the fluoride concentration. The standard solution to be used in the test is known to contain 0.45 w/w% F, and the RSD of the entire analytical procedure is known to be 0.10 (i.e., 10% RSD for the average of the three measurements) (a) What are the critical values that can be used to determine if there is bias in a measurement? (b) What values of population measurement mean, µx, would result in a 90% probability that bias will be detected? In other words, what bias would result in acceptance (with 90% probability) of the alternate hypothesis in part (a)? The true fluoride concentration, ξx, of the standard solution is 0.45 w/w%. The analytical procedure in this situation consists of obtaining three measurements and averaging them to obtain a point estimate of the fluoride concentration. We can calculate the standard error of the mean of three measurements: ξx 0.45. % RSD 0.1 RSD. ξ x σ overall = 0.0450 % the true standard error (a population parameter) is known σ overall The null and alternate hypotheses will be H0: µx = ξx there is no measurement bias H1: µx ≠ ξξ measurement bias exists (two-tailed test) One thing that is different about this hypothesis test, compared to all the others we have done: the true (i.e., population) standard deviation of the mean, ✤(x n ), is known. Thus, the test statistic will be the standardized different between the mean of three measurements and the true concentration of the solution: test statistic T= x 3 − ✛x ✤(x 3 ) where x 3 is the mean of 3 measurements. Assuming that x 3 is distributed normally, T will follow a normal distribution with a standard deviation of one.. The null distribution, which assumes that µx = ξx, will follow a z-distribution (i.e., a standard normal distribution). Let’s set our confidence level at 99%; in other words, we are limiting the probability of false positives to 1%: α = 0.01. Now we can find the critical values. From the z-tables, we see that z0.005 ≈ 2.575 (you should verify this; the actual value is 2.5758, as reported by Excel). Our decision rules for this hypothesis test are: • if −2.5758 < Tobs < 2.5758, then accept H0. We cannot prove measurement bias with 99% confidence. Page 127 Chapter 5 Hypothesis Testing: Concepts • if Tobs < −2.5758 or Tobs > 2.5758, then reject H0 and accept H1. We can prove bias with 99% confidence. In this instance, it is useful to note that there is an equivalent way of stating these decision rules: if the observed measurement mean, x 3 , is more than 2.5758 standard errors from the true concentration, ξx, then we have evidence of bias. crit lower ξ x z crit. σ overall crit lower = 0.3341 % crit upper ξ x z crit. σ overall crit upper = 0.5659 % Alternate decision rules • if 0.3321 w/w% < x 3 < 0.5659 w/w%, then we must accept H0 • if x 3 < 0.3321 w/w% or x 3 > 0.5659 w/w%, then we reject H0 and accept H1 at the 99% confidence level You should realize that these rules are not different then the first ones; they would result in exactly the same conclusion for a given set of data. These rules just give another way of looking at the hypothesis test process. Now let’s look at part (b). We want to find the measurement population mean, µx, that would result in a 90% chance that measurement bias would be detected. Let’s imagine that there is actually a certain amount of positive bias in the measurements. The probability that the bias will actually be detected is the area under the probability distribution curve that is greater than the upper critical value. In other words, if we want to find the minimum amount of positive bias that will be detected with a 90% probability, we need to find the measurement mean, µx, that satisfies: P(x 3 > 0.5659 w/w%) = 0.90 This situation is shown in the following figure. Page 128 Chapter 5 Hypothesis Testing: Concepts probability distribution of measurement mean accept H1 accept H1 accept H0 β = 0.10 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 measurement mean, w/w % Figure 5.6: The critical values associated with the decision rules for two-tailed bias detection at the 99% confidence level are represented by the dashed vertical lines. The probability distribution describes the mean of three positively biased measurements, and results in β = 0.10; in other words, for measurements described by this distribution, there is a 10% chance of a false negative result to bias testing at the 99% confidence level. From the z-tables, we know that z0.90 = − 1.2816 gives a right-tailed area of 0.90. We must solve for µx in the following expression: x crit − ✙ x = −1.2816 ✤(x 3 ) where xcrit is the upper critical value for the testing procedure, and ✤(x 3 ) is the standard error of the mean of three measurements. Solving for µx gives ✙ = x crit + z 0.90 ✤(x 3 ) This is the mean of the probability distribution shown in the figure. Substituting 0.5659 w/w% for the critical value, and a standard error of 0.0450 w/w%, gives µx = 0.6236 w/w%. This corresponds to a bias, γx, of γx µ x ξ x γ x = 0.1736 % If you repeat this procedure to find the negative bias that gives β = 0.10, you will find that a bias of γx = −0.1736 w/w% will give the desired false negative probability value. In other words, our calculations tell us that when testing for bias at the 99% confidence level under these conditions, we have a 90% chance of detecting bias of 0.1736 w/w%. This is useful information. If, for example, the “sensitivity” of our hypothesis test for bias detection is unacceptable, then we have two options: lower our confidence level from 99% (which would Page 129 Chapter 5 Hypothesis Testing: Concepts decrease our critical values) or average more measurements to decrease our standard error. We could also try to improve the precision of our method, so that the standard deviation of the individual measurements is smaller. Summary: Choosing the Confidence Level Choosing the confidence level directly determines the critical values and the value of α, the probability of a false positive error. Let’s consider a two-tailed test: H0: H1: µx = k µx ≠ k for which there are two critical values, represented on the following number line: Choosing a larger confidence level will cause the critical values to move further “apart.” True, this means that there is a less chance of a false positive error; however, the power of the test to detect small differences between µx and k has been decreased. In other words, there is a greater chance of a false negative error (i.e., β has increased). Thus there is always a compromise to consider in choosing the confidence level; values of 95% and 99% are very common. The value chosen may depend on the potential consequences of errors. Consider the following situations: • in employee drug testing, no employer want to deal with false accusations. In such a situation, a high confidence level (99% or even higher) might be appropriate, because the consequences of a false positive (wrongly accusing an employee of taking drugs) are perceived to be more severe than missing the borderline cases. • in screening patients for HIV, the consequences of a false negative (incorrectly concluding that the patient is not infected) are very severe. In this case, the confidence level might be set relaltively low. To be sure, there will be an increase in false positives, but a separate, independent test can be performed on these patients. Page 130 Chapter 5 Hypothesis Testing: Concepts Chapter Checkpoint The following terms/concepts were introduced in this chapter: acceptance region P-value alternate hypothesis rejection region critical value significance level false positive significance test false negative statistical hypothesis hypothesis test statistical significance null hypothesis test statistic null distribution two-tailed test one-tailed test In addition to being able to understand and use these terms, after mastering this chapter, you should • use formal hypothesis testing procedures to determine if there is a significant difference between a normally-distributed random variable and a fixed value, using either a one- or two-tailed test • interpret P-values from a hypothesis test • explain trade-offs in choosing a confidence level Page 131