Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 2441 Probability and Statistics for Biological Sciences Introduction to Hypothesis Testing You've already read several times in these notes that the methods of statistical inference fall into two broad categories: 1. 2. statistical estimation -- in which we use information available from a random sample to estimate values of population parameters of interest. We've just finished looking at a variety of confidence interval formulas that can be used to estimate population means, proportions, variances, standard deviations, differences of two population means, differences of two population proportions, etc. hypothesis testing -- in which we determine whether specific statements or claims about the value of a population parameter are supported or not by data (evidence) available from a random sample. In this document, we describe the general concepts and jargon associated with hypothesis testing. In the next few documents, we describe how the general method is applied to hypotheses involving various population parameters. This is a rather long document because it introduces a number of concepts which are new to most people studying statistical inference for the first time. Hypothesis testing requires us to examine decision-making logic in some depth and this also is difficult at first for many. As a result, the notes below will also tend to repeat things more than once to help you see the connections between the concepts, conventions, computations and conclusions involved. What is meant by "hypothesis?" The word hypothesis is just a slightly technical or mathematical term for "sentence" or "claim" or "statement." In statistics, a hypothesis is ALWAYS A STATEMENT ABOUT THE VALUE OF A POPULATION PARAMETER. Thus, typical statistical hypotheses are (in some appropriate context, of course) > 5 ppm 0.65 2 > 2.00 1 - 2 > 0 and so on. These are hypotheses because they all involve population parameters, and because they are statements -- in mathematical notation -- about the values of the population parameters. The following are not statistical hypotheses: " x 5 ppm ", because x is not a population parameter (it is a sample statistic) " is big enough" because, though it is a statement about the population parameter , the statement is not quantitative (In a more general sense, the word "hypothesis" can be used to refer to any statement that potentially can be assessed to be supported or not supported by some evidence. In technical applications, even statements that are initially non-quantitative almost always must be reduced to quantitative statements before it is possible to evaluate the degree to which they may or may not be supported by available evidence. ) Then hypothesis testing is the operation of deciding whether or not data obtained for a random sample supports or fails to support a particular hypothesis. In practice, the result of testing a hypothesis is a declaration that the hypothesis is supported by the data or that it is not supported by the data (there's a bit more to it than this, as you'll see before we're done, but this is the gist of it). But the matter doesn't end there, because few people or companies are prepared to pay for the costs of an experiment just to find out whether a particular hypothesis is supported or not. The statistical hypothesis test normally determines a course of action -- perhaps prompting a company to switch to a new production method, or to a new supplier © David W. Sabo (1999) Introduction to Hypothesis Testing Page 1 of 22 of materials, etc. If the conclusion of the hypothesis test procedure is incorrect, then a mistaken and potentially disadvantageous or costly course of action may result. Why Can't We Just Look at the "Facts?" Why Do We Need "Hypothesis Testing?" At first, the formalism described below may seem to be quite an abstract and overly complicated approach to deciding whether a statement about a population parameter is "true" or "false." You may wonder why we can't just look at the data and come to a common-sense conclusion. If you've been keeping up with the last few weeks of work in the course, you already know the answer to this question. We cannot rely on data in a random sample being a perfect representation of the population from which the random sample was selected. Thus we know that for a population with a mean value of, say, 5, we can select random samples which have a mean value less than 5, exactly equal to 5 (very unlikely), or greater than 5. If you draw two or more independent random samples from the same population, you are extremely unlikely to find that any two (let alone the whole lot) have the same mean value. Suppose we make a claim that the mean value of a population is greater than 4. Our claim would be correct if it turned out to be that the actual mean value of the population was 4.1 or 5 or 10, for instance. Now, suppose we take a random sample of that population and find that the sample mean is 3.9. Does this invalidate our original claim? Is this evidence that our original claim that > 4 is false? Well, we can't really say. Recall the sampling experiments we did earlier in the course with the populations constructed to have = 5 (and so, for those populations, > 4 is certainly true). Some groups drew random samples of 30 items and observed a sample mean as small as 2.8, while others drew random samples of size 30 from the same populations and observed a sample mean as large as 7.5. If the people who drew a sample which gave x 2.8 had concluded that > 4 is false, they would have drawn a mistaken conclusion. On the other hand, we might ask, if observing x 2.8 is not adequate evidence to conclude that > 4 is a false statement, then what kind of evidence would we need? Glad you asked that! That's exactly what the formalism presented below is all about. The whole point of the methodology of statistical hypothesis testing is to decide exactly how contradictory the experimental evidence must be before we can consider a claim about the value of a population parameter to be "disproven." The real problem here is that the "facts" often tend to be ambiguous or misleading, because they are really only "facts" about a small part (the random sample) of the much larger population about which the hypothesis is making a statement. If we had the capability of including the entire population in our experiment, the simple "facts" would be quite adequate in determining the truth or falseness of the hypothesis. Since we are restricted to information obtained from just a random sample of that population, we must be careful about jumping to an unwarranted conclusion. You might have noticed that we put quotation marks around words like "true", "false", and "disproven". This is not so much because we're pushing a philosophical agenda which questions whether there are such things as absolute truth or rigorous proof. Rather, what you'll see as we get deeper into the examination of this problem, it's more that we can never ensure the impossibility of error when drawing conclusions about an entire population based on the observation of a random sample of that population. So, while we can set up our decision-making procedure to make the possibility of drawing an incorrect conclusion as unlikely as we wish (though not without corresponding cost), we can never eliminate the possibility of error entirely. Thus, we will try to avoid categorical words like "true", "false", and "proof" -- using instead phrases that more accurately reflect what we are really above to say: "the claim is supported", "the claim is not supported", etc. We'll return to this matter near the end of this document when we can recap the issue a bit more precisely. A key issue in hypothesis testing is error avoidance or error control. We attempt to draw a conclusion about the population. That conclusion is potentially correct or incorrect. The goal is to control the likelihood of making a mistake. What Are the Potential Mistakes We Can Make? Whenever we attempt to evaluate a hypothesis, there are always four ways the situation can play itself out. To see this more concretely, consider the following example. Page 2 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) Example 1: Recall the SalmonCa experiment described in the standard data sets distributed earlier. As part of the experiment, a technologist analyzed 40 unsanitized salmon fillets, and found for that random sample of 40 fillets, the mean calcium concentration was 74.28 ppm with a standard deviation of 22.02 ppm. Now, suppose that the intent of the experiment was to determine whether one could state that the mean calcium content of all unsanitized salmon fillets was greater than 65 ppm. That is, suppose the hypothesis to be tested is: > 65 ppm. Now, there are two possible states of reality here, and there are two possible conclusions we could draw from the data. We summarize the alternatives in the following table: Our Conclusion: 65 ppm The Actual State of the Population 65 ppm > 65 ppm no mistake -- is really less than or equal to 65 ppm, and we conclude that it is less than or equal to 65 ppm. we make a mistake -- is actually greater than 65 ppm, but we conclude that it is less than or equal to 65 ppm. > 65 ppm we make a mistake -- is really less than or equal to 65 ppm, but we conclude that it is greater than 65 ppm. no mistake -- really is greater than 65 ppm and we conclude it is greater than 65 ppm. Since either of our two possible conclusions can occur with either of the two potential realities, there are four different ways the hypothesis test process can play out. As described in the table, two of those four alternatives amount to a correct conclusion: the population mean is really less than or equal to 65 ppm and our conclusion (based on the data, of course) is that the population mean is less than or equal to 65 ppm, the population mean is really greater than 65 ppm and our conclusion is that the population mean is greater than 65 ppm. and Unfortunately, two of the four alternatives amount to the data leading us to draw a false conclusion; to make a mistake: the population mean is really less than or equal to 65 ppm, but the data in the random sample has led us to conclude that the population mean is greater than 65 ppm, the population mean is really greater than 65 ppm, but the data in the random sample has led us to conclude that the population mean is less than or equal to 65 ppm. and The really unfortunate thing is that since the only information we have about the value of the population mean is the data in the random sample, we have no way of telling for sure which of these four scenarios has occurred once we arrive at our conclusion! What we can do, however, is set up our decision-making process to allow us to control the probability of drawing certain wrong conclusions. The Strategy People have long realized that it can be harder in principle to prove that a true statement about a population is true, than it is to demonstrate that a false statement is false. For example, suppose I make the statement: "Everybody in Canada likes vanilla ice cream." Suppose first that this is a true statement. How could I prove it? Well -- I'd really have to ask every single person in Canada about their views on vanilla ice cream. Even if I asked 10 million people, and found that every single one of the 10 million people I asked claimed to like vanilla ice cream, I still haven't come close to proving that every single person in Canada likes vanilla ice cream. After all, there would still be nearly 20 million people that I haven't talked to, and it is possible that there is perhaps at least one person in those unpolled 20 million who absolutely detests the taste of vanilla ice cream. © David W. Sabo (1999) Introduction to Hypothesis Testing Page 3 of 22 On the other hand, if my statement about vanilla ice cream is false, all I need to do to prove it false is to find just one person who does not like vanilla ice cream. It doesn't matter how many people I find who do like vanilla ice cream, since the first person I encounter who does not like vanilla ice cream effectively provides absolute proof that my statement is incorrect. Nor does it matter how many people I do not ask. If the first person I talk to tells me that they hate the taste of vanilla ice cream, I've proven the statement "Everybody in Canada likes vanilla ice cream" to be false, even though I haven't checked with any of the other nearly 30 million or so people living in Canada. In practice, it would be a rare situation in which we really needed to find evidence for a statement as inclusive as "Everybody in Canada likes vanilla ice cream." If we were thinking of constructing an ice cream factory which could only make one flavor of ice cream, then we wouldn't need to pick a flavor that every single person in the country would eat. So, it is more likely that we would be interested in "proving" statements such as "More than 30% of Canadians like vanilla ice cream" or "More Canadians prefer vanilla ice cream than prefer chocolate ice cream", etc. But the general point still applies -- it will turn out to be easier to disprove a false statement than it will be to prove a true statement. There is a somewhat more subtle but very important issue here as well. Very often, evidence is ambiguous. This is common in statistics because we are trying to say something about an entire population based on information obtained from a relatively small random sample of that population and often we are trying to detect very small effects. Suppose I'm interested in determining whether the statement "More than 30% of Canadians like vanilla ice cream." can be substantiated. I'm interested because an earlier study indicated vanilla ice cream was preferred by 28% of all Canadians and I've just spent a million dollars on an advertising campaign that the agent claimed would increase this flavor preference by at least 2% to over 30% of all Canadians. I'm trying to determine if I got my money's worth out of the advertising. I select a random sample of 100 Canadians, and find that 31 of them state they like vanilla ice cream and the other 69 say they don't. If 31% of the sample say they like vanilla ice cream, is that proof that more than 30% of the population likes vanilla ice cream? Not really. Using the information in the document on sampling distributions, we can calculate that if , the proportion of all Canadians who like vanilla ice cream, is, say, 0.29, which is less than 0.30, then there is a probability of 33% that a random sample of 100 Canadians will contain at least 31 people who like vanilla ice cream. So, the observation that 31 people in a sample of 100 like vanilla ice cream is far from conclusive proof that more than 30% of all Canadians like vanilla ice cream. On the other hand, it would be bizarre to conclude that the statement that "More than 30% of Canadians like vanilla ice cream" is false if 31 of the 100 people in our sample stated they liked vanilla ice cream. In this situation, the data we have is inconclusive -- it is neither supports nor contradicts the statement made about the population parameter. Our formalism has to have a way to detecting or dealing with this (unfortunately rather common) situation. Because of all these problems, the statistical hypothesis test procedure is structured as follows. First: the claim of interest will be called the alternative hypothesis and denoted HA. HA is always a strict inequality involving a population parameter. (Some workers also call HA the research hypothesis, but we will stick with the term "alternative hypothesis" in this course.) The alternative hypothesis is often the claim that the researcher hopes to see supported by the experimental data or evidence. For example, if we were interested in establishing that the mean calcium concentration in unsanitized salmon fillets was greater than 65 ppm, we would write HA: > 65 ppm. This is a statement about , the population mean. It is a strict inequality: "greater than", rather than "greater than or equal to". (This distinction between > and has no practical consequences, but it is a crucial distinction for the formalism being developed here.) The end result of the hypothesis test procedure in this case will be one of two statements: the evidence supports HA: > 65 ppm. … the evidence does not support HA: > 65 ppm. … or Page 4 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) The "…" indicates that there are additional words required to complete these statements, but we need to address a few more issues before they'll make sense. In view of the discussion involving the vanilla ice cream example above, the decision over which of these two conclusions is appropriate will result not from looking at the data to see whether or not it supports HA directly, but by looking to see whether the data contradicts the opposite of H A. If the opposite of HA is contradicted by the data, we will conclude HA is supported by the data. If the opposite of HA is not adequately contradicted by the data, we will conclude there is no strong support for HA -- that is, the data is inconclusive. Second: every alternative hypothesis will be matched with a so-called null hypothesis, H0. H0 is always a statement of equality involving a population parameter. It is obtained by replacing the inequality symbol in HA by an equals sign. For example, with the alternative hypothesis HA: > 65 ppm. we would pair the following null hypothesis H0: = 65 ppm. Now, in this specific example, the opposite of HA is the statement: 65 ppm, so it is not quite true to say that H0 is the opposite of HA. However, of all the situations in which HA is not true, the one described by H0 is the one which is closest to HA. We choose to write H0 as an equality because that simplifies the calculations or analysis detailed below. However, we also recognized that if the data contradicts H 0 in favor of HA, then that data will even more strongly contradict any other situation which is in the opposite of H A. In determining whether the data convinces us to reject the possibility of 65 ppm in favor of > 65 ppm, it will be the situation = 65 ppm that will be most difficult to distinguish from > 65 ppm. If we can demonstrate that > 65 ppm is to be preferred over = 65 ppm, then we will have effectively demonstrated that > 65 ppm is to be preferred over 65 ppm (which is the desired goal of the whole procedure). Don't worry if this seems to be getting a bit confusing. You may have to read through this document several times before everything makes sense. What we are saying is that every hypothesis test procedure involves a pair of hypotheses: an HA and an H0. HA is the claim we wish to evaluate. Our decision about HA will be based on whether we find the data sufficiently contradictory to H0 or not. If you stick with us here, you'll eventually see that by writing H0 as an equality as done above, we will get both the soundest possible conclusion, and the clearest mathematical analysis. The Decision Method We now come to the core of the process: making the decision for or against H A. To see this a bit more concretely, we will work with the SalmonCa0 example. The hypotheses are: H0: = 65 ppm vs. (IH - 1) HA: > 65 ppm The final conclusion will have to rely on the value of x observed for a random sample of all such salmon fillets. We need to bring together several observations here. First, suppose H0 was exactly correct. Then, we know that for sample sizes of 30 or more, the random variable x will be approximately normally distributed, © David W. Sabo (1999) H0: µ = 65 true x 65 ppm Introduction to Hypothesis Testing Page 5 of 22 with a mean of 65 ppm, and a standard deviation of is of the sort shown in the figure to the right. n . Thus, the distribution of potential sample means Secondly, ask yourself what sort of observed value of x would tend to make you favor HA: > 65 ppm as opposed to H0: = 65 ppm? Clearly, if we observed a value of x which was much much larger than 65 ppm, we would feel quite comfortable in concluding that the evidence strongly supported H A as opposed to H0 in this case. So, in principle, we can devise a decision rule as follows. Pick some critical value of x , which will presumably be some distance to the right of the value 65 ppm. If we observe a value of x which is to the right (or greater than) this critical value, we will reject H0 in favor of HA, and declare HA to be supported by the experimental data. If we observe a value of x to the left (or less than) this critical value, we declare the evidence to be inconclusive. Thus, we picture setting up a situation along the lines shown in the figure to the right. The x -axis is divided into two parts: a part to the right of some critical value which we shall call the rejection region. If we observe a value of x which falls in this rejection region, we will take it as adequate evidence to reject H0, and so conclude that HA is supported. H0: µ = 65 true x 65 ppm if the value of x observed falls here, evidence is inconclusive if the value of x observed falls here, reject H0 in favor of HA. a part to the left of this critical critical value value which we might call the of x non-rejection region. If we observe a value of x which falls in this non-rejection region, we consider the experiment inconclusive. We have not proven that HA is false -- rather, we've found that the evidence is inadequate to support HA. Notice that the terminology "rejection" and "non-rejection" refers directly to H0. Whatever conclusion we can or cannot draw regarding HA is a side-effect of our conclusion (or lack of conclusion) with regard to H0. if the data allows us to reject H0, we regard this as equivalent to concluding that HA is supported. However, the only way we can determine whether H A is supported by the evidence is by finding that the evidence allows us to reject H0. if the data does not allow us to reject H0, we regard this as equivalent to finding that we have insufficient evidence to draw a conclusion with regard to H A. It does not mean that HA is false, or that the evidence contradicts HA. The most we can say is that there is not enough evidence to say anything definite about HA. (This is not to say that there is no possibility of HA being false -- but such a conclusion cannot be drawn from the way the present hypothesis test process is set up.) If you talk about rejecting HA or supporting H0, you are confused and must rethink your statement. You may decide to reject H0 and thereby imply that HA is supported, or you may decide that you can't reject H0, and thereby imply that no conclusion one way or the other can be drawn from the data. But you cannot decide to support H0 thinking that this means you've shown HA to be unsupported, nor can you decide to accept HA without reference to rejecting H0 without making an error in logic. What remains is to come up with a reasonable way to decide how to compute the critical value of x which forms the boundary between the rejection region and the non-rejection region. In the present example, all we know is that it should be some number quite a bit bigger than 65 ppm. The figure just above displays the principle we can exploit here. The bell curve in the figure is the probability distribution curve for x when H0 is true. Thus, the area under this curve corresponding to the rejection Page 6 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) region (shaded in the figure) is the probability of observing a value of x in the rejection region when H0 is true -- that is, the shaded area gives the probability that the data will result in us rejecting H0 when H0 is true. What is done then is to select the critical value of x separating the rejection region from the non-rejection region so that the probability of making this sort of mistake (rejecting a true H0) is what we consider to be an acceptably small value. Recall that there are two different errors possible in every hypothesis testing procedure. The error highlighted in the figure above -- rejecting a true H0 -- is called a type 1 error. (The other potential error, to fail to reject H0 when it should be rejected, is called a type 2 error.) The area of the shaded region in the figure above, the probability of making a type 1 error, is called the level of significance of the hypothesis test, and is conventionally represented by the Greek letter . When a hypothesis test is carried out with a small value of and we are able to reject H0, the result is said to be to have statistical significance, meaning that there is a small probability that the conclusion obtained is a mistake. This is also why hypothesis tests are often called tests of significance. It is conventional to use = 0.05 unless there are reasons explicitly justifying some other value. Thus, the critical value of x is just what standard notation represents as x 0.05 , the value of x which cuts off a righthand tail of area = 0.05. In the present example, we have a large sample: n = 40 > 30, and so x is approximately normally distributed with a mean x , the population mean (and since we start off by assuming H0 is true, we know that = 65 ppm), and a standard deviation x n (and while we don't know what is here, we do know that s = 22.02 ppm should provide a rough estimate of for us). Thus, from our study of the normal distribution we know we can write x 0.05 x z 0.05 x 65 1.645 22.02 70.73 ppm 40 This means that for this SalmonCa0 experiment and the hypotheses H0: = 65 ppm vs. HA: > 65 ppm if we reject H0 in favor of HA whenever we observe a sample of size 40 with a sample mean greater than 70.73 ppm, then the probability that such a conclusion is a mistake will be 0.05 or smaller. In fact, the sample mean for the SalmonCa0 data is 74.28 ppm which is larger than the critical value 70.73 ppm. Thus, at a level of significance of 0.05, we may reject H 0 here, and conclude that the experimental evidence supports the conclusion that the mean calcium level in the unsanitized salmon fillets is greater than 65 ppm. While we stated the rejection criterion for the hypotheses (IH - 1) in terms of the critical value of x , we could also have stated it in terms of values of z, the corresponding standard scores: x x 0.05 is equivalent to z > z0.05 where z x x x For the SalmonCa0 data, the value x = 74.28 corresponds to the standard score © David W. Sabo (1999) Introduction to Hypothesis Testing Page 7 of 22 z x x 74 .28 65 2.665 22 .02 x 40 Since z = 2.665 > z0.05 = 1.645, we can reject H0 and conclude that the data supports HA: > 65 ppm. The two conditions: 74.28 > 70.73 and 2.665 > 1.645 are completely equivalent. In this context, we call z the standardized test statistic because it is a statistic (its value depends on the data observed for the sample through the values of x and s), and its value determines the outcome of the hypothesis test. The rejection criterion (that is, the rule for deciding whether or not to reject H 0) is stated in terms of the value of this test statistic: reject H0 in favor of HA if z > z. It turns out to be more convenient to write rejection criteria in terms of values of standard random variables such as z, t, etc. rather than in terms of the original sample statistics such as x , p, s, etc., and we shall adopt that approach through the remainder of this course. We've now gone through one basic hypothesis test process in quite some detail with reference to a specific set of hypotheses and a specific set of data. Now we need to rewrite the steps in a more condensed and generic form so that it is clearer how the process can be applied to similar situations. Summary of the Hypothesis Test Procedure for Large Samples When H A Contains ">" We write the two hypotheses in the following generic form: H0: = 0 vs. (IH - 2) HA: > 0 where 0 stands for some specific number, and stands for the mean of the population of interest. We note that H0, the null hypothesis, is written as an equality, and that HA, the alternative hypothesis, is written as a strict inequality. The right hand sides of both hypotheses are (and must be -- why?) the same value, 0. If data for a large random sample is available, then we will be able to compute the value of the standard test statistic z x 0 x x x 0 s x n n (IH - 3) (If is known, use the second last expression; if is not known, use the last expression which has s as a point estimate of .) Then, H0 can be rejected in favor of HA at a level of significance if we find that z > z. (IH - 4) If this rejection criterion is met, we can state that the evidence supports H A at a level of significance , meaning that there is a probability of that this conclusion is mistaken. What Do We Really Mean by "Statistical Significance?" People often state that a conclusion or statement has "statistical significance", or that a "statistically significant" effect has been observed. This phrase is shorthand for the statement, "the conclusion results from rejection of the null hypothesis at a level of significance of 0.05" (or some other appropriate value of ). We can ask what this really means at a somewhat deeper level however. Why is statistical significance such an important or "significant" thing? Page 8 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) To be specific, look again at the hypotheses (IH - 1). We know that even if HA: > 65 ppm is not true, it may well happen that we observe a value x which is larger than 65 ppm as a result of coincidences in the random sampling process. This is the reason why the observation x > 65 ppm is not adequate evidence to conclude that > 65 ppm there's too great a chance that the observations x > 65 ppm is a coincidental result of the random sampling process and not a reflection of a real property of the target population. µ < 65 µ > 65 x µ = 65 non-rejection region rejection region By setting up the rejection criteria is the way outlined above, we limit the probability or likelihood of such a coincidence occurring. Coincidence may still act to cause a mistaken conclusion, but by setting up the rejection region to correspond to a right-tail area of the sampling distribution of when = 65 ppm, we are ensuring that for situations in which 65 ppm, the probability of coincidence resulting in the observed value of x being in the rejection region is no bigger than . You can see this in the figure just above to the right. The boundary between the non-rejection and the rejection region is located so that the shaded region in the right tail of the sampling distribution of x when = 65 ppm is . If < 65 ppm, the area in the rejection region right-hand tail of the sampling distribution of x will be even smaller, so that we are even less likely to conclude > 65 ppm in such a situation. (On the other hand, the more the value of exceeds 65 ppm, the greater the area in the rejection region right hand tail of the sampling distribution of x , and so the more likely we are to obtain data which will result in the correct conclusion that > 65 ppm.) Thus, a statistically significant conclusion is a conclusion which is unlikely to be the result of coincidence (or, as a statistician would probably say, the result of sampling error). Instead, it is a decision which has a high probability of reflecting a true property of the target population. What About Small Samples? The formulas (IH - 3) and (IH - 4) apply to the situation (IH - 2) only when data is available from a sample of size 30 or larger. However, the arguments leading up to these formulas are completely general. As a result, when the so-called small sample situation applies: sample size is less than 30 but, the population is approximately normally distributed we know that all that is different in the work above is that x is now t-distributed (with = n - 1 degrees of freedom), rather than being normally distributed. Thus, instead of computing the standardized test statistic in (IH - 3), we would calculate the standardized test statistic given by: t x 0 s n (IH - 5) Then, H0 can be rejected in favor of HA at a level of significance if t > t, (IH - 6) Example 2: Is the data in the BiotinDry data set adequate to support a claim that dry roasted peanuts contain an average of more than 80 micrograms of biotin per 100 g serving? © David W. Sabo (1999) Introduction to Hypothesis Testing Page 9 of 22 Solution Let stand for the mean biotin content of the population of dry roasted peanuts (in units of micrograms of biotin per 100 g of peanuts). The claim of interest about this population is that > 80. Thus, we must test the hypotheses: H0: = 80 HA: > 80 Biotin concentrations in these units were obtained for 9 randomly selected specimens of dry roasted peanuts: 58.70 78.20 78.00 91.40 80.90 88.40 96.10 97.40 104.80 BiotinDry From these values, we get x = 85.99 and s = 13.77. With a sample size of n = 9 (< 30), we are clearly in a small sample situation. For the moment, let's assume that the data is not inconsistent with the population being normally distributed (you've already seen a number of examples of how you could "test" that assumption using a normal probability plot). No level of significance is mentioned in the example, so we will use the usual = 0.05. Thus, we can reject H0 in favor of HA at a level of significance of 0.05 if we find that the test statistic, t, computed using formula (IH - 5) satisfies: t > t0.05, 8 = 1.860 But, plugging numbers into formula (IH - 5), we get t x 0 85 .99 80 1.305 s 13 .77 n 9 Since 1.305 is not greater than the critical value 1.860, we cannot reject H 0 here. Thus, we must state the result as: the data obtained is inconclusive on the issue of whether the mean biotin content of dry roasted peanuts is greater than 80 micrograms/100 grams. (Be very careful: it would be misleading to say that "the data contradicts the claim that the mean biotin content of dry roasted peanuts is greater than 80 micrograms per 100 g" -- such a conclusion would be interpreted by most people to imply that the experiment has shown the mean biotin content of these peanuts to be less than 80 micrograms per 100 grams. But such an implication is not supported by the data either -you'll see shortly that had we tested the claim < 80, we would also have come up "inconclusive." That's why we say that the test result is "no conclusion" -- you can't say either that > 80 or that < 80 -- the data gives no "significant" information at all about the relationship between the value of and the quantity 80 micrograms/100 g here.) The p-value The level of significance is a measure of how likely a conclusion drawn from a hypothesis test is to be a mistake. Recall that the only way we can draw a conclusion is if we decide to reject H0, and so implicitly conclude that HA is supported by the data. If (unknown to us, of course) this result is incorrect does not reflect an actual true statement about the population then we have made a type 1 error. When the hypothesis test is set up along the lines described already, the probability that any decision to reject H 0 is in error is no greater than the level of significance. If we do not reject H 0, then we cannot be risking a type 1 error at all and so the level of significance is not relevant to the soundness of our result. In this case, though, there is a risk that we have made a type 2 error. We will discuss ways of evaluating this risk later. The value of = level of significance is chosen when the statistical experiment is being planned. Once the value of is chosen (and almost always, it is chosen as 0.05), the only information to come out of the rest of the procedure is the conclusion "reject H0" or the non-conclusion "unable to reject H0." Thus, the choice of Page 10 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) = 0.05 in Example 1 above led to the rejection criterion "reject H 0 if z > 1.645," and since the data gave z = 2.665 which is greater than 1.645, we rejected H0, drawing the implied conclusion about mean calcium content of unsanitized salmon fillets. In Example 2 above, the choice of = 0.05 led to the rejection criterion "reject H0 if t > 1.860." Since the data in that example gave t = 1.305, which is not greater than 1.860, we had to declare the data inconclusive. If someone asked us how confident we were of the conclusion in Example 1, we'd have to answer: "there's no more than a 5% chance that this conclusion is wrong due to random sampling error." If someone asked us why we didn't draw a conclusion in Example 2, we'd have to answer: "because to do so would run a greater than 5% risk of drawing a wrong conclusion due to random sampling error." Both of these responses are correct as far as they go, but it is possible to be a bit more definite about the probabilities of the errors in these two cases. What our responses here don't tell the inquirer is how close to the 5% value the actual probability of the errors is. In example 2, we would have had to give the same response even if the data had led to t = 1.850, even though now our rejection criterion has been missed by just a little bit. Similarly, in Example 1, we would have declared the same conclusion even if the data had given z = 1.655, even though again, the data giving z = 2.665 is much stronger support for the HA than would be data giving z = 1.665. An alternative way of stating the outcome of a hypothesis test has been gaining increasing popularity. First, we define a new quantity called the p-value. To do this, you skip the step of choosing a value of , and move straight to calculate the value of the standardized test statistic from the data. Then p-value = area under the sampling distribution curve for the rejection region constructed using the computed value of the standardized test statistic. The p-value is thus a probability. In fact, it's the probability of making a type 1 error when you set up the rejection region to be just big enough to allow you to reject H 0. Then, the result of the hypothesis test is not a statement of "reject H0" or "cannot reject H0", but rather, "for these hypotheses, the p-value is …" It's then up to the recipient to decide whether or not to conclude (for their purposes) whether H A is supported or not. Of course, if the p-value is much larger than 0.05, one would have to have a very good justification for acting as if HA was supported. The p-value approach is particularly enlightening in two situations. If the p-value turns out to be slightly larger than 0.05 (say 0.055 or 0.06 or so), it gives the user the opportunity to take a slightly greater than normal risk of error, but proceed on the conclusion that HA is supported. In the original approach, they would simply have been told that the data does not support H A with no indication of how close it actually came to supporting HA. Since there are a number of approximations go into the calculation of the standardized test statistics, allowing for a bit of "fuzz" near the rejection criterion limits is a useful thing. Very small p-values mean that there is very little likelihood that the conclusion is the result of coincidence in random sampling. Although conclusions based on levels of significance of 0.05 are considered quite sound for routine work, type 1 errors will still occur if you do enough hypothesis tests. On the other hand, decisions based on p-values which are very small (say 0.001 or smaller) can pretty well be considered free of error for practical purposes unless you are in a situation in which dozens of hypotheses tests are done every day. (When the p-value of the tests is 0.05, you need to do 14 rejections of H0 before the chance of at least one type 1 error is greater than 50%. When the p-value of the tests is 0.001, you would need to do 693 rejections of H0 before the chance of at least one H0: µ = 65 ppm type 1 error exceeds 50%!) HA: µ > 65 ppm gives a right-tailed rejection region. Example 1: p-value is this area The hypotheses in this case were z H0: = 65 ppm vs. HA: > 65 ppm rejection region indicating the rejection region is a right-hand tail. From the data, we computed the standardized test © David W. Sabo (1999) value of the standardized test statistic from sample data Introduction to Hypothesis Testing z = 2.665 Page 11 of 22 statistic to be z = 2.665. Thus, the rejection region determined by the computed value of the standardized test statistic in this case is the region z > 2.665 under the standard normal distribution (shown in the figure). We can compute the area of this region using our standard normal probability tables: p-value = Pr(z > 2.665) 0.5 - Pr(0 z 2.66) = 0.5 - 0.4961 = 0.0039 Here, we rounded the value of z down to two decimal places to match our probability tables, we rounded down to avoid any hint of over-optimism. From this calculation, we can state our conclusion something like "The p-value for the test of > 65 ppm is 0.0039." The listener familiar with concepts of statistical hypothesis testing would know that this means that the probability of being incorrect in taking the course of action implied by > 65 ppm is very small -- less than one chance in 250, and so such action can be taken with considerable confidence. Example 2: In this case, the hypotheses H0: = 80 HA: > 80 again give rise to a right-hand tail rejection region. The data gave the value of the standardized test statistic to be t = 1.305 with 8 degrees of freedom. Thus, we have p-value = Pr(t > 1.305, = 8) Unfortunately, the usual t-tables cannot be used to calculate this probability very precisely. Looking at the elements of row = 8 in the t-table which bracket the value 1.305, we find that Areas of Right-hand Tails Pr(t > 1.108) = 0.15 area = 0.15 area must be between 0.15 and 0.10 and Pr(t > 1.397) = 0.10 area = 0.10 Thus, it appears that Pr(t > 1.305) is a value between 0.15 and 0.10, and probably somewhat closer to 0.10 than to 0.15. t = 1.397 Thus, the best we can do is to say that for t = 1.305 these hypotheses, the p-value is greater t = 1.108 than 0.10 (which is not very close to 0.05) but is less than 0.15. Because we have a lower estimate of the p-value at 0.10 which is considerably larger than 0.05, few practitioners would feel comfortable in acting as if HA had been supported by the data. t (Incidentally, if you have access to MS Excel/97, you could use the function call "=TDIST(1.305,8,1)" to get the more precise result: p-value = Pr(t > 1.305, = 8) = 0.1141, if such precision was important.) One-Tailed Hypothesis Tests So far, we have looked only at hypotheses in which HA contains a ">". We explained why it makes sense to set up a rejection region of the form reject H0 if Page 12 of 22 standardized test statistic > critical value Introduction to Hypothesis Testing © David W. Sabo (1999) where the "critical value" was selected to cut off a right-hand tail area of the standardized probability distribution of some selected area , which we called the level of significance of the test. Similar logic applies to testing the hypotheses: H0: µ = µ0 true H0: = 0 vs. (IH - 7) HA: < 0 area = Now, it would be the observation of a value of x which is very must less than 0 which would make us favor rejecting H0 in favor of HA. But, x very much smaller than 0 corresponds to z (or t) very much smaller than 0. Thus, the rejection/non-rejection region in this case is as shown in the diagram to the right (drawn for the large sample case -- the same diagram would apply to the small sample case if the symbol z is replaced by t). z 0 if the value of z observed falls here, reject H0 in favor of HA. if the value of z observed falls here, evidence is inconclusive z = -z By rejecting H0 in favor of HA in (IH - 7) when the computed value of z turns out to be smaller than the value of -z, the probability that such a rejection of H0 is a mistake due to sampling error will be or smaller. Since the rejection region here again corresponds to one tail of the sampling distribution, this is again called a one-tailed hypothesis test. To calculate the p-value for the hypotheses (IH - 7), you would first sketch the left-tail rejection region bounded by the actual value of the standardized test statistic. Then the p-value is just the probability associated with that rejection region. Two-Tailed Hypothesis Tests There is one more type of HA to be considered, namely one containing the inequality "". This leads to the hypothesis test based on H0: = 0 vs. (IH - 8) HA: 0 for example. Such a hypothesis test would arise whenever we are simply trying to demonstrate that two populations are different, without knowing (or perhaps without caring) which has the larger mean value and which has the smaller mean value. In this case, H0 is contradicted by values of x which are either much larger than 0 or by values reject H0 of x which are much smaller than 0. The situation is shown in the diagram to the right. The rejection region now corresponds to two tails of the sampling distribution, hence the term "two-tailed test". H 0: µ = µ 0 true x µ0 do not reject H0 reject H0 There is no fundamental reason why the two intervals labeled in the figure should have the same length, or equivalently (since the both the normal and the t-distributions are symmetric about the mean) why the two rejection regions should correspond to the same area. However, it is very rare for statisticians not to preserve this symmetry of the underlying sampling distribution. It is also useful to continue to use the symbol to represent the level of significance of the hypothesis test -- the total area under the sampling © David W. Sabo (1999) Introduction to Hypothesis Testing Page 13 of 22 distribution density curve corresponding to the rejection region. As a result, each of the two rejection regions must correspond to an area of /2 (for a total area of /2 + /2 = ). Thus, when transformed to the standard normal picture, the null hypothesis in (IH - 8) should be rejected as indicated in the figure to the right. H0: µ = µ0 true area = /2 area = /2 Calculating the p-value for a two-tailed z hypothesis test is not much more complicated than for a one-tailed hypothesis test. The 0 computed value of the standardized test statistic is used as the boundary of one of the reject H0 do not reject H0 reject H0 two identical tail. (Obviously, if the value of the standardized test statistic is positive, it is at the -z/2 z/2 left edge of the right-hand tail, whereas if the value of the standardized test statistic is negative, it is at the right edge of the left-hand tail.) Calculate the area under the standard probability density curve for that single tail, and double the result to get the p-value. Example 3: Refer to the standard data sets entitled JonApples1. Does the data given there support the claim that the mean weight of the apples in the first harvest is different from 210 g? Carry out the appropriate hypothesis test using a level of significance of 0.05, but also state the p-value for the test. Solution: The hypotheses to be tested here are H0: = 210 g vs. HA: 210 g A sample of n = 60 apples gave an mean, x , of 219.73 g and a standard deviation, s, of 42.88 g. Since n = 60 30, we are dealing with the large sample case, and so the standardized test statistic obtained is z x 0 219 .73 210 1.76 s 42.88 n 60 Now, the rejection region is two-tailed here. For a level of significance of = 0.05, we have /2 = 0.025, and so require z/2 = z0.025 = 1.96. Thus, we may reject H0 in favor of HA if either z = 1.76 > 1.96 (= z0.025) or Since neither of these conditions is satisfied, we cannot reject H0 here at a level of significance of 0.05. The data is inconclusive with regard to the question of whether 210 g or not. z = 1.76 < -1.96 (= -z0.025) p-value is the combined areas The computed value of the standardized test statistic is z = 1.76. So, one of the two equal area parts of the rejection region for purposes of calculating the p-value is the region corresponding to z > 1.76 (with the other part being z < -1.76). Thus, for these hypotheses p-value = 2 Pr(z > 1.76) = 0.0784. Thus, if you decide to reject H0 on the basis of Page 14 of 22 z 0 reject H0 reject H0 z = -1.76 Introduction to Hypothesis Testing z = 1.76 © David W. Sabo (1999) the given data (and so conclude that the mean weight of these apples is different from 210 g), then there is a 7.84% chance that that conclusion will be a mistake due to sampling error. Type 1 and Type 2 Errors The outcome of a hypothesis test procedure is always one of two possibilities: reject H0, thereby implying that the data supports HA do not reject H0, thereby implying that the data is inconclusive as far as the claim H A is concerned. We need to be a bit careful with language here: this outcome in no way means that HA has been disproven or that the opposite of HA has been proven (though it may be possible to do that by testing a new set of hypotheses in which the alternative hypothesis is the opposite claim), but that we simply do not have adequate evidence to consider H A supported. In a particular situation, the outcome of the test procedure can be a correct one or an incorrect one, in the sense of reflecting the true state of the population under study or not. Thus, rejecting H0 is a correct decision when HA is a true statement about the population. When HA is not a true statement about the population, rejecting H0 is a mistake, called a type 1 error. (People often phrase this error as "rejecting a true H0", though this language is not accurate. H0 is a "straw-man" statement about the population -- it is a stand-in for the opposite of HA, used because of two important features: (i) it is the instance of "HA not true" which will be most difficult to distinguish from "HA true", and (ii) it simplifies calculations, since it states a specific characteristic of the population. Not rejecting H0 in no way means that H0 itself accurately reflects the state of the population -- in no way does it mean that H0 is true. And, in fact, if you've understood the discussion so far, not rejecting H 0 in no way means that the opposite of HA is a true statement about the population. Not rejecting H0 is a statement that the data is ambiguous as far as the given hypotheses are concerned. Anyway, a type 1 error is more accurately described as the error of rejecting H0 when HA is not true, or the error of supporting a false HA.) not rejecting H0 is a correct outcome when HA is not a true statement about the population. When HA is a true statement about the population, failing to reject H0 is a mistake, called a type 2 error. Since HA is a definite statement about a property or characteristic of the population, a type 2 error is an error of not recognizing some actual characteristic of the population (whereas a type 1 error is an error of concluding the population has a characteristic which it actually does not have). Consider a quick example to illustrate these ideas. Folic acid has recently been promoted as a significant factor in the prevention of cardiovascular disease (and many other serious health problems). Suppose you have genetically engineered a new variety of bean which you hope to be able to claim contains a mean of more than 200 g of folic acid per 100 g dry weight. From a hypothesis testing point of view, you need to test the hypotheses: H0: = 200 g/100 g vs HA: > 200 g/100 g (Obviously, here stands for the mean folic acid content of the population of all beans of this variety, measured in units of g/100 g.) Making a type 1 error means rejecting H0 when HA is not true. This amounts to you concluding that the beans do contain more than 200 g of folic acid per 100 g when that is not true. Such a conclusion may result in legal action or bad publicity against you or your company when buyers of these beans later find that they are not the good source of folic acid that you claimed them to be. On the other hand, making a type 2 error here would be concluding that HA is not supported, when in fact the beans do average more than 200 g of folic acid per 100 g dry weight. In this case, you really would have a superior natural source of folic acid, but due to sampling error you would overlook that fact. As a result, you may abandon attempts to sell or promote the use of these beans, with the consequence of you © David W. Sabo (1999) Introduction to Hypothesis Testing Page 15 of 22 losing potential business or income from that activity, and the world losing access to a valuable nutritional resource. Notice that in both of these scenarios, a course of action results from the outcome of the hypothesis test. In this example, the action is either to market the beans or abandon plans to market the beans. Either course of action taken on the basis of a mistaken hypothesis test outcome have associated costs of various sorts (not necessarily just monetary costs -- you can have costs in lost opportunity, lost health benefits, etc.). It is conventional to use the symbol to represent the probability of making a type 1 error, and the symbol to represent the probability of making a type 2 error. That is, by definition Pr(make a type 1 error) and (IH - 9) Pr(make a type 2 error) Because the hypothesis test procedure is set up so that we start by assuming H 0 is true, the value of (or an upper bound to the value of ) can be computed. In fact, the rejection criteria are formulated based on the desired value of . This is the reason for writing H0 as an equality standing in for the opposite of HA. The value of is called the level of significance of the hypothesis test. When a definite conclusion is obtained, the level of significance is the probability that that conclusion is erroneous. If this value of is small, it means that the conclusion has little likelihood of being wrong, and hence it is a "significant" conclusion. The type 2 error occurs when we fail to detect a true property of the population. As a counterpart to the notion of a level of significance, the quantity 1 - is called the power of the test. It is the likelihood that a true property of the population will be detected. Tests with high powers are likely to detect true properties of populations. Unfortunately, because the formalism focuses so directly on the probability of making a type 1 error, we unable to say much with the same degree of detail about the value of . The level of significance is defined to be the probability of rejecting H0 when H0 is a true statement about the population. But if H0 is a true statement about the population, we have enough information to calculate this probability. On the other hand, a type 2 error can only occur when HA is a true statement about a population. But, HA, being an inequality, does not contain enough information to compute probabilities. For instance, you can calculate Pr( x > 200) if you are told that = 150 (and you have an estimate of ), but you cannot calculate this probability if you are told only that > 150. Of course, the reason why we cannot compute a specific value of for a set of hypotheses is because the value of , the probability of making a type 2 error, depends on what the true state of the population is. However, we can calculate the probability of making a type 2 error for specific hypothetical states of the population. We can illustrate this with a couple of simple examples. Example 1: This is the example we've used a couple of times above already, involving the hypotheses H0: = 65 ppm vs. (IH - 1) HA: > 65 ppm where is the mean calcium concentration in unsanitized salmon fillets. Compute the probability of making a type 2 error when these hypotheses are tested with = 0.05 and the true value of is 70 ppm. You may assume that s = 22.02 ppm is an acceptable estimate of for this population. Solution The rejection region (and rejection criterion) are computed from the hypotheses and the required value of . We've already calculated that the critical value of x is x 0.05 x z 0.05 x Page 16 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) 65 1.645 22.02 area: 70.73 ppm x = 70.73 ppm (location determined by the chosen value of ) 40 Thus, we reject H0 in favor of HA if the sample mean is greater than 70.73 ppm. If the sample mean is smaller than 70.73 ppm, we will not reject H0. Thus, the line dividing the rejection region from the non-rejection region is located to cut off a right-hand tail of area 0.05 under the sampling distribution of x that arises assuming = 65 (that is, assuming H0 is true). area: = 0.05 x 65 70 do not reject H0 reject H0 But, if the population mean is actually 70 ppm, the sampling distribution of x is the curve shown as with a dashed line in the figure: the shape is the same as for = 65, but the center of the distribution is shifted rightwards to x = 70 ppm. If = 70, then the failure to reject H0: = 65 is a type 2 error. In the figure, the probability of not rejecting H0 when = 70 then is just the area under the = 70 density curve corresponding to the non-rejection region: ( 70 ppm ) Pr( x 70.73 ppm when 70 ppm ) 70 .73 70 Pr z Pr z 0.21 0.5832 22 .02 40 This is quite a large probability. What it says to us is that with the hypothesis test set up as specified (that is, so that = 0.05 for the hypotheses (IH - 1) and assuming a sample of size 40 is used, and that 22.02 ppm is an acceptable estimate of ), there is almost a 60% chance that H0 will not be rejected even if the true population mean is 70 ppm. The figure above illustrates a number of important features of "error control", which we will amplify considerably before we're done with this whole topic: Pr(type 2 error) (i) the value of will decrease as the true mean value of the population increases. Picture sliding the dashed curve in the figure further rightwards, representing a true mean "Operating Characteristic Curve" for Example 1 value, , which is say 75 ppm, or 80 ppm, 1 etc. The area under the shifted dashed curve corresponding to the non-rejection 0.8 region will decrease, because the size of 0.6 the tail to the left of the critical value of x gets smaller. Thus, while there is a high 0.4 probability here of making a type 2 error when = 70 ppm or thereabouts, if the 0.2 true value of was even larger, this 0 probability of making a type 2 error drops 60 65 70 75 80 85 considerably. (For instance, ( = 85) = true value of population mean 0.00002, a very acceptable value!. The figure to the right shows a plot of as a function of the true value of -- curves representing this sort of information are called operating characteristic curves -- and you can see how the value of starts off very near 1 for = 65 or smaller, but drops off in a fairly characteristic reverse S-shape as increases, becoming essentially negligible in this case (that is, less than 0.05) by the time 77 ppm or so.) (ii) The concern with the figure at the top of this page is that while the value of is quite acceptable, the value of ( = 70) seems to be way too large to be acceptable. The only way to reduce this value of while keeping the shape of the sampling distribution curves as they are would be to © David W. Sabo (1999) Introduction to Hypothesis Testing Page 17 of 22 shift the boundary between the rejection and non-rejection region x = 68 ppm leftwards. For instance, if we decided to shift that boundary to x = 68 ppm, as shown in the figure to area: = 0.2843 the right, the area does indeed area: = 0.1949 decrease quite substantially, to a value of about half what it was when we used the critical value x = 70.73 ppm. Of course, you see x immediately what's wrong with this 65 70 approach to reducing the value of ( = 70) -- by moving the do not reject H0 reject H0 boundary between the rejection and non-rejection region leftwards to reduce the value of , we've simultaneously increased the value of , the probability of making a type 1 error. In fact, using x = 68 ppm as the critical value has increased the probability of making a type 1 error by a factor of nearly four to 0.1949. What you see from this is that if all other aspects of the experiment remain the same (essentially those things which determine the shape of the sampling distribution: the value of s, and the sample size n), you cannot do anything to reduce the values of both and at the same time. The only freedom you have left is to move the boundary between the rejection and non-rejection regions either leftwards or rightwards. Such a change will result in one of or become smaller, but inevitably the other will increase in value. (iii) Finally, you can see that the only way of reducing the value of without simultaneously increasing the value of is to narrow the sampling distributions themselves. If we do that, then the tails of the two distributions falling on the "bad" side of the boundary between the rejection area: and non-rejection regions will area: = 0.05 decrease in size. Since the width of these distributions is x determined by the value of and 65 70 n, and we have no control over do not reject H0 reject H0 (it is a characteristic of the population), our control must be through increasing the sample size n appropriately. We'll go into the specifics of estimating an appropriate sample size in later sections of the course. For the moment, the point here is that the only way to exercise simultaneous control over the probabilities of both types of errors is through use of an adequately large sample size. For a fixed sample size, you cannot simultaneously reduce both error probabilities to some desired value. In the figure just above, the effect of increasing the sample size is seen in the two distribution curves becoming narrower and higher at the center. The narrowing of the distribution for = 65 ppm results in the boundary between the rejection and non-rejection region shifting leftwards so that remains at 0.05. Thus, the value of is reduced due to the combination of two effects: the boundary between the rejection/non-rejection regions shifts leftwards, and the distribution curve for = 70 ppm clusters more tightly about the location 70 ppm on the horizontal axis, reducing the area in any fixed left-hand tail. Page 18 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) How Do You Decide Which HA to Use? Actually, let's back up one step. How do you know when to use hypothesis testing instead of constructing a confidence interval estimate? Here, you look for key words in the request. If words like "estimate", "predict," or "calculate," are present, almost certainly you are being asked to compute a confidence interval estimate of the indicated population parameter. On the other hand, if you see phrases such as "is this adequate evidence…" or "does the data support the claim …", and similar, you are being asked to test hypotheses. Once you've decided a hypothesis test is required, the next issue is to formulate the hypotheses in the test. Always start by working out the appropriate HA, since you can get H0 from HA by simply replacing the inequality symbol by an equality symbol. There are several clues and principles that one may rely on here. most often, the claim to be tested is stated quite explicitly as an inequality in words. This is the appropriate HA. For instance, to respond to the question "Is this data evidence in support of the claim that the mean folic acid content is more than 200 g/100 g?", you know immediately that you must use "HA: > 200 g/100 g". the type of inequality is indicated by various key relational words. Words such as "different" and its synonyms indicate a two-tailed hypothesis test is appropriate. Words and phrases such as "greater than", "exceeds," "is bigger," etc. (and their counterparts "less than," etc.) point to a one-tailed test. There are some more subtle issues that many authors raise, though it may be difficult for you to understand their significance if you are new to the subject of hypothesis testing. We mention two of them briefly here, and may return to them from time to time later in the course. Some authors caution against using one-tailed hypothesis tests except for situations in which you have a scientific principle that justifies a one-tailed relationship. Thus, unless we have an independent physical, chemical, or biological principle that indicates the appropriate relationship is a "greater than" or a "less than" type, they would recommend all tests be done as two-tailed tests. The point here is subtle, but important. It is considered bad form in statistics to allow the data to determine what kind of analysis you carry out, since that tends to promote error -- it's a bit like shooting the arrow first, and then drawing the target. If the only reason you have for using a one-tailed test is because the data seems to indicate that the value of the population parameter is greater than some number or less than some number, then you would not be justified in using a one-tailed test. The downside of this advice (though, if you think about it, it may really be an upside!) is that two-tailed tests are more rigorous -- you need to have much clearer data to be able to reject H0 in a two-tailed test, all other things being equal, than you need for a one-tailed test. A second principle (which we'll look at in a bit more detail in the next document in this series) revolves around the issue sometimes called burden of proof. (We'll explain this term later as well.) The hypothesis test procedure is formulated so that we have direct control over the probability of making a type 1 error, but relatively less control or information about the probability of making a type 2 error. In situations in which one of the two possible errors that could be made is particularly serious, you may wish to formulate H A so that this more serious error becomes the type 1 error. Then, you can control the probability of making this more serious error. For example, suppose that a certain pesticide is known to harm people in average concentrations of more than 5 ppb (parts per billion) in, say, apples. If the fruit does contain a mean concentration of more than 5 ppb of the pesticide, it should not be sold as whole fruit. It can be reprocessed into other products in which the mean pesticide residue concentration can be reduced to a safe level. However, because of higher production costs, these alternative products generate lower overall profits, so the company would prefer to sell the apples as whole fruit. The decision to release a crop of apples for sale to consumers might be based on a test of hypotheses involving , the mean concentration of the pesticide in the population of these apples, and the quantity 5 ppb. So immediately, we know that the null hypothesis must be H0: = 5 ppb. The question is: what should we use for the alternative hypothesis? We know that there are only two reasonable possibilities here, since a two-tailed test would not make much sense (why?): © David W. Sabo (1999) Introduction to Hypothesis Testing Page 19 of 22 HA: > 5 ppb or HA: > 5 ppb In detail, the consequences of the two possible choices of hypotheses are: hypotheses H0: = 5 ppb HA: > 5 ppb H0: = 5 ppb HA: < 5 ppb action if H0 is rejected: -conclude that the mean pesticide concentration exceeds safe limits and so the apples are not released for sale as whole fruit instead they are reprocessed into other less profitable products. action if H0 is not rejected -recognize that there is insufficient evidence to conclude that the mean pesticide level in the apples exceeds safe limits, and so release the apples for sale as whole fruit there is no proof that the apples are dangerous. -conclude that the mean pesticide concentration is less than the safe limits, and so the apples are released for sale as whole fruit. -recognize that there is insufficient evidence to conclude that the mean pesticide concentration in the apples is less than the safe limit, and so probably do not release the apples for sale there is no proof that the apples are safe. mistaken action resulting from the type 1 error -apples are withheld from sale as whole fruit when they are safe; some potential profit is lost. -apples are released for sale even though the mean pesticide concentration in them exceeds safe levels; consumers will be harmed, and perhaps the producer will be legally liable for this harm. The fourth column of this table describes the mistaken actions that will result from carrying out the corresponding hypothesis test. The producer would have to decide which of these two errors is most serious, and then base the decision of whether to release the apples for sale as whole fruit on the outcome of the test of the corresponding hypotheses. Notice that the first set of hypotheses will result in the release of the apples for sale as whole fruit as long as there is no definite proof that they contain harmful concentrations of the pesticide. The second set of hypotheses will result in the apples being released for sale as whole fruit only if there is definite proof that they are safe. When worded in this way, you can see that the distinction between the two approaches is not all that subtle. In one case, the producers sell the apples only if they are quite certain the apples are safe; in the other case the producers sell the apples unless they are quite certain the apples are harmful. Statistical Significance vs. Practical Significance We need to raise one more general issue before leaving this introduction to hypothesis testing. Recall that we view a decision based on a hypothesis test as "statistically significant" if the value of the standardized test statistic falls within the rejection region. This means that the likelihood of the conclusion being a mistake has been controlled. The situation for tests of hypotheses involving the population mean exhibits a common feature of hypothesis tests. If we assume the use of a large sample, then the formula for the standardized test statistic is z x 0 x 0 n s s n (IH-10) Now, 0 is a value fixed in the statement of the hypotheses. x and s are properties of the sample which are estimating fixed properties of the population ( and , respectively), and so if we increase the sample size n, we don't expect the values of x and s to change much (though of course, being a bigger and thus different Page 20 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999) random sample, the values we get for x and s are unlikely to exactly equal to those for the initial smaller sample), Thus, in increasing the sample size, n, we expect that the value of the factor x 0 s in the formula for z may change slightly, but probably not much. But as n increases, so does the value of n , and so inevitably the value of z given by formula (IH - 10) will increase simply because n is larger. In this way, we can actually make z as large as desired by simply taking a large enough sample. But large values of z correspond to rejection of H0. Thus, even if is only very slightly larger than 0, we may be able to satisfy the rejection criterion for H0 by choosing a very very large sample size. This may sound as if we can "prove" any conclusion we like through hypothesis testing just by taking large enough samples, and so cast doubt on the value of the procedure. And that's almost true. However, think about what it means to reject H0 when testing H0: = 65 vs. HA: > 65 Rejecting H0 just means we have evidence that supports the conclusion that > 65. This conclusion is just as true if = 65.001 as it is if = 165. However, if = 65.001 is the true state of affairs, it will likely take an exceedingly large sample to be able to reject the H 0 above, whereas even a relatively small sample should result in rejection of H0 when = 165. Thus, while taking a large enough sample might result the difference between = 65 and = 65.001 becoming statistically significant, in reality the distinction may have little practical significance. (Would you advise a company to adopt a new way to process salmon fillets if the new way resulted in a mean calcium content of 65.001 ppm rather than the mean 65.000 ppm that the current method gives? Probably not. But if calcium content was an important issue, then perhaps switching to a new method that increased the mean calcium content from 65 ppm to 165 ppm would be worth considering. It is a difference of practical significance here.) What this means is that there are properties of populations that can be demonstrated with statistical significance by taking very large samples, but which have little practical significance. It is a caution against increasing sample sizes arbitrarily just to get a rejection of the null hypothesis. You will get a definite conclusion to your study, but nobody may care about it. In some ways, large samples are good in that they contain a lot of information it is unlikely that a very large sample will seriously misrepresent the population from which it has been drawn. On the other hand, if you need a sample of thousands of items to be able to just detect an effect, you need to ask whether the effect is really important whether it has any practical significance. Summary of Test Generic Procedures The following table summarizes the common features in most hypothesis tests that we will cover in this course. In the table: stands for a population parameter (eg. , , , 1 - 2, 1 - 2, etc.) 0 stands for a specific hypothesized numerical value of f is the symbol for the standardized test statistic (e. g. z, t, 2, etc.) We will use the symbol f to stand for the value of f which cuts of a right-hand tail of area . Then, the symbol f1 - stands for the value of f that cuts of a left-hand tail of area . For standard random variables which are symmetric about the value zero (e. g., z and t), we have that f 1 - -f, of course. Now, the three possible hypothesis tests that can arise are: © David W. Sabo (1999) Introduction to Hypothesis Testing Page 21 of 22 hypotheses H0: = 0 H0: > 0 rejection criteria f > f H0: = 0 H0: < 0 f < f1 - (f < -f for symmetric distributions) H0: = 0 H0: 0 f > f/2 or f < f1 -/2 p-value Pr(f > test statistic) Pr(f < test statistic) 2Pr(f > test statistic) for symmetric distributions. This has become a very long document, and we have introduced a lot of important ideas (though in many cases at a very superficial level). We will now look at the application of these basic notions to situations involving quite a variety of specific population parameters. In the process, we will further clarify or detail the general principles described in this document. Before that, though, we will take a short excursion to draw a parallel between statistical hypothesis testing and a situation that that the entertainment industry (if not direct contact with the legal system itself) has made quite familiar to most of us -- the criminal justice system. Page 22 of 22 Introduction to Hypothesis Testing © David W. Sabo (1999)