Download here - BCIT Commons

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
MATH 2441
Probability and Statistics for Biological Sciences
Introduction to Hypothesis Testing
You've already read several times in these notes that the methods of statistical inference fall into two broad
categories:
1.
2.
statistical estimation -- in which we use information available from a random sample to
estimate values of population parameters of interest. We've just finished looking at a variety of
confidence interval formulas that can be used to estimate population means, proportions,
variances, standard deviations, differences of two population means, differences of two
population proportions, etc.
hypothesis testing -- in which we determine whether specific statements or claims about the
value of a population parameter are supported or not by data (evidence) available from a
random sample.
In this document, we describe the general concepts and jargon associated with hypothesis testing. In the
next few documents, we describe how the general method is applied to hypotheses involving various
population parameters. This is a rather long document because it introduces a number of concepts which
are new to most people studying statistical inference for the first time. Hypothesis testing requires us to
examine decision-making logic in some depth and this also is difficult at first for many. As a result, the notes
below will also tend to repeat things more than once to help you see the connections between the concepts,
conventions, computations and conclusions involved.
What is meant by "hypothesis?"
The word hypothesis is just a slightly technical or mathematical term for "sentence" or "claim" or
"statement." In statistics, a hypothesis is ALWAYS A STATEMENT ABOUT THE VALUE OF A
POPULATION PARAMETER. Thus, typical statistical hypotheses are (in some appropriate context, of
course)
 > 5 ppm
  0.65
2 > 2.00
1 - 2 > 0
and so on. These are hypotheses because they all involve population parameters, and because they are
statements -- in mathematical notation -- about the values of the population parameters.
The following are not statistical hypotheses:
" x  5 ppm ", because x is not a population parameter (it is a sample statistic)
" is big enough" because, though it is a statement about the population parameter , the
statement is not quantitative
(In a more general sense, the word "hypothesis" can be used to refer to any statement that potentially can
be assessed to be supported or not supported by some evidence. In technical applications, even
statements that are initially non-quantitative almost always must be reduced to quantitative statements
before it is possible to evaluate the degree to which they may or may not be supported by available
evidence. )
Then hypothesis testing is the operation of deciding whether or not data obtained for a random sample
supports or fails to support a particular hypothesis. In practice, the result of testing a hypothesis is a
declaration that the hypothesis is supported by the data or that it is not supported by the data (there's a bit
more to it than this, as you'll see before we're done, but this is the gist of it). But the matter doesn't end
there, because few people or companies are prepared to pay for the costs of an experiment just to find out
whether a particular hypothesis is supported or not. The statistical hypothesis test normally determines a
course of action -- perhaps prompting a company to switch to a new production method, or to a new supplier
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 1 of 22
of materials, etc. If the conclusion of the hypothesis test procedure is incorrect, then a mistaken and
potentially disadvantageous or costly course of action may result.
Why Can't We Just Look at the "Facts?" Why Do We Need "Hypothesis Testing?"
At first, the formalism described below may seem to be quite an abstract and overly complicated approach
to deciding whether a statement about a population parameter is "true" or "false." You may wonder why we
can't just look at the data and come to a common-sense conclusion.
If you've been keeping up with the last few weeks of work in the course, you already know the answer to this
question. We cannot rely on data in a random sample being a perfect representation of the population from
which the random sample was selected. Thus we know that for a population with a mean value of, say, 5,
we can select random samples which have a mean value less than 5, exactly equal to 5 (very unlikely), or
greater than 5. If you draw two or more independent random samples from the same population, you are
extremely unlikely to find that any two (let alone the whole lot) have the same mean value.
Suppose we make a claim that the mean value of a population is greater than 4. Our claim would be correct
if it turned out to be that the actual mean value of the population was 4.1 or 5 or 10, for instance. Now,
suppose we take a random sample of that population and find that the sample mean is 3.9. Does this
invalidate our original claim? Is this evidence that our original claim that  > 4 is false?
Well, we can't really say. Recall the sampling experiments we did earlier in the course with the populations
constructed to have  = 5 (and so, for those populations,  > 4 is certainly true). Some groups drew random
samples of 30 items and observed a sample mean as small as 2.8, while others drew random samples of
size 30 from the same populations and observed a sample mean as large as 7.5. If the people who drew a
sample which gave x  2.8 had concluded that  > 4 is false, they would have drawn a mistaken
conclusion. On the other hand, we might ask, if observing x  2.8 is not adequate evidence to conclude
that  > 4 is a false statement, then what kind of evidence would we need?
Glad you asked that! That's exactly what the formalism presented below is all about. The whole point of the
methodology of statistical hypothesis testing is to decide exactly how contradictory the experimental
evidence must be before we can consider a claim about the value of a population parameter to be
"disproven." The real problem here is that the "facts" often tend to be ambiguous or misleading, because
they are really only "facts" about a small part (the random sample) of the much larger population about
which the hypothesis is making a statement. If we had the capability of including the entire population in our
experiment, the simple "facts" would be quite adequate in determining the truth or falseness of the
hypothesis. Since we are restricted to information obtained from just a random sample of that population,
we must be careful about jumping to an unwarranted conclusion.
You might have noticed that we put quotation marks around words like "true", "false", and "disproven". This
is not so much because we're pushing a philosophical agenda which questions whether there are such
things as absolute truth or rigorous proof. Rather, what you'll see as we get deeper into the examination of
this problem, it's more that we can never ensure the impossibility of error when drawing conclusions about
an entire population based on the observation of a random sample of that population. So, while we can set
up our decision-making procedure to make the possibility of drawing an incorrect conclusion as unlikely as
we wish (though not without corresponding cost), we can never eliminate the possibility of error entirely.
Thus, we will try to avoid categorical words like "true", "false", and "proof" -- using instead phrases that more
accurately reflect what we are really above to say: "the claim is supported", "the claim is not supported", etc.
We'll return to this matter near the end of this document when we can recap the issue a bit more precisely.
A key issue in hypothesis testing is error avoidance or error control. We attempt to draw a conclusion
about the population. That conclusion is potentially correct or incorrect. The goal is to control the likelihood
of making a mistake.
What Are the Potential Mistakes We Can Make?
Whenever we attempt to evaluate a hypothesis, there are always four ways the situation can play itself out.
To see this more concretely, consider the following example.
Page 2 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
Example 1: Recall the SalmonCa experiment described in the standard data sets distributed earlier. As
part of the experiment, a technologist analyzed 40 unsanitized salmon fillets, and found for that random
sample of 40 fillets, the mean calcium concentration was 74.28 ppm with a standard deviation of 22.02 ppm.
Now, suppose that the intent of the experiment was to determine whether one could state that the mean
calcium content of all unsanitized salmon fillets was greater than 65 ppm. That is, suppose the hypothesis
to be tested is:  > 65 ppm.
Now, there are two possible states of reality here, and there are two possible conclusions we could draw
from the data. We summarize the alternatives in the following table:
Our Conclusion:
  65 ppm
The Actual
State of the
Population
  65 ppm
 > 65 ppm
no mistake --  is really less than or
equal to 65 ppm, and we conclude
that it is less than or equal to 65
ppm.
we make a mistake --  is actually
greater than 65 ppm, but we
conclude that it is less than or
equal to 65 ppm.
 > 65 ppm
we make a mistake --  is really
less than or equal to 65 ppm, but
we conclude that it is greater than
65 ppm.
no mistake --  really is greater
than 65 ppm and we conclude it is
greater than 65 ppm.
Since either of our two possible conclusions can occur with either of the two potential realities, there are four
different ways the hypothesis test process can play out. As described in the table, two of those four
alternatives amount to a correct conclusion:

the population mean is really less than or equal to 65 ppm and our conclusion (based on the
data, of course) is that the population mean is less than or equal to 65 ppm,

the population mean is really greater than 65 ppm and our conclusion is that the population
mean is greater than 65 ppm.
and
Unfortunately, two of the four alternatives amount to the data leading us to draw a false conclusion; to make
a mistake:

the population mean is really less than or equal to 65 ppm, but the data in the random sample
has led us to conclude that the population mean is greater than 65 ppm,

the population mean is really greater than 65 ppm, but the data in the random sample has led
us to conclude that the population mean is less than or equal to 65 ppm.
and
The really unfortunate thing is that since the only information we have about the value of the population
mean is the data in the random sample, we have no way of telling for sure which of these four scenarios has
occurred once we arrive at our conclusion! What we can do, however, is set up our decision-making
process to allow us to control the probability of drawing certain wrong conclusions.
The Strategy
People have long realized that it can be harder in principle to prove that a true statement about a population
is true, than it is to demonstrate that a false statement is false.
For example, suppose I make the statement: "Everybody in Canada likes vanilla ice cream." Suppose first
that this is a true statement. How could I prove it? Well -- I'd really have to ask every single person in
Canada about their views on vanilla ice cream. Even if I asked 10 million people, and found that every
single one of the 10 million people I asked claimed to like vanilla ice cream, I still haven't come close to
proving that every single person in Canada likes vanilla ice cream. After all, there would still be nearly 20
million people that I haven't talked to, and it is possible that there is perhaps at least one person in those
unpolled 20 million who absolutely detests the taste of vanilla ice cream.
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 3 of 22
On the other hand, if my statement about vanilla ice cream is false, all I need to do to prove it false is to find
just one person who does not like vanilla ice cream. It doesn't matter how many people I find who do like
vanilla ice cream, since the first person I encounter who does not like vanilla ice cream effectively provides
absolute proof that my statement is incorrect. Nor does it matter how many people I do not ask. If the first
person I talk to tells me that they hate the taste of vanilla ice cream, I've proven the statement "Everybody in
Canada likes vanilla ice cream" to be false, even though I haven't checked with any of the other nearly 30
million or so people living in Canada.
In practice, it would be a rare situation in which we really needed to find evidence for a statement as
inclusive as "Everybody in Canada likes vanilla ice cream." If we were thinking of constructing an ice cream
factory which could only make one flavor of ice cream, then we wouldn't need to pick a flavor that every
single person in the country would eat. So, it is more likely that we would be interested in "proving"
statements such as "More than 30% of Canadians like vanilla ice cream" or "More Canadians prefer vanilla
ice cream than prefer chocolate ice cream", etc. But the general point still applies -- it will turn out to be
easier to disprove a false statement than it will be to prove a true statement.
There is a somewhat more subtle but very important issue here as well. Very often, evidence is ambiguous.
This is common in statistics because we are trying to say something about an entire population based on
information obtained from a relatively small random sample of that population and often we are trying to
detect very small effects. Suppose I'm interested in determining whether the statement "More than 30% of
Canadians like vanilla ice cream." can be substantiated. I'm interested because an earlier study indicated
vanilla ice cream was preferred by 28% of all Canadians and I've just spent a million dollars on an
advertising campaign that the agent claimed would increase this flavor preference by at least 2% to over
30% of all Canadians. I'm trying to determine if I got my money's worth out of the advertising.
I select a random sample of 100 Canadians, and find that 31 of them state they like vanilla ice cream and
the other 69 say they don't. If 31% of the sample say they like vanilla ice cream, is that proof that more than
30% of the population likes vanilla ice cream? Not really. Using the information in the document on
sampling distributions, we can calculate that if , the proportion of all Canadians who like vanilla ice cream,
is, say, 0.29, which is less than 0.30, then there is a probability of 33% that a random sample of 100
Canadians will contain at least 31 people who like vanilla ice cream. So, the observation that 31 people in a
sample of 100 like vanilla ice cream is far from conclusive proof that more than 30% of all Canadians like
vanilla ice cream. On the other hand, it would be bizarre to conclude that the statement that "More than
30% of Canadians like vanilla ice cream" is false if 31 of the 100 people in our sample stated they liked
vanilla ice cream. In this situation, the data we have is inconclusive -- it is neither supports nor contradicts
the statement made about the population parameter. Our formalism has to have a way to detecting or
dealing with this (unfortunately rather common) situation.
Because of all these problems, the statistical hypothesis test procedure is structured as follows.
First:
the claim of interest will be called the alternative hypothesis and denoted HA. HA is always a
strict inequality involving a population parameter. (Some workers also call HA the research
hypothesis, but we will stick with the term "alternative hypothesis" in this course.) The alternative
hypothesis is often the claim that the researcher hopes to see supported by the experimental data or
evidence.
For example, if we were interested in establishing that the mean calcium concentration in unsanitized
salmon fillets was greater than 65 ppm, we would write
HA:  > 65 ppm.
This is a statement about , the population mean. It is a strict inequality: "greater than", rather than "greater
than or equal to". (This distinction between > and  has no practical consequences, but it is a crucial
distinction for the formalism being developed here.) The end result of the hypothesis test procedure in this
case will be one of two statements:

the evidence supports HA:  > 65 ppm. …

the evidence does not support HA:  > 65 ppm. …
or
Page 4 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
The "…" indicates that there are additional words required to complete these statements, but we need to
address a few more issues before they'll make sense.
In view of the discussion involving the vanilla ice cream example above, the decision over which of these
two conclusions is appropriate will result not from looking at the data to see whether or not it supports HA
directly, but by looking to see whether the data contradicts the opposite of H A. If the opposite of HA is
contradicted by the data, we will conclude HA is supported by the data. If the opposite of HA is not
adequately contradicted by the data, we will conclude there is no strong support for HA -- that is, the data is
inconclusive.
Second:
every alternative hypothesis will be matched with a so-called null hypothesis, H0. H0 is
always a statement of equality involving a population parameter. It is obtained by replacing the
inequality symbol in HA by an equals sign.
For example, with the alternative hypothesis
HA:  > 65 ppm.
we would pair the following null hypothesis
H0:  = 65 ppm.
Now, in this specific example, the opposite of HA is the statement:   65 ppm, so it is not quite true to say
that H0 is the opposite of HA. However, of all the situations in which HA is not true, the one described by H0
is the one which is closest to HA. We choose to write H0 as an equality because that simplifies the
calculations or analysis detailed below. However, we also recognized that if the data contradicts H 0 in favor
of HA, then that data will even more strongly contradict any other situation which is in the opposite of H A. In
determining whether the data convinces us to reject the possibility of   65 ppm in favor of  > 65 ppm, it
will be the situation  = 65 ppm that will be most difficult to distinguish from  > 65 ppm. If we can
demonstrate that  > 65 ppm is to be preferred over  = 65 ppm, then we will have effectively demonstrated
that  > 65 ppm is to be preferred over   65 ppm (which is the desired goal of the whole procedure).
Don't worry if this seems to be getting a bit confusing. You may have to read through this document several
times before everything makes sense. What we are saying is that every hypothesis test procedure involves
a pair of hypotheses: an HA and an H0. HA is the claim we wish to evaluate. Our decision about HA will be
based on whether we find the data sufficiently contradictory to H0 or not. If you stick with us here, you'll
eventually see that by writing H0 as an equality as done above, we will get both the soundest possible
conclusion, and the clearest mathematical analysis.
The Decision Method
We now come to the core of the process: making the decision for or against H A.
To see this a bit more concretely, we will work with the SalmonCa0 example. The hypotheses are:
H0:  = 65 ppm
vs.
(IH - 1)
HA:  > 65 ppm
The final conclusion will have to rely
on the value of x observed for a
random sample of all such salmon
fillets.
We need to bring together several
observations here. First, suppose H0
was exactly correct. Then, we know
that for sample sizes of 30 or more,
the random variable x will be
approximately normally distributed,
© David W. Sabo (1999)
H0: µ = 65
true
x
65 ppm
Introduction to Hypothesis Testing
Page 5 of 22
with a mean of 65 ppm, and a standard deviation of 
is of the sort shown in the figure to the right.
n . Thus, the distribution of potential sample means
Secondly, ask yourself what sort of observed value of x would tend to make you favor HA:  > 65 ppm as
opposed to H0:  = 65 ppm? Clearly, if we observed a value of x which was much much larger than 65
ppm, we would feel quite comfortable in concluding that the evidence strongly supported H A as opposed to
H0 in this case. So, in principle, we can devise a decision rule as follows.
Pick some critical value of x , which will presumably be some distance to the right of the value
65 ppm. If we observe a value of x which is to the right (or greater than) this critical value, we will
reject H0 in favor of HA, and declare HA to be supported by the experimental data. If we observe a
value of x to the left (or less than) this critical value, we declare the evidence to be inconclusive.
Thus, we picture setting up a situation along
the lines shown in the figure to the right. The
x -axis is divided into two parts:


a part to the right of some critical
value which we shall call the
rejection region. If we observe
a value of x which falls in this
rejection region, we will take it as
adequate evidence to reject H0,
and so conclude that HA is
supported.
H0: µ = 65
true
x
65 ppm
if the value of x observed falls
here, evidence is inconclusive
if the value of x
observed falls here,
reject H0 in favor of HA.
a part to the left of this critical
critical value
value which we might call the
of x
non-rejection region. If we
observe a value of x which falls
in this non-rejection region, we consider the experiment inconclusive. We have not proven that
HA is false -- rather, we've found that the evidence is inadequate to support HA.
Notice that the terminology "rejection" and "non-rejection" refers directly to H0. Whatever conclusion we can
or cannot draw regarding HA is a side-effect of our conclusion (or lack of conclusion) with regard to H0.

if the data allows us to reject H0, we regard this as equivalent to concluding that HA is
supported. However, the only way we can determine whether H A is supported by the evidence
is by finding that the evidence allows us to reject H0.

if the data does not allow us to reject H0, we regard this as equivalent to finding that we have
insufficient evidence to draw a conclusion with regard to H A. It does not mean that HA is false,
or that the evidence contradicts HA. The most we can say is that there is not enough evidence
to say anything definite about HA. (This is not to say that there is no possibility of HA being
false -- but such a conclusion cannot be drawn from the way the present hypothesis test
process is set up.)
If you talk about rejecting HA or supporting H0, you are confused and must rethink your statement. You may
decide to reject H0 and thereby imply that HA is supported, or you may decide that you can't reject H0, and
thereby imply that no conclusion one way or the other can be drawn from the data. But you cannot decide to
support H0 thinking that this means you've shown HA to be unsupported, nor can you decide to accept HA
without reference to rejecting H0 without making an error in logic.
What remains is to come up with a reasonable way to decide how to compute the critical value of x which
forms the boundary between the rejection region and the non-rejection region. In the present example, all
we know is that it should be some number quite a bit bigger than 65 ppm.
The figure just above displays the principle we can exploit here. The bell curve in the figure is the probability
distribution curve for x when H0 is true. Thus, the area under this curve corresponding to the rejection
Page 6 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
region (shaded in the figure) is the probability of observing a value of x in the rejection region when H0 is
true -- that is, the shaded area gives the probability that the data will result in us rejecting H0 when H0 is true.
What is done then is to select the critical value of x separating the rejection region from the non-rejection
region so that the probability of making this sort of mistake (rejecting a true H0) is what we consider to be an
acceptably small value.
Recall that there are two different errors possible in every hypothesis testing procedure. The error
highlighted in the figure above -- rejecting a true H0 -- is called a type 1 error. (The other potential error, to
fail to reject H0 when it should be rejected, is called a type 2 error.) The area of the shaded region in the
figure above, the probability of making a type 1 error, is called the level of significance of the hypothesis
test, and is conventionally represented by the Greek letter . When a hypothesis test is carried out with a
small value of  and we are able to reject H0, the result is said to be to have statistical significance,
meaning that there is a small probability that the conclusion obtained is a mistake. This is also why
hypothesis tests are often called tests of significance.
It is conventional to use  = 0.05 unless there are reasons explicitly justifying some other value. Thus, the
critical value of x is just what standard notation represents as x 0.05 , the value of x which cuts off a righthand tail of area  = 0.05. In the present example, we have a large sample: n = 40 > 30, and so x is
approximately normally distributed with a mean  x   , the population mean (and since we start off by
assuming H0 is true, we know that  = 65 ppm), and a standard deviation  x  
n (and while we don't
know what  is here, we do know that s = 22.02 ppm should provide a rough estimate of  for us). Thus,
from our study of the normal distribution we know we can write
x 0.05   x  z 0.05  x
 65  1.645
22.02
 70.73 ppm
40
This means that for this SalmonCa0 experiment and the hypotheses
H0:  = 65 ppm
vs.
HA:  > 65 ppm
if we reject H0 in favor of HA whenever we observe a sample of size 40 with a sample mean greater than
70.73 ppm, then the probability that such a conclusion is a mistake will be 0.05 or smaller.
In fact, the sample mean for the SalmonCa0 data is 74.28 ppm which is larger than the critical value 70.73
ppm. Thus, at a level of significance of 0.05, we may reject H 0 here, and conclude that the experimental
evidence supports the conclusion that the mean calcium level in the unsanitized salmon fillets is greater than
65 ppm.
While we stated the rejection criterion for the hypotheses (IH - 1) in terms of the critical value of x , we could
also have stated it in terms of values of z, the corresponding standard scores:
x  x 0.05
is equivalent to
z > z0.05
where
z
x  x
x
For the SalmonCa0 data, the value x = 74.28 corresponds to the standard score
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 7 of 22
z
x   x 74 .28  65

 2.665
22 .02
x
40
Since z = 2.665 > z0.05 = 1.645, we can reject H0 and conclude that the data supports HA:  > 65 ppm. The
two conditions: 74.28 > 70.73 and 2.665 > 1.645 are completely equivalent. In this context, we call z the
standardized test statistic because it is a statistic (its value depends on the data observed for the sample
through the values of x and s), and its value determines the outcome of the hypothesis test. The rejection
criterion (that is, the rule for deciding whether or not to reject H 0) is stated in terms of the value of this test
statistic: reject H0 in favor of HA if z > z.
It turns out to be more convenient to write rejection criteria in terms of values of standard random variables
such as z, t, etc. rather than in terms of the original sample statistics such as x , p, s, etc., and we shall
adopt that approach through the remainder of this course.
We've now gone through one basic hypothesis test process in quite some detail with reference to a specific
set of hypotheses and a specific set of data. Now we need to rewrite the steps in a more condensed and
generic form so that it is clearer how the process can be applied to similar situations.
Summary of the Hypothesis Test Procedure for Large Samples When H A Contains ">"
We write the two hypotheses in the following generic form:
H0:  = 0
vs.
(IH - 2)
HA:  > 0
where 0 stands for some specific number, and  stands for the mean of the population of interest. We note
that H0, the null hypothesis, is written as an equality, and that HA, the alternative hypothesis, is written as a
strict inequality. The right hand sides of both hypotheses are (and must be -- why?) the same value, 0.
If data for a large random sample is available, then we will be able to compute the value of the standard test
statistic
z
x  0
x   x x  0



s
x
n
n
(IH - 3)
(If  is known, use the second last expression; if  is not known, use the last expression which has s as a
point estimate of .)
Then, H0 can be rejected in favor of HA at a level of significance  if we find that
z > z.
(IH - 4)
If this rejection criterion is met, we can state that the evidence supports H A at a level of significance ,
meaning that there is a probability of  that this conclusion is mistaken.
What Do We Really Mean by "Statistical Significance?"
People often state that a conclusion or statement has "statistical significance", or that a "statistically
significant" effect has been observed. This phrase is shorthand for the statement, "the conclusion results
from rejection of the null hypothesis at a level of significance of 0.05" (or some other appropriate value of ).
We can ask what this really means at a somewhat deeper level however. Why is statistical significance
such an important or "significant" thing?
Page 8 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
To be specific, look again at the hypotheses
(IH - 1). We know that even if HA:  > 65 ppm
is not true, it may well happen that we observe
a value x which is larger than 65 ppm as a
result of coincidences in the random sampling
process. This is the reason why the
observation x > 65 ppm is not adequate
evidence to conclude that  > 65 ppm 
there's too great a chance that the
observations x > 65 ppm is a coincidental
result of the random sampling process and not
a reflection of a real property of the target
population.
µ < 65
µ > 65
x
µ = 65
non-rejection region rejection region
By setting up the rejection criteria is the way outlined above, we limit the probability or likelihood of such a
coincidence occurring. Coincidence may still act to cause a mistaken conclusion, but by setting up the
rejection region to correspond to a right-tail area of the sampling distribution of  when  = 65 ppm, we are
ensuring that for situations in which   65 ppm, the probability of coincidence resulting in the observed
value of x being in the rejection region is no bigger than . You can see this in the figure just above to the
right. The boundary between the non-rejection and the rejection region is located so that the shaded region
in the right tail of the sampling distribution of x when  = 65 ppm is . If  < 65 ppm, the area in the
rejection region right-hand tail of the sampling distribution of x will be even smaller, so that we are even
less likely to conclude  > 65 ppm in such a situation. (On the other hand, the more the value of  exceeds
65 ppm, the greater the area in the rejection region right hand tail of the sampling distribution of x , and so
the more likely we are to obtain data which will result in the correct conclusion that  > 65 ppm.)
Thus, a statistically significant conclusion is a conclusion which is unlikely to be the result of coincidence (or,
as a statistician would probably say, the result of sampling error). Instead, it is a decision which has a high
probability of reflecting a true property of the target population.
What About Small Samples?
The formulas (IH - 3) and (IH - 4) apply to the situation (IH - 2) only when data is available from a sample of
size 30 or larger. However, the arguments leading up to these formulas are completely general. As a result,
when the so-called small sample situation applies:


sample size is less than 30
but, the population is approximately normally distributed
we know that all that is different in the work above is that x is now t-distributed (with  = n - 1 degrees of
freedom), rather than being normally distributed. Thus, instead of computing the standardized test statistic
in (IH - 3), we would calculate the standardized test statistic given by:
t
x  0
s
n
(IH - 5)
Then, H0 can be rejected in favor of HA at a level of significance  if
t > t, 
(IH - 6)
Example 2: Is the data in the BiotinDry data set adequate to support a claim that dry roasted peanuts
contain an average of more than 80 micrograms of biotin per 100 g serving?
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 9 of 22
Solution
Let  stand for the mean biotin content of the population of dry roasted peanuts (in units of micrograms of
biotin per 100 g of peanuts). The claim of interest about this population is that  > 80. Thus, we must test
the hypotheses:
H0:  = 80
HA:  > 80
Biotin concentrations in these units were obtained for 9 randomly selected specimens of dry roasted
peanuts:
58.70
78.20
78.00
91.40
80.90
88.40
96.10
97.40
104.80
BiotinDry
From these values, we get x = 85.99 and s = 13.77. With a sample size of n = 9 (< 30), we are clearly in a
small sample situation. For the moment, let's assume that the data is not inconsistent with the population
being normally distributed (you've already seen a number of examples of how you could "test" that
assumption using a normal probability plot).
No level of significance is mentioned in the example, so we will use the usual  = 0.05. Thus, we can reject
H0 in favor of HA at a level of significance of 0.05 if we find that the test statistic, t, computed using formula
(IH - 5) satisfies:
t > t0.05, 8 = 1.860
But, plugging numbers into formula (IH - 5), we get
t
x   0 85 .99  80

 1.305
s
13 .77
n
9
Since 1.305 is not greater than the critical value 1.860, we cannot reject H 0 here. Thus, we must state the
result as: the data obtained is inconclusive on the issue of whether the mean biotin content of dry roasted
peanuts is greater than 80 micrograms/100 grams.
(Be very careful: it would be misleading to say that "the data contradicts the claim that the mean biotin
content of dry roasted peanuts is greater than 80 micrograms per 100 g" -- such a conclusion would be
interpreted by most people to imply that the experiment has shown the mean biotin content of these peanuts
to be less than 80 micrograms per 100 grams. But such an implication is not supported by the data either -you'll see shortly that had we tested the claim  < 80, we would also have come up "inconclusive." That's
why we say that the test result is "no conclusion" -- you can't say either that  > 80 or that  < 80 -- the data
gives no "significant" information at all about the relationship between the value of  and the quantity 80
micrograms/100 g here.)

The p-value
The level of significance is a measure of how likely a conclusion drawn from a hypothesis test is to be a
mistake. Recall that the only way we can draw a conclusion is if we decide to reject H0, and so implicitly
conclude that HA is supported by the data. If (unknown to us, of course) this result is incorrect  does not
reflect an actual true statement about the population  then we have made a type 1 error. When the
hypothesis test is set up along the lines described already, the probability that any decision to reject H 0 is in
error is no greater than the level of significance. If we do not reject H 0, then we cannot be risking a type 1
error at all and so the level of significance is not relevant to the soundness of our result. In this case,
though, there is a risk that we have made a type 2 error. We will discuss ways of evaluating this risk later.
The value of  = level of significance is chosen when the statistical experiment is being planned. Once the
value of  is chosen (and almost always, it is chosen as 0.05), the only information to come out of the rest of
the procedure is the conclusion "reject H0" or the non-conclusion "unable to reject H0." Thus, the choice of
Page 10 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
 = 0.05 in Example 1 above led to the rejection criterion "reject H 0 if z > 1.645," and since the data gave z =
2.665 which is greater than 1.645, we rejected H0, drawing the implied conclusion about mean calcium
content of unsanitized salmon fillets. In Example 2 above, the choice of  = 0.05 led to the rejection criterion
"reject H0 if t > 1.860." Since the data in that example gave t = 1.305, which is not greater than 1.860, we
had to declare the data inconclusive.
If someone asked us how confident we were of the conclusion in Example 1, we'd have to answer: "there's
no more than a 5% chance that this conclusion is wrong due to random sampling error." If someone asked
us why we didn't draw a conclusion in Example 2, we'd have to answer: "because to do so would run a
greater than 5% risk of drawing a wrong conclusion due to random sampling error." Both of these
responses are correct as far as they go, but it is possible to be a bit more definite about the probabilities of
the errors in these two cases. What our responses here don't tell the inquirer is how close to the 5% value
the actual probability of the errors is. In example 2, we would have had to give the same response even if
the data had led to t = 1.850, even though now our rejection criterion has been missed by just a little bit.
Similarly, in Example 1, we would have declared the same conclusion even if the data had given z = 1.655,
even though again, the data giving z = 2.665 is much stronger support for the HA than would be data giving
z = 1.665.
An alternative way of stating the outcome of a hypothesis test has been gaining increasing popularity. First,
we define a new quantity called the p-value. To do this, you skip the step of choosing a value of , and
move straight to calculate the value of the standardized test statistic from the data. Then
p-value = area under the sampling distribution curve for the rejection region constructed using the
computed value of the standardized test statistic.
The p-value is thus a probability. In fact, it's the probability of making a type 1 error when you set up the
rejection region to be just big enough to allow you to reject H 0. Then, the result of the hypothesis test is not
a statement of "reject H0" or "cannot reject H0", but rather, "for these hypotheses, the p-value is …" It's then
up to the recipient to decide whether or not to conclude (for their purposes) whether H A is supported or not.
Of course, if the p-value is much larger than 0.05, one would have to have a very good justification for acting
as if HA was supported.
The p-value approach is particularly enlightening in two situations. If the p-value turns out to be slightly
larger than 0.05 (say 0.055 or 0.06 or so), it gives the user the opportunity to take a slightly greater than
normal risk of error, but proceed on the conclusion that HA is supported. In the original approach, they
would simply have been told that the data does not support H A with no indication of how close it actually
came to supporting HA. Since there are a number of approximations go into the calculation of the
standardized test statistics, allowing for a bit of "fuzz" near the rejection criterion limits is a useful thing.
Very small p-values mean that there is very little likelihood that the conclusion is the result of coincidence in
random sampling. Although conclusions based on levels of significance of 0.05 are considered quite sound
for routine work, type 1 errors will still occur if you do enough hypothesis tests. On the other hand, decisions
based on p-values which are very small (say 0.001 or smaller) can pretty well be considered free of error for
practical purposes unless you are in a situation in which dozens of hypotheses tests are done every day.
(When the p-value of the tests is 0.05, you need to do 14 rejections of H0 before the chance of at least one
type 1 error is greater than 50%. When the p-value of the tests is 0.001, you would need to do 693
rejections of H0 before the chance of at least one
H0: µ = 65 ppm
type 1 error exceeds 50%!)
HA: µ > 65 ppm
gives a right-tailed
rejection region.
Example 1:
p-value is
this area
The hypotheses in this case were
z
H0:  = 65 ppm
vs.
HA:  > 65 ppm
rejection region
indicating the rejection region is a right-hand tail.
From the data, we computed the standardized test
© David W. Sabo (1999)
value of the standardized
test statistic from sample
data
Introduction to Hypothesis Testing
z = 2.665
Page 11 of 22
statistic to be z = 2.665. Thus, the rejection region determined by the computed value of the standardized
test statistic in this case is the region z > 2.665 under the standard normal distribution (shown in the figure).
We can compute the area of this region using our standard normal probability tables:
p-value = Pr(z > 2.665)
 0.5 - Pr(0  z  2.66)
= 0.5 - 0.4961 = 0.0039
Here, we rounded the value of z down to two decimal places to match our probability tables, we rounded
down to avoid any hint of over-optimism. From this calculation, we can state our conclusion something like
"The p-value for the test of  > 65 ppm is 0.0039." The listener familiar with concepts of statistical
hypothesis testing would know that this means that the probability of being incorrect in taking the course of
action implied by  > 65 ppm is very small -- less than one chance in 250, and so such action can be taken
with considerable confidence.

Example 2:
In this case, the hypotheses
H0:  = 80
HA:  > 80
again give rise to a right-hand tail rejection region. The data gave the value of the standardized test statistic
to be t = 1.305 with 8 degrees of freedom. Thus, we have
p-value = Pr(t > 1.305,  = 8)
Unfortunately, the usual t-tables cannot be
used to calculate this probability very
precisely. Looking at the elements of row
 = 8 in the t-table which bracket the value
1.305, we find that
Areas of Right-hand Tails
Pr(t > 1.108) = 0.15
area = 0.15
area must be between
0.15 and 0.10
and
Pr(t > 1.397) = 0.10
area = 0.10
Thus, it appears that Pr(t > 1.305) is a value
between 0.15 and 0.10, and probably
somewhat closer to 0.10 than to 0.15.
t = 1.397
Thus, the best we can do is to say that for
t = 1.305
these hypotheses, the p-value is greater
t = 1.108
than 0.10 (which is not very close to 0.05)
but is less than 0.15. Because we have a
lower estimate of the p-value at 0.10 which is considerably larger than 0.05, few practitioners would feel
comfortable in acting as if HA had been supported by the data.
t
(Incidentally, if you have access to MS Excel/97, you could use the function call "=TDIST(1.305,8,1)" to get
the more precise result: p-value = Pr(t > 1.305,  = 8) = 0.1141, if such precision was important.)

One-Tailed Hypothesis Tests
So far, we have looked only at hypotheses in which HA contains a ">". We explained why it makes sense to
set up a rejection region of the form
reject H0 if
Page 12 of 22
standardized test statistic > critical value
Introduction to Hypothesis Testing
© David W. Sabo (1999)
where the "critical value" was selected to cut off a right-hand tail area of the standardized probability
distribution of some selected area , which we called the level of significance of the test.
Similar logic applies to testing the
hypotheses:
H0: µ = µ0
true
H0:  = 0
vs.
(IH - 7)
HA:  < 0
area = 
Now, it would be the observation of a value
of x which is very must less than 0 which
would make us favor rejecting H0 in favor of
HA. But, x very much smaller than 0
corresponds to z (or t) very much smaller
than 0. Thus, the rejection/non-rejection
region in this case is as shown in the
diagram to the right (drawn for the large
sample case -- the same diagram would
apply to the small sample case if the symbol
z is replaced by t).
z
0
if the value of z
observed falls here,
reject H0 in favor of HA.
if the value of z observed falls
here, evidence is inconclusive
z = -z
By rejecting H0 in favor of HA in (IH - 7) when the computed value of z turns out to be smaller than the value
of -z, the probability that such a rejection of H0 is a mistake due to sampling error will be  or smaller.
Since the rejection region here again corresponds to one tail of the sampling distribution, this is again called
a one-tailed hypothesis test.
To calculate the p-value for the hypotheses (IH - 7), you would first sketch the left-tail rejection region
bounded by the actual value of the standardized test statistic. Then the p-value is just the probability
associated with that rejection region.
Two-Tailed Hypothesis Tests
There is one more type of HA to be considered, namely one containing the inequality "". This leads to the
hypothesis test based on
H0:  = 0
vs.
(IH - 8)
HA:   0
for example. Such a hypothesis test would arise
whenever we are simply trying to demonstrate
that two populations are different, without
knowing (or perhaps without caring) which has
the larger mean value and which has the smaller
mean value.
In this case, H0 is contradicted by values of x
which are either much larger than 0 or by values
reject H0
of x which are much smaller than 0. The
situation is shown in the diagram to the right.
The rejection region now corresponds to two tails
of the sampling distribution, hence the term "two-tailed test".
H 0: µ = µ 0
true


x
µ0
do not reject H0
reject H0
There is no fundamental reason why the two intervals labeled  in the figure should have the same length,
or equivalently (since the both the normal and the t-distributions are symmetric about the mean) why the two
rejection regions should correspond to the same area. However, it is very rare for statisticians not to
preserve this symmetry of the underlying sampling distribution. It is also useful to continue to use the
symbol  to represent the level of significance of the hypothesis test -- the total area under the sampling
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 13 of 22
distribution density curve corresponding to the
rejection region. As a result, each of the two
rejection regions must correspond to an area of
/2 (for a total area of /2 + /2 = ). Thus,
when transformed to the standard normal
picture, the null hypothesis in (IH - 8) should be
rejected as indicated in the figure to the right.
H0: µ = µ0
true
area = /2
area = /2
Calculating the p-value for a two-tailed
z
hypothesis test is not much more complicated
than for a one-tailed hypothesis test. The
0
computed value of the standardized test
statistic is used as the boundary of one of the
reject H0
do not reject H0
reject H0
two identical tail. (Obviously, if the value of the
standardized test statistic is positive, it is at the
-z/2
z/2
left edge of the right-hand tail, whereas if the
value of the standardized test statistic is
negative, it is at the right edge of the left-hand tail.) Calculate the area under the standard probability
density curve for that single tail, and double the result to get the p-value.
Example 3: Refer to the standard data sets entitled JonApples1. Does the data given there support the
claim that the mean weight of the apples in the first harvest is different from 210 g? Carry out the
appropriate hypothesis test using a level of significance of 0.05, but also state the p-value for the test.
Solution:
The hypotheses to be tested here are
H0:  = 210 g
vs.
HA:   210 g
A sample of n = 60 apples gave an mean, x , of 219.73 g and a standard deviation, s, of 42.88 g. Since
n = 60  30, we are dealing with the large sample case, and so the standardized test statistic obtained is
z
x  0
219 .73  210

 1.76
s
42.88
n
60
Now, the rejection region is two-tailed here. For a level of significance of  = 0.05, we have /2 = 0.025,
and so require z/2 = z0.025 = 1.96. Thus, we may reject H0 in favor of HA if either
z = 1.76 > 1.96 (= z0.025)
or
Since neither of these conditions is satisfied,
we cannot reject H0 here at a level of
significance of 0.05. The data is inconclusive
with regard to the question of whether   210
g or not.
z = 1.76 < -1.96 (= -z0.025)
p-value is the combined areas
The computed value of the standardized test
statistic is z = 1.76. So, one of the two equal
area parts of the rejection region for purposes
of calculating the p-value is the region
corresponding to z > 1.76 (with the other part
being z < -1.76). Thus, for these hypotheses
p-value = 2 Pr(z > 1.76) = 0.0784.
Thus, if you decide to reject H0 on the basis of
Page 14 of 22
z
0
reject H0
reject H0
z = -1.76
Introduction to Hypothesis Testing
z = 1.76
© David W. Sabo (1999)
the given data (and so conclude that the mean weight of these apples is different from 210 g), then there is a
7.84% chance that that conclusion will be a mistake due to sampling error.

Type 1 and Type 2 Errors
The outcome of a hypothesis test procedure is always one of two possibilities:

reject H0, thereby implying that the data supports HA

do not reject H0, thereby implying that the data is inconclusive as far as the claim H A is
concerned. We need to be a bit careful with language here: this outcome in no way means
that HA has been disproven or that the opposite of HA has been proven (though it may be
possible to do that by testing a new set of hypotheses in which the alternative hypothesis is the
opposite claim), but that we simply do not have adequate evidence to consider H A supported.
In a particular situation, the outcome of the test procedure can be a correct one or an incorrect one, in the
sense of reflecting the true state of the population under study or not. Thus,

rejecting H0 is a correct decision when HA is a true statement about the population. When HA
is not a true statement about the population, rejecting H0 is a mistake, called a type 1
error. (People often phrase this error as "rejecting a true H0", though this language is not
accurate. H0 is a "straw-man" statement about the population -- it is a stand-in for the opposite
of HA, used because of two important features: (i) it is the instance of "HA not true" which will
be most difficult to distinguish from "HA true", and (ii) it simplifies calculations, since it states a
specific characteristic of the population. Not rejecting H0 in no way means that H0 itself
accurately reflects the state of the population -- in no way does it mean that H0 is true. And, in
fact, if you've understood the discussion so far, not rejecting H 0 in no way means that the
opposite of HA is a true statement about the population. Not rejecting H0 is a statement that
the data is ambiguous as far as the given hypotheses are concerned. Anyway, a type 1 error
is more accurately described as the error of rejecting H0 when HA is not true, or the error of
supporting a false HA.)

not rejecting H0 is a correct outcome when HA is not a true statement about the population.
When HA is a true statement about the population, failing to reject H0 is a mistake, called
a type 2 error. Since HA is a definite statement about a property or characteristic of the
population, a type 2 error is an error of not recognizing some actual characteristic of the
population (whereas a type 1 error is an error of concluding the population has a characteristic
which it actually does not have).
Consider a quick example to illustrate these ideas. Folic acid has recently been promoted as a significant
factor in the prevention of cardiovascular disease (and many other serious health problems). Suppose you
have genetically engineered a new variety of bean which you hope to be able to claim contains a mean of
more than 200 g of folic acid per 100 g dry weight. From a hypothesis testing point of view, you need to
test the hypotheses:
H0:  = 200 g/100 g
vs
HA:  > 200 g/100 g
(Obviously, here  stands for the mean folic acid content of the population of all beans of this variety,
measured in units of g/100 g.) Making a type 1 error means rejecting H0 when HA is not true. This
amounts to you concluding that the beans do contain more than 200 g of folic acid per 100 g when that is
not true. Such a conclusion may result in legal action or bad publicity against you or your company when
buyers of these beans later find that they are not the good source of folic acid that you claimed them to be.
On the other hand, making a type 2 error here would be concluding that HA is not supported, when in fact
the beans do average more than 200 g of folic acid per 100 g dry weight. In this case, you really would
have a superior natural source of folic acid, but due to sampling error you would overlook that fact. As a
result, you may abandon attempts to sell or promote the use of these beans, with the consequence of you
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 15 of 22
losing potential business or income from that activity, and the world losing access to a valuable nutritional
resource.
Notice that in both of these scenarios, a course of action results from the outcome of the hypothesis test. In
this example, the action is either to market the beans or abandon plans to market the beans. Either course
of action taken on the basis of a mistaken hypothesis test outcome have associated costs of various sorts
(not necessarily just monetary costs -- you can have costs in lost opportunity, lost health benefits, etc.).

It is conventional to use the symbol  to represent the probability of making a type 1 error, and the symbol 
to represent the probability of making a type 2 error. That is, by definition
  Pr(make a type 1 error)
and
(IH - 9)
  Pr(make a type 2 error)
Because the hypothesis test procedure is set up so that we start by assuming H 0 is true, the value of  (or
an upper bound to the value of ) can be computed. In fact, the rejection criteria are formulated based on
the desired value of . This is the reason for writing H0 as an equality standing in for the opposite of HA.
The value of  is called the level of significance of the hypothesis test. When a definite conclusion is
obtained, the level of significance is the probability that that conclusion is erroneous. If this value of  is
small, it means that the conclusion has little likelihood of being wrong, and hence it is a "significant"
conclusion.
The type 2 error occurs when we fail to detect a true property of the population. As a counterpart to the
notion of a level of significance, the quantity 1 -  is called the power of the test. It is the likelihood that a
true property of the population will be detected. Tests with high powers are likely to detect true properties of
populations.
Unfortunately, because the formalism focuses so directly on the probability of making a type 1 error, we
unable to say much with the same degree of detail about the value of . The level of significance is defined
to be the probability of rejecting H0 when H0 is a true statement about the population. But if H0 is a true
statement about the population, we have enough information to calculate this probability. On the other
hand, a type 2 error can only occur when HA is a true statement about a population. But, HA, being an
inequality, does not contain enough information to compute probabilities. For instance, you can calculate
Pr( x > 200) if you are told that  = 150 (and you have an estimate of ), but you cannot calculate this
probability if you are told only that  > 150.
Of course, the reason why we cannot compute a specific value of  for a set of hypotheses is because the
value of , the probability of making a type 2 error, depends on what the true state of the population is.
However, we can calculate the probability of making a type 2 error for specific hypothetical states of the
population. We can illustrate this with a couple of simple examples.
Example 1: This is the example we've used a couple of times above already, involving the hypotheses
H0:  = 65 ppm
vs.
(IH - 1)
HA:  > 65 ppm
where  is the mean calcium concentration in unsanitized salmon fillets. Compute the probability of making
a type 2 error when these hypotheses are tested with  = 0.05 and the true value of  is 70 ppm. You may
assume that s = 22.02 ppm is an acceptable estimate of  for this population.
Solution
The rejection region (and rejection criterion) are computed from the hypotheses and the required value of .
We've already calculated that the critical value of x is
x 0.05   x  z 0.05  x
Page 16 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
 65  1.645
22.02
area: 
 70.73 ppm
x = 70.73 ppm
(location determined by the
chosen value of )
40
Thus, we reject H0 in favor of HA if the
sample mean is greater than 70.73 ppm. If
the sample mean is smaller than 70.73 ppm,
we will not reject H0. Thus, the line dividing
the rejection region from the non-rejection
region is located to cut off a right-hand tail of
area 0.05 under the sampling distribution of
x that arises assuming  = 65 (that is,
assuming H0 is true).
area:  = 0.05
x
65
70
do not reject H0
reject H0
But, if the population mean is actually 70
ppm, the sampling distribution of x is the curve shown as with a dashed line in the figure: the shape is the
same as for  = 65, but the center of the distribution is shifted rightwards to x = 70 ppm. If  = 70, then
the failure to reject H0:  = 65 is a type 2 error. In the figure, the probability of not rejecting H0 when  = 70
then is just the area under the  = 70 density curve corresponding to the non-rejection region:
(  70 ppm )  Pr( x  70.73 ppm when   70 ppm )



70 .73  70 
 Pr  z 
  Pr z  0.21   0.5832
22 .02


40 

This is quite a large probability. What it says to us is that with the hypothesis test set up as specified (that is,
so that  = 0.05 for the hypotheses (IH - 1) and assuming a sample of size 40 is used, and that 22.02 ppm is
an acceptable estimate of ), there is almost a 60% chance that H0 will not be rejected even if the true
population mean is 70 ppm.
The figure above illustrates a number of important features of "error control", which we will amplify
considerably before we're done with this whole topic:
Pr(type 2 error)
(i) the value of  will decrease as the true mean value of the population increases. Picture sliding the
dashed curve in the figure further
rightwards, representing a true mean
"Operating Characteristic Curve" for Example 1
value, , which is say 75 ppm, or 80 ppm,
1
etc. The area under the shifted dashed
curve corresponding to the non-rejection
0.8
region will decrease, because the size of
0.6
the tail to the left of the critical value of x
gets smaller. Thus, while there is a high
0.4
probability here of making a type 2 error
when  = 70 ppm or thereabouts, if the
0.2
true value of  was even larger, this
0
probability of making a type 2 error drops
60
65
70
75
80
85
considerably. (For instance, ( = 85) =
true value of population mean
0.00002, a very acceptable value!. The
figure to the right shows a plot of  as a
function of the true value of  -- curves representing this sort of information are called operating
characteristic curves -- and you can see how the value of  starts off very near 1 for  = 65 or
smaller, but drops off in a fairly characteristic reverse S-shape as  increases, becoming
essentially negligible in this case (that is, less than 0.05) by the time   77 ppm or so.)
(ii) The concern with the figure at the top of this page is that while the value of  is quite acceptable,
the value of ( = 70) seems to be way too large to be acceptable. The only way to reduce this
value of  while keeping the shape of the sampling distribution curves as they are would be to
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 17 of 22
shift the boundary between the
rejection and non-rejection region
x = 68 ppm
leftwards. For instance, if we
decided to shift that boundary to x
= 68 ppm, as shown in the figure to
area:  = 0.2843
the right, the area  does indeed
area:  = 0.1949
decrease quite substantially, to a
value of about half what it was
when we used the critical value x
= 70.73 ppm. Of course, you see
x
immediately what's wrong with this
65
70
approach to reducing the value of
( = 70) -- by moving the
do not reject H0
reject H0
boundary between the rejection and
non-rejection region leftwards to reduce the value of , we've simultaneously increased the value
of , the probability of making a type 1 error. In fact, using x = 68 ppm as the critical value has
increased the probability of making a type 1 error by a factor of nearly four to 0.1949.
What you see from this is that if all other aspects of the experiment remain the same (essentially
those things which determine the shape of the sampling distribution: the value of   s, and the
sample size n), you cannot do anything to reduce the values of both  and  at the same time.
The only freedom you have left is to move the boundary between the rejection and non-rejection
regions either leftwards or rightwards. Such a change will result in one of  or  become
smaller, but inevitably the other will increase in value.
(iii) Finally, you can see that the only
way of reducing the value of 
without simultaneously increasing
the value of  is to narrow the
sampling distributions
themselves. If we do that, then
the tails of the two distributions
falling on the "bad" side of the
boundary between the rejection
area: 
and non-rejection regions will
area:  = 0.05
decrease in size. Since the width
of these distributions is
x
determined by the value of  and
65
70
n, and we have no control over 
do not reject H0
reject H0
(it is a characteristic of the
population), our control must be
through increasing the sample size n appropriately. We'll go into the specifics of estimating an
appropriate sample size in later sections of the course. For the moment, the point here is that
the only way to exercise simultaneous control over the probabilities of both types of errors is
through use of an adequately large sample size. For a fixed sample size, you cannot
simultaneously reduce both error probabilities to some desired value.
In the figure just above, the effect of increasing the sample size is seen in the two distribution
curves becoming narrower and higher at the center. The narrowing of the distribution for
 = 65 ppm results in the boundary between the rejection and non-rejection region shifting
leftwards so that  remains at 0.05. Thus, the value of  is reduced due to the combination of
two effects: the boundary between the rejection/non-rejection regions shifts leftwards, and the
distribution curve for  = 70 ppm clusters more tightly about the location 70 ppm on the
horizontal axis, reducing the area in any fixed left-hand tail.
Page 18 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
How Do You Decide Which HA to Use?
Actually, let's back up one step. How do you know when to use hypothesis testing instead of
constructing a confidence interval estimate? Here, you look for key words in the request. If words like
"estimate", "predict," or "calculate," are present, almost certainly you are being asked to compute a
confidence interval estimate of the indicated population parameter. On the other hand, if you see phrases
such as "is this adequate evidence…" or "does the data support the claim …", and similar, you are being
asked to test hypotheses.
Once you've decided a hypothesis test is required, the next issue is to formulate the hypotheses in the test.
Always start by working out the appropriate HA, since you can get H0 from HA by simply replacing the
inequality symbol by an equality symbol. There are several clues and principles that one may rely on here.

most often, the claim to be tested is stated quite explicitly as an inequality in words. This is the
appropriate HA. For instance, to respond to the question "Is this data evidence in support of
the claim that the mean folic acid content is more than 200 g/100 g?", you know immediately
that you must use "HA:  > 200 g/100 g".

the type of inequality is indicated by various key relational words. Words such as "different"
and its synonyms indicate a two-tailed hypothesis test is appropriate. Words and phrases
such as "greater than", "exceeds," "is bigger," etc. (and their counterparts "less than," etc.)
point to a one-tailed test.
There are some more subtle issues that many authors raise, though it may be difficult for you to understand
their significance if you are new to the subject of hypothesis testing. We mention two of them briefly here,
and may return to them from time to time later in the course.
Some authors caution against using one-tailed hypothesis tests except for situations in which you have a
scientific principle that justifies a one-tailed relationship. Thus, unless we have an independent physical,
chemical, or biological principle that indicates the appropriate relationship is a "greater than" or a "less than"
type, they would recommend all tests be done as two-tailed tests. The point here is subtle, but important. It
is considered bad form in statistics to allow the data to determine what kind of analysis you carry out, since
that tends to promote error -- it's a bit like shooting the arrow first, and then drawing the target. If the only
reason you have for using a one-tailed test is because the data seems to indicate that the value of the
population parameter is greater than some number or less than some number, then you would not be
justified in using a one-tailed test. The downside of this advice (though, if you think about it, it may really be
an upside!) is that two-tailed tests are more rigorous -- you need to have much clearer data to be able to
reject H0 in a two-tailed test, all other things being equal, than you need for a one-tailed test.
A second principle (which we'll look at in a bit more detail in the next document in this series) revolves
around the issue sometimes called burden of proof. (We'll explain this term later as well.) The hypothesis
test procedure is formulated so that we have direct control over the probability of making a type 1 error, but
relatively less control or information about the probability of making a type 2 error. In situations in which one
of the two possible errors that could be made is particularly serious, you may wish to formulate H A so that
this more serious error becomes the type 1 error. Then, you can control the probability of making this more
serious error.
For example, suppose that a certain pesticide is known to harm people in average concentrations of more
than 5 ppb (parts per billion) in, say, apples. If the fruit does contain a mean concentration of more than 5
ppb of the pesticide, it should not be sold as whole fruit. It can be reprocessed into other products in which
the mean pesticide residue concentration can be reduced to a safe level. However, because of higher
production costs, these alternative products generate lower overall profits, so the company would prefer to
sell the apples as whole fruit. The decision to release a crop of apples for sale to consumers might be
based on a test of hypotheses involving , the mean concentration of the pesticide in the population of these
apples, and the quantity 5 ppb. So immediately, we know that the null hypothesis must be
H0:  = 5 ppb.
The question is: what should we use for the alternative hypothesis? We know that there are only two
reasonable possibilities here, since a two-tailed test would not make much sense (why?):
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 19 of 22
HA:  > 5 ppb
or
HA:  > 5 ppb
In detail, the consequences of the two possible choices of hypotheses are:
hypotheses
H0:  = 5 ppb
HA:  > 5 ppb
H0:  = 5 ppb
HA:  < 5 ppb
action if H0 is rejected:
-conclude that the mean
pesticide concentration
exceeds safe limits and so
the apples are not released
for sale as whole fruit 
instead they are
reprocessed into other less
profitable products.
action if H0 is not rejected
-recognize that there is
insufficient evidence to
conclude that the mean
pesticide level in the apples
exceeds safe limits, and so
release the apples for sale
as whole fruit  there is no
proof that the apples are
dangerous.
-conclude that the mean
pesticide concentration is
less than the safe limits,
and so the apples are
released for sale as whole
fruit.
-recognize that there is
insufficient evidence to
conclude that the mean
pesticide concentration in
the apples is less than the
safe limit, and so probably
do not release the apples
for sale  there is no proof
that the apples are safe.
mistaken action resulting
from the type 1 error
-apples are withheld from
sale as whole fruit when
they are safe; some
potential profit is lost.
-apples are released for
sale even though the mean
pesticide concentration in
them exceeds safe levels;
consumers will be harmed,
and perhaps the producer
will be legally liable for this
harm.
The fourth column of this table describes the mistaken actions that will result from carrying out the
corresponding hypothesis test. The producer would have to decide which of these two errors is most
serious, and then base the decision of whether to release the apples for sale as whole fruit on the outcome
of the test of the corresponding hypotheses.
Notice that the first set of hypotheses will result in the release of the apples for sale as whole fruit as long as
there is no definite proof that they contain harmful concentrations of the pesticide. The second set of
hypotheses will result in the apples being released for sale as whole fruit only if there is definite proof that
they are safe. When worded in this way, you can see that the distinction between the two approaches is not
all that subtle. In one case, the producers sell the apples only if they are quite certain the apples are safe; in
the other case the producers sell the apples unless they are quite certain the apples are harmful.
Statistical Significance vs. Practical Significance
We need to raise one more general issue before leaving this introduction to hypothesis testing. Recall that
we view a decision based on a hypothesis test as "statistically significant" if the value of the standardized
test statistic falls within the rejection region. This means that the likelihood of the conclusion being a
mistake has been controlled.
The situation for tests of hypotheses involving the population mean exhibits a common feature of hypothesis
tests. If we assume the use of a large sample, then the formula for the standardized test statistic is
z
x  0
x  0

 n
s
s
n
(IH-10)
Now, 0 is a value fixed in the statement of the hypotheses. x and s are properties of the sample which are
estimating fixed properties of the population ( and , respectively), and so if we increase the sample size n,
we don't expect the values of x and s to change much (though of course, being a bigger and thus different
Page 20 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)
random sample, the values we get for x and s are unlikely to exactly equal to those for the initial smaller
sample), Thus, in increasing the sample size, n, we expect that the value of the factor
x  0
s
in the formula for z may change slightly, but probably not much. But as n increases, so does the value of
n , and so inevitably the value of z given by formula (IH - 10) will increase simply because n is larger. In
this way, we can actually make z as large as desired by simply taking a large enough sample. But large
values of z correspond to rejection of H0. Thus, even if  is only very slightly larger than 0, we may be able
to satisfy the rejection criterion for H0 by choosing a very very large sample size.
This may sound as if we can "prove" any conclusion we like through hypothesis testing just by taking large
enough samples, and so cast doubt on the value of the procedure. And that's almost true. However, think
about what it means to reject H0 when testing
H0:  = 65
vs.
HA:  > 65
Rejecting H0 just means we have evidence that supports the conclusion that  > 65. This conclusion is just
as true if  = 65.001 as it is if  = 165. However, if  = 65.001 is the true state of affairs, it will likely take an
exceedingly large sample to be able to reject the H 0 above, whereas even a relatively small sample should
result in rejection of H0 when  = 165. Thus, while taking a large enough sample might result the difference
between  = 65 and  = 65.001 becoming statistically significant, in reality the distinction may have little
practical significance. (Would you advise a company to adopt a new way to process salmon fillets if the new
way resulted in a mean calcium content of 65.001 ppm rather than the mean 65.000 ppm that the current
method gives? Probably not. But if calcium content was an important issue, then perhaps switching to a
new method that increased the mean calcium content from 65 ppm to 165 ppm would be worth considering.
It is a difference of practical significance here.)
What this means is that there are properties of populations that can be demonstrated with statistical
significance by taking very large samples, but which have little practical significance. It is a caution against
increasing sample sizes arbitrarily just to get a rejection of the null hypothesis. You will get a definite
conclusion to your study, but nobody may care about it.
In some ways, large samples are good in that they contain a lot of information  it is unlikely that a very
large sample will seriously misrepresent the population from which it has been drawn. On the other hand, if
you need a sample of thousands of items to be able to just detect an effect, you need to ask whether the
effect is really important  whether it has any practical significance.
Summary of Test Generic Procedures
The following table summarizes the common features in most hypothesis tests that we will cover in this
course. In the table:
 stands for a population parameter (eg. , , , 1 - 2, 1 - 2, etc.)
0 stands for a specific hypothesized numerical value of 
f is the symbol for the standardized test statistic (e. g. z, t, 2, etc.)
We will use the symbol f to stand for the value of f which cuts of a right-hand tail of area . Then, the
symbol f1 -  stands for the value of f that cuts of a left-hand tail of area . For standard random variables
which are symmetric about the value zero (e. g., z and t), we have that f 1 -   -f, of course.
Now, the three possible hypothesis tests that can arise are:
© David W. Sabo (1999)
Introduction to Hypothesis Testing
Page 21 of 22
hypotheses
H0:  = 0
H0:  > 0
rejection criteria
f > f
H0:  = 0
H0:  < 0
f < f1 -
(f < -f for symmetric
distributions)
H0:  = 0
H0:   0
f > f/2 or f < f1 -/2
p-value
Pr(f > test statistic)
Pr(f < test statistic)
2Pr(f > test statistic) for
symmetric distributions.
This has become a very long document, and we have introduced a lot of important ideas (though in many
cases at a very superficial level). We will now look at the application of these basic notions to situations
involving quite a variety of specific population parameters. In the process, we will further clarify or detail the
general principles described in this document. Before that, though, we will take a short excursion to draw a
parallel between statistical hypothesis testing and a situation that that the entertainment industry (if not direct
contact with the legal system itself) has made quite familiar to most of us -- the criminal justice system.
Page 22 of 22
Introduction to Hypothesis Testing
© David W. Sabo (1999)