Download Hypothesis testing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
YALE School of Management
EMBA MGT511- HYPOTHESIS TESTING AND REGRESSION
K. Sudhir
Sessions 1 and 2
Hypothesis Testing
1. Introduction to Hypothesis Testing
A hypothesis is a statement about the population. The hypothesis (statement about the
population) may be true or false.
Examples of Hypothesis:
1. The average salary of SOM MBA students who finished their MBAs in 2001 is
$110000.
2. The proportion of 2001 SOM MBA graduates who had jobs at the time of graduation is
0.9.
For these hypotheses about the population (SOM MBA students who finished their
MBAs in 2001), it is easy to verify whether the statement is true by asking every student
who is graduating (i.e., take a census) their starting salaries as well as whether they had a
job at the time of graduation. In that case, we can categorically say whether the hypothesis
is true or false.
In most practical situations, however, it is not possible to conduct a census. Suppose we
had not collected the data from the students at the time of graduation, but now need this
information. We could send a survey to these alumni and ask them. It is likely that only a
fraction of the alumni would respond. Assuming we obtained a representative set of
responses, we could use this sample to still assess whether our statement about the
population is true or false. However, we have to recognize that there is likely to be some
probability with which we could make errors, because the sample mean would be
different from the population mean.
The approach of using sample data to assess whether a hypothesis is true or false is the
essence of hypothesis testing.
There are two types of hypotheses: the null and alternative (research) hypothesis.
The null hypothesis is usually the default belief about the population parameter.
The alternative hypothesis reflects a research claim.
Consider the following null and alternative hypothesis.
Null
Alternative
Defendant is innocent
Defendant is guilty
Machine is “in control”
Machine is “out of control”
(working according to specs)
The new drug is no better than a placebo
The new drug is better than a placebo
The portfolio manager’s performance is
The portfolio manager’s performance
equal to the S&P 500 performance
exceeds the S&P 500 performance
In hypothesis testing, the null hypothesis is assumed as the default unless the evidence is
strong enough to claim support for the alternative. It is a method of “proof by
contradiction.” More precisely, it offers proof of the alternative hypothesis by
contradicting the null hypothesis.
1. Defendant is assumed innocent until proven guilty
2. Machine is assumed to be “in control” until the evidence suggests otherwise.
3. A new drug is assumed to be ineffective unless it is shown to be better than a
placebo (or some other benchmark).
4. A portfolio manager is assumed to perform no better than the S&P 500 unless…
We use the weight of the evidence from the sample to see if we can reject the null
hypothesis. If the weight of the sample evidence is such that the null hypothesis is
unlikely, we reject the null.
Type I and Type II Errors
Since we use sample evidence to reject or not reject the null hypothesis, we face the
possibility that there will be some errors in the decisions due to sampling variation. There
are two types of errors: Type I and Type II Errors.
A Type I error occurs if the null hypothesis is true, but it is rejected. This is akin to
convicting an innocent defendant.
A Type II error occurs if the null hypothesis is false, but it is not rejected. This is akin to
acquitting a guilty defendant. These errors are well summarized in the table below:
Ho
Decision
Reject Ho
Do not reject Ho
True
False
Type I error
Correct
Decision
Type II error
Correct
Decision
We want to keep both types of errors to a minimum. However, for a constant sample size,
these errors are negatively correlated. Reducing one increases the likelihood of the other.
Carrying forward the innocent-guilty example: A low probability of convicting someone
who is innocent (Type I Error) implies a very high threshold for conviction of any
defendant. But such a high threshold implies that you are likely to acquit a guilty
defendant (Type II Error). So when you set the acceptable level of one type of error, you
automatically set the level for the other type of error, unless you change the sample size.
In practice, we typically control for Type I error. We often allow for a 5% level of Type I
error. But the level of error is in fact a managerial decision when doing hypothesis
testing. If Type I Error is more costly than Type II error, then managers will want to keep
Type I error to a minimum. If Type II error is more costly that Type I error, then managers
may accept higher Type I errors.
Practice Exercise: Think of situations where Type I errors may be costlier than Type II
errors and vice versa.
2. The Hypothesis Testing Process
Hypothesis testing involves a series of steps as shown by the following example problem.
Step 1: Defining the Null and Alternative Hypotheses
A machine is designed to fill bottles with an average content of 12 oz. and a standard
deviation of 0.1 oz. Periodically, random samples are taken to determine if the machine
might be “out of control”. Define the null and alternative hypotheses for this problem
using symbols and in words.
Note: When defining the hypotheses, it is critical to define the population precisely.
Null Hypothesis:
Ho:  y = 12 oz.
The true average content of all bottles filled by the machine during the time period of
sampling is 12 oz. (note the precise definition of the population of interest)
Alternative Hypothesis:
HA:  y ≠ 12 oz.
The true average content of all bottles filled by the machine during the time period of
sampling is not 12 oz.
Step 2: Specify the appropriate probability of Type I Error (alpha)
Since we use sample evidence to reject or not reject the null hypothesis, we face the
possibility of errors in the decisions due to sampling variation. There are two types of
errors: Type I and Type II Errors. Recall that, for any given sample size, reducing the
probability of Type I error increases the likelihood of Type II error. We typically control
(minimize) the probability of Type I error.
Ho
Decision
Reject Ho
Do not reject Ho
True
False
Type I error
Correct
Decision
Type II error
Correct
Decision
The implication of specifying a 5% type I error probability (alpha=0.05) is that we will
reject the null hypothesis (Ho) even when it is true 5% of the time. Thus, the sample
outcomes for which we reject Ho are determined under the assumption that Ho is true.
We make our decision to reject or not reject the null as follows. If the sample outcome is
among the 5% least likely outcomes assuming that the Null hypothesis is true, then we
reject the null. The basic logic of this decision is that the 5% the least likely outcomes
under the null hypothesis are more likely to occur if Ho is false.
Note the similarity with a court procedure. A defendant is charged with a crime. The
judge and jury assume the defendant is innocent until proven otherwise. If the sample
evidence presented by the prosecutor is very unlikely to occur under the assumption that
the defendant is innocent, the decision of guilt is favored. Thus, the decision is based on
inconsistency (highly unlikely evidence to occur under the assumption of innocence)
between the evidence and the assumption of innocence.
How do we know what are the 5% least likely outcomes? The sampling distribution of
the sample statistic of interest will help us to identify the least likely (most extreme)
outcomes. So the next step is to set up an appropriate test statistic for this problem and
identify what values of the statistic cause us to reject the null hypothesis.
Step 3: Defining and Justifying the Relevant Test Statistic
In the above problem, we know the population standard deviation (  ). Given random
sampling, the sample mean Y will tend to follow a normal distribution with mean equal
to the population mean (  ) and the standard deviation

. It is conceivable that the
n
content of individual bottles is actually normally distributed. But if it is not, the Central
Limit Theorem allows us to claim that the theoretical distribution of all possible sample
means, of a given sample size, will tend to a bell-shaped curve (as the sample size
increases). Given this knowledge of the distribution, we can find out what the 5% least
likely values of Y are. However, it is conventional to specify the rejection region in terms
of extreme values for the standardized test statistic. If the null hypothesis is true, then the
population mean should be  0 . So we specify the rejection region in terms of
standardized units, i.e. we follow the usual procedure of standardizing the variable of
interest when we need to compute probabilities.
Z =
Y  0
y
Step 4: Determining the Rejection Region
Having defined the test statistic, we now decide for what values of the test-statistic we
reject the null hypothesis. The selection of the rejection region depends on whether the
hypotheses are stated as one- or two-tailed tests.
Probability
For a two-tailed test, the five percent of the least likely values are split equally between
the two tails as in the figure above. So Z  1.96 and Z  1.96 are the 5% least likely
values if the null hypothesis is true. Therefore, we will reject Ho if the computed Z-value
based on the sample data is in this rejection region.
95% Of
Values
Rejection Region
2.5% of
Values
-4
-3
Rejection Region
Null Acceptance
Region
-1.96 -1
0
1
2.5% of
Values
1.96 3
4
Z
Steps 5 and 6: Computing the test statistics and drawing statistical and managerial
conclusions
Exercise: For the above bottling machine problem, suppose a simple random sample of
100 bottles is taken and the sample mean is 11.982 oz. Is this sample result among the
five percent least likely to occur under the null hypothesis (Ho)?
From the null hypothesis,  0 =12.
Z =
Y  0
y

11.982  12 0.18

 1.8 .
0.01
0.1/ 100
Since Z  1.96 and Z  1.96 is the rejection region with the 5% least likely values, this
computed value from the sample does not fall in the rejection region. Hence we cannot
reject the null hypothesis (statistical conclusion).
Strictly speaking, we conclude that the machine is functioning according to bottling
content specifications at the time the sample was taken (managerial conclusion).
However, one might argue that it is possible for the machine to be “out of control” but
that we have insufficient evidence to reject Ho at the 5% level. Compare a jury’s decision
that the defendant is not guilty.
Summary of Steps in Hypothesis Testing
1. Identify the appropriate null and alternative hypotheses (Ho and HA). Be precise about
the interpretation of hypotheses, by carefully identifying the population of interest.
2. Choose an acceptable probability of Type I error (alpha).
3. Define a relevant test statistic. Justify it.
4. Determine the rejection region (for what values of the test statistic will Ho be
rejected?).
5. Collect the data (in practice, we need to consider the proper sample size so that the
probability of a type II error is controlled), compute sample results and calculate the
test statistic value.
6. Draw statistical and managerial conclusions.
P-Values
An interesting issue with deciding on the level of Type I error is: who should decide what
type I error probability is tolerable? What if the person conducting the test does not know
the decision maker’s tolerance?
One solution to this problem is the following. Instead of testing H0 at a specified type I
error probability (alpha), we can report the probability of a type I error if H0 is rejected
This probability is called p-value.
In the example above, what is the probability of Type I error if the Ho is rejected?
We can answer this directly by looking at what fraction of values of Z lies below -1.8 and
above -1.8.
Probability
We can do this either by looking at the normal tables in a statistics textbook or using
Excel.
Rejection Region
Rejection Region
3.6% of
Values
-4
-3
3.6% of
Values
-1.8
-1.08 -1
0
1
1.8
3
4
Z
In Excel, the function =normsdist(Z) (Hint: this is short form for Normal Standardized
Distribution) can be used to find out P(X<Z). Plugging this function in to Excel tell us
that P(X<-1.8) is 0.036; i.e., 3.6% of the values lie below -1.8.
Since the rejection region includes both P(X<-1.8) and P(X>1.8), the probability of Type
I error will be 0.036*2=0.072. Hence the p-value is 0.072.
Question: What should we do to compute p-value when Z is positive?
When Z is positive we need to compute P(X>Z). Since P(X>Z) = 1 – P(X<Z), simply
compute 1-normdist(Z), when Z is positive and get the p-value by multiplying that
number by 2.
3. One-tailed versus Two-tailed tests
Two-tailed test:
Recall: In the example in the previous section, we conducted a two-tailed test. The null
and the alternative hypotheses were:
Null Hypothesis:
Ho: Y  12 oz
Alternative Hypothesis:
HA: Y  12 oz .
We rejected the null if Z  1.96 or Z  1.96 . This implies that the person conducting
the test wants to stop the machine if it is out of control in either direction. That is, there is
a cost associated with having an excessive amount of liquid as well as with an insufficient
amount in the bottles.
One-tailed test:
Suppose in the machine-bottling problem, we take the perspective of a distributor, who
does not care if the bottles truly have more than 12 oz. on average (or the bottles have a
maximum capacity of 12 oz). That is, the distributor is only concerned about insufficient
content on average. So the distributor takes samples out of batches of items and returns
the entire batch if the hypothesis test shows that the contents are less than 12 oz on
average. Now the null and the alternative hypotheses are:
Null Hypothesis:
Alternative Hypothesis:
Ho: Y  12 oz .
HA: Y  12 oz .
For a one-tailed test, the alternate hypothesis expresses the values of the parameter for
which we want to reject the null hypothesis. If we create mutually exclusive and
collectively exhaustive hypotheses, the null hypothesis must then have the complement of
the alternative. Note that it is usually easier, in practice, to start with the alternative
hypothesis. It represents what management is concerned about or what a researcher might
believe based on theory. The test proceeds under the assumption that the equality case
under the null hypothesis applies. In other words, the machine is assumed to be in control,
i.e. the true average is assumed to be 12 oz. But now only the 5% extreme cases in the left
tail will result in rejection of the null hypothesis.
As before, we use the Z statistic because we know the population standard deviation. As
argued above, the Z-statistic is still computed at the boundary of the null hypothesis (12
oz).
Z =
Y  0
y

11.982  12 0.18

 1.8 .
0.01
0.1/ 100
Since now all of the extreme values for rejection are concentrated in the left tail, the 5%
rejection region is for computed Z  1.645 . (See the figures on the next page)
The interesting finding is that for the same test result, we cannot reject the null hypothesis
at the 5% level if the test is two-tailed but we can if it is one-tailed.
Therefore, for the one-tailed test, we reject the null hypothesis. (Statistical Conclusion)
The batch of bottles received by the distributor from this manufacturer has lower than the
specified contents of 12 oz and therefore must be returned to the manufacturer.
(Managerial Conclusion)
Probability
Note that the two-tailed test is more conservative in rejecting the null hypothesis. While
the z-score needs to be below –1.96 to reject the null in the two tailed test, it needs to be
just –1.645 to reject the null in the one-tailed test. In this case, the distributor thus rejects
the null hypothesis with a lower threshold of evidence, than the manufacturer.
Null Acceptance
Region
Rejection Region
95% Of Values
-4
-3
-2
-1
0
5% of
Values
1
1.645
3
4
Probability
Z
Rejection Region
Null Acceptance
Region
5% of
Values
-4
-3
95% Of Values
-1.645 -1
Z
0
1
3
4
4. What are appropriate test-statistics for hypothesis testing?
The choice of the appropriate test-statistic is critical in hypothesis testing. In the example
above we were testing a hypothesis about a population mean. We used a Z-statistic,
because we knew the population standard deviation and we knew Y was normally
distributed. Even if Y is not normally distributed, we know by the Central Limit Theorem
that Y will be normally distributed for sufficiently large samples. Therefore we can use
the Z-statistic. However, if the population standard deviation is not known, we need to
use a t-statistic instead to compensate for additional uncertainty due to the use of the
sample standard deviation instead of the population value. Note that we still require the
theoretical distribution of all possible sample means to be normal.
We now discuss appropriate test statistics for means and proportions under different
conditions.
Test of One Mean
Population Standard Deviation is known:
H0:  = 0
HA:   0
Condition: If (1) simple random sampling
(2) Y is normally distributed (because Y is normal or Central Limit
Theorem applies) and
(3)  is known,
then use the Z statistic:
Z=
Y-μ

y
where:  y =

n
(assuming N is large so that we can
ignore the finite population correction
factor)
Population Standard Deviation is unknown
Condition: If (1) simple random sampling
(2) Y is normally distributed (because Y is normal or Central Limit
Theorem applies) and
(3)  is unknown,
then use the t statistic:
t=
Y-μ
sy
with (n-1) df
where: s y =
s
n
(assuming N is large; re: finite population correction)
Test of one proportion
H0:  = 0
HA:   0
Condition: If (1) simple random sampling
(2) ̂ is approximately normally distributed (for large N)
then we can use the Z statistic:
Z=
ˆ -

0
ˆ

where:  ˆ =
 (1   )
0
0
n
However this statistic can be used only if the following conditions are satisfied:
n 0 > 5 and n (1-0) > 5
Example:
John Rowland and Bill Curry are candidates for CT Governor. We are interested in
knowing who is likely to win the election on Nov 5, 2002. A survey of 900 CT “likely
voters” on October 30, 2002 asked who they intend to vote on election day. 56% of
respondents said they intend to vote for Rowland and 44% for Bill Curry. Test the
hypothesis that one of the candidates is more likely to win the election.
Let  be the proportion of likely voters who intend to vote for Bill Curry on Nov 5, as of
October 30. (You could just as well have written the hypothesis in terms of voters
intending to vote for Rowland and do this test).
Hypothesis testing
H o :   .5
(we assume, until proven otherwise, that Curry (and therefore Rowland)
has 50 percent of the vote among all likely voters)
H A :   .5
Random sampling:

ˆ   1   
Var 
n
E ̂  
̂ approximately normal if n and n 1    both > 5
z
ˆ 

0
 ˆ

ˆ 

0
 (1   )
0
0
n
At   .05 reject Ho if Z(calculated) ≥ 1.96
or ≤ -1.96
ˆ
Suppose n = 900
  0.44 (this would be 0.56, if you had written  in terms of
votes for Rowland)
Calculate z-value, z 
.44  .50
.5.5

.06
 3.59
.017
900
Since -3.59 < -1.96, reject Ho
Or, calculate p-value
Prob [Z  -3.59] = .001
p-value = .001 *2 = .002
Since .002 < .05, reject Ho
Test of Two Samples
Independent versus paired samples
1. Suppose we wish to test which of two ads, consumers like better. We can use either
independent samples or paired samples. If we use independent samples, then we can show
the two ads to two different samples of consumers and ask them to rate the ads. We can
then compare the average ratings of the two ads in performing the hypothesis test.
Alternatively, we can take one sample of consumers and show both ads to the consumers.
We can then compare the ratings of each individual for the two ads. This would be an
example of a paired sample.
2. Suppose we want to test whether married men or women are happier. Here again we
could use independent samples or paired samples. If we use independent samples, then
we can ask a sample of married men and an independent sample of married women about
their happiness. We can the compare the average ratings of the two ads in performing the
hypothesis test.
Alternatively, we can pick married couples and ask both the men and the women about
their happiness. Then we can look at the difference in happiness reported by each man
and his wife. We can test whether this difference (one number for each couple) is
significantly different from zero. This would be a paired sample test.
Independent samples: each sample is randomly drawn from a separate population, and
there is no linkage between successive draws.
Paired samples: the population of interest is defined in terms of pairs in such a way that
the paired observations have something in common.
Comparing Means for Paired Samples: Population Standard Deviation is unknown
Paired samples: As discussed earlier, the population of interest is defined in terms of
pairs in such a way that the paired observations have something in common.
Note that in these examples the pairs can either be two different individuals or two
different measures on each individual
Given random sampling, EY1  Y 2   1   2 
With paired samples, Var Y1  Y 2 
= Var Y1  Var Y 2  2CovY1, Y 2 
Thus, if there is a positive covariance, pairing reduces the variance of the difference
between the sample means
Instead of accommodating the covariance, we can create differences between the paired
observations; the greater the positive covariance between Y1 and Y2, the smaller the
variance of the difference (compared to the variance of Y1 and the variance of Y2).
In fact, for paired samples, we take the differences and then perform the single sample
hypothesis test on the differences.
If Y1 and Y2 are a paired sample, then create a difference variable YD = Y1 – Y2.
Given the following null and alternative hypothesis
Null: H0:  D  D0 (true average difference in the population is D0)
Alternative: HA  D  D0
If random sampling,
If Y D normally distributed (conditions?),
t
Y D  D0
sY D
with (n-1) df where n is number of pairs
Comparing Means for Independent Samples: Population Standard Deviation is known
Two samples, sizes n1 and n2
We are interested in the difference between, say, two population means, 1   2 
Define two random variables, Y 1 and Y 2


Given random sampling, E Y1  Y 2   1  2 


 Var Y   Var Y 
With independent sampling, Var Y 1  Y 2
1

 12
Hence  Y 1 Y 2 



 12
n1

2
 22
n2
 22
n1 n2
Given the following null and alternative hypothesis
Null: H0:  1  2   D0
Alternative:  1  2   D0
If normality applies, then the test statistic is
Z
Y
1
If Ho :

 Y 2  D0
 Y
1 Y 2

 1  2   0
Then Z 
Y
1

Y 2  0
 Y Y 
1
2
=
Y1  Y 2
2 2
1  2
n1 n 2
Independent Samples: Population Standard Deviation is unknown
Everything else is the same as above, but if 1,  2 unknown, replace it with s1 and s2 and
compute the t-statistic.
Two samples, sizes n1 and n2
We are interested in the difference between, say, two population means, 1   2 
Define two random variables, Y 1 and Y 2


Given random sampling, E Y1  Y 2   1  2 


 Var Y   Var Y 
With independent sampling, Var Y 1  Y 2
1

2
s12 s22

n1 n2
s12 s22

Hence sY 1 Y 2  
n1 n2
Given the following null and alternative hypothesis
Null: H0:  1  2   D0
Alternative:  1  2   D0
If normality applies, then the test statistic is
t
Y

 Y 2  D0
1
 Y Y 
1
If Ho :
with  n1  n2  2 df
2
 1  2   0
Then t 
Y
1

Y 2  0
sY 1 Y 2 
=
Y1 Y 2
s12 s22

n1 n2
with  n1  n2  2 df
Example:
The table below provides the salaries of MBA students at a mid-western school before
they started their MBA and after they finished their MBA. Test the hypothesis that their
salaries “after MBA” is different from the salaries “before MBA”.
Before
MBA
After MBA
60
40
35
75
52
35
50
40
35
140
45
35
45
110
130
15
62
36
N
Average
Std Devn
75
45
50
90
70
45
65
55
50
130
55
55
55
110
130
15
72
29
Percentage
After-Before Change
15
5
15
15
18
10
15
15
15
-10
10
20
10
0
0
15
10
8
0.25
0.13
0.43
0.20
0.35
0.29
0.30
0.38
0.43
-0.07
0.22
0.57
0.22
0.00
0.00
15.00
0.25
0.18
The data represent paired samples. By creating the Difference, we eliminate the common
covariance which is quite large. As a result, the standard deviation of Difference is much
smaller than of Before or After (as can be seen from the table)
If random sampling,
Ho : D  0
Ho : D  0
(true average difference in the population is zero)
If Y D normally distributed (conditions?),
t
YD  D
sY
D
t
10
 4.84
8 / 15
with (n-1) df where n is number of pairs
Since t 14,0.025=2.14, we can reject the null.
Prob [t ≥ 4.84] < .005
So, p-value < .01
Contrast this with the result if we assumed (wrongly) that the data comes from two
independent samples:
2
2
 36   29 
s Y 1 Y 2        11.93


 15   15 
t
Y1  Y 2  0
s Y  Y 
1
2
=
10
11.93
= 0.84
Since t 28,0.025=2.04, with the mistaken assumption of independent sampling, we cannot
reject the null. This example illustrates the importance of designing a statistical study
based on an understanding of statistical principles.