Download Word Pro - Ch 6 - Hypothesis Tests II.lwp

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter Six
Hypothesis Testing: Applications
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of Measurements with a Fixed Value . . . . . . . . . . . . . . . . . . . . . . . . . .
Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of Two Measurement Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Independent Sets of Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Paired Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of Many Measurement Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of Measurement Precision (F-test) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hypothesis Testing in Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Outlier Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Identification of Contaminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Testing Single Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Testing Multiple Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Advanced Topic: Data Contamination Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Choosing Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
133
135
135
135
136
136
142
146
147
147
147
149
152
152
154
156
159
164
164
166
Introduction
In the last chapter, we described the most important concepts of statistical hypothesis testing; in
this chapter, we will apply the formal procedure of hypothesis testing to a variety of situations.
Let’s very quickly review the steps in the testing procedure (as applied to measurement data)
1. Construct the null hypothesis, H0, and the alternate hypothesis, H1, in terms of measurement
population parameters.
2. Choose the confidence level to give desired levels of false positive and false negative errors
3. Choose a test statistic and calculate its value from the measurements.
4. Determine the critical value(s) from the null distribution of the test statistic, and make the
decision to accept H0 or H1.
5. State the conclusion in plain English (along with the confidence level). Remember that a
decision to accept H0 does not imply proof.
Before proceeding further, a word of caution is necessary. The limitations of hypothesis testing
must be kept firmly in mind:
• a positive conclusion (proof of the alternate hypothesis) is no more valid than the data on
which it is based. As the saying goes, “garbage in, garbage out.” The presence of bias in the
measurements may render meaningless the results of a hypothesis test.
• the assumptions behind any specific hypothesis test must be kept firmly in mind. For example,
in order to use the testing procedure outlined above, the form of the probability distribution of
the test statistic must be known.
This last point especially needs to be emphasized, because many statistical tests are used (and
accepted as fact) even when the assumptions on which they are based are not met). Once you get
the hang of them, hypothesis testing is deceptively easy, especially with the abundance of
computer programs that are available for assistance. It is too easy to get caught up in number
crunching without examining the validity of the test procedure. That is why it is important to
understand the previous chapter, and to read the “fine print” in any testing procedure.
In spite of these caveats, you should still appreciate the scope of the hypothesis testing procedure.
Any assertion that can be stated in terms of population parameters can be tested; for some test
procedures (called nonparametric tests), it is not even necessary to know the underlying
distribution of the data. In this chapter, we will discuss the following hypothesis tests:
1. comparison of the mean of a set of measurements with a fixed value;
2. comparison of the mean of two sets of measurements;
3. comparison of the precision of two sets of measurements;
4. testing in regression analysis; and
5. testing of outliers.
Page 133
Chapter 6
Hypothesis Testing: Applications
In all of these cases, we are testing to see if there is a significant difference between two
numbers. A difference is considered “significant” if it is greater than can be reasonably explained
by the random variability of the measurements.
Page 134
Chapter 6
Hypothesis Testing: Applications
Comparison of Measurements with a Fixed Value
We have covered this test in some detail in chapter 5, where we were attempting to compare the
measured nicotine levels of cigarettes with the manufacturer’s claimed level. This test is used
when we want to test of there is a significant difference between a group of measurements and
some postulated value.
Hypotheses
• H0:
µx = k
• H1:
µx ≠ k (two-tailed), or
µx > k (one-tailed), or
µx < k (one-tailed)
Test Statistic
The following table gives the test statistics for these types of hypothesis tests.
Test Statistic
Null Distribution
σ is known
T= x−k
✤(x)
The z-distribution, assuming the measurement mean is normally
distributed.
σ is unknown
T= x−k
s(x)
A student’s t-distribution with n−1 degrees of freedom and a
mean of zero, assuming the measurement mean is normally
distributed
A hypothesis test is often referred to by the name of the null distribution of the test statistic.
Thus, to test the significance between the measurement mean and a fixed value, we use a
z-test if the standard deviation, σx, of the measurements is known, and a t-test if σx is
unknown.
Page 135
Chapter 6
Hypothesis Testing: Applications
Comparison of Two Measurement Means
Independent Sets of Measurements
Hypotheses
We want to compare two sets of measurements. The first set contains nx measurements, and is a
sample of the population of a random variable x. The second set contains ny measurements of a
random variable y. Almost certainly, since random measurement error will be present in all of the
values, the two measurement means x and y will almost certainly not be equal; however, we want
to see if the observed difference between them is significant. So our null and alternate hypotheses
will be
• H0:
µx = µy
• H1:
µx ≠ µy
(two-tailed), or
µx > µy
(one-tailed), or
µx > µy (one-tailed)
Test Statistic
We have two groups of measurements, group x and group y. Let’s define d as the difference
between the sample means of the two groups:
dhx−y
If the sample means are normally-distributed variables, then d will also be normally distributed,
with a mean given by
✙d = ✙x − ✙y
and a standard deviation (assuming the groups of measurements are independent)
✤d =
✤ 2x ✤ 2y
nx + ny
[6.1]
[Note: you should be able to derive this equation by propagation of error.]
Equation 6.1 can be simplified under one important condition. If the population variance of the
two sets of measurements is equal (i.e., homogeneous variance), then σx = σy, and the equation
becomes
homogeneous variance
✤ d = ✤ n1x + n1y
where σ is the common population variance of the two sets of measurements.
Page 136
[6.2]
Chapter 6
Hypothesis Testing: Applications
The test statistic that we will use in the comparison of measurement means will either be the
standardized or studentized value of d. Thus, if the variances of the groups is known (which is
not very common), we use
T = ✤dd
test statistic, known variances
[6.3]
In other words, in comparing the means of two independent sets of measurements, we can use the
standardized difference of sample means.
Example 6.1
The composition of a flake of paint found on the clothes of the victim of a hit-and-run
accident was compared with that of paint from the car suspected of causing the accident. Do
the following data for the spectroscopic determination of titanium in the paint suggest a
difference in composition between the two materials? From previous experience, the standard
deviation for the method is known to be 0.35% Ti.
Paint from clothes:
%Ti = 4.0, 4.6
Paint from car:
%Ti = 4.5, 5.3, 5.5, 5.0, 4.9
The null hypothesis will be that the concentration of Ti in all the paint samples is the same;
assuming no bias in the measurements, this can be stated in terms of the population parameters as
H0: µclothes = µcar
while the alternate hypothesis will simply be that they are different:
H1: µclothes ≠ µcar (2-tailed test)
Let’s set the confidence level at 80%, a fairly low value, because we are less concerned with the
consequences of a false positive than with a false negative (which would result in wrongfully
accusing an innocent man).
clothes
d
T
( 4.0 4.6 ) . % car
mean( clothes )
σd
T obs
1
σ.
2
d
σd
1
5
T
( 4.5 5.3 5.5 5.0 4.9 ) . %
mean( car ) d = 0.7400 %
σ d = 0.2928 %
T obs = 2.5271
σ
0.35. %
difference in sample means
standard deviation of the difference in means
standardized difference in means; seems pretty large.
Looking in the z-tables (which describe the null distribution of T), we find the Tcrit = ±1.2816.
Thus, we reject H0 at the 80% confidence level. Assuming no measurement bias, we can say with
80% confidence that the paints contain different levels of Ti. Note that the observed two-tailed
P-value is Pobs = 0.0115; thus, we can still reject H0 at the 98% confidence level, if we so chose.
Page 137
Chapter 6
Hypothesis Testing: Applications
In the last example, since the measurement variance is known, and it is homogeneous, we use
eqn 6.2 to calculate σd, the standard error of the difference of sample means. If the two sets of
measurements had not shared the same variance, we would have had to use eqn. 6.1
Normally, of course, the standard deviation of the measurements will not be known. In this case,
our test statistic will be the studentized difference in sample means:
T = sdd
test statistic, unknown variances
[6.4]
where d is the difference in sample means. In the case of homogeneous (but unknown) variance
of the two sets of measurements, we calculate the standard deviation, sd, of the difference of
means by using an analog of eqn. 6.2:
s d = s pool n1x + n1y
homogeneous (but unknown) variance
[6.5]
Since the two groups share the same variance, we may pool data to obtain a standard deviation
with more degrees of freedom. In this case, the test statistic will follow a student’s t-distribution
with ✚ = n x − n y − 2 degrees of freedom. Since T follows a t-distribution, this test is a type of
t-test.
The next example illustrates the t-test for the comparison of means in the case of homogenous
variance.
Example 6.2
A procedure is developed for the analysis of the trace iron content of water. Five
measurements of the blank yield the following measurements (in ppm):
0.53, 0.56, 0.51, 0.53, 0.50 ppm Fe
blank measurements
A sample is analyzed three times, giving the following measurements:
0.56, 0.58, 0.56 ppm Fe
sample measurements
Can we say with some degree of certainty that there is iron in the sample?
The null hypothesis is that there is no iron in the sample; assuming no measurement bias, we can
state this as
H0: µsample = µblank
The alternate hypothesis is that there is indeed iron in the sample, which would result in
measurements significantly larger than those on the blank:
H1: µsample > µblank (1-tailed test)
Let’s use a 95% confidence level. Calculating the difference between the sample and the blank,
we obtain
blank
d
T
( 0.53 0.56 0.51 0.53 0.50 ) . ppm
mean( sample )
mean( blank )
sample
d = 0.0407 ppm
T
( 0.56 0.58 0.56 ) . ppm
we want to test whether this value is
significantly larger than zero.
Page 138
Chapter 6
Hypothesis Testing: Applications
The test statistic is given by eqn. 6.4; we must obtain an estimate, sd, of the standard error of d. It
is common in analytical chemistry to assume that the measurements on the sample and the blank,
when they are similar in magnitude, are of the same precision. In other words, we may assume
homogeneous variance, and calculate the pooled variance.
4. var( blank )
s pool
2. var( sample )
4
2
s pool = 0.0199 ppm
Now we can estimate the standard deviation of the difference in sample means:
sd
T obs
P obs
s pool .
1
1
5
3
d
sd
0.0157
s d = 0.0146 ppm
T obs = 2.7920
The test statistic follows a t-distribution with 6 degrees of freedom
one-tailed P-value
The one-tailed critical value for testing at 95% confidence is obtained from the t-tables:
Tcrit = 1.9432. Since Tobs > Tcrit, we can reject H0 and accept H1. We can say with 95% confidence
that there is iron in the sample, assuming no measurement bias. Note that the one-tailed P-value
is Pobs = 0.0157. Thus, we would be able to prove that there is iron in the sample even at the 98%
confidence level.
If two groups of measurements do not share the same variance, then we cannot pool the data. In
this case, we say that the variance of the two populations is nonhomogeneous, and we must
estimate σd with sd calculated from the following expression:
nonhomogeneous (unknown) variance
sd =
s 2x s 2y
nx + ny
[6.6]
Using this estimate, T will follow a t-distribution with ν degrees of freedom, where ν must be
calculated using the following expression
✚x✚y
✚d =
[6.7]
2
✚ y c + v x (1 − c) 2
where νx and νy are the degrees of freedom in in sx and sy, respectively, and the value of c in this
equation is calculated from
s2
c = n1x x2
sd
Note that sd in this expression is calculated by eqn. 6.6. If necessary, you should round down the
value of ν calculated from eqn. 6.7 to the nearest integer (this gives a slightly more conservative
test).
Page 139
Chapter 6
Hypothesis Testing: Applications
Example 6.3
Two barrels of wine were analyzed for alcohol content. On the basis of 6 analyses, the mean
content of the first barrel was 12.61% ethanol (s = 0.065%). Four analyses of the second
barrel gave a mean of 12.50% ethanol (s = 0.080%). Do the data indicate a difference
between the wines?
The null and alternate hypotheses are
H0: µfirst = µsecond
“same alcohol content in both barrels (assuming no bias)”
H1: µfirst ≠ µsecond
“different alcohol content, 2-tailed test”
Let’s test at the 95% level. Normally in this case I would assume homogeneous variance:
presumably, the same analytical technique was used in both cases, and the signal levels, and
precisions, seem about the same. However, for illustrative purposes, let’s assume that the
variance is nonhomogeneous.
d
12.61. %
12.50. %
0.065. %
s first
2
sd
c
n first
n second
n first
sd
d
sd
2
n first
6
n second
4
s d = 0.0480 %
2
c = 0.3056
ν
n first
n second
T obs = 2.2916
0.0705
0.080. %
2
s second
s
. first
P obs
s second
s first
1
T obs
d = 0.1100 %
1 . n second
2
1 .c
(1
1
2
c ) . n first
ν = 5.5739
1
This will (approximately) follow a t-distribution
with 5 degrees of freedom.
2-tailed P-value
Based on the observed P-value, we cannot prove H1 at the 95% level. The 2-tailed critical values
for testing at the 95% level are ±2.5706. Since Tobs=2.2916, we must accept H0. We cannot prove
a significant difference in alcohol content in the barrels with 95% confidence.
Summary
The following table summarizes the test statistics used in comparing the means of two
independent sets of measurements. Note that the statements about null distribution assumes that
both measurement means x and y are normally distributed.
Page 140
Chapter 6
Hypothesis Testing: Applications
std error
null distribution
yes
test statistic
T = ✤dd
✤ d = ✤ n1x + n1y
z-distribution
yes
no
T = ✤dd
✤d =
no
yes
T = sdd
s d = s pool n1x + n1y
no
no
T = sdd
sd =
σ known?
σx = σy?
yes
2
✤ 2x ✤ y
+
nx ny
2
s 2x s y
+
nx ny
z-distribution
t-distribution with
ν = nx+ny−2
t-distribution with ν given by
eqn. 6.7
The most common situation is that you have two sets of measurements of unknown precision
(i.e., neither σx nor σy are known). You then must decide whether to assume homogeneous
variance (σx = σy?). The questions to be aware of are:
1. When can we assume homogeneous variance?
2. What are the consequences of incorrectly assuming homogeneous variance (or of incorrectly
assuming nonhomogeneous variance)?
In real life, the first question is usually a judgment call. Based on a knowledge of the experiment
and of the likely sources of error, you must decide whether the magnitudes of random
measurement error are comparable. In analytical chemistry, this is more likely to be the case
when the same analytical measurement technique is used by the same analyst on samples that
contain comparable quantities of analyte in similar sample matrices. Homogeneous variance is
not likely when comparing two dissimilar measurement techniques, or in the analysis of very
different types of samples.
The assumption of homogeneous variance will always result in the greatest degrees of freedom,
resulting in more powerful statistical tests (which are able to label smaller differences as
“significant”). If sample sizes and the standard deviations are comparable (nx = ny and sx = sy),
then eqn. 6.7 yields a value of ν that is comparable in magnitude to nx+ny−2, the value obtained
when the data is pooled. The following rule of thumb expresses this idea
• if the sample sizes are equal (nx = ny) and the population variances do not differ by a factor of
more than three (in other words, ✤ x < 3 ✤ y , where group x is the group with the greater
variability), then homogeneous variance may be assumed with reasonable justification.
On the other hand,
• when the sample size is different, the most serious case is when the group with the smaller
sample size is associated with the larger variance: in other words, if nx < ny and σx > σy . In
this case we must use equations 6.6 and 6.7 in calculating the sample statistic.
If we make an incorrect decision (homogeneous vs nonhomogeneous) then our critical values,
and the calculated P-value, are not quite what they should be. If we assume homogeneous
variance incorrectly, then the confidence level of our test procedure is actually not as high as we
think it is; in other words, the probability of false positive (α) is higher than we want. We can
minimize this affect by designing the experiment so that nx = ny (if possible). In any event, this
Page 141
Chapter 6
Hypothesis Testing: Applications
consequence is more serious than incorrectly assuming nonhomogeneous variance, which results
in a more conservative test (lower α) than we expect.
Based on this argument, a very conservative approach would be to always assume
nonhomogeneous variance. However, this approach is needlessly restrictive, since there are many
situations where we can be reasonably certain that the measurement precision of two data sets is
the same. If we assumed nonhomogeneous variance in these cases, we are not making full use of
our data, and our test will not be as powerful.
In example 6.3, we assumed nonhomogeneous variance and accepted H0 for a test at the 95%
level. Let’s try this example again, this time making the reasonable assumption of equal
measurement precision.
2
5. s x
s pool
sd
T obs
P obs
2
3. s y
8
s pool .
d
sd
1
1
nx
ny
s pool = 0.0710 %
s d = 0.0458 %
T obs = 2.4002
0.0432
we can pool our variances now
revised estimate of std error
follows a t-distribution with 8 degrees of freedom
two-tailed P-value indicates that we can reject H0 at the 95% level.
Let's be sure.
Since we now have 8 degrees of freedom, the critical values are ±2.3060; since Tobs is outside of
this range, we may reject H0 and accept H1 at the 95% level. For this particular case, the decision
of homogeneous vs nonhomogeneous variance makes our conclusion different. Note that the
P-value changed from 0.0705 to 0.0432, which doesn’t seem a remarkable change; it is enough,
however, to make a different when testing at the 95% level.
Paired Measurements
All throughout the last section, we assumed that there was no correlation between the two groups
of measurements (i.e., that they were independent). Now consider the following example:
Page 142
Chapter 6
Hypothesis Testing: Applications
Example 6.4
A new flame atomic-absorption spectroscopic method of determining antimony in the
atmosphere was compared with the recommended colorimetric method. For samples from an
urban atmosphere, the following results were obtained:
antimony found (mg/m3)
sample #
new method
standard method
1
22.2
25.1
2
19.2
19.5
3
15.7
16.6
4
20.4
21.3
5
19.6
20.7
6
15.7
16.8
Do the results obtained by the two methods differ significantly?
This is a good example of paired measurements, when two measurements are made on the same
object. The two sets of values from paired measurements are often correlated, particularly when
some changing property of the object strongly affects both measurements.
In this particular example, we are comparing the ability of two methods to analyze Sb.
Presumably the standard method has been well characterized, and is free of measurement bias.
The new method may be cheaper, faster and/or more sensitive, but we need to test it against the
standard method to ensure that it gives comparable (i.e., bias-free) results. This has been done by
measuring [Sb] in a number of samples, each of which may contain different levels of analyte,
using both methods.
Each measurement contains random error, but it is also strongly affected (we hope!) by the
concentration of analyte in the sample. This being the case, it is likely that there will be
significant correlation between the two methods: when a measurement by one method gives a
high result, a measurement by the other method on the same sample will also tend to give a high
result, since [Sb] for the sample is presumably relatively high.
The sample linear correlation coefficient between the two sets of measurements is r = 0.9716.
Recall that truly independent variables will have ρ = 0, and that values of ρ close to one indicate
a strong linear relationship between variables − a fact that is confirmed by a glance at the
following plot of the measurements.
Page 143
Chapter 6
Hypothesis Testing: Applications
27
results from standard method, mg/m3
25
23
21
19
17
15
15
16
17
18
19
20
21
22
23
results from new method, mg/m3
Figure 6.1: a plot of the “paired” measurement data − a strong linear correlation between
the measurement pairs is evident. Note that this is a good example of how correlation
does not imply causation: just because two variables are correlated doesn’t mean that one
variable “causes” the other to occur. In this case, these variables are affected by a third
factor − the concentration of Sb in the sample.
So how do we test of measurements from the two methods differ significantly? Let’s define a
variable e such that
ei = xi − yi
where xi and yi are the ith measurement pair. In other words, the e values are the difference
between the paired measurements. Let’s calculate the values for e observed in example 6.4
x
22.2
25.1
2.9000
19.2
19.5
0.3000
15.7 mg
.
20.4 m3
y
16.6 mg
.
21.3 m3
e
x y
e=
0.9000 mg
0.9000 m3
19.6
20.7
1.1000
15.7
16.8
1.1000
Looking at the measurements, and the values of e, we see that the second set of measurements,
using the standard analytical method, gave results that were consistently larger than those of the
new method. This observation suggests that the methods give different results for the same
samples, but we need to test whether the difference is statistically significant.
Page 144
Chapter 6
Hypothesis Testing: Applications
Even though the original two sets of data, x and y, are correlated, the difference between them is
a random variable whose values are not correlated with one another. We have changed the
problem back to one with which we are more familiar. The null and alternate hypothesis will be
H0: µe = 0
“the methods give the same results”
H1: µe ≠ 0
“the methods give different results (2-tailed)”
Let’s test at the 90% confidence level. We can use
T= e
s(e)
as our test statistic; by the central limit theorem, e will tend to be normally distributed, even if e
does not follow a normal probability distributed. Note that if the random error in the x and y
measurements are normally distributed, then e will be normally distributed.
Let’s solve the problem now, shall we? First let’s calculate the observed value of the test statistic
e bar
T obs
mean( e )
e bar
std_err
e bar = 1.2000
mg
3
m
std_err
stdev ( e )
6
std_err = 0.3606
mg
3
m
T obs = 3.3282
Remember that if e is normally distributed, then the null distribution of T will follow a Student’s
t distribution; and we find our critical values from the t-table. From the t-tables, we find that the
critical values of Tcrit = ± 2.0150 should be used for a two-tailed test at the 90% confidence level.
From the observed value of Tobs = −3.3282, we may reject H0 and accept H1. Thus, we can state
with 90% certainty that there is a significant difference in the measurements yielded by the two
methods. Note that the observed P-value is Pobs = 0.02082, so that we can prove H1 at the 97.9%
level from this data.
Page 145
Chapter 6
Hypothesis Testing: Applications
Comparison of Many Measurement Means
Note to students
We did not cover this in class, and you will not be expected to know it. This section is just a
“placeholder” to introduce this (important) subject in a future version of this class.
• introduction - mention the term analysis of variance (ANOVA), and that this is only the tip of
the iceberg
• multiple sets of measurements. Illustrate by example. Restrictions on this test: (i)
homogeneous variance, and (ii) same # of measurements in each set
• solve the problem, explaining the entire time
Page 146
Chapter 6
Hypothesis Testing: Applications
Comparison of Measurement Precision (F-test)
Let’s say that we have two independent sets of measurements, x and y, and we want to compare
the precision (i.e., the sample standard deviations) of the two groups. In this section, we will
describe the test that can be used in this situation. Note that this test can be used to disprove (but
never to prove!) homogeneity of variance.
Hypotheses
H0: σx = σy
“two sets of measurements have the same precision”
H1: σx ≠ σy
σx> σy
σy < σy
(2-tailed)
(1-tailed)
(1-tailed)
Test Statistic
We have two sets of measurements, x and y; we will use as our test statistic the ratio of sample
variances of the two sets:
T=
s 2x
s 2y
If we assume that the x and y measurements follow a normal distribution, then the probability
distribution of T is called an F-distribution. Actually, the actual distribution of T will depend on
the degrees of freedom in sx and sy, so that there is a family of F-distributions, F(νx,νy). The null
hypothesis is that σx = σy, so that the null distribution of T will be an F(νx,νy)-distribution with a
mean µF = 1.
The appendix contains tables with critical values for the F-test, which is used to compare
measurement precision. These tables are for right-tailed areas, so that in constructing T, we must
be sure to put the larger variance in the numerator: in the F-test, the test statistic, T, must be
larger than one. Of course, for two-tailed tests, the area in the right-tail of the F-distribution
must be α/2, so that (for example) we would use the 97.5% F-tables for two-tailed testing at the
95% confidence level.
Example 6.5
One analyst who analyzed a sample 16 times calculated a sample standard deviation of 3.07.
Another analyst measured the same sample 11 times and calculated a standard deviation of
1.20. Are we justified in assuming that the second analyst is more precise than the first?
The null and alternate hypotheses are
H0: σx = σy
“analysts have the same measurement precision”
H1: σx > σy
“analyst y is more precise than analyst x (1-tailed test)”
Let’s test at the 95% level. Our test statistic will be
Page 147
Chapter 6
Hypothesis Testing: Applications
T=
s 2x
s 2y
Note that the numerator must be larger than the denominator. Let’s calculate the observed value
of the test statistic:
sx
3.07
n = 16; 15 degrees of freedom
sy
1.20
n = 11; 10 degrees of freedom
2
T obs
sx
sy
2
T obs = 6.5451
The null distribution will be an F(15,10)-distribution with a mean of one. Since we are interested
in examining whether or not the y measurements are significantly more precise than the x
measurements, we will obtain a one-tailed critical value on the right side of the null distribution.
The critical value turns out to be Tcrit = 2.85. Thus, we may reject H0 and accept H1 at the 95%
level. We can state with 95% certainty that analyst 2’s measurements are more precise than
analyst 1’s measurements.
For this example, the observed P-value (1-tailed) is Pobs = 0.002452. Thus, we could test at the
99.7% confidence level and still accept H1.
Page 148
Chapter 6
Hypothesis Testing: Applications
Hypothesis Testing in Regression Analysis
We have applied the z-test and t-tests to measurement means; however, these tests can be used
for any normally-distributed variable. The defining component of the z- and t-test is that the test
statistic follows a z- or t-distribution, respectively. We can easily use either of these tests in
regression analysis, as will now be demonstrated.
Hypothesis testing in regression analysis usually involves the least-squares estimates in linear
regression. In chemistry, one can easily imagine cases in which you would want to
• compare the value of the slope to some theoretical or established value.
• test to see if the slope is significantly different than zero. This is a way to eliminate extra
independent variables: if the slope is not significantly different than zero, then you can make a
case for stating that there is no discernible effect of x on µx.
• test to see if the intercept is significantly different than zero. In quantitative analysis, for
example, this test can be used to check for the presence of background (“additive”)
interferences.
If we can assume of homogeneous variance in the regression data, we can easily calculate the
standard deviation of the least-squares estimates; the equations in chapter 4 can be used for the
first-order linear model. Further, if we assume that the measurement error of the dependent
variable is normally distributed, then the least-squares estimates themselves follow a normal
distribution. Thus, we can form a test statistic that can be used in a t-test (or a z-test, if the
magnitude of the homogeneous variance, σ, is known), as shown in the following example.
Page 149
Chapter 6
Hypothesis Testing: Applications
Example 6.6
Municipal drinking water is usually fluoridated during water treatment processes. A F− ISE
electrode is to be used to measure the fluoride concentration in a drinking water sample,
using a method based on the measurement of the voltage response of the ISE to a set of
calibration standards (i.e., the calibration curve method). The following data was collected:
[F−] (ppm)
log[F−]
signal (mV)
10.55
1.0232
87.63
50.21
1.7008
62.18
130.2
2.1146
29.63
361.2
2.5577
0.22
682.3
2.8339
-10.39
unknown
39.28
(a) Construct a calibration curve and obtain a confidence interval for the slope. Also use the
calibration curve obtain an estimate of the measurement noise on each measurement.
Remember that the Nernst equation predicts a linear relationship between voltage and the
logarithm of the concentration.
(b) The theoretical value of the slope of the is − 59.2 mV/decade. Is there a significant
difference between the theoretical and measured sensitivities?
part (a)
xi .y i
S xy
5 .mean( x) .mean( y )
S xy = 117.1428
i
S xx
b1
res
s_b1
4 .var ( x)
S xy
b 1 = 57.0682
S xx
y
S xx = 2.0527
b 1 .x b 0
s res
s res
s_b1 = 4.3348
b0
mean( y )
1.
3
res i
2
b 1 .mean( x)
s res = 6.2105
i
b 0 = 150.6198
LS estimates of
regression params
estimate of homogeneous
measurement noise
std error of the slope
S xx
t
qt ( .025, 3 )
t = 3.1824
t .s_b1 = 13.795
95% confidence, 3 degrees of freedom
95% CI for slope: -57 ± 14 mV/decade
Page 150
Chapter 6
Hypothesis Testing: Applications
part (b)
Simply by examining the confidence interval of the slope from part (a) we can see that there
is no significant difference between this value and the theoretical value. But let's do the test
anyway.
H0 :
59.2 H1 :
β1
β1
59.2
Let's test at the 90% level
T obs
b1
( 59.2)
s_b1
T obs = 0.4918
difference between the values, in units of std deviations
2-tailed critical value is ± 2.3534 (3 degrees of freedom). Since the observed test stat is between the
two critical values, we must accept H0. Thus, we cannot prove with 90% certainty that the
measured slope is significantly different than the theoretical value.
Page 151
Chapter 6
Hypothesis Testing: Applications
Outlier Testing
Introduction
Have you ever had a data set, and one particular measurement seemed to stick out like a sore
thumb? Consider the following sequence of measurements (which might, for example, be
measurements of analyte concentration in a single sample)
24, 21, 62, 16, 19
The third measurement, 62, doesn’t seem to belong with the others. Its presence has a serious
effect on the value of sample statistics, including sample mean, sample standard deviation, and
any test statistics for hypothesis testing. Measurements that are “surprising” or “unexpected” in
some way are called outlier measurements, or simply outliers (because the measurement “lies
outside” the rest of the data).
What do we do with outliers? There are three basic tactics: accommodation, investigation and
rejection.
• outlier accommodation
Outliers are a concern because they can have a huge effect on sample statistics (and hence
confidence intervals and hypothesis tests). So we just choose sample statistics that are not as
sensitive to the presence of outliers; these are called robust statistics. An example is the sample
median, x̃ , as an indicator of location. For symmetric distributions, x̃ is an unbiased estimate of
µx; however, the standard error of the sample median is somewhat larger than that of the sample
mean, and so it is not considered as good an estimate of µx.
However, x̃ is much less affected by outliers than x . Let’s consider the above data set, with and
without the outlier:
x
( 24 21 62 16 19 )
x
( 24 21 16 19 )
T
T
mean( x) = 28.4
median( x) = 21.0
mean( x) = 20.0
median( x) = 20.0
As you can see, removing the outlier had a much greater effect on the sample mean than on the
sample median, which is more “tolerant” of the outlier. There are a number of robust estimates of
location (e.g., µx) and dispersion (e.g., σx); an interesting example of the latter is the median
deviation, sm:
s m = median{|x i − x̃|}
The use of such robust statistics is becoming more common. Robust linear regression is another
useful subject along these lines.
• outlier investigation
At the very least, the occurrence of an outlier sends most people scrambling for their laboratory
books, or has them racking their brains to remember if there was something done differently
when that particular measurement was obtained! The first thought that crosses most people’s
Page 152
Chapter 6
Hypothesis Testing: Applications
mind is that there must be some gross error in that particular measurement. Therefore the
measurement procedure comes under scrutiny to try to explain the presence of the supposed
“error.”
In many situations, however, the outlier does not signify a gross error at all, but rather is a cause
for concern about the assumptions of a particular data-generating mechanism. For example,
consider the measurements in the following plot, which shows a calibration curve such as might
be observed in quantitative analysis:
30
25
Signal
20
15
10
5
0
0
2
4
6
8
10
12
Concentration
Figure 6.2: a typical calibration curve in quantitative analysis, where instrument response
(the “signal”) is plotted as a function of analyte concentration. The circled data point is an
outlier; however, it might not be due to any “gross error,” but a failure of the linear model
for that particular data point.
The last data point is an outlier − it doesn’t fall on the line, like the first five points. However, it
may be there is not a linear dependence of signal on concentration at high concentrations: the
model is not valid for this last data point.
The point is, don’t just assume that an outlier is “wrong.” Very often, the experiment that
generated the outlier might need to be repeated; if outliers still occur, then the assumptions about
the nature of the experiment might need to be modified. In the calibration curve shown in figure
6.2, if more data points were collected at higher concentrations, the line might be seen to “curve
off” for these measurements.
Finally, the last common response to outliers is:
• outlier rejection
Many (some might say most) researchers simply delete outliers from the data set, with or without
justification. Certainly if some gross error in the outlying data is identified, then deletion of the
Page 153
Chapter 6
Hypothesis Testing: Applications
data is justified. All too often, however, an experimenter will adopt a “delete and repeat” attitude
to outliers. Don’t do this. The data you are rejecting may very well be representative of the
population you are trying to sample, and by the “delete and repeat” approach, you are simply
pruning the data to match your expectations more closely. One purpose of including a section of
outliers in a chapter on hypothesis testing is to provide some measure of statistical justification
for outlier rejection, if that’s what you want to do.
Identification of Contaminants
Let’s clarify two closely related concepts through the following definitions1:
• an outlier is a measurement that is surprising or unexpected in some respect.
• a contaminant is a measurement that is generated by some different data-generating
mechanism than the bulk of a data set. What this means in practice is that contaminants belong
to a different probability distribution than the rest of the data.
Not all outliers are contaminants, just as all contaminants do not manifest themselves as outlying
measurements. Consider the following figure:
probability distribution
of contaminant data
points
probability distribution
of "normal" data points
"outlying" data
Figure 6.3: distinction between contaminants and outliers. All of the data in this group of
measurements originates either from the “normal” probability distribution or the
“contaminating” probability distribution. The three circled data points might be identified
as “outliers” in this data set. However, only two of these points are true contaminants in
the sense of belonging to the contaminating probability distribution. In addition, not all
contaminant data points are identified as outliers.
In general, there are two distinct phases to dealing with outliers, particularly when deciding
whether or not to reject them from a data set:
1
the distinction between outlier (a “surprising” measurement) and contaminant (a measurement from a
different probability distribution) is an important one. However, the terms “outlier” and “contaminant”
are not universally used in this manner. Many texts refer to discordant values as “potential” outliers and
contaminants as “true” outliers.
Page 154
Chapter 6
Hypothesis Testing: Applications
1. Identification of outliers. For univariate data, this means sorting the data into ascending
order; any outliers will be at either the lower or upper extreme of the data set. For
multivariate data (particularly when more than two variables are involved), outlier
identification is sometimes not nearly so straightforward.
2. Testing of outliers. This is a process, sometimes called a discordancy test, of deciding
whether to label one or more outliers as contaminants. For truly objective outlier testing, the
principles of hypothesis testing must be applied.
In the hypothesis testing of a single outlier, the general forms of the null and alternate hypotheses
are
H0: the outlier belongs to the same probability distribution as the rest of the data (i.e., the outlier
is not a contaminant).
H1: the outlier does not belong to the same probability distribution as the rest of the data (the
outlier is a contaminant).
Of course, hypotheses involving specific probability distributions in the null and alternate
hypotheses can (and usually are) adopted. For our purposes, however, these generic statements
will do just fine.
After deciding on the appropriate confidence level, the next step in outlier testing would be to
choose a test statistic. There are a bewildering array of statistics from which to choose;
interestingly, however, most commercial statistical programs, such as SAS or SPSS, do not treat
outlier testing in a comprehensive or consistent manner. Likewise, outlier testing is not a topic
that is covered in most beginning statistical textbooks. This omission is a little puzzling, to say
the least, since any scientist who deals with quantitative data must eventually deal with outliers.
A good reference is: V. Barnett, T. Lewis, “Outliers in Statistical Data,” 3rd ed, Wiley Series in
Probability and Mathematical Statistics, 1995; much of the material in these notes, as well as the
tables for outlier testing in the Appendix, are from this source.
Table 6.1: table of 5 test statistics (labeled T1-T5) for the testing of univariate, normally distributed data; see the
appendix for critical values. See the text for more detail.
upper outliers
T1 =
“t-like” test
statistics
“Dixon-like”
statistics
T2 =
xn − x
sx
✟(x i − x)
lower outliers
T1 =
T2 =
x −x
T3 = n s x 1
sx
comments
x − x1
sx
Tests a single outlier; sometimes called the Tn test
✟(x − x i )
Block test for k upper or lower outliers.
sx
Block test for one upper and one lower outlier.
x −x
T4 = xnn − xn−1
1
x −x
T4 = x 2n − x 11
x −x
T5 = xnn − xn−2
2
x3 − x1
T5 = x n−1
− x1
Tests for a single outlier; sometimes called the Q
test
Tests for a single outlier; a form of the Q-test that
provides some protection from masking.
Page 155
Chapter 6
Hypothesis Testing: Applications
Barnett and Lewis present a large number of test statistics for a variety of purposes: by my count,
at least 56 different test statistics are presented for univariate data, along with the appropriate
distribution tables. From these I have culled 5 test statistics that will be enough to get us started.
These statistics are collected in the following table for easy reference, along with a brief
description; we will encounter and use all of them in the following discussion.
It is worth mentioning at this point that an outlier can be classified as either an upper or lower
outlier, depending on its position with respect to the main mass of data. The form of a particular
test statistic will be slightly different in testing upper or lower outliers.
The five test statistics have been grouped into two classifications: the three “t-like” statistics and
the two “Dixon-like” statistics. The t-like statistics are all studentized deviations of some sort,
where the deviations follow a normal distribution (assuming that the original data itself is
normally distributed). The Dixon-like statistics get their name from the man who first proposed
their use; they are easy to remember and calculate, and are widely used in analytical chemistry.
The test statistics in the table are used to test outlying data points under the assumption (i.e., the
null hypothesis) that the data should be described by a normal probability distribution. The
identity of the distribution assumed in the null hypothesis is important, since it will determine the
null distribution of the test statistic − indeed, it will likely determine which test statistic is best
used in the hypothesis test. The assumed distribution also affects outlier identification, since
values that might be considered to be outliers from a normal distribution might not be considered
so surprising from, say, exponential or Poisson distributions.
Testing Single Outliers
A Simple Test
Let’s take a few of these outlier test statistics out for a spin.
Example 6.7
Test the following measurements for contaminants: 18, 23, 55, 13, 17, 20, 14
First, let’s sort the seven measurements: x = 13, 14, 17, 18, 20, 23, 55. Obviously, the
measurement value 55 is an upper outlier, which we will test for discordancy.
H0: all the measurements are sampled from a population described by a single normal
probability distribution.
H1: the largest value is a contaminant: it originates from a different probability distribution than
the rest of the measurements.
We will use the following statistics to test H1:
T1 =
xn − x
sx
x −x
T4 = xnn − xn−1
1
[6.8]
The first statistic is sometimes referred to as Tn, while the second is usually called Dixon’s Q by
analytical chemists. T1 is the studentized deviation of the outlier from the sample mean; we
Page 156
Chapter 6
Hypothesis Testing: Applications
would expect large values to be indicative of data contamination. Dixon-like statistics such as T4
will always be a fraction; as the outlier becomes more separated from the main body of data, the
value of a Dixon-like statistic will approach unity.
Let’s calculate the values of these test statistics. [Note: I use the software package MathCAD to
work out these problems. MathCAD indexes its vectors and arrays starting with 0, and not 1, so
that the expressions for the test statistics will appear slightly different. If you become confused,
work out the problem yourself, using the expressions given in table 6.1; you should get the same
numbers for the test statistics.]
x
( 18 23 55 13 17 20 14 )
T1 obs
T4 obs
xlast( x )
mean( x)
stdev ( x)
xlast( x )
xlast( x )
xlast( x )
1
x0
T
x
sort ( x)
T
x = ( 13 14 17 18 20 23 55 )
T1 obs = 2.2051
critical values: T1 .05
T4 obs = 0.7619
critical values:
T4 .05
1.94
0.507
T1 .01
2.10
T4 .01
0.637
Thus, in either case we can accept H1 at the 99% level.
Comparing the observed values of either test statistic results in rejection of the null hypothesis at
the 99% confidence level. Thus, we can be 99% certain that the largest measurement value is a
contaminant.
Important Warning!
The null distributions (i.e., the critical values) of all of the test statistics have been calculated by
assuming that the measurements should be normally distributed. The labeling of an outlier as a
“contaminant” using these critical values rests upon the assumption that the outlier isn’t likely to
be from the same population as the other measurements, assuming those other measurements are
normally distributed. The critical values would be much different if the measurements were
distributed according to, for example, an exponential distribution.
Instead of deciding that the outlier is from a different probability distribution, another possible
explanation for the outlier is that the measurements are not normally distributed at all, but are
instead described by a probability distribution with “heavier tails” (i.e., a higher probability of
extreme values) than a normal distribution. This is one reason that you must always be very
careful in deleting measurements that are labeled as contaminants. A better approach is to use
robust statistics, or simply to collect more measurements (remember that the sample mean will
always tend towards a normal distribution as n increases).
A Complication: Outlier Masking
One problem in testing for a single outlier is the sensitivity of the phenomenon of masking. The
following example illustrates the problem.
Page 157
Chapter 6
Hypothesis Testing: Applications
Example 6.8
Test the following measurements for contaminants: 18, 23, 55, 38, 17, 20, 14
x
( 18 23 55 38 17 20 14 )
T
x
sort ( x)
T
x = ( 14 17 18 20 23 38 55 )
H0: all the measurements are described by the same normal probability distribution.
H1: the largest measurement is a contaminant.
Note: this set of measurements is almost identical to the measurements in the last example; only
a single value has changed.
T1 obs
T4 obs
xlast( x )
mean( x)
stdev ( x)
xlast( x )
xlast( x )
xlast( x )
1
x0
T1 obs = 1.9280
critical values: T1 .05
T4 obs = 0.4146
critical values:
1.94
T4 .05
T1 .01
0.507
2.10
T4 .01
0.637
For both of these test statistics, the second-largest measurement value has masked the largest
value. The largest value is not identified as a contaminant by either test statistic at the 95%
confidence level. Dixon’s Q statistic (i.e., T4) is especially susceptible to masking effects,
although the Tn statistic (i.e., T1) is not too much better.
Masking can occur when a second outlier is observed with a value similar to the first. In this
example, the second largest value has the effect of decreasing the denominator of both test
statistics; it also increases the numerator of T1.
One solution to this problem is to use test statistics that are less sensitive to masking. There are a
number of Dixon-like statistics for this purpose; here is a good general purpose statistic to protect
against masking (see table 6.1 for the corresponding statistic for a lower outlier).
x −x
T5 = xnn − xn−2
2
In this case, the numerator is the difference between the outlier and its second-nearest neighbor;
no masking is observed by the measurement value xn−1, as we can see:
T5 obs
xlast( x )
xlast( x )
xlast( x )
x1
2
T5 obs = 0.8421
critical values:
T5 .05
0.780
T5 .01
0.885
The T5 Dixon-like statistic identifies the largest value as a contaminant at the 95% confidence level
(but not at the 99% confidence level).
Masking of can also occur when an upper and a lower outlier value are observed in the same data
set, as seen in the following example.
Example 6.9
Test the following measurements for contaminants: 18, 23, 39, -15, 17, 20, 14
Page 158
Chapter 6
Hypothesis Testing: Applications
Let’s work out this problem (remember that indexing in MathCAD begins at 0, not 1).
x
( 18 39
15 17 20 14 )
T
x
sort ( x)
T
x = ( 15 14 17 18 20 39 )
It looks like both the smallest and the largest values are outliers. Let's test the smallest value to see
if it is a contaminant.
H0: all the measurements are described by the same normal probability distribution
H1: the smallest value is a contaminant
T1 obs
T4 obs
mean( x)
x0
stdev ( x)
x1
x0
xlast( x )
x0
T1 obs = 1.7530
critical values:
T1 .05
1.82
T4 obs = 0.5370
critical values:
T4 .05
0.560
T1 .01
T4 .01
1.94
0.698
The observed values of both of these statistics is too small to allow rejection of H0 at the 95% level.
The upper outlier has masked the lower outlier. Let's try the Dixon-like T5.
T5 obs
x2
xlast( x )
x0
1
x0
T5 obs = 0.9143
critical values:
T5 .05
0.872
T5 .01
0.995
T5 allows rejection of H0 at the 95% (but not 99%) confidence level. Thus, we can state with 95%
confidence that the lowest value is a contaminant.
From this example, we see that the upper outlier can mask the lower outlier (and vice versa, if we
are testing the upper outlier). As before, the Dixon-like T5 statistic is not vulnerable to this
effect; it protects against masking of a lower outlier x1 by either x2 (the nearest neighbor) or xn (an
upper outlier). Likewise, T5 protects against masking of an upper outlier xn by either xn−1 or x1.
Testing Multiple Outliers
Introduction
Outlier masking happens because there are really two (or more) outliers, and the T1 and T4
statistics work best when testing data sets with a single outlier. If a data set contains more than
one outlier, we should modify our approach somewhat, depending on the nature of the outliers.
Let’s list how multiple outliers might occur: we may have a data set with
• two or more upper outliers
• two or more lower outliers
• a combination of one (or more) upper outlier(s) and one or more lower outlier(s).
Let’s say that k outliers have been identified in a data set, where k > 1. There are two approaches
to testing multiple outliers:
Page 159
Chapter 6
Hypothesis Testing: Applications
1. the consecutive testing approach, where a test statistic such as T1 or T4 is applied repeatedly
to a data set (one outlier at a time); or
2. the block testing approach, where a statistic simultaneously tests all k outliers in the data set.
Consecutive Testing of Multiple Outliers
Consecutive testing is conceptually easy, and can occur in two varieties: the inward approach,
and the outward approach. In the inward approach, the most extreme outlier is tested; if it gives a
positive test, it is removed from the data set, and then the most extreme remaining outlier is
tested (we are moving “inward” towards the sample mean). This procedure is repeated until all
the outliers are tested, or until an outlier gives a negative test. The following flow chart illustrates
the procedure in testing for multiple upper outliers.
Data Set
Test largest
value
contaminant?
Yes
remove from
data set
No
Stop testing
procedure
Figure 6.4: The inward consecutive outlier testing procedure. A test statistic like T1 or
T4 is used to test measurements one at a time as we move from the most extreme outlier
“inward” toward the mean of the data set. Note that contaminants are not permanently
removed from the data set, but only temporarily removed for the purposes of the testing
procedure. In other words, the purpose of the test is to label measurements as
contaminants, not to make a decision to permanently reject these values.
The advantage of the inward approach is that it is not necessary to determine the value of k prior
to testing; indeed, it is not even necessary to identify the outliers at all. However, the biggest
disadvantage is the susceptibility of the procedure to masking effects. T1 and T4 are certainly
fairly poor candidates for this type of testing procedure.
The outward approach is as follows: after identifying the k outliers, the “innermost”
(least-extreme) outlier is tested, with the remaining outliers removed from the data set. If this
Page 160
Chapter 6
Hypothesis Testing: Applications
inner outlier gives a positive test, than all of the remaining outliers are considered to be
contaminants. If, however, the inner outlier gives a negative test, then the next innermost outlier
is tested, again with the more extreme outliers removed from the data set. This procedure is
repeated until all of the outliers are tested (moving “outward” throughout the procedure), or until
the first positive test, whichever comes first. When an outlier gives a positive test, that outlier,
and all of the more extreme outliers, are considered to be contaminants. The flow chart in figure
6.5 describes the outward consecutive testing process.
Data Set: all but
the least extreme
outlier have been
removed
Test the least
extreme outlier
contaminant?
No
add the next
most extreme
outlier to data set
Yes
Stop the
testing
procedure
Figure 6.5: The outward consecutive outlier testing procedure. A test statistic like T1 or
T4 is used to test measurements one at a time as we move from the least extreme outlier
“outward” from the mean of the data set. Initially, all but one outlier is removed from the
data set, and these are added until a positive test result (or until we run out of outliers).
Again, measurements are not permanently removed from the data set, but only
temporarily removed for the purposes of the testing procedure.
The advantage of this procedure over the outside-in approach is the reduced susceptibility to
masking; the test of the innermost outlier should not exhibit masking at all. However, the k
outliers must be identified beforehand. Another, more fundamental, objection is that removing
the extreme outliers amounts to outlier rejection before the outliers have been tested at all. The
data subset that consists of only the more central data points might not be representative of the
parent population, and the outlier test would be more apt to give a positive result.
Let’s look at the next example to see how inside-out consecutive testing works.
Example 6.10
Test the two outliers of the data set in example 6.8 using a consecutive test procedure.
Page 161
Chapter 6
Hypothesis Testing: Applications
From the answer to example 6.8, we already know that the inward consecutive test procedure
using either T1 or T4 would fail to identify any contamination, since neither statistic could give a
positive test for the most extreme outlier value. So let’s try the outward test procedure.
x
( 18 23 55 38 17 20 14 )
T
x
sort ( x)
T
x = ( 14 17 18 20 23 38 55 )
The largest two values are upper outliers. Let's create a subset of this data that omits the largest
outlier so that we may test the "inner" upper outlier.
y
T
submatrix( x, 0 , 5 , 0 , 0 )
y = ( 14 17 18 20 23 38 )
H0: all the measurements in this subset of the data are described
by the same normal probabiltiy distribution
H1: the largest value in this data subset is a contaminant
T1 obs
T4 obs
y last( y )
mean( y )
stdev ( y )
y last( y ) y last( y )
y last( y )
y0
T1 obs = 1.9108
1
T4 obs = 0.6250
critical values: 1.82 (95%) and 1.94 (99%)
critical values: 0.560 (95%) and 0.698 (99%)
In both cases, the innermost outlier tests positive at the 95% level. Thus, using the outward
consecutive test procedure, we identify both the upper outliers as contaminants at the 95% level.
A problem with any consecutive test procedure is in the interpretation of the confidence level,
because the confidence levels associated with the critical values of the test statistics T1, T4 and
T5 have been determined for the testing of a single outlier. A conservative approach is advisable
in choosing the confidence level for consecutive tests, particularly if the goal is to identify
contaminants for subsequent removal from the data set.
Block Testing of Multiple Outliers
An alternative to consecutive testing is block testing, where a statistic is used to test all k outliers
at once. T2 is a t-like statistic used to test for k upper or k lower outliers:
k upper outliers
T2 =
k lower outliers
T2 =
✟(x i − x)
sx
✟(x − x i )
sx
The T2 statistic is the studentized sum of the deviations of all k outliers from the mean.
Example 6.11
Test the two outliers of the data set in example 6.8 using a block test procedure.
Page 162
Chapter 6
x
Hypothesis Testing: Applications
( 18 23 55 38 17 20 14 )
T
x
sort ( x)
T
x = ( 14 17 18 20 23 38 55 )
H0: all measurements are described by the same normal probability distribution
H1: the two largest measurements are contaminants
We will use a block-testing procedure.
T2 obs
xlast( x )
mean( x)
xlast( x )
1
mean( x)
T2 obs = 2.7088
stdev ( x)
The critical values are
T2 .05
2.66
T2 .01
2.79
Thus, we can reject H0 and accept H1 at the 95% level. We can say with 95% certainty that
the upper two outliers are contaminants.
The statistic T3 is another block-testing statistic that is intended to be used in situations where an
upper and lower outlier are both present in the data set.
x −x
T3 = n s x 1
T3 is simply the studentized difference between the two extreme values of the data set.
Example 6.12
Test the two outliers of the data set in example 6.8 using a block test procedure.
x
( 18 39
15 17 20 14 )
T
x
sort ( x)
T
x = ( 15 14 17 18 20 39 )
H0: all the measurements are described by a single normal probability distribution
H1: the smallest and the largest measurement values are contaminants
Let's use a block test procedure:
xlast( x ) x0
T3 obs
T3 obs = 3.1038
stdev ( x)
critical values: 3.22 (95%) and 3.34 (99%)
Thus, we must accept H0 at the 95% level. We cannot say with 95% confidence that the largest
and smallest measurements are contaminants.
Assuming that all of the outliers have been identified, statistics intended for block tests are not
susceptible to outlier masking. That’s a good thing, but we have also created another problem,
that of outlier swamping. In block testing, all k outliers are either labeled as contaminants or they
are accepted as “normal” measurements. There is no middle ground, as there is in consecutive
testing, where we can say that some outliers are contaminants and some are not. Thus, there is the
possibility that a marginal outlier might be falsely declared a contaminant because it is “carried
along” in the block testing procedure by other, more extreme, outliers. Or perhaps a few marginal
outliers cause the block testing procedure to fail, which means that the contaminants that are in
the block will not be identified.
Page 163
Chapter 6
Hypothesis Testing: Applications
Advanced Topic: Data Contamination Models
Note to students
We did not cover this in class, and you will not be expected to know it. This section is just a
“placeholder” to discuss this subject in a future version of this class.
• discuss how outliers can arise. Use actual numbers, and pictures of probability distribution. See
book for a good example of some pictures. Contaminants are presumed to be due to a different
data-generating mechanism than the rest of the data
• describe the slippage model, and maybe the “inherent alternative” model.
Choosing Test Statistics
Note to students
We did not cover this in class, and you will not be expected to know it. This section is just a
“placeholder” to discuss this subject in a future version of this class.
Introduction
• which test statistic? Experience, common sense. Can use more than one; they only differ in
terms of false negatives, not false positives.
• I will present more alternatives (didn’t want to confuse the issue earlier in this section)
Variations on a Theme
• the notion of one-tailed and two-tailed outlier tests is a little hard to define
• give T6 and T7, the two-sided versions of T1 and T4, and explain why they might be preferred
to their one-sided brethren.
• give T8 and T9 the externally studentized version of T1 and T6
• give T10 as an example of a Dixon-like stat that corrects for a specific masking effect
More Choices
• give three more test stats: T11 (N4), T12 (N5), and T13 (N15), along with their properties.
Summary
• summarize the advantages of the t-like stats: optimal if k is correct
• summarize the advantages of the Dixon-like stats. Easy to calculate; easy to choose one to
correct for a specific instance of masking.
• give the advantages of the F-like stats
Page 164
Chapter 6
Hypothesis Testing: Applications
• give the advantages of the sample kurtosis
• block-testing vs consecutive testing. Give the problems with either, and when it is best to use
one or the other (block: multiple outliers of a similar size; consecutive: multiple outliers of
different sizes). You must be especially careful in rejecting multiple outliers. In using the
outward test, you are essentially trimming the data before you begin. One possibility: use
inward, but apply a Dixon-like stat that guards against masking for the first test. Then I would
test at a fairly high CL for the second test to assuage my guilt. Maybe combine with a block
test to further strengthen the argument.
Page 165
Chapter 6
Hypothesis Testing: Applications
Chapter Checkpoint
The following terms/concepts were introduced in this chapter:
block testing of outliers
outlier masking
consecutive testing of outliers
outlier swamping
contaminant
paired measurements
F-test
robust statistic
nonparametric test
t-test
outlier
z-test
In addition to being able to understand and use these terms, after mastering this chapter, you
should
• be able to perform z-tests (1- or 2-tailed)
• be able to perform t-tests (1- or 2-tailed)
• be able to perform F-tests (1- or 2-tailed)
• know when to apply the above, and to be able to apply them to linear regression
• identify outliers in a data set
• objectively determine whether outliers are contaminants by a hypothesis test using the
appropriate test statistic (to guard against masking or swamping effects, when appropriate)
Page 166