Download CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Omnibus test wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
CHAPTER 2: SOME TRULY USEFUL BASIC TESTS FOR QUANTITATIVE
VARIABLES
TOPICS
RANDOM SAMPLES
INDEPENDENT AND DEPENDENT VARIABLES
PAIRED T-TESTS (Before/After comparisons)
INDEPENDENT SAMPLE T-TESTS (Separate group comparisons)
A. Large sample or equal population variances
B. Unequal variances
TESTS TO COMPARE VARIANCES
NORMALITY ASSUMPTION
Section 1. Introduction
Many important experimental results are based on statistical analyses no more difficult
than those we will review in this chapter. They are among the most useful test statistics ever
devised, simply because the experimental designs which they match are easy, powerful, and
popular.
All statistical tests require that the sample studied be a random sample from the
population of interest. This is an extremely stringent requirement. It means that every person or
item in the population had an equal chance of making it into the sample. These are the ONLY
conditions under which sampling variability can be calculated. It ensures that sampling
variability is the sole source of error in your results. If random samples are not selected
randomly, biases can easily contaminate the results. At the very least, if random samples are not
possible, randomization has to be used to create comparison groups.
We should note that most hypothesis tests are loosely stated as questions of the form
"Does variable X affect variable Y?" For instance, we might ask if Gender affects a person's
opinion on Abortion, or whether Age affects a person's Blood Pressure. In questions where one
variable can be thought of as "affecting" the other variable, the "cause" is referred to as the
independent variable. The outcome is referred to as the dependent variable. Hence, in the
preceding examples, Gender and Age are independent variables possibly affecting the dependent
variables Opinion and Blood Pressure.
The tools discussed in this chapter are only applicable when the dependent variable is
quantitative. Basically, that is because all these methods focus on the effect of the independent
STA 5126 -- 1 of Chp 2
©D. Mohr
variable on the mean and standard deviation of the dependent variable. These parameters only
make sense if the variable is quantitative. There are further mathematical requirements, which
we will list at the end of the chapter.
Section 2. Paired t-test
The most common experimental designs are in the form of a comparison. Often, the
comparison is on values collected on the same experimental subjects. For instance, we may have
reading proficiency scores for children before and after they undergo a six-week training
program. We may have strength scores on right and left arms of the same person. We may have
yields from tomato plants of type A and B, when one of each were planted in the same pot. In
each of these examples, the key feature is that there is a "matching" mechanism which pairs an
observation of one type unambiguously with an observation of the other type. The statistical
technique we discuss will preserve the information due to the pairing by using one of the
observations as a "baseline" against which the other is measured.
The method is simple. Consider the before and after scores for reading proficiency. If we
are interested in whether the program (independent variable) affected the reading proficiency
(dependent variable), we are really interested in whether there was typically a change in the
scores from before to after. We will calculate the individual changes for each child and use the
one-sample t-test (Chapter 1) to test the null hypothesis that the mean change is zero (no effect
exists.)
Recipe for Paired t-test
Data structure: For n individuals, we have measurement 1 (X1) and measurement 2 (X2)
which we wish to compare. X1 and X2 must be quantitative variables. Form a new
column D=X1 -X2 which contains the differences in the two measurements for each
individual.
Perform one sample t-test on D
1) Ho: D=0 (typical values do not differ for measurements 1 and 2)
H1: D0 (typical values do differ for measurements 1 and 2)
2) Since the sample of D which we have observed has n observations, the tstatistics will have n-1 d.f.
STA 5126 -- 2 of Chp 2
©D. Mohr
t

xD  0
sD / n

xD
sD / n
The subscript "D" is to remind you that these statistics are calculated from the column of
D=Differences. Form your critical region using the table of the t-distribution with n-1 df
where n is the number of pairs.
3) Calculate the value of t for your sample. If you use a statistical computer package, it
may give you the p-value for this test automatically.
4) Write the appropriate conclusion.
Example of a paired comparison. Notice the presence of a natural pairing mechanism between
observations with the different "treatments". What are the advantages of such a mechanism?
The data below are from Darwin's study of cross- and self-fertilization.
Pairs of seedlings of the same age, one produced by cross-fertilization and
the other by self-fertilization, were grown together so that the members of
each pair were reared under nearly identical conditions. The data are the
final heights of each plant after a fixed period of time, in inches. Darwin
consulted the famous 19th century statistician Francis Galton about the
analysis of these data. The summary information was produced by the
statistical package SAS for Windows.
PAIR
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Summary on
variable
DIFF
CROSS
23.5
12.0
21.0
22.0
19.1
21.5
22.1
20.4
18.3
21.6
23.3
21.0
22.1
23.0
12.0
SELF
17.4
20.4
20.0
20.0
18.4
18.6
18.6
15.3
16.5
18.0
16.3
18.0
12.8
15.5
18.0
DIFF = cross-self
6.1
-8.4
1.0
2.0
0.7
2.9
3.5
5.1
1.8
3.6
7.0
3.0
9.3
7.5
-6.0
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------15
2.6066667
4.7128194
-8.4000000
9.3000000
STA 5126 -- 3 of Chp 2
©D. Mohr
The null hypothesis is that the mean difference in the population is 0,
implying that mean heights of cross and self-fertilized plants would not
differ. In symbols,
Ho: D=0
vs
H1: D 0
There are 15 observations in the data set, so 14 d.f. If we use  = 5%, then
the critical region would be "Reject Ho if t < -2.145 or t > 2.145"
In this
sample, t=2.142. Hence, there is no significant evidence, at =5%, that
cross and self-fertilized seedlings differ in mean length.
Section 3. Two-sample t-test (also called the independent samples t-test)
Frequently, we have two separate groups on which we wish to make comparisons. We
may be interested in comparing mean salaries for male and female entry-level employees, or
length of hospital stays for HMO and PPC plan insurees. In the first case, our independent
variable is gender while the dependent variable is salary. Salary is a quantitative variable for
which we summarize typical values using the mean. Unlike the paired t-test, where the values in
each group are naturally matched, here we assume the two groups are completely independent.
Diagram 1 gives a schematic of the statistical situation. We have two populations
summarized by the means in each (1 and 2) and the standard deviations (1 and 2). Our
hypotheses are
Ho: 1 = 2 (1 - 2 = 0) "group" has no effect on mean
Ha: 1  2 (1 - 2  0) "group" has an effect on mean
Since we cannot observe 1 and 2, we must use our sample data to reach conclusions. Looking
at the hypotheses, our natural move is to compare the two sample means to each other, or
equivalently, their difference to 0.
STA 5126 -- 4 of Chp 2
©D. Mohr
Population 1
Parameters 1, 1
Population 2
Parameters 2, 2
Ho: 1 = 2
Sample 1
Stats: n1, mean x1, s1
Sample 2
Stats: n2, mean x2, s2
DIAGRAM 1. Comparing two populations
If the population variances are known, probability theory shows that the appropriate statistic
would be
Z
x1  x 2
12 / n1   22 / n 2
The two sample t-test has two versions, which differ in how they "doctor up" the
denominator of this statistic since the population variance is hardly ever known. The versions
differ depending on whether the two population variances can be assumed equal or unequal. In
section 4 we cover a method for checking this assumption.
Section 3A. Large samples or unequal variances
When the variances (or standard deviations) in the two groups appear very dissimilar, the
best method may be the unequal variance version. This method does not require the assumption
of equal variances. The disadvantage of this method is that the degrees of freedom are
sometimes small, and they are always difficult to calculate (this is referred to as Satterthwaite‘s
approximation). While the test statistic itself is easy to calculate, the degrees of freedom are best
computed by a statistical package. Without the computer, it helps to know that the d.f. are
always between ns  1 where ns is the size of the smallest sample, and n1  n2  2 , so if you get
the same conclusion using both those d.f., you are safe. If both samples are large (at least 50), it
STA 5126 -- 5 of Chp 2
©D. Mohr
is probably safe to use infinite () d.f. The value of the test statistic is computed by:
t
x1  x 2
s12 / n1  s 22 / n 2
Section 3B. Equal variance t-test
When the variances in the two samples appear similar, it is advantageous to "pool" the
two estimates into an estimate of the alleged single underlying variance. This allows us to pool
the degrees of freedom in the two groups as well, giving more sensitive critical regions.
s 2p

t
(n1  1)s12  (n 2  1)s 22
n1  n 2  2
sp
x1  x 2
1 / n1  1 / n 2
 pooled var iance
(note s p , not s 2p )
d.f. = n1 + n2 - 2
Section 4. Comparing two standard deviations
Some authors now argue that we should always use the unequal variance version of the
test. Traditionally, however, the equal variance version was preferred both because of the
potentially greater degrees of freedom and because its relation to more advanced topics (like the
one-way ANOVA) is well understood. In this tradition, before we decide which version of the
two-sample t-test to use, we need a tool for deciding whether the variances in two groups are
equal or different. This amounts to a hypothesis test for the hypotheses
2
2
Ho: 1 = 2 (or equivalently,  1 =  2)
vs
2
2
Ha: 1  2 (or equivalently,  1   2)
Note the restatement of the hypotheses in terms of the variances. There are many tests available
for testing these hypotheses. The most commonly cited are Fisher's test (F-test classic!) and
STA 5126 -- 6 of Chp 2
©D. Mohr
Levene's test.
Section 4a. Fisher's test
The test statistic used to compare the variances if the F-statistic. F is for Sir R. A. Fisher,
who pioneered many classic statistical techniques.
F  s12 / s 22
or
F'  s 2max / s 2min
F' differs from F only in that it always places the larger of the two sample variances in the
numerator. If the null hypothesis is true, we expect F (or F') to be near 1. If F is either very
much larger or very much smaller than 1 (F' very much larger than 1), we would believe Ha is
true. As always, the question is where to draw the line (critical region).
The table of the F-distribution is provided in most statistics texts. Most tables only give
the cutpoint which marks off the lower 1-A of area from the upper A of area in the righthand tail.
It can be quite confusing to understand how to use this to get the critical values for all the
varieties of test.
Generic shape of an
F distribution:
a)
Area A
Area 1-A
F is explicitly a two-tailed test. So we need cutpoints which mark off the lower /2 area
in the lefthand tail, and /2 in the righthand tail. Most tables only give the righthand cutpoint.
To get the lefthand cutpoint, you use
Lefthand cutpoint for lower /2 with M,N df = 1. / (Righthand cutpoint for upper /2 with N,M df)
Example: Suppose we are using  = 5%, and sample 1 has n=10 while sample 2 has n=6 (9 and
5 df, respectively). We should put 2.5% in each tail. From the table, we see that the cutpoint for
the upper tail is 6.68. To get the lower cutpoint, we need to reverse the order of the d.f. (now 5
and 9), then take the reciprocal of the upper cutpoint. That is, the lower cutpoint is 1/4.48=.223.
b)
F' is also two-tailed, but it finesses the problem of getting the lower cutpoint by arranging
to always put the largest variance on top. Hence, if we had sample 1 with n=10 and sample 2
with n=6, our critical region would be: If s1 is the largest, reject if F' > 6.68 (9 and 5 df, with
/2 in the upper tail); if s2 is the largest, reject if F' > 4.48 (5 and 9 df, with /2 in the upper
STA 5126 -- 7 of Chp 2
©D. Mohr
tail).
Example of an F test.
Notice that sometimes hypothesis tests about the variances (or standard deviations) are of
interest in there own right.
Drill press operators in a manufacturing plant must drill holes of specified
diameter in sheets of metal. One goal is that all holes should have the same
diameter (small variability in the individual diameters). Actual diameters
are measured for 20 holes drilled by inexperienced operators, and 10 holes
drilled by experienced operators. The data is summarized below. Is there
significant evidence, at  = 5%, that the population variances differ for
experienced and inexperienced operators?
Inexperienced
n = 20
s = .52 mm
Experienced
n = 10
s = .21 mm
1) Ho: 1 = E
(variability is the same for experienced and inexperienced
operators)
Ha: 1  E
(variability is not the same)
2) We will reject Ho if F' > 3.69, using F table for upper area of .025, 19
df in numerator and 9 in denominator.
2
2
3) F' = .52 / .21 = 6.13
4) There is significant evidence that the variability in the diameters
differs for experienced and inexperienced operators. Inexperienced operators
have larger variability (less consistency) in the diameters of the holes.
Section 4b. Levene's Test (used by SPSS)
Levene's test actually tests the null hypothesis that the mean values of the magnitude of
the distances from individual observations to the mean are the same. Instead of defining
dispersion in terms of 'squared distances' as variances do, it uses absolute values of distances.
The actual algorithm is as follows:
1) Within each group, compute the difference between the individual observations and
the group mean.
2) Take the absolute value of these differences.
3) Do a independent sample t-test (equal variance version) of the null hypothesis that the
means of the absolute differences are equal.
4) Square the t-value from the t-test. (Under Ho, the square of a t should have an F
distribution with 1 df in the numerator and n1 + n2 -2 in the denominator.) Compare it to the
cutpoint which places  (usually 5%) area in the upper tail of the distribution. You are only
interested in large values of F, because only large values would indicate that the variances are
different. (Note the difference between this and the cutpoints for Fisher's test, which place /2 in
each tail.)
STA 5126 -- 8 of Chp 2
©D. Mohr
Large values of F indicate that one of the means must be different from the other (Ha
true). Bear in mind at this point that we are no longer talking about the means of the raw data,
but of the distance of the raw values around their group means. In the example above, a large
value for F would indicate that the typical (mean) distance of individual diameters from the
group mean was larger in one group than in another, indicating more variability in one group.
Levene's Test and Fisher's Test do not give exactly the same result. Except in borderline
cases, however, they usually give comparable values. There is some intuitive evidence that
Levene's Test is less sensitive to departures from the normality assumption, and I think that is
why it is the default in SPSS.
Example of Levene's Test
The following data shows test scores for five freshman and five juniors
on an assessment test for critical thinking. Does variability differ in the
two groups, using  = 5%?
Freshman:
28
32
21
36
33
( sample mean = 30.0)
Juniors:
34
49
43
32
27
( sample mean = 37.0)
Ho:  1   2 , vs Ha:  1   2 .
Reject Ho if
F > 5.32
(using table
with 1 and 8 d.f., and 5% in the upper tail.)
Absolute values of differences from mean within each group:
Freshman:
2
2
9
6
3
(sample mean = 4.4, s=3.05)
Juniors:
3
12
6
5
10 (sample mean = 7.2, s=3.70)
Sp = 3.39, df = 8,
t = 1.31, F = 1.71.
Since 1.71 is less than the
cutpoint of 5.32, there is no significant evidence that the variances are
different in the two groups. When we compute the t-test to compare the means
of the test scores, we can use the equal variance version.
2
2
2
2
Section 5. A Full Example!!
Recipe for a two sample t-test:
Data structure: Two separate groups are measured for a quantitative variable Y.
Performing the two sample t-test
1) State the null and alternative hypotheses in terms of the means in the two groups.
2) Decide which version of the t-test to use by using the F-test or Levene's test to examine the
variances within the two samples.
3) Calculate the number of degrees of freedom for the appropriate version of the test. Use your 
and a table of the t-distribution to set the critical region.
4) Calculate the appropriate version of t.
5) Write your conclusion.
Most computer programs will automatically calculate both versions of t as well as F, along with their pvalues, saving you a lot of effort.
Example for two-sample test (two independent samples)
STA 5126 -- 9 of Chp 2
©D. Mohr
Notice that the two groups of patients are completely separate, with no natural pairing.
In small to moderate samples, the particular version of the two-sample t-test depends on whether
the variances within the two groups seem similar. SAS for Windows computes an F-test to help
you decide which version is appropriate.
The data summarized below show cholesterol values for the 39 heaviest
men in the Western Collaborative Group Study. (This study was carried out in
California in 1960-1961 and involved 3,154 middle-aged men. The purpose was
to study behaviour patterns and risk of coronary heart disease.) All the
cholesterols summarized below are for men weighing more than 225 pounds.
Cholesterols are given in mg per 100 ml. Each man was rated as generally
having Behaviour Type A (urgency, aggression, ambition) or Behaviour Type B
(relaxed, non-competitive, less hurried.) In heavy, middled-aged men, is
cholesterol level related to behaviour type?
1) The null hypothesis is that behavior type has no effect on mean
cholesterol, while the alternative hypothesis is that it does have an effect
on mean cholesterol. In symbols:
Ho: A = B
vs
Ha: A  B
2) Since the hypotheses concern the means in two separate groups, we will use
the two sample t-test. To decide which version, we notice that the program
has printed the value of F', along with the p-value (labeled Prob>F').
Recall that this statistic tests the null hypothesis that the two population
variances are equal. Since the p-value of .2927 is greater than any
reasonable  (.1 to .01) so it is reasonable to assume that the variances are
equal and use that version of the t-test.
3) For the pooled (equal) variance version, the d.f.=37. With a significance
level of 5%, we would reject Ho if t < -2.021 or t > 2.021. Alternatively, we
reject if the p-value is less than .05.
4) For the equal variance version, t=2.5191, df=37 and the p-value is .0162.
You should use the table of sample means and standard deviations to check
these results.
5) If we are using a significance level of .05, we would reject the null
hypothesis. Hence, we can say there is significant evidence that behaviour
type is associated with differences in mean cholesterol.
COMPUTER PRINTOUT
-
TTEST PROCEDURE
Variable: CHOL
TYPE
N
Mean
Std Dev
Std Error
--------------------------------------------------------------------A
19
245.36842105
37.61384279
8.62920735
B
20
210.30000000
48.33991486
10.80913356
Variances
T
DF
Prob>|T|
--------------------------------------Unequal
2.5355
35.7
0.0158
Equal
2.5191
37.0
0.0162
<----- note how SAS labels
the p-values
For H0: Variances are equal, F' = 1.65
DF = (19,18)
Prob>F' = 0.2927
<------note how SAS labels
the p-values
STA 5126 -- 10 of Chp 2
©D. Mohr
Boxplot for CHOL by TYPE
|
400 +
|
|
|
0
300 +
|
|
+-----+
|
|
*--+--*
+-----+
200 +
|
*--+--*
|
|
+-----+
|
|
100 +
------------+-----------+----------TYPE
A
B
CASE STUDY
Jerrold et al (2009) compared typically developing children to young adults who had Downs
Syndrome, with respect to a number of psychological measures thought to be related to the ability to learn
new words. Data on two of the measures is summarized in Table 5.6. Recall Score is a measure of verbal
short-term memory. Raven’s CPM is a task in which the participant must correctly identify an image which
completes a central pattern.
The authors used the pooled t test to compare the typical scores in the two groups. For Raven’s
CPM, t  .485, p value  .629 . For Recall Score, t  7.007, p value  .0001 . Hence, the two
groups did not differ significantly with respect to mean Raven’s CPM, but the Down’s Syndrome group
scored significantly differently (apparently lower) on Recall Score. Based on this and a number of other
comparisons, the authors conclude that verbal short-term memory is a primary factor in the ability to learn
new words.
The authors choice of the pooled t test rather than the unequal-variance t Test appears
reasonable here. For Raven’s CPM, F  0.700, p value  .379 . For Recall Score,
F  0.691, p value  .361 . Neither variable showed a significant difference in the variances within the
groups.
The other distributional assumption underlying t tests is that the data comes from normal
distributions. Journal publications rarely have space in which to present graphical evidence with which the
reader can check this assumption. However, the discussion will often include a sentence addressing this
issue, and remark on any transformations (e.g. logarithms) used to make the variable more nearly normal.
The authors actually presented the results of the pooled t test (with 80 degrees of freedom) as an
F test with 1 degrees of freedom in the numerator and 80 in the denominator. The relation between these
two test statistics will be explained in Chapter 4.
Summary statistics from Jerrold (2009).
Raven’s CPM
Recall Score
Down Syndrome young adults
n = 21
Mean
S.D.
19.33
4.04
12.00
3.05
Typically developing children
n = 61
Mean
S.D.
19.90
4.83
18.25
3.67
(Source: Jerrold, C., Thorn, A. S. C, and Stephens, E. (2009). The relationship among verbal short-term
memory, phonological awareness, and new word learning: evidence from typical development and Down
syndrome. J. Experimental Child Psychology, 102(2) 196-218.)
STA 5126 -- 11 of Chp 2
©D. Mohr
Section 6. Nasty mathematical assumptions
We already know of two fundamental assumptions underlying the tests in this chapter,
and that of the t-test in Chapter 1.
1) The sample must be random
2)
The dependent variable must be quantitative
In addition, the derivations of the t and F-distributions have a nasty mathematical assumption:
that the distribution of the variable in the population must follow a normal distribution. In plain
language, if you could draw a histogram of the values for all the observations in the entire
population, you should see the famous "bell curve". So we have a third assumption:
3)
The distribution for the individual values is normal.
It is not very likely that we will ever know for sure whether assumption 3 is met. What can we
do to check and how important is it anyway? There are several graphical techniques we can use
to check for normality. So far, we have seen dotplots and boxplots, though we have not
discussed them. (See your elementary text.) In chapter 4 we will meet a tool called a normal
probability plot which gives a more sensitive check. What are we really looking for? An
immediate cause of trouble in a small or moderate data set would be when one or two values are
very far away from the rest. The self/cross-fertilization data used as an example of the paired ttest may be a case where the data contains two "outliers". Outliers should be rare in normally
distributed data. Outliers can cause the p-values and critical regions to be only approximate.
The most common problem is to make the p-value larger than what it should be.
If there are no outliers, and the data show a nearly symmetric pattern with the points
clustering in the middle of the range, then it is unlikely that non-normality is a serious problem.
If you do seriously suspect nonnormality in your data, consult a statistician on a variety of
"nonparametric" statistical tests which do not require the normality assumption.
There is one frequent case in social science data where normality is very questionable. If
you have data collected on an ordinal scale (e.g. 0 = strongly disagree to 4 = strongly agree), it is
unlikely to be normally distributed. Recall that the normal distribution is for continuous or
nearly continuous random variables, and data on a five point scale is quite discrete. This is
STA 5126 -- 12 of Chp 2
©D. Mohr
especially true if the values cluster at one end or the other end of the scale (e.g. almost all agree
or strongly agree). In this case, one of the techniques of Chapter 3 might be appropriate.
Furthermore, it is questionable as to whether one can legitimately average values on this kind of
scale — does a (―disagree‖+‖strongly agree‖)/2 = ―agree‖? Nevertheless, treating this ordinal
data AS IF it were numerical on a 0-4 scale, and conducting averaging operations, is a sloppy but
common practice in the social sciences. Averages over several questions frequently produce
values which appear reasonably normally distributed.
Finally, when comparing two population means, the choice of the version of the test
depends on whether variances can be assumed equal. Since this assumption, referred to as
―homogeneity of variance‖, underlies much of the Analysis of Variance, we list it as a fourth
assumption:
4) variances in the two groups are equal.
STA 5126 -- 13 of Chp 2
©D. Mohr
EXERCISES FOR CHAPTER 2
*Exercise 1. Data below show blood pressures for 5 subjects. The first value was taken while the
subject was resting. The second was taken while the subject was resting, but asked to work a mental
arithmetic problem. Does math affect mean blood pressure? Use  = 5%.
Subject number
Resting BP
During Math BP
1
115
125
2
125
125
3
110
130
4
120
115
5
110
125
*Exercise 2. Occupancy rates (average annual percentage of beds filled) are compared for randomly
selected urban and suburban hospitals in a state.
a. Is there evidence of difference in variability between the two groups? Compute both Levene's Test
and Fisher's Test to answer this. Use  = 5%.
b. Is there evidence of a difference in the mean occupancy rates? Use the results of A to help you
decide on an appropriate version of the t-test. Use  = 5%.
Urban:
76.5
79.6
77.5
79.4
79.3
78.1
Suburban:
71.5
73.4
71.2
67.8
63.0
76.5
Exercise 3. Eight students volunteer to participate in a test of the effect of caffeine on the speed with
which they can respond to a flashing light. Each student takes the test on a morning when they have
had no caffeine, then again a week later on a morning after having had the equivalent of two cups of
coffee. The data is given below, in hundredths of seconds to respond to the light.
Subject
1
2
3
4
5
6
7
8
Without caffeine
12
18
22
9
14
24
21
16
With caffeine
10
14
20
8
14
21
19
14
Does caffeine have an effect? Use  = 5%.
Exercise 4. Do HMO‘s really reduce costs of care? 40 adults aged 55-60 enrolled in HMO‘s are
questioned on their health care within the last 2 years. They report an average of days hospitalized
STA 5126 -- 14 of Chp 2
©D. Mohr
during that period of 1.19 days with a standard deviation of 1.4 days. A similar sample of 40 adults
with ordinary healthcare insurance reports an average of 1.35 days with a standard deviation of 1.7
days.
a) Is there evidence of a difference in the variability within the groups? (You don‘t have enough
information to do Levene‘s test here, you must use Fisher‘s.)
b) Is there evidence of a difference in the means for the groups?
Use =5% for each test.
Exercise 5. We are comparing math FCAT scores for rural and urban high schools. We have a
random sample of 20 urban high schools and 20 rural high schools. Their school-aggregated math
FCAT scores for 10th graders are summarized below.
Location
n
sample mean
sample standard deviation
Rural
20
1925
252
Urban
20
1982
212
a. Use the F‘ test to say whether it is reasonable to assume that the two populations have the same
variance. Why can you not prove that the variances are equal?
b. Do the means differ significantly in the two groups? Use  = 1%.
Exercise 6. From each of 4 different litters of mice, a researcher chooses two female mice (for a total
of 8 mice). Within each pair of sisters, one is chosen to be fed a standard diet, and the other is fed a
high-protein diet. Their weight in grams, at the end of 6 weeks, is shown below.
Diet
Pair 1
Pair 2
Pair 3
Pair 4
Standard
19.4
18.2
18.5
19.8
High-protein
17.6
19.4
17.2
19.2
Do the different diets seem to affect mean weight? Use  = 5%.
Exercise 7. (A development from Exercise 1.) The researcher wishes to know whether girls and boys
differ in their reaction to arithmetic. 5 girls are are recruited, and their blood pressures are tested
resting, and again resting but doing mental arithmetic. 5 boys are tested under the same
circumstances. The data is given below. Is there significant evidence, at  = 5%, that boys and girls
differ in the mean change in BP experienced while doing arithmetic? Note: this experimental design
uses ideas from both paired and two-sample experiments!
Girls
Boys
Resting
During Math
Resting
During Math
120
130
120
125
110
115
110
125
115
115
105
115
110
120
110
120
110
115
120
115
STA 5126 -- 15 of Chp 2
©D. Mohr
Exercise 8. Pedersen (2007, Perceptual and Motor Skills, 104(1), pp 201-211) interviewed a sample
of students enrolled in psychology courses in a large private university in the western U.S. regarding
their attitudes towards sports. Each student was asked to self-rate his or her degree of sport
participation, on a scale of 1 to 5. The 112 men in the sample had M = 4.3 and SD = 1.7. The 173
women had M = 3.6 and S.D. = 1.7. (M is a common abbreviation for the sample mean, and SD a
common abbreviation for the standard deviation.) Is there significant evidence, at  = 1%, that men
and women at this university differ in their mean self-rankings of sport participation?
Exercise 9. Martinussen et al. (2007, J. Criminal Justice 35, 239-249) compared ‗burnout‘ among a
sample of Norwegian police officers to a comparison group of air traffic controllers, journalists and
building constructors. Burnout was measured on three scales: exhaustion, cynicism, and efficacy. The
data is summarized in the table below. The authors state
The overall level of burnout was not high among police compared to other occupational
groups sampled from Norway. In fact, police scored significantly lower on exhaustion and
cynicism than the comparison group, and the difference between groups was largest for
exhaustion.
Substantiate the authors‘ claim regarding Exhaustion. That is, check that it does show a significant
difference between the two groups..
Summary Statistics for Exercise 9
Police, n = 222
Mean
Exhaustion
1.38
Cynicism
1.50
Efficacy
4.72
std dev
1.14
1.33
0.97
Comparison group, n = 473
mean
std dev
2.20
1.46
1.75
1.34
4.69
0.89
SOLUTIONS TO STARRED PROBLEMS
Exercise 1. Notice the existence of a pairing mechanism between items. Each experimental unit (a subject) has two
blood pressures--a resting and a ‗during math‘ blood pressure. This should be done via a paired t-test.
a) D = mean difference in during math – resting blood pressure in the population
Ho: D = 0 versus Ha: D  0.
b) The 5 differences in the sample are: 10 0 20 –5 15. There will be 4 degrees of freedom. We will reject Ho if t
< -2.776 or t> 2.776
c)
d  8 and
sD  10.368, t 
80
 1.725
10.368 / 5
d) Do not reject Ho. There is no significant evidence that math affects mean blood pressure, at  = 5%.
Further Note. A computer package would not tell you the cutpoints for t. Instead, it would report that the p-value for this
data was .1595. Since .1595 > .05 (your ), you would not reject Ho.
Exercise 2.
Urban group had mean = 78.4, s.d. 1.2458
STA 5126 -- 16 of Chp 2
©D. Mohr
Suburban group had mean = 70.5667 and s.d. = 4.6779
a) Difference in variability: Ho:
 s2   U2
versus Ha:
 s2   U2
Fisher‘s test. Reject if F‘ > F-table value with 6-1=5 and 6-1=5 df and 2.5% in tail
F‘ = 4.67792 / 1.24582 = 14.1. Cutpoint in table is 7.15. Since F‘ > 7.15, we reject Ho. There is significant evidence of
variability, at  = 5%. (Note, tail value is half the desired  for F‘ version.)
Levene‘s test. Absolute values of difference of individual scores from group mean—
Urban 0.9 1.2 0.9 1.0 0.9
Suburban .93 2.83 .63 2.77 7.57 5.93
2
Running an independent samples t-test (equal variance version) on this data gives t=2.099. F=2.099 = 4.406 with 1 and
10 df. Since the cutpoint in the F-table is 4.96 for =5% (don‘t split the !) we would not reject Ho: that is, we have no
significant evidence of a difference in variability.
b) Fisher‘s test and Levene‘s test differ on the advisability of using the equal variance /unequal variance version.
Fortunately, in this case, the answers don‘t differ. Both versions of t come out to 3.96, which would be significant
whether you use df=10 (equal variance version) or df=5 (smallest df possible under unequal variance version).
STA 5126 -- 17 of Chp 2
©D. Mohr
SPSS NOTES IF YOU WANT TO GET STARTED ON YOUR OWN!
Step 1. Deciding how to set up your data.
When you double-click on the SPSS icon, the first thing you see is a spreadsheet-like grid for entering your data. This is
called the Data Editor. Before you charge in and start typing, you have to think about how the data is structured. The
format in which you enter the data must follow that structure.
The basic rule-of-thumb is that entries on the same row, or line, are from the SAME subject, or experimental unit.
Things on different lines are from different subjects. Things in different columns are different measurements. Let‘s see
how that plays out in the Starred Exercises. By double-clicking on the heading of each column, you can change the name
to something sensible, and also indicate whether your data is nominal (‗string‘) or numerical.
Exercise 1. There are 5 different subjects, so our data entry will have five rows. There should be one column for subject
number (‗SUBJ‘), one for the resting BP (‗REST‘) and one for the during-math BP (‗MATH‘). In otherwords, the data
entry will look very like the table given in the problem.
Exercise 2. There are 14 different hospitals. Each will have its own row in the data entry. In addition to a column for
occupancy rate (‗O_RATE‘), I will need a column which tells me whether it is an Urban or Suburban hospital. I will call
this column ‗LOCATION‘. Many of the ANOVA and T-test routines in SPSS want group variables to be coded AS IF
they were numeric. I am going to code Urban=0, Suburban=1. Keep notes of the codes you define.
LOCATION O_RATE
0
76.5
0
79.6
......
1
76.5
Step 2. Request the appropriate t-test
Exercise 1.
In SPSS, click on the ANALYZE option at the top. From the drop-down menu, request COMPARE MEANS. Choose the
type of T-test you need, in this case, the PAIRED SAMPLES T-TEST. You will see a ‗Dialog Box‘ like the one below.
You need to click on the column names with the two variables you are trying to compare (REST and MATH), and move
them into the big box on the right using the key that looks like an arrow >. Then hit the OK button. You will see printout
like that on the next page.
STA 5126 -- 18 of Chp 2
©D. Mohr
Paired Samples Statistics
Pair
1
Mean
116.0000
124.0000
REST
MATH
N
5
5
Std. Dev iat ion
6.51920
5.47723
Std. Error
Mean
2.91548
2.44949
Paired Samples Correlations
N
Pair 1
REST & MATH
5
Correlation
-.490
Sig.
.402
Paired Samples Test
Paired Diff erences
Pair 1
REST - MATH
Mean
Std. Dev iation
-8.0000
10.36822
Std. Error
Mean
4.63681
95% Confidence
Interv al of the
Dif f erence
Lower
Upper
-20.8738
4.8738
t
-1.725
df
4
Sig. (2-tailed)
.160
The first panel gives you some of the summary statistics within each group. The last panel reports the results of the t-test.
The p-value is labeled Sig., which is short for ‗Observed Significance Level‘. Since the .16 is greater than your , you do
not have significant evidence of a math effect.
Exercise 2. From the ANALYZE / COMPARE MEANS menu, choose INDEPENDENT SAMPLES T-TEST. You need
to click on O-RATE and use the > key to move it into the Test Variable(s) box. You need to click on location and move it
into the Grouping Variable box. Then hit OK.
STA 5126 -- 19 of Chp 2
©D. Mohr
Group Statisti cs
O_RATE
LOCATION
0
1
N
Mean
78.4000
70.5667
6
6
St d. Error
Mean
.50859
1.90974
St d. Dev iation
1.24579
4.67789
Independent Samples Test
Levene's Test f or
Equality of Variances
F
O_RATE
Equal variances
assumed
Equal variances
not assumed
4.406
Sig.
t-test for Equality of Means
t
.062
df
Sig. (2-tailed)
Mean
Dif f erence
Std. Error
Dif f erence
95% Confidence
Interv al of the
Dif f erence
Lower
Upper
3.964
10
.003
7.8333
1.97630
3.42985
12.23681
3.964
5.706
.008
7.8333
1.97630
2.93638
12.73029
Note that SPSS automatically gives you Levene‘s test to help you choose the version of the t-test. The p-value is once
again labeled Sig. in SPSS..
STA 5126 -- 20 of Chp 2
©D. Mohr
BOXPLOTS
Simple boxplots use a box to mark off the middle 50% of the data. The box extends from the first
quartile to the third quartile, with a thick mark at the median. The purpose of the box is to draw your
eye to the central 'typical' half of the data. The lowest 25% of the data is marked off by a whisker that
extends from the minimum value to the first quartile. The highest 25% of the data is marked off by a
whisker that extends from the third quartile (75th percentile) to the maximum.
Modified boxplots alter the whiskers to draw your attention to outliers, or wild values in the data set.
To define outliers, the computer calculates a value called the hinge width, which is 1.5 x (75th
percentile - 25th percentile ) = 1.5 x length of 'box'. Any value lying more than one hinge width
ABOVE the 75th percentile is an outlier on the high side. Any value lying more than one hinge width
BELOW the 25th percentile is an outlier on the low side. Modified boxplots draw the whiskers from
the quartile to the most extreme value that is not an outlier. Any outliers are marked off with a
separate symbol.
Boxplots give you a quick view as to whether typical values (denoted by the boxes) are changing.
They also help you see whether the spread (variance) are relatively stable. They can also help you
diagnose non-normality, by helping you spot asymmetries or outliers.
Outliers are wild, unusual values. Normally distributed data should have very few, if any, outliers.
A very large data set might reasonably have a few (1%?) outliers without causing harm, but very
severe or frequent outliers can cause statistical trouble. Moreover - outliers are of interest in there
own right -- what causes these people to be so different from the rest?
Example of Boxplots The typical values are higher in Group 1 than in Group 2. The spreads are
similar, except that Group 2 has an outlier with an unusually large value.
14
25
12
10
8
6
4
2
0
X
-2
N=
15
15
1.0000
2.0000
GROU
P
STA 5126 -- 21 of Chp 2
©D. Mohr