Download Part II. Testing the assumptions for ANOVA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Categorical variable wikipedia , lookup

Omnibus test wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Analysis of variance wikipedia , lookup

Transcript
CSS 590 Experimental Design in Agriculture
Lab exercise – 5th week
Testing ANOVA assumptions
SAS On-line Documentation
Univariate Procedure
GLM Procedure
Part I – Data input formats
Up to this point we have entered data into SAS in the format of a SAS dataset. If
you are working with large datasets that are in a different format, you may
prefer to write a short program to rearrange the data in SAS. This can be
achieved using Do loops and the ‘@’ symbol, which tells SAS to read another
data point from the same line. Two ‘@@’ symbols would tell SAS to continue
reading from the same line until there are no more data points to be read. Run
the program below, and note how the data is reformatted.
This is basically an input format that consolidates the data. the “Do” loop tells it to take
the next 6 observations as weedcounts and then go back to herbicide, and so on and so
on.
These are
the
herbicides.
Data one;
Input herbicide $ @;
Do i=1 to 6;
Input weedcount @;
Output;
End;
Datalines;
A 4 5 2 5 4 1 3 4 2 6
B 8 11 9 13 6 5 9 7 6 12
C 25 28 20 15 14 30 27 17 23 13
D 33 21 48 18 53 31 39 26 44 25
;
Proc Print;
Run;
The @ sign means
read from same data
line.
These are the
weedcounts for each
herbicide.
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
herbicide
A
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
D
D
i
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
weedcount
4
5
2
5
4
1
3
4
2
6
8
11
9
13
6
5
9
7
6
12
25
28
20
15
14
30
27
17
23
13
33
21
48
18
53
31
39
26
44
25
Part II. Testing the assumptions for ANOVA
Conduct a one-way ANOVA of the above data set using herbicide as the
independent variable. Request a Bartlett’s test for Homogeneity of Variances or
use the default which is Levene’s test. Output the residuals and predicted values
to a new data set for further diagnosis.
PROC GLM;
Class herbicide;
Model weedcount = herbicide;
Means herbicide / hovtest=bartlett;
Means herbicide / hovtest;
output out=new r=residual p=predicted;
Run;
The GLM Procedure
Class Level Information
Class
herbicide
Levels
4
Values
A B C D
Number of Observations Read
Number of Observations Used
40
40
Dependent Variable: weedcount
Source
Model
Error
Corrected Total
Sum of
Squares
5498.400000
1702.000000
7200.400000
DF
3
36
39
R-Square
0.763624
Coeff Var
40.92788
Mean Square
1832.800000
47.277778
Root MSE
6.875884
F Value
38.77
Pr > F
<.0001
weedcount Mean
16.80000
Source
herbicide
DF
3
Type I SS
5498.400000
Mean Square
1832.800000
F Value
38.77
Pr > F
<.0001
Source
herbicide
DF
3
Type III SS
5498.400000
Mean Square
1832.800000
F Value
38.77
Pr > F
<.0001
Bartlett's Test for Homogeneity of weedcount Variance
Bartlett’s Test: this is a ChiSquare test
H0 = homogeneous variance
among the treatment groups.
Source
herbicide
DF
3
Level of
herbicide
N
A
B
C
D
10
10
10
10
Chi-Square
33.5957
Pr > ChiSq
<.0001
----------weedcount---------Mean
Std Dev
3.6000000
8.6000000
21.2000000
33.8000000
1.5776213
2.7162065
6.2503333
11.8396697
Levene's Test for Homogeneity of weedcount Variance
ANOVA of Squared Deviations from Group Means
Levene’s Test: this is
an F test
H0 = homogeneous
variance among the
treatment groups.
Source
herbicide
Error
Sum of
Squares
99596.7
134410
DF
3
36
Level of
herbicide
N
A
B
C
D
10
10
10
10
Mean
Square
33198.9
3733.6
F Value
8.89
----------weedcount---------Mean
Std Dev
3.6000000
8.6000000
21.2000000
33.8000000
1.5776213
2.7162065
6.2503333
11.8396697
Pr > F
0.0002
We reject the
H0 and
conclude that
at least one of
the treatment
groups has a
different
variance
Obtain residual plots:
PROC PLOT data=new;
plot residual*predicted;
Run;
Plot of residual*predicted.
0 is the mean
for residuals
Symbol is value of herbicide.
residual ‚
‚
The mean of
20 ˆ
group D is 33.8, D
‚
‚
and this shows a
‚
huge variation.
‚
15 ˆ
‚
D
‚
‚
‚
10 ˆ
D
‚
C
‚
The mean of
‚
C
group A is 3.6
‚
C
5 ˆ
D
‚
B
C
‚
B
‚
A
B
C
‚
A
0 ˆ
A
B
level
‚
A
B
C
D
‚
A
B
‚
A
B
D
‚
B
C
-5 ˆ
‚
C
‚
C
‚
C
D
‚
D
-10 ˆ
‚
‚
‚
D
‚
-15 ˆ
‚
D
‚
‚
‚
-20 ˆ
Šˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆ
0
5
10
15
20
25
30
35
predicted
These values are the mean of each group.
Proc univariate can be used to test for normality (normal statement) and to
obtain a variety of descriptive plots, including normal probability plots (plots
statement).
PROC UNIVARIATE data=new normal plots;
QQPLOT residual /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1);
var residual;
Run;
The UNIVARIATE Procedure
Variable:
residual
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
40
0
6.60613545
0.37278049
1702
.
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
40
0
43.6410256
1.56667458
1702
1.04452173
Basic Statistical Measures
Location
Mean
Median
Mode
Variability
0.00000
-0.10000
0.40000
Std Deviation
Variance
Range
Interquartile Range
6.60614
43.64103
35.00000
5.60000
Tests for Location: Mu0=0
Test
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
0
0
-20.5
1.0000
1.0000
0.7868
Tests for Normality
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
W
D
W-Sq
A-Sq
Pr
Pr
Pr
Pr
0.965268
0.110838
0.109167
0.599428
Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
Estimate
19.2
19.2
12.2
7.8
2.9
-0.1
-2.7
The UNIVARIATE Procedure
Variable: residual
Quantiles (Definition 5)
<
>
>
>
W
D
W-Sq
A-Sq
0.2524
>0.1500
0.0852
0.1139
These normality
tests all fail to
reject the H0 that
there is normality
among the groups.
Quantile
10%
5%
1%
0% Min
Estimate
-8.0
-10.8
-15.8
-15.8
Extreme Observations
----Lowest---Value
Obs
-15.8
-12.8
-8.8
-8.2
-7.8
We want the stem and leaf
diagram and Boxplot to
show even tails.
So, these plots looks fine.
We want the
observed values
on the normal
probability plot (*)
to follow the
straight line
prediction (+++)
So this plot looks
reasonable.
Stem
18
16
14
12
10
8
6
4
2
0
-0
-2
-4
-6
-8
-10
-12
-14
34
32
40
30
38
----Highest--Value
Obs
6.8
8.8
10.2
14.2
19.2
22
26
39
33
35
Leaf
2
#
1
Boxplot
0
2
1
0
2
8
8
428
4448
44444448
6662866
68666
2
822
82
1
1
1
3
4
8
7
5
1
3
2
|
|
|
|
+-----+
| + |
*-----*
+-----+
|
|
|
8
8
----+----+----+----+
1
1
0
0
The UNIVARIATE Procedure
Variable: residual
Normal Probability Plot
19+
*
|
+
|
*
+++
|
+++
|
*+++
|
*++
|
++*
|
++***
|
++***
|
**** *
|
*****
|
****+
|
*++
|
***
|
*+*
|
+++
|
+++ *
-15+ ++*
+----+----+----+----+----+----+----+----+----+----+
-2
-1
0
+1
+2
This is the Q-Q plot that we requested. The interpretation is the same as for
the normal probability plot.
Are the residuals for the variable ‘weedcounts’ normally distributed? Do they
have homogeneous variance? What is your proof? If not, can you determine
what transformation is needed? Rerun your analysis on the transformed data and
recheck the ANOVA assumptions.
One way to solve this variance problem is by doing a transformation. A log
transformation works because the standard deviation is proportional to the
mean.
Data two;
Input herbicide $ @;
Do i=1 to 6;
Input weedcount @;
weedtr=log(weedcount);
Output;
End;
Datalines;
A
B
C
D
;
4 5 2
5 4 1 3 4 2 6
8 11 9 13 6 5 9 7 6 12
25 28 20 15 14 30 27 17 23 13
33 21 48 18 53 31 39 26 44 25
So now we’ll run these analyses with the log transformation and look at the tests for
homogeneous variance and normality.
Obs
herbicide
i
weedcount
weedtr
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
A
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
D
D
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
4
5
2
5
4
1
3
4
2
6
8
11
9
13
6
5
9
7
6
12
25
28
20
15
14
30
27
17
23
13
33
21
48
18
53
31
39
26
44
25
1.38629
1.60944
0.69315
1.60944
1.38629
0.00000
1.09861
1.38629
0.69315
1.79176
2.07944
2.39790
2.19722
2.56495
1.79176
1.60944
2.19722
1.94591
1.79176
2.48491
3.21888
3.33220
2.99573
2.70805
2.63906
3.40120
3.29584
2.83321
3.13549
2.56495
3.49651
3.04452
3.87120
2.89037
3.97029
3.43399
3.66356
3.25810
3.78419
3.21888
The GLM Procedure
Class Level Information
Class
herbicide
Levels
4
Values
A B C D
Number of Observations Read
Number of Observations Used
40
40
Dependent Variable: weedtr
Source
Model
Error
Corrected Total
Sum of
Squares
31.10546574
5.69826960
36.80373535
DF
3
36
39
R-Square
0.845171
Coeff Var
16.32692
Mean Square
10.36848858
0.15828527
Root MSE
0.397851
F Value
65.51
Pr > F
<.0001
So the herbicide
treatments are still
significant
weedtr Mean
2.436779
Source
herbicide
DF
3
Type I SS
31.10546574
Mean Square
10.36848858
F Value
65.51
Pr > F
<.0001
Source
herbicide
DF
3
Type III SS
31.10546574
Mean Square
10.36848858
F Value
65.51
Pr > F
<.0001
Bartlett's Test for Homogeneity of weedtr Variance
Source
herbicide
DF
3
Level of
herbicide
N
A
B
C
D
6
6
6
6
Chi-Square
4.1134
Pr > ChiSq
0.2495
------------weedtr----------Mean
Std Dev
1.11410195
2.10678469
3.04918625
3.45114698
0.64145443
0.36060559
0.32256552
0.43084133
We can now fail to reject
(i.e. accept) the H0 that
there is homogeneity of
variance among the
groups.
Levene's Test for Homogeneity of weedtr Variance
Source
herbicide
Error
ANOVA of Squared Deviations from Group Means
Sum of
Mean
DF
Squares
Square
F Value
Pr > F
3
0.2369
0.0790
1.72
0.1793
36
1.6486
0.0458
Level of
herbicide
A
N
10
------------weedtr----------Mean
Std Dev
1.16544250
0.55193716
We now fail to reject
Levene’s H0 that there is
homogeneity of variance
among the treatment
groups.
B
C
D
10
10
10
2.10605090
3.01246113
3.46316055
Plot of resid2*pred2.
0.32084155
0.30843161
0.36116075
Symbol is value of herbicide.
0.75 ˆ
resid2
‚
‚
A
‚
‚
0.50 ˆ
D
‚
A
B
‚
B
C
D
‚
‚
B
C
D
0.25 ˆ
‚
A
C
D
‚
‚
B
C
‚
D
0.00 ˆ
C
‚
A
B
D
‚
‚
B
‚
C
D
-0.25 ˆ
D
‚
B
C
‚
C
‚
D
‚
A
C
-0.50 ˆ
B
‚
D
‚
‚
‚
-0.75 ˆ
‚
‚
‚
‚
-1.00 ˆ
‚
‚
‚
A
‚
-1.25 ˆ
Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ
1.0
1.5
2.0
2.5
3.0
3.5
Pred2
As you can see in the above plot, the variances are more similar among groups. This
suggests the log transformation was successful at equalizing the variation among group
residuals.
Now let’s see what happened to normality.
The UNIVARIATE Procedure
Variable: resid2
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation
40
0
0.38224269
-0.6997911
5.6982696
.
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
40
0
0.14610948
0.53339561
5.6982696
0.06043788
Basic Statistical Measures
Location
Mean
Median
Mode
Variability
0.000000
0.062260
0.220852
Std Deviation
Variance
Range
Interquartile Range
0.38224
0.14611
1.79176
0.61515
Tests for Location: Mu0=0
Test
-Statistic-
-----p Value------
Student's t
Sign
Signed Rank
t
M
S
Pr > |t|
Pr >= |M|
Pr >= |S|
0
1
19
1.0000
0.8746
0.8021
Tests for Normality
Test
--Statistic---
-----p Value------
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
W
D
W-Sq
A-Sq
Pr
Pr
Pr
Pr
0.949167
0.124957
0.08222
0.554735
<
>
>
>
W
D
W-Sq
A-Sq
0.0710
0.1150
0.1956
0.1465
All tests for normality
are not significant. We
can assume that the
residuals are normally
distributed.
Quantiles (Definition 5)
Quantile
Estimate
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
0.6263170
0.6263170
0.4830149
0.4439954
0.3057939
0.0622603
-0.3093512
The UNIVARIATE Procedure
Variable: resid2
Quantiles (Definition 5)
Quantile
10%
5%
1%
0% Min
Estimate
-0.4722953
-0.5347009
-1.1654425
-1.1654425
Extreme Observations
------Lowest------
------Highest-----
Value
Obs
Value
Obs
-1.165443
-0.572789
-0.496613
-0.472295
-0.472295
6
34
16
9
3
0.443995
0.443995
0.458898
0.507131
0.626317
2
4
14
35
10
Stem
6
5
4
3
2
Leaf
3
1
1446
2289
0122289
#
1
1
4
4
7
Boxplot
|
|
|
+-----+
|
|
However, our stem and
leaf and boxplot
diagrams are less
desirable than previous.
The distribution has
been skewed to some
extent, but by using the
above normality tests,
1
0
-0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-11
2
399
7332
86
41
7110
7752
70
1
3
4
2
2
4
4
2
7
1
----+----+----+----+
Multiply Stem.Leaf by 10**-1
|
|
*--+--*
|
|
|
|
|
|
+-----+
|
|
|
|
|
|
|
|
The UNIVARIATE Procedure
Variable: resid2
Normal Probability Plot
0.65+
+++
*
|
++ *
|
**+ *
|
****
|
** **+
|
*+++
|
*++
|
***
|
+**
The normal probability
-0.25+
+**
plot still looks linear,
|
+** *
further validating our
|
* *+**
assumption of normality.
|
* ++
|
+++
|
++
| +++
|++
|
-1.15+
*
+----+----+----+----+----+----+----+----+----+----+
-2
-1
0
+1
+2