Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Psychometrics wikipedia , lookup
Taylor's law wikipedia , lookup
Categorical variable wikipedia , lookup
Omnibus test wikipedia , lookup
Student's t-test wikipedia , lookup
CSS 590 Experimental Design in Agriculture Lab exercise – 5th week Testing ANOVA assumptions SAS On-line Documentation Univariate Procedure GLM Procedure Part I – Data input formats Up to this point we have entered data into SAS in the format of a SAS dataset. If you are working with large datasets that are in a different format, you may prefer to write a short program to rearrange the data in SAS. This can be achieved using Do loops and the ‘@’ symbol, which tells SAS to read another data point from the same line. Two ‘@@’ symbols would tell SAS to continue reading from the same line until there are no more data points to be read. Run the program below, and note how the data is reformatted. This is basically an input format that consolidates the data. the “Do” loop tells it to take the next 6 observations as weedcounts and then go back to herbicide, and so on and so on. These are the herbicides. Data one; Input herbicide $ @; Do i=1 to 6; Input weedcount @; Output; End; Datalines; A 4 5 2 5 4 1 3 4 2 6 B 8 11 9 13 6 5 9 7 6 12 C 25 28 20 15 14 30 27 17 23 13 D 33 21 48 18 53 31 39 26 44 25 ; Proc Print; Run; The @ sign means read from same data line. These are the weedcounts for each herbicide. Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 herbicide A A A A A A A A A A B B B B B B B B B B C C C C C C C C C C D D D D D D D D D D i 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 weedcount 4 5 2 5 4 1 3 4 2 6 8 11 9 13 6 5 9 7 6 12 25 28 20 15 14 30 27 17 23 13 33 21 48 18 53 31 39 26 44 25 Part II. Testing the assumptions for ANOVA Conduct a one-way ANOVA of the above data set using herbicide as the independent variable. Request a Bartlett’s test for Homogeneity of Variances or use the default which is Levene’s test. Output the residuals and predicted values to a new data set for further diagnosis. PROC GLM; Class herbicide; Model weedcount = herbicide; Means herbicide / hovtest=bartlett; Means herbicide / hovtest; output out=new r=residual p=predicted; Run; The GLM Procedure Class Level Information Class herbicide Levels 4 Values A B C D Number of Observations Read Number of Observations Used 40 40 Dependent Variable: weedcount Source Model Error Corrected Total Sum of Squares 5498.400000 1702.000000 7200.400000 DF 3 36 39 R-Square 0.763624 Coeff Var 40.92788 Mean Square 1832.800000 47.277778 Root MSE 6.875884 F Value 38.77 Pr > F <.0001 weedcount Mean 16.80000 Source herbicide DF 3 Type I SS 5498.400000 Mean Square 1832.800000 F Value 38.77 Pr > F <.0001 Source herbicide DF 3 Type III SS 5498.400000 Mean Square 1832.800000 F Value 38.77 Pr > F <.0001 Bartlett's Test for Homogeneity of weedcount Variance Bartlett’s Test: this is a ChiSquare test H0 = homogeneous variance among the treatment groups. Source herbicide DF 3 Level of herbicide N A B C D 10 10 10 10 Chi-Square 33.5957 Pr > ChiSq <.0001 ----------weedcount---------Mean Std Dev 3.6000000 8.6000000 21.2000000 33.8000000 1.5776213 2.7162065 6.2503333 11.8396697 Levene's Test for Homogeneity of weedcount Variance ANOVA of Squared Deviations from Group Means Levene’s Test: this is an F test H0 = homogeneous variance among the treatment groups. Source herbicide Error Sum of Squares 99596.7 134410 DF 3 36 Level of herbicide N A B C D 10 10 10 10 Mean Square 33198.9 3733.6 F Value 8.89 ----------weedcount---------Mean Std Dev 3.6000000 8.6000000 21.2000000 33.8000000 1.5776213 2.7162065 6.2503333 11.8396697 Pr > F 0.0002 We reject the H0 and conclude that at least one of the treatment groups has a different variance Obtain residual plots: PROC PLOT data=new; plot residual*predicted; Run; Plot of residual*predicted. 0 is the mean for residuals Symbol is value of herbicide. residual ‚ ‚ The mean of 20 ˆ group D is 33.8, D ‚ ‚ and this shows a ‚ huge variation. ‚ 15 ˆ ‚ D ‚ ‚ ‚ 10 ˆ D ‚ C ‚ The mean of ‚ C group A is 3.6 ‚ C 5 ˆ D ‚ B C ‚ B ‚ A B C ‚ A 0 ˆ A B level ‚ A B C D ‚ A B ‚ A B D ‚ B C -5 ˆ ‚ C ‚ C ‚ C D ‚ D -10 ˆ ‚ ‚ ‚ D ‚ -15 ˆ ‚ D ‚ ‚ ‚ -20 ˆ Šˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒˆ 0 5 10 15 20 25 30 35 predicted These values are the mean of each group. Proc univariate can be used to test for normality (normal statement) and to obtain a variety of descriptive plots, including normal probability plots (plots statement). PROC UNIVARIATE data=new normal plots; QQPLOT residual /NORMAL(MU=EST SIGMA=EST COLOR=RED L=1); var residual; Run; The UNIVARIATE Procedure Variable: residual Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 40 0 6.60613545 0.37278049 1702 . Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 40 0 43.6410256 1.56667458 1702 1.04452173 Basic Statistical Measures Location Mean Median Mode Variability 0.00000 -0.10000 0.40000 Std Deviation Variance Range Interquartile Range 6.60614 43.64103 35.00000 5.60000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| 0 0 -20.5 1.0000 1.0000 0.7868 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.965268 0.110838 0.109167 0.599428 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 Estimate 19.2 19.2 12.2 7.8 2.9 -0.1 -2.7 The UNIVARIATE Procedure Variable: residual Quantiles (Definition 5) < > > > W D W-Sq A-Sq 0.2524 >0.1500 0.0852 0.1139 These normality tests all fail to reject the H0 that there is normality among the groups. Quantile 10% 5% 1% 0% Min Estimate -8.0 -10.8 -15.8 -15.8 Extreme Observations ----Lowest---Value Obs -15.8 -12.8 -8.8 -8.2 -7.8 We want the stem and leaf diagram and Boxplot to show even tails. So, these plots looks fine. We want the observed values on the normal probability plot (*) to follow the straight line prediction (+++) So this plot looks reasonable. Stem 18 16 14 12 10 8 6 4 2 0 -0 -2 -4 -6 -8 -10 -12 -14 34 32 40 30 38 ----Highest--Value Obs 6.8 8.8 10.2 14.2 19.2 22 26 39 33 35 Leaf 2 # 1 Boxplot 0 2 1 0 2 8 8 428 4448 44444448 6662866 68666 2 822 82 1 1 1 3 4 8 7 5 1 3 2 | | | | +-----+ | + | *-----* +-----+ | | | 8 8 ----+----+----+----+ 1 1 0 0 The UNIVARIATE Procedure Variable: residual Normal Probability Plot 19+ * | + | * +++ | +++ | *+++ | *++ | ++* | ++*** | ++*** | **** * | ***** | ****+ | *++ | *** | *+* | +++ | +++ * -15+ ++* +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 This is the Q-Q plot that we requested. The interpretation is the same as for the normal probability plot. Are the residuals for the variable ‘weedcounts’ normally distributed? Do they have homogeneous variance? What is your proof? If not, can you determine what transformation is needed? Rerun your analysis on the transformed data and recheck the ANOVA assumptions. One way to solve this variance problem is by doing a transformation. A log transformation works because the standard deviation is proportional to the mean. Data two; Input herbicide $ @; Do i=1 to 6; Input weedcount @; weedtr=log(weedcount); Output; End; Datalines; A B C D ; 4 5 2 5 4 1 3 4 2 6 8 11 9 13 6 5 9 7 6 12 25 28 20 15 14 30 27 17 23 13 33 21 48 18 53 31 39 26 44 25 So now we’ll run these analyses with the log transformation and look at the tests for homogeneous variance and normality. Obs herbicide i weedcount weedtr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 A A A A A A A A A A B B B B B B B B B B C C C C C C C C C C D D D D D D D D D D 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 4 5 2 5 4 1 3 4 2 6 8 11 9 13 6 5 9 7 6 12 25 28 20 15 14 30 27 17 23 13 33 21 48 18 53 31 39 26 44 25 1.38629 1.60944 0.69315 1.60944 1.38629 0.00000 1.09861 1.38629 0.69315 1.79176 2.07944 2.39790 2.19722 2.56495 1.79176 1.60944 2.19722 1.94591 1.79176 2.48491 3.21888 3.33220 2.99573 2.70805 2.63906 3.40120 3.29584 2.83321 3.13549 2.56495 3.49651 3.04452 3.87120 2.89037 3.97029 3.43399 3.66356 3.25810 3.78419 3.21888 The GLM Procedure Class Level Information Class herbicide Levels 4 Values A B C D Number of Observations Read Number of Observations Used 40 40 Dependent Variable: weedtr Source Model Error Corrected Total Sum of Squares 31.10546574 5.69826960 36.80373535 DF 3 36 39 R-Square 0.845171 Coeff Var 16.32692 Mean Square 10.36848858 0.15828527 Root MSE 0.397851 F Value 65.51 Pr > F <.0001 So the herbicide treatments are still significant weedtr Mean 2.436779 Source herbicide DF 3 Type I SS 31.10546574 Mean Square 10.36848858 F Value 65.51 Pr > F <.0001 Source herbicide DF 3 Type III SS 31.10546574 Mean Square 10.36848858 F Value 65.51 Pr > F <.0001 Bartlett's Test for Homogeneity of weedtr Variance Source herbicide DF 3 Level of herbicide N A B C D 6 6 6 6 Chi-Square 4.1134 Pr > ChiSq 0.2495 ------------weedtr----------Mean Std Dev 1.11410195 2.10678469 3.04918625 3.45114698 0.64145443 0.36060559 0.32256552 0.43084133 We can now fail to reject (i.e. accept) the H0 that there is homogeneity of variance among the groups. Levene's Test for Homogeneity of weedtr Variance Source herbicide Error ANOVA of Squared Deviations from Group Means Sum of Mean DF Squares Square F Value Pr > F 3 0.2369 0.0790 1.72 0.1793 36 1.6486 0.0458 Level of herbicide A N 10 ------------weedtr----------Mean Std Dev 1.16544250 0.55193716 We now fail to reject Levene’s H0 that there is homogeneity of variance among the treatment groups. B C D 10 10 10 2.10605090 3.01246113 3.46316055 Plot of resid2*pred2. 0.32084155 0.30843161 0.36116075 Symbol is value of herbicide. 0.75 ˆ resid2 ‚ ‚ A ‚ ‚ 0.50 ˆ D ‚ A B ‚ B C D ‚ ‚ B C D 0.25 ˆ ‚ A C D ‚ ‚ B C ‚ D 0.00 ˆ C ‚ A B D ‚ ‚ B ‚ C D -0.25 ˆ D ‚ B C ‚ C ‚ D ‚ A C -0.50 ˆ B ‚ D ‚ ‚ ‚ -0.75 ˆ ‚ ‚ ‚ ‚ -1.00 ˆ ‚ ‚ ‚ A ‚ -1.25 ˆ Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ 1.0 1.5 2.0 2.5 3.0 3.5 Pred2 As you can see in the above plot, the variances are more similar among groups. This suggests the log transformation was successful at equalizing the variation among group residuals. Now let’s see what happened to normality. The UNIVARIATE Procedure Variable: resid2 Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 40 0 0.38224269 -0.6997911 5.6982696 . Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 40 0 0.14610948 0.53339561 5.6982696 0.06043788 Basic Statistical Measures Location Mean Median Mode Variability 0.000000 0.062260 0.220852 Std Deviation Variance Range Interquartile Range 0.38224 0.14611 1.79176 0.61515 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| 0 1 19 1.0000 0.8746 0.8021 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.949167 0.124957 0.08222 0.554735 < > > > W D W-Sq A-Sq 0.0710 0.1150 0.1956 0.1465 All tests for normality are not significant. We can assume that the residuals are normally distributed. Quantiles (Definition 5) Quantile Estimate 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 0.6263170 0.6263170 0.4830149 0.4439954 0.3057939 0.0622603 -0.3093512 The UNIVARIATE Procedure Variable: resid2 Quantiles (Definition 5) Quantile 10% 5% 1% 0% Min Estimate -0.4722953 -0.5347009 -1.1654425 -1.1654425 Extreme Observations ------Lowest------ ------Highest----- Value Obs Value Obs -1.165443 -0.572789 -0.496613 -0.472295 -0.472295 6 34 16 9 3 0.443995 0.443995 0.458898 0.507131 0.626317 2 4 14 35 10 Stem 6 5 4 3 2 Leaf 3 1 1446 2289 0122289 # 1 1 4 4 7 Boxplot | | | +-----+ | | However, our stem and leaf and boxplot diagrams are less desirable than previous. The distribution has been skewed to some extent, but by using the above normality tests, 1 0 -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 2 399 7332 86 41 7110 7752 70 1 3 4 2 2 4 4 2 7 1 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 | | *--+--* | | | | | | +-----+ | | | | | | | | The UNIVARIATE Procedure Variable: resid2 Normal Probability Plot 0.65+ +++ * | ++ * | **+ * | **** | ** **+ | *+++ | *++ | *** | +** The normal probability -0.25+ +** plot still looks linear, | +** * further validating our | * *+** assumption of normality. | * ++ | +++ | ++ | +++ |++ | -1.15+ * +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2