Download Word - UC Davis Plant Sciences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
PLS205
Lab 1
January 9, 2014
Laboratory Topics 1 & 2
∙ Welcome, introduction, logistics, and organizational matters
∙ Introduction to SAS
Writing and running programs; saving results; checking for errors
Different ways to input/import data
Proc Means, Proc Univariate (testing for normality)
∙ Hypothesis testing
t-Tests:
One sample
Two sample (Independent)
Two sample (Paired)
∙ Power Calculations using Proc Power
∙ Proc Sort, Proc Print, Proc Means
∙ Nifty SAS Program: Critical values generator
∙ Niftier webpage
∙ APPENDIX: Data input examples
Logistics and Organizational Matters
1. Homework is due at the beginning of lab, with 10 points off for every day it's late. If you don't
submit it by the time the homework key is posted (usually 24 hours later), you will receive a zero.
2. Print the lab handouts before coming to lab; they will be posted on the class website each week by
Wednesday night at the latest.
3. To log on to the lab computers, you need a UCD user ID and password.
4. Bring a diskette/flashdrive to lab to copy examples from the class directory (G:\PLS205\*.*).
5. This is a demanding class, so make use of all your resources – office hours, lab handouts, homework
keys, each other (the 205 Buddy System).
Introduction to SAS (your new best friend?)
To open SAS Version 9.3: START  All Programs  Class Software  SAS  SAS 9.3
(English)
The SAS Display Manager
There are three basic Windows, listed in the order you should view them:
1) The Program Editor window: Where you tell SAS what to do.
2) The Log window: Where SAS tells you what it did and (usually) what you did wrong.
3) The Output window: Where you find the results of your analysis (i.e. the good stuff).
Lab 1.1
Example 1
From ST&D p. 29 [Lab1ex1.sas]
Data BirdCount;
* Creates a new data set called "BirdCount";
Input Field Birds;
* Tells SAS the names of variables;
Cards;
* A throwback to the old days;
1 210
2 221
3 218
4 228
5 220
6 227
7 223
8 224
9 192
;
* SEMICOLON! SEMICOLON! SEMICOLON!;
Proc Means mean var std stderr cv Data = BirdCount;
Var Birds;
* Generate these requested statistics
for the variable "Birds" in the dataset "BirdCount";
Run;
Quit;
Output
Analysis Variable : Birds
Coeff of
Mean
Variance
Std Dev
Std Error
Variation
--------------------------------------------------------------------------218.1111111
124.3611111
11.1517313
3.7172438
5.1128671
---------------------------------------------------------------------------
Things to Learn
1.
2.
3.
4.
5.
6.
Run (submit) a SAS program with a simple click on the running man icon.
Move between windows to scan for red-type errors (Log) and then view results (Output).
Clear Log and Output windows with a simple click on the blank page icon.
Save program to disk. From Program Editor window: File  Save as.
Save output to disk. From Output window: File  Save as.
Set the line size for output to 76 characters (the perfect fit for 10 point Courier font on a page with 1"
margins): Tools  Options  System  Log and procedure output control  SAS log and
procedure output  Double click linesize
Example 2 PROC UNIVARIATE test of Normality
Data Barley;
Input Extract @@;
*
From ST&D pg. 30 [Lab1ex2.sas]
@@ tells SAS to please read to the end of the line;
Cards;
77.7 76.0 76.9 74.6 74.7 76.5 74.2 75.4 76.0 76.0 73.9 77.4 76.6 77.3
;
Proc Univariate normal plot Data = Barley;
var Extract;
* Test for normality and generate plots
for the variable “Extract” in the dataset “Barley”;
Run; Quit;
Lab 1.2
Comments on the code
1. Use @@ in the input statement when you have more “Cards” on a row than input variables.
2. The word "plot" in Proc Univariate is an example of an option. Its function is to generate several
graphical displays of the data, including a stem-and-leaf display, a boxplot, and a normal probability
plot (a.k.a. quantile-quantile or Q-Q plot) [see ST&D for interpretation of these displays: pages 3032, 566-567].
3. The word "normal" in Proc Univariate is another option. Its function is to carry out tests for
normality. In this class, we will be using the Shapiro-Wilk test for normality.
Output
Variable: Extract Moments
N
14
Mean
75.9428571
Std Deviation
1.2270755
Skewness
-0.2898702
Uncorrected SS
80762.02
Coeff Variation
1.61578791
Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean
Basic Statistical Measures
Location
Variability
Mean
75.94286
Std Deviation
Median
76.00000
Variance
Mode
76.00000
Range
Interquartile Range
14
1063.2
1.50571429
-1.0921714
19.5742857
0.32794972
1.22708
1.50571
3.80000
2.20000
Tests for Location: Mu0=0
Test
Student's t
Sign
Signed Rank
-Statistict 231.5686
M
7
S
52.5
-----p Value-----Pr > |t|
<.0001
Pr >= |M|
0.0001
Pr >= |S|
0.0001
Tests for Normality
Test
Shapiro-Wilk
Kolmogorov-Smirnov
Cramer-von Mises
Anderson-Darling
Stem
77
77
76
76
75
75
74
74
73
--Statistic--W
0.945784
D
0.161429
W-Sq 0.046718
A-Sq 0.297241
Leaf
7
34
569
000
#
1
2
3
3
4
67
2
9
----+----+----+----+
1
2
1
1
Lab 1.3
-----p Value-----Pr < W
0.4974
Pr > D
>0.1500
Pr > W-Sq >0.2500
Pr > A-Sq >0.2500
Boxplot
|
|
+-----+
*-----*
| + |
|
|
+-----+
|
|
Normal Probability Plot
77.75+
|
|
|
75.75+
|
|
|
73.75+
++++*
*++*
* *+*+
* * *+++
++++
++*+
++*+*
+++*
++*+
+----+----+----+----+----+----+----+----+----+----+
-2
-1
0
+1
+2
NOTE: The Shapiro-Wilk “W” statistic measures the linear correlation between the data and their
normal scores. The closer W is to 1, the better correlated the distribution is to a normal distribution.
Normality is rejected when W is sufficiently smaller than one, that is, when the value Pr < W is less than
0.05. In this example, p = 0.4974 > 0.05, so we conclude the data exhibit a normal distribution.
Example 3 PROC TTEST One sample
[Lab1ex3.sas]
One sample TTEST
To use Proc TTEST to do a t-test (e.g. testing if  = xx), we must create:
new variable = old variable – expected 
In the following example, we will test the hypothesis that  = 78 by creating a new variable TEST78 =
Extract - 78.0. We will then perform a t-test for the new variable against the hypothesis = 0 (see similar
example ST&D pg. 96-97).
Data Barley;
Input Extract @@;
Extract78 = Extract - 78.0;
* Here's that new variable;
Cards;
77.7 76.0 76.9 74.6 74.7 76.5 74.2 75.4 76.0 76.0 73.9 77.4 76.6 77.3
;
Proc Print;
* Proc Print displays the inputted data, a nice check;
Proc TTEST;
Var Extract; * T TEST original variety;
Var Extract78; * T TEST new variable Extract78;
Run; Quit;
Output
[Note: In your work, you would accompany this output with a line of interpretation.]
The one sample PROC TTEST produces a nice Q-Q graph and tests the probability that the mean is 0 (In
this case P<0.0001, since 0 is not included in the confidence vale). For the original Extract variable:
Lab 1.4
Variable: Extract
N
Mean
14
75.9429
Std Dev Std Err
1.2271
Mean
95% CL Mean
75.9429
75.2344 76.6513
DF t Value
13
0.3279
Std Dev
1.2271
Pr > |t|
231.57
<.0001
To test if the mean is equal to a certain value,
We generate a new variable Extract78 =Extract– 78 (the mean we want to test).
TTEST of this new variable produces the same nice graphs as before (not shown).
Variable: Extract78
N
Mean
Std Dev Std Err Minimum Maximum
14 -2.0571
Mean
-2.0571
1.2271
0.3279
95% CL Mean
-2.7656
-1.3487
-4.1000
-0.3000
Std Dev
1.2271
DF t Value Pr > |t|
13
-6.27 <.0001
Things to Notice
1. The t-test is highly significant (p < 0.001); so we reject H0.
2. The 95% confidence interval of the mean for Extract is [75.23 ….76.65]. See that the value 78 is far
above the upper limit of this confidence interval. That is why the test is highly significant.
3. The 95% confidence interval of the mean for Extract78 is [-2.77 to -1.35] which does not include the
0, so the mean of this sample is significantly different from 78 (P<0.0001).
Try repeating the exercise using 75.234 (the lower extreme of the confidence interval) as the Null
Mean. What is the expected probability of the t-test?
Lab 1.5
Example 4 PROC TTEST 2 independent samples
[Lab1ex4.sas]
A classification variable (named in this case Trt) is required to tell SAS which values belong to each group.
Alphanumeric variables are indicated by $ after the name.
Data Barley;
Input Trt $ Extract @@;
Cards;
Var1 77.7 Var1 76.0 Var1 76.9 Var1 74.6 Var1 74.7 Var1 76.5 Var1 74.2
Var1 75.4 Var1 76.0 Var1 76.0 Var1 73.9 Var1 77.4 Var1 76.6 Var1 77.3
Var2 79.5 Var2 77.3 Var2 77.9 Var2 75.0 Var2 74.0 Var2 75.9 Var2 73.5
Var2 76.3 Var2 77.8 Var2 77.4 Var2 75.7 Var2 79.3 Var2 76.8 Var2 78.7
Proc TTEST;
class trt; * assumes independent samples;
var extract;
Proc sort;
by Trt;
Proc Univariate normal plot Data = Barley;
var Extract;
by Trt;
Run; Quit;
Trt
N
Mean Std Dev Std Err
Var1 14 75.9429
1.2271
0.3279
Var2 14 76.7929
1.8441
0.4929
Confidence Interval
Trt
Mean
95% CL Mean
Var1
75.9429 75.2344 76.6513
Var2
76.7929 75.7281 77.8576
TTEST
Method
Variances
Pooled
Equal
Satterthwaite Unequal
DF
t Value Pr > |t|
26
-1.44 0.1630
22.625
-1.44 0.1647
Equality of Variances
Method
Folded F
Num DF Den DF F Value Pr > F
13
13
2.26 0.1550
In this case there are no significant differences in malt extract between the two varieties (P=0.16). The
test for the equality of variances is NS so we use the P value for the Equal variances. If the test for the
equality of variances is significant use the P value for the Unequal variances
Lab 1.6
Tests for Normality using the by statement
Test Statistic
p Value
Var1 Shapiro-Wilk W
0.945784 Pr < W 0.4974 NS (then we do not reject Normality)
Var2 Shapiro-Wilk W
0.968419 Pr < W 0.8548 NS
Example 5 PROC TTEST paired samples
[Lab1ex5.sas]
The two values are taken from the same experimental unit (NOT INDEPENDENT). For example, assume that Var2
is the same sample extracted at a different temperature. The code for paired TTEST is different.
Data Barley;
Input Var1 Var2 @@;
*paired samples;
Cards;
77.7 78.4
76.0 78.0
76.9 78.5
74.6 75.0
74.7 75.3
76.5 77.1
74.2 74.6
75.4 75.8
76.0 79.0
76.0 78.5
73.9 74.3
77.4 79.5
76.6 78.1
77.3 78.6
;
Proc TTEST;
paired var1*Var2; * assumes paired samples;
Run;
Quit;
The Paired TTEST Procedure generates a new variable equal to the difference: Var1 - Var2 and then performs a
one sample TTEST to see if that difference is 0. Note that the 95% confidence interval [-1.76 –0.74] does not
include the 0. This agrees with the highly significance of this test P<0.0001
N
Mean Std Dev Std Err Minimum Maximum
14 -1.2500
Mean
0.8830
0.2360
95% CL Mean
-1.2500 -1.7598 -0.7402
-3.0000
Std Dev 95% CL Std Dev
0.8830
DF t Value Pr > |t|
13
-0.4000
-5.30 0.0001
Lab 1.7
0.6401
1.4225
Example 6 Power calculation with PROC POWER
[Lab1ex6.sas]
One Sample power test. What is the power of a test to detect a difference between the observed mean of
75.94 and alternative means of 78 77 and 75.94 (the same value)?
proc power;
onesamplemeans
mean
= 75.94
ntotal = 14
stddev = 1.23
nullmean= 75.94 76.79 77 78
alpha= 0.05
power = .;
run; quit;
The POWER Procedure
One-sample t Test for Mean
Fixed Scenario Elements
Normal
Distribution
Exact
Method
Alpha
0.05
Mean
75.94
1.23
Standard Deviation
Total Sample Size
14
Number of Sides
2
Computed Power
Index Null Mean Power
1
75.9
0.050
2
76.8
0.667
3
77.0
0.846
4
78.0
>.999
Things to Notice
1. The “.” after power indicates that you are requesting the power
2. The onesamplemeans is one line of code up to the “;”. It is split in multiple lines to make it easier to
read
3. The power to detect a difference from a null mean of 77 is 0.846, and the power increases to almost 1
when the alternative mean is 78. The minimum value of the power is =alpha when the alternative
mean is the same as the observed mean. You generally want a power of at least 0.80 (80%).
Notice that a 95% confidence interval of the mean is [75.23 ….76.65] excludes both 77 and 78. See that
the value 78 is far above the upper limit of this confidence interval. That is why the test is highly
significant.
Lab 1.8
Proc Power can be also used to estimate the number of samples required to obtain a certain power
proc power;
onesamplemeans
mean
= 75.94
ntotal = .
stddev = 1.23
nullmean= 77
alpha= 0.05
power = 0.80 0.90 0.95 0.99 0.846 0.845;
run;
The POWER Procedure
One-sample t Test for Mean
Fixed Scenario Elements
Distribution
Normal
Method
Exact
Null Mean
77
Alpha
0.05
Mean
75.94
Standard Deviation
1.23
Number of Sides
2
Computed N Total
Index
1
2
3
4
5
6
Nominal Actual
Power Power
0.800 0.814
0.900 0.915
0.950 0.955
0.990 0.991
0.846 0.873
0.845 0.846
N
Total
13
17
20
27
15
14
SAS rounds the number estimation conservatively to the upper number if there are decimals, to guarantee
at least the requested power.
Two sample power test. What is the power of a test to detect a difference between two samples with the
following mean and variances:
Mean
Variance
N
Sample 1
90
15
6
Sample 2
95
17
6
Mean difference= 5
Pooled s= SQRT( (15+17)/2)= 4 (not the same as the average of the standard deviations)
proc power;
twosamplemeans test=diff
meandiff = 5
stddev = 4
npergroup = 6 10 20
power = .;
run; quit;
Lab 1.9
The POWER Procedure
Two-sample t Test for Mean Difference
Distribution
Normal
Method
Exact
Mean Difference
5
Standard Deviation
4
Number of Sides
2
Null Difference
0
Alpha
0.05
Computed Power
N Per
Index Group Power
1
6 0.498
2
10 0.753
3
20 0.971
Example 7
[Lab1ex7.sas]
This next example illustrates the use of Proc Sort, Proc Print, and Proc Means:
Data Grades;
Input StudentNo GradUG $ HWGrade Midterm Final;
* $ indicates a non-numeric class variable;
FinalGrade = 0.25*HWgrade + 0.35*Midterm + 0.40*Final;
Cards;
13
G
92
84
89
9
G
85
65
80
47
G
90
81
92
21
UG
82
73
86
60
G
94
96
98
4
UG
89
82
90
;
Proc Sort;
* Orders the data by the variable named below;
By StudentNo;
Proc Print;
* Displays the inputted data in whatever order you wish;
Title 'Roster in order of Student Number';
ID StudentNo;
Var HWGrade Midterm Final FinalGrade;
Proc Means n mean std var stderr maxdec=1;* MaxDec limits all numbers to 1 decimal place;
Title 'Descriptive statistics';
Var HWGrade Midterm Final FinalGrade;
Proc Sort;
By GradUG;
* Sorting is needed because of the Proc Means below;
Proc Means n mean std var stderr maxdec=1;
Title 'Descriptive statistics by student level';
Var HWGrade Midterm Final FinalGrade;
By GradUG;
* Without Proc Sort above, this would confuse SAS;
Proc Plot;
Plot Final*FinalGrade;
* Generates plot of Final (y) vs. FinalGrade(x);
Run;
Quit;
Lab 1.10
Note: If you add a title to one Proc statement but not to the others, all the Proc outputs will
have the same label. In fact, they will carry over to future programs! To avoid confusion, you
should label everything, especially as your programs become more complicated and the output
more profuse.
Nifty SAS Program
[SASCritValues.sas]
Tables of critical values rarely contain the exact values you are looking for.
Here's a way to use SAS to find critical values and p-values with precision:
Data ValueFinder;
TITLE 'CRITICAL VALUES';
* The functions below find the critical value for a specified probability 'p';
* where 'p' is the proportion of the area to the **LEFT** of the critical value;
* [e.g. 0.975 will be the 'p' for a 5% two-tailed test];
Nvalue = PROBIT (0.975);
* This is Z;
Tvalue = TINV (0.975, 20);
* This is t (p, df);
Chivalue = CINV (0.975, 20);
* This is chi-square (p, df);
Fvalue = FINV (0.975, 20, 4);
* This is F (p, NUM df, DEN df);
TITLE 'PROBABILITY';
* These functions return the probability that an observation is < x;
Nprob = PROBNORM (1.96);
* Z;
Tprob = PROBT (2.086, 20);
* t;
Chiprob = PROBCHI (34.2, 20);
* chi-square;
Fprob = PROBF (8.56, 20, 4);
* F;
Proc Print;
Run;
Quit;
Very very handy; but if you use this, please be aware of what SAS is telling you, namely that it is the areas
to the LEFT of the critical values that are being considered. Double-check your results with a table until
you get the hang of it.
Niftier Website
There are a lot of free critical values calculators available on-line as well. Feel free to use them, but be
sure you understand how they work. The best way to do this is by checking some test values against
the tables in the book (or on the class webpage). A good site:
http://www.graphpad.com/quickcalcs/DistMenu.cfm
Caution: Be aware of what these calculators are telling you, namely that it is the areas to the LEFT or
RIGHT of the critical values that are being considered. Double-check your results with a table until you
get the hang of it.
Lab 1.11
APPENDIX: Data Input Examples
Students lose a shocking number of points on homeworks and exams due to incorrect data input (i.e.
careless typographical errors). Very rarely should you ever have to input data number-by-number
because almost all the datasets will be provided to you already typed into Word documents. The
challenge you have is to structure your data input routine in SAS such that it will read correctly whatever
you cut-and-paste into your code. The "Do-End-loops" illustrated below may look complicated, but it is
worth your time to understand how they work, especially as our data sets become bigger and bigger.
Example dataset 1
5 treatments with 5 replications each
A
B
C
D
E
3.08
3.30
5.73
1.87
2.25
5.51
3.19
5.18
3.30
4.78
5.07
4.29
5.06
2.64
3.13
4.41
1.87
3.96
3.08
2.91
3.85
1.32
3.74
3.85
2.58
Possible SAS data entry code:
Data Example1;
Input Treatment $ @@;
Do Replication = 1 to 5;
Input Response @@;
Output;
End;
Cards;
A 3.08 5.51 5.07 4.41 3.85
B 3.30 3.19 4.29 1.87 1.32
C 5.73 5.18 5.06 3.96 3.74
D 1.87 3.30 2.64 3.08 3.85
E 2.25 4.78 3.13 2.91 2.58
;
If this is scary, you can also paste the above table into Excel and manipulate it (again, by cutting and
pasting and transposing, not by retyping numbers) to give you something like this:
A
A
A
A
A
B
B
B
B
B
C
C
C
3.08
5.51
5.07
4.41
3.85
3.3
3.19
4.29
1.87
1.32
5.73
5.18
5.06
Lab 1.12
3.96
3.74
1.87
3.3
2.64
3.08
3.85
2.25
4.78
3.13
2.91
2.58
C
C
D
D
D
D
D
E
E
E
E
E
Once you are here, the SAS code is straightforward:
Data Example1;
Input Treatment Response;
Cards;
A 3.08
A 5.51
.
.
.
E 2.91
E 2.58
;
The two approaches are equivalent, but as the data sets become bigger, the Excel manipulations needed
for the second approach will become more and more cumbersome.
Example data set 2
Trt1A
Trt1B
Trt2A
Trt2B
Trt2C
Trt2A
Trt2B
Trt2C
Combinations of treatments with 10 replications each
131
136
101
68
149
125
109
103
142
114
132
106
133
78
164
101
134
125
136
154
144
120
111
89
Possible SAS data entry code:
Data Example2;
Do Trt1 = 1 to 2;
Do Trt2 = 1 to 3;
Do Rep = 1 to 10;
Input Response @@;
Output;
End;
End;
End;
Lab 1.13
142
132
113
113
136
120
126
122
149
134
103
71
150
114
139
147
103
100
142
107
121
97
92
125
167
120
162
85
124
132
145
127
154
114
80
108
Cards;
131
109
136
103
101
142
68
149
125
;
114
132
106
133
78
164
136
154
144
142
132
113
126
122
149
150
114
139
142
107
121
167
120
162
145
127
154
101
134
125
120
111
89
113
136
120
134
103
71
147
103
100
97
92
125
85
124
132
114
80
108
Here we’ve set up the input routine in such a way that we could just cut-and-paste the data table into SAS.
No chance for typographical errors.
Example data set 3
A1
A2
B1
B2
B3
B4
B1
B2
B3
B4
D1
121
123
107
123
123
129
131
131
Each data point identified by four classification variables
C1
D2
121
131
160
119
118
131
129
131
D3
116
125
160
127
138
140
131
129
D1
107
113
129
129
151
157
136
151
C2
D2
104
138
114
131
104
127
143
131
D3
110
119
107
100
127
133
121
118
D1
119
118
119
121
108
119
131
118
C3
D2
116
116
114
99
118
121
131
114
D3
108
118
107
111
108
99
108
119
D1
92
113
131
105
136
123
131
125
C4
D2
101
107
103
92
116
100
127
127
Possible SAS data entry code:
Data Example3;
Do ClassA = 1 to 2;
Do ClassB = 1 to 4;
Do ClassC = 1 to 4;
Do ClassD = 1 to 3;
Input Response @@;
Output;
End;
End;
End;
End;
Cards;
121
121
116
107
104
110
123
131
125
113
138
119
107
160
160
129
114
107
123
119
127
129
131
100
123
129
131
131
;
118
131
129
131
138
140
131
129
151
157
136
151
104
127
143
131
127
133
121
118
119
118
119
121
116
116
114
99
108
118
107
111
92
113
131
105
101
107
103
92
121
123
86
108
108
119
131
118
118
121
131
114
108
99
108
119
136
123
131
125
116
100
127
127
114
95
110
90
Lab 1.14
D3
121
123
86
108
114
95
110
90
Voila! Without the Do-End loops, the same dataset would be five times as large because you would have
to input the individual classification address for each and every data point (e.g. A2, B3, C2, D1). Again,
this may seem unnecessary to you now; but please take the time to learn it. And if you have any
questions, just ask.
Example data set 3
data read;
input score count
datalines;
40 2
47 2
52 2
25 2
35 4
39 1
14 2
22 1
42 1
18 1
15 1
29 1
51 1
43 1
27 2
49 1
31 1
28 1
;
Each data point identified by four classification variables
@@;
26
26
34
41
46
54
1
1
2
2
2
1
19
48
33
44
28
45
2
1
2
1
1
1
The following statements invoke the TTEST procedure to test if the mean test score is
equal to 30.
proc ttest data=read h0=30;
var score;
freq count;
run;
Lab 1.15