Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PLS205 Lab 1 January 9, 2014 Laboratory Topics 1 & 2 ∙ Welcome, introduction, logistics, and organizational matters ∙ Introduction to SAS Writing and running programs; saving results; checking for errors Different ways to input/import data Proc Means, Proc Univariate (testing for normality) ∙ Hypothesis testing t-Tests: One sample Two sample (Independent) Two sample (Paired) ∙ Power Calculations using Proc Power ∙ Proc Sort, Proc Print, Proc Means ∙ Nifty SAS Program: Critical values generator ∙ Niftier webpage ∙ APPENDIX: Data input examples Logistics and Organizational Matters 1. Homework is due at the beginning of lab, with 10 points off for every day it's late. If you don't submit it by the time the homework key is posted (usually 24 hours later), you will receive a zero. 2. Print the lab handouts before coming to lab; they will be posted on the class website each week by Wednesday night at the latest. 3. To log on to the lab computers, you need a UCD user ID and password. 4. Bring a diskette/flashdrive to lab to copy examples from the class directory (G:\PLS205\*.*). 5. This is a demanding class, so make use of all your resources – office hours, lab handouts, homework keys, each other (the 205 Buddy System). Introduction to SAS (your new best friend?) To open SAS Version 9.3: START All Programs Class Software SAS SAS 9.3 (English) The SAS Display Manager There are three basic Windows, listed in the order you should view them: 1) The Program Editor window: Where you tell SAS what to do. 2) The Log window: Where SAS tells you what it did and (usually) what you did wrong. 3) The Output window: Where you find the results of your analysis (i.e. the good stuff). Lab 1.1 Example 1 From ST&D p. 29 [Lab1ex1.sas] Data BirdCount; * Creates a new data set called "BirdCount"; Input Field Birds; * Tells SAS the names of variables; Cards; * A throwback to the old days; 1 210 2 221 3 218 4 228 5 220 6 227 7 223 8 224 9 192 ; * SEMICOLON! SEMICOLON! SEMICOLON!; Proc Means mean var std stderr cv Data = BirdCount; Var Birds; * Generate these requested statistics for the variable "Birds" in the dataset "BirdCount"; Run; Quit; Output Analysis Variable : Birds Coeff of Mean Variance Std Dev Std Error Variation --------------------------------------------------------------------------218.1111111 124.3611111 11.1517313 3.7172438 5.1128671 --------------------------------------------------------------------------- Things to Learn 1. 2. 3. 4. 5. 6. Run (submit) a SAS program with a simple click on the running man icon. Move between windows to scan for red-type errors (Log) and then view results (Output). Clear Log and Output windows with a simple click on the blank page icon. Save program to disk. From Program Editor window: File Save as. Save output to disk. From Output window: File Save as. Set the line size for output to 76 characters (the perfect fit for 10 point Courier font on a page with 1" margins): Tools Options System Log and procedure output control SAS log and procedure output Double click linesize Example 2 PROC UNIVARIATE test of Normality Data Barley; Input Extract @@; * From ST&D pg. 30 [Lab1ex2.sas] @@ tells SAS to please read to the end of the line; Cards; 77.7 76.0 76.9 74.6 74.7 76.5 74.2 75.4 76.0 76.0 73.9 77.4 76.6 77.3 ; Proc Univariate normal plot Data = Barley; var Extract; * Test for normality and generate plots for the variable “Extract” in the dataset “Barley”; Run; Quit; Lab 1.2 Comments on the code 1. Use @@ in the input statement when you have more “Cards” on a row than input variables. 2. The word "plot" in Proc Univariate is an example of an option. Its function is to generate several graphical displays of the data, including a stem-and-leaf display, a boxplot, and a normal probability plot (a.k.a. quantile-quantile or Q-Q plot) [see ST&D for interpretation of these displays: pages 3032, 566-567]. 3. The word "normal" in Proc Univariate is another option. Its function is to carry out tests for normality. In this class, we will be using the Shapiro-Wilk test for normality. Output Variable: Extract Moments N 14 Mean 75.9428571 Std Deviation 1.2270755 Skewness -0.2898702 Uncorrected SS 80762.02 Coeff Variation 1.61578791 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean Basic Statistical Measures Location Variability Mean 75.94286 Std Deviation Median 76.00000 Variance Mode 76.00000 Range Interquartile Range 14 1063.2 1.50571429 -1.0921714 19.5742857 0.32794972 1.22708 1.50571 3.80000 2.20000 Tests for Location: Mu0=0 Test Student's t Sign Signed Rank -Statistict 231.5686 M 7 S 52.5 -----p Value-----Pr > |t| <.0001 Pr >= |M| 0.0001 Pr >= |S| 0.0001 Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Stem 77 77 76 76 75 75 74 74 73 --Statistic--W 0.945784 D 0.161429 W-Sq 0.046718 A-Sq 0.297241 Leaf 7 34 569 000 # 1 2 3 3 4 67 2 9 ----+----+----+----+ 1 2 1 1 Lab 1.3 -----p Value-----Pr < W 0.4974 Pr > D >0.1500 Pr > W-Sq >0.2500 Pr > A-Sq >0.2500 Boxplot | | +-----+ *-----* | + | | | +-----+ | | Normal Probability Plot 77.75+ | | | 75.75+ | | | 73.75+ ++++* *++* * *+*+ * * *+++ ++++ ++*+ ++*+* +++* ++*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 NOTE: The Shapiro-Wilk “W” statistic measures the linear correlation between the data and their normal scores. The closer W is to 1, the better correlated the distribution is to a normal distribution. Normality is rejected when W is sufficiently smaller than one, that is, when the value Pr < W is less than 0.05. In this example, p = 0.4974 > 0.05, so we conclude the data exhibit a normal distribution. Example 3 PROC TTEST One sample [Lab1ex3.sas] One sample TTEST To use Proc TTEST to do a t-test (e.g. testing if = xx), we must create: new variable = old variable – expected In the following example, we will test the hypothesis that = 78 by creating a new variable TEST78 = Extract - 78.0. We will then perform a t-test for the new variable against the hypothesis = 0 (see similar example ST&D pg. 96-97). Data Barley; Input Extract @@; Extract78 = Extract - 78.0; * Here's that new variable; Cards; 77.7 76.0 76.9 74.6 74.7 76.5 74.2 75.4 76.0 76.0 73.9 77.4 76.6 77.3 ; Proc Print; * Proc Print displays the inputted data, a nice check; Proc TTEST; Var Extract; * T TEST original variety; Var Extract78; * T TEST new variable Extract78; Run; Quit; Output [Note: In your work, you would accompany this output with a line of interpretation.] The one sample PROC TTEST produces a nice Q-Q graph and tests the probability that the mean is 0 (In this case P<0.0001, since 0 is not included in the confidence vale). For the original Extract variable: Lab 1.4 Variable: Extract N Mean 14 75.9429 Std Dev Std Err 1.2271 Mean 95% CL Mean 75.9429 75.2344 76.6513 DF t Value 13 0.3279 Std Dev 1.2271 Pr > |t| 231.57 <.0001 To test if the mean is equal to a certain value, We generate a new variable Extract78 =Extract– 78 (the mean we want to test). TTEST of this new variable produces the same nice graphs as before (not shown). Variable: Extract78 N Mean Std Dev Std Err Minimum Maximum 14 -2.0571 Mean -2.0571 1.2271 0.3279 95% CL Mean -2.7656 -1.3487 -4.1000 -0.3000 Std Dev 1.2271 DF t Value Pr > |t| 13 -6.27 <.0001 Things to Notice 1. The t-test is highly significant (p < 0.001); so we reject H0. 2. The 95% confidence interval of the mean for Extract is [75.23 ….76.65]. See that the value 78 is far above the upper limit of this confidence interval. That is why the test is highly significant. 3. The 95% confidence interval of the mean for Extract78 is [-2.77 to -1.35] which does not include the 0, so the mean of this sample is significantly different from 78 (P<0.0001). Try repeating the exercise using 75.234 (the lower extreme of the confidence interval) as the Null Mean. What is the expected probability of the t-test? Lab 1.5 Example 4 PROC TTEST 2 independent samples [Lab1ex4.sas] A classification variable (named in this case Trt) is required to tell SAS which values belong to each group. Alphanumeric variables are indicated by $ after the name. Data Barley; Input Trt $ Extract @@; Cards; Var1 77.7 Var1 76.0 Var1 76.9 Var1 74.6 Var1 74.7 Var1 76.5 Var1 74.2 Var1 75.4 Var1 76.0 Var1 76.0 Var1 73.9 Var1 77.4 Var1 76.6 Var1 77.3 Var2 79.5 Var2 77.3 Var2 77.9 Var2 75.0 Var2 74.0 Var2 75.9 Var2 73.5 Var2 76.3 Var2 77.8 Var2 77.4 Var2 75.7 Var2 79.3 Var2 76.8 Var2 78.7 Proc TTEST; class trt; * assumes independent samples; var extract; Proc sort; by Trt; Proc Univariate normal plot Data = Barley; var Extract; by Trt; Run; Quit; Trt N Mean Std Dev Std Err Var1 14 75.9429 1.2271 0.3279 Var2 14 76.7929 1.8441 0.4929 Confidence Interval Trt Mean 95% CL Mean Var1 75.9429 75.2344 76.6513 Var2 76.7929 75.7281 77.8576 TTEST Method Variances Pooled Equal Satterthwaite Unequal DF t Value Pr > |t| 26 -1.44 0.1630 22.625 -1.44 0.1647 Equality of Variances Method Folded F Num DF Den DF F Value Pr > F 13 13 2.26 0.1550 In this case there are no significant differences in malt extract between the two varieties (P=0.16). The test for the equality of variances is NS so we use the P value for the Equal variances. If the test for the equality of variances is significant use the P value for the Unequal variances Lab 1.6 Tests for Normality using the by statement Test Statistic p Value Var1 Shapiro-Wilk W 0.945784 Pr < W 0.4974 NS (then we do not reject Normality) Var2 Shapiro-Wilk W 0.968419 Pr < W 0.8548 NS Example 5 PROC TTEST paired samples [Lab1ex5.sas] The two values are taken from the same experimental unit (NOT INDEPENDENT). For example, assume that Var2 is the same sample extracted at a different temperature. The code for paired TTEST is different. Data Barley; Input Var1 Var2 @@; *paired samples; Cards; 77.7 78.4 76.0 78.0 76.9 78.5 74.6 75.0 74.7 75.3 76.5 77.1 74.2 74.6 75.4 75.8 76.0 79.0 76.0 78.5 73.9 74.3 77.4 79.5 76.6 78.1 77.3 78.6 ; Proc TTEST; paired var1*Var2; * assumes paired samples; Run; Quit; The Paired TTEST Procedure generates a new variable equal to the difference: Var1 - Var2 and then performs a one sample TTEST to see if that difference is 0. Note that the 95% confidence interval [-1.76 –0.74] does not include the 0. This agrees with the highly significance of this test P<0.0001 N Mean Std Dev Std Err Minimum Maximum 14 -1.2500 Mean 0.8830 0.2360 95% CL Mean -1.2500 -1.7598 -0.7402 -3.0000 Std Dev 95% CL Std Dev 0.8830 DF t Value Pr > |t| 13 -0.4000 -5.30 0.0001 Lab 1.7 0.6401 1.4225 Example 6 Power calculation with PROC POWER [Lab1ex6.sas] One Sample power test. What is the power of a test to detect a difference between the observed mean of 75.94 and alternative means of 78 77 and 75.94 (the same value)? proc power; onesamplemeans mean = 75.94 ntotal = 14 stddev = 1.23 nullmean= 75.94 76.79 77 78 alpha= 0.05 power = .; run; quit; The POWER Procedure One-sample t Test for Mean Fixed Scenario Elements Normal Distribution Exact Method Alpha 0.05 Mean 75.94 1.23 Standard Deviation Total Sample Size 14 Number of Sides 2 Computed Power Index Null Mean Power 1 75.9 0.050 2 76.8 0.667 3 77.0 0.846 4 78.0 >.999 Things to Notice 1. The “.” after power indicates that you are requesting the power 2. The onesamplemeans is one line of code up to the “;”. It is split in multiple lines to make it easier to read 3. The power to detect a difference from a null mean of 77 is 0.846, and the power increases to almost 1 when the alternative mean is 78. The minimum value of the power is =alpha when the alternative mean is the same as the observed mean. You generally want a power of at least 0.80 (80%). Notice that a 95% confidence interval of the mean is [75.23 ….76.65] excludes both 77 and 78. See that the value 78 is far above the upper limit of this confidence interval. That is why the test is highly significant. Lab 1.8 Proc Power can be also used to estimate the number of samples required to obtain a certain power proc power; onesamplemeans mean = 75.94 ntotal = . stddev = 1.23 nullmean= 77 alpha= 0.05 power = 0.80 0.90 0.95 0.99 0.846 0.845; run; The POWER Procedure One-sample t Test for Mean Fixed Scenario Elements Distribution Normal Method Exact Null Mean 77 Alpha 0.05 Mean 75.94 Standard Deviation 1.23 Number of Sides 2 Computed N Total Index 1 2 3 4 5 6 Nominal Actual Power Power 0.800 0.814 0.900 0.915 0.950 0.955 0.990 0.991 0.846 0.873 0.845 0.846 N Total 13 17 20 27 15 14 SAS rounds the number estimation conservatively to the upper number if there are decimals, to guarantee at least the requested power. Two sample power test. What is the power of a test to detect a difference between two samples with the following mean and variances: Mean Variance N Sample 1 90 15 6 Sample 2 95 17 6 Mean difference= 5 Pooled s= SQRT( (15+17)/2)= 4 (not the same as the average of the standard deviations) proc power; twosamplemeans test=diff meandiff = 5 stddev = 4 npergroup = 6 10 20 power = .; run; quit; Lab 1.9 The POWER Procedure Two-sample t Test for Mean Difference Distribution Normal Method Exact Mean Difference 5 Standard Deviation 4 Number of Sides 2 Null Difference 0 Alpha 0.05 Computed Power N Per Index Group Power 1 6 0.498 2 10 0.753 3 20 0.971 Example 7 [Lab1ex7.sas] This next example illustrates the use of Proc Sort, Proc Print, and Proc Means: Data Grades; Input StudentNo GradUG $ HWGrade Midterm Final; * $ indicates a non-numeric class variable; FinalGrade = 0.25*HWgrade + 0.35*Midterm + 0.40*Final; Cards; 13 G 92 84 89 9 G 85 65 80 47 G 90 81 92 21 UG 82 73 86 60 G 94 96 98 4 UG 89 82 90 ; Proc Sort; * Orders the data by the variable named below; By StudentNo; Proc Print; * Displays the inputted data in whatever order you wish; Title 'Roster in order of Student Number'; ID StudentNo; Var HWGrade Midterm Final FinalGrade; Proc Means n mean std var stderr maxdec=1;* MaxDec limits all numbers to 1 decimal place; Title 'Descriptive statistics'; Var HWGrade Midterm Final FinalGrade; Proc Sort; By GradUG; * Sorting is needed because of the Proc Means below; Proc Means n mean std var stderr maxdec=1; Title 'Descriptive statistics by student level'; Var HWGrade Midterm Final FinalGrade; By GradUG; * Without Proc Sort above, this would confuse SAS; Proc Plot; Plot Final*FinalGrade; * Generates plot of Final (y) vs. FinalGrade(x); Run; Quit; Lab 1.10 Note: If you add a title to one Proc statement but not to the others, all the Proc outputs will have the same label. In fact, they will carry over to future programs! To avoid confusion, you should label everything, especially as your programs become more complicated and the output more profuse. Nifty SAS Program [SASCritValues.sas] Tables of critical values rarely contain the exact values you are looking for. Here's a way to use SAS to find critical values and p-values with precision: Data ValueFinder; TITLE 'CRITICAL VALUES'; * The functions below find the critical value for a specified probability 'p'; * where 'p' is the proportion of the area to the **LEFT** of the critical value; * [e.g. 0.975 will be the 'p' for a 5% two-tailed test]; Nvalue = PROBIT (0.975); * This is Z; Tvalue = TINV (0.975, 20); * This is t (p, df); Chivalue = CINV (0.975, 20); * This is chi-square (p, df); Fvalue = FINV (0.975, 20, 4); * This is F (p, NUM df, DEN df); TITLE 'PROBABILITY'; * These functions return the probability that an observation is < x; Nprob = PROBNORM (1.96); * Z; Tprob = PROBT (2.086, 20); * t; Chiprob = PROBCHI (34.2, 20); * chi-square; Fprob = PROBF (8.56, 20, 4); * F; Proc Print; Run; Quit; Very very handy; but if you use this, please be aware of what SAS is telling you, namely that it is the areas to the LEFT of the critical values that are being considered. Double-check your results with a table until you get the hang of it. Niftier Website There are a lot of free critical values calculators available on-line as well. Feel free to use them, but be sure you understand how they work. The best way to do this is by checking some test values against the tables in the book (or on the class webpage). A good site: http://www.graphpad.com/quickcalcs/DistMenu.cfm Caution: Be aware of what these calculators are telling you, namely that it is the areas to the LEFT or RIGHT of the critical values that are being considered. Double-check your results with a table until you get the hang of it. Lab 1.11 APPENDIX: Data Input Examples Students lose a shocking number of points on homeworks and exams due to incorrect data input (i.e. careless typographical errors). Very rarely should you ever have to input data number-by-number because almost all the datasets will be provided to you already typed into Word documents. The challenge you have is to structure your data input routine in SAS such that it will read correctly whatever you cut-and-paste into your code. The "Do-End-loops" illustrated below may look complicated, but it is worth your time to understand how they work, especially as our data sets become bigger and bigger. Example dataset 1 5 treatments with 5 replications each A B C D E 3.08 3.30 5.73 1.87 2.25 5.51 3.19 5.18 3.30 4.78 5.07 4.29 5.06 2.64 3.13 4.41 1.87 3.96 3.08 2.91 3.85 1.32 3.74 3.85 2.58 Possible SAS data entry code: Data Example1; Input Treatment $ @@; Do Replication = 1 to 5; Input Response @@; Output; End; Cards; A 3.08 5.51 5.07 4.41 3.85 B 3.30 3.19 4.29 1.87 1.32 C 5.73 5.18 5.06 3.96 3.74 D 1.87 3.30 2.64 3.08 3.85 E 2.25 4.78 3.13 2.91 2.58 ; If this is scary, you can also paste the above table into Excel and manipulate it (again, by cutting and pasting and transposing, not by retyping numbers) to give you something like this: A A A A A B B B B B C C C 3.08 5.51 5.07 4.41 3.85 3.3 3.19 4.29 1.87 1.32 5.73 5.18 5.06 Lab 1.12 3.96 3.74 1.87 3.3 2.64 3.08 3.85 2.25 4.78 3.13 2.91 2.58 C C D D D D D E E E E E Once you are here, the SAS code is straightforward: Data Example1; Input Treatment Response; Cards; A 3.08 A 5.51 . . . E 2.91 E 2.58 ; The two approaches are equivalent, but as the data sets become bigger, the Excel manipulations needed for the second approach will become more and more cumbersome. Example data set 2 Trt1A Trt1B Trt2A Trt2B Trt2C Trt2A Trt2B Trt2C Combinations of treatments with 10 replications each 131 136 101 68 149 125 109 103 142 114 132 106 133 78 164 101 134 125 136 154 144 120 111 89 Possible SAS data entry code: Data Example2; Do Trt1 = 1 to 2; Do Trt2 = 1 to 3; Do Rep = 1 to 10; Input Response @@; Output; End; End; End; Lab 1.13 142 132 113 113 136 120 126 122 149 134 103 71 150 114 139 147 103 100 142 107 121 97 92 125 167 120 162 85 124 132 145 127 154 114 80 108 Cards; 131 109 136 103 101 142 68 149 125 ; 114 132 106 133 78 164 136 154 144 142 132 113 126 122 149 150 114 139 142 107 121 167 120 162 145 127 154 101 134 125 120 111 89 113 136 120 134 103 71 147 103 100 97 92 125 85 124 132 114 80 108 Here we’ve set up the input routine in such a way that we could just cut-and-paste the data table into SAS. No chance for typographical errors. Example data set 3 A1 A2 B1 B2 B3 B4 B1 B2 B3 B4 D1 121 123 107 123 123 129 131 131 Each data point identified by four classification variables C1 D2 121 131 160 119 118 131 129 131 D3 116 125 160 127 138 140 131 129 D1 107 113 129 129 151 157 136 151 C2 D2 104 138 114 131 104 127 143 131 D3 110 119 107 100 127 133 121 118 D1 119 118 119 121 108 119 131 118 C3 D2 116 116 114 99 118 121 131 114 D3 108 118 107 111 108 99 108 119 D1 92 113 131 105 136 123 131 125 C4 D2 101 107 103 92 116 100 127 127 Possible SAS data entry code: Data Example3; Do ClassA = 1 to 2; Do ClassB = 1 to 4; Do ClassC = 1 to 4; Do ClassD = 1 to 3; Input Response @@; Output; End; End; End; End; Cards; 121 121 116 107 104 110 123 131 125 113 138 119 107 160 160 129 114 107 123 119 127 129 131 100 123 129 131 131 ; 118 131 129 131 138 140 131 129 151 157 136 151 104 127 143 131 127 133 121 118 119 118 119 121 116 116 114 99 108 118 107 111 92 113 131 105 101 107 103 92 121 123 86 108 108 119 131 118 118 121 131 114 108 99 108 119 136 123 131 125 116 100 127 127 114 95 110 90 Lab 1.14 D3 121 123 86 108 114 95 110 90 Voila! Without the Do-End loops, the same dataset would be five times as large because you would have to input the individual classification address for each and every data point (e.g. A2, B3, C2, D1). Again, this may seem unnecessary to you now; but please take the time to learn it. And if you have any questions, just ask. Example data set 3 data read; input score count datalines; 40 2 47 2 52 2 25 2 35 4 39 1 14 2 22 1 42 1 18 1 15 1 29 1 51 1 43 1 27 2 49 1 31 1 28 1 ; Each data point identified by four classification variables @@; 26 26 34 41 46 54 1 1 2 2 2 1 19 48 33 44 28 45 2 1 2 1 1 1 The following statements invoke the TTEST procedure to test if the mean test score is equal to 30. proc ttest data=read h0=30; var score; freq count; run; Lab 1.15