Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 2 1. SAS Procedures: Proc Ttest, Proc NPar1Way, Proc Freq 2. SAS Data step 3. Data step arithmetic 4. Data step comparisons and logical conditions: IF-THEN-ELSE, subsetting IF 5. Indicator variables 6. Missing values Class email group: did you get an email? Reading: much of Lecture 1 material reviewed in LSB §1.6-1.13, 2.3 1 Proc TTEST Compare mean IQ between children of HS graduates and children of non-HS graduates with a two-sample t -test. Proc TTEST class var data= dataset ci=none ; omit CI for standard deviation group-variable, gives a two-sample test ; response-variable(s) ; Proc TTEST data = ph6470.child_iq ci = none ; class mom_HS_grad; var child_iq; Note use of permanent data: no need to import it again 2 SAS performs 2 t-tests: Pooled t-test assumes two population standard deviations are equal, uses pooled standard deviation, which is Root Mean Square Error in one-factor ANOVA: æ̂ = s (n 1 ° 1)SD21 + (n 2 ° 1)SD22 n1 + n2 ° 2 æ̂ is used for the test and for confidence intervals. The test statistic is: t = X̄ 1 ° X̄ 2 , s 1 1 æ̂ + n1 n2 with df = (n 1 + n 2 ° 2). 3 Satterthwaite t-test does not assume two population standard deviations are equal. Uses unpooled standard error: SEU = s SD21 n1 + SD22 n2 Adjusts degrees of freedom for differences in the group SDs : √ !°1 4 4 SD SD 1 2 4 dfU = SEU + 2 2 n 1 (n 1 ° 1) n 2 (n 2 ° 1) When SD1 and SD2 are different, dfU < (n 1 + n 2 ° 2) The Satterthwaite test statistic: t = X̄ 1 ° X̄ 2 , SEU with df = dfU . 4 The TTEST Procedure Variable: mom_HS_ grad 0 1 Diff (1-2) N 93 341 Mean 77.5484 89.3196 -11.7713 mom_HS_ grad 0 1 Diff (1-2) Diff (1-2) Method Pooled Satterthwaite child_IQ (child IQ) Std Dev 22.5738 19.0495 19.8525 Std Err 2.3408 1.0316 2.3224 Method Mean 77.5484 89.3196 -11.7713 -11.7713 Pooled Satterthwaite Variances Equal Unequal DF 432 129.88 Minimum 20.0000 38.0000 95% CL 72.8994 87.2906 -16.3359 -16.8321 t Value -5.07 -4.60 Maximum 136.0 144.0 Mean 82.1974 91.3487 -7.2066 -6.7105 Pr > |t| <.0001 <.0001 Equality of Variances Method Folded F Num DF 92 Den DF 340 F Value 1.40 Pr > F 0.0326 5 When are the variances unequal? Variable: mom_HS_ grad 0 1 N 93 341 Mean 77.5484 89.3196 child_IQ (child IQ) Std Dev 22.5738 19.0495 Std Err 2.3408 1.0316 Minimum 20.0000 38.0000 Maximum 136.0 144.0 Equality of Variances Method Folded F Num DF 92 Den DF 340 F Value 1.40 Pr > F 0.0326 “Folded F -test” strongly depends on Gaussian assumption, not a reliable test. 6 Better: When the ratio larger standard deviation >3 smaller standard deviation then the SDs are probably different, and different enough to matter. mom_HS_ grad 0 1 N 93 341 Mean 77.5484 89.3196 Std Dev 22.5738 19.0495 Std Err 2.3408 1.0316 Minimum 20.0000 38.0000 Maximum 136.0 144.0 • Report the Satterthwaite p-value • Take logs (in a Data step) and perform t-test on logged data • Wilcoxon rank-sum test (non-parametric) • Bootstrap t-test, customized to observed data 7 Wilcoxon rank-sum test Rank-sum test compares two samples: combine all the data and order from smallest to largest. Test statistic is sum of ranks in one sample. H0 : two population distributions equal (same center). If so, neither sample should be a majority in lower ranks. H A : two population distributions not equal, with different centers at least. (Approximately: different medians.) Wikipedia Frank Wilcoxon (1892–1965) Proc NPar1Way data = pubh.child_iq class mom_HS_grad; var child_iq; 8 wilcoxon ; The NPAR1WAY Procedure Wilcoxon Scores (Rank Sums) for Variable child_IQ Classified by Variable mom_HS_grad mom_HS_ Sum of Expected Std Dev Mean grad N Scores Under H0 Under H0 Score ----------------------------------------------------------------------1 341 78993.50 74167.50 1071.98173 231.652493 0 93 15401.50 20227.50 1071.98173 165.607527 Average scores were used for ties. Wilcoxon Two-Sample Test Statistic 15401.5000 Normal Approximation Z One-Sided Pr < Z Two-Sided Pr > |Z| t Approximation One-Sided Pr < Z Two-Sided Pr > |Z| -4.5015 <.0001 <.0001 <.0001 <.0001 9 Chi-square test in Proc Freq Proc Freq (for frequency) makes tables of counts, performs chi-square test of association, also calculates relative risk and odds ratio, tests of trend and measures of association. Proc FREQ tables data= dataset ; row-variable * column-variable Proc Freq data = one; tables mom_HS_grad * IQ_over_100; 10 ; The FREQ Procedure Table of mom_HS_grad by IQ_over_100 mom_HS_grad(mom HS grad) IQ_over_100 Frequency| Percent | Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 80 | 13 | 93 | 18.43 | 3.00 | 21.43 | 86.02 | 13.98 | | 25.24 | 11.11 | ---------+--------+--------+ 1 | 237 | 104 | 341 | 54.61 | 23.96 | 78.57 | 69.50 | 30.50 | | 74.76 | 88.89 | ---------+--------+--------+ Total 317 117 434 73.04 26.96 100.00 11 Better to omit unwanted percents: Proc Freq data = one; tables mom_HS_grad * IQ_over_100 / nopercent nocol nopercent = omit percents of grand total nocol = omit column percents norow = omit row percents chisq = calculate chi-square test 12 chisq ; Table of mom_HS_grad by IQ_over_100 mom_HS_grad(mom HS grad) IQ_over_100 Frequency| Row Pct | 0| 1| ---------+--------+--------+ 0 | 80 | 13 | | 86.02 | 13.98 | ---------+--------+--------+ 1 | 237 | 104 | | 69.50 | 30.50 | ---------+--------+--------+ Total 317 117 Total 93 341 434 Statistics for Table of mom_HS_grad by IQ_over_100 Statistic DF Value Prob -----------------------------------------------------Chi-Square 1 10.1275 0.0015 Likelihood Ratio Chi-Square 1 11.2096 0.0008 Continuity Adj. Chi-Square 1 9.3059 0.0023 Mantel-Haenszel Chi-Square 1 10.1042 0.0015 13 Phi Coefficient Contingency Coefficient Cramer’s V 0.1528 0.1510 0.1528 Fisher’s Exact Test ---------------------------------Cell (1,1) Frequency (F) 80 Left-sided Pr <= F 0.9998 Right-sided Pr >= F 7.236E-04 Table Probability (P) Two-sided Pr <= P 4.742E-04 0.0014 Sample Size = 434 14 Pearson’s Chi-square Test for No Association. H0 : no association between row probabilities and column probabilities. 1. Individual’s chance of being in a particular column does not depend on which row they belong to. Within a column, all row percents should be roughly equal, except for sampling variability. 2. Equivalently, individual’s chance of being in a particular row does not depend on which column they belong to. Within a row, all column percents should be roughly equal, except for sampling variability. 15 Using this no-association assumption, we can compute an expected count for each cell: ± expected count = (row total)£(column total) (grand total) The test statistic compares expected to observed counts: X2 = X all cells ° ¢2 observed count ° expected count ° ¢ expected count ° ¢ ° ¢ degrees of freedom = number of rows ° 1 £ number of columns ° 1 . 16 Each cell’s contribution to the chi-square sum measure’s its departure from the null hypothesis (no association). Pearson residual is the square-root of the cell’s contribution to chi-square, with sign of (observed ° expected). In a large table (bigger than 2 £ 2), examine Pearson residuals for each cell to find which cells depart from no-association. Proc Freq data = one; tables mom_HS_grad * IQ_over_100 / nopercent nocol norow cellchi2 deviation chisq ; 17 mom_HS_grad(mom HS grad) IQ_over_100 Frequency | Deviation | Cell Chi-Square| 0| 1| Total ---------------+--------+--------+ 0 | 80 | 13 | 93 | 12.071 | -12.07 | | 2.1452 | 5.8122 | ---------------+--------+--------+ 1 | 237 | 104 | 341 | -12.07 | 12.071 | | 0.5851 | 1.5851 | ---------------+--------+--------+ Total 317 117 434 deviation = (observed count ° expected count), gives sign for squared residual: Signed | Squared Pearson | Residual | 0| 1| ---------------+---------+---------+ 0 | 80 | 13 | | 2.1452 | -5.8122 | ---------------+---------+---------+ 1 | 237 | 104 | | -0.5851 | 1.5851 | ---------------+---------+---------+ Total 317 117 Total 93 341 434 18 The DATA step (LSB §1.4) Data B; set A; statement1; statement2; statement3; 1. Read observation 1 (row 1) from A. Perform statement1, statement2, statement3 using observation 1. 2. Write output to data B. 3. Set variables to missing values. 4. Repeat steps 1, 2, 3 using observation 2 from A. 5. Repeat steps 1, 2, 3 using observation 3 from A. 6. Continue through all rows of A. 19 Obs 1 2 3 ID 1 2 3 child_IQ 65 98 85 mom_HS_ grad 1 1 1 mom_age 27 25 27 mom_IQ 121 89 115 male 1 1 0 From Lecture 1 example: data A; set child_iq; if (male=1) then gender="M"; if (male=0) then gender="F"; 1. Read observation 1 (row 1) from child_iq. Calculate gender for observation 1. 2. Write output to data A, row 1. 3. Set ID, child_IQ, mom_HS_grad, mom_age, mom_IQ, male, gender to missing values. 4. Read observation 2 (row 2) from child_iq. Calculate gender for observation 2. 5. Write output to data A, row 2. Repeat. 20 Data step is sequential, only looks at one observation at a time. _N_ is the internal variable that counts observations. There is an implicit DO-loop in every data step: Data B; Documentation 01/23/2007 10:23 PM DO from _N_ = 1 to last-observation; values created within the WHERE expression itself. set A; You cannot use variables that are created within the DATA step (for example, FIRST.variable, LAST.variable, _N_, orstatement1; variables that are created in assignment statements) in a WHERE expression because the WHERE statement is executed before the SAS System brings observations into the DATA or PROC step. When WHERE statement2; expressions contain comparisons, the unformatted values of variables are compared. Use operands in WHERE statements as in the following examples: statement3; where score>50; END; where date>='01jan1999'd and time>='9:00't; where state='Mississippi'; As in other SAS expressions, the names of numeric variables can stand alone. SAS treats values of 0 or 21 missing as false; other values are true. These examples are WHERE expressions that contain the numeric variables EMPNUM and SSN: where empnum; where empnum and ssn; Calculations in the DATA step: arithmetic Character literals or the names of character variables can also stand alone in WHERE expressions. If you use the name of a character variable by itself as a WHERE expression, SAS selects observations where the value of the character variable is not blank. Calculation with data to create new variables is done in the data step: Operators Used in the WHERE Expression Data Z; You can include both SAS operators and special WHERE-expression operators in the WHERE statement. For a complete list of the operators,=see WHERE Statement Operators. For the rules SAS follows when it evaluates height_inches height_cm/2.54; convert cm to inches WHERE expressions, see WHERE-Expression Processing in SAS Language Reference: Concepts. WHERE Statement Operators Operator Type SAS arithmetic symbols:Symbol or Mnemonic Description Arithmetic Comparison * multiplication / division + addition - subtraction ** exponentiation = or EQ equal to 4 22 http://support.sas.com/onlinedoc/913/docMainpage.jsp Page 4 of 8 Expressions in parentheses are evaluated first, starting with the innermost of nested groups. Here is the order of evaluating operations, from left to right in an expression: 1. Exponents 2. Multiplication and division 3. Addition and subtraction Evaluate: 2 § 3 + 4/5 ° 1 2 § (3 + 4)/(5 ° 1) 2 § (3 + 4/5) ° 1 23 Data A; C = 2*3+4/5-1; D = 2*(3+4)/(5-1); E = 2*(3+4/5) -1; proc print data=A; ----------------------------------------------------------Obs C D E 1 5.8 3.5 6.6 24 Exponents: Don’t use ** for any exponents except small integers (2, 3, or 4). To compute x = a b for b 6= 2 or 3, don’t use x = a**b. Instead, for a > 0 use the more numerically stable log and exp functions: x = exp(b * log(a)); In SAS, log is the natural log function. Inverse of log is exp. log10 is common log or base 10 log. Inverse of b = log10(c) is c = exp(b*log(10.0)). 25 See LSB §3.3 for a short list of SAS functions. For a complete list, see the SAS Documentation: SAS Products > Base SAS > SAS 9.3 Functions and CALL Routines> Dictionary of Functions and CALL Routines > Functions and CALL Routines by Category In SAS 9.2: SAS Products > Base SAS > SAS Language Dictionary > Dictionary of Language Elements > Functions and CALL Routines 26 IF-THEN-ELSE (LSB §3.5-3.6) In the data step, evaluation of a command may depend on a condition: IF ( condition ) THEN statement A ; When condition is true, statement A is performed. Parentheses around condition are not required, but make code easier to read. detection_limit = 0.025; IF (0 < x < detection_limit) THEN x = detection_limit/2.0; 27 In the data step, you can make a branch depend on a condition: IF ( condition ) THEN statement A ; ELSE statement B ; When condition is true, statement A is performed. When condition is false, statement B is performed. if (score > 70) then grade =’S’; else grade = ’N’; 28 Here are the SAS symbols for comparisons and logical relations. Letters are easier to read and remember. 29 Subsetting IF Data B; set A; IF ( condition ); If condition is true, the observation is kept. This data step makes a copy of A, names it B, and includes in B only those observations that satisfy the condition. Equivalent: if ( NOT condition ) then delete; So use the form depending on which is simpler: condition or non-condition. 30 Indicator variables Create an indicator (0/1) variable: x = (condition); x = 1 when condition is true, x = 0 when false data one; set pubh.child_iq; IQ_over_100 = (child_iq > 100.0); female = (gender = ’F’); Let the variable name define the condition. Use 1 = Yes, 0 = No 31 Missing values Numeric variables: missing is indicated by a period, x = . Comparisons with missing values. In a sort of a numeric variable, missing values are treated as °1. detection_limit = 0.025; IF (x < detection_limit) THEN x = detection_limit/2.0; What happens to a subject who is missing x? Common source of errors. detection_limit = 0.025 IF ( 0 LE x < detection_limit) THEN x = detection_limit/2.0; 32 Create an indicator 0/1 variable: IQ_under_100 = (0 LE child_IQ < 100); What is the value for a child with missing IQ score? 0 or 1? What should it be? Indicator variables created by logical conditions are never missing. Need to fix explicitly: IQ_under_100 = (0 LE IF (child_IQ = . ) child_IQ < 100); THEN IQ_under_100 = . ; 33 Arithmetic with missing values Find mean diastolic blood pressure (DBP) measured at 4 clinic visits. Data from 2 subjects in visits: ID DBP1 DBP2 DBP3 DBP4 11 95 90 98 92 14 94 . 91 95 data G; set visits; DBP_mean = (DBP1 + DBP2 + DBP3 + DBP4)/4.0 ; 34 Results: Obs ID DBP1 DBP2 DBP3 DBP4 DBP_mean 1 11 95 90 98 92 93.75 2 14 94 . 91 95 . Arithmetic with a missing value has a missing result. Usually we want to ignore missing values and average the rest of the numbers, not have the mean be missing. SAS procedures generally omit observations (rows) with missing values. 35 Many SAS functions correctly handle missing values—see the Documentation: MEAN (argument list) returns the average of the non-missing values; for example, MEAN(3, ., ., 1) = 2 DBP_mean1 = mean (DBP1, DBP2, DBP3, DBP4); DBP_mean1 = mean(OF DBP1 - DBP4); short form for sequential variables Results: DBP_ ID DBP1 DBP2 DBP3 DBP4 DBP_mean mean1 11 95 90 98 92 93.75 93.7500 14 94 . 91 95 . 93.3333 36