Download 2. Procedures (Ttest, NPar1Way,Freq)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Student's t-test wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lecture 2
1. SAS Procedures: Proc Ttest, Proc NPar1Way, Proc Freq
2. SAS Data step
3. Data step arithmetic
4. Data step comparisons and logical conditions: IF-THEN-ELSE, subsetting IF
5. Indicator variables
6. Missing values
Class email group: did you get an email?
Reading: much of Lecture 1 material reviewed in LSB §1.6-1.13, 2.3
1
Proc TTEST
Compare mean IQ between children of HS graduates and children of non-HS
graduates with a two-sample t -test.
Proc TTEST
class
var
data= dataset
ci=none ;
omit CI for standard deviation
group-variable, gives a two-sample test ;
response-variable(s) ;
Proc TTEST
data = ph6470.child_iq
ci = none ;
class mom_HS_grad;
var child_iq;
Note use of permanent data: no need to import it again
2
SAS performs 2 t-tests:
Pooled t-test assumes two population standard deviations are equal,
uses pooled standard deviation, which is Root Mean Square Error in one-factor
ANOVA:
æ̂ =
s
(n 1 ° 1)SD21 + (n 2 ° 1)SD22
n1 + n2 ° 2
æ̂ is used for the test and for confidence intervals. The test statistic is:
t =
X̄ 1 ° X̄ 2
,
s
1
1
æ̂
+
n1 n2
with df = (n 1 + n 2 ° 2).
3
Satterthwaite t-test does not assume two population standard deviations are equal.
Uses unpooled standard error:
SEU =
s
SD21
n1
+
SD22
n2
Adjusts degrees of freedom for differences in the group SDs :
√
!°1
4
4
SD
SD
1
2
4
dfU = SEU
+ 2
2
n 1 (n 1 ° 1) n 2 (n 2 ° 1)
When SD1 and SD2 are different, dfU < (n 1 + n 2 ° 2)
The Satterthwaite test statistic:
t =
X̄ 1 ° X̄ 2
,
SEU
with df = dfU .
4
The TTEST Procedure
Variable:
mom_HS_
grad
0
1
Diff (1-2)
N
93
341
Mean
77.5484
89.3196
-11.7713
mom_HS_
grad
0
1
Diff (1-2)
Diff (1-2)
Method
Pooled
Satterthwaite
child_IQ
(child IQ)
Std Dev
22.5738
19.0495
19.8525
Std Err
2.3408
1.0316
2.3224
Method
Mean
77.5484
89.3196
-11.7713
-11.7713
Pooled
Satterthwaite
Variances
Equal
Unequal
DF
432
129.88
Minimum
20.0000
38.0000
95% CL
72.8994
87.2906
-16.3359
-16.8321
t Value
-5.07
-4.60
Maximum
136.0
144.0
Mean
82.1974
91.3487
-7.2066
-6.7105
Pr > |t|
<.0001
<.0001
Equality of Variances
Method
Folded F
Num DF
92
Den DF
340
F Value
1.40
Pr > F
0.0326
5
When are the variances unequal?
Variable:
mom_HS_
grad
0
1
N
93
341
Mean
77.5484
89.3196
child_IQ
(child IQ)
Std Dev
22.5738
19.0495
Std Err
2.3408
1.0316
Minimum
20.0000
38.0000
Maximum
136.0
144.0
Equality of Variances
Method
Folded F
Num DF
92
Den DF
340
F Value
1.40
Pr > F
0.0326
“Folded F -test” strongly depends on Gaussian assumption, not a reliable test.
6
Better: When the ratio
larger standard deviation
>3
smaller standard deviation
then the SDs are probably different, and different enough to matter.
mom_HS_
grad
0
1
N
93
341
Mean
77.5484
89.3196
Std Dev
22.5738
19.0495
Std Err
2.3408
1.0316
Minimum
20.0000
38.0000
Maximum
136.0
144.0
• Report the Satterthwaite p-value
• Take logs (in a Data step) and perform t-test on logged data
• Wilcoxon rank-sum test (non-parametric)
• Bootstrap t-test, customized to observed data
7
Wilcoxon rank-sum test
Rank-sum test compares two samples: combine all the
data and order from smallest to largest. Test statistic is
sum of ranks in one sample.
H0 : two population distributions equal (same center).
If so, neither sample should be a majority in lower ranks.
H A : two population distributions not equal, with different
centers at least. (Approximately: different medians.)
Wikipedia
Frank Wilcoxon (1892–1965)
Proc NPar1Way
data = pubh.child_iq
class mom_HS_grad;
var child_iq;
8
wilcoxon
;
The NPAR1WAY Procedure
Wilcoxon Scores (Rank Sums) for Variable child_IQ
Classified by Variable mom_HS_grad
mom_HS_
Sum of
Expected
Std Dev
Mean
grad
N
Scores
Under H0
Under H0
Score
----------------------------------------------------------------------1
341
78993.50
74167.50
1071.98173
231.652493
0
93
15401.50
20227.50
1071.98173
165.607527
Average scores were used for ties.
Wilcoxon Two-Sample Test
Statistic
15401.5000
Normal Approximation
Z
One-Sided Pr < Z
Two-Sided Pr > |Z|
t Approximation
One-Sided Pr < Z
Two-Sided Pr > |Z|
-4.5015
<.0001
<.0001
<.0001
<.0001
9
Chi-square test in Proc Freq
Proc Freq (for frequency) makes tables of counts, performs chi-square test of
association, also calculates relative risk and odds ratio, tests of trend and measures
of association.
Proc FREQ
tables
data= dataset ;
row-variable * column-variable
Proc Freq data = one;
tables mom_HS_grad * IQ_over_100;
10
;
The FREQ Procedure
Table of mom_HS_grad by IQ_over_100
mom_HS_grad(mom HS grad)
IQ_over_100
Frequency|
Percent |
Row Pct |
Col Pct |
0|
1| Total
---------+--------+--------+
0 |
80 |
13 |
93
| 18.43 |
3.00 | 21.43
| 86.02 | 13.98 |
| 25.24 | 11.11 |
---------+--------+--------+
1 |
237 |
104 |
341
| 54.61 | 23.96 | 78.57
| 69.50 | 30.50 |
| 74.76 | 88.89 |
---------+--------+--------+
Total
317
117
434
73.04
26.96
100.00
11
Better to omit unwanted percents:
Proc Freq data = one;
tables mom_HS_grad * IQ_over_100 / nopercent nocol
nopercent = omit percents of grand total
nocol = omit column percents
norow = omit row percents
chisq = calculate chi-square test
12
chisq
;
Table of mom_HS_grad by IQ_over_100
mom_HS_grad(mom HS grad)
IQ_over_100
Frequency|
Row Pct |
0|
1|
---------+--------+--------+
0 |
80 |
13 |
| 86.02 | 13.98 |
---------+--------+--------+
1 |
237 |
104 |
| 69.50 | 30.50 |
---------+--------+--------+
Total
317
117
Total
93
341
434
Statistics for Table of mom_HS_grad by IQ_over_100
Statistic
DF
Value
Prob
-----------------------------------------------------Chi-Square
1
10.1275
0.0015
Likelihood Ratio Chi-Square
1
11.2096
0.0008
Continuity Adj. Chi-Square
1
9.3059
0.0023
Mantel-Haenszel Chi-Square
1
10.1042
0.0015
13
Phi Coefficient
Contingency Coefficient
Cramer’s V
0.1528
0.1510
0.1528
Fisher’s Exact Test
---------------------------------Cell (1,1) Frequency (F)
80
Left-sided Pr <= F
0.9998
Right-sided Pr >= F
7.236E-04
Table Probability (P)
Two-sided Pr <= P
4.742E-04
0.0014
Sample Size = 434
14
Pearson’s Chi-square Test for No Association.
H0 : no association between row probabilities and column probabilities.
1. Individual’s chance of being in a particular column does not depend on which
row they belong to.
Within a column, all row percents should be roughly equal, except for sampling
variability.
2. Equivalently, individual’s chance of being in a particular row does not depend
on which column they belong to.
Within a row, all column percents should be roughly equal, except for sampling
variability.
15
Using this no-association assumption, we can compute an expected count for each
cell:
±
expected count = (row total)£(column total) (grand total)
The test statistic compares expected to observed counts:
X2 =
X
all cells
°
¢2
observed count ° expected count
°
¢
expected count
°
¢ °
¢
degrees of freedom = number of rows ° 1 £ number of columns ° 1 .
16
Each cell’s contribution to the chi-square sum measure’s its departure from the
null hypothesis (no association).
Pearson residual is the square-root of the cell’s contribution to chi-square, with
sign of (observed ° expected).
In a large table (bigger than 2 £ 2), examine Pearson residuals for each cell to find
which cells depart from no-association.
Proc Freq data = one;
tables mom_HS_grad * IQ_over_100 / nopercent nocol norow
cellchi2 deviation
chisq
;
17
mom_HS_grad(mom HS grad)
IQ_over_100
Frequency
|
Deviation
|
Cell Chi-Square|
0|
1| Total
---------------+--------+--------+
0 |
80 |
13 |
93
| 12.071 | -12.07 |
| 2.1452 | 5.8122 |
---------------+--------+--------+
1 |
237 |
104 |
341
| -12.07 | 12.071 |
| 0.5851 | 1.5851 |
---------------+--------+--------+
Total
317
117
434
deviation = (observed count ° expected count), gives sign for squared residual:
Signed
|
Squared Pearson |
Residual
|
0|
1|
---------------+---------+---------+
0 |
80 |
13 |
| 2.1452 | -5.8122 |
---------------+---------+---------+
1 |
237 |
104 |
| -0.5851 | 1.5851 |
---------------+---------+---------+
Total
317
117
Total
93
341
434
18
The DATA step (LSB §1.4)
Data B;
set A;
statement1;
statement2;
statement3;
1. Read observation 1 (row 1) from A. Perform statement1, statement2,
statement3 using observation 1.
2. Write output to data B.
3. Set variables to missing values.
4. Repeat steps 1, 2, 3 using observation 2 from A.
5. Repeat steps 1, 2, 3 using observation 3 from A.
6. Continue through all rows of A.
19
Obs
1
2
3
ID
1
2
3
child_IQ
65
98
85
mom_HS_
grad
1
1
1
mom_age
27
25
27
mom_IQ
121
89
115
male
1
1
0
From Lecture 1 example:
data A;
set child_iq;
if (male=1) then gender="M";
if (male=0) then gender="F";
1. Read observation 1 (row 1) from child_iq. Calculate gender for observation 1.
2. Write output to data A, row 1.
3. Set ID, child_IQ, mom_HS_grad, mom_age, mom_IQ, male, gender
to missing values.
4. Read observation 2 (row 2) from child_iq. Calculate gender for observation 2.
5. Write output to data A, row 2. Repeat.
20
Data step is sequential, only looks at one observation at a time.
_N_ is the internal variable that counts observations.
There is an implicit DO-loop in every data step:
Data B;
Documentation
01/23/2007 10:23 PM
DO from _N_ = 1 to last-observation;
values created within the WHERE expression itself.
set A;
You cannot use variables that are created within the DATA step (for example, FIRST.variable, LAST.variable,
_N_, orstatement1;
variables that are created in assignment statements) in a WHERE expression because the WHERE
statement is executed before the SAS System brings observations into the DATA or PROC step. When WHERE
statement2;
expressions
contain comparisons, the unformatted values of variables are compared.
Use operands
in WHERE statements as in the following examples:
statement3;
where score>50;
END;
where date>='01jan1999'd and time>='9:00't;
where state='Mississippi';
As in other SAS expressions, the names of numeric variables can stand alone. SAS treats values of 0 or
21
missing as false; other values are true. These examples are WHERE expressions that contain the numeric
variables EMPNUM and SSN:
where empnum;
where empnum and ssn;
Calculations in the DATA step: arithmetic
Character literals or the names of character variables can also stand alone in WHERE expressions. If you use
the name of a character variable by itself as a WHERE expression, SAS selects observations where the value of
the character variable is not blank.
Calculation with data to create new variables is done in the data step:
Operators Used in the WHERE Expression
Data Z;
You can include both SAS operators and special WHERE-expression operators in the WHERE statement. For a
complete
list of the operators,=see
WHERE Statement Operators.
For the
rules
SAS follows when it evaluates
height_inches
height_cm/2.54;
convert
cm
to inches
WHERE expressions, see WHERE-Expression Processing in SAS Language Reference: Concepts.
WHERE Statement Operators
Operator
Type
SAS
arithmetic
symbols:Symbol or Mnemonic
Description
Arithmetic
Comparison
*
multiplication
/
division
+
addition
-
subtraction
**
exponentiation
= or EQ
equal to
4
22
http://support.sas.com/onlinedoc/913/docMainpage.jsp
Page 4 of 8
Expressions in parentheses are evaluated first, starting with the innermost of
nested groups. Here is the order of evaluating operations, from left to right in an
expression:
1. Exponents
2. Multiplication and division
3. Addition and subtraction
Evaluate:
2 § 3 + 4/5 ° 1
2 § (3 + 4)/(5 ° 1)
2 § (3 + 4/5) ° 1
23
Data A;
C = 2*3+4/5-1;
D = 2*(3+4)/(5-1);
E = 2*(3+4/5) -1;
proc print data=A;
----------------------------------------------------------Obs
C
D
E
1
5.8
3.5
6.6
24
Exponents: Don’t use ** for any exponents except small integers (2, 3, or 4).
To compute x = a b for b 6= 2 or 3, don’t use x = a**b.
Instead, for a > 0 use the more numerically stable
log and exp functions:
x = exp(b * log(a));
In SAS, log is the natural log function. Inverse of log is exp.
log10 is common log or base 10 log.
Inverse of b = log10(c) is c = exp(b*log(10.0)).
25
See LSB §3.3 for a short list of SAS functions.
For a complete list, see the SAS Documentation:
SAS Products > Base SAS > SAS 9.3 Functions and CALL Routines>
Dictionary of Functions and CALL Routines > Functions and CALL Routines by Category
In SAS 9.2:
SAS Products > Base SAS > SAS Language Dictionary >
Dictionary of Language Elements > Functions and CALL Routines
26
IF-THEN-ELSE (LSB §3.5-3.6)
In the data step, evaluation of a command may depend on a condition:
IF ( condition ) THEN statement A ;
When condition is true, statement A is performed.
Parentheses around condition are not required, but make code easier to read.
detection_limit = 0.025;
IF (0 < x < detection_limit) THEN x = detection_limit/2.0;
27
In the data step, you can make a branch depend on a condition:
IF ( condition ) THEN statement A ;
ELSE statement B ;
When condition is true, statement A is performed.
When condition is false, statement B is performed.
if (score > 70) then grade =’S’;
else grade = ’N’;
28
Here are the SAS symbols for comparisons and logical relations.
Letters are easier to read and remember.
29
Subsetting IF
Data B;
set A;
IF ( condition );
If condition is true, the observation is kept.
This data step makes a copy of A, names it B, and includes in B only those
observations that satisfy the condition. Equivalent:
if ( NOT condition ) then delete;
So use the form depending on which is simpler: condition or non-condition.
30
Indicator variables
Create an indicator (0/1) variable:
x = (condition); x = 1 when condition is true, x = 0 when false
data one;
set pubh.child_iq;
IQ_over_100 = (child_iq > 100.0);
female = (gender = ’F’);
Let the variable name define the condition. Use 1 = Yes, 0 = No
31
Missing values
Numeric variables: missing is indicated by a period, x = .
Comparisons with missing values. In a sort of a numeric variable, missing values
are treated as °1.
detection_limit = 0.025;
IF (x < detection_limit) THEN x = detection_limit/2.0;
What happens to a subject who is missing x? Common source of errors.
detection_limit = 0.025
IF ( 0 LE x <
detection_limit) THEN x = detection_limit/2.0;
32
Create an indicator 0/1 variable:
IQ_under_100 = (0 LE child_IQ < 100);
What is the value for a child with missing IQ score? 0 or 1? What should it be?
Indicator variables created by logical conditions are never missing.
Need to fix explicitly:
IQ_under_100 = (0 LE
IF (child_IQ = . )
child_IQ < 100);
THEN
IQ_under_100 = . ;
33
Arithmetic with missing values
Find mean diastolic blood pressure (DBP) measured at 4 clinic visits.
Data from 2 subjects in visits:
ID
DBP1
DBP2
DBP3
DBP4
11
95
90
98
92
14
94
.
91
95
data G;
set
visits;
DBP_mean = (DBP1 + DBP2 + DBP3 + DBP4)/4.0 ;
34
Results:
Obs
ID
DBP1
DBP2
DBP3
DBP4
DBP_mean
1
11
95
90
98
92
93.75
2
14
94
.
91
95
.
Arithmetic with a missing value has a missing result.
Usually we want to ignore missing values and average the rest of the numbers,
not have the mean be missing.
SAS procedures generally omit observations (rows) with missing values.
35
Many SAS functions correctly handle missing values—see the Documentation:
MEAN (argument list) returns the average of the non-missing values;
for example, MEAN(3, ., ., 1) = 2
DBP_mean1 =
mean (DBP1, DBP2, DBP3, DBP4);
DBP_mean1 = mean(OF
DBP1 -
DBP4); short form for sequential variables
Results:
DBP_
ID
DBP1
DBP2
DBP3
DBP4
DBP_mean
mean1
11
95
90
98
92
93.75
93.7500
14
94
.
91
95
.
93.3333
36