Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Lecture 9
1. Correlation and Proc Corr
2. Partial correlation
3. Linear regression: assessing predictors
4. Linear regression: Proc Reg
1
Correlation
For random variables X and Y , association is measured by covariance,
!
"
Cov(X , Y ) = mean of (X − µX )(Y − µY ) ,
where µX and µY are the population means of X and Y , respectively.
The population correlation ρ (rho) is a scaled version of covariance
Cov(X , Y )
ρX Y = "
(Var X )(Var Y )
Scaling guarantees that −1 ≤ ρ ≤ 1.
2
If we substitute sample estimates for population parameters in the formula for
correlation, we get Pearson’s product moment correlation:
for ordered pairs {(x i , y i )}
(x i − x̄)(y i − ȳ)
.
SD(x) SD(y)
r X Y = mean of
Thus Pearson’s r is the average of the areas of rectangles with one corner at (x i , y i )
and the other corner at (x̄, ȳ), scaled by the standard deviations.
This is why Pearson’s r is sensitive to outliers.
3
Spearman’s correlation is less sensitive:
1. Rank the {x i } from largest to smallest.
2. Replace the {x i } with their ranks. Use average ranks in case of ties.
3. Do the same for the {y i }.
4. Finally, compute the Pearson correlation of the ordered pairs of ranks.
Ranking is similar to taking logs: it pulls in outliers.
4
Test for the null hypothesis H0 : ρ = 0 based on Pearson’s correlation r has test
statistic
t=
#
(n − 2)
r2
,
1−r2
which has a t-distribution with n − 2 degrees of freedom.
Same test is used if r is a Spearman correlation.
5
Proc CORR
Proc Corr computes Pearson’s correlation by default, but also offers Spearman’s
correlation and other measures of association, such as Cronbach’s alpha and
Kendall’s tau.
Proc Corr spearman pearson
data=nhanes;
var sbp dbp educ_yrs ht_in wt_lbs;
asks for both types of correlation
The CORR Procedure
5
Variables:
sbp
dbp
educ_yrs ht_in
wt_lbs
Simple Statistics
Variable
sbp
dbp
educ_yrs
ht_in
wt_lbs
N
3494
3493
3484
3507
3507
Mean
112.75301
69.07902
11.57147
65.81600
157.66935
Std Dev
10.65539
9.74592
3.05088
3.78504
38.97120
6
Median
112.00000
69.00000
12.00000
65.70000
150.90000
Minimum
83.00000
32.00000
0
53.70000
81.50000
Maximum
174.00000
115.00000
17.00000
79.40000
481.60000
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
sbp
sbp
dbp
1.00000
0.55098
<.0001
3493
-0.00048
0.9775
3472
0.40235
<.0001
3494
0.41690
<.0001
3494
0.55098
<.0001
3493
1.00000
0.07429
<.0001
3471
0.27735
<.0001
3493
0.32658
<.0001
3493
-0.00048
0.9775
3472
0.07429
<.0001
3471
1.00000
0.21637
<.0001
3484
0.06546
0.0001
3484
0.40235
<.0001
3494
0.27735
<.0001
3493
0.21637
<.0001
3484
1.00000
0.47679
<.0001
3507
0.41690
<.0001
3494
0.32658
<.0001
3493
0.06546
0.0001
3484
0.47679
<.0001
3507
1.00000
3494
dbp
educ_yrs
ht_in
wt_lbs
educ_yrs
3493
3484
ht_in
3507
wt_lbs
3507
7
Spearman Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
sbp
sbp
dbp
educ_yrs
ht_in
wt_lbs
1.00000
0.53673
<.0001
3493
-0.00757
0.6555
3472
0.42194
<.0001
3494
0.42739
<.0001
3494
0.53673
<.0001
3493
1.00000
0.07594
<.0001
3471
0.28834
<.0001
3493
0.32659
<.0001
3493
-0.00757
0.6555
3472
0.07594
<.0001
3471
1.00000
0.18753
<.0001
3484
0.04258
0.0120
3484
0.42194
<.0001
3494
0.28834
<.0001
3493
0.18753
<.0001
3484
1.00000
0.51977
<.0001
3507
0.42739
<.0001
3494
0.32659
<.0001
3493
0.04258
0.0120
3484
0.51977
<.0001
3507
3494
dbp
educ_yrs
ht_in
wt_lbs
3493
3484
8
3507
1.00000
3507
By default, Corr computes correlation between every variable in the list.
Change this by specifying WITH and VAR:
Proc Corr pearson spearman data=nhanes;
var educ_yrs ht_in wt_lbs;
with sbp dbp;
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
educ_yrs
ht_in
wt_lbs
sbp
-0.00048
0.9775
3472
0.40235
<.0001
3494
0.41690
<.0001
3494
dbp
0.07429
<.0001
3471
0.27735
<.0001
3493
0.32658
<.0001
3493
9
Partial correlation
The idea: find the correlation between variables X and Y ,
after adjusting both for Z , which may be one or more variables.
1. Adjust X for Z by linear regression of X on Z , and calculate residuals
ê X = (x i − x̂ i ) from this regression.
2. Adjust Y for Z by linear regression of Y on Z , and calculate residuals
ê Y = (y i − ŷ i ) from this regression.
3. Partial correlation adjusted for Z is the correlation between ê X and ê Y .
10
Partial correlation of sbp with dpb adjusting for ht_in and wt_lbs:
Proc Corr data=nhanes;
var sbp dbp;
partial ht_in wt_lbs;
adjusting variable(s)
Pearson Partial Correlation Coefficients, N = 3493
Prob > |r| under H0: Partial Rho=0
sbp
dbp
sbp
1.00000
0.46583
<.0001
dbp
0.46583
<.0001
1.00000
Pearson correlation of SBP with DBP was r = 0.55098,
so adjusting has reduced the association.
11
Linear Regression Example: Minnesota 8th-grade math scores, 2000
Each year, all the eighth grade students in Minnesota take reading and math tests.
The Department of Children, Families, and Learning publishes the average scores
for each school, with other characteristics at the school level. In 2000, passing
score was 50 correct of 68 questions.
12
We will consider regression models for the math test scores on these school-level
covariates:
• pctlep percent of students with Limited English Proficiency
• pctspe percent of students in Special Education
• pctfre percent of students receiving free or reduced-price lunch
• pctmob mobility index (percent)
• pctdrp percent of students dropping out of school
• opexp district operating expenditure per student (district level)
• totexp district total expenditure per student (district level)
• tot8enr total eighth grade enrollment
• k12enr kindergarten through 12th grade enrollment
13
Proc Corr data=pubh.grade8_2000;
var
math pctlep pctspe pctfre pctmob pctdrp
opexp totexp tot8enr k12enr;
Simple Statistics
Variable
math
pctlep
pctspe
pctfre
pctmob
pctdrp
opexp
totexp
tot8enr
k12enr
N
Mean
Std Dev
Sum
Minimum
Maximum
435
435
435
435
435
435
435
435
435
435
53.45977
2.70805
12.18161
28.85977
10.57241
0.67356
6724
7708
151.11724
538.65057
4.20008
7.42984
3.96722
17.94349
8.22081
1.26057
1069
1329
136.73599
339.87658
23255
1178
5299
12554
4599
293.00000
2925119
3352914
65736
234313
35.00000
0
0
0
0
0
4836
5417
0
0
61.00000
50.00000
40.00000
97.00000
63.00000
14.00000
11800
13187
900.00000
1754
Which variables should be log-transformed?
Any problems here?
14
data a;
set pubh.grade8_2000;
log_LEP = log(pctlep + 1 ); log is not defined at zero
log_special_ed = log(pctspe+1);
log_free_lunch = log(pctfre+1);
log_mobility = log(pctmob+1);
log_dropout = log(pctdrp+1);
log_8th_grade_n = log(tot8enr+1);
log_K12_n = log(k12enr+1);
15
It is very common to screen predictors by looking at all correlations, after
transforming:
Proc Corr data=pubh.grade8_2000;
var math
log_LEP log_special_ed log_free_lunch log_mobility
opexp totexp log_dropout log_8th_grade_n log_K12_n;
16
log_
special_
ed
-0.09783
0.0414
log_
free_
lunch
-0.53627
<.0001
log_
mobility
-0.40425
<.0001
math
1.00000
log_LEP
-0.40383
<.0001
log_LEP
-0.40383
<.0001
1.00000
0.11383
0.0176
0.30059
<.0001
0.32629
<.0001
log_special_ed
-0.09783
0.0414
0.11383
0.0176
1.00000
0.31232
<.0001
0.33971
<.0001
log_free_lunch
-0.53627
<.0001
0.30059
<.0001
0.31232
<.0001
1.00000
0.38278
<.0001
log_mobility
-0.40425
<.0001
0.32629
<.0001
0.33971
<.0001
0.38278
<.0001
1.00000
opexp
-0.48856
<.0001
0.44055
<.0001
0.20747
<.0001
0.48541
<.0001
0.41365
<.0001
totexp
-0.35819
<.0001
0.32945
<.0001
0.12581
0.0086
0.29797
<.0001
0.32846
<.0001
log_dropout
-0.39055
<.0001
0.02951
0.5393
0.06239
0.1940
0.33239
<.0001
0.41009
<.0001
log_8th_grade_n
0.14321
0.0028
0.26603
<.0001
0.15549
0.0011
-0.29034
<.0001
0.10954
0.0223
log_K12_n
0.13690
0.0042
0.28239
<.0001
0.19046
<.0001
-0.21826
<.0001
0.20722
<.0001
math
17
opexp
-0.48856
<.0001
totexp
-0.35819
<.0001
log_
dropout
-0.39055
<.0001
log_8th_
grade_n
0.14321
0.0028
log_K12_n
0.13690
0.0042
log_LEP
0.44055
<.0001
0.32945
<.0001
0.02951
0.5393
0.26603
<.0001
0.28239
<.0001
log_special_ed
0.20747
<.0001
0.12581
0.0086
0.06239
0.1940
0.15549
0.0011
0.19046
<.0001
log_free_lunch
0.48541
<.0001
0.29797
<.0001
0.33239
<.0001
-0.29034
<.0001
-0.21826
<.0001
log_mobility
0.41365
<.0001
0.32846
<.0001
0.41009
<.0001
0.10954
0.0223
0.20722
<.0001
opexp
1.00000
0.81509
<.0001
0.26601
<.0001
-0.08382
0.0808
-0.04614
0.3370
totexp
0.81509
<.0001
1.00000
0.21699
<.0001
-0.01163
0.8089
0.03027
0.5289
log_dropout
0.26601
<.0001
0.21699
<.0001
1.00000
-0.28248
<.0001
-0.15190
0.0015
log_8th_grade_n
-0.08382
0.0808
-0.01163
0.8089
-0.28248
<.0001
1.00000
0.81324
<.0001
log_K12_n
-0.04614
0.3370
0.03027
0.5289
-0.15190
0.0015
0.81324
<.0001
1.00000
math
18
Better:
19
20
Reconsider some of the log transformations:
• leave pctspe (% in special education) on original scale
• increase the amount added to the two enrollment variables
data a;
set pubh.grade8_2000;
log_LEP = log(pctlep+1 );
log_free_lunch = log(pctfre+1);
log_mobility = log(pctmob+1);
log_dropout = log(pctdrp+1);
log_8th_grade_n = log(tot8enr +25 );
log_K12_n = log(k12enr +100 );
21
22
Linear regression with Proc REG
This is another procedure that performs linear regression using ordinary least
squares estimation:
Proc REG data=a;
model math = log_LEP log_special_ed
log_free_lunch log_mobility
opexp totexp log_dropout
log_8th_grade_n log_K12_n;
23
The REG Procedure
Model: MODEL1
Dependent Variable: math
Number of Observations Read
Number of Observations Used
435
435
Analysis of Variance
DF
Sum of
Squares
Mean
Square
9
425
434
3444.19733
4211.84865
7656.04598
382.68859
9.91023
Root MSE
Dependent Mean
Coeff Var
3.14805
53.45977
5.88864
Source
Model
Error
Corrected Total
24
R-Square
Adj R-Sq
F Value
Pr > F
38.62
<.0001
0.4499
0.4382
Estimated regression coefficients:
Parameter Estimates
Variable
Intercept
log_LEP
log_special_ed
log_free_lunch
log_mobility
opexp
totexp
log_dropout
log_8th_grade_n
log_K12_n
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
1
1
1
1
1
1
60.87064
-0.93064
1.33401
-2.00068
-0.70716
-0.00072241
0.00005191
-1.81213
-0.66408
0.97201
2.73391
0.19947
0.48420
0.31848
0.32402
0.00028499
0.00020125
0.37291
0.44044
0.55827
22.27
-4.67
2.76
-6.28
-2.18
-2.53
0.26
-4.86
-1.51
1.74
<.0001
<.0001
0.0061
<.0001
0.0296
0.0116
0.7966
<.0001
0.1324
0.0824
Test of H0 : β = 0 is a t-test:
t=
β̂
SE(β̂)
with error df.
25