Download Proc Corr Proc Reg

Lecture 9 1. Correlation and Proc Corr 2. Partial correlation 3. Linear regression: assessing predictors 4. Linear regression: Proc Reg 1 Correlation For random variables X and Y , association is measured by covariance, ! " Cov(X , Y ) = mean of (X − µX )(Y − µY ) , where µX and µY are the population means of X and Y , respectively. The population correlation ρ (rho) is a scaled version of covariance Cov(X , Y ) ρX Y = " (Var X )(Var Y ) Scaling guarantees that −1 ≤ ρ ≤ 1. 2 If we substitute sample estimates for population parameters in the formula for correlation, we get Pearson’s product moment correlation: for ordered pairs {(x i , y i )} (x i − x̄)(y i − ȳ) . SD(x) SD(y) r X Y = mean of Thus Pearson’s r is the average of the areas of rectangles with one corner at (x i , y i ) and the other corner at (x̄, ȳ), scaled by the standard deviations. This is why Pearson’s r is sensitive to outliers. 3 Spearman’s correlation is less sensitive: 1. Rank the {x i } from largest to smallest. 2. Replace the {x i } with their ranks. Use average ranks in case of ties. 3. Do the same for the {y i }. 4. Finally, compute the Pearson correlation of the ordered pairs of ranks. Ranking is similar to taking logs: it pulls in outliers. 4 Test for the null hypothesis H0 : ρ = 0 based on Pearson’s correlation r has test statistic t= # (n − 2) r2 , 1−r2 which has a t-distribution with n − 2 degrees of freedom. Same test is used if r is a Spearman correlation. 5 Proc CORR Proc Corr computes Pearson’s correlation by default, but also offers Spearman’s correlation and other measures of association, such as Cronbach’s alpha and Kendall’s tau. Proc Corr spearman pearson data=nhanes; var sbp dbp educ_yrs ht_in wt_lbs; asks for both types of correlation The CORR Procedure 5 Variables: sbp dbp educ_yrs ht_in wt_lbs Simple Statistics Variable sbp dbp educ_yrs ht_in wt_lbs N 3494 3493 3484 3507 3507 Mean 112.75301 69.07902 11.57147 65.81600 157.66935 Std Dev 10.65539 9.74592 3.05088 3.78504 38.97120 6 Median 112.00000 69.00000 12.00000 65.70000 150.90000 Minimum 83.00000 32.00000 0 53.70000 81.50000 Maximum 174.00000 115.00000 17.00000 79.40000 481.60000 Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations sbp sbp dbp 1.00000 0.55098 <.0001 3493 -0.00048 0.9775 3472 0.40235 <.0001 3494 0.41690 <.0001 3494 0.55098 <.0001 3493 1.00000 0.07429 <.0001 3471 0.27735 <.0001 3493 0.32658 <.0001 3493 -0.00048 0.9775 3472 0.07429 <.0001 3471 1.00000 0.21637 <.0001 3484 0.06546 0.0001 3484 0.40235 <.0001 3494 0.27735 <.0001 3493 0.21637 <.0001 3484 1.00000 0.47679 <.0001 3507 0.41690 <.0001 3494 0.32658 <.0001 3493 0.06546 0.0001 3484 0.47679 <.0001 3507 1.00000 3494 dbp educ_yrs ht_in wt_lbs educ_yrs 3493 3484 ht_in 3507 wt_lbs 3507 7 Spearman Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations sbp sbp dbp educ_yrs ht_in wt_lbs 1.00000 0.53673 <.0001 3493 -0.00757 0.6555 3472 0.42194 <.0001 3494 0.42739 <.0001 3494 0.53673 <.0001 3493 1.00000 0.07594 <.0001 3471 0.28834 <.0001 3493 0.32659 <.0001 3493 -0.00757 0.6555 3472 0.07594 <.0001 3471 1.00000 0.18753 <.0001 3484 0.04258 0.0120 3484 0.42194 <.0001 3494 0.28834 <.0001 3493 0.18753 <.0001 3484 1.00000 0.51977 <.0001 3507 0.42739 <.0001 3494 0.32659 <.0001 3493 0.04258 0.0120 3484 0.51977 <.0001 3507 3494 dbp educ_yrs ht_in wt_lbs 3493 3484 8 3507 1.00000 3507 By default, Corr computes correlation between every variable in the list. Change this by specifying WITH and VAR: Proc Corr pearson spearman data=nhanes; var educ_yrs ht_in wt_lbs; with sbp dbp; Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations educ_yrs ht_in wt_lbs sbp -0.00048 0.9775 3472 0.40235 <.0001 3494 0.41690 <.0001 3494 dbp 0.07429 <.0001 3471 0.27735 <.0001 3493 0.32658 <.0001 3493 9 Partial correlation The idea: find the correlation between variables X and Y , after adjusting both for Z , which may be one or more variables. 1. Adjust X for Z by linear regression of X on Z , and calculate residuals ê X = (x i − x̂ i ) from this regression. 2. Adjust Y for Z by linear regression of Y on Z , and calculate residuals ê Y = (y i − ŷ i ) from this regression. 3. Partial correlation adjusted for Z is the correlation between ê X and ê Y . 10 Partial correlation of sbp with dpb adjusting for ht_in and wt_lbs: Proc Corr data=nhanes; var sbp dbp; partial ht_in wt_lbs; adjusting variable(s) Pearson Partial Correlation Coefficients, N = 3493 Prob > |r| under H0: Partial Rho=0 sbp dbp sbp 1.00000 0.46583 <.0001 dbp 0.46583 <.0001 1.00000 Pearson correlation of SBP with DBP was r = 0.55098, so adjusting has reduced the association. 11 Linear Regression Example: Minnesota 8th-grade math scores, 2000 Each year, all the eighth grade students in Minnesota take reading and math tests. The Department of Children, Families, and Learning publishes the average scores for each school, with other characteristics at the school level. In 2000, passing score was 50 correct of 68 questions. 12 We will consider regression models for the math test scores on these school-level covariates: • pctlep percent of students with Limited English Proficiency • pctspe percent of students in Special Education • pctfre percent of students receiving free or reduced-price lunch • pctmob mobility index (percent) • pctdrp percent of students dropping out of school • opexp district operating expenditure per student (district level) • totexp district total expenditure per student (district level) • tot8enr total eighth grade enrollment • k12enr kindergarten through 12th grade enrollment 13 Proc Corr data=pubh.grade8_2000; var math pctlep pctspe pctfre pctmob pctdrp opexp totexp tot8enr k12enr; Simple Statistics Variable math pctlep pctspe pctfre pctmob pctdrp opexp totexp tot8enr k12enr N Mean Std Dev Sum Minimum Maximum 435 435 435 435 435 435 435 435 435 435 53.45977 2.70805 12.18161 28.85977 10.57241 0.67356 6724 7708 151.11724 538.65057 4.20008 7.42984 3.96722 17.94349 8.22081 1.26057 1069 1329 136.73599 339.87658 23255 1178 5299 12554 4599 293.00000 2925119 3352914 65736 234313 35.00000 0 0 0 0 0 4836 5417 0 0 61.00000 50.00000 40.00000 97.00000 63.00000 14.00000 11800 13187 900.00000 1754 Which variables should be log-transformed? Any problems here? 14 data a; set pubh.grade8_2000; log_LEP = log(pctlep + 1 ); log is not defined at zero log_special_ed = log(pctspe+1); log_free_lunch = log(pctfre+1); log_mobility = log(pctmob+1); log_dropout = log(pctdrp+1); log_8th_grade_n = log(tot8enr+1); log_K12_n = log(k12enr+1); 15 It is very common to screen predictors by looking at all correlations, after transforming: Proc Corr data=pubh.grade8_2000; var math log_LEP log_special_ed log_free_lunch log_mobility opexp totexp log_dropout log_8th_grade_n log_K12_n; 16 log_ special_ ed -0.09783 0.0414 log_ free_ lunch -0.53627 <.0001 log_ mobility -0.40425 <.0001 math 1.00000 log_LEP -0.40383 <.0001 log_LEP -0.40383 <.0001 1.00000 0.11383 0.0176 0.30059 <.0001 0.32629 <.0001 log_special_ed -0.09783 0.0414 0.11383 0.0176 1.00000 0.31232 <.0001 0.33971 <.0001 log_free_lunch -0.53627 <.0001 0.30059 <.0001 0.31232 <.0001 1.00000 0.38278 <.0001 log_mobility -0.40425 <.0001 0.32629 <.0001 0.33971 <.0001 0.38278 <.0001 1.00000 opexp -0.48856 <.0001 0.44055 <.0001 0.20747 <.0001 0.48541 <.0001 0.41365 <.0001 totexp -0.35819 <.0001 0.32945 <.0001 0.12581 0.0086 0.29797 <.0001 0.32846 <.0001 log_dropout -0.39055 <.0001 0.02951 0.5393 0.06239 0.1940 0.33239 <.0001 0.41009 <.0001 log_8th_grade_n 0.14321 0.0028 0.26603 <.0001 0.15549 0.0011 -0.29034 <.0001 0.10954 0.0223 log_K12_n 0.13690 0.0042 0.28239 <.0001 0.19046 <.0001 -0.21826 <.0001 0.20722 <.0001 math 17 opexp -0.48856 <.0001 totexp -0.35819 <.0001 log_ dropout -0.39055 <.0001 log_8th_ grade_n 0.14321 0.0028 log_K12_n 0.13690 0.0042 log_LEP 0.44055 <.0001 0.32945 <.0001 0.02951 0.5393 0.26603 <.0001 0.28239 <.0001 log_special_ed 0.20747 <.0001 0.12581 0.0086 0.06239 0.1940 0.15549 0.0011 0.19046 <.0001 log_free_lunch 0.48541 <.0001 0.29797 <.0001 0.33239 <.0001 -0.29034 <.0001 -0.21826 <.0001 log_mobility 0.41365 <.0001 0.32846 <.0001 0.41009 <.0001 0.10954 0.0223 0.20722 <.0001 opexp 1.00000 0.81509 <.0001 0.26601 <.0001 -0.08382 0.0808 -0.04614 0.3370 totexp 0.81509 <.0001 1.00000 0.21699 <.0001 -0.01163 0.8089 0.03027 0.5289 log_dropout 0.26601 <.0001 0.21699 <.0001 1.00000 -0.28248 <.0001 -0.15190 0.0015 log_8th_grade_n -0.08382 0.0808 -0.01163 0.8089 -0.28248 <.0001 1.00000 0.81324 <.0001 log_K12_n -0.04614 0.3370 0.03027 0.5289 -0.15190 0.0015 0.81324 <.0001 1.00000 math 18 Better: 19 20 Reconsider some of the log transformations: • leave pctspe (% in special education) on original scale • increase the amount added to the two enrollment variables data a; set pubh.grade8_2000; log_LEP = log(pctlep+1 ); log_free_lunch = log(pctfre+1); log_mobility = log(pctmob+1); log_dropout = log(pctdrp+1); log_8th_grade_n = log(tot8enr +25 ); log_K12_n = log(k12enr +100 ); 21 22 Linear regression with Proc REG This is another procedure that performs linear regression using ordinary least squares estimation: Proc REG data=a; model math = log_LEP log_special_ed log_free_lunch log_mobility opexp totexp log_dropout log_8th_grade_n log_K12_n; 23 The REG Procedure Model: MODEL1 Dependent Variable: math Number of Observations Read Number of Observations Used 435 435 Analysis of Variance DF Sum of Squares Mean Square 9 425 434 3444.19733 4211.84865 7656.04598 382.68859 9.91023 Root MSE Dependent Mean Coeff Var 3.14805 53.45977 5.88864 Source Model Error Corrected Total 24 R-Square Adj R-Sq F Value Pr > F 38.62 <.0001 0.4499 0.4382 Estimated regression coefficients: Parameter Estimates Variable Intercept log_LEP log_special_ed log_free_lunch log_mobility opexp totexp log_dropout log_8th_grade_n log_K12_n DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 1 1 1 1 1 1 60.87064 -0.93064 1.33401 -2.00068 -0.70716 -0.00072241 0.00005191 -1.81213 -0.66408 0.97201 2.73391 0.19947 0.48420 0.31848 0.32402 0.00028499 0.00020125 0.37291 0.44044 0.55827 22.27 -4.67 2.76 -6.28 -2.18 -2.53 0.26 -4.86 -1.51 1.74 <.0001 <.0001 0.0061 <.0001 0.0296 0.0116 0.7966 <.0001 0.1324 0.0824 Test of H0 : β = 0 is a t-test: t= β̂ SE(β̂) with error df. 25

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Proc Corr Proc Reg