Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Transcript

The Pearson Product-Moment Correlation Coefficient The regression coefficient is an asymmetrical statistic, one that gives different values for the model Y = f(X) and the model X = f(Y). The other major measure of bivariate association is the Pearson product-moment correlation coefficient (sometimes called "little r" for short). The correlation coefficient is a symmetrical statistic. That is, it simply describes the association between X and Y without worrying about whether Y = f(X) or X = f(Y). It would produce the same result in either case. Unlike the regression coefficient, whose values range from 0.0 to , the correlation coefficient ranges from 0.0 when there is NO association between X and Y to 1.00 when there is PERFECT association (either direct or inverse). To generate the second set of statistics describing association from the linear model, we partition the sum of squares. Graphically, we begin with a single data point, i, in two-dimensional space. Yi is its location on the scale of y (on the y-axis); below that is the predicted location of Y, Yi-hat. The dotted horizontal line (- - - -) is the location of the mean of Y. (When there is no association between X and Y, b = 0.0 and therefore a = Y-bar.) a Y bX where b = 0, a Y i Yi • Yi - hat _ Y --------------- Xi } Yi Yˆi } Yˆ Y i The vertical line represents the deviation of the ith observation from the mean of Y (i.e., the difference between Yi and Y-bar). The line of best fit bisects the deviation into its two mathematical components. The component ABOVE the line of best fit is the residual, the difference between Yi and Yi - hat, the actual location of the ith observation on the y-axis and the predicted location of this observation on the y-axis. This is the error (or residual) component. The component BELOW the line of best fit is new. It is the difference between the predicted Y-value, Yi - hat, and the mean of Y (Y-bar). This component is called the regression component. Since these two components combined are the parts of the deviation of the ith observation from the mean of Y, the following is merely an algebraic summary of this relationship: deviation = regression component + error (residual) Y Y Yˆ Y Y Yˆ i i i i Squaring both sides and summing across all observations yields Y N i 1 i Y 2 N i 1 Yˆi Y 2 N i 1 Yi Yˆi or SSTotal = SSRegression + SSError 2 We can express the amount of association between X and Y as a ratio of the variance explained by the linear model to the total variance in Y to be explained. SSTotal is the variance to be explained and SSRegression the variance accounted for by Y's relationship with X: R2YX = SSRegression / SSTotal This is the Coefficient of Determination. Its values range from 0.0 when X and Y are independent (i.e., when Y-hat minus Y-bar = 0.0) to 1.0 with perfect association (i.e., SSRegression = SSTotal). It is interpreted as the percentage of the total variance in Y explained by Y's association with X. In algebraic form, the Coefficient of Determination is calculated as 2 RYX 2 s XY 2 2 s X sY The denominator is the product of the variance (standard deviation squared) of X and the variance of Y. The numerator is the square of the covariance and can be obtained by squaring the value from the following short-cut equation s XY N N N i 1 i 1 i 1 N Yi X i Yi X i N N 1 In the time and temperature example, N = 3, the sum of X (time) was 23.5, the sum of the squared time values was 194.25, the sum of time values squared was 552.25, the sum of Y (temperature) was 248, and the sum of the cross-products was 1,911. sXY = (3)(1911) - (248)(23.5) / (3)(3 - 1) sXY = (5733 - 5828) / 6 sXY = - 95 / 6 sXY = - 15.833 Squaring to get the covariance squared, s2XY = 250.694 Next, we can use the short-hand equation to calculate the two variances: s2X = NX2 - (X)2 / N(N - 1) (Here, the absence of an index and counter on the summation sign implies summing from the first to the last value.) s2X = (3)(194.25) - (23.5)2 / (3)(3- 1) s2X = (582.75) - (552.25) / (3)(2) s2X = 30.5 / 6 s2X = 5.083 And for the variance of Y: s2Y = NY2 - (Y)2 / N(N - 1) s2Y = (3)(20,600) - (248)2 / (3)(3 - 1) s2Y = (61,800) - (61,504) / 6 s2Y = 296 / 6 s2Y = 49.333 Now we can solve for the Coefficient of Determination: R2YX = s2XY / s2X s2Y R2YX = 250.694 / (5.083)(49.333) R2YX = 250.694 / 250.760 R2YX = 0.9997 This is interpreted as meaning that 99.9 percent of the variance in afternoon high temperature is statistically explained by the association of this variable with the time of the sun's first appearance. This is an extremely high—and extremely unlikely—value, since R2YX varies from a minimum of 0.0 (no variance explained) to a maximum of 1.0 (100 percent if ALL the variance is explained). If the Coefficient of Determination is the percentage of the variance in Y explained by its association with X, then the converse is the percentage of variance in Y NOT explained by its association with X. This is called the Coefficient of Nondetermination, simply KYX = 1 - R2YX In this example, the percentage of variance NOT explained is 1 - 0.999, or less than 0.1 percent. Conceptually, the Pearson product-moment correlation coefficient is the square root of the Coefficient of Determination: rXY 2 RYX For raw data, the correlation coefficient is found by rXY = sXY / sX sY where the numerator is the covariance and the denominator is the product of the standard deviations of X and Y. In our example, rXY = - 15.833 / (2.255) (7.024) rXY = - 15.833 / 15.839 rXY = - 0.9996 Notice that, unlike the Coefficient of Determination which only takes positive values, the correlation coefficient varies between 0.0 and 1.00. Here, a correlation of - 0.9996 shows an extremely STRONG INVERSE relationship. Finally, in the bivariate situation, the regression coefficient (i.e., slope, b) and the correlation coefficient (rXY) are related, as follows: b = rXY (sY / sX) and rXY = b (sX / sY) In the present little example, b = (- 0.968) (7.024 / 2.255) b = (- 0.968) (3.115) b = - 3.015 and rXY = - 3.115 (2.255 / 7.024) rXY = - 3.115 (0.321) rXY = - 0.999 SAS Time and Temperature Example LIBNAME perm 'a:\'; LIBNAME library 'a:\'; OPTIONS NODATE NONUMBER PS=66; PROC CORR DATA=perm.weather NOSIMPLE; VAR temp time; TITLE1 'Time and Temperature Example'; RUN; Time and Temperature Example Correlation Analysis 2 'VAR' Variables: TIME TEMP Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 3 TIME TEMP TIME 1.00000 0.0 -0.99983 0.0116 TEMP -0.99983 0.0116 1.00000 0.0 Time and Temperature Example Correlation Analysis 2 'VAR' Variables: TIME TEMP Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / Number of Observations TIME TEMP TIME TEMP 1.00000 0.0 -0.99983 0.0116 2 -0.99983 0.0116 3 1.00000 0.0 3 2 Correlation Example For the following data on ten families, answer the questions below. —————————————————————————————————————————————————————————————————————————————— Annual Income _ Number of _ _ _ 2 2 Family (in $1,000) (Xi - X) Children (Yi - Y) (Xi - X)(Yi - Y) X Y —————————————————————————————————————————————————————————————————————————————— 1 25 0 2 17 0 3 20 1 4 14 2 5 11 2 6 10 3 7 6 4 8 8 5 9 8 6 10 4 7 ----X = Y = _ _ X = Y = —————————————————————————————————————————————————————————————————————————————— 1. What is the value of the correlation coefficient? ______________ 2. What is the value of the Coefficient of Determination? ______________ 3. What is the value of the Coefficient of Nondetermination? ______________ Correlation Example Answers For the following data on ten families, answer the questions below. —————————————————————————————————————————————————————————————————————————————— Annual Income _ Number of _ _ _ Family (in $1,000) (Xi - X)2 Children (Yi - Y)2 (Xi - X)(Yi - Y) X Y —————————————————————————————————————————————————————————————————————————————— 1 25 161.29 0 9 -38.1 2 17 22.09 0 9 -14.1 3 20 59.29 1 4 -15.4 4 14 2.89 2 1 -1.7 5 11 1.69 2 1 1.3 6 10 5.29 3 0 0.0 7 6 39.69 4 1 -6.3 8 8 18.49 5 4 -8.6 9 8 18.49 6 9 -12.9 10 4 68.89 7 16 -33.2 ----X = 123 Y = 30 _ _ X = 12.3 Y = 3.0 = 398.1 = 54 = -129 —————————————————————————————————————————————————————————————————————————————— 1. What is the value of the correlation coefficient? -0.880 2. What is the value of the Coefficient of Determination? 0.774 3. What is the value of the Coefficient of Nondetermination? 0.226