Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic 6. Bivariate correlation and regression Introduction Relationships 1. Is there are relationship between two variables in the population? Nominal with nominal – chi-square Nominal with ordinal – chi-square Ordinal with ordinal – chi-square Interval/ratio DV with a 2 category IV – independent samples t test Interval/ratio DV with a >2 category IV (nominal or ordinal) – ANOVA (F test) Interval/ratio with interval/ratio – t test (in bivariate correlation and regression) 2. How strong is the relationship? (Measures of association) Nominal with nominal – Lambda, etc. Nominal with ordinal – Lambda, etc. Ordinal with ordinal – Gamma, etc. Interval/ratio DV with a >2 category IV (nominal or ordinal) – Eta-squared Interval/ratio with interval/ratio – Pearson’s r (a.k.a. bivariate correlation) 3. How can we characterize/what is the direction of the relationship? Nominal with nominal – cross-tabulation or clustered bar chart Nominal with ordinal – cross-tabulation or clustered bar chart Ordinal with ordinal – cross-tabulation or clustered bar chart; Gamma, etc. Interval/ratio DV with a 2 category IV – comparison of means table Interval/ratio DV with a >2 category IV (nominal or ordinal) – comparison of means table or means plot Interval/ratio with interval/ratio – scatterplot; bivariate correlation and regression Scatterplots Figure 1. Infant Mortality and Literacy (N=107). 200 160 120 80 40 0 0 20 40 60 80 100 People w ho read (%) Source: World95.sav. Strength is determined by the spread of the cases. Direction is determined by the pattern of joint scores. You can even use scatterplots to identify nonlinear relationships. Scatterplots are very useful descriptive tools, but they don’t help us with making predictions about the population. Page 1 of 18 Covariation The most elementary measure for identifying a bivariate relationship between two interval-ratio variables The covariance is a building-block for other statistics including r (Pearson’s correlation) and b (the regression slope) The formula is: S XY Cov( x, y) ( x x)( y y) ; The numerator is referred to as the sum of the cross-products. n How does it work? Here is an example: 1 Negative covariance – scores below the mean on one variable are above the mean on the other x y ( x x) ( x x) * ( y y ) 10 -2 2 -4 2 9 -1 1 -1 3 8 0 0 0 4 7 1 -1 -1 5 6 2 -2 -4 0 0 -10 3 8 Sum Mean ( y y) 1 Covariance -2.0 2 Positive covariance – scores below the mean on one variable are below the mean on the other; scores above the mean on one variable are above the mean on the other x y ( x x) ( x x) * ( y y ) 6 -2 -2 4 2 7 -1 -1 1 3 8 0 0 0 4 9 1 1 1 5 10 2 2 4 0 0 10 Sum Mean ( y y) 1 3 8 Covariance 2.0 3 No covariance x y ( x x) ( x x) * ( y y ) 6 -2 -2 4 1 10 -2 2 -4 2 7 -1 -1 1 2 9 -1 1 -1 3 8 0 0 0 3 8 0 0 0 4 9 1 1 1 4 7 1 -1 -1 5 10 2 2 4 5 6 2 -2 -4 0 0 0 Sum Mean ( y y) 1 3 8 Covariance 0 The downside to the covariance is that the number doesn’t have any meaning in and of itself (it depends on the ‘metric’ of the variables). If only there was some way to convert it into something else... Page 2 of 18 The bivariate correlation coefficient (Pearson’s r) Indicates the strength and direction of a straight line (linear) relationship Is a symmetrical measure of association (i.e., it doesn’t matter which is the DV and which is the IV) Is a single number ranging from -1 to 1 with 0 indicating no relationship A formula (different than the one in our book, but using stuff we know): r S xy SySx ( x x)( y y) n ( y y ) ( x x) 2 n 2 n Let’s work through an example (I took a random sample of 20 countries from the World95.sav data file): Country Infant mortality (y) Literacy (x) 2 y y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Bangladesh Burkina Faso Cent. Afri.R Czech Rep. a Ethiopia Finland Georgia Hong Kong India Iran Lebanon Libya Lithuania Malaysia N. Korea Netherlands New Zealand Nicaragua Somalia U.Arab Em. ( y y) xx ( x x) 2 ( x x) ( y y ) 106 118 137 35 18 27 51.7 63.7 82.7 2668.0 4051.7 6831.5 -30.9 -47.9 -38.9 957.7 2299.0 1516.9 -1598.5 -3052.0 -3219.1 110 5.3 23 5.8 79 60 39.5 63 17 25.6 27.7 6.3 8.9 52.5 126 22 24 100 99 77 52 54 80 64 99 78 99 99 99 57 24 68 55.7 -49.0 -31.3 -48.5 24.7 5.7 -14.8 8.7 -37.3 -28.7 -26.6 -48.0 -45.4 -1.8 71.7 -32.3 3097.2 2405.6 982.7 2356.8 607.8 32.0 220.4 74.9 1394.8 826.4 710.1 2308.5 2065.5 3.4 5134.1 1046.4 -41.9 34.1 33.1 11.1 -13.9 -11.9 14.1 -1.9 33.1 12.1 33.1 33.1 33.1 -8.9 -41.9 2.1 1759.6 1159.6 1092.5 122.2 194.5 142.7 197.5 3.8 1092.5 145.3 1092.5 1092.5 1092.5 80.1 1759.6 4.2 -2334.5 -1670.2 -1036.1 -536.6 -343.8 -67.5 -208.6 -16.8 -1234.4 -346.5 -880.8 -1588.1 -1502.2 16.5 -3005.6 -66.4 Sum 1032.6 1253.0 0.0 36817.7 0.0 15804.9 -22691.3 Mean 54.3 65.9 a. The Czech Republic has valid data for infant mortality, but not for literacy. I have used listwise deletion – that is, the results are based on the 19 countries with valid data on both variables. 22,691.3 36,817.7 15,804.9 1,194.3 ; S y 44.0 ; S x 28.8 19 19 19 1,194.3 r 0.942 ; a strong negative relationship 44.0 * 28.8 S xy But is there a relationship in the population? In other words, is r sufficiently different from 0 in my sample for me to assume it is different from 0 in the population? What is our population? Let’s perform a hypothesis test: Page 3 of 18 Requirements (assumptions) for Pearson’s correlation coefficient: 1. A straight line relationship 2. Interval-ratio level data 3. Random sampling 4. Normally distributed characteristics – “Testing the significance of Pearson’s r requires both X and Y variables to be normally distributed in the population. In small samples, failure to meet the requirement of normally distributed characteristics may seriously impair the validity of the test. However, this requirement is of minor importance when the sample size equals or exceeds 30 cases” (p. 329). Two-tailed hypotheses and alpha (.05) [you can also test one-tailed hypotheses]: H0: =0 H1: ≠0 The test statistic and sampling distribution: t The formula for converting our correlation into a t value: t r n2 1 r 2 3.884 11.560 .336 df=n-2=17 Critical value=2.110 (df=17, 2 tailed, alpha=0.05) We reject H0. But be careful! 1. The correlation coefficient is sensitive to unusual combinations of mean deviations (outliers on both X and Y) as well as outliers on X and Y separately (which influence the standard deviations). 2. It can only detect linear relationships. You should always check univariate descriptive statistics as well as scatterplots before relying on the correlation. Page 4 of 18 Results from SPSS (using the complete data): Statistics N BABYMORT Infant mortality (deaths per 1000 live births) 109 0 42.313 3.6473 27.700 5.7 a 38.0792 1450.0274 1.090 .231 .365 .459 164.0 4.0 168.0 4612.1 Valid Mis sing Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum Sum LITERACY People who read (%) 107 2 78.34 2.212 88.00 99 22.883 523.640 -.994 .234 -.160 .463 82 18 100 8382 a. Multiple modes exis t. The smalles t value is shown Infant mortality (deaths per 1000 live births) People who read (%) 40 30 30 20 20 Frequency Frequency 10 10 0 0.0 20.0 10.0 40.0 30.0 60.0 50.0 80.0 70.0 100.0 90.0 120.0 110.0 140.0 130.0 0 160.0 150.0 20.0 170.0 30.0 25.0 Infant mortality (deaths per 1000 live births) 40.0 35.0 50.0 45.0 60.0 55.0 70.0 65.0 80.0 75.0 90.0 85.0 100.0 95.0 People w ho read (%) 200 A strong negative relationship Correl ations 160 120 BABYMORT Infant Pearson Correlation mortality (deaths per Sig. (2-tailed) 1000 live births ) N LITERACY People Pearson Correlation who read (% ) Sig. (2-tailed) N 80 40 BABYMORT Infant mortality (deaths per LITERACY 1000 live People who births) read (% ) 1 -.900** . .000 109 107 -.900** 1 .000 . 107 107 **. Correlation is s ignificant at the 0.01 level (2-t ailed). 0 0 20 People w ho read (%) 40 60 80 100 We can reject the null hypothesis (H0: =0) because p (‘Sig.’) is less than alpha (.05) Page 5 of 18 Correlation matrices Later on, when we are working with more than two variables, you will see correlation matrices. These display all possible bivariate correlations. Here is an example of unedited output from SPSS: Correlationsa BABYMORT Infant mortality (deaths per 1000 live births) GDP_CAP Gross domes tic product / capita LITERACY People who read (%) BABYMORT Infant mortality (deaths per 1000 live births) Pearson Correlation LITERACY People who read (%) Pearson Correlation Sig. (2-tailed) -.920** 1 .000 . Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed) Pearson Correlation Sig. (2-tailed) -.692** .000 -.774** .000 .865** .000 .843** .000 GDP_CAP Gros s domes tic product / capita CALORIES Daily calorie intake BIRTH_RT Birth rate per 1000 people FERTILTY Fertility: average number of kids Sig. (2-tailed) CALORIES Daily calorie intake 1 -.920** -.692** -.774** . .000 .000 .000 .627** .682** .000 1 . .760** .000 -.741** .000 -.652** .000 .627** .000 .682** .000 -.871** .000 -.863** .000 .865** .843** .000 .000 -.871** -.863** .000 .000 .000 .760** .000 1 . -.757** .000 -.691** .000 -.741** .000 -.757** .000 1 . .974** .000 -.652** .000 -.691** .000 .974** .000 1 . **. Correlation is s ignificant at the 0.01 level (2-tailed). a. Lis twis e N=74 You should improve the matrix for presentation: Table 1. Bivariate correlations (N=74). (1) (2) Infant mortality (1) 1.000 Literacy (2) -.920* 1.000 GDP (3) -.692* .627* Calories (4) -.774* .682* Birth rate (5) .865* -.871* Fertility (6) .843* -.863* * p <.01 (2-tailed) Source: World95.sav. (3) (4) (5) (6) 1.000 .760* -.741* -.652* 1.000 -.757* -.691* 1.000 .974* 1.000 Scatterplots and bivariate correlation in SPSS Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define Analyze → Correlate → Bivariate Page 6 of 18 BIRTH_RT Birth rate per 1000 people FERTILTY Fertility: average number of kids Bivariate regression Bivariate correlation is not terribly helpful for identifying how meaningful a relationship is (i.e., what does -.942 really mean?). Bivariate regression is another way to describe the linear relationship between two interval-ratio variables With regression, we calculate the best-fitting line (a.k.a. the slope) and an intercept to summarize the relationship; the slope conveys more meaning because of the interpretation – each one unit increase in X leads to a B unit increase/decrease in Y y i a bx i ; the predicted score for the ith case is equal to the y-intercept (a) plus the product of the slope (b) and the ith case’s score on x. The following formula gives us the best-fitting line (when certain assumptions hold): S yx b 2 ; the slope is equal to the covariance between y and x divided by the variance of x Sx Covariance: S yx ( x x)( y y) ; Variance of x: S n a y (b * x) ; a is the y-intercept 2 x ( x x) 2 n Note that the slope is an asymmetrical measure of association because the denominator includes only the variance for x (i.e., the independent variable); whereas Pearson’s correlation is symmetrical because it includes the standard deviations for both x and y. What numbers do you need to draw a line? You need a point and a slope The point that we are interested in is called the y-intercept o The y-intercept is the value of y (the dependent variable) when the independent variable equals 0 o This is the point at which the line crosses the y axis o SPSS refers to the y-intercept as the “(Constant)”; it is circled below o Interpretation: the predicted number of deaths per 1,000 live births is 160.732 for a country with a literacy rate of 0 (i.e., a country in which nobody can read or write – this country doesn’t exist; the minimum literacy rate is 18% for Burkina Faso; we’ll deal with this in a bit) Coefficientsa Model 1 (Constant) LITERACY People who read (%) Unstandardized Coefficients B Std. Error 160.732 5.794 -1.507 .071 Standardized Coefficients Beta -.900 t 27.740 Sig. .000 -21.219 .000 95% Confidence Interval for B Lower Bound Upper Bound 149.243 172.221 -1.648 -1.366 a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births ) The second thing you need to draw a line is the slope o The slope (or “b”) is the change in the dependent variable (y) for each one unit change in the independent variable (x). o SPSS includes the slope in the ‘B’ column (it is in a rectangular box above) o Interpretation: Every 1% increase in literacy reduces the number of deaths by 1.507 per 1,000 live births. o What would ‘no relationship’ look like? Page 7 of 18 If we have these two things (the y-intercept and the slope) we can draw a line: y i a bx i a is the y-intercept (the value of the dependent variable when X = 0) b is the slope xi is the value of the independent variable for case i We could write the equation for our example as: y i 160.732 (1.507 * xi ) y i is the predicted value of the dependent variable for case i You can use this equation to predict scores of the dependent variable for individual cases: yi xi y i 160.732 (1.507 * xi ) yi ei y i y i Infant Predicted infant Country mortality Literacy Predicted infant mortality mortality Prediction error Afghanistan 168 29 160.732 - 1.507 * 29 = 117.029 51.0 Argentina 25.6 95 160.732 - 1.507 * 95 = 17.567 8.0 Armenia 27 98 160.732 - 1.507 * 98 = 13.046 14.0 Australia 7.3 100 160.732 - 1.507 * 100 = 10.032 -2.7 Austria 6.7 99 160.732 - 1.507 * 99 = 11.539 -4.8 Azerbaijan 35 98 160.732 - 1.507 * 98 = 13.046 22.0 Bahrain 25 77 160.732 - 1.507 * 77 = 44.693 -19.7 Bangladesh 106 35 160.732 - 1.507 * 35 = 107.987 -2.0 Barbados 20.3 99 160.732 - 1.507 * 99 = 11.539 8.8 Belarus 19 99 160.732 - 1.507 * 99 = 11.539 7.5 … Notice that the predicted value does not equal each case’s observed value. These differences are called prediction errors. We will always have prediction errors – this is expected. Remember that the regression line is meant to summarize the relationship between two variables. Anytime that you summarize, you simplify and simplification leads to the loss of information. So our picture is not perfect, but it gives us a general sense of the relationship. Other ways to write the regression equation: y i a bxi ei ; where yi is the actual score and ei is the prediction error; or y i y i e i ; where the actual score is equal to the predicted score plus the prediction error. The method that we are using to estimate the slope is referred to as ordinary least squares (OLS). When certain assumptions hold (we’ll discuss these in a few weeks), the OLS slope is the ‘best’ (i.e., it is better than the slopes calculated by other means). It is best because it minimizes the residual sum of squares (RSS): n RSS ( y i y i ) 2 i 1 In other words, the best-fitting line is the one that minimizes the squared prediction errors or the squared difference between each case’s actual score and their score predicted by the regression model. Page 8 of 18 How meaningful is the slope? Statistics N Valid Mis sing Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum Sum BABYMORT Infant mortality (deaths per 1000 live births) 109 0 42.313 3.6473 27.700 5.7 a 38.0792 1450.0274 1.090 .231 .365 .459 164.0 4.0 168.0 4612.1 LITERACY People who read (%) 107 2 78.34 2.212 88.00 99 22.883 523.640 -.994 .234 -.160 .463 82 18 100 8382 a. Multiple modes exis t. The smalles t value is shown Every 1% increase in literacy reduces the number of deaths by 1.507 per 1,000 live births The minimum value of literacy is 18 and the maximum is 100 o The predicted infant mortality rate for a literacy rate of 18%=133.606 o The predicted infant mortality rate for a literacy rate of 100%=10.032 o This seems to be a pretty meaningful difference o You could (and should) compute predicted probability for other less extreme values (perhaps the lower and upper quartiles or +/- 1 standard deviation) Hypothesis testing Meaningful, however, is not the same thing as statistically significant. Our slope coefficient is pretty close to 0. Is it different enough from 0 that we can assume that the slope is not equal to 0 in the population? We need a statistical hypothesis test to answer this question. Some requirements (assumptions) for regression: 1. Both variables are measured at the interval-ratio level 2. There is a straight line relationship (There are ways to get around this that we’ll discuss later). Regression is sensitive to outliers (we’ll also deal with this later). 3. Random sample – do we have a random sample in this example? We’ll pretend just for now… 4. To test significance, you must assume the variables have normal distributions in the population or you must have a large sample. Two-tailed hypotheses and alpha (.05) [you can also test one-tailed hypotheses]: H0: =0 H1: ≠0 Test statistic and distribution: t s b t ; df=n-2; s b e sb sx n Like all standard errors, the standard error of the regression slope describes variability across all possible samples of the same size from the population. The standard error of the slope describes variability in the estimate of the slope across all possible samples. By using this formula (for t), we are converting the slope coefficient to a t score so that we can see how unusual our sample slope is if the null hypothesis is true. Page 9 of 18 Descriptive Statistics Mean BABYMORT Infant mortality (deaths per 1000 live births) LITERACY People who read (%) Std. Deviation N 42.674 38.2972 107 78.34 22.883 107 Coefficientsa Model 1 (Constant) LITERACY People who read (%) Unstandardized Coefficients B Std. Error 160.732 5.794 -1.507 .071 Standardized Coefficients Beta -.900 t 27.740 Sig. .000 -21.219 .000 95% Confidence Interval for B Lower Bound Upper Bound 149.243 172.221 -1.648 -1.366 a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births ) Critical value ≈ ± 1.980 (=.05; 2 tailed hypotheses; df = 105) Observed t=-21.219 We reject the null hypothesis and conclude that there probably is a relationship between infant mortality and literacy in the population. We would not obtain a slope as far away from 0 as -1.507 very often if the population slope is equal to 0. The approximate probability of our data if the null hypothesis is true is: p = 9.176113601021e-040 = .0000000000000000000000000000000000000009176113601021 So p < which would also allow us to reject the null hypothesis. Note that the sample slope is a point estimate of the population slope. We can also calculate a confidence interval: CI=b±(z*sb). 95 out of 100 confidence intervals should contain the true population slope. Thus, we can be pretty confident that the population slope is between -1.648 and -1.366. Notice that this interval does not contain 0 – this is a third way to test the null hypothesis. A potential problem…and a solution The y-intercept is not terribly meaningful because no countries have a score of 0% on literacy. There is a trick that we can use to get around this – it is known as grand mean centering. You simply subtract the mean value of the independent variable from every score and use this centered variable in the regression. SPSS Syntax: * Grand mean centering. freq vars=literacy /stats=mean. compute lit_c=literacy-78.33644859813. freq vars=literacy lit_c /stats=all /histogram. Statistics LITERACY LIT_C Valid 18 24 Frequency 1 2 Valid … Missing Total 99 100 Total System 22 3 107 2 109 Missing Total -60.34 -54.34 … 20.66 21.66 Total System Frequency 1 2 22 3 107 2 109 Page 10 of 18 N Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum Sum Valid Mis sing LITERACY People who read (%) 107 2 78.34 2.212 88.00 99 22.883 523.640 -.994 .234 -.160 .463 82 18 100 8382 LIT_C 107 2 .0000 2.21220 9.6636 20.66 22.88319 523.64045 -.994 .234 -.160 .463 82.00 -60.34 21.66 .00 Before: After: 200 Infant mortality (deaths per 1000 live births) 200 160 120 80 40 0 0 160 120 80 40 0 20 40 60 80 -80 100 -60 BABYMORT Infant mortality (deaths per 1000 live births) Sig. (1-tailed) N 0 20 Correlations Correlations BABYMORT Infant mortality (deaths per 1000 live births) LITERACY People who read (%) BABYMORT Infant mortality (deaths per 1000 live births) LITERACY People who read (%) BABYMORT Infant mortality (deaths per 1000 live births) LITERACY People who read (%) -20 Literacy Centered People w ho read (%) Pearson Correlation -40 BABYMORT Infant mortality (deaths per 1000 live births) LITERACY People who read (%) 1.000 -.900 -.900 1.000 . .000 .000 . Pearson Correlation Sig. (1-tailed) N 107 107 107 107 BABYMORT Infant mortality (deaths per 1000 live births) LIT_C BABYMORT Infant mortality (deaths per 1000 live births) LIT_C BABYMORT Infant mortality (deaths per 1000 live births) LIT_C LIT_C 1.000 -.900 -.900 1.000 . .000 .000 . 107 107 107 107 Before: Coefficientsa Model 1 (Constant) LITERACY People who read (%) Unstandardized Coefficients B Std. Error 160.732 5.794 -1.507 .071 Standardized Coefficients Beta -.900 t 27.740 Sig. .000 -21.219 .000 95% Confidence Interval for B Lower Bound Upper Bound 149.243 172.221 -1.648 -1.366 a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births ) After: Coeffi cientsa Model 1 (Const ant) LIT_C Unstandardized Coeffic ient s B St d. Error 42.674 1.618 -1. 507 .071 St andardiz ed Coeffic ient s Beta -.900 t 26.380 -21.219 Sig. .000 .000 a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births ) Little has changed… Page 11 of 18 95% Confidenc e Interval for B Lower Bound Upper Bound 39.466 45.881 -1. 648 -1. 366 40 Our intercept, however, is now 42.674: 42.674 is the predicted number of infant deaths for a country with ‘0’ literacy We have, however, changed what 0 means by centering the variable o On the original literacy variable, 0 meant 0% literate o On the centered variable (lit_c), 0 is equal to the mean literacy rate (78.34%) It is accurate to say that the predicted number of infant deaths for a country with mean literacy is 42.674 per 1,000 live births You may have noticed that the new intercept is the average infant mortality rate. This will be true in bivariate regression. The real benefit will come when we do multivariate regression with more than one independent variable. Centering will also come in very handy later on when we discuss interaction effects. Measures of association Notice that I haven’t said anything about determining the strength of the relationship from the y-intercept and slope. These tell us nothing about the strength of the relationship – they only tell us if there is a relationship and they help us to describe and understand the relationship. There are three measures of association for interval-ratio variables that tell us about strength (in addition to other things). These are Pearson’s product moment correlation coefficient (r), the coefficient of determination, (r2), and the standardized slope coefficient (Beta). The coefficient of determination Another measure of association for interval-ratio variables (PRE) It is equal to the correlation squared Because it is squared, it cannot tell us the direction of the relationship However, it has a more meaningful interpretation than r o It tells us how much our errors in predicting the dependent variable are reduced by taking into account the independent variable o It reflects the total variation in the dependent variable explained by the independent variable The logic of r2: Imagine that I am trying to predict the infant mortality rate for South Africa. What is my best guess without knowing anything about South Africa? My best guess would be the mean infant mortality rate across all countries: 42.674 deaths per 1,000 live births Now, imagine that I am allowed to use one variable to improve my prediction. I decide to use the literacy rate (uncentered to simplify the example). 76% of the population of South Africa is literate. I can plug 76% into the regression equation to generate a new estimate: y i 160.732 (1.507 * 76) 46.2 South Africa has an actual infant mortality rate of 47.1. My prediction errors: Using only the grand mean: 4.426 = 47.1 – 42.674 Using the literacy rate: 0.9 = 47.1 – 46.2 I have reduced my prediction substantially – by a proportion of Page 12 of 18 4.426 0.9 3.526 0.797 or 79.7% 4.426 4.426 The coefficient of determination (r2) is a measure of association that summarizes just how wrong we are in our predictions across all cases. It tells us how much better our prediction is when we use the independent variable to predict the dependent variable rather than guessing the mean. The good news is that once you have calculated the correlation coefficient, it is really easy to get the coefficient of determination. All you have to do is square the correlation. It is also included in SPSS output: Model Summary Model 1 R .900a R Square .811 Adjusted R Square .809 Std. Error of the Estimate 16.7334 a. Predictors: (Constant), LIT_C In this example, we reduce our prediction errors by a proportion of .811 or by 81.1% when we use literacy to predict infant mortality. We can also say that literacy explains 81.1% of the variation in infant mortality. This is clearly a strong relationship. Other measures of strength We’ve already discussed the correlation coefficient. In a bivariate regression, the standardized slope coefficient is equal to the bivariate correlation. This will not be the case in multivariate regression. There is some controversy surrounding standardized slope coefficients. I am not a big fan of them, but we will talk about them more when we discuss multivariate regression. Bivariate correlation and regression in SPSS – Example 2 Our research questions: Are there relationships between anti-Black stereotyping (the DV), age and education among Whites in the US? Our hypotheses (alpha = .01): H0: age ≤ 0 H0: education ≥ 0 H1: age > 0 H1: education < 0 The variables: Stereotyping is an index that I created from four variables: 1. “Now I have some questions about different groups in our society. I’m going to show you a seven-point scale on which the characteristics of people in a group can be rated. In the first statement a score of 1 means that you think almost all of the people in that group are “rich.” A score of 7 means that almost everyone in the group are “poor.” A score of 4 means you think that the group is not towards one end or another, and of course you may any number on between that comes closest to where you think people in the group stand. Jews?” 2. Hard-working to Lazy 3. Not prone to violence to prone to violence 4. Intelligent to unintelligent I have separate indexes for four target groups: Jews, Blacks, Hispanics, and Asians. We’ll focus on anti-Black stereotyping. Each index can range from 4 to 28 (since the original items range from 1 to 7). Scores above 16 indicate negative stereotypes. Scores of 16 indicate neutrality (since 4 is neutral and there are 4 questions). Scores below 16 indicate positive stereotypes. Page 13 of 18 I have used a filter to select only the White respondents for the analysis: Data → Select Cases SPSS Syntax: freq vars=race. USE ALL. COMPUTE filter_$=(race=1). VARIABLE LABEL filter_$ 'race=1 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMAT filter_$ (f1.0). FILTER BY filter_$. EXECUTE . freq vars=race. Univariate statistics for anti-Black stereotyping: 200 Statistics 100 .0 -9 .0 -7 .0 -5 0 7. -2 .0 0 26 25. .0 0 24 3. -2 .0 0 22 1. -2 .0 0 20 19. .0 0 18 7. -1 .0 0 16 5. -1 .0 0 14 13. .0 0 12 1. -1 .0 10 0 8. 0 0 6. 1025 1213 17.6429 .09485 17.0000 16.00 3.03677 9.22198 .274 .076 1.001 .153 24.00 4.00 28.00 18084.00 0 4. Mean Std. Error of Mean Median Mode Std. Deviation Variance Skewness Std. Error of Skewness Kurtos is Std. Error of Kurtosis Range Minimum Maximum Sum Valid Missing Frequency STEREOB N You can see that the mean level of anti-Black stereotyping is 17.6429. If you consider that there are 4 questions, this amounts to an average of about 4.4 per question, which is slightly into the negative side (i.e., lazy, prone to violence, etc.) Page 14 of 18 Scatterplots: Age: A weak positive relationship Education: A weak negative relationship 28 26 26 24 24 22 22 20 20 18 18 16 16 Anti-Black Stereotyping 28 14 12 10 8 6 4 14 12 10 8 6 4 0 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 5 10 15 20 Y ears of Education A ge Bivariate correlations and regressions: Regression – Annotated output Correl ations Pearson Correlation De scri ptive Statistics STEREOB AGE_C Mean 17.6416 -1. 2563 St d. Deviat ion 3.03796 17.53322 Sig. (1-tailed) N 1024 1024 N STEREOB AGE_C STEREOB AGE_C STEREOB AGE_C STEREOB 1.000 .208 . .000 1024 1024 AGE_C .208 1.000 .000 . 1024 1024 H0: age ≤ 0 H1: age > 0 Alpha=.01 Based on the p values listed in the ‘correlations’ table above, we can reject the null hypothesis and conclude that there probably is a positive relationship in the population (because p < ). The correlation between anti-Black stereotyping and age is weak. Page 15 of 18 Model Summaryb Variables Entered/Removedb Model 1 Variables Entered AGE_C a Variables Removed Method Enter . Model 1 R .208a R Square .043 Adjusted R Square .043 a. All requested variables entered. a. Predictors: (Constant), AGE_C b. Dependent Variable: STEREOB b. Dependent Variable: STEREOB Std. Error of the Estimate 2.97269 R square suggests that age explains about 4.3% of the variation in anti-Black stereotyping. ANOVAb Model 1 Regres sion Residual Total Sum of Squares 410.187 9031.280 9441.468 df 1 1022 1023 Mean Square 410.187 8.837 F 46.418 Sig. .000a a. Predic tors: (Constant), AGE_C b. Dependent Variable: STEREOB We will discuss the F test in the ANOVA table later in the semester. Coeffi cientsa Model 1 (Const ant) AGE_C Unstandardized Coeffic ient s B St d. Error 17.687 .093 .036 .005 St andardiz ed Coeffic ient s Beta .208 t 189.907 6.813 Sig. .000 .000 95% Confidenc e Interval for B Lower Bound Upper Bound 17.504 17.870 .026 .047 a. Dependent Variable: STEREOB Slope: Anti-Black stereotyping increases by .036 points with each additional year of age. The critical t value is 2.326 (df=1,022, 1 tailed, alpha=.01). My observed t value of 6.813 exceeds the critical t so I reject the null hypothesis and conclude that there is probably a positive relationship between anti-Black stereotyping and age in the population. The p value is 1.629166466766e-011 (it is listed as .000 above), which is also less than alpha (.01). I am 95% certain that the slope in the population is between .026 and .047. The fact that the confidence interval does not contain 0 also suggests that we can reject the null hypothesis (although notice that this is a 95% confidence interval). Intercept: The predicted level of anti-Black stereotyping for a person of average age (about 47) is 17.687 (this corresponds to an average score on the four items of about 4.4, which is just slightly into the negative stereotype range). Other predicted values for context: The predicted level of anti-Black stereotyping for 29 and 65 year olds are 17.0 and 18.3, respectively (I selected these values on age because they are one standard deviation below and one standard deviation above the mean; see the calculations below). a 17.687 Note: b 0.036 x 47.14 29.415 64.865 x centered 0 -17.725 17.725 Predicted y 17.687 17.0489 18.3251 x 47.14; s x 17.725 Although the slope is statistically significant, it does not appear to be substantively meaningful. You can see this when you consider that there is not much of a difference between 29 and 65 year olds in their levels of antiBlack stereotyping. Because the slope is ‘small’ at .036, a 36 year increase in age (from 29 to 65) only increases anti-Black stereotyping by 1.296 (36 * .036) and the anti-Black stereotyping index can range from 4 to 28. Page 16 of 18 Annotated output for education: Correlations Pearson Correlation Sig. (1-tailed) De scri ptive Statistics STEREOB EDUC HIGHEST YEAR OF SCHOOL COMPLETED Mean 17.6429 St d. Deviat ion 3.03677 N 1025 13.52 2.730 1025 N STEREOB EDUC HIGHEST YEAR OF SCHOOL COMPLETED STEREOB EDUC HIGHEST YEAR OF SCHOOL COMPLETED STEREOB EDUC HIGHEST YEAR OF SCHOOL COMPLETED STEREOB 1.000 EDUC HIGHEST YEAR OF SCHOOL COMPLETED -.137 -.137 1.000 . .000 .000 . 1025 1025 1025 1025 H0: education ≥ 0 H1: education < 0 Alpha=.01 Based on the p values listed in the ‘correlations’ table above, we can reject the null hypothesis and conclude that there probably is a negative relationship in the population. The correlation between anti-Black stereotyping and education is weak. Model Summary Model 1 R .137a R Square .019 Adjusted R Square .018 Std. Error of the Estimate 3.00944 a. Predictors: (Constant), EDUC HIGHEST YEAR OF SCHOOL COMPLETED R square suggests that education explains about 1.9% of the variation in anti-Black stereotyping. Coefficientsa Model 1 (Constant) EDUC HIGHEST YEAR OF SCHOOL COMPLETED Unstandardized Coefficients B Std. Error 19.709 .475 -.153 .034 Standardized Coefficients Beta -.137 t 41.487 Sig. .000 -4.437 .000 95% Confidence Interval for B Lower Bound Upper Bound 18.777 20.641 -.220 -.085 a. Dependent Variable: STEREOB Slope: Each additional year of education decreases anti-Black stereotyping by .153 units. The critical t value is -2.326 (df=1,022, 1 tailed, alpha=.01). My observed t value of -4.437 exceeds the critical t so I reject the null hypothesis and conclude that there is probably a negative relationship between anti-Black stereotyping and education in the population. The p value is 1.013266718346e-005 (it is listed as .000 above), which is also less than alpha (.01). I am 95% certain that the slope in the population is between -.220 and -.085. The fact that the confidence interval does not contain 0 also suggests that we can reject the null hypothesis (although notice that this is a 95% confidence interval). Intercept: The predicted level of anti-Black stereotyping for a person with zero years of education (which is a possible value, so I did not center education) is 19.709. Page 17 of 18 Other predicted values for context: The predicted level of anti-Black stereotyping for those with roughly 11 and 16 years of education are 18.1 and 17.2, respectively (I selected these values on education because they are one standard deviation below and one standard deviation above the mean; see the calculations below). a 19.709 Note: b -0.153 x 0 10.79 16.25 Predicted y 19.709 18.05813 17.22275 x 13.52; s x 2.73 Although the slope is statistically significant, it does not appear to be substantively meaningful. You can see this when you consider that there is not much of a difference between those with 11 and 16 years of education in their levels of anti-Black stereotyping. Because the slope is ‘small’ at -.153, a 5 year increase in education (from 11 to 16 years) only decreases anti-Black stereotyping by .765 (5 * .153) and the anti-Black stereotyping index can range from 4 to 28. Page 18 of 18