Download Outline of Levin and Fox, Chapter 1 (2003)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Topic 6. Bivariate correlation and regression
Introduction
Relationships
1. Is there are relationship between two variables in the population?
Nominal with nominal – chi-square
Nominal with ordinal – chi-square
Ordinal with ordinal – chi-square
Interval/ratio DV with a 2 category IV – independent samples t test
Interval/ratio DV with a >2 category IV (nominal or ordinal) – ANOVA (F test)
Interval/ratio with interval/ratio – t test (in bivariate correlation and regression)
2. How strong is the relationship? (Measures of association)
Nominal with nominal – Lambda, etc.
Nominal with ordinal – Lambda, etc.
Ordinal with ordinal – Gamma, etc.
Interval/ratio DV with a >2 category IV (nominal or ordinal) – Eta-squared
Interval/ratio with interval/ratio – Pearson’s r (a.k.a. bivariate correlation)
3. How can we characterize/what is the direction of the relationship?
Nominal with nominal – cross-tabulation or clustered bar chart
Nominal with ordinal – cross-tabulation or clustered bar chart
Ordinal with ordinal – cross-tabulation or clustered bar chart; Gamma, etc.
Interval/ratio DV with a 2 category IV – comparison of means table
Interval/ratio DV with a >2 category IV (nominal or ordinal) – comparison of means table or means plot
Interval/ratio with interval/ratio – scatterplot; bivariate correlation and regression
Scatterplots
Figure 1. Infant Mortality and Literacy (N=107).
200
160
120
80
40
0
0
20
40
60
80
100
People w ho read (%)
Source: World95.sav.
Strength is determined by the spread of the cases. Direction is determined by the pattern of joint scores. You
can even use scatterplots to identify nonlinear relationships. Scatterplots are very useful descriptive tools, but
they don’t help us with making predictions about the population.
Page 1 of 18
Covariation
 The most elementary measure for identifying a bivariate relationship between two interval-ratio
variables
 The covariance is a building-block for other statistics including r (Pearson’s correlation) and b (the
regression slope)
 The formula is:
S XY  Cov( x, y) 
 ( x  x)( y  y) ; The numerator is referred to as the sum of the cross-products.
n
How does it work? Here is an example:
1 Negative covariance – scores below the mean on one variable are above the mean on the other
x
y
( x  x)
( x  x) * ( y  y )
10
-2
2
-4
2
9
-1
1
-1
3
8
0
0
0
4
7
1
-1
-1
5
6
2
-2
-4
0
0
-10
3
8
Sum
Mean
( y  y)
1
Covariance
-2.0
2 Positive covariance – scores below the mean on one variable are below the mean on the other;
scores above the mean on one variable are above the mean on the other
x
y
( x  x)
( x  x) * ( y  y )
6
-2
-2
4
2
7
-1
-1
1
3
8
0
0
0
4
9
1
1
1
5
10
2
2
4
0
0
10
Sum
Mean
( y  y)
1
3
8
Covariance
2.0
3 No covariance
x
y
( x  x)
( x  x) * ( y  y )
6
-2
-2
4
1
10
-2
2
-4
2
7
-1
-1
1
2
9
-1
1
-1
3
8
0
0
0
3
8
0
0
0
4
9
1
1
1
4
7
1
-1
-1
5
10
2
2
4
5
6
2
-2
-4
0
0
0
Sum
Mean
( y  y)
1
3
8
Covariance
0
The downside to the covariance is that the number doesn’t have any meaning in and of itself (it depends on the
‘metric’ of the variables). If only there was some way to convert it into something else...
Page 2 of 18
The bivariate correlation coefficient (Pearson’s r)
 Indicates the strength and direction of a straight line (linear) relationship
 Is a symmetrical measure of association (i.e., it doesn’t matter which is the DV and which is the IV)
 Is a single number ranging from -1 to 1 with 0 indicating no relationship
 A formula (different than the one in our book, but using stuff we know):
r
S xy
SySx
 ( x  x)( y  y)

n
 ( y  y )  ( x  x)
2
n
2
n
Let’s work through an example (I took a random sample of 20 countries from the World95.sav data file):
Country
Infant mortality (y)
Literacy (x)
2
y y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Bangladesh
Burkina Faso
Cent. Afri.R
Czech Rep. a
Ethiopia
Finland
Georgia
Hong Kong
India
Iran
Lebanon
Libya
Lithuania
Malaysia
N. Korea
Netherlands
New Zealand
Nicaragua
Somalia
U.Arab Em.
( y  y)
xx
( x  x) 2
( x  x) ( y  y )
106
118
137
35
18
27
51.7
63.7
82.7
2668.0
4051.7
6831.5
-30.9
-47.9
-38.9
957.7
2299.0
1516.9
-1598.5
-3052.0
-3219.1
110
5.3
23
5.8
79
60
39.5
63
17
25.6
27.7
6.3
8.9
52.5
126
22
24
100
99
77
52
54
80
64
99
78
99
99
99
57
24
68
55.7
-49.0
-31.3
-48.5
24.7
5.7
-14.8
8.7
-37.3
-28.7
-26.6
-48.0
-45.4
-1.8
71.7
-32.3
3097.2
2405.6
982.7
2356.8
607.8
32.0
220.4
74.9
1394.8
826.4
710.1
2308.5
2065.5
3.4
5134.1
1046.4
-41.9
34.1
33.1
11.1
-13.9
-11.9
14.1
-1.9
33.1
12.1
33.1
33.1
33.1
-8.9
-41.9
2.1
1759.6
1159.6
1092.5
122.2
194.5
142.7
197.5
3.8
1092.5
145.3
1092.5
1092.5
1092.5
80.1
1759.6
4.2
-2334.5
-1670.2
-1036.1
-536.6
-343.8
-67.5
-208.6
-16.8
-1234.4
-346.5
-880.8
-1588.1
-1502.2
16.5
-3005.6
-66.4
Sum
1032.6
1253.0
0.0
36817.7
0.0
15804.9
-22691.3
Mean
54.3
65.9
a. The Czech Republic has valid data for infant mortality, but not for literacy. I have used listwise deletion – that is, the results are based on the 19
countries with valid data on both variables.
 22,691.3
36,817.7
15,804.9
 1,194.3 ; S y 
 44.0 ; S x 
 28.8
19
19
19
 1,194.3
r
 0.942 ; a strong negative relationship
44.0 * 28.8
S xy 
But is there a relationship in the population? In other words, is r sufficiently different from 0 in my sample for
me to assume it is different from 0 in the population? What is our population? Let’s perform a hypothesis test:
Page 3 of 18
Requirements (assumptions) for Pearson’s correlation coefficient:
1. A straight line relationship
2. Interval-ratio level data
3. Random sampling
4. Normally distributed characteristics – “Testing the significance of Pearson’s r requires both X and Y
variables to be normally distributed in the population. In small samples, failure to meet the requirement of
normally distributed characteristics may seriously impair the validity of the test. However, this requirement is
of minor importance when the sample size equals or exceeds 30 cases” (p. 329).
Two-tailed hypotheses and alpha (.05) [you can also test one-tailed hypotheses]:
H0: =0
H1: ≠0
The test statistic and sampling distribution: t
The formula for converting our correlation into a t value:
t
r n2
1 r
2

 3.884
 11.560
.336
df=n-2=17
Critical value=2.110 (df=17, 2 tailed, alpha=0.05)
We reject H0.
But be careful!
1. The correlation coefficient is sensitive to unusual combinations of mean deviations (outliers on both X and
Y) as well as outliers on X and Y separately (which influence the standard deviations).
2. It can only detect linear relationships.
You should always check univariate descriptive statistics as well as scatterplots before relying on the
correlation.
Page 4 of 18
Results from SPSS (using the complete data):
Statistics
N
BABYMORT
Infant
mortality
(deaths per
1000 live
births)
109
0
42.313
3.6473
27.700
5.7 a
38.0792
1450.0274
1.090
.231
.365
.459
164.0
4.0
168.0
4612.1
Valid
Mis sing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
Sum
LITERACY
People who
read (%)
107
2
78.34
2.212
88.00
99
22.883
523.640
-.994
.234
-.160
.463
82
18
100
8382
a. Multiple modes exis t. The smalles t value is shown
Infant mortality (deaths per 1000 live births)
People who read (%)
40
30
30
20
20
Frequency
Frequency
10
10
0
0.0
20.0
10.0
40.0
30.0
60.0
50.0
80.0
70.0
100.0
90.0
120.0
110.0
140.0
130.0
0
160.0
150.0
20.0
170.0
30.0
25.0
Infant mortality (deaths per 1000 live births)
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
90.0
85.0
100.0
95.0
People w ho read (%)
200
A strong negative relationship
Correl ations
160
120
BABYMORT Infant
Pearson Correlation
mortality (deaths per Sig. (2-tailed)
1000 live births )
N
LITERACY People
Pearson Correlation
who read (% )
Sig. (2-tailed)
N
80
40
BABYMORT
Infant
mortality
(deaths per
LITERACY
1000 live
People who
births)
read (% )
1
-.900**
.
.000
109
107
-.900**
1
.000
.
107
107
**. Correlation is s ignificant at the 0.01 level (2-t ailed).
0
0
20
People w ho read (%)
40
60
80
100
We can reject the null hypothesis (H0: =0) because p
(‘Sig.’) is less than alpha (.05)
Page 5 of 18
Correlation matrices
Later on, when we are working with more than two variables, you will see correlation matrices. These display
all possible bivariate correlations. Here is an example of unedited output from SPSS:
Correlationsa
BABYMORT
Infant
mortality
(deaths per
1000 live
births)
GDP_CAP
Gross
domes tic
product /
capita
LITERACY
People who
read (%)
BABYMORT Infant
mortality (deaths per 1000
live births)
Pearson Correlation
LITERACY People who
read (%)
Pearson Correlation
Sig. (2-tailed)
-.920**
1
.000
.
Pearson Correlation
Sig. (2-tailed)
Pearson Correlation
Sig. (2-tailed)
Pearson Correlation
Sig. (2-tailed)
Pearson Correlation
Sig. (2-tailed)
-.692**
.000
-.774**
.000
.865**
.000
.843**
.000
GDP_CAP Gros s
domes tic product / capita
CALORIES Daily calorie
intake
BIRTH_RT Birth rate per
1000 people
FERTILTY Fertility:
average number of kids
Sig. (2-tailed)
CALORIES
Daily calorie
intake
1
-.920**
-.692**
-.774**
.
.000
.000
.000
.627**
.682**
.000
1
.
.760**
.000
-.741**
.000
-.652**
.000
.627**
.000
.682**
.000
-.871**
.000
-.863**
.000
.865**
.843**
.000
.000
-.871**
-.863**
.000
.000
.000
.760**
.000
1
.
-.757**
.000
-.691**
.000
-.741**
.000
-.757**
.000
1
.
.974**
.000
-.652**
.000
-.691**
.000
.974**
.000
1
.
**. Correlation is s ignificant at the 0.01 level (2-tailed).
a. Lis twis e N=74
You should improve the matrix for presentation:
Table 1. Bivariate correlations (N=74).
(1)
(2)
Infant mortality (1)
1.000
Literacy (2)
-.920*
1.000
GDP (3)
-.692*
.627*
Calories (4)
-.774*
.682*
Birth rate (5)
.865*
-.871*
Fertility (6)
.843*
-.863*
* p <.01 (2-tailed)
Source: World95.sav.
(3)
(4)
(5)
(6)
1.000
.760*
-.741*
-.652*
1.000
-.757*
-.691*
1.000
.974*
1.000
Scatterplots and bivariate correlation in SPSS
Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define
Analyze → Correlate → Bivariate
Page 6 of 18
BIRTH_RT
Birth rate per
1000 people
FERTILTY
Fertility:
average
number of
kids
Bivariate regression
 Bivariate correlation is not terribly helpful for identifying how meaningful a relationship is (i.e., what
does -.942 really mean?).
 Bivariate regression is another way to describe the linear relationship between two interval-ratio
variables
 With regression, we calculate the best-fitting line (a.k.a. the slope) and an intercept to summarize the
relationship; the slope conveys more meaning because of the interpretation – each one unit increase in X
leads to a B unit increase/decrease in Y


y i  a  bx i ; the predicted score for the ith case is equal to the y-intercept (a) plus the product of the
slope (b) and the ith case’s score on x.
The following formula gives us the best-fitting line (when certain assumptions hold):
S yx
b  2 ; the slope is equal to the covariance between y and x divided by the variance of x
Sx
Covariance: S yx
 ( x  x)( y  y) ; Variance of x: S

n
a  y  (b * x) ; a is the y-intercept
2
x
 ( x  x)

2
n
Note that the slope is an asymmetrical measure of association because the denominator includes only the
variance for x (i.e., the independent variable); whereas Pearson’s correlation is symmetrical because it includes
the standard deviations for both x and y.
What numbers do you need to draw a line? You need a point and a slope
 The point that we are interested in is called the y-intercept
o The y-intercept is the value of y (the dependent variable) when the independent variable equals 0
o This is the point at which the line crosses the y axis
o SPSS refers to the y-intercept as the “(Constant)”; it is circled below
o Interpretation: the predicted number of deaths per 1,000 live births is 160.732 for a country with
a literacy rate of 0 (i.e., a country in which nobody can read or write – this country doesn’t exist;
the minimum literacy rate is 18% for Burkina Faso; we’ll deal with this in a bit)
Coefficientsa
Model
1
(Constant)
LITERACY People
who read (%)
Unstandardized
Coefficients
B
Std. Error
160.732
5.794
-1.507
.071
Standardized
Coefficients
Beta
-.900
t
27.740
Sig.
.000
-21.219
.000
95% Confidence Interval for B
Lower Bound Upper Bound
149.243
172.221
-1.648
-1.366
a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births )

The second thing you need to draw a line is the slope
o The slope (or “b”) is the change in the dependent variable (y) for each one unit change in the
independent variable (x).
o SPSS includes the slope in the ‘B’ column (it is in a rectangular box above)
o Interpretation: Every 1% increase in literacy reduces the number of deaths by 1.507 per 1,000
live births.
o What would ‘no relationship’ look like?
Page 7 of 18

If we have these two things (the y-intercept and the slope) we can draw a line: y i  a  bx i





a is the y-intercept (the value of the dependent variable when X = 0)
b is the slope
xi is the value of the independent variable for case i

We could write the equation for our example as: y i  160.732  (1.507 * xi )
y i is the predicted value of the dependent variable for case i

You can use this equation to predict scores of the dependent variable for individual cases:



yi
xi
y i  160.732  (1.507 * xi )
yi
ei  y i  y i
Infant
Predicted infant
Country
mortality
Literacy Predicted infant mortality
mortality
Prediction error
Afghanistan
168
29 160.732 - 1.507 * 29 =
117.029
51.0
Argentina
25.6
95 160.732 - 1.507 * 95 =
17.567
8.0
Armenia
27
98 160.732 - 1.507 * 98 =
13.046
14.0
Australia
7.3
100 160.732 - 1.507 * 100 =
10.032
-2.7
Austria
6.7
99 160.732 - 1.507 * 99 =
11.539
-4.8
Azerbaijan
35
98 160.732 - 1.507 * 98 =
13.046
22.0
Bahrain
25
77 160.732 - 1.507 * 77 =
44.693
-19.7
Bangladesh
106
35 160.732 - 1.507 * 35 =
107.987
-2.0
Barbados
20.3
99 160.732 - 1.507 * 99 =
11.539
8.8
Belarus
19
99 160.732 - 1.507 * 99 =
11.539
7.5
…
Notice that the predicted value does not equal each case’s observed value. These differences are called
prediction errors. We will always have prediction errors – this is expected. Remember that the regression line
is meant to summarize the relationship between two variables. Anytime that you summarize, you simplify and
simplification leads to the loss of information. So our picture is not perfect, but it gives us a general sense of
the relationship.
Other ways to write the regression equation:
 y i  a  bxi  ei ; where yi is the actual score and ei is the prediction error; or


y i  y i  e i ; where the actual score is equal to the predicted score plus the prediction error.
The method that we are using to estimate the slope is referred to as ordinary least squares (OLS). When certain
assumptions hold (we’ll discuss these in a few weeks), the OLS slope is the ‘best’ (i.e., it is better than the
slopes calculated by other means). It is best because it minimizes the residual sum of squares (RSS):
n

RSS   ( y i  y i ) 2
i 1
In other words, the best-fitting line is the one that minimizes the squared prediction errors or the squared
difference between each case’s actual score and their score predicted by the regression model.
Page 8 of 18
How meaningful is the slope?
Statistics
N
Valid
Mis sing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
Sum
BABYMORT
Infant
mortality
(deaths per
1000 live
births)
109
0
42.313
3.6473
27.700
5.7 a
38.0792
1450.0274
1.090
.231
.365
.459
164.0
4.0
168.0
4612.1
LITERACY
People who
read (%)
107
2
78.34
2.212
88.00
99
22.883
523.640
-.994
.234
-.160
.463
82
18
100
8382
a. Multiple modes exis t. The smalles t value is shown


Every 1% increase in literacy reduces the number of deaths by 1.507 per 1,000 live births
The minimum value of literacy is 18 and the maximum is 100
o The predicted infant mortality rate for a literacy rate of 18%=133.606
o The predicted infant mortality rate for a literacy rate of 100%=10.032
o This seems to be a pretty meaningful difference
o You could (and should) compute predicted probability for other less extreme values (perhaps the
lower and upper quartiles or +/- 1 standard deviation)
Hypothesis testing
Meaningful, however, is not the same thing as statistically significant. Our slope coefficient is pretty close to 0.
Is it different enough from 0 that we can assume that the slope is not equal to 0 in the population? We need a
statistical hypothesis test to answer this question.
Some requirements (assumptions) for regression:
1. Both variables are measured at the interval-ratio level
2. There is a straight line relationship (There are ways to get around this that we’ll discuss later). Regression is
sensitive to outliers (we’ll also deal with this later).
3. Random sample – do we have a random sample in this example? We’ll pretend just for now…
4. To test significance, you must assume the variables have normal distributions in the population or you must
have a large sample.
Two-tailed hypotheses and alpha (.05) [you can also test one-tailed hypotheses]:
H0: =0
H1: ≠0
Test statistic and distribution: t
s
b
t
; df=n-2; s b  e
sb
sx n
Like all standard errors, the standard error of the regression slope describes variability across all possible
samples of the same size from the population. The standard error of the slope describes variability in the
estimate of the slope across all possible samples. By using this formula (for t), we are converting the slope
coefficient to a t score so that we can see how unusual our sample slope is if the null hypothesis is true.
Page 9 of 18
Descriptive Statistics
Mean
BABYMORT Infant
mortality (deaths per
1000 live births)
LITERACY People
who read (%)
Std. Deviation
N
42.674
38.2972
107
78.34
22.883
107
Coefficientsa
Model
1
(Constant)
LITERACY People
who read (%)
Unstandardized
Coefficients
B
Std. Error
160.732
5.794
-1.507
.071
Standardized
Coefficients
Beta
-.900
t
27.740
Sig.
.000
-21.219
.000
95% Confidence Interval for B
Lower Bound Upper Bound
149.243
172.221
-1.648
-1.366
a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births )
Critical value ≈ ± 1.980 (=.05; 2 tailed hypotheses; df = 105)
Observed t=-21.219
We reject the null hypothesis and conclude that there probably is a relationship between infant mortality and
literacy in the population. We would not obtain a slope as far away from 0 as -1.507 very often if the
population slope is equal to 0. The approximate probability of our data if the null hypothesis is true is:
p = 9.176113601021e-040 = .0000000000000000000000000000000000000009176113601021
So p < which would also allow us to reject the null hypothesis.
Note that the sample slope is a point estimate of the population slope. We can also calculate a confidence
interval: CI=b±(z*sb). 95 out of 100 confidence intervals should contain the true population slope. Thus, we
can be pretty confident that the population slope is between -1.648 and -1.366. Notice that this interval does not
contain 0 – this is a third way to test the null hypothesis.
A potential problem…and a solution
The y-intercept is not terribly meaningful because no countries have a score of 0% on literacy. There is a trick
that we can use to get around this – it is known as grand mean centering. You simply subtract the mean value
of the independent variable from every score and use this centered variable in the regression. SPSS Syntax:
* Grand mean centering.
freq vars=literacy /stats=mean.
compute lit_c=literacy-78.33644859813.
freq vars=literacy lit_c /stats=all /histogram.
Statistics
LITERACY
LIT_C
Valid
18
24
Frequency
1
2
Valid
…
Missing
Total
99
100
Total
System
22
3
107
2
109
Missing
Total
-60.34
-54.34
…
20.66
21.66
Total
System
Frequency
1
2
22
3
107
2
109
Page 10 of 18
N
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
Sum
Valid
Mis sing
LITERACY
People who
read (%)
107
2
78.34
2.212
88.00
99
22.883
523.640
-.994
.234
-.160
.463
82
18
100
8382
LIT_C
107
2
.0000
2.21220
9.6636
20.66
22.88319
523.64045
-.994
.234
-.160
.463
82.00
-60.34
21.66
.00
Before:
After:
200
Infant mortality (deaths per 1000 live births)
200
160
120
80
40
0
0
160
120
80
40
0
20
40
60
80
-80
100
-60
BABYMORT
Infant
mortality
(deaths per
1000 live
births)
Sig. (1-tailed)
N
0
20
Correlations
Correlations
BABYMORT Infant
mortality (deaths per
1000 live births)
LITERACY People
who read (%)
BABYMORT Infant
mortality (deaths per
1000 live births)
LITERACY People
who read (%)
BABYMORT Infant
mortality (deaths per
1000 live births)
LITERACY People
who read (%)
-20
Literacy Centered
People w ho read (%)
Pearson Correlation
-40
BABYMORT
Infant
mortality
(deaths per
1000 live
births)
LITERACY
People who
read (%)
1.000
-.900
-.900
1.000
.
.000
.000
.
Pearson Correlation
Sig. (1-tailed)
N
107
107
107
107
BABYMORT Infant
mortality (deaths per
1000 live births)
LIT_C
BABYMORT Infant
mortality (deaths per
1000 live births)
LIT_C
BABYMORT Infant
mortality (deaths per
1000 live births)
LIT_C
LIT_C
1.000
-.900
-.900
1.000
.
.000
.000
.
107
107
107
107
Before:
Coefficientsa
Model
1
(Constant)
LITERACY People
who read (%)
Unstandardized
Coefficients
B
Std. Error
160.732
5.794
-1.507
.071
Standardized
Coefficients
Beta
-.900
t
27.740
Sig.
.000
-21.219
.000
95% Confidence Interval for B
Lower Bound Upper Bound
149.243
172.221
-1.648
-1.366
a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births )
After:
Coeffi cientsa
Model
1
(Const ant)
LIT_C
Unstandardized
Coeffic ient s
B
St d. Error
42.674
1.618
-1. 507
.071
St andardiz ed
Coeffic ient s
Beta
-.900
t
26.380
-21.219
Sig.
.000
.000
a. Dependent Variable: BABYMORT Infant mortality (deaths per 1000 live births )
Little has changed…
Page 11 of 18
95% Confidenc e Interval for B
Lower Bound Upper Bound
39.466
45.881
-1. 648
-1. 366
40
Our intercept, however, is now 42.674:
 42.674 is the predicted number of infant deaths for a country with ‘0’ literacy
 We have, however, changed what 0 means by centering the variable
o On the original literacy variable, 0 meant 0% literate
o On the centered variable (lit_c), 0 is equal to the mean literacy rate (78.34%)
 It is accurate to say that the predicted number of infant deaths for a country with mean literacy is 42.674
per 1,000 live births
 You may have noticed that the new intercept is the average infant mortality rate. This will be true in
bivariate regression. The real benefit will come when we do multivariate regression with more than one
independent variable. Centering will also come in very handy later on when we discuss interaction
effects.
Measures of association
Notice that I haven’t said anything about determining the strength of the relationship from the y-intercept and
slope. These tell us nothing about the strength of the relationship – they only tell us if there is a relationship and
they help us to describe and understand the relationship.
There are three measures of association for interval-ratio variables that tell us about strength (in addition to
other things). These are Pearson’s product moment correlation coefficient (r), the coefficient of determination,
(r2), and the standardized slope coefficient (Beta).
The coefficient of determination
 Another measure of association for interval-ratio variables (PRE)
 It is equal to the correlation squared
 Because it is squared, it cannot tell us the direction of the relationship
 However, it has a more meaningful interpretation than r
o It tells us how much our errors in predicting the dependent variable are reduced by taking into
account the independent variable
o It reflects the total variation in the dependent variable explained by the independent variable
The logic of r2:
Imagine that I am trying to predict the infant mortality rate for South Africa. What is my best guess without
knowing anything about South Africa? My best guess would be the mean infant mortality rate across all
countries: 42.674 deaths per 1,000 live births
Now, imagine that I am allowed to use one variable to improve my prediction. I decide to use the literacy rate
(uncentered to simplify the example). 76% of the population of South Africa is literate. I can plug 76% into the

regression equation to generate a new estimate: y i  160.732  (1.507 * 76)  46.2
South Africa has an actual infant mortality rate of 47.1.
My prediction errors:
 Using only the grand mean: 4.426 = 47.1 – 42.674
 Using the literacy rate: 0.9 = 47.1 – 46.2

I have reduced my prediction substantially – by a proportion of
Page 12 of 18
4.426  0.9 3.526

 0.797 or 79.7%
4.426
4.426
The coefficient of determination (r2) is a measure of association that summarizes just how wrong we are in our
predictions across all cases. It tells us how much better our prediction is when we use the independent variable
to predict the dependent variable rather than guessing the mean.
The good news is that once you have calculated the correlation coefficient, it is really easy to get the coefficient
of determination. All you have to do is square the correlation. It is also included in SPSS output:
Model Summary
Model
1
R
.900a
R Square
.811
Adjusted
R Square
.809
Std. Error of
the Estimate
16.7334
a. Predictors: (Constant), LIT_C
In this example, we reduce our prediction errors by a proportion of .811 or by 81.1% when we use literacy to
predict infant mortality. We can also say that literacy explains 81.1% of the variation in infant mortality. This
is clearly a strong relationship.
Other measures of strength
We’ve already discussed the correlation coefficient. In a bivariate regression, the standardized slope coefficient
is equal to the bivariate correlation. This will not be the case in multivariate regression. There is some
controversy surrounding standardized slope coefficients. I am not a big fan of them, but we will talk about
them more when we discuss multivariate regression.
Bivariate correlation and regression in SPSS – Example 2
Our research questions: Are there relationships between anti-Black stereotyping (the DV), age and education
among Whites in the US?
Our hypotheses (alpha = .01):
H0: age ≤ 0
H0: education ≥ 0
H1: age > 0 H1: education < 0
The variables:
Stereotyping is an index that I created from four variables:
1. “Now I have some questions about different groups in our society. I’m going to show you a seven-point
scale on which the characteristics of people in a group can be rated. In the first statement a score of 1 means
that you think almost all of the people in that group are “rich.” A score of 7 means that almost everyone in the
group are “poor.” A score of 4 means you think that the group is not towards one end or another, and of course
you may any number on between that comes closest to where you think people in the group stand. Jews?”
2. Hard-working to Lazy
3. Not prone to violence to prone to violence
4. Intelligent to unintelligent
I have separate indexes for four target groups: Jews, Blacks, Hispanics, and Asians. We’ll focus on anti-Black
stereotyping.
Each index can range from 4 to 28 (since the original items range from 1 to 7). Scores above 16 indicate
negative stereotypes. Scores of 16 indicate neutrality (since 4 is neutral and there are 4 questions). Scores
below 16 indicate positive stereotypes.
Page 13 of 18
I have used a filter to select only the White respondents for the analysis: Data → Select Cases
SPSS Syntax:
freq vars=race.
USE ALL.
COMPUTE filter_$=(race=1).
VARIABLE LABEL filter_$ 'race=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE .
freq vars=race.
Univariate statistics for anti-Black stereotyping:
200
Statistics
100
.0
-9
.0
-7
.0
-5
0
7.
-2
.0
0
26 25.
.0
0
24
3.
-2
.0
0
22
1.
-2
.0
0
20 19.
.0
0
18
7.
-1
.0
0
16
5.
-1
.0
0
14 13.
.0
0
12
1.
-1
.0
10
0
8.
0
0
6.
1025
1213
17.6429
.09485
17.0000
16.00
3.03677
9.22198
.274
.076
1.001
.153
24.00
4.00
28.00
18084.00
0
4.
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtos is
Std. Error of Kurtosis
Range
Minimum
Maximum
Sum
Valid
Missing
Frequency
STEREOB
N
You can see that the mean level of anti-Black stereotyping is 17.6429. If you consider that there are 4
questions, this amounts to an average of about 4.4 per question, which is slightly into the negative side (i.e.,
lazy, prone to violence, etc.)
Page 14 of 18
Scatterplots:
Age: A weak positive relationship
Education: A weak negative relationship
28
26
26
24
24
22
22
20
20
18
18
16
16
Anti-Black Stereotyping
28
14
12
10
8
6
4
14
12
10
8
6
4
0
18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90
5
10
15
20
Y ears of Education
A ge
Bivariate correlations and regressions:
Regression – Annotated output
Correl ations
Pearson Correlation
De scri ptive Statistics
STEREOB
AGE_C
Mean
17.6416
-1. 2563
St d. Deviat ion
3.03796
17.53322
Sig. (1-tailed)
N
1024
1024
N
STEREOB
AGE_C
STEREOB
AGE_C
STEREOB
AGE_C
STEREOB
1.000
.208
.
.000
1024
1024
AGE_C
.208
1.000
.000
.
1024
1024
H0: age ≤ 0
H1:  age > 0
Alpha=.01
Based on the p values listed in the ‘correlations’ table above, we can reject the null hypothesis and conclude that
there probably is a positive relationship in the population (because p < ). The correlation between anti-Black
stereotyping and age is weak.
Page 15 of 18
Model Summaryb
Variables Entered/Removedb
Model
1
Variables
Entered
AGE_C a
Variables
Removed
Method
Enter
.
Model
1
R
.208a
R Square
.043
Adjusted
R Square
.043
a. All requested variables entered.
a. Predictors: (Constant), AGE_C
b. Dependent Variable: STEREOB
b. Dependent Variable: STEREOB
Std. Error of
the Estimate
2.97269
R square suggests that age explains about 4.3% of the variation in anti-Black stereotyping.
ANOVAb
Model
1
Regres sion
Residual
Total
Sum of
Squares
410.187
9031.280
9441.468
df
1
1022
1023
Mean Square
410.187
8.837
F
46.418
Sig.
.000a
a. Predic tors: (Constant), AGE_C
b. Dependent Variable: STEREOB
We will discuss the F test in the ANOVA table later in the semester.
Coeffi cientsa
Model
1
(Const ant)
AGE_C
Unstandardized
Coeffic ient s
B
St d. Error
17.687
.093
.036
.005
St andardiz ed
Coeffic ient s
Beta
.208
t
189.907
6.813
Sig.
.000
.000
95% Confidenc e Interval for B
Lower Bound Upper Bound
17.504
17.870
.026
.047
a. Dependent Variable: STEREOB
Slope: Anti-Black stereotyping increases by .036 points with each additional year of age. The critical t value is
2.326 (df=1,022, 1 tailed, alpha=.01). My observed t value of 6.813 exceeds the critical t so I reject the null
hypothesis and conclude that there is probably a positive relationship between anti-Black stereotyping and age
in the population. The p value is 1.629166466766e-011 (it is listed as .000 above), which is also less than alpha
(.01). I am 95% certain that the slope in the population is between .026 and .047. The fact that the confidence
interval does not contain 0 also suggests that we can reject the null hypothesis (although notice that this is a
95% confidence interval).
Intercept: The predicted level of anti-Black stereotyping for a person of average age (about 47) is 17.687 (this
corresponds to an average score on the four items of about 4.4, which is just slightly into the negative stereotype
range).
Other predicted values for context: The predicted level of anti-Black stereotyping for 29 and 65 year olds are
17.0 and 18.3, respectively (I selected these values on age because they are one standard deviation below and
one standard deviation above the mean; see the calculations below).
a
17.687
Note:
b
0.036
x
47.14
29.415
64.865
x centered
0
-17.725
17.725
Predicted y
17.687
17.0489
18.3251
x  47.14; s x  17.725
Although the slope is statistically significant, it does not appear to be substantively meaningful. You can see
this when you consider that there is not much of a difference between 29 and 65 year olds in their levels of antiBlack stereotyping. Because the slope is ‘small’ at .036, a 36 year increase in age (from 29 to 65) only
increases anti-Black stereotyping by 1.296 (36 * .036) and the anti-Black stereotyping index can range from 4
to 28.
Page 16 of 18
Annotated output for education:
Correlations
Pearson Correlation
Sig. (1-tailed)
De scri ptive Statistics
STEREOB
EDUC HIGHEST
YEAR OF SCHOOL
COMPLETED
Mean
17.6429
St d. Deviat ion
3.03677
N
1025
13.52
2.730
1025
N
STEREOB
EDUC HIGHEST
YEAR OF SCHOOL
COMPLETED
STEREOB
EDUC HIGHEST
YEAR OF SCHOOL
COMPLETED
STEREOB
EDUC HIGHEST
YEAR OF SCHOOL
COMPLETED
STEREOB
1.000
EDUC
HIGHEST
YEAR OF
SCHOOL
COMPLETED
-.137
-.137
1.000
.
.000
.000
.
1025
1025
1025
1025
H0: education ≥ 0
H1:  education < 0
Alpha=.01
Based on the p values listed in the ‘correlations’ table above, we can reject the null hypothesis and conclude that
there probably is a negative relationship in the population. The correlation between anti-Black stereotyping and
education is weak.
Model Summary
Model
1
R
.137a
R Square
.019
Adjusted
R Square
.018
Std. Error of
the Estimate
3.00944
a. Predictors: (Constant), EDUC HIGHEST YEAR OF
SCHOOL COMPLETED
R square suggests that education explains about 1.9% of the variation in anti-Black stereotyping.
Coefficientsa
Model
1
(Constant)
EDUC HIGHEST
YEAR OF SCHOOL
COMPLETED
Unstandardized
Coefficients
B
Std. Error
19.709
.475
-.153
.034
Standardized
Coefficients
Beta
-.137
t
41.487
Sig.
.000
-4.437
.000
95% Confidence Interval for B
Lower Bound Upper Bound
18.777
20.641
-.220
-.085
a. Dependent Variable: STEREOB
Slope: Each additional year of education decreases anti-Black stereotyping by .153 units. The critical t value is
-2.326 (df=1,022, 1 tailed, alpha=.01). My observed t value of -4.437 exceeds the critical t so I reject the null
hypothesis and conclude that there is probably a negative relationship between anti-Black stereotyping and
education in the population. The p value is 1.013266718346e-005 (it is listed as .000 above), which is also less
than alpha (.01). I am 95% certain that the slope in the population is between -.220 and -.085. The fact that the
confidence interval does not contain 0 also suggests that we can reject the null hypothesis (although notice that
this is a 95% confidence interval).
Intercept: The predicted level of anti-Black stereotyping for a person with zero years of education (which is a
possible value, so I did not center education) is 19.709.
Page 17 of 18
Other predicted values for context: The predicted level of anti-Black stereotyping for those with roughly 11 and
16 years of education are 18.1 and 17.2, respectively (I selected these values on education because they are one
standard deviation below and one standard deviation above the mean; see the calculations below).
a
19.709
Note:
b
-0.153
x
0
10.79
16.25
Predicted y
19.709
18.05813
17.22275
x  13.52; s x  2.73
Although the slope is statistically significant, it does not appear to be substantively meaningful. You can see
this when you consider that there is not much of a difference between those with 11 and 16 years of education
in their levels of anti-Black stereotyping. Because the slope is ‘small’ at -.153, a 5 year increase in education
(from 11 to 16 years) only decreases anti-Black stereotyping by .765 (5 * .153) and the anti-Black stereotyping
index can range from 4 to 28.
Page 18 of 18