Download Correlation and Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Testing hypotheses
Continuous variables
Review: testing a hypothesis using categorical variables
Hypothesis: Lower income  higher murder rate
To test this hypothesis we
can build tables and
calculate percentages.
L
L
H
L
H
L
H
L
L
H
L
H
H
L
H
H
L
H
H
H
Note that we recoded
two continuous
variables - income and
murder rate - so they
became categorical
Frequencies table
Percentages table
High
Murder
Low
Murder
High
Murder
Low
Murder
Low
Income
3
1
Low
Income
75%
25%
High
Income
2
4
High
Income
33%
67%
Interpreting percentages
is a bit “loosey goosey.”
For a more precise
estimate of the
relationship between the
variables we can use the
frequencies table to
calculate the “ChiSquare” (X2) statistic.
We’ll do that later…
CORRELATION
r statistic
scattergram
cloud of dots
line of best fit
intercorrelations
“control” variables
Correlation statistic - r
•
Correlation: measure of the strength of an
association (relationship) between continuous
variables
•
Values of r Range from –1 to +1
•
-1 is a perfect negative association (correlation),
meaning that as the scores of one variable increase,
the scores of the other variable decrease at exactly
the same rate
•
+1 is a perfect positive association, meaning that
both variables go up or down together, in perfect
harmony
•
Intermediate values of r (close to zero) indicate
weak or no relationship
•
Zero r (never in real life) means no relationship –
that the variables do not change or “vary” together,
except as what might happen through chance alone
•
Remember that “negative” doesn’t mean “no”
relationship. A negative relationship is just as much a
relationship as a positive relationship.
+1
Perfect
positive
relationship
0
No relationship
-1
Perfect
negative
relationship
Scattergrams
Hypothesis: Lower income  higher murder rate
“dot”
Y axis
Murder rate
• Depict the distribution of
two continuous variables
• Each case will have a score
for each variable. A “dot” is
placed where these scores
intersect.
• When part of a hypothesis,
the independent variable
scores go on the X
(horizontal) axis, and the
dependent variable scores
go on the Y (vertical) axis
• Lowest actual scores (not
necessarily a zero) and
highest actual scores go on
the extremes (using scales
with regular intervals is
fine)
• The dots form a “cloud”
X
axis
Median income
Testing a hypothesis using continuous variables
Lower income
Higher murder rate
Distribution of
cities by
median income
Distribution
of cities by
murder rate
Murder rate
Murder rate
Median income
Median income
Two “scattergrams” – each with a “cloud” of dots
Y
NOTE:
Dependent
variable (Y) is
always placed on
the vertical axis
5
5
6
6
Y
r=-1
3
3
4
4
r = +1
2
1
1
2
NOTE: Independent
variable (X) is always
placed on the horizontal
axis
X
X
1
2
3
4
5
1
2
3
4
5
Can changes in one variable be predicted by changes in the other?
Can changes in one variable be predicted by changes in the other?
Y
4
5
As X changes in value, does
Y move correspondingly,
either in the same or opposite
direction?
2
3
Here there seems to be no
connection between X and Y.
One cannot predict values of
Y from values of X.
1
r=0
X
1
2
3
4
5
Can changes in one variable be predicted by changes in the other?
Y
5
Here as X changes in value by
one unit Y also changes in
value by one unit.
3
4
Knowing the value of X one
can predict the value of
Y.
1
2
X and Y go up and down
together, meaning a positive
relationship.
r = +1
X
1
2
3
4
5
Can changes in one variable be predicted by changes in the other?
Y
4
5
Here as X changes in value by
one unit Y also changes in
value by one unit.
3
Knowing the value of X one
can predict the value of Y.
1
2
X and Y go up and down in
an opposite direction,
meaning a negative
relationship.
r = -1
X
1
2
3
4
5
Computing r using the “Line of best fit”
• To arrive at a value of “r” a straight line is
placed through the cloud of dots (the actual,
“observed” data)
6
Y
5
• This line is placed so that the cumulative
distance between itself and the dots is
minimized
4
• The smaller this distance, the higher the r
3
• r’s are normally calculated with computers.
Paired scores (each X/Y combination) and the
means of X and Y are used to compute:
2
b
a
•
•
a, where the line crosses the Y axis
b, the slope of the line
1
• When relationships are very strong or very
weak, one can estimate the r value by simply
examining the graph
X
1
2
3
4
5
2
“Line of best fit”
•
The line of best fit predicts a value for
Y
•
There will be a difference between these
if y =5, x=3.4
5
variable
6
one variable given the value of the other
(“observed”) values. This difference is
called a “residual” or an “error of the
4
estimated values and the actual, known
3
estimate.”
predicted values decreases – as the dots
2
As the error between the known and
cluster more tightly around the line – the
absolute value of r (whether + or –)
if x =.5, y=2.3
1
•
increases
X
1
2
3
4
5
3
3
4
4
5
5
6
6
A perfect fit: Line of best fit goes “through” each dot
Y
Y
2
r = -1.0
a perfect fit
1
1
2
r = +1.0
a perfect fit
X
X
1
2
3
4
5
1
2
3
4
5
4
Moderate cumulative distance
between line of best fit and “cloud” of dots
3
4
5
6
Y
1
2
r = +.65
An intermediate fit yields
an intermediate value of r
X
1
2
3
4
5
2
Large cumulative distance
between line of best fit and “cloud” of dots
2
r = - .19
1
3
4
5
6
Y
A poor fit yields
a low value of r
X
1
2
3
4
5
HYPOTHESIS TESTING
r2 and R2 - regression coefficient
extreme scores
extreme scores
restricted range
partial correlation and control variables
other correlation techniques
R-squared (r2 or R2), the regression
coefficient (aka coefficient of determination)
• Proportion of the change in the dependent variable (also
known as the “effect” variable), in percentage terms, that is
accounted for by change in the independent variable (also
known as the “predictor” variable)
• Taken by squaring the correlation coefficient (r)
• “Little” r squared (r2) depicts the explanatory power of a
single independent/predictor variable
• “Big” R squared (R2) combines the effects of multiple
independent/predictor variables. It’s the more commonly
used.
Hypothesis: Lower income  higher murder rate
How to “read” a scattergram
•
Move along the IV. Do
the values of the DV
change in a consistent
direction?
•
Look across the IV. Does
knowing the value of the
IV help you predict the
value of the DV?
•
Place a straight line
through the cloud of dots,
trying to minimize the
overall distance between
the line and the dots. Is
the line at a pronounced
angle?
To the extent that you can
answer “yes” to each of
these, there is a relationship
r = -.6
r2 = .36
Change in the
IV accounts
for thirty-six
percent of the
change in the
DV.
A moderateto-strong
relationship,
in the
hypothesized
direction –
hypothesis
confirmed!
Class exercise
Hypothesis1: Height  Weight
Hypothesis2: Age  Weight
• Build a scattergram for your assigned hypothesis
• Be sure that the independent variable is on the X
axis, smallest value on the left, largest on the
right, just like when graphing any distribution
• Be sure that the dependent variable is on the Y
axis, smallest value on the bottom, largest on top
• Place a dot representing a case at the
intersection of its values on X and Y
• Place a STRAIGHT line where it minimizes the
overall distance between itself and the cloud of
dots
• Use this overall distance to estimate a possible
value of r, from -1 (perfect negative
relationship,) to 0 (no relationship), to +1
(perfect positive relationship)
• Remember that “negative” doesn’t mean “no”
relationship. Negative relationships are just as
much a relationship as positive relationships.
Height (inches)
62
62
64
64
68
60
63
66
69
62
69
64
64
65
68
66
63
74
67
64
71
71
65
69
69
70
Weight
130
167
145
150
145
122
125
125
236
115
150
115
175
150
208
190
150
230
150
117
195
230
175
180
220
150
Age
23
26
30
28
28
26
31
20
40
20
21
23
22
29
40
26
28
25
34
27
21
24
26
27
28
20
Impact of extreme scores
r = .35
r2 = .12
With all cases, weak to moderate
positive relationship
r = -.17
r2 = .03
Less extreme cases,
very weak negative relationship
Extreme scores can be produced by measurement errors or other circumstances
(here, it could be chronic illness or a hereditary disorder). To prevent confusion,
such cases are often dropped, but notice should always be given.
Effects of restricted range
r = .04
r2 =.00
Age  Height
People get taller as
they age, right?
In this sample, age has no relationship with height. Why? Because the range
for age is severely restricted: each case is already an adult!
What do we learn from this? KNOW YOUR DATA!
Intercorrelations
•
•
Might associations between variables be distorted by their relationship with other variables
(“intercorrelations”)?
– Issue: Whenever we measure the effect of a variable, we inevitably include the effects of
other variables with which our variable of interest is related
– Example: When we measure the effect of poverty on crime, part of the effect reflects the
variable education, with which poverty is related
Research articles often begin data analysis with a “correlation matrix” that displays the
bivariate (two-variable) correlations between all continuous variables
Hypothesis: fewer gun laws  more gun homicides (from Police Issues)
Hypothesis: fewer gun laws  more gun homicides
•
•
•
Poverty is strongly associated with law scores and with gun homicides
Could the association between law scores and gun homicides reflect, at least in part,
the relationship between poverty and gun homicides?
We use “partial correlation” to remove the influence of poverty from the
relationship between law scores and gun homicides
– We do so by statistically “controlling” for poverty. Poverty becomes a
“control” variable
Controlling for poverty using “partial correlation”
•
•
•
•
•
Sure enough, when we control for poverty, the relationship between law score and
gun homicides (originally, -.366*) becomes non-significant. Poverty was
exaggerating the influence of law score on gun homicide
To be fair, is it also working the other way around? Let’s test the relationship
between poverty and gun homicides, controlling by law score.
The original relationship between poverty and gun homicides (-.397*) decreases
only slightly. So the more likely cause of changes in gun homicides is changes in
poverty, not changes in law scores.
THINK BACK! This process accomplishes the same for continuous variables as
first-order partial tables did for categorical variables
So, what about age and weight?
Hypothesis: Age  weight (older  heavier)
Zero-order correlations
between age and weight
Partial correlation of age  weight controlling for height
(sample of 19 youths, ages 2-20)
Partial correlation,
controlling for height
• The relationship between age and weight decreases only slightly from .990 to .850. So our hypothesis remains well confirmed.
Miscellaneous stuff…
•
“Spearman’s r”: Correlation technique for ordinal categorical variables (e.g.,
Low/Medium/High)
•
Changing the level of measurement from continuous to categorical:
SHORT
240
TALL
220
Weight
HEAVY
200
3
7
180
160
12
140
LIGHT
4
120
100
58
60
62
64
66
68
Height
70
72
74
76
Some parting thoughts
•
•
If we do not use probability (e.g., random) sampling
– Our results apply only to the cases we “observed” and coded
– Accounting for the influence of other variables can be tricky
– R and related statistics are often unimpressive; describing what they mean
can be tricky
If we use probability sampling
– Our results can be extended to the population
– But how accurate will our results be? After all, statistics (e.g., r, R2) will
vary from sample from sample.
– That, actually, is a good thing. If we sample correctly, procedures we will
learn (i.e., “inferential statistics”) will allow us to estimate the difference
between sample statistics and the actual population parameters. That’s
called “error.”
– These together - the statistical results, and the error - will allow us to
interpret our results with far greater precision than is possible without
probability sampling. Stand by!
Exam preview
1. You will apply what you have learned about populations, samples, sampling methods and
building a scattergram to the “College Education and Police Job Performance” article.
2. You will be given a hypothesis and data from a sample. There will be two variables – the
dependent variable, and the independent variable. Both will be categorical, and each will
have two levels (e.g., low/high, etc.)
A. You will build a table containing the frequencies (number of cases).
B. You will build another table with the percentages.
C. You will analyze the results. Are they consistent with the hypothesis?
3. You will be given the same data as above, broken down by a control variable. It will also
be categorical, with two levels.
A. You will build first order partial tables, one with frequencies (number of cases), the
other with percentages, for each level of the control variable.
B. You will be asked whether introducing the control variable affects your assessment
of the hypothesized zero-order relationship. This requires that you separately
compare the results for each level of the control variable to the zero-order table.
Does introducing the control variable tell us anything new?
4. You will be given another hypothesis and data. There will be two variables – the
dependent variable and the independent variable. Both are continuous variables.
A. You will build a scattergram and draw in a line of best fit.
B. You will state whether the scattergram supports the hypothesis. Be careful! First, is
there a relationship between variables? Second, is it in the same direction (positive
or negative) as the hypothesized relationship?
Related documents