• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Lasso (statistics) wikipedia, lookup

Expectation–maximization algorithm wikipedia, lookup

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Choice modelling wikipedia, lookup

Time series wikipedia, lookup

Coefficient of determination wikipedia, lookup

Instrumental variables estimation wikipedia, lookup

Interaction (statistics) wikipedia, lookup

Transcript
```Association for
Interval Level Variables
Chapter 15
Introduction
• When referring to interval-ratio variables a
commonly used synonym for association is
correlation
• We will be looking for the existence,
strength, and direction of the relationship
• We will only look at bivariate relationships
in this chapter
Scattergrams
• The first step is to construct and examine
a scattergram
• Example in the book
– Analysis of how dual wage-earner families
cope with housework
– They want to know if the number of children
in the family is related to the amount of time
the husband contributes to housekeeping
chores
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Construction of a Scattergram
• Draw two axes of about equal length and at right
•
•
angles to each other
Put the independent (X) variable along the horizontal
axis (the abscissa) and the dependent (Y) variable
along the vertical axis (the ordinate)
For each person, locate the point along the abscissa
that corresponds to the scores of that person on the
X variable
– Draw a straight line up from that point and at right angles to
the axis
– Then locate the point along the ordinate that corresponds to
the score of that same case on the Y variable
– Place a dot there to represent the case, and then repeat
with all cases
Regression Line and its Purpose
• It checks for linearity of the data points on
the scattergram
• It gives information about the existence,
strength, and direction of the association
• It is used to predict the score of a case on
one variable from the score of that case
on the other variable
• It is a floating mean through all the data
points
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Existence of a Relationship
• Two variables are associated if the
distributions of Y change for the various
conditions of X
– The scores along the abscissa (number of
children) are conditions of values of X
– The dots above each X value can be
thought of as the conditional distributions
of Y (scores on Y for each value of X)
• In other words, Y tends to increase as X
increases
Existence of a Relationship
• The existence of a relationship is
reinforced by the fact that the regression
line lies at an angle to the X axis (the
abscissa)
– There is no linear relationship between two
interval-level variables when the regression
line on a scattergram is parallel to the
horizontal axis
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Strength of the Association
• The strength of the association is judged by
observing the spread of the dots around the
regression line
– A perfect association between variables can be seen
on a scattergram when all dots lie on the regression
line
– The closer the dots to the regression line, the
stronger the association
– So, for a given X. there should not be much variety on
the Y variable
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Direction of the Relationship
• The direction of the relationship can be
judged by observing the angle of the
regression line with respect to the abscissa
– The relationship is positive when the line slopes
upward from left to right
– The association is negative when it slopes down
– Your book shows a positive relationship, because
cases with high scores on X also tend to have high
scores on Y
– For a negative relationship, high scores on X
would tend to have low scores on Y, and vice
versa
• Your book also shows a zero relationship—no
association between variables, in that they
are randomly associated with each other
Linearity
• The key assumption (first step in model) with
correlation and regression is that the two
variables have an essentially linear relationship
– The points or dots must form a pattern of a straight
line
– It is important to begin with a scattergram before
doing correlations and regressions
– If the relationship is nonlinear, you may need to treat
the variables as if they were ordinal rather than
interval-ratio
Regression and Prediction
• The final use of the scattergram is to predict
•
•
scores of cases on one variable from their score
on the other
May want to predict the number of hours of
housework a husband with a family of four
children would do each week
You use regression to predict outside the range
of the data with caution, since you do not
have any data to show what happens beyond
the scope of the data—it may have suddenly
gone down
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
The Predicted Score on Y
• The symbol for this is Y’, or Y prime, though in other
•
•
•
•
books, it is most often Y hat, but that symbol is
difficult to do on a computer or to print in books
It is found by first locating the score on X (X=4), for
four children) and then drawing a straight line from
that point on the abscissa to the regression line
From the regression line, another straight line parallel
to the abscissa is drawn across to the Y axis or
ordinate
Y’ is found at the point where the line from the
regression line crosses the Y axis
Or, you can compute Y’ = a + bX
– Y’ is the expected Y value for a given X
Formula for the Regression Line
• The formula for a straight line that fits
closest to the conditional means of Y
• Y = a + bX
• Where Y = score on the dependent variable
• a = the Y intercept or the point where the
regression line crosses the Y axis
• b = the slope of the regression line or the amount
of change produced in Y by a unit change in X
• X = score on the independent variable
Regression Line
• The position of the least-squares regression line
is defined by two elements
– The Y intercept and the slope of the line
• The weaker the effect of X on Y (the weaker the
•
association between the variables) the lower the
value of the slope (b)
If the two variables are unrelated, the leastsquares regression line would be parallel to the
abscissa, and b would be 0 (the line would have
no slope)
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Equations for the Slope of the
Regression Line
• You need to compute b first, since it is
needed in the formula for “a”
• Slope:
 X  X Y  Y 
b
2
 X  X 
• Which is the covariance of X and Y
divided by the variance of X
Interpretation of the Value of the
Slope
• If you put your scattergram on graph paper,
•
you can see that as X increases one box, “b”
is how many units that Y increases on the
regression line
So, a slope of .69 indicates that, for each unit
increase in X. there is an increase of .69 units
in Y
– If the slope is 1.5, for every unit of change in X
there is an increase of 1.5 units in Y
– They refer to units, since correlation and
regression allow you to compare apples and
oranges—two completely different variables
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Interpretation of “b” cont.
• So, to find what one unit of X is or one
unit of Y is, you have to go back to the
labels for each variable
• For the example in your book which has
a “b” (beta) of .69
– The addition of each child (an increase of
one unit in X—one unit is one child)
– Results in an increase of .69 hours of
housework being done by the husband (an
increase of .69 units—or hours—in Y)
Formula for the Intercept of the
Regression Line
a  Y  bX
Interpretation of the Intercept
• The intercept for the example in the book is 1.49
• The least-squares regression line will cross the Y
•
axis at the point where Y equals 1.49
You need a second point to draw the regression
line
– You can begin at Y of 1.49, and for the next value of
X, which is 1 child, you will go up .69 units of Y
– Or, you can use the intersection of the mean of X and
the mean of Y—the regression line always goes
through this point
Interpretation of “a” cont.
• Most of the time, you can’t interpret the value of
the intercept
– Technically, it is the value that Y would take if X were
zero
• But, most often a zero X is not meaningful
• Or, in the case in your book, zero is outside the range of the
data
• You don’t have any information about the hours of
housework that husbands do when they have no children
• Technically, the intercept of 1.49 is the amount of predicted
housework a husband of zero children would do, but you
can’t say that with certainty
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Least Squares Regression Line
• Now that you know “a” and “b”, you can fill in
•
•
the full least-squares regression line
Y = a + bX
Y = (1.49) + (.69) X
– This formula can be used to predict scores on Y as
was mentioned earlier
• For any value of X, it will give you the predicted
value of Y (Y’)
• The predictions of husband’s housework are
“educated guesses”
– The accuracy of our predictions will increase as
relationships become stronger (as dots are closer
to the regression line)
The Correlation Coefficient
(Pearson’s r)
• Pearson’s r varies from 0 to plus or minus 1
– With 0 indicating no association
– And + 1 and – 1 indicating perfect positive and
perfect negative relationships
– The definitional formula for Pearson’s r is in your book
– Similar to the formula for b (beta), the numerator is
the covariation between X and Y (usually called the
covariance)
Interpretation of r and r-squared
• Interpretation of “r” will be the same as all
the other measures of association
– An “r” of .5 would be a moderate positive linear
relationship between the variables
• Interpretation of the Coefficient of
Determination (r-squared)
– The square of Pearson’s r is also called the
coefficient of determination
– While “r” measures the strength of the linear
relationship between two variables
• But values between 0 and 1 or -1 have no direct
interpretation
Interpretation, cont.
• The coefficient of determination can be interpreted
with the logic of PRE (proportional reduction in error)
– First Y is predicted while ignoring the information supplied
by X
– Second the independent variable is taken into account when
predicting the dependent
• When working with variables measured at the
interval-ratio level, the predictions of Y under the first
condition (while ignoring X) will be the mean of the Y
scores (Y bar) for every case
– We know that the mean of any distribution is closer than any
other point to all the scores in the distribution
Interpretation, cont.
• Will make many errors in predicting Y
• The amount of error is shown in Figure 16.6
– The formula for the error is the sum of (Y minus Y
bar) squared
– This is called the total variation in Y, meaning the
total amount that all the points are off the mean
of Y
• The next step will be to find the extent to
which knowledge of X improves our ability to
predict Y (will we make predictions that come
closer to the actual points than will the mean
of Y?)
Interpretation, cont.
• If the two variables have a linear
relationship, then predicting scores on Y
from the least-squares regression equation
will use knowledge of X and reduce our
errors of prediction
• The formula for the predicted Y score for
each value of X will be: Y’ = a + bX
– This is also the formula for the regression line
Interpretation, cont.
• In Figure 16.7, we measure the distance of the
actual data points from the regression line
– If there is less distance (smaller errors) here than in
the distance of the actual points from the mean of Y,
then there is an association between the two
variables
– The vertical lines from each data point to the
regression line represent the amount of error in
predicting Y that remains even after X has been taken
into account
– We can calculate that precisely by looking at rsquared
– r-squared is the proportion, or when multiplied by
100, is the percentage of variation in Y that is
explained by X
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Unexplained Variation
• That suggests that some of the variation in Y is
unexplained by X
– The vertical lines in your book represent the
unexplained variation
• The amount of error in predicting Y that remains after X has
been taken into account is the unexplained variation of Y
• The unexplained variation is the scattering of the actual
scores around the regression line
• The formula for this is the sum of (Y minus Y prime) squared
– The proportion of the total variation in Y unexplained
by X can also be found by subtracting the value of rsquared from 1.00
Unexplained Variation, cont.
• Unexplained variation is usually
attributed to the influence of three
things
– Some combination of other variables, as in
the example of the husband’s housework
– Measurement error
• People over or under estimate how much time
they spend doing housework
– Random chance
• Your sample may be biased, particularly if it is
small
Testing Pearson’s r for Significance
• When “r” is based on data from a random
•
sample, you need to test r for its statistical
significance
When testing Pearson’s r for significance, the
null hypothesis is that there is no linear
association between the variables in the
population from which the sample was drawn
– We will use the t distribution for this test
Assumptions for the Significance
Test
• We make some additional assumptions
in Step 1
• Need to assume that both variables are
normal in distribution
• Need to assume that the relationship
between the two variables is roughly
linear in form
• The third new assumption involves the
concept of homoscedasticity
Homoscedasticity
• A homoscedastistic relationship is one where the
variance of the Y scores is uniform for all values of X
– If the Y scores are evenly spread above and below the
regression line for the entire length of the line, the
relationship is homoscedastistic
– If the variance around the regression line is greater at one
end or the other, the relationship is heteroscedastistic
– A visual inspection of the scattergram is usually sufficient to
find the extent the relationship conforms to the assumptions
of linearity and homoscedasticity
– If the data points fall in a roughly symmetrical, cigar-shaped
pattern, whose shape can be approximated with a straight
line, then it is appropriate to proceed with this test of
significance
Scattergram of Relationship
Between the Two Variables
• Regression of
Hours Per Week Husband Spends on Housework
8
6
•
4
2
0
-2
0
1
Number of Children
2
3
4
5
6
Husband’s Hours of
Housework
By The Number of
Children in the Family
Formula for t (obtained)
• The formula is in your book, and you just
plug values of r and r-squared into the
formula
```
Related documents