Download 10 Correlation and regression

Document related concepts

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Linear Regression and
Correlation
GOALS
1.
2.
3.
4.
5.
13-2
Understand and interpret the terms dependent and
independent variable.
Calculate and interpret the coefficient of correlation,
the coefficient of determination, and the standard
error of estimate.
Conduct a test of hypothesis to determine whether
the coefficient of correlation in the population is
zero.
Calculate the least squares regression line.
Construct and interpret confidence and prediction
intervals for the dependent variable.
Regression Analysis - Uses
Some examples.
 Is there a relationship between the amount Healthtex
spends per month on advertising and its sales in the
month?
 Can we base an estimate of the cost to heat a home
in January on the number of square feet in the
home?
 Is there a relationship between the miles per gallon
achieved by large pickup trucks and the size of the
engine?
 Is there a relationship between the number of hours
that students studied for an exam and the score
earned?
13-3
Correlation Analysis and Scatter
Diagram

Correlation Analysis is the study of the
relationship between variables. It is also
defined as group of techniques to measure
the association between two variables.

13-4
A Scatter Diagram is a chart that portrays
the relationship between the two variables. It
is the usual first step in correlations analysis
Dependent vs. Independent Variable
DEPENDENT VARIABLE
The variable that is being
predicted or estimated. It
is scaled on the Y-axis.
INDEPENDENT
VARIABLE The variable
that provides the basis
for estimation. It is the
predictor variable. It is
scaled on the X-axis.
13-5
Regression Example
The sales manager of Copier Sales of America, which has a large sales
force throughout the United States and Canada, wants to determine
whether there is a relationship between the number of sales calls
made in a month and the number of copiers sold that month. The
manager selects a random sample of 10 representatives and
determines the number of sales calls each representative made last
month and the number of copiers sold.
13-6
Scatter Diagram
13-7
The Coefficient of Correlation, r
The Coefficient of Correlation (r) is a measure of
the strength of the relationship between two
variables. It requires interval or ratio-scaled
data.
 It can range from -1.00 to 1.00.
 Values of -1.00 or 1.00 indicate perfect and
strong correlation.
 Values close to 0.0 indicate weak correlation.
 Negative values indicate an inverse
relationship and positive values indicate a
direct relationship.
13-8
Perfect Correlation
13-9
Correlation Coefficient - Interpretation
13-10
Correlation Coefficient - Formula
13-11
Coefficient of Determination
The coefficient of determination (r2) is the
proportion of the total variation in the
dependent variable (Y) that is explained or
accounted for by the variation in the
independent variable (X).
 It is the square of the coefficient of
correlation.
 It ranges from 0 to 1.
 It does not give any information on the
direction of the relationship between the
variables.
13-12
Correlation Coefficient - Example
Using the Copier Sales of
America data which a
scatter plot was
developed earlier,
compute the correlation
coefficient and
coefficient of
determination.
13-13
Correlation Coefficient - Example
13-14
Correlation Coefficient - Example
How do we interpret a correlation of 0.759?
•First, it is positive, so we see there is a direct relationship between
the number of sales calls and the number of copiers sold.
•The value of 0.759 is fairly close to 1.00, so we conclude that the
association is strong.
However, does this mean that more sales calls cause more sales?
No, we have not demonstrated cause and effect here, only that the
two variables—sales calls and copiers sold—are related.
13-15
Coefficient of Determination (r2) - Example
•The coefficient of determination, r2 ,is
0.576, found by (0.759)2
•This is a proportion or a percent; we can
say that 57.6 percent of the variation in the
number of copiers sold is explained, or
accounted for, by the variation in the
number of sales calls.
13-16
Testing the Significance of
the Correlation Coefficient
H0:  = 0 (the correlation in the population is 0)
H1:  ≠ 0 (the correlation in the population is not 0)
Reject H0 if:
(in SPSS)
P-value <.05
13-17
Correlation and Cause




13-18
High correlation does not mean cause and effect
For example, it can be shown that the consumption of
Georgia peanuts and the consumption of aspirin have a
strong correlation. However, this does not indicate that
an increase in the consumption of peanuts caused the
consumption of aspirin to increase.
Likewise, the incomes of professors and the number of
inmates in mental institutions have increased
proportionately. Further, as the population of donkeys
has decreased, there has been an increase in the
number of doctoral degrees granted.
Relationships such as these are called spurious
correlations.
Practice


Home sales.sav
Correlation
–
–
AnalyzeCorrelatebivariate
Select




13-19
Appraised Land Value
Appraised value of improvements
Total appraised value
Sale price
Regression Analysis
In regression analysis we use the independent variable
(X) to estimate the dependent variable (Y).
 The relationship between the variables is linear.
 Both variables must be at least interval scale.
 The least squares criterion is used to determine the
equation.
13-20
Linear Regression Model
13-21
Regression Analysis – Least Squares
Principle
The least squares principle is used to
obtain a and b.
 The equations to determine a and b
are:

n( XY )  ( X )( Y )
b
n(  X 2 )  (  X ) 2
Y
X
a
b
n
n
13-22
Illustration of the Least Squares
Regression Principle
13-23
Computing the Slope of the Line
13-24
Computing the Y-Intercept
13-25
Regression Equation - Example
Recall the example involving
Copier Sales of America.
The sales manager
gathered information on the
number of sales calls made
and the number of copiers
sold for a random sample of
10 sales representatives.
Use the least squares
method to determine a linear
equation to express the
relationship between the two
variables.
What is the expected number of
copiers sold by a
representative who made 20
calls?
13-26
Finding the Regression Equation - Example
The regression equation is :
^
Y  a  bX
^
Y  18.9476  1.1842 X
^
Y  18.9476  1.1842(20)
^
Y  42.6316
13-27
Computing the Estimates of Y
Step 1 – Using the regression equation, substitute the
value of each X to solve for the estimated sales
Tom Keller
^
Y  18.9476  1.1842 X
^
Y  18.9476  1.1842(20)
^
Y  42.6316
13-28
Soni Jones
^
Y  18.9476  1.1842 X
^
Y  18.9476  1.1842(30)
^
Y  54.4736
Plotting the Estimated and the Actual Y’s
13-29
The Standard Error of Estimate


The standard error of estimate (sy.x) measures
the scatter, or dispersion, of the observed values
around the line of regression
A formula that can be used to compute the
standard error:
^
s y. x 
13-30
(Y  Y ) 2
n2
Standard Error of the Estimate - Example
Recall the example involving
Copier Sales of America.
The sales manager
determined the least
squares regression
equation is given below.
Determine the standard error
of estimate as a measure
of how well the values fit
the regression line.
^
^
Y  18.9476  1.1842 X
s y. x
(Y  Y ) 2

n2

13-31
784.211
 9.901
10  2
Graphical Illustration of^the Differences between Actual
Y – Estimated Y (Y  Y )
13-32
Assumptions Underlying Linear
Regression
For each value of X, there is a group of Y values, and these
1.
Y values are normally distributed.
2.
The means of these normal distributions of Y values all lie on
the straight line of regression.
3.
The standard deviations of these normal distributions are equal.
4.
The Y values are statistically independent. This means that in
the selection of a sample, the Y values chosen for a particular X
value do not depend on the Y values for any other X values.
13-33
Practice


Home sales.sav
Regression
–
–
AnalyzeRegressionlinear
Select

Independent variable
–
Appraised Land Value
– Appraised value of improvements
– Total appraised value

Dependent variable
–
13-34
Sale price
Multiple Linear Regression and
Correlation Analysis
GOALS
1.
2.
3.
4.
14-36
Describe the relationship between several
independent variables and a dependent variable
using multiple regression analysis.
Compute and interpret the multiple standard error
of estimate, the coefficient of multiple
determination, and the adjusted coefficient of
multiple determination.
Conduct a test of hypothesis on each of the
regression coefficients.
Use and understand qualitative independent
variables.
Multiple Regression Analysis
The general multiple regression with k
independent variables is given by:
The least squares criterion is used to develop this
equation. Because determining b1, b2, etc. is very
tedious, a software package such as Excel or
MINITAB is recommended.
14-37
Multiple Regression Analysis
For two independent variables, the general form
of the multiple regression equation is:
• X1 and X2 are the independent variables.
• a is the Y-intercept
• b1 is the net change in Y for each unit change in X1 holding X2
constant. It is called a partial regression coefficient, a net
regression coefficient, or just a regression coefficient.
14-38
Regression Plane for a 2-Independent
Variable Linear Regression Equation
14-39
Multiple Linear Regression - Example
Salsberry Realty sells homes
along the east coast of the
United States. One of the
questions most frequently
asked by prospective buyers
is: If we purchase this home,
how much can we expect to
pay to heat it during the
winter? The research
department at Salsberry has
been asked to develop some
guidelines regarding heating
costs for single-family homes.
14-40
Multiple Linear Regression - Example
Three variables are thought to relate to
the heating costs:
(1) the mean daily outside
temperature,
(2) the number of inches of insulation
in the attic, and
(3) the age in years of the furnace (a
device used for heating).
To investigate, Salsberry’s research
department selected a random
sample of 20 recently sold homes. It
determined the cost to heat each
home last January, as well
(data in next slide)
14-41
Attic
Multiple Linear Regression - Example
14-42
The Multiple Regression Equation –
Interpreting the Regression Coefficients
The regression coefficient for mean outside temperature
(X1) is 4.583. The coefficient is negative and shows an
inverse relationship between heating cost and
temperature.
As the outside temperature increases, the cost to heat the
home decreases. The numeric value of the regression
coefficient provides more information. If we increase
temperature by 1 degree and hold the other two
independent variables constant, we can estimate a
decrease of $4.583 in monthly heating cost.
14-43
The Multiple Regression Equation –
Interpreting the Regression Coefficients
The attic insulation variable (X2) also shows an inverse
relationship: the more insulation in the attic, the less
the cost to heat the home. So the negative sign for this
coefficient is logical. For each additional inch of
insulation, we expect the cost to heat the home to
decline $14.83 per month, regardless of the outside
temperature or the age of the furnace.
14-44
The Multiple Regression Equation –
Interpreting the Regression Coefficients
The age of the furnace variable (X3) shows a direct
relationship. With an older furnace, the cost to heat the
home increases.
Specifically, for each additional year older the furnace is,
we expect the cost to increase $6.10 per month.
14-45
Applying the Model for Estimation
What is the estimated heating cost for a home if the
mean outside temperature is 30 degrees, there
are 5 inches of insulation in the attic, and the
furnace is 10 years old?
14-46
Multiple Standard Error of
Estimate
The multiple standard error of estimate is a measure of the
effectiveness of the regression equation.
 It is measured in the same units as the dependent
variable.
 It is difficult to determine what is a large value and what
is a small value of the standard error.
 The formula is:
14-47
14-48
Multiple Regression and
Correlation Assumptions




14-49
The independent variables and the dependent
variable have a linear relationship. The dependent
variable must be continuous and at least intervalscale.
The residual must be the same for all values of Y.
When this is the case, we say the difference exhibits
homoscedasticity.
The residuals should follow the normal distribution
with mean 0.
Successive values of the dependent variable must
be uncorrelated.
The ANOVA Table
The ANOVA table reports the variation in the dependent
variable. The variation is divided into two
components.
 The Explained Variation is that accounted for by the
set of independent variable.
 The Unexplained or Random Variation is not
accounted for by the independent variables.
14-50
Minitab – the ANOVA Table
14-51
Coefficient of Multiple Determination (r2)
14-52
Characteristics of the coefficient of multiple determination:
1. It is symbolized by a capital R squared. In other words,
it is written as because it behaves like the square of a
correlation coefficient.
2. It can range from 0 to 1. A value near 0 indicates little
association between the set of independent variables and
the dependent variable. A value near 1 means a strong
association.
3. It cannot assume negative values. Any number that is
squared or raised to the second power cannot be
negative.
4. It is easy to interpret. Because R2 is a value between 0
and 1 it is easy to interpret, compare, and understand.
Coefficient of Multiple Determination (r2)
- Formula
14-53
Minitab – the ANOVA Table
R2 
14-54
SSR
171,220

 0.804
SS total 212,916
Adjusted Coefficient of Determination
14-55

The number of independent variables in a multiple
regression equation makes the coefficient of
determination larger.

If the number of variables, k, and the sample size, n,
are equal, the coefficient of determination is 1.0.

To balance the effect that the number of
independent variables has on the coefficient of
multiple determination, statistical software packages
use an adjusted coefficient of multiple determination.
Adjusted Coefficient of Determination Example
14-56
Correlation Matrix
A correlation matrix is used to show all
possible simple correlation
coefficients among the variables.
 The matrix is useful for locating
correlated independent variables.
 It shows how strongly each
independent variable is correlated
with the dependent variable.
14-57
Global Test: Testing the Multiple
Regression Model
The global test is used to investigate
whether any of the independent
variables have significant coefficients.
The hypotheses are:
H 0 : 1   2  ...   k  0
H1 : Not all  ' s equal 0
14-58
Global Test
continued
The test statistic is the F distribution
with k (number of independent
variables) and
n-(k+1) degrees of freedom, where n is
the sample size.
 Decision Rule:

Reject H0 if F > F,k,n-k-1
Or in SPSS when in ANOVA table;
p-value<.05
14-59
Interpretation



14-60
The computed F is 21.90, or
When p-value<.05, so we can
reject H0
The null hypothesis that all the
multiple regression coefficients
are zero is therefore rejected.
Interpretation: some of the
independent variables (amount
of insulation, etc.) do have the
ability to explain the variation in
the dependent variable (heating
cost).
Evaluating the
Assumptions of Multiple Regression
1. There is a linear relationship. That is, there is a straight-line
relationship between the dependent variable and the set of
independent variables.
2. The variation in the residuals is the same for both large and
small values of the estimated Y To put it another way, the residual
is unrelated whether the estimated Y is large or small.
3. The residuals follow the normal probability distribution.
4. The independent variables should not be correlated. That is, we
would like to select a set of independent variables that are not
themselves correlated.
5. The residuals are independent. This means that successive
observations of the dependent variable are not correlated. This
assumption is often violated when time is involved with the sampled
observations.
14-61
Analysis of Residuals
A residual is the difference between the
actual value of Y and the predicted
value of Y.
 Residuals should be approximately
normally distributed. Histograms and
are useful in checking this requirement.
 A plot of the residuals and their
corresponding Y’ values is used for
showing that there are no trends or
patterns in the residuals.
14-62
Multicollinearity



14-63
Multicollinearity exists when independent
variables (X’s) are correlated.
Correlated independent variables make it
difficult to make inferences about the
individual regression coefficients (slopes)
and their individual effects on the dependent
variable (Y).
However, correlated independent variables
do not affect a multiple regression equation’s
ability to predict the dependent variable (Y).
Effect of Multicollinearity in


14-64
Not a Problem: Multicollinearity does not
affect a multiple regression equation’s
ability to predict the dependent variable
A Problem: Multicollinearity may show
unexpected results in evaluating the
relationship between each independent
variable and the dependent variable (a.k.a.
partial correlation analysis),
Variance Inflation Factor



A general rule is if the correlation between two independent
variables is between -0.70 and 0.70 there likely is not a problem
using both of the independent variables.
A more precise test is to use the variance inflation factor (VIF).
The value of VIF is found as follows:
•The term R2j refers to the coefficient of determination, where the selected
independent variable is used as a dependent variable and the remaining
independent variables are used as independent variables.
•A VIF greater than 10 is considered unsatisfactory, indicating that independent
variable should be removed from the analysis.
14-65
Independence Assumption


14-66
The fifth assumption about regression and
correlation analysis is that successive
residuals should be independent.
When successive residuals are correlated we
refer to this condition as autocorrelation.
Autocorrelation frequently occurs when the
data are collected over a period of time.
Residual Plot versus Fitted Values


14-67
The graph shows the
residuals plotted on the
vertical axis and the fitted
values on the horizontal
axis.
Note the run of residuals
above the mean of the
residuals, followed by a
run below the mean. A
scatter plot such as this
would indicate possible
autocorrelation.
Durbin–Watson statistic


The Durbin–Watson statistic is a test
statistic used to detect the presence of
autocorrelation in the residuals from a
regression analysis. It is named after James
Durbin and Geoffrey Watson.
If et is the residual associated with the
observation at time t, then the test statistic is
–
14-68
Durbin–Watson statistic



14-69
Since d is approximately equal to 2(1-r), where r is the sample
autocorrelation of the residuals, d = 2 indicates that appears to be
no autocorrelation, its value always lies between 0 and 4.
If the Durbin–Watson statistic is substantially less than 2, there is
evidence of positive serial correlation. As a rough rule of thumb, if
Durbin–Watson is less than 1.0, there may be cause for alarm.
Small values of d indicate successive error terms are, on average,
close in value to one another, or positively correlated. If d > 2
successive error terms are, on average, much different in
value to one another, i.e., negatively correlated. In regressions,
this can imply an underestimation of the level of statistical
significance.
Qualitative Independent Variables


14-70
Frequently we wish to use nominal-scale
variables—such as gender, whether the
home has a swimming pool, or whether the
sports team was the home or the visiting
team—in our analysis. These are called
qualitative variables.
To use a qualitative variable in regression
analysis, we use a scheme of dummy
variables in which one of the two possible
conditions is coded 0 and the other 1.
Qualitative Variable - Example
Suppose in the Salsberry
Realty example that the
independent variable
“garage” is added. For those
homes without an attached
garage, 0 is used; for homes
with an attached garage, a 1
is used. We will refer to the
“garage” variable as X4.The
data shown on the table are
entered into the MINITAB
system.
14-71
Qualitative Variable - Minitab
14-72
Using the Model for Estimation
What is the effect of the garage variable? Suppose we have two houses exactly
alike next to each other in Buffalo, New York; one has an attached garage,
and the other does not. Both homes have 3 inches of insulation, and the
mean January temperature in Buffalo is 20 degrees.
For the house without an attached garage, a 0 is substituted for in the regression
equation. The estimated heating cost is $280.90, found by:
Without garage
For the house with an attached garage, a 1 is substituted for in the regression
equation. The estimated heating cost is $358.30, found by:
With garage
14-73
Testing the Model for Significance


We have shown the difference between the two
types (with/without garbage) of homes to be
$77.40, but is the difference significant?
We conduct the following test of hypothesis.
H0: βi = 0
H1: βi ≠ 0
Reject H0 if p-value<.05
Conclusion: if the regression coefficient is not zero, the independent variable
garage should be included in the analysis.
14-74
Practice

Use employee data.sav
–
Analyzeregressionlinearcrosstab

Dependent variable Current salary

Independent variablesbeginning salary; Months since
hired; Minority classification (Qualitative data)

In “statistics” box; choose “collinearity diagnostics”
In “residuals”, choose “Durbin-Watson”

14-75
End of Chapter
14-76