Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Transcript
13- 1
Chapter
Thirteen
McGraw-Hill/Irwin
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Chapter Thirteen
13- 2
Linear Regression and Correlation
GOALS
When you have completed this chapter, you will be able to:
ONE
Draw a scatter diagram.
TWO
Understand and interpret the terms dependent variable and
independent variable.
THREE
Calculate and interpret the coefficient of correlation, the coefficient
of determination, and the standard error of estimate.
FOUR
Conduct a test of hypothesis to determine if the population
coefficient of correlation is different from zero.
Goals
Chapter Thirteen
13- 3
continued
Linear Regression and Correlation
GOALS
When you have completed this chapter, you will be able to:
FIVE
Calculate the least squares regression line and interpret the slope
and intercept values.
SIX
Construct and interpret a confidence interval and prediction
interval for the dependent variable.
SEVEN
Set up and interpret an ANOVA table.
Goals
13- 4
Correlation Analysis is a group of statistical techniques to
measure the association between two variables.
Advertising Minutes and $ Sales
The Dependent Variable
is the variable being
predicted or estimated.
30
Sales ($thousands)
A Scatter Diagram
is a chart that portrays
the relationship between
two variables.
25
20
15
10
5
0
70
90
110
130
150
170
190
Advertising Minutes
The Independent
Variable provides the
basis for estimation. It
is the predictor variable.
Correlation Analysis
13- 5
The Coefficient of Correlation (r) is a measure of the
strength of the relationship between two variables.
Also called Pearson’s r and
It requires interval or ratioPearson’s product moment
scaled data.
correlation coefficient.
Pearson's r
It can range from
-1.00 to 1.00.
Values of -1.00 or 1.00
indicate perfect and strong
-1
correlation.
0
1
Negative values indicate an
Values close to 0.0 indicate
inverse relationship and
weak correlation.
positive values indicate a
The Coefficient of Correlation, r
direct relationship.
13- 6
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
X
6
7
8
9
10
Perfect Negative Correlation
13- 7
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
X
6
7
8
9
10
Perfect Positive Correlation
13- 8
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3 4
5
X
6
7
8
9
10
Zero Correlation
13- 9
Y
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
X
6
7
8
9
10
Strong Positive Correlation
13- 10
We calculate the coefficient of correlation from the
following formula.
r=
S(X – X)(Y – Y)
(n-1)sxsy
Formula for r
13- 11
The coefficient of determination (r2) is the
proportion of the total variation in the dependent variable
(Y) that is explained or accounted for by the variation in
the independent variable (X).
It is the square of the coefficient of correlation.
It ranges from 0 to 1.
It does not give any information on the direction
of the relationship between the variables.
Coefficient of Determination
Dan Ireland, the student body
president at Toledo State
University, is concerned about
the cost to students of
textbooks. He believes there is
a relationship between the
number of pages in the text and
the selling price of the book.
To provide insight into the
problem he selects a sample of
eight textbooks currently on
sale in the bookstore. Draw a
scatter diagram. Compute the
correlation coefficient.
13- 12
Example 1
13- 13
Book
Page
Price($)
Introduction to History
500
84
Basic Algebra
700
75
Introduction to Psychology
800
99
Introduction to Sociology
600
72
Business Management
400
69
Introduction to Biology
500
81
Fundamentals of Jazz
600
63
Principles of Nursing
800
93
Example 1 continued
13- 14
Scatter Diagram of Number of Pages and Selling Price of Text
100
90
Price ($)
80
70
60
400
500
600
700
800
Page
Example 1 continued
13- 15
Page
E
X
C
E
L
Price
Mean
625 Mean
79.50
Standard Error
49 Standard Error
4.32
Median
600 Median
78
Mode
600 Mode
#N/A
Standard Deviation
139 Standard Deviation 12.21
Sample Variance
19,286 Sample Variance
149.14
Kurtosis
-0.55 Kurtosis
-0.77
Skewness
-0.16 Skewness
0.40
Range
400 Range
36
Minimum
400 Minimum
63
Maximum
800 Maximum
99
Sum
5,000 Sum
636
Count
8 Count
8
Example 1 continued
13- 16
S(X – X)(Y – Y)
(a)
Page
500
700
800
600
400
600
600
800
(b)
Price
84
75
99
72
69
81
63
93
(c)
(d)
Page - Mean(Page) Price-Mean(Price)
-125
4.5
75
-4.5
175
19.5
-25
-7.5
-225
-10.5
-25
1.5
-25
-16.5
175
13.5
(c)*(d)
.
(562.5)
(337.5)
3,412.5
187.5
2,362.5
(37.5)
412.5
2,362.5
7,800.0
Example 1 continued
13- 17
r=
S(X – X)(Y – Y)
(n-1)sxsy
7800
= 7(138.87)(12.21)
= .657
r2 = .6572 = .432
The correlation between the number of pages and the
selling price of the book is 0.657. This indicates a
moderate association between the variable.
Example 1 continued
13- 18
Did a computed r come from a
population of paired observations
with zero correlation?
Ho: r = 0 (The correlation in the population is zero.)
H1: r ≠ 0 (The correlation in the population is
different from zero.
t test for the
coefficient of
correlation
t = r  n- 2 With n-2 d.f.
 1- r2
T-test of significance of r
13- 19
Computed r = .657. Test the hypothesis that
there is no correlation in the population. Use a
.02 significance level.
Step 1
H0: the correlation in the
population is zero.
H1:The correlation in the
population is not zero.
Step 3
The statistic to
use follows the
t distribution.
Step 2
Significance
level is .02.
Step 4
H0 is rejected if
t>3.143 or if t<-3.143
or if p < .02. There are
6 degrees of freedom,
found by n – 2 = 8 – 2 = 6.
13- 20
Step 5
Find the value of the
test statistic.
t = r  n- 22 = .657  8 – 2 = 2.135
 1- r
 1 - .6572
p(t > 2.135) = .077
H0 is not rejected. We cannot
reject the hypothesis that there is
no correlation in the population.
The amount of association could
be due to chance.
Example 1 continued
13- 21
In Regression Analysis we use the independent
variable (X) to estimate the dependent variable (Y).
The relationship
between the
variables is linear.
Both variables must
be at least interval
scale.
The least squares criterion
is used to determine the
equation. That is the term
S(Y – Y’)2 is minimized.
Regression Analysis
The regression equation is Y’= a + bX
13- 22
where
Y’ is the average predicted value of Y for any X.
a is the Y-intercept.
It is the estimated Y value when X=0
b is the slope of the line, or the average change
in Y’ for each change of one unit in X
The least squares principle is used to obtain a
and b.
Regression Analysis
13- 23
The least squares principle is used to obtain a and
b. The equations to determine a and b are:
sy
b=r
sx
a = Y – bX
Regression Analysis
13- 24
Develop a regression
equation for the
information given in
example 1 that can be
used to estimate the
selling price based on
the number of pages.
sy
b=r
sx
12.21
= (.657)
138.87
= .0578
a = Y – bX = 79.5 - .0578*625 = 43.39
Example 1 revisited
13- 25
The regression equation
is:
Y’ = 43.39 + .0578X
The slope of the line is
.0578. Each addition
page costs about a
nickel.
The sign of the b
value and the sign
of r will always be
the same.
The equation crosses
the Y-axis at $43.39. A
book with no pages
would cost $43.39.
Example 1 revisited
13- 26
We can use the
regression equation
to estimate values
of Y.
The estimated selling price of an
800 page book is $89.61, found by
Price = $43.39 + .0578(Number of Pages)
= $43.39 + .0578(800)
= $89.61
Example 1 revisited
13- 27
The Standard Error of Estimate measures the
scatter, or dispersion, of the observed values around the
line of regression
The formula
that is used to
compute the
standard error:
sy x =

 (Y-Y')2
n-2
The Standard Error of Estimate
13- 28
Find the standard error of estimate for the problem involving
the number of pages in a book and the selling price.
Deviation
Estimated
Actual price
price
Deviation Squared
2
(Y)
(Y')
(Y-Y')
(Y-Y')
84
72.28
11.72
137.41
75
83.83
-8.83
78.03
99
89.61
9.39
88.15
72
78.06
-6.06
36.67
69
66.50
2.50
6.25
81
78.06
2.94
8.67
63
78.06
-15.06
226.67
93
89.61
3.39
11.48
0.00
593.33
sy x =

 (Y-Y')2
n-2
=

593.33
8-2
= 9.944
Example 1 revisited
13- 29
Assumptions Underlying Linear Regression
For each value of X, there is a group of Y values, and
these Y values are normally distributed.
The Y values are statistically independent. This means
that in the selection of a sample, the Y values chosen for
a particular X value do not depend on the Y values for
any other X values.
The means of these normal
distributions of Y values all
lie on the straight line of
regression.
The standard deviations
of these normal
distributions are the same.
13- 30
The confidence interval for the mean value of Y for a
given value of X is given by:

1 + (X-X)2
Y' + t(sy x)
n  (X-X)2
Y’is the predicted value for any selected X value
X is an selected value of X
X is the mean of the Xs
n is the number of observations
Sy.x is the standard error of the estimate
t is the value of t at n-2 degrees of freedom
Confidence Interval
13- 31
For our earlier price estimate of $89.61, the
confidence interval, assuming a desired 95%
confidence, is calculated as follows.
Page - Mean(Page) (Page - Mean(Page))2
-125
15625
75
5625
175
30625
-25
625
-225
50625
-25
625
-25
625
175
30625
135000
Example 1
Y’ the predicted value, is $89.61
13- 32
X is 800 pages
X is 625, the mean of the pages
n is 8, the number of observations
Sy.x is 9.944, the standard error of the estimate
t is 2.447 at 8-2 degrees of freedom and 95% confidence
89.61 + 9.944(2.447)

1
(800 - 625)2
+
8
135,000
=89.61 + 14.433

1 + (X-X)2
Y' + t(sy x)
n  (X-X)2
Example 1 revisited
13- 33
The prediction interval for an individual value
of Y for a given value of X
Y' + t(sy x)

1 + (X-X)2
1+
n  (X-X)2

1
(800 - 625)2
89.61 + 9.944(2.447) 1 + 8 + 135,000
=89.61 + 28.292
Prediction Interval
13- 34
Summarizing The Results
 The estimated selling price for a book with 800 pages
is $89.61.
 The standard error of estimate is $9.94.
 The 95 percent confidence interval for all books with
800 pages is $89.61 + $14.43. This means the limits
are between $75.18 and $104.04.
 The 95 percent prediction interval for a particular
book with 800 pages is $89.61+ $28.29. The means
the limits are between $61.32 and $117.90.
 These results appear in the following Minitab and
Excel outputs.
Example 1 revisited
13- 35
Regression Analysis
The regression equation is
Price = 43.4 + 0.0578 No of Pages
Predictor
Constant
No of Pages
S = 9.944
Coef
43.39
0.05778
R-Sq = 43.2%
Analysis of Variance
Source
DF
SS
Regression 1
450.67
Error
6
593.33
Total
7 1044.00
StDev
17.28
0.02706
T
2.51
2.13
P
0.046
0.077
R-Sq(adj) = 33.7%
MS
450.67
98.89
F
4.56
P
0.077
Fit StDev Fit
95.0% CI
95.0% PI
89.61 5.90 ( 75.17, 104.05) ( 61.31, 117.91)
Example 1 revisited
M
I
N
I
T
A
B
13- 36
Regression Statistics
Multiple R
0.657
R Square
0.432
Adjusted R Square
0.337
Standard Error
9.944
Observations
8
ANOVA
df
SS
MS
F
Significance
F
Regression
1
450.67
450.67
4.5573034
0.0767
Residual
6
593.33
98.89
Total
7
1044
Intercept
Page
Coefficients
43.3889
0.0578
Standard
Error
17.277
0.027
t Stat
2.511
2.135
E
X
C
E
L
P-value
0.0458193
0.0767009
Example 1 revisited