Download PPT

Document related concepts

Foundations of statistics wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Chapter 24 ~
Linear Correlation & Regression Analysis
30
29
28
27
Waist
Size
26
25
24
23
22
100
110
120
130
140
150
160
Weight
1
Chapter Goals
• More detailed look at linear correlation and
regression analysis
• Develop a hypothesis test to determine the
strength of a linear relationship
• Consider the line of best fit. Use this to make
confidence interval estimations.
2
Linear Correlation Analysis
The coefficient of linear correlation, r, is a measure
of the strength of a linear relationship
Consider another measure of dependence: covariance
Recall: bivariate data - ordered pairs of numerical
values
3
Derivation of the Covariance
Derivation of the Covariance
Goal: a measure of the linear relationship between two variables
Consider the following set of bivariate data:
{(8, 22), (5, 28), (8, 18), (4, 16), (13, 27), (15, 23), (17, 17), (12, 13)}
x  10.25
y  20.50
Consider a graph of the data:
1. The point ( x, y ) is the centroid of the data
2. A vertical and horizontal line through the centroid divides
the graph into four sections
4
Graph of the Data with Centriod
30
( x  x)
28
26
( y  y)
24
22
y
20
(10.25, 20.5)
18
16
14
12
4
6
8
10
12
14
16
18
20
x
5
Notes
1. Each point (x, y) lies a certain distance from each of the two lines
2. ( x  x ) : the horizontal distance from (x, y) to the vertical line passing
through the centroid
3. ( y  y ) : the vertical distance from (x, y) to the horizontal line passing
through the centroid
4. The distances may be positive, negative, or zero
5. Consider the product: ( x  x)( y  y )
a. If the graph has lots of points to the upper right and lower left of
the centroid (positive linear relationship), most products will be
positive
b. If the graph has lots of points to the upper left and lower right of
the centroid (negative linear relationship), most products will be
negative
6
Covariance of x and y
The covariance of x and y is defined as the sum of the products of
the distances of all values x and y from the centroid divided by n  1:
n
covar ( x, y ) 
Note:
 ( x  x)  0
and
 ( xi  x)( yi  y)
i 1
n 1
 ( y  y)  0
always!
7
Calculations for Finding Covar (x, y)
Points
(8, 22)
(5, 28)
(8, 18)
(4, 16)
(13, 27)
(15, 23)
(17, 17)
(12, 13)
Total
covar ( x, y ) 
xx
-2.25
-5.25
-2.25
-6.25
2.75
4.75
6.75
1.75
0.00
y y
1.5
7.5
-2.5
-4.5
6.5
2.5
-3.5
-7.5
0.0
( x  x)( y  y )
-3.375
-39.375
5.625
28.125
17.875
11.875
-23.625
-13.125
-16.000
 16
 2.2857
7
8
Data & Covariance
Positive covariance:
8
7
6
5
y
( x, y )
4
3
2
1
0
0
1
2
3
4
5
6
7
8
x
9
Data & Covariance
Negative covariance:
9
8
7
6
y
( x, y )
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
x
10
Data & Covariance
Covariance near 0:
9
8
7
6
5
y
4
( x, y )
3
2
1
0
0
1
2
3
4
5
6
7
8
9
x
11
Problems
1. The covariance does not have a standardized unit of measure
2. Suppose we multiply each data point in the example in this
section by 15
The covariance of the new data set is -514.29
3. The amount of the dependency between x and y seems
stronger but the relationship is really the same
4. We must find a way to eliminate the effect of the spread of
the data when we measure the strength of a linear
relationship
12
Solution
1. Standardize x and y:
xx
x' 
sx
and
y y
y' 
sy
2. Compute the covariance of x and y 
3. This covariance is not affected by the spread of the data
4. This is exactly what is accomplished by the coefficient of linear
correlation:
covar ( x, y )
r  covar ( x' , y ' ) 
sx  s y
13
Notes
1. The coefficient of linear correlation standardizes the measure of
dependency and allows us to compare the relative strengths of
dependency of different sets of data
2. Also commonly called Pearson’s product moment, r
Calculation of r (for the data presented in this section):
s x  4.71
r
and
s y  5.37
covar ( x, y )
 2.2857

 0.0904
sx  s y
(4.71)(5.37)
14
Alternative (Computational) Formula for r
Alternative (Computational) Formula for r:
 ( x  x)( y  y)
covar ( x, y )
r

sx  s y
n 1
sx  s y
SS( xy )

SS( x)  SS( y )
1. This formula avoids the separate calculations of the means,
standard deviations, and the deviations from the means
2. This formula is easier and more accurate: minimizes
round-off error
15
Inferences About
the Linear Correlation Coefficient
• Use the calculated value of the coefficient of linear
correlation, r*, to make an inference about the
population correlation coefficient, r
• Consider a confidence interval for r and a hypothesis
test concerning r
16
Assumptions...
Assumptions for inferences about linear correlation coefficient:
The set of (x, y) ordered pairs forms a random sample and the
y-values at each x have a normal distribution. Inferences use the
t-distribution with n  2 degrees of freedom.
Caution:
The inferences about the linear correlation coefficient are about the
pattern of behavior of the two variables involved and the usefulness
of one variable in predicting the other. Significance of the linear
correlation coefficient does not mean there is a direct cause-andeffect relationship.
17
Confidence Interval Procedure
1. A confidence interval may be used to estimate the value
of the population correlation coefficient, r
2. Use a table showing confidence belts
3. Table 10, Appendix B: confidence belts for 95%
confidence intervals
4. Table 10 utilizes n, the sample size
18
Example
 Example: A random sample of 25 ordered pairs of data have a calculated
value of r = 0.45. Find a 95% confidence interval for r, the
population linear correlation coefficient.
Solution:
1. Population Parameter of Concern
The linear correlation coefficient for the population, r
2. The Confidence Interval Criteria
a. Assumptions: The ordered pairs form a random sample, and for each
x, the y-values have a mounded distribution
b. Test statistic: The calculated value of r
c. Confidence level: 1  a = 0.95
3. Sample Evidence
n = 25 and r = 0.45
19
Solution Continued
4. The Confidence Interval
The confidence interval is read from Table 10, Appendix B
Find r = 0.45 at the bottom of Table 10
Visualize a vertical line through that point
Find the two points where the belts marked for the correct
sample size cross the vertical line
Draw a horizontal line through each point to the vertical
scale on the left and read the confidence interval
The values are 0.68 and 0.12
5. The Results
0.68 to 0.12 is the 95% confidence interval for r
20
Table 10
The numbers on the curves are sample sizes:
Scale of p
(population
correlation
coefficient)
-0.12
-0.68
-0.45
Scale of r (sample correlation)
21
Hypothesis Testing Solution
1. Null hypothesis: the two variables are linearly unrelated,
r=0
2. Alternative hypothesis: one- or two-tailed, usually r  0
3. Test statistic: calculated value of r
4. Probability bounds or critical values for r: Table 11,
Appendix B
5. Number of degrees of freedom for the r-statistic: n  2
22
Example
 Example: In a study of 32 randomly selected ordered pairs,
r = 0.421. Is there any evidence to suggest the linear
correlation coefficient is different from 0 at the 0.05
level of significance?
Solution:
1. The Set-up
a. Population parameter of concern: The linear correlation
coefficient for the population, r
b. The null and the alternative hypothesis:
Ho: r = 0
Ha: r  0
23
Solution Continued
2. The Hypothesis Test Criteria
a. Assumptions: The ordered pairs form a random sample
and we will assume that the y-values at each x have a
mounded distribution
b. Test statistic:
r* (calculated value of r) with df = 32  2 = 30
c. Level of significance: a = 0.05
3. The Sample Evidence
n = 32 and r* = r = 0.421
24
Solution Continued
4. The Probability Distribution (p-Value Approach)
a. The p-value: Use Table 11: 0.01 < P < 0.02
b. The p-value is smaller than the level of significance, a
~ or ~
4. The Probability Distribution (Classical Approach)
a. Critical Value: The critical value is found at the intersection of
the df = 30 row and the two-tailed 0.05 column of Table 11:
0.349
b. r* is in the critical region
5. The Results
a. Decision: Reject Ho
b. Conclusion: At the 0.05 level of significance, there is
evidence to suggest x and y are correlated
25
Linear Regression Analysis
• Line of best fit results from an analysis of two (or
more) related variables
• Try to predict the value of the dependent, or
output, variable
• The variable we control is the independent, or
input, variable
26
Method of Least Squares
Method of Least Squares:
The line of best fit: yˆ  b0  b1 x
The slope: b1 
SS( xy )
SS( x)
The y-intercept: b0 
1
 y  b1  x 
n
Notes:
1. A scatter diagram may suggest curvilinear regression
2. If two or more input variables are used: multiple regression
27
Linear Model
The Linear Model: yˆ  b 0  b1 x  
This equation represents the linear relationship between the two variables
in a population
b0: The y-intercept, estimated by b0
b1: The slope, estimated by b1
:
Experimental error, estimated by e  y  yˆ
The random variable e is called the residual
e is the difference between the observed value of y and the predicted
value of y at a given x
The sum of the residuals is exactly zero
Mean value of experimental error is zero: m = 0
2
Variance of experimental error:  
28
Estimating the
Variance of the Experimental Error
Estimating the Variance of the Experimental Error:
Assumption: The distribution of y’s is approximately normal and
the variances of the distributions of y at all values of x are the same
(The standard deviation of the distribution of y about yˆ is the same
for all values of x)
2
(
x

x
)
Consider the sample variance: s 2  
n 1
1. The variance of y involves an additional complication: there is a
different mean for y at each value of x
2. Each “mean” is actually the predicted value, yˆ
2
(
y

y
)
ˆ

3. Variance of the error e estimated by: se2 
n2
Degrees of freedom: n  2
29
Alternative (Computational) Formula
for Variance of Experimental Error
2
Rewriting se :
2
)
y

y
(
ˆ
se2  
n2
2
)
x
b

b

y
(

1
0

n2
2
 b0  y   b1  xy 
y


n2
SSE

n2


SSE = sum of squares for error
30
Example
 Example: A recent study was conducted to determine the relation
between advertising expenditures and sales of statistics
texts (for the first year in print). The data is given
below (in thousands). Find the line of best fit and the
variance of y about the line of best fit.
Adv. Costs (x ) Sales (y ) Adv. Costs (x ) Sales (y )
40
289
60
470
55
423
52
408
35
250
39
320
50
400
47
415
43
335
38
389
31
Solution
2


x
(459) 2

2
SS( x)   x 
 21677 
 608.9
n
10
x y
(459)(3699)

SS( xy )   xy 
 174163 
 4378.9
n
10
SS( xy ) 4378.9
b1 

 7.1915
SS( x)
608.9
y  b1   x  3699  (7.1915)(459)

b0 

 39.8105
n
10
32
Solution Continued
• The equation for the line of best fit: yˆ  39.81  7.19 x
• The variance of y about the regression line:
2


y
 b0  y   b1  xy 

2
s 
e
n2
(1410485)  (39.81)(3699)  (7.1915)(174163)

8
10734.5955

 1341.8244
8
Note: Extra decimal places are often needed for this type of
calculation
33
Illustration
• Scatter diagram, regression line, and random errors as line segments:
500
475
450
425
400
Sales
375
350
325
300
275
250
35
40
45
50
55
60
65
Advertising Costs
34
Minitab Output
Regression Analysis
The regression equation is
C2 = 39.8 + 7.19 C1
Predictor
Constant
C1
Coef
39.81
7.191
StDev
69.11
1.484
S = 36.63
R-Sq = 74.6%
T
0.58
4.84
P
0.580
0.001
R-Sq(adj) = 71.4%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
8
9
SS
31491
10734
42225
MS
31491
1342
F
23.47
P
0.001
35
Inferences Concerning
the Slope of the Regression Line
• Confidence Interval for b1: 1-a confidence interval
estimate for the population slope of the line of best
fit
• Hypothesis Test for b1: Tests the null hypothesis,
b1= 0, the slope of the line of best fit is equal to 0,
that is, the line is of no use in predicting y for a given
value of x
36
Sampling Distribution of the Slope b1
Assume: Random samples of size n are repeatedly taken from a
bivariate population
1. b1 has a sampling distribution that is approximately normal
2. The mean of b1 is b1
3. The variance of
2
b1 is:  b1

 2
2
(
x

x
)

provided there is no lack of fit
37
Standard Error of Regression
Estimator for  b21 :
sb21
se2


2
 ( x  x)
se2
 x 
x  n
2
2
se2

SS( x)
The standard error of regression (slope) is  b1
and is estimated by sb
1
Example (continued): For the advertising costs and sales data:
sb21
se2
1341.8244


 2.2037
SS( x)
608.9
38
Inferences About Slope Continued
Assumptions for inferences about the slope parameter b1:
The set of (x, y) ordered pairs forms a random sample and the yvalues at each x have a normal distribution. Since the population
standard deviation is unknown and replaced with the sample
standard deviation, the t-distribution will be used with n  2 degrees
of freedom.
Confidence Interval Procedure:
The 1  a confidence interval for b1 is given by b1  t ( n  2 , a / 2 )  sb1
39
Example
 Example: Find the 95% confidence interval for the population
slope b1 for the advertising costs and sales example
Solution:
1. Population parameter of Interest
The slope, b1, for the line of best fit for the population
2. The Confidence Interval Criteria
a. Assumptions: The ordered pairs form a random sample and we will
assume the y-values (sales) at each x (advertising costs) have a
mounded distribution
b. Test statistic: t with df = 10  2 = 8
c. Confidence level: 1  a = 0.95
3. Sample Evidence
Sample information: n  10,
b1  7.1915,
sb21  2.2037
40
Solution Continued
4. The Confidence Interval
a. Confidence coefficients:
t(df, a/2) = t(8, 0.025) = 2.31
b. Interval:
b1  t(n-2, a/2)  sb1  7.1915  (2.31)  2.2037
 7.1915  1.4845
 (5.707, 8.676)
5. The Results
The slope of the line of best fit of the population from
which the sample was drawn is between 5.707 and 8.676
with 95% confidence
41
Hypothesis-Testing Procedure
1. Null hypothesis is always Ho: b1 = 0
2. Use the Students t distribution with df = n  2
3. The test statistic: t* 
b1  b1
sb1
42
Example
 Example: In the previous example, is the slope for the line of best
fit significant enough to show that advertising cost is
useful in predicting the first year sales? Use a = 0.05
Solution:
1. The Set-up
a. Population parameter of concern: The parameter of concern is
b1, the slope of the line of best fit for the population
b. The null and alternative hypothesis:
Ho: b1 = 0 (x is of no use in predicting y)
Ha: b1 > 0 (we expect sales to increase as costs increase)
43
Solution Continued
2. The Hypothesis Test Criteria
a. Assumptions: The ordered pairs form a random sample
and we will assume the y-values (sales) at each x
(advertising costs) have a mounded distribution
b. Test statistic: t* with df = n  2 = 8
c. Level of significance: a = 0.05
3. The Sample Evidence
a. Sample information: n  10, b1  7.1915,
b. Calculate the value of the test statistic:
b1  b1 7.1915  0.0
t* 

 4.8444
sb1
2.2037
sb21  2.2037
44
Solution Continued
4. The Probability Distribution (p-Value Approach)
a. The p-value: P = P(t* > 4.8444, with df = 8) < 0.001
b. The p-value is smaller than the level of significance, a
~ or ~
4. The Probability Distribution (Classical Approach)
a. Critical value: t(8, 0.05) = 1.86
b. t* is in the critical region
5. The Results
a. Decision: Reject Ho
b. Conclusion: At the 0.05 level of significance, there is evidence to
suggest the slope of the line of best fit is greater than
zero. The evidence indicates there is a linear
relationship and that advertising cost (x) is useful in
predicting the first year sales (y).
45
Confidence
Interval Estimates for Regression
• Use the line of best fit to make predictions
• Predict the population mean y-value at a given x
• Predict the individual y-value selected at random
that will occur at a given value of x
• The best point estimate, or prediction, for both
is yˆ
46
Notation & Background
Notation:
1. Mean of the population y-values at a given value of x: m y|x0
2. The individual y-value selected at random for a given
value of yx:x0
Background:
1. Recall: the development of confidence intervals for the
population mean m when the variance was known and
when the variance was estimated
2. The confidence interval for m y|x0 and the prediction interval
for
y x0 are constructed in a similar fashion
3. yˆ replaces x as the point estimate
4. The sampling distribution of yˆ is normal
47
Background Continued
5. The standard deviation in both cases is computed by multiplying the
square root of the variance of the error by an appropriate correction
factor
6. The line of best fit passes through the centroid: ( x, y )
Consider a confidence interval for the slope b1
If we draw lines with slopes equal to the extremes of that
confidence interval through the centroid, the value for y
fluctuates considerably for different values of x (See the
Figure on the next slide.)
It is reasonable to expect a wider confidence interval as we consider
values of x further from x
We need a correction factor to adjust for the distance between x0 and x
This factor must also adjust for the variation of the y-values about yˆ
48
Confidence Interval for Slope
Slope is 8.676
500
475
450
425
400
375
Sales
350
Slope is 5.707
( x, y )
325
300
275
250
35
40
45
50
55
60
65
Advertising Costs
49
Confidence Interval
Confidence interval for the mean value of y at a given value
of x, m y|x0
standard error of yˆ
( x0  x) 2
1

yˆ  t (n-2, a /2)  se 
n
( x x) 2
2

(
x
x
)
1
 yˆ  t (n-2, a /2)  se 
 0
n
SS( x)
Notes:
1. The numerator of the second term under the radical sign is
the square of the distance of x0 from
x
2. The denominator is closely related to the variance of x and
has a standardizing effect on this term
50
Example
 Example: It is believed that the amount of nitrogen fertilizer used per
acre has a direct effect on the amount of wheat produced.
The data below shows the amount of nitrogen fertilizer used
per test plot and the amount of wheat harvested per test plot.
a. Find the line of best fit
b. Construct a 95% confidence interval for the mean amount
of wheat harvested for 45 pounds of fertilizer
Pounds of
Fertilizer (x )
30
36
41
49
53
55
60
65
100 Pounds
of Wheat (y )
14
9
18
16
23
17
28
33
Pounds of
Fertilizer (x )
74
76
81
88
93
94
101
109
100 Pounds
of Wheat (y )
20
24
29
35
34
39
28
33
51
Solution
• Using Minitab, the line of best fit: yˆ  4.42  0.298 x
Confidence Interval:
1. Population Parameter of Interest
The mean amount of wheat produced for 45 pounds of fertilizer, m y| x  45
2. The Confidence Interval Criteria
a. Assumptions: The ordered pairs form a random sample and the
y-values at each x have a mounded distribution
b. Test statistic: t with df = 16  2 = 14
c. Confidence level: 1  a = 0.95
3. Sample Information:
se2  25.97
y x 45 :
se  25.97  5.096
yˆ  4.42  0.298(45)  17.83
52
Solution Continued
4. The Confidence Interval:
1 ( x0  x) 2

yˆ  t (n-2, a /2)  se 
n
SS( x)
1 (45  69.06) 2
 17.83  (2.14)(5.096)

16
8746.94
 17.83  (2.14)(5.096) 0.0625  0.0662
 17.83  (2.14)(5.096)(0.3587)
 17.83  3.91
13.92 to 21.74, 95% confidence interval for m y| x  45
53
Confidence Belts for m y|x
0
• Confidence interval: green vertical line
• Confidence interval belt: upper and lower boundaries of all 95% confidence
intervals
45
Line of best fit
40
35
30
Upper boundary
for m y|x0
Wheat
25
20
15
Lower boundary for m y|x0
10
30
40
50
60
70
80
90
100
110
120
Fertilizer
54
Prediction Interval
Prediction interval of the value of a single randomly selected y:
( x0  x) 2
1
yˆ  t (n-2, a /2)  se  1  
n
SS( x)
 Example: Find the 95% prediction interval for the amount of
wheat harvested for 45 pounds of fertilizer
Solution:
1. Population Parameter of Interest
yx=45, the amount of wheat harvested for 45 pounds of
fertilizer
55
Solution Continued
2. The Confidence Interval Criteria
a. Assumptions: The ordered pairs form a random sample
and the y-values at each x have a mounded distribution
b. Test statistic: t with df = 16  2 = 14
c. Confidence level: 1  a = 0.95
3. Sample Information
se2  25.97
y x 45 :
se  25.97  5.096
yˆ  4.42  0.298(45)  17.83
56
Solution Continued
4. The Confidence Interval
2

(
x
x
)
1
yˆ  t (n-2, a /2)  se  1   0
n
SS( x)
1 (45  69.06) 2
 17.83  (2.14)(5.096) 1  
16
8746.94
 17.83  (2.14)(5.096) 1  0.0625  0.0662
 17.83  (2.14)(5.096) 1.1287
 17.83  (2.14)(5.096)(1.0624)
 17.83  11.5859
6.24 to 29.41, 95% prediction interval for y x 45
57
Prediction belts for y x
0
45
Line of best fit
40
35
Upper boundary on
individual y-values
30
Wheat
25
20
15
Lower boundary for 95% prediction
interval on individual y-values at any x
10
30
40
50
x0 = 45
60
70
80
90
100
110
120
Fertilizer
58
Precautions
1. The regression equation is meaningful only in the domain of
the x variable studied. Estimation outside this domain is
risky; it assumes the relationship between x and y is the same
outside the domain of the sample data.
2. The results of one sample should not be used to make
inferences about a population other than the one from which
the sample was drawn
3. Correlation (or association) does not imply causation. A
significant regression does not imply x causes y to change.
Most common problem: missing, or third, variable effect.
59
13.6 ~ Understanding the Relationship
Between Correlation & Regression
• We have considered correlation and regression
analysis
• When do we use these techniques?
• Is there any duplication of work?
60
Remarks
1. The primary use of the linear correlation coefficient is in
answering the question “Are these two variables related?”
2. The linear correlation coefficient may be used to indicate the
usefulness of x as a predictor of y (if the linear model is
appropriate)
The test concerning the slope of the regression line
(Ho: b1 = 0) tests the same basic concept
3. Lack-of-fit test: Is the linear model appropriate?
Consider the scatter diagram
61
Conclusions
1. Linear correlation and regression measure different
characteristics. It is possible to have a strong linear
correlation and have the wrong model?
2. Regression analysis should be used to answer questions
about the relationship between two variables:
a. What is the relationship?
b. How are the two variables related?
62