Download Least Squares Regression Line

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Forecasting wikipedia , lookup

Regression analysis wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 3
LSRL
Bivariate data
• x – variable: is the independent or
explanatory variable
• y- variable: is the dependent or
response variable
• Use x to predict y
yˆ  a  bx
ŷ - (y-hat) means the predicted y
b – is the slope
– it is the amountBe
by sure
whichtoy increases
when
put the hat
x increases by 1 unit on the y
a – is the y-intercept
– it is the height of the line when x = 0
– in some situations, the y-intercept has no
meaning
Least Squares Regression Line
LSRL
• The line that gives the best fit to
the data set
• The line that minimizes the sum of
the squares of the deviations from
the line
(3,10)
y =.5(6) + 4 = 7
4.5
2 – 7 = -5
y =.5(0) + 4 = 4
ˆ
y  .5 x  4
0 – 4 = -4
y =.5(3) + 4 = 5.5
-4
(0,0)
10 – 5.5 = 4.5
-5
(6,2)
Sum of the squares = 61.25
What is the sum
of the deviations
from the line?
Will it always be
zero?
(3,10)
Use a calculator
to find the line of
best fit
6
1
ŷ  x  3
3
Find y - y
The line that minimizes the sum of the
squares of the deviations from the line
-3
is the LSRL.
(0,0)
-3
(6,2)
Sum of the squares = 54
Interpretations
Slope:
For each unit increase in x, there is an
approximate increase/decrease of b in y.
Correlation coefficient:
There is a direction, strength, type of
association between x and y.
The ages (in months) and heights (in
inches) of seven children are given.
x
16
24
42
60
75
102 120
y
24
30
35
40
48
56
60
Find the LSRL.
Interpret the slope and correlation
coefficient in the context of the problem.
Correlation coefficient:
There is a strong, positive, linear
association between the age and
height of children.
Slope:
For an increase in age of one month,
there is an approximate increase of .34
inches in heights of children.
The ages (in months) and heights (in
inches) of seven children are given.
x
16
24
42
60
75
102 120
y
24
30
35
40
48
56
60
Predict the height of a child who is 4.5
years old.
Predict the height of someone who is 20
years old.
Extrapolation
• The LSRL should not be used to
predict y for values of x outside the
data set.
• It is unknown whether the pattern
observed in the scatterplot
continues outside this range.
The ages (in months) and heights (in
inches) of seven children are given.
x
16
24
42
60
75
102 120
y
24
30
35
40
48
56
Calculate x & y.
Plot the point (x, y) on the LSRL.
Will this point always
be on the LSRL?
60
The correlation coefficient
and the LSRL are both
non-resistant measures.
Formulas – on chart
yˆ  b0  b1 x
b1
x  x  y  y 


 x  x 
i
i
2
i
b0  y  b1 x
b1  r
sy
sx
The following statistics are found for the
variables posted speed limit and the
average number of accidents.
x  40, s x  11 .6,
y  18, s y  8.4, r  .9981
Find the LSRL & predict the number of
accidents for a posted speed limit of 50 mph.
ˆ
y  .723 x  10 .92
ˆ
y  25.23 accidents
Correlation
Suppose we found the age and weight
of a sample of 10 adults.
Create a scatterplot of the data below.
Is there any relationship between the
age and weight of these adults?
Age 24
30
41
28
50
46
49
35
20
39
Wt 256 124 320 185 158 129 103 196 110 130
Suppose we found the height and weight of a
sample of 10 adults.
Create a scatterplot of the data below.
Is there any relationship between the height
and weight of these adults?
Is it positive or negative? Weak or strong?
Ht
74
65
77
72
68
60
62
73
61
64
Wt 256 124 320 185 158 129 103 196 110 130
The closer the points in a
The farther away from a
scatterplot are to a straight
straight line – the weaker the
line - the stronger the
relationship
relationship.
Identify as having a positive association,
a negative association, or no association.
1. Heights of mothers & heights of their adult +
daughters
2. Age of a car in years and its current value 3. Weight of a person and calories consumed +
4. Height of a person and the person’s birth
NO
month
5. Number of hours spent in safety training and
the number of accidents that occur
Correlation Coefficient (r)• A quantitative assessment of the strength
& direction of the linear relationship
between bivariate, quantitative data
• Pearson’s sample correlation is used most
• parameter - r rho)
• statistic - r
 xi  x  yi  y 
1




r



n  1  s x  s y 
Speed Limit
(mph)
55
50
45
40
30
20
Avg. # of
accidents
(weekly)
28
25
21
17
11
6
Calculate r. Interpret r in context.
There is a strong, positive, linear relationship
between speed limit and average number of
accidents per week.
Properties of r
(correlation coefficient)
• legitimate values of r are [-1,1]
No
Correlation
Strong
correlation
Moderate
Correlation
Weak correlation
-1 -.8
-.5
0
.5
.8
1
•value of r does not depend on the unit
of measurement for either variable
x (in mm) 12 15
y
4 7
21
10
32
14
26
9
19
8
24
12
Find r.
Change to cm & find r.
The correlations are the same.
•value of r does not depend on which
of the two variables is labeled x
x
y
12
4
15
7
21
10
32
14
26
9
19
8
24
12
Switch x & y & find r.
The correlations are the same.
•value of r is non-resistant
x
y
12
4
15
7
21
10
32
14
26
9
19
8
24
22
Find r.
Outliers affect the correlation
coefficient
•value of r is a measure of the extent
to which x & y are linearly related
A value of r close to zero does not rule out
any strong relationship between x and y.
r = 0, but has a definite
relationship!
Minister data:
r = .9999
(Data on Elmo)
So does an increase in ministers
cause an increase in consumption of
rum?
Correlation does not imply
causation
Correlation does not imply
causation
Correlation does not
imply causation
Residuals, Residual Plots,
& Influential points
Residuals (error) • The vertical deviation between the
observations & the LSRL
• the sum of the residuals is always zero
• error = observed - expected
residual  y  yˆ
Residual plot
• A scatterplot of the (x, residual) pairs.
• Residuals can be graphed against other
statistics besides x
• Purpose is to tell if a linear association
exist between the x & y variables
• If no pattern exists between the points in
the residual plot, then the association is
linear.
Residuals
Residuals
x
Linear
x
Not linear
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
One measure of the success of knee
surgery is post-surgical range of motion
for the knee joint following a knee
dislocation. Is there a linear
relationship between age & range of
motion?
Sketch a residual plot.
Residuals
Age
x
Since there is no pattern in the
residual plot, there is a linear
relationship between age and
range of motion
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
Plot the residuals against the yhats. How does this residual plot
compare to the previous one?
Residuals
Age
ŷ
Residuals
Residuals
x
Residual plots are the same no matter if
plotted against x or y-hat.
ŷ
Coefficient of determination• r2
• gives the proportion of variation in y
that can be attributed to an approximate
linear relationship between x & y
• remains the same no matter which
variable is labeled x
Age
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
135
Sum of the
squared
16
135
residuals (errors)
using
14
108 of y.
the mean
20
120
21
127
30
122
Let’s examine r2.
Suppose you were going to
predict a future y but you didn’t
know the x-value. Your best guess
would be the overall mean of the
existing y’s.
Now, find the sum of the squared
residuals (errors). L3 = (L2130.0833)^2. Do 1VARSTAT on
L3 to find the sum.
SSEy = 1564.917
Age
Range of Motion
35
154
24
142
40
137
31
133
28
122
25
126
26
Sum of the 135
squared
16residuals (errors)
135
14using the LSRL.
108
20
120
21
127
30
122
Now suppose you were going
to predict a future y but you DO
know the x-value. Your best
guess would be the point on the
LSRL for that x-value (y-hat).
Find the LSRL & store in Y1.
In L3 = Y1(L1) to calculate the
predicted y for each x-value.
Now, find the sum of the
squared residuals (errors). In
L4 = (L2-L3)^2. Do
1VARSTAT on L4 to find the
sum.
SSEy = 1085.735
Age
Range of Motion
35
154
SSEy = 1564.917
24
142
SSEy = 1085.735
40
137
31
133
28
122
25
126
26
135
16
14
20
21
30

By what percent did the sum of
the squared error go down
when you went from just an
“overall mean” model to the
“regression on x” model?

SSE y of
 SSE
This is 135
r2 – the amount
the ˆy

108
variation in the y-values
that is
SSE
y
explained
120 by the x-values.
1564 .91667  1085 .735
 .3062
127
1564 .91667
122
Age
35
Range of Motion
154
24
142
40
137
31
133
28
122
25
126
26
135
16
135
14
108
20
120
21
127
30
122
How well does age predict the
range of motion after knee
surgery?
Approximately 30.6% of the
variation in range of motion
after knee surgery can be
explained by the linear
regression of age and range
of motion.
Interpretation of
2
r
Approximately r2% of the
variation in y can be explained
by the LSRL of x & y.
Computer-generated regression analysis of knee surgery
Be sure to convert r2
data:
NEVER use
to decimal before 2
adjusted r !
taking the square
Predictor
Coef
Stdev
T
P
root!
Constant
107.58What is
11.12
9.67 of0.000
the equation
the
What
Age
0.8710are the0.4146
LSRL? 2.10 0.062
correlation
coefficient
Find
the slope & y-intercept.
and the coefficient of
s = 10.42
R-sq = 30.6%
R-sq(adj) = 23.7%
determination?
yˆ  107.58  .8710 x
r  .5532
Outlier –
• In a regression setting, an
outlier is a data point with a
large residual
Influential point• A point that influences where the LSRL
is located
• If removed, it will significantly change
the slope of the LSRL
Racket
Resonance
Acceleration
(Hz)
(m/sec/sec)
1
105
36.0
2
106
35.0
3
110
34.5
4
111
36.8
5
112
37.0
6
113
34.0
7
113
34.2
8
114
33.8
9
114
35.0
10
119
35.0
11
120
33.6
12
121
34.2
13
126
36.2
14
189
30.0
One factor in the
development of tennis elbow
is the impact-induced
vibration of the racket and
arm at ball contact.
Sketch a scatterplot of these
data.
Calculate the LSRL &
correlation coefficient.
Does there appear to be an
influential point? If so,
remove it and then calculate
the new LSRL &
correlation coefficient.
Which of these measures are
resistant?
• LSRL
• Correlation coefficient
• Coefficient of determination
NONE – all are affected by outliers