Download Ch4 Describing Relationships Between Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Ch4 Describing Relationships
Between Variables
Ceramic Item Page 125
2900
2850
2800
2750
2700
Density
2650
2600
2550
2500
2450
0
2000
4000
6000
8000
Pressure
10000
12000
Section 4.1: Fitting a Line by Least
Squares
• Often we want to fit a straight line to data.
• For example from an experiment we might have
the following data showing the relationship of
density of specimens made from a ceramic
compound at different pressures.
• By fitting a line to the data we can predict what
the average density would be for specimens
made at any given pressure, even pressures we
did not investigate experimentally.
• For a straight line we assume a model which
says that on average in the whole population
of possible specimens the average density, y,
value is related to pressure, x, by the equation
y  0  1 x
• The population (true) intercept and slope are
represented by Greek symbols just like m and
s.
How to choose the best line?
----Principle of least squares
• To apply the principle of least squares in the
fitting of an equation for y to an n-point data
set, values of the equation parameters are
chosen to minimize
n
2
ˆ
(
y

y
)
 i i
i 1
where y1, y2, …, yn are the observed responses
and yi-hat are corresponding responses
predicted or fitted by the equation.
In another word
• We want to choose a slope and intercept so as
to minimize the sum of squared vertical
distances from the data points to the line in
question.
• A least squares fit minimizes the sum of squared
deviations from the fitted line ŷ
2
2
minimize
  yi  yˆi     yi   0  1xi 
• Deviations from the fitted line are called
“residuals”
• We are minimizing the sum of squared residuals,
called the “residual sum of squares.”
Come again
We need to minimize
S (0 , 1 )    yi   0  1 xi  
2
over all possible values of 0 and 1.
This is a calculus problem (take partial
derivatives).
• The resulting formulas for the least squares
estimates of the intercept and slope are
b1
 x  x  y  y 


 x  x 
i
i
2
i
b0  y  b1 x
• Notice the notation. We use b1 and b0 to
denote some particular values for the
parameters 1 and 0.
The fitted line
• For the measured data we fit a straight line
ŷ  b0  b1 x
• For the ith point, the fitted value or predicted
value is
yˆi  b0  b1 xi
which represent likely y behavior at that x
value.
Ceramic Compound data
x
y
(x-x_bar)
(y-y_bar)
(x-x_bar)*(y-y_bar)
(x-x_bar)^2
2000
2.486
-4000
-0.181
724
16000000
2000
2.479
-4000
-0.188
752
16000000
2000
2.472
-4000
-0.195
780
16000000
4000
2.558
-2000
-0.109
218
4000000
4000
2.57
-2000
-0.097
194
4000000
4000
2.58
-2000
-0.087
174
4000000
6000
2.646
0
-0.021
0
0
6000
2.657
0
-0.01
0
0
6000
2.653
0
-0.014
0
0
8000
2.724
2000
0.057
114
4000000
8000
2.774
2000
0.107
214
4000000
8000
2.808
2000
0.141
282
4000000
10000
2.861
4000
0.194
776
16000000
10000
2.879
4000
0.212
848
16000000
10000
2.858
4000
0.191
764
16000000
5840
120000000
x_bar=6000
b1=
4.87E-05
y_bar=2.667
b0=
2.375
yˆ  2.375  0.0000487 x
Ceramic Item Page 125
2.9
2.85
2.8
2.75
2.7
Density
Linear (Density)
2.65
2.6
2.55
2.5
2.45
0
2000
4000
6000
8000
10000
12000
Interpolation
• At the situation for x=5000psi, there are no
data with this x value.
• If interpolation is sensible from a physical view
point, the fitted value
yˆ  2.375  0.0000487(5000)  2.6183g / cc
can be used to represent density for 5,000 psi
pressure.
• “Least squares” is the optimal method to use
for fitting the line if
– The relationship is in fact linear.
– For a fixed value of x the resulting values of y are
• normally distributed with
• the same constant variance at all x values.
• If these assumptions are not met, then we are
not using the best tool for the job.
• For any statistical tool, know when that tool is
the right one to use.
4.1.2 The Sample Correlation and
Coefficient of Determination
The sample (linear) correlation coefficient, r, is
a measure of how “correlated” the x and y
variable are.
The correlation is between -1 and 1
+1 means perfect positive linear correlation
0 means no linear correlation
-1 means perfect negative linear correlation
• The sample correlation is computed by
r
 x  x  y  y 
 x  x    y  y 
i
i
2
i
2
i
• It is a measure of the strength of an apparent
linear relationship.
Coefficient of Determination
It is another measure of the quality of a fitted
equation.
 y  y     y  yˆ 


 y  y 
2
R
2
i
i
2
i
i
2
lnterpretation of R2
R2 = fraction of variation accounted for
(explained) by the fitted line.
 y  y     y  yˆ 


 y  y 
2
R
i
2
i
2
i
Ceramic Items Page 124
2900
2850
2800
Density
2750
2700
2650
2600
2550
2500
2450
0
2000
4000
6000
Pressure
8000
10000
12000
i
2
Pressure
y = Density
y - mean
(y-mean)^2
2000
2486
-181
32761
2000
2479
-188
35344
2000
2472
-195
38025
4000
2558
-109
11881
4000
2570
-97
9409
4000
2580
-87
7569
6000
2646
-21
441
6000
2657
-10
100
6000
2653
-14
196
8000
2724
57
3249
8000
2774
107
11449
8000
2808
141
19881
10000
2861
194
37636
10000
2879
212
44944
10000
2858
191
36481
mean
6000
2667
0
289366
st dev
2927.7
143.767
correlation
0.991
correl^2
0.982
sum
Observation
Predicted Density
Residuals
Residual^2
1
2472.333
13.667
186.778
2
2472.333
6.667
44.444
3
2472.333
-0.333
0.111
4
2569.667
-11.667
136.111
5
2569.667
0.333
0.111
6
2569.667
10.333
106.778
7
2667.000
-21.000
441.000
8
2667.000
-10.000
100.000
9
2667.000
-14.000
196.000
10
2764.333
-40.333
1626.778
11
2764.333
9.667
93.444
12
2764.333
43.667
1906.778
13
2861.667
-0.667
0.444
14
2861.667
17.333
300.444
15
2861.667
-3.667
13.444
5152.666667
sum
• If we don't use the pressures to predict density
– We use y to predict every yi
– Our sum of squared errors is
= SS Total
y
 y   289,366
2
i
• If we do use the pressures to predict density
– We use yˆ  b  b x
i
0
0 i
  y  yˆ 
i
i
2
to predict yi
 5152.67= SS Residual
The percent reduction in our error sum of
squares is
  yi  y     yi  yˆi 
2
R2 
 y  y 
2
2
100
i
SS _ Total  SS _ Re sidual SS _ Re gression

SS _ Total
SS _ Total
289,366  5152.67
284,213.33

100 
100
289,366
289,366

R 2  98.2%
Using x to predict y decreases the error sum of
squares by 98.2%.
The reduction in error sum of squares from
using x to predict y is
– Sum of squares explained by the regression
equation
– 284,213.33 = SS Regression
This is also the correlation squared.
r2 = 0.9912 = 0.982=R2
For a perfectly straight line
• All residuals are zero.
– The line fits the points exactly.
• SS Residual = 0
• SS Regression = SS Total
– The regression equation explains all variation
• R2 = 100%
• r = ±1
r2 = 1
If r=0, then there is no linear relationship between x and y
• R2 = 0%
• Using x to predict y with a linear model does not help at
all.
4.1.3 Computing and Using Residuals
• Does our linear model extract the main
message of the data?
• What is left behind? (hopefully only small
fluctuations explainable only as random
variation)
• Residuals!
ei  yi  yˆi
Good Residuals: Pattenless
• They should look randomly scattered if a fitted
equation is telling the whole story.
• When there is a pattern, there is something
gone unaccounted for in the fitting.
Plotting residuals will be most crucial in section 4.2
with multiple x variables
– But residual plots are still of use here.
Plot residuals versus
–
–
–
–
Predicted values
Versus x
In run order
Versus other potentially influential variables, e.g.
technician
– Normal Plot of residuals
Read the book Page 135 for more Residual plots.
Checking Model Adequacy
With only single x variable, we can tell most of
what we need from a plot with the fitted line.
Original Scale
30
25
20
Y 15
10
5
0
0
2
4
6
8
10
X
12
14
16
18
20
A residual plot gives us a magnified view of
the increasing variance and curvature.
Original Scale
10
8
6
Residual
4
2
0
0
2
4
6
8
10
12
14
16
18
-2
-4
-6
Predicted
This residual plot indicates 2 problems with this
linear least squares fit
• The relationship is not linear
– Indicated by the curvature in the residual plot
• The variance is not constant
– So the least squares method isn't the best approach
even if we handle the nonlinearity.
Some Cautions
• Don't fit a linear function to these data
directly with least squares.
• With increasing variability, not all squared
errors should count equally.
Some Study Questions
• What does it mean to say that a line fit to data is the "least
squares" line? Where do the terms least and squares come
from?
• We are fitting data with a straight line. What 3 assumptions
(conditions) need to be true for a linear least squares fit to
be the optimal way of fitting a line to the data?
• What does it mean if the correlation between x and y is -1?
What is the residual sum of squares in this situation?
• If the correlation between x and y is 0, what is the
regression sum of squares, SS Regression, in this situation?
• Consider the following data.
16
14
12
Y
10
8
Series1
6
4
2
0
0
2
4
6
8
10
12
X
ANOVA
df
Regression
Residual
Total
Intercept
X
1
4
5
SS
124.0333
31.3
155.3333
Coefficients
-0.5
1.525
Standard
Error
2.79732
0.383039
MS
124.03
7.825
F
15.85
Significance F
0.016383072
t Stat
-0.1787
3.9813
P-value
0.867
0.016
Lower 95%
-8.26660583
0.461513698
Upper 95%
7.2666058
2.5884863
• What is the value of R2?
• What is the least squares regression equation?
• How much does y increase on average if x is increased by 1.0?
• What is the sum of squared residuals? Do not compute the
residuals; find the answer in the Excel output.
• What is the sum of squared deviations of y from y bar?
• By how much is the sum of squared errors reduced by using x to
predict y compared to using only y bar to predict y?
• What is the residual for the point with x = 2?