Download Notes 11

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 112 Notes 11
• Today:
– Fitting Curvilinear Relationships (Chapter 5)
• Homework 3 due Friday.
• I will e-mail Homework 4 tonight, but it will
not be due for two weeks (October 26th).
Curvilinear Relationships
• Relationship between Y and X is
curvilinear if E(Y|X) is not a straight line.
• Linearity for simple linear regression
model is violated for a curvilinear
relationship.
• Approaches to estimating E(Y|X) for a
curvilinear relationship
– Polynomial Regression
– Transformations
Transformations
• Curvilinear relationship: E(Y|X) is not a straight
line.
• Another approach to fitting curvilinear
relationships is to transform Y or x.
• Transformations: Perhaps E(f(Y)|g(X)) is a
straight line, where f(Y) and g(X) are
transformations of Y and X, and a simple linear
regression model holds for the response
variable f(Y) and explanatory variable g(X).
Curvilinear Relationship
Bivariate Fit of Life Expectancy By Per Capita GDP
Life Expectancy
80
70
60
Y=Life Expectancy in 1999
X=Per Capita GDP (in US
Dollars) in 1999
Data in gdplife.JMP
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Residual
15
5
-5
-15
-25
0
5000
10000
15000
20000
Per Capita GDP
25000
30000
Linearity assumption of simple
linear regression is clearly violated.
The increase in mean life
expectancy for each additional dollar
of GDP is less for large GDPs than
Small GDPs. Decreasing returns to
increases in GDP.
Bivariate Fit of Life Expectancy By log Per Capita GDP
70
15
60
Residual
Life Expectancy
80
50
5
-5
-15
-25
40
6
6
7
8
9
10
7
8
9
10
log Per Capita GDP
log Per Capita GDP
Linear Fit
Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
The mean of Life Expectancy | Log Per Capita appears to be approximately
a straight line.
HowLinear
doFit we use the transformation?
•
Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
-7.97718 3.943378
-2.02 0.0454
log Per Capita
8.729051 0.474257 18.41 <.0001
GDP
• Testing for association between Y and X: If the simple linear
regression model holds for f(Y) and g(X), then Y and X are
associated if and only if the slope in the regression of f(Y) and g(X)
does not equal zero. P-value for test that slope is zero is <.0001:
Strong evidence that per capita GDP and life expectancy are
associated.
• Prediction and mean response: What would you predict the life
expectancy to be for a country with a per capita GDP of $20,000?
Eˆ (Y | X  20,000)  Eˆ (Y | log X  log 20,000) 
Eˆ (Y | log X  9.9035)  7.9772  8.7291* 9.9035  78.47
How do we choose a
transformation?
• Tukey’s Bulging Rule.
• See Handout.
• Match curvature in data to the shape of
one of the curves drawn in the four
quadrants of the figure in the handout.
Then use the associated transformations,
selecting one for either X, Y or both.
Transformations in JMP
1. Use Tukey’s Bulging rule (see handout) to determine
transformations which might help.
2. After Fit Y by X, click red triangle next to Bivariate Fit and
click Fit Special. Experiment with transformations
suggested by Tukey’s Bulging rule.
3. Make residual plots of the residuals for transformed
model vs. the original X by clicking red triangle next to
Transformed Fit to … and clicking plot residuals.
Choose transformations which make the residual plot
have no pattern in the mean of the residuals vs. X.
4. Compare different transformations by looking for
transformation with smallest root mean square error on
original y-scale. If using a transformation that involves
transforming y, look at root mean square error for fit
measured on original scale.
Bivariate Fit of Life Expectancy By Per Capita GDP
Life Expectancy
80
70
60
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Linear Fit
Transformed Fit to Log
Transformed Fit to Sqrt
Transformed Fit Square
Transformed Fit to Sqrt
Linear Fit
Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP
•
0.515026
0.510734
8.353485
63.86957
115
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.636551
0.633335
7.231524
63.86957
115
Transformed Fit Square
Transformed Fit to Log
Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
`
Summary of Fit
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP)
Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP
Fit Measured on Original Scale
0.749874
0.74766
5.999128
63.86957
115
Sum of Squared Error
Root Mean Square Error
RSquare
Sum of Residuals
7597.7156
8.1997818
0.5327083
-70.29942
By looking at the root mean square error on the original y-scale, we see that
all of the transformations improve upon the untransformed model and that the
transformation to log x is by far the best.
Linear Fit
Transformation to
-5
-15
5
-5
-15
-25
0
5000
10000
15000
20000
25000
-25
30000
0
Per Capita GDP
5000
10000
15000
20000
25000
30000
25000
30000
Per Capita GDP
Transformation to Log X
Transformation to
15
Y2
15
5
Residual
Residual
X
15
5
Residual
Residual
15
-5
5
-5
-15
-15
-25
-25
0
5000
10000
15000
20000
Per Capita GDP
25000
30000
0
5000
10000
15000
20000
Per Capita GDP
The transformation to Log X appears to have mostly removed a trend in the mean
of the residuals. This means that E (Y | X )  0  1 log X. There is still a
problem of nonconstant variance.
Comparing models for
curvilinear relationships
• In comparing two transformations, use transformation
with lower RMSE, using the fit measured on the original
scale if y was transformed on the original y-scale
• In comparing transformations to polynomial regression
models, compare RMSE of best transformation to best
polynomial regression model (selected using the criterion
from Note 10).
• If the transfomation’s RMSE is close to (e.g., within 1%)
but not as small as the polynomial regression’s, it is still
reasonable to use the transformation on the grounds of
parsimony.
Transformations and Polynomial Regression for
Display.JMP
Fourth order polynomial is the best polynomial regression model
using the criterion on slide 10
RMSE
Linear
51.59
log x
41.31
1/x
40.04
x
Fourth order poly.
46.02
37.79
Fourth order polynomial is the best model – it has the smallest RMSE by
a considerable amount (more than 1% advantage over best transformation of 1/x.
Interpreting the Coefficient on Log X
Suppose E (Y | X )  0  1 log X
Then using the properties of logarithms,
E (Y | 2 X )  E (Y | X )  ( 0  1 log 2 X )  ( 0  1 log X )
2X
X
 1 log 2  0.691
 1 log
Thus, the interpretation of 1 is that a doubling of X is associated
with a 1 log 2  0.691 increase in the mean of Y.
Similarly, a tripling of X is associated with a 1 log 3 increase in
the mean of Y
For life expectancy data,
Transformed Fit to Log
Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP)
A doubling of GDP is associated with a 8.73*log2=8.73*.69=6.02
year increase in mean life expectancy.
Log Transformation of Both X and
Y variables
• It is sometimes useful to transform both
the X and Y variables.
• A particularly common transformation is to
transform X to log(X) and Y to log(Y)
E (log Y | X )  0  1 log X
E (Y | X )  exp( 0  1 log X )
Heart Disease-Wine Consumption
Data (heartwine.JMP)
Bivariate Fit of Heart Disease Mortality By Wine Consumption
Residual Plot for Simple Linear Regression Model
Residual
10
8
6
3
2
1
0
-1
-2
-3
0
10
20
4
30
40
50
60
70
80
60
70
80
Wine Consumption
Residual Plot for Log-Log Transformed Model
2
0
10
20
30
40
50
60
70
80
Wine Consumption
Linear Fit
Transformed Fit Log to Log
3
Residual
Heart Disease Mortality
12
1
-1
-3
0
10
20
30
40
50
Wine Consumption
Evaluating Transformed Y Variable Models
The residuals for a log-log transformation model on the original Y-scale are
eˆi  Yi  Eˆ (Y | X i )
 Yi  exp(b0  b1 log X i )
The root mean square error and R2 on the original Y-scale are shown in JMP under Fit
Measured on Original Scale.
To evaluate models with transformed Y variables and compare their R2’s and root mean
square error to models with untransformed Y variables, use the root mean square error
and R2 on the original Y-scale for the transformed Y variables.
Linear Fit
Heart Disease Mortality = 7.6865549 - 0.0760809 Wine Consumption
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
0.555872
0.528114
1.618923
Transformed Fit Log to Log
Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption)
Fit Measured on Original Scale
Sum of Squared Error
Root Mean Square Error
RSquare
41.557487
1.6116274
0.5598656
The log-log transformation
provides slightly better predictions
than the simple linear regression
Model.
Interpreting Coefficients in Log-Log
Models
E (log Y | X )  0  1 log X
E (Y | X )  exp( 0  1 log X )
Assuming that
E (log Y | log X )  0  1 log X
satisfies the simple linear regression model assumptions, then
Median(Y | X )  exp(0 ) exp(1 X )
Thus,
Median(Y | log 2 X ) exp(  0 ) exp( 1 log 2 X )

 21
Median(Y | log X )
exp(  0 ) exp( 1 log X )
Thus, a doubling of X is associated with a multiplicative change of 2 1 in the
median of Y.
Transformed Fit Log to Log
Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption)
Doubling wine consumption is associated with multiplying median heart disease
mortality by 20.356  0.781 .
Another interpretation of
coefficients in log-log models
For a 1% increase in X,
Median(Y | log1.01X ) exp(  0 ) exp( 1 log1.01X )

 exp(  0 )1.011
Median(Y | log X )
exp(  0 ) exp( 1 log X )
Because 1.011  1  .011 ,
a 1% increase in X in associated with a 1 percent increase in the median (or mean) of Y.
Transformed Fit Log to Log
Log(Heart Disease Mortality) = 2.5555519 - 0.3555959 Log(Wine Consumption)
Increasing wine consumption by 1% is associated with a -0.36% decrease in mean heart
disease mortality.
Similarly a 10% increase in X is associated with a 10 1 percent increase in mean heart
disease mortality.
Increasing wine consumption by 10% is associated with a -3.6% decrease in mean heart
disease mortality.
For large percentage changes (e.g., 50%, 100%) , this interpretation is not accurate.
Another Example of
Transformations: Y=Count of tree
seeds, X= weight of tree
Bivariate Fit of Seed Count By Seed weight (mg)
30000
25000
Seed Count
20000
15000
10000
5000
0
-5000
-1000
0
1000
2000
3000
Seed w eight (mg)
4000
5000
Bivariate Fit of Seed Count By Seed weight (mg)
30000
25000
Seed Count
20000
15000
10000
5000
0
-5000
-1000
0
1000
2000
3000
Seed w eight (mg)
Linear Fit
Transformed Fit Log to Log
Transformed Fit to Log
4000
5000
Linear Fit
Seed Count = 6751.7179 - 2.1076776 Seed weight (mg)
Transformed Fit to Log
Seed Count = 12174.621 - 1672.3962 Log(Seed weight (mg))
Summary of Fit
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
0.220603
0.174756
6199.931
4398.474
19
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
0.566422
0.540918
4624.247
4398.474
19
Transformed Fit Log to Log
Log(Seed Count) = 9.758665 - 0.5670124 Log(Seed weight (mg))
Fit Measured on Original Scale
Sum of Squared Error 161960739
Root Mean Square
3086.6004
Error
RSquare
0.8068273
Sum of Residuals
3142.2066
By looking at the root mean square error on the original y-scale, we see that
Both of the transformations improve upon the untransformed model and that the
transformation to log y and log x is by far the best.
Comparison of Transformations to
Polynomials for Tree Data
Bivariate Fit of Seed Count By Seed weight (mg)
30000
Transformed Fit Log to Log
25000
Log(Seed Count) = 9.758665 - 0.5670124*Log(Seed weight (mg))
Seed Count
20000
Fit Measured on Original Scale
15000
Root Mean Square Error
3086.6004
10000
5000
Polynomial Fit Degree=6
0
-5000
0
1000
2000
3000
Seed w eight (mg)
4000
5000
Seed Count = 1539.0377 + 2.453857*Seed weight (mg)
-0.0139213*(Seed weight (mg)-1116.51)^2
+1.2747e-6*(Seed weight (mg)-1116.51)^3
+1.0463e-8*(Seed weight (mg)-1116.51)^4
- 5.675e-12*(Seed weight (mg)-1116.51)^5
+ 8.269e-16*(Seed weight (mg)-1116.51)^6
Summary of Fit
Transformed Fit Log to Log
Polynomial Fit Degree=6
Root Mean Square Error
6138.581
For the tree data, the log-log transformation is
much better than polynomial regression.
Prediction using the log y/log x
transformation
• What is the predicted seed count of a tree
that weights 50 mg?
• Math trick: exp{log(y)}=y (Remember by
log, we always mean the natural log, ln),
i.e., elog10  10
Eˆ (Y | X  50)  exp{ Eˆ (log Y | X  50)} 
exp{ Eˆ (log Y | log X  log 50)}  exp{ Eˆ (log Y | log X  3.912)} 
exp{9.7587  0.5670 * 3.912}  exp{7.5406}  1882.96