Download Thu Oct 14

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transformation in economics wikipedia , lookup

Gross domestic product wikipedia , lookup

Transcript
Class 11: Thurs., Oct. 14
• Finish transformations
• Example Regression Analysis
• Next Tuesday: Review for Midterm (I will
take questions and go over practice
midterm if there are no questions)
• Next Thursday: Midterm
• HW5 due Tuesday.
• I will e-mail review notes and a practice
midterm to you tomorrow.
Transformations in JMP
1. Use Tukey’s Bulging rule (see handout) to determine
transformations which might help.
2. After Fit Y by X, click red triangle next to Bivariate Fit and
click Fit Special. Experiment with transformations
suggested by Tukey’s Bulging rule.
3. Make residual plots of the residuals for transformed
model vs. the original X by clicking red triangle next to
Transformed Fit to … and clicking plot residuals.
Choose transformations which make the residual plot
have no pattern in the mean of the residuals vs. X.
4. Compare different transformations by looking for
transformation with smallest root mean square error on
original y-scale. If using a transformation that involves
transforming y, look at root mean square error for fit
measured on original scale.
Bivariate Fit of Life Expectancy By Per Capita GDP
Life Expectancy
80
70
60
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Linear Fit
Transformed Fit to Log
Transformed Fit to Sqrt
Transformed Fit Square
Transformed Fit to Sqrt
Linear Fit
Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP
•
0.515026
0.510734
8.353485
63.86957
115
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.636551
0.633335
7.231524
63.86957
115
Transformed Fit Square
Transformed Fit to Log
Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP)
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
`
Summary of Fit
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP)
Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP
Fit Measured on Original Scale
0.749874
0.74766
5.999128
63.86957
115
Sum of Squared Error
Root Mean Square Error
RSquare
Sum of Residuals
7597.7156
8.1997818
0.5327083
-70.29942
By looking at the root mean square error on the original y-scale, we see that
all of the transformations improve upon the untransformed model and that the
transformation to log x is by far the best.
Linear Fit
Transformation to
-5
-15
5
-5
-15
-25
0
5000
10000
15000
20000
25000
-25
30000
0
Per Capita GDP
5000
10000
15000
20000
25000
30000
25000
30000
Per Capita GDP
Transformation to Log X
Transformation to
15
Y2
15
5
Residual
Residual
X
15
5
Residual
Residual
15
-5
5
-5
-15
-15
-25
-25
0
5000
10000
15000
20000
Per Capita GDP
25000
30000
0
5000
10000
15000
20000
Per Capita GDP
The transformation to Log X appears to have mostly removed a trend in the mean
of the residuals. This means that E (Y | X )  0  1 log X. There is still a
problem of nonconstant variance.
HowLinear
doFit we use the transformation?
•
Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP
Parameter Estimates
Term
Estimate Std Error t Ratio Prob>|t|
Intercept
-7.97718 3.943378
-2.02 0.0454
log Per Capita
8.729051 0.474257 18.41 <.0001
GDP
• Testing for association between Y and X: If the simple linear
regression model holds for f(Y) and g(X), then Y and X are
associated if and only if the slope in the regression of f(Y) and g(X)
does not equal zero. P-value for test that slope is zero is <.0001:
Strong evidence that per capita GDP and life expectancy are
associated.
• Prediction and mean response: What would you predict the life
expectancy to be for a country with a per capita GDP of $20,000?
Eˆ (Y | X  20,000)  Eˆ (Y | log X  log 20,000) 
Eˆ (Y | log X  9.9035)  7.9772  8.7291* 9.9035  78.47
More Examples of finding E(Y|X)
using a transformation
• Suppose simple linear regression model
holds for E (Y | X ) :
Then Eˆ (Y | X  20,000)  Eˆ (Y | X  20,000 ) 
Eˆ (Y | X  141.42)  47.93  0.219 *141.42  78.91
• Suppose simple linear regression model
2
E
(
Y
| X) :
holds for
Then ˆ
ˆ 2
E (Y | X  20,000)  E (Y | X  20,000) 
3232.192  0.13748 * 20,0000  5891.79  77.34
CIs for Mean Response and
Prediction Intervals
Bivariate Fit of Life Expectancy By Per Capita GDP
100
95% Confidence Interval for Mean
Response for E(Y|X=20,000) = (76.89,80.50)
Life Expectancy
90
80
95% Prediction Interval for Y|X=20,000 =
(66.42,91.69)
70
60
50
40
0
5000
10000 15000 20000 25000 30000
Per Capita GDP
Note: To expand Y-axis or X-axis, right click
on the X-axis, click Axis Settings and change
the minimum/maximum. In order to fully see
the prediction intervals, I needed to expand
the Y-axis.
Transformed Fit to Log
Transformed Fit to Log
Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP)
Another Example of
Transformations: Y=Count of tree
seeds, X= weight of tree
Bivariate Fit of Seed Count By Seed weight (mg)
30000
25000
Seed Count
20000
15000
10000
5000
0
-5000
-1000
0
1000
2000
3000
Seed w eight (mg)
4000
5000
Bivariate Fit of Seed Count By Seed weight (mg)
30000
25000
Seed Count
20000
15000
10000
5000
0
-5000
-1000
0
1000
2000
3000
Seed w eight (mg)
Linear Fit
Transformed Fit Log to Log
Transformed Fit to Log
4000
5000
Linear Fit
Seed Count = 6751.7179 - 2.1076776 Seed weight (mg)
Transformed Fit to Log
Seed Count = 12174.621 - 1672.3962 Log(Seed weight (mg))
Summary of Fit
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
0.220603
0.174756
6199.931
4398.474
19
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum
Wgts)
0.566422
0.540918
4624.247
4398.474
19
Transformed Fit Log to Log
Log(Seed Count) = 9.758665 - 0.5670124 Log(Seed weight (mg))
Fit Measured on Original Scale
Sum of Squared Error 161960739
Root Mean Square
3086.6004
Error
RSquare
0.8068273
Sum of Residuals
3142.2066
By looking at the root mean square error on the original y-scale, we see that
Both of the transformations improve upon the untransformed model and that the
transformation to log y and log x is by far the best.
Prediction using the log y/log x
transformation
• What is the predicted seed count of a tree
that weights 50 mg?
• Math trick: exp{log(y)}=y (Remember by
log, we always mean the natural log, ln),
i.e., elog10  10
Eˆ (Y | X  50)  exp{ Eˆ (log Y | X  50)} 
exp{ Eˆ (log Y | log X  log 50)}  exp{ Eˆ (log Y | log X  3.912)} 
exp{9.7587  0.5670 * 3.912}  exp{7.5406}  1882.96
Assumptions for linear regression
and
their
importance
to
inferences
Inference
Assumptions that are
important
Point prediction, point
estimation
Confidence interval for
slope, hypothesis test
for slope, confidence
interval for mean
response
Prediction interval
Linearity (specification of
mean of Y|X is correct),
independence
Linearity, constant
variance, independence,
normality (only if n<30)
Linearity, constant
variance, independence,
normality
Transformations to Remedy
Constant Variance and Normality
Nonconstant Variance
• When the variance of Y|X increases with X, try
transforming Y to log Y or Y to
• When the variance of Y|X decreases with X,Ytry
transforming Y to 1/Y or Y to Y2
Nonnormality
• When the distribution of the residuals is
skewed to the right, try transforming Y to log Y.
• When the distribution of the residuals is
skewed to the left, try transforming Y to Y2
Steps in Regression Analysis
1.
2.
3.
4.
5.
6.
Define the question of interest. Review the design of
the study to see if it can answer question of interest.
Correct errors in the data.
Explore the data using a scatterplot.
Fit an initial regression model (possibly using a
transformation). Check the assumptions of the
regression model.
Investigate influential points.
Infer answers to the questions of interest using
appropriate tools (e.g., confidence intervals,
hypothesis tests, prediction intervals).
Communicate the results to the intended audience.