Download Hand Calculations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
ST 370
Probability and Statistics for Engineers
Hand Calculations
Calculation of βˆ0 , β̂1 , and SSE from the formulas is tedious; these
shortcuts can help:
!2
n
n
n
X
X
X
1
xi
Sxx =
(xi − x̄)2 =
xi2 −
n
i=1
i=1
i=1
! n !
n
n
n
X
X
X
1 X
Sxy =
(xi − x̄)(yi − ȳ ) =
xi yi −
xi
yi
n
i=1
i=1
i=1
i=1
!2
n
n
n
X
X
1 X
SST = Syy =
(yi − ȳ )2 =
yi2 −
yi
n
i=1
i=1
i=1
SSE = SST − β̂1 Sxy
1 / 18
Simple Linear Regression
ST 370
Probability and Statistics for Engineers
Standard errors
Calculations of standard errors begin with an estimate of σ, the
standard deviation of the noise term .
The least squares residuals are ei = yi − (β̂0 + β̂1 xi ) so the residual
sum of squares is
n
X
SSE =
ei2 .
i=1
Because two parameters were estimated in finding the residuals, the
residual degrees of freedom are n − 2, and the estimate of σ 2 is
σ̂ 2 = MSE =
2 / 18
SSE
.
n−2
Simple Linear Regression
ST 370
Probability and Statistics for Engineers
The estimated standard errors of the least squares estimates are
r
1
se(β̂1 ) = σ̂
Sxx
and
s
se(β̂0 ) = σ̂
3 / 18
1
x̄ 2
+
.
n Sxx
Simple Linear Regression
ST 370
Probability and Statistics for Engineers
These estimated standard errors, especially se(β̂1 ), are used to set up
confidence intervals like
β̂1 ± tα/2,ν × estimated standard error
and test statistics like
tobs =
4 / 18
β̂1
.
estimated standard error
Simple Linear Regression
ST 370
Probability and Statistics for Engineers
Predicting a New Observation
Regression equations are often used to predict what the response
would be in a new experiment, in which the predictor value is
say xnew .
If we carried out many experiments with x = xnew , we would expect
the average response to be
β0 + β1 xnew ,
which we estimate by
Ŷnew = β̂0 + β̂1 xnew .
5 / 18
Simple Linear Regression
Predicting a New Observation
ST 370
Probability and Statistics for Engineers
When this mean response is the quantity of interest, you use its
estimated standard error
s
1 (xnew − x̄)2
se(Ŷnew ) = σ̂
+
n
Sxx
to set up confidence intervals.
Note that if xnew = 0, Ŷnew is just β̂0 , and this expression reduces to
the estimated standard error given earlier.
√
Note also that se(Ŷnew ) is always at least σ̂/ n, and increases as
xnew gets farther from x̄.
6 / 18
Simple Linear Regression
Predicting a New Observation
ST 370
Probability and Statistics for Engineers
Often, however, what is wanted is not a confidence interval for the
mean response, but a prediction interval for a single new observation
Ynew at x = xnew .
Now
Ynew = β0 + β1 xnew + new
so our uncertainty about Ynew comes from two sources:
We have to use estimates β̂0 and β̂1 instead of the true values;
We have no information about the value of new .
7 / 18
Simple Linear Regression
Predicting a New Observation
ST 370
Probability and Statistics for Engineers
The best prediction of Ynew is still Ŷnew , but the estimated prediction
standard error is
s
1 (xnew − x̄)2
pse(Ynew ) = σ̂ 1 + +
n
Sxx
We use the estimated prediction standard error to set up a
100(1 − α) prediction interval for Ynew :
Ŷnew ± tα/2,ν × pse(Ynew ).
8 / 18
Simple Linear Regression
Predicting a New Observation
ST 370
Probability and Statistics for Engineers
In R
The predict() method can produce either
A confidence interval, for the mean response;
A prediction interval, for a single new observation.
oxygenLm <- lm(Purity ~ HC, oxygen)
# confidence interval for mean response:
predict(oxygenLm, newdata = data.frame(HC = 1.0),
interval = "confidence")
# prediction interval for single observation:
predict(oxygenLm, newdata = data.frame(HC = 1.0),
interval = "prediction")
9 / 18
Simple Linear Regression
Predicting a New Observation
ST 370
Probability and Statistics for Engineers
Indicator variables
Typically, the predictor variable x is a controllable or measured
variable; however, sometimes it is an artificial quantity, constructed
to convey some information.
An indicator variable is a variable that takes only the values 0 and 1;
the observed data can then be divided into two groups: those where
x = 0 and those where x = 1.
10 / 18
Simple Linear Regression
Indicator variables
ST 370
Probability and Statistics for Engineers
When x = 0, the regression model
Y = β0 + β1 x + simplifies to
Y = β0 + so the mean response for this group is β0 .
When x = 1, the regression model becomes
Y = β0 + β1 + so the mean response for this group is β0 + β1 .
11 / 18
Simple Linear Regression
Indicator variables
ST 370
Probability and Statistics for Engineers
So the interpretation of the coefficients β0 and β1 is:
β0 is the mean response for group 0;
β1 is the difference between the mean response for group 1 and
the mean response for group 0.
When we discussed the factorial design with one factor having two
levels, we used the model
Yi,j = µ + τi + i,j ,
i = 1 or 2,
j = 1, 2, . . . , n
with the constraint τ1 = 0; so:
µ is the mean response for the first (baseline) level of the factor;
τ2 is the difference between the mean response for the second
level and the mean response for the first level.
12 / 18
Simple Linear Regression
Indicator variables
ST 370
Probability and Statistics for Engineers
The two models represent the same features, in a different notation.
That is:
the factorial model with a single factor having two levels is
essentially the same as the regression model with an
indicator variable as the predictor.
That insight is not especially helpful in this context, but when we
move on to regression models with more than one predictor it is very
useful.
Using indicator variables allows us to put regression models and
factorial models into a unified framework, called the general linear
model.
13 / 18
Simple Linear Regression
Indicator variables
ST 370
Probability and Statistics for Engineers
Coefficient of Determination
How well does a regression model fit a given set of data?
A careful answer must look at the context: does the model give
usefully precise predictions? What is “usefully precise” varies from
one context to another.
A quick but not careful answer is to look at the coefficient of
determination, R 2 :
SSE
R2 = 1 −
,
SST
the fraction of variability in Y explained by the predictor x.
14 / 18
Simple Linear Regression
Coefficient of Determination
ST 370
Probability and Statistics for Engineers
In R
For the Oxygen example, part of the regression output is:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
74.283
1.593
46.62 < 2e-16 ***
HC
14.947
1.317
11.35 1.23e-09 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 1.087 on 18 degrees of freedom
Multiple R-squared: 0.8774,Adjusted R-squared: 0.8706
F-statistic: 128.9 on 1 and 18 DF, p-value: 1.227e-09
The R 2 is reported as the Multiple R-squared, 0.8774. We might
report that “87.74% of the variability in oxygen purity is explained by
variations in hydrocarbon levels”.
15 / 18
Simple Linear Regression
Coefficient of Determination
ST 370
Probability and Statistics for Engineers
Regression with Transformed Variables
In some applications of regression, either the response Y or the
predictor x may need to be transformed, and sometimes both.
Example: wind turbine
Response: DC Output
Predictor: Wind speed
In R
turbine <- read.csv("Data/Table-11-05.csv")
with(turbine, plot(WindSpeed, Output))
with(turbine, plot(-1 / WindSpeed, Output))
16 / 18
Simple Linear Regression
Regression with Transformed Variables
ST 370
Probability and Statistics for Engineers
No straight line will give a good fit to the plot of Output versus Wind
Speed. The plot of Output versus (-1 / Wind Speed) looks more
appropriate for a straight line fit.
Try both:
summary(lm(Output ~ WindSpeed, turbine))
summary(lm(Output ~ I(-1 / WindSpeed), turbine))
Note: many symbols like “-” and “/” that usually represent
arithmetic operations have special meanings in a formula, so an
expression like -1 / WindSpeed must be “wrapped” in the identity
function I(.).
17 / 18
Simple Linear Regression
Regression with Transformed Variables
ST 370
Probability and Statistics for Engineers
Note that the second model has R 2 = 0.98, while the first has
R 2 = 0.8745.
That is, using -1 / WindSpeed as the predictor accounts for much
more of the variability in Output than using WindSpeed.
Interestingly, using -1 / WindSpeed also results in the model equation
ŷ = 2.98 −
6.93
,
x
which predicts that the Output will never be higher than 2.98, no
matter how high the Wind Speed.
18 / 18
Simple Linear Regression
Regression with Transformed Variables