Download Ch14-Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Forecasting wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 14: Simple Linear Regression
§14.1 Simple Linear Regression Model
: The variable that is being predicted or explained by the
Dependent Variable (Y)
regression equation.
Independent variable (X) : The variable that is doing the predicting or explaining
Simple Linear regression : Regression involving only two variables X and Y. The
relationship between the two variables is approximated
by a straight line.
Regression Model
: Y = 0 + 1X + 
Describes how variable (y) is related to the variable (x)
in simple linear regression.
Regression Equation
: E(y) = 0 + 1X
Estimated Regression
: Ŷ = b0 + b1X
Equation
Estimated from sample data. Also called the “fitted”
line.
§14.2 Least Squares Method
Let Yi = ith observation of the response variable
Ŷi. = Estimated value of Y = b0 + b1Xi
Then, regression error for the ith observation ei = Yi – Ŷi. = Yi – (b0 + b1Xi)
”e” is also called the regression error or simply residual.
Sum of squares due to error (SSE) = e2 = (Yi – Ŷi)2
The objective of Least Squares Method is to determine the values for b0 and b1 such that
the value of SSE is minimized. The following formulas for b0 and b1 give the least value
for SSE and therefore they are called least squares estimates.
b1 =
̅)(Y - Y
̅)
∑(X - X
̅)2
∑ (X - X
i.e =
∑ XY - ( ∑ X ∑ Y)/n
∑ X 2 −(∑ X)2 /n
, and
̅ − b1 X
̅
b0 = Y
Excel functions:
b1 = SLOPE(y_range,x_range), and b0 = INTERCEPT(y_range,x_range)
§14.3 Coefficient of determination (r2)
Coefficient of determination is a measure of the goodness of fit of the estimated
regression equation. It can be interpreted as the proportion of the variability in the
dependent variable Y that is explained by the estimated regression equation.
Sums of Squares
250
Y
̂
Y
̂
Y-Y
200
̅
Y-Y
̂− Y
̅
Y
150
Y
100
̅
Y
50
0
0
5
10
15
20
X
SST = Total Sum of Squares = (Yi – )2
SSR = Sum of Squares due to Regression = (Ŷi – )2
SSE = Sum of Squares due to Error = (Yi – Ŷi)2
SST = SSR + SSE
ANOVA Table
Source
Sum of squares
Regression
SSR
Error
SSE
Total
SST
df
1
n-2
n-1
r2 = Coefficient of determination =
SSR
SST
Mean Square
MSR = SSR/1
MSE = SSE/(n-2)
F-test
F = MSR/MSE
25
Correlation Coefficient (r)
Correlation coefficient is a measure of the strength of linear association between two
variables, X and Y. The value of r is always between –1 and +1. Negative values of r
indicate negative linear relationship and positive values of r indicate positive linear
relationship. The magnitude of r indicates the strength of the linear relationship with a
value of 1 or -1 indicating perfect relationship. Values close to zero for r indicate that X and
Y are not linearly related.
r = sign of b1 √Coefficient of determination = sign of b1 √r 2
§14.4 Model assumptions
Given the regression model: Y = 0 + 1X + 
1. The error term  is a random variable with a mean or expected value of zero; that is,
E() = 0. This also means E(Y) = 0 + 1X.
2. The variance of , denoted by 2, is the same for all values of X.
3. The values of  are independent.
4. The error term is a normally distributed random variable. Therefore, Y being a
linear function of , is also a normally distributed random variable.
§14.5 Testing for significance
Ho: 1 = 0
Ha: 1 ≠ 0
t-Test
Estimate for 2 = S2 = MSE =
S = √MSE = √
SSE
n-2
degrees of freedom = n-2
SSE
n-2
Sb1 = Standard error of b1 =
S
̅)2
√∑ (X - X
b
Test statistic = tcalc = S 1
b1
p-value = T.DIST.2T(ABS(tcalc),df), where df = n – 2
Confidence interval for 1: b1 ± t/2Sb1
F-test for significance of regression
Mean Squares due to regression = MSR =
SSR
degrees of freedom for regression
degrees of freedom for regression = number of independent variables
Then, test statistic = Fcalc =
MSR
MSE
p-value = F.DIST.RT(Fcalc,df1,df2), where df1 = df for MSR and df2 = df for MSE
Caveat: Regression analysis, which can be used to identify how variables are associated
with one another, cannot be used as evidence of a cause-and-effect relationship
§14.6 Estimation and prediction
̂ = b0 + b1Xp
Point estimate of Y for a given Xp = Y
Confidence interval for the mean of Y for a given Xp value = ̂
Y ± t/2SŶp
1
where = SŶp = S√n +
̅)2
(X 𝑝 − X
∑(X − ̅
X)2
̂ ± t/2SŶ
Prediction interval for an individual Y for a given Xind value = Y
ind
where = SŶind = S√1 +
1
+
n
(X 𝑝 − ̅
X)2
̅)2
∑(X − X
Estimation and prediction in Excel
Prediction interval for an individual Y for a given Xind value = From the special
regression output Lower 95% and Upper 95% value for Prediction
Confidence Interval Estimate of  yp : Ŷ± t/2 SŶp
2
where SŶp = √Sind
− MSE and df = df of MSE, and Sind = From special regression output
§14.8 Residual Analysis: Validating Model Assumptions
Residual = Yi – Ŷi.
Given the regression model: Y = 0 + 1X + 
Model assumptions:
1. The error term  is a random variable with a mean or expected value of zero; that is,
E() = 0. This also means E(Y) = 0 + 1X.
2. The variance of , denoted by 2, is the same for all values of X.
3. The values of  are independent.
4. The error term is a normally distributed random variable. Therefore, Y being a
linear function of , is also a normally distributed random variable.
Four plots:
1.
A plot of the residuals against values of the independent variable x
2.
A plot of residuals against the predicted values of the dependent variable
3.
A standardized residual plot
4.
A normal probability plot
Plot #1: Residual Plot Against X
Panel A
Panel B
Panel C
Plot #2: Residual Plot Against y
For Simple Regression this plot is the same as Plot #1.
Plot #3: Standardized Residuals against X
Residual = Yi – Ŷi
̂i
Y −Y
Standardized residual = S i
̂i
Yi −Y
where SYi−Ŷi = Standard deviation of the ith residual = S√1 - hi
1
and hi = +
n
̅)2
(Xi −X
̅)2
∑(Xi −X
The standardized residual plot can provide insight about the assumption that the error
term  has a normal distribution. If this assumption is satisfied, we should expect to see
approximately 95% of the standardized residuals between -2 and + 2.
Plot #4: Normal Probability Plot
Another approach for determining the validity of the assumption that the error term
has a normal distribution is the normal probability plot.
§14.9 Residual Analysis: Outliers and Influential observations
Outlier
A data point or observation that is unusual compared to the remaining data and must
be carefully examination
The standardized residuals can be used to identify outliers. If an observation deviates
greatly from the pattern of the rest of the data (i.e., an outlier) , the corresponding
standardized residual will be large in absolute value.
i.
ii.
iii.
Outliers may represent erroneous data; if so, the data should be corrected.
Outliers may signal a violation of model assumptions; if so, another model should
be considered.
Outliers may simply be unusual values that occurred by chance. In this case, they
should be retained.
Influential Observation
An observation that has a strong influence or effect on the regression results. Such
observations may have such a dramatic effect on the estimated regression equation.
Leverage
A measure of the influence an observation has on the regression results. Influential
observations, i.e., observations with extreme values for the independent variables, have
high leverage and are called high leverage points.
1
Leverage of an observation is given by and hi = +
n
̅)2
(Xi −X
̅)2
∑(Xi −X
If hi > 6/n or 0.99 whichever is smaller, the observation is considered high leverage
point.
In these cases:
i. If the observation is invalid, correct it and find new estimated regression equation.
ii. If the observation is valid, obtain data on intermediate values of x to understand
better the relationship between x and y.