Download Chapter 13, Simple Regression Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Types of artificial neural networks wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Taylor's law wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Introduction to
Regression Analysis, Chapter 13,
• Regression analysis is used to:
– Predict values of a dependent variable,Y, based on its
relationship with values of at least one independent variable,
X.
– Explain the impact of changes in an independent variable on
the dependent variable by estimating the numerical value of
the relationship
Dependent variable: the variable we wish to explain
Independent variable: the variable used to explain
the dependent variable
Simple Linear Regression
Model
• Only one independent variable (thus,
simple), X
• Relationship between X and Y is
described by a linear function
• Changes in Y are assumed to be caused
by changes in X, that is,
– Change In X
Causes
Change in Y
Important points before we start a regression
analysis:
• The most important thing in deciding whether or not there
is a relationship between X and Y is to have a systematic
model that is based on logical reasons.
• Investigate the nature of the relationship between X and Y
(use scatter diagram, covariance, correlation of coefficient)
• Remember that regression is not an exact or deterministic
mathematical equation. It is a behavioral relationship
that is subject to randomness.
• Remember that X is not the only thing that explains the
behavior of Y. There are other factor that you may not
have information about.
• All you are trying to do is to have an estimate of the
relationship using the best linear fit possible
Types of Relationships
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
X
X
Types of Relationships
Strong relationships
Y
(continued)
Weak relationships
Y
X
Y
X
Y
X
X
Types of Relationships
No relationship
Y
X
Y
X
(continued)
Simple Linear Regression
Conceptual Model
The population regression model: This is a
conceptual model, a hypothesis, or a postulation
Population
Y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
Random
Error
term
Yi  β0  β1Xi  ε i
Linear component
Random Error
component
• The model to be estimated from sample data is:
Yi  b0  b1 X i  ei
• The actual estimated from the sample
Estimated (or
predicted) Y
value for
observation i
Estimate of the
regression
intercept
Estimate of the
regression slope
Ŷi  b0  b1 X i
– Where
ei  Yi  Ŷi
Residual
(random
error from
the sample)
Value of X for
observation i
• The individual random error terms, ei, have a mean of zero,
n
i.e.,
ei  0

i 1
• Since the sum of random error is zero, we try to estimate
the regression line such that the sum of squared differences
are minimized, thus, the name Ordinary Least Squared
Method (OLS)
• i.e.,
min  e i  min  (Yi Ŷi ) 
2
min  (Yi (b0  b1 X i ))
2
2
• The estimated values of b0 and b1 by OLS are the only possible
values of b0 and b1 that minimize the sum of the squared differences
between Y and Ŷ
Simple Linear Regression Model
Ŷi  b0  b1 X i
Y
Observed Value
of Y for Xi
Yi Actual
ei  Yi  Ŷi
Slope = b1
Random Error for
this Xi value
Predicted Value
of Y for Xi
Ŷi estimated
Intercept = b0
Xi
X
Interpretation of the slope and the
intercept
 0  E (Y | X  0) ;  1   E(Y|X)/ (X);
• b0 is the estimated average value of Y when the value of X is b0
zero
• b1 is the estimated change in the average value of Y as a result of a
one-unit change in X
• Units of measurement of X and Y are very important for the correct
interpretation of the slop and the intercept
• Example:
App Val  165.03  6.93 (Lot size)
• Predict the app. Value of a house with 10,000 s.f. lot size
App Val  165.03  6.93 (10)  $234,330
• How Good is this prediction?
How Good is the Model’s prediction Power?
• Total variation is made up of two parts:
SST  SSR  SSE
Total Sum
of Squares
Regression
Sum of Squares
SST   ( Yi  Y )
2
Error Sum of
Squares
SSR   ( Ŷi  Y )
2
SSE   ( Yi  Ŷi )2
where:
Y = Average value of the dependent variable
Yi = Observed values of the dependent variable
Ŷ i = Predicted value of Y for the given Xi value
• SST = total sum of squares
– Measures total variation of the Yi values around their
mean
• SSR = regression sum of squares
(Explained)
– Explained portion of total variation attributed to Y’s
relationship with X
• SSE = error sum of squares (Unexplained)
– Variation of Y values attributable to other factors than
its relationship with X
 2
SSE = (Yi - Yi )
Yi

Y
_

Y
SST =  (Yi - Y)2
 _
SSR = (Yi - Y)2
_
Y
Y, W/O the
effect of X
X
Xi
X
How Good is the Model’s prediction Power?
• The coefficient of determination is the
portion of the total variation in the
dependent variable,Y, that is explained by
variation in the independent variable, X
• The coefficient of determination is also
called r-squared and is denoted as r2
SSR regression sum of squares
r 

SST
total sum of squares
2
0r 1
2
Standard Error of Estimate
• The standard deviation of the variation of
observations around the regression line is
estimated by
– Where SSE = error sum of squares; n = sample size
n
SSE
SYX 

n2
2
(
Y

Ŷ
)
 i i
i 1
n2
 MSE
– The concept is the same as the standard deviation (average
difference) around the mean of a univariate
Comparing Standard Errors
• SYX is a measure of the variation of observed Y values from
the regression line
Y
Y
small sYX
X
large sYX
X
The magnitude of SYX should always be judged relative to the size of
the Y values in the sample data
i.e., SYX = $36.34K is moderately small relative to house prices in the
$200 - $300K range (average 215K)
Assumptions of Regression
• Normality of Error
– Error values (ε) are normally distributed for any
given value of X
• Homoscedasticity
– The probability distribution of the errors has
constant variance
• Independence of Errors
– Error values are statistically independent
How to investigate the appropriateness
of the fitted model
• The residual for observation i, ei, is the difference between
its observed and predicted value;
ei  Yi  Ŷi
• Check the assumptions of regression by examining the
residuals
–
–
–
–
Examine for linearity assumption
Examine for constant variance for all levels of X (homoscedasticity)
Evaluate normal distribution assumption
Evaluate independence assumption
• Graphical Analysis of Residuals
– Can plot residuals vs. X
Residual Analysis for Linearity
Y
Y
x
x
Not Linear
x
residuals
residuals
x
Linear
Residual Analysis for Homoscedasticity
Y
Y
x
x
Non-constant variance
residuals
residuals
x
x
Constant variance
Residual Analysis for Independence
X
X
residuals
residuals
Independent
residuals
Not Independent
X
Measuring Autocorrelation: the Durbin-Watson
Statistic (DO NOT COVER FOR)
• Used when data are collected over time to detect if
autocorrelation is present. Can be useful for cross
sectional data.
• Autocorrelation exists if residuals in one time period are
related to residuals in another period
• Here, residuals
show a cyclic
pattern, not random
Residuals
Time (t) Residual Plot
15
10
5
0
-5 0
-10
-15
2
4
Time (t)
6
8
The Durbin-Watson Statistic
• The Durbin-Watson statistic is used to test for
autocorrelation.
• The possible range is 0 ≤ D ≤ 4
•
D should be close to 2 if H0 is true
•
D less than 2 may signal positive autocorrelation, D
greater than 2 may signal negative autocorrelation
n
Actual D 
2
(
e

e
)
 i
i 1
i 2
n
2
e
 i
i 1
H0: residuals are not correlated
H1: autocorrelation is present
• Calculate the Durbin-Watson test statistic = D
• Find the values dL and dU from the Durbin-Watson
table (for sample size n and number of independent
variables k)
H0: positive autocorrelation does not exist
H1: positive autocorrelation is present
Reject H0
0
Positive Auto
Inconclusive
dL
Do not reject H0
dU
2
H0: Negative autocorrelation does not exist
H1: Negative autocorrelation is present
Do not reject H0
2
Reject H0
Inconclusive
4-du
Negative Auto
4-dL
4
Inferences about Estimated Parameters
• t test for a population slope
– Is there a linear relationship between X and Y?
• Null and alternative hypotheses
– H0: β1 = 0
(no linear relationship)
– H1: β1 0
(linear relationship does exist)
• Test statistic
b β
d.f.  n  2
t
1
1
Sb
1
where:
b1 = regression slope coefficient
β1 = hypothesized slope
Sb1 = estimate of the standard error of the slope
• The standard error of the regression slope
coefficient (b1) is estimated by
SSE
Where S YX 
n2
SYX
SYX
Sb 

2
SSX
(X

X
)
 i
1
• Sb1 is a measure of the variation in the slope of
regression lines from different possible samples
Y
Y
small Sb1
X
X
large Sb
1
F-Test for Significance
•A second approach to test the existence of a significant relationship is the
ratio of explained to unexplained variances. For multiple regression, this is
also a test of the entire model.
•H0: β1 = 0; H1: β1 ≠ 0
•The Ratio is,
MSR
F 
MSE
• where F follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom. (k = the number of independent variables in
the regression model)
df
Regression
K
Residual (error) n-k-1
Total
n-1
Sum S. Mean S.
SSR
SSE
SST
Actual F
MSR=SSR/K
Fk,(n-k-1)=MSR/MSE
MSE=SSE/(n-k-1)
Confidence Interval Estimates and
Prediction of Individual Values of Y
1.
Confidence Interval Estimate for the Slope
 1  b1  tn2 Sb
1
2.
Confidence interval estimate for the mean value of Y
given a particular Xi
Confidence interval for μY|X  X :
i
Where
Ŷ  t n2 SYX hi
1 (X i  X )2 1 (X i  X )2
hi  
 
n
SSX
n  (X i  X )2
–
Note that the size of interval varies according to distance away
from mean, X
3. Confidence interval estimate for an Individual
value of Y given a particular Xi
Confidence interval for YX  X :
i
Ŷ  tn2 SYX 1  hi
-- This extra term, 1, adds to the interval width to reflect the added
uncertainty for an individual case
• Graphical Presentation of Confidence Interval
Estimation
Confidence
Interval for
the mean of
Y, given Xi
Y

Y

Y = b0+b1Xi
Prediction Interval
for an individual Y,
given Xi
Xi
Xj
Pitfalls of Regression Analysis
1.
2.
3.
4.
5.
6.
Lacking an awareness of the assumptions underlying
least-squares regression
Not knowing how to evaluate the assumptions
Not knowing the alternatives to least-squares regression
if a particular assumption is violated
Using a regression model without knowledge of the
subject matter
Extrapolating outside the relevant range
Let’s look at 4 different data set, with their scatter
diagram and residual plots on pages 509-510.
Strategies:
1.
Start with a scatter plot of X on Y to observe possible
relationship
2.
Perform residual analysis to check the assumptions
 Plot the residuals vs. X to check for violations of assumptions
such as homoscedasticity
 Use a histogram, stem-and-leaf display, box-and-whisker
plot, or normal probability plot of the residuals to uncover
possible non-normality
3.
4.
5.
If there is violation of any assumption, use alternative
methods or models
If there is no evidence of assumption violation, then test
for the significance of the regression coefficients and
construct confidence intervals and prediction intervals
Avoid making predictions or forecasts outside the
relevant range