Download document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

German tank problem wikipedia , lookup

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Confidence interval wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter Topics
•
•
•
•
•
•
•
Types of Regression Models
Determining the Simple Linear Regression
Equation
Measures of Variation in Regression and
Correlation
Assumptions of Regression and Correlation
Residual Analysis and the Durbin-Watson Statistic
Estimation of Predicted Values
Correlation - Measuring the Strength of the
Association
Purpose of Regression and
Correlation Analysis
• Regression Analysis is Used Primarily for
Prediction
A statistical model used to predict the values of a
dependent or response variable based on values of
at least one independent or explanatory variable
Correlation Analysis is Used to Measure
Strength of the Association Between
Numerical Variables
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Simple Linear Regression Model
• Relationship Between Variables Is a Linear Function
• The Straight Line that Best Fit the Data
Random
Error
Y intercept
Yi   0   1 X i   i
Dependent
(Response)
Variable
Slope
Independent
(Explanatory)
Variable
Sample Linear Regression
Model

Y
i
 b0  b1X
i

Yi
= Predicted Value of Y for observation i
Xi
= Value of X for observation i
b0
= Sample Y - intercept used as estimate of
the population 0
b1 = Sample Slope used as estimate of the
population 1
Simple Linear Regression
Equation: Example
You wish to examine the
relationship between the
square footage of produce
stores and its annual sales.
Sample data for 7 stores
were obtained. Find the
equation of the straight
line that fits the data best
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Equation for the Best Straight
Line

Y i  b0  b1 X i
 1636 . 415  1 . 487 X i
From Excel Printout:
C o e ffi c i e n ts
I n te r c e p t
1 6 3 6 .4 1 4 7 2 6
X V a ria b le 1 1 .4 8 6 6 3 3 6 5 7
Graph of the Best
Straight Line
Annua l Sa le s ($000)
12000
10000
8000
6000
4000
2000
0
0
1000
2000
3000
4000
S q u a re F e e t
5000
6000
Interpreting the Results

Yi = 1636.415 +1.487Xi
The slope of 1.487 means for each increase of one
unit in X, the Y is estimated to increase 1.487units.
For each increase of 1 square foot in the size of the
store, the model predicts that the expected annual
sales are estimated to increase by $1487.
Measures of Variation:
The Sum of Squares
SST = Total Sum of Squares
•measures_the variation of the Yi values around their
mean Y
SSR = Regression Sum of Squares
•explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares
•variation attributable to factors other than the
relationship between X and Y
Measures of Variation: The Sum
of Squares

SSE =(Yi - Yi )2
Y
_
SST = (Yi - Y)2
 _
SSR = (Yi - Y)2
Xi
_
Y
X
The Sum of Squares: Example
Excel Output for Produce Stores
df
SS
R e g r e ssi o n
1
30380456.12
R e si d u a l
5
1871199.595
T o ta l
6
32251655.71
SSR
SSE
SST
The Coefficient of
Determination
r2 =
SSR
SST
=
regression sum of squares
total sum of squares
Measures the proportion of variation that is
explained by the independent variable X in
the regression model
Coefficients of Determination (r2)
and Correlation (r)
Y r2 = 1, r = +1
Y r2 = 1, r = -1
^=b +b X
Y
i
^=b +b X
Y
i
0
1 i
0
X
Yr2 = .8, r = +0.9
X
Y
^=b +b X
Y
i
0
1 i
X
1 i
r2 = 0, r = 0
^=b +b X
Y
i
0
1 i
X
Standard Error of Estimate
Syx 
SSE
n2
n
=

 ( Yi  Yi )
i 1
2
n2
The standard deviation of the variation of
observations around the regression line
Measures of Variation:
Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re
0 .9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e r va t i o n s
r2 = .94
0 .9 7 0 5 5 7 2
7
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage
Syx
Linear Regression Assumptions
For Linear Models
•
1.Normality
–
–
•
•
Y Values Are Normally Distributed For
Each X
Probability Distribution of Error is Normal
2.Homoscedasticity (Constant Variance)
3.Independence of Errors
Variation of Errors Around
the Regression Line
f(e)
y values are normally distributed
around the regression line.
For each x value, the “spread” or
variance around the regression
line is the same.
Y
X2
X1
X
Regression Line
Residual Analysis
•
Purposes
–
–
•
Examine Linearity
Evaluate violations of assumptions
Graphical Analysis of Residuals
–
Plot residuals Vs. Xi values
•
–
Difference between actual Yi & predicted Yi 
Studentized residuals:
•
Allows consideration for the magnitude of the
residuals
Residual Analysis for Linearity

Not Linear
e
Linear
e
X
X
Residual Analysis for
Homoscedasticity
Homoscedasticity

SR
Heteroscedasticity
SR
X
Using Standardized Residuals
X
The Durbin-Watson Statistic
•Used when data is collected over time to detect
autocorrelation (Residuals in one time period
are related to residuals in another period)
•Measures Violation of independence assumption
n
D
 ( ei  ei  1 )
i 2
n
2
 ei
i 1
2
Should be close to 2.
If not, examine the model
for autocorrelation.
Residual Analysis for
Independence

Not Independent
SR
Independent
SR
X
X
Inferences about the Slope: t Test
• t Test for a Population Slope
Is a Linear Relationship Between X & Y ?
•Null and Alternative Hypotheses
H0: 1 = 0 (No Linear Relationship)
H1: 1  0 (Linear Relationship)
•Test Statistic:
b1   1
t 
Where Sb 
1
S b1
SYX
n
2
( Xi  X )
i 1
and df = n - 2
Example: Produce Stores
Data for 7 Stores:
Store
1
2
3
4
5
6
7
Square
Feet
Annual
Sales
($000)
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Regression
Model Obtained:

Yi = 1636.415 +1.487Xi
The slope of this model
is 1.487.
Is there a linear
relationship between the
square footage of a store
and its annual sales?
Inferences about the Slope: t Test
Test Statistic:
•
H0: 1 = 0
•
H1: 1  0
a  .05
•df  7 - 2 = 7
•Critical Value(s):
Reject
.025
From Excel Printout
t S tat
I n te r c e p t
3.6244333
0.0151488
9.009944
0.0002812
X V a ria b le 1
Decision:
Reject H0
Reject
Conclusion:
.025
-2.5706 0 2.5706
P-valu e
t
There is evidence of a
relationship.
Inferences about the Slope:
Confidence Interval Example
Confidence Interval Estimate of the Slope
b1 tn-2 Sb1
Excel Printout for Produce Stores
L o w er 95%
I n te r c e p t
U p p er 95%
475.810926
2797.01853
X V a r i a b l e 11 . 0 6 2 4 9 0 3 7
1.91077694
At 95% level of Confidence The confidence Interval for the
slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear relationship
between annual sales and the size of the store.
Estimation of Predicted Values
Confidence Interval Estimate for mXY
The Mean of Y given a particular Xi
Standard error
of the estimate
Ŷi  t n  2  Syx
t value from table
with df=n-2
Size of interval vary according to
distance away from mean, X.
1
( Xi  X )
 n
n  ( X  X )2
i
2
i 1
Estimation of
Predicted Values
Confidence Interval Estimate for
Individual Response Yi at a Particular Xi
Addition of this 1 increased width of
interval from that for the mean Y
Ŷi  t n  2  Syx
1
( Xi  X )
1  n
n  ( X  X )2
i
2
i 1
Interval Estimates for
Different Values of X
Y
Confidence
Interval for the
mean of Y
Confidence Interval
for a individual Yi
_
X
X
A Given X
Example: Produce Stores
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Predict the annual
sales for a store with
2000 square feet.
Regression Model Obtained:

Yi = 1636.415 +1.487Xi
Estimation of Predicted
Values: Example
Confidence Interval Estimate for Individual Y
Find the 95% confidence interval for the average annual sales
for stores of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
Ŷi  t n  2  Syx
SYX = 611.75
1
( X i  X )2
 n
n  ( X  X )2
i
i 1
tn-2 = t5 = 2.5706
= 4610.45  980.97
Confidence interval for mean Y
Estimation of Predicted
Values: Example
Confidence Interval Estimate for mXY
Find the 95% confidence interval for annual sales of one
particular store of 2,000 square feet

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
Ŷi  t n  2  Syx
SYX = 611.75
tn-2 = t5 = 2.5706
1
( X i  X )2
1  n
= 4610.45  1853.45
n  ( X  X )2
Confidence interval for
i
i 1
individual Y
Correlation: Measuring the
Strength of Association
•
•
Answer ‘How Strong Is the Linear
Relationship Between 2 Variables?’
Coefficient of Correlation Used
–
–
–
•
Population correlation coefficient denoted
r (‘Rho’)
Values range from -1 to +1
Measures degree of association
Is the Square Root of the Coefficient of
Determination
Test of
Coefficient of Correlation
•
•
•
Tests If There Is a Linear Relationship
Between 2 Numerical Variables
Same Conclusion as Testing Population
Slope 1
Hypotheses
–
–
H0: r = 0 (No Correlation)
H1: r  0 (Correlation)