Download Binus Repository

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear least squares (mathematics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Confidence interval wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Matakuliah
Tahun
: A0392 – Statistik Ekonomi
: 2006
Pertemuan 13
Uji Koefisien Korelasi dan Regresi
1
Outline Materi :
 Uji koefisien korelasi
 Uji koefisien regresi
 Uji parameter intersep
2
Inference About The Slope and
Coefficient Of Correlation
Inference about the slope with
Analysis of Variance
3
Measures of Variation:
The Sum of Squares
SST
=
Total
=
Sample
Variability
SSR
Explained
Variability
+
SSE
+
Unexplained
Variability
4
Measures of Variation:
The Sum of Squares
(continued)
• SST = Total Sum of Squares
– Measures the variation of the Yi values
around their mean,
Y
• SSR = Regression Sum of Squares
– Explained variation attributable to the
relationship between X and Y
• SSE = Error Sum of Squares
– Variation attributable to factors other than the
relationship between X and Y
5
Measures of Variation:
The Sum of Squares
(continued)

SSE =(Yi - Yi )2
Y
_
SST = (Yi - Y)2
 _
SSR = (Yi - Y)2
Xi
_
Y
X
6
Venn Diagrams and
Explanatory Power of
Regression
Variations in
store Sizes not
used in
explaining
variation in
Sales
Sizes
Sales
Variations in Sales
explained by the
error term or
unexplained by
Sizes  SSE 
Variations in Sales
explained by Sizes
or variations in Sizes
used in explaining
variation in Sales
 SSR 
7
The ANOVA Table in Excel
ANOVA
df
Regressio
k
n
Residuals
Total
SS
MS
SS
R
MSR
=SSR/k
n-k- SS
1
E
n-1
Significanc
e
F
F
P-value of
MSR/MSE
the F Test
MSE
=SSE/(n-k1)
SS
T
8
Measures of Variation
The Sum of Squares: Example
Excel Output for Produce Stores
Degrees of freedom
ANOVA
df
SS
MS
Regression
1
30380456.12
30380456
Residual
5
1871199.595 374239.92
Total
6
32251655.71
F
81.17909
Regression (explained) df
Error (residual) df
Total df
SSE
SSR
Significance F
0.000281201
SST
9
Venn Diagrams and
Explanatory Power of
Regression
r 
2
Sales
Sizes
SSR

SSR  SSE
10
Standard Error of Estimate

n
•
SYX
SSE


n2
i 1
Y  Yˆi

2
n2
• Measures the standard deviation
(variation) of the Y values around the
regression equation
11
Measures of Variation:
Produce Store Example
Excel Output for Produce Stores
R e g r e ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re
0 .9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e r va t i o n s
r2 = .94
0 .9 7 0 5 5 7 2
n
7
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage.
Syx
12
Linear Regression
Assumptions
• Normality
– Y values are normally distributed for each X
– Probability distribution of error is normal
• Homoscedasticity (Constant Variance)
• Independence of Errors
13
Consequences of Violation
of the Assumptions
• Violation of the Assumptions
– Non-normality (error not normally distributed)
– Heteroscedasticity (variance not constant)
• Usually happens in cross-sectional data
– Autocorrelation (errors are not independent)
• Usually happens in time-series data
• Consequences of Any Violation of the Assumptions
– Predictions and estimations obtained from the
sample regression line will not be accurate
– Hypothesis testing results will not be reliable
• It is Important to Verify the Assumptions
14
Variation of Errors Around
the Regression Line
f(e)
• Y values are normally distributed
around the regression line.
• For each X value, the “spread” or
variance around the regression line is
the same.
Y
X2
X1
X
Sample Regression Line
15
Inference about the Slope:
t Test
• t Test for a Population Slope
– Is there a linear dependency of Y on X ?
• Null and Alternative Hypotheses
– H0: 1 = 0
– H1: 1  0
• Test Statistic
b1  1
–
t
Sb1
(no linear dependency)
(linear dependency)
where Sb1 
SYX
n
(X
i 1
–
d. f .  n  2
i
 X)
2
16
Example: Produce Store
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Estimated Regression
Equation:
Yˆi  1636.415  1.487X i
The slope of this
model is 1.487.
Does square footage
affect annual sales?
17
Inferences about the Slope:
t Test Example
Test Statistic:
H0: 1 = 0
From Excel Printout b
1
H1: 1  0
Coefficients Standard Error
  .05
Intercept
1636.4147
451.4953
df  7 - 2 = 5
Footage
1.4866
0.1650
Decision:
Critical Value(s):
Reject
Reject
Reject H0.
.025
.025
-2.5706 0 2.5706
t
Sb1
t
t Stat P-value
3.6244 0.01515
9.0099 0.00028
p-value
Conclusion:
There is evidence that
square footage affects
18
annual sales.
Inferences about the Slope:
Confidence Interval Example
Confidence Interval Estimate of the Slope:
b1  tn  2 Sb1
Excel Printout for Produce Stores
Intercept
Footage
Lower 95% Upper 95%
475.810926 2797.01853
1.06249037 1.91077694
At 95% level of confidence, the confidence interval
for the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear dependency
of annual sales on the size of the store.
19
Inferences about the Slope:
F Test
• F Test for a Population Slope
– Is there a linear dependency of Y on X ?
• Null and Alternative Hypotheses
– H0: 1 = 0
– H1: 1  0
(no linear dependency)
(linear dependency)
• Test Statistic
SSR
1
– F 
SSE
 n  2
20
Relationship between a t Test
and an F Test
• Null and Alternative Hypotheses
– H0: 1 = 0
– H1: 1  0
•
t 
n2
2
(no linear dependency)
(linear dependency)
 F1,n 2
• The p –value of a t Test and the p –value
of an F Test are Exactly the Same
• The Rejection Region of an F Test is
Always in the Upper Tail
21
Inferences about the Slope:
F Test Example
H0: 1 = 0
ANOVA
H1: 1  0
  .05
Regression
numerator Residual
df = 1
Total
denominator
df  7 - 2 = 5
Test Statistic:
From Excel Printout
df
1
5
6
Reject
 = .05
0
6.61
F1, n  2
SS
MS
F Significance F
30380456.12 30380456.12 81.179
0.000281
1871199.595 374239.919
p-value
32251655.71
Decision: Reject H0.
Conclusion:
There is evidence that
square footage affects
annual sales.
22
Purpose of Correlation
Analysis
• Correlation Analysis is Used to Measure
Strength of Association (Linear
Relationship) Between 2 Numerical
Variables
– Only strength of the relationship is concerned
– No causal effect is implied
23
Purpose of Correlation
Analysis
(continued)
• Population Correlation Coefficient  (Rho)
is Used to Measure the Strength between
the Variables
 XY

 X Y
24
Purpose of Correlation
Analysis
(continued)
• Sample Correlation Coefficient r is an
Estimate of  and is Used to Measure the
Strength of the Linear Relationship in the
Sample Observations
n
r
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
2
i
25
Sample Observations from
Various r Values
Y
Y
Y
X
r = -1
X
r = -.6
Y
X
r=0
Y
r = .6
X
r=1
X
26
Features of  and r
• Unit Free
• Range between -1 and 1
• The Closer to -1, the Stronger the
Negative Linear Relationship
• The Closer to 1, the Stronger the Positive
Linear Relationship
• The Closer to 0, the Weaker the Linear
Relationship
27
t Test for Correlation
• Hypotheses
– H0:  = 0 (no correlation)
– H1:   0 (correlation)
• Test Statistic
t
–
r
where
 r
n2
2
n
r  r2 
 X
i 1
n
 X
i 1
i
i
 X Yi  Y 
X
2
n
 Y  Y 
i 1
2
i
28
Example: Produce Stores
r
From Excel Printout
Is there any
evidence of linear
relationship between
annual sales of a
store and its square
footage at .05 level
of significance?
R e g re ssi o n S ta ti sti c s
M u lt ip le R
R S q u a re
0 .9 7 0 5 5 7 2
0 .9 4 1 9 8 1 2 9
A d ju s t e d R S q u a re 0 . 9 3 0 3 7 7 5 4
S t a n d a rd E rro r
6 1 1 .7 5 1 5 1 7
O b s e rva t io n s
H0:  = 0 (no
association)
H1:   0 (association)
  .05
7
29
Example: Produce Stores
Solution
r
.9706
t

 9.0099
2
1  .9420
 r
5
n2
Critical Value(s):
Reject
.025
Reject
.025
-2.5706 0 2.5706
Decision:
Reject H0.
Conclusion:
There is evidence of a
linear relationship at 5%
level of significance.
The value of the t statistic is
exactly the same as the t
statistic value for test on the
slope coefficient.
30
Estimation of Mean Values
Confidence Interval Estimate for
Y | X  X
:
i
The Mean of Y Given a Particular Xi
Standard error
of the estimate
Size of interval varies according
to distance away from mean, X
Yˆi  tn 2 SYX
t value from table
with df=n-2
(Xi  X )
1
 n
n
2
(Xi  X )
2
i 1
31
Prediction of Individual
Values
Prediction Interval for Individual Response
Yi at a Particular Xi
Addition of 1 increases width of interval
from that for the mean of Y
Yˆi  tn2 SYX
1 (Xi  X )
1  n
n
2
(Xi  X )
2
i 1
32
Interval Estimates for Different
Values of X
Y
Confidence
Interval for the
Mean of Y
Prediction Interval
for a Individual Yi
X
X
a given X
33
Example: Produce Stores
Data for 7 Stores:
Store
Square
Feet
Annual
Sales
($000)
1
2
3
4
5
6
7
1,726
1,542
2,816
5,555
1,292
2,208
1,313
3,681
3,395
6,653
9,543
3,318
5,563
3,760
Consider a store
with 2000 square
feet.
Regression Model Obtained:

Yi = 1636.415 +1.487Xi
34
Estimation of Mean Values:
Example
Confidence Interval Estimate for
Y | X  X
i
Find the 95% confidence interval for the average annual
sales for stores of 2,000 square feet.

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
SYX = 611.75
Yˆi  tn 2 SYX
tn-2 = t5 = 2.5706
( X i  X )2
1
 n
 4610.45  612.66
n
2
(Xi  X )
i 1
3997.02  Y |X  X i  5222.34
35
Prediction Interval for Y :
Example
Prediction Interval for Individual YX  X i
Find the 95% prediction interval for annual sales of one
particular store of 2,000 square feet.

Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
X = 2350.29
SYX = 611.75
Yˆi  tn 2 SYX
tn-2 = t5 = 2.5706
1 ( X i  X )2
1  n
 4610.45  1687.68
n
2
(
X

X
)
 i
i 1
2922.00  YX  X i  6297.37
36
Estimation of Mean Values and Prediction
of Individual Values in PHStat
• In Excel, use PHStat | Regression | Simple
Linear Regression …
– Check the “Confidence and Prediction Interval
for X=” box
• Excel Spreadsheet of Regression Sales
on Footage
37
Pitfalls of Regression Analysis
• Lacking an Awareness of the Assumptions
Underlining Least-Squares Regression
• Not Knowing How to Evaluate the
Assumptions
• Not Knowing What the Alternatives to
Least-Squares Regression are if a
Particular Assumption is Violated
• Using a Regression Model Without
Knowledge of the Subject Matter
38
Strategy for Avoiding the
Pitfalls of Regression
• Start with a scatter plot of X on Y to
observe possible relationship
• Perform residual analysis to check the
assumptions
• Use a histogram, stem-and-leaf display,
box-and-whisker plot, or normal
probability plot of the residuals to uncover
possible non-normality
39
Strategy for Avoiding the
Pitfalls of Regression
(continued)
• If there is violation of any assumption, use
alternative methods (e.g., least absolute
deviation regression or least median of
squares regression) to least-squares
regression or alternative least-squares models
(e.g., curvilinear or multiple regression)
• If there is no evidence of assumption violation,
then test for the significance of the regression
coefficients and construct confidence intervals
and prediction intervals
40
Chapter Summary
• Introduced Types of Regression Models
• Discussed Determining the Simple Linear
Regression Equation
• Described Measures of Variation
• Addressed Assumptions of Regression
and Correlation
• Discussed Residual Analysis
• Addressed Measuring Autocorrelation
41
Chapter Summary
(continued)
• Described Inference about the Slope
• Discussed Correlation - Measuring the
Strength of the Association
• Addressed Estimation of Mean Values and
Prediction of Individual Values
• Discussed Pitfalls in Regression and
Ethical Issues
42