Download Class 23 notes - Darden Faculty

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Class 23
The most over-rated statistic
The four assumptions
The most Important hypothesis test yet
Using yes/no variables in regressions
Adjusted R-square
Hours
2
4.17
4.42
4.75
4.83
6.67
7
7.08
7.17
7.17
10
12
12.5
13.67
15.08
Pg 9-12 Pfeifer note
Hours
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
7.900667
1.003487
7.08
7.17
3.886488
15.10479
-0.75506
0.524811
13.08
2
15.08
118.51
15
Our better method of forecasting
hours would use a mean of 7.9 and
standard deviation of 3.89 (and the tdistribution with 14 dof)
The sample variance is
𝑠2
(π‘Œ βˆ’ π‘Œ)2
=
= 3.892 = 15.1
π‘›βˆ’1
The variation in Hours
that regression will try
to explain
Adjusted R-square
Pg 9-12 Pfeifer note
SUMMARY OUTPUT
MSF
26
34.2
29
34.3
85.9
143.2
85.5
140.6
140.6
40.4
101
239.7
179.3
126.5
140.8
Hours
2
4.17
4.42
4.75
4.83
6.67
7
7.08
7.17
7.17
10
12
12.5
13.67
15.08
Regression Statistics
Multiple R
0.7260033
R Square
0.5270808
Adjusted R Square
0.4907024
Standard Error
2.7735959
Observations
15
Our better method of forecasting
hours for job A would use a mean of
10.51 and standard deviation of 2.77
(and the t-distribution with 13 dof)
The squared standard error is
ANOVA
df
Regression
Residual
Total
Intercept
MSF
1
13
14
Coefficients
3.312316
0.0444895
π‘Œ = 3.31 + 0.044 × 157.3 = 10.51
2
(π‘Œ
βˆ’
π‘Œ)
𝑠𝑒 2 =
= 2.772 = 7.69
π‘›βˆ’2
The variation in Hours
regression leaves
unexplained.
Adjusted R-square Pg 9-12 Pfeifer note
β€’ Adjusted R-square is the percentage of variation
explained
β€’ The initial variation is s2 = 15.1
β€’ The variation left unexplained (after using MSF in a
regression) is (standard error)2 = 7.69.
𝑠2 βˆ’(π‘ π‘‘π‘Žπ‘›π‘‘π‘Žπ‘Ÿπ‘‘ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ)2
𝑠2
β€’ Adjusted R-square =
β€’ Adjusted R-square = (15.1-7.69)/15.1 = 0.49
β€’ The regression using MSF explained 49% of the
variation in hours.
β€’ The β€œadjusted” happened in the calculation of s and
standard error.
From the Pfeifer note
Adj R-square = 1.0
Standard
error = 0
Adj R-square = 0.5
Standard
error = s
Adj R-square = 0.0
Why Pfeifer says R2 is over-rated
β€’ There is no standard for how large it should be.
– In some situations an adjusted R2 of 0.05 would be
FANTASTIC. In others, an adjusted R2 of 0.96 would be
DISAPOINTING.
β€’ It has no real use.
– Unlike β€œstandard error” which is needed to make
probability forecasts.
β€’ It is usually redundant
– When comparing models, lower standard errors mean
higher adj R2
– The correlation coefficient (which shares the same
sign as b) β‰ˆ the square root of adj R2.
The Coal Pile Example
96% of the
variation in W
is explained by
this regression.
β€’ The firm needed a way to estimate the weight
of a coal pile (based on it’s dimensions)
SUMMARY OUTPUT
W
56
93
161
31
70
76
375
34
45
58
D
20
25
30
15
20
20
40
15
20
20
We just used MULTIPLE
regression.
h
10
10
12
12
14
14
16
14
8
10
d
15
20
24
10
13
13
32
8
16
15
Regression Statistics
Multiple R
0.986792416
R Square
0.973759272
Adjusted R Square
0.960638908
Standard Error
20.56622179
Observations
10
ANOVA
df
Regression
Residual
Total
Intercept
D
h
d
3
6
9
Coefficients
-294.6954733
-15.12016461
23.02366255
27.62139918
The Coal Pile Example
100% of the
variation in W
is explained by
this regression.
β€’ Engineer Bob calculated the Volume of each
pile and used simple regression…
SUMMARY OUTPUT
W
56
93
161
31
70
76
375
34
45
58
Vol
2421.64
3992.44
6898.94
1492.26
3038.44
3038.44
16353.04
1499.06
2044.13
2421.64
Standard error went from to
20.6 to 2.8!!!
Regression Statistics
Multiple R
0.999673782
R Square
0.99934767
Adjusted R Square
0.999266129
Standard Error
2.808218162
Observations
10
ANOVA
df
Regression
Residual
Total
Intercept
Vol
1
8
9
Coefficients
0.668821408
0.022970159
Sec 5 of Pfeifer note
Sec 12.4 of EMBS
The Four Assumptions
β€’ Linearity
30
25
Y
20
β€’ Independence
15
10
– The n observations were sampled
independently from the same
population.
40
50
60
70
25
20
15
10
5
0
-5 0
2
4
6
8
10
-10
-15
-20
-25
– The probability distribution of Yβ”‚X is
normal.
Fitted
250
200
150
100
50
75.04
64.47
53.90
43.34
32.77
22.20
11.63
0
1.06
Frequency
β€’ Errors are normal. Y’s don’t have to be.
30
-9.51
β€’ Normality
20
-20.08
– All Y’s given X share a common Οƒ.
10
X
Residual
β€’ Homoskedasticity
0
-30.64
– πœ‡ =π‘Ž+𝑏×𝑋
35
12
Sec 5 of Pfeifer note
Sec 12.4 of EMBS
The four assumptions
SUMMARY OUTPUT
MSF
26
34.2
29
34.3
85.9
143.2
85.5
140.6
140.6
40.4
101
239.7
179.3
126.5
140.8
Hours
2
4.17
4.42
4.75
4.83
6.67
7
7.08
7.17
7.17
10
12
12.5
13.67
15.08
Regression Statistics
Multiple R
0.7260033
R Square
0.5270808
Adjusted R Square
0.4907024
Standard Error
2.7735959
Observations
15
ANOVA
df
Regression
Residual
Total
Intercept
MSF
1
13
14
Coefficients
3.312316
0.0444895
π‘Œ = 3.31 + 0.044 × 157.3 = 10.51
Our better method of forecasting
hours for job A would use a mean of
10.51 and standard deviation of 2.77
(and the t-distribution with 13 dof)
Independence
(all 15 points
count equally)
Linearity
homoskedasticity
Normality
Hypotheses
P 13 of Pfeifer note
Sec 12.5 of EMBS
β€’ H0: P=0.5 (LTT, wunderdog)
β€’ H0: Independence (supermarket job and
response, treatment and heart attack, light
and myopia, tosser and outcome)
β€’ H0: ΞΌ=100 (IQ)
β€’ H0: ΞΌM= ΞΌF (heights, weights, batting average)
β€’ H0: ΞΌcompact= ΞΌmid = ΞΌlarge (displacement)
H0: b=0
P 13 of Pfeifer note
Sec 12.5 of EMBS
β€’ b=0 means X and Y are independent
– In this way it’s like the chi-squared independence
test….for numerical variables.
β€’ b=0 means don’t use X to forecast Y
– Don’t put X in the regression equation
β€’ b=0 means just use π‘Œ to forecast Y
β€’ b=0 means the β€œtrue” adj R-square is zero.
P 13 of Pfeifer note
Sec 12.5 of EMBS
Testing b=0 is EASY!!!
β€’ H0: ΞΌ=100
β€’ 𝑑=
π‘Œβˆ’100
𝑠/ 𝑛
β€’ P-value from the
t.dist with n-1 dof
The standard error
of the coefficient
Intercept
MSF
𝑏
MSF
26
34.2
29
34.3
85.9
143.2
85.5
140.6
140.6
40.4
101
239.7
179.3
126.5
140.8
Hours
2
4.17
4.42
4.75
4.83
6.67
7
7.08
7.17
7.17
10
12
12.5
13.67
15.08
β€’ H0: b=0
β€’ 𝑑 = (𝑏-0)/(se of coef)
β€’ P-value from t.dist
using n-2 dof.
The t-stat to test
b=0.
The 2-tailed pvalue.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
3.3123
1.4021 2.3624 0.0344
0.2832
6.3414
0.0445
0.0117 3.8064 0.0022
0.0192
0.0697
Using Yes/No variable in Regression
Categorical
n=60
Numerical
Numerical
Categorical
Displaceme
nt
Fuel Type Hwy MPG
Car
Class
1
Midsize
3.5
R
28
2
Midsize
3
R
26
3
Large
3
P
26
4
Large
3.5
P
25
.
.
.
.
.
.
.
.
.
.
58
Compact
6
P
20
59
Midsize
2.5
R
30
60
Midsize
2
R
32
Does MPG
β€œdepend” on
fuel type?
Sec 8 of Pfeifer note
Sec 13.7 of EMBS
Fuel type (yes/no) and mpg (numerical)
β€’ Un-stack the data so there are two columns of MPG data.
β€’ Data Analysis, T-test two sample
H0: ΞΌP = ΞΌR
Or
H0: ΞΌP – ΞΌR = 0
t-Test: Two-Sample Assuming Equal Variances
Mean
Variance
Observations
Pooled Variance
Hypothesized Mean Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
P
24.33333
12.4
36
11.2579
0
58
-3.81704
0.000165
1.671553
0.000331
2.001717
R
27.70833
9.519928
24
0.999835
Sec 8 of Pfeifer note
Sec 13.7 of EMBS
Using Yes/No variables in Regression
1. Convert the categorical variable into a 1/0
DUMMY Variable.
– Use an if statement to do this.
– It won’t matter which is assigned 1, which is assigned 0.
– It doesn’t even matter what 2 numbers you assign to
the two categories (regression will adjust)
2. Regress MPG (numerical) on DUMMY (1/0
numerical)
3. Test H0: b=0 using the regression output.
Sec 8 of Pfeifer note
Sec 13.7 of EMBS
Using Yes/No variables in Regression
Fuel
Type Dprem
R
0
R
0
P
1
P
1
.
.
P
1
P
1
P
1
R
0
R
0
Hwy
MPG
28
26
26
25
.
21
25
20
30
32
SUMMARY OUTPUT
Regression Statistics
Adj R Square
0.1870
Standard Error
3.3553
Observations
60
ANOVA
df
SS
Regression
Residual
1
58
164.025
652.958
Total
59
816.983
Intercept
Dprem
Coeff Std Error
27.708
0.6849
-3.375
0.8842
MS
164.025
11.258
F
14.570
Sig F
3.306E-04
t Stat
P-value
40.4564 3.321E-44
-3.8170 3.306E-04
Sec 8 of Pfeifer note
Sec 13.7 of EMBS
Regression with one Dummy variable
𝑀𝑃𝐺 = 27.7 βˆ’ 3.8 × π·π‘π‘Ÿπ‘’π‘š
π‘Œ =π‘Ž+𝑏×𝐷
For Regular,
When D=0,
𝑀𝑃𝐺 =27.7
π‘Œ=π‘Ž
H0: ΞΌP = ΞΌR
Or
H0: ΞΌP – ΞΌR = 0
When D=1,
Or
For premium,
H0: b = 0
π‘Œ =π‘Ž+𝑏
π‘Œ = 27.7 βˆ’ 3.8 = 24.3
What we learned today
β€’ We learned about β€œadjusted R square”
– The most over-rated statistic of all time.
β€’ We learned the four assumptions required to use
regression to make a probability forecast of Yβ”‚X.
– And how to check each of them.
β€’ We learned how to test H0: b=0.
– And why this is such an important test.
β€’ We learned how to use a yes/no variable in a
regression.
– Create a dummy variable.