Download Government Financial Accounting

Document related concepts

Types of artificial neural networks wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Regression Analysis
Defense Resources Management Institute
Unscheduled Maintenance Issue:

36 flight squadrons

Each experiences unscheduled
maintenance actions (UMAs)

UMAs costs $1000 to repair, on
average.
You’ve got the Data…
Now What?
Unscheduled Maintenance Actions
(UMAs)
Sq
Jan Feb Mar Apr May Jun
Jul
Aug Sep Oct Nov Dec
101
36
53
51
61
63
54
50
65
62
51
68
45
104
60
42
56
63
39
65
63
67
66
52
59
60
108
53
61
59
87
61
46
52
85
84
75
78
68
What do you want to know?
How many UMAs will there be next month?
 What is the average number of UMAs ?

Sample Mean
xi
x
 60
n

Sample Standard Deviation
s

( xi  x )
 12.05
n 1
2
UMA Sample Statistics
UMAs
Mean
Standard Error of Mean
Median
Mode
Standard Deviation
Minimum
Maximum
Count
60
2.01
60.5
61
12.05
36
87
36
UMAs Next Month
95% Confidence Interval
x  60  212
36  x  84
Average UMAs
95% Confidence Interval
12 

  60  2

 36
56    64
Model: Cost of UMAs for one
squadron
If the cost per UMA = $1000, the
Expected cost for one squadron =
$60,000
Model: Total Cost of UMAs
Expected Cost for all squadrons
= 60 * $1000 * 36 = $2,160,000
Model: Total Cost of UMAs
Expected Cost for all squadrons
= 60 * $1000 * 36 = $2,160,000
How confident are we about this
estimate?
.3413
.3413
.1359
.1359
.0215
-3
.0215
-2
-1
0
1
~ 95%
mean (=60)
standard error =12/36 = 2
2
3
.3413
.3413
.1359
.1359
.0215
-3
.0215
-2
-1
0
1
2
~56
~58
60
~62
~64
(1 standard unit = 2)
~ 95%
3
95% Confidence Interval on our
estimate of UMAs and costs

60 + 2(2) = [56, 64]

low cost: 56 * $1000 * 36 = $2,016,000

high cost: 64 * $1000 * 36 = $2,304,000
What do you want to know?
How many UMAs will there be next month?
 What is the average number of UMAs ?
 Is there a relationship between UMAs and
and some other variable that may be used
to predict UMAs?
 What is that relationship?

Relationships

What might be related to UMAs?
 Pilot
Experience ?
 Flight hours ?
 Sorties flown ?
 Mean time to failure (for specific parts) ?
 Number of landings / takeoffs ?
Regression:

To estimate the expected or mean value
of UMAs for next month:
 look
for a linear relationship between
UMAs and a “predictive” variable
 If
a linear relationship exists, use
regression analysis
Regression analysis:
describes and evaluates
relationships between one variable
(dependent or explained variable), and
one or more other variables (called the
independent or explanatory variables).
What is a good estimating variable
for UMAs?
quantifiable
 predictable
 logical relationship with dependent
variable
 must be a linear relationship:

Y = a + bX
Sorties
Sq
Jan Feb Mar Apr May Jun
Jul
Aug Sep Oct Nov Dec
101
100 120 114 132 146 124 110 138 140 114 157 106
104
130 106 124 140 100 146 142 141 148 118 128 130
108
122 134 126 190 136 110 120 196 184 154 172 157
Pilot Experience
Sq
Jan Feb Mar Apr May Jun
Jul Aug Sep Oct Nov Dec
101 6.06 2.81 3.37 3.87 4.22 6.67 2.61 1.96 2.96 2.45 3.29 3.73
104 4.61 2.45 4.65 5.71 7.23 3.01 2.53 1.54 4.49 1.73 4.81 5.17
108 1.11 5.75 4.9 3.59 6.88 1.17 2.59 5.87 7.28 7.79 5.87 2.47
Sample Statistics
Sorties
Mean
Standard Error of Mean
Median
Mode
Standard Deviation
Minimum
Maximum
Count
135
3.99
131
100
23.92
100
196
36
Exp
4.06
0.31
3.80
#N/A
1.84
1.11
7.79
36
Describing the Relationship

Is there a relationship?
 Do
the two variables (UMAs and sorties or
experience) move together?
 Do they move in the same direction or in
opposite directions?

How strong is the relationship?
 How
closely do they move together?
Positive Relationship
60
50
40
Y 30
20
10
0
0
10
20
30
X
40
50
60
Strong Positive Relationship
60
50
40
30
20
10
0
0
10
20
30
40
50
60
Negative Relationship
50
40
Y
30
20
10
0
0
10
20
30
X
40
50
Strong Negative Relationship
60
50
40
30
20
10
0
0
10
20
30
40
50
60
No Relationship
25
20
15
10
5
0
0
10
20
30
40
50
60
Relationship?
400
350
300
Y
250
200
150
100
50
0
0
10
20
30
X
40
50
60
Correlation Coefficient

Statistical measure of how closely two
variables are moving together in a
coordinated fashion


Measures strength and direction
Value ranges from -1.0 to +1.0



+1.0 indicates “perfect” positive linear relation
-1.0 indicates “perfect” negative linear relation
0 indicates no relation between the two variables
Correlation Coefficient
r

n  ( xi yi )   xi  yi
2
2
2
2




n xi  ( xi ) n yi  ( yi )


Sorties vs. UMAs
90
80
70
UMAs
60
50
40
30
20
10
0
0
50
100
Sorties
r = .9788
150
200
Experience vs. UMAs
90
80
70
UMAs
60
50
40
30
20
10
0
0.00
2.00
4.00
6.00
Pilot Experience
r = .1896
8.00
10.00
Correlation Matrix
Correlation
UMAs
Sorties
Exp
UMAs
Sorties
1
0.9787613
1
0.1895905 0.198641
Exp
1
A Word of Caution...

Correlation does NOT imply causation
 It
simply measures the coordinated
movement of two variables
Variation in two variables may be due to
a third common variable
 The observed relationship may be due
to chance alone

What is the Relationship?
In order to use the correlation
information to help describe the
relationship between two variables we
need a model
 The simplest one is a linear model:

Y  a  bX
Fitting a Line to the Data
10
9
8
7
Y
6
5
4
3
2
1
0
0
2
4
6
8
X
10
12
14
One Possibility
10
9
8
7
Y
6
5
4
3
2
1
0
0
2
4
6
8
X
Sum of errors = 0
10
12
14
Another Possibility
10
9
8
7
Y
6
5
4
3
2
1
0
0
2
4
6
8
X
Sum of errors = 0
10
12
14
Which is Better?


Both have sum of errors = 0
Compare sum of absolute errors:
Y
8
1
6
4
6
Y1
6
5
4
5.5
4.5
Error
2
-4
2
-1.5
1.5
Abs err
2
4
2
1.5
1.5
0
11
Y2
2
5
8
3.5
6.5
Error
6
-4
-2
0.5
-0.5
Abs err
6
4
2
0.5
0.5
0
13
Fitting a Line to the Data
10
9
8
7
Y
6
5
4
3
2
1
0
0
2
4
6
X
8
10
12
One Possibility
10
9
8
7
Y
6
5
4
3
2
1
0
0
2
4
6
8
10
X
Sum of absolute errors = 6
12
Another Possibility
10
9
8
7
Y
6
5
4
3
2
1
0
0
2
4
6
8
10
X
Sum of absolute errors = 6
12
Which is Better?


Sum of the absolute errors are equal
Compare sum of errors squared:
Y
4
7
2
5
2
Y1
4
3
2
3.5
2.5
Abs err
0
4
0
1.5
0.5
Sum Sq
0
16
0
2.25
0.25
6
18.5
Y2
5.6
3.8
2
4.7
2.9
Abs err
1.6
3.2
0
0.3
0.9
Sum Sq
2.56
10.24
0
0.09
0.81
6
13.7
The Correct Relationship: Y = a + bX + U
Y
systematic random
100
90
80
70
60
50
X
100
110
120
130
Y
The correct relationship:
Y = a + bX + U
systematic random
100
90
80
70
60
50
X
100
110
120
130
Least-Squares Method

Penalizes large absolute errors

Y- intercept: b 

Slope: a  Y  bX

XY  nXY

X  nX
2
2
Assumptions
Linear relationship: Y  a  bX  U
 Errors are random and normally
2
distributed with mean = 0 and variance = 

 Supported
by Central Limit Theorem
Least Squares Regression for
Sorties and UMAs
100
90
80
70
UMAs
60
50
40
30
20
10
0
0
50
100
Sorties
150
200
Regression Calculations
SUMMARY OUTPUT
Regression Statistics
0.978761339
Multiple R
0.957973758
R Square
Adjusted R Square 0.956737692
2.505836188
Standard Error
36
Observations
ANOVA
df
Regression
Residual
Total
Intercept
Sorties
1
34
35
SS
4866.50669
213.49331
5080
MS
4866.50669
6.279215001
t Stat
Coefficients Standard Error
-6.542935597 2.426476306 -2.696476195
0.492910634 0.017705663 27.83915093
Significance F
F
775.0183246 5.51636E-25
P-value
0.01082052
5.51636E-25
Upper 95%
Lower 95%
-11.4741255 -1.611745688
0.456928421 0.528892848
Sorties vs. UMAs
100
90
Y  654
. .49 X
80
70
UMAs
60
50
40
30
20
10
0
0
50
100
Sorties
150
200
Regression Calculations:
Confidence in the predictions
SUMMARY OUTPUT
Regression Statistics
0.978761339
Multiple R
0.957973758
R Square
Adjusted R Square 0.956737692
2.505836188
Standard Error
36
Observations
ANOVA
df
Regression
Residual
Total
Intercept
Sorties
1
34
35
SS
4866.50669
213.49331
5080
MS
4866.50669
6.279215001
t Stat
Coefficients Standard Error
-6.542935597 2.426476306 -2.696476195
0.492910634 0.017705663 27.83915093
Significance F
F
775.0183246 5.51636E-25
P-value
0.01082052
5.51636E-25
Upper 95%
Lower 95%
-11.4741255 -1.611745688
0.456928421 0.528892848
Confidence Interval for Estimate
100
90
UMAs
80
70
60
50
40
30
90
100
110
120
130
140
150
160
Sorties
Y  a  bX  ( t / 2 ) se
170
180
190
200
95% Confidence Interval for the model (b)
Y
X
Testing Model Parameters
How well does the model explain the
variation in the dependent variable?
 Does the independent variable really
seem to matter?
 Is the intercept constant statistically
significant?

Variation
100
90
UMAs
80
Y
Y
70
60
Y
50
40
30
90
100
110
120
130
140
150
Sorties
160
170
180
190
200
Coefficient of Determination
Explained Variation
R =
Total Variation
2
Values between 0 and 1
 R2 = 1 when all data on line (r=1)
 R2 = 0 when no correlation (r=0)

Regression Calculations: How well
does the model explain the variation?
SUMMARY OUTPUT
Regression Statistics
0.978761339
Multiple R
0.957973758
R Square
Adjusted R Square 0.956737692
2.505836188
Standard Error
36
Observations
ANOVA
df
Regression
Residual
Total
Intercept
Sorties
1
34
35
SS
4866.50669
213.49331
5080
MS
4866.50669
6.279215001
t Stat
Coefficients Standard Error
-6.542935597 2.426476306 -2.696476195
0.492910634 0.017705663 27.83915093
Significance F
F
775.0183246 5.51636E-25
P-value
0.01082052
5.51636E-25
Upper 95%
Lower 95%
-11.4741255 -1.611745688
0.456928421 0.528892848
Does the Independent
Variable Matter?
Y  a  bX
If sorties do not help predict UMAs we
expect b = 0
 If b is not 0, is it statistically significant?

Regression Calculations: Does the
Independent Variable Matter?
SUMMARY OUTPUT
Regression Statistics
0.978761339
Multiple R
0.957973758
R Square
Adjusted R Square 0.956737692
2.505836188
Standard Error
36
Observations
ANOVA
df
Regression
Residual
Total
Intercept
Sorties
1
34
35
SS
4866.50669
213.49331
5080
MS
4866.50669
6.279215001
t Stat
Coefficients Standard Error
-6.542935597 2.426476306 -2.696476195
0.492910634 0.017705663 27.83915093
Significance F
F
775.0183246 5.51636E-25
P-value
0.01082052
5.51636E-25
Upper 95%
Lower 95%
-11.4741255 -1.611745688
0.456928421 0.528892848
95% Confidence Interval for the slope (a)
Y
Mean of Y
Mean of X
X
Confidence Interval for Slope
100
90
UMAs
80
70
60
50
40
30
90
100
110
120
130
140
150
Sorties
160
170
180
190
200
Is the Intercept
Statistically Significant?
SUMMARY OUTPUT
Regression Statistics
0.978761339
Multiple R
0.957973758
R Square
Adjusted R Square 0.956737692
2.505836188
Standard Error
36
Observations
ANOVA
df
Regression
Residual
Total
Intercept
Sorties
1
34
35
SS
4866.50669
213.49331
5080
MS
4866.50669
6.279215001
t Stat
Coefficients Standard Error
-6.542935597 2.426476306 -2.696476195
0.492910634 0.017705663 27.83915093
Significance F
F
775.0183246 5.51636E-25
P-value
0.01082052
5.51636E-25
Upper 95%
Lower 95%
-11.4741255 -1.611745688
0.456928421 0.528892848
Confidence Interval
for Y-intercept
100
90
UMAs
80
70
60
50
40
30
90
110
130
150
Sorties
170
190
210
Basic Steps of
Regression Analysis
Formulate the model
 Plot scatter diagram for visual inspection
 Compute correlation coefficient
 Fit the regression line
 Test the model

Factors affecting estimation
accuracy
Sample size (larger is better)
 Range of X values (wider is better)
 Standard deviation of U (smaller is
better)

Uses and Limitations
of Regression Analysis

Identifying relationships
 Not
necessarily cause
 May be due to chance only

Forecasting future outcomes
 Only
valid over the range of the data
 Past may not be good predictor of future
Common pitfalls in regression






Failure to draw scatter diagrams
Omitting important variables from the
model
The “two point” phenomenon
Unfounded claims of model sophistication
Insufficient attention to interval estimates
and predictions
Predicting too far outside of known range
Lines can be deceiving...
X Variable 1 Line Fit Plot
14
12
10
Y
8
6
4
2
0
0
5
10
X Variable 1
R2 = .6662
15
20
Nonlinear Relationship
y = -0.1267x 2 + 2.7808x - 5.9957
R2 = 1
14
12
10
Y
8
6
4
2
0
0
5
10
X
15
20
Best fit?
X Variable 1 Line Fit Plot
14
12
10
Y
8
6
4
2
0
0
5
10
X Variable 1
15
20
Misleading data
X Variable 1 Line Fit Plot
14
12
10
Y
8
6
4
2
0
0
5
10
X Variable 1
15
20
Summary

Regression Analysis is a useful tool
 Helps

quantify relationships
But be careful
 Does
not imply cause and effect
 Don’t go outside range of data
 Check linearity assumptions
 Use common sense!
Cost
Non-linear relationship
between output and cost
50
45
40
35
30
25
20
15
10
5
0
r = 0.0
0
5
10
Output
15
20