Download Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Confidence interval wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
v1.1
Regression Analysis
How to develop and assess a CER
“All models are wrong, but some are useful.” -George Box
“In mathematics, context obscures structure. In data analysis,
context provides meaning.” -George Cobb
“Mathematical theorems are true; statistical methods are sometimes
effective when used with skill.” -David Moore
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
1
v1.1
Regression Overview
• Key Ideas
–
–
–
–
–
Y  a  bX  
Yˆ  aˆ  bˆX
Correlation
Best fit / minimum error
Homoscedasticity!
Statistical significance
Quantification of uncertainty
• Analytical Constructs
–
–
–
–
• Practical Applications
– CER Development
– Learning Curves
• Related Topics
OLS Regression
Analysis of Variance (ANOVA)
Confidence Intervals
Linear algebra
– Parametrics
– Distributions
3
• Normal, Chi, t, F
10
– Hypothesis testing
– Risk Analysis
9
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
NEW!
2
Bivariate Data Analysis
v1.1
6
• One independent variable and one dependent variable
(i.e., y is a function of x)
What does it
look like?
• Visual display of information
– Scatter plot, residual plots
• Measures of central tendency
What’s your
best guess?
– Ŷ from regression equation
• Measures of variability
– Standard Error of the Estimate (SEE), R2
• Measures of uncertainty
– Confidence and Prediction Intervals
• Statistical Tests
– F test, t test, ANOVA
How much
remains unexplained?
How precise
are you?
How can you
be sure?
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
NEW!
3
v1.1
Definition of Regression
• Regression Analysis is used to describe a statistical
relationship between variables
• Specifically, it is the process of estimating the “best fit”
parameters of a specified function that relates a
dependent variable to one or more independent
variables (including implicit uncertainty)
y
Data
y=a+bx
x
y
Regression
x
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
4
Regression Analysis in Cost Estimating
v1.1
• If the dependent variable is a cost, the regression equation is
often referred to as a Cost Estimating Relationship or CER
– The independent variable in a CER is often called a cost driver
Cost Driver
(single)
Aircraft Design
# of Drawings
Software
Lines of Code
Power Cable
Linear Feet
CER
3
Power
Cable Cost
Examples of
cost drivers:
Cost
Linear Ft
– A CER may have multiple cost drivers:
Example with
multiple cost
drivers:
Cost Driver
(multiple)
Cost
Power Cable
CER
Linear Feet
Power
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
5
v1.1
Preliminary Concepts – Correlation
• Correlation is quantified by a correlation coefficient (r)
6
–
r can range from -1 to +1
r = -1
Tip: The
stronger the
correlation,
the farther r is
from zero
r=
0
r = -0.8
r=
+1
r=
+0.8
The presence of correlation is what leads you to regression
analysis. Scatter plotting is the best to detect it.
Explicit estimation of r will be addressed later…
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
6
v1.1
Preliminary Concepts – Types of Models
• A mathematical function must be specified
before regression analysis is performed
19
– The specified function is called a regression model
– Many types of models may be considered:
power
b<0
b>0
6
0<b<1
linear
y=a+bx
3
exponential
y=axb
y=axb
quadratic
y=aebx
b<0
y=a+bx
y = a + b x + c x2 y = a + b ln x
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
logarithmic
7
v1.1
Key Concept of Regression
• The regression procedure uses the “method of least
squares” to find the “best fit” parameters of a specified
3
function
– We will focus on Ordinary Least Squares (OLS) Regression1
– The idea is to minimize the sum of squared deviations (called
“errors” or “residuals”) between the Y data and the regression
equation
y=a+bx
20
Called the
“Sum of
Squared
Errors” or
“SSE”
15
10
5
In other words…
Find the equation
that minimizes the
sum of the
squared
distances of the
blue lines
Let’s look at the
math…
0
0
5
10
15
20
Unit III - Module 8
1 OLS Regression is most commonly used and is the foundation to the understanding of all regression
techniques. To learn about variations and more advanced techniques, see Resources and Advanced
© 2002-2010 SCEA. All rights reserved.
Topics.
8
v1.1
Finding the Regression Equation
^
20
• Problem: Find a^ and b such that the SSE
is minimized…
15
10
5
20
0
Data
0
X
Y
4
5
10
7
5
5
8
10
12
7
16
13
the line passes through
the means of X and Y1
10
15
20
^
y^ = a^ + b x
0
0
The SSE is minimized when the
slope is equal to the
correlation coefficient, r
multiplied by the quotient of
standard deviations, sY/sX
and
5
15
5
10
15
20
This formulation
reduces to the following
set of equations
8
__
^ SXY – nXY
_
b=
SX2 – nX2
_ ^ _
^
a=Y–bX
8
Where:
X
_ and Y
_ are the raw values of the 2 variables and
X and Y are the means of the 2 variables
Unit III - Module 8
1 The geometry of this formulation is provided under Related and Advanced Topics
© 2002-2010 SCEA. All rights reserved.
9
Finding the Regression Equation: Example
20
4
^
^ ^
y=a
+bx
15
^
b=
10
SXY – nXY
SX2 – nX2
^
a^ = Y – b X
5
0
0
5
10
n=5
Data
15
20
X
Y
XY
X2
4
5
20
16
7
5
35
49
8
10
80
64
12
7
84
144
16
13
208
256
SXY=
SX2=
427
529
X = avg
of X data
=9
Example Calculation
Calculations
Y = avg
of Y data
=8
^
b=
SXY – nXY
SX2 – nX2
^ 427 – 5*9*8
b=
529 – 5*92
b^ = 0.6
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
a^ = Y – b^X
^
a = 8 – 0.6*9
a^ = 2.5
Ŷ = 2.5 + 0.6 X
10
v1.1
v1.1
Finding the Regression Equation
^
20
• Problem: Find a^ and b such that the SSE
is minimized…
15
10
5
20
0
Data
0
X
Y
4
5
10
7
5
5
8
10
12
7
16
13
the line passes through
the means of X and Y1
10
15
20
^
y^ = a^ + b x
0
0
The SSE is minimized when the
slope is equal to the
correlation coefficient, r
multiplied by the quotient of
standard deviations, sY/sX
and
5
15
5
10
15
20
This formulation
reduces to the following
set of equations
8
__
^ SXY – nXY
_
b=
SX2 – nX2
_ ^ _
^
a=Y–bX
8
Where:
X
_ and Y
_ are the raw values of the 2 variables and
X and Y are the means of the 2 variables
Unit III - Module 8
1 The geometry of this formulation is provided under Related and Advanced Topics
© 2002-2010 SCEA. All rights reserved.
11
Finding the Regression Equation: Example
20
4
^
^ ^
y=a
+bx
15
^
b=
10
SXY – nXY
SX2 – nX2
^
a^ = Y – b X
5
0
0
5
10
n=5
Data
15
20
X
Y
XY
X2
4
5
20
16
7
5
35
49
8
10
80
64
12
7
84
144
16
13
208
256
SXY=
SX2=
427
529
X = avg
of X data
=9
Example Calculation
Calculations
Y = avg
of Y data
=8
^
b=
SXY – nXY
SX2 – nX2
^ 427 – 5*9*8
b=
529 – 5*92
b^ = 0.6
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
a^ = Y – b^X
^
a = 8 – 0.6*9
a^ = 2.5
Ŷ = 2.5 + 0.6 X
12
v1.1
Statistical Error and Residual Analysis
v1.1
20
•
In addition to the equation we have just found,
we must describe the statistical error – the “fuzz”
or “noise” – in the data
This is done by adding an “error term” () to the basic
regression equation to model the residuals:
15
10
5
0
0
•
y=a+bx+
•
There are two key assumptions in OLS
regression regarding the error term:
1. It is independent with X
2. It is normally distributed with a mean of
zero and constant variance for all X
These assumptions need to be checked, as
the entire analysis hinges on their validity
•
5
10
A “residual plot” is
useful in determining
whether these
assumptions apply to
the data
This is called
homoscedasticity!
The error about the “true underlying” line generates a data
cloud whose best-fit line is slightly different
The error term is what makes Regression more than just Curve Fitting
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
15
13
20
v1.1
Residual Analysis: Example
Create a Residual Plot
to verify that the OLS 20
assumptions hold:
Residuals
15
Plot the
distance from
the regression
4
line for each
3
data point
Residual Plot
2
1
y = 2.5 + 0.6 x + 
10
x
0
-1 0
5
5
10
15
20
-2
-3
5
0
-4
0
yes
yes
5
10
15
20
Questions to ask:
1. Does the residual plot show independence?
2. Are the points symmetric about the the x-axis with
constant variability for all x?
The OLS assumptions are reasonable. This tells us:
•That our assumption of linearity is reasonable
•The error term  can be modeled as Normal with mean of
zero and constant variance
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
14
v1.1
Example Residual Patterns
Good residual pattern:
• Independent with x
• Constant variation
0
Residuals do not
have constant
variation:
Weighted Least
Squares or
Multiplicative Error
approach should
be examined
0
Tip: A residual plot is the primary way of
indicating whether a non-linear model (and
which one) might be appropriate
0
0
Residuals not
independent with x:
A curvilinear model is
probably more
appropriate in this
case
Residuals not
independent with x:
e.g., in learning curve
analysis, this pattern
might indicate loss of
learning or injection of
new work
7
Usually the residual plot provides enough visual insight to
determine whether or not linear OLS regression is appropriate.
If the picture is inconclusive, statistical tests exist to help
determine if the OLS assumptions hold1.
Unit III - Module 8
15
1 See references to learn about Weighted Least Squares Regression and statistical tests for residuals
© 2002-2010 SCEA. All rights reserved.
Excel Demo – Parameters and Residuals
v1.1
SUMMARY OUTPUT
Regression Statistics
SUMMARY
Multiple R OUTPUT
R Square
Regression
Adjusted
R SquareStatistics
Multiple
R
Standard Error
R
Square
Observations
Adjusted R Square
Standard
ANOVA Error
Observations
0.79
0.62
0.50
0.79
2.5
0.62
5
0.50
2.5
5
df
ANOVA
Regression
Residual
Total
Regression
Residual
Total
1
3
df 4
1
3
Coefficien
4
ts
2.5
Coefficien
0.6
ts
2.5
0.6
Intercept
x
Intercept
x
SS
29.8
18.2
SS 48
29.8
18.2
Standard
48
Error
2.7
Standard
0.3
Error
2.7
0.3
MS
29.8
6.1
MS
29.8
6.1
Equation
Parameters
Significan
ce F
4.9
0.11
Significan
ce F
F
4.9
0.11
F
t Stat
P-value
0.9
0.42
2.2
0.11
t Stat
P-value
0.9
0.42
2.2
0.11
Lower
95%
-6.1
Lower
-0.3
95%
-6.1
-0.3
Ŷ = 2.5 + 0.6 X
Upper
95%
11.1
Upper1.4
95%
11.1
1.4
Residuals
20
RESIDUAL OUTPUT
15
RESIDUAL OUTPUT
Observation
Observation
1
2
3
1
4
2
5
3
4
5
Predicted
Standard
y
Residuals Residuals
4.8
0.2
Predicted
6.6
-1.6 Standard
-0.7
y 7.2 Residuals
2.8 Residuals
1.3
4.8
0.2
9.5
-2.5
-1.2
6.6
-1.6
-0.7
11.9
1.1
0.5
7.2
2.8
1.3
9.5
-2.5
-1.2
11.9
1.1
0.5
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
10
5
0
0
5
10
15
20
16
v1.1
Application
Note: A seeker is a component
of the missile that is used to
detect a target
• Suppose our toy problem defines X
and Y as follows:
• Assuming the regression is a good
fit1, the following is true:
– The cost estimate for a new seeker with
weight of11 is
2.5 + 0.6*(11) = $9.1M
– The result is the estimated mean of the
distribution of all possible costs, which is
assumed to be a Normal distribution2
1 Goodness of Fit is described in the next section
2 Standard deviation of the cost distribution will be
discussed later
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
Seeker Cost ($M)
6
– Y is the cost of a seeker in a missile
– X is the weight of the seeker
20
15
10
5
0
0
5
10
15
20
Weight
Tip: The Mean is
usually calculated in
the cost model. The
distribution is usually
accounted for in the
risk analysis.
17
9
v1.1
Analysis of Variance (ANOVA)
Measures of Variation
7
1.
2
~

n 1
Total Sum of Squares (SST):
The sum of the squared deviations between
the data and the average
2.
2
~

n k
Residual or Error Sum of Squares (SSE):
The sum of the squared deviations between
the data and the regression line
“The unexplained variation”
3.
2
~

Regression Sum of Squares (SSR):
k 1
The sum of the squared deviations between
the regression line and the average
“The explained variation”
20
 (Y  Y)
15
2
10
5
0
0
5
10
15
20
2
(
Y

Ŷ
)

20
15
10
5
0
0
5
10
15
20
2
(
Ŷ

Y
)

20
15
10
8
SST = SSE + SSR
5
“total” = “unexplained” + “explained”
0
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
0
5
10
15
18
20
v1.1
Uncertainty of Coefficients
• Each estimated coefficient has an associated
standard error As a result of previous assumptions,
coefficients are normally distributed
SUMMARY OUTPUT
Regression Statistics
0.79
Multiple R
0.62
R Square
0.50
Adjusted R Square
2.5
Standard Error
5
Observations
Ŷ = 2.5 + 0.6 X
Intercept:
ANOVA
df
Regression
Residual
Total
Intercept
x
The true coefficients then
have a t distribution about
the estimated values
1
3
4
SS
29.8
18.2
48
Coefficien Standard
Error
ts
2.7
2.5
0.3
0.6
MS
29.8
6.1
Significan
ce F
F
0.11
4.9
P-value
t Stat
0.42
0.9
0.11
2.2
2.5
-2.7 +2.7
Lower
95%
-6.1
-0.3
Upper
95%
11.1
1.4
Slope:
0.6
The standard error values are affected by the MSE
as well as variability within the x data points
RESIDUAL OUTPUT
Observation
Standard
Predicted
Unit III - Module 8
Residuals Residuals
y
0.2
4.8
1
SCEA. All rights reserved.
-0.7
-1.6 © 2002-2010
6.6
2
-0.3 +0.3
19
v1.1
Uncertainty of the Estimate
• The estimated regression equation has
a standard error of the estimate, or SEE
The SEE is the
estimated standard
deviation of the Normal
distribution that models
the residuals
SEE  MSE
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.79
R Square
0.62
Adjusted R Square
0.50
Standard Error
2.5
Observations
5
SEE  6.1  2.5
15
10
ANOVA
df
Regression
Residual
Total
Intercept
x
20
1
3
4
SS
29.8
18.2
48
Coefficien Standard
ts
Error
2.5
2.7
0.6
0.3
MS
29.8
6.1
F
Significan
ce F
4.9
0.11
5
0
t Stat
P-value
0.9
0.42
2.2
0.11
Lower
95%
-6.1
-0.3
Unit III - Module 8
Upper
95%
11.1
1.4
© 2002-2010 SCEA. All rights reserved.
0
5
10
15
20
20
v1.1
Coefficient of Variation
• The coefficient of variation, or CV is the ratio of the
SEE to the mean of the dependent variable
SEE
CV 
Y
Tip: CV of less
than 15% is
desirable
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.79
R Square
0.62
Adjusted R Square
0.50
Standard Error
2.5
Observations
5
SEE  6.1 
Example:
2.5
From our sample problem,
SEE = 2.5 and Y = 8, so
CV = 2.5 / 8
= 31%
ANOVA
df
Regression
Residual
Total
Intercept
x
1
3
4
SS
29.8
18.2
48
Coefficien Standard
ts
Error
2.5
2.7
0.6
0.3
MS
29.8
6.1
F
Significan
ce F
4.9
0.11
t Stat
P-value
0.9
0.42
2.2
0.11
Lower
95%
-6.1
-0.3
Upper
95%
11.1
1.4
The CV expresses the standard error of the
estimate as a percent (of the mean)
Unit III - Module 8
RESIDUAL OUTPUT
Predicted
The mean
is the only
value not
found on
the
regression
output
Standard
© 2002-2010 SCEA. All rights reserved.
21
Statistical Significance in Regression
v1.1
• Statistical significance means the probability is “acceptably low”
that a stated hypothesis is true.
10
8
– The “acceptable” probability is referred to as a significance level, or a
Typical significance levels are
a =0.01 and 0.05
– In regression analysis, the standard hypothesis that is checked is that
one or more variables has a coefficient with an actual value of zero
• The statistical tests of interest in Regression Analysis involve the
x coefficient(s) in the Regression Equation
Tip: To say b = 0 implies
there is no relationship
between x and y
y=a+bx
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
Are the statistics good
enough to convince us
that b is not zero?
i.e., “Is Probability that
the hypothesis “b = 0”
is true < a ?”
22
v1.1
The t statistic
• For a regression coefficient, the determination of
10
statistical significance is based on a t test
17
– The test depends on the ratio of the coefficient’s estimated
value to its standard error, called a t statistic
Example Setup:
SUMMARY OUTPUT
Set a = 0.05
Regression Statistics
Multiple R
0.79
R Square
0.62
Adjusted R Square
0.50
Standard Error
2.5
Observations
5
y=a+bx
Hypothesis:
H0 : b  0
Ha : b  0
ANOVA
df
Regression
Residual
Total
1
3
4
SS
29.8
18.2
48
Coefficien Standard
ts
Error
2.5
2.7
0.6
0.3
Intercept
x
MS
29.8
6.1
F
Significan
ce F
4.9
0.11
Test Statistic:
t
t Stat
P-value
0.9
0.42
2.2
0.11
Lower
95%
-6.1
-0.3
Upper
95%
11.1
1.4
We cannot conclude there is a
relationship between x and y
RESIDUAL OUTPUT
Observation
Predicted
Standard
y
Residuals Residuals
1
4.8
0.2
2
6.6
-1.6
-0.7
3
7.2
2.8
1.3
Since 0.11 > 0.05
we cannot reject Ho
Therefore, this
coefficient is not
statistically
significant
Estimated Coefficien t 0.6

 2.2
Standard Error
0.3
p-value: 0.11
t stat
Decision: We reject H0 if the p-value is
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
less than the chosen
significance level (0.05)
23
v1.1
Calculating R2
How much of the total
variability is explained
by adding x as an
independent variable?
10
y
8
Remember…
“total variation” = SST
“explained variation” = SSR
x
Data
Calculations
_
_
2
^
_
^
explained variation SSR
SSE
2



1

R
total variation
X
Y
Y
(Y - Y)
Y
(Y - Y)
4
7
8
12
16
5
5
10
7
13
8
8
8
8
8
9
9
4
1
25
5
7
7
10
12
10
2
1
2
15
S=48
_
20
Y=8
15
15
20
SSR
2

R SST 
S=30
Y = 0.6X+2.5
SSR
20
15

15
10
10
5
5
10
10
5
5
0
0
0
5
10
15
20
SST
Toy Problem Calculation
^
SST
20
SST
2
30

48
 (Ŷ  Y)
 (Y  Y )
0.62
0
0
5
10
15
20
0
0
0
5
10
15
5
10
15
20
20
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
24
2
2
v1.1
Confidence Intervals
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
25
Introduction to Confidence Intervals
v1.1
• A confidence interval suggests to us that we are (1-a)*100%
confident that the true value of the random variable is contained
within the calculated range*
• Confidence intervals are calculated for:
– Regression Equation Parameters
– The Regression Equation at a given value of x
• The calculation combines the Student t distribution with the
associated standard error
– The general formula is:
10
(mean est)  t α/2, df  stdev
* Note this statement provides a general sense of what a confidence
interval does for us in concise language for ease of understanding.
The specific statistical interpretation is that if many independent
samples are taken where the levels of the predictor variable are the
same as in the data set, and a (1-a)*100% confidence interval is
constructed for each sample, then (1-a)*100% of the intervals will
contain the true value of the parameter (or value of the dependent
variable at a given value of x, depending on which interval is being
calculated).
a/2
a/2
1-a
“t” = Student t
distribution with
n-k degrees of
freedom. Critical
values can be looked
up on a table or
calculated in Excel
critical values
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
26
v1.1
Confidence Bands
• When we plot the confidence intervals for all X, we
get a pair of hyperbolic curves called the “Confidence
Band”
Toy Problem:
Confidence Interval
vs X
20
95%
-3.18
Stdevs
15
Ŷ
+3.18
Stdevs
10
5
0
0
5
10
15
20
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
Note that the
interval becomes
larger as X moves
away from the
mean
27
v1.1
Prediction Intervals
• It is more common that we would want a confidence
interval for a new observation of Y at a given value of X
16
– This is referred to as a Prediction Interval and has the following
formula:
n 1
( X - X) 2
Ŷ  t α/2, df  SEE

2
2
n
X

n
X

The t distribution
critical values
remain the same as
previously
calculated
Tip: The Prediction
Interval provides an
interval for y as
opposed to ŷ
Prediction Interval
vs X
25
20
Standard
deviation of Y
for a fixed X
15
10
5
0
0
5
10
15
-5
Unit III - Module 8
-10
© 2002-2010 SCEA. All rights reserved.
20
Note that
prediction intervals
are wider than
confidence intervals
28
v1.1
Confidence Intervals when n is large
• When n is large (at least 30), the t distribution
approaches a Normal distribution
• This means critical values from the Normal can be
used, which produce the following standard
confidence intervals:
e.g.
Ŷ  1 stdev
68%
20
-1 stdev
15
.
10
95%
-2 stdevs
5
0
0
5
10
+1 stdev
15
+2 stdevs
provides a 68%
confidence
interval for the
mean
10
99%
20
-3 stdevs
Unit III - Module 8
© 2002-2010 SCEA. All rights reserved.
+3 stdevs
29