Download Forecasting with Regression Analysis Causal, Explanatory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Forecasting wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Forecasting with Regression
Causal, Explanatory
Forecasting
• Assumes cause-and-effect
relationship between system inputs
and its output
System
Inputs
Forecasting with Regression
Analysis
Cause + Effect
Relationship
Output
• The job of forecasting:
Richard S. Barr
1
Regression Analysis
• Determines and measures the
relationship between two or more
variables
– “Simple” linear regression: 2
variables
2
Simple Linear Regression
• Evaluates the
relationship (goingtogether) of two
variables
– Dependent
variable (Y)
– Independent
variable (X)
• Relationship
depicted by a
straight line model:
Y=a+bX
– Multiple linear regression: 3+
variables
3
4
Which is Independent?
Forecasting
• Build the model using historical data
• Then use knowledge of the
independent variable (X) to forecast the
value of the dependent variable (Y)
• Assumptions:
– The relationship between X and Y is
strong
– The future follows the past
5
• Sales
Advertising
• Age
wear
Equipment
• Demand
Time
• Price
Units sold
6
regr-1
Forecasting with Regression
Regression Forecasting Steps
1. Plot the scatter diagram
2. Compute the regression equation
3. Forecast Y using the regression model
and estimates of X
Scatter Diagram
• The first step for simple regression
modeling
• Used to
– Display historical raw data
– Spot patterns of relationships
• Will help you determine if regression is
appropriate
7
Types of Relationships
Direct linear
• Positive relationship
• As X increases, Y Y
8
Types of Relationships
Inverse linear
• Negative
Y
relationship
• As X increases, Y
tends to decrease
by a constant
amount
tends to increase by
a constant amount
X
X
9
Types of Relationships
No correlation
• Change in X tells
nothing about Y Y
10
Types of Relationships
Nonlinear relationship
• As X increases, Y
Y
changes by a
varying amount
X
11
X
12
regr-2
Forecasting with Regression
Regression Model
• Expresses the relationship between X
and Y as a straight line:
Yc = a + b X (the regression line)
where
– Yc = estimated average Y for a given
X
– X = actual value of independent
variable
– a = estimated Y-intercept (if X=0)
– b = estimated slope of regression line
Regression Line
Y
slo
b=
Yc = a + bX
pe
slope =
a
change in Y
change in X
X
13
14
Purposes for the Regression
• Provides a mathematical definition of
the relationship
– Precise, accuracy depends on data fit
• Is a standard of perfect correlation
– Can compare line with actual data
values
– If all values on the line, perfect
correlation
• Is a model for forecasting Y using X
– Plug an X-value into: Yc = a + bX
Which Line is Best?
• There are many possibilities for a and b
– Each defines a different line and
model
• To evaluate mathematically, let:
– Yi = historical value of Y for a given Xi
– Yc = calculated Y using Xi in
regression line
– (Yi-Yc) = deviation, error between
actual and model forecast
15
Measuring Goodness of Fit
• Measuring the fit of the line to the data:
– Sum of the deviations
16
Measuring Goodness of Fit
– Sum of the squared deviations
n
∑ (Y − Y )
n
∑ (Yi − Yc )
i =1
i
2
c
i =1
• Eliminates the sign problem
• Is the generally accepted least
squares criterion
• Is 0 for any line going through
(X,Y), due to +/- cancellations
17
18
regr-3
Forecasting with Regression
Mail Order Sales vs.
Advertising
Least-Squares Regression Line
• To minimize the squared deviations
use:
Date of
Advertising
Sept. 9
Sept. 26
Oct. 2
Oct. 9
Oct. 16
Oct. 23
∑ ( XY ) − n X Y
b=
∑ ( X ) − n( X )
2
2
a = Y −bX
where:
n = number of data points
$ Spent on
Advertising
$1,700
3,000
2,000
1,500
600
1,500
$ Sales in
Next Week
$60,000
110,000
85,000
55,000
30,000
60,000
X , Y = mean of X i 's,Yi 's
∑ ( XY ) = sum of { X × Y }
∑ ( X ) = sum of {X 's squared}
i
i
2
i
19
Scatter Plot
Y, Sales
($1000s)
120
100
80
60
40
20
0
$0
$1
$2
$3
$4
X, Advertising ($000s)
20
Computing the Regression Line
Xi
Advert
$1000s
1.7
3.0
2.0
1.5
0.6
1.5
Yi
Sales
$1000s
60
110
85
55
30
60
21
Step 1: Sum Column 1 for ΣX
(1)
Xi
Advert
$1000s
1.7
3.0
2.0
1.5
0.6
1.5
22
Step 2: Sum Column 2 for ΣY
(1)
Xi
Advert
$1000s
1.7
3.0
2.0
1.5
0.6
1.5
Yi
Sales
$1000s
60
110
85
55
30
60
23
(2)
Yi
Sales
$1000s
60
110
85
55
30
60
24
regr-4
Forecasting with Regression
Step 3: (1)•(2)=(3), Sum for ΣXY
(1)
Xi
Advert
$1000s
1.7
3.0
2.0
1.5
0.6
1.5
(2)
Yi
Sales
$1000s
60
110
85
55
30
60
Step 4: (1)2=(4), Sum for ΣX2
(1)
Xi
Advert
$1000s
1.7
3.0
2.0
1.5
0.6
1.5
(3)
XY
(1)x(2)
______
(2)
Yi
Sales
$1000s
60
110
85
55
30
60
25
Step 5: Compute the Mean of X
X=
∑X
n
Step 6: Compute the Mean of Y
∑Y
i
n
27
28
Compute b
b=
∑ ( XY ) − n XY
∑ ( X ) − n( X )
2
(4)
X2
(1)2
______
26
Y =
i
(3)
XY
(1)x(2)
______
Compute a
a = Y − bX
2
29
30
regr-5
Forecasting with Regression
The Regression Equation
• The resultant equation:
– Yc = 7.455 + 34.49X
• Interpretation and reasonableness
check:
– a = 7.455 =
– b = 34.49 =
• Forecast sales with $1800 advertising:
Evaluating the Model
How Well Did We Do?
31
32
Compare Actuals with Estimates
Xi
Yi
1.7
3.0
2.0
1.5
0.6
1.5
60
110
85
55
30
60
Model
Estimate
Yc
66.09
110.93
76.44
59.19
28.15
59.19
Error
(Y-Yc)
-6.09
-0.93
8.56
-4.19
1.85
0.81
Error2
(Y-Yc)2
37.11
0.87
73.28
17.58
4.42
0.65
Correlation Analysis
Measures the degree of association
between two variables
33
Measuring Correlation
• We compare two approaches to
estimating or forecasting Y for a given
X:
– Using the mean of Y
– Using our least-squares regression
line
34
Variation Analysis
• We could use
to estimate
Y Y (for
any X) and, on Y
average, be ok
_
• Can regression Y
do better?
X
35
36
regr-6
Forecasting with Regression
Variation Analysis
• Let’s look at
variations around
Y
the regression line
y1
to see how much
better it explains the
Y’s than the mean _
Y
(x1,y1)
Yc
x1
Explained Deviation
• Explained
deviation from
the mean:
– (Yc-Y)
Y
– Deviation
“explained” by
the regression
line
Y
y1
(x1,y1)
Yc
Yc1
_
Y
} Explained
Deviation
x1
X
37
38
Unexplained Deviation
• Deviation from the
mean not explained Y
by the regression
y1 Unexplained
line:
deviation
Yc1
– (y1-Yc)
_
Y
(x1,y1)
Yc
{
} Explained
deviation
x1
Total Deviation
• The total deviation
from the mean =
Y
explained +
y1
unexplained
Total
Yc1 deviation
_
Y
(x1,y1)
Yc
{ {}
x1
X
39
Portion Explained, r2
• Sample coefficient
of determination:
• Total variation =
Explained + Unexplained variation
•
Total
= Explained + Unexplained
∑ (Y − Y ) = ∑ (Y
2
i
c
X
40
Variation
• Variation is the square of deviations
from the mean of Y
X
r2 =
Explained variation
Total variation
• The fraction of
variation from the
mean explained by
the regression line
r2 =
∑ (Y − Y )
∑ (Y − Y )
2
c
2
i
− Y ) 2 + ∑ (Yi − Yc ) 2
41
42
regr-7
Forecasting with Regression
Extreme Values of r2
•
r2
=1
– Perfect linear
correlation
– All points are
explained by the
line
– All points are on
the line
•
r2
=0
– No correlation
– The regression
does not explain
the data any
better than the
mean of Y
– X provides no
useful information
about Y in this
context
Correlation
• The correlation coefficient, r :
r = ± r2
• Unitless
• Sign: + if b>0, - if b<0
• Simply a different way of expressing the
relationship (correlation) between two
variables
43
44
Correlation Coefficient
• r = +/-1
– Only if a perfect linear relationship
Y=a+bX exists
– All points on the line
• Some think that it “looks better” than r2
– r2 = 0.36
– r = 0.60
Example Scatterplot A
y
58
32
67
54
39
38
55
31
60
54
62
72
46
36
44
63
38
48
43
38
x
51
42
65
52
45
24
45
31
51
60
67
44
40
53
52
59
41
51
55
42
y
x
45
Example Scatterplot B
y
38
52
52
62
57
34
70
50
56
25
53
43
60
54
35
63
52
40
35
51
x
45
58
40
41
61
56
64
40
65
57
55
45
55
56
54
34
55
48
55
55
46
Correlation Coefficient
• Shows
– The direction of the
relationship
– The strength of
association
• Cautions
– It only measures
linear association
– It is unstable with a
small sample size
– Is distorted by
extreme values or by
including different
data sets in the
analysis
y
x
47
48
regr-8
Forecasting with Regression
Nonlinear Relationship
Monkey Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Wt
55
45
35
39
53
41
51
35
57
57
45
47
35
49
43
51
31
53
47
51
Ht
29
27
17
29
31
21
31
13
37
41
45
35
25
25
31
33
29
27
17
45
49
50
Monkey & King Kong Data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
KK
55
45
35
39
53
41
51
35
57
57
45
47
35
49
43
51
31
53
47
51
130
29
27
17
29
31
21
31
13
37
41
45
35
25
25
31
33
29
27
17
45
150
Multiple Regression
Same concept, more variables
51
Multiple Regression Models
• An extension of the simple case
• Permits use of more variables to try to
explain more variation
• Example model:
Y = a + b1 X 1 + b2 X 2 L
53
52
Real Estate Example
• Monthly sales (Y) are related to
– Mortgage rates (X1)
– Number of salespersons (X2)
• With simple regression models:
– Y = a + bX1, r2 = 0.36
– Y = a + bX2, r2 = 0.25
• Multiple regression model
– Y = a + b1X1+ b2X2, r2 = 0.49, not
0.61!
54
regr-9
Forecasting with Regression
Real Estate Example
• Why is not more
variation explained?
• Multicollinearity
Total variation
exists:
– X1 is correlated
Explained
with X2
by X1
– We want
independence of
Explained
by X2
the X’s
(uncorrelated)
MLR Software
55
56
MLR Input
•
•
•
•
MLR Reports
Title line
Variables and observations
Labels for variables, dependent last
For each observation
– Xij values, followed by Yj
– Xijs in label order
• Blanks separate all values and labels
• Descriptive statistics
• Correlation matrix and determinant
• Regression equation, each variable:
– Label
– coeffcient
– beta value
– standard error of the coefficient
– t-statistic and probability that bi = 0
57
58
MLR Reports
• Analysis of variance
– P(insignificant regression model)
• Summary statistics
– r2
– sy,x
• Residual summary (optional)
– Residuals (errors)
– Graph
Standard Error of the Estimate
• The standard deviation of the observed
values of Y from the regression line
sy,x =
∑ (Y − Y )
c
n−2
2
=
∑Y
2
− a ∑ Y − b∑ XY
n−2
• On average, how the data varies
around the regression line
59
60
regr-10
Forecasting with Regression
Confidence Intervals
• Using the 68-95-99.7 Rule of Normality
– µ ± 1 σ includes 68% of all values
– µ ± 2 σ includes 95%
– µ ± 3 σ includes 99.7%
• b ± Z sy,x gives confidence interval for a
given probability and associated Zvalue
– If Z=1, a 68% confidence that the
interval contains the true regression
coefficient
61
regr-11