Download Handout 3-2

Document related concepts

Regression analysis wikipedia , lookup

Forecasting wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Lecture 3-2
Summarizing
Relationships among
variables
©
Numerical measures of
summarizing the relationship
between two variables
To think of what numerical measures we
need to represent relationships between
variables, see the following three pairs of
scatter plots.
Example 1: Relationships between
the returns of different stocks
Stock A
return
Stock C
Return
* **
* *
* * *
* *
* *
**
**
Scatter plot I
Stock
B
return
* *
*
Scatter Plot II
*
* * *
Stock D
**
return
Example 1 (Continued)
Scatter Plot I shows a positive relationship
while scatter plot II shows a negative
relationship.
We need a numerical measure that
shows the direction of the relationship.
 For this purpose, we use “Covariance”
Example 2: Relationships between
advertisement spending and revenue
Advertisement and revenue Product II
80000
35000
70000
30000
60000
25000
50000
Revenue
Revenue
Advertisement and revenue product I
40000
30000
20000
15000
20000
10000
10000
5000
0
0
0
50
100
Advertisement spending
150
200
0
20
40
60
80
100
120
Advertisement spending
Product I shows a clear linear relationship between the advertisement spending
and revenue, while product II does not show much of a relationship. We need to
have a numerical measure that shows the strength of linear relationship between
two variables.
We use “Correlation Coefficient”
Example 3: Number of
promotion and sales
Product B: Promotion and Sales
Product A: Promotion and sales
2,500,000
2,000,000
2,000,000
1,500,000
1,500,000
Sales
Sales
2,500,000
1,000,000
1,000,000
500,000
500,000
0
0
0
5
10
15
Number of promotions
20
25
0
5
10
15
20
Number of promotions
Promotion seems to be more effective for Product A than
product B in the sense that additional promotion brings greater
increase in revenue (i.e., the “slope” is steeper). To measure the
effectiveness of the promotion, we use “Regression Analysis”
25
Numerical measures of
summarizing relationships
 This lecture covers the following topics
1. Covariance
2. Correlation coefficient
3. Regression Analysis
Covariance
Covariance is a numerical measure that shows
the direction of the relationship between two
variables.
Covariance is one of the most fundamental
numerical measures of the relationship between
two variables. It will appear in many areas (i.e.,
computation of returns of a portfolio of stocks)
In the following slides, we will learn the logic
behind the derivation of covariance.
How to measure the direction of
the relationship
y
z
* *
Box I
* * *
Box IV
*
Y
* **
* *
*
**
*
*
*
*
Box III
*
Z
* *
* *
Box I
Box IV
Box II
*
* * *
Box II
**
Box III
**
X
Positive Relationship
x
W
Negative relationship
w
How to measure the direction of
the relationship
 From the previous two scatter plots, notice
that:
1. When two variables show a positive
relationship, there are more data points in Box
I and Box III, than in Box II and Box IV
2. When two variables show a negative
relationship, there are more data points in Box
II and Box IV, than Box I and Box III.
 We use these facts to measure the direction of
the relationship.
How to measure the direction of
the relationship: Example
Number of
promotions
Revenue from
the product A
in 1000 yen
5
600
10
1000
8
1100
9
900
10
1500
12
750
20
2200
18
2000
17
1700
•The data shows the relationships between
the number of promotions and revenue. (It
is same data set used in the previous
handout. Revenue is now denoted in 1000
yen)
•Suppose you want to know if there is
positive relationship between these two
variables. Next slide is the scatter plot of
this relationship.
How to measure the direction of
the relationship: Example, contd
Relationship between Number of promotions and
revenue from product A
•
Number of promotions and
revenue appears to have a
positive relationship.
•
Notice that most of the data
points are either in Box I or
Box III
•
What can we say about Box
I and Box III? See the next
slide
Revenue in 1000 yen
2500
Box IV
Box I
2000
1500
The mean = 1305.6
1000
500 Box III
Box II
0
0
5
The mean
10=12.11 15
Number of promotions
20
25
How to measure the direction of
the relationship: Example, contd
Relationship between Number of promotions and
revenue
For each data point, you can
compute the distances from
the means.
•
Then we can notice that, for
any data points in Box I,
both of the distances are
positive.
•
For any data points in Box
III, both of the distances are
negative.
Distance from the mean of X = (Xthe mean of X)
2500
Y: Revenue in 1000 yen
•
Box IV
Box I
2000
1500
The mean of Y =
1305.6
1000
Distance from the mean of Y
= (Y- the mean of Y)
Box III
500
Box II
0
0
5
The mean of X
10=12.11 15
20
X: Number of promotions
25
See the next slide
Relationship between Number of promotions and
Box I Dist an c e s fr om t h e
revenue
m e an s ar e bot h posit ive
Revenue in 1000 yen
2500
Box I
Box IV
2000
1500
The mean of Y
1000
Box III
500
Box II
0
0
Box III: Dist an c e s fr om
t h e m e an s ar e bot h
The mean of X
n e gat ive
5
10
15
Number of promotions
20
25
How to measure the direction of
the relationship: Example, contd
 For a data point in Box I, distances from the
means are both positive. That is, both (X- X ) and
(Y- Y ) are positive.
Therefore, if we multiply the two distances
together, we will have a positive number
 For a data point in Box III, distance from the
means are both negative. That is (X- X ) and (YY ) are both negative.
Therefore, if we multiply the two distances
together, we will again have a positive number.
 Now, what we can say about Box II and Box IV?
See next slide.
Relationship between Number of promotions and
revenue
Revenue in 1000 yen
2500
Box IV
Box I
Negative
distance
2000
Positive distance
1500
The mean
1000
500
Box III
Box II
0
0
5
The mean
10
15
Number of promotions
20
25
How to measure the direction of
the relationship: Example, contd
For any points in box II and box IV, one
distance will be positive and the other
distance will be negative. So if we
multiply them together, we will have a
negative number.
How to measure the direction of
the relationship: Example, contd
 Consider, for each data point, you compute the
distances from the means, then multiply them together.
Further, consider you sum all the multiplied distances
together. If the resulting number is positive, this
roughly indicates that there are more data points in Box
I and Box III than Box II and Box IV. This in turn
indicates that the data shows positive relationship. If the
resulting number is negative, this indicates a negative
relationship.
 This is the basic idea of measuring the direction of the
relationship between two variables, and this is the first
step to compute “Covariance”.
Computation of the Sample
Covariance

1.
2.
3.
4.
The sample covariance is computed in the following
way.
Compute the mean for each variable.
For each observation, and for each variable, compute
the distances from the means, i.e. compute (X- X )
and (Y- Y ). Then multiply them together.
Sum all the multiplied differences.
Divide the sum of the multiplied differences by n-1,
(that is the number of observations minus 1).
Computation of Sample
covariance
Exercise
Open “Computation of Covariance” data
set.
Using data on the sheet “data 1”, compute
the covariance between the number of
promotions and the revenue.
Exercise, contd
The covariance between the number of
promotions and revenue is 2561.8
Positive covariance indicates that the
number of promotions and revenue have
a positive relationship.
Characteristics of Covariance
1. If covariance is positive, the two
variables have a positive relationship
2. If covariance is negative, the two
variables have a negative relationship.
3. A large value of covariance does not
indicate that the two variables have a
strong linear relationship.
A note on Covariance
One may be tempted to conclude that if
the covariance is larger, the relationship
between two variables is stronger (in the
sense that they have stronger linear
relationship)
However, this is not true. To see this, go
over the next example.
A note on Covariance, example
Open the data “Computation of
Covariance”, work sheet “data 2”.
Compute the covariance between variable
X and Y.
(The data 2 is in fact the same as data 1. Only the
difference is, the revenue is measure in 1000 yen
for data 1, while it is measure in 1 yen for data
2.)
Example, contd
 The covariance for data 2 is 2561805. This compares the
covariance for data 1 which was 2561.8.
 Even if data 1 and data 2 show exactly the same
relationship, covariance for data 2 is much larger. This is
simply because the unit of measurement for revenue is
different between data 1 and data2.
 This shows that a larger covariance does not mean a
stronger relationship. (In this particular example,
relationship is exactly the same.)
 To show the strength of the relationship, we use
“Correlation coefficient”.
Sample Correlation Coefficient
The measure of the strength of
linear relationship
Correlation coefficient between X and Y,
denoted as rxy, is computed as
(Covarianc e between X and Y)
rxy 
(Standard deviation of X) * (Standard deviation of Y)
Characteristics of Correlation
Coefficient
1.
•
•
•
2.
3.
4.
The correlation coefficient ranges from –1 to +1 with,
rxy = +1 indicates a perfect positive linear relationship: the X and
Y points would plot an increasing straight line.
rxy = 0 indicates no linear relationship between X and Y.
rxy = -1 indicates a perfect negative linear relationship: the X and
Y points would plot a decreasing straight line.
Positive correlations indicate positive or increasing linear
relationships with values closer to +1 indicating data points
closer to a straight line and closer to 0 indicating greater
deviations from a straight line.
Negative correlations indicate decreasing linear relationships
with values closer to –1 indicating points closer to a straight
line and closer to 0 indicating greater deviations from a
straight line.
Correlation coefficient is not the slope of the relationship.
Correlation Coefficient
Exercise
Open “Computation of Covariance”.
Compute correlation coefficient between
the number of promotion and revenue for
both data 1 and data 2.
Correlation Coefficient exercise
 Exercise 1: Open Data set “Correlation Coefficient
Exercise 1”. This data set shows the relationships
between advertisement cost and revenue for two
different products. First, produce a scatter plot for each
product. Then compute correlation coefficient for each
product.
 Exercise 2: Open data set “Correlation Coefficient
Exercise 2”. This data set contains two pairs of variables.
First, make a scatter plot for each pair in a single graph.
Second, compute correlation coefficient for each pair of
the variables.
Exercise 1, Answer
Product I; Advertisement cost and Revenue
Product II: Advertisement Cost and revneue
80000
35000
70000
30000
25000
50000
40000
Revenue
Revenue
60000
Corre lation
Coe ffic ie n t
= 0 .9 5
30000
20000
20000
15000
Correlation
Coefficient = 0.05
10000
5000
10000
0
0
0
50
100
Ad cost
150
200
0
20
40
60
80
100
120
Ad cost
Product I shows strong positive linear relationship between
advertisement cost and revenue. Correlation coefficient is
0.95, which is close to 1. Product II does not show much
linear relationship. The correlation coefficient is close to 0.05,
which is close to 0.
Exercise 2 (Answer)
Correlation Coefficient Exercise 2
30
Correlation Coefficient=-1
20
10
Correlation Coefficient =-1
0
0
-10
-20
-30
2
4
6
8
10
12
14
16
18
Pair I
Pair II
Correlation Coefficient Exercise
2 (Answer)
First, for both pairs, the correlation
coefficients are -1. This means that the
relationships are perfectly (negatively)
linear for both pairs of variables.
Also note that, even though the slope for
the pair I is much steeper, the correlation
coefficients are the same for both pairs.
This shows that correlation coefficient is
not the slope of the relationship.
Correlation Coefficient
To have more idea about the coefficient
correlation, see the following slides
Scatter Plots and Correlation
(Figure 3.6)
Y
X
(a) r = .8
Scatter Plots and Correlation
(Figure 3.6)
Y
X
(b)r = -.8
Scatter Plots and Correlation
(Figure 3.6)
Y
X
(c) r = 0
Understanding the mathematical
notation for the covariance and
correlation coefficient.
Obs ID Variable Variable
X
Y
1
X1
Y1
2
X2
Y2
:
:
:
n
Xn
Yn
•This is a typical data format for
the use of describing two
variables.
•Using this format, we would like
to represent the covariance, and
the correlation coefficient using
mathematical notations.
Understanding the mathematical notation
for the sample covariance and sample
correlation coefficient.
Obs ID Variable
X
Variable Each X –
Y
the mean
of X
Each Y- the
mean of Y
(each X- ) * (each Ythe mean Y)
1
X1
Y1
(X1- X)
(Y1- Y )
(X1- X )*(Y1- Y )
2
X2
Y2
(X2- X )
(Y2- Y )
(X2- X )*(Y2- Y )
:
:
:
:
n
Xn
Yn
(Xn - X)
The
mean
X
Y
:
(Yn- Y )
(Xn- X
)*(Yn- Y )
Covariance is computed by summing the last colum, then divide the
sum by (n-1). Therefore, the mathematical notation for the covariance
is given by
Next Slide
Mathematical Notation for the
sample covariance
The mathematical notation for covariance between variable X and
variable Y, denoted by either Cov(X,Y) or sxy, is given as
( x1  X )( y1  Y )  ( x2  X )( y2  Y )   ( xn  X )( yn  Y )
Cov( x, y )  s xy 
n 1
n

 ( x  X )( y  Y )
i 1
i
i
n 1
where xi and yi are the observed values, X and Y are the
sample means, and n is the sample size.
Mathematical Notation for the
sample correlation coefficient
The sample correlation coefficient, rxy, is computed by the
equation
rxy
Cov( x, y, )

sx s y
Sx is the standard deviation of variable X. Sy is the standard
deviation for variable Y.
3. Ordinary Least Square
estimation
-A Regression AnalysisProduct A
Product B
Product C
Product A trend
Product B Trend
Product C Trend
2,500,000
Revenue
2,000,000
1,500,000
1,000,000
500,000
•This is the scatter plot we saw in
Lecture 3-1. From the graph, we
can see that promotion is more
effective for product A than
product B.
0
0
5
10
15
Number of promotions
20
25
•Then, how do we measure the
effectiveness of promotions?
•Correlation coefficient cannot be
used for this purpose since it is
not the measure of the slope
Ordinary Least Square
estimation
A Regression Analysis
Product A
Product B
Product C
Product A trend
Product B Trend
Product C Trend
2,500,000
Revenue
2,000,000
1,500,000
To measure the effectiveness of
promotion for each product, we use
regression analysis.
1,000,000
500,000
0
0
5
10
15
Number of promotions
20
25
In this handout, we will talk about a
type of regression analysis called
“Ordinary Least Square Estimation”
Ordinary Least Square
Estimation
Product A: Promotion and sales
Product A: Promotion and sales
2,500,000
2,500,000
y = 99060x + 105827
2,000,000
1,500,000
Revenue
Revenue
2,000,000
1,000,000
500,000
1,500,000
1,000,000
500,000
0
0
5
10
15
Number of promotions
20
25
0
0
5
10
15
20
Number of promotions
•Ordinary Least Square (OLS) estimation is a method to find a linear equation
that best fits the data. Left hand graph is a simple scatter plot of the relationship
between the number of promotions and the revenue from product A. The right
hand side graph shows the OLS estimation of the linear relationship between the
number of promotion and revenue for the product A.
•Next several slides show the logic behind the OLS estimation.
25
Ordinary Least Square Estimation
(Two variable case)
Ordinary Least Square Estimation assume that the number of
promotions and the revenue from the product has the following
relationship.
(Revenue)   0  1 ( Number of promotions )
More generally, ordinary least square estimation assume that,
between variable Y and variable X, there is a following linear
relationship.
Y   0  1 X
An equation, like this, that describes a relationship among variables is
called a “model”, or “regression equation”. The model above contains
two parameters, 0 and 1. They are called the model coefficients. The
coefficient 0, is the intercept on the Y-axis and the coefficient 1 is the
slope. (The slope is the change in Y for every unit change in X.)
Ordinary Least Square Estimation
(Two variable case)
Ordinary Least Square Estimation is a
method to find (estimate) the values for β0
and β1 that fit the equation to the data
“best”.
The criteria to choose (estimate) the
values for β0 and β1 is described in the
following slides.
Ordinary Least Square Estimation
Criteria to estimate the parameter values
We choose (estimate) the values for β0 and β1 so that the sum of the
squared (vertical) distances from the equation to each data point is
minimized. (Therefore, this estimation is called ordinary least square
estimation.) Excel automatically estimates these values.
Y (Sales from Product
A)
Y   0  1 X
ei
(xi, yi)
Vertical distance from the
equation to ith data point
X (number of
promotion)
Ordinary Least Square
Estimation Using Excel, Example
Product A: Promotion and sales
2,500,000
y = 99060x + 105827
Revenue
2,000,000
1,500,000
1,000,000
500,000
0
0
5
10
15
20
25
Number of promotions
Excel can estimate the linear equation model, and draw the line
at the same time. The estimated β0 =105827, and β1=99060.
Exercise: Open Data “OLS Exercise 1-Promotion and Sales”
and reproduce this figure.
Things we can do with OLS
 Using the estimated equation, we can
1. Find the effect of promotion on the
revenue for product A.
2. Forecast revenue for different number of
promotions.
3. Find the number of promotions
necessary to achieve your sales goal.
Effect of promotion on the sales
of product A
Product A: Promotion and sales
2,500,000
y = 99060x + 105827
Revenue
2,000,000
1,500,000
1,000,000
500,000
0
0
5
10
15
20
25
Number of promotions
The estimated slope parameter β1 is the estimated effect of
promotion on the revenue from product A. β1=99,060 means
that if you increase the number of promotion by one, the
revenue would increase by 99,060 on average.
Forecasting Revenue
Product A: Promotion and sales
2,500,000
y = 99060x + 105827
Revenue
2,000,000
1,500,000
1,000,000
500,000
0
0
5
10
15
20
25
Number of promotions
 Estimated equation can be used to forecast revenue for
different number of promotions.
 Suppose that you would like to know what would be the
expected revenue from product A if the number of
promotions is 12. Then expected revenue given the number
of promotion equal 12 can be computed as
(Expected revenue when number of promotion is 12)
=99060*12 +105827 =1,294,547
 So you would expect the revenue to be roughly 1.3 million
yen.
Finding the number of
promotions that achieve sales
goal
Product A: Promotion and sales
2,500,000
y = 99060x + 105827
Revenue
2,000,000
1,500,000
1,000,000
500,000
0
0
5
10
15
20
25
Number of promotions
 Suppose that you would like achieve the sales of 3,000,000. How
many promotions are necessary to achieve this goal?
 To answer this question, simply solve the following equation for
X.
3,000,000=99060X+105827
X=29.2
 Therefore, if you would like to achieve at least 3,000,000, you
would need to utilize promotion 30 times.
Exercise
 Open data “OLS Exercise 1-Promotion and Sales”. Plot
the relationship between number of promotion and
revenue for product A and product B.
 Estimate the following equation
(revenue)= β0+β1(number of promotion)
separately for product A and Product B using OLS.
 Are the effect of promotion different for product A and
Product B?
 What would be the revenue from Product B if the
number of promotion is 12.
 Suppose the sale goal from product B is 1,000,000.How
many promotions are necessary to achieve this goal?
More Topics on Ordinary Least
Square Estimations
Advertisement and revenue Product II
35000
Revenue in 1000 yen
30000
25000
20000
y = 13.451x + 15440
15000
10000
5000
0
0
20
40
60
80
100
120
Advertisement spending in 1000 yen
•Above graph shows a relationship between advertisement cost
and revenue along with the estimated linear equation.
•The estimated slope coefficient is 13.4, which means that every
1000 yen you spend on advertisement, revenue increases by 13.4
thousand yen.
Next Page
More Topics on Ordinary Least
Square Estimations
Advertisement and revenue Product II
35000
Revenue in 1000 yen
30000
25000
20000
y = 13.451x + 15440
15000
10000
5000
0
0
20
40
60
80
100
120
Advertisement spending in 1000 yen
However, the graph seems to indicate that there is not much relationship
between advertisement spending and revenue.
When we estimate linear equation, we typically would like to know if
advertisement has any effect on the revenue at all. To answer such a question,
just estimating β0 and β1 is not enough. We need more information.
More Topics on Ordinary Least
Square Estimations
Advertisement and revenue Product II
35000
Revenue in 1000 yen
30000
25000
20000
y = 13.451x + 15440
15000
10000
5000
0
0
20
40
60
80
100
120
Advertisement spending in 1000 yen
To answer the following question, “Would the advertisement have any impact
on the revenue?”, we use the concept of “hypothesis testing” using “t-statistics”.
This is the topic for the next class.
Topics to be covered next week

We will cover several more topics on ordinary least
square estimation, which include
1.
Testing whether advertisement spending has any
effect on revenue, using t-statistics.
Ordinary Least Square estimation when there are
more explanatory variables.
Ordinary Least Square estimation when you have a
panel data ( repeated observations over time)
Analyzing the effect of a policy change (i.e, a new
introduction of tax, change in compensation scheme
etc) using OLS.
2.
3.
4.