Download Word Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Page1
Econ107 Applied Econometrics
Topic 1: An Overview of Regression Analysis
(Studenmund, Chapter 1)
I. The Nature and Scope of Econometrics.
Lot’s of definitions of econometrics.
 Nobel Prize Committee
 Paul Samuelson, et al. “Econometrics may be defined as quantitative
analysis of actual economic phenomena.”
 Goldberger “... application of economic theory, mathematics and statistical
inference to the analysis of economic phenomena.”
 (Joke) E.E. Leamer “There are two things you don’t want to see in the
making – sausage and econometric research.”
II. Major Uses of Econometrics.
1. Describing economic reality
2. Testing hypothesis about economic theory
3. Forecasting future economic activity
III. Econometric Methodology – Regression Analysis
An important methodology in econometrics is regression analysis which typically
follows these steps:
Use a famous example to illustrate.
1.
State the hypotheses.
Keynes in the General Theory said a $1 increase in income will lead to less than
a $1 increase in overall consumption.
Page2
We want to test this hypothesis — that the MPC<1.
2.
Specify the mathematical model of the theory.
Although Keynes didn’t specify the exact nature of the relationship. Might
suggest a simple linear relationship.
C =  0 +  1 DI
0 < 1 < 1
where C=aggregate consumption and DI=aggregate disposable income
3.
Specify the econometric model.
This purely mathematical model is uninteresting to the econometrician. It
assumes an exact or deterministic relationship between C and DI.
C =  0 +  1 DI + 
We re-write the equation with a disturbance or error term.
This is now an econometric model, or more precisely a linear regression model.
Page3
4.
Obtain the Data.
Only way to estimate the parameters of interest in this model, is to obtain the
necessary data. Data source could involve time series, cross-sectional or panel
data.
Time series data are collected over time for the same country or other single
aggregate economic unit (e.g., aggregate C and DI could be obtained for
Singapore from 1950 -2000). In this case, we’d normally re-write the equation
with a ‘t’ subscript on the variables and disturbance term to denote ‘time’.
Ct =  0 +  1 DI t +  t
Cross-sectional data are collected for a sample over individuals, households,
firms or other disaggregate economic entity at a point in time (e.g., C and DI could
be obtained for sample of 1,000 Singapore families during 2000). In this case,
we’d normally re-write the equation with a ‘i’ subscript on the variables and
disturbance term to denote ‘individual’.
Ci =  0 +  1 DI i +  i
Finally, panel data contains elements of both time series and cross-sectional data
(e.g., C and DI could be obtained for all countries in the OECD during the period
1950-2000). Note that we have variation across countries at any single point in
time, as well as variation across time. In this case, we’d normally re-write the
equation with both an ‘i’ and ‘t’ subscript on the variables and disturbance term to
denote ‘country’ and ‘time’.
Cit =  0 +  1 DI it +  it
Time series or cross sectional data could be plotted as a ‘scatter diagram’ below:
Page4
5.
Estimate the parameters in the econometric model.
Now it’s time to estimate the coefficients in the model. The basic idea is to come
up with a ‘line’ that best ‘fits’ the data points. Imagine that this ‘regression
analysis’ yields the following consumption function.
Ĉ = 336.9 + 0.820DI
These are the estimates of the 2 coefficients. The ‘hat’ on C indicates that this is
an ‘estimated’ consumption function or regression model.
6.
Test the hypothesis.
Recall that we wanted to test Keynes’ hypothesis that the MPC was between zero
and 1. Looks reasonable, but unsure whether there is any ‘statistical’ evidence
that it’s below 1.
Page5
7.
Forecast or predict economic behaviour.
One of the other uses of this model if for forecasting or predicting future economic
behaviour. To predict C, however, need to know future values of DI. Suppose
you know that DI is going to be $65,000 (millions).
Ĉ = 336.9 + 0.820(65,000) = 53,636.9
This also allows you to predict savings of $11,363.1. This is just the difference
between DI and C.
8.
Use the model for policy purposes.
Can also be used for ‘control’ purposes. Suppose that C of 53.6 billion is
insufficient to maintain full-employment. Not enough spending by households.
Government could consider increasing DI through tax cuts to achieve a higher
target. Suppose 62 billion is needed.
62,000 = 336.9 + 0.820DI
DI = 75,198.9
Thus, need to cut taxes by just over $10 billion from forecasted levels.
IV. Types of Econometrics and Names of Variables in Regression
Split into ‘theoretical’ and ‘applied’ fields. We end up ‘straddling’ these 2
approaches. Theoretical econometrics concerns the development of basic
estimation approaches, properties of estimators, etc. More closely related to
mathematical statistics (e.g., proofs, axioms, ...).
Applied econometrics is built on this theoretical foundation. Applies estimation
techniques to various areas of economic enquiry. Examples: Where to open a new
restaurant? How much ad? Should we fix the target interest rate? How many hours
studying on Econ107? Academics, private and government sectors have
increasingly used econometrics.
Page6
Regression analysis is the study of the relationship between a ‘Dependent
Variable’ and one or more ‘Independent’ or ‘Explanatory Variables’.
In the linear regression model (or true regression line or population regression
function)
Yi =  0 +  1 X 1i     K X Ki +  i
Yi is called dependent or left-hand-side variable or regressant and is random;
X ki (k  1,, K ) is called independent or explanatory or right-hand-side variable or
regressor, it can be fixed or random;  i is called error or disturbance term and is
random;  ’s are called regression coefficients, they are unknown and fixed;  0 is
the intercept coefficient;  k (k  1,, K ) is the slope coefficients. The meaning of
 1 is the impact of a one unit increase in X 1 on Y , holding constant the other
independent variables.
The estimated regression line (or sample regression function) is written as
Yˆi = ˆ 0 + ˆ1 X 1i    ˆ K X Ki
Yˆi is called ‘estimated’ or fitted value of Yi ; ˆk (k  0,, K ) is called estimated
regression coefficient; Define ei  Yi  Yˆi and call ei the residual.
When K=1, the regression model is Simple Linear Regression (SLR) model.
When K>1, the regression model is Multiple Linear Regression (MLR) model.
V. Statistical vs. Deterministic Relationships
Regression analysis is concerned with a Statistical, not a Functional or
Deterministic dependence among variables. In statistical relationships, the
variables are Random or Stochastic.
VI. Regression vs. Causation
Although regression analysis deals with the relationship of one variable on other
variables, it doesn’t necessarily imply causation. A causal relationship must come
from outside of statistics. Economic theory is supposed to provide the compelling
evidence of causation.
Page7
VII. The True (or Population) Regression Function (PRF)
Suppose we have a small community of 12 families. We’re interested in studying
the relationship between their weekly disposable income (X) and expenditure on
food (Y). We want to predict the population mean of food expenditures, given
some level of family income.
The 12 families can be grouped into four income groups. Each family within a
group has the same disposable income. This is the entire population, not a sample.
Disposable
Income (X)
Individual Food
Expenditures (Y)
Average Food
Expenditures
250
78.00, 88.50, 96.00
87.50
300
77.50, 89.00, 96.50, 109.00
93.00
350
90.50, 106.50
98.50
400
99.00, 103.00, 110.00
104.00
Plot these data points on the following diagram. This is often known as a Scatter
Diagram. The ‘solid’ dots are the actual observations. Now the Conditional
Mean or Conditional Expectation is
E(Y | X = X i )
The ‘circles’ are the conditional means. Clearly, food expenditures ‘on average’
increase with disposable income.
This can be seen even more clearly by ‘connecting’ these conditional means with
a straight line. This is the True (or Population) Regression Line. Note that it
could also be a True (or Population) Regression Curve.
Page8
Geometrically, a population regression line or curve is simply the locus of the
conditional means or expectations of the dependent variable for fixed values of the
explanatory variable(s).
In general, we could write the Population Regression Function (PRF) as:
E(Y | X i ) = f( X i )
where this is some function of the explanatory variable.
We might anticipate that food consumption will be linearly related to disposable
income. This is an initial assumption of our estimation. We could narrow this
functional form to:
E(Y | X i ) =  0 +  1 X i
This is known as the linear PRF (or PR Line).
Page9
VIII. ‘Linearity’ in Regression Analysis
What do we mean when we say that our regression model is linear? One
possibility is that the model is nonlinear in terms of the variables.
E(Y | X i ) =  0 +  1 X i2
The second possibility is that the PRF is nonlinear in terms of the coefficients.
E(Y | X i ) =  0 +
1 X i
Such regressions functions will not be considered in this paper, but the one given
above will be. From now on, ‘linear regression models’ should be read as linear
(in terms of the parameters).
IX. Adding the Disturbance Term to Our PRF
The PRF tells us the 'average' food expenditures for a given level of household
income. But we know that any 'particular' household is unlikely to be on this
function. For this reason we rewrite PRF as
Y i =  0 + 1 X i +  i
where  i is a random variable with mean 0. Lot's of reasons why  i might exist.
•
Minor influences of Y are omitted.
•
The underlying theoretical equation might have a different functional form
than the one chosen for the regression.
•
Some purely random variations are always there.
•
Measurement Error on Y or X.
Page10
X. The Sample (Estimated) Regression Function
Thus far, we've dealt with the entire population and the PRF. Avoided any
consideration of sampling. In most cases, we will never observe the entire
population. We have to infer from a sample or samples what the PRF might look
like. Note that we're unlikely to know just how close we get to the truth.
Each sample we draw can be used to produce a Sample (Estimated) Regression
Function (SRF), that is, the estimated regression function:
ˆ
ˆ
Yˆi = 0 + 1 X i
Of course, we can replace the actual value of the dependent variable ( Y i ) with its
fitted value ( Yˆ i ).
The LHS is no longer an estimator, it’s the actual value. The RHS now includes
the Residual term ei.
Y i = ˆ0 + ˆ1 X i + ei
This means that the actual dependent variable can be decomposed into its fitted
value and the residual.
Y i = Yˆ i + ei
This residual, like the disturbance can be either positive or negative. We can
either overestimate:
Y i - Yˆ i = ei < 0
if Y i < Yˆ i
or underestimate the true value of Yi:
Y i - Yˆ i = ei > 0
if Y i > Yˆ i
X. Questions for discussion: Q1.10
XI. Run the height regression (Section 1.4) using the data file provided.
Do further exploration according to Q1.4 and Q1.5