Download Review for Mid

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Review for Mid-Term Exam
What this course is about ?
Estimate how changes in one or more
“explanatory” or “independent”
variables, X1,…,Xk, change the value of
a variable of interest, the “dependent”
variable, Y, all else equal.
Why?
-To provide quantitative answers to
quantitive questions:
 Policy evaluation/design
 Forecasting/prediction
-To evaluate the plausibility of economic
hypotheses
Approach?
Scientific approach based on the
principles of probability and statistics.
Major Topics Covered So Far
1. Estimating the Population Mean
and the difference between to
Population Means. (Also, provided
the opportunity to review basic
ideas from probability and
statistics.):
Ch’s 2 and 3
Problem Set 1
2. Estimating the Simple Linear
Regression Model
Ch. 4
Problem Sets 2 and 3
3. Estimating the Multiple Linear
Regression Model
Ch. 5
Problem Set 4
The same approach was taken to address
each of these topics:
1. Population from which Y is drawn
and we are interested in the
population mean or the condtional
mean of Y –
E(Y), E(Y X), E(Y X1,…,Xk)
The linear regression model is
defined by the assumption, which
may or may not be correct, that the
conditional mean of Y given X (or
X1,…,Xk) is a linear function of X
(or X1,…,Xk).
2. We draw what we assume is an
i.i.d. sample from the population –
Y1,…,Yn
(Y1,X1),…,(Yn,Xn)
(Y1,X11,…,Xk1),…,(Yn,X1n,…,Xkn)
3. Use the sample to construct:
- an estimator of the mean
- test statistics
- confidence intervals
- descriptive statistics
An estimator is a procedure that is
applied to compute an estimate of the
parameters of interest; in our case, either
the population mean or the conditional
population mean.
An estimator is a random variable and its
quality is evaluated by its sampling
distribution.
The estimator we applied in all three
cases: the OLS estimator.
Sampling distribution of the OLS
estimator (under the appropriate
assusmptions):
- Unbiased estimator
- Consistent estimator
- ( ˆ   ) / se( ˆ ) ~ N (0,1)
A test statistic is a random variable
whose value is computed from the
sample and whose distribution under the
“null hypothesis” is known and has a
simple form.
Consider null hypotheses of the form:
H0:  = b, where b is some known no.
Under the null hypothesis, the t-statistic
t  ( ˆ  b) / se( ˆ )
is drawn from a N(0,1) distribution, for
suffiently large sample sizes. For a given
alternative hypothesis
HA:  ≠ b or HA:  < b or HA:  > b
we can compare the calculated t-statistic
against the percentiles of the N(0,1) to
determine the “p-value” of the test and,
for a given test size (significance level),
whether to reject or not reject the null.
In the multiple regression model, we
encountered the F-statistic (and the
closely related chi-square statistic) to test
restrictions involving more than one
coefficient:
- joint hypotheses (e.g., are the
values of a group of coefficients
all equal to zero?)
- single restriction on the
relationship among coefficients
(e.g., is 2 = 3)?
Under the null hypothesis and for
sufficiently large sample size,
F ~ F(q,∞)
or, equivalently,
qF ~ Χ2(q)
where q is the number of restrictions that
make up H0.
A confidence interval (or interval
estimate) is a random interval derived
from the sample. That is, it is an interval
whose endpoints depend on the
particular sample that was drawn and,
therefore, will vary from sample to
sample.
An x% confidence interval for the
parameter  (0 < x < 100) will contain
the actual population value of , x% of
the time.
Under our usual assumptions, a (1-2)
confidence interval for  is given by:
ˆ  Z1  se(ˆ )
where Z1- is the (1-) percentile of the
N(0,1) distribution.
Note:
Heteroskedasticity vs. Homoskedasticity
The assumptions we have made allow
the variance of the Y’s for a given value
of X (or for given values of X1,…,Xk) to
depend on the value of X. That is, we
allow the errors in the regression model
to be heteroskedastic.
Experience suggests that the more
restrictive assumption of
homoskedasticity is not very plausible in
most applications with cross-sectional
data.
The formulas for the standard error of
beta-hat (and, therefore, the t-statistics
that depend on this standard) and for the
F-statistic differ according to whether
heteroskedasticity or homoskedasticity is
assumed. Most regression software
computes these standard errors and
statistics under the default setting of
homoskedasticity, but provide an option
to override the default and compute them
appropriately under the assumption of
heteroskedasticity.
Descriptive statistics, including the
standard error of the regression (SER),
the R2, and the adjusted-R2 are used to
measure the amount of the variation in
the observed values of the Y’s accounted
for by the variation in the X’s.
A high R2 does not imply a good and/or
meaningful regression; A low R2 does
not imply a bad and/or meaningless
regression.