Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Introduction to Statistical Inference
Ping Yu
Department of Economics
University of Hong Kong
Ping Yu (HKU)
Statistics
1 / 30
1
Point Estimation
2
Hypothesis Testing
Ping Yu (HKU)
Statistics
2 / 30
The Objective of Statistics
The objective of statistics is to infer (characteristics of) the underlying probability
law from observed data, and then use the obtained knowledge to explain what has
happened (i.e., internal validity), and predict what will happen (i.e., external
validity).
The internal validity concerns three problems:
- What is a plausible value for the parameter? (point estimation)
- What are a plausible set of values for the parameter? (set/interval estimation)
- Is some preconceived notion or economic theory on the parameter "consistent"
with the data? (hypothesis testing).
In other words, the objectives of statistics are estimation, inferences (including
hypothesis testing and confidence interval (CI) construction) and prediction.
Ping Yu (HKU)
Statistics
2 / 30
Point Estimation
Point Estimation
Ping Yu (HKU)
Statistics
3 / 30
Point Estimation
There are two econometric traditions: the frequentist approach and the Bayesian
approach.
- the former treats the parameter as fixed (i.e., there is only one true value) and
the samples as random.
- the latter treats the parameter as random and the samples as fixed.
This course will concentrate on the frequentist approach.
Two main methods in the frequentist approach are the likelihood method and the
method of moments (MoM).
Ping Yu (HKU)
Statistics
4 / 30
Point Estimation
The Maximum Likelihood Estimator
The MLE was popularized by R.A. Fisher (1890-1962).
The basic idea of the MLE is to guess the truth which could generate the
phenomenon we observed most likely (practical examples here).
Mathematically,
R
θ MLE = arg max E [ln(f (X jθ ))] = arg max f (x ) ln f (xjθ )dx
θ 2Θ
R
= arg max ln f (xjθ )dF (x ),
θ 2Θ
θ 2Θ
(1)
where X is a random vector, f (x ) is the true pdf or the true pmf, f (xjθ ) is the
specified parameterized pdf or pmf, Θ is the parameter space, and F (x ) is the
true cdf.
Ping Yu (HKU)
Statistics
5 / 30
Point Estimation
History of the MLE
Ronald A. Fisher (1890-1962), UCL
Ronald A. Fisher (1890-1962) is one iconic founder of modern statistical theory.
The name of F -distribution was coined by G.W. Snedecor, in honor of R.A. Fisher.
The p-value is also credited to him.
Ping Yu (HKU)
Statistics
6 / 30
Point Estimation
The MoM Estimator
The MoM estimator was introduced by Karl Pearson (1857-1936).
The original problem is to estimate k unknown parameters, say θ = (θ 1 ,
in f (x ). But we are not fully sure about the functional form of f (x ).
, θ k ),
Nevertheless, we know the functional form of the moments of X 2 R as a function
of θ :
E [X ] = g1 (θ ),
E [X 2 ] = g2 (θ ),
(2)
..
.
E [X k ] = gk (θ ).
There are k functions with k unknowns, so we can solve out θ uniquely in
principle.
Ping Yu (HKU)
Statistics
7 / 30
Point Estimation
History of the MoM
Karl Pearson (1857-1936), UCL
Karl Pearson (1857-1936) is also the inventor of the correlation coefficient in
Chapter 5, so the correlation coefficient is also called the Pearson correlation
coefficient.
Ping Yu (HKU)
Statistics
8 / 30
Point Estimation
Efficiency and Robustness
The MoM estimator uses only the moment information in X , while the MLE uses
"all" information in X , so the MLE is more efficient than the MoM estimator.
However, the MoM estimator is more robust than the MLE since it does not rely on
the correctness of the full distribution but relies only on the correctness of the
moment functions.
Efficiency and robustness are a common trade-off among econometric methods.
Ping Yu (HKU)
Statistics
9 / 30
Point Estimation
A Microeconomic Example of the MoM Estimator
Moment conditions often originate from the FOCs in an optimization problem.
Suppose the firms are maximizing their profits conditional on the information in
hand; then the problem for the firm i is
max Eνjz [π (di , zi , ν i ; θ )] .
(3)
di
π is the profit function, e.g.,
π (di , zi , ν i , θ ) = pi f (Li , ν i ; θ )
wi L i ,
where zi = (pi , wi ) is all information used in decision and can be observed by both
the firm and the econometrician, pi is the output price and wi is the wage, ν i is the
exogenous random error (e.g., weather, financial crisis, etc.) and cannot be
observed or controlled by either the firm or the econometrician, and di = Li is the
decision of labor input.
φ
θ is the technology parameter, e.g., if f (Li , ν i ; θ ) = Li exp(ν i ), then θ = φ , and is
known to the firm but unknown to the econometrician.
Our goal is to estimate θ , which is relevant to measure the causal effect - the
effect of labor input on profit.
Ping Yu (HKU)
Statistics
10 / 30
Point Estimation
continue...
The FOCs of (3) are
Eνjz
∂ π (di , zi , ν i , θ )
= m (di , zi jθ ) = 0.
∂ di
When there is randomness even in zi ,1 then the objective function changes to
maxE [π (di , zi , ν i ; θ )] ,
di
and the FOCs change to
E [m (di , zi jθ )] = 0,
(4)
which are a special set of moment conditions.
1
The difference between zi and ν i is that zi can be observed ex post while ν i cannot. That zi is random
means that the decision is made before zi is revealed, or the decision is made ex ante.
Ping Yu (HKU)
Statistics
11 / 30
Point Estimation
A Macroeconomic Example of the MoM Estimator (*)
∞
max ∑ ρ t E0 [u (ct )]
fct g∞
t =1 t = 1
s.t. ct +1 + kt +1 = kt Rt +1 , k0 is known,
ρ is the discount factor, E0 [u ( )] is the conditional expected utility based on the
information at t = 0, kt is the capital accumulation at time period t, ct is the
consumption at t, and Rt is the gross return rate at t.
From dynamic programming, we have the Euler equation
E0 ρ
If u (c ) =
c1 α 1
1 α ,
u 0 (ct +1 )
R
= 1.
u 0 (ct ) t +1
α > 0, then we get
E0 ρ
ct
ct +1
α
Rt +1 = 1.
(5)
Suppose ρ is known while α is unknown; then (5) is a moment condition for α.
Ping Yu (HKU)
Statistics
12 / 30
Point Estimation
Population Version vs Sample Version of Moment Conditions
Equations (2), (4) and (5) are the population version of moment conditions.
Although some econometricians treat "population" as a physical population (e.g.,
all individuals in the US census) in the real world, the term "population" is often
treated abstractly, and is potentially infinitely large.
Since the population distribution is unknown, we cannot solve the population
moment conditions to estimate the parameters.
In practice, we often have a set of finite data points from the population, so we
can substitute the population distribution in the moment conditions by the
empirical distribution of the data, which generates the sample version of the
moment conditions.
This is called the analog method.
Ping Yu (HKU)
Statistics
13 / 30
Point Estimation
History of the Analog Method
Charles F. Manski (1948-), Northwestern
Ping Yu (HKU)
Statistics
14 / 30
Point Estimation
(The Sample Version of) the MoM Estimator
Suppose the true distribution of X satisfies
E [m (X jθ 0 )] = 0 or
where m : Θ
Z
m (xjθ 0 )dF (x ) = 0,
Rk ! Rk , and F ( ) is the true cdf of X .
The essence of the MoM estimator is to substitute the true distribution F ( ) by the
n
bn (x ) = 1 ∑ 1(Xi x ):
empirical distribution F
n
Z
which is equivalent to
i =1
bn (x ) = 0,
m (xjθ )d F
1 n
m (Xi jθ ) = 0.
n i∑
=1
The MoM estimator θb (X1 ,
Ping Yu (HKU)
(6)
, Xn ) is the solution to (6).
Statistics
15 / 30
Point Estimation
(The Sample Version of) the MLE
Similarly, the MLE can be constructed as the maximizer of the average
log-likelihood function
1 n
`n (θ ) = ∑ ln f (Xi jθ ) ,
n i =1
which is equivalent to the maximizer of the log-likelihood function
Ln (θ ) =
n
∑ ln f (Xi jθ )
i =1
or the likelihood function
n
Ln (θ ) = exp fLn (θ )g =
∏ f (Xi jθ ) .
i =1
If f (xjθ ) is smooth in θ , the FOCs for the MLE are
1 n
s (Xi jθ ) = 0,
n i∑
=1
where s ( jθ ) = ∂ ln f ( jθ )/∂ θ is called the score function.2 So the MLE is a special
MoM estimator in this case.
2
More often, ∑ni=1 s (Xi jθ ) is called the score function.
Ping Yu (HKU)
Statistics
16 / 30
Hypothesis Testing
Hypothesis Testing
Ping Yu (HKU)
Statistics
17 / 30
Hypothesis Testing
Hypotheses: The Null and Alternative
Different from an estimation problem where nothing is known about the true
parameter, in hypothesis testing, some restrictions about the true parameter are
assessed. In other words, there is already a target to attack. Nevertheless,
hypothesis testing and estimation are closely related since some test statistics are
based on estimators.
The null hypothesis, written as H0 , is often a point hypothesis θ = θ 0 .
The complement of the null hypothesis is called the alternative hypothesis. So the
alternative hypothesis, written as H1 , is θ 6= θ 0 .
More generally, we express a null hypothesis as H0 : θ 2 Θ0 and the alternative
hypothesis as H1 : θ 2 Θ1 , where Θ0 is a proper subset of Θ, Θ0 \ Θ1 = 0,
/ and
Θ0 [ Θ1 = Θ.
For simplicity, we often refer to the hypotheses as "the null" and "the alternative".
Ping Yu (HKU)
Statistics
18 / 30
Hypothesis Testing
Decisions
A hypothesis test either accepts the null, or rejects the null in favor of the
alternative. We can describe these two decisions as “Accept H0 ” and “Reject H0 ”.
Given the two possible states of the world (H0 or H1 ) and the two possible
decisions (Accept H0 or Reject H0 ), there are four possible pairings of states and
decisions:
State of NaturenDecision
H0 is true
H1 is true
Accept H0
Correct Decision
Type II Error
Reject H0
Type I Error
Correct Decision
Table: Hypothesis Testing Decisions
Ping Yu (HKU)
Statistics
19 / 30
Hypothesis Testing
Acceptance Region and Rejection Region
The decision is based on the data, and so is a mapping from the sample space to
the decision set. [Check the MoM estimator]
This splits the sample space into two regions A and R such that if the observed
sample falls into A we accept H0 , while if the sample falls into R we reject H0 .
The set A can be called the acceptance region and the set R the rejection or
critical region.
It is convenient to express this mapping as a real-valued function called a test
statistic
Tn = Tn ( X1 , , Xn )
relative to a critical value c.
The hypothesis test then consists of the decision rule
1
2
Accept H0 if Tn c,
Reject H0 if Tn > c.
A test statistic Tn should be designed so that small values are likely when H0 is
true and large values are likely when H1 is true.
Ping Yu (HKU)
Statistics
20 / 30
Hypothesis Testing
Type I Error and Type II Error
A false rejection of the null hypothesis H0 (rejecting H0 when H0 is true) is called a
Type I error.
The probability of a Type I error is
P (Reject H0 jH0 is true) = P (Tn > cjH0 is true) .
(7)
The size of the test is defined as the supremum of (7) across all data distributions
which satisfy H0 .
(**)- For a set A R, the supremum or least upper bound of A, denoted as sup A,
is the smallest number y such that y x for every x 2 A, and the infimum or
greatest lower bound of A, denoted as inf A, is the largest number y such that
x y for every x 2 A.
- Although minnA and max A may
o not exist, inf A and sup
n A always exist.
o For
does not exist, but inf n1 , n = 1, 2,
= 0. Of
example, min n1 , n = 1, 2,
course, if min A and max A exist, then inf A = min A and sup A = max A. (**)
A primary goal of test construction is to limit the incidence of Type I error by
bounding the size of the test.
A false acceptance of the null hypothesis H0 (accepting H0 when H1 is true) is
called a Type II error.
Ping Yu (HKU)
Statistics
21 / 30
Hypothesis Testing
Power
The rejection probability under the alternative hypothesis is called the power of the
test, and equals 1 minus the probability of a Type II error:
π n (θ ) = P (Reject H0 jH1 is true) = P (Tn > cjH1 is true) .
We call π n (θ ) the power function and is written as a function of θ to indicate its
dependence on the true value of the parameter θ under H1 .
In the dominant approach to hypothesis testing, the goal of test construction is to
have high power, subject to the constraint that the size of the test is lower than the
pre-specified significance level.
Generally, the power of a test depends on the true value of the parameter θ , and
for a well behaved test the power is increasing both as θ moves away from the null
hypothesis θ 0 and as the sample size n increases.
Ping Yu (HKU)
Statistics
22 / 30
Hypothesis Testing
Trade-off Between Size and Power
Given a test statistic Tn , increasing the critical value c increases the acceptance
region A while decreasing the rejection region R.
This decreases the likelihood of a Type I error (decreases the size) but increases
the likelihood of a Type II error (decreases the power). (why?) Thus the choice of c
involves a trade-off between size and the power.
This is why the significance level of the test cannot be set arbitrarily small.
(Otherwise the test will not have meaningful power.)
It is important to consider the power of a test when interpreting hypothesis tests,
as an overly narrow focus on size can lead to poor decisions.
For example, it is trivial to design a test which has perfect size yet has trivial power.
Specifically, for any hypothesis we can use the following test: Generate a random
variable U U [0, 1] and reject H0 if U < α.
This test has exact size of α. Yet the test also has power precisely equal to α.
When the power of a test equals the size, we say that the test has trivial power.
Nothing is learned from such a test.
Ping Yu (HKU)
Statistics
23 / 30
Hypothesis Testing
Scientific Reasoning of Hypothesis Testing
To determine the critical value c, we need to pre-select a significance level α such
that P (Tn > cjH0 is true) = α, yet there is no objective scientific basis for choice of
α.
Nevertheless, the common practice is to set α = 0.05 (5%). Alternative values are
α = 0.10 (10%) and α = 0.01 (1%). These choices are somewhat the by-product
of traditional tables of critical values and statistical software.
The informal reasoning behind the choice of a 5% critical value is to ensure that
Type I errors should be relatively unlikely - that the decision “Reject H0 ” has
scientific strength - yet the test retains power against reasonable alternatives.
The decision “Reject H0 ” means that the evidence is inconsistent with the null
hypothesis, in the sense that it is relatively unlikely (1 in 20) that data generated by
the null hypothesis would yield the observed test result.
In contrast, the decision “Accept H0 ” is not a strong statement. It does not mean
that the evidence supports H0 , only that there is insufficient evidence to reject H0 .
So it is more accurate to use the label “Do not Reject H0 ” instead of “Accept H0 ”.
Ping Yu (HKU)
Statistics
24 / 30
Hypothesis Testing
Statistically Significant
When a test rejects H0 at the 5% significance level it is common to say that the
statistic is statistically significant and if the test accepts H0 it is common to say that
the statistic is not statistically significant or that it is statistically insignificant.
It is helpful to remember that this is simply a way of saying “Using the statistic Tn ,
the hypothesis H0 can [cannot] be rejected at the 5% level.”
When the null hypothesis H0 : θ = 0 is rejected it is common to say that the
coefficient θ is statistically significant, because the test has rejected the
hypothesis that the coefficient is equal to zero.
Ping Yu (HKU)
Statistics
25 / 30
Hypothesis Testing
An Example
Suppose we have only one data point z in hand and we know z
want to test H0 : µ = 0 against H1 : µ > 0.
N ( µ, 1). We
A natural test is to reject H0 if z is large. Rigorously, the test is 1(z > c ), where
1( ) is the indicator function which equals 1 if the event in the parenthesis is true
and 0 otherwise, 1 indicates rejection and 0 acceptance, the test statistic Tn = z,
and the critical value is c.
Set the significance level α = 0.05; then c is chosen such that
P (z > cjµ = 0) = E [1(z > c )jµ = 0] = 1 Φ(c ) = 0.05, i.e., c = 1.645. So if
z > 1.645, we will reject H0 ; otherwise, we cannot reject H0 .
The power function π ( µ ) = P (z > cjµ ) = P (z µ > c µ ) = 1 Φ(c µ ) which
is an increasing function of µ and a decreasing function of c. [Figure here]
It is understandable that π ( µ ) is increasing with µ since when µ is larger, it is
easier to detect µ > 0.
That π ( µ ) is decreasing in c indicates a trade-off between size and the power.
Since the power equals 1 minus the probability of the Type II error, it is equivalent
to study the trade-off between the probabilities of the Type I error and Type II error.
Ping Yu (HKU)
Statistics
26 / 30
Hypothesis Testing
Figure: Trade-Off Between the Type I Error and Type II Error
Ping Yu (HKU)
Statistics
27 / 30
Hypothesis Testing
continue...
The acceptance region is z
trivial.
c and the critical region is z > c, which are quite
To illustrate these regions in a more complicated example, suppose two data
points y1 and y2 are observed and they follow N ( µ, 2).
We want to test H0 : µ = 0 against H1 : µ 6= 0.
A natural test is to reject H0 if the absolute value of y = (y1 + y2 ) /2 is large.
Given that y follows N ( µ, 1), the 5% critical value is 1.96 since
P (jy j > 1.96jµ = 0) = 0.05.
The acceptance region is
f(y1 , y2 )j jy j
1.96g
or
f(y1 , y2 )j
3.92
y1
y2
3.92
y1 g .
[Figure here]
Ping Yu (HKU)
Statistics
28 / 30
Hypothesis Testing
Figure: Acceptance Region Based on (y1 , y2 ) and y
Ping Yu (HKU)
Statistics
29 / 30
Hypothesis Testing
Summary
One hypothesis testing includes the following steps.
1
2
3
4
5
specify the null and alternative.
construct the test statistic.
derive the distribution of the test statistic under the null.
determine the decision rule (acceptance and rejection regions) by specifying a level of
significance.
study the power of the test.
Step 2, 3 and 5 are key and step 1 and 4 are usually trivial.
Of course, in some cases, how to specify the null and the alternative is also subtle,
and in some cases, the critical value is not easy to determine if the asymptotic
distribution is complicated.
Ping Yu (HKU)
Statistics
30 / 30