Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Statistical Inference Ping Yu Department of Economics University of Hong Kong Ping Yu (HKU) Statistics 1 / 30 1 Point Estimation 2 Hypothesis Testing Ping Yu (HKU) Statistics 2 / 30 The Objective of Statistics The objective of statistics is to infer (characteristics of) the underlying probability law from observed data, and then use the obtained knowledge to explain what has happened (i.e., internal validity), and predict what will happen (i.e., external validity). The internal validity concerns three problems: - What is a plausible value for the parameter? (point estimation) - What are a plausible set of values for the parameter? (set/interval estimation) - Is some preconceived notion or economic theory on the parameter "consistent" with the data? (hypothesis testing). In other words, the objectives of statistics are estimation, inferences (including hypothesis testing and confidence interval (CI) construction) and prediction. Ping Yu (HKU) Statistics 2 / 30 Point Estimation Point Estimation Ping Yu (HKU) Statistics 3 / 30 Point Estimation There are two econometric traditions: the frequentist approach and the Bayesian approach. - the former treats the parameter as fixed (i.e., there is only one true value) and the samples as random. - the latter treats the parameter as random and the samples as fixed. This course will concentrate on the frequentist approach. Two main methods in the frequentist approach are the likelihood method and the method of moments (MoM). Ping Yu (HKU) Statistics 4 / 30 Point Estimation The Maximum Likelihood Estimator The MLE was popularized by R.A. Fisher (1890-1962). The basic idea of the MLE is to guess the truth which could generate the phenomenon we observed most likely (practical examples here). Mathematically, R θ MLE = arg max E [ln(f (X jθ ))] = arg max f (x ) ln f (xjθ )dx θ 2Θ R = arg max ln f (xjθ )dF (x ), θ 2Θ θ 2Θ (1) where X is a random vector, f (x ) is the true pdf or the true pmf, f (xjθ ) is the specified parameterized pdf or pmf, Θ is the parameter space, and F (x ) is the true cdf. Ping Yu (HKU) Statistics 5 / 30 Point Estimation History of the MLE Ronald A. Fisher (1890-1962), UCL Ronald A. Fisher (1890-1962) is one iconic founder of modern statistical theory. The name of F -distribution was coined by G.W. Snedecor, in honor of R.A. Fisher. The p-value is also credited to him. Ping Yu (HKU) Statistics 6 / 30 Point Estimation The MoM Estimator The MoM estimator was introduced by Karl Pearson (1857-1936). The original problem is to estimate k unknown parameters, say θ = (θ 1 , in f (x ). But we are not fully sure about the functional form of f (x ). , θ k ), Nevertheless, we know the functional form of the moments of X 2 R as a function of θ : E [X ] = g1 (θ ), E [X 2 ] = g2 (θ ), (2) .. . E [X k ] = gk (θ ). There are k functions with k unknowns, so we can solve out θ uniquely in principle. Ping Yu (HKU) Statistics 7 / 30 Point Estimation History of the MoM Karl Pearson (1857-1936), UCL Karl Pearson (1857-1936) is also the inventor of the correlation coefficient in Chapter 5, so the correlation coefficient is also called the Pearson correlation coefficient. Ping Yu (HKU) Statistics 8 / 30 Point Estimation Efficiency and Robustness The MoM estimator uses only the moment information in X , while the MLE uses "all" information in X , so the MLE is more efficient than the MoM estimator. However, the MoM estimator is more robust than the MLE since it does not rely on the correctness of the full distribution but relies only on the correctness of the moment functions. Efficiency and robustness are a common trade-off among econometric methods. Ping Yu (HKU) Statistics 9 / 30 Point Estimation A Microeconomic Example of the MoM Estimator Moment conditions often originate from the FOCs in an optimization problem. Suppose the firms are maximizing their profits conditional on the information in hand; then the problem for the firm i is max Eνjz [π (di , zi , ν i ; θ )] . (3) di π is the profit function, e.g., π (di , zi , ν i , θ ) = pi f (Li , ν i ; θ ) wi L i , where zi = (pi , wi ) is all information used in decision and can be observed by both the firm and the econometrician, pi is the output price and wi is the wage, ν i is the exogenous random error (e.g., weather, financial crisis, etc.) and cannot be observed or controlled by either the firm or the econometrician, and di = Li is the decision of labor input. φ θ is the technology parameter, e.g., if f (Li , ν i ; θ ) = Li exp(ν i ), then θ = φ , and is known to the firm but unknown to the econometrician. Our goal is to estimate θ , which is relevant to measure the causal effect - the effect of labor input on profit. Ping Yu (HKU) Statistics 10 / 30 Point Estimation continue... The FOCs of (3) are Eνjz ∂ π (di , zi , ν i , θ ) = m (di , zi jθ ) = 0. ∂ di When there is randomness even in zi ,1 then the objective function changes to maxE [π (di , zi , ν i ; θ )] , di and the FOCs change to E [m (di , zi jθ )] = 0, (4) which are a special set of moment conditions. 1 The difference between zi and ν i is that zi can be observed ex post while ν i cannot. That zi is random means that the decision is made before zi is revealed, or the decision is made ex ante. Ping Yu (HKU) Statistics 11 / 30 Point Estimation A Macroeconomic Example of the MoM Estimator (*) ∞ max ∑ ρ t E0 [u (ct )] fct g∞ t =1 t = 1 s.t. ct +1 + kt +1 = kt Rt +1 , k0 is known, ρ is the discount factor, E0 [u ( )] is the conditional expected utility based on the information at t = 0, kt is the capital accumulation at time period t, ct is the consumption at t, and Rt is the gross return rate at t. From dynamic programming, we have the Euler equation E0 ρ If u (c ) = c1 α 1 1 α , u 0 (ct +1 ) R = 1. u 0 (ct ) t +1 α > 0, then we get E0 ρ ct ct +1 α Rt +1 = 1. (5) Suppose ρ is known while α is unknown; then (5) is a moment condition for α. Ping Yu (HKU) Statistics 12 / 30 Point Estimation Population Version vs Sample Version of Moment Conditions Equations (2), (4) and (5) are the population version of moment conditions. Although some econometricians treat "population" as a physical population (e.g., all individuals in the US census) in the real world, the term "population" is often treated abstractly, and is potentially infinitely large. Since the population distribution is unknown, we cannot solve the population moment conditions to estimate the parameters. In practice, we often have a set of finite data points from the population, so we can substitute the population distribution in the moment conditions by the empirical distribution of the data, which generates the sample version of the moment conditions. This is called the analog method. Ping Yu (HKU) Statistics 13 / 30 Point Estimation History of the Analog Method Charles F. Manski (1948-), Northwestern Ping Yu (HKU) Statistics 14 / 30 Point Estimation (The Sample Version of) the MoM Estimator Suppose the true distribution of X satisfies E [m (X jθ 0 )] = 0 or where m : Θ Z m (xjθ 0 )dF (x ) = 0, Rk ! Rk , and F ( ) is the true cdf of X . The essence of the MoM estimator is to substitute the true distribution F ( ) by the n bn (x ) = 1 ∑ 1(Xi x ): empirical distribution F n Z which is equivalent to i =1 bn (x ) = 0, m (xjθ )d F 1 n m (Xi jθ ) = 0. n i∑ =1 The MoM estimator θb (X1 , Ping Yu (HKU) (6) , Xn ) is the solution to (6). Statistics 15 / 30 Point Estimation (The Sample Version of) the MLE Similarly, the MLE can be constructed as the maximizer of the average log-likelihood function 1 n `n (θ ) = ∑ ln f (Xi jθ ) , n i =1 which is equivalent to the maximizer of the log-likelihood function Ln (θ ) = n ∑ ln f (Xi jθ ) i =1 or the likelihood function n Ln (θ ) = exp fLn (θ )g = ∏ f (Xi jθ ) . i =1 If f (xjθ ) is smooth in θ , the FOCs for the MLE are 1 n s (Xi jθ ) = 0, n i∑ =1 where s ( jθ ) = ∂ ln f ( jθ )/∂ θ is called the score function.2 So the MLE is a special MoM estimator in this case. 2 More often, ∑ni=1 s (Xi jθ ) is called the score function. Ping Yu (HKU) Statistics 16 / 30 Hypothesis Testing Hypothesis Testing Ping Yu (HKU) Statistics 17 / 30 Hypothesis Testing Hypotheses: The Null and Alternative Different from an estimation problem where nothing is known about the true parameter, in hypothesis testing, some restrictions about the true parameter are assessed. In other words, there is already a target to attack. Nevertheless, hypothesis testing and estimation are closely related since some test statistics are based on estimators. The null hypothesis, written as H0 , is often a point hypothesis θ = θ 0 . The complement of the null hypothesis is called the alternative hypothesis. So the alternative hypothesis, written as H1 , is θ 6= θ 0 . More generally, we express a null hypothesis as H0 : θ 2 Θ0 and the alternative hypothesis as H1 : θ 2 Θ1 , where Θ0 is a proper subset of Θ, Θ0 \ Θ1 = 0, / and Θ0 [ Θ1 = Θ. For simplicity, we often refer to the hypotheses as "the null" and "the alternative". Ping Yu (HKU) Statistics 18 / 30 Hypothesis Testing Decisions A hypothesis test either accepts the null, or rejects the null in favor of the alternative. We can describe these two decisions as “Accept H0 ” and “Reject H0 ”. Given the two possible states of the world (H0 or H1 ) and the two possible decisions (Accept H0 or Reject H0 ), there are four possible pairings of states and decisions: State of NaturenDecision H0 is true H1 is true Accept H0 Correct Decision Type II Error Reject H0 Type I Error Correct Decision Table: Hypothesis Testing Decisions Ping Yu (HKU) Statistics 19 / 30 Hypothesis Testing Acceptance Region and Rejection Region The decision is based on the data, and so is a mapping from the sample space to the decision set. [Check the MoM estimator] This splits the sample space into two regions A and R such that if the observed sample falls into A we accept H0 , while if the sample falls into R we reject H0 . The set A can be called the acceptance region and the set R the rejection or critical region. It is convenient to express this mapping as a real-valued function called a test statistic Tn = Tn ( X1 , , Xn ) relative to a critical value c. The hypothesis test then consists of the decision rule 1 2 Accept H0 if Tn c, Reject H0 if Tn > c. A test statistic Tn should be designed so that small values are likely when H0 is true and large values are likely when H1 is true. Ping Yu (HKU) Statistics 20 / 30 Hypothesis Testing Type I Error and Type II Error A false rejection of the null hypothesis H0 (rejecting H0 when H0 is true) is called a Type I error. The probability of a Type I error is P (Reject H0 jH0 is true) = P (Tn > cjH0 is true) . (7) The size of the test is defined as the supremum of (7) across all data distributions which satisfy H0 . (**)- For a set A R, the supremum or least upper bound of A, denoted as sup A, is the smallest number y such that y x for every x 2 A, and the infimum or greatest lower bound of A, denoted as inf A, is the largest number y such that x y for every x 2 A. - Although minnA and max A may o not exist, inf A and sup n A always exist. o For does not exist, but inf n1 , n = 1, 2, = 0. Of example, min n1 , n = 1, 2, course, if min A and max A exist, then inf A = min A and sup A = max A. (**) A primary goal of test construction is to limit the incidence of Type I error by bounding the size of the test. A false acceptance of the null hypothesis H0 (accepting H0 when H1 is true) is called a Type II error. Ping Yu (HKU) Statistics 21 / 30 Hypothesis Testing Power The rejection probability under the alternative hypothesis is called the power of the test, and equals 1 minus the probability of a Type II error: π n (θ ) = P (Reject H0 jH1 is true) = P (Tn > cjH1 is true) . We call π n (θ ) the power function and is written as a function of θ to indicate its dependence on the true value of the parameter θ under H1 . In the dominant approach to hypothesis testing, the goal of test construction is to have high power, subject to the constraint that the size of the test is lower than the pre-specified significance level. Generally, the power of a test depends on the true value of the parameter θ , and for a well behaved test the power is increasing both as θ moves away from the null hypothesis θ 0 and as the sample size n increases. Ping Yu (HKU) Statistics 22 / 30 Hypothesis Testing Trade-off Between Size and Power Given a test statistic Tn , increasing the critical value c increases the acceptance region A while decreasing the rejection region R. This decreases the likelihood of a Type I error (decreases the size) but increases the likelihood of a Type II error (decreases the power). (why?) Thus the choice of c involves a trade-off between size and the power. This is why the significance level of the test cannot be set arbitrarily small. (Otherwise the test will not have meaningful power.) It is important to consider the power of a test when interpreting hypothesis tests, as an overly narrow focus on size can lead to poor decisions. For example, it is trivial to design a test which has perfect size yet has trivial power. Specifically, for any hypothesis we can use the following test: Generate a random variable U U [0, 1] and reject H0 if U < α. This test has exact size of α. Yet the test also has power precisely equal to α. When the power of a test equals the size, we say that the test has trivial power. Nothing is learned from such a test. Ping Yu (HKU) Statistics 23 / 30 Hypothesis Testing Scientific Reasoning of Hypothesis Testing To determine the critical value c, we need to pre-select a significance level α such that P (Tn > cjH0 is true) = α, yet there is no objective scientific basis for choice of α. Nevertheless, the common practice is to set α = 0.05 (5%). Alternative values are α = 0.10 (10%) and α = 0.01 (1%). These choices are somewhat the by-product of traditional tables of critical values and statistical software. The informal reasoning behind the choice of a 5% critical value is to ensure that Type I errors should be relatively unlikely - that the decision “Reject H0 ” has scientific strength - yet the test retains power against reasonable alternatives. The decision “Reject H0 ” means that the evidence is inconsistent with the null hypothesis, in the sense that it is relatively unlikely (1 in 20) that data generated by the null hypothesis would yield the observed test result. In contrast, the decision “Accept H0 ” is not a strong statement. It does not mean that the evidence supports H0 , only that there is insufficient evidence to reject H0 . So it is more accurate to use the label “Do not Reject H0 ” instead of “Accept H0 ”. Ping Yu (HKU) Statistics 24 / 30 Hypothesis Testing Statistically Significant When a test rejects H0 at the 5% significance level it is common to say that the statistic is statistically significant and if the test accepts H0 it is common to say that the statistic is not statistically significant or that it is statistically insignificant. It is helpful to remember that this is simply a way of saying “Using the statistic Tn , the hypothesis H0 can [cannot] be rejected at the 5% level.” When the null hypothesis H0 : θ = 0 is rejected it is common to say that the coefficient θ is statistically significant, because the test has rejected the hypothesis that the coefficient is equal to zero. Ping Yu (HKU) Statistics 25 / 30 Hypothesis Testing An Example Suppose we have only one data point z in hand and we know z want to test H0 : µ = 0 against H1 : µ > 0. N ( µ, 1). We A natural test is to reject H0 if z is large. Rigorously, the test is 1(z > c ), where 1( ) is the indicator function which equals 1 if the event in the parenthesis is true and 0 otherwise, 1 indicates rejection and 0 acceptance, the test statistic Tn = z, and the critical value is c. Set the significance level α = 0.05; then c is chosen such that P (z > cjµ = 0) = E [1(z > c )jµ = 0] = 1 Φ(c ) = 0.05, i.e., c = 1.645. So if z > 1.645, we will reject H0 ; otherwise, we cannot reject H0 . The power function π ( µ ) = P (z > cjµ ) = P (z µ > c µ ) = 1 Φ(c µ ) which is an increasing function of µ and a decreasing function of c. [Figure here] It is understandable that π ( µ ) is increasing with µ since when µ is larger, it is easier to detect µ > 0. That π ( µ ) is decreasing in c indicates a trade-off between size and the power. Since the power equals 1 minus the probability of the Type II error, it is equivalent to study the trade-off between the probabilities of the Type I error and Type II error. Ping Yu (HKU) Statistics 26 / 30 Hypothesis Testing Figure: Trade-Off Between the Type I Error and Type II Error Ping Yu (HKU) Statistics 27 / 30 Hypothesis Testing continue... The acceptance region is z trivial. c and the critical region is z > c, which are quite To illustrate these regions in a more complicated example, suppose two data points y1 and y2 are observed and they follow N ( µ, 2). We want to test H0 : µ = 0 against H1 : µ 6= 0. A natural test is to reject H0 if the absolute value of y = (y1 + y2 ) /2 is large. Given that y follows N ( µ, 1), the 5% critical value is 1.96 since P (jy j > 1.96jµ = 0) = 0.05. The acceptance region is f(y1 , y2 )j jy j 1.96g or f(y1 , y2 )j 3.92 y1 y2 3.92 y1 g . [Figure here] Ping Yu (HKU) Statistics 28 / 30 Hypothesis Testing Figure: Acceptance Region Based on (y1 , y2 ) and y Ping Yu (HKU) Statistics 29 / 30 Hypothesis Testing Summary One hypothesis testing includes the following steps. 1 2 3 4 5 specify the null and alternative. construct the test statistic. derive the distribution of the test statistic under the null. determine the decision rule (acceptance and rejection regions) by specifying a level of significance. study the power of the test. Step 2, 3 and 5 are key and step 1 and 4 are usually trivial. Of course, in some cases, how to specify the null and the alternative is also subtle, and in some cases, the critical value is not easy to determine if the asymptotic distribution is complicated. Ping Yu (HKU) Statistics 30 / 30