Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
School of Economics, Mathematics and Statistics Birkbeck College Malet Street, London WC1E 7HX, UK Weeks 2 to 4 September Statistics Ali C. Tasiran September 2007 Contents 1 Introduction 1.1 Course Aims . . . . . 1.2 Course Objectives . . 1.3 Outline of Topics . . . 1.4 Teaching arrangements 1.5 Textbooks . . . . . . . 1.6 Some preliminaries . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Probability 2.1 Probability definitions and concepts . . . 2.1.1 Classical definition of probability . 2.1.2 Frequency definition of probability 2.1.3 Subjective definition of probability 2.1.4 Axiomatic definition of probability Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 4 5 5 5 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 8 8 8 8 8 11 3 Random variables and probability distributions 3.1 Random variables, densities, and cumulative distribution functions 3.1.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . 3.1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . 3.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 13 13 14 15 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Expectations and moments 4.1 Mathematical Expectation and Moments . 4.1.1 Mathematical Expectation . . . . . 4.1.2 Moments . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 16 20 22 5 Some univariate distributions 5.1 Discrete Distributions . . . . . . . 5.1.1 The Bernoulli Distribution 5.1.2 The Binomial Distribution . 5.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 23 23 23 24 . . . . 1 . . . . . . . . . . . . Quantitative Techniques II 2 5.1.4 Simple Random Walk . . . . . . . . . . . . . . . . . . 5.1.5 Geometric Distribution . . . . . . . . . . . . . . . . . 5.1.6 Hypergeometric Distribution . . . . . . . . . . . . . . 5.1.7 Negative Binomial Distribution . . . . . . . . . . . . . 5.1.8 Poisson Distribution . . . . . . . . . . . . . . . . . . . 5.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . 5.2.1 Uniform Distribution on an Interval . . . . . . . . . . 5.2.2 Beta Distribution . . . . . . . . . . . . . . . . . . . . . 5.2.3 Cauchy Distribution . . . . . . . . . . . . . . . . . . . 5.2.4 Chi-Square Distribution . . . . . . . . . . . . . . . . . 5.2.5 The Exponential Distribution . . . . . . . . . . . . . . 5.2.6 Extreme Value Distribution (Gompertz Distribution) . 5.2.7 F Distribution . . . . . . . . . . . . . . . . . . . . . . 5.2.8 Gamma Distribution . . . . . . . . . . . . . . . . . . . 5.2.9 Geometric Distribution . . . . . . . . . . . . . . . . . 5.2.10 Logistic Distribution . . . . . . . . . . . . . . . . . . . 5.2.11 Lognormal Distribution . . . . . . . . . . . . . . . . . 5.2.12 The Normal Distribution . . . . . . . . . . . . . . . . 5.2.13 Pareto Distribution . . . . . . . . . . . . . . . . . . . . 5.2.14 Student’s t Distribution . . . . . . . . . . . . . . . . . 5.2.15 Weibull Distribution . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Multivariate distributions 6.1 Bivariate Distributions . . . . . . . . . . . . . . . 6.1.1 The Bivariate Normal Distribution . . . . 6.1.2 Mixture Distributions . . . . . . . . . . . 6.2 Multivariate Density Functions . . . . . . . . . . 6.2.1 The Multivariate Normal Distribution . . 6.2.2 Standard multivariate normal density . . 6.2.3 Marginal and Conditional Distributions of 6.2.4 The Chi-Square Distribution . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N (μ, Σ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 25 25 25 25 26 26 26 26 27 27 27 27 27 28 28 28 28 29 29 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 32 32 33 33 33 34 34 35 7 Sampling, sample moments, sampling distributions, and simulation 7.1 Independent, Dependent, and Random Samples . . . . . . . . . . . . . . 7.2 Sample Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 36 37 37 37 8 Large sample theory 8.1 Different Types of Convergence . . 8.2 The Weak Law of Large Numbers . 8.3 The Strong Law of Large Numbers 8.4 The Central Limit Theorem . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 39 41 41 42 42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quantitative Techniques II 3 9 Estimation and properties of estimators 9.1 Point Estimation . . . . . . . . . . . . . . . . . . . 9.1.1 Small Sample Criteria for Estimators . . . . 9.1.2 Large Sample Properties of Estimators . . . 9.2 Interval Estimation . . . . . . . . . . . . . . . . . . 9.2.1 Pivotal-quantity method of finding CI . . . 9.2.2 CI for the mean of a normal population . . 9.2.3 CI for the variance of a normal population . 9.3 Problems . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 43 43 44 45 46 46 47 48 48 10 Tests of statistical hypotheses 10.1 Basic Concepts in Hypothesis Testing . . . . . . . 10.1.1 Null and Alternative Hypotheses . . . . . . 10.1.2 Simple and Composite Hypotheses . . . . . 10.1.3 Statistical Test . . . . . . . . . . . . . . . . 10.1.4 Type I and Type II Errors . . . . . . . . . . 10.1.5 Power of a Test . . . . . . . . . . . . . . . . 10.1.6 Operating Characteristics . . . . . . . . . . 10.1.7 Level of Significance and the Size of a Test Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 49 49 50 50 50 50 50 51 11 Examination 1 11.1 Definition Questions . . . 11.2 Calculation questions . . . 11.3 Discussion questions . . . 11.4 Multiple choice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 52 53 53 . . . . 55 55 55 56 56 12 Examination 2 12.1 Definition Questions . . . 12.2 Calculation questions . . . 12.3 Discussion questions . . . 12.4 Multiple choice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction 1.1 Course Aims This is a refresher course in the mathematical statistics and it has also some new modules for forthcoming econometric studies. 1.2 Course Objectives The course is a beginning to mathematical statistics and consists of the foundations of the theory of probability and statistical inference. It is intended to provide the necessary statistical background for the Econometric courses. At the beginning, it covers basic facts about random variables and their distributions. It then provides an introduction to statistical inference. These tools are arranged with a view to applying them to econometric methodology. Thus, the emphasis is on probability and distribution theories together with estimation and hypothesis testing involving several parameters. 1.3 Outline of Topics There are two main parts in the course. Probability and Distribution Theories 1. Probability 2. Random variables and probability distributions 3. Expectations and moments 4. Some univariate distributions 5. Multivariate distributions Statistical Inference 6. Sampling, sample moments and sampling distributions 8. Large sample theory 9. Estimation and properties of estimators 10. Tests of statistical hypotheses 4 Quantitative Techniques II 1.4 5 Teaching arrangements and Assessment There will be two lectures and one class for problem solutions each week over three weeks. Performance in this course is assessed through a written examination. You are required to pass this examination to continue on the MSc programme. No resits are held. 1.5 Textbooks Lecture notes are provided. However, these are not a substitute for a textbook. I do not recommend any particular text, but in the past students have found the following useful. • Greene, W.H., (2004) Econometric Analysis, 5rd edition, Prentice-Hall. A good summary of much of the material can be found in Appendix. • Hogg, R. V. and Craig A. T., (1995) Introduction to Mathematical Statistics, 5th edition, Prentice Hall. A popular textbook, even though it is slightly dated. • Mittelhammer. R. C., (1999) Mathematical Statistics for Economics and Business, Springer Verlag. A good text. A good mathematical statistics textbook for economists, it is useful especially for further econometric studies. • Mood. A.M., Graybill, and Boes D.C., (1974) Introduction to the Theory of Statistics, 3rd edition, McGraw-Hall. • Spanos. A., (1999) Probability Theory and Statistical Inference, Econometric Modeling with Observational Data, Cambridge University Press. • Wackerly. D., Mendenhall W., and Scheaffer. R., (1996) Mathematical Statistics with Applications, 5th edition, Duxbury Press. Those who plan to take forthcoming courses in Econometrics may buy the book by Greene (2004). Ali Tasiran [email protected] 1.6 Some preliminaries Statistics is the science of observing data and making inferences about the characteristics of a random mechanism that has generated data. It is also called as science of uncertainty. In Economics, theoretical models are used to analyze economic behavior. Economic theoretical models are deterministic functions but in real world, the relationships are not exact and deterministic rather than uncertain and stochastic. We thus employ distribution functions to make approximations to the actual processes that generate the observed data. The process that generates data is known as the data generating process (DGP or Super Population). In Econometrics, to study the economic relationships, we estimate statistical models, which are build under guidance of the theoretical economic models and by taking into account the properties in the data generating process. Quantitative Techniques II 6 Using parameters of estimated statistical models, we make generalisations about the characteristics of a random mechanism that has generated data. In Econometrics, we use observed data in the samples to draw conclusions about the populations. Populations are either real which the data came or conceptual as processes by which the data were generated. The inference in the first case is called design-based (for experimental data) and used mainly to study samples from populations with known frames. The inference in the second case is called model-based (for observational data) and used mainly to study stochastic relationships. The statistical theory that used for such analyses is called as the Classical inference one will be followed in this course. It is based on two premises: 1. The sample data constitute the only relevant information 2. The construction and assessment on the different procedures for inference are based on long-run behavior under similar circumstances. The starting point of an investigation is an experiment. An experiment is a random experiment if it satisfies the following conditions: - all possible distinct outcomes are known ahead of time - the outcome of a particular trial is not known a priori - the experiment can be duplicated. The totality of all possible outcomes of the experiment is referred to as the sample space (denoted by S) and its distinct individual elements are called the sample points or elementary events. An event, is a subset of a sample space and is a set of sample points that represents several possible outcomes of an experiment. A sample space with a finite or countably infinite sample points (with a one to one correspondence to positive integers) is called a discrete space. A continuous space is one with an uncountable infinite number of sample points (that is, it has as many elements as there are real numbers). Events are generally represented by sets, and some important concepts can be explained by using the algebra of sets (known as Boolean Algebra). Definition 1 The sample space is denoted by S. A = S implies that the events in A must always occur. The empty set is a set with no elements and is denoted by ®. A = ® implies that the events in A do not occur. The set of all elements not in A is called the complement of A and is denoted by Ā. Thus, Ā occurs if and only if A does not occur. The set of all points in either a set A or a set B or both is called the union of the two sets and is denoted by ∪. A ∪ B means that either the event A or the event B or both occur. Note: A ∪ Ā = S. The set of all elements in both A and B is called the intersection of the two sets and is represented by ∩. A ∩ B means that both the events A and B occur simultaneously. A ∩ B = ® means that A and B cannot occur together. A and B are said to be disjoint or mutually exclusive. Note: A ∩ Ā = ®. A ⊂ B means that A is contained in B or that A is a subset of B, that is, every element of A is an element of B. In other words, if an event A has occurred, then B must have occurred also. Sometimes it is useful to divide elements of a set A into several subsets that are disjoint. Such a division is known as a partition. If A1 and A2 are such partitions, then A1 ∩A2 = ® Quantitative Techniques II 7 and A1 ∪ A2 = A. This can be generalized to n partitions; A = ∪n1 Ai with Ai ∩ Aj = ® for i 6= j. Some postulates according to the Boolean Algebra: Identity: There exist unique sets ® and S such that, for every set A, A ∩ S = A and A ∪ ® = A. Complementation: For each A we can define a unique set Ā such that A ∩ Ā = ® and A ∪ Ā = S. Closure: For every pair of sets A and B, we can define unique sets A ∪ B and A ∩ B. Commutative: A ∪ B = B ∪ A; A ∩ B = B ∩ A. Associative: (A ∪ B) ∪ C = A ∪ (B ∪ C). Also (A ∩ B) ∩ C = A ∩ (B ∩ C). Distributive: A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C). Also, A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). Morgan’s Laws: A ∪ B) = Ā ∩ B̄. (A ∩ B) = Ā ∪ B̄. Problems 1. Let the set S contains the ordered combination of sexes of two children S = {F F, F M, M F, M M }. Let A denote the subset of possibilities containing no males, B the subset of two males, and C the subset containing at least one male. List the elements of A, B, C, A ∩ B, A ∪ B, A ∩ C, A ∪ C, B ∩ C, B ∪ C, and C ∩ B̄. 2. Verify Morgan’s Laws by drawing Venn Diagrams. A ∪ B = Ā ∩ B̄. (A ∩ B) = Ā ∪ B̄. Chapter 2 Probability 2.1 2.1.1 Probability definitions and concepts Classical definition of probability If an experiment has n(n < ∞) mutually exclusive and equally likely outcomes, and if nA of these outcomes have an attribute A (that is, the event A occurs in nA possible ways), then the probability of A is nA /n, denoted as P (A) = nA /n 2.1.2 Frequency definition of probability Let nA be the number of times the event A occurs in n trials of an experiment. If there exists a real number p such that p = limn→∞ (nA /n), then p is called the probability of A and is denoted as P (A). (Examples are histograms for frequency distribution of variables). 2.1.3 Subjective definition of probability Our personal judgments to assess the relative likelihood of various outcomes. They are based on our ”educated guesses” or intuitions. ”The weather will be rainy with a probability 0.6, tomorrow”. 2.1.4 Axiomatic definition of probability The probability of an event A ∈ z is a real number such that 1) P (A) ≥ 0 for every A ∈ z, 2) the probability of the entire sample space S is 1, that is P (S) = 1, and exclusive events (that is, Ai ∩ Aj = ® for all i 6= j), 3) if A1 , A2 , ..., An are mutually P then P (A1 ∪ A2 ∪ ...An ) = i P (Ai ), and this holds for n = ∞ also. Where z is a set of all sub-sets in the sample space, S. The triple (S, z, P (·)) is referred to as the probability space, and P (·) is a probability measure. We can derive the following theorems by using the axiomatic Definition of probability. Theorem 1 P (Ā) = 1 − P (A). Theorem 2 P (A) ≤ 1. 8 Quantitative Techniques II 9 Theorem 3 P (®) = 0. Theorem 4 If A ⊂ B, then P (A) ≤ P (B). Theorem 5 P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Definition 2 Let A and B be two events in a probability space (S, z, P (.)) such that P (B) > 0. The conditional probability of A given that B has occurred, denoted by P (A | B), is given by P (A ∩ B)/P (B). (It should be noted that the original probability space (S, z, P (·)) remains unchanged even though we focus our attention on the subspace, this is (S, z, P (· | B)) Theorem 6 Bonferroni’s Theorem: Let A and B be two events in a sample space S. Then P (A ∩ B) ≥ 1 − P (Ā) − P (B̄). Theorem 7 Bayes Theorem: If Aand B are two events with positive probabilities, then P (A | B) = P (A) P (B | A) P (B) Law of total probability Assume that S = A1 ∪ A2 ∪ ... ∪ An where Ai ∩ Aj = ∅ for i 6= j. Then for any event B⊂S n X P (Ai )P (B | Ai ). P (B) = i=1 Theorem 8 Extended Bayes Theorem: If A1 , A2 , ..., An constitute a partition of the sample space, so that Ai ∩ Aj = ® for i 6= j and ∪i Ai = S, and P (Ai ) 6= 0 for any i, then for a given event B with P (B) > 0, P (Ai ) P (B | Ai ) P (Ai | B) = P i P (Ai ) P (B | Ai ) Definition 3 Two events A and B with positive probabilities are said to be statistically independent if and only if P (A | B) = P (A). Equivalently, P (B | A) = P (B) and P (A ∩ B) = P (A)P (B). The other type of statistical inference is called Bayesian inference where sample information is combined with prior information. This is expressed of a probability distribution known as the prior distribution. When it is combined with the sample information then a posterior distribution of parameters is obtained. It can be derived by using Bayes Theorem. If we substitute Model (the model that generated the observed data) for A and Data (Observed Data) for B, then we have P (M odel | Data) = P (Data | M odel)P (M odel) P (Data) (2.1) where P (Data | M odel) is the probability of observing data given that the Model is true. This is usually called the likelihood, (sample information). P (M odel) is the probability Quantitative Techniques II 10 that the Model is true before observing the data (usually called the prior probability). P (M odel | Data) is the probability that the Model is true after observing the data (usually called posterior probability). P (Data) is the unconditional probability of observing data (whether the Model is true or not). Hence, the relation can be written P (M odel | Data) ∝ P (Data | M odel)P (M odel) (2.2) That is, that Posterior probability is proportional to likelihood (sample information) times prior probability. The inverse of an estimator’s variance is called as the precision. In Classical Inference, we use only parameter’s variances but in Bayesian Inference, we have both sample precision and prior precision. Also, the precision (or inverse of the variance) of the posterior distribution of a parameter is the sum of sample precision and prior precision. For example, the posterior mean will lie between the sample mean and the prior mean. The posterior variance will be less than the both the sample and prior variances. These are the reasons behind the increasing popularity of Bayesian Inference in the practical econometric applications. When we speak in econometrics of models to be estimated or tested, we refer to sets of DGPs in Classical Inference context. In design-based inference, we restrict our attention to a particular sample size and characterize a DGP by the law of probability that governs the random variables in a sample of that size. In model based inference, we refer to a limiting process in which the sample size goes to infinity, it is clear that such a restricted characterization will no longer suffice. When we indulge in asymptotic theory, the DGPs in question must be stochastic processes. A stochastic process is a collection of random variables indexed by some suitable index set. This index set may be finite, in which case we have no more than a vector of random variables, or it may be infinite, with either a discrete or a continuous infinity of elements. In order to define a DGP, we must be able to specify the joint distribution of the set of random variables corresponding to the observations contained in a sample of arbitrarily large size. This is a very strong requirement. In econometrics, or any other empirical discipline for that matter, we deal with finite samples. How then can we, even theoretically, treat infinite samples? We must in some way create a rule that allows one to generalize from finite samples to an infinite stochastic process. Unfortunately, for any observational framework, there is an infinite number of ways in which such a rule can be constructed, and different rules can lead to widely asymptotic conclusions. In the process of estimating an econometric model, what we are doing is to try to obtain some estimated characterization of the DGP that actually did generate the data. Let us denote an econometric model that is to be estimated, tested, or both, as M and a typical DGP belonging to M as μ. The simplest model in econometrics is the linear regression model, one possibility is to write y = Xβ + u, u ∼ N (0, σ 2 In ) (2.3) where y and u are n-vectors and X is a nonrandom nxk matrix and y follows the N (Xβ, σ 2 In ) distribution. This distribution is unique if the parameters β and σ 2 are specified. We may therefore say that the DGP is completely characterized by the model parameters. In other words, knowledge of the model parameters β and σ 2 uniquely identify an element of μ in M . Quantitative Techniques II 11 On the other hand, the linear regression model can also be written as y = Xβ + u, u ∼ IID(0, σ 2 In ) (2.4) with no assumption of normality. Many aspects of the theory of linear regressions are just 0 applicable, the OLS estimator is unbiased, and its covariance matrix is σ 2 (X X)−1 . But the distribution of the vector u, and hence also that of y, is now only partially characterized even when β and σ 2 are known. For example, errors u could be skewed to the left or to the right, could have fourth moments larger or smaller than 3σ 4 .Let us call the sets of DGPs associated these regressions M1 and M2 ., respectively. M1 being in fact a proper subset of M2 . For a given β and σ 2 there is an infinite number of DGPs in M2 (only one of which is M1 ) that all correspond to the same β and σ 2 . Thus we must consider these models as different models even though the parameters used in them are the same. In either case, it must be possible to associate a parameter vector in a unique way to any DGP μ in the model M , even if the same parameter vector is associated with many DGPs. We call the model M with its associated parameter-defining mapping θ as a parametrized model The main task in our practical work is to build the association between the DGPs of a model and the model parameters. For example, in the Generalized Method of Moments (GMM) context, there are many possible ways of choosing the econometric model, i.e., the underlying set of DGPs. One of the advantages of GMM as an estimation method is that it permits models which consist of a very large number of DGPs. In striking contrast to Maximum Likelihood estimation, where the model must be completely specified, any DGP is admissible if it satisfies a relatively small number of restrictions or regularity conditions. Sometimes, the existence of the moments used to define the parameters is the only requirement needed for a model to be well defined. Problems 1. A sample space consists of five simple events E1 , E2 , E3 , E4 , and E5 . (a) If P (E1 ) = P (E2 ) = 0.15, P (E3 ) = 0.4 and P (E4 ) = 2P (E5 ), find the probabilities of E4 and E5 . (b) If P (E1 ) = 3P (E2 ) = 0.3, find the remaining simple events if you know that the remaining events are equally probable. 2. A business office orders paper supplies from one of three vendors, V1 , V2 , and V3 . Orders are to be placed on two successive days, one order per day. Thus (V2 , V3 ) might denote that vendor V2 gets the order on the first day and vendor V3 gets the order on the second day. (a) List the sample points in this experiment of ordering paper on two successive days. (b) Assume the vendors are selected at random each day and assign a probability to each sample point. (c) Let A denote the event that the same vendor gets both orders and B the event that V2 gets at least one order. Find P (A), P (B), P (A ∩ B), and P (A ∪ B) by summing probabilities of the sample points in these events. Chapter 3 Random variables and probability distributions 3.1 Random variables, densities, and cumulative distribution functions A random variable X, is a function whose domain is the sample space and whose range is a set of real numbers. Definition 4 In simple terms, a random variable (also referred as a stochastic variable) is a real-valued set function whose value is a real number determined by the outcome of an experiment. The range of a random variable is the set of all the values it can assume. The particular values observed are called realisations x. If these are countable, x1 , x2 , ..., it is said to be discrete with associated probabilities X P (X = xi ) = p(xi ) ≥ 0, p(xi ) = 1; (3.1) i Pj and cumulative distribution P (X ≤ xj ) = i=1 p(xi ). For a continuous random variable, defined over the real line, the cumulative distribution function is Z x f (u)d(u), (3.2) F (x) = P (X ≤ x) = −∞ where denotes the probability density function f (x) = dF (x) dx (3.3) R∞ and −∞ f (x)d(x) = 1. Also note that the cumulative distribution function satisfies limx→∞ F (x) = 1 and limx→−∞ F (x) = 0. Definition 5 The real-valued function F (x) such that F (x) = Px {(−∞, x]} for each x ∈ < is called the distribution function, also known as the cumulative distribution (or cumulative density) function, or CDF. 12 Quantitative Techniques II 13 Theorem 9 P (a < X ≤ b) = F (b) − F (a) Theorem 10 For each x ∈ <, F (x) is continuous to the right of x. Theorem 11 If F (x) is continuous at x ∈ <, then P (X = x) = 0. Although f (x) is defined at a point, P (X = x) = 0 for a continuous random variable. The support of a distribution is the range over which f (x) 6= 0. Let f be a function from Rk to R. Let x0 be a vector in Rk and let y = f (x0 ) be its k image. The function f is continuous at x0 if whenever {xn }∞ n=1 is a sequence in R which ∞ converges to x0 , then the sequence {f (xn )}n=1 converge to f (x0 ). The function f is said to be continuous if it is continuous at each point in its domain. All polynomial functions are continuous. As an example of a function that is not continuous consider ½ 1, if x > 0, f (x) = 0, if x ≤ 0. If both g and f are continuous functions, then g(f (x)) is continuous. 3.1.1 Discrete Distributions Definition 6 For a discrete random variable X, let f (x) = Px (X = x). The function f (x) is called the probability function (or as probability mass function). The Bernoulli Distribution f (x; θ) = f (x; p) = px (1 − p)1−x for x = 0, 1(failure, success) and 0 ≤ p ≤ 1. The Binomial Distribution µ ¶ n x n! px (1 − p)n−x f (x; θ) = B(x; n, p) = p (1 − p)n−x = x! (n − x)! x (3.4) x = 0, 1, ..., n (X is the number of success in n trials) 0 ≤ p ≤ 1. 3.1.2 Continuous Distributions Definition 7 For a random variable X if there exists a nonnegative function f (x), defined on the real line, such that for any interval B, P (X ∈ B) = Z f (x) dx (3.5) B then X is said to have a continuous distribution and the function f (x) is called the probability density function or simply density function (or pdf). The following can be written for the continuous random variables: F (x) = Z x −∞ f (u) d(u) (3.6) Quantitative Techniques II 14 f (x) = F 0 (x) = Z ∂F (x) ∂x (3.7) +∞ f (u) d(u) = 1 (3.8) −∞ F (b) − F (a) = Z b f (u) d(u) (3.9) a Uniform Distribution on an Interval A random variable X with the density function 1 (b − a) f (x; a, b) = (3.10) in the interval a ≤ X ≤ b is called the uniform distribution on an interval. The Normal Distribution A random variable X with the density function 1 f (x; μ, σ) = p σ (2π) ∙ ¸ 1 (x − μ)2 − σ2 e 2 (3.11) is called a Normal (Gaussian) distributed variable. 3.1.3 Example 1. Toss of a single fair coin. X =number of heads ⎧ if x < 0 ⎨ 0, 1 , if 0 ≤ x<1 F (x) = ⎩ 2 1, if x ≤ 1 the cumulative distribution function (cdf) of discrete random variables are always step functions because the cdf increases only at a countable of number of points. ½ 1 2 , if x = 0 f (x) = 1 2 , if x = 1 F (x) = X xj ≤x f (xj ) Quantitative Techniques II 3.2 15 Problems 1. Write P (a ≤ x ≤ b) in terms of integrals and draw a picture for it. 2. Assume the probability density function for x is: ½ cx, if 0 ≤ x ≤ 2 f (x) = 0, elsewhere (a) Find the value of c for which f (x) is a pdf. (b) Compute F (x). (c) Compute P (1 ≤ x ≤ 2). 3. The large lot of electrical is supposed to contain only 5 percent defectives assuming a binomial model. If n = 20 fuses are randomly sampled from this lot, find the probability that at least three defectives will be observed. 4. Let the distribution function of a random variable X be given by ⎧ 0, x<0 ⎪ ⎪ ⎨ x, 0 ≤ x < 2 8 F (x) = x2 ⎪ , 2≤x<4 ⎪ 16 ⎩ 1, x≥4 (a) Find the density function (i.e., pdf) of x. (b) Find P (1 ≤ x ≤ 3) (c) Find P (x ≤ 3) (d) Find P (x ≥ 1 | x ≤ 3). Chapter 4 Expectations and moments 4.1 Mathematical Expectation and Moments The probability density and the cumulative distributions functions determine the probabilities of random variables at various points or in different intervals. Very often we are interested in summary measures of where the distribution is located, how it is dispersed around some average measure, whether it is symmetric around some point, and so on. 4.1.1 Mathematical Expectation Definition 8 Let X be a random variable with f (x) as the PMF, or PDF, and g(x) be a single-valued-function. The integral is the expected value (or mathematical expectation) of g(x) and is denoted P by E[g(X)]. In the case of a discrete random variable +∞ this takes the form E[g(X)] = −∞ g(x)f (xi ), and in the continuous case, E[g(X)] = R +∞ −∞ g(x)f (x)dx Mean of a Distribution For the special case of g(X) = X, the mean of a distribution is μ = E(X). Theorem 12 If c is a constant, E(c) = c. Theorem 13 If c is constant, E[cg(X)] = cE[g(X)]. Theorem 14 E[u(X) + v(X)] = E[u(X)] + E[v(X)]. Theorem 15 E(X − μ) = 0, where μ = E(X). Examples: Ex1: Let X have the probability density function x f (x) 1 2 3 4 4 10 1 10 3 10 2 10 16 Quantitative Techniques II E(x) = P x xf (x) =1 ¡4¢ 10 17 +2 Ex2: Let X have the pdf ¡1¢ 10 +3 f (x) = E(x) = R +∞ −∞ xf (x)dx = R1 0 ¡3¢ ½ x(4x3 )dx = 4 Moments of a Distribution 10 +4 ¡2¢ 10 = ¡ 23 ¢ 10 . 4x3 , 0 < x < 1 . 0, elsewhere R1 0 x4 dx = 4 h i1 x5 5 0 =4 £1¤ 5 = 45 . The mean of a distribution is the expected value of the random variable X. If the following integral exists 0 m μm = E(X ) = Z +∞ xm dF (4.1) −∞ 0 it is called the mth moment around the origin, and it is denoted by μm . Moments can also be obtained around the mean or the central moments (denoted by μm ) m μm = E[(X − μ) ] = Z +∞ −∞ (x − μ)m dF (4.2) Variance and Standard Deviation The central moment of a distribution that corresponds to m = 2 is called the variance of this distribution, and is denoted by σ 2 or V ar(X). The positive square root of the variance is called standard deviation and is denoted by σ or Std(x). The variance is an average of the squared deviation from the mean. There are many deviations from the mean but only one standard deviation. The variance shows the dispersion of a distribution and by squaring deviations one treats positive and negative deviations symmetrically. Mean and Variance of a Normal Distribution A random variable X is normal distributed as N(μ, σ 2 ) the mean is μ, and variance is σ 2 . The operation of substracting the mean and dividing by the standard deviation is called standardizing. Then the standardized variable Z = (X − μ)/σ is SN(0, 1). Mean and Variance of a Binomial Distribution The random variable X is binomial distributed B(n, p) with the mean np and a variance with np(1 − p). (Show this!) Theorem 16 If E(X)=μ and Var(X)=σ 2 , and a and b are constants, then V ar(a + bX) = b2 σ 2 . (Show this!) Quantitative Techniques II 18 Example: Ex3: Let X have the probability density function ½ 4x3 , 0 < x < 1 f (x) = . 0, elsewhere E(x) = 45 . V ar(x) = E(x2 )−E 2 (x) = R1 0 x2 (4x3 )dx− Expectations and Probabilities £ 4 ¤2 5 =4 h i1 £ ¤ x6 4 2 6 0− 5 = 46 − 16 25 = 2 75 = 0.0266. Any probability can be interpreted as an expectation. Define the variable Z which is equal to 1 if event A occurs, and equal to zero if event A does not occur. Then it is easy to see that P r(A) = E(Z). How much information about the probability distribution of a random variable X is provided by the expectation and variance of X? There are three useful theorems here. Theorem 17 Markov’s Inequality If X is nonnegative random variable, that is, if P r(X < 0) = 0, and any k is any constant, then P r(X ≥ k) ≤ E(X)/k. Theorem 18 Chebyshev’s Inequality Let b a positive constant and h(X) be a nonnegative measurable function of the random variable X. Then 1 Pr(h(X) ≥ b) ≤ E[h(X)] b For any constant c > 0 and σ 2 = V ar(X), Corollary 19 P r(| X − μ |≥ c) ≤ σ2 c2 Corollary 20 P r(| X − μ |≤ c) ≥ 1 − Corollary 21 P r(| X − μ |≥ kσ) ≤ µ σ2 c2 ¶ 1 k2 For linear functions the expectation of the function is the function of the expectation. But if Y = h(X) is nonlinear, then in general E(Y ) 6= h[E(X)]. The direction of the inequality may depend on the distribution of X. For certain functions, we can be more definite. Theorem 22 Jensen’s Inequality If Y = h(X) is concave and E(X) = μ, then E(Y ) ≤ h(μ). For example, the logarithmic function is concave, so E[log(X)] ≤ log[E(X)] regardless of the distribution of X. Similarly, if Y = h(X) is convex, so that it lies everywhere above its tangent line, then E(Y ) ≥ h(μ). For example, the square function is convex, so E(X 2 ) ≥ [E(X)]2 regardless of the distribution of X. Quantitative Techniques II 19 Approximate Mean and Variance of g(X) Suppose X is a random variable defined on (S, z, P (·)) with E(X) = μ and V ar(X) = σ 2 , and let g(X) be a differentiable and measurable function of X. We first take a linear approximation of g(X) in the neighborhood of μ. This is given by g(X) ≈ g(μ) + g 0 (μ)(X − μ) (4.3) provided g(μ) and g 0 (μ) exist. Since the second term zero expectation E[g(X)] ≈ g(μ), and variance is V ar[g(X)] ≈ σ 2 [g 0 (μ)]2 . Mode of a Distribution The point(s) for which f (x) is maximum are called mode. It is the most frequently observed value of X. Median, Upper and Lower Quartiles, and Percentiles A value of x such that P (X < x) ≤ (1/2), and P (X ≤ (1/2)) ≥ (1/2) is called a median of the distribution. If the point is unique, then it is the median. Thus the median is the point on either side of which lies 50 percent of the distribution. We often prefer median as an ”average” measure because the arithmetic average can be misleading if extreme values are present. The point(s) with an area 1/4 to the left is (are) called the lower quartile(s), and the point(s) corresponding to 3/4 is (are) called upper quartile(s). For any probability p, the values of X, for which the area to the right is p are called the upper pth percentiles (also referred to as quantiles). Coefficient of Variation The coefficient of variation is defined as the ratio (σ/μ)100, where the numerator is the standard deviation and the denominator is the mean. It is a measure of the dispersion of a distribution relative to its mean and useful in the estimation of relationships. We usually say that the variable X does not vary much if the coefficient of variation is less than 5 percent. This is also helpful to make comparison between two variables that are measured with different scales. Skewness and Kurtosis If a continuous density f (x) has the property that f (μ + a) = f (μ − a) for all a (μ being the mean of the distribution), then f (x) is said to be symmetric around the mean . If a distribution is not symmetric about the mean, then it is called skewed. A commonly used measure of skewness is α3 = E[(X − μ)3 /σ 3 ]. For a symmetric distribution such as the normal, this is zero(μ = α3 = 0). [Positive skewed (μ > α3 > 0), to the right with a long tail, negative skewed (μ < α3 < 0), to the left with a long tail]. The peaknedness of a distribution is called kurtosis. One measure of kurtosis is α4 = E[(X − μ)4 /σ 4 ]. For a normal distribution, kurtosis is called mesokurtic (α4 = 3). A narrow distribution is called leptokurtic (α4 > 3) and a flat distribution is called Quantitative Techniques II 20 platykurtic (α4 < 3). The value E[(X − μ)4 /σ 4 ] − 3 is often referred to as excess kurtosis. 4.1.2 Moments Mathematical Expectation The concept of mathematical expectation is easily extended to bivariate random variables. We have Z Z E[g(X, Y )] = g(x, y)dF (x, y) (4.4) where the integral is over the (X, Y ) space. Moments The rth moment of X is r E(X ) = Z xr dF (x) (4.5) Joint Moments E(X r Y r ) = Z Z xr ys dF (x, y) Let X and Y be independent random variables and let u(X) be a function of X only and v(Y ) be a function of Y only. Then, E[u(X)v(Y )] = E[u(X)]E[v(Y )] (4.6) Covariance Covariance between X and Y is defined as σXY = Cov(X, Y ) = E[(X − μx )(Y − μy )] = E(XY ) − μx μy In the continuous case this takes the form: Z ∞Z ∞ σXY = (x − μx )(y − μy )f (x, y)dxdy −∞ (4.7) (4.8) −∞ and in the discrete case it is σXY = XX (x − μx )(y − μy )f (x, y) x (4.9) y Although the covariance measure is useful in identifying the nature of the association between X and Y , it has a serious problem, namely, the numerical value is very sensitive to the units of measurement. To avoid this problem, a ”normalized” covariance measure is used. This measure is called the correlation coefficient. Quantitative Techniques II 21 Correlation The quantity ρXY = σXY Cov(X, Y ) =p σX σY V ar(X) V ar(Y ) (4.10) is called correlation coefficient between X and Y . If Cov(X, Y ) = 0, then Cor(X, Y ) = 0, in which case X and Y are said to be uncorrelated. Two random variables are independent then σXY = 0 and ρXY = 0. The converse need not to be true. Theorem 23 | ρXY |≤ 1 that is, −1 ≤ ρXY ≤ 1. The inequality [Cov(X, Y )]2 ≤ V ar(X)V ar(Y )is called Cauchy-Schwarz Inequality or ρ2XY ≤ 1 that is, −1 ≤ ρXY ≤ 1. It should be emphasized that ρXY measures only a linear relationship between X and Y . It is possible to have an exact relation but a correlation less than 1, even 0. Example: To illustrate, consider random variable X which is distributed as Uniform [−θ, θ] and the transformation Y = X 2 . Cov(X, Y ) = E(X 3 ) − E(X)E(X 2 ) = 0 because the distribution is symmetric around the origin and hence all the odd moments about the origin are zero. It follows that X and Y are uncorrelated even though there is an exact relation between them. In fact, this result holds for any distribution that is symmetric around the origin. Definition 9 Conditional Expectation: Let X and Y be continuous random variables and g(Y ) be a continuous function. Then the conditional expectation (or conditional mean) R∞ of g(Y ) given X = x, denoted by EY |X [g(Y ) | X], is given by −∞ g(y) f (y | x) dy wheref (y | x) is the conditional density of Y given X. Note that E[g(Y ) | X = x] is a function of x and is not a random variable because x is fixed. The special case of E(Y | X) is called the regression of Y on X. Theorem 24 Law of Iterated Expectation: EXY [g(Y )] = EX [EY |X {g(Y ) | X}]. That is, the unconditional expectation is the expectation of the conditional expectation. Definition 10 Conditional Variance: Let μY |X = E(Y | X) = μ∗ (X) be the conditional mean of Y given X. Then the conditional variance of Y given X is defined as V ar(Y | X) = EY |X [(Y − μ∗ )2 | X}]. This is a function of X. Theorem 25 V arY |X (Y ) = EX [V ar(Y | X)] + V arX [E(Y | X)], that is, the variance of Y is the mean of its conditional variance plus the variance of its conditional mean. Theorem 26 V ar(aX + bY ) = a2 V ar(X) + 2abCov(X, Y ) + b2 V ar(Y ). Quantitative Techniques II 22 Approximate Mean and Variance for g(X, Y ) After obtaining a linear approximation of the function g(X, Y ) ¸ ¸ ∙ ∙ ∂g ∂g (X − μX ) + (Y − μY ) g(X, Y ) ≈ g(μx , μy ) + ∂X ∂Y its mean can be written E[g(X, Y )] ≈ g(μX , μY ). Its variance is ¸ ¸ ¸∙ ¸ ∙ ∙ ∙ ∂g ∂g 2 ∂g 2 ∂g 2 2 + σY + 2 σX σY V ar[g(X, Y )] ≈ σX ∂X ∂Y ∂X ∂Y (4.11) (4.12) Note that approximations may be grossly in error. You should be especially careful with the variance and covariance approximations. Problems 1. For certain ore samples the proportion Y of impurities per sample is a random variable with density function given by ½ ¡3¢ 2 2 y + y, 0 ≤ y ≤ 1 . f (y) = 0, elsewhere The dollar value of each sample is W = 5 − 0.5Y . Find the mean and variance of W. 2. The random variable Y has the following probability density function ½ ¡3¢ 2 8 (7 − y) , 5 ≤ y ≤ 7 . f (y) = 0, elsewhere (a) Find E(Y ) and V ar(Y ). (b) Find an interval shorter than (5, 7) in which least 3/4 of the Y values must lie. (c) Would you expect to see a measurement below 5.5 very often? Why? Chapter 5 Some univariate distributions 5.1 Discrete Distributions A random variable X is said to have a discrete distribution if it can take only a finite number of different values x1 , x2 , ..., xn , or a countably infinite number of distinct points. 5.1.1 The Bernoulli Distribution We have this distribution when there are only two possible outcomes to an experiment, one labeled a success (p) and the other labeled a failure (1 − p = q). If there is only a trial of an experiment then we have the Bernoulli Distribution with the probability density function f (x; θ) = f (x; p) = px (1 − p)1−x . (5.1) for X = 0, 1(failure, success) and 0 ≤ p ≤ 1. E(x) = 1 X x=0 xf (x) = 0(1 − p) + 1.p = p. V ar(x) = E(x2 ) − E 2 (x) = 0(1 − p) + 1.p − p2 = p(1 − p) = pq. 5.1.2 (5.2) (5.3) The Binomial Distribution This is also a Bernoulli Distribution but in this distribution we have n independent trials. µ ¶ n x n! px (1 − p)n−x f (x; θ) = B(x; n, p) = p (1 − p)n−x = x! (n − x)! x x = 0, 1, ..., n (X is the number of success in n trials) 0 ≤ p ≤ 1. E(x) = np. (5.4) V ar(x) = npq. 23 (5.5) Quantitative Techniques II 5.1.3 24 Example Ex1: Assume a student is given a test with 10 true-false questions. Also assume that the student is totally unprepared for the test and guesses at the answer to every question. What is the probability that the student will answer 7 or more questions correctly? Let X is the number of questions answered correctly. The test represents a binomial experiment with n = 10, p = 1/2. So X ∼ Bin(n = 10, p = 1/2). P (x ≥ 7) = P (x = 7) + P (x = 8) + P (x = 9) + P (x = 10) 10 µ ¶ µ ¶k µ ¶10−k 10 µ ¶ µ ¶10 X X 10 1 1 10 1 = = 2 2 2 k k k=7 k=7 = 0.17 5.1.4 Simple Random Walk This is a process often used to describe the behavior of stock prices. Suppose that { it } is a purely random series with mean μ and variance σ 2 . Then a process {Xit } is said to be a random walk if Xt = Xt−1 + t (5.6) Let us assume that X0 is equal to zero. Then the process evolves as follows: X1 = (5.7) 1 X2 = X1 + = 1 + 2 2 (5.8) and so on. We have by successive substitution Xt = t X (5.9) i i=1 Hence E(Xt ) = tμ and V ar(Xt ) = tσ 2 . Since the mean and variance change with t, the process is nonstationary, but its first difference is stationary. Referring to share prices, this says that the changes in a share price will be purely random process. 5.1.5 Geometric Distribution Let X be the number of the trial at which the first success occurs. The distribution of X is known as the geometric distribution. It has the density function f (x; p) = p(1 − p)x−1 x = 1, 2, 3, ... (5.10) Quantitative Techniques II 5.1.6 25 Hypergeometric Distribution The binomial distribution is often referred to as sampling with replacement, which is needed to maintain the same probabilities across the trials. Let there be a objects in a certain class (defective) and b objects in another class (nondefective). If we draw a random sample of size n without replacement then there ¡ ¢ are xa (a over x) possible way to get x from class A. For each such outcome, there are ¡ b ¢ n−x (b over (n − x)) possible ways drawing from B. Thus the probability density function of a hypergeometric distributed X variable is: ¶ µ ¶µ b a x n−x µ ¶ f (x; n, a, b) = (5.11) a+b n 5.1.7 Negative Binomial Distribution In a Binomial experiment, let Y be the number of trials to get exactly k success. To get exactly k success, there must be k − 1 success in y − 1 trials and the next outcome must be a success. Let X = y − k be the number of failures until k success have been obtained. The density function of X is known as the negative binomial. ¶ µ x+k−1 k p (1 − p)x f (x; k, p) = k−1 5.1.8 x = 0, 1, 2, · · · (5.12) Poisson Distribution When n → ∞, and p → 0 in a binomial distributed variable np = λ (> 0) for all n and p. The probability of success is very small and the number of trials is large. This is known as the Poisson distribution. f (x; λ) = e−λ λx x! x = 0, 1, 2, · · · (5.13) We use this distribution in queuing theory, and modeling the arrival of a next customer on the check out line or the making of a phone call in a specific small interval. 5.2 Continuous Distributions Definition 11 For a random variable X if there exists a nonnegative function f (x), defined on the real line, such that for any interval B, Z P (X ∈ B) = f (x)dx B then X is said to have a continuous distribution and the functionf (x) is called the probability density function or simply density function (or PDF). Quantitative Techniques II 5.2.1 26 Uniform Distribution on an Interval A random variable X with the density function f (x; a, b) = 1 (b − a) (5.14) in the interval a ≤ X ≤ b is called the uniform distribution on an interval. 5.2.2 Beta Distribution The density function for this distribution has the form f (x) = R 1 0 xm−1 (1 − x)n−1 0<x<1 xm−1 (1 − x)n−1 dx m, n > 0 (5.15) The denominator is known as the Beta Function. This distribution, B(m, n), reduces to the uniform distribution for m = n = 1. 5.2.3 Cauchy Distribution The standard Cauchy distribution has the density function f (x) = 1 π(1 + x2 ) −∞<x<∞ (5.16) The Cauchy distribution arises when the ratio of the two independent normal variates is computed. 5.2.4 Chi-Square Distribution P If Z1 , Z2 , ....., Zn are independent N (0, 1) variables, and X = ni=1 Zi2 , then the probability density function of a Chi-Squared distributed variable X is n x ( x ) 2 −1 e− 2 f (x) = 2 2Γ( n2 ) x>0 n = 1, 2, ... (5.17) Where Γ(n) is the Gamma function. Γ(1/2) = √ π (5.18) Γ(1) = 1 (5.19) Γ(n) = (n − 1)Γ(n − 1) Z 1 un−1 e−u du = (5.20) Γ(n + 1) = n! (5.21) 0 Quantitative Techniques II 5.2.5 27 The Exponential Distribution The distribution x 1 − f (x; θ) = e θ θ x>0 θ>0 (5.22) is called the exponential distribution. 5.2.6 Extreme Value Distribution (Gompertz Distribution) For modeling extreme values such as the peak electricity demand in a day, maximum rainfall, and so on, we can use the extreme value distribution which, in its standard form, has the following density. f (x) = e−x exp[−e−x ] 5.2.7 −∞<x<∞ (5.23) F Distribution If x = (w1 /m)/(w2 /n) where w1 ∼ χ2 (m) and w2 ∼ χ2 (n) are independent, then x ∼ F (m, n) with the following density function m−2 ³ m ´ m Γ[( m+n )] x 2 2 2 f (x) = n m m+n n Γ[ m 2 ] Γ( 2 ) [1 + n x] 2 x>0 n = 1, 2, ... (5.24) That is the ratio two independent chi-square variables, each divided by its degrees of freedom, has the Snedecor F Distribution, with numerator and denominator degrees of freedom equal to those of the respective chi-squares. 5.2.8 Gamma Distribution This distribution has the density function f (x; α, β) = 1 β α Γ(α) x xα−1 e− β x>0 β>0 (5.25) When α = 1, the Gamma Distribution reduces to the Exponential distribution. 5.2.9 Geometric Distribution The density function for this distribution is f (x) = θxθ−1 . 0<x<1 θ>0 (5.26) Quantitative Techniques II 5.2.10 28 Logistic Distribution This distribution has the following density function: f (x) = 5.2.11 e−x (1 + e−x )2 − ∞ < x < ∞. (5.27) Lognormal Distribution A random variable X is said to have the standard lognormal distribution if Z = lnX has the standard normal distribution z2 1 − fz (z) = √ e 2 2π −∞<z <∞ (5.28) By transforming variable to x, we can write its density function µ ¶ 1 (ln x)2 1 fx (x) = √ exp − 2 x 2π 0<z<∞ (5.29) Because lnX is defined only for positive X and most economic variables take only positive values, this distribution is very popular in economics. It has been used to model the size of firms, stock prices at the end of a trading day, income distributions, expenditure on particular commodities, and certain commodity prices. 5.2.12 The Normal Distribution A random variable X with the density function 1 f (x; μ, σ) = p σ (2π) ∙ ¸ 1 (x − μ)2 − σ2 e 2 (5.30) is called a Normal (Gaussian) distributed variable. The integral in the cdf of the standard normal distribution does not have a closed form solution but requires numerical integration. 5.2.13 Pareto Distribution The density function for this distribution is f (x) = θ h x0 iθ+1 x0 x x > x0 θ>0 (5.31) Although the lognormal distribution is often used to model the distribution of incomes, it has been found to approximate incomes in the middle range very well but to fail in the upper tail. A more appropriate distribution for this purpose is the Pareto distribution. Quantitative Techniques II 5.2.14 29 Student’s t Distribution q If Z ∼ N (0, 1) and W ∼ χ2(n) with Z and W being independent, and X = Z/ W k , then the probability density function of X is 1 Γ[( n+1 2 )] f (x) = √ n n Γ[ 2 ] Γ( 12 ) ∙ ¸− 1 (n+1) x2 2 1+ n (5.32) The probability density function is symmetric, centered at zero, and similar in shape to a standard normal probability density function. 5.2.15 Weibull Distribution In some more general situations the conditions for the exponential distribution are not met. An exponential distribution provides an appropriate model for the lifetime of an equipment but it is not suitable for the lifetime of human population. The exponential distribution thus is called memoryless. The distribution has the density function b f (x; a, b) = abxb−1 e−ax x>0 a, b > 0 (5.33) Note that when b = 1, this reduces to the exponential distribution. Problems 1. Find the values z0 in the following probabilities. (a) P (Z > z0 ) = 0.50 (b) P (Z < z0 ) = 0.8643 (c) P (−z0 < Z < z0 ) = 0.90 (d) P (−z0 < Z < z0 ) = 0.99 2. A soft drink machine can be regulated so that it discharges an average of μ ounces per cup. If the ounces of fill are normally distributed with σ 2 = (0.3)2 , give the setting for μ so that 8-ounce cups will overflow only 1 percent of the time. (Note: P (Z > 2.33) = 0.01). 3. Let f1 (y) and f2 (y) be density functions, and let a be a constant such that 0 ≤ a ≤ 1. Consider the function f (y) = af1 (y) + (1 − a)f2 (y). (a) Show that f (y) is a density function. Such a density function is often referred to as a mixture of two density functions. (b) Suppose that Y1 is a random variable with density function f1 (y), and that E(Y1 ) = μ1 and V ar(Y1 ) = σ12 , and similarly suppose that Y2 is a random variable with density function f2 (y), and that E(Y2 ) = μ2 and V ar(Y2 ) = σ22 . Assume that Y is a random variable whose density is a mixture of the densities corresponding to Y1 and Y2 . Quantitative Techniques II (i) Show that E(Y ) = aμ1 + (1 − a)μ2 . (ii) Show that V ar(Y ) = aσ12 + (1 − a)σ22 + a(1 − a)[μ1 − μ2 ]2 . [Hint: E(Yi2 ) = μ2i + σi2 , i = 1, 2] 30 Chapter 6 Multivariate distributions 6.1 Bivariate Distributions In most cases, the outcome of an experiment may be characterized by more than one variable. For instance, X may be the income, Y the total expenditures of a household, and Z be family size. We observe (X, Y, Z). Definition 12 Joint Distribution Function: Let X and Y be two random variables. Then the function FXY (x, y) = P (X ≤ x and Y ≤ y) is called the joint distribution function. 1) FXY (x, ∞) = F (x) and FXY (∞, y) = F (y). 2) FXY (−∞, y) = FXY (x, −∞) = 0. Definition 13 Joint Probability Density Function Discrete probability function: fXY (x, y) = P (X = x, Y = y) (6.1) Continuous probability function: fXY (x, y) = and hence FXY (x, y) = Z x −∞ Z ∂ 2 F (x, y) ∂x∂y (6.2) y fXY (u, v)dudv (6.3) −∞ In the univariate case, if ∆x is a small increment of x, then fX (x)∆x is the approximate probability that x − (1/2)∆x < X ≤ x + (1/2)∆x. Similarly, in a bivariate distribution, fXY (x, y)∆x∆y is the approximate probability that x − (1/2)∆x < X ≤ x + (1/2)∆x and y − (1/2)∆y < Y ≤ (1/2)∆y. The bivariate density function satisfies the conditions R∞ R ∞y + fXY (x, y) ≥ 0 and −∞ −∞ dF (x, y) = 1 where dF (x, y) is the bivariate analog of dF (x). Definition 14PMarginal Density Function: If X and Y are discrete random variables, P then fX (x) = Y fXY (x, y) is the marginal density of X, and R fY (y) = X fXY (x, y) is the marginal density of Y .R In the continuous case, fX (x) = fXY (x, y)dy is the marginal density of X and fY (y) = fXY (x, y)dx is the marginal density of Y . 31 Quantitative Techniques II 32 Definition 15 Conditional Density Function: The conditional density of Y given X = x is defined as f (y | x) = f (x, y)/f (x), provided f (x) 6= 0. The conditional density of X given Y = y is defined as f (x | y) = f (x, y)/f (y), provided f (y) 6= 0. This definition holds for both discrete and continuous random variables. Definition 16 Statistical Independence: The random variables X and Y are said to be statistically independent if and only if f (y | x) = f (y) for all values of X and Y for which f (x, y) is defined. Equivalently, f (x | y) = f (x) and f (x, y) = f (x)f (y). Theorem 27 Random variables X and Y with joint density function f (x, y) will be statistically independent if and only if f (x, y) can be written as a product of two nonnegative functions, one in X alone and another in Y alone. Theorem 28 If X and Y are statistically independent and a, b, c, d are real constants with a < b and c < d, then P (a < X < b, c < Y < d) = P (a < X < b)P (c < Y < d). 6.1.1 The Bivariate Normal Distribution Let (X, Y ) have the joint density f (x, y) = ⎡ 1 − 2(1−ρ 2) ⎤ µ 1 ´ ¶ ⎦ ³ p exp ⎣ ³ x−μx ´2 (x−μx )(y−μy ) y−μy 2 2 − 2ρ + σy 2πσx σy (1 − ρ ) σx σx σy (6.4) for −∞ < x < ∞, −∞ < y < ∞, −∞ < μX < ∞, −∞ < μY < ∞, σX , σY > 0 −1 < ρ < 1. Then (X, Y ) is said to have the bivariate normal distribution. and 2 ) Theorem 29 If (X, Y ) is bivariate normal, then the marginal distribution of X is N (μX , σX and that of Y is N(μY , σY2 ). The converse of this theorem need not to be true, that is, if the marginal distribution of X is univariate normal, the joint density between X and Y not be bivariate normal. Theorem 30 For a bivariate normal, the conditional density of Y given X = x is univariate normal with mean μY + (ρσY /σX )(x − μX ) and variance σY2 (1 − ρ2 ). The conditional density of X given Y = y is also normal with mean μX + (ρσX /σY )(y − μY ) and variance 2 (1 − ρ2 ). σX In the case of the bivariate normal density, the conditional expectation E(Y | X) is the form α + βX, where α and β depend on the respective means, standard deviations and the correlation coefficient. This is a simple linear regression in which the conditional expectation is a linear function of X. 6.1.2 Mixture Distributions If the distribution of random variables depend on parameters or variables which themselves depend on other random variables then we say that we have mixture distributions. This might take the form f (x; θ) where θ depends on a random variable or the form f (x | y), where Y is another random variable. For example the unobserved heterogeneity in hazard models is an example to the latter. The density function for the duration t, f (t | v) is conditional on an unobserved heterogeneity term which in turn itself is a random variable. Quantitative Techniques II 6.2 33 Multivariate Density Functions The joint density function of X1 , X2 , ...Xn have the form f (x1 , x2 , ...xn ). If Xs are continuous random variables, fX (x1 , x2 , · · · xn ) = 6.2.1 ∂ n FX (x1 , x2 , · · · xn ) ∂x1 ∂x2 · · · ∂xn (6.5) The Multivariate Normal Distribution Definition 17 Mean vector: Let X 0 = (X1 , X2 , ...Xn ) be an n−dimensional vector random variable defined in <n with a density function f (x), E(Xi )=μi , and μ0 = (μ1 , μ2 , ...μn ). Then the mean of the distribution is μ = E(X), where μ and E(X) are nx1 vectors, and hence E(X − μ) = 0. Definition 18 Covariance Variance Matrix: The covariance between Xi and Xj is defined as σij = E[(Xi − μi )(Xj − μj )], where μi = E(Xi ). The matrix ⎡ σ11 σ12 X ⎢ σ21 σ22 V ar(X) = =⎢ ⎣ . . σn1 σn2 . . . . . . . . ⎤ . σ1n . σ2n ⎥ ⎥ . . ⎦ . σnn (6.6) also denoted as V ar(X), is called the covariance matrix of X. In matrix notation, this can be expressed as Σ = E[(X − μ)(X − μ)0 ], where (X − μ) is nx1. Note that the diagonal elements are variances. Properties: 1) If Ymx1 = Amxn Bnx1 + bmx1 , then E(Y ) = Aμ + b. 2) Σ is a symmetric positive semi-definite matrix. 3) Σ is positive definite if and only if it is nonsingular. 0 0 4) Σ = E[XX ] − μμ . 0 5) If Y = AX + b, then the covariance matrix of Y is AΣA . 6.2.2 Standard multivariate normal density Let X1 , X2 , ...Xn be n independent random variables each of which is N (0, 1). Then their joint density function is the product of individual density functions and is the standard multivariate normal density. ¸ ¸ ∙ ∙ Pn x2i 1 n i=1 √ fX (x1 , x2 , · · · , xn ) = exp − 2 2π " 0 # ∙ ¸n 1 2 xx = (6.7) exp − 2π 2 We have the density function of the general multivariate normal distribution N (μ, Σ) as Quantitative Techniques II 34 " 0 (y − μ) Σ−1 (y − μ) exp − fY (y) = n 1 2 (2π) 2 | Σ | 2 1 # (6.8) Properties: 1) If Y is multivariate normal, then Y1 , Y2 , ...., Yn will be independent if and only if Σ is diagonal. 2) A linear combination of multivariate normal random variables is also multivariate normal. More specifically, let Y ∼ N (μ, Σ). Then Z = AY ∼ N (Aμ, AΣA0 ), where A is an nxn matrix. 3) If Y ∼ N (μ, Σ) and Σ has rank k < n, then there exists a nonsingular kxk matrix such that the kx1 matrix X = [A−1 O](Y − μ) is a k−variate normal with zero mean and covariance matrix. Ik , where O is a kx(n − k) matrix of zeros. 6.2.3 Marginal and Conditional Distributions of N(μ, Σ) Let Y ∼ N (μ, Σ), and consider the following partition ¸ ∙ ¸ ∙ ¸ ∙ μ1 Σ11 Σ12 Y1 ; μ= ; Σ= ; Y = Y2 μ2 Σ21 Σ22 (6.9) where the n random variables are partitioned into n1 and n2 variates (n1 + n2 = n). Theorem 31 Given the above partition, the marginal distribution of Y1 is N (μ1 , Σ11 ) and the conditional density of Y2 given Y1 is multivariate normal with mean μ2 +Σ21 Σ−1 11 (Y1 −μ1 ) −1 and covariance matrix Σ22 -Σ21 Σ11 Σ12 . 6.2.4 The Chi-Square Distribution Theorem: If Z1 , Z2 , · · · , Zn are independent N (0, 1) variables, and X = Σni=1 Zi2 , then the probability density function of a Chi-Squared distributed variable X is n f (x) = x ( x2 ) 2 −1 e− 2 2Γ( n2 ) x>0 n = 1, 2, ... (6.10) Where Γ(n) is the Gamma function. The chi-square distribution has the additive property, that is if X ∼ χ2m , Y ∼ χ2n , and X and Y are independent, then their sum (X + Y ) ∼ χ2m+n . Thus the sum of independent chi-square is also chi-square with degrees of freedom as the sum of the degrees of freedom. Theorem 32 If Xi ∼ N (μi , σi2 ), and i = 1, 2, .., n and X1 , X2 , ..., Xn are all independent, then Y = Σni=1 [(Xi − μi )/σi ]2 has the chi-square distribution with n degrees of freedom. Quantitative Techniques II 35 Problems 1. Let Y1 and Y2 have the bivariate uniform distribution ½ 1, 0 ≤ y1 ≤ 1; 0 ≤ y2 ≤ 1 f (y1 , y2 ) = 0, otherwise (a) Sketch the probability density surface (b) Find F (0.2, 0.4). (c) Find P (0.1 ≤ Y1 ≤ 0.3, 0 ≤ Y2 ≤ 0.5). 2. Let f (y1 , y2 ) = ½ 2y1 , 0, 0 ≤ y1 ≤ 1; 0 ≤ y2 ≤ 1 otherwise (a) Sketch f (y1 , y2 ). (b) Find the marginal density functions for Y1 , and Y2 . Chapter 7 Sampling, sample moments, sampling distributions, and simulation 7.1 Independent, Dependent, and Random Samples The totality of elements about which some information is desired is called a population. We often use a small proportion of a population (known as a sample) and measure attributes it and draw conclusions or make policy decisions based on the data obtained. By statistical inference, we estimate of the unknown parameters underlying statistical distributions, measure their precision, test hypotheses on them, and use them generating forecasts of random variables. Definition 19 Independent Sample: The observations x1 , x2 , · · · , xn are said to form an 0 independent sample if the joint density function of the xi s has the form fX (x1 , x2 , · · · , xn ) = n Y fXi (xi ; θi ) (7.1) i=1 0 fXi might be different across i, here we are not assuming that the x s have the same distribution. Definition 20 Random Sample: A random sample from a population is a set of independent, identically distributed (abbreviated as iid) random variables x1 , x2 , · · · , xn , each of which has the same distribution as X. fX (x1 , x2 , · · · , xn ) = f (x1 )f (x2 ) · · · f (xn ) = n Y f (xi ) = [f (xi )]n (7.2) i=1 Definition 21 Dependent Sample: If the observations are obtained over time or if there is a dependency between observations in a cross section data then we have a dependent sample. The joint density fX (x1 , x2 , · · · , xn ; θ) can be factored as follows: f (xn | x1 , x2 , ...., xn−1 ; θ) f (x1 , x2 , ...., xn−1 ; θ). 36 Quantitative Techniques II 7.2 37 Sample Statistics Definition 22 Statistics: A statistic is a function of the observable random variable(s) that does not contain any unknown parameters. Examples are: Sample mean, sample variance, sample moments, sample covariance, sample correlation coefficient. Theorem 33 If x1 , x2 , · · · , xn is a random sample from a population with mean μ and 0 variance σ 2 and all c0i s are constant, then Y = c1 x1 + c2 x2 + · · · + cn xn = c x has the following expectation and variance: à n ! X ci μ E(Y ) = i=1 = (c1 + c2 + · · · cn )μ. V ar(Y ) = (c21 + c22 + · · · c2n ) σ 2 = σ 2 c0 c. (7.3) (7.4) Corollary 34 E(x̄) = μ. V ar(x̄) = 7.3 σ2 . n (7.5) (7.6) Sampling Distributions Because a sample statistic is a function of random variables, it has a statistical distribution. This is called sampling distribution of the statistic. If we obtain a sample of n observations and compute the statistic, we obtain a numerical value. By repeating this process we get a sequence of values of the statistic. This can be tabulated in the form of a frequency distribution. As an example take the normal distribution case. Linear combinations of normal variates √ are also normally distributed. Z = n(x̄ − μ)/σ ∼ N (0, 1) will converge to a normal distribution as n → ∞, even though the parent distribution was not normal. Problems 1. Suppose that X1 , X2 , ..., Xm and Y1 , Y2 , ..., Yn are independent random samples, with the variables Xi normally distributed with mean μ1 and variance σ12 and the variables Yi normally distributed with mean μ2 and variance σ22 . The difference between the sample means, X̄ − Ȳ , is then a linear combination of m + n normal random variables, and is itself normally distributed. (a) Find E(X̄ − Ȳ ). (b) Find V ar(X̄ − Ȳ ). Quantitative Techniques II 38 (c) Suppose that σ12 = 2, σ22 = 2.5, and m = n. Find the sample size so that (X̄ − Ȳ ) will be within one unit of (μ1 − μ2 ) with probability 0.95. 2. If Y is a random variable that has an F distribution with ν1 numerator and ν2 denominator degrees of freedom, show that U = 1/Y has an F distribution with ν2 numerator and ν1 denominator degrees of freedom. 3. If T has a t distribution with ν degrees of freedom, then show that U = T 2 has an F distribution with 1 numerator degree of freedom and ν denominator degrees of freedom. Chapter 8 Large sample theory 8.1 Different Types of Convergence In many situations it is not possible to derive exact distributions of several statistics based on a random sample of observations. The problem disappears in most cases, however, if the sample size is large because we can derive approximate distributions. Hence we need for large sample or asymptotic distribution theory. Definition 23 Limit of a sequence: Suppose a1 , a2 , ...., an constitute a sequence of real numbers. If there exists a real number a such that for every real > 0, there exists an integer N( ) with the property that for all n > N ( ), we have | an − a |< , then we say that a is the limit of the sequence {an } and write limn→∞ an = a. Intuitively, if an lies in an neighborhood of a (a − , a + ) for all n > N ( ), then a said to be the limit of the sequence {an }. Examples where limits are exists are ∙ µ ¶¸ 1 = 1 and (8.1) lim 1 + n→∞ n h ³ a ´in lim 1 + = ea n→∞ n (8.2) The notion of convergence is easily extended to that of a function f (x). Definition 24 Limit of a function: The function f (x) has the limit A at the point x0 , if for every > 0 there exists a δ( ) > 0 such that | f (x) − A |< whenever 0 <| x − x0 |< δ( ) Definition 25 Convergence in Distribution: Given a sequence of random variables Xn whose CDF is Fn (x), and a CDF FX (x) corresponding to the random variable X, we say d that Xn converges in distribution to X, and write Xn → X , if limn→∞ Fn (x) = FX (x) at all points x at which FX (x) is continuous. Intuitively, convergence in distribution occurs when the distribution of Xn comes closer and closer to that of X as n increased indefinitely. Thus, FX (x) can be taken to be an approximation to the distribution of Xn when n is large. Remember Poisson distribution. 39 Quantitative Techniques II 40 Definition 26 Convergence in Probability: The sequence of random variables Xn is said to converge in probability to the real number x if limn→∞ P [| Xn − x |> ] = 0 for each > 0. Thus it becomes less and less likely that the random variable Xn − x lies the outside the interval (− , + ). Equivalent definitions are given below. 1. limn→∞ P [| Xn − x |< ] = 1, > 0. 2. Given > 0 and δ > 0, there exists N ( , δ) such that P [| Xn − x |> ] < δ, for all n > N. 3. P [| Xn − x |< ] > 1 − δ , for all n > N , that is, P [| XN+1 − x |< ] > 1 − δ, P [| XN+2 − x |< ] > 1 − δ, and so on. p We write Xn → x or plim Xn = x. The sequence of random variables Xn is said to converge in probability to the random variable X if the sequence of their differences (Xn − X) converges in probability to 0. This result known as the weak law of large numbers. Definition 27 Convergence in mean (r): The sequence of random variables Xn is said to (r) converge in mean of order (r) to X (r ≥ 1), and designated Xn → X, if E[| Xn − X |r ] exists and limn→∞ E[| Xn −X |r ] = 0, that is, if r th moment of the difference tends to zero. The most commonly used version is mean squared convergence, which is when r = 2. This means that sample mean (x̄n ) converges in mean square to μ. Because V ar(x̄n ) = E[(x̄n − μ)2 ] = (σ 2 /n) tends to zero as n goes to infinity. Definition 28 Almost Sure Convergence: The sequence of random variables Xn is said to a.s. converge almost surely to the real number x, and is written Xn → x, if P [limXn = x] = 1. In other words, the sequence Xn may not converge everywhere to x, but the points where it does not converge form a set of measure zero in the probability sense. More formally, given , δ > 0, there exists N such that P [| XN+1 − x |< , | XN+2 − x |< ].. > 1 − δ, that is, the probability of these events jointly occurring can be made arbitrarily close to 1. Xn is a.s said to converge almost surely to the random variable X if (Xn − X) → 0. p d d Theorem 35 If Xn → X and Yn → c(6= 0), where c is a constant, (a) (Xn +Yn ) → (X +c), d and (b) (Xn /Yn ) → (X/c). Note that c is a constant. p p p p Theorem 36 If Xn → X and Yn → Y , then (a) (Xn + Yn ) → (X + Y ), (b)(Xn Yn ) → XY , p and (c) if Yn and Y 6= 0, then (Xn /Yn ) → X/Y . p p Theorem 37 If g(·) is a continuous function, then Xn → X implies that g(Xn ) → g(X). In other words, convergence in probability is preserved under continuous transformations. p Theorem 38 Convergence in probability implies convergence in distribution, that is, Xn → d X =⇒ Xn → X, but the converse need not to be true. Theorem 39 Convergence in mean of order r implies convergence in mean of an order (r) (s) less than r, that is, Xn → X =⇒ Xn → X (r > s), but the converse need not to be true. Quantitative Techniques II 41 Theorem 40 Convergence in mean of order r ≥ 1 implies convergence in probability, but the converse need not to be true. Theorem 41 Almost sure convergence implies convergence in probability, but the converse need not be true. a.s Xn → X –––––––––––> p d Xn → X ––—> Xn → X (r) (s) Xn → X –––—> Xn → X––—> Relationships Among Modes of Convergence a.s. Theorem 42 Xn → X if and only if P [sup∞ j=n | Xj − X |> ] → 0 as n → ∞ for any > 0. P a.s. Theorem 43 If ∞ n=1 P (| Xn − X |> ] is finite for each > 0, then Xn → X. Theorem 44 If P∞ a.s. E(| Xn − X |r ] is finite for some r > 0, then Xn → X. n=1 a.s a.s. Theorem 45 If the functions g(·) is continuous at X and Xn → X, then g(Xn ) → g(X). 8.2 The Weak Law of Large Numbers We know that the sample mean approaches the population mean when the sample size becomes large. This is called the weak law of large numbers (WLLN), which holds under a variety of different assumptions. Theorem 46 Khinchin’s theorem: Let P {Xn , n ≥ 1} be a sequence of iid random variables with the finite mean μ, and let (X̄n ) = ( ni=1 Xi )/n. Then limn→∞ P [| X̄n − μ |> ] = 0 or equivalently limn→∞ P [| X̄n − μ |≤ ] = 1. In other words, plim(X̄n ) = μ. 8.3 The Strong Law of Large Numbers The WLLN stated that under certain conditions the sample mean converges in probability to the population mean. We can in fact derive a stronger result, namely, that the sample mean converges almost surely to the population mean. This is the strong law of large numbers. (SLLN). of random variables with E(X i ) = μi < ∞, As before,P let X1 , X2 , ..., Xn be a sequence Pn n and (X̄n ) = ( i=1 Xi )/n, and (μ̄n ) = ( i=1 μi )/n. Then under certain conditions we can a.s. show that (X̄n − μ̄n ) → 0. 0 a.s. Theorem 47 If the Xi s are iid, then (X̄n − μ̄n ) → 0. 0 Theorem 48 Kolmogorov’s Theorem on SLLN: If the Xi s are independent with finite variP a.s. 2 ances, and if ( ∞ n=1 V ar(Xn )/n < ∞, then (X̄n − μ̄n ) → 0. 0 a.s. Theorem 49 If the X s are iid, then a necessary and sufficient condition for (X̄n −μ̄n ) → 0 is that E | Xi − μi |< ∞ for all i. Quantitative Techniques II 8.4 42 The Central Limit Theorem Perhaps the most important theorem in large sample theory is the central limit theorem, which states, under quite general conditions, the mean of a sequence of random variables (such as the sample mean for example) converges to a normal distribution even though the population is not normal. Thus, even if we did not know the statistical distribution of the population from which a sample is drawn, by having a large sample we can approximate quite well the distribution of the sample mean by the normal distribution. Theorem 50 Central Limit Theorem: Let X1 , X2 , ..., Xn be a sequence of random variP ables, Sn be their sum ni=1 Xi and (X̄n ) be their mean (Sn /n). Define the standardized mean X̄n − E(X̄n ) Sn − E(Sn ) = p Zn = p V ar(Sn ) V ar(X̄n ) (8.3) d Then, under a variety of alternative assumptions (stated below) Zn → N (0, 1). Problem 1. Explain the relationship among modes of convergence. 2. The service times for customers coming through a checkout counter in a retail store are independent random variables with mean 1.5 minutes and variance 1.0. Approximate the probability that 100 customers can be serviced in less than 2 hours of total service time. Chapter 9 Estimation and properties of estimators The formula for obtaining the estimate of a parameter is referred to as an estimator, it is a function of the observations x1 , x2 , · · · , xn and the numerical value associated with it is called an estimate. There are two types of parametric estimation; Point and Interval Estimation. 9.1 Point Estimation A point estimation procedure uses the information in the sample to arrive at a single number that is intended to be close to the true value of the target parameter in the population. For example, the sample mean Pn Yi (9.1) Ȳ = i=1 n is one possible point estimator of the population mean μ. 9.1.1 Small Sample Criteria for Estimators The standard notation for an unknown parameter is θ and an estimator of θ is denoted by θ̂. The parameter space is denoted by Θ. A function g(θ) is called estimable if there exists a statistic u(x) such that E[u(x)] = g(θ). Unbiasedness An estimator θ̂ is called unbiased estimator of θ if E(θ̂) = θ. If E(θ̂) − θ = b(θ) and b(θ) is nonzero, it is called bias. Mean Square Error A commonly used measure of the adequacy of an estimator is E[(θ̂ − θ)2 ], which is called the mean square error (MSE). It is a measure of how close θ̂ is, on average, to the true 43 Quantitative Techniques II 44 θ. It can also be written as follows: M SE = E[(θ̂ − θ)2 ] = E[(θ̂ − E(θ̂) + E(θ̂) − θ)2 ] = V ar(θ̂) + bias2 (θ) (9.2) Relative Efficiency Let θ̂1 and θ̂2 be two alternative estimators of θ. Then the ratio of the respective M SEs, E[(θ̂1 − θ)2 ]/E[(θ̂2 − θ)2 ], is called the relative efficiency of θ̂1 with respect to θ̂2 . Uniformly Minimum Variance Estimators An estimator θ̂ of θ is called a uniformly minimum variance unbiased estimator (UMVU) if E(θ̂1 ) = θ and for any other unbiased estimator θ∗ , V ar(θ̂1 ) ≤ V ar(θ∗ ) for every θ. Thus, among the class of unbiased estimators, a UMVU has the smallest variance. Sufficiency Definition: Let θ̂ be a sample statistics and θ∗ any other statistics not a function of θ̂. Also, let f (x; θ) be density function. θ̂ is said to be a sufficient statistic for θ if and only if the conditional density of θ∗ given θ̂ is independent of θ, for every choice of θ∗ . Equivalently, the conditional density of the sample given θ̂, that is f (x1 , x2 , · · · , xn | θ̂), is independent of θ. Minimal Sufficiency θ̂ is minimal sufficient if, for any other sufficient statistic θ∗ we can find a function h(·) so that θ̂ = h(θ∗ ). 9.1.2 Large Sample Properties of Estimators Asymptotic Unbiasedness √ If an estimator has the property that V ar(θ̂) and n(θ̂n − θ) tend to zero as the sample size increases, then it is said to be asymptotically unbiased. Consistency Another desirable property of θ̂ is that as sample size n increases, θ̂ must approach to the true θ. This property is called consistency. We have three types of consistency measures. Simple Consistency Let θ̂1 , θ̂2 , · · · , θ̂n be a sequence of estimators of θ. This sequence is a simple consistent estimator of θ if, for every > 0, ³ ´ lim P | θ̂ − θ |< =1 θ∈Θ (9.3) n→∞ p Thus θ̂n is a simple consistent estimator if P limθ̂n = θ. This is, (θ̂n → θ) convergence in probability. Quantitative Techniques II 45 Squared-error Consistency The sequence (θ̂n ) is a squared-error consistent estimator if lim [E(θ̂n − θ)2 ] = 0 n→∞ (9.4) m.s. This is, (θ̂n → θ) convergence in mean square. Strong Consistency θ̂n is said to be strongly consistent if P [ lim θ̂n = θ] = 1 n→∞ (9.5) a.s. This is, (θ̂n → θ) almost sure convergence. Asymptotic Efficiency Definition: Let θ̂n be a consistent estimator of θ. θ̂n is said to be asymptotically efficient if there is no other consistent estimator θn∗ for which limsupn→∞ {[E(θ̂n − θ)2 ]/[E(θn∗ − θ)]} > 1 (9.6) for all θ in some open interval. Best Asymptotic Normality The sequence of estimators (θ̂n ) is a best asymptotically normal (BAN) estimator if all the following conditions are satisfied. p 1. θ̂n → θ for every θ ∈ Θ, that is, θ̂n is consistent. √ d 2. The distribution of n(θ̂n − θ) → N[0, σ 2 (θ)], where σ 2 (θ) = limV ar(θ̂n ). 3. There is no other sequence (θn∗ ) that satisfies (1) and (2) and is such that σ 2 (θ) > σ ∗2 (θ) for every θ in some open interval. [σ ∗2 (θ) = limV ar(θn∗ )]. 9.2 Interval Estimation Instead of obtaining a point estimate of a parameter, we estimate an interval within which the value of the parameter is contained with some probability. Let X1 , X2 , ..., Xn be a random sample from f (x; θ). Let T1 = t1 (X1 , X2 , ..., Xn ) and T2 = t2 (X1 , X2 , ..., Xn ) be two statistic satisfying T1 ≤ T2 for which Pr(T1 < τ (θ) < T2 ) = α. (9.7) (T1 , T2 ) is a 100α percent confidence interval for τ (θ), T1 , T2 are the lower and upper confidence limits, and α is the confidence coefficient. Pr [T1 < τ (θ)] = α (9.8) Pr [τ (θ) < T2 ] = α (9.9) is a one-sided lower CI for τ (θ). is a one-sided upper CI for τ (θ). Quantitative Techniques II 9.2.1 46 Pivotal-quantity method of finding CI Let Q = q(X1 , X2 , ..., Xn ). If the distribution of Q does not depend on θ, Q is defined to be a pivotal quantity. e.g., X1 , X2 , ..., Xn is a random sample from N (μ, 1). x̄ − μ is a pivotal quantity since x̄ − μ ∼ N (0, 1/n). x̄/μ is not a pivotal quantity since x̄/μ ∼ N (1, 1/μ2 n). If Q = q(X1 , X2 , ..., Xn ; θ) is a pivotal quantity with a pdf, then for any fixed x ∈ (0, 1) there will exist q1 and q2 such that P (q1 < Q < q2 ) = α. But P (cq1 < cQ < cq2 ) = α = P (d + cq1 < d + cQ < d + cq2 ), that is the probability of the event P (q1 < Q < q2 ) is unaffected by a change of scale or a translation of Q. Thus, if we know the pdf of Q, it may be possible to use these operations to form the desired confidence interval. For example, assume that f (x; μ) ≡ N (μ, 1) and x̄ ∼ N(μ, 1/n) Q= x̄ − μ √1 n Q is a pivotal quantity. à ! x̄ − μ P q1 < 1 < q2 = α √ n ³ So, x̄ − 9.2.2 √1 q2 , < n x̄ − √1 q1 n ´ ∼ N (0, 1); ¶ 1 1 = P √ q1 < x̄ − μ < √ q2 n n µ ¶ 1 1 = P √ q1 − x̄ < −μ < √ q2 − x̄ n n µ ¶ 1 1 = P x̄ − √ q2 < μ < x̄ − √ q1 n n (9.10) (9.11) µ (9.12) is a 100α percent confidence interval for μ. CI for the mean of a normal population Consider the case where both μ and σ 2 are unknown. x̄ − μ √σ n ∼ N (0, 1) (9.13) Although this is a pivotal quantity the presence of σ means that we cannot compute a CI. We know, however, that x̄ − μ √s n for all σ 2 > 0. ∼ t(n−1) (9.14) Quantitative Techniques II 47 Using the table of the t-distribution, we can always find a number b such that ! à x̄ − μ P −b < s < b = α √ n (9.15) which can be written as µ ¶ s s P x̄ − b √ < μ < x̄ + b √ =α n n (9.16) b = t( 1−α , (n−1)) (9.17) 2 Example Ex1: n = 10, x̄ = 3.22, s = 1.17, α = 0.95. The 95 percent CI for μ is (2.262)(1.17) (2.262)(1.17) √ √ , 3.22 + 10 10 or (2.38, 4.06) 3.22 − 9.2.3 CI for the variance of a normal population We know that Q= (n − 1)s2 ∼ χ2(n−1) , σ2 (9.18) hence the Q is a pivotal quantity. P (q1 < Q < q2 ) = α So, ³ (n−1)s2 (n−1)s2 , q1 q2 ´ ¶ µ (n − 1)s2 < q2 = P q1 < σ2 µ ¶ (n − 1)s2 (n − 1)s2 2 = P <σ < q2 q1 is a 100α percent CI for σ 2 . Example Ex2: n = 10, x̄ = 3.22, s = 1.17, α = 0.95. The 95 percent CI for σ 2 is ⎞ ⎛ 2 2 (n − 1)s (n − 1)s ⎠ ⎝ , χ2 1−α , (n−1) χ2 1+α , (n−1) ( 2 ) ( 2 ) χ2(0.025, (9)) = 19.02, and χ2(0.975, (9)) = 2.70 so the interval is (0.65, 4.56) (9.19) Quantitative Techniques II 9.3 48 Problems 1. Suppose that Y1, Y2 , ..., Yn denote a random sample with a density function f (yi ) = ½ 1 − θe yi θ 0, , yi > 0 elsewhere with mean θ. Find the MLE of the population variance θ2 . 2. Suppose that Y is normally distributed with mean 0 and unknown variance σ 2 . Then Y 2 /σ 2 has a χ2 distribution with 1 degree of freedom. Use this distribution to find: (a) a 95 percent confidence interval for σ 2 . (b) a 95 percent upper confidence limit for σ 2 (c) a 95 percent lower confidence limit for σ 2 . 3. Suppose that E(θ̂1 ) = E(θ̂2 ) = θ, V ar(θ̂1 ) = σ12 , and V ar(θ̂2 ) = σ22 . A new unbiased estimator θ̂3 is to be formed by θ̂3 = aθ̂1 + (1 − a)θ̂2 . (a) How should a constant a be chosen in order to minimise the variance of θ̂3 ? Assume that θ̂1 and θ̂2 are independent. (b) How should a constant a be chosen in order to minimise the variance of θ̂3 ? Assume that θ̂1 and θ̂2 are not independent but are such that Cov(θ̂1 ,θ̂2 ) = c 6= 0. Chapter 10 Tests of statistical hypotheses The testing of statistical hypotheses on the unknown parameters of a probability model is one of the most important steps of any empirical study. Three test situations can be mentioned: - test of alternative models, for drawing conclusions that are not model sensitive - test of policy change effects, and - test of the validity of an economic theory. 10.1 Basic Concepts in Hypothesis Testing Consider a family of distributions represented by the density function f (x; θ), θ ∈ Θ. The term hypothesis stands for a statement or conjecture regarding the values that θ might take. The testing of a hypothesis consists of three basic steps: 1) formulate two opposing hypotheses, 2) derive a test statistic and identify its sampling distribution, and 3) derive a decision rule and choose one of the opposing hypothesis. 10.1.1 Null and Alternative Hypotheses A hypothesis can be thought of as a binary partition of the parameter space Θ into two sets, Θ0 and Θ1 such that Θ0 ∩ Θ1 =® and Θ0 ∪ Θ1 =Θ. The set Θ0 , which corresponds to the statement of the hypothesis, is called the null hypothesis and denoted by H0 , and Θ1 , which is the class of alternatives to the null hypothesis, is called the alternative hypothesis and denoted by H1 . 10.1.2 Simple and Composite Hypotheses If the null hypothesis is of the form H0 : θ = θ0 and the alternative is H1 : θ = θ1 , then we have a simple hypothesis and a simple alternative. If either H0 or H1 specifies a range of values for θ (for example, H1 : θ 6= θ0 ) then we have a composite hypothesis. If we have a simple hypothesis and a simple alternative only, in this case the problem reduces to one of choosing between the two density functions f (x; θ0 ) and f (x; θ1 ). 49 Quantitative Techniques II 10.1.3 50 Statistical Test A decision rule that selects one of the inferences ”accept the null hypothesis” or ”reject the null hypothesis” is called a statistical test or simply test. A test procedure is usually described by a sample statistic T (x) = T (x1 , x2 , · · · , xn ), which is called the test statistic. The range of values of T for which the test procedure recommends the rejection of a hypothesis is called the critical region, and the range for accepting the hypothesis is called acceptance region. 10.1.4 Type I and Type II Errors In performing a test one may arrive at the correct decision or commit one of the two types errors. The errors can be classified into two groups labeled Type I and Type II errors. Type I error : Rejecting H0 when it is true Type II error: Accepting H0 when it is false 10.1.5 Power of a Test The probability of rejecting the null hypothesis H0 based on a test procedure is called the power of the test. This probability would obviously depend on the value of the parameter θ about which the hypothesis is formulated. It is a function of θ. This power function is denoted by π(θ). 10.1.6 Operating Characteristics The probability of accepting the null hypothesis is known as the operating characteristic and is represented by 1 − π(θ). This concept is widely used in statistical quality control theory. 10.1.7 Level of Significance and the Size of a Test When θ is in Θ0 , π(θ) gives the probability of Type I error. This probability, denoted by P (I), will also depend on θ. The maximum value of P (I) when θ ∈ Θ0 is called the level of significance of a test, denoted by α. It is also known as the size of a test. Thus, α = max P (I) = max π(I) θ∈Θ0 θ∈Θ0 (10.1) The level of significance is hence the largest probability of a Type I error. The common sizes used are 0.001, 0.05, and 0.10. The probability of a Type II error is denoted by β(θ). It is readily seen to be 1 − π(θ) when θ ∈ Θ1 . Thus, β(θ) = P (II)θ∈Θ1 = 1 − π(θ) (10.2) Ideally we would want to keep both P (I) and P (II) to the minimum no matter the value of θ is. But this is impossible because an attempt to reduce P (I) generally increases P (II). For instance, the decision rule ”reject H0 always” regardless of x, has P (II) = 0 but P (I) = 1 if θ ∈ Θ0 . Similarly, the rule ”always accept H0 ” implies that P (I) = 0 but P (II) = 1 when θ ∈ Θ1 . Thus for some values of θ, one decision rule will be better than another. The classical decision procedure chooses an acceptable value for α and then Quantitative Techniques II 51 selects a decision rule (that is, a test procedure) that minimizes P (II). In other words, given α, among the class of decision rules for which P (I) ≤ α, choose the one for which P (II) is minimized or, equivalently, for which π(θ) is maximized. Thus the test procedure selects the decision rule that maximizes π(θ) subject to P (I) ≤ α. Such a test is called a most powerful (MP) test. If the critical region obtained this way is independent of the alternative H1 , then we have a uniformly most powerful (UMP) test. Problems 1. The output voltage for a certain electric circuit is specified to be 130. A sample of 40 independent readings on the voltage for this circuit gave a sample mean of 128.6 and a standard deviation of 2.1. Test the hypothesis that the average output voltage is 130 against the alternative that it is less than 130. Use a test with level 0.05. 2. Let Y1, Y2 , ..., Yn be a random sample of size n = 20 from a normal distribution with unknown mean μ and known variance σ 2 = 5. We wish to test H0 : μ = 7 versus H1 : μ > 7. (a) Find the uniformly most powerful test with significance level 0.05. (b) For the test in (a), find the power at each of the following alternative values for μ : μ1 = 7.5, μ1 = 8.0, μ1 = 8.5, and μ1 = 9.0. Chapter 11 Examination 1 11.1 Definition Questions Define the following statistical terms. 1. (5 marks) (3) a. Statistical modelling. (2) b. Variable and random variable. 2. (5 marks) The Bayes Theorem. 3. (5 marks) a. The Weak Law of Large Numbers. b. The Strong Law of Large Numbers. 4. (5 marks) The Central Limit Theorem. 5. (5 marks) a. The property of unbiasedness of an estimator. b. The property of sufficiency of an estimator. 11.2 Calculation questions Compute the following values. 6. (10 marks) Let X1 , X2 , . . . , Xn , be random sample with the following probability density function f (x) = 2x−3 for x > 1 Compute E(X) and M edian(X). 7. (10 marks) Suppose a manufacturer of TV tubes draws a random sample of 10 tubes. The probability that a single tube is defective is 10 percent. Calculate a. the probability of having exactly 3 defective tubes, b. the probability of having no more than 2 defectives. 52 Quantitative Techniques II 53 8. (10 marks) Let A and B be two events such that P (A∪B) = 0.9, P (A | B) = 0.625, and P (A | B̄) = 0.5. Calculate P (A) 9. (10 marks) The ages of a group executives attending a convention are uniformly distributed between 35 and 65 years. If X denotes ages in years, the probability density function is ½ 1 for 35 < X < 65 30 . f (x) = 0 otherwise 1. (2) a. Draw the probability density function for this random variable. (2) b. Find and draw the cumulative distribution function for this random variable. (3) c. Find the probability that the age of a randomly chosen executive in this group is between 40 and 50 years. (4) d. Find the mean age of executives in the group. 11.3 Discussion questions Determine whether the following statements are true, false or uncertain. For full point there is a need for an explanation. 10. (5 marks) The probability of union of the events A and B can be written P (A ∪ B) = P (A) + P (B)[1 − P (A | B)] 11. (5 marks) In a sample, the observations of a random variable can be seen as degenarated values of its marginal distribution (not all equal constants to each other, of course). 12. (5 marks) Consider a bivariate sample with sample size n is drawn independently from a population. Independence in this process runs across the n observations, not within each observation. 13. (5 marks) Unbiasedness is related to the number of observations in each sample, while consistency is related to the number of samples. 11.4 Multiple choice questions Select correct answers for the following questions and write it in your paper. 14. (5 marks) If P (A ∩ B) ≥ 0 then the following inequality can be written a. P (A ∪ B) ≥ P (A) + P (B) b. P (A) + P (B) ≥ P (A ∪ B) c. P (A ∪ B) − P (A) ≥ P (B) d. P (A) ≤ P (A ∪ B) − P (B). 15. (5 marks) For a negatively skewed distibuted random variable, X, the following can be written a. M ean(X) < M edian(X) b. M ean(X) > M edian(X) c. M ean(X) > M ode(X) d. M edian(X) > M ode(X). Quantitative Techniques II 54 16. (5 marks) For a conditional probability, P (A|B), the following expressions can be written a. P (A|B) = P (A ∩ B|B) b. P (A|B) = P (A ∩ B|B) + P (A ∩ B̄|B) c. P (A|B) > P (A ∩ B̄|B) d. All of them above. Chapter 12 Examination 2 12.1 Definition Questions Define the following statistical terms. 1. (5 marks) a. Statistical model. b. Probability space. 2. (5 marks) Axiomatic definition of probability. 3. (5 marks) a. Limiting distribution of the sample mean, X̄. b. Approximative distribution of the sample mean, X̄. 4. (5 marks) Central Limit Theorem. 5. (5 marks) a. The property of sufficiency of an estimator. b. Convergence in probability. 12.2 Calculation questions Compute the following values. 6. (10 marks) Let X1 , X2 , . . . , Xn , be random sample from a Poisson distribution with probability function e−λ λx for λ > 0 and x = 0, 1, 2, ... f (x) = x! Determine λ by using the maximum likelihood estimation method and discuss its unbiasedness and sufficiency. Note that E(x) = V ar(x) = λ for a Poisson distributed variable X. 7. (10 marks) The following conditional density function is given by f (y | x) = c1 y for 0 ≤ x ≤ 1 and 0 ≤ x ≤ y x2 55 Calculate Quantitative Techniques II a. b. 8. a. b. 56 c1 and P [(1/4) < Y < (1/2) | X = (5/8)]. (10 marks) Below is a table for random variables X and Y , calculate E(X | Y = 2) E(Y | X = 40). Y Frequency 20 X 30 40 Total 12.3 1 218 125 201 544 2 302 411 256 969 3 198 305 287 790 4 660 310 327 1297 Total 1378 1151 1071 3600 Discussion questions Determine whether the following statements are true, false or uncertain. For full point there is a need for an explanation. 9. (5 marks) The following equality is correct: F (y) = P (Y | X ≤ x)F (x) + P (Y | X > x)(1 − F (x)) 10. (5 marks) One can try to minimise the probability of commiting Type II error after choosing an acceptable value for the probability of Type I error in testing procedure. 11. (5 marks) The Cramer-Rao Inequality establishes a lower bound for the variance of an unbiased estimator of θ. However, it does not necessarly imply that the variance of the minimum variance unbiased estimator of θ has to be equal to the Cramer-Rao Lower Bound. 12. (5 marks) Non-random samples may not be used in the scientific inferences. 12.4 Multiple choice questions Select correct answers for the following questions and write it in your paper. 13. (5 marks) Choose the correct expression(s) in the followings a. Use harmonic means for ratios b. Use arithmetic means for values including extremly low or high values c. Use geometric means for proportions d. None of them above. 14. (5 marks) A ∩ (B ∪ C) can be expressed as a. (A ∪ B) ∩ (A ∪ C) b. (A ∪ B) ∪ (A ∪ C) c. (A ∩ B) ∪ (A ∩ C) d. (A ∩ B) ∩ (A ∩ C). 15. (5 marks) For a positively skewed distibuted random variable, X, the following can be written a. M ean(X) < M edian(X) Quantitative Techniques II 57 b. M ean(X) > M edian(X) c. M ean(X) < M ode(X) d. M edian(X) < M ode(X). 16. (5 marks) For a pair of correlated random variables, X and Y, the following relationship is correct a. V ar(Y ) < V ar(X)Cov(X, Y ) b. Cov 2 (X, Y ) < V ar(X)V ar(Y ) c. Cov(X, Y ) < V ar(X)V ar(Y ) d. Cov 2 (X, Y ) > V ar(X)V ar(Y ). 17. (5 marks) For jointly distributed random variables, X and Y, the following can be written a. F (y) = FXY (∞, y) b. FXY (X, −∞) = 0 c. FXY (X, ∞) = F (x) d. All of them above.