Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
September Statistics for MSc Weeks 1 - 2 Probability and Distribution Theories Ali C. Tasiran Department of Economics, Mathematics and Statistics Malet Street, London WC1E 7HX September 2014 Contents 1 Introduction 1.1 Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Some preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 2 4 2 Probability 2.1 Probability definitions and concepts . . . 2.1.1 Classical definition of probability . 2.1.2 Frequency definition of probability 2.1.3 Subjective definition of probability 2.1.4 Axiomatic definition of probability Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 5 5 5 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Random variables and probability distributions 3.1 Random variables, densities, and cumulative distribution functions 3.1.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . 3.1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . 3.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 10 11 12 12 4 Expectations and moments 4.1 Mathematical Expectation and Moments . 4.1.1 Mathematical Expectation . . . . . 4.1.2 Moments . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 13 17 19 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction 1.1 Textbooks Lecture notes are provided. However, these are not a substitute for a textbook. I do not recommend any particular text, but in the past students have found the following useful. • Greene, W.H., (2004) Econometric Analysis, 5rd edition, Prentice-Hall. A good summary of much of the material can be found in Appendix. • Hogg, R. V. and Craig A. T., (1995) Introduction to Mathematical Statistics, 5th edition, Prentice Hall. A popular textbook, even though it is slightly dated. • Mittelhammer. R. C., (1999) Mathematical Statistics for Economics and Business, Springer Verlag. A good text. A good mathematical statistics textbook for economists, it is useful especially for further econometric studies. • Mood. A.M., Graybill, and Boes D.C., (1974) Introduction to the Theory of Statistics, 3rd edition, McGraw-Hall. • Spanos. A., (1999) Probability Theory and Statistical Inference, Econometric Modeling with Observational Data, Cambridge University Press. • Wackerly. D., Mendenhall W., and Scheaffer. R., (1996) Mathematical Statistics with Applications, 5th edition, Duxbury Press. Those who plan to take forthcoming courses in Econometrics may buy the book by Green (2004). Welcome to this course. Ali Tasiran [email protected] 1.2 Some preliminaries Statistics is the science of observing data and making inferences about the characteristics of a random mechanism that has generated data. It is also called as science of uncertainty. 2 September Statistics 3 In Economics, theoretical models are used to analyze economic behavior. Economic theoretical models are deterministic functions but in real world, the relationships are not exact and deterministic rather than uncertain and stochastic. We thus employ distribution functions to make approximations to the actual processes that generate the observed data. The process that generates data is known as the data generating process (DGP or Super Population). In Econometrics, to study the economic relationships, we estimate statistical models, which are build under guidance of the theoretical economic models and by taking into account the properties in the data generating process. Using parameters of estimated statistical models, we make generalisations about the characteristics of a random mechanism that has generated data. In Econometrics, we use observed data in the samples to draw conclusions about the populations. Populations are either real which the data came or conceptual as processes by which the data were generated. The inference in the first case is called design-based (for experimental data) and used mainly to study samples from populations with known frames. The inference in the second case is called model-based (for observational data) and used mainly to study stochastic relationships. The statistical theory that used for such analyses is called as the Classical inference one will be followed in this course. It is based on two premises: 1. The sample data constitute the only relevant information 2. The construction and assessment on the different procedures for inference are based on long-run behavior under similar circumstances. The starting point of an investigation is an experiment. An experiment is a random experiment if it satisfies the following conditions: - all possible distinct outcomes are known ahead of time - the outcome of a particular trial is not known a priori - the experiment can be duplicated. The totality of all possible outcomes of the experiment is referred to as the sample space (denoted by S) and its distinct individual elements are called the sample points or elementary events. An event, is a subset of a sample space and is a set of sample points that represents several possible outcomes of an experiment. A sample space with a finite or countably infinite sample points (with a one to one correspondence to positive integers) is called a discrete space. A continuous space is one with an uncountable infinite number of sample points (that is, it has as many elements as there are real numbers). Events are generally represented by sets, and some important concepts can be explained by using the algebra of sets (known as Boolean Algebra). Definition 1 The sample space is denoted by S. A = S implies that the events in A must always occur. The empty set is a set with no elements and is denoted by . A = implies that the events in A do not occur. The set of all elements not in A is called the complement of A and is denoted by Ā. Thus, Ā occurs if and only if A does not occur. The set of all points in either a set A or a set B or both is called the union of the two sets and is denoted by ∪. A ∪ B means that either the event A or the event B or both occur. Note: A ∪ Ā = S. September Statistics 4 The set of all elements in both A and B is called the intersection of the two sets and is represented by ∩. A ∩ B means that both the events A and B occur simultaneously. A ∩ B = means that A and B cannot occur together. A and B are said to be disjoint or mutually exclusive. Note: A ∩ Ā = . A ⊂ B means that A is contained in B or that A is a subset of B, that is, every element of A is an element of B. In other words, if an event A has occurred, then B must have occurred also. Sometimes it is useful to divide elements of a set A into several subsets that are disjoint. Such a division is known as a partition. If A1 and A2 are such partitions, then A1 ∩A2 = and A1 ∪ A2 = A. This can be generalized to n partitions; A = ∪n1 Ai with Ai ∩ Aj = for i 6= j. Some postulates according to the Boolean Algebra: Identity: There exist unique sets and S such that, for every set A, A ∩ S = A and A ∪ = A. Complementation: For each A we can define a unique set Ā such that A ∩ Ā = and A ∪ Ā = S. Closure: For every pair of sets A and B, we can define unique sets A ∪ B and A ∩ B. Commutative: A ∪ B = B ∪ A; A ∩ B = B ∩ A. Associative: (A ∪ B) ∪ C = A ∪ (B ∪ C). Also (A ∩ B) ∩ C = A ∩ (B ∩ C). Distributive: A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C). Also, A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). Morgan’s Laws: A ∪ B) = Ā ∩ B̄. (A ∩ B) = Ā ∪ B̄. Problems 1. Let the set S contains the ordered combination of sexes of two children S = {F F, F M, M F, M M }. Let A denote the subset of possibilities containing no males, B the subset of two males, and C the subset containing at least one male. List the elements of A, B, C, A ∩ B, A ∪ B, A ∩ C, A ∪ C, B ∩ C, B ∪ C, and C ∩ B̄. 2. Verify Morgan’s Laws by drawing Venn Diagrams. A ∪ B = Ā ∩ B̄. (A ∩ B) = Ā ∪ B̄. Chapter 2 Probability 2.1 2.1.1 Probability definitions and concepts Classical definition of probability If an experiment has n(n < ∞) mutually exclusive and equally likely outcomes, and if nA of these outcomes have an attribute A (that is, the event A occurs in nA possible ways), then the probability of A is nA /n, denoted as P (A) = nA /n 2.1.2 Frequency definition of probability Let nA be the number of times the event A occurs in n trials of an experiment. If there exists a real number p such that p = limn→∞ (nA /n), then p is called the probability of A and is denoted as P (A). (Examples are histograms for frequency distribution of variables). 2.1.3 Subjective definition of probability Our personal judgments to assess the relative likelihood of various outcomes. They are based on our ”educated guesses” or intuitions. ”The weather will be rainy with a probability 0.6, tomorrow”. 2.1.4 Axiomatic definition of probability The probability of an event A ∈ z is a real number such that 1) P (A) ≥ 0 for every A ∈ z, 2) the probability of the entire sample space S is 1, that is P (S) = 1, and 3) if A1 , A2 , ..., An are mutually exclusive events (that is, Ai ∩ Aj = for all i 6= j), P then P (A1 ∪ A2 ∪ ...An ) = i P (Ai ), and this holds for n = ∞ also. Where z is a set of any collection of sub-sets in the sample space, S. The triple (S, z, P (·)) is referred to as the probability space, and P (·) is a probability measure. We can derive the following theorems by using the axiomatic Definition of probability. Theorem 2 P (Ā) = 1 − P (A). Theorem 3 P (A) ≤ 1. 5 September Statistics 6 Theorem 4 P () = 0. Theorem 5 If A ⊂ B, then P (A) ≤ P (B). Theorem 6 P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Definition 7 Let A and B be two events in a probability space (S, z, P (.)) such that P (B) > 0. The conditional probability of A given that B has occurred, denoted by P (A | B), is given by P (A ∩ B)/P (B). (It should be noted that the original probability space (S, z, P (·)) remains unchanged even though we focus our attention on the subspace, this is (S, z, P (· | B)) Theorem 8 Bonferroni’s Theorem: Let A and B be two events in a sample space S. Then P (A ∩ B) ≥ 1 − P (Ā) − P (B̄). Theorem 9 Bayes Theorem: If Aand B are two events with positive probabilities, then P (A | B) = P (A) P (B | A) P (B) Law of total probability Assume that S = A1 ∪ A2 ∪ ... ∪ An where Ai ∩ Aj = ∅ for i 6= j. Then for any event B⊂S n X P (B) = P (Ai )P (B | Ai ). i=1 Theorem 10 Extended Bayes Theorem: If A1 , A2 , ..., An constitute a partition of the sample space, so that Ai ∩ Aj = for i 6= j and ∪i Ai = S, and P (Ai ) 6= 0 for any i, then for a given event B with P (B) > 0, P (Ai ) P (B | Ai ) P (Ai | B) = P i P (Ai ) P (B | Ai ) Definition 11 Two events A and B with positive probabilities are said to be statistically independent if and only if P (A | B) = P (A). Equivalently, P (B | A) = P (B) and P (A ∩ B) = P (A)P (B). The other type of statistical inference is called Bayesian inference where sample information is combined with prior information. This is expressed of a probability distribution known as the prior distribution. When it is combined with the sample information then a posterior distribution of parameters is obtained. It can be derived by using Bayes Theorem. If we substitute Model (the model that generated the observed data) for A and Data (Observed Data) for B, then we have P (M odel | Data) = P (Data | M odel)P (M odel) P (Data) (2.1) where P (Data | M odel) is the probability of observing data given that the Model is true. This is usually called the likelihood, (sample information). P (M odel) is the probability September Statistics 7 that the Model is true before observing the data (usually called the prior probability). P (M odel | Data) is the probability that the Model is true after observing the data (usually called posterior probability). P (Data) is the unconditional probability of observing data (whether the Model is true or not). Hence, the relation can be written P (M odel | Data) ∝ P (Data | M odel)P (M odel) (2.2) That is, that Posterior probability is proportional to likelihood (sample information) times prior probability. The inverse of an estimator’s variance is called as the precision. In Classical Inference, we use only parameter’s variances but in Bayesian Inference, we have both sample precision and prior precision. Also, the precision (or inverse of the variance) of the posterior distribution of a parameter is the sum of sample precision and prior precision. For example, the posterior mean will lie between the sample mean and the prior mean. The posterior variance will be less than the both the sample and prior variances. These are the reasons behind the increasing popularity of Bayesian Inference in the practical econometric applications. When we speak in econometrics of models to be estimated or tested, we refer to sets of DGPs in Classical Inference context. In design-based inference, we restrict our attention to a particular sample size and characterize a DGP by the law of probability that governs the random variables in a sample of that size. In model based inference, we refer to a limiting process in which the sample size goes to infinity, it is clear that such a restricted characterization will no longer suffice. When we indulge in asymptotic theory, the DGPs in question must be stochastic processes. A stochastic process is a collection of random variables indexed by some suitable index set. This index set may be finite, in which case we have no more than a vector of random variables, or it may be infinite, with either a discrete or a continuous infinity of elements. In order to define a DGP, we must be able to specify the joint distribution of the set of random variables corresponding to the observations contained in a sample of arbitrarily large size. This is a very strong requirement. In econometrics, or any other empirical discipline for that matter, we deal with finite samples. How then can we, even theoretically, treat infinite samples? We must in some way create a rule that allows one to generalize from finite samples to an infinite stochastic process. Unfortunately, for any observational framework, there is an infinite number of ways in which such a rule can be constructed, and different rules can lead to widely asymptotic conclusions. In the process of estimating an econometric model, what we are doing is to try to obtain some estimated characterization of the DGP that actually did generate the data. Let us denote an econometric model that is to be estimated, tested, or both, as M and a typical DGP belonging to M as µ. The simplest model in econometrics is the linear regression model, one possibility is to write y = Xβ + u, u ∼ N (0, σ 2 In ) (2.3) where y and u are n-vectors and X is a nonrandom nxk matrix and y follows the N (Xβ, σ 2 In ) distribution. This distribution is unique if the parameters β and σ 2 are specified. We may therefore say that the DGP is completely characterized by the model parameters. In other words, knowledge of the model parameters β and σ 2 uniquely identify an element of µ in M . On the other hand, the linear regression model can also be written as y = Xβ + u, u ∼ IID(0, σ 2 In ) (2.4) September Statistics 8 with no assumption of normality. Many aspects of the theory of linear regressions are just 0 applicable, the OLS estimator is unbiased, and its covariance matrix is σ 2 (X X)−1 . But the distribution of the vector u, and hence also that of y, is now only partially characterized even when β and σ 2 are known. For example, errors u could be skewed to the left or to the right, could have fourth moments larger or smaller than 3σ 4 .Let us call the sets of DGPs associated these regressions M1 and M2 ., respectively. M1 being in fact a proper subset of M2 . For a given β and σ 2 there is an infinite number of DGPs in M2 (only one of which is M1 ) that all correspond to the same β and σ 2 . Thus we must consider these models as different models even though the parameters used in them are the same. In either case, it must be possible to associate a parameter vector in a unique way to any DGP µ in the model M , even if the same parameter vector is associated with many DGPs. We call the model M with its associated parameter-defining mapping θ as a parametrized model The main task in our practical work is to build the association between the DGPs of a model and the model parameters. For example, in the Generalized Method of Moments (GMM) context, there are many possible ways of choosing the econometric model, i.e., the underlying set of DGPs. One of the advantages of GMM as an estimation method is that it permits models which consist of a very large number of DGPs. In striking contrast to Maximum Likelihood estimation, where the model must be completely specified, any DGP is admissible if it satisfies a relatively small number of restrictions or regularity conditions. Sometimes, the existence of the moments used to define the parameters is the only requirement needed for a model to be well defined. Problems 1. A sample space consists of five simple events E1 , E2 , E3 , E4 , and E5 . (a) If P (E1 ) = P (E2 ) = 0.15, P (E3 ) = 0.4 and P (E4 ) = 2P (E5 ), find the probabilities of E4 and E5 . (b) If P (E1 ) = 3P (E2 ) = 0.3, find the remaining simple events if you know that the remaining events are equally probable. 2. A business office orders paper supplies from one of three vendors, V1 , V2 , and V3 . Orders are to be placed on two successive days, one order per day. Thus (V2 , V3 ) might denote that vendor V2 gets the order on the first day and vendor V3 gets the order on the second day. (a) List the sample points in this experiment of ordering paper on two successive days. (b) Assume the vendors are selected at random each day and assign a probability to each sample point. (c) Let A denote the event that the same vendor gets both orders and B the event that V2 gets at least one order. Find P (A), P (B), P (A ∩ B), and P (A ∪ B) by summing probabilities of the sample points in these events. Chapter 3 Random variables and probability distributions 3.1 Random variables, densities, and cumulative distribution functions A random variable X, is a function whose domain is the sample space and whose range is a set of real numbers. Definition 12 In simple terms, a random variable (also referred as a stochastic variable) is a real-valued set function whose value is a real number determined by the outcome of an experiment. The range of a random variable is the set of all the values it can assume. The particular values observed are called realisations x. If these are countable, x1 , x2 , ..., it is said to be discrete with associated probabilities X P (X = xi ) = p(xi ) ≥ 0, p(xi ) = 1; (3.1) i Pj and cumulative distribution P (X ≤ xj ) = i=1 p(xi ). For a continuous random variable, defined over the real line, the cumulative distribution function is Z x f (u)d(u), (3.2) F (x) = P (X ≤ x) = −∞ where denotes the probability density function f (x) = dF (x) dx (3.3) R∞ and −∞ f (x)d(x) = 1. Also note that the cumulative distribution function satisfies limx→∞ F (x) = 1 and limx→−∞ F (x) = 0. Definition 13 The real-valued function F (x) such that F (x) = Px {(−∞, x]} for each x ∈ < is called the distribution function, also known as the cumulative distribution (or cumulative density) function, or CDF. 9 September Statistics 10 Theorem 14 P (a ≤ X ≤ b) = F (b) − F (a) Theorem 15 For each x ∈ <, F (x) is continuous to the right of x. Theorem 16 If F (x) is continuous at x ∈ <, then P (X = x) = 0. Although f (x) is defined at a point, P (X = x) = 0 for a continuous random variable. The support of a distribution is the range over which f (x) 6= 0. Let f be a function from Rk to R. Let x0 be a vector in Rk and let y = f (x0 ) be its k image. The function f is continuous at x0 if whenever {xn }∞ n=1 is a sequence in R which ∞ converges to x0 , then the sequence {f (xn )}n=1 converge to f (x0 ). The function f is said to be continuous if it is continuous at each point in its domain. All polynomial functions are continuous. As an example of a function that is not continuous consider 1, if x > 0, f (x) = 0, if x ≤ 0. If both g and f are continuous functions, then g(f (x)) is continuous. 3.1.1 Discrete Distributions Definition 17 For a discrete random variable X, let f (x) = Px (X = x). The function f (x) is called the probability function (or as probability mass function). The Bernoulli Distribution f (x; θ) = f (x; p) = px (1 − p)1−x for x = 0, 1(failure, success) and 0 ≤ p ≤ 1. The Binomial Distribution n x n! f (x; θ) = B(x; n, p) = p (1 − p)n−x = px (1 − p)n−x x x! (n − x)! (3.4) x = 0, 1, ..., n (X is the number of success in n trials) 0 ≤ p ≤ 1. 3.1.2 Continuous Distributions Definition 18 For a random variable X if there exists a nonnegative function f (x), defined on the real line, such that for any interval B, Z P (X ∈ B) = f (x) dx (3.5) B then X is said to have a continuous distribution and the function f (x) is called the probability density function or simply density function (or pdf). The following can be written for the continuous random variables: Z x F (x) = f (u) d(u) (3.6) −∞ September Statistics 11 f (x) = F 0 (x) = Z ∂F (x) ∂x (3.7) +∞ f (u) d(u) = 1 (3.8) −∞ b Z F (b) − F (a) = f (u) d(u) (3.9) a Uniform Distribution on an Interval A random variable X with the density function 1 (b − a) f (x; a, b) = (3.10) in the interval a ≤ X ≤ b is called the uniform distribution on an interval. The Normal Distribution A random variable X with the density function 1 (x − µ)2  −  2 σ2  1 e f (x; µ, σ) = p σ (2π)  (3.11) is called a Normal (Gaussian) distributed variable. 3.1.3 Example 1. Toss of a single fair coin. X =number of heads  if x < 0  0, 1 , if 0 ≤ x<1 F (x) =  2 1, if x ≤ 1 the cumulative distribution function (cdf) of discrete random variables are always step functions because the cdf increases only at a countable of number of points. 1 2 , if x = 0 f (x) = 1 2 , if x = 1 F (x) = X xj ≤x f (xj ) September Statistics 3.2 12 Problems 1. Write P (a ≤ x ≤ b) in terms of integrals and draw a picture for it. 2. Assume the probability density function for x is: cx, if 0 ≤ x ≤ 2 f (x) = 0, elsewhere (a) Find the value of c for which f (x) is a pdf. (b) Compute F (x). (c) Compute P (1 ≤ x ≤ 2). 3. The large lot of electrical is supposed to contain only 5 percent defectives assuming a binomial model. If n = 20 fuses are randomly sampled from this lot, find the probability that at least three defectives will be observed. 4. Let the distribution function of a random variable X be given by  0, x<0    x, 0 ≤ x < 2 8 F (x) = x2  , 2≤x<4  16  1, x≥4 (a) Find the density function (i.e., pdf) of x. (b) Find P (1 ≤ x ≤ 3) (c) Find P (x ≤ 3) (d) Find P (x ≥ 1 | x ≤ 3). Chapter 4 Expectations and moments 4.1 Mathematical Expectation and Moments The probability density and the cumulative distributions functions determine the probabilities of random variables at various points or in different intervals. Very often we are interested in summary measures of where the distribution is located, how it is dispersed around some average measure, whether it is symmetric around some point, and so on. 4.1.1 Mathematical Expectation Definition 19 Let X be a random variable with f (x) as the PMF, or PDF, and g(x) be a single-valued-function. The integral is the expected value (or mathematical expectation) of g(x) and is denoted P by E[g(X)]. In the case of a discrete random variable +∞ this takes the form E[g(X)] = −∞ g(x)f (xi ), and in the continuous case, E[g(X)] = R +∞ −∞ g(x)f (x)dx Mean of a Distribution For the special case of g(X) = X, the mean of a distribution is µ = E(X). Theorem 20 If c is a constant, E(c) = c. Theorem 21 If c is constant, E[cg(X)] = cE[g(X)]. Theorem 22 E[u(X) + v(X)] = E[u(X)] + E[v(X)]. Theorem 23 E(X − µ) = 0, where µ = E(X). Examples: Ex1: Let X have the probability density function x f (x) 1 2 3 4 4 10 1 10 3 10 2 10 13 September Statistics E(x) = P x xf (x) =1 14 4 10 +2 1 10 +3 3 10 +4 2 10 = 23 10 . Ex2: Let X have the pdf f (x) = E(x) = R +∞ −∞ xf (x)dx = R1 0 x(4x3 )dx = 4 4x3 , 0 < x < 1 . 0, elsewhere R1 0 x4 dx = 4 h i1 x5 5 0 =4 1 5 = 45 . Moments of a Distribution The mean of a distribution is the expected value of the random variable X. If the following integral exists Z +∞ 0 m xm dF (4.1) µm = E(X ) = −∞ 0 it is called the mth moment around the origin, and it is denoted by µm . Moments can also be obtained around the mean or the central moments (denoted by µm ) Z +∞ m µm = E[(X − µ) ] = (x − µ)m dF (4.2) −∞ Variance and Standard Deviation The central moment of a distribution that corresponds to m = 2 is called the variance of this distribution, and is denoted by σ 2 or V ar(X). The positive square root of the variance is called standard deviation and is denoted by σ or Std(x). The variance is an average of the squared deviation from the mean. There are many deviations from the mean but only one standard deviation. The variance shows the dispersion of a distribution and by squaring deviations one treats positive and negative deviations symmetrically. Mean and Variance of a Normal Distribution A random variable X is normal distributed as N (µ, σ 2 ) the mean is µ, and variance is σ 2 . The operation of substracting the mean and dividing by the standard deviation is called standardizing. Then the standardized variable Z = (X − µ)/σ is SN (0, 1). Mean and Variance of a Binomial Distribution The random variable X is binomial distributed B(n, p) with the mean np and a variance with np(1 − p). (Show this!) Theorem 24 If E(X)=µ and Var(X)=σ 2 , and a and b are constants, then V ar(a + bX) = b2 σ 2 . (Show this!) September Statistics 15 Example: Ex3: Let X have the probability density function 4x3 , 0 < x < 1 f (x) = . 0, elsewhere E(x) = 54 . V ar(x) = E(x2 )−E 2 (x) = R1 0 x2 (4x3 )dx− 4 2 5 =4 h i1 x6 4 2 6 0− 5 = 46 − 16 25 = 2 75 = 0.0266. Expectations and Probabilities Any probability can be interpreted as an expectation. Define the variable Z which is equal to 1 if event A occurs, and equal to zero if event A does not occur. Then it is easy to see that P r(A) = E(Z). How much information about the probability distribution of a random variable X is provided by the expectation and variance of X? There are three useful theorems here. Theorem 25 Markov’s Inequality If X is nonnegative random variable, that is, if P r(X < 0) = 0, and any k is any constant, then P r(X ≥ k) ≤ E(X)/k. Theorem 26 Chebyshev’s Inequality Let b a positive constant and h(X) be a nonnegative measurable function of the random variable X. Then 1 Pr(h(X) ≥ b) ≤ E[h(X)] b For any constant c > 0 and σ 2 = V ar(X), Corollary 27 P r(| X − µ |≥ c) ≤ σ2 c2 Corollary 28 P r(| X − µ |≤ c) ≥ 1 − Corollary 29 P r(| X − µ |≥ kσ) ≤ σ2 c2 1 k2 For linear functions the expectation of the function is the function of the expectation. But if Y = h(X) is nonlinear, then in general E(Y ) 6= h[E(X)]. The direction of the inequality may depend on the distribution of X. For certain functions, we can be more definite. Theorem 30 Jensen’s Inequality If Y = h(X) is concave and E(X) = µ, then E(Y ) ≤ h(µ). For example, the logarithmic function is concave, so E[log(X)] ≤ log[E(X)] regardless of the distribution of X. Similarly, if Y = h(X) is convex, so that it lies everywhere above its tangent line, then E(Y ) ≥ h(µ). For example, the square function is convex, so E(X 2 ) ≥ [E(X)]2 regardless of the distribution of X. September Statistics 16 Approximate Mean and Variance of g(X) Suppose X is a random variable defined on (S, z, P (·)) with E(X) = µ and V ar(X) = σ 2 , and let g(X) be a differentiable and measurable function of X. We first take a linear approximation of g(X) in the neighborhood of µ. This is given by g(X) ≈ g(µ) + g 0 (µ)(X − µ) (4.3) provided g(µ) and g 0 (µ) exist. Since the second term zero expectation E[g(X)] ≈ g(µ), and variance is V ar[g(X)] ≈ σ 2 [g 0 (µ)]2 . Mode of a Distribution The point(s) for which f (x) is maximum are called mode. It is the most frequently observed value of X. Median, Upper and Lower Quartiles, and Percentiles A value of x such that P (X < x) ≤ (1/2), and P (X ≤ (x)) ≥ (1/2) is called a median of the distribution. If the point is unique, then it is the median. Thus the median is the point on either side of which lies 50 percent of the distribution. We often prefer median as an ”average” measure because the arithmetic average can be misleading if extreme values are present. The point(s) with an area 1/4 to the left is (are) called the lower quartile(s), and the point(s) corresponding to 3/4 is (are) called upper quartile(s). For any probability p, the values of X, for which the area to the right is p are called the upper pth percentiles (also referred to as quantiles). Coefficient of Variation The coefficient of variation is defined as the ratio (σ/µ)100, where the numerator is the standard deviation and the denominator is the mean. It is a measure of the dispersion of a distribution relative to its mean and useful in the estimation of relationships. We usually say that the variable X does not vary much if the coefficient of variation is less than 5 percent. This is also helpful to make comparison between two variables that are measured with different scales. Skewness and Kurtosis If a continuous density f (x) has the property that f (µ + a) = f (µ − a) for all a (µ being the mean of the distribution), then f (x) is said to be symmetric around the mean . If a distribution is not symmetric about the mean, then it is called skewed. A commonly used measure of skewness is α3 = E[(X − µ)3 /σ 3 ]. For a symmetric distribution such as the normal, this is zero(µ = α3 = 0). [Positive skewed (µ > α3 > 0), to the right with a long tail, negative skewed (µ < α3 < 0), to the left with a long tail]. The peaknedness of a distribution is called kurtosis. One measure of kurtosis is α4 = E[(X − µ)4 /σ 4 ]. For a normal distribution, kurtosis is called mesokurtic (α4 = 3). A narrow distribution is called leptokurtic (α4 > 3) and a flat distribution is called September Statistics 17 platykurtic (α4 < 3). The value E[(X − µ)4 /σ 4 ] − 3 is often referred to as excess kurtosis. 4.1.2 Moments Mathematical Expectation The concept of mathematical expectation is easily extended to bivariate random variables. We have Z Z E[g(X, Y )] = g(x, y)dF (x, y) (4.4) where the integral is over the (X, Y ) space. Moments The rth moment of X is E(X r ) = Z xr dF (x) (4.5) Joint Moments E(X r Y s ) = Z Z xr y s dF (x, y) Let X and Y be independent random variables and let u(X) be a function of X only and v(Y ) be a function of Y only. Then, E[u(X)v(Y )] = E[u(X)]E[v(Y )] (4.6) Covariance Covariance between X and Y is defined as σXY = Cov(X, Y ) = E[(X − µx )(Y − µy )] = E(XY ) − µx µy In the continuous case this takes the form: Z ∞Z ∞ σXY = (x − µx )(y − µy )f (x, y)dxdy −∞ (4.7) (4.8) −∞ and in the discrete case it is σXY = XX x (x − µx )(y − µy )f (x, y) (4.9) y Although the covariance measure is useful in identifying the nature of the association between X and Y , it has a serious problem, namely, the numerical value is very sensitive to the units of measurement. To avoid this problem, a ”normalized” covariance measure is used. This measure is called the correlation coefficient. September Statistics 18 Correlation The quantity ρXY = σXY Cov(X, Y ) =p σX σY V ar(X) V ar(Y ) (4.10) is called correlation coefficient between X and Y . If Cov(X, Y ) = 0, then Cor(X, Y ) = 0, in which case X and Y are said to be uncorrelated. Two random variables are independent then σXY = 0 and ρXY = 0. The converse need not to be true. Theorem 31 | ρXY |≤ 1 that is, −1 ≤ ρXY ≤ 1. The inequality [Cov(X, Y )]2 ≤ V ar(X)V ar(Y )is called Cauchy-Schwarz Inequality or ρ2XY ≤ 1 that is, −1 ≤ ρXY ≤ 1. It should be emphasized that ρXY measures only a linear relationship between X and Y . It is possible to have an exact relation but a correlation less than 1, even 0. Example: To illustrate, consider random variable X which is distributed as Uniform [−θ, θ] and the transformation Y = X 2 . Cov(X, Y ) = E(X 3 ) − E(X)E(X 2 ) = 0 because the distribution is symmetric around the origin and hence all the odd moments about the origin are zero. It follows that X and Y are uncorrelated even though there is an exact relation between them. In fact, this result holds for any distribution that is symmetric around the origin. Definition 32 Conditional Expectation: Let X and Y be continuous random variables and g(Y ) be a continuous function. Then the conditional expectation (or conditional mean) R∞ of g(Y ) given X = x, denoted by EY |X [g(Y ) | X], is given by −∞ g(y) f (y | x) dy wheref (y | x) is the conditional density of Y given X. Note that E[g(Y ) | X = x] is a function of x and is not a random variable because x is fixed. The special case of E(Y | X) is called the regression of Y on X. Theorem 33 Law of Iterated Expectation: EXY [g(Y )] = EX [EY |X {g(Y ) | X}]. That is, the unconditional expectation is the expectation of the conditional expectation. Definition 34 Conditional Variance: Let µY |X = E(Y | X) = µ∗ (X) be the conditional mean of Y given X. Then the conditional variance of Y given X is defined as V ar(Y | X) = EY |X [(Y − µ∗ )2 | X}]. This is a function of X. Theorem 35 V arY |X (Y ) = EX [V ar(Y | X)] + V arX [E(Y | X)], that is, the variance of Y is the mean of its conditional variance plus the variance of its conditional mean. Theorem 36 V ar(aX + bY ) = a2 V ar(X) + 2abCov(X, Y ) + b2 V ar(Y ). September Statistics 19 Approximate Mean and Variance for g(X, Y ) After obtaining a linear approximation of the function g(X, Y ) ∂g ∂g (X − µX ) + (Y − µY ) g(X, Y ) ≈ g(µx , µy ) + ∂X ∂Y its mean can be written E[g(X, Y )] ≈ g(µX , µY ). Its variance is ∂g 2 ∂g ∂g ∂g 2 2 2 V ar[g(X, Y )] ≈ σX + σY + 2ρ σX σY ∂X ∂Y ∂X ∂Y (4.11) (4.12) Note that approximations may be grossly in error. You should be especially careful with the variance and covariance approximations. Problems 1. For certain ore samples the proportion Y of impurities per sample is a random variable with density function given by 3 2 2 y + y, 0 ≤ y ≤ 1 . f (y) = 0, elsewhere The dollar value of each sample is W = 5 − 0.5Y . Find the mean and variance of W. 2. The random variable Y has the following probability density function 3 2 8 (7 − y) , 5 ≤ y ≤ 7 . f (y) = 0, elsewhere (a) Find E(Y ) and V ar(Y ). (b) Find an interval shorter than (5, 7) in which least 3/4 of the Y values must lie. (c) Would you expect to see a measurement below 5.5 very often? Why?