Download Notes on Set Theory and Probability Theory

Notes on Set Theory and Probability Theory Michelle Alexopoulos August 2003 0.1. Set Theory and Probability Theory Before we talk about probability, it is useful to review some basic definitions and theorem from set theory. Definition 1. A set if a collection of objects. Definition 2. If every element in a set A is also a member of set B then A is a subset of B, i.e., A ⊂ B. Definition 3. Two sets, A and B, are equal, denoted A=B, if and only if all element in A belongs to the set B and every element in B belongs to set A, i.e., A ⊆ B and A ⊇ B. Definition 4. B is a proper subset of A if B is a subset of a A, but B does not equal A. Definition 5. The empty set, or null set, is a set which contains no elements, and is denoted by the symbol ∅. Definition 6. Suppose that A ⊂ S. The complement of set A, denoted as A or Ac , is the set containing all elements in S that are not in A. i.e., Ac = {γ : γ ∈ S and γ ∈ / A} 2 Definition 7. The union of sets A and B, denoted A ∪ B, is the set containing all elements in either A or B or both. i.e., A ∪ B = {γ : γ ∈ A or γ ∈ B}. Definition 8. The intersection of sets A and B, denoted A ∩ B, is the set containing all elements in both A and B. i.e., A ∩ B = {γ : γ ∈ A and γ ∈ B}. Definition 9. Two sets, A and B, are called disjoint of mutually exclusive if they contain no common element. i.e., if A ∩ B = ∅. Definition 10. The set of all possible outcomes of a random experiment is called the sample space (or universal set) and is denoted by U. Some Theorems involving sets: Theorem 11. A ∪ B = B ∪ A (commutative law for unions) Theorem 12. (A ∪ B) ∪ C = A ∪ (B ∪ C) (associative law for unions) Theorem 13. A ∩ B = B ∩ A (commutative law for intersections) Theorem 14. (A ∩ B) ∩ C = A ∩ (B ∩ C) (associative law for intersections) 3 Theorem 15. A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) (First distributive law) Theorem 16. A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) (Second distributive law) Theorem 17. If A ⊂ B then A0 ⊃ B 0 Theorem 18. A ∪ ∅ = A and A ∩ ∅ = ∅ Theorem 19. (A ∪ B)0 = A0 ∩ B 0 (De Morgan’s first law) Theorem 20. (A ∩ B)0 = A0 ∪ B 0 (De Morgan’s second law) 0.2. Probability Space Basic Probability theory is defined using a triple (S, F, P ) where S is the sample space, F is the collection of events, and P is a function that maps F into the interval [0,1]. P is the probability measure and intuitively F is the set of all events that can be verified to have occurred or not occurred I will discuss these objects in more detail below. However, for the most part, I will follow the notation used in standard textbooks like Greene’s Econometric Analysis. Definition 21. Sample space S: a set of elements of interest. 4 In elementary probability theory, we usually associate these elements with outcomes of an experiment. For example, consider a simple experiment of a single toss of a coin. The sample space of this experiment is S = {H, T }, where H stands for “head” and T for “tail”. For an experiment of three tosses of a coin, then the sample space is S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. More generally, the sample space of an experiment of n tosses of a coin is S = {ω : ω = (a1 , ..., an ), ai = H or T }. An alternate example is one where an individual rolls a die once. The sample space of this experiment is: S = {1, 2, 3, 4, 5, 6} 5 and if we rolled the die n times, the sample space of this experiment would be: S = {ω : ω = (a1 , ..., an ), ai = 1, 2, 3, 4, 5 or 6}. In modern probability theory, the sample space can be fairly general and abstract. For example, it can be the collection of all real numbers, R, or the collection all n-dimensional vectors, Rn , or any subset of these collections. The Axioms of Probability: Suppose we have a sample space S. If S is discrete, then all subsets correspond to events, but if S is continuous, only measurable subsets correspond to events. To each event A in the class of events, we associate a real number, P(A), i.e., P is a real-valued function defined on C. Then P is the probability functions and P(A) is the probability of the event A, if the following axioms are satisfied: Axiom 1: For every event A, P (A) ≥ 0. Axiom 2: For the sure or certain event S, P (S) = 1. Axiom 3: For any number of mutually exclusive events A1 , A2 , ... , then P (A1 ∪ A2 ∪ ...) = P (A1 ) + P (A2 ) + ... Some Theorems on Probability: 6 Theorem 22. If A1 ⊆ A2 , then P (A1 ) ≤ P (A2 ) and P (A2 −A1 ) = P (A2 )−P (A1 ) Theorem 23. For every event A, 0 ≤ P (A) ≤ 1. Theorem 24. P (∅) = 0 (i.e., the impossible event has probability 0) Theorem 25. If A0 is the complement of A, then P (A0 ) = 1 − P (A) Theorem 26. If A = A1 ∪ A2 ∪ ... ∪ An ,where A1 , ..., An are mutually exclusive events then P (A) = Pn i=1 P (Ai ). Theorem 27. If A and B are any two events, then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Conditional Probability: Let A and B be two events such that P (A) > 0. Let P (B|A) denote the probability of B given that A has occurred. Since A has already occurred, it becomes the new sample space. From this, we are led to the definition of P (B|A): P (B|A) = P (A ∩ B) or P (A ∩ B) = P (A)P (B|A) P (A) Definition 28. If P(B|A)=P(B), then A and B are independent events. This is equivalent to P(A ∩ B) = P (A)P (B). 7 Theorem 29. Bayes Rule: Suppose that A1 , A2 , ..., An are mutually exclusive events whose union is the sample space, S. Then if A is any event: P (Ak )P (A|Ak ) P (AK |A) = Pn i=1 P (Ai )P (A|Ai ) Definition 30. Two events A and B are said to be independent if and only if P (A ∩ B) = P (A)P (B) Counting: Suppose that we are given n distinct objects Definition 31. n Pr is the number of permutations of n objects taken r at a time: n Pr = n! (n − r)! Definition 32. n Cr is the combinations of n objects taken r at a time:    n  n!  = C = n r   r!(n − r)! r n Pr is used when order matters. For example if we want to find out how many different permutations consisting of three letters each can be formed from the 4 8 letters A,B, C, and D, the answer is given by 4 P3 = 4! = 4 ∗ 3 ∗ 2 ∗ 1 = 24 (4 − 3)! In this case order matters, i.e., ABC is a different permutation then ACB, or BAC, or CAB, etc. If we only want to know how many ways three letters can be chosen from the set of letters A,B, C and D, then the answer is given by 4 C3 = 4! =4 3!(4 − 3)! In this case order does not matter so we can only have 4 possibilities. i.e., (1) A,B, and C, (2) A,C, and D, (3) A,B, and D, and (4) B,C, and D. Other Commonly used definitions If you are reading more advanced books on probability theory, you will often find terms like sigma-field and sigma-algebra and find reference to measurable functions and measure spaces. I will now briefly turn to these Partition and Information: A partition of a set S is a finite collection A = 9 {A1 , A2 , ..., AN } of disjoint subsets of S whose union is S. Examples: S = {−0.05, −0.01, 0, 0.01, 0.05}, and A0 = {S}, A1 = {{−0.05, −0.01}, {0}, {0.01, 0.05}}, A2 = {{−0.05}, {−0.01}, {0}, {0.01}, {0.05}}, Ai , i = 1, 2, 3 can each be thought of as representing the information an agent may have. Suppose that the numbers in S represent all possible returns in the stock market in a day. Then, after the return is realized, an agent with information partition A0 has effectively no information about the return, while an agent with information A1 can tell whether the return is positive, zero, or negative, and an agent with information A2 knows exactly what the return is. So, these three partitions represent progressively more information. Given a partition A = {A1 , A2 , ..., AN }, an agent may assign different probabilities to each of the event in the partition, P1 , ..., PN . Based on these probabilities, the agent should also be able to decide the probabilities of events such as A1 ∪ A2 , or Ac1 . P (A1 ∪ A2 ) = P (A1 ) + P (A2 ), 10 P (Ac1 ) = 1 − P (A1 ) This motivates the following definition of measurable sets, which can thought of as all events that can be assigned a probability. Above, I mentioned that F includes all outcomes on the sample space that can be verified to have occurred or not occurred. Basically this means that if set A is an event, then its complement, A0 (i.e., not A) must also be an event. Furthermore, if A and B are events then we need to be able to determine: (a) if both A and B happened and (b) if either A or B (or both) occurred. Thus, A ∩ B and A ∪ B are also events. An algebra (of field) is a collection of subsets that is closed under complementation, intersection and union. For our purposes, we will also require F to be closed under countable unions/intersections, and we will refer toe F as a σ−algebra (σ − f ield). This next definition states these ideas more formally. Definition 33. A σ-field F is a collection of subsets of a sample space S with the following properties: (i) The empty set φ ∈ F. (ii) If A ∈ F, then the complement Ac ∈ F. (iii) If Ai ∈ F, i = 1, 2, ..., then their union ∪Ai ∈ F. 11 Note that if Ai ∈ F for i = 1, 2, then, Aci ∈ F, which imply that Ac1 ∪ Ac2 ∈ F. Thus, (Ac1 ∪ Ac2 )c ∈ F. However, (Ac1 ∪ Ac2 )c ≡ {ω ∈ S : ω ∈ / Ac1 , and ω ∈ / Ac2 } = {ω ∈ S : ω ∈ A1 and ω ∈ A2 } ≡ A1 ∩ A2 . So, A1 ∩ A2 ∈ F. Definition 34. A pair (S, F) is called a measurable space, and any subset in F is called a measurable set or event. Examples: (i) F ={φ, S} (ii) F ={φ, A, Ac , S} = {φ, {−0.05, −0.01}, {0, 0.01, 0.05}, S} Definition 35. σ(C): smallest σ-field that contains the collection of subsets, C. Examples: (i) σ(A) = {φ, A, Ac , S}. (ii) C = {A1 , A2 }, σ(C) = {φ, A1 , A2 , Ac1 , Ac2 , A1 ∪ A2 , Ac1 ∪ Ac2 , A1 ∩ A2 , Ac1 ∩ A2 , A1 ∩ Ac2 , A1 ∪ Ac2 , Ac1 ∪ A2 } 12 (iii) B, the σ-field generated by all the open intervals in R. We call all the subsets in B Borel sets. Let A and A0 be two partitions of S, we say that information represented by A is finer than that represented by A0 if σ(A0 ) ⊂ σ(A). Definition 36. A measure is a set function v defined on F such that : (i) 0 ≤ v(A) ≤ ∞ for any A ∈ F. (ii) v(φ) = 0. (iii) If Ai ∈ F, i = 1, 2, ..., and Ai ∩ Aj = φ for any i 6= j, then v (∪∞ i=1 Ai ) = ∞ X v(Ai ). i=1 Examples: Counting measure S = {a1 , a2 , a3 , ...}, F contains all the subsets of S. v(A) = number of elements in subset A Lebesgue measure S = R, F = B, and v((a, b)) = b − a. 13 Proposition 37. For a measure space (S, F, v), we have (i) If A ⊂ B, then v(A) ≤ v(B) (ii) For any sequence A1 , A2 ,..., v(∪∞ i=1 Ai ) ≤ ∞ X v(Ai ) i=1 (iii) If A1 ⊂ A2 ⊂ A3 ⊂ ..., (or A1 ⊃ A2 ⊃ A3 ⊃ ...), then, v( lim An ) = v(∪∞ i=1 Ai ) = lim v(An ) n−→∞ n−→∞ (or v( lim An ) = v(∩∞ i=1 Ai ) = lim v(An ) n−→∞ n−→∞ if v(A1 ) < ∞). Proof: (i) Let C = B ∩ Ac , then, C ∈ F and v(B) = v(A ∪ C) = v(A) + v(C) ≥ v(A) because v(C) ≥ 0. (ii) Let C1 = A1 , C2 = A2 ∩ C1c , C3 = A3 ∩ C2c ∩ C1c , ..., then, Ci , i = 1, 2, ... is 14 ∞ a sequence of disjoint sets such that ∪∞ i=1 Ai = ∪i=1 Ci and, by (i), v(Ci ) ≤ v(Ai ). Thus, we have v(∪∞ i=1 Ai ) = v(∪∞ i=1 Ci ) = ∞ X i=1 v(Ci ) ≤ ∞ X v(Ai ). i=1 (iii) If An is an increasing sequence, let A0 = φ, and Dn = An − An−1 ≡ An ∩ Acn−1 for n ≥ 1. Then, Dn , n = 1, 2, ... is a sequence of disjoint sets such ∞ that ∪∞ n=1 An = ∪n=1 Dn . By the definition of measure, we have ∞ v(∪∞ n=1 An ) = v(∪n=1 Dn ) = ∞ X v(Dn ) n=1 = = = lim n−→∞ lim n−→∞ n X v(Di ) "i=1n X i=1 # (v(Ai ) − v(Ai−1 )) lim v(An ). n−→∞ Now, if An is a decreasing sequence such that v(A1 ) < ∞, then, we have Bn = A1 − An is an increasing sequence. From what we just proved, then, we have v(∪∞ n=1 Bn ) = lim v(Bn ) = v(A1 ) − lim v(An ). n−→∞ n−→∞ 15 ∞ However, ∩∞ n=1 An = A1 − (∪n=1 Bn ), so, ∞ v(∩∞ n=1 An ) = v(A1 ) − v(∪n=1 Bn ) = lim v(An ). n−→∞ Q.E.D. If v(S) = 1, then v is called a probability measure. Proposition 38. Let (S,F) be a measurable space. • (i) If f and g are measurable, then so are f g and af + bg, where a and b are two real numbers; also, f /g is measurable provided g(ω) 6= 0 for any ω ∈ S. (ii) If f1 , f2 , ... are measurable, then so are supn fn , inf n fn . Furthermore, if limn−→∞ fn exists, then it is also measurable. (iii) Suppose that f is a measurable function on (S,F) and g a measurable function on (R,B), then, the composite function g ◦ f defined by g ◦ f (ω) = g(f (ω)) is also a measurable function. (iv) If f is a continuous function on (R,B), then f is measurable. Proposition 39. Let f and g be measurable functions on a measure space (S, F, v). R R R • (i) (af + bg)dv = a f dv + b gdv. 16 (ii) If f = g a.e., then, (iii) If f ≤ g a.e., then, (iv) If f ≥ 0 a.e. and (v) If f ≥ 0 a.e. and R R R R f dv = R f dv ≤ gdv. R gdv. f dv = 0, then, f = 0 a.e. f dv = 1, then, the set function P (B) = Z f dv B is a probability measure on (S, F). The function f is called the probability density function (p.d.f.) of P with respect to measure v. (vi) If fn −→ f a.e., |fn | ≤ g, and lim n−→∞ Z R gdv < ∞, then, fn dv = (vii) If |∂f (ω, θ)/∂θ| ≤ g(ω) a.e., and d dθ ·Z ¸ R f dv. gdv < ∞, then, f (ω, θ)dv = 17 Z Z ∂f (ω, θ) dv. ∂θ 0.3. Random Variables and Probability Distribution Definition 40. Consider a random experiment with sample space S. A random variable X(ξ) is a single valued real function that assigns a real number to each sample point ξ of S. Often we use a single letter X for this function in place of X(ξ). Probability Distribution: Definition 41. A listing of the values x taken by a random variable X and their associated probabilities is a probability distribution, f(x). Definition 42. The distribution function [or cumulative distribution function (c.d.f)] of X is the function defined by : FX (x) ≡ P (X ≤ x), − ∞ < x < ∞. Properties of FX (x) : 1. 0 ≤ FX (x) ≤ 1 2. FX (x1 ) ≤ FX (x2 ) if x1 < x2 (i.e., non-decreasing) 3. limx→∞ FX (x) = FX (∞) = 1 18 4. limx→−∞ FX (x) = FX (−∞) = 0 5. limx→a+ FX (x) = FX (a+ ) = FX (a), a+ = lim0<ε→0 a + ε (i.e., F is right continuous) **Note that a distribution may not be left continuous. Definition 43. Let X be a random variable with cdf. FX (x). X is a discrete random variable only if its range contains a finite or countably infinite number of points. Alternatively, if FX (x) changes values only in jumps (at most a countable number of them) and is constant between jumps, then X is called a discrete random variable. Definition 44. Suppose that jumps in FX (x) of a discrete random variable X occur at the points x1 , x2 , ... where the sequence may be either finite or countably infinite, and we assume xi < xj if i < j, then: FX (xi ) − FX (xi−1 ) = P (X ≤ xi ) − P (X ≤ xi−1 ) = P (X = xi ) Let pX (x) = P (X = x). The function pX (x) is called a probability mass function (pmf) of the discrete random variable X. 19 Properties of pX (x) : 1. 0 ≤ pX (xk ) ≤ 1 for k = 1, 2, ... 2. pX (x) = 0 if x 6= xk for k = 1, 2, ... 3. P k pX (xk ) = 1 The probability distribution for a discrete random variable is f (x) = pX (x) = P (X = x) and the c.d.f. FX (x) of a discrete random variable X can be obtained by: FX (x) = P (X ≤ x) = X pX (xk ) xk ≤x Definition 45. Let X be a random variable with cdf. FX (x). X is a continuous random variable only if its range contains an interval (either finite or infinite) of real numbers. Alternatively, if FX (x) is continuous and also has a derivative dFX (x)/dx which exists everywhere except at possibly a finite number of points and is piecewise continuous, then X is called a continuous random variable. For the case of a continuous random variable, the probability associated with any particular point is zero, (i.e., P(X=x)=0). However, we can assign a positive probability to intervals in the range of x. 20 Definition 46. Let f(x) = dFX (x) . dx The function f(x) is called the probability density function (pdf) of the continuous random variable X. Properties of f(x) : 1. f (x) ≥ 0 2. R∞ −∞ f (x) = 1 3. f (x) is piecewise continuous 4. P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b) = P (a ≤ X ≤ b) = Rb a f (x)dx The cumulative distribution function for the continuous random variable X is FX (x) = P (X ≤ x) = Z x f (t)dt −∞ Furthermore, from the definition of the cdf we know that P (a < x ≤ b) = FX (b) − FX (a) **Note that many books write FX (x) as F (x). 21 0.4. Expectation of a Random Variable Definition 47. Mean of a Random Variables: The mean, or expect value, of a random variable is: E[x] =     P x xf (x) if x is discrete R    ∞ xf (x)dx if x is continuous −∞ The mean is normally denoted by µ. Proposition 48. Let g(x) be a function of x. The function that gives the expected value of g(x) is denoted: E[g(X)] =     P x g(x)f (x) if X is discrete R    ∞ g(x)f (x)dx if X is continuous −∞ . Definition 49. Variance of a Random Variables: The variance of a random variable is: 2 V ar[x] = E[(x − µ) ] =     P x (x − µ)2 f (x) if x is discrete R    ∞ (x − µ)2 f (x)dx if x is continuous −∞ where µ = E(x). 22 The variance is usually denoted by σ 2 . The variance is conveniently computed according to the following equation V ar(x) = σ 2 = E(x2 ) − µ2 Properties of expectations and variances: Let a and b be constants, and let X and Y be random variables. 1. E(a) = a and V ar(a) = 0 2. E(aX)=aE(X) 3. E(X+Y)=E(X)+E(Y) 4. E(XY)=E(X)E(Y) if X and Y are independent random variables. 5. Var(aX)=c2 Var(X) 6. If X and Y are independent random variables, V ar(X + Y ) = V ar(X) + V ar(Y ) V ar(X − Y ) = V ar(X) + V ar(Y ) 23 The Normal distribution: In econometrics you will often use the Normal distribution. The general form of a normal distribution with mean µ and variance σ 2 is ( #) " 2 1 (x − µ) 1 f (x|µ, σ 2 ) = √ exp − 2 σ2 σ 2 2π We usually denote the fact that x has a normal distribution by writing x ∼ N[µ, σ 2 ] which reads x is normally distributed with mean µ and standard deviation σ. Properties of a normal distribution: 1. If x ∼ N[µ, σ 2 ], then a + bx ∼ N[a + bµ, b2 σ 2 ] where a and b are constants. If a = − σµ and b = 1 σ then letting z = a + bx we find that z ∼ N[0, 1]. N[0, 1] is called the standard normal distribution and has the density function ½ 2¾ 1 z φ(z) = √ exp − 2 2π The notation φ(z) is often used to denote the standard normal distribution, and Φ(z) is often used for its cdf. 2. If z ∼ N[0, 1] then z2 ∼ χ2 [1] where χ2 [1] is the chi-squared distribution with one degree of freedom. 24 3. If z1 , z2 , ..., zn are independent random variables and zi ∼ N[0, 1] for all i, then n X zi2 ∼ χ2 [n] i=1 You will also find that the t-distribution will converge in the limit to the normal distribution. 0.5. Distribution and Expectation of Random Vectors Above we discussed the case where we have one random variable (i.e., the univariate case). However, in many instances we will want to consider the case where we have multiple random variables. The good news is that the concepts de- scribed above can be fairly easily extended to the case of n random variables (the multivariate case) Definition 50. Given an experiment, the n-tuple of random variables (X1 , X2 , ..., Xn ) is referred to as an n-dimensional random vector (or n-variate random variable), if each Xi associates a real number with every sample point ξ in S. 25 Let X denote a random vector in Rn . The vector X = (X1 , X2 , ..., Xn ) takes on the values in Rn according to the following joint probability distribution (cdf): F (x) = P (X ≤ x) where this equality is given by FX1 X2 ...Xn (x1 , x2 , ..., xn ) = P (X1 ≤ x1 , X2 ≤ x2 , ..., Xn ≤ xn ) For this case we have that FX1 X2 ...Xn (∞, ∞, ..., ∞) = 1. The marginal joint cdfs are gotten from this one by setting the appropriate Xi s to ∞. For example, the bivariate distribution for x1 and x2 is given by FX1 X2 (x1 , x2 ) = FX1 X2 ...Xn (x1 , x2 , ∞, ∞, ..., ∞) For the discrete n-variate random variable, the joint pmf is defined by: pX1 X2 ...Xn (x1 , x2 , ..., xn ) = P (X1 = x1 , X2 = x2 , ..., Xn = xn ) Properties of pX1 X2 ...Xn (x1 , x2 , ..., xn ) : 1. 0 ≤ pX1 X2 ...Xn (x1 , x2 , ..., xn ) ≤ 1 2. Pn P i=1 xi pX1 X2 ...Xn (x1 , x2 , ..., xn ) = 1 26 3. The marginal pmf of one random variable (or set of random variables) is found by summing pX1 X2 ...Xn (x1 , x2 , ..., xn ) over the ranges of the other variables xi s. , e.g., pX1 X2 ...Xn−k (x1 , x2 , ..., xn−k ) = X X xn−k+1 xn−k+2 ... X pX1 X2 ...Xn (x1 , x2 , ..., xn ) xn 4. Conditional pmfs are then defined in a straight forward manor. For example: pXn|X1 ,X2 ...Xn−1 (xn |x1 , x2 , ..., xn−1 ) = pX1 X2 ...Xn (x1 , x2 , ..., xn ) pX1 X2 ...Xn−1 (x1 , x2 , ..., xn−1 ) If we are dealing with a continuous n-variate random variable, then if FX has a pdf f , that is, FX (x1 , ..., xn ) = Z x1 ... −∞ Z xn f (z1 , ..., zn )dz1 ...dzn −∞ for some function f , then, we can generally find the joint pdf for a continuous n-variate random variable by: fX1 X2 ...Xn (x1 , ..., xn ) = ∂ n FX1 X2 ...Xn (x1 , ..., xn ) ∂x1 ∂x2 ...∂xn 27 If we know the joint distribution function of a random vector X, then we also know the joint distribution of any subvector of X. For example, the joint distribution of X(k) = (X1 , ..., Xk )0 , k < n, is FX(k) (x1 , ..., xk ) = Z x1 −∞ ... Z xk −∞ ·Z ∞ ... −∞ Z ∞ ¸ f (z1 , ..., zn )dzk+1 ...dzn dz1 ...dzk . −∞ Properties of fX1 X2 ...Xn (x1 , x2 , ..., xn ) : 1. fX1 X2 ...Xn (x1 , x2 , ..., xn ) ≥ 0 2. R∞ ... −∞ R∞ −∞ fX1 X2 ...Xn (x1 , x2 , ..., xn )dx1 ...dxn = 1 3. The marginal pdf of one random variable (or set of random variables) is found by integrating fX1 X2 ...Xn (x1 , x2 , ..., xn ) over the ranges of the other variables xi s. , e.g., fX1 X2 ...Xn−k (x1 , x2 , ..., xn−k ) = Z ∞ −∞ Z ∞ ... −∞ Z ∞ fX1 X2 ...Xn (x1 , x2 , ..., xn )dxn−k+1 dxx−k+2 ...dxn −∞ 4. Conditional pdfs are then defined easily computed. For example: fXn|X1 ,X2 ...Xn−1 (xn |x1 , x2 , ..., xn−1 ) = 28 fX1 X2 ...Xn (x1 , x2 , ..., xn ) fX1 X2 ...Xn−1 (x1 , x2 , ..., xn−1 ) Proposition 51. If X and Y ’s joint distribution function has a p.d.f. f (x, y), then, the conditional distribution function of X given Y has a p.d.f. f (x|y) given by the following: f (x|y) = where fY (y) ≡ variable Y . R f (x, y) fY (y) f (x, y)dx is the marginal distribution function of the random 0.5.1. Expectations: The definition of an expectation is also easily generalizable to the case were we have multiple random variable: E[g(X)] =  R R    ∞ ... ∞ g(x1 , ..., xn )f (x1 , ..., xn )dx1 ...dxn if the variables are continuous −∞ −∞    P ... P g(x1 , ..., xn )f (x1 , ..., xn ) if the variables are discrete Let the EX (g(X)), and EX,Y (g(X)) denote the expectation of the function G(X) with respect to the marginal and the joint distributions respectively. It is easy to show that EX (g(X)) = EX,Y (g(X)) 29 since EX,Y (g(X)) = EX (g(X)). P i,j g(xi )f (xi , yi ) = P i g(xi ) hP i P j f (xi , yi ) = i g(xi )f (xi ) = Theorem 52. If X and Y are independent EX,Y (g(X)h(y)) = EX (g(x))EY (h(Y )) Definition 53. Cov(X,Y)=E[(X-E(X))(Y-E(Y))] Definition 54. The conditional expectation of Y given X=x is defined as: E(Y |X = x) = P i R yi f (yi |x) in the discrete case yf (y|x) in the continuous case Definition 55. The conditional expectation of g(X,Y) given X=x is: E(g(X, Y )|X = x) = P i R g(x, yi )f (yi |X = x) in the discrete case g(x, y)f (y|X = x) in the continuous case Theorem 56. Law of iterated expectations. E[y]=Ex [E[y|x]] where the notation Ex [·] indicates the expectation over the values of x. 30 Properties of Conditional Expectation: The conditional expectation can be defined given any σ−field A such that A ⊂ F or any random variable Y . The following proposition gives some of the properties of the conditional expectation using the formal language. Proposition 57. Let X and Y be two integrable random variables on a probability space (Ω, F, P ) and F1 is a σ−field such that F1 ⊂ F. • (i) If X = c a.s. for some real number c, then, E[X|F1 ] = c a.s. (ii) If X ≤ Y a.s., then, E[X|F1 ] ≤ E[Y |F1 ] a.s. (iii) If a and b are real numbers, then, E[aX +bY |F1 ] = aE[X|F1 ]+bE[Y |F1 ] a.s. (iv) E[E[X|F1 ]] = E[X]. (v) If F0 ⊂ F1 , then, E[E[X|F1 ]|F0 ] = E[X|F0 ] = E[E[X|F0 ]|F1 ] a.s. (vi) If σ(Y ) ⊂ F1 and E[|XY |] < ∞, then E[XY |F1 ] = Y E[X|F1 ] a.s. (vii) If E[|g(X, Y )|] < ∞, then E[g(X, Y )|Y = y] = E[g(X, y)|Y = y] a.s. Variance-Covariance Matrix of A Random Vector: 31 The expectation of a random vector is defined as a vector which consists of the expected value of each individual random variables. E[X] = (E[X1 ], ..., E[Xn ])0 . The variance-covariance matrix of a random vector X is defined as V ar(X) =E[(X−E[X])(X − E[X])0 ]. Here, the expectation is taken element by element. Proposition 58. Let X be a random vector. • (i) For any vector c, E[c0 X] = c0 E[X], and V ar(c0 X) = c0 V ar(X)c. (ii) The variance-covariance matrix of X is semi positive definite. Proof: (i) For any vector c, we have E[c0 X] ≡ E[c1 X1 + ... + cn Xn ] = c1 E[X1 ] + ... + cn E[Xn ] ≡ c0 E[X]. 32 V ar(c0 X) ≡ E[(c0 X − E[c0 X])(c0 X − E[c0 X])0 ] = E[c0 (X − E[X])(c0 (X − E[X]))0 ] = E[c0 (X − E[X])(X − E[X])0 c] = c0 E[(X − E[X])(X − E[X])0 ]c ≡ c0 V ar(X)c. (ii) For any vector c, (c0 X − E[c0 X])(c0 X − E[c0 X])0 = (c0 X − E[c0 X])2 ≥ 0. So, c0 V ar(X)c = E[(c0 X − E[c0 X])(c0 X − E[c0 X])0 ] ≥ 0. Q.E.D. Transformation of random variables: Let X be a continuous random variable with pdf fX (x). If the transformation y=g(x) is one-to-one and has the inverse transformation x = g−1 (y) = h(y), then the pdf of Y is given by fY (y) = ¯ ¯ ¯ ¯ ¯ ¯ ¯ dh(y) ¯ = f . fX (x) ¯ dx [h(y)] ¯ ¯ X dy dy ¯ Let Z=g(X,Y) and W=h(X,Y) where X and Y are random variables and 33 fX,Y (x,y,) is the joint pdf of X and Y. If the transformation z=g(x,y) and w=h(w,y) is one to one and has the inverse transformation x=q(z,w) and y=r(z,w) then the joint pdf for Z and W is given by: fZ,W (z, w) = fX,Y (x, y) |J(x, y)|−1 where x = q(z, w) and y = r(z, w) and ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ∂g ∂g ¯ ¯ ∂z ∂z ¯ ¯ ∂x ∂y ¯ ¯ ∂x ∂y ¯ ¯=¯ ¯ J(x, y) = ¯¯ ¯ ¯ ¯ ¯ ∂h ∂h ¯ ¯ ∂w ∂w ¯ ¯ ∂x ∂y ¯ ¯ ∂x ∂y ¯ which is the jacobian of the transformation z=g(x,y) and w=h(x,y). If we then define ¯ ¯ ¯ ¯ J(z, w) = ¯¯ ¯ ¯ ∂q ∂z ∂q ∂w ∂r ∂z ∂r ∂w ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯=¯ ¯ ¯ ¯ ¯ ¯ ¯ ∂x ∂z ∂x ∂w ∂y ∂z ∂y ∂w ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ then¯J(z, w)¯ = |J(x, y)|−1 and fZ,W (z, w) = fX,Y [q(z, w), r(z, w)] ¯J(z, w)¯ . The multivariate normal distribution Let x be a set of random variables, x = (x1 , ...xn ), with the mean vector µ and the covariance matrix Σ. The general form of the joint density for the multivariate normal is 0 −1 (x−µ) f (x) − (2π)−n/2 |Σ|−1/2 e(−1/2)(x−µ) Σ 34 Properties of the multivariate normal Let x1 be any subset of the variables including a single variable, and let x2 be the remaining variables. Partition µ and Σ likewise so      Σ11 Σ12   µ1   and Σ =   µ=     µ2 Σ21 Σ22 1. If [x1 , x2 ] have a joint multivariate normal distribution, then the marginal distributions are x1 ˜ N(µ1 , Σ11 ) and x2 ˜ N(µ2 , Σ22 ) 2. If [x1 , x2 ] have a joint multivariate normal distribution, then the conditional distribution of x1 given x2 is also normal: x1 |x2 ˜ N(µ1.2 , Σ11.2 ) where µ1.2 = µ1 + Σ12 Σ−1 22 (x2 − µ2 ) Σ11.2 = Σ11 − Σ12 Σ−1 22 Σ21 35 0.6. Markov Process Stochastic Process: A stochastic process is a sequence of random variables {Zt , t = 0, 1, ...} on a fixed probability space (S, F, P ) such that Zt is measurable with respect to a σ-field Ft for all t = 0, 1, ... and that the sequence of the σ-fields, {Ft , t = 0, 1, ...} is a filtration, i.e., Ft ⊂ Ft+1 for any t = 0, 1, .... Note that for any sequence of random variables, we can always set Ft = σ(Z0 , ..., Zt ). Then clearly {Ft , t = 0, 1, ...} is a filtration and Zt is measurable with respect to Ft . Most of the time this is the natural filtration that we work with for stochastic processes. However, there maybe other filtrations such that {Zt , t = 0, 1, ...} is also a stochastic process. Which filtration to use depends on the problem we want to study. Sample Path: For any given ω ∈ S, we call the sequence of real numbers {Zt (ω), t = 0, 1, ...} a sample path of the stochastic process. Markov Process: A stochastic process is called a Markov process if for any A ∈ F, t ≥ 1, and 36 1 ≤ n ≤ t, we have P (A|σ(Zt , Zt−1 , ..., Zt−n )) = P (A|Zt ). Another way to put this is that a random process {X(t), t∈ T } is a markov process if P (X(tn+1 ) ≤ xn+1 |X(t1 ) = x1 , X(t2 ) = x2 , ..., X(tn ) = xn ) = P (X(tn+1 ) ≤ xn+1 |X(tn ) = xn ) whenever t1 < t2 < ... < tn < tn+1 . This type of process has a memoryless property since the future state of the process depends only on the present state and not on the past history. A discrete-state markov process is called a Markov chain. Acknowledgements: The notes for this section have been based on materials provided by Prof. Xiaodong Zhu, Prof. Angelo Melino, Prof. Bruce Hansen, and on chapters in Econometric Analysis by W. Greene, and many of the definitions are taken from Probability, random variables and random processes by Hsu. Please do not circulate these notes without permission of the author. 37

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Notes on Set Theory and Probability Theory