Download Notes on Set Theory and Probability Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Notes on Set Theory and Probability Theory
Michelle Alexopoulos
August 2003
0.1. Set Theory and Probability Theory
Before we talk about probability, it is useful to review some basic definitions and
theorem from set theory.
Definition 1. A set if a collection of objects.
Definition 2. If every element in a set A is also a member of set B then A is a
subset of B, i.e., A ⊂ B.
Definition 3. Two sets, A and B, are equal, denoted A=B, if and only if all
element in A belongs to the set B and every element in B belongs to set A, i.e.,
A ⊆ B and A ⊇ B.
Definition 4. B is a proper subset of A if B is a subset of a A, but B does not
equal A.
Definition 5. The empty set, or null set, is a set which contains no elements,
and is denoted by the symbol ∅.
Definition 6. Suppose that A ⊂ S. The complement of set A, denoted as A or
Ac , is the set containing all elements in S that are not in A. i.e.,
Ac = {γ : γ ∈ S and γ ∈
/ A}
2
Definition 7. The union of sets A and B, denoted A ∪ B, is the set containing
all elements in either A or B or both. i.e.,
A ∪ B = {γ : γ ∈ A or γ ∈ B}.
Definition 8. The intersection of sets A and B, denoted A ∩ B, is the set containing all elements in both A and B. i.e.,
A ∩ B = {γ : γ ∈ A and γ ∈ B}.
Definition 9. Two sets, A and B, are called disjoint of mutually exclusive if they
contain no common element. i.e., if A ∩ B = ∅.
Definition 10. The set of all possible outcomes of a random experiment is called
the sample space (or universal set) and is denoted by U.
Some Theorems involving sets:
Theorem 11. A ∪ B = B ∪ A (commutative law for unions)
Theorem 12. (A ∪ B) ∪ C = A ∪ (B ∪ C) (associative law for unions)
Theorem 13. A ∩ B = B ∩ A (commutative law for intersections)
Theorem 14. (A ∩ B) ∩ C = A ∩ (B ∩ C) (associative law for intersections)
3
Theorem 15. A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) (First distributive law)
Theorem 16. A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) (Second distributive law)
Theorem 17. If A ⊂ B then A0 ⊃ B 0
Theorem 18. A ∪ ∅ = A and A ∩ ∅ = ∅
Theorem 19. (A ∪ B)0 = A0 ∩ B 0 (De Morgan’s first law)
Theorem 20. (A ∩ B)0 = A0 ∪ B 0 (De Morgan’s second law)
0.2. Probability Space
Basic Probability theory is defined using a triple (S, F, P ) where S is the sample
space, F is the collection of events, and P is a function that maps F into the
interval [0,1].
P is the probability measure and intuitively F is the set of all
events that can be verified to have occurred or not occurred I will discuss these
objects in more detail below. However, for the most part, I will follow the notation
used in standard textbooks like Greene’s Econometric Analysis.
Definition 21. Sample space S: a set of elements of interest.
4
In elementary probability theory, we usually associate these elements with
outcomes of an experiment. For example, consider a simple experiment of a single
toss of a coin. The sample space of this experiment is
S = {H, T },
where H stands for “head” and T for “tail”. For an experiment of three tosses of
a coin, then the sample space is
S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
More generally, the sample space of an experiment of n tosses of a coin is
S = {ω : ω = (a1 , ..., an ), ai = H or T }.
An alternate example is one where an individual rolls a die once. The sample
space of this experiment is:
S = {1, 2, 3, 4, 5, 6}
5
and if we rolled the die n times, the sample space of this experiment would be:
S = {ω : ω = (a1 , ..., an ), ai = 1, 2, 3, 4, 5 or 6}.
In modern probability theory, the sample space can be fairly general and abstract. For example, it can be the collection of all real numbers, R, or the collection
all n-dimensional vectors, Rn , or any subset of these collections.
The Axioms of Probability: Suppose we have a sample space S. If S is discrete, then all subsets correspond to events, but if S is continuous, only measurable
subsets correspond to events.
To each event A in the class of events, we associate a real number, P(A), i.e.,
P is a real-valued function defined on C. Then P is the probability functions and
P(A) is the probability of the event A, if the following axioms are satisfied:
Axiom 1: For every event A, P (A) ≥ 0.
Axiom 2: For the sure or certain event S, P (S) = 1.
Axiom 3: For any number of mutually exclusive events A1 , A2 , ... , then P (A1 ∪
A2 ∪ ...) = P (A1 ) + P (A2 ) + ...
Some Theorems on Probability:
6
Theorem 22. If A1 ⊆ A2 , then P (A1 ) ≤ P (A2 ) and P (A2 −A1 ) = P (A2 )−P (A1 )
Theorem 23. For every event A, 0 ≤ P (A) ≤ 1.
Theorem 24. P (∅) = 0 (i.e., the impossible event has probability 0)
Theorem 25. If A0 is the complement of A, then P (A0 ) = 1 − P (A)
Theorem 26. If A = A1 ∪ A2 ∪ ... ∪ An ,where A1 , ..., An are mutually exclusive
events then P (A) =
Pn
i=1
P (Ai ).
Theorem 27. If A and B are any two events, then P (A ∪ B) = P (A) + P (B) −
P (A ∩ B).
Conditional Probability: Let A and B be two events such that P (A) > 0.
Let P (B|A) denote the probability of B given that A has occurred. Since A has
already occurred, it becomes the new sample space. From this, we are led to the
definition of P (B|A):
P (B|A) =
P (A ∩ B)
or P (A ∩ B) = P (A)P (B|A)
P (A)
Definition 28. If P(B|A)=P(B), then A and B are independent events.
This is equivalent to P(A ∩ B) = P (A)P (B).
7
Theorem 29. Bayes Rule: Suppose that A1 , A2 , ..., An are mutually exclusive
events whose union is the sample space, S. Then if A is any event:
P (Ak )P (A|Ak )
P (AK |A) = Pn
i=1 P (Ai )P (A|Ai )
Definition 30. Two events A and B are said to be independent if and only if
P (A ∩ B) = P (A)P (B)
Counting: Suppose that we are given n distinct objects
Definition 31. n Pr is the number of permutations of n objects taken r at a time:
n Pr
=
n!
(n − r)!
Definition 32. n Cr is the combinations of n objects taken r at a time:


 n 
n!
 =
C
=
n r
  r!(n − r)!
r
n Pr
is used when order matters. For example if we want to find out how many
different permutations consisting of three letters each can be formed from the 4
8
letters A,B, C, and D, the answer is given by
4 P3
=
4!
= 4 ∗ 3 ∗ 2 ∗ 1 = 24
(4 − 3)!
In this case order matters, i.e., ABC is a different permutation then ACB, or
BAC, or CAB, etc.
If we only want to know how many ways three letters can be chosen from the
set of letters A,B, C and D, then the answer is given by
4 C3
=
4!
=4
3!(4 − 3)!
In this case order does not matter so we can only have 4 possibilities. i.e., (1)
A,B, and C, (2) A,C, and D, (3) A,B, and D, and (4) B,C, and D.
Other Commonly used definitions If you are reading more advanced books
on probability theory, you will often find terms like sigma-field and sigma-algebra
and find reference to measurable functions and measure spaces. I will now briefly
turn to these
Partition and Information: A partition of a set S is a finite collection A =
9
{A1 , A2 , ..., AN } of disjoint subsets of S whose union is S.
Examples: S = {−0.05, −0.01, 0, 0.01, 0.05}, and
A0 = {S},
A1 = {{−0.05, −0.01}, {0}, {0.01, 0.05}},
A2 = {{−0.05}, {−0.01}, {0}, {0.01}, {0.05}},
Ai , i = 1, 2, 3 can each be thought of as representing the information an agent
may have. Suppose that the numbers in S represent all possible returns in the
stock market in a day. Then, after the return is realized, an agent with information
partition A0 has effectively no information about the return, while an agent with
information A1 can tell whether the return is positive, zero, or negative, and an
agent with information A2 knows exactly what the return is. So, these three
partitions represent progressively more information.
Given a partition A = {A1 , A2 , ..., AN }, an agent may assign different probabilities to each of the event in the partition, P1 , ..., PN . Based on these probabilities,
the agent should also be able to decide the probabilities of events such as A1 ∪ A2 ,
or Ac1 .
P (A1 ∪ A2 ) = P (A1 ) + P (A2 ),
10
P (Ac1 ) = 1 − P (A1 )
This motivates the following definition of measurable sets, which can thought of
as all events that can be assigned a probability.
Above, I mentioned that F includes all outcomes on the sample space that
can be verified to have occurred or not occurred. Basically this means that if
set A is an event, then its complement, A0 (i.e., not A) must also be an event.
Furthermore, if A and B are events then we need to be able to determine: (a)
if both A and B happened and (b) if either A or B (or both) occurred. Thus,
A ∩ B and A ∪ B are also events. An algebra (of field) is a collection of subsets
that is closed under complementation, intersection and union. For our purposes,
we will also require F to be closed under countable unions/intersections, and we
will refer toe F as a σ−algebra (σ − f ield). This next definition states these ideas
more formally.
Definition 33. A σ-field F is a collection of subsets of a sample space S with
the following properties:
(i) The empty set φ ∈ F.
(ii) If A ∈ F, then the complement Ac ∈ F.
(iii) If Ai ∈ F, i = 1, 2, ..., then their union ∪Ai ∈ F.
11
Note that if Ai ∈ F for i = 1, 2, then, Aci ∈ F, which imply that Ac1 ∪ Ac2 ∈ F.
Thus, (Ac1 ∪ Ac2 )c ∈ F. However,
(Ac1 ∪ Ac2 )c ≡ {ω ∈ S : ω ∈
/ Ac1 , and ω ∈
/ Ac2 }
= {ω ∈ S : ω ∈ A1 and ω ∈ A2 }
≡ A1 ∩ A2 .
So, A1 ∩ A2 ∈ F.
Definition 34. A pair (S, F) is called a measurable space, and any subset in F
is called a measurable set or event.
Examples:
(i) F ={φ, S}
(ii) F ={φ, A, Ac , S} = {φ, {−0.05, −0.01}, {0, 0.01, 0.05}, S}
Definition 35. σ(C): smallest σ-field that contains the collection of subsets, C.
Examples:
(i) σ(A) = {φ, A, Ac , S}.
(ii) C = {A1 , A2 }, σ(C) = {φ, A1 , A2 , Ac1 , Ac2 , A1 ∪ A2 , Ac1 ∪ Ac2 , A1 ∩ A2 , Ac1 ∩
A2 , A1 ∩ Ac2 , A1 ∪ Ac2 , Ac1 ∪ A2 }
12
(iii) B, the σ-field generated by all the open intervals in R. We call all the
subsets in B Borel sets.
Let A and A0 be two partitions of S, we say that information represented by
A is finer than that represented by A0 if σ(A0 ) ⊂ σ(A).
Definition 36. A measure is a set function v defined on F such that :
(i) 0 ≤ v(A) ≤ ∞ for any A ∈ F.
(ii) v(φ) = 0.
(iii) If Ai ∈ F, i = 1, 2, ..., and Ai ∩ Aj = φ for any i 6= j, then
v (∪∞
i=1 Ai ) =
∞
X
v(Ai ).
i=1
Examples:
Counting measure S = {a1 , a2 , a3 , ...}, F contains all the subsets of
S.
v(A) = number of elements in subset A
Lebesgue measure S = R, F = B, and
v((a, b)) = b − a.
13
Proposition 37. For a measure space (S, F, v), we have
(i) If A ⊂ B, then v(A) ≤ v(B)
(ii) For any sequence A1 , A2 ,...,
v(∪∞
i=1 Ai )
≤
∞
X
v(Ai )
i=1
(iii) If A1 ⊂ A2 ⊂ A3 ⊂ ..., (or A1 ⊃ A2 ⊃ A3 ⊃ ...), then,
v( lim An ) = v(∪∞
i=1 Ai ) = lim v(An )
n−→∞
n−→∞
(or
v( lim An ) = v(∩∞
i=1 Ai ) = lim v(An )
n−→∞
n−→∞
if v(A1 ) < ∞).
Proof: (i) Let C = B ∩ Ac , then, C ∈ F and
v(B) = v(A ∪ C) = v(A) + v(C) ≥ v(A)
because v(C) ≥ 0.
(ii) Let C1 = A1 , C2 = A2 ∩ C1c , C3 = A3 ∩ C2c ∩ C1c , ..., then, Ci , i = 1, 2, ... is
14
∞
a sequence of disjoint sets such that ∪∞
i=1 Ai = ∪i=1 Ci and, by (i), v(Ci ) ≤ v(Ai ).
Thus, we have
v(∪∞
i=1 Ai )
=
v(∪∞
i=1 Ci )
=
∞
X
i=1
v(Ci ) ≤
∞
X
v(Ai ).
i=1
(iii) If An is an increasing sequence, let A0 = φ, and Dn = An − An−1 ≡
An ∩ Acn−1 for n ≥ 1. Then, Dn , n = 1, 2, ... is a sequence of disjoint sets such
∞
that ∪∞
n=1 An = ∪n=1 Dn . By the definition of measure, we have
∞
v(∪∞
n=1 An ) = v(∪n=1 Dn ) =
∞
X
v(Dn )
n=1
=
=
=
lim
n−→∞
lim
n−→∞
n
X
v(Di )
"i=1n
X
i=1
#
(v(Ai ) − v(Ai−1 ))
lim v(An ).
n−→∞
Now, if An is a decreasing sequence such that v(A1 ) < ∞, then, we have
Bn = A1 − An is an increasing sequence. From what we just proved, then, we
have
v(∪∞
n=1 Bn ) = lim v(Bn ) = v(A1 ) − lim v(An ).
n−→∞
n−→∞
15
∞
However, ∩∞
n=1 An = A1 − (∪n=1 Bn ), so,
∞
v(∩∞
n=1 An ) = v(A1 ) − v(∪n=1 Bn ) = lim v(An ).
n−→∞
Q.E.D.
If v(S) = 1, then v is called a probability measure.
Proposition 38. Let (S,F) be a measurable space.
• (i) If f and g are measurable, then so are f g and af + bg, where a and b are
two real numbers; also, f /g is measurable provided g(ω) 6= 0 for any ω ∈ S.
(ii) If f1 , f2 , ... are measurable, then so are supn fn , inf n fn . Furthermore, if
limn−→∞ fn exists, then it is also measurable.
(iii) Suppose that f is a measurable function on (S,F) and g a measurable
function on (R,B), then, the composite function g ◦ f defined by g ◦ f (ω) =
g(f (ω)) is also a measurable function.
(iv) If f is a continuous function on (R,B), then f is measurable.
Proposition 39. Let f and g be measurable functions on a measure space (S, F, v).
R
R
R
• (i) (af + bg)dv = a f dv + b gdv.
16
(ii) If f = g a.e., then,
(iii) If f ≤ g a.e., then,
(iv) If f ≥ 0 a.e. and
(v) If f ≥ 0 a.e. and
R
R
R
R
f dv =
R
f dv ≤
gdv.
R
gdv.
f dv = 0, then, f = 0 a.e.
f dv = 1, then, the set function
P (B) =
Z
f dv
B
is a probability measure on (S, F). The function f is called the probability
density function (p.d.f.) of P with respect to measure v.
(vi) If fn −→ f a.e., |fn | ≤ g, and
lim
n−→∞
Z
R
gdv < ∞, then,
fn dv =
(vii) If |∂f (ω, θ)/∂θ| ≤ g(ω) a.e., and
d
dθ
·Z
¸
R
f dv.
gdv < ∞, then,
f (ω, θ)dv =
17
Z
Z
∂f (ω, θ)
dv.
∂θ
0.3. Random Variables and Probability Distribution
Definition 40. Consider a random experiment with sample space S. A random
variable X(ξ) is a single valued real function that assigns a real number to each
sample point ξ of S. Often we use a single letter X for this function in place of
X(ξ).
Probability Distribution:
Definition 41. A listing of the values x taken by a random variable X and their
associated probabilities is a probability distribution, f(x).
Definition 42. The distribution function [or cumulative distribution function
(c.d.f)] of X is the function defined by :
FX (x) ≡ P (X ≤ x), − ∞ < x < ∞.
Properties of FX (x) :
1. 0 ≤ FX (x) ≤ 1
2. FX (x1 ) ≤ FX (x2 ) if x1 < x2 (i.e., non-decreasing)
3. limx→∞ FX (x) = FX (∞) = 1
18
4. limx→−∞ FX (x) = FX (−∞) = 0
5. limx→a+ FX (x) = FX (a+ ) = FX (a), a+ = lim0<ε→0 a + ε (i.e., F is right
continuous)
**Note that a distribution may not be left continuous.
Definition 43. Let X be a random variable with cdf. FX (x). X is a discrete
random variable only if its range contains a finite or countably infinite number of
points. Alternatively, if FX (x) changes values only in jumps (at most a countable
number of them) and is constant between jumps, then X is called a discrete random
variable.
Definition 44. Suppose that jumps in FX (x) of a discrete random variable X
occur at the points x1 , x2 , ... where the sequence may be either finite or countably
infinite, and we assume xi < xj if i < j, then:
FX (xi ) − FX (xi−1 ) = P (X ≤ xi ) − P (X ≤ xi−1 ) = P (X = xi )
Let pX (x) = P (X = x). The function pX (x) is called a probability mass
function (pmf) of the discrete random variable X.
19
Properties of pX (x) :
1. 0 ≤ pX (xk ) ≤ 1 for k = 1, 2, ...
2. pX (x) = 0 if x 6= xk for k = 1, 2, ...
3.
P
k
pX (xk ) = 1
The probability distribution for a discrete random variable is f (x) = pX (x) =
P (X = x) and the c.d.f. FX (x) of a discrete random variable X can be obtained
by:
FX (x) = P (X ≤ x) =
X
pX (xk )
xk ≤x
Definition 45. Let X be a random variable with cdf. FX (x). X is a continuous
random variable only if its range contains an interval (either finite or infinite)
of real numbers. Alternatively, if FX (x) is continuous and also has a derivative
dFX (x)/dx which exists everywhere except at possibly a finite number of points
and is piecewise continuous, then X is called a continuous random variable.
For the case of a continuous random variable, the probability associated with
any particular point is zero, (i.e., P(X=x)=0). However, we can assign a positive
probability to intervals in the range of x.
20
Definition 46. Let f(x) =
dFX (x)
.
dx
The function f(x) is called the probability
density function (pdf) of the continuous random variable X.
Properties of f(x) :
1. f (x) ≥ 0
2.
R∞
−∞
f (x) = 1
3. f (x) is piecewise continuous
4. P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b) = P (a ≤ X ≤ b) =
Rb
a
f (x)dx
The cumulative distribution function for the continuous random variable X is
FX (x) = P (X ≤ x) =
Z
x
f (t)dt
−∞
Furthermore, from the definition of the cdf we know that
P (a < x ≤ b) = FX (b) − FX (a)
**Note that many books write FX (x) as F (x).
21
0.4. Expectation of a Random Variable
Definition 47. Mean of a Random Variables: The mean, or expect value, of a
random variable is:
E[x] =




P
x
xf (x) if x is discrete
R


 ∞ xf (x)dx if x is continuous
−∞
The mean is normally denoted by µ.
Proposition 48. Let g(x) be a function of x. The function that gives the expected value of g(x) is denoted:
E[g(X)] =




P
x
g(x)f (x) if X is discrete
R


 ∞ g(x)f (x)dx if X is continuous
−∞
.
Definition 49. Variance of a Random Variables: The variance of a random variable is:
2
V ar[x] = E[(x − µ) ] =




P
x (x
− µ)2 f (x) if x is discrete
R


 ∞ (x − µ)2 f (x)dx if x is continuous
−∞
where µ = E(x).
22
The variance is usually denoted by σ 2 .
The variance is conveniently computed according to the following equation
V ar(x) = σ 2 = E(x2 ) − µ2
Properties of expectations and variances: Let a and b be constants, and
let X and Y be random variables.
1. E(a) = a and V ar(a) = 0
2. E(aX)=aE(X)
3. E(X+Y)=E(X)+E(Y)
4. E(XY)=E(X)E(Y) if X and Y are independent random variables.
5. Var(aX)=c2 Var(X)
6. If X and Y are independent random variables,
V ar(X + Y ) = V ar(X) + V ar(Y )
V ar(X − Y ) = V ar(X) + V ar(Y )
23
The Normal distribution: In econometrics you will often use the Normal
distribution. The general form of a normal distribution with mean µ and variance
σ 2 is
(
#)
"
2
1
(x
−
µ)
1
f (x|µ, σ 2 ) = √
exp −
2
σ2
σ 2 2π
We usually denote the fact that x has a normal distribution by writing x ∼
N[µ, σ 2 ] which reads x is normally distributed with mean µ and standard deviation
σ.
Properties of a normal distribution:
1. If x ∼ N[µ, σ 2 ], then a + bx ∼ N[a + bµ, b2 σ 2 ] where a and b are constants.
If a = − σµ and b =
1
σ
then letting z = a + bx we find that z ∼ N[0, 1]. N[0, 1]
is called the standard normal distribution and has the density function
½ 2¾
1
z
φ(z) = √ exp −
2
2π
The notation φ(z) is often used to denote the standard normal distribution,
and Φ(z) is often used for its cdf.
2. If z ∼ N[0, 1] then z2 ∼ χ2 [1] where χ2 [1] is the chi-squared distribution
with one degree of freedom.
24
3. If z1 , z2 , ..., zn are independent random variables and zi ∼ N[0, 1] for all i,
then
n
X
zi2 ∼ χ2 [n]
i=1
You will also find that the t-distribution will converge in the limit to the
normal distribution.
0.5. Distribution and Expectation of Random Vectors
Above we discussed the case where we have one random variable (i.e., the univariate case). However, in many instances we will want to consider the case where
we have multiple random variables.
The good news is that the concepts de-
scribed above can be fairly easily extended to the case of n random variables (the
multivariate case)
Definition 50. Given an experiment, the n-tuple of random variables (X1 , X2 , ..., Xn )
is referred to as an n-dimensional random vector (or n-variate random variable),
if each Xi associates a real number with every sample point ξ in S.
25
Let X denote a random vector in Rn . The vector X = (X1 , X2 , ..., Xn ) takes
on the values in Rn according to the following joint probability distribution (cdf):
F (x) = P (X ≤ x) where this equality is given by
FX1 X2 ...Xn (x1 , x2 , ..., xn ) = P (X1 ≤ x1 , X2 ≤ x2 , ..., Xn ≤ xn )
For this case we have that FX1 X2 ...Xn (∞, ∞, ..., ∞) = 1.
The marginal joint cdfs are gotten from this one by setting the appropriate
Xi s to ∞. For example, the bivariate distribution for x1 and x2 is given by
FX1 X2 (x1 , x2 ) = FX1 X2 ...Xn (x1 , x2 , ∞, ∞, ..., ∞)
For the discrete n-variate random variable, the joint pmf is defined by:
pX1 X2 ...Xn (x1 , x2 , ..., xn ) = P (X1 = x1 , X2 = x2 , ..., Xn = xn )
Properties of pX1 X2 ...Xn (x1 , x2 , ..., xn ) :
1. 0 ≤ pX1 X2 ...Xn (x1 , x2 , ..., xn ) ≤ 1
2.
Pn P
i=1
xi
pX1 X2 ...Xn (x1 , x2 , ..., xn ) = 1
26
3. The marginal pmf of one random variable (or set of random variables) is
found by summing pX1 X2 ...Xn (x1 , x2 , ..., xn ) over the ranges of the other variables xi s. , e.g.,
pX1 X2 ...Xn−k (x1 , x2 , ..., xn−k ) =
X
X
xn−k+1 xn−k+2
...
X
pX1 X2 ...Xn (x1 , x2 , ..., xn )
xn
4. Conditional pmfs are then defined in a straight forward manor. For example:
pXn|X1 ,X2 ...Xn−1 (xn |x1 , x2 , ..., xn−1 ) =
pX1 X2 ...Xn (x1 , x2 , ..., xn )
pX1 X2 ...Xn−1 (x1 , x2 , ..., xn−1 )
If we are dealing with a continuous n-variate random variable, then if FX has
a pdf f , that is,
FX (x1 , ..., xn ) =
Z
x1
...
−∞
Z
xn
f (z1 , ..., zn )dz1 ...dzn
−∞
for some function f , then, we can generally find the joint pdf for a continuous
n-variate random variable by:
fX1 X2 ...Xn (x1 , ..., xn ) =
∂ n FX1 X2 ...Xn (x1 , ..., xn )
∂x1 ∂x2 ...∂xn
27
If we know the joint distribution function of a random vector X, then we also know
the joint distribution of any subvector of X. For example, the joint distribution
of X(k) = (X1 , ..., Xk )0 , k < n, is
FX(k) (x1 , ..., xk ) =
Z
x1
−∞
...
Z
xk
−∞
·Z
∞
...
−∞
Z
∞
¸
f (z1 , ..., zn )dzk+1 ...dzn dz1 ...dzk .
−∞
Properties of fX1 X2 ...Xn (x1 , x2 , ..., xn ) :
1. fX1 X2 ...Xn (x1 , x2 , ..., xn ) ≥ 0
2.
R∞
...
−∞
R∞
−∞
fX1 X2 ...Xn (x1 , x2 , ..., xn )dx1 ...dxn = 1
3. The marginal pdf of one random variable (or set of random variables) is
found by integrating fX1 X2 ...Xn (x1 , x2 , ..., xn ) over the ranges of the other
variables xi s. , e.g.,
fX1 X2 ...Xn−k (x1 , x2 , ..., xn−k ) =
Z
∞
−∞
Z
∞
...
−∞
Z
∞
fX1 X2 ...Xn (x1 , x2 , ..., xn )dxn−k+1 dxx−k+2 ...dxn
−∞
4. Conditional pdfs are then defined easily computed. For example:
fXn|X1 ,X2 ...Xn−1 (xn |x1 , x2 , ..., xn−1 ) =
28
fX1 X2 ...Xn (x1 , x2 , ..., xn )
fX1 X2 ...Xn−1 (x1 , x2 , ..., xn−1 )
Proposition 51. If X and Y ’s joint distribution function has a p.d.f. f (x, y),
then, the conditional distribution function of X given Y has a p.d.f. f (x|y) given
by the following:
f (x|y) =
where fY (y) ≡
variable Y .
R
f (x, y)
fY (y)
f (x, y)dx is the marginal distribution function of the random
0.5.1. Expectations:
The definition of an expectation is also easily generalizable to the case were we
have multiple random variable:
E[g(X)] =

R
R


 ∞ ... ∞ g(x1 , ..., xn )f (x1 , ..., xn )dx1 ...dxn if the variables are continuous
−∞
−∞



P
...
P
g(x1 , ..., xn )f (x1 , ..., xn ) if the variables are discrete
Let the EX (g(X)), and EX,Y (g(X)) denote the expectation of the function G(X)
with respect to the marginal and the joint distributions respectively. It is easy to
show that
EX (g(X)) = EX,Y (g(X))
29
since EX,Y (g(X)) =
EX (g(X)).
P
i,j
g(xi )f (xi , yi ) =
P
i
g(xi )
hP
i P
j f (xi , yi ) =
i g(xi )f (xi ) =
Theorem 52. If X and Y are independent EX,Y (g(X)h(y)) = EX (g(x))EY (h(Y ))
Definition 53. Cov(X,Y)=E[(X-E(X))(Y-E(Y))]
Definition 54. The conditional expectation of Y given X=x is defined as:
E(Y |X = x) =
P
i
R
yi f (yi |x) in the discrete case
yf (y|x) in the continuous case
Definition 55. The conditional expectation of g(X,Y) given X=x is:
E(g(X, Y )|X = x) =
P
i
R
g(x, yi )f (yi |X = x) in the discrete case
g(x, y)f (y|X = x) in the continuous case
Theorem 56. Law of iterated expectations. E[y]=Ex [E[y|x]] where the notation
Ex [·] indicates the expectation over the values of x.
30
Properties of Conditional Expectation: The conditional expectation can be
defined given any σ−field A such that A ⊂ F or any random variable Y . The
following proposition gives some of the properties of the conditional expectation
using the formal language.
Proposition 57. Let X and Y be two integrable random variables on a probability space (Ω, F, P ) and F1 is a σ−field such that F1 ⊂ F.
• (i) If X = c a.s. for some real number c, then, E[X|F1 ] = c a.s.
(ii) If X ≤ Y a.s., then, E[X|F1 ] ≤ E[Y |F1 ] a.s.
(iii) If a and b are real numbers, then, E[aX +bY |F1 ] = aE[X|F1 ]+bE[Y |F1 ]
a.s.
(iv) E[E[X|F1 ]] = E[X].
(v) If F0 ⊂ F1 , then, E[E[X|F1 ]|F0 ] = E[X|F0 ] = E[E[X|F0 ]|F1 ] a.s.
(vi) If σ(Y ) ⊂ F1 and E[|XY |] < ∞, then E[XY |F1 ] = Y E[X|F1 ] a.s.
(vii) If E[|g(X, Y )|] < ∞, then E[g(X, Y )|Y = y] = E[g(X, y)|Y = y] a.s.
Variance-Covariance Matrix of A Random Vector:
31
The expectation of a random vector is defined as a vector which consists of
the expected value of each individual random variables.
E[X] = (E[X1 ], ..., E[Xn ])0 .
The variance-covariance matrix of a random vector X is defined as
V ar(X) =E[(X−E[X])(X − E[X])0 ].
Here, the expectation is taken element by element.
Proposition 58. Let X be a random vector.
• (i) For any vector c, E[c0 X] = c0 E[X], and V ar(c0 X) = c0 V ar(X)c.
(ii) The variance-covariance matrix of X is semi positive definite.
Proof: (i) For any vector c, we have
E[c0 X] ≡ E[c1 X1 + ... + cn Xn ]
= c1 E[X1 ] + ... + cn E[Xn ]
≡ c0 E[X].
32
V ar(c0 X) ≡ E[(c0 X − E[c0 X])(c0 X − E[c0 X])0 ]
= E[c0 (X − E[X])(c0 (X − E[X]))0 ]
= E[c0 (X − E[X])(X − E[X])0 c]
= c0 E[(X − E[X])(X − E[X])0 ]c
≡ c0 V ar(X)c.
(ii) For any vector c,
(c0 X − E[c0 X])(c0 X − E[c0 X])0 = (c0 X − E[c0 X])2 ≥ 0.
So,
c0 V ar(X)c = E[(c0 X − E[c0 X])(c0 X − E[c0 X])0 ] ≥ 0.
Q.E.D.
Transformation of random variables: Let X be a continuous random variable with pdf fX (x). If the transformation y=g(x) is one-to-one and has the inverse transformation x = g−1 (y) = h(y), then the pdf of Y is given by fY (y) =
¯
¯ ¯
¯
¯ ¯
¯ dh(y) ¯
=
f
.
fX (x) ¯ dx
[h(y)]
¯
¯
X
dy
dy ¯
Let Z=g(X,Y) and W=h(X,Y) where X and Y are random variables and
33
fX,Y (x,y,) is the joint pdf of X and Y. If the transformation z=g(x,y) and w=h(w,y)
is one to one and has the inverse transformation x=q(z,w) and y=r(z,w) then the
joint pdf for Z and W is given by:
fZ,W (z, w) = fX,Y (x, y) |J(x, y)|−1 where x = q(z, w) and y = r(z, w) and
¯ ¯
¯
¯
¯ ¯
¯
¯
¯ ∂g ∂g ¯ ¯ ∂z ∂z ¯
¯ ∂x ∂y ¯ ¯ ∂x ∂y ¯
¯=¯
¯
J(x, y) = ¯¯
¯ ¯
¯
¯ ∂h ∂h ¯ ¯ ∂w ∂w ¯
¯ ∂x ∂y ¯ ¯ ∂x ∂y ¯
which is the jacobian of the transformation z=g(x,y) and w=h(x,y). If we then
define
¯
¯
¯
¯
J(z, w) = ¯¯
¯
¯
∂q
∂z
∂q
∂w
∂r
∂z
∂r
∂w
¯ ¯
¯ ¯
¯ ¯
¯ ¯
¯=¯
¯ ¯
¯ ¯
¯ ¯
∂x
∂z
∂x
∂w
∂y
∂z
∂y
∂w
¯
¯
¯
¯
¯
¯
¯
¯
¯
¯
¯
¯
then¯J(z, w)¯ = |J(x, y)|−1 and fZ,W (z, w) = fX,Y [q(z, w), r(z, w)] ¯J(z, w)¯ .
The multivariate normal distribution Let x be a set of random variables,
x = (x1 , ...xn ), with the mean vector µ and the covariance matrix Σ. The general
form of the joint density for the multivariate normal is
0
−1 (x−µ)
f (x) − (2π)−n/2 |Σ|−1/2 e(−1/2)(x−µ) Σ
34
Properties of the multivariate normal Let x1 be any subset of the variables
including a single variable, and let x2 be the remaining variables. Partition µ and
Σ likewise so




 Σ11 Σ12 
 µ1 
 and Σ = 

µ=




µ2
Σ21 Σ22
1. If [x1 , x2 ] have a joint multivariate normal distribution, then the marginal
distributions are
x1 ˜ N(µ1 , Σ11 ) and
x2 ˜ N(µ2 , Σ22 )
2. If [x1 , x2 ] have a joint multivariate normal distribution, then the conditional
distribution of x1 given x2 is also normal:
x1 |x2 ˜ N(µ1.2 , Σ11.2 ) where
µ1.2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 )
Σ11.2 = Σ11 − Σ12 Σ−1
22 Σ21
35
0.6. Markov Process
Stochastic Process:
A stochastic process is a sequence of random variables {Zt , t = 0, 1, ...} on
a fixed probability space (S, F, P ) such that Zt is measurable with respect to a
σ-field Ft for all t = 0, 1, ... and that the sequence of the σ-fields, {Ft , t = 0, 1, ...}
is a filtration, i.e., Ft ⊂ Ft+1 for any t = 0, 1, ....
Note that for any sequence of random variables, we can always set Ft =
σ(Z0 , ..., Zt ). Then clearly {Ft , t = 0, 1, ...} is a filtration and Zt is measurable
with respect to Ft . Most of the time this is the natural filtration that we work
with for stochastic processes. However, there maybe other filtrations such that
{Zt , t = 0, 1, ...} is also a stochastic process. Which filtration to use depends on
the problem we want to study.
Sample Path:
For any given ω ∈ S, we call the sequence of real numbers {Zt (ω), t = 0, 1, ...}
a sample path of the stochastic process.
Markov Process:
A stochastic process is called a Markov process if for any A ∈ F, t ≥ 1, and
36
1 ≤ n ≤ t, we have
P (A|σ(Zt , Zt−1 , ..., Zt−n )) = P (A|Zt ).
Another way to put this is that a random process {X(t), t∈ T } is a markov
process if
P (X(tn+1 ) ≤ xn+1 |X(t1 ) = x1 , X(t2 ) = x2 , ..., X(tn ) = xn ) = P (X(tn+1 ) ≤ xn+1 |X(tn ) = xn )
whenever t1 < t2 < ... < tn < tn+1 .
This type of process has a memoryless property since the future state of the
process depends only on the present state and not on the past history.
A discrete-state markov process is called a Markov chain.
Acknowledgements: The notes for this section have been based on materials
provided by Prof. Xiaodong Zhu, Prof. Angelo Melino, Prof. Bruce Hansen, and
on chapters in Econometric Analysis by W. Greene, and many of the definitions are
taken from Probability, random variables and random processes by Hsu. Please
do not circulate these notes without permission of the author.
37