Download Probability and Statistics Lecture 6: Special probability distributions

Document related concepts
no text concepts found
Transcript
Econ 514: Probability and Statistics
Lecture 6: Special probability distributions
Summarizing probability distributions
• Let X be a random variable with probability distribution PX .
• We consider two types of probability distributions
– Discrete distributions: PX is absolutely continuous with respect to the counting measure.
– Continuous distributions: PX is absolutely continuous with respect to the Lebesgue measure.
• In both cases there is a density fX .
• Initially we consider X scalar and later random vectors X.
1
How do we summarize the probability distribution PX ?
• Obvious method: Make graph of density.
• Figure 1 for discrete case and figures 2,3 for continuous case.
2
3
4
• Graph can be used to visualize support and to compute probabilities.
• Intervals where fX is large have a high probability.
Summarizing using moments
• We can also try to summarize PX by numbers.
• This never gives a complete picture, because we summarize a function fX by some number.
• Obvious choice is E(X), the expected value of X and
the mean of the distribution PX .
• Interpretation: average over repetitions.
– Repeat the random experiment N times and call
the outcomes X1 , X2 , X3 , . . . , XN .
P
– If N is large then N1 N
i=1 Xi ≈ E(X), i.e. the
mean is the average over repetitions.
5
• Interpretation: optimal prediction.
– Consider predictor m of outcome X.
– Prediction error of this predictor
X −m
– Assume that the loss function is proportional to
(X − m)2
i.e. proportional to squared prediction error.
– Optimal predictor minimizes expected loss
E((X−m)2 ) = E ((X − E(X)) + (E(X) − m))2 =
= E (X − E(X))2 + 2(X − E(X))(E(X) − m) + (E(X) − m)2 =
= E (X − E(X))2 + (E(X) − m)2
which is minimal if m = E(X).
• Special case: If fX is symmetric around µ, i.e. fX (µ+
x) = fX (µ − x), then, if E(|X|) < ∞, we have
E(X) = µ.
• E(X) can be outside the support of X: see figure 3.
Implication for prediction?
6
• The mean E(X) is a measure of location of the distribution PX .
• A measure of dispersion is the variance of X defined
by
Var(X) = E (X − E(X))2
• Interpretation clear in discrete case
X
Var(X) =
(xi − E(X))2 fX (xi )
i
with xi − E(X) deviation from the mean and fX (xi )
the weight, i.e. the probability of the deviation.
• We have
Var(X) = E (X − E(X))2 = E X 2 − 2XE(X) + E(X)2 =
= E(X 2 ) − 2E(X)2 + E(X)2 = E(X 2 ) − E(X)2
Useful in computations.
7
2
• Often we use µ or µX for E(X) and σ 2 or σX
for
Var(X).
• The standard deviation of (the distribution of) X,
often denoted by σX is defined by
p
σX = Var(X)
• Example: Picking a number at random from [−1, 1].
– fX (x) = I[−1,1] (x) 12
– By symmetry E(X) = 0.
– Variance is equal to E(X 2 )
1
Z 1 2
x
1
1
2
Var(X) = σX
=
dx = x3 =
6 −1 3
−1 2
– Standard deviation
1
σX = √
3
8
• Mean and variance are determined by E(X) and
E(X 2 ).
• These are the first two moments of the distribution
of X.
• In general the k-th moment, often denoted by µk , is
µk = E(X k )
• We can also define the cumulants, i.e the moments
around E(X) = µ
mk = E (X − µ)k
• The third cumulant is called skewness and the fourth
kurtosis.
• If the distribution is symmetric then skewness is 0
(see exercise). Kurtosis is a measure of peakedness
(useful if distribution is symmetric).
9
• More moments means more knowledge about distribution. What if w know all moments µk , k =
1, 2, . . .?
• Useful too to obtain moments is the moment generating function of X, denoted by MX (t) and defined
by
MX (t) = E etX
if this expectation exist for −h < t < h, h > 0.
• Obviously MX (0) = 1.
• Take the derivative with respect to t and interchange
integration and differentiation
Z ∞
dMX
(t) =
xetx fX (x)dx
dt
−∞
• For a non-negative random variable this is allowed if
E XehX < ∞
Why?
• Hence
dMX
(0) = E(X)
dt
10
• In general
dk MX k
(t ) =
dt
so that
Z
∞
xk etx fX (x)dx
−∞
dk MX
(0) = E(X k )
k
dt
• The moments E(X k ) do not uniquely determine the
distribution of X. Casella and Berger give counterexample.
• With some further assumptions the moments do determine the distribution
– If the distributions of X and Y have bounded
support, then they are the same if and only if
the moments are the same.
– If the moment generating functions MX , MY exist and if they are equal for −h < t < h, then X
and Y have the same distribution.
11
• We can also consider the characteristic function mX (t)
mX (t) = E eitX
• Because
eitx = sin(tx) + i cos(tx)
the characteristic function always exists.
• There is a 1-1 correspondence between characteristic
functions and distributions.
12
Special distributions
• There is a catalogue of ’standard’ distributions PX
of random variables X.
• Often a random experiment that we encounter in
practice is such that we are interested in the associated random variable X with such a standard distribution.
• Choosing such a standard distribution is the selection of a mathematical model for a random experiment, described by the probability space (<, B, PX ).
• Often PX depends on parameters that have to be
chosen in order to have a fully specified mathematical model.
Description of special distributions
(i) In what type of random experiments can the standard distribution be used?
(ii) Mean, variance, mgf (if exists).
(iii) Shape of density, i.e. graph of density.
13
Discrete distributions
Discrete uniform distribution
• Consider a random experiment with a finite number
of outcomes that without loss of generality can be
labeled 1, . . . , N .
• If outcomes are equally likely PX has a density with
respect to the counting measure
fX (x) = Pr(X = x) =
= 0
1
N
, x = 1, . . . , N
, elsewhere
• This the discrete uniform distribution.
• This distribution has one parameter N .
• Moments etc. only have meaning if the outcomes
1, . . . , N are not just labels, but are a count.
• Moment generating function
N
1 X tk
1 t 1 − etN
MX (t) =
e = e
N
N 1 − et
k=1
P
PN 2
1
• Using N
k
=
N
(N
+1)
and
k=1 k =
k=1
2
we have
N
N +1
1 X
k=
E(X) =
N
2
N (N +1)(2N +1)
,
6
k=1
N
1 X 2 (N + 1)(2N + 1)
E(X ) =
k =
N
6
2
k=1
14
so that
Var(X) = E(X 2 ) − E(X)2 =
(N + 1)(N − 1)
12
Bernoulli distribution
• Random experiment has two outcomes that we label
0 and 1.
• Denote Pr(X = 1) = p.
• PX has a density with respect to the counting measure
fX (x) = px (1 − p)1−x
= 0
, x = 0, 1
, elsewhere
• This is the Bernoulli distribution.
• There is one parameter p with 0 ≤ p ≤ 1.
• The mgf is
MX (t) = pet + 1 − p
and
E(X) = p
E(X 2 ) = E(X) = p
Var(X) = E(X 2 ) − E(X)2 = p − p2 = p(1 − p)
Binomial distribution
• Consider sequence of independent Bernoulli random
experiments (or trials).
• Define X as the number of 1-s in n trials.
• Consider the event X = x.
15
– For this event x trials must have outcome 1 and
n − x outcome 0.
– Sequence with x 1-s and n − x 0-s is e.g.
1001 . . . 1100
– The probability of this sequence is px (1 − p)n−x .
n
– There are
sequences of 0-s and 1-s that
x
have the same probability, so that
n
Pr(X = x) =
px (1 − p)n−x
x
– Hence PX has a density with respect to the counting measure
n
fX (x) =
px (1 − p)n−x , x = 0, 1, . . . , n
x
= 0
, elsewhere
• This the Binomial distribution. Notation
X ∼ B(n, p)
• Binomial formula
n
(a + b) =
n
X
ak bn−k
k=0
• We use this formula to establish
– The density sums to 1
n X
n
px (1 − p)n−x = (p + (1 − p))n = 1
x
x=0
16
– The mgf is
n X
n
etx px (1 − p)n−x =
MX (t) =
x
x=0
n X
x
n
n
=
pet (1 − p)n−x = pet + 1 − p
x
x=0
• Using the mgf we find
n−1 t dMX
t
E(X) =
(0) = n pe + 1 − p
pe = np
t=0
dt
n−1 t d2 MX
2
t
E(X ) =
(0) = n pe + 1 − p
pe +
t=0
dt2
n−2 2 2t + n(n − 1) pet + 1 − p
p e = np+n(n−1)p2
t=0
so that
Var(X) = E(X 2 ) − E(X)2 = np(1 − p)
• Let Yk be the outcome of the k-th Bernoulli trial, so
that
n
X
X=
Yk
k=1
with the Yk , k = 1, . . . , n stochastic independent.
• This implies that
E(X) = nE(Y1 ) = np
Var(X) = nVar(Y1 ) = np(1 − p)
n
MX (t) = (MY1 (t))n = pet + 1 − p
• Shape of the density fX
17
– We have
fX (x)
(n + 1)p − x
=1+
fX (x − 1)
p(1 − p)
– We conclude that fX is increasing for x < (n+1)p
and decreasing for x > (n + 1)p.
n
then fX is increasing for x = 0, . . . , n
– If p > n+1
1
then fX is decreasing for x =
and if p < n+1
0, . . . , n. Otherwise fX is increasing/decreasing.
– The value of x that maximizes fX is called the
mode of the distribution of X.
– For the binomial distribution the mode is the
largest integer less than or the smallest integer
greater than (n + 1)p.
• The binomial distribution has two parameters n, p
with n a positive integer and 0 ≤ p ≤ 1.
• Example: sampling
– Let p be fraction of households in US with income
less than $15000 per year.
– Select N households at random from the population.
– Define X is number of households among the n
selected with income less than $15000.
– The distribution of X is binomial, if the selections of households are independent.
– This is true if the selection is done with replacement and approximately true if the population is
sufficiently large.
18
– Assume n = 100 and 16 households have an income less than $ 15000.
– Now 16 is an estimate of E(X) and this suggests
that it is reasonable to guess that
p̂ =
16
= .16
n
or 16% of the US households has an income less
than $15000.
19
Hypergeometric distribution
• In the example we assumed (counterfactually) that
selection was with replacement.
• Now consider a population of size N from which we
select a sample of size n without replacement. In
the population M households have an income of less
that $15000.
• X is number of households among the n selected with
income less than $15000.
• X = x iff
– we select x household from the M with
anincome
M
of less that $15000: can be done in
ways.
x
– we select the remaining n − x households from
the N − M ith an income greater than
or equal
N −M
to $15000: can be done in
ways.
n−x
20
• The total number of selections (without replacement)
of n households
from the population of N households
N
is
.
n
• Combining these results we have
M
N −M
x
n−x
Pr(X = x) =
N
n
21
• The distribution PX has a density with respect to
the counting measure
N −M
M
x
n−x
, x = 0, . . . , n
fX (x) =
N
n
= 0
, otherwise
• The distribution PX is the Hypergeometric distribution.
• It can be shown (see Casella and Berger)
M
N
N n M
M
1−
n
1−
Var(X) =
N −1
N
N
N
E(X) = n
• Compare these results to those for the binomial distribution.
22
Geometric distribution
• Consider a sequence of independent Bernoulli random experiments with probability of outcome 1 equal
to p.
• Call outcome 1 a ’success’ and outcome 0 a ’failure’.
• Define X as the number of experiments before the
first success.
• X = x iff the outcomes for x + 1 Bernoulli experiments are
000 . . . 01
where there are x leading 0-s.
• Hence
Pr(X = x) = (1 − p)x p
23
• PX has a density with respect to the counting measure
fX (x) = (1 − p)x p , x = 0, 1, . . .
= 0
, otherwise
• The distribution PX is called the Geometric distribution.
• This distribution has one parameter p with 0 ≤ p ≤
1.
• The mgf is
tX
MX (t) = E e
=
∞
X
etx 1 − p)x p =
x=0
=p
∞
X
(1 − p)et
x=0
24
x
=
p
1 − (1 − p)et
• From the mgf we find
dMX
1−p
(0) =
dt
p
2
2
d
M
1
−
p
1
−
p
X
E(X 2 ) =
(0) =
+
dt2
p2
p
so that
1−p
Var(X) =
p2
E(X) =
• Sometimes we define X1 is the number of Bernoulli
experiments needed for first success.
• Then
X1 = X + 1
and e.g.
MX1 (t) = E etX1 = et tE etX
25
• Example of geometric distribution:
– Consider a job seeker and let p be the probability
of receiving a job offer in any week
– The week in which the first offer is received has
the distribution PX1 .
– We have for x2 ≥ x1
Pr(X1 > x2 |X1 > x1 ) =
Pr(X1 > x2 )
=
Pr(X1 > x1 )
P∞
x
(1 − p)x2
x=x2 +1 (1 − p) p
P
=
= (1−p)x2 −x1 =
= ∞
x
x
1
(1 − p)
x=x1 +1 (1 − p) p
=
∞
X
(1 − p)x p = Pr(X1 > x2 − x1 )
x=x2 −x1 +1
– Conclusion: If the job seeker has waited x1 weeks
the probability that he/she has to wait another
x2 − x1 weeks is the same as the probability of
waiting x2 − x1 weeks from the beginning of the
job search. The geometric distribution has no
memory.
26
Negative binomial distribution
• Setup as for the geometric distribution.
• Define X as the number of failures before the r-th
success.
• X = x iff trial x + r is success (event A) and in
previous x+r −1 trials r −1 successes and x failures.
• Because the events A and B depend on independent
random variables P (A ∩ B) = P (A)P (B).
• P (A) = p
• A sequence with r − 1 successes and x failures has
probability pr−1 (1 − p)x . Because we can
choose the
x+r−1
r −1 successes in the x+r −1 trials in
r−1
ways, this is the number of such sequences. Hence
x+r−1
P (B) =
pr−1 (1 − p)x
r−1
27
• Combining
Pr(X = x) = p
x+r−1
r−1
pr−1 (1 − p)x
• PX has density with respect to the counting measure
x+r−1
fX (x) = p
pr−1 (1 − p)x , x = 0, 1, . . .
r−1
= 0
, otherwise
• This is the Negative binomial distribution.
• The parameters are r (integer) and p with 0 ≤ p ≤ 1.
28
Poisson distribution
• Poisson distribution applies to number of occurrences
of some event in a time interval of finite length,
e.g. number of job offers received by job seeker in
a month.
• Offers can arrive at any moment (in continuous time).
Compare with the geometric distribution.
• Define X(a, b) as the number of offers in [a, b).
• The symbol o(h) (small o of h) denotes any function
with limh→0 o(h)
h = 0.
• Assumptions
(i) Pr(X(s, s + h) = 1) = λ(h) + o(h)
(ii) Pr(X(s, s + h) ≥ 2) = o(h)
(iii) X(a, b) and X(c, d) are independent if [a, b) ∩
[c, d) = ∅.
29
• Consider [0, t) and divide into n intervals with length
h = nt . Then (neglect probabilities that are of order
o(h))
n
Pr(X(0, t) = k) =
(λh)k (1 − λh)n−k =
k
k n−k
t
t
n
=
λ
1−λ
=
k
n
n
n−k
1
λt
n . . . (n − k + 1)
k
= (λt) 1 −
k!
n
n...n
Now
n . . . (n − k + 1)
=1
lim
n→∞
n...n
n−k
−k
n
λt
λt
λt
lim 1 −
= lim 1 −
lim 1 −
= e−λt
n→∞
n→∞
n→∞
n
n
n
• Conclusion: for n → ∞ and if we write X for X(0, t)
k
−λt (λt)
Pr(X = k) = e
30
k!
• The distribution PX has a density with respect to
the counting measure
−θ θ
fX (x) = e
x
, x = 0, 1, . . .
x!
= 0
, otherwise
• The distribution PX is the Poisson distribution. It
has one parameter θ > 0. Notation
X ∼ Poisson(λ)
• The mgf is
MX (t) =
∞
X
tx −θ θ
e e
x
x!
x=0
−θ
=e
∞
X
(et θ)x
x=0
x!
so that
E(X 2 ) = θ2 + θ
E(X) = θ
and
Var(X) = θ
Note E(X) = Var(X).
31
t
= e(e −1)θ
Continuous distributions
Uniform distribution
• Random experiment: pick a number at random from
[a, b].
Rx
• PX ([a, x]) = x − a = a dx
• Hence PX has a density with respect to Lebesgue
measure
1
fX (x) =
,a ≤ x ≤ b
b−a
= 0
, otherwise
• PX is the Uniform distribution on [a, b]. Notation
X ∼ U [a, b]
• We have
ebt − eat
MX (t) =
(b − a)t
a+b
E(X) =
2
(b − a)2
Var(X) =
12
32
• Graph of density
33
Normal distribution
• The distribution PX has density with respect to the
Lebesgue measure
1
2
1
fX (x) = √ e− 2σ2 (x−µ)
σ 2π
, −∞ < x < ∞
• The mgf is
tX
tµ
t(X−µ)
MX (t) = E e
=e E e
=
Z ∞
1
2
1
tµ
√ et(x−µ) e− 2σ2 (x−µ) dx =
=e
−∞ σ 2π
Z ∞
1
2
2
1
√ e− 2σ2 ((x−µ) −2σ t(x−µ)) dx
= etµ
−∞ σ 2π
Now
(x−µ)2 −2σ 2 t(x−µ) = (x−µ)2 −2σ 2 t(x−µ)+σ 4 t2 −σ 4 t2 =
= (x − µ − σ 2 t)2 − σ 4 t2
so that
tµ+ 12 σ 2 t2
MX (t) = e
Z
∞
1
1 2 2
2 2
1
√ e− 2σ2 (x−µ−σ t) dx = etµ+ 2 σ t
−∞ σ 2π
34
• From the mgf
E(X 2 ) = σ 2 + µ2
E(X) = µ
so that
Var(X) = σ 2
• The distribution PX is the Normal distribution with
mean µ and variance σ 2 . Notation
X ∼ N (µ, σ 2 )
35
• Define
Z=
X −µ
σ
Then
E(Z) = 0
Var(Z) = 1
Hence Z has a normal distribution with µ = 0, σ 2 =
1. This is the standard normal distribution with density
1 2
1
φ(x) = √ e− 2 x , −∞ < x < ∞
2π
and cdf
Z x
Φ(x) =
φ(s)ds
−∞
We can compute the probability of an interval [a, b]
with the standard normal cdf
a−µ
b−µ
Pr(a ≤ X ≤ b) = Pr
≤Z≤
=
σ
σ
b−µ
a−µ
=Φ
−Φ
σ
σ
36
• Shape of normal density: bell curve
37
• Why is the normal distribution so popular?
– Galton’s quincunx or dropping board
Magnified
– Define Xn position (relative to 0) after n rows of
pins.
– If Zn takes values -1 and 1 and gives the direction
at row n, then
Xn = Z1 + . . . + Zn
38
– If n is large then Xn has approximately the normal distribution.
– Central limit theorem: Sum of many independent
small effects gives normal distribution.
39
Exponential distribution
• Consider waiting time to an event that can occur at
any time (compare with geometric distribution).
• Define the hazard or failure rate by
Pr(event in [t, t + dt]|event after t) =
= Pr(t ≤ X < t + dt|X ≥ t) =
• Assume
fX (t)dt
1 − FX (t)
fX (t)
=λ
1 − FX (t)
Then the solution to is obtained by integration
fX (t) = λe−λt
40
• The distribution PX has a density with respect to
the Lebesgue measure
fX (x) = λe−λx
= 0
,x ≥ 0
, otherwise
• PX has the Exponential distribution. There is one
parameter λ > 0 and the notation is
X ∼ Exp(λ)
• The mgf is
MX (t) =
λ
λ−t
and hence
E(X) =
1
λ
var(X) =
41
1
λ2
• Note for t ≥ s
Pr(X > t)
e−λt
= −λs = e−λ(t−s)
Pr(X > t|X > s) =
Pr(X > s) e
If you have waited s, the probability of an additional
wait of t − s is the same as if the wait had started at
time 0.
• As the geometric distribution the exponential distribution has no memory.
• If X is length of human life, compare Pr(X > 40|X >
30) and Pr(X > 70|X > 60)
• Connection with Poisson distribution: If event is recurrent and waiting time has exponential distribution with parameter λ, then number of occurrences
in [0, t] has a Poisson distribution with parameter λt.
42
Gamma distribution
• The Gamma distribution is the distribution of
X = Y1 + . . . + Yr
with Yk independent exponential random variables
with parameter λ.
• X is the waiting time to the r-the occurrence of the
event. Compare with negative binomial distribution.
• The distribution PX has a density with respect to
the Lebesgue measure
λ
(λx)r−1 e−λx
Γ(r)
= 0
fX (x) =
,x ≥ 0
, otherwise
with Γ the Γ function. Γ(r) = (r − 1)! if r is a
positive integer and otherwise it has to be computed
numerically.
43
• This is the Gamma distribution with parameters λ, r >
0. r need not be an integer. Notation
X ∼ Γ(λ, r)
• The mgf is
MX (t) =
so that
E(X) =
λ
λ−t
r
λ
r
t<λ
Var(X) =
44
r
λ2
Lognormal distribution
• Let Y ∼ N (µ, σ 2 ) and define X = eY .
• The distribution PX has density
fX (x) =
1
√
xσ 2π
= 0
1
2
e− 2σ2 (ln x−µ)
,x ≥ 0
, otherwise
Derive this density.
• This is the Lognormal distribution with parameters
µ and σ 2 .
• The mean and variance can be derived from the mgf
of the normal distribution
1
E(X) = eµ+ 2 σ
2
2
Var(X) = e2µ+2σ − e2µ+σ
What is E(ln X) and Var(ln X)?
45
2
Cauchy distribution
• A random variable that has a distribution with density with respect to the Lebesgue measure
1
fX (x) =
πβ
x−α
β
2
, −∞ < x < ∞
has a Cauchy distribution with parameters α and
β > 0.
• The density is symmetric around α. This is the median of X.
• E(X) does not exist and var(X) = ∞.
• The mgf is ∞ for t 6= 0.
46
Chi-squared distribution
• The chi-squared distribution is a special case of the
Γ distribution: set r = k2 and λ = 12 .
• The density is
fX (x) =
Γ
= 0
1
k
2
k
x
2 −1 e− 2
k x
22
,x ≥ 0
, otherwise
• The parameter k is called the degrees od freedom of
the distribution.
• The chi-squared distribution is important because
of the following result: If X has a standard normal
distribution, then Y = X 2 has a chi-squared distribution with k = 1.
47
• We derive the mgf
2 Z
MY (t) = E etX =
∞
1
2 1 2
√ etx − 2 x dx =
2π
−∞
1
=√
1 − 2t
Z
=√
∞
− 11 x2
1
√
e 2 1−2t dx =
1
2π √1−2t
−∞
1
=
1 − 2t
1
2
1
2
12
−t
which is the mgf of the Γ distribution with r = 12 and
λ = 12 , i.e. the chi-squared distribution with k = 1.
48
Exponential family of distributions
• The exponential family of densities are the densities
that can expressed as
Pk
fX (x) = h(x)c(θ)e
i=1
wi (θ)ti (x)
.−∞<x<∞
• Note that c, wi , i = 1, . . . , k do not depend on x and
h, ti , i = 1, . . . , k do not depnd on θ. θ is the vector
of parameters of the distribution.
• Why useful: We will see that if we have data from
an exponential family distribution, the information
can be summarized by ti , i = 1, . . . , k.
• Examples
(i) Binomial distribution: For x = 0, . . . , n
p
n
n
fX (x) =
px (1−p)n−x =
(1−p)n ex ln( 1−p )
x
x
Hence
h(x) =
n
x
t(x) = x c(θ) = (1−p)n
49
p
w(θ) = ln
1−p
(ii) Normal distribution: For −∞ < x < ∞
µ2
µ
1
2
1
1
− 2σ12 (x−µ)2
− 2σ
= √ e 2 e− 2σ2 x + σ2 x
fX (x) = √ e
σ 2π
σ 2π
Hence
h(x) = 1 t1 (x) = x2
µ2
1
− 2σ
c(θ) = √ e 2
σ 2π
w1 (θ) = −
t2 (x) = x
1
2σ 2
w2 (θ) =
µ
σ2
• Other exponential family distributions: Poisson, exponential, Gamma.
• The density of the uniform distribution is
fX (x) =
1
I(a ≤ x ≤ b)
b−a
The function I(a ≤ x ≤ b) cannot be factorized in a
function of x and a, b. Hence it does not belong to
the exponential family.
50
Multivariate distributions: recapitulation
• Consider a probability space (Ω, A, P ) and define a
vector of random variables or random vector X as a
function X : Ω → <K , i.e.


X1 (ω)
..

X(ω) = 
.
XK (ω)
• The distribution of X is a probability measure PX :
B K → [0, 1]. This is usually called the joint distribution of the random vector X.
• We consider the case that PX has a density with respect to the counting measure (discrete distribution)
or with respect to the Lebesgue measure (continuous
distribution).
• The density fX (x1 , . . . , xK ) is called the joint density
of X.
51
• We have
Pr(X1 ∈ B) = PX (B × < × . . . × <) =
Z Z ∞
Z ∞
=
...
fX (x1 , . . . , xK )dx1 . . . dxK =
B −∞
−∞
Z
=
fX1 (x1 )dx1
B
with
Z
∞
fX1 (x1 ) =
Z
∞
...
−∞
fX (x1 , x2 , . . . , xK )dx2 . . . dxK
−∞
• fX1 is called the marginal density of X1 .
• The marginal density of Xk for any k is obtained
in the same way. For discrete distributions replace
integration by summation.
52
• Consider subvectors X1 , . . . , XK1 and XK1 +1 , . . . , XK .
• The distributions of these subvectors are independent if and only if
fX (x1 , . . . , xK ) = fX1 ...XK1 (x1 , . . . , xK1 )fXK1 +1 ...XK (xK1 +1 , . . . , xK )
i.e. the joint density is the product of the marginal
densities.
• The conditional distribution of X1 , . . . , XK1 give XK1 +1 , . . . , XK
has density
fX1 ...XK1 |XK1 +1 ...XK (x1 , . . . , xK1 |xK1 +1 , . . . , xK ) =
fX (x1 , . . . , xK )
fXK1 +1 ...XK (xK1 +1 , . . . , xK )
i.e. it is the ratio of the joint density and the marginal
density of the variables on which we condition.
53
• If X̃ is any subvector of X that does not have X1
as a component, then the conditional mean of X1
given X̃ = x̃ can be computed using the conditional
density of X1 given X̃
Z
E(X1 |X̃ = x̃) =
x1 fX1 |X̃ (x1 |x̃)dx1
<
For a discrete distribution replace integration by summation.
• The conditional variance of X1 given X̃ is
2
Var(X1 |X̃ = x̃) = E (X1 − E(X1 |X̃ = x̃)) X̃ = x̃
• We have
Var(X1 |X̃ = x̃) =
=E
− 2X1 E(X1 |X̃ = x̃) + E(X1 |X̃ = x̃) X̃ = x̃ =
2
= E X1 X̃ = x̃ −2E(X1 |X̃ = x̃))2 +E(X1 |X̃ = x̃))2 =
2
= E X1 X̃ = x̃ − E(X1 |X̃ = x̃))2
X12
2
Compare this result to that for the unconditional
variance.
54
• Law of iterated expectations:
E(X1 ) = EX̃ (EX1 |X̃ (X1 |X̃))
Remember that on the rhs we just integrate E(X1 |X̃ =
x̃) with respect to the distribution of X̃.
• For the variance note
h
i
h
h
i
i
2
2
EX̃ Var(X1 |X̃) = EX̃ E X1 X̃ −EX̃ E(X1 |X̃))
and because E(X1 |X̃) is a random variable that is a
function of X̃
h
i
h
i h
i2
2
Var E(X1 |X̃) = EX̃ E(X1 |X̃) − EX̃ E(X1 |X̃)
we obtain if we add these equations
E Var(X1 |X̃) +Var(E(X1 |X̃)) = E(X12 )−(E(X1 ))2 = Var(X1 )
55
Summary measures associated with multivariate distributions, i.e. distribution of a random vector X
56
• Obvious: Means and variances of the random variables in X (marginal means and variances).
• In random vectors we also consider the covariance of
any two components of X, say X1 and X2
Cov(X1 , X2 ) = E [(X1 − E(X1 ))(X2 − E(X2 ))]
• The covariance is informative on the relation between X1 and X2 , e.g. for a discrete distribution
XX
Cov(X1 , X2 ) =
(x1 −E(X1 ))(x2 −E(X2 ))fX1 X2 (x1 , x2 )
x1
x2
If outcomes with x1 −E(X1 ) > 0 and x2 −E(X2 ) > 0
or x1 − E(X1 ) < 0 and x2 − E(X2 ) < 0 (deviations
go in same direction) are more likely than outcomes
with x1 − E(X1 ) > 0 and x2 − E(X2 ) < 0 or x1 −
E(X1 ) < 0 and x2 − E(X2 ) > 0 (deviations go in
opposite directions), then .
57
• In that case there is a positive association between
X1 and X2 .
• If the second type of outcomes are more likely Cov(X1 , X2 ) <
0 and the association is negative.
• Note for constants c, d
Cov(cX1 , dX2 ) = cdCov(X1 , X2 )
so that the size of Cov(X1 , X2 ) is not a good measure
of the strength of the association.
• To measure the strength we define the correlation
coefficient of X1 , X2 by
Cov(X1 , X2 )
p
Var(X1 ) Var(X2 )
ρX 1 X 2 = p
58
• To derive its properties we need the Cauchy-Schwartz
inequality
p
p
|E(X1 X2 )| ≤ E(X1 ) E(X2 )
Proof: Consider
0 ≤ E (tX1 + X2 )2 = t2 E(X12 )+2tE(X1 X2 )+E(X22 )
The rhs is a quadratic equation with at most one
zero. The discriminant of the equation satisfies
4E(X1 X2 )2 − 4E(X12 )E(X22 ) ≤ 0
Dividing by 4 and taking the square root gives the
inequality. If
E (tX1 + X2 )2 = 0
then
Pr(tX1 + X2 = 0) = 1
i.e. the joint distribution is concentrated on the line
tx1 + x2 = 0. 2
59
• Properties of the correlation coefficient
– ρcX1 ,dX2 = ρX1 X2
– By Cauchy-Schwartz
|Cov(X1 , X2 )| = |E [(X1 − E(X2 ))(X2 − E(X2 ))] | ≤
p
p
≤ E ((X1 − E(X1 ))2 ) E ((X2 − E(X2 ))2 )
so that
|ρX1 X2 | ≤ 1
60
– Note |ρX1 X2 = 1| ⇔ Pr((X2 − E(X2 )) = t(X1 −
E(X1 ))) = 1. Hence
Pr(X2 = a + bX1 ) = 1
with a = E(X2 ) − tE(X1 ) and b = t. Note that
Pr((X2 −E(X2 )) = t(X1 −E(X1 ))) = 1 ⇒ Cov(X1 , X2 ) = bVar(X1 )
so that
sign(ρX1 X2 ) = sign(Cov(X1 , X2 )) = sign(b).
Conclusion:
|ρX1 X2 | = 1 ⇔ Pr(X1 = a + bX1 ) = 1
for b 6= 0. If ρX1 X2 = 1 then b > 0 and if ρX1 X2 =
−1 then b < 0
– The correlation coefficient is a measure of the
strength of the association and the extreme values correspond to a linear relation.
61
• In the case of a multivariate distribution we organize the variances and covariances in a matrix, the
variance(-covariance) matrix of X

Var(X1 )
Cov(X1 , X2 ) · · · Cov(X1 , XK )
..

Var(X2 )
.
 Cov(X1 , X2 )
Var(X) = 
.
.
...
..
..

Cov(X1 , XK )
···
···
Var(XK )
Note that this is a symmetric K × K matrix
Var(X) = Var(X)0
Often we use the notation
Var(X) = Σ
62





• Remember if X is a K vector, then


X1 − µ1
..
 [X1 −µ1 · · · XK −µK ] =
(X−µ)(X−µ)0 = 
.
XK − µK

2
(X1 − µ1 )

(X1 − µ1 )(X2 − µ2 ) · · · (X1 − µ1 )(XK − µK )
..
 (X1 − µ1 )(X2 − µ2 )

(X1 − µ1 )2
.


=
..
..
..

.
.
.
(X1 − µ1 )(XK − µK )
···
···
(XK − µK )2
so that if we denote µ = E(X)
Σ = Var(X) = E ((X − µ)(X − µ)0 )
63
Linear and quadratic functions of random vectors
• If X is a random vector with K components and a is
a K vector of constants, we define the linear function
of X
K
X
0
aX =
ak Xk
k=1
• Hence
0
E(a X) = E
K
X
!
ak Xk
k=1
=
K
X
ak E(Xk ) = a0 E(X)
k=1
• Also
var(a0 X) = E (a0 X − E(a0 X))2 = E [(a0 X − a0 µ)(a0 X − a0 µ)] =
= E [(a0 X − a0 µ)(X 0 a − µ0 a)] = E [a0 (X − µ)(X − µ)0 a] =
= a0 [(X − µ)(X − µ)0 ] a
64
Moment generating function of a joint distribution
• If X is a random vector the mgf of X is
MX (t) = E et1 X1 +···+tK XK
if the mgf exists for −h < tk < h, k = 1, . . . , K. Note


t1
t =  ... 
tK
• Note that
∂ 2 MX
(t) = E X1 X2 et1 X1 +···+tK XK
∂t1 ∂t2
so that
∂ 2 MX
(0) = E (X1 X2 )
∂t1 ∂t2
65
• This can be used to compute the covariance, because
Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 )E(X2 )
• The mgf of the marginal distribution of X1 is
MX1 (t1 ) = MX (t1 , 0, . . . , 0)
66
Special multivariate distributions
Multinomial distribution
• Binomial distribution: Number of 1’s in n independent Bernoulli experiments.
• Instead of Bernoulli experiment with two outcomes
consider random experiment wit K outcomes k =
1, . . . , K.
• Example is to pick student at random from class and
record his/her nationality. Label nationalities with
label k = 1, . . . , K.
• If fraction with nationality k is pk , then if outcome
of random selection is Y we have
Pr(Y = k) = pk , k = 1, . . . , K
with
PK
k=1 pk
= 1.
67
• Repeat this experiment n times and let the repetitions be independent.
• Define XP
k is number of experiments with outcome
k. Note K
k=1 Xk = n, so that XK is determined by
X1 , . . . , XK−1 .
• Consider a sequence of n outcomes
Experiment 1 2 3 4 . . . n − 1 n
Outcome
3 4 1 1 ... K − 1 K
Probability p3 p4 p1 p1 . . . pK−1 pK
Probability of this sequence is
x
K−1 xK
px1 1 px2 2 · · · pK−1
pK
P
with xK = n − K−1
k=1 xk .
• To compute Pr(X1 = x1 , . . . , XK−1 = xK−1 ) we count
the number of such sequences.
68
• This is equivalent to
– Pick x1 experiments with outcome 1, x2 with outcome 2 etc. from the n experiments.
– Start with picking the x1 experiments with outcome 1 among
the n experiments. This can be
n
done in
ways.
x1
– From the remaining n − x1 experiments pick the
experiments
with outcome 2. This can be done
n − x1
in
ways.
x2
– The total number of ways to choose the experiments with outcomes 1 and 2 is:
n!
n
n − x1
=
x2
x1
x1 !x2 !(n − x1 − x2 )!
– Using the same argument repeatedly we find that
the total number of ways to choose the experiments with outcomes 1, 2, . . . K is:
n!
n!
=
x1 ! · · · (n − x1 − · · · − xK−1 )! x1 ! · · · xK !
69
• Hence
Pr(X1 = x1 , . . . , XK−1 = xK−1 ) =
n!
xK−1 xK
px1 1 px2 2 · · · pK−1
pK
x1 ! · · · xK !
• The Multinomial joint density of X1 , . . . , XK−1 is
n!
fX (x1 , . . . , xK−1 ) = QK
K
Y
k=1 xk ! k=1
= 0
pxk k
0 ≤ xk ≤ n,
k=1
otherwise
• Multinomial formula
n
(a1 + · · · + aK ) =
X
x1 +···+xK
70
K
X
n!
ax1 1 · · · axKK
x ! · · · xK !
=n 1
xk = n
• Using this the mgf is
MX (t) = E et1 X1 +···tK−1 XK−1 =
=
x
x
n!
et1 p1 1 · · · etK−1 pK−1 K−1 pxKK =
x ! · · · xK !
x1 +···+xK =n 1
!n
K−1
X
=
etk pk + pK
X
k=1
• From the mgf we find
E(Xk ) = npk Var(Xk ) = npk (1−pk ) Cov(Xk , Xl ) = npk pl
• Exercise: What is the marginal distribution of Xk ?
What is the conditional distribution of X1 , X2 given
X3 = x3 , . . . , XK−1 = xK−1 ?
71
Multivariate normal distribution
• The K random vector X has K-dimensional Multivariate normal distribution if its distribution has a
density with respect to the K-dimensional Lebesgue
measure equal to
fX (x) =
1
1
2
0
1
− 2 (x−µ) Σ
K e
−1
|Σ| (2π) 2
(x−µ)
, −∞ < x < ∞
• By completion of squares (see 1-dimensional case)
the mgf is
1 0
0
MX (t) = et µ+ 2 t Σt
Exercise: Derive the mgf.
• Hence
E(X) = µ
Var(X) = Σ
Exercise: Derive these results.
• The marginal distribution of Xk normal with mean
µk and variance σk2 , the k-th element of the main
diagonal of Σ. Exercise: Prove this using the mgf.
72
• Special case K = 2, the bivariate normal distribution. Let the random vector be
Y
X
• The conditional distribution of Y given X = x is
normal with
σXY
E(Y |X = x) = µY + 2 (X − µX )
σX
2
σXY
2
Var(Y |X = x) = σY 1 − 2 2 = σY2 1 − ρ2XY
σX σY
with σXY = Cov(X, Y )
• The conditional mean is linear in x. Compare wit
result that Pr(Y = a+bX) = 1 if and only if |ρXY | =
1.
73
Regression fallacy or regression to the mean
• Francis Galton (1822-1911), observed that tall fathers have on average shorter sons, and short fathers
have on average taller sons (in Victorian England
mothers and daughters did not count).
• If this process were to continue, one would expect
that in the long run extremes would disappear and
all fathers and sons will have the average height.
• Using the same reasoning: Short sons have on average taller fathers (with a height closer to the mean)
and tall sons have on average smaller fathers (again
with a height closer to the mean).
• By this argument there is a tendency to move away
from the mean!
• Similar observations can be made about many phenomena: Rookie players who do exceptionally well
in the first year, tend to have a slump in the second; bringing in new management when a company
underperforms seems to improve performance etc.
74
• Analysis
X = height of father
Y = height of son
• Reasonable assumption: X, Y have a bivariate normal distribution with
E(X) = E(Y ) = µ
Var(X) = var(Y ) = σ 2
0 < ρXY < 1
75
• Hence
E(Y |X = x) = µ + ρ(x − µ)
• If x > µ
0 < E(Y |X = x) − µ < x − µ
i.e. average height of sons with fathers with more
than average height is closer to the mean.
• If x < µ
0 > E(Y |X = x) − µ > x − µ
i.e. average height of sons with fathers with less than
average height is closer to the mean.
• However, heights of fathers and sons have the same
(normal) distribution, i.e. no change over the generations.
76
The distribution of linear and quadratic functions of normal random vectors
• X is a K random vector with
X ∼ N (µ, Σ)
• Consider the random variables
(i) Y1 = a0 X with a a K vector of constants (scalar).
(ii) Y2 = AX + b with A an M × K matrix and b an
M vector of constants.
(iii) Y3 = X 0 CX with C an K × K matrix of constants. C is symmetric.
77
• From the mgf of Y1 and Y2 we find
(i) Y1 ∼ N (a0 µ, a0 Σa).Exercise: Derive this.
(ii) Y2 ∼ N (Aµ + b, AΣA0 ) Exercise: Derive this.
– We verify
E(Y2 ) = AE(X) + b = Aµ + b
Var(Y2 ) = E [(Y2 − Aµ − b)(Y2 − Aµ − b)0 ] =
= E [(AX − Aµ)(AX − Aµ)0 ] = E [A(X − µ)(X − µ)A0 ] =
= AE [(X − µ)(X − µ)] A0 = AΣA0
(iii) Special case X ∼ N (0, I) and C idempotent, i.e.
C 2 = C, the matrix generalization of unity.
78
∗ P is the K × K matrix of eigenvectors of C
and choose P such that P 0 P = I, i.e. P is
orthonormal.
∗ Define the diagonal matrix of eigenvalues


λ1 · · · 0
Λ =  ... . . . ... 
0 · · · λK
∗ We have
CP = P Λ
and
P 0 CP = Λ
C = P ΛP 0
because by P 0 P = I we have P = (P 0 )−1 .
∗ Hence
P ΛP 0 = C = C 2 = P Λ2 P 0
so that
Λ2 = Λ
79
∗ This implies that λk is either 0 or 1. Let L be
1 and consider
Z = P 0X
so that Z ∼ N (0, 1).
∗ Hence
0
0
0
Y3 = X P ΛP X = Z ΛZ =
K
X
2
λk Z K
∼ χ2 (L)
k=1
∗ Finally
tr(C) = tr(P ΛP 0 ) = tr(ΛP 0 P ) = tr(Λ) = L
• Let X1 and X2 be subvectors of X of dimensions K1
and K2 with K1 +K2 = K. Then the variance matrix
of X is
Σ11 Σ12
Σ=
Σ12 Σ22
with Var(X1 ) = Σ11 and Var(X2 ) = Σ22 and Σ12 =
E((X1 − µ1 )(X2 − µ2 )0 ). We have that X1 and X2
are independent if and only if Σ12 = 0.
• To see this note that if Σ12 = 0
−1 −1
Σ
0
Σ
0
11
11
Σ−1 =
=
0 Σ22
0 Σ−1
22
Hence
0 −1
(x−µ)0 Σ−1 (x−µ) = (x1 −µ1 )0 Σ−1
11 (x1 −µ1 )+(x2 −µ2 ) Σ22 (x2 −µ2 )
Substitution in the density of the multivariate normal distribution shows that this density factorizes in
a function of x1 and a function of x2 , which establishes that these random vectors are independent.
80
• Conclusion: In the normal distribution X1 , X2 are
independent if and only if Cov(X1 , X2 ) = 0.
• Define
Y4 = X 0 BX
with B idempotent. Then if X ∼ N (0, I)
(i) Y1 and Y3 are independent if and only if Ba = 0.
(ii) Y3 and Y4 are stochastically independent if and
only if BC = CB = 0.
Proof:
(i) Y3 = X 0 CX = X 0 C 2 X = X 0 C 0 CX which is a
function of CX. Hence Y1 and Y3 are independent if and only if
Cov(BX, a0 X) = E(BXX 0 a) = Ba = 0
(ii) Y3 = X 0 C 0 CX and Y4 = X 0 D0 DX so that Y3 and
Y4 are independent if and only if
Cov(BXX 0 C 0 ) = BC = 0.
81