Download Section 1.2. Random variables and distributions. The next concept

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Section 1.2. Random variables and distributions.
The next concept of probability theory: a random variable; and the distribution of a
random variable.
A random variable is a function X : Ω 7→ R – a real-valued function X = X(ω)
defined on the sample space – that has the following property:
• For every interval I ⊆ R the set {ω : X(ω) ∈ I} is an event: {ω : X(ω) ∈ I} ∈ F.
Two random variables X and Y (on the same probability space (Ω, F, P )) are called
equivalent: X ∼ Y , if they are equal to one another almost surely:
P {X ̸= Y } = 0
(1.2.1)
(or, which is the same, P {X = Y } = 1). We are going to systematically disregard the
distinction between equivalent random variables.
The word distribution is used in probability theory in two different ways. Either it
is used as some sort of key word to speak of several similar concepts: e. g., if we are told
to find the distribution of a random variable, it means finding what is called “probability
mass function” if this is a discrete random variable, or finding the distribution density if it
is a continuous random variable, or finding the cumulative distribution function. Or the
word “distribution” may be used as a precise mathematical term. In this course, we are
choosing the second way.
The distribution of the random variable X is, by definition, a function µ = µX of
subsets of the real line R defined by
µ(C) = µX (C) = P {X ∈ C}.
(1.2.2)
The notation P {X ∈ C} is short for P ({ω : X(ω) ∈ C}). You see: something is standing under
the sign of probability. What can it be? Probabilities can be only of events , and every event is a set
consisting of sample points ω . So mentioning ω in the notation P ({ω : X(ω) ∈ C}) is redundant. Also:
there are parentheses { } inside the parentheses ( ): too complicated, we should better keep just one pair
of parentheses. Of which kind? Better it should be the more characteristic kind: the braces { }.
For what class of sets C ⊆ R is the set function µ defined? In other words, for what
subsets of the real line is the set {X ∈ C} an event? This is a question belonging to the
set-theoretic introduction to both measure theory and probability theory; and we have
decided not to speak about such questions. Suffice it to say that of course every interval
belongs to this class of sets; and that we cannot construct a subset of the real line that
does not belong to this class of sets (even if we can prove – ineffectively – that such sets C
exist).
It turns out that the set function µX is necessarily a measure.
Indeed, it is, of course, nonnegative (between 0 and 1); as for countable additivity:
if C1 , C2 , ..., Cn , ... is a sequence of disjoint subsets of the real line (i. e., Ci ∩ Cj = ∅
for i ̸= j), we have:
{
ω : X(ω) ∈
∞
∪
}
Ci =
i=1
∞
∪
i=1
1
{ω : X(ω) ∈ Ci }
(1.2.3)
{
} ∪∞
∪∞
(using short notations: X ∈ i=1 Ci = i=1 {X ∈ Ci }); these events are disjoint, and
from countable additivity of the probability P we obtain:
µX
∞
(∪
i=1
)
Ci = P
∞
(∪
∞
∞
∑
) ∑
{X ∈ Ci } =
P {X ∈ Ci } =
µX (Ci ).
i=1
i=1
(1.2.4)
i=1
Two important classes of distributions:
• Discrete distributions are distributions of discrete random variables, i. e., random variables taking only countably many values x1 , x2 , ..., xn (, ...). Discrete distributions are
given by
∑
∑
p(xi ),
µ(C) =
p(x) =
(1.2.5)
i: xi ∈C
x∈C
the nonnegative numbers p(x) = µ{x}
numbers p(x) are different
from 0; the
}
{x1 , x2 , ..., xn (, ...)} = P (Ω) = 1); the
ity mass function (of the random variable
= P {X = x} (only countably many {
of the
sum of all no-zero p(x) is equal to P X ∈
function p(x) = pX (x) is called the probabilX).
• Continuous distributions given by
∫
µ(C) =
p(x) dx,
(1.2.6)
C
where the nonnegative function p(x) = pX
∑(x) is called the distribution density, or the
probability density, or just
∫ density (just as x p(x) = 1 for the probability mass function,
for the density we have
p(x) dx = 1; according to our convention about notations, the
integral is taken over all possible values: from − ∞ to ∞).
You should keep in mind that a density is not determined uniquely: we can change it,
say, at one point, or at one hundred and fifty points, the integrals won’t change, and still
it will be a density of the same random variable. We can speak about different versions
of the density. All statements that we make about a probability density hold not for all
values of its argument, but, generally, with exceptions on a set that can be disregarded
while integrating (we can call such sets negligible sets).
Theorem 1.2.1. If two random variables X and Y are equivalent: X ∼ Y, they have
the same distribution: µX = µY .
Proof. We have to prove that for an arbitrary set C ⊆ R (yes, yes, an arbitrary set
belonging to a class that we did not describe precisely, but which is very vast, and contains
at least all intervals; but I told you that I won’t pay attention to such things)
µX (C) = µY (C),
(1.2.7)
P {X ∈ C} = P {Y ∈ C}.
(1.2.8)
or
2
Now draw a picture (I cannot draw pictures here) of the sample space Ω and its two subsets
(events): {X ∈ C} and {Y ∈ C}. We have (look at your picture!):
|P {X ∈ C} − P {Y ∈ C}| = |P ({X ∈ C} \ {Y ∈ C}) − P ({Y ∈ C} \ {X ∈ C})|
(1.2.9)
≤ P ({X ∈ C} \ {Y ∈ C}) + P ({Y ∈ C} \ {X ∈ C}) = P ({X ∈ C}∆{Y ∈ C}),
where ∆ is the sign for the symmetric difference of two sets:
A∆B = (A \ B) ∪ (B \ A)
(draw another picture, or look at the picture you have drawn).
Clearly,
{X ∈ C}∆{Y ∈ C} ⊆ {X ̸= Y }
(1.2.10)
(1.2.11)
(in everyday language rather than in that of sets: one of the events {X ∈ C}, {Y ∈ C}
occurring, but not the other can only happen if X does not coincide with Y ). So we have:
|P {X ∈ C} − P {Y ∈ C}| ≤ P {X ̸= Y } = 0,
(1.2.12)
so P {X ∈ C} = P {Y ∈ C}.
If X is a random variable, its cumulative distribution function F (t) = FX (t), − ∞ <
t < ∞, is defined as
FX (t) = P {X ≤ t}.
(1.2.13)
It turns out that the cumulative distribution function determines the distribution of
a random variable uniquely:
Theorem 1.2.2. If the cumulative distribution functions of two random variables
coincide:
FX (t) = FY (t)
for all t ∈ (− ∞, ∞),
(1.2.14)
then the distributions of these random variables coincide:
µX (C) = µY (C),
C ⊆ R.
(1.2.15)
The fact is usually mentioned (but not put stress upon) in the elementary course of
probability theory – without proof. I won’t give the proof either. In fact, the precise
formulation (the above formulation is not absolutely precise: the class of sets C for which
the equality (1.2.15) holds is not specified) and the proof of this theorem requires knowing
measure theory (and for the proof, even not the simple part of this theory).
To compensate for this, let me formulate and prove a simple theorem about cumulative
distribution functions:
Theorem 1.2.3. Every cumulative distribution function has the limit at − ∞ equal
to 0, limt→∞ FX (t) = 1, and it is right-continuous at every point t ∈ (− ∞, ∞).
3
Proof. The events At = {X ≤ t}, − ∞ < t < ∞, form a non-decreasing family
((1.1.12) is satisfied); so
lim FX (t) = lim P {X ≤ t} = P
t→−∞
t→−∞
(∪
t→∞
s→t
)
{X ≤ t} = P (Ω) = 1,
t
lim+ FX (s) = lim+ P {X ≤ s} = P
s→t
)
{X ≤ t} = P (∅) = 0,
t
lim FX (t) = lim P {X ≤ t} = P
t→∞
(∩
(∩
(1.2.16)
)
{X ≤ s} = P {X ≤ t} = FX (t).
s>t
In the same way that we consider random variables taking real values, we can also
consider generalized random variables, taking values in some other space, SP, instead of
the real line R. For example, we often have to consider random vectors, taking values in
the space SP = Rr . A generalized random variable is defined as a function X : Ω 7→ SP
such that, for some specified class of subsets C ⊆ SP we can consider the event {X ∈ C}:
in other words,
{ω : X(ω) ∈ C} ∈ F.
(1.2.17)
How should this class of sets C be chosen? In the case of real-valued random variables we
took all intervals; in the case of random vectors, that is, SP = Rr , we can take either the
class of all r-dimensional “intervals”, i. e., r-dimensional parallelepipeds:
C = I1 × I2 × ... × Ir ,
(1.2.18)
where Ii are intervals; or, with the same result (leading to an equivalent definition), we
can take the class of all open r-dimensional sets C.
The distribution of a generalized random variable is defined the same way as for
real-valued ones: it is a measure on the space SP defined by
µ(C) = µX (C) = P {X ∈ C}
(1.2.19)
(and the class of all C’s that we can consider here is so vast that we cannot produce an
example of a subset C ⊆ SP that cannot be put into the function (1.2.19) – even if such
sets C exist).
If X1 , ..., Xr are random variables (of course, on the same probability space
(Ω, F, P )), their joint distribution µ = µX1 , ..., Xr is, by definition, the distribution of
the r-dimensional random vector X with components X1 , ..., Xr : for an r-dimensional
set C
µ(C) = µX1 , ..., Xr (C) = P {(X1 , ..., Xr ) ∈ C}.
(1.2.20)
Just as in the case of one-dimensional distributions, we can consider r-dimensional discrete
distributions given by
∑
µ(C) =
p(x1 , ..., xr ),
(1.2.21)
(x1 , ..., xr )∈C
4
where p(x1 , ..., xn ) = pX1 , ..., Xn (x1 , ..., xn ) is the joint probability mass function (compare
formula (1.2.5)); and continuous distributions with an r-dimensional density p(x1 , ..., xn )
= pX1 , ..., Xn (x1 , ..., xn ) (the joint probability density of the random variables X1 , ..., Xr ):
∫
∫
···
µ(C) =
p(x1 , ..., xr ) dx1 ... dxr .
(1.2.22)
C
You are supposed to know some concrete classes of distributions: the uniform distribution, the Poisson distribution, the exponential, the normal (Gaussian) distribution,
the multidimensional Gaussian (normal) distribution. We’ll return to multidimensional
Gaussian distributions a little later.
An important question is: given the distribution of some random variable or random
vector X, how to find that of a function of it: Y = f (X)?
If the distribution of X is discrete, the question is very simple – in principle: the
distribution of Y is also discrete, and it is characterized by the probability mass function
pY (y) = P {f (X) = y} = P
∪
(
)
{X = x} =
x: f (x)=y
∑
∑
P {X = x} =
x: f (x)=y
pX (x).
x: f (x)=y
(1.2.23)
In particular: if we know the probability mass function of the random vector X = (X1 , X2 )
(or, which is the same, the joint probability mass function of the random variables X1 , X2 ),
its each component being a (very simple) function of it, we can find the probability mass
function of each component taken separately:
pX1 (x1 ) =
∑
pX1 , X2 (x1 , x2 ),
pX2 (x2 ) =
x2
∑
pX1 , X2 (x1 , x2 ).
(1.2.24)
x1
For continuous distributions the problem is much more complicated; however, its
particular case of finding the probability density of one component (or several components)
of a random vector given the probability density of the vector, is solved just the same way
as for discrete random variables, only the sums are replaced with integrals:
∫
pX1 (x1 ) =
∫
∞
−∞
pX1 , X2 (x1 , x2 ) dx2 ,
∫
pX2 (x2 ) =
∞
−∞
pX2 , X3 (x2 , x3 ) =
∫
pX2 (x2 ) =
∞
−∞
pX1 , X2 (x1 , x2 ) dx1 ;
(1.2.25)
∞
−∞
∫ ∞
−∞
pX1 , X2 , X3 (x1 , x2 , x3 ) dx1 dx3 ,
(1.2.26)
pX1 , X2 , X3 (x1 , x2 , x3 ) dx1 ;
etc.
1.2.1 Suppose X and Y are two random variables (on the same probability space), X
having a uniform distribution on the interval [0, 1], and Y uniform on [0, 3].
5
Is it possible for the joint distribution of X and Y to have a (two-dimensional) density?
Is it possible for this joint distribution not to have a density?
1.2.2 Let X and Y be as in the previous problem. Is it possible for the sum X + Y to
have a (one-dimensional) density? Not to have?
Theorem 1.2.4. Let X be a random variable with probability density pX (x); let
f (x), x ∈ R, be a one-to-one function (and so having an inverse f −1 (y), y changing in
the interval R being the range of the function f : R = f (R) = {f (x) : x ∈ R}), continuously differentiable, with f ′ (x) ̸= 0.
Then the distribution of the random variable Y = f (X) has a density
(
)
1
pY (y) = IR (y) · ′ ( −1 ) · pX f −1 (y) .
f f (y)
Proof. We have to prove that for C ⊆ R
∫
(
)
1
IR (y) · ′ ( −1 ) · pX f −1 (y) dy.
µY (C) = P {f (X) ∈ C} =
f f (y)
C
(1.2.27)
(1.2.28)
Here we are confronted with the deficiency of our approach that is not based on
measure theory: we don’t know precisely for what class of sets C in the real line we need
to prove it. It seems reasonable to have it proved for C’s being intervals; if (1.2.28) is
proved for intervals, then using the countable additivity it is proved for finite or countable
infinite unions of intervals; and that seems to be enough (possibly not enough from the
point of view of rigorous mathematical theory, but clearly enough for extra-mathematical
applications).
So we have to prove that for every interval [a, b] (we could consider intervals (a, b)
without their ends, it doesn’t matter)
∫
P {f (X) ∈ [a, b]} =
a
b
(
)
1
IR (y) · ′ ( −1 ) · pX f −1 (y) dy.
f f (y)
(1.2.29)
The integral with the integrand multiplied by an indicator function can be rewritten as
one over a set intersected with the set R whose indicator is there; since always f (X) ∈ R,
the left-hand side also can be rewritten with this intersection: we have to prove that
∫
(
)
1
(
) · pX f −1 (y) dy.
P {f (X) ∈ [a, b] ∩ R} =
(1.2.30)
′ −1 (y) [a, b]∩R f f
The intersection [a, b] ∩ R is again an interval; so we have to prove that for every interval
[c, d] ⊆ R
∫ d
(
)
1
(
) · pX f −1 (y) dy
P {f (X) ∈ [c, d]} =
(1.2.31)
f ′ f −1 (y) c
(think about whether it is justifiable to take closed intervals [c, d]).
6
The event {f (X) ∈ [c, d]} can be rewritten as
{X ∈ [f −1 (c), f −1 (d)]}
or {X ∈ [f −1 (d), f −1 (c)]}
(1.2.32)
(according to whether f ′ (x) > 0 or < 0). Let us do the calculations in the latter case
(which is a little more complicated):
−1
P {X ∈ [f
(d), f
−1
∫
f −1 (c)
(c)]} =
f −1 (d)
Making the substitution y = f (x), x = f −1 (y), dx =
∫
c
P {f (X) ∈ [c, d]} =
d
∫
=
c
d
(
)
pX f −1 (y) ·
f′
pX (x) dx.
(1.2.33)
1
(
) dy, we get:
−1
f (y)
1
(
) dy
−1
f (y)
∫ d
−1
1
(
(
) dy =
)
f ′ f −1 (y) dy;
f ′ f −1 (y)
c
f′
(1.2.34)
which was to be proved.
Particular case:
Theorem 1.2.5. Let X be a random variable with probability density pX (x). Then
the distribution of the linear function Y = aX + b of this random variable (a ̸= 0) has the
density
(y − b)
1
pY (y) =
.
(1.2.35)
· pX
|a|
a
Theorem 1.2.6. Let X be a random variable with probability density pX (x); and
Y = f (X). Suppose the real line except for a finite number of points is divided into finitely
many intervals Ii , on each of which the function f is one-to-one and continuously differentiable with non-zero derivative. Let Ri = f (Ii ) = {f (x) : x ∈ Ii }; let fi−1 (y) be a
function defined on Ri , being the inverse function of the function f (x) considered only
on Ii .
Then the distribution of the random variable Y has a density given by
∑
(
)
1
pY (y) =
IRi (y) · ′ ( −1 ) · pX fi−1 (y) .
(1.2.36)
f f (y) i
i
The proof is practically the same as that of Theorem 1.2.4, only the first thing we
do is representing the probability P {Y ∈ [a, b]} as the sum
∑
P {X ∈ Ii , f (X) ∈ [a, b]}.
(1.2.37)
i
1.2.3 Let X be a random variable with probability density
{
(4 − x)/8,
x ∈ [0, 4],
pX (x) =
0,
x∈
/ [0, 4];
7
(1.2.38)
Y = (X − 1)2 .
Find the probability density pY (y) of this random variable. Draw the graph of this
density.
We are going to need also the multidimensional versions of these results.
Theorem 1.2.7. Let X be an n-dimensional
random vector having continuous
distri(
)
bution with density pX (x). Let f (x) = f1 (x1 , ..., xn ), ..., fn (x1 , ..., xn ) be a one-to-one
function from Rn onto R, continuously differentiable, and with the Jacobian determinant


∂f1 /∂x1 ... ∂f1 /∂xn
 ̸= 0.
J(x) = det 
...
...
...
(1.2.39)
∂fn /∂x1 ... ∂fn /∂xn
Then the random vector Y = f (X) has a continuous distribution with density
(
)
1
pY (y) = IR (y) · ( −1 ) · pX f −1 (y) .
J f (y)
(1.2.40)
Again a particular case:
Theorem 1.2.8. Let X be an n-dimensional random vector having continuous distribution with density pX (x). Let A be a non-singular n × n matrix, and b a vector
belonging to Rn (we are writing vectors as column vectors now). Then the random vector
Y = AX + b has a continuous distribution with density
pY (y) =
(
)
1
· pX A−1 (y − b) .
| det(A)|
(1.2.41)
Theorem 1.2.9. Let X be an n-dimensional random vector having continuous distribution with density pX (x). Let the whole space Rn , except for its part that can be
disregarded by integration (a negligible set; e. g., several smooth surfaces of dimension
smaller than n), be divided into finitely many regions Gi , in each of which the function
f : Gi 7→ Ri (Ri = f (Gi )) is one-to-one, continuously differentiable with nonzero Jacobian
determinant and with the inverse fi−1 .
Then the random vector Y = f (X) has a continuous distribution with density
pY (y) =
∑
i
(
)
1
IRi (y) · ( −1 ) · pX fi−1 (y) .
J f (y) (1.2.42)
i
In contrast with formulas (1.2.25), (1.2.26), Theorems 1.2.4 – 1.2.9 are about the distributions of functions of a random object (random variable or random vector) X of the
same dimension as the object X itself . But combining Theorems 1.2.7 – 1.2.9 with formulas (1.2.25), (1.2.26), sometimes we can, given the density of an n-dimensional
random)
(
vector X, find the probability density of a function of it Y = f (X) = f1 (X), ..., fk (X)
of a smaller dimension k. To do this, we complement the functions f1 (x),
( ..., fk (x) with
)
another (n−k) functions fk+1 (x), ..., fn (x) so that the function f (x) = f1 (x), ..., fn (x)
8
satisfies the conditions of one of the Theorems 1.2.7 – 1.2.9; and then apply formulas
(1.2.25) – (1.2.26).
For example, this way we can obtain
Theorem 1.2.10. Let X, Y have a joint density pX, Y (x, y). Then the random
variable Z = X + Y has a continuous distribution with density
∫
pZ (z) =
∞
−∞
pX, Y (x, z − x) dx.
(
(1.2.43)
)
X
X +Y
Proof.
is obtained from the random vector
( ) The random vector Z =
X
X=
by multiplication:
Y
Z = A · X,
(1.2.44)
(
)
1 0
A=
.
(1.2.45)
1 1
(
)
1 0
−1
The determinant of A is equal to 1, A =
. By Theorem 1.2.8 we have:
−1 1
( )
(
)
(
)
( −1
x
x )
= pX, Y
= pX, Y (x, z − x); (1.2.46)
pZ (z) = pX, Z (x, z) = pX A ·
z
z−x
where
and by formula (1.2.25)
∫
pZ (z) =
∫
∞
−∞
pX, Z (x, z) dx =
9
∞
−∞
pX, Y (x, z − x) dx.
(1.2.47)