Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Section 1.2. Random variables and distributions. The next concept of probability theory: a random variable; and the distribution of a random variable. A random variable is a function X : Ω 7→ R – a real-valued function X = X(ω) defined on the sample space – that has the following property: • For every interval I ⊆ R the set {ω : X(ω) ∈ I} is an event: {ω : X(ω) ∈ I} ∈ F. Two random variables X and Y (on the same probability space (Ω, F, P )) are called equivalent: X ∼ Y , if they are equal to one another almost surely: P {X ̸= Y } = 0 (1.2.1) (or, which is the same, P {X = Y } = 1). We are going to systematically disregard the distinction between equivalent random variables. The word distribution is used in probability theory in two different ways. Either it is used as some sort of key word to speak of several similar concepts: e. g., if we are told to find the distribution of a random variable, it means finding what is called “probability mass function” if this is a discrete random variable, or finding the distribution density if it is a continuous random variable, or finding the cumulative distribution function. Or the word “distribution” may be used as a precise mathematical term. In this course, we are choosing the second way. The distribution of the random variable X is, by definition, a function µ = µX of subsets of the real line R defined by µ(C) = µX (C) = P {X ∈ C}. (1.2.2) The notation P {X ∈ C} is short for P ({ω : X(ω) ∈ C}). You see: something is standing under the sign of probability. What can it be? Probabilities can be only of events , and every event is a set consisting of sample points ω . So mentioning ω in the notation P ({ω : X(ω) ∈ C}) is redundant. Also: there are parentheses { } inside the parentheses ( ): too complicated, we should better keep just one pair of parentheses. Of which kind? Better it should be the more characteristic kind: the braces { }. For what class of sets C ⊆ R is the set function µ defined? In other words, for what subsets of the real line is the set {X ∈ C} an event? This is a question belonging to the set-theoretic introduction to both measure theory and probability theory; and we have decided not to speak about such questions. Suffice it to say that of course every interval belongs to this class of sets; and that we cannot construct a subset of the real line that does not belong to this class of sets (even if we can prove – ineffectively – that such sets C exist). It turns out that the set function µX is necessarily a measure. Indeed, it is, of course, nonnegative (between 0 and 1); as for countable additivity: if C1 , C2 , ..., Cn , ... is a sequence of disjoint subsets of the real line (i. e., Ci ∩ Cj = ∅ for i ̸= j), we have: { ω : X(ω) ∈ ∞ ∪ } Ci = i=1 ∞ ∪ i=1 1 {ω : X(ω) ∈ Ci } (1.2.3) { } ∪∞ ∪∞ (using short notations: X ∈ i=1 Ci = i=1 {X ∈ Ci }); these events are disjoint, and from countable additivity of the probability P we obtain: µX ∞ (∪ i=1 ) Ci = P ∞ (∪ ∞ ∞ ∑ ) ∑ {X ∈ Ci } = P {X ∈ Ci } = µX (Ci ). i=1 i=1 (1.2.4) i=1 Two important classes of distributions: • Discrete distributions are distributions of discrete random variables, i. e., random variables taking only countably many values x1 , x2 , ..., xn (, ...). Discrete distributions are given by ∑ ∑ p(xi ), µ(C) = p(x) = (1.2.5) i: xi ∈C x∈C the nonnegative numbers p(x) = µ{x} numbers p(x) are different from 0; the } {x1 , x2 , ..., xn (, ...)} = P (Ω) = 1); the ity mass function (of the random variable = P {X = x} (only countably many { of the sum of all no-zero p(x) is equal to P X ∈ function p(x) = pX (x) is called the probabilX). • Continuous distributions given by ∫ µ(C) = p(x) dx, (1.2.6) C where the nonnegative function p(x) = pX ∑(x) is called the distribution density, or the probability density, or just ∫ density (just as x p(x) = 1 for the probability mass function, for the density we have p(x) dx = 1; according to our convention about notations, the integral is taken over all possible values: from − ∞ to ∞). You should keep in mind that a density is not determined uniquely: we can change it, say, at one point, or at one hundred and fifty points, the integrals won’t change, and still it will be a density of the same random variable. We can speak about different versions of the density. All statements that we make about a probability density hold not for all values of its argument, but, generally, with exceptions on a set that can be disregarded while integrating (we can call such sets negligible sets). Theorem 1.2.1. If two random variables X and Y are equivalent: X ∼ Y, they have the same distribution: µX = µY . Proof. We have to prove that for an arbitrary set C ⊆ R (yes, yes, an arbitrary set belonging to a class that we did not describe precisely, but which is very vast, and contains at least all intervals; but I told you that I won’t pay attention to such things) µX (C) = µY (C), (1.2.7) P {X ∈ C} = P {Y ∈ C}. (1.2.8) or 2 Now draw a picture (I cannot draw pictures here) of the sample space Ω and its two subsets (events): {X ∈ C} and {Y ∈ C}. We have (look at your picture!): |P {X ∈ C} − P {Y ∈ C}| = |P ({X ∈ C} \ {Y ∈ C}) − P ({Y ∈ C} \ {X ∈ C})| (1.2.9) ≤ P ({X ∈ C} \ {Y ∈ C}) + P ({Y ∈ C} \ {X ∈ C}) = P ({X ∈ C}∆{Y ∈ C}), where ∆ is the sign for the symmetric difference of two sets: A∆B = (A \ B) ∪ (B \ A) (draw another picture, or look at the picture you have drawn). Clearly, {X ∈ C}∆{Y ∈ C} ⊆ {X ̸= Y } (1.2.10) (1.2.11) (in everyday language rather than in that of sets: one of the events {X ∈ C}, {Y ∈ C} occurring, but not the other can only happen if X does not coincide with Y ). So we have: |P {X ∈ C} − P {Y ∈ C}| ≤ P {X ̸= Y } = 0, (1.2.12) so P {X ∈ C} = P {Y ∈ C}. If X is a random variable, its cumulative distribution function F (t) = FX (t), − ∞ < t < ∞, is defined as FX (t) = P {X ≤ t}. (1.2.13) It turns out that the cumulative distribution function determines the distribution of a random variable uniquely: Theorem 1.2.2. If the cumulative distribution functions of two random variables coincide: FX (t) = FY (t) for all t ∈ (− ∞, ∞), (1.2.14) then the distributions of these random variables coincide: µX (C) = µY (C), C ⊆ R. (1.2.15) The fact is usually mentioned (but not put stress upon) in the elementary course of probability theory – without proof. I won’t give the proof either. In fact, the precise formulation (the above formulation is not absolutely precise: the class of sets C for which the equality (1.2.15) holds is not specified) and the proof of this theorem requires knowing measure theory (and for the proof, even not the simple part of this theory). To compensate for this, let me formulate and prove a simple theorem about cumulative distribution functions: Theorem 1.2.3. Every cumulative distribution function has the limit at − ∞ equal to 0, limt→∞ FX (t) = 1, and it is right-continuous at every point t ∈ (− ∞, ∞). 3 Proof. The events At = {X ≤ t}, − ∞ < t < ∞, form a non-decreasing family ((1.1.12) is satisfied); so lim FX (t) = lim P {X ≤ t} = P t→−∞ t→−∞ (∪ t→∞ s→t ) {X ≤ t} = P (Ω) = 1, t lim+ FX (s) = lim+ P {X ≤ s} = P s→t ) {X ≤ t} = P (∅) = 0, t lim FX (t) = lim P {X ≤ t} = P t→∞ (∩ (∩ (1.2.16) ) {X ≤ s} = P {X ≤ t} = FX (t). s>t In the same way that we consider random variables taking real values, we can also consider generalized random variables, taking values in some other space, SP, instead of the real line R. For example, we often have to consider random vectors, taking values in the space SP = Rr . A generalized random variable is defined as a function X : Ω 7→ SP such that, for some specified class of subsets C ⊆ SP we can consider the event {X ∈ C}: in other words, {ω : X(ω) ∈ C} ∈ F. (1.2.17) How should this class of sets C be chosen? In the case of real-valued random variables we took all intervals; in the case of random vectors, that is, SP = Rr , we can take either the class of all r-dimensional “intervals”, i. e., r-dimensional parallelepipeds: C = I1 × I2 × ... × Ir , (1.2.18) where Ii are intervals; or, with the same result (leading to an equivalent definition), we can take the class of all open r-dimensional sets C. The distribution of a generalized random variable is defined the same way as for real-valued ones: it is a measure on the space SP defined by µ(C) = µX (C) = P {X ∈ C} (1.2.19) (and the class of all C’s that we can consider here is so vast that we cannot produce an example of a subset C ⊆ SP that cannot be put into the function (1.2.19) – even if such sets C exist). If X1 , ..., Xr are random variables (of course, on the same probability space (Ω, F, P )), their joint distribution µ = µX1 , ..., Xr is, by definition, the distribution of the r-dimensional random vector X with components X1 , ..., Xr : for an r-dimensional set C µ(C) = µX1 , ..., Xr (C) = P {(X1 , ..., Xr ) ∈ C}. (1.2.20) Just as in the case of one-dimensional distributions, we can consider r-dimensional discrete distributions given by ∑ µ(C) = p(x1 , ..., xr ), (1.2.21) (x1 , ..., xr )∈C 4 where p(x1 , ..., xn ) = pX1 , ..., Xn (x1 , ..., xn ) is the joint probability mass function (compare formula (1.2.5)); and continuous distributions with an r-dimensional density p(x1 , ..., xn ) = pX1 , ..., Xn (x1 , ..., xn ) (the joint probability density of the random variables X1 , ..., Xr ): ∫ ∫ ··· µ(C) = p(x1 , ..., xr ) dx1 ... dxr . (1.2.22) C You are supposed to know some concrete classes of distributions: the uniform distribution, the Poisson distribution, the exponential, the normal (Gaussian) distribution, the multidimensional Gaussian (normal) distribution. We’ll return to multidimensional Gaussian distributions a little later. An important question is: given the distribution of some random variable or random vector X, how to find that of a function of it: Y = f (X)? If the distribution of X is discrete, the question is very simple – in principle: the distribution of Y is also discrete, and it is characterized by the probability mass function pY (y) = P {f (X) = y} = P ∪ ( ) {X = x} = x: f (x)=y ∑ ∑ P {X = x} = x: f (x)=y pX (x). x: f (x)=y (1.2.23) In particular: if we know the probability mass function of the random vector X = (X1 , X2 ) (or, which is the same, the joint probability mass function of the random variables X1 , X2 ), its each component being a (very simple) function of it, we can find the probability mass function of each component taken separately: pX1 (x1 ) = ∑ pX1 , X2 (x1 , x2 ), pX2 (x2 ) = x2 ∑ pX1 , X2 (x1 , x2 ). (1.2.24) x1 For continuous distributions the problem is much more complicated; however, its particular case of finding the probability density of one component (or several components) of a random vector given the probability density of the vector, is solved just the same way as for discrete random variables, only the sums are replaced with integrals: ∫ pX1 (x1 ) = ∫ ∞ −∞ pX1 , X2 (x1 , x2 ) dx2 , ∫ pX2 (x2 ) = ∞ −∞ pX2 , X3 (x2 , x3 ) = ∫ pX2 (x2 ) = ∞ −∞ pX1 , X2 (x1 , x2 ) dx1 ; (1.2.25) ∞ −∞ ∫ ∞ −∞ pX1 , X2 , X3 (x1 , x2 , x3 ) dx1 dx3 , (1.2.26) pX1 , X2 , X3 (x1 , x2 , x3 ) dx1 ; etc. 1.2.1 Suppose X and Y are two random variables (on the same probability space), X having a uniform distribution on the interval [0, 1], and Y uniform on [0, 3]. 5 Is it possible for the joint distribution of X and Y to have a (two-dimensional) density? Is it possible for this joint distribution not to have a density? 1.2.2 Let X and Y be as in the previous problem. Is it possible for the sum X + Y to have a (one-dimensional) density? Not to have? Theorem 1.2.4. Let X be a random variable with probability density pX (x); let f (x), x ∈ R, be a one-to-one function (and so having an inverse f −1 (y), y changing in the interval R being the range of the function f : R = f (R) = {f (x) : x ∈ R}), continuously differentiable, with f ′ (x) ̸= 0. Then the distribution of the random variable Y = f (X) has a density ( ) 1 pY (y) = IR (y) · ′ ( −1 ) · pX f −1 (y) . f f (y) Proof. We have to prove that for C ⊆ R ∫ ( ) 1 IR (y) · ′ ( −1 ) · pX f −1 (y) dy. µY (C) = P {f (X) ∈ C} = f f (y) C (1.2.27) (1.2.28) Here we are confronted with the deficiency of our approach that is not based on measure theory: we don’t know precisely for what class of sets C in the real line we need to prove it. It seems reasonable to have it proved for C’s being intervals; if (1.2.28) is proved for intervals, then using the countable additivity it is proved for finite or countable infinite unions of intervals; and that seems to be enough (possibly not enough from the point of view of rigorous mathematical theory, but clearly enough for extra-mathematical applications). So we have to prove that for every interval [a, b] (we could consider intervals (a, b) without their ends, it doesn’t matter) ∫ P {f (X) ∈ [a, b]} = a b ( ) 1 IR (y) · ′ ( −1 ) · pX f −1 (y) dy. f f (y) (1.2.29) The integral with the integrand multiplied by an indicator function can be rewritten as one over a set intersected with the set R whose indicator is there; since always f (X) ∈ R, the left-hand side also can be rewritten with this intersection: we have to prove that ∫ ( ) 1 ( ) · pX f −1 (y) dy. P {f (X) ∈ [a, b] ∩ R} = (1.2.30) ′ −1 (y) [a, b]∩R f f The intersection [a, b] ∩ R is again an interval; so we have to prove that for every interval [c, d] ⊆ R ∫ d ( ) 1 ( ) · pX f −1 (y) dy P {f (X) ∈ [c, d]} = (1.2.31) f ′ f −1 (y) c (think about whether it is justifiable to take closed intervals [c, d]). 6 The event {f (X) ∈ [c, d]} can be rewritten as {X ∈ [f −1 (c), f −1 (d)]} or {X ∈ [f −1 (d), f −1 (c)]} (1.2.32) (according to whether f ′ (x) > 0 or < 0). Let us do the calculations in the latter case (which is a little more complicated): −1 P {X ∈ [f (d), f −1 ∫ f −1 (c) (c)]} = f −1 (d) Making the substitution y = f (x), x = f −1 (y), dx = ∫ c P {f (X) ∈ [c, d]} = d ∫ = c d ( ) pX f −1 (y) · f′ pX (x) dx. (1.2.33) 1 ( ) dy, we get: −1 f (y) 1 ( ) dy −1 f (y) ∫ d −1 1 ( ( ) dy = ) f ′ f −1 (y) dy; f ′ f −1 (y) c f′ (1.2.34) which was to be proved. Particular case: Theorem 1.2.5. Let X be a random variable with probability density pX (x). Then the distribution of the linear function Y = aX + b of this random variable (a ̸= 0) has the density (y − b) 1 pY (y) = . (1.2.35) · pX |a| a Theorem 1.2.6. Let X be a random variable with probability density pX (x); and Y = f (X). Suppose the real line except for a finite number of points is divided into finitely many intervals Ii , on each of which the function f is one-to-one and continuously differentiable with non-zero derivative. Let Ri = f (Ii ) = {f (x) : x ∈ Ii }; let fi−1 (y) be a function defined on Ri , being the inverse function of the function f (x) considered only on Ii . Then the distribution of the random variable Y has a density given by ∑ ( ) 1 pY (y) = IRi (y) · ′ ( −1 ) · pX fi−1 (y) . (1.2.36) f f (y) i i The proof is practically the same as that of Theorem 1.2.4, only the first thing we do is representing the probability P {Y ∈ [a, b]} as the sum ∑ P {X ∈ Ii , f (X) ∈ [a, b]}. (1.2.37) i 1.2.3 Let X be a random variable with probability density { (4 − x)/8, x ∈ [0, 4], pX (x) = 0, x∈ / [0, 4]; 7 (1.2.38) Y = (X − 1)2 . Find the probability density pY (y) of this random variable. Draw the graph of this density. We are going to need also the multidimensional versions of these results. Theorem 1.2.7. Let X be an n-dimensional random vector having continuous distri( ) bution with density pX (x). Let f (x) = f1 (x1 , ..., xn ), ..., fn (x1 , ..., xn ) be a one-to-one function from Rn onto R, continuously differentiable, and with the Jacobian determinant ∂f1 /∂x1 ... ∂f1 /∂xn ̸= 0. J(x) = det ... ... ... (1.2.39) ∂fn /∂x1 ... ∂fn /∂xn Then the random vector Y = f (X) has a continuous distribution with density ( ) 1 pY (y) = IR (y) · ( −1 ) · pX f −1 (y) . J f (y) (1.2.40) Again a particular case: Theorem 1.2.8. Let X be an n-dimensional random vector having continuous distribution with density pX (x). Let A be a non-singular n × n matrix, and b a vector belonging to Rn (we are writing vectors as column vectors now). Then the random vector Y = AX + b has a continuous distribution with density pY (y) = ( ) 1 · pX A−1 (y − b) . | det(A)| (1.2.41) Theorem 1.2.9. Let X be an n-dimensional random vector having continuous distribution with density pX (x). Let the whole space Rn , except for its part that can be disregarded by integration (a negligible set; e. g., several smooth surfaces of dimension smaller than n), be divided into finitely many regions Gi , in each of which the function f : Gi 7→ Ri (Ri = f (Gi )) is one-to-one, continuously differentiable with nonzero Jacobian determinant and with the inverse fi−1 . Then the random vector Y = f (X) has a continuous distribution with density pY (y) = ∑ i ( ) 1 IRi (y) · ( −1 ) · pX fi−1 (y) . J f (y) (1.2.42) i In contrast with formulas (1.2.25), (1.2.26), Theorems 1.2.4 – 1.2.9 are about the distributions of functions of a random object (random variable or random vector) X of the same dimension as the object X itself . But combining Theorems 1.2.7 – 1.2.9 with formulas (1.2.25), (1.2.26), sometimes we can, given the density of an n-dimensional random) ( vector X, find the probability density of a function of it Y = f (X) = f1 (X), ..., fk (X) of a smaller dimension k. To do this, we complement the functions f1 (x), ( ..., fk (x) with ) another (n−k) functions fk+1 (x), ..., fn (x) so that the function f (x) = f1 (x), ..., fn (x) 8 satisfies the conditions of one of the Theorems 1.2.7 – 1.2.9; and then apply formulas (1.2.25) – (1.2.26). For example, this way we can obtain Theorem 1.2.10. Let X, Y have a joint density pX, Y (x, y). Then the random variable Z = X + Y has a continuous distribution with density ∫ pZ (z) = ∞ −∞ pX, Y (x, z − x) dx. ( (1.2.43) ) X X +Y Proof. is obtained from the random vector ( ) The random vector Z = X X= by multiplication: Y Z = A · X, (1.2.44) ( ) 1 0 A= . (1.2.45) 1 1 ( ) 1 0 −1 The determinant of A is equal to 1, A = . By Theorem 1.2.8 we have: −1 1 ( ) ( ) ( ) ( −1 x x ) = pX, Y = pX, Y (x, z − x); (1.2.46) pZ (z) = pX, Z (x, z) = pX A · z z−x where and by formula (1.2.25) ∫ pZ (z) = ∫ ∞ −∞ pX, Z (x, z) dx = 9 ∞ −∞ pX, Y (x, z − x) dx. (1.2.47)