Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
3.1 Lecture 3 1-D Random Variables 1. Probability Many people have an intuitive understanding of probability. For example, when one hears the weather forecast: There is a 50% chance of snow today, it suggests that the probability of the event that it will rain today is 0.5. In fact, probability is a measure of the ‘size’ of an event. Hence, in order to develop a firm grasp of the basic mathematical elements of probability, it is first necessary to understand what an event is. Then it is necessary to discuss in concrete terms what is meant by a measure of size. Definition 1. An event is a set; that is, it is a collection of entities (or objects). The notation { } will denote a set. For example, consider a weather forecast that includes the event that it snows, and the event that it does not snow. The notation for these events is {recording snow} and {recording no snow}. Note the use of the word ‘record’. It is an action verb. It is not correct to say that ‘to snow’ is an event. IT may well snow, but not in sufficient amounts at a given measurement location to record it. This distinction points out a crucial characteristic of an event; namely how it is recorded/observed/measured. This, in turn, is related directly to the resolution of the sensing device. The following example may help to clarify this. Example 1. An increasingly important problem in the aviation industry is the problem of aging aircraft. Over time, a part such as a wing or a turbine blade will develop micro-cracks. Even though a part may have a crack, to say that ‘the part has a crack’ is not an event. It is a claim as to an attribute of that part. This claim is not the same as the claim: ‘Based on a nondestructive evaluation of the part with crack width detection resolution of .005mm, the part has a crack’. This claim includes a description of the act of recording the presence of a crack of at least .005mm width. □ At this point a typical course in probability would proceed to discuss the general concepts of a sample space and sigma algebras. Since we do not have the luxury of devoting so much time to probability theory, we will present these and other concepts in relation to a random variable. Definition 2. A random variable is an action which, when performed, results in a number. Moreover, the number that results is not perfectly predictable. Notationally, the action will be assigned an uppercase letter (e.g. X), whereas the number that results by performing the action will be assigned the corresponding lower case (e.g. x). Example 1 (continued). Let X = The act of recording whether or not a crack is detected. If we enter the number 0 to represent ‘no crack’ and 1 to represent ‘crack’, then we have two possible events. Notationally, these can be represented simply as {0} and {1}, respectively. In the 3.2 parlance of random variables, these two events are typically denoted as [X = 0] and [X=1], respectively. For a number, x {0,1} , the notation become {x} or, equivalently, [X = x]. □ Definition 3. The set (or collection) of all the possible numerical values that a random variable, X, can have is called the sample space for X. It will be denoted as SX. Definition 4. The set (or collection) of all possible subsets of SX is called the field of events. It will be denoted as X . Example 1 (continued). The sample space for X = The act of noting whether or not a crack of at least .005mm is detected is S X {0,1} . The corresponding field of events is X {{0}, {1}, S X , } , where is the symbol for the empty set. The above definitions are not entirely mathematically correct. We have ignored the concept of sigma algebras. But since this is not a course in probability theory, these definitions will suffice for our purposes. We are now in a position to discuss probability. Definition 5. Let X be a random variable with sample space SX and field of events X . Then probability, denoted Pr(*) is a measure of the size of the sets contained in X . Moreover, it has the following properties: (P1): Pr( ) = 0 ; (P2): Pr( SX ) = 1 ; (P3): for any A, B X with A B , Pr( A B) Pr( A) Pr( B) . It need to be emphasized that Pr(*) measures the size of a set, and not the size of a number. For example, Pr(1) makes no sense, whereas Pr( {1} ) does. Example 1 (continued). From property (P2), Pr( S X ) Pr({0,1}) 1 . Suppose that we define Pr({1}) p . Since {0} {1} , and {0} {1} S X , it follows from properties (P2) and (P3) that: 1 Pr( S X ) Pr({0} {1}) Pr({0}) Pr({1}) Pr({0}) p Pr({0}) 1 p . □ Definition 6. Let X be a random variable with sample space SX, field of events X , and probability measure Pr(*). Then the ordered triple (S X , X , Pr) is said to be a probability space for X. 3.3 Even though, for a given random variable, say, X, the attendant probability measure Pr(*) is appropriate, very often it is not used directly. Instead, one of the following two quantities related to it are used. Definition 7. Let X be a random variable with sample space SX, field of events, X , and probability measure Pr(*). For any number x (, ) , the half-open interval ( , x ] , which can also be denoted as [ X x] , is called a cumulative event. The probability of this event Pr{( , x]} Pr[ X x] FX ( x) is called the cumulative distribution function (cdf) for X. Definition 8. The (perhaps generalized) derivative of FX (x) , f X ( x) dFX ( x) / dx is called the probability density function (pdf) for X. Example 1 (continued). Since the sample space for X is SX = {0,1}, then for any x (,0) we have FX ( x) Pr{( , x]} Pr[ X x] 0 for x 0 . Since Pr[ X 0] 1 p , we have FX (0) Pr{( ,0]} Pr[ X 0] 1 p . Since there is no probability to accumulate for x > 0 until we reach the point x = 1, we have FX ( x) Pr{( , x]} Pr[ X x] 1 p for 0 x 1 , and FX ( x) Pr{( , x]} Pr[ X x] 1 The cdf is shown in the plot below. FX (x) 1 1 p x0 Figure 1. Graph of the cdf related to Example 1. x 1 for x 1. 3.4 Clearly, the formal derivative f X ( x) dFX ( x) / dx does not exist at the collection of points SX = {0,1}. In this situation we will define the generalized derivative: f X ( x) dFX ( x) / dx (1 p) ( x) p ( x 1) . The ‘function’ (x) is called the Dirac delta function. It is not a proper function. Rather, it is defined in relation to its integral. Specifically, for any continuous function, g (x ) and any chosen x o , (x) is defined via its ‘sifting’ property: g ( x) ( x x o )dx g ( xo ) . In words, what (*) does is that it sifts out the value of the function it is integrated against, at the location that makes the argument * equal to zero. Another less mathematical term for describing (x) , and one that is used in systems and control theory, is that it is the unit impulse. In relation to FX (x) given in Figure 1, the generalized derivative, f X (x) , is shown below. f X (x) 1 1 p p x0 x 1 Figure 2. The generalized derivative of FX (x) given in Figure 1, Remark. The numerical values 1 p and p shown in Figure 2 are not the values of f X (x) . This function has no well-defined values on S X {0,1} . Loosely speaking, one could also say that its value is infinity on this set. The numerical values 1 p and p are said to be the intensity values of f X (x) . They are, in fact, the size of the jumps of the cdf FX (x) . Example 2 The life time of electrical components are often such that the probability of failure is greatest at the beginning of the life. This is why when you order a personal computer, the pc manufacturer will run a series of exhaustive tests (i.e. burn it in) prior to sending it to you. Let X denote the act of recording the life time (i.e. time to failure) of an electronic device, and, for 3.5 convenience, suppose that the recording resolution is infinitely precise. Then the sample space for X is S X (0, ) . Since this is a continuous open interval, X is said to be a continuous random variable. A very common pdf model for X is the exponential pdf: f X ( x) e x . The corresponding cdf is: x Pr[ X x] FX ( x) x e f X (u) du 0 u du 1 e x . 0 This is a 1-parameter pdf model, since it is parameterized by the single parameter, λ.. Notice that since x has units of time, and since the cdf is dimensionless, then the parameter must have units of 1/time. Equivalently, 1 must have units of time. Suppose that 1 0.63 years. The pdf and cdf for X are plotted below. Plots of the pdf (blue) & cdf (black) 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 x 3 3.5 4 4.5 5 Figure 3. Plots of the pdf, f X (x) , and the cdf, FX (x) . Even though the pdf is more commonly used to describe X, when it comes to computing probability it is often the cdf that can make life a little easier. For example, suppose that we want to compute the probability of the event [1.0 , 2.0], which is also denoted as [1.0 X 2.0] . This event is the shaded region of the x-axis shown in Figure 3. To compute the probability of this event using the pdf, we need to integrate it over this region: 2 2 Pr[1.0 X 2.0] f 1 X ( x)dx e x / 1.25 e 1 / 0.63 e 2 / 1.25 0.163 . x 1 However, since FX (x) gives all the area to the left of any chosen number, x, we can more easily compute the probability of interest as: 3.6 Pr[1.0 X 2.0] FX (2) FX (1) e 1 / 1.25 e 2 / 1.25 0.163 . □ 2. Expectations Again, let X be a random variable, with sample space SX, and with pdf f X (x) . If one knows f X (x) , then one knows everything about the probabilistic structure of X. However, this knowledge is not in relation to a scalar parameter, but to a collection of scalars { f X ( x)} xS X In the case where S X includes only two scalars (e.g. 0 and 1), it follows that the collection of pdf scalars includes only two values. However, suppose that as is often the case) X is a continuous random variable. Then there are an infinite number of scalars in the collection { f X ( x)} xS X . This can be problematic when using a finite number, say, n, of measurements of X in order to estimate f X (x) for every x S X . As a general rule of thumb, one needs to have many more measurements than parameters being estimated. Suppose, for example, that one has 100 measurements with which to construct a histogram-based estimate of f X (x) . If one desires a histogram with high resolution (say, 10 bins), then one can expect that the 10 bin heights will not be very trustworthy. If, one desires trustworthy bin heights then it will be necessary to use fewer bins. Consequently, there a trade-off: More bins will give higher resolution but higher bin height uncertainty. Fewer bins will reduce the bin height uncertainty, but give lower resolution. For this reason, often one will downplay, if not forego an investigation of f X (x) , and resort to estimating parameters such as the mean and variance of X. These parameters are defined as special cases of the following definition. Definition 8. Let X be a random variable, with sample space SX, and with pdf f X (x) , and let g ( X ) be any chosen function of X. Then E[ g ( X )] g ( x) f X ( x) dx . (1) SX Remark. In any textbook on probability it would be heresy to call (1) a definition. It is, in fact, a theorem. However, since this course is not a course in probability, we will take (1) to be a definition. Definition 9. The kth moment of a random variable X is E ( X k ) . In particular, the first moment of X is called the mean (or expected value) of X. It will be denoted as E ( X ) X . The kth central moment of X is E[( X X ) k ] . In particular, the second central moment of X is called the variance of X. It will be denoted as X2 . 3.7 Example 1 (continued) We will now use (1) to compute the mean of X. In this case, we have g( X ) X X E( X ) x f X ( x) dx SX 1 x f x 0 X ( x) 0(1 p) 1( p) p. To compute the variance of X, we use (1) with g ( X ) ( X X ) 2 . In this case, we have 1 X2 E[( X X ) 2 ] ( x X ) 2 f X ( x) dx ( x p) 2 f X ( x) (0 p) 2 (1 p) (1 p) 2 p p(1 p) x 0 SX Before we proceed, it is both instructive and expedient to present the following theorem. THEOREM 1. E (aX b) aE ( X ) b . Proof: Let g ( X ) aX b . Then (1) gives E[ g ( X )] E (aX b) (ax b) f X ( x) dx a xf X ( x) dx b f X ( x) dx aE ( X ) b. □ SX SX SX We will now use this theorem to prove the next theorem. THEOREM 2. X2 E ( X 2 ) X2 . Proof: X2 E[( X X ) 2 ] E ( X 2 2 X X X2 ) E ( X 2 ) 2 X E ( X ) X2 E ( X 2 ) 2 X2 X2 E ( X 2 ) X2 . The method of computing the variance of a random variable using THEOREM 2 can be computationally advantageous. Example 2 (continued) From a table of integrals, http://en.wikipedia.org/wiki/List_of_integrals_of_exponential_functions we have: x xe dx The mean of X is: e x X (x 1) e x x e 2 ; (x 1) x 0 x 2x 2 dx e x x 2 2 1 1 0 . 3.8 E( X ) e 2 The second moment of X is: x 2 2 2x 2 2 2 x 0 . Hence, from THEOREM 2, the variance of X is: X2 1 / 2 . □ Before proceeding to extend the results of this section and the last to the case of two random variables, we offer the following extension of Example 2. Example 2 (continued) Suppose that it has been decided that an electrical component that is incorporated should not remain in the field for longer than 1 year. Then the resulting random variable, call it Y, relates to X as follows: For 0 x 1 , the cumulative events [ X x] and [Y x] are one and the same event. Also, the events [ X 1] and [Y 1] are one and the same event. It follows that for 0 x 1 , Pr[ X x] Pr[Y x] and [Y x] ; that is, FX ( x) FY ( x) . Hence, for 0 x 1 , f X ( x) f Y ( x) , and Pr[ X 1] 1 FX (1) e 1 / 0.63 0.20 Pr[Y 1] . In relation to Figure 3, the pdf for Y is exactly the pdf for X in the interval (0,1) , The area beneath f X (x) over the interval [1, ) is mapped to a ‘lump’ of probability at the location x = 1. Plot of fX(x)(BLUE) & fY (x) (RED) 1.6 1.4 1.2 f 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 x 2 2.5 3 Figure 4 Plots of f X (x) (BLUE) and f Y (x) (RED). We are now in a position to compute the mean and variance of Y. The component of this moment 1 associated with the continuous part of the pdf is 1 ( 1)e . = 0.23. The component associated with the lump of probability, 0.20, at the location x = 1 is 0.20. Hence, Y 0.23 0.2(1) 0.43 (versus X 0.63 ). 3.9 Computation of the variance of Y proceeds in the same manner. The component of the second moment associated with the continuous portion of the pdf is: 1 x 2 f X ( x)dx 0 2 2 e 1 2 0.68 . 2 2 The component associated with the lump of probability at x = 1 is 0.20. Hence, the second moment for Y is E (Y 2 ) 0.88 [versus E ( X 2 ) 0.80 ]. It follow from THEOREM 2 that the variance of Y is Y2 E (Y 2 ) Y2 End of Lecture 3 .88 .43 2 0.70 (versus X2 0.40 ). □