Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Contents 6 Discrete Random Variables 6.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The PMF and CDF . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 The Probability Mass Function of a Discrete Random 6.2.2 The Cumulative Distribution Function . . . . . . . . 6.3 Examples of Discrete Random Variables . . . . . . . . . . . 6.3.1 The Bernoulli Distribution . . . . . . . . . . . . . . . Categorical and Discrete Uniform Random Variables 6.4 The Algebra of Random Variables . . . . . . . . . . . . . . . 6.4.1 Joint Distributions . . . . . . . . . . . . . . . . . . . 6.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Expected Value . . . . . . . . . . . . . . . . . . . . . 6.5.2 Expectation of Functions of a Random Variable . . . 6.5.3 Properties of Expectation . . . . . . . . . . . . . . . 6.5.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.5 Properties of the Variance . . . . . . . . . . . . . . . 6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 6 6 8 11 11 12 13 14 16 17 18 19 21 22 24 2 CONTENTS Chapter 6 Discrete Random Variables For the rest of the chapter unless otherwise noted, we will let U denote a universe with finitely or countably infinitely many elements, and P denote a probability measure defined on all events in U . The domain of P is the power set of U , that is, the set of all possible subsets (including the empty set as well as U itself). The power set of U is denoted P(U ). Together, the triple (U, P(U ), P ) are called a probability space. 6.1 Random Variables Most universes in real problems are large. • U = {All lists of 16 Birthdays} • U = {All subsets of 10 items drawn from a lot of 30} #(U ) = 36516 #(U ) = 30 10 If the universe itself is large, the power set of the universe is extraordinarily huge: We have #(P(U )) = 2#(U ) (to count the number of subsets, consider choosing for every single element whether it gets included or excluded from the subset). Typically we aren’t interested in every possible event; but only those that pick out outcomes with some common “feature”: • E = Lists with a repeated birthday • E = Subsets with no defective items 3 4 CHAPTER 6. DISCRETE RANDOM VARIABLES In principle, we could calculate the probability of these events by breaking them down into smaller disjoint sets whose probabilities we can compute. But it is simpler to introduce random variables, which let us work with these features directly. Definition 6.1 (Random Variable). A random variable, X, is a function, whose domain is a universe, U , and whose range is a set R which is a subset of the real numbers R. That is, for every a ∈ U , X(a) ∈ R ⊂ R. Example If U is the set of all subsets of 3 items drawn from a lot of 30, we can define a random variable X which returns the number of defective items. The range of X would then be the set {0, 1, 2, 3}. Definition 6.2 (Discrete Random Variable). A random variable is called discrete if its range is either a finite set or a countably infinite set (e.g. N, Z, etc.). The random variable X defined above (counting defective items) is discrete because its range is a finite set. In contrast, suppose U is the set of all possible Tucson days. We can define a random variable Y which, for each day gives its temperature. But since temperatures are real numbers, the range of Y is uncountably infinite, and so Y does not qualify as a discrete random variable. We are often interested in events of this form: E = {a ∈ U : X(a) ∈ A}. In words, these are events that occur whenever X takes on one of a particular set of values. • As a shorthand, we can write E = {X ∈ A}. • If A is a singleton set (i.e., A = {x} for some x ∈ R), then we can write E as {X = a}. • If A is an interval, for example, A = [a, b], we can write E = {a ≤ X ≤ x}. If the interval is unbounded on one end, such as A = (−∞, b], then we have E = {X ≤ b}. The probability of such events follows from the probability measure we have defined on events in U : • P (X ∈ A) = P ({a ∈ U ; X(a) ∈ A}) • P (X = x) = P ({a ∈ U ; X(a) = x}) 6.1. RANDOM VARIABLES 5 • P (X ≤ x) = P ({a ∈ U ; X(a) ≤ x}) As such, everything we have established so far about the probability of events, their complements, unions, etc., still applies. All we are doing is providing a bit more structure to the universe (which, after all, is what all mathematics is about, isn’t it?). Example 1. Two fair dice are thrown. What is the probability that the sum of the numbers appearing on the two upturned faces equals 7? Solution Note that the sample space consists of all possible ordered pairs of the numbers 1 through 6: U = {(1, 1), (1, 2), . . . , (6, 5), (6, 6)} #(U ) = 6 × 6 = 36 (by the generalized counting theorem) We can define a random variable X whose value is the sum of the scores of the two dice. In terms of X, the event of interest is E = {X = 7} = {a ∈ U ; X(a) = 7} = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} =⇒ #(E) = 6 It seems reasonable to assume that the ordered pairs are exchangeable in this case, as we would not expect the probabilities to change if we relabeled the dice, so we can compute 6 1 #(E) = = P (E) = #(U ) 36 6 If we repeat this process for other possible values of X, we will find that P (X = x) = 0 unless x ∈ {2, 3, 4, . . . , 11, 12}, and that for these values, we can express the probability that X = x with the following algebraic expression: P (X = x) = 6 − |x − 7| 36 Note that by restricting our attention to the sum of the dice, we have reduced the number of fundamental events from 36 to 11. 6 6.2 CHAPTER 6. DISCRETE RANDOM VARIABLES The PMF and CDF Notice that a random variable X defines a partition of the universe, whose ’compartments’ are the events {X = x} for each x in the range of X. For our dice example, the events {X = 2} through {X = 12} partition all the possible outcomes of the two dice. 6.2.1 The Probability Mass Function of a Discrete Random Variable Definition 6.3 (Probability Mass Function). Let X be a discrete random variable with range R. We define the probability mass function, f , associated with X to be a function that takes in a valid value for X, and returns the probability that X takes that value. Formally, the domain of f is the set R (the range of X), and the range of f is the interval [0, 1], and f (x) is defined for each x ∈ R to be f (x) = P (X = x) Theorem 6.1. Let U be a universe, P a probability measure defined on events in U , let X be a random variable on U , and let f be the PMF of X as determined by the probability measure P . Then, treating R itself as a universe, there is a unique probability measure, P 0 , defined 0 on all subsets of R, P such that P ({x}) = f (x) for each x ∈ R. For any set S ⊂ R, 0 we have P (S) = x∈S f (x). In other words, if we know f (x) for every x ∈ R, then there is exactly one way to extend these values to a probability measure which is defined on all subsets of R. The way to do this is to take any subset S of R and set its probability to be the sum of the atomic probabilities of the elements of S. One reason this matters is that it says that if we know the PMF of a discrete random variable, X, then we know everything that X is capable of telling us about. Proof. There are two parts to this claim; an existence claim and a uniqueness claim. We need to check each part: first that the probability measure defined as stated obeys the Kolmogorov axioms, and second that no other probability measure on P(R) agrees with P 0 on all elements of R. 6.2. THE PMF AND CDF 7 Existence. The first Kolmogorov axiom says that all probabilities are nonnegative. Since f (x) is a probability, we know that P 0 ({x}) is nonnegative for every x ∈ R. Since the value of P 0 on any other set is obtained by summing these nonnegative numbers, P 0 is always nonnegative. Next, we need to check that P 0 (R) = 1. By definition, we have P 0 (R) = X f (x) = x∈R X P (X = x) x∈R Since the sets {X = x}P form a partition of the universe, property (P7) of probability measures implies that x∈r P (X = x) = 1. Finally we need to check disjoint additivity. Suppose S and T are disjoint subsets of R. Then X X X P 0 (S ∪ T ) = f (x) = f (x) + f (x) = P 0 (S) + P 0 (T ) x∈S x∈(S∪T ) x∈T where the second equality follows from the fact that S and T are disjoint, and therefore do not share any elements whose probabilities might be counted twice if we split up the sum. This completes the part of the proof that says that P 0 as defined is a probability measure, since we have checked all of the Kolmogorov axioms. Uniqueness. It remains only to check that no other probability measure exists (say, P̃ ) that satisfies P̃ ({x}) = f (x). The key here is disjoint additivity. Suppose we define P̃ ({x}) = f (x) on every x ∈ R. Then for any set S = {x1 , . . . , xM }, which is a subset of R, we can write S = {x1 } ∪ {x2 } ∪ · · · ∪ {xM } which is a disjoint union. Hence by the inductive extension of the disjoint additivity axiom, we must have P̃ (S) = M X P̃ ({xi }) = i=1 and so P̃ is indistinguishable from P 0 . M X i=1 f (xi ) = P 0 (S) 8 CHAPTER 6. DISCRETE RANDOM VARIABLES Note the following important fact about probability mass functions which has come out in the course of the above proof: X f (x) = 1 (6.1) x∈R In the case where the range of the random variable is a finite set, we can visualize the probability mass function p with a probability histogram (or “spike plot”), with a bar centered at each xj , whose height is p(xj ). If we again consider the random variable whose value is equal to the sum of the faces of two six-sided dice, then the probability histogram looks like Fig. 6.1 Figure 6.1: A probability histogram for the random variable representing the sum of two fair die rolls 6.2.2 The Cumulative Distribution Function We often are interested in events of the form {X ≤ x} for some x ∈ R. For example, we might ask “What’s the probability that there is at most one defective item in our draw?” We can record probabilities of all events of this form using the cumulative distribution function. 6.2. THE PMF AND CDF 9 Definition 6.4 (Cumulative Distribution Function). Let X be a discrete random variable with range R. The cumulative distribution function for X is a function, F , from R to [0, 1], and defined for each x ∈ R as F (x) = P (X ≤ x) Note that if R has a smallest element, x1 , so that we can list the elements of R in increasing order as x1 , x2 , . . . , then F (xr ) = r X f (xj ) j=1 where f is the probability mass function associated with X. This follows from the fact that the event {X ≤ xr } is a disjoint union of the events {X = xj } for all j ranging from 1 to r, and so the probability of the union is the sum of the probabilities of these events, which are the f (x)s. Theorem 6.2 (Basic Properties of the CDF). Let F be the CDF of a random variable X, with range R indexed by integers, so that xi < xj whenever i < j (technically this rules out certain discrete ranges, like the rational numbers, which cannot be indexed in this way, but it covers all the cases that we will be interested in, which are usually subsets of the integers). Then (i) F (xr ) − F (xr−1 ) = f (xr ) for all r ≥ 2 (ii) F is a nondecreasing function. That is, whenever xt > xs F (xt ) ≥ F (xs ) (iii) If R has a largest element, xn , then F (xn ) = 1. (Otherwise limr→∞ F (xr ) = 1). Proof. (i) We can write F (xr ) = P ({X ≤ xr }) = P ({X ≤ xr−1 } ∪ {X = xr }) = P ({X ≤ xr−1 }) + P ({X = xr }) = F (xr−1 ) + f (xr ) (Disjoint additivity) (Defs of F and f ) Subtracting F (xr−1 ) from the first and last expression gives us (i). 10 CHAPTER 6. DISCRETE RANDOM VARIABLES (ii) We have, using (i) and the fact that f (xr ) ≥ 0 for all r: F (xr ) = F (xr−1 ) + f (xr ) ≥ F (xr−1 ) (iii) F (xn ) = n X f (xr ) (by (6.2.2)) r=1 =1 (by (6.1)) where the last line follows from the fact that we assume xn to be the largest element of R, and so the sum is over every element of R. The CDF tells us directly how to compute probabilities of events of the form {X ≤ x} for some x ∈ R. With barely any additional calculation, we can also use the CDF to compute probabilities for events of the form {X < x} and {xr ≤ X ≤ xs }, as the next theorem illustrates. Theorem 6.3 (More Properties of the CDF). Let X and F be defined as in the previous two theorems. Then (i) P (X > xr ) = 1 − F (xr ) (ii) For r ≤ s, we have P (xr ≤ X ≤ xs ) = F (xs ) − F (xr−1 ) Proof. (i) We can write P (X > xr ) = P ({X ≤ xr }) = 1 − P (X ≤ xr ) = 1 − F (xr ) (by the complement rule) (by the Def. of F) 6.3. EXAMPLES OF DISCRETE RANDOM VARIABLES 11 (ii) We have P (xr ≤ X ≤ xs ) = P ({X ≤ xs } − {X < xr }) = P ({X ≤ xs }) − P ({X < xr }) ((P5), since {X < xr } ⊂ {X ≤ xs }) = P ({X ≤ xs }) − P ({X ≤ xr−1 }) ({X < xr } = {X ≤ xr−1 }) = F (xs ) − F (xr−1 ) 6.3 Examples of Discrete Random Variables Every situation has a different sample space. Random variables, however, take these sample spaces and create a partition, which, as we’ve seen, induces a simplified domain of events, and a simplified probability measure. As a result, problems that are very different at the level of sample space can give rise to random variables with very similar structures. This similarity allows us to generalize several probabilistic properties from one situation to another which is seemingly very different. 6.3.1 The Bernoulli Distribution The simplest possible random variable takes the universe and divides it in two pieces. On one piece, the random variable takes the value 1. On the other, it takes the value 0. Any proposition which, for each outcome in U , is either true or false immediately defines such a random variable. This random variable can be set to 1 for all outcomes that make the proposition true, and 0 elsewhere. A random variable with this structure is called a Bernoulli random variable. • Question: What’s the range of a Bernoulli random variable? P Since we must have 1x=0 f (x) = 1, the probability mass function of a Bernoulli random variable is determined entirely by setting one of its values, since the other value is obtained from the complement rule. We define p, 0 ≤ p ≤ 1 to be equal to f (1), which implies that f (0) = 1 − p. 12 CHAPTER 6. DISCRETE RANDOM VARIABLES This value p is called a parameter of the random variable. If X is a Bernoulli random variable with parameter p, you may see the notation X ∼ Bern(p) where the ∼ symbol is read “is distributed as”. We can write the probability mass function concisely as f (x) = px (1 − p)1−x . The use of exponents here is a bit “clever”: since x is always 0 or 1, one of the exponents is always 1, and the other is always 0. Whichever term has an exponent of zero disappears since its value is just 1. Categorical and Discrete Uniform Random Variables We can generalize the notion of a Bernoulli random variable to one that takes more than two (but still a finite number of) values. A random variable whose range is a finite set, R = {x1 , x2 , . . . , xn }, and which has no restrictions on its PMF, is called a categorical random variable. In the most general case, we can’t simplify the probability mass function beyond just stating all its values. We can denote p1 = f (x1 ), p2 = f (x2 ), ..., pn = f (xn ) and say X ∼ Cat(p1 , p2 , . . . , pn ) (Of course we do not have complete freedom in choosing the pr s, as they must sum to 1) If we’re in a situation where the values of the random variable are exchangeable (that is, the numbers assigned are basically arbitrary labels), then we can set p1 = p 2 = · · · = pn which, since p(·) sums to 1, implies as we’ve seen before that p(xr ) = 1 n for all r ∈ {1, 2, . . . , n} 6.4. THE ALGEBRA OF RANDOM VARIABLES 13 Then, provided x1 < x2 < · · · < xn , by (6.2.2), we get r F (xr ) = n In this case, the random variable has a discrete uniform distribution and we might write X ∼ DU(n) 6.4 The Algebra of Random Variables Remember that random variables are functions from U to R. As such, we can manipulate random variables in the same way that we manipulate arbitrary realvalued functions. If X and Y are random variables, α is a real number, and g is a function from R to R (for example g(x) = x2 or g(x) = sin(x)), then the following are all random variables: (i) X + Y (ii) αX (iii) X + α (iv) XY (v) g(X) They are defined in the natural way. For any a ∈ U : (i) (X + Y )(a) = X(a) + Y (a) (ii) (αX)(a) = αX(a) (iii) (X + α)(a) = X(a) + α (iv) (XY )(a) = X(a)Y (a) (v) (g(X))(a) = g(X(a)) For example: (i) If X is the number of boys born in Tucson in a given year, and Y is the number of girls, then X + Y is the total number of babies. 14 CHAPTER 6. DISCRETE RANDOM VARIABLES (ii) If X is the number of items of a certain kind that a store sells in a given day, and β is the profit margin per unit, then βX is a random variable representing the gross profit in a given day. (iii) If α is the fixed operating cost of running the store for a day, then βX − α is the daily profit. (iv) A machine produces window panes to target dimensions, but with some error. If X is the width of a given window, Y is its height, then XY describes the area of the windows. (v) If X is the amplitude of a sound wave, then (roughly) log(X) is the decibel level. 6.4.1 Joint Distributions If X and Y are two random variables on the same probability space (U, P(U ), P ), then any given sample element will have an X value and a Y value. For example, we might take U to be the set of students in this class, and let X be major (for simplicity, say ISTA vs. not ISTA) and Y be handedness. Technically we should assign numbers to each category — say 0 and 1. Using RX to denote the range of X and RY to denote the range of Y , we have RX = {0, 1}(= {Non-ISTA, ISTA}) RY = {0, 1}(= {Left, Right}) Then we can talk about the probability of a particular combination of values (say, a left-handed ISTA major). Definition 6.5 (Joint Distribution). Let X and Y be discrete random variables with ranges RX = {x1 , x2 , . . . } and RY = {y1 , y2 , . . . }, and probability mass functions fX and fY , respectively. We define the joint distribution (or joint mass function) to be a function fX,Y from RX × RY (the set of all ordered pairs (x, y), where x ∈ RX and y ∈ RY ) to [0, 1], which sets fX,Y (x, y) = P ({X = x} ∩ {Y = y}) In a sense we can think of a “two-dimensional” random variable, or random vector, (X, Y ), whose range is RX × RY (i.e., all ordered pairs whose first element is in 6.4. THE ALGEBRA OF RANDOM VARIABLES 15 RX and whose second element is in RY ), and whose probability mass function is fX,Y . In our example, we can think of the joint mass function as applying to the cells of a 2 × 2 table (see Table 6.1) y (Hand) 0 (Left) 1 (Right) x (Major) 0 (non-ISTA) 1 (ISTA) fX,Y (0, 0) fX,Y (1, 0) fX,Y (0, 1) fX,Y (1, 1) Table 6.1: A Simple Joint Mass Function Then, for example, the probability of selecting a left-handed ISTA major is given by fX,Y (1, 0). It should be easy to see that the joint distribution function has the following properties: Lemma 6.4 (Marginalization). P (i) For each y ∈ RY , x∈RX fX,Y (x, y) = fY (y) P (ii) For each x ∈ RX , y∈RY fX,Y (x, y) = fX (x) P P (iii) x∈RX y∈RY fX,Y (x, y) = 1 Proof. Every element of the sample space gets mapped to some cell of the distribution table. The table therefore gives us three different ways of constructing a partition: one by taking each row as a block, one by taking each column as a block, and the third by taking each cell as a block. Part (i) says that when we sum over the cells in a particular column, we get the total (“marginal”) probability assigned to that column. Part (ii) says that when we sum over the cells in a particular row, we get the total (“marginal”) probability assigned to that row. Part (iii) says that when we sum over all the cells, we get the probability of the entire sample space, which is 1. More formally: 16 CHAPTER 6. DISCRETE RANDOM VARIABLES (i) For any y ∈ RY , we have X X fX,Y (x, y) = P ({X = x} ∩ {Y = y}) x∈RX x∈RX [ = P( {X = x} ∩ {Y = y}) (Disjoint Additivity) x∈RX = P (( [ {X = x}) ∩ {Y = y}) (Distributive) x∈RX = P (U ∩ {Y = y}) = fY (y) = P (Y = y) ({X = x}s form a partition) (Def. of PMF) (ii) Exactly the same as (i), with roles reversed (iii) We can either use (6.1), applied to (i) or (ii), or we can note that the sets {X = xr } ∩ {Y = ys }, taking every combination of r and s, form a partition of S, and hence (P7) applies. 6.5 Expectation So far, we have characterized random variables mainly in terms of their PMF and CDF. Both of these give us a list of probabilities that we can use to calculate the probability of any event we want. Often we want more concise information, telling us • What’s a central, “typical” value of the feature represented by the random variable? • How much variability is there in this feature? Both of these questions (along with lots of others) can be addressed by taking expectations of functions of the random variable. Analogy Suppose I own an electronics store. On 20% of the days my shop is open, I don’t sell any laptops. 40% of the time I sell one, 30% of the time I sell 2, and 10% of the time I sell 3. How many laptops do I expect to sell tomorrow? 6.5. EXPECTATION 17 Our expectations are a weighted average of all the possible things that can happen, where I weigh more likely outcomes more heavily. 6.5.1 Expected Value Definition 6.6 (Expected Value/Mean of a Random Variable). Let X be a random variable with range R = {x1 , x2 , . . . } and probability mass function p. We define the expectation (or mean) of X, written E [X], to be X E [X] = x · f (x) x∈R We will sometimes use the Greek letter µ (pronounced “mew”, like a cat) to stand for E [X]. This is the Greek version of “m” for “mean”. In the laptop example, we have R = {0, 1, 2, 3}, so I have: µ = E [X] = 3 X x · f (xj ) x=0 = 0 · 0.2 + 1 · 0.4 + 2 · 0.3 + 3 · 0.1 = 0 + 0.4 + 0.6 + 0.3 = 1.3 On average, I sell 1.3 laptops a day. Notice that the average is not a possible value for X. This is often the case for discrete random variables. One reason to define expected value this way is that, under a relative frequency interpretation of probability, it represents the “long run average” if we repeatedly observe the random variable. Consider repeating a random experiment N times, and observing the value of the random variable X each time. Letting x(i) denote the value of X on the ith run of the experiment, the sample mean of the runs will be N 1 X (i) x N i=1 We can simplify this sum by grouping together identical terms, to get 1 X Count(X = x) · x N x∈R X 18 CHAPTER 6. DISCRETE RANDOM VARIABLES In a sample of size N , we expect x1 to occur about N · f (x1 ) times (on average), x2 to occur N · f (x2 ) times, and so on, so that if we compute the sample mean, it will tend toward X 1 X N · f (x) · x = f (x) · x N x∈R x∈R X X which is exactly how we’ve defined expected value. 6.5.2 Expectation of Functions of a Random Variable Suppose that, in the laptop example above, I charge $1000 for each laptop I sell. How much do I expect to bring in tomorrow? The logic is the same as when I determine the expected number of laptops sold; I just have to apply the income function first. Let h(x) = 1000x represent my gross income in a day. Then I can define E [h(X)] = 3 X h(x)f (x) x=0 = 1000 · 0 · 0.2 + 1000 · 1 · 0.4 + 1000 · 2 · 0.3 + 1000 · 2 · 0.3 + 1000 · 3 · 0.1 = 0 + 400 + 600 + 300 = 1300 So I expect to bring in $1300 on average in a given day. Suppose I buy my laptops for $800 each. If I want to compute my profit, as opposed to my raw income, I can set g(x) = 1000x − 800x = 200x and compute my expected profit the same way. Or I can take into account my fixed cost of operating the store; say, $100 a day. 6.5. EXPECTATION 19 Then I have f (x) = 200x − 100, and I get 3 X E [f (x)] = (200x − 100)f (x) x=0 = 3 X 200xf (x) − x=0 = 200 3 X 100f (x) x=0 3 X xf (x) − 100 x=0 3 X f (x) x=0 = 200 · E [X] − 100 · 1 = 260 − 100 = 160 So, I turn a profit of $160 a day, on average. Notice the general pattern: Definition 6.7 (Expectation of a Function of a Random Variable). Let X be a random variable with range R, and let h be any function which is defined on R, and which returns real numbers. Remember, we showed that h(X) is a random variable in its own right. Then we have X E [h(X)] = h(x)f (x) x∈R 6.5.3 Properties of Expectation Recall from our discussion of the algebra of random variables that if X and Y are both random variables, then X + Y is also a random variable. Therefore, the expectation of X + Y is a well-defined quantity. We could compute it by looking at the range of X +Y and computing a PMF for this random variable, which we would use to find the expectation. However, it’s often easier to stick with pairs (x, y), and compute the weighted average of the numbers x + y, where the weights come from the joint distribution of X and Y . This will give us the same result; just with potentially more terms in the sum. 20 CHAPTER 6. DISCRETE RANDOM VARIABLES Definition 6.8. Let X have range RX and PMF fX , let Y have range RY and PMF fY , and let fX,Y be the joint PMF of X and Y . Then X E [X + Y ] = (x + y)fX,Y (x, y) (x,y)∈(RX ×RY ) It turns out that we never have to use this definition directly, due to the following Theorem. Theorem 6.5 (a). For any two random variables, X and Y , on a common probability space, E [X + Y ] = E [X] + E [Y ] Proof. This is a matter of some algebra, and using the Marginalization Lemma. E [X + Y ] = X (x + y)fX,Y (x, y) (Def. 6.8) (x,y)∈RX ×RY = X X (x + y)fX,Y (x, y) x∈RX y∈RY = X X (xfX,Y (x, y) + yfX,Y (x, y)) x∈RX y∈RY = X X xfX,Y (x, y) + x∈RX y∈RY = X x∈RX = X x X X X yfX,Y (x, y) x∈RX y∈RY fX,Y (x, y) + y∈RY y∈RY xfX (x) + x∈RX X X yfY (y) y X fX,Y (x, y) x∈RX (Marginalization Lemma) y∈RY = E [X] + E [Y ] (Def. of E) Theorem 6.5 (b). If the function h is a linear function (that is, h(x) = ax + b for some choices of constants a and b), then we can exchange the order of E and h. That is: E [h(X)] = h(E [X]), that is to say, E [aX + b] = aE [X] + b 6.5. EXPECTATION 21 Proof. This comes straight from the definition. Suppose X has range R and PMF f . Then E [aX + b] = X (ax + b)f (x) x∈R X = (axf (x) + bf (x)) x∈R = X axf (x) + x∈R =a X X bf (x) x∈R xf (x) + b x∈R X f (x) x∈R = aE [X] + b where in the last line we’ve used the definition of Expectation in the first term, and the fact that the PMF sums to 1 in the second term. Note that we’re free to set b = 0 or a = 1 in the above, in which case we get the following two statements: E [aX] = aE [X] E [X + b] = E [X] + b 6.5.4 (6.2) (6.3) Variance Suppose we have already calculated µ = E [X]. Then this is just a number. If we now set h(x) = (x − µ)2 , then E [h(X)] is called the variance of X. It gives us a way of measuring how far away X tends to be from its mean. It’s also a measure of how much uncertainty is associated with our expectation of the value of X itself: if the variance is high, we “expect” to be off by a greater amount. Definition 6.9 (Variance of a Random Variable). Let X be defined as above, and let µ = E [X]. We define the variance of X to be X Var [X] = E (X − µ)2 = (x − µ)2 f (x) x∈R 22 CHAPTER 6. DISCRETE RANDOM VARIABLES Example In the case of the laptops, I have Var [X] = E (X − µ)2 E (X − 1.3)2 = 3 X (x − 1.3)2 f (x) x=0 = (0 − 1.3)2 · 0.2 + (1 − 1.3)2 · 0.4 + (2 − 1.3)2 · 0.3 + (3 − 1.3)2 · 0.1 = 1.69 · 0.2 + 0.09 · 0.4 + 0.49 · 0.3 + 2.89 · 0.1 = 0.338 + 0.036 + 0.147 + 0.289 = 0.810 The variance has many useful and elegant mathematical properties. One of its limitations, however, is the fact that it’s expressed in different units from those of X itself: namely, squared units. This fact is captured in the alternative symbol for variance, σ 2 , where σ is the Greek letter “sigma”. We can return to the original units by taking the square root, leaving us with σ. This is called the standard deviation of X. Definition 6.10 (Standard Deviation of a Random Variable). Let X be defined as above, with mean µ and variance Var [X]. We define the standard deviation of X to be the square root of the variance. p σ = Var [X] Example The standard deviation in the daily number of laptops is 0.9. √ √ σ 2 = 0.81 = Notice that there’s no function h that we can pick (in general) so that σ = E [h(X)]. We need to compute the variance first in order to get the standard deviation. 6.5.5 Properties of the Variance The following theorem gives us an easier way to compute variance than using the direct definition. 6.5. EXPECTATION 23 Theorem 6.6 (a). Var [X] = E X 2 − E [X]2 Proof. Just follow the algebra, and apply Thm 6.5. Denote µ = E [X]. Var [X] = E (X − µ)2 = E X 2 − 2µX + µ2 = E X 2 − 2µE [X] + µ2 = E X 2 − 2E [X] E [X] + E [X]2 = E X 2 − 2E [X]2 + E [X]2 = E X 2 − E [X]2 (Thm. 6.5(b)) (Def. of µ) (E [X] is a number) If we rescale X by multiplying by a constant, the variance changes in a simple way; but not just by the constant the way the mean does. Remember, variance is in squared units. Theorem 6.6 (b). Let a ∈ R be a constant. We already know that aX is then a random variable, and so it has a variance. In particular, Var [aX] = a2 Var [X] Proof. This is just algebra, and we could get the result directly from the definition of Variance, but it’s easier to use the alternative expression for Variance from Thm 6.6(a), and then apply Thm 6.5(b). Var [aX] = E (aX)2 − E [aX]2 = E a2 X 2 − E [aX]2 = a2 E X 2 − (aE [X])2 = a2 E X 2 − a2 E [X]2 = a2 E X 2 − E [X]2 = a2 Var [X] (Thm. 6.6(a)) (Thm. 6.5(b)) (Thm. 6.6(a)) 24 CHAPTER 6. DISCRETE RANDOM VARIABLES What should happen to the variance if we add a constant to a random variable? We saw that when we add (respectively, subtract) a constant, the mean shifts up (respectively, down) by that exact amount. But the variance is a measure of how far away the variable tends to be from its mean. Since both the mean and the individual values are moving the same way, the variance should not change. Indeed this is the case: Theorem 6.7. Let X be a random variable with a defined variance, and let c be a real constant. Then Var [X + c] = Var [X] Proof. Homework 6.6 Exercises 1. (Adapted from Applebaum) A discrete random variable has range R = {0, 1, 2, 3, 4, 5} and cumulative distribution whose first five values are F (0) = 0, F (1) = 1/9, F (2) = 1/6 F (3) = 1/3, F (4) = 1/2 (a) Find f (x), the value of the PMF, for each x ∈ R. (b) Find E [X] and Var [X]. 2. (Adapted from Applebaum) Three fair coins are tossed. Find the PMF of the random variable Y whose value is given by the total number of heads. 3. A bucket has a red balls and b blue ones. (1 ≤ k ≤ a + b) without replacement from dom variable representing the number of red expression representing the probability that Y You draw k balls at random the bucket. Let Y be a ranballs in your sample. Find an = y. 6.6. EXERCISES 25 4. Let X be a discrete random variable with range R = {x1 , . . . , xn }, where the elements are listed in increasing order. Show that the following probabilities can be computed from the CDF as follows: (a) P (xr < X ≤ xs ) = F (xs ) − F (xr ) (b) P (xr < X < xs ) = F (xs−1 ) − F (xr ) (c) P (xr ≤ X < xs ) = F (xs−1 ) − F (xr−1 ) (d) P (X ≥ xr ) = 1 − F (xr−1 ) 5. A game show allows contestants to win money by spinning a wheel. The wheel is divided into four equally sized spaces, three of which contain dollar amounts and one of which says BUST. The contestant can spin as many times as she wants, adding any money won to her total, but if she lands on BUST, the game ends and she loses everything won so far. (a) Suppose for the moment that the contestant keeps going until she goes bust. Let W be the random variable that represents the number of successful spins before going bust. Indicate the valid range for W , and find an expression for its CDF. (b) Use the CDF you found above to derive an expression for the PMF of W . (c) The dollar amounts on the non-BUST spaces are $100, $500, and $5000. Let Xn be the random variable denoting the amount of money won or lost on the nth spin. Find E [X1 ] (note that there is nothing to lose on the first spin). (d) Let Sn be the cumulative amount won after n spins. Find E [Xn |Sn−1 ], as a function of Sn . How large does Sn have to get before the expected gain by spinning again is negative? 6. Use induction to extend Theorem 6.5(a), showing that if X1 , . . . , Xn are arbitrary random variables, then E [X1 + · · · + Xn ] = E [X1 ] + · · · + E [Xn ] 7. Let X be a random variable and c ∈ R a constant. Show that Var [X + c] = Var [X] 26 CHAPTER 6. DISCRETE RANDOM VARIABLES 8. (Adapted from Bolstad) A discrete random variable, Y , has a probability mass function given by the following table. y 0 1 2 3 4 f (y) 0.2 0.3 0.1 0.1 0.1 (a) Calculate P (1 < Y ≤ 3). (b) Calculate E [Y ]. (c) Calculate Var [Y ]. (d) Let W = 2Y + 3. Calculate E [W ]. (e) Calculate Var [W ]. 9. (Adapted from Bolstad) A discrete random variable, Y , has a probability mass function given by the following table. y 0 1 2 5 f (y) 0.1 0.2 0.3 0.4 (a) Calculate P (0 < Y < 5). (b) Calculate E [Y ]. (c) Calculate Var [Y ]. (d) Let W = 3Y − 1. Calculate E [W ]. (e) Calculate Var [W ]. 6.6. EXERCISES 27 10. (Adapted from Bolstad) Let X and Y be jointly distributed discrete random variables. Their joint probability distribution is given in the following table: x 1 2 3 4 f (y) 1 0.02 0.08 0.05 0.10 2 0.04 0.02 0.05 0.04 y 3 0.06 0.10 0.03 0.05 f (x) 4 0.08 0.02 0.02 0.03 5 0.05 0.03 0.10 0.03 (a) Find the marginal PMF of X. (b) Find the marginal PMF of Y . (c) Calculate the conditional probability P (X = 3|Y = 1).