Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Random variables, mean and variance: Suppose in a collection of people there are some number with height 6’, and equal numbers with heights 5’11” and 6’1”. The mean or average of this distribution is 6’, as can be determined by summing the heights of all the people and dividing by the number of people, or equivalently by summing over distinct heights weighted by the fractional number of people with that height. Suppose for example, that the numbers in the above height categories are 5,30,5, then the latter calculation corresponds to (1/8) · 5’11” + (3/4) · 6’ + (1/8) · 6’1” = 6’. But the average gives only limited information about a distribution. Suppose there were instead only people with heights 5’ and 7’, and an equal number of each, then the average would still be 6’ though these are very different distributions. It is useful to characterize the variation within the distribution from the mean. The average deviation from the mean gives zero due to equal positive and negative variations (as proven below), so the quantity known as the variance (or mean square deviation) is defined as the average of the squares of the differences between the values in the distribution and their mean. For the first distribution above, this gives the variance V = 81 (−1”)2 + 34 (0”)2 + 18 (1”)2 = 14 (inch)2 , and for the second distribution the much larger result V = 21 (−1’)2 + 12 (1’)2 = 1(foot)2 . The standard or r.m.s (“root mean square”) deviation σ is defined as the square root of √ the variance, σ = V . The above two distributions have σ = (1/2 inch) and σ = (1 foot) respectively. 30 mean 72.0 stdev 0.5 mean 72.0 stdev 12.0 #people with that height 25 aheights = [6*12+1]*5 + [6*12]*30 + [5*12+11]*5 bheights = [5*12]*20 + [7*12]*20 figure(figsize=(5,5)) hist(aheights,bins=arange(59.5,90)) 20 hist(bheights,bins=arange(59.5,90)) xlabel(’inches’) ylabel(’#people with that height’) legend([’mean {}\n stdev {}’.format(mean(d),std(d)) for d in (aheights,bheights)]) savefig(’hhist.pdf’) 15 10 5 0 55 60 65 70 75 inches 80 85 90 1 INFO 2950, 18 Feb 16 More generally, a random variable is a function X : S → IR, assigning some real number to each element of the probability space S. The average of this variable is determined by summing the values it can take weighted by the corresponding probability, <X> = X p(s)X(s) . s∈S (An alternate notation for this is E[X] = <X>, for the “expectation value” of X.) Example 1: roll two dice and let X be the sum of two numbers rolled. Thus X({1, 1}) = 2, X({1, 2}) = X({2, 1}) = 3, . . ., X({6, 6}) = 12. The average of X is <X> = 1 2 3 4 5 6 5 4 3 2 1 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 = 7 . 36 36 36 36 36 36 36 36 36 36 36 Example 2: flip a coin 3 times, and let X be the number of tails. The average is <X> = 1 3 3 1 3 3+ 2+ 1+ 0= . 8 8 8 8 2 The expectation of the sum of two random variables X, Y (defined on the same sample space) satisfies <X + Y > = <X> + <Y >. In general, they satisfy a “linearity of expectation” <aX + bY > = a<X> + b<Y > proven as follows: P P P <aX+bY > = s p(s)(aX(s)+bY (s)) = a s p(s)X(s)+b s p(s)Y (s) = a<X>+b<Y >. Thus an alternate way to calculate the mean of X = X1 + X2 for the two dice rolls in example 1 above is to calculate the mean for a single die, X1 = (1 + 2 + 3 + 4 + 5 + 6)/6 = 21/6 = 7/2, and so for two rolls <X> = <X1 > + <X2 > = 7/2 + 7/2 = 7. By definition, independent random variables X, Y satisfy p(X=a ∧ Y =b) = p(X = a)p(Y = b) (i.e., the joint probability is the product of their independent probabilities, just as for independent events). For such variables, it follows that the expectation value of their product satisfies <XY > = <X><Y > (X, Y independent) P P P P since r,s p(r, s)X(r)Y (s) = r,s p(r)p(s)X(r)Y (s) = r p(r)X(r) s p(s)Y (s) . To see that the above relation fails when X and Y are not independent, consider a single coin flip and let X count the number of heads, and Y count the number of tails. Then <X> = <Y > = 1/2, but <XY > = 0 since one of X or Y is always zero on any given flip. On the other hand, consider flipping a coin ten times and rolling a die 12 times, and let X count the number of heads of the coin flip, and Y the number of times a six is rolled. Then <XY > = <X><Y > = 5 · 2 = 10. As indicated above, the average of the differences of a random variable from the mean P P vanishes: p(s) X(s) − <X> = <X> − <X> s∈S s p(s) = <X> − <X> = 0. The 2 INFO 2950, 18 Feb 16 variance of a probability distribution for a random variable is defined as the average of the squared differences from the mean, V [X] = X p(s) X(s) − <X> 2 . (V 1) s∈S The variance satisfies the important relation V [X] = <X 2 > − <X>2 , (V 2) following directly from the definition above: V [X] = X = X p(s) X(s) − <X> 2 s∈S X 2 (s)p(s) − 2<X> s X p(s)X(s) + <X>2 s X p(s) s = <X 2 > − 2<X>2 + <X>2 = <X 2 > − <X>2 . In the case of independent random variables X, Y , as defined above, the variance is additive: V [X + Y ] = V [X] + V [Y ] . To see this, use (V 2) together with <XY > = <X><Y >: V [X + Y ] = <(X + Y )2 > − (<X> + <Y >)2 = <X 2 > + 2<XY > + <Y >2 − <X>2 − 2<X><Y > − <Y >2 = <X 2 > − <X>2 + <Y 2 > − <Y >2 = V [X] + V [Y ] . Example: again flip a coin 3 times, and let X be the number of tails. <X 2 > = 1 2 3 2 3 2 1 2 0 + 1 + 2 + 3 =3 8 8 8 8 so V [X] = 3 − (3/2)2 = 3/4. If we let X = X1 + X2 + X3 , where Xi is the number of tails (0 or 1) for the ith roll, then the Xi are independent variables with <Xi > = 1/2 and <Xi2 > = (1/2) · 1 + (1/2) · 0 = 1/2, so V [Xi ] = 1/2 − 1/4 = 1/4 (or equivalently V [Xi ] = 1/2(1/2)2 + 1/2(−1/2)2 = 1/8 + 1/8 = 1/4). For the three rolls, V [X] = V [X1 ] + V [X2 ] + V [X3 ] = 1/4 + 1/4 + 1/4 = 3/4 , confirming the result above. 3 INFO 2950, 18 Feb 16 Here’s a brief summary: Expectation value: E[X] = P s∈S p(s)X(s) P Variance: V [X] = s∈S p(s)(X(s) − E[X])2 = E[X 2 ] − (E[X])2 Standard deviation: σ[X] = p V [X] For X a sum of random variable X = P E[X] = i E[Xi ] P i Xi , the expectation always satisfies: If (and only if) the variables X and Y are independent, then E[XY ] = E[X]E[Y ] If (and only if) all the variables Xi are independent, then P V [X] = i V [Xi ] Example of coin flips (Xi = 1, 0 according to whether or not flip is heads) For the ith coin flip , then V [Xi ] = 1/2 − 1/4 = 1/4 Since they’re independent, for n such flips E[X] = n/2 V [X] = n/4 √ σ[X] = n/2 Note that the fractional standard deviation √ σ[X]/E[X] = 1/ n → 0 for large n so the relative spread of the distribution goes to zero for a large number of trials (the distribution becomes more tightly centered on the mean) 4 INFO 2950, 18 Feb 16 Bernoulli Trial A Bernoulli trial is a trial with two possible outcomes: “success” with probability p, and “failure” with probability 1 − p. The probability of r successes in N trials is N r p (1 − p)N −r . r PN Note the correct overall normalization automatically follows from r=0 Nr pr (1−p)N −r = N p + (1 − p) = 1N = 1. The overall probability for r successes is a competition between N r N −r with is largest for small r when r , which is maximum at r ∼ N/2, and p (1 − p) p < 1/2 (or large r for p > 1/2). In class, we considered the case of rolling a standard six-sided die, with a roll of 6 considered a success, so p = 1/6. (See figures on next page showing Nr pr (1 − p)N −r for N = 1, 2, 4, 10, 40, 80, 160, 320 trials, with the number of successes r plotted along the horizontal axis for each value of N .) For a larger number N of trials, the distribution of expected number of successes becomes more narrowly peaked and more symmetrical about a fractional distance r = N/6. To analyze this in the framework outlined above, let the random variable Xi = 1 if the ith trial is success. Then <Xi > = p. Let X = X1 + X2 + . . . + XN count the total number of successes. Then it follows that the average satisfies X <X> = <Xi > = N p . (B1) i From V [Xi ] = <Xi2 > − <Xi >2 = p − p2 = p(1 − p), it follows that the variance satisfies X V [X] = V [Xi ] = N p(1 − p) , (B2) i p p and the standard deviation is σ = V [X] = N p(1 − p). (Note that for p = 1/2 and N = 3, this gives V [X] = 3/4, reproducing the result of the coin flip example above.) This explains the observation that the probability gets more sharply peaked as the number of trials increases, since √ the width √ of the distribution (σ) divided by the average <X> behaves as σ/<X> ∼ N /N ∼ 1/ N , a decreasing function of N . By the “central limit theorem” (not proven in class), many such distributions under fairly relaxed assumptions always tend for sufficiently large number of trials to a “gaussian” or “normal” distribution, of the form (as shown explicitly in lecture 22 notes) (x−µ)2 − 2 1 P (x) ≈ √ e 2σ . (G) σ 2π R∞ R∞ This is properly normalized, with −∞ dx P (x) = 1, and also has −∞ dx xP (x) = µ, R∞ dx x2 P (x) = p σ 2 + µ2 , so the above distribution has mean µ and variance σ 2 . Setting −∞ µ = N p and σ = N p(1 − p) for p = 1/6 in (G) thus gives a good approximation to the distribution of successful rolls of 6 for large number of trials in the example above. 5 INFO 2950, 18 Feb 16 Probability of r sixes in 2 trials 1 0.8 0.8 0.6 0.6 Probability Probability Probability of r sixes in 1 trial 1 0.4 0.2 0.4 0.2 0 0 0 1 0 1 Number of sixes Number of sixes Probability of r sixes in 10 trials 0.5 0.5 0.4 0.4 0.3 0.3 Probability Probability Probability of r sixes in 4 trials 2 0.2 0.1 0.2 0.1 0 0 1 2 Number of sixes 3 0 4 0 1 2 3 Probability of r sixes in 40 trials 4 5 6 Number of sixes 7 8 9 10 Probability of r sixes in 80 trials 0.2 0.15 0.15 Probability Probability 0.1 0.1 0.05 0.05 0 0 5 10 15 20 25 Number of sixes 30 35 0 40 0 10 Probability of r sixes in 160 trials 20 30 40 50 Number of sixes 60 70 80 240 280 320 Probability of r sixes in 320 trials 0.1 0.08 0.08 0.06 Probability Probability 0.06 0.04 0.04 0.02 0.02 0 0 20 40 60 80 100 Number of sixes 120 140 160 0 0 40 80 120 160 200 Number of sixes