Download Lecture Notes: Variance, Law of Large Numbers, Central Limit

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Information theory wikipedia , lookup

Birthday problem wikipedia , lookup

Least squares wikipedia , lookup

Generalized linear model wikipedia , lookup

Hardware random number generator wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Probability amplitude wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Lecture Notes: Variance, Law of Large
Numbers, Central Limit Theorem
CS244-Randomness and Computation
March 24, 2015
1
Variance–Definition, Basic Examples
The variance of a random variable is a measure of how much the value of the
random variable differs from its expected value. Let X be a random variable, and
let µ = E(X) be its expected value. Then the variance is defined by
V ar(X) = E((X − µ)2 ).
A related quantity, which you can think of as the average deviation of X from
its mean, is the standard deviation of X, denoted σX , defined by
σX =
p
V ar(X).
You might wonder why we don’t use the more obvious
E(|X − µ|)
as a measure of the average deviation from the mean. The answer, in part, is given
in the last section of these notes on the Central Limit Theorem.
Example. Bernoulli Random Variable Let X be a Bernoulli random variable
that has the value 1 with probability p and 0 with probability q = 1 − p. As we’ve
seen, E(X) = p. So (X − µ)2 has the value (1 − p)2 = q 2 with probability p, and
the value (0 − p)2 = p2 with probability q. Thus
V ar(X) = pq 2 + qp2 = pq(p + q) = pq,
and
σX =
√
1
pq.
Figure 1: PMFs for a fair die and two differently loaded dice
For instance if p = 12 then V ar(X) = 14 and σX = 12 . This represents a kind of
extreme case, at least for Bernoulli random variables, of deviation from the mean.
At the other extreme, if p = 1 then X never varies (it always has the value 1), and
V ar(X) = σX = 0.
Example. Dice, loaded and unloaded. Figure 1 shows the PMFs for three
different distributions of the outcome of a single die roll. The diagram at left
shows the standard uniform random variable, where each of the six outcomes has
probability 1/6. The diagram in the center shows a loaded die that always results
in 1 or 6, each with probability 1/2, and the diagram at right is the case where the
die always results in 3 or 4, again each with probability 1/2.
In all three instances, the expected value of the random variable is 3.5. For the
fair die, the variance is
6
X
1
j=1
6
(j − 3.5)2 = 2.9167,
so the standard deviation is the square root of this, about 1.71.
For the die loaded to come out 1 or 6, the variance is just
1
((1 − 3.5)2 + (6 − 3.5)2 ) = 6.25,
2
so the standard deviation is 2.5. Similarly for the die loaded to come out 3 or 4, the
variance is 0.25 and the standard deviation 0.5. In other words, the center diagram
2
is the most ‘spread out’, because its value is always quite far from the mean, and
the right diagram the least spread out. The uniform distribution has values both
far from the mean and close to the mean, giving a variance those of the two loaded
dice.
A (usually) easier way to compute the variance.
We can write (X − µ)2 = X 2 − 2µX + µ2 , so by linearity of expectation, and
the fact that µ is constant,
V ar(X) =
=
=
=
E(X 2 − 2µX + µ2 )
E(X 2 ) − E(2µX) + µ2
E(X 2 ) − 2µE(X) + µ2
E(X 2 ) − µ2 .
Let’s repeat the computation of the variance of the Bernoulli variable, using
this simpler formula. Since X has values 0 and 1, X 2 = X, so E(X 2 ) = E(X) =
p. Thus
V ar(X) = E(X 2 ) − µ2 = p − p2 = p(1 − p) = pq,
just as we found above.
For the fair die, E(X 2 ) = 61 · (12 + 22 + 32 + 42 + 52 + 62 ) = 15.1667, so the
variance is 15.1667 − 3.52 = 2.9167, which agrees with the previous example.
As another example, if we want to find E(X 2 ) for a continuous random variable X, we compute
Z
∞
x2 p(x)dx
−∞
where p(x) is the probability density function. If we take X to be the outcome
of a spinner with values between 0 and 1, then p is the uniform density that is 1
between 0 and 1 and 0 elsewhere. Thus
Z 1
2
E(X ) =
x2 dx = 1/3.
0
Since E(X) = 1/2, we have
V ar(X) = E(X 2 ) − E(X)2 = 1/3 − (1/2)2 = 1/12.
As was the case with the expected value, the variance of an infinite discrete or
continuous random variable might not be defined, because the underlying infinite
series or improper interval does not converge.
3
2
Additivity for Independent Random Variables; Binomial Distribution
For a constant c and random variable X we have, by linearity of expectations
V ar(cX) =
=
=
=
=
E((cX)2 ) − E(cX)2
E(c2 X 2 ) − (cE(X))2
c2 E(X 2 ) − c2 E(X)2
c2 (E(X 2 ) − E(X)2 )
c2 V ar(X).
What is the variance of the sum of two random variables? Again, by repeatedly
applying the linearity of expectations,
V ar(X + Y ) =
=
=
=
=
E((X + Y )2 ) − E(X + Y )2
E(X 2 + 2XY + Y 2 ) − (E(X) + E(Y ))2
E(X 2 ) + 2E(XY ) + E(Y 2 ) − (E(X)2 + 2E(X)E(Y ) + E(Y )2 )
(E(X 2 ) − E(X)2 ) + (E(Y 2 ) − E(Y )2 ) + 2(E(XY ) − E(X)E(Y ))
V ar(X) + V ar(Y ) + 2(E(XY ) − E(X)E(Y )).
The expression E(XY ) − E(X)E(Y ) in the right-hand summand is called the
covariance of X and Y, and we will see it again later. If X and Y are independent
then this expression is 0, so for independent random variables,
V ar(X + Y ) = V ar(X) + V ar(Y ).
This does not hold in general if the random variables are not independent. In
a homework problem you were asked to show that if X and Y represent, respectively, the smaller and larger values of the dice in a roll of two dice, then
E(X)E(Y ) 6= E(XY ), so in this case the sum of the variances is not equal to the
variance of the sum.
Example. Binomial Random Variable The binomial random variable Sn
gives the probability of k heads on n coin tosses; as we’ve seen before,
n k
P [Xn = k] =
p (1 − p)n−k
k
4
if p is the probability of heads. Xn is itself the sum of n pairwise independent
copies of a Bernoulli random variable X each with probability p. As a result,
using the summation result above,
V ar(Sn ) = n · V ar(X) = np(1 − p),
and
σSn =
p
np(1 − p).
Further, let Y = Sn /n be the average number of heads on n tosses of a coin. Then
V ar(Sn ) = V ar(X)/n2 = p(1 − p)/n.
3
Chebyshev Inequality
The notion of variance allows us to obtain a rough estimate of the probability that
a random value differs by a certain amount from its mean. Before we state and
prove this, we’ll establish a simpler property called the Markov inequality: If X
is a positive-valued random variable with expected value µ, and c > 0, then
E(X)
.
c
Let’s see why this is: We’ll assume that X is a discrete random variable with
finitely many outcomes
0 < c1 < c2 < · · · cn .
P [X ≥ c] ≤
with probabilities p1 , p2 , . . . , pn , respectively. Let ci be the smallest of these values
that is greater than or equal to c. Then
P [X > c] = pi + pi+1 + . . . + pn
1
· (cpi + cpi+1 + · · · cpn )
=
c
1
≤
· (ci pi + ci+1 pi+1 + · · · + cn pn )
c
1
≤
· (c1 p1 + c2 p2 + · · · + cn pn )
c
E(X)
=
.
c
Essentially the same argument works for infinite discrete and continuous random
variables. The Markov inequality by itself does not provide a great deal of information, but we will use it show something important. Let X be any random
5
variable with expected value µ, and let be a positive number. (Think of as
small.) When we apply the Markov inequality to the random variable (X − µ)2
then we get
P [(X − µ)2 ≥ 2 ] ≤
E((X − µ)2 )
V ar(X)
=
.
2
2
Since the left-hand side is the same thing as P [|X − µ| ≥ ], we can write this as
P [|X − µ| ≥ ] ≤
V ar(X)
.
2
This is called Chebyshev’s inequality. It provides a probability bound on how
likely it is for a random variable to deviate a certain amount from its expected
value.
Example. Let Sn be the number of heads on n tosses of a fair coin. Let’s estimate the probability that the number of heads for n = 100 is between 40 and 60.
Since E(S100 ) = 50, we are asking for the complement of the probability that the
number of heads is at least 61 or at most 39; in other words, that it differs by at
least 11 from it’s mean. Since the variance of S100 is 100/4 = 25, Chebyshev’s
inequality gives
P [|S100 − 50| ≥ 11] < 25/112 ≈ 0.207,
so the probability of the number of head being between 40 and 50 is at least 0.793.
As we shall see later, the probability is actually much closer to 1 than this. Chebyshev’s inequality only gives a rough upper bound, not a close approximation. The
advantage is that it applies to absolutely any random variable.
If we take = tσX , then we can write the inequality as
P [|X − µ| ≥ tσX ] ≤
1
.
t2
For instance, the probability of being at least two standard deviations from the
mean is at most 4.1
4
Law of Large Numbers
The numerical outcome of an experiment is a random variable X. Let us suppose
X has expected value µ and variance σ 2 . If we perform n independent trials of
6
the experiment, then the random variable An = (X1 + . . . + Xn )/n, giving the
average of the outcomes of these trials, still has expected value µ, but now the
variance is σ 2 /n. Thus by Chebyshev’s inequality, for any > 0
P r[|An − µ| > ] <
σ2
.
n2
What does this mean? Imagine that is rather small, say 0.01. Then the right-hand
side is 104 σ 2 /n. If we choose n to be large enough, then we can make this righthand side as small as we like, which means that we can guarantee with probability
as large as we like, that the average An is within 0.01 of its expected value. Put
more simply, we can get as close to the mean as we like (with probability as high
as we like) by repeating the experiment often enough. (Of course, we cannot make
the probability exactly 1 no matter how many times we repeat the experiment, nor
can we guarantee that the average will be exactly equal to the mean.)
This is the precise statement of what is often described colloquially as ‘the law
of averages’. In probability theory, it is called the Law of Large Numbers.
5
Normal Approximation to Binomial Distribution
Figure 2 shows the PMFs of the random variables Sn for n = 20, 60, 100, where
Sn denotes the number of heads on n tosses of a coin with heads probability
p = 0.4. These PMFs are given by the binomial probability distribution
n k
P [Sn = k] =
p (1 − p)n−k .
k
The three random variables of course all have different expected values (10, 30
and 50, respectively) so the PMFs are nonzero on different parts of the numer line.
By our previous calculations, the standard deviation of Sn grows proportionally
to the square root of n. As a result, the PMFs get more spread out as n increases.
In Figure 3, Sn is replaced by Sn − np, so that all three random variables have
expected value 0. In Figure 4, we further change this to the random variables
S − np
p n
,
np(1 − p)
so that all three have variance 1.
The apparent result is that all three graphs seem to have the same basic shape,
but just differ in the vertical scale. In Figure 5, the vertical scale is adjusted so
7
that all three have maximum value 1. All the points lie on the same smooth curve.
What is this shape?
The smooth curve was drawn by plotting the graph of
y = e−x
2 /2
,
and the crucial result illustrated by these pictures is that this shape closely approximates the binomial distribution. In other words, this famous ‘bell curve’
represents a continuous probability density that is a kind of limiting case of the
binomial distributions as n grows large.
2
Let’s be a little more precise about this: The function e−x /2 is not itself a
probability density function, because the area under the curve is not 1, but it becomes
√ a density function when we divide by the area under the curve. (This area
is 2π, a fact that is far from obvious.) is called the standard normal density.
‘Standard’ here means that it has mean 0 and standard deviation 1.
The corresponding cumulative distribution function is
Z x
1
2
e−t /2 dt.
Φ(x) = √
2π −∞
Since we cannot evaluate Φ(x) analytically, it has to be approximated numerically.
You can compute Φ(x) in Python to high accuracy using a built-in related function
erf, as
0.5+0.5*math.erf(x/math.sqrt(2))
Our observations above illustrate an important fact: the binomial distribution,
adjusted to have mean 0 and standard deviation 1, is closely approximated by the
normal distribution, especially as n gets larger. Here are a few examples.
Example. Let us redo the problem of estimating the probability that on one hundred tosses of a fair coin, the number of heads is between 40 and 60. Let X be the
random variable representing the number of heads, so we are asking for
P [45 ≤ X ≤ 60].
We make the same modification asq
above, subtracting the expected value 50 and
dividing by the standard deviation
100 ·
P [−1 ≤
1
4
= 5. Thus we are looking for
X − 50
≤ 1].
5
8
Figure 2: PMFs of binomial distribution with n = 20, 60, 100 and p = 0.4
9
Figure 3: The same distributions shifted to all have mean 0...
10
Figure 4: ...and scaled horizontally to have standard deviation 1
11
Figure 5: The previous figure stretched vertically so that all three PMFs appear
2
with the same height, superimposed on the graph of e−x /2
12
Figure 6: The standard normal density φ(x): the shaded area is Φ(1), the probability that the standard normal random variable has value less than 1.
13
Figure 7: The cumulative normal density Φ(x).
14
Approximating this by the standard normal distribution suggests that this probability is about
Φ(1) − Φ(−1) = 0.6827.
The exact value, of course, is
55 X
100 −100
2
= 0.72875,
j
j=45
so our approximation is not very impressive, only accurate to one decimal digit
of precision. Part of the reason can be seen in the fact that the probability we are
looking for is also equal to
P [44 < X < 56].
One of the pitfalls of approximating a discrete distribution with a continuous one
is that we don’t necessarily know exactly where we should draw the lines between
values of the random variable. It turns out that works well in this situation is to
use values for the continuous distribution that are halfway between the relevant
values for the discrete distribution: In this case, that means we should view the
problem as one of calculating
P [44.5 < X < 55.5].
This gives an estimate of
Φ(1.1) − Φ(−1.1) = 0.72867,
accurate to four decimal digits of precision.
6
Central Limit Theorem
The last section illustrated the fact that the sum of independent identically distributed Bernoulli random variables is approximately normally distributed. This
is an instance of a much more general phenomenon—every random variable has
this property!
To be more precise, let X be a random variable with µ = E(X) and σ 2 =
V ar(X) defined. Let X1 , . . . , Xn be pairwise independent random variables, each
with the same distribution as X. Think of this as making n independent repetitions
of an experiment whose outcome is modeled by the random variable X. Our claim
15
is that the sum of the Xi is approximately normally distributed. Again we adjust
the mean and standard deviation to be 0 and 1; then the precise statement is
lim P [a <
n→∞
X1 + · · · Xn − nµ
√
< b] = Φ(b) − Φ(a).
σ n
This is called the Central Limit Theorem. Before we saw, with the Law of Large
Numbers, that the deviation of the average of n independent identical random variables from its mean approaches 0 as n grows larger. The Central Limit Theorem
says more: it tells us how that deviation is distributed.
Example. Let’s look at an experiment that was the subject of a question on the
midterm: Spin two spinners, each giving a value uniformly distributed between 0
and 1, and let X be the larger of the two values. We saw that the cdf of X was
given by y = x2 between 0 and 1, and thus the pdf of X is y = 2x between 0 and
1, and 0 elsewhere. (See the posted exam solutions for the details.) We can then
compute
Z 1
2
x · 2xdx = ,
µ = E(X) =
3
0
Z 1
1
x2 · 2xdx = ,
E(X 2 ) =
2
0
so
2
1
1
σ 2 = V ar(X) = − ( )2 = ,
2
3
18
1
and σ = √18 .
Suppose we perform this experiment 100 times. How likely is it that the sum
is greater than 65? The expected value of the sum is 66 23 . Since the distribution of
the sum is approximately normal, and thus symmetric about the mean, we should
expect a probability greater than one half. How likely is it that the sum is greater
than 70? Here we should expect a probability less than one-half.
We apply the Central Limit Theorem to obtain an estimate. We first try to
compute
P r[0 ≤ X1 + · · · + X100 < 65],
so making our usual transformation with µ and σ, this is
√
√
X1 + · · · + X100 − 66 32
2
2
√
P r[−66 /(10/ 18) <
< −1 /(10/ 18)].
3
3
10/ 18
16
The Central Limit Theorem says that this is approximately
√
√
2
2
Φ(−1 /(10/ 18)) − Φ(−66 /(10/ 18)).
3
3
The right-hand expression is a very tiny number which we can treat as 0. (Alternatively, we could just as well have use −∞ as 0 in our computation.) So this
gives the approximation
√
2
Φ(−1 /(10/ 18)) = 0.23975,
3
and thus the probability that the sum is greater than 65 is
1 − 0.23975 = 0.76025,
which is greater than one-half, as we expected. If we replace 65 by 70, then an
identical calculation gives
√
1
1 − Φ(−3 /(10/ 18)) = 1 − 0.92135 = 0.07865.
3
I simulated the experiment of 100 spinners spun twice and summed the maxima of the results. In 10,000 repetitions, I found that the number of times the sum
was greater than 65 to be 7594 (as compared to the predicted 7602), and the number of times the sum was greater than 70 to be 780 (as compared to the predicted
787).
17