Download Notes on Expected Value and Variance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
October 25, 2002
Expected Value (again), Variance, and Standard Deviation
Notes for Math 295
We introduce the notions of variance and standard deviation of a random
variable, and summarize the algebraic rules for manipulating the operations E( ), V( ),
and D( ).
These rules are also covered in the text, Sections 3.10-3.11 (expectation) and
3.12-3.13 (variance).
1. Expected Value (summary)
Here is a summary of the formulas we have for expected value.
If X is a discrete random variable, then
E(X)   kp X (k)
(1)
k
where pX(k) denotes the probability function for X, and the sum is taken over all possible
values of k. Equation (1) means exactly the same thing as
E(X)   k P  X  k  .
(2)
k
We have seen that—if the sample space itself is discrete—then equation (1) gives the
same result as
E(X)   X(s)P(s)
(3)
s
where this time, P(s) represents the probability function for the sample space, and the
sum is taken over all outcomes s.
If a second random variable Y is defined as a function of X, say Y = h(X), then
we can get the expected value of Y from the formula
E(Y)   h(k)p X (k) .
(4)
k
This formula is never really necessary, because we can always just construct the
probability function pY(k) for Y and use equation (1) directly. But sometimes equation
(4) is easier, especially if we have already constructed a table of pX(k).
If X is a continuous random variable—that is, if X has a density function fX(x)—
then


E(X) 
x  f X (x) dx .
(5)
x 
If Y is a function of X, say Y = h(X), then the expected value of Y is given by

E(Y) 

h(x)  f X (x) dx .
(6)
x 
Equation (6) is more important than equation (4), because sometimes it is hard to get
E(Y) by any other method. We’ll see an example later on.
We also have a formula for E(X) in terms of FX, the cumulative distribution
function (cdf) of X. This is rarely used, but it is worth writing down because every
random variable has a cdf, and so this formula applies even if X is not discrete and does
not have a density function:
E(X)  
0
x 
FX (x)dx  

x 0
1  FX (x)  dx
(7)
whenever both integrals exist.
2. Variance
If X is any random variable, then the variance of X, denoted V(X), is defined by
V(X)  E  X2    E(X) 
2
(8)
provided both expected values (E(X) and E(X2) ) exist. Sometimes V(X) is written
Var(X).
Equation (8) is worth memorizing. It may be easier in the form of a slogan:
Variance
=
Mean of the Square
minus
Square of the Mean.
(Recall “mean” is a synonym for expected value.)
The standard deviation of X, denoted D(X), is defined by
D(X)  V(X) .
(9)
2
Obviously, if you have the variance, you can get the standard deviation in one step, and
vice versa (since V(X) = D(X)2 ).
People use the symbol  (lower-case Greek sigma) for standard deviations.
This usage is so common that some people refer to the standard deviation of X as “the
sigma of X.” If there are other random variables in the problem, they might write  X
for D(X). Also, people use  2 or  X
2
for the variance.
(Note: The notation V(X) is almost universal, but not quite. Among strangers
you might want to write Var(X) to be absolutely clear. The notation D(X) isn’t standard.
Writing StDev(X) is clear but ugly. Most people just write X.)
Example 1. Suppose that a discrete random variable X has the probability
function
k
P(X=k)
0
1/8
1
3/8
2
3/8
3
1/8
What are E(X), V(X), and D(X) ?
(By the way: This is a binomial distribution with n=3 and p=0.5. We saw it
before when we tossed three coins, and X was the number of heads.)
Solution: First,
E(X)   kp X (k)
k
= (0) (1/8) + (1) (3/8) + (2) (3/8) + (3) (1/8)
= 1.5.
(That’s no surprise. If we toss three coins, then on average, we get 1.5 heads.)
In order to get V(X), we need to calculate E(X2). This is a good time to use
equation (4). (In this case we can write Y=X2; then Y = h(X) where h(x) = x2. )
E(X 2 )   k 2 p X (k)
k
= (0)2 (1/8) + (1)2 (3/8) + (2)2 (3/8) + (3)2 (1/8)
= (0) (1/8) + (1) (3/8) + (4) (3/8) + (9) (1/8)
= 3.
So the variance is given by
3
V(X) = E(X2) – E(X)2 = 3 – (1.5)2 = 0.75.
(10)
The standard deviation is just the square root of the variance:
0.75  12 3 .
D(X) =
3. A small point of notation
Let’s agree that the notation E(X)2 always means
anything else.
 E(X) 
2
, and never means
That allows us to save a set of parentheses in the definition of variance:
V(X) = E(X2) – E(X)2.
4. More examples
Example 2. Let X be the result of the spinner experiment. Then X has a uniform
density function on the interval from 0 to 1:
1 if 0  x  1
f (x)  
0 otherwise.
What are E(X), V(X), and D(X)?
Solution:

E(X) 

x  f (x) dx
x 
1
  x 1 dx
0
 x2  1
   |0
 2 
1
 .
2
This should be no surprise, either. When we pick a number at random between 0 and 1,
the average value is ½.
Continuing: We need E(X2).
4

E(X 2 ) 

x 2  f (x) dx
x 
1
  x 2 1 dx
0
 x3 
   |10
 3
1
 .
3
So:
V(X) = E(X2) – E(X)2 = (1/3) – (1/2)2 = 1/12.
And,
D(X) = 1/12  16 3 .
5. Another way to look at the variance
There is another formula for the variance that is in common use. It isn’t usually
as good for computation, but it is good for insight.
If X is any random variable, and of E(X) is its expected value, then the variance
of X is given by


V(X)  E  X  E(X)  .
2
(11)
Some people write  (lower-case Greek mu) for the mean—that is,  is a synonym
for E(X)—so we could also write this formula as


V(X)  E  X    .
2
(12)
We will prove below that these equations really give the same result as
equation (8). For now, let’s concentrate on what equation (11) means.
The variance is a measure of how dispersed (or how scattered) the values of
X are. A low variance means that the values don’t really vary a lot, and that they are
usually close to their average value. A high variance means that they are often very far
from their average value.
With that in mind, let’s “deconstruct” equation (11).
5
The expression X – E(X) is just the difference between X and its mean. Notice
that E(X) is just a number, but that X is a random variable. So, the quantity X – E(X)
is itself a random variable. We call it the “deviation of X from the mean”, or just the
“deviation.”
(For example: Let X be the result of the spinner experiment. Suppose we carry
out the experiment three times, and X happens to have the values 0.766, 0.249, and
0.545. Then in these three cases, the random variable X – E(X) takes on the values
+0.266, -0.751, and +0.045. That is, X-E(X) is just another random variable, defined as
a function of X.)
If we want to measure the dispersion of X, then the size of the deviations are a
good start. But that’s a random variable, and we want a simple number. So what about
the average of the deviations?
Bad idea! The average of the deviations is always exactly zero. So, that’s a
useless concept. The trouble is that the deviations can be positive or negative, and they
cancel out when we take the average.
So, what about the average of the absolute values of the deviations? That’s a
better idea, but mathematicians have had better success by taking the squares of the
deviations. That makes them all positive, so the average really means something.
Now: (X-E(X))2 is the “square deviation.” So, E ( (X – E(X))2 ) is the “mean
square deviation.”
Variance
=
Mean Square Deviation.
It may be hard to remember equation (11), but it’s not hard to remember the slogan in the
box.
So the variance is a summary measure of the dispersion of the random variable X.
It turns out to be a very useful measure from a theoretical point of view. But intuitively,
it is somewhat confusing, because the units are strange. For example, if X is measured in
inches, then V(X) is in square inches. If X is in hours, then V(X) is in square hours. If
you aren’t used to dealing in square hours, then you may find V(X) hard to use.
That’s why we have standard deviations. The standard deviation is just the square
root of the variance. In order to be cute, let’s just say “root” when we mean “square
root”:
Standard Deviation
=
Root Mean Square Deviation.
6
Have you heard the term “root-mean-square” before? Or the abbreviation, RMS? It is
used in several places, mostly in engineering applications, but also in measuring the
output of stereo speakers. It’s a kind of average. So, the standard deviation is a kind of
“average” or “typical” deviation of X from its own expected value.
Statisticians often use “standard” as a synonym for “root mean square.” So, now
you know where the “standard deviation” got its name. It’s not just a name, it’s a
formula!
6. Variances are never negative.
From equation (11), we can see that the variance of X is the expected value of a
square. Since squares are never negative, neither is this expected value.
V(X) is never negative.
Also, standard deviations are never negative.
7. Variances, Standard Deviations, and Repeated Experiments
Have you seen the terms “variance” and “standard deviation” in connection with a
list of numbers? For example: The standard deviation of the column of figures
4
5
6
7
8
9
10
is 2.0. Is that a familiar concept? If not, skip this section.
Suppose X is a random variable whose value depends on some experiment.
Suppose you repeat the experiment a large number of times, and write down the value of
X each time. You now have a list of numbers.
The mean of this list of numbers will be about E(X). The variance of this list of
numbers will be about V(X). The standard deviation will be about D(X). The more
times you repeat the experiment, the closer this approximation will be.
That’s the connection between the terms “variance” and “standard deviation” in
probability, and the same terms in statistics.
7
8. E, V, and D are operators.
The notations E( ), V( ), and D( ) look like functions. In fact, they really are
functions, but their domains aren’t sets of numbers. Instead, the domains of E( ), V( ),
and D( ) are sets of random variables.
That means that the argument of E( ) or V( ) or D( ) is always a random variable.
And the value of E( ), V( ), or D( ) is always a number (if it exists). In other words: If
we write
E(X)=t
or
V(X)=w
then X is a random variable, and t and w are numbers.
A function whose domain is a set of functions is usually called an “operator.”
That’s actually just a fancy synonym for function. Since random variables are functions,
that means E( ), V( ), and D( ) are operators.
A fine point of notation: We hardly ever write “E(x)”. We use upper-case letters
for random variables, and lower-case letters for placeholders and other stuff. So, people
who write “E(x)” usually mean “E(X)”.
9. An easy example: Constant random variables
Sometimes a random variable isn’t really random. For example, suppose we have
some sample space S, and we define a random variable W by
W(s) = 20 for every outcome s in S.
Then W satisfies the definition of a random variable, even though it is a fairly silly one.
What is the pmf of W? It’s simple: pW(20) = 1.0, and pW(anything else) = 0.
What is the expected value of W? We have: E(W) = 20, obviously.
What is the variance of W? Equation (11) is useful here. The deviation W –
E(W) is always zero, so the square deviation is always zero, so the mean square deviation
is zero. So, the variance V(W) = 0. Also, D(W) = 0.
Constant random variables aren’t much use in themselves, but they come up once
in a while in computations, so it’s worth remembering:
E(constant) = the constant; and
V(constant) = D(constant) = 0.
8
For example,
E ( 20 ) = 20.
(13)
But isn’t that inconsistent with the previous section? Doesn’t the argument of E( )
have to be a random variable? We have to be careful how we interpret equation (13).
The last “20” is just a number. But the first “20” should be understood as a random
variable which happens to have the constant value 20.
10. Expectation of a sum
We have seen in an earlier set of notes that
E( X + Y ) = E( X ) + E( Y )
(14)
for any two random variables X and Y.
Example: Each day Joe eats a random number of apples and a random number of oranges. On
average, Joe eats 2.2 apples per day. On average, Joe eats 1.1 oranges per day. How many fruits does Joe
eat per day, on average?
Solution: E(apples + oranges) = E(apples) + E(oranges) = 2.2 + 1.1 = 3.3.
The example is in small type because it is so obvious. We have been using
equation (14) all our lives. Now we are just expressing it formally.
We didn’t prove equation (14) before, but now we have the machinery to prove it,
at least in the case of a discrete sample space. We saw an example in homework #5.
Here is the proof, based on equation (3) above.
Suppose Z = X + Y. That means that Z(s) = X(s) + Y(s) for every outcome  .
Therefore, using equation (3):
E(Z)   Z(s)P(s)
s
   X(s)  Y(s)  P(s)
s
  X(s)P(s)   Y(s)P(s)
s
s
 E(X)  E(Y).
In two words:
“Expectations add.”
9
11. Expectation is a Linear Operator
We also have this identity:
E( aX ) = a E(X)
for every random variable X and every number a.
Example: Joe sells a random number of books every day, for $10 each. On average, he sells 6
books per day. How much money does he receive per day, on average?
Solution: Let X = number of books sold, Y = number of dollars received. Then Y = 10X. So,
E(Y) = E( 10 X ) = 10 E(X) = 10 * 6 = 60.
This is also pretty easy to prove from equation (3).
Combining the results from this section and last section, we have
E(X + Y) = E(X) + E(Y) and
E(aX)
= a E(X).
Any operator with these properties is called a “linear operator.” So, we can summarize
these results by saying that E( ) is a linear operator. (You don’t have to remember this
definition, and we won’t use it again.) By the way, V( ) and D( ) are not linear operators.
We can also combine these results to get formulas like this:
 a n Xn )  a1E(X1 )  a 2E  X2  
E(a1X1  a 2 X2 
 a n E  Xn  .
12. Our two definitions of variance are the same
Let’s prove that equation (11) gives the same result as equation (8). We calculate
starting with equation (11):

V(X)  E  X  E(X) 
2

 E  X 2  2XE(X)  E(X) 2 
(I just expanded the square)
 E(X 2 )  E(2XE(X))  E  E(X) 2 
(Because “expectations add”)
10
 E(X 2 )  2  E(X)  E(X)  E  E(X) 2 
(Because 2E(X) is just a number, and E(aX)=aE(X) )
 E(X2 )  2  E(X)  E(X)  E(X) 2
(Because E(X)2 is just a number too, and E(constant)=constant )
 E(X2 )  E(X)2 .
13. How do constants affect variance and standard deviation?
Theorem: If X is any random variable and a is any constant, then
V( aX ) = a2 V(X).
Proof:
V(aX) = E((aX)2) – E(aX)2
= E(a2X2) – (aE(X))2
= a2 E(X2) – a2E(X)2
= a2 ( E(X2) – E(X)2 )
= a2 V(X).
Theorem: If X is any random variable and a is any constant, then
D(aX) = |a| D(X).
Proof: In the previous theorem, just take square roots. //
Both of these results are tricky. It’s important to remember the square in the first
theorem, and the absolute value in the second theorem.
Theorem: If X is any random variable and c is any constant, then
V( X + c ) = V( X ) and
D( X + c ) = D( X ).
Proof. Omitted.
11
The last theorem says: “Adding a constant doesn’t affect variance.”
Example. Suppose X is a random variable, with E(X) = 110 and V(X) = 4.
Suppose that Y is related to X by
Y = –7 X + 23.
What are E(Y), V(Y), and D(Y) ?
Solution. First,
E( Y ) = E (–7 X + 23 )
= E(-7X) + E(23)
(because expectations add)
= –7 E(X) + E(23)
(because E(cX)=cE(X))
= –7 E(X) + 23
(because E(constant)=constant)
= -7 * 110 + 23 = 793.
Next,
V( Y ) = V (–7 X + 23)
= V ( -7X )
(because adding c doesn’t affect V)
= (-7)2 V(X)
(by the last section)
= 49 * 4 = 196.
Finally,
D( Y ) =
V(Y)  196 = 14, OR
D( Y ) = D ( -7X )
= |-7| D(X)
(by the last section)
= 7 * 2 = 14.
12
14. Independent Random Variables
When X and Y are independent, we have:
E(XY) = E(X) E(Y).
(We saw in homework #7a that this isn’t always true when X and Y aren’t independent.)
We aren’t ready to prove this result yet, but we can use it, as follows:
Theorem. If X and Y are independent, then
V( X + Y ) = V(X) + V(Y).
(This isn’t usually true if X and
Y are not independent.)
Proof: If X and Y are independent, then
V( X + Y ) = E( (X+Y)2 ) – E ( X+Y )2
= E ( X2 + 2XY + Y2 ) – E ( X+Y )2
(expanding the square)
= E(X2) + 2E(XY) + E(Y2) – E ( X+Y )2
= E(X2) + 2E(XY) + E(Y2) – (E(X)+E(Y))2
(expectations add)
= E(X2) + 2E(XY) + E(Y2) – E(X)2 – 2E(X)E(Y) – E(Y)2
= ( E(X2) – E(X)2 ) + ( E(Y2) – E(Y)2 )
+ 2 ( E(XY) – E(X)E(Y) )
(Note the last term is zero when X and Y are
independent.)
= V(X) + V(Y).
13