Download 10.0 Central Limit Theorem

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Addition wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Non-standard calculus wikipedia , lookup

Central limit theorem wikipedia , lookup

Risk aversion (psychology) wikipedia , lookup

Expected value wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
10.0 Central Limit Theorem
• Answer Questions
• Review Bayes
• Box Models
• Expected Value and Standard Error
• Central Limit Theorem
1
10.1 Bayes Review
Recall: Let A1 , . . . , An be mutually exclusive and suppose that
P [A1 or A2 or . . . or An ] = 1. Then
P [B|A1 ] ∗ P [A1 ]
P [A1 |B] = Pn
.
i=1 P [B|Ai ] ∗ P [Ai ]
Suppose that 70% of Republicans favor the medicare prescription plan,
as do 20% of Democrats and 40% of Independents. And assume that
35% of the population are Republicans, 45% are Democrats, and 20% are
Independents. You meet someone who says they support the plan; what
is the chance that they are Republican?
2
Here the Ai are the three parties, which we assume are mutually
exclusive and exhaustive.
A1 : Repulicans
A2 : Democrats
A3 : Independents
We want to find the probability of a Republican given support in terms
of the probabilities of support given the party affiliations.
Pair and Discuss
3
So you want to find:
P [ Rep | supp ] = P [ Rep and supp ]/P [ supp ]
P[ s | R ] ∗ P[ R ]
=
P[ s | R ] ∗ P[ R ] + P[ s | D ] ∗ P[ D ] + P[ s | I ] ∗ P[ I ]
= (.7) ∗ (.35)/[(.7) ∗ (.35) + (.2) ∗ (.45) + (.4)(.2)]
= .590.
4
10.2 Box Models
A Box Model describes a process in terms of making repeated draws,
with replacement, from a box containing numbers.
Since draws are made with replacement, the outcomes in a series of
draws are independent. The value on the first draw does not affect the
value on the second.
Box models describe: repeated rolls of a die, repeated tosses of a coin
(fair or unfair), and (approximately) drawing a random sample of U.S.
citizens and asking for whom they plan to vote.
5
For a box model, the expected value is the average of the numbers in
the box.
For categorical data, such as voting for Bush or Kerry, one averages
zeroes (Bush) and ones (Kerry). When averaging zeroes and ones, the
result is just a proportion, or the total number of people voting for Kerry
divided by the total number of people whom you ask.
The Law of Averages says that if one makes many draws from the box
and averages the results, that average will approach to the expected
value of the box.
6
10.3 Expected Value and Standard Error
Suppose that a box contains B numbers, X1 , . . . , XB . Then the expected
value of the box is is
B
1 X
EV = X̄ =
Xi
B i=1
and the standard deviation of the box (X1 , . . . , XB ) is
r
1
sd =
[(X1 − EV )2 + · · · + (XB − EV )2 ]
B
v
!
u
B
u 1 X
Xi2 − (EV )2 .
= t
B i=1
This is just finding the mean and sd for a list of numbers (X1 , . . . , XB ).
7
The standard error for the average of n draws from a box (with
replacement) is:
sd
se = √ .
n
sd ⇐ standard deviation of a box.
se ⇐ standard error for n draws from a box,.
se is the sd of the averages found in many samples of size n.
8
Recall that The Law of Averages says that if one makes many
draws from the box and averages the results, that average will
approach to the expected value of the box.
Another way to understand law of average:
The standard error is the likely size of the difference between the average
of n draws from the box and the expected value of the box. And the
standard error se goes to zero as n → ∞. It means that the sample
average is a good estimate of the average in the box, and the accuracy of
the estimate improves as you take more and more draws from the box.
In the long run, average will approach to the expected value of the box
9
The standard error for the sum of n draws (with replacement) is:
√
se = n × sd.
This is the sd of many sums of size n.
Analgously to standard error for averages, the standard error of the sum
is the likely size of the difference between the sum of n draws from the
box and n times the expected value of the box.
Note that as n → ∞, this standard error does not go to zero. This means
that as the number of draws increases, the likely difference between the
sum of the draws and nEV gets larger, rather than smaller.
This concern about the sum, rather than the average, arises in contexts
such as investment or gambling, where the total return from multiple
trials is important, not the average return.
10
10.4 The Central Limit Theorem
The Central Limit Theorem is one of the high-water marks of
mathematical thinking. It was worked upon by James Bernoulli,
Abraham de Moivre, and Alan Turing. Over the centuries, the theory
improved from special cases to a very general rule.
Essentially, the Central Limit Theorem allows one to describe how
accurately the Law of Averages works. That is, what is the chance that
the sample average is more than d away from the true EV?
11
Formally, the Central Limit Theorem for averages says:
√
X̄ ∼
˙ N (EV, sd/ n)
X̄ − EV
√ ∼
˙ N (0, 1) (1)
sd/ n
or
where N (0, 1) is the standard normal, X̄ is the average of n draws, EV
is the expected value of the box, and sd is the standard deviation of the
box.
√ ) is a random number that is approximately
This means that ( X̄−EV
sd/ n
normal with mean zero and standard deviation one. Note that EV and
sd is a fixed number and X̄ is a random number.
The approximation gets better as n gets larger.
Modifications of this formula hold for many other situations in practice.
12
Another version of the Central Limit holds for sums:
√
sum ∼
˙ N (nEV, n sd)
sum − nEV
√
∼
˙ N (0, 1). (2)
n sd
or
Note that sum = nX̄.
This formula is useful when calculating the chance of winning a given
amount of money when gambling, or getting more than a specific score
on a test.
With these two central limit formulas, one can answer all sorts of
practical questions.
13
Problem 1a: You want to estimate the average income of people in
Durham. Suppose the true mean income (EV) is $42,000 with sd of
$10,000. You draw a random sample of 100 households (n=100 ). What
is the probability that your sample estimate is over the true value by
$500 or more?
P[X̄ > 42, 000 + 500] =?
• What is the box model for this problem?
• What is the expected value of the box?
• What is the sd of the box?
• What formulae should we use? (1) [average] or (2) [sum]?
14
Recall that
√
X̄ ∼
˙ N (EV, sd/ n)
X̄ − EV
√ ∼
˙ N (0, 1) (1)
sd/ n
or
where N (0, 1) is the standard normal, X̄ is the average of n draws, EV
is the expected value of the box, and sd is the standard deviation of the
box.
15
P[X̄ > 42, 500] = P[X̄ − EV > 500]
= P[(X̄ − EV )/(se) > 500/(se)]
√
=
˙ P[Z > 500/(10, 000/ 100)]
= P[Z > .5].
From the standard normal table, we know this has chance (1/2) (100 38.29) = 30.85%, so the probability of the estimate being too high by
$500 is just .3085.
16
Note that in order to solve this, we made the unreasonable assumption
that we knew the true mean of the box and the sd. Later we shall relax
this assumption.
But this is not really the question one wants to ask in practice, nor is it
the kind of information that one really has from a survey.
17
Problem 1b: You want to estimate the average income of people in
Durham. You draw a random sample of 100 households and find the
sample mean is $42,500 and the sd of your sample is $10,000. What
is the approximate probability that you have overestimated the true
average income of Durham by $500 or more?
First, assume that the sd of your sample equals the sd of the box. This is
an approximation, and later we shall see a way to improve this.
We want to find
P[X̄ − EV > 500] = P[(X̄ − EV )/(se) > 500/(se)]
√
=
˙ P[Z > 500/(10, 000/ 100)]
= P[Z > .5].
The probability is .3085 that $42,500 is $500 too high (or more).
18
Problem 2: You are playing Red and Black in roulette. (A roulette
wheel has 38 pockets; 18 are red, 18 are black, and 2 are green—the
house takes all the money on green). You pick either red or black; if the
ball lands in the color you pick, you win a dollar. Otherwise you lose a
dollar.
Suppose you make 100 plays. What is the chance that you lose $10 or
more?
What is the box model?
What formulae should we use? (1) [average] or (2) [sum]?
.
19
Recall that another version of the Central Limit holds for sums:
√
sum ∼
˙ N (nEV, n sd)
sum − nEV
√
∼
˙ N (0, 1). (2)
n sd
or
20
There are 38 tickets, and 18 are labelled 1 and the 20 are labelled -1.
So the expected value of the box is
EV
1
[1 + 1 + · · · + 1 + (−1) + (−1) + · · · + (−1)]
=
38
1
[−2]
=
38
= −1/19.
The standard deviation of the box is
v
u
38
u 1 X
sd = t(
Xi2 ) − (EV )2
38 i=1
p
=
1 − (−1/19)2
= .998614.
21
The probability of losing more than $10 or more in 100 plays is
P[sum < −10] = P[sum − nEV < −10 − nEV ]
−10 − nEV
sum − nEV
√
√
<
]
= P[
nsd
nsd
−10 − nEV
√
]
=
˙ P[Z <
nsd
= P[Z < [−10 − (100)(−1/19)]/(10 ∗ .998614)]
= P[Z < −.47434].
¿From the standard normal table, the chance of falling between .47434
and -.4743 is about .3473. Thus the chance of being below -.47434 is (1 .3473)/2 = .3264.
22