Download 252oneal

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Misuse of statistics wikipedia , lookup

Probability amplitude wikipedia , lookup

Law of large numbers wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
1/08/03 252oneal (Open this document in 'Outline' view!)
ECONOMICS 252 COURSE OUTLINE
A. Parameter Estimation
1. Review of the Normal Distribution
See 251greatD, 251distrex2, 251distrex3, 251distrex4
2. Point and Interval Estimation
3. A Confidence Interval for the Mean when the
Population Variance is Known.
a. A Two-Sided Confidence Interval
An interval of this type is used in two situations:
(i)
where the population variance,  2 , is in fact, known
and the sample size is relatively large; or
(ii)
where the variance is not known and the sample
variance, s 2 , is used to replace  2 , but the degrees of freedom
are so large that the appropriate value of t n 1 is not very
different from z .
The first of these situations is not very
realistic, but serves as a good introduction to confidence
intervals. The formula for this type of confidence interval for
the mean is,   x  z  x , where  x  
2
n
.

N n
n N 1
( n is sample size and N is population size) See 252onealex1.
Note: If n  .05 N , use  x 
Don’t use this method unless you know the population
variance.
b. A One-Sided Confidence Interval.
There are two types of one-sided confidence interval
for the mean.
These are (i) An upper bound, and (ii) a lower bound, and
have the form:   x  z  x and   x  z  x . An example is in 252oneaex1a.
4. A Confidence Interval for the Mean when the
Population Variance is not Known.
"The variance is not known " implies that there is no previous
1
knowledge or assumption about the value of  2 . Knowing s 2 is
having a guess as to what the variance is; it is not the same as knowing
the variance. If the population distribution is normal or approximately
normal, the formula for a two-sided confidence interval for the mean is
.
  x  tn1s x , where s x  s
n
2
Note: If n  .05 N , use s x 
s
N n
N 1
n
See 252onealex2 and 252oneaex3.
Note: this is the more common case – if you do not know the
population variance and the sample size is not very large, using z
instead of t is a very bad idea.
2
5. Deciding on Sample Size when working with a Mean
The formula usually suggested is n 
it can be approximated by  
z 2 2
e2
, where, if  is not known,
x.001  x.999
.
6
6. A Confidence Interval for a Proportion.
(a. Small Samples.
Table 16 (ConfidenceIntervalsBinominalDistribution.pdf) gives Confidence Intervals for
proportions.
These tables are of use when the conditions do not exist in
which one can use the normal distribution. For example if
n  10 and p  .5 , and we wish to find a 95% confidence
interval, we can look at the horizontal axis of the upper table.
There we can find p  .5 and look up to find the upper and
lower curves for n  10 . Then vertical line at p  .5
intersects these curves. The lower curve meets the vertical line
at about p  .175 . (Read up the vertical axis.). The upper
curve meets the vertical line at about p  .825 , so that our
95% confidence interval is about .175  p  .825 .)
b. Large Samples.
More usually, using the normal approximation to the
binomial distribution, and using p for the population
probability of success and q for the population probability of
failure, and letting p and q be the corresponding sample
quantities, we can write p  p  z s p ,
2
where s p 
pq
and q  1  p . An example is in 251 proport.
n
c. Deciding on Sample Size.
The usually suggested formula is n 
pqz 2
, but since p
e2
is usually unknown, a conservative choice is to set p  0.5 .
This is the formula everyone forgets that we covered.
7. A Confidence Interval for a Variance.
This method is only appropriate when the population
distribution is normal or approximately normal.
For small samples
n  1s 2
 22
 2 
n  1s 2
12 2
,
3
but if the degrees of freedom are too large for the chi-square table use
s 2DF 
z 2  2DF 
 
s 2DF 
 z 2  2DF 
. An example is in 252oneaex4.
4
(8. Appendix
A Confidence Interval for a Median.
In a situation where the population distribution is not normal,
it is often more appropriate to find the median than the mean. The
process of finding a confidence interval for a median is based on one
simple fact: the probability that a single number picked at random from
a population is above (or below) the median is 50%. Similarly, the
probability that any two numbers picked at random from a population
are both above (or both below) the median is 25%.. This comes from the
multiplication rule: If A is the probability that the first number is above
the median, and B is the probability that the second number is above
the median, then P A  B   P A  PB  if A and B are independent
events. If the probability of both numbers being above the median is
25%, and the probability of both numbers being below the median is
25%, then the probability that both numbers are on the same side of the
median is 50%. This is due to the addition rule: Let event C be "both
numbers are above the median," and event D be "both numbers are
below the median." Then event C  D is "both numbers are on the
same side of the median." The addition rule says that if C and D are
mutually exclusive, PC  D  P(C )  P( D) . Finally, if the probability
that both numbers are on the same side of the median is 50%, then the
probability that the two numbers are on opposite sides of the median is
also 50%. This means that, since any two numbers picked from the
sample have a 50% chance of bracketing the median, these two
numbers constitute a 50% confidence interval.
Note that, since p , the probability that any one number is
above the median, is 0.5, and q , the probability that any one number is
below the median, is also 0.5, we have a problem that resembles
finding the distribution of the number of heads on two tosses of a fair
coin. If we call a head a success, the distribution of heads on two
tosses is described by the binomial distribution with n (the number of
tries) set at 2, and p (the probability of success on one try) set at 0.5.
For convenience, we will use q (the probability of failure on one try)
for the probability that one number is below the median or of getting a
tail on one toss of a fair coin. It is always true that q  1  p . The
formula for the binomial distribution is Px  Cxn p x q nx , where x is
the number of successes. For the probability of two successes (heads) in
2 tries, we find that P2  C22 .52 .50  1.25 1  .25 . We find the
probability of two heads or two tails in two tries by noting that the
probability of two failures (tails) is P0  C02 .50 .52  .25 . Thus the
probability of two heads or two tails is P2  P0  .25  .25  .50.
This is the same as the probability of two randomly picked numbers
both being on the same side of the mean.
To take this a bit further, let us assume that we take a sample
of n numbers from a population and then take two numbers at equal
distances from the ends of the sample (for example, the fourth lowest
5
and the fourth highest of a sample of 20 numbers). We will find that it
is relatively easy to figure out the probability that these numbers bracket
the median, and this will be our confidence level. This process requires
some new thinking because: (i) we find our confidence interval without
using a point estimate as we did in every previously studied method for
constructing a confidence interval; and (ii) we find the interval first and
then figure out its confidence level instead of starting with a confidence
level and then figuring out the interval. This process serves as an
introduction to the field of nonparametric statistics, which is largely
made up of methods that do intervals and tests without assuming that the
parent distribution (the distribution of the population from which the
sample is drawn) is normal. In the case of finding a median, the
process to be explained would be unnecessary if the parent population
were normal, because in a normal population the mean and median are
identical. Therefore, if the parent population is normal, we could use
a method for finding a confidence interval for the mean in place of a
method for finding a confidence interval for a median.
Assume that we pick a sample of four from a population, and that this
sample, when put in ascending order, is 20,25,29,30 . If we use two
numbers at equal distances from the ends as our confidence interval , we
can use 20    30 or 25    29 (  (nu) is our symbol for a population
median). The first of these intervals ( 20    30 ) is wrong only if all
four numbers in the sample are below the median or all four numbers
are above the median. The probability that all four are above the median
is the same as the probability of four heads in four tosses,
P4  C 44 .54 .50  .0625 . The probability that all four numbers are
below the median is the same as the probability of four tails on four
tosses P0  C 04 .50 .54  .0625 . We can find the probability of all
four being above the median from a cumulative binomial table by noting
that, for n  4,Px  4  Px  4  1  Px  3 .
The binomial table will tell us that, for p  .5 , Px  4  1 , and
Px  3  .9375 , so Px  4  1  .9375  .0625 . (bin) Since the probability
that all four numbers are below the median, Px  0 , is the same as the
probability that all four numbers are above the median, the probability
that the two numbers do not bracket the median (the probability that we
are wrong or the significance level) is   2Px  0  2.0625   .1250 .
The confidence level is thus 1    1  2Px  0  1  2.0625   .8750 .
Now try picking the confidence interval 25    29 , by
choosing the numbers x2 and x3 , that is the second from the top and
the second from the bottom in the ordered sample, 20,25,29,30 . This
interval is invalid if (i) the lowest three or more numbers in the sample
are below the median (equivalent to three or more tails when a coin is
tossed four times), or (ii) the highest three or more numbers in the
sample are above the median (equivalent to three or more heads). The
probability of the first of these events is (for n  4 and p  .5) Px  1 ,
and the probability of the second event is Px  3 . But, using the
binomial table we find that Px  3  1  Px  2  Px  1  .3125 . bin
6
So the probability that the interval does not bracket the median is 2Px  1  2.3125   .6250 , and
the confidence level is 1    1  2Px  1  1  2.3125   .3750 .
7
Generalize this to a situation where we take a random sample
of n items from a population and put the numbers in ascending order so
that x1  x2  x3    xn1  xn . Now pick x k and x n -k +1 , the
numbers that are the k th from the bottom and the k th from the top,
respectively. This interval is invalid if (i) all the numbers included in
the interval and all the numbers below the interval are below the median
or (ii) all the numbers on the interval and all the numbers above the
interval are above the median. The probability of the first event is
Px  k  1 and the probability of the second event is
Px  n  k   Px  k  1  the equality is due to the symmetry of the
binomial distribution for p  .5 . So   2Px  k  1 , and the
confidence level is 1    1  2Px  k  1 . For example, if we take a
sample of 100 items and put them in order and then use the interval
x38    x63  , that is, the 38th number from the bottom and the 38th
number from the top, the confidence level (from the binomial table for
n  100 and p  .5 ) is 1    1  2Px  37   1  2.0060   .9880 .bin
There will be some situations in which we cannot find
Px  k  1 on the cumulative binomial table. Then we must use a
normal approximation to the binomial distribution, that is (using a
continuity correction), find the normal probability,


k  1  1 2  np 

k  .5  .5n 
.
 P x  k  1  1 2   P z 
 P z 



2
npq
.5 n 



(In the last part of this equality, .5 was substituted for both p and q .)
This takes us back to a more conventional formulation for the
k  .5  .5n
confidence interval because we can choose k so that  z  2 
.
.5 n
If we solve this equation for k , we find that k 
n  1  z . 2 n
.
2
Thus if we want a 95% confidence interval for the median, and we take
150  1  1.96 150
 63 .4975 .
2
Our interval will then be x63    x88 .) ttable
a sample of n  150 , and pick k 
© 2002 R. E. Bove
8