Download Sampling Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Sampling Distributions
Fall 2001
B6014: Managerial Statistics
Professor Paul Glasserman
403 Uris Hall
Sampled Data
1. So far, in all our probability calculations we have assumed that we know all quantities
needed to solve the problem:
• Portfolio problems: To find the expected return and standard deviation of a portfolio, we assumed we knew the mean and standard deviation of the returns of the
underlying stocks.
• Potato chip example: To find the proportion of bags below the 8-ounce minimum,
we assumed we knew the mean and standard deviation of the weight of chips in each
bags.
In practice, these types of parameters are not given to us; we must estimate them from
data.
2. Statistical analysis usually proceeds along the following lines:
(a) Postulate a probability model (usually including unknown parameters) for a situation involving uncertainty; e.g., assume that a certain quantity follows a normal
distribution.
(b) Use data to estimate the unknown parameters in the model.
(c) Plug the estimated parameters into the model in order to do make predictions from
the model.
The first step, picking a model, must be based on an understanding of the situation to
be modeled. Which assumptions are plausible? Which are not? These questions are
answered by judgement, not by precise statistical techniques.
3. Examples:
1
• We might assume that daily changes in a stock price follow a normal distribution.
We might then use historical data to estimate the mean and standard deviation.
Once we have estimates, we might use the model to predict future price ranges or to
value an option on the stock.
• We might assume that demand for a fashion item is normally distributed. We might
then use historical data to estimate the mean and standard deviation. Once we have
estimates, we might use the model to set production levels.
4. The first step in understanding the process of estimation is understanding basic properties
of sampled data and sample statistics, since these are the basis of estimation.
5. When we talk about sampling it is always in the context of a fixed underlying population:
• If we look at 50 daily changes in IBM stock, we are looking at a sample of size 50
from the population of all daily changes in IBM stock.
• If we ask 150 shoppers whether or not they buy corn flakes, we have a sample of size
150 from all possible shoppers.
If the population is very large (as in these examples), we generally treat it as though
it were infinite; this simplifies matters. Thus, we are primarily concerned with finite
samples from infinite populations.
6. A single sample from a population is a random variable. Its distribution is the population
distribution; e.g.,
• the distribution of a randomly selected daily change in IBM stock is the distribution
over all daily changes;
• the probability that a randomly selected shopper buys corn flakes is the proportion
of the entire population that buys corn flakes.
7. A random sample from a population is a set of randomly selected observations from
that population. If {X1 , . . . , Xn } are a random sample, then
• they are independent;
• they are identically distributed, all with the distribution of the underlying population.
8. A sample statistic is any quantity calculated from a random sample. The most familiar
example of a sample statistic is the sample mean X, given by
X = (X1 + X2 + · · · + Xn )/n.
The sample mean gives an estimate of the the population mean µ = E[Xi ].
2
Distribution of the Sample Mean
1. Every sample statistic is a random variable. Randomness is introduced through the
sampling mechanism.
2. As noted above, the sample mean X of a random sample {X1 , X2 , . . . , Xn } is an estimate
of the population mean µ = E[Xi ]. How good an estimate is it? How can we assess the
uncertainty in the estimate? To answer these questions, we need to examine the sampling
distribution of the sample mean; that is, the distribution of the random variable X.
3. We begin by assuming that the underlying population is normal with mean µ and variance
σ 2 . This means that Xi ∼ N (µ, σ 2 ) for all i. Moreover, the Xi ’s are independent, since
we assume we have a random sample.
4. Fact: The sum of independent normal random variables is normally distributed. The
usual rules for means and variances apply: the expected value of the sum is the sum of
the expected values; the variance of the sum is the sum of the variances (by independence).
5. Earlier Fact: Any linear transformation of a normal random variable is normal; in particular, multiplication by a constant preserves normality.
6. Using these two facts, we find that if Xi ∼ N (µ, σ 2 ), i = 1, . . . , n, then
(X1 + X2 + . . . + Xn ) ∼ N (nµ, nσ 2 ),
and
X = (X1 + X2 + . . . + Xn )/n ∼ N (µ, σ 2 /n).
We conclude that the sample mean from a normal population has a normal distribution;
specifically, the sample mean from a random sample of size n from a population N (µ, σ 2 )
has distribution N (µ, σ 2 /n).
7. First consequence: E[X] = µ. In other words, the expected value of the sample mean is
the population mean; “on average” the sample mean correctly estimates the underlying
mean.
8. The standard deviation of a sample statistic is called its standard error. Thus, we have
√
shown that the standard error of the sample mean is σ/ n, where σ is the underlying
standard deviation and n is the sample size.
√
9. Second consequence: Because the standard error of X is σ/ n, the uncertainty in this
estimate decreases as the sample size n increases. (That’s good.) However, the uncertainty
(as measured by the standard deviation) decreases rather slowly: to cut the standard
deviation in half, we need to collect four times as much data, because of the square root.
(That’s not so good, but that’s life.)
10. Example: Suppose the population standard√deviation is σ = 50. If we have a sample of
√
size 100, the standard error is σ/ n = 50/ 100 = 5. Suppose we would like to reduce
the standard error to 2.5 by collecting more data. This would reduce the uncertainty
in our
√ estimate. We would need a total of 400 data points to accomplish this, because
50/ 400 = 50/20 = 2.5.
3
11. Example: Suppose the number of miles driven each week by US car owners is normally
distributed with a standard deviation of σ = 75 miles. Suppose we plan to estimate the
population mean number of miles driven per week by US car owners using a random
sample of size n = 100. What is the probability that our estimate will differ from the
true value by more than 10 miles?
Denote the population mean by µ and the sample mean by X. We need to find P (|X −µ| >
10). By symmetry of the normal distribution, this is 2 × P (X − µ > 10). Standardizing,
we find that this is
10
X −µ
√ > √
,
2P
σ/ n
σ/ n
which is
2P (Z >
10
√
) = 2P (Z > 1.33) = 2 × .0918 = 0.1836.
75/ 100
Thus, the probability that our estimate will be off by more than 10 miles is 18.36%.
12. We now drop the assumption that the underlying population is normal, and just assume
that it has mean µ and variance σ 2 . It is still true that the sample mean X has
E[X] = µ
and
σ
Std Error = Std Dev(X) = √ .
n
These properties do not use normality; they follow from basic properties of means and
variances.
13. By the central limit theorem, regardless of the underlying population, the distribution
of X tends towards N (µ, σ 2 /n) as n becomes large. In particular, for a sufficiently large
sample size n,
X −µ
√ ≈ N (0, 1);
σ/ n
i.e., the standardized quantity on the left tends toward the standard normal distribution.
We will use this approximation repeatedly to assess the error in X as an estimate of µ.
14. A consequence of this is that in the example above concerning the mean number of miles
driven by week, we don’t need to assume that the number of miles driven per week is
normally distributed (as long as our sample size n is large).
15. How large should n be for the normal approximation to be accurate? There is no simple
answer (it depends on the underlying distribution), but n ≥ 30 is a reasonable rule of
thumb.
16. If the underlying population is finite of size N , and if the sample size n is not a small
proportion of N , we use the following small sample correction to the standard error:
σ
Std Error(X) = √
n
4
N −n
.
N −1
In this course, we generally assume that the underlying population is infinite or else much
larger than the sample size so that the small sample correction is not needed.
Sampling Distribution of the Sample Proportion
1. Consider estimating any of the following quantities:
• Proportion of voters who will vote for a third-party candidate in the next election.
• Proportion of visits to a web site that result in a sale.
• Proportion of shoppers who prefer crunchy over creamy.
In each of these examples, we are trying to estimate a population proportion. We
denote a generic population proportion using the symbol p.
2. We estimate a population proportion using a sample proportion. For example, if a
poll surveys 1000 voters and finds that 85 of those surveyed plan to vote for a third-party
candidate, then the sample proportion is 8.5%. The population proportion is what the
poll would find if it could ask every voter in the population.
3. We denote the sample proportion using the symbol p̂, which is read “pee-hat.” Once we
have collected a random sample, the sample proportion p̂ is known. We use it to estimate
the true, unknown population proportion p.
4. The problem of estimating a proportion can be formulated as a special case of estimating
a population mean. To see this, consider again the example of a poll of 1000 voters.
Imagine encoding responses to a question about third-party candidates as follows: for the
ith person polled,
Xi =
1, if ith person plans to vote for third-party candidate;
0, otherwise.
Thus, our random sample consists of X1 , . . . , X1000 . If 85 respondents indicated that they
would vote for a third-party candidate, then
X1 + X2 + . . . + X1000 = 85,
because 85 of the Xi ’s are equal to 1 and all the rest are equal to 0. Moreover,
X=
X1 + X2 + . . . + X1000
= 8.5%.
1000
Thus, the sample proportion is just a special case of the sample mean.
5. We can summarize this example as follows: the sample proportion p̂ is a special case of
the sample mean X when the data consists of 1’s and 0’s.
6. How good an estimate of the population proportion p is the sample proportion p̂? For
example, how effective are polls and surveys? These are issue we will be examining.
5
7. Recall from our discussion above of sample means that E[X] = µ, indicating that the
sample mean is correct “on average” in estimating the population mean. Similarly, E[p̂] =
p, indicating that the sample proportion is also correct “on average.”
8. By how much is the sample proportion p̂ likely to deviate from the true population
proportion p? This is measured by the standard deviation of p̂, also called the standard
error of p̂. This standard error is given by
StdErr[p̂] =
p(1 − p)
.
n
9. This standard error is greatest when p = 0.5 (the most uncertain case) and becomes 0
when p = 0 or p = 1 (i.e., when there is no uncertainty at all).
10. We could alternatively calculate the standard error by taking the standard deviation of
the underlying 1’s and 0’s (the Xi ’s in our sample). The formula above is just a shortcut
to the same answer.
11. If the sample size n is large, the error in our estimate is approximately normally distributed. The error is just the difference p̂ − p between the sample and population proportion.
For large n,
p(1 − p)
.
p̂ − p ≈ N 0,
n
We sometimes write this in the alternative form
p̂ − p
≈ N (0, 1) .
p(1−p)
n
12. EXAMPLE: Suppose that the true, unknown proportion p of voters who will vote for a
third-party candidate in the next election is 9%. What is the probability that a poll of
1000 voters will find a sample proportion p̂ that differs from the true proportion by more
than 2%?
We need to find P (|p̂ − p| > .02). We will use the normal approximation, so by symmetry
this is 2 × P (p̂ − p > .02). Now we standardize to write this as

p̂ − p
2 × P 
p(1−p)
n
i.e., as
>

2 × P Z > .02
p(1−p)
n
.02
p(1−p)
n

,


with Z a standard normal. Now we plug in numbers to write this as

2 × P Z > .02
.09(1−.09)
1000

 = 2 × P (Z > 2.21).
From the normal table, we find that this is 2 × .0135 = .027. We conclude that the
probability that the poll will be off by more than two percentage points is .027.
6