Download Faculty - The University of Texas at Austin

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
THE UNIVERSITY OF TEXAS AT AUSTIN
Department of Information, Risk,
and Operations Management
BA 386T
Tom Shively
ESTIMATION AND SAMPLING DISTRIBUTIONS
The purpose of these notes is to summarize the concepts regarding estimation and
sampling distributions that we discussed in class.
ESTIMATING THE POPULATION MEAN (µ)
Consider an example dealing with the salaries of last spring’s MBA graduates.
The random variable X will represent MBA salary. We will assume the population of
last spring’s MBA salaries are normally distributed with unknown mean µ and a known
variance of σ X2 = (10,000)2. The distribution is drawn below.
σX = 10,000
µ
X: MBA Salary
The interpretation of µ is the following: µ represents the mean salary of all MBA
graduates from last spring. We could determine the exact value of µ (i.e. the exact mean
salary of all MBAs in the population) by obtaining the salary of every MBA graduate,
adding them up, and dividing the total by the number of MBA graduates. In practice, it is
too expensive and time consuming to obtain every MBA salary, which means in practice,
that we cannot determine µ exactly. The important point to understand from this
discussion is that the population mean µ is a number, but we do not know what the
number is without getting the salary of every MBA graduate in the country.
Note that “mean” and “average” mean the same thing. They can be used
interchangeably. For example, “sample mean” and “sample average” mean the same
thing, as do “population mean” and “population average.”
1
Aside
I am assuming the variance of the population of MBA salaries ( σ X2 ) is known. In
particular, I am assuming that σ X2 = (10,000)2. In practice, if we don’t know the mean (µ)
of the population, we will not know the variance ( σ X2 ) of the population either. I am
assuming we know the variance to remove one level of complexity from the problem of
estimating the population mean µ and determining the quality of the estimate. In practice,
we will have to estimate the population variance σ X2 . This will be discussed below in
further detail.
End Aside
Next Aside
It is much easier to interpret the standard deviation σX of the population than it is to
interpret the variance σ X2 of the population. The reason for this is that we know 68% of
the probability falls within one standard deviation of the mean, and that 95% of the
probability falls within two standard deviations of the mean. The probability calculation
for one standard deviation is given below.
σX
µ -σX
µ
µ+σX
X: MBA Salary
 (µ − σ X ) − µ X − µ (µ + σ X ) − µ 

pr(µ - σX < X < µ + σX) = pr 
<
<
σX
σX
σX


= pr(-1.0 < Z < 1.0)
= pr(Z < 1.0) - pr(Z < -1.0)
= 0.8413 - 0.1587
= 0.6826
Also, the units for the standard deviation in this example are dollars, while the units for
the variance are dollars squared. It is considerably easier to think of a measure of
dispersion in dollars than in dollars squared.
2
In the MBA salary example, 68% of the MBA graduates make within σX = $10,000 of
the population mean salary. An equivalent way to think about this probability is that if we
pick an MBA graduate at random from the population, there is a 68% chance that we will
get a person that makes within $10,000 of the average salary of all MBA graduates.
Similarly, 95% of the MBA graduates make within 2σX = $20,000 of the population
mean salary. An equivalent way to think about this probability is that if we pick an MBA
graduate at random from the population, there is a 95% chance that we will get a person
that makes within $20,000 of the average salary of all MBA graduates.
End Aside
To estimate the population mean, we collect a sample of MBA salaries from the
population of last spring’s MBA graduates and use the mean of the salaries in the sample
as an estimate of the mean of the salaries in the entire population. The steps in the logic
underlying this idea are the following:
(1) The sample is representative of the population.
If the sample is relatively large, it will represent the entire spectrum of MBA salaries
in the population. For example, a few people in the sample will make low salaries
(because a small percentage of the MBA population makes low salaries), a large portion
of the people in the sample will make salaries right around the population average
(because most of the people in the MBA population make a salary close to the population
mean salary), and a few people in the sample will make high salaries (because a small
percentage of the MBA population makes high salaries).
(2) The sample mean ( X ) is representative of the population mean (µ)
This follows from step (1). If the sample is representative of the entire spectrum of
MBA salaries, then the sample mean ( X ) must be representative of the population mean
(µ). Another way to say this is that the sample mean ( X ) is a good proxy for the
unobservable population mean (µ).
(3) The sample mean ( X ) is a good estimator of the population mean (µ)
This follows from step (2). Step (3) just formalizes the idea in step (2), i.e. if X is a
good proxy for µ, then we say X provides a good estimator for µ.
3
DETERMINING THE QUALITY OF X AS AN ESTIMATOR FOR µ
A natural question to ask is how good the estimate of µ is that we get using X . For
example, suppose we collect a sample of size n = 100. It is possible (although unlikely)
that we get 80 people in the sample that make salaries well above the population mean µ,
and only 20 people in the sample that make salaries below the population mean µ. If this
happens, then the sample mean X will be far above the population mean µ, and we will
get a bad estimate of µ.
The appropriate way to phrase the question concerning the quality of X as an
estimator for µ is the following. First, we must define what a good estimate is. I will say
any estimate within $900 of the population mean µ is a good estimate. The definition of
an accurate estimate (i.e. saying we want an estimate within $900 of µ) is a subject matter
question, not a statistical question. This means you must consider the context of the
problem to determine the degree of accuracy that is required to have a good estimate.
(The $900 figure I chose is admittedly a bit arbitrary. A more natural figure would be
$1000 but I chose $900 to make it easy to differentiate from σ X = 1000, which is used
below.)
The question we want to answer is: What is the probability that we get a sample of
MBA salaries from the population of last spring’s MBA graduates that gives an X
that is within $900 of µ?
To answer this question we must consider the sampling distribution of X . First, note
that X is a random variable. The random experiment used to obtain X is the process of
collecting a random sample, and the outcome of the random experiment (i.e. the sample
mean X ) is a numerical value. Therefore, X is a random variable and has a distribution
associated with it. This distribution represents the uncertainty regarding the value of X
that we obtain due to the uncertainty regarding the random sample we will obtain.
The distribution of X is N(µ ,σ X2 =
σ X2
n
), where n is the sample size, σ X2 = (10,000)2
is the variance of the population, and σ = Var ( X ) =
2
X
mean X . The distribution is drawn below.
4
σ X2
n
is the variance of the sample
σX =
µ
σX
n
X
Intuition underlying the distribution of X
The distribution of X has a mean of µ. The reason is that there is a 50/50 chance we
will obtain a sample of MBA salaries from the MBA population that is weighted towards
good students (i.e. there are more good students in the sample whose salaries are above
the population mean µ than poor students whose salaries are below the population mean
µ), and therefore there is a 50/50 chance the sample average X is above the population
mean µ. This is represented in the sampling distribution because half the area in the
distribution for X is above µ, i.e. half the time we get a sample that gives an X greater
than µ. A similar argument can be used to explain why half the area in the distribution for
X is below µ.
Aside
There is a theorem in statistics called the Central Limit Theorem. It says that if X ~
N(µ, σ ), then X ~ N(µ ,σ
2
X
2
X
=
σ X2
). This theorem backs up the intuition we developed
n
in class. It is a formal statement of the intuition (which is what all theorems are).
End Aside
Suppose we collect a sample of size n = 100. Given the sampling distribution for X
σ X2
(10,000) 2
2
is N(µ ,σ X =
=
= 1,000,000 = (1,000)2), we can compute the probability
100
n
that we get a sample of MBA salaries that give an X within $900 of µ (i.e. we can
compute the probability that we get what we define to be a good estimate of µ).
5
σX =
µ-900
µ
σX
n
µ+900
=
10,000
100
= 1000
X
 ( µ − 900) − µ X − µ ( µ + 900) − µ 
pr(µ - 900 < X < µ + 900) = pr 
<
<



1000
1000
1000
= pr(-0.9 < Z < 0.9)
= pr(Z < 0.9) - pr(Z < -0.9)
= 0.8159 - 0.1841
= 0.6318
(1)
Suppose we collect the following sample of n = 100 salaries from the MBA
population:
56496
75416
63516
61981
73506
84008
66390
76538
57154
65582
75998
72854
47156
83319
68730
64246
78871
64879
46674
71946
71169
68624
53100
57686
82651
62452
70288
34509
71315
71786
55572
68847
81244
74021
58491
60610
69942
61855
68703
66174
58173
68386
62209
64545
56881
75401
60446
73382
52375
62371
53678
65182
63114
61985
51936
58156
69112
57161
54847
61671
51208
60074
70832
52879
51615
59239
48304
58531
65702
58701
73883
73227
54859
65904
80033
70814
57770
57757
58249
67506
63484
80587
84081
69471
62943
76397
56920
62409
55329
56845
46387
50703
70761
74632
70901
62259
46217
68240
56122
78082
The notation we will use is the following: Xi represents the salary of the i-th person in the
sample. Thus, X1 = $56496 is the salary of the first person in the sample, X2 = $64246 is
the salary of the second person in the sample, …, X100 = $71786 is the salary of the last
person in the sample. The sample mean is
X =
1 n
1 100
Xi =
∑
∑ X = 64252
n i =1
100 i =1 i
6
Thus, X = $64252 is our estimate of the population mean µ.
We don’t know whether X = $64252 is close to µ or not because we don’t know µ.
If we knew µ we would not have to bother estimating it. However, we can make the
following statement based on the probability calculation in equation (1). Of all the
possible samples of size n = 100 we could collect from the population of MBA graduates,
63.2% of them give an X within $900 of the population mean µ. 36.8% of the possible
sample we could collect give an X more than $900 from the population mean µ. You
don’t know which kind of sample you get (i.e. you don’t know whether you get one of
the 63.2% that give an X within $900 of µ or one of the samples that gives an X more
than $900 from µ). However, in my opinion, a 36.8% chance of failure is too high (i.e. a
36.8% chance of getting a bad estimate is too high).
To increase the probability of getting a good estimate we need to increase the sample
size n. If we increase the sample size we are collecting more information about the
population mean µ so we should get a better estimate. This will be reflected in a smaller
probability of collecting a sample that gives an X more than $900 from µ. Intuitively,
with a larger sample (say n = 400) we are less likely to get a “strange” sample. A large
sample is more likely to be very representative of the population. For example, we are
highly unlikely to get all n = 400 people in the sample from good schools.
σ X2
(10,000) 2
Given the sampling distribution for X is N(µ ,σ =
=
= 250,000 =
n
400
(500)2) when n = 400, we can now compute the probability that we get a sample of 400
MBA salaries that give an X within $900 of µ (i.e. we can compute the probability that
we get what we define to be a good estimate of µ when n = 400).
2
X
σX =
µ-900
µ
σX
n
µ+900
=
10,000
400
= 500
X
 ( µ − 900) − µ X − µ ( µ + 900) − µ 
<
<
pr(µ - 900 < X < µ + 900) = pr 



500
500
500
= pr(-1.8 < Z < 1.8)
7
= pr(Z < 1.8) - pr(Z < -1.8)
= 0.9641 - 0.0359
= 0.9282
(2)
We can now make the following statement based on the probability calculation in
equation (2). Of all the possible samples of size n = 400 we could collect from the
population of MBA graduates, 92.8% of them give an X within $900 of the population
mean µ. 7.2% give an X more than $900 from the population mean µ. Therefore, we can
be highly confident that the X we obtain from a sample of size n = 400 will be within
$900 of µ.
ESTIMATING THE POPULATION VARIANCE σ X2
The logic for estimating the population variance σ X2 is the same as the logic used to
estimate the population mean µ. The three steps are the following:
(1) The sample is representative of the population.
(2) The sample dispersion is representative of the population dispersion ( σ X2 )
The question is how to measure the dispersion of the sample. A natural way to do this
is to use the average distance squared that each point in the sample is from the center of
the sample. The distance from the i-th point (Xi) to the center of the sample ( X ) is (Xi X ). Thus, the average distance squared is
1 n
s = ∑ ( Xi − X )2
n i =1
2
X
Rather than divide by n, we divide by n-1. Intuition tells use to divide by n but some
mathematics (which we will not discuss and you are not responsible for) tells us to divide
by n -1. Therefore, we will use
1 n
s =
( Xi − X )2 .
∑
n − 1 i =1
2
X
to represent the dispersion of the sample. (Technical point: We use average distance
squared instead of average distance because if we added up all the distances, they would
always add to zero, which means the average distance would also be zero. This would
8
clearly be a bad measure of dispersion. To avoid the problem of positive and negative
values canceling out, we use distances squared.)
(3) The sample variance ( s 2X ) is a good estimator of the population variance ( σ X2 )
The reason is that s 2X is a measure of the dispersion of the sample, and the dispersion
of the sample is representative of the dispersion of the population. Thus, s 2X is
representative of the population dispersion, and is therefore a good estimator of σ X2 .
Example
Consider the MBA salary example. The salaries in the sample we collected are given
on page 6. The following dotplot of the salaries (which is not from Excel output) gives a
feel for the dispersion of the sample.
.
.
:
.:
:
. : :
. ..::. :. .: .: : ...
.
::. :::.::::::.::::: :::::::::....: ::
+---------+---------+---------+---------+---------+------30000
40000
50000
60000
70000
80000
For this sample, X = 64252 and
s 2X =
1 n
1 100
2
(
X
X
)
=
−
∑
∑ ( X − X )2
n − 1 i =1 i
100 − 1 i =1 i
= 94692361 = (9731)2.
Thus, s 2X = (9731)2 is our estimate of the population variance σ X2 .
The sample standard deviation (called sX) is sX = 9731.
9
Salary
EXPLANATION OF THE THREE TYPES OF VARIANCES
The three types of variances are:
(1) σ X2 : σ X2 is the population variance. σX is the population standard deviation. The
population standard deviation and population variance provide a measure of the
dispersion of the population. 68% of the probability falls within σX of the population
mean µ, while 95% of the probability falls within 2σX of the population mean µ. For
example, in the MBA salary example, σX = $10,000 so 68% of the MBA population
makes within $10,000 of the average salary of all MBAs in the population, while
95% of the MBA population makes within $20,000 of the average salary of all MBAs
in the population.
1 n
∑ ( X − X ) 2 : s 2X is the sample variance. sX is the sample standard
n − 1 i =1 i
deviation. The sample standard deviation and sample variance provide a measure of
the dispersion of the sample. Because the sample is representative of the population,
the sample variance s 2X is representative of the population variance σ X2 , and
therefore s 2X is an estimator for σ X2 .
(2) s 2X =
(3) σ =
2
X
σ X2
: σ X2 is the variance of the sample mean. It provides a measure of the
n
uncertainty regarding the value of the sample mean X we obtain from our random
sample. σ X is the standard deviation of the sample mean. Of all the possible samples
we could collect from the population, 68% of the them will give an X within σ X of
the population mean. For example, in the MBA salary example (with n = 100 and
σ2
(10,000) 2
σ X2 = X =
= 1,000,000 = (1,000)2, so σ X = 1000), 68% of the possible
n
100
samples we could collect from the population will give an X within σ X = $1000 of
µ. This provides a measure of the quality of X as an estimator of µ.
To summarize, σ X2 is a measure of the dispersion of the population and s 2X is a
measure of the dispersion of the sample. σ X2 is a measure of the quality of the sample
mean X as an estimator of the population mean µ.
The population standard deviation σX = $10,000 tells us the probability that the salary
of a single randomly chosen graduate from the population being within $10,000 of the
population average salary (µ) is 68%.
10
σ X2
(10000) 2
=
= 1000 tells us
The standard deviation of the sample mean σ X =
n
100
the probability that we collect a sample of size n = 100 from the population that gives a
sample mean X within σ X = $1,000 of µ is 68%.
The sample standard deviation sX is an estimate of σX.
11
NOTATION
(1) X ~ N(µ, σ X2 ): X is normally distributed with population mean µ and population
variance σ X2 .
(2) µ: Population mean. µ represents the center of the population. It is the value of X that
we expect on average. For example, in the MBA salary example, µ is the average
salary of all MBAs in the population.
(3) σ X2 : Population variance. σ X2 provides a measure of the dispersion of the population.
See the description on the previous page.
(4) σX: Population standard deviation. σX also provides a measure of the dispersion of
the population. It is easier to interpret than the population variance because the units
are appropriate, e.g. dollars (not dollars squared) in the MBA salary example.
(5) Xi: i-th value in the random sample. For example, in the MBA salary example, X1
represents the salary of the first person in the sample, X2 represents the salary of the
second person in the sample, etc.
1 n
∑ X : Sample mean (or equivalently, sample average). X represents the
n i =1 i
center of the sample. It is an estimator for the population mean µ.
(6) X =
(7) X ~ N(µ,σ X2 =
σ X2 =
(8) σ X2 =
σ X2
n
σ X2
n
): X is normally distributed with mean µ and variance
.
σ X2
: σ X2 is the variance of the sample mean. See the description on the
n
previous page.
1 n
∑ ( X − X ) 2 : s 2X is the sample variance. It represents the dispersion of
n − 1 i =1 i
the sample and is an estimate of σ X2 . See the description on the previous page.
(9) s 2X =
1 n
∑ ( X − X ) 2 : sX is the sample standard deviation. sX also
n − 1 i =1 i
provides a measure of the dispersion of the sample. It is easier to interpret than the
sample variance because the units are appropriate, e.g. dollars (not dollars squared)
in the MBA salary example.
(10) s X = s 2X =
12