Download Point Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
YALE School of Management
EMBA-MGT511: HYPOTHESIS TESTING AND REGRESSION
K. Sudhir
Review Materials from Data I
1. Review of Concepts
We begin with a brief review of concepts of random sampling, sampling distributions,
point and interval estimation that you learned in Data-I, which will be useful for us in
doing Hypothesis Testing. If you feel comfortable with these concepts, you may skip it
and go directly to Section 2 on practice exercises.
Random Sampling and Sampling Distributions:
Since a census of the population of interest can be too costly or impossible to obtain,
random sampling is a practical approach to learn about the populations. Using “sample
statistics” such as the sample mean, sample standard deviation and sample proportion,
we can learn about the true parameters of the population.
Sample statistics will vary from sample to sample. Suppose we knew the population mean
is 200. Even then, the sample mean is very unlikely to be exactly 200 in any particular
sample. The sample statistic itself is a random variable, due to the random variation in
which elements are chosen in the sample. If we use random sampling (where every
element of the population has an equal chance to be part of the sample), then we can use
probability theory to determine the probability distribution of the sample statistic. The
probability distribution of a sample statistic is called its sampling distribution.
Using the sampling distribution, make point and interval estimates of the population
parameters.
Sampling Distribution for Means
Consider the random variable Y. The sampling distribution for the sample mean Y has:
Mean:  Y    Population Mean
Standard deviation:  Y 


Population Standard Deviation
sqrt(Sample Size)
n
If the sample size is greater than 30, Y will be approximately normally distributed, thanks
to the central limit theorem.
Sampling Distribution for Proportions
Consider a binary random variable Y, which can take one of two values 0 or 1. Let  be
the proportion of 1’s in a population. The sampling distribution for the sample proportion
 has the following mean and standard deviation:
Mean:     Population Proportion
 (1   )
Standard deviation:   
n
Population Standard Deviation
sqrt(Sample Size)

If the sample size is such that n  >5 and n(1-  ) > 5,  will be approximately normally
distributed.
Point Estimation
The object of estimation is to learn about population parameter (such as the proportion of
all voters who favor Bush at a given time, the mean account balance of all American
Express Gold Card accounts) from sample data. Parameters are numbers, but are
unknown. To learn about them, we employ special random variables called estimators,
which have good properties (such as being correct on average). The properties of
estimators derive from random sampling. The actual numbers that we obtain from the
data are referred to as point estimates.
To clarify the notation for population parameters, estimators and estimates, see the
following table.
Parameter

Estimator
Y
Estimate
y


1  2
1   2

S
Y1  Y2

s
1   2
1   2
n
n
Where Y 
Y
i 1
n
i
y
i 1
n
i
 (Y  Y )
and S 
2
i
i 1
n 1
n
n
y
y1  y2
and s 
 ( y  y)
i 1
2
i
n 1
Interval Estimate (Confidence Interval)
We recognize that the point estimates are random variables and are subject to sampling
error. We therefore wish to quantify the range of values that we can reasonably expect the
true population parameter to take given our sample. This is the interval estimate. It is also
called the confidence interval.
The standard formula for a 2-sided confidence interval
Confidence Interval=
Point Estimate  Critical Value  Estimated Standard Deviation of Estimator
We know what the point estimates are from the previous section. We also know how to
estimate the standard deviation of estimator (for both means and proportion) from above.
So the one thing we need in addition is the critical value.
Getting the critical value
The critical value asks us how confident we want to be that the range we offer will
contain the true population parameter. Often, people use a 95% confidence interval. But
we could also use a 99% confidence interval or a 90% interval. The more confident we
want to be, the greater will be the range of the interval. At an extreme if you want 100%
confidence, then the confidence interval would be -  to +  . Of course, this confidence
interval would be practically useless. So we have to tradeoff confidence with managerial
usefulness.
Usually people express confidence intervals using the 100(1-  )% notation. So a 95%
confidence corresponds to   0.05 (called the significance level). A 90% confidence
interval corresponds to   0.1 (significance level). A 99% confidence interval
corresponds to   0.01(significance level).
Since both sample proportions and sample means follow the normal distribution
approximately for large samples, the critical value is obtained from the standard normal
distribution (usually represented by Z). The critical value (z) corresponding to the
significance level  / 2 is given by z / 2 . The corresponding critical values for the 90, 95
and 99% confidence intervals are:
Confidence Level
Significance Level (  )
90%
95%
99%
0.1
0.05
0.01
z / 2 (Use Hypothesis
Testing Worksheet)
z0.05 = 1.645
z0.025 = 1.96
z0.005 = 2.58
However, if we have a small sample and the population standard deviation is not known
(which is almost always the case), then we cannot use the normal distribution and
therefore we need to use the t-distribution with n-1 degrees of freedom.
Interval estimation works under the following assumptions:
(1) The sample is random
(2) The sample means follow a normal distribution (by appealing to Central Limit
Theorem). Even if we use the t-distribution for the critical value (because we
don’t know the population standard deviation), the sample means are assumed to
follow a normal distribution.
We can check for the assumptions by doing the following:
(1) Look at a histogram of the sample data. If the histogram is symmetric, the sample
means will follow a normal distribution even with a small sample. If the data do
not appear to be symmetric, the central limit theorem works more slowly. This
means that we need to get larger samples before the distribution of sample means
can be assumed to follow a normal distribution.
(2) Correct for any outliers, or identify causes of any unusual values.
These assumptions and checks should apply to hypothesis testing also.
2. Review- Practice Exercises
(i)
Computing the Probability of Individual Outcome drawn from a Normal
Distribution
Suppose IQ in a population is normally distributed across individuals. Based on
historical data, we know the average IQ in the population is 100 and the standard
deviation is 20. What is the probability that a randomly chosen individual will have an IQ
between 80 and 120?
Y – random variable, measuring IQ
Y    100
 Y    20
If Y normally distributed, then
Y  Y Y  
Z

Y

has
z  0 ;  z  1
Note Z is the standard normal variable.
Therefore,
80  100
120  100
Z
Prob [80 ≤ Y ≤ 120] = Prob [
]
20
20
= Prob [-1 < Z < 1]
= 0.6826
Use p. 800 of the text, to look up the probabilities for the standard normal distribution.
The probability for Z to take value between 0 and 1 is 0.3413 (from the table). By
symmetry of the normal, the probability of Z between –1 to 0 is also 0.3413. Therefore
the total probability is 0.6826. See graph below for a pictorial representation.
Probability
Prob=0.3413
Prob=0.3413
-1
1
0
Z
(ii)
Computing the Probability of Sample Means
Suppose IQ in a population is normally distributed across individuals. Based on
historical data, we know the average IQ in the population is 100 and the standard
deviation is 20. What is the probability that a randomly chosen sample of size 16 will
have an average IQ between 80 and 120?
Y - random variable, measuring average IQ of a sample of size n
 Y    100

20
Y 

5
n
16
Y is normally distributed since Y is normally distributed in this case. (even if Y is not
normal, you can appeal to the Central Limit Theorem to claim Y is normally distributed)
Probability of sample mean outcomes
80  100
120  100
Z
Prob [80 ≤ Y ≤ 120] =
P[
]
5
5
= Prob [-4 < Z < 4] =
= 0.9999
Y  Y
Where: Z 
Y
(iii)
Point Estimation and Interval Estimation
When Population Standard Deviation is known:
Suppose we draw one sample of size 16 and found the sample average IQ to be 105. The
population standard deviation is known to be 20. Estimate the average IQ in the
population and the 95% confidence interval for average IQ in the population.
Point estimator = Y
Interval estimator Y  Z / 2 y
n  16
y  105
  20
95% confidence interval: 105  1.96 
20
16
105  1.96 5
105   9.8 = (95.2, 114.8)
“We are 95% confident that the unknown population mean is between 95 and 115”
(in repeated sampling an interval so constructed contains the interval 95% of the time).
When Population Standard Deviation is unknown:
Suppose the above question was changed as follows:
Suppose we draw one sample of size 16 and found the sample average IQ to be 105. The
sample standard deviation is calculated to be 20. Estimate the average IQ in the
population and the 95% confidence interval for average IQ in the population.
Point estimator = Y
Here we calculate the sample standard deviation from the sample data (rather than
knowing the population standard deviation), so the sampling distribution is a tdistribution in this case. The t-distribution is approximately equal to the normal
distribution, for large values of n (n>30), so when n is large, we can work with the normal
distribution for convenience.
Interval estimator Y  tn 1, / 2 s y
n  16
y  105
tn 1, / 2  t15,0.025  2.131
s  20
95% confidence interval: 105   2.131
20
16
105   2.131 5
105  10.65 = (94.35, 115.65)
Note that when you use the t-distribution, the confidence interval is larger than when you
use the normal distribution. Note that the critical value in this problem is 2.131 compared
to 1.96 in the previous problem. This reflects the greater uncertainty in the estimate,
because we estimate one more parameter (the standard deviation) from the data.
(iv)
Sample Size Determination
Suppose we want to have a maximum range of Y  2 in the 95% confidence interval. If
the population standard deviation is 20, what is the required sample size?
Let Z / 2 y  E
Z

n
E
 Z 
n

 E 
2
 1.96  20  

  384.2  385
2


If the population standard deviation is 20, to obtain a 95 percent confidence interval with
a difference between the sample mean and the population mean of 2 units maximum (E),
we need a random sample of size 385.
2