Download Estimating a Population mean: Large samples

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
1. Estimating a Population mean: Large samples:
Large-sample
100% confidence interval for a population mean,
where
is the z-value that locates an area of
to its right,
is the standard
deviation of the population from which the sample was selected, n is the sample size,
and
is the value of the sample mean.
Assumption: n  30
[When the value of
is unknown, the sample standard deviation s may be used to approximate
in the
formula for the confidence interval. The approximation is generally quite satisfactory when n  30.]
Example: Suppose that in the previous year all graduates at a certain university reported
the number of hours spent on their studies during a certain week; the average was 40
hours and the standard deviation was 10 hours. Suppose we want to investigate the
problem whether students now are studying more than they used to. This year a random
sample of n = 50 students is selected. Each student in the sample was interviewed about
the number of hours spent on his/her study. This experiment produced the following
statistics:
= 41.5 hours
s = 9.2 hours
Estimate , the mean number of hours spent on study, using a 99% confidence interval.
Interpret the interval in term of the problem.
Solution The general form of a large-sample 99% confidence interval for
is
or (38.14, 44.86).
We can be 99% confident that the interval (38.14, 44.86) encloses the true mean weekly
time spent on study this year. Since all the values in the interval fall above 38 hours and
below 45 hours, we conclude that there is tendency that students now spend more than 6
hours and less than 7.5 hours per day on average (suppose that they don't study on
Sunday).
2. Estimating a population mean, Small samples
Assumption required for estimating
based on small samples (n < 30)
Small-sample confidence interval for
where the distribution of t based on (n - 1) degrees of freedom.
Example: Determine the t-value that would be used in constructing a 95% confidence
interval for
based on a sample of size n = 14.
Solution For confidence coefficient of .95, we have
We require the value of t.025 for a t-distribution based on (n - 1) = (14 - 1) = 13 degrees of
freedom. In t-table, at intersection of the column labeled t.025 and the row corresponding
to df = 13, we find the entry 2.160 (see Figure 7.6). Hence, a 95% confidence interval for
, based on a sample of size n = 13 observations, would be given by
3. Determining sample size required to estimate u ( s.d. known and not known)
Choosing the sample size for estimating a population mean
to within d units with
probability
(Note: The population standard deviation
will usually have to be approximated.)
Choosing the sample size for estimating a population proportion p to within d units with
probability
where p is the value of the population proportion that we are attempting to estimate and q
= 1 - p.
(Note: This technique requires previous estimates of p and q. If none are available, use
p = q = .5 for a conservative choice of n.)
4. Estimating a population proportion
Large-sample
100% confidence interval for a population proportion, p
where
is the sample proportion of observations with the characteristic of interest, and
.
Example: A commission on crime is interested in estimation the proportion of crimes to
firearms in an area with one of the highest crime rates in a country. The commission
selects a random sample of 300 files of recently committed crimes in the area and
determines that a firearm was reportedly used in 180 of them. Estimate the true
proportion p of all crimes committed in the area in which some type of firearm was
reportedly used. Then construct a 95% confidence interval for p, the population
proportion of crimes committed in the area in which some type of firearm is reportedly
used.
Solution A logical candidate for a point estimate of the population proportion p is the
proportion of observations in the sample that have the characteristic of interest (called a
"success"). This is called this sample proportion
(read "p hat"). In this example, the
sample proportion of crimes related to firearms is given by
=180/300=.60
That is, 60% of the crimes in the sample were related to firearms; the value
servers as our point estimate of the population proportion p.
For a confidence interval of .95, we have
;
;
; and the
required z-value is z.025 = 1.96. We obtained
. Thus,
. Substitution of these values into the formula for an approximate
confidence interval for p yields
or (.54, .66). Note that the approximation is valid since the interval does not contain 0 or
1.
We are 95% confident that the interval from .54 to .66 contains the true proportion of
crimes committed in the area that are related to firearms. That is, in repeated construction
of 95% confidence intervals, 95% of all samples would produce confidence interval that
enclose p.
5. Estimation of population variance:
A (1 - )100% confidence interval for a population variance, 2
where
, and
are values of 2 that locate an area of /2 to the right and /2 to
the left, respectively, of a chi-square distribution based on (n - 1) degrees of freedom.
Assumption: The population from which the sample is selected has an approximate
normal distribution.
Example: There was a study of contaminated fish in a river. Suppose it is important for
the study to know how stable the weights of the contaminated fish are. That is, how large
is the variance 2 in the fish weights? The 144 samples of fish in the study produced the
following summary statistics:
Use this information to construct a 95% confidence interval for the true variation in
weights of contaminated fish in the river.
Solution: For a 95% confidence interval, (1 - ) = .95 and /2 = .05/2 = .025. Therfore,
we need the tabulated values 2.025, and 2.975 for (n - 1) = 143 df. Looking in the df =
150 row of chi^2 table (the row with the df values closest to 143), we find 2.025 =
185.800 and 2.975 = 117.985. Substituting into the formula given in the box, we obtain
We are 95% confident that the true variance in weights of contaminated fish in the river
falls between 109,156.8 and
171,898.4.
Figure 7.11 The location of 21-/2 and 2/2 for a chi-square distribution
6. Finding value of test statistic z
Z is the standard normal variate written as: z 
we define
x

to be the z-value such that an area of
. In Confidence interval problems,
lies to its right
Now, if an area of
lies beyond
in the right tail of the standard normal (z)
distribution, then an area of
lies to the left of
in the left tail because of the
symmetry of the distribution. The remaining area,
, is equal to the confidence
coefficient - that is, the probability that
falls within
standard deviation of
is
.
7. Testing a claim about a mean: Large samples
Large-sample test of hypothesis about a population mean
ONE -TAILED TEST
TWO -TAILED TEST
H0:  =  0
H0:  =  0
Ha:  >  0 (or Ha:  <  0)
Ha:    0
Test statistic:
Rejection region:
z > z (or z < - z)
Rejection region:
z < -z /2
(or z > z /2)
where z is the z-value such that P(z > z) = ; and z/2 is the z-value such
that P(z > z/2) = /2. [Note: 0 is our symbol for the particular numerical
value specified for  in the null hypothesis.]
Assumption: The sample size must be sufficiently large (say, n  30) so that
the sampling distribution of
is approximately normal and that s provides a
good approximately to .
Example: The mean time spent on studies of all students at a university last year
was 40 hours per week. This year, a random sample of 35 students at the university was
drawn. The following summary statistics were computed:
Test the hypothesis that , the population mean time spent on studies per week is equal to
40 hours against the alternative that  is larger than 40 hours. Use a significance level of
 = .05.
Solution We formulate the hypotheses as:
H0:  = 40
Ha:  > 40
Note that the sample size n = 35 is sufficiently large so that the sampling distribution of
is approximately normal and that s provides a good approximation to . Since the
required assumption is satisfied, we may proceed with a large-sample test of hypothesis
about .
Using a significance level of  = .05, we will reject the null hypothesis for this one-tailed
test if
z > z /2 = z.05, i.e., if z > 1.645. This rejection region is shown in Figure.
Computing the value of the test statistic, we obtain
Since this value does not fall within the rejection region (Figure), we do not reject H0. We
say that there is insufficient evidence (at  = .05) to conclude that the mean time spent on
studies per week of all students at the university this year is greater than 40 hours. We
would need to take a larger sample before we could detect whether  > 40, if in fact this
were the case.
8. P-value method of testing hypothesis
A P-Value or probability value is the probability of getting a value of the sample test
statistic that is at least as extreme as the one found from the sample data, assuming the
null hypothesis is true.
In testing using the p-value you follow the same steps, but after you calculate the test
statistic, you find a p-value. You may find a p-value from the table, but it is very
inaccurate. Once you have a p-value - the guideline is:
Reject the null hypothesis if the p-value is less than or equal to the significance level. Fail
to reject if p-value is greater than the significance level.
Example 2: Because of the expense involved, car crash tests often involve small samples.
When 5 BMW cars are crashed under standard conditions the repair costs are shown in
the accompanying table. Use a 0.05 significance level to test the claim that the mean for
all BMW cars is less than $1000
$797 $571 $904 $1147 $418
Since the p value is not less than the significance level - we fail to reject the null
hypothesis. The final conclusion is there is not sufficient sample evidence to support the
claim that the average cost is less than $1000
9. Testing a claim about a proportion
Large-sample test of hypothesis about a population proportion
ONE -TAILED TEST
TWO -TAILED TEST
H0: p = p0
H0: p = p0
Ha: p > p0 (or Ha: p < p0)
Ha: p  p0
Test statistic:
Rejection region:
z > z (or z < - z)
where q0 = 1 – p0
Assumption: The interval
Rejection region:
z < -z/2
(or z > z/2)
where q0 = 1 – p0
does not contain 0 and 1.
Example: Suppose it is claimed that in a very large batch of components, about 10% of
items contain some form of defect. It is proposed to check whether this proportion has
increased, and this will be done by drawing randomly a sample of 150 components. In the
sample, 20 are defectives. Does this evidence indicate that the true proportion of
defective components is significantly larger than 10%? Test at significance level  = .0 5.
Solution We wish to perform a large-sample test about a population proportion, p:
H0: p = .10 (i.e., no change in proportion of defectives)
Ha: p > .10 (i.e., proportion of defectives has increased)
where p represents the true proportion of defects.
At significance level  = .05, the rejection region for this one-tailed test consists of all
values of z for which
z > z.05 = 1.645
The test statistic requires the calculation of the sample proportion,
, of defects:
Noting that q0 = 1 – p0 = 1 - .10 = .90, we obtain the following value of the test statistic:
This value of z lies out of the rejection region; so we would conclude that the proportion
defective in the sample is not significant. We have no evidence to reject the null
hypothesis that the proportion defective is .01 at the 5% level of significance. The
probability of our having made a Type II error (accepting H0 when, in fact, it is not true)
is  = .05.
[Note that the interval
does not contain 0 or 1. Thus, the sample size is large enough to guarantee that validity of
the hypothesis test.]