Download P - Radford University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 9: Inferences Involving
One Population
Student’s t, df = 25
Student’s t, df = 15
Student’s t, df = 5
0
t
Chapter Goals
• Learned about confidence intervals and
hypothesis testing.
• Assumed s was known.
• Consider both types of inference about m
when s is unknown.
• Consider both types of inference about p,
the binomial probability of success.
9.1: Inference About mean m
(s unknown)
• Inferences about m are based on the sample
mean x
• If the sample size is large or the sample
population is normal: z*  ( x  m ) /(s / n )
has a standard normal distribution.
• If s is unknown, use s as a point estimate
for s.
• Estimated standard error of the mean: s / n
• Test statistic: t  ( x  m ) /(s / n )
Student’s t-statistic:
1. When s is used as an estimate for s, the test statistic has
two sources of variation: x and s
2. The resulting test statistic:
xm
t
s n
Known as the Student’s t-statistic.
3. Assumption: samples are taken from normal populations.
4. The population standard deviation, s, is almost never
known in real-world problems.
The standard error will almost always be estimated using
s n.
Almost all real-world inference about the population mean
will be completed using the Student’s t-statistic.
Properties of the t-Distribution (df > 2):
1. t is distributed with a mean of 0.
2. t is distributed symmetrically about its mean.
3. t is distributed so as to form a family of distributions, a
separate distribution for each different number of degrees
of freedom (df  1)
4. The t-distribution approaches the normal distribution as the
number of degrees of freedom increases.
5. t is distributed with a variance greater than 1, but as the
degrees of freedom increase, the variance approaches 1.
6. t is distributed so as to be less peaked at the mean and
thicker at the tails than the normal distribution.
Student’s t-Distributions:
Normal distribution
Student’s t, df = 15
Student’s t, df = 5
0
t
Degrees of Freedom, df:
A parameter that identifies each different distribution of
Student’s t-distribution. For the methods presented in this
chapter, the value of df will be the sample size minus 1, df = n
 1.
Note:
1. The number of degrees of freedom associated with s2 is the
divisor (n  1) used to define the sample variance s2.
Thus: df = n  1
2. The number of degrees of freedom is the number of
unrelated deviations available for use in estimating s2.
3. Table for Student’s t-distribution (Table 6 in Appendix B)
is a table of critical values. Left column = df. When df >
100, critical values of the t-distribution are the same as the
corresponding critical values of the standard normal
distribution.
4. Notation: t(df, a)
Read as: t of df, a.
t-Distribution Showing t(df, a):
a
0
t (df ,a )
t
Example: Find the value of t(12, 0.025).
0.025
0.025
 t (12,0.025)
 2.18
Portion of
Table 6
df

12
0
t (12,0.025)
2.18
t
Amount of a in one-tail


0.025
2.18
Note:
1. If the df is not listed in the left-hand column of Table 6, use
the next smaller value of df that is listed.
2. Most computer software packages will calculate either the
area related to a specified t-value or the t-value that bounds
a specified area.
3. The cumulative distribution function (CDF) is often
used to find area from   to t.
4. If the area from   to t is known and the value of t is
wanted, then the inverse cumulative distribution
function (INVCDF) is used.
cumulative
probability

t
The assumption for inferences about mean m when s is
unknown:
The sampled population is normally distributed.
Confidence Interval Procedure:
1. Procedure for constructing confidence intervals similar to
that used when s is known.
2. Use t in place of z. Use s in place of s.
3. The formula for the 1  a confidence interval for m is
a s

x  t  df , 
2 n

to
a s

x  t  df , 
2 n

where df  n  1
Example: A study is conducted to learn how long it takes the
typical tax payer to complete their federal income tax return.
A random sample of 17 income tax filers showed a mean time
(in hours) of 7.8 and a standard deviation of 2.3. Find a 95%
confidence interval for the true mean time required to
complete a federal income tax return. Assume the time to
complete the return is normally distributed.
Solution:
1. Parameter of Interest: the mean time required to complete
a federal income tax return.
2. Confidence Interval Criteria:
a. Assumptions: Sampled population assumed normal, s
unknown.
b. Test statistic: t will be used.
c. Confidence level: 1  a = 0.95
3. The Sample Evidence:
n  17, x  7.8, and s  2.3
4. The Confidence Interval:
a. Confidence coefficients: t (df ,a / 2)  t (16,0.025)  2.12
b. Maximum error:
s
2.3
E  t (16,0.025)
 (2.12) 
 (2.12)(0.5578)  1.18
n
17
c. Confidence limits:
x  E to x  E
7.8  1.18 to 7.8  1.18
6.62 to 8.98
5. The Results:
6.62 to 8.98 is the 95% confidence interval for m.
Hypothesis-Testing Procedure:
1. The t-statistic is used to complete a hypothesis test about a
population mean m.
2. The test statistic:
xm
t* 
with df  n  1
s n
3. The calculated t is the number of estimated standard errors
x is from the hypothesized mean m.
4. Probability-Value or Classical Approach.
Example: A random sample of 25 students registering for
classes showed the mean waiting time in the registration line
was 22.6 minutes and the standard deviation was 8.0 minutes.
Is there any evidence to support the student newspaper’s
claim that registration time takes longer than 20 minutes?
Use a = 0.05 and assume waiting time is approximately
normal.
Solution:
1. The Set-up:
a. Population parameter of concern: the mean waiting time
spent in the registration line.
b. State the null and alternative hypotheses:
H0: m = 20 () (no longer than)
Ha: m > 20 (longer than)
2. The Hypothesis Test Criteria:
a. Check the assumptions: The sampled population is
approximately normal.
b. Test statistic: t* with df = n  1 = 24
c. Level of significance: a = 0.05
3. The Sample Evidence:
a. Sample information: n  25, x  22.6, and s  8
b. Calculate the value of the test statistic:
x  m 22.6  20 2.6
t* 


 1.625
s n
8 25
1.6
Using the p-value procedure:
4. The Probability Distribution:
a. The p-value: P  P(t*  1.625, with df  24)
Note:
1. If this hypothesis test is done with the aid of a computer,
most likely the computer will compute the p-value for you.
2. Using Table 6: place bounds on the p-value.
3. Using Table 7: read the p-value directly from the table for
many situations.
Using Table 6: 0.05 < P < 0.10
Using Table 7: P  0.061
b. The p-value is not smaller than the level of significance, a.
Using the classical procedure:
4. The Probability Distribution:
a. The critical value:
t (24,0.05)  1.71
b. t* is not in the critical region.
5. The Results:
a. Decision: Fail to reject H0.
b. Conclusion: There is insufficient evidence to show the
mean waiting time is greater than 20 minutes.
Example: A new study indicates that higher than normal (220)
cholesterol levels are a good indicator of possible heart
attacks. A random sample of 27 heart attack victims showed
a mean cholesterol level of 231 and a standard deviation of
20. Is there any evidence to suggest the mean cholesterol
level is higher than normal for heart attack victims? Use a =
0.01.
Solution:
1. The Set-up:
a. Population parameter of concern: The mean cholesterol
level of heart attack victims.
b. State the the null and alternative hypothesis:
H0: m = 220 () (mean is not greater than 220)
Ha: m > 220 (mean is greater than 220)
2. The Hypothesis Test Criteria:
a. Assumptions: We will assume cholesterol level is at
least approximately normal.
b. Test statistic: t* (s unknown), df = n  1 = 26
c. Level of significance: a = 0.01 (given)
3. The Sample Evidence:
a. Sample information: n  27, x  231, and s  20
b. Calculate the value of the test statistic:
t* 
x  m 231  220
11


 2.858
s n 20 27 3.849
4. The Probability Distribution:
a. The critical value: t(26, 0.01) = 2.48
0.01
0
t
t (26,0.01)
2.48
b. t* falls in the critical region.
5. The Results:
a. Decision: Reject H0.
b. Conclusion: At the 0.01 level of significance, there is
sufficient evidence to suggest the mean cholesterol level
in heart attack victims is higher than normal.
9.2: Inferences About the
Binomial Probability of
Success
• Possibly the most common inference of all.
• Many examples of situations in which we
are concerned about something either
happening or not happening.
• Two possible outcomes, and multiple
independent trials.
Background:
1. p: the binomial parameter, the probability of success on a
single trial.
2. p ' : the observed or sample binomial probability.
x
p' 
n
x represents the number of successes that
occur in a sample consisting of n trials.
3. For the binomial random variable x:
m  np,
s  npq ,
where
q  1 p
4. The distribution of x is approximately normal if n is larger
than 20 and if np and nq are both larger than 5.
Sampling Distribution of p’:
If a sample of size n is randomly selected from a large
population with p = P(success), then the sampling distribution
of p’ has
1. a mean m p ' equal to p,
2. a standard error s p ' equal to ( pq) / n , and
3. an approximately normal distribution if n is sufficiently
large.
In practice, use of the following guidelines will ensure
normality:
1. The sample size is greater than 20.
2. The sample consists of less than 10% of the population.
3. The products np and nq are both larger than 5.
The assumptions for inferences about the binomial
parameter p:
The n random observations forming the sample are selected
independently from a population that is not changing during
the sampling.
Confidence Interval Procedure:
The unbiased sample statistic p’ is used to estimate the
population proportion p.
The formula for the 1  a confidence interval for p is
 a  p' q'
p' z   
n
2
where p'  x / n and q'  1  p'
to
 a  p' q'
p' z   
n
2
Example: A recent survey of 300 randomly selected fourth
graders showed 210 participate in at least one organized sport
during one calendar year. Find a 95% confidence interval for
the proportion of fourth graders who participate in an
organized sport during the year.
Solution:
1. Describe the population parameter of concern:
The parameter of interest is the proportion of fourth
graders who participate in an organized sport during the
year.
2. Specify the confidence interval criteria:
a. Check the assumptions.
The sample was randomly selected.
Each subject’s response was independent.
b. Identify the probability distribution:
z is the test statistic.
p’ is approximately normal
n  300  20
np '  300(210 / 300)  210  5
nq '  300(90 / 300)  90  5
c. Determine the level of confidence: 1  a  0.95
3. Collect and present sample evidence.
Sample information: n = 300, and x = 210.
The point estimate: p'  x / n  210 / 300  0.70
4. Determine the confidence interval:
a. Determine the confidence coefficients:
Using Table 4, Appendix B: z (a / 2)  z (0.025)  1.96
b. The maximum error of estimate:
p' q'
(0.70)(0.30)
E  z (a / 2) 
 1.96
n
300
 (1.96) 0.0007  (1.96)(0.0265)  0.0519
c. Find the lower and upper confidence limits:
p ' E
0.70  0.0519
0.6481
to
to
to
p ' E
0.70  0.0519
0.7519
d. The Results:
0.6481 to 0.7519 is a 95% confidence interval for the
true proportion of fourth graders who participate in an
organized sport during the year.
Sample Size Determination:
[ z (a / 2)]2  p * q *
n
E2
E: maximum error of estimate.
1  a: confidence level
p*: provisional value of p (q* = 1  p*)
If no provisional values for p and q are given use p* = q* = 0.5
(Always round up.)
Example: Determine the sample size necessary to estimate the
true proportion of laboratory mice with a certain genetic defect.
We would like the estimate to be within 0.015 with 95%
confidence.
Solution:
1. Level of confidence: 1  a = 0.95, z(a/2) = z(0.025) = 1.96
2. Desired maximum error is E = 0.015.
3. No estimate of p given, use p* = q* = 0.5
4. Use the formula for n:
[ z (a / 2)]2  p * q * (1.96) 2  (0.5)  (0.5)
n

2
E
(0.015) 2
0.9604

 4268.44  n  4269
0.000225
Note: Suppose we know the genetic defect occurs in
approximately 1 of 80 animals.
Use p* = 1/80 = 0.125
[ z (a / 2)]2  p * q * (1.96) 2  (0.0125)  (0.9875)
n

2
E
(0.015) 2
0.0474

 210.75  n  211
0.000225
As illustrated here, it is an advantage to have some indication
of the value expected for p, especially as p becomes
increasingly further from 0.5.
Hypothesis-Testing Procedure:
For hypothesis tests concerning the binomial parameter p, use
the test statistic z*:
p' p
z* 
pq
n
where
x
p' 
n
Example (Probability-Value Approach):A hospital
administrator believes that at least 75% of all adults have a
routine physical once every two years. A random sample of
250 adults showed 172 had physicals within the last two
years. Is there any evidence to refute the administrator's
claim? Use a = 0.05.
Solution:
1. The Set-up:
a. Population parameter of concern: the proportion of
adults who have a physical every two years.
b. State the null and alternative hypotheses:
H0: p = 0.75
Ha: p < 0.75
2. The Hypothesis Test Criteria:
a. Assumptions: 250 adults independently surveyed.
b. Test statistic: z*.
n = 250
np = (250)(0.75) = 187.5 > 5
nq = (250)(0.25) = 62.5 > 5
c. Level of significance: a = 0.05
3. The Sample Evidence:
a. Sample information:
n  250, x  172, and
b. The test statistic:
p'  172 / 250  0.688
p ' p
0.688  0.75

pq
(0.75)(0.25)
n
250
 0.062
 0.062


 2.26
0.00075 0.02738
z* 
4. The Probability Distribution:
a. The p-value:
Use Table 3 Appendix B, Table 5 Appendix B, or
use a computer.
p-value
 2.26
0
z
p  P( z  2.26)  0.5000  0.4881  0.0119
b. The p-value is smaller than the level of significance, a.
5. The Results:
a. Decision: Reject H0.
b. Conclusion: There is evidence to suggest the proportion
of adults who have a routine physical exam every two
years is less than 0.75.
Example (Classical Procedure): A university bookstore employee
in charge of ordering texts believes 65% of all students sell their
statistics book back to the bookstore at the end of the class. To
test this claim, 200 statistics students are selected at random and
141 plan to sell their texts back to the bookstore. Is there any
evidence to suggest the proportion is different from 0.65? Use a
= 0.01.
Solution:
1. The Set-up:
a. Population parameter of concern: p = the proportion of
students who sell their statistics book back to the bookstore.
b. The null and alternative hypotheses:
H0: p = 0.65
Ha: p  0.65
2. The Hypothesis Test Criteria:
a. Assumptions: Sample randomly selected. Each
subject’s response was independent of other responses.
b. Test statistic: z*
n = 200
np = (200)(0.65) = 130 > 5 ; nq = (200)(0.35) = 70 > 5
c. Level of significance: a = 0.01
3. The Sample Evidence:
a. n  200, x  141, p'  141 / 200  0.705
b. Calculate the value of the test statistic:
p ' p
0.705  0.65
z* 

pq
(0.65)(0.35)
n
200
0.055
0.055


 1.63
0.0011375 0.03373
4. The Probability Distribution:
a. The critical value: z(0.005) = 2.58
0.005
0.005
 2.58
0
2.58
z
b. z* is not in the critical region.
5. The Results:
a. Decision: Do not reject H0.
b. Conclusion: There is no evidence to suggest the true
proportion of students who sell their statistics text back
to the bookstore is different from 0.65.
Note:
1. There is a relationship between confidence intervals and
two-tailed hypothesis tests when the level of confidence
and the level of significance add up to 1.
2. The confidence interval and the width of the noncritical
region are the same.
3. The point estimate is the center of the confidence interval,
and the hypothesized mean is the center of the noncritical
region.
4. If the hypothesized value of p is contained in the
confidence interval, then the test statistic will be in the
noncritical region.
5. If the hypothesized value of p does not fall within the
confidence interval, then the test statistic will be in the
critical region.