Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 9 ~ Inferences Involving One Population
Student’s t, df = 25
Student’s t, df = 15
Student’s t, df = 5
0
t
1
Chapter Goals
• Learned about confidence intervals and hypothesis testing
(assumed s was known)
• Consider both types of inference about m when s is unknown
• Consider both types of inference about p, the binomial
probability of success
• Consider the inference of hypothesis testing of s, the
population standard deviation
2
9.1 ~ Inference About mean m (s unknown)
• Inferences about m are based on the sample mean x
• If the sample size is large or the sample population is
normal: z* = ( x - m ) /(s / n )
has a standard normal distribution
• If s is unknown, use s as a point estimate for s
• Estimated standard error of the mean: s / n
• Test statistic: t = ( x - m ) /(s / n)
3
Student’s t-Statistic
1. When s is used as an estimate for s, the test statistic has two
sources of variation: x and s
2. The resulting test statistic:
x-m
t=
Known as the Student’s t-statistic
s n
3. Assumption: samples are taken from normal populations
4. The population standard deviation, s, is almost never known in
real-world problems
The standard error will almost always be estimated using s
n
Almost all real-world inference about the population mean will be
completed using the Student’s t-statistic
4
Properties of the t-Distribution
1. t is distributed with a mean of 0
2. t is distributed symmetrically about its mean
3. t is distributed so as to form a family of distributions, a
separate distribution for each different number of degrees of
freedom (df  1)
4. The t-distribution approaches the normal distribution as the
number of degrees of freedom increases
5. t is distributed with a variance greater than 1, but as the
degrees of freedom increase, the variance approaches 1
6. t is distributed so as to be less peaked at the mean and
thicker at the tails than the normal distribution
5
Student’s t-Distributions
Normal distribution
Student’s t, df = 15
Student’s t, df = 5
0
t
Degrees of Freedom, df (自由度): A parameter that identifies
each different distribution of Student’s t-distribution. For the
methods presented in this chapter, the value of df will be the
sample size minus 1, df = n - 1.
6
Notes
1. The number of degrees of freedom associated with s2 is the
divisor (n - 1) used to define the sample variance s2
Thus: df = n - 1
2. The number of degrees of freedom is the number of unrelated
deviations available for use in estimating s2
3. Table for Student’s t-distribution (Table 6 in Appendix B) is a
table of critical values. Left column = df. When df > 100, critical
values of the t-distribution are the same as the corresponding
critical values of the standard normal distribution.
4. Notation: t(df, a)
Read as: t of df, a
7
t-Distribution Showing t(df, a)
a
0
t (df, a )
t
8
Example
 Example: Find the value of t(12, 0.025)
0.025
0.025
- t (12, 0.025)
- 2.18
Portion of
Table 6
df
0
t (12, 0.025)
2.18
t
Amount of a in one-tail
...
...
0.025
..
.
12
2.18
9
Notes
1. If the df is not listed in the left-hand column of Table 6, use the
next smaller value of df that is listed
2. Most computer software packages will calculate either the area
related to a specified t-value or the t-value that bounds a specified
area
3. The cumulative distribution function (CDF) is often used to find
area from -  to t
4. If the area from -  to t is known and the value of t is wanted,
then the inverse cumulative distribution function (INVCDF) is
used
cumulative
probability
-
t
10
The Assumption...
The assumption for inferences about mean m when s is
unknown: The sampled population is normally distributed
Confidence Interval Procedure:
1. Procedure for constructing confidence intervals similar to that
used when s is known
2. Use t in place of z, use s in place of s
3. The formula for the 1 - a confidence interval for m is:
s
x t(df, a/2)
n
to
s
+
x t(df, a/2)
n
where df = n - 1
11
Example
 A study is conducted to learn how long it takes the typical tax payer to complete
their federal income tax return (所得稅申報). A random sample of 17 income
tax filers showed a mean time (in hours) of 7.8 and a standard deviation of 2.3.
Find a 95% confidence interval for the true mean time required to complete a
federal income tax return. Assume the time to complete the return is normally
distributed.
Solution:
1. Parameter of Interest
The mean time required to complete a federal income tax return
2. Confidence Interval Criteria
a. Assumptions: Sampled population assumed normal, s unknown
b. Test statistic: t will be used
c. Confidence level: 1 - a = 0.95
12
Solution Continued
3. The Sample Evidence:
n = 17,
x = 7.8, and s = 2.3
4. The Confidence Interval
a. Confidence coefficients: t(df, a /2) = t(16, 0.025) = 2.12
b. Maximum error:
s
2.3
×
=
=
= ( 2.12)(0.5578) = 1.18
t
(16,
0.025)
E
( 2.12)
n
17
x-E
to
x+E
7.8 - 1.18
to
7.8 + 1.18
6.62
to
8.98
c. Confidence limits:
5. The Results:
6.62 to 8.98 is the 95% confidence interval for m
13
Hypothesis-Testing Procedure
1. The t-statistic is used to complete a hypothesis test about a
population mean m
x-m
2. The test statistic: t* =
s n
with df = n - 1
3. The calculated t is the number of estimated standard errors
x is from the hypothesized mean m
4. Probability-Value or Classical Approach
14
Example
 A random sample of 25 students registering for classes showed the mean
waiting time in the registration line was 22.6 minutes and the standard
deviation was 8.0 minutes. Is there any evidence to support the student
newspaper’s claim that registration time takes longer than 20 minutes?
Use a = 0.05 and assume waiting time is approximately normal.
Solution:
1. The Set-up
a. Population parameter of concern: the mean waiting time
spent in the registration line
b. State the null and alternative hypotheses:
Ho: m = 20 () (no longer than)
Ha: m > 20 (longer than)
15
Solution Continued
2. The Hypothesis Test Criteria
a. Check the assumptions:
The sampled population is approximately normal
b. Test statistic: t* with df = n - 1 = 24
c. Level of significance: a = 0.05
3. The Sample Evidence
a. Sample information: n = 25,
x = 22.6, and s = 8
b. Calculate the value of the test statistic:
t* =
x - m 22.6 - 20 2.6
=
=
= 1.625
s n
8 25
1.6
16
Solution Continued
Using the p-Value Procedure:
4. The Probability Distribution
a. The p-value: P = P (t * > 1.625, with df = 24 )
?
0
Portion
of
Table 6
Portion
of
Table 7
t*=1.625
t
Amount of a in one-tail
df
..
.
24
t*
..
.
1.6
...
0.10 0.05
...
1.32 1.71
Degrees of Freedom
...
21
25
...
0.062 0.061
17
Notes:

If this hypothesis test is done with the aid of a computer, most likely the
computer will compute the p-value for you

Using Table 6: place bounds on the p-value

Using Table 7: read the p-value directly from the table for many
situations:
Using Table 6: 0.05 < P < 0.10
Using Table 7: P  0.061
b. The p-value is not smaller than the level of significance, a
18
Solution Continued
Using the Classical Procedure:
4. The Probability Distribution
a. The critical value: t(24, 0.05) = 1.71
0.05
b. t* = 1.625 is not in the critical region
0
t
(24, 0.05)=1.71
t
5. The Results
a. Decision: Fail to reject Ho
b. Conclusion: There is insufficient evidence to show the mean
waiting time is greater than 20 minutes at the 0.05 level
of significance
19
Example
 A new study indicates that higher than normal (220) cholesterol(膽固醇)
levels are a good indicator of possible heart attacks (心臟病發作). A
random sample of 27 heart attack victims showed a mean cholesterol
level of 231 and a standard deviation of 20. Is there any evidence to
suggest the mean cholesterol level is higher than normal for heart attack
victims? Use a = 0.01
Solution:
1. The Set-up
a. Population parameter of concern: The mean cholesterol
level of heart attack victims
b. State the the null and alternative hypothesis:
Ho: m = 220 () (mean is not greater than 220)
Ha: m > 220 (mean is greater than 220)
20
Solution Continued
2. The Hypothesis Test Criteria
a. Assumptions: We will assume cholesterol level is at
least approximately normal.
b. Test statistic: t* (s unknown), df = n - 1 = 26
c. Level of significance: a = 0.01 (given)
3. The Sample Evidence
a. Sample information: n = 27,
x = 231,
and
s = 20
b. Calculate the value of the test statistic:
t* =
x - m 231 - 220
11
=
=
= 2.858
s n 20 27 3.849
21
Solution Continued
4. The Probability Distribution
a. The critical value: t(26, 0.01) = 2.48
0.01
*
0
t(26, 0.01)
t
2.48
b. t* falls in the critical region
5. The Results
a. Decision: Reject H0
b. Conclusion: At the 0.01 level of significance, there is
sufficient evidence to suggest the mean cholesterol
level in heart attack victims is higher than normal
22
9.2 ~ Inferences About the
Binomial Probability of Success
• Possibly the most common inference of all
• Many examples of situations in which we are
concerned about something either happening or not
happening
• Two possible outcomes, and multiple independent
trials
23
Background
1. p: the binomial parameter, the probability of success on a
single trial
2. p ': the observed or sample binomial probability
x x represents the number of successes that
p' =
n occur in a sample consisting of n trials
3. For the binomial random variable x:
m = np,
s = npq ,
where
q = 1- p
4. The distribution of x is approximately normal if n is larger than 20
and if np and nq are both larger than 5
24
Sampling Distribution of p'
Sampling Distribution of p': If a sample of size n is randomly
selected from a large population with p = P(success), then the
sampling distribution of p' has:
1. a mean m p ' equal to p,
2. a standard error s p ' equal to ( pq) / n , and
3. an approximately normal distribution if n is sufficiently large
In practice, use of the following guidelines will ensure normality:
1. The sample size is greater than 20
2. The sample consists of less than 10% of the population
3. The products np and nq are both larger than 5
25
The Assumptions...
The assumptions for inferences about the binomial parameter p:
The n random observations forming the sample are selected
independently from a population that is not changing during the
sampling
Confidence Interval Procedure:
The unbiased sample statistic p' is used to estimate the population
proportion p
The formula for the 1 - a confidence interval for p is:
p ' - z(a/2) ×
p' q'
n
to
p ' + z(a/2) ×
p' q'
n
where p' = x / n and q' = 1 - p'
26
Example
 A recent survey of 300 randomly selected fourth graders showed 210
participate in at least one organized sport during one calendar year. Find
a 95% confidence interval for the proportion of fourth graders who
participate in an organized sport during the year.
Solution:
1. Describe the population parameter of concern
The parameter of interest is the proportion of fourth graders who
participate in an organized sport during the year
2. Specify the confidence interval criteria
a. Check the assumptions
The sample was randomly selected
Each subject’s response was independent
27
Solution Continued
b. Identify the probability distribution
z is the test statistic
p' is approximately normal
n = 300 > 20
np' = 300(210 / 300) = 210 > 5
nq' = 300(90 / 300) = 90 > 5
c. Determine the level of confidence: 1 - a = 0.95
3. Collect and present sample evidence
Sample information: n = 300, and x = 210
The point estimate: p' = x / n = 210 / 300 = 0.70
28
Solution Continued
4. Determine the confidence interval
a. Determine the confidence coefficients:
Using Table 4, Appendix B: z(a /2) = z(0.025) = 1.96
b. The maximum error of estimate:
p' q' =
(0.70)(0.30)
E = z(a /2) ×
1.96
n
300
= (1.96) 0.0007 = (1.96)(0.0265) = 0.0519
c. Find the lower and upper confidence limits:
p'- E
0.70 - 0.0519
0.6481
to
to
to
p'+ E
0.70 + 0.0519
0.7519
29
Solution Continued
d. The Results
0.6481 to 0.7519 is a 95% confidence interval for the
true proportion of fourth graders who participate in an
organized sport during the year
z(a /2) ] 2 ×p * ×q *
[
Sample Size Determination: n =
E2
E: maximum error of estimate
1 - a: confidence level
p*: provisional value of p (q* = 1 - p*)
If no provisional values for p and q are given use p* = q* = 0.5
(Always round up)
30
Example
 Determine the sample size necessary to estimate the true
proportion of laboratory mice with a certain genetic defect (遺傳缺
陷). We would like the estimate to be within 0.015 with 95%
confidence.
Solution:
1. Level of confidence: 1 - a = 0.95, z(a/2) = z(0.025) = 1.96
2. Desired maximum error is E = 0.015.
3. No estimate of p given, use p* = q* = 0.5
[ z(a /2) ]2 ×p * ×q *
(1.96) 2 ×(0.5) ×(0.5)
=
4. Use the formula for n: n =
2
E
(0.015) 2
0.9604
=
= 4268.44  n = 4269
0.000225
31
Note
Note: Suppose we know the genetic defect (遺傳缺陷) occurs in
approximately 1 of 80 animals
Use p* = 1/80 = 0.125:
[z(a /2) ]2 ×p * ×q * (1.96) 2 ×(0.0125) ×(0.9875)
=
n=
2
E
(0.015) 2
0.0474
=
= 210.75  n = 211
0.000225
As illustrated here, it is an advantage to have some indication of the
value expected for p, especially as p becomes increasingly further
from 0.5
32
Hypothesis-Testing Procedure
Hypothesis-Testing Procedure: For hypothesis tests concerning the
binomial parameter p, use the test statistic z*:
p'- p
z* =
pq
n
where
x
p' =
n
 Example: (Probability-Value Approach) A hospital
administrator believes that at least 75% of all adults have a routine
physical(身體檢查) once every two years. A random sample of 250
adults showed 172 had physicals within the last two years. Is
there any evidence to refute(反駁) the administrator's claim? Use
a = 0.05
33
Solution
1. The Set-up
a. Population parameter of concern: the proportion of
adults who have a physical every two years
b. State the null and alternative hypotheses:
Ho: p = 0.75 (>)
Ha: p < 0.75
2. The Hypothesis Test Criteria
a. Assumptions: 250 adults independently surveyed
b. Test statistic: z*
n = 250
np = (250)(0.75) = 187.5 > 5
nq = (250)(0.25) = 62.5 > 5
c. Level of significance: a = 0.05
34
Solution Continued
3. The Sample Evidence
a. Sample information:
n = 250,
x = 172,
and
p' = 172 / 250 = 0.688
p '- p
0.688 - 0.75
=
pq
(0.75)(0.25)
n
250
- 0.062
- 0.062
=
=
= -2.26
0.00075 0.02738
b. The test statistic: z* =
4. The Probability Distribution
a. The p-value:
Use Table 3, Table 5 (Appendix B), or use a computer
35
Solution Continued
p-value
- 2.26
0
z
P = P ( z < -2.26) = 0.5000 - 0.4881 = 0.0119
b. The p-value is smaller than the level of significance, a
5. The Results
a. Decision: Reject Ho
b. Conclusion: There is evidence to suggest the proportion of
adults who have a routine physical exam every
two years is less than 0.75 at the 0.05 level of
significance
36
Example
 (Classical Procedure) A university bookstore employee in charge(負責)
of ordering texts believes 65% of all students sell their statistics books
back to the bookstore at the end of the class. To test this claim, 200
statistics students are selected at random and 141 plan to sell their texts
back to the bookstore. Is there any evidence to suggest the proportion is
different from 0.65? Use a = 0.01
Solution:
1. The Set-up
a. Population parameter of concern: p = the proportion of
students who sell their statistics books back to the bookstore
b. The null and alternative hypotheses:
Ho: p = 0.65
Ha: p  0.65
37
Solution Continued
2. The Hypothesis Test Criteria
a. Assumptions: Sample randomly selected. Each subject’s response
was independent of other responses.
b. Test statistic: z*
n = 200
np = (200)(0.65) = 130 > 5 ;
nq = (200)(0.35) = 70 > 5
c. Level of significance: a = 0.01
3. The Sample Evidence
a. n = 200, x = 141,
p' = 141 / 200 = 0.705
p '- p
0.705 - 0.65
=
pq
(0.65)(0.35)
n
200
0.055
0.055
=
=
= 1.63
0.0011375 0.03373
b. Calculate the value of the test statistic: z* =
38
Solution Continued
4. The Probability Distribution
a. The critical value: z(0.005) = 2.58
0.005
0.005
*
- 2.58
0
2.58
z
b. z* is not in the critical region
5. The Results
a. Decision: Do not reject Ho
b. Conclusion: There is no evidence to suggest the true proportion of
students who sell their statistics texts back to the
bookstore is different from 0.65 at the 0.01 level of
significance
39
Notes
1. There is a relationship between confidence intervals and twotailed hypothesis tests when the level of confidence and the level
of significance add up to 1
2. The confidence interval and the width of the noncritical region are
the same
3. The point estimate is the center of the confidence interval, and the
hypothesized mean is the center of the noncritical region
4. If the hypothesized value of p is contained in the confidence
interval, then the test statistic will be in the noncritical region
5. If the hypothesized value of p does not fall within the confidence
interval, then the test statistic will be in the critical region
40
9.3 ~ Inferences About Variance & Standard Deviation
• Problems often arise that require us to make
inferences about variability
• Usually use variance to make inferences about
variability
• Inferences about the variance of a normally
distributed population use the chi-square, c2,
distribution (卡方分佈)
41
Background
1. The chi-square distributions are a family of probability
distributions
2. Each distribution is identified by a parameter called the number of
degrees of freedom
Properties of the Chi-Square Distribution:
1. c2 is nonnegative in value; it is zero or positively valued
2. c2 is not symmetrical; it is skewed to the right
3. c2 is distributed so as to form a family of distributions, a separate
distribution for each different number of degrees of freedom
42
Various Chi-Square Distributions
c2, df = 6
c2, df = 10
c2, df = 20
0
c2
43
Critical Values for Chi-Square Distribution
1. Table 8 in Appendix B
2. c2(df, a)
The symbol used to identify the critical value of a chi-square
distribution with df degrees of freedom
It denotes a point on the measurement axis so that there is a of
the area to the right of that point
3. Since the chi-square distribution is not symmetrical, the critical
values associated with right and left tails are given separately in
Table 8
44
c2 Distribution Showing c2(df,a)
a
0
c2(df, a)
c2
45
Example
 Example: Find the value of c2(18, 0.025)
0.025
c2(18, 0.025)
31.5
0
Portion of
Table 8
df
..
.
18
c2
Area in Right-hand Tail
...
0.025 . . .
31.5
46
Example
 Example: Find the value of c2(12, 0.99)
0.01
0.99
0
c2(12, 0.99)
c2
3.57
Portion of
Table 8
df
..
.
12
Area to the Right
...
...
0.99
Area in Left-hand Tail
...
...
0.010
3.57
47
Notes
1. When df > 2, the mean value of the chi-square distribution is df
2. The mean is located to the right of the mode (the value where the
curve reaches it high point) and just to the right of the median
(the value that splits the distribution, 50% on either side)
Illustration:
0
mode ~
x x
c2
48
The Assumptions...
The assumptions for inferences about the variance s2 or standard
deviation s: The sample population is normally distributed
Hypothesis-Testing Procedure:
1. The statistical procedures for standard deviation are very sensitive
to nonnormal distributions (skewness, in particular). This makes it
difficult to decide if a significant result is due to sample evidence
or a violation of assumptions.
2
(
n
1
)
s
2
2. The test statistic: c * =
s2
If the random sample is drawn from a normal population with
known variance s2, then c2* has a chi-square distribution with n 1 degrees of freedom
49
Example
 A machine used to fill 5-gallon buckets of driveway sealer has a standard
deviation of 2.5 ounces. A random sample of 24 buckets showed a
standard deviation of 2.9 ounces. Is there any evidence to suggest an
increase in variability at the 0.05 level of significance? Assume the
amount of driveway sealer in a bucket is normally distributed.
Solution:
1. The Set-up
a. Population parameter of concern: the variance s2 for the
amount of driveway sealer in a 5-gallon bucket
b. State the null and alternative hypotheses:
Ho: s2 = 6.25 () (variance is not larger than 6.25)
Ha: s2 > 6.25 (variance is larger than 6.25)
50
Solution Continued
2. The Hypothesis Test Criteria
a. Check the assumptions: The sample population is
assumed to be normally distributed
b. Test statistic: c2* with df = n - 1 = 24 - 1 = 23
c. Level of significance: a = 0.05
3. The Sample Evidence
a. Sample information: n = 24,
s2 = (2.9)2 = 8.41
b. Calculate the value of the test statistic:
c 2* =
(n - 1) s 2
s2
=
(24 - 1)(8.41)
= 30.95
6.25
51
Solution Continued
4. The Probability Distribution (p-Value Approach)
a. The p-value: Use Table 8 Appendix B, or use a computer 0.10 < P < 0.25
b. The p-value is larger than the level of significance, a
~ or ~
4. The Probability Distribution (Classical Approach)
a. The critical value: c2(23, 0.05) = 35.2
b. c2* is not is the critical region
5. The Results
a. Decision: Fail to reject H0
b. Conclusion: There is insufficient evidence to show the variability has
increased at the 0.05 level of significance
52