Download Examples

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Sufficient statistic wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
INFERENCE ABOUT A POPULATION VARIANCE
Variance can be used to describe a number of different situations e.g.
 In Quality control, engineers must ensure that products coming out of a production line
meets specifications such as size, weight, volume etc.
 In finance, the variance of the returns on a portfolio of investments is a measure of the
uncertainty and risk inherent in a portfolio.
The sample variance is an unbiased, consistent estimator of the population variance.
Chi-square sampling distribution
In repeated sampling from a normal population whose variance is  2 , the variable
chi-square distributed with (n-1) degrees of freedom. The variable
n  1s 2
2
n  1s 2
2
is
is called the chi-
square statistic and is denoted by  2 . The  2 variable can equal any value between 0 and x.
Chi-square notation
The value of  2 such that the area to its right under the chi-square curve is equal to x and is
denoted by  2 . The value  12 is the point such that the area to its right is 1   . Hence , the
area to its left is  .
Testing the population variance
Test statistic for  2
The test statistic used to test hypothesis about  2 is  2 
n  1s 2
2
which is chi-square
distributed with (n-1) degrees of freedom, provided the population random variable is normally
distributed.
Estimating the population variance
The confidence interval estimator of  2 is
n  1s 2
n  1s 2
LCL 
UCL

2
2

2
 1
2
Examples
1. A manufacturer of a bottle-filling machine claims that the standard deviation of the fills
from his machine is less than 2 cc. In a random sample of 10 fills, the sample standard
deviation was 1.19 cc. Is this sufficient evidence at the 5% level of significance to support
the manufacturer’s claim? (Assume a normal population).
2. A company manufactures steel shafts for use in engines. One method of judging
inconsistencies in the production process is to determine the variance of the lengths of the
shafts. A random sample of 10 shafts produced the following measurements of their lengths
in centimeters
20.5, 19.8, 21.1, 20.2, 18.9, 19.6, 20.7, 20.1, 19.8, 19.0
1
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
Find a 90% confidence interval estimate for the population variance  2 assuming that the
lengths of the steel shafts are normally distributed.
3. One important factor in inventory control is the variance of the daily demand for a product.
Management believes that demand is normally distributed with the variance equal to 250. In
an experiment to test this belief about  2 , the daily demand was recorded for 25 days. The
data has been summarized as follows:
x  50.6
S 2  500
Do the data provide sufficient evidence to show that management’s belief about the
variance is untrue? (Use   0.01).
4. Some traffic experts claim that the variability of automobile speeds is a critical factor in
determining how many accidents are likely to occur on a highway. The greater the
variability, the more the accidents. Suppose a random sample of 101 cars reveals that the
mean and variance of their speeds are 57.3 Km/h and 88.7 (Km/h) 2 respectively.
i. Can we conclude at the 5% significance level that the variance of all cars speeds
exceeds 50 (Km/h) 2?
ii. Estimate the variance of all cars speeds with 99% confidence
Sampling distribution of the sample proportion
Sampling distribution of p̂
The sample proportion p̂ is a approximately normally distributed, with mean p and standard
deviation
pq
, provided that n is large ( np  5 and nq  5 ).
n
Since p̂ is approximately normal, it follows that the standardized variable Z 
pˆ  p
pq
n
is
approximately standard normally distributed.
Testing the population proportion
The null and alternate hypotheses of tests of proportions are set up in the same way as the
pˆ  p
hypothesis of tests about mean and variance. The test statistic for p is Z 
pq
n
Example:
An inventor has developed a system that allows visitors to museums, zoos and other attractions
to get information at the touch of a digital code. For example, zoo patrons can listen to an
announcement (recorded on a microchip) about each animal they see. It is anticipated that the
device would rent for $3.00 each. The installation cost for the complete system is expected to be
about $400,000. The ABC zoo is interested in having the system installed, but the management is
uncertain about whether to take the risk. A financial analysis of the problem indicates that if
more than 10% of the zoo visitors rent the system, the zoo will make a profit. To help make the
2
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
decision, a random sample of 400 zoo visitors is given details of the systems capabilities and
cost. If 48 people say that they would rent the device, can the management of the zoo conclude at
the 5% significance level that the investment would result in a profit?
Confidence interval estimator of p is pˆ  Z 
pˆ qˆ
n
2
Example
1. A factory produces a component that is used in manufacturing computers. Each component is
tested prior to shipment to determine whether or not it is defective. In a random sample of
250 units, 20 were found to be defective. Estimate with 99% confidence the true proportion
of defective components produced by the factory. (0.036, 0.124)
2. In a random sample of 100 units from an assembly line, 22 were defective.
(a) Does this provide sufficient evidence at the 10% significance level to allow us to
conclude that the defective rate among all units exceeds 10%?
(b) Find the p-value of the above test.
(c) Find a 99% confidence interval estimate of the defective rate.
3. A manufacturer of computer chips claims that more than 905 of his products conform to
specifications. In a random sample of 1,000 chips drawn from a large production run, 75
were defective. Do the data provide sufficient evidence at the 1% level of significance to
enable us to conclude that the manufacturer’s claim is true? What is the P-value of the test?
INFERENCE ABOUT THE DIFFERENCE BETWEEN TWO MEANS WHEN THE
POPULATION VARIANCES ARE KNOWN
2
2
Sampling distribution of x1  x 2 when  1 and  2 are known.
 x1  x 2 is normally distributed if the populations that have been sampled are normal. If the
populations are not normal, x1  x 2 is approximately normal if the samples are large.
 The expected value of x1  x 2 is E ( x1  x 2 ) = 1   2
 x x 
 The standard deviation of x1  x 2 is
1
2
 12
n1

 22
n2
It then follows that
( x  x 2 )  ( 1   2 )
Z 1
 12
n1

 22
n2
The test statistic for 1   2 when  1 and  2 are known is
2
Z
2
( x1  x2 )  ( 1   2 )
 12
n1

 22
n2
3
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
Confidence interval estimator of 1   2 when  1 and  2 are known is
2
1   2  ( x1  x 2 )  Z 
 12
n1
2

2
 22
n2
Example
The selection of a new store location depends on many factors, one of which is the level of
household income in areas around the proposed site. A large departmental store chain is trying to
decide whether to build a new store in Nakuru or in the nearby city of Nairobi. Building costs are
lower in Nairobi and the company decides it will build there unless the average household
income is higher in Nakuru than in Nairobi. In a survey of 100 residences in each of the cities,
the mean household was Sh. 29,980 in Nakuru and Sh. 28,650 in Nairobi. From other sources, it
is known that the population standard deviations of households’ incomes are Sh. 4,740 in Nakuru
and Sh. 5,365 in Nairobi.
(a) At the 5% significance level, can it be concluded that the mean household income in
Nakuru exceeds that of Nairobi?
(b) Estimate with 90% confidence level, the difference in means between the mean household
income in Nakuru and that of Nairobi?
INFERENCE ABOUT THE DIFFERENCE BETWEEN TWO MEANS WHEN THE
POPULATION VARIANCES ARE UNKNOWN
2
2
The test statistic for 1   2 when  1 and  2 are unknown and n1  30 and n2  30 is
Z
( x1  x2 )  ( 1   2 )
2
2
S1
S
 2
n1
n2
Confidence interval estimator of 1   2 when  1 and  2 are unknown and n1  30 and
n2  30 is
2
1   2  ( x1  x 2 )  Z 
2
2
2
2
S1
S
 2
n1
n2
The test statistic for 1   2 when  1 and  2 are unknown and n1  30 and n2  30 is
2
t
( x1  x 2 )  ( 1   2 )
1 
2 1
S P   
 n1 n2 
2
where S P 
2
n1  1S1 2  n2  1S 2 2
n1  n2  2
The test statistic is student t distributed with n1  n2  2 degrees of freedom, provided that the
following conditions are satisfied:
4
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
 The two population random variables ( x1 and x 2 ) are normally distributed
 The two population variances are equal i.e.  1   2
2
2
2
The quantity S P is called the pooled variance estimate.
Confidence interval estimator of 1   2 when  1 and  2 are unknown and n1  30 and
n2  30 is
2
1   2  ( x1  x2 )  t
2
, n1  n2  2
2
1
2 1
S P   
 n1 n2 
Examples
1. Despite some controversy, scientists generally agree that high fibre cereals reduce various
forms of cancer. However, one scientists claims that people who eat high fibre cereal for
breakfast will consume on average fewer calories for lunch than people who do not eat highfibre cereal for break fast. If this is true, high-fibre cereal manufacturers will be able to claim
another advantage of eating their products- potential weight reduction for dieters. To test the
claim, 200 people were randomly sampled and asked what they regularly eat for breakfast
and lunch. Each person was identified as either a consumer or a non-consumer of high fibre
cereal, and the number of calories consumed at lunch was measured and recorded. These data
are summarized below:
Calories consumed at lunch
Consumer of high
Non -Consumer of high
fibre cereal
fibre cereal
n1  41
n2  159
x1  603
x2  639
S1  110
S 2  141
(a) Is there sufficient evidence at the 5% significance level to support the scientist’s
claim?
(b) Estimate with 95% confidence the difference in mean consumption of calories at
lunch between those who regularly eat and those who do not eat high fibre cereals
for breakfast.
2. The manager of a large production facility believes that worker productivity is a function of
among other things the design of the job, which refers to the sequence of worker movements.
Two designs are being considered for the production of a new product. To help decide which
should be used, an experiment was performed. Six randomly selected workers assembled the
product using design A and another eight workers assembled the product utilizing design B.
the assembly times are normally distributed as shown below:
Design A:
8.3, 5.3, 6.5, 5.1, 9.7, and 10.8
Design B:
9.5, 8.3, 7.5, 10.9, 11.3, 9.3, 8.8, and 8.0
5
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
(a)
Can the manager conclude at the 5% significance level that the assembly times differ
for the two designs?
(b) Estimate with 99% confidence, the difference in mean assembly times between design
A and design B.
3. High blood pressure is a leading cause of strokes. Medical researchers are constantly seeking
ways to treat patients suffering from this condition. A specialist in hypertension claims that
regular aerobic exercise can reduce high blood pressure just as successfully as drugs, with
none of the adverse side effects. To test the claim, 50 patients who suffer from high blood
pressure were chosen to participate in an experiment. For 60 days, half the sample exercised
three times per week for one hour; the other half took the standard medication. The
percentage reduction in blood pressure was recorded for each individual; the resulting data
are shown in the table below
Exercise
Medication
X 1  14.31
X 2  13.28
S1  1.63
S 2  1.82
Can we conclude at the 1% significance level that exercise is at least as effective as
medication in reducing hypertension?
Inference about the difference between two populations
Sampling distribution of pˆ 1  pˆ 2
1. The statistic pˆ 1  pˆ 2 is approximately normally distributed provided the sample sizes are
large enough so that n 1 p̂1 , n 1q̂1, n 2 p̂ 2 and n 2 q̂ 2  5.
2. The mean of pˆ 1  pˆ 2 is E ( pˆ 1  pˆ 2 )  p1  p2
pq
pq
3. The variance of pˆ 1  pˆ 2 is E ( pˆ 1  pˆ 2 )  1 1  2 2
n1
n2
Test statistic for pˆ 1  pˆ 2 : Case 1
If the null hypothesis specifies that H 0 : p1  p2  0 the test statistic is
( pˆ  pˆ 2 )  ( p1  p 2 )
Z 1
1
1 
pˆ qˆ   
 n1 n2 
Test statistic for pˆ 1  pˆ 2 : Case 2
If the null hypothesis specifies that H 0 : p1  p2  D where D  0 the test statistic is
( pˆ  pˆ 2 )  ( p1  p 2 )
Z 1
pˆ 1 qˆ1 pˆ 2 qˆ 2

n1
n2
6
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
Examples
1. An insurance company is thinking about offering discounts on its life insurance policies to
nonsmokers. As part of its analysis, it randomly selects 200 men who are 50 years old and
asks them if they smoke at least one pack of cigarettes per day and if they have ever suffered
from heart disease. The results indicate that 20 out of 80 smokers and 15 out of 120 nonsmokers suffer from heart disease. Can we conclude at the 5% level of significance that
smokers have a higher incidence of heart disease than nonsmokers?
2. The process that is used to produce a complex component used in medical instruments
typically results in defective rates in the 40% range. Recently, two innovations have been
developed. Innovation one appears to be more promising but is considerably more expensive
to purchase and operate that innovation two. After a careful analysis of the costs,
management decides that it will adopt innovation one only if the proportion of defective
components it produces is at least 8% smaller than that produced by innovation two. In a
random sample of 300 units produced by innovation one, 33 are found to be defective. At the
1% significance level, can we conclude that there is sufficient evidence to justify adopting
innovation 1? Determine the p-value of the test.
Estimating the difference between two population proportions
The confidence interval estimator of pˆ 1  pˆ 2 is
p1  p2  ( pˆ 1  pˆ 2 )  Z
2
pˆ 1qˆ1 pˆ 2 qˆ 2

n1
n2
Examples
1. In 1998, a survey of 1654 Kenyans found that 37% believed that the “energy crisis” was a
hoax. In 2004, of 1814 Kenyans, 42% believed that the energy crisis was a hoax. In order to
determine the real size of the change, estimate with 90% confidence the difference between
the 1998 and 2004 proportions.
2. In a public opinion survey, 60 out of a sample of 100 high-income voters and 40 out of a
sample of 75 low-income voters supported a decrease in sales tax.
i. Can we conclude at the 5% level of significance that the proportion of voters
favouring a sales tax decrease differs between high and low income voters?
ii. What is the p-value of the test?
iii. Estimate the differences in proportions, with 99% confidence.
3. In a random sample of 500 television sets from a large production line, there were 80
defective sets. In a random sample of 200 television sets from a second production line, there
were 10 defective sets. Do these data provide sufficient evidence that the proportion of
defective sets from the first line exceeds the proportion of defective sets from the second by
more than 3%? (Use  = 0.05)
7