Download Exercise IV: Confidence intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Exercise IV: Confidence intervals.
Suppose the distribution of some trait X in a population is dependent on some parameter Q
(e.g. the average and/or variance). It is assumed that the form of the density f ( x, Q ) or
probability function P( X  xk )  pk (Q ) is known.
Let E be a vector of observations:  X1, X 2 ,..., X n  . E is an n - dimensional random variable
and dependent on the parameter Q.
Let U ( E ), U ( E ) be functions of the random variable E such that U ( E )  U ( E ) .
Let  be a real number 0    1.
If the following holds:

P
(
U
(
E
)
Q

U
(
E
)
)1


(for continuous random variables = 1 - ) the interval U ( E ),U ( E ) is called a confidence
interval for Q, and 1   the level of confidence.
Theoretically, you can build a confidence interval for each parameter of the trait’s
distribution, but in practice it is used to construct confidence intervals for the average (Q = m)
and variance (Q = Var(X)). Below we show the corresponding confidence intervals for both
parameters.
Confidence interval for population mean (Q=m)
Model I.
Suppose a trait in some population has a N (m, ) distribution. Let’s assume that - m - is
unknown,  is known, and the sample is small (n < 30).
The estimate of the population average is the sample statistic X . Given these conditions, X
has a N ( m,

n
) distribution.
Thus, the standardized random variable U 
X m

n
has a N (0, 1) distribution.
f(u)

1-
2
 u

2
u
0
Hence, the random variable U satisfies:
Pu
u
1




U

The value u can be read from the tables of the standardized normal distribution for a given
:


(u)1
2
Since U has a N (0, 1) distribution and we have U 
X m

:
n



X
m 
P
u
u
1 ,

  





n






P
X

u
m

X

u

1

.



n
n


Thus, for example for = 0.05, the confidence interval is as follows ( u from the table =
1.96):
1
,9
6

1
,9
6

X

,X

n
n
which means that in 95 cases out of 100, the estimated "m" is in this range. In other words,
the error in estimation is not greater than
1,96
in 95% of cases.
n
Example. The waiting time for a tram has been studied and the following values obtained (in
minutes): 12, 15, 14, 13, 15. Suppose that the waiting time for a tram has a normal
distribution with unknown mean value (m) and a known standard deviation   2 .
The construction of the confidence interval for the mean consists of the following steps:
Step 1 Calculate the sample mean x .
Step 2 Calculate the radius of the confidence interval.
Step 3 Construct the confidence interval
x

1
.
7
5
3
;
x

1
.
7
5
3

1
2
.
0
4
6
9
;
1
5
.
5
5
3
0
.
Thus, with a confidence level of 10.95 , the population average lies in the calculated
confidence interval.
Model II.
The trait X has a N (m, )distribution in some population, where neither m nor  are known.
To build a confidence interval for m, we will use the t-statistic with n-1 degrees of freedom:
Xm
t
n
1
S
f(t)
1-

2

2
 t
0
t
The value t  is read from the tables for the student distribution with n-1 - degrees of freedom:
Pt
t
1




 t




X
m 
P
t
t
1 ,

S




n
1 


S
S


P
X

t

m

X

t

1

.




n

1
n

1


Thus, for example, for = 0.05 the confidence interval is as follows ( t from tables;
e.g. for n = 26, this value is 2.056):
2
,
0
5
6
S 2
,
0
5
6
S
X

;
X

5
5
which means that in 95 cases out of 100, "m" lies in this range. In other words, the error in
2.056S
estimation is not greater than 5
in 95% of cases.
Note: this interval is variable, depending on the value of S.
Example. Suppose that in the previous example on the average waiting time for a tram there
was no information on the standard deviation in the population. Calculation of the confidence
interval will be carried out in 4 steps.
Step 1 Calculate the sample mean x . This has not changed and still is 13.8
Step 2 Calculate the sample standard deviation.
Step 3 Calculate the radius of the confidence interval.
Step 4 Construct the confidence interval.
x

1
.
6
1
8
9
3
1
1
8
7
;
x

1
.
6
1
8
9
3
1
1
8
7

1
2
.
1
8
1
0
7
;
1
5
.
4
1
8
9
3
.
Thus, with a confidence level of 1    0.95 the population average lies in the calculated
confidence interval.
Model III.
For large samples (n> 30), the central limit theorem shows that X  N (m,

n
) , while the
law of large numbers shows that S   .
Therefore, by substituting in model I the population standard deviation  by the sample
standard deviation s, We get:

s
s


P
X

u
m

X

u

1




n
n


Example. Suppose that in the analysis of the average waiting time for a tram there was no
information on the standard deviation in the population, but we managed to gather much more
data.
14
13
15
15
12
15
14
13
15
15
12
15
14
13
15
15
12
15
14
13
15
15
12
15
14
13
15
Calculation of the confidence interval can be conducted using the "data analysis / descriptive
statistics" Panel. By entering the data into this previously described panel we get:
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Confidence Level(95,0%)
13,94117647
0,202221074
14
15
1,179141354
1,390374332
-1,22463316
-0,58712867
3
12
15
474
34
0,411421868
Construct a confidence interval
x

0
.
4
1
1
4
2
;
x

0
.
4
1
1
4
2

1
3
.
5
2
9
7
;
1
4
.
3
5
2
5
. Thus, with a confidence level of
10.95 the population average is included in the calculated confidence interval. Note
that, in agreement with our intuition, the length of the interval has significantly decreased as a
consequence of the large number of observations; the estimation has become more accurate.
Model IV (confidence interval for a proportion).
If we examine a population according to the presence or absence of a certain characteristic
(e.g. quality control - products classed as good and bad, non-smokers and smokers, etc.), it
can be described by a two-point distribution:
P
(
X

1
)

p
,P
(
X

0
)1


p
,
where the random variable X takes the value 1 if the feature exists and 0 if not present.
Thus, if a feature is observed m times in an n-element sample, an approximation of p is
1n
m
X 
X
i
ni1
n
We find that for 0.05 <p<0.95 and n> 100:


mm
mm
 
 
1

1



 
 
m n
m n


n
n




P
u

p

u

1





n
n
n
n 






Estimation of the percentage of smokers among students. In a 1800-element sample, the
number of smokers m = 600. For 1 -  = 0.95, u 1,96 , which can be read from the
cumulative distribution function (two-sided interval):
m
 m

1 
. Thus, the 95% confidence level for the fraction
n n
0
,0
1
1
n
m 6
0
0


0
,3
3
3
,
n 1
8
0
0
of students smoking is:
32,19% < p < 34,40% .
Confidence interval for standard deviation - variance (Q = σ).
Assumption: The feature X has a normal distribution, or close to normal.
Model V:
The population mean m and population standard deviation  are not known, the sample is
large (n > 30). The statistic Z 
nS 2
2
2
has a  distribution with n-1 degrees of freedom:
1/2
1/2
c1
c2
2
 n

S
P
c


c

1


2
1 

2





1
1
2
2

c
 P

c
.




where: P
1
2
2
2
2
Note: Most tables provide the value P a .
We obtain the following confidence interval for the variance:
2
2


n
S
S
2 n
P




1




c
2
1
c
Example. Consider again the data used in model III.
Step I. We calculate the variance of the sample:
2
Thus, the observed value of the non-standardized  (the numerator of the equation) amounts
to 1.3903 * 34 = 47.2727.
Step II. We now calculate the value of c1 and c2.
For c1:
And for c2
Step III. Hence the required confidence interval for the variance is:
4
7
.
2
7
2
7
4
7
.
2
7
2
7
;

0
.
9
3
1
9
;
2
.
4
8
1
9
5
0
.
7
2
5
1
9
.
0
4
6
6
Model VI.
The trait of interest has a normal or close to normal distribution, large sample n> 30



s 
 s

P



1


 u

u

1

1

 2

n
2
n


where: u is read from tables for the standard normal distribution, N (0, 1).
Sample Size Determination for interval estimates of the average with given confidence
level.
Note that in each case we have a required length of interval (the difference between the right
and left ends of the confidence interval). This knowledge allows us to determine the necessary
sample size, so that one can estimate the required parameter with a predetermined precision (a
specified level of confidence).
Task: what is the required sample size to obtain a confidence interval of given length
(accuracy) at the chosen confidence level 1 - ?
Let 2d be the reference length of the interval.
Model I:
u
u2 2
Length of the interval: 2d 2
. Thus: n   2 .
n
d
Model II:
Length of the interval: 2d2
ts
t2s2
. Thus: n  2 1 .
n1
d
It is necessary to draw an initial sample (s is calculated from this sample) of size n0. If it is
established that n> n0, an extra n - n0 elements must be drawn.
Model III:
As in model II.
Model IV:
a) if we know the magnitude of p, the required sample size is n 
u2 p(1  p )
:
d2
b) if we do not know the order of magnitude of p, we use the inequality
1
u2
p(1p) . Thus: n  2 .
4
4d
Additional problems.
1) Attendance at statistics lectures is as follows (in %)
72, 70, 58, 62, 67, 58, 90, 91, 56, 68, 68, 70, 71, 52, 69
Assuming this is a random sample of lectures, give a 95% confidence interval for the average
percentage attendance.
2) According to the Wall Street Journal , an average of 44 tons of carbon dioxide will be
saved per year if new, more efficient lamps are used. Assume that this average is based on a
random sample of 25 test runs of the new lamps and the sample standard deviation was 19
tons. Give a 90% confidence interval for the average annual savings.
3) A new optical disc system prototype was tested and it is claimed to be able to record an
average of 2.2 hours of HD TV. Assume n=10 trials and σ=0.2 hour. Give a 90% confidence
interval for the mean recording time.
4) In a survey, Fortune rated companies on a 0 to 10 scale. A random sample of 10 firms and
their scores are as follows:
FedEx 8.94, Walt Disney 8.76, CHS 8.67, McDonald’s 7.82, CVS 6.80, Safeway 6.57,
Starbucks 8.09, Sysco 7.42, Staples 6.45, HNI 7.29
Construct a 95% confidence interval for the average rating of a company on Fortune’s entire
list.
6) Use the following random sample of gasoline prices to construct a 90% confidence interval
for the average price of a litre (in PLN):
3.85, 3.95, 4.95, 4.19, 4.50, 425, 4.50, 4.32.
7) An estimate of the percentage of defective pins in a large batch of pins supplied by a
vendor is desired to be estimated within 1% with a 90% confidence level. The actual
percentage of defective pins is guessed to be 4%.
a) What is the minimum sample size?
b) If the actual percentage of defective pins may be anywhere between 3% and 6%, tabulate
the minimum sample size required for actual percentages from 3% to 6%.
c) If the cost of sampling and testing n pins is (25+6n) dollars, tabulate the cost for the same
range of percentages as in part (b).
8) For all the above problems of interval estimation, determine the size of the required
research sample such that the required precision is obtained with a confidence level of not less
than 99%