Download Chapter 9 Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
CHAPTER 9
Inference: Estimation
The essential nature of inferential statistics, as verses descriptive statistics is one of knowledge. In descriptive
statistics, the analyst has knowledge of the population data. The use of descriptive statistics such as mean, mode,
and standard deviation is typically intended for "collapsing" the population data for convenience of reporting or
interpretation. In inferential statistics, knowledge about the population is limited to what can be derived from
samples. For whatever reason (both economic and logical reasons) it is not possible to view all of the population
data, so we must examine our sample data, and make inferences about the population. We can view this process as
illustrated in the following figure:
Point Estimation
A point estimate of the population parameter

is a single number that can be regarded as the most
plausible value of  . It is obtained by selecting a suitable statistic ˆ and computing its value from the
given sample data. The selected statistic is called the point estimator of  .
Point estimate of the population mean
Point Estimate of the Population Standard Deviation
Consider three marksmen, Here we have three different
situations.
Target 1 has all its shots clustered tightly together, but none
of them hit the bullseye.
Target 2 has a large spread, but on average the bullseye is hit.
Target 3 has a tight cluster around the bullseye.
In statistical terminology, we say that
Target 1 is biased/ with a small variance.
Target 2 is unbiased/ with a large variance.
Target 3 is unbiased/ with a small variance.
If you were hiring for the police department, which shooter
would you want? In general in statistics, we want both unbiased
and small variance--an estimator that almost always is ``on
target.''
Interval Estimation
Confidence-interval estimate for a population parameter is a set
of numbers obtained from the point estimate the parameter,
coupled with a "percentage" or probability which characterizes
how confident that we are that the parameter lies within the
interval.
confidence level is the value of that "percentage" of
confidence.
If we wish to be 95% confident that an randomly selected value
drawn from a normal distribution, with a mean of 0 and a
standard deviation of 1 will be within an interval which is
constructed, this interval must be constructed with end-points -z
and z such that:
Pr( -z < Z < z) = 0.95
To accomplish the desired results, we must find the appropriate
z values for the interval, i.e., this interval would look like the
following:
Pr( -1.96 < z < 1.96) = 0.95
Normal distribution,  known
Recall that if X1, X2, ..., Xn are a random sample from a N( ,  2), then
X  
and SE ( X )  
2
X
X is distributed as N(,2/n), i.e.,
X   is
 2 . From the properties of the normal distribution,

Z 
n
/ n
distributed as N(0,1).
Thus, if we know , then we know that


X 
P  z/2 <
< z/2  = P z/2 < Z < z/2 = 1   .
/ n


This is equivalent, rearranging terms algebraically, to saying that the probability is 1 –  that the interval


 
 X  z/2
, X + z/2
 covers .
n
n

Definition: Let X1, X2, ..., Xn be a random sample from a normal distribution with unknown mean  and

  is a (1–)100confidence interval
known variance  2. Then the interval  X 
,X +



for .
z/2
When written as X  z/2  , we can see that
n
n
z/2

n
X is the (point) estimator of  and that e = z/2  is the
n
margin of error in the estimate, that is, a measure of the accuracy of the estimate. The length of the
confidence interval is L = 2  z/2
 .
n
Example 9.2
The average zinc concentration recovered from a sample of zinc
measurements in 36 different locations is found to be 2.6 grams per
millilitre. Find the 95 % and 99% confidence intervals for the mean
zinc concentration in the river. Assume that the population standard
deviation is 0.3.
Example 9.3
How large sample is required in Example 9.2 if we want to be
95% confident that our estimate of  is off by less than 0.05?
Normal distribution,  unknown
If X1, X2, ..., Xn are a random sample from a normal distribution with unknown mean  and known variance
 2, then the interval  X  z/2  , X + z/2   is a (1–)100%confidence interval for .



n
n
n
If 
2
is unknown, it is estimated by S 2 
(X
i 1
distribution with  = n–1 degrees of freedom, where
i
 X )2
n 1
. In that case, T 
X  X
has a t
ˆ X
ˆ X2  S 2 / n .
Thus,



X 
S
S 


 = 1   ,
P  t/2 <
< t/2  = P  t/2
< X   < t/2
S/ n
n
n



S
and this is the same as saying that with a probability of 1 – , X is within t /2
of , so that we have:
n
Definition: Let X1, X2, ..., Xn be a random sample from a normal distribution with unknown mean  and

S
S 
 is a (1-t-confidence
unknown variance  . Then the interval  X  t/2
, X + t/2
n
n

2
interval for .
Here,
E = t/2
S
S
is the margin of error and L = 2  t /2
is the length of the confidence interval.
n
n
Note that, unlike the z-interval case with known  2, E and L are not constant from one sample to another,
even though n is kept the same.
Example 9.4
The contents of 7 similar containers of sulphuric acid are 9.8, 10.2, 10.4, 9.8, 10.0, 10.2, and 9,6 liters. Find
95% confidence interval for the mean of all such containers, assuming an approximate normal distribution.
Large-sample confidence intervals
In constructing a CI for a parameter , we often can find an estimator
(Note that
̂ which has the following property:
̂ is a function of the sample X1, X2, ..., Xn .)
For n large,
̂ is approximately normally distributed with mean  and variance 
2
.
ˆ
Then in such a case
we can say that for n large,
ˆ 



P  z/2 <
< z/2   1   .
 ˆ


This, then, can be the basis of a CI with a coverage probability of approximately 1 –  and of the form:
Estimator
 z/2 (SE), that is, ˆ  z/2   ˆ 
Often it happens that we cannot get our hands on the SE =
 ˆ , but must rely on the [estimated SE] = ˆ ˆ .
Under fairly general conditions that deal with the convergence" of ˆ ˆ to  ˆ as the sample size increases,
it is often the case that
ˆ  z/2  ˆ ˆ
ˆ  
ˆ ˆ
also has approximately a standard normal distribution, so that the interval
still has an approximate coverage probability of 1 –  .
(Be careful not to confuse approximate large-sample confidence intervals with z-confidence intervals, which
have exact coverage probabilities under the conditions of sampling from a normal distribution with known
standard deviation .)
9.7. Two samples
If we have two populations with means and standard deviations 1 , 1 , and 2 , 2 respectively, the point
estimator of the difference between 1 and 2 is given by the statistic X 1 - X 2 . Therefore to obtain a point
estimator of 1- 2 we shall select two independent random samples, one from each population, of size n 1
and n2 , and compute the difference
X 1 - X 2 , of the sample means. We must consider the sampling
distribution of X 1 - X 2
X 1 - X 2 is approximately normally distributed with mean and variance given by
 X X  1  2
1
2
and

2
X1  X 2

 12
n1

 22
n2
Hence
Z
( X 1  X 2 )  ( 1  2 )
( / n1 )  (
2
1
2
21
/ n2 )
is approximately a standard normal variable.
P(  z / 2 
( X 1  X 2 )  ( 1  2 )
( / n1 )  ( / n2 )
2
1
2
21
 z / 2 )  1  
(1-)100% confidence interval for 1- 2 is given by
( x1  x2 )  z / 2
 
 
  1  2  ( x1  x2 )  z / 2

n1 n2
n1 n2
2
1
2
2
2
1
2
2
Example 9.6
Two types of engines, A and B, are compared. Gas mileage was measured. 50 experiments were conducted
using engine type A and 75 experiments were done for engine type B. The gasoline used and another
conditions were held constant. The average gas mileage for A was 36 miles per gallon and for B 42 miles
per gallon. Assume that population standard deviations are 6 and 8 for machines A and B respectively. Find
a 96% confidence interval on B- A .
If the variances are not known and the two distributions are approximately normal, the t-distribution
becomes involved as in the case of a single sample. Large samples (greater than 30) will allow us to use
s1and s2 in place of 1 and 2 without assumption of normality. The confidence interval is an approximate
one then.
( x1  x2 )  t / 2
s12 s22
s12 s22
  1  2  ( x1  x2 )  t / 2

n1 n2
n1 n2
where t-distribution is with degrees of freedom
s12 / n1  s22 / n2
v 2
( s1 / n1  1)) 2 /( ( s22 / n2 )2 /( n2  1)
If 1 and 2 are assumed to be equal 1 = 2 =  the common variance is estimated by pooled variance
(n1  1) s  (n2  1) s
s 
n1  n2  2
2
p
2
1
2
2
and confidence interval is
( x1  x2 )  t / 2 s p
1 1
1 1
  1  2  ( x1  x2 )  t / 2 s p

n1 n2
n1 n2
where t-distribution is use with n1 + n2-2 degrees of freedom.
Paired observations
(1-)100% confidence interval for 1- 2 is given by
Sd
Sd 

 d  t/2 , d + t/2 
n
n

Example 9.9
The levels of the dioxin TCDD of 20 Vietnam veterans who were possibly exposed to Agent Orange were
reported. The amount of TCDD levels in plasma and in fat tissue were as follows:
Veteran
1
2
3
4
5
6
7
8
9
10
TCD
D
In
plasm
a
2.5
3.1
2.1
3.5
3.1
1.8
6.0
3.0
36.0
4.7
TCDD
In fat
d
Veteran
TCDD
In plasma
TCDD
In fat
d
4.9
5.9
4.4
6.9
7.0
4.2
10.0
5.5
41.0
4.4
-2.4
-2.8
-2.3
-3.4
-3.9
-2.4
-4.0
-2.5
-5.0
0.3
11
12
13
14
15
16
17
18
19
20
6.9
3.3
4.6
1.6
7.2
1.8
20.0
2.0
2.5
4.1
7.0
2.9
4.6
1.4
7.7
1.1
11.0
2.5
2.3
2.5
-0.1
0.4
0.0
0.2
-0.5
0.7
9.0
-0.5
0.2
1.6
Find 95 % confidence interval for 1- 2
9.9 Inferences on proportions
A point estimator of p is a statistic. The one, which we will develop, is the sample proportion, designated by
p̂ . The sample proportion is a random variable and as such has a sampling distribution with a mean and a
standard error (square root of the variance). The observed value of p̂ from a sample is x/n, where x is the
observed number of "successes" in our sample of size n. Now if the sample is a simple random sample, then
x is the observed value of X, a binomial random variable with parameters n (which is known) and p (which
unfortunately is unknown).
If X is binomial, then it is discrete random variable with a range of integer values
The mean of X,
E(X) = np
and
Var(X) = n(p)(1-p).
0,1,2,,n.
Since p̂ = X/n, p̂ is also a discrete random variable that takes on the (n + 1) possible values of
0,1/n,2/n,,1. The expected value of p̂ ,
E( p̂ ) = E(X)/n = n(p)/n = p.
This says that p̂ is an unbiased estimator of p .
The variance of p̂
,
Var( p̂ ) = Var(X)/n2 =p (1 -p )/n.
This says that p̂ is also a consistent estimator of p.
It also says that an estimate of p based on a larger sample size is a better estimate than one based on a
smaller sample size. Therefore p̂ is also an efficient estimator of p. On faith, we also note that p̂ is a
sufficient estimator of p.
Then our point estimate for p is p̂ = X/n. If forced to make a guess, we would offer p̂ = X/n, the
observed value of the sample proportion, as our best guess. This answer is very precise but unfortunately
very likely to be wrong. A better idea would be to construct an interval estimate for p. We set a confidence
level (say 95%) and we generate an answer that is less precise (its an interval) but we are very much surer
of the accuracy!
To obtain a random variable that involves p whose distribution is known to serve as a starting point for the
confidence interval derivation, standardize
pˆ  p
Z
pq / n
follows an approximate Z distribution for large sample sizes.


pˆ  p
P  z/2 < Z < z/2 = P  z/2 <
< z/2  = 1  
pq / n


We will write the confidence bounds for the true population proportion p when the sample size is large ( n >
30)as
pˆ  z / 2
pˆ qˆ
pˆ qˆ
ˆ
 p  p  z / 2
n
n
Example
To estimate the reliability of 16-kilobit dynamic RAMs being produced by particular company, a sample of
size 100 is to be drawn and tested. We are interested in estimating p, the proportion of circuits that operate
correctly during the first 1000 hours of operation. It was found that 91 of the 100 circuits tested perform
correctly. Find a 95% confidence interval for the actual proportion of circuits performing properly.
Sample size for estimating p
How large a sample should be selected so that p̂ lies within a specified distance e of p with a stated degree of
confidence
Example
How large a sample is required in the previous example if we want to be 95 % confident that our estimate of p
is within 0.03?
Comparing two proportions
There are two population of interest, the same trait is studied in each population and in each population the
proportion having the trait is unknown. Random samples are drawn from each population. Inferences are to
be made on p1, p2, and p1 - p2 , where p1 and p2 are the true proportions in populations I and II.
There are two assumptions for the determination of sample sizes, estimation of parameters and conduct of
hypothesis tests discussed in this section. First we assume that the two samples are independent. Secondly we
want both sample sizes to be large (bigger than 50 each).
A point estimator of the difference between two population proportions
sample proportions
p̂ 1 - p̂ 2 = X1 / n1 - X2/ n2 .
E( p̂
1
- p̂ 2) =
Var( p̂ 1 - p̂ 2) =
p1 - p2 is the difference between
The statistic p̂ 1 - p̂ 2 is an unbiased estimator for p1 - p2
To obtain a random variable that involves p1 - p2 whose distribution is known , at least approximately to serve
as a starting point for the confidence interval derivation, standardize
Z
( pˆ1  pˆ 2 )  ( p1  p2 )
( p1 q1 / n1 )  ( p2 q2 / n2 )
Applying the notion of constructing confidence intervals the proposed bounds for a confidence interval on
p1 - p2 will be
( pˆ1  pˆ 2 )  z / 2
pˆ1qˆ1 pˆ 2qˆ2
pˆ1qˆ1 pˆ 2qˆ2

 p1  p2  ( pˆ1  pˆ 2 )  z / 2

n1
n2
n1
n2
Example 9.13:
A certain change in a process for manufacture of component parts is being considered. Samples are taken
using both the existing and the new procedure so as to determine if the new process results in an
improvement. If 75 of 1500 items from the existing procedure were found to be defective and 80 of 2000
items from the new procedure were found to be defective, find a 90 % confidence interval for the true
difference in the fraction of defectives between the existing and the new process.
9.11 Interval estimation of variability
The statistics S2 is an unbiased estimator for 2. To obtain a 100(1- )% confidence interval for 2, we need
a random variable whose expression involves 2 and whose probability distribution is known.
If S2 is the variance of a random sample of size n taken from a normal population having the mean  and the
variance 2, then
 
2
( n  1) S 2
2
is a random variable having the chi-square distribution with the parameter v = n -1 (degrees of freedom).
To use the random variable (n-1)S
2
/2 to derive a 100(1- )% confidence interval for 2, we first partition
the  n-1 curve
2
P(21-/2  (n-1)S2/2  2/2) = 1-
Theorem
Let X1, X2, ..., Xn are a random sample from a normal population with a mean  and standard deviation .
The lower and upper bounds, L1 and L2 respectively, for a 100(1- )% confidence interval for 2, are given
by
L1 = (n-1)S
2
2
/ /2
2
) and L2 = (n-1)S /
2
1-/2
.
Example.
25 observations on the relative I/O content for a large consulting firm over randomly selected one-hour
period are obtained. s2 = 1.407 s = 1.186.
Construct a 95% confidence interval on the standard deviation of the relative I/O content for this instalation.
9.12 Two samples: Estimating the ratio of two variances
If S12 and S22 are the variances of independent random samples of size n1 and n2 ,respectively, taken from
two normal populations with the variance 2,2 then a point estimator of the ratio of two population
variances 2/2 is given by the ratio s12 / s22 of the sample variances. Hence the statistic S12 /S22 is called an
estimator of 2/2 .
To establish an interval estimate of 2/2 is we will use the statistic
S12 /  12
F 2 2
S2 /  2
which is a random variable having the F distribution with the parameters v1 = n1-1 and


P f 1/2 (v1, v2 ) < F < f /2 (v1, v2 ) = 1  
v2 = n2-1.
Exercise 9 p.275
Construct a 98% confidence interval for , where are , respectively, the standard deviations for the distance
obtained per litre of fuel by the Volkswagen and Toyota minitrucks.
9.14 Maximum likelihood estimation
Let X1, X2, ..., Xn be independent random variables taken from a probability distribution with
probability density function f(X, ), where  is a single parameter of the distribution.
L(X1, X2, ..., Xn ;) = f(X1, X2, ..., Xn ;) = f(X1 ;) f(X2 ;)….. f(Xn ;)
is the joint distribution of the random variable called likelihood function.
Let x1, x2, ..., xn denote observed values in a sample. The values are known, we observed them,
we want to estimate the true population parameter . In the discrete case the likelihood of the
sample is, is the following joint probability
P(X1= x1, X2 = x2 ,…, Xn = xn )
The maximum likelihood estimator is one that results in a maximum value for this joint
probability.
Consider an example where three items from an assembly line are inspected and classified as defective or
nondefective. Testing the three items resulted in two nondefective items followed by a defective item. We
want to estimate p, the proportion nondefective in the process. The likelihood of the sample is
ppq
The maximum likelihood estimator is that parameter value that produces the largest probability of obtaining
the sample.
Example 9.19
Consider a Poisson distribution
Example 9.21
Suppose 15 rats are used in a biomedical study where the rats are injected with
cancer cells and given a cancer drug that is designed to increase their survival
rate.
14, 17, 27, 18, 12, 8, 22, 13, 19, 12
Assume that the exponential distribution applies.