Download Chapter 5

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 8
Parametric Estimation
Contents --8.1 Types of Parametric Estimators
8.2 Point Estimation
8.3 Inference of Confidence Intervals for Parameters
8.4 Hypothesis Testing
8.1 Types of Parametric Estimators
 Introduction -- Statistical inference (統計推論) --Statistical inference is the process of using observed data of a random
phenomenon to draw conclusions about the distributions of the random
variables which model the phenomenon.
 Parametric estimation (參數估測) --Parametric estimation is a type of statistical inference to determine or
make decisions about the parameters which characterize the distributions of
the random variables modeling a concerned random phenomenon.
 Types of parametric estimation (all using observed data to achieve different
purposes) ---
 Point estimation --- estimation of the parameter value(s) of a distribution,
such as the mean and variance of the random variable which has the
distribution.
 Interval estimation --- estimation of an interval which may be believed, to
some degree of confidence, to contain a parameter of interest.
 Hypothesis testing --- decision making to accept or reject a claim, called
hypothesis, regarding a parameter of interest.
8- 1
8.2 Point Estimation
 Concepts and review ---
 Point estimation means to estimate the parameter(s) of a certain distribution
of a random variable from a set of observed data of the distribution according
to a certain estimation rule.
 Every observed data item of the distribution is itself random in nature, and so
may be regarded as a random variable, as mentioned in the last chapter where
we had the following definition of random sample:
A random sample of size n arising from a certain random variable X is a collection of
n independent random variables X1, X2, …, Xn such that  i =1, 2, …, n, Xi is identically
distributed with X, meaning that every Xi has the same pmf or pdf as that of X. Each Xi is
called a sample variable, X is called the population random variable, and the mean and
variance of X are called the population mean and variance, respectively.
 We have mentioned several point estimators in the last chapter, like sample
mean, sample variance, sample covariance, etc.
 Here, we investigate point estimators in more formal and systematic ways.
 Definition of estimator ---
 Definition 8.1 --An estimator of a parameter  of a random variable X is a function ˆ =
ˆ (X1, X2, …, Xn) of a random sample X1, X2, …, Xn arising from X so that for
a particular set of sample values, say, x1, x2, …, xn, evaluating ˆ at these
values produces a point estimate of , namely, ˆ (x1, x2, …, xn).
(Note: here point estimate, or simply, estimate, is used as an undefined term.)
 More types of mean estimators --The sample mean X of a random sample X1, X2, …, Xn of size n
arising from X defined in the last chapter is just one of the many possible
estimators of the population mean. Others include the following.

X l = linear estimator of population mean --- a weighted sum of all the
n
sample variables, X l = a1X1 + a2X2 + … + anXn, where  ai = 1; e.g.,
i1
X l = 0.35X1 + 0.1X2 + 0.1X3 + 0.1X4 + 0.35X5 for n = 5.
8- 2

X = sample median --- the median of the sample variables whose
value, as implied by Definition 4.6, is the middle of the sample values
after they are reordered by magnitude (if n is even, then it is the
average of the middle two magnitude-reordered sample values).

X e = sample midrange --- the average of the maximum of X1, X2, ...,
Xn and the minimum of them, i.e., X e = [maxi(Xi) + mini(Xi)]/2.

X tr(r) = sample r% trim mean --- the mean obtained from discarding the
largest and the smallest r% (say, 10%) of the sample variables and then
taking the average of the remaining ones.
 Example 8.1 (various types of estimators) --As a continuation of Example 7.4 where three random samples were
given with the first one being (8.04, 8.02, 8.07, 7.99, 8.03) and the
corresponding sample mean value being computed to be x1 = (8.04 + 8.02 +
8.07 + 7.99 + 8.03)/5 = 8.03, we now want to compute the estimates of the
above-mentioned four different types of estimators for the first random
sample (8.04, 8.02, 8.07, 7.99, 8.03).
Solution:
 Value of linear estimator of the mean --- xl = 0.35x1 + 0.1x2 + 0.1x3 +
0.1x4 + 0.35x5 = 0.358.04 + 0.1(8.02 + 8.07 + 7.99) + 0.358.03 =
8.0325.
 Sample median value --- x = 8.03 because the sample values, after
reordered according to their magnitudes, are 7.99, 8.02, 8.03, 8.04,
8.07;
 Sample midrange value --- xe = (8.07 + 7.99)/2 = 8.03 because the
two extreme values of the sample values are 7.99 and 8.07;
 Sample r% trim mean value with r = 20 --- x tr(20) = (8.04 + 8.02
+8.03)/3 = 8.03 after the two 20% extreme values 7.99 and 8.07 are
“trimmed.”
 Notes ---
8- 3
 We use capital letters for estimators and lower-case letters for estimate
values in the above example and hereafter.
 Also, we call an estimate value of an estimator by the term the
“estimator value,” e.g., the estimate value, sample midrange value, for
the estimator, sample midrange, and so on.
 The sample mean is just a special case of the linear estimator of the
population mean mentioned in the last example, which is defined more
formally in the following.
 Definition 8.2 (linear estimator of population mean) --Given a random sample X1, X2, …, Xn arising from a population random
variable X with mean , a linear estimator of  is defined as
X l = a1X1 + a2X2 + … + anXn.
 A Note --- by the above definition, the sample mean X = (X1 + X2 + … +
Xn)/n is a linear estimator of the population mean with all ai = 1/n.
 Which estimator is better? ---
 Idea -- It is desired to have the “best” estimator for a certain parameter  of a
population random variable.
 This requires a suitable measure of the goodness of each parameter
estimator ˆ .
 A suitable measure for this purpose is the error | ˆ   | or the square of
it.
 However, this measure is not useful for computing the best estimator.
 A substitute is the mean value of the square of it, namely, E[( ˆ  )2],
which we called the mean square error (MSE) of ˆ .
 Since  is a constant value, the MSE may be transformed by the
following way into a more meaningful form:
E[( ˆ  )2] = E[ ˆ 2  2 ˆ  +  2]
= E[ ˆ 2]  2E[ ˆ ] +  2
= E[ ˆ 2]  (E[ ˆ ])2 + (E[ ˆ ])2  2E[ ˆ ] +  2
8- 4
= [E[ ˆ 2]  (E[ ˆ ])2] + [(E[ ˆ ])2  2E[ ˆ ] +  2]
= Var( ˆ ) + (E[ ˆ ]  )2.
(8.1)
 Accordingly, if Var( ˆ ) = 0 and E[ ˆ ] = , then the MSE E[( ˆ  )2]
becomes the minimum value, zero.
 Therefore, we have the following definitions.
 Definition 8.3 (better and best estimators) --For a parameter , an estimator ˆ  is said to be better, or more efficient,
than another ˆ 2 if E[( ˆ   )2] < E[( ˆ   )2]. The estimator which is better
than any other is called the best estimator for , which we denote as ˆ .
 Definition 8.4 (unbiased estimator) --An estimator ˆ for a parameter  is said to be unbiased if E[ ˆ ] = ;
otherwise, it is said to be biased with the amount, E[ ˆ ]  , called the bias
of it.
 A note and an illustration of the meaning of unbiasedness -- An unbiased estimator ˆ means that, though the estimate values are
not always equal to the exact value of , they fall around  with  as
the center. Therefore, unbiasedness is a good property for an estimator.
 See Fig. 8.1 for an illustration of unbiasedness where ˆ is unbiased
1
while ˆ2 is biased.
pdf of ˆ2
pdf of ˆ1

bias of ˆ2
Fig. 8.1 Illustration of unbiasedness of estimators --- ˆ1 is unbiased while ˆ2 is
8- 5
biased.
 Linear unbiased estimator -- A linear estimator X l = a1X1 + a2X2 + … + anXn of the population
mean as defined in Definition 8.2 need not be unbiased. For it to be so,
the coefficients a1 through an must satisfy a1 + a2 + … + an = 1 because
by Fact 7.2,
E[ X l ] = E[a1X1 + a2X2 + … + anXn]
= a1E[X1] + a2E[X2] + … + anE[Xn]
= a1 + a2 + … + an
= (a1 + a2 + … + an)
= 1
= .
 A linear estimator of the population mean with a1 + a2 + … + an = 1
will be called a linear unbiased estimator hereafter.
 Facts about unbiased estimators of some population parameters ---
 Fact 8.1 --The sample mean X and the sample variance S2 of a random sample
arising from a population random variable X with mean  and variance 2 are
unbiased estimators of  and 2, respectively.
Proof: immediate from Facts 7.4 and 7.13, which say respectively E[ X ] = 
and E[S2] = 2.
 Fact 8.2 --If the population distribution is continuous and symmetric, the sample
median X and any sample trimmed mean X tr(r) of a random sample
arising from a population random variable X with mean  are unbiased
estimators of .
Proof: omitted; see http://en.wikipedia.org/wiki/Efficiency_(statistics) for a reference.
8- 6
 Example 8.2 --Given a binomial random variable X with parameters (n, p) which
specifies the number of successes in n trials, prove that the estimator p̂ =
X/n, called sample proportion, is an unbiased estimator of p.
Proof: easy from Fact 4.7: E[X] = np because then
E[ p̂ ] = E[X/n] = E[X]/n = np/n = p.
 Notes -- Compare the result of this simple example with that of Example 7.14.
 From Facts 8.1 and 8.2, we see that the unbiased estimator for a
parameter, like the population mean, is not unique; therefore a further
criterion is needed, which is the concept of minimum variance of the
estimator, as defined next.
 Definition 8.5 (minimum-variance estimator) --A minimum-variance estimator ˆ for a parameter  has a smaller
variance Var( ˆ ) than that of any other estimator for .
 A note and an illustration of the meaning of minimum variance -- See Fig. 8.2 for an illustration about the meaning of the variance of the
estimator, where though both ˆ1 and ˆ2 are unbiased but ˆ1 has a
smaller variance than ˆ .
2
 If an estimator is unbiased, then according to the analysis described by
(8.1), if it also has the minimum variance, it will become the best
estimator, as said by the following fact.
8- 7
pdf of ˆ1
pdf of ˆ2
Fig. 8.2 Illustration of variances of estimators --- ˆ1 has a smaller variance than ˆ2
though both unbiased.
 Fact 8.3 (best estimator) --An unbiased estimator ˆ for a parameter  with the minimum variance
is the best estimator for .
Proof: immediately from the above three definitions (Definitions 8.3 through
8.5) and the analysis described by (8.1).
 Discussions on the best estimator -- In general, it is not always easy to find a best estimator for a special
parameter of a population distribution.
 But under some constraints, such best parameter estimators may be
found, as shown by the following facts and examples.
 Fact 8.4 (best linear unbiased estimator of population mean) --The sample mean is the best linear unbiased estimator of the population
mean .
Proof:
 According to Definition 8.2 and the previous discussion, a linear
unbiased estimator of  is X l = aX1 + a2X2 + … + anXn where a1 + a2
+ … + an = 1.
8- 8
 Therefore, the sample mean X is a linear unbiased estimator of 
with all ai = 1/n as mentioned previously.
 Now, we only have to prove that X has the minimum variance among
all possible linear unbiased estimators.
 The proof will be done for the case of n = 2; generalization to the case
of any n is left as an exercise.
 For n = 2, since a1 + a2 = 1, we have a2 = 1  a1 and
X = a1X1 + a2X2 = a1X1 + (1  a1)X2.
 By applying a derivation process like those for deriving Facts 7.9
through 7.11 (details left as an exercise), the Var( X ) of X may be
computed to be
Var( X ) = a12Var(X1) + (1  a1)2Var(X2) = [a12 + (1  a1)2]2.
 To minimize the above value, we differentiate it with respect to a1 to
get
d[Var( X )]/da1 = [2a1  2(1  a1)]2.
 Setting this result equal to zero, we get a1 = 1/2, and so a2 = 1  a1 =
1/2, too.
 That is, for n = 2, the sample mean X = ½X1 + ½X2 has the minimum
variance, compared with other linear unbiased estimators. Therefore, it
is the best linear unbiased estimator.
 Fact 8.5 (best estimator of mean of a normal population distribution) --The sample mean is the best estimator of the mean  of a normal
population distribution.
Proof:
 First, we state without proof a theorem, called Cramer-Rao Inequality, which
says: if ˆ is an unbiased estimator of a parameter  in the pdf f(x) of a
random variable X, then
8- 9
Var( ˆ ) 
1
2
 
 
nE  ln f (x)  
 
 
where n is the size of the random sample X1, X2, ..., Xn used in the
estimator ˆ . For a proof, see the following reference:
R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 5th Ed.,
Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA, 1995.
 Next, if X is normally distributed with mean  and variance 2 with pdf
f(x) =
2
2
1
e( x  ) / 2 ,     x   ,
2
then
1 x 
ln f(x) = ln( 2 )  
 .
2  
2
 Taking the partial derivative of the last equality with respect to  yields
x

ln f (x) =
.
2

 Therefore, by Proposition 5.3 and the definition of variance, we have
2
 
 x    2 
 
E  ln f (x)   = E  2   = 4E[(x  )2] = 42 = 2.
 
 
   
 Substituting the above result into the above Cramer-Rao inequality, we
get
Var( ̂ )  1/(n2) = 2/n
where ̂ is an unbiased estimator of the mean  of X.
 This means that any estimator ̂ has a variance Var( ̂  2/n.
 However, the variance of the sample mean X , according to Fact 7.11,
is just 2/n.
 This means that the sample mean has a variance not larger than that of
8- 10
any unbiased estimator ̂ of .
 Also, since E[ X ] =  (see Fact 7.4), we know that X is unbiased,
too.
 As a consequence, the sample mean X by definition is the best
estimator of the population random variable X. Done.
 Consistency of unbiased estimators ---
 Idea -- It is desirable that as the sample size n becomes larger and larger, an
estimator ˆ of a parameter  will get closer and closer to the
parameter  with high probability.
 The weak law of large numbers described in the last chapter and
repeated in the following says that this statement is true for the sample
mean X as an estimator of :
lim P{| X  | < } = 1
n
  > 0.
(7.5)
 Therefore, we have the following definition and fact.
 Definition 8.6 --An estimator ˆ of a parameter  of a population distribution is said to
be consistent if the following property is satisfied:
lim P{| ˆ   | < } = 1
n
  > 0.
 Fact 8.6 (consistency of sample mean) --The sample mean X is a consistent estimator of the population mean
.
Proof: use the weak law of large numbers in the proof as discussed above.
 A note: it also can be shown that the sample variance S is a consistent
estimator of the population variance 2 (discussed late in this chapter).
8- 11
 An illustration of the meaning of consistency -- See Fig. 8.3 for an illustration about the meaning of consistency of an
estimator ˆ of , where ˆ   with high probability as n  .
pdf of ˆ1 for n1
pdf of ˆ2 for n2 > n1


Fig. 8.3 Consistency of estimators --- ˆ   with high probability as n  
 Fact 8.7 (consistency of the sample variance of a normal population) --The sample variance S2 is a consistent estimator of the variance of a
normal population.
Proof:
 In the proof of the weak law of large numbers, we have obtained
Chebyshev’s inequality as:
P{|W  |  k}  2/k2
 k > 0,
(A)
where W is a random variable with mean  and variance 2.
 Also, Facts 7.13 and 7.14 say that the mean and variance of the sample
variance of a normal population respectively are
E[S2] = ; Var(S2) = 24/(n  1).
 Take W in (A) to be S2, k to be  to be E[S2], and 2 to be Var(S2), we
get
P{|S2  2|  }  24/[(n  1)2]
or equivalently,
8- 12
  > 0,
P{|S2  2| < } = 1  P{|S2  2|  }  1  24/[(n  1)2]   > 0
which reduces, as n  , to
P{|S2  2| < } = 1
based on the fact that the largest probability value is 1.
 That is, lim P{|S2  2| < } = 1   > 0 which says that S2 is a
n
consistent estimator of 2 according to Definition 8.6. Done.
 A comment --- a careful check of the above proof of Fact 8.7 leads to the
following general fact for the validity of consistency of an estimator.
 Fact 8.7a (condition for consistency) --If ˆ is an unbiased estimator of  for which
lim Var( ˆ ) = 0,
n
then ˆ is a consistent estimator of .
Proof:
 Since ˆ is unbiased, we have E[ ˆ ] =  by Definition 8.4.
 Also, let ˆ denote the variance of ˆ which approaches 0 as n  
according to the given condition lim Var( ˆ ) = 0.
n
 That is, ˆ has mean  and variance  ˆ .
 Accordingly, as done in the proof of Fact 8.7, we can get the following
Chebyshev’s inequality for ˆ :
P{| ˆ  |  }   ˆ 2/2
  > 0,
or equivalently,
P{| ˆ  | < } = 1  P{| ˆ  |  }  1   ˆ 2/2   > 0
8- 13
leading to
lim P{| ˆ  | < } = 1
n
based on the fact that  ˆ  0 as n   and the fact that the largest
probability value is 1. Done.
 Principles for choosing point estimators --The above discussions lead to the following reasonable principle for choosing
parameter estimators.
 Step 1 --- choose unbiased estimators first;
 Step 2 --- then choose the unbiased estimators with the minimum variances;
 Step 3 --- finally choose the consistent unbiased estimator with the minimum
variance.
 Methods for point estimator design ---
 Idea -- The above discussions are about estimators for single parameters.
 There are two major general methods for designing point estimators,
namely,
(1) the method of moments;
(2) the maximum likelihood estimation,
for any specific problem with multiple parameters, which we discuss in
the following.
 Recall: the kth moment of a random variable X is E[Xk] (called kth
population moment if X is a population random variable).
 When no pmf or pdf is available, this moment must be estimated,
leading to the following definition.
 Definition 8.7 (sample moment) --Given a random sample X1, X2, ..., Xn of size n from a population random
n
variable X, the kth sample moment is defined as (  Xik)/n, denoted as M k
i1
(thus X = M 1 ).
8- 14
 Definition 8.8 (moment estimator) --Given a random sample X1, X2, ..., Xn of size n arising from a population
random variable X whose pmf or pdf is f(x) with parameters 1, 2, …, m,
the moment estimator ˆ 1, ˆ 2, …, ˆ m are defined as those obtained by the
following way: (1) equate the first m sample moments to the corresponding
first m population moments; (2) solve for 1, 2, …, m; and (3) use the
solutions as ˆ 1, ˆ 2, …, ˆ m, respectively.
 Example 8.3 --Let X1, X2, ..., Xn be a random sample from a gamma distribution with
parameters (t, ) where t > 0 and  > 0 with pdf
f(x) = ex(x)t 1/(t)
=0
 x  0;
 x < 0.
Find the moment estimators for t and .
Solution:
 From Fact 6.13, we know E[X] = t/; Var(X) = t/2.
 Since Var(X) = E[X2]  (E[X])2 according to Proposition 5.2, we get the
second population moment as E[X2] = Var(X) + (E[X])2 = t/2 + (t/)2 =
t(1 + t)/2
 By Step (1) in Definition 8.8 --- equating M 1 to E[X] and M 2 to
E[X2], we get the following equations:
M 1 = t/; M 2 = t(1 + t)/2
which may be solved to get
t = M 1 2/( M 2  M 1 2);  = M 1 /( M 2  M 1 2).
 Therefore, the estimators for t and  are respectively
tˆ = M 1 2/( M 2  M 1 2); λ̂ = M 1 /( M 2  M 1 2)
n
n
i1
i1
where M 1 = (  Xi)/n and M 2 = (  Xi2)/n.
8- 15
 Definition 8.9 (maximum likelihood estimator) --Given random variables X1, X2, ..., Xn whose joint pmf or pdf is f(x1, x2,
…, xn) with parameters 1, 2, …, m, by regarding f as a likelihood function
of 1, 2, …, m and rewrite it as f(x1, x2, …, xn; 1, 2, …, m), the maximum
likelihood estimates ˆ 1, ˆ 2, …, ˆ m are defined as those values of the i’s
that maximize the likelihood function, so that for all 1, 2, …, m,
f(x1, x2, …, xn; ˆ 1, ˆ 2, …, ˆ m)  f(x1, x2, …, xn; 1, 2, …, m).
The maximum likelihood estimators are obtained by substituting the xi’s in
the maximum likelihood estimates with Xi’s.
 Why the maximum likelihood estimator works? -- The likelihood function tells us how likely the observed sample values
fit the function with the parameters to be estimated.
 Maximizing the likelihood function value gives the parameter values
for which the observed sample is most likely to have been generated.
 Example 8.4 --Let X1, X2, ..., Xn be a random sample arising from a normal distribution
with parameters (, 2). Find the maximum likelihood estimators for  and
2.
Solution:
 Since all Xi’s are independent by the definition of random sample,
applying Proposition 6.2 and the induction principle, we get the
likelihood function for them as:
f(x1, x2, …, xn; , 2) =
1
2 2
e (1/ 2
 1 
= 
2 
 2 
2
)( x1   ) 2
…
1
2 2
e  (1/ 2
n
n/2
e
 (1/ 2 2 )  ( xi   ) 2
i 1
.
 By taking the natural logarithm, the above equality becomes
8- 16
2
)( xn   )2

n
ln[f(x1, x2, …, xn; , 2)] = (n/2)ln(22)  (1/22)  (xi  )2.
i1
 Taking the partial derivatives of the above equality with respect to 
and 2, respectively, equating them to zero, and solving the resulting
two equations, we get (the detail omitted)
̂ = X ,
n
ˆ 2 =  (Xi  X )2/n.
i1
 Note that ˆ 2 above is biased, which is different from the unbiased
n
sample variance S2 =  (Xi  X )2/(n  1) we have before.
i1
n
 Nevertheless, as n  , ˆ 2 and S2 becomes the same,  (Xi  X )2/n.
i1
And so, lim E[ ˆ 2 ] = ES2] = 2. As such, ˆ 2 is called an
n
asymptotically unbiased estimator in some books.
 Proposition 8.1 (invariance properties of maximum likelihood estimator) --If ˆ is the maximum likelihood estimator of the parameter , and g is a
one-to-one function of , then g( ˆ ) is the maximum likelihood estimator of
g().
Proof: omitted; for a proof, see the following reference:
R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 5th Ed.,
Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA, 1995.
8.3 Inference of Confidence Intervals for Parameters
 Ideas ---
 Insufficiency of point estimation -- For example, let X be the sample mean of a random sample X1, X2, ...,
Xn of size n arising from a normal population random variable X with
8- 17
parameters (, 2).




Then, X ~N(, 2) according to Facts 7.4 and 7.11.
It follows that P{ X > } = P{ X < } = 1/2, but P{ X = } = 0.
Therefore, X   with probability 1 (i.e, P{ X  } = 1).
That is, the estimator X for  is actually not precise from the
viewpoint of probability.
 Need of interval estimation -- Consequently, for inference about a parameter , instead of just
computing a point estimate of  it is desirable to report an interval of
values that contains the unknown parameter with higher probability
(i.e., with confidence).
 Such an interval is called a confidence interval, as defined in the
following.
 Definition 8.10 (confidence interval) --Two
estimator
ˆ 2) of 
estimators ˆ 1 and ˆ 2 of a parameter  determined from an
ˆ of  are said to form a 100(1  )% confidence interval ( ˆ 1,
(also called a confidence interval with 100(1  )% confidence
level) if
P{ ˆ 1 <  < ˆ } = 1  
where 1   is called the confidence coefficient.
 Confidence interval for the mean of a normal population distribution with
known variance ---
 Reasoning for derivation of the confidence interval -- Given a random sample X1, X2, ..., Xn of size n arising from a normal
population random variable X~N(, 2), it is known from Facts 7.4 and
7.11 that the sample mean X of the random sample is normally
distributed with mean and variance being  and 2/n, respectively, i.e.,
X ~N(, 2/n).
 Therefore, ( X  )/[ n  is a unit normal random variable Z with cdf
(x) (the error function).
 Define z be a value such that P{Z > z} =  (see Fig. 8.4 for an
8- 18
illustration), or equivalently, P{Z  z} = (z) = 1  .
 Consequently, z/2 is such that P{Z  z/2} = 1  /2 and
P{z/2 < Z < z/2} = (z/2)  (z/2)
= (z/2)  (1  (z/2))
= 2(z/2)  1
= 2(1  /2)  1
= 1  .
That is,
P{z/2 <
X 
< z/2} = 1  .
/ n
(8.2)

z
Fig. 8.4 Illustration of the meaning of  in confidence coefficient 1  .
 For example, if  = 0.05 so that the confidence coefficient 1   = 0.95
(95%), then z/2 = z0.025 = 1.96 as can be figured out using the error
function table (Table 5.1). Furthermore, if 1   = 0.90 and 0.99, then
the corresponding z/2 are z0.05 = 1.645 and z0.005 = 2.58.
 Now Equality (8.2) may be transformed into the following form:
P{ X  (z/2/
n)
<  < X + (z/2/
n )}
=1
from which we get the desired (1  )100% confidence interval (L, U)
where
8- 19
L = X  (z/2/ n ) (lower limit);
U = X + (z/2/ n ) (upper limit).
(8.3)
 That is, the interval (L, U) contains the parameter mean  of the
normal population distribution with known variance 2 with probability
1  .
 Note that (L, U) are themselves random variables, and possible values
of (L, U) computed from random sample values are denoted as (l, u)
with l = x  (z/2/ n ) and u = x + (z/2/ n ), respectively.
 An illustration of a possible interval (l, u) and the related probabilities
is shown in Fig. 8.5.
 Consequently, by Definition 8.10 we have the following proposition.
Fig. 8.5 An illustration of confidence interval (L, U) and the related probabilities.
 Proposition 8.2 (100(1  )% confidence interval for normal population
mean with known variance) --Given a random sample X1, X2, ..., Xn of size n arising from a normal
population random variable X with unknown mean  and known variance 2,
the 100(1  )% confidence interval for  is
( X z/2/
n,
X + z/2/
n ),
(8.4)
or in a simpler form, is
X  z/2/
n.
(8.5)
 Recall: we use x to denote the sample mean value x = (x1 + x2 + …
8- 20
+xn)/n for a random sample of X1 = x1, X2 = x2, …, Xn = xn.
 Notes:
 The upper and lower limits U and L as described in (8.3) through (8.5)
are random variables since X is a random variable.
 If the sample mean value x is used in replace of X in these
formulas, then the results, still called a confidence interval, is denoted
as (l, u), corresponding to (U, L).
 Note that each (l, u) is just a possible outcome of (U, L). See Fig. 8.6
for an illustration.
Fig. 8.6 Illustration of randomness of observed confidence intervals.
 Example 8.5 --Given a random sample of size n = 10 arising from a normal population
random variable X~N(, 4) and the sample mean value x = 15.1. Compute
a 95% confidence interval for .
Solution:
 1   = 0.95,  = 0.05, and so z/2 = z0.025 = 1.96 as mentioned before.
 From Proposition 8.2, the confidence interval is (L, U) with
L = X  1.96/
n,
U = X + 1.96/
n.
 Therefore, with the sample mean value x = 15.1 and known 2 = 4 or
equivalently  = 2, the desired confidence interval is (l, u) where
l = x  1.96/
n
= 15.1  1.962/
8- 21
10
 13.86;
u = x + 1.96/
n
= 15.1  1.962/
10
 16.34.
 Therefore, we have a 95% confidence that the mean  of the normal
population distribution falls within the interval (13.86, 16.34).
 Notes -- The results derived in the previous discussions are not practical because
the variance was assumed to be known in advance.
 In real applications, the population variance is usually unknown, and we
need more theories before we can compute the confidence interval for
the mean under such conditions.
 Student’s t distribution ---
 Definition 8.11 (t distribution) --Let Z denote the unit normal random variable, and W denote a 2 random
variable with n degrees of freedom. Also, assume that Z and W are
independent. Then the random variable
Z
W /n
T=
(8.6)
is said to possess a (Student’s) t distribution with n degrees of freedom.
 Proposition 8.3 (the pdf of a random variable with t distribution) --If T is a random variable possessing a t distribution with n degrees of
freedom, then its pdf is given by
f(t) =
 n 1 


2  ( n 1) / 2
 2  1  t 


n
n
n   
2
  < t < .
Proof:
 We will try to find the pdf of T by applying Theorem 6.1.
8- 22
(8.7)
 Define two new random variables T = g(Z, W) =
W) = W. And so t = g(z, w) =
Z
and U = h(Z,
W /n
z
and u = h(z, w) = w.
w/ n
 The Jacobian of g and h is:
g
z
J(z, w) =
h
z
g
w
=
h
w
1
w/ n
0

z
2 w3 / n
1
=
1
w/ n
which we may assume is not equal to zero.
 Then, the inverse transforms of g and h may be found to be
z = r(t, u) = t
u
and w = s(t, u) = u.
n
 According to Theorem 6.1 and the independence of W and Z, the joint
u
and w = u, is
n
pdf of T and U, with z = t
fTU(t, u) = fZW(z, w)|J(z, w)|1
1 1
= fZW(z, w)|
|
w/ n
= fZW(t
u
, u) u / n
n
u
)fW(u) u / n
(by independence of Z and W)
n
1 ut 2 / 2 n
1
eu/2u(n/2)1 u / n
e
 n/2
2 (n / 2)
2
= fZ(t
=
=
=
2
u/n
u ( n / 2) 1e ( u / 2)[1( t / n)]
2 2 (n / 2)
n/2
1
2 n 2
n/2
(n / 2)
u ( n 1) / 2 e ( u / 2)[1( t
2
/ n)]
for  < t <  and u > 0, and fTU(t, u) = 0, otherwise.
 The marginal pdf fT(t) of T then is
8- 23
fT(t) =

 fTU (t , u)du
1

 u
2 n 2n / 2 (n / 2) 0
=
( n 1) / 2 ( u / 2)[1 ( t 2 / n)]
e
du .
 Let y = (u/2)[1 + (t2/n)]. Then, u = 2y[1 + (t2/n)]1 so that du = 2[1 +
(t2/n)]1dy.
 And so
1
fT(t) =

n/2
2 n 2 (n / 2)

0
{2[1+(t2/n)]1y}(n1)/2e(1/2)2y[1+(t2/n)]
1
[1+(t2/n)]2[1+(t2/n)]1dy
=

1
[1+(t2/n)](n+1)/2 0 y(n1)/2eydy
 n(n / 2)
=

1
[1+(t2/n)](n+1)/2 0 y[(n+1)/2]1eydy.
 n(n / 2)
 But the above integral
(t) =

0 e
y

0
y[(n+1)/2]1eydy, according to Definition 6.7:
y t 1dy , is just ((n+1)/2).
 Therefore, fT(t) above is equal to
fT(t) =
=
1
[1+(t2/n)](n+1)/2((n+1)/2)
2 n(n / 2)
 n 1 


2  ( n 1) / 2
 2  1  t 


n
n
n   
2
  < t < .
Done.
 Fact 8.8 (a limiting property of t distribution) --The pdf of the t distribution satisfies
lim f(t) =
n
1 et 2 /2
2
which means that as n  , the t distribution resembles a unit normal
8- 24
distribution.
Proof:
 t2 
 A fact in elementary calculus is: lim  1  
n
n

1
2
lim
n
 t2 
1  
n

n
= et 2 , so
=
 t 2 
1
lim 1  
n
2 n 
=
1
2
n / 2
=
=
n



n

 t2  
 lim 1   
n 
 n 

1/ 2
1/ 2
1
[ et ]1/2
2
1 et 2 /2
.
2
2
(8a)
 A property of the gamma function is:
x  
 x for a large x and a small 
  x
(see the book by K. Fukunaga listed below, p.574).
K. Fukunaga, Introduction to Pattern Recognition, (2nd ed.) Academic Press,
San Diego, CA, USA, 1990.
 Another property of the gamma function is:
  x  12 
  x
=
x (1 
1
5
21
1
+
+
+
+ …)
2
3
128x
1024x
32768x 4
8x
(see the website listed below, Eq. (98)).
http://mathworld.wolfram.com/GammaFunction.html
 By either of the above two properties and with x = n/2 and  = 1/2, we
can get
8- 25
 n 1 
n

 

 2   2
= 1.
n
n
/
2
n / 2  
2
1/ 2
lim
n
(8b)
 Also, with t being finite, it is easy to see that the following equality is
true:
lim
n
1
1/ 2
 t2 
1  
n

= 1.
(8c)
 According to the three facts (8a)-(8c) derived above, lim f(t) becomes
n
 n 1 


2  ( n 1) / 2
2   t 

lim f(t) = lim
1  
n
n
n
n
n  
2
= lim
n
 n 1 


 2   lim
 n  n
n / 2  
2
1
1/ 2
 t2 
1  
n

 lim
n
1
2
 t2 
1  
n

n / 2
= (8a)(8b)(8c)
1
= 11(
et / 2 )
2
2
=
1 et 2 /2
2
which is the pdf of a unit normal distribution. Done.
 Fact 8.9 (mean and variance of t distribution) --A random variable T possessing a t distribution with n degrees of
freedom has the following mean and variance
E[T] = 0;
Var[T] = n/(n  2).
8- 26
Proof: as an exercise.
 Shape of pdf of the t distribution -- Some shapes of the t distribution for various degrees of freedom are
shown in Fig. 8.7.
 According to Fact 8.8, as n  , the shape of the t distribution
becomes that of a unit normal distribution which is the top pink one in
the figure.
n=1
n=2
n=5
n=10
n=.
Fig. 8.7 Shape of pdf of t distribution where “n” means “degrees of freedom.” Note the curve with n
=  which becomes that of a unit normal distribution.
 Proposition 8.4 --Given a random sample X1, X2, ..., Xn of size n arising from a normal
population random variable X with parameters (, 2), let X and S2 denote
its sample mean and sample variance. Then, the function
X 
S/ n
possesses a t distribution with n  1 degrees of freedom. (Note: S =
8- 27
(8.8)
S 2 is
called the sample standard deviation of the random sample).
Proof:
 Fact 7.15(c) says that
( n  1) S 2

2
is a 2 random variable with n  1
degrees of freedom.
 Also, by applying Fact 6.20 repeatedly, X may be approved to be
normally distributed since the population random variable X is normal.
 Furthermore, since the mean and variance of X is  and 2/n (by
X 
Facts 7.4 and 7.11), we get to know that
is a unit normal
/ n
random variable.
 Therefore, by taking Z and W in Definition 8.11 to be
( n  1) S 2

2
X 
and
/ n
, respectively, we get to know that
T=
Z
=
W /(n  1)
X 
/ n
(n  1) S 2
2
( n  1)
=
X 
s n
possesses a t distribution with n  1 degrees of freedom.
 Computing the confidence interval for the mean of a normal population
distribution with unknown variance ---
 Reasoning for derivation of the confidence interval -- Let T is a random variable possessing a t distribution with n  1 degrees
of freedom as defined previously.
 Define t, n1 be a value such that P{T > t, n1} = , or equivalently,
P{T  t, n1} = 1   (see Fig. 8.8 for an illustration).
8- 28

t, n
t
Fig. 8.8 Illustration of pdf of a t distribution where t, n1 is such that P{T > t, n1} = .
 To compute the values of t, n, a table found at the website
http://wise.xmu.edu.cn/course/ugecon2/t-table.pdf
can be searched to get the values of t, n for various n and . For
example, for  = 0.05 and 0.01 and n = 13 and 22, the values of t may
be found to be t0.05, 13 = 1.573; t0.05, 22 = 1.717; t0.01, 13 = 2.650; t0.01, 22 =
2.508.
 On the other hand, by Proposition 8.3,
X 
has a t distribution with
S/ n
n 1 degrees of freedom.
 Therefore, similar to the derivation of (8.2) and by the symmetry of the
t distribution, we can get
P{t, n1 <
X 
< t, n1} = 1  .
S/ n
(8.9)
 Equality (8.9) may be easily transformed into the following form:
P{ X  (t, n1S/
n)
<  < X + (t, n1S/
n )}
= 1  .
 Therefore, by Definition 8.10, we have the following proposition.
 Proposition 8.5 (100(1  )% confidence interval for normal population
mean with unknown variance) --Given a random sample X1, X2, ..., Xn of size n arising from a normal
population distribution with unknown mean  and unknown variance 2, the
100(1  )% confidence interval for  is
8- 29
( X  t, n1S/
n,
X + t, n1S/
n ),
(8.10)
or in a simpler form, is
X  t, n1S/
n.
(8.11)
 Example 8.6 --Given a random sample of size n = 15 observed from a normal
population distribution with the sample values given below,
26.7, 25.8, 24.0, 24.9, 26.4, 25.9, 24.4, 21.7, 24.1, 25.9, 27.3, 26.9, 27.3, 24.8, 23.6
compute a 95% confidence interval for the population mean .
Solution:
 From Proposition 8.5, the confidence interval is (L, U) (in random
variable form) with
L = X  t, n1S/
n,
U = X + t, n1S/
n.
 The sample mean value x and sample standard deviation value s
may be computed to be 25.31 and 1.58, respectively (details omitted).
 Now, n = 15 and  = 0.05, and so t, n1 = t0.025, 14 = 2.145 according to
the t table found at the above-mentioned IP address.
 Therefore, the desired confidence interval is (l, u) where
l = x  2.145 s /
u = x + 2.145 s /
n
n
= 25.31  2.1451.58/
= 25.31  2.1451.58/
15
15
 24.43;
 26.19.
 Thus, a 95% confidence interval for the population mean  is (24.43,
26.19).
 Computing large-sample confidence interval for the mean of any population ---
 Idea -- Recall the two ways of derivations of the confidence intervals for the
mean of a normal population.
(1) Way 1 --- assuming the population variance2 known.
(2) Way 2 --- assuming the population variance 2 unknown.
8- 30
 For both ways, the assumption that the concerned population is normal.
 When the size n of the random sample is large enough, we may infer
confidence intervals for the mean of any population without making the
assumptions of normality and the availability of the population
variance.
 That is, we can infer intervals for the mean of any population as n  .
 The underline theory supporting this possibility is the central limit
theorem and some other facts, as discussed in the following.
 ***Consistency of variance of sample variance -- Fact 7.14 says that the variance of the sample variance S2 of a normally
distributed population with variance 2 is Var(S2) = 24/(n  1).
 A more general fact with no assumption of the normality of the
population, which we state without proof, is:
Var(S2) = 4(
2

 )
n 1 n
where  is the (excess) kurtosis of the population distribution defined as
“the fourth moment around the mean (called central moment) divided
by the square of the variance 2 of the population distribution minus 3”
(or as “the fourth normalized central moment minus 3”):
=
4
3
4
where the central moment means k = E[(X  )k] (see the following
web pages for more details: http://en.wikipedia.org/wiki/Moment_(mathematics)
and http://en.wikipedia.org/wiki/Variance).
 From the following web page: http://en.wikipedia.org/wiki/Normal_distribution,
we get to know that the fourth central moment of normal distributions is
34.
 Therefore, for normal distributions,  = 0.
 But for general distributions,  need not be zero, and Var(S2) may be
reduced to be
Var(S2) = 4(
2

2

 ) = 4[
+ ( 44  3)/n]
n 1 n
n 1

8- 31
=
1
n3 4
( 4 
 ).
n
n 1
(B).
 Fact 8.10 (consistency of the sample variance of a general population) --The sample variance S2 is a consistent estimator of the variance 2 of
any population distribution.
Proof:
 We know that S2 is an unbiased estimator of the variance of the
population distribution from Fact 8.1.
 From the equality of (B) above, we have
lim Var(S2) = 0
n
because both the central moment 4 and the square 4 = (2)2 of the
variance of the population are constants.
 Therefore, by Fact 8.7a, we get to know that S2 is a consistent estimator
of 2. Done.
 Comments -- By the definition of consistency, the above fact says that
lim P{|S2  2| < } = 1   > 0
n
which a form of weak law of large numbers for the variance 2, and we
may say that S2 converges toward 2 in probability, or in notation, that
S2  2 as n  , or equivalently, S   as n  .
 Fact 8.7 is just a special case of Fact 8.10 above.
 Reasoning for derivation of the confidence interval -- As n  , according to the central limit theorem, if X is the sample
mean of a random sample of size n arising from any random variable
with mean and variance  and , respectively, then Y =
approximately a unit normal random variable.
8- 32
X 
is
/ n
 Let z be such that P{
X 
 z}  (z) = .
/ n
 Then, using a process similar to that for deriving (8.2), we can get
P{z/2 <
X 
< z/2}  1  .
/ n
 The value  in the above equality is unknown, so it must be estimated.
 For this purpose, we may replace it with the sample standard deviation
S because by Fact 8.10 above and the comment following it, we have S
  as n  .
 This results in
X 
< z/2}  1  
S/ n
P{z/2 <
which may be transformed into the following form
P{ X  (z/2S/
n)
<  < X + (z/2S/
n )}
= 1  .
 Therefore, by Definition 8.10, we have the following proposition.
 Proposition 8.6 (large-sample 100(1  )% confidence interval for the
mean of any population) --Given a large random sample X1, X2, ..., Xn of size n arising from a
population distribution with unknown mean  and unknown variance 2, the
large-sample (1  )% confidence interval for  is
( X  z/2S/
n,
X + z/2S/
n ),
(8.12)
or in a simpler form,
X  z/2S/
n
(8.13)
where X and S are the sample mean and the sample standard deviation of
the random sample, respectively.
 A rule of thumb for selecting n --- when n > 30, the above proposition may
be applied (i.e., the requirement of a large sample is satisfied).
 Example 8.7 --Suppose 40 observations (i.e., random sample values) are made of the
8- 33
weight of a 100-lb rivet bag (鉚釘袋) manufactured by a factory. The sample
mean and the sample standard deviation of these observations are x =
99.71 and s = 0.88, respectively. What is the 95% confidence interval of the
mean weight of the rivet bag?
Solution:
 From Proposition 8.6, the 95% confidence interval (with  = 0.05 and
z/2 = z0.025 = 1.96) is
( x  z/2s/
n,
x + z/2s/ n ) = (99.71  1.960.88/ 40 99.71 + 1.960.88/ 40 )
= (99.44, 99.98).
 Precision and sample size: inference of the random sample size n for a
fixed width of confidence interval -- It is assumed in the previous discussions that the random sample size is
fixed for the corresponding confidence interval to be computed.
 For the large-sample confidence interval derivation as described in
Proposition 8.6, sometimes it is desired instead that the width of the
confidence interval is fixed first (for the purpose of reducing the interval
width to increase the precision), say, to be  r around the sample mean
value x , and compute then the number n of samples to take.
 This means that the interval now is
( x  z / 2s/
n,
x + z / 2s/ n ) = ( x  r, x + r),
so
r = z / 2s/
n
which leads to the solution
n = (z / 2s/r)2.
(8.14)
(Note: if necessary, take the ceiling of the above value of n as the solution.)
 Example 8.8 (continued from Example 8.7) --In Example 8.7, the computed 95% confidence interval around the
sample mean value x = 99.71 with a sample size of n = 40 is (99.44, 99.98)
which means the interval width is 2(99.98  99.71) = 20.27 = 0.54.
8- 34
Suppose a tighter interval with the width of a half of this width is desired,
what is the new sample size n′ which should be used?
Solution:
 From (8.14) with r = 0.27/2 = 0.135, we have
(z/2s/r)2 = (1.96  0.88  0.135)2  163.23
so n′ should be taken to be 163.23 = 164 where  is the integer
ceiling function.
 A note --The above discussion of computing the sample size n from a given fixed
confidence interval width is applicable to many other cases of confidence
interval inference, as can be seen subsequently.
 Computing large-sample confidence interval for a proportion of a population
---
 Idea -- It is often desired to derive a confidence interval for the proportion of a
population which has a certain property (such as the proportion of the
people in a community who are in favor of a certain election candidate
against an opponent, or the proportion of the objects in a certain group
which have specific characteristics).
 Let p denote the proportion of “success” in the population, called
success proportion, with success identifying the above-mentioned
property.
 Then, given a random sample of n individuals, ;X1, X2, ..., Xn, from the
population, obviously the number of successes in the sample may be
regarded as a binomial random variable X with parameters (n, p) with
each sample variable Xi being a Bernoulli random variable.
 From Fact 4.7, we have E[X] = np and Var(X) = np(1  p).
 Also, from the DeMoivre-Laplace Limit Theorem (a special case of the
central limit theorem as shown in Example 7.13) mentioned in Chapters
5 and 7), we know that as n  , X may be approximated by a unit
normal random variable in the following way:
8- 35
P{a 
X  np
 b}  (b)  (a) ,
np(1  p)
or equivalently,
P{a 
( X / n)  p
 b}  (b)  (a)
p(1  p) / n
(8.15)
where (·) is the error function (the cdf of the unit normal random
variable).
 On the other hand, from Example 8.2, an unbiased estimator of p is the
sample proportion p̂ = X/n which may be used to replace X/n in
(8.15) above.
 Also, if z/2 is such that P{Z  z/2} = 1  /2 where Z is the unit normal
random variable, then based on (8.15) we get
P{ z / 2 
p̂  p
  z / 2 }  (z/2)  (z/2) = 1  .
p(1  p) / n
(8.16)
 The left part of (8.16) above may be transformed by solving the two
inequalities into the following form (with the details omitted):
P{ pl  p  pu }  1  
(8.17)
where
pˆ 
pl =
z2 / 2
z2
pˆ qˆ z2 / 2
pˆ qˆ z2 / 2
 z / 2
 2
pˆ   / 2  z / 2

2n
n 4n ; p =
2n
n 4n 2
u
1  ( z2 / 2 / n)
1  ( z2 / 2 / n)
with q̂ = 1  p̂ .
 As n  , the three terms in each of pl and pu above involving z2 / 2
are negligible in magnitude, so that pl  pˆ  z / 2
pˆ q̂
.
n
 Thus, (8.17) becomes
pˆ  z / 2
8- 36
pˆ q̂
and pu 
n
P{ pˆ  z / 2
pˆ qˆ
pˆ qˆ
 p  pˆ  z / 2
} 1  .
n
n
(8.18)
 So, by Definition 8.10 we have the following proposition.
 Proposition 8.7 (large-sample 100(1  )% confidence interval for a
population proportion) --Given a large random sample X1, X2, ..., Xn of size n arising from a
population with a success proportion of p, and let X be the number of
successes in the sample, then the large-sample (1  )% confidence interval
for p is
( pˆ  z / 2
pˆ qˆ
pˆ qˆ
, pˆ  z / 2
),
n
n
(8.19)
or in a simpler form,
pˆ  z / 2
pˆ qˆ
n
(8.20)
where p̂ = X/n and q̂ = 1  p̂ .
 A rule of thumb for selecting the sample size n --- when np > 5 and n(1  p)
> 5, the above proposition may be applied (i.e., the requirement of a large
sample is satisfied).
 Definition 8.12 (sampling error) --ˆˆ
pq
, whose absolute value is a half of the width
n
of the confidence interval as mentioned previously, is called the sampling
error.
The values  r =  z / 2
 Chinese of some frequently-used terms in media (newspaper, magazines,
etc.) -- Confidence interval --- 信賴區間;
 Sampling error --- 抽樣誤差;
8- 37
 Confidence level of 100(1  )% --- 100(1  )% 信心水準.
 Example 8.9 --In a survey of family opinions about a certain public policy in the Taipei
metropolitan area, 1000 families were taken as a random sample in which
720 were found positive for the policy. Under the confidence level of 95%,
what is the sampling error r? And what is the confidence interval for the rate
p of families with positive opinions in the metropolitan area?
Solution:
 Let “being positive for the policy” mean “success.”
 Then, the sample value of the “number X of successes” is x = 720, the
sample size is n = 1000, the estimate of p is p̂ = x/n = 720/1000 =
0.72, and q̂ = 1  p̂ = 0.28.
 The confidence level of 95% means  = 1  0.95 = 0.05.
ˆˆ
ˆˆ
pq
pq
= z0.025
n
n
1.96 0.72  0.28 / 1000  0.028. And the confidence interval is
 So, the sampling error is r =  z / 2
=
(l, u) = (0.72  0.028, 0.72 + 0.028) = (0.692, 0.748).
 A translation of the above results into Chinese is:
「根據某一公共政策的民意調查結果,大台北地區有百分之七十二的
家庭贊成該政策。此一民調共隨機取樣一千個家庭,在百分之九十五
的信心水準下,抽樣誤差約為正負二點八個百分點。」
 A website for computing the sampling error --http://www.dssresearch.com/toolkit/secalc/error.asp
 A note --Actually the equality of (8.17) repeated below may also be used directly
for confidence interval inference for results with better precision:
P{ pl  p  pu }  1  
where
8- 38
(8.17)
pˆ 
pl =
z2 / 2
z2
pˆ qˆ z2 / 2
pˆ qˆ z2 / 2
 z / 2
 2
pˆ   / 2  z / 2

2n
n 4n ; p =
2n
n 4n 2
u
1  ( z2 / 2 / n)
1  ( z2 / 2 / n)
with q̂ = 1  p̂ .
 Precision and sample size -- Again, as discussed previously, sometimes it is desired to fix in
advance the precision of the confidence interval, which now is called
the sampling error r, and compute the corresponding sampling size n.
 For this, it seems that we can solve r = z / 2
ˆˆ
pq
to get a solution for
n
n.
 However, this is impractical because p̂ and q̂ = 1  p̂ , both
ˆˆ
pq
.
n
 One way out is to redefine r as the maximum sampling error we want,
and under this assumption, to find the corresponding minimum n.
ˆ ˆ = p̂ (1  p̂ )
 For this, first try to find the maximum value of pq
which occurs when p̂ = 1/2 ( setting the derivative d( p̂  q̂ )/d p̂ =
1  2 p̂ = 0 leads to the solution of p̂ = 1/2).
involving n, are included in r = z / 2
 So, substituting p̂ = 1/2 into r = z / 2
ˆˆ
pq
where q̂ = 1  p̂ , we get
n
the solution for n as
n=
1
(z/2/r)2.
4
(8.21)
 Another way is to specify an expected value for p̂ , say denoted as p̂0 ,
and solve r = z / 2
ˆˆ
pq
to get
n
n = p̂0 (1  p̂0 )(z/2/r)2.
(8.22)
 Example 8.10 (continued from Example 8.9) --As a continuation of Example 8.9, suppose that it is desired to have a
8- 39
sampling error no larger than r = 0.02 instead of the original value of 0.028.
At least how many families should be sampled? On the other hand, if it is
assumed that the estimate p̂ = 0.72 is kept the same, what is the number of
families that should be sampled?
Solution:
 This can be carried out using the first way mentioned above.
 So by (8.21), we get
n=
1
1
1
(z/2/r)2 = (1.96/0.02)2 = 982 = 492 = 2401.
4
4
4
 Therefore, at least 2401 families should be sampled.
 If p̂ is assumed to be the same as 0.72 when r = 0.02, then the
number of families that should be sampled now, by (8.22), is
n = p̂0 (1  p̂0 )(z/2/r)2 = 0.72(1  0.72)(1.96/0.02)2
= 0.720.28982  1936.1664 = 1937.
 That is, 1937 instead 1000 families should be sampled.
 Computing small-sample confidence interval for a proportion of a population
---
 Idea -- The above discussions are based on the use of large samples.
 It is also possible to obtain a confidence interval for the success
proportion p when n is small. The inference of the interval is based
directly on the use of the binomial distribution.
 Assume that the number of successes in the n samples is the sample
total value X = x.
 For simplification of notations hereafter, we define B(x: n, p) as the cdf
x
P{X  x} =  C(n, i)pi(1  p)ni of a binomial random variable X with
i 0
parameters (n, p). This notation is also used in many binomial
distribution tables found in books, like the one at the following website:
http://webcache.googleusercontent.com/search?q=cache:kHbpn3dKAIYJ:www.statisticshowto.com/table
s/binomial-distribution-table/+binomial+distribution+table&cd=10&hl=zh-TW&ct=clnk&gl=tw.
8- 40
 To find the 100(1  ) % confidence interval (pl, pu) for p, we may
consider first to find two functions, b(pl) and a(pu), as limits for X so
that for any p, approximately the following equality is true:
P{a(pu) < X  b(pl)} = 1  
where
(1) we assign P{X  a(pu)} = /2 and P{X > b(pl)} = /2 (or
equivalently, P{X  b(pl)} = 1  /2) so that the above equality
holds; and
(2) define a(pu) and b(pl) to be such that P{X  a(pu)} = B(x: n, pu) and
P{X  b(pl)} = B(x  1: n, pl).
 Accordingly, P{X  a(pu)} = B(x: n, pu) = /2, and P{X  b(pl)} = B(x 
1: n, pl) = 1  /2 from which by a binomial distribution table, we can
find approximate values of pu and pl to satisfy the equalities (note: only
approximate values can be found due to discreteness of the random
variable X).
 Finally, take the found limits pl and pu to construct the desired 100(1 
) % confidence interval (pl, pu).
 A reference for more about the above topic is:
N. Johnson and F. Leone, Statistics and Experimental Design in Engineering and the
Physical Sciences, vol. II (2nd ed.) Wiley, New York, 1977.
 The above discussions lead to the following proposition.
 Proposition 8.8 (small-sample 100(1  )% confidence interval for a
population proportion) --Given a small random sample X1, X2, ..., Xn of size n arising from a
population with a success proportion of p, and let X be the number of
successes in the sample, then the small-sample (1  )% confidence interval
for p is (pl, pu) where pl and pu are respectively such that
B(x  1: n, pl) = 1  /2; B(x: n, pu) = /2.
 Example 8.11 ---
8- 41
Twenty units are sampled from a continuous production line and four
items are found to be defective. Find an approximately 90% confidence
interval for the true proportion defective, p, using the small-sample approach
described previously.
Solution:
 The proportion defective is estimated to be p̂ = x/n = 4/20 = 0.20.
 1   = 0.9, and so  = 0.1 and /2 = 0.05
 The upper limit pu may be found by solving the following equality:
4
B(x; n, pu) = B(4; 20, pu) =  C(20, i)pui(1  pu)20i = /2 = 0.05,
i 0
which leads to the approximate solution pu = 0.4 (for this pu, the more
precise value of B(4; 20, 0.4) = 0.051).
 The lower limit pl may be found be solving the following equality:
3
B(x  1; n, pl) = B(3; 20, pl) =  C(20, i)pli(1  pl)20i = 1  /2 = 0.95,
i 0
which leads to the approximate solution pl = 0.1 (for this pl, the more
precise value of B(3; 20, 0.1) = 0.957).
 A 90% confidence interval for the proportion defective, p, is thus
(0.071, 0.400) for computed p̂ = 0.20
8.4 Hypothesis Testing
 Concept and definitions ---
 Hypothesis testing means to make a decision to accept or reject a claim,
called hypothesis, regarding a parameter of interest.
 Random sampling is often used in hypothesis testing.
 Definition 8.13 (statistic) --A statistic is any function of the random variables constituting one or
more sample, provided that the function does not depend on any unknown
parameter values.
8- 42
 A note --- e.g., various point estimators mentioned previously are statistics.
 Definition 8.14 (hypotheses) --In the hypothesis testing problem, the claim initially favored or believed
to be true is called the null hypothesis, denoted by H0; and the other claim in
the problem is called the alternative hypothesis, denoted by Ha.
 Another note --- e.g., H0:  = 0.75; Ha:   0.75, where  is the mean of a
population.
 Definition 8.15 (hypothesis test procedure) --A hypothesis test procedure is specified by:
 a test statistic --- a function of the sample data on which the decision of
rejecting H0 or not is based;
 a rejection (critical) region --- the set of all test statistic values for
which H0 will be rejected.
 Definition 8.16 (type I and II errors) --A test procedure will yield errors in the decision making result,
including the following two types:
 type I error --- coming from rejecting H0 when H0 is true, with its
probability denoted by  and called the significance level of the test;
 type II error --- coming from accepting H0 when H0 is false (or
equivalently, from rejecting Ha when Ha is true), with its probability
denoted by .
 Notes -- A level  test is a hypothesis test whose type I error (or significance
level) is .
  is often choose in the first place to be 0.10, 0.05, or 0.01.
 Tests about a population mean --- case I: a normal population with known
standard deviation ---
 Problem definition --Given a random sample X1, X2, ..., Xn, of size n arising from a normal
8- 43
population random variable X with unknown mean  and known standard
deviation, we want to make a decision about whether the mean  is equal to
a special value 0, called the null value.
 Reasoning of the solution -- Null hypothesis: H0:  = 0
(e.g., we may want to make a decision about whether the average life 
of a given set of tires of a new design has no change, i.e., it still equals
the old parameter 0).
 A possible alternative hypothesis: Ha:  > 0
(that is, the new design yields instead tires having a longer average life
 than the old one 0).
 Since the population distribution is normal with parameters (, 2), the
sample mean X has mean  and variance 2/n (from Facts 7.4 and
7.11).
 Therefore,
X 
is a unit normal random variable Z with mean  and
/ n
standard deviation.
 If the null hypothesis H0:  = 0 is true, then
X  0
will also be a
/ n
unit normal random variable (note:  is replaced by 0 here).
 Given a set of random sample values x1, x2, ..., xn, we have a sample
n
mean value x = (  xi)/n which is an estimate of , i.e., x  .
i1
 If the null hypothesis H0:  = 0 is true, the distance d between x ()
and 0 should be close to zero; on the contrary, for H0 to be rejected (or
Ha to be accepted), this distance should be large in value.
 In more detail,


if the alternative hypothesis is Ha:  > 0, then this distance d
should be positive and large enough in magnitude;
if Ha is  < 0, then d should be negative and large enough in
magnitude;
and finally, if Ha is   0, then d should be large enough in
magnitude, no matter whether it is positive or negative.
 However, terms like “large enough” are not precise enough, and to
improve this,

8- 44

first a better measure is to substitute x into Z =
a “sample Z value” z =
X  0
to give
/ n
x  0
, which is in a sense just a
/ n
standardized distance d between x ( ) and 0 expressed in the
unit of standard deviation / n ;

secondly, we need a threshold value to express “large enough,” and
for this we use the value z or z/2 defined previously as
P{Z > z} =  and P{Z > z/2} = /2
(see Fig. 8.4 for an illustration of the meaning of z).
 Accordingly, for H0:  = 0, we define the following decision making
rules for rejecting H0 and accepting various alternative hypotheses Ha
in the following:



(upper-tailed test) reject H0:  = 0 in favor of Ha:  > 0 if z  z;
(lower-tailed test) reject H0:  = 0 in favor of Ha:  < 0 if z  z;
(two-tailed test) reject H0:  = 0 in favor of Ha:   0 if z  z/2 or
z z/2.
 The term Z =
X  0
is called a test statistic for this hypothesis
/ n
testing problem. It comes from the assumption that the population mean
 = 0.
 Critical region and Type I and II errors -- Each of the conditions, “z  z,” “z  z,” or “z  z or z z,” is said to
compose a rejection (critical) region of the corresponding Ha.
 For the upper-tailed test --
(Type I error ) if z =
x  0
/ n
computed from x
falls
erroneously in the rejection region described by z  z, then H0 is
rejected in favor of Ha, incurring a type-I error with probability
P{type I error}= P{H0 is rejected when H0 is true}
8- 45
= P{H0 is rejected when  = 0}
X  0
X 
= P{
=
 z}
/ n
/ n
= P{Z  z} = ;

(Type II error ) on the other hand, H0 might not be rejected (i.e.,
Ha might be accepted) erroneously because z < z (instead of z 
z), incurring a type II error; and this could happen when the real
value of  is a particular value ′ that exceeds 0 so that the
probability of this type II error may be computed to be
(′) = P{type II error while  = ′}
= P{H0 not rejected while  = ′}
X  0
X 
= P{
=
< z while  = ′}
/ n
/ n
= P{ X < 0 + z/ n while  = ′}
= P{
X 
  '
X  '
=
< z + 0
}
/ n
/ n
/ n
=(z +

0  '
).
/ n
(Illustration of type I and II errors) a diagram showing the two
types of error is shown in Fig. 8. 9(a).
 For the other two types of test (lower-tailed and two-tailed tests),
similar derivations of the probabilities of type I and II errors may be
conducted. A figure illustrating the rejection regions and the
corresponding type I error probabilities for the three types of test is
shown in Figs. 8.9(b) through (d).
8- 46
critical line z = 1.645
(for  = 0)
(for  = ′)
(a)  and  values (left curve is for  = 0, and right curve for  = ′).
(b) Upper-tailed test.
(c) Lower-tailed test.
(d) Two-tailed test.
Fig. 8.9 Rejection regions and  values of the three types of tests.
 Computing n to satisfy selected  and  -- For the upper-tailed test --
Suppose that for a fixed ′, we hope an n can be found to satisfy
arbitrary choices of both  and ; and for this, since (′) =(z +
0  '
) as just derived previously, it is implied that
/ n
z = z +
0  '
/ n
(note: just as we define z to satisfy P{Z  z} = (z) = 1  ,
here we define as well z to satisfy P{Z  z} = (z) = 1   so
that (z) = , which explains the term z at the left-hand side of
the above equality).
8- 47

The above equality can be solved to get the desired sample size n
as
  ( z  z  ) 
n= 
 .
 0  ' 
2
 For the lower-tailed and two-tailed tests, the values of n for fixed  and
 may be derived similarly.
 Summary -- A summary is given in Table 8.1.
 The above type of hypothesis testing will be called z test because of the
use of the values of z and z.
Table 8.1 Level  z test of mean of a normal population with known standard deviation.
Null hypothesis: H0:  = 0
Test statistic: Z = X  0
/ n
Type of test
Upper-tailed
test
Lower-tailed
test
Two-tailed
test
Alternative
hypothesis
Ha:  > 0
Ha:  < 0
Ha:   0
Rejection
region
z  z
z  z
z  z or
z  z/2
Type I
error
prob.
Type II error prob. (′)
Sample size n for
fixed  and 
  ( z  z  ) 


 0  ' 
2

'
(z + 0   )
/ n
  ( z  z  ) 


 0  ' 
2

'
(z + 0   )
/ n

'
'
(z + 0   )  (z + 0   )
/ n
/ n
  ( z / 2  z ) 


 0  ' 
 Example 8.12 --A manufacturer of tires is considering modifying its design of the tire
tread. A study reveals that the modification is justified only if the average tire
life under standard test conditions exceeds 20,000 miles. A random sample of
n = 16 prototype tires is manufactured and tested, resulting in a sample mean
8- 48
2
value of x = 20.758. Suppose that tire life is normally distributed with a
known  = 1500 (the standard deviation value for the current version of the
tire). Do these data suggest that the modification is good for a decision to
change the tire design to the new one? Solve this problem by a hypothesis
testing procedure with a significance level of 0.01.
Solution:
 Parameter to test:  = the true life of the new tire design.
 Null hypothesis: H0:  = 0 with 0 = 20,000 miles (so that accepting
H0 means that the new design is ineffective).
 Alternative hypothesis: Ha:  > 0 (so that rejection of H0 in favor of Ha
means that the new design should be adopted).
x  0
x  20000
 Test statistic value: z =
=
(because the samples
/ n
1500 / n
arise from a normal distribution so that the sample mean X is also
normally distributed).
  = 0.01.
 Rejection region: according to Table 8.1, the rejection region described
by z  z is adopted where  = 0.01, z = z0.01= 2.33 ( P{Z  z} = 1 
 = 1  0.01 = 0.99; check Table 5.1); that is, H0 will be rejected if z >
2.33.
 Now the sample values yield
z=
x  20000
20, 758  20000
=
= 2.02.
1500 / 16
1500 / n
 Since z = 2.02 < 2.33 = z0.01 (i.e., z  R), H0 cannot be rejected at the
significance level of  = 0.01, meaning that the new design yields tires
whose lives are still quite the same as that of the old ones, 20,000 miles.
 So the new design must be abandoned according to this level 0.01
upper-tailed hypothesis test.
 Here, the probability of the type I error is  = 0.01, and the probability
of the type II error, say, for a true ′ = 21,000 is
(′) = (21000) = (z +
= (2.33 +
 (0.34)
8- 49
0  '
)
/ n
20000  21000
)
1500 / 16
 0.3669.
 Tests about a population mean --- case II: a normal population with unknown
standard deviation ---
 Problem definition --In real cases, the standard deviation is usually unknown in hypothesis
testing about the mean of a normal population, so the above hypothesis
testing procedure must be modified for more realistic applications.
 Reasoning of the solution -- Just like the case of inferring the confidence interval for such a kind of
population with an unknown standard deviation, the (Student’s) t
distribution, instead of the unit normal distribution, should be used
here.
 The reasoning process is all the same as done for Case I above, and is
omitted; only the final results like those in Table 8.1 are listed in Table
8.2.
 The hypothesis testing here is usually called the t test.
 Note that unlike the case of the z test described previously, there is no
closed-form formula for computing the value of type II error  here. An
online calculator for computing  is available at the following website:
http://www.stat.uiowa.edu/%7Erlenth/Power/index.html.
(W.A)
To use the above calculator, select “one-sample t test (or paired t)” and
then click the button of “Run Selection” to pop up a window in which
click the small gray-colored squares at the right end of the window to
pop up dialogs and fill in appropriate values of |0  ′| (denoted as
True |mu – mu_0| in the window),  (denoted as sigma), and n, where
the value of  must be guessed according to prior experience. Also in
the dialogs of “Solve for” and “alpha,” select the choice of n and fill in
the value of . Then a value of “power” will appear automatically,
which means 1  (′).
 Usually, a conservative (large) guessed value of  will yield a
conservative (large) value of (′) and so a conservative estimate of the
sample size n necessary for the prescribed  and (′).
 There is no closed form for computing the sample size n for fixed
8- 50
values of  and , either. An online calculator for computing n is
available at the following website:
http://www.objectivedoe.com/student/Shared/SSCalculators/sample.php
(W.B)
where the term “standard deviation” is  above, the term “difference to
detect” is the value | 0  ' | , the “confidence” is the value of 1  ,
and the “power” again is the value of 1  . Also, the “Test type”
should be selected to be “Compare a Mean to a Standard.” The sample
size generated after clicking the button of “Submit” is just the value n 
1. (A note: actually site (W.A) may also be used for the same purpose,
but the yielding results will be a little bit different from the that
obtained by (W.B) due to computation accuracy.)
 For the above-mentioned computations of the value  for a fixed
sample size n as well as the sample size n for a fixed , certain graphs
called “curves of  = P{Type II Error} for t tests,” “OC
(operating–characteristic) curves for t tests,” etc., may also be used.
Such graphs are not included here but may be found from a reference
book like:
Edward R. Dougherty, Probability and Statistics for the Engineering, Computing
and Physical Sciences, Prentice-Hall, Inc., Englewood Cliffs, N. J., USA, 1990.
 Example 8.13 --A company wants to buy a new building at a location not far from an
old one. A shuttle bus is operated for 10 trial runs between the two buildings
during working hours to test the travel time. Any new building with more
than an average of 20 minutes to reach is excluded from consideration. A
level 0.05 t test about the average bus travel time  with H0:  = 20 (minutes)
and Ha:  > 20 was applied to make a decision about whether the new
building should be bought. However, the 10 trial shuttle bus runs yielded a
sample mean variable x not large enough to reject H0:  = 20, and incurred
a type II error . Compute the value  under the assumptions that the real  is
′ = 25 and that the standard deviation is  = 5 according to prior evidence.
Solution:
 The t test is upper-tailed and has  (type I error) = 0.05.
 Also, the real  is ′ = 25 with  = 5.
 After visiting the above mentioned website of (W.A); filling the values
8- 51
 = 5, |′  0| = |25  20| = 5, n = 10,  = 0.05; and selecting the
choice of n in the “Solve for” dialog; and removing the default selection
of “Two-sided” (meaning that we are using the one-sided t test), a
“power” value of .8975 appears automatically, which means 1  .
 Therefore, the value  may be computed to be 1  0.8975 = 0.1025 
10%.
 This probability, meaning the risk of not rejecting H0 when H0 is false,
is too high (which means: the new building is too far or in a traffic-busy
area probably so that the shuttle bus needs more than 20 minutes to
travel between the two buildings; however, the 10 test runs of the
shuttle bus has the high probability of nearly 10% to be unable find this
fact).
 Therefore, the company wants to conduct more shuttle bus runs to be
lower this probability to the level of 5% (i.e.,  = 0.05 or equivalently,
the power = 1   = 0.95).
 After visit the website of (W.B) above and submitting the values of the
“standard deviation”  = 5, the “difference to detect” | 0  ' | = |20 
25| = 5, and the “power” 1   = 0.95, we get the result n  1 = 14.
 And so the desired sample size n = 15.
Table 8.2 Level  t test of mean of a normal population with unknown standard deviation.
Null hypothesis: H0:  = 0
Test statistic: T = X  0
s/ n
Type of test
Upper-tailed
test
Lower-tailed
test
Two-tailed
test
Alternative
hypothesis
Ha:  > 0
Ha:  < 0
Ha:   0
Rejection
region
t  t, n  1
t  t, n  1
t  t, n  1 or
t  t, n  1
Type I
error
prob.
Type II error prob.


No closed-form solution; use
the online calculator at the
following website:
No closed-form solution; use
the online calculator at the
following website:
http://www.stat.uiowa.edu/%7Erl
http://www.objectivedoe.com/studen
enth/Power/index.html
t/Shared/SSCalculators/sample.php

8- 52
 Tests about a population mean ---: case III: use of large samples ---
 Idea --When the sample size is large (with n > 30), the z test for Case I may be
easily modified to yield a test procedure without requiring either a normal
population distribution or a known , just like the corresponding
large-sample case of confidence interval inference.
 Reasoning of the solution -- When n   (n > 30 in practice), the sample standard deviation S  
which is the real standard deviation of the population distribution as
mentioned previously when we discussed confidence interval inference
for large samples of any population.
 Therefore, a simple change of the test statistic Z =
X  0
/ n
of Case I to
be Z = X  0 suffices for use here.
S/ n
 And then all the details of the previously-discussed z tests are
applicable now as long as with all ’s in Table 8.1 being substituted by
S’s.
 Example 8.14 --An automobile company recommends that any purchaser of one of its
new cars bring it in to a dealer for a 3000 mile checkup. The company wishes
to know if the true average mileage for initial servicing differs from 3000. A
random sample of 50 recent purchasers resulted in a sample average mileage
of 3208 and a sample standard deviation of 273 miles. Does the data strongly
suggest that the true average mileage for this checkup is something other than
the recommended value? Sate and test the relevant hypothesis using the level
of significance 0.01.
Solution:
 Let  = the true average mileage of cars brought to the dealer for 3000
mile checkup.
 Hypotheses: H0:  = 3000; Ha:   3000.
8- 53
 A large-sample two-tailed z test may be used here because the sample
size n = 50 > 30.
x  0
x  3000
 Test statistic value z =
=
.
s/ n
s/ n
 Since  = 0.01, z/2 = z0.005 = 2.58. This means that for the desired
two-tailed z test, if z  2.58 or z  2.58, H0 should be rejected.
x  3000
3208  3000
 Now, z =
=
 5.39 > 2.58, so H0 is rejected.
s/ n
273/ 50
 This means that at the significance level of  = 0.01, the data does
strongly suggest that the true average initial checkup mileage differs
from the manufacturer’s recommended value of 3000.
 Determination of  and the sample size n --- may be conducted by either
way of the above two cases (Case I and Case II).
 Tests about a population proportion: case I: use of large samples ---
 Reasoning -- Just like the reasoning for deriving the test statistic for the previous
cases for inferring the population mean, according to (8.16) which is
repeated in the following:
P{ z / 2 
where p and
p̂  p
  z / 2 }  1  
p(1  p) / n
(8.16)
p(1  p) / n are the mean and standard deviation of the
population, we may take the following test statistic for the current case
for inferring the population proportion:
Z0 =
pˆ  p0
p0 (1  p0 ) / n
where p0 is the population proportion to be tested.
 Omitting the detailed reasoning process like those derived for the cases
of testing the population mean, we present the results as a summary
described by Table 8.3 for direct uses in applications.
 A rule of thumb for choosing n for the test to be valid is: np0  5 and
8- 54
n(1  p0)  5.
 Computations of the  value for a specific p = p′ and the sample size n
for fixed  and  are also listed in Table 8.3. The details for the
derivations of the computation formulas are like those of deriving the
corresponding contents of Table 8.1, and so are omitted.
 Example 8.15 --Suppose that 1000 voters about who they voted for the city mayor were
interviewed. Of the 1000 voters, 550 reported that they voted for the
democratic candidate. Is there sufficient evidence to suggest that the
democratic candidate will win the election at the .01 level?
Solution:
 To win the election, the proportion for the democratic candidate should
be larger than p = 0.5.
 Set up the test to be H0: p0 = 0.5; Ha: p0 > 0.5 so that when H0 is
rejected, it means that the democratic candidate wins.
 The confidence level of the test is  = 0.01 so that z = z0.01 = 2.33. The
estimated p̂ = 550/1000 = 0.55.
 n = 1000 > 30 so that we can use the large sample approach discussed
above.
 The test statistic value is: 
pˆ  p0
=
p0 (1  p0 ) / n
z0 =
0.55  0.5
 3.16.
0.5(1  0.5) /1000
 Since z0 = 3.16 > 2.33 = z, we reject H0 which means that the
democratic candidate will win the election at the confidence level of 
= 0.01.
Table 8.3 Large-sample level  z test of population proportion.
Null hypothesis: H0: p = p0
Test statistic: Z =
Type of test
pˆ  p0
p0 (1  p0 ) / n
Alternative
hypothesis
Rejection
region
Type II error prob. p′)
Type I
error
8- 55
Sample size n for fixed  and 
prob.
Upper-tailed
Ha: p > p0
test
1
Lower-tailed
Ha: p < p0
test
Two-tailed
test

z  z
 p  p' + z p0 (1  p0 ) / n 
 0

p' (1  p' ) / n


z  z,
Ha: p  p0

z  z
n  1
or
t  t, n  1

 p  p'  z p0 (1  p0 ) / n 
 0

p' (1  p' ) / n


 p  p' + z / 2 p0 (1  p0 ) / n 
 0

p' (1  p' ) / n


 p  p'  z / 2 p0 (1  p0 ) / n 
 0

p' (1  p' ) / n


 z p0 (1  p0 )  z p' (1  p' ) 


p'  p0


2
 z p0 (1  p0 )  z p' (1  p' ) 


p'  p0


2

 z / 2 p0 (1  p0 )  z p' (1  p' ) 


p'  p0


 Tests about a population proposition: case II: use of small sample ---
 Reasoning -- When the sample size is small, the test process may be based directly
on the binomial distribution rather than the normal distribution, as
discussed in the following.
 Again, let p denote the proportion of “success” in a population.
 Given a random sample X1, X2, ..., Xn of size n arising from the
population, then as discussed previously, the sample total To is a
binomial random variable specifying the number of successes in the
sample with parameters (n, p) with the samples X1, X2, ..., Xn all being
Bernoulli random variables.
 From Fact 4.7, we have E[X] = np and Var(X) = np(1  p) and from
Example 8.2, we have an unbiased estimator of p, the sample
proportion p̂ = X/n.
 Let H0: p = p0 and Ha: p > p0 where p0 is a specific proportion to be
tested, then a feasible test statistic for the hypotheses is just the sample
total
n
T o =  Xi
i1
with the rejection region to > c where to the sample total value and c is a
pre-selected constant.
 That is, we reject H0 if t0 > c; and accept H0 if t0  c.
8- 56
2
 To compute the type I and II errors, we define first a power function as
follows:
(p) = P{To > c | p}
= 1  P{To  c | p}
1  B(c; n, p)
where, as defined previously,
c
B(c; n, p)   C(n, i)pi(1  p)ni
i0
who values may be found at the following website which has been
mentioned previously:
http://webcache.googleusercontent.com/search?q=cache:kHbpn3dKAIYJ:www.statisticshowto.com/tables/b
inomial-distribution-table/+binomial+distribution+table&cd=10&hl=zh-TW&ct=clnk&gl=tw.
(W.C)
 Then, the type I error, which is also the significance level  of the test,
is
 = (p0) = P{To > c | p0} = 1  B(c; n, p0).
(8.23)
 And the type II error, (p), for a specific p′ > p0 is
(p′) = P{To  c | p′} = B(c; n, p′).
(8.24)
 It is, as usual, desired to find a value of c such that  equals the
popularly-used values of 0.05, 0.01, and so on, for various n and p0.
This requires the use of a binomial distribution table.
 However, since To is discrete, it will not always be possible to construct
a test with a prescribed significance level .
 For examples, if n = 10, p0 = 1/2, and the desired  = 0.05, then from
(8.24), we have B(c; n, p0) = 0.95; and from the binomial distribution
table mentioned above, we can get c  7 and the real  value now is 1 
0.945 = 0.055 instead of 0.05; and if n = 20 with the other parameters
unchanged, then c  13 with real  = 1  0.942 = 0.058 instead; and if n
= 25, c  16 with real  = 1  0.946 = 0.054.
 The above discussions are about the upper-tailed test; the lower- and
two-tailed tests can be derived similarly (the details are omitted). A
summary is given in Table 8.4.
8- 57
 For more discussions, see the following reference:
Jay L. Devore, Probability and Statistics for the Engineering and the Sciences, (2nd
ed.) Brooks/Cole Publishing Co., Monterey, CA, USA, 1987.
 Example 8.16 --Assume that the natural recovery rate p0 for a certain disease is 40%. A
new drug is developed, and to test the manufacturer’s claim that it boosts the
recovery rate, a random sample of n = 10 patients is given the drug and a total
number to of recoveries is recorded. Find the “cutoff value” c that produces a
test with H0: p = p0 and Ha: p > p0 at an approximate significance level of  =
0.05 and compute the corresponding type II error  for p′ = 0.7.
Solution:
 According to (8.23) or Table 8.4, the test with the rejection region
described by to > c has a significance level of 0.05 when c satisfies
0.05 =  = 1  B(c; n, p0) = 1  B(c; 10, 0.4).
 From the binomial distribution table given at Website (W.C), we get
B(6; 10, 0.4) = 0.945 < 0.95 < 0.988 = B(7; 10, 0.4).
 That is, the choice c = 6 yields  = 1  0.945 = 0.055, and the choice c
= 7 yields  = 1  0.988 = 0.012.
 Therefore, for  to be 0.05 or smaller, we have to choose c = 7.
 And according to (8.24), the corresponding type II error for p′ = 0.7 is:
(p′) = (0.7) = P{To  7 | p′ = 0.7} = B(7; 10, 0.7) = 0.617
which is quite large because of the small sample size n = 10.
Table 8.4 Small-sample test of population proportion.
Null hypothesis: H0: p = p0
Test statistic: To = X1 + X2 + … + Xn
Type of test
Alternative
hypothesis
Rejection region
Type I error prob.
8- 58
Type II error prob. p′)
P{To > c | p0} = 1  B(c; n,
Upper-tailed
test
Ha: p > p0
to > c
Ha: p < p0
to  c
Ha: p  p0
to > b or to  a
Lower-tailed
test
Two-tailed
test
p 0)
P{To  c | p′} = B(c; n, p′)
P{To  c | p0} = B(c; n, p0)
P{To > c | p′} = 1  B(c; n, p′)
P{To  a, To > b | p′} =
P{a < To  b | p′}
1  B(b; n, p0) + B(a; n, p0)
= B(b; n, p′)  B(a; n, p′)
 Bayes test for two-class pattern classification ---
 Idea -- In many applications, we often encounter the following hypothesis
testing problem for two-class pattern classification:
H0:  0
Ha:   a
where 0 and a are two classes of patterns (like those described in
Examples 3.6 through 3.8).
 The following example is given to illustrate how this problem can be
formulated as a hypothesis testing problem.
 Example 8.17 (Example 3.8 revisited) --In a class of 20 female and 80 male students, 4 of the female students and
8 of the male wear glasses. Now if a student was observed to wear glasses,
how will you decide the sex (性別) of the student? Solve the problem from
the viewpoint of hypothesis testing.
Solution:
 Regard the feature of “wearing glasses” as a random variable X
described as follows:
X = 1 if an observed student wears glasses;
= 0 otherwise.
 Let the sex be denoted as a parameter  with two values as follows:
 = 1 = 1 for the male and  = 2 = 0 for the female.
 Based on the given data, random variable X has a conditional pmf
8- 59
pX|(x|) = P{X = x| = } with its values given by:
pX|(1|1) = P{X = 1|1} = 8/80; pX|(0|1) = P{X = 0|1} = 72/80;
pX|(1|0) = P{X = 1|0} = 4/20; pX|(0|0) = P{X = 0|0} = 16/20.
 We also have the following a priori (class) probability p for  :
p(1) = 80/(20 +80) = 80/100; p(0) = 20/100.
 Let 0 = the group of female students; and a = the group of male
students.
 The pattern classification problem now can be formulated as a
hypothesis testing problem as follows:
H0: x  0; Ha: x  a;
or equivalently,
H0:  = 1 = 1; Ha:  = 2 = 0.
 The a posteriori probability of a value  (a class) given an observed
sample value, x, of X is the conditional pmf
p|X(| x) = P{ =  | X = x}.
A larger value of p|X(| x) means that x is more likely to come from
the class  with the parameter .
 In terms of the notations defined above, the Bayes formula described by
(3.4) says:
p|X(| x) = pX(x|)p()/[pX(x|)p() + pX(x| C)p( C)]
(8.25)
where  C is the complement of  (e.g., if  = 1, then  C = 2 for the
case here).
 The test statistic value in the Bayes sense is taken to be the ratio of the
a posteriori probabilities:
b = p|X(1| x)/p|X(2| x)
(8.26)
where x is the previously-mentioned observed sample value of X.
 And the rejection region is described by b < 1 because if on the contrary
it is true that b > 1, then p|X(1| x) > p|X(2| x), meaning that it is
more likely that x comes from the class with the parameter 1 (class 0)
8- 60
than from the other class (class a).
 By the Bayes formula, the above test statistic may be transformed (with
the details left as an exercise) into
b = [pX(x|1)p(1)]/[pX(x|2)p(2)]
(8.27)
where the a priori (class) probabilities and the conditional pmf’s are
used.
 The rejection region then becomes one which includes all the x
satisfying
b = [pX(x|1)p(1)]/[pX(x|2)p(2)] < 1.
(8.28)
 Now, the observed X value is x = 1 (wearing glasses), which, after
being substituted into (8.27) above, leads to the following values
pX(1|1)p(1) = pX(1|1)p(1) = (8/80)(80/100) = 8/100;
pX(1|2)p(2) = pX(1|0)p(0) = (4/20)(20/100) = 4/100;
and the following inequality
b = pX(x|1)p(1)/pX(x|2)p(2) = (8/100)/(4/100) > 1
which means b for x = 1 is not in the rejection region.
 Therefore, H0 is not rejected, and so the decision is that “the student is
male.”
 A note: when the tie b = pX(x|1)p(1)/pX(x|2)p(2) = 1 is
encountered, the choice may be arbitrary.
 Formal description of Bayes test for two-class pattern classification -- The above example may be generalized to formulate the Bayes test,
which is a type of hypothesis testing based on the Bayes formula, for a
two-class pattern classification problem.
 Let X1, X2, ..., Xn be n random variables, called features, for describing
patterns (not necessarily iid) with parameters , which is also a random
variable with two discrete values 1 and 2 representing the two classes
of patterns, 1 and 2 (corresponding to the 0 and a in the above
example).
 Let the random vector X be used to denote the n (feature) random
8- 61
variable, which is notationally defined as X = [X1, X2, …Xn]t with t
meaning vector transpose.
 Then, the bold notation x means a sample vector x = [x1, x2, …, xn]t of
X.
 Let the pmf of  be denoted as p() and let the conditional pdf (or pmf)
of X given  =  be denoted as
f(x|)
(note: unlike in the last example, we drop the subscripts for p and f
because now no ambiguity will arise here).
 The hypotheses are taken to be:
H0:  = 1 (or x  1); Ha:  = 2 (or x  2).
 The Bayes test statistic, according to (8.27) above, is:
B = [f(X|1)p(1)]/[f(X|2)p(2)]
(8.29)
(note: the first step of using the a posteriori probabilities in the last
example is skipped here).
 And the rejection region, according to (8.28), now includes all x
satisfying:
b = [f(x|1)p(1)]/[f(x|2)p(2)] < 1.
(8.30)
 In the pattern recognition area, the above rejection region means
equivalently the following Bayes decision rule:
if f(x|1)p(1) > f(x|2)p(2),
then classify x as from 1 (Ha rejected);
otherwise, as from 2 (H0 rejected).
 The equation
[f(x|1)p(1)]/[f(x|2)p(2)] = 1
or equivalently,
f(x|1)p(1) = f(x|2)p(2)
(8.31)
is called the decision boundary of the classifier, which divides the
pattern feature space into two mutually-exclusive regions, one of which
8- 62
is the rejection region.
 Notes -- The above test is called the Bayes test with the minimum error
probability (or simply with the minimum error), in contrast to another
Bayes test with the minimum cost, which is beyond our discussion here
(it is taught in detail in a course of pattern recognition or mathematical
statistics).
 There are also several other similar tests, like likely-ratio test,
Neuman-Pearson test, minimax test, etc. For more discussions on these
tests, the corresponding type I and II errors, and other related topics, see
the following reference book, for example:
K. Fukunaga, Introduction to Pattern Recognition, (2nd ed.) Academic Press, San
Diego, CA, USA, 1990.
 Example 8.18 (Bayes test for two-class pattern classification) --You are given two classes of patterns 1 and 2 which have equal a priori
class probabilities 1/2, and are described by two normal random variables X1
and X2 with equal variance 2 but different means 1 and 2, respectively.
Now given an observed (sample) value x, use the Bayes test with the
minimum error to classify x (to see if x comes from X1 with parameters (1, 2)
or from X2 with parameters (2, 2). Derive the test statistic B. If the identical
variance is 2 = 4, and the means are 1 = 0 and 2 = 2, derive the rejection
region, the decision boundary (simplified as much as possible), and the
classifier (described as a rule) for the two pattern classes.
Solution:
 Here, the hypotheses are: H0:  = 1 (x  1) and Ha:  = 2 (x  2).
 The conditional normal pdf’s for the two classes 1 and 2 are:
f(x|1) =
f(x|2) =
2
2
1
e( x 1 ) / 2 ,     x   ;
2
2
2
1
e( x 2 ) / 2 ,     x   .
2
8- 63
 The a priori (class) probabilities of the two classes are p(1) = p(1) 
p1 = 1/2, p(2) = p(2)  p2 = 1/2.
 The Bayes test statistic value is
B = [f(x|1)p(1)]/[f(x|2)p(2)]
= [f(x|1)p(1)]/[f(x|2)p(2)]
2
2
1
p1
e ( x  1 ) / 2
2
=
2
2
1
p2
e ( x  2 ) / 2
2
p1 [  ( x   )2 ( x   )2 ] / 2 2
1
2
= p2 e
.
 The rejection region is described by
p1
2
2
2
b = p e[  ( x  1 ) ( x  2 ) ] / 2 < 1.
2
 The decision boundary is described by:
p1 [  ( x   )2 ( x   )2 ] / 2 2
1
2
=1
p2 e
which may be simplified by taking the natural logarithm ln to be
ln(p1/p2) + [(x  1)2 + (x  2)2]/(22) = 0,
or equivalently, to be
(2  1)x/2 + ½(12  22)/2 = ln(p1/p2).
 With the given values of p1 = p2 = ½ = 0.5, 2 = 4, and 1 = 0 and 2 =
2 substituted into the above formula, we get the decision boundary
equation to be:
(2  )x/ + ½(02  22)/4 = x/2  1/2 = ln(0.5/0.5) = ln(1) = 0,
or equivalently,
x=1
which is just the middle point between the two means 1 = 0 and 2 = 2
(a property which is always true for two 1-dimensional
normally-distributed feature classes with equal variances, as can be
8- 64
proved; do this by yourself).
 The above result also means that the rejection region is simplified to
include those x which satisfy x > 1.
 Therefore, the pattern classifier, in a rule form, is:
“if x < 1, then x  1; otherwise x  2”
(where “x ” means “assign or classify x as from”).
8- 65