Download Chapter 5

Chapter 8 Parametric Estimation Contents --8.1 Types of Parametric Estimators 8.2 Point Estimation 8.3 Inference of Confidence Intervals for Parameters 8.4 Hypothesis Testing 8.1 Types of Parametric Estimators  Introduction -- Statistical inference (統計推論) --Statistical inference is the process of using observed data of a random phenomenon to draw conclusions about the distributions of the random variables which model the phenomenon.  Parametric estimation (參數估測) --Parametric estimation is a type of statistical inference to determine or make decisions about the parameters which characterize the distributions of the random variables modeling a concerned random phenomenon.  Types of parametric estimation (all using observed data to achieve different purposes) ---  Point estimation --- estimation of the parameter value(s) of a distribution, such as the mean and variance of the random variable which has the distribution.  Interval estimation --- estimation of an interval which may be believed, to some degree of confidence, to contain a parameter of interest.  Hypothesis testing --- decision making to accept or reject a claim, called hypothesis, regarding a parameter of interest. 8- 1 8.2 Point Estimation  Concepts and review ---  Point estimation means to estimate the parameter(s) of a certain distribution of a random variable from a set of observed data of the distribution according to a certain estimation rule.  Every observed data item of the distribution is itself random in nature, and so may be regarded as a random variable, as mentioned in the last chapter where we had the following definition of random sample: A random sample of size n arising from a certain random variable X is a collection of n independent random variables X1, X2, …, Xn such that  i =1, 2, …, n, Xi is identically distributed with X, meaning that every Xi has the same pmf or pdf as that of X. Each Xi is called a sample variable, X is called the population random variable, and the mean and variance of X are called the population mean and variance, respectively.  We have mentioned several point estimators in the last chapter, like sample mean, sample variance, sample covariance, etc.  Here, we investigate point estimators in more formal and systematic ways.  Definition of estimator ---  Definition 8.1 --An estimator of a parameter  of a random variable X is a function ˆ = ˆ (X1, X2, …, Xn) of a random sample X1, X2, …, Xn arising from X so that for a particular set of sample values, say, x1, x2, …, xn, evaluating ˆ at these values produces a point estimate of , namely, ˆ (x1, x2, …, xn). (Note: here point estimate, or simply, estimate, is used as an undefined term.)  More types of mean estimators --The sample mean X of a random sample X1, X2, …, Xn of size n arising from X defined in the last chapter is just one of the many possible estimators of the population mean. Others include the following.  X l = linear estimator of population mean --- a weighted sum of all the n sample variables, X l = a1X1 + a2X2 + … + anXn, where  ai = 1; e.g., i1 X l = 0.35X1 + 0.1X2 + 0.1X3 + 0.1X4 + 0.35X5 for n = 5. 8- 2  X = sample median --- the median of the sample variables whose value, as implied by Definition 4.6, is the middle of the sample values after they are reordered by magnitude (if n is even, then it is the average of the middle two magnitude-reordered sample values).  X e = sample midrange --- the average of the maximum of X1, X2, ..., Xn and the minimum of them, i.e., X e = [maxi(Xi) + mini(Xi)]/2.  X tr(r) = sample r% trim mean --- the mean obtained from discarding the largest and the smallest r% (say, 10%) of the sample variables and then taking the average of the remaining ones.  Example 8.1 (various types of estimators) --As a continuation of Example 7.4 where three random samples were given with the first one being (8.04, 8.02, 8.07, 7.99, 8.03) and the corresponding sample mean value being computed to be x1 = (8.04 + 8.02 + 8.07 + 7.99 + 8.03)/5 = 8.03, we now want to compute the estimates of the above-mentioned four different types of estimators for the first random sample (8.04, 8.02, 8.07, 7.99, 8.03). Solution:  Value of linear estimator of the mean --- xl = 0.35x1 + 0.1x2 + 0.1x3 + 0.1x4 + 0.35x5 = 0.358.04 + 0.1(8.02 + 8.07 + 7.99) + 0.358.03 = 8.0325.  Sample median value --- x = 8.03 because the sample values, after reordered according to their magnitudes, are 7.99, 8.02, 8.03, 8.04, 8.07;  Sample midrange value --- xe = (8.07 + 7.99)/2 = 8.03 because the two extreme values of the sample values are 7.99 and 8.07;  Sample r% trim mean value with r = 20 --- x tr(20) = (8.04 + 8.02 +8.03)/3 = 8.03 after the two 20% extreme values 7.99 and 8.07 are “trimmed.”  Notes --- 8- 3  We use capital letters for estimators and lower-case letters for estimate values in the above example and hereafter.  Also, we call an estimate value of an estimator by the term the “estimator value,” e.g., the estimate value, sample midrange value, for the estimator, sample midrange, and so on.  The sample mean is just a special case of the linear estimator of the population mean mentioned in the last example, which is defined more formally in the following.  Definition 8.2 (linear estimator of population mean) --Given a random sample X1, X2, …, Xn arising from a population random variable X with mean , a linear estimator of  is defined as X l = a1X1 + a2X2 + … + anXn.  A Note --- by the above definition, the sample mean X = (X1 + X2 + … + Xn)/n is a linear estimator of the population mean with all ai = 1/n.  Which estimator is better? ---  Idea -- It is desired to have the “best” estimator for a certain parameter  of a population random variable.  This requires a suitable measure of the goodness of each parameter estimator ˆ .  A suitable measure for this purpose is the error | ˆ   | or the square of it.  However, this measure is not useful for computing the best estimator.  A substitute is the mean value of the square of it, namely, E[( ˆ  )2], which we called the mean square error (MSE) of ˆ .  Since  is a constant value, the MSE may be transformed by the following way into a more meaningful form: E[( ˆ  )2] = E[ ˆ 2  2 ˆ  +  2] = E[ ˆ 2]  2E[ ˆ ] +  2 = E[ ˆ 2]  (E[ ˆ ])2 + (E[ ˆ ])2  2E[ ˆ ] +  2 8- 4 = [E[ ˆ 2]  (E[ ˆ ])2] + [(E[ ˆ ])2  2E[ ˆ ] +  2] = Var( ˆ ) + (E[ ˆ ]  )2. (8.1)  Accordingly, if Var( ˆ ) = 0 and E[ ˆ ] = , then the MSE E[( ˆ  )2] becomes the minimum value, zero.  Therefore, we have the following definitions.  Definition 8.3 (better and best estimators) --For a parameter , an estimator ˆ  is said to be better, or more efficient, than another ˆ 2 if E[( ˆ   )2] < E[( ˆ   )2]. The estimator which is better than any other is called the best estimator for , which we denote as ˆ .  Definition 8.4 (unbiased estimator) --An estimator ˆ for a parameter  is said to be unbiased if E[ ˆ ] = ; otherwise, it is said to be biased with the amount, E[ ˆ ]  , called the bias of it.  A note and an illustration of the meaning of unbiasedness -- An unbiased estimator ˆ means that, though the estimate values are not always equal to the exact value of , they fall around  with  as the center. Therefore, unbiasedness is a good property for an estimator.  See Fig. 8.1 for an illustration of unbiasedness where ˆ is unbiased 1 while ˆ2 is biased. pdf of ˆ2 pdf of ˆ1  bias of ˆ2 Fig. 8.1 Illustration of unbiasedness of estimators --- ˆ1 is unbiased while ˆ2 is 8- 5 biased.  Linear unbiased estimator -- A linear estimator X l = a1X1 + a2X2 + … + anXn of the population mean as defined in Definition 8.2 need not be unbiased. For it to be so, the coefficients a1 through an must satisfy a1 + a2 + … + an = 1 because by Fact 7.2, E[ X l ] = E[a1X1 + a2X2 + … + anXn] = a1E[X1] + a2E[X2] + … + anE[Xn] = a1 + a2 + … + an = (a1 + a2 + … + an) = 1 = .  A linear estimator of the population mean with a1 + a2 + … + an = 1 will be called a linear unbiased estimator hereafter.  Facts about unbiased estimators of some population parameters ---  Fact 8.1 --The sample mean X and the sample variance S2 of a random sample arising from a population random variable X with mean  and variance 2 are unbiased estimators of  and 2, respectively. Proof: immediate from Facts 7.4 and 7.13, which say respectively E[ X ] =  and E[S2] = 2.  Fact 8.2 --If the population distribution is continuous and symmetric, the sample median X and any sample trimmed mean X tr(r) of a random sample arising from a population random variable X with mean  are unbiased estimators of . Proof: omitted; see http://en.wikipedia.org/wiki/Efficiency_(statistics) for a reference. 8- 6  Example 8.2 --Given a binomial random variable X with parameters (n, p) which specifies the number of successes in n trials, prove that the estimator p̂ = X/n, called sample proportion, is an unbiased estimator of p. Proof: easy from Fact 4.7: E[X] = np because then E[ p̂ ] = E[X/n] = E[X]/n = np/n = p.  Notes -- Compare the result of this simple example with that of Example 7.14.  From Facts 8.1 and 8.2, we see that the unbiased estimator for a parameter, like the population mean, is not unique; therefore a further criterion is needed, which is the concept of minimum variance of the estimator, as defined next.  Definition 8.5 (minimum-variance estimator) --A minimum-variance estimator ˆ for a parameter  has a smaller variance Var( ˆ ) than that of any other estimator for .  A note and an illustration of the meaning of minimum variance -- See Fig. 8.2 for an illustration about the meaning of the variance of the estimator, where though both ˆ1 and ˆ2 are unbiased but ˆ1 has a smaller variance than ˆ . 2  If an estimator is unbiased, then according to the analysis described by (8.1), if it also has the minimum variance, it will become the best estimator, as said by the following fact. 8- 7 pdf of ˆ1 pdf of ˆ2 Fig. 8.2 Illustration of variances of estimators --- ˆ1 has a smaller variance than ˆ2 though both unbiased.  Fact 8.3 (best estimator) --An unbiased estimator ˆ for a parameter  with the minimum variance is the best estimator for . Proof: immediately from the above three definitions (Definitions 8.3 through 8.5) and the analysis described by (8.1).  Discussions on the best estimator -- In general, it is not always easy to find a best estimator for a special parameter of a population distribution.  But under some constraints, such best parameter estimators may be found, as shown by the following facts and examples.  Fact 8.4 (best linear unbiased estimator of population mean) --The sample mean is the best linear unbiased estimator of the population mean . Proof:  According to Definition 8.2 and the previous discussion, a linear unbiased estimator of  is X l = aX1 + a2X2 + … + anXn where a1 + a2 + … + an = 1. 8- 8  Therefore, the sample mean X is a linear unbiased estimator of  with all ai = 1/n as mentioned previously.  Now, we only have to prove that X has the minimum variance among all possible linear unbiased estimators.  The proof will be done for the case of n = 2; generalization to the case of any n is left as an exercise.  For n = 2, since a1 + a2 = 1, we have a2 = 1  a1 and X = a1X1 + a2X2 = a1X1 + (1  a1)X2.  By applying a derivation process like those for deriving Facts 7.9 through 7.11 (details left as an exercise), the Var( X ) of X may be computed to be Var( X ) = a12Var(X1) + (1  a1)2Var(X2) = [a12 + (1  a1)2]2.  To minimize the above value, we differentiate it with respect to a1 to get d[Var( X )]/da1 = [2a1  2(1  a1)]2.  Setting this result equal to zero, we get a1 = 1/2, and so a2 = 1  a1 = 1/2, too.  That is, for n = 2, the sample mean X = ½X1 + ½X2 has the minimum variance, compared with other linear unbiased estimators. Therefore, it is the best linear unbiased estimator.  Fact 8.5 (best estimator of mean of a normal population distribution) --The sample mean is the best estimator of the mean  of a normal population distribution. Proof:  First, we state without proof a theorem, called Cramer-Rao Inequality, which says: if ˆ is an unbiased estimator of a parameter  in the pdf f(x) of a random variable X, then 8- 9 Var( ˆ )  1 2     nE  ln f (x)       where n is the size of the random sample X1, X2, ..., Xn used in the estimator ˆ . For a proof, see the following reference: R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 5th Ed., Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA, 1995.  Next, if X is normally distributed with mean  and variance 2 with pdf f(x) = 2 2 1 e( x  ) / 2 ,     x   , 2 then 1 x  ln f(x) = ln( 2 )    . 2   2  Taking the partial derivative of the last equality with respect to  yields x  ln f (x) = . 2   Therefore, by Proposition 5.3 and the definition of variance, we have 2    x    2    E  ln f (x)   = E  2   = 4E[(x  )2] = 42 = 2.          Substituting the above result into the above Cramer-Rao inequality, we get Var( ̂ )  1/(n2) = 2/n where ̂ is an unbiased estimator of the mean  of X.  This means that any estimator ̂ has a variance Var( ̂  2/n.  However, the variance of the sample mean X , according to Fact 7.11, is just 2/n.  This means that the sample mean has a variance not larger than that of 8- 10 any unbiased estimator ̂ of .  Also, since E[ X ] =  (see Fact 7.4), we know that X is unbiased, too.  As a consequence, the sample mean X by definition is the best estimator of the population random variable X. Done.  Consistency of unbiased estimators ---  Idea -- It is desirable that as the sample size n becomes larger and larger, an estimator ˆ of a parameter  will get closer and closer to the parameter  with high probability.  The weak law of large numbers described in the last chapter and repeated in the following says that this statement is true for the sample mean X as an estimator of : lim P{| X  | < } = 1 n   > 0. (7.5)  Therefore, we have the following definition and fact.  Definition 8.6 --An estimator ˆ of a parameter  of a population distribution is said to be consistent if the following property is satisfied: lim P{| ˆ   | < } = 1 n   > 0.  Fact 8.6 (consistency of sample mean) --The sample mean X is a consistent estimator of the population mean . Proof: use the weak law of large numbers in the proof as discussed above.  A note: it also can be shown that the sample variance S is a consistent estimator of the population variance 2 (discussed late in this chapter). 8- 11  An illustration of the meaning of consistency -- See Fig. 8.3 for an illustration about the meaning of consistency of an estimator ˆ of , where ˆ   with high probability as n  . pdf of ˆ1 for n1 pdf of ˆ2 for n2 > n1   Fig. 8.3 Consistency of estimators --- ˆ   with high probability as n    Fact 8.7 (consistency of the sample variance of a normal population) --The sample variance S2 is a consistent estimator of the variance of a normal population. Proof:  In the proof of the weak law of large numbers, we have obtained Chebyshev’s inequality as: P{|W  |  k}  2/k2  k > 0, (A) where W is a random variable with mean  and variance 2.  Also, Facts 7.13 and 7.14 say that the mean and variance of the sample variance of a normal population respectively are E[S2] = ; Var(S2) = 24/(n  1).  Take W in (A) to be S2, k to be  to be E[S2], and 2 to be Var(S2), we get P{|S2  2|  }  24/[(n  1)2] or equivalently, 8- 12   > 0, P{|S2  2| < } = 1  P{|S2  2|  }  1  24/[(n  1)2]   > 0 which reduces, as n  , to P{|S2  2| < } = 1 based on the fact that the largest probability value is 1.  That is, lim P{|S2  2| < } = 1   > 0 which says that S2 is a n consistent estimator of 2 according to Definition 8.6. Done.  A comment --- a careful check of the above proof of Fact 8.7 leads to the following general fact for the validity of consistency of an estimator.  Fact 8.7a (condition for consistency) --If ˆ is an unbiased estimator of  for which lim Var( ˆ ) = 0, n then ˆ is a consistent estimator of . Proof:  Since ˆ is unbiased, we have E[ ˆ ] =  by Definition 8.4.  Also, let ˆ denote the variance of ˆ which approaches 0 as n   according to the given condition lim Var( ˆ ) = 0. n  That is, ˆ has mean  and variance  ˆ .  Accordingly, as done in the proof of Fact 8.7, we can get the following Chebyshev’s inequality for ˆ : P{| ˆ  |  }   ˆ 2/2   > 0, or equivalently, P{| ˆ  | < } = 1  P{| ˆ  |  }  1   ˆ 2/2   > 0 8- 13 leading to lim P{| ˆ  | < } = 1 n based on the fact that  ˆ  0 as n   and the fact that the largest probability value is 1. Done.  Principles for choosing point estimators --The above discussions lead to the following reasonable principle for choosing parameter estimators.  Step 1 --- choose unbiased estimators first;  Step 2 --- then choose the unbiased estimators with the minimum variances;  Step 3 --- finally choose the consistent unbiased estimator with the minimum variance.  Methods for point estimator design ---  Idea -- The above discussions are about estimators for single parameters.  There are two major general methods for designing point estimators, namely, (1) the method of moments; (2) the maximum likelihood estimation, for any specific problem with multiple parameters, which we discuss in the following.  Recall: the kth moment of a random variable X is E[Xk] (called kth population moment if X is a population random variable).  When no pmf or pdf is available, this moment must be estimated, leading to the following definition.  Definition 8.7 (sample moment) --Given a random sample X1, X2, ..., Xn of size n from a population random n variable X, the kth sample moment is defined as (  Xik)/n, denoted as M k i1 (thus X = M 1 ). 8- 14  Definition 8.8 (moment estimator) --Given a random sample X1, X2, ..., Xn of size n arising from a population random variable X whose pmf or pdf is f(x) with parameters 1, 2, …, m, the moment estimator ˆ 1, ˆ 2, …, ˆ m are defined as those obtained by the following way: (1) equate the first m sample moments to the corresponding first m population moments; (2) solve for 1, 2, …, m; and (3) use the solutions as ˆ 1, ˆ 2, …, ˆ m, respectively.  Example 8.3 --Let X1, X2, ..., Xn be a random sample from a gamma distribution with parameters (t, ) where t > 0 and  > 0 with pdf f(x) = ex(x)t 1/(t) =0  x  0;  x < 0. Find the moment estimators for t and . Solution:  From Fact 6.13, we know E[X] = t/; Var(X) = t/2.  Since Var(X) = E[X2]  (E[X])2 according to Proposition 5.2, we get the second population moment as E[X2] = Var(X) + (E[X])2 = t/2 + (t/)2 = t(1 + t)/2  By Step (1) in Definition 8.8 --- equating M 1 to E[X] and M 2 to E[X2], we get the following equations: M 1 = t/; M 2 = t(1 + t)/2 which may be solved to get t = M 1 2/( M 2  M 1 2);  = M 1 /( M 2  M 1 2).  Therefore, the estimators for t and  are respectively tˆ = M 1 2/( M 2  M 1 2); λ̂ = M 1 /( M 2  M 1 2) n n i1 i1 where M 1 = (  Xi)/n and M 2 = (  Xi2)/n. 8- 15  Definition 8.9 (maximum likelihood estimator) --Given random variables X1, X2, ..., Xn whose joint pmf or pdf is f(x1, x2, …, xn) with parameters 1, 2, …, m, by regarding f as a likelihood function of 1, 2, …, m and rewrite it as f(x1, x2, …, xn; 1, 2, …, m), the maximum likelihood estimates ˆ 1, ˆ 2, …, ˆ m are defined as those values of the i’s that maximize the likelihood function, so that for all 1, 2, …, m, f(x1, x2, …, xn; ˆ 1, ˆ 2, …, ˆ m)  f(x1, x2, …, xn; 1, 2, …, m). The maximum likelihood estimators are obtained by substituting the xi’s in the maximum likelihood estimates with Xi’s.  Why the maximum likelihood estimator works? -- The likelihood function tells us how likely the observed sample values fit the function with the parameters to be estimated.  Maximizing the likelihood function value gives the parameter values for which the observed sample is most likely to have been generated.  Example 8.4 --Let X1, X2, ..., Xn be a random sample arising from a normal distribution with parameters (, 2). Find the maximum likelihood estimators for  and 2. Solution:  Since all Xi’s are independent by the definition of random sample, applying Proposition 6.2 and the induction principle, we get the likelihood function for them as: f(x1, x2, …, xn; , 2) = 1 2 2 e (1/ 2  1  =  2   2  2 )( x1   ) 2 … 1 2 2 e  (1/ 2 n n/2 e  (1/ 2 2 )  ( xi   ) 2 i 1 .  By taking the natural logarithm, the above equality becomes 8- 16 2 )( xn   )2  n ln[f(x1, x2, …, xn; , 2)] = (n/2)ln(22)  (1/22)  (xi  )2. i1  Taking the partial derivatives of the above equality with respect to  and 2, respectively, equating them to zero, and solving the resulting two equations, we get (the detail omitted) ̂ = X , n ˆ 2 =  (Xi  X )2/n. i1  Note that ˆ 2 above is biased, which is different from the unbiased n sample variance S2 =  (Xi  X )2/(n  1) we have before. i1 n  Nevertheless, as n  , ˆ 2 and S2 becomes the same,  (Xi  X )2/n. i1 And so, lim E[ ˆ 2 ] = ES2] = 2. As such, ˆ 2 is called an n asymptotically unbiased estimator in some books.  Proposition 8.1 (invariance properties of maximum likelihood estimator) --If ˆ is the maximum likelihood estimator of the parameter , and g is a one-to-one function of , then g( ˆ ) is the maximum likelihood estimator of g(). Proof: omitted; for a proof, see the following reference: R. V. Hogg and A. T. Craig, Introduction to Mathematical Statistics, 5th Ed., Prentice-Hall, Inc., Englewood Cliffs, New Jersey, USA, 1995. 8.3 Inference of Confidence Intervals for Parameters  Ideas ---  Insufficiency of point estimation -- For example, let X be the sample mean of a random sample X1, X2, ..., Xn of size n arising from a normal population random variable X with 8- 17 parameters (, 2).     Then, X ~N(, 2) according to Facts 7.4 and 7.11. It follows that P{ X > } = P{ X < } = 1/2, but P{ X = } = 0. Therefore, X   with probability 1 (i.e, P{ X  } = 1). That is, the estimator X for  is actually not precise from the viewpoint of probability.  Need of interval estimation -- Consequently, for inference about a parameter , instead of just computing a point estimate of  it is desirable to report an interval of values that contains the unknown parameter with higher probability (i.e., with confidence).  Such an interval is called a confidence interval, as defined in the following.  Definition 8.10 (confidence interval) --Two estimator ˆ 2) of  estimators ˆ 1 and ˆ 2 of a parameter  determined from an ˆ of  are said to form a 100(1  )% confidence interval ( ˆ 1, (also called a confidence interval with 100(1  )% confidence level) if P{ ˆ 1 <  < ˆ } = 1   where 1   is called the confidence coefficient.  Confidence interval for the mean of a normal population distribution with known variance ---  Reasoning for derivation of the confidence interval -- Given a random sample X1, X2, ..., Xn of size n arising from a normal population random variable X~N(, 2), it is known from Facts 7.4 and 7.11 that the sample mean X of the random sample is normally distributed with mean and variance being  and 2/n, respectively, i.e., X ~N(, 2/n).  Therefore, ( X  )/[ n  is a unit normal random variable Z with cdf (x) (the error function).  Define z be a value such that P{Z > z} =  (see Fig. 8.4 for an 8- 18 illustration), or equivalently, P{Z  z} = (z) = 1  .  Consequently, z/2 is such that P{Z  z/2} = 1  /2 and P{z/2 < Z < z/2} = (z/2)  (z/2) = (z/2)  (1  (z/2)) = 2(z/2)  1 = 2(1  /2)  1 = 1  . That is, P{z/2 < X  < z/2} = 1  . / n (8.2)  z Fig. 8.4 Illustration of the meaning of  in confidence coefficient 1  .  For example, if  = 0.05 so that the confidence coefficient 1   = 0.95 (95%), then z/2 = z0.025 = 1.96 as can be figured out using the error function table (Table 5.1). Furthermore, if 1   = 0.90 and 0.99, then the corresponding z/2 are z0.05 = 1.645 and z0.005 = 2.58.  Now Equality (8.2) may be transformed into the following form: P{ X  (z/2/ n) <  < X + (z/2/ n )} =1 from which we get the desired (1  )100% confidence interval (L, U) where 8- 19 L = X  (z/2/ n ) (lower limit); U = X + (z/2/ n ) (upper limit). (8.3)  That is, the interval (L, U) contains the parameter mean  of the normal population distribution with known variance 2 with probability 1  .  Note that (L, U) are themselves random variables, and possible values of (L, U) computed from random sample values are denoted as (l, u) with l = x  (z/2/ n ) and u = x + (z/2/ n ), respectively.  An illustration of a possible interval (l, u) and the related probabilities is shown in Fig. 8.5.  Consequently, by Definition 8.10 we have the following proposition. Fig. 8.5 An illustration of confidence interval (L, U) and the related probabilities.  Proposition 8.2 (100(1  )% confidence interval for normal population mean with known variance) --Given a random sample X1, X2, ..., Xn of size n arising from a normal population random variable X with unknown mean  and known variance 2, the 100(1  )% confidence interval for  is ( X z/2/ n, X + z/2/ n ), (8.4) or in a simpler form, is X  z/2/ n. (8.5)  Recall: we use x to denote the sample mean value x = (x1 + x2 + … 8- 20 +xn)/n for a random sample of X1 = x1, X2 = x2, …, Xn = xn.  Notes:  The upper and lower limits U and L as described in (8.3) through (8.5) are random variables since X is a random variable.  If the sample mean value x is used in replace of X in these formulas, then the results, still called a confidence interval, is denoted as (l, u), corresponding to (U, L).  Note that each (l, u) is just a possible outcome of (U, L). See Fig. 8.6 for an illustration. Fig. 8.6 Illustration of randomness of observed confidence intervals.  Example 8.5 --Given a random sample of size n = 10 arising from a normal population random variable X~N(, 4) and the sample mean value x = 15.1. Compute a 95% confidence interval for . Solution:  1   = 0.95,  = 0.05, and so z/2 = z0.025 = 1.96 as mentioned before.  From Proposition 8.2, the confidence interval is (L, U) with L = X  1.96/ n, U = X + 1.96/ n.  Therefore, with the sample mean value x = 15.1 and known 2 = 4 or equivalently  = 2, the desired confidence interval is (l, u) where l = x  1.96/ n = 15.1  1.962/ 8- 21 10  13.86; u = x + 1.96/ n = 15.1  1.962/ 10  16.34.  Therefore, we have a 95% confidence that the mean  of the normal population distribution falls within the interval (13.86, 16.34).  Notes -- The results derived in the previous discussions are not practical because the variance was assumed to be known in advance.  In real applications, the population variance is usually unknown, and we need more theories before we can compute the confidence interval for the mean under such conditions.  Student’s t distribution ---  Definition 8.11 (t distribution) --Let Z denote the unit normal random variable, and W denote a 2 random variable with n degrees of freedom. Also, assume that Z and W are independent. Then the random variable Z W /n T= (8.6) is said to possess a (Student’s) t distribution with n degrees of freedom.  Proposition 8.3 (the pdf of a random variable with t distribution) --If T is a random variable possessing a t distribution with n degrees of freedom, then its pdf is given by f(t) =  n 1    2  ( n 1) / 2  2  1  t    n n n    2   < t < . Proof:  We will try to find the pdf of T by applying Theorem 6.1. 8- 22 (8.7)  Define two new random variables T = g(Z, W) = W) = W. And so t = g(z, w) = Z and U = h(Z, W /n z and u = h(z, w) = w. w/ n  The Jacobian of g and h is: g z J(z, w) = h z g w = h w 1 w/ n 0  z 2 w3 / n 1 = 1 w/ n which we may assume is not equal to zero.  Then, the inverse transforms of g and h may be found to be z = r(t, u) = t u and w = s(t, u) = u. n  According to Theorem 6.1 and the independence of W and Z, the joint u and w = u, is n pdf of T and U, with z = t fTU(t, u) = fZW(z, w)|J(z, w)|1 1 1 = fZW(z, w)| | w/ n = fZW(t u , u) u / n n u )fW(u) u / n (by independence of Z and W) n 1 ut 2 / 2 n 1 eu/2u(n/2)1 u / n e  n/2 2 (n / 2) 2 = fZ(t = = = 2 u/n u ( n / 2) 1e ( u / 2)[1( t / n)] 2 2 (n / 2) n/2 1 2 n 2 n/2 (n / 2) u ( n 1) / 2 e ( u / 2)[1( t 2 / n)] for  < t <  and u > 0, and fTU(t, u) = 0, otherwise.  The marginal pdf fT(t) of T then is 8- 23 fT(t) =   fTU (t , u)du 1   u 2 n 2n / 2 (n / 2) 0 = ( n 1) / 2 ( u / 2)[1 ( t 2 / n)] e du .  Let y = (u/2)[1 + (t2/n)]. Then, u = 2y[1 + (t2/n)]1 so that du = 2[1 + (t2/n)]1dy.  And so 1 fT(t) =  n/2 2 n 2 (n / 2)  0 {2[1+(t2/n)]1y}(n1)/2e(1/2)2y[1+(t2/n)] 1 [1+(t2/n)]2[1+(t2/n)]1dy =  1 [1+(t2/n)](n+1)/2 0 y(n1)/2eydy  n(n / 2) =  1 [1+(t2/n)](n+1)/2 0 y[(n+1)/2]1eydy.  n(n / 2)  But the above integral (t) =  0 e y  0 y[(n+1)/2]1eydy, according to Definition 6.7: y t 1dy , is just ((n+1)/2).  Therefore, fT(t) above is equal to fT(t) = = 1 [1+(t2/n)](n+1)/2((n+1)/2) 2 n(n / 2)  n 1    2  ( n 1) / 2  2  1  t    n n n    2   < t < . Done.  Fact 8.8 (a limiting property of t distribution) --The pdf of the t distribution satisfies lim f(t) = n 1 et 2 /2 2 which means that as n  , the t distribution resembles a unit normal 8- 24 distribution. Proof:  t2   A fact in elementary calculus is: lim  1   n n  1 2 lim n  t2  1   n  n = et 2 , so =  t 2  1 lim 1   n 2 n  = 1 2 n / 2 = = n    n   t2    lim 1    n   n   1/ 2 1/ 2 1 [ et ]1/2 2 1 et 2 /2 . 2 2 (8a)  A property of the gamma function is: x    x for a large x and a small    x (see the book by K. Fukunaga listed below, p.574). K. Fukunaga, Introduction to Pattern Recognition, (2nd ed.) Academic Press, San Diego, CA, USA, 1990.  Another property of the gamma function is:   x  12    x = x (1  1 5 21 1 + + + + …) 2 3 128x 1024x 32768x 4 8x (see the website listed below, Eq. (98)). http://mathworld.wolfram.com/GammaFunction.html  By either of the above two properties and with x = n/2 and  = 1/2, we can get 8- 25  n 1  n      2   2 = 1. n n / 2 n / 2   2 1/ 2 lim n (8b)  Also, with t being finite, it is easy to see that the following equality is true: lim n 1 1/ 2  t2  1   n  = 1. (8c)  According to the three facts (8a)-(8c) derived above, lim f(t) becomes n  n 1    2  ( n 1) / 2 2   t   lim f(t) = lim 1   n n n n n   2 = lim n  n 1     2   lim  n  n n / 2   2 1 1/ 2  t2  1   n   lim n 1 2  t2  1   n  n / 2 = (8a)(8b)(8c) 1 = 11( et / 2 ) 2 2 = 1 et 2 /2 2 which is the pdf of a unit normal distribution. Done.  Fact 8.9 (mean and variance of t distribution) --A random variable T possessing a t distribution with n degrees of freedom has the following mean and variance E[T] = 0; Var[T] = n/(n  2). 8- 26 Proof: as an exercise.  Shape of pdf of the t distribution -- Some shapes of the t distribution for various degrees of freedom are shown in Fig. 8.7.  According to Fact 8.8, as n  , the shape of the t distribution becomes that of a unit normal distribution which is the top pink one in the figure. n=1 n=2 n=5 n=10 n=. Fig. 8.7 Shape of pdf of t distribution where “n” means “degrees of freedom.” Note the curve with n =  which becomes that of a unit normal distribution.  Proposition 8.4 --Given a random sample X1, X2, ..., Xn of size n arising from a normal population random variable X with parameters (, 2), let X and S2 denote its sample mean and sample variance. Then, the function X  S/ n possesses a t distribution with n  1 degrees of freedom. (Note: S = 8- 27 (8.8) S 2 is called the sample standard deviation of the random sample). Proof:  Fact 7.15(c) says that ( n  1) S 2  2 is a 2 random variable with n  1 degrees of freedom.  Also, by applying Fact 6.20 repeatedly, X may be approved to be normally distributed since the population random variable X is normal.  Furthermore, since the mean and variance of X is  and 2/n (by X  Facts 7.4 and 7.11), we get to know that is a unit normal / n random variable.  Therefore, by taking Z and W in Definition 8.11 to be ( n  1) S 2  2 X  and / n , respectively, we get to know that T= Z = W /(n  1) X  / n (n  1) S 2 2 ( n  1) = X  s n possesses a t distribution with n  1 degrees of freedom.  Computing the confidence interval for the mean of a normal population distribution with unknown variance ---  Reasoning for derivation of the confidence interval -- Let T is a random variable possessing a t distribution with n  1 degrees of freedom as defined previously.  Define t, n1 be a value such that P{T > t, n1} = , or equivalently, P{T  t, n1} = 1   (see Fig. 8.8 for an illustration). 8- 28  t, n t Fig. 8.8 Illustration of pdf of a t distribution where t, n1 is such that P{T > t, n1} = .  To compute the values of t, n, a table found at the website http://wise.xmu.edu.cn/course/ugecon2/t-table.pdf can be searched to get the values of t, n for various n and . For example, for  = 0.05 and 0.01 and n = 13 and 22, the values of t may be found to be t0.05, 13 = 1.573; t0.05, 22 = 1.717; t0.01, 13 = 2.650; t0.01, 22 = 2.508.  On the other hand, by Proposition 8.3, X  has a t distribution with S/ n n 1 degrees of freedom.  Therefore, similar to the derivation of (8.2) and by the symmetry of the t distribution, we can get P{t, n1 < X  < t, n1} = 1  . S/ n (8.9)  Equality (8.9) may be easily transformed into the following form: P{ X  (t, n1S/ n) <  < X + (t, n1S/ n )} = 1  .  Therefore, by Definition 8.10, we have the following proposition.  Proposition 8.5 (100(1  )% confidence interval for normal population mean with unknown variance) --Given a random sample X1, X2, ..., Xn of size n arising from a normal population distribution with unknown mean  and unknown variance 2, the 100(1  )% confidence interval for  is 8- 29 ( X  t, n1S/ n, X + t, n1S/ n ), (8.10) or in a simpler form, is X  t, n1S/ n. (8.11)  Example 8.6 --Given a random sample of size n = 15 observed from a normal population distribution with the sample values given below, 26.7, 25.8, 24.0, 24.9, 26.4, 25.9, 24.4, 21.7, 24.1, 25.9, 27.3, 26.9, 27.3, 24.8, 23.6 compute a 95% confidence interval for the population mean . Solution:  From Proposition 8.5, the confidence interval is (L, U) (in random variable form) with L = X  t, n1S/ n, U = X + t, n1S/ n.  The sample mean value x and sample standard deviation value s may be computed to be 25.31 and 1.58, respectively (details omitted).  Now, n = 15 and  = 0.05, and so t, n1 = t0.025, 14 = 2.145 according to the t table found at the above-mentioned IP address.  Therefore, the desired confidence interval is (l, u) where l = x  2.145 s / u = x + 2.145 s / n n = 25.31  2.1451.58/ = 25.31  2.1451.58/ 15 15  24.43;  26.19.  Thus, a 95% confidence interval for the population mean  is (24.43, 26.19).  Computing large-sample confidence interval for the mean of any population ---  Idea -- Recall the two ways of derivations of the confidence intervals for the mean of a normal population. (1) Way 1 --- assuming the population variance2 known. (2) Way 2 --- assuming the population variance 2 unknown. 8- 30  For both ways, the assumption that the concerned population is normal.  When the size n of the random sample is large enough, we may infer confidence intervals for the mean of any population without making the assumptions of normality and the availability of the population variance.  That is, we can infer intervals for the mean of any population as n  .  The underline theory supporting this possibility is the central limit theorem and some other facts, as discussed in the following.  ***Consistency of variance of sample variance -- Fact 7.14 says that the variance of the sample variance S2 of a normally distributed population with variance 2 is Var(S2) = 24/(n  1).  A more general fact with no assumption of the normality of the population, which we state without proof, is: Var(S2) = 4( 2   ) n 1 n where  is the (excess) kurtosis of the population distribution defined as “the fourth moment around the mean (called central moment) divided by the square of the variance 2 of the population distribution minus 3” (or as “the fourth normalized central moment minus 3”): = 4 3 4 where the central moment means k = E[(X  )k] (see the following web pages for more details: http://en.wikipedia.org/wiki/Moment_(mathematics) and http://en.wikipedia.org/wiki/Variance).  From the following web page: http://en.wikipedia.org/wiki/Normal_distribution, we get to know that the fourth central moment of normal distributions is 34.  Therefore, for normal distributions,  = 0.  But for general distributions,  need not be zero, and Var(S2) may be reduced to be Var(S2) = 4( 2  2   ) = 4[ + ( 44  3)/n] n 1 n n 1  8- 31 = 1 n3 4 ( 4   ). n n 1 (B).  Fact 8.10 (consistency of the sample variance of a general population) --The sample variance S2 is a consistent estimator of the variance 2 of any population distribution. Proof:  We know that S2 is an unbiased estimator of the variance of the population distribution from Fact 8.1.  From the equality of (B) above, we have lim Var(S2) = 0 n because both the central moment 4 and the square 4 = (2)2 of the variance of the population are constants.  Therefore, by Fact 8.7a, we get to know that S2 is a consistent estimator of 2. Done.  Comments -- By the definition of consistency, the above fact says that lim P{|S2  2| < } = 1   > 0 n which a form of weak law of large numbers for the variance 2, and we may say that S2 converges toward 2 in probability, or in notation, that S2  2 as n  , or equivalently, S   as n  .  Fact 8.7 is just a special case of Fact 8.10 above.  Reasoning for derivation of the confidence interval -- As n  , according to the central limit theorem, if X is the sample mean of a random sample of size n arising from any random variable with mean and variance  and , respectively, then Y = approximately a unit normal random variable. 8- 32 X  is / n  Let z be such that P{ X   z}  (z) = . / n  Then, using a process similar to that for deriving (8.2), we can get P{z/2 < X  < z/2}  1  . / n  The value  in the above equality is unknown, so it must be estimated.  For this purpose, we may replace it with the sample standard deviation S because by Fact 8.10 above and the comment following it, we have S   as n  .  This results in X  < z/2}  1   S/ n P{z/2 < which may be transformed into the following form P{ X  (z/2S/ n) <  < X + (z/2S/ n )} = 1  .  Therefore, by Definition 8.10, we have the following proposition.  Proposition 8.6 (large-sample 100(1  )% confidence interval for the mean of any population) --Given a large random sample X1, X2, ..., Xn of size n arising from a population distribution with unknown mean  and unknown variance 2, the large-sample (1  )% confidence interval for  is ( X  z/2S/ n, X + z/2S/ n ), (8.12) or in a simpler form, X  z/2S/ n (8.13) where X and S are the sample mean and the sample standard deviation of the random sample, respectively.  A rule of thumb for selecting n --- when n > 30, the above proposition may be applied (i.e., the requirement of a large sample is satisfied).  Example 8.7 --Suppose 40 observations (i.e., random sample values) are made of the 8- 33 weight of a 100-lb rivet bag (鉚釘袋) manufactured by a factory. The sample mean and the sample standard deviation of these observations are x = 99.71 and s = 0.88, respectively. What is the 95% confidence interval of the mean weight of the rivet bag? Solution:  From Proposition 8.6, the 95% confidence interval (with  = 0.05 and z/2 = z0.025 = 1.96) is ( x  z/2s/ n, x + z/2s/ n ) = (99.71  1.960.88/ 40 99.71 + 1.960.88/ 40 ) = (99.44, 99.98).  Precision and sample size: inference of the random sample size n for a fixed width of confidence interval -- It is assumed in the previous discussions that the random sample size is fixed for the corresponding confidence interval to be computed.  For the large-sample confidence interval derivation as described in Proposition 8.6, sometimes it is desired instead that the width of the confidence interval is fixed first (for the purpose of reducing the interval width to increase the precision), say, to be  r around the sample mean value x , and compute then the number n of samples to take.  This means that the interval now is ( x  z / 2s/ n, x + z / 2s/ n ) = ( x  r, x + r), so r = z / 2s/ n which leads to the solution n = (z / 2s/r)2. (8.14) (Note: if necessary, take the ceiling of the above value of n as the solution.)  Example 8.8 (continued from Example 8.7) --In Example 8.7, the computed 95% confidence interval around the sample mean value x = 99.71 with a sample size of n = 40 is (99.44, 99.98) which means the interval width is 2(99.98  99.71) = 20.27 = 0.54. 8- 34 Suppose a tighter interval with the width of a half of this width is desired, what is the new sample size n′ which should be used? Solution:  From (8.14) with r = 0.27/2 = 0.135, we have (z/2s/r)2 = (1.96  0.88  0.135)2  163.23 so n′ should be taken to be 163.23 = 164 where  is the integer ceiling function.  A note --The above discussion of computing the sample size n from a given fixed confidence interval width is applicable to many other cases of confidence interval inference, as can be seen subsequently.  Computing large-sample confidence interval for a proportion of a population ---  Idea -- It is often desired to derive a confidence interval for the proportion of a population which has a certain property (such as the proportion of the people in a community who are in favor of a certain election candidate against an opponent, or the proportion of the objects in a certain group which have specific characteristics).  Let p denote the proportion of “success” in the population, called success proportion, with success identifying the above-mentioned property.  Then, given a random sample of n individuals, ;X1, X2, ..., Xn, from the population, obviously the number of successes in the sample may be regarded as a binomial random variable X with parameters (n, p) with each sample variable Xi being a Bernoulli random variable.  From Fact 4.7, we have E[X] = np and Var(X) = np(1  p).  Also, from the DeMoivre-Laplace Limit Theorem (a special case of the central limit theorem as shown in Example 7.13) mentioned in Chapters 5 and 7), we know that as n  , X may be approximated by a unit normal random variable in the following way: 8- 35 P{a  X  np  b}  (b)  (a) , np(1  p) or equivalently, P{a  ( X / n)  p  b}  (b)  (a) p(1  p) / n (8.15) where (·) is the error function (the cdf of the unit normal random variable).  On the other hand, from Example 8.2, an unbiased estimator of p is the sample proportion p̂ = X/n which may be used to replace X/n in (8.15) above.  Also, if z/2 is such that P{Z  z/2} = 1  /2 where Z is the unit normal random variable, then based on (8.15) we get P{ z / 2  p̂  p   z / 2 }  (z/2)  (z/2) = 1  . p(1  p) / n (8.16)  The left part of (8.16) above may be transformed by solving the two inequalities into the following form (with the details omitted): P{ pl  p  pu }  1   (8.17) where pˆ  pl = z2 / 2 z2 pˆ qˆ z2 / 2 pˆ qˆ z2 / 2  z / 2  2 pˆ   / 2  z / 2  2n n 4n ; p = 2n n 4n 2 u 1  ( z2 / 2 / n) 1  ( z2 / 2 / n) with q̂ = 1  p̂ .  As n  , the three terms in each of pl and pu above involving z2 / 2 are negligible in magnitude, so that pl  pˆ  z / 2 pˆ q̂ . n  Thus, (8.17) becomes pˆ  z / 2 8- 36 pˆ q̂ and pu  n P{ pˆ  z / 2 pˆ qˆ pˆ qˆ  p  pˆ  z / 2 } 1  . n n (8.18)  So, by Definition 8.10 we have the following proposition.  Proposition 8.7 (large-sample 100(1  )% confidence interval for a population proportion) --Given a large random sample X1, X2, ..., Xn of size n arising from a population with a success proportion of p, and let X be the number of successes in the sample, then the large-sample (1  )% confidence interval for p is ( pˆ  z / 2 pˆ qˆ pˆ qˆ , pˆ  z / 2 ), n n (8.19) or in a simpler form, pˆ  z / 2 pˆ qˆ n (8.20) where p̂ = X/n and q̂ = 1  p̂ .  A rule of thumb for selecting the sample size n --- when np > 5 and n(1  p) > 5, the above proposition may be applied (i.e., the requirement of a large sample is satisfied).  Definition 8.12 (sampling error) --ˆˆ pq , whose absolute value is a half of the width n of the confidence interval as mentioned previously, is called the sampling error. The values  r =  z / 2  Chinese of some frequently-used terms in media (newspaper, magazines, etc.) -- Confidence interval --- 信賴區間;  Sampling error --- 抽樣誤差; 8- 37  Confidence level of 100(1  )% --- 100(1  )% 信心水準.  Example 8.9 --In a survey of family opinions about a certain public policy in the Taipei metropolitan area, 1000 families were taken as a random sample in which 720 were found positive for the policy. Under the confidence level of 95%, what is the sampling error r? And what is the confidence interval for the rate p of families with positive opinions in the metropolitan area? Solution:  Let “being positive for the policy” mean “success.”  Then, the sample value of the “number X of successes” is x = 720, the sample size is n = 1000, the estimate of p is p̂ = x/n = 720/1000 = 0.72, and q̂ = 1  p̂ = 0.28.  The confidence level of 95% means  = 1  0.95 = 0.05. ˆˆ ˆˆ pq pq = z0.025 n n 1.96 0.72  0.28 / 1000  0.028. And the confidence interval is  So, the sampling error is r =  z / 2 = (l, u) = (0.72  0.028, 0.72 + 0.028) = (0.692, 0.748).  A translation of the above results into Chinese is: 「根據某一公共政策的民意調查結果，大台北地區有百分之七十二的家庭贊成該政策。此一民調共隨機取樣一千個家庭，在百分之九十五的信心水準下，抽樣誤差約為正負二點八個百分點。」  A website for computing the sampling error --http://www.dssresearch.com/toolkit/secalc/error.asp  A note --Actually the equality of (8.17) repeated below may also be used directly for confidence interval inference for results with better precision: P{ pl  p  pu }  1   where 8- 38 (8.17) pˆ  pl = z2 / 2 z2 pˆ qˆ z2 / 2 pˆ qˆ z2 / 2  z / 2  2 pˆ   / 2  z / 2  2n n 4n ; p = 2n n 4n 2 u 1  ( z2 / 2 / n) 1  ( z2 / 2 / n) with q̂ = 1  p̂ .  Precision and sample size -- Again, as discussed previously, sometimes it is desired to fix in advance the precision of the confidence interval, which now is called the sampling error r, and compute the corresponding sampling size n.  For this, it seems that we can solve r = z / 2 ˆˆ pq to get a solution for n n.  However, this is impractical because p̂ and q̂ = 1  p̂ , both ˆˆ pq . n  One way out is to redefine r as the maximum sampling error we want, and under this assumption, to find the corresponding minimum n. ˆ ˆ = p̂ (1  p̂ )  For this, first try to find the maximum value of pq which occurs when p̂ = 1/2 ( setting the derivative d( p̂  q̂ )/d p̂ = 1  2 p̂ = 0 leads to the solution of p̂ = 1/2). involving n, are included in r = z / 2  So, substituting p̂ = 1/2 into r = z / 2 ˆˆ pq where q̂ = 1  p̂ , we get n the solution for n as n= 1 (z/2/r)2. 4 (8.21)  Another way is to specify an expected value for p̂ , say denoted as p̂0 , and solve r = z / 2 ˆˆ pq to get n n = p̂0 (1  p̂0 )(z/2/r)2. (8.22)  Example 8.10 (continued from Example 8.9) --As a continuation of Example 8.9, suppose that it is desired to have a 8- 39 sampling error no larger than r = 0.02 instead of the original value of 0.028. At least how many families should be sampled? On the other hand, if it is assumed that the estimate p̂ = 0.72 is kept the same, what is the number of families that should be sampled? Solution:  This can be carried out using the first way mentioned above.  So by (8.21), we get n= 1 1 1 (z/2/r)2 = (1.96/0.02)2 = 982 = 492 = 2401. 4 4 4  Therefore, at least 2401 families should be sampled.  If p̂ is assumed to be the same as 0.72 when r = 0.02, then the number of families that should be sampled now, by (8.22), is n = p̂0 (1  p̂0 )(z/2/r)2 = 0.72(1  0.72)(1.96/0.02)2 = 0.720.28982  1936.1664 = 1937.  That is, 1937 instead 1000 families should be sampled.  Computing small-sample confidence interval for a proportion of a population ---  Idea -- The above discussions are based on the use of large samples.  It is also possible to obtain a confidence interval for the success proportion p when n is small. The inference of the interval is based directly on the use of the binomial distribution.  Assume that the number of successes in the n samples is the sample total value X = x.  For simplification of notations hereafter, we define B(x: n, p) as the cdf x P{X  x} =  C(n, i)pi(1  p)ni of a binomial random variable X with i 0 parameters (n, p). This notation is also used in many binomial distribution tables found in books, like the one at the following website: http://webcache.googleusercontent.com/search?q=cache:kHbpn3dKAIYJ:www.statisticshowto.com/table s/binomial-distribution-table/+binomial+distribution+table&cd=10&hl=zh-TW&ct=clnk&gl=tw. 8- 40  To find the 100(1  ) % confidence interval (pl, pu) for p, we may consider first to find two functions, b(pl) and a(pu), as limits for X so that for any p, approximately the following equality is true: P{a(pu) < X  b(pl)} = 1   where (1) we assign P{X  a(pu)} = /2 and P{X > b(pl)} = /2 (or equivalently, P{X  b(pl)} = 1  /2) so that the above equality holds; and (2) define a(pu) and b(pl) to be such that P{X  a(pu)} = B(x: n, pu) and P{X  b(pl)} = B(x  1: n, pl).  Accordingly, P{X  a(pu)} = B(x: n, pu) = /2, and P{X  b(pl)} = B(x  1: n, pl) = 1  /2 from which by a binomial distribution table, we can find approximate values of pu and pl to satisfy the equalities (note: only approximate values can be found due to discreteness of the random variable X).  Finally, take the found limits pl and pu to construct the desired 100(1  ) % confidence interval (pl, pu).  A reference for more about the above topic is: N. Johnson and F. Leone, Statistics and Experimental Design in Engineering and the Physical Sciences, vol. II (2nd ed.) Wiley, New York, 1977.  The above discussions lead to the following proposition.  Proposition 8.8 (small-sample 100(1  )% confidence interval for a population proportion) --Given a small random sample X1, X2, ..., Xn of size n arising from a population with a success proportion of p, and let X be the number of successes in the sample, then the small-sample (1  )% confidence interval for p is (pl, pu) where pl and pu are respectively such that B(x  1: n, pl) = 1  /2; B(x: n, pu) = /2.  Example 8.11 --- 8- 41 Twenty units are sampled from a continuous production line and four items are found to be defective. Find an approximately 90% confidence interval for the true proportion defective, p, using the small-sample approach described previously. Solution:  The proportion defective is estimated to be p̂ = x/n = 4/20 = 0.20.  1   = 0.9, and so  = 0.1 and /2 = 0.05  The upper limit pu may be found by solving the following equality: 4 B(x; n, pu) = B(4; 20, pu) =  C(20, i)pui(1  pu)20i = /2 = 0.05, i 0 which leads to the approximate solution pu = 0.4 (for this pu, the more precise value of B(4; 20, 0.4) = 0.051).  The lower limit pl may be found be solving the following equality: 3 B(x  1; n, pl) = B(3; 20, pl) =  C(20, i)pli(1  pl)20i = 1  /2 = 0.95, i 0 which leads to the approximate solution pl = 0.1 (for this pl, the more precise value of B(3; 20, 0.1) = 0.957).  A 90% confidence interval for the proportion defective, p, is thus (0.071, 0.400) for computed p̂ = 0.20 8.4 Hypothesis Testing  Concept and definitions ---  Hypothesis testing means to make a decision to accept or reject a claim, called hypothesis, regarding a parameter of interest.  Random sampling is often used in hypothesis testing.  Definition 8.13 (statistic) --A statistic is any function of the random variables constituting one or more sample, provided that the function does not depend on any unknown parameter values. 8- 42  A note --- e.g., various point estimators mentioned previously are statistics.  Definition 8.14 (hypotheses) --In the hypothesis testing problem, the claim initially favored or believed to be true is called the null hypothesis, denoted by H0; and the other claim in the problem is called the alternative hypothesis, denoted by Ha.  Another note --- e.g., H0:  = 0.75; Ha:   0.75, where  is the mean of a population.  Definition 8.15 (hypothesis test procedure) --A hypothesis test procedure is specified by:  a test statistic --- a function of the sample data on which the decision of rejecting H0 or not is based;  a rejection (critical) region --- the set of all test statistic values for which H0 will be rejected.  Definition 8.16 (type I and II errors) --A test procedure will yield errors in the decision making result, including the following two types:  type I error --- coming from rejecting H0 when H0 is true, with its probability denoted by  and called the significance level of the test;  type II error --- coming from accepting H0 when H0 is false (or equivalently, from rejecting Ha when Ha is true), with its probability denoted by .  Notes -- A level  test is a hypothesis test whose type I error (or significance level) is .   is often choose in the first place to be 0.10, 0.05, or 0.01.  Tests about a population mean --- case I: a normal population with known standard deviation ---  Problem definition --Given a random sample X1, X2, ..., Xn, of size n arising from a normal 8- 43 population random variable X with unknown mean  and known standard deviation, we want to make a decision about whether the mean  is equal to a special value 0, called the null value.  Reasoning of the solution -- Null hypothesis: H0:  = 0 (e.g., we may want to make a decision about whether the average life  of a given set of tires of a new design has no change, i.e., it still equals the old parameter 0).  A possible alternative hypothesis: Ha:  > 0 (that is, the new design yields instead tires having a longer average life  than the old one 0).  Since the population distribution is normal with parameters (, 2), the sample mean X has mean  and variance 2/n (from Facts 7.4 and 7.11).  Therefore, X  is a unit normal random variable Z with mean  and / n standard deviation.  If the null hypothesis H0:  = 0 is true, then X  0 will also be a / n unit normal random variable (note:  is replaced by 0 here).  Given a set of random sample values x1, x2, ..., xn, we have a sample n mean value x = (  xi)/n which is an estimate of , i.e., x  . i1  If the null hypothesis H0:  = 0 is true, the distance d between x () and 0 should be close to zero; on the contrary, for H0 to be rejected (or Ha to be accepted), this distance should be large in value.  In more detail,   if the alternative hypothesis is Ha:  > 0, then this distance d should be positive and large enough in magnitude; if Ha is  < 0, then d should be negative and large enough in magnitude; and finally, if Ha is   0, then d should be large enough in magnitude, no matter whether it is positive or negative.  However, terms like “large enough” are not precise enough, and to improve this,  8- 44  first a better measure is to substitute x into Z = a “sample Z value” z = X  0 to give / n x  0 , which is in a sense just a / n standardized distance d between x ( ) and 0 expressed in the unit of standard deviation / n ;  secondly, we need a threshold value to express “large enough,” and for this we use the value z or z/2 defined previously as P{Z > z} =  and P{Z > z/2} = /2 (see Fig. 8.4 for an illustration of the meaning of z).  Accordingly, for H0:  = 0, we define the following decision making rules for rejecting H0 and accepting various alternative hypotheses Ha in the following:    (upper-tailed test) reject H0:  = 0 in favor of Ha:  > 0 if z  z; (lower-tailed test) reject H0:  = 0 in favor of Ha:  < 0 if z  z; (two-tailed test) reject H0:  = 0 in favor of Ha:   0 if z  z/2 or z z/2.  The term Z = X  0 is called a test statistic for this hypothesis / n testing problem. It comes from the assumption that the population mean  = 0.  Critical region and Type I and II errors -- Each of the conditions, “z  z,” “z  z,” or “z  z or z z,” is said to compose a rejection (critical) region of the corresponding Ha.  For the upper-tailed test -- (Type I error ) if z = x  0 / n computed from x falls erroneously in the rejection region described by z  z, then H0 is rejected in favor of Ha, incurring a type-I error with probability P{type I error}= P{H0 is rejected when H0 is true} 8- 45 = P{H0 is rejected when  = 0} X  0 X  = P{ =  z} / n / n = P{Z  z} = ;  (Type II error ) on the other hand, H0 might not be rejected (i.e., Ha might be accepted) erroneously because z < z (instead of z  z), incurring a type II error; and this could happen when the real value of  is a particular value ′ that exceeds 0 so that the probability of this type II error may be computed to be (′) = P{type II error while  = ′} = P{H0 not rejected while  = ′} X  0 X  = P{ = < z while  = ′} / n / n = P{ X < 0 + z/ n while  = ′} = P{ X    ' X  ' = < z + 0 } / n / n / n =(z +  0  ' ). / n (Illustration of type I and II errors) a diagram showing the two types of error is shown in Fig. 8. 9(a).  For the other two types of test (lower-tailed and two-tailed tests), similar derivations of the probabilities of type I and II errors may be conducted. A figure illustrating the rejection regions and the corresponding type I error probabilities for the three types of test is shown in Figs. 8.9(b) through (d). 8- 46 critical line z = 1.645 (for  = 0) (for  = ′) (a)  and  values (left curve is for  = 0, and right curve for  = ′). (b) Upper-tailed test. (c) Lower-tailed test. (d) Two-tailed test. Fig. 8.9 Rejection regions and  values of the three types of tests.  Computing n to satisfy selected  and  -- For the upper-tailed test -- Suppose that for a fixed ′, we hope an n can be found to satisfy arbitrary choices of both  and ; and for this, since (′) =(z + 0  ' ) as just derived previously, it is implied that / n z = z + 0  ' / n (note: just as we define z to satisfy P{Z  z} = (z) = 1  , here we define as well z to satisfy P{Z  z} = (z) = 1   so that (z) = , which explains the term z at the left-hand side of the above equality). 8- 47  The above equality can be solved to get the desired sample size n as   ( z  z  )  n=   .  0  '  2  For the lower-tailed and two-tailed tests, the values of n for fixed  and  may be derived similarly.  Summary -- A summary is given in Table 8.1.  The above type of hypothesis testing will be called z test because of the use of the values of z and z. Table 8.1 Level  z test of mean of a normal population with known standard deviation. Null hypothesis: H0:  = 0 Test statistic: Z = X  0 / n Type of test Upper-tailed test Lower-tailed test Two-tailed test Alternative hypothesis Ha:  > 0 Ha:  < 0 Ha:   0 Rejection region z  z z  z z  z or z  z/2 Type I error prob. Type II error prob. (′) Sample size n for fixed  and    ( z  z  )     0  '  2  ' (z + 0   ) / n   ( z  z  )     0  '  2  ' (z + 0   ) / n  ' ' (z + 0   )  (z + 0   ) / n / n   ( z / 2  z )     0  '   Example 8.12 --A manufacturer of tires is considering modifying its design of the tire tread. A study reveals that the modification is justified only if the average tire life under standard test conditions exceeds 20,000 miles. A random sample of n = 16 prototype tires is manufactured and tested, resulting in a sample mean 8- 48 2 value of x = 20.758. Suppose that tire life is normally distributed with a known  = 1500 (the standard deviation value for the current version of the tire). Do these data suggest that the modification is good for a decision to change the tire design to the new one? Solve this problem by a hypothesis testing procedure with a significance level of 0.01. Solution:  Parameter to test:  = the true life of the new tire design.  Null hypothesis: H0:  = 0 with 0 = 20,000 miles (so that accepting H0 means that the new design is ineffective).  Alternative hypothesis: Ha:  > 0 (so that rejection of H0 in favor of Ha means that the new design should be adopted). x  0 x  20000  Test statistic value: z = = (because the samples / n 1500 / n arise from a normal distribution so that the sample mean X is also normally distributed).   = 0.01.  Rejection region: according to Table 8.1, the rejection region described by z  z is adopted where  = 0.01, z = z0.01= 2.33 ( P{Z  z} = 1   = 1  0.01 = 0.99; check Table 5.1); that is, H0 will be rejected if z > 2.33.  Now the sample values yield z= x  20000 20, 758  20000 = = 2.02. 1500 / 16 1500 / n  Since z = 2.02 < 2.33 = z0.01 (i.e., z  R), H0 cannot be rejected at the significance level of  = 0.01, meaning that the new design yields tires whose lives are still quite the same as that of the old ones, 20,000 miles.  So the new design must be abandoned according to this level 0.01 upper-tailed hypothesis test.  Here, the probability of the type I error is  = 0.01, and the probability of the type II error, say, for a true ′ = 21,000 is (′) = (21000) = (z + = (2.33 +  (0.34) 8- 49 0  ' ) / n 20000  21000 ) 1500 / 16  0.3669.  Tests about a population mean --- case II: a normal population with unknown standard deviation ---  Problem definition --In real cases, the standard deviation is usually unknown in hypothesis testing about the mean of a normal population, so the above hypothesis testing procedure must be modified for more realistic applications.  Reasoning of the solution -- Just like the case of inferring the confidence interval for such a kind of population with an unknown standard deviation, the (Student’s) t distribution, instead of the unit normal distribution, should be used here.  The reasoning process is all the same as done for Case I above, and is omitted; only the final results like those in Table 8.1 are listed in Table 8.2.  The hypothesis testing here is usually called the t test.  Note that unlike the case of the z test described previously, there is no closed-form formula for computing the value of type II error  here. An online calculator for computing  is available at the following website: http://www.stat.uiowa.edu/%7Erlenth/Power/index.html. (W.A) To use the above calculator, select “one-sample t test (or paired t)” and then click the button of “Run Selection” to pop up a window in which click the small gray-colored squares at the right end of the window to pop up dialogs and fill in appropriate values of |0  ′| (denoted as True |mu – mu_0| in the window),  (denoted as sigma), and n, where the value of  must be guessed according to prior experience. Also in the dialogs of “Solve for” and “alpha,” select the choice of n and fill in the value of . Then a value of “power” will appear automatically, which means 1  (′).  Usually, a conservative (large) guessed value of  will yield a conservative (large) value of (′) and so a conservative estimate of the sample size n necessary for the prescribed  and (′).  There is no closed form for computing the sample size n for fixed 8- 50 values of  and , either. An online calculator for computing n is available at the following website: http://www.objectivedoe.com/student/Shared/SSCalculators/sample.php (W.B) where the term “standard deviation” is  above, the term “difference to detect” is the value | 0  ' | , the “confidence” is the value of 1  , and the “power” again is the value of 1  . Also, the “Test type” should be selected to be “Compare a Mean to a Standard.” The sample size generated after clicking the button of “Submit” is just the value n  1. (A note: actually site (W.A) may also be used for the same purpose, but the yielding results will be a little bit different from the that obtained by (W.B) due to computation accuracy.)  For the above-mentioned computations of the value  for a fixed sample size n as well as the sample size n for a fixed , certain graphs called “curves of  = P{Type II Error} for t tests,” “OC (operating–characteristic) curves for t tests,” etc., may also be used. Such graphs are not included here but may be found from a reference book like: Edward R. Dougherty, Probability and Statistics for the Engineering, Computing and Physical Sciences, Prentice-Hall, Inc., Englewood Cliffs, N. J., USA, 1990.  Example 8.13 --A company wants to buy a new building at a location not far from an old one. A shuttle bus is operated for 10 trial runs between the two buildings during working hours to test the travel time. Any new building with more than an average of 20 minutes to reach is excluded from consideration. A level 0.05 t test about the average bus travel time  with H0:  = 20 (minutes) and Ha:  > 20 was applied to make a decision about whether the new building should be bought. However, the 10 trial shuttle bus runs yielded a sample mean variable x not large enough to reject H0:  = 20, and incurred a type II error . Compute the value  under the assumptions that the real  is ′ = 25 and that the standard deviation is  = 5 according to prior evidence. Solution:  The t test is upper-tailed and has  (type I error) = 0.05.  Also, the real  is ′ = 25 with  = 5.  After visiting the above mentioned website of (W.A); filling the values 8- 51  = 5, |′  0| = |25  20| = 5, n = 10,  = 0.05; and selecting the choice of n in the “Solve for” dialog; and removing the default selection of “Two-sided” (meaning that we are using the one-sided t test), a “power” value of .8975 appears automatically, which means 1  .  Therefore, the value  may be computed to be 1  0.8975 = 0.1025  10%.  This probability, meaning the risk of not rejecting H0 when H0 is false, is too high (which means: the new building is too far or in a traffic-busy area probably so that the shuttle bus needs more than 20 minutes to travel between the two buildings; however, the 10 test runs of the shuttle bus has the high probability of nearly 10% to be unable find this fact).  Therefore, the company wants to conduct more shuttle bus runs to be lower this probability to the level of 5% (i.e.,  = 0.05 or equivalently, the power = 1   = 0.95).  After visit the website of (W.B) above and submitting the values of the “standard deviation”  = 5, the “difference to detect” | 0  ' | = |20  25| = 5, and the “power” 1   = 0.95, we get the result n  1 = 14.  And so the desired sample size n = 15. Table 8.2 Level  t test of mean of a normal population with unknown standard deviation. Null hypothesis: H0:  = 0 Test statistic: T = X  0 s/ n Type of test Upper-tailed test Lower-tailed test Two-tailed test Alternative hypothesis Ha:  > 0 Ha:  < 0 Ha:   0 Rejection region t  t, n  1 t  t, n  1 t  t, n  1 or t  t, n  1 Type I error prob. Type II error prob.   No closed-form solution; use the online calculator at the following website: No closed-form solution; use the online calculator at the following website: http://www.stat.uiowa.edu/%7Erl http://www.objectivedoe.com/studen enth/Power/index.html t/Shared/SSCalculators/sample.php  8- 52  Tests about a population mean ---: case III: use of large samples ---  Idea --When the sample size is large (with n > 30), the z test for Case I may be easily modified to yield a test procedure without requiring either a normal population distribution or a known , just like the corresponding large-sample case of confidence interval inference.  Reasoning of the solution -- When n   (n > 30 in practice), the sample standard deviation S   which is the real standard deviation of the population distribution as mentioned previously when we discussed confidence interval inference for large samples of any population.  Therefore, a simple change of the test statistic Z = X  0 / n of Case I to be Z = X  0 suffices for use here. S/ n  And then all the details of the previously-discussed z tests are applicable now as long as with all ’s in Table 8.1 being substituted by S’s.  Example 8.14 --An automobile company recommends that any purchaser of one of its new cars bring it in to a dealer for a 3000 mile checkup. The company wishes to know if the true average mileage for initial servicing differs from 3000. A random sample of 50 recent purchasers resulted in a sample average mileage of 3208 and a sample standard deviation of 273 miles. Does the data strongly suggest that the true average mileage for this checkup is something other than the recommended value? Sate and test the relevant hypothesis using the level of significance 0.01. Solution:  Let  = the true average mileage of cars brought to the dealer for 3000 mile checkup.  Hypotheses: H0:  = 3000; Ha:   3000. 8- 53  A large-sample two-tailed z test may be used here because the sample size n = 50 > 30. x  0 x  3000  Test statistic value z = = . s/ n s/ n  Since  = 0.01, z/2 = z0.005 = 2.58. This means that for the desired two-tailed z test, if z  2.58 or z  2.58, H0 should be rejected. x  3000 3208  3000  Now, z = =  5.39 > 2.58, so H0 is rejected. s/ n 273/ 50  This means that at the significance level of  = 0.01, the data does strongly suggest that the true average initial checkup mileage differs from the manufacturer’s recommended value of 3000.  Determination of  and the sample size n --- may be conducted by either way of the above two cases (Case I and Case II).  Tests about a population proportion: case I: use of large samples ---  Reasoning -- Just like the reasoning for deriving the test statistic for the previous cases for inferring the population mean, according to (8.16) which is repeated in the following: P{ z / 2  where p and p̂  p   z / 2 }  1   p(1  p) / n (8.16) p(1  p) / n are the mean and standard deviation of the population, we may take the following test statistic for the current case for inferring the population proportion: Z0 = pˆ  p0 p0 (1  p0 ) / n where p0 is the population proportion to be tested.  Omitting the detailed reasoning process like those derived for the cases of testing the population mean, we present the results as a summary described by Table 8.3 for direct uses in applications.  A rule of thumb for choosing n for the test to be valid is: np0  5 and 8- 54 n(1  p0)  5.  Computations of the  value for a specific p = p′ and the sample size n for fixed  and  are also listed in Table 8.3. The details for the derivations of the computation formulas are like those of deriving the corresponding contents of Table 8.1, and so are omitted.  Example 8.15 --Suppose that 1000 voters about who they voted for the city mayor were interviewed. Of the 1000 voters, 550 reported that they voted for the democratic candidate. Is there sufficient evidence to suggest that the democratic candidate will win the election at the .01 level? Solution:  To win the election, the proportion for the democratic candidate should be larger than p = 0.5.  Set up the test to be H0: p0 = 0.5; Ha: p0 > 0.5 so that when H0 is rejected, it means that the democratic candidate wins.  The confidence level of the test is  = 0.01 so that z = z0.01 = 2.33. The estimated p̂ = 550/1000 = 0.55.  n = 1000 > 30 so that we can use the large sample approach discussed above.  The test statistic value is:  pˆ  p0 = p0 (1  p0 ) / n z0 = 0.55  0.5  3.16. 0.5(1  0.5) /1000  Since z0 = 3.16 > 2.33 = z, we reject H0 which means that the democratic candidate will win the election at the confidence level of  = 0.01. Table 8.3 Large-sample level  z test of population proportion. Null hypothesis: H0: p = p0 Test statistic: Z = Type of test pˆ  p0 p0 (1  p0 ) / n Alternative hypothesis Rejection region Type II error prob. p′) Type I error 8- 55 Sample size n for fixed  and  prob. Upper-tailed Ha: p > p0 test 1 Lower-tailed Ha: p < p0 test Two-tailed test  z  z  p  p' + z p0 (1  p0 ) / n   0  p' (1  p' ) / n   z  z, Ha: p  p0  z  z n  1 or t  t, n  1   p  p'  z p0 (1  p0 ) / n   0  p' (1  p' ) / n    p  p' + z / 2 p0 (1  p0 ) / n   0  p' (1  p' ) / n    p  p'  z / 2 p0 (1  p0 ) / n   0  p' (1  p' ) / n    z p0 (1  p0 )  z p' (1  p' )    p'  p0   2  z p0 (1  p0 )  z p' (1  p' )    p'  p0   2   z / 2 p0 (1  p0 )  z p' (1  p' )    p'  p0    Tests about a population proposition: case II: use of small sample ---  Reasoning -- When the sample size is small, the test process may be based directly on the binomial distribution rather than the normal distribution, as discussed in the following.  Again, let p denote the proportion of “success” in a population.  Given a random sample X1, X2, ..., Xn of size n arising from the population, then as discussed previously, the sample total To is a binomial random variable specifying the number of successes in the sample with parameters (n, p) with the samples X1, X2, ..., Xn all being Bernoulli random variables.  From Fact 4.7, we have E[X] = np and Var(X) = np(1  p) and from Example 8.2, we have an unbiased estimator of p, the sample proportion p̂ = X/n.  Let H0: p = p0 and Ha: p > p0 where p0 is a specific proportion to be tested, then a feasible test statistic for the hypotheses is just the sample total n T o =  Xi i1 with the rejection region to > c where to the sample total value and c is a pre-selected constant.  That is, we reject H0 if t0 > c; and accept H0 if t0  c. 8- 56 2  To compute the type I and II errors, we define first a power function as follows: (p) = P{To > c | p} = 1  P{To  c | p} 1  B(c; n, p) where, as defined previously, c B(c; n, p)   C(n, i)pi(1  p)ni i0 who values may be found at the following website which has been mentioned previously: http://webcache.googleusercontent.com/search?q=cache:kHbpn3dKAIYJ:www.statisticshowto.com/tables/b inomial-distribution-table/+binomial+distribution+table&cd=10&hl=zh-TW&ct=clnk&gl=tw. (W.C)  Then, the type I error, which is also the significance level  of the test, is  = (p0) = P{To > c | p0} = 1  B(c; n, p0). (8.23)  And the type II error, (p), for a specific p′ > p0 is (p′) = P{To  c | p′} = B(c; n, p′). (8.24)  It is, as usual, desired to find a value of c such that  equals the popularly-used values of 0.05, 0.01, and so on, for various n and p0. This requires the use of a binomial distribution table.  However, since To is discrete, it will not always be possible to construct a test with a prescribed significance level .  For examples, if n = 10, p0 = 1/2, and the desired  = 0.05, then from (8.24), we have B(c; n, p0) = 0.95; and from the binomial distribution table mentioned above, we can get c  7 and the real  value now is 1  0.945 = 0.055 instead of 0.05; and if n = 20 with the other parameters unchanged, then c  13 with real  = 1  0.942 = 0.058 instead; and if n = 25, c  16 with real  = 1  0.946 = 0.054.  The above discussions are about the upper-tailed test; the lower- and two-tailed tests can be derived similarly (the details are omitted). A summary is given in Table 8.4. 8- 57  For more discussions, see the following reference: Jay L. Devore, Probability and Statistics for the Engineering and the Sciences, (2nd ed.) Brooks/Cole Publishing Co., Monterey, CA, USA, 1987.  Example 8.16 --Assume that the natural recovery rate p0 for a certain disease is 40%. A new drug is developed, and to test the manufacturer’s claim that it boosts the recovery rate, a random sample of n = 10 patients is given the drug and a total number to of recoveries is recorded. Find the “cutoff value” c that produces a test with H0: p = p0 and Ha: p > p0 at an approximate significance level of  = 0.05 and compute the corresponding type II error  for p′ = 0.7. Solution:  According to (8.23) or Table 8.4, the test with the rejection region described by to > c has a significance level of 0.05 when c satisfies 0.05 =  = 1  B(c; n, p0) = 1  B(c; 10, 0.4).  From the binomial distribution table given at Website (W.C), we get B(6; 10, 0.4) = 0.945 < 0.95 < 0.988 = B(7; 10, 0.4).  That is, the choice c = 6 yields  = 1  0.945 = 0.055, and the choice c = 7 yields  = 1  0.988 = 0.012.  Therefore, for  to be 0.05 or smaller, we have to choose c = 7.  And according to (8.24), the corresponding type II error for p′ = 0.7 is: (p′) = (0.7) = P{To  7 | p′ = 0.7} = B(7; 10, 0.7) = 0.617 which is quite large because of the small sample size n = 10. Table 8.4 Small-sample test of population proportion. Null hypothesis: H0: p = p0 Test statistic: To = X1 + X2 + … + Xn Type of test Alternative hypothesis Rejection region Type I error prob. 8- 58 Type II error prob. p′) P{To > c | p0} = 1  B(c; n, Upper-tailed test Ha: p > p0 to > c Ha: p < p0 to  c Ha: p  p0 to > b or to  a Lower-tailed test Two-tailed test p 0) P{To  c | p′} = B(c; n, p′) P{To  c | p0} = B(c; n, p0) P{To > c | p′} = 1  B(c; n, p′) P{To  a, To > b | p′} = P{a < To  b | p′} 1  B(b; n, p0) + B(a; n, p0) = B(b; n, p′)  B(a; n, p′)  Bayes test for two-class pattern classification ---  Idea -- In many applications, we often encounter the following hypothesis testing problem for two-class pattern classification: H0:  0 Ha:   a where 0 and a are two classes of patterns (like those described in Examples 3.6 through 3.8).  The following example is given to illustrate how this problem can be formulated as a hypothesis testing problem.  Example 8.17 (Example 3.8 revisited) --In a class of 20 female and 80 male students, 4 of the female students and 8 of the male wear glasses. Now if a student was observed to wear glasses, how will you decide the sex (性別) of the student? Solve the problem from the viewpoint of hypothesis testing. Solution:  Regard the feature of “wearing glasses” as a random variable X described as follows: X = 1 if an observed student wears glasses; = 0 otherwise.  Let the sex be denoted as a parameter  with two values as follows:  = 1 = 1 for the male and  = 2 = 0 for the female.  Based on the given data, random variable X has a conditional pmf 8- 59 pX|(x|) = P{X = x| = } with its values given by: pX|(1|1) = P{X = 1|1} = 8/80; pX|(0|1) = P{X = 0|1} = 72/80; pX|(1|0) = P{X = 1|0} = 4/20; pX|(0|0) = P{X = 0|0} = 16/20.  We also have the following a priori (class) probability p for  : p(1) = 80/(20 +80) = 80/100; p(0) = 20/100.  Let 0 = the group of female students; and a = the group of male students.  The pattern classification problem now can be formulated as a hypothesis testing problem as follows: H0: x  0; Ha: x  a; or equivalently, H0:  = 1 = 1; Ha:  = 2 = 0.  The a posteriori probability of a value  (a class) given an observed sample value, x, of X is the conditional pmf p|X(| x) = P{ =  | X = x}. A larger value of p|X(| x) means that x is more likely to come from the class  with the parameter .  In terms of the notations defined above, the Bayes formula described by (3.4) says: p|X(| x) = pX(x|)p()/[pX(x|)p() + pX(x| C)p( C)] (8.25) where  C is the complement of  (e.g., if  = 1, then  C = 2 for the case here).  The test statistic value in the Bayes sense is taken to be the ratio of the a posteriori probabilities: b = p|X(1| x)/p|X(2| x) (8.26) where x is the previously-mentioned observed sample value of X.  And the rejection region is described by b < 1 because if on the contrary it is true that b > 1, then p|X(1| x) > p|X(2| x), meaning that it is more likely that x comes from the class with the parameter 1 (class 0) 8- 60 than from the other class (class a).  By the Bayes formula, the above test statistic may be transformed (with the details left as an exercise) into b = [pX(x|1)p(1)]/[pX(x|2)p(2)] (8.27) where the a priori (class) probabilities and the conditional pmf’s are used.  The rejection region then becomes one which includes all the x satisfying b = [pX(x|1)p(1)]/[pX(x|2)p(2)] < 1. (8.28)  Now, the observed X value is x = 1 (wearing glasses), which, after being substituted into (8.27) above, leads to the following values pX(1|1)p(1) = pX(1|1)p(1) = (8/80)(80/100) = 8/100; pX(1|2)p(2) = pX(1|0)p(0) = (4/20)(20/100) = 4/100; and the following inequality b = pX(x|1)p(1)/pX(x|2)p(2) = (8/100)/(4/100) > 1 which means b for x = 1 is not in the rejection region.  Therefore, H0 is not rejected, and so the decision is that “the student is male.”  A note: when the tie b = pX(x|1)p(1)/pX(x|2)p(2) = 1 is encountered, the choice may be arbitrary.  Formal description of Bayes test for two-class pattern classification -- The above example may be generalized to formulate the Bayes test, which is a type of hypothesis testing based on the Bayes formula, for a two-class pattern classification problem.  Let X1, X2, ..., Xn be n random variables, called features, for describing patterns (not necessarily iid) with parameters , which is also a random variable with two discrete values 1 and 2 representing the two classes of patterns, 1 and 2 (corresponding to the 0 and a in the above example).  Let the random vector X be used to denote the n (feature) random 8- 61 variable, which is notationally defined as X = [X1, X2, …Xn]t with t meaning vector transpose.  Then, the bold notation x means a sample vector x = [x1, x2, …, xn]t of X.  Let the pmf of  be denoted as p() and let the conditional pdf (or pmf) of X given  =  be denoted as f(x|) (note: unlike in the last example, we drop the subscripts for p and f because now no ambiguity will arise here).  The hypotheses are taken to be: H0:  = 1 (or x  1); Ha:  = 2 (or x  2).  The Bayes test statistic, according to (8.27) above, is: B = [f(X|1)p(1)]/[f(X|2)p(2)] (8.29) (note: the first step of using the a posteriori probabilities in the last example is skipped here).  And the rejection region, according to (8.28), now includes all x satisfying: b = [f(x|1)p(1)]/[f(x|2)p(2)] < 1. (8.30)  In the pattern recognition area, the above rejection region means equivalently the following Bayes decision rule: if f(x|1)p(1) > f(x|2)p(2), then classify x as from 1 (Ha rejected); otherwise, as from 2 (H0 rejected).  The equation [f(x|1)p(1)]/[f(x|2)p(2)] = 1 or equivalently, f(x|1)p(1) = f(x|2)p(2) (8.31) is called the decision boundary of the classifier, which divides the pattern feature space into two mutually-exclusive regions, one of which 8- 62 is the rejection region.  Notes -- The above test is called the Bayes test with the minimum error probability (or simply with the minimum error), in contrast to another Bayes test with the minimum cost, which is beyond our discussion here (it is taught in detail in a course of pattern recognition or mathematical statistics).  There are also several other similar tests, like likely-ratio test, Neuman-Pearson test, minimax test, etc. For more discussions on these tests, the corresponding type I and II errors, and other related topics, see the following reference book, for example: K. Fukunaga, Introduction to Pattern Recognition, (2nd ed.) Academic Press, San Diego, CA, USA, 1990.  Example 8.18 (Bayes test for two-class pattern classification) --You are given two classes of patterns 1 and 2 which have equal a priori class probabilities 1/2, and are described by two normal random variables X1 and X2 with equal variance 2 but different means 1 and 2, respectively. Now given an observed (sample) value x, use the Bayes test with the minimum error to classify x (to see if x comes from X1 with parameters (1, 2) or from X2 with parameters (2, 2). Derive the test statistic B. If the identical variance is 2 = 4, and the means are 1 = 0 and 2 = 2, derive the rejection region, the decision boundary (simplified as much as possible), and the classifier (described as a rule) for the two pattern classes. Solution:  Here, the hypotheses are: H0:  = 1 (x  1) and Ha:  = 2 (x  2).  The conditional normal pdf’s for the two classes 1 and 2 are: f(x|1) = f(x|2) = 2 2 1 e( x 1 ) / 2 ,     x   ; 2 2 2 1 e( x 2 ) / 2 ,     x   . 2 8- 63  The a priori (class) probabilities of the two classes are p(1) = p(1)  p1 = 1/2, p(2) = p(2)  p2 = 1/2.  The Bayes test statistic value is B = [f(x|1)p(1)]/[f(x|2)p(2)] = [f(x|1)p(1)]/[f(x|2)p(2)] 2 2 1 p1 e ( x  1 ) / 2 2 = 2 2 1 p2 e ( x  2 ) / 2 2 p1 [  ( x   )2 ( x   )2 ] / 2 2 1 2 = p2 e .  The rejection region is described by p1 2 2 2 b = p e[  ( x  1 ) ( x  2 ) ] / 2 < 1. 2  The decision boundary is described by: p1 [  ( x   )2 ( x   )2 ] / 2 2 1 2 =1 p2 e which may be simplified by taking the natural logarithm ln to be ln(p1/p2) + [(x  1)2 + (x  2)2]/(22) = 0, or equivalently, to be (2  1)x/2 + ½(12  22)/2 = ln(p1/p2).  With the given values of p1 = p2 = ½ = 0.5, 2 = 4, and 1 = 0 and 2 = 2 substituted into the above formula, we get the decision boundary equation to be: (2  )x/ + ½(02  22)/4 = x/2  1/2 = ln(0.5/0.5) = ln(1) = 0, or equivalently, x=1 which is just the middle point between the two means 1 = 0 and 2 = 2 (a property which is always true for two 1-dimensional normally-distributed feature classes with equal variances, as can be 8- 64 proved; do this by yourself).  The above result also means that the rejection region is simplified to include those x which satisfy x > 1.  Therefore, the pattern classifier, in a rule form, is: “if x < 1, then x  1; otherwise x  2” (where “x ” means “assign or classify x as from”). 8- 65

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 5