Download MSc QT: Statistics Part II Statistical Inference (Weeks 3 and 4)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Randomness wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
MSc QT: Statistics
Part II Statistical Inference
(Weeks 3 and 4)
Sotiris Migkos
Department of Economics, Mathematics and Statistics
Malet Street, London WC1E 7HX
September 2015
MSc Economics & MSc Financial Economics (FT & PT2)
MSc Finance & MSc Financial Risk Management (FT & PT2)
MSc Financial Engineering (FT & PT1)
PG Certificate in Econometrics (PT1)
Contents
Introduction
v
1 Sampling Distributions
Literature . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Introduction . . . . . . . . . . . . . . . . . . . .
1.2 Sampling Distributions . . . . . . . . . . . . . .
1.3 Sampling Distributions Derived from the Normal
1.3.1 Chi-Square . . . . . . . . . . . . . . . .
1.3.2 Student-t . . . . . . . . . . . . . . . . .
1.3.3 F-distribution . . . . . . . . . . . . . . .
Problems . . . . . . . . . . . . . . . . . . . . . . . .
Proofs . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
5
5
6
7
8
8
2 Large Sample Theory
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Law of Large Numbers . . . . . . . . . . . . . . . . . .
2.2 The Central Limit Theorem . . . . . . . . . . . . . . . .
2.3 The Normal Approximation to the Binomial Distribution
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
11
14
15
17
17
3 Estimation
Literature . . . . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . .
3.2 Evaluation Criteria for Estimators
3.3 Confidence Intervals . . . . . . .
Problems . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
19
20
21
24
4 Hypothesis Testing
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
4.2 The Elements of a Statistical Test . . . . . . . . . . . .
4.3 Duality of Hypothesis Testing and Confidence Intervals
4.4 Attained Significance Levels: P-Values . . . . . . . . .
4.5 Power of the Test . . . . . . . . . . . . . . . . . . . .
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
27
27
32
32
33
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
A Exercise Solutions
A.1 Sampling Distributions
A.2 Large Sample Theory .
A.3 Estimation . . . . . . .
A.4 Hypothesis Testing . .
QT 2015: Statistical Inference
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
39
40
44
Introduction
Course Content
This part of the course consists of five lectures, followed by a closed book exam. The topics covered
are
1. Sampling distributions.
2. Large sample theory.
3. Estimation.
4. Hypothesis Testing.
The last lecture will cover exercises based on the topics above.
Textbooks
Lecture notes are provided, however these notes are not a substitute for a textbook. The required
textbook for this part of the course is.
• Wackerly, D., Mendenhall, W. and Schaeffer, R. (2008). Mathematical Statistics with Applications, 7th ed., Cengage. (Henceforth WMS)
Students who desire a more advanced treatment of the materials might want to consider:
• Casella, G. and Berger, R. (2008). Statistical Inference, 2nd ed., Duxbury press. (Henceforth
CB)
• Rice, J. (2006). Mathematical Statistics and Data Analysis, 3rd. ed., Cengage. (Henceforth
R)
Furthermore, the following books are recommended for students that plan to take further courses
in econometrics. The appendices of these books also contain summaries of the material covered in
this class.
• Greene, W. (2011). Econometric Analysis, 7th ed., Prentice-Hall. (Henceforth G)
• Verbeek, M. (2012). A Guide to Modern Econometrics, 4th ed., Wiley. (Henceforth V)
v
vi
QT 2015: Statistical Inference
Online Resources
The primary resources for this part of the course are contained in this syllabus. However, further
resources can be found online at either www.ems.bbk.ac.uk/for_students/presess/ and the
course page at the virtual learning environment Moodle (login via moodle.bbk.ac.uk/ )
Instructor
The instructor for this part of this course is
• Sotiris Migkos, [email protected]
Chapter 1
Sampling Distributions
Literature
Required Reading
• WMS, Chapter 7.1 – 7.2
Recommended Further Reading
• WMS, Chapters 4 and 12
• CB, Chapter 5.1 – 5.4
• R, Chapters 6 – 7
1.1 Introduction
A statistical investigation normally starts with some measures of interest of a distribution. The
totality of elements about which some information is desired is called a population. Often we only
use a small proportion of a population, known as a sample, because it is impractical to gather data
on the whole population. We measure the attributes of this sample and draw conclusions or make
policy decisions based on the data obtained. That is, with statistical inference we estimate the
unknown parameters underlying the statistical distributions of the sample. We can then measure
their precision, test hypotheses on them to and use them to generate forecasts.
Definition 1.1 Population
A population (of size N), x1 , x2 . . . , xN is the totality of elements that we are interested in. The
numerical characteristics of a population are called parameters. Parameters are often denoted
by Greek letters such as θ.
Definition 1.2 Sample
A sample (of size n) is a set of random variables, X1 , X2 , . . . , Xn , that are drawn from the
population. The realization of the sample is denoted by x1 , . . . , xn .
1
2
QT 2015: Statistical Inference
The method of sampling, known sometimes as the design of the experiment, will affect the
structure of the data that you measure, and thus the amount of information and the likelihood of
observing a certain sample outcome. The type of sample you collect may have profound effects on
the way you can make inferences based on that sample. For the moment we will concern ourselves
only with the most basic of sampling methods simple random sampling.
Definition 1.3 Random Sample
The random variables X1 , · · · , Xn are called a random sample of size n from the population
f (x) if X1 , · · · , Xn are mutually independent random variables and the marginal pdf or pmf
of each Xi is the same function f (x). Alternatively X1 , · · · , Xn are called independent and
identically distributed variables with pdf of pmf f (x). This is commonly abbreviated to iid
random variables.
The joint density of the realized xi ’s in a random sample sample has the form:
f (x1 , x2 , · · · , xn ) =
n
Y
fXi (xi )
=
n
Y
f (xi )
(by independence)
i=1
(by identicality).
(1.1)
i=1
Of course, in economics and finance one normally does not have much control on how the data is
collected and the data at hand is often time-series data, which is in most cases neither independent
nor identically distributed. Although addressing these issues is extremely important in empirical
analysis, this course will ignore such considerations to focus on the basic issues.
1.2 Sampling Distributions
When drawing a sample from a population, a researcher is normally interested in reducing the data
into some summary measures. Any well-defined measure may be expressed as a function of the
realized values of the sample. As the function will be based on a vector of random variables, the
function itself, called a statistic, will be a random variable as well.
Definition 1.4 Statistic and Sampling Distribution
Let X1 , . . . , Xn be a sample of size n and T (x1 , . . . , xn ) be a real-valued or vector-valued function whose domain includes the sample space of (X1 , . . . , Xn ), that does not include any unknown parameters, then the random variable X = T (x1 , . . . , xn ) is called a statistic.
The probability distribution of a statistic is called the sampling distribution of X.
The analysis of these statistics and their sampling distributions is at the very core of econometrics. As the definition of a statistic is very broad, it can include a wide range of different measures. The most two common statistics are probably the mean X and the sample variance S 2 . Other
examples include order statistics such as the smallest observation in the sample, X(1) , the largest
observation in the sample, X(n) , and the median, X(n/2) ; correlations, Corr(X, Y), and covariances,
Cov(X, Y), between two sequences of random variables are also common statistics. Statistics do
not need to be scalar, but may also be vector-valued, returning for instance all the unique values
observed in the sample or all the order statistics of the sample.
1. Sampling Distributions
3
Note the important difference between the sampling distribution which measures the probability
distribution of the statistic T (x1 , . . . , xn ) and the distribution of the population, which measures the
marginal distribution of each Xi .
The following two sections consider the sampling distributions of the two most important statistics, the sample mean and the sample variance, on the assumption that the sample is drawn from
a normal population. The main features of these sampling distributions are summarized by the
following theorem.
Theorem 1.1 The sample mean and the sample variance of a random normal sample have the
following three characteristics:
1. E[X] = µ, and X has the sampling distribution X ∼ N(µ, σ2 /n),
2. E[S 2 ] = σ2 , and σ2 has the sampling distribution (n − 1)S 2 /σ2 ∼ χ2n−1 ,
3. X and S 2 are independent random variables.
As some of the most common statistics, such as the sample mean and sample total are linear
combinations of the individual sample points, the following theorem is of great value in determining
the sampling distribution of statistics.
Theorem 1.2 If X1 , . . . , Xn are random variables with defined means,E(Xi ) = µi , and defined
variances, Var(Xi ) = σ2i ; then a linear combination of those random variables,
Z =a+
n
X
bi Xi ,
(1.2)
i=1
will have the following mean and variance:
E(Z) = a +
Var(Z) =
n
X
i=1
n
n X
X
[bi E(Xi )],
(1.3)
[bi b jCov(Xi X j )]
(1.4)
i=1 j=1
if the Xi are independent, the variance reduces to
Var(Z) =
n h
X
i
b2i Var(Xi ) .
(1.5)
i=1
Sample Mean
Corollary 1.1 If X1 , . . . , Xn is a random sample drawn from a population with mean µ and
4
QT 2015: Statistical Inference
variance σ2 . Using theorem 1.2 it can be shown that the the mean of this sample,
n
X
X n = n−1
Xi ,
(1.6)
i=1
will have expectation
E X n = µ,
(1.7)
and variance
Var X n = σ2X
=
σ2
n
n
.
(1.8)
Let us consider how a sampling distribution may look like. As an example take the case of
the sample mean X n of a random sample drawn from a normally distributed population. Combined
with the knowledge that linear combinations of normal variates are also normally distributed, the
sampling distribution of X will be equal to
!
σ2
X n ∼ N µ,
.
(1.9)
n
We can now go one step further and calculate the standardized sample mean. Subtracting the
expected value, which is the population mean µ, and dividing by the (asymptotic) standard error
creates a random variable with a standard normal distribution:
Z=
Xn − µ
√ ∼ N(0, 1).
σ/ n
(1.10)
Of course, in reality one does not generally know σ, in which case it is common practice to
replace it with it’s sample counterpart S , which will give the following sampling distribution:
Z=
Xn − µ
√ ∼ tn .
S/ n
(1.11)
The details on why the sampling distribution changes from from a normal to a t-distribution are
discussed in the next section.
Sample Variance
Corollary 1.2 If X1 , . . . , Xn is a random sample drawn from a population with mean µ and
variance σ2 , then the sample variance
S n2 = (n − 1)−1
n X
i=1
2
Xi − X n ,
(1.12)
1. Sampling Distributions
will have the following expectation:
E S 2 = σ2 .
5
(1.13)
Note that to calculate the sample variance we divide by n − 1 and not n, proof of this is
provided at the end of this chapter.
If the sample is random and drawn from a normal population, then it can also be shown
that the sampling distribution is as follows:
(n − 1)
S2
∼ χ2n−1 .
σ2
(1.14)
An intuition of this result is provided in WMS; the proof can be found in e.g. Casella and Berger,
chapter 5.
Finite Population Correction
As a short distraction, notice that if the whole population is sampled, the estimation error of the
sample mean will be, logically, equal to zero. Similarly, if a large proportion of the population
is sampled, without replacement, the standard error calculated above will over-estimate the true
standard error. In such cases, the standard error should be adjusted using a so-called finite population
correction. Taking the standard error of the sample mean as an example:
σX = 1 −
!
n−1 σ
√ .
N −1
n
(1.15)
When the sampling fraction n/N approaches zero, then the correction will approach 1. So for
most applications,
σ
σX ≈ √ ,
n
(1.16)
which is the definition of the standard error as given in the previous section. For most samples
considered, the sampling fraction will be very small. Thus, the finite sample correction will be
neglected throughout most of this syllabus.
1.3 Sampling Distributions Derived from the Normal
The normal distribution plays a central role in econometrics and statistics, for reasons that we will
explore in more depth in the next chapter. However, there are a number of other distributions that
feature as sampling distributions for various (test) statistics. As it turns out, the three most common
of these distributions can actually be derived from the normal distribution.
1.3.1 Chi-Square
6
QT 2015: Statistical Inference
Definition 1.5 Chi-Square distribution
P
Let Zi ∼ iidN(0, 1). The distribution of U = ni=1 Zi2 is called the chi-square (χ2 ) distribution,
with n degrees of freedom. This is denoted with χ2n .
Notice that the definition above implies that If U1 , U2 , . . . , Un are independent chi-square ranP
dom variables with 1 degree of freedom, the distribution of V = Ui will be a chi-square distribution with n degrees of freedom. Also, for large degrees of freedom n the chi-square distribution will
converge to a normal distribution, but this convergence is relatively slow.
The moment generation function of a χ2n distribution is
M(t) = (1 − 2t)−n/2 .
(1.17)
This implies that if V ∼ χ2n , then
E(Vn ) = n, and
(1.18)
Var(Vn ) = 2n.
(1.19)
Like the other distributions that are derived from the normal distribution, the chi-square distribution often appears as the distribution of a test statistic. For instance, testing for the joint significance
of two (or more) independent normally distributed variables. If Za ∼ N(µa , σa ) and Zb ∼ N(µb , σb )
and V is defined as
Za − µa
V2 =
σa
!2
Zb − µb
+
σb
!2
,
(1.20)
then V ∼ χ22 (Remember that (Z − µ)/σ ∼ N(0, 1) ).
Also, if X1 , X2 , . . . , Xn is a sequence of independent normally distributed variables, then the
estimated variance
(n − 1)
S2
∼ χ2n−1 .
σ2
(1.21)
1.3.2 Student-t
Definition 1.6 Student t distribution
Let Z ∼ N(0, 1) and Un ∼ χ2n , with Z and Un independent, then
Tn = √
Z
,
Un /n
(1.22)
will have a t distribution with n degrees of freedom, often denoted by tn .
The mean an variance of a t-distribution with n degrees of freedom is
E(T n ) = 0, and
n
, n > 2.
Var(T ) =
n−2
(1.23)
(1.24)
1. Sampling Distributions
7
Like the normal, the expected value of the t-distribution is 0, and the distribution is symmetric
around it’s mean, implying that f (t) = f (−t). In contrast to the normal, the t distribution has more
probability mass in it’s tails, a property called fat-tailness. As the degrees of freedom, n, increases
the tails become lighter.
Indeed in appearance the student t distribution is very similar to the normal distribution; actually
in the limit n −→ ∞ the t distribution converges in distribution to a standard normal distribution.
Already for values of n as small as 20 or 30, the t distribution is very similar to a standard normal.
√
Remember that for a random sample drawn from a normal distribution, Z = (X − µ)/(σ/ n) ∼
N(0, 1). However√in reality we do not have information about σ; thus we normally substitute the
√
sample estimate (S 2 ) = S for σ. Thus T = (X − µ)/(S / n) will have a t-distribution (proof of
this can be found at the end of the chapter).
1.3.3 F-distribution
Definition 1.7 F distribution
Let Un ∼ χ2n and Vm ∼ χ2m , and let Un and Vm be independent from each other, then
Wn,m =
Un /n
,
Vm /m
(1.25)
will have a F distribution with m and n degrees of freedom, often denoted by Fn,m .
The mean an variance of an F-distribution with n and m degrees of freedom is
m
m−2
m 2 n + m − 2
Var(Fn,m ) = 2
m − 2 n(m − 4)
E(Fn,m ) =
,
m>2
(1.26)
,
m > 4.
(1.27)
Under specific circumstances, the F distribution converges to either a t or a χ2 distribution.
Particularly
2
F1,m = tm
,
(1.28)
and
d
nFn,m −→ χ2n .
(1.29)
The F-distribution often appears when investigating variances. Recall that the standardized
variance of a normal sample will have a Chi-square distribution. Hence the ratio of two variances
of independent samples can be expressed as a F-distribution.
[nS 12 /σ21 ]/n
[mS 22 /σ22 ]/m
=
Un /n
= Fn,m
Vm /m
where the degrees of freedom are a function of the two sample sizes: n = n1 − 1 and n = n2 − 1.
8
QT 2015: Statistical Inference
Problems
1. The fill of a bottle of soda, dispensed by a certain machine, is normally distributed with mean
µ = 100 and variance σ2 = 9 (measured in centiliters).
(a) Calculate the probability that a single bottle of soda contains less than 98cl
(b) Calculate the probability that a random sample of 9 soda bottles contains, on average,
less than 98cl.
(c) How does your answer in (b) change if the variance of 9 was an estimate (i.e. S 2 = 9),
rather than a population parameter.
2. Let X1 , X2 , ..., Xm and
Y1 , Y2 , ..., Yn be two
normally distributed independent random samples,
with Xi ∽ N µ1 , σ21 and Yi ∽ N µ2 , σ22 . Suppose that µ1 = µ2 = 10, σ21 = 2, σ22 = 2.5, and
m = n.
(a) Find E(X) and Var(X).
(b) Find E(X − Y) and Var(X − Y).
(c) Find the sample size n, such that σ(X−Y) = 0.1.
3. Let S 12 and S 22 be sample variance of two random samples drawn from a normal population
with population variance σ2 = 15. Let the sample size be n = 11.
h
i
(a) find a such that Pr S 12 ≤ a = 0.95
i
h
(b) find b such that Pr S 12 /S 22 ≤ b = 0.95
4. Let Z1 , Z2 , Z3 , Z4 be a sequence of independent standard normal variables. Derive distributions for the following random variables.
(a) X1 = Z1 + Z2 + Z3 + Z4 .
(b) X2 = Z12 + Z22 + Z32 + Z42 .
Z12
.
(Z22 + Z32 + Z42 )/3
Z1
(d) X4 = q
√ .
Z22 + Z32 + Z42 / 3
(c) X3 =
Proofs
Proof 1.1 To prove that the sample variance S 2 has expectation σ2 , note that
n−1
S n2 = P 2
n
i=1 Xi − X n
= P n−1
.
2
2
Xi − nX n
1. Sampling Distributions
Therefore, by taking expectations we get






n−1
2

E(S n ) = E  P 2 

Xi2 − nX n 
n−1
= P h i
2
E Xi2 − nE[X n ]
Recall that Var(Z) = E(Z 2 ) − E(Z)2 , so
E(Xi2 ) = σ2 + µ2 and
2
E(X n ) = n−1 σ2 + µ2 .
Substitute to get
E(S n2 ) = σ2
n
+
n−1 2
=
σ
n−1
= σ2 .
µ2
n−1
− n n−1 σ2 + µ2
√
Proof 1.2 To prove T = (X − µ)/(S / n) ∼ t(n − 1) rewrite T :
√
X−µ
(X − µ)/(σ/ n)
=
√
√
√
S / n (S / n)/(σ/ n)
Z
=
S /σ
Z
= p
S 2 /σ2
Z
= q
(n−1)S 2 /σ2
n−1
= q
Z
,
U n−1
n−1
where
X−µ
Z=
√ ∼ N(0, 1) and
σ/ n
S2
Un−1 = (n − 1) 2 ∼ χ2n−1 .
σ
9
10
QT 2015: Statistical Inference
Thus
X−µ
√ ∼ tn−1
S n/ n
Chapter 2
Large Sample Theory
Literature
Required Reading
• WMS, Chapter 7.3 – 7.6
Recommended Further Reading
• CB, Chapter 5.5
• R, Chapter 5
2.1 Law of Large Numbers
In many situations it is not possible to derive exact distributions of statistics with the use of a
random sample of observations. This problem disappears, in most cases, if the sample size is large,
because we can derive an approximate distribution. Hence the need for large sample or asymptotic
distribution theory. Two of the main results of large sample theory are the Law of Large Numbers
(LLN), discussed in this section, and the Central Limit Theory, described in the next section.
As large sample theory builds heavily on the notion of limits, let us first define what they are.
Definition 2.1 Limit of a sequence
Suppose a1 , a2 , ...., an constitute a sequence of real numbers. If there exists a real number a such
that for every real ǫ > 0, there exists an integer N(ǫ) with the property that for all n > N(ǫ), we
have | an − a |< ǫ, then we say that a is the limit of the sequence {an } and write limn−→∞ an = a.
Intuitively, if an lies in an ǫ neighborhood of a (a − ǫ, a + ǫ) for all n > N(ǫ), then a said to be
the limit of the sequence {an }. Examples of limits are
!#
"
1
= 1, and
(2.1)
lim 1 +
n−→∞
n
a n
lim 1 +
= ea .
(2.2)
n−→∞
n
The notion of convergence is easily extended to that of a function f (x).
11
12
QT 2015: Statistical Inference
Definition 2.2 Limit of a function
The function f (x) has the limit A at the point x0 , if for every ǫ > 0 there exists a δ(ǫ) > 0 such
that | f (x) − A |< ǫ whenever 0 <| x − x0 |< δ(ǫ)
One of the core principles in statistics is that the sample estimator will converge to the the ‘true’
value when the sample gets larger. For instance, if a coin is flipped enough times, the proportion of
times it comes up tails should get very close to 0.5. The Law of Large Numbers is a formalization
of this notion.
Weak Law of Large Numbers
The concept of convergence in probability can be used to show that, under very general conditions,
the sample mean converges to the population mean, a result that is known as The Weak Law of
Large Numbers (WLLN). This property of convergence is also referred to a consistency, will will
be treated in more detail in the next chapter.
Theorem 2.1 (Weak Law of Large Numbers) Let X1 , X2 , . . . , Xn be iid random variables with
P
E(Xi ) = µ and Var(Xi ) = σ2 < ∞. Define X n = n−1 ni=1 Xi . Then for every ǫ > 0,
lim Pr(|X n − µ| < ǫ) = 1;
n−→∞
that is, X n converges in probability to µ
As stated, the weak law of large numbers relies on the notion of convergence in probability.
This type of convergence is relatively weak and so normally not too hard to verify.
Definition 2.3 Convergence in Probability
if
lim Pr[| Xn − x |≥ ǫ] = 0 for all ǫ > 0,
n−→∞
the sequence of random variables Xn is said to converge in probability to the real number x .
We write
p
Xn −→ x or plimXn = x.
Convergence in probability implies that it becomes less and less likely that the random variable
(Xn − x) lies the outside the interval (−ǫ, +ǫ) as the sample size gets larger and larger. There exist
different equivalent definitions of convergence in probability. Some equivalent definitions are given
below:
1. limn−→∞ Pr[|Xn − x| < ǫ] = 1, ǫ > 0.
2. Given ǫ > 0 and δ > 0, there exists N(ǫ, δ) such that Pr[| Xn − x |> ǫ] < δ, for all n > N.
3. Pr[| Xn − x |< ǫ] > 1 − δ , for all n > N, that is, Pr[| XN+1 − x |< ǫ] > 1 − δ, Pr[| XN+2 − x |<
ǫ] > 1 − δ, and so on.
2. Large Sample Theory
p
13
p
Theorem 2.2 If Xn −→ X and Yn −→ Y, then
p
(a) (Xn + Yn ) −→ (X + Y),
p
(b) (Xn Yn ) −→ XY, and
p
(c) (Xn /Yn ) −→ X/Y (if Yn , Y , 0).
p
p
Theorem 2.3 If g(·) is a continuous function, then Xn −→ X implies that g(Xn ) −→ g(X). In
other words, convergence in probability is preserved under continuous transformations.
Strong Law of Large Numbers
Like in the case of convergence in probability, almost sure convergence can be used to prove the
convergence (almost surely) of the sample mean to the population mean. This stronger result is
known as the the Strong Law of Large Numbers (SLLN).
Definition 2.4 Almost Sure Convergence
if
Pr lim Xn = x = 1,
n−→∞
the sequence of random variables Xn is said to converge almost surely to the real number x.
and is written as
a.s.
Xn −→ x.
In other words, almost sure convergence implies that the sequence Xn may not converge everywhere to x, but the points where it does not converge form a set of measure zero in the probability
sense. More formally, given ǫ, and δ > 0, there exists N such that Pr[| XN+1 − x |< ǫ, | XN+2 − x |<
ǫ, . . .] > (1 − δ), that is, the probability of these events jointly occurring can be made arbitrarily
a.s
close to 1. Xn is said to converge almost surely to the random variable X if (Xn − X) −→ 0.
Do not be fooled by the similarity between the definitions of almost sure convergence and convergence in probability. Although they look the same, convergence in probability is much weaker
than almost sure convergence. For almost sure convergence to happen, the Xn must converge for all
point in the sample space (that have a strictly positive probability). For convergence in probability
all that is needed is for the likelihood of convergence to increase as the sequence gets larger.
Theorem 2.4 (Strong Law of Large Numbers) Let X1 , X2 , . . . , Xn be iid random variables
with
E(Xi ) = µ and
Var(Xi ) = σ2 < ∞.
14
QT 2015: Statistical Inference
P
Define X n = n−1 ni=1 Xi . Then for every ǫ > 0,
Pr lim X n − µ < ǫ = 1 ;
(2.3)
n−→∞
that is, the strong law of large numbers states that X n converges almost surely to µ:
a.s.
X n − µ −→ 0.
(2.4)
The SLLN applies under fairly general conditions; some sufficient cases are outlined below.
a.s.
′
Theorem 2.5 If the X s are iid, then a necessary and sufficient condition for X n − µ −→ 0 is
that E |Xi − µ| < ∞ for all i.
′
Theorem 2.6 (Kolmogorov’s Theorem on SLLN) If the X s are independent
(but not neces
P∞
a.s.
2
sarily identical) with finite variances, and if n=1 Var(Xn )/n < ∞, then X n − EX n −→ 0.
A third form of point-wise convergence is the concept of convergence in mean.
Definition 2.5 Convergence in Mean (r)
The sequence of random variables Xn is said to converge in mean of order (r) to x (r ≥ 1), and
(r)
designated Xn −→ x, if E[ | Xn − x |r ] exists and limn−→∞ E[ | Xn − x |r ] = 0, that is, if r
th moment of the difference tends to zero. The most commonly used version is mean squared
convergence, which is when r = 2.
For example, the sample mean (X n ) converges in mean square to µ, because Var(X n ) = E[(X n −
µ)2 ] = (σ2 /n) tends to zero as n goes to infinity. Like convergence almost surely, convergence in
Mean (r) is a stronger concept than convergence in probability.
2.2 The Central Limit Theorem
Perhaps the most important theorem in large sample theory is the central limit theorem, which implies, under quite general conditions, that the standardized mean of a sequence of random variables
(for example the sample mean) converges in distribution to a standard normal distribution, even
though the population is not normal. Thus, even if we did not know the statistical distribution of
the population from which a sample is drawn, we can approximate quite well the distribution of the
sample mean by the normal distribution by having a large sample.
In order to establish this result, we rely on the concept of convergence in distribution.
Definition 2.6 Convergence in Distribution
Let {Xn } be a a sequence of random variables whose CDF is Fn (x), and let the CDF F X (x)
correspond to the random variable X. We say that Xn converges in distribution to if
lim Fn (x) = F X (x)
n−→∞
2. Large Sample Theory
15
at all points x at which F X (x) is continuous. This can be written as
d
Xn −→ X
Sometimes, convergence in distribution is also referred to as convergence in law.
Intuitively, convergence in distribution occurs when the distribution of Xn comes closer and
closer to that of X as n increased indefinitely. Thus, F X (x) can be taken to be an approximation to
the distribution of Xn when n is large. The following relations hold for convergence in distribution:
d
p
Theorem 2.7 If Xn −→ X and Yn −→ c, where c is a non-zero constant, then
d
(a) (Xn + Yn ) −→ (X + c), and
d
(b) (Xn /Yn ) −→ (X/c).
Using the definition of convergence in distribution we can now introduce formally one version
of the Central Limit Theorem.
Theorem 2.8 (Central Limit Theorem) Let X1 , X2 , ..., Xn be iid random variables with mean
E(Xi ) = µ and a finite variance σ2 < ∞. Define the standardized sample mean,
X n − E(X n )
Zn = q
Var(X n )
Then, under a variety of alternative assumptions
d
Zn −→ N(0, 1).
(2.5)
2.3 The Normal Approximation to the Binomial Distribution
The Bernoulli Distribution
The Bernoulli distribution is a binary distribution, with only two possible outcomes: success (X = 1)
with probability p and failure (X = 0) with probability q = 1 − p. The probability density of a
Bernoulli is
Pr(X = x|p) = px (1 − p)1−x ;
x = 0, 1.
(2.6)
for X = 0, 1(failure, success) and 0 ≤ p ≤ 1.
The mean and variance of a Bernoulli distribution are given as:
E(X) = p,
Var(X) = p(1 − p) = pq.
(2.7)
(2.8)
16
QT 2015: Statistical Inference
The Binomial Distribution
The Binomial distribution can be thought of as a sequence of iid Bernoulli rv of length n.
!
n x
Pr(X = x|n, p) =
p (1 − p)n−x
x
n!
=
px (1 − p)n−x .
x! (n − x)!
(2.9)
x = 0, 1, ..., n (X is the number of success in n trials) 0 ≤ p ≤ 1.
The mean and variance of a binomial distribution are given as:
E(x) = np.
(2.10)
Var(x) = npq.
(2.11)
Example 2.1 Assume a student is given a test with 10 true-false questions. Also assume that
the student is totally unprepared for the test and guesses the answer to every question. What is
the probability that the student will answer 7 or more questions correctly?
Let X is the number of questions answered correctly. The test represents a binomial experiment with n = 10, p = 1/2. So X ∼ Bin(n = 10, p = 1/2).
Pr(x ≥ 7) = Pr(x = 7) + Pr(x = 8) + Pr(x = 9) + Pr(x = 10)
! !k !10−k X
! !10
10
10
X
1
10 1
10 1
=
=
2
k 2
k 2
k=7
k=7
= 0.172.
The Normal Approximation
For large sample size n and number of successes k, it becomes cumbersome to calculate the exact
probabilities of the binomial. However, we can obtain approximate probabilities by invoking CLT.
As stated before, a Binomial(n,p), can be thought of as n independent Bernoulli trails, with
success probability p. Consequently, when n is large, the sample average of the Bernoulli trails
n
1X
Xi = X,
n i=1
will be approximately normal with mean E(X) = p and variance Var(X) = p(1 − p)/n. Thus
p
X−p
p(1 − p)/n
∼ N(0, 1)
Even for fairly low numbers of n and k the normal approximation is surprisingly accurate. Wackerly provides the useful rule of thumb that the the approximation is adequate if
n>9
larger of p and q
smaller of p and q
(2.12)
2. Large Sample Theory
17
Example 2.2 Consider again the student from example 2.1. What would be the approximate
probability?
Pr(x ≥ 7) = Pr(x/10 ≥ 0.7)
Define x = x/10



x− p
0.7 − p 

Pr (x ≥ 0.7) = Pr  p
≥ p

p(1 − p)/n
p(1 − p)/n
!
0.2
= Pr Z ≥ √
0.025
= Pr (Z ≥ 1.26)
= 0.104
If we compare the approximate probability of 0.104 with the exact probability of 0.172 from
the previous exercise, it becomes clear that there may be a substantial approximation error.
However, as n gets larger, this approximation error becomes progressively smaller.
Problems
1. let X1 , X2 , . . . , Xn be an independent sample (i.e. independent but not identically distributed),
P
with E(Xi ) = µi and Var(Xi ) = σ2i . Also, let n−1 ni=1 µi −→ µ.
P
Show that if n− 2 ni=1 σ2i −→ 0, then X −→ µ in probability.
2. The service times for customers coming through a checkout counter in a retail store are independent random variables with mean 1.5 minutes and variance 1.0. Use CLT to approximate
the probability that 100 customers can be serviced in less than 2 hours of total service time.
3. Suppose that a measurement has mean µ and variance σ2 = 25. Let X be the average of n
such independent measurements. If we are interested in measuring the sample mean with a
degree of precision such that 95% of the time the sample mean lies within 1.5 units (in the
absolute sense) from the true population mean, how large should we make our sample size?
In on other words how large should n be so that Pr(|X − µ| < 1.5) = 0.95 ?
Proofs
The Weak Law of Large numbers can be proven by use of Chebychev’s Inequality.
Proof 2.1 (Weak Law of Large Numbers) The Weak Law of Large numbers can be proven
by use of Chebychev’s Inequality:
E g(X)
, ǫ > 0.
Pr g(X) ≥ ǫ ≤
ǫ
18
QT 2015: Statistical Inference
For instance, let g(X) be |X − E(X)|, in this case Chebychev’s inequality reduces to
Pr [|X − E(X)| ≥ ǫ] ≤
E[|X − E(X)|]
.
ǫ
Using Chebychev’s inequality; for every ǫ > 0 we have
2 E X−µ
Pr X − E X ≥ ǫ ≤
,
ǫ2
with
E X−µ
ǫ2
2 =
=
σ2
) = 0 we have
n−→∞ nǫ 2
h
i
lim Pr X − E(X) ≥ ǫ = 0.
As lim (
n−→∞
Var X
ǫ2
σ2
nǫ 2
.
Chapter 3
Estimation
Literature
Required Reading
• WMS, Chapters 8 & 9.1 – 9.3
Recommended Further Reading
• R, Sections 8.6 – 8.8
• CB, Chapters 7, 9, & 10.1.
3.1 Introduction
The purpose of statistics is to use the information contained in a sample to make inference about the
parameters of the population that the sample is taken from. To key to making good inference about
the parameters is to have a good estimation procedure that produces good estimates of the quantities
of interest.
Definition 3.1 Estimator
An estimator is a rule for calculating an estimate of a target parameter based on the information
from a sample. To indicate the link between an estimator and it’s target parameter, say θ, the
estimator is normally denoted by adding a hat: θ̂.
A point estimation procedure uses the information in the sample to arrive at a single number
that is intended to be close to the true value of the target parameter in the population. For example,
the sample mean
Pn
Xi
X = i=1
(3.1)
n
is one possible point estimator of the population mean µ. There may be more than one estimator for
a population parameter. The sample median, X(n/2) , for example might be another estimator for the
population mean. Alternatively one might provide a range of values as estimates for the mean, for
example the range from 0.10 to 0.35. This case is referred to as interval estimation.
19
20
QT 2015: Statistical Inference
3.2 Evaluation Criteria for Estimators
As there are often multiple point estimators available for any given parameter it is important to develop some evaluation criteria to judge the performance of each estimator and compare their relative
effectiveness. The three most important criteria used in economics and finance are: unbiasedness,
efficiency, and consistency.
Unbiasedness
Definition 3.2 Unbiasedness
An estimator θ̂ is called unbiased estimator of θ if E(θ̂) = θ. The bias of an estimator is given
by b(θ) = E(θ̂) − θ.
Definition 3.3 Asymptotic Unbiasedness
√
If an estimator has the property that Var(θ̂) and n(θ̂n − θ) tend to zero as the sample size
increases, then it is said to be asymptotically unbiased.
Efficiency
Definition 3.4 Mean Square Error (MSE)
A commonly used measure of the adequacy of an estimator is E[(θ̂ − θ)2 ], which is called the
mean square error ( MSE). It is a measure of how close θ̂ is, on average, to the true θ. The MSE
can be decomposed into two parts:
MS E = E[(θ̂ − θ)2 ]
= E[(θ̂ − E(θ̂) + E(θ̂) − θ)2 ]
= Var(θ̂) + bias2 (θ).
(3.2)
Definition 3.5 Relative Efficiency
Let θ̂1 and θ̂2 be two alternative estimators of θ. Then the ratio of the respective MS Es, E[(θ̂1 −
θ)2 ]/E[(θ̂2 − θ)2 ], is called the relative efficiency of θ̂1 with respect to θ̂2 .
Consistency
Definition 3.6 Consistency
An estimator θ̂ is consistent if the sequence θ̂n converges to θ in the limit, i.e. θ̂ → θ.
There are different types of consistency, corresponding to different versions of the law of large
numbers. Examples are:
p
1. θ̂n −→ θ (Weak Consistency)
(2)
2. θ̂n −→ θ (Squared-error Consistency)
3. Estimation
21
a.s.
3. θ̂n −→ θ (Strong Consistency)
A sufficient condition for weak consistency is that
1. The estimator is asymptotically unbiased
2. The variance of the estimator goes to zero as n → ∞
3.3 Confidence Intervals
An interval estimator is a estimation rule that specifies two numbers that form the endpoints of
an interval, θ̂L and θ̂H . A good interval estimator is chosen such that (i) it will contain the target
parameter θ most of the time and (ii) the interval chosen is as small as possible. Of course, as the
estimators are random variables one or both of the endpoints of the interval will vary from sample
to sample, so one cannot guarantee with certainty that the parameter will lie inside the interval but
we can be fairly confident; as such interval estimators are often referred to as confidence intervals.
The probability (1 − α) that θ will lie in the confidence interval is called the confidence level and the
upper and lower endpoints are called, respectively, the upper and lower confidence limits
Definition 3.7 Confidence Interval
Let θ̂L and θ̂H be interval estimators of θ s.t. Pr(θ̂L ≤ θ ≤ θ̂H ) = 1 − α, then the interval [θ̂L , θ̂H ]
is called the two-sided (1 − α) × 100% confidence interval. Normally the interval is chosen such
that on each side α/2 falls outside the confidence interval.
In addition to two sided confidence intervals it is also possible to form single sided confidence
intervals. If θ̂L is chosen s.t.
Pr(θ̂L ≤ θ) = 1 − α,
h
then the interval θ̂L , ∞ is the lower confidence interval. Additionally if θ̂H is chosen such that
Pr(θ ≤ θ̂H ) = 1 − α,
i
the interval −∞, θ̂H is the upper confidence interval.
Pivotal Method
A useful method for finding the endpoints of confidence intervals is the pivotal method, which relies
on finding a pivotal quantity
Definition 3.8 Pivotal Quantity
The random variable Q = q(X1 , . . . , Xn ) is said to be a pivotal quantity if the distribution of Q
is independent from θ.
For example for a random sample drawn from N(µ, 1) the random variable Q = X−µ
1/n is a pivotal
quantity since Q ∼ N(0, 1). For the more general case of a random sample drawn from N(µ, σ2 ) the
pivotal quantity associated with µ̂ will be Q = X−µ
S /n , where S is the sample estimate of the standard
deviation, as Q ∼ tn−1
22
QT 2015: Statistical Inference
Pr(q1 ≤ Q ≤ q2 ) is unaffected by a change of scale or a translation of Q. That is if
Pr (q1 ≤ Q ≤ q2 ) = (1 − α)
Pr (a + bq1 ≤ a + bQ ≤ a + bq2 ) = (1 − α)
(3.3)
(3.4)
Thus, if we know the pdf of Q, it may be possible to use the operations of addition and multiplication to find out the desired confidence interval. Let’s take as an example a sample drawn from a
normal population with known variance. To build a confidence interval around the mean the pivotal
quantity of interest is
Q = X ∼ N(µ, 1/n) ∼ N(0, 1).
(3.5)
To find the confidence limits µ̂L and µ̂H s.t.
Pr (µ̂L ≤ µ ≤ µ̂H ) = 1 − α,
we start with finding the confidence limits q1 and q2 of our pivotal quantity s.t.
!
x−µ
Pr q1 ≤ √ ≤ q2 = 1 − α.
1/ n
(3.6)
(3.7)
After we have found q1 and q2 , we can manipulate the probability to find expressions for µ̂L and µ̂H .
!
!
x−µ
1
1
Pr q1 ≤ √ ≤ q2 = Pr √ q1 ≤ x − µ ≤ √ q2
1/ n
n
n
!
1
1
= Pr √ q1 − x ≤ −µ ≤ √ q2 − x
n
n
!
1
1
(3.8)
= Pr x − √ q2 ≤ µ ≤ x − √ q1 .
n
n
So,
1
µ̂L = x − √ q2
n
1
µ̂H = x − √ q1
n
(3.9)
(3.10)
and
"
1
1
x − √ q2 , x − √ q1
n
n
#
(3.11)
is the (1 − α)100% confidence interval for µ.
Constructing Confidence Intervals
Confidence Intervals for the Mean of a Normal Population
Consider the case of a sample drawn from a normal population where both µ and σ2 are unknown.
We know that
Q=
x−µ
S
√
n
∼ t(n−1) .
(3.12)
3. Estimation
23
As the distribution of Q does not depend on any unknown parameters, Q is a pivotal quantity.
We start with finding the confidence limits q1 and q2 of the pivotal quantity. As a t-distribution
is symmetrical (just like the normal distribution), we can simplify the problem somewhat as it can
be shown that q2 = −q1 = q. So we need to find a number q s.t.




x−µ

(3.13)
Pr −q ≤ s ≤ q = 1 − α.
√
n
which reduces to finding q s.t.
Pr (Q ≥ q) =
α
.
2
(3.14)
After we have retrieved q = t α2 ,(n−1) , we manipulate the quantities inside the probability to find
!
s
s
(3.15)
Pr x − q √ ≤ µ ≤ x + q √ = 1 − α.
n
n
To obtain the confidence interval
"
#
s
s
x − t( α ,(n−1)) √ , x + t α2 ,(n−1) √
2
n
n
(3.16)
Example 3.1 Consider a sample drawn from a normal population with unknown mean and
variance. Let n = 10, x = 3.22, s = 1.17, (1 − α) = 0.95. Filling in the numbers in the formula
"
#
s
s
x − t α2 ,(n−1) √ , x + t α2 ,(n−1) √ .
n
n
The 95% CI for µ equals,
"
#
(2.262)(1.17)
(2.262)(1.17)
, 3.22 +
3.22 −
= [2.38, 4.06] .
√
√
10
10
(3.17)
Confidence Intervals for the Variance of a Normal Population
To find the confidence interval of the variance of a normal population, we start again with finding
an appropriate pivotal quantity. In this case recall that
Q = (n − 1)
S2
∼ χ2(n−1) .
σ2
(3.18)
Note that the distribution of Q does not depend on any unknown parameters, hence Q is a pivotal
quantity. Therefore we can find limits q1 and q2 such that
Pr (q1 ≤ Q ≤ q2 ) = 1 − α.
(3.19)
This is slightly more tricky as the Chi-square distribution is not symmetric. It is standard to select
the thresholds such that
α
(3.20)
Pr (Q ≤ q1 ) = Pr (Q ≥ q2 ) = .
2
24
QT 2015: Statistical Inference
After retrieving q1 = χ21−α/2,(n−1) and q2 = χ2α/2,(n−1) we manipulate the expression to find
Pr (q1 ≤ Q ≤ q2 ) = Pr q1 ≤ (n − 1)
S2
≤ q2
σ2
!
!
S2
s2
2
= Pr (n − 1)
≤ σ ≤ (n − 1)
.
q2
q1
2
2
So, (n − 1) Sq2 , (n − 1) Sq1 is a 100(1 − α)100% CI for σ2 .
(3.21)
Example 3.2 As in the example of the previous sample, let n = 10, x = 3.22, s = 1.17, (1−α) =
0.95.
The 95 percent CI for σ2 is
#
"
s2
s2
,
(n − 1) , (n − 1)
q2
q1
with q2 = χ20.025, (9) = 19.02 and q1 = χ20.975, (9) = 2.70. so the 95% CI equals
"
#
1.172
1.172
9×
,9 ×
= [0.65, 4.56] .
19.02
2.70
(3.22)
Problems
1. Let X1 , X2 , . . . , Xn be a random sample with mean µ and variance σ2 . Consider the following
estimators:
(i) µ̂1 =
(ii) µ̂2 =
(iii) µ̂3 =
X1 +Xn
2
X1
4
Pn
+
Pn−1
1 i=2 Xi
2 (n−2)
i=1 Xi
n+k
+
Xn
4
where 0 < k ≤ 3.
(iv) µ̂4 = X
(a) Explain for each estimator whether they are unbiased and/or consistent.
(b) Find the efficiency of µ̂1 , µ̂2 , and µ̂3 relative to µ̂4 . Assume n = 36, σ2 = 20, µ = 15, and
k = 3.
2. Consider the case in which two estimators are available for some parameter, θ.
Suppose that E(θ̂1 ) = E(θ̂2 ) = θ, Var(θ̂1 ) = σ21 , and Var(θ̂2 ) = σ22 .
Consider now a third estimator, θ̂3 , defined as
θ̂3 = aθ̂1 + (1 − a)θ̂2 .
How should a constant a be chosen in order to minimise the variance of θ̂3 ?
(a) Assume that θ̂1 and θ̂2 are independent.
3. Estimation
25
(b) Assume that θ̂1 and θ̂2 are not independent but are such that Cov(θ̂1 , θ̂2 ) = γ , 0.
3. Consider a random sample drawn from a normal population with unknown mean and variance.
You have the following information about the sample: n = 21, x = 10.15, and s = 2.34. Let
α = 0.10 throughout this question.
(a) Calculate the (1 − α) two-sided, upper, and lower confidence intervals for µ.
(b) Calculate the (1 − α) two-sided, upper, and lower confidence intervals for σ2 .
(c) Calculate the (1 − α) two-sided, upper, and lower confidence intervals for σ.
26
QT 2015: Statistical Inference
Chapter 4
Hypothesis Testing
Literature
Required Reading
• WMS, Chapter 10
Recommended Further Reading
• R, Sections 9.1 – 9.3
• CB, Chapter 8.
• G, Chapter 5.
4.1 Introduction
Think for a second about a courtroom drama. A defendant is led down the aisle, the prosecution
lays out all the evidence, and at the end the judge has to weigh the evidence and make his verdict:
innocent or guilty. In many ways a legal trial follows the same logic as a statistical hypothesis test.
The testing of statistical hypotheses on unknown parameters of a probability model is one of
the most important steps of any empirical study. Examples of statistical hypothesis that are tested
in economics include
• The comparison of two alternative models,
• The evaluation of the effects of a policy change,
• The testing of the validity of an economic theory.
4.2 The Elements of a Statistical Test
Broadly speaking there are two main approaches to hypothesis testing: the classical approach and
the Bayesian approach. The approach followed in this chapter is the classical approach, which is
most widely used in econometrics. The classical approach is best described by the Neyman-Pearson
27
28
QT 2015: Statistical Inference
methodology; it can be roughly described as a decision rule that follows the logic: ‘What type of
data will lead me to reject the hypothesis?’ A decision rule that selects one of the inferences ‘reject
the null hypothesis’ or ‘do not reject the null hypothesis’ is called a statistical test. Any statistical
test of hypotheses is composed of the same three essential components:
1. Selecting a null hypothesis, H0 , and an alternative hypothesis, H1 ,
2. Choosing a test statistic,
3. Defining the rejection region.
Null and Alternative Hypotheses
A hypothesis can be thought of as a binary partition of the parameter space Θ into two sets, Θ0 and
Θ1 such that
Θ0 ∩ Θ1 = ⊘ and Θ0 ∪ Θ1 = Θ.
(4.1)
The set Θ0 is called the null hypothesis, denoted by H0 . The set Θ1 is called the alternative hypothesis, denoted by H1 or Ha .
Take as example a political poll. Let’s assume that the current prime minister declares that
he has got the support of more than half the population and we do not believe him. To test his
statement we randomly select 100 voters and ask them if they approve of the prime minister. We
can now formulate a null and alternative hypothesis.
Let the null hypothesis be that the prime minister is correct, in that case the proportion of people
supporting the prime minister will be at least 0.5, so
H0 : θ ≥ 0.5.
(4.2)
Conversely if the prime minister is wrong then the alternative is true
H1 : θ < 0.5.
(4.3)
Note that this partitioning of the null and alternative is done such that there is no value for θ that
lies both in the domain of the null and the alternative and the union of the null and the alternative
contains all possible values that θ can take.
Often the null hypothesis in the above case is simplified: we are really only interested in the
endpoint of the interval described by the null hypothesis, in this case the point θ = 0.5, so often the
null is written instead as
H0 : θ = 0.5,
(4.4)
where it is implicit that any value for θ larger than 0.5 is covered by this hypothesis by the way the
alternative is formulated.
The above example outlines what is known as a single sided hypothesis as the alternative hypothesis lies to one side of the null hypothesis. Alternatively one can specify a two sided hypothesis
such as
H0 : θ = 0.5 vs. H1 : θ , 0.5.
(4.5)
In this case the alternative hypothesis includes values for θ that lie on both sides of the postulated
null hypothesis.
4. Hypothesis Testing
29
Test Statistic
Once the null and alternative hypothesis have been defined a procedure needs to be developed to
decide whether the null hypothesis is a reasonable one. This test procedure usually contains a sample
statistic T (x) called the test statistic, which summarizes the ‘evidence’ against the null hypothesis.
Generally the test statistic is chosen such that it’s limiting distribution is known.
Take again the example of the popularity poll of the prime minister. We can exploit the fact that
(i) the sample consists of an iid sequence of Bernoulli RV and (ii) CLT to show that approximately
!
θ(1 − θ)
θ̂ = X ∼ N θ,
(4.6)
n
If we standardize θ̂ and fill in our hypothesized value θ0 = 0.5 for θ we can create the test statistic.
Z(x) =
X − 0.5
∼ N(0, 1).
0.25/100
(4.7)
Note that Z(x) does not rely on any unknown quantities and its limiting distribution is known.
Rejection Region
After a test statistic T has been selected, the researcher needs to define a range of values of T for
which the test procedure recommends the rejection of the null. This range is called the rejection
region or the critical region. Conversely the range of values for T in which the null is not rejected is
called the acceptance region. The cut-off point(s) that indicate the boundary between the rejection
region and the acceptance region is called the critical value.
Going back to the example of the popularity poll, we could create the protocol: if the test statistic
T is lower than the critical value τcrit = −2 I reject the null H0 : θ = 0.5 in favour of the alternative
H1 : θ < 0.5. In this case the rejection region consists of the set RR = {t < −2} and the acceptance
region of the set AR = {t ≥ −2}.
To find the right critical value is an interesting problem. In the above example, we know that
ˆ lower than 0.5 (and hence a test statistic lower than 0) is evidence against
finding any value for theta
the null hypothesis. But how low should we set our threshold exactly? In order to better understand
this dilemma lets first assume the decision rule fixed and evaluate the possible outcomes of our
statistical test.
Hopefully our test arrives at the correct conclusion: reject the null when it is not true or not
rejecting it when it is indeed true. However there is the possibility that an erroneous conclusion has
been made and one of two types of errors has been committed:
Type I error : Rejecting H0 when it is true
Type II error: Not Rejecting H0 when it is false
Now that we have identified the two correct outcomes and two errors we can commit, we can
associate probabilities with these events.
Definition 4.1 Size of the test(α)
The probability of rejecting H0 when it is actually true (ie. committing a type I error) is called
the size of the test. Sometimes it is also called the level of significance of the test. This probability is usually denoted as α.
30
QT 2015: Statistical Inference
Table 4.1: Decision outcomes and their associated probabilities
H0 rejected
H0 not rejected
H0 true
α
Type I error
Level / Size
(1 − α)
H0 false
(1 − β)
β
Type II error
Operating Char.
Power
Common sizes that are used in hypothesis testing are α = 0.10, α = 0.05, and α = 0.01.
Definition 4.2 Power of the test (1 − β)
The probability of rejecting H0 when it is false is called the power of the test. This probability
is normally denoted as (1 − β).
Definition 4.3 Operating Characteristic (β)
The probability of not rejecting H0 when it is actually false (ie. committing a type II error) is
known as the operating characteristic. This probability is usually denoted as β. This concept is
widely used in statistical quality control theory.
Table 4.2 below summarizes the probabilities. Ideally a test is chosen such that both the probability of a type I error,α, and the probability of a type II error,β, are as low as possible.However,
practically this is impossible because, given some fixed sample, reducing α increases β: there is
a trade-off between the two. The only way to decrease both α and β is to increase the sample
size, something that is often not feasible. The classical decision procedure therefore chooses an
acceptable value for the level α.
Note that in small samples the empirical size associated with a critical value of a test statistic is
often larger than the asymptotic size because the approximation of the limiting distribution might
not yet be very good. Thus if a researcher is not careful he risks choosing a test which rejects the
null hypothesis more often than he realizes.
So then how do we select the critical value τcrit after fixing α? Let’s consider once more our
popularity contest. Recall that the test statistic associated with the hypothesis that θ = 0.5 was
(X − 0.5)/(0.25/100) ∼ N(0, 1). Let’s say that we are willing to reject the null hypothesis if there is
less than 2.5% probability of committing a type I error, ie. α = 0.025. Since we know the limiting
distribution of T we can find the value τc rit such that
Pr[T < τcrit | θ = 0.5] = α = 0.025.
(4.8)
This value can be found by looking up the CDF of a standard normal: Pr(T ≥ τ) = 1−0.025 = 0.975;
in this case τ = −1.96. In any case, we have now found the relevant critical value, and can define
the rejection region as RR = {t < −1.96} and the acceptance region as AR = {t ≥ −1.96}. If we map
4. Hypothesis Testing
31
the critical value of the test statistic back to a proportion, this translates to θcrit = θ − 1.96 × se =
0.5 − 1.96 × 0.05 = 0.402; ie. we can reject the null (θ = 0.5) at the 2.5% level if we find a sample
mean lower than 0.402.
If a two sided test of the form H0 : θ = 0.5 vs. H1 : θ , 0.5 would have been considered,
the rejection region would have consisted of two parts: RR = {t : t < τl or t > τu }. Where
for a symmetric distribution like the normal τu = −τl = τ which reduces the rejection region to
RR = {T : |t| > τ}. Using the data from the popularity poll, we can easily construct a two-sided
rejection region for the hypothesis H0 : θ = 0.5 vs. H1 : θ , 0.5 at the 5% level by realizing that
5% / 2 = 2.5%. Hence the critical values for the two-sided test will be −1.96 and 1.96, with the
associated rejection region: RR = {t : |t| > 1.96}.
Example 4.1 Consider the hypothetical example in which a subject is asked to draw, 20 times,
a card from a suit of 52 cards and identify, without looking, the suit (hearts, diamonds, clubs,
spades). Let T be the number of correct identifications. Let the null hypothesis random guesses
with the alternative being that the person has extrasensory ability (also called ESP). If the
maximum level of the test is set at α = 0.05, what should be the decision rule and associated
rejection region?
T ∼ binomial(20, 0.25).
Find τ0.05 such that Pr[T > τ0.05 | π = 0.25] ≤ 0.05.
P[T ≥ 8 | π = 0.25] = 0.102 > 0.05 and P[T ≥ 9 | π = 0.25] = 0.041 < 0.05.
Thus the critical value of this test is τ0.05 = 9 and the rejection region equals
RR : t ≥ 9.
Common Large-Sample Tests
Many hypothesis tests are based around test statistics that are approximately normal by virtue of the
CLT, such as sample means X. We can exploit this fact to construct a test statistic that is commonly
encountered in econometrics.
Z=
θ̂ − θ0
∼ N(0, 1).
σθ̂
(4.9)
The standard error is often replaced with its sample estimate S /n which results in the following test
statistic
T =
θ̂ − θ0
∼ t(n−1) .
S /n
(4.10)
with associated two-sided rejection region
RR : {t : |t| > τα/2 } or RR : {θ̂ : θ̂ < θ − τα/2 σθ̂ or θ̂ > θ + τα/2 σθ̂ }.
(4.11)
32
QT 2015: Statistical Inference
4.3 Duality of Hypothesis Testing and Confidence Intervals
i
h
Recall the concept of a (1 − α) two-sided confidence interval θ̂l , θ̂h as an interval that contains
the true parameter θ with probability (1 − α). Also recall that if the sampling distribution of θ is
approximately normal then the (1 − α) confidence interval is given by
θ̂ ± zα/2 σθ̂ ,
(4.12)
with σθ̂ the standard error of the estimator and zα/2 the value such that Pr(Z > zα/2 ) = α/2.
Note the strong similarity with this confidence interval and the test statistic plus associated
rejection region of a two sided hypothesis test described in the previous section. This is no coincidence. Consider again the two-sided rejection region for a test with level α from the previous
section: RR : {z : |z| > zα/2 }. The complement of the rejection region, RR, is the acceptance region
AR : {z : |z| ≤ zα/2 } which maps onto the parameter space as do not reject (‘accept’) null hypothesis
at level α if the estimate lies in the interval
θ0 ± zα/2 σθ̂ .
(4.13)
Restated, for all θ0 that lie in the interval
θ̂ − zα/2 σθ̂ ≥ θ0 ≥ θ̂ + zα/2 σθ̂ .
(4.14)
the estimate θ̂ will lie inside the acceptance region and the null hypothesis cannot be rejected at
level α. This interval is, as you will notice, exactly equal to the (1 − α) confidence interval outlined
above. Thus the duality between confidence intervals and hypothesis testing: if the hypothesized
value θ0 lies inside the (1 − α) confidence interval, one cannot reject the null hypothesis H0 : θ̂ = θ0
vs. H1 : θ̂ , θ0 at level α; if θ0 does not lie in the confidence interval then the null can be rejected
at level α. A similar statement can be made for upper and lower single sided confidence intervals.
Notice that any value inside the confidence interval would be an ‘acceptable’ value for the null
hypothesis, in the sense that it cannot be rejected with a hypothesis test of level α. This explains
why in statistics we usually only talk about rejecting the null vs. not rejecting the null, rather than
saying we ‘accept’ the null. Even if we do not reject the null we recognize that there are probably
many other values for θ that would be acceptable and we should be hesitant to make statements
about a single θ being the single true value. Likewise we do not commonly ‘accept’ the alternative
when we reject the null hypothesis are there are usually many potential values the paramater θ can
take under the alternative.
4.4 Attained Significance Levels: P-Values
Recall that the most common method of selecting a critical value for the test statistic and determining the rejection region is fixing the level of the test α. Of course we would like to have α as
small as possible as it denotes the probability of committing a type I error. However, as discussed,
choosing a low α comes at the cost of increasing β, the probability of a type II error. Choosing the
correct value of α is thus important, but also rather arbitrary. While one researcher would be happy
to conduct a test with level α = 0.10 another would insist upon only testing with levels lower than,
say, α = 0.05. Furthermore, the levels of tests are often fixed at 10%, 5%, or 1% not as a result of
long deliberations, but rather out of custom and tradition.
4. Hypothesis Testing
33
There is a way to partially sidestep this issue of selecting the right value for α by reporting
the attained significance level or p-value. For example let T be a test statistic for the hypothesis
H0 : θ = θ0 vs. H1 : θ > θ0 . If the realized value of the test statistic is t, based on our sample, then
the p-value is calculated as the probability
pval = Pr[T > t | θ0 ].
(4.15)
Definition 4.4 p-value The attained significance level, or p-value, is the smallest level of significance α at which the null hypothesis can be reject given the observed sample.
The advantage of reporting a p-value, rather than fixing the level of the test yourself is that it
permits each of your readers to draw their own conclusion about the strength of your results. The
procedures for finding p-values are very similar to those of finding the critical value of a test statistic.
However, instead of fixing the probability α and finding the critical value of the test statistic τ, we
now fix the value of the test statistic t and find the associated probability pval.
Example 4.2 A financial analyst believes that firms experience positive stock returns upon the
announcement that they are targeted for a takeover. To test his hypothesis he has collected
a data set comprising 300 take-over announcements with an average abnormal return of r =
1.5% on the announcement date, with a standard error of 0.5%. Calculate the p-value of the
null hypothesis H0 : r̂ = 0 vs. H0 : r̂ > 0.
Invoking CLT, the natural test statistic to test this hypothesis is
Z=
r̂abn − 0
∼ t(99) ≈ N(0, 1).
S r̂abn
The value of the test statistic in this sample equals z = 1.5/0.5 = 3. Looking up the value 3 in
the standard normal table yields us p-val = Pr[Z > 3] = 0.0013. Thus the p-value of this test is
0.13%, implying that we can easily reject the null hypothesis of no news effect at the 10%, 5%,
or 1% level.
4.5 Power of the Test
In the previous sections we have primarily focused on the probability α of committing a type I error.
However, it is at least as important for a test to also have a low probability β of committing a type
II error. Remember that a type II error is committed if the testing procedure fails to reject the null
when it was in fact false. In econometrics, rather than looking directly at β, many statistical tests
are evaluated by its complement (1 − β): the probability that a statistical test rejects the null when it
is indeed false; this probability (1 − β) is called the power of the test.
Before we can calculate the power of a test there are two issues that need to be addressed.
Firstly, recall that the alternative hypothesis often contains a large range of potential values for θ.
For instance in the single sided hypothesis H0 : θ = θ0 vs H1 : θ > θ0 , all values of θ larger than
θ0 are included in the alternative. However, the power will normally not be the same for all these
different values included in Θ1 . Therefore the power of a test is often evaluated at specific values
for the alternative, say θ = θ1 .
34
QT 2015: Statistical Inference
Secondly, as we have focused on type I errors and the associated α’s, we have only considered
how the sampling distribution looks like under the assumption that the null hypothesis is correct.
This sampling distribution is referred to as the null distribution. However, if we are interested about
making statements about the power of the test (or type II errors), then we have to consider how the
sampling distribution of θ̂ looks like when θ = θ1 . That is, we evaluate the sampling distribution for
that specific alternative.
Consider once more the one-sided hypothesis H0 : θ = θ0 vs. H0 : θ > θ0 with the associated
test statistic T and critical value τα . For a specific alternative θ = θ1 (with θ1 > θ0 ) the power of the
test can be calculated as the conditional probability
(1 − β) = Pr[T > τ | θ = θ1 ].
(4.16)
Note that the main difference with the definition of α
α = Pr[T > τ | θ = θ0 ],
(4.17)
is that the probability is conditioned on the alternative hypothesis being true, rather than the assumption that the null hypothesis is true.
Example 4.3 Many American high-schoolers take the SAT (scholastic aptitude test). The average SAT score for mathematics is 633. Consider the following test: a school is considered
‘excellent’ if its students obtain an average SAT score of more than 650 (assume a class size
of 40). School X believes that its own students will have an expected SAT score of 660 with a
standard deviation of 113. Thus, school X feels it should be rated excellent; what is the probability of the school to be actually rated ‘excellent’?
This problem is really all about the power of the test. Realize first that we can describe the
above as a hypothesis test of the form H0 : θ = 633 vs. H1 : θ > 633 with a rejection region
RR = {θ̂ : θ̂ > 650}. Because we are looking at the power of the test, we have to consider the
alternative distribution, not the null distribution. In this case the school wants to evaluate this
test at the specific alternative θ1 = 660 and find the probability Pr[θ̂ > 650 | θ1 = 660] = (1−β)
which is equal to the power of the test evaluated at θ1 .
√
Invoking CLT we have, under the alternative distribution, Z = (θ̂ − 660)/(113/ 40) ∼
N(0, 1). We can use this to manipulate the probability from above to
(1 − β) = Pr [Z ≥ z650 ] .
Filling in the numbers we find that
z650 =
650 − 660
√ − 0.56.
113/ 40
Looking up z650 = −0.56 in the standard normal table yields us the probability (1 − β) = 0.71.
Asymmetry Null and Alternative Hypotheses
As should be clear by now there is an asymmetry between the null and the alternative hypothesis.
The testing procedure outlined heavily focuses on the null hypothesis, ‘favouring’ it over the alter-
4. Hypothesis Testing
35
native: the decision rule and test statistic are based around the null distribution and the probability
of falsely rejecting the null hypothesis; the conclusion drawn is mainly about the null (reject the
null, do not reject the null). The test only rejects the null if there is a lot of evidence against it, even
if the test has low power.
Therefore, the decision as to which is the null and which is the alternative is not merely a
mathematical one, but depends on context and custom. There are no fast and hard rules on how to
choose the null over the alternative, but often the ’logical’ null can be deduced on the hand of one
of several principles
• Sometimes we have good information about the distribution of one of the two hypothesis, but
not really about how the sampling distribution looks like under the other hypothesis. In this
case it is standard to choose the ‘simpler’ hypothesis of which we know the distribution as the
null hypothesis. For example, if you are interested whether a certain sample is drawn from a
normal population, you know how the distribution looks like under the null (ie. normal), but
no clue how it might look like under the alternative (exponential, χ2 , something else?), so the
natural null is to assume normality.
• Sometimes the consequences of falsely rejecting one hypothesis is much more grave than
rejecting the other hypothesis. In this case we should choose the former as the null hypothesis.
For example: if you have to judge the safety of a bridge it is more harmful to wrongly reject
the hypothesis that is unsafe (potentially killing many people) than it is to wrongly reject the
hypothesis that the bridge is safe (which may cost money on spurious repairs). In this case
the null should be: the bridge is deemed unsafe, unless proven otherwise.
• In scientific investigations it is common to approach the research question with a certain
level of scepsis. If a new medicine is introduced, the appropriate null hypothesis would
be to assume that it does not perform better than the current drug on the market. If you
evaluate the effect of an economic policy the natural null hypothesis would be to assume
that it had no effect whatsoever. In both cases you put the burden of evidence on your new
medicine/theory/policy.
Problems
1. The output voltage for a certain electric circuit is specified to be 130. A sample of 40 independent readings on the voltage for this circuit gave a sample mean of 128.6 and a standard
deviation of 2.1.
(a) Test the hypothesis that the average output voltage is 130 against the alternative that it is
less than 130 using a test with level α = 0.05.
(b) If the average voltage falls as low as 128 serious consequences may occur. Calculate the
probability of committing a type II error for H1 : V = 128 given the decision rule outlined
in (a).
2. Let Y1, Y2 , ..., Yn be a random sample of size n = 20 from a normal distribution with unknown
mean µ and known variance σ2 = 5. We wish to test H0 : µ ≤ 7 versus H1 : µ > 7.
(a) Find the uniformly most powerful test with significance level 0.05.
36
QT 2015: Statistical Inference
(b) For the test in (a), find the power at each of the following alternative values for µ :
µ1 = 7.5, µ1 = 8.0, µ1 = 8.5, and µ1 = 9.0.
3. In a study to assess various effects of using a female model in automobile advertising, each of
100 male subjects was shown photographs of two automobiles matched for price, colour, and
size but of different makes. Fifty of the subjects (group A) were shown automobile 1 with a
female model and automobile 2 with no model. Both automobiles were shown without the
model to the other 50 subjects (group B). In group A, automobile 1 (shown with the model)
was judged to be more expensive by 37 subjects. In group B, automobile 1 was judged to be
more expensive by 23 subjects. Do these results indicate that using a female model increases
the perceived cost of an automobile? Find the associated p-value and indicate your conclusion
for an α = .05 level test.
Appendix A
Exercise Solutions
A.1
1.
Sampling Distributions
(a) As the population is normally distributed, we have (Xi − µ)/σ = Z ∼ N(0, 1). Here:
Z = (98 − 100)/3 = −0.67. Look up Z = −0.67 in the standard normal table to find that
Pr[Xi ≤ 98] = Pr[Z ≤ −0.67] = 0.2514.
(b) As the population is normally distributed, we have (X − µ)/(σ/n) = Z ∼ N(0, 1). Here:
Z = (98 − 100)/(3/3) = −2. Look up Z = −2 in the standard normal table to find that
Pr[X ≤ 98] = Pr[Z ≤ −2] = 0.0238.
(c) If we use the sample variance to calculate the standard error, rather than the population
variance, the resulting sampling distribution will be tn−1 , rather than standard normal.
That is, (X − µ)/(S /n) = T ∼ tn−1 . Pr[X ≤ 98] = Pr[T ≤ −2] = 0.0403. (if you use the
student t tables, you’ll only be able to find 0.025 < Pr < 0.05.
2.
(a)
E(X̄) = µ1 ,
Var(X̄) = n−1 σ21 .
(b)
E(X̄ − Ȳ) = 0,
Var(X̄) = n−1 σ21 + m−1 σ22
= n−1 (σ21 + σ22 )
= 4.5/n.
(c)
√
σX̄−Ȳ = 0.1
√
4.5/ n = 0.1
n = 4.5/(0.1)2
n = 450.
37
38
QT 2015: Statistical Inference
3.
h
i
(a) Find Pr S 12 ≤ a .
S2
We know that (n − 1) σ12 ∼ χ2n−1 .
1


S 12


2
Pr (n − 1) 2 ≤ χ(10),0.95  , χ2(10),0.05 = 18.307
σ1


S 12


Pr 10 ×
≤ 18.307
15
#
"
15
Pr S 12 ≤ 18.307 ×
10
h
i
Pr S 12 ≤ 27.46
2
S1
(b) Find Pr S 2 ≤ b .
2
We know that
4.
(a)
P4
P4
S 12 /σ21
S 22 /σ22
∼ F(n1 −1,n2 −1) .

 2 2

 S 1 /σ1
Pr  2 2 ≤ F(n1 −1,n2 −1),0.05  , F(n1 −1,n2 −1),0.05 = 2.98
S 2 /σ2
 2

 S 1 /15

Pr  2
≤ 2.98
S 2 /15
 2

 S 1

Pr  2 ≤ 2.98
S2
Zi ∼ N(0, 4)
Zi2 ∼ χ24
P
(c) Z12 / 4i=2 Zi ∼ F1,3
q
P4 2
(d) Z1 /
i=2 Zi /3 ∼ t3
(b)
A. Exercise Solutions
A.2
39
Large Sample Theory
1. First we show that expectation of the sample mean equals the average of the population means.
E(X) = n−1 E[X1 + . . . + Xn ]
n
X
= n−1
µi
i=1
= µ.
Also the standard error of the sample mean will be
1 X 2
σ .
Var(X) = 2
n i i
Next we use Chebychev’s inequality to establish that
Var(X n )
ǫ2
P 2
i σi 1
→ 0,
≤
n2 ǫ 2
which concludes our proof.
Pr[(X n − µ)2 > ǫ 2 ] ≤
2. Approximate Pr[S 100 < 120] with Xi ∼ CDF(1.5, 1).
CLT states that,
S n − E(S n ) d
→ N(0, 1),
√ 2
σS n
with σ2S n = nσ2 . Thus
120 − 150
= −3,
√
100
Pr(Z < −3) = (1 − 0.9987) ≈ 0.13%.
3. Again, use CLT to approximate the sampling distribution with a normal distribution. As the
variance is known, the sampling distribution will be approximately
X−µ
√ ∼ N(0, 1),
σ/ n
with σ2 = 25. Next we look up the quantile that is associated with an exceedance probability
of 0.05/2 = 0.025, z0.025 = 1.96. So we solve for:
p
1.5/(σ/ (n)) = 1.96
√
p
(n) = 1.96/1.5 × 25
= 1.962 /1.52 × 25
n = 42.7.
Normally n would be rounded up, to in this case 43, to ensure that the probability is at least
the desired level.
40
QT 2015: Statistical Inference
A.3
1.
Estimation
(a)
n
(i) µˆ1 = X1 +X
is unbiased, but inconsistent.
2
Unbiasedness:
X1 + Xn
2
E X1 + E Xn
=
2
2µ
=
2
= µ.
E[µ̂1 ] = E
Consistency:
E(µ̂1 ) → µ,
but
Var(µ̂1 ) = σ2 /2 → σ2 /2 , 0.
so
µ̂2 9 µ.
Pn−1
X
i=2 i
(ii) µˆ2 = X41 + 21 n−2
+
Unbiasedness:
Xn
4
is unbiased, but inconsistent.
P
1 n−1
X1 + Xn
i=2 Xi
+E
E(µ̂2 ) = E
4
2 n−2
Pn−1
1 i=2 E Xi
1
= µ+
2
2 n−2
1
1 (n − 2)µ
= µ+
2
2 (n − 2)
= µ.
Consistency:
E(µˆ2 ) → µ,
but,
1
Var(µ̂2 ) = σ2 /8 + σ2 /(n − 2) → σ2 /8 , 0.
4
so
µ̂2 9 µ.
A. Exercise Solutions
41
P
(iii) µ3 = ni=1 Xi /(n + k), 0 < k ≤ 3 is biased, but consistent.
Unbiasedness:
−1
E(µ̂3 ) = E((n + k)
n
X
Xi )
i=1
−1
= (n + k)
= (n + k)−1
n
X
i=1
n
X
E(Xi )
µ
i=1
n
µ
n+k
k
µ
=µ−
n+k
, µ.
=
Consistency:
lim [E(µˆ3 )] = lim [
n→∞
n→∞
n
µ],
n+k
= µ.
and
Var(µ̂3 ) = n/(n + k)2 σ2 → 0.
so
µ̂3 → µ.
(iv) µ4 = X is both unbiased and consistent, see lecture notes.
(b) Relative Efficiency: (MS E1 /MS E2 ), with MS Ei = Var(µ̂i ) + bias2 (µ̂i ). For n =
36, σ2 = 20, µ = 15, and k = 3 we have
(i) MS E1 = σ2 /2 = 10
(ii) MS E2 = σ2 /8 + 41 σ2 /(n − 2) = 45/17
(iii)
k
n
σ2 +
µ
MS E3 =
2
n+k
(n + k)
!2
= 80/169 + 225/169 = 300/169
(iv) MS E4 = σ2 /n = 20/36
So the relative efficiency of the sample mean (µ̂4 ) will be
(i) MS E1 /MS E4 = 0.056
(ii) MS E2 /MS E4 = 0.210
(iii) MS E3 /MS E4 = 0.313
42
QT 2015: Statistical Inference
2. Note that θ3 is unbiased:
E(θ̂3 ) = E(aθˆ1 + (1 − a)θˆ2 )
E(θ̂3 ) = aθ + (1 − a)θ
E(θ̂3 ) = θ.
The variance of θˆ3 is defined as
σ23
=
n X
n
X
(bi b j γi, j ),
i=1 j=1
= a2 σ21 + (1 − a)2 σ22 + 2a(1 − a)γ.
Let’s consider the general case with γ unconstrained.
To minimize σ23 find:
arg min[a2 σ21 + (1 − a)2 σ22 + 2a(1 − a)γ].
a
∂σ23
= 2aσ21 − 2(1 − a)σ22 + 2(1 − 2a)γ
∂a
= 2a(σ21 + σ21 − 2γ) − 2(σ22 − γ)
a
=0
= 0,
=
note that γ = ρσ1 σ2 .
(a) a =
σ22
σ21 +σ22
(b) a =
σ22 −γ
σ21 +σ22 −2γ
3. (a) 90 % Confidence intervals for the mean (x = 10.15)
(i) Two sided
"
#
s
x ± t0.05,20 √
n
.
√
s
t0.05,20 = 1.725, √ = 2.34/ 21 = 0.51
n
CI = [10.5 ± 1.725 × 0.51] = [9.27, 11.03]
σ22
σ21
−γ
+ σ22 − 2γ
.
A. Exercise Solutions
43
(ii) Upper
−∞, x + t0.10,20
s
√
n
#
√
s
t0.10,20 = 1.325, √ = 2.34/ 21 = 0.51
n
CIH = (−∞, 10.83]
(iii) Lower
"
s
x − t0.10,20 √ , ∞
n
!
√
s
t0.10,20 = 1.325, √ = 2.34/ 21 = 0.51
n
CIL = [9.47, ∞)
(b) 90 % Confidence intervals for the variance (s2 = 2.342 = 5.48)
(i) Two sided
χ20.05,20 = 31.41; χ20.95,20 = 10.85
"
#
20 × 5.48 20 × 5.48
,
CI =
= [3.49, 10.10]
31.41
10.85
(ii) Upper
χ20.90,20 = 12.44
CIH = [0, 8.80]
(iii) Lower
χ20.10,20 = 28.41
CIL = [2.71, ∞)
(c) To find the 90 % Confidence intervals for the standard deviation, take the square root of
the CI of the variance
(i) Two sided [1.87, 3.18]
(ii) Upper [0, 2.97]
(iii) Lower [1.65, ∞)
44
QT 2015: Statistical Inference
A.4
1.
Hypothesis Testing
(a) Test H0 : ν ≥ 130 vs. H1 : ν < 130 with α = 0.05. n = 40, ν̂ = 128.6, σ = 2.1
Using CLT we can construct a test statistic with known sampling statistic:
ν̂ − ν
√ ∼ N(0, 1),
σ/ n
ν̂ − ν
t = √ ∼ t(n − 1),
S/ n
128.6 − 130 −1.4
=
= −4.24.
tν̂ =
√
0.33
2.1/ 40
z=
Rejection Region: tν̂ < t0.05 , t0.05 (39) ≈ t0.05 (40) = −1.684 (compare z0.05 = −1.645)
−4.24 < −1.684 ⇒ Reject H0 : ν is significantly lower than 130 at the 5% level.
(b) Decision rule: Reject H0 if
(V̂−130)
√
2.1/ 40
< −1.684 → V̂ < 129.44
P[V̂ ≥ 129.44 | ν = 128] = P[(V̂ − 128)/0.33 ≥ (129.44 − 128)/0.33]
(129.44 − 128)/0.33 = 4.36, look up in t-table or z-table to find this probability is well
below 0.1%.
2. Consider H0 : µ ≤ 7 vs. Ha : µ > 7. Yi ∼ N(µ, 5), n = 20
(a) Uniformly most powerful test:
arg max[P(µ̂ > mcrit | µ1 )] s.t. P(µ̂ > mcrit | µ0 ) ≤ α.
mcrit
i.e. set mcrit s.t.
P(µ̂ > mcrit | µ0 ) = 0.05.
By CLT we know that:
µ̂ − µ
√ ∼ N(0, 1), Z0.95 = 1.645.
σ/ n
Thus:
mcrit − µ
√ = Z0.95 ,
σ/ n
mcrit − 7
= 1.645,
√
5/20
p
mcrit = 7 + 1.645 5/20,
= 7.8225.
i.e. rejection region: reject if µ̂ > 7.8225.
A. Exercise Solutions
45
(b) Find the power of the test:
(1 − β) = P(µ̂ > mcrit | µ1 ),
When the alternative takes on the following values (Again, use CLT for the sampling
distribution):
µ1 = 7.5, (7.8225 − 7.5)/0.5 = 0.645
, (1 − β) = 0.26.
µ1 = 8.5, (7.8225 − 8.5)/0.5 = −1.335
, (1 − β) = 0.91.
µ1 = 8.0, (7.8225 − 8.0)/0.5 = −0.335
µ1 = 9.0, (7.8225 − 9.0)/0.5 = −2.355
, (1 − β) = 0.63.
, (1 − β) = 0.99.
3. In effect there are two random samples which are both a sequence of Bernoulli trails, each
with n = 50 and some parameter φ ∈ [0, 1]. Setting up the null and alternative hypothesis
yields:
H0 : φ1 ≤ φ2 , H1 : φ1 > φ2
or alternatively
H0 : (φ1 − φ2 ) ≤ 0, H1 : (φ1 − φ2 ) > 0
A Bernoulli distribution has mean φ and variance φ(1−φ). Remember that the setup of this test
is strongly reminiscent of exercise 7, implying that Var(φ1 − φ2 ) = σ21 /n1 + σ22 /n2 and by CLT
it will be normally distributed. Replacing the population variances with their sample equivalents yields the following test statistic, which will be a t-distribution with approximately
(n1 + n2 − 2)1 degrees of freedom:
(φ̂1 − φ̂2 ) − 0
t= √ 2
(s1 + s22 )/n
Filling in the numbers yields:
p
t = (0.74 − 0.46)/ [0.74(1 − 0.74) + 0.46(1 − 0.46)]/50 = 2.982
Checking the table for the t-dist, it can be seen that the p-value < 0.005. p-value is less than
0.05, so reject H0 . Alternatively the critical value τ0.95 = 1.661. Note t > τ0.95 so reject
H0 . In both cases the conclusion is that, indeed, the inclusion of female models significantly
increase the probability that a car is perceived to be more expensive. Note, the 95% upper
confidence interval is (−∞, 0.156]; so observing (φ̂1 − φ̂2 ) = 0.28 falls outside the confidence
bounds also leads to the conclusion that H0 can be rejected.
1
to be more precise the degrees of freedom is estimated by
2
σ21 /n1 + σ22 /n2
(σ21 /n1 )2 /(n1 − 1) + (σ22 /n2 )2 /(n2 − 1)